reveal.js

### Curse of Dimensionality

What is the volume of the hypersphere divided by the hypercube as $d \rightarrow \infty$?

---

### Curse of Dimensionality

$V_{HC} = (2r)^d$ and $V_{HS} = \frac{\pi^{d/2}r^d}{\Gamma(\frac{d}{2} + 1)}$ thus,

$\frac{V_{HS}}{V_{HC}} = \frac{\pi^{d/2}}{2^{d}\Gamma(\frac{d}{2} + 1)} \rightarrow \frac{1}{\sqrt{\pi d}} \left (\frac{\pi e}{2 d} \right )^{d/2} \rightarrow 0$

as $d \rightarrow \infty$.  And the distance between the center and corners of the hypercube is $\sqrt{d} r$.

---

### Curse of Dimensionality

1. Nearly all of high-dimensional space in a hypercube is distant from the center and close to the border.
2. High dimensional datasets at risk of being sparse. The average distance between two random points:
   1. in a unit square is roughly 0.52.
   2. in a unit 3-d cube is roughly 0.66.
   3. in a unit 1,000,000-d hypercube is $\sim$408.25.
3. Distances from a random point to its nearest and farthest neighbor are similar.
4. Distance-based classification generalizes poorly unless # samples grows exponentially with $d$.

---

### Biological Networks

---

### Biological Networks

1. Highly interconnected with modular structure.
2. Weakly to strongly scale-free (fraction of nodes with degree $k$ follows a power law $k^{-\alpha}$).
3. Subsets of genes, proteins or regulatory elements tend to form highly correlated modules.
4. Functional genomics datasets tend to (not always!) occupy a low dimensional subpace of the feature space (e.g., genes, proteins, regulatory elements).
5. Ideal for dimenstional reduction approaches to both visualize and analyze functional genomics data.

---

### Principal Components Analysis (PCA)

---

### PCA

Assume we have $n$ samples and $p$ features which are in the form of a $n \times p$ centered matrix $\mathbf{X}$ where we subtracted the mean across samples of each feature.

The unbiased sample covariance matrix is then

$$\Sigma_{XX} = \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X}$$

PCA finds a linear transformation $\mathbf{Z} = \mathbf{X} \mathbf{V}$ that diagonalizes $\Sigma_{XX}$.

---

### Singular Value Decomposition

$\mathbf{X}$ can be decomposed as follows:

$$\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^{T}$$

where $\mathbf{U}$ and $\mathbf{V}$ are $n \times n$ and $p \times p$ orthogonal matricies, respectively, and $\mathbf{D}$ is a $n \times p$ diagonal matrix. The diagonal elements of $\mathbf{D}$ are the **singular values** of $\mathbf{X}$. The columns of $\mathbf{U}$ and $\mathbf{V}$ are the **left-singular vectors** and **right-singular vectors**.

---

### Singular Value Decomposition

The left singular vectors and right singular vectors of $\mathbf{X}$ are the eigenvectors of $\mathbf{X} \mathbf{X}^{T}$ and $\mathbf{X}^{T} \mathbf{X}$.

The nonzero singular values of $\mathbf{X}$ are the square roots of the eigenvalues of $\mathbf{X} \mathbf{X}^{T}$ and $\mathbf{X}^{T} \mathbf{X}$.

---

### PCA

The covariance matrix of $\mathbf{Z} = \mathbf{X} \mathbf{V}$ where the columns of $\mathbf{V}$ are the right-singular vectors of $\mathbf{X}$ is

$$\Sigma_{ZZ} = \frac{1}{n-1} \mathbf{Z}^{T} \mathbf{Z} = \frac{1}{n-1} \mathbf{D}^{T} \mathbf{D} = \frac{1}{n-1} \mathbf{\hat{D}}^{2}$$

where $\mathbf{\hat{D}}^{2}$ is a square diagonal matrix (0s truncated), and we have used the SVD of $\mathbf{X}$, $(\mathbf{UDV}^{T})^{T} = \mathbf{V} \mathbf{D}^{T} \mathbf{U}^{T}$, $\mathbf{V}^{T} \mathbf{V} = \mathbf{I}\_{p}$, $\mathbf{U}^{T} \mathbf{U} = \mathbf{I}\_{n}$.

---