### Curse of Dimensionality
What is the volume of the hypersphere divided by the hypercube as $d \rightarrow \infty$? --- ### Curse of Dimensionality $V_{HC} = (2r)^d$ and $V_{HS} = \frac{\pi^{d/2}r^d}{\Gamma(\frac{d}{2} + 1)}$ thus, $\frac{V_{HS}}{V_{HC}} = \frac{\pi^{d/2}}{2^{d}\Gamma(\frac{d}{2} + 1)} \rightarrow \frac{1}{\sqrt{\pi d}} \left (\frac{\pi e}{2 d} \right )^{d/2} \rightarrow 0$ as $d \rightarrow \infty$. And the distance between the center and corners of the hypercube is $\sqrt{d} r$. --- ### Curse of Dimensionality 1. Nearly all of high-dimensional space in a hypercube is distant from the center and close to the border. 2. High dimensional datasets at risk of being sparse. The average distance between two random points: 1. in a unit square is roughly 0.52. 2. in a unit 3-d cube is roughly 0.66. 3. in a unit 1,000,000-d hypercube is $\sim$408.25. 3. Distances from a random point to its nearest and farthest neighbor are similar. 4. Distance-based classification generalizes poorly unless # samples grows exponentially with $d$. --- ### Biological Networks
--- ### Biological Networks 1. Highly interconnected with modular structure. 2. Weakly to strongly scale-free (fraction of nodes with degree $k$ follows a power law $k^{-\alpha}$). 3. Subsets of genes, proteins or regulatory elements tend to form highly correlated modules. 4. Functional genomics datasets tend to (not always!) occupy a low dimensional subpace of the feature space (e.g., genes, proteins, regulatory elements). 5. Ideal for dimenstional reduction approaches to both visualize and analyze functional genomics data. --- ### Principal Components Analysis (PCA)
--- ### PCA Assume we have $n$ samples and $p$ features which are in the form of a $n \times p$ centered matrix $\mathbf{X}$ where we subtracted the mean across samples of each feature. The unbiased sample covariance matrix is then $$\Sigma_{XX} = \frac{1}{n-1} \mathbf{X}^{T} \mathbf{X}$$ PCA finds a linear transformation $\mathbf{Z} = \mathbf{X} \mathbf{V}$ that diagonalizes $\Sigma_{XX}$. --- ### Singular Value Decomposition $\mathbf{X}$ can be decomposed as follows: $$\mathbf{X} = \mathbf{U} \mathbf{D} \mathbf{V}^{T}$$ where $\mathbf{U}$ and $\mathbf{V}$ are $n \times n$ and $p \times p$ orthogonal matricies, respectively, and $\mathbf{D}$ is a $n \times p$ diagonal matrix. The diagonal elements of $\mathbf{D}$ are the **singular values** of $\mathbf{X}$. The columns of $\mathbf{U}$ and $\mathbf{V}$ are the **left-singular vectors** and **right-singular vectors**. --- ### Singular Value Decomposition The left singular vectors and right singular vectors of $\mathbf{X}$ are the eigenvectors of $\mathbf{X} \mathbf{X}^{T}$ and $\mathbf{X}^{T} \mathbf{X}$. The nonzero singular values of $\mathbf{X}$ are the square roots of the eigenvalues of $\mathbf{X} \mathbf{X}^{T}$ and $\mathbf{X}^{T} \mathbf{X}$. --- ### PCA The covariance matrix of $\mathbf{Z} = \mathbf{X} \mathbf{V}$ where the columns of $\mathbf{V}$ are the right-singular vectors of $\mathbf{X}$ is $$\Sigma_{ZZ} = \frac{1}{n-1} \mathbf{Z}^{T} \mathbf{Z} = \frac{1}{n-1} \mathbf{D}^{T} \mathbf{D} = \frac{1}{n-1} \mathbf{\hat{D}}^{2}$$ where $\mathbf{\hat{D}}^{2}$ is a square diagonal matrix (0s truncated), and we have used the SVD of $\mathbf{X}$, $(\mathbf{UDV}^{T})^{T} = \mathbf{V} \mathbf{D}^{T} \mathbf{U}^{T}$, $\mathbf{V}^{T} \mathbf{V} = \mathbf{I}\_{p}$, $\mathbf{U}^{T} \mathbf{U} = \mathbf{I}\_{n}$. --- ### Non-Negative Matrix Factorization Factorize a matrix $\mathbf{V}$ with all positive elements into a product of two matricies $\mathbf{W}$ (features matrix) and $\mathbf{H}$ (coefficients matrix) subject to the constraint that $\mathbf{W}$ and $\mathbf{H}$ have no negative elements. Formally, $\mathbf{V} \approx \mathbf{W} \mathbf{H}$ such that $\mathbf{W} \geq 0$ and $\mathbf{H} \geq 0$ where $\mathbf{V}$, $\mathbf{W}$ and $\mathbf{H}$ are $m \times n$, $m \times p$, and $p \times n$ matricies, $m$ is the number of samples, $n$ is the number of features, $p \ll n$, and possibly $p \ll m$. --- ### Non-Negative Matrix Factorization To approximate $\mathbf{V} \approx \mathbf{W} \mathbf{H}$, we need a cost function. We'll focus on a Euclidean distance based cost function between $\mathbf{V}$ and $\mathbf{W} \mathbf{H}$ $\Vert \mathbf{V} - \mathbf{WH} \Vert^{2} = \sum_{ij} (V_{ij} - (\mathbf{WH})_{ij})^2$ --- ### Non-Negative Matrix Factorization Minimize cost function using Lee and Seung's multiplicative update rule: 1. Initialize $\mathbf{H}$ and $\mathbf{W}$ with non negative values 2. Update $H_{ij}$ and $W_{ij}$ 1. $H_{ij} \leftarrow H_{ij} \frac{(\mathbf{W}^{T} \mathbf{V})_{ij}}{(\mathbf{W}^{T} \mathbf{WH})\_{ij}}$ 2. $W_{ij} \leftarrow W_{ij} \frac{(\mathbf{V} \mathbf{H}^{T})_{ij}}{(\mathbf{W} \mathbf{HH}^{T})\_{ij}}$ 3. Stop when $H_{ij}$ and $W_{ij}$ don't change within a specified tolerance --- ### Non-Negative Matrix Factorization
--- ### Non-Negative Matrix Factorization
--- ### Non-Negative Matrix Factorization
--- ### T-Distributed Stochastic Neighbor Embedding (T-SNE)
--- ### T-SNE A non-linear dimensional reduction approach that attempts to map a distribution of pairwise distances among $n$ high-dimensional samples from their high dimension to a distribution of pairwise distances of the $n$ samples in a low dimension. --- ### T-SNE Given $n$ high-dimensional samples $\mathbf{x}\_{1},\ldots,\mathbf{x}_{n}$ $p(j|i) = \frac{e^{-\Vert \mathbf{x}\_{j} - \mathbf{x}\_{i} \Vert^2/2\sigma\_{i}^{2}}}{\sum\_{k \ne i} e^{-\Vert \mathbf{x}\_{k} - \mathbf{x}\_{i} \Vert^2/2\sigma\_{i}^{2}}}$ for $i \ne j$ and 0 otherwise. Next, define $p_{ij} = \frac{p(j|i) + p(i|j)}{2n}$ which ensures that $p_{ij} = p_{ji}$ and $\sum_{i,j} p_{ij} = 1$ --- ### T-SNE The goal of t-SNE is to learn a map $\mathbf{X} \mapsto \mathbf{Y}$ such that the $\mathbf{y}\_{1},\ldots,\mathbf{y}_{n}$ are mapped to a lower dimensional space ($d = 2$ or $3$). In the lower dimensional space, the probability of $\mathbf{y}\_{i}$ and $\mathbf{y}\_{j}$ being associated/near each other is assumed to follow a t-distribution with one degree of freedom where for $i \ne j$ $q_{ij} = \frac{(1 + \Vert \mathbf{y}\_{j} - \mathbf{y}\_{j} \Vert^2)^{-1}}{\sum\_{k,l;k \ne l} (1 + \Vert \mathbf{y}\_{k} - \mathbf{y}\_{l} \Vert^2)^{-1}}$ and 0 otherwise. --- ### T-SNE The locations of the $\mathbf{y}\_{i}$ in the lower dimensional space are found by minimizing the Kullback-Leibler divergence between the $p_{ij}$ and $q_{ij}$ distributions $KL(p \Vert q) = \sum_{i,j;i \ne j} p_{ij} \log \frac{p\_{ij}}{q\_{ij}}$ which is a measure of the difference between distributions $p$ and $q$. The minimization of the Kullback-Leibler divergence with respect to $\mathbf{y}_{i}$ is performed using gradient descent. --- ### Uniform Manifold Approximation & Projection (UMAP)
--- ### UMAP
--- ### UMAP: Graph Construction Let $\mathbf{x}\_{i},\ldots,\mathbf{x}\_{n}$ be the input data. For each $\mathbf{x}\_{i}$ compute the $k$ nearest neighbors $\mathbf{x}\_{i1},\ldots,\mathbf{x}\_{ik}$ and $\rho\_{i} = \min \\{ d(\mathbf{x}\_{i},\mathbf{x}\_{ij}) | 1 \le j \le k, d > 0 \\}$ and set $\sigma\_{i}$ using $\sum_{j=1}^{k} e^{-\max(0,d(\mathbf{x}\_{i},\mathbf{x}\_{ij}) - \rho\_{i})/\sigma\_{i}} = \log\_{2}(k)$ --- ### UMAP: Graph Construction
--- ### UMAP: Graph Construction Define a weighted directed graph $\bar{G} = (V,E,w)$ where the verticies $V$ of $\bar{G}$ are the set $X$. Then form the set of directed edges with weights $w\_{h}$ $E = \\{ (\mathbf{x}\_{i},\mathbf{x}\_{ij}) | 1 \le j \le k, 1 \le i \le n \\}$ $w\_{h}(\mathbf{x}\_{i},\mathbf{x}\_{ij}) = e^{-\max(0,d(\mathbf{x}\_{i},\mathbf{x}\_{ij}) - \rho\_{i})/\sigma\_{i}}$ --- ### UMAP: Graph Construction
--- ### UMAP: Graph Construction Combine edges of $\bar{G}$ with adjacency matrix $\mathbf{A}$ into a unified undirected graph $G$ with adjacency matrix $\mathbf{B}$ $\mathbf{B} = \mathbf{A} + \mathbf{A}^{T} - \mathbf{A} \circ \mathbf{A}^{T}$ where $\circ$ is the Haddamard (or pointwise) product. If $A\_{ij}$ is the probability that the directed edge from $\mathbf{x}\_{i}$ to $\mathbf{x}\_{j}$ exists, then $B\_{ij}$ is the probability that at least one of the two directed edges (from $\mathbf{x}\_{i}$ to $\mathbf{x}\_{j}$ and from $\mathbf{x}\_{j}$ to $\mathbf{x}\_{i}$) exists. --- ### UMAP: Graph Construction
--- ### UMAP: Graph Layout Learn a mapping $\mathbf{X} \mapsto \mathbf{Y}$ by first initializing the low dimentional representation using spectral embedding. Spectral embedding is a dimensional reduction approach that maps a connected graph $G$ to a low dimension vector space in such a way that two points that are "close" in the graph are "close" in the low dimensional vector space. --- ### UMAP: Graph Layout Formally, this is done by calculating the eigenvectors of the Laplacian $\mathbf{L} = \mathbf{D} - \mathbf{B}$ of the adjacency matrix $B$ where $\mathbf{D}$ is a diagonal matrix with $D\_{ii} = \sum\_{j} B_{ij}$. Sort the eigenvalues $\lambda_{0},\lambda_{1},\lambda_{2},\ldots$, and the eigenvectors corresponding to $\lambda_{1}$ and $\lambda_{2}$ represent the embedding in a 2-d vector space. --- ### UMAP: Graph Layout Fit $w\_{l}$, a differentiable form of the 2-d weights to exp. Estimate $a$ and $b$ by performing non-linear regression of $w\_{l}(\mathbf{x},\mathbf{y}) = (1 + a (\Vert \mathbf{x} - \mathbf{y} \Vert\_{2}^{2})^{b})^{-1}$ to $\Psi(\mathbf{x},\mathbf{y}) = e^{-( \Vert \mathbf{x} - \mathbf{y} \Vert\_{2} - d_{min} )}$ if $\Vert \mathbf{x} - \mathbf{y} \Vert\_{2}$ is greater than $d_{min}$ and 1 otherwise. --- ### UMAP: Graph Layout The locations of the $\mathbf{y}\_{i}$ in the lower dimensional space are found by minimizing the binary cross entropy loss $\mathcal{L}$ using stochastic gradient descent where the loss is $$\begin{aligned} \mathcal{L} = & -\sum_{b \in B} w\_{h}(b) \log w\_{l}(b) \\\\\\ & + (1 - w\_{h}(b)) \log (1 - w\_{l}(b)) \\\\\\ \end{aligned}$$