Curse of Dimensionality
- High dimensions
- Average distance between two points will be large -> sparse
- Predictions much less reliable -> need more data
- Reduce dimensionality
- Speed up training
- Helps visualization
- Approaches
- Feature selection
- Feature extraction
- Transforms dataset into a new feature space with lower dimensionality
Principal Component Analysis
Feature extraction, unsupervised. Finds the directions of maximum variance (minimum information loss) in data & projects data onto a new subspace with equal or fewer dimensions.
- Principal component
- The orthogonal axes of the new subspace
Algorithm
Find $$d \times k$$ matrix $$W$$ to map vector $$x$$ of dimension $$d$$ to new $$k$$-dimensional feature space.
$$ z = xW, x \in \mathbb{R}^d, z \in \mathbb{R}^k, W \in \mathbb{R}^{d \times k}
$$
- Standardize $$d$$-dimensional dataset
- Construct covariance matrix
- $$\Sigma = \begin{bmatrix}\sigma1^2 & \sigma{12} & \sigma{13} \ \sigma{21} & \sigma2^2 & \sigma{23} \ \sigma{31} & \sigma{32} & \sigma_3^2 \end{bmatrix}$$
- Decompose covariance matrix into eigenvectors & eigenvalues
- $$\Sigma v = \lambda v$$
- Eigenvector: principal components
- Eigenvalue: magnitude of the corresponding principal component
- Sort eigenvalues by decreasing order to rank the corresponding eigenvectors
- Select $$k$$ eigenvectors corresponding to the $$k$$ largest eigenvalues
- Construct projection matrix $$W$$ from the top $$k$$ eigenvectors
- Transform $$d$$-dimensional dataset into a new $$k$$-dimensional one
Variance
$$\sigma^2 = \sum \frac{(X - \mu)^2}{n}$$
Covariance
$$\sigma{jk} = \frac{1}{n}\sum^n{i=1}(x_j^{(i)} - \mu_j)(x_j^{(i)} - \mu_k)$$
PCA for Compression
Lossy compression: only keep the information that has the most effect on how the image looks.
Can do so only if $$W$$ has an inverse -> non-square matrices do not have one.