K-Means Algorithm
Init K cluster centroids mu[1]..mu[K] in R^n
Repeat:
# cluster assignment
for i = 1..m:
c[i] := index of cluster centroid closest to x[i]
# move centroid
for k = 1..K:
mu[k] := average of points assigned to cluster k
- $$c^{(i)}$$: index of cluster which $$x^{(i)}$$ is assigned to
- $$\mu_k$$: cluster centroid $$k$$
- $$\mu_{c^{(i)}}$$: cluster centroid of cluster to which example $$x^{(i)}$$ has been assigned
- Optimization objective:
$$ J(c^{(1)}, ..., c^{(m)}, \mu1, ..., \mu_K) = \frac{1}{m} \sum{i=1}^{m} |{x^{(i)} - \mu{c^{(i)}}}|^2\ min{c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K} J(c^{(1)}, ..., c^{(m)}, \mu_1, ..., \mu_K)
$$
Random Initialization
Randomly pick $$K$$ training examples, set $$\mu_1, ..., \mu_k$$ to these examples.
for i = 1..100:
Random init mu[1]..mu[K]
# run K-means, get c & mu
Calculate cost function J(c, mu)
Pick argmin(J)
Choosing K
- Elbow Method
- Consider downstream tasks
Dimensionality Reduction
Principal Component Analysis
Reduce from n-dimension to k-dimension: find $$k$$ vectors $$u^{(1)}, ..., u^{(k)}$$ onto which to project the data, so as to minimize the projection error.
Algorithm
Data preprocessing
- Feature scaling/mean normalization:
$$ \muj = \frac{1}{m} \sum{i=1}^m x_j^{(i)}
$$
Replace $$x_j^{(i)}$$ with $$x_j^{(i)} - \mu_j$$.
(Optional) If features on different scales, replace $$x_j^{(i)}$$ with $$\frac{x_j^{(i)} - \mu_j}{s_j}$$.
Reduce Dimensionality
- Compute covariance matrix:
$$ \Sigma = \frac{1}{m} \sum^n_{i=1} (x^{(i)})(x^{(i)})^T
$$
Sigma = (1/m) * X * X';
- Compute eigenvectors:
[U, S, V] = svd(Sigma); # Single Value Decomposition # U in R^(nxn), Sigma in R^(nxn)
- Take $$u^{(1)}..u^{(K)}$$ as $$U_{reduce}$$
Ureduce = U(:, 1:k);
- $$z = U_{reduce}^Tx$$
z = Ureduce' * x;
Reconstruction from Compressed Representation
$$ x{approx} = U{reduce}z
$$
Choosing k (number of principal components)
Average squared projection error: $$\frac{1}{m} \sum^m{i=1} |x^{(i)} - x{approx}^{(i)}|^2$$
- Total variation in data: $$\frac{1}{m} \sum^m_{i=1} |x^{(i)}|^2$$
Choose $$k$$ to be smallest s.t. $$\frac{\frac{1}{m} \sum^m{i=1} |x^{(i)} - x{approx}^{(i)}|^2}{\frac{1}{m} \sum^m{i=1} |x^{(i)}|^2} = (1 - \frac{\sum{i=1}^k S{ii}}{\sum{i=1}^n S_{ii}}) \le 0.01$$ (99% of variance retained).
Algorithm
k = 1
Repeat:
Try PCA with k
Compute Ureduce, z, x_approx, S
Pick smallest k s.t. (sum S[1,1]..S[k,k] / sum S[1,1]..S[n,n]) >= 0.99)
Implementation
- Run PCA (mapping $$x^{(i)}$$ to $$z^{(i)}$$) only on training data; this mapping can be applied to $$x{cv}$$ and $$x{test}$$
- Use case
- Compression
- Reduce storage/memory
- Speed up algorithm
- Visualizatoin
- Compression
- Invalid use case
- Prevent overfitting (should use regularization)
- Design of ML system (should run with raw data first)