Clustering Overview
- Similarity measure
- Squared Euclidean distance: place weights on points further apart
- $$d(x, y)^2 = | x - y |_2^2$$
- Squared Euclidean distance: place weights on points further apart
Normalization important.
Prototype-Based Clustering
Prototype: centroid (average) or mediod (the most representative point).
- Drawback
- $$k$$ has to be decided in advance
K-Means
- Error measure
- Sum of Squared Errors (SSE) / inertia
- $$SSE = \sum^n{i=1} \sum^k{j=1} w^{(i,j)}|x^{(i)} - \mu^{(j)}|_2^2$$
- $$\mu^{(j)}$$: representative point for cluster $$j$$
- $$w^{(i, j)}$$: 1 if $$x^{(i)}$$ in cluster $$j$$, 0 otherwise
- Sum of Squared Errors (SSE) / inertia
- Advantage
- Bound to converge
- Drawback
- NP-hard
- Convergence may be slow
- Clustering may be bad
Algorithm
- Random init $$k$$ centroids
- Associate every data point with a centroid
- Calculate new centroids
- Repeat 2.3. until convergence
K-Means++
Place initial centroid far away from each other -> better & more consistence results.
Algorithm
- Randomly choose the first centroid, keep in a centroid set $$M$$
- For each sample $$x^{(i)}$$, find the minimum squared distance to any of the centroids in $$M$$
- Randomly select the next centroid $$\mu^{(p)}$$ with a weighted probability distribution:
- $$\frac{d(\mu^{(p)}, M)^2}{\sum_i d(x^{(i)}, M)^2}$$
- Repeat 2.3. until $$k$$ centroids chosen
- Proceed with k-means
Elbow Method
Choosing the optimal k.
Hierarchical Clustering
- Advantage
- No need to specify $$k$$
- Drawback
- Half-moons dataset?
Divisive Hierarchical Clustering
Start with 1 cluster, iteratively split into smaller ones.
Agglomerative Hierarchical Clustering
Start with small clusters, iteratively merge into a big one.
- Which 2 to merge
- Single linkage
- Complete linkage
Algorithm
- Compute distance matrix of all samples
- Represent each data point as a singleton cluster
- Merge two closest clusters
- Distance between the most dissimilar members
- Update similarity matrix
- Repeat 3.4. until one single cluster remains
Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
- Advantage
- No need to specify $$k$$
- Can find arbitrarily shaped clusters
- A notion of noise makes it robust to outliers
- Drawback
- Cannot detect meaningful clusters in data of varying density
- Core point
- At least
MinPts
of neighboring points fall within radiusε
- At least
- Border point
- Fewer points within
ε
thanMinPts
- Lies within radius of core point
- Fewer points within
- Noise point
- Others
Algorithm
- Form a separate cluster for each core point
- Connect core points if they are not further away than
ε
- Connect core points if they are not further away than
- Assign each border point to the cluster of its corresponding core point
OPTICS (Ordering Points to Identify the Clustering Structure)
Points linearly ordered s.t. points spatially closest become neighbors in the ordering.
- Advantage
- Detects clusters with varying density
Quality of Clustering
Silhouette Coefficient
- Cluster cohesion $$a$$: average distance between sample $$x$$ & all other points in the same cluster
- Cluster separation $$b$$: average distance between sample $$x$$ & all samples in the nearest cluster
- Silhouette coefficient $$s$$:
- $$s = \frac{b - a}{max(b, a)}$$ in the range of $$[-1, 1]$$
- $$1$$: ideal; $$-1$$: bad