Clustering Overview

Similarity measure
- Squared Euclidean distance: place weights on points further apart
  - $$d(x, y)^2 = | x - y |_2^2$$

Normalization important.

Prototype-Based Clustering

Prototype: centroid (average) or mediod (the most representative point).

Drawback
- $$k$$ has to be decided in advance

K-Means

Error measure
- Sum of Squared Errors (SSE) / inertia
  - $$SSE = \sum^n{i=1} \sum^k{j=1} w^{(i,j)}|x^{(i)} - \mu^{(j)}|_2^2$$
  - $$\mu^{(j)}$$: representative point for cluster $$j$$
  - $$w^{(i, j)}$$: 1 if $$x^{(i)}$$ in cluster $$j$$, 0 otherwise
Advantage
- Bound to converge
Drawback
- NP-hard
- Convergence may be slow
- Clustering may be bad

Algorithm

Random init $$k$$ centroids
Associate every data point with a centroid
Calculate new centroids
Repeat 2.3. until convergence

K-Means++

Place initial centroid far away from each other -> better & more consistence results.

Algorithm

Randomly choose the first centroid, keep in a centroid set $$M$$
For each sample $$x^{(i)}$$, find the minimum squared distance to any of the centroids in $$M$$
Randomly select the next centroid $$\mu^{(p)}$$ with a weighted probability distribution:
1. $$\frac{d(\mu^{(p)}, M)^2}{\sum_i d(x^{(i)}, M)^2}$$
Repeat 2.3. until $$k$$ centroids chosen
Proceed with k-means

Elbow Method

Choosing the optimal k.

Hierarchical Clustering

Advantage
- No need to specify $$k$$
Drawback
- Half-moons dataset?

Divisive Hierarchical Clustering

Start with 1 cluster, iteratively split into smaller ones.

Agglomerative Hierarchical Clustering

Start with small clusters, iteratively merge into a big one.

Which 2 to merge
- Single linkage
- Complete linkage

Algorithm

Compute distance matrix of all samples
Represent each data point as a singleton cluster
Merge two closest clusters
1. Distance between the most dissimilar members
Update similarity matrix
Repeat 3.4. until one single cluster remains

Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Advantage
- No need to specify $$k$$
- Can find arbitrarily shaped clusters
- A notion of noise makes it robust to outliers
Drawback
- Cannot detect meaningful clusters in data of varying density

Core point
- At least MinPts of neighboring points fall within radius ε
Border point
- Fewer points within ε than MinPts
- Lies within radius of core point
Noise point
- Others

Algorithm

Form a separate cluster for each core point
1. Connect core points if they are not further away than ε
Assign each border point to the cluster of its corresponding core point

OPTICS (Ordering Points to Identify the Clustering Structure)

Points linearly ordered s.t. points spatially closest become neighbors in the ordering.

Advantage
- Detects clusters with varying density

Quality of Clustering

Silhouette Coefficient

Cluster cohesion $$a$$: average distance between sample $$x$$ & all other points in the same cluster
Cluster separation $$b$$: average distance between sample $$x$$ & all samples in the nearest cluster
Silhouette coefficient $$s$$:
- $$s = \frac{b - a}{max(b, a)}$$ in the range of $$[-1, 1]$$
- $$1$$: ideal; $$-1$$: bad

Clustering

Clustering Overview

Prototype-Based Clustering

K-Means

Algorithm

K-Means++

Algorithm

Elbow Method

Hierarchical Clustering

Divisive Hierarchical Clustering

Agglomerative Hierarchical Clustering

Algorithm

Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Algorithm

OPTICS (Ordering Points to Identify the Clustering Structure)

Quality of Clustering

Silhouette Coefficient

results matching ""

No results matching ""