Clustering Overview

  • Similarity measure
    • Squared Euclidean distance: place weights on points further apart
      • $$d(x, y)^2 = | x - y |_2^2$$

Normalization important.

Prototype-Based Clustering

Prototype: centroid (average) or mediod (the most representative point).

  • Drawback
    • $$k$$ has to be decided in advance

K-Means

  • Error measure
    • Sum of Squared Errors (SSE) / inertia
      • $$SSE = \sum^n{i=1} \sum^k{j=1} w^{(i,j)}|x^{(i)} - \mu^{(j)}|_2^2$$
      • $$\mu^{(j)}$$: representative point for cluster $$j$$
      • $$w^{(i, j)}$$: 1 if $$x^{(i)}$$ in cluster $$j$$, 0 otherwise
  • Advantage
    • Bound to converge
  • Drawback
    • NP-hard
    • Convergence may be slow
    • Clustering may be bad
Algorithm
  1. Random init $$k$$ centroids
  2. Associate every data point with a centroid
  3. Calculate new centroids
  4. Repeat 2.3. until convergence

K-Means++

Place initial centroid far away from each other -> better & more consistence results.

Algorithm
  1. Randomly choose the first centroid, keep in a centroid set $$M$$
  2. For each sample $$x^{(i)}$$, find the minimum squared distance to any of the centroids in $$M$$
  3. Randomly select the next centroid $$\mu^{(p)}$$ with a weighted probability distribution:
    1. $$\frac{d(\mu^{(p)}, M)^2}{\sum_i d(x^{(i)}, M)^2}$$
  4. Repeat 2.3. until $$k$$ centroids chosen
  5. Proceed with k-means
Elbow Method

Choosing the optimal k.

Hierarchical Clustering

  • Advantage
    • No need to specify $$k$$
  • Drawback
    • Half-moons dataset?

Divisive Hierarchical Clustering

Start with 1 cluster, iteratively split into smaller ones.

Agglomerative Hierarchical Clustering

Start with small clusters, iteratively merge into a big one.

  • Which 2 to merge
    • Single linkage
    • Complete linkage

Algorithm
  1. Compute distance matrix of all samples
  2. Represent each data point as a singleton cluster
  3. Merge two closest clusters
    1. Distance between the most dissimilar members
  4. Update similarity matrix
  5. Repeat 3.4. until one single cluster remains

Density-Based Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

  • Advantage
    • No need to specify $$k$$
    • Can find arbitrarily shaped clusters
    • A notion of noise makes it robust to outliers
  • Drawback
    • Cannot detect meaningful clusters in data of varying density

  • Core point
    • At least MinPts of neighboring points fall within radius ε
  • Border point
    • Fewer points within ε than MinPts
    • Lies within radius of core point
  • Noise point
    • Others
Algorithm
  1. Form a separate cluster for each core point
    1. Connect core points if they are not further away than ε
  2. Assign each border point to the cluster of its corresponding core point

OPTICS (Ordering Points to Identify the Clustering Structure)

Points linearly ordered s.t. points spatially closest become neighbors in the ordering.

  • Advantage
    • Detects clusters with varying density

Quality of Clustering

Silhouette Coefficient
  • Cluster cohesion $$a$$: average distance between sample $$x$$ & all other points in the same cluster
  • Cluster separation $$b$$: average distance between sample $$x$$ & all samples in the nearest cluster
  • Silhouette coefficient $$s$$:
    • $$s = \frac{b - a}{max(b, a)}$$ in the range of $$[-1, 1]$$
    • $$1$$: ideal; $$-1$$: bad

results matching ""

    No results matching ""