K-Nearest Neighbors

The Simplest Classifier

  • Assign new sample the class of similar labelled samples
  • Instance-based
    • Non-parametric learning

Algorithm

  • A training dataset of samples already classified
  • For each new sample, find k nearest records
    • Assign it the majority of the k nearest neighbors

Choosing k

  • Bias-variance tradeoff
    • High variance: overfitting
    • High bias: underfitting
  • Common practice
    • $$k = \sqrt{\text{size of training sample}}$$

Data Normalization

$$ X' = \frac{X - min(X)}{max(X) - min(X)}

$$

  • Problem: min/max values might change in the future
Fix

$$ X' = \frac{X - \mu}{\sigma}

$$

Nominal/Categorical Features
  • Dummy coding
    • Same as one-hot, but leave one out as base variable
  • One-hot coding
  • Ordinal coding

results matching ""

    No results matching ""