K-Nearest Neighbors
The Simplest Classifier
- Assign new sample the class of similar labelled samples
- Instance-based
- Non-parametric learning
Algorithm
- A training dataset of samples already classified
- For each new sample, find
k
nearest records- Assign it the majority of the
k
nearest neighbors
- Assign it the majority of the
Choosing k
- Bias-variance tradeoff
- High variance: overfitting
- High bias: underfitting
- Common practice
- $$k = \sqrt{\text{size of training sample}}$$
Data Normalization
$$ X' = \frac{X - min(X)}{max(X) - min(X)}
$$
- Problem: min/max values might change in the future
Fix
$$ X' = \frac{X - \mu}{\sigma}
$$
Nominal/Categorical Features
- Dummy coding
- Same as one-hot, but leave one out as base variable
- One-hot coding
- Ordinal coding