Machine Learning

Gives computers the ability to learn without being explicitly programmed.

Supervised learning
- k-NN
- Linear regression
- Logistic regression
- SVM
- Decision trees & random forests
- Neural networks
Unsupervised learning
- Clustering
  - k-means
  - Hierarchical cluster analysis (HCA)
  - Expectation maximization
- Visualization & dimensionality reduction
  - Principal component analysis (PCA)
  - Kernel PCA
  - Locally-linear embedding (LLE)
  - t-distributed stochastic neighbor embedding (t-SNE)
- Anomaly detection
- Association rule learning
  - Apriori
  - Eclat
Semi-supervised learning
Reinforcement learning
- Observe environment, select & perform action, get reward or penalty, update policy

Instance-based
- Memorize every training instances
- Generalize to new instances using similarity measure
Model-based
- Build a model of the given training data, and use it to make predictions
- Less vulnerable to bad data

Very different ML algorithms performed almost identically well on a complex problem given enough data.

But small- and medium-sized datasets are still very common.

Data cleaning
- Remove outliers
- Ignore/fill in missing values
Feature engineering
- Select the most useful features
- Combine features
- Create new features
Prevent overfitting (regularization)
- Simplify model
  - Fewer parameters
  - Reduce features
- Gather more data
- Reduce noise in data
- Add in hyperparameters to tune the pattern
Prevent underfitting
- Build more powerful model
  - More parameters
- Feed better features

To see how well the derived model performs, try it on new data instances.

Split dataset
- Training set
- Test set
  - Generalization error (out-of-sample error)

No one model works best for every problem. -> It is common in ML to try multiple models.

Introduction