Overview

Motivation
- No single classifier is best for all circumstances
Necessary & sufficient condition
- Individual classifiers (base learners) are accurate and diverse

Correct predictions are positively correlated with themselves; need intentionally non-optimal base learners for better ensemble performance.

Different models/algorithms
Different hyper-parameters
- k in KNN
- Threshold in decision tree
- Kernel function in SVM
- Initial weights in neural networks
Different input representations of the same event
- Sensor fusion, sound, mouth shape for speech recognition
- Random subset of features
Different training sets
- Random subset of samples
- Sequential training

Majority Voting

Error rate
- $$\epsilon$$: error of an individual classifier
- $$\epsilon_{ensemble} = P(y \ge k) = \sum^n_k {n \choose k} \epsilon^k (1-\epsilon)^{n-k}$$
Why very good ensembles can often be constructed?
- Statistical
  - When training data small, reduces risk of choosing the wrong one
- Computational
  - Better approximation to the true optimum combining many local optima

Weighting schemes: confidence, accuracy, Bayesian prior, etc.
- Equal weighting
- Unequal weighting
- Weighted probability

Extension of majority voting for multi-class setting. Individual classifiers can be the same or different types.

E.g. random forest: multiple decision trees.

Parallel: multi-expert combination
- Global
  - All base learners generate outputs
- Local
  - Select a few base learners based on input (gating)
Sequential: multi-stage combination
- Start with simpler models, increase model complexity to handle datasets not well handled

Random subsets of samples for each base learner.

Collecting weak base learners, i.e. diversity over accuracy. Later classifiers focus on weak parts of earlier classifiers.