Decision Tree

Pros
- Simple to understand & interpret
- Handles numerical & categorical data
- Little data preparation needed e.g. normalization not needed
- White box model
- Performs well with large datasets
- Mirrors human decision making closely
Cons
- Sensitive to small changes in data
- NP-completeness of training
- Prone to overfitting
- Tree can be prohibitively large for data boundaries not orthogonally orientated

Training Algorithm

Assume binary split, features with continuous values.

Split training set into 2 subsets with a single feature $$k$$ and threshold $$t_k$$
1. Minimize $$J(k, tk) = \frac{m{left}}{m}G{left} + \frac{m{right}}{m}G_{right}$$
2. $$G$$: impurity of left/right subset
3. $$m$$: number of instances in left/right subset
Recursively do so for each subset
Stop when:
1. Maximum depth reached
2. Cannot find split that reduces impurity
Properties
- Greedy; not guaranteed to be optimal (minimum average depth)
- NP-complete to find an optimal tree

Discrete

Continuous

Gini: computes faster
- $$Gi = 1 - \sum^n{k=1}p_{i, k}^2$$
- $$p_{i,k}$$: ratio of class $$k$$ instances among the training instances in the i-th node
Entropy: produces more balanced trees
- $$Hi = -\sum^n{k=1} p{i,k}log(p{i,k})$$

Instead of a class, predict a value.

Also needs regularization.

Minimum Squared Error (MSE)
- $$MSE{node} = \sum{i \in node}(\hat{y}_{node} - y^{(i)})^2)$$
- $$\hat{y}{node} = \frac{1}{m{node}} \sum_{i \in node} y^{(i)}$$