Decision Tree
- Training: building a tree
- Long
- Prediction: going through one path
- Short
Pros & Cons
- Pros
- Simple to understand & interpret
- Handles numerical & categorical data
- Little data preparation needed e.g. normalization not needed
- White box model
- Performs well with large datasets
- Mirrors human decision making closely
- Cons
- Sensitive to small changes in data
- NP-completeness of training
- Prone to overfitting
- Tree can be prohibitively large for data boundaries not orthogonally orientated
Training Algorithm
Classification And Regression Tree (CART)
Assume binary split, features with continuous values.
- Split training set into 2 subsets with a single feature $$k$$ and threshold $$t_k$$
- Minimize $$J(k, tk) = \frac{m{left}}{m}G{left} + \frac{m{right}}{m}G_{right}$$
- $$G$$: impurity of left/right subset
- $$m$$: number of instances in left/right subset
- Recursively do so for each subset
Stop when:
- Maximum depth reached
- Cannot find split that reduces impurity
Properties
- Greedy; not guaranteed to be optimal (minimum average depth)
- NP-complete to find an optimal tree
Values
Discrete
Continuous
Impurity Measure
- Gini: computes faster
- $$Gi = 1 - \sum^n{k=1}p_{i, k}^2$$
- $$p_{i,k}$$: ratio of class $$k$$ instances among the training instances in the i-th node
- Entropy: produces more balanced trees
- $$Hi = -\sum^n{k=1} p{i,k}log(p{i,k})$$
Regularization
- Maximum depth
- Minimum samples in a node before it can split
- Minimum samples in a leaf node
- Maximum leaf nodes
- Maximum features evaluated for splitting
Regression
Instead of a class, predict a value.
Also needs regularization.
Impurity Measure
- Minimum Squared Error (MSE)
- $$MSE{node} = \sum{i \in node}(\hat{y}_{node} - y^{(i)})^2)$$
- $$\hat{y}{node} = \frac{1}{m{node}} \sum_{i \in node} y^{(i)}$$