Regularization
- Bias: error from erroneous assumptions in the learning algorithm
- High bias = underfitting
- Variance: error from sensitivity to small fluctuations in the training set
- High variance = overfitting
L2 Regularization
On logistic regression:
$$ J(w) = -\sum^n{i=1} [y^{(i)}log(\phi(z^{(i)})) - (1 - y^{(i)})log(1-\phi(z^{(i)}))] + \frac{\lambda}{2}|w|^2\ \Delta w_j = -\eta \frac{\partial J}{\partial w_j} = \eta \sum^n{i=1}(y^{(i)} - \phi (z^{(i)}))x_j^{(i)} - \eta \lambda w_j
$$
Support Vector Machine
- Linear/nonlinear classification, regression, outlier deteciton
- Suitable for classification of complex but small- or medium-sized datasets
Finds the maximum margin hyperplane that divides the points into positive & negative groups s.t. the distance between the hyperplane and the closest point from either group is maximized
Key insight
- Some examples are more important, i.e. the support vectors
- Neat trick
- Kernel trick
- SVM's can be applied efficiently to nonlinearly separable data
Objective
$$ \begin{cases} w^Tx{pos} + b = 1\ w^Tx{neg} + b = -1 \end{cases}\ \Rightarrow w^T(x{pos} - x{neg}) = 2
$$
Maximize:
$$ \frac{w^T(x{pos}-x{neg})}{|w|} = \frac{2}{|w|}
$$
= minimize:
$$ \frac{1}{2}|w|^2
$$
subject to
$$ \begin{cases} w^T x^{(i)} + b \ge 1 &\text{if } y^{(i)} = 1\ w^T x^{(i)} + b \le -1 &\text{if } y^{(i)} = -1 \end{cases}\ \Rightarrow y^{(i)}(w^T x^{(i)} + b) \ge 1
$$
which is a convex quadratic optimization problem.
Transform the primal problem to its dual:
$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}
$$
subject to
$$ \alpha^{(i)} \ge 0
$$
where $$\alpha$$ is the Lagrangian multipliers.
Solution of the primal problem:
$$ w = \sum^m{i=1} \alpha^{(i)}y^{(i)}x^{(i)}\ b = \frac{1}{n_s}\sum^m{i=1}(1-y^{(i)}(w^T \cdot x^{(i)}))
$$
Possible Problems
- Sensitivity to scales
- Sensitivity to outliers
Soft Margin
- Hard margin
- Data must be linearly separable
- Sensitive to outliers
- Soft margin: allow margin violations
Objective
Minimize:
$$ \frac{1}{2}|w|^2 + C \sum^m_{i=1}\xi^{(i)}
$$
subject to
$$ y^{(i)}(w^T x^{(i)} + b) \ge 1 - \xi^{(i)}\ \xi^{(i)} \ge 0
$$
which is also a convex quadratic optimization problem, which can be solved similarly to previous technique.
Non-Linear Data
Apply polynomial transformation to map the points to a higher dimension s.t. they are linearly separable.
- Low-degree polynomial: cannot deal with complex datasets
- High-degree polynomial: too slow to transform every data points
Kernel Tricks
In objective function:
$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}
$$
We can apply a kernel function to $$x^{(i)} \cdot x^{(j)}$$ s.t. $$K(x^{(i)}, x^{(j)}) = \phi(x^{(i)}) \cdot \phi(x^{(j)})$$is based solely on the original vectors without the transformation $$\phi$$.
- Linear
- $$K(a, b) = a^T \cdot b$$
- Polynomial
- $$K(a, b) = (\gamma a^T \cdot b + r)^d$$
- Gaussian RBF
- $$K(a, b) = e^{-\gamma |a-b|^2}$$
- Sigmoid
- $$K(a, b) = \tanh(\gamma a^T \cdot b + r)$$
SVM Regression
Reverse the SVM objective: to fit as many instances as possible on the street.