Regularization

Bias: error from erroneous assumptions in the learning algorithm
- High bias = underfitting
Variance: error from sensitivity to small fluctuations in the training set
- High variance = overfitting

L2 Regularization

On logistic regression:

$$ J(w) = -\sum^n{i=1} [y^{(i)}log(\phi(z^{(i)})) - (1 - y^{(i)})log(1-\phi(z^{(i)}))] + \frac{\lambda}{2}|w|^2\ \Delta w_j = -\eta \frac{\partial J}{\partial w_j} = \eta \sum^n{i=1}(y^{(i)} - \phi (z^{(i)}))x_j^{(i)} - \eta \lambda w_j

Support Vector Machine

Linear/nonlinear classification, regression, outlier deteciton
Suitable for classification of complex but small- or medium-sized datasets
Finds the maximum margin hyperplane that divides the points into positive & negative groups s.t. the distance between the hyperplane and the closest point from either group is maximized
Key insight
- Some examples are more important, i.e. the support vectors
Neat trick
- Kernel trick
- SVM's can be applied efficiently to nonlinearly separable data

Objective

$$ \begin{cases} w^Tx{pos} + b = 1\ w^Tx{neg} + b = -1 \end{cases}\ \Rightarrow w^T(x{pos} - x{neg}) = 2

Maximize:

$$ \frac{w^T(x{pos}-x{neg})}{|w|} = \frac{2}{|w|}

= minimize:

$$ \frac{1}{2}|w|^2

subject to

$$ \begin{cases} w^T x^{(i)} + b \ge 1 &\text{if } y^{(i)} = 1\ w^T x^{(i)} + b \le -1 &\text{if } y^{(i)} = -1 \end{cases}\ \Rightarrow y^{(i)}(w^T x^{(i)} + b) \ge 1

which is a convex quadratic optimization problem.

Transform the primal problem to its dual:

$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}

subject to

$$ \alpha^{(i)} \ge 0

where $$\alpha$$ is the Lagrangian multipliers.

Solution of the primal problem:

$$ w = \sum^m{i=1} \alpha^{(i)}y^{(i)}x^{(i)}\ b = \frac{1}{n_s}\sum^m{i=1}(1-y^{(i)}(w^T \cdot x^{(i)}))

Possible Problems

Sensitivity to scales
Sensitivity to outliers

Soft Margin

Hard margin
- Data must be linearly separable
- Sensitive to outliers
Soft margin: allow margin violations

Objective

Minimize:

$$ \frac{1}{2}|w|^2 + C \sum^m_{i=1}\xi^{(i)}

subject to

$$ y^{(i)}(w^T x^{(i)} + b) \ge 1 - \xi^{(i)}\ \xi^{(i)} \ge 0

which is also a convex quadratic optimization problem, which can be solved similarly to previous technique.

Non-Linear Data

Apply polynomial transformation to map the points to a higher dimension s.t. they are linearly separable.

Low-degree polynomial: cannot deal with complex datasets
High-degree polynomial: too slow to transform every data points

Kernel Tricks

In objective function:

$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}

We can apply a kernel function to $$x^{(i)} \cdot x^{(j)}$$ s.t. $$K(x^{(i)}, x^{(j)}) = \phi(x^{(i)}) \cdot \phi(x^{(j)})$$is based solely on the original vectors without the transformation $$\phi$$.

Linear
- $$K(a, b) = a^T \cdot b$$
Polynomial
- $$K(a, b) = (\gamma a^T \cdot b + r)^d$$
Gaussian RBF
- $$K(a, b) = e^{-\gamma |a-b|^2}$$
Sigmoid
- $$K(a, b) = \tanh(\gamma a^T \cdot b + r)$$

SVM Regression

Reverse the SVM objective: to fit as many instances as possible on the street.

References

MIT: Learning: Support Vector Machines

Support Vector Machine

Regularization

L2 Regularization

Support Vector Machine

Objective

Possible Problems

Soft Margin

Objective

Non-Linear Data

Kernel Tricks

SVM Regression

References

results matching ""

No results matching ""