Regularization

  • Bias: error from erroneous assumptions in the learning algorithm
    • High bias = underfitting
  • Variance: error from sensitivity to small fluctuations in the training set
    • High variance = overfitting
L2 Regularization

On logistic regression:

$$ J(w) = -\sum^n{i=1} [y^{(i)}log(\phi(z^{(i)})) - (1 - y^{(i)})log(1-\phi(z^{(i)}))] + \frac{\lambda}{2}|w|^2\ \Delta w_j = -\eta \frac{\partial J}{\partial w_j} = \eta \sum^n{i=1}(y^{(i)} - \phi (z^{(i)}))x_j^{(i)} - \eta \lambda w_j

$$

Support Vector Machine

  • Linear/nonlinear classification, regression, outlier deteciton
  • Suitable for classification of complex but small- or medium-sized datasets
  • Finds the maximum margin hyperplane that divides the points into positive & negative groups s.t. the distance between the hyperplane and the closest point from either group is maximized

  • Key insight

    • Some examples are more important, i.e. the support vectors
  • Neat trick
    • Kernel trick
    • SVM's can be applied efficiently to nonlinearly separable data

Objective

$$ \begin{cases} w^Tx{pos} + b = 1\ w^Tx{neg} + b = -1 \end{cases}\ \Rightarrow w^T(x{pos} - x{neg}) = 2

$$

Maximize:

$$ \frac{w^T(x{pos}-x{neg})}{|w|} = \frac{2}{|w|}

$$

= minimize:

$$ \frac{1}{2}|w|^2

$$

subject to

$$ \begin{cases} w^T x^{(i)} + b \ge 1 &\text{if } y^{(i)} = 1\ w^T x^{(i)} + b \le -1 &\text{if } y^{(i)} = -1 \end{cases}\ \Rightarrow y^{(i)}(w^T x^{(i)} + b) \ge 1

$$

which is a convex quadratic optimization problem.

Transform the primal problem to its dual:

$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}

$$

subject to

$$ \alpha^{(i)} \ge 0

$$

where $$\alpha$$ is the Lagrangian multipliers.

Solution of the primal problem:

$$ w = \sum^m{i=1} \alpha^{(i)}y^{(i)}x^{(i)}\ b = \frac{1}{n_s}\sum^m{i=1}(1-y^{(i)}(w^T \cdot x^{(i)}))

$$

Possible Problems
  • Sensitivity to scales
  • Sensitivity to outliers

Soft Margin

  • Hard margin
    • Data must be linearly separable
    • Sensitive to outliers
  • Soft margin: allow margin violations
Objective

Minimize:

$$ \frac{1}{2}|w|^2 + C \sum^m_{i=1}\xi^{(i)}

$$

subject to

$$ y^{(i)}(w^T x^{(i)} + b) \ge 1 - \xi^{(i)}\ \xi^{(i)} \ge 0

$$

which is also a convex quadratic optimization problem, which can be solved similarly to previous technique.

Non-Linear Data

Apply polynomial transformation to map the points to a higher dimension s.t. they are linearly separable.

  • Low-degree polynomial: cannot deal with complex datasets
  • High-degree polynomial: too slow to transform every data points
Kernel Tricks

In objective function:

$$ min{\alpha} \frac{1}{2} \sum^m{i=1} \sum^m{j=1} \alpha^{(i)} \alpha^{(j)} y^{(i)} y^{(j)} {x^{(i)}}^T \cdot x^{(j)} - \sum^m{i=1} \alpha^{(i)}

$$

We can apply a kernel function to $$x^{(i)} \cdot x^{(j)}$$ s.t. $$K(x^{(i)}, x^{(j)}) = \phi(x^{(i)}) \cdot \phi(x^{(j)})$$is based solely on the original vectors without the transformation $$\phi$$.

  • Linear
    • $$K(a, b) = a^T \cdot b$$
  • Polynomial
    • $$K(a, b) = (\gamma a^T \cdot b + r)^d$$
  • Gaussian RBF
    • $$K(a, b) = e^{-\gamma |a-b|^2}$$
  • Sigmoid
    • $$K(a, b) = \tanh(\gamma a^T \cdot b + r)$$

SVM Regression

Reverse the SVM objective: to fit as many instances as possible on the street.

References

results matching ""

    No results matching ""