Overview
$$ \begin{aligned} wx_i + b &\ge 1 \text{ when } y_i = 1\ wx_i + b &\le 1 \text{ when } y_i = -1\ \rightarrow y_i(wx_i + b) &\ge 1 \end{aligned}
$$
$$ \begin{aligned} wx^+ + b &= 1\ wx^- + b &= -1\ \rightarrow w(x^+ - w^-) &= 2\ M &= \frac{(x^+ - x^-) \cdot w}{|w|} = \frac{2}{|w|} \end{aligned}
$$
- Maximize $$M$$ = minimize $$\frac{1}{2}w^Tw$$, where $$y_i(wx_i + b) \ge 1$$ for all $$i$$
- Quadratic optimization problem, solve for $$w$$ and $$b$$
Solving Quadratic Optimization Problem
Dual Problem
Find Lagrange multiplier $$\alpha$$ s.t.
- $$Q(\alpha) = \sum \alpha_i - \frac{1}{2}\sum\sum \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$ is maximized
- $$\sum \alpha_i y_i = 0$$
- $$\alpha_i \ge 0$$
Solution
- $$w = \sum \alpha_i y_i x_i$$
- $$b = y_k - w^T x_k$$ for any $$x_k$$ s.t. $$\alpha_k \ne 0$$ (i.e. $$x_k$$ is a supporting vector)
Classifying Function
- $$f(x) = wx + b = \sum \alpha_i y_i K(x_i, x_j) + b$$ (involves computing inner products between all pairs of training points)
Soft Margin Classification
Slack variable $$\epsilon$$ to allow noise.
- Minimize $$\frac{1}{2}w^Tw + C \sum_{k=1}^R \epsilon_k$$, where $$y_i(wx_i + b) \ge 1 - \epsilon_i$$, $$\epsilon_i \ge 0$$ for all $$i$$
- $$C$$ to control overfitting
Non-Linear SVM
Kernel Function
- Linear classifier: $$K(x_i, x_j) = x_i^Tx_j$$
With mapping $$\Phi: x \rightarrow \phi(x)$$, $$K(x_i, x_j) = \phi(x_i)^T\phi(x_j)$$
Polynomial power $$K(x_i, x_j) = (1 + x_i^Tx_j)^p$$
Gaussian kernel $$K(x_i, x_j) = e^{-\frac{|x_i - x_j|^2}{2\sigma^2}}$$
Sigmoid $$K(x_i, x_j) = tanh(\beta_0 x_i^Tx_j + \beta_1)$$
Properties
- Flexibility in choosing kernel function
- Sparseness of solution
- Only support vectors used
- Able to handle large feature space
- Overfitting controlled by soft margin approach
- Always converge to a single global solution
- Feature selection