Optimization Objective
Cost Function
$$min{\theta} C \sum^m{i=1}[y^{(i)}cost1(\theta^T x^{(i)}) + (1 - y^(i)) cost_0 (\theta^T x^{(i)})] + \frac{1}{2} \sum^n{i=1} \theta^2_j$$
- $$C$$ functions similarly to $$\frac{1}{\lambda}$$
- Get rid of $$\frac{1}{m}$$
Hypothesis
$$\begin{aligned} h_{\theta}(x) &= 1 \text{ if } \theta^T x \ge 0\ &= 0 \text{ otherwise} \end{aligned}$$
Large Margin Classifier
Kernels
Given $$x$$, compute new feature depending on proximity to landmarks.
$$f_i = similarity(x, l^{(i)}) = e^{-\frac{|x-l^{(i)}|^2}{2\sigma^2}}$$
Choosing the Landmarks
Choose $$l^{(i)} = x^{(i)}$$ .
SVM Parameters
- $$C$$
- Large: lower bias, higher variance
- Small: higher bias, lower variance
- $$\sigma^2$$
- Large: features vary more smoothly, higher bias, lower variance
- Small: features vary less smoothly, smaller bias, higher variance
Using an SVM
- Choose parameter $$C$$
- Choose kernel (similarity function)
- Linear kernel (no kernel)
- Predict $$y = 1$$ if $$\theta^T x \ge 0$$
- Gaussian kernel
- $$f_i = e^{-\frac{|x-l^{(i)}|^2}{2\sigma^2}}, \text{ where } l^{(i)} = x^{(i)}$$
- Choose $$\sigma^2$$
- Feature scaling
- Others
- Need to satisfy Mercer's Theorem
- Linear kernel (no kernel)
Multi-Class Classification - One-vs-All Method
Train $$K$$ SVMs, each distinguishing $$y = i$$ from the rest. Get $$\theta^{(i)}$$s. Pick class $$i$$ with largest $$(\theta^{(i)})^T x$$.
Logistic Regression v.s. SVM
Let n
= # of features, m
= # of training examples
n
larger thanm
- Logistic regression || SVM with linear kernel
n
small (1-1000
),m
intermediate (10-10000
)- SVM with Gaussian kernel
n
small (1-1000
),m
large (50000+
)- Create more features
- Logistic regression || SVM with linear kernel
Neural networks slower to train.