Optimization Objective

Cost Function

$$min{\theta} C \sum^m{i=1}[y^{(i)}cost1(\theta^T x^{(i)}) + (1 - y^(i)) cost_0 (\theta^T x^{(i)})] + \frac{1}{2} \sum^n{i=1} \theta^2_j$$

  • $$C$$ functions similarly to $$\frac{1}{\lambda}$$
  • Get rid of $$\frac{1}{m}$$
Hypothesis

$$\begin{aligned} h_{\theta}(x) &= 1 \text{ if } \theta^T x \ge 0\ &= 0 \text{ otherwise} \end{aligned}$$

Large Margin Classifier

Kernels

Given $$x$$, compute new feature depending on proximity to landmarks.

$$f_i = similarity(x, l^{(i)}) = e^{-\frac{|x-l^{(i)}|^2}{2\sigma^2}}$$

Choosing the Landmarks

Choose $$l^{(i)} = x^{(i)}$$ .

SVM Parameters
  • $$C$$
    • Large: lower bias, higher variance
    • Small: higher bias, lower variance
  • $$\sigma^2$$
    • Large: features vary more smoothly, higher bias, lower variance
    • Small: features vary less smoothly, smaller bias, higher variance

Using an SVM

  • Choose parameter $$C$$
  • Choose kernel (similarity function)
    • Linear kernel (no kernel)
      • Predict $$y = 1$$ if $$\theta^T x \ge 0$$
    • Gaussian kernel
      • $$f_i = e^{-\frac{|x-l^{(i)}|^2}{2\sigma^2}}, \text{ where } l^{(i)} = x^{(i)}$$
      • Choose $$\sigma^2$$
      • Feature scaling
    • Others
      • Need to satisfy Mercer's Theorem
Multi-Class Classification - One-vs-All Method

Train $$K$$ SVMs, each distinguishing $$y = i$$ from the rest. Get $$\theta^{(i)}$$s. Pick class $$i$$ with largest $$(\theta^{(i)})^T x$$.

Logistic Regression v.s. SVM

Let n = # of features, m = # of training examples

  • n larger than m
    • Logistic regression || SVM with linear kernel
  • n small (1-1000), m intermediate (10-10000)
    • SVM with Gaussian kernel
  • n small (1-1000), m large (50000+)
    • Create more features
    • Logistic regression || SVM with linear kernel

Neural networks slower to train.

results matching ""

    No results matching ""