Stochastic Gradient Descent

  1. Randomly shuffle training samples
  2. Repeat
    1. For each training sample $$i := 1..m, j := 0..n$$:
      1. $$\thetaj := \theta_j - \alpha (h{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}$$
  3. Learning rate $$\alpha$$ usually held constant
    1. Can decrease over time, e.g. $$\alpha = \frac{const1}{iterationNum + const2}$$; may be finicky to tune

Checking for Convergence

  1. During learning, compute $$cost(\theta, (x^{(i)}, y^{(i)}))$$ before updating $$\theta$$
  2. For every few iterations (say 1000), plot $$cost(\theta, (x^{(i)}, y^{(i)}))$$ averaged over the past few examples

Mini-Batch Gradient Descent

  1. Mini-batch size $$b$$
  2. Repeat
    1. For each training sample $$i := 1, (1+b), (1+2b)..m, j := 0..n$$:
      1. $$\thetaj := \theta_j - \alpha \frac{1}{b}\sum{k=i}^{i+b-1}(h_{\theta}(x^{(k)}) - y^{(k)}) x_j^{(k)}$$
Benefits
  1. Can win over batch gradient descent for very large dataset
    1. Computationally less expensive
    2. Does not need to load the entire large dataset before running
  2. Can win over stochastic gradient descent for its vectorization ability

Online Learning

  1. Repeat forever
    1. Get $$(x, y)$$
    2. Update $$\theta$$ using $$(x, y)$$
Benefits
  1. Can adapt to change of user preference
  2. After training from a sample, can discard it

results matching ""

    No results matching ""