Stochastic Gradient Descent
- Randomly shuffle training samples
- Repeat
- For each training sample $$i := 1..m, j := 0..n$$:
- $$\thetaj := \theta_j - \alpha (h{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}$$
- Learning rate $$\alpha$$ usually held constant
- Can decrease over time, e.g. $$\alpha = \frac{const1}{iterationNum + const2}$$; may be finicky to tune
Checking for Convergence
- During learning, compute $$cost(\theta, (x^{(i)}, y^{(i)}))$$ before updating $$\theta$$
- For every few iterations (say 1000), plot $$cost(\theta, (x^{(i)}, y^{(i)}))$$ averaged over the past few examples
Mini-Batch Gradient Descent
- Mini-batch size $$b$$
- Repeat
- For each training sample $$i := 1, (1+b), (1+2b)..m, j := 0..n$$:
- $$\thetaj := \theta_j - \alpha \frac{1}{b}\sum{k=i}^{i+b-1}(h_{\theta}(x^{(k)}) - y^{(k)}) x_j^{(k)}$$
Benefits
- Can win over batch gradient descent for very large dataset
- Computationally less expensive
- Does not need to load the entire large dataset before running
- Can win over stochastic gradient descent for its vectorization ability
Online Learning
- Repeat forever
- Get $$(x, y)$$
- Update $$\theta$$ using $$(x, y)$$
Benefits
- Can adapt to change of user preference
- After training from a sample, can discard it