Powered by GitBook

Stochastic Gradient Descent

Randomly shuffle training samples
Repeat
1. For each training sample $$i := 1..m, j := 0..n$$:
  1. $$\thetaj := \theta_j - \alpha (h{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}$$
Learning rate $$\alpha$$ usually held constant
1. Can decrease over time, e.g. $$\alpha = \frac{const1}{iterationNum + const2}$$; may be finicky to tune

Checking for Convergence

During learning, compute $$cost(\theta, (x^{(i)}, y^{(i)}))$$ before updating $$\theta$$
For every few iterations (say 1000), plot $$cost(\theta, (x^{(i)}, y^{(i)}))$$ averaged over the past few examples

Mini-Batch Gradient Descent

Mini-batch size $$b$$
Repeat
1. For each training sample $$i := 1, (1+b), (1+2b)..m, j := 0..n$$:
  1. $$\thetaj := \theta_j - \alpha \frac{1}{b}\sum{k=i}^{i+b-1}(h_{\theta}(x^{(k)}) - y^{(k)}) x_j^{(k)}$$

Benefits

Can win over batch gradient descent for very large dataset
1. Computationally less expensive
2. Does not need to load the entire large dataset before running
Can win over stochastic gradient descent for its vectorization ability

Online Learning

Repeat forever
1. Get $$(x, y)$$
2. Update $$\theta$$ using $$(x, y)$$

Benefits

Can adapt to change of user preference
After training from a sample, can discard it

results matching ""

No results matching ""