Perceptron
Based on one sample at a time.
$$ z = w_0x_0 + w_1x_1 + \dots + w_mx_m = w^T x
$$
where $$w_0 = -\theta$$, $$x_0 = 1$$.
$$ \phi(z) = \begin{cases} 1 & \text{if}\ z \ge 0 \ -1 & \text{otherwise} \end{cases}
$$
Learning Algorithm
Rosenblatt's Perceptron Rule
- Initialize weights to 0 or small random numbers
- For each training pair $$(x_j^{(i)}, y^{(i)})$$:
- Compute output $$\hat{y}^{(i)}$$
- Update weights
- $$w_j = w_j + \Delta w_j$$, $$\Delta w_j = \eta (y^{(i)} - \hat{y}^{(i)})x_j^{(i)}$$
Convergence to minimal error is guaranteed only if the two classes are linearly separable.
Simplification
$$ \Delta w_j = \eta(y^{(i)} - 0)x_j^{(i)}
$$
applied to misclassified input samples.
Adaline
Based on all samples, i.e. batch.
Linear activation function:
$$ \phi(w^T x) = w^Tx
$$
Cost function (Sum of Squared Errors, SSE):
$$ J(w) = \frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2
$$
which is based on all samples.
Learning Algorithm
Widrow-Hoff Rule
Since $$J(w)$$ is continuous and convex, gradient descent is applied to find the minimum.
$$ \Delta w_j = \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (y^{(i)} - \phi(z^{(i)})) x_j^{(i)}
$$
Overshoot:
Can also use other error function that is continuous and differentiable. Use gradient descent to find absolute minimum. Error cannot be 0, since the subtrahend is not $$\hat{y}^{(i)}$$.
Proof
$$ \begin{aligned} \frac{\partial J}{\partial w_j} &= \frac{\partial}{\partial w_j}\frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\frac{\partial}{\partial w_j}\sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\sum_i 2(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \phi(z^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \sum_i(w_j^{(i)}x_j^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))(-x_j^{(i)}) \ &= -\sum_i(y^{(i)} - \phi(z^{(i)}))x_j^{(i)} \end{aligned}
$$
Improvements
Normalization
Can use larger learning rate, since less likely to overshoot -> converge faster.
$$ x'_j = \frac{x - \mu_j}{\sigma_j}
$$
Variations
Batch Adaline
Each step takes the entire training dataset.
- Guaranteed convergence to weight set with minimum squared error
- With sufficiently small learning rate
- Even when training data is not linearly separable
Incremental Adaline
Approximates the complete gradient by its estimate for each sample -> stochastic or online GD.
$$ \Delta w = \eta (y^{(i)} - \phi(z^{(i)})) x^{(i)}
$$
- Often converges faster, since updated frequently
- Should shuffle the training set for every epoch
- Never goes exactly to minimum of cost function
- Oscillates around the vicinity of minimum
- Can be reduced by using adaptive learning rate