Perceptron

Based on one sample at a time.

$$ z = w_0x_0 + w_1x_1 + \dots + w_mx_m = w^T x

where $$w_0 = -\theta$$, $$x_0 = 1$$.

$$ \phi(z) = \begin{cases} 1 & \text{if}\ z \ge 0 \ -1 & \text{otherwise} \end{cases}

Learning Algorithm

Rosenblatt's Perceptron Rule

Initialize weights to 0 or small random numbers
For each training pair $$(x_j^{(i)}, y^{(i)})$$:
1. Compute output $$\hat{y}^{(i)}$$
2. Update weights
  1. $$w_j = w_j + \Delta w_j$$, $$\Delta w_j = \eta (y^{(i)} - \hat{y}^{(i)})x_j^{(i)}$$

Convergence to minimal error is guaranteed only if the two classes are linearly separable.

Simplification

$$ \Delta w_j = \eta(y^{(i)} - 0)x_j^{(i)}

applied to misclassified input samples.

Adaline

Based on all samples, i.e. batch.

Linear activation function:

$$ \phi(w^T x) = w^Tx

Cost function (Sum of Squared Errors, SSE):

$$ J(w) = \frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2

which is based on all samples.

Learning Algorithm

Widrow-Hoff Rule

Since $$J(w)$$ is continuous and convex, gradient descent is applied to find the minimum.

$$ \Delta w_j = \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (y^{(i)} - \phi(z^{(i)})) x_j^{(i)}

Overshoot:

Can also use other error function that is continuous and differentiable. Use gradient descent to find absolute minimum. Error cannot be 0, since the subtrahend is not $$\hat{y}^{(i)}$$.

Proof

$$ \begin{aligned} \frac{\partial J}{\partial w_j} &= \frac{\partial}{\partial w_j}\frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\frac{\partial}{\partial w_j}\sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\sum_i 2(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \phi(z^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \sum_i(w_j^{(i)}x_j^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))(-x_j^{(i)}) \ &= -\sum_i(y^{(i)} - \phi(z^{(i)}))x_j^{(i)} \end{aligned}

$$

Improvements

Normalization

Can use larger learning rate, since less likely to overshoot -> converge faster.

$$ x'_j = \frac{x - \mu_j}{\sigma_j}

Variations

Batch Adaline

Each step takes the entire training dataset.

Guaranteed convergence to weight set with minimum squared error
- With sufficiently small learning rate
- Even when training data is not linearly separable

Incremental Adaline

Approximates the complete gradient by its estimate for each sample -> stochastic or online GD.

$$ \Delta w = \eta (y^{(i)} - \phi(z^{(i)})) x^{(i)}

Often converges faster, since updated frequently
Should shuffle the training set for every epoch
Never goes exactly to minimum of cost function
Oscillates around the vicinity of minimum
- Can be reduced by using adaptive learning rate

References

Can a perceptron with sigmoid activation function perform nonlinear classification?

The Perceptron and Adaline

Perceptron

Learning Algorithm

Rosenblatt's Perceptron Rule

Simplification

Adaline

Learning Algorithm

Widrow-Hoff Rule

Proof

Improvements

Normalization

Variations

Batch Adaline

Incremental Adaline

References

results matching ""

No results matching ""