Perceptron

Based on one sample at a time.

$$ z = w_0x_0 + w_1x_1 + \dots + w_mx_m = w^T x

$$

where $$w_0 = -\theta$$, $$x_0 = 1$$.

$$ \phi(z) = \begin{cases} 1 & \text{if}\ z \ge 0 \ -1 & \text{otherwise} \end{cases}

$$

Learning Algorithm

Rosenblatt's Perceptron Rule
  1. Initialize weights to 0 or small random numbers
  2. For each training pair $$(x_j^{(i)}, y^{(i)})$$:
    1. Compute output $$\hat{y}^{(i)}$$
    2. Update weights
      1. $$w_j = w_j + \Delta w_j$$, $$\Delta w_j = \eta (y^{(i)} - \hat{y}^{(i)})x_j^{(i)}$$

Convergence to minimal error is guaranteed only if the two classes are linearly separable.

Simplification

$$ \Delta w_j = \eta(y^{(i)} - 0)x_j^{(i)}

$$

applied to misclassified input samples.

Adaline

Based on all samples, i.e. batch.

Linear activation function:

$$ \phi(w^T x) = w^Tx

$$

Cost function (Sum of Squared Errors, SSE):

$$ J(w) = \frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2

$$

which is based on all samples.

Learning Algorithm

Widrow-Hoff Rule

Since $$J(w)$$ is continuous and convex, gradient descent is applied to find the minimum.

$$ \Delta w_j = \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (y^{(i)} - \phi(z^{(i)})) x_j^{(i)}

$$

Overshoot:

Can also use other error function that is continuous and differentiable. Use gradient descent to find absolute minimum. Error cannot be 0, since the subtrahend is not $$\hat{y}^{(i)}$$.

Proof

$$ \begin{aligned} \frac{\partial J}{\partial w_j} &= \frac{\partial}{\partial w_j}\frac{1}{2} \sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\frac{\partial}{\partial w_j}\sum_i (y^{(i)} - \phi(z^{(i)}))^2 \ &= \frac{1}{2}\sum_i 2(y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \phi(z^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))\frac{\partial}{\partial w_j}(y^{(i)} - \sum_i(w_j^{(i)}x_j^{(i)})) \ &= \sum_i (y^{(i)} - \phi(z^{(i)}))(-x_j^{(i)}) \ &= -\sum_i(y^{(i)} - \phi(z^{(i)}))x_j^{(i)} \end{aligned}

$$

Improvements

Normalization

Can use larger learning rate, since less likely to overshoot -> converge faster.

$$ x'_j = \frac{x - \mu_j}{\sigma_j}

$$

Variations

Batch Adaline

Each step takes the entire training dataset.

  • Guaranteed convergence to weight set with minimum squared error
    • With sufficiently small learning rate
    • Even when training data is not linearly separable
Incremental Adaline

Approximates the complete gradient by its estimate for each sample -> stochastic or online GD.

$$ \Delta w = \eta (y^{(i)} - \phi(z^{(i)})) x^{(i)}

$$

  • Often converges faster, since updated frequently
  • Should shuffle the training set for every epoch
  • Never goes exactly to minimum of cost function
  • Oscillates around the vicinity of minimum
    • Can be reduced by using adaptive learning rate

References

results matching ""

    No results matching ""