Multilayer NN

Hidden layer
- Detect features
- Can represent any continuous function with 1 hidden layer
- Can represent discontinuous functions with 2 hidden layers

Generalized Delta Rule (GDR)

Init weights & thresholds
Input to network
Feed forward, determine the every unit outputs
Compare final output with desired output, calculate the error
Backpropagate error back to the network for weight correction
Minimize overall errors

Backpropagation Algorithm

$$a$$: input
$$c$$: predicted output
$$c^t$$: desired/target output
$$g$$: activation function at output layer
$$f$$: activation function at hidden layers
$$m$$: number of inputs
$$n$$: number of outputs
$$E = \frac{1}{2} \sum_{i=1}^n (c^t_i - c_i)^2$$: squared error function

Init weights & thresholds to small values
For each pair ($$a_k$$, $$c_k$$):
1. Transfer & activate input values to the next layer
  1. $$b^ki = activate(\sum{j=1}^m aj w^k{ji} + \theta^k_i)$$
2. Compute error at output layer & derivative to inputs
  1. $$\Delta_i = g'(c_i)(c_i^t - c_i)$$
3. Calculate error for each hidden layer relative to the error from the layer above
  1. $$\delta^ki = f'(b^k_i) \sum_j w^k{ij} \delta^{k+1}_j$$
4. Adjust weights
  1. $$w'^{k}{ij} = w^k{ij} + \Delta w^k{ij} = w^k{ij} + \alpha \cdot b^k_i \cdot \delta^{k+1}_j$$
5. Adjust thresholds
  1. $$\theta'^k_j = \theta^k_j + \Delta \theta^k_j = \theta^k_j + \beta \cdot \delta^{k+1}_j$$
Repeat 2. until error sufficiently low

Gradient Based Method

To minimize a function $$E(x)$$, randomly init $$x^0$$, compute its gradient, then move in the opposite direction:

$$x^1 = x^0 - \alpha \cdot \frac{dE(x^0)}{x}$$

Do so until $$(x^1 - x^0)$$ is sufficiently small.

Derivative of Sigmoid Function

$$f(x) = \frac{1}{1 - e^{-x}}$$

$$f'(x) = (-1)(1 + e^{-x})^2 e^{-x} (-1) = \frac{e^{-x}}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = \frac{1 + e^{-x} - 1}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = f(x) (1 - f(x))$$

Overfitting

When model is too complex, the predictive performance will deteriorate, since minor fluctuations in data will be exaggerated.

Input/Output Scaling

Output scaling (to [0, 1]) is crucial for sigmoid function at output layers; the learning will be faster.

Accelerated Learning

Momentum

Providing stabilizing effect on training.

$$ \Delta w^k{ij} = \alpha \cdot b^k_i \cdot \delta^{k+1}_j + \beta \Delta w^{k-1}{ij}

$$\beta$$: momentum constant
- Accelerate descent in the steady downhill direction
- Slow down when learning curve exhibits peak or valley

Adaptive Learning Rate

Learning Rate

Small $$\alpha$$
- Smooth learning curve
Large $$\alpha$$
- Speed up learning process
- May cause instability

Adaptive Learning

Error increasing, fluctuating, or becomes constant:
- Decrease $$\alpha$$
Error decreasing for several epochs:
- Increase $$\alpha$$

Hidden Nodes

More hidden nodes
- Can fit better the data
- May overfit
Less hidden nodes
- May underfit
More training samples
- Can provide better chance to match the original curve
Less training samples
- Less change to match the original curve

Backpropagation

Multilayer NN

Generalized Delta Rule (GDR)

Backpropagation Algorithm

Gradient Based Method

Derivative of Sigmoid Function

Overfitting

Input/Output Scaling

Accelerated Learning

Momentum

Adaptive Learning Rate

Learning Rate

Adaptive Learning

Hidden Nodes

results matching ""

No results matching ""