Multilayer NN
- Hidden layer
- Detect features
- Can represent any continuous function with 1 hidden layer
- Can represent discontinuous functions with 2 hidden layers
Generalized Delta Rule (GDR)
- Init weights & thresholds
- Input to network
- Feed forward, determine the every unit outputs
- Compare final output with desired output, calculate the error
- Backpropagate error back to the network for weight correction
- Minimize overall errors
Backpropagation Algorithm
- $$a$$: input
- $$c$$: predicted output
- $$c^t$$: desired/target output
- $$g$$: activation function at output layer
- $$f$$: activation function at hidden layers
- $$m$$: number of inputs
- $$n$$: number of outputs
- $$E = \frac{1}{2} \sum_{i=1}^n (c^t_i - c_i)^2$$: squared error function
- Init weights & thresholds to small values
- For each pair ($$a_k$$, $$c_k$$):
- Transfer & activate input values to the next layer
- $$b^ki = activate(\sum{j=1}^m aj w^k{ji} + \theta^k_i)$$
- Compute error at output layer & derivative to inputs
- $$\Delta_i = g'(c_i)(c_i^t - c_i)$$
- Calculate error for each hidden layer relative to the error from the layer above
- $$\delta^ki = f'(b^k_i) \sum_j w^k{ij} \delta^{k+1}_j$$
- Adjust weights
- $$w'^{k}{ij} = w^k{ij} + \Delta w^k{ij} = w^k{ij} + \alpha \cdot b^k_i \cdot \delta^{k+1}_j$$
- Adjust thresholds
- $$\theta'^k_j = \theta^k_j + \Delta \theta^k_j = \theta^k_j + \beta \cdot \delta^{k+1}_j$$
- Transfer & activate input values to the next layer
- Repeat 2. until error sufficiently low
Gradient Based Method
To minimize a function $$E(x)$$, randomly init $$x^0$$, compute its gradient, then move in the opposite direction:
$$x^1 = x^0 - \alpha \cdot \frac{dE(x^0)}{x}$$
Do so until $$(x^1 - x^0)$$ is sufficiently small.
Derivative of Sigmoid Function
$$f(x) = \frac{1}{1 - e^{-x}}$$
$$f'(x) = (-1)(1 + e^{-x})^2 e^{-x} (-1) = \frac{e^{-x}}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = \frac{1 + e^{-x} - 1}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = f(x) (1 - f(x))$$
Overfitting
When model is too complex, the predictive performance will deteriorate, since minor fluctuations in data will be exaggerated.
Input/Output Scaling
Output scaling (to [0, 1]) is crucial for sigmoid function at output layers; the learning will be faster.
Accelerated Learning
Momentum
Providing stabilizing effect on training.
$$ \Delta w^k{ij} = \alpha \cdot b^k_i \cdot \delta^{k+1}_j + \beta \Delta w^{k-1}{ij}
$$
- $$\beta$$: momentum constant
- Accelerate descent in the steady downhill direction
- Slow down when learning curve exhibits peak or valley
Adaptive Learning Rate
Learning Rate
- Small $$\alpha$$
- Smooth learning curve
- Large $$\alpha$$
- Speed up learning process
- May cause instability
Adaptive Learning
- Error increasing, fluctuating, or becomes constant:
- Decrease $$\alpha$$
- Error decreasing for several epochs:
- Increase $$\alpha$$
Hidden Nodes
- More hidden nodes
- Can fit better the data
- May overfit
- Less hidden nodes
- May underfit
- More training samples
- Can provide better chance to match the original curve
- Less training samples
- Less change to match the original curve