Multilayer NN

  • Hidden layer
    • Detect features
    • Can represent any continuous function with 1 hidden layer
    • Can represent discontinuous functions with 2 hidden layers

Generalized Delta Rule (GDR)

  1. Init weights & thresholds
  2. Input to network
  3. Feed forward, determine the every unit outputs
  4. Compare final output with desired output, calculate the error
  5. Backpropagate error back to the network for weight correction
  6. Minimize overall errors

Backpropagation Algorithm

  • $$a$$: input
  • $$c$$: predicted output
  • $$c^t$$: desired/target output
  • $$g$$: activation function at output layer
  • $$f$$: activation function at hidden layers
  • $$m$$: number of inputs
  • $$n$$: number of outputs
  • $$E = \frac{1}{2} \sum_{i=1}^n (c^t_i - c_i)^2$$: squared error function

  1. Init weights & thresholds to small values
  2. For each pair ($$a_k$$, $$c_k$$):
    1. Transfer & activate input values to the next layer
      1. $$b^ki = activate(\sum{j=1}^m aj w^k{ji} + \theta^k_i)$$
    2. Compute error at output layer & derivative to inputs
      1. $$\Delta_i = g'(c_i)(c_i^t - c_i)$$
    3. Calculate error for each hidden layer relative to the error from the layer above
      1. $$\delta^ki = f'(b^k_i) \sum_j w^k{ij} \delta^{k+1}_j$$
    4. Adjust weights
      1. $$w'^{k}{ij} = w^k{ij} + \Delta w^k{ij} = w^k{ij} + \alpha \cdot b^k_i \cdot \delta^{k+1}_j$$
    5. Adjust thresholds
      1. $$\theta'^k_j = \theta^k_j + \Delta \theta^k_j = \theta^k_j + \beta \cdot \delta^{k+1}_j$$
  3. Repeat 2. until error sufficiently low

Gradient Based Method

To minimize a function $$E(x)$$, randomly init $$x^0$$, compute its gradient, then move in the opposite direction:

$$x^1 = x^0 - \alpha \cdot \frac{dE(x^0)}{x}$$

Do so until $$(x^1 - x^0)$$ is sufficiently small.

Derivative of Sigmoid Function

$$f(x) = \frac{1}{1 - e^{-x}}$$

$$f'(x) = (-1)(1 + e^{-x})^2 e^{-x} (-1) = \frac{e^{-x}}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = \frac{1 + e^{-x} - 1}{1 + e^{-x}} \cdot \frac{1}{1 + e^{-x}} = f(x) (1 - f(x))$$

Overfitting

When model is too complex, the predictive performance will deteriorate, since minor fluctuations in data will be exaggerated.

Input/Output Scaling

Output scaling (to [0, 1]) is crucial for sigmoid function at output layers; the learning will be faster.

Accelerated Learning

Momentum

Providing stabilizing effect on training.

$$ \Delta w^k{ij} = \alpha \cdot b^k_i \cdot \delta^{k+1}_j + \beta \Delta w^{k-1}{ij}

$$

  • $$\beta$$: momentum constant
    • Accelerate descent in the steady downhill direction
    • Slow down when learning curve exhibits peak or valley

Adaptive Learning Rate

Learning Rate
  • Small $$\alpha$$
    • Smooth learning curve
  • Large $$\alpha$$
    • Speed up learning process
    • May cause instability
Adaptive Learning
  • Error increasing, fluctuating, or becomes constant:
    • Decrease $$\alpha$$
  • Error decreasing for several epochs:
    • Increase $$\alpha$$

Hidden Nodes

  • More hidden nodes
    • Can fit better the data
    • May overfit
  • Less hidden nodes
    • May underfit
  • More training samples
    • Can provide better chance to match the original curve
  • Less training samples
    • Less change to match the original curve

results matching ""

    No results matching ""