Multi-Layer Neural Network

  • Hyper-parameters
    • Network architecture
    • Number of layers
    • Number of nodes in each layer
    • Bias

Types

  • Multi-layer perceptron (MLP)
    • Fully connected
    • One hidden layer
  • Convolutional neural networks (CNN)
    • Weight connection emulates convolution kernels in image processing
    • Kernel weights fixed & shared among all nodes in the same layer
    • Sparser & implicitly regularized than MLP
      • Can be deeper/wider
    • Image related applications
  • Recurrent neural networks (RNN
    • MLP + time variable
    • Sequential applications
  • etc.

An NLP can emulate any functions; an RNN can emulate any programs.

Training

  1. Forward propagation
    1. $$z^{(h)} = a^{(in)}W^{(h)}$$
    2. $$a^{(h)} = \phi(z^{(h)})$$
    3. $$z^{(out)} = a^{(h)}W^{(out)}$$
    4. $$a^{(out)} = \phi(z^{(out)})$$
  2. Calculate error
    1. $$\delta^{(out)} = \frac{\partial E}{\partial a^{(out)}} = a^{(out)} - y$$
    2. $$\Delta^{(out)} = \frac{\partial E}{\partial W^{(out)}} = \frac{\partial E}{\partial a^{(out)}} \frac{\partial a^{(out)}}{\partial W^{(out)}} = (a^{(h)})^T\delta^{(out)}$$, $$\frac{\partial E}{\partial b^{(out)}} = \delta^{(out)}$$
  3. Backpropagation
    1. Derivative w.r.t. each weight in the network
      1. $$\delta^{(h)} = \frac{\partial E}{\partial a^{(h)}} = \frac{\partial E}{\partial a^{(out)}}\frac{\partial a^{(out)}}{\partial z^{(out)}}\frac{\partial z^{(out)}}{\partial a^{(h)}} \= \delta^{(out)}(W^{(out)})^T \odot \frac{\partial \phi(z^{(h)})}{\partial z^{(h)}}$$
      2. $$\Delta^{(h)} = \frac{\partial E}{\partial W^{(h)}} = \frac{\partial E}{\partial a^{(h)}} \frac{\partial a^{(h)}}{\partial W^{(h)}} = (a^{(in)})^T\delta^{(h)}$$, $$\frac{\partial E}{\partial b^{(h)}} = \delta^{(h)}$$
    2. Update weights
      1. $$\Delta^{(l)} = \Delta^{(l)} + \lambda^{(l)}$$ (not bias)
      2. $$W^{(l)} = W^{(l)} - \eta \Delta^{(l)}$$

Derivative of Sigmoid

$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

Activation Function

If all internal nodes have linear activation, multi-layer network can be reduced to a single layer network.

Stochastic GD

GD: computes gradient on the whole training set; SGD: takes a subset of samples at a time.

  • Computes faster
  • Adapts better (online learning)
  • Better avoids local minimum

References

results matching ""

    No results matching ""