Multi-Layer Neural Network
- Hyper-parameters
- Network architecture
- Number of layers
- Number of nodes in each layer
- Bias
Types
- Multi-layer perceptron (MLP)
- Fully connected
- One hidden layer
- Convolutional neural networks (CNN)
- Weight connection emulates convolution kernels in image processing
- Kernel weights fixed & shared among all nodes in the same layer
- Sparser & implicitly regularized than MLP
- Can be deeper/wider
- Image related applications
- Recurrent neural networks (RNN
- MLP + time variable
- Sequential applications
- etc.
An NLP can emulate any functions; an RNN can emulate any programs.
Training
- Forward propagation
- $$z^{(h)} = a^{(in)}W^{(h)}$$
- $$a^{(h)} = \phi(z^{(h)})$$
- $$z^{(out)} = a^{(h)}W^{(out)}$$
- $$a^{(out)} = \phi(z^{(out)})$$
- Calculate error
- $$\delta^{(out)} = \frac{\partial E}{\partial a^{(out)}} = a^{(out)} - y$$
- $$\Delta^{(out)} = \frac{\partial E}{\partial W^{(out)}} = \frac{\partial E}{\partial a^{(out)}} \frac{\partial a^{(out)}}{\partial W^{(out)}} = (a^{(h)})^T\delta^{(out)}$$, $$\frac{\partial E}{\partial b^{(out)}} = \delta^{(out)}$$
- Backpropagation
- Derivative w.r.t. each weight in the network
- $$\delta^{(h)} = \frac{\partial E}{\partial a^{(h)}} = \frac{\partial E}{\partial a^{(out)}}\frac{\partial a^{(out)}}{\partial z^{(out)}}\frac{\partial z^{(out)}}{\partial a^{(h)}} \= \delta^{(out)}(W^{(out)})^T \odot \frac{\partial \phi(z^{(h)})}{\partial z^{(h)}}$$
- $$\Delta^{(h)} = \frac{\partial E}{\partial W^{(h)}} = \frac{\partial E}{\partial a^{(h)}} \frac{\partial a^{(h)}}{\partial W^{(h)}} = (a^{(in)})^T\delta^{(h)}$$, $$\frac{\partial E}{\partial b^{(h)}} = \delta^{(h)}$$
- Update weights
- $$\Delta^{(l)} = \Delta^{(l)} + \lambda^{(l)}$$ (not bias)
- $$W^{(l)} = W^{(l)} - \eta \Delta^{(l)}$$
- Derivative w.r.t. each weight in the network
Derivative of Sigmoid
$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$
Activation Function
If all internal nodes have linear activation, multi-layer network can be reduced to a single layer network.
Stochastic GD
GD: computes gradient on the whole training set; SGD: takes a subset of samples at a time.
- Computes faster
- Adapts better (online learning)
- Better avoids local minimum