Multi-Layer Neural Network

Hyper-parameters
- Network architecture
- Number of layers
- Number of nodes in each layer
- Bias

Types

An NLP can emulate any functions; an RNN can emulate any programs.

Forward propagation
1. $$z^{(h)} = a^{(in)}W^{(h)}$$
2. $$a^{(h)} = \phi(z^{(h)})$$
3. $$z^{(out)} = a^{(h)}W^{(out)}$$
4. $$a^{(out)} = \phi(z^{(out)})$$
Calculate error
1. $$\delta^{(out)} = \frac{\partial E}{\partial a^{(out)}} = a^{(out)} - y$$
2. $$\Delta^{(out)} = \frac{\partial E}{\partial W^{(out)}} = \frac{\partial E}{\partial a^{(out)}} \frac{\partial a^{(out)}}{\partial W^{(out)}} = (a^{(h)})^T\delta^{(out)}$$, $$\frac{\partial E}{\partial b^{(out)}} = \delta^{(out)}$$
Backpropagation
1. Derivative w.r.t. each weight in the network
  1. $$\delta^{(h)} = \frac{\partial E}{\partial a^{(h)}} = \frac{\partial E}{\partial a^{(out)}}\frac{\partial a^{(out)}}{\partial z^{(out)}}\frac{\partial z^{(out)}}{\partial a^{(h)}} \= \delta^{(out)}(W^{(out)})^T \odot \frac{\partial \phi(z^{(h)})}{\partial z^{(h)}}$$
  2. $$\Delta^{(h)} = \frac{\partial E}{\partial W^{(h)}} = \frac{\partial E}{\partial a^{(h)}} \frac{\partial a^{(h)}}{\partial W^{(h)}} = (a^{(in)})^T\delta^{(h)}$$, $$\frac{\partial E}{\partial b^{(h)}} = \delta^{(h)}$$
2. Update weights
  1. $$\Delta^{(l)} = \Delta^{(l)} + \lambda^{(l)}$$ (not bias)
  2. $$W^{(l)} = W^{(l)} - \eta \Delta^{(l)}$$

Derivative of Sigmoid

$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

If all internal nodes have linear activation, multi-layer network can be reduced to a single layer network.

GD: computes gradient on the whole training set; SGD: takes a subset of samples at a time.