Technical Neural Nets 2
From Perzeptron to MLP
Why Nonlinear Transfer Function
- smooth, continuous, differentiable → can use backpropagation algorithm
- monotonic increase
- bounded output
MLP Structure
- one input layer, one or more hidden layers, one output layer
- no weights in input layer
- each neuron has a set of weights
- forward unidirectional connections
- fully connected
MLP Capability
Universal function approximators
It has been shown that an MLP with one hidden layer of a finite number of neurons with nonlinear transfer function is capable of approximating any continuous mapping from an N-dimensional input space to an M-dimensional output space with arbitrary accuracy.
Learning
Assess of Model
- Loss function (statistics, pattern recognition, …)
- Error function (function approximation, pattern recognition, …)
Supervised Learning Scheme
- Choose a model, get some teacher data.
- Initialize the model by setting the parameters.
- Pick some data X and pass it to the model.
- Apply the model to the data, produce the output Y(X).
- Compare the output Y to the teacher Ŷ.
- Apply the learning algorithm to change the parameters.
- Decide whether to stop or continue.
- Post processing if necessary.
Error Function
[
E_p = \frac{1}{2}\sum_{m=1}^{M}(\hat{y}_m - y_m)^2
]
Back Propagation
Table of Symbols and Meanings
| Symbol | Meaning | Explanation |
|---|---|---|
| p | Training pattern index | Identifies the current sample in the training set |
| Eₚ | Error for pattern p | (E_p = \frac{1}{2}\sum_m (\hat{y}_m - y_m)^2); half-squared error for one sample |
| η (eta) | Learning rate | Controls the step size for each weight update |
| wₙₕ, wₕₘ | Weights | (w_{nh}): from input neuron n → hidden neuron h; (w_{hm}): from hidden h → output m |
| Δw | Weight change | Amount each weight is updated: (-\eta \frac{\partial E_p}{\partial w}) |
| xₙ | Input neuron output | The _n_th input value (also written as (out_i) for previous layer output) |
| ~outₕ | Hidden neuron output | The output value from hidden neuron h, used as input to the next layer |
| yₘ | Actual output | The network’s prediction for output neuron m |
| ŷₘ | Target output | The desired (teacher) output for neuron m |
| netₕ, netₘ | Net input | Weighted sum before activation: (net_j = \sum_i w_{ij} out_i) |
| f(net) | Activation function | Nonlinear function applied to net (e.g., sigmoid or tanh) |
| f′(net) | Derivative of activation | Needed for gradient computation in BP |
| δₘ | Delta for output neuron m | ((y_m - \hat{y}_m) f′(net_m)); represents output layer error signal |
| δₕ | Delta for hidden neuron h | (f′(net_h)\sum_m w_{hm}δ_m); backpropagated error signal |
| ∇₍W₎E | Gradient of error w.r.t weights | Vector of all partial derivatives (\frac{\partial E}{\partial w}) |
| ΔW | Vector of all weight updates | (\Delta W = -\eta ∇₍W₎E) |
| K, M, H, N | Counts of neurons | (N)=input, (H)=hidden, (M)=output, (K)=next layer neurons |
| Bias | Constant input (usually = 1) | Allows shifting the activation threshold |
Mathematical Derivation
0. Setup and Goal
For a single training pattern (p):
[
E_p = \frac{1}{2}\sum_{m=1}^{M}(\hat{y}_m - y_m)^2
]
(MSE = mean squared error)
Weight update rule (gradient descent):
[
\Delta w = -\eta \frac{\partial E_p}{\partial w}
]
Each neuron’s net input and output:
[
net_m = \sum_{g=0}^{H} \tilde{out}g , w{gm}, \quad y_m = f(net_m), \quad \tilde{out}_0 = 1 \text{ (bias input)}.
]
Typical activation derivatives:
- Sigmoid: (f’(z) = f(z)(1 - f(z)))
- Tanh: (f’(z) = 1 - \tanh^2(z))
Goal: compute (\frac{\partial E_p}{\partial w}) for every weight in the network.
1. Output Layer Weights (w_{hm})
(Connection from hidden neuron (h) → output neuron (m))
Apply the chain rule:
[
\frac{\partial E_p}{\partial w_{hm}} = \frac{\partial E_p}{\partial y_m} \cdot \frac{\partial y_m}{\partial net_m} \cdot \frac{\partial net_m}{\partial w_{hm}}
]
Step by step:
[
\frac{\partial E_p}{\partial y_m} = -( \hat{y}_m - y_m ) = (y_m - \hat{y}_m)
]
Combine:
[
\frac{\partial E_p}{\partial w_{hm}} = (y_m - \hat{y}_m) f’(net_m) \tilde{out}_h
]
Define the error signal (delta) for the output neuron:
[
\boxed{\delta_m = (y_m - \hat{y}_m) f’(net_m)}
]
Then:
[
\frac{\partial E_p}{\partial w_{hm}} = \delta_m \tilde{out}h, \quad \boxed{\Delta w{hm} = -\eta , \delta_m , \tilde{out}_h}
]
This is the delta rule for output weights.
2. Hidden Layer Weights (w_{nh})
(Connection from input neuron (n) → hidden neuron (h))
The hidden layer error depends on all output neurons, so again apply the chain rule:
[
\frac{\partial E_p}{\partial w_{nh}} = \sum_{m=1}^{M} \frac{\partial E_p}{\partial y_m} \frac{\partial y_m}{\partial net_m} \frac{\partial net_m}{\partial \tilde{out}_h} \frac{\partial \tilde{out}h}{\partial net_h} \frac{\partial net_h}{\partial w{nh}}
]
Substitute each term:
[
\frac{\partial net_m}{\partial \tilde{out}h} = w{hm}, \quad \frac{\partial \tilde{out}h}{\partial net_h} = f’(net_h), \quad \frac{\partial net_h}{\partial w{nh}} = x_n
]
and note that:
[
\frac{\partial E_p}{\partial y_m} \frac{\partial y_m}{\partial net_m} = \delta_m
]
So:
[
\frac{\partial E_p}{\partial w_{nh}} = \Big(\sum_{m=1}^{M} \delta_m , w_{hm}\Big) f’(net_h) x_n
]
Define the error signal (delta) for the hidden neuron:
[
\boxed{\delta_h = f’(net_h) \sum_{m=1}^{M} w_{hm} \delta_m}
]
Thus:
[
\frac{\partial E_p}{\partial w_{nh}} = \delta_h x_n, \quad \boxed{\Delta w_{nh} = -\eta , \delta_h , x_n}
]
(Bias weights use the same rule, with (x_0 = 1)).
BP Conclusion
General weight update rule:
[
\Delta w_{ij} = \eta , \delta_j , out_i
]
Each weight change = learning rate × neuron j’s delta × neuron i’s output.Output neuron:
[
\delta_m = (\hat{y}_m - y_m) , f’(net_m)
]
Error comes directly from target vs. output.Hidden neuron:
[
\delta_h = \Big(\sum_{k=1}^{K} \delta_k , w_{hk}\Big) f’(net_h)
]
Error is backpropagated from the next layer.
In short:
BP adjusts each weight by how much that neuron contributed to the total error, propagating δ backward layer by layer.