Technical Neural Nets 2

From Perzeptron to MLP

Why Nonlinear Transfer Function

  • smooth, continuous, differentiable → can use backpropagation algorithm
  • monotonic increase
  • bounded output

MLP Structure

  • one input layer, one or more hidden layers, one output layer
  • no weights in input layer
  • each neuron has a set of weights
  • forward unidirectional connections
  • fully connected

MLP Capability

Universal function approximators
It has been shown that an MLP with one hidden layer of a finite number of neurons with nonlinear transfer function is capable of approximating any continuous mapping from an N-dimensional input space to an M-dimensional output space with arbitrary accuracy.


Learning

Assess of Model

  • Loss function (statistics, pattern recognition, …)
  • Error function (function approximation, pattern recognition, …)

Supervised Learning Scheme

  1. Choose a model, get some teacher data.
  2. Initialize the model by setting the parameters.
  3. Pick some data X and pass it to the model.
  4. Apply the model to the data, produce the output Y(X).
  5. Compare the output Y to the teacher Ŷ.
  6. Apply the learning algorithm to change the parameters.
  7. Decide whether to stop or continue.
  8. Post processing if necessary.

Error Function
[
E_p = \frac{1}{2}\sum_{m=1}^{M}(\hat{y}_m - y_m)^2
]


Back Propagation

Table of Symbols and Meanings

Symbol Meaning Explanation
p Training pattern index Identifies the current sample in the training set
Eₚ Error for pattern p (E_p = \frac{1}{2}\sum_m (\hat{y}_m - y_m)^2); half-squared error for one sample
η (eta) Learning rate Controls the step size for each weight update
wₙₕ, wₕₘ Weights (w_{nh}): from input neuron n → hidden neuron h; (w_{hm}): from hidden h → output m
Δw Weight change Amount each weight is updated: (-\eta \frac{\partial E_p}{\partial w})
xₙ Input neuron output The _n_th input value (also written as (out_i) for previous layer output)
~outₕ Hidden neuron output The output value from hidden neuron h, used as input to the next layer
yₘ Actual output The network’s prediction for output neuron m
ŷₘ Target output The desired (teacher) output for neuron m
netₕ, netₘ Net input Weighted sum before activation: (net_j = \sum_i w_{ij} out_i)
f(net) Activation function Nonlinear function applied to net (e.g., sigmoid or tanh)
f′(net) Derivative of activation Needed for gradient computation in BP
δₘ Delta for output neuron m ((y_m - \hat{y}_m) f′(net_m)); represents output layer error signal
δₕ Delta for hidden neuron h (f′(net_h)\sum_m w_{hm}δ_m); backpropagated error signal
∇₍W₎E Gradient of error w.r.t weights Vector of all partial derivatives (\frac{\partial E}{\partial w})
ΔW Vector of all weight updates (\Delta W = -\eta ∇₍W₎E)
K, M, H, N Counts of neurons (N)=input, (H)=hidden, (M)=output, (K)=next layer neurons
Bias Constant input (usually = 1) Allows shifting the activation threshold

Mathematical Derivation

0. Setup and Goal

For a single training pattern (p):
[
E_p = \frac{1}{2}\sum_{m=1}^{M}(\hat{y}_m - y_m)^2
]
(MSE = mean squared error)

Weight update rule (gradient descent):
[
\Delta w = -\eta \frac{\partial E_p}{\partial w}
]

Each neuron’s net input and output:
[
net_m = \sum_{g=0}^{H} \tilde{out}g , w{gm}, \quad y_m = f(net_m), \quad \tilde{out}_0 = 1 \text{ (bias input)}.
]

Typical activation derivatives:

  • Sigmoid: (f’(z) = f(z)(1 - f(z)))
  • Tanh: (f’(z) = 1 - \tanh^2(z))

Goal: compute (\frac{\partial E_p}{\partial w}) for every weight in the network.


1. Output Layer Weights (w_{hm})

(Connection from hidden neuron (h) → output neuron (m))

Apply the chain rule:
[
\frac{\partial E_p}{\partial w_{hm}} = \frac{\partial E_p}{\partial y_m} \cdot \frac{\partial y_m}{\partial net_m} \cdot \frac{\partial net_m}{\partial w_{hm}}
]

Step by step:
[
\frac{\partial E_p}{\partial y_m} = -( \hat{y}_m - y_m ) = (y_m - \hat{y}_m)
]

Combine:
[
\frac{\partial E_p}{\partial w_{hm}} = (y_m - \hat{y}_m) f’(net_m) \tilde{out}_h
]

Define the error signal (delta) for the output neuron:
[
\boxed{\delta_m = (y_m - \hat{y}_m) f’(net_m)}
]

Then:
[
\frac{\partial E_p}{\partial w_{hm}} = \delta_m \tilde{out}h, \quad \boxed{\Delta w{hm} = -\eta , \delta_m , \tilde{out}_h}
]

This is the delta rule for output weights.


2. Hidden Layer Weights (w_{nh})

(Connection from input neuron (n) → hidden neuron (h))

The hidden layer error depends on all output neurons, so again apply the chain rule:
[
\frac{\partial E_p}{\partial w_{nh}} = \sum_{m=1}^{M} \frac{\partial E_p}{\partial y_m} \frac{\partial y_m}{\partial net_m} \frac{\partial net_m}{\partial \tilde{out}_h} \frac{\partial \tilde{out}h}{\partial net_h} \frac{\partial net_h}{\partial w{nh}}
]

Substitute each term:
[
\frac{\partial net_m}{\partial \tilde{out}h} = w{hm}, \quad \frac{\partial \tilde{out}h}{\partial net_h} = f’(net_h), \quad \frac{\partial net_h}{\partial w{nh}} = x_n
]

and note that:
[
\frac{\partial E_p}{\partial y_m} \frac{\partial y_m}{\partial net_m} = \delta_m
]

So:
[
\frac{\partial E_p}{\partial w_{nh}} = \Big(\sum_{m=1}^{M} \delta_m , w_{hm}\Big) f’(net_h) x_n
]

Define the error signal (delta) for the hidden neuron:
[
\boxed{\delta_h = f’(net_h) \sum_{m=1}^{M} w_{hm} \delta_m}
]

Thus:
[
\frac{\partial E_p}{\partial w_{nh}} = \delta_h x_n, \quad \boxed{\Delta w_{nh} = -\eta , \delta_h , x_n}
]

(Bias weights use the same rule, with (x_0 = 1)).


BP Conclusion

  • General weight update rule:
    [
    \Delta w_{ij} = \eta , \delta_j , out_i
    ]
    Each weight change = learning rate × neuron j’s delta × neuron i’s output.

  • Output neuron:
    [
    \delta_m = (\hat{y}_m - y_m) , f’(net_m)
    ]
    Error comes directly from target vs. output.

  • Hidden neuron:
    [
    \delta_h = \Big(\sum_{k=1}^{K} \delta_k , w_{hk}\Big) f’(net_h)
    ]
    Error is backpropagated from the next layer.


In short:

BP adjusts each weight by how much that neuron contributed to the total error, propagating δ backward layer by layer.


Technical Neural Nets 2
http://example.com/2025/10/23/MLP2/
Author
Newtown
Posted on
October 23, 2025
Licensed under