ML0034 Backpropagation

Written by

What is backpropagation?

Answer

Backpropagation, backward propagation of errors, is the central algorithm by which multilayer neural networks learn. At its core, it efficiently computes how each weight and bias in the network contributes to the overall prediction error (loss). Then, it updates those parameters in the direction that reduces the error the most.
By combining the chain rule from calculus with gradient‑based optimization (e.g., gradient descent), backpropagation makes training deep architectures tractable and underpins virtually all modern advances in deep learning.

Steps to conduct Backpropagation:
(1) Forward Pass: Inputs are propagated through the network to compute outputs. Intermediate activations are stored for later use.
(2) Compute Loss: Use a loss function to compare the network’s output to the actual target values.
(3) Backward Pass (Error Propagation): The error is computed at the output layer. The chain rule is applied to recursively calculate the gradients of the loss for each weight, starting from the output layer back to the input layer.
(4) Gradient Calculation: For every neuron, determine how much its weights contributed to the error by computing partial derivatives.
(5) Update Weights: Adjust the weights using an optimization algorithm (e.g., gradient descent), by subtracting a fraction (learning rate) of the computed gradients. This step is repeated iteratively to gradually minimize the loss.

More details for step (3): Backward Pass (Error Propagation)
At the Output Layer:
Imagine a neuron with an output value $a$ (its activation) and a weighted sum $z$ computed as:
$\mbox z = w_1 x_1 + w_2 x_2 + \dots + w_n x_n + b$
Suppose we use the mean squared error (MSE) as our loss function:
$\mbox L = \frac{1}{2} (T - a)^2$
Where $T$ is the target value.
The derivative of the loss to the activation is:
$\frac{dL}{da} = a - T$
To update weights, we need to know how the loss changes to $z$ . Using the chain rule, we have:
$\frac{dL}{dz} = \frac{dL}{da} \cdot \frac{da}{dz}$
For example, if the activation function is sigmoid, then:
$\frac{da}{dz} = a (1 - a)$

For Hidden Layers:
Consider a hidden neuron $j$ that feeds into the output neurons. Its contribution to the loss is influenced by all neurons it connects to in the subsequent layer. The backpropagated error for neuron $j$ is given by:
$\frac{dL}{dz_j} = \left( \sum_{k} \frac{dL}{dz_k} \cdot w_{jk} \right) \cdot f'(z_j)$
Here, $f'(z_j)$ is the derivative of the activation function at neuron $j$ .

More details for step (4): Gradient Calculation
For Each Weight:
Once you have the error signal $\frac{dL}{dz}$ for a neuron, the gradient with respect to a weight $w_i$ connected to input $x_i$ is:
$\frac{dL}{dw_i} = \frac{dL}{dz} \cdot x_i$
This shows that the gradient is directly proportional to the input, linking how much weight its contribution had on the final error.