π Backpropagation Algorithm Pseudocode
Backpropagation (backward propagation of errors) is the fundamental algorithm that enables neural networks to learn. It efficiently computes gradients of the loss function with respect to every weight in the network using the chain rule of calculus.
The Algorithm
Neural Network Forward Pass
ALGORITHM: Forward Propagation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
x: Input vector of shape (n_input,)
W: List of weight matrices [WΒΉ, WΒ², ..., Wα΄Έ]
b: List of bias vectors [bΒΉ, bΒ², ..., bα΄Έ]
activations: List of activation functions [ΟΒΉ, ΟΒ², ..., Οα΄Έ]
OUTPUT:
a: List of activations for each layer [aβ°, aΒΉ, ..., aα΄Έ]
z: List of pre-activations for each layer [zΒΉ, zΒ², ..., zα΄Έ]
PROCEDURE:
a[0] β x // Input is the first "activation"
L β number of layers
FOR l = 1 TO L:
// Linear transformation
z[l] β W[l] Β· a[l-1] + b[l]
// Apply activation function
a[l] β Ο[l](z[l])
RETURN a, zBackpropagation Algorithm
ALGORITHM: Backpropagation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
x: Input sample
y: Target output
W, b: Network weights and biases
activations: Activation functions for each layer
loss_function: Loss function (e.g., MSE, cross-entropy)
OUTPUT:
βL/βW: Gradients for all weight matrices
βL/βb: Gradients for all bias vectors
PROCEDURE:
L β number of layers
// ========== FORWARD PASS ==========
a, z β forward_propagation(x, W, b, activations)
// ========== BACKWARD PASS ==========
// Step 1: Compute output layer error (Ξ΄α΄Έ)
// Ξ΄α΄Έ = βL/βaα΄Έ β Ο'(zα΄Έ)
βL/βa[L] β derivative_of_loss(a[L], y) // e.g., (a[L] - y) for MSE
Ξ΄[L] β βL/βa[L] β Ο'[L](z[L]) // β is element-wise multiply
// Step 2: Backpropagate error through hidden layers
FOR l = L-1 DOWN TO 1:
// Propagate error backward: Ξ΄Λ‘ = (WΛ‘βΊΒΉ)α΅ Ξ΄Λ‘βΊΒΉ β Ο'(zΛ‘)
Ξ΄[l] β (W[l+1]α΅ Β· Ξ΄[l+1]) β Ο'[l](z[l])
// Step 3: Compute gradients for weights and biases
FOR l = 1 TO L:
βL/βW[l] β Ξ΄[l] β a[l-1]α΅ // Outer product
βL/βb[l] β Ξ΄[l]
RETURN βL/βW, βL/βbFull Training Loop with Backpropagation
ALGORITHM: Neural Network Training with Backpropagation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Training data of shape (n_samples, n_input)
Y: Target values of shape (n_samples, n_output)
architecture: List of layer sizes [n_input, n_hiddenβ, ..., n_output]
learning_rate: Ξ·
n_epochs: Number of training epochs
batch_size: Mini-batch size
OUTPUT:
W, b: Trained weights and biases
PROCEDURE:
1. // Initialize weights (Xavier/He initialization)
FOR l = 1 TO L:
W[l] β random_normal(0, β(2/n_in)) Γ shape(n_out, n_in)
b[l] β zeros(n_out)
2. // Training loop
FOR epoch = 1 TO n_epochs:
// Shuffle data
shuffle(X, Y)
// Mini-batch training
FOR each batch (X_batch, Y_batch):
// Initialize gradient accumulators
βL/βW_total β [zeros_like(W[l]) for each layer]
βL/βb_total β [zeros_like(b[l]) for each layer]
// Accumulate gradients over batch
FOR each (x, y) in (X_batch, Y_batch):
// Backpropagation for single sample
βL/βW, βL/βb β backpropagation(x, y, W, b)
// Accumulate
FOR l = 1 TO L:
βL/βW_total[l] β βL/βW_total[l] + βL/βW[l]
βL/βb_total[l] β βL/βb_total[l] + βL/βb[l]
// Average gradients
FOR l = 1 TO L:
βL/βW_total[l] β βL/βW_total[l] / batch_size
βL/βb_total[l] β βL/βb_total[l] / batch_size
// Update weights (gradient descent)
FOR l = 1 TO L:
W[l] β W[l] - Ξ· Γ βL/βW_total[l]
b[l] β b[l] - Ξ· Γ βL/βb_total[l]
3. RETURN W, bCommon Activation Functions and Their Derivatives
FUNCTION: Activation Functions and Derivatives
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SIGMOID:
Ο(z) = 1 / (1 + exp(-z))
Ο'(z) = Ο(z) Γ (1 - Ο(z))
TANH:
Ο(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z))
Ο'(z) = 1 - Ο(z)Β²
ReLU (Rectified Linear Unit):
Ο(z) = max(0, z)
Ο'(z) = { 1 if z > 0
{ 0 if z β€ 0
LEAKY ReLU:
Ο(z) = { z if z > 0
{ Ξ± Γ z if z β€ 0 (Ξ± typically 0.01)
Ο'(z) = { 1 if z > 0
{ Ξ± if z β€ 0
SOFTMAX (for output layer with multiple classes):
Ο(z)α΅’ = exp(zα΅’) / Ξ£β±Ό exp(zβ±Ό)
// Jacobian is more complex; often combined with cross-entropy lossLoss Functions
FUNCTION: Common Loss Functions and Their Gradients
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MEAN SQUARED ERROR (Regression):
L(Ε·, y) = (1/2) Γ ||Ε· - y||Β²
βL/βΕ· = Ε· - y
BINARY CROSS-ENTROPY (Binary Classification):
L(Ε·, y) = -[y log(Ε·) + (1-y) log(1-Ε·)]
βL/βΕ· = -y/Ε· + (1-y)/(1-Ε·)
// With sigmoid output, simplifies to:
βL/βz = Ε· - y
CATEGORICAL CROSS-ENTROPY (Multiclass):
L(Ε·, y) = -Ξ£α΅’ yα΅’ log(Ε·α΅’)
// With softmax output:
βL/βz = Ε· - y (where y is one-hot encoded)Mathematical Foundation
The Chain Rule
Backpropagation is an efficient application of the chain rule:
$$\frac{\partial L}{\partial w_{ij}^{(l)}} = \frac{\partial L}{\partial a_j^{(l)}} \cdot \frac{\partial a_j^{(l)}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial w_{ij}^{(l)}}$$The Error Signal (Ξ΄)
We define the error signal at layer $l$:
$$\delta_j^{(l)} = \frac{\partial L}{\partial z_j^{(l)}}$$Backpropagation Equations
$$\delta^{(L)} = \nabla_a L \odot \sigma'(z^{(L)})$$$$\delta^{(l)} = ((W^{(l+1)})^T \delta^{(l+1)}) \odot \sigma'(z^{(l)})$$$$\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T$$$$\frac{\partial L}{\partial b^{(l)}} = \delta^{(l)}$$Complexity Analysis
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Forward pass | $O(\sum_l n_l \times n_{l-1})$ | $O(\sum_l n_l)$ |
| Backward pass | $O(\sum_l n_l \times n_{l-1})$ | $O(\sum_l n_l)$ |
| Full epoch | $O(N \times \sum_l n_l \times n_{l-1})$ | $O(\sum_l n_l^2)$ |
Where $N$ = samples, $n_l$ = neurons in layer $l$.
Common Issues and Solutions
VANISHING GRADIENTS:
Problem: Gradients become tiny in deep networks (especially with sigmoid/tanh)
Solutions:
- Use ReLU activation
- Use batch normalization
- Use residual connections (skip connections)
- Proper weight initialization
EXPLODING GRADIENTS:
Problem: Gradients become very large, causing unstable training
Solutions:
- Gradient clipping: ||g|| β min(||g||, threshold) Γ (g/||g||)
- Lower learning rate
- Proper weight initialization
DEAD NEURONS (ReLU):
Problem: ReLU outputs 0 for all inputs β gradient = 0 β neuron never updates
Solutions:
- Use Leaky ReLU or ELU
- Careful weight initialization
- Lower learning rateKey Insights
Efficiency: Backprop computes all gradients in $O(W)$ time, where $W$ is the number of weightsβsame as a forward pass!
Dynamic Programming: Backprop is essentially dynamic programming, reusing intermediate computations.
Automatic Differentiation: Modern frameworks (PyTorch, TensorFlow) implement backprop automatically.
Credit Assignment: Backprop solves the credit assignment problemβdetermining how much each weight contributed to the error.
Locality: Each weight update depends only on local information (connected activations and incoming error signal).
References
- Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533-536.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 6. MIT Press. https://www.deeplearningbook.org/
- Nielsen, M. (2015). Neural Networks and Deep Learning, Chapter 2. http://neuralnetworksanddeeplearning.com/
- LeCun, Y., Bottou, L., Orr, G. B., & MΓΌller, K.-R. (1998). Efficient BackProp. In Neural Networks: Tricks of the Trade.
- Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS.