Interactive Tutorial: Backpropagation Step by Step
Introduction
Backpropagation is the fundamental algorithm that allows neural networks to learn. It's how errors at the output propagate backward through the network, telling each weight exactly how much it contributed to the mistake and how to adjust.
Activity
Backpropagation Tutorial
How to Explore It
- ๐ข You'll see a simple network with random initial weights.
- ๐ฅ An input-output training example is presented.
- โก๏ธ You compute the forward pass step by step.
- ๐ You calculate the error at the output.
- โฌ ๏ธ You propagate gradients backward using the chain rule.
- ๐ You update the weights using gradient descent.
- ๐ฏ See how the network improves with each pass!
Interactive Demonstration
Backpropagation Step-by-Step Trainer
Step 1
Quick guide: We log each correct expression and result.
- $z_j^{(k)} = \sum_i a_i^{(k-1)} w_{ij}^{(k)}$
- $a_j^{(k)} = \sigma(z_j^{(k)})$
- $\sigma(z) = \frac{1}{1 + e^{-z}}$
Core Concepts
The Two Phases of Backpropagation
Training a neural network involves two alternating phases:
- Forward Pass: Input flows through the network, layer by layer, producing an output
- Backward Pass: The error at the output flows backward, computing gradients for each weight
- Weight Update: Each weight is adjusted in the direction that reduces the error
This process repeats for many training examples until the network learns the desired behavior.
The Chain Rule is the Key
The magic of backpropagation is the chain rule from calculus:
$$\frac{\partial E}{\partial w} = \frac{\partial E}{\partial o} \cdot \frac{\partial o}{\partial net} \cdot \frac{\partial net}{\partial w}$$This tells us: "How does the error change when we change this weight?" by breaking it into simpler steps.
- $\frac{\partial E}{\partial o}$: How error changes with output
- $\frac{\partial o}{\partial net}$: How output changes with weighted sum (activation derivative)
- $\frac{\partial net}{\partial w}$: How weighted sum changes with weight (just the input!)
The Activation Function
In this tutorial, we use the sigmoid activation function:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$Its derivative has a beautiful property:
$$\sigma'(x) = \sigma(x) \cdot (1 - \sigma(x))$$This means if you know the output, you can easily compute the derivative!
Learning Rate
The learning rate ($\eta$) controls how big each weight update is:
$$w_{new} = w_{old} - \eta \cdot \frac{\partial E}{\partial w}$$- Too large: The network may overshoot and never converge
- Too small: Learning is very slow
- Just right: Smooth convergence to good solutions
In this tutorial, we use $\eta = 0.5$ for clear, visible updates.