π Gradient Descent Pseudocode
Gradient descent is the workhorse optimization algorithm in machine learning. It iteratively adjusts parameters to minimize a cost function by following the direction of steepest descent.
The Core Algorithm
Batch Gradient Descent (One Epoch)
ALGORITHM: Batch Gradient Descent (One Epoch)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
ΞΈ: Current parameter vector
X: Full training dataset (n_samples, n_features)
y: Target values (n_samples,)
Ξ±: Learning rate
loss_function: Function to compute loss
gradient_function: Function to compute gradient
OUTPUT:
ΞΈ: Updated parameter vector
PROCEDURE:
1. Compute predictions for all samples:
Ε· β predict(X, ΞΈ)
2. Compute loss (optional, for monitoring):
L β loss_function(y, Ε·)
3. Compute gradient over entire dataset:
βL β gradient_function(X, y, Ε·, ΞΈ)
4. Update parameters:
ΞΈ β ΞΈ - Ξ± Β· βL
5. RETURN ΞΈFull Training Loop
ALGORITHM: Gradient Descent Training
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Training data of shape (n_samples, n_features)
y: Target values of shape (n_samples,)
Ξ±: Learning rate (e.g., 0.01)
max_epochs: Maximum number of epochs
tolerance: Convergence threshold (e.g., 1e-6)
loss_function: Function to compute loss
gradient_function: Function to compute gradient
OUTPUT:
ΞΈ: Optimized parameter vector
history: List of loss values per epoch
PROCEDURE:
1. Initialize parameters ΞΈ randomly or to zeros
2. history β empty list
3. FOR epoch = 1 TO max_epochs:
a. // Forward pass: compute predictions
Ε· β predict(X, ΞΈ)
b. // Compute and record loss
L β loss_function(y, Ε·)
APPEND L to history
c. // Compute gradient
βL β gradient_function(X, y, Ε·, ΞΈ)
d. // Update parameters (gradient step)
ΞΈ β ΞΈ - Ξ± Β· βL
e. // Check for convergence
IF ||βL|| < tolerance:
PRINT "Converged at epoch", epoch
BREAK
4. RETURN ΞΈ, historyπ Variants of Gradient Descent
We use Batch GD when we want a stable descent direction at each step, computed from the full dataset. It is the clearest baseline variant for understanding core gradient descent behavior.
ALGORITHM: Batch Gradient Descent (One Epoch)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
ΞΈ: Current parameter vector
X: Full training dataset (n_samples, n_features)
y: Target values (n_samples,)
Ξ±: Learning rate
loss_function: Function to compute loss
gradient_function: Function to compute gradient
OUTPUT:
ΞΈ: Updated parameter vector
PROCEDURE:
1. Compute predictions for all samples:
Ε· β predict(X, ΞΈ)
2. Compute loss (optional, for monitoring):
L β loss_function(y, Ε·)
3. Compute gradient over entire dataset:
βL β gradient_function(X, y, Ε·, ΞΈ)
4. Update parameters:
ΞΈ β ΞΈ - Ξ± Β· βL
5. RETURN ΞΈCharacteristics:
- Uses entire dataset for each update
- Single update per epoch
- Smooth and stable convergence
- Higher memory consumption
- Best for small datasets
We include the full loop to show how gradient descent is used in real training: multiple epochs, loss tracking, and a stopping condition. This bridges the math update rule and practical implementation.
ALGORITHM: Gradient Descent Training
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Training data of shape (n_samples, n_features)
y: Target values of shape (n_samples,)
Ξ±: Learning rate (e.g., 0.01)
max_epochs: Maximum number of epochs
tolerance: Convergence threshold (e.g., 1e-6)
loss_function: Function to compute loss
gradient_function: Function to compute gradient
OUTPUT:
ΞΈ: Optimized parameter vector
history: List of loss values per epoch
PROCEDURE:
1. Initialize parameters ΞΈ randomly or to zeros
2. history β empty list
3. FOR epoch = 1 TO max_epochs:
a. // Forward pass: compute predictions
Ε· β predict(X, ΞΈ)
b. // Compute and record loss
L β loss_function(y, Ε·)
APPEND L to history
c. // Compute gradient
βL β gradient_function(X, y, Ε·, ΞΈ)
d. // Update parameters (gradient step)
ΞΈ β ΞΈ - Ξ± Β· βL
e. // Check for convergence
IF ||βL|| < tolerance:
PRINT "Converged at epoch", epoch
BREAK
4. RETURN ΞΈ, historyCharacteristics:
- Complete training loop with multiple epochs
- Monitors loss at each epoch
- Includes convergence criterion
- Returns loss history for analysis
- Foundation for all modern optimization methods
We use SGD when datasets are large or we need very frequent, cheap updates. It introduces useful noise during training, which can help escape local minima and speed up early learning.
ALGORITHM: Stochastic Gradient Descent (One Epoch)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
ΞΈ: Current parameter vector
X: Training data of shape (n_samples, n_features)
y: Target values of shape (n_samples,)
Ξ±: Learning rate
OUTPUT:
ΞΈ: Updated parameter vector
PROCEDURE:
1. Shuffle the dataset indices
2. FOR each sample i in shuffled order:
a. Get single sample: xα΅’, yα΅’
b. Compute prediction: Ε·α΅’ β predict(xα΅’, ΞΈ)
c. Compute gradient for single sample:
βLα΅’ β gradient(xα΅’, yα΅’, Ε·α΅’, ΞΈ)
d. Update parameters:
ΞΈ β ΞΈ - Ξ± Β· βLα΅’
3. RETURN ΞΈCharacteristics:
- Update for each individual sample
- Very noisy but memory efficient
- Can escape local minima
- Excellent for very large datasets
- Less stable than Batch GD
We use Mini-Batch GD to balance stability and efficiency: each update sees more than one sample, but not the whole dataset. That is why it is the standard choice in modern deep learning training.
ALGORITHM: Mini-Batch Gradient Descent (One Epoch)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
ΞΈ: Current parameter vector
X: Training data of shape (n_samples, n_features)
y: Target values of shape (n_samples,)
Ξ±: Learning rate
batch_size: Number of samples per batch (e.g., 32)
OUTPUT:
ΞΈ: Updated parameter vector
PROCEDURE:
1. n β number of samples
2. Shuffle the dataset
3. n_batches β ceil(n / batch_size)
4. FOR batch_idx = 0 TO n_batches - 1:
a. // Extract mini-batch
start β batch_idx Γ batch_size
end β min(start + batch_size, n)
X_batch β X[start:end]
y_batch β y[start:end]
b. // Forward pass
Ε·_batch β predict(X_batch, ΞΈ)
c. // Compute gradient on mini-batch
βL β gradient(X_batch, y_batch, Ε·_batch, ΞΈ)
d. // Update parameters
ΞΈ β ΞΈ - Ξ± Β· βL
5. RETURN ΞΈCharacteristics:
- Perfect balance between Batch GD and SGD
- Update for each mini-batch (typically 32-256 samples)
- Good balance between speed and stability
- Less noise than SGD, more memory-efficient than Batch GD
- Most widely used method in modern practice
Mathematical Foundation
For a differentiable loss function $L(\theta)$, gradient descent updates parameters as:
$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$Where:
- $\theta_t$ is the parameter vector at iteration $t$
- $\alpha$ is the learning rate (step size)
- $\nabla_\theta L$ is the gradient of the loss with respect to parameters
Convergence Conditions
For convex functions with Lipschitz-continuous gradients, gradient descent converges when:
- The learning rate satisfies: $\alpha < \frac{2}{L}$ where $L$ is the Lipschitz constant
- For convergence rate: $L(\theta_t) - L(\theta^*) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\alpha t}$
Comparison of Variants
| Variant | Update Frequency | Convergence | Memory Usage | Noise Level |
|---|---|---|---|---|
| Batch GD | Once per epoch | Smooth, stable | High | Low |
| SGD | Once per sample | Noisy, can escape local minima | Low | High |
| Mini-Batch | Once per batch | Good balance | Medium | Medium |
Key Hyperparameters
- Learning Rate (Ξ±): Too high β divergence; too low β slow convergence
- Batch Size: Larger β more stable; smaller β more noise, better generalization
- Number of Epochs: Enough to converge but avoid overfitting
References
- Ruder, S. (2016). An overview of gradient descent optimization algorithms. arXiv:1609.04747. https://arxiv.org/abs/1609.04747
- Bottou, L. (2010). Large-Scale Machine Learning with Stochastic Gradient Descent. COMPSTAT 2010.
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 8. MIT Press. https://www.deeplearningbook.org/
- Stanford CS231n: Optimization notes. https://cs231n.github.io/optimization-1/