πŸ“ Gradient Descent Pseudocode

Gradient descent is the workhorse optimization algorithm in machine learning. It iteratively adjusts parameters to minimize a cost function by following the direction of steepest descent.

The Core Algorithm

Batch Gradient Descent (One Epoch)

ALGORITHM: Batch Gradient Descent (One Epoch)
─────────────────────────────────────────────────────────────────

INPUT:
    ΞΈ: Current parameter vector
    X: Full training dataset (n_samples, n_features)
    y: Target values (n_samples,)
    Ξ±: Learning rate
    loss_function: Function to compute loss
    gradient_function: Function to compute gradient

OUTPUT:
    ΞΈ: Updated parameter vector

PROCEDURE:
    1. Compute predictions for all samples:
       Ε· ← predict(X, ΞΈ)
    
    2. Compute loss (optional, for monitoring):
       L ← loss_function(y, Ε·)
    
    3. Compute gradient over entire dataset:
       βˆ‡L ← gradient_function(X, y, Ε·, ΞΈ)
    
    4. Update parameters:
       ΞΈ ← ΞΈ - Ξ± Β· βˆ‡L
    
    5. RETURN ΞΈ

Full Training Loop

ALGORITHM: Gradient Descent Training
─────────────────────────────────────────────────────────────────

INPUT:
    X: Training data of shape (n_samples, n_features)
    y: Target values of shape (n_samples,)
    Ξ±: Learning rate (e.g., 0.01)
    max_epochs: Maximum number of epochs
    tolerance: Convergence threshold (e.g., 1e-6)
    loss_function: Function to compute loss
    gradient_function: Function to compute gradient

OUTPUT:
    ΞΈ: Optimized parameter vector
    history: List of loss values per epoch

PROCEDURE:
    1. Initialize parameters ΞΈ randomly or to zeros
    
    2. history ← empty list
    
    3. FOR epoch = 1 TO max_epochs:
        
        a. // Forward pass: compute predictions
           Ε· ← predict(X, ΞΈ)
        
        b. // Compute and record loss
           L ← loss_function(y, Ε·)
           APPEND L to history
        
        c. // Compute gradient
           βˆ‡L ← gradient_function(X, y, Ε·, ΞΈ)
        
        d. // Update parameters (gradient step)
           ΞΈ ← ΞΈ - Ξ± Β· βˆ‡L
        
        e. // Check for convergence
           IF ||βˆ‡L|| < tolerance:
               PRINT "Converged at epoch", epoch
               BREAK
    
    4. RETURN ΞΈ, history

πŸ“‹ Variants of Gradient Descent

We use Batch GD when we want a stable descent direction at each step, computed from the full dataset. It is the clearest baseline variant for understanding core gradient descent behavior.

ALGORITHM: Batch Gradient Descent (One Epoch)
─────────────────────────────────────────────────────────────────

INPUT:
    ΞΈ: Current parameter vector
    X: Full training dataset (n_samples, n_features)
    y: Target values (n_samples,)
    Ξ±: Learning rate
    loss_function: Function to compute loss
    gradient_function: Function to compute gradient

OUTPUT:
    ΞΈ: Updated parameter vector

PROCEDURE:
    1. Compute predictions for all samples:
       Ε· ← predict(X, ΞΈ)
    
    2. Compute loss (optional, for monitoring):
       L ← loss_function(y, Ε·)
    
    3. Compute gradient over entire dataset:
       βˆ‡L ← gradient_function(X, y, Ε·, ΞΈ)
    
    4. Update parameters:
       ΞΈ ← ΞΈ - Ξ± Β· βˆ‡L
    
    5. RETURN ΞΈ

Characteristics:

  • Uses entire dataset for each update
  • Single update per epoch
  • Smooth and stable convergence
  • Higher memory consumption
  • Best for small datasets

We include the full loop to show how gradient descent is used in real training: multiple epochs, loss tracking, and a stopping condition. This bridges the math update rule and practical implementation.

ALGORITHM: Gradient Descent Training
─────────────────────────────────────────────────────────────────

INPUT:
    X: Training data of shape (n_samples, n_features)
    y: Target values of shape (n_samples,)
    Ξ±: Learning rate (e.g., 0.01)
    max_epochs: Maximum number of epochs
    tolerance: Convergence threshold (e.g., 1e-6)
    loss_function: Function to compute loss
    gradient_function: Function to compute gradient

OUTPUT:
    ΞΈ: Optimized parameter vector
    history: List of loss values per epoch

PROCEDURE:
    1. Initialize parameters ΞΈ randomly or to zeros
    
    2. history ← empty list
    
    3. FOR epoch = 1 TO max_epochs:
        
        a. // Forward pass: compute predictions
           Ε· ← predict(X, ΞΈ)
        
        b. // Compute and record loss
           L ← loss_function(y, Ε·)
           APPEND L to history
        
        c. // Compute gradient
           βˆ‡L ← gradient_function(X, y, Ε·, ΞΈ)
        
        d. // Update parameters (gradient step)
           ΞΈ ← ΞΈ - Ξ± Β· βˆ‡L
        
        e. // Check for convergence
           IF ||βˆ‡L|| < tolerance:
               PRINT "Converged at epoch", epoch
               BREAK
    
    4. RETURN ΞΈ, history

Characteristics:

  • Complete training loop with multiple epochs
  • Monitors loss at each epoch
  • Includes convergence criterion
  • Returns loss history for analysis
  • Foundation for all modern optimization methods

We use SGD when datasets are large or we need very frequent, cheap updates. It introduces useful noise during training, which can help escape local minima and speed up early learning.

ALGORITHM: Stochastic Gradient Descent (One Epoch)
─────────────────────────────────────────────────────────────────

INPUT:
    ΞΈ: Current parameter vector
    X: Training data of shape (n_samples, n_features)
    y: Target values of shape (n_samples,)
    Ξ±: Learning rate

OUTPUT:
    ΞΈ: Updated parameter vector

PROCEDURE:
    1. Shuffle the dataset indices
    
    2. FOR each sample i in shuffled order:
        
        a. Get single sample: xα΅’, yα΅’
        
        b. Compute prediction: Ε·α΅’ ← predict(xα΅’, ΞΈ)
        
        c. Compute gradient for single sample:
           βˆ‡Lα΅’ ← gradient(xα΅’, yα΅’, Ε·α΅’, ΞΈ)
        
        d. Update parameters:
           ΞΈ ← ΞΈ - Ξ± Β· βˆ‡Lα΅’
    
    3. RETURN ΞΈ

Characteristics:

  • Update for each individual sample
  • Very noisy but memory efficient
  • Can escape local minima
  • Excellent for very large datasets
  • Less stable than Batch GD

We use Mini-Batch GD to balance stability and efficiency: each update sees more than one sample, but not the whole dataset. That is why it is the standard choice in modern deep learning training.

ALGORITHM: Mini-Batch Gradient Descent (One Epoch)
─────────────────────────────────────────────────────────────────

INPUT:
    ΞΈ: Current parameter vector
    X: Training data of shape (n_samples, n_features)
    y: Target values of shape (n_samples,)
    Ξ±: Learning rate
    batch_size: Number of samples per batch (e.g., 32)

OUTPUT:
    ΞΈ: Updated parameter vector

PROCEDURE:
    1. n ← number of samples
    
    2. Shuffle the dataset
    
    3. n_batches ← ceil(n / batch_size)
    
    4. FOR batch_idx = 0 TO n_batches - 1:
        
        a. // Extract mini-batch
           start ← batch_idx Γ— batch_size
           end ← min(start + batch_size, n)
           X_batch ← X[start:end]
           y_batch ← y[start:end]
        
        b. // Forward pass
           Ε·_batch ← predict(X_batch, ΞΈ)
        
        c. // Compute gradient on mini-batch
           βˆ‡L ← gradient(X_batch, y_batch, Ε·_batch, ΞΈ)
        
        d. // Update parameters
           ΞΈ ← ΞΈ - Ξ± Β· βˆ‡L
    
    5. RETURN ΞΈ

Characteristics:

  • Perfect balance between Batch GD and SGD
  • Update for each mini-batch (typically 32-256 samples)
  • Good balance between speed and stability
  • Less noise than SGD, more memory-efficient than Batch GD
  • Most widely used method in modern practice

Mathematical Foundation

For a differentiable loss function $L(\theta)$, gradient descent updates parameters as:

$$\theta_{t+1} = \theta_t - \alpha \nabla_\theta L(\theta_t)$$

Where:

  • $\theta_t$ is the parameter vector at iteration $t$
  • $\alpha$ is the learning rate (step size)
  • $\nabla_\theta L$ is the gradient of the loss with respect to parameters

Convergence Conditions

For convex functions with Lipschitz-continuous gradients, gradient descent converges when:

  1. The learning rate satisfies: $\alpha < \frac{2}{L}$ where $L$ is the Lipschitz constant
  2. For convergence rate: $L(\theta_t) - L(\theta^*) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\alpha t}$

Comparison of Variants

VariantUpdate FrequencyConvergenceMemory UsageNoise Level
Batch GDOnce per epochSmooth, stableHighLow
SGDOnce per sampleNoisy, can escape local minimaLowHigh
Mini-BatchOnce per batchGood balanceMediumMedium

Key Hyperparameters

  1. Learning Rate (Ξ±): Too high β†’ divergence; too low β†’ slow convergence
  2. Batch Size: Larger β†’ more stable; smaller β†’ more noise, better generalization
  3. Number of Epochs: Enough to converge but avoid overfitting

References