πŸ“ Logistic Regression Pseudocode

Logistic regression is the fundamental algorithm for binary classification. Despite its name, it's a classification algorithm that uses the logistic (sigmoid) function to transform linear combinations of features into probabilities.

The Algorithm

Training with Gradient Descent

ALGORITHM: Logistic Regression Training
─────────────────────────────────────────────────────────────────

INPUT:
    X: Feature matrix of shape (n_samples, n_features)
    y: Binary labels {0, 1} of shape (n_samples,)
    Ξ±: Learning rate (e.g., 0.01)
    max_iterations: Maximum number of iterations
    tolerance: Convergence threshold (e.g., 1e-6)

OUTPUT:
    w: Weight vector of shape (n_features,)
    b: Bias term (scalar)

PROCEDURE:
    1. Initialize weights: w ← zeros(n_features)
    2. Initialize bias: b ← 0
    3. n ← number of samples
    
    4. FOR iteration = 1 TO max_iterations:
        
        a. // Forward pass: compute linear combination
           z ← X Β· w + b
        
        b. // Apply sigmoid activation
           Ε· ← Οƒ(z) = 1 / (1 + exp(-z))
        
        c. // Compute gradients (from cross-entropy loss)
           dw ← (1/n) Β· Xα΅€ Β· (Ε· - y)
           db ← (1/n) Β· Ξ£(Ε· - y)
        
        d. // Update parameters
           w ← w - Ξ± Β· dw
           b ← b - Ξ± Β· db
        
        e. // Check convergence
           IF ||dw|| < tolerance AND |db| < tolerance:
               BREAK
    
    5. RETURN w, b

Prediction

ALGORITHM: Logistic Regression Prediction
─────────────────────────────────────────────────────────────────

INPUT:
    x: Sample feature vector of shape (n_features,)
    w: Trained weight vector
    b: Trained bias term
    threshold: Decision threshold (default = 0.5)

OUTPUT:
    class: Predicted class {0, 1}
    probability: Probability of class 1

PROCEDURE:
    1. Compute linear combination:
       z ← w Β· x + b
    
    2. Apply sigmoid to get probability:
       p ← Οƒ(z) = 1 / (1 + exp(-z))
    
    3. Apply threshold to get class:
       IF p β‰₯ threshold:
           class ← 1
       ELSE:
           class ← 0
    
    4. RETURN class, p

The Sigmoid Function

FUNCTION: Sigmoid (Logistic Function)
─────────────────────────────────────────────────────────────────

INPUT:
    z: Real number or array

OUTPUT:
    Οƒ(z): Value(s) in range (0, 1)

DEFINITION:
    Οƒ(z) = 1 / (1 + exp(-z))

PROPERTIES:
    - Οƒ(0) = 0.5
    - lim(z β†’ +∞) Οƒ(z) = 1
    - lim(z β†’ -∞) Οƒ(z) = 0
    - Οƒ'(z) = Οƒ(z) Β· (1 - Οƒ(z))  // Derivative

Mathematical Foundation

The Model

Logistic regression models the probability that sample $x$ belongs to class 1:

$$P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$

The Loss Function (Binary Cross-Entropy)

$$L(y, \hat{y}) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$$$J(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

The Gradients

$$\frac{\partial J}{\partial w} = \frac{1}{n} X^T (\hat{y} - y)$$$$\frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$$

Regularized Logistic Regression

To prevent overfitting, we add a regularization term:

ALGORITHM: L2-Regularized Logistic Regression
─────────────────────────────────────────────────────────────────

MODIFIED GRADIENT UPDATE (step 4d in training):

    // L2 Regularization (Ridge)
    dw ← (1/n) Β· Xα΅€ Β· (Ε· - y) + (Ξ»/n) Β· w
    
    // Note: bias is typically not regularized
    db ← (1/n) Β· Ξ£(Ε· - y)

WHERE:
    Ξ»: Regularization strength (hyperparameter)

The regularized loss function becomes:

$$J(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] + \frac{\lambda}{2n} \|w\|^2$$

Multiclass Extension (Softmax Regression)

For $K$ classes, we extend to softmax regression:

ALGORITHM: Softmax Regression (Multinomial Logistic Regression)
─────────────────────────────────────────────────────────────────

INPUT:
    X: Feature matrix of shape (n_samples, n_features)
    y: Labels {0, 1, ..., K-1} of shape (n_samples,)

PREDICTION:
    1. Compute scores for each class:
       z_k ← X Β· w_k + b_k    for k = 0, 1, ..., K-1
    
    2. Apply softmax to get probabilities:
       P(y = k | x) = exp(z_k) / Ξ£β±Ό exp(z_j)
    
    3. Predict class with highest probability:
       Ε· ← argmax_k P(y = k | x)

Complexity Analysis

OperationTime ComplexitySpace Complexity
Training (per iteration)$O(n \cdot m)$$O(m)$
Prediction (per sample)$O(m)$$O(1)$

Where $n$ = samples, $m$ = features.

Key Insights

  1. Interpretability: Weights represent log-odds ratios; $e^{w_j}$ is the multiplicative change in odds per unit increase in feature $j$.

  2. Threshold Tuning: The default 0.5 threshold can be adjusted based on the cost of false positives vs. false negatives.

  3. Feature Scaling: Standardizing features speeds up convergence.

  4. Class Imbalance: Use class weights or resampling for imbalanced datasets.

References