π Logistic Regression Pseudocode
Logistic regression is the fundamental algorithm for binary classification. Despite its name, it's a classification algorithm that uses the logistic (sigmoid) function to transform linear combinations of features into probabilities.
The Algorithm
Training with Gradient Descent
ALGORITHM: Logistic Regression Training
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Feature matrix of shape (n_samples, n_features)
y: Binary labels {0, 1} of shape (n_samples,)
Ξ±: Learning rate (e.g., 0.01)
max_iterations: Maximum number of iterations
tolerance: Convergence threshold (e.g., 1e-6)
OUTPUT:
w: Weight vector of shape (n_features,)
b: Bias term (scalar)
PROCEDURE:
1. Initialize weights: w β zeros(n_features)
2. Initialize bias: b β 0
3. n β number of samples
4. FOR iteration = 1 TO max_iterations:
a. // Forward pass: compute linear combination
z β X Β· w + b
b. // Apply sigmoid activation
Ε· β Ο(z) = 1 / (1 + exp(-z))
c. // Compute gradients (from cross-entropy loss)
dw β (1/n) Β· Xα΅ Β· (Ε· - y)
db β (1/n) Β· Ξ£(Ε· - y)
d. // Update parameters
w β w - Ξ± Β· dw
b β b - Ξ± Β· db
e. // Check convergence
IF ||dw|| < tolerance AND |db| < tolerance:
BREAK
5. RETURN w, bPrediction
ALGORITHM: Logistic Regression Prediction
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
x: Sample feature vector of shape (n_features,)
w: Trained weight vector
b: Trained bias term
threshold: Decision threshold (default = 0.5)
OUTPUT:
class: Predicted class {0, 1}
probability: Probability of class 1
PROCEDURE:
1. Compute linear combination:
z β w Β· x + b
2. Apply sigmoid to get probability:
p β Ο(z) = 1 / (1 + exp(-z))
3. Apply threshold to get class:
IF p β₯ threshold:
class β 1
ELSE:
class β 0
4. RETURN class, pThe Sigmoid Function
FUNCTION: Sigmoid (Logistic Function)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
z: Real number or array
OUTPUT:
Ο(z): Value(s) in range (0, 1)
DEFINITION:
Ο(z) = 1 / (1 + exp(-z))
PROPERTIES:
- Ο(0) = 0.5
- lim(z β +β) Ο(z) = 1
- lim(z β -β) Ο(z) = 0
- Ο'(z) = Ο(z) Β· (1 - Ο(z)) // DerivativeMathematical Foundation
The Model
Logistic regression models the probability that sample $x$ belongs to class 1:
$$P(y=1|x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}$$The Loss Function (Binary Cross-Entropy)
$$L(y, \hat{y}) = -[y \log(\hat{y}) + (1-y) \log(1-\hat{y})]$$$$J(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$The Gradients
$$\frac{\partial J}{\partial w} = \frac{1}{n} X^T (\hat{y} - y)$$$$\frac{\partial J}{\partial b} = \frac{1}{n} \sum_{i=1}^{n} (\hat{y}_i - y_i)$$Regularized Logistic Regression
To prevent overfitting, we add a regularization term:
ALGORITHM: L2-Regularized Logistic Regression
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
MODIFIED GRADIENT UPDATE (step 4d in training):
// L2 Regularization (Ridge)
dw β (1/n) Β· Xα΅ Β· (Ε· - y) + (Ξ»/n) Β· w
// Note: bias is typically not regularized
db β (1/n) Β· Ξ£(Ε· - y)
WHERE:
Ξ»: Regularization strength (hyperparameter)The regularized loss function becomes:
$$J(w, b) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)] + \frac{\lambda}{2n} \|w\|^2$$Multiclass Extension (Softmax Regression)
For $K$ classes, we extend to softmax regression:
ALGORITHM: Softmax Regression (Multinomial Logistic Regression)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Feature matrix of shape (n_samples, n_features)
y: Labels {0, 1, ..., K-1} of shape (n_samples,)
PREDICTION:
1. Compute scores for each class:
z_k β X Β· w_k + b_k for k = 0, 1, ..., K-1
2. Apply softmax to get probabilities:
P(y = k | x) = exp(z_k) / Ξ£β±Ό exp(z_j)
3. Predict class with highest probability:
Ε· β argmax_k P(y = k | x)Complexity Analysis
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Training (per iteration) | $O(n \cdot m)$ | $O(m)$ |
| Prediction (per sample) | $O(m)$ | $O(1)$ |
Where $n$ = samples, $m$ = features.
Key Insights
Interpretability: Weights represent log-odds ratios; $e^{w_j}$ is the multiplicative change in odds per unit increase in feature $j$.
Threshold Tuning: The default 0.5 threshold can be adjusted based on the cost of false positives vs. false negatives.
Feature Scaling: Standardizing features speeds up convergence.
Class Imbalance: Use class weights or resampling for imbalanced datasets.
References
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 4. Springer. https://hastie.su.domains/ElemStatLearn/
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 4. Springer.
- scikit-learn Documentation: LogisticRegression. https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
- Andrew Ng's CS229 Notes: Logistic Regression. https://cs229.stanford.edu/