π Linear Regression Pseudocode
Linear regression is one of the foundational algorithms in machine learning. It finds the best-fitting line (or hyperplane) through data by minimizing the sum of squared errors between predictions and actual values.
The Algorithm
Gradient Descent Solution
When the dataset is very large or the matrix inversion is computationally expensive, we use gradient descent:
ALGORITHM: Linear Regression (Gradient Descent)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Feature matrix of shape (n_samples, n_features)
y: Target vector of shape (n_samples,)
Ξ±: Learning rate (e.g., 0.01)
max_iterations: Maximum number of iterations
tolerance: Convergence threshold (e.g., 1e-6)
OUTPUT:
w: Weight vector of shape (n_features + 1,)
PROCEDURE:
1. Add bias term: X_b β [1, X]
2. Initialize weights: w β zeros(n_features + 1)
3. n β number of samples
4. FOR iteration = 1 TO max_iterations:
a. Compute predictions:
Ε· β X_b Β· w
b. Compute error:
error β Ε· - y
c. Compute gradient:
gradient β (2/n) Β· X_bα΅ Β· error
d. Update weights:
w β w - Ξ± Β· gradient
e. Check convergence:
IF ||gradient|| < tolerance:
BREAK
5. RETURN wClosed-Form Solution (Normal Equation)
For a dataset with $n$ samples and $m$ features, linear regression finds weights $\mathbf{w}$ that minimize the Mean Squared Error (MSE).
ALGORITHM: Linear Regression (Normal Equation)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Feature matrix of shape (n_samples, n_features)
y: Target vector of shape (n_samples,)
OUTPUT:
w: Weight vector of shape (n_features + 1,)
PROCEDURE:
1. Add bias term: X_b β [1, X] // Prepend column of ones
2. Compute optimal weights using Normal Equation:
w β (X_bα΅ Β· X_b)β»ΒΉ Β· X_bα΅ Β· y
3. RETURN w
PREDICTION:
Given new sample x:
Ε· β wβ + wβΒ·xβ + wβΒ·xβ + ... + wβΒ·xβ
// Or in matrix form: Ε· β [1, x] Β· wMathematical Foundation
The cost function (Mean Squared Error) that we minimize is:
$$J(\mathbf{w}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \mathbf{w}^T \mathbf{x}_i)^2$$The gradient of this cost function with respect to the weights is:
$$\nabla_{\mathbf{w}} J = \frac{2}{n} \mathbf{X}^T (\mathbf{X}\mathbf{w} - \mathbf{y})$$Setting the gradient to zero and solving gives us the Normal Equation:
$$\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$Complexity Analysis
| Method | Time Complexity | Space Complexity |
|---|---|---|
| Normal Equation | $O(n \cdot m^2 + m^3)$ | $O(m^2)$ |
| Gradient Descent | $O(k \cdot n \cdot m)$ | $O(m)$ |
Where $n$ = samples, $m$ = features, $k$ = iterations.
Key Insights
Normal Equation: Best for small to medium datasets. No hyperparameters needed, but computationally expensive for large feature sets.
Gradient Descent: Scales better to large datasets. Requires tuning learning rate and convergence criteria.
Feature Scaling: Always normalize/standardize features when using gradient descent to ensure stable convergence.
References
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 3. Springer. https://hastie.su.domains/ElemStatLearn/
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 3. Springer.
- scikit-learn Documentation: LinearRegression. https://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
- Stanford CS229: Machine Learning Course Notes on Linear Regression. https://cs229.stanford.edu/