π Support Vector Machines (SVM) Pseudocode
Support Vector Machines (SVM) are powerful supervised learning models that find the optimal hyperplane separating classes by maximizing the margin between them. SVMs can handle linear and non-linear classification through the kernel trick.
The Algorithm
Linear SVM: The Optimization Problem
ALGORITHM: Linear SVM Formulation
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
OBJECTIVE: Find hyperplane wΒ·x + b = 0 that maximizes margin
PRIMAL PROBLEM (Hard Margin):
minimize: (1/2) ||w||Β²
subject to: yα΅’(w Β· xα΅’ + b) β₯ 1 for all i = 1, ..., n
where:
w: Weight vector (normal to hyperplane)
b: Bias term
yα΅’ β {-1, +1}: Class labels
xα΅’: Training samples
PRIMAL PROBLEM (Soft Margin - allows misclassification):
minimize: (1/2) ||w||Β² + C Ξ£α΅’ ΞΎα΅’
subject to: yα΅’(w Β· xα΅’ + b) β₯ 1 - ΞΎα΅’ for all i
ΞΎα΅’ β₯ 0 for all i
where:
C: Regularization parameter (trade-off between margin and errors)
ΞΎα΅’: Slack variables (allow points inside margin or misclassified)SVM Training with SMO (Sequential Minimal Optimization)
ALGORITHM: SMO for SVM Training (Simplified)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Training data of shape (n_samples, n_features)
y: Labels {-1, +1} of shape (n_samples,)
C: Regularization parameter
kernel: Kernel function (default: linear)
tolerance: Numerical tolerance (e.g., 1e-3)
max_iterations: Maximum passes over data
OUTPUT:
Ξ±: Lagrange multipliers of shape (n_samples,)
b: Bias term
PROCEDURE:
1. Initialize: Ξ± β zeros(n_samples), b β 0
2. // Precompute kernel matrix (for efficiency)
K[i,j] β kernel(xα΅’, xβ±Ό) for all i, j
3. passes β 0
WHILE passes < max_iterations:
num_changed β 0
FOR i = 1 TO n_samples:
// Calculate prediction error for sample i
Eα΅’ β (Ξ£β±Ό Ξ±β±Ό yβ±Ό K[i,j]) + b - yα΅’
// Check if Ξ±α΅’ violates KKT conditions
IF (yα΅’ Eα΅’ < -tol AND Ξ±α΅’ < C) OR (yα΅’ Eα΅’ > tol AND Ξ±α΅’ > 0):
// Select second Ξ± to optimize (heuristic: max |Eα΅’ - Eβ±Ό|)
j β select_second_alpha(i, Eα΅’)
Eβ±Ό β (Ξ£β Ξ±β yβ K[j,k]) + b - yβ±Ό
// Save old alphas
Ξ±α΅’_old β Ξ±α΅’
Ξ±β±Ό_old β Ξ±β±Ό
// Compute bounds for Ξ±β±Ό
IF yα΅’ β yβ±Ό:
L β max(0, Ξ±β±Ό - Ξ±α΅’)
H β min(C, C + Ξ±β±Ό - Ξ±α΅’)
ELSE:
L β max(0, Ξ±α΅’ + Ξ±β±Ό - C)
H β min(C, Ξ±α΅’ + Ξ±β±Ό)
IF L = H:
CONTINUE
// Compute eta (second derivative of objective)
Ξ· β 2 K[i,j] - K[i,i] - K[j,j]
IF Ξ· β₯ 0:
CONTINUE
// Update Ξ±β±Ό
Ξ±β±Ό β Ξ±β±Ό - yβ±Ό(Eα΅’ - Eβ±Ό) / Ξ·
Ξ±β±Ό β clip(Ξ±β±Ό, L, H)
IF |Ξ±β±Ό - Ξ±β±Ό_old| < 1e-5:
CONTINUE
// Update Ξ±α΅’
Ξ±α΅’ β Ξ±α΅’ + yα΅’ yβ±Ό (Ξ±β±Ό_old - Ξ±β±Ό)
// Update bias b
bβ β b - Eα΅’ - yα΅’(Ξ±α΅’-Ξ±α΅’_old)K[i,i] - yβ±Ό(Ξ±β±Ό-Ξ±β±Ό_old)K[i,j]
bβ β b - Eβ±Ό - yα΅’(Ξ±α΅’-Ξ±α΅’_old)K[i,j] - yβ±Ό(Ξ±β±Ό-Ξ±β±Ό_old)K[j,j]
IF 0 < Ξ±α΅’ < C:
b β bβ
ELSE IF 0 < Ξ±β±Ό < C:
b β bβ
ELSE:
b β (bβ + bβ) / 2
num_changed β num_changed + 1
IF num_changed = 0:
passes β passes + 1
ELSE:
passes β 0
4. // Extract support vectors (samples with Ξ±α΅’ > 0)
support_vectors β indices where Ξ± > 0
5. RETURN Ξ±, b, support_vectorsSVM Prediction
ALGORITHM: SVM Prediction
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
x: New sample to classify
Ξ±: Trained Lagrange multipliers
b: Trained bias term
X_sv: Support vectors
y_sv: Labels of support vectors
kernel: Kernel function
OUTPUT:
class: Predicted class {-1, +1}
score: Decision function value (distance to hyperplane)
PROCEDURE:
// Decision function: f(x) = Ξ£α΅’ Ξ±α΅’ yα΅’ K(xα΅’, x) + b
score β 0
FOR i in support_vector_indices:
score β score + Ξ±[i] Γ y[i] Γ kernel(X[i], x)
score β score + b
// Classify based on sign
IF score β₯ 0:
class β +1
ELSE:
class β -1
RETURN class, scoreKernel Functions
FUNCTION: Common Kernel Functions
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
LINEAR KERNEL:
K(x, z) = x Β· z = xα΅z
// Equivalent to standard dot product
// Use when data is linearly separable
POLYNOMIAL KERNEL:
K(x, z) = (Ξ³ xα΅z + r)^d
where:
d: Degree of polynomial (e.g., 2, 3)
Ξ³: Scale factor (default: 1/n_features)
r: Coefficient (default: 0 or 1)
RADIAL BASIS FUNCTION (RBF/Gaussian):
K(x, z) = exp(-Ξ³ ||x - z||Β²)
where:
Ξ³: Kernel coefficient (default: 1/n_features)
// Higher Ξ³ β narrower Gaussian β more complex boundary
// Most popular for non-linear problems
SIGMOID KERNEL:
K(x, z) = tanh(Ξ³ xα΅z + r)
// Similar to neural network with one hidden layer
// Not always a valid Mercer kernelMathematical Foundation
The Dual Problem
The primal SVM problem is converted to its dual for computational efficiency:
$$\max_\alpha \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$Subject to:
- $0 \leq \alpha_i \leq C$ for all $i$
- $\sum_{i=1}^{n} \alpha_i y_i = 0$
The Kernel Trick
$$K(x, z) = \phi(x) \cdot \phi(z)$$We compute the kernel directly, avoiding the expensive explicit transformation.
KKT Conditions
For optimal solution, each $\alpha_i$ satisfies:
- $\alpha_i = 0$ β sample is correctly classified, outside margin
- $0 < \alpha_i < C$ β sample is on the margin (support vector)
- $\alpha_i = C$ β sample is inside margin or misclassified
Complexity Analysis
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| Training (SMO) | $O(n^2 \cdot m)$ to $O(n^3 \cdot m)$ | $O(n^2)$ (kernel matrix) |
| Prediction | $O(n_{sv} \cdot m)$ | $O(n_{sv})$ |
Where $n$ = samples, $m$ = features, $n_{sv}$ = support vectors.
Multiclass SVM
ALGORITHM: One-vs-Rest (OvR) Multiclass SVM
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TRAINING:
FOR each class k = 1 TO K:
y_binary β [+1 if yα΅’ = k else -1 for all i]
svm[k] β train_svm(X, y_binary)
PREDICTION:
scores β [svm[k].decision_function(x) for k = 1 to K]
predicted_class β argmax(scores)
ALGORITHM: One-vs-One (OvO) Multiclass SVM
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
TRAINING:
FOR each pair of classes (i, j) where i < j:
X_subset β samples from class i or j
y_binary β [+1 if class i, -1 if class j]
svm[i,j] β train_svm(X_subset, y_binary)
PREDICTION:
votes β zeros(K)
FOR each pair (i, j):
pred β svm[i,j].predict(x)
IF pred = +1:
votes[i] β votes[i] + 1
ELSE:
votes[j] β votes[j] + 1
predicted_class β argmax(votes)Key Insights
Maximum Margin: SVMs find the decision boundary that maximizes the margin, leading to better generalization.
Sparse Solution: Only support vectors matter; other training points can be discarded.
Kernel Trick: Enables non-linear classification without explicit feature mapping.
C Parameter: Low C β wider margin, more errors; High C β narrow margin, fewer errors.
Ξ³ Parameter (RBF): Low Ξ³ β smooth boundary; High Ξ³ β complex boundary.
Feature Scaling: Essential for SVM; use standardization.
References
- Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
- Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 12. Springer. https://hastie.su.domains/ElemStatLearn/
- scikit-learn Documentation: SVM. https://scikit-learn.org/stable/modules/svm.html
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 7. Springer.