πŸ“ Support Vector Machines (SVM) Pseudocode

Support Vector Machines (SVM) are powerful supervised learning models that find the optimal hyperplane separating classes by maximizing the margin between them. SVMs can handle linear and non-linear classification through the kernel trick.

The Algorithm

Linear SVM: The Optimization Problem

ALGORITHM: Linear SVM Formulation
─────────────────────────────────────────────────────────────────

OBJECTIVE: Find hyperplane wΒ·x + b = 0 that maximizes margin

PRIMAL PROBLEM (Hard Margin):
    
    minimize:     (1/2) ||w||Β²
    
    subject to:   yα΅’(w Β· xα΅’ + b) β‰₯ 1    for all i = 1, ..., n
    
    where:
        w: Weight vector (normal to hyperplane)
        b: Bias term
        yᡒ ∈ {-1, +1}: Class labels
        xα΅’: Training samples

PRIMAL PROBLEM (Soft Margin - allows misclassification):
    
    minimize:     (1/2) ||w||Β² + C Ξ£α΅’ ΞΎα΅’
    
    subject to:   yα΅’(w Β· xα΅’ + b) β‰₯ 1 - ΞΎα΅’    for all i
                  ΞΎα΅’ β‰₯ 0                       for all i
    
    where:
        C: Regularization parameter (trade-off between margin and errors)
        ΞΎα΅’: Slack variables (allow points inside margin or misclassified)

SVM Training with SMO (Sequential Minimal Optimization)

ALGORITHM: SMO for SVM Training (Simplified)
─────────────────────────────────────────────────────────────────

INPUT:
    X: Training data of shape (n_samples, n_features)
    y: Labels {-1, +1} of shape (n_samples,)
    C: Regularization parameter
    kernel: Kernel function (default: linear)
    tolerance: Numerical tolerance (e.g., 1e-3)
    max_iterations: Maximum passes over data

OUTPUT:
    Ξ±: Lagrange multipliers of shape (n_samples,)
    b: Bias term

PROCEDURE:
    1. Initialize: Ξ± ← zeros(n_samples), b ← 0
    
    2. // Precompute kernel matrix (for efficiency)
       K[i,j] ← kernel(xα΅’, xβ±Ό) for all i, j
    
    3. passes ← 0
       WHILE passes < max_iterations:
           num_changed ← 0
           
           FOR i = 1 TO n_samples:
               // Calculate prediction error for sample i
               Eα΅’ ← (Ξ£β±Ό Ξ±β±Ό yβ±Ό K[i,j]) + b - yα΅’
               
               // Check if Ξ±α΅’ violates KKT conditions
               IF (yα΅’ Eα΅’ < -tol AND Ξ±α΅’ < C) OR (yα΅’ Eα΅’ > tol AND Ξ±α΅’ > 0):
                   
                   // Select second Ξ± to optimize (heuristic: max |Eα΅’ - Eβ±Ό|)
                   j ← select_second_alpha(i, Eα΅’)
                   Eβ±Ό ← (Ξ£β‚– Ξ±β‚– yβ‚– K[j,k]) + b - yβ±Ό
                   
                   // Save old alphas
                   Ξ±α΅’_old ← Ξ±α΅’
                   Ξ±β±Ό_old ← Ξ±β±Ό
                   
                   // Compute bounds for Ξ±β±Ό
                   IF yα΅’ β‰  yβ±Ό:
                       L ← max(0, Ξ±β±Ό - Ξ±α΅’)
                       H ← min(C, C + Ξ±β±Ό - Ξ±α΅’)
                   ELSE:
                       L ← max(0, Ξ±α΅’ + Ξ±β±Ό - C)
                       H ← min(C, Ξ±α΅’ + Ξ±β±Ό)
                   
                   IF L = H:
                       CONTINUE
                   
                   // Compute eta (second derivative of objective)
                   Ξ· ← 2 K[i,j] - K[i,i] - K[j,j]
                   IF Ξ· β‰₯ 0:
                       CONTINUE
                   
                   // Update Ξ±β±Ό
                   Ξ±β±Ό ← Ξ±β±Ό - yβ±Ό(Eα΅’ - Eβ±Ό) / Ξ·
                   Ξ±β±Ό ← clip(Ξ±β±Ό, L, H)
                   
                   IF |Ξ±β±Ό - Ξ±β±Ό_old| < 1e-5:
                       CONTINUE
                   
                   // Update Ξ±α΅’
                   Ξ±α΅’ ← Ξ±α΅’ + yα΅’ yβ±Ό (Ξ±β±Ό_old - Ξ±β±Ό)
                   
                   // Update bias b
                   b₁ ← b - Eα΅’ - yα΅’(Ξ±α΅’-Ξ±α΅’_old)K[i,i] - yβ±Ό(Ξ±β±Ό-Ξ±β±Ό_old)K[i,j]
                   bβ‚‚ ← b - Eβ±Ό - yα΅’(Ξ±α΅’-Ξ±α΅’_old)K[i,j] - yβ±Ό(Ξ±β±Ό-Ξ±β±Ό_old)K[j,j]
                   
                   IF 0 < Ξ±α΅’ < C:
                       b ← b₁
                   ELSE IF 0 < Ξ±β±Ό < C:
                       b ← bβ‚‚
                   ELSE:
                       b ← (b₁ + bβ‚‚) / 2
                   
                   num_changed ← num_changed + 1
           
           IF num_changed = 0:
               passes ← passes + 1
           ELSE:
               passes ← 0
    
    4. // Extract support vectors (samples with Ξ±α΅’ > 0)
       support_vectors ← indices where Ξ± > 0
    
    5. RETURN Ξ±, b, support_vectors

SVM Prediction

ALGORITHM: SVM Prediction
─────────────────────────────────────────────────────────────────

INPUT:
    x: New sample to classify
    Ξ±: Trained Lagrange multipliers
    b: Trained bias term
    X_sv: Support vectors
    y_sv: Labels of support vectors
    kernel: Kernel function

OUTPUT:
    class: Predicted class {-1, +1}
    score: Decision function value (distance to hyperplane)

PROCEDURE:
    // Decision function: f(x) = Ξ£α΅’ Ξ±α΅’ yα΅’ K(xα΅’, x) + b
    score ← 0
    FOR i in support_vector_indices:
        score ← score + Ξ±[i] Γ— y[i] Γ— kernel(X[i], x)
    score ← score + b
    
    // Classify based on sign
    IF score β‰₯ 0:
        class ← +1
    ELSE:
        class ← -1
    
    RETURN class, score

Kernel Functions

FUNCTION: Common Kernel Functions
─────────────────────────────────────────────────────────────────

LINEAR KERNEL:
    K(x, z) = x Β· z = xα΅€z
    
    // Equivalent to standard dot product
    // Use when data is linearly separable

POLYNOMIAL KERNEL:
    K(x, z) = (Ξ³ xα΅€z + r)^d
    
    where:
        d: Degree of polynomial (e.g., 2, 3)
        Ξ³: Scale factor (default: 1/n_features)
        r: Coefficient (default: 0 or 1)

RADIAL BASIS FUNCTION (RBF/Gaussian):
    K(x, z) = exp(-Ξ³ ||x - z||Β²)
    
    where:
        Ξ³: Kernel coefficient (default: 1/n_features)
        // Higher Ξ³ β†’ narrower Gaussian β†’ more complex boundary
    
    // Most popular for non-linear problems

SIGMOID KERNEL:
    K(x, z) = tanh(Ξ³ xα΅€z + r)
    
    // Similar to neural network with one hidden layer
    // Not always a valid Mercer kernel

Mathematical Foundation

The Dual Problem

The primal SVM problem is converted to its dual for computational efficiency:

$$\max_\alpha \sum_{i=1}^{n} \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$

Subject to:

  • $0 \leq \alpha_i \leq C$ for all $i$
  • $\sum_{i=1}^{n} \alpha_i y_i = 0$

The Kernel Trick

$$K(x, z) = \phi(x) \cdot \phi(z)$$

We compute the kernel directly, avoiding the expensive explicit transformation.

KKT Conditions

For optimal solution, each $\alpha_i$ satisfies:

  • $\alpha_i = 0$ β†’ sample is correctly classified, outside margin
  • $0 < \alpha_i < C$ β†’ sample is on the margin (support vector)
  • $\alpha_i = C$ β†’ sample is inside margin or misclassified

Complexity Analysis

OperationTime ComplexitySpace Complexity
Training (SMO)$O(n^2 \cdot m)$ to $O(n^3 \cdot m)$$O(n^2)$ (kernel matrix)
Prediction$O(n_{sv} \cdot m)$$O(n_{sv})$

Where $n$ = samples, $m$ = features, $n_{sv}$ = support vectors.

Multiclass SVM

ALGORITHM: One-vs-Rest (OvR) Multiclass SVM
─────────────────────────────────────────────────────────────────

TRAINING:
    FOR each class k = 1 TO K:
        y_binary ← [+1 if yα΅’ = k else -1 for all i]
        svm[k] ← train_svm(X, y_binary)

PREDICTION:
    scores ← [svm[k].decision_function(x) for k = 1 to K]
    predicted_class ← argmax(scores)


ALGORITHM: One-vs-One (OvO) Multiclass SVM
─────────────────────────────────────────────────────────────────

TRAINING:
    FOR each pair of classes (i, j) where i < j:
        X_subset ← samples from class i or j
        y_binary ← [+1 if class i, -1 if class j]
        svm[i,j] ← train_svm(X_subset, y_binary)

PREDICTION:
    votes ← zeros(K)
    FOR each pair (i, j):
        pred ← svm[i,j].predict(x)
        IF pred = +1:
            votes[i] ← votes[i] + 1
        ELSE:
            votes[j] ← votes[j] + 1
    predicted_class ← argmax(votes)

Key Insights

  1. Maximum Margin: SVMs find the decision boundary that maximizes the margin, leading to better generalization.

  2. Sparse Solution: Only support vectors matter; other training points can be discarded.

  3. Kernel Trick: Enables non-linear classification without explicit feature mapping.

  4. C Parameter: Low C β†’ wider margin, more errors; High C β†’ narrow margin, fewer errors.

  5. Ξ³ Parameter (RBF): Low Ξ³ β†’ smooth boundary; High Ξ³ β†’ complex boundary.

  6. Feature Scaling: Essential for SVM; use standardization.

References

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297.
  • Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning, Chapter 12. Springer. https://hastie.su.domains/ElemStatLearn/
  • scikit-learn Documentation: SVM. https://scikit-learn.org/stable/modules/svm.html
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 7. Springer.