π Perceptron Training Algorithm Pseudocode
The Perceptron is the simplest form of a neural networkβa single artificial neuron that learns to classify linearly separable data. Invented by Frank Rosenblatt in 1958, it laid the foundation for all modern neural networks.
The Algorithm
Perceptron Model
PERCEPTRON: Single Artificial Neuron
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
STRUCTURE:
Inputs: x = [xβ, xβ, ..., xβ]
Weights: w = [wβ, wβ, ..., wβ]
Bias: b (or wβ with xβ = 1)
Activation: Step function
FORWARD PASS:
1. Compute weighted sum:
z = Ξ£α΅’ (wα΅’ Γ xα΅’) + b = w Β· x + b
2. Apply activation function:
Ε· = step(z) = { 1 if z β₯ 0
{ 0 if z < 0
// Alternative: use {-1, +1} labels
Ε· = sign(z) = { +1 if z β₯ 0
{ -1 if z < 0Perceptron Learning Algorithm
ALGORITHM: Perceptron Training
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
X: Training data of shape (n_samples, n_features)
y: Binary labels {0, 1} or {-1, +1} of shape (n_samples,)
Ξ·: Learning rate (default: 1.0)
max_epochs: Maximum number of passes through data
OUTPUT:
w: Learned weight vector
b: Learned bias term
PROCEDURE:
1. Initialize weights and bias:
w β zeros(n_features) // or small random values
b β 0
2. FOR epoch = 1 TO max_epochs:
errors β 0
FOR each training sample (xα΅’, yα΅’):
// Compute prediction
z β w Β· xα΅’ + b
Ε· β step(z) // or sign(z)
// Update only if misclassified
IF Ε· β yα΅’:
// Update rule (Perceptron Learning Rule)
// For {0, 1} labels:
w β w + Ξ· Γ (yα΅’ - Ε·) Γ xα΅’
b β b + Ξ· Γ (yα΅’ - Ε·)
// Equivalently, for {-1, +1} labels:
// w β w + Ξ· Γ yα΅’ Γ xα΅’
// b β b + Ξ· Γ yα΅’
errors β errors + 1
// Check for convergence
IF errors = 0:
PRINT "Converged at epoch", epoch
BREAK
3. RETURN w, bPerceptron Prediction
ALGORITHM: Perceptron Prediction
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
x: New sample of shape (n_features,)
w: Trained weights
b: Trained bias
OUTPUT:
predicted_class: {0, 1} or {-1, +1}
PROCEDURE:
z β w Β· x + b
IF z β₯ 0:
RETURN 1 // or +1
ELSE:
RETURN 0 // or -1Pocket Algorithm (Handling Non-Separable Data)
ALGORITHM: Pocket Perceptron
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Keeps the best weights found so far in a "pocket"
// Useful when data is not linearly separable
INPUT:
X, y: Training data
Ξ·: Learning rate
max_iterations: Maximum updates
OUTPUT:
w_pocket: Best weights found
b_pocket: Best bias found
PROCEDURE:
w β zeros(n_features)
b β 0
w_pocket β w
b_pocket β b
best_accuracy β 0
FOR iteration = 1 TO max_iterations:
// Select random misclassified sample
misclassified β [i for i where predict(xα΅’) β yα΅’]
IF misclassified is empty:
BREAK // Perfect classification
// Pick random misclassified point
i β random_choice(misclassified)
// Perceptron update
w β w + Ξ· Γ yα΅’ Γ xα΅’
b β b + Ξ· Γ yα΅’
// Check if this is the best so far
accuracy β count_correct(X, y, w, b) / n_samples
IF accuracy > best_accuracy:
w_pocket β w
b_pocket β b
best_accuracy β accuracy
RETURN w_pocket, b_pocketVoted Perceptron
ALGORITHM: Voted Perceptron
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Stores all weight vectors encountered during training
// Uses weighted voting for prediction
TRAINING:
weights_list β empty list
survival_counts β empty list
w β zeros(n_features)
b β 0
count β 0
FOR epoch = 1 TO max_epochs:
FOR each (xα΅’, yα΅’):
IF sign(w Β· xα΅’ + b) β yα΅’:
// Store current weights and their survival count
APPEND (w, b) to weights_list
APPEND count to survival_counts
// Update
w β w + yα΅’ Γ xα΅’
b β b + yα΅’
count β 1
ELSE:
count β count + 1
// Don't forget the last weights
APPEND (w, b) to weights_list
APPEND count to survival_counts
PREDICTION:
vote β 0
FOR i = 1 TO length(weights_list):
(wα΅’, bα΅’) β weights_list[i]
cα΅’ β survival_counts[i]
vote β vote + cα΅’ Γ sign(wα΅’ Β· x + bα΅’)
RETURN sign(vote)Mathematical Foundation
Decision Boundary
The perceptron defines a hyperplane in feature space:
$$w \cdot x + b = 0$$Points on one side are classified as positive, on the other as negative:
- $w \cdot x + b > 0 \Rightarrow$ class 1
- $w \cdot x + b < 0 \Rightarrow$ class 0 (or -1)
Convergence Theorem (Novikoff, 1962)
For linearly separable data, the perceptron converges in at most:
$$k \leq \left(\frac{R}{\gamma}\right)^2$$iterations, where:
- $R = \max_i ||x_i||$ (radius of data)
- $\gamma = \min_i y_i(w^* \cdot x_i + b^*)$ (margin of optimal separator)
Why It Works
The update rule moves the decision boundary to correctly classify the misclassified point:
$$w_{new} = w_{old} + y \cdot x$$- If $y = +1$ and we predicted $-1$: Adding $x$ increases $w \cdot x$
- If $y = -1$ and we predicted $+1$: Subtracting $x$ decreases $w \cdot x$
The XOR Limitation
THE XOR PROBLEM:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
XOR Truth Table:
xβ | xβ | y
ββββΌβββββΌβββ
0 | 0 | 0
0 | 1 | 1
1 | 0 | 1
1 | 1 | 0
PROBLEM: No single line can separate the classes!
This limitation was famously highlighted by Minsky & Papert (1969),
leading to the first "AI Winter."
SOLUTION: Multi-layer perceptrons (neural networks) with hidden layers
can solve XOR and other non-linearly separable problems.Complexity Analysis
| Operation | Time Complexity | Space Complexity |
|---|---|---|
| One update | $O(m)$ | $O(m)$ |
| Full training | $O(k \cdot m)$ | $O(m)$ |
| Prediction | $O(m)$ | $O(1)$ |
Where $m$ = features, $k$ = number of mistakes (β€ $(R/\gamma)^2$ for separable data).
Key Insights
Guaranteed Convergence: For linearly separable data, the algorithm always converges.
Online Learning: Updates weights immediately after each mistakeβno need to store all data.
Margin Dependence: Convergence speed depends on the margin; larger margin = faster convergence.
Non-Separable Data: Basic perceptron oscillates forever; use Pocket, Voted, or Averaged variants.
Building Block: Single perceptrons are limited, but stacking them creates powerful neural networks.
Feature Engineering: The perceptron can solve non-linear problems if given the right features.
References
- Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review.
- Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata.
- Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.
- Freund, Y., & Schapire, R. E. (1999). Large Margin Classification Using the Perceptron Algorithm. Machine Learning, 37(3), 277-296.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning, Chapter 4. Springer.