π Word2Vec Algorithm Pseudocode
Word2Vec, introduced by Mikolov et al. at Google in 2013, learns dense vector representations (embeddings) of words from large text corpora. It captures semantic relationships so that similar words have similar vectors, enabling arithmetic like: king - man + woman β queen.
The Algorithm
Word2Vec Architectures Overview
WORD2VEC: Two Architectures
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1. SKIP-GRAM:
Given a target word, predict surrounding context words.
"The quick brown fox jumps"
β
Target: "brown"
Predict: ["quick", "fox"] (window size 1)
Best for: Infrequent words, smaller datasets
2. CBOW (Continuous Bag of Words):
Given surrounding context words, predict the target word.
"The quick ___ fox jumps"
β
Context: ["quick", "fox"]
Predict: "brown"
Best for: Frequent words, larger datasetsSkip-Gram Model
ALGORITHM: Skip-Gram Training (Basic Version)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
corpus: List of sentences (lists of word tokens)
V: Vocabulary size
d: Embedding dimension (typically 100-300)
window: Context window size
learning_rate: Ξ·
OUTPUT:
W_in: Input embedding matrix (V Γ d) - these are the word vectors
W_out: Output embedding matrix (d Γ V)
PROCEDURE:
1. // Initialize embeddings randomly
W_in β random_uniform(-0.5/d, 0.5/d, shape=(V, d))
W_out β zeros(d, V)
2. // Build vocabulary and word-to-index mapping
word_to_idx β {word: i for i, word in enumerate(vocabulary)}
3. // Training loop
FOR each sentence in corpus:
FOR t = 0 TO length(sentence) - 1:
target_word β sentence[t]
target_idx β word_to_idx[target_word]
// Get context words within window
FOR j = max(0, t - window) TO min(len(sentence), t + window):
IF j β t:
context_word β sentence[j]
context_idx β word_to_idx[context_word]
// Forward pass: predict context from target
// Get target word vector
v_target β W_in[target_idx] // (d,)
// Compute scores for all words
scores β W_out.T Β· v_target // (V,)
// Softmax to get probabilities
probs β softmax(scores) // (V,)
// Backward pass: gradient descent
// Gradient of cross-entropy loss
error β probs.copy()
error[context_idx] -= 1 // y - Ε·
// Update output weights
W_out β W_out - Ξ· Γ outer(v_target, error)
// Update input embedding
grad_input β W_out Β· error
W_in[target_idx] β W_in[target_idx] - Ξ· Γ grad_input
4. RETURN W_in, W_out // Usually only W_in is usedCBOW Model
ALGORITHM: CBOW Training (Basic Version)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INPUT:
corpus: List of sentences
V: Vocabulary size
d: Embedding dimension
window: Context window size
learning_rate: Ξ·
OUTPUT:
W_in: Input embedding matrix (V Γ d)
W_out: Output embedding matrix (d Γ V)
PROCEDURE:
1. // Initialize embeddings
W_in β random_uniform(-0.5/d, 0.5/d, shape=(V, d))
W_out β zeros(d, V)
2. // Training loop
FOR each sentence in corpus:
FOR t = 0 TO length(sentence) - 1:
target_word β sentence[t]
target_idx β word_to_idx[target_word]
// Collect context word indices
context_indices β []
FOR j = max(0, t - window) TO min(len(sentence), t + window):
IF j β t:
context_indices.append(word_to_idx[sentence[j]])
IF context_indices is empty:
CONTINUE
// Forward pass: predict target from context
// Average context word vectors (CBOW)
context_vectors β [W_in[idx] for idx in context_indices]
v_context β mean(context_vectors, axis=0) // (d,)
// Compute scores and probabilities
scores β W_out.T Β· v_context // (V,)
probs β softmax(scores)
// Backward pass
error β probs.copy()
error[target_idx] -= 1
// Update output weights
W_out β W_out - Ξ· Γ outer(v_context, error)
// Update input embeddings (distribute gradient to all context words)
grad_input β W_out Β· error
FOR idx in context_indices:
W_in[idx] β W_in[idx] - Ξ· Γ (grad_input / len(context_indices))
3. RETURN W_in, W_outNegative Sampling (Efficient Training)
ALGORITHM: Skip-Gram with Negative Sampling (SGNS)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Full softmax over V words is expensive (O(V) per training example)
// Negative sampling approximates it with O(k+1) operations
INPUT:
corpus: Text corpus
V, d, window: As before
k: Number of negative samples (typically 5-20)
learning_rate: Ξ·
PROCEDURE:
1. // Build noise distribution for negative sampling
// P(w) β count(w)^0.75 (subsampling frequent words)
word_counts β count occurrences of each word
noise_dist β (word_counts ** 0.75) / sum(word_counts ** 0.75)
2. // Initialize embeddings
W_target β random_uniform(-0.5/d, 0.5/d, shape=(V, d))
W_context β zeros(V, d)
3. // Training loop
FOR each sentence in corpus:
FOR each (target, context) word pair from sentence:
target_idx β word_to_idx[target]
context_idx β word_to_idx[context]
// Sample k negative words
neg_samples β []
WHILE len(neg_samples) < k:
neg_idx β sample from noise_dist
IF neg_idx β context_idx:
neg_samples.append(neg_idx)
// Get vectors
v_target β W_target[target_idx]
v_context β W_context[context_idx]
// ===== POSITIVE EXAMPLE =====
// Maximize log Ο(v_target Β· v_context)
score_pos β dot(v_target, v_context)
sigmoid_pos β Ο(score_pos)
// Gradients for positive pair
grad_target β (sigmoid_pos - 1) Γ v_context
grad_context β (sigmoid_pos - 1) Γ v_target
// ===== NEGATIVE EXAMPLES =====
// Maximize log Ο(-v_target Β· v_negative)
FOR neg_idx in neg_samples:
v_neg β W_context[neg_idx]
score_neg β dot(v_target, v_neg)
sigmoid_neg β Ο(score_neg)
// Gradients for negative pair
grad_target += sigmoid_neg Γ v_neg
W_context[neg_idx] -= Ξ· Γ sigmoid_neg Γ v_target
// Update embeddings
W_target[target_idx] -= Ξ· Γ grad_target
W_context[context_idx] -= Ξ· Γ grad_context
4. // Final embeddings: often average or concatenate W_target and W_context
RETURN W_target // or (W_target + W_context) / 2Subsampling of Frequent Words
ALGORITHM: Subsampling
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Very frequent words like "the", "a", "is" provide less information
// and appear in too many contexts. Subsample them.
INPUT:
sentence: List of words
word_freqs: Frequency of each word in corpus
threshold: t (typically 10β»β΅)
OUTPUT:
subsampled_sentence: Sentence with some frequent words removed
PROCEDURE:
subsampled_sentence β []
FOR word in sentence:
freq β word_freqs[word]
// Probability of keeping the word
// Higher frequency β lower probability of keeping
p_keep β (β(freq / t) + 1) Γ (t / freq)
p_keep β min(p_keep, 1.0)
IF random() < p_keep:
subsampled_sentence.append(word)
RETURN subsampled_sentenceHierarchical Softmax (Alternative to Negative Sampling)
ALGORITHM: Hierarchical Softmax
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
// Uses a binary tree (Huffman coding) to reduce softmax from O(V) to O(log V)
STRUCTURE:
- Build Huffman tree from word frequencies
- Each word is a leaf
- Path from root to word = sequence of binary decisions
- Each internal node has a parameter vector
PREDICTION:
P(word | context) = β Ο(Β±n_i Β· v_context)
where the path from root to word passes through nodes nβ, nβ, ..., nβ
and Β± depends on whether we go left (-) or right (+) at each node
COMPLEXITY:
O(log V) per training example (vs O(V) for full softmax)
TRADE-OFF:
- More complex to implement
- Works well for frequent words
- Negative sampling often preferred in practiceMathematical Foundation
Skip-Gram Objective
For a corpus of words $w_1, w_2, \ldots, w_T$:
$$J(\theta) = \frac{1}{T} \sum_{t=1}^{T} \sum_{-c \leq j \leq c, j \neq 0} \log P(w_{t+j} | w_t)$$where $c$ is the context window size.
Softmax Probability
$$P(w_O | w_I) = \frac{\exp(v'_{w_O} \cdot v_{w_I})}{\sum_{w=1}^{V} \exp(v'_w \cdot v_{w_I})}$$Negative Sampling Objective
Instead of full softmax, maximize:
$$\log \sigma(v'_{w_O} \cdot v_{w_I}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)} [\log \sigma(-v'_{w_i} \cdot v_{w_I})]$$where $P_n(w) \propto \text{count}(w)^{0.75}$ is the noise distribution.
Complexity Analysis
| Variant | Time per Example | Space |
|---|---|---|
| Full Softmax | $O(V \cdot d)$ | $O(V \cdot d)$ |
| Negative Sampling | $O((k+1) \cdot d)$ | $O(V \cdot d)$ |
| Hierarchical Softmax | $O(\log V \cdot d)$ | $O(V \cdot d)$ |
Where $V$ = vocabulary size, $d$ = embedding dimension, $k$ = negative samples.
Word Analogies
WORD ANALOGIES: The Famous Result
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The learned embeddings capture semantic relationships:
vec("king") - vec("man") + vec("woman") β vec("queen")
vec("Paris") - vec("France") + vec("Germany") β vec("Berlin")
vec("walking") - vec("walk") + vec("swim") β vec("swimming")
ANALOGY SOLVING:
Given: a is to b as c is to ?
Find: argmax_d cos_similarity(vec(b) - vec(a) + vec(c), vec(d))
where d β {a, b, c}Key Insights
Distributional Hypothesis: Words appearing in similar contexts have similar meaningsβWord2Vec operationalizes this.
Dense vs Sparse: Word2Vec produces dense 100-300 dimensional vectors vs sparse one-hot (V dimensions).
Transfer Learning: Pre-trained embeddings can boost performance on downstream NLP tasks.
Negative Sampling: A brilliant computational trick that makes training tractable on large vocabularies.
Subsampling: Reducing frequent word occurrences improves embedding quality for rare words.
Linear Structure: Semantic relationships are captured as linear translations in vector space.
References
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop. arXiv:1301.3781
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS.
- Goldberg, Y., & Levy, O. (2014). word2vec Explained: Deriving Mikolov et al.'s Negative-Sampling Word-Embedding Method. arXiv:1402.3722
- Rong, X. (2014). word2vec Parameter Learning Explained. arXiv:1411.2738
- Jurafsky, D., & Martin, J. H. (2023). Speech and Language Processing, Chapter 6. https://web.stanford.edu/~jurafsky/slp3/