A pure NumPy implementation of Skip-Gram with Negative Sampling (SGNS) for learning word embeddings from scratch.
# Clone and set up
git clone https://github.com/heisdanielade/word2vec-numpy.git
cd word2vec-numpy
uv sync
# Place a text corpus (any plain text file) at data/corpus.txt
# e.g., WikiText-103 from https://www.kaggle.com/datasets/zusmani/wikitext2
# Train and evaluate
uv run python main.py
# Run tests
uv run pytest tests/ -vmain.py
│
├─ preprocessing.py Tokenise → Build vocab → Subsample frequent words
│
├─ data_generator.py Generate skip-gram (centre, context) pairs
│ + negative sampling from unigram^(3/4) distribution
│
├─ model.py Forward: embedding lookup → dot products → sigmoid → SGNS loss
│ Backward: analytical gradients via chain rule
│
├─ trainer.py SGD with linear LR decay, sparse weight updates
│
└─ evaluation.py Cosine similarity search, analogy reasoning, PCA visualisation
| Component | Detail |
|---|---|
| Preprocessing | WikiText cleaning, min-frequency filtering, Mikolov subsampling |
| Data generation | Skip-gram pairs with dynamic window, unigram^(3/4) negative sampling |
| Forward pass | Numerically stable sigmoid, SGNS binary cross-entropy loss |
| Backward pass | Hand-derived gradients, verified via numerical gradient check |
| Training | Sparse SGD updates on active rows, linear LR decay |
| Evaluation | Cosine similarity, analogy tests (b − a + c), PCA projection plot |
| Testing | 38 unit tests covering all modules |
Tokens: 500,000
Vocabulary: 9,205 words
After subsampling: 98,053 tokens
Skip-gram pairs: 588,434
Batches per epoch: 1,150
Epoch 1/10 | loss: 3.5006 | lr: 0.022503 | time: 6.8s
Epoch 2/10 | loss: 2.6706 | lr: 0.020004 | time: 7.3s
...
Epoch 10/10 | loss: 2.5778 | lr: 0.000012 | time: 7.4s
PCA projection of selected word embeddings after training:
Notable clusters: boy/girl/woman/prince (people), lake/ocean/sea (water bodies), war/battle/military/army (conflict), town/village/country (places).
-
Sparse SGD — Updates only the embedding rows involved in each batch (via
np.add.at), matching the original word2vec's per-pair update semantics. Full-matrix gradients stall training because the 1/B averaging makes updates vanishingly small. -
Two embedding matrices —
W_in(centre embeddings) initialised with small uniform values;W_out(context embeddings) initialised to zero. After training,W_inis used as the final word embeddings. -
Numerically stable sigmoid — Splits computation into positive and negative branches to avoid overflow in
exp(). -
Dynamic context window — Window size is randomly sampled per position, giving closer words higher effective weight across the dataset.
- Python ≥ 3.12
- NumPy (core computation)
- Matplotlib (embedding visualisation)
- Pytest (testing, dev dependency)
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
- Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013
