Skip to content

heisdanielade/word2vec-numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec-NumPy

A pure NumPy implementation of Skip-Gram with Negative Sampling (SGNS) for learning word embeddings from scratch.

Quickstart

# Clone and set up
git clone https://github.com/heisdanielade/word2vec-numpy.git
cd word2vec-numpy
uv sync

# Place a text corpus (any plain text file) at data/corpus.txt
# e.g., WikiText-103 from https://www.kaggle.com/datasets/zusmani/wikitext2

# Train and evaluate
uv run python main.py

# Run tests
uv run pytest tests/ -v

Architecture

main.py
  │
  ├─ preprocessing.py    Tokenise → Build vocab → Subsample frequent words
  │
  ├─ data_generator.py   Generate skip-gram (centre, context) pairs
  │                       + negative sampling from unigram^(3/4) distribution
  │
  ├─ model.py            Forward:  embedding lookup → dot products → sigmoid → SGNS loss
  │                       Backward: analytical gradients via chain rule
  │
  ├─ trainer.py           SGD with linear LR decay, sparse weight updates
  │
  └─ evaluation.py        Cosine similarity search, analogy reasoning, PCA visualisation

What's Implemented

Component Detail
Preprocessing WikiText cleaning, min-frequency filtering, Mikolov subsampling
Data generation Skip-gram pairs with dynamic window, unigram^(3/4) negative sampling
Forward pass Numerically stable sigmoid, SGNS binary cross-entropy loss
Backward pass Hand-derived gradients, verified via numerical gradient check
Training Sparse SGD updates on active rows, linear LR decay
Evaluation Cosine similarity, analogy tests (b − a + c), PCA projection plot
Testing 38 unit tests covering all modules

Training Output

Tokens:            500,000
Vocabulary:          9,205 words
After subsampling:  98,053 tokens
Skip-gram pairs:   588,434
Batches per epoch:   1,150

Epoch  1/10 | loss: 3.5006 | lr: 0.022503 | time:  6.8s
Epoch  2/10 | loss: 2.6706 | lr: 0.020004 | time:  7.3s
  ...
Epoch 10/10 | loss: 2.5778 | lr: 0.000012 | time:  7.4s

Embedding Visualisation

PCA projection of selected word embeddings after training:

Word Embeddings — PCA Projection

Notable clusters: boy/girl/woman/prince (people), lake/ocean/sea (water bodies), war/battle/military/army (conflict), town/village/country (places).

Key Design Decisions

  • Sparse SGD — Updates only the embedding rows involved in each batch (via np.add.at), matching the original word2vec's per-pair update semantics. Full-matrix gradients stall training because the 1/B averaging makes updates vanishingly small.

  • Two embedding matricesW_in (centre embeddings) initialised with small uniform values; W_out (context embeddings) initialised to zero. After training, W_in is used as the final word embeddings.

  • Numerically stable sigmoid — Splits computation into positive and negative branches to avoid overflow in exp().

  • Dynamic context window — Window size is randomly sampled per position, giving closer words higher effective weight across the dataset.

Dependencies

  • Python ≥ 3.12
  • NumPy (core computation)
  • Matplotlib (embedding visualisation)
  • Pytest (testing, dev dependency)

References

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013

About

A NumPy implementation of Skip-Gram Word2Vec with Negative Sampling for learning word embeddings from scratch.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages