Word2Vec-NumPy

A pure NumPy implementation of Skip-Gram with Negative Sampling (SGNS) for learning word embeddings from scratch.

Quickstart

# Clone and set up
git clone https://github.com/heisdanielade/word2vec-numpy.git
cd word2vec-numpy
uv sync

# Place a text corpus (any plain text file) at data/corpus.txt
# e.g., WikiText-103 from https://www.kaggle.com/datasets/zusmani/wikitext2

# Train and evaluate
uv run python main.py

# Run tests
uv run pytest tests/ -v

Architecture

main.py
  │
  ├─ preprocessing.py    Tokenise → Build vocab → Subsample frequent words
  │
  ├─ data_generator.py   Generate skip-gram (centre, context) pairs
  │                       + negative sampling from unigram^(3/4) distribution
  │
  ├─ model.py            Forward:  embedding lookup → dot products → sigmoid → SGNS loss
  │                       Backward: analytical gradients via chain rule
  │
  ├─ trainer.py           SGD with linear LR decay, sparse weight updates
  │
  └─ evaluation.py        Cosine similarity search, analogy reasoning, PCA visualisation

What's Implemented

Component	Detail
Preprocessing	WikiText cleaning, min-frequency filtering, Mikolov subsampling
Data generation	Skip-gram pairs with dynamic window, unigram^(3/4) negative sampling
Forward pass	Numerically stable sigmoid, SGNS binary cross-entropy loss
Backward pass	Hand-derived gradients, verified via numerical gradient check
Training	Sparse SGD updates on active rows, linear LR decay
Evaluation	Cosine similarity, analogy tests (b − a + c), PCA projection plot
Testing	38 unit tests covering all modules

Training Output

Tokens:            500,000
Vocabulary:          9,205 words
After subsampling:  98,053 tokens
Skip-gram pairs:   588,434
Batches per epoch:   1,150

Epoch  1/10 | loss: 3.5006 | lr: 0.022503 | time:  6.8s
Epoch  2/10 | loss: 2.6706 | lr: 0.020004 | time:  7.3s
  ...
Epoch 10/10 | loss: 2.5778 | lr: 0.000012 | time:  7.4s

Embedding Visualisation

PCA projection of selected word embeddings after training:

Notable clusters: boy/girl/woman/prince (people), lake/ocean/sea (water bodies), war/battle/military/army (conflict), town/village/country (places).

Key Design Decisions

Sparse SGD — Updates only the embedding rows involved in each batch (via np.add.at), matching the original word2vec's per-pair update semantics. Full-matrix gradients stall training because the 1/B averaging makes updates vanishingly small.
Two embedding matrices — W_in (centre embeddings) initialised with small uniform values; W_out (context embeddings) initialised to zero. After training, W_in is used as the final word embeddings.
Numerically stable sigmoid — Splits computation into positive and negative branches to avoid overflow in exp().
Dynamic context window — Window size is randomly sampled per position, giving closer words higher effective weight across the dataset.

Dependencies

Python ≥ 3.12
NumPy (core computation)
Matplotlib (embedding visualisation)
Pytest (testing, dev dependency)

References

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NeurIPS 2013

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
embeddings.png		embeddings.png
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Word2Vec-NumPy

Quickstart

Architecture

What's Implemented

Training Output

Embedding Visualisation

Key Design Decisions

Dependencies

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Word2Vec-NumPy

Quickstart

Architecture

What's Implemented

Training Output

Embedding Visualisation

Key Design Decisions

Dependencies

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages