GitHub

LLM Pretraining: Tokenizer

Details: this project focuses on the foundational infrastructure of LLM pretraining, specifically the creation of a highly optimized tokenizer. The team will build a tokenizer from scratch that mimics the functionality of tiktoken (encoding, decoding, tokenize), prioritizing fast latency. The core implementation will be written in C++ and then ported over to Python.

Team Size: 2-3

Progression

locate a deduplicated dataset suitable for pretraining.
develop a Byte Pair Encoding tokenizer in C++ that stores merges and provides fast mapping from new tokens to original tokens. Should store and write merges in a format similar to the tokenizer format examples in the reference section (tokenizer.json)
validate correctness using Google Test and profile latency using Google Benchmark. (because BPE is deterministic, it should perform the same set of merges as the tiktoken library)
implement linked list data structures to asymptotically speed up the tokenization process (Reference: Fast MinBPE).
Implement batching to process data streams without loading the entire dataset into memory.
create bindings using pybind11 to export the module to Python.
if there is time we can also try to implement SentencePiece, or try to optimize for benchmark latency

Technologies

Languages: C++, Python, CMake
Libraries & Frameworks: Google Benchmark, Google Test, Pybind11
References:
- Original BPE implementation video: Let's build the GPT Tokenizer
- Tokenizer format: DeepSeek-R1 Example

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
csrc		csrc
src/pytokenizer		src/pytokenizer
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CMakeLists.txt		CMakeLists.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Pretraining: Tokenizer

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

UBCAgroBot/pytokenizer

Folders and files

Latest commit

History

Repository files navigation

LLM Pretraining: Tokenizer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages