Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
-
Updated
Jan 30, 2023 - Python
Byte-Pair Encoding (BPE) (subword-based tokenization) algorithm implementaions from scratch with python
LLM-inspired BiLSTM pipeline for real-time, multi-label toxicity inference across adversarial discourse modalities.
A clean, educational implementation of the Byte Pair Encoding algorithm used in modern language models like GPT.
Paper: A Comparison of Different Tokenization Methods for the Georgian Language
This repository hosts our comprehensive study on text tokenization methods, covering word-, character-, and subword-level algorithms such as BPE, WordPiece, and Unigram, and extends to discussions on multilingual, mathematical, and code tokenization. It examines their efficiency, consistency, semantic preservation, and influence on LLMs.
A minimal Python implementation of Byte Pair Encoding (BPE) with step-by-step visualization of merge operations and vocabulary updates.
BPE & Unigram Vocab Training library
Add a description, image, and links to the subword-tokenization topic page so that developers can more easily learn about it.
To associate your repository with the subword-tokenization topic, visit your repo's landing page and select "manage topics."