Extend existing tokenizers with new vocabulary from custom training data. Train a new tokenizer on your domain-specific data and merge it with any sentencepiece tokenizer (e.g., Gemma) by replacing unused language tokens.
pip install .Python:
from dalla_sp_tokenizer import TokenizerManager
# Using HuggingFace dataset
manager = TokenizerManager(
original_tokenizer_path="google/gemma-2-9b-it",
training_data="data/dataset", # HF dataset directory
vocab_size=32000,
replacement_languages=["CJK_merged"],
output_path="merged",
hf_token="hf_..." # Optional, for gated models
)
results = manager.run_full_pipeline(
analyze_original=True,
analyze_new=True,
evaluate_result=True,
test_text_path="data/dataset" # Can also use HF dataset for evaluation
)CLI:
export HF_TOKEN="hf_..." # Optional, for gated models
# Using HuggingFace dataset
tokenizer pipeline \
--original google/gemma-2-9b-it \
--training-data data/dataset \
--vocab-size 32000 \
--replacement-languages CJK_merged \
--output merged \
--evaluate \
--test-text data/dataset
# Or using raw text file (still supported)
tokenizer pipeline \
--original google/gemma-2-9b-it \
--training-data data.txt \
--vocab-size 32000 \
--replacement-languages CJK_merged \
--output merged \
--evaluate \
--test-text test.txtPython:
# Train with HF dataset
manager.train()
# Analyze tokenizer
manager.analyze()
# Merge tokenizers
manager.merge()
# Evaluate
manager.evaluate(test_text_path="data/dataset")CLI:
# Train a new tokenizer with HF dataset
tokenizer train --training-data data/dataset --vocab-size 32000 --output-prefix custom_tokenizer
# Train with text file (still supported)
tokenizer train --training-data data.txt --vocab-size 32000 --output-prefix custom_tokenizer
# Analyze a tokenizer
tokenizer analyze google/gemma-2-9b-it
# Merge tokenizers
tokenizer merge --original google/gemma-2-9b-it --new custom_tokenizer.model --replacement-languages CJK_merged --output merged
# Evaluate with HF dataset
tokenizer evaluate --tokenizer merged/tokenizer.model --test-text data/dataset
# Evaluate with text file (still supported)
tokenizer evaluate --tokenizer merged/tokenizer.model --test-text test.txtRun tokenizer analyze <path> to see available languages in your tokenizer:
CJK_merged- Chinese/Japanese/Korean charactersHiragana- Japanese HiraganaKatakana- Japanese KatakanaArabic_merged- Arabic scriptCyrillic- Cyrillic scriptGreek_merged- Greek scriptLatin_merged- Latin script variantsHangul Syllables- Korean Hangul
- Train: Creates a new SentencePiece tokenizer from your custom data
- Analyze: Classifies tokens by language using Unicode ranges
- Merge: Replaces specified language tokens in the base tokenizer with new tokens
- Evaluate: Tests tokenizer efficiency (fertility ratio)
The tool identifies tokens by their Unicode script/language, allowing you to replace underused languages in the base tokenizer with tokens optimized for your domain.