Skip to content

U4RASD/dalla-sentencepiece

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dalla SentencePiece Tokenizer Manipulation

Extend existing tokenizers with new vocabulary from custom training data. Train a new tokenizer on your domain-specific data and merge it with any sentencepiece tokenizer (e.g., Gemma) by replacing unused language tokens.

Installation

pip install .

Usage

Full Pipeline

Python:

from dalla_sp_tokenizer import TokenizerManager

# Using HuggingFace dataset
manager = TokenizerManager(
    original_tokenizer_path="google/gemma-2-9b-it",
    training_data="data/dataset",  # HF dataset directory
    vocab_size=32000,
    replacement_languages=["CJK_merged"],
    output_path="merged",
    hf_token="hf_..."  # Optional, for gated models
)

results = manager.run_full_pipeline(
    analyze_original=True, 
    analyze_new=True, 
    evaluate_result=True, 
    test_text_path="data/dataset"  # Can also use HF dataset for evaluation
)

CLI:

export HF_TOKEN="hf_..."  # Optional, for gated models

# Using HuggingFace dataset
tokenizer pipeline \
  --original google/gemma-2-9b-it \
  --training-data data/dataset \
  --vocab-size 32000 \
  --replacement-languages CJK_merged \
  --output merged \
  --evaluate \
  --test-text data/dataset

# Or using raw text file (still supported)
tokenizer pipeline \
  --original google/gemma-2-9b-it \
  --training-data data.txt \
  --vocab-size 32000 \
  --replacement-languages CJK_merged \
  --output merged \
  --evaluate \
  --test-text test.txt

Individual Operations

Python:

# Train with HF dataset
manager.train()
# Analyze tokenizer
manager.analyze()

# Merge tokenizers
manager.merge()

# Evaluate
manager.evaluate(test_text_path="data/dataset")

CLI:

# Train a new tokenizer with HF dataset
tokenizer train --training-data data/dataset --vocab-size 32000 --output-prefix custom_tokenizer

# Train with text file (still supported)
tokenizer train --training-data data.txt --vocab-size 32000 --output-prefix custom_tokenizer

# Analyze a tokenizer
tokenizer analyze google/gemma-2-9b-it

# Merge tokenizers
tokenizer merge --original google/gemma-2-9b-it --new custom_tokenizer.model --replacement-languages CJK_merged --output merged

# Evaluate with HF dataset
tokenizer evaluate --tokenizer merged/tokenizer.model --test-text data/dataset

# Evaluate with text file (still supported)
tokenizer evaluate --tokenizer merged/tokenizer.model --test-text test.txt

Replacement Languages

Run tokenizer analyze <path> to see available languages in your tokenizer:

  • CJK_merged - Chinese/Japanese/Korean characters
  • Hiragana - Japanese Hiragana
  • Katakana - Japanese Katakana
  • Arabic_merged - Arabic script
  • Cyrillic - Cyrillic script
  • Greek_merged - Greek script
  • Latin_merged - Latin script variants
  • Hangul Syllables - Korean Hangul

How It Works

  1. Train: Creates a new SentencePiece tokenizer from your custom data
  2. Analyze: Classifies tokens by language using Unicode ranges
  3. Merge: Replaces specified language tokens in the base tokenizer with new tokens
  4. Evaluate: Tests tokenizer efficiency (fertility ratio)

The tool identifies tokens by their Unicode script/language, allowing you to replace underused languages in the base tokenizer with tokens optimized for your domain.

About

A tool to update an existing tokenizer model with new and common tokens from a newly trained tokenizer model. It works with any SentencePiece model.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages