How efficiently can different tokenizers process your training datasets?
Know your tokenization costs and bottlenecks BEFORE you train, not after.
TOKBLAZE is the first systematic, reproducible framework for tokenization performance analysis. Built in Rust for maximum performance with Python bindings for ease of use, it helps AI researchers make evidence-based tokenizer choices and optimize their training pipelines.
The Hidden Bottleneck
When training large language models, researchers focus on GPU utilization, model architecture, and optimization techniques. However, there's a critical preprocessing step that can become a silent performance killer: tokenization.
Consider the scale: datasets like C4 contain 170 billion tokens, The Pile spans 825 gigabytes, and RedPajama reaches 1.2 trillion tokens. Before any of this data can feed into a model, it must be tokenized. If your tokenizer processes text at 50 MB/s instead of 200 MB/s, you're not just losing 4x throughput—you're potentially adding days or weeks to your training pipeline.
The AI community has sophisticated tools for profiling GPU utilization and model performance, yet tokenization efficiency remains largely unmeasured territory. Researchers often choose tokenizers based on convention rather than evidence, potentially sacrificing significant efficiency gains.
- 🎯 Evidence-Based Selection: Make tokenizer choices grounded in empirical performance data specific to your datasets
- 📊 Performance Baselines: Create reproducible, shareable benchmarks for the community
- ⚡ Optimization Insights: Identify which datasets or text patterns are expensive to tokenize
- 🔬 Hardware-Aware Analysis: Understand how compute infrastructure affects preprocessing efficiency
# Clone and install (single command)
git clone https://github.com/your-username/tokblaze
cd tokblaze
pip install -e ".[full]"# Simple token counting
tokblaze your_dataset.txt
# Performance analysis with stable measurements
tokblaze your_dataset.txt --tps --runs 5
# Full system monitoring
tokblaze your_dataset.txt --tps --cpu --mem --runs 5 --warmup 2Running 2 warmup + 5 measurement runs...
╭───────────────────────────── 🔥 TOKBLAZE Results ────────────────────────────────────────╮
│ 📄 File dataset.txt │
│ 📏 Size 1.2 GB │
│ 🔢 Tokens 312,456,789 │
│ ⚙️ Tokenizer TIK │
│ 💾 Chunk Size 16 MiB │
│ │
│ 📊 Runs 5 measurements (± std dev) │
│ 🚀 Throughput 187.3 MB/s ± 3.2 MB/s │
│ ⚡ Token Rate 45.2k tok/s ± 1.1k tok/s │
│ ✅ Stability Low variance (good) │
│ │
│ 🖥️ CPU Usage 78.4% ± 4.2% │
│ 💾 Memory 156.7 MB avg │
╰──────────────────────────────────────────────────────────────────────────────────────────╯
tokblaze <file_path> [OPTIONS]| Option | Description | Default | Example |
|---|---|---|---|
--tokenizer, -t |
Tokenizer to use | tik |
-t tik |
--chunk-mb |
Memory chunk size (MB) | 16 |
--chunk-mb 64 |
--tps |
Show throughput statistics | False |
--tps |
--cpu |
Show CPU usage | False |
--cpu |
--mem |
Show memory usage | False |
--mem |
--verify |
Hash verification for reproducibility | False |
--verify |
| Option | Description | Default | Example |
|---|---|---|---|
--runs |
Number of measurement runs | 1 |
--runs 5 |
--warmup |
Number of warmup runs | 1 |
--warmup 2 |
| Tokenizer | Code | Description |
|---|---|---|
| TikToken | tik |
OpenAI's cl100k_base (GPT-3.5/GPT-4) |
More tokenizers coming soon: SentencePiece, HuggingFace Tokenizers
# Quick token count
tokblaze dataset.txtUse case: Get basic statistics about your dataset
# Single measurement
tokblaze dataset.txt --tps
# Stable benchmark with multiple runs
tokblaze dataset.txt --tps --runs 5 --warmup 2Use case: Compare tokenizer performance across different datasets
# Monitor CPU and memory during tokenization
tokblaze dataset.txt --tps --cpu --mem --runs 3Use case: Understand resource requirements for large-scale preprocessing
# Generate verifiable results with hash
tokblaze dataset.txt --tps --verify --runs 10 --warmup 3Use case: Create benchmarks for research papers or team sharing
# Test different chunk sizes for optimal performance
tokblaze dataset.txt --tps --chunk-mb 8
tokblaze dataset.txt --tps --chunk-mb 32
tokblaze dataset.txt --tps --chunk-mb 64Use case: Optimize for your specific hardware configuration
# Benchmark multiple files
tokblaze english_text.txt --tps --runs 5 > english_results.txt
tokblaze code_dataset.txt --tps --runs 5 > code_results.txt
tokblaze multilingual.txt --tps --runs 5 > multilingual_results.txtUse case: Understand how tokenizer performance varies across content types
- Throughput (MB/s): Raw data processing speed
- Token Rate (tok/s): Tokens generated per second
- ±Standard Deviation: Measurement stability (lower = more reliable)
- ✅ Low variance: Reliable, repeatable measurements
⚠️ High variance: System under load, results may be unstable- System load warnings: Automatic detection of CPU/memory pressure
# Research/benchmarking: High precision needed
tokblaze dataset.txt --tps --runs 10 --warmup 3
# Development/testing: Quick feedback
tokblaze dataset.txt --tps --runs 3
# Production monitoring: Single run acceptable
tokblaze dataset.txt --tps| Dataset Size | Recommended Chunk Size | Memory Usage |
|---|---|---|
| < 100 MB | 8-16 MB | Low |
| 100 MB - 1 GB | 16-32 MB | Medium |
| > 1 GB | 32-64 MB | High |
- SSDs recommended: Faster I/O improves overall throughput
- More CPU cores: Better parallel processing performance
- Sufficient RAM: Avoid swapping during large dataset processing
# Profile your actual training data
tokblaze training_corpus.txt --tps --runs 5 --cpu --mem
# Compare preprocessing costs
tokblaze raw_data.txt --tps --runs 3 # Before cleaning
tokblaze cleaned_data.txt --tps --runs 3 # After cleaning# Systematic evaluation across content types
for dataset in code.txt english.txt multilingual.txt; do
echo "=== $dataset ==="
tokblaze $dataset --tps --runs 5 --verify
doneUse TOKBLAZE results to estimate preprocessing costs for large-scale training:
# Calculate preprocessing time for large dataset
throughput_mbs = 187.3 # From TOKBLAZE results
dataset_size_gb = 825 # The Pile size
preprocessing_hours = (dataset_size_gb * 1024) / throughput_mbs / 3600
print(f"Estimated preprocessing time: {preprocessing_hours:.1f} hours")TOKBLAZE is designed to become a standard tool in AI research workflows. We welcome contributions in:
- New tokenizer implementations
- Performance optimizations
- Dataset integrations
- Benchmark standardization
- Documentation improvements
If TOKBLAZE helps your research, please cite:
@software{tokblaze2025,
title={TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research},
author={Kyle Ryan},
year={2025},
url={https://github.com/kylejryan/tokblaze}
}MIT License - see LICENSE.md
Ready to optimize your tokenization pipeline?
pip install -e ".[full]"
tokblaze your_dataset.txt --tps --runs 5