TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research

How efficiently can different tokenizers process your training datasets?
Know your tokenization costs and bottlenecks BEFORE you train, not after.

TOKBLAZE is the first systematic, reproducible framework for tokenization performance analysis. Built in Rust for maximum performance with Python bindings for ease of use, it helps AI researchers make evidence-based tokenizer choices and optimize their training pipelines.

Why This Matters for AI Research

The Hidden Bottleneck

When training large language models, researchers focus on GPU utilization, model architecture, and optimization techniques. However, there's a critical preprocessing step that can become a silent performance killer: tokenization.

Consider the scale: datasets like C4 contain 170 billion tokens, The Pile spans 825 gigabytes, and RedPajama reaches 1.2 trillion tokens. Before any of this data can feed into a model, it must be tokenized. If your tokenizer processes text at 50 MB/s instead of 200 MB/s, you're not just losing 4x throughput—you're potentially adding days or weeks to your training pipeline.

The Research Gap

The AI community has sophisticated tools for profiling GPU utilization and model performance, yet tokenization efficiency remains largely unmeasured territory. Researchers often choose tokenizers based on convention rather than evidence, potentially sacrificing significant efficiency gains.

How TOKBLAZE Advances the Field

🎯 Evidence-Based Selection: Make tokenizer choices grounded in empirical performance data specific to your datasets
📊 Performance Baselines: Create reproducible, shareable benchmarks for the community
⚡ Optimization Insights: Identify which datasets or text patterns are expensive to tokenize
🔬 Hardware-Aware Analysis: Understand how compute infrastructure affects preprocessing efficiency

Quick Start

Installation

# Clone and install (single command)
git clone https://github.com/your-username/tokblaze
cd tokblaze
pip install -e ".[full]"

Basic Usage

# Simple token counting
tokblaze your_dataset.txt

# Performance analysis with stable measurements
tokblaze your_dataset.txt --tps --runs 5

# Full system monitoring
tokblaze your_dataset.txt --tps --cpu --mem --runs 5 --warmup 2

Example Output

Running 2 warmup + 5 measurement runs...

╭───────────────────────────── 🔥 TOKBLAZE Results ────────────────────────────────────────╮
│  📄  File        dataset.txt                                                             │
│  📏  Size        1.2 GB                                                                  │
│  🔢  Tokens      312,456,789                                                             │
│  ⚙️   Tokenizer   TIK                                                                    │
│  💾  Chunk Size  16 MiB                                                                  │
│                                                                                          │
│  📊  Runs        5 measurements (± std dev)                                              │
│  🚀  Throughput  187.3 MB/s ± 3.2 MB/s                                                   │
│  ⚡  Token Rate  45.2k tok/s ± 1.1k tok/s                                                │
│  ✅  Stability   Low variance (good)                                                     │
│                                                                                          │
│  🖥️   CPU Usage   78.4% ± 4.2%                                                           │
│  💾  Memory      156.7 MB avg                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────╯

Command Reference

Basic Syntax

tokblaze <file_path> [OPTIONS]

Core Options

Option	Description	Default	Example
`--tokenizer`, `-t`	Tokenizer to use	`tik`	`-t tik`
`--chunk-mb`	Memory chunk size (MB)	`16`	`--chunk-mb 64`
`--tps`	Show throughput statistics	`False`	`--tps`
`--cpu`	Show CPU usage	`False`	`--cpu`
`--mem`	Show memory usage	`False`	`--mem`
`--verify`	Hash verification for reproducibility	`False`	`--verify`

Advanced Options

Option	Description	Default	Example
`--runs`	Number of measurement runs	`1`	`--runs 5`
`--warmup`	Number of warmup runs	`1`	`--warmup 2`

Supported Tokenizers

Tokenizer	Code	Description
TikToken	`tik`	OpenAI's cl100k_base (GPT-3.5/GPT-4)

More tokenizers coming soon: SentencePiece, HuggingFace Tokenizers

Usage Examples

1. Basic Profiling

# Quick token count
tokblaze dataset.txt

Use case: Get basic statistics about your dataset

2. Performance Benchmarking

# Single measurement
tokblaze dataset.txt --tps

# Stable benchmark with multiple runs
tokblaze dataset.txt --tps --runs 5 --warmup 2

Use case: Compare tokenizer performance across different datasets

3. System Resource Analysis

# Monitor CPU and memory during tokenization
tokblaze dataset.txt --tps --cpu --mem --runs 3

Use case: Understand resource requirements for large-scale preprocessing

4. Reproducible Benchmarks

# Generate verifiable results with hash
tokblaze dataset.txt --tps --verify --runs 10 --warmup 3

Use case: Create benchmarks for research papers or team sharing

5. Performance Tuning

# Test different chunk sizes for optimal performance
tokblaze dataset.txt --tps --chunk-mb 8
tokblaze dataset.txt --tps --chunk-mb 32
tokblaze dataset.txt --tps --chunk-mb 64

Use case: Optimize for your specific hardware configuration

6. Comparative Analysis

# Benchmark multiple files
tokblaze english_text.txt --tps --runs 5 > english_results.txt
tokblaze code_dataset.txt --tps --runs 5 > code_results.txt
tokblaze multilingual.txt --tps --runs 5 > multilingual_results.txt

Use case: Understand how tokenizer performance varies across content types

Understanding Results

Performance Metrics

Throughput (MB/s): Raw data processing speed
Token Rate (tok/s): Tokens generated per second
±Standard Deviation: Measurement stability (lower = more reliable)

Stability Indicators

✅ Low variance: Reliable, repeatable measurements
⚠️ High variance: System under load, results may be unstable
System load warnings: Automatic detection of CPU/memory pressure

When to Use Multiple Runs

# Research/benchmarking: High precision needed
tokblaze dataset.txt --tps --runs 10 --warmup 3

# Development/testing: Quick feedback
tokblaze dataset.txt --tps --runs 3

# Production monitoring: Single run acceptable
tokblaze dataset.txt --tps

Performance Optimization

Chunk Size Guidelines

Dataset Size	Recommended Chunk Size	Memory Usage
< 100 MB	8-16 MB	Low
100 MB - 1 GB	16-32 MB	Medium
> 1 GB	32-64 MB	High

Hardware Considerations

SSDs recommended: Faster I/O improves overall throughput
More CPU cores: Better parallel processing performance
Sufficient RAM: Avoid swapping during large dataset processing

Research Applications

Training Pipeline Optimization

# Profile your actual training data
tokblaze training_corpus.txt --tps --runs 5 --cpu --mem

# Compare preprocessing costs
tokblaze raw_data.txt --tps --runs 3        # Before cleaning
tokblaze cleaned_data.txt --tps --runs 3    # After cleaning

Tokenizer Comparison Studies

# Systematic evaluation across content types
for dataset in code.txt english.txt multilingual.txt; do
    echo "=== $dataset ===" 
    tokblaze $dataset --tps --runs 5 --verify
done

Computational Budget Planning

Use TOKBLAZE results to estimate preprocessing costs for large-scale training:

# Calculate preprocessing time for large dataset
throughput_mbs = 187.3  # From TOKBLAZE results
dataset_size_gb = 825   # The Pile size
preprocessing_hours = (dataset_size_gb * 1024) / throughput_mbs / 3600
print(f"Estimated preprocessing time: {preprocessing_hours:.1f} hours")

Contributing

TOKBLAZE is designed to become a standard tool in AI research workflows. We welcome contributions in:

New tokenizer implementations
Performance optimizations
Dataset integrations
Benchmark standardization
Documentation improvements

Citation

If TOKBLAZE helps your research, please cite:

@software{tokblaze2025,
  title={TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research},
  author={Kyle Ryan},
  year={2025},
  url={https://github.com/kylejryan/tokblaze}
}

License

MIT License - see LICENSE.md

Ready to optimize your tokenization pipeline?

pip install -e ".[full]"
tokblaze your_dataset.txt --tps --runs 5

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
docs		docs
src		src
tokblaze-core		tokblaze-core
tokblaze		tokblaze
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research

Why This Matters for AI Research

The Hidden Bottleneck

The Research Gap

How TOKBLAZE Advances the Field

Quick Start

Installation

Basic Usage

Example Output

Command Reference

Basic Syntax

Core Options

Advanced Options

Supported Tokenizers

Usage Examples

1. Basic Profiling

2. Performance Benchmarking

3. System Resource Analysis

4. Reproducible Benchmarks

5. Performance Tuning

6. Comparative Analysis

Understanding Results

Performance Metrics

Stability Indicators

When to Use Multiple Runs

Performance Optimization

Chunk Size Guidelines

Hardware Considerations

Research Applications

Training Pipeline Optimization

Tokenizer Comparison Studies

Computational Budget Planning

Contributing

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages