Skip to content

kylejryan/tokblaze

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research

How efficiently can different tokenizers process your training datasets?
Know your tokenization costs and bottlenecks BEFORE you train, not after.

TOKBLAZE is the first systematic, reproducible framework for tokenization performance analysis. Built in Rust for maximum performance with Python bindings for ease of use, it helps AI researchers make evidence-based tokenizer choices and optimize their training pipelines.

License: MIT Python 3.8+

Why This Matters for AI Research

The Hidden Bottleneck

When training large language models, researchers focus on GPU utilization, model architecture, and optimization techniques. However, there's a critical preprocessing step that can become a silent performance killer: tokenization.

Consider the scale: datasets like C4 contain 170 billion tokens, The Pile spans 825 gigabytes, and RedPajama reaches 1.2 trillion tokens. Before any of this data can feed into a model, it must be tokenized. If your tokenizer processes text at 50 MB/s instead of 200 MB/s, you're not just losing 4x throughput—you're potentially adding days or weeks to your training pipeline.

The Research Gap

The AI community has sophisticated tools for profiling GPU utilization and model performance, yet tokenization efficiency remains largely unmeasured territory. Researchers often choose tokenizers based on convention rather than evidence, potentially sacrificing significant efficiency gains.

How TOKBLAZE Advances the Field

  • 🎯 Evidence-Based Selection: Make tokenizer choices grounded in empirical performance data specific to your datasets
  • 📊 Performance Baselines: Create reproducible, shareable benchmarks for the community
  • ⚡ Optimization Insights: Identify which datasets or text patterns are expensive to tokenize
  • 🔬 Hardware-Aware Analysis: Understand how compute infrastructure affects preprocessing efficiency

Quick Start

Installation

# Clone and install (single command)
git clone https://github.com/your-username/tokblaze
cd tokblaze
pip install -e ".[full]"

Basic Usage

# Simple token counting
tokblaze your_dataset.txt

# Performance analysis with stable measurements
tokblaze your_dataset.txt --tps --runs 5

# Full system monitoring
tokblaze your_dataset.txt --tps --cpu --mem --runs 5 --warmup 2

Example Output

Running 2 warmup + 5 measurement runs...

╭───────────────────────────── 🔥 TOKBLAZE Results ────────────────────────────────────────╮
│  📄  File        dataset.txt                                                             │
│  📏  Size        1.2 GB                                                                  │
│  🔢  Tokens      312,456,789                                                             │
│  ⚙️   Tokenizer   TIK                                                                    │
│  💾  Chunk Size  16 MiB                                                                  │
│                                                                                          │
│  📊  Runs        5 measurements (± std dev)                                              │
│  🚀  Throughput  187.3 MB/s ± 3.2 MB/s                                                   │
│  ⚡  Token Rate  45.2k tok/s ± 1.1k tok/s                                                │
│  ✅  Stability   Low variance (good)                                                     │
│                                                                                          │
│  🖥️   CPU Usage   78.4% ± 4.2%                                                           │
│  💾  Memory      156.7 MB avg                                                            │
╰──────────────────────────────────────────────────────────────────────────────────────────╯

Command Reference

Basic Syntax

tokblaze <file_path> [OPTIONS]

Core Options

Option Description Default Example
--tokenizer, -t Tokenizer to use tik -t tik
--chunk-mb Memory chunk size (MB) 16 --chunk-mb 64
--tps Show throughput statistics False --tps
--cpu Show CPU usage False --cpu
--mem Show memory usage False --mem
--verify Hash verification for reproducibility False --verify

Advanced Options

Option Description Default Example
--runs Number of measurement runs 1 --runs 5
--warmup Number of warmup runs 1 --warmup 2

Supported Tokenizers

Tokenizer Code Description
TikToken tik OpenAI's cl100k_base (GPT-3.5/GPT-4)

More tokenizers coming soon: SentencePiece, HuggingFace Tokenizers

Usage Examples

1. Basic Profiling

# Quick token count
tokblaze dataset.txt

Use case: Get basic statistics about your dataset

2. Performance Benchmarking

# Single measurement
tokblaze dataset.txt --tps

# Stable benchmark with multiple runs
tokblaze dataset.txt --tps --runs 5 --warmup 2

Use case: Compare tokenizer performance across different datasets

3. System Resource Analysis

# Monitor CPU and memory during tokenization
tokblaze dataset.txt --tps --cpu --mem --runs 3

Use case: Understand resource requirements for large-scale preprocessing

4. Reproducible Benchmarks

# Generate verifiable results with hash
tokblaze dataset.txt --tps --verify --runs 10 --warmup 3

Use case: Create benchmarks for research papers or team sharing

5. Performance Tuning

# Test different chunk sizes for optimal performance
tokblaze dataset.txt --tps --chunk-mb 8
tokblaze dataset.txt --tps --chunk-mb 32
tokblaze dataset.txt --tps --chunk-mb 64

Use case: Optimize for your specific hardware configuration

6. Comparative Analysis

# Benchmark multiple files
tokblaze english_text.txt --tps --runs 5 > english_results.txt
tokblaze code_dataset.txt --tps --runs 5 > code_results.txt
tokblaze multilingual.txt --tps --runs 5 > multilingual_results.txt

Use case: Understand how tokenizer performance varies across content types

Understanding Results

Performance Metrics

  • Throughput (MB/s): Raw data processing speed
  • Token Rate (tok/s): Tokens generated per second
  • ±Standard Deviation: Measurement stability (lower = more reliable)

Stability Indicators

  • ✅ Low variance: Reliable, repeatable measurements
  • ⚠️ High variance: System under load, results may be unstable
  • System load warnings: Automatic detection of CPU/memory pressure

When to Use Multiple Runs

# Research/benchmarking: High precision needed
tokblaze dataset.txt --tps --runs 10 --warmup 3

# Development/testing: Quick feedback
tokblaze dataset.txt --tps --runs 3

# Production monitoring: Single run acceptable
tokblaze dataset.txt --tps

Performance Optimization

Chunk Size Guidelines

Dataset Size Recommended Chunk Size Memory Usage
< 100 MB 8-16 MB Low
100 MB - 1 GB 16-32 MB Medium
> 1 GB 32-64 MB High

Hardware Considerations

  • SSDs recommended: Faster I/O improves overall throughput
  • More CPU cores: Better parallel processing performance
  • Sufficient RAM: Avoid swapping during large dataset processing

Research Applications

Training Pipeline Optimization

# Profile your actual training data
tokblaze training_corpus.txt --tps --runs 5 --cpu --mem

# Compare preprocessing costs
tokblaze raw_data.txt --tps --runs 3        # Before cleaning
tokblaze cleaned_data.txt --tps --runs 3    # After cleaning

Tokenizer Comparison Studies

# Systematic evaluation across content types
for dataset in code.txt english.txt multilingual.txt; do
    echo "=== $dataset ===" 
    tokblaze $dataset --tps --runs 5 --verify
done

Computational Budget Planning

Use TOKBLAZE results to estimate preprocessing costs for large-scale training:

# Calculate preprocessing time for large dataset
throughput_mbs = 187.3  # From TOKBLAZE results
dataset_size_gb = 825   # The Pile size
preprocessing_hours = (dataset_size_gb * 1024) / throughput_mbs / 3600
print(f"Estimated preprocessing time: {preprocessing_hours:.1f} hours")

Contributing

TOKBLAZE is designed to become a standard tool in AI research workflows. We welcome contributions in:

  • New tokenizer implementations
  • Performance optimizations
  • Dataset integrations
  • Benchmark standardization
  • Documentation improvements

Citation

If TOKBLAZE helps your research, please cite:

@software{tokblaze2025,
  title={TOKBLAZE: High-Performance Tokenization Benchmarking for AI Research},
  author={Kyle Ryan},
  year={2025},
  url={https://github.com/kylejryan/tokblaze}
}

License

MIT License - see LICENSE.md


Ready to optimize your tokenization pipeline?

pip install -e ".[full]"
tokblaze your_dataset.txt --tps --runs 5

About

How many effective tokens per second will this dataset yield on that hardware? Which tokenizer (tiktoken, SentencePiece, HF Tokenizers Fast, …) is the throughput king for my use-case?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors