A comprehensive toolkit for analyzing and benchmarking vision tokenizers with detailed experimental results.
This project provides three main functionalities:
- ๐ฎ Interactive Testing - Test tokenizers on your own images with a simple CLI
- ๐ Comprehensive Benchmark - Evaluate multiple tokenizers on COCO dataset with detailed metrics
- ๐ Interactive Demo - Web-based interface for real-time experimentation
| Tokenizer | Type | Resolution | Tokens | Codebook |
|---|---|---|---|---|
| TA-Tok + AR-DTok | Autoregressive | 512px | 729 โ 256 | 65K โ 16K |
| TA-Tok + SANA | Diffusion | 512px | 729 | 65K |
| TA-Tok + Lumina2 | Diffusion | 512px | 729 | 65K |
| MAGVIT-v2 | LFQ | 256px | 256 | 262K |
| TiTok-L-32 | 1D Latent | 256px | 32 | 4K |
| VAE (SD-MSE) | VAE | 512px | - | 256 |
# 1. Create conda environment
conda create -n vtp python=3.11
conda activate vtp
# 2. Install all dependencies
pip install -r requirements.txtNote: PyTorch CUDA wheels and flash-attention are included in requirements.txt. If you encounter issues with flash-attention compilation, install it separately:
pip install flash-attn==2.8.3 --no-build-isolationMost models are automatically downloaded on first use. For Lumina2:
# Login to Hugging Face (required for Gemma-2-2b access)
huggingface-cli login
# Download Gemma-2-2b (one-time setup)
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('google/gemma-2-2b')"All models are cached in ./model_weights/ directory.
Test tokenizers on your own images:
# List available tokenizers
python playground.py --list
# Test single image
python playground.py --model magvit2_256 --input image.jpg --output results/
# Test folder of images
python playground.py --model tatok_ardtok_512 --input images/ --output results/
# Compare multiple tokenizers
python playground.py --model magvit2_256,titok_256,vae_512 --input image.jpg --output results/
# Save token arrays
python playground.py --model magvit2_256 --input image.jpg --output results/ --save-tokensRun comprehensive benchmarks on COCO dataset:
First, download COCO dataset:
python datasets/download_coco_1k.pyThen run benchmarks:
# Quick test (10 images)
python run_benchmark.py --config configs/discrete_tokenizers.yaml
# Full evaluation (1000 images) - Run in background
nohup python -u run_benchmark.py --config configs/discrete_tokenizers_full.yaml > benchmark.log 2>&1 &Output: (saved in config's output_dir, e.g., results/discrete_tokenizers/)
benchmark_results.csv- Detailed metricsbenchmark_summary.txt- Text summarybenchmark_summary.png- Visualization{tokenizer_name}/- Reconstructed images
Launch interactive web interface:
# Default (port 7860)
python run_demo.py
# Custom port
python run_demo.py --port 8080Then open http://localhost:7860 in your browser.
Demo Interface:
The demo allows you to:
- Upload any image and see reconstructed results from different tokenizers
- Compare visual quality across various tokenization methods
- All models are preloaded at startup for fast inference
- PSNR (โ) - Peak Signal-to-Noise Ratio
- SSIM (โ) - Structural Similarity Index
- LPIPS (โ) - Learned Perceptual Image Patch Similarity
- FID (โ) - Frรฉchet Inception Distance
- MAE (โ) - Mean Absolute Error
- RMSE (โ) - Root Mean Square Error
- Encode/Decode Time (โ) - Inference speed
vision_tokenizer_playground/
โโโ playground.py # ๐ฎ Interactive CLI
โโโ run_benchmark.py # ๐ Benchmark runner
โโโ run_demo.py # ๐ Gradio demo launcher
โโโ requirements.txt # Package dependencies
โ
โโโ scripts/
โ โโโ demo_gradio.py # Gradio demo implementation
โ
โโโ vision_tokenizers/ # Tokenizer wrapper classes (unified interface)
โ โโโ base.py # Base tokenizer class (VisionTokenizerBase)
โ โโโ tatok.py # TA-Tok encoder wrapper
โ โโโ tatok_combined.py # TA-Tok + de-tokenizer combinations
โ โโโ ar_dtok.py # AR-DTok wrapper (uses tok.ar_dtok)
โ โโโ sana_dtok.py # SANA wrapper (uses tok.dif_dtok_sana)
โ โโโ lumina2_dtok.py # Lumina2 wrapper (uses tok.dif_dtok_lumina2)
โ โโโ magvit2.py # MAGVIT-v2 tokenizer
โ โโโ titok.py # TiTok tokenizer
โ โโโ vae_ldm.py # VAE baseline
โ โโโ model_cache.py # Model caching utilities
โ โโโ magvit2_modules/ # Extracted MAGVIT2 code
โ โโโ titok_modules/ # Extracted TiTok code
โ
โ Note: Wrapper classes provide unified `encode()`/`decode()` interface.
โ TA-Tok variants internally use `tok/` module for actual implementation.
โ
โโโ vision_metrics/ # Metric implementations
โ โโโ reconstruction.py # PSNR, SSIM, LPIPS
โ โโโ fid.py # FID calculation
โ โโโ token_stats.py # Token statistics
โ
โโโ vision_benchmarks/ # Benchmark implementation
โ โโโ benchmark.py # Main benchmark logic
โ โโโ dataset_coco.py # COCO dataset loader
โ
โโโ configs/ # Configuration files
โ โโโ discrete_tokenizers.yaml
โ โโโ discrete_tokenizers_full.yaml
โ
โโโ datasets/ # Dataset utilities
โ โโโ download_coco_1k.py # COCO dataset downloader
โ
โโโ tok/ # Tar project original code (low-level implementations)
โ โโโ ta_tok.py # TA-Tok encoder (original implementation)
โ โโโ ar_dtok/ # AR-DTok implementation
โ โโโ dif_dtok_sana.py # SANA Dif-DTok
โ โโโ dif_dtok_lumina2.py # Lumina2 Dif-DTok
โ โโโ lumina2_model.py # Lumina2 model utilities
โ โโโ mm_autoencoder.py # Multi-modal autoencoder
โ โโโ models.py # Model definitions
โ โโโ transport/ # Transport-based diffusion
โ โโโ utils.py # Utility functions
โ
โ Note: `tok/` contains original Tar project code, while `vision_tokenizers/`
โ provides unified wrapper classes that use `tok/` internally.
โ
โโโ docs/ # Documentation assets
โ โโโ images/ # Figures and samples
โ
โโโ data/ # Dataset storage (auto-created)
โ โโโ coco/ # COCO dataset
โ
โโโ results/ # Benchmark results (auto-created)
โ
โโโ model_weights/ # Downloaded models (auto-created)
Edit configs/discrete_tokenizers.yaml to customize:
dataset:
name: coco
root: data/coco
num_samples: 10 # Number of images
tokenizers:
- name: magvit2_256
type: magvit2
enabled: true # Set to false to skip
metrics:
reconstruction:
- psnr
- ssim
- lpips
- fidUse tokenizers programmatically:
from playground import TokenizerPlayground
# Initialize
playground = TokenizerPlayground(device='cuda')
# Load tokenizer
playground.load('magvit2_256')
# Encode & decode
from PIL import Image
image = Image.open('example.jpg')
tokens = playground.encode(image)
reconstructed = playground.decode(tokens)
# Get info
info = playground.info()
print(f"Tokens: {info['num_tokens']}, Codebook: {info['codebook_size']}")We evaluated 6 vision tokenizers on 1000 images from COCO val2017 dataset. The results show a comprehensive comparison across multiple metrics.
| Tokenizer | PSNR (โ) | SSIM (โ) | LPIPS (โ) | FID (โ) | Encode Time (s) | Decode Time (s) |
|---|---|---|---|---|---|---|
| vae_sd_mse_q8_512 | 24.92 ยฑ 4.99 | 0.6953 ยฑ 0.1467 | 0.0830 ยฑ 0.0522 | 12.45 ยฑ 0.00 | 0.039 ยฑ 0.295 | 0.070 ยฑ 0.344 |
| magvit2_256 | 18.65 ยฑ 4.56 | 0.4951 ยฑ 0.1717 | 0.3936 ยฑ 0.1298 | 34.37 ยฑ 0.00 | 0.015 ยฑ 0.181 | 0.011 ยฑ 0.169 |
| titok_l32_256 | 15.02 ยฑ 2.40 | 0.3766 ยฑ 0.1657 | 0.5724 ยฑ 0.1206 | 54.23 ยฑ 0.00 | 0.013 ยฑ 0.197 | 0.028 ยฑ 0.501 |
| tatok_lumina2_512 | 14.01 ยฑ 2.21 | 0.3910 ยฑ 0.1641 | 0.4452 ยฑ 0.0948 | 44.01 ยฑ 0.00 | 0.028 ยฑ 0.007 | 95.720 ยฑ 0.038 |
| tatok_sana_512 | 13.11 ยฑ 2.24 | 0.3702 ยฑ 0.1629 | 0.4998 ยฑ 0.1044 | 48.91 ยฑ 0.00 | 0.021 ยฑ 0.002 | 2.008 ยฑ 0.019 |
| tatok_ardtok_512 | 12.35 ยฑ 1.97 | 0.3609 ยฑ 0.1570 | 0.5357 ยฑ 0.1135 | 57.95 ยฑ 0.00 | 0.016 ยฑ 0.013 | 58.287 ยฑ 41.953 |
Key Findings:
-
Best Overall Quality: VAE (SD-MSE) achieves the best reconstruction quality across all metrics (PSNR: 24.92 dB, SSIM: 0.6953), but uses continuous latent space with quantization rather than discrete tokens.
-
Best Discrete Tokenizer: Among discrete tokenization models, MAGVIT2 offers the best quality-speed trade-off with PSNR of 18.65 dB and the fastest decoding time (0.011s), making it ideal for real-time applications.
-
TA-Tok Variants: Lumina2 provides the best quality among TA-Tok variants but is slow (95.72s). SANA offers a good balance (2.008s decode time), while AR-DTok is slower with lower quality.
-
Extreme Compression: TiTok-L-32 achieves moderate quality (PSNR: 15.02 dB) with very fast decoding (0.028s) using only 32 tokens for 256ร256 images, making it suitable for extreme compression scenarios.
| Model | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 |
|---|---|---|---|---|---|
| Original | ![]() |
![]() |
![]() |
![]() |
![]() |
| MAGVIT2 (256px) |
![]() |
![]() |
![]() |
![]() |
![]() |
| TA-Tok + AR-DTok (512px) |
![]() |
![]() |
![]() |
![]() |
![]() |
| TA-Tok + Lumina2 (512px) |
![]() |
![]() |
![]() |
![]() |
![]() |
| TA-Tok + SANA (512px) |
![]() |
![]() |
![]() |
![]() |
![]() |
| TiTok-L-32 (256px) |
![]() |
![]() |
![]() |
![]() |
![]() |
| VAE SD-MSE (512px) |
![]() |
![]() |
![]() |
![]() |
![]() |
Full benchmark results are available in:
docs/images/figures/benchmark_summary.png- Comprehensive visualizationdocs/images/figures/benchmark_summary.txt- Text summary with detailed statisticsresults/discrete_tokenizers_full/benchmark_results_merged.csv- Detailed per-image metrics (6000 images ร 6 tokenizers)
- Tar - TA-Tok and de-tokenizers
- Open-MAGVIT2 - MAGVIT-v2 implementation
- TiTok - TiTok implementation
- Diffusers - Stable Diffusion VAE




































