Skip to content

A comprehensive toolkit for analyzing and benchmarking vision tokenizers with detailed experimental results.

Notifications You must be signed in to change notification settings

ldj7672/Vision-Tokenizer-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

2 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฌ Vision Tokenizer Analysis

A comprehensive toolkit for analyzing and benchmarking vision tokenizers with detailed experimental results.

๐Ÿ“– Overview

This project provides three main functionalities:

  1. ๐ŸŽฎ Interactive Testing - Test tokenizers on your own images with a simple CLI
  2. ๐Ÿ“Š Comprehensive Benchmark - Evaluate multiple tokenizers on COCO dataset with detailed metrics
  3. ๐ŸŒ Interactive Demo - Web-based interface for real-time experimentation

Supported Tokenizers

Tokenizer Type Resolution Tokens Codebook
TA-Tok + AR-DTok Autoregressive 512px 729 โ†’ 256 65K โ†’ 16K
TA-Tok + SANA Diffusion 512px 729 65K
TA-Tok + Lumina2 Diffusion 512px 729 65K
MAGVIT-v2 LFQ 256px 256 262K
TiTok-L-32 1D Latent 256px 32 4K
VAE (SD-MSE) VAE 512px - 256

๐Ÿš€ Quick Start

Installation

# 1. Create conda environment
conda create -n vtp python=3.11
conda activate vtp

# 2. Install all dependencies
pip install -r requirements.txt

Note: PyTorch CUDA wheels and flash-attention are included in requirements.txt. If you encounter issues with flash-attention compilation, install it separately:

pip install flash-attn==2.8.3 --no-build-isolation

Model Setup

Most models are automatically downloaded on first use. For Lumina2:

# Login to Hugging Face (required for Gemma-2-2b access)
huggingface-cli login

# Download Gemma-2-2b (one-time setup)
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('google/gemma-2-2b')"

All models are cached in ./model_weights/ directory.

๐Ÿ’ป Usage

1. Interactive Testing (CLI)

Test tokenizers on your own images:

# List available tokenizers
python playground.py --list

# Test single image
python playground.py --model magvit2_256 --input image.jpg --output results/

# Test folder of images
python playground.py --model tatok_ardtok_512 --input images/ --output results/

# Compare multiple tokenizers
python playground.py --model magvit2_256,titok_256,vae_512 --input image.jpg --output results/

# Save token arrays
python playground.py --model magvit2_256 --input image.jpg --output results/ --save-tokens

2. Benchmark Evaluation

Run comprehensive benchmarks on COCO dataset:

First, download COCO dataset:

python datasets/download_coco_1k.py

Then run benchmarks:

# Quick test (10 images)
python run_benchmark.py --config configs/discrete_tokenizers.yaml

# Full evaluation (1000 images) - Run in background
nohup python -u run_benchmark.py --config configs/discrete_tokenizers_full.yaml > benchmark.log 2>&1 &

Output: (saved in config's output_dir, e.g., results/discrete_tokenizers/)

  • benchmark_results.csv - Detailed metrics
  • benchmark_summary.txt - Text summary
  • benchmark_summary.png - Visualization
  • {tokenizer_name}/ - Reconstructed images

3. Gradio Web Demo

Launch interactive web interface:

# Default (port 7860)
python run_demo.py

# Custom port
python run_demo.py --port 8080

Then open http://localhost:7860 in your browser.

Demo Interface:

Gradio Demo

The demo allows you to:

  • Upload any image and see reconstructed results from different tokenizers
  • Compare visual quality across various tokenization methods
  • All models are preloaded at startup for fast inference

๐Ÿ“Š Evaluation Metrics

  • PSNR (โ†‘) - Peak Signal-to-Noise Ratio
  • SSIM (โ†‘) - Structural Similarity Index
  • LPIPS (โ†“) - Learned Perceptual Image Patch Similarity
  • FID (โ†“) - Frรฉchet Inception Distance
  • MAE (โ†“) - Mean Absolute Error
  • RMSE (โ†“) - Root Mean Square Error
  • Encode/Decode Time (โ†“) - Inference speed

๐Ÿ“ Project Structure

vision_tokenizer_playground/
โ”œโ”€โ”€ playground.py              # ๐ŸŽฎ Interactive CLI
โ”œโ”€โ”€ run_benchmark.py           # ๐Ÿ“Š Benchmark runner
โ”œโ”€โ”€ run_demo.py                # ๐ŸŒ Gradio demo launcher
โ”œโ”€โ”€ requirements.txt           # Package dependencies
โ”‚
โ”œโ”€โ”€ scripts/
โ”‚   โ””โ”€โ”€ demo_gradio.py         # Gradio demo implementation
โ”‚
โ”œโ”€โ”€ vision_tokenizers/         # Tokenizer wrapper classes (unified interface)
โ”‚   โ”œโ”€โ”€ base.py                # Base tokenizer class (VisionTokenizerBase)
โ”‚   โ”œโ”€โ”€ tatok.py               # TA-Tok encoder wrapper
โ”‚   โ”œโ”€โ”€ tatok_combined.py      # TA-Tok + de-tokenizer combinations
โ”‚   โ”œโ”€โ”€ ar_dtok.py             # AR-DTok wrapper (uses tok.ar_dtok)
โ”‚   โ”œโ”€โ”€ sana_dtok.py           # SANA wrapper (uses tok.dif_dtok_sana)
โ”‚   โ”œโ”€โ”€ lumina2_dtok.py        # Lumina2 wrapper (uses tok.dif_dtok_lumina2)
โ”‚   โ”œโ”€โ”€ magvit2.py             # MAGVIT-v2 tokenizer
โ”‚   โ”œโ”€โ”€ titok.py               # TiTok tokenizer
โ”‚   โ”œโ”€โ”€ vae_ldm.py             # VAE baseline
โ”‚   โ”œโ”€โ”€ model_cache.py         # Model caching utilities
โ”‚   โ”œโ”€โ”€ magvit2_modules/       # Extracted MAGVIT2 code
โ”‚   โ””โ”€โ”€ titok_modules/         # Extracted TiTok code
โ”‚
โ”‚   Note: Wrapper classes provide unified `encode()`/`decode()` interface.
โ”‚   TA-Tok variants internally use `tok/` module for actual implementation.
โ”‚
โ”œโ”€โ”€ vision_metrics/            # Metric implementations
โ”‚   โ”œโ”€โ”€ reconstruction.py      # PSNR, SSIM, LPIPS
โ”‚   โ”œโ”€โ”€ fid.py                 # FID calculation
โ”‚   โ””โ”€โ”€ token_stats.py         # Token statistics
โ”‚
โ”œโ”€โ”€ vision_benchmarks/         # Benchmark implementation
โ”‚   โ”œโ”€โ”€ benchmark.py           # Main benchmark logic
โ”‚   โ””โ”€โ”€ dataset_coco.py        # COCO dataset loader
โ”‚
โ”œโ”€โ”€ configs/                   # Configuration files
โ”‚   โ”œโ”€โ”€ discrete_tokenizers.yaml
โ”‚   โ””โ”€โ”€ discrete_tokenizers_full.yaml
โ”‚
โ”œโ”€โ”€ datasets/                  # Dataset utilities
โ”‚   โ””โ”€โ”€ download_coco_1k.py   # COCO dataset downloader
โ”‚
โ”œโ”€โ”€ tok/                       # Tar project original code (low-level implementations)
โ”‚   โ”œโ”€โ”€ ta_tok.py              # TA-Tok encoder (original implementation)
โ”‚   โ”œโ”€โ”€ ar_dtok/               # AR-DTok implementation
โ”‚   โ”œโ”€โ”€ dif_dtok_sana.py       # SANA Dif-DTok
โ”‚   โ”œโ”€โ”€ dif_dtok_lumina2.py    # Lumina2 Dif-DTok
โ”‚   โ”œโ”€โ”€ lumina2_model.py       # Lumina2 model utilities
โ”‚   โ”œโ”€โ”€ mm_autoencoder.py      # Multi-modal autoencoder
โ”‚   โ”œโ”€โ”€ models.py              # Model definitions
โ”‚   โ”œโ”€โ”€ transport/             # Transport-based diffusion
โ”‚   โ””โ”€โ”€ utils.py               # Utility functions
โ”‚
โ”‚   Note: `tok/` contains original Tar project code, while `vision_tokenizers/` 
โ”‚   provides unified wrapper classes that use `tok/` internally.
โ”‚
โ”œโ”€โ”€ docs/                      # Documentation assets
โ”‚   โ””โ”€โ”€ images/                # Figures and samples
โ”‚
โ”œโ”€โ”€ data/                      # Dataset storage (auto-created)
โ”‚   โ””โ”€โ”€ coco/                  # COCO dataset
โ”‚
โ”œโ”€โ”€ results/                   # Benchmark results (auto-created)
โ”‚
โ””โ”€โ”€ model_weights/             # Downloaded models (auto-created)

๐Ÿ”ง Configuration

Edit configs/discrete_tokenizers.yaml to customize:

dataset:
  name: coco
  root: data/coco
  num_samples: 10  # Number of images

tokenizers:
  - name: magvit2_256
    type: magvit2
    enabled: true  # Set to false to skip

metrics:
  reconstruction:
    - psnr
    - ssim
    - lpips
    - fid

๐Ÿ“š Python API

Use tokenizers programmatically:

from playground import TokenizerPlayground

# Initialize
playground = TokenizerPlayground(device='cuda')

# Load tokenizer
playground.load('magvit2_256')

# Encode & decode
from PIL import Image
image = Image.open('example.jpg')
tokens = playground.encode(image)
reconstructed = playground.decode(tokens)

# Get info
info = playground.info()
print(f"Tokens: {info['num_tokens']}, Codebook: {info['codebook_size']}")

๐ŸŽฏ Experimental Results

We evaluated 6 vision tokenizers on 1000 images from COCO val2017 dataset. The results show a comprehensive comparison across multiple metrics.

Benchmark Summary

Benchmark Results

Quantitative Results

Tokenizer PSNR (โ†‘) SSIM (โ†‘) LPIPS (โ†“) FID (โ†“) Encode Time (s) Decode Time (s)
vae_sd_mse_q8_512 24.92 ยฑ 4.99 0.6953 ยฑ 0.1467 0.0830 ยฑ 0.0522 12.45 ยฑ 0.00 0.039 ยฑ 0.295 0.070 ยฑ 0.344
magvit2_256 18.65 ยฑ 4.56 0.4951 ยฑ 0.1717 0.3936 ยฑ 0.1298 34.37 ยฑ 0.00 0.015 ยฑ 0.181 0.011 ยฑ 0.169
titok_l32_256 15.02 ยฑ 2.40 0.3766 ยฑ 0.1657 0.5724 ยฑ 0.1206 54.23 ยฑ 0.00 0.013 ยฑ 0.197 0.028 ยฑ 0.501
tatok_lumina2_512 14.01 ยฑ 2.21 0.3910 ยฑ 0.1641 0.4452 ยฑ 0.0948 44.01 ยฑ 0.00 0.028 ยฑ 0.007 95.720 ยฑ 0.038
tatok_sana_512 13.11 ยฑ 2.24 0.3702 ยฑ 0.1629 0.4998 ยฑ 0.1044 48.91 ยฑ 0.00 0.021 ยฑ 0.002 2.008 ยฑ 0.019
tatok_ardtok_512 12.35 ยฑ 1.97 0.3609 ยฑ 0.1570 0.5357 ยฑ 0.1135 57.95 ยฑ 0.00 0.016 ยฑ 0.013 58.287 ยฑ 41.953

Key Findings:

  • Best Overall Quality: VAE (SD-MSE) achieves the best reconstruction quality across all metrics (PSNR: 24.92 dB, SSIM: 0.6953), but uses continuous latent space with quantization rather than discrete tokens.

  • Best Discrete Tokenizer: Among discrete tokenization models, MAGVIT2 offers the best quality-speed trade-off with PSNR of 18.65 dB and the fastest decoding time (0.011s), making it ideal for real-time applications.

  • TA-Tok Variants: Lumina2 provides the best quality among TA-Tok variants but is slow (95.72s). SANA offers a good balance (2.008s decode time), while AR-DTok is slower with lower quality.

  • Extreme Compression: TiTok-L-32 achieves moderate quality (PSNR: 15.02 dB) with very fast decoding (0.028s) using only 32 tokens for 256ร—256 images, making it suitable for extreme compression scenarios.

Sample Reconstructions

Model Sample 1 Sample 2 Sample 3 Sample 4 Sample 5
Original orig_1 orig_2 orig_3 orig_4 orig_5
MAGVIT2
(256px)
magvit2_1 magvit2_2 magvit2_3 magvit2_4 magvit2_5
TA-Tok + AR-DTok
(512px)
ardtok_1 ardtok_2 ardtok_3 ardtok_4 ardtok_5
TA-Tok + Lumina2
(512px)
lumina2_1 lumina2_2 lumina2_3 lumina2_4 lumina2_5
TA-Tok + SANA
(512px)
sana_1 sana_2 sana_3 sana_4 sana_5
TiTok-L-32
(256px)
titok_1 titok_2 titok_3 titok_4 titok_5
VAE SD-MSE
(512px)
vae_1 vae_2 vae_3 vae_4 vae_5

Detailed Results

Full benchmark results are available in:

  • docs/images/figures/benchmark_summary.png - Comprehensive visualization
  • docs/images/figures/benchmark_summary.txt - Text summary with detailed statistics
  • results/discrete_tokenizers_full/benchmark_results_merged.csv - Detailed per-image metrics (6000 images ร— 6 tokenizers)

๐Ÿ™ Acknowledgments

About

A comprehensive toolkit for analyzing and benchmarking vision tokenizers with detailed experimental results.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages