Generate images with multiple quantization budgets and compare quality against the baseline FP16 model.
# Generate 50 images for baseline + 9 budget levels (0.1 to 0.9)
python generate_experiment.py --num_prompts 50
# This creates:
# experiments/2025-10-31_14-30-45_50prompts/
# baseline/ # FP16 images
# budget_0.1/ # Heavily quantized
# budget_0.2/
# ...
# budget_0.9/ # Lightly quantized
# prompts.txt # List of prompts used
# config.json # Experiment metadataThe generate_experiment.py script automates the complete evaluation workflow:
- Creates timestamped directory in
experiments/ - Loads N COCO prompts from
prompts/coco_val2017.txt - Generates baseline images with FP16 model
- For each budget level:
- Runs greedy optimizer to get bit allocation
- Applies mixed-precision quantization
- Generates images with quantized model
- Saves quantization config
# Basic: 50 prompts, default budgets (0.1 to 0.9)
python generate_experiment.py --num_prompts 50
# Custom budget levels
python generate_experiment.py \
--num_prompts 100 \
--budget_levels 0.3 0.5 0.7
# Specify all options
python generate_experiment.py \
--num_prompts 50 \
--device cuda \
--model_path CompVis/stable-diffusion-v1-4 \
--flops_file results/flops_analysis/flops_analysis_unet.json \
--sensitivity_file results/sensitivity_analysis/sensitivity_100_prompts.json \
--budget_levels 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 \
--seed 42 \
--prompt_seed 42| Flag | Type | Default | Description |
|---|---|---|---|
--num_prompts |
int | required | Number of COCO prompts to use |
--coco_path |
str | None |
Path to COCO prompts file (default: prompts/coco_val2017.txt) |
--model_path |
str | CompVis/stable-diffusion-v1-4 |
Hugging Face model identifier |
--device |
str | cuda |
Device: cuda or cpu |
--flops_file |
str | results/flops_analysis/flops_analysis_unet.json |
Path to FLOPs analysis |
--sensitivity_file |
str | results/sensitivity_analysis/sensitivity_100_prompts.json |
Path to sensitivity analysis |
--budget_levels |
float[] | [0.1, 0.2, ..., 0.9] |
Budget multipliers (space-separated) |
--seed |
int | 42 |
Random seed for image generation |
--prompt_seed |
int | 42 |
Random seed for prompt selection |
--experiment_dir |
str | experiments |
Base directory for experiments |
For --num_prompts 50 with 9 budget levels:
experiments/2025-10-31_14-30-45_50prompts/
├── baseline/
│ ├── image_0000.png
│ ├── image_0001.png
│ └── ... (50 images)
├── budget_0.1/
│ ├── image_0000.png
│ ├── image_0001.png
│ ├── ... (50 images)
│ └── quantization_config.json
├── budget_0.2/
│ └── ... (same structure)
├── budget_0.3/
├── budget_0.4/
├── budget_0.5/
├── budget_0.6/
├── budget_0.7/
├── budget_0.8/
├── budget_0.9/
├── prompts.txt # List of prompts used
└── config.json # Experiment metadata
Total images generated: N_prompts × (1 + N_budgets)
- Example: 50 prompts × 10 configs = 500 images
After generating images, use evaluate_experiment.py to compute FID and CLIP scores:
# Basic evaluation (prints summary table)
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/
# Save results to CSV and JSON
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/ --save_results
# Specify device
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/ \
--device cuda \
--save_results- FID (Fréchet Inception Distance): Measures distribution similarity between baseline and quantized images (lower is better)
- CLIP Score: Measures semantic consistency between prompts and generated images (higher is better)
- Bit-width Distribution: Number of layers at each precision level (4-bit, 8-bit, 16-bit)
evaluation_results.csv: Table with all metricsevaluation_results.json: Complete results with metadata
================================================================================
EXPERIMENT EVALUATION SUMMARY
================================================================================
Experiment: 2025-11-03_13-27-00_100prompts
Number of prompts: 100
Baseline CLIP: 0.2960
----------------------------------------------------------------------------------------------------------------------------------
Budget FID CLIP CLIP Δ Target Used Util% 4-bit 8-bit 16-bit
----------------------------------------------------------------------------------------------------------------------------------
0.1 76.958 0.3025 -0.0065 398944.4 398932.1 99.9 14 50 250
0.2 5.936 0.2969 -0.0009 797888.8 797865.3 100.0 7 35 272
0.3 7.143 0.2969 -0.0009 1196833.2 1196798.4 99.9 4 27 283
0.4 -0.000 0.2960 0.0000 1595777.7 1595732.1 99.9 4 10 300
0.5 -0.000 0.2960 0.0000 1994722.1 1994665.8 99.9 3 11 300
0.6 -0.000 0.2960 0.0000 2393666.5 2393599.5 99.9 0 8 306
0.7 -0.000 0.2960 0.0000 2792610.9 2792533.2 99.9 0 4 310
0.8 -0.000 0.2960 0.0000 3191555.3 3191466.9 99.9 0 3 311
0.9 -0.000 0.2960 0.0000 3590499.7 3590400.6 99.9 0 2 312
----------------------------------------------------------------------------------------------------------------------------------
Interpretation:
- FID: Lower is better (measures distribution similarity between baseline and quantized images)
- CLIP: Higher is better (measures semantic consistency with prompts)
- CLIP Δ: Degradation from baseline (lower is better)
- Target: Target budget in GBOPs (budget_multiplier × max_cost)
- Used: Actual BOPs used in GBOPs
- Util%: Budget utilization (Used / Target × 100)
================================================================================
| Flag | Type | Default | Description |
|---|---|---|---|
experiment_dir |
str | required | Path to experiment directory |
--device |
str | cuda |
Device: cuda or cpu |
--save_results |
flag | - | Save results to CSV and JSON |
--output_path |
str | <experiment_dir>/evaluation_results.csv |
Custom output path |
- Use
--prompt_seedfor reproducibility - Start with 50-100 prompts for quick validation
- Use 500-1000 prompts for final evaluation
- CUDA (NVIDIA): Best performance, recommended
- CPU: Very slow, only for testing
- Test 3-5 budgets initially:
0.3 0.5 0.7 - Expand to 9 levels for comprehensive analysis:
0.1 0.2 ... 0.9 - Focus on range where quality transitions occur
- ~10-15 seconds per image (NVIDIA A100)
- ~20-30 seconds per image (NVIDIA 3090)
- Each PNG image: ~1-2 MB
- 50 prompts × 10 configs = ~500-1000 MB
- Plan accordingly for large experiments
-
Plot quality vs cost curves: Visualize tradeoffs
- X-axis: Computational cost (GBOPs)
- Y-axis: Quality (FID, CLIP)
- Find optimal budget level
-
Compare bit-width distributions: Analyze allocation patterns
- Which layers are kept at high precision?
- How does allocation change with budget?
-
Select optimal budget: Balance quality and computational cost
- Budget 0.3-0.4: Often provides best quality/cost tradeoff
- Budget 0.5+: Minimal quality loss, higher cost
- Budget <0.3: Significant quality degradation