Skip to content

Latest commit

 

History

History
214 lines (172 loc) · 8.1 KB

File metadata and controls

214 lines (172 loc) · 8.1 KB

Phase 4: Image Generation Experiments

Generate images with multiple quantization budgets and compare quality against the baseline FP16 model.

Quick Start

# Generate 50 images for baseline + 9 budget levels (0.1 to 0.9)
python generate_experiment.py --num_prompts 50

# This creates:
# experiments/2025-10-31_14-30-45_50prompts/
#   baseline/          # FP16 images
#   budget_0.1/        # Heavily quantized
#   budget_0.2/
#   ...
#   budget_0.9/        # Lightly quantized
#   prompts.txt        # List of prompts used
#   config.json        # Experiment metadata

Image Generation Script

The generate_experiment.py script automates the complete evaluation workflow:

  1. Creates timestamped directory in experiments/
  2. Loads N COCO prompts from prompts/coco_val2017.txt
  3. Generates baseline images with FP16 model
  4. For each budget level:
    • Runs greedy optimizer to get bit allocation
    • Applies mixed-precision quantization
    • Generates images with quantized model
    • Saves quantization config

Usage Examples

# Basic: 50 prompts, default budgets (0.1 to 0.9)
python generate_experiment.py --num_prompts 50

# Custom budget levels
python generate_experiment.py \
    --num_prompts 100 \
    --budget_levels 0.3 0.5 0.7

# Specify all options
python generate_experiment.py \
    --num_prompts 50 \
    --device cuda \
    --model_path CompVis/stable-diffusion-v1-4 \
    --flops_file results/flops_analysis/flops_analysis_unet.json \
    --sensitivity_file results/sensitivity_analysis/sensitivity_100_prompts.json \
    --budget_levels 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 \
    --seed 42 \
    --prompt_seed 42

CLI Options

Flag Type Default Description
--num_prompts int required Number of COCO prompts to use
--coco_path str None Path to COCO prompts file (default: prompts/coco_val2017.txt)
--model_path str CompVis/stable-diffusion-v1-4 Hugging Face model identifier
--device str cuda Device: cuda or cpu
--flops_file str results/flops_analysis/flops_analysis_unet.json Path to FLOPs analysis
--sensitivity_file str results/sensitivity_analysis/sensitivity_100_prompts.json Path to sensitivity analysis
--budget_levels float[] [0.1, 0.2, ..., 0.9] Budget multipliers (space-separated)
--seed int 42 Random seed for image generation
--prompt_seed int 42 Random seed for prompt selection
--experiment_dir str experiments Base directory for experiments

Output Structure

For --num_prompts 50 with 9 budget levels:

experiments/2025-10-31_14-30-45_50prompts/
├── baseline/
│   ├── image_0000.png
│   ├── image_0001.png
│   └── ... (50 images)
├── budget_0.1/
│   ├── image_0000.png
│   ├── image_0001.png
│   ├── ... (50 images)
│   └── quantization_config.json
├── budget_0.2/
│   └── ... (same structure)
├── budget_0.3/
├── budget_0.4/
├── budget_0.5/
├── budget_0.6/
├── budget_0.7/
├── budget_0.8/
├── budget_0.9/
├── prompts.txt        # List of prompts used
└── config.json        # Experiment metadata

Total images generated: N_prompts × (1 + N_budgets)

  • Example: 50 prompts × 10 configs = 500 images

Quality Evaluation

After generating images, use evaluate_experiment.py to compute FID and CLIP scores:

# Basic evaluation (prints summary table)
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/

# Save results to CSV and JSON
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/ --save_results

# Specify device
python evaluate_experiment.py experiments/2025-10-31_14-30-45_50prompts/ \
    --device cuda \
    --save_results

What It Evaluates

  • FID (Fréchet Inception Distance): Measures distribution similarity between baseline and quantized images (lower is better)
  • CLIP Score: Measures semantic consistency between prompts and generated images (higher is better)
  • Bit-width Distribution: Number of layers at each precision level (4-bit, 8-bit, 16-bit)

Output Files

  • evaluation_results.csv: Table with all metrics
  • evaluation_results.json: Complete results with metadata

Example Output

================================================================================
EXPERIMENT EVALUATION SUMMARY
================================================================================
Experiment: 2025-11-03_13-27-00_100prompts
Number of prompts: 100
Baseline CLIP: 0.2960

----------------------------------------------------------------------------------------------------------------------------------
Budget   FID        CLIP         CLIP Δ     Target       Used         Util%    4-bit    8-bit    16-bit
----------------------------------------------------------------------------------------------------------------------------------
0.1      76.958     0.3025       -0.0065    398944.4     398932.1     99.9     14       50       250
0.2      5.936      0.2969       -0.0009    797888.8     797865.3     100.0    7        35       272
0.3      7.143      0.2969       -0.0009    1196833.2    1196798.4    99.9     4        27       283
0.4      -0.000     0.2960       0.0000     1595777.7    1595732.1    99.9     4        10       300
0.5      -0.000     0.2960       0.0000     1994722.1    1994665.8    99.9     3        11       300
0.6      -0.000     0.2960       0.0000     2393666.5    2393599.5    99.9     0        8        306
0.7      -0.000     0.2960       0.0000     2792610.9    2792533.2    99.9     0        4        310
0.8      -0.000     0.2960       0.0000     3191555.3    3191466.9    99.9     0        3        311
0.9      -0.000     0.2960       0.0000     3590499.7    3590400.6    99.9     0        2        312
----------------------------------------------------------------------------------------------------------------------------------

Interpretation:
  - FID: Lower is better (measures distribution similarity between baseline and quantized images)
  - CLIP: Higher is better (measures semantic consistency with prompts)
  - CLIP Δ: Degradation from baseline (lower is better)
  - Target: Target budget in GBOPs (budget_multiplier × max_cost)
  - Used: Actual BOPs used in GBOPs
  - Util%: Budget utilization (Used / Target × 100)
================================================================================

CLI Options

Flag Type Default Description
experiment_dir str required Path to experiment directory
--device str cuda Device: cuda or cpu
--save_results flag - Save results to CSV and JSON
--output_path str <experiment_dir>/evaluation_results.csv Custom output path

Tips and Best Practices

Prompt Selection

  • Use --prompt_seed for reproducibility
  • Start with 50-100 prompts for quick validation
  • Use 500-1000 prompts for final evaluation

Device Selection

  • CUDA (NVIDIA): Best performance, recommended
  • CPU: Very slow, only for testing

Budget Levels

  • Test 3-5 budgets initially: 0.3 0.5 0.7
  • Expand to 9 levels for comprehensive analysis: 0.1 0.2 ... 0.9
  • Focus on range where quality transitions occur

Generation Speed

  • ~10-15 seconds per image (NVIDIA A100)
  • ~20-30 seconds per image (NVIDIA 3090)

Disk Space

  • Each PNG image: ~1-2 MB
  • 50 prompts × 10 configs = ~500-1000 MB
  • Plan accordingly for large experiments

Next Steps After Evaluation

  1. Plot quality vs cost curves: Visualize tradeoffs

    • X-axis: Computational cost (GBOPs)
    • Y-axis: Quality (FID, CLIP)
    • Find optimal budget level
  2. Compare bit-width distributions: Analyze allocation patterns

    • Which layers are kept at high precision?
    • How does allocation change with budget?
  3. Select optimal budget: Balance quality and computational cost

    • Budget 0.3-0.4: Often provides best quality/cost tradeoff
    • Budget 0.5+: Minimal quality loss, higher cost
    • Budget <0.3: Significant quality degradation