-
Notifications
You must be signed in to change notification settings - Fork 25
Evaluation Strategy for Quantizing Qwen-VL #448
Copy link
Copy link
Open
Description
What
Let's talk about how to evaluate the quantized VL model.
First, running full benchmark evaluations after every change is prohibitively expensive when quantizing Qwen-VL. At the same time, relying on a single lightweight metric (e.g. perplexity) is insufficient for vision-language models (VLMs), where degradation often occurs in the vision encoder or multimodal fusion layers.
This issue proposes a two-stage evaluation strategy:
- Fast evaluation for rapid iteration and regression detection
- Final evaluation for quality validation and reportable results
Each stage has a clearly defined purpose and should not be used as a replacement for the other.
Design Principles
This design principles are a good start point when we decide the overall process.
- Separate iteration speed from evaluation rigor
- By separating iteration speed from evaluation rigor, we can iterate quickly during quantization development while still guaranteeing reliable, reportable metrics at the final stage.
- Use high-sensitivity, low-cost signals early
- Use official, comparable benchmarks only when needed
- Ensure FP16 vs quantized comparisons are fair and reproducible
Stage 1: Fast Evaluation (Iteration / Sanity Check)
- Quickly detect severe or obvious regressions caused by quantization
- Enable frequent experimentation with minimal cost
- Answer the question: “Did this quantization change break the model?”
Scope
- Small datasets
- Heuristic or relative metrics
- FP16 vs quantized comparison only
Methods
1. Text-only Perplexity (Optional)
- Run on a small text corpus if supported
- Used as a smoke test for language decoder degradation
- Known limitation: does not capture vision or multimodal failures
Exmaple
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoTokenizer
from tico.quantization.wrapq.utils.metrics import perplexity
MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID, min_pixels=256*256, max_pixels=384*384)
tokenizer = getattr(processor, "tokenizer", None)
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForVision2Seq.from_pretrained(
MODEL_ID,
device_map="cpu",
)
model.eval()
texts = [
"The quick brown fox jumps over the lazy dog.",
"In a distant future, humans and machines coexist in uneasy peace.",
"Explain the difference between supervised and unsupervised learning in simple terms.",
]
enc = tokenizer(texts[0], return_tensors="pt")
input_ids = enc["input_ids"].to(model.device)
ppl = perplexity(model, input_ids, model.device)
print("Text-only PPL:", ppl)2. Mini VQA Evaluation
- Use a small subset (200–1,000 samples) from:
- VQAv2
- TextVQA
- (Optionally) DocVQA or OK-VQA
- Fixed prompt forcing the model to output only the final answer
- Metric:
- normalized exact match (or simple soft match)
- Highly sensitive to:
- vision encoder errors
- OCR degradation
- vision-language alignment issues
3. FP16 vs Quantized Output Agreement
- Evaluate on a fixed set of multimodal prompts
- Metrics:
- next-token top-1 agreement
- logit similarity (cosine or KL divergence)
- Does not require ground-truth labels
- Very sensitive to subtle regressions
4. Basic Performance Metrics (Optional)
- Prefill latency (with image input)
- Decode throughput (tokens/sec)
- Peak memory usage
- Relative speedup vs FP16
Stage 2: Final Evaluation (Validation / Reporting)
- Produce trustworthy, reportable metrics
- Confirm that quantization preserves end-to-end multimodal quality
- Answer the question: “Is this model ready for release or comparison?”
Scope
- Official datasets and splits
- Standardized evaluation protocols
- Absolute metrics suitable for documentation and comparison
Methods
1. Multimodal Benchmarks (Primary)
- Datasets:
- VQAv2
- TextVQA
- DocVQA (or equivalent)
- OK-VQA (optional)
- Protocol:
- official splits
- fixed prompting
- deterministic decoding (temperature = 0)
- Metrics:
- official accuracy (preferred)
- relative drop vs FP16 baseline
- category-level breakdowns when available
2. Language-only Evaluation (Secondary)
- Use LM Eval Harness (lm-eval) for standardized comparison
- Suggested tasks:
- MMLU
- HellaSwag
- ARC-Challenge
- Few-shot: 0 or low
- Prefer log-likelihood mode over generation
- Goal: ensure language decoder quality is preserved
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels