Evaluation Strategy for Quantizing Qwen-VL

## What

Let's talk about **how to evaluate the quantized VL model**.

First, running full benchmark evaluations after every change is prohibitively expensive when quantizing Qwen-VL. At the same time, relying on a single lightweight metric (e.g. perplexity) is insufficient for vision-language models (VLMs), where degradation often occurs in the vision encoder or multimodal fusion layers.

This issue proposes a **two-stage evaluation strategy**:

1. **Fast evaluation** for rapid iteration and regression detection
2. **Final evaluation** for quality validation and reportable results

Each stage has a clearly defined purpose and should not be used as a replacement for the other.

## Design Principles

This design principles are a good start point when we decide the overall process.

- Separate iteration speed from evaluation rigor
  - By separating iteration speed from evaluation rigor, we can iterate quickly during quantization development while still guaranteeing reliable, reportable metrics at the final stage.
- Use high-sensitivity, low-cost signals early
- Use official, comparable benchmarks only when needed
- Ensure FP16 vs quantized comparisons are fair and reproducible

## Stage 1: Fast Evaluation (Iteration / Sanity Check)

- Quickly detect severe or obvious regressions caused by quantization
- Enable frequent experimentation with minimal cost
- Answer the question: “Did this quantization change break the model?”

### Scope

- Small datasets
- Heuristic or relative metrics
- FP16 vs quantized comparison only

### Methods

#### 1. Text-only Perplexity (Optional)

- Run on a small text corpus if supported
- Used as a smoke test for language decoder degradation
- Known limitation: does not capture vision or multimodal failures

<details>
<summary>Exmaple</summary>

```python
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoTokenizer

from tico.quantization.wrapq.utils.metrics import perplexity

MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID, min_pixels=256*256, max_pixels=384*384)
tokenizer = getattr(processor, "tokenizer", None)
if tokenizer is None:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

model = AutoModelForVision2Seq.from_pretrained(
    MODEL_ID,
    device_map="cpu",
)
model.eval()

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "In a distant future, humans and machines coexist in uneasy peace.",
    "Explain the difference between supervised and unsupervised learning in simple terms.",
]

enc = tokenizer(texts[0], return_tensors="pt")
input_ids = enc["input_ids"].to(model.device)
ppl = perplexity(model, input_ids, model.device)
print("Text-only PPL:", ppl)
```

</details>

#### 2. Mini VQA Evaluation

- Use a small subset (200–1,000 samples) from:
  - VQAv2
  - TextVQA
  - (Optionally) DocVQA or OK-VQA
- Fixed prompt forcing the model to output only the final answer
- Metric:
  - normalized exact match (or simple soft match)
- Highly sensitive to:
  - vision encoder errors
  - OCR degradation
  - vision-language alignment issues

#### 3. FP16 vs Quantized Output Agreement

- Evaluate on a fixed set of multimodal prompts
- Metrics:
  - next-token top-1 agreement
  - logit similarity (cosine or KL divergence)
- Does not require ground-truth labels
- Very sensitive to subtle regressions

#### 4. Basic Performance Metrics (Optional)

- Prefill latency (with image input)
- Decode throughput (tokens/sec)
- Peak memory usage
- Relative speedup vs FP16

## Stage 2: Final Evaluation (Validation / Reporting)

- Produce trustworthy, reportable metrics
- Confirm that quantization preserves end-to-end multimodal quality
- Answer the question: “Is this model ready for release or comparison?”

### Scope

- Official datasets and splits
- Standardized evaluation protocols
- Absolute metrics suitable for documentation and comparison

### Methods

#### 1. Multimodal Benchmarks (Primary)

- Datasets:
  - VQAv2
  - TextVQA
  - DocVQA (or equivalent)
  - OK-VQA (optional)
- Protocol:
  - official splits
  - fixed prompting
  - deterministic decoding (temperature = 0)
- Metrics:
  - official accuracy (preferred)
  - relative drop vs FP16 baseline
  - category-level breakdowns when available

#### 2. Language-only Evaluation (Secondary)

- Use LM Eval Harness (lm-eval) for standardized comparison
- Suggested tasks:
  - MMLU
  - HellaSwag
  - ARC-Challenge
- Few-shot: 0 or low
- Prefer log-likelihood mode over generation
- Goal: ensure language decoder quality is preserved



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Strategy for Quantizing Qwen-VL #448

What

Design Principles

Stage 1: Fast Evaluation (Iteration / Sanity Check)

Scope

Methods

1. Text-only Perplexity (Optional)

2. Mini VQA Evaluation

3. FP16 vs Quantized Output Agreement

4. Basic Performance Metrics (Optional)

Stage 2: Final Evaluation (Validation / Reporting)

Scope

Methods

1. Multimodal Benchmarks (Primary)

2. Language-only Evaluation (Secondary)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation Strategy for Quantizing Qwen-VL #448

Description

What

Design Principles

Stage 1: Fast Evaluation (Iteration / Sanity Check)

Scope

Methods

1. Text-only Perplexity (Optional)

2. Mini VQA Evaluation

3. FP16 vs Quantized Output Agreement

4. Basic Performance Metrics (Optional)

Stage 2: Final Evaluation (Validation / Reporting)

Scope

Methods

1. Multimodal Benchmarks (Primary)

2. Language-only Evaluation (Secondary)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions