-
Notifications
You must be signed in to change notification settings - Fork 25
Introduce prefill_reference attention variant for faster PTQ evaluation #551
Description
Motivation
The current prefill attention wrapper is implemented with head unrolling in order to remove transpose operations and better match the expected execution pattern on accelerators.
However, this implementation is significantly slower during GPU evaluation, because attention is computed head-by-head in a loop. This becomes problematic when running large PTQ experiments where many quantization configurations must be evaluated.
In practice, we observed that evaluation can be ~5× faster when attention is implemented using a more GPU-friendly batched formulation.
Problem
The accelerated implementation is not strictly equivalent to the current reference execution path.
In particular:
- the computation structure changes (head-wise loop → batched attention)
transposeis reintroduced- GQA expansion is implemented via
repeat_interleave - cache is not supported
- intermediate tensors are grouped differently
Additionally, observers are applied differently.
In the current prefill implementation, attention is computed head-by-head and observers / fake quantization are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs).
In the accelerated implementation, attention is computed across all heads simultaneously and observers are applied to the aggregated tensors instead.
Because of this, the accelerated implementation should be considered an approximate evaluation path, rather than a strict acceleration of the reference quantized execution. While the final task results are typically very similar, intermediate quantization behavior is not guaranteed to be identical.
Proposed Solution
Introduce a separate evaluation-oriented attention variant, for example:
prefill— reference implementationprefill_reference— fast evaluation implementation
The roles would be:
prefill
- hardware-faithful implementation
- head-unrolled attention
- used for final evaluation and reported results
prefill_reference
- GPU-friendly batched attention
- significantly faster evaluation
- intended for experimentation and rapid iteration
Suggested Workflow
- Run most PTQ experiments using
prefill_referencefor fast iteration. - Identify promising quantization configurations.
- Re-run evaluation with
prefillto obtain final reported results.
This allows experimentation to remain fast while ensuring that final results reflect the reference execution path.
Additional Motivation
Another important reason for separating these implementations is to avoid drifting too far from the reference implementation.
If the existing prefill wrapper is heavily modified to optimize evaluation speed, it becomes harder to return to the original baseline behavior later when investigating accuracy issues or exploring other performance optimizations.
By keeping the two variants separate:
prefillremains a stable reference implementationprefill_referencecan evolve independently for evaluation performance
This preserves flexibility and gives us more options when debugging or experimenting with future optimizations.