Skip to content

Introduce prefill_reference attention variant for faster PTQ evaluation #551

@mhs4670go

Description

@mhs4670go

Motivation

The current prefill attention wrapper is implemented with head unrolling in order to remove transpose operations and better match the expected execution pattern on accelerators.

However, this implementation is significantly slower during GPU evaluation, because attention is computed head-by-head in a loop. This becomes problematic when running large PTQ experiments where many quantization configurations must be evaluated.

In practice, we observed that evaluation can be ~5× faster when attention is implemented using a more GPU-friendly batched formulation.

Problem

The accelerated implementation is not strictly equivalent to the current reference execution path.

In particular:

  • the computation structure changes (head-wise loop → batched attention)
  • transpose is reintroduced
  • GQA expansion is implemented via repeat_interleave
  • cache is not supported
  • intermediate tensors are grouped differently

Additionally, observers are applied differently.

In the current prefill implementation, attention is computed head-by-head and observers / fake quantization are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs).

In the accelerated implementation, attention is computed across all heads simultaneously and observers are applied to the aggregated tensors instead.

Because of this, the accelerated implementation should be considered an approximate evaluation path, rather than a strict acceleration of the reference quantized execution. While the final task results are typically very similar, intermediate quantization behavior is not guaranteed to be identical.

Proposed Solution

Introduce a separate evaluation-oriented attention variant, for example:

  • prefill — reference implementation
  • prefill_reference — fast evaluation implementation

The roles would be:

prefill

  • hardware-faithful implementation
  • head-unrolled attention
  • used for final evaluation and reported results

prefill_reference

  • GPU-friendly batched attention
  • significantly faster evaluation
  • intended for experimentation and rapid iteration

Suggested Workflow

  1. Run most PTQ experiments using prefill_reference for fast iteration.
  2. Identify promising quantization configurations.
  3. Re-run evaluation with prefill to obtain final reported results.

This allows experimentation to remain fast while ensuring that final results reflect the reference execution path.

Additional Motivation

Another important reason for separating these implementations is to avoid drifting too far from the reference implementation.

If the existing prefill wrapper is heavily modified to optimize evaluation speed, it becomes harder to return to the original baseline behavior later when investigating accuracy issues or exploring other performance optimizations.

By keeping the two variants separate:

  • prefill remains a stable reference implementation
  • prefill_reference can evolve independently for evaluation performance

This preserves flexibility and gives us more options when debugging or experimenting with future optimizations.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions