Introduce `prefill_reference` attention variant for faster PTQ evaluation

## Motivation

[The current](https://github.com/Samsung/TICO/commit/7a5381f6192830da13150603aa0bf7ef0437e518) `prefill` attention wrapper is implemented with **head unrolling** in order to remove transpose operations and better match the expected execution pattern on accelerators.

However, this implementation is significantly slower during **GPU evaluation**, because attention is computed head-by-head in a loop. This becomes problematic when running large PTQ experiments where many quantization configurations must be evaluated.

In practice, we observed that evaluation can be **~5× faster** when attention is implemented using a more GPU-friendly batched formulation.

## Problem

The accelerated implementation is **not strictly equivalent** to the current reference execution path.

In particular:

- the computation structure changes (head-wise loop → batched attention)
- `transpose` is reintroduced
- GQA expansion is implemented via `repeat_interleave`
- cache is not supported
- intermediate tensors are grouped differently

Additionally, **observers are applied differently**.

In the current `prefill` implementation, attention is computed head-by-head and observers / fake quantization are applied to each head’s intermediate tensors individually (e.g., logits, attention weights, and outputs).

In the accelerated implementation, attention is computed across all heads simultaneously and observers are applied to the aggregated tensors instead.

Because of this, the accelerated implementation should be considered an **approximate evaluation path**, rather than a strict acceleration of the reference quantized execution. While the final task results are typically very similar, intermediate quantization behavior is not guaranteed to be identical.

## Proposed Solution

Introduce a separate **evaluation-oriented attention variant**, for example:

- `prefill` — reference implementation
- `prefill_reference` — fast evaluation implementation

The roles would be:

### `prefill`
- hardware-faithful implementation
- head-unrolled attention
- used for final evaluation and reported results

### `prefill_reference`
- GPU-friendly batched attention
- significantly faster evaluation
- intended for experimentation and rapid iteration

## Suggested Workflow

1. Run most PTQ experiments using **`prefill_reference`** for fast iteration.
2. Identify promising quantization configurations.
3. Re-run evaluation with **`prefill`** to obtain final reported results.

This allows experimentation to remain fast while ensuring that final results reflect the reference execution path.

## Additional Motivation

Another important reason for separating these implementations is to **avoid drifting too far from the reference implementation**.

If the existing `prefill` wrapper is heavily modified to optimize evaluation speed, it becomes harder to return to the original baseline behavior later when investigating accuracy issues or exploring other performance optimizations.

By keeping the two variants separate:

- `prefill` remains a stable reference implementation
- `prefill_reference` can evolve independently for evaluation performance

This preserves flexibility and gives us more options when debugging or experimenting with future optimizations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `prefill_reference` attention variant for faster PTQ evaluation #551

Motivation

Problem

Proposed Solution

`prefill`

`prefill_reference`

Suggested Workflow

Additional Motivation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Introduce prefill_reference attention variant for faster PTQ evaluation #551

Description

Motivation

Problem

Proposed Solution

prefill

prefill_reference

Suggested Workflow

Additional Motivation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Introduce `prefill_reference` attention variant for faster PTQ evaluation #551

`prefill`

`prefill_reference`