This fork contains the implementation accompanying our paper:
Attention-Level Speculation
Authors: Jack Cai, Ammar Vora, Randolph Zhang, Mark O'Conner, Mark C. Jeffrey.
Conference Paper (Link Coming Soon!)
As Large Language Models (LLMs) scale in size and context length, inference latency becomes a bottleneck. Traditional tensor and data parallelism approaches struggle to scale efficiently across multiple devices.
Attention-Level Speculation (ALSpec) introduces a novel speculative execution mechanism that predicts the outputs of self-attention layers. This allows overlapping attention and non-attention computations—unlocking parallelism across layers that were previously executed sequentially. As a result, ALSpec achieves:
- Up to 5× reduction in attention latency at 128K sequence length
- Up to 1.65× improvement in end-to-end decode latency at large sequence lengths
- Maintains model quality, even with substantial attention approximation
- Proof-of-concept demonstrated on real hardware using Tenstorrent’s Wormhole™ n300 platform
This is a fork of TT-Metalium™, the open-source software stack used for high-performance ML inference on Tenstorrent hardware. This fork provides an example implementation of ALSpec, designed to demonstrate the feasibility of speculative self-attention on real silicon.
Note: This proof-of-concept targets a two-device system and currently does not integrate with tensor parallelism. Due to on-chip memory constraints, the maximum tested context length in this implementation is 32K.
- 🧠 ALSpec end-to-end demo:
- Llama-3.1-8B-Instruct integrated with ALSpec
- Runs on a dual-device Wormhole™ n300 setup
- Enabled via the
ALSpecclass
- ⚙️
ttnn.experimental.speculative_scaled_dot_product_attention_decode:- Speculative flash-attention kernel implementation for Tenstorrent hardware
- 🔄
ttnn.experimental.swap_tensor_async:- Custom CCL operation to synchronize residual and priority tensors across devices
- 🔧
device.set_speculation_mode(...):- Runtime modification enabling static graph dynamic concurrency (SGDC) with priority gating
The original TT-Metalium™ README follows below.
Last Update: April 22, 2025
Notes:
- ttft = time to first token | t/s/u = tokens/second/user | t/s = tokens/second; where t/s = t/s/u * batch.
- TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.
- The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).
- The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.
| Model | Batch | Hardware | ttft (ms) | t/s/u | Target t/s/u | t/s | TT-Metalium Release |
|---|---|---|---|---|---|---|---|
| Whisper (distil-large-v3) | 1 | n150 | 244 | 54.7 | 45 | 54.7 | v0.57.0-rc71 |
| Model | Batch | Hardware | fps | Target fps | Release |
|---|---|---|---|---|---|
| ResNet-50 (224x224) | 16 | n150 | 4,700 | 7,000 | |
| ResNet-50 (224x224) (DP=2) | 32 | n300 | 9,200 | 14,000 | |
| ResNet-50 (224x224) (DP=8) | 128 | QuietBox | 35,800 | 56,000 | |
| ResNet-50 (224x224) (DP=32) | 512 | Galaxy | 96,800 | 224,000 | |
| ViT (224x224) | 8 | n150 | 1100 | 1,600 | |
| Stable Diffusion 1.4 (512x512) | 1 | n150 | 0.117 | 0.3 | |
| YOLOv4 (320x320) | 1 | n150 | 120 | 300 | |
| YOLOv4 (640x640) | 1 | n150 | 50 | 100 | |
| SegFormer Semantic Segmentation (512x512) | 1 | n150 | 90 | 300 | |
| Stable Diffusion 3.5 medium (512x512) | 1 | n150 | 0.06 | 0.3 |
Notes:
- Stable Diffusion FPS is based on the time elapsed from submitting the input prompt to receiving the image from the VAE decoder.
| Model | Batch | Hardware | sen/sec | Target sen/sec | Release |
|---|---|---|---|---|---|
| BERT-Large | 8 | n150 | 270 | 400 |
For the latest model updates and features, please see MODEL_UPDATES.md
For information on initial model procedures, please see Model Bring-Up and Testing
- Advanced Performance Optimizations for Models (updated March 4th, 2025)
- Programming Mesh of Devices (updated Sept 9th, 2024)
- ViT Implementation in TT-NN on GS (updated Sept 22nd, 2024)
- LLMs Bring up in TT-NN (updated Oct 29th, 2024)
- YOLOv4 Implementation in TT-NN on WH (updated November 8th, 2024)
- CNN Bring up & Optimization in TT-NN (updated Jan 22nd, 2025)
- Matrix Multiply FLOPS on WH (updated November 13th, 2024)
TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.
Get started with simple kernels.
- Matrix Engine (updated Sept 6th, 2024)
- Data Formats (updated Sept 7th, 2024)
- Reconfiguring Data Formats (updated Oct 17th, 2024)
- Handling special floating-point numbers (updated Oct 5th, 2024)
- Allocator (Updated Dec 19th, 2024)
- Tensor Layouts (updated Sept 6th, 2024)
- Saturating DRAM Bandwidth (updated Sept 6th, 2024)
- Flash Attention on Wormhole (updated Sept 6th, 2024)
- CNNs on TT Architectures (updated Sept 6th, 2024)
- Ethernet and Multichip Basics (Updated Sept 20th, 2024)
- Collective Communication Library (CCL) (Updated Sept 20th, 2024)
- Blackhole Bring-Up Programming Guide (Updated Dec 18th, 2024)
- Sub-Devices (Updated Jan 7th, 2025)
- Matmul OP on a Single_core
- Matmul OP on Multi_core (Basic)
- Matmul Multi_core Reuse (Optimized)
- Matmul Multi_core Multi-Cast (Optimized)
This repo is a part of Tenstorrent’s bounty program. If you are interested in helping to improve tt-metal, please make sure to read the Tenstorrent Bounty Program Terms and Conditions before heading to the issues tab. Look for the issues that are tagged with both “bounty” and difficulty level!
