Skip to content

mcj-group/ALSpec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

🚀 Attention-Level Speculation (ALSpec)

This fork contains the implementation accompanying our paper:

Attention-Level Speculation
Authors: Jack Cai, Ammar Vora, Randolph Zhang, Mark O'Conner, Mark C. Jeffrey.
Conference Paper (Link Coming Soon!)


As Large Language Models (LLMs) scale in size and context length, inference latency becomes a bottleneck. Traditional tensor and data parallelism approaches struggle to scale efficiently across multiple devices.

Attention-Level Speculation (ALSpec) introduces a novel speculative execution mechanism that predicts the outputs of self-attention layers. This allows overlapping attention and non-attention computations—unlocking parallelism across layers that were previously executed sequentially. As a result, ALSpec achieves:

  • Up to 5× reduction in attention latency at 128K sequence length
  • Up to 1.65× improvement in end-to-end decode latency at large sequence lengths
  • Maintains model quality, even with substantial attention approximation
  • Proof-of-concept demonstrated on real hardware using Tenstorrent’s Wormhole™ n300 platform

🔧 About This Repository

This is a fork of TT-Metalium™, the open-source software stack used for high-performance ML inference on Tenstorrent hardware. This fork provides an example implementation of ALSpec, designed to demonstrate the feasibility of speculative self-attention on real silicon.

Note: This proof-of-concept targets a two-device system and currently does not integrate with tensor parallelism. Due to on-chip memory constraints, the maximum tested context length in this implementation is 32K.

Key Additions in This Fork

  • 🧠 ALSpec end-to-end demo:
    • Llama-3.1-8B-Instruct integrated with ALSpec
    • Runs on a dual-device Wormhole™ n300 setup
    • Enabled via the ALSpec class
  • ⚙️ ttnn.experimental.speculative_scaled_dot_product_attention_decode:
    • Speculative flash-attention kernel implementation for Tenstorrent hardware
  • 🔄 ttnn.experimental.swap_tensor_async:
    • Custom CCL operation to synchronize residual and priority tensors across devices
  • 🔧 device.set_speculation_mode(...):
    • Runtime modification enabling static graph dynamic concurrency (SGDC) with priority gating



The original TT-Metalium™ README follows below.

tt-metal CI

ttnn logo

TT-NN is a Python & C++ Neural Network OP library.


LLMs

Model Batch Hardware ttft (ms) t/s/u Target
t/s/u
t/s TT-Metalium Release vLLM Tenstorrent Repo Release
QwQ 32B (TP=8) 32 QuietBox 133 25.2 30 806.4 v0.56.0-rc51 e2e0002
DeepSeek R1 Distill Llama 3.3 70B (TP=8) 32 QuietBox 159 15.2 20 486.4 v0.57.0-rc71 3f59287
Llama 3.1 70B (TP=32) 32 Galaxy 45.1 80 1443.2 avora/40-tks
Llama 3.1 70B (TP=8) 32 QuietBox 159 15.2 20 486.4 v0.57.0-rc71 3f59287
Llama 3.2 11B Vision (TP=2) 16 n300 2550 15.8 17 252.8 v0.56.0-rc6 e2e0002
Qwen 2.5 7B (TP=2) 32 n300 126 32.5 38 1040.0 v0.56.0-rc33 e2e0002
Qwen 2.5 72B (TP=8) 32 QuietBox 333 14.5 20 464.0 v0.56.0-rc33 e2e0002
Falcon 7B 32 n150 70 18.3 26 585.6 v0.57.0-rc56
Falcon 7B (DP=8) 256 QuietBox 88 15.5 26 3968.0 v0.57.0-rc56
Falcon 7B (DP=32) 1024 Galaxy 223 4.8 26 4915.2 v0.56.0-rc6
Falcon 40B (TP=8) 32 QuietBox 10.9 36 348.8 v0.57.0-rc71
Llama 3.1 8B 32 n150 104 24.6 23 787.2 v0.57.0-rc71 3f59287
Llama 3.2 1B 32 n150 23 67.6 160 2163.2 v0.57.0-rc23 f8b5b72
Llama 3.2 3B 32 n150 53 43.5 60 1392.0 v0.57.0-rc71 3f59287
Mamba 2.8B 32 n150 37 12.9 41 412.8 v0.57.0-rc71
Mistral 7B 32 n150 9.9 25 316.8 v0.51.0-rc28
Mixtral 8x7B (TP=8) 32 QuietBox 207 16.6 33 531.2 v0.57.0-rc71

Last Update: April 22, 2025

Notes:

  • ttft = time to first token | t/s/u = tokens/second/user | t/s = tokens/second; where t/s = t/s/u * batch.
  • TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.
  • The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).
  • The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.

Speech-to-Text

Model Batch Hardware ttft (ms) t/s/u Target t/s/u t/s TT-Metalium Release
Whisper (distil-large-v3) 1 n150 244 54.7 45 54.7 v0.57.0-rc71

CNNs

Model Batch Hardware fps Target fps Release
ResNet-50 (224x224) 16 n150 4,700 7,000
ResNet-50 (224x224) (DP=2) 32 n300 9,200 14,000
ResNet-50 (224x224) (DP=8) 128 QuietBox 35,800 56,000
ResNet-50 (224x224) (DP=32) 512 Galaxy 96,800 224,000
ViT (224x224) 8 n150 1100 1,600
Stable Diffusion 1.4 (512x512) 1 n150 0.117 0.3
YOLOv4 (320x320) 1 n150 120 300
YOLOv4 (640x640) 1 n150 50 100
SegFormer Semantic Segmentation (512x512) 1 n150 90 300
Stable Diffusion 3.5 medium (512x512) 1 n150 0.06 0.3

Notes:

  • Stable Diffusion FPS is based on the time elapsed from submitting the input prompt to receiving the image from the VAE decoder.

NLPs

Model Batch Hardware sen/sec Target sen/sec Release
BERT-Large 8 n150 270 400

Model Updates

For the latest model updates and features, please see MODEL_UPDATES.md

Model Bring-Up and Testing

For information on initial model procedures, please see Model Bring-Up and Testing

TT-NN Tech Reports

Benchmarks


TT-Metalium logo

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Getting started

Get started with simple kernels.

TT-Metalium Tech Reports

TT-Metalium Programming Examples

Hello World

Add Integers

Simple Tensor Manipulation

DRAM Data Movement

Eltwise

Matmul

Tenstorrent Bounty Program Terms and Conditions

This repo is a part of Tenstorrent’s bounty program. If you are interested in helping to improve tt-metal, please make sure to read the Tenstorrent Bounty Program Terms and Conditions before heading to the issues tab. Look for the issues that are tagged with both “bounty” and difficulty level!

About

Code for "Attention-Level Speculation" [ICML 2025] – includes modifications on top of tag v0.57.1

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 54.2%
  • Python 39.6%
  • Jupyter Notebook 3.2%
  • C 1.8%
  • Shell 0.6%
  • CMake 0.5%
  • Other 0.1%