🚀 Attention-Level Speculation (ALSpec)

This fork contains the implementation accompanying our paper:

Attention-Level Speculation
Authors: Jack Cai, Ammar Vora, Randolph Zhang, Mark O'Conner, Mark C. Jeffrey.
Conference Paper (Link Coming Soon!)

As Large Language Models (LLMs) scale in size and context length, inference latency becomes a bottleneck. Traditional tensor and data parallelism approaches struggle to scale efficiently across multiple devices.

Attention-Level Speculation (ALSpec) introduces a novel speculative execution mechanism that predicts the outputs of self-attention layers. This allows overlapping attention and non-attention computations—unlocking parallelism across layers that were previously executed sequentially. As a result, ALSpec achieves:

Up to 5× reduction in attention latency at 128K sequence length
Up to 1.65× improvement in end-to-end decode latency at large sequence lengths
Maintains model quality, even with substantial attention approximation
Proof-of-concept demonstrated on real hardware using Tenstorrent’s Wormhole™ n300 platform

🔧 About This Repository

This is a fork of TT-Metalium™, the open-source software stack used for high-performance ML inference on Tenstorrent hardware. This fork provides an example implementation of ALSpec, designed to demonstrate the feasibility of speculative self-attention on real silicon.

Note: This proof-of-concept targets a two-device system and currently does not integrate with tensor parallelism. Due to on-chip memory constraints, the maximum tested context length in this implementation is 32K.

Key Additions in This Fork

🧠 ALSpec end-to-end demo:
- Llama-3.1-8B-Instruct integrated with ALSpec
- Runs on a dual-device Wormhole™ n300 setup
- Enabled via the ALSpec class
⚙️ ttnn.experimental.speculative_scaled_dot_product_attention_decode:
- Speculative flash-attention kernel implementation for Tenstorrent hardware
🔄 ttnn.experimental.swap_tensor_async:
- Custom CCL operation to synchronize residual and priority tensors across devices
🔧 device.set_speculation_mode(...):
- Runtime modification enabling static graph dynamic concurrency (SGDC) with priority gating

The original TT-Metalium™ README follows below.

Bounty $ | Buy | Install | Discord | Join Us

TT-NN is a Python & C++ Neural Network OP library.

API Reference | Model Demos

LLMs

Model	Batch	Hardware	ttft (ms)	t/s/u	Target t/s/u	t/s	TT-Metalium Release	vLLM Tenstorrent Repo Release
QwQ 32B (TP=8)	32	QuietBox	133	25.2	30	806.4	v0.56.0-rc51	e2e0002
DeepSeek R1 Distill Llama 3.3 70B (TP=8)	32	QuietBox	159	15.2	20	486.4	v0.57.0-rc71	3f59287
Llama 3.1 70B (TP=32)	32	Galaxy		45.1	80	1443.2	avora/40-tks
Llama 3.1 70B (TP=8)	32	QuietBox	159	15.2	20	486.4	v0.57.0-rc71	3f59287
Llama 3.2 11B Vision (TP=2)	16	n300	2550	15.8	17	252.8	v0.56.0-rc6	e2e0002
Qwen 2.5 7B (TP=2)	32	n300	126	32.5	38	1040.0	v0.56.0-rc33	e2e0002
Qwen 2.5 72B (TP=8)	32	QuietBox	333	14.5	20	464.0	v0.56.0-rc33	e2e0002
Falcon 7B	32	n150	70	18.3	26	585.6	v0.57.0-rc56
Falcon 7B (DP=8)	256	QuietBox	88	15.5	26	3968.0	v0.57.0-rc56
Falcon 7B (DP=32)	1024	Galaxy	223	4.8	26	4915.2	v0.56.0-rc6
Falcon 40B (TP=8)	32	QuietBox		10.9	36	348.8	v0.57.0-rc71
Llama 3.1 8B	32	n150	104	24.6	23	787.2	v0.57.0-rc71	3f59287
Llama 3.2 1B	32	n150	23	67.6	160	2163.2	v0.57.0-rc23	f8b5b72
Llama 3.2 3B	32	n150	53	43.5	60	1392.0	v0.57.0-rc71	3f59287
Mamba 2.8B	32	n150	37	12.9	41	412.8	v0.57.0-rc71
Mistral 7B	32	n150		9.9	25	316.8	v0.51.0-rc28
Mixtral 8x7B (TP=8)	32	QuietBox	207	16.6	33	531.2	v0.57.0-rc71

Last Update: April 22, 2025

Notes:

ttft = time to first token | t/s/u = tokens/second/user | t/s = tokens/second; where t/s = t/s/u * batch.

TP = Tensor Parallel, DP = Data Parallel; Defines parallelization factors across multiple devices.

The reported LLM performance is for an input sequence length (number of rows filled in the KV cache) of 128 for all models except Mamba (which can accept any sequence length).

The t/s/u reported is the throughput of the first token generated after prefill, i.e. 1 / inter token latency.

Speech-to-Text

Model	Batch	Hardware	ttft (ms)	t/s/u	Target t/s/u	t/s	TT-Metalium Release
Whisper (distil-large-v3)	1	n150	244	54.7	45	54.7	v0.57.0-rc71

CNNs

Model	Batch	Hardware	fps	Target fps
ResNet-50 (224x224)	16	n150	4,700	7,000
ResNet-50 (224x224) (DP=2)	32	n300	9,200	14,000
ResNet-50 (224x224) (DP=8)	128	QuietBox	35,800	56,000
ResNet-50 (224x224) (DP=32)	512	Galaxy	96,800	224,000
ViT (224x224)	8	n150	1100	1,600
Stable Diffusion 1.4 (512x512)	1	n150	0.117	0.3
YOLOv4 (320x320)	1	n150	120	300
YOLOv4 (640x640)	1	n150	50	100
SegFormer Semantic Segmentation (512x512)	1	n150	90	300
Stable Diffusion 3.5 medium (512x512)	1	n150	0.06	0.3

Notes:

Stable Diffusion FPS is based on the time elapsed from submitting the input prompt to receiving the image from the VAE decoder.

NLPs

Model	Batch	Hardware	sen/sec	Target sen/sec	Release
BERT-Large	8	n150	270	400

Model Updates

For the latest model updates and features, please see MODEL_UPDATES.md

Model Bring-Up and Testing

For information on initial model procedures, please see Model Bring-Up and Testing

TT-NN Tech Reports

Advanced Performance Optimizations for Models (updated March 4th, 2025)
Programming Mesh of Devices (updated Sept 9th, 2024)
ViT Implementation in TT-NN on GS (updated Sept 22nd, 2024)
LLMs Bring up in TT-NN (updated Oct 29th, 2024)
YOLOv4 Implementation in TT-NN on WH (updated November 8th, 2024)
CNN Bring up & Optimization in TT-NN (updated Jan 22nd, 2025)

Benchmarks

Matrix Multiply FLOPS on WH (updated November 13th, 2024)

TT-Metalium is our low-level programming model, enabling kernel development for Tenstorrent hardware.

Programming Guide | API Reference

Getting started

Get started with simple kernels.

TT-Metalium Tech Reports

Matrix Engine (updated Sept 6th, 2024)
Data Formats (updated Sept 7th, 2024)
Reconfiguring Data Formats (updated Oct 17th, 2024)
Handling special floating-point numbers (updated Oct 5th, 2024)
Allocator (Updated Dec 19th, 2024)
Tensor Layouts (updated Sept 6th, 2024)
Saturating DRAM Bandwidth (updated Sept 6th, 2024)
Flash Attention on Wormhole (updated Sept 6th, 2024)
CNNs on TT Architectures (updated Sept 6th, 2024)
Ethernet and Multichip Basics (Updated Sept 20th, 2024)
Collective Communication Library (CCL) (Updated Sept 20th, 2024)
Blackhole Bring-Up Programming Guide (Updated Dec 18th, 2024)
Sub-Devices (Updated Jan 7th, 2025)

TT-Metalium Programming Examples

Hello World

Add Integers

Simple Tensor Manipulation

DRAM Data Movement

Dram Loopback Data Movement

Eltwise

Matmul

Tenstorrent Bounty Program Terms and Conditions

This repo is a part of Tenstorrent’s bounty program. If you are interested in helping to improve tt-metal, please make sure to read the Tenstorrent Bounty Program Terms and Conditions before heading to the issues tab. Look for the issues that are tagged with both “bounty” and difficulty level!

Name		Name	Last commit message	Last commit date
Latest commit History 14,793 Commits
.github		.github
cmake		cmake
contributing		contributing
dockerfile		dockerfile
docs		docs
infra		infra
models		models
scripts		scripts
tech_reports		tech_reports
tests		tests
third_party		third_party
tt-train		tt-train
tt_metal		tt_metal
tt_stl		tt_stl
ttnn		ttnn
.clang-format		.clang-format
.clang-format-ignore		.clang-format-ignore
.clang-tidy		.clang-tidy
.gersemirc		.gersemirc
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.test_durations		.test_durations
.yamllint		.yamllint
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
INSTALLING.md		INSTALLING.md
LICENSE		LICENSE
LICENSE_understanding.txt		LICENSE_understanding.txt
MANIFEST.in		MANIFEST.in
METALIUM_GUIDE.md		METALIUM_GUIDE.md
README.md		README.md
build_metal.sh		build_metal.sh
check_copyright_config.yaml		check_copyright_config.yaml
cloc.sh		cloc.sh
conftest.py		conftest.py
create_venv.sh		create_venv.sh
install_dependencies.sh		install_dependencies.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 Attention-Level Speculation (ALSpec)

🔧 About This Repository

Key Additions in This Fork

Bounty $ | Buy | Install | Discord | Join Us

API Reference | Model Demos

LLMs

Speech-to-Text

CNNs

NLPs

Model Updates

Model Bring-Up and Testing

TT-NN Tech Reports

Benchmarks

Programming Guide | API Reference

Getting started

TT-Metalium Tech Reports

TT-Metalium Programming Examples

Hello World

Add Integers

Simple Tensor Manipulation

DRAM Data Movement

Eltwise

Matmul

Tenstorrent Bounty Program Terms and Conditions

About

Uh oh!

Releases

Packages

Languages

License

mcj-group/ALSpec

Folders and files

Latest commit

History

Repository files navigation

🚀 Attention-Level Speculation (ALSpec)

🔧 About This Repository

Key Additions in This Fork

Bounty $ | Buy | Install | Discord | Join Us

API Reference | Model Demos

LLMs

Speech-to-Text

CNNs

NLPs

Model Updates

Model Bring-Up and Testing

TT-NN Tech Reports

Benchmarks

Programming Guide | API Reference

Getting started

TT-Metalium Tech Reports

TT-Metalium Programming Examples

Hello World

Add Integers

Simple Tensor Manipulation

DRAM Data Movement

Eltwise

Matmul

Tenstorrent Bounty Program Terms and Conditions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages