npu-embeddings

Run a text embedding model on AMD's XDNA NPU (Strix Halo). Produces OpenAI-compatible /v1/embeddings output from sentence-transformers/all-MiniLM-L6-v2, accelerated by the NPU's INT8 matrix multiplication units.

Hardware

NPU: AMD XDNA "Strix Halo" (aie2p architecture, 6×8 tile array, 51 TOPS)
CPU: AMD Ryzen AI MAX+ 395
System: Framework Desktop, 128 GiB LPDDR5, Ubuntu 25.10

What This Does

The NPU is a coarse-grained reconfigurable array (CGRA) — a grid of programmable VLIW+SIMD tiles connected by configurable DMA interconnect. You program it by defining spatial dataflow graphs: which tiles run which compute kernels, and how data flows between them.

This project:

Implements transformer building blocks as NPU kernels (C++ compiled for aie2p)
Defines tile-level dataflow designs (IRON Python → MLIR → xclbin)
Orchestrates full model inference from Python, dispatching matmuls to the NPU
Wraps it as a FastAPI server compatible with OpenWebUI's RAG pipeline

Architecture

Text → Tokenizer (CPU) → Embedding Lookup (CPU)
     → 6× Transformer Layer:
         Matmul (NPU) → Softmax (CPU) → Matmul (NPU) → LayerNorm (CPU)
         Matmul (NPU) → GELU (CPU) → Matmul (NPU) → LayerNorm (CPU)
     → Mean Pooling (CPU) → L2 Normalize → 384-dim embedding vector

All matmul operations run on the NPU (>95% of compute). Element-wise ops run on CPU because NPU dispatch overhead exceeds their compute time at these sizes.

Repository Structure

kernels/          C++ kernel source for AIE2p tiles
designs/          IRON Python dataflow designs (generate MLIR → xclbin)
scripts/          Build scripts, weight quantization, calibration
inference/        Python inference engine and FastAPI server
docs/             Design documents, requirements, task tracking

Prerequisites

AMD XDNA driver (amdxdna DKMS 2.23.0+) — see docs/setup.md
XRT 2.23.0+ (/opt/xilinx/xrt/)
MLIR-AIE toolchain built from source (~/mlir-aie/)
Python 3.13+ with transformers, numpy, fastapi, uvicorn

Quick Start

# 1. Set up environment
source scripts/env.sh

# 2. Build all NPU programs (compiles kernels + designs → xclbins)
make all

# 3. Download and quantize model weights
python scripts/quantize.py

# 4. Run the embedding server
python inference/server.py
# → http://localhost:8000/v1/embeddings

Build

The build compiles C++ kernels to AIE2p object files, then compiles IRON Python designs (which reference those kernels) into xclbin NPU programs.

source scripts/env.sh    # Set up MLIR-AIE toolchain paths
make kernels             # Compile .cc → .o (Peano/LLVM-AIE cross-compiler)
make designs             # Compile .py → .mlir → .xclbin (aiecc)
make all                 # Both
make clean               # Remove all build artifacts

Current Status

Working on NPU (verified against CPU reference):

INT8 matmul at BERT dimensions (128×768×768, 179 GOPS)
Multi-tile pipelines: matmul → activation → matmul → residual add (4 tiles)
Complete attention head: Q×K^T → softmax → probs×V → scale (4 tiles)
Complete FFN block: W_up → ReLU+requant → W_down → +residual (4 tiles)
LayerNorm (768 elements, 0.999999 correlation with float)
Row-wise softmax (32×32, perfect argmax preservation)
Real MiniLM model weights through NPU pipeline (bit-exact)

Next: Host-orchestrated full model inference at MiniLM dimensions.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
designs		designs
docs		docs
inference		inference
kernels		kernels
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

npu-embeddings

Hardware

What This Does

Architecture

Repository Structure

Prerequisites

Quick Start

Build

Current Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

npu-embeddings

Hardware

What This Does

Architecture

Repository Structure

Prerequisites

Quick Start

Build

Current Status

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages