Skip to content

jyatesdotdev/npu-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

npu-embeddings

Run a text embedding model on AMD's XDNA NPU (Strix Halo). Produces OpenAI-compatible /v1/embeddings output from sentence-transformers/all-MiniLM-L6-v2, accelerated by the NPU's INT8 matrix multiplication units.

Hardware

  • NPU: AMD XDNA "Strix Halo" (aie2p architecture, 6×8 tile array, 51 TOPS)
  • CPU: AMD Ryzen AI MAX+ 395
  • System: Framework Desktop, 128 GiB LPDDR5, Ubuntu 25.10

What This Does

The NPU is a coarse-grained reconfigurable array (CGRA) — a grid of programmable VLIW+SIMD tiles connected by configurable DMA interconnect. You program it by defining spatial dataflow graphs: which tiles run which compute kernels, and how data flows between them.

This project:

  1. Implements transformer building blocks as NPU kernels (C++ compiled for aie2p)
  2. Defines tile-level dataflow designs (IRON Python → MLIR → xclbin)
  3. Orchestrates full model inference from Python, dispatching matmuls to the NPU
  4. Wraps it as a FastAPI server compatible with OpenWebUI's RAG pipeline

Architecture

Text → Tokenizer (CPU) → Embedding Lookup (CPU)
     → 6× Transformer Layer:
         Matmul (NPU) → Softmax (CPU) → Matmul (NPU) → LayerNorm (CPU)
         Matmul (NPU) → GELU (CPU) → Matmul (NPU) → LayerNorm (CPU)
     → Mean Pooling (CPU) → L2 Normalize → 384-dim embedding vector

All matmul operations run on the NPU (>95% of compute). Element-wise ops run on CPU because NPU dispatch overhead exceeds their compute time at these sizes.

Repository Structure

kernels/          C++ kernel source for AIE2p tiles
designs/          IRON Python dataflow designs (generate MLIR → xclbin)
scripts/          Build scripts, weight quantization, calibration
inference/        Python inference engine and FastAPI server
docs/             Design documents, requirements, task tracking

Prerequisites

  • AMD XDNA driver (amdxdna DKMS 2.23.0+) — see docs/setup.md
  • XRT 2.23.0+ (/opt/xilinx/xrt/)
  • MLIR-AIE toolchain built from source (~/mlir-aie/)
  • Python 3.13+ with transformers, numpy, fastapi, uvicorn

Quick Start

# 1. Set up environment
source scripts/env.sh

# 2. Build all NPU programs (compiles kernels + designs → xclbins)
make all

# 3. Download and quantize model weights
python scripts/quantize.py

# 4. Run the embedding server
python inference/server.py
# → http://localhost:8000/v1/embeddings

Build

The build compiles C++ kernels to AIE2p object files, then compiles IRON Python designs (which reference those kernels) into xclbin NPU programs.

source scripts/env.sh    # Set up MLIR-AIE toolchain paths
make kernels             # Compile .cc → .o (Peano/LLVM-AIE cross-compiler)
make designs             # Compile .py → .mlir → .xclbin (aiecc)
make all                 # Both
make clean               # Remove all build artifacts

Current Status

Working on NPU (verified against CPU reference):

  • INT8 matmul at BERT dimensions (128×768×768, 179 GOPS)
  • Multi-tile pipelines: matmul → activation → matmul → residual add (4 tiles)
  • Complete attention head: Q×K^T → softmax → probs×V → scale (4 tiles)
  • Complete FFN block: W_up → ReLU+requant → W_down → +residual (4 tiles)
  • LayerNorm (768 elements, 0.999999 correlation with float)
  • Row-wise softmax (32×32, perfect argmax preservation)
  • Real MiniLM model weights through NPU pipeline (bit-exact)

Next: Host-orchestrated full model inference at MiniLM dimensions.

License

MIT

About

Text embeddings on AMD XDNA NPU (Strix Halo) — OpenAI-compatible /v1/embeddings endpoint powered by all-MiniLM-L6-v2 running on the NPU tile array

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors