Run a text embedding model on AMD's XDNA NPU (Strix Halo). Produces OpenAI-compatible
/v1/embeddings output from sentence-transformers/all-MiniLM-L6-v2, accelerated by
the NPU's INT8 matrix multiplication units.
- NPU: AMD XDNA "Strix Halo" (aie2p architecture, 6×8 tile array, 51 TOPS)
- CPU: AMD Ryzen AI MAX+ 395
- System: Framework Desktop, 128 GiB LPDDR5, Ubuntu 25.10
The NPU is a coarse-grained reconfigurable array (CGRA) — a grid of programmable VLIW+SIMD tiles connected by configurable DMA interconnect. You program it by defining spatial dataflow graphs: which tiles run which compute kernels, and how data flows between them.
This project:
- Implements transformer building blocks as NPU kernels (C++ compiled for aie2p)
- Defines tile-level dataflow designs (IRON Python → MLIR → xclbin)
- Orchestrates full model inference from Python, dispatching matmuls to the NPU
- Wraps it as a FastAPI server compatible with OpenWebUI's RAG pipeline
Text → Tokenizer (CPU) → Embedding Lookup (CPU)
→ 6× Transformer Layer:
Matmul (NPU) → Softmax (CPU) → Matmul (NPU) → LayerNorm (CPU)
Matmul (NPU) → GELU (CPU) → Matmul (NPU) → LayerNorm (CPU)
→ Mean Pooling (CPU) → L2 Normalize → 384-dim embedding vector
All matmul operations run on the NPU (>95% of compute). Element-wise ops run on CPU because NPU dispatch overhead exceeds their compute time at these sizes.
kernels/ C++ kernel source for AIE2p tiles
designs/ IRON Python dataflow designs (generate MLIR → xclbin)
scripts/ Build scripts, weight quantization, calibration
inference/ Python inference engine and FastAPI server
docs/ Design documents, requirements, task tracking
- AMD XDNA driver (amdxdna DKMS 2.23.0+) — see docs/setup.md
- XRT 2.23.0+ (
/opt/xilinx/xrt/) - MLIR-AIE toolchain built from source (
~/mlir-aie/) - Python 3.13+ with
transformers,numpy,fastapi,uvicorn
# 1. Set up environment
source scripts/env.sh
# 2. Build all NPU programs (compiles kernels + designs → xclbins)
make all
# 3. Download and quantize model weights
python scripts/quantize.py
# 4. Run the embedding server
python inference/server.py
# → http://localhost:8000/v1/embeddingsThe build compiles C++ kernels to AIE2p object files, then compiles IRON Python designs (which reference those kernels) into xclbin NPU programs.
source scripts/env.sh # Set up MLIR-AIE toolchain paths
make kernels # Compile .cc → .o (Peano/LLVM-AIE cross-compiler)
make designs # Compile .py → .mlir → .xclbin (aiecc)
make all # Both
make clean # Remove all build artifactsWorking on NPU (verified against CPU reference):
- INT8 matmul at BERT dimensions (128×768×768, 179 GOPS)
- Multi-tile pipelines: matmul → activation → matmul → residual add (4 tiles)
- Complete attention head: Q×K^T → softmax → probs×V → scale (4 tiles)
- Complete FFN block: W_up → ReLU+requant → W_down → +residual (4 tiles)
- LayerNorm (768 elements, 0.999999 correlation with float)
- Row-wise softmax (32×32, perfect argmax preservation)
- Real MiniLM model weights through NPU pipeline (bit-exact)
Next: Host-orchestrated full model inference at MiniLM dimensions.
MIT