The first-generation XDNA1 NPU — "Phoenix" / "Hawk Point" — in millions of Ryzen laptops, brought up on Linux with 100% open source. Models running on the silicon. Custom vector kernels hand-authored in the AIE ISA.
FastFlowLM and AMD Lemonade — the stacks everyone points to for "LLMs on the Ryzen AI NPU" — support XDNA2 only. If you own a Ryzen 7040 / 8040 laptop, AMD's answer for your NPU on Linux is "not available — use the GPU."
open-xdnais the answer that says "here's how."
$ xrt-smi examine
[0000:06:00.1] RyzenAI-npu1 aie2 6x5 ← the chip they left behind, alive on Linux
| Milestone | Result |
|---|---|
| 🔌 Gen-1 NPU driven by fully open-source XRT + driver + firmware | ✅ |
| ♻️ Survives reboot + kernel updates (DKMS) | ✅ |
⚙️ 512³ int16 matmul on the NPU |
✅ ~68 GFLOPS |
| 🧠 A real model's forward pass (2-layer MLP) on the NPU | ✅ bit-exact |
| ✂️ Non-bijunctive collapse (keep-strong / prune-weak) | ✅ prune 50→92% |
🛠️ Hand-authored AIE vector kernel (aie::ge+aie::select) |
✅ first compile |
🔁 Fused reduce_max→dynamic-τ collapse in one kernel |
✅ data-adaptive |
🔀 aie::reverse (vec_perm / shuffle) on the NPU |
✅ |
| 🎯 Full collapse (reduce+threshold+compact) in one AIE kernel | ✅ |
| 📈 Measured FFN net-win (NPU prune → dense down_proj) | ✅ 1.6× @ cos 0.998 |
| 🎲 Attention KV-prune (NPU evicts low-mass keys) | ✅ 4× @ cos 0.976 |
| 🖼️ Vision pipeline (rgba2gray→3×3 conv→threshold) on the NPU | ✅ PASS — the NPU's native CNN strength |
| 🧩 SigLIP/ViT patch-embed wired on the NPU | ✅ bit-exact (offload play; ~3.2× slower than CPU at base size — honest) |
Hardware vendors retire silicon with software, not screwdrivers. AMD moved on to XDNA2 and told gen-1 owners the NPU "isn't available" for LLMs on Linux. The chip is fine. The driver's in the kernel. The compiler exists. What was missing was a recipe — so here's one, end to end, with receipts.
Getting a kernel onto an XDNA1 NPU on a current Linux needs four non-obvious moves (full detail: docs/BRINGUP.md, error-string fixes in the FAQ):
- Install
libxrt_driver_xdna.so(not justlibvxdna.so) — elsexrt-smireports "0 devices found". - Load the matched staging
amdxdna.ko— mainline (≤6.17) lacks ioctls the SHIM needs. - Put
llvm-objcopyonPATH— GNU objcopy can't parse the AIE2 ELF. - Match the NPU firmware — stale firmware aborts commands (
ERT_CMD_STATE_ABORT).
Upstream gap writeup:
docs/UPSTREAM_amdxdna_ioctls.md— mainlineamdxdna(≤6.17) is missing theGET_ARRAY/telemetry ioctls the current XRT SHIM expects (why fix #2 is needed).
bash scripts/setup_iron.sh # IRON/mlir-aie + Peano
sudo bash scripts/swap_driver.sh # matched staging driver
sudo bash scripts/install_firmware.sh 1.5.5.391 # matching NPU firmware
sudo bash scripts/run_example.sh basic/matrix_multiplication/single_core # 68 GFLOPS on the NPURun the hand-authored kernels:
python3 examples/npu_tiny_mlp.py # a model's matmuls on the NPU
python3 examples/npu_collapse_fused.py # fused reduce_max -> dynamic-tau collapse
python3 examples/npu_collapse_runtime.py # runtime-tau collapse (on-NPU aie::sub shift)
python3 examples/npu_shuffle_demo.py # aie::reverse (vec_perm) on the NPU
python3 examples/npu_pse_collapse.py # FULL collapse (reduce+threshold+compact) in ONE kernel
python3 examples/npu_ffn_prune.py # MEASURED FFN net-win + accuracy tradeoff
python3 examples/npu_attention_prune.py # MEASURED attention KV-prune (both GEMMs shrink)
python3 mlir-aie/programming_examples/vision/edge_detect/edge_detect.py -W 512 -H 512 # 2D conv vision pipeline on NPU
python3 examples/npu_patch_embed.py # SigLIP/ViT patch-embed on the NPU (measured, honest)The AIE2 vector ISA is the same primitive class as PowerPC AltiVec/VSX — just new mnemonics. The non-bijunctive collapse, hand-authored (examples/kernels/collapse.cc):
aie::vector<bfloat16,32> x = aie::load_v<32>(a + i); // vec_ld
aie::mask<32> keep = aie::ge(x, tau_v); // vec_cmpge → mask
aie::vector<bfloat16,32> out = aie::select(zero_v, x, keep); // vec_sel: keep ? x : 0
aie::store_v(c + i, out); // vec_stPorting your own SIMD kernels? docs/ALTIVEC_TO_AIE.md maps the whole vocabulary.
Every measurement, organized by model/architecture — what was tested, the result, measured-vs-projected: BENCHMARKS.md. Includes the honest negatives (gemma4 layer-prune net loss, vision offload-not-speed).
This is not a turnkey LLM server, and the NPU is not a fast matmul engine — measured, it's ~6× slower than the integrated Radeon 780M at dense GEMM and loses on energy-per-GFLOP for dense work (RESULTS.md). Its real value is a ~6.6 W power floor and cheap pruning/selection that lets a stronger device do less work. A prune-then-shrink FFN nets ~1.3–1.6× less work (not the naive 4× — you still pay the producer projection), gated on accuracy. We measure before we claim, and we mark every frontier.
lspci -d 1022:1502 → "AMD IPU Device" = yes. 7840HS/U · 8840HS/U · 8845HS · 7940HS · 8945HS · 7640HS/U · 8640HS · 8645HS · 8600G/8700G. (Desktop 8500G/8300G have no NPU — Zen4c die.)
Heterogeneous-compute research (PSE non-bijunctive collapse, RAM coffers / NUMA weight banking, neuromorphic device routing) by Elyan Labs. See also ram-coffers · pse-vcipher-collapse.
The NPU was built for vision/CNN inference — a multimodal model's image tower (patch-embed convs, preprocessing) is a far better NPU fit than text decode. A full conv pipeline already runs on the NPU; see docs/VISION_OFFLOAD.md for the offload, and docs/MULTIMODAL_3WAY.md for the full 3-processor picture (incl gemma4:26b @ 16.5 tok/s on the 780M via 96 GB — a 17 GB model the 8 GB discrete can't hold).
AGPLv3 (this repo's original work) — see LICENSE. A commercial / proprietary license (no copyleft) is available for closed-source or SaaS use: COMMERCIAL.md. AMD's XDNA/XRT and IRON/mlir-aie toolchains are Apache-2.0 WITH LLVM-exception and installed, not redistributed.