Skip to content

Scottcjn/open-xdna

Repository files navigation

⚡ open-xdna

Running AI on the AMD NPU that AMD says you can't.

The first-generation XDNA1 NPU — "Phoenix" / "Hawk Point" — in millions of Ryzen laptops, brought up on Linux with 100% open source. Models running on the silicon. Custom vector kernels hand-authored in the AIE ISA.

BCOS Certified License NPU Status Stack Built from


FastFlowLM and AMD Lemonade — the stacks everyone points to for "LLMs on the Ryzen AI NPU" — support XDNA2 only. If you own a Ryzen 7040 / 8040 laptop, AMD's answer for your NPU on Linux is "not available — use the GPU."

open-xdna is the answer that says "here's how."

$ xrt-smi examine
  [0000:06:00.1]   RyzenAI-npu1   aie2   6x5      ←  the chip they left behind, alive on Linux

🏆 Proven on silicon (every row measured, on a real Ryzen 7 8845HS)

Milestone Result
🔌 Gen-1 NPU driven by fully open-source XRT + driver + firmware
♻️ Survives reboot + kernel updates (DKMS)
⚙️ 512³ int16 matmul on the NPU ~68 GFLOPS
🧠 A real model's forward pass (2-layer MLP) on the NPU ✅ bit-exact
✂️ Non-bijunctive collapse (keep-strong / prune-weak) ✅ prune 50→92%
🛠️ Hand-authored AIE vector kernel (aie::ge+aie::select) ✅ first compile
🔁 Fused reduce_max→dynamic-τ collapse in one kernel ✅ data-adaptive
🔀 aie::reverse (vec_perm / shuffle) on the NPU
🎯 Full collapse (reduce+threshold+compact) in one AIE kernel
📈 Measured FFN net-win (NPU prune → dense down_proj) 1.6× @ cos 0.998
🎲 Attention KV-prune (NPU evicts low-mass keys) 4× @ cos 0.976
🖼️ Vision pipeline (rgba2gray→3×3 conv→threshold) on the NPU PASS — the NPU's native CNN strength
🧩 SigLIP/ViT patch-embed wired on the NPU ✅ bit-exact (offload play; ~3.2× slower than CPU at base size — honest)

😈 Why this exists

Hardware vendors retire silicon with software, not screwdrivers. AMD moved on to XDNA2 and told gen-1 owners the NPU "isn't available" for LLMs on Linux. The chip is fine. The driver's in the kernel. The compiler exists. What was missing was a recipe — so here's one, end to end, with receipts.

🔧 The four fixes nobody documented together

Getting a kernel onto an XDNA1 NPU on a current Linux needs four non-obvious moves (full detail: docs/BRINGUP.md, error-string fixes in the FAQ):

  1. Install libxrt_driver_xdna.so (not just libvxdna.so) — else xrt-smi reports "0 devices found".
  2. Load the matched staging amdxdna.ko — mainline (≤6.17) lacks ioctls the SHIM needs.
  3. Put llvm-objcopy on PATH — GNU objcopy can't parse the AIE2 ELF.
  4. Match the NPU firmware — stale firmware aborts commands (ERT_CMD_STATE_ABORT).

Upstream gap writeup: docs/UPSTREAM_amdxdna_ioctls.md — mainline amdxdna (≤6.17) is missing the GET_ARRAY/telemetry ioctls the current XRT SHIM expects (why fix #2 is needed).

🚀 Quick start

bash scripts/setup_iron.sh                      # IRON/mlir-aie + Peano
sudo bash scripts/swap_driver.sh                # matched staging driver
sudo bash scripts/install_firmware.sh 1.5.5.391 # matching NPU firmware
sudo bash scripts/run_example.sh basic/matrix_multiplication/single_core   # 68 GFLOPS on the NPU

Run the hand-authored kernels:

python3 examples/npu_tiny_mlp.py            # a model's matmuls on the NPU
python3 examples/npu_collapse_fused.py      # fused reduce_max -> dynamic-tau collapse
python3 examples/npu_collapse_runtime.py    # runtime-tau collapse (on-NPU aie::sub shift)
python3 examples/npu_shuffle_demo.py        # aie::reverse (vec_perm) on the NPU
python3 examples/npu_pse_collapse.py        # FULL collapse (reduce+threshold+compact) in ONE kernel
python3 examples/npu_ffn_prune.py           # MEASURED FFN net-win + accuracy tradeoff
python3 examples/npu_attention_prune.py     # MEASURED attention KV-prune (both GEMMs shrink)
python3 mlir-aie/programming_examples/vision/edge_detect/edge_detect.py -W 512 -H 512  # 2D conv vision pipeline on NPU
python3 examples/npu_patch_embed.py         # SigLIP/ViT patch-embed on the NPU (measured, honest)

🧬 The flex: you can hand-write AIE kernels like AltiVec

The AIE2 vector ISA is the same primitive class as PowerPC AltiVec/VSX — just new mnemonics. The non-bijunctive collapse, hand-authored (examples/kernels/collapse.cc):

aie::vector<bfloat16,32> x   = aie::load_v<32>(a + i);    // vec_ld
aie::mask<32>            keep = aie::ge(x, tau_v);         // vec_cmpge  → mask
aie::vector<bfloat16,32> out  = aie::select(zero_v, x, keep); // vec_sel: keep ? x : 0
aie::store_v(c + i, out);                                 // vec_st

Porting your own SIMD kernels? docs/ALTIVEC_TO_AIE.md maps the whole vocabulary.

📒 Full benchmark log (by model)

Every measurement, organized by model/architecture — what was tested, the result, measured-vs-projected: BENCHMARKS.md. Includes the honest negatives (gemma4 layer-prune net loss, vision offload-not-speed).

📊 The honest part

This is not a turnkey LLM server, and the NPU is not a fast matmul engine — measured, it's ~6× slower than the integrated Radeon 780M at dense GEMM and loses on energy-per-GFLOP for dense work (RESULTS.md). Its real value is a ~6.6 W power floor and cheap pruning/selection that lets a stronger device do less work. A prune-then-shrink FFN nets ~1.3–1.6× less work (not the naive 4× — you still pay the producer projection), gated on accuracy. We measure before we claim, and we mark every frontier.

🗺️ Does my chip have an XDNA1 NPU?

lspci -d 1022:1502 → "AMD IPU Device" = yes. 7840HS/U · 8840HS/U · 8845HS · 7940HS · 8945HS · 7640HS/U · 8640HS · 8645HS · 8600G/8700G. (Desktop 8500G/8300G have no NPU — Zen4c die.)

🔗 Related work — Elyan Labs

Heterogeneous-compute research (PSE non-bijunctive collapse, RAM coffers / NUMA weight banking, neuromorphic device routing) by Elyan Labs. See also ram-coffers · pse-vcipher-collapse.

🖼️ Multimodal: vision front-end on the NPU

The NPU was built for vision/CNN inference — a multimodal model's image tower (patch-embed convs, preprocessing) is a far better NPU fit than text decode. A full conv pipeline already runs on the NPU; see docs/VISION_OFFLOAD.md for the offload, and docs/MULTIMODAL_3WAY.md for the full 3-processor picture (incl gemma4:26b @ 16.5 tok/s on the 780M via 96 GB — a 17 GB model the 8 GB discrete can't hold).

📜 License

AGPLv3 (this repo's original work) — see LICENSE. A commercial / proprietary license (no copyleft) is available for closed-source or SaaS use: COMMERCIAL.md. AMD's XDNA/XRT and IRON/mlir-aie toolchains are Apache-2.0 WITH LLVM-exception and installed, not redistributed.

About

Open-source bring-up + verified matmul on the first-gen AMD XDNA1 (Phoenix/Hawk Point) NPU on Linux — the gen FastFlowLM/Lemonade skip. RyzenAI-npu1, mlir-aie/IRON, XRT.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages