CXL Type 2 accelerator on Intel Agilex 7 FPGA (IA-780i platform) with a Vortex
RV64 SIMT GPU and configurable memory latency injection. PCI 0000:ad:00.0,
device 8086:0DDB.
┌─────────────────────────────────────────────────────────────────────┐
│ Host CPU │
│ ├─ MMIO (BAR0 CSR writes) ──── CXL.io / PIO ────┐ │
│ ├─ Memory load/store ──── CXL.mem ──────────┤ │
│ └─ Cache coherency ──── CXL.cache ────────┤ │
└──────────────────────────────────────────────────────┼─────────────┘
PCIe/CXL 16-lane link │
┌──────────────────────────────────────────────────────┼─────────────┐
│ Intel CXL IP (intel_rtile_cxl_top) │ │
│ ├─ PCIe TLP decode ──────────────────────────────┐│ │
│ ├─ DVSEC registers (CXLCap, CXLCtl, Ranges) ││ │
│ ├─ HDM Decoder (HPA → DPA translation) ││ │
│ └─ Mailbox, Device Status, Component Regs ││ │
├─────────────────────────────────────────────────────┘│ │
│ │ │
│ ┌──── PIO AVMM bus (125 MHz, 64-bit) ──────────────┤ │
│ │ │ │
│ ▼ │ │
│ ex_default_csr_top │ │
│ └─ ex_default_csr_avmm_slave │ │
│ ├─ Vortex GPU CSRs (0x100–0x148) │ │
│ └─ launch trigger, status, perf counters │ │
│ │ │ │
│ ▼ │ │
│ ┌── afu_top ────────────────────────────────────────┤ │
│ │ │ │
│ │ vortex_gpu_wrapper │ │
│ │ └─ Vortex GPU core (RV64 SIMT) │ │
│ │ ├─ Port 0 (host mem) ─── tied off │ │
│ │ └─ Port 1 (dev mem) ───┐ │ │
│ │ │ │ │
│ │ axi_mc_arbiter ◄──────────────┘ │ │
│ │ ├─ m0: HDM Ch1 (CXL.mem from IP) ◄───────────┘ │
│ │ ├─ m1: GPU Port 1 (Vortex AXI) │
│ │ └─ s: merged → MC Channel 1 │
│ │ │ │
│ │ [Delay Buffer] ◄── Channel 0 only, configurable latency │
│ │ │ │
│ └──────────────┤ │
│ ▼ │
│ mc_top │
│ ├─ mc_single_chan_hdm_axi_fsm (AXI → EMIF AVMM) │
│ ├─ ECC encode/decode (Altera ECC IP) │
│ ├─ CDC FIFOs (ip_clk ↔ emif_clk) │
│ └─ EMIF DDR4-2666 (dual channel, 32GB each) │
│ ├─ dram0: Channel 0 (host CXL.mem + delay buffer) │
│ └─ dram1: Channel 1 (host CXL.mem + GPU shared) │
└────────────────────────────────────────────────────────────────────┘
cxltyp2_ed.sv Top (PCIe PHY, RTile PLL, CXL IP)
└─ ed_top_wrapper_typ2.sv Endpoint wrapper
├─ ex_default_csr_top CSR decoder (AVMM → registers)
│ └─ ex_default_csr_avmm_slave Register file (DVSEC + Vortex GPU)
├─ afu_top Application Function Unit
│ ├─ vortex_gpu_wrapper Vortex RV64 SIMT GPU
│ │ └─ Vortex core 2 AXI ports (port 0 tied off)
│ └─ axi_mc_arbiter 2-to-1 AXI mux (host + GPU → MC)
├─ mc_top Memory controller
│ ├─ mc_single_chan_hdm_axi_fsm Per-channel AXI→AVMM FSM
│ ├─ mc_devmem_top Error monitoring (SBE/DBE/Poison)
│ └─ mc_emif_avmm EMIF DDR4 instantiation
└─ cafu_csr0_cfg CXL Feature CSR (mem_enable, etc.)
Host MMIO read/write to BAR0 for GPU configuration and control.
Host CPU store/load
→ PCIe TLP (BAR0 target)
→ Intel CXL IP PIO engine
→ AVMM bus (125 MHz, 64-bit data, byte-enable)
address[21:0], writedata[63:0], readdata[63:0]
→ ex_default_csr_top
→ ex_default_csr_avmm_slave
├─ decode address → register select
├─ write: latch to register FF
└─ read: drive readdata from register/status
→ AVMM response → PIO completion TLP → Host
Bus: AVMM, 64-bit, ip2csr_avmm_clk (125 MHz)
Host load/store to HPA range [0x180000000000, +16GB) via CXL.mem protocol.
Host CPU load/store to HPA
→ CXL.mem request TLP over PCIe/CXL link
→ Intel CXL IP HDM Decoder
HPA [51:6] → DPA translation (base/size/interleave)
→ HDM AXI interface (ip2hdm_aximm channels)
AXI4: 512-bit data, 52-bit address, 8-bit ID
→ afu_top
├─ Channel 0: [delay buffer] → mc_top ch0 → EMIF dram0
└─ Channel 1: axi_mc_arbiter (m0=host, m1=GPU) → mc_top ch1 → EMIF dram1
→ mc_single_chan_hdm_axi_fsm
AXI4 → AVMM conversion, ECC insertion
→ EMIF DDR4-2666 SDRAM
→ Read data + ECC check → response FIFO → CXL completion → Host
Bus: AXI4, 512-bit (64B cache line), ip2hdm_clk (~545 MHz SIP)
Host writes CSR registers, then triggers kernel execution.
Host writes BAR0 CSRs:
0x100: KERNEL_ADDR ← entry point (0x80000000)
0x108: KERNEL_ARGS ← pointer to arg struct in shared mem
0x110: GRID_DIM_X ← grid dimensions
0x114: GRID_DIM_Y
0x118: GRID_DIM_Z
0x11C: BLOCK_DIM_X ← block dimensions
0x120: BLOCK_DIM_Y
0x124: BLOCK_DIM_Z
0x140: COMPLETION_LO ← DCOH completion address
0x144: COMPLETION_HI
0x148: DCOH_ENABLE ← enable completion writeback
Host writes BAR0 + 0x128 = 1:
→ ex_default_csr_avmm_slave.LAUNCH register
→ vx_launch_trigger pulse (1 cycle)
→ vortex_gpu_wrapper starts execution
→ Vortex core fetches kernel binary from shared memory
→ Threads execute GEMM (grid-stride loop)
→ vx_fence() to flush stores
→ DCOH writeback: CompletionData.magic = 0xDEADBEEF
Host polls:
BAR0 + 0x12C (STATUS): 0x00=IDLE, 0x01=RUNNING, 0x02=DONE, 0xFF=ERROR
or polls completion→magic in shared memory (DCOH)
GPU threads read/write device DDR through AXI Port 1.
Vortex GPU core (executing kernel threads)
→ AXI Port 1 (device memory, 512-bit, 4-bit ID)
→ axi_mc_arbiter
m1 input (GPU): ID[7:0] = {1'b1, 3'b0, gpu_id[3:0]}
m0 input (host HDM Ch1): ID[7] = 0
Arbitration: priority/round-robin between host and GPU
→ s output → mc_top Channel 1
→ mc_single_chan_hdm_axi_fsm
→ ECC encode → EMIF dram1 → DDR4 SDRAM
→ Response: ID[7] demuxes back to GPU (1) or host (0)
Shared address space: Both host and GPU access the same DDR via Channel 1. GPU Port 0 (host memory access) is tied off — not used in this design.
Cache-coherent completion signaling without polling CSR.
GPU kernel finishes computation
→ Writes CompletionData struct to shared memory:
.status = 0 (success)
.result = FLOP count
.cycles = cycle counter
.timestamp = timer value
.magic = 0xDEADBEEF ← written last (release semantics)
→ Write flows through AXI Port 1 → MC → DDR
→ CXL.mem coherency ensures host CPU sees updated cache line
→ Host polling loop detects magic == 0xDEADBEEF
(or host uses mwait/monitor on the cache line)
| Offset | Region | Access |
|---|---|---|
0x000000 |
Vendor CSR space (AVMM slave) | R/W |
0x0E0000 |
PCIe config space mirror | R |
0x150000 |
CXL Component Registers (HDM decoder) | R/W |
0x180000 |
CXL Device Registers (status, mailbox) | R/W |
| Offset | Name | Width | Access | Description |
|---|---|---|---|---|
| 0x100 | KERNEL_ADDR_LO | 32 | R/W | Kernel entry point [31:0] |
| 0x104 | KERNEL_ADDR_HI | 32 | R/W | Kernel entry point [63:32] |
| 0x108 | KERNEL_ARGS_LO | 32 | R/W | Kernel args pointer [31:0] |
| 0x10C | KERNEL_ARGS_HI | 32 | R/W | Kernel args pointer [63:32] |
| 0x110 | GRID_DIM_X | 32 | R/W | Grid dimension X |
| 0x114 | GRID_DIM_Y | 32 | R/W | Grid dimension Y |
| 0x118 | GRID_DIM_Z | 32 | R/W | Grid dimension Z |
| 0x11C | BLOCK_DIM_X | 32 | R/W | Block dimension X |
| 0x120 | BLOCK_DIM_Y | 32 | R/W | Block dimension Y |
| 0x124 | BLOCK_DIM_Z | 32 | R/W | Block dimension Z |
| 0x128 | LAUNCH | 32 | W | Write 1 to trigger kernel |
| 0x12C | STATUS | 8 | R | 0x00=IDLE 0x01=RUNNING 0x02=DONE 0xFF=ERROR |
| 0x130 | CYCLE_LO | 32 | R | Cycle counter [31:0] |
| 0x134 | CYCLE_HI | 32 | R | Cycle counter [63:32] |
| 0x138 | INSTR_LO | 32 | R | Instruction counter [31:0] |
| 0x13C | INSTR_HI | 32 | R | Instruction counter [63:32] |
| 0x140 | COMPLETION_LO | 32 | R/W | Completion address [31:0] |
| 0x144 | COMPLETION_HI | 32 | R/W | Completion address [63:32] |
| 0x148 | DCOH_ENABLE | 32 | R/W | Enable DCOH completion writeback |
| Domain | Frequency | Signals | Usage |
|---|---|---|---|
| SIP | ~545 MHz | ip2hdm_clk |
HDM AXI, core datapath |
| CSR | 125 MHz | ip2csr_avmm_clk |
BAR0 PIO, register access |
| CAFU | 125 MHz | ip2cafu_avmm_clk |
CXL feature CSR access |
| EMIF | DDR-dependent | emif_usr_clk |
DDR4 controller |
| JTAG | 33 MHz | altera_reserved_tck |
Debug |
CDC crossings use async FIFOs: CSR↔AXI, AXI↔EMIF.
Offset Content Alignment
───────────── ─────────────────────── ──────────
0x00000000 GemmKernelArgs (72B) 64B (cache line)
0x00000040 CompletionData (64B) 64B (cache line)
0x00001000 Matrix A (M*K floats) 4KB
0x00001000+A Matrix B (K*N floats) 4KB
0x00001000+A+B Matrix C (M*N floats) 4KB
...
0x80000000 Kernel binary (.bin) 4B (instruction aligned)
The device name "delay buffer" refers to a configurable read latency injection
on HDM Channel 0. The read_delay register controls the depth of a FIFO-based
delay stage in afu_top, allowing characterization of CXL.mem latency impact
on application performance.
- Channel 0: Host CXL.mem traffic passes through delay buffer → dram0
- Channel 1: Direct path (no delay) shared between host CXL.mem and GPU → dram1
BIOS sets config_lock=1 (write-once latch) before the OS can enable CXL.mem
in the DVSEC control register. The RTL fix forces mem_enable=1 at reset:
cafu_csr0_cfg_pkg.sv:MEM_ENABLE_RESET = 1'b1(was1'b0)cafu_csr0_cfg.sv: FF reset value1'h1(was1'h0)- Effect:
CXLCtl = 0x0007(Cache+ IO+ Mem+) at power-on
cxl_type2_accel.c registers the device as both CXL cache device and memory
device. Probe sequence:
pcim_enable_device()+pci_set_master()- Allocate
cxl_memdev_state(embedscxl_dev_stateat offset 0) - Detect RCiEP topology → set
cxlds->rcd = true - Read/enable DVSEC: Cache+ IO+ Mem+
- Map component registers (BAR0+0x150000) for RAS
- Register cache device:
devm_cxl_add_cachedev()(128MB, 64B lines) - Set memory size: 16GB volatile
- DPA partition setup:
cxl_mem_dpa_fetch()+cxl_dpa_setup() - Register memory device:
devm_cxl_add_memdev()
Additional kernel patches (applied):
| File | Fix |
|---|---|
core/port.c |
is_cxl_ep_device() for cachedev endpoint detection |
core/pci.c |
Skip DVSEC ranges with base=0; early HDM enable when no ranges |
core/hdm.c |
Don't emulate decoders when no DVSEC ranges |
core/cdat.c |
Defer gp_port init after RCH early-return |
cxlmem.h |
is_cxl_endpoint() uses is_cxl_ep_device() |
acpi.c |
Synthetic root decoder for RCH dports without CFMWS |
cxl_acpi_probe()
→ cxl_inject_synthetic_cfmws() Synthetic root decoder0.12
HPA [0x180000000000, +16GB] targeting RCH dport (pci0000:ad)
→ cxl_mem endpoint probe
→ cxl_hdm_decode_init() HDM decoder enabled (no range validation)
→ decoder8.0 committed DPA 0x0 → 16GB
→ cxl create-region region12 created
→ dax_cxl /dev/dax12.0 (devdax mode)
Cross-compiled RV64 binary (kernels/gemm_kernel.bin, 824 bytes).
Toolchain: riscv64-unknown-elf-gcc -march=rv64imafdc -mabi=lp64d
Entry point (crt0.S):
- Compute
global_tid = core_id * (warps * threads) + warp_id * threads + thread_id - Set per-thread stack:
sp = _stack_base - (gtid+1) * 4096 - Read kernel args from CSR
mscratch(0x340) - Call
kernel_main(args) ecallto signal completion
GEMM kernel (gemm_kernel.c): grid-stride loop, each thread computes
C[row][col] = alpha * dot(A[row,:], B[:,col]) + beta * C[row][col].
Thread mapping: Block (8,4,1) = 32 threads/warp. Grid sized to cover output matrix. Each thread strides by total thread count.
# Build GPU kernel
cd kernels && make
# Build test
g++ -O2 -std=c++17 -o tests/test_gemm_coherent tests/test_gemm_coherent.cpp -lpthread
# Run (auto-detects real device, falls back to simulation)
./tests/test_gemm_coherent --dim 64 --verbose
# Force simulation mode
./tests/test_gemm_coherent --sim
# With kernel binary (real device)
./tests/test_gemm_coherent --kernel kernels/gemm_kernel.bin| Binary | Purpose |
|---|---|
tests/probe_bar0 |
Safe MMIO discovery with SIGBUS recovery |
tests/probe_cxl_deep |
CXL mailbox commands, capability decode |
tests/test_csr_readback |
Vortex CSR write/readback validation |
tests/test_kernel_launch |
Kernel launch + DCOH completion demo |
# Enable PCI device
echo 1 > /sys/bus/pci/devices/0000:ad:00.0/enable
# Enable bus master
setpci -s ad:00.0 COMMAND=0x0146
# Load kernel modules
modprobe cxl_acpi cxl_type2_accel
# Switch DAX to devdax mode (default may be system-ram)
daxctl reconfigure-device --mode=devdax dax12.0hardware_test_design/
cxltyp2_ed.sv Top-level (PCIe PHY + CXL IP)
ed_top_wrapper_typ2.sv Endpoint wrapper (AFU + CSR + MC)
common/afu/afu_top.sv AFU: GPU wrapper + AXI arbiter + delay buffer
common/mc_top/mc_top.sv Memory controller (ECC + EMIF)
common/mc_top/hdm_axi_if_pkg.sv HDM AXI interface (512-bit, 52-bit addr)
common/mc_top/mc_single_chan_hdm_axi_fsm.sv AXI→AVMM conversion FSM
common/ex_default_csr/ex_default_csr_avmm_slave.sv Register file
common/cafu_csr0/cafu_csr0_cfg_pkg.sv DVSEC reset values (mem_enable fix)
common/cafu_csr0/cafu_csr0_cfg.sv DVSEC register FFs
build/cafu_csr0_t2.rdl Register Description Language (full map)
kernels/
gemm_kernel.c SIMT GEMM kernel source
crt0.S Thread startup (stack + entry)
Makefile RV64 cross-compilation
gemm_kernel.bin Compiled binary (824 bytes)
tests/
test_gemm_coherent.cpp GEMM test (real device + DAX or simulation)
probe_bar0.cpp BAR0 MMIO region discovery
probe_cxl_deep.cpp CXL mailbox + register decode
test_csr_readback.cpp Vortex CSR validation
test_kernel_launch.cpp Kernel launch + DCOH demo