EdgeMATX-TinyML-Accelerator is a Verilog-based RISC-V accelerator project that integrates a 4x4 fixed-point matrix-multiplication engine with PicoRV32 through the PCPI custom-instruction interface.
The repository is organized around a simulation-first workflow:
- standalone accelerator validation,
- PicoRV32 + PCPI integration,
- firmware-driven regression and cycle comparison,
- preparation for later FPGA deployment on Pynq-Z2.
- 4x4 Q5.10 systolic matrix accelerator RTL.
- PicoRV32 integration through a custom PCPI instruction.
- Scripted regression, handoff validation, and 3-way cycle comparison.
- Live real-input flow for evaluator-provided matrices.
- Beginner-focused documentation for wrapper RTL, accelerator RTL, and systolic-array concepts.
- Standalone accelerator RTL is validated in simulation.
- PicoRV32 + PCPI integration is working end-to-end in simulation.
- Firmware-driven smoke, regression, handoff, professor-demo, and cycle-compare flows are available.
- FPGA timing closure and on-board performance measurement remain future work.
start_here/: quick-entry docs and evaluation flow.integration/pcpi_demo/: main PicoRV32 + PCPI + accelerator flow.accel_standalone/: standalone accelerator RTL evaluation flow.picorv32/: vendored PicoRV32 core from YosysHQ.RISC-V/: vendored RV32I core reference implementation.midsem_sim/: compatibility shim for older standalone-flow paths.integration/pcpi_demo/legacy/: fallback/reference assets separated from active flow.
start_here/README.mdstart_here/EVAL_FLOW.mdintegration/pcpi_demo/README.mddocs/diagrams/pcpi_wrapper_realistic_block_diagram.drawio.xml
Required for all simulation flows:
gitPowerShell(Windows)iverilogandvvp(Icarus Verilog)python(Python 3)
Recommended for waveform inspection:
gtkwave
Required for firmware rebuild (PCPI regression/handoff):
Option A: Native toolchain on Windows:
riscv64-unknown-elf-gccriscv64-unknown-elf-objcopymake
Option B: WSL fallback (Ubuntu), used by scripts when native toolchain is missing:
sudo apt-get update
sudo apt-get install -y gcc-riscv64-unknown-elf binutils-riscv64-unknown-elf make python3Quick verify commands:
git --version
iverilog -V
vvp -V
python --version
gtkwave --version
wsl bash -lc "riscv64-unknown-elf-gcc --version | head -n 1"This repository currently includes the upstream core as a vendor copy (not a submodule):
- Upstream:
https://github.com/srpoyrek/RISC-V - Imported on:
2026-02-28 - Import metadata:
RISC-V/VENDORING.md
The RISC-V/ folder is tracked directly by this repo, and RISC-V/.git is intentionally removed.
Based on the upstream README, the core includes:
- 5-stage pipelined RISC-V architecture (RV32I)
- Verilog HDL implementation
- Modules such as control unit, hazard detection, forwarding, ALU, and memory-related blocks
- Testbench-based verification (ModelSim-oriented upstream setup)
- Show simulation-first progress in mid-sem using
accel_standalone. - Stabilize accelerator + custom instruction interface in simulation before FPGA deployment.
- Transition from analytic speedup estimates to measured board timings (ARM and
mcycle).
- Accelerator is currently validated in standalone RTL simulation (
accel_standalone). - PicoRV32 is vendored and available in-repo (
picorv32/). - A first CPU integration milestone is implemented in simulation via PCPI (
integration/pcpi_demo). - Custom instruction path is tested with machine-code loaded directly in testbench memory.
- PCPI demo now uses matrix base pointers (rs1/rs2), reads A/B from memory, and writes C buffer back to memory.
- Firmware scaffold is added (
firmware.S, linker script, Makefile, hex generation path) with fallback hex support when toolchain is unavailable. - Full board deployment is still pending.
- Add a proven integration-ready RV32 core (recommended: PicoRV32).
- Wrap the accelerator with a coprocessor/custom-op interface (
start,busy,done). - Add decode/handshake logic so a custom instruction triggers matrix multiply.
- Build CPU+accelerator simulation testbench and verify correctness plus stall behavior.
- Replace analytic speedup estimates with measured cycle counts (
mcycle/ ARM timing). - Move to Vivado/Pynq-Z2 hardware integration after simulation sign-off.
Run from repository root:
.\accel_standalone\scripts\run_midsem_sim.ps1Generated artifacts:
accel_standalone/results/sim_output.logaccel_standalone/results/MIDSEM_RESULTS.md
Compatibility shim (old path, still supported):
.\midsem_sim\scripts\run_midsem_sim.ps1Run from repository root:
.\integration\pcpi_demo\scripts\run_pcpi_demo.ps1Optional C firmware variant (same custom instruction semantics):
.\integration\pcpi_demo\scripts\run_pcpi_demo.ps1 -FirmwareVariant cNote: the C smoke variant and cycle-compare flow both use the shared source
integration/pcpi_demo/firmware/firmware_matmul_unified.c with compile-time mode/address macros.
Accelerator offload uses an explicitly emitted custom instruction word (0x5420818b), not automatic loop-to-accelerator compiler conversion.
Generated artifacts:
integration/pcpi_demo/results/pcpi_demo.logintegration/pcpi_demo/results/pcpi_demo_wave.vcd
Run from repository root:
.\integration\pcpi_demo\scripts\run_pcpi_regression.ps1Generated artifacts:
integration/pcpi_demo/results/cases/*.logintegration/pcpi_demo/results/pcpi_regression_summary.mdintegration/pcpi_demo/results/pcpi_regression_summary.json
Run from repository root:
.\integration\pcpi_demo\scripts\run_pcpi_handoff.ps1Run from repository root:
.\integration\pcpi_demo\scripts\run_pcpi_local_check.ps1This script runs smoke (asm + c), full regression, and handoff, and exits non-zero on any failure.
Run from repository root:
.\integration\pcpi_demo\scripts\run_cycle_compare.ps1This reports cycle counts for:
- software baseline without scalar MUL (
rv32i,ENABLE_MUL=0) - software baseline with scalar MUL (
rv32im,ENABLE_MUL=1) - custom-instruction accelerator path
and writes speedup ratios across all three.
Latest verified (2026-03-05):
accel_cycles=673sw_nomul_cycles=26130(rv32i,ENABLE_MUL=0)sw_mul_cycles=7975(rv32im,ENABLE_MUL=1)sw_nomul/accel=38.8262xsw_mul/accel=11.8499xsw_nomul/sw_mul=3.2765x
Run from repository root:
.\integration\pcpi_demo\scripts\run_pcpi_professor_demo.ps1This runs an explainable set of matrix cases (identity, negative identity, zero, half-scale, signed passthrough) and produces a concise demo summary.
Run from repository root:
python .\integration\pcpi_demo\scripts\estimate_cycle_scaling.py --sizes 4,8,16,32,64This generates estimated normal-core vs accelerator scaling tables (ideal and overhead-aware) in JSON form.
Mentor/evaluator-provided real matrices can be tested without touching baseline regression cases.json.
Fastest live-evaluation mode (edit one JSON, run one script):
- Edit
integration/pcpi_demo/tests/live_real_input.json - Run:
.\integration\pcpi_demo\scripts\run_pcpi_custom_cycle_compare.ps1This single command automatically converts real values to Q5.10, generates firmware case data, and runs accelerator + SW no-MUL + SW MUL comparisons. It also writes per-variant outputs in real format:
integration/pcpi_demo/results/custom_cases/live_eval_active_outputs_real.json
Current checked-in live profile (live_real_input.json) is tuned for near-50x no-MUL comparison and currently measures:
accel=673,sw_nomul=36246,sw_mul=7975sw_nomul/accel=53.8574xsw_mul/accel=11.8499x
Convert real values to Q5.10 and print preview only:
python .\integration\pcpi_demo\tests\real_to_q5_10_case.py --input-json .\integration\pcpi_demo\tests\sample_real_input.jsonConvert and append timestamped custom case into isolated custom file:
python .\integration\pcpi_demo\tests\real_to_q5_10_case.py --input-json .\integration\pcpi_demo\tests\sample_real_input.json --append-customRun one custom case from custom case file:
.\integration\pcpi_demo\scripts\run_pcpi_custom_case.ps1 -CaseName <custom_case_name>Run one custom case across all 3 performance variants (accelerator, SW no-MUL, SW MUL):
.\integration\pcpi_demo\scripts\run_pcpi_custom_cycle_compare.ps1 -CaseName <custom_case_name>This writes per-variant logs plus a per-case cycle summary:
integration/pcpi_demo/results/custom_cases/<case>_cycle_accel.logintegration/pcpi_demo/results/custom_cases/<case>_cycle_sw_nomul.logintegration/pcpi_demo/results/custom_cases/<case>_cycle_sw_mul.logintegration/pcpi_demo/results/custom_cases/<case>_cycle_compare_summary.mdintegration/pcpi_demo/results/custom_cases/<case>_cycle_compare_summary.jsonintegration/pcpi_demo/results/custom_cases/<case>_outputs_real.json
Explicitly clear generated custom cases:
python .\integration\pcpi_demo\tests\real_to_q5_10_case.py --clear-generated- Generated outputs are intentionally ignored (do not commit):
integration/pcpi_demo/results/pcpi_cycle_*integration/pcpi_demo/results/pcpi_prof_demo_*integration/pcpi_demo/results/prof_demo_cases/*pynq_z2_custom_core/build/*.out
run_cycle_compare.ps1andrun_pcpi_professor_demo.ps1now use a shared lock file (integration/pcpi_demo/firmware/.firmware_flow.lock) to avoid concurrent firmware rewrite races.run_pcpi_custom_case.ps1also uses this lock.run_pcpi_custom_cycle_compare.ps1also uses this lock.
- After any code/script/RTL/testbench change, update both:
README.mdhandoff_project_context.md
- Consolidated tracked handoff/testing table is maintained at:
integration/pcpi_demo/TEST_RESULTS_SUMMARY.md
- Mentor-facing progress brief is maintained at:
mentor_progress_update.txt
- Beginner-to-advanced full project walkthrough is maintained at:
integration/pcpi_demo/docs/MIDSEM_COMPLETE_PROJECT_GUIDE.md
- Dedicated RTL learning docs (wrapper, accelerator, systolic concept, end-to-end interaction) are at:
integration/pcpi_demo/docs/RTL_WRAPPER_LINE_BY_LINE.mdintegration/pcpi_demo/docs/RTL_ACCELERATOR_LINE_BY_LINE.mdintegration/pcpi_demo/docs/SYSTOLIC_ARRAY_FROM_SCRATCH.mdintegration/pcpi_demo/docs/END_TO_END_BLOCK_INTERACTION.md
- Design-space and deployment tradeoff note is at:
integration/pcpi_demo/docs/DESIGN_TRADEOFFS_AND_USE_CASES.md
- Interactive web visualizer for architecture + handshake animation is at:
integration/pcpi_demo/visualizer/README.mdintegration/pcpi_demo/visualizer/index.html- Production URL:
https://tinyml-pcpi-visualizer.vercel.app - It now includes per-arrow signal inspection, CPU stall/handoff guidance, project-level architecture info, PE dataflow view, step-back control, and draggable split-pane layout.
Generated artifacts:
integration/pcpi_demo/results/pcpi_handoff.logintegration/pcpi_demo/results/pcpi_handoff_wave.vcdintegration/pcpi_demo/results/pcpi_handoff_summary.md
This repository is licensed under the MIT License. See LICENSE.