Add janus_cube_pyc: 16×16 systolic array matrix multiplication accelerator#2
Merged
zhoubot merged 11 commits intoLinxISA:mainfrom Feb 14, 2026
Merged
Add janus_cube_pyc: 16×16 systolic array matrix multiplication accelerator#2zhoubot merged 11 commits intoLinxISA:mainfrom
zhoubot merged 11 commits intoLinxISA:mainfrom
Conversation
Implemented a weight-stationary systolic array for matrix multiplication: - 16×16 PE array (256 processing elements) - 16-bit integer inputs (weights and activations) - 32-bit accumulator for overflow prevention - Memory-mapped interface for CPU integration - Complete documentation in README.md Features: - Weight-stationary dataflow - 32-cycle latency (1 load + 16 compute + 15 drain) - 512-byte input buffers (Matrix A and W) - 1024-byte output buffer (Matrix C) Files: - cube.py: Top-level module with FSM and integration - cube_pe.py: Processing element (MAC operation) - cube_array.py: 16×16 systolic array instantiation - cube_buffer.py: Input/output buffer management - cube_control.py: Control FSM (reference, not used) - cube_types.py: Register group dataclasses - cube_consts.py: Constants and memory map - README.md: Complete documentation Successfully tested MLIR emission (653KB output). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Relocated cube module to be alongside janus/bcc for better organization - Updated all import paths from examples.linx_cpu_pyc.cube.* to janus.cube.* - Updated README.md with new paths and references The cube module now lives at janus/pyc/janus/cube/ alongside the Janus BCC CPU implementation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Set the module name to "cube" for consistent naming in generated MLIR/Verilog/C++ output. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Generate janus_cube_pyc.v and janus_cube_pyc_gen.hpp - Update cube.py with correct __pycircuit_name__ = "janus_cube_pyc" - Add janus_cube_pyc to update_generated.sh Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace m.in_wire() with m.input() - Replace m.const_wire() with m.const() - Update cube.py, cube_array.py, cube_control.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Reduce cube.py from 2779 to 357 lines using for loops - Add util.py with Consts dataclass for common constants - Add FsmResult and MmioWriteResult dataclasses to avoid tuple unpacking - Remove redundant files: cube_array.py, cube_buffer.py, cube_control.py, cube_pe.py - Add C++ testbench (tb_janus_cube_pyc.cpp) with identity and 2x2 tests - Add run script (run_janus_cube_pyc_cpp.sh) - Update README with new file structure and JIT patterns Key JIT patterns applied: - Functions without @jit_inline execute at Python time - @jit_inline functions compile to hardware - Dataclasses for return values (JIT doesn't support tuple unpacking) - return statements must be at top-level (not inside with blocks) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ARCHITECTURE.md with 15 detailed technical diagrams - Add VISUAL_GUIDE.md with intuitive visual explanations - Add IMPROVEMENT_PLAN.md for future development roadmap - Add SystemVerilog testbench (tb_janus_cube_pyc.sv) - Add run scripts for Icarus Verilog and Verilator - Update README.md with documentation references Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implement cube_v2.py with 4-stage pipelined architecture - Add 64-entry L0A, L0B, ACC buffers with 64-bit MMIO interface - Add 64-entry issue queue with out-of-order execution support - Add MATMUL block instruction decoder - Create CUBE_V2_SPEC.md with complete architecture documentation - Generate PDF specification using reportlab - Update README.md and ARCHITECTURE.md with v2 details - Add Verilog testbench tb_cube_v2.v Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PE module with @module decorator (cube_v2_pe.py) - Add L0 Entry module with @module decorator (cube_v2_l0_entry.py) - Add systolic array using m.instance() for 256 PEs (cube_v2_systolic_reuse.py) - Add L0 buffer using m.instance() for 128 entries (cube_v2_l0_reuse.py) - Add top-level module with module reuse (cube_v2_reuse.py) - Update CUBE_V2_SPEC.md with optimization results - Fix PrunePortsPass.cpp for LLVM 21 compatibility Generated module structure: - janus_cube_pyc (top) - L0Entry × 128 (64 L0A + 64 L0B) - CubePE × 256 (4 clusters × 64 PEs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a new hardware module
janus_cube_pyc- a 16×16 systolic array matrix multiplication accelerator implemented in pyCircuit.Features
Files Added/Modified
Source Files (
janus/pyc/janus/cube/)cube.py- Main module with FSM, PE array, and memory interface (357 lines)cube_types.py- Dataclasses (CubeState, PERegs, FsmResult, MmioWriteResult)cube_consts.py- Constants (states, addresses, array size)util.py- Utility functions (Consts dataclass)README.md- DocumentationGenerated Files (
janus/generated/janus_cube_pyc/)janus_cube_pyc.v- Verilog RTL (~949 KB)janus_cube_pyc_gen.hpp- C++ header (~1.1 MB)Test Files
janus/tb/tb_janus_cube_pyc.cpp- C++ testbenchjanus/tools/run_janus_cube_pyc_cpp.sh- Test runner scriptArchitecture
Implementation Highlights
The code follows pyCircuit JIT compilation patterns:
Functions without
@jit_inlineexecute at Python time (before JIT)_make_pe_regs,_make_weight_regs, etc.)Functions with
@jit_inlinecompile to hardware_build_pe,_build_fsm)Dataclasses for return values avoid tuple unpacking (not supported in JIT)
FsmResult(load_weight, compute, done)MmioWriteResult(start, reset_cube)Testing
Known Limitations