Skip to content

PKU-SEC-Lab/LightMamba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightMamba

0. Description

LightMamba is the official open source implementation of the paper "LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design". It is a mamba model accelerator based on the FPGA hardware platform.

1

The accelerator design process consists of three parts: the model process, the front-end process, and the back-end process. The model process is implemented within the PyTorch framework, the front-end process uses Vitis HLS and SpinalHDL, and the back-end process uses Vivado and Pynq. The front-end process, which is the accelerator design process, begins with a C++ framework design. Without a machine learning framework, network operators are modeled in C++. This verifies the consistency of neural network quantization and serves as a reference design for HLS design. Next, high-level synthesis implementation is performed. Hardware modules are designed for each operator, and data is streamed between multiple modules using the AxiStream interface. Pre-simulation (C++ functional simulation) and synthesis of each operator are completed within the HLS framework. The resulting Verilog files are packaged and simulated in SpinalHDL. After successful simulation, the back-end process begins, where the Pynq framework is ported and the accelerator is deployed for testing.

1. Accelerator Features

The following features are specific to VCK190:

Precision LUTs DSPs BRAMs Frequency prefill
W4A4 107k 228 456 400MHz 7.21 tok/s

2. Requirements

  • Vitis HLS2023.2 and later versions
  • python3
  • IDEA (Community)+ Scala(2.11.12) + Spinal(1.7.1)+ Verilator(5.x)

3. File Structure

This project includes several components:

  • HLS design files
  • Python scripts for running Vitis HLS
  • SpinalHDL code for accelerated simulation and packaged export
  • Jupyter notebook scripts for testing on the FPGA
LIGHT-MAMBA/
├── src/                    # HLS design files
├── case/                   # Contains HLS operator module components and component unit tests
│   ├── ref.zip            # The weights and nonlinear lookup tables (LUTs) for the Mamba network, along with the test stimuli
│   ├── ATTN.cpp.template   
│   ├── MLP.cpp.template     
│   ├── SOFTMAX_1X2.cpp      
│   ├── GELU.cpp             
│   └── ...                 
├── instances/             	    # Automatically generated folder, operator component IP generated by Vitis HLS
│   ├── proj_B_BUFFER 	    
│   ├── proj_B_BUFFER          
│   ├── proj_CONV        	 
│   ├── ...                 
│   ├── proj_GEMM           
│   ├── proj_GEMM_MUX          
│   ├── ...                  
│   └── proj_SILU     
├── mamba_cpp/				# The mamba network is modeled using C++ language
├── ips/					# Automatically generated folder, operator component IP generated by Vitis HLS
├── SPINAl/                 # Code for fast simulation accelerators, packaged into the Vivado environment
│   └── ...                 # ...
├── pynq/               	# Jupyter notebook script for testing the accelerator on the board
├── bin/ 					# weight and activation function lookup table values
├── constant.py             # Python file containing constant definitions
├── pre_syn_process.py      # Python script to create a VitisHLS project
├── pst_syn_process.py      # Python script for collecting HLS synthesis data and supporting other processes
├── step0_~step5.py         # Python scripts for the complete flow
├── memba_bd.tcl      		# TCL script for creating the VCK190 base Block Design
└── template.tcl            # Template file for generating HLS projects

ref.zip which include the weights and nonlinear lookup tables (LUTs) for the Mamba network, along with the test stimuli, is located at: https://huggingface.co/PKU-SEC-Lab/LightMamba

4. Development Flow

4.1 HLS Simulation

Before simulation, you need to change the weights and lookup table file directories. In the step1_hls_sim.py script, select the module to be simulated, such as the RESIDUAL module.

case_names = [
	...
    ...
    # "EXP",
    # "GEMM_DEMUX",
    # "GEMM_MUX",
    # "GEMM",
    # "HT_ADD_QUANT",
    # "HT_ADD",
    # "HT_STATE",
    # "HTC_QUANT",
    # "HTC",
    # "M_AXI_FLOW",
    # "M_AXI_STATIC",
    # "QUANT_CONV",
    "RESIDUAL",
	...
    ...
    ...
]

After selecting the operator module to be simulated, run in the terminal:

python step1_hls_sim.py

The running results are as follows (the simulation result terminal has output, and you can also view the running log):

RESIDUAL is running
....
....
....
    
[X                   ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[CONDENSED OD        ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[Y                   ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[RESIDUAL O          ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[X                   ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[CONDENSED OD        ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[Y                   ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
[RESIDUAL O          ][==================================================] 100%  Estimated total time: 0s, Estimated remaining time: 0s
INFO [HLS SIM]: The maximum depth reached by any hls::stream() instance in the design is 320
INFO: [SIM 211-1] CSim done with 0 errors.
INFO: [SIM 211-3] *************** CSIM finish ***************
INFO: [HLS 200-2161] Finished Command csim_design Elapsed time: 00:00:03; Allocated memory: 0.000 MB.
INFO: [HLS 200-112] Total CPU user time: 6.45 seconds. Total CPU system time: 0.77 seconds. Total elapsed time: 7.36 seconds; peak allocated memory: 641.383 MB.
RESIDUAL is done, time: 9.075169324874878
4.2 HLS Synthesis

When synthesizing an operator module, its synthesis properties can be configured (step2_hls_syn.py):

create_tcls(INSTANCE_DIR, case_names=case_names, do_csim=False, do_csynth=True, do_cosim=False,do_syn=True, do_impl=True, phys_opt="all", pipeline_style="frp")

Select the module that needs to be synthesized, such as the RESIDUAL module (step2_hls_syn.py):

case_names = [
    ...
    ...
    ...
    # "M_AXI_STATIC",
    # "QUANT_CONV",
    "RESIDUAL",
    # "RMSNORM_1",
	...
    ...
]

Run the synthesis script step2_hls_syn.py. After all modules are synthesized, run step5_print_resource.py to view resource usage:

python step2_hls_syn.py

python step5_print_resource.py
instance         SLICE     LUT       FF        DSP       BRAM      URAM      LATCH     SRL       CP        

proj_DTA         0         1643      1708      8         32        0         0         3         1.957     
proj_DBU         0         1341      937       8         3         0         0         4         2.236     
proj_DTB         0         1044      917       8         0         0         0         2         2.099 
...
proj_DTB_QUANT   0         3532      3427      8         0         0         0         416       2.099     
proj_C_BUFFER    0         388       518       0         0         0         0         0         1.469     
proj_DAH         0         1211      1019      8         0         0         0         74        2.027 
...
4.3 SpinalHDL Simulation and Packaging

Using SpinalHDL, we provide a platform using Verilator simulation to provide a complete simulation of the entire accelerator, improving simulation speed. Run the step3_spinal_flow.py script to copy the generated Verilog code for each project into the SPINAL directory and run the simulation of all layers in parallel. To use SpinalHDL, you must use JetBrains' IDEA environment and install the Scala plugin. Please follow the official SpinalHDL documentation to install a Verilator-compatible environment (https://spinalhdl.github.io/SpinalDoc-RTD/SpinalHDL/Getting%20Started/). Note: We are using the OSS CAD suite.

To enable SpinalSim, you must add the following line to the build.sbt file:

fork := true

Open the SPINAL directory in IDEA and load the build.sbt file to start the SpinalHDL simulation:

sbt run

After the simulation is started, the following interface will appear:

[info] welcome to sbt 1.10.0 (Ubuntu Java 17.0.15)
[info] loading project definition from <***>/light-mamba-master/SPINAL/project
[info] loading settings for project root from build.sbt ...
[info] set current project to SPINAL (in build file:<***>light-mamba-master/SPINAL/)

Multiple main classes detected. Select one to run:
 [1] accelerator_save_conv_mem
 [2] accelerator_save_mem
 [3] generate_accelerator
 [4] generate_mamba
 [5] simulate_B_buffer
 [6] simulate_C_buffer
 [7] simulate_accelerator
 [8] simulate_conv
 [9] simulate_conv_state
 [10] simulate_dAh
 [11] simulate_dBu
 [12] simulate_dtA
 [13] simulate_dtB_quant
 [14] simulate_dtadapt
 [15] simulate_exp_quant
 [16] simulate_gemm
 [17] simulate_gemm_demux
 [18] simulate_gemm_mux
 [19] simulate_htC_quant
 [20] simulate_ht_add_quant
 [21] simulate_ht_state
 [22] simulate_m_axi
 [23] simulate_mamba
 [24] simulate_quant_conv
 [25] simulate_residual
 [26] simulate_rmsnorm_quant_1
 [27] simulate_rmsnorm_quant_2
 [28] simulate_silu_demux
 [29] simulate_silu_mux
 [30] simulate_silu_quant
 [31] simulate_simple_node
 [32] simulate_uD
 [33] simulate_yz
 [34] utils.genManager
 [35] utils.simulate_controller

Enter number: 25

  | => root / Compile / selectMainClass 13s

Taking residual as an example, after selecting this module and running the simulation, a success mark will be displayed; the corresponding folder and waveform file will be generated:

  
  [info] running (fork) simulate_residual 
  [success] Total time: 54 s, completed Aug 28, 2025, 5:07:11 PM

Load wave.fst in the waveform viewer to view the waveform. Repeat the same steps for other modules.

image-20250825111149350

4.4 Vivado Flow:IP Packaging and Implementation

After executing step 4.3, two files, ACCELERATOR.v and ACCELERATOR_bb.v, are generated: the top-level accelerator file and the packaged HLS module, respectively. Next, run the to_vivado.py script in the SPINAl directory. This will create a "vivado" folder containing all the design files needed to package the IP, including Verilog files and initialization memory files:

Next, open Vivado, select Tools->Create and Package New IP, select "Create a new AXI4 peripheral," and add all the files in the "vivado" folder from the previous step to the source directory.

After packaging the IP, the next step is to create a block design in Vivado, build a complete SoC, and add the packaged IP. Use the ./mamba_bd.tcl script to create the following block design:

526f7dda38be1f7cc5fb5a7913981c6

After creating the Block Design, you need to assign addresses in the Address Editor. The address space is assigned as shown below:

image-20250825145213675

image-20250825145248646

image-20250825145407192

Finally, generate the pdi file and use bootgen to generate a new BOOT.BIN file.

5. On-Board Testing

Our design does not use any vendor-specific IP, making it compatible with various FPGA platforms. We provide a Jupyter Notebook in the "notebooks" directory for testing the accelerator on the board. Please upload this notebook and reference data files (refs) to the beta version and follow the instructions. This project implements a Pynq-like mechanism for controlling various hardware on the VCK190 platform, which does not support Pynq. Before running, please check the notebook contents to ensure the correct hardware addresses (primarily the accelerator's hardware address).

b749902723bd8c0e0be7a077cf68ae8

6. Citation

You are welcome to cite the paper "LightMamba: Efficient Mamba Acceleration on FPGA with Quantization and Hardware Co-design".

@inproceedings{wei2025lightmamba,
  title={Lightmamba: Efficient mamba acceleration on fpga with quantization and hardware co-design},
  author={Wei, Renjie and Xu, Songqiang and Zhong, Linfeng and Yang, Zebin and Guo, Qingyu and Wang, Yuan and Wang, Runsheng and Li, Meng},
  booktitle={2025 Design, Automation \& Test in Europe Conference (DATE)},
  pages={1--7},
  year={2025},
  organization={IEEE}
}

About

Code release for LightMamba accepted by DATE 2025

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors