This repository contains a high-performance, FPGA-accelerated Least Mean Squares (LMS) adaptive filter pipeline designed for real-time beamforming applications. The project implements both 1-Tap and 4-Tap complex-valued LMS designs deployed on a Xilinx Zynq-7020 SoC. It integrates bit-accurate Python simulations (software golden model), cycle-accurate Verilog/RTL hardware implementations, Vivado IP block designs, automated DMA streaming interfaces, and final physical FPGA artifacts (bitstreams .bit and hardware handoffs .hwh).
Adaptive beamforming is a signal processing technique used in sensor arrays (such as antenna arrays) to dynamically adjust the spatial reception or transmission pattern. By continuously updating complex weight factors applied to the signals from each sensor element, the beamformer steers the main lobe toward the signal of interest (SOI) while forming nulls toward interferers and sources of noise.
The Least Mean Squares (LMS) algorithm is an iterative stochastic gradient descent approach widely used in adaptive filtering due to its computational simplicity and robustness. The algorithm updates the filter weights according to:
where:
-
$\mathbf{w}(n)$ is the vector of complex weights at time index$n$ . -
$\mu$ is the adaptation step-size parameter. -
$e(n) = d(n) - y(n)$ is the instantaneous complex error signal between the desired training signal$d(n)$ and the filter output$y(n) = \mathbf{w}^H(n)\mathbf{x}(n)$ . -
$\mathbf{x}(n)$ is the input signal vector, and$\mathbf{x}^*(n)$ is its complex conjugate.
Accelerating this computation on an FPGA fabric permits low-latency, deterministic execution of weight updates and FIR filtering, which is critical for real-time, high-bandwidth communication and radar systems.
- Hardware-Software Co-Design (PS/PL Integration): Uses Xilinx DMA to stream complex IQ data blocks between the ARM Cortex-A9 Processing System (PS) and the FPGA Programmable Logic (PL) fabric.
- Dual Architecture Implementation:
- 1-Tap LMS: Lightweight design tailored for single-path phase/amplitude alignment.
- 4-Tap LMS: Pipelined complex-valued design to mitigate multi-tap Inter-Symbol Interference (ISI) in multipath channels.
- Strict Bit-Accuracy Verification: Validated by comparing hardware-in-the-loop (HIL) executions against equivalent fixed-point Python software simulations.
- PYNQ-Z2 Framework Compatibility: Real-time data streaming and control via Python using Jupyter notebooks and PYNQ DMA APIs.
- Robust Timing Closure: Deeply pipelined RTL arithmetic elements designed to meet clock constraints at high execution frequencies.
FPGA-Accelerated-LMS-Beamforming/
├── 1 tap/
│ ├── hardware/ # 1-Tap hardware simulation scripts and hardware photos
│ ├── software/ # 1-Tap software model and offline simulation results
│ └── verilog/ # 1-Tap LMS RTL implementation (lms_singletap_iq.v)
├── 4 tap/
│ ├── hardware/ # 4-Tap hardware verification scripts and board run photos
│ ├── software/ # 4-Tap software reference model and convergence verification
│ └── verilog/ # 4-Tap LMS RTL design (4_tap.v)
├── comparing/
│ └── comparing_single_tap_to_4_tap.md # Detailed analysis of software vs. hardware execution performance
├── docs/
│ └── images/ # Vivado block designs and report captures
│ ├── one_tap/
│ └── four_tap/
├── final .bit and .hwh/ # Bitstream and hardware handoff files for deployment
│ ├── 4 tap/
│ └── single tap/
├── USRP/
│ └── 1TAP/ # USRP SDR interfacing directory
└── README.md # Project documentation
The project utilizes a co-design methodology where signal generation, channel modeling, and performance evaluation are handled in software, while the intensive filtering and weight updates are accelerated on the FPGA.
graph TD
A[Python: Generate Input Signal x & Desired Signal d] --> B[Pack Data: 64-bit IQ Words]
B --> C[Transmit via DMA over AXI-Stream]
C --> D[FPGA: Receive Inputs at S1]
D --> E[FPGA: S2 Compute FIR y = w * x]
E --> F[FPGA: S4 Compute Error e = d - y]
F --> G[FPGA: S6-S8 Compute Gradient & Update weights w]
E --> H[FPGA: S4 Pack y & e into 64-bit Word]
H --> I[Transmit Output via DMA over AXI-Stream]
I --> J[Unpack Data in Python]
J --> K[Performance Analysis: SNR, RMS Error, Convergence Plots]
-
Host Processing (PS): The Python interface generates complex source waveforms, applies multipath channel distortion and noise, and formats the data (
xandd) into 64-bit streams (16-bit real/imaginary parts for each signal). - DMA Streaming: The PYNQ DMA controller transfers the packed array of input samples to the FPGA fabric via high-speed AXI-Stream interfaces.
-
Hardware Pipeline (PL):
- Stage 1 (S1): Registers incoming data and updates the internal tapped delay-line.
- Stage 2 (S2): Multiplies weights and delay-line samples (FIR output calculation).
-
Stage 3-4 (S3-S4): Sums the products and computes the complex error
$e(n) = d(n) - y(n)$ . -
Stage 5-8 (S5-S8): Scales the error by step-size
$\mu$ , computes the complex-conjugate gradient ($\mu \cdot e(n) \cdot x^*(n-k)$), adds the gradient to the current weights with overflow saturation, and writes the updated weights back to the registers.
-
DMA Readback: Simultaneously, the calculated filter output
$y(n)$ and error$e(n)$ are streamed back to the DDR memory via DMA for evaluation in Python.
All hardware designs target the Xilinx Zynq-7020 SoC (specifically device xc7z020clg484-1) on standard development boards such as the PYNQ-Z2.
- Timing Closure: Met
- Worst Negative Slack (WNS): 0.388 ns
- Total Negative Slack (TNS): 0.000 ns
- Failing Setup Endpoints: 0
- Worst Hold Slack (WHS): 0.045 ns
- Total Hold Slack (THS): 0.000 ns
- LUT Utilization: 3076 / 53200
- Register Utilization: 4149 / 106400
- BRAM Tiles: 3 / 140
- DSP Blocks: 10 / 220
- Bonded I/O Pads: 130 / 130
- Total On-Chip Power: 1.74 W
- Junction Temperature: 45.1°C
- Timing Closure: TBD
- Worst Negative Slack (WNS): TBD
- Total Negative Slack (TNS): TBD
- Failing Setup Endpoints: TBD
- Worst Hold Slack (WHS): TBD
- Total Hold Slack (THS): TBD
- LUT Utilization: TBD
- Register Utilization: TBD
- BRAM Tiles: TBD
- DSP Blocks: TBD
- Bonded I/O Pads: TBD
- Total On-Chip Power: TBD
- Junction Temperature: TBD
Note
The Vivado implementation parameters and reports for the 4-Tap design are currently TBD. This section should be updated after the 4-tap Vivado reports are captured.
The 1-tap LMS core is optimized for minimal resource utilization and functions as a simple phase and amplitude corrector.
The 4-tap LMS design incorporates a deeply pipelined 8-stage architecture. This design is engineered to model multipath propagation delays and mitigate Inter-Symbol Interference (ISI). It includes a critical alignment fix in Stage 2 to synchronize the weight update path with the correct delay-line snapshot, ensuring robust convergence in physical hardware.
Note
This section should be updated after the 4-tap Vivado reports are captured.
The table below compares the hardware resources and software vs. hardware execution performance. (Test case: 307,200 samples grouped in chunks of 1,024 samples).
| Metric | 1-Tap LMS Design | 4-Tap LMS Design |
|---|---|---|
| Logic Device | xc7z020clg484-1 | xc7z020clg484-1 |
| LUT Utilization | 3,076 / 53,200 (5.78%) | TBD |
| Register Utilization | 4,149 / 106,400 (3.90%) | TBD |
| BRAM Tiles | 3 / 140 (2.14%) | TBD |
| DSP Blocks | 10 / 220 (4.55%) | TBD |
| Bonded I/O Pads | 130 / 130 (100.0%) | TBD |
| Total On-Chip Power | 1.74 W | TBD |
| Software Execution Time (ARM) | 35.16 s | 57.43 s |
| Hardware Execution Time (FPGA) | 3.76 s | 4.89 s |
| System Throughput (Software) | 0.0087 MSps | 0.0053 MSps |
| System Throughput (Hardware) | 0.0816 MSps | 0.0628 MSps |
| Avg. Processing Time / Chunk | 12.20 ms | 1.63 ms |
| Hardware Acceleration Speedup | 9.35x | 11.74x |
| Average Error Reduction (%) | 48.9% | 84.3% |
| Final Signal-to-Noise Ratio (SNR) | 6.05 dB | 17.21 dB |
Note: The hardware utilization, power, and timing parameters for the 4-tap design will be updated once the Vivado implementation run is completed and reported.
To verify the LMS convergence characteristics offline without hardware:
- Navigate to the desired design folder (e.g.,
4 tap/software). - Run the software simulation script:
This script executes the floating-point and fixed-point LMS reference models, evaluates convergence metrics, and saves the output plots.
python 4tap_software.py
To verify the bit-accurate behavior of the HDL implementation:
- Compile the Verilog code located in the
verilog/directory using an HDL simulator (e.g., Vivado Simulator, ModelSim, or Icarus Verilog). - Check the output wave alignments, specifically verifying that the pipeline stages keep weights and data samples properly synchronized.
To run the design on a PYNQ board:
- Copy the corresponding
.bitand.hwhfiles from thefinal .bit and .hwh/directory to the target PYNQ board filesystem. - Transfer the hardware driver python script (e.g.,
4 tap/hardware/4tap_hardware.py) or the verification Jupyter notebook to the board. - Execute the Python script or notebook on the board. The PYNQ overlay class will automatically load the bitstream, configure the DMA channels, transmit sample buffers, and read back the filtered outputs for verification.
- Xilinx Vivado (v2020.2 or later): Used for RTL synthesis, implementation, timing closure analysis, power analysis, and bitstream generation.
- PYNQ Framework: Python-based library for interfacing Jupyter Notebook / Python scripts running on the Zynq ARM PS with DMA engines on the PL.
- Python (v3.8+): Used for input signal generation, fixed-point software modeling (NumPy, SciPy), and validation plotting (Matplotlib).
- Git: Version control.
This project is provided for educational and engineering portfolio review purposes. The hardware and software designs are licensed under the MIT License - see the LICENSE file (if available) or individual headers for details. Please attribute the author if using this architecture as a reference for academic or commercial designs.







