Real-Time FPGA Image Processing Accelerator using HLS

A hardware-accelerated Canny-style edge detection pipeline running on the DE1-SoC (Intel Cyclone V), built with Intel's High-Level Synthesis (HLS) compiler. The system processes a live 640x480 video stream at 60 Hz and outputs the edge-detected result over VGA in real time.

This project was completed as an MSc dissertation at the University of Liverpool (Telecommunications and Wireless Systems, 2024-2025).

Key Results

Metric	Value
Initiation Interval (II)	1 (one pixel per clock cycle)
Pixel clock	100 MHz
Sustained throughput	100 M pixels/s (~325 fps at 640x480)
Worst-case Fmax (Slow 1100mV 85C)	138.87 MHz
ALM usage (full system)	7,398 / 32,070 (23%)
Block RAM bits	369,278 / 4,065,280 (9%)
DSP blocks	21 / 87 (24%)
Total registers	16,845
Total pins	204 / 457 (45%)
PLLs	1 / 6 (17%)

HLS Component Estimated Resources (i++ Report)

Resource	Usage	Available	%
ALUTs	8,005	109,572	7%
Flip-Flops	13,235	219,144	6%
RAMs	30	514	6%
DSPs	3.5	112	4%

System Architecture

                        DE1-SoC VIP Pipeline (Platform Designer / Qsys)

  ┌──────────────┐    ┌──────────┐    ┌────────────┐    ┌───────────┐    ┌──────────┐
  │  Clocked     │    │  Colour  │    │Deinterlacer│    │  Chroma   │    │ Clipper  │
  │  Video Input ├───►│  Plane   ├───►│   II (4K   ├───►│ Resampler ├───►│ II (4K   │
  │  II (4K)     │    │ Seq. II  │    │HDR Passthru│    │  II (4K)  │    │  Ready)  │
  └──────────────┘    └──────────┘    └────────────┘    └───────────┘    └────┬─────┘
                                                                              │
                                                                              ▼
  ┌──────────────┐    ┌──────────┐    ┌────────────┐    ┌───────────┐    ┌──────────┐
  │  Clocked     │    │  Frame   │    │  RAW-to-   │    │  filters  │    │ VIP-to-  │
  │  Video Out   │◄───┤  Buffer  │◄───┤    VIP     │◄───┤  (HLS)   │◄───┤   RAW    │
  │  (VGA)       │    │  II (4K) │    │   Bridge   │    │          │    │  Bridge  │
  └──────────────┘    └──────────┘    └────────────┘    └───────────┘    └──────────┘
        │                  │
        ▼                  ▼
   VGA Monitor         SDRAM Controller

The HLS filters block sits between two VIP/RAW bridges. Video data enters as packetised Avalon-ST (8 bits/symbol, 3 symbols/beat for RGB 8:8:8, with SOP/EOP framing). The pipeline processes one pixel per clock cycle, maintaining II = 1 throughout.

Processing Pipeline Stages

Input (RGB 8:8:8)
      │
      ▼
  Greyscale Conversion ──► (R + G + B) / 3
      │
      ▼
  5×5 Gaussian Blur ──► Weighted kernel (sum = 159), buffered across 5 line stores
      │
      ▼
  3×3 Sobel Gradient ──► Gx and Gy computed over 3 buffered Gaussian output lines
      │                    Gradient magnitude = |Gx| + |Gy|
      │                    Direction quantised to 4 bins (0°, 45°, 90°, 135°)
      ▼
  Non-Maximum Suppression ──► 3×3 neighbourhood comparison along gradient direction
      │                        Thin edges to single-pixel width
      ▼
  Double Threshold ──► Low = 50, High = 130
      │                 Strong edges (> 130) → 255
      │                 Weak edges (50-130) → kept only if NMS direction = 1
      │                 Below low → 0
      ▼
  Output (greyscale edge map, replicated to RGB for VGA)

All line buffers are shift-register based. Each new pixel shifts into position 0, and the oldest pixel falls off the far end. This avoids any address-based memory access and lets the HLS compiler map the buffers to on-chip block RAM.

Clock Domains

Clock	Frequency	Source	Role
CLOCK_50	50 MHz	Board oscillator	System reference
PLL general[0]	100 MHz	Derived	SDRAM (phase-shifted for external)
PLL general[1]	100 MHz	Derived	Pixel processing clock (VIP chain + HLS filters)
PLL general[2]	25 MHz	Derived	VGA pixel clock (640x480 @ 60 Hz)
PLL general[3]	~18.33 MHz	Derived	Audio clock
TD_CLK27	27 MHz	Board oscillator	TV decoder

The HLS block and VIP pipeline both run on the 100 MHz general[1] domain. VGA output uses the 25 MHz general[2] domain. A single fractional PLL (1100.11 MHz VCO) generates all derived clocks.

Fmax Summary (Slow 1100mV 85C, Worst-Case Corner)

Clock Domain	Achieved Fmax	Required	Margin
PLL general[2] (VGA path)	138.87 MHz	25 MHz	+113.87 MHz
PLL general[1] (pixel processing)	141.18 MHz	100 MHz	+41.18 MHz
CLOCK_50	193.99 MHz	50 MHz	+143.99 MHz
TD_CLK27	201.98 MHz	27 MHz	+174.98 MHz
PLL general[3] (audio)	499.50 MHz	18.33 MHz	+481.17 MHz

All clocks meet timing with positive slack. TNS = 0.000 across all domains and all timing corners (Slow 85C, Slow 0C, Fast 85C, Fast 0C).

HLS Loop Analysis

The main processing loop (filters.B2 at filters.cpp:59) is pipelined with II ~1 (approximately 1 cycle per iteration). All inner shift-register loops are fully unrolled via #pragma unroll, which eliminates loop overhead and lets the compiler schedule them as combinational logic within a single clock cycle.

Repository Structure

FPGA-Image-Processing-HLS/
├── README.md                   ← You are here
├── .gitignore
├── LICENSE
│
├── hls/
│   └── filters.cpp             ← HLS C++ kernel source (Intel i++ compiler)
│
├── quartus/
│   └── README.md               ← Notes on the Quartus project structure
│
└── docs/
    ├── build-guide.md           ← Step-by-step build and programming instructions
    ├── screenshots/             ← Place Platform Designer, Fitter, STA screenshots here
    └── reports/                 ← Place .rpt and .summary files here

Note on Quartus project files: The full Quartus project (DE1_SoC_VIP_TV_640x480_W/) and generated IP files are not included in this repository due to their size and Intel's VIP IP licensing. The hls/ directory contains the original HLS source code, and the docs/build-guide.md explains how to recreate the full project from Intel's DE1-SoC VIP demonstration design.

Hardware and Software Requirements

Hardware

FPGA Board: Terasic DE1-SoC (Intel Cyclone V 5CSEMA5F31C6N)
Video Input: TV decoder on-board, or any Avalon-ST video source
Video Output: VGA monitor connected to the DE1-SoC VGA port
Programming: USB-Blaster II (on-board, via USB cable)

Software

Intel Quartus Prime 18.1 Standard Edition (Build 625)
Intel HLS Compiler (i++) version 18.1.0 Build 625, invoked from the Nios II Command Shell
Visual Studio 2010 Professional (required by the Intel HLS Compiler for Windows)
Platform Designer (Qsys) (included with Quartus Prime)
OS: Windows 10/11

Quick Start

Full instructions are in docs/build-guide.md. The short version:

1. Compile the HLS kernel

# From the Nios II Command Shell
cd C:/path/to/HLS/filters
i++ -march="Cyclone V" --simulator none -v -o filters filters.cpp

This generates filters.prj/components/filters/ containing the synthesisable RTL and Platform Designer IP files.

2. Integrate into Platform Designer

Open the DE1-SoC VIP Quartus project, launch Platform Designer, and add the HLS component's IP search path:

C:/path/to/HLS/filters/filters.prj/components

Insert the filters component between the VIP-to-RAW and RAW-to-VIP bridges. Connect clock, reset, and Avalon-ST interfaces. Generate HDL.

3. Compile and program

# Full Quartus compilation
quartus_sh --flow compile DE1_SoC_VIP_TV

# Program the FPGA via JTAG
quartus_pgm -m jtag -o "p;output_files/DE1_SoC_VIP_TV.sof"

HLS Source: `filters.cpp`

The kernel is a single hls_always_run_component that reads from an Avalon-ST input and writes to an Avalon-ST output. The full Canny-style pipeline runs inside a single while(!end_of_packet) loop, processing one pixel per iteration.

Key implementation details:

Greyscale is computed as the average of R, G, B channels (integer division by 3).
Gaussian blur uses a 5x5 kernel with weights summing to 159. Five line buffers (line0 through line4, each 640 entries) store the pixel rows. The kernel coefficients match a standard 5x5 Gaussian approximation (centre weight 15).
Sobel computes horizontal and vertical gradients over a 3x3 window on the Gaussian output. Three additional line buffers (gaussian_line0 through gaussian_line2) hold the blurred rows. Gradient direction is quantised into four bins using atan approximation via threshold comparisons on the Gy/Gx ratio.
Non-maximum suppression compares each pixel's gradient magnitude against its two neighbours along the gradient direction. Three more line buffers (sobel_line0 through sobel_line2) store the Sobel results with direction metadata.
Double threshold classifies pixels as strong (> 130), weak (50-130), or suppressed (< 50). Weak pixels are kept only if the NMS pass flagged them as local maxima.
Output delay uses a small shift register (final_delay, depth 12) and a large RGB buffer (rgb_buffer, depth 3201) to align the processed output with the original stream timing.
All line buffer shifts use #pragma unroll to ensure single-cycle execution.

Quartus Compilation Settings

The following non-default settings were applied for timing closure:

Setting	Value	Default
Optimisation Mode	Aggressive Performance	Balanced
Optimisation Technique	Speed	Balanced
Physical Synthesis (Combo Logic)	On	Off
Physical Synthesis (Register Retiming)	On	Off
Placement Effort Multiplier	4.0	1.0
Router Timing Optimisation Level	Maximum	Normal
Fitter Effort	Standard Fit	Auto Fit

Total compilation time: approximately 10 minutes 31 seconds (Analysis & Synthesis: 3:36, Fitter: 6:03, Assembler: 0:23, Timing Analyzer: 0:29).

STA Timing Summary (All Corners, All Clocks)

All setup, hold, recovery, removal, and minimum pulse width checks pass with positive slack. Zero TNS across every timing corner.

Worst-case setup slack on the pixel processing clock (general[1], 100 MHz):

Corner	Slack
Slow 1100mV 85C	+0.827 ns
Slow 1100mV 0C	+1.124 ns
Fast 1100mV 85C	+3.224 ns
Fast 1100mV 0C	+3.416 ns

Platform Designer Warnings (Benign)

edi_read_master / write_master "must be an export" on the Deinterlacer II. These are unused memory-mapped master interfaces in the VIP IP. They can be safely ignored or exported and left unconnected.
PLL "Actual settings differ from Requested settings." The PLL quantises to the nearest achievable frequency. If the VGA image is stable and timing passes, this is fine.
Deinterlacer "configured for YCbCr 4:2:2 colour space." This is an informational message; the colour pipeline is coherent (confirmed by correct VGA output).

Author

Varun Venkata Tej Molleti MSc Telecommunications and Wireless Systems, University of Liverpool (2024-2025)

License

This project is provided for educational and portfolio purposes. The HLS source code (filters.cpp) is original work. Intel VIP IP cores and Quartus project templates are subject to Intel's licensing terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time FPGA Image Processing Accelerator using HLS

Key Results

HLS Component Estimated Resources (i++ Report)

System Architecture

Processing Pipeline Stages

Clock Domains

Fmax Summary (Slow 1100mV 85C, Worst-Case Corner)

HLS Loop Analysis

Repository Structure

Hardware and Software Requirements

Hardware

Software

Quick Start

HLS Source: `filters.cpp`

Quartus Compilation Settings

STA Timing Summary (All Corners, All Clocks)

Platform Designer Warnings (Benign)

Author

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
hls		hls
quartus		quartus
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Real-Time FPGA Image Processing Accelerator using HLS

Key Results

HLS Component Estimated Resources (i++ Report)

System Architecture

Processing Pipeline Stages

Clock Domains

Fmax Summary (Slow 1100mV 85C, Worst-Case Corner)

HLS Loop Analysis

Repository Structure

Hardware and Software Requirements

Hardware

Software

Quick Start

HLS Source: filters.cpp

Quartus Compilation Settings

STA Timing Summary (All Corners, All Clocks)

Platform Designer Warnings (Benign)

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

HLS Source: `filters.cpp`