SegFormer B0 CUDA Inference Engine

Pure CUDA/C++ implementation of SegFormer B0 for real-time semantic segmentation. Just raw CUDA kernels and cuBLAS.

Uses weights from HuggingFace reference model (nvidia/segformer-b0-finetuned-ade-512-512).

Features

Pure CUDA inference with custom kernels (im2col conv2d, depthwise conv, layernorm, GELU, softmax, bilinear resize, etc.)
CUDA Graph capture for reduced kernel launch overhead (--graph)
Real-time webcam segmentation via Python/OpenCV script with binary pipe protocol
ADE20K 150-class output
Workspace bump allocator for all intermediate tensors (single 896 MB allocation, no per-frame mallocs). You can set up memory re-use easily.
BN folding during weight export for faster decode head

Requirements

NVIDIA GPU with CUDA support (tested on RTX 5060 Laptop, sm_120)
CUDA Toolkit (11.0+)
CMake 3.18+
Python 3.8+ with numpy, opencv-python (for webcam demo)
transformers and torch (only for weight export)

Build

cmake -B build
cmake --build build --config Release

To target a specific GPU architecture:

cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86

Export Weights

Download and convert weights from HuggingFace (one-time step):

pip install transformers torch
python scripts/export_weights.py

This creates weights/segformer_b0.bin (~14 MB, 204 tensors with BatchNorm folded into the fuse conv).

Usage

Benchmark

./build/Release/segformer_infer weights/segformer_b0.bin
./build/Release/segformer_infer weights/segformer_b0.bin --graph

Real-time Webcam

python scripts/webcam_segmentation.py
python scripts/webcam_segmentation.py --graph

Controls:

q : quit
+/- : adjust overlay opacity
t/g : increase/decrease confidence threshold
s : save screenshot

Test Mode for Numerical Verification

./build/Release/segformer_infer weights/segformer_b0.bin --test --test-input input.bin --test-output output.bin

Architecture

Input [1, 3, 512, 512]
  |
  v
Stage 0: PatchEmbed(7x7, stride=4) -> 2x TransformerBlock(dim=32, heads=1, sr=8)  -> [1, 32, 128, 128]
Stage 1: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=64, heads=2, sr=4)  -> [1, 64, 64, 64]
Stage 2: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=160, heads=5, sr=2) -> [1, 160, 32, 32]
Stage 3: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=256, heads=8, sr=1) -> [1, 256, 16, 16]
  |
  v
Decode Head:
  - Project each stage to 256 dims
  - Bilinear resize all to 128x128
  - Concatenate [stage3, stage2, stage1, stage0]
  - 1x1 fuse conv (1024->256) + ReLU
  - 1x1 classifier (256->150)
  |
  v
Output [1, 150, 128, 128]  (ADE20K class logits)

Project Structure

include/
  segformer.h             model struct, config constants, weight definitions
  cuda_kernels.cuh        kernel declarations and error-check macros
src/
  cuda_kernels.cu         all CUDA kernel implementations
  segformer.cu            weight loading, forward pass, CUDA graph, cleanup
  main.cu                 CLI entry point (benchmark / stream / test modes)
scripts/
  export_weights.py       export HuggingFace weights to binary format
  webcam_segmentation.py  real-time webcam demo with OpenCV

Author

Athrva Pandhare (athrva98@gmail.com)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
include		include
scripts		scripts
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
segformer_test.gif		segformer_test.gif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SegFormer B0 CUDA Inference Engine

Features

Requirements

Build

Export Weights

Usage

Benchmark

Real-time Webcam

Test Mode for Numerical Verification

Architecture

Project Structure

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SegFormer B0 CUDA Inference Engine

Features

Requirements

Build

Export Weights

Usage

Benchmark

Real-time Webcam

Test Mode for Numerical Verification

Architecture

Project Structure

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages