Skip to content

athrva98/segformer-cuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SegFormer B0 CUDA Inference Engine

Pure CUDA/C++ implementation of SegFormer B0 for real-time semantic segmentation. Just raw CUDA kernels and cuBLAS.

Uses weights from HuggingFace reference model (nvidia/segformer-b0-finetuned-ade-512-512).

SegFormer B0 CUDA Inference

Features

  • Pure CUDA inference with custom kernels (im2col conv2d, depthwise conv, layernorm, GELU, softmax, bilinear resize, etc.)
  • CUDA Graph capture for reduced kernel launch overhead (--graph)
  • Real-time webcam segmentation via Python/OpenCV script with binary pipe protocol
  • ADE20K 150-class output
  • Workspace bump allocator for all intermediate tensors (single 896 MB allocation, no per-frame mallocs). You can set up memory re-use easily.
  • BN folding during weight export for faster decode head

Requirements

  • NVIDIA GPU with CUDA support (tested on RTX 5060 Laptop, sm_120)
  • CUDA Toolkit (11.0+)
  • CMake 3.18+
  • Python 3.8+ with numpy, opencv-python (for webcam demo)
  • transformers and torch (only for weight export)

Build

cmake -B build
cmake --build build --config Release

To target a specific GPU architecture:

cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86

Export Weights

Download and convert weights from HuggingFace (one-time step):

pip install transformers torch
python scripts/export_weights.py

This creates weights/segformer_b0.bin (~14 MB, 204 tensors with BatchNorm folded into the fuse conv).

Usage

Benchmark

./build/Release/segformer_infer weights/segformer_b0.bin
./build/Release/segformer_infer weights/segformer_b0.bin --graph

Real-time Webcam

python scripts/webcam_segmentation.py
python scripts/webcam_segmentation.py --graph

Controls:

  • q : quit
  • +/- : adjust overlay opacity
  • t/g : increase/decrease confidence threshold
  • s : save screenshot

Test Mode for Numerical Verification

./build/Release/segformer_infer weights/segformer_b0.bin --test --test-input input.bin --test-output output.bin

Architecture

Input [1, 3, 512, 512]
  |
  v
Stage 0: PatchEmbed(7x7, stride=4) -> 2x TransformerBlock(dim=32, heads=1, sr=8)  -> [1, 32, 128, 128]
Stage 1: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=64, heads=2, sr=4)  -> [1, 64, 64, 64]
Stage 2: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=160, heads=5, sr=2) -> [1, 160, 32, 32]
Stage 3: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=256, heads=8, sr=1) -> [1, 256, 16, 16]
  |
  v
Decode Head:
  - Project each stage to 256 dims
  - Bilinear resize all to 128x128
  - Concatenate [stage3, stage2, stage1, stage0]
  - 1x1 fuse conv (1024->256) + ReLU
  - 1x1 classifier (256->150)
  |
  v
Output [1, 150, 128, 128]  (ADE20K class logits)

Project Structure

include/
  segformer.h             model struct, config constants, weight definitions
  cuda_kernels.cuh        kernel declarations and error-check macros
src/
  cuda_kernels.cu         all CUDA kernel implementations
  segformer.cu            weight loading, forward pass, CUDA graph, cleanup
  main.cu                 CLI entry point (benchmark / stream / test modes)
scripts/
  export_weights.py       export HuggingFace weights to binary format
  webcam_segmentation.py  real-time webcam demo with OpenCV

Author

Athrva Pandhare (athrva98@gmail.com)

About

A Pure CUDA implementation of the Segformer B0 model. No deps beyond CUDA and cuBLAS

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors