Pure CUDA/C++ implementation of SegFormer B0 for real-time semantic segmentation. Just raw CUDA kernels and cuBLAS.
Uses weights from HuggingFace reference model (nvidia/segformer-b0-finetuned-ade-512-512).
- Pure CUDA inference with custom kernels (im2col conv2d, depthwise conv, layernorm, GELU, softmax, bilinear resize, etc.)
- CUDA Graph capture for reduced kernel launch overhead (
--graph) - Real-time webcam segmentation via Python/OpenCV script with binary pipe protocol
- ADE20K 150-class output
- Workspace bump allocator for all intermediate tensors (single 896 MB allocation, no per-frame mallocs). You can set up memory re-use easily.
- BN folding during weight export for faster decode head
- NVIDIA GPU with CUDA support (tested on RTX 5060 Laptop, sm_120)
- CUDA Toolkit (11.0+)
- CMake 3.18+
- Python 3.8+ with
numpy,opencv-python(for webcam demo) transformersandtorch(only for weight export)
cmake -B build
cmake --build build --config ReleaseTo target a specific GPU architecture:
cmake -B build -DCMAKE_CUDA_ARCHITECTURES=86Download and convert weights from HuggingFace (one-time step):
pip install transformers torch
python scripts/export_weights.pyThis creates weights/segformer_b0.bin (~14 MB, 204 tensors with BatchNorm folded into the fuse conv).
./build/Release/segformer_infer weights/segformer_b0.bin
./build/Release/segformer_infer weights/segformer_b0.bin --graphpython scripts/webcam_segmentation.py
python scripts/webcam_segmentation.py --graphControls:
q: quit+/-: adjust overlay opacityt/g: increase/decrease confidence thresholds: save screenshot
./build/Release/segformer_infer weights/segformer_b0.bin --test --test-input input.bin --test-output output.binInput [1, 3, 512, 512]
|
v
Stage 0: PatchEmbed(7x7, stride=4) -> 2x TransformerBlock(dim=32, heads=1, sr=8) -> [1, 32, 128, 128]
Stage 1: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=64, heads=2, sr=4) -> [1, 64, 64, 64]
Stage 2: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=160, heads=5, sr=2) -> [1, 160, 32, 32]
Stage 3: PatchEmbed(3x3, stride=2) -> 2x TransformerBlock(dim=256, heads=8, sr=1) -> [1, 256, 16, 16]
|
v
Decode Head:
- Project each stage to 256 dims
- Bilinear resize all to 128x128
- Concatenate [stage3, stage2, stage1, stage0]
- 1x1 fuse conv (1024->256) + ReLU
- 1x1 classifier (256->150)
|
v
Output [1, 150, 128, 128] (ADE20K class logits)
include/
segformer.h model struct, config constants, weight definitions
cuda_kernels.cuh kernel declarations and error-check macros
src/
cuda_kernels.cu all CUDA kernel implementations
segformer.cu weight loading, forward pass, CUDA graph, cleanup
main.cu CLI entry point (benchmark / stream / test modes)
scripts/
export_weights.py export HuggingFace weights to binary format
webcam_segmentation.py real-time webcam demo with OpenCV
Athrva Pandhare (athrva98@gmail.com)
