Testing CUDA kernels for matrix multiplication with fused operations.
pip install -e .python -m mygemm.bench --device cuda- Plain: Naive baseline implementation
- Fused: Bias + ReLU fused into single kernel
- Optimized: Tiled computation with shared memory, bank conflict resolution
# Rebuild after changing CUDA code
pip install -e . --force-reinstall --no-deps
# Debug mode
CUDA_LAUNCH_BLOCKING=1 python -m mygemm.benchcsrc/
├── mygemm_kernels.cu # Naive & fused kernels
└── bank_extra.cu # Optimized kernel
mygemm/
├── functional.py # Autograd functions
├── modules.py # nn.Module wrappers
└── bench.py # Benchmarking
Optimized kernel based on siboehm/SGEMM_CUDA