v05 (FP32) beats all Tensor Core kernels v07–v10

v05 Vectorized FP32:    3,304 GFLOP/s
v08 TC SMEM WMMA:       3,230 GFLOP/s   ← slower than FP32
v09 TC Async Pipeline:  3,105 GFLOP/s
v10 TC Vectorized:      2,984 GFLOP/s

FP16 matrices are half the size of FP32. For the same M=N=K=2048, v08 loads 16 MB of operands vs v05's 32 MB. If both are bandwidth-bound by the same 320 GB/s wall, v08 should be 2× faster than v05, not slower. The fact that v08 ≈ v05 means the TC kernels are loading ~2× more data than they should be for FP16. This is the 662 MB vs ~32 MB theoretical gap — a 20× inefficiency that the README attributes to warp-level re-fetching but never quantifies per-version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v05 (FP32) beats all Tensor Core kernels v07–v10 #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

v05 (FP32) beats all Tensor Core kernels v07–v10 #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions