v05 Vectorized FP32: 3,304 GFLOP/s
v08 TC SMEM WMMA: 3,230 GFLOP/s ← slower than FP32
v09 TC Async Pipeline: 3,105 GFLOP/s
v10 TC Vectorized: 2,984 GFLOP/s
FP16 matrices are half the size of FP32. For the same M=N=K=2048, v08 loads 16 MB of operands vs v05's 32 MB. If both are bandwidth-bound by the same 320 GB/s wall, v08 should be 2× faster than v05, not slower. The fact that v08 ≈ v05 means the TC kernels are loading ~2× more data than they should be for FP16. This is the 662 MB vs ~32 MB theoretical gap — a 20× inefficiency that the README attributes to warp-level re-fetching but never quantifies per-version.
v05 Vectorized FP32: 3,304 GFLOP/s
v08 TC SMEM WMMA: 3,230 GFLOP/s ← slower than FP32
v09 TC Async Pipeline: 3,105 GFLOP/s
v10 TC Vectorized: 2,984 GFLOP/s
FP16 matrices are half the size of FP32. For the same M=N=K=2048, v08 loads 16 MB of operands vs v05's 32 MB. If both are bandwidth-bound by the same 320 GB/s wall, v08 should be 2× faster than v05, not slower. The fact that v08 ≈ v05 means the TC kernels are loading ~2× more data than they should be for FP16. This is the 662 MB vs ~32 MB theoretical gap — a 20× inefficiency that the README attributes to warp-level re-fetching but never quantifies per-version.