Skip to content

Buddy-MLIR Gemmini performance benchmarks: kernel + ResNet50 validation#11

Open
ashvin-verma wants to merge 2 commits intomainfrom
ashvin/buddy-gemmini-benchmarks
Open

Buddy-MLIR Gemmini performance benchmarks: kernel + ResNet50 validation#11
ashvin-verma wants to merge 2 commits intomainfrom
ashvin/buddy-gemmini-benchmarks

Conversation

@ashvin-verma
Copy link
Copy Markdown
Collaborator

@ashvin-verma ashvin-verma commented Feb 9, 2026

Summary

  • Add reproducible Buddy-MLIR performance benchmarks on Gemmini (Spike simulator)
  • Covers matmul workloads (MLP2, MLP1, softmax, iGELU), conv workloads (conv, conv+pool), and ResNet50 conv1 layer validation
  • All output checksums validated against Gemmini C reference (tiled_matmul_auto / tiled_conv_auto)
  • Documents the full lowering pipeline from Gemmini MLIR to bare-metal execution in WORKFLOW.md

Conv encoding bug fix

We found and fixed a bug in Buddy-MLIR's Gemmini conv lowering: the im2col encoding was producing incorrect weight matrix layouts, causing checksum mismatches against the Gemmini C reference. Fix contributed upstream as buddy-compiler/buddy-mlir#689. All conv benchmarks here require this fix. The conv1-bad-buddy test case (intentionally wrong stride) is included to verify the validation methodology catches such errors.

Key Results

Matmul Workloads

Workload Dataflow Gemmini C cycles Buddy cycles Checksum Speedup
MLP2 (64×832) WS 2,528 409 ✓ 252338 6.18×
MLP2 (64×832) OS 207,782 96,076 ✓ 252338 2.16×
MLP1 (6-layer) WS 25,251 2,539 ✓ 258664 9.95×
softmax matmul (31×30×66) WS 335 145 ✓ 3860 2.31×
iGELU matmul (30×30×30) WS 133 133 ✓ −23260 1.00×

Conv Workloads (post conv-encoding fix)

Workload CPU cycles Gemmini C cycles Buddy cycles Checksum Buddy vs Gemmini C
conv (17×17, k=3, stride=2) 7,559,913 1,027 149 ✓ 950 6.89×
conv+pool (17×17, k=3, pool=3) 7,714,291 1,605 172 ✓ 30827 9.33×

ResNet50 Conv1 Layer

Layer Gemmini C cycles Buddy cycles Checksum Speedup
Conv1 (7×7, stride=2, 3×3 pool) 225,146 7,313 ✓ 10206332 30.8×

Note on cycle counts: rdcycle measures CPU instructions — Buddy's compile-time loop unrolling reduces host-side orchestration overhead, so speedups reflect less host-side work, not necessarily faster Gemmini hardware throughput.

What's included

  • experiments/buddy-benchmarks/kernels/ — 7 kernel benchmarks (.mlir + .c harnesses) with Makefile
  • experiments/buddy-benchmarks/resnet50/ — ResNet50 conv1 validation (Buddy vs Gemmini C + intentional bad case)
  • experiments/buddy-benchmarks/scripts/run_benchmark.sh — Single script to build and run everything
  • experiments/buddy-benchmarks/logs/ — Reference Spike output logs
  • experiments/buddy-benchmarks/README.md — Full results, methodology, reproduction instructions
  • experiments/buddy-benchmarks/WORKFLOW.md — Complete lowering pipeline documentation (MLIR → buddy-opt → buddy-translate → buddy-llc → gcc link → Spike)

Related

Test plan

  • Build kernel benchmarks: make -C experiments/buddy-benchmarks/kernels all
  • Run on Spike and verify checksums match reference logs
  • Run experiments/buddy-benchmarks/scripts/run_benchmark.sh end-to-end
  • Build ResNet50 validation: make -C experiments/buddy-benchmarks/resnet50 validate

Add reproducible benchmarks comparing Buddy-MLIR's Gemmini dialect
backend against the Gemmini C reference on Spike simulator.

Kernel benchmarks: conv, conv+pool, MLP2 (WS/OS), MLP1, softmax
matmul, iGELU matmul. ResNet50 conv1 layer validation with
intentional bad case for test methodology verification.

Conv benchmarks require buddy-compiler/buddy-mlir#689 (conv encoding
fix) for correct im2col lowering.
Step-by-step guide from Gemmini dialect MLIR through buddy-opt,
buddy-translate, buddy-llc, bare-metal linking, to Spike execution.
Includes setup instructions for all prerequisites.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant