-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Add a batch size sweep mode that automatically runs benchmarks across a range of batch sizes and reports how throughput, latency, and efficiency scale with data volume.
Motivation
The current benchmark uses a fixed batch size of 1,048,576 elements. However, GPU performance characteristics change dramatically with batch size — small batches may not saturate the GPU's execution units, while very large batches may hit memory limits or cause diminishing returns. A sweep analysis reveals the optimal operating point for each kernel and helps users understand the performance curve for their specific hardware.
Acceptance Criteria
- Add a
--sweepCLI flag that runs each task across a geometric range of batch sizes (e.g. 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M) - For each batch size, report: average latency, P95 latency, throughput (elements/sec), and throughput per GPU core (if detectable)
- Identify and highlight the batch size that achieves peak throughput for each task
- Output should clearly show the scaling curve — ideally suitable for charting
- Handle edge cases: batch sizes that exceed available GPU memory should be skipped with a warning
Technical Notes
- The sweep range should be configurable (e.g.
--sweep-min 1024 --sweep-max 16777216 --sweep-steps 8) - Results could be output in a tabular format that's easy to paste into a spreadsheet or plot
- This pairs well with CSV / Markdown export #10 (CSV/Markdown export) for automated analysis
- Consider logging GPU memory utilisation at each batch size
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels