Skip to content

Batch size sweep analysis #8

@SolidRegardless

Description

@SolidRegardless

Summary

Add a batch size sweep mode that automatically runs benchmarks across a range of batch sizes and reports how throughput, latency, and efficiency scale with data volume.

Motivation

The current benchmark uses a fixed batch size of 1,048,576 elements. However, GPU performance characteristics change dramatically with batch size — small batches may not saturate the GPU's execution units, while very large batches may hit memory limits or cause diminishing returns. A sweep analysis reveals the optimal operating point for each kernel and helps users understand the performance curve for their specific hardware.

Acceptance Criteria

  • Add a --sweep CLI flag that runs each task across a geometric range of batch sizes (e.g. 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M)
  • For each batch size, report: average latency, P95 latency, throughput (elements/sec), and throughput per GPU core (if detectable)
  • Identify and highlight the batch size that achieves peak throughput for each task
  • Output should clearly show the scaling curve — ideally suitable for charting
  • Handle edge cases: batch sizes that exceed available GPU memory should be skipped with a warning

Technical Notes

  • The sweep range should be configurable (e.g. --sweep-min 1024 --sweep-max 16777216 --sweep-steps 8)
  • Results could be output in a tabular format that's easy to paste into a spreadsheet or plot
  • This pairs well with CSV / Markdown export #10 (CSV/Markdown export) for automated analysis
  • Consider logging GPU memory utilisation at each batch size

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions