Batch size sweep analysis

## Summary

Add a batch size sweep mode that automatically runs benchmarks across a range of batch sizes and reports how throughput, latency, and efficiency scale with data volume.

## Motivation

The current benchmark uses a fixed batch size of 1,048,576 elements. However, GPU performance characteristics change dramatically with batch size — small batches may not saturate the GPU's execution units, while very large batches may hit memory limits or cause diminishing returns. A sweep analysis reveals the optimal operating point for each kernel and helps users understand the performance curve for their specific hardware.

## Acceptance Criteria

- [ ] Add a `--sweep` CLI flag that runs each task across a geometric range of batch sizes (e.g. 1K, 4K, 16K, 64K, 256K, 1M, 4M, 16M)
- [ ] For each batch size, report: average latency, P95 latency, throughput (elements/sec), and throughput per GPU core (if detectable)
- [ ] Identify and highlight the batch size that achieves peak throughput for each task
- [ ] Output should clearly show the scaling curve — ideally suitable for charting
- [ ] Handle edge cases: batch sizes that exceed available GPU memory should be skipped with a warning

## Technical Notes

- The sweep range should be configurable (e.g. `--sweep-min 1024 --sweep-max 16777216 --sweep-steps 8`)
- Results could be output in a tabular format that's easy to paste into a spreadsheet or plot
- This pairs well with #10 (CSV/Markdown export) for automated analysis
- Consider logging GPU memory utilisation at each batch size

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size sweep analysis #8

Summary

Motivation

Acceptance Criteria

Technical Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Batch size sweep analysis #8

Description

Summary

Motivation

Acceptance Criteria

Technical Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions