Benchmark of pts/llama-cpp tests one instance at a time and doesn't accurately reflect potential parallel performance. Add `llama-batched-bench`.

On rigs with dual GPUs and NVLink, the standard `llama-bench` is not an adequate benchmark to reflect performance potential. I noticed a single thread causing 50% load on each of my two GV100s (dual NVLink). Spinning up another instance of `llama-bench` manually I noticed I could run an entire extra benchmark in parallel without affecting the pts/llama-cpp benchmark, with both GPUs finally reaching 100% load. Georgi noticed this too and apparently created `llama-batched-bench` to offer parallel benchmark performance on the DGX Spark (and probably other DGX devices with multiple GPUs). This test is a much more accurate reflection of AI performance on larger rigs that will otherwise cruise through the single-threaded benchmark while performing a fraction of their potential.

Consider adding `llama-batched-bench` as an option in the test?
https://github.com/ggml-org/llama.cpp/discussions/16578


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark of pts/llama-cpp tests one instance at a time and doesn't accurately reflect potential parallel performance. Add `llama-batched-bench`. #75

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark of pts/llama-cpp tests one instance at a time and doesn't accurately reflect potential parallel performance. Add llama-batched-bench. #75

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Benchmark of pts/llama-cpp tests one instance at a time and doesn't accurately reflect potential parallel performance. Add `llama-batched-bench`. #75