Skip to content

Benchmark of pts/llama-cpp tests one instance at a time and doesn't accurately reflect potential parallel performance. Add llama-batched-bench. #75

@jboero

Description

@jboero

On rigs with dual GPUs and NVLink, the standard llama-bench is not an adequate benchmark to reflect performance potential. I noticed a single thread causing 50% load on each of my two GV100s (dual NVLink). Spinning up another instance of llama-bench manually I noticed I could run an entire extra benchmark in parallel without affecting the pts/llama-cpp benchmark, with both GPUs finally reaching 100% load. Georgi noticed this too and apparently created llama-batched-bench to offer parallel benchmark performance on the DGX Spark (and probably other DGX devices with multiple GPUs). This test is a much more accurate reflection of AI performance on larger rigs that will otherwise cruise through the single-threaded benchmark while performing a fraction of their potential.

Consider adding llama-batched-bench as an option in the test?
ggml-org/llama.cpp#16578

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions