On rigs with dual GPUs and NVLink, the standard llama-bench is not an adequate benchmark to reflect performance potential. I noticed a single thread causing 50% load on each of my two GV100s (dual NVLink). Spinning up another instance of llama-bench manually I noticed I could run an entire extra benchmark in parallel without affecting the pts/llama-cpp benchmark, with both GPUs finally reaching 100% load. Georgi noticed this too and apparently created llama-batched-bench to offer parallel benchmark performance on the DGX Spark (and probably other DGX devices with multiple GPUs). This test is a much more accurate reflection of AI performance on larger rigs that will otherwise cruise through the single-threaded benchmark while performing a fraction of their potential.
Consider adding llama-batched-bench as an option in the test?
ggml-org/llama.cpp#16578
On rigs with dual GPUs and NVLink, the standard
llama-benchis not an adequate benchmark to reflect performance potential. I noticed a single thread causing 50% load on each of my two GV100s (dual NVLink). Spinning up another instance ofllama-benchmanually I noticed I could run an entire extra benchmark in parallel without affecting the pts/llama-cpp benchmark, with both GPUs finally reaching 100% load. Georgi noticed this too and apparently createdllama-batched-benchto offer parallel benchmark performance on the DGX Spark (and probably other DGX devices with multiple GPUs). This test is a much more accurate reflection of AI performance on larger rigs that will otherwise cruise through the single-threaded benchmark while performing a fraction of their potential.Consider adding
llama-batched-benchas an option in the test?ggml-org/llama.cpp#16578