Empirical LLM inference benchmarks for the GPU you actually own.
Find the absolute limit of your local AI hardware without the enterprise budget.
AI-driven GPU and RAM prices got you down? Same here. Poor Paul's Benchmark exists to answer a simple question with real data: what is the most efficient hardware, model, and quantization setup for the budget you actually have?
Poor Paul's Benchmark is an automated evaluation framework for local LLM inference. It combines throughput, latency, VRAM headroom, long-context recall, tool-call accuracy, answer quality, and multi-turn memory into one open benchmarking workflow.
The goal is simple: build the most useful public benchmark for cost-effective local AI — for hobbyists, homelabbers, prosumers, and small teams who need honest numbers, not marketing claims.
No cloud credits. No black boxes. No benchmark theatre. Just your GPU, real models, and honest numbers.
PPB is the benchmark-runtime layer of a four-layer open platform. Every result you contribute flows through the full stack automatically.
┌─────────────────────────────────────────────────────────────────┐
│ Layer 4 — Human interface poorpaul.dev │
│ Analytics, leaderboard, blog, editorial │
├─────────────────────────────────────────────────────────────────┤
│ Layer 3 — LLM interface mcp.poorpaul.dev │
│ Any MCP client can query benchmark data directly │
├─────────────────────────────────────────────────────────────────┤
│ Layer 2 — Data huggingface.co/datasets/ │
│ paulplee/ppb-results (Parquet, versioned, public) │
├─────────────────────────────────────────────────────────────────┤
│ Layer 1 — Benchmark runtime poor-pauls-benchmark ← here │
│ Contributor-run; results pushed to Layer 2 │
└─────────────────────────────────────────────────────────────────┘
Once you push results, any MCP-compatible AI client (Claude Desktop, Cursor, Windsurf, etc.) connected to mcp.poorpaul.dev can answer questions like:
"What's the best quantization for an RTX 4090 running 4 concurrent users?"
using real community benchmark data — including yours.
PPB produces two complementary result types for each model and hardware configuration.
Performance benchmarks measure raw hardware capability: how fast the model runs, how much VRAM it uses, and how many concurrent users it can handle.
Quality benchmarks measure what the model actually does with that compute: whether it retrieves facts accurately from long contexts, calls tools correctly, produces reliable answers, and maintains coherence across a conversation.
| Benchmark type | What it covers | Key metrics |
|---|---|---|
| Performance | VRAM limits, throughput, latency, concurrency | tokens_per_sec, vram_used_mb, ttft_ms, itl_ms, max_concurrent_users |
| Quality | Context recall, tool accuracy, answer quality, multi-turn memory | context_rot_score, overall_tool_accuracy, knowledge_accuracy_mean, quality_composite_score, memory_accuracy |
Performance and quality runs are independent and composable: run a throughput sweep today, add quality scoring next week. Results are joined in the dataset by (gpu_name, model_name, quantization, run_type).
Requirements: Python 3.11+, uv, llama-bench and/or llama-server from llama.cpp on your $PATH, and a CUDA- or Metal-capable GPU.
# 1. Clone and install
git clone https://github.com/paulplee/poor-pauls-benchmark
cd poor-pauls-benchmark
uv sync
# 2. Copy the starter suite and edit for your GPU
cp suites/suite.example.toml suites/my_gpu.toml
# → set repo_id, filename, n_ctx, n_batch
# 3. Run
uv run ppb.py quantitative suites/my_gpu.tomlResults are written to results/results.jsonl. See Publishing Results to contribute them to the shared dataset.
llama.cpp version: Build
b8688or later is required. See docs/building-llama-cpp.md for step-by-step build instructions for CUDA, ROCm, Metal, and CPU.
uv syncAll quality evaluations run local models via llama-cpp-python, which requires a GPU-accelerated build. A plain pip install llama-cpp-python gives a CPU-only binary that is impractically slow for long-context evaluation.
Install the GPU-accelerated build once before running quality benchmarks:
# CUDA — x86_64 Linux / Windows: pre-built wheel (fastest path)
uv pip install llama-cpp-python \
--extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
# CUDA — x86_64 Linux / Windows: build from source
CMAKE_ARGS="-DGGML_CUDA=ON" uv pip install "llama-cpp-python>=0.3.0"
# CUDA — ARM64/aarch64 (Grace Blackwell GB10, Jetson, etc.)
# No CUDA-enabled wheels exist for aarch64; PyPI has a CPU-only aarch64 wheel.
# --no-binary forces a source build. CMAKE_CUDA_ARCHITECTURES=native auto-detects the GPU.
CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native" \
uv pip install "llama-cpp-python>=0.3.0" --no-binary llama-cpp-python
# Metal (macOS Apple Silicon)
CMAKE_ARGS="-DGGML_METAL=on" uv pip install "llama-cpp-python>=0.3.0"
# Or compile via the pyproject extras group
uv pip install -e ".[qualitative]"Note — ARM64 / aarch64 machines (including Lenovo ThinkStation PGX GB10, Jetson AGX Orin, etc.): The pre-built wheel index at
abetlen.github.io/llama-cpp-python/whl/cu124only publisheslinux_x86_64andwin_amd64wheels. Passing--extra-index-urlon an aarch64 host will fail with a "no matching platform tag" error. You must build from source as shown above. Thenativearchitecture flag auto-detects your GPU's compute capability at compile time, so you do not need to look it up manually.
Pre-built wheels for common targets: github.com/abetlen/llama-cpp-python/releases
Run this one-liner after installing. No model file needed.
uv run python - <<'EOF'
from llama_cpp import llama_supports_gpu_offload, llama_backend_init, llama_backend_free
from llama_cpp.llama_cpp import llama_print_system_info
llama_backend_init()
print(llama_print_system_info().decode())
gpu = llama_supports_gpu_offload()
print("GPU offload:", "OK ✓" if gpu else "FAIL — CPU-only build (qualitative benchmarks will be unusably slow)")
llama_backend_free()
EOFHealthy CUDA output — look for the ggml_cuda_init header and a CUDA : line in the system info:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122502 MiB):
Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122502 MiB
CUDA : ARCHS = 1210 | USE_GRAPHS = 1 | ...
GPU offload: OK ✓
If GPU offload: FAIL or no CUDA : line appears, the wheel was built without GPU support. Reinstall with the source-build command for your platform from the section above, making sure to pass --no-binary llama-cpp-python so uv doesn't silently reuse the CPU-only wheel from PyPI:
# Remove the CPU-only install first
uv pip uninstall llama-cpp-python
# Then reinstall from source (substitute the correct CMAKE_ARGS for your platform)
CMAKE_ARGS="-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=native" \
uv pip install "llama-cpp-python>=0.3.0" --no-binary llama-cpp-pythonIf the GPU isn't visible at all, check the OS level first:
nvidia-smi # GPU visible and driver loaded?
nvcc --version # CUDA toolkit installed?
echo $CUDA_HOME # CUDA_HOME set? (needed for source builds)# Performance benchmarks only
uv run ppb.py quantitative suites/my_gpu.toml
# Quality benchmarks only
# (auto-reads your VRAM limit from an existing result row to skip OOM cases)
uv run ppb.py qualitative suites/my_gpu.toml
# Everything together — most efficient; judge model loaded only once
uv run ppb.py all suites/my_gpu.tomlBefore running a sweep, PPB automatically discovers the maximum context window (n_ctx) your hardware supports before hitting an Out-of-Memory error, using a binary search. Run it once per model/quantization and the result is reused:
uv run ppb.py vram-cliff suites/my_gpu.tomlA VRAM pre-flight check also runs automatically before any sweep, reading GGUF metadata to estimate worst-case memory usage across your parameter matrix. If any combination would likely OOM, you are offered three options: auto-cap n_ctx, proceed anyway, or quit.
Runs the full models × n_ctx × n_batch × concurrent_users Cartesian product, automatically skipping any combination that exceeds the VRAM limit. Three runner backends are available:
| Runner | Measures | Best for |
|---|---|---|
llama-bench (default) |
Raw throughput (tok/s) | Peak hardware performance, quant comparisons |
llama-server |
TTFT and ITL latency | UX-relevant interactive / chat benchmarks |
llama-server-loadtest |
Max concurrent users | Capacity planning for multi-user deployments |
Set runner_type in your suite TOML to switch backends.
All quality evaluations require a GPU-accelerated llama-cpp-python build. The long-context recall and answer quality evaluations use ShareGPT conversations as their prompt source — real human exchanges that reflect how models are actually used.
Tests whether a model can retrieve specific facts from long contexts — and whether it gets confused by them. PPB plants 15 diverse factual needles (covering alphanumeric codes, dates, currency amounts, chemical formulae, place names, and more) into ShareGPT haystacks at 6 context lengths × 5 depth positions (30 cells total). Needle selection is deterministic per model while rotating across all 15 needles so no single fact type dominates the score.
Enable multi-needle mode (multi_needle_enabled = true) to measure context confusion specifically: whether the model can correctly reason across 2–3 simultaneously planted facts, not just retrieve a single phrase.
Dual-track evaluation using established benchmarks and PPB-native cases:
-
BFCL v3 (Berkeley Function Calling Leaderboard) — four single-turn splits (
simple,multiple,parallel,irrelevance) covering straightforward calls, multi-candidate tool selection, parallel/batched requests, and cases where the model must correctly decline to call any tool. The irrelevance split directly measures hallucination avoidance. -
PPB-native ground truth — 100 cases covering all four
ppb-mcptools (recommend_quantization,query_ppb_results,get_gpu_headroom,list_tested_configs) with realistic hobbyist-AI prompts.
Scores: tool_selection_accuracy, parameter_accuracy, parameter_hallucination_rate, no_call_accuracy (from BFCL irrelevance), parse_success_rate, and overall_tool_accuracy (geometric mean of selection and parameter accuracy).
A judge model scores 50 responses sampled from ShareGPT across three dimensions:
- Knowledge accuracy — fraction of factual claims the judge considers consistent with common knowledge. Note: this measures claim plausibility against the judge's parametric knowledge, not RAGAS-style reference-grounded faithfulness. See docs/qualitative-methodology.md for the distinction and rationale.
- Answer relevancy — how directly the response addresses the question
- Coherence — logical flow and internal consistency
A quality_composite_score combines all three. The 50-prompt evaluation set is frozen by a SHA-256 hash stored in every published row so results remain comparable across models.
Two modes, one selected per suite run:
- LongMemEval (default) — 50 questions from the LongMemEval benchmark (ICLR 2025) embedded in conversation histories up to 500K tokens. Tests recall, temporal reasoning, and knowledge-update handling across turns.
- MT-Bench — all 80 canonical MT-Bench questions (bundled at
ppb_datasets/data/mt_bench_questions.json, MIT-licensed). Fast signal for development cycles. Score is on the standard 1–10 MT-Bench scale.
Context-length cases that would exceed the model's VRAM limit are automatically skipped and reported in cases_skipped_context.
Answer quality and multi-turn evaluation both use a separate judge GGUF. The judge must be a different model from the one under test — PPB enforces this with a runtime path comparison. Self-grading inflates scores and defeats the purpose.
Recommended: a small, well-aligned 3–7B model such as Qwen3.5-4B-Q4_K_M or Llama-3-8B-Instruct-Q4_K_M. In ppb all mode the judge is loaded once and shared across both evaluations.
[qualitative]
judge_model_path = "/path/to/judge-model-Q4_K_M.gguf"All benchmark parameters live in a TOML file. Copy suites/suite.example.toml or suites/qualitative_example.toml as your starting point.
# suites/my_gpu.toml
repo_id = "unsloth/Qwen3.5-8B-GGUF"
filename = "Qwen3.5-8B-Q4_K_M.gguf"
[vram-cliff]
min_ctx = 2048
max_ctx = 131072
[sweep]
n_ctx = [8192, 32768, 65536]
n_batch = [512, 1024]
[publish]
hf_dataset = "paulplee/ppb-results"
# hf_token = "hf_..." # or export HF_TOKEN[qualitative]
context_rot_enabled = true
tool_accuracy_enabled = true
answer_quality_enabled = true
multiturn_enabled = true
multiturn_mode = "longmemeval" # or "mt_bench"
multi_needle_enabled = false # true to test context confusion
judge_model_path = "/path/to/judge-model-Q4_K_M.gguf"
# Optional
needle_seed = 20260426 # deterministic needle selection per model
sample_size = 50 # LongMemEval cases to evaluate| Key | Type | Default | Description |
|---|---|---|---|
repo_id |
string | — | HuggingFace repo, e.g. unsloth/Qwen3.5-8B-GGUF |
filename |
string | — | GGUF filename or glob ("*Q4*.gguf") |
models_dir |
string | "./models" |
Local model cache directory |
n_ctx |
int list | — | Context lengths to test |
n_batch |
int list | — | Batch sizes (llama-bench) |
concurrent_users |
int list | [1] |
Parallel users (llama-server runners) |
runner_type |
string | "llama-bench" |
llama-bench, llama-server, or llama-server-loadtest |
Full configuration reference: see suites/suite.example.toml.
Results are written to results/results.jsonl. Every row carries a nested qualitative JSON column; evaluations that did not run carry null for their keys.
| Key | Evaluation | Type |
|---|---|---|
context_rot_score |
Long-Context Recall | float |
context_rot_accuracy_by_length |
Long-Context Recall | object |
context_rot_accuracy_by_depth |
Long-Context Recall | object |
context_rot_accuracy_by_needle |
Long-Context Recall | object | null |
multi_needle_score |
Long-Context Recall (multi-needle mode) | float | null |
multi_needle_accuracy_by_length |
Long-Context Recall (multi-needle mode) | object | null |
tool_selection_accuracy |
Tool-Call Accuracy | float |
parameter_accuracy |
Tool-Call Accuracy | float |
parameter_hallucination_rate |
Tool-Call Accuracy | float |
no_call_accuracy |
Tool-Call Accuracy (irrelevance split) | float | null |
parse_success_rate |
Tool-Call Accuracy | float |
overall_tool_accuracy |
Tool-Call Accuracy | float |
knowledge_accuracy_mean |
Answer Quality | float |
knowledge_accuracy_std |
Answer Quality | float |
answer_relevancy_mean |
Answer Quality | float |
coherence_mean |
Answer Quality | float |
quality_composite_score |
Answer Quality | float |
memory_accuracy |
Multi-Turn Memory (LongMemEval) | float | null |
mt_bench_score |
Multi-Turn Memory (MT-Bench) | float 1–10 | null |
cases_evaluated |
Multi-Turn Memory | int |
cases_skipped_context |
Multi-Turn Memory | int |
multi_needle_scoreandmulti_needle_accuracy_by_lengtharenullunlessmulti_needle_enabled = trueis set.
mt_bench_scoreuses the 1–10 MT-Bench scale. All other float metrics are 0–1. To normalise:mt_bench_score_norm = (mt_bench_score − 1) / 9.
Full schema documentation: docs/schema.md
Every result you submit makes the shared dataset more useful — and more accurate for every LLM querying it via ppb-mcp.
huggingface.co → Settings → Access Tokens → create a token with Write access.
export HF_TOKEN=hf_your_token_here[publish]
hf_dataset = "paulplee/ppb-results"uv run ppb.py all suites/my_gpu.tomlResults are pushed to HuggingFace after each model completes — so a multi-model run is never lost to a late crash. They appear on poorpaul.dev/insights and become queryable via mcp.poorpaul.dev within minutes.
The benchmark is only as good as the hardware coverage. If your GPU isn't in the dataset, the community is flying blind for your tier.
git clone https://github.com/your-username/poor-pauls-benchmark
cd poor-pauls-benchmark
uv sync
cp suites/suite.example.toml suites/rtx-4070-ti.toml
# edit: repo_id, filename, n_ctx, n_batch, [publish] block
HF_TOKEN=hf_... uv run ppb.py all suites/rtx-4070-ti.tomlNo PR needed to contribute data — results go directly to the shared dataset.
The ground truth lives in ppb_mcp_ground_truth.json. Add cases for new tools, edge cases, or underrepresented prompt styles. Open a PR.
Implement BaseRunner from runners/base.py, register with @register_runner("your-backend"), open a PR. See runners/llama_bench.py for the reference implementation.
uv sync --group dev
pytest tests/ -v
# Quality unit tests only (no GPU required)
pytest tests/test_qualitative_fixes.py tests/test_context_rot.py \
tests/test_tool_accuracy.py tests/test_multiturn.py -vExpand file tree
ppb.py # CLI entry point (Typer app)
ppb_context_rot.py # Long-Context Recall (Semantic NIAH)
ppb_tool_accuracy.py # Tool-Call Accuracy (BFCL v3: simple/multiple/parallel/irrelevance + PPB-native)
ppb_answer_quality.py # Answer Quality (judge-model pipeline)
ppb_multiturn.py # Multi-Turn Memory (LongMemEval / MT-Bench)
ppb_mcp_ground_truth.json # 100 PPB-native tool-call evaluation cases
ppb_quality_prompts_cache.json # Frozen 50-prompt evaluation set (auto-generated)
runners/
base.py # BaseRunner ABC
llama_bench.py # llama-bench runner (throughput)
llama_server.py # llama-server runner (TTFT / ITL latency)
llama_server_loadtest.py # Load-test runner (max concurrency)
ppb_datasets/
sharegpt.py # ShareGPT download + prompt extraction
data/
mt_bench_questions.json # 80 canonical MT-Bench questions (MIT licence)
utils/
flattener.py # Normalise nested JSONL → flat Arrow-friendly dicts (schema v0.9.0)
gguf_metadata.py # Read GGUF headers for VRAM estimation
publisher.py # Push results to HuggingFace
scripts/
aggregate_results.py # Group Parquet rows by key dimensions; compute mean/CI95; optional HF upload
suites/
suite.example.toml # Starter performance suite
qualitative_example.toml # Starter quality suite
docs/
schema.md # Full results schema reference (v0.9.0)
qualitative-methodology.md # Knowledge accuracy vs. reference-grounded faithfulness
building-llama-cpp.md # Build llama.cpp for CUDA / Metal / ROCm
tests/ # pytest suite (no GPU required for unit tests)
results/ # Benchmark output — gitignored
models/ # Downloaded GGUF cache — gitignored
Radical honesty. Benchmarks run on real hardware under real conditions. Results are published unfiltered — no cherry-picking, no vendor-sponsored averages.
Practitioner empathy. PPB is designed by and for people running on limited GPU budgets. Every decision — from automatic VRAM limit discovery to pre-flight safety checks — exists because hardware is expensive and crashes waste time.
Open by default. Code, data, and methodology are all public. The brand earns trust through transparency, not access restrictions.
Compounding contribution. Every benchmark row, every ground-truth case, every runner plugin makes the resource more useful for the next person. The flywheel only spins if people contribute.
| Project | What it does |
|---|---|
| poor-pauls-benchmark (this repo) | Benchmark runner — contribute results here |
| paulplee/ppb-results | Public HuggingFace dataset — canonical source of truth |
| ppb-mcp | MCP server — lets LLMs query the dataset directly |
| poorpaul.dev | Visual analytics, leaderboard, and editorial |
MIT — code and tooling.
Benchmark results contributed to paulplee/ppb-results are published under CC BY 4.0. Contributions remain attributed to their authors.
BFCL v3 evaluation data © UC Berkeley, used under CC BY 4.0. MT-Bench questions © LMSYS, used under MIT licence. ShareGPT dataset used under its original licence.