An MCP server that exposes Poor Paul's Benchmark GPU inference data — quantization × throughput × VRAM × concurrent users — as queryable tools to any LLM client.
Hosted instance: https://mcp.poorpaul.dev/ (streamable-http transport, no auth)
Connect any MCP-aware client (Claude Desktop, Cline, Continue, etc.) to ask questions like:
- "What's the best quantization for a 32 GB GPU running Qwen3.5-9B with 8 concurrent users?"
- "Show me every model tested at Q4_K_M on the RTX 5090."
- "What GPU should I buy for running 27B models at 4 concurrent users on a $800 budget?"
- "Why is my RTX 5090 result at Q4_K_M slower than I expected?"
It exposes thirteen tools backed by 39,000+ real benchmark rows:
| Tool | What it does |
|---|---|
list_tested_configs |
Lists every tested GPU, model, and quantization (call this first) |
query_ppb_results |
Filters raw benchmark rows by GPU / VRAM / model / quant / users / backend / date |
recommend_quantization |
Three-tier empirical-first recommendation engine (high / medium / low confidence) |
recommend_hardware |
Budget-aware GPU recommendation ranked by speed, efficiency, or value-for-money |
explain_result |
Contextual explanation of a result: VRAM pressure, PCIe context, percentile rank |
get_gpu_headroom |
Sanity-checks a (gpu, model, quant, users) configuration for VRAM headroom |
compare_quants_quantitative |
Side-by-side throughput comparison across quantizations for a model + GPU |
get_combined_scores |
Quantitative + qualitative metrics in one call for a (gpu, model, quant) config |
rank_by_priority |
Rank quantizations by speed, efficiency (tok/W), or a balanced composite score |
| Tool | What it does |
|---|---|
get_qualitative_summary |
All available qualitative scores (context-rot, tool accuracy, quality, MT-Bench) |
query_qualitative_results |
Filter qualitative rows by phase, model, quant, GPU, or minimum score thresholds |
get_context_rot_breakdown |
Long-context recall scores by length, depth, and needle type |
get_tool_accuracy_breakdown |
Tool-call accuracy: selection, parameters, hallucination rate, parse success |
compare_quants_qualitative |
Side-by-side qualitative comparison across quantizations with deterministic insight |
Benchmark rows are mirrored into a local SQLite cache (./ppb_cache.db by
default; override with PPB_DB_PATH). On startup the server loads from
SQLite and only contacts HuggingFace when the dataset's git commit SHA
has changed — making subsequent restarts fast and offline-friendly.
Add to your MCP client config (Claude Desktop example, ~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}pip install ppb-mcp
MCP_TRANSPORT=stdio ppb-mcpClaude Desktop config:
{
"mcpServers": {
"ppb": {
"command": "ppb-mcp",
"env": { "MCP_TRANSPORT": "stdio" }
}
}
}docker run --rm -p 9933:9933 \
-e MCP_TRANSPORT=streamable-http \
-v ppb-hf-cache:/data/huggingface \
ghcr.io/paulplee/ppb-mcp:latestgit clone https://github.com/paulplee/ppb-mcp
cd ppb-mcp
pip install -e ".[dev]"
ppb-mcp # streamable-http on :9933All clients use the same hosted endpoint: https://mcp.poorpaul.dev/mcp
Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS)
or %APPDATA%\Claude\claude_desktop_config.json (Windows):
{
"mcpServers": {
"ppb": {
"transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
}
}
}Restart Claude Desktop after saving.
Edit ~/.cursor/mcp.json (create if it doesn't exist):
{
"mcpServers": {
"ppb": {
"url": "https://mcp.poorpaul.dev/mcp",
"type": "http"
}
}
}Or via UI: Settings → Tools & Integrations → MCP → Add Server.
Edit ~/.codeium/windsurf/mcp_config.json:
{
"mcpServers": {
"ppb": {
"serverUrl": "https://mcp.poorpaul.dev/mcp",
"transport": "http"
}
}
}Add to your .vscode/mcp.json (workspace) or User settings.json:
{
"mcp": {
"servers": {
"ppb": {
"type": "http",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}
}Add to ~/.config/zed/settings.json under "context_servers":
{
"context_servers": {
"ppb": {
"command": {
"path": "env",
"args": ["MCP_TRANSPORT=stdio", "uvx", "ppb-mcp"]
}
}
}
}Open the Cline panel → MCP Servers tab → Add Server → select SSE/HTTP → paste https://mcp.poorpaul.dev/mcp.
Add to ~/.continue/config.yaml:
mcpServers:
- name: ppb
transport:
type: http
url: https://mcp.poorpaul.dev/mcpAdd to ~/.config/opencode/config.json:
{
"mcp": {
"ppb": {
"type": "remote",
"url": "https://mcp.poorpaul.dev/mcp"
}
}
}goose mcp add ppb --transport http --url https://mcp.poorpaul.dev/mcp# Zero-install (requires uv):
env MCP_TRANSPORT=stdio uvx ppb-mcp
# After pip install:
env MCP_TRANSPORT=stdio ppb-mcpNote on transport key names: MCP clients are not yet fully standardised on JSON key names for the HTTP transport. If your client doesn't connect with
"type": "http", try"transport": "http","type": "sse", or"transport": "streamable-http". The endpoint URL is the same regardless.
> list_tested_configs
{ "gpus": ["Apple M4 Pro", "NVIDIA GB10", "NVIDIA GeForce RTX 5090"],
"models": ["Qwen3.5-9B", ...], "quantizations": ["Q4_K_M", ...] }
> recommend_quantization(gpu_vram_gb=32, concurrent_users=8, model="Qwen3.5-9B", priority="balance")
{ "recommended_quantization": "Q5_K_M",
"estimated_vram_usage_gb": 27.8,
"estimated_tokens_per_second": 142.0,
"headroom_gb": 4.2,
"confidence": "high",
"reasoning": "Q5_K_M is recommended for your NVIDIA GeForce RTX 5090 (32 GB) ...",
"alternatives": ["Q4_K_M", "Q8_0"] }
> recommend_hardware(target_model="Qwen3.5-27B", target_quantization="Q4_K_M",
concurrent_users=4, budget_usd=1200, priority="value")
{ "top_picks": [
{ "gpu": "NVIDIA GeForce RTX 5090", "msrp_usd": 1999, "throughput_tok_s": 94.3,
"efficiency_tok_per_dollar": 0.047, "rank_reason": "best measured tok/$ in budget" },
...
],
"budget_usd": 1200 }
> explain_result(gpu_name="NVIDIA GeForce RTX 5090", model="Qwen3.5-9B",
quantization="Q4_K_M", concurrent_users=8, n_ctx=32768)
{ "throughput_tok_s": 142.0,
"vram_pressure": "medium",
"pcie_context": "PCIe Gen 5 x16 — full bandwidth, no bottleneck expected",
"percentile_rank": 0.91,
"insight": "Top 9% throughput among all Qwen3.5-9B Q4_K_M configurations measured." }
| Env var | Default | Notes |
|---|---|---|
HF_DATASET |
paulplee/ppb-results |
HuggingFace dataset ID |
REFRESH_INTERVAL_HOURS |
1 |
Background refresh cadence |
MCP_TRANSPORT |
streamable-http |
stdio or streamable-http |
HOST |
0.0.0.0 |
HTTP bind host |
PORT |
9933 |
HTTP bind port |
LOG_LEVEL |
INFO |
Python logging level |
git clone https://github.com/paulplee/ppb-mcp /tmp/ppb-mcp
cd /tmp/ppb-mcp
DOMAIN=mcp.example.com EMAIL=you@example.com ./deploy/deploy.shThis installs Docker, builds the image, registers a systemd unit, configures nginx, and runs certbot.
cd /opt/ppb-mcp
sudo git pull --ff-only
sudo systemctl restart ppb-mcpThe systemd unit runs docker compose pull before starting, so the new image is fetched automatically. Check that it came up cleanly with:
sudo systemctl status ppb-mcp
# or follow live logs:
journalctl -u ppb-mcp -fcd ~/ppb-mcp # or wherever you cloned the repo
git pull --ff-only
docker compose build
docker compose up -d
# verify:
docker compose logs -f ppb-mcppip install -e ".[dev]"
ruff check src tests
pytest -vIntegration tests against the live HuggingFace dataset are gated behind PPB_RUN_INTEGRATION=1 to keep CI offline-clean.
- Tier 1 — empirical exact match (high confidence). ≥3 measured runs on a GPU at-or-below your VRAM budget at the requested concurrency.
- Tier 2 — empirical-near (medium). Same
(model, quant)benchmarked on a different GPU at the same concurrency; throughput borrowed, VRAM scaled to your card. - Tier 3 — formula extrapolation (low).
vram_per_user ≈ (params_B × bits_per_weight / 8) × 1.15; viable iff total ≤ 90 % of your VRAM.
MIT — see LICENSE.