Skip to content

paulplee/ppb-mcp

Repository files navigation

ppb-mcp

An MCP server that exposes Poor Paul's Benchmark GPU inference data — quantization × throughput × VRAM × concurrent users — as queryable tools to any LLM client.

CI PyPI License: MIT

Hosted instance: https://mcp.poorpaul.dev/ (streamable-http transport, no auth)

What it does

Connect any MCP-aware client (Claude Desktop, Cline, Continue, etc.) to ask questions like:

  • "What's the best quantization for a 32 GB GPU running Qwen3.5-9B with 8 concurrent users?"
  • "Show me every model tested at Q4_K_M on the RTX 5090."
  • "What GPU should I buy for running 27B models at 4 concurrent users on a $800 budget?"
  • "Why is my RTX 5090 result at Q4_K_M slower than I expected?"

It exposes thirteen tools backed by 39,000+ real benchmark rows:

Quantitative tools

Tool What it does
list_tested_configs Lists every tested GPU, model, and quantization (call this first)
query_ppb_results Filters raw benchmark rows by GPU / VRAM / model / quant / users / backend / date
recommend_quantization Three-tier empirical-first recommendation engine (high / medium / low confidence)
recommend_hardware Budget-aware GPU recommendation ranked by speed, efficiency, or value-for-money
explain_result Contextual explanation of a result: VRAM pressure, PCIe context, percentile rank
get_gpu_headroom Sanity-checks a (gpu, model, quant, users) configuration for VRAM headroom
compare_quants_quantitative Side-by-side throughput comparison across quantizations for a model + GPU
get_combined_scores Quantitative + qualitative metrics in one call for a (gpu, model, quant) config
rank_by_priority Rank quantizations by speed, efficiency (tok/W), or a balanced composite score

Qualitative tools

Tool What it does
get_qualitative_summary All available qualitative scores (context-rot, tool accuracy, quality, MT-Bench)
query_qualitative_results Filter qualitative rows by phase, model, quant, GPU, or minimum score thresholds
get_context_rot_breakdown Long-context recall scores by length, depth, and needle type
get_tool_accuracy_breakdown Tool-call accuracy: selection, parameters, hallucination rate, parse success
compare_quants_qualitative Side-by-side qualitative comparison across quantizations with deterministic insight

Data & caching

Benchmark rows are mirrored into a local SQLite cache (./ppb_cache.db by default; override with PPB_DB_PATH). On startup the server loads from SQLite and only contacts HuggingFace when the dataset's git commit SHA has changed — making subsequent restarts fast and offline-friendly.

Install

1) Use the hosted instance (zero setup)

Add to your MCP client config (Claude Desktop example, ~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "ppb": {
      "transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
    }
  }
}

2) pip install and run locally (stdio)

pip install ppb-mcp
MCP_TRANSPORT=stdio ppb-mcp

Claude Desktop config:

{
  "mcpServers": {
    "ppb": {
      "command": "ppb-mcp",
      "env": { "MCP_TRANSPORT": "stdio" }
    }
  }
}

3) Docker

docker run --rm -p 9933:9933 \
  -e MCP_TRANSPORT=streamable-http \
  -v ppb-hf-cache:/data/huggingface \
  ghcr.io/paulplee/ppb-mcp:latest

4) From source

git clone https://github.com/paulplee/ppb-mcp
cd ppb-mcp
pip install -e ".[dev]"
ppb-mcp           # streamable-http on :9933

Connect Your LLM Client

All clients use the same hosted endpoint: https://mcp.poorpaul.dev/mcp

Claude Desktop

Edit ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "ppb": {
      "transport": { "type": "http", "url": "https://mcp.poorpaul.dev/mcp" }
    }
  }
}

Restart Claude Desktop after saving.

Cursor

Edit ~/.cursor/mcp.json (create if it doesn't exist):

{
  "mcpServers": {
    "ppb": {
      "url": "https://mcp.poorpaul.dev/mcp",
      "type": "http"
    }
  }
}

Or via UI: Settings → Tools & Integrations → MCP → Add Server.

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "ppb": {
      "serverUrl": "https://mcp.poorpaul.dev/mcp",
      "transport": "http"
    }
  }
}

VS Code (GitHub Copilot Agent Mode)

Add to your .vscode/mcp.json (workspace) or User settings.json:

{
  "mcp": {
    "servers": {
      "ppb": {
        "type": "http",
        "url": "https://mcp.poorpaul.dev/mcp"
      }
    }
  }
}

Zed

Add to ~/.config/zed/settings.json under "context_servers":

{
  "context_servers": {
    "ppb": {
      "command": {
        "path": "env",
        "args": ["MCP_TRANSPORT=stdio", "uvx", "ppb-mcp"]
      }
    }
  }
}

Cline (VS Code extension)

Open the Cline panel → MCP Servers tab → Add Server → select SSE/HTTP → paste https://mcp.poorpaul.dev/mcp.

Continue.dev

Add to ~/.continue/config.yaml:

mcpServers:
  - name: ppb
    transport:
      type: http
      url: https://mcp.poorpaul.dev/mcp

OpenCode

Add to ~/.config/opencode/config.json:

{
  "mcp": {
    "ppb": {
      "type": "remote",
      "url": "https://mcp.poorpaul.dev/mcp"
    }
  }
}

Goose (Block)

goose mcp add ppb --transport http --url https://mcp.poorpaul.dev/mcp

Any stdio-compatible client

# Zero-install (requires uv):
env MCP_TRANSPORT=stdio uvx ppb-mcp

# After pip install:
env MCP_TRANSPORT=stdio ppb-mcp

Note on transport key names: MCP clients are not yet fully standardised on JSON key names for the HTTP transport. If your client doesn't connect with "type": "http", try "transport": "http", "type": "sse", or "transport": "streamable-http". The endpoint URL is the same regardless.

Example session

> list_tested_configs
{ "gpus": ["Apple M4 Pro", "NVIDIA GB10", "NVIDIA GeForce RTX 5090"],
  "models": ["Qwen3.5-9B", ...], "quantizations": ["Q4_K_M", ...] }

> recommend_quantization(gpu_vram_gb=32, concurrent_users=8, model="Qwen3.5-9B", priority="balance")
{ "recommended_quantization": "Q5_K_M",
  "estimated_vram_usage_gb": 27.8,
  "estimated_tokens_per_second": 142.0,
  "headroom_gb": 4.2,
  "confidence": "high",
  "reasoning": "Q5_K_M is recommended for your NVIDIA GeForce RTX 5090 (32 GB) ...",
  "alternatives": ["Q4_K_M", "Q8_0"] }

> recommend_hardware(target_model="Qwen3.5-27B", target_quantization="Q4_K_M",
                     concurrent_users=4, budget_usd=1200, priority="value")
{ "top_picks": [
    { "gpu": "NVIDIA GeForce RTX 5090", "msrp_usd": 1999, "throughput_tok_s": 94.3,
      "efficiency_tok_per_dollar": 0.047, "rank_reason": "best measured tok/$ in budget" },
    ...
  ],
  "budget_usd": 1200 }

> explain_result(gpu_name="NVIDIA GeForce RTX 5090", model="Qwen3.5-9B",
                 quantization="Q4_K_M", concurrent_users=8, n_ctx=32768)
{ "throughput_tok_s": 142.0,
  "vram_pressure": "medium",
  "pcie_context": "PCIe Gen 5 x16 — full bandwidth, no bottleneck expected",
  "percentile_rank": 0.91,
  "insight": "Top 9% throughput among all Qwen3.5-9B Q4_K_M configurations measured." }

Configuration

Env var Default Notes
HF_DATASET paulplee/ppb-results HuggingFace dataset ID
REFRESH_INTERVAL_HOURS 1 Background refresh cadence
MCP_TRANSPORT streamable-http stdio or streamable-http
HOST 0.0.0.0 HTTP bind host
PORT 9933 HTTP bind port
LOG_LEVEL INFO Python logging level

Self-hosting (Lightsail / any Ubuntu VPS)

git clone https://github.com/paulplee/ppb-mcp /tmp/ppb-mcp
cd /tmp/ppb-mcp
DOMAIN=mcp.example.com EMAIL=you@example.com ./deploy/deploy.sh

This installs Docker, builds the image, registers a systemd unit, configures nginx, and runs certbot.

Updating a self-hosted instance

Deployed via deploy.sh (systemd manages the container)

cd /opt/ppb-mcp
sudo git pull --ff-only
sudo systemctl restart ppb-mcp

The systemd unit runs docker compose pull before starting, so the new image is fetched automatically. Check that it came up cleanly with:

sudo systemctl status ppb-mcp
# or follow live logs:
journalctl -u ppb-mcp -f

Running docker compose directly (no systemd unit)

cd ~/ppb-mcp          # or wherever you cloned the repo
git pull --ff-only
docker compose build
docker compose up -d
# verify:
docker compose logs -f ppb-mcp

Development

pip install -e ".[dev]"
ruff check src tests
pytest -v

Integration tests against the live HuggingFace dataset are gated behind PPB_RUN_INTEGRATION=1 to keep CI offline-clean.

How recommendations work

  1. Tier 1 — empirical exact match (high confidence). ≥3 measured runs on a GPU at-or-below your VRAM budget at the requested concurrency.
  2. Tier 2 — empirical-near (medium). Same (model, quant) benchmarked on a different GPU at the same concurrency; throughput borrowed, VRAM scaled to your card.
  3. Tier 3 — formula extrapolation (low). vram_per_user ≈ (params_B × bits_per_weight / 8) × 1.15; viable iff total ≤ 90 % of your VRAM.

License

MIT — see LICENSE.

About

Poor Paul's Benchmark as an MCP server. Queryable GPU inference data — quantization, throughput, VRAM, concurrent users — for Claude Desktop, Cursor, Windsurf, Cline, and any MCP client. Self-host or use the free hosted endpoint.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors