|
| 1 | +# Benchmark Results |
| 2 | + |
| 3 | +Detailed benchmark data for graph-tool-call. The README contains a 3-row summary; this document contains the full pipeline, retrieval-only, competitive, large-scale, and LangChain agent results. |
| 4 | + |
| 5 | +- **Model used (LLM benchmarks)**: `qwen3:4b` (4-bit, Ollama), unless noted |
| 6 | +- **Pipelines compared**: `baseline` (all tools), `retrieve-k3 / k5 / k10`, plus `+ embedding`, `+ ontology` |
| 7 | +- **Reproduce**: see [Reproduce](#reproduce) at the bottom |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## What we measure |
| 12 | + |
| 13 | +graph-tool-call verifies two things. |
| 14 | + |
| 15 | +1. Can performance be **maintained or improved** by giving the LLM only a subset of retrieved tools? |
| 16 | +2. Does the **retriever itself** rank the correct tools within the top K? |
| 17 | + |
| 18 | +These are different questions. A retriever that achieves high `Gold Tool Recall@K` does not automatically translate to high end-to-end accuracy — the LLM still has to pick the right tool from the candidate set. |
| 19 | + |
| 20 | +### Metrics |
| 21 | + |
| 22 | +- **End-to-end Accuracy** — did the LLM ultimately succeed in selecting the correct tool / performing the correct workflow? |
| 23 | +- **Gold Tool Recall@K** — was the canonical gold tool included in the top K at the retrieval stage? |
| 24 | +- **Avg tokens** — average tokens passed to the LLM |
| 25 | +- **Token reduction** — token savings vs. baseline |
| 26 | + |
| 27 | +> The two accuracy metrics often diverge. Evaluations that accept **alternative tools** or **equivalent workflows** as correct may show End-to-end Accuracy that doesn't exactly match Gold Tool Recall@K. `baseline` has no retrieval stage, so Gold Tool Recall@K does not apply. |
| 28 | +
|
| 29 | +--- |
| 30 | + |
| 31 | +## 1. Full pipeline comparison |
| 32 | + |
| 33 | +| Dataset | Tools | Pipeline | End-to-end Accuracy | Gold Tool Recall@K | Avg tokens | Token reduction | |
| 34 | +|---|---:|---|---:|---:|---:|---:| |
| 35 | +| Petstore | 19 | baseline | 100.0% | — | 1,239 | — | |
| 36 | +| Petstore | 19 | retrieve-k3 | 90.0% | 93.3% | 305 | 75.4% | |
| 37 | +| Petstore | 19 | retrieve-k5 | 95.0% | 98.3% | 440 | 64.4% | |
| 38 | +| Petstore | 19 | retrieve-k10 | 100.0% | 98.3% | 720 | 41.9% | |
| 39 | +| GitHub | 50 | baseline | 100.0% | — | 3,302 | — | |
| 40 | +| GitHub | 50 | retrieve-k3 | 85.0% | 87.5% | 289 | 91.3% | |
| 41 | +| GitHub | 50 | retrieve-k5 | 87.5% | 87.5% | 398 | 87.9% | |
| 42 | +| GitHub | 50 | retrieve-k10 | 90.0% | 92.5% | 662 | 79.9% | |
| 43 | +| Mixed MCP | 38 | baseline | 96.7% | — | 2,741 | — | |
| 44 | +| Mixed MCP | 38 | retrieve-k3 | 86.7% | 93.3% | 328 | 88.0% | |
| 45 | +| Mixed MCP | 38 | retrieve-k5 | 90.0% | 96.7% | 461 | 83.2% | |
| 46 | +| Mixed MCP | 38 | retrieve-k10 | 96.7% | 100.0% | 826 | 69.9% | |
| 47 | +| Kubernetes core/v1 | 248 | baseline | 12.0% | — | 8,192 | — | |
| 48 | +| Kubernetes core/v1 | 248 | retrieve-k5 | 78.0% | 91.0% | 1,613 | 80.3% | |
| 49 | +| Kubernetes core/v1 | 248 | retrieve-k5 + embedding | 80.0% | 94.0% | 1,728 | 78.9% | |
| 50 | +| Kubernetes core/v1 | 248 | retrieve-k5 + ontology | **82.0%** | 96.0% | 1,699 | 79.3% | |
| 51 | +| Kubernetes core/v1 | 248 | retrieve-k5 + embedding + ontology | **82.0%** | **98.0%** | 1,924 | 76.5% | |
| 52 | + |
| 53 | +### Key insights |
| 54 | + |
| 55 | +- **Small/medium APIs (19~50 tools)** — baseline is already strong. graph-tool-call's main value here is **64~91% token savings** with little accuracy loss. |
| 56 | +- **Large APIs (248 tools)** — baseline collapses to **12%** due to context overload. graph-tool-call recovers performance to **78~82%** by narrowing candidates through retrieval. At this scale it's not an optimization — it's closer to a required retrieval layer. |
| 57 | +- **`retrieve-k5` is the best default**. Good token/accuracy tradeoff. On large datasets, adding embedding/ontology yields further gains. |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## 2. Retrieval quality (BM25 + graph only) |
| 62 | + |
| 63 | +The table below measures retrieval quality **before the LLM stage**. Only BM25 + graph traversal — no embedding or ontology. |
| 64 | + |
| 65 | +| Dataset | Tools | Gold Tool Recall@3 | Gold Tool Recall@5 | Gold Tool Recall@10 | |
| 66 | +|---|---:|---:|---:|---:| |
| 67 | +| Petstore | 19 | 93.3% | **98.3%** | 98.3% | |
| 68 | +| GitHub | 50 | 87.5% | **87.5%** | 92.5% | |
| 69 | +| Mixed MCP | 38 | 93.3% | **96.7%** | 100.0% | |
| 70 | +| Kubernetes core/v1 | 248 | 82.0% | **91.0%** | 92.0% | |
| 71 | + |
| 72 | +### How to read |
| 73 | + |
| 74 | +- **Gold Tool Recall@K** measures the retriever's ability to include the correct tool in the candidate set, **not** final LLM accuracy. |
| 75 | +- On small datasets, `k=5` already achieves high recall. |
| 76 | +- On large datasets, increasing `k` raises recall but also increases tokens passed to the LLM — consider both. |
| 77 | + |
| 78 | +### Insights |
| 79 | + |
| 80 | +- **Petstore / Mixed MCP** — `k=5` alone includes nearly all correct tools. |
| 81 | +- **GitHub** — there's a recall gap between `k=5` and `k=10`; choose `k=10` if recall matters more than tokens. |
| 82 | +- **Kubernetes core/v1** — even with 248 tools, `k=5` already achieves **91.0%** gold recall. The retrieval stage alone compresses the candidate set dramatically while retaining most correct tools. |
| 83 | + |
| 84 | +--- |
| 85 | + |
| 86 | +## 3. When do embedding and ontology help? |
| 87 | + |
| 88 | +Comparison on the largest dataset (Kubernetes core/v1, 248 tools), all on top of `retrieve-k5`. |
| 89 | + |
| 90 | +| Pipeline | End-to-end Accuracy | Gold Tool Recall@5 | Interpretation | |
| 91 | +|---|---:|---:|---| |
| 92 | +| retrieve-k5 | 78.0% | 91.0% | BM25 + graph alone is a strong baseline | |
| 93 | +| + embedding | 80.0% | 94.0% | Recovers semantically-similar but differently-worded queries | |
| 94 | +| + ontology | **82.0%** | 96.0% | LLM-generated keywords/example queries significantly improve retrieval | |
| 95 | +| + embedding + ontology | **82.0%** | **98.0%** | Accuracy maintained, gold recall at its highest | |
| 96 | + |
| 97 | +- **Embedding** compensates for **semantic similarity** that BM25 misses. |
| 98 | +- **Ontology** **expands the searchable representation itself** when descriptions are short or non-standard. |
| 99 | +- Using both together yields limited extra end-to-end gains, but **gold recall reaches its highest**. |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## 4. Competitive benchmark (retrieval strategies) |
| 104 | + |
| 105 | +Compared 6 retrieval strategies across 9 datasets (19–1068 tools): |
| 106 | + |
| 107 | +| Strategy | Recall@5 | MRR | Latency | |
| 108 | +|---|:---:|:---:|:---:| |
| 109 | +| Vector Only (≈bigtool) | 96.8% | 0.897 | 176ms | |
| 110 | +| BM25 Only | 91.6% | 0.819 | 1.5ms | |
| 111 | +| BM25 + Graph (default) | 91.6% | 0.819 | 14ms | |
| 112 | +| Full Pipeline (with embedding) | 96.8% | 0.897 | 172ms | |
| 113 | + |
| 114 | +**Key finding** — without embedding, BM25+Graph achieves 91.6% Recall, competitive with vector search at **65× faster speed**. With embedding enabled, performance matches pure vector search. |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## 5. Scale test: 1068 tools (GitHub full API) |
| 119 | + |
| 120 | +| Strategy | Recall@5 | MRR | Miss% | |
| 121 | +|---|:---:|:---:|:---:| |
| 122 | +| Vector Only | 88.0% | 0.761 | 12.0% | |
| 123 | +| BM25 + Graph | 78.0% | 0.643 | 22.0% | |
| 124 | +| Full Pipeline | 88.0% | 0.761 | 12.0% | |
| 125 | + |
| 126 | +At 1068 tools, baseline (passing all definitions) is impractical due to context size — graph-tool-call provides a working retrieval layer where vector-only and full pipeline tie. |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## 6. LangChain agent benchmark (200 tools) |
| 131 | + |
| 132 | +End-to-end accuracy when **200 simple tools** are registered and invoked through a LangChain agent. |
| 133 | + |
| 134 | +- **Direct (D)** — all 200 tool definitions passed to the LLM at once |
| 135 | +- **Graph (G)** — tools managed via graph-tool-call gateway (search → call, 2 turns) |
| 136 | + |
| 137 | +| Model | D-Acc | G-Acc | D-Turns | G-Turns | D-Tokens | G-Tokens | Savings | D-Time | G-Time | |
| 138 | +|---|---:|---:|---:|---:|---:|---:|---:|---:|---:| |
| 139 | +| gpt-4.1 | 60.0% | 80.0% | 1.0 | 2.0 | 52,587 | 6,639 | 87.4% | 15.5s | 17.6s | |
| 140 | +| gpt-5.2 | 60.0% | **100.0%** | 1.0 | 2.0 | 53,645 | 10,508 | 80.4% | 20.5s | 17.1s | |
| 141 | +| gpt-5.4 | 60.0% | **100.0%** | 1.0 | 2.0 | 60,035 | 14,049 | 76.6% | 18.2s | 17.0s | |
| 142 | +| claude-sonnet-4-20250514 | 100.0% | 100.0% | 1.0 | 2.0 | 196,183 | 17,349 | 91.2% | 58.2s | 49.4s | |
| 143 | +| claude-sonnet-4-6 | 100.0% | 100.0% | 1.0 | 2.0 | 198,665 | 20,074 | 89.9% | 67.0s | 69.4s | |
| 144 | +| claude-haiku-4-5 | 100.0% | 100.0% | 1.0 | 2.0 | 197,845 | 19,714 | 90.0% | 23.7s | 22.8s | |
| 145 | + |
| 146 | +> Acc = accuracy, Turns = average agent turns, Tokens = total tokens, Savings = token reduction (D→G), Time = wall-clock. |
| 147 | +
|
| 148 | +### Key findings |
| 149 | + |
| 150 | +- GPT-series models drop to **60% accuracy** when all 200 tools are passed directly; graph-tool-call recovers to **80–100%**. |
| 151 | +- Claude-series models maintain 100% accuracy either way, but graph-tool-call delivers **89–91% token savings**. |
| 152 | +- Graph mode adds 1 extra turn (search → call) but total latency stays comparable or decreases thanks to smaller context. |
| 153 | +- Across all models, token reduction ranges from **76.6% to 91.2%**. |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## Reproduce |
| 158 | + |
| 159 | +```bash |
| 160 | +# Retrieval quality only (fast, no LLM needed) |
| 161 | +python -m benchmarks.run_benchmark |
| 162 | +python -m benchmarks.run_benchmark -d k8s -v |
| 163 | + |
| 164 | +# Pipeline benchmark (LLM comparison) |
| 165 | +python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b |
| 166 | +python -m benchmarks.run_benchmark --mode pipeline \ |
| 167 | + --pipelines baseline retrieve-k3 retrieve-k5 retrieve-k10 |
| 168 | + |
| 169 | +# Save baseline and compare across runs |
| 170 | +python -m benchmarks.run_benchmark --mode pipeline --save-baseline |
| 171 | +python -m benchmarks.run_benchmark --mode pipeline --diff |
| 172 | +``` |
| 173 | + |
| 174 | +See [`benchmarks/`](../benchmarks/) for dataset definitions, ground truth, and the runner source. |
0 commit comments