LLM agents can't fit thousands of tool definitions into context.
Vector search finds similar tools, but misses the workflow they belong to.
graph-tool-call builds a tool graph and retrieves the right chain — not just one match.
| Without retrieval | graph-tool-call | |
|---|---|---|
| 248 tools (K8s API) | 12% accuracy | 82% accuracy |
| 1068 tools (GitHub full API) | context overflow | 78% Recall@5 |
| Token usage | 8,192 tok | 1,699 tok (79% ↓) |
Measured with qwen3:4b (4-bit) — full benchmark
Table of Contents
LLM agents need tools. But as tool count grows, two things break:
- Context overflow — 248 Kubernetes API endpoints = 8,192 tokens of tool definitions. The LLM chokes and accuracy drops to 12%.
- Vector search misses workflows — Searching "cancel my order" finds
cancelOrder, but the actual flow islistOrders → getOrder → cancelOrder → processRefund. Vector search returns one tool; you need the chain.
graph-tool-call solves both. It models tool relationships as a graph, retrieves multi-step workflows via hybrid search (BM25 + graph traversal + embedding + MCP annotations), and cuts token usage by 64–91% while maintaining or improving accuracy.
| Scenario | Vector-only | graph-tool-call |
|---|---|---|
| "cancel my order" | Returns cancelOrder |
listOrders → getOrder → cancelOrder → processRefund |
| "read and save file" | Returns read_file |
read_file + write_file (COMPLEMENTARY relation) |
| "delete old records" | Returns any tool matching "delete" | Destructive tools ranked first via MCP annotations |
| "now cancel it" (after listing orders) | No context from history | Demotes used tools, boosts next-step tools |
| Multiple Swagger specs with overlapping tools | Duplicate tools in results | Cross-source auto-deduplication |
| 1,200 API endpoints | Slow, noisy results | Categorized + graph traversal for precise retrieval |
OpenAPI / MCP / Python functions → Ingest → Build tool graph → Hybrid retrieve → Agent
Example — User says "cancel my order and process a refund"
Vector search finds cancelOrder. But the actual workflow is:
┌──────────┐
PRECEDES │listOrders│ PRECEDES
┌─────────┤ ├──────────┐
▼ └──────────┘ ▼
┌──────────┐ ┌───────────┐
│ getOrder │ │cancelOrder│
└──────────┘ └─────┬─────┘
│ COMPLEMENTARY
▼
┌──────────────┐
│processRefund │
└──────────────┘
graph-tool-call returns the entire chain, not just one tool. Retrieval combines four signals via weighted Reciprocal Rank Fusion (wRRF):
- BM25 — keyword matching
- Graph traversal — relation-based expansion (PRECEDES, REQUIRES, COMPLEMENTARY)
- Embedding similarity — semantic search (optional, any provider)
- MCP annotations — read-only / destructive / idempotent hints
The core package has zero dependencies — just Python standard library. Install only what you need:
pip install graph-tool-call # core (BM25 + graph) — no dependencies
pip install graph-tool-call[embedding] # + embedding, cross-encoder reranker
pip install graph-tool-call[openapi] # + YAML support for OpenAPI specs
pip install graph-tool-call[mcp] # + MCP server / proxy mode
pip install graph-tool-call[all] # everythingAll extras
| Extra | Installs | When to use |
|---|---|---|
openapi |
pyyaml | YAML OpenAPI specs |
embedding |
numpy | Semantic search (connect to Ollama/OpenAI/vLLM) |
embedding-local |
numpy, sentence-transformers | Local sentence-transformers models |
similarity |
rapidfuzz | Duplicate detection |
langchain |
langchain-core | LangChain integration |
visualization |
pyvis, networkx | HTML graph export, GraphML |
dashboard |
dash, dash-cytoscape | Interactive dashboard |
lint |
ai-api-lint | Auto-fix bad API specs |
mcp |
mcp | MCP server / proxy mode |
uvx graph-tool-call search "user authentication" \
--source https://petstore.swagger.io/v2/swagger.jsonQuery: "user authentication"
Source: https://petstore.swagger.io/v2/swagger.json (19 tools)
Results (5):
1. getUserByName — Get user by user name
2. deleteUser — Delete user
3. createUser — Create user
4. loginUser — Logs user into the system
5. updateUser — Updated user
from graph_tool_call import ToolGraph
# Build a tool graph from the official Petstore API
tg = ToolGraph.from_url(
"https://petstore3.swagger.io/api/v3/openapi.json",
cache="petstore.json",
)
print(tg)
# → ToolGraph(tools=19, nodes=22, edges=100)
# Search for tools
tools = tg.retrieve("create a new pet", top_k=5)
for t in tools:
print(f"{t.name}: {t.description}")
# Search with workflow guidance
results = tg.retrieve_with_scores("process an order", top_k=5)
for r in results:
print(f"{r.tool.name} [{r.confidence}]")
for rel in r.relations:
print(f" → {rel.hint}")
# Execute an OpenAPI tool directly
result = tg.execute(
"addPet", {"name": "Buddy", "status": "available"},
base_url="https://petstore3.swagger.io/api/v3",
)plan_workflow() returns ordered execution chains with prerequisites — reducing agent round-trips from 3-4 to 1.
plan = tg.plan_workflow("process a refund")
for step in plan.steps:
print(f"{step.order}. {step.tool.name} — {step.reason}")
# 1. getOrder — prerequisite for requestRefund
# 2. requestRefund — primary action
plan.save("refund_workflow.json")Edit, parameterize, and visualize workflows — see Direct API guide.
# From an MCP server (HTTP JSON-RPC tools/list)
tg.ingest_mcp_server("https://mcp.example.com/mcp")
# From an MCP tool list (annotations preserved)
tg.ingest_mcp_tools(mcp_tools, server_name="filesystem")
# From Python callables (type hints + docstrings)
tg.ingest_functions([read_file, write_file])MCP annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) are used as retrieval signals — query intent is automatically classified, and read queries prioritize read-only tools while delete queries prioritize destructive tools.
graph-tool-call ships several integration patterns. Pick the one that matches your stack:
| You're using... | Pattern | Token win | Guide |
|---|---|---|---|
| Claude Code / Cursor / Windsurf | MCP Proxy (aggregate N MCP servers → 3 meta-tools) | ~1,200 tok/turn | docs/integrations/mcp-proxy.md |
| Any MCP-compatible client | MCP Server (single source as MCP) | varies | docs/integrations/mcp-server.md |
| LangChain / LangGraph (50+ tools) | Gateway tools (N tools → 2 meta-tools) | 92% | docs/integrations/langchain.md |
| OpenAI / Anthropic SDK (existing code) | Middleware (1-line monkey-patch) | 76–91% | docs/integrations/middleware.md |
| Direct control over retrieval | Python API (retrieve() + format adapter) |
varies | docs/integrations/direct-api.md |
When you have many MCP servers, their tool names pile up in every LLM turn. Bundle them behind one server: 172 tools → 3 meta-tools.
# 1. Create ~/backends.json listing your MCP servers
# 2. Register the proxy with Claude Code
claude mcp add -s user tool-proxy -- \
uvx "graph-tool-call[mcp]" proxy --config ~/backends.jsonFull setup, passthrough mode, remote transport → MCP Proxy guide.
from graph_tool_call.langchain import create_gateway_tools
# 62 tools from Slack, GitHub, Jira, MS365...
gateway = create_gateway_tools(all_tools, top_k=10)
# → [search_tools, call_tool] — only 2 tools in context
agent = create_react_agent(model=llm, tools=gateway)92% token reduction vs binding all 62 tools. See LangChain guide for auto-filter and manual patterns.
from graph_tool_call.middleware import patch_openai
patch_openai(client, graph=tg, top_k=5) # ← add this one line
# Existing code unchanged — 248 tools go in, only 5 relevant ones are sent
response = client.chat.completions.create(
model="gpt-4o",
tools=all_248_tools,
messages=messages,
)Also works with Anthropic via patch_anthropic. See Middleware guide.
Two questions: (1) Does the LLM still pick the right tool when given only the retrieved subset? (2) Does the retriever itself rank correct tools in the top K?
| Dataset | Tools | Baseline acc | graph-tool-call | Token reduction |
|---|---|---|---|---|
| Petstore | 19 | 100% | 95% (k=5) | 64% |
| GitHub | 50 | 100% | 88% (k=5) | 88% |
| Mixed MCP | 38 | 97% | 90% (k=5) | 83% |
| Kubernetes core/v1 | 248 | 12% | 82% (k=5 + ontology) | 79% |
Key finding — at 248 tools, baseline collapses (context overflow) to 12% while graph-tool-call recovers to 82%. At smaller scales, baseline is already strong, so graph-tool-call's value is token savings without accuracy loss.
→ Full results (pipeline / retrieval-only / competitive / 1068-scale / 200-tool LangChain agent across GPT and Claude): docs/benchmarks.md
# Reproduce
python -m benchmarks.run_benchmark # retrieval only
python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b # full pipelineAdd semantic search on top of BM25 + graph. No heavy dependencies needed — connect to any external embedding server.
tg.enable_embedding("ollama/qwen3-embedding:0.6b") # Ollama (recommended)
tg.enable_embedding("openai/text-embedding-3-large") # OpenAI
tg.enable_embedding("vllm/Qwen/Qwen3-Embedding-0.6B") # vLLM
tg.enable_embedding("sentence-transformers/all-MiniLM-L6-v2") # local
tg.enable_embedding(lambda texts: my_embed_fn(texts)) # custom callableWeights are auto-rebalanced. See API reference for all provider forms.
tg.enable_reranker() # cross-encoder rerank
tg.enable_diversity(lambda_=0.7) # MMR diversity
tg.set_weights(keyword=0.2, graph=0.5, embedding=0.3, annotation=0.2)Pass previously called tools to demote them and boost next-step candidates.
tools = tg.retrieve("now cancel it", history=["listOrders", "getOrder"])
# → [cancelOrder, processRefund, ...]tg.save("my_graph.json")
tg = ToolGraph.load("my_graph.json")
# Or use cache= in from_url() for automatic save/load
tg = ToolGraph.from_url(url, cache="my_graph.json")tg.auto_organize(llm="ollama/qwen2.5:7b")
tg.auto_organize(llm="litellm/claude-sonnet-4-20250514")
tg.auto_organize(llm=openai.OpenAI())Builds richer categories, relations, and search keywords. Supports Ollama, OpenAI clients, litellm, and any callable. See API reference.
| Feature | API | Docs |
|---|---|---|
| Duplicate detection across specs | find_duplicates / merge_duplicates |
API ref |
| Conflict detection | apply_conflicts |
API ref |
| Operational analysis | analyze |
API ref |
| Interactive dashboard | dashboard() |
API ref |
| HTML / GraphML / Cypher export | export_html / export_graphml / export_cypher |
API ref |
| Auto-fix bad OpenAPI specs | from_url(url, lint=True) |
ai-api-lint |
| Doc | Description |
|---|---|
| CLI reference | All graph-tool-call CLI commands |
| Python API reference | ToolGraph methods, helpers, middleware, LangChain |
| Integrations | MCP server / proxy, LangChain, middleware, direct API |
| Benchmark results | Full pipeline / retrieval / competitive / scale tables |
| Architecture | System overview, pipeline layers, data model |
| Design notes | Algorithm design — normalization, dependency detection, ontology |
| Research | Competitive analysis, API scale data |
| Release checklist | Release process, changelog flow |
Contributions are welcome.
git clone https://github.com/SonAIengine/graph-tool-call.git
cd graph-tool-call
pip install poetry pre-commit
poetry install --with dev --all-extras
pre-commit install # auto-runs ruff on every commit
# Test, lint, benchmark
poetry run pytest -v
poetry run ruff check . && poetry run ruff format --check .
python -m benchmarks.run_benchmark -v