Skip to content

Commit e58e622

Browse files
SonAIengineclaude
andcommitted
docs: README 구조 재편 — 1262 → 406 lines, 통합/CLI/API/벤치마크 docs 분리
- README.md: 9개 평행 통합 섹션을 결정 테이블 + 미니 예제로 압축 - Hero 표 오버클레임 수정 ("impossible" → "context overflow") - Quick Start 예제를 Petstore로 통일, 중복 Basic Usage/Feature Comparison 제거 - TOC 추가, 통합 가이드는 docs/integrations/ 5개 파일로 분리 - 벤치마크 5개 표는 docs/benchmarks.md로, CLI는 docs/cli.md, API는 docs/api-reference.md로 이동 - README-ko.md 동기화 (ja/zh는 별도) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent feea82f commit e58e622

11 files changed

Lines changed: 1318 additions & 1937 deletions

File tree

README-ko.md

Lines changed: 191 additions & 946 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 132 additions & 988 deletions
Large diffs are not rendered by default.

docs/README.md

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,17 @@
66
docs/
77
├── README.md ← 현재 문서 (인덱스)
88
9+
├── cli.md # CLI 레퍼런스 (모든 명령)
10+
├── api-reference.md # Python API 레퍼런스 (ToolGraph, helpers, middleware)
11+
├── benchmarks.md # 벤치마크 결과 (pipeline / retrieval / competitive / scale)
12+
13+
├── integrations/ # 통합 가이드 (사용 패턴별)
14+
│ ├── mcp-server.md # MCP server 모드
15+
│ ├── mcp-proxy.md # MCP proxy (multi-backend aggregation)
16+
│ ├── langchain.md # LangChain Gateway / auto-filter / retriever
17+
│ ├── middleware.md # OpenAI/Anthropic SDK 1-line patch
18+
│ └── direct-api.md # Python API + workflow planning
19+
920
├── architecture/ # 아키텍처 & 데이터 모델
1021
│ ├── overview.md # 전체 아키텍처 (파이프라인, 레이어)
1122
│ └── data-model.md # ToolSchema, RelationType, NodeType
@@ -44,11 +55,18 @@ docs/
4455

4556
## 읽는 순서
4657

58+
**사용자 (라이브러리 쓰는 사람)**
59+
1. **시작**: 루트 [README.md](../README.md)
60+
2. **CLI 명령**: [cli.md](cli.md)
61+
3. **Python API**: [api-reference.md](api-reference.md)
62+
4. **통합 가이드**: [integrations/](integrations/) — 본인 스택에 맞는 패턴 선택
63+
5. **벤치마크**: [benchmarks.md](benchmarks.md)
64+
65+
**개발자/기여자 (내부 구조 이해하려는 사람)**
4766
1. **전체 그림**: [architecture/overview.md](architecture/overview.md)
4867
2. **진행 상황**: [wbs/README.md](wbs/README.md)
49-
3. **현재 Phase 상세**: [wbs/phase-1-ingest.md](wbs/phase-1-ingest.md)
50-
4. **설계 깊이 파기**: `design/` 디렉토리
51-
5. **리서치 근거**: `research/` 디렉토리
68+
3. **설계 깊이 파기**: `design/` 디렉토리
69+
4. **리서치 근거**: `research/` 디렉토리
5270

5371
## 최근 추가 (v0.12)
5472

docs/api-reference.md

Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# Python API Reference
2+
3+
The primary entry point is `ToolGraph`. Most workflows are: ingest a spec → call `retrieve()`.
4+
5+
```python
6+
from graph_tool_call import ToolGraph
7+
8+
tg = ToolGraph()
9+
tg.ingest_openapi("api.json")
10+
tools = tg.retrieve("create a pet", top_k=5)
11+
```
12+
13+
---
14+
15+
## `ToolGraph` methods
16+
17+
### Construction
18+
19+
| Method | Description |
20+
|---|---|
21+
| `ToolGraph()` | Empty graph |
22+
| `ToolGraph.from_url(url, cache=...)` | Build from Swagger UI or spec URL (auto-discovers spec groups) |
23+
| `ToolGraph.load(path)` | Deserialize from JSON |
24+
25+
### Ingestion
26+
27+
| Method | Description |
28+
|---|---|
29+
| `add_tool(tool)` | Add a single tool (auto-detects format) |
30+
| `add_tools(tools)` | Add multiple tools |
31+
| `ingest_openapi(source)` | Ingest from OpenAPI / Swagger spec (file path, URL, or dict) |
32+
| `ingest_mcp_tools(tools)` | Ingest from MCP tool list |
33+
| `ingest_mcp_server(url)` | Fetch and ingest from an MCP HTTP server |
34+
| `ingest_functions(fns)` | Ingest from Python callables (uses type hints + docstrings) |
35+
| `ingest_arazzo(source)` | Ingest Arazzo 1.0.0 workflow spec |
36+
| `add_relation(src, tgt, type)` | Add a manual relation between two tools |
37+
38+
### Retrieval
39+
40+
| Method | Description |
41+
|---|---|
42+
| `retrieve(query, top_k=10)` | Search and return tool list |
43+
| `retrieve_with_scores(query, top_k=10)` | Search and return tools with confidence scores and relation hints |
44+
| `plan_workflow(query)` | Build an ordered execution plan |
45+
| `suggest_next(tool, history=...)` | Suggest next tools based on graph relations |
46+
| `validate_tool_call(call)` | Validate and auto-correct a tool call |
47+
| `assess_tool_call(call)` | Return `allow` / `confirm` / `deny` decision based on annotations |
48+
49+
### Configuration
50+
51+
| Method | Description |
52+
|---|---|
53+
| `enable_embedding(provider)` | Enable hybrid embedding search (Ollama, OpenAI, vLLM, sentence-transformers, callable) |
54+
| `enable_reranker(model)` | Enable cross-encoder reranking |
55+
| `enable_diversity(lambda_)` | Enable MMR diversity |
56+
| `set_weights(keyword=, graph=, embedding=, annotation=)` | Tune wRRF fusion weights |
57+
| `auto_organize(llm=...)` | Auto-categorize tools (rule-based or LLM-enhanced) |
58+
| `build_ontology(llm=...)` | Build complete ontology |
59+
60+
### Analysis
61+
62+
| Method | Description |
63+
|---|---|
64+
| `find_duplicates(threshold)` | Find duplicate tools across sources |
65+
| `merge_duplicates(pairs)` | Merge detected duplicates |
66+
| `apply_conflicts()` | Detect and add `CONFLICTS_WITH` edges |
67+
| `analyze()` | Build operational analysis summary |
68+
69+
### Persistence
70+
71+
| Method | Description |
72+
|---|---|
73+
| `save(path)` | Serialize to JSON (preserves embeddings + weights when set) |
74+
| `ToolGraph.load(path)` | Deserialize and restore retrieval state |
75+
76+
### Export & visualization
77+
78+
| Method | Description |
79+
|---|---|
80+
| `export_html(path, progressive=True)` | Interactive HTML (vis.js) |
81+
| `export_graphml(path)` | GraphML for Gephi / yEd |
82+
| `export_cypher(path)` | Neo4j Cypher statements |
83+
| `dashboard_app()` | Build Dash Cytoscape app object |
84+
| `dashboard(port=8050)` | Launch interactive dashboard |
85+
86+
### Execution
87+
88+
| Method | Description |
89+
|---|---|
90+
| `execute(name, params, base_url=...)` | Execute an OpenAPI tool directly |
91+
92+
---
93+
94+
## Top-level helpers
95+
96+
| Function | Description |
97+
|---|---|
98+
| `filter_tools(tools, query, top_k=5)` | One-shot filter on any tool list (LangChain, OpenAI, MCP, Anthropic, callables) |
99+
| `GraphToolkit(tools, top_k=5)` | Reusable toolkit — build graph once, filter per query |
100+
101+
## Middleware
102+
103+
| Function | Description |
104+
|---|---|
105+
| `patch_openai(client, graph, top_k=5)` | Auto-filter tools on OpenAI client |
106+
| `patch_anthropic(client, graph, top_k=5)` | Auto-filter tools on Anthropic client |
107+
108+
## LangChain
109+
110+
| Function | Description |
111+
|---|---|
112+
| `create_gateway_tools(tools, top_k=10)` | Convert N tools → 2 gateway meta-tools |
113+
| `create_agent(llm, tools, top_k=5)` | Auto-filtering LangGraph agent |
114+
| `GraphToolRetriever(tool_graph, top_k=5)` | LangChain `BaseRetriever` returning `Document` objects |
115+
| `tool_schema_to_openai_function(tool)` | Convert `ToolSchema` → OpenAI function dict |
116+
117+
---
118+
119+
## Embedding provider strings
120+
121+
`enable_embedding()` accepts:
122+
123+
| Form | Example |
124+
|---|---|
125+
| `"ollama/<model>"` | `"ollama/qwen3-embedding:0.6b"` |
126+
| `"openai/<model>"` | `"openai/text-embedding-3-large"` |
127+
| `"vllm/<model>"` | `"vllm/Qwen/Qwen3-Embedding-0.6B"` |
128+
| `"vllm/<model>@<url>"` | `"vllm/model@http://gpu-box:8000/v1"` |
129+
| `"llamacpp/<model>@<url>"` | `"llamacpp/model@http://192.168.1.10:8080/v1"` |
130+
| `"<url>@<model>"` | `"http://localhost:8000/v1@my-model"` |
131+
| `"sentence-transformers/<model>"` | `"sentence-transformers/all-MiniLM-L6-v2"` |
132+
| `callable` | `lambda texts: my_embed_fn(texts)` |
133+
134+
## Ontology LLM inputs
135+
136+
`auto_organize(llm=...)` accepts:
137+
138+
| Input | Wrapped as |
139+
|---|---|
140+
| `OntologyLLM` instance | Pass-through |
141+
| `callable(str) -> str` | `CallableOntologyLLM` |
142+
| OpenAI client (has `chat.completions`) | `OpenAIClientOntologyLLM` |
143+
| `"ollama/model"` | `OllamaOntologyLLM` |
144+
| `"openai/model"` | `OpenAICompatibleOntologyLLM` |
145+
| `"litellm/model"` | litellm.completion wrapper |

docs/benchmarks.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
# Benchmark Results
2+
3+
Detailed benchmark data for graph-tool-call. The README contains a 3-row summary; this document contains the full pipeline, retrieval-only, competitive, large-scale, and LangChain agent results.
4+
5+
- **Model used (LLM benchmarks)**: `qwen3:4b` (4-bit, Ollama), unless noted
6+
- **Pipelines compared**: `baseline` (all tools), `retrieve-k3 / k5 / k10`, plus `+ embedding`, `+ ontology`
7+
- **Reproduce**: see [Reproduce](#reproduce) at the bottom
8+
9+
---
10+
11+
## What we measure
12+
13+
graph-tool-call verifies two things.
14+
15+
1. Can performance be **maintained or improved** by giving the LLM only a subset of retrieved tools?
16+
2. Does the **retriever itself** rank the correct tools within the top K?
17+
18+
These are different questions. A retriever that achieves high `Gold Tool Recall@K` does not automatically translate to high end-to-end accuracy — the LLM still has to pick the right tool from the candidate set.
19+
20+
### Metrics
21+
22+
- **End-to-end Accuracy** — did the LLM ultimately succeed in selecting the correct tool / performing the correct workflow?
23+
- **Gold Tool Recall@K** — was the canonical gold tool included in the top K at the retrieval stage?
24+
- **Avg tokens** — average tokens passed to the LLM
25+
- **Token reduction** — token savings vs. baseline
26+
27+
> The two accuracy metrics often diverge. Evaluations that accept **alternative tools** or **equivalent workflows** as correct may show End-to-end Accuracy that doesn't exactly match Gold Tool Recall@K. `baseline` has no retrieval stage, so Gold Tool Recall@K does not apply.
28+
29+
---
30+
31+
## 1. Full pipeline comparison
32+
33+
| Dataset | Tools | Pipeline | End-to-end Accuracy | Gold Tool Recall@K | Avg tokens | Token reduction |
34+
|---|---:|---|---:|---:|---:|---:|
35+
| Petstore | 19 | baseline | 100.0% || 1,239 ||
36+
| Petstore | 19 | retrieve-k3 | 90.0% | 93.3% | 305 | 75.4% |
37+
| Petstore | 19 | retrieve-k5 | 95.0% | 98.3% | 440 | 64.4% |
38+
| Petstore | 19 | retrieve-k10 | 100.0% | 98.3% | 720 | 41.9% |
39+
| GitHub | 50 | baseline | 100.0% || 3,302 ||
40+
| GitHub | 50 | retrieve-k3 | 85.0% | 87.5% | 289 | 91.3% |
41+
| GitHub | 50 | retrieve-k5 | 87.5% | 87.5% | 398 | 87.9% |
42+
| GitHub | 50 | retrieve-k10 | 90.0% | 92.5% | 662 | 79.9% |
43+
| Mixed MCP | 38 | baseline | 96.7% || 2,741 ||
44+
| Mixed MCP | 38 | retrieve-k3 | 86.7% | 93.3% | 328 | 88.0% |
45+
| Mixed MCP | 38 | retrieve-k5 | 90.0% | 96.7% | 461 | 83.2% |
46+
| Mixed MCP | 38 | retrieve-k10 | 96.7% | 100.0% | 826 | 69.9% |
47+
| Kubernetes core/v1 | 248 | baseline | 12.0% || 8,192 ||
48+
| Kubernetes core/v1 | 248 | retrieve-k5 | 78.0% | 91.0% | 1,613 | 80.3% |
49+
| Kubernetes core/v1 | 248 | retrieve-k5 + embedding | 80.0% | 94.0% | 1,728 | 78.9% |
50+
| Kubernetes core/v1 | 248 | retrieve-k5 + ontology | **82.0%** | 96.0% | 1,699 | 79.3% |
51+
| Kubernetes core/v1 | 248 | retrieve-k5 + embedding + ontology | **82.0%** | **98.0%** | 1,924 | 76.5% |
52+
53+
### Key insights
54+
55+
- **Small/medium APIs (19~50 tools)** — baseline is already strong. graph-tool-call's main value here is **64~91% token savings** with little accuracy loss.
56+
- **Large APIs (248 tools)** — baseline collapses to **12%** due to context overload. graph-tool-call recovers performance to **78~82%** by narrowing candidates through retrieval. At this scale it's not an optimization — it's closer to a required retrieval layer.
57+
- **`retrieve-k5` is the best default**. Good token/accuracy tradeoff. On large datasets, adding embedding/ontology yields further gains.
58+
59+
---
60+
61+
## 2. Retrieval quality (BM25 + graph only)
62+
63+
The table below measures retrieval quality **before the LLM stage**. Only BM25 + graph traversal — no embedding or ontology.
64+
65+
| Dataset | Tools | Gold Tool Recall@3 | Gold Tool Recall@5 | Gold Tool Recall@10 |
66+
|---|---:|---:|---:|---:|
67+
| Petstore | 19 | 93.3% | **98.3%** | 98.3% |
68+
| GitHub | 50 | 87.5% | **87.5%** | 92.5% |
69+
| Mixed MCP | 38 | 93.3% | **96.7%** | 100.0% |
70+
| Kubernetes core/v1 | 248 | 82.0% | **91.0%** | 92.0% |
71+
72+
### How to read
73+
74+
- **Gold Tool Recall@K** measures the retriever's ability to include the correct tool in the candidate set, **not** final LLM accuracy.
75+
- On small datasets, `k=5` already achieves high recall.
76+
- On large datasets, increasing `k` raises recall but also increases tokens passed to the LLM — consider both.
77+
78+
### Insights
79+
80+
- **Petstore / Mixed MCP**`k=5` alone includes nearly all correct tools.
81+
- **GitHub** — there's a recall gap between `k=5` and `k=10`; choose `k=10` if recall matters more than tokens.
82+
- **Kubernetes core/v1** — even with 248 tools, `k=5` already achieves **91.0%** gold recall. The retrieval stage alone compresses the candidate set dramatically while retaining most correct tools.
83+
84+
---
85+
86+
## 3. When do embedding and ontology help?
87+
88+
Comparison on the largest dataset (Kubernetes core/v1, 248 tools), all on top of `retrieve-k5`.
89+
90+
| Pipeline | End-to-end Accuracy | Gold Tool Recall@5 | Interpretation |
91+
|---|---:|---:|---|
92+
| retrieve-k5 | 78.0% | 91.0% | BM25 + graph alone is a strong baseline |
93+
| + embedding | 80.0% | 94.0% | Recovers semantically-similar but differently-worded queries |
94+
| + ontology | **82.0%** | 96.0% | LLM-generated keywords/example queries significantly improve retrieval |
95+
| + embedding + ontology | **82.0%** | **98.0%** | Accuracy maintained, gold recall at its highest |
96+
97+
- **Embedding** compensates for **semantic similarity** that BM25 misses.
98+
- **Ontology** **expands the searchable representation itself** when descriptions are short or non-standard.
99+
- Using both together yields limited extra end-to-end gains, but **gold recall reaches its highest**.
100+
101+
---
102+
103+
## 4. Competitive benchmark (retrieval strategies)
104+
105+
Compared 6 retrieval strategies across 9 datasets (19–1068 tools):
106+
107+
| Strategy | Recall@5 | MRR | Latency |
108+
|---|:---:|:---:|:---:|
109+
| Vector Only (≈bigtool) | 96.8% | 0.897 | 176ms |
110+
| BM25 Only | 91.6% | 0.819 | 1.5ms |
111+
| BM25 + Graph (default) | 91.6% | 0.819 | 14ms |
112+
| Full Pipeline (with embedding) | 96.8% | 0.897 | 172ms |
113+
114+
**Key finding** — without embedding, BM25+Graph achieves 91.6% Recall, competitive with vector search at **65× faster speed**. With embedding enabled, performance matches pure vector search.
115+
116+
---
117+
118+
## 5. Scale test: 1068 tools (GitHub full API)
119+
120+
| Strategy | Recall@5 | MRR | Miss% |
121+
|---|:---:|:---:|:---:|
122+
| Vector Only | 88.0% | 0.761 | 12.0% |
123+
| BM25 + Graph | 78.0% | 0.643 | 22.0% |
124+
| Full Pipeline | 88.0% | 0.761 | 12.0% |
125+
126+
At 1068 tools, baseline (passing all definitions) is impractical due to context size — graph-tool-call provides a working retrieval layer where vector-only and full pipeline tie.
127+
128+
---
129+
130+
## 6. LangChain agent benchmark (200 tools)
131+
132+
End-to-end accuracy when **200 simple tools** are registered and invoked through a LangChain agent.
133+
134+
- **Direct (D)** — all 200 tool definitions passed to the LLM at once
135+
- **Graph (G)** — tools managed via graph-tool-call gateway (search → call, 2 turns)
136+
137+
| Model | D-Acc | G-Acc | D-Turns | G-Turns | D-Tokens | G-Tokens | Savings | D-Time | G-Time |
138+
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
139+
| gpt-4.1 | 60.0% | 80.0% | 1.0 | 2.0 | 52,587 | 6,639 | 87.4% | 15.5s | 17.6s |
140+
| gpt-5.2 | 60.0% | **100.0%** | 1.0 | 2.0 | 53,645 | 10,508 | 80.4% | 20.5s | 17.1s |
141+
| gpt-5.4 | 60.0% | **100.0%** | 1.0 | 2.0 | 60,035 | 14,049 | 76.6% | 18.2s | 17.0s |
142+
| claude-sonnet-4-20250514 | 100.0% | 100.0% | 1.0 | 2.0 | 196,183 | 17,349 | 91.2% | 58.2s | 49.4s |
143+
| claude-sonnet-4-6 | 100.0% | 100.0% | 1.0 | 2.0 | 198,665 | 20,074 | 89.9% | 67.0s | 69.4s |
144+
| claude-haiku-4-5 | 100.0% | 100.0% | 1.0 | 2.0 | 197,845 | 19,714 | 90.0% | 23.7s | 22.8s |
145+
146+
> Acc = accuracy, Turns = average agent turns, Tokens = total tokens, Savings = token reduction (D→G), Time = wall-clock.
147+
148+
### Key findings
149+
150+
- GPT-series models drop to **60% accuracy** when all 200 tools are passed directly; graph-tool-call recovers to **80–100%**.
151+
- Claude-series models maintain 100% accuracy either way, but graph-tool-call delivers **89–91% token savings**.
152+
- Graph mode adds 1 extra turn (search → call) but total latency stays comparable or decreases thanks to smaller context.
153+
- Across all models, token reduction ranges from **76.6% to 91.2%**.
154+
155+
---
156+
157+
## Reproduce
158+
159+
```bash
160+
# Retrieval quality only (fast, no LLM needed)
161+
python -m benchmarks.run_benchmark
162+
python -m benchmarks.run_benchmark -d k8s -v
163+
164+
# Pipeline benchmark (LLM comparison)
165+
python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b
166+
python -m benchmarks.run_benchmark --mode pipeline \
167+
--pipelines baseline retrieve-k3 retrieve-k5 retrieve-k10
168+
169+
# Save baseline and compare across runs
170+
python -m benchmarks.run_benchmark --mode pipeline --save-baseline
171+
python -m benchmarks.run_benchmark --mode pipeline --diff
172+
```
173+
174+
See [`benchmarks/`](../benchmarks/) for dataset definitions, ground truth, and the runner source.

0 commit comments

Comments
 (0)