SonAIengine
diff --git a/‎README-ko.md‎
Lines changed: 191 additions & 946 deletions b/‎README-ko.md‎
Lines changed: 191 additions & 946 deletions
diff --git a/‎README.md‎
Lines changed: 132 additions & 988 deletions b/‎README.md‎
Lines changed: 132 additions & 988 deletions
diff --git a/‎docs/README.md‎
Lines changed: 21 additions & 3 deletions b/‎docs/README.md‎
Lines changed: 21 additions & 3 deletions
diff --git a/‎docs/api-reference.md‎
Lines changed: 145 additions & 0 deletions b/‎docs/api-reference.md‎
Lines changed: 145 additions & 0 deletions
diff --git a/‎docs/benchmarks.md‎
Lines changed: 174 additions & 0 deletions b/‎docs/benchmarks.md‎
Lines changed: 174 additions & 0 deletions
@@ -6,6 +6,17 @@
 docs/
 ├── README.md                   ← 현재 문서 (인덱스)
 │
+├── cli.md                      # CLI 레퍼런스 (모든 명령)
+├── api-reference.md            # Python API 레퍼런스 (ToolGraph, helpers, middleware)
+├── benchmarks.md               # 벤치마크 결과 (pipeline / retrieval / competitive / scale)
+│
+├── integrations/               # 통합 가이드 (사용 패턴별)
+│   ├── mcp-server.md           # MCP server 모드
+│   ├── mcp-proxy.md            # MCP proxy (multi-backend aggregation)
+│   ├── langchain.md            # LangChain Gateway / auto-filter / retriever
+│   ├── middleware.md           # OpenAI/Anthropic SDK 1-line patch
+│   └── direct-api.md           # Python API + workflow planning
+│
 ├── architecture/               # 아키텍처 & 데이터 모델
 │   ├── overview.md             # 전체 아키텍처 (파이프라인, 레이어)
 │   └── data-model.md           # ToolSchema, RelationType, NodeType
@@ -44,11 +55,18 @@ docs/
 
 ## 읽는 순서
 
+**사용자 (라이브러리 쓰는 사람)**
+1. **시작**: 루트 [README.md](../README.md)
+2. **CLI 명령**: [cli.md](cli.md)
+3. **Python API**: [api-reference.md](api-reference.md)
+4. **통합 가이드**: [integrations/](integrations/) — 본인 스택에 맞는 패턴 선택
+5. **벤치마크**: [benchmarks.md](benchmarks.md)
+
+**개발자/기여자 (내부 구조 이해하려는 사람)**
 1. **전체 그림**: [architecture/overview.md](architecture/overview.md)
 2. **진행 상황**: [wbs/README.md](wbs/README.md)
-3. **현재 Phase 상세**: [wbs/phase-1-ingest.md](wbs/phase-1-ingest.md)
-4. **설계 깊이 파기**: `design/` 디렉토리
-5. **리서치 근거**: `research/` 디렉토리
+3. **설계 깊이 파기**: `design/` 디렉토리
+4. **리서치 근거**: `research/` 디렉토리
 
 ## 최근 추가 (v0.12)
 
 
@@ -0,0 +1,145 @@
+# Python API Reference
+
+The primary entry point is `ToolGraph`. Most workflows are: ingest a spec → call `retrieve()`.
+
+```python
+from graph_tool_call import ToolGraph
+
+tg = ToolGraph()
+tg.ingest_openapi("api.json")
+tools = tg.retrieve("create a pet", top_k=5)
+```
+
+---
+
+## `ToolGraph` methods
+
+### Construction
+
+| Method | Description |
+|---|---|
+| `ToolGraph()` | Empty graph |
+| `ToolGraph.from_url(url, cache=...)` | Build from Swagger UI or spec URL (auto-discovers spec groups) |
+| `ToolGraph.load(path)` | Deserialize from JSON |
+
+### Ingestion
+
+| Method | Description |
+|---|---|
+| `add_tool(tool)` | Add a single tool (auto-detects format) |
+| `add_tools(tools)` | Add multiple tools |
+| `ingest_openapi(source)` | Ingest from OpenAPI / Swagger spec (file path, URL, or dict) |
+| `ingest_mcp_tools(tools)` | Ingest from MCP tool list |
+| `ingest_mcp_server(url)` | Fetch and ingest from an MCP HTTP server |
+| `ingest_functions(fns)` | Ingest from Python callables (uses type hints + docstrings) |
+| `ingest_arazzo(source)` | Ingest Arazzo 1.0.0 workflow spec |
+| `add_relation(src, tgt, type)` | Add a manual relation between two tools |
+
+### Retrieval
+
+| Method | Description |
+|---|---|
+| `retrieve(query, top_k=10)` | Search and return tool list |
+| `retrieve_with_scores(query, top_k=10)` | Search and return tools with confidence scores and relation hints |
+| `plan_workflow(query)` | Build an ordered execution plan |
+| `suggest_next(tool, history=...)` | Suggest next tools based on graph relations |
+| `validate_tool_call(call)` | Validate and auto-correct a tool call |
+| `assess_tool_call(call)` | Return `allow` / `confirm` / `deny` decision based on annotations |
+
+### Configuration
+
+| Method | Description |
+|---|---|
+| `enable_embedding(provider)` | Enable hybrid embedding search (Ollama, OpenAI, vLLM, sentence-transformers, callable) |
+| `enable_reranker(model)` | Enable cross-encoder reranking |
+| `enable_diversity(lambda_)` | Enable MMR diversity |
+| `set_weights(keyword=, graph=, embedding=, annotation=)` | Tune wRRF fusion weights |
+| `auto_organize(llm=...)` | Auto-categorize tools (rule-based or LLM-enhanced) |
+| `build_ontology(llm=...)` | Build complete ontology |
+
+### Analysis
+
+| Method | Description |
+|---|---|
+| `find_duplicates(threshold)` | Find duplicate tools across sources |
+| `merge_duplicates(pairs)` | Merge detected duplicates |
+| `apply_conflicts()` | Detect and add `CONFLICTS_WITH` edges |
+| `analyze()` | Build operational analysis summary |
+
+### Persistence
+
+| Method | Description |
+|---|---|
+| `save(path)` | Serialize to JSON (preserves embeddings + weights when set) |
+| `ToolGraph.load(path)` | Deserialize and restore retrieval state |
+
+### Export & visualization
+
+| Method | Description |
+|---|---|
+| `export_html(path, progressive=True)` | Interactive HTML (vis.js) |
+| `export_graphml(path)` | GraphML for Gephi / yEd |
+| `export_cypher(path)` | Neo4j Cypher statements |
+| `dashboard_app()` | Build Dash Cytoscape app object |
+| `dashboard(port=8050)` | Launch interactive dashboard |
+
+### Execution
+
+| Method | Description |
+|---|---|
+| `execute(name, params, base_url=...)` | Execute an OpenAPI tool directly |
+
+---
+
+## Top-level helpers
+
+| Function | Description |
+|---|---|
+| `filter_tools(tools, query, top_k=5)` | One-shot filter on any tool list (LangChain, OpenAI, MCP, Anthropic, callables) |
+| `GraphToolkit(tools, top_k=5)` | Reusable toolkit — build graph once, filter per query |
+
+## Middleware
+
+| Function | Description |
+|---|---|
+| `patch_openai(client, graph, top_k=5)` | Auto-filter tools on OpenAI client |
+| `patch_anthropic(client, graph, top_k=5)` | Auto-filter tools on Anthropic client |
+
+## LangChain
+
+| Function | Description |
+|---|---|
+| `create_gateway_tools(tools, top_k=10)` | Convert N tools → 2 gateway meta-tools |
+| `create_agent(llm, tools, top_k=5)` | Auto-filtering LangGraph agent |
+| `GraphToolRetriever(tool_graph, top_k=5)` | LangChain `BaseRetriever` returning `Document` objects |
+| `tool_schema_to_openai_function(tool)` | Convert `ToolSchema` → OpenAI function dict |
+
+---
+
+## Embedding provider strings
+
+`enable_embedding()` accepts:
+
+| Form | Example |
+|---|---|
+| `"ollama/<model>"` | `"ollama/qwen3-embedding:0.6b"` |
+| `"openai/<model>"` | `"openai/text-embedding-3-large"` |
+| `"vllm/<model>"` | `"vllm/Qwen/Qwen3-Embedding-0.6B"` |
+| `"vllm/<model>@<url>"` | `"vllm/model@http://gpu-box:8000/v1"` |
+| `"llamacpp/<model>@<url>"` | `"llamacpp/model@http://192.168.1.10:8080/v1"` |
+| `"<url>@<model>"` | `"http://localhost:8000/v1@my-model"` |
+| `"sentence-transformers/<model>"` | `"sentence-transformers/all-MiniLM-L6-v2"` |
+| `callable` | `lambda texts: my_embed_fn(texts)` |
+
+## Ontology LLM inputs
+
+`auto_organize(llm=...)` accepts:
+
+| Input | Wrapped as |
+|---|---|
+| `OntologyLLM` instance | Pass-through |
+| `callable(str) -> str` | `CallableOntologyLLM` |
+| OpenAI client (has `chat.completions`) | `OpenAIClientOntologyLLM` |
+| `"ollama/model"` | `OllamaOntologyLLM` |
+| `"openai/model"` | `OpenAICompatibleOntologyLLM` |
+| `"litellm/model"` | litellm.completion wrapper |
@@ -0,0 +1,174 @@
+# Benchmark Results
+
+Detailed benchmark data for graph-tool-call. The README contains a 3-row summary; this document contains the full pipeline, retrieval-only, competitive, large-scale, and LangChain agent results.
+
+- **Model used (LLM benchmarks)**: `qwen3:4b` (4-bit, Ollama), unless noted
+- **Pipelines compared**: `baseline` (all tools), `retrieve-k3 / k5 / k10`, plus `+ embedding`, `+ ontology`
+- **Reproduce**: see [Reproduce](#reproduce) at the bottom
+
+---
+
+## What we measure
+
+graph-tool-call verifies two things.
+
+1. Can performance be **maintained or improved** by giving the LLM only a subset of retrieved tools?
+2. Does the **retriever itself** rank the correct tools within the top K?
+
+These are different questions. A retriever that achieves high `Gold Tool Recall@K` does not automatically translate to high end-to-end accuracy — the LLM still has to pick the right tool from the candidate set.
+
+### Metrics
+
+- **End-to-end Accuracy** — did the LLM ultimately succeed in selecting the correct tool / performing the correct workflow?
+- **Gold Tool Recall@K** — was the canonical gold tool included in the top K at the retrieval stage?
+- **Avg tokens** — average tokens passed to the LLM
+- **Token reduction** — token savings vs. baseline
+
+> The two accuracy metrics often diverge. Evaluations that accept **alternative tools** or **equivalent workflows** as correct may show End-to-end Accuracy that doesn't exactly match Gold Tool Recall@K. `baseline` has no retrieval stage, so Gold Tool Recall@K does not apply.
+
+---
+
+## 1. Full pipeline comparison
+
+| Dataset | Tools | Pipeline | End-to-end Accuracy | Gold Tool Recall@K | Avg tokens | Token reduction |
+|---|---:|---|---:|---:|---:|---:|
+| Petstore | 19 | baseline | 100.0% | — | 1,239 | — |
+| Petstore | 19 | retrieve-k3 | 90.0% | 93.3% | 305 | 75.4% |
+| Petstore | 19 | retrieve-k5 | 95.0% | 98.3% | 440 | 64.4% |
+| Petstore | 19 | retrieve-k10 | 100.0% | 98.3% | 720 | 41.9% |
+| GitHub | 50 | baseline | 100.0% | — | 3,302 | — |
+| GitHub | 50 | retrieve-k3 | 85.0% | 87.5% | 289 | 91.3% |
+| GitHub | 50 | retrieve-k5 | 87.5% | 87.5% | 398 | 87.9% |
+| GitHub | 50 | retrieve-k10 | 90.0% | 92.5% | 662 | 79.9% |
+| Mixed MCP | 38 | baseline | 96.7% | — | 2,741 | — |
+| Mixed MCP | 38 | retrieve-k3 | 86.7% | 93.3% | 328 | 88.0% |
+| Mixed MCP | 38 | retrieve-k5 | 90.0% | 96.7% | 461 | 83.2% |
+| Mixed MCP | 38 | retrieve-k10 | 96.7% | 100.0% | 826 | 69.9% |
+| Kubernetes core/v1 | 248 | baseline | 12.0% | — | 8,192 | — |
+| Kubernetes core/v1 | 248 | retrieve-k5 | 78.0% | 91.0% | 1,613 | 80.3% |
+| Kubernetes core/v1 | 248 | retrieve-k5 + embedding | 80.0% | 94.0% | 1,728 | 78.9% |
+| Kubernetes core/v1 | 248 | retrieve-k5 + ontology | **82.0%** | 96.0% | 1,699 | 79.3% |
+| Kubernetes core/v1 | 248 | retrieve-k5 + embedding + ontology | **82.0%** | **98.0%** | 1,924 | 76.5% |
+
+### Key insights
+
+- **Small/medium APIs (19~50 tools)** — baseline is already strong. graph-tool-call's main value here is **64~91% token savings** with little accuracy loss.
+- **Large APIs (248 tools)** — baseline collapses to **12%** due to context overload. graph-tool-call recovers performance to **78~82%** by narrowing candidates through retrieval. At this scale it's not an optimization — it's closer to a required retrieval layer.
+- **`retrieve-k5` is the best default**. Good token/accuracy tradeoff. On large datasets, adding embedding/ontology yields further gains.
+
+---
+
+## 2. Retrieval quality (BM25 + graph only)
+
+The table below measures retrieval quality **before the LLM stage**. Only BM25 + graph traversal — no embedding or ontology.
+
+| Dataset | Tools | Gold Tool Recall@3 | Gold Tool Recall@5 | Gold Tool Recall@10 |
+|---|---:|---:|---:|---:|
+| Petstore | 19 | 93.3% | **98.3%** | 98.3% |
+| GitHub | 50 | 87.5% | **87.5%** | 92.5% |
+| Mixed MCP | 38 | 93.3% | **96.7%** | 100.0% |
+| Kubernetes core/v1 | 248 | 82.0% | **91.0%** | 92.0% |
+
+### How to read
+
+- **Gold Tool Recall@K** measures the retriever's ability to include the correct tool in the candidate set, **not** final LLM accuracy.
+- On small datasets, `k=5` already achieves high recall.
+- On large datasets, increasing `k` raises recall but also increases tokens passed to the LLM — consider both.
+
+### Insights
+
+- **Petstore / Mixed MCP** — `k=5` alone includes nearly all correct tools.
+- **GitHub** — there's a recall gap between `k=5` and `k=10`; choose `k=10` if recall matters more than tokens.
+- **Kubernetes core/v1** — even with 248 tools, `k=5` already achieves **91.0%** gold recall. The retrieval stage alone compresses the candidate set dramatically while retaining most correct tools.
+
+---
+
+## 3. When do embedding and ontology help?
+
+Comparison on the largest dataset (Kubernetes core/v1, 248 tools), all on top of `retrieve-k5`.
+
+| Pipeline | End-to-end Accuracy | Gold Tool Recall@5 | Interpretation |
+|---|---:|---:|---|
+| retrieve-k5 | 78.0% | 91.0% | BM25 + graph alone is a strong baseline |
+| + embedding | 80.0% | 94.0% | Recovers semantically-similar but differently-worded queries |
+| + ontology | **82.0%** | 96.0% | LLM-generated keywords/example queries significantly improve retrieval |
+| + embedding + ontology | **82.0%** | **98.0%** | Accuracy maintained, gold recall at its highest |
+
+- **Embedding** compensates for **semantic similarity** that BM25 misses.
+- **Ontology** **expands the searchable representation itself** when descriptions are short or non-standard.
+- Using both together yields limited extra end-to-end gains, but **gold recall reaches its highest**.
+
+---
+
+## 4. Competitive benchmark (retrieval strategies)
+
+Compared 6 retrieval strategies across 9 datasets (19–1068 tools):
+
+| Strategy | Recall@5 | MRR | Latency |
+|---|:---:|:---:|:---:|
+| Vector Only (≈bigtool) | 96.8% | 0.897 | 176ms |
+| BM25 Only | 91.6% | 0.819 | 1.5ms |
+| BM25 + Graph (default) | 91.6% | 0.819 | 14ms |
+| Full Pipeline (with embedding) | 96.8% | 0.897 | 172ms |
+
+**Key finding** — without embedding, BM25+Graph achieves 91.6% Recall, competitive with vector search at **65× faster speed**. With embedding enabled, performance matches pure vector search.
+
+---
+
+## 5. Scale test: 1068 tools (GitHub full API)
+
+| Strategy | Recall@5 | MRR | Miss% |
+|---|:---:|:---:|:---:|
+| Vector Only | 88.0% | 0.761 | 12.0% |
+| BM25 + Graph | 78.0% | 0.643 | 22.0% |
+| Full Pipeline | 88.0% | 0.761 | 12.0% |
+
+At 1068 tools, baseline (passing all definitions) is impractical due to context size — graph-tool-call provides a working retrieval layer where vector-only and full pipeline tie.
+
+---
+
+## 6. LangChain agent benchmark (200 tools)
+
+End-to-end accuracy when **200 simple tools** are registered and invoked through a LangChain agent.
+
+- **Direct (D)** — all 200 tool definitions passed to the LLM at once
+- **Graph (G)** — tools managed via graph-tool-call gateway (search → call, 2 turns)
+
+| Model | D-Acc | G-Acc | D-Turns | G-Turns | D-Tokens | G-Tokens | Savings | D-Time | G-Time |
+|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
+| gpt-4.1 | 60.0% | 80.0% | 1.0 | 2.0 | 52,587 | 6,639 | 87.4% | 15.5s | 17.6s |
+| gpt-5.2 | 60.0% | **100.0%** | 1.0 | 2.0 | 53,645 | 10,508 | 80.4% | 20.5s | 17.1s |
+| gpt-5.4 | 60.0% | **100.0%** | 1.0 | 2.0 | 60,035 | 14,049 | 76.6% | 18.2s | 17.0s |
+| claude-sonnet-4-20250514 | 100.0% | 100.0% | 1.0 | 2.0 | 196,183 | 17,349 | 91.2% | 58.2s | 49.4s |
+| claude-sonnet-4-6 | 100.0% | 100.0% | 1.0 | 2.0 | 198,665 | 20,074 | 89.9% | 67.0s | 69.4s |
+| claude-haiku-4-5 | 100.0% | 100.0% | 1.0 | 2.0 | 197,845 | 19,714 | 90.0% | 23.7s | 22.8s |
+
+> Acc = accuracy, Turns = average agent turns, Tokens = total tokens, Savings = token reduction (D→G), Time = wall-clock.
+
+### Key findings
+
+- GPT-series models drop to **60% accuracy** when all 200 tools are passed directly; graph-tool-call recovers to **80–100%**.
+- Claude-series models maintain 100% accuracy either way, but graph-tool-call delivers **89–91% token savings**.
+- Graph mode adds 1 extra turn (search → call) but total latency stays comparable or decreases thanks to smaller context.
+- Across all models, token reduction ranges from **76.6% to 91.2%**.
+
+---
+
+## Reproduce
+
+```bash
+# Retrieval quality only (fast, no LLM needed)
+python -m benchmarks.run_benchmark
+python -m benchmarks.run_benchmark -d k8s -v
+
+# Pipeline benchmark (LLM comparison)
+python -m benchmarks.run_benchmark --mode pipeline -m qwen3:4b
+python -m benchmarks.run_benchmark --mode pipeline \
+  --pipelines baseline retrieve-k3 retrieve-k5 retrieve-k10
+
+# Save baseline and compare across runs
+python -m benchmarks.run_benchmark --mode pipeline --save-baseline
+python -m benchmarks.run_benchmark --mode pipeline --diff
+```
+
+See [`benchmarks/`](../benchmarks/) for dataset definitions, ground truth, and the runner source.