diff --git a/.claude/skills/codesearch/SKILL.md b/.claude/skills/codesearch/SKILL.md index 9b66903..df587f1 100644 --- a/.claude/skills/codesearch/SKILL.md +++ b/.claude/skills/codesearch/SKILL.md @@ -3,7 +3,7 @@ name: codesearch description: Semantic code search using ML embeddings and AST analysis. Replaces built-in search tools for intent-based code exploration. Use when the user asks to find code by describing what it does, understand code relationships, or explore a codebase semantically. metadata: author: ArtemisMucaj - version: "0.6.0" + version: "0.7.0" compatibility: Requires the codesearch binary installed and a repository indexed with `codesearch index`. --- @@ -21,6 +21,9 @@ Invoke this skill **immediately** when: - You need to discover code related to a **concept** rather than an exact string - User asks about **blast radius** or **impact** of changing a function/symbol - User asks **who calls** a function or **what does a function call** (symbol context) +- User asks for an **explanation** of a symbol's full call flow or business purpose (`explain`) +- User asks about the **architectural structure** of a repo — modules, clusters, entry-point features +- User asks **which files one repository uses** from another (cross-repo dependencies) ## When to Use Built-in Tools Instead @@ -117,32 +120,105 @@ Codesearch uses Tree-sitter to extract and index these code constructs: ## Call Graph Analysis -Once a repository is indexed, the call graph is available for two complementary commands. +Once a repository is indexed, the call graph powers several complementary commands. ### Impact Analysis — blast radius of a change ```shell -# Who breaks if `authenticate` changes? (default depth: 5 hops) +# Who breaks if `authenticate` changes? (BFS over the call graph) codesearch impact authenticate -# Limit to 2 hops -codesearch impact authenticate --depth 2 - # Restrict to one repository; JSON output for scripts codesearch impact authenticate --repository my-api --format json ``` -### Symbol Context — 360-degree callers + callees +### Symbol Context — full caller/callee call-chain tree ```shell # Who calls `authenticate`, and what does it call? codesearch context authenticate -# Limit results per direction -codesearch context authenticate --limit 10 +# Restrict to one repository; JSON output +codesearch context authenticate --repository my-api --format json +``` + +### Matching symbols by regex + +`impact`, `context`, and `explain` resolve the symbol name with a substring match +by default (`load` matches any fully-qualified name containing `load`). Pass +`--regex` to supply your own POSIX pattern with explicit anchoring: + +```shell +codesearch impact "^MyNs/.*Service#get$" --regex +codesearch context ".*Repository.*" --regex +``` + +### Explain — LLM-generated call-flow explanation + +Produces a structured natural-language description of a symbol's purpose, data/control +flow, business feature, and key dependencies. Requires `ANTHROPIC_API_KEY` (default +backend) or an OpenAI-compatible endpoint. + +```shell +# Explain a symbol using the default Anthropic backend +codesearch explain authenticate + +# Use an OpenAI-compatible backend (e.g. LM Studio) +codesearch explain authenticate --llm open-ai + +# Also dump every analyzed symbol's source chunk +codesearch explain authenticate --dump-symbols +``` + +## Architecture & Dependency Analysis + +These commands operate on the file- and repository-level dependency graph built during +indexing. They help answer "how is this codebase structured?" rather than "where is X?". + +### Execution Features — entry-point flows ranked by criticality + +```shell +# List the most critical entry-point features in a repository +codesearch features list my-repo + +# Limit the number of features and emit JSON +codesearch features list my-repo --limit 10 --format json + +# Inspect a single feature by its entry-point symbol +codesearch features get handle_request + +# Which features are impacted by changing one or more symbols? +codesearch features impacted authenticate hash_password +``` + +### Clusters — architectural modules (Leiden community detection) + +```shell +# List tightly-coupled file clusters (architectural modules) +codesearch clusters list my-repo + +# Which cluster does a given file belong to? +codesearch clusters get src/api/auth.rs my-repo -# JSON output -codesearch context authenticate --format json +# Print a high-level Markdown architecture overview table +codesearch clusters overview my-repo +``` + +### Uses — cross-repository file dependencies + +```shell +# List the files in repo `web` that use files from repo `core` +codesearch uses web core +``` + +## Interactive TUI + +For exploratory sessions a full-screen terminal UI bundles search, impact, and context: + +```shell +codesearch tui # open in search mode +codesearch tui --mode impact # open in impact mode +codesearch tui --query "auth flow" # pre-populate and dispatch a query ``` ## Repository Management @@ -200,4 +276,4 @@ codesearch --no-rerank search "query" ## Keywords -semantic search, hybrid search, code search, natural language search, find code, explore codebase, code understanding, intent search, AST analysis, embeddings, code discovery, code exploration, BM25, keyword search, RRF, reciprocal rank fusion, call graph, impact analysis, blast radius, symbol context, callers, callees, dependency analysis +semantic search, hybrid search, code search, natural language search, find code, explore codebase, code understanding, intent search, AST analysis, embeddings, code discovery, code exploration, BM25, keyword search, RRF, reciprocal rank fusion, call graph, impact analysis, blast radius, symbol context, callers, callees, dependency analysis, explain, call flow, execution features, criticality, clusters, architecture overview, Leiden, module detection, cross-repository dependencies, uses, regex symbol match diff --git a/AGENTS.md b/AGENTS.md index bc2ef17..732e56a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -76,7 +76,12 @@ The codebase follows **Domain-Driven Design (DDD)** with a strict **Ports & Adap | RRF result fusion | `src/application/use_cases/rrf_fuse.rs` | | Impact (blast radius) analysis | `src/application/use_cases/impact_analysis.rs` | | Symbol context (callers/callees) | `src/application/use_cases/symbol_context.rs` | -| Call graph extraction | `src/application/use_cases/call_graph.rs` | +| Call graph extraction & relationship queries | `src/application/use_cases/call_graph.rs` | +| LLM explanation of a symbol's call flow | `src/application/use_cases/explain.rs` | +| Execution features + criticality scoring | `src/application/use_cases/execution_features.rs` | +| Architectural cluster detection (Leiden) | `src/application/use_cases/cluster_detection.rs` | +| File / cross-repo relationship graph (`uses`) | `src/application/use_cases/file_relationship.rs` | +| Source snippet lookup | `src/application/use_cases/snippet_lookup.rs` | | List / delete repositories | `src/application/use_cases/{list,delete}_repository.rs` | ### Dependency Injection diff --git a/README.md b/README.md index 8e29a41..59fb36d 100644 --- a/README.md +++ b/README.md @@ -62,6 +62,9 @@ codesearch search "function that handles authentication" # Show indexed repositories codesearch list +# Show indexing statistics +codesearch stats + # Delete a repository by name or path codesearch delete my-repo codesearch delete /path/to/repo @@ -69,12 +72,21 @@ codesearch delete /path/to/repo # Show the blast radius of a symbol change (BFS over call graph) codesearch impact authenticate -# Show 360-degree caller/callee context for a symbol +# Show full caller/callee call-chain context for a symbol codesearch context authenticate # LLM-powered explanation of a symbol's full call flow and business purpose codesearch explain authenticate +# Rank entry-point execution features by criticality +codesearch features list my-repo + +# Detect architectural clusters in the file dependency graph +codesearch clusters list my-repo + +# List the files one repository uses from another +codesearch uses web core + # Launch the interactive TUI (search, impact, and context in one terminal UI) codesearch tui @@ -149,7 +161,7 @@ CodeSearch builds a call graph during indexing and exposes two commands to query ### Impact Analysis -Shows every symbol that would be affected (transitively) if a given symbol changes. Uses BFS over the call graph up to a configurable depth. +Shows every symbol that would be affected (transitively) if a given symbol changes. Uses BFS over the call graph, grouping affected symbols by hop depth. ```bash # Show what breaks if `authenticate` changes @@ -213,6 +225,11 @@ process_request [call] src/router.rs:10 |------|---------|---------|-------------| | `-r, --repository` | both | (none) | Restrict to a specific repository | | `-F, --format` | both | `text` | Output format: `text`, `json`, or `vimgrep` | +| `--regex` | both | off | Treat the symbol as an explicit POSIX regex (no auto-wrapping) | + +> **Symbol matching:** By default the symbol argument is matched as a substring +> (`load` matches any fully-qualified name containing `load`). Pass `--regex` to +> supply your own anchored pattern, e.g. `codesearch impact "^MyNs/.*Service#get$" --regex`. > **Note:** Call graph data is populated during `codesearch index`. Re-index after code changes to keep the graph up to date. @@ -239,6 +256,57 @@ codesearch tui --query "authentication" See [Getting Started — Launch the Interactive TUI](docs/features/getting-started.md#launch-the-interactive-tui) for all options. +## Architecture & Dependency Analysis + +Beyond per-symbol call graphs, CodeSearch analyses the file- and repository-level +dependency graph built during indexing. + +### Execution Features (`features`) + +Discovers entry-point execution flows (forward call chains rooted at entry-point +symbols) and ranks them by a criticality score. + +```bash +# List the most critical features in a repository +codesearch features list my-repo --limit 20 + +# Show a single feature by its entry-point symbol +codesearch features get handle_request + +# Show which features are impacted by changing one or more symbols +codesearch features impacted authenticate hash_password +``` + +### Clusters (`clusters`) + +Runs the [Leiden](https://en.wikipedia.org/wiki/Leiden_algorithm) community-detection +algorithm over the file-level call graph to surface tightly-coupled groups of files +(architectural modules). + +```bash +# List detected clusters +codesearch clusters list my-repo + +# Find which cluster a file belongs to +codesearch clusters get src/api/auth.rs my-repo + +# Print a high-level Markdown architecture overview table +codesearch clusters overview my-repo +``` + +### Cross-repository Usage (`uses`) + +Lists every file in one repository that references symbols defined in another, +grouped by the target file they depend on. + +```bash +# Files in the `web` repo that use files from the `core` repo +codesearch uses web core +``` + +See [Architecture & Dependency Analysis](docs/features/architecture-analysis.md) for +output examples, flags, and JSON schemas. + ## Editor Integrations ### Neovim / Telescope @@ -312,7 +380,20 @@ codesearch mcp --http 8080 --public The HTTP server exposes the MCP endpoint at `/mcp`. -**Exposed tool**: `search_code` — accepts `query`, `limit`, `min_score`, `languages`, and `repositories` parameters. +**Exposed tools:** + +| Tool | Description | +|------|-------------| +| `search_code` | Hybrid/semantic search. Accepts `query`, `limit`, `min_score`, `languages`, `repositories`, and `text_search`. | +| `analyze_impact` | Blast-radius analysis for a symbol. Accepts `symbol`, `repository_id`, and `regex`. | +| `get_symbol_context` | 360° caller/callee context for a symbol. Accepts `symbol`, `repository_id`, and `regex`. | +| `query_graph` | Precise relationship queries over the call graph. Accepts `pattern`, `target`, `repository_id`, and `limit`. | + +The `query_graph` tool supports eight intention-named relationship `pattern`s, returning +only the requested edge type instead of every relationship at once: + +`callers_of`, `callees_of`, `imports_of`, `importers_of`, `inheritors_of`, +`children_of`, `tests_for`, and `file_summary`. ### Storage Backends diff --git a/docs/README.md b/docs/README.md index 935c0aa..cca49c2 100644 --- a/docs/README.md +++ b/docs/README.md @@ -14,4 +14,6 @@ This directory contains documentation for the CodeSearch project. - [Indexing Pipeline](./features/indexing.md) - [Search Features](./features/search.md) - [Call Graph Analysis](./features/call-graph.md) +- [Architecture & Dependency Analysis](./features/architecture-analysis.md) +- [Embedding Backends](./features/embedding-backends.md) - [Editor Integrations](./features/editor-integrations.md) diff --git a/docs/features/architecture-analysis.md b/docs/features/architecture-analysis.md new file mode 100644 index 0000000..a271936 --- /dev/null +++ b/docs/features/architecture-analysis.md @@ -0,0 +1,177 @@ +# Architecture & Dependency Analysis + +In addition to per-symbol call graph queries (`impact`, `context`, `explain` — see +[Call Graph Analysis](./call-graph.md)), CodeSearch analyses the **file- and +repository-level** dependency graph built during indexing. These commands answer +"how is this codebase structured?" rather than "where is X?". + +All three commands derive from the same `SymbolReference` edges populated during +`codesearch index`, so re-index after code changes to keep the analysis current. + +## Execution Features (`codesearch features`) + +An **execution feature** is a forward call chain rooted at an entry-point symbol — a +self-contained slice of behaviour that the codebase exposes. Each feature is assigned a +**criticality** score derived from how deep and how wide its call chain is, so the most +load-bearing flows surface first. + +### Subcommands + +```bash +# List entry-point features for a repository, sorted by descending criticality +codesearch features list my-repo + +# Cap the number of features and emit JSON +codesearch features list my-repo --limit 10 --format json + +# Show the execution feature for a single entry-point symbol (exact or substring) +codesearch features get handle_request + +# Show the features impacted by changing one or more symbols +codesearch features impacted authenticate hash_password +``` + +### Options + +| Flag | Subcommand | Default | Description | +|------|------------|---------|-------------| +| `-l, --limit` | `list` | `20` | Maximum number of features to return | +| `-r, --repository` | `get`, `impacted` | (none) | Restrict lookup to a specific repository ID | +| `-F, --format` | all | `text` | Output format: `text`, `json`, or `vimgrep` | + +### Example: `features list` + +```text +Execution Features (3 total) +───────────────────────────────────────── +login_flow criticality=0.91 depth=4 files=6 + entry: handle_login + +index_repository criticality=0.74 depth=5 files=9 + entry: run_index + +search_flow criticality=0.68 depth=3 files=4 + entry: handle_search +``` + +### Example: `features get` + +```text +Execution Feature: login_flow +───────────────────────────────────────── +Entry point : handle_login +Repository : my-api +Criticality : 0.91 +Depth : 4 +Files : 6 + +Call chain: +handle_login + └── authenticate [src/auth/mod.rs:10] + └── verify_password [src/crypto/hash.rs:22] + └── generate_token [src/crypto/token.rs:7] +``` + +## Clusters (`codesearch clusters`) + +The `clusters` command runs the [Leiden](https://en.wikipedia.org/wiki/Leiden_algorithm) +community-detection algorithm over the file-level call graph to identify groups of +tightly-coupled files — i.e. architectural modules — even when those groupings are not +reflected in the directory layout. + +### Subcommands + +```bash +# List all clusters detected in the repository +codesearch clusters list my-repo + +# JSON output +codesearch clusters list my-repo --format json + +# Show which cluster a specific file belongs to (path as indexed, repo-relative) +codesearch clusters get src/api/auth.rs my-repo + +# Print a high-level Markdown architecture overview table +codesearch clusters overview my-repo +``` + +### Options + +| Flag | Subcommand | Default | Description | +|------|------------|---------|-------------| +| `-F, --format` | `list`, `get` | `text` | Output format: `text` or `json` (vimgrep is not supported) | + +> The `overview` subcommand always emits a Markdown table and takes no `--format` flag. + +### Example: `clusters list` + +```text +Clusters for `my-repo` — 3 clusters, 42 files, 118 edges +──────────────────────────────────────────────────── + 1. auth (8 files, rust, cohesion 0.82) + src/auth/mod.rs + src/crypto/hash.rs + src/crypto/token.rs + src/db/users.rs + src/middleware/session.rs + … and 3 more + 2. indexing (12 files, rust, cohesion 0.77) + src/connector/adapter/duckdb/vector.rs + … +``` + +### Example: `clusters get` + +```text +File `src/api/auth.rs` belongs to cluster `auth` (8 files, rust, cohesion 0.82) +``` + +## Cross-repository Usage (`codesearch uses`) + +`codesearch uses ` lists every file in the `` repository that +references symbols defined in the `` repository, grouped by the target file they +depend on. Both arguments accept a repository name or ID. This is useful for auditing +the surface area one service consumes from a shared library or another service. + +```bash +# Files in the `web` repo that use files from the `core` repo +codesearch uses web core +``` + +### Example output + +```text +Files in 'web' that use files from 'core': + + core/src/db.rs + ← web/src/handlers/users.rs [query, execute] + ← web/src/handlers/auth.rs [query] + core/src/models.rs + ← web/src/handlers/users.rs [User, Session] + +2 file(s) in 'web' depend on 2 file(s) in 'core'. +``` + +Each `←` line names a consuming file; the bracketed list shows the referenced symbols. +If there are no cross-repository references, the command reports that no dependencies +were found. + +## Querying the Graph from AI Tools (`query_graph`) + +When running as an [MCP server](./editor-integrations.md#mcp-context-server-ai-assistant-integration), +CodeSearch exposes the `query_graph` tool for precise, single-relationship queries over +the call graph. Rather than returning every edge kind at once, it returns only the +intention you ask for: + +| Pattern | Returns | +|---------|---------| +| `callers_of` | Symbols that call the target | +| `callees_of` | Symbols the target calls | +| `imports_of` | What the target imports (import edges only) | +| `importers_of` | Who imports the target (import edges only) | +| `inheritors_of` | Symbols that inherit from / implement the target | +| `children_of` | Symbols the target inherits from / implements | +| `tests_for` | Test functions or files that exercise the target | +| `file_summary` | All symbols referenced within a file | + +See [Editor Integrations — MCP Server](./editor-integrations.md) for tool parameters. diff --git a/docs/features/call-graph.md b/docs/features/call-graph.md index 9102181..a4a1dd1 100644 --- a/docs/features/call-graph.md +++ b/docs/features/call-graph.md @@ -39,6 +39,9 @@ codesearch impact authenticate --format json # Vimgrep output (file:line:col:text) for Neovim quickfix codesearch impact authenticate --format vimgrep + +# Match the root symbol with an explicit regex +codesearch impact "^MyNs/.*Service#get$" --regex ``` ### Options @@ -47,6 +50,12 @@ codesearch impact authenticate --format vimgrep |------|---------|-------------| | `-r, --repository` | (none) | Restrict the graph traversal to one repository | | `-F, --format` | `text` | Output format: `text`, `json`, or `vimgrep` | +| `--regex` | off | Treat SYMBOL as an explicit POSIX regex (no auto-wrapping) | + +> **Symbol matching:** By default the symbol argument is matched as a substring — +> `load` resolves to any fully-qualified name containing `load`. Pass `--regex` to +> control anchoring yourself (e.g. `^MyNs/.*Service#get$`). The same applies to +> `context` and `explain`. ### Example Text Output @@ -95,6 +104,9 @@ codesearch context authenticate --format json # Vimgrep output (file:line:col:text) for Neovim quickfix codesearch context authenticate --format vimgrep + +# Match the symbol with an explicit regex +codesearch context ".*Repository.*" --regex ``` ### Options @@ -103,6 +115,7 @@ codesearch context authenticate --format vimgrep |------|---------|-------------| | `-r, --repository` | (none) | Restrict lookup to one repository | | `-F, --format` | `text` | Output format: `text`, `json`, or `vimgrep` | +| `--regex` | off | Treat SYMBOL as an explicit POSIX regex (no auto-wrapping) | ### Example Text Output diff --git a/docs/features/editor-integrations.md b/docs/features/editor-integrations.md index cba7ca8..4275d26 100644 --- a/docs/features/editor-integrations.md +++ b/docs/features/editor-integrations.md @@ -4,13 +4,17 @@ CodeSearch provides output formats and plugins for integrating semantic search i ## Output Formats -The `--format` (`-F`) flag controls output for `search`, `context`, and `impact`: +The `--format` (`-F`) flag controls output for `search`, `context`, `impact`, +`explain`, `features`, and `clusters`: | Format | Description | Commands | |--------|-------------|----------| | `text` | Human-readable output with code previews (default) | all | | `json` | Structured JSON array for programmatic consumption | all | -| `vimgrep` | `file:line:col:text` for Neovim quickfix list and Telescope | all | +| `vimgrep` | `file:line:col:text` for Neovim quickfix list and Telescope | `search`, `context`, `impact`, `features` | + +> `clusters list` / `clusters get` support `text` and `json` only; `vimgrep` is not +> available for them. `clusters overview` always emits a Markdown table. ## Zed @@ -32,7 +36,22 @@ Add the following block to `~/.config/zed/settings.json` (see [`ide/zed/settings } ``` -Restart Zed and open the AI assistant — the server will be listed in the context-server panel. The assistant can then call `search_code`, `analyze_impact`, and `get_symbol_context` autonomously while you chat. +Restart Zed and open the AI assistant — the server will be listed in the context-server panel. The assistant can then call `search_code`, `analyze_impact`, `get_symbol_context`, and `query_graph` autonomously while you chat. + +#### Exposed MCP tools + +| Tool | Description | +|------|-------------| +| `search_code` | Hybrid/semantic search (`query`, `limit`, `min_score`, `languages`, `repositories`, `text_search`) | +| `analyze_impact` | Blast-radius analysis for a symbol (`symbol`, `repository_id`, `regex`) | +| `get_symbol_context` | 360° caller/callee context for a symbol (`symbol`, `repository_id`, `regex`) | +| `query_graph` | Single-relationship graph query (`pattern`, `target`, `repository_id`, `limit`) | + +`query_graph` accepts one of eight intention-named `pattern`s: `callers_of`, +`callees_of`, `imports_of`, `importers_of`, `inheritors_of`, `children_of`, +`tests_for`, and `file_summary`. See +[Architecture & Dependency Analysis](./architecture-analysis.md#querying-the-graph-from-ai-tools-query_graph) +for what each pattern returns. ### Tasks (command palette integration) diff --git a/docs/features/getting-started.md b/docs/features/getting-started.md index f1b808b..c35a255 100644 --- a/docs/features/getting-started.md +++ b/docs/features/getting-started.md @@ -112,6 +112,24 @@ codesearch tui --query "authentication" codesearch tui --mode impact ``` +### Analyze Architecture & Dependencies + +Explore the repository-level dependency graph — entry-point features, architectural +clusters, and cross-repository usage: + +```bash +# Rank entry-point execution features by criticality +codesearch features list my-repo + +# Detect architectural clusters (Leiden community detection) +codesearch clusters list my-repo + +# List the files one repository uses from another +codesearch uses web core +``` + +See [Architecture & Dependency Analysis](./architecture-analysis.md) for full details. + ### Start the MCP Server Run CodeSearch as a [Model Context Protocol](https://modelcontextprotocol.io/) server for AI tool integration: @@ -147,13 +165,17 @@ codesearch -v search "my query" ## How Search Works -Codesearch uses **semantic vector search**: +Codesearch defaults to **hybrid search** — a semantic (vector) leg and a keyword +(BM25-style) leg, fused via Reciprocal Rank Fusion (RRF): 1. Your query is converted to a 384-dimensional embedding 2. The DuckDB VSS extension finds semantically similar code using HNSW indexes -3. A cross-encoder reranker (bge-reranker-base) rescores candidates for higher relevance (enabled by default, disable with `--no-rerank`) -4. Results are ranked by cosine similarity (0.0 to 1.0) or reranking score -5. Filters can be applied by language, node type, repository, or minimum score +3. In parallel, a keyword leg matches content and symbol names; the two ranked lists are fused with RRF (pass `--no-text-search` for semantic-only) +4. A cross-encoder reranker (bge-reranker-base) rescores candidates for higher relevance (enabled by default, disable with `--no-rerank`) +5. Results are ranked by fused RRF score (hybrid), cosine similarity (semantic-only), or reranking score +6. Filters can be applied by language, node type, repository, or minimum score + +See [Search Features](./search.md) for the full hybrid-vs-semantic breakdown. **Why VSS (Vector Similarity Search)?** - ✓ Finds conceptually similar code, not just keyword matches