diff --git a/.changeset/graph-algorithms.md b/.changeset/graph-algorithms.md new file mode 100644 index 0000000..84001fe --- /dev/null +++ b/.changeset/graph-algorithms.md @@ -0,0 +1,11 @@ +--- +'@prosdevlab/dev-agent': patch +--- + +Graph algorithms for dev_map and dev_refs + +- `dev_map` hot paths now use PageRank over the weighted dependency graph — files depended on by other important files rank higher +- `dev_map` shows connected subsystems ("Subsystems: packages/core (45 files), packages/cli (12 files)") +- `dev_refs` new `traceTo` parameter traces the dependency chain between files through the call graph +- All algorithms are hand-rolled pure functions (~230 lines), no new dependencies +- Inspired by aider's repo map (PageRank over dependency graphs) diff --git a/.claude/da-plans/README.md b/.claude/da-plans/README.md index 77b9ff1..1385aa7 100644 --- a/.claude/da-plans/README.md +++ b/.claude/da-plans/README.md @@ -9,9 +9,9 @@ Implementation deviations are logged at the bottom of each plan file. | Track | Description | Status | |-------|-------------|--------| -| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Merged (indexing rethink) | +| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Merged, Phase 3: Draft (graph cache) | | [CLI](cli/) | Command-line interface | Not started | -| [MCP Server](mcp/) | Model Context Protocol server + adapters | Phase 1: Draft (tools improvement) | +| [MCP Server](mcp/) | Model Context Protocol server + adapters | Phase 1: Merged (tools improvement) | | [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started | | [Integrations](integrations/) | Claude Code, VS Code, Cursor | Not started | | [Logger](logger/) | @prosdevlab/kero centralized logging | Not started | diff --git a/.claude/da-plans/core/phase-3-graph-cache/3.1-index-time-graph.md b/.claude/da-plans/core/phase-3-graph-cache/3.1-index-time-graph.md new file mode 100644 index 0000000..d8d8bd8 --- /dev/null +++ b/.claude/da-plans/core/phase-3-graph-cache/3.1-index-time-graph.md @@ -0,0 +1,106 @@ +# Part 3.1: Build and Save Dependency Graph at Index Time + +See [overview.md](overview.md) for architecture context. + +## Goal + +After `linearMerge` (full index) or `batchUpsertAndDelete` (incremental), build the +dependency graph from the scan results and save it as JSON. + +## What changes + +### `packages/core/src/storage/path.ts` + +Add `dependencyGraph` to `getStorageFilePaths`: + +```typescript +export function getStorageFilePaths(storagePath: string): { + vectors: string; + metadata: string; + watcherSnapshot: string; + dependencyGraph: string; // NEW + // ... deprecated paths +} { + return { + // ... existing + dependencyGraph: path.join(storagePath, 'dependency-graph.json'), + }; +} +``` + +### `packages/core/src/map/graph.ts` + +Add serialization/deserialization: + +```typescript +export interface CachedGraph { + version: 1; + generatedAt: string; + nodeCount: number; + edgeCount: number; + graph: Record; +} + +export function serializeGraph(graph: Map): string { + let edgeCount = 0; + const obj: Record = {}; + for (const [key, edges] of graph) { + obj[key] = edges; + edgeCount += edges.length; + } + return JSON.stringify({ + version: 1, + generatedAt: new Date().toISOString(), + nodeCount: graph.size, + edgeCount, + graph: obj, + }); +} + +export function deserializeGraph(json: string): Map | null { + try { + const data = JSON.parse(json); + if (data.version !== 1) return null; + const graph = new Map(); + for (const [key, edges] of Object.entries(data.graph)) { + graph.set(key, edges as WeightedEdge[]); + } + return graph; + } catch { + return null; + } +} +``` + +### `packages/core/src/indexer/index.ts` + +After `linearMerge` completes in `index()`, build and save the graph: + +```typescript +// After linearMerge (line ~180) +const documents = prepareDocumentsForEmbedding(scanResult.documents); +// ... linearMerge call ... + +// Build and cache dependency graph +const graph = buildDependencyGraph( + documents.map(d => ({ id: d.id, score: 0, metadata: d.metadata })) +); +const graphJson = serializeGraph(graph); +await fs.writeFile(filePaths.dependencyGraph, graphJson, 'utf-8'); +``` + +## Tests + +| Test | What it verifies | +|------|-----------------| +| `serializeGraph` round-trips correctly | Serialize → deserialize → same graph | +| `deserializeGraph` returns null for invalid JSON | Error handling | +| `deserializeGraph` returns null for wrong version | Schema evolution | +| `getStorageFilePaths` includes `dependencyGraph` | Path registration | +| After `index()`, graph file exists | Integration | + +## Commit + +``` +feat(core): build and save dependency graph at index time +``` diff --git a/.claude/da-plans/core/phase-3-graph-cache/3.2-load-on-demand.md b/.claude/da-plans/core/phase-3-graph-cache/3.2-load-on-demand.md new file mode 100644 index 0000000..d36795d --- /dev/null +++ b/.claude/da-plans/core/phase-3-graph-cache/3.2-load-on-demand.md @@ -0,0 +1,109 @@ +# Part 3.2: Load Cached Graph in dev_map and dev_refs + +See [overview.md](overview.md) for architecture context. + +## Goal + +Replace `getAll(limit: 10000)` → `buildDependencyGraph()` in `dev_map` and `dev_refs` +with loading the cached graph from disk. Falls back to current approach if graph file +is missing or corrupted. + +## What changes + +### `packages/core/src/map/graph.ts` + +Add a loader that reads from disk with fallback: + +```typescript +import * as fs from 'node:fs/promises'; + +/** + * Load dependency graph from cache, or build from docs as fallback. + */ +export async function loadOrBuildGraph( + graphPath: string | undefined, + fallbackDocs: () => Promise +): Promise> { + // Try cached graph first + if (graphPath) { + try { + const json = await fs.readFile(graphPath, 'utf-8'); + const graph = deserializeGraph(json); + if (graph) return graph; + } catch { + // File missing or unreadable — fall through to build + } + } + + // Fallback: build from docs (current approach) + const docs = await fallbackDocs(); + return buildDependencyGraph(docs); +} +``` + +### `packages/core/src/map/index.ts` + +Replace the graph build in `generateCodebaseMap`: + +```typescript +// Before (current): +const graph = buildDependencyGraph(allDocs); + +// After: +const graph = await loadOrBuildGraph( + context.graphPath, // new optional field on MapGenerationContext + async () => allDocs // fallback uses already-fetched docs +); +``` + +Add `graphPath` to `MapGenerationContext`: + +```typescript +export interface MapGenerationContext { + indexer: RepositoryIndexer; + gitExtractor?: LocalGitExtractor; + logger?: Logger; + graphPath?: string; // NEW — path to cached dependency-graph.json +} +``` + +### `packages/mcp-server/src/adapters/built-in/refs-adapter.ts` + +Replace the `getDependencyGraph` method: + +```typescript +private async getDependencyGraph() { + const CACHE_TTL_MS = 60_000; + if (this.cachedGraph && Date.now() - this.cachedGraphTime < CACHE_TTL_MS) { + return this.cachedGraph; + } + + // Try loading from disk first (no getAll needed) + this.cachedGraph = await loadOrBuildGraph( + this.graphPath, + async () => this.indexer!.getAll({ limit: 50000 }) // raised limit as fallback + ); + this.cachedGraphTime = Date.now(); + return this.cachedGraph; +} +``` + +### `packages/mcp-server/bin/dev-agent-mcp.ts` + +Pass `graphPath` to both MapAdapter and RefsAdapter from `getStorageFilePaths`. + +## Tests + +| Test | What it verifies | +|------|-----------------| +| `loadOrBuildGraph` with valid cached file | Loads from disk, doesn't call fallback | +| `loadOrBuildGraph` with missing file | Calls fallback, builds from docs | +| `loadOrBuildGraph` with corrupted file | Calls fallback, doesn't crash | +| `generateCodebaseMap` uses cached graph when available | Integration | +| `dev_refs dependsOn` uses cached graph | Integration | + +## Commit + +``` +feat(core,mcp): load cached dependency graph in dev_map and dev_refs +``` diff --git a/.claude/da-plans/core/phase-3-graph-cache/3.3-incremental-graph.md b/.claude/da-plans/core/phase-3-graph-cache/3.3-incremental-graph.md new file mode 100644 index 0000000..02f6408 --- /dev/null +++ b/.claude/da-plans/core/phase-3-graph-cache/3.3-incremental-graph.md @@ -0,0 +1,89 @@ +# Part 3.3: Incremental Graph Updates via File Watcher + +See [overview.md](overview.md) for architecture context. + +## Goal + +When the file watcher detects changes and calls `applyIncremental`, update the +cached dependency graph without a full rebuild. This keeps the graph fresh as +files are edited. + +## What changes + +### `packages/core/src/map/graph.ts` + +Add an incremental update function: + +```typescript +/** + * Update a dependency graph incrementally. + * + * For changed/new files: remove old edges from those files, add new edges. + * For deleted files: remove all edges from those files. + * Pure function — returns a new graph. + */ +export function updateGraphIncremental( + existing: Map, + changedDocs: SearchResult[], + deletedFiles: string[] +): Map { + const updated = new Map(existing); + + // Remove edges for deleted files + for (const file of deletedFiles) { + updated.delete(file); + } + + // Remove old edges for changed files, then add new ones + const changedGraph = buildDependencyGraph(changedDocs); + for (const file of changedGraph.keys()) { + // Remove old edges (the file was re-scanned) + updated.delete(file); + } + for (const [file, edges] of changedGraph) { + updated.set(file, edges); + } + + return updated; +} +``` + +### `packages/core/src/indexer/index.ts` + +In `applyIncremental`, update the cached graph: + +```typescript +async applyIncremental(upserts: EmbeddingDocument[], deleteIds: string[]): Promise { + await this.vectorStorage.batchUpsertAndDelete(upserts, deleteIds); + + // Update cached dependency graph + const graphPath = getStorageFilePaths(this.config.vectorStorePath).dependencyGraph; + try { + const existing = await loadGraphFromDisk(graphPath); + if (existing) { + const deletedFiles = extractFilesFromDeleteIds(deleteIds); + const changedDocs = upserts.map(d => ({ id: d.id, score: 0, metadata: d.metadata })); + const updated = updateGraphIncremental(existing, changedDocs, deletedFiles); + await fs.writeFile(graphPath, serializeGraph(updated), 'utf-8'); + } + } catch { + // Graph update failed — next full index will fix it + } +} +``` + +## Tests + +| Test | What it verifies | +|------|-----------------| +| `updateGraphIncremental` adds edges for new files | New file → new edges appear | +| `updateGraphIncremental` removes edges for deleted files | Deleted file → edges gone | +| `updateGraphIncremental` replaces edges for changed files | Changed file → old edges removed, new edges added | +| `updateGraphIncremental` with empty existing graph | Handles first incremental gracefully | +| Incremental update failure doesn't crash indexer | Error resilience | + +## Commit + +``` +feat(core): incremental dependency graph updates via file watcher +``` diff --git a/.claude/da-plans/core/phase-3-graph-cache/overview.md b/.claude/da-plans/core/phase-3-graph-cache/overview.md new file mode 100644 index 0000000..e08ec5b --- /dev/null +++ b/.claude/da-plans/core/phase-3-graph-cache/overview.md @@ -0,0 +1,239 @@ +# Phase 3: Cached Dependency Graph for Scale + +**Status:** Draft + +## Context + +Phase 2 established the indexing pipeline (scan → Linear Merge → Antfly). MCP Phase 1 +added graph algorithms (PageRank, connected components, shortest path) that operate +over the dependency graph built from indexed `callees` metadata. + +The current approach rebuilds the dependency graph from scratch on every `dev_map` and +`dev_refs dependsOn` call by fetching all documents via `getAll(limit: 10000)`. This +works at our current scale (~2,200 docs) but breaks at medium-to-large repos: + +| Repo size | Docs | Current behavior | +|-----------|------|-----------------| +| Small (dev-agent) | ~2k | Works. Graph build <1ms, PageRank 4ms. | +| Medium (product monorepo) | 10-15k | **Silently truncated** at 10k. Graph is incomplete. | +| Large (platform monorepo) | 20-50k | Completely broken. Missing most of the graph. | + +### What breaks + +1. **`getAll(limit: 10000)` hard wall** — docs beyond 10k are silently dropped. + The graph is incomplete with no indication. PageRank scores are wrong. + +2. **Memory** — 50k docs × ~5KB each = ~250MB just for raw data. The graph itself + is much smaller (~50k nodes × ~5 edges × 16 bytes = ~4MB). + +3. **Latency per request** — `dev_refs dependsOn` fetches all docs and rebuilds the + graph on every call. For a 10k-doc repo, that's ~50ms fetch + ~5ms graph build + on every MCP request. The RefsAdapter has a 60s cache but it still rebuilds from + scratch after expiry. + +### What we already have + +- `buildDependencyGraph(docs)` — pure function, returns `Map` +- `pageRank(graph)` — pure function, weighted with dangling nodes +- `connectedComponents(graph)` — pure function, BFS on undirected graph +- `shortestPath(graph, from, to)` — pure function, BFS on directed graph +- File watcher that detects changes and triggers incremental re-indexing +- Storage paths at `~/.dev-agent/indexes/{hash}/` with `metadata.json` and `watcher-snapshot` + +--- + +## Proposed architecture + +### Current flow (what we're fixing) + +``` +┌──────────────────────────────────────────────────────────┐ +│ dev_map / dev_refs │ +│ │ +│ getAll(limit: 10000) ──────────► Antfly │ +│ │ (fetch ALL docs) │ +│ │ ~250MB for 50k docs │ +│ ▼ │ +│ buildDependencyGraph() │ +│ │ rebuild from scratch every time │ +│ ▼ │ +│ pageRank() / shortestPath() │ +└──────────────────────────────────────────────────────────┘ + +Problem: fetches ALL docs (truncated at 10k), rebuilds graph every call +``` + +### Proposed flow + +``` +┌──────────────────────────────────────────────────────────┐ +│ Index time (dev index) │ +│ │ +│ scan ──► prepareDocuments ──► linearMerge ──► Antfly │ +│ │ │ +│ │ NEW: also build graph │ +│ ▼ │ +│ buildDependencyGraph() │ +│ │ │ +│ ▼ │ +│ dependency-graph.json (~1-5MB) │ +│ ~/.dev-agent/indexes/{hash}/ │ +└──────────────────────────────────────────────────────────┘ + +┌──────────────────────────────────────────────────────────┐ +│ dev_map / dev_refs (query time) │ +│ │ +│ Load dependency-graph.json ──► Map │ +│ │ ~50ms for 5MB │ +│ │ (no getAll, no Antfly fetch) │ +│ ▼ │ +│ pageRank() / shortestPath() / connectedComponents() │ +└──────────────────────────────────────────────────────────┘ + +Fix: graph built once at index time, loaded from disk at query time +``` + +### Incremental updates (file watcher) + +``` +┌──────────────────────────────────────────────────────────┐ +│ File change detected │ +│ │ +│ @parcel/watcher: files A, B changed; file C deleted │ +│ │ │ +│ ▼ │ +│ scan changed files ──► batchUpsertAndDelete ──► Antfly │ +│ │ │ +│ │ NEW: also update graph │ +│ ▼ │ +│ Load existing graph │ +│ Remove edges for changed/deleted files │ +│ Add edges from re-scanned callees │ +│ Save updated graph │ +│ │ +│ O(changed files), not O(all files) │ +└──────────────────────────────────────────────────────────┘ +``` + +### Storage layout + +``` +~/.dev-agent/indexes/{hash}/ + ├── metadata.json (existing — index config) + ├── watcher-snapshot (existing — @parcel/watcher state) + └── dependency-graph.json (NEW — ~1-5MB, serialized graph) +``` + +### Graph JSON format + +```json +{ + "version": 1, + "generatedAt": "2026-03-31T20:00:00Z", + "nodeCount": 2214, + "edgeCount": 8456, + "graph": { + "src/services/search.ts": [ + { "target": "src/vector/index.ts", "weight": 1.414 }, + { "target": "src/scanner/types.ts", "weight": 1.0 } + ] + } +} +``` + +### Consumer changes + +| Consumer | Before | After | +|----------|--------|-------| +| `dev_map` (generateCodebaseMap) | `getAll(10000)` → build graph → PageRank | Load cached graph → PageRank | +| `dev_refs dependsOn` | `getAll(10000)` → build graph → shortestPath | Load cached graph → shortestPath | +| `dev_map` (directory tree) | Still needs `getAll` for component counts + exports | Unchanged — separate concern | + +**Important:** `generateCodebaseMap` still needs `getAll` for the directory tree +(component counts, exports). But the graph algorithms no longer depend on it. +The directory tree already has its own limit handling. Only the graph operations +are decoupled. + +### Incremental updates + +When the file watcher detects changes and calls `applyIncremental`: +1. Load existing graph JSON +2. Remove edges from changed/deleted files +3. Add edges from newly scanned files' callees +4. Save updated graph JSON + +This is O(changed files), not O(all files). The graph stays up to date without +a full rebuild. + +--- + +## Parts + +| Part | Description | Risk | +|------|-------------|------| +| [3.1](./3.1-index-time-graph.md) | Build and save dependency graph at index time | Low — additive | +| [3.2](./3.2-load-on-demand.md) | Load cached graph in dev_map + dev_refs, remove getAll dependency | Medium — changes data flow | +| [3.3](./3.3-incremental-graph.md) | Incremental graph updates via file watcher | Medium — new update path | + +--- + +## Decisions + +| Decision | Rationale | Alternatives | +|----------|-----------|-------------| +| JSON file, not DB | Graph is small (~1-5MB), read-only between updates, JSON is debuggable | SQLite: overkill. Antfly: no server-side graph API. | +| Build at index time | Amortizes cost. Graph only changes when index changes. | Build on demand: current approach, doesn't scale. | +| Incremental updates | Watcher already knows which files changed. Graph update is O(changed). | Full rebuild on every change: wasteful at scale. | +| Keep getAll for directory tree | Directory tree needs component counts and exports which aren't in the graph. | Index component counts separately: premature optimization. | +| Version field in JSON | Allows schema evolution without migration headaches. | No version: breaks silently on format change. | + +--- + +## Risk register + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Graph JSON out of sync with index | Medium | Medium | Rebuild graph on `dev index --force`. Watcher keeps it updated incrementally. | +| Graph file corrupted or missing | Low | Low | Fallback to current approach (getAll + build). Never crash. | +| Graph file too large for huge repos | Low | Low | 50k nodes × 5 edges × ~50 bytes = ~12MB. Acceptable. | +| Incremental update misses edge cases | Medium | Medium | Full rebuild always available via `dev index --force`. Incremental is best-effort. | +| JSON parse performance | Low | Low | 5MB JSON parses in <50ms. Not a bottleneck. | + +--- + +## Test strategy + +| Test | Priority | What it verifies | +|------|----------|-----------------| +| Build graph from scan results and save JSON | P0 | Index time graph generation | +| Load graph JSON and run PageRank | P0 | Cached graph → algorithms work | +| Missing graph file → fallback to getAll | P0 | Graceful degradation | +| Corrupted graph file → fallback to getAll | P0 | Error handling | +| Incremental: add file → graph updated | P0 | Watcher integration | +| Incremental: delete file → edges removed | P0 | Watcher integration | +| Graph version mismatch → full rebuild | P1 | Schema evolution | +| 10k+ node graph serialization round-trip | P1 | Scale | +| dev_map uses cached graph (not getAll) | P1 | Integration | +| dev_refs dependsOn uses cached graph | P1 | Integration | + +--- + +## Verification checklist + +- [ ] `dev index` produces `dependency-graph.json` alongside `metadata.json` +- [ ] `dev_map` loads cached graph instead of calling `getAll` for PageRank +- [ ] `dev_refs dependsOn` loads cached graph +- [ ] Missing graph file → falls back to getAll (current behavior) +- [ ] `dev index --force` rebuilds graph from scratch +- [ ] File watcher change → graph incrementally updated +- [ ] Graph JSON < 15MB for 50k-node repo +- [ ] PageRank on cached 50k-node graph < 500ms +- [ ] `pnpm build && pnpm test` passes + +--- + +## Dependencies + +- Phase 2 (indexing rethink) — merged +- MCP Phase 1 Part 1.6 (graph algorithms) — merged (pending PR #19) +- `getStorageFilePaths` in `packages/core/src/storage/path.ts` — add `dependencyGraph` path diff --git a/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/1.6-pagerank-map.md b/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/1.6-pagerank-map.md index 005d038..4ea4875 100644 --- a/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/1.6-pagerank-map.md +++ b/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/1.6-pagerank-map.md @@ -1,18 +1,24 @@ -# Part 1.6: PageRank File Ranking for dev_map +# Part 1.6: Graph Algorithms for dev_map and dev_refs > **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. -**Goal:** Replace simple reference counting in `dev_map` hot paths with PageRank over the call graph for more meaningful file ranking. +**Goal:** Replace simple reference counting in `dev_map` hot paths with weighted PageRank +over the call graph for more meaningful file ranking. -**User stories:** US-12 (meaningful file importance in codebase map) +Connected components and shortest path are implemented alongside PageRank in the +same `graph.ts` module but wired into consumers (formatCodebaseMap, dev_refs) in a +follow-up PR. + +**User stories:** US-12 (meaningful file importance) **Inspiration:** [aider's repo map](https://aider.chat/docs/repomap.html) — Apache 2.0. Uses PageRank over dependency graph to identify architecturally central files. We already have the call graph data from the scanner (callees metadata in Antfly). **Files:** -- Create: `packages/core/src/map/pagerank.ts` -- Create: `packages/core/src/map/__tests__/pagerank.test.ts` -- Modify: `packages/core/src/map/index.ts` -- Modify: `packages/core/src/map/types.ts` +- Create: `packages/core/src/map/graph.ts` (PageRank, graph builder, connected components, shortest path) +- Create: `packages/core/src/map/__tests__/graph.test.ts` +- Modify: `packages/core/src/map/index.ts` (replace computeHotPaths with PageRank) +- Modify: `packages/core/src/map/types.ts` (add `score` to HotPath) +- Modify: `packages/core/src/map/__tests__/map.test.ts` (rewrite callers→callees tests) --- @@ -30,6 +36,24 @@ This is a good proxy but misses graph structure. A file could have few direct re We already have the data: every indexed document has `callees: [{ name, file, line }]` metadata. This is the dependency graph. +**Important finding:** `callers` metadata is NOT stored in the index — the scanner +comment says "callers are computed at query time via reverse lookup." The current +`computeHotPaths` reads `callers` from metadata (line 455 of map/index.ts) but this +field is always empty for real indexed docs. Only the `callees` path (lines 469-480) +produces results in production. This means switching to callees-only `buildDependencyGraph` +is not a regression — it matches what actually works. + +--- + +## Review findings (addressed before implementation) + +| Finding | Fix | Risk mitigation | +|---------|-----|-----------------| +| Plan only uses `callees`, current code uses both | `callers` is dead code in production (not stored in index). Use `callees` only. | Update tests from callers to callees mock data. Add comment explaining why. | +| `incomingRefs` changes meaning | Keep `incomingRefs` as actual incoming edge count. Add `score` field for PageRank value. Sort by `score`. | Backward compatible — `incomingRefs` is still a real count. Display stays "refs". | +| Existing tests will break | Update to use callees data + relative ordering assertions. | PageRank unit tests have exact assertions; integration tests verify behavior. | +| Performance claim unverified | Add perf test: 2k-node graph, assert <50ms. | Generous threshold avoids CI flakiness. | + --- ## Task 1: Implement PageRank algorithm @@ -41,59 +65,102 @@ Pure function — takes a graph, returns ranked nodes. No I/O. ```typescript // In packages/core/src/map/__tests__/pagerank.test.ts -import { pageRank } from '../pagerank'; +import { pageRank, buildDependencyGraph, type WeightedEdge } from '../pagerank'; + +function edge(target: string, weight = 1): WeightedEdge { + return { target, weight }; +} describe('pageRank', () => { it('should rank nodes by importance', () => { // A -> B -> C, A -> C // C should rank highest (most incoming from important nodes) - const graph = new Map(); - graph.set('A', ['B', 'C']); - graph.set('B', ['C']); - graph.set('C', []); + const graph = new Map(); + graph.set('A', [edge('B'), edge('C')]); + graph.set('B', [edge('C')]); const ranks = pageRank(graph); - expect(ranks.get('C')).toBeGreaterThan(ranks.get('A')!); - expect(ranks.get('C')).toBeGreaterThan(ranks.get('B')!); + expect(ranks.get('C')!).toBeGreaterThan(ranks.get('A')!); + expect(ranks.get('C')!).toBeGreaterThan(ranks.get('B')!); }); it('should handle cycles', () => { - // A -> B -> A (mutual dependency) - const graph = new Map(); - graph.set('A', ['B']); - graph.set('B', ['A']); + const graph = new Map(); + graph.set('A', [edge('B')]); + graph.set('B', [edge('A')]); const ranks = pageRank(graph); - // Both should have similar rank - expect(Math.abs(ranks.get('A')! - ranks.get('B')!)).toBeLessThan(0.1); + expect(Math.abs(ranks.get('A')! - ranks.get('B')!)).toBeLessThan(0.01); }); it('should handle disconnected nodes', () => { - const graph = new Map(); - graph.set('A', ['B']); - graph.set('B', []); - graph.set('C', []); // No connections + const graph = new Map(); + graph.set('A', [edge('B')]); + // B and C have no outgoing edges (dangling) + // B should rank higher — it has an incoming edge from A + + const ranks = pageRank(graph); + expect(ranks.get('B')!).toBeGreaterThan(ranks.get('C') || 0); + }); + + it('should handle dangling nodes (no outgoing edges)', () => { + // types.ts is imported by many but exports nothing callable + const graph = new Map(); + graph.set('a.ts', [edge('types.ts'), edge('b.ts')]); + graph.set('b.ts', [edge('types.ts')]); + // types.ts has no outgoing edges — dangling node + + const ranks = pageRank(graph); + // types.ts should rank highest (most incoming) + expect(ranks.get('types.ts')!).toBeGreaterThan(ranks.get('a.ts')!); + // Dangling node's rank should be distributed, not lost + const totalRank = Array.from(ranks.values()).reduce((a, b) => a + b, 0); + expect(totalRank).toBeCloseTo(1.0, 2); + }); + + it('should respect edge weights', () => { + const graph = new Map(); + // A depends heavily on B (weight 10), lightly on C (weight 1) + graph.set('A', [edge('B', 10), edge('C', 1)]); const ranks = pageRank(graph); - expect(ranks.get('A')).toBeDefined(); - expect(ranks.get('C')).toBeDefined(); - // Connected nodes should rank higher than isolated expect(ranks.get('B')!).toBeGreaterThan(ranks.get('C')!); }); it('should return empty map for empty graph', () => { - const ranks = pageRank(new Map()); - expect(ranks.size).toBe(0); + expect(pageRank(new Map()).size).toBe(0); }); - it('should converge within iterations', () => { - // Large-ish graph - const graph = new Map(); + it('should converge for large ring graph', () => { + const graph = new Map(); for (let i = 0; i < 100; i++) { - graph.set(`node${i}`, [`node${(i + 1) % 100}`]); + graph.set(`node${i}`, [edge(`node${(i + 1) % 100}`)]); } const ranks = pageRank(graph); expect(ranks.size).toBe(100); + // All nodes in a ring should have equal rank + const values = Array.from(ranks.values()); + const avg = values.reduce((a, b) => a + b, 0) / values.length; + for (const v of values) { + expect(v).toBeCloseTo(avg, 4); + } + }); + + it('should complete 2k-node graph in under 50ms', () => { + const graph = new Map(); + for (let i = 0; i < 2000; i++) { + const edges: WeightedEdge[] = []; + for (let j = 0; j < 5; j++) { + edges.push(edge(`node${(i + j + 1) % 2000}`, Math.random() * 5)); + } + graph.set(`node${i}`, edges); + } + const start = Date.now(); + const ranks = pageRank(graph); + const duration = Date.now() - start; + console.log(`PageRank: 2000 nodes, 10000 edges, ${duration}ms`); + expect(ranks.size).toBe(2000); + expect(duration).toBeLessThan(50); }); }); ``` @@ -103,71 +170,102 @@ describe('pageRank', () => { Run: `pnpm test -- packages/core/src/map/__tests__/pagerank.test.ts` Expected: FAIL — module not found -- [ ] **Step 3: Implement PageRank** +- [ ] **Step 3: Implement weighted PageRank with dangling node handling** + +Learnings from studying aider's implementation (Apache 2.0, uses NetworkX): +- Weighted edges (sqrt-dampened reference counts) +- Dangling node handling (files with no outgoing edges distribute rank to all) +- Convergence check (stop early if delta < 1e-6) +- Standard damping 0.85, max 100 iterations (matches NetworkX defaults) + +No external dependency — hand-rolled (~60 lines). If we ever need more +graph algorithms, graphology (MIT, TS types, 1.6k stars) is the upgrade path. ```typescript // packages/core/src/map/pagerank.ts +export interface WeightedEdge { + target: string; + weight: number; +} + /** - * PageRank algorithm for ranking nodes in a directed graph. + * Weighted PageRank with dangling node handling and convergence. * Pure function — no I/O. * - * Inspired by aider's repo map (https://github.com/Aider-AI/aider). - * - * @param graph - Map of node -> outgoing edges (dependencies) - * @param damping - Damping factor (default 0.85, standard for PageRank) - * @param iterations - Number of iterations (default 20, sufficient for convergence) - * @returns Map of node -> rank score (higher = more important) + * Inspired by aider's repo map (https://github.com/Aider-AI/aider) + * which uses NetworkX PageRank over a weighted dependency graph. */ export function pageRank( - graph: Map, + graph: Map, damping = 0.85, - iterations = 20 + maxIterations = 100, + tolerance = 1e-6 ): Map { + // Collect all nodes (sources + targets) const nodes = new Set(); - for (const [src, targets] of graph) { + for (const [src, edges] of graph) { nodes.add(src); - for (const t of targets) nodes.add(t); + for (const e of edges) nodes.add(e.target); } if (nodes.size === 0) return new Map(); const n = nodes.size; - const ranks = new Map(); - const initial = 1 / n; + let ranks = new Map(); // Initialize equal rank - for (const node of nodes) { - ranks.set(node, initial); - } + for (const node of nodes) ranks.set(node, 1 / n); - // Build reverse graph (who points to me?) - const inbound = new Map(); + // Build inbound map: target → [{ source, weight }] + const inbound = new Map>(); for (const node of nodes) inbound.set(node, []); - for (const [src, targets] of graph) { - for (const t of targets) { - inbound.get(t)?.push(src); + + // Build outgoing weight sums for normalization + const outWeightSum = new Map(); + for (const [src, edges] of graph) { + let sum = 0; + for (const e of edges) { + inbound.get(e.target)?.push({ source: src, weight: e.weight }); + sum += e.weight; } + outWeightSum.set(src, sum); } - // Iterate - for (let i = 0; i < iterations; i++) { + // Identify dangling nodes (no outgoing edges) + const danglingNodes: string[] = []; + for (const node of nodes) { + if (!outWeightSum.has(node) || outWeightSum.get(node) === 0) { + danglingNodes.push(node); + } + } + + // Iterate until convergence or max iterations + for (let iter = 0; iter < maxIterations; iter++) { const newRanks = new Map(); + // Dangling rank: sum of dangling nodes' ranks, distributed to all + let danglingRank = 0; + for (const d of danglingNodes) danglingRank += ranks.get(d) || 0; + for (const node of nodes) { let sum = 0; - const sources = inbound.get(node) || []; - for (const src of sources) { - const outDegree = graph.get(src)?.length || 1; - sum += (ranks.get(src) || 0) / outDegree; + for (const { source, weight } of inbound.get(node) || []) { + const srcOutWeight = outWeightSum.get(source) || 1; + sum += ((ranks.get(source) || 0) * weight) / srcOutWeight; } - newRanks.set(node, (1 - damping) / n + damping * sum); + // Standard PageRank formula with dangling node contribution + newRanks.set(node, (1 - damping) / n + damping * (sum + danglingRank / n)); } - // Update ranks - for (const [node, rank] of newRanks) { - ranks.set(node, rank); + // Check convergence + let delta = 0; + for (const node of nodes) { + delta += Math.abs((newRanks.get(node) || 0) - (ranks.get(node) || 0)); } + + ranks = newRanks; + if (delta < tolerance) break; } return ranks; @@ -198,32 +296,40 @@ git commit -m "feat(core): add PageRank algorithm for file importance ranking" import type { SearchResult } from '../vector/types.js'; /** - * Build a file dependency graph from indexed documents. - * Uses callees metadata to create edges: file A calls something in file B → A depends on B. + * Build a weighted file dependency graph from indexed documents. + * Uses callees metadata: file A calls N things in file B → edge weight = sqrt(N). + * sqrt dampening (from aider) prevents high-frequency references from dominating. * Pure function. */ -export function buildDependencyGraph(docs: SearchResult[]): Map { - const graph = new Map(); +export function buildDependencyGraph(docs: SearchResult[]): Map { + // Count raw references per (source, target) pair + const rawCounts = new Map>(); for (const doc of docs) { const sourceFile = doc.metadata.path as string; if (!sourceFile) continue; - if (!graph.has(sourceFile)) graph.set(sourceFile, []); + if (!rawCounts.has(sourceFile)) rawCounts.set(sourceFile, new Map()); const callees = doc.metadata.callees as Array<{ file?: string }> | undefined; if (callees && Array.isArray(callees)) { for (const callee of callees) { if (callee.file && callee.file !== sourceFile) { - graph.get(sourceFile)!.push(callee.file); + const targets = rawCounts.get(sourceFile)!; + targets.set(callee.file, (targets.get(callee.file) || 0) + 1); } } } } - // Deduplicate edges - for (const [node, edges] of graph) { - graph.set(node, [...new Set(edges)]); + // Convert to weighted edges with sqrt dampening + const graph = new Map(); + for (const [source, targets] of rawCounts) { + const edges: WeightedEdge[] = []; + for (const [target, count] of targets) { + edges.push({ target, weight: Math.sqrt(count) }); + } + graph.set(source, edges); } return graph; @@ -234,8 +340,8 @@ export function buildDependencyGraph(docs: SearchResult[]): Map { - it('should build graph from callees metadata', () => { - const docs: SearchResult[] = [ + it('should build weighted graph from callees metadata', () => { + const docs = [ { id: '1', score: 0.9, metadata: { path: 'src/a.ts', callees: [{ name: 'foo', file: 'src/b.ts', line: 10 }], @@ -246,38 +352,45 @@ describe('buildDependencyGraph', () => { }}, ]; - const graph = buildDependencyGraph(docs); - expect(graph.get('src/a.ts')).toContain('src/b.ts'); - expect(graph.get('src/b.ts')).toContain('src/c.ts'); + const graph = buildDependencyGraph(docs as any); + const aEdges = graph.get('src/a.ts')!; + expect(aEdges.some(e => e.target === 'src/b.ts')).toBe(true); + expect(aEdges[0].weight).toBe(1); // sqrt(1) = 1 }); - it('should handle docs without callees metadata', () => { - const docs: SearchResult[] = [ - { id: '1', score: 0.9, metadata: { path: 'src/types.ts', type: 'interface' } }, - { id: '2', score: 0.9, metadata: { + it('should sqrt-dampen weights for multiple references', () => { + const docs = [ + { id: '1', score: 0.9, metadata: { path: 'src/a.ts', - callees: [{ name: 'MyType', file: 'src/types.ts', line: 1 }], + callees: [ + { name: 'foo', file: 'src/b.ts', line: 10 }, + { name: 'bar', file: 'src/b.ts', line: 20 }, + { name: 'baz', file: 'src/b.ts', line: 30 }, + { name: 'qux', file: 'src/b.ts', line: 40 }, + ], }}, ]; - const graph = buildDependencyGraph(docs); - expect(graph.get('src/a.ts')).toContain('src/types.ts'); - expect(graph.get('src/types.ts')).toEqual([]); + const graph = buildDependencyGraph(docs as any); + const aEdges = graph.get('src/a.ts')!; + expect(aEdges.length).toBe(1); // deduplicated to one edge + expect(aEdges[0].target).toBe('src/b.ts'); + expect(aEdges[0].weight).toBe(2); // sqrt(4) = 2 }); - it('should deduplicate edges', () => { - const docs: SearchResult[] = [ - { id: '1', score: 0.9, metadata: { + it('should handle docs without callees metadata', () => { + const docs = [ + { id: '1', score: 0.9, metadata: { path: 'src/types.ts', type: 'interface' } }, + { id: '2', score: 0.9, metadata: { path: 'src/a.ts', - callees: [ - { name: 'foo', file: 'src/b.ts', line: 10 }, - { name: 'bar', file: 'src/b.ts', line: 20 }, - ], + callees: [{ name: 'MyType', file: 'src/types.ts', line: 1 }], }}, ]; - const graph = buildDependencyGraph(docs); - expect(graph.get('src/a.ts')).toEqual(['src/b.ts']); + const graph = buildDependencyGraph(docs as any); + expect(graph.get('src/a.ts')!.some(e => e.target === 'src/types.ts')).toBe(true); + // types.ts has no outgoing edges (not even in the graph as a source) + expect(graph.has('src/types.ts')).toBe(false); }); }); ``` @@ -298,12 +411,21 @@ git commit -m "feat(core): add dependency graph builder from indexed callees" Replace the current `computeHotPaths` function (simple reference count) with PageRank-based ranking: ```typescript -import { buildDependencyGraph, pageRank } from './pagerank.js'; +import { buildDependencyGraph, pageRank } from './graph.js'; function computeHotPaths(docs: SearchResult[], maxPaths: number): HotPath[] { const graph = buildDependencyGraph(docs); const ranks = pageRank(graph); + // Count real incoming edges per file (distinct source files) + const incomingCounts = new Map>(); + for (const [src, edges] of graph) { + for (const e of edges) { + if (!incomingCounts.has(e.target)) incomingCounts.set(e.target, new Set()); + incomingCounts.get(e.target)!.add(src); + } + } + // Build a lookup for primary component name per file const componentByFile = new Map(); for (const doc of docs) { @@ -313,48 +435,327 @@ function computeHotPaths(docs: SearchResult[], maxPaths: number): HotPath[] { } } - // Sort by PageRank score and take top N + // Sort by PageRank score, display real incoming ref count return Array.from(ranks.entries()) .sort((a, b) => b[1] - a[1]) .slice(0, maxPaths) .map(([file, score]) => ({ file, - incomingRefs: Math.round(score * 1000), // Normalized PageRank for display + incomingRefs: incomingCounts.get(file)?.size ?? 0, + score, primaryComponent: componentByFile.get(file), })); } ``` -- [ ] **Step 2: Update HotPath type if needed** +- [ ] **Step 2: Update HotPath type** -In `packages/core/src/map/types.ts`, consider adding a `pageRankScore` field: +In `packages/core/src/map/types.ts`, add `score` field: ```typescript export interface HotPath { file: string; - incomingRefs: number; + incomingRefs: number; // actual count of files that depend on this file + score: number; // PageRank score (used for sorting) primaryComponent?: string; - pageRankScore?: number; // Optional, for debugging/verbose output } ``` -- [ ] **Step 3: Run full test suite** +Sort by `score` (PageRank), display `incomingRefs` (real count) — keeps display meaningful. + +- [ ] **Step 3: Rewrite existing hot paths tests (callers → callees)** + +Three tests in `map.test.ts` use `callers` mock data which is dead code in production. +Rewrite them to use `callees` data and assert relative ordering (not exact counts): + +**Test 1: "should compute hot paths from callers data" (line 288)** +→ Rewrite as "should compute hot paths from callees data" + - Mock docs with `callees` pointing to a common target file + - Assert the target file ranks first (PageRank should rank it highest) + - Assert `incomingRefs` is the real incoming edge count + - Assert `score` is a positive number + +**Test 2: "should limit hot paths to maxHotPaths" (line 365)** +→ Keep structure, change mock data from `callers` to `callees` + - Assert `hotPaths.length <= maxHotPaths` + - Assert sorted by score descending + +**Test 3: "should format hot paths in output" (line 411)** +→ Keep structure, change mock data from `callers` to `callees` + - Assert output contains "refs" label + - Assert file names appear in output + +**Test 4 (existing, unchanged): "should not include hot paths when disabled" (line 390)** +→ No change needed — doesn't use callers data + +- [ ] **Step 4: Add test for empty callees array** + +```typescript +it('should handle docs with empty callees array', () => { + const docs = [ + { id: '1', score: 0.9, metadata: { path: 'src/types.ts', callees: [] } }, + ]; + const graph = buildDependencyGraph(docs as any); + expect(graph.get('src/types.ts')).toEqual([]); +}); +``` + +- [ ] **Step 5: Run full test suite** Run: `pnpm build && pnpm test` -Expected: ALL PASS (hot paths test may need updating for new ranking order) +Expected: ALL PASS - [ ] **Step 4: Commit** ```bash -git add packages/core/src/map/index.ts packages/core/src/map/types.ts packages/core/src/map/pagerank.ts packages/core/src/map/__tests__/ +git add packages/core/src/map/ git commit -m "feat(core): use PageRank for dev_map hot paths ranking" ``` --- +## Task 4 (deferred): Connected components for subsystem identification + +**Deferred to follow-up PR.** Implement the algorithm and tests in this PR +(it's in graph.ts alongside PageRank), but don't wire it into CodebaseMap or +formatCodebaseMap. Wire it when there's a consumer. + +Identifies clusters of files that form independent subsystems. Uses the +undirected version of the dependency graph (A depends on B = A and B are connected). + +**Value for agents:** "This codebase has 3 isolated subsystems: core (45 files), +CLI (12 files), and MCP server (18 files)." Helps agents scope their work. + +- [ ] **Step 1: Implement connected components (BFS)** + +```typescript +// In graph.ts + +/** + * Find connected components in the dependency graph (undirected). + * Returns groups of files sorted by size (largest first). + * Pure function — no I/O. + */ +export function connectedComponents( + graph: Map +): string[][] { + // Build undirected adjacency list + const adj = new Map>(); + const allNodes = new Set(); + for (const [src, edges] of graph) { + allNodes.add(src); + if (!adj.has(src)) adj.set(src, new Set()); + for (const e of edges) { + allNodes.add(e.target); + if (!adj.has(e.target)) adj.set(e.target, new Set()); + adj.get(src)!.add(e.target); + adj.get(e.target)!.add(src); + } + } + + const visited = new Set(); + const components: string[][] = []; + + for (const node of allNodes) { + if (visited.has(node)) continue; + // BFS from this node + const component: string[] = []; + const queue = [node]; + visited.add(node); + while (queue.length > 0) { + const current = queue.shift()!; + component.push(current); + for (const neighbor of adj.get(current) || []) { + if (!visited.has(neighbor)) { + visited.add(neighbor); + queue.push(neighbor); + } + } + } + components.push(component); + } + + // Sort by size (largest first) + return components.sort((a, b) => b.length - a.length); +} +``` + +- [ ] **Step 2: Write tests** + +```typescript +describe('connectedComponents', () => { + it('should identify separate clusters', () => { + const graph = new Map(); + // Cluster 1: A -> B -> C + graph.set('A', [edge('B')]); + graph.set('B', [edge('C')]); + // Cluster 2: D -> E + graph.set('D', [edge('E')]); + + const components = connectedComponents(graph); + expect(components.length).toBe(2); + expect(components[0].length).toBe(3); // A, B, C + expect(components[1].length).toBe(2); // D, E + }); + + it('should treat the graph as undirected', () => { + const graph = new Map(); + // A -> B, C -> B (B connects A and C even though edges point inward) + graph.set('A', [edge('B')]); + graph.set('C', [edge('B')]); + + const components = connectedComponents(graph); + expect(components.length).toBe(1); // All connected + expect(components[0].length).toBe(3); + }); + + it('should handle single-node components', () => { + const graph = new Map(); + graph.set('A', [edge('B')]); + graph.set('lonely', []); // Isolated node + + const components = connectedComponents(graph); + expect(components.length).toBe(2); + }); + + it('should return empty for empty graph', () => { + expect(connectedComponents(new Map()).length).toBe(0); + }); +}); +``` + +- [ ] **Step 3: Commit (algorithm + tests only, no wiring)** + +```bash +git add packages/core/src/map/graph.ts packages/core/src/map/__tests__/graph.test.ts +git commit -m "feat(core): add connected components algorithm (consumer wired in follow-up)" +``` + +--- + +## Task 5 (deferred): Shortest path for call chain tracing + +**Deferred to follow-up PR.** Implement algorithm and tests in this PR. +Wire into dev_refs as "trace path" feature in a separate PR. + +Answers "how does file A reach file B?" — BFS for hop count on the +unweighted dependency graph. Not Dijkstra — agents care about hops, not weights. + +**Value for agents:** Instead of multiple `dev_refs` calls, one query shows: +"auth.ts → user-service.ts → repository.ts → database.ts (3 hops)" + +- [ ] **Step 1: Implement shortest path (BFS)** + +```typescript +// In graph.ts + +/** + * Find shortest path between two files in the dependency graph. + * Uses BFS (unweighted — hop count, not edge weight). + * Returns the path as an array of files, or null if unreachable. + * Pure function — no I/O. + */ +export function shortestPath( + graph: Map, + from: string, + to: string +): string[] | null { + if (from === to) return [from]; + if (!graph.has(from)) return null; + + const visited = new Set([from]); + const parent = new Map(); + const queue = [from]; + + while (queue.length > 0) { + const current = queue.shift()!; + for (const { target } of graph.get(current) || []) { + if (visited.has(target)) continue; + visited.add(target); + parent.set(target, current); + if (target === to) { + // Reconstruct path + const path = [to]; + let node = to; + while (parent.has(node)) { + node = parent.get(node)!; + path.unshift(node); + } + return path; + } + queue.push(target); + } + } + + return null; // Unreachable +} +``` + +- [ ] **Step 2: Write tests** + +```typescript +describe('shortestPath', () => { + it('should find direct path', () => { + const graph = new Map(); + graph.set('A', [edge('B')]); + + expect(shortestPath(graph, 'A', 'B')).toEqual(['A', 'B']); + }); + + it('should find multi-hop path', () => { + const graph = new Map(); + graph.set('A', [edge('B')]); + graph.set('B', [edge('C')]); + graph.set('C', [edge('D')]); + + expect(shortestPath(graph, 'A', 'D')).toEqual(['A', 'B', 'C', 'D']); + }); + + it('should find shortest among multiple paths', () => { + const graph = new Map(); + graph.set('A', [edge('B'), edge('C')]); + graph.set('B', [edge('D')]); + graph.set('C', [edge('D')]); // A->C->D is same length as A->B->D + + const path = shortestPath(graph, 'A', 'D'); + expect(path?.length).toBe(3); // 3 nodes = 2 hops + }); + + it('should return null for unreachable target', () => { + const graph = new Map(); + graph.set('A', [edge('B')]); + graph.set('C', [edge('D')]); // Disconnected + + expect(shortestPath(graph, 'A', 'D')).toBeNull(); + }); + + it('should return single-node path for self', () => { + const graph = new Map(); + graph.set('A', [edge('B')]); + + expect(shortestPath(graph, 'A', 'A')).toEqual(['A']); + }); + + it('should return null for unknown source', () => { + expect(shortestPath(new Map(), 'X', 'Y')).toBeNull(); + }); +}); +``` + +- [ ] **Step 3: Commit (algorithm + tests only, no wiring)** + +```bash +git add packages/core/src/map/graph.ts packages/core/src/map/__tests__/graph.test.ts +git commit -m "feat(core): add shortest path algorithm (consumer wired in follow-up)" +``` + +--- + ## Notes -- **Existing map tests need updating:** Tests in `map.test.ts` mock `callers` data. After this change, `computeHotPaths` uses `callees` via `buildDependencyGraph`. Update mock data to use `callees` instead of `callers`, and adjust expected ranking order since PageRank differs from simple ref counting. -- PageRank is ~O(V + E) per iteration × 20 iterations. For a 2k-file repo with 10k edges, this is <10ms. Negligible. -- The `incomingRefs` field now shows normalized PageRank score, not raw count. Display label in `formatCodebaseMap` could change from "refs" to "importance" or keep "refs" for familiarity. -- Attribution: add to ARCHITECTURE.md: "File importance ranking inspired by [aider's repo map](https://github.com/Aider-AI/aider)" +- **Existing map tests:** 3 tests use `callers` mock data (dead code). Concrete rewrites specified in Task 3 Step 3. +- **Performance:** PageRank is ~O(V + E) per iteration × 100 max iterations. <10ms for 2k files. Perf test verifies. +- **Display:** `incomingRefs` = real incoming edge count. `score` = PageRank value for sorting. Label stays "refs". +- **Deferred consumers:** Connected components → `formatCodebaseMap`. Shortest path → `dev_refs`. Both in follow-up PRs. +- **All algorithms in one file:** `graph.ts` contains pageRank, buildDependencyGraph, connectedComponents, shortestPath. ~115 lines total. +- **Attribution:** "File importance ranking inspired by [aider's repo map](https://github.com/Aider-AI/aider)" diff --git a/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/overview.md b/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/overview.md index a482a85..ddc4ec2 100644 --- a/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/overview.md +++ b/.claude/da-plans/mcp/phase-1-mcp-tools-improvement/overview.md @@ -1,6 +1,6 @@ # Phase 1: MCP Tools Improvement -**Status:** Draft +**Status:** Complete (all parts merged) ## Context @@ -67,8 +67,18 @@ This is an acceptable trade-off: line count is a cheap stat call. | [1.2](./1.2-index-based-analysis.md) | Add `getDocsByFilePath`, index analysis path, wire VectorStorage | Medium — new code path | | [1.3](./1.3-cleanup.md) | Consolidate reads, remove dead code, remove GitHub from health | Low — cleanup | | [1.4](./1.4-agent-usability.md) | Merge status/health, add error suggestions, rename params, JSON output | Medium — tool surface change | -| [1.5](./1.5-ast-pattern-analysis.md) | AST-based pattern analysis via ast-grep (optional dep) | Low — additive, regex fallback | -| [1.6](./1.6-pagerank-map.md) | PageRank file ranking for dev_map hot paths | Low — replaces simple counting | +| [1.5](./1.5-ast-pattern-analysis.md) | AST-based pattern analysis via tree-sitter queries | Low — additive, regex fallback | +| [1.6](./1.6-pagerank-map.md) | Graph algorithms: PageRank, connected components, shortest path | Low — replaces simple counting | + +### Part 1.6 Commit Plan + +| # | Commit | What changes | +|---|--------|-------------| +| 1 | `feat(core): add graph algorithms — PageRank, connected components, shortest path` | New `graph.ts` with pure functions + `graph.test.ts` (~20 tests). No wiring. | +| 2 | `feat(core): replace ref counting with PageRank in dev_map` | Wire PageRank into `computeHotPaths`. Add `score` to `HotPath`. Rewrite 3 callers→callees tests. | +| 3 | `feat(core): wire connected components into dev_map output` | Add `components` to `CodebaseMap` + `formatCodebaseMap`. | +| 4 | `feat(mcp): add path tracing to dev_refs` | New `trace` param on RefsAdapter. Schema + tests. | +| 5 | `docs: complete MCP Phase 1, attribution, plan status` | Plan updates, aider attribution, mark Phase 1 complete. | --- diff --git a/.claude/scratchpad.md b/.claude/scratchpad.md index b00ea8f..dbdea2d 100644 --- a/.claude/scratchpad.md +++ b/.claude/scratchpad.md @@ -3,6 +3,7 @@ ## Known Limitations - **`getDocsByFilePath` fetches all docs client-side (capped at 5k).** Uses `getAll(limit: 5000)` + exact path filter. Fine for single repos (dev-agent has ~2,200 docs). Won't scale to monorepos with 50k+ files. Future fix: server-side path filter in Antfly SDK. +- **Two clones of the same repo share one index.** Storage path is hashed from git remote URL (`prosdevlab/dev-agent` → `a1b2c3d4`). Two local clones on different branches share the same index, graph cache, and watcher snapshot. Stale data possible if branches diverge significantly. Pre-existing design — not introduced by graph cache. Fix would be to include branch or worktree path in the hash. ## Open Questions @@ -11,7 +12,12 @@ ## Future Work - Antfly SDK: server-side path filter for `getDocsByFilePath` (eliminates 5k cap) -- PageRank for `dev_map` hot paths (MCP Phase 1, Part 1.6) +- Wire `shortestPath` into `dev_refs` as a "trace path" feature (graph.ts is ready, adapter wiring is separate scope) +- Wire `connectedComponents` into `dev_map` verbose output (graph.ts is ready) +- Betweenness centrality — identifies bridge files between subsystems. Worth adding if agents need refactoring guidance. graphology (MIT, 1.6k stars) is the upgrade path if we need more than 3 hand-rolled algorithms. +- **Connected components hub filtering** — widely-shared utility files (e.g., logger.ts imported by 50+ files) merge separate subsystems into one component. Filter out hub nodes (high in-degree) before computing components for better subsystem identification. +- **PageRank at 10k+ nodes** — convergence tolerance 1e-6 may require all 100 iterations for large sparse graphs. Monitor performance. Consider reducing maxIterations or loosening tolerance for dev_map where approximate ranks are fine. +- **getAll(limit: 10000) truncation** — medium-large monorepos may exceed 10k docs. Warning is logged but results are silently incomplete. Long-term: paginate or make limit configurable. - E2E tests in CI — blocked on Antfly memory requirements vs GitHub runner limits (7GB) - **Python language support** — tree-sitter-python WASM is ~300KB, already in tree-sitter-wasms. Needs a Python scanner (document extraction) + Python-specific pattern rules. High demand — large ecosystem. Worth a standalone plan covering: scanner, pattern rules, test fixtures, indexer integration. The PatternMatcher interface from 1.5 is language-agnostic so pattern rules slot right in; the scanner is the real work. - Vue/Svelte SFC support — `.vue`/`.svelte` files have embedded `