|
1 | 1 | # Architecture |
2 | 2 |
|
3 | | -## Problem |
4 | | -AI agents receive poor codebase context because existing tools (repomix, etc.) are file |
5 | | -concatenators. They dump files in filesystem order with no ranking, no compression, and no |
6 | | -semantic structure. Agent output quality is bounded by signal-to-noise ratio in the context window. |
| 3 | +## Overview |
7 | 4 |
|
8 | | -## Solution |
9 | | -Treat context generation as compilation. Parse the codebase into a dependency graph, rank files |
10 | | -by importance signals, compress to a token budget, and emit structured output that orients an |
11 | | -agent immediately. |
| 5 | +codectx processes repositories through a structured analysis pipeline that ranks code by importance, compresses it intelligently, and emits a structured markdown document optimized for AI systems. |
| 6 | + |
| 7 | +The pipeline consists of six stages: file discovery, parsing, graph construction, ranking, compression, and formatting. |
12 | 8 |
|
13 | 9 | ## Pipeline |
14 | 10 |
|
| 11 | +### Stage 1: Walker |
| 12 | + |
| 13 | +**Purpose:** Discover repository files while respecting ignore rules. |
| 14 | + |
| 15 | +The Walker recursively traverses the filesystem from the repository root and applies ignore rules in order: |
| 16 | + |
| 17 | +1. `ALWAYS_IGNORE` — built-in patterns (`.git`, `__pycache__`, `.venv`, etc.) |
| 18 | +2. `.gitignore` — Git standard ignore rules |
| 19 | +3. `.ctxignore` — codectx-specific ignore rules |
| 20 | + |
| 21 | +The tool uses `pathspec` with `gitwildmatch` semantics to ensure exact behavioral parity with Git's ignore processing. |
| 22 | + |
| 23 | +**Output:** `List[Path]` of files to analyze. |
| 24 | + |
| 25 | +### Stage 2: Parser |
| 26 | + |
| 27 | +**Purpose:** Extract imports, symbols, and metadata from source files. |
| 28 | + |
| 29 | +The Parser processes files in parallel using `ProcessPoolExecutor` (CPU-bound) and `ThreadPoolExecutor` (I/O-bound). For each file: |
| 30 | + |
| 31 | +1. Detect language from file extension |
| 32 | +2. Parse AST using tree-sitter |
| 33 | +3. Extract: |
| 34 | + - Import statements (list of import strings) |
| 35 | + - Top-level symbols (functions, classes, methods) |
| 36 | + - Docstrings per symbol |
| 37 | + - Code structure metadata |
| 38 | + |
| 39 | +Tree-sitter provides a unified interface across six+ languages: Python, TypeScript, JavaScript, Go, Rust, Java, C, C++, and Ruby. |
| 40 | + |
| 41 | +**Output:** `Dict[Path, ParseResult]` where each `ParseResult` contains imports, symbols, and source text. |
| 42 | + |
| 43 | +### Stage 3: Dependency Graph |
| 44 | + |
| 45 | +**Purpose:** Build a directed graph representing module relationships. |
| 46 | + |
| 47 | +The Graph Builder processes parse results to construct a `rustworkx.DiGraph`: |
| 48 | + |
| 49 | +1. For each import statement, resolve the import string to a file path using per-language import resolvers |
| 50 | +2. Create nodes for files and edges for import relationships |
| 51 | +3. Compute graph metrics: |
| 52 | + - **Fan-in** — in-degree per node (how many files import this module) |
| 53 | + - **Fan-out** — out-degree per node (how many modules this file imports) |
| 54 | + - **Strongly connected components** — detect cyclic dependencies |
| 55 | + |
| 56 | +The graph enables ranking algorithms to identify important modules based on structural position. |
| 57 | + |
| 58 | +**Output:** `rustworkx.DiGraph` with computed metrics. |
| 59 | + |
| 60 | +### Stage 4: Ranker |
| 61 | + |
| 62 | +**Purpose:** Score files by importance using multiple signals. |
| 63 | + |
| 64 | +The Ranker computes a composite importance score for each file: |
| 65 | + |
15 | 66 | ``` |
16 | | -Codebase |
17 | | - │ |
18 | | - ▼ |
19 | | -Walker |
20 | | - - Recursive file discovery from root |
21 | | - - Applies ALWAYS_IGNORE, .gitignore, .ctxignore in order |
22 | | - - Warns and confirms on sensitive file detection |
23 | | - - Returns: List[Path] |
24 | | - │ |
25 | | - ▼ |
26 | | -Parser (parallel, ProcessPoolExecutor) |
27 | | - - Detects language from file extension |
28 | | - - Extracts via tree-sitter AST: |
29 | | - - Import statements → List[str] |
30 | | - - Top-level symbols (functions, classes) → List[Symbol] |
31 | | - - Docstrings per symbol |
32 | | - - Returns: Dict[Path, ParseResult] |
33 | | - │ |
34 | | - ▼ |
35 | | -Graph Builder |
36 | | - - Resolves import strings → file paths (per-language resolver) |
37 | | - - Constructs rustworkx DiGraph: nodes=files, edges=imports |
38 | | - - Computes fan-in (in-degree) per node |
39 | | - - Returns: DepGraph |
40 | | - │ |
41 | | - ▼ |
42 | | -Ranker |
43 | | - - Scores each file 0.0–1.0 using weighted composite: |
44 | | - git_frequency : 0.35 (commit count touching file) |
45 | | - fan_in : 0.35 (how many files import this) |
46 | | - recency : 0.20 (days since last modification) |
47 | | - entry_proximity: 0.10 (graph distance from entry points) |
48 | | - - Returns: Dict[Path, float] |
49 | | - │ |
50 | | - ▼ |
51 | | -Compressor |
52 | | - - Enforces token budget (from config or CLI flag) |
53 | | - - Assigns tier per file by score: |
54 | | - Tier 1 (score > 0.7): full source |
55 | | - Tier 2 (score 0.3–0.7): signatures + docstrings |
56 | | - Tier 3 (score < 0.3): one-line summary |
57 | | - - If over budget: drop Tier 3 → truncate Tier 2 → truncate Tier 1 |
58 | | - - Returns: Dict[Path, CompressedFile] |
59 | | - │ |
60 | | - ▼ |
61 | | -Formatter |
62 | | - - Emits structured markdown with fixed section order |
63 | | - - Sections: ARCHITECTURE, DEPENDENCY_GRAPH, ENTRY_POINTS, |
64 | | - CORE_MODULES, PERIPHERY, RECENT_CHANGES |
65 | | - - Returns: str |
66 | | - │ |
67 | | - ▼ |
68 | | -Output file (default: context.md) |
| 67 | +score = (0.35 × git_frequency) |
| 68 | + + (0.35 × fan_in) |
| 69 | + + (0.20 × recency) |
| 70 | + + (0.10 × entry_proximity) |
69 | 71 | ``` |
70 | 72 |
|
71 | | -## Parallelism model |
72 | | -- File parsing: ProcessPoolExecutor (CPU-bound, tree-sitter C extension) |
73 | | -- File I/O: ThreadPoolExecutor (I/O-bound, reading source files) |
74 | | -- Graph construction: single-threaded (fast, rustworkx handles it) |
75 | | -- Ranking: single-threaded (fast after git metadata collected) |
| 73 | +**Git Frequency (0.35):** Commit count touching the file. Frequently-modified files are typically more important. |
| 74 | + |
| 75 | +**Fan-in (0.35):** Inverse-normalized in-degree. Files imported by many other modules are critical interfaces. |
| 76 | + |
| 77 | +**Recency (0.20):** Days since last modification. Recently active files are prioritized. |
| 78 | + |
| 79 | +**Entry Proximity (0.10):** Graph distance from identified entry points. Files close to main execution paths rank higher. |
| 80 | + |
| 81 | +Scores are normalized to `[0.0, 1.0]` range for uniform compression tier assignment. |
| 82 | + |
| 83 | +**Output:** `Dict[Path, float]` mapping file paths to scores. |
| 84 | + |
| 85 | +### Stage 5: Compressor |
| 86 | + |
| 87 | +**Purpose:** Fit code content within a token budget. |
| 88 | + |
| 89 | +The Compressor assigns content tiers based on scores: |
| 90 | + |
| 91 | +- **Tier 1** (score ≥ 0.7) — Full source code |
| 92 | +- **Tier 2** (0.3 ≤ score < 0.7) — Function signatures and docstrings only |
| 93 | +- **Tier 3** (score < 0.3) — One-line summary |
| 94 | + |
| 95 | +Files are emitted in order: Tier 1 by score descending, then Tier 2, then Tier 3. |
| 96 | + |
| 97 | +If total token count exceeds the budget: |
| 98 | + |
| 99 | +1. Drop all Tier 3 files |
| 100 | +2. Truncate Tier 2 content (keep only signatures, remove docstrings) |
| 101 | +3. Truncate Tier 1 content (reduce line count progressively) |
| 102 | +4. If still over budget, drop lowest-scored Tier 1 files |
| 103 | + |
| 104 | +This is a hard constraint. The tool does not emit context that exceeds the token limit. |
| 105 | + |
| 106 | +**Output:** `Dict[Path, CompressedContent]` and usage statistics. |
| 107 | + |
| 108 | +### Stage 6: Formatter |
| 109 | + |
| 110 | +**Purpose:** Emit structured markdown optimized for AI agents. |
| 111 | + |
| 112 | +The Formatter writes sections in fixed order: |
| 113 | + |
| 114 | +1. **ARCHITECTURE** — High-level project structure |
| 115 | +2. **DEPENDENCY_GRAPH** — Mermaid diagram of module relationships |
| 116 | +3. **ENTRY_POINTS** — Main files and public interfaces with full source |
| 117 | +4. **CORE_MODULES** — High-scoring modules with full source |
| 118 | +5. **SUPPORTING_MODULES** — Mid-scoring modules with signatures and docstrings |
| 119 | +6. **PERIPHERY** — Low-scoring files with one-line summaries |
| 120 | +7. **RECENT_CHANGES** — Optional diff section (if `--since` flag provided) |
| 121 | + |
| 122 | +Each section is preceded by a Markdown heading and terminated with metadata (token count, file count). |
| 123 | + |
| 124 | +**Output:** Markdown string suitable for writing to disk as `CONTEXT.md`. |
| 125 | + |
| 126 | +## Data Flow Diagram |
| 127 | + |
| 128 | +``` |
| 129 | +File System |
| 130 | + │ |
| 131 | + ├─→ [Walker] |
| 132 | + │ ├ Respects .gitignore |
| 133 | + │ ├ Respects .ctxignore |
| 134 | + │ └ Output: List[Path] |
| 135 | + │ |
| 136 | + ├─→ [Parser] (Parallel) |
| 137 | + │ ├ Per-language extraction |
| 138 | + │ ├ tree-sitter AST processing |
| 139 | + │ └ Output: Dict[Path, ParseResult] |
| 140 | + │ |
| 141 | + ├─→ [Graph Builder] |
| 142 | + │ ├ Resolve imports |
| 143 | + │ ├ Construct DiGraph |
| 144 | + │ └ Output: rustworkx.DiGraph |
| 145 | + │ |
| 146 | + ├─→ [Git Metadata] (Parallel) |
| 147 | + │ ├ Commit frequency per file |
| 148 | + │ ├ Recency (last modification) |
| 149 | + │ └ Output: Dict[Path, GitMeta] |
| 150 | + │ |
| 151 | + ├─→ [Ranker] |
| 152 | + │ ├ Composite scoring |
| 153 | + │ ├ Normalize to [0.0, 1.0] |
| 154 | + │ └ Output: Dict[Path, float] |
| 155 | + │ |
| 156 | + ├─→ [Compressor] |
| 157 | + │ ├ Tier assignment |
| 158 | + │ ├ Token budget enforcement |
| 159 | + │ └ Output: Dict[Path, CompressedContent] |
| 160 | + │ |
| 161 | + └─→ [Formatter] |
| 162 | + ├ Section organization |
| 163 | + ├ Markdown generation |
| 164 | + └ Output: CONTEXT.md |
| 165 | +``` |
76 | 166 |
|
77 | 167 | ## Caching |
78 | | -- Cache key: (file_path, file_hash, git_commit_sha) |
79 | | -- Cache location: .codectx_cache/ at project root (gitignored) |
80 | | -- Cached: ParseResult per file, git metadata per file |
81 | | -- Invalidated: on file content change or new commit |
82 | | - |
83 | | -## Incremental mode (--watch) |
84 | | -- watchfiles monitors project root |
85 | | -- On change: reparse affected files only |
86 | | -- Rebuild graph for changed nodes and their dependents |
87 | | -- Re-rank affected subgraph |
88 | | -- Re-emit output |
89 | | - |
90 | | -## Token budget enforcement |
91 | | -Hard cap. Not a suggestion. Budget is consumed in this order: |
92 | | -1. ARCHITECTURE section (fixed, small) |
93 | | -2. DEPENDENCY_GRAPH section (fixed, small) |
94 | | -3. Tier 1 files by rank score descending |
95 | | -4. Tier 2 files by rank score descending |
96 | | -5. Tier 3 files by rank score descending |
97 | | - |
98 | | -Files that don't fit are omitted with a note in the output. |
99 | | - |
100 | | -## Language support |
101 | | -Pluggable resolver interface. Initial support: |
102 | | -- Python (.py) |
103 | | -- TypeScript (.ts, .tsx) |
104 | | -- JavaScript (.js, .jsx) |
105 | | -- Go (.go) |
106 | | -- Rust (.rs) |
107 | | -- Java (.java) |
108 | | - |
109 | | -Adding a language requires: tree-sitter grammar (via tree-sitter-languages) + import resolver. |
110 | | - |
111 | | -## Config precedence |
112 | | -CLI flags > .contextcraft.toml > defaults |
| 168 | + |
| 169 | +The tool caches expensive computations: |
| 170 | + |
| 171 | +**Cache key:** `(file_path, file_hash, git_commit_sha)` |
| 172 | + |
| 173 | +**Cached items:** |
| 174 | +- Parsed AST and extracted symbols per file |
| 175 | +- Git metadata (frequency, recency) |
| 176 | + |
| 177 | +**Cache location:** `.codectx_cache/` at repository root (gitignored) |
| 178 | + |
| 179 | +**Invalidation:** Cache entries are invalidated when file content changes or HEAD commit changes. |
| 180 | + |
| 181 | +This enables fast incremental updates in watch mode. |
| 182 | + |
| 183 | +## Incremental Mode |
| 184 | + |
| 185 | +When running `codectx watch .`, the tool: |
| 186 | + |
| 187 | +1. Monitors filesystem with `watchfiles` |
| 188 | +2. On file change: |
| 189 | + - Reparse only affected files |
| 190 | + - Rebuild graph for changed nodes and dependents |
| 191 | + - Re-rank affected subgraph |
| 192 | + - Recompress to budget |
| 193 | + - Re-emit output |
| 194 | + |
| 195 | +This is significantly faster than full analysis on every change. |
| 196 | + |
| 197 | +## Token Budget Enforcement |
| 198 | + |
| 199 | +Token counting uses `tiktoken`, which accurately reflects OpenAI and Anthropic model tokenization. |
| 200 | + |
| 201 | +Budget enforcement is hard: the tool does not emit context exceeding the specified limit. |
| 202 | + |
| 203 | +Consumption order: |
| 204 | + |
| 205 | +1. Fixed overhead (section headers, metadata) — typically 500–1000 tokens |
| 206 | +2. Tier 1 files by score descending (full source) |
| 207 | +3. Tier 2 files by score descending (signatures only) |
| 208 | +4. Tier 3 files by score descending (one-line summaries) |
| 209 | + |
| 210 | +Files omitted due to budget are logged with a note in the output. |
| 211 | + |
| 212 | +## Language Support |
| 213 | + |
| 214 | +The Parser uses tree-sitter for universal AST extraction. Each language requires: |
| 215 | + |
| 216 | +1. **tree-sitter grammar** — provided by `tree-sitter-LANGUAGE` package |
| 217 | +2. **Import resolver** — per-language logic to resolve import strings to file paths |
| 218 | + |
| 219 | +Currently supported: |
| 220 | + |
| 221 | +- **Python** — `import X`, `from X import Y` |
| 222 | +- **TypeScript/JavaScript** — `import * from "X"`, `require("X")` |
| 223 | +- **Go** — `import "X"` |
| 224 | +- **Rust** — `use X::{Y, Z}` |
| 225 | +- **Java** — `import X.Y;` |
| 226 | + |
| 227 | +Adding a language requires implementing a resolver in `src/codectx/graph/resolver.py` and adding the grammar dependency to `pyproject.toml`. |
| 228 | + |
| 229 | +## Configuration |
| 230 | + |
| 231 | +Configuration is applied in this precedence order: |
| 232 | + |
| 233 | +1. **CLI flags** (highest priority) |
| 234 | +2. **`.contextcraft.toml`** in repository root |
| 235 | +3. **Built-in defaults** (lowest priority) |
| 236 | + |
| 237 | +Example `.contextcraft.toml`: |
| 238 | + |
| 239 | +```toml |
| 240 | +[codectx] |
| 241 | +token_budget = 120000 |
| 242 | +output = "CONTEXT.md" |
| 243 | +include_patterns = ["src/**", "lib/**"] |
| 244 | +exclude_patterns = ["tests/**", "*.test.py"] |
| 245 | +``` |
| 246 | + |
| 247 | +## Parallelism Strategy |
| 248 | + |
| 249 | +**CPU-bound tasks (Parser):** `ProcessPoolExecutor` — parsing and AST extraction leverages tree-sitter C extension. |
| 250 | + |
| 251 | +**I/O-bound tasks (Git metadata, file I/O):** `ThreadPoolExecutor` — reading git history and source files is I/O-bound. |
| 252 | + |
| 253 | +**Sync tasks:** Graph construction, ranking, and compression are single-threaded because they are fast and maintain simple state. |
| 254 | + |
| 255 | +This mixed-executor approach balances CPU and I/O contention. |
| 256 | + |
| 257 | +## Performance Characteristics |
| 258 | + |
| 259 | +On a typical 10k-file repository: |
| 260 | + |
| 261 | +- **Walker:** ~500ms (filesystem traversal) |
| 262 | +- **Parser:** ~2-5s (parallel tree-sitter parsing) |
| 263 | +- **Graph Builder:** ~100ms (import resolution) |
| 264 | +- **Ranker:** ~200ms (scoring and normalization) |
| 265 | +- **Compressor:** ~50ms (tier assignment) |
| 266 | +- **Formatter:** ~100ms (markdown generation) |
| 267 | + |
| 268 | +**Total:** ~3-6 seconds for full analysis. |
| 269 | + |
| 270 | +Incremental mode (watch) is typically 5-10x faster because it processes only changed files. |
0 commit comments