Skip to content

Commit ac7a0f1

Browse files
committed
basic documentation update
1 parent 83372ad commit ac7a0f1

File tree

5 files changed

+798
-520
lines changed

5 files changed

+798
-520
lines changed

ARCHITECTURE.md

Lines changed: 259 additions & 101 deletions
Original file line numberDiff line numberDiff line change
@@ -1,112 +1,270 @@
11
# Architecture
22

3-
## Problem
4-
AI agents receive poor codebase context because existing tools (repomix, etc.) are file
5-
concatenators. They dump files in filesystem order with no ranking, no compression, and no
6-
semantic structure. Agent output quality is bounded by signal-to-noise ratio in the context window.
3+
## Overview
74

8-
## Solution
9-
Treat context generation as compilation. Parse the codebase into a dependency graph, rank files
10-
by importance signals, compress to a token budget, and emit structured output that orients an
11-
agent immediately.
5+
codectx processes repositories through a structured analysis pipeline that ranks code by importance, compresses it intelligently, and emits a structured markdown document optimized for AI systems.
6+
7+
The pipeline consists of six stages: file discovery, parsing, graph construction, ranking, compression, and formatting.
128

139
## Pipeline
1410

11+
### Stage 1: Walker
12+
13+
**Purpose:** Discover repository files while respecting ignore rules.
14+
15+
The Walker recursively traverses the filesystem from the repository root and applies ignore rules in order:
16+
17+
1. `ALWAYS_IGNORE` — built-in patterns (`.git`, `__pycache__`, `.venv`, etc.)
18+
2. `.gitignore` — Git standard ignore rules
19+
3. `.ctxignore` — codectx-specific ignore rules
20+
21+
The tool uses `pathspec` with `gitwildmatch` semantics to ensure exact behavioral parity with Git's ignore processing.
22+
23+
**Output:** `List[Path]` of files to analyze.
24+
25+
### Stage 2: Parser
26+
27+
**Purpose:** Extract imports, symbols, and metadata from source files.
28+
29+
The Parser processes files in parallel using `ProcessPoolExecutor` (CPU-bound) and `ThreadPoolExecutor` (I/O-bound). For each file:
30+
31+
1. Detect language from file extension
32+
2. Parse AST using tree-sitter
33+
3. Extract:
34+
- Import statements (list of import strings)
35+
- Top-level symbols (functions, classes, methods)
36+
- Docstrings per symbol
37+
- Code structure metadata
38+
39+
Tree-sitter provides a unified interface across six+ languages: Python, TypeScript, JavaScript, Go, Rust, Java, C, C++, and Ruby.
40+
41+
**Output:** `Dict[Path, ParseResult]` where each `ParseResult` contains imports, symbols, and source text.
42+
43+
### Stage 3: Dependency Graph
44+
45+
**Purpose:** Build a directed graph representing module relationships.
46+
47+
The Graph Builder processes parse results to construct a `rustworkx.DiGraph`:
48+
49+
1. For each import statement, resolve the import string to a file path using per-language import resolvers
50+
2. Create nodes for files and edges for import relationships
51+
3. Compute graph metrics:
52+
- **Fan-in** — in-degree per node (how many files import this module)
53+
- **Fan-out** — out-degree per node (how many modules this file imports)
54+
- **Strongly connected components** — detect cyclic dependencies
55+
56+
The graph enables ranking algorithms to identify important modules based on structural position.
57+
58+
**Output:** `rustworkx.DiGraph` with computed metrics.
59+
60+
### Stage 4: Ranker
61+
62+
**Purpose:** Score files by importance using multiple signals.
63+
64+
The Ranker computes a composite importance score for each file:
65+
1566
```
16-
Codebase
17-
18-
19-
Walker
20-
- Recursive file discovery from root
21-
- Applies ALWAYS_IGNORE, .gitignore, .ctxignore in order
22-
- Warns and confirms on sensitive file detection
23-
- Returns: List[Path]
24-
25-
26-
Parser (parallel, ProcessPoolExecutor)
27-
- Detects language from file extension
28-
- Extracts via tree-sitter AST:
29-
- Import statements → List[str]
30-
- Top-level symbols (functions, classes) → List[Symbol]
31-
- Docstrings per symbol
32-
- Returns: Dict[Path, ParseResult]
33-
34-
35-
Graph Builder
36-
- Resolves import strings → file paths (per-language resolver)
37-
- Constructs rustworkx DiGraph: nodes=files, edges=imports
38-
- Computes fan-in (in-degree) per node
39-
- Returns: DepGraph
40-
41-
42-
Ranker
43-
- Scores each file 0.0–1.0 using weighted composite:
44-
git_frequency : 0.35 (commit count touching file)
45-
fan_in : 0.35 (how many files import this)
46-
recency : 0.20 (days since last modification)
47-
entry_proximity: 0.10 (graph distance from entry points)
48-
- Returns: Dict[Path, float]
49-
50-
51-
Compressor
52-
- Enforces token budget (from config or CLI flag)
53-
- Assigns tier per file by score:
54-
Tier 1 (score > 0.7): full source
55-
Tier 2 (score 0.3–0.7): signatures + docstrings
56-
Tier 3 (score < 0.3): one-line summary
57-
- If over budget: drop Tier 3 → truncate Tier 2 → truncate Tier 1
58-
- Returns: Dict[Path, CompressedFile]
59-
60-
61-
Formatter
62-
- Emits structured markdown with fixed section order
63-
- Sections: ARCHITECTURE, DEPENDENCY_GRAPH, ENTRY_POINTS,
64-
CORE_MODULES, PERIPHERY, RECENT_CHANGES
65-
- Returns: str
66-
67-
68-
Output file (default: context.md)
67+
score = (0.35 × git_frequency)
68+
+ (0.35 × fan_in)
69+
+ (0.20 × recency)
70+
+ (0.10 × entry_proximity)
6971
```
7072

71-
## Parallelism model
72-
- File parsing: ProcessPoolExecutor (CPU-bound, tree-sitter C extension)
73-
- File I/O: ThreadPoolExecutor (I/O-bound, reading source files)
74-
- Graph construction: single-threaded (fast, rustworkx handles it)
75-
- Ranking: single-threaded (fast after git metadata collected)
73+
**Git Frequency (0.35):** Commit count touching the file. Frequently-modified files are typically more important.
74+
75+
**Fan-in (0.35):** Inverse-normalized in-degree. Files imported by many other modules are critical interfaces.
76+
77+
**Recency (0.20):** Days since last modification. Recently active files are prioritized.
78+
79+
**Entry Proximity (0.10):** Graph distance from identified entry points. Files close to main execution paths rank higher.
80+
81+
Scores are normalized to `[0.0, 1.0]` range for uniform compression tier assignment.
82+
83+
**Output:** `Dict[Path, float]` mapping file paths to scores.
84+
85+
### Stage 5: Compressor
86+
87+
**Purpose:** Fit code content within a token budget.
88+
89+
The Compressor assigns content tiers based on scores:
90+
91+
- **Tier 1** (score ≥ 0.7) — Full source code
92+
- **Tier 2** (0.3 ≤ score < 0.7) — Function signatures and docstrings only
93+
- **Tier 3** (score < 0.3) — One-line summary
94+
95+
Files are emitted in order: Tier 1 by score descending, then Tier 2, then Tier 3.
96+
97+
If total token count exceeds the budget:
98+
99+
1. Drop all Tier 3 files
100+
2. Truncate Tier 2 content (keep only signatures, remove docstrings)
101+
3. Truncate Tier 1 content (reduce line count progressively)
102+
4. If still over budget, drop lowest-scored Tier 1 files
103+
104+
This is a hard constraint. The tool does not emit context that exceeds the token limit.
105+
106+
**Output:** `Dict[Path, CompressedContent]` and usage statistics.
107+
108+
### Stage 6: Formatter
109+
110+
**Purpose:** Emit structured markdown optimized for AI agents.
111+
112+
The Formatter writes sections in fixed order:
113+
114+
1. **ARCHITECTURE** — High-level project structure
115+
2. **DEPENDENCY_GRAPH** — Mermaid diagram of module relationships
116+
3. **ENTRY_POINTS** — Main files and public interfaces with full source
117+
4. **CORE_MODULES** — High-scoring modules with full source
118+
5. **SUPPORTING_MODULES** — Mid-scoring modules with signatures and docstrings
119+
6. **PERIPHERY** — Low-scoring files with one-line summaries
120+
7. **RECENT_CHANGES** — Optional diff section (if `--since` flag provided)
121+
122+
Each section is preceded by a Markdown heading and terminated with metadata (token count, file count).
123+
124+
**Output:** Markdown string suitable for writing to disk as `CONTEXT.md`.
125+
126+
## Data Flow Diagram
127+
128+
```
129+
File System
130+
131+
├─→ [Walker]
132+
│ ├ Respects .gitignore
133+
│ ├ Respects .ctxignore
134+
│ └ Output: List[Path]
135+
136+
├─→ [Parser] (Parallel)
137+
│ ├ Per-language extraction
138+
│ ├ tree-sitter AST processing
139+
│ └ Output: Dict[Path, ParseResult]
140+
141+
├─→ [Graph Builder]
142+
│ ├ Resolve imports
143+
│ ├ Construct DiGraph
144+
│ └ Output: rustworkx.DiGraph
145+
146+
├─→ [Git Metadata] (Parallel)
147+
│ ├ Commit frequency per file
148+
│ ├ Recency (last modification)
149+
│ └ Output: Dict[Path, GitMeta]
150+
151+
├─→ [Ranker]
152+
│ ├ Composite scoring
153+
│ ├ Normalize to [0.0, 1.0]
154+
│ └ Output: Dict[Path, float]
155+
156+
├─→ [Compressor]
157+
│ ├ Tier assignment
158+
│ ├ Token budget enforcement
159+
│ └ Output: Dict[Path, CompressedContent]
160+
161+
└─→ [Formatter]
162+
├ Section organization
163+
├ Markdown generation
164+
└ Output: CONTEXT.md
165+
```
76166

77167
## Caching
78-
- Cache key: (file_path, file_hash, git_commit_sha)
79-
- Cache location: .codectx_cache/ at project root (gitignored)
80-
- Cached: ParseResult per file, git metadata per file
81-
- Invalidated: on file content change or new commit
82-
83-
## Incremental mode (--watch)
84-
- watchfiles monitors project root
85-
- On change: reparse affected files only
86-
- Rebuild graph for changed nodes and their dependents
87-
- Re-rank affected subgraph
88-
- Re-emit output
89-
90-
## Token budget enforcement
91-
Hard cap. Not a suggestion. Budget is consumed in this order:
92-
1. ARCHITECTURE section (fixed, small)
93-
2. DEPENDENCY_GRAPH section (fixed, small)
94-
3. Tier 1 files by rank score descending
95-
4. Tier 2 files by rank score descending
96-
5. Tier 3 files by rank score descending
97-
98-
Files that don't fit are omitted with a note in the output.
99-
100-
## Language support
101-
Pluggable resolver interface. Initial support:
102-
- Python (.py)
103-
- TypeScript (.ts, .tsx)
104-
- JavaScript (.js, .jsx)
105-
- Go (.go)
106-
- Rust (.rs)
107-
- Java (.java)
108-
109-
Adding a language requires: tree-sitter grammar (via tree-sitter-languages) + import resolver.
110-
111-
## Config precedence
112-
CLI flags > .contextcraft.toml > defaults
168+
169+
The tool caches expensive computations:
170+
171+
**Cache key:** `(file_path, file_hash, git_commit_sha)`
172+
173+
**Cached items:**
174+
- Parsed AST and extracted symbols per file
175+
- Git metadata (frequency, recency)
176+
177+
**Cache location:** `.codectx_cache/` at repository root (gitignored)
178+
179+
**Invalidation:** Cache entries are invalidated when file content changes or HEAD commit changes.
180+
181+
This enables fast incremental updates in watch mode.
182+
183+
## Incremental Mode
184+
185+
When running `codectx watch .`, the tool:
186+
187+
1. Monitors filesystem with `watchfiles`
188+
2. On file change:
189+
- Reparse only affected files
190+
- Rebuild graph for changed nodes and dependents
191+
- Re-rank affected subgraph
192+
- Recompress to budget
193+
- Re-emit output
194+
195+
This is significantly faster than full analysis on every change.
196+
197+
## Token Budget Enforcement
198+
199+
Token counting uses `tiktoken`, which accurately reflects OpenAI and Anthropic model tokenization.
200+
201+
Budget enforcement is hard: the tool does not emit context exceeding the specified limit.
202+
203+
Consumption order:
204+
205+
1. Fixed overhead (section headers, metadata) — typically 500–1000 tokens
206+
2. Tier 1 files by score descending (full source)
207+
3. Tier 2 files by score descending (signatures only)
208+
4. Tier 3 files by score descending (one-line summaries)
209+
210+
Files omitted due to budget are logged with a note in the output.
211+
212+
## Language Support
213+
214+
The Parser uses tree-sitter for universal AST extraction. Each language requires:
215+
216+
1. **tree-sitter grammar** — provided by `tree-sitter-LANGUAGE` package
217+
2. **Import resolver** — per-language logic to resolve import strings to file paths
218+
219+
Currently supported:
220+
221+
- **Python**`import X`, `from X import Y`
222+
- **TypeScript/JavaScript**`import * from "X"`, `require("X")`
223+
- **Go**`import "X"`
224+
- **Rust**`use X::{Y, Z}`
225+
- **Java**`import X.Y;`
226+
227+
Adding a language requires implementing a resolver in `src/codectx/graph/resolver.py` and adding the grammar dependency to `pyproject.toml`.
228+
229+
## Configuration
230+
231+
Configuration is applied in this precedence order:
232+
233+
1. **CLI flags** (highest priority)
234+
2. **`.contextcraft.toml`** in repository root
235+
3. **Built-in defaults** (lowest priority)
236+
237+
Example `.contextcraft.toml`:
238+
239+
```toml
240+
[codectx]
241+
token_budget = 120000
242+
output = "CONTEXT.md"
243+
include_patterns = ["src/**", "lib/**"]
244+
exclude_patterns = ["tests/**", "*.test.py"]
245+
```
246+
247+
## Parallelism Strategy
248+
249+
**CPU-bound tasks (Parser):** `ProcessPoolExecutor` — parsing and AST extraction leverages tree-sitter C extension.
250+
251+
**I/O-bound tasks (Git metadata, file I/O):** `ThreadPoolExecutor` — reading git history and source files is I/O-bound.
252+
253+
**Sync tasks:** Graph construction, ranking, and compression are single-threaded because they are fast and maintain simple state.
254+
255+
This mixed-executor approach balances CPU and I/O contention.
256+
257+
## Performance Characteristics
258+
259+
On a typical 10k-file repository:
260+
261+
- **Walker:** ~500ms (filesystem traversal)
262+
- **Parser:** ~2-5s (parallel tree-sitter parsing)
263+
- **Graph Builder:** ~100ms (import resolution)
264+
- **Ranker:** ~200ms (scoring and normalization)
265+
- **Compressor:** ~50ms (tier assignment)
266+
- **Formatter:** ~100ms (markdown generation)
267+
268+
**Total:** ~3-6 seconds for full analysis.
269+
270+
Incremental mode (watch) is typically 5-10x faster because it processes only changed files.

0 commit comments

Comments
 (0)