Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude/da-plans/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Implementation deviations are logged at the bottom of each plan file.

| Track | Description | Status |
|-------|-------------|--------|
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Merged, Phase 3: Draft (graph cache) |
| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1-2: Merged, Phase 3: Draft (graph cache), Phase 4: Draft (Python) |
| [CLI](cli/) | Command-line interface | Not started |
| [MCP Server](mcp/) | Model Context Protocol server + adapters | Phase 1: Merged (tools improvement) |
| [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started |
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# Part 4.1: Bundle Python WASM + Define Queries

See [overview.md](overview.md) for architecture context.

## Goal

Bundle `tree-sitter-python.wasm`, register the `'python'` language, and define
the S-expression queries that the PythonScanner will use.

## What changes

### `packages/dev-agent/scripts/copy-wasm.js`

Add `'python'` to `SUPPORTED_LANGUAGES`:

```javascript
const SUPPORTED_LANGUAGES = ['go', 'typescript', 'tsx', 'javascript', 'python'];
```

### `packages/core/src/scanner/tree-sitter.ts`

Add `'python'` to `TreeSitterLanguage`:

```typescript
export type TreeSitterLanguage = 'go' | 'typescript' | 'tsx' | 'javascript' | 'python';
```

### New file: `packages/core/src/scanner/python-queries.ts`

All queries validated against `tree-sitter-python` grammar via AST inspection.

```typescript
/**
* Tree-sitter queries for Python code extraction.
* Modeled after GO_QUERIES in go.ts.
*/
export const PYTHON_QUERIES = {
// Top-level function definitions (not inside a class)
functions: `
(module
(function_definition
name: (identifier) @name) @definition)
`,

// Top-level decorated functions (e.g., @app.route, @pytest.fixture)
decoratedFunctions: `
(module
(decorated_definition
definition: (function_definition
name: (identifier) @name)) @definition)
`,

// Class definitions
classes: `
(class_definition
name: (identifier) @name) @definition
`,

// Method definitions (inside class body)
methods: `
(class_definition
body: (block
(function_definition
name: (identifier) @name) @definition))
`,

// Decorated methods (inside class body)
decoratedMethods: `
(class_definition
body: (block
(decorated_definition
definition: (function_definition
name: (identifier) @name)) @definition))
`,

// Import statements
imports: `
(import_statement) @definition
`,

// From...import statements
fromImports: `
(import_from_statement) @definition
`,

// Module-level variable assignments (constants, config)
moduleVariables: `
(module
(expression_statement
(assignment
left: (identifier) @name)) @definition)
`,

// Module-level type-annotated assignments (x: int = 3)
annotatedVariables: `
(module
(expression_statement
(assignment
left: (identifier) @name
type: (type) @type)) @definition)
`,
};
```

### Step 1: Validate ALL queries against tree-sitter-python grammar

Before implementation, run each query against a real Python snippet (same approach
as the JS/TS query validation in Part 1.5). Specifically verify:

- `bare-except` negation syntax: parse `except:` and `except ValueError:`, confirm
the field name used in the negation pattern `!name` is correct for tree-sitter-python
- `annotatedVariables`: parse `x: int = 3`, confirm the field name is `type` and
the node structure matches the query
- All other queries: confirm node types match grammar (function_definition, class_definition, etc.)

Write a validation script at `/tmp/python-query-test.js`, run it, fix any broken queries.

### Tests

| Test | What it verifies |
|------|-----------------|
| `parseCode('def foo(): pass', 'python')` works | WASM loads |
| Each query matches expected Python source | Query correctness |
| Decorated function query matches `@app.route` pattern | Decorator handling |
| Method query matches method inside class | Class method detection |

### Commit

```
feat(core): bundle tree-sitter-python WASM and define extraction queries
```
197 changes: 197 additions & 0 deletions .claude/da-plans/core/phase-4-python-support/4.2-python-scanner.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# Part 4.2: Implement PythonScanner

See [overview.md](overview.md) for architecture context.

## Goal

Implement `PythonScanner` that extracts functions, classes, methods, imports,
and module variables from `.py` files. Outputs `Document[]` matching the existing
scanner interface. Register in the scanner registry.

## What changes

### New file: `packages/core/src/scanner/python.ts`

Implements `Scanner` interface (same pattern as `go.ts`):

```typescript
export class PythonScanner implements Scanner {
readonly language = 'python';
readonly capabilities: ScannerCapabilities = {
syntax: true,
types: true, // type hints
documentation: true, // docstrings
};

canHandle(filePath: string): boolean {
return path.extname(filePath).toLowerCase() === '.py';
}

async scan(files, repoRoot, logger, onProgress): Promise<Document[]> {
// For each .py file:
// 1. Read file content
// 2. Parse with tree-sitter (language: 'python')
// 3. Run PYTHON_QUERIES
// 4. For each match, create a Document with:
// - id: `${relativePath}:${name}:${startLine}`
// - type: 'function' | 'class' | 'method' | 'variable'
// - text: signature + docstring (for search quality)
// - metadata: name, signature, exported, docstring, callees, isAsync
}
}
```

### Extraction logic per query type

**Functions:**
- Name from `@name` capture
- Signature: first line of node text (up to `:`)
- Return type: check for `return_type` field on `function_definition`
- isAsync: check if source text starts with `async`
- Docstring: first `expression_statement > string` child of body block
- Exported: name doesn't start with `_`
- Callees: scan body for `call` nodes, extract function names

**Classes:**
- Name from `@name` capture
- Signature: `class Name(bases):` from first line
- Superclasses: from `superclasses` field (argument_list)
- Docstring: first string in body block
- Exported: name doesn't start with `_`

**Methods:**
- Same as functions but type is `'method'`
- Parent class name prepended to signature: `ClassName.method_name`

**Imports:**
- `import_statement`: extract module name
- `import_from_statement`: extract module + imported names
- Stored in file-level `metadata.imports` array

**Module variables:**
- `UPPER_CASE` assignments at module level → type `'variable'`
- Name from left-hand identifier
- Exported: name doesn't start with `_`

**Parameters (`*args`, `**kwargs`):**
- Extract `*args` via tree-sitter `list_splat_pattern` node
- Extract `**kwargs` via `dictionary_splat_pattern` node
- Include in signature: `def foo(x: int, *args, **kwargs) -> str`
- These are extremely common in Python — validated by stack-graphs' parameter handling

**Async function detection:**
- `async def` is NOT a separate node type in tree-sitter-python
- It's a regular `function_definition` with an `async` keyword token as a child
- Detect by checking if source text of the node starts with `async`
- Confirmed by both AST inspection and stack-graphs (which also lacks `async_function_definition`)

**Callees — extraction depth:**
- Walk ALL `call` nodes within the function body subtree (any depth)
- Matches TypeScript behavior: `getDescendantsOfKind(CallExpression)` walks recursively
- This means calls inside nested lambdas, comprehensions, and conditionals ARE included
- A function that uses `result = list(map(lambda x: db.query(x), items))` DOES
list `db.query` as a callee — correct for dependency analysis
- Deduplicate by name+line (same pattern as TypeScript scanner)

### `__all__` handling

If module contains `__all__ = [...]`:
1. Parse the list literal to extract names
2. Override exported flag: only names in `__all__` are `exported: true`
3. If `__all__` is computed (not a simple list), fall back to `_` convention

### Snippet extraction

Every Document must include `metadata.snippet` — truncated source text for search
result previews. Use the same pattern as GoScanner: extract node text, truncate at
50 lines. Without this, Python search results would lack code previews that Go and
TypeScript results have.

### Generated file detection

Skip files matching common Python generated patterns:
- `_pb2.py`, `_pb2_grpc.py` (protobuf stubs)
- Files with `# Generated by` or `# DO NOT EDIT` in the first 3 lines
- Migration files: `*/migrations/*.py` (Django), `*/versions/*.py` (Alembic)

### `packages/core/src/utils/test-utils.ts` — refactor to language-aware

Refactor both `isTestFile()` and `findTestFile()` from hardcoded JS/TS patterns
to a language-aware pattern map. This prevents if/else chain growth as we add
Rust, Java, C# etc.

```typescript
const TEST_PATTERNS: Record<string, (filePath: string) => boolean> = {
ts: (f) => f.includes('.test.') || f.includes('.spec.'),
tsx: (f) => f.includes('.test.') || f.includes('.spec.'),
js: (f) => f.includes('.test.') || f.includes('.spec.'),
jsx: (f) => f.includes('.test.') || f.includes('.spec.'),
go: (f) => f.endsWith('_test.go'),
py: (f) => {
const name = path.basename(f);
return name.startsWith('test_') || name.endsWith('_test.py') || name === 'conftest.py';
},
};

export function isTestFile(filePath: string): boolean {
const ext = path.extname(filePath).slice(1);
const check = TEST_PATTERNS[ext];
// Fall back to legacy JS/TS check for unknown extensions
return check ? check(filePath) : filePath.includes('.test.') || filePath.includes('.spec.');
}
```

Similarly update `findTestFile()` to generate Python test path patterns
(`test_{name}.py`, `{name}_test.py`) alongside the existing `.test.`/`.spec.` patterns.

### `packages/core/src/scanner/index.ts`

Register PythonScanner:

```typescript
import { PythonScanner } from './python';

export function createDefaultRegistry(): ScannerRegistry {
const registry = new ScannerRegistry();
registry.register(new TypeScriptScanner());
registry.register(new MarkdownScanner());
registry.register(new GoScanner());
registry.register(new PythonScanner()); // NEW
return registry;
}
```

### Tests

| Test | What it verifies |
|------|-----------------|
| Extract function with type hints | Signature includes types |
| Extract async function | isAsync = true |
| Extract class with methods | Class doc + method separate |
| Extract decorated function | Decorator preserved in context |
| Extract imports | Both `import` and `from...import` |
| Extract module-level constants | UPPER_CASE assignments |
| Docstring extraction | First string in function/class body |
| Public/private via `_` convention | exported flag correct |
| `__all__` overrides convention | Only listed names exported |
| Callees from function body | Call nodes extracted |
| Snippet field populated | Truncated source text on every Document |
| isTestFile recognizes test_*.py | Python test convention |
| isTestFile recognizes conftest.py | pytest fixture files |
| Skip _pb2.py generated files | Generated file detection |
| Callees inside nested lambda | Recursive depth extraction |
| isTestFile refactored to pattern map | Language-aware, extensible |
| findTestFile generates Python patterns | test_{name}.py, {name}_test.py |
| Scan multiple files | Progress callback, error handling |
| Empty file | No crash, empty results |
| Syntax error in file | Graceful handling, partial results |

### Commit

```
feat(core): implement PythonScanner with full extraction

Extracts functions, classes, methods, imports, decorators, and module
variables from Python files using tree-sitter. Handles type hints,
docstrings, async functions, and __all__ for export detection.
```
Loading
Loading