From 5bfcbe7808585c969c9de0cdf69d0c5f444c37c8 Mon Sep 17 00:00:00 2001 From: prosdev Date: Tue, 31 Mar 2026 15:15:58 -0700 Subject: [PATCH] =?UTF-8?q?docs:=20write=20Core=20Phase=204=20plan=20?= =?UTF-8?q?=E2=80=94=20Python=20language=20support?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit 4-part plan for adding Python to dev-agent: - 4.1: Bundle tree-sitter-python WASM (476KB) + define extraction queries - 4.2: Implement PythonScanner (functions, classes, methods, imports, decorators, type hints, docstrings, __all__ exports, callees) - 4.3: Add Python-specific pattern rules for dev_patterns (try/except, raise, imports, type coverage) - 4.4: Test fixtures (FastAPI, pytest, utils), integration tests, docs Key decisions: tree-sitter WASM (not Python subprocess), no cross-file import resolution (name-based callees only), no framework-specific logic (decorators extracted generically), __all__ overrides _ convention. Co-Authored-By: Claude Opus 4.6 (1M context) --- .claude/da-plans/README.md | 2 +- .../4.1-bundle-wasm-queries.md | 131 +++++++++ .../4.2-python-scanner.md | 197 ++++++++++++++ .../4.3-pattern-rules.md | 186 +++++++++++++ .../4.4-test-fixtures.md | 198 ++++++++++++++ .../core/phase-4-python-support/overview.md | 255 ++++++++++++++++++ .claude/scratchpad.md | 6 +- 7 files changed, 973 insertions(+), 2 deletions(-) create mode 100644 .claude/da-plans/core/phase-4-python-support/4.1-bundle-wasm-queries.md create mode 100644 .claude/da-plans/core/phase-4-python-support/4.2-python-scanner.md create mode 100644 .claude/da-plans/core/phase-4-python-support/4.3-pattern-rules.md create mode 100644 .claude/da-plans/core/phase-4-python-support/4.4-test-fixtures.md create mode 100644 .claude/da-plans/core/phase-4-python-support/overview.md diff --git a/.claude/da-plans/README.md b/.claude/da-plans/README.md index 1385aa7..bb3d579 100644 --- a/.claude/da-plans/README.md +++ b/.claude/da-plans/README.md @@ -9,7 +9,7 @@ Implementation deviations are logged at the bottom of each plan file. | Track | Description | Status | |-------|-------------|--------| -| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1: Merged, Phase 2: Merged, Phase 3: Draft (graph cache) | +| [Core](core/) | Scanner, vector storage, services, indexer | Phase 1-2: Merged, Phase 3: Draft (graph cache), Phase 4: Draft (Python) | | [CLI](cli/) | Command-line interface | Not started | | [MCP Server](mcp/) | Model Context Protocol server + adapters | Phase 1: Merged (tools improvement) | | [Subagents](subagents/) | Coordinator, explorer, planner, GitHub agents | Not started | diff --git a/.claude/da-plans/core/phase-4-python-support/4.1-bundle-wasm-queries.md b/.claude/da-plans/core/phase-4-python-support/4.1-bundle-wasm-queries.md new file mode 100644 index 0000000..01ae92d --- /dev/null +++ b/.claude/da-plans/core/phase-4-python-support/4.1-bundle-wasm-queries.md @@ -0,0 +1,131 @@ +# Part 4.1: Bundle Python WASM + Define Queries + +See [overview.md](overview.md) for architecture context. + +## Goal + +Bundle `tree-sitter-python.wasm`, register the `'python'` language, and define +the S-expression queries that the PythonScanner will use. + +## What changes + +### `packages/dev-agent/scripts/copy-wasm.js` + +Add `'python'` to `SUPPORTED_LANGUAGES`: + +```javascript +const SUPPORTED_LANGUAGES = ['go', 'typescript', 'tsx', 'javascript', 'python']; +``` + +### `packages/core/src/scanner/tree-sitter.ts` + +Add `'python'` to `TreeSitterLanguage`: + +```typescript +export type TreeSitterLanguage = 'go' | 'typescript' | 'tsx' | 'javascript' | 'python'; +``` + +### New file: `packages/core/src/scanner/python-queries.ts` + +All queries validated against `tree-sitter-python` grammar via AST inspection. + +```typescript +/** + * Tree-sitter queries for Python code extraction. + * Modeled after GO_QUERIES in go.ts. + */ +export const PYTHON_QUERIES = { + // Top-level function definitions (not inside a class) + functions: ` + (module + (function_definition + name: (identifier) @name) @definition) + `, + + // Top-level decorated functions (e.g., @app.route, @pytest.fixture) + decoratedFunctions: ` + (module + (decorated_definition + definition: (function_definition + name: (identifier) @name)) @definition) + `, + + // Class definitions + classes: ` + (class_definition + name: (identifier) @name) @definition + `, + + // Method definitions (inside class body) + methods: ` + (class_definition + body: (block + (function_definition + name: (identifier) @name) @definition)) + `, + + // Decorated methods (inside class body) + decoratedMethods: ` + (class_definition + body: (block + (decorated_definition + definition: (function_definition + name: (identifier) @name)) @definition)) + `, + + // Import statements + imports: ` + (import_statement) @definition + `, + + // From...import statements + fromImports: ` + (import_from_statement) @definition + `, + + // Module-level variable assignments (constants, config) + moduleVariables: ` + (module + (expression_statement + (assignment + left: (identifier) @name)) @definition) + `, + + // Module-level type-annotated assignments (x: int = 3) + annotatedVariables: ` + (module + (expression_statement + (assignment + left: (identifier) @name + type: (type) @type)) @definition) + `, +}; +``` + +### Step 1: Validate ALL queries against tree-sitter-python grammar + +Before implementation, run each query against a real Python snippet (same approach +as the JS/TS query validation in Part 1.5). Specifically verify: + +- `bare-except` negation syntax: parse `except:` and `except ValueError:`, confirm + the field name used in the negation pattern `!name` is correct for tree-sitter-python +- `annotatedVariables`: parse `x: int = 3`, confirm the field name is `type` and + the node structure matches the query +- All other queries: confirm node types match grammar (function_definition, class_definition, etc.) + +Write a validation script at `/tmp/python-query-test.js`, run it, fix any broken queries. + +### Tests + +| Test | What it verifies | +|------|-----------------| +| `parseCode('def foo(): pass', 'python')` works | WASM loads | +| Each query matches expected Python source | Query correctness | +| Decorated function query matches `@app.route` pattern | Decorator handling | +| Method query matches method inside class | Class method detection | + +### Commit + +``` +feat(core): bundle tree-sitter-python WASM and define extraction queries +``` diff --git a/.claude/da-plans/core/phase-4-python-support/4.2-python-scanner.md b/.claude/da-plans/core/phase-4-python-support/4.2-python-scanner.md new file mode 100644 index 0000000..408a601 --- /dev/null +++ b/.claude/da-plans/core/phase-4-python-support/4.2-python-scanner.md @@ -0,0 +1,197 @@ +# Part 4.2: Implement PythonScanner + +See [overview.md](overview.md) for architecture context. + +## Goal + +Implement `PythonScanner` that extracts functions, classes, methods, imports, +and module variables from `.py` files. Outputs `Document[]` matching the existing +scanner interface. Register in the scanner registry. + +## What changes + +### New file: `packages/core/src/scanner/python.ts` + +Implements `Scanner` interface (same pattern as `go.ts`): + +```typescript +export class PythonScanner implements Scanner { + readonly language = 'python'; + readonly capabilities: ScannerCapabilities = { + syntax: true, + types: true, // type hints + documentation: true, // docstrings + }; + + canHandle(filePath: string): boolean { + return path.extname(filePath).toLowerCase() === '.py'; + } + + async scan(files, repoRoot, logger, onProgress): Promise { + // For each .py file: + // 1. Read file content + // 2. Parse with tree-sitter (language: 'python') + // 3. Run PYTHON_QUERIES + // 4. For each match, create a Document with: + // - id: `${relativePath}:${name}:${startLine}` + // - type: 'function' | 'class' | 'method' | 'variable' + // - text: signature + docstring (for search quality) + // - metadata: name, signature, exported, docstring, callees, isAsync + } +} +``` + +### Extraction logic per query type + +**Functions:** +- Name from `@name` capture +- Signature: first line of node text (up to `:`) +- Return type: check for `return_type` field on `function_definition` +- isAsync: check if source text starts with `async` +- Docstring: first `expression_statement > string` child of body block +- Exported: name doesn't start with `_` +- Callees: scan body for `call` nodes, extract function names + +**Classes:** +- Name from `@name` capture +- Signature: `class Name(bases):` from first line +- Superclasses: from `superclasses` field (argument_list) +- Docstring: first string in body block +- Exported: name doesn't start with `_` + +**Methods:** +- Same as functions but type is `'method'` +- Parent class name prepended to signature: `ClassName.method_name` + +**Imports:** +- `import_statement`: extract module name +- `import_from_statement`: extract module + imported names +- Stored in file-level `metadata.imports` array + +**Module variables:** +- `UPPER_CASE` assignments at module level → type `'variable'` +- Name from left-hand identifier +- Exported: name doesn't start with `_` + +**Parameters (`*args`, `**kwargs`):** +- Extract `*args` via tree-sitter `list_splat_pattern` node +- Extract `**kwargs` via `dictionary_splat_pattern` node +- Include in signature: `def foo(x: int, *args, **kwargs) -> str` +- These are extremely common in Python — validated by stack-graphs' parameter handling + +**Async function detection:** +- `async def` is NOT a separate node type in tree-sitter-python +- It's a regular `function_definition` with an `async` keyword token as a child +- Detect by checking if source text of the node starts with `async` +- Confirmed by both AST inspection and stack-graphs (which also lacks `async_function_definition`) + +**Callees — extraction depth:** +- Walk ALL `call` nodes within the function body subtree (any depth) +- Matches TypeScript behavior: `getDescendantsOfKind(CallExpression)` walks recursively +- This means calls inside nested lambdas, comprehensions, and conditionals ARE included +- A function that uses `result = list(map(lambda x: db.query(x), items))` DOES + list `db.query` as a callee — correct for dependency analysis +- Deduplicate by name+line (same pattern as TypeScript scanner) + +### `__all__` handling + +If module contains `__all__ = [...]`: +1. Parse the list literal to extract names +2. Override exported flag: only names in `__all__` are `exported: true` +3. If `__all__` is computed (not a simple list), fall back to `_` convention + +### Snippet extraction + +Every Document must include `metadata.snippet` — truncated source text for search +result previews. Use the same pattern as GoScanner: extract node text, truncate at +50 lines. Without this, Python search results would lack code previews that Go and +TypeScript results have. + +### Generated file detection + +Skip files matching common Python generated patterns: +- `_pb2.py`, `_pb2_grpc.py` (protobuf stubs) +- Files with `# Generated by` or `# DO NOT EDIT` in the first 3 lines +- Migration files: `*/migrations/*.py` (Django), `*/versions/*.py` (Alembic) + +### `packages/core/src/utils/test-utils.ts` — refactor to language-aware + +Refactor both `isTestFile()` and `findTestFile()` from hardcoded JS/TS patterns +to a language-aware pattern map. This prevents if/else chain growth as we add +Rust, Java, C# etc. + +```typescript +const TEST_PATTERNS: Record boolean> = { + ts: (f) => f.includes('.test.') || f.includes('.spec.'), + tsx: (f) => f.includes('.test.') || f.includes('.spec.'), + js: (f) => f.includes('.test.') || f.includes('.spec.'), + jsx: (f) => f.includes('.test.') || f.includes('.spec.'), + go: (f) => f.endsWith('_test.go'), + py: (f) => { + const name = path.basename(f); + return name.startsWith('test_') || name.endsWith('_test.py') || name === 'conftest.py'; + }, +}; + +export function isTestFile(filePath: string): boolean { + const ext = path.extname(filePath).slice(1); + const check = TEST_PATTERNS[ext]; + // Fall back to legacy JS/TS check for unknown extensions + return check ? check(filePath) : filePath.includes('.test.') || filePath.includes('.spec.'); +} +``` + +Similarly update `findTestFile()` to generate Python test path patterns +(`test_{name}.py`, `{name}_test.py`) alongside the existing `.test.`/`.spec.` patterns. + +### `packages/core/src/scanner/index.ts` + +Register PythonScanner: + +```typescript +import { PythonScanner } from './python'; + +export function createDefaultRegistry(): ScannerRegistry { + const registry = new ScannerRegistry(); + registry.register(new TypeScriptScanner()); + registry.register(new MarkdownScanner()); + registry.register(new GoScanner()); + registry.register(new PythonScanner()); // NEW + return registry; +} +``` + +### Tests + +| Test | What it verifies | +|------|-----------------| +| Extract function with type hints | Signature includes types | +| Extract async function | isAsync = true | +| Extract class with methods | Class doc + method separate | +| Extract decorated function | Decorator preserved in context | +| Extract imports | Both `import` and `from...import` | +| Extract module-level constants | UPPER_CASE assignments | +| Docstring extraction | First string in function/class body | +| Public/private via `_` convention | exported flag correct | +| `__all__` overrides convention | Only listed names exported | +| Callees from function body | Call nodes extracted | +| Snippet field populated | Truncated source text on every Document | +| isTestFile recognizes test_*.py | Python test convention | +| isTestFile recognizes conftest.py | pytest fixture files | +| Skip _pb2.py generated files | Generated file detection | +| Callees inside nested lambda | Recursive depth extraction | +| isTestFile refactored to pattern map | Language-aware, extensible | +| findTestFile generates Python patterns | test_{name}.py, {name}_test.py | +| Scan multiple files | Progress callback, error handling | +| Empty file | No crash, empty results | +| Syntax error in file | Graceful handling, partial results | + +### Commit + +``` +feat(core): implement PythonScanner with full extraction + +Extracts functions, classes, methods, imports, decorators, and module +variables from Python files using tree-sitter. Handles type hints, +docstrings, async functions, and __all__ for export detection. +``` diff --git a/.claude/da-plans/core/phase-4-python-support/4.3-pattern-rules.md b/.claude/da-plans/core/phase-4-python-support/4.3-pattern-rules.md new file mode 100644 index 0000000..29e0f8e --- /dev/null +++ b/.claude/da-plans/core/phase-4-python-support/4.3-pattern-rules.md @@ -0,0 +1,186 @@ +# Part 4.3: Python Pattern Rules for dev_patterns + +See [overview.md](overview.md) for architecture context. + +## Goal + +Add Python-specific S-expression queries to the `PatternMatcher` so `dev_patterns` +can analyze error handling, import style, and type coverage in Python files. + +## What changes + +### `packages/core/src/pattern-matcher/rules.ts` + +Add Python-specific rules alongside the existing JS/TS rules: + +```typescript +// ============================================================================ +// Python Error Handling (4 rules) +// ============================================================================ + +export const PYTHON_ERROR_HANDLING_QUERIES: PatternMatchRule[] = [ + { + id: 'try-except', + category: 'error-handling', + query: '(try_statement) @match', + }, + { + id: 'raise', + category: 'error-handling', + query: '(raise_statement) @match', + }, + { + id: 'except-clause', + category: 'error-handling', + query: '(except_clause) @match', + }, + { + id: 'bare-except', + category: 'error-handling', + // except without specifying exception type — code smell + query: '(except_clause !name) @match', + }, +]; + +// ============================================================================ +// Python Import Style (3 rules) +// ============================================================================ + +export const PYTHON_IMPORT_QUERIES: PatternMatchRule[] = [ + { + id: 'import-module', + category: 'import-style', + query: '(import_statement) @match', + }, + { + id: 'from-import', + category: 'import-style', + query: '(import_from_statement) @match', + }, + { + id: 'relative-import', + category: 'import-style', + query: '(import_from_statement module_name: (relative_import)) @match', + }, +]; + +// ============================================================================ +// Python Type Coverage (3 rules) +// ============================================================================ + +export const PYTHON_TYPE_QUERIES: PatternMatchRule[] = [ + { + id: 'typed-parameter', + category: 'type-coverage', + query: '(typed_parameter) @match', + }, + { + id: 'function-return-type', + category: 'type-coverage', + query: '(function_definition return_type: (type)) @match', + }, + { + id: 'function-total', + category: 'type-coverage', + query: '(function_definition) @match', + }, +]; + +export const ALL_PYTHON_QUERIES: PatternMatchRule[] = [ + ...PYTHON_ERROR_HANDLING_QUERIES, + ...PYTHON_IMPORT_QUERIES, + ...PYTHON_TYPE_QUERIES, +]; +``` + +### `packages/core/src/pattern-matcher/wasm-matcher.ts` + +**Two changes required:** + +1. Add `'.py'` to `EXTENSION_TO_LANGUAGE`: + +```typescript +const EXTENSION_TO_LANGUAGE: Record = { + '.ts': 'typescript', + '.tsx': 'tsx', + '.js': 'javascript', + '.jsx': 'javascript', + '.py': 'python', // NEW +}; +``` + +2. Add `'python'` to the hardcoded `supportedLanguages` set in `match()`: + +```typescript +const supportedLanguages = new Set(['typescript', 'tsx', 'javascript', 'go', 'python']); +``` + +**Without #2, all Python pattern queries will silently return empty results.** + +### `packages/core/src/services/pattern-analysis-service.ts` + +Refactor `runAllAstQueries` to use a map-based query selection (not if/else): + +```typescript +import { ALL_QUERIES } from '../pattern-matcher/rules'; +import { ALL_PYTHON_QUERIES } from '../pattern-matcher/rules'; + +const QUERIES_BY_LANGUAGE: Record = { + typescript: ALL_QUERIES, + tsx: ALL_QUERIES, + javascript: ALL_QUERIES, + python: ALL_PYTHON_QUERIES, +}; + +export async function runAllAstQueries( + content: string, + filePath: string | undefined, + matcher: PatternMatcher | undefined +): Promise> { + if (!matcher || !filePath) return new Map(); + const language = resolveLanguage(filePath); + if (!language) return new Map(); + + const queries = QUERIES_BY_LANGUAGE[language] ?? []; + if (queries.length === 0) return new Map(); + return matcher.match(content, language, queries); +} +``` + +This prevents if/else chain growth as we add Rust, Java, C#. Adding a new +language's pattern rules is one line in the map. + +### Query validation + +**IMPORTANT:** The `bare-except` query `(except_clause !name) @match` uses field +negation. The field name `name` must be verified against the actual `tree-sitter-python` +grammar definition. If the exception type uses a different field name (e.g., `type`), +this query will false-positive on ALL except clauses. Parse a bare `except:` and a +typed `except ValueError:` to confirm the AST structure before implementation. + +Similarly, validate `function-return-type` field name `return_type` and node type +`type` against the grammar. + +### Tests + +All queries validated against real Python source via tree-sitter parsing: + +| Test | Source | Expected | +|------|--------|----------| +| try-except positive | `try:\n x()\nexcept:\n pass` | count === 1 | +| raise positive | `raise ValueError("bad")` | count === 1 | +| bare-except positive | `except:\n pass` | count === 1 | +| bare-except negative | `except ValueError:\n pass` | count === 0 | +| typed-parameter positive | `def f(x: int): pass` | count === 1 | +| function-return-type positive | `def f() -> int: pass` | count === 1 | +| relative-import positive | `from .models import User` | count === 1 | +| relative-import negative | `from os import path` | count === 0 | + +| WasmPatternMatcher accepts 'python' | Integration test: match() returns non-empty | count > 0 | +| All 10 Python queries return counts on real source | End-to-end validation | counts > 0 | + +### Commit + +``` +feat(core): add Python pattern rules for dev_patterns +``` diff --git a/.claude/da-plans/core/phase-4-python-support/4.4-test-fixtures.md b/.claude/da-plans/core/phase-4-python-support/4.4-test-fixtures.md new file mode 100644 index 0000000..bada844 --- /dev/null +++ b/.claude/da-plans/core/phase-4-python-support/4.4-test-fixtures.md @@ -0,0 +1,198 @@ +# Part 4.4: Test Fixtures, Integration Tests, Documentation + +See [overview.md](overview.md) for architecture context. + +## Goal + +Add realistic Python test fixtures, integration tests that verify end-to-end +behavior, and update documentation to advertise Python support. + +## What changes + +### New fixtures: `packages/core/src/scanner/__fixtures__/` + +**`fastapi-app.py`** — Modern Python web app: +```python +"""FastAPI application for user management.""" + +from fastapi import FastAPI, HTTPException +from pydantic import BaseModel +from typing import Optional + +app = FastAPI() + +class User(BaseModel): + """User data model.""" + name: str + email: str + age: Optional[int] = None + +@app.get("/users/{user_id}") +async def get_user(user_id: int) -> User: + """Fetch a user by ID.""" + user = await db.get(user_id) + if not user: + raise HTTPException(status_code=404) + return user +``` + +**`pytest-tests.py`** — Test file: +```python +import pytest +from app import get_user + +@pytest.fixture +def mock_db(): + return MockDatabase() + +def test_get_user(mock_db): + result = get_user(1) + assert result.name == "Alice" +``` + +**`models.py`** — Dataclass and Pydantic models: +```python +"""Data models.""" + +from dataclasses import dataclass, field +from typing import Optional + +@dataclass +class Config: + """Application configuration.""" + host: str = "localhost" + port: int = 8080 + debug: bool = False + tags: list[str] = field(default_factory=list) + +@dataclass +class UserProfile: + name: str + email: str + age: Optional[int] = None +``` + +**`__init__.py`** — Package init with re-exports: +```python +"""User management package.""" + +from .models import Config, UserProfile +from .utils import parse_date + +__all__ = ["Config", "UserProfile", "parse_date"] +``` + +**`utils.py`** — Utility module with `__all__`: +```python +"""Utility functions.""" + +__all__ = ["parse_date", "format_currency"] + +MAX_RETRIES = 3 +_INTERNAL_CACHE = {} + +def parse_date(date_str: str) -> datetime: + """Parse a date string.""" + return datetime.strptime(date_str, "%Y-%m-%d") + +def format_currency(amount: float) -> str: + return f"${amount:.2f}" + +def _internal_helper(): + """Private helper — not in __all__.""" + pass +``` + +### Test file: `packages/core/src/scanner/__tests__/python.test.ts` + +| Test | What it verifies | +|------|-----------------| +| **Fixture: fastapi-app.py** | | +| Extracts `get_user` as async function | isAsync, type: function | +| Extracts `User` as class | type: class, superclass: BaseModel | +| Extracts `app` as module variable | type: variable | +| Extracts imports (fastapi, pydantic, typing) | metadata.imports | +| Decorator preserved in context | @app.get in docstring/snippet | +| Return type in signature | `-> User` in signature | +| **Fixture: pytest-tests.py** | | +| Test functions detected | test_get_user | +| Fixture decorator detected | mock_db | +| File identified as test file | isTestFile utility | +| **Fixture: utils.py** | | +| `__all__` controls exported flag | parse_date: exported, _internal_helper: not | +| Module constants extracted | MAX_RETRIES: type variable, exported | +| Private vars not exported | _INTERNAL_CACHE: exported = false | +| Docstrings extracted | "Parse a date string." on parse_date | +| **Fixture: models.py** | | +| Dataclass fields extracted | Config, UserProfile as classes | +| @dataclass decorator in context | Decorator preserved | +| **Fixture: __init__.py** | | +| Re-exports detected via imports | from .models import ... | +| `__all__` controls exported | Only listed names exported | +| **Scope boundaries** | | +| Nested functions NOT extracted | Intentional: only module + class level | +| @property methods extracted | Decorated method in class body | +| **Cross-language parity** | | +| Python Document has same fields as Go Document | Field presence parity check | +| snippet, signature, docstring, callees all populated | No missing fields vs Go/TS | + +### Integration test: `packages/core/src/scanner/__tests__/python-integration.test.ts` + +```typescript +describe('PythonScanner integration', () => { + it('should scan fixture directory and produce valid documents', async () => { + const scanner = new PythonScanner(); + const docs = await scanner.scan( + ['fastapi-app.py', 'utils.py'], + fixturesPath + ); + + // Verify document count + expect(docs.length).toBeGreaterThan(5); + + // Verify all required fields present + for (const doc of docs) { + expect(doc.id).toBeTruthy(); + expect(doc.language).toBe('python'); + expect(doc.metadata.file).toBeTruthy(); + expect(doc.metadata.startLine).toBeGreaterThan(0); + } + + // Verify specific extractions + const getUser = docs.find(d => d.metadata.name === 'get_user'); + expect(getUser).toBeDefined(); + expect(getUser!.type).toBe('function'); + expect(getUser!.metadata.isAsync).toBe(true); + expect(getUser!.metadata.exported).toBe(true); + }); +}); +``` + +### Documentation updates + +**README.md** — Add Python to supported languages table: + +```markdown +| Language | Scanner | Features | +|----------|---------|----------| +| TypeScript/JavaScript | ts-morph | Functions, classes, interfaces, types, arrow functions, hooks | +| Go | tree-sitter | Functions, methods, structs, interfaces, generics | +| Python | tree-sitter | Functions, classes, methods, decorators, type hints, docstrings | +| Markdown | remark | Documentation sections | +``` + +**CLAUDE.md** — Add Python to language list + +**website/content/index.mdx** — Add Python to features + +**website/content/updates/index.mdx** — Add release notes + +### Commit + +``` +feat(core): add Python test fixtures, integration tests, and documentation + +Includes FastAPI, pytest, and utility module fixtures. Verifies scanner +extracts functions, classes, methods, imports, decorators, type hints, +docstrings, and __all__ exports. +``` diff --git a/.claude/da-plans/core/phase-4-python-support/overview.md b/.claude/da-plans/core/phase-4-python-support/overview.md new file mode 100644 index 0000000..ce0bc72 --- /dev/null +++ b/.claude/da-plans/core/phase-4-python-support/overview.md @@ -0,0 +1,255 @@ +# Phase 4: Python Language Support + +**Status:** Draft + +## Context + +dev-agent currently supports TypeScript, JavaScript, Go, and Markdown. Python is the +#1 language for AI/ML engineers — the exact audience using MCP tools with Cursor and +Claude Code. A Python developer indexing their repo today gets only markdown and any +JS config files. Core `.py` files are invisible to search, refs, patterns, and map. + +The tree-sitter infrastructure is already in place from MCP Phase 1: +- `web-tree-sitter` WASM runtime (bundled) +- `tree-sitter-python.wasm` already in `tree-sitter-wasms@0.1.13` (476KB) +- `PatternMatcher` interface accepts any tree-sitter language +- Scanner registry pattern (`GoScanner` as reference implementation) + +### What Python developers use + +| Framework | What to extract | +|-----------|----------------| +| **FastAPI / Flask / Django** | Route decorators, view functions, middleware | +| **pytest** | Test functions (`test_*`), fixtures (`@pytest.fixture`) | +| **Pydantic / dataclasses** | Model classes, field definitions | +| **SQLAlchemy / Django ORM** | Model classes, relationships | +| **Click / Typer** | CLI commands | +| **General** | Functions, classes, methods, imports, type hints, docstrings | + +The scanner needs to handle all of these via the common Python AST — we don't +need framework-specific logic. Functions, classes, methods, decorators, and +imports cover everything. + +--- + +## What we're building + +``` +┌──────────────────────────────────────────────────────────┐ +│ PythonScanner │ +│ │ +│ Implements Scanner interface (same as GoScanner) │ +│ │ +│ tree-sitter-python.wasm │ +│ │ │ +│ ▼ │ +│ Parse .py files → AST │ +│ │ │ +│ ▼ │ +│ PYTHON_QUERIES (S-expression patterns) │ +│ ┌─────────────────────────────────────┐ │ +│ │ functions → function_definition │ │ +│ │ methods → function_definition │ │ +│ │ inside class body │ │ +│ │ classes → class_definition │ │ +│ │ imports → import_statement │ │ +│ │ + import_from_statement│ │ +│ │ decorators → decorated_definition │ │ +│ │ module_vars → assignment at top │ │ +│ └─────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ Document[] (same shape as Go/TS scanners) │ +│ - id, text, type, language: 'python' │ +│ - metadata: name, signature, exported, docstring, │ +│ callees, isAsync, imports │ +└──────────────────────────────────────────────────────────┘ +``` + +### Integration with existing tools + +``` +Scanner Registry + ├── TypeScriptScanner (.ts, .tsx, .js, .jsx) ← ts-morph + ├── GoScanner (.go) ← tree-sitter + ├── MarkdownScanner (.md) ← remark + └── PythonScanner (.py) ← tree-sitter (NEW) + +All MCP tools work automatically: + dev_search → Python code searchable by meaning + dev_refs → Python call graph (callees from AST) + dev_map → Python files in hot paths + components + dev_patterns → Python patterns via AST queries (error handling, imports, types) + dev_status → Python file count in stats +``` + +### What we DON'T need to build + +- **No new MCP tools** — existing tools work with any language +- **No Python-specific pattern rules** (for now) — the 12 JS/TS rules don't apply, + but error handling (try/except) and import analysis work via regex fallback +- **No framework-specific logic** — decorators, dataclasses, etc. are extracted + as generic AST patterns. The AI agent interprets them. + +--- + +## Python-specific considerations + +### Public vs private + +Python uses naming conventions, not keywords: +- `_private` — single underscore prefix = private by convention +- `__mangled` — double underscore = name-mangled (very private) +- No underscore = public +- `__all__` — explicit public API list (if present, overrides convention) + +The scanner should: +- Mark functions/classes without `_` prefix as `exported: true` +- If `__all__` is defined at module level, use that instead + +### Docstrings + +Python docstrings are the first expression statement in a function/class body: +```python +def foo(): + """This is the docstring.""" # expression_statement > string + pass +``` + +Tree-sitter node path: `function_definition > body > block > expression_statement > string` + +### Callees extraction + +Python function calls are `call` nodes with `function` field: +```python +result = db.query(User) # call > function: attribute (db.query) +foo() # call > function: identifier (foo) +``` + +For cross-file resolution, we need to map imports to file paths. This is harder +than Go (where the package system is explicit) but we can do basic resolution: +- `from .models import User` → `models.py` in same package +- `import os` → stdlib (skip) +- `from myproject.db import query` → `myproject/db.py` + +For Phase 4, we extract callees with names but **don't resolve file paths** for +cross-file references. This matches how the TypeScript scanner works (callees +have `name` but `file` is optional). `dev_refs` will show callers/callees by +name; cross-file resolution is a future enhancement. + +### Async functions + +Python `async def` maps to a `function_definition` with an `async` keyword token +as a sibling. The scanner should set `metadata.isAsync = true`. + +### Type hints + +Python 3 type annotations appear in the AST: +- Parameters: `typed_parameter` nodes with `type` field +- Return type: `function_definition` has `return_type` field +- Variable annotations: `type` field on assignment + +The signature should include type hints for search quality: +``` +def get_user(user_id: int) -> User +``` + +--- + +## Parts + +| Part | Description | Risk | +|------|-------------|------| +| [4.1](./4.1-bundle-wasm-queries.md) | Bundle Python WASM, define queries, register language | Low — config + constants | +| [4.2](./4.2-python-scanner.md) | Implement PythonScanner with full extraction | Medium — main implementation | +| [4.3](./4.3-pattern-rules.md) | Add Python-specific pattern rules for dev_patterns | Low — S-expression constants | +| [4.4](./4.4-test-fixtures.md) | Test fixtures, integration tests, documentation | Low — validation | + +--- + +## Decisions + +| Decision | Rationale | Alternatives | +|----------|-----------|-------------| +| tree-sitter WASM, not AST module | Matches Go scanner pattern. WASM already bundled. 476KB. | `ast` module via Python subprocess: slower, requires Python installed | +| No cross-file callee resolution | Complex (import resolution varies by project). Name-based callees are useful enough. | Full resolution: needs import graph, virtual env analysis | +| `__all__` overrides `_` convention | Explicit is better than implicit (Python zen). | Ignore `__all__`: simpler but less accurate | +| No framework-specific extraction | Decorators and class patterns are generic. AI agent interprets. | Flask/Django extractors: high maintenance, low marginal value | +| `exported: true` for non-underscore names | Matches Python community convention. | Always true: loses signal. Always false: wrong. | +| Pattern rules in Phase 4.3 (not 4.2) | Scanner works without patterns. Patterns are additive. | All-in-one: larger PR, harder to review | + +--- + +## Risk register + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Python AST edge cases (walrus operator, match/case, PEP 695 type params) | Medium | Low | tree-sitter-python handles all modern syntax. Tests cover edge cases. | +| Large Python repos (Django, Flask projects with thousands of files) | Medium | Medium | Scanner is file-at-a-time, same as Go. No global state. | +| Import resolution too simplistic | High | Low | Phase 4 doesn't resolve imports to files. Just extracts names. Future work. | +| `__all__` parsing complexity | Low | Low | Only check for simple list literal. Complex `__all__` (computed) → fall back to `_` convention. | +| Python 2 syntax | Low | None | tree-sitter-python supports Python 2 syntax. We don't need to special-case. | +| Decorator extraction too verbose | Medium | Low | Only extract decorator name, not arguments. Keeps documents focused. | +| Generated files indexed (protobuf stubs, migrations) | Medium | Low | Skip `_pb2.py`, `_pb2_grpc.py`, files with `# Generated by` header. | +| `isTestFile` doesn't recognize Python conventions | High | Medium | Update utility to handle `test_*.py`, `*_test.py`, `conftest.py`. | +| `WasmPatternMatcher` rejects `'python'` language | High | High | Add `'python'` to both `EXTENSION_TO_LANGUAGE` and hardcoded `supportedLanguages` set. | + +--- + +## Test strategy + +| Test | Priority | What it verifies | +|------|----------|-----------------| +| Extract functions with type hints | P0 | Core scanner functionality | +| Extract classes with methods | P0 | Class + method detection | +| Extract imports (import, from...import) | P0 | Import extraction | +| Extract decorated functions | P0 | Decorator handling | +| Extract async functions | P0 | isAsync flag | +| Extract docstrings | P0 | First-expression docstring detection | +| Public/private via `_` convention | P0 | exported flag | +| `__all__` overrides convention | P1 | Explicit API | +| Extract callees from function bodies | P1 | Call graph | +| Scan real Python project (fixture) | P1 | Integration | +| Pattern rules: try/except | P0 | Error handling detection | +| Pattern rules: import style | P0 | Import analysis | +| Pattern rules: type hint coverage | P0 | Type annotation detection | +| Snippet field on every Document | P0 | Search result previews | +| isTestFile recognizes test_*.py, conftest.py | P0 | Test detection | +| Skip _pb2.py generated files | P1 | Noise reduction | +| WasmPatternMatcher accepts 'python' | P0 | Pattern analysis works | +| Dataclass fixture extracted correctly | P1 | Common Python pattern | +| __init__.py re-exports and __all__ | P1 | Package API | +| Nested functions intentionally excluded | P1 | Scope boundary | +| dev_search finds Python code | P1 | End-to-end via Antfly | +| dev_refs shows Python callers/callees | P1 | End-to-end | +| dev_map includes Python in hot paths | P1 | End-to-end | + +--- + +## Verification checklist + +- [ ] `tree-sitter-python.wasm` bundled in dist +- [ ] `parseCode('def foo(): pass', 'python')` works +- [ ] `PythonScanner.scan()` extracts functions, classes, methods, imports +- [ ] Docstrings extracted from function/class bodies +- [ ] `exported: true` for non-underscore names +- [ ] `isAsync: true` for `async def` functions +- [ ] Signatures include type hints +- [ ] Callees extracted from function call nodes +- [ ] Pattern rules detect try/except, import style, type hints +- [ ] `WasmPatternMatcher` accepts `'python'` language (not silently rejected) +- [ ] `isTestFile()` recognizes `test_*.py`, `*_test.py`, `conftest.py` +- [ ] Generated files (`_pb2.py`) skipped +- [ ] Snippet field populated on every Document +- [ ] Test fixtures cover real Python patterns (FastAPI, pytest, dataclass, __init__.py) +- [ ] `pnpm build && pnpm test` passes +- [ ] `dev index` on a Python repo produces searchable documents +- [ ] `dev_search "authentication"` finds Python code + +--- + +## Dependencies + +- MCP Phase 1 (tree-sitter infrastructure) — merged +- `tree-sitter-python.wasm` in `tree-sitter-wasms@0.1.13` — confirmed (476KB) +- Scanner registry pattern — established by GoScanner diff --git a/.claude/scratchpad.md b/.claude/scratchpad.md index dbdea2d..c61085a 100644 --- a/.claude/scratchpad.md +++ b/.claude/scratchpad.md @@ -19,7 +19,7 @@ - **PageRank at 10k+ nodes** — convergence tolerance 1e-6 may require all 100 iterations for large sparse graphs. Monitor performance. Consider reducing maxIterations or loosening tolerance for dev_map where approximate ranks are fine. - **getAll(limit: 10000) truncation** — medium-large monorepos may exceed 10k docs. Warning is logged but results are silently incomplete. Long-term: paginate or make limit configurable. - E2E tests in CI — blocked on Antfly memory requirements vs GitHub runner limits (7GB) -- **Python language support** — tree-sitter-python WASM is ~300KB, already in tree-sitter-wasms. Needs a Python scanner (document extraction) + Python-specific pattern rules. High demand — large ecosystem. Worth a standalone plan covering: scanner, pattern rules, test fixtures, indexer integration. The PatternMatcher interface from 1.5 is language-agnostic so pattern rules slot right in; the scanner is the real work. +- **Python language support** — plan written at `.claude/da-plans/core/phase-4-python-support/`. 4 parts: bundle WASM + queries, PythonScanner, pattern rules, test fixtures + docs. - Vue/Svelte SFC support — `.vue`/`.svelte` files have embedded `