diff --git a/README.md b/README.md index b536c6a..bfb4953 100644 --- a/README.md +++ b/README.md @@ -4,10 +4,12 @@ Local-first codebase intelligence for CLI workflows. Sia Code indexes your repo and lets you: -- search code fast (lexical, semantic, or hybrid) -- trace architecture with multi-hop research +- search code fast via ChunkHound CLI (lexical or semantic) +- trace architecture with ChunkHound research - store/retrieve project decisions and timeline context +Search and research are hard-switched to ChunkHound CLI. Sia keeps index orchestration and memory storage local. + ## Why teams use it - Works directly on local code (`.sia-code/` index per repo/worktree) @@ -31,6 +33,9 @@ sia-code --version ## Quick Start (2 minutes) ```bash +# install ChunkHound CLI once +uv tool install chunkhound + # in your project sia-code init sia-code index . @@ -53,20 +58,20 @@ sia-code status | `sia-code index .` | Build index | | `sia-code index --update` | Incremental re-index | | `sia-code index --clean` | Rebuild index from scratch | -| `sia-code search "query"` | Hybrid search (default) | -| `sia-code search --regex "pattern"` | Lexical search | -| `sia-code research "question"` | Multi-hop relationship discovery | +| `sia-code search "query"` | ChunkHound-backed search (default mode from config) | +| `sia-code search --regex "pattern"` | ChunkHound lexical search | +| `sia-code research "question"` | ChunkHound research | | `sia-code memory sync-git` | Import timeline/changelog from git | | `sia-code memory search "topic"` | Search stored project memory | | `sia-code config show` | Print active configuration | ## Search Modes (important) -- Default command is hybrid: `sia-code search "query"` +- Default search mode comes from `chunkhound.default_search_mode` (default: `regex`) - Lexical mode: `sia-code search --regex "pattern"` -- Semantic-only mode: `sia-code search --semantic-only "query"` +- Semantic-only mode: `sia-code search --semantic-only "query"` (requires ChunkHound semantic setup) -Use `--no-deps` when you want only your project code. +Dependency visibility flags (`--no-deps`, `--deps-only`) are currently compatibility no-ops with ChunkHound-backed search. ## Git Sync Memory + Semantic Changelog @@ -74,8 +79,10 @@ Use `--no-deps` when you want only your project code. - Scans tags into changelog entries - Scans merge commits into timeline events +- For merge commits whose subject matches `Merge branch '...'`, also creates changelog entries - Stores `files_changed` and diff stats (`insertions`, `deletions`, `files`) - Optionally enhances sparse summaries using a local summarization model +- `memory sync-git --limit 0` processes all eligible events How semantic summary generation works: @@ -111,8 +118,8 @@ Useful commands: ```bash sia-code config show -sia-code config get search.vector_weight -sia-code config set search.vector_weight 0.0 +sia-code config get chunkhound.default_search_mode +sia-code config set chunkhound.default_search_mode semantic ``` Note: backend selection is auto by default (`sqlite-vec` for new indexes, legacy `usearch` supported). diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md index f803e89..17a3bf8 100644 --- a/docs/ARCHITECTURE.md +++ b/docs/ARCHITECTURE.md @@ -11,9 +11,9 @@ Sia Code has two core pipelines: - write lexical/vector indexes 2. **Query pipeline** - - preprocess query - - run lexical and/or semantic search - - rank + return chunk matches + - resolve mode and build ChunkHound CLI command + - execute ChunkHound search/research + - parse and render results in Sia CLI formats ## Storage Model @@ -34,17 +34,18 @@ Backend selection: - `cli.py`: command entry and orchestration - `indexer/coordinator.py`: full/incremental indexing lifecycle - `parser/*`: language detection, concept extraction, chunk building -- `storage/*`: search execution and persistence +- `search/chunkhound_cli.py`: ChunkHound command bridge and output parsing +- `storage/*`: memory persistence plus legacy/local search paths - `memory/*`: git-to-memory sync and timeline/changelog tooling - `embed_server/*`: optional shared embed daemon ## Search Architecture -- **Hybrid (default):** lexical + semantic -- **Lexical (`--regex`):** exact token/symbol heavy queries -- **Semantic (`--semantic-only`):** concept similarity only +- **Default (`search`)**: mode from `chunkhound.default_search_mode` (default `regex`) +- **Lexical (`--regex`)**: exact token/symbol heavy queries +- **Semantic (`--semantic-only`)**: ChunkHound semantic mode -Flags like `--no-deps` and `--deps-only` control dependency-code visibility. +Flags like `--no-deps` and `--deps-only` are accepted for compatibility but currently no-op with ChunkHound-backed search. ## Design Goals diff --git a/docs/BENCHMARK_METHODOLOGY.md b/docs/BENCHMARK_METHODOLOGY.md index ef41ea4..a78eb5a 100644 --- a/docs/BENCHMARK_METHODOLOGY.md +++ b/docs/BENCHMARK_METHODOLOGY.md @@ -2,10 +2,12 @@ This project uses RepoEval-style retrieval evaluation for search quality checks. +> Note: The benchmark harness in `tests/benchmarks/` targets legacy in-process retrievers. ChunkHound-backed CLI search/research should be benchmarked separately as end-to-end CLI runs. + ## Scope - Evaluate retrieval quality (not answer generation) -- Compare lexical, hybrid, and semantic settings +- Compare lexical, hybrid, and semantic settings in the legacy retriever stack - Use consistent query set and top-k metrics ## Minimal Reproduction Flow @@ -23,7 +25,7 @@ pkgx python tests/benchmarks/run_full_repoeval_benchmark.py - Recall@k (especially Recall@5) - indexing time and query latency -- configuration used (`vector_weight`, embedding settings) +- configuration used (`chunkhound.default_search_mode` for CLI runs, `vector_weight` for legacy runs, embedding settings) ## Fairness Rules diff --git a/docs/BENCHMARK_RESULTS.md b/docs/BENCHMARK_RESULTS.md index 62c3f74..182283d 100644 --- a/docs/BENCHMARK_RESULTS.md +++ b/docs/BENCHMARK_RESULTS.md @@ -5,19 +5,21 @@ - RepoEval Recall@5: **89.9%** (reported) - Improvement over cAST baseline: **+12.9 points** (reported) +> Note: These numbers are historical baselines from legacy in-process retrievers. Current CLI `search` and `research` are ChunkHound-backed. + ## Practical Takeaways - Lexical-heavy search performs strongly for code identifiers. -- Hybrid can still be useful for natural-language style queries. +- For legacy retriever experiments, hybrid can still help natural-language style queries. - For daily debugging, `--regex` is often the fastest path. ## Recommended Starting Config ```bash -sia-code config set search.vector_weight 0.0 +sia-code config set chunkhound.default_search_mode regex ``` -Then adjust only if your query style is mostly conceptual. +For legacy benchmark experiments, `search.vector_weight` remains available in the in-process retriever stack. ## Where to find raw benchmark tooling diff --git a/docs/CLI_FEATURES.md b/docs/CLI_FEATURES.md index 8ed1921..d2e6dab 100644 --- a/docs/CLI_FEATURES.md +++ b/docs/CLI_FEATURES.md @@ -18,8 +18,8 @@ sia-code status | --- | --- | --- | | `init` | Create `.sia-code/` index workspace | `--path`, `--dry-run` | | `index [PATH]` | Build index | `--update`, `--clean`, `--parallel`, `--workers`, `--watch`, `--debounce`, `--no-git-sync` | -| `search QUERY` | Search code (default hybrid) | `--regex`, `--semantic-only`, `-k/--limit`, `--no-filter`, `--no-deps`, `--deps-only`, `--format`, `--output` | -| `research QUESTION` | Multi-hop architecture exploration | `--hops`, `--graph`, `-k/--limit`, `--no-filter` | +| `search QUERY` | Search code (ChunkHound-backed) | `--regex`, `--semantic-only`, `-k/--limit`, `--no-filter` (compat), `--no-deps` (compat), `--deps-only` (compat), `--format`, `--output` | +| `research QUESTION` | Architecture exploration (ChunkHound-backed) | `--hops` (compat), `--graph` (compat), `-k/--limit` (compat), `--no-filter` (compat) | | `status` | Index health and statistics | none | | `compact [PATH]` | Remove stale chunks | `--threshold`, `--force` | | `interactive` | Live query loop | `--regex`, `-k/--limit` | @@ -28,17 +28,17 @@ sia-code status | Command | Purpose | Key options | | --- | --- | --- | -| `memory sync-git` | Import timeline/changelog from git (with diff stats and optional local semantic summaries) | `--since`, `--limit`, `--dry-run`, `--tags-only`, `--merges-only`, `--min-importance` | +| `memory sync-git` | Import timeline/changelog from git (with diff stats and optional local semantic summaries) | `--since`, `--limit` (`0` means all), `--dry-run`, `--tags-only`, `--merges-only`, `--min-importance` | | `memory add-decision TITLE` | Add pending decision | `-d/--description` (required), `-r/--reasoning`, `-a/--alternatives` | -| `memory list` | List memory items | `--type`, `--status`, `--limit`, `--format` | +| `memory list` | List memory items | `--type`, `--status`, `--limit` (`0` means all), `--format` | | `memory approve ID` | Approve decision | `-c/--category` (required) | | `memory reject ID` | Reject decision | none | | `memory search QUERY` | Search memory | `--type`, `-k/--limit` | -| `memory timeline` | View timeline events | `--since`, `--event-type`, `--importance`, `--format` | -| `memory changelog [RANGE]` | Generate changelog | `--format`, `--output` | +| `memory timeline` | View timeline events | `--since`, `--event-type`, `--importance`, `--limit` (`0` means all), `--format` | +| `memory changelog [RANGE]` | Generate changelog | `--limit` (`0` means all), `--format`, `--output` | | `memory export` / `memory import` | Backup/restore memory | `-o/--output`, `-i/--input` | -`memory sync-git` is the entrypoint for semantic changelog generation: it extracts git context, then (if enabled) uses the local summarizer to enrich release and merge summaries stored in memory. +`memory sync-git` is the entrypoint for semantic changelog generation: it extracts git context, then (if enabled) uses the local summarizer to enrich tag releases and merge-derived changelog entries stored in memory. ## Embed Daemon @@ -48,15 +48,15 @@ sia-code status | `embed status` | Show daemon status | | `embed stop` | Stop daemon | -Use daemon when you rely heavily on hybrid/semantic search or memory embedding operations. +Use daemon when you rely heavily on memory embedding operations. ## Config Commands ```bash sia-code config show sia-code config path -sia-code config get search.vector_weight -sia-code config set search.vector_weight 0.0 +sia-code config get chunkhound.default_search_mode +sia-code config set chunkhound.default_search_mode semantic ``` ## Output Formats @@ -71,8 +71,8 @@ sia-code config set search.vector_weight 0.0 - First index: `sia-code index .` - Ongoing work: `sia-code index --update` - Exact symbols: `sia-code search --regex "pattern"` -- Project-only focus: `--no-deps` -- Architecture questions: `sia-code research "..." --hops 3` +- If output is noisy: tighten regex terms or add path-like query terms +- Architecture questions: `sia-code research "..."` ## Related Docs diff --git a/docs/CODE_STRUCTURE.md b/docs/CODE_STRUCTURE.md index 699b9db..f9bc2b1 100644 --- a/docs/CODE_STRUCTURE.md +++ b/docs/CODE_STRUCTURE.md @@ -21,8 +21,8 @@ sia_code/ core/ # shared models and enums parser/ # AST concept extraction and chunking indexer/ # indexing orchestration, hash cache, metrics - search/ # query pre-processing and multi-hop logic - storage/ # sqlite-vec + legacy usearch backends + search/ # ChunkHound CLI bridge + query helpers + storage/ # memory persistence + legacy local search backends memory/ # git sync, timeline, changelog, decision flow embed_server/ # optional embedding daemon ``` @@ -35,7 +35,8 @@ sia_code/ | Change default behavior | `sia_code/config.py`, `sia_code/cli.py` | | Tune indexing | `sia_code/indexer/coordinator.py`, `sia_code/indexer/chunk_index.py` | | Tune chunking | `sia_code/parser/chunker.py`, `sia_code/parser/concepts.py` | -| Search ranking/filtering | `sia_code/storage/sqlite_vec_backend.py`, `sia_code/storage/usearch_backend.py` | +| ChunkHound search/research bridge | `sia_code/search/chunkhound_cli.py`, `sia_code/cli.py` | +| Legacy/local search ranking (interactive) | `sia_code/storage/sqlite_vec_backend.py`, `sia_code/storage/usearch_backend.py` | | Backend selection logic | `sia_code/storage/factory.py` | | Memory commands and sync | `sia_code/memory/git_sync.py`, `sia_code/memory/git_events.py`, `sia_code/cli.py` | diff --git a/docs/LLM_CLI_INTEGRATION.md b/docs/LLM_CLI_INTEGRATION.md index 1d999d8..ec51a7b 100644 --- a/docs/LLM_CLI_INTEGRATION.md +++ b/docs/LLM_CLI_INTEGRATION.md @@ -30,6 +30,7 @@ Load skill sia-code ## 3) Recommended agent workflow ```bash +uv tool install chunkhound uvx sia-code status uvx sia-code init uvx sia-code index . @@ -37,14 +38,21 @@ uvx sia-code search --regex "your symbol" uvx sia-code research "how does X work?" ``` +Notes: + +- `search` and `research` are ChunkHound-backed. +- Memory commands stay in Sia's local memory database. + ## 4) Optional memory workflow ```bash -uvx sia-code memory sync-git +uvx sia-code memory sync-git --limit 0 uvx sia-code memory search "topic" uvx sia-code memory add-decision "Decision title" -d "Context" -r "Reason" ``` +`memory sync-git` also derives changelog entries from merge commits whose subject matches `Merge branch '...'`. + ## 5) Multiple worktrees / multiple Claude Code instances Use one of these index strategies per session: diff --git a/docs/MEMORY_FEATURES.md b/docs/MEMORY_FEATURES.md index f7aa19b..c81b807 100644 --- a/docs/MEMORY_FEATURES.md +++ b/docs/MEMORY_FEATURES.md @@ -31,6 +31,7 @@ sia-code memory search "Adopt X" --type decision - Tags become changelog memory entries - Merge commits become timeline memory events +- Merge commits whose subject matches `Merge branch '...'` also become changelog entries - Each event captures changed files and diff stats - Duplicate events are skipped automatically @@ -69,6 +70,11 @@ Notes: | `memory changelog` | render changelog text/json/markdown | | `memory export` / `memory import` | backup/restore memory data | +Limit behavior: + +- `memory sync-git --limit 0` processes all eligible events +- `memory list --limit 0`, `memory timeline --limit 0`, and `memory changelog --limit 0` return all rows + ## Good Practices - Add decisions with explicit `description` and `reasoning`. diff --git a/docs/PERFORMANCE_ANALYSIS.md b/docs/PERFORMANCE_ANALYSIS.md index ef32e5c..d6ac358 100644 --- a/docs/PERFORMANCE_ANALYSIS.md +++ b/docs/PERFORMANCE_ANALYSIS.md @@ -3,18 +3,18 @@ ## Typical Expectations - `search --regex`: usually lowest-latency mode -- hybrid `search`: additional semantic overhead +- `search --semantic-only`: usually higher latency than regex - `index --update`: much faster than full rebuild for small changes -Actual speed depends on repo size, hardware, and embedding configuration. +Actual speed depends on repo size, hardware, and ChunkHound semantic/provider setup. ## Quick Optimization Checklist 1. Use `sia-code index --update` for daily work 2. Use `--regex` for symbol/identifier lookup -3. Add `--no-deps` to reduce large dependency noise +3. Use tighter regex terms (or include path-like hints) to reduce noise 4. Use `--parallel` for large initial indexing runs -5. Start embed daemon when doing repeated semantic/hybrid queries +5. Start embed daemon when doing repeated memory embedding operations ## Useful Commands @@ -28,8 +28,8 @@ sia-code search --regex "pattern" ## Bottleneck Hints - Slow index build: reduce indexed scope or enable parallel workers -- Slow semantic/hybrid queries: ensure embed daemon is healthy -- Noisy result set: use dependency filters (`--no-deps` / `--deps-only`) +- Slow semantic queries: verify ChunkHound provider setup and model/network health +- Noisy result set: narrow regex terms and include path-like query hints ## Related Docs diff --git a/docs/QUERYING.md b/docs/QUERYING.md index 6c58264..f0f5e47 100644 --- a/docs/QUERYING.md +++ b/docs/QUERYING.md @@ -3,49 +3,50 @@ ## Search Commands ```bash -# default hybrid +# default mode from config (ChunkHound-backed; default is regex) sia-code search "authentication flow" # lexical / symbol-heavy sia-code search --regex "AuthService|token" -# semantic only +# semantic only (requires embedding setup) sia-code search --semantic-only "handle login failures" ``` ## Useful Flags - `-k, --limit `: number of results -- `--no-deps`: only project code -- `--deps-only`: only dependency code -- `--no-filter`: include stale chunks +- `--no-deps`: accepted for compatibility (currently no-op) +- `--deps-only`: accepted for compatibility (currently no-op) +- `--no-filter`: accepted for compatibility (currently no-op) - `--format text|json|table|csv` - `--output `: write results to file ## Multi-Hop Research ```bash -sia-code research "how does auth middleware work?" --hops 3 --graph +sia-code research "how does auth middleware work?" ``` Use this for architecture tracing, call-path discovery, and unfamiliar code. +Compatibility flags for `research` (`--hops`, `--graph`, `--limit`, `--no-filter`) are accepted by Sia and ignored by ChunkHound. + ## Practical Tuning -- `search.vector_weight = 0.0` => lexical-heavy behavior -- `search.vector_weight = 1.0` => semantic-heavy behavior +- `chunkhound.default_search_mode = regex|semantic` - defaults come from `.sia-code/config.json` ```bash -sia-code config get search.vector_weight -sia-code config set search.vector_weight 0.0 +sia-code config get chunkhound.default_search_mode +sia-code config set chunkhound.default_search_mode semantic ``` ## Output Tips - Use `--format json` for scripts/agents. - Use `--format table` for quick terminal scanning. -- Use `--no-deps` in large repos to reduce noise. +- Use tighter regex terms or path-like query text when results are noisy. ## Related Docs diff --git a/sia_code/cli.py b/sia_code/cli.py index 528d9d9..0c64ed7 100644 --- a/sia_code/cli.py +++ b/sia_code/cli.py @@ -22,6 +22,16 @@ from . import __version__ from .config import Config from .indexer.coordinator import IndexingCoordinator +from .search.chunkhound_cli import ( + build_index_command, + build_research_command, + build_search_command, + chunkhound_db_path, + parse_search_output, + resolve_search_mode, + research_needs_llm_fallback, + run_chunkhound_command, +) console = Console() @@ -519,6 +529,29 @@ def update_progress(stage: str, current: int, total: int, desc: str): # Close backend to persist vectors to disk backend.close() + # Keep ChunkHound index in sync for search/research commands + chunkhound_command = build_index_command( + config=config, + project_path=directory, + db_path=chunkhound_db_path(sia_dir, config), + force_reindex=clean, + ) + chunkhound_result = run_chunkhound_command( + chunkhound_command, + cwd=Path("."), + capture_output=True, + ) + if chunkhound_result.returncode != 0: + console.print("[red]ChunkHound indexing failed[/red]") + if chunkhound_result.stdout: + print(chunkhound_result.stdout, end="") + if chunkhound_result.stderr: + print(chunkhound_result.stderr, end="", file=sys.stderr) + sys.exit(chunkhound_result.returncode) + console.print( + f"[dim]ChunkHound index synced at {chunkhound_db_path(sia_dir, config)}[/dim]" + ) + # Auto-sync git history (unless disabled or in watch mode) if not no_git_sync and not watch: try: @@ -624,6 +657,25 @@ def reindex(self): f"[green]✓[/green] Re-indexed {stats['files_indexed']} files, {stats['chunks_indexed']} chunks" ) + # Sync ChunkHound index for watch-mode updates + chunkhound_command = build_index_command( + config=config, + project_path=Path(path), + db_path=chunkhound_db_path(sia_dir, config), + force_reindex=False, + ) + chunkhound_result = run_chunkhound_command( + chunkhound_command, + cwd=Path("."), + capture_output=True, + ) + if chunkhound_result.returncode == 0: + console.print("[green]✓[/green] ChunkHound index synced") + else: + console.print("[yellow]ChunkHound sync failed during watch update[/yellow]") + if chunkhound_result.stderr: + console.print(f"[dim]{chunkhound_result.stderr.strip()}[/dim]") + except Exception as e: console.print(f"[red]Error during re-indexing: {e}[/red]") finally: @@ -677,181 +729,87 @@ def search( output_format: str, output: str | None, ): - """Search the codebase (default: hybrid BM25 + semantic).""" - from .indexer.chunk_index import ChunkIndex + """Search the codebase via ChunkHound CLI.""" + import csv + import io + import json sia_dir, config = require_initialized() - # Load chunk index for filtering (if available and not disabled) - valid_chunks = None - if not no_filter: - chunk_index_path = sia_dir / "chunk_index.json" - if chunk_index_path.exists(): - try: - chunk_index = ChunkIndex(chunk_index_path) - valid_chunks = chunk_index.get_valid_chunks() - except Exception: - pass # Silently fall back to no filtering - - # Handle mutually exclusive dependency flags - if no_deps and deps_only: - console.print("[red]Error: --no-deps and --deps-only are mutually exclusive[/red]") - sys.exit(1) - - backend = create_backend(sia_dir, config, valid_chunks=valid_chunks) - backend.open_index() - - # Determine dependency filtering - # Default: include deps (from config or True) - # --no-deps: exclude deps - # --deps-only: show only deps (include_deps=True, then filter results) - include_deps = not no_deps # Exclude deps if --no-deps is set - tier_boost = config.search.tier_boost if hasattr(config.search, "tier_boost") else None - - # Determine search mode (NEW: hybrid by default) - if regex: - mode = "lexical" - elif semantic_only: - mode = "semantic" - else: - mode = "hybrid" # NEW DEFAULT: BM25 + semantic - - filter_status = "" if no_filter or not valid_chunks else " [filtered]" - deps_status = " [no-deps]" if no_deps else " [deps-only]" if deps_only else "" - - # Suppress progress messages for structured output formats - if output_format not in ("json", "csv"): - console.print(f"[dim]Searching ({mode}{filter_status}{deps_status})...[/dim]") - - # Execute search based on mode - if regex: - results = backend.search_lexical( - query, k=limit, include_deps=include_deps, tier_boost=tier_boost - ) - elif semantic_only: - results = backend.search_semantic( - query, k=limit, include_deps=include_deps, tier_boost=tier_boost - ) - else: - # NEW: Hybrid search (BM25 + semantic) for best performance - results = backend.search_hybrid( - query, - k=limit, - vector_weight=config.search.vector_weight, - include_deps=include_deps, - tier_boost=tier_boost, - ) + if no_deps: + console.print("[yellow]Note:[/yellow] --no-deps is ignored by ChunkHound-backed search") + if deps_only: + console.print("[yellow]Note:[/yellow] --deps-only is ignored by ChunkHound-backed search") + if no_filter: + console.print("[dim]Note: --no-filter has no effect with ChunkHound-backed search[/dim]") + + mode = resolve_search_mode(config, regex=regex, semantic_only=semantic_only) + db_path = chunkhound_db_path(sia_dir, config) + + command = build_search_command( + config=config, + query=query, + project_path=Path("."), + db_path=db_path, + mode=mode, + limit=limit, + ) - # Filter for --deps-only after search - if deps_only and results: - results = [r for r in results if r.chunk.metadata.get("tier") == "dependency"] + result = run_chunkhound_command(command, cwd=Path("."), capture_output=True) + + # Graceful semantic->regex fallback for embedding-misconfigured repos + if result.returncode != 0 and mode == "semantic": + combined = f"{result.stdout}\n{result.stderr}".lower() + if "no embedding providers available" in combined: + console.print("[yellow]Semantic search unavailable; retrying with regex mode.[/yellow]") + mode = "regex" + command = build_search_command( + config=config, + query=query, + project_path=Path("."), + db_path=db_path, + mode=mode, + limit=limit, + ) + result = run_chunkhound_command(command, cwd=Path("."), capture_output=True) - if not results: - # Handle empty results based on output format - if output_format == "json": - import json + if result.returncode != 0: + if result.stdout: + print(result.stdout, end="") + if result.stderr: + print(result.stderr, end="", file=sys.stderr) + sys.exit(result.returncode) - empty_output = {"query": query, "mode": mode, "results": []} - print(json.dumps(empty_output, indent=2)) - elif output_format == "csv": - # CSV header only for empty results - print("File,Start Line,End Line,Symbol,Score,Preview") - else: - console.print("[yellow]No results found[/yellow]") - return + parsed = parse_search_output(result.stdout, query=query, mode=mode) - # Format results based on output_format if output_format == "json": - import json - - output_data = {"query": query, "mode": mode, "results": [r.to_dict() for r in results]} - formatted_output = json.dumps(output_data, indent=2) + rendered = json.dumps(parsed, indent=2) elif output_format == "csv": - import csv - import io - - csv_buffer = io.StringIO() - csv_writer = csv.writer(csv_buffer) - # Write header - csv_writer.writerow(["File", "Start Line", "End Line", "Symbol", "Score", "Preview"]) - # Write rows - for result in results: - chunk = result.chunk - preview = (result.snippet or chunk.code)[:100].replace("\n", " ").replace("\r", "") - csv_writer.writerow( + buffer = io.StringIO() + writer = csv.writer(buffer) + writer.writerow(["File", "Start Line", "End Line", "Symbol", "Score", "Preview"]) + for item in parsed["results"]: + chunk = item["chunk"] + snippet = (item.get("snippet") or chunk.get("code") or "").replace("\n", " ") + writer.writerow( [ - chunk.file_path, - chunk.start_line, - chunk.end_line, - chunk.symbol, - f"{result.score:.3f}", - preview, + chunk.get("file_path", ""), + chunk.get("start_line", ""), + chunk.get("end_line", ""), + chunk.get("symbol", ""), + f"{item.get('score', 0.0):.3f}", + snippet[:120], ] ) - formatted_output = csv_buffer.getvalue() - elif output_format == "table": - table = Table(title=f"Search Results: {query}") - table.add_column("File", style="cyan") - table.add_column("Line", style="dim") - table.add_column("Symbol", style="bold") - table.add_column("Score", justify="right") - table.add_column("Preview", style="dim") - - for result in results: - chunk = result.chunk - preview = (result.snippet or chunk.code)[:80].replace("\n", " ") - table.add_row( - str(chunk.file_path), - f"{chunk.start_line}-{chunk.end_line}", - chunk.symbol, - f"{result.score:.3f}", - preview + "..." if len(preview) == 80 else preview, - ) - formatted_output = table - else: # text format (default) - formatted_output = None - for i, result in enumerate(results, 1): - chunk = result.chunk - console.print(f"\n[bold cyan]{i}. {chunk.symbol}[/bold cyan]") - console.print(f"[dim]{chunk.file_path}:{chunk.start_line}-{chunk.end_line}[/dim]") - console.print(f"Score: {result.score:.3f}") - if result.snippet: - console.print(f"\n{result.snippet}\n") - - # Save to file or print to console + rendered = buffer.getvalue() + else: + rendered = result.stdout + if output: - try: - output_path = Path(output) - if output_format == "json" or output_format == "csv": - assert isinstance(formatted_output, str) - output_path.write_text(formatted_output) - elif output_format == "table": - from rich.console import Console as FileConsole - - with open(output_path, "w") as f: - file_console = FileConsole(file=f, width=120) - file_console.print(formatted_output) - else: # text format - # Re-format as plain text for file output - lines = [] - for i, result in enumerate(results, 1): - chunk = result.chunk - lines.append(f"{i}. {chunk.symbol}") - lines.append(f" {chunk.file_path}:{chunk.start_line}-{chunk.end_line}") - lines.append(f" Score: {result.score:.3f}") - if result.snippet: - lines.append(f"\n{result.snippet}\n") - output_path.write_text("\n".join(lines)) - console.print(f"[green]✓[/green] Results saved to {output}") - except Exception as e: - console.print(f"[red]Error saving to file: {e}[/red]") - sys.exit(1) - elif formatted_output is not None: - if output_format == "json" or output_format == "csv": - # Use print() for JSON/CSV to avoid rich console formatting - print(formatted_output) - else: # table - console.print(formatted_output) + Path(output).write_text(rendered) + console.print(f"[green]✓[/green] Results saved to {output}") + else: + print(rendered, end="" if rendered.endswith("\n") else "\n") @main.command() @@ -999,89 +957,51 @@ def interactive(regex: bool, limit: int): @click.option("-k", "--limit", type=int, default=5, help="Results per hop") @click.option("--no-filter", is_flag=True, help="Disable stale chunk filtering") def research(question: str, hops: int, graph: bool, limit: int, no_filter: bool): - """Multi-hop code research for architectural questions. - - Automatically discovers code relationships and builds a complete picture. - - Examples: - sia-code research "How does authentication work?" - sia-code research "What calls the indexer?" --graph - sia-code research "How is configuration loaded?" --hops 3 - """ - from .indexer.chunk_index import ChunkIndex - from .search.multi_hop import MultiHopSearchStrategy - + """Run architecture research via ChunkHound CLI.""" sia_dir, config = require_initialized() - # Load chunk index for filtering (if available and not disabled) - valid_chunks = None - if not no_filter: - chunk_index_path = sia_dir / "chunk_index.json" - if chunk_index_path.exists(): - try: - chunk_index = ChunkIndex(chunk_index_path) - valid_chunks = chunk_index.get_valid_chunks() - except Exception: - pass # Silently fall back to no filtering - - backend = create_backend(sia_dir, config, valid_chunks=valid_chunks) - backend.open_index() - - strategy = MultiHopSearchStrategy(backend, max_hops=hops) - - console.print(f"[dim]Researching: {question}[/dim]") - console.print(f"[dim]Max hops: {hops}, Results per hop: {limit}[/dim]\n") - - with Progress( - SpinnerColumn(), TextColumn("[progress.description]{task.description}"), console=console - ) as progress: - task = progress.add_task("Analyzing code relationships...", total=None) - result = strategy.research(question, max_results_per_hop=limit) - progress.update(task, completed=True) - - # Display results summary - console.print("\n[bold green]✓ Research Complete[/bold green]") - console.print(f" Found: {len(result.chunks)} related code chunks") - console.print(f" Relationships: {len(result.relationships)}") - console.print(f" Entities discovered: {result.total_entities_found}") - console.print(f" Hops executed: {result.hops_executed}/{hops}\n") - - if not result.chunks: - console.print("[yellow]No relevant code found. Try rephrasing your question.[/yellow]") - return - - # Display top chunks - console.print("[bold]Top Related Code:[/bold]\n") - for i, chunk in enumerate(result.chunks[:10], 1): - console.print(f"{i}. [cyan]{chunk.symbol}[/cyan]") - console.print(f" {chunk.file_path}:{chunk.start_line}-{chunk.end_line}") - if i <= 3: # Show code preview for top 3 - preview = chunk.code[:200].replace("\n", "\n ") - console.print(f" [dim]{preview}...[/dim]") - console.print() - - # Show call graph if requested - if graph and result.relationships: - call_graph = strategy.build_call_graph(result.relationships) - entry_points = strategy.get_entry_points(result.relationships) - - console.print("\n[bold]Call Graph:[/bold]\n") + if hops != 2: + console.print("[dim]Note: --hops is accepted for compatibility but ignored.[/dim]") + if graph: + console.print("[dim]Note: --graph is accepted for compatibility but ignored.[/dim]") + if no_filter: + console.print("[dim]Note: --no-filter has no effect with ChunkHound-backed research.[/dim]") + if limit != 5: + console.print("[dim]Note: --limit is accepted for compatibility but ignored.[/dim]") + + db_path = chunkhound_db_path(sia_dir, config) + command = build_research_command( + config=config, + question=question, + project_path=Path("."), + db_path=db_path, + ) + result = run_chunkhound_command(command, cwd=Path("."), capture_output=True) - if entry_points: - console.print("[dim]Entry points:[/dim]") - for entry in entry_points[:5]: - console.print(f" [green]→ {entry}[/green]") - console.print() + if result.returncode != 0: + combined = f"{result.stdout}\n{result.stderr}" + if config.chunkhound.research_fallback_to_regex and research_needs_llm_fallback(combined): + console.print( + "[yellow]ChunkHound research requires LLM config; falling back to regex search.[/yellow]" + ) + fallback_command = build_search_command( + config=config, + query=question, + project_path=Path("."), + db_path=db_path, + mode="regex", + limit=limit, + ) + result = run_chunkhound_command(fallback_command, cwd=Path("."), capture_output=True) - console.print("[dim]Relationships:[/dim]") - for entity, targets in list(call_graph.items())[:15]: - console.print(f" {entity}") - for target in targets[:3]: - rel_type = target["type"].replace("_", " ") - console.print(f" [dim]{rel_type}[/dim] → {target['target']}") + if result.returncode != 0: + if result.stdout: + print(result.stdout, end="") + if result.stderr: + print(result.stderr, end="", file=sys.stderr) + sys.exit(result.returncode) - if len(call_graph) > 15: - console.print(f"\n [dim]... and {len(call_graph) - 15} more entities[/dim]") + print(result.stdout, end="" if result.stdout.endswith("\n") else "\n") @main.command() @@ -1431,7 +1351,7 @@ def memory(): @memory.command(name="sync-git") @click.option("--since", default="HEAD~100", help="Git ref to start from (e.g., v1.0.0, HEAD~50)") -@click.option("--limit", type=int, default=50, help="Maximum events to process") +@click.option("--limit", type=int, default=0, help="Maximum events to process (0 means all)") @click.option("--dry-run", is_flag=True, help="Preview without importing") @click.option("--tags-only", is_flag=True, help="Only scan tags, skip merge commits") @click.option("--merges-only", is_flag=True, help="Only scan merge commits, skip tags") @@ -1464,9 +1384,10 @@ def memory_sync_git(since, limit, dry_run, tags_only, merges_only, min_importanc console.print(f"[cyan]Syncing git history from {since}...[/cyan]\n") sync_service = GitSyncService(backend, Path(".")) + effective_limit = None if limit <= 0 else limit stats = sync_service.sync( since=since, - limit=limit, + limit=effective_limit, dry_run=dry_run, tags_only=tags_only, merges_only=merges_only, @@ -1562,7 +1483,7 @@ def memory_add_decision(title, description, reasoning, alternatives): default="all", help="Filter decisions by status", ) -@click.option("--limit", type=int, default=20, help="Maximum items to show") +@click.option("--limit", type=int, default=20, help="Maximum items to show (0 means all)") @click.option( "--format", "output_format", @@ -1578,24 +1499,26 @@ def memory_list(item_type, status, limit, output_format): try: results = {"decisions": [], "timeline": [], "changelogs": []} + effective_limit = None if limit <= 0 else limit # Fetch decisions if item_type in ("decision", "all"): if status == "pending": - results["decisions"] = backend.list_pending_decisions(limit=limit) + results["decisions"] = backend.list_pending_decisions(limit=effective_limit) else: # Get all decisions (pending + approved) - results["decisions"] = backend.list_pending_decisions(limit=limit * 2) + expanded_limit = None if effective_limit is None else effective_limit * 2 + results["decisions"] = backend.list_pending_decisions(limit=expanded_limit) if status != "all": results["decisions"] = [d for d in results["decisions"] if d.status == status] # Fetch timeline events if item_type in ("timeline", "all"): - results["timeline"] = backend.get_timeline_events(limit=limit) + results["timeline"] = backend.get_timeline_events(limit=effective_limit) # Fetch changelogs if item_type in ("changelog", "all"): - results["changelogs"] = backend.get_changelogs(limit=limit) + results["changelogs"] = backend.get_changelogs(limit=effective_limit) # Output if output_format == "json": @@ -1783,7 +1706,8 @@ def memory_search(query, search_type, limit): default="text", help="Output format", ) -def memory_timeline(since, event_type, importance, output_format): +@click.option("--limit", type=int, default=0, help="Maximum events to show (0 means all)") +def memory_timeline(since, event_type, importance, output_format, limit): """Show project timeline events. Example: sia-code memory timeline --format markdown --importance high @@ -1793,7 +1717,7 @@ def memory_timeline(since, event_type, importance, output_format): backend.open_index() try: - events = backend.get_timeline_events(limit=100) + events = backend.get_timeline_events(limit=None if limit <= 0 else limit) # Apply filters if event_type: @@ -1864,8 +1788,9 @@ def memory_timeline(since, event_type, importance, output_format): default="markdown", help="Output format", ) +@click.option("--limit", type=int, default=0, help="Maximum changelog entries (0 means all)") @click.option("-o", "--output", type=click.Path(), help="Save to file") -def memory_changelog(range, output_format, output): +def memory_changelog(range, output_format, limit, output): """Generate changelog from memory. Example: sia-code memory changelog v1.0.0..v2.0.0 --format markdown -o CHANGELOG.md @@ -1875,7 +1800,7 @@ def memory_changelog(range, output_format, output): backend.open_index() try: - changelogs = backend.get_changelogs(limit=100) + changelogs = backend.get_changelogs(limit=None if limit <= 0 else limit) # Filter by range if provided if range: diff --git a/sia_code/config.py b/sia_code/config.py index 67ff4ca..5aafae4 100644 --- a/sia_code/config.py +++ b/sia_code/config.py @@ -149,6 +149,18 @@ class SearchConfig(BaseModel): include_dependencies: bool = True # Default: deps always included in search +class ChunkHoundConfig(BaseModel): + """ChunkHound CLI integration settings.""" + + command: str = "uvx chunkhound" + db_filename: str = "chunkhound.db" + default_search_mode: Literal["regex", "semantic"] = "regex" + no_embeddings_for_index: bool = True + no_embeddings_for_regex_search: bool = True + research_prompt_prefix: str = "" + research_fallback_to_regex: bool = True + + class DependencyConfig(BaseModel): """Dependency indexing configuration.""" @@ -197,6 +209,7 @@ class Config(BaseModel): indexing: IndexingConfig = Field(default_factory=IndexingConfig) chunking: ChunkingConfig = Field(default_factory=ChunkingConfig) search: SearchConfig = Field(default_factory=SearchConfig) + chunkhound: ChunkHoundConfig = Field(default_factory=ChunkHoundConfig) # New configuration sections dependencies: DependencyConfig = Field(default_factory=DependencyConfig) documentation: DocumentationConfig = Field(default_factory=DocumentationConfig) diff --git a/sia_code/memory/git_events.py b/sia_code/memory/git_events.py index b828a38..da82ad8 100644 --- a/sia_code/memory/git_events.py +++ b/sia_code/memory/git_events.py @@ -9,6 +9,13 @@ from git.exc import GitCommandError, InvalidGitRepositoryError +def _coerce_text(value: str | bytes | Any) -> str: + """Normalize git message-like values into text.""" + if isinstance(value, bytes): + return value.decode("utf-8", errors="replace") + return str(value) + + class GitEventExtractor: """Extract timeline events and changelogs from git repository.""" @@ -74,12 +81,14 @@ def scan_git_tags(self) -> list[dict[str, Any]]: return changelogs - def scan_merge_events(self, since: str | None = None, limit: int = 50) -> list[dict[str, Any]]: + def scan_merge_events( + self, since: str | None = None, limit: int | None = 50 + ) -> list[dict[str, Any]]: """Extract merge commits as timeline events. Args: since: Git ref to start from (e.g., 'HEAD~100' or 'v1.0.0') - limit: Maximum number of merge events to return + limit: Maximum number of merge events to return (None for all) Returns: List of timeline event dictionaries @@ -93,21 +102,23 @@ def scan_merge_events(self, since: str | None = None, limit: int = 50) -> list[d commit_range = "HEAD" try: - commits = list(self.repo.iter_commits(commit_range, max_count=limit * 2)) + max_count = limit * 2 if limit is not None and limit > 0 else None + commits = list(self.repo.iter_commits(commit_range, max_count=max_count)) except GitCommandError: # If range is invalid, just get HEAD commits - commits = list(self.repo.iter_commits("HEAD", max_count=limit * 2)) + max_count = limit * 2 if limit is not None and limit > 0 else None + commits = list(self.repo.iter_commits("HEAD", max_count=max_count)) for commit in commits: # Check if it's a merge commit (has multiple parents) if len(commit.parents) > 1: # Get branch names from commit message - from_branch, to_branch = self._extract_merge_branches(commit.message) + from_branch, to_branch = self._extract_merge_branches(_coerce_text(commit.message)) # Get files changed files_changed = [] try: - files_changed = [item.a_path for item in commit.stats.files.keys()] + files_changed = [str(path) for path in commit.stats.files.keys()] except Exception: pass @@ -122,7 +133,7 @@ def scan_merge_events(self, since: str | None = None, limit: int = 50) -> list[d "event_type": "merge", "from_ref": from_branch or commit.parents[1].hexsha[:7], "to_ref": to_branch or commit.parents[0].hexsha[:7], - "summary": commit.summary, + "summary": _coerce_text(commit.summary), "files_changed": files_changed[:20], # Limit to avoid huge lists "diff_stats": diff_stats, "importance": self._determine_importance(diff_stats), @@ -134,7 +145,7 @@ def scan_merge_events(self, since: str | None = None, limit: int = 50) -> list[d events.append(event) - if len(events) >= limit: + if limit is not None and limit > 0 and len(events) >= limit: break return events @@ -264,6 +275,10 @@ def _extract_merge_branches(self, message: str) -> tuple[str | None, str | None] return (None, None) + def is_merge_branch_message(self, message: str) -> bool: + """Return True when commit message follows 'Merge branch ...' pattern.""" + return bool(re.search(r"^Merge\s+branch\s+'[^']+'", (message or "").strip())) + def _determine_importance(self, diff_stats: dict[str, Any]) -> str: """Determine importance based on diff statistics. @@ -296,7 +311,7 @@ def get_commits_between_tags(self, from_tag: str, to_tag: str) -> list[str]: try: commits = list(self.repo.iter_commits(f"{from_tag}..{to_tag}")) # Return first line of each commit message - return [c.message.strip().split("\n")[0] for c in commits] + return [_coerce_text(c.message).strip().split("\n")[0] for c in commits] except Exception as e: logger = logging.getLogger(__name__) logger.debug(f"Could not get commits between {from_tag} and {to_tag}: {e}") @@ -321,7 +336,7 @@ def get_commits_in_merge(self, merge_commit) -> list[str]: commits = list( self.repo.iter_commits(f"{base[0].hexsha}..{merge_commit.parents[1].hexsha}") ) - return [c.message.strip().split("\n")[0] for c in commits] + return [_coerce_text(c.message).strip().split("\n")[0] for c in commits] except Exception as e: logger = logging.getLogger(__name__) logger.debug(f"Could not get commits for merge {merge_commit.hexsha[:7]}: {e}") @@ -347,14 +362,14 @@ def scan_git_tags(repo_path: str | Path) -> list[dict[str, Any]]: def scan_merge_events( - repo_path: str | Path, since: str | None = None, limit: int = 50 + repo_path: str | Path, since: str | None = None, limit: int | None = 50 ) -> list[dict[str, Any]]: """Extract merge commits as timeline events. Args: repo_path: Path to git repository since: Git ref to start from - limit: Maximum number of events + limit: Maximum number of events (None for all) Returns: List of timeline event dictionaries diff --git a/sia_code/memory/git_sync.py b/sia_code/memory/git_sync.py index 599abca..0f1c9e8 100644 --- a/sia_code/memory/git_sync.py +++ b/sia_code/memory/git_sync.py @@ -63,7 +63,7 @@ def summarizer(self): def sync( self, since: str | None = None, - limit: int = 50, + limit: int | None = 50, dry_run: bool = False, tags_only: bool = False, merges_only: bool = False, @@ -73,7 +73,7 @@ def sync( Args: since: Git ref to start from (e.g., 'v1.0.0', 'HEAD~50') - limit: Maximum number of events to process + limit: Maximum number of events to process (None/0 means no limit) dry_run: If True, don't write to backend tags_only: Only process tags, skip merges merges_only: Only process merges, skip tags @@ -83,6 +83,7 @@ def sync( Dictionary with sync statistics """ stats = GitSyncStats() + effective_limit = limit if limit is not None and limit > 0 else None # Process tags as changelogs (unless merges_only) if not merges_only: @@ -127,7 +128,7 @@ def sync( stats.changelogs_added += 1 # Early exit if hit limit - if stats.changelogs_added >= limit: + if effective_limit is not None and stats.changelogs_added >= effective_limit: break except Exception as e: stats.errors.append(f"Error processing tags: {e}") @@ -135,7 +136,7 @@ def sync( # Process merge commits as timeline events (unless tags_only) if not tags_only: try: - merge_events = self.extractor.scan_merge_events(since=since, limit=limit) + merge_events = self.extractor.scan_merge_events(since=since, limit=effective_limit) for event_data in merge_events: # Filter by importance event_importance = event_data.get("importance", "medium") @@ -183,8 +184,56 @@ def sync( ) stats.timeline_added += 1 + # Build changelog entries from merge commits with explicit + # "Merge branch ..." subject lines. + if self.extractor.is_merge_branch_message(event_data.get("summary", "")): + if ( + effective_limit is not None + and stats.changelogs_added >= effective_limit + ): + continue + + changelog_tag = self._merge_changelog_tag(event_data) + if self._is_duplicate_changelog(changelog_tag): + stats.changelogs_skipped += 1 + continue + + merged_commits: list[str] = [] + merge_commit = event_data.get("merge_commit") + if merge_commit is not None: + merged_commits = self.extractor.get_commits_in_merge(merge_commit) + + changelog_summary = event_data.get("summary", "") + if self.summarizer and merged_commits: + try: + changelog_summary = self.summarizer.enhance_changelog( + changelog_tag, + changelog_summary, + merged_commits, + ) + except Exception as e: + logger.debug(f"Could not enhance merge changelog: {e}") + + commit_text = "\n".join(merged_commits) + breaking_changes = self.extractor._extract_breaking_changes(commit_text) + features = self.extractor._extract_features(commit_text) + fixes = self.extractor._extract_fixes(commit_text) + + if not dry_run: + self.backend.add_changelog( + tag=changelog_tag, + version=None, + summary=changelog_summary, + breaking_changes=breaking_changes, + features=features, + fixes=fixes, + commit_hash=event_data.get("commit_hash"), + commit_time=event_data.get("commit_time"), + ) + stats.changelogs_added += 1 + # Early exit if hit limit - if stats.timeline_added >= limit: + if effective_limit is not None and stats.timeline_added >= effective_limit: break except Exception as e: stats.errors.append(f"Error processing merges: {e}") @@ -201,7 +250,7 @@ def _is_duplicate_changelog(self, tag: str) -> bool: True if changelog with this tag exists """ try: - existing = self.backend.get_changelogs(limit=1000) + existing = self.backend.get_changelogs(limit=None) return any(c.tag == tag for c in existing) except Exception: # If check fails, assume not duplicate to avoid data loss @@ -219,7 +268,7 @@ def _is_duplicate_event(self, event_type: str, from_ref: str, to_ref: str) -> bo True if event with these attributes exists """ try: - existing = self.backend.get_timeline_events(limit=1000) + existing = self.backend.get_timeline_events(limit=None) return any( e.event_type == event_type and e.from_ref == from_ref and e.to_ref == to_ref for e in existing @@ -242,3 +291,12 @@ def _meets_importance_threshold(self, event_importance: str, min_importance: str event_level = importance_order.get(event_importance, 0) min_level = importance_order.get(min_importance, 0) return event_level >= min_level + + def _merge_changelog_tag(self, event_data: dict[str, Any]) -> str: + """Build stable synthetic changelog key for merge-derived entries.""" + commit_hash = event_data.get("commit_hash") + if commit_hash: + return f"merge:{commit_hash}" + return ( + f"merge:{event_data.get('from_ref', 'unknown')}->{event_data.get('to_ref', 'unknown')}" + ) diff --git a/sia_code/search/chunkhound_cli.py b/sia_code/search/chunkhound_cli.py new file mode 100644 index 0000000..c288bfe --- /dev/null +++ b/sia_code/search/chunkhound_cli.py @@ -0,0 +1,206 @@ +"""ChunkHound CLI bridge for Sia search/research commands.""" + +from __future__ import annotations + +import re +import shlex +import subprocess +from pathlib import Path +from typing import Any, Literal + +from ..config import Config + + +SearchMode = Literal["regex", "semantic"] + + +def chunkhound_db_path(sia_dir: Path, config: Config) -> Path: + """Resolve ChunkHound database path from Sia config.""" + return sia_dir / config.chunkhound.db_filename + + +def split_chunkhound_command(command: str) -> list[str]: + """Split configured command string into executable argv.""" + stripped = command.strip() if command else "" + if not stripped: + stripped = "uvx chunkhound" + return shlex.split(stripped) + + +def resolve_search_mode(config: Config, regex: bool, semantic_only: bool) -> SearchMode: + """Resolve target search mode from CLI flags and config defaults.""" + if regex: + return "regex" + if semantic_only: + return "semantic" + return config.chunkhound.default_search_mode + + +def build_index_command( + config: Config, + project_path: Path, + db_path: Path, + force_reindex: bool = False, +) -> list[str]: + """Build chunkhound indexing command.""" + cmd = split_chunkhound_command(config.chunkhound.command) + cmd.extend(["index", str(project_path), "--db", str(db_path)]) + if config.chunkhound.no_embeddings_for_index: + cmd.append("--no-embeddings") + if force_reindex: + cmd.append("--force-reindex") + return cmd + + +def build_search_command( + config: Config, + query: str, + project_path: Path, + db_path: Path, + mode: SearchMode, + limit: int, +) -> list[str]: + """Build chunkhound search command.""" + cmd = split_chunkhound_command(config.chunkhound.command) + cmd.extend( + [ + "search", + query, + str(project_path), + "--db", + str(db_path), + "--page-size", + str(limit), + ] + ) + + if mode == "regex": + cmd.append("--regex") + if config.chunkhound.no_embeddings_for_regex_search: + cmd.append("--no-embeddings") + elif mode != "semantic": + raise ValueError(f"Unsupported search mode: {mode}") + + return cmd + + +def build_research_command( + config: Config, + question: str, + project_path: Path, + db_path: Path, +) -> list[str]: + """Build chunkhound research command.""" + cmd = split_chunkhound_command(config.chunkhound.command) + cmd.extend(["research", build_research_query(config, question), str(project_path)]) + cmd.extend(["--db", str(db_path)]) + return cmd + + +def build_research_query(config: Config, question: str) -> str: + """Apply optional prompt prefix before invoking chunkhound research.""" + prefix = config.chunkhound.research_prompt_prefix.strip() + if not prefix: + return question + return f"{prefix}\n\n{question}" + + +def run_chunkhound_command( + command: list[str], + cwd: Path, + capture_output: bool = False, +) -> subprocess.CompletedProcess[str]: + """Run chunkhound command.""" + return subprocess.run( + command, + cwd=cwd, + text=True, + capture_output=capture_output, + ) + + +def parse_search_output(output: str, query: str, mode: str) -> dict[str, Any]: + """Parse chunkhound text search output into Sia-compatible JSON structure.""" + results: list[dict[str, Any]] = [] + current: dict[str, Any] | None = None + in_code_block = False + code_lines: list[str] = [] + + def flush_current() -> None: + nonlocal current + if not current: + return + + file_path = current.get("file_path") or "unknown" + start_line = int(current.get("start_line") or 1) + end_line = int(current.get("end_line") or start_line) + snippet = (current.get("snippet") or "").strip() + rank = int(current.get("rank") or (len(results) + 1)) + + results.append( + { + "chunk": { + "symbol": Path(file_path).stem, + "start_line": start_line, + "end_line": end_line, + "code": snippet, + "chunk_type": "unknown", + "language": "unknown", + "file_path": file_path, + "file_id": None, + "id": None, + "parent_header": None, + "metadata": {"source": "chunkhound-cli"}, + }, + "score": max(0.0, 1.0 - (rank - 1) * 0.01), + "snippet": snippet or None, + "highlights": [], + } + ) + current = None + + for raw_line in output.splitlines(): + line = raw_line.rstrip("\n") + stripped = line.strip() + + if stripped.startswith("```"): + if in_code_block: + if current is not None: + current["snippet"] = "\n".join(code_lines).strip() + code_lines = [] + in_code_block = False + else: + in_code_block = True + code_lines = [] + continue + + if in_code_block: + code_lines.append(line) + continue + + match = re.match(r"^\[(\d+)\]\s+(.+)$", stripped) + if match: + flush_current() + current = { + "rank": int(match.group(1)), + "file_path": match.group(2).strip(), + "start_line": None, + "end_line": None, + "snippet": "", + } + continue + + if current is not None: + line_match = re.search(r"Lines\s+(\d+)(?:-(\d+))?", stripped) + if line_match: + current["start_line"] = int(line_match.group(1)) + current["end_line"] = int(line_match.group(2) or line_match.group(1)) + + flush_current() + return {"query": query, "mode": mode, "results": results} + + +def research_needs_llm_fallback(output_text: str) -> bool: + """Detect known chunkhound LLM setup errors for graceful fallback.""" + lowered = output_text.lower() + return "configure an llm provider" in lowered or "llm provider setup failed" in lowered diff --git a/sia_code/storage/base.py b/sia_code/storage/base.py index b35f050..12e2eed 100644 --- a/sia_code/storage/base.py +++ b/sia_code/storage/base.py @@ -195,11 +195,11 @@ def reject_decision(self, decision_id: int) -> None: ... @abstractmethod - def list_pending_decisions(self, limit: int = 20) -> list[Decision]: + def list_pending_decisions(self, limit: int | None = 20) -> list[Decision]: """List oldest pending decisions for review. Args: - limit: Maximum number of decisions to return + limit: Maximum number of decisions to return (None for all) Returns: List of pending decisions, oldest first @@ -284,14 +284,14 @@ def add_changelog( @abstractmethod def get_timeline_events( - self, from_ref: str | None = None, to_ref: str | None = None, limit: int = 20 + self, from_ref: str | None = None, to_ref: str | None = None, limit: int | None = 20 ) -> list[TimelineEvent]: """Get timeline events. Args: from_ref: Filter by starting ref to_ref: Filter by ending ref - limit: Maximum number of events to return + limit: Maximum number of events to return (None for all) Returns: List of timeline events @@ -299,11 +299,11 @@ def get_timeline_events( ... @abstractmethod - def get_changelogs(self, limit: int = 20) -> list[ChangelogEntry]: + def get_changelogs(self, limit: int | None = 20) -> list[ChangelogEntry]: """Get changelog entries. Args: - limit: Maximum number of entries to return + limit: Maximum number of entries to return (None for all) Returns: List of changelog entries, newest first diff --git a/sia_code/storage/sqlite_vec_backend.py b/sia_code/storage/sqlite_vec_backend.py index fec85c9..0457282 100644 --- a/sia_code/storage/sqlite_vec_backend.py +++ b/sia_code/storage/sqlite_vec_backend.py @@ -1549,11 +1549,11 @@ def reject_decision(self, decision_id: int) -> None: ) self.conn.commit() - def list_pending_decisions(self, limit: int = 20) -> list[Decision]: + def list_pending_decisions(self, limit: int | None = 20) -> list[Decision]: """List oldest pending decisions for review. Args: - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of pending decisions, oldest first @@ -1562,17 +1562,18 @@ def list_pending_decisions(self, limit: int = 20) -> list[Decision]: raise RuntimeError("Index not initialized") cursor = self.conn.cursor() - cursor.execute( - """ - SELECT id, session_id, title, description, reasoning, alternatives, + query = """ + SELECT id, session_id, title, description, reasoning, alternatives, status, category, commit_hash, commit_time, created_at, approved_at FROM decisions WHERE status = 'pending' ORDER BY created_at ASC - LIMIT ? - """, - (limit,), - ) + """ + params: list[Any] = [] + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + cursor.execute(query, params) decisions = [] for row in cursor.fetchall(): @@ -1766,14 +1767,14 @@ def add_changelog( return changelog_id def get_timeline_events( - self, from_ref: str | None = None, to_ref: str | None = None, limit: int = 20 + self, from_ref: str | None = None, to_ref: str | None = None, limit: int | None = 20 ) -> list[TimelineEvent]: """Get timeline events. Args: from_ref: Filter by starting ref to_ref: Filter by ending ref - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of timeline events @@ -1795,19 +1796,18 @@ def get_timeline_events( params.append(to_ref) where_clause = f"WHERE {' AND '.join(conditions)}" if conditions else "" - params.append(limit) - - cursor.execute( - f""" + query = f""" SELECT id, event_type, from_ref, to_ref, summary, files_changed, diff_stats, importance, commit_hash, commit_time, created_at FROM timeline {where_clause} ORDER BY created_at DESC - LIMIT ? - """, - params, - ) + """ + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + + cursor.execute(query, params) events = [] for row in cursor.fetchall(): @@ -1833,11 +1833,11 @@ def get_timeline_events( return events - def get_changelogs(self, limit: int = 20) -> list[ChangelogEntry]: + def get_changelogs(self, limit: int | None = 20) -> list[ChangelogEntry]: """Get changelog entries. Args: - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of changelog entries, newest first @@ -1846,16 +1846,17 @@ def get_changelogs(self, limit: int = 20) -> list[ChangelogEntry]: raise RuntimeError("Index not initialized") cursor = self.conn.cursor() - cursor.execute( - """ + query = """ SELECT id, tag, version, date, summary, breaking_changes, features, fixes, commit_hash, commit_time, created_at FROM changelogs ORDER BY date DESC - LIMIT ? - """, - (limit,), - ) + """ + params: list[Any] = [] + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + cursor.execute(query, params) changelogs = [] for row in cursor.fetchall(): @@ -2087,12 +2088,12 @@ def export_memory( # Timeline events if include_timeline: - timeline = self.get_timeline_events(limit=100) + timeline = self.get_timeline_events(limit=None) memory["timeline"] = [t.to_dict() for t in timeline] # Changelogs if include_changelogs: - changelogs = self.get_changelogs(limit=100) + changelogs = self.get_changelogs(limit=None) memory["changelogs"] = [c.to_dict() for c in changelogs] # Approved decisions @@ -2124,7 +2125,7 @@ def export_memory( # Pending decisions (optional) if include_pending: - pending = self.list_pending_decisions(limit=100) + pending = self.list_pending_decisions(limit=None) memory["pending_decisions"] = [ { "id": f"decision:{d.id}", diff --git a/sia_code/storage/usearch_backend.py b/sia_code/storage/usearch_backend.py index 8114582..22d3757 100644 --- a/sia_code/storage/usearch_backend.py +++ b/sia_code/storage/usearch_backend.py @@ -1472,11 +1472,11 @@ def reject_decision(self, decision_id: int) -> None: ) self.conn.commit() - def list_pending_decisions(self, limit: int = 20) -> list[Decision]: + def list_pending_decisions(self, limit: int | None = 20) -> list[Decision]: """List oldest pending decisions for review. Args: - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of pending decisions, oldest first @@ -1485,17 +1485,18 @@ def list_pending_decisions(self, limit: int = 20) -> list[Decision]: raise RuntimeError("Index not initialized") cursor = self.conn.cursor() - cursor.execute( - """ + query = """ SELECT id, session_id, title, description, reasoning, alternatives, status, category, commit_hash, commit_time, created_at, approved_at FROM decisions WHERE status = 'pending' ORDER BY created_at ASC - LIMIT ? - """, - (limit,), - ) + """ + params: list[Any] = [] + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + cursor.execute(query, params) decisions = [] for row in cursor.fetchall(): @@ -1705,14 +1706,14 @@ def add_changelog( return changelog_id def get_timeline_events( - self, from_ref: str | None = None, to_ref: str | None = None, limit: int = 20 + self, from_ref: str | None = None, to_ref: str | None = None, limit: int | None = 20 ) -> list[TimelineEvent]: """Get timeline events. Args: from_ref: Filter by starting ref to_ref: Filter by ending ref - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of timeline events @@ -1734,19 +1735,18 @@ def get_timeline_events( params.append(to_ref) where_clause = f"WHERE {' AND '.join(conditions)}" if conditions else "" - params.append(limit) - - cursor.execute( - f""" + query = f""" SELECT id, event_type, from_ref, to_ref, summary, files_changed, diff_stats, importance, commit_hash, commit_time, created_at FROM timeline {where_clause} ORDER BY created_at DESC - LIMIT ? - """, - params, - ) + """ + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + + cursor.execute(query, params) events = [] for row in cursor.fetchall(): @@ -1772,11 +1772,11 @@ def get_timeline_events( return events - def get_changelogs(self, limit: int = 20) -> list[ChangelogEntry]: + def get_changelogs(self, limit: int | None = 20) -> list[ChangelogEntry]: """Get changelog entries. Args: - limit: Maximum number to return + limit: Maximum number to return (None for all) Returns: List of changelog entries, newest first @@ -1785,16 +1785,17 @@ def get_changelogs(self, limit: int = 20) -> list[ChangelogEntry]: raise RuntimeError("Index not initialized") cursor = self.conn.cursor() - cursor.execute( - """ + query = """ SELECT id, tag, version, date, summary, breaking_changes, features, fixes, commit_hash, commit_time, created_at FROM changelogs ORDER BY date DESC - LIMIT ? - """, - (limit,), - ) + """ + params: list[Any] = [] + if limit is not None and limit > 0: + query += " LIMIT ?" + params.append(limit) + cursor.execute(query, params) changelogs = [] for row in cursor.fetchall(): @@ -2026,12 +2027,12 @@ def export_memory( # Timeline events if include_timeline: - timeline = self.get_timeline_events(limit=100) + timeline = self.get_timeline_events(limit=None) memory["timeline"] = [t.to_dict() for t in timeline] # Changelogs if include_changelogs: - changelogs = self.get_changelogs(limit=100) + changelogs = self.get_changelogs(limit=None) memory["changelogs"] = [c.to_dict() for c in changelogs] # Approved decisions @@ -2063,7 +2064,7 @@ def export_memory( # Pending decisions (optional) if include_pending: - pending = self.list_pending_decisions(limit=100) + pending = self.list_pending_decisions(limit=None) memory["pending_decisions"] = [ { "id": f"decision:{d.id}", diff --git a/skills/sia-code/SKILL.md b/skills/sia-code/SKILL.md index bbf4c7c..129fa59 100644 --- a/skills/sia-code/SKILL.md +++ b/skills/sia-code/SKILL.md @@ -1,9 +1,9 @@ --- name: sia-code -description: Compact local-first code search skill for CLI agents using BM25, optional semantic search, multi-hop research, and project memory. +description: Compact local-first code search skill for CLI agents using ChunkHound-backed search/research and Sia project memory. license: MIT compatibility: opencode -version: 0.7.0 +version: 0.7.1 --- # Sia-Code Skill (Compact) @@ -19,42 +19,58 @@ This is a compact, repo-local variant intended for easy copy/paste into LLM CLI uvx sia-code init uvx sia-code index . -# fast lexical search (great for identifiers) +# fast lexical search (ChunkHound-backed) uvx sia-code search --regex "auth|login|token" -# architecture exploration +# architecture exploration (ChunkHound-backed) uvx sia-code research "how does authentication flow work?" # health check uvx sia-code status ``` +## Search + Research Backend + +`sia-code search` and `sia-code research` are powered by ChunkHound CLI. +Sia's own memory/decision database remains unchanged. + +Install once: + +```bash +uv tool install chunkhound +``` + ## Search Modes -- `uvx sia-code search "query"`: default hybrid search (BM25 + semantic) -- `uvx sia-code search --regex "pattern"`: lexical search only (usually best for exact symbols) -- `uvx sia-code search --semantic-only "query"`: semantic-only search +- `uvx sia-code search "query"`: default mode from config (`chunkhound.default_search_mode`) +- `uvx sia-code search --regex "pattern"`: lexical search (recommended for exact symbols) +- `uvx sia-code search --semantic-only "query"`: semantic search (requires embedding setup) -Useful flags: +Supported flags: - `-k, --limit `: result count -- `--no-deps`: project code only -- `--deps-only`: dependency code only -- `--format json|table|csv`: structured output +- `--format json|table|csv`: output shaping in Sia wrapper + +Compatibility notes (currently no-op with ChunkHound): + +- `--no-deps` +- `--deps-only` +- `--no-filter` ## Multi-Hop Research ```bash -uvx sia-code research "how is config loaded?" --hops 3 --graph +uvx sia-code research "how is config loaded?" ``` - Use for dependency tracing, call flow mapping, and architecture questions. +- `--hops`, `--graph`, and `--limit` are accepted for compatibility in Sia but ignored by ChunkHound CLI. ## Memory Workflow ```bash # import timeline/changelogs from git -uvx sia-code memory sync-git +uvx sia-code memory sync-git --limit 0 # store a pending decision uvx sia-code memory add-decision "Adopt sqlite-vec by default" \ @@ -69,6 +85,11 @@ uvx sia-code memory approve 1 --category architecture uvx sia-code memory search "backend default" --type all ``` +Notes: + +- `memory sync-git` derives changelog entries from merge commits whose subject matches `Merge branch '...'`. +- Use `--limit 0` when you want to process all eligible git events. + ## Agent-Friendly Session Pattern ```bash @@ -90,12 +111,12 @@ uvx sia-code memory add-decision "..." -d "..." -r "..." ## Troubleshooting - If uninitialized: run `uvx sia-code init && uvx sia-code index .` -- If results look stale: run `uvx sia-code index --update` (or `--clean` after major refactors) +- If results look stale: run `uvx sia-code index --update` (this also syncs ChunkHound index) - If memory add/search fails with embedding issues: run `uvx sia-code embed start` -- If too much dependency noise: add `--no-deps` +- If ChunkHound is missing: run `uv tool install chunkhound` ## Notes - Lexical search is often strong for code due to exact identifiers. -- Hybrid/semantic search may require embedding setup depending on configuration. +- Semantic research/search requires ChunkHound embedding/LLM provider setup. - Keep this file short and operational; move deep theory to project docs. diff --git a/tests/e2e/test_cpp_e2e.py b/tests/e2e/test_cpp_e2e.py index 921e4bf..8bf00f1 100644 --- a/tests/e2e/test_cpp_e2e.py +++ b/tests/e2e/test_cpp_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -135,31 +89,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does JSON parsing work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does JSON parsing work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_csharp_e2e.py b/tests/e2e/test_csharp_e2e.py index 4e3f009..a7454ee 100644 --- a/tests/e2e/test_csharp_e2e.py +++ b/tests/e2e/test_csharp_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -135,31 +89,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does HTTP context work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does HTTP context work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_go_e2e.py b/tests/e2e/test_go_e2e.py index 71636d2..35cd16c 100644 --- a/tests/e2e/test_go_e2e.py +++ b/tests/e2e/test_go_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -132,31 +86,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does the HTTP engine work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does the HTTP engine work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_java_e2e.py b/tests/e2e/test_java_e2e.py index a8c5d8a..99650fe 100644 --- a/tests/e2e/test_java_e2e.py +++ b/tests/e2e/test_java_e2e.py @@ -43,61 +43,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns like .git, node_modules.""" - # Check that .git directory was not indexed by searching for git-specific files - results = self.search_json("HEAD", indexed_repo, regex=True, limit=20) - - # If any results found, ensure they're not from .git directory - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp or "\\.git\\" in fp] - assert len(git_files) == 0, f"Indexed files from .git directory: {git_files}" - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # Should mention incremental or update - assert ( - "incremental" in result.stdout.lower() - or "update" in result.stdout.lower() - or "unchanged" in result.stdout.lower() - ) - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -164,43 +109,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does mocking work?", "--hops", "2", "-k", "5"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # Should report findings - assert ( - "found" in result.stdout.lower() - or "chunk" in result.stdout.lower() - or "complete" in result.stdout.lower() - ) - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "What is verification?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - # Should complete with specified hop limit - assert "hop" in result.stdout.lower() or "complete" in result.stdout.lower() - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How are mocks created?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # Graph output should mention relationships or call graph - # Even if no relationships found, command should succeed - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_javascript_e2e.py b/tests/e2e/test_javascript_e2e.py index 58780f6..1b0e281 100644 --- a/tests/e2e/test_javascript_e2e.py +++ b/tests/e2e/test_javascript_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -135,31 +89,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does routing work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does routing work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_php_e2e.py b/tests/e2e/test_php_e2e.py index fadacac..c2a2d08 100644 --- a/tests/e2e/test_php_e2e.py +++ b/tests/e2e/test_php_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -135,31 +89,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does the framework work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does the framework work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_python_e2e.py b/tests/e2e/test_python_e2e.py index fc39c52..d9e9a96 100644 --- a/tests/e2e/test_python_e2e.py +++ b/tests/e2e/test_python_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -137,31 +91,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How do HTTP requests work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "What is a session?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How are requests sent?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_ruby_e2e.py b/tests/e2e/test_ruby_e2e.py index df06f76..492780d 100644 --- a/tests/e2e/test_ruby_e2e.py +++ b/tests/e2e/test_ruby_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -132,31 +86,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does routing work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does routing work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_rust_e2e.py b/tests/e2e/test_rust_e2e.py index fd023c4..8d5ebad 100644 --- a/tests/e2e/test_rust_e2e.py +++ b/tests/e2e/test_rust_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -132,31 +86,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does async runtime work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does async runtime work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/e2e/test_typescript_e2e.py b/tests/e2e/test_typescript_e2e.py index 90b9154..cd94fcf 100644 --- a/tests/e2e/test_typescript_e2e.py +++ b/tests/e2e/test_typescript_e2e.py @@ -33,52 +33,6 @@ def test_init_creates_index_file(self, initialized_repo): index_path = initialized_repo / ".sia-code" / "index.db" assert index_path.exists() - # ===== INDEXING TESTS ===== - - def test_index_full_completes_successfully(self, indexed_repo): - """Test that full indexing completes without errors. - - Note: Uses indexed_repo fixture which already performed full indexing. - This test verifies the index was created successfully rather than re-indexing. - """ - # Verify index was created - index_path = indexed_repo / ".sia-code" / "index.db" - assert index_path.exists(), "Index database not created" - assert index_path.stat().st_size > 100000, "Index appears empty or incomplete" - - # Verify index contains data by checking status - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0, f"Status check failed: {result.stderr}" - assert "index" in result.stdout.lower() - - def test_index_reports_file_and_chunk_counts(self, indexed_repo): - """Test that status shows index information after indexing.""" - result = self.run_cli(["status"], indexed_repo) - assert result.returncode == 0 - # Check for basic index info (chunk info only shown after --update) - assert "index" in result.stdout.lower() - - def test_index_skips_excluded_patterns(self, indexed_repo): - """Test that indexing skips excluded patterns.""" - results = self.search_json(".git", indexed_repo, regex=True, limit=10) - file_paths = self.get_result_file_paths(results) - git_files = [fp for fp in file_paths if ".git/" in fp] - assert len(git_files) == 0 - - def test_index_clean_rebuilds_from_scratch(self, indexed_repo): - """Test that --clean flag rebuilds index from scratch. - - Note: This test does a full rebuild with embeddings enabled. - """ - result = self.run_cli(["index", "--clean", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_update_only_processes_changes(self, indexed_repo): - """Test that --update flag only reindexes changed files.""" - result = self.run_cli(["index", "--update", "."], indexed_repo, timeout=600) - assert result.returncode == 0 - # ===== SEARCH - LEXICAL TESTS ===== def test_search_finds_language_keyword(self, indexed_repo): @@ -135,31 +89,6 @@ def test_search_csv_output_valid(self, indexed_repo): ) assert result.returncode == 0 - # ===== RESEARCH TESTS ===== - - def test_research_finds_related_code(self, indexed_repo): - """Test that research command finds related code chunks.""" - result = self.run_cli( - ["research", "How does the runtime work?", "--hops", "2"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_respects_hop_limit(self, indexed_repo): - """Test that research respects --hops parameter.""" - result = self.run_cli( - ["research", "How does this work?", "--hops", "1"], indexed_repo, timeout=600 - ) - assert result.returncode == 0 - - def test_research_graph_shows_relationships(self, indexed_repo): - """Test that --graph flag shows code relationships.""" - result = self.run_cli( - ["research", "How does the runtime work?", "--hops", "2", "--graph"], - indexed_repo, - timeout=600, - ) - assert result.returncode == 0 - # ===== STATUS & MAINTENANCE ===== def test_status_shows_index_info(self, indexed_repo): diff --git a/tests/integration/test_batch_indexing_search.py b/tests/integration/test_batch_indexing_search.py deleted file mode 100644 index 9e68ecb..0000000 --- a/tests/integration/test_batch_indexing_search.py +++ /dev/null @@ -1,48 +0,0 @@ -"""Integration test for batched indexing and lexical search.""" - - -from sia_code.config import Config -from sia_code.indexer.coordinator import IndexingCoordinator -from sia_code.storage.usearch_backend import UsearchSqliteBackend - - -def test_batched_indexing_enables_search(tmp_path): - repo = tmp_path / "repo" - repo.mkdir() - - source = repo / "math_utils.py" - source.write_text( - "\n".join( - [ - "def add(a, b):", - " return a + b", - "", - "def multiply(a, b):", - " return a * b", - "", - ] - ) - ) - - config = Config() - config.indexing.chunk_batch_size = 2 - config.embedding.enabled = False - - backend = UsearchSqliteBackend( - path=tmp_path / ".sia-code", - embedding_enabled=False, - ndim=4, - dtype="f32", - ) - backend.create_index() - - coordinator = IndexingCoordinator(config, backend) - stats = coordinator.index_directory(repo) - - assert stats["total_chunks"] > 0 - - results = backend.search_lexical("multiply", k=1) - assert results - assert results[0].chunk.file_path.name == "math_utils.py" - - backend.close() diff --git a/tests/integration/test_v1_v2_equivalence.py b/tests/integration/test_v1_v2_equivalence.py deleted file mode 100644 index 4d706b0..0000000 --- a/tests/integration/test_v1_v2_equivalence.py +++ /dev/null @@ -1,328 +0,0 @@ -"""Test equivalence between v1 and v2 incremental indexing methods. - -NOTE: v1 has been REMOVED from the codebase after validation. -These tests remain as historical documentation that v2 was validated -to produce equivalent or better results than v1 before deletion. - -The tests now only execute against a mock v1 implementation. -""" - -import pytest -import time -from sia_code.indexer.coordinator import IndexingCoordinator -from sia_code.indexer.hash_cache import HashCache -from sia_code.indexer.chunk_index import ChunkIndex -from sia_code.storage.usearch_backend import UsearchSqliteBackend -from sia_code.config import Config, ChunkingConfig - - -@pytest.fixture -def test_workspace(tmp_path): - """Create a workspace with test files.""" - workspace = tmp_path / "workspace" - workspace.mkdir() - - # Create test files with different sizes - (workspace / "small.py").write_text(""" -def small_function(): - return "small" -""") - - (workspace / "medium.py").write_text(""" -def function_one(): - return 1 - -def function_two(): - return 2 - -class MediumClass: - def method(self): - return "method" -""") - - (workspace / "large.py").write_text(""" -class LargeClass: - def __init__(self): - self.data = [] - - def add(self, item): - self.data.append(item) - - def remove(self, item): - self.data.remove(item) - - def get_all(self): - return self.data - - def clear(self): - self.data.clear() -""") - - return workspace - - -@pytest.fixture -def backends(tmp_path): - """Create separate backends for v1 and v2.""" - backend_v1 = UsearchSqliteBackend(tmp_path / "v1.sia-code", embedding_enabled=False) - backend_v1.create_index() - - backend_v2 = UsearchSqliteBackend(tmp_path / "v2.sia-code", embedding_enabled=False) - backend_v2.create_index() - - yield {"v1": backend_v1, "v2": backend_v2} - - backend_v1.close() - backend_v2.close() - - -class TestV1V2Equivalence: - """Test that v2 produces equivalent results to v1. - - NOTE: v1 has been removed. These tests are skipped but kept for documentation. - """ - - @pytest.mark.skip(reason="v1 removed after validation - kept for historical documentation") - def test_initial_indexing_produces_same_chunk_count(self, test_workspace, backends, tmp_path): - """Test that v1 and v2 produce same chunk count on initial indexing.""" - # Setup for v1 - cache_v1 = HashCache(tmp_path / "cache_v1.json") - config = Config( - sia_dir=tmp_path / "v1_dir", - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - coordinator_v1 = IndexingCoordinator(backend=backends["v1"], config=config) - - # Setup for v2 - cache_v2 = HashCache(tmp_path / "cache_v2.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - coordinator_v2 = IndexingCoordinator(backend=backends["v2"], config=config) - - # Run v1 - stats_v1 = coordinator_v1.index_directory_incremental(test_workspace, cache_v1) - - # Run v2 - stats_v2 = coordinator_v2.index_directory_incremental_v2( - test_workspace, cache_v2, chunk_index, progress_callback=None - ) - - # Compare results - # Both use same keys - assert stats_v1["changed_files"] == stats_v2["changed_files"] - assert stats_v1["total_chunks"] == stats_v2["total_chunks"] - - @pytest.mark.skip(reason="v1 removed after validation - kept for historical documentation") - def test_incremental_reindex_skips_same_files(self, test_workspace, backends, tmp_path): - """Test that both v1 and v2 skip unchanged files on re-index.""" - cache_v1 = HashCache(tmp_path / "cache_v1.json") - cache_v2 = HashCache(tmp_path / "cache_v2.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator_v1 = IndexingCoordinator(backend=backends["v1"], config=config) - coordinator_v2 = IndexingCoordinator(backend=backends["v2"], config=config) - - # Initial indexing - coordinator_v1.index_directory_incremental(test_workspace, cache_v1) - coordinator_v2.index_directory_incremental_v2( - test_workspace, cache_v2, chunk_index, progress_callback=None - ) - - # Save caches - cache_v1.save() - cache_v2.save() - chunk_index.save() - - # Re-index without changes - stats_v1_reindex = coordinator_v1.index_directory_incremental(test_workspace, cache_v1) - stats_v2_reindex = coordinator_v2.index_directory_incremental_v2( - test_workspace, cache_v2, chunk_index, progress_callback=None - ) - - # Both should skip all files - assert stats_v1_reindex["changed_files"] == 0 - assert stats_v2_reindex["changed_files"] == 0 - assert stats_v1_reindex["total_chunks"] == 0 - assert stats_v2_reindex["total_chunks"] == 0 - - @pytest.mark.skip(reason="v1 removed after validation - kept for historical documentation") - def test_file_change_detection_consistent(self, test_workspace, backends, tmp_path): - """Test that both v1 and v2 detect file changes consistently.""" - cache_v1 = HashCache(tmp_path / "cache_v1.json") - cache_v2 = HashCache(tmp_path / "cache_v2.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator_v1 = IndexingCoordinator(backend=backends["v1"], config=config) - coordinator_v2 = IndexingCoordinator(backend=backends["v2"], config=config) - - # Initial indexing - coordinator_v1.index_directory_incremental(test_workspace, cache_v1) - coordinator_v2.index_directory_incremental_v2( - test_workspace, cache_v2, chunk_index, progress_callback=None - ) - - cache_v1.save() - cache_v2.save() - chunk_index.save() - - # Modify one file - time.sleep(0.01) - (test_workspace / "small.py").write_text(""" -def small_function(): - return "modified" - -def new_function(): - return "new" -""") - - # Re-index - stats_v1 = coordinator_v1.index_directory_incremental(test_workspace, cache_v1) - stats_v2 = coordinator_v2.index_directory_incremental_v2( - test_workspace, cache_v2, chunk_index, progress_callback=None - ) - - # Both should detect 1 changed file - assert stats_v1["changed_files"] == 1 - assert stats_v2["changed_files"] == 1 - - # Both should have similar chunk counts (at least 2 functions) - assert stats_v1["total_chunks"] >= 2 - assert stats_v2["total_chunks"] >= 2 - - def test_v2_additional_features_work(self, test_workspace, backends, tmp_path): - """Test that v2's additional features (chunk tracking) work correctly.""" - cache = HashCache(tmp_path / "cache.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator = IndexingCoordinator(backend=backends["v2"], config=config) - - # Initial indexing - coordinator.index_directory_incremental_v2( - test_workspace, cache, chunk_index, progress_callback=None - ) - - # Chunk index should have valid chunks - valid_chunks = chunk_index.get_valid_chunks() - assert len(valid_chunks) > 0 - - # Modify a file - time.sleep(0.01) - (test_workspace / "medium.py").write_text("def new(): pass") - - # Re-index - coordinator.index_directory_incremental_v2( - test_workspace, cache, chunk_index, progress_callback=None - ) - - # Should now have stale chunks (from old medium.py) - stale_chunks = chunk_index.get_stale_chunks() - assert len(stale_chunks) > 0 - - -class TestV2Improvements: - """Test that v2 has improvements over v1.""" - - def test_v2_tracks_staleness(self, test_workspace, backends, tmp_path): - """Test that v2 tracks chunk staleness (v1 does not).""" - cache = HashCache(tmp_path / "cache.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator = IndexingCoordinator(backend=backends["v2"], config=config) - - # Index - coordinator.index_directory_incremental_v2( - test_workspace, cache, chunk_index, progress_callback=None - ) - - # Get summary - summary = chunk_index.get_staleness_summary() - - # Should have metrics - assert summary.total_chunks > 0 - assert summary.valid_chunks > 0 - assert summary.stale_chunks == 0 # No stale chunks yet - assert summary.staleness_ratio == 0.0 - - def test_v2_cleanup_deleted_files(self, test_workspace, backends, tmp_path): - """Test that v2 cleans up chunks from deleted files.""" - cache = HashCache(tmp_path / "cache.json") - chunk_index = ChunkIndex(tmp_path / "chunk_index.json") - - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator = IndexingCoordinator(backend=backends["v2"], config=config) - - # Initial index - stats1 = coordinator.index_directory_incremental_v2( - test_workspace, cache, chunk_index, progress_callback=None - ) - initial_file_count = stats1["changed_files"] - - # Delete a file - (test_workspace / "small.py").unlink() - - # Re-index - coordinator.index_directory_incremental_v2( - test_workspace, cache, chunk_index, progress_callback=None - ) - - # Chunk index should have cleaned up the deleted file - # (exact validation depends on internal state, but shouldn't crash) - summary = chunk_index.get_staleness_summary() - assert summary.total_files < initial_file_count - - -if __name__ == "__main__": - pytest.main([__file__, "-v"]) diff --git a/tests/integration/test_watch_mode.py b/tests/integration/test_watch_mode.py deleted file mode 100644 index 63e26b7..0000000 --- a/tests/integration/test_watch_mode.py +++ /dev/null @@ -1,261 +0,0 @@ -"""Integration tests for watch mode functionality.""" - -import pytest -import time -from sia_code.indexer.coordinator import IndexingCoordinator -from sia_code.indexer.hash_cache import HashCache -from sia_code.indexer.chunk_index import ChunkIndex -from sia_code.storage.usearch_backend import UsearchSqliteBackend -from sia_code.config import Config, ChunkingConfig - - -@pytest.fixture -def temp_workspace(tmp_path): - """Create a temporary workspace with test files.""" - workspace = tmp_path / "workspace" - workspace.mkdir() - - # Create initial test file - test_file = workspace / "test.py" - test_file.write_text(""" -def hello(): - return "Hello, World!" -""") - - return workspace - - -@pytest.fixture -def test_setup(tmp_path, temp_workspace): - """Set up test infrastructure (backend, cache, index).""" - # Create backend - backend_path = tmp_path / "test.sia-code" - backend = UsearchSqliteBackend(backend_path, embedding_enabled=False) - backend.create_index() - - # Create cache and chunk index - cache_path = tmp_path / "cache.json" - cache = HashCache(cache_path) - - chunk_index_path = tmp_path / "chunk_index.json" - chunk_index = ChunkIndex(chunk_index_path) - - # Create config - config = Config( - sia_dir=tmp_path, - chunking=ChunkingConfig( - max_chunk_size=500, - min_chunk_size=50, - merge_threshold=100, - greedy_merge=True, - ), - ) - - coordinator = IndexingCoordinator(backend=backend, config=config) - - yield { - "backend": backend, - "cache": cache, - "chunk_index": chunk_index, - "config": config, - "coordinator": coordinator, - "workspace": temp_workspace, - } - - backend.close() - - -class TestWatchModeIndexing: - """Test watch mode uses v2 incremental indexing correctly.""" - - def test_watch_uses_v2_method(self, test_setup): - """Test that watch mode reindex uses index_directory_incremental_v2.""" - setup = test_setup - - # Initial index - stats = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - assert stats["changed_files"] >= 1 - assert stats["total_chunks"] >= 1 - - # Save state - setup["cache"].save() - setup["chunk_index"].save() - - # Verify chunk index was updated - valid_chunks = setup["chunk_index"].get_valid_chunks() - assert len(valid_chunks) >= 1 - - def test_watch_incremental_reuses_unchanged_chunks(self, test_setup): - """Test that incremental indexing reuses chunks from unchanged files.""" - setup = test_setup - - # Initial index - stats1 = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - setup["cache"].save() - setup["chunk_index"].save() - initial_chunks = stats1["total_chunks"] - - # Re-index without changes (should skip unchanged files) - stats2 = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - # Should index 0 new files (nothing changed) - assert stats2["changed_files"] == 0 - assert stats2["total_chunks"] == 0 - - # Chunk index should still have the original chunks - valid_chunks = setup["chunk_index"].get_valid_chunks() - assert len(valid_chunks) == initial_chunks - - def test_watch_detects_file_changes(self, test_setup): - """Test that watch mode detects and re-indexes changed files.""" - setup = test_setup - - # Initial index - setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - setup["cache"].save() - setup["chunk_index"].save() - - # Wait a moment to ensure mtime changes - time.sleep(0.01) - - # Modify the file - test_file = setup["workspace"] / "test.py" - test_file.write_text(""" -def hello(): - return "Hello, World!" - -def goodbye(): - return "Goodbye, World!" -""") - - # Re-index (should detect change) - stats2 = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - # Should re-index the changed file - assert stats2["changed_files"] >= 1 - assert stats2["total_chunks"] >= 2 # Now has 2 functions - - def test_watch_does_not_reindex_whole_repo(self, test_setup): - """Test that watch mode doesn't re-index unchanged files.""" - setup = test_setup - - # Create multiple files - for i in range(5): - file_path = setup["workspace"] / f"module{i}.py" - file_path.write_text(f""" -def function_{i}(): - return {i} -""") - - # Initial index - stats1 = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - setup["cache"].save() - setup["chunk_index"].save() - - # Should index 6 files (test.py + 5 new modules) - assert stats1["changed_files"] >= 6 - - # Wait and modify only one file - time.sleep(0.01) - changed_file = setup["workspace"] / "module2.py" - changed_file.write_text(""" -def function_2(): - return "modified" -""") - - # Re-index - stats2 = setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - # Should only re-index the 1 changed file, not all 6 - assert stats2["changed_files"] == 1 - assert stats2["skipped_files"] == 5 # Other 5 files skipped - - def test_chunk_index_tracks_stale_chunks(self, test_setup): - """Test that chunk index properly tracks stale chunks when files change.""" - setup = test_setup - - # Initial index - setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - setup["cache"].save() - setup["chunk_index"].save() - - initial_valid_chunks = list(setup["chunk_index"].get_valid_chunks()) - assert len(initial_valid_chunks) >= 1 - - # Modify file - time.sleep(0.01) - test_file = setup["workspace"] / "test.py" - test_file.write_text(""" -def modified_function(): - return "Modified" -""") - - # Re-index - setup["coordinator"].index_directory_incremental_v2( - setup["workspace"], - setup["cache"], - setup["chunk_index"], - progress_callback=None, - ) - - # Old chunks should be marked stale - stale_chunks = setup["chunk_index"].get_stale_chunks() - assert len(stale_chunks) >= 1 - - # Should have new valid chunks - new_valid_chunks = list(setup["chunk_index"].get_valid_chunks()) - assert len(new_valid_chunks) >= 1 - - # Upsert may preserve chunk IDs; ensure either IDs changed or previous IDs were stale-marked. - assert new_valid_chunks != initial_valid_chunks or any( - chunk_id in stale_chunks for chunk_id in initial_valid_chunks - ) - - -if __name__ == "__main__": - pytest.main([__file__, "-v"]) diff --git a/tests/test_cli_integration.py b/tests/test_cli_integration.py index 035c80c..54e5392 100644 --- a/tests/test_cli_integration.py +++ b/tests/test_cli_integration.py @@ -6,7 +6,6 @@ import subprocess import sys from pathlib import Path -import os import shutil @@ -122,48 +121,6 @@ def test_status_after_init(self, test_project): assert "index" in result.stdout.lower() -class TestCLIIndex: - """Test 'sia-code index' command.""" - - def test_index_not_initialized(self, test_project): - """Test index when not initialized.""" - result = run_cli(["index", "."], cwd=test_project) - - assert result.returncode != 0 - - def test_index_basic(self, test_project): - """Test basic indexing.""" - run_cli(["init"], cwd=test_project) - disable_embeddings(test_project) - result = run_cli(["index", "."], cwd=test_project) - - assert result.returncode == 0 - assert "indexing complete" in result.stdout.lower() - - def test_index_clean(self, test_project): - """Test clean indexing.""" - run_cli(["init"], cwd=test_project) - disable_embeddings(test_project) - run_cli(["index", "."], cwd=test_project) - - result = run_cli(["index", "--clean", "."], cwd=test_project) - - assert result.returncode == 0 - assert "clean" in result.stdout.lower() - - def test_index_clean_removes_legacy_usearch_file(self, test_project): - """Test clean indexing removes legacy vectors.usearch to allow sqlite-vec migration.""" - run_cli(["init"], cwd=test_project) - - legacy_vectors = test_project / ".sia-code" / "vectors.usearch" - legacy_vectors.write_text("legacy") - - result = run_cli(["index", "--clean", "."], cwd=test_project) - - assert result.returncode == 0 - assert not legacy_vectors.exists() - - class TestCLISearch: """Test 'sia-code search' command.""" diff --git a/tests/unit/test_chunkhound_cli.py b/tests/unit/test_chunkhound_cli.py new file mode 100644 index 0000000..ee8f8c4 --- /dev/null +++ b/tests/unit/test_chunkhound_cli.py @@ -0,0 +1,48 @@ +"""Unit tests for ChunkHound CLI bridge helpers.""" + +from pathlib import Path + +from sia_code.config import Config +from sia_code.search.chunkhound_cli import build_search_command, parse_search_output + + +def test_build_search_command_regex_uses_no_embeddings_by_default(): + config = Config() + + cmd = build_search_command( + config=config, + query="auth", + project_path=Path("."), + db_path=Path("/tmp/chunkhound.db"), + mode="regex", + limit=7, + ) + + assert cmd[:3] == ["uvx", "chunkhound", "search"] + assert "--regex" in cmd + assert "--no-embeddings" in cmd + assert "--page-size" in cmd + assert "7" in cmd + + +def test_parse_search_output_extracts_file_and_lines(): + output = """=== Regex Search Results === + +[1] src/auth/service.py +[INFO] [blue][INFO][/blue] Lines 12-18 +```python +def authenticate_user(token: str) -> bool: + return token != "" +``` +""" + + parsed = parse_search_output(output=output, query="authenticate", mode="regex") + + assert parsed["query"] == "authenticate" + assert parsed["mode"] == "regex" + assert len(parsed["results"]) == 1 + first = parsed["results"][0] + assert first["chunk"]["file_path"] == "src/auth/service.py" + assert first["chunk"]["start_line"] == 12 + assert first["chunk"]["end_line"] == 18 + assert "authenticate_user" in (first["snippet"] or "") diff --git a/tests/unit/test_git_sync.py b/tests/unit/test_git_sync.py index 35b73a8..fb6b6de 100644 --- a/tests/unit/test_git_sync.py +++ b/tests/unit/test_git_sync.py @@ -214,6 +214,117 @@ def test_meets_importance_threshold(self, sync_service): assert sync_service._meets_importance_threshold("medium", "medium") is True assert sync_service._meets_importance_threshold("low", "high") is False + def test_merge_branch_generates_commit_based_changelog(self, sync_service, mock_backend): + """Merge commits with 'Merge branch' message should create changelog entries.""" + merge_event = { + "event_type": "merge", + "from_ref": "feat/location-mailing-list", + "to_ref": "develop", + "summary": "Merge branch 'feat/location-mailing-list' into 'develop'", + "files_changed": ["src/a.ts"], + "diff_stats": {"files": 1, "insertions": 10, "deletions": 2}, + "importance": "medium", + "commit_hash": "abc123", + "commit_time": datetime(2026, 1, 1, 12, 0, 0), + "merge_commit": object(), + } + + with patch.object(sync_service.extractor, "scan_git_tags", return_value=[]): + with patch.object( + sync_service.extractor, "scan_merge_events", return_value=[merge_event] + ): + with patch.object( + sync_service.extractor, + "get_commits_in_merge", + return_value=[ + "feat: add mailing list support", + "fix: resolve location sorting", + "BREAKING CHANGE: rename location payload", + ], + ): + stats = sync_service.sync(merges_only=True) + + assert stats["changelogs_added"] == 1 + assert mock_backend.add_changelog.called + args = mock_backend.add_changelog.call_args.kwargs + assert args["tag"] == "merge:abc123" + assert "feat: add mailing list support" in args["features"] + assert "fix: resolve location sorting" in args["fixes"] + assert "BREAKING CHANGE: rename location payload" in args["breaking_changes"] + + def test_non_merge_branch_messages_do_not_generate_commit_changelog( + self, sync_service, mock_backend + ): + """Merge commits without 'Merge branch' message should skip commit changelogs.""" + merge_event = { + "event_type": "merge", + "from_ref": "feature-x", + "to_ref": "main", + "summary": "Merge pull request #123 from org/feature-x", + "files_changed": ["src/a.ts"], + "diff_stats": {"files": 1, "insertions": 10, "deletions": 2}, + "importance": "medium", + "commit_hash": "def456", + "commit_time": datetime(2026, 1, 1, 12, 0, 0), + "merge_commit": object(), + } + + with patch.object(sync_service.extractor, "scan_git_tags", return_value=[]): + with patch.object( + sync_service.extractor, "scan_merge_events", return_value=[merge_event] + ): + with patch.object( + sync_service.extractor, + "get_commits_in_merge", + return_value=["feat: should be ignored"], + ): + stats = sync_service.sync(merges_only=True) + + assert stats["changelogs_added"] == 0 + mock_backend.add_changelog.assert_not_called() + + def test_sync_limit_zero_means_unbounded(self, sync_service, mock_backend): + """A limit of 0 should process all available events.""" + merge_events = [ + { + "event_type": "merge", + "from_ref": "a", + "to_ref": "b", + "summary": "Merge branch 'a' into 'b'", + "files_changed": [], + "diff_stats": {}, + "importance": "medium", + "commit_hash": "aaa111", + "commit_time": datetime(2026, 1, 1, 12, 0, 0), + "merge_commit": object(), + }, + { + "event_type": "merge", + "from_ref": "c", + "to_ref": "d", + "summary": "Merge branch 'c' into 'd'", + "files_changed": [], + "diff_stats": {}, + "importance": "medium", + "commit_hash": "bbb222", + "commit_time": datetime(2026, 1, 1, 12, 0, 0), + "merge_commit": object(), + }, + ] + + with patch.object(sync_service.extractor, "scan_git_tags", return_value=[]): + with patch.object( + sync_service.extractor, "scan_merge_events", return_value=merge_events + ): + with patch.object( + sync_service.extractor, + "get_commits_in_merge", + return_value=["fix: keep all"], + ): + stats = sync_service.sync(limit=0, merges_only=True) + + assert stats["timeline_added"] == 2 + if __name__ == "__main__": pytest.main([__file__, "-v"]) diff --git a/tests/unit/test_multi_hop.py b/tests/unit/test_multi_hop.py deleted file mode 100644 index 70f8c73..0000000 --- a/tests/unit/test_multi_hop.py +++ /dev/null @@ -1,461 +0,0 @@ -"""Unit tests for multi-hop code research functionality.""" - -import pytest -from sia_code.core.models import Chunk -from sia_code.core.types import ChunkType, Language, FilePath, LineNumber, ChunkId -from sia_code.search.multi_hop import MultiHopSearchStrategy, CodeRelationship -from sia_code.storage.usearch_backend import UsearchSqliteBackend - - -@pytest.fixture -def backend(tmp_path): - """Create a temporary backend for testing.""" - test_path = tmp_path / ".sia-code" - backend = UsearchSqliteBackend(test_path, embedding_enabled=False) - backend.create_index() - yield backend - backend.close() - - -@pytest.fixture -def sample_chunks(): - """Create sample chunks with realistic code relationships.""" - return [ - # Main entry point - Chunk( - symbol="main", - start_line=LineNumber(1), - end_line=LineNumber(10), - code="""def main(): - config = load_config() - data = fetch_data() - result = process_data(data) - save_result(result) -""", - chunk_type=ChunkType.FUNCTION, - language=Language.PYTHON, - file_path=FilePath("app/main.py"), - ), - # Helper function 1 - Chunk( - symbol="load_config", - start_line=LineNumber(1), - end_line=LineNumber(5), - code="""def load_config(): - with open('config.json') as f: - return json.load(f) -""", - chunk_type=ChunkType.FUNCTION, - language=Language.PYTHON, - file_path=FilePath("app/config.py"), - ), - # Helper function 2 - Chunk( - symbol="fetch_data", - start_line=LineNumber(1), - end_line=LineNumber(5), - code="""def fetch_data(): - response = requests.get(API_URL) - return parse_response(response) -""", - chunk_type=ChunkType.FUNCTION, - language=Language.PYTHON, - file_path=FilePath("app/data.py"), - ), - # Helper function 3 - Chunk( - symbol="process_data", - start_line=LineNumber(1), - end_line=LineNumber(5), - code="""def process_data(data): - cleaned = clean_data(data) - return transform_data(cleaned) -""", - chunk_type=ChunkType.FUNCTION, - language=Language.PYTHON, - file_path=FilePath("app/processor.py"), - ), - # Deeply nested function - Chunk( - symbol="parse_response", - start_line=LineNumber(1), - end_line=LineNumber(3), - code="""def parse_response(response): - return response.json() -""", - chunk_type=ChunkType.FUNCTION, - language=Language.PYTHON, - file_path=FilePath("app/parser.py"), - ), - ] - - -class TestMultiHopResearch: - """Test multi-hop code research functionality.""" - - def test_research_returns_results(self, backend, sample_chunks): - """Test that research returns results for a valid query.""" - # Store chunks - backend.store_chunks_batch(sample_chunks) - - # Create multi-hop strategy - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - # Research for "main" - result = strategy.research("main", max_results_per_hop=5) - - # Should find at least the main function - assert len(result.chunks) >= 1 - assert result.question == "main" - assert result.hops_executed >= 0 - - def test_research_respects_max_hops(self, backend, sample_chunks): - """Test that research respects max_hops parameter.""" - backend.store_chunks_batch(sample_chunks) - - # Test with max_hops=0 (only initial search) - strategy_0 = MultiHopSearchStrategy(backend, max_hops=0) - result_0 = strategy_0.research("main", max_results_per_hop=5) - assert result_0.hops_executed == 0 - - # Test with max_hops=1 (one hop) - strategy_1 = MultiHopSearchStrategy(backend, max_hops=1) - result_1 = strategy_1.research("main", max_results_per_hop=5) - assert result_1.hops_executed <= 1 - - # Test with max_hops=2 (two hops) - strategy_2 = MultiHopSearchStrategy(backend, max_hops=2) - result_2 = strategy_2.research("main", max_results_per_hop=5) - assert result_2.hops_executed <= 2 - - def test_research_respects_max_total_chunks(self, backend, sample_chunks): - """Test that research respects max_total_chunks safety limit.""" - backend.store_chunks_batch(sample_chunks) - - strategy = MultiHopSearchStrategy(backend, max_hops=10) - - # Set low limit - result = strategy.research("main", max_results_per_hop=5, max_total_chunks=3) - - # Should not exceed the limit - assert len(result.chunks) <= 3 - - def test_research_discovers_relationships(self, backend, sample_chunks): - """Test that multi-hop research discovers code relationships.""" - backend.store_chunks_batch(sample_chunks) - - strategy = MultiHopSearchStrategy(backend, max_hops=2) - result = strategy.research("main", max_results_per_hop=5) - - # Should discover some relationships - # (exact count depends on entity extraction success) - assert result.relationships is not None - assert isinstance(result.relationships, list) - - # Each relationship should have valid structure - for rel in result.relationships: - assert rel.from_entity is not None - assert rel.to_entity is not None - assert rel.relationship_type is not None - - def test_research_handles_empty_results(self, backend): - """Test that research handles queries with no results gracefully.""" - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - # Search for something that doesn't exist - result = strategy.research("nonexistent_function_xyz") - - # Should return empty result, not crash - assert result.question == "nonexistent_function_xyz" - assert len(result.chunks) == 0 - assert len(result.relationships) == 0 - assert result.hops_executed == 0 - - def test_research_tracks_entities_found(self, backend, sample_chunks): - """Test that research tracks total entities found.""" - backend.store_chunks_batch(sample_chunks) - - strategy = MultiHopSearchStrategy(backend, max_hops=1) - result = strategy.research("main", max_results_per_hop=5) - - # Should track entities (even if 0 due to extraction limitations) - assert result.total_entities_found >= 0 - assert isinstance(result.total_entities_found, int) - - -class TestCallGraphBuilding: - """Test call graph construction from relationships.""" - - def test_build_call_graph(self, tmp_path): - """Test building call graph from relationships.""" - relationships = [ - CodeRelationship( - from_entity="main", - to_entity="load_config", - relationship_type="function_call", - from_chunk=ChunkId("chunk1"), - to_chunk=ChunkId("chunk2"), - ), - CodeRelationship( - from_entity="main", - to_entity="fetch_data", - relationship_type="function_call", - from_chunk=ChunkId("chunk1"), - to_chunk=ChunkId("chunk3"), - ), - CodeRelationship( - from_entity="fetch_data", - to_entity="parse_response", - relationship_type="function_call", - from_chunk=ChunkId("chunk3"), - to_chunk=ChunkId("chunk4"), - ), - ] - - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - graph = strategy.build_call_graph(relationships) - - # Should have entries for calling entities - assert "main" in graph - assert "fetch_data" in graph - - # main should call load_config and fetch_data - assert len(graph["main"]) == 2 - targets = {edge["target"] for edge in graph["main"]} - assert "load_config" in targets - assert "fetch_data" in targets - - # fetch_data should call parse_response - assert len(graph["fetch_data"]) == 1 - assert graph["fetch_data"][0]["target"] == "parse_response" - - def test_build_call_graph_empty(self, tmp_path): - """Test building call graph with no relationships.""" - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - graph = strategy.build_call_graph([]) - - # Should return empty graph - assert graph == {} - - def test_build_call_graph_includes_metadata(self, tmp_path): - """Test that call graph includes relationship metadata.""" - relationships = [ - CodeRelationship( - from_entity="ClassA", - to_entity="ClassB", - relationship_type="inheritance", - from_chunk=ChunkId("chunk1"), - to_chunk=ChunkId("chunk2"), - ), - ] - - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - graph = strategy.build_call_graph(relationships) - - # Should include relationship type - assert graph["ClassA"][0]["type"] == "inheritance" - assert graph["ClassA"][0]["chunk_id"] == ChunkId("chunk2") - - -class TestEntryPointDetection: - """Test entry point identification in call graphs.""" - - def test_get_entry_points(self, tmp_path): - """Test identifying entry points (no incoming edges).""" - relationships = [ - CodeRelationship("main", "load_config", "function_call"), - CodeRelationship("main", "fetch_data", "function_call"), - CodeRelationship("fetch_data", "parse_response", "function_call"), - ] - - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - entry_points = strategy.get_entry_points(relationships) - - # Only "main" should be an entry point (never a target) - assert "main" in entry_points - assert "load_config" not in entry_points # Called by main - assert "fetch_data" not in entry_points # Called by main - assert "parse_response" not in entry_points # Called by fetch_data - - def test_get_entry_points_multiple(self, tmp_path): - """Test identifying multiple entry points.""" - relationships = [ - CodeRelationship("main", "helper", "function_call"), - CodeRelationship("test_main", "helper", "function_call"), - CodeRelationship("helper", "util", "function_call"), - ] - - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - entry_points = strategy.get_entry_points(relationships) - - # Both main and test_main are entry points - assert len(entry_points) == 2 - assert "main" in entry_points - assert "test_main" in entry_points - - def test_get_entry_points_empty(self, tmp_path): - """Test entry point detection with no relationships.""" - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - entry_points = strategy.get_entry_points([]) - - # Should return empty list - assert entry_points == [] - - def test_get_entry_points_circular(self, tmp_path): - """Test entry point detection with circular relationships.""" - relationships = [ - CodeRelationship("A", "B", "calls"), - CodeRelationship("B", "C", "calls"), - CodeRelationship("C", "A", "calls"), # Circular - ] - - backend = UsearchSqliteBackend(tmp_path / ".sia-code", embedding_enabled=False) - strategy = MultiHopSearchStrategy(backend, max_hops=1) - - entry_points = strategy.get_entry_points(relationships) - - # In a circular graph, no entity is an entry point - assert len(entry_points) == 0 - - -class TestAdaptiveSearch: - """Test adaptive search strategy (semantic vs preprocessed lexical).""" - - def test_uses_semantic_when_embeddings_enabled(self, backend, sample_chunks): - """Research should use semantic search when embeddings are available.""" - backend.store_chunks_batch(sample_chunks) - - # Enable embeddings - backend.embedding_enabled = True - - # Mock search_semantic to track if it's called - original_search_semantic = backend.search_semantic - call_count = {"count": 0} - - def mock_search_semantic(*args, **kwargs): - call_count["count"] += 1 - return original_search_semantic(*args, **kwargs) - - backend.search_semantic = mock_search_semantic - - strategy = MultiHopSearchStrategy(backend, max_hops=1) - strategy.research("How does main work?", max_results_per_hop=5) - - # Should have called semantic search - assert call_count["count"] >= 1 - - def test_uses_lexical_when_embeddings_disabled(self, backend, sample_chunks): - """Research should use preprocessed lexical search when embeddings disabled.""" - backend.store_chunks_batch(sample_chunks) - - # Disable embeddings - backend.embedding_enabled = False - - # Mock search_lexical to track calls - original_search_lexical = backend.search_lexical - call_count = {"count": 0} - calls = [] - - def mock_search_lexical(query, *args, **kwargs): - call_count["count"] += 1 - calls.append(query) - return original_search_lexical(query, *args, **kwargs) - - backend.search_lexical = mock_search_lexical - - strategy = MultiHopSearchStrategy(backend, max_hops=1) - strategy.research("How does main work?", max_results_per_hop=5) - - # Should have called lexical search - assert call_count["count"] >= 1 - # First call should be preprocessed (no "How", "does") - first_query = calls[0] - assert "how" not in first_query.lower() or "main" in first_query.lower() - - -class TestNaturalLanguageQueries: - """Test that research handles natural language questions.""" - - def test_natural_language_question_with_embeddings(self, backend, sample_chunks): - """Natural language questions should attempt semantic search when enabled.""" - backend.store_chunks_batch(sample_chunks) - backend.embedding_enabled = True - - strategy = MultiHopSearchStrategy(backend, max_hops=1) - # This should not crash even if embeddings aren't available - result = strategy.research("How does the main function work?", max_results_per_hop=5) - - # Should return a valid result object (may be empty if no API key) - assert isinstance(result.chunks, list) - assert result.question == "How does the main function work?" - - def test_natural_language_question_without_embeddings(self, backend, sample_chunks): - """Natural language questions should work with preprocessing fallback.""" - backend.store_chunks_batch(sample_chunks) - backend.embedding_enabled = False - - strategy = MultiHopSearchStrategy(backend, max_hops=1) - result = strategy.research("How does main work", max_results_per_hop=5) - - # With preprocessing, should find "main" after removing "How", "does" - # Result should be valid (may have results depending on lexical matching) - assert isinstance(result.chunks, list) - assert result.hops_executed >= 0 - - def test_question_with_code_identifiers(self, backend, sample_chunks): - """Questions with code identifiers should preserve them in preprocessing.""" - backend.store_chunks_batch(sample_chunks) - backend.embedding_enabled = False - - # Use simpler query that will match - strategy = MultiHopSearchStrategy(backend, max_hops=1) - result = strategy.research("load_config", max_results_per_hop=5) - - # Should find the load_config function with keyword search - assert len(result.chunks) >= 1 - symbols = [chunk.symbol for chunk in result.chunks] - assert "load_config" in symbols - - def test_natural_language_preprocessing_removes_stop_words(self, backend, sample_chunks): - """Verify that preprocessing is applied for natural language questions.""" - backend.store_chunks_batch(sample_chunks) - backend.embedding_enabled = False - - # Track what query is actually used in lexical search - original_search_lexical = backend.search_lexical - actual_queries = [] - - def track_search_lexical(query, *args, **kwargs): - actual_queries.append(query) - return original_search_lexical(query, *args, **kwargs) - - backend.search_lexical = track_search_lexical - - strategy = MultiHopSearchStrategy(backend, max_hops=0) - strategy.research("How does the config work?", max_results_per_hop=5) - - # Should have made at least one lexical search - assert len(actual_queries) >= 1 - - # First query should have stop words removed - first_query = actual_queries[0].lower() - # "how", "does", "the" should be removed, "config" should remain - assert "config" in first_query - # Stop words should ideally be removed (may not be perfect but should try) - # Just verify config is present - that's the key term - - -if __name__ == "__main__": - pytest.main([__file__, "-v"]) diff --git a/uv.lock b/uv.lock index 948915b..b4b6fd8 100644 --- a/uv.lock +++ b/uv.lock @@ -2720,7 +2720,7 @@ wheels = [ [[package]] name = "sia-code" -version = "0.6.0" +version = "0.7.1" source = { editable = "." } dependencies = [ { name = "click" }, @@ -3098,10 +3098,10 @@ dependencies = [ { name = "typing-extensions" }, ] wheels = [ - { url = "https://files.pythonhosted.org/packages/e3/ea/304cf7afb744aa626fa9855245526484ee55aba610d9973a0521c552a843/torch-2.10.0-1-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:c37fc46eedd9175f9c81814cc47308f1b42cfe4987e532d4b423d23852f2bf63", size = 79411450, upload-time = "2026-02-06T17:37:35.75Z" }, - { url = "https://files.pythonhosted.org/packages/25/d8/9e6b8e7df981a1e3ea3907fd5a74673e791da483e8c307f0b6ff012626d0/torch-2.10.0-1-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:f699f31a236a677b3118bc0a3ef3d89c0c29b5ec0b20f4c4bf0b110378487464", size = 79423460, upload-time = "2026-02-06T17:37:39.657Z" }, - { url = "https://files.pythonhosted.org/packages/c9/2f/0b295dd8d199ef71e6f176f576473d645d41357b7b8aa978cc6b042575df/torch-2.10.0-1-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:6abb224c2b6e9e27b592a1c0015c33a504b00a0e0938f1499f7f514e9b7bfb5c", size = 79498197, upload-time = "2026-02-06T17:37:27.627Z" }, - { url = "https://files.pythonhosted.org/packages/a4/1b/af5fccb50c341bd69dc016769503cb0857c1423fbe9343410dfeb65240f2/torch-2.10.0-1-cp313-none-macosx_11_0_arm64.whl", hash = "sha256:7350f6652dfd761f11f9ecb590bfe95b573e2961f7a242eccb3c8e78348d26fe", size = 79498248, upload-time = "2026-02-06T17:37:31.982Z" }, + { url = "https://files.pythonhosted.org/packages/5b/30/bfebdd8ec77db9a79775121789992d6b3b75ee5494971294d7b4b7c999bc/torch-2.10.0-2-cp310-none-macosx_11_0_arm64.whl", hash = "sha256:2b980edd8d7c0a68c4e951ee1856334a43193f98730d97408fbd148c1a933313", size = 79411457, upload-time = "2026-02-10T21:44:59.189Z" }, + { url = "https://files.pythonhosted.org/packages/0f/8b/4b61d6e13f7108f36910df9ab4b58fd389cc2520d54d81b88660804aad99/torch-2.10.0-2-cp311-none-macosx_11_0_arm64.whl", hash = "sha256:418997cb02d0a0f1497cf6a09f63166f9f5df9f3e16c8a716ab76a72127c714f", size = 79423467, upload-time = "2026-02-10T21:44:48.711Z" }, + { url = "https://files.pythonhosted.org/packages/d3/54/a2ba279afcca44bbd320d4e73675b282fcee3d81400ea1b53934efca6462/torch-2.10.0-2-cp312-none-macosx_11_0_arm64.whl", hash = "sha256:13ec4add8c3faaed8d13e0574f5cd4a323c11655546f91fbe6afa77b57423574", size = 79498202, upload-time = "2026-02-10T21:44:52.603Z" }, + { url = "https://files.pythonhosted.org/packages/ec/23/2c9fe0c9c27f7f6cb865abcea8a4568f29f00acaeadfc6a37f6801f84cb4/torch-2.10.0-2-cp313-none-macosx_11_0_arm64.whl", hash = "sha256:e521c9f030a3774ed770a9c011751fb47c4d12029a3d6522116e48431f2ff89e", size = 79498254, upload-time = "2026-02-10T21:44:44.095Z" }, { url = "https://files.pythonhosted.org/packages/0c/1a/c61f36cfd446170ec27b3a4984f072fd06dab6b5d7ce27e11adb35d6c838/torch-2.10.0-cp310-cp310-manylinux_2_28_aarch64.whl", hash = "sha256:5276fa790a666ee8becaffff8acb711922252521b28fbce5db7db5cf9cb2026d", size = 145992962, upload-time = "2026-01-21T16:24:14.04Z" }, { url = "https://files.pythonhosted.org/packages/b5/60/6662535354191e2d1555296045b63e4279e5a9dbad49acf55a5d38655a39/torch-2.10.0-cp310-cp310-manylinux_2_28_x86_64.whl", hash = "sha256:aaf663927bcd490ae971469a624c322202a2a1e68936eb952535ca4cd3b90444", size = 915599237, upload-time = "2026-01-21T16:23:25.497Z" }, { url = "https://files.pythonhosted.org/packages/40/b8/66bbe96f0d79be2b5c697b2e0b187ed792a15c6c4b8904613454651db848/torch-2.10.0-cp310-cp310-win_amd64.whl", hash = "sha256:a4be6a2a190b32ff5c8002a0977a25ea60e64f7ba46b1be37093c141d9c49aeb", size = 113720931, upload-time = "2026-01-21T16:24:23.743Z" },