Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 17 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,12 @@ Local-first codebase intelligence for CLI workflows.

Sia Code indexes your repo and lets you:

- search code fast (lexical, semantic, or hybrid)
- trace architecture with multi-hop research
- search code fast via ChunkHound CLI (lexical or semantic)
- trace architecture with ChunkHound research
- store/retrieve project decisions and timeline context

Search and research are hard-switched to ChunkHound CLI. Sia keeps index orchestration and memory storage local.

## Why teams use it

- Works directly on local code (`.sia-code/` index per repo/worktree)
Expand All @@ -31,6 +33,9 @@ sia-code --version
## Quick Start (2 minutes)

```bash
# install ChunkHound CLI once
uv tool install chunkhound

# in your project
sia-code init
sia-code index .
Expand All @@ -53,29 +58,31 @@ sia-code status
| `sia-code index .` | Build index |
| `sia-code index --update` | Incremental re-index |
| `sia-code index --clean` | Rebuild index from scratch |
| `sia-code search "query"` | Hybrid search (default) |
| `sia-code search --regex "pattern"` | Lexical search |
| `sia-code research "question"` | Multi-hop relationship discovery |
| `sia-code search "query"` | ChunkHound-backed search (default mode from config) |
| `sia-code search --regex "pattern"` | ChunkHound lexical search |
| `sia-code research "question"` | ChunkHound research |
| `sia-code memory sync-git` | Import timeline/changelog from git |
| `sia-code memory search "topic"` | Search stored project memory |
| `sia-code config show` | Print active configuration |

## Search Modes (important)

- Default command is hybrid: `sia-code search "query"`
- Default search mode comes from `chunkhound.default_search_mode` (default: `regex`)
- Lexical mode: `sia-code search --regex "pattern"`
- Semantic-only mode: `sia-code search --semantic-only "query"`
- Semantic-only mode: `sia-code search --semantic-only "query"` (requires ChunkHound semantic setup)

Use `--no-deps` when you want only your project code.
Dependency visibility flags (`--no-deps`, `--deps-only`) are currently compatibility no-ops with ChunkHound-backed search.

## Git Sync Memory + Semantic Changelog

`sia-code memory sync-git` is the fastest way to build project memory from git history.

- Scans tags into changelog entries
- Scans merge commits into timeline events
- For merge commits whose subject matches `Merge branch '...'`, also creates changelog entries
- Stores `files_changed` and diff stats (`insertions`, `deletions`, `files`)
- Optionally enhances sparse summaries using a local summarization model
- `memory sync-git --limit 0` processes all eligible events

How semantic summary generation works:

Expand Down Expand Up @@ -111,8 +118,8 @@ Useful commands:

```bash
sia-code config show
sia-code config get search.vector_weight
sia-code config set search.vector_weight 0.0
sia-code config get chunkhound.default_search_mode
sia-code config set chunkhound.default_search_mode semantic
```

Note: backend selection is auto by default (`sqlite-vec` for new indexes, legacy `usearch` supported).
Expand Down
17 changes: 9 additions & 8 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ Sia Code has two core pipelines:
- write lexical/vector indexes

2. **Query pipeline**
- preprocess query
- run lexical and/or semantic search
- rank + return chunk matches
- resolve mode and build ChunkHound CLI command
- execute ChunkHound search/research
- parse and render results in Sia CLI formats

## Storage Model

Expand All @@ -34,17 +34,18 @@ Backend selection:
- `cli.py`: command entry and orchestration
- `indexer/coordinator.py`: full/incremental indexing lifecycle
- `parser/*`: language detection, concept extraction, chunk building
- `storage/*`: search execution and persistence
- `search/chunkhound_cli.py`: ChunkHound command bridge and output parsing
- `storage/*`: memory persistence plus legacy/local search paths
- `memory/*`: git-to-memory sync and timeline/changelog tooling
- `embed_server/*`: optional shared embed daemon

## Search Architecture

- **Hybrid (default):** lexical + semantic
- **Lexical (`--regex`):** exact token/symbol heavy queries
- **Semantic (`--semantic-only`):** concept similarity only
- **Default (`search`)**: mode from `chunkhound.default_search_mode` (default `regex`)
- **Lexical (`--regex`)**: exact token/symbol heavy queries
- **Semantic (`--semantic-only`)**: ChunkHound semantic mode

Flags like `--no-deps` and `--deps-only` control dependency-code visibility.
Flags like `--no-deps` and `--deps-only` are accepted for compatibility but currently no-op with ChunkHound-backed search.

## Design Goals

Expand Down
6 changes: 4 additions & 2 deletions docs/BENCHMARK_METHODOLOGY.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,12 @@

This project uses RepoEval-style retrieval evaluation for search quality checks.

> Note: The benchmark harness in `tests/benchmarks/` targets legacy in-process retrievers. ChunkHound-backed CLI search/research should be benchmarked separately as end-to-end CLI runs.

## Scope

- Evaluate retrieval quality (not answer generation)
- Compare lexical, hybrid, and semantic settings
- Compare lexical, hybrid, and semantic settings in the legacy retriever stack
- Use consistent query set and top-k metrics

## Minimal Reproduction Flow
Expand All @@ -23,7 +25,7 @@ pkgx python tests/benchmarks/run_full_repoeval_benchmark.py

- Recall@k (especially Recall@5)
- indexing time and query latency
- configuration used (`vector_weight`, embedding settings)
- configuration used (`chunkhound.default_search_mode` for CLI runs, `vector_weight` for legacy runs, embedding settings)

## Fairness Rules

Expand Down
8 changes: 5 additions & 3 deletions docs/BENCHMARK_RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,21 @@
- RepoEval Recall@5: **89.9%** (reported)
- Improvement over cAST baseline: **+12.9 points** (reported)

> Note: These numbers are historical baselines from legacy in-process retrievers. Current CLI `search` and `research` are ChunkHound-backed.

## Practical Takeaways

- Lexical-heavy search performs strongly for code identifiers.
- Hybrid can still be useful for natural-language style queries.
- For legacy retriever experiments, hybrid can still help natural-language style queries.
- For daily debugging, `--regex` is often the fastest path.

## Recommended Starting Config

```bash
sia-code config set search.vector_weight 0.0
sia-code config set chunkhound.default_search_mode regex
```

Then adjust only if your query style is mostly conceptual.
For legacy benchmark experiments, `search.vector_weight` remains available in the in-process retriever stack.

## Where to find raw benchmark tooling

Expand Down
24 changes: 12 additions & 12 deletions docs/CLI_FEATURES.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ sia-code status
| --- | --- | --- |
| `init` | Create `.sia-code/` index workspace | `--path`, `--dry-run` |
| `index [PATH]` | Build index | `--update`, `--clean`, `--parallel`, `--workers`, `--watch`, `--debounce`, `--no-git-sync` |
| `search QUERY` | Search code (default hybrid) | `--regex`, `--semantic-only`, `-k/--limit`, `--no-filter`, `--no-deps`, `--deps-only`, `--format`, `--output` |
| `research QUESTION` | Multi-hop architecture exploration | `--hops`, `--graph`, `-k/--limit`, `--no-filter` |
| `search QUERY` | Search code (ChunkHound-backed) | `--regex`, `--semantic-only`, `-k/--limit`, `--no-filter` (compat), `--no-deps` (compat), `--deps-only` (compat), `--format`, `--output` |
| `research QUESTION` | Architecture exploration (ChunkHound-backed) | `--hops` (compat), `--graph` (compat), `-k/--limit` (compat), `--no-filter` (compat) |
| `status` | Index health and statistics | none |
| `compact [PATH]` | Remove stale chunks | `--threshold`, `--force` |
| `interactive` | Live query loop | `--regex`, `-k/--limit` |
Expand All @@ -28,17 +28,17 @@ sia-code status

| Command | Purpose | Key options |
| --- | --- | --- |
| `memory sync-git` | Import timeline/changelog from git (with diff stats and optional local semantic summaries) | `--since`, `--limit`, `--dry-run`, `--tags-only`, `--merges-only`, `--min-importance` |
| `memory sync-git` | Import timeline/changelog from git (with diff stats and optional local semantic summaries) | `--since`, `--limit` (`0` means all), `--dry-run`, `--tags-only`, `--merges-only`, `--min-importance` |
| `memory add-decision TITLE` | Add pending decision | `-d/--description` (required), `-r/--reasoning`, `-a/--alternatives` |
| `memory list` | List memory items | `--type`, `--status`, `--limit`, `--format` |
| `memory list` | List memory items | `--type`, `--status`, `--limit` (`0` means all), `--format` |
| `memory approve ID` | Approve decision | `-c/--category` (required) |
| `memory reject ID` | Reject decision | none |
| `memory search QUERY` | Search memory | `--type`, `-k/--limit` |
| `memory timeline` | View timeline events | `--since`, `--event-type`, `--importance`, `--format` |
| `memory changelog [RANGE]` | Generate changelog | `--format`, `--output` |
| `memory timeline` | View timeline events | `--since`, `--event-type`, `--importance`, `--limit` (`0` means all), `--format` |
| `memory changelog [RANGE]` | Generate changelog | `--limit` (`0` means all), `--format`, `--output` |
| `memory export` / `memory import` | Backup/restore memory | `-o/--output`, `-i/--input` |

`memory sync-git` is the entrypoint for semantic changelog generation: it extracts git context, then (if enabled) uses the local summarizer to enrich release and merge summaries stored in memory.
`memory sync-git` is the entrypoint for semantic changelog generation: it extracts git context, then (if enabled) uses the local summarizer to enrich tag releases and merge-derived changelog entries stored in memory.

## Embed Daemon

Expand All @@ -48,15 +48,15 @@ sia-code status
| `embed status` | Show daemon status |
| `embed stop` | Stop daemon |

Use daemon when you rely heavily on hybrid/semantic search or memory embedding operations.
Use daemon when you rely heavily on memory embedding operations.

## Config Commands

```bash
sia-code config show
sia-code config path
sia-code config get search.vector_weight
sia-code config set search.vector_weight 0.0
sia-code config get chunkhound.default_search_mode
sia-code config set chunkhound.default_search_mode semantic
```

## Output Formats
Expand All @@ -71,8 +71,8 @@ sia-code config set search.vector_weight 0.0
- First index: `sia-code index .`
- Ongoing work: `sia-code index --update`
- Exact symbols: `sia-code search --regex "pattern"`
- Project-only focus: `--no-deps`
- Architecture questions: `sia-code research "..." --hops 3`
- If output is noisy: tighten regex terms or add path-like query terms
- Architecture questions: `sia-code research "..."`

## Related Docs

Expand Down
7 changes: 4 additions & 3 deletions docs/CODE_STRUCTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ sia_code/
core/ # shared models and enums
parser/ # AST concept extraction and chunking
indexer/ # indexing orchestration, hash cache, metrics
search/ # query pre-processing and multi-hop logic
storage/ # sqlite-vec + legacy usearch backends
search/ # ChunkHound CLI bridge + query helpers
storage/ # memory persistence + legacy local search backends
memory/ # git sync, timeline, changelog, decision flow
embed_server/ # optional embedding daemon
```
Expand All @@ -35,7 +35,8 @@ sia_code/
| Change default behavior | `sia_code/config.py`, `sia_code/cli.py` |
| Tune indexing | `sia_code/indexer/coordinator.py`, `sia_code/indexer/chunk_index.py` |
| Tune chunking | `sia_code/parser/chunker.py`, `sia_code/parser/concepts.py` |
| Search ranking/filtering | `sia_code/storage/sqlite_vec_backend.py`, `sia_code/storage/usearch_backend.py` |
| ChunkHound search/research bridge | `sia_code/search/chunkhound_cli.py`, `sia_code/cli.py` |
| Legacy/local search ranking (interactive) | `sia_code/storage/sqlite_vec_backend.py`, `sia_code/storage/usearch_backend.py` |
| Backend selection logic | `sia_code/storage/factory.py` |
| Memory commands and sync | `sia_code/memory/git_sync.py`, `sia_code/memory/git_events.py`, `sia_code/cli.py` |

Expand Down
10 changes: 9 additions & 1 deletion docs/LLM_CLI_INTEGRATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,21 +30,29 @@ Load skill sia-code
## 3) Recommended agent workflow

```bash
uv tool install chunkhound
uvx sia-code status
uvx sia-code init
uvx sia-code index .
uvx sia-code search --regex "your symbol"
uvx sia-code research "how does X work?"
```

Notes:

- `search` and `research` are ChunkHound-backed.
- Memory commands stay in Sia's local memory database.

## 4) Optional memory workflow

```bash
uvx sia-code memory sync-git
uvx sia-code memory sync-git --limit 0
uvx sia-code memory search "topic"
uvx sia-code memory add-decision "Decision title" -d "Context" -r "Reason"
```

`memory sync-git` also derives changelog entries from merge commits whose subject matches `Merge branch '...'`.

## 5) Multiple worktrees / multiple Claude Code instances

Use one of these index strategies per session:
Expand Down
6 changes: 6 additions & 0 deletions docs/MEMORY_FEATURES.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ sia-code memory search "Adopt X" --type decision

- Tags become changelog memory entries
- Merge commits become timeline memory events
- Merge commits whose subject matches `Merge branch '...'` also become changelog entries
- Each event captures changed files and diff stats
- Duplicate events are skipped automatically

Expand Down Expand Up @@ -69,6 +70,11 @@ Notes:
| `memory changelog` | render changelog text/json/markdown |
| `memory export` / `memory import` | backup/restore memory data |

Limit behavior:

- `memory sync-git --limit 0` processes all eligible events
- `memory list --limit 0`, `memory timeline --limit 0`, and `memory changelog --limit 0` return all rows

## Good Practices

- Add decisions with explicit `description` and `reasoning`.
Expand Down
12 changes: 6 additions & 6 deletions docs/PERFORMANCE_ANALYSIS.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,18 @@
## Typical Expectations

- `search --regex`: usually lowest-latency mode
- hybrid `search`: additional semantic overhead
- `search --semantic-only`: usually higher latency than regex
- `index --update`: much faster than full rebuild for small changes

Actual speed depends on repo size, hardware, and embedding configuration.
Actual speed depends on repo size, hardware, and ChunkHound semantic/provider setup.

## Quick Optimization Checklist

1. Use `sia-code index --update` for daily work
2. Use `--regex` for symbol/identifier lookup
3. Add `--no-deps` to reduce large dependency noise
3. Use tighter regex terms (or include path-like hints) to reduce noise
4. Use `--parallel` for large initial indexing runs
5. Start embed daemon when doing repeated semantic/hybrid queries
5. Start embed daemon when doing repeated memory embedding operations

## Useful Commands

Expand All @@ -28,8 +28,8 @@ sia-code search --regex "pattern"
## Bottleneck Hints

- Slow index build: reduce indexed scope or enable parallel workers
- Slow semantic/hybrid queries: ensure embed daemon is healthy
- Noisy result set: use dependency filters (`--no-deps` / `--deps-only`)
- Slow semantic queries: verify ChunkHound provider setup and model/network health
- Noisy result set: narrow regex terms and include path-like query hints

## Related Docs

Expand Down
23 changes: 12 additions & 11 deletions docs/QUERYING.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,49 +3,50 @@
## Search Commands

```bash
# default hybrid
# default mode from config (ChunkHound-backed; default is regex)
sia-code search "authentication flow"

# lexical / symbol-heavy
sia-code search --regex "AuthService|token"

# semantic only
# semantic only (requires embedding setup)
sia-code search --semantic-only "handle login failures"
```

## Useful Flags

- `-k, --limit <N>`: number of results
- `--no-deps`: only project code
- `--deps-only`: only dependency code
- `--no-filter`: include stale chunks
- `--no-deps`: accepted for compatibility (currently no-op)
- `--deps-only`: accepted for compatibility (currently no-op)
- `--no-filter`: accepted for compatibility (currently no-op)
- `--format text|json|table|csv`
- `--output <path>`: write results to file

## Multi-Hop Research

```bash
sia-code research "how does auth middleware work?" --hops 3 --graph
sia-code research "how does auth middleware work?"
```

Use this for architecture tracing, call-path discovery, and unfamiliar code.

Compatibility flags for `research` (`--hops`, `--graph`, `--limit`, `--no-filter`) are accepted by Sia and ignored by ChunkHound.

## Practical Tuning

- `search.vector_weight = 0.0` => lexical-heavy behavior
- `search.vector_weight = 1.0` => semantic-heavy behavior
- `chunkhound.default_search_mode = regex|semantic`
- defaults come from `.sia-code/config.json`

```bash
sia-code config get search.vector_weight
sia-code config set search.vector_weight 0.0
sia-code config get chunkhound.default_search_mode
sia-code config set chunkhound.default_search_mode semantic
```

## Output Tips

- Use `--format json` for scripts/agents.
- Use `--format table` for quick terminal scanning.
- Use `--no-deps` in large repos to reduce noise.
- Use tighter regex terms or path-like query text when results are noisy.

## Related Docs

Expand Down
Loading
Loading