A Python CLI tool that builds a structured code index through three phases: deterministic AST parsing, ripgrep-based dependency mapping, and LLM semantic enrichment. The index is stored in a local SQLite database and supports lexical, graph, and semantic queries — giving agents and developers fast, structured access to codebase knowledge without injecting raw source into context windows.
Requires Python ≥ 3.11 and ripgrep on PATH.
pip install -e .# 1. Initialise the database
index init
# 2. Parse source files and map dependencies
index build
# 3. Enrich nodes with LLM-generated semantic metadata
# Set one of: ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, or LITELLM_BASE_URL
export ANTHROPIC_API_KEY="sk-..."
index enrich # auto-detects provider from env
index enrich --provider openai # or specify explicitly
# 4. Query the index
index query "validateCartState"This walkthrough indexes a real project from scratch and shows how to use every major feature.
pip install -e .
# Install ripgrep and other external dependencies automatically
index installNavigate to your project root and run the full pipeline:
cd /path/to/your/project
# Build the index (init is automatic)
index buildThis creates a .codeindex/ directory containing the SQLite database. The build runs two phases: AST parsing extracts every file, class, function, and method, then ripgrep maps all call-site and import relationships between them.
To exclude vendored or generated code:
index build --exclude "vendor/*" --exclude "generated/*"index statusExample output:
Nodes: 142
Edges: 387
Unenriched: 142
Last build: 2026-03-25T10:15:00+00:00
Schema version: 3
DB path: .codeindex/codeindex.db
The Unenriched: 142 line means no nodes have semantic metadata yet — that comes next.
This step calls the Claude API to generate summaries, domain tags, and inferred responsibilities for each node. It requires an API key:
# Using Anthropic (default)
export ANTHROPIC_API_KEY="sk-..."
index enrich
# Using OpenAI
export OPENAI_API_KEY="sk-..."
index enrich --provider openai
# Using OpenRouter
export OPENROUTER_API_KEY="sk-..."
index enrich --provider openrouter --model anthropic/claude-sonnet-4-6
# Using LiteLLM proxy
export LITELLM_BASE_URL="http://localhost:4000/v1"
index enrich --provider litellm
# Preview what will be enriched (any provider)
index enrich --dry-runThe provider is auto-detected from environment variables when --provider is omitted (checks ANTHROPIC_API_KEY → OPENAI_API_KEY → OPENROUTER_API_KEY → LITELLM_API_KEY/LITELLM_BASE_URL in order).
Enrichment is hash-gated: re-running index enrich after code changes only processes nodes whose content actually changed.
First-run cost (one-time per repository): enriching a ~14,000-node codebase with Claude Sonnet costs approximately $42–67 depending on average node size. This is paid once when you first index a repository.
Incremental cost (every subsequent run): the indexer is hash-gated. Phase 1 clears enriched_at only on nodes whose content_hash changed. Phase 3 then only processes those nodes. On a normally-evolving codebase where a sprint touches 1–2% of nodes, a rebuild enrichment run costs under $5 — often under $1.
What drives cost up:
- Large-scale refactors that invalidate many
content_hashvalues in one go - Onboarding many repositories (each pays the first-run cost once)
- Branch switches between long-lived divergent branches
If Phase 3 cost is a concern, run index enrich --dry-run first — it reports the number of unenriched nodes before making any API calls. You can also skip Phase 3 entirely; the structural index (Phase 1+2) still provides AST nodes and dependency graph context at zero LLM cost.
Find a symbol by name (lexical search):
index query "UserService"Explore a node's dependency graph:
index query "UserService.validate" --type graph --depth 3Ask a natural-language question (semantic search — requires enrichment):
index query "where is authentication handled" --type semanticGet machine-readable output for scripts or agents:
index query "CartService" --format json --with-sourceThe query router automatically picks the best strategy (lexical, graph, or semantic) when --type is omitted, and falls back to an alternative strategy if the first returns no results.
index buildThe build is incremental at the enrichment layer — only changed nodes need re-enrichment. To start completely fresh:
index reset --yes
index buildindex build # parse + map dependencies
index enrich # add semantic metadata (optional)
index query "MyClass" # find symbols
index query "how does auth work" # semantic search
index status # check healthInstall external dependencies required by the indexer. Currently installs ripgrep using the system package manager (Homebrew on macOS, apt/dnf/pacman on Linux, Chocolatey/Scoop on Windows). No-op if all dependencies are already present.
index installCreate the .codeindex/ directory and initialise the database schema. No-op if the DB already exists and the schema version is current. Auto-invoked by index build if the DB does not yet exist.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
--no-gitignore-update |
Skip automatic.gitignore update |
Run Phase 1 (AST parse) and Phase 2 (dependency mapping). Bootstraps the DB automatically if not yet initialised.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
| `--phase PREPARE | DEPLOY` |
--token-limit N |
Max tokens per cAST chunk (default: 512) |
--exclude PATTERN |
Glob patterns to exclude from parsing (repeatable) |
--no-gitignore-update |
Skip automatic.gitignore update |
index build --phase PREPARE --exclude "vendor/*"Run Phase 3 — LLM enrichment on unenriched nodes. Only re-enriches nodes whose content_hash has changed since the last run.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
--dry-run |
Show what would be enriched without making API calls |
--model MODEL |
Override the LLM model for enrichment |
--provider NAME |
LLM provider: anthropic, openai, openrouter, or litellm |
index enrich --dry-run
index enrich --provider openai --model gpt-4oQuery the code index. The query router auto-selects a strategy (lexical, graph, or semantic) based on input, with cross-strategy fallback when results are empty.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
| `--type lexical | graph |
| `--format text | json |
--with-source |
Include raw source in results |
--top-k N |
Maximum number of results (default: 10) |
--depth N |
Graph traversal depth (default: 2) |
# Human-readable lexical lookup
index query "CartService" --type lexical --with-source
# Structured output for agent consumption
index query "cart loses items after discount" --type semantic --format jsonlShow index health: node count, edge count, unenriched nodes, last build time, and schema version.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
Drop and recreate all database tables.
| Option | Description |
|---|---|
--db PATH |
Path to the SQLite database file |
--yes, -y |
Skip confirmation prompt (required for non-interactive use) |
index reset --yes && index build --phase PREPAREThe indexing pipeline runs in three phases:
- AST Parse — Extracts files, classes, functions, methods, signatures, docstrings, and line ranges using Python's
astmodule (for.pyfiles) andtree-sitter(for Kotlin and TypeScript). Large nodes are split into chunks within a configurable token limit (cAST split-merge). - Dependency Map — For each node, runs
ripgrepto find all call sites and identifier references across the codebase, then resolves import statements to target nodes. Writes directed edges (calls,imports,inherits,overrides,references,instantiates) into the graph. - LLM Enrich — Sends each node's signature, docstring, and immediate graph neighbours to a configurable LLM provider (Anthropic, OpenAI, OpenRouter, or LiteLLM). Receives back a
semantic_summary,domain_tags, andinferred_responsibility. Only re-runs on nodes whose content hash has changed (hash-gated).
The resulting SQLite database (.codeindex/codeindex.db) supports three query paths:
- Lexical — ripgrep identifier match with re-ranking
- Graph — SQLite edge traversal with configurable depth
- Semantic — FTS5 full-text search over enriched metadata
All progress and diagnostic output goes to stderr; only structured query results go to stdout.
| Language | Parser |
|---|---|
| Python | ast (stdlib) |
| Kotlin | tree-sitter-kotlin |
| TypeScript | tree-sitter-typescript |
| Java | tree-sitter-java |
| Ruby | tree-sitter-ruby |
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
For enrich (Anthropic) |
Anthropic API key for LLM enrichment (default provider) |
OPENAI_API_KEY |
For enrich (OpenAI) |
OpenAI API key |
OPENROUTER_API_KEY |
For enrich (OpenRouter) |
OpenRouter API key |
LITELLM_API_KEY |
For enrich (LiteLLM) |
LiteLLM API key (optional if LITELLM_BASE_URL is set) |
LITELLM_BASE_URL |
For enrich (LiteLLM) |
LiteLLM proxy URL (default: http://localhost:4000/v1) |
CODEINDEX_DB |
No | Override default database path (.codeindex/codeindex.db) |
Database path resolution order: --db flag → CODEINDEX_DB env var → .codeindex/codeindex.db → exit 2.
| Code | Meaning |
|---|---|
0 |
Success — all phases completed without warnings |
1 |
Completed with warnings (e.g. parse errors, unenriched nodes) |
2 |
Fatal error (e.g. ripgrep missing, DB locked, schema mismatch) |
Broad queries return too few results from a specific module? Lexical search ranks results by match density across the whole repo and returns --top-k 10 by default. If you're looking for all nodes related to a common term, increase the limit:
index query "survey" --top-k 30Prefer specific identifiers over broad terms. index query "SurveyService" is more precise than index query "survey" and will surface the exact class you need.
Use graph search to explore a node's neighbourhood. Once you find a node of interest, trace its callers and callees:
index query "SurveyService.createSurvey" --type graph --depth 3Pipe structured output to other tools. Non-TTY output defaults to JSON, so you can chain with jq:
index query "CartService" --format jsonl | jq '.qualified_name'# Install with dev dependencies
pip install -e ".[dev]"
# Run tests
python3 -m pytest tests/ -vThe code indexer's three-phase architecture is grounded in peer-reviewed research on repository-level code generation and token optimisation for LLM agents.
Traditional RAG pipelines split source code at fixed token counts, severing functions from their bodies and isolating return statements from surrounding logic. The cAST methodology (Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree, arXiv:2506.15655v1, EMNLP 2025) addresses this by parsing code into complete Abstract Syntax Trees and applying a recursive split-then-merge process — ensuring every chunk is a syntactically complete, semantically coherent unit. Empirical results: +4.3 points Recall@5 on RepoEval, +2.67 points Pass@1 on SWE-bench over fixed-size chunking baselines.
Software logic depends on exact, deterministic identifiers. Semantic vector search struggles to locate custom entities like auth_token_v2_middleware_factory; lexical search finds them instantly with zero index overhead. GrepRAG (An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion, ResearchGate/400340391) demonstrated a 7.04–15.58% relative improvement in exact code match over graph-based semantic baselines across CrossCodeEval and RepoEval-Updated.
When agents navigate repositories in response to natural language queries (bug reports, product requirements), a vocabulary mismatch blocks pure structural indexing. Hierarchical summarisation research (Repository-Level Code Understanding by LLMs via Hierarchical Summarization, ResearchGate/391739021) showed that LLM-generated summaries enable semantic navigation with Pass@10 of 0.89 on real-world Jira issue datasets. Critically, enrichment runs once at build time and is amortised across all subsequent queries — only changed nodes require re-enrichment.