Hybrid Code Indexing System

A Python CLI tool that builds a structured code index through three phases: deterministic AST parsing, ripgrep-based dependency mapping, and LLM semantic enrichment. The index is stored in a local SQLite database and supports lexical, graph, and semantic queries — giving agents and developers fast, structured access to codebase knowledge without injecting raw source into context windows.

Installation

Requires Python ≥ 3.11 and ripgrep on PATH.

pip install -e .

Quick Start

# 1. Initialise the database
index init

# 2. Parse source files and map dependencies
index build

# 3. Enrich nodes with LLM-generated semantic metadata
# Set one of: ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, or LITELLM_BASE_URL
export ANTHROPIC_API_KEY="sk-..."
index enrich                              # auto-detects provider from env
index enrich --provider openai            # or specify explicitly

# 4. Query the index
index query "validateCartState"

Tutorial

This walkthrough indexes a real project from scratch and shows how to use every major feature.

Step 1: Install and verify prerequisites

pip install -e .

# Install ripgrep and other external dependencies automatically
index install

Step 2: Index your project

Navigate to your project root and run the full pipeline:

cd /path/to/your/project

# Build the index (init is automatic)
index build

This creates a .codeindex/ directory containing the SQLite database. The build runs two phases: AST parsing extracts every file, class, function, and method, then ripgrep maps all call-site and import relationships between them.

To exclude vendored or generated code:

index build --exclude "vendor/*" --exclude "generated/*"

Step 3: Check index health

index status

Example output:

Nodes:            142
Edges:            387
Unenriched:       142
Last build:       2026-03-25T10:15:00+00:00
Schema version:   3
DB path:          .codeindex/codeindex.db

The Unenriched: 142 line means no nodes have semantic metadata yet — that comes next.

Step 4: Enrich with LLM metadata (optional)

This step calls the Claude API to generate summaries, domain tags, and inferred responsibilities for each node. It requires an API key:

# Using Anthropic (default)
export ANTHROPIC_API_KEY="sk-..."
index enrich

# Using OpenAI
export OPENAI_API_KEY="sk-..."
index enrich --provider openai

# Using OpenRouter
export OPENROUTER_API_KEY="sk-..."
index enrich --provider openrouter --model anthropic/claude-sonnet-4-6

# Using LiteLLM proxy
export LITELLM_BASE_URL="http://localhost:4000/v1"
index enrich --provider litellm

# Preview what will be enriched (any provider)
index enrich --dry-run

The provider is auto-detected from environment variables when --provider is omitted (checks ANTHROPIC_API_KEY → OPENAI_API_KEY → OPENROUTER_API_KEY → LITELLM_API_KEY/LITELLM_BASE_URL in order).

Enrichment is hash-gated: re-running index enrich after code changes only processes nodes whose content actually changed.

Phase 3 Enrichment — Cost Model

First-run cost (one-time per repository): enriching a ~14,000-node codebase with Claude Sonnet costs approximately $42–67 depending on average node size. This is paid once when you first index a repository. Incremental cost (every subsequent run): the indexer is hash-gated. Phase 1 clears enriched_at only on nodes whose content_hash changed. Phase 3 then only processes those nodes. On a normally-evolving codebase where a sprint touches 1–2% of nodes, a rebuild enrichment run costs under $5 — often under $1.

What drives cost up:

Large-scale refactors that invalidate many content_hash values in one go
Onboarding many repositories (each pays the first-run cost once)
Branch switches between long-lived divergent branches

If Phase 3 cost is a concern, run index enrich --dry-run first — it reports the number of unenriched nodes before making any API calls. You can also skip Phase 3 entirely; the structural index (Phase 1+2) still provides AST nodes and dependency graph context at zero LLM cost.

Step 5: Query the index

Find a symbol by name (lexical search):

index query "UserService"

Explore a node's dependency graph:

index query "UserService.validate" --type graph --depth 3

Ask a natural-language question (semantic search — requires enrichment):

index query "where is authentication handled" --type semantic

Get machine-readable output for scripts or agents:

index query "CartService" --format json --with-source

The query router automatically picks the best strategy (lexical, graph, or semantic) when --type is omitted, and falls back to an alternative strategy if the first returns no results.

Step 6: Rebuild after code changes

index build

The build is incremental at the enrichment layer — only changed nodes need re-enrichment. To start completely fresh:

index reset --yes
index build

Typical workflow summary

index build                          # parse + map dependencies
index enrich                         # add semantic metadata (optional)
index query "MyClass"                # find symbols
index query "how does auth work"     # semantic search
index status                         # check health

Commands

`index install`

Install external dependencies required by the indexer. Currently installs ripgrep using the system package manager (Homebrew on macOS, apt/dnf/pacman on Linux, Chocolatey/Scoop on Windows). No-op if all dependencies are already present.

index install

`index init`

Create the .codeindex/ directory and initialise the database schema. No-op if the DB already exists and the schema version is current. Auto-invoked by index build if the DB does not yet exist.

Option	Description
`--db PATH`	Path to the SQLite database file
`--no-gitignore-update`	Skip automatic`.gitignore` update

`index build`

Run Phase 1 (AST parse) and Phase 2 (dependency mapping). Bootstraps the DB automatically if not yet initialised.

Option	Description
`--db PATH`	Path to the SQLite database file
`--phase PREPARE	DEPLOY`
`--token-limit N`	Max tokens per cAST chunk (default: 512)
`--exclude PATTERN`	Glob patterns to exclude from parsing (repeatable)
`--no-gitignore-update`	Skip automatic`.gitignore` update

index build --phase PREPARE --exclude "vendor/*"

`index enrich`

Run Phase 3 — LLM enrichment on unenriched nodes. Only re-enriches nodes whose content_hash has changed since the last run.

Option	Description
`--db PATH`	Path to the SQLite database file
`--dry-run`	Show what would be enriched without making API calls
`--model MODEL`	Override the LLM model for enrichment
`--provider NAME`	LLM provider: `anthropic`, `openai`, `openrouter`, or `litellm`

index enrich --dry-run
index enrich --provider openai --model gpt-4o

`index query`

Query the code index. The query router auto-selects a strategy (lexical, graph, or semantic) based on input, with cross-strategy fallback when results are empty.

Option	Description
`--db PATH`	Path to the SQLite database file
`--type lexical	graph
`--format text	json
`--with-source`	Include raw source in results
`--top-k N`	Maximum number of results (default: 10)
`--depth N`	Graph traversal depth (default: 2)

# Human-readable lexical lookup
index query "CartService" --type lexical --with-source

# Structured output for agent consumption
index query "cart loses items after discount" --type semantic --format jsonl

`index status`

Show index health: node count, edge count, unenriched nodes, last build time, and schema version.

Option	Description
`--db PATH`	Path to the SQLite database file

`index reset`

Drop and recreate all database tables.

Option	Description
`--db PATH`	Path to the SQLite database file
`--yes`, `-y`	Skip confirmation prompt (required for non-interactive use)

index reset --yes && index build --phase PREPARE

Architecture

The indexing pipeline runs in three phases:

AST Parse — Extracts files, classes, functions, methods, signatures, docstrings, and line ranges using Python's ast module (for .py files) and tree-sitter (for Kotlin and TypeScript). Large nodes are split into chunks within a configurable token limit (cAST split-merge).
Dependency Map — For each node, runs ripgrep to find all call sites and identifier references across the codebase, then resolves import statements to target nodes. Writes directed edges (calls, imports, inherits, overrides, references, instantiates) into the graph.
LLM Enrich — Sends each node's signature, docstring, and immediate graph neighbours to a configurable LLM provider (Anthropic, OpenAI, OpenRouter, or LiteLLM). Receives back a semantic_summary, domain_tags, and inferred_responsibility. Only re-runs on nodes whose content hash has changed (hash-gated).

The resulting SQLite database (.codeindex/codeindex.db) supports three query paths:

Lexical — ripgrep identifier match with re-ranking
Graph — SQLite edge traversal with configurable depth
Semantic — FTS5 full-text search over enriched metadata

All progress and diagnostic output goes to stderr; only structured query results go to stdout.

Supported Languages

Language	Parser
Python	`ast` (stdlib)
Kotlin	`tree-sitter-kotlin`
TypeScript	`tree-sitter-typescript`
Java	`tree-sitter-java`
Ruby	`tree-sitter-ruby`

Environment Variables

Variable	Required	Description
`ANTHROPIC_API_KEY`	For `enrich` (Anthropic)	Anthropic API key for LLM enrichment (default provider)
`OPENAI_API_KEY`	For `enrich` (OpenAI)	OpenAI API key
`OPENROUTER_API_KEY`	For `enrich` (OpenRouter)	OpenRouter API key
`LITELLM_API_KEY`	For `enrich` (LiteLLM)	LiteLLM API key (optional if `LITELLM_BASE_URL` is set)
`LITELLM_BASE_URL`	For `enrich` (LiteLLM)	LiteLLM proxy URL (default: `http://localhost:4000/v1`)
`CODEINDEX_DB`	No	Override default database path (`.codeindex/codeindex.db`)

Database path resolution order: --db flag → CODEINDEX_DB env var → .codeindex/codeindex.db → exit 2.

Exit Codes

Code	Meaning
`0`	Success — all phases completed without warnings
`1`	Completed with warnings (e.g. parse errors, unenriched nodes)
`2`	Fatal error (e.g. ripgrep missing, DB locked, schema mismatch)

Tips & Tricks

Broad queries return too few results from a specific module? Lexical search ranks results by match density across the whole repo and returns --top-k 10 by default. If you're looking for all nodes related to a common term, increase the limit:

index query "survey" --top-k 30

Prefer specific identifiers over broad terms. index query "SurveyService" is more precise than index query "survey" and will surface the exact class you need.

Use graph search to explore a node's neighbourhood. Once you find a node of interest, trace its callers and callees:

index query "SurveyService.createSurvey" --type graph --depth 3

Pipe structured output to other tools. Non-TTY output defaults to JSON, so you can chain with jq:

index query "CartService" --format jsonl | jq '.qualified_name'

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python3 -m pytest tests/ -v

Research Foundation

The code indexer's three-phase architecture is grounded in peer-reviewed research on repository-level code generation and token optimisation for LLM agents.

Phase 1 — AST-based chunking (cAST)

Traditional RAG pipelines split source code at fixed token counts, severing functions from their bodies and isolating return statements from surrounding logic. The cAST methodology (Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree, arXiv:2506.15655v1, EMNLP 2025) addresses this by parsing code into complete Abstract Syntax Trees and applying a recursive split-then-merge process — ensuring every chunk is a syntactically complete, semantically coherent unit. Empirical results: +4.3 points Recall@5 on RepoEval, +2.67 points Pass@1 on SWE-bench over fixed-size chunking baselines.

Phase 2 — Index-free lexical retrieval (GrepRAG)

Software logic depends on exact, deterministic identifiers. Semantic vector search struggles to locate custom entities like auth_token_v2_middleware_factory; lexical search finds them instantly with zero index overhead. GrepRAG (An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion, ResearchGate/400340391) demonstrated a 7.04–15.58% relative improvement in exact code match over graph-based semantic baselines across CrossCodeEval and RepoEval-Updated.

Phase 3 — LLM semantic enrichment

When agents navigate repositories in response to natural language queries (bug reports, product requirements), a vocabulary mismatch blocks pure structural indexing. Hierarchical summarisation research (Repository-Level Code Understanding by LLMs via Hierarchical Summarization, ResearchGate/391739021) showed that LLM-generated summaries enable semantic navigation with Pass@10 of 0.89 on real-world Jira issue datasets. Critically, enrichment runs once at build time and is amortised across all subsequent queries — only changed nodes require re-enrichment.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.factory		.factory
docs		docs
indexer		indexer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
code-indexer.iml		code-indexer.iml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Hybrid Code Indexing System

Installation

Quick Start

Tutorial

Step 1: Install and verify prerequisites

Step 2: Index your project

Step 3: Check index health

Step 4: Enrich with LLM metadata (optional)

Phase 3 Enrichment — Cost Model

Step 5: Query the index

Step 6: Rebuild after code changes

Typical workflow summary

Commands

index install

index init

index build

index enrich

index query

index status

index reset

Architecture

Supported Languages

Environment Variables

Exit Codes

Tips & Tricks

Development

Research Foundation

Phase 1 — AST-based chunking (cAST)

Phase 2 — Index-free lexical retrieval (GrepRAG)

Phase 3 — LLM semantic enrichment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`index install`

`index init`

`index build`

`index enrich`

`index query`

`index status`

`index reset`

Packages