Skip to content

lycha/code-indexer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hybrid Code Indexing System

A Python CLI tool that builds a structured code index through three phases: deterministic AST parsing, ripgrep-based dependency mapping, and LLM semantic enrichment. The index is stored in a local SQLite database and supports lexical, graph, and semantic queries — giving agents and developers fast, structured access to codebase knowledge without injecting raw source into context windows.

Installation

Requires Python ≥ 3.11 and ripgrep on PATH.

pip install -e .

Quick Start

# 1. Initialise the database
index init

# 2. Parse source files and map dependencies
index build

# 3. Enrich nodes with LLM-generated semantic metadata
# Set one of: ANTHROPIC_API_KEY, OPENAI_API_KEY, OPENROUTER_API_KEY, or LITELLM_BASE_URL
export ANTHROPIC_API_KEY="sk-..."
index enrich                              # auto-detects provider from env
index enrich --provider openai            # or specify explicitly

# 4. Query the index
index query "validateCartState"

Tutorial

This walkthrough indexes a real project from scratch and shows how to use every major feature.

Step 1: Install and verify prerequisites

pip install -e .

# Install ripgrep and other external dependencies automatically
index install

Step 2: Index your project

Navigate to your project root and run the full pipeline:

cd /path/to/your/project

# Build the index (init is automatic)
index build

This creates a .codeindex/ directory containing the SQLite database. The build runs two phases: AST parsing extracts every file, class, function, and method, then ripgrep maps all call-site and import relationships between them.

To exclude vendored or generated code:

index build --exclude "vendor/*" --exclude "generated/*"

Step 3: Check index health

index status

Example output:

Nodes:            142
Edges:            387
Unenriched:       142
Last build:       2026-03-25T10:15:00+00:00
Schema version:   3
DB path:          .codeindex/codeindex.db

The Unenriched: 142 line means no nodes have semantic metadata yet — that comes next.

Step 4: Enrich with LLM metadata (optional)

This step calls the Claude API to generate summaries, domain tags, and inferred responsibilities for each node. It requires an API key:

# Using Anthropic (default)
export ANTHROPIC_API_KEY="sk-..."
index enrich

# Using OpenAI
export OPENAI_API_KEY="sk-..."
index enrich --provider openai

# Using OpenRouter
export OPENROUTER_API_KEY="sk-..."
index enrich --provider openrouter --model anthropic/claude-sonnet-4-6

# Using LiteLLM proxy
export LITELLM_BASE_URL="http://localhost:4000/v1"
index enrich --provider litellm

# Preview what will be enriched (any provider)
index enrich --dry-run

The provider is auto-detected from environment variables when --provider is omitted (checks ANTHROPIC_API_KEYOPENAI_API_KEYOPENROUTER_API_KEYLITELLM_API_KEY/LITELLM_BASE_URL in order).

Enrichment is hash-gated: re-running index enrich after code changes only processes nodes whose content actually changed.

Phase 3 Enrichment — Cost Model

First-run cost (one-time per repository): enriching a ~14,000-node codebase with Claude Sonnet costs approximately $42–67 depending on average node size. This is paid once when you first index a repository. Incremental cost (every subsequent run): the indexer is hash-gated. Phase 1 clears enriched_at only on nodes whose content_hash changed. Phase 3 then only processes those nodes. On a normally-evolving codebase where a sprint touches 1–2% of nodes, a rebuild enrichment run costs under $5 — often under $1.

What drives cost up:

  • Large-scale refactors that invalidate many content_hash values in one go
  • Onboarding many repositories (each pays the first-run cost once)
  • Branch switches between long-lived divergent branches

If Phase 3 cost is a concern, run index enrich --dry-run first — it reports the number of unenriched nodes before making any API calls. You can also skip Phase 3 entirely; the structural index (Phase 1+2) still provides AST nodes and dependency graph context at zero LLM cost.

Step 5: Query the index

Find a symbol by name (lexical search):

index query "UserService"

Explore a node's dependency graph:

index query "UserService.validate" --type graph --depth 3

Ask a natural-language question (semantic search — requires enrichment):

index query "where is authentication handled" --type semantic

Get machine-readable output for scripts or agents:

index query "CartService" --format json --with-source

The query router automatically picks the best strategy (lexical, graph, or semantic) when --type is omitted, and falls back to an alternative strategy if the first returns no results.

Step 6: Rebuild after code changes

index build

The build is incremental at the enrichment layer — only changed nodes need re-enrichment. To start completely fresh:

index reset --yes
index build

Typical workflow summary

index build                          # parse + map dependencies
index enrich                         # add semantic metadata (optional)
index query "MyClass"                # find symbols
index query "how does auth work"     # semantic search
index status                         # check health

Commands

index install

Install external dependencies required by the indexer. Currently installs ripgrep using the system package manager (Homebrew on macOS, apt/dnf/pacman on Linux, Chocolatey/Scoop on Windows). No-op if all dependencies are already present.

index install

index init

Create the .codeindex/ directory and initialise the database schema. No-op if the DB already exists and the schema version is current. Auto-invoked by index build if the DB does not yet exist.

Option Description
--db PATH Path to the SQLite database file
--no-gitignore-update Skip automatic.gitignore update

index build

Run Phase 1 (AST parse) and Phase 2 (dependency mapping). Bootstraps the DB automatically if not yet initialised.

Option Description
--db PATH Path to the SQLite database file
`--phase PREPARE DEPLOY`
--token-limit N Max tokens per cAST chunk (default: 512)
--exclude PATTERN Glob patterns to exclude from parsing (repeatable)
--no-gitignore-update Skip automatic.gitignore update
index build --phase PREPARE --exclude "vendor/*"

index enrich

Run Phase 3 — LLM enrichment on unenriched nodes. Only re-enriches nodes whose content_hash has changed since the last run.

Option Description
--db PATH Path to the SQLite database file
--dry-run Show what would be enriched without making API calls
--model MODEL Override the LLM model for enrichment
--provider NAME LLM provider: anthropic, openai, openrouter, or litellm
index enrich --dry-run
index enrich --provider openai --model gpt-4o

index query

Query the code index. The query router auto-selects a strategy (lexical, graph, or semantic) based on input, with cross-strategy fallback when results are empty.

Option Description
--db PATH Path to the SQLite database file
`--type lexical graph
`--format text json
--with-source Include raw source in results
--top-k N Maximum number of results (default: 10)
--depth N Graph traversal depth (default: 2)
# Human-readable lexical lookup
index query "CartService" --type lexical --with-source

# Structured output for agent consumption
index query "cart loses items after discount" --type semantic --format jsonl

index status

Show index health: node count, edge count, unenriched nodes, last build time, and schema version.

Option Description
--db PATH Path to the SQLite database file

index reset

Drop and recreate all database tables.

Option Description
--db PATH Path to the SQLite database file
--yes, -y Skip confirmation prompt (required for non-interactive use)
index reset --yes && index build --phase PREPARE

Architecture

The indexing pipeline runs in three phases:

  1. AST Parse — Extracts files, classes, functions, methods, signatures, docstrings, and line ranges using Python's ast module (for .py files) and tree-sitter (for Kotlin and TypeScript). Large nodes are split into chunks within a configurable token limit (cAST split-merge).
  2. Dependency Map — For each node, runs ripgrep to find all call sites and identifier references across the codebase, then resolves import statements to target nodes. Writes directed edges (calls, imports, inherits, overrides, references, instantiates) into the graph.
  3. LLM Enrich — Sends each node's signature, docstring, and immediate graph neighbours to a configurable LLM provider (Anthropic, OpenAI, OpenRouter, or LiteLLM). Receives back a semantic_summary, domain_tags, and inferred_responsibility. Only re-runs on nodes whose content hash has changed (hash-gated).

The resulting SQLite database (.codeindex/codeindex.db) supports three query paths:

  • Lexical — ripgrep identifier match with re-ranking
  • Graph — SQLite edge traversal with configurable depth
  • Semantic — FTS5 full-text search over enriched metadata

All progress and diagnostic output goes to stderr; only structured query results go to stdout.

Supported Languages

Language Parser
Python ast (stdlib)
Kotlin tree-sitter-kotlin
TypeScript tree-sitter-typescript
Java tree-sitter-java
Ruby tree-sitter-ruby

Environment Variables

Variable Required Description
ANTHROPIC_API_KEY For enrich (Anthropic) Anthropic API key for LLM enrichment (default provider)
OPENAI_API_KEY For enrich (OpenAI) OpenAI API key
OPENROUTER_API_KEY For enrich (OpenRouter) OpenRouter API key
LITELLM_API_KEY For enrich (LiteLLM) LiteLLM API key (optional if LITELLM_BASE_URL is set)
LITELLM_BASE_URL For enrich (LiteLLM) LiteLLM proxy URL (default: http://localhost:4000/v1)
CODEINDEX_DB No Override default database path (.codeindex/codeindex.db)

Database path resolution order: --db flag → CODEINDEX_DB env var → .codeindex/codeindex.db → exit 2.

Exit Codes

Code Meaning
0 Success — all phases completed without warnings
1 Completed with warnings (e.g. parse errors, unenriched nodes)
2 Fatal error (e.g. ripgrep missing, DB locked, schema mismatch)

Tips & Tricks

Broad queries return too few results from a specific module? Lexical search ranks results by match density across the whole repo and returns --top-k 10 by default. If you're looking for all nodes related to a common term, increase the limit:

index query "survey" --top-k 30

Prefer specific identifiers over broad terms. index query "SurveyService" is more precise than index query "survey" and will surface the exact class you need.

Use graph search to explore a node's neighbourhood. Once you find a node of interest, trace its callers and callees:

index query "SurveyService.createSurvey" --type graph --depth 3

Pipe structured output to other tools. Non-TTY output defaults to JSON, so you can chain with jq:

index query "CartService" --format jsonl | jq '.qualified_name'

Development

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
python3 -m pytest tests/ -v

Research Foundation

The code indexer's three-phase architecture is grounded in peer-reviewed research on repository-level code generation and token optimisation for LLM agents.

Phase 1 — AST-based chunking (cAST)

Traditional RAG pipelines split source code at fixed token counts, severing functions from their bodies and isolating return statements from surrounding logic. The cAST methodology (Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree, arXiv:2506.15655v1, EMNLP 2025) addresses this by parsing code into complete Abstract Syntax Trees and applying a recursive split-then-merge process — ensuring every chunk is a syntactically complete, semantically coherent unit. Empirical results: +4.3 points Recall@5 on RepoEval, +2.67 points Pass@1 on SWE-bench over fixed-size chunking baselines.

Phase 2 — Index-free lexical retrieval (GrepRAG)

Software logic depends on exact, deterministic identifiers. Semantic vector search struggles to locate custom entities like auth_token_v2_middleware_factory; lexical search finds them instantly with zero index overhead. GrepRAG (An Empirical Study and Optimization of Grep-Like Retrieval for Code Completion, ResearchGate/400340391) demonstrated a 7.04–15.58% relative improvement in exact code match over graph-based semantic baselines across CrossCodeEval and RepoEval-Updated.

Phase 3 — LLM semantic enrichment

When agents navigate repositories in response to natural language queries (bug reports, product requirements), a vocabulary mismatch blocks pure structural indexing. Hierarchical summarisation research (Repository-Level Code Understanding by LLMs via Hierarchical Summarization, ResearchGate/391739021) showed that LLM-generated summaries enable semantic navigation with Pass@10 of 0.89 on real-world Jira issue datasets. Critically, enrichment runs once at build time and is amortised across all subsequent queries — only changed nodes require re-enrichment.

About

Stop feeding your AI the wrong files. This hybrid code indexer (AST + dependency graph + LLM semantic enrichment) gives coding agents surgical context — not guesswork. Less token waste. Fewer hallucinations. Built for AI-first engineering workflows.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors