This project builds a local evidence database from text files and provides a CLI chat interface over that database.
The current shipping path is:
- Build the SQLite evidence database from a folder of source files.
- Query that database in retrieval-only mode with no LLM required.
- Optionally use hybrid retrieval when an embedding backend is available.
- Optionally synthesize a grounded answer from the retrieved evidence, with citations.
The code currently supports:
- ingestion from a broader set of local text and text-like files
- conservative text normalization
- structural segmentation into regions
- shaping those regions into
EvidenceUnitrecords - SQLite-backed lexical retrieval over
EvidenceUnits - hybrid lexical + semantic retrieval over the same
EvidenceUnits when embeddings are available - grounded answer synthesis over retrieved
EvidenceUnits when a local answer model is available
The project turns a folder of local notes and documents into a queryable evidence store.
Instead of indexing raw chunks directly, it builds provenance-preserving EvidenceUnits that carry:
- the source file path
- structural region metadata
- line and character ranges when available
- signals and context
- adjacency links to neighboring units
That lets the query layer retrieve structured evidence, expand nearby context, and produce answers that cite the source material.
Build the database:
python3 build_evidence_db.py /path/to/your/folderRun the chat interface in retrieval-only mode:
python3 chat.pyAdd hybrid retrieval when embeddings are available:
python3 chat.py --hybridAdd answer synthesis when a local answer model is available:
python3 chat.py --hybrid --answer --query "your question here"The project currently follows this data path:
local file -> ExtractedText -> StructuralRegion -> EvidenceUnit -> SQLite index -> retrieval -> grounded answer
graph TD
A[Local files]:::accent0 --> B[Ingest + normalize]:::accent1
B --> C[Segmentation contract]:::accent2
C --> D[Structural segmentation]:::accent3
D --> E[Structural regions]:::accent2
E --> F[Evidence shaping + type detection]:::accent4
F --> G[Evidence units]:::accent2
G --> H[SQLite evidence index]:::accent5
H --> I[Lexical retrieval]:::accent6
H -. optional .-> J[Hybrid retrieval]:::accent6
I --> K[Hits + neighbors]:::accent2
J --> K
K -. optional .-> L[Grounded answer synthesis]:::accent7
L --> M[Answer + citations]:::accent7
K --> N[Retrieval-only output]:::accent6
Implemented in mvp_pipeline/ingest.py and mvp_pipeline/normalize.py.
- The pipeline recursively discovers files for folder ingestion.
- Supported text and text-like files are processed.
- Files are read as UTF-8.
- Normalization is conservative:
- BOM removal
- newline normalization to LF
- Normalization produces an explicit segmentation contract with line-level source mapping.
Implemented in mvp_pipeline/segment.py.
Current structural segmentation supports:
- markdown ATX headings
- fenced code blocks
- paragraph-like blocks separated by blank lines
- plain text paragraph blocks
Segmentation is profile-based:
markdownplain_text
Implemented in mvp_pipeline/evidence.py.
Structural regions are turned into EvidenceUnits with:
- source file reference
- parent region reference
- line range
- char range when available
- previous/next adjacency
- signals
- confidence
The shaping code also includes type detection for several content patterns. Tests and code paths show units such as:
proseheading_sectioncodecommandsqljson_querytablediagrammixed
Implemented in canonical_data_model.py.
This file defines the canonical storage-agnostic object model used by the pipeline, including:
SourceDocumentExtractedTextStructuralRegionEvidenceUnitDerivedArtifact
Implemented in query_retriever.py.
SQLite is the canonical store for indexed EvidenceUnits.
The retriever:
- stores
EvidenceUnits inevidence_units - creates indexes on
source_fileandtype - uses SQLite FTS5 when available
- falls back to
LIKE-based scoring if FTS5 is unavailable - supports neighbor expansion through
prev_unit_idandnext_unit_id
Implemented in hybrid_retriever.py.
Hybrid retrieval:
- keeps SQLite as the source of truth
- adds an auxiliary semantic index keyed by
unit_id - embeds the same
EvidenceUnits already stored in SQLite - merges lexical and semantic hits into a single ranked list
- hydrates final results back into canonical
EvidenceUnitobjects from SQLite
If semantic retrieval fails, chat.py falls back to lexical retrieval.
Implemented in grounded_answer_client.py.
Grounded answers:
- use retrieved
EvidenceUnits as the grounding context - call a local OpenAI-compatible LiteLLM proxy
- model is configurable via env vars (defaults to
llama3:8b) - return:
- an answer
- source citations
- concise support metadata
If answer synthesis fails, chat.py falls back to showing retrieval results.
-
build_evidence_db.pyBuilds the SQLite evidence database from a folder. -
chat.pyQueries the evidence database in lexical or hybrid mode and can synthesize grounded answers.
-
canonical_data_model.pyCanonical object model and validation layer. -
query_retriever.pyStep 12 SQLite-backed indexing and lexical retrieval. -
hybrid_retriever.pyStep 14 hybrid lexical + semantic retrieval. -
grounded_answer_client.pyStep 13 grounded answer synthesis over retrieved evidence. -
mvp_pipeline/Core ingestion and evidence-building pipeline.
-
ingest.pyLocal file ingestion across supported text-like formats. -
normalize.pyConservative normalization and segmentation handoff contract. -
segment.pyStructural region segmentation. -
evidence.pyEvidence-unit shaping and classification. -
pipeline.pyEnd-to-end orchestration and SQLite write-through during ingestion. -
derived_artifacts.pyBuilds derived indexes such as command, query, and link indexes. -
export_markdown.pyMarkdown audit export helpers.
-
docs/Design and architecture notes. These are documentation artifacts, not the runtime system. -
tests/unittest-based tests for the pipeline, derived artifacts, and export layer. -
examples/Small example scripts for inspecting pipeline stages.
This tool is designed to run locally over your files, but you are responsible for what you ingest and where you send data.
- Do not commit your index: the SQLite DB (default
evidence_units.db) can contain sensitive or proprietary text from the files you ingested. - Model boundary:
--hybridand--answersend text to an OpenAI-compatible endpoint (the code assumes a local LiteLLM proxy athttp://localhost:4000). Treat that endpoint as a trust boundary. - Secrets & PII: avoid ingesting secrets (API keys, tokens) and PII unless you have an explicit need; consider adding a redaction step if you plan to use this on broad folders.
The core retrieval-only workflow is Python standard library only. Optional features (PDF extraction, hybrid semantic retrieval, grounded answers / rewrites / reranking) may require additional dependencies and/or a local OpenAI-compatible proxy.
Optional capabilities:
- PDF extraction works only if you have either
pypdf/PyPDF2installed or apdftotextbinary available. - Hybrid retrieval / grounded answers require a local OpenAI-compatible proxy (e.g. LiteLLM).
The main scripts are intended to be run with python3.
Dependencies:
- Retrieval-only mode: stdlib-only (no
pip installrequired). - Optional features: install optional deps via:
python3 -m pip install -r requirements.txtThe project works in retrieval-only mode without any model stack running.
This path requires:
- a built SQLite evidence database
- Python 3
It does not require:
- embeddings
- a local answer model
- a running LiteLLM proxy
Hybrid retrieval and grounded answer synthesis require a local OpenAI-compatible LiteLLM proxy.
The code currently expects:
- API base:
http://localhost:4000 - chat completions endpoint:
/v1/chat/completions
- embeddings endpoint:
/v1/embeddings
The code currently uses:
- answer model: configurable via
LLA_ANSWER_MODEL(defaults tollama3:8b) - rewrite model: configurable via
LLA_REWRITE_MODEL(defaults tollama3:8b) - rerank model: configurable via
LLA_RERANK_MODEL(defaults tollama3:8b) - embedding model default:
text-embedding-ada-002(configurable viaSTEP14_EMBED_MODEL)
Relevant environment variables used by the code:
-
LITELLM_PROXY_API_KEYAPI key used for your local OpenAI-compatible proxy (LiteLLM). Defaults tolocal-dev-keyif unset. -
STEP14_EMBED_MODELOverrides the embedding model used byhybrid_retriever.py.
If you do not have a working local proxy:
- lexical retrieval still works
- hybrid retrieval falls back to lexical retrieval
- answer mode falls back to retrieval-only output
The current ingestion path supports these file types directly:
- plain text and markup:
.txt.md.markdown.rst.log
- office and document formats:
.docx.pdf
- data, config, and query files:
.sql.json.yaml.yml.toml.env.ini.conf.properties.xml.html.css.scss
- code and shell files:
.py.js.ts.tsx.jsx.java.kt.go.rs.sh.bash.zsh.c.cpp.h.hpp
Current extraction behavior from code:
- text-like files are decoded as UTF-8
- DOCX files are extracted in-process from
word/document.xml - PDF files require an available extraction backend
- the code tries
pypdf - then
PyPDF2 - then a local
pdftotextbinary - if none are available, the PDF file is skipped and the rest of the corpus continues indexing
- the code tries
Main command:
python3 build_evidence_db.py <input_folder>Optional custom database path:
python3 build_evidence_db.py <input_folder> --db-path /path/to/evidence_units.dbbuild_evidence_db.py calls run_folder_pipeline(...) from mvp_pipeline/pipeline.py.
That pipeline:
- recursively scans the input folder
- processes supported text and text-like files
- skips unsupported files
- builds
EvidenceUnits for each supported file - writes all indexed units into SQLite using replace mode
The script prints:
- input folder
- database path
- files processed
- evidence units indexed
Main command:
python3 chat.pyThis starts an interactive REPL.
Startup output includes:
- database path
- indexed evidence unit count
- current mode
Interactive commands supported by the code:
:help:stats:quitexitquit
python3 chat.py --query "how do I build evidence units from a file"python3 chat.py --query "docker"python3 chat.py --hybrid --query "docker"python3 chat.py --hybrid --answer --query "docker"python3 chat.py --hybrid --answer --query "docker" --jsonFrom the current code:
-
input_folderRequired positional argument. Folder containing source files to ingest. -
--db-path DB_PATHSQLite database path to build. Defaults toevidence_units.db.
From the current code:
-
--db-path DB_PATHSQLite database path for the evidence index. -
--query QUERYOne-shot query. If omitted, interactive mode starts. -
--top-k TOP_KMaximum number of hits to return. Default:5. -
--neighbors NEIGHBORSNumber of previous/next neighbors to expand per hit. Default:1. -
--jsonPrint machine-readable JSON instead of formatted text. -
--answerSynthesize a grounded answer from retrievedEvidenceUnits. -
--hybridUse hybrid lexical + semantic retrieval before answer synthesis.
Based on the actual code, the intended shipping workflow is:
- Put supported local text and text-like files into a folder.
- Build the SQLite database with
build_evidence_db.py. - Query that database with
chat.pyin retrieval-only mode. - Add
--hybridif embeddings are available. - Add
--answerif a local answer model is available.
Lexical or hybrid retrieval returns structured evidence hits.
Text output includes:
- query
- hits
- score
- unit id
- type
- source file
- region metadata
- signals
- original text
- optional neighbors
Answer mode prints:
AnswerSources- optional
Support
Sources are formatted as:
- filename + line range when line numbers exist
- otherwise a fallback region summary from the indexed region metadata
These files exist in the codebase, but they are not part of the current shipping CLI path:
-
export_folder_markdown_audit.pyExports a corpus-level markdown audit view from pipeline output. -
mvp_pipeline/export_markdown.pyMarkdown export helpers used by the audit export flow. -
query_rewrite.pyExperimental extension for rewrite-aware retrieval. It can generate query variants (heuristic + optional LLM rewrites), run retrieval for each variant, then merge/rerank the combined hit list. Not wired intochat.pyby default; you’d integrate it by callingretrieve_with_rewrites(...)(orQueryRewriteClient) in the query path before printing results. -
candidate_reranker.pyExperimental extension for second-stage LLM reranking. Given the top-N retrieved candidates, it asks a local OpenAI-compatible endpoint to return a best-first ordering ofunit_ids. Not wired intochat.pyby default; you’d integrate it after retrieval (lexical or hybrid) and before neighbor expansion / answer synthesis. -
examples/Development and inspection scripts for intermediate pipeline stages.
chat.py prints startup status from the active database path. If Indexed EvidenceUnits: 0 appears, build the DB first:
python3 build_evidence_db.py <folder>Also make sure chat.py is reading the same --db-path you built.
This is expected if semantic retrieval is unavailable.
From the code:
chat.pycatches hybrid retrieval failures- it prints a fallback status message
- it reruns lexical retrieval instead
Common causes based on the code path:
- local LiteLLM proxy not running at
http://localhost:4000 - embeddings endpoint unavailable
- invalid API key for the local proxy
This is also expected behavior in the current code.
If grounded answer synthesis fails:
chat.pyprints a status message to stderr- retrieval results are shown instead of an answer
This is expected when no PDF extraction backend is available.
From the current code, PDF extraction is optional and tries:
pypdfPyPDF2pdftotext
If none of those are available, the PDF file is skipped and indexing continues for the rest of the folder.
If the DB is populated but a query returns nothing, chat.py prints:
Status: no grounded evidence found for that query.
Try:
- broader search terms
- lexical mode first, to inspect the raw retrieved evidence
- hybrid mode if the local embedding path is available
Unsupported files are skipped during folder ingestion. Supported files that cannot be read are also skipped, and indexing continues for the rest of the folder.
The ingestion code decodes files as UTF-8. Files that cannot be decoded as UTF-8 are skipped or fail ingestion.
These are clear from the current code:
- Ingestion for text-like files is UTF-8 only.
- PDF support depends on an optional extraction backend and may not be available on every machine.
- DOCX extraction is text-only and paragraph-oriented.
- Hybrid retrieval depends on a local embeddings endpoint and model.
- Grounded answer synthesis depends on a local chat-completions endpoint and model.
- The answer layer requires JSON-shaped model output internally and may fall back to retrieval-only output when synthesis fails.
- Experimental rewrite/rerank modules exist in the repository but are not part of the shipping CLI.
- The semantic index is auxiliary; SQLite remains the canonical store of record.
Run unit tests:
PYTHONPATH=. python3 -m unittest discover -s tests -p "test_*.py"Run Ruff lint + formatting checks:
python3 -m pip install ruff
ruff format --check .
ruff check .The repository contains unittest test files under tests/, including:
tests/test_mvp_pipeline.pytests/test_derived_artifacts.pytests/test_export_markdown.pytests/test_canonical_data_model.py
These tests cover major pieces of the pipeline and supporting utilities, but the README does not assume any specific test runner command beyond the presence of these unittest modules.