Keep your Retrieval-Augmented Generation (RAG) index hot while you work. Point the watcher at a folder, drop PDFs/DOCX/TXT/MD files, and the vector store refreshes automatically—no restarts, no manual re-indexing.
- Watches directories for add/modify/delete and reacts in near real time
- Extracts text from PDF, DOCX, TXT, and Markdown
- Cleans and chunks text with overlap for better retrieval quality
- Persists embeddings to Chroma so the index survives restarts
- Uses SentenceTransformers locally (no remote embedding calls)
- Emits observability events, counters, gauges, and span timings
- Python 3.10+
- Enough disk space to persist Chroma under
rag_index/ - One-time network access to download the SentenceTransformers model the first time it runs
python -m venv .venv
. .venv/Scripts/activate # Windows
pip install -r requirements.txtpython run_demo.pyWhat happens:
- The watcher ensures
./sop_docsexists and starts monitoring it. - Drop supported files into
./sop_docs; you will see ingest events. - Type questions into the REPL; retrieved context and sources are printed.
from doc_watch_rag import RagStore, DocWatcher, Observability
obs = Observability()
store = RagStore(
persist_dir="./rag_index",
collection_name="sop_knowledge",
embedding_model="sentence-transformers/all-MiniLM-L6-v2",
observability=obs,
)
watcher = DocWatcher(
watch_paths=["./sop_docs"],
rag_store=store,
observability=obs,
# max_chars=1400, overlap=150, include_ext={".pdf", ".docx", ".txt", ".md"}
)
watcher.start_in_background()
# Later, answer a question
ctx, sources = store.build_context("How do I deploy?", top_k=5)
print(ctx)
print(sources)doc_watch_rag/watcher.py— async watcher usingwatchfiles.awatch; debounces churn, ingests on create/modify, deletes on remove.doc_watch_rag/extractors.py— pulls text from PDF (PyPDF2), DOCX (python-docx), TXT/MD (direct read).doc_watch_rag/chunker.py— cleans text (whitespace, hyphenated line breaks) and chunks with configurable size/overlap.doc_watch_rag/rag_store.py— wraps Chroma PersistentClient and SentenceTransformers; supports upsert/delete by document and retrieval with scores.doc_watch_rag/observability.py— event bus plus counters/gauges and span timers; never raises to callers.
- Watch paths: pass one or many to
DocWatcher(watch_paths=[...]). - File types: override
include_extto allow more/less extensions. - Chunking: tune
max_charsandoverlapto fit your model’s context window. - Persistence: set
persist_dirandcollection_nameonRagStore. - Embeddings: swap
embedding_modelfor a different SentenceTransformers model.
run_demo.py— interactive REPL wiring watcher + store + observabilitydoc_watch_rag/— library code (watcher, store, chunking, extraction, observability)sop_docs/— drop your source documents here (auto-created)rag_index/— Chroma persistence directory (auto-created)requirements.txt— Python dependencies
- Subscribe to events via
Observability.subscribe(fn); events include watcher lifecycle, ingest/delete, spans, and RAG operations. obs.snapshot()returns counters, gauges, last event, and recent events—handy for health endpoints or debug logging.
- No results returned: ensure documents are in
sop_docs/, supported extensions are used, and ingest events appear. Wait for the first embed/model download to finish. - Model download blocked: pre-download or vendor the SentenceTransformers model; point to it via
embedding_model. - Index missing after restart: confirm
persist_diris stable and writable; default is./rag_index.
See LICENSE.
doc-watch-rag is an async Python module that automatically watches document folders and keeps a RAG (Retrieval-Augmented Generation) index hot and up to date.
Start your chatbot once.
Drop new documents anytime.
Your RAG system always queries the latest knowledge.
Most RAG systems break in production because documents change but indexes don’t.
- SOPs are updated frequently
- New PDFs and Word files arrive continuously
- Chatbots silently answer from stale knowledge
doc-watch-rag solves this by acting as an always-on ingestion layer that keeps your vector index synchronized with your document folders.
- Watches folders for document changes (add / modify / delete)
- Automatically ingests documents into a persistent vector index
- Supports true RAG using embeddings + vector search
- Runs asynchronously in the background
- Provides observability via events, counters, and timings
- Keeps chatbot / API / agent code unchanged
- ❌ A chatbot UI
- ❌ An LLM framework
- ❌ A vector database
- ❌ A one-off PDF Q&A script
It is infrastructure for RAG systems.
pip install git+https://github.com/RAK0152/doc-watch-rag.git