Skip to content

RAK0152/doc-watch-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

doc-watch-rag

Keep your Retrieval-Augmented Generation (RAG) index hot while you work. Point the watcher at a folder, drop PDFs/DOCX/TXT/MD files, and the vector store refreshes automatically—no restarts, no manual re-indexing.

What this does

  • Watches directories for add/modify/delete and reacts in near real time
  • Extracts text from PDF, DOCX, TXT, and Markdown
  • Cleans and chunks text with overlap for better retrieval quality
  • Persists embeddings to Chroma so the index survives restarts
  • Uses SentenceTransformers locally (no remote embedding calls)
  • Emits observability events, counters, gauges, and span timings

Prerequisites

  • Python 3.10+
  • Enough disk space to persist Chroma under rag_index/
  • One-time network access to download the SentenceTransformers model the first time it runs

Install

python -m venv .venv
. .venv/Scripts/activate   # Windows
pip install -r requirements.txt

Quick demo

python run_demo.py

What happens:

  • The watcher ensures ./sop_docs exists and starts monitoring it.
  • Drop supported files into ./sop_docs; you will see ingest events.
  • Type questions into the REPL; retrieved context and sources are printed.

Use in your own code

from doc_watch_rag import RagStore, DocWatcher, Observability

obs = Observability()

store = RagStore(
    persist_dir="./rag_index",
    collection_name="sop_knowledge",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    observability=obs,
)

watcher = DocWatcher(
    watch_paths=["./sop_docs"],
    rag_store=store,
    observability=obs,
    # max_chars=1400, overlap=150, include_ext={".pdf", ".docx", ".txt", ".md"}
)
watcher.start_in_background()

# Later, answer a question
ctx, sources = store.build_context("How do I deploy?", top_k=5)
print(ctx)
print(sources)

How it works (modules)

  • doc_watch_rag/watcher.py — async watcher using watchfiles.awatch; debounces churn, ingests on create/modify, deletes on remove.
  • doc_watch_rag/extractors.py — pulls text from PDF (PyPDF2), DOCX (python-docx), TXT/MD (direct read).
  • doc_watch_rag/chunker.py — cleans text (whitespace, hyphenated line breaks) and chunks with configurable size/overlap.
  • doc_watch_rag/rag_store.py — wraps Chroma PersistentClient and SentenceTransformers; supports upsert/delete by document and retrieval with scores.
  • doc_watch_rag/observability.py — event bus plus counters/gauges and span timers; never raises to callers.

Configuration knobs

  • Watch paths: pass one or many to DocWatcher(watch_paths=[...]).
  • File types: override include_ext to allow more/less extensions.
  • Chunking: tune max_chars and overlap to fit your model’s context window.
  • Persistence: set persist_dir and collection_name on RagStore.
  • Embeddings: swap embedding_model for a different SentenceTransformers model.

Repository layout

  • run_demo.py — interactive REPL wiring watcher + store + observability
  • doc_watch_rag/ — library code (watcher, store, chunking, extraction, observability)
  • sop_docs/ — drop your source documents here (auto-created)
  • rag_index/ — Chroma persistence directory (auto-created)
  • requirements.txt — Python dependencies

Observability

  • Subscribe to events via Observability.subscribe(fn); events include watcher lifecycle, ingest/delete, spans, and RAG operations.
  • obs.snapshot() returns counters, gauges, last event, and recent events—handy for health endpoints or debug logging.

Troubleshooting

  • No results returned: ensure documents are in sop_docs/, supported extensions are used, and ingest events appear. Wait for the first embed/model download to finish.
  • Model download blocked: pre-download or vendor the SentenceTransformers model; point to it via embedding_model.
  • Index missing after restart: confirm persist_dir is stable and writable; default is ./rag_index.

License

See LICENSE.

doc-watch-rag

doc-watch-rag is an async Python module that automatically watches document folders and keeps a RAG (Retrieval-Augmented Generation) index hot and up to date.

Start your chatbot once.
Drop new documents anytime.
Your RAG system always queries the latest knowledge.


Why this exists

Most RAG systems break in production because documents change but indexes don’t.

  • SOPs are updated frequently
  • New PDFs and Word files arrive continuously
  • Chatbots silently answer from stale knowledge

doc-watch-rag solves this by acting as an always-on ingestion layer that keeps your vector index synchronized with your document folders.


What this package does

  • Watches folders for document changes (add / modify / delete)
  • Automatically ingests documents into a persistent vector index
  • Supports true RAG using embeddings + vector search
  • Runs asynchronously in the background
  • Provides observability via events, counters, and timings
  • Keeps chatbot / API / agent code unchanged

What this package is NOT

  • ❌ A chatbot UI
  • ❌ An LLM framework
  • ❌ A vector database
  • ❌ A one-off PDF Q&A script

It is infrastructure for RAG systems.


Installation

pip install git+https://github.com/RAK0152/doc-watch-rag.git

About

Async document watcher that keeps your RAG index hot. Automatically ingests new or changed documents into a live RAG pipeline with built-in observability.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages