doc-watch-rag

Keep your Retrieval-Augmented Generation (RAG) index hot while you work. Point the watcher at a folder, drop PDFs/DOCX/TXT/MD files, and the vector store refreshes automatically—no restarts, no manual re-indexing.

What this does

Watches directories for add/modify/delete and reacts in near real time
Extracts text from PDF, DOCX, TXT, and Markdown
Cleans and chunks text with overlap for better retrieval quality
Persists embeddings to Chroma so the index survives restarts
Uses SentenceTransformers locally (no remote embedding calls)
Emits observability events, counters, gauges, and span timings

Prerequisites

Python 3.10+
Enough disk space to persist Chroma under rag_index/
One-time network access to download the SentenceTransformers model the first time it runs

Install

python -m venv .venv
. .venv/Scripts/activate   # Windows
pip install -r requirements.txt

Quick demo

python run_demo.py

What happens:

The watcher ensures ./sop_docs exists and starts monitoring it.
Drop supported files into ./sop_docs; you will see ingest events.
Type questions into the REPL; retrieved context and sources are printed.

Use in your own code

from doc_watch_rag import RagStore, DocWatcher, Observability

obs = Observability()

store = RagStore(
    persist_dir="./rag_index",
    collection_name="sop_knowledge",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",
    observability=obs,
)

watcher = DocWatcher(
    watch_paths=["./sop_docs"],
    rag_store=store,
    observability=obs,
    # max_chars=1400, overlap=150, include_ext={".pdf", ".docx", ".txt", ".md"}
)
watcher.start_in_background()

# Later, answer a question
ctx, sources = store.build_context("How do I deploy?", top_k=5)
print(ctx)
print(sources)

How it works (modules)

doc_watch_rag/watcher.py — async watcher using watchfiles.awatch; debounces churn, ingests on create/modify, deletes on remove.
doc_watch_rag/extractors.py — pulls text from PDF (PyPDF2), DOCX (python-docx), TXT/MD (direct read).
doc_watch_rag/chunker.py — cleans text (whitespace, hyphenated line breaks) and chunks with configurable size/overlap.
doc_watch_rag/rag_store.py — wraps Chroma PersistentClient and SentenceTransformers; supports upsert/delete by document and retrieval with scores.
doc_watch_rag/observability.py — event bus plus counters/gauges and span timers; never raises to callers.

Configuration knobs

Watch paths: pass one or many to DocWatcher(watch_paths=[...]).
File types: override include_ext to allow more/less extensions.
Chunking: tune max_chars and overlap to fit your model’s context window.
Persistence: set persist_dir and collection_name on RagStore.
Embeddings: swap embedding_model for a different SentenceTransformers model.

Repository layout

run_demo.py — interactive REPL wiring watcher + store + observability
doc_watch_rag/ — library code (watcher, store, chunking, extraction, observability)
sop_docs/ — drop your source documents here (auto-created)
rag_index/ — Chroma persistence directory (auto-created)
requirements.txt — Python dependencies

Observability

Subscribe to events via Observability.subscribe(fn); events include watcher lifecycle, ingest/delete, spans, and RAG operations.
obs.snapshot() returns counters, gauges, last event, and recent events—handy for health endpoints or debug logging.

Troubleshooting

No results returned: ensure documents are in sop_docs/, supported extensions are used, and ingest events appear. Wait for the first embed/model download to finish.
Model download blocked: pre-download or vendor the SentenceTransformers model; point to it via embedding_model.
Index missing after restart: confirm persist_dir is stable and writable; default is ./rag_index.

License

See LICENSE.

doc-watch-rag

doc-watch-rag is an async Python module that automatically watches document folders and keeps a RAG (Retrieval-Augmented Generation) index hot and up to date.

Start your chatbot once.
Drop new documents anytime.
Your RAG system always queries the latest knowledge.

Why this exists

Most RAG systems break in production because documents change but indexes don’t.

SOPs are updated frequently
New PDFs and Word files arrive continuously
Chatbots silently answer from stale knowledge

doc-watch-rag solves this by acting as an always-on ingestion layer that keeps your vector index synchronized with your document folders.

What this package does

Watches folders for document changes (add / modify / delete)
Automatically ingests documents into a persistent vector index
Supports true RAG using embeddings + vector search
Runs asynchronously in the background
Provides observability via events, counters, and timings
Keeps chatbot / API / agent code unchanged

What this package is NOT

❌ A chatbot UI
❌ An LLM framework
❌ A vector database
❌ A one-off PDF Q&A script

It is infrastructure for RAG systems.

Installation

pip install git+https://github.com/RAK0152/doc-watch-rag.git

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

doc-watch-rag

What this does

Prerequisites

Install

Quick demo

Use in your own code

How it works (modules)

Configuration knobs

Repository layout

Observability

Troubleshooting

License

doc-watch-rag

Why this exists

What this package does

What this package is NOT

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
doc_watch_rag		doc_watch_rag
sop_docs		sop_docs
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_demo.py		run_demo.py

Folders and files

Latest commit

History

Repository files navigation

doc-watch-rag

What this does

Prerequisites

Install

Quick demo

Use in your own code

How it works (modules)

Configuration knobs

Repository layout

Observability

Troubleshooting

License

doc-watch-rag

Why this exists

What this package does

What this package is NOT

Installation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages