DocBuddy

DocBuddy is a full-stack RAG (Retrieval-Augmented Generation) application designed to mirror the Google NotebookLM experience. It allows users to upload documents and have grounded, citation-rich conversations with their data.

✨ Key Features

Advanced RAG Pipeline: Hybrid retrieval, parent-document context expansion, optional cross-encoder reranking, self-correcting retrieval (CRAG), and conversational follow-ups (see below).
Self-Correcting Retrieval (CRAG): Before answering, the system grades whether the retrieved context is actually relevant and re-retrieves with an alternative query when it isn't — reducing answers based on irrelevant chunks.
Document Summaries & Smart Fallback: Every upload is auto-summarized. Ask for an overview of a document (or of all of them) and DocBuddy answers from the summaries, naming the relevant document; ask something the documents don't cover and it says so, then points you to what they do cover instead of guessing.
Interactive Citations: Hover over AI-generated citations (e.g., [1], p. 3) to see the exact text snippet retrieved from your document.
Multi-Format Support: Seamlessly parses and indexes PDF, TXT, and CSV files.
Conversational Memory: Follow-up questions ("what about its limits?") are resolved against the chat history, so retrieval understands references and pronouns.
Live Activity Panel: A toggleable session feed (streamed over WebSocket) where every upload and query is a collapsible event showing per-step timings — parse/split/embed for ingestion; rewrite → retrieval → rerank → CRAG → generation for queries — including steps that were skipped. A backend restart is detected automatically and starts a fresh session.
Premium UI/UX: A responsive, dark-mode dual-pane interface with auto-focusing chat and floating toast notifications.
Persistent by Default: Uploaded documents survive restarts. Optionally wipe everything on boot with RESET_ON_STARTUP=true for a clean slate.

🧠 The RAG Pipeline

1. Ingestion & Parent/Child Chunking

DocBuddy uses parent-document splitting to balance retrieval precision with answer context:

Parent windows: 2,000 characters / 200 overlap — the large, context-rich passages fed to the LLM.
Child slices: 400 characters / 80 overlap — small, focused units that are embedded and indexed.
Strategy: Recursive character splitting keeps paragraphs and sentences together. Small children sharpen retrieval precision; the surrounding parent gives the LLM enough context to answer.
Auto-summary: After indexing, a single LLM call summarizes each document; the summary is stored and later used as a "document overview" at answer time.

2. Embedding & Storage

Embeddings: sentence-transformers/all-MiniLM-L6-v2 (via Hugging Face Inference API), 384-dimensional.
Vector Store: Qdrant Cloud with named vectors — a dense semantic vector (Cosine) plus a sparse BM25 keyword vector (Qdrant computes IDF server-side).

3. Hybrid Retrieval

Every query runs two retrieval arms that are fused with Reciprocal Rank Fusion (RRF):

Dense (semantic): embedding similarity — great for paraphrases and concepts.
Sparse (BM25 keyword): exact-term matching — great for names, IDs, and acronyms.

Because matches are child slices, results are deduped back to distinct parent windows before they reach the LLM.

4. Reranking (optional)

When a COHERE_API_KEY is configured, the top hybrid candidates are re-scored by a Cohere cross-encoder reranker, which jointly evaluates each (question, passage) pair for higher precision. If no key is set, retrieval gracefully falls back to the hybrid ordering.

5. Corrective RAG (CRAG)

Before generating, a single Groq call grades whether the retrieved sources can actually answer the question:

Correct → answer from the retrieved context as-is.
Ambiguous → keep only the sources graded relevant.
Incorrect → run one bounded corrective retry with an alternative query, then answer (or abstain).

CRAG is enabled by default and powered by the existing Groq key (no extra service). It fails open — any grading hiccup is treated as "correct" so an answer is never blocked — and the corrective retry is capped at one (no loops). Set ENABLE_CRAG=false to disable.

6. Query Rewriting & Generation

Conversational query rewriting: the raw question (plus recent chat turns) is rewritten into a standalone, retrieval-optimized search query.
LLM: Powered by Llama 3 (via Groq API) for fast, high-quality reasoning.
Groundedness: A strict system prompt ensures the AI only answers based on the provided context and cites its sources using bracketed markers.
Document overview: The stored per-document summaries are supplied to the model so it can answer summary/overview questions (one document or all), and — when retrieval doesn't cover the question — tell the user what the documents do cover instead of guessing.

🛠 Tech Stack

Frontend: React, TypeScript, Vite, Tailwind CSS, Lucide React.
Backend: Node.js, Express, TypeScript, LangChain.js.
Database: Qdrant Cloud (Vector), Local JSON (Metadata).
AI Services: Groq (LLM), Hugging Face (embeddings), Cohere (reranking, optional).

⚙️ Setup & Installation

Backend

cd backend
npm install

Create a .env file with:

GROQ_API_KEY=your_key
HUGGINGFACEHUB_API_KEY=your_token
QDRANT_URL=your_qdrant_url
QDRANT_API_KEY=your_qdrant_key
PORT=5000

# Optional — enables cross-encoder reranking
COHERE_API_KEY=your_cohere_key

# Optional — Corrective RAG is on by default; set false to disable
# ENABLE_CRAG=false

# Optional — data persists by default; set true to wipe on every restart
# RESET_ON_STARTUP=true

npm run build && npm start

Frontend

cd frontend
npm install
Create a .env file with:
```
VITE_API_URL=http://localhost:5000
```
npm run dev

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
.node-version		.node-version
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocBuddy

✨ Key Features

🧠 The RAG Pipeline

1. Ingestion & Parent/Child Chunking

2. Embedding & Storage

3. Hybrid Retrieval

4. Reranking (optional)

5. Corrective RAG (CRAG)

6. Query Rewriting & Generation

🛠 Tech Stack

⚙️ Setup & Installation

Backend

Frontend

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocBuddy

✨ Key Features

🧠 The RAG Pipeline

1. Ingestion & Parent/Child Chunking

2. Embedding & Storage

3. Hybrid Retrieval

4. Reranking (optional)

5. Corrective RAG (CRAG)

6. Query Rewriting & Generation

🛠 Tech Stack

⚙️ Setup & Installation

Backend

Frontend

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages