📄 RAG PDF Research Assistant

A fully local, privacy-first RAG (Retrieval-Augmented Generation) system that lets you chat with your PDF documents using open-source LLMs. No API keys. No cloud. Everything runs on your machine.

What It Does

You give it a PDF → it chunks, embeds, and indexes the content → you ask questions in natural language → it retrieves relevant sections and generates accurate, sourced answers using a local LLM.

Comes with both a CLI and a Streamlit web UI with real-time streaming.

Example:

❓ You: What is the maternity leave policy?

💡 Answer: According to the Workplace Policies document, employees are entitled
   to 26 weeks of paid maternity leave after completing 80 days of service...

📚 Sources:
   [1] Page 12 [MATERNITY LEAVE POLICY]: "Maternity Leave Policy — Purpose..."
   [2] Page 13 [ELIGIBILITY] 📊: "All female employees who have..."

⏱️  Response time: 4.82s

Features

Streaming responses — Response takes around 10s-15s to be generated. (This is where I am looking for optimization using open-source LLMS under 4GB)
Table extraction — reads tables from PDFs using pdfplumber (not just plain text, tables, and unstructured data in PDF)
Section-aware chunking — respects document structure (policy headers, sections, eligibility blocks)
Response caching — repeated questions return instantly (tried caching, but needs improvement)
Dual interface — CLI for power users, Streamlit web UI for everyone else
Fully local — runs on CPU, no GPU required, no data leaves your machine

Tech Stack

Layer	Technology	Why This Choice
LLM	Ollama + Mistral 7B	Runs locally, no API costs, good quality
Embeddings	`all-MiniLM-L6-v2` (HuggingFace)	Fast, 384-dim vectors, great for semantic search
Vector Store	FAISS (Facebook AI Similarity Search)	Sub-millisecond search, persists to disk
PDF Parsing	`pdfplumber`	Extracts both text and table structures
Chunking	LangChain `RecursiveCharacterTextSplitter`	Section-aware separators, configurable overlap
Orchestration	LangChain `RetrievalQA` chain	Handles retrieval → prompt → generation pipeline
Web UI	Streamlit	Streaming chat interface with source display

Architecture

┌─────────────┐     ┌──────────────┐     ┌──────────────────┐
│   PDF File   │────▶│  pdfplumber  │────▶│  Pages + Tables   │
└─────────────┘     └──────────────┘     └────────┬─────────┘
                                                   │
                                                   ▼
                                         ┌───────────────────┐
                                         │ Section-Aware      │
                                         │ RecursiveCharText   │
                                         │ Splitter (1000c,    │
                                         │ 200 overlap)        │
                                         └────────┬──────────┘
                                                   │
                                                   ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────────┐
│  FAISS Index │◀────│ all-MiniLM   │◀────│  Enriched Chunks  │
│  (on disk)   │     │ L6-v2 (384d) │     │  + section meta   │
└──────┬───────┘     └──────────────┘     └──────────────────┘
       │
       │  similarity search (top-k=3)
       ▼
┌──────────────┐     ┌──────────────┐     ┌──────────────────┐
│  Retrieved   │────▶│  HR-Specific │────▶│  Ollama/Mistral  │
│  Chunks      │     │  Prompt      │     │  (local, stream)  │
└──────────────┘     └──────────────┘     └────────┬─────────┘
                                                    │
                                                    ▼
                                          ┌──────────────────┐
                                          │  Streamed Answer  │
                                          │  + Sources + Cache │
                                          └──────────────────┘

Quick Start

Prerequisites

Python 3.9+
Ollama installed and running

1. Clone & Install

git clone https://github.com/YOUR_USERNAME/rag-pdf-assistant.git
cd rag-pdf-assistant

python -m venv venv
source venv/bin/activate        # Windows: venv\Scripts\activate

pip install -r requirements.txt

2. Pull an LLM

ollama pull mistral              # Default — 4.1GB, best quality
# OR for faster responses on CPU:
ollama pull phi3:mini            # 2.3GB, 2-3x faster
ollama pull qwen2.5:3b           # 2GB, good speed/quality balance

3. Add Your PDF

mkdir -p documents
cp /path/to/your/file.pdf documents/

4. Run

Web UI (recommended):

streamlit run app.py

CLI:

python rag_assistant.py --pdf documents/your_file.pdf

First run builds the FAISS index (~10-30s depending on PDF size). Subsequent runs load instantly from disk.

Usage

Web UI

Launch with streamlit run app.py. Features:

Streaming chat — responses appear token-by-token
Source display — expandable sources with page numbers and section tags
Sidebar settings — switch models, adjust k, rebuild index
Response caching — repeated questions return instantly (⚡ badge)
Table indicators — 📊 marks sources that contain table data

CLI Commands

# Basic usage
python rag_assistant.py

# Use a specific PDF
python rag_assistant.py --pdf documents/handbook.pdf

# Force re-index (after updating the PDF or changing chunking)
python rag_assistant.py --rebuild

# Use a different/faster model
python rag_assistant.py --model phi3:mini

# Retrieve fewer chunks (faster responses)
python rag_assistant.py --k 2

Interactive CLI Commands

Command	Description
Type any question	Ask about your document
`/search <query>`	Raw FAISS search — no LLM, just retrieval (great for debugging)
`/stats`	Show index stats (vector count, dimensions, model, cache size)
`quit`	Exit

Configuration

CLI Arguments

Flag	Default	Description
`--pdf`	`documents/WORKPLACE_POLICIES.pdf`	Path to PDF file
`--rebuild`	`false`	Force rebuild the FAISS index
`--model`	`mistral`	Ollama model name
`--k`	`3`	Number of chunks to retrieve per query

Performance Tuning

These are already optimized in setup_llm() for CPU-only systems:

llm = OllamaLLM(
    model=model,
    temperature=0.1,         # Deterministic for factual Q&A
    num_predict=256,         # Caps answer length, prevents rambling
    num_ctx=2048,            # 40% faster than default 4096
    # num_thread=8,          # Set to your CPU core count
)

Optimizations Applied

Optimization	Impact	Details
pdfplumber for PDF parsing	Reads tables properly	Tables extracted as pipe-separated text, tagged in metadata
Section-aware chunking	Better retrieval	Splits on `Purpose:`, `Policy:`, `Eligibility:`, etc.
Section metadata	Traceable sources	Each chunk tagged with detected section title
`num_ctx=2048`	~40% faster	Halved context window, sufficient for policy Q&A
`num_predict=256`	Faster, focused	Prevents runaway generation
`temperature=0.1`	More accurate	Deterministic outputs for factual content
Domain-specific prompt	Much better answers	Lists ALL items, cites numbers/deadlines, reads tables
Response caching	Instant repeats	MD5-based cache, last 50 questions
Streaming (UI)	~1s perceived latency	Tokens appear in real-time via `st.write_stream`
`k=3` (down from 4)	Less noise, faster	3 focused chunks > 4 noisy ones

Performance Benchmarks

Tested on AMD Ryzen 7 5700U (CPU only, no discrete GPU):

Step	Time	Notes
Embedding model load	~2-5s	One-time per session
PDF load + table extraction	~1-3s	One-time, cached to disk
FAISS index build	~5-15s	One-time, saved to `faiss_index/`
FAISS similarity search	<0.01s	Effectively instant
LLM generation (Mistral, CPU)	~8-12s	Optimized with num_ctx=2048
LLM generation (cached)	<0.01s	Instant on repeated questions
Streaming first token	~1s	Perceived latency with streaming

Project Structure

rag-pdf-assistant/
├── README.md                  # This file
├── requirements.txt           # Python dependencies
├── .env.example               # Environment variable template
├── .gitignore                 # Ignores faiss_index/, .env, __pycache__/
├── rag_assistant.py           # Core RAG logic + CLI interface
├── app.py                     # Streamlit web UI (streaming)
├── documents/                 # Your PDF files go here
│   └── sample.pdf
├── faiss_index/               # Auto-generated, git-ignored
│   ├── index.faiss
│   └── index.pkl
└── tests/
    └── test_rag.py

Dependencies

langchain-community
langchain-huggingface
langchain-ollama
langchain-classic
langchain-text-splitters
langchain-core
faiss-cpu
sentence-transformers
pdfplumber
streamlit
python-dotenv

Roadmap

Completed

Section-aware chunking with metadata enrichment
Domain-specific HR prompt template
pdfplumber for table extraction
Response caching (MD5-based, in-memory)
Streamlit web UI with streaming
Performance tuning (num_ctx, num_predict, k)

Next Up

Hybrid search (BM25 keyword + FAISS vector)
Conversation memory (follow-up questions)
Multi-PDF support (scan a directory)
Cross-encoder re-ranking (fetch 10 → re-rank → keep best 3)
Multi-query retrieval (LLM rewrites your question 3 ways)
FastAPI REST API wrapper
RAGAS evaluation framework

How It Works

1. Ingestion — The PDF is loaded using pdfplumber, which extracts both regular text and table structures page-by-page. Tables are converted to pipe-separated format and tagged with [TABLE DATA] markers. Pages are then split into overlapping chunks using section-aware separators that respect document structure (policy headers, eligibility blocks, etc.). Each chunk is enriched with metadata including page number, detected section title, and whether it contains table data.

2. Embedding — Each chunk is converted into a 384-dimensional vector using the all-MiniLM-L6-v2 sentence transformer. Embeddings are normalized for consistent similarity scores.

3. Indexing — Vectors are stored in a FAISS flat index (IndexFlatL2), persisted to disk so you only pay the embedding cost once.

4. Retrieval — Your question is embedded with the same model, then FAISS finds the top-3 most similar chunks. This takes <1ms.

5. Generation — Retrieved chunks are injected into a domain-specific prompt template and streamed through the local LLM. Responses are cached for instant retrieval on repeated questions.

Troubleshooting

Problem	Fix
`Connection error` to Ollama	Make sure Ollama is running: `ollama serve`
Slow responses (>15s)	Try `--model phi3:mini` or lower `--k 2`
Wrong/hallucinated answers	Try `--k 5` for more context, or check `/search` results
Tables not being read	Run `--rebuild` to re-index with pdfplumber
`ModuleNotFoundError`	Run `pip install -r requirements.txt` inside your venv
Index seems stale	Rebuild with `--rebuild` flag
Streamlit won't start	Ensure `rag_assistant.py` is in the same directory as `app.py`

Contributing

Contributions welcome! Feel free to open issues or submit PRs for any roadmap items.

Fork the repo
Create your feature branch (git checkout -b feature/hybrid-search)
Commit your changes
Push to the branch
Open a Pull Request

License

MIT — use it however you want.

Acknowledgments

Built with LangChain, Ollama, FAISS, pdfplumber, Streamlit, and HuggingFace Sentence Transformers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
rag_assistant.py		rag_assistant.py

Folders and files

Latest commit

History

Repository files navigation

📄 RAG PDF Research Assistant

What It Does

Features

Tech Stack

Architecture

Quick Start

Prerequisites

1. Clone & Install

2. Pull an LLM

3. Add Your PDF

4. Run

Usage

Web UI

CLI Commands

Interactive CLI Commands

Configuration

CLI Arguments

Performance Tuning

Optimizations Applied

Performance Benchmarks

Project Structure

Dependencies

Roadmap

Completed

Next Up

How It Works

Troubleshooting

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages