A fully local, privacy-first RAG (Retrieval-Augmented Generation) system that lets you chat with your PDF documents using open-source LLMs. No API keys. No cloud. Everything runs on your machine.
You give it a PDF → it chunks, embeds, and indexes the content → you ask questions in natural language → it retrieves relevant sections and generates accurate, sourced answers using a local LLM.
Comes with both a CLI and a Streamlit web UI with real-time streaming.
Example:
❓ You: What is the maternity leave policy?
💡 Answer: According to the Workplace Policies document, employees are entitled
to 26 weeks of paid maternity leave after completing 80 days of service...
📚 Sources:
[1] Page 12 [MATERNITY LEAVE POLICY]: "Maternity Leave Policy — Purpose..."
[2] Page 13 [ELIGIBILITY] 📊: "All female employees who have..."
⏱️ Response time: 4.82s
- Streaming responses — Response takes around 10s-15s to be generated. (This is where I am looking for optimization using open-source LLMS under 4GB)
- Table extraction — reads tables from PDFs using
pdfplumber(not just plain text, tables, and unstructured data in PDF) - Section-aware chunking — respects document structure (policy headers, sections, eligibility blocks)
- Response caching — repeated questions return instantly (tried caching, but needs improvement)
- Dual interface — CLI for power users, Streamlit web UI for everyone else
- Fully local — runs on CPU, no GPU required, no data leaves your machine
| Layer | Technology | Why This Choice |
|---|---|---|
| LLM | Ollama + Mistral 7B | Runs locally, no API costs, good quality |
| Embeddings | all-MiniLM-L6-v2 (HuggingFace) |
Fast, 384-dim vectors, great for semantic search |
| Vector Store | FAISS (Facebook AI Similarity Search) | Sub-millisecond search, persists to disk |
| PDF Parsing | pdfplumber |
Extracts both text and table structures |
| Chunking | LangChain RecursiveCharacterTextSplitter |
Section-aware separators, configurable overlap |
| Orchestration | LangChain RetrievalQA chain |
Handles retrieval → prompt → generation pipeline |
| Web UI | Streamlit | Streaming chat interface with source display |
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐
│ PDF File │────▶│ pdfplumber │────▶│ Pages + Tables │
└─────────────┘ └──────────────┘ └────────┬─────────┘
│
▼
┌───────────────────┐
│ Section-Aware │
│ RecursiveCharText │
│ Splitter (1000c, │
│ 200 overlap) │
└────────┬──────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ FAISS Index │◀────│ all-MiniLM │◀────│ Enriched Chunks │
│ (on disk) │ │ L6-v2 (384d) │ │ + section meta │
└──────┬───────┘ └──────────────┘ └──────────────────┘
│
│ similarity search (top-k=3)
▼
┌──────────────┐ ┌──────────────┐ ┌──────────────────┐
│ Retrieved │────▶│ HR-Specific │────▶│ Ollama/Mistral │
│ Chunks │ │ Prompt │ │ (local, stream) │
└──────────────┘ └──────────────┘ └────────┬─────────┘
│
▼
┌──────────────────┐
│ Streamed Answer │
│ + Sources + Cache │
└──────────────────┘
- Python 3.9+
- Ollama installed and running
git clone https://github.com/YOUR_USERNAME/rag-pdf-assistant.git
cd rag-pdf-assistant
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtollama pull mistral # Default — 4.1GB, best quality
# OR for faster responses on CPU:
ollama pull phi3:mini # 2.3GB, 2-3x faster
ollama pull qwen2.5:3b # 2GB, good speed/quality balancemkdir -p documents
cp /path/to/your/file.pdf documents/Web UI (recommended):
streamlit run app.pyCLI:
python rag_assistant.py --pdf documents/your_file.pdfFirst run builds the FAISS index (~10-30s depending on PDF size). Subsequent runs load instantly from disk.
Launch with streamlit run app.py. Features:
- Streaming chat — responses appear token-by-token
- Source display — expandable sources with page numbers and section tags
- Sidebar settings — switch models, adjust k, rebuild index
- Response caching — repeated questions return instantly (⚡ badge)
- Table indicators — 📊 marks sources that contain table data
# Basic usage
python rag_assistant.py
# Use a specific PDF
python rag_assistant.py --pdf documents/handbook.pdf
# Force re-index (after updating the PDF or changing chunking)
python rag_assistant.py --rebuild
# Use a different/faster model
python rag_assistant.py --model phi3:mini
# Retrieve fewer chunks (faster responses)
python rag_assistant.py --k 2| Command | Description |
|---|---|
| Type any question | Ask about your document |
/search <query> |
Raw FAISS search — no LLM, just retrieval (great for debugging) |
/stats |
Show index stats (vector count, dimensions, model, cache size) |
quit |
Exit |
| Flag | Default | Description |
|---|---|---|
--pdf |
documents/WORKPLACE_POLICIES.pdf |
Path to PDF file |
--rebuild |
false |
Force rebuild the FAISS index |
--model |
mistral |
Ollama model name |
--k |
3 |
Number of chunks to retrieve per query |
These are already optimized in setup_llm() for CPU-only systems:
llm = OllamaLLM(
model=model,
temperature=0.1, # Deterministic for factual Q&A
num_predict=256, # Caps answer length, prevents rambling
num_ctx=2048, # 40% faster than default 4096
# num_thread=8, # Set to your CPU core count
)| Optimization | Impact | Details |
|---|---|---|
| pdfplumber for PDF parsing | Reads tables properly | Tables extracted as pipe-separated text, tagged in metadata |
| Section-aware chunking | Better retrieval | Splits on Purpose:, Policy:, Eligibility:, etc. |
| Section metadata | Traceable sources | Each chunk tagged with detected section title |
num_ctx=2048 |
~40% faster | Halved context window, sufficient for policy Q&A |
num_predict=256 |
Faster, focused | Prevents runaway generation |
temperature=0.1 |
More accurate | Deterministic outputs for factual content |
| Domain-specific prompt | Much better answers | Lists ALL items, cites numbers/deadlines, reads tables |
| Response caching | Instant repeats | MD5-based cache, last 50 questions |
| Streaming (UI) | ~1s perceived latency | Tokens appear in real-time via st.write_stream |
k=3 (down from 4) |
Less noise, faster | 3 focused chunks > 4 noisy ones |
Tested on AMD Ryzen 7 5700U (CPU only, no discrete GPU):
| Step | Time | Notes |
|---|---|---|
| Embedding model load | ~2-5s | One-time per session |
| PDF load + table extraction | ~1-3s | One-time, cached to disk |
| FAISS index build | ~5-15s | One-time, saved to faiss_index/ |
| FAISS similarity search | <0.01s | Effectively instant |
| LLM generation (Mistral, CPU) | ~8-12s | Optimized with num_ctx=2048 |
| LLM generation (cached) | <0.01s | Instant on repeated questions |
| Streaming first token | ~1s | Perceived latency with streaming |
rag-pdf-assistant/
├── README.md # This file
├── requirements.txt # Python dependencies
├── .env.example # Environment variable template
├── .gitignore # Ignores faiss_index/, .env, __pycache__/
├── rag_assistant.py # Core RAG logic + CLI interface
├── app.py # Streamlit web UI (streaming)
├── documents/ # Your PDF files go here
│ └── sample.pdf
├── faiss_index/ # Auto-generated, git-ignored
│ ├── index.faiss
│ └── index.pkl
└── tests/
└── test_rag.py
langchain-community
langchain-huggingface
langchain-ollama
langchain-classic
langchain-text-splitters
langchain-core
faiss-cpu
sentence-transformers
pdfplumber
streamlit
python-dotenv
- Section-aware chunking with metadata enrichment
- Domain-specific HR prompt template
- pdfplumber for table extraction
- Response caching (MD5-based, in-memory)
- Streamlit web UI with streaming
- Performance tuning (num_ctx, num_predict, k)
- Hybrid search (BM25 keyword + FAISS vector)
- Conversation memory (follow-up questions)
- Multi-PDF support (scan a directory)
- Cross-encoder re-ranking (fetch 10 → re-rank → keep best 3)
- Multi-query retrieval (LLM rewrites your question 3 ways)
- FastAPI REST API wrapper
- RAGAS evaluation framework
1. Ingestion — The PDF is loaded using pdfplumber, which extracts both regular text and table structures page-by-page. Tables are converted to pipe-separated format and tagged with [TABLE DATA] markers. Pages are then split into overlapping chunks using section-aware separators that respect document structure (policy headers, eligibility blocks, etc.). Each chunk is enriched with metadata including page number, detected section title, and whether it contains table data.
2. Embedding — Each chunk is converted into a 384-dimensional vector using the all-MiniLM-L6-v2 sentence transformer. Embeddings are normalized for consistent similarity scores.
3. Indexing — Vectors are stored in a FAISS flat index (IndexFlatL2), persisted to disk so you only pay the embedding cost once.
4. Retrieval — Your question is embedded with the same model, then FAISS finds the top-3 most similar chunks. This takes <1ms.
5. Generation — Retrieved chunks are injected into a domain-specific prompt template and streamed through the local LLM. Responses are cached for instant retrieval on repeated questions.
| Problem | Fix |
|---|---|
Connection error to Ollama |
Make sure Ollama is running: ollama serve |
| Slow responses (>15s) | Try --model phi3:mini or lower --k 2 |
| Wrong/hallucinated answers | Try --k 5 for more context, or check /search results |
| Tables not being read | Run --rebuild to re-index with pdfplumber |
ModuleNotFoundError |
Run pip install -r requirements.txt inside your venv |
| Index seems stale | Rebuild with --rebuild flag |
| Streamlit won't start | Ensure rag_assistant.py is in the same directory as app.py |
Contributions welcome! Feel free to open issues or submit PRs for any roadmap items.
- Fork the repo
- Create your feature branch (
git checkout -b feature/hybrid-search) - Commit your changes
- Push to the branch
- Open a Pull Request
MIT — use it however you want.
Built with LangChain, Ollama, FAISS, pdfplumber, Streamlit, and HuggingFace Sentence Transformers.