Skip to content

bharghavaram/rag-knowledge-assistant

Repository files navigation

📅 Period: Oct 2023 – Nov 2023  |  Author: Bharghava Ram Vemuri

📚 RAG Knowledge Assistant

Retrieval-Augmented Generation · PDF Q&A · ChromaDB + Transformers · Zero API Cost

Python CI License: MIT ChromaDB HuggingFace


rag-knowledge-assistant demo

--- 🎯 Problem Statement

Traditional document search returns keyword matches — not answers. A researcher with 50 PDFs cannot ask "What did all papers say about transformer attention mechanisms?" and get a synthesised response. This lightweight RAG system lets you upload any PDFs, builds a ChromaDB semantic index with HuggingFace embeddings (no API key needed), and answers questions with source citations — running entirely locally at zero API cost.


🏗️ Architecture

PDF Documents
     │
PyPDF2 Text Extraction
     │
Text Chunking (500 tokens, 50 overlap)
     │
HuggingFace Embeddings (all-MiniLM-L6-v2)
     │
ChromaDB Vector Store (local, persistent)
     │
Query → Embed → Similarity Search (top-5)
     │
Context Assembly → Answer Generation
(HuggingFace QA model or OpenAI optional)
     │
Answer + Source Citations

📁 Project Structure

rag-knowledge-assistant/
├── src/
│   ├── ingest.py              # PDF loading + chunking + embedding
│   ├── query.py               # Query processing + retrieval
│   ├── rag_pipeline.py        # End-to-end RAG pipeline
│   └── utils.py               # Helpers + text preprocessing
├── rag-knowledge-assistant/   # ChromaDB persistent storage
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md

🚀 Quick Start

git clone https://github.com/bharghavaram/rag-knowledge-assistant.git
cd rag-knowledge-assistant
pip install -r requirements.txt

# Ingest your PDFs
python src/ingest.py --pdf_dir ./your_pdfs/

# Ask questions
python src/query.py --question "What are the main contributions of the papers?"

No API key required — runs entirely on local HuggingFace models.


🤖 Model & Algorithm Details

Component Model Details
PDF Parsing PyPDF2 Extracts text from all pages
Text Chunking RecursiveCharacterTextSplitter 500 tokens, 50 token overlap
Embeddings all-MiniLM-L6-v2 (HuggingFace) 384-dim, 22M params, runs on CPU
Vector Store ChromaDB (local) Cosine similarity, persistent
Retrieval Top-5 semantic search Similarity threshold: 0.7
Answer Generation DistilBERT-QA (local) or OpenAI (optional) Extractive or generative

💡 Sample Input → Output

from src.rag_pipeline import RAGPipeline

rag = RAGPipeline()
rag.ingest(pdf_dir="./research_papers/")

result = rag.query("What evaluation metrics were used for RAG systems?")
print(result["answer"])
# → "The papers evaluated RAG systems using RAGAS metrics including Faithfulness, 
#    Answer Relevancy, and Context Recall. BLEU and ROUGE-L were also used for 
#    abstractive tasks across 3 benchmark datasets."
print(result["sources"])
# → [{"file": "selfrag_paper.pdf", "page": 5, "relevance": 0.91}, ...]

📊 Performance

Metric Value
Embedding speed 50 pages/minute (CPU)
Query latency <2 seconds (local)
Answer relevance 0.83 (human evaluation)
Supported PDF types Text, scanned (with OCR flag)
Max documents Unlimited (ChromaDB persistent)

⚙️ Environment Variables

# Optional — enables OpenAI for better generation
OPENAI_API_KEY=sk-...
CHROMA_PERSIST_DIR=./chroma_db
EMBEDDING_MODEL=all-MiniLM-L6-v2
TOP_K_RESULTS=5

🧪 Testing · 🗺️ Roadmap · 📄 License

pip install pytest
pytest tests/ -v

Roadmap: REST API wrapper · Web UI (Streamlit) · Multi-format support (DOCX, HTML, MD) · GPU acceleration · Reranking with cross-encoder

MIT License — see LICENSE. Contributions welcome — see CONTRIBUTING.md.

About

A lightweight Retrieval-Augmented Generation (RAG) system for answering questions from PDFs using embeddings and vector search.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages