📅 Period: Oct 2023 – Nov 2023 | Author: Bharghava Ram Vemuri
--- 🎯 Problem Statement
Traditional document search returns keyword matches — not answers. A researcher with 50 PDFs cannot ask "What did all papers say about transformer attention mechanisms?" and get a synthesised response. This lightweight RAG system lets you upload any PDFs, builds a ChromaDB semantic index with HuggingFace embeddings (no API key needed), and answers questions with source citations — running entirely locally at zero API cost.
PDF Documents
│
PyPDF2 Text Extraction
│
Text Chunking (500 tokens, 50 overlap)
│
HuggingFace Embeddings (all-MiniLM-L6-v2)
│
ChromaDB Vector Store (local, persistent)
│
Query → Embed → Similarity Search (top-5)
│
Context Assembly → Answer Generation
(HuggingFace QA model or OpenAI optional)
│
Answer + Source Citations
rag-knowledge-assistant/
├── src/
│ ├── ingest.py # PDF loading + chunking + embedding
│ ├── query.py # Query processing + retrieval
│ ├── rag_pipeline.py # End-to-end RAG pipeline
│ └── utils.py # Helpers + text preprocessing
├── rag-knowledge-assistant/ # ChromaDB persistent storage
├── requirements.txt
├── .gitignore
├── LICENSE
└── README.md
git clone https://github.com/bharghavaram/rag-knowledge-assistant.git
cd rag-knowledge-assistant
pip install -r requirements.txt
# Ingest your PDFs
python src/ingest.py --pdf_dir ./your_pdfs/
# Ask questions
python src/query.py --question "What are the main contributions of the papers?"No API key required — runs entirely on local HuggingFace models.
| Component | Model | Details |
|---|---|---|
| PDF Parsing | PyPDF2 | Extracts text from all pages |
| Text Chunking | RecursiveCharacterTextSplitter | 500 tokens, 50 token overlap |
| Embeddings | all-MiniLM-L6-v2 (HuggingFace) | 384-dim, 22M params, runs on CPU |
| Vector Store | ChromaDB (local) | Cosine similarity, persistent |
| Retrieval | Top-5 semantic search | Similarity threshold: 0.7 |
| Answer Generation | DistilBERT-QA (local) or OpenAI (optional) | Extractive or generative |
from src.rag_pipeline import RAGPipeline
rag = RAGPipeline()
rag.ingest(pdf_dir="./research_papers/")
result = rag.query("What evaluation metrics were used for RAG systems?")
print(result["answer"])
# → "The papers evaluated RAG systems using RAGAS metrics including Faithfulness,
# Answer Relevancy, and Context Recall. BLEU and ROUGE-L were also used for
# abstractive tasks across 3 benchmark datasets."
print(result["sources"])
# → [{"file": "selfrag_paper.pdf", "page": 5, "relevance": 0.91}, ...]| Metric | Value |
|---|---|
| Embedding speed | 50 pages/minute (CPU) |
| Query latency | <2 seconds (local) |
| Answer relevance | 0.83 (human evaluation) |
| Supported PDF types | Text, scanned (with OCR flag) |
| Max documents | Unlimited (ChromaDB persistent) |
# Optional — enables OpenAI for better generation
OPENAI_API_KEY=sk-...
CHROMA_PERSIST_DIR=./chroma_db
EMBEDDING_MODEL=all-MiniLM-L6-v2
TOP_K_RESULTS=5pip install pytest
pytest tests/ -vRoadmap: REST API wrapper · Web UI (Streamlit) · Multi-format support (DOCX, HTML, MD) · GPU acceleration · Reranking with cross-encoder
MIT License — see LICENSE. Contributions welcome — see CONTRIBUTING.md.