A Retrieval-Augmented Generation (RAG) system that processes and indexes documentation from the Krypt blockchain project, enabling intelligent document retrieval and question-answering capabilities.
This project implements a RAG pipeline that:
- Ingests multiple document formats (PDF, Markdown, CSV)
- Processes documents using LangChain and sentence transformers
- Stores embeddings in ChromaDB vector store
- Enables semantic search and retrieval-augmented generation
- Multi-format Document Support: Process PDFs, Markdown, and text files
- Vector Embeddings: Uses sentence-transformers for semantic understanding
- Persistent Vector Store: ChromaDB for efficient similarity searches
- LLM Integration: Groq API integration for generative tasks
- Modular Architecture: Clean separation of concerns for extensibility
- LangChain: Document processing and RAG orchestration
- ChromaDB: Vector database for embeddings storage
- Sentence Transformers: Semantic embedding generation
- Groq: LLM API for generative responses
- PyMuPDF & PyPDF: PDF document parsing
- FAISS: CPU-based similarity search
RAG_model/
├── data/
│ ├── notion/ # Notion exports (.md, .csv)
│ ├── pdf/ # PDF documents
│ ├── text_files/ # Text files
│ └── vector_store/ # ChromaDB persisted data
├── notebook/
│ ├── document.ipynb # Main processing notebook
│ └── pdf_loader.ipynb # PDF loading utilities
├── src/ # Source code modules
├── main.py # Entry point
├── requirements.txt # Project dependencies
└── README.md
-
Clone the repository
git clone <repository-url> cd RAG_model
-
Create virtual environment (Python 3.13+)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
cp .env.example .env # Add your Groq API key and other configuration
python main.py- document.ipynb: Main data processing pipeline
- pdf_loader.ipynb: PDF extraction and loading utilities
The system processes documents from data/ directories:
- PDFs from
data/pdf/ - Notion exports from
data/notion/ - Text files from
data/text_files/
Vector embeddings are stored in data/vector_store/ using ChromaDB.
Update .env file with:
GROQ_API_KEY=your_api_key_here
See requirements.txt:
- langchain & langchain ecosystem
- chromadb (vector database)
- sentence-transformers (embeddings)
- pymupdf & pypdf (PDF processing)
- faiss-cpu (similarity search)
This project indexes the Krypt blockchain project documentation:
- React + Solidity decentralized application for Ethereum transactions
- Smart contract code and frontend implementation details
- Project report and technical specifications
- Support for additional document formats (DOCX, HTML)
- Real-time document updates
- Multi-language support
- Advanced query expansion techniques
- Hybrid search (semantic + keyword)
- Web UI for document exploration
MIT License
Tejomai V