A Retrieval-Augmented Generation (RAG) system built over the Dune universe from https://dune.fandom.com/wiki/Dune_Wiki
This project implements an end-to-end pipeline for domain-specific question answering by combining semantic search with language model generation.
Instead of relying purely on an LLM’s internal knowledge, this system retrieves relevant context from a curated Dune corpus before generating answers.
This enables:
- grounded responses
- domain-aware answers
- reduced hallucination
User Query
Query Embedding
FAISS Vector Search
Relevant Context Retrieval
Prompt Augmentation
LLM Generation
Final Answer
| File | Purpose |
|---|---|
getallarticles.py |
Collects raw Dune text corpus |
extract.py |
Cleans and prepares text |
embed.py |
Generates vector embeddings |
main.py |
Query → Retrieve → Generate pipeline |
faiss_index.index |
Vector similarity index |
faiss_metadata.pkl |
Chunk metadata |
requirements.txt |
Dependencies |
Articles from the Dune universe are collected and preprocessed.
Text chunks are converted into semantic vector representations.
Embeddings are stored using FAISS for efficient similarity search.
User queries are matched against the vector store.
Relevant context is injected into the LLM prompt to produce grounded responses.
Built an end-to-end pipeline combining semantic search with LLM-based generation for grounded responses.
Implemented document retrieval using FAISS for efficient high-dimensional search.
Used Sentence Transformers to convert text into semantic vector space.
Integrated retrieval context into prompts for improved answer quality.
Collected and cleaned domain-specific corpus using BeautifulSoup and structured parsing.
Applied splitting strategies using LangChain text splitters for optimal retrieval.
Managed external calls using requests / urllib pipelines.
Used python-dotenv for secure configuration handling.
Streamlit-based interface for real-time querying.
Install dependencies:
pip install -r requirements.txt