Dune RAG Chatbot

A Retrieval-Augmented Generation (RAG) system built over the Dune universe from https://dune.fandom.com/wiki/Dune_Wiki

This project implements an end-to-end pipeline for domain-specific question answering by combining semantic search with language model generation.

Overview

Instead of relying purely on an LLM’s internal knowledge, this system retrieves relevant context from a curated Dune corpus before generating answers.

This enables:

grounded responses
domain-aware answers
reduced hallucination

Architecture

User Query

Query Embedding

FAISS Vector Search

Relevant Context Retrieval

Prompt Augmentation

LLM Generation

Final Answer

Project Structure

File	Purpose
`getallarticles.py`	Collects raw Dune text corpus
`extract.py`	Cleans and prepares text
`embed.py`	Generates vector embeddings
`main.py`	Query → Retrieve → Generate pipeline
`faiss_index.index`	Vector similarity index
`faiss_metadata.pkl`	Chunk metadata
`requirements.txt`	Dependencies

Pipeline

1. Corpus Creation

Articles from the Dune universe are collected and preprocessed.

2. Embedding

Text chunks are converted into semantic vector representations.

3. Storage

Embeddings are stored using FAISS for efficient similarity search.

4. Retrieval

User queries are matched against the vector store.

5. Generation

Relevant context is injected into the LLM prompt to produce grounded responses.

Skills Demonstrated

Retrieval-Augmented Generation (RAG)

Built an end-to-end pipeline combining semantic search with LLM-based generation for grounded responses.

Vector Databases & Similarity Search

Implemented document retrieval using FAISS for efficient high-dimensional search.

Embedding-Based NLP

Used Sentence Transformers to convert text into semantic vector space.

LLM-Oriented System Design

Integrated retrieval context into prompts for improved answer quality.

Data Ingestion & Processing

Collected and cleaned domain-specific corpus using BeautifulSoup and structured parsing.

Text Chunking & Preprocessing

Applied splitting strategies using LangChain text splitters for optimal retrieval.

API Integration

Managed external calls using requests / urllib pipelines.

Environment & Config Management

Used python-dotenv for secure configuration handling.

Interactive Deployment

Streamlit-based interface for real-time querying.

How to Run

Install dependencies:

pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dune RAG Chatbot

Overview

Architecture

Project Structure

Pipeline

1. Corpus Creation

2. Embedding

3. Storage

4. Retrieval

5. Generation

Skills Demonstrated

Retrieval-Augmented Generation (RAG)

Vector Databases & Similarity Search

Embedding-Based NLP

LLM-Oriented System Design

Data Ingestion & Processing

Text Chunking & Preprocessing

API Integration

Environment & Config Management

Interactive Deployment

How to Run

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
embed.py		embed.py
extract.py		extract.py
faiss_index.index		faiss_index.index
faiss_metadata.pkl		faiss_metadata.pkl
getallarticles.py		getallarticles.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Dune RAG Chatbot

Overview

Architecture

Project Structure

Pipeline

1. Corpus Creation

2. Embedding

3. Storage

4. Retrieval

5. Generation

Skills Demonstrated

Retrieval-Augmented Generation (RAG)

Vector Databases & Similarity Search

Embedding-Based NLP

LLM-Oriented System Design

Data Ingestion & Processing

Text Chunking & Preprocessing

API Integration

Environment & Config Management

Interactive Deployment

How to Run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages