A Retrieval-Augmented Generation (RAG) pipeline built on top of Wikipedia. This project fetches and parses Wikipedia articles, builds a FAISS vector index from their content, and enables semantic question answering powered by LLMs via Groq — all from the command line.
MyWikiRAG/
├── wiki.py # Wikipedia API wrapper (search, fetch, summarize)
├── wiki_parser.py # Parses & chunks raw Wikipedia content into docs
├── rag_corpus_builder.py # Builds & deduplicates the full RAG corpus → saves to JSON
├── rag_pipeline_nlp.py # Embeds corpus, builds FAISS index, runs interactive Q&A
├── main.py # Demo script to test Wikipedia API functions
├── rag_corpus_nlp.json # Pre-built corpus of Wikipedia article chunks
├── faiss_wikipedia_index/ # Saved FAISS vector index (auto-generated)
└── requirements.txt # All Python dependencies
- Python 3.9+
- A Groq API key for LLM-powered answers
pip install -r requirements.txt
⚠️ Always activate your virtual environment before installing or running anything.
# Create the venv
python -m venv venv
# Activate — Windows (PowerShell)
venv\Scripts\activate
# Activate — Linux / macOS
source venv/bin/activate
# Install all dependencies
pip install -r requirements.txt📌 Never move the
venv/folder after creating it. It uses absolute paths and will break if relocated. If you move the project, deletevenv/and recreate it.
Create a .env file in the project root:
GROQ_API_KEY=your_groq_api_key_here
Or export it directly in your terminal:
# Windows (PowerShell)
$env:GROQ_API_KEY = "your_api_key_here"
# Windows (Command Prompt)
set GROQ_API_KEY=your_api_key_here
# Linux / macOS
export GROQ_API_KEY=your_api_key_hereRun the files in this exact order:
The Wikipedia API wrapper. Used as a module by other scripts. Provides:
| Function | Description |
|---|---|
search_articles(query) |
Search Wikipedia by keyword |
get_article_summary(title) |
Fetch a short article summary |
get_article_content(title) |
Fetch full or intro-only article text |
get_article_section(title) |
Fetch section headers of an article |
get_todays_featured_article() |
Fetch today's Wikipedia featured article |
get_random_article() |
Fetch a random article summary |
Parses and cleans raw Wikipedia API responses into structured document chunks. Used as a module by rag_corpus_builder.py.
A demo script to verify that the Wikipedia API wrapper is working correctly.
python main.pyPrints search results, summaries, random articles, and section headers to the console.
Loops through a predefined list of AI/ML topics, fetches Wikipedia content for each, deduplicates the chunks, and saves everything to rag_corpus_nlp.json.
python rag_corpus_builder.pyWhat it does:
- Fetches search results, article content, summaries, and sections for 20+ topics
- Deduplicates documents (removes chunks under 50 characters or seen before)
- Appends today's featured article
- Saves the final corpus to
rag_corpus_nlp.json
⏭️ Skip this step if
rag_corpus_nlp.jsonalready exists in the repo — it's pre-built.
The full end-to-end RAG pipeline. Loads the corpus, builds or loads the FAISS vector index, and starts an interactive Q&A assistant.
python rag_pipeline_nlp.pyWhat it does:
- Loads
rag_corpus_nlp.jsonand converts it to LangChainDocumentobjects - Embeds documents using
sentence-transformers/all-MiniLM-L6-v2 - Builds a FAISS index on first run (or loads from
faiss_wikipedia_index/if it exists) - Runs a RetrievalQA chain using Groq's LLM
- Starts an interactive terminal loop — ask any AI/ML question
Example session:
AI/ML knowledge Assistant - type 'exit' to quit
Your question: What is a Transformer model?
Answer: A Transformer is a deep learning architecture based on self-attention...
📚 Context Sources:
[1] Transformer Models | wikipedia_summary
[2] Large Language Model | wikipedia_content
| Package | Purpose |
|---|---|
requests |
Wikipedia REST API calls |
sentence-transformers |
Text embeddings (all-MiniLM-L6-v2) |
faiss-cpu |
Vector similarity search & indexing |
langchain |
RAG chain orchestration |
langchain-groq |
Groq LLM integration |
langchain-huggingface |
HuggingFace embeddings in LangChain |
langchain-community |
FAISS vector store integration |
groq |
Groq API client |
torch + transformers |
Underlying model support |
nltk |
Text preprocessing |
python-dotenv |
Load API keys from .env file |
- First run of
rag_pipeline_nlp.pywill embed all documents and build the FAISS index — this may take a few minutes depending on your hardware. - Subsequent runs will skip embedding and load directly from
faiss_wikipedia_index/, making startup much faster. - The pre-built
rag_corpus_nlp.jsonandfaiss_wikipedia_index/are included in the repo, so you can jump straight to Step 5 for a quick test. - The assistant is specialized in AI and ML topics based on the corpus topics defined in
rag_corpus_builder.py. Add more topics to that list to expand its knowledge.
This project is open source. Feel free to fork and build on it.