🕷️ Web RAG 🕷️

This project will help you to make a chatbot out of any website with atmost accuracy and speed. Just enter the website link, follow readme steps and in the end you can get a ready API for your chatbot.

⇝ Working ⇜

Crawls every page and subpage
Automatically extracts PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, ODT, TXT files
Continues interrupted crawls from exact stopping point
Multi-document RAG:
- Text documents → Vector embeddings + docstore
- Tables → Summarized for embeddings, original stored in docstore
- Smart retrieval system fetches relevant content for accurate responses

⇝ Setup ⇜

Prerequisites

Python 3.10+ (tested on 3.13)
OpenAI API key (for embeddings and chat)
LangChain API key (for tracing)
HuggingFace ACCESS TOKEN (to upload and retrieve scraped content and db files)

Installation

# Clone the repository
git clone https://github.com/ArshCypherZ/web-rag
cd web-rag

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
# Note: Tested on Arch Linux with Python 3.13, idk about windows, dont blame me

# Install dependencies
pip install -r requirements.txt

Configuration File

Check .env.example for required environment variables.

⇝ Complete Workflow ⇜

Step 1: Crawl a Website

python3 crawl.py https://example.com

This downloads all documents and content from the website.

Step 2: Upload Content to HuggingFace

python3 upload.py

This uploads the scraped content to your HuggingFace repository.

Step 3: Process Documents

Run meow.ipynb notebook - This is the core processing step that:

Creates vector embeddings of all documents
Generates the database files needed for RAG
Uploads database files to HuggingFace (as username/repo-name-dbfiles)

Recommended: Run on Google Colab for better GPU access and processing power.

Step 4: Use RAG System

First download database files from HuggingFace (check rag.py top comment for details).

# Local RAG interface
python3 rag.py

# OR start API server
python3 web_rag.py
# Access at: http://localhost:8000/query?question=your-question

Additional Information

Should anything be not working, kindly let us know at Spiral Tech Division or simply open an issue. If you want to contribute, we are happy to look at your pull request.

Run it on colab for better processing. The notebook can be stopped and resumed, it tracks progress with SQLite.

I have already made several chatbots using this, like Google's Gemini documentation chatbot, Pipecat's documentation chatbot, chatbot for queries regarding Graphic Era Deemed to be University, and this has more potential, meaning, for any website you can create a RAG for it, which will obviously be accurate and fast.

More applications to this may include an AI assisted voice-to-voice app like for sales, admission queries, call centre, on which I am working on. If anyone would like to be a part of it, please email arsh0javed@gmail.com with your projects or your resume, doesn't really matter just flex on my email, will be happy to look :")

Thank You.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
crawl.py		crawl.py
meow.ipynb		meow.ipynb
rag.py		rag.py
requirements.txt		requirements.txt
spider.jpg		spider.jpg
upload.py		upload.py
upload_to_huggingface.py		upload_to_huggingface.py
web_rag.py		web_rag.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ Web RAG 🕷️

⇝ Working ⇜

⇝ Setup ⇜

Prerequisites

Installation

Configuration File

⇝ Complete Workflow ⇜

Step 1: Crawl a Website

Step 2: Upload Content to HuggingFace

Step 3: Process Documents

Step 4: Use RAG System

Additional Information

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🕷️ Web RAG 🕷️

⇝ Working ⇜

⇝ Setup ⇜

Prerequisites

Installation

Configuration File

⇝ Complete Workflow ⇜

Step 1: Crawl a Website

Step 2: Upload Content to HuggingFace

Step 3: Process Documents

Step 4: Use RAG System

Additional Information

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages