This project will help you to make a chatbot out of any website with atmost accuracy and speed. Just enter the website link, follow readme steps and in the end you can get a ready API for your chatbot.
- Crawls every page and subpage
- Automatically extracts PDF, DOC, DOCX, XLS, XLSX, PPT, PPTX, ODT, TXT files
- Continues interrupted crawls from exact stopping point
- Multi-document RAG:
- Text documents → Vector embeddings + docstore
- Tables → Summarized for embeddings, original stored in docstore
- Smart retrieval system fetches relevant content for accurate responses
- Python 3.10+ (tested on 3.13)
- OpenAI API key (for embeddings and chat)
- LangChain API key (for tracing)
- HuggingFace ACCESS TOKEN (to upload and retrieve scraped content and db files)
# Clone the repository
git clone https://github.com/ArshCypherZ/web-rag
cd web-rag
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Note: Tested on Arch Linux with Python 3.13, idk about windows, dont blame me
# Install dependencies
pip install -r requirements.txtCheck .env.example for required environment variables.
python3 crawl.py https://example.comThis downloads all documents and content from the website.
python3 upload.pyThis uploads the scraped content to your HuggingFace repository.
Run meow.ipynb notebook - This is the core processing step that:
- Creates vector embeddings of all documents
- Generates the database files needed for RAG
- Uploads database files to HuggingFace (as
username/repo-name-dbfiles)
Recommended: Run on Google Colab for better GPU access and processing power.
First download database files from HuggingFace (check rag.py top comment for details).
# Local RAG interface
python3 rag.py
# OR start API server
python3 web_rag.py
# Access at: http://localhost:8000/query?question=your-questionShould anything be not working, kindly let us know at Spiral Tech Division or simply open an issue. If you want to contribute, we are happy to look at your pull request.
Run it on colab for better processing. The notebook can be stopped and resumed, it tracks progress with SQLite.
I have already made several chatbots using this, like Google's Gemini documentation chatbot, Pipecat's documentation chatbot, chatbot for queries regarding Graphic Era Deemed to be University, and this has more potential, meaning, for any website you can create a RAG for it, which will obviously be accurate and fast.
More applications to this may include an AI assisted voice-to-voice app like for sales, admission queries, call centre, on which I am working on. If anyone would like to be a part of it, please email arsh0javed@gmail.com with your projects or your resume, doesn't really matter just flex on my email, will be happy to look :")
Thank You.
