🔍 LightRAG Web Crawling System

A modern Retrieval-Augmented Generation (RAG) system that combines web crawling capabilities with intelligent question answering using LightRAG, Gemini LLM, and Nomic embeddings.

🎯 Features

🕷️ Web Crawling: Extract content from any webpage using Firecrawl API
🤖 AI-Powered Q&A: Ask questions about crawled content in Vietnamese with Hoàng Hà Mobile customer service style
📚 Knowledge Management: Automatic document indexing and deduplication
💾 Persistent Storage: Auto-save crawled content and prevent duplicate processing
🔄 Real-time Processing: Streamlined crawl-to-query workflow
🎨 User-Friendly UI: Clean Streamlit interface for easy interaction

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Streamlit UI  │────│  FastAPI Backend │────│   LightRAG Core │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         │              ┌─────────────────┐              │
         │              │ Firecrawl Client│              │
         │              └─────────────────┘              │
         │                       │                       │
    ┌─────────┐          ┌─────────────────┐    ┌─────────────────┐
    │ User    │          │ Document Storage│    │ Vector Database │
    │ Input   │          │ (SAVED_DATA/)   │    │ + Graph Store   │
    └─────────┘          └─────────────────┘    └─────────────────┘

🛠️ Technology Stack

Core Components

LightRAG: Advanced RAG framework with graph-based knowledge representation
Gemini 2.0 Flash: Google's latest LLM for text generation
Nomic Embeddings: 768-dimensional embeddings (nomic-embed-text-v1.5)
Firecrawl: Professional web crawling and content extraction

Backend & Frontend

FastAPI: High-performance API framework with automatic OpenAPI documentation
Streamlit: Interactive web interface for user interactions
Pydantic: Data validation and serialization

📦 Installation

Prerequisites

Python 3.12
API Keys for:

Setup Steps

Clone the repository

git clone <repository-url>
cd Test-LightRAG

Create virtual environment

python -m venv venv
# Windows
venv\Scripts\activate
# Linux/Mac
source venv/bin/activate

Install dependencies
```
pip install -r requirements.txt
```
Environment configuration Create .env file in the backend/ directory then copy .env.example content to .env file and fill in your actual API keys.

Quick Start

1. Start the Backend Server

cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

2. Launch the Web Interface

cd ui
streamlit run app.py --server.port 8581

3. Access the Application

Web UI: http://localhost:8581
API Documentation: http://localhost:8000/docs
Health Check: http://localhost:8000/health

📚 Usage Guide

Web Crawling & Knowledge Building

Enter any webpage URL in the crawling section
Click "🕷️ Crawl & Auto-Insert"
The system will:
- Extract content using Firecrawl
- Save markdown files to SAVED_DATA/
- Automatically insert into the RAG knowledge base
- Index documents to prevent duplicates

Intelligent Q&A

Type your question in Vietnamese in the Q&A section
Click "🤔 Ask"
Get AI-powered answers in Hoàng Hà Mobile customer service style (You can customize the persona in backend/rag_pipeline/llm.py)
Responses include:
- Product recommendations
- Pricing information
- Technical specifications
- Polite, professional Vietnamese communication

🏗️ Project Structure

Test-LightRAG/
├── 📁 backend/                    # FastAPI backend application
│   ├── 📁 rag_pipeline/           # Modular RAG components
│   │   ├── __init__.py           # Module exports
│   │   ├── config.py             # API keys & configuration
│   │   ├── embeddings.py         # Nomic embedding functions
│   │   ├── llm.py               # Gemini LLM integration
│   │   ├── storage.py           # File operations & indexing
│   │   └── rag_pipeline.py      # Main RAG pipeline class
│   ├── 📁 models/                # Pydantic data models
│   │   └── schemas.py           # API request/response schemas
│   ├── main.py                  # FastAPI application entry point
│   ├── firecrawl_client.py      # Web crawling client
│   ├── test_rag_pipeline.py     # Pipeline testing script
│   └── .env                     # Environment variables
├── 📁 ui/                        # Streamlit web interface
│   └── app.py                   # Main UI application
├── 📁 SAVED_DATA/               # Auto-saved crawled content
│   └── *.md                     # Markdown documents
├── requirements.txt             # Python dependencies
└── README.md                   # This documentation

� API Reference

Endpoints

`POST /crawl`

Crawl a webpage and automatically insert content into RAG.

Request:

{
  "url": "https://hoanghamobile.com/dien-thoai-di-dong"
}

Response:

{
  "success": true,
  "data": {
    "crawled_docs": 1,
    "inserted_count": 1,
    "docs": [...]
  }
}

`POST /query`

Query the RAG system with a question.

Request:

{
  "question": "Có laptop nào tầm 15 triệu không?"
}

Response:

{
  "success": true,
  "data": {
    "answer": "Dạ có ạ, shop có một số laptop trong tầm giá 15 triệu..."
  }
}

`GET /health`

Check system health status.

Response:

{
  "success": true,
  "data": {
    "status": "healthy",
    "services": {
      "firecrawl_client": true,
      "rag_pipeline": true
    }
  }
}

⚙️ Configuration

Environment Variables

Variable	Description	Required
`GEMINI_API_KEY`	Google Gemini API key	✅
`NOMIC_API_KEY`	Nomic embeddings API key	✅
`FIRECRAWL_API_KEY`	Firecrawl web scraping API key	✅

Customization Options

Modify AI Persona

Edit backend/rag_pipeline/llm.py to customize:

System prompts
Response style
Business context
Language preferences

Adjust RAG Parameters

Modify backend/rag_pipeline/rag_pipeline.py:

Embedding dimensions
Search parameters (top_k, enable_rerank)
Query modes (naive, local, global, hybrid)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
backend		backend
ui		ui
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 LightRAG Web Crawling System

🎯 Features

🏗️ Architecture

🛠️ Technology Stack

Core Components

Backend & Frontend

📦 Installation

Prerequisites

Setup Steps

Quick Start

1. Start the Backend Server

2. Launch the Web Interface

3. Access the Application

📚 Usage Guide

Web Crawling & Knowledge Building

Intelligent Q&A

🏗️ Project Structure

� API Reference

Endpoints

`POST /crawl`

`POST /query`

`GET /health`

⚙️ Configuration

Environment Variables

Customization Options

Modify AI Persona

Adjust RAG Parameters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔍 LightRAG Web Crawling System

🎯 Features

🏗️ Architecture

🛠️ Technology Stack

Core Components

Backend & Frontend

📦 Installation

Prerequisites

Setup Steps

Quick Start

1. Start the Backend Server

2. Launch the Web Interface

3. Access the Application

📚 Usage Guide

Web Crawling & Knowledge Building

Intelligent Q&A

🏗️ Project Structure

� API Reference

Endpoints

POST /crawl

POST /query

GET /health

⚙️ Configuration

Environment Variables

Customization Options

Modify AI Persona

Adjust RAG Parameters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /crawl`

`POST /query`

`GET /health`

Packages