Skip to content

DanielWay17/Light-RAG-POC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” LightRAG Web Crawling System

A modern Retrieval-Augmented Generation (RAG) system that combines web crawling capabilities with intelligent question answering using LightRAG, Gemini LLM, and Nomic embeddings.

Python FastAPI Streamlit LightRAG

🎯 Features

  • πŸ•·οΈ Web Crawling: Extract content from any webpage using Firecrawl API
  • πŸ€– AI-Powered Q&A: Ask questions about crawled content in Vietnamese with HoΓ ng HΓ  Mobile customer service style
  • πŸ“š Knowledge Management: Automatic document indexing and deduplication
  • πŸ’Ύ Persistent Storage: Auto-save crawled content and prevent duplicate processing
  • πŸ”„ Real-time Processing: Streamlined crawl-to-query workflow
  • 🎨 User-Friendly UI: Clean Streamlit interface for easy interaction

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Streamlit UI  │────│  FastAPI Backend │────│   LightRAG Core β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚
         β”‚              β”‚ Firecrawl Clientβ”‚              β”‚
         β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚
         β”‚                       β”‚                       β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ User    β”‚          β”‚ Document Storageβ”‚    β”‚ Vector Database β”‚
    β”‚ Input   β”‚          β”‚ (SAVED_DATA/)   β”‚    β”‚ + Graph Store   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ› οΈ Technology Stack

Core Components

  • LightRAG: Advanced RAG framework with graph-based knowledge representation
  • Gemini 2.0 Flash: Google's latest LLM for text generation
  • Nomic Embeddings: 768-dimensional embeddings (nomic-embed-text-v1.5)
  • Firecrawl: Professional web crawling and content extraction

Backend & Frontend

  • FastAPI: High-performance API framework with automatic OpenAPI documentation
  • Streamlit: Interactive web interface for user interactions
  • Pydantic: Data validation and serialization

πŸ“¦ Installation

Prerequisites

Setup Steps

  1. Clone the repository

    git clone <repository-url>
    cd Test-LightRAG
  2. Create virtual environment

    python -m venv venv
    # Windows
    venv\Scripts\activate
    # Linux/Mac
    source venv/bin/activate
  3. Install dependencies

    pip install -r requirements.txt
  4. Environment configuration Create .env file in the backend/ directory then copy .env.example content to .env file and fill in your actual API keys.

Quick Start

1. Start the Backend Server

cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000

2. Launch the Web Interface

cd ui
streamlit run app.py --server.port 8581

3. Access the Application

πŸ“š Usage Guide

Web Crawling & Knowledge Building

  1. Enter any webpage URL in the crawling section
  2. Click "πŸ•·οΈ Crawl & Auto-Insert"
  3. The system will:
    • Extract content using Firecrawl
    • Save markdown files to SAVED_DATA/
    • Automatically insert into the RAG knowledge base
    • Index documents to prevent duplicates

Intelligent Q&A

  1. Type your question in Vietnamese in the Q&A section
  2. Click "πŸ€” Ask"
  3. Get AI-powered answers in HoΓ ng HΓ  Mobile customer service style (You can customize the persona in backend/rag_pipeline/llm.py)
  4. Responses include:
    • Product recommendations
    • Pricing information
    • Technical specifications
    • Polite, professional Vietnamese communication

πŸ—οΈ Project Structure

Test-LightRAG/
β”œβ”€β”€ πŸ“ backend/                    # FastAPI backend application
β”‚   β”œβ”€β”€ πŸ“ rag_pipeline/           # Modular RAG components
β”‚   β”‚   β”œβ”€β”€ __init__.py           # Module exports
β”‚   β”‚   β”œβ”€β”€ config.py             # API keys & configuration
β”‚   β”‚   β”œβ”€β”€ embeddings.py         # Nomic embedding functions
β”‚   β”‚   β”œβ”€β”€ llm.py               # Gemini LLM integration
β”‚   β”‚   β”œβ”€β”€ storage.py           # File operations & indexing
β”‚   β”‚   └── rag_pipeline.py      # Main RAG pipeline class
β”‚   β”œβ”€β”€ πŸ“ models/                # Pydantic data models
β”‚   β”‚   └── schemas.py           # API request/response schemas
β”‚   β”œβ”€β”€ main.py                  # FastAPI application entry point
β”‚   β”œβ”€β”€ firecrawl_client.py      # Web crawling client
β”‚   β”œβ”€β”€ test_rag_pipeline.py     # Pipeline testing script
β”‚   └── .env                     # Environment variables
β”œβ”€β”€ πŸ“ ui/                        # Streamlit web interface
β”‚   └── app.py                   # Main UI application
β”œβ”€β”€ πŸ“ SAVED_DATA/               # Auto-saved crawled content
β”‚   └── *.md                     # Markdown documents
β”œβ”€β”€ requirements.txt             # Python dependencies
└── README.md                   # This documentation

οΏ½ API Reference

Endpoints

POST /crawl

Crawl a webpage and automatically insert content into RAG.

Request:

{
  "url": "https://hoanghamobile.com/dien-thoai-di-dong"
}

Response:

{
  "success": true,
  "data": {
    "crawled_docs": 1,
    "inserted_count": 1,
    "docs": [...]
  }
}

POST /query

Query the RAG system with a question.

Request:

{
  "question": "CΓ³ laptop nΓ o tαΊ§m 15 triệu khΓ΄ng?"
}

Response:

{
  "success": true,
  "data": {
    "answer": "DαΊ‘ cΓ³ αΊ‘, shop cΓ³ mα»™t sα»‘ laptop trong tαΊ§m giΓ‘ 15 triệu..."
  }
}

GET /health

Check system health status.

Response:

{
  "success": true,
  "data": {
    "status": "healthy",
    "services": {
      "firecrawl_client": true,
      "rag_pipeline": true
    }
  }
}

βš™οΈ Configuration

Environment Variables

Variable Description Required
GEMINI_API_KEY Google Gemini API key βœ…
NOMIC_API_KEY Nomic embeddings API key βœ…
FIRECRAWL_API_KEY Firecrawl web scraping API key βœ…

Customization Options

Modify AI Persona

Edit backend/rag_pipeline/llm.py to customize:

  • System prompts
  • Response style
  • Business context
  • Language preferences

Adjust RAG Parameters

Modify backend/rag_pipeline/rag_pipeline.py:

  • Embedding dimensions
  • Search parameters (top_k, enable_rerank)
  • Query modes (naive, local, global, hybrid)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages