- π·οΈ Web Crawling: Extract content from any webpage using Firecrawl API
- π€ AI-Powered Q&A: Ask questions about crawled content in Vietnamese with HoΓ ng HΓ Mobile customer service style
- π Knowledge Management: Automatic document indexing and deduplication
- πΎ Persistent Storage: Auto-save crawled content and prevent duplicate processing
- π Real-time Processing: Streamlined crawl-to-query workflow
- π¨ User-Friendly UI: Clean Streamlit interface for easy interaction
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Streamlit UI ββββββ FastAPI Backend ββββββ LightRAG Core β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β βββββββββββββββββββ β
β β Firecrawl Clientβ β
β βββββββββββββββββββ β
β β β
βββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β User β β Document Storageβ β Vector Database β
β Input β β (SAVED_DATA/) β β + Graph Store β
βββββββββββ βββββββββββββββββββ βββββββββββββββββββ
- LightRAG: Advanced RAG framework with graph-based knowledge representation
- Gemini 2.0 Flash: Google's latest LLM for text generation
- Nomic Embeddings: 768-dimensional embeddings (nomic-embed-text-v1.5)
- Firecrawl: Professional web crawling and content extraction
- FastAPI: High-performance API framework with automatic OpenAPI documentation
- Streamlit: Interactive web interface for user interactions
- Pydantic: Data validation and serialization
- Python 3.12
- API Keys for:
-
Clone the repository
git clone <repository-url> cd Test-LightRAG
-
Create virtual environment
python -m venv venv # Windows venv\Scripts\activate # Linux/Mac source venv/bin/activate
-
Install dependencies
pip install -r requirements.txt
-
Environment configuration Create
.envfile in thebackend/directory then copy .env.example content to .env file and fill in your actual API keys.
cd backend
uvicorn main:app --reload --host 0.0.0.0 --port 8000cd ui
streamlit run app.py --server.port 8581- Web UI: http://localhost:8581
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
- Enter any webpage URL in the crawling section
- Click "π·οΈ Crawl & Auto-Insert"
- The system will:
- Extract content using Firecrawl
- Save markdown files to
SAVED_DATA/ - Automatically insert into the RAG knowledge base
- Index documents to prevent duplicates
- Type your question in Vietnamese in the Q&A section
- Click "π€ Ask"
- Get AI-powered answers in HoΓ ng HΓ Mobile customer service style (You can customize the persona in
backend/rag_pipeline/llm.py) - Responses include:
- Product recommendations
- Pricing information
- Technical specifications
- Polite, professional Vietnamese communication
Test-LightRAG/
βββ π backend/ # FastAPI backend application
β βββ π rag_pipeline/ # Modular RAG components
β β βββ __init__.py # Module exports
β β βββ config.py # API keys & configuration
β β βββ embeddings.py # Nomic embedding functions
β β βββ llm.py # Gemini LLM integration
β β βββ storage.py # File operations & indexing
β β βββ rag_pipeline.py # Main RAG pipeline class
β βββ π models/ # Pydantic data models
β β βββ schemas.py # API request/response schemas
β βββ main.py # FastAPI application entry point
β βββ firecrawl_client.py # Web crawling client
β βββ test_rag_pipeline.py # Pipeline testing script
β βββ .env # Environment variables
βββ π ui/ # Streamlit web interface
β βββ app.py # Main UI application
βββ π SAVED_DATA/ # Auto-saved crawled content
β βββ *.md # Markdown documents
βββ requirements.txt # Python dependencies
βββ README.md # This documentation
Crawl a webpage and automatically insert content into RAG.
Request:
{
"url": "https://hoanghamobile.com/dien-thoai-di-dong"
}Response:
{
"success": true,
"data": {
"crawled_docs": 1,
"inserted_count": 1,
"docs": [...]
}
}Query the RAG system with a question.
Request:
{
"question": "CΓ³ laptop nΓ o tαΊ§m 15 triα»u khΓ΄ng?"
}Response:
{
"success": true,
"data": {
"answer": "DαΊ‘ cΓ³ αΊ‘, shop cΓ³ mα»t sα» laptop trong tαΊ§m giΓ‘ 15 triα»u..."
}
}Check system health status.
Response:
{
"success": true,
"data": {
"status": "healthy",
"services": {
"firecrawl_client": true,
"rag_pipeline": true
}
}
}| Variable | Description | Required |
|---|---|---|
GEMINI_API_KEY |
Google Gemini API key | β |
NOMIC_API_KEY |
Nomic embeddings API key | β |
FIRECRAWL_API_KEY |
Firecrawl web scraping API key | β |
Edit backend/rag_pipeline/llm.py to customize:
- System prompts
- Response style
- Business context
- Language preferences
Modify backend/rag_pipeline/rag_pipeline.py:
- Embedding dimensions
- Search parameters (top_k, enable_rerank)
- Query modes (naive, local, global, hybrid)