A practice-oriented multi-user Agentic RAG document QA system. The current main entry point is a FastAPI web UI with document upload, hybrid retrieval, reranking, LangGraph workflows, multi-turn memory, and conversation isolation after document-library changes.
This project is more than a simple "vector search + LLM answer" demo. It already has a product-like shape and focuses on:
- Private document QA
- Context understanding across multi-turn conversations
- Preventing stale context after document-library changes
- Multi-user document and chat isolation
- More explainable retrieval behavior
Current capabilities:
- User registration and login
- PDF / Markdown upload
- PDF parsing with LlamaParse / PyMuPDF4LLM
- Parent-child chunking
- Dense + sparse hybrid retrieval
- Cross-encoder reranker
- LangGraph-based Agentic RAG workflow
- CRAG-style retrieval-quality grading
- Persistent chat history and summary memory
- Document-version-aware thread refresh strategy
This project is based on and extended from:
- Original repository: GiovanniPasq/agentic-rag-for-dummies
Major changes in this version include:
- FastAPI UI and multi-user login system
- Document-version-aware chat thread refresh
- Answer-basis display and intermediate status cards
- CRAG-style retrieval grading
- Better upload workflow and document management
- Improved Docker and environment-variable loading
The system can be viewed in 5 layers:
- Web: FastAPI pages and APIs
- Core: chat, document management, user state, and RAG assembly
- Agent: LangGraph graph, nodes, edges, tools, and prompts
- Storage: Qdrant, parent store, and local chat state
- Model: external OpenAI-compatible LLM plus Ollama embeddings
Typical flow:
Upload Document
-> PDF/Markdown preprocessing
-> parent/child chunking
-> child chunks into Qdrant
-> parent chunks into parent_store
-> documents_version bump
-> user asks a question
-> LLM Router
-> LangGraph document QA flow
-> retrieval / rerank / grading / answer
.
├─ project/
│ ├─ app.py
│ ├─ config.py
│ ├─ document_chunker.py
│ ├─ core/
│ ├─ db/
│ ├─ rag_agent/
│ ├─ ui/
│ ├─ Dockerfile
│ ├─ README.md
│ └─ README_CN.md
├─ data/ # local runtime user data
├─ qdrant_db/ # local Qdrant storage
├─ requirements.txt
├─ README.md
└─ README_CN.md
project/README.md focuses more on implementation and development details, while this README focuses on setup and usage.
- Recommended Python 3.12
- Compatible with Python 3.11+
uvis recommended for environment management- An accessible OpenAI-compatible LLM API
- A local or remote Ollama service
- The embedding model
nomic-embed-text
Recommended preparation:
ollama serve
ollama pull nomic-embed-textConfiguration priority:
- Process environment variables
project/.env- Default values in
project/config.py
Key settings:
LLM_MODELLLM_BASE_URLLLM_API_KEYDENSE_MODELDENSE_VECTOR_SIZESPARSE_MODELOLLAMA_HOSTAPP_HOSTAPP_PORTAPP_AUTO_RELOADCROSS_ENCODER_LOCAL_FILES_ONLY
Copy and rename project/.env.example to project/.env first:
Copy-Item project\.env.example project\.envThen fill in your own LLM_API_KEY and other local settings.
uv python install 3.12
uv venv --python 3.12 .venvActivate the virtual environment:
.\.venv\Scripts\Activate.ps1Install dependencies:
uv pip install torch==2.4.1+cpu --extra-index-url https://download.pytorch.org/whl/cpu
uv pip install -r requirements.txtNotes:
- Install
torchfirst becausetorch==2.4.1+cpucomes from the PyTorch CPU wheel index and can conflict with multi-index dependency resolution. - If you are not using Python 3.12, choose a Torch version compatible with your interpreter first.
- When in doubt, use the official PyTorch install selector for your Python version and CPU / CUDA environment.
Prepare environment variables:
Copy-Item project\.env.example project\.envStart the app:
cd project
python app.pypip install -r requirements.txt
cd project
python app.pyDefault URL:
http://127.0.0.1:7860
- Register and log in
- Upload PDF or Markdown files on the Documents page
- Choose one of the upload modes:
Supplement Current TopicStart New Topic
- Return to the chat page and ask questions
The UI can show answer-basis hints such as:
回答依据:模型直接生成回答依据:当前文档列表回答依据:文档库概览回答依据:当前文档库检索
The project uses a "document version + chat thread version" mechanism:
- Effective document changes bump
documents_version - Each chat stores its own
document_context_version - If a chat version is stale, a new
thread_idis used on the next real message - Old chat history remains visible but no longer participates in reasoning
Recommended deployment model:
- Run only the FastAPI app inside the container
- Use Ollama as an external embedding service
- Use an external OpenAI-compatible API for the main LLM
Build:
docker build -f project/Dockerfile -t agentic-rag-fastapi .Run:
docker run --rm -p 7860:7860 --env-file project/.env agentic-rag-fastapiThe current main UI is FastAPI.
- Use
cloudflaredfor temporary public exposure - Use a cloud server for longer-term deployment
Example:
cloudflared tunnel --url http://127.0.0.1:7860- LangGraph checkpointing is still memory-based
- Document overwrite/update strategy is not implemented separately
- Page-level citation is not implemented yet
- Clearing the document library uses lazy chat refresh, not automatic global reset
This repository is licensed under the MIT License.
This project is based on and modified from the original work by Giovanni Pasqualino. The original copyright notice is preserved in the LICENSE file.

