A full-stack Retrieval-Augmented Generation (RAG) application for uploading PDFs, indexing document content, chatting with source-grounded answers, and evaluating retrieval quality across chunking strategies.
The project is built to demonstrate more than a basic PDF chatbot. It shows how document parsing, chunking, embedding, retrieval, reranking, citation, and evaluation decisions affect answer quality.
- Upload one or more PDF documents simultaneously.
- Parse structured PDF content with Docling.
- Generate embeddings with
BAAI/bge-m3. - Store and retrieve chunks with Qdrant vector database.
- Chat with uploaded documents through a React interface.
- Return answers with document name, page number, section title, chunk text, similarity score, and reranking metadata.
- Evaluate retrieval quality using hard-coded benchmark questions and expected answers.
- Compare chunking strategies and top-k settings using measurable scores.
- Use a document-aware retrieval policy for default strategy selection.
- Optionally rerank retrieved chunks before the LLM call.
- Run the full stack with Docker Compose.
| Layer | Technology |
|---|---|
| Frontend | React, TypeScript, Vite |
| Backend | FastAPI, Python |
| PDF parsing | Docling |
| Embeddings | BGE-M3 |
| Vector database | Qdrant |
| Retrieval | Dense vector search with metadata filtering |
| Reranking | SentenceTransformers CrossEncoder |
| LLM provider | OpenAI-compatible API, Groq-compatible configuration, extractive fallback |
| Evaluation | Custom retrieval and RAG evaluation scripts |
| Runtime | Docker Compose |
flowchart LR
U[User] --> FE[React Frontend]
FE --> API[FastAPI Backend]
API --> UP[PDF Upload]
UP --> PARSE[Docling Parser]
PARSE --> PROFILE[Document Profiler]
PARSE --> CHUNK[Chunking Strategies]
CHUNK --> EMB[BGE-M3 Embeddings]
EMB --> QDRANT[Qdrant Vector DB]
FE --> CHAT[Chat Request]
CHAT --> API
API --> RETRIEVE[Vector Retrieval]
RETRIEVE --> QDRANT
RETRIEVE --> RERANK[Optional CrossEncoder Reranking]
RERANK --> LLM[LLM / Extractive Provider]
LLM --> ANSWER[Answer + Sources]
ANSWER --> FE
API --> EVAL[Evaluation Workflow]
EVAL --> SCORES[Strategy Metrics]
SCORES --> FE
Upload PDF
→ Parse document
→ Profile document structure
→ Generate multiple chunking variants
→ Embed chunks
→ Store chunks in Qdrant
→ Select retrieval policy
→ Retrieve relevant chunks for each question
→ Optionally rerank retrieved chunks
→ Generate source-grounded answer
→ Display answer, sources, and evaluation metrics
Use this when running the application locally on Windows.
.\start-local.ps1The script starts the Docker Compose stack, waits for services, and opens the frontend automatically.
Stop everything with:
.\stop-local.ps1docker compose up --buildOpen:
| Service | URL |
|---|---|
| Frontend | http://localhost:5173 |
| Backend API docs | http://localhost:8000/docs |
| Qdrant dashboard | http://localhost:6333/dashboard |
Stop the stack:
docker compose downRun in background:
docker compose up -d --buildView logs:
docker compose logs -fCreate the backend environment file:
copy backend\.env.example backend\.envExample backend configuration:
APP_NAME=Document Intelligence RAG Platform
QDRANT_URL=http://qdrant:6333
QDRANT_COLLECTION_NAME=document_chunks
EMBEDDING_MODEL_NAME=BAAI/bge-m3
EMBEDDING_BATCH_SIZE=8
NORMALIZE_EMBEDDINGS=true
OPENAI_COMPATIBLE_BASE_URL=https://api.groq.com/openai/v1
OPENAI_COMPATIBLE_API_KEY=your_api_key_here
OPENAI_COMPATIBLE_MODEL=llama-3.1-8b-instant
ENABLE_RERANKING=true
RERANKER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2
RERANKER_CANDIDATE_LIMIT=12
RERANKER_FINAL_TOP_K=3| Endpoint | Purpose |
|---|---|
GET /health |
Backend health check |
POST /api/documents/upload |
Upload, parse, chunk, embed, and index PDFs |
GET /api/documents |
List indexed documents |
POST /api/search |
Run retrieval against indexed chunks |
POST /api/chat |
Generate an answer from retrieved context |
POST /api/evaluation/run |
Run retrieval evaluation across strategies |
Example chat request:
{
"question": "What does Mean Opinion Score measure in TTS evaluation?",
"document_id": "<document_id>",
"strategy": "auto",
"provider": "compatible",
"limit": 3,
"rerank": true,
"rerank_candidate_limit": 12
}The platform does not use a single fixed retrieval setup for every PDF. During ingestion, each document is parsed, profiled, and indexed using multiple chunking strategies.
| Strategy | Purpose | Best suited for |
|---|---|---|
section_aware |
Preserves heading-based sections | Structured reports, papers, documentation |
recursive |
Splits text by size and overlap | General prose and weakly structured PDFs |
table_preserving |
Keeps table-like content together | Metric-heavy or tabular documents |
parent_child |
Retrieves focused child chunks with broader parent context | Dense technical documents and long explanations |
The default retrieval policy is selected from document structure signals such as page count, heading density, table density, and chunk statistics. Full evaluation is available as an explicit diagnostic workflow rather than blocking upload or chat.
The application includes optional second-stage reranking.
Qdrant retrieves candidate chunks
→ CrossEncoder scores question-chunk pairs
→ chunks are sorted by rerank score
→ final top-k chunks are sent to the LLM
This improves precision when dense vector search returns semantically related chunks that are not the best direct evidence for the question. The UI shows both retrieval similarity and reranking metadata:
- vector similarity score
- rerank score
- original retrieval rank
- final source order
Rerank scores are raw model relevance scores. They are not normalized percentages; higher score means stronger relevance.
The project includes a hard-coded evaluation set with expected answers and source expectations. Evaluation is used to compare retrieval configurations and justify strategy selection.
| Component | Configuration |
|---|---|
| Evaluation mode | Full RAG evaluation |
| Evaluated strategies | section_aware, recursive, parent_child, table_preserving |
| Top-k values | 3, 5 |
| Benchmark questions | 10 |
| Total evaluation calls | 80 /api/chat calls |
| Metrics | Overall, retrieval, groundedness, citation accuracy, latency |
| Best current configuration | section_aware, top_k=3 |
| Strategy | Top-k | Overall | Retrieval | Groundedness | Citation | Latency ms |
|---|---|---|---|---|---|---|
section_aware |
3 | 0.7265 | 0.80 | 0.85 | 0.80 | 706.47 |
section_aware |
5 | 0.7145 | 0.80 | 0.90 | 0.80 | 1351.36 |
parent_child |
5 | 0.7080 | 0.80 | 0.95 | 0.80 | 8534.32 |
parent_child |
3 | 0.6544 | 0.60 | 1.00 | 0.60 | 6139.30 |
recursive |
3 | 0.6192 | 0.80 | 0.6215 | 0.80 | 9478.93 |
table_preserving |
3 | 0.6007 | 0.80 | 0.5733 | 0.80 | 9471.08 |
table_preserving |
5 | 0.5998 | 0.80 | 0.6422 | 0.80 | 14004.92 |
recursive |
5 | 0.5833 | 0.80 | 0.5667 | 0.80 | 13795.27 |
| Decision | Selected value | Reason |
|---|---|---|
| Default strategy | section_aware |
Best overall benchmark result |
| Default top-k | 3 |
Best quality-latency trade-off |
| Table-heavy fallback | table_preserving |
Preserves tabular evidence |
| Weak-structure fallback | recursive |
Robust when headings are limited |
| Long-document fallback | parent_child |
Provides broader context when needed |
| Full evaluation | On demand | Avoids slowing normal upload and chat |
The result shows that the highest groundedness score is not always the best production choice. parent_child improves context coverage, but its latency is much higher. section_aware + top_k=3 gives the best balance between accuracy, citations, and responsiveness.
Chunking directly controls what evidence the retriever can find. Poor chunking can split relevant context, lose section boundaries, or separate table rows from their headers. This project evaluates chunking strategies instead of assuming one strategy is universally correct.
The selected default, section_aware, works well for structured PDFs because it preserves topical boundaries and improves source-level citations. The evaluation results show that increasing top_k from 3 to 5 does not automatically improve the final system; it can increase latency and introduce less focused context.
Each answer includes retrieved evidence with:
- document name
- page number
- section title
- chunk index
- retrieval strategy
- similarity score
- rerank score when reranking is enabled
- original retrieval rank
- source text preview
This makes each answer inspectable and helps users verify whether the response is grounded in the uploaded document.
document-intelligence-rag-platform/
├── backend/
│ ├── app/
│ │ ├── api/
│ │ ├── core/
│ │ ├── schemas/
│ │ ├── services/
│ │ └── main.py
│ ├── evaluation/
│ ├── data/
│ ├── Dockerfile
│ ├── requirements.txt
│ └── .env.example
├── frontend/
│ ├── src/
│ ├── Dockerfile
│ ├── package.json
│ └── .env.example
├── docker-compose.yml
├── start-local.ps1
├── stop-local.ps1
└── README.md
Backend syntax check:
cd backend
python -m py_compile app\main.py
python -m py_compile app\api\documents.py
python -m py_compile app\api\chat.py
python -m py_compile app\api\evaluation.pyFrontend build:
cd frontend
npm run buildDocker validation:
docker compose up --build
docker compose ps
docker compose logs -fEvaluation run:
cd backend
python -m evaluation.run_eval- Evaluation uses a fixed benchmark set and custom scoring.
- Chat history is kept in the frontend session and is not persisted.
- Follow-up question rewriting is not yet implemented.
- Authentication and user-level document isolation are not included.
- Very large PDFs may require background processing in a production deployment.
- Streaming responses are not implemented yet.
- Add streaming answer generation.
- Add hybrid dense and keyword retrieval.
- Add LLM-as-judge evaluation for groundedness and answer correctness.
- Add automatic evaluation-question generation per uploaded document.
- Add persistent chat sessions.
- Add user authentication and document ownership.
- Add deployment configuration for a public URL.
- Add CI checks for backend, frontend, and Docker builds.
- Start the platform with
./start-local.ps1ordocker compose up --build. - Upload one or more PDFs.
- Show the detected document profile and selected retrieval policy.
- Ask a document question.
- Expand the source cards and show page/section citations.
- Enable reranking and compare source ordering.
- Run retrieval evaluation.
- Explain why
section_aware + top_k=3is the default configuration. - Open FastAPI docs and Qdrant dashboard to show the backend infrastructure.