A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding
Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment
Run the complete end-to-end pipeline with a single command:
docs2synth runThis automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.
For more control, run each step individually:
# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/
# 2. Generate QA pairs
docs2synth qa batch
# 3. Verify quality
docs2synth verify batch
# 4. Annotate (opens UI)
docs2synth annotate
# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10
# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag appCPU Version (includes all features + MCP server):
pip install docs2synth[cpu]GPU Version (includes all features + MCP server):
# Standard GPU installation (no vLLM)
pip install docs2synth[gpu]
# With vLLM for local LLM inference (requires CUDA GPU)
# 1. Install PyTorch with CUDA first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# 2. Install docs2synth with vLLM:
pip install docs2synth[gpu,vllm]
# 3. Uninstall paddlex to avoid conflicts with vLLM:
pip uninstall -y paddlexNote: PaddleX conflicts with vLLM. If you need vLLM support, you must uninstall paddlex after installation.
Minimal Install (CLI only, no ML/MCP features):
pip install docs2synthUse the setup script (installs uv + dependencies automatically):
# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
# Run setup script
./setup.sh # Unix/macOS/WSL
# setup.bat # WindowsThe script:
- Installs uv (fast package manager)
- Creates virtual environment
- Installs dependencies (CPU or GPU)
- Sets up config
Manual development setup:
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex" # Windows
# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate # .venv\Scripts\activate on Windows
# Install for development
uv pip install -e ".[cpu,dev]" # or [gpu,dev] for GPU
# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys- Document Processing: Extract text/layout with Docling, PaddleOCR, PDFPlumber
- QA Generation: Automatic question-answer pair generation with LLMs
- Verification: Built-in meaningful and correctness verifiers
- Human Annotation: Streamlit UI for manual review
- Retriever Training: Train LayoutLMv3-based retrievers
- RAG Deployment: Deploy with naive or iterative strategies
- MCP Integration: Expose as Model Context Protocol server
Create config.yml from config.example.yml:
# API keys (config.yml is in .gitignore)
agent:
keys:
openai_api_key: "sk-..."
anthropic_api_key: "sk-ant-..."
# Document processing
preprocess:
processor: docling
input_dir: ./data/raw/
output_dir: ./data/processed/
# QA generation
qa:
strategies:
- strategy: semantic
provider: openai
model: gpt-4o-mini
# Retriever training
retriever:
learning_rate: 1e-5
epochs: 10
# RAG
rag:
embedding:
model: sentence-transformers/all-MiniLM-L6-v2# CPU
./scripts/build-docker.sh cpu
# GPU
./scripts/build-docker.sh gpuSee Docker Builds
Full documentation: https://ai4wa.github.io/Docs2Synth/
- Complete Workflow Guide
- CLI Reference
- Document Processing
- QA Generation
- Retriever Training
- RAG Deployment
We welcome contributions! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
pytest tests/ -v - Run code quality checks:
./scripts/check.sh - Submit a pull request
See Dependency Management for dev setup details.
MIT License - see LICENSE file for details.
If you use Docs2Synth in your research, please cite:
@software{docs2synth2025,
title = {Docs2Synth: A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding},
author = {AI4WA Team},
year = {2025},
url = {https://github.com/AI4WA/Docs2Synth}
}- Documentation: https://ai4wa.github.io/Docs2Synth/
- Issues: https://github.com/AI4WA/Docs2Synth/issues
- Discussions: https://github.com/AI4WA/Docs2Synth/discussions
