Docs2Synth

A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding

Workflow

Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment

🚀 Quick Start: Automated Pipeline

Run the complete end-to-end pipeline with a single command:

docs2synth run

This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.

Manual Step-by-Step Workflow

For more control, run each step individually:

# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/

# 2. Generate QA pairs
docs2synth qa batch

# 3. Verify quality
docs2synth verify batch

# 4. Annotate (opens UI)
docs2synth annotate

# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10

# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app

Complete Workflow Guide →

Installation

PyPI Installation (Recommended)

CPU Version (includes all features + MCP server):

pip install docs2synth[cpu]

GPU Version (includes all features + MCP server):

# Standard GPU installation (no vLLM)
pip install docs2synth[gpu]

# With vLLM for local LLM inference (requires CUDA GPU)
# 1. Install PyTorch with CUDA first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# 2. Install docs2synth with vLLM:
pip install docs2synth[gpu,vllm]

# 3. Uninstall paddlex to avoid conflicts with vLLM:
pip uninstall -y paddlex

Note: PaddleX conflicts with vLLM. If you need vLLM support, you must uninstall paddlex after installation.

Minimal Install (CLI only, no ML/MCP features):

pip install docs2synth

Development Setup

Use the setup script (installs uv + dependencies automatically):

# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth

# Run setup script
./setup.sh         # Unix/macOS/WSL
# setup.bat        # Windows

The script:

Installs uv (fast package manager)
Creates virtual environment
Installs dependencies (CPU or GPU)
Sets up config

Manual development setup:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows

# Install for development
uv pip install -e ".[cpu,dev]"  # or [gpu,dev] for GPU

# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys

Features

Document Processing: Extract text/layout with Docling, PaddleOCR, PDFPlumber
QA Generation: Automatic question-answer pair generation with LLMs
Verification: Built-in meaningful and correctness verifiers
Human Annotation: Streamlit UI for manual review
Retriever Training: Train LayoutLMv3-based retrievers
RAG Deployment: Deploy with naive or iterative strategies
MCP Integration: Expose as Model Context Protocol server

Configuration

Create config.yml from config.example.yml:

# API keys (config.yml is in .gitignore)
agent:
  keys:
    openai_api_key: "sk-..."
    anthropic_api_key: "sk-ant-..."

# Document processing
preprocess:
  processor: docling
  input_dir: ./data/raw/
  output_dir: ./data/processed/

# QA generation
qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini

# Retriever training
retriever:
  learning_rate: 1e-5
  epochs: 10

# RAG
rag:
  embedding:
    model: sentence-transformers/all-MiniLM-L6-v2

Docker

# CPU
./scripts/build-docker.sh cpu

# GPU
./scripts/build-docker.sh gpu

See Docker Builds

Documentation

Full documentation: https://ai4wa.github.io/Docs2Synth/

Contributing

We welcome contributions! Please:

Fork the repository
Create a feature branch
Make your changes
Run tests: pytest tests/ -v
Run code quality checks: ./scripts/check.sh
Submit a pull request

See Dependency Management for dev setup details.

License

MIT License - see LICENSE file for details.

Citation

If you use Docs2Synth in your research, please cite:

@software{docs2synth2025,
  title = {Docs2Synth: A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding},
  author = {AI4WA Team},
  year = {2025},
  url = {https://github.com/AI4WA/Docs2Synth}
}

Support

Documentation: https://ai4wa.github.io/Docs2Synth/
Issues: https://github.com/AI4WA/Docs2Synth/issues
Discussions: https://github.com/AI4WA/Docs2Synth/discussions

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
.githooks		.githooks
.github/workflows		.github/workflows
docs		docs
docs2synth		docs2synth
notebooks/agent		notebooks/agent
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CNAME		CNAME
Dockerfile		Dockerfile
Dockerfile.mcp		Dockerfile.mcp
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
config.example.yml		config.example.yml
config.mcp.example.yml		config.mcp.example.yml
docker-compose.mcp.yml		docker-compose.mcp.yml
docker-compose.yml		docker-compose.yml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
setup.bat		setup.bat
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docs2Synth

Workflow

🚀 Quick Start: Automated Pipeline

Manual Step-by-Step Workflow

Installation

PyPI Installation (Recommended)

Development Setup

Features

Configuration

Docker

Documentation

Contributing

License

Citation

Support

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docs2Synth

Workflow

🚀 Quick Start: Automated Pipeline

Manual Step-by-Step Workflow

Installation

PyPI Installation (Recommended)

Development Setup

Features

Configuration

Docker

Documentation

Contributing

License

Citation

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages