Skip to content

AI4WA/Docs2Synth

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

271 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docs2Synth

Documentation Video Tutorial License: MIT Python 3.11+

A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding

Docs2Synth Framework

Workflow

Documents → Preprocess → QA Generation → Verification →
Human Annotation → Retriever Training → RAG Deployment

🚀 Quick Start: Automated Pipeline

Run the complete end-to-end pipeline with a single command:

docs2synth run

This automatically chains: preprocessing → QA generation → verification → retriever training → validation → RAG deployment, skipping the manual annotation UI.

Manual Step-by-Step Workflow

For more control, run each step individually:

# 1. Preprocess documents
docs2synth preprocess data/raw/my_documents/

# 2. Generate QA pairs
docs2synth qa batch

# 3. Verify quality
docs2synth verify batch

# 4. Annotate (opens UI)
docs2synth annotate

# 5. Train retriever
docs2synth retriever preprocess
docs2synth retriever train --mode standard --lr 1e-5 --epochs 10

# 6. Deploy RAG
docs2synth rag ingest
docs2synth rag app

Complete Workflow Guide →


Installation

PyPI Installation (Recommended)

CPU Version (includes all features + MCP server):

pip install docs2synth[cpu]

GPU Version (includes all features + MCP server):

# Standard GPU installation (no vLLM)
pip install docs2synth[gpu]

# With vLLM for local LLM inference (requires CUDA GPU)
# 1. Install PyTorch with CUDA first:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128

# 2. Install docs2synth with vLLM:
pip install docs2synth[gpu,vllm]

# 3. Uninstall paddlex to avoid conflicts with vLLM:
pip uninstall -y paddlex

Note: PaddleX conflicts with vLLM. If you need vLLM support, you must uninstall paddlex after installation.

Minimal Install (CLI only, no ML/MCP features):

pip install docs2synth

Development Setup

Use the setup script (installs uv + dependencies automatically):

# Clone
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth

# Run setup script
./setup.sh         # Unix/macOS/WSL
# setup.bat        # Windows

The script:

  • Installs uv (fast package manager)
  • Creates virtual environment
  • Installs dependencies (CPU or GPU)
  • Sets up config

Manual development setup:

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh  # Unix/macOS
# powershell -c "irm https://astral.sh/uv/install.ps1 | iex"  # Windows

# Clone and setup
git clone https://github.com/AI4WA/Docs2Synth.git
cd Docs2Synth
uv venv
source .venv/bin/activate  # .venv\Scripts\activate on Windows

# Install for development
uv pip install -e ".[cpu,dev]"  # or [gpu,dev] for GPU

# Setup config
cp config.example.yml config.yml
# Edit config.yml and add your API keys

Features

  • Document Processing: Extract text/layout with Docling, PaddleOCR, PDFPlumber
  • QA Generation: Automatic question-answer pair generation with LLMs
  • Verification: Built-in meaningful and correctness verifiers
  • Human Annotation: Streamlit UI for manual review
  • Retriever Training: Train LayoutLMv3-based retrievers
  • RAG Deployment: Deploy with naive or iterative strategies
  • MCP Integration: Expose as Model Context Protocol server

Configuration

Create config.yml from config.example.yml:

# API keys (config.yml is in .gitignore)
agent:
  keys:
    openai_api_key: "sk-..."
    anthropic_api_key: "sk-ant-..."

# Document processing
preprocess:
  processor: docling
  input_dir: ./data/raw/
  output_dir: ./data/processed/

# QA generation
qa:
  strategies:
    - strategy: semantic
      provider: openai
      model: gpt-4o-mini

# Retriever training
retriever:
  learning_rate: 1e-5
  epochs: 10

# RAG
rag:
  embedding:
    model: sentence-transformers/all-MiniLM-L6-v2

Docker

# CPU
./scripts/build-docker.sh cpu

# GPU
./scripts/build-docker.sh gpu

See Docker Builds


Documentation

Full documentation: https://ai4wa.github.io/Docs2Synth/


Contributing

We welcome contributions! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: pytest tests/ -v
  5. Run code quality checks: ./scripts/check.sh
  6. Submit a pull request

See Dependency Management for dev setup details.


License

MIT License - see LICENSE file for details.


Citation

If you use Docs2Synth in your research, please cite:

@software{docs2synth2025,
  title = {Docs2Synth: A Synthetic Data Tuned Retriever Framework for Visually Rich Documents Understanding},
  author = {AI4WA Team},
  year = {2025},
  url = {https://github.com/AI4WA/Docs2Synth}
}

Support

About

A Synthetic Data Tuned Retriever Framework for Documents Understanding

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages