Convert academic papers from PDF to Markdown or EPUB format with AI-powered OCR and LaTeX support.
- AI-Powered PDF Conversion: Transform research papers and academic PDFs into clean Markdown or EPUB formats
- LaTeX Equation Support: Preserve mathematical equations with multiple rendering options (MathML, SVG, or clipping)
- Smart Table Extraction: Convert complex tables to HTML or image clipping
- Multi-Language OCR: Support for English and Chinese documents
- Configurable OCR Models: Choose from tiny to gundam-sized models based on your accuracy needs
- Batch Processing: Automatically process multiple PDFs in your directory
- GPU Acceleration: Leverages PyTorch and CUDA for fast processing
- Python 3.13 or higher
- CUDA-compatible GPU (recommended for faster processing)
- UV package manager
- Clone this repository:
git clone https://github.com/ahnafnafee/papers-to-markdown.git
cd papers-to-markdown- Install dependencies using UV:
uv sync- Activate the virtual environment:
# Windows
.venv\Scripts\activate
# Linux/Mac
source .venv/bin/activateConvert PDFs to Markdown format:
uv run main.py --format markdown --output-dir markdownConvert PDFs to EPUB format:
uv run main.py --format epub --output-dir epubsuv run main.py \
--root . \
--format markdown \
--output-dir output \
--ocr-size base \
--lang en \
--latex-render svg \
--table-render html| Option | Values | Default | Description |
|---|---|---|---|
--root |
Path | . |
Directory containing PDF files |
--output-dir |
Path | epubs |
Output directory for converted files |
--format |
epub, markdown |
epub |
Output format |
--ocr-size |
tiny, small, base, large, gundam |
small |
OCR model size (larger = more accurate) |
--lang |
en, zh |
en |
Document language |
--latex-render |
mathml, svg, clipping |
svg |
LaTeX equation rendering mode (EPUB only) |
--table-render |
html, clipping |
html |
Table rendering mode (EPUB only) |
--models-cache-path |
Path | .pdf_craft_models |
Model cache directory |
--analysis-root |
Path | .pdf_craft_work |
Work directory for analysis artifacts |
--dry-run |
Flag | False |
Preview files without converting |
papers-to-markdown/
βββ main.py # Main conversion script
βββ pyproject.toml # Project dependencies (UV)
βββ README.md # This file
βββ .gitignore # Git ignore rules
βββ .venv/ # Virtual environment
βββ .cache/ # HuggingFace and matplotlib cache
βββ .pdf_craft_models/ # Cached OCR models
βββ .pdf_craft_work/ # Temporary analysis files
βββ markdown/ # Output directory for Markdown files
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
This project uses pdf-craft for PDF processing.
academic paper converter, PDF to Markdown, PDF to EPUB, research paper OCR, LaTeX equation preservation, academic document processing, AI PDF converter, scientific paper conversion, Python PDF tool, batch PDF conversion
For issues and questions, please open an issue on GitHub.
Made with β€οΈ for researchers and academics