PDF2MD - PDF to Markdown Converter

A powerful command-line tool for converting PDF documents to well-structured Markdown format using Apple's Vision framework for OCR and PDFKit for text extraction.

Features

Intelligent PDF Processing: Automatically detects PDF type and applies appropriate processing method
OCR Support: Uses Apple Vision framework for scanned documents
Text Extraction: Leverages PDFKit for structured PDFs with selectable text
AI Enhancement: Local LLM integration for improved markdown generation
Batch Processing: Convert multiple PDFs efficiently with progress reporting
Configurable: YAML-based configuration with priority-based loading
Pattern Detection: Advanced header and list detection using regex patterns
Page Number Filtering: Intelligent filtering of common OCR artifacts

Architecture

The project has been refactored into a clean, modular architecture:

Sources/
├── pdf2md/                    # Command-line interface
│   └── main.swift            # Main entry point
└── PDF2MDFramework/          # Core framework
    ├── Configuration/         # Configuration management
    │   ├── PDF2MDConfig.swift
    │   └── ConfigurationManager.swift
    ├── Core/                 # Core processing components
    │   ├── PDFProcessor.swift
    │   ├── OCRProcessor.swift
    │   ├── MarkdownGenerator.swift
    │   ├── PageNumberDetector.swift
    │   ├── LocalLLMProcessor.swift
    │   ├── PatternManager.swift
    │   └── BatchProcessor.swift
    ├── Models/               # Data models
    │   ├── HeaderPattern.swift
    │   └── ListPattern.swift
    └── Utils/                # Utility classes
        └── Logger.swift

Installation

Prerequisites

macOS 10.15 or later
Xcode 12.0 or later
Swift 5.9 or later

Building from Source

Clone the repository:

git clone https://github.com/yourusername/pdf2md.git
cd pdf2md

Open the project in Xcode:

open pdf2md.xcodeproj

Add the Yams dependency to your Xcode project:
- In Xcode, go to File → Add Package Dependencies
- Enter: https://github.com/jpsim/Yams.git
- Click "Add Package"
Build the project (⌘+B) or run (⌘+R)
The executable will be available in the build products directory

Usage

Basic Usage

# Convert a single PDF
./pdf2md document.pdf

# Convert multiple PDFs
./pdf2md doc1.pdf doc2.pdf doc3.pdf

# Specify output file
./pdf2md document.pdf -o output.md

# Specify output directory for batch processing
./pdf2md doc1.pdf doc2.pdf -o ./markdown_output/

# Use custom configuration
./pdf2md document.pdf -c ./config.yaml

# Enable OCR processing
./pdf2md document.pdf --enable-ocr

# Enable LLM processing
./pdf2md document.pdf --enable-llm

# Enable verbose logging
./pdf2md document.pdf --verbose

Configuration

The tool supports multiple configuration sources with priority-based loading:

Command-line parameter (highest priority): -c /path/to/config.yaml
Environment variable (second priority): PDF2MD_CONFIG=/path/to/config.yaml
Default files (third priority): Automatically looks for pdf2md.yaml or pdf2md.yml
Swift defaults (fallback): Built-in configuration

Configuration File Example

# pdf2md.yaml
llm:
  enabled: true
  model:
    id: "ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF"
    model_name: "meta-llama-3.1-8b-instruct-q4_0.gguf"

ocr:
  enabled: true
  language: "en-US"
  recognition_level: "accurate"
  filter_page_numbers: true

markdown:
  preserve_page_breaks: false
  extract_images: true
  header_detection:
    enabled: true
    patterns:
      - pattern: "^\d+\.\s+\w+"
        level: 1
      - pattern: "^\d+\.\d+\.\s+\w+"
        level: 2

logging:
  enabled: true
  level: "info"

Configuration Commands

# Create a new configuration file
./pdf2md config --create ./my-config.yaml

# Show configuration schema
./pdf2md config --schema ./schema.txt

Core Components

PDFProcessor

Handles PDF loading and text extraction using PDFKit. Automatically detects whether a PDF has selectable text or requires OCR processing.

OCRProcessor

Uses Apple's Vision framework for optical character recognition on scanned documents. Includes intelligent page number filtering and content enhancement.

MarkdownGenerator

Converts extracted text to well-structured Markdown. Supports configurable header detection, list formatting, and table preservation.

PageNumberDetector

Intelligently filters common OCR artifacts like page numbers using pattern matching and heuristics.

LocalLLMProcessor

Integrates with local LLM models for AI-powered text enhancement and structure detection.

PatternManager

Manages regex patterns for header and list detection. Supports custom pattern creation and validation.

BatchProcessor

Handles multiple PDF processing with progress reporting, error handling, and statistics.

ConfigurationManager

Implements priority-based configuration loading with YAML support and validation.

Development

Project Structure

The project follows a modular architecture with clear separation of concerns:

Configuration Layer: Manages all configuration aspects
Core Processing Layer: Handles PDF processing, OCR, and markdown generation
Pattern Management Layer: Manages document structure detection patterns
Utility Layer: Provides logging, error handling, and common utilities

Adding New Features

New Processors: Extend the Core layer with new processing capabilities
New Patterns: Add to PatternManager for enhanced structure detection
New Configuration: Extend PDF2MDConfig and ConfigurationManager
New CLI Commands: Add to main.swift as new subcommands

Testing

# Run tests in Xcode
# Use ⌘+U to run all tests, or select specific test targets in the test navigator

# Or run from command line (if you have xcodebuild available)
xcodebuild test -project pdf2md.xcodeproj -scheme pdf2md

Dependencies

Yams: YAML parsing and generation (add via Xcode Package Dependencies)
PDFKit: PDF processing (built-in Apple framework)
Vision: OCR processing (built-in Apple framework)
ArgumentParser: Command-line argument parsing (built-in Apple framework)

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests for new functionality
Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

Enhanced table detection and formatting
Image extraction and referencing
Support for additional output formats (HTML, LaTeX)
Performance optimization for large documents
Web interface for configuration management
Integration with cloud OCR services
Support for additional languages

Support

For issues and questions:

Create an issue on GitHub
Check the documentation in the Resources/Documentation/ folder
Review the implementation plan in Resources/Documentation/implementation-plan.md

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Resources		Resources
Sources		Sources
Tests		Tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
GB_T_22239_2019.md		GB_T_22239_2019.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
run_pdf2md.sh		run_pdf2md.sh

Folders and files

Latest commit

History

Repository files navigation

PDF2MD - PDF to Markdown Converter

Features

Architecture

Installation

Prerequisites

Building from Source

Usage

Basic Usage

Configuration

Configuration File Example

Configuration Commands

Core Components

PDFProcessor

OCRProcessor

MarkdownGenerator

PageNumberDetector

LocalLLMProcessor

PatternManager

BatchProcessor

ConfigurationManager

Development

Project Structure

Adding New Features

Testing

Dependencies

Contributing

License

Roadmap

Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages