A powerful command-line tool for converting PDF documents to well-structured Markdown format using Apple's Vision framework for OCR and PDFKit for text extraction.
- Intelligent PDF Processing: Automatically detects PDF type and applies appropriate processing method
- OCR Support: Uses Apple Vision framework for scanned documents
- Text Extraction: Leverages PDFKit for structured PDFs with selectable text
- AI Enhancement: Local LLM integration for improved markdown generation
- Batch Processing: Convert multiple PDFs efficiently with progress reporting
- Configurable: YAML-based configuration with priority-based loading
- Pattern Detection: Advanced header and list detection using regex patterns
- Page Number Filtering: Intelligent filtering of common OCR artifacts
The project has been refactored into a clean, modular architecture:
Sources/
├── pdf2md/ # Command-line interface
│ └── main.swift # Main entry point
└── PDF2MDFramework/ # Core framework
├── Configuration/ # Configuration management
│ ├── PDF2MDConfig.swift
│ └── ConfigurationManager.swift
├── Core/ # Core processing components
│ ├── PDFProcessor.swift
│ ├── OCRProcessor.swift
│ ├── MarkdownGenerator.swift
│ ├── PageNumberDetector.swift
│ ├── LocalLLMProcessor.swift
│ ├── PatternManager.swift
│ └── BatchProcessor.swift
├── Models/ # Data models
│ ├── HeaderPattern.swift
│ └── ListPattern.swift
└── Utils/ # Utility classes
└── Logger.swift
- macOS 10.15 or later
- Xcode 12.0 or later
- Swift 5.9 or later
- Clone the repository:
git clone https://github.com/yourusername/pdf2md.git
cd pdf2md- Open the project in Xcode:
open pdf2md.xcodeproj-
Add the Yams dependency to your Xcode project:
- In Xcode, go to File → Add Package Dependencies
- Enter:
https://github.com/jpsim/Yams.git - Click "Add Package"
-
Build the project (⌘+B) or run (⌘+R)
-
The executable will be available in the build products directory
# Convert a single PDF
./pdf2md document.pdf
# Convert multiple PDFs
./pdf2md doc1.pdf doc2.pdf doc3.pdf
# Specify output file
./pdf2md document.pdf -o output.md
# Specify output directory for batch processing
./pdf2md doc1.pdf doc2.pdf -o ./markdown_output/
# Use custom configuration
./pdf2md document.pdf -c ./config.yaml
# Enable OCR processing
./pdf2md document.pdf --enable-ocr
# Enable LLM processing
./pdf2md document.pdf --enable-llm
# Enable verbose logging
./pdf2md document.pdf --verboseThe tool supports multiple configuration sources with priority-based loading:
- Command-line parameter (highest priority):
-c /path/to/config.yaml - Environment variable (second priority):
PDF2MD_CONFIG=/path/to/config.yaml - Default files (third priority): Automatically looks for
pdf2md.yamlorpdf2md.yml - Swift defaults (fallback): Built-in configuration
# pdf2md.yaml
llm:
enabled: true
model:
id: "ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF"
model_name: "meta-llama-3.1-8b-instruct-q4_0.gguf"
ocr:
enabled: true
language: "en-US"
recognition_level: "accurate"
filter_page_numbers: true
markdown:
preserve_page_breaks: false
extract_images: true
header_detection:
enabled: true
patterns:
- pattern: "^\d+\.\s+\w+"
level: 1
- pattern: "^\d+\.\d+\.\s+\w+"
level: 2
logging:
enabled: true
level: "info"# Create a new configuration file
./pdf2md config --create ./my-config.yaml
# Show configuration schema
./pdf2md config --schema ./schema.txtHandles PDF loading and text extraction using PDFKit. Automatically detects whether a PDF has selectable text or requires OCR processing.
Uses Apple's Vision framework for optical character recognition on scanned documents. Includes intelligent page number filtering and content enhancement.
Converts extracted text to well-structured Markdown. Supports configurable header detection, list formatting, and table preservation.
Intelligently filters common OCR artifacts like page numbers using pattern matching and heuristics.
Integrates with local LLM models for AI-powered text enhancement and structure detection.
Manages regex patterns for header and list detection. Supports custom pattern creation and validation.
Handles multiple PDF processing with progress reporting, error handling, and statistics.
Implements priority-based configuration loading with YAML support and validation.
The project follows a modular architecture with clear separation of concerns:
- Configuration Layer: Manages all configuration aspects
- Core Processing Layer: Handles PDF processing, OCR, and markdown generation
- Pattern Management Layer: Manages document structure detection patterns
- Utility Layer: Provides logging, error handling, and common utilities
- New Processors: Extend the Core layer with new processing capabilities
- New Patterns: Add to PatternManager for enhanced structure detection
- New Configuration: Extend PDF2MDConfig and ConfigurationManager
- New CLI Commands: Add to main.swift as new subcommands
# Run tests in Xcode
# Use ⌘+U to run all tests, or select specific test targets in the test navigator
# Or run from command line (if you have xcodebuild available)
xcodebuild test -project pdf2md.xcodeproj -scheme pdf2md- Yams: YAML parsing and generation (add via Xcode Package Dependencies)
- PDFKit: PDF processing (built-in Apple framework)
- Vision: OCR processing (built-in Apple framework)
- ArgumentParser: Command-line argument parsing (built-in Apple framework)
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Enhanced table detection and formatting
- Image extraction and referencing
- Support for additional output formats (HTML, LaTeX)
- Performance optimization for large documents
- Web interface for configuration management
- Integration with cloud OCR services
- Support for additional languages
For issues and questions:
- Create an issue on GitHub
- Check the documentation in the
Resources/Documentation/folder - Review the implementation plan in
Resources/Documentation/implementation-plan.md