Skip to content

alan-zhang-22/pdf2md

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF2MD - PDF to Markdown Converter

A powerful command-line tool for converting PDF documents to well-structured Markdown format using Apple's Vision framework for OCR and PDFKit for text extraction.

Features

  • Intelligent PDF Processing: Automatically detects PDF type and applies appropriate processing method
  • OCR Support: Uses Apple Vision framework for scanned documents
  • Text Extraction: Leverages PDFKit for structured PDFs with selectable text
  • AI Enhancement: Local LLM integration for improved markdown generation
  • Batch Processing: Convert multiple PDFs efficiently with progress reporting
  • Configurable: YAML-based configuration with priority-based loading
  • Pattern Detection: Advanced header and list detection using regex patterns
  • Page Number Filtering: Intelligent filtering of common OCR artifacts

Architecture

The project has been refactored into a clean, modular architecture:

Sources/
├── pdf2md/                    # Command-line interface
│   └── main.swift            # Main entry point
└── PDF2MDFramework/          # Core framework
    ├── Configuration/         # Configuration management
    │   ├── PDF2MDConfig.swift
    │   └── ConfigurationManager.swift
    ├── Core/                 # Core processing components
    │   ├── PDFProcessor.swift
    │   ├── OCRProcessor.swift
    │   ├── MarkdownGenerator.swift
    │   ├── PageNumberDetector.swift
    │   ├── LocalLLMProcessor.swift
    │   ├── PatternManager.swift
    │   └── BatchProcessor.swift
    ├── Models/               # Data models
    │   ├── HeaderPattern.swift
    │   └── ListPattern.swift
    └── Utils/                # Utility classes
        └── Logger.swift

Installation

Prerequisites

  • macOS 10.15 or later
  • Xcode 12.0 or later
  • Swift 5.9 or later

Building from Source

  1. Clone the repository:
git clone https://github.com/yourusername/pdf2md.git
cd pdf2md
  1. Open the project in Xcode:
open pdf2md.xcodeproj
  1. Add the Yams dependency to your Xcode project:

    • In Xcode, go to File → Add Package Dependencies
    • Enter: https://github.com/jpsim/Yams.git
    • Click "Add Package"
  2. Build the project (⌘+B) or run (⌘+R)

  3. The executable will be available in the build products directory

Usage

Basic Usage

# Convert a single PDF
./pdf2md document.pdf

# Convert multiple PDFs
./pdf2md doc1.pdf doc2.pdf doc3.pdf

# Specify output file
./pdf2md document.pdf -o output.md

# Specify output directory for batch processing
./pdf2md doc1.pdf doc2.pdf -o ./markdown_output/

# Use custom configuration
./pdf2md document.pdf -c ./config.yaml

# Enable OCR processing
./pdf2md document.pdf --enable-ocr

# Enable LLM processing
./pdf2md document.pdf --enable-llm

# Enable verbose logging
./pdf2md document.pdf --verbose

Configuration

The tool supports multiple configuration sources with priority-based loading:

  1. Command-line parameter (highest priority): -c /path/to/config.yaml
  2. Environment variable (second priority): PDF2MD_CONFIG=/path/to/config.yaml
  3. Default files (third priority): Automatically looks for pdf2md.yaml or pdf2md.yml
  4. Swift defaults (fallback): Built-in configuration

Configuration File Example

# pdf2md.yaml
llm:
  enabled: true
  model:
    id: "ggml-org/Meta-Llama-3.1-8B-Instruct-Q4_0-GGUF"
    model_name: "meta-llama-3.1-8b-instruct-q4_0.gguf"

ocr:
  enabled: true
  language: "en-US"
  recognition_level: "accurate"
  filter_page_numbers: true

markdown:
  preserve_page_breaks: false
  extract_images: true
  header_detection:
    enabled: true
    patterns:
      - pattern: "^\d+\.\s+\w+"
        level: 1
      - pattern: "^\d+\.\d+\.\s+\w+"
        level: 2

logging:
  enabled: true
  level: "info"

Configuration Commands

# Create a new configuration file
./pdf2md config --create ./my-config.yaml

# Show configuration schema
./pdf2md config --schema ./schema.txt

Core Components

PDFProcessor

Handles PDF loading and text extraction using PDFKit. Automatically detects whether a PDF has selectable text or requires OCR processing.

OCRProcessor

Uses Apple's Vision framework for optical character recognition on scanned documents. Includes intelligent page number filtering and content enhancement.

MarkdownGenerator

Converts extracted text to well-structured Markdown. Supports configurable header detection, list formatting, and table preservation.

PageNumberDetector

Intelligently filters common OCR artifacts like page numbers using pattern matching and heuristics.

LocalLLMProcessor

Integrates with local LLM models for AI-powered text enhancement and structure detection.

PatternManager

Manages regex patterns for header and list detection. Supports custom pattern creation and validation.

BatchProcessor

Handles multiple PDF processing with progress reporting, error handling, and statistics.

ConfigurationManager

Implements priority-based configuration loading with YAML support and validation.

Development

Project Structure

The project follows a modular architecture with clear separation of concerns:

  • Configuration Layer: Manages all configuration aspects
  • Core Processing Layer: Handles PDF processing, OCR, and markdown generation
  • Pattern Management Layer: Manages document structure detection patterns
  • Utility Layer: Provides logging, error handling, and common utilities

Adding New Features

  1. New Processors: Extend the Core layer with new processing capabilities
  2. New Patterns: Add to PatternManager for enhanced structure detection
  3. New Configuration: Extend PDF2MDConfig and ConfigurationManager
  4. New CLI Commands: Add to main.swift as new subcommands

Testing

# Run tests in Xcode
# Use ⌘+U to run all tests, or select specific test targets in the test navigator

# Or run from command line (if you have xcodebuild available)
xcodebuild test -project pdf2md.xcodeproj -scheme pdf2md

Dependencies

  • Yams: YAML parsing and generation (add via Xcode Package Dependencies)
  • PDFKit: PDF processing (built-in Apple framework)
  • Vision: OCR processing (built-in Apple framework)
  • ArgumentParser: Command-line argument parsing (built-in Apple framework)

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Roadmap

  • Enhanced table detection and formatting
  • Image extraction and referencing
  • Support for additional output formats (HTML, LaTeX)
  • Performance optimization for large documents
  • Web interface for configuration management
  • Integration with cloud OCR services
  • Support for additional languages

Support

For issues and questions:

  • Create an issue on GitHub
  • Check the documentation in the Resources/Documentation/ folder
  • Review the implementation plan in Resources/Documentation/implementation-plan.md

About

Automated PDF to Markdown Conversion Tool Using Apple Vision and PDFKit

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors