Skip to content

alan-zhang-22/mdkit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

61 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

mdkit

License: MIT Swift Platform

Intelligent PDF to Markdown conversion tool using Apple Vision Framework and local LLMs

mdkit is a powerful, intelligent PDF to Markdown conversion tool that leverages Apple's Vision framework for advanced document analysis and local Large Language Models (LLMs) for markdown optimization. It's designed specifically for technical documents, academic papers, and structured content that requires high-quality conversion.

โœจ Features

๐Ÿง  Intelligent Document Analysis

  • Apple Vision Framework Integration: Advanced OCR with document structure detection
  • Position-Based Processing: Maintains logical document flow from top to bottom
  • Duplicate Detection: Automatically identifies and resolves overlapping content
  • Smart Element Recognition: Detects titles, headers, paragraphs, tables, lists, and barcodes

๐Ÿ“‹ Header & Footer Management

  • Region-Based Detection: Precise header/footer detection using absolute coordinates
  • Frequency Analysis: Identifies repetitive page elements across documents
  • Configurable Thresholds: Adjustable detection parameters for different document types
  • Multi-Region Support: Handles complex layouts with multiple header/footer areas

๐Ÿ”— Header & List Detection

  • Pattern Recognition: Automatically detects numbered, lettered, and named headers
  • Smart Merging: Combines split headers and list items using OCR position data
  • Level Calculation: Automatic header level detection and markdown hierarchy
  • Nested List Support: Handles complex nested list structures with indentation

๐Ÿค– Local LLM Integration

  • llama.cpp Backend: Local processing with LocalLLMClientLlama
  • Language Detection: Automatic document language detection using Apple's Natural Language framework
  • Multi-Language Prompts: Support for English, Chinese, and other languages
  • Configurable Prompts: Customizable system and user prompts with template placeholders
  • Markdown Optimization: AI-powered structure improvement and formatting enhancement

โš™๏ธ Flexible Configuration

  • JSON Configuration: Comprehensive configuration system with no hardcoded values
  • Environment Support: Development, production, and testing configurations
  • Configuration Inheritance: Base configs with environment-specific overrides
  • Validation: JSON schema validation and error checking

๐Ÿ“ Centralized File Management

  • Consistent Naming: Timestamped files with document hashes
  • Organized Output: Separate directories for markdown, logs, and temporary files
  • Comprehensive Logging: Detailed logs for every processing step
  • Traceability: Link generated markdown to source OCR elements and LLM prompts

๐Ÿงช Testing & Quality

  • Lightweight Dependency Injection: Easy testing with protocol-based interfaces
  • Comprehensive Testing: Unit tests, integration tests, and performance benchmarks
  • Mock Implementations: Simple mocking for external dependencies
  • Quality Assurance: >90% test coverage target

๐Ÿš€ Quick Start

Prerequisites

  • macOS 13.0+ (Ventura)
  • Xcode 15.0+
  • Swift 5.9+
  • Local LLM model (optional, for markdown optimization)

Installation

  1. Clone the repository

    git clone --recursive https://github.com/alan-zhang-22/mdkit.git
    cd mdkit
  2. Open in Xcode

    open mdkit.xcodeproj
  3. Build and run

    swift build

Basic Usage

# Convert a PDF using default configuration
mdkit input.pdf

# Use custom configuration
mdkit --config my-config.json input.pdf

# Generate configuration template
mdkit --generate-config > template.json

# Validate configuration
mdkit --validate-config my-config.json

# Dry run (test without processing)
mdkit --dry-run input.pdf

๐Ÿ“– Configuration

mdkit uses a comprehensive JSON configuration system. Here's a basic example:

{
  "version": "1.0",
  "description": "mdkit PDF to Markdown conversion configuration",
  
  "headerFooterDetection": {
    "enabled": true,
    "regionBasedDetection": {
      "enabled": true,
      "headerRegionY": 72.0,
      "footerRegionY": 720.0,
      "regionTolerance": 5.0
    }
  },
  
  "llm": {
    "enabled": true,
    "backend": "LocalLLMClientLlama",
    "model": {
      "id": "llama-3.1-8b-instruct-q4_0",
      "localPath": "~/models/llama-3.1-8b-instruct-q4_0.gguf"
    },
    "parameters": {
      "temperature": 0.1,
      "context": 4096,
      "threads": 8
    }
  }
}

Configuration Locations

  1. Command-line specified path (--config)
  2. Project-specific config (./mdkit-config.json)
  3. User config (~/.config/mdkit/config.json)
  4. Built-in defaults

๐Ÿ—๏ธ Architecture

Core Components

  • DocumentElement: Unified representation of all document elements
  • UnifiedDocumentProcessor: Collects and processes Vision framework output
  • HeaderFooterDetector: Intelligent header/footer detection and filtering
  • HeaderAndListDetector: Pattern-based header and list item detection
  • MarkdownGenerator: Generates properly structured markdown output
  • LLMProcessor: Local LLM integration for markdown optimization
  • FileManager: Centralized file management and logging

Dependency Injection

mdkit uses lightweight dependency injection for improved testability:

protocol LLMClient {
    func textStream(from input: LLMInput) async throws -> AsyncThrowingStream<String, Error>
    func generateText(from input: LLMInput) async throws -> String
}

class LLMProcessor {
    let client: any LLMClient
    let languageDetector: any LanguageDetecting
    
    init(client: any LLMClient, languageDetector: any LanguageDetecting) {
        self.client = client
        self.languageDetector = languageDetector
    }
}

๐Ÿงช Testing

Running Tests

# Run all tests
swift test

# Run specific test suite
swift test --filter CoreTests

# Run with verbose output
swift test --verbose

Test Coverage

  • Unit Tests: >90% code coverage target
  • Integration Tests: End-to-end workflow validation
  • Performance Tests: Memory usage and processing speed benchmarks
  • Mock Implementations: Easy testing of external dependencies

๐Ÿ“š Documentation

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Ensure all tests pass
  6. Submit a pull request

Code Style

  • Follow Swift style guidelines
  • Use SwiftLint for code formatting
  • Write comprehensive tests
  • Document public APIs

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Apple Vision Framework: Advanced document analysis and OCR
  • LocalLLMClient: Local LLM integration capabilities
  • llama.cpp: Efficient local language model inference
  • Apple Natural Language Framework: Language detection and analysis

๐Ÿ“ž Support

๐Ÿ”ฎ Roadmap

  • Phase 1: Foundation & Core Infrastructure
  • Phase 2: Document Processing Core
  • Phase 3: Header & Footer Detection
  • Phase 4: File Management & Logging
  • Phase 5: LLM Integration
  • Phase 6: Integration & Testing
  • Phase 7: Optimization & Polish

See our Implementation Plan for detailed progress and timeline.


Made with โค๏ธ for the open source community

About

Intelligent PDF to Markdown conversion tool using Apple Vision Framework and local LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages