Intelligent PDF to Markdown conversion tool using Apple Vision Framework and local LLMs
mdkit is a powerful, intelligent PDF to Markdown conversion tool that leverages Apple's Vision framework for advanced document analysis and local Large Language Models (LLMs) for markdown optimization. It's designed specifically for technical documents, academic papers, and structured content that requires high-quality conversion.
- Apple Vision Framework Integration: Advanced OCR with document structure detection
- Position-Based Processing: Maintains logical document flow from top to bottom
- Duplicate Detection: Automatically identifies and resolves overlapping content
- Smart Element Recognition: Detects titles, headers, paragraphs, tables, lists, and barcodes
- Region-Based Detection: Precise header/footer detection using absolute coordinates
- Frequency Analysis: Identifies repetitive page elements across documents
- Configurable Thresholds: Adjustable detection parameters for different document types
- Multi-Region Support: Handles complex layouts with multiple header/footer areas
- Pattern Recognition: Automatically detects numbered, lettered, and named headers
- Smart Merging: Combines split headers and list items using OCR position data
- Level Calculation: Automatic header level detection and markdown hierarchy
- Nested List Support: Handles complex nested list structures with indentation
- llama.cpp Backend: Local processing with LocalLLMClientLlama
- Language Detection: Automatic document language detection using Apple's Natural Language framework
- Multi-Language Prompts: Support for English, Chinese, and other languages
- Configurable Prompts: Customizable system and user prompts with template placeholders
- Markdown Optimization: AI-powered structure improvement and formatting enhancement
- JSON Configuration: Comprehensive configuration system with no hardcoded values
- Environment Support: Development, production, and testing configurations
- Configuration Inheritance: Base configs with environment-specific overrides
- Validation: JSON schema validation and error checking
- Consistent Naming: Timestamped files with document hashes
- Organized Output: Separate directories for markdown, logs, and temporary files
- Comprehensive Logging: Detailed logs for every processing step
- Traceability: Link generated markdown to source OCR elements and LLM prompts
- Lightweight Dependency Injection: Easy testing with protocol-based interfaces
- Comprehensive Testing: Unit tests, integration tests, and performance benchmarks
- Mock Implementations: Simple mocking for external dependencies
- Quality Assurance: >90% test coverage target
- macOS 13.0+ (Ventura)
- Xcode 15.0+
- Swift 5.9+
- Local LLM model (optional, for markdown optimization)
-
Clone the repository
git clone --recursive https://github.com/alan-zhang-22/mdkit.git cd mdkit -
Open in Xcode
open mdkit.xcodeproj
-
Build and run
swift build
# Convert a PDF using default configuration
mdkit input.pdf
# Use custom configuration
mdkit --config my-config.json input.pdf
# Generate configuration template
mdkit --generate-config > template.json
# Validate configuration
mdkit --validate-config my-config.json
# Dry run (test without processing)
mdkit --dry-run input.pdfmdkit uses a comprehensive JSON configuration system. Here's a basic example:
{
"version": "1.0",
"description": "mdkit PDF to Markdown conversion configuration",
"headerFooterDetection": {
"enabled": true,
"regionBasedDetection": {
"enabled": true,
"headerRegionY": 72.0,
"footerRegionY": 720.0,
"regionTolerance": 5.0
}
},
"llm": {
"enabled": true,
"backend": "LocalLLMClientLlama",
"model": {
"id": "llama-3.1-8b-instruct-q4_0",
"localPath": "~/models/llama-3.1-8b-instruct-q4_0.gguf"
},
"parameters": {
"temperature": 0.1,
"context": 4096,
"threads": 8
}
}
}- Command-line specified path (
--config) - Project-specific config (
./mdkit-config.json) - User config (
~/.config/mdkit/config.json) - Built-in defaults
DocumentElement: Unified representation of all document elementsUnifiedDocumentProcessor: Collects and processes Vision framework outputHeaderFooterDetector: Intelligent header/footer detection and filteringHeaderAndListDetector: Pattern-based header and list item detectionMarkdownGenerator: Generates properly structured markdown outputLLMProcessor: Local LLM integration for markdown optimizationFileManager: Centralized file management and logging
mdkit uses lightweight dependency injection for improved testability:
protocol LLMClient {
func textStream(from input: LLMInput) async throws -> AsyncThrowingStream<String, Error>
func generateText(from input: LLMInput) async throws -> String
}
class LLMProcessor {
let client: any LLMClient
let languageDetector: any LanguageDetecting
init(client: any LLMClient, languageDetector: any LanguageDetecting) {
self.client = client
self.languageDetector = languageDetector
}
}# Run all tests
swift test
# Run specific test suite
swift test --filter CoreTests
# Run with verbose output
swift test --verbose- Unit Tests: >90% code coverage target
- Integration Tests: End-to-end workflow validation
- Performance Tests: Memory usage and processing speed benchmarks
- Mock Implementations: Easy testing of external dependencies
- Implementation Plan: Detailed development roadmap
- Complete Implementation Guide: Comprehensive implementation status, architecture, and roadmap
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
- Follow Swift style guidelines
- Use SwiftLint for code formatting
- Write comprehensive tests
- Document public APIs
This project is licensed under the MIT License - see the LICENSE file for details.
- Apple Vision Framework: Advanced document analysis and OCR
- LocalLLMClient: Local LLM integration capabilities
- llama.cpp: Efficient local language model inference
- Apple Natural Language Framework: Language detection and analysis
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Wiki: Project Wiki
- Phase 1: Foundation & Core Infrastructure
- Phase 2: Document Processing Core
- Phase 3: Header & Footer Detection
- Phase 4: File Management & Logging
- Phase 5: LLM Integration
- Phase 6: Integration & Testing
- Phase 7: Optimization & Polish
See our Implementation Plan for detailed progress and timeline.
Made with โค๏ธ for the open source community