Medical records reconciliation tool for comparing indexed documents and identifying discrepancies.
Auditor helps you compare two indexed medical record PDFs (yours vs vendor's) to identify:
- Records present in only one document
- Records with matching dates but different page counts
- Content similarity scores via OCR
- Ruby 3.x
- Python 3.x with PyMuPDF (
pip install PyMuPDF) - Tesseract OCR (
brew install tesseract) - mutool (
brew install mupdf-tools) - pandoc (
brew install pandoc)
For PDF conversion (optional):
- Microsoft Word for Mac, OR
- docx2pdf (
pip install docx2pdf)
bin/run_pipeline.rb /path/to/your_indexed.pdf /path/to/their_indexed.pdfThe case name will be auto-detected from the filename.
bin/run_pipeline.rb --case "Reyes_Isidro" /path/to/your_indexed.pdf /path/to/their_indexed.pdfEach case is stored in cases/<case_name>/:
cases/Reyes_Isidro/
├── mappings/ # Logical → physical page mappings
│ ├── your_doc_hyperlink_mapping.json
│ └── their_doc_hyperlink_mapping.json
├── reports/ # Output reports
│ ├── reconciliation_data.json
│ └── discrepancy_report.csv
└── ocr_cache/ # Cached OCR results
- Phase 0: Build page mappings from TOC hyperlinks
- Phase 1: (Skipped) Compare document content
- Phase 2: Reconcile table of contents entries
- Phase 3: Convert DOCX to PDF if needed
- Phase 4: Match pages via OCR content similarity
Intermediate data showing:
yours_only: Dates/pages in your TOC onlytheirs_only: Dates/pages in their TOC onlysame_dates: Dates appearing in both TOCs
Final report with:
- Status (YOURS ONLY, THEIRS ONLY, SAME DATE)
- Page numbers (logical pages from TOCs)
- Match confidence scores
- Recommended actions
auditor/
├── bin/ # Executable scripts
│ ├── run_pipeline.rb # Main orchestrator
│ ├── simple_reconcile.rb # TOC comparison
│ ├── page_matcher.rb # Page content matching
│ ├── build_complete_mappings.py
│ └── extract_hyperlinks.py
└── cases/ # Case data (gitignored)
- Make changes in
bin/scripts - Test with a real case
- Commit only code changes (cases/ is gitignored)
Internal use only.