Skip to content

feat: replace mtime-based conversion with page-hashing change detection#43

Merged
jo-minjun merged 1 commit into
mainfrom
feat/page-hashing
Feb 13, 2026
Merged

feat: replace mtime-based conversion with page-hashing change detection#43
jo-minjun merged 1 commit into
mainfrom
feat/page-hashing

Conversation

@jo-minjun

@jo-minjun jo-minjun commented Feb 13, 2026

Copy link
Copy Markdown
Owner

Summary

  • mtime 기반 변환 스킵을 2단계 해시 비교(fileHash → pageHashes)로 교체
  • 변경된 페이지만 OCR 재처리하는 incremental update 도입
  • 중복 코드 추출 (mergeOcrResults, parsePageMarkers, sha1Hex)

Related issue

N/A

Type of change

  • New feature
  • Refactoring (no functional changes)

Checklist

  • pnpm check passes (Biome lint and format)
  • pnpm typecheck passes
  • pnpm test passes
  • pnpm build succeeds
  • Updated versions.json and manifest.json (if version bump)
  • Updated CHANGELOG.md under [Unreleased]

🤖 Generated with Claude Code

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @jo-minjun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the change detection mechanism from simple modification time checks to a sophisticated content-based hashing approach. This change enables efficient incremental updates, particularly for OCR processing, by identifying and re-processing only the parts of a file that have genuinely changed. The refactoring of core utilities and the updated metadata structure lay the groundwork for more reliable and performant file conversions within the system.

Highlights

  • Enhanced Change Detection: Replaced the previous modification time (mtime) based conversion skipping logic with a more robust 2-stage hash comparison system. This now uses a file-level hash and per-page hashes to accurately detect changes.
  • Incremental OCR Updates: Introduced incremental OCR processing, where only pages that have changed or been newly added will trigger OCR re-processing. Unchanged pages will reuse their existing OCR text, significantly improving efficiency.
  • Refactored Core Logic: Extracted common code for merging OCR results (mergeOcrResults), parsing page markers (parsePageMarkers), and SHA-1 hashing (sha1Hex) into the @petrify/core package for better reusability and maintainability.
  • Updated Metadata Structure: The ConversionMetadataPort and frontmatter format now store the parser ID, file hash, and an array of per-page hashes, providing more detailed and accurate metadata for change tracking.
  • Standardized OCR Page Markers: OCR markers in generated files now use a stable page ID format (<!-- page: {id} -->) to ensure consistent identification across potential page reordering or structural changes.
Changelog
  • CHANGELOG.md
    • Added page-level hash-based change detection, replacing mtime-based skipping.
    • Added incrementalUpdate method to FileGeneratorPort for incremental output generation.
    • Updated ConversionMetadataPort to store parser, fileHash, and per-page hashes.
    • Updated OCR markers to use page ID format (<!-- page: {id} -->) for stable identification.
    • Changed PetrifyService.handleFileChange to use 2-stage hash comparison.
    • Modified OCR re-processing to only affect changed/added pages.
    • Removed mtime from FileChangeEvent.
    • Updated frontmatter format to store new metadata fields.
    • Removed mtime-based conversion skip logic from PetrifyService.
    • Removed duplicate sha1.ts from @petrify/generator-excalidraw.
  • CONTRIBUTING.md
    • Added PDF parser to the list of ParserPort implementations.
    • Updated the PetrifyService diagram to reflect the new hash-based change detection flow.
    • Added pdf/ directory to the packages/parser/ structure.
  • README.md
    • Updated the description of 'Duplicate prevention' to mention file and page content hashes instead of modification time.
  • packages/core/src/index.ts
    • Exported sha1Hex from hash.js.
    • Exported mergeOcrResults from ocr/merge.js.
    • Exported parsePageMarkers from ocr/page-marker-parser.js.
    • Exported diffPages and related types from page-diff.js.
    • Exported PageHash from ports/conversion-metadata.js.
    • Exported IncrementalInput and PageUpdate from ports/file-generator.js.
  • packages/core/src/ocr/merge.ts
    • Added mergeOcrResults function to combine existing and updated OCR texts for incremental updates.
  • packages/core/src/ocr/page-marker-parser.ts
    • Added parsePageMarkers function to extract OCR text based on page ID markers.
  • packages/core/src/page-diff.ts
    • Added diffPages function to compare current and saved page hashes and determine change types.
  • packages/core/src/petrify-service.ts
    • Imported sha1Hex and diffPages.
    • Updated handleFileChange to use fileHash and pageHashes for change detection instead of mtime.
    • Modified handleFileChange to perform full regeneration if the parser changes.
    • Updated handleFileChange to compute page hashes and pass them to metadata.
    • Refactored convertDroppedFile to compute file and page hashes for metadata.
    • Removed shouldSkipConversion method.
    • Added computePageHashes private method.
    • Refactored convertData to accept a Note object and an optional targetPageIds set for selective OCR.
    • Added runOcr private method to encapsulate OCR processing, allowing it to run only on specified pages.
  • packages/core/src/ports/conversion-metadata.ts
    • Added PageHash interface to define page ID and hash.
    • Updated ConversionMetadata interface to include parser, fileHash, and pageHashes instead of mtime.
  • packages/core/src/ports/file-generator.ts
    • Added pageId to OcrTextResult interface.
    • Added PageUpdate interface for incremental updates.
    • Added IncrementalInput interface for incremental update data.
    • Added incrementalUpdate method to FileGeneratorPort interface.
  • packages/core/src/ports/index.ts
    • Exported PageHash from conversion-metadata.js.
    • Exported IncrementalInput and PageUpdate from file-generator.js.
  • packages/core/src/ports/parser.ts
    • Added id property to ParserPort interface.
  • packages/core/src/ports/watcher.ts
    • Removed mtime property from FileChangeEvent interface.
  • packages/core/tests/hash.test.ts
    • Added tests for the sha1Hex utility function.
  • packages/core/tests/ocr-marker-parser.test.ts
    • Added tests for the parsePageMarkers function, ensuring correct parsing of OCR page markers.
  • packages/core/tests/ocr-merge.test.ts
    • Added tests for the mergeOcrResults function, covering various merging scenarios.
  • packages/core/tests/page-diff.test.ts
    • Added tests for the diffPages function, verifying different change detection types.
  • packages/core/tests/petrify-service.test.ts
    • Imported sha1Hex.
    • Added id to mock parser port.
    • Added incrementalUpdate to mock generator port.
    • Added helper functions createTestPage, createTestNote, and createMockOcrPort.
    • Updated handleFileChange tests to reflect new hash-based change detection and metadata structure.
    • Added tests for new file conversion, parser change leading to full regeneration, and incremental OCR updates.
    • Updated convertDroppedFile tests to verify new metadata fields.
    • Removed mtime from FileChangeEvent mocks.
    • Updated OCR result expectations to include pageId.
  • packages/generator/excalidraw/src/excalidraw-file-generator.ts
    • Imported mergeOcrResults and sha1Hex from @petrify/core.
    • Imported extractOcrByPageId.
    • Removed local sha1Hex import.
    • Implemented incrementalUpdate method using extractOcrByPageId and mergeOcrResults.
  • packages/generator/excalidraw/src/excalidraw-generator.ts
    • Imported sha1Hex from @petrify/core.
    • Removed local sha1Hex import.
  • packages/generator/excalidraw/src/index.ts
    • Exported extractOcrByPageId.
    • Removed export of local sha1Hex.
  • packages/generator/excalidraw/src/md-generator.ts
    • Updated OCR page marker format from <!-- Page {index} --> to <!-- page: {id} -->.
  • packages/generator/excalidraw/src/ocr-extractor.ts
    • Added extractOcrByPageId function to parse OCR text from Excalidraw Markdown content.
  • packages/generator/excalidraw/src/sha1.ts
    • Renamed to packages/core/src/hash.ts and moved SHA-1 hashing utility to the core package.
  • packages/generator/excalidraw/tests/excalidraw-file-generator.test.ts
    • Updated import path for sha1Hex to @petrify/core.
  • packages/generator/excalidraw/tests/incremental-update.test.ts
    • Added tests for ExcalidrawFileGenerator.incrementalUpdate, covering OCR reuse, page addition, and removal.
  • packages/generator/excalidraw/tests/md-generator.test.ts
    • Updated OCR text section tests to reflect the new <!-- page: {id} --> marker format.
  • packages/generator/excalidraw/tests/ocr-extractor.test.ts
    • Added tests for extractOcrByPageId function.
  • packages/generator/excalidraw/tests/sha1.test.ts
    • Removed SHA-1 tests as the utility was moved to @petrify/core.
  • packages/generator/markdown/src/index.ts
    • Exported extractOcrByPageId.
  • packages/generator/markdown/src/markdown-file-generator.ts
    • Imported mergeOcrResults from @petrify/core.
    • Imported extractOcrByPageId.
    • Updated OCR page marker format from <!-- Page {index} --> to <!-- page: {id} -->.
    • Implemented incrementalUpdate method using extractOcrByPageId and mergeOcrResults.
  • packages/generator/markdown/src/ocr-extractor.ts
    • Added extractOcrByPageId function to parse OCR text from Markdown content.
  • packages/generator/markdown/tests/incremental-update.test.ts
    • Added tests for MarkdownFileGenerator.incrementalUpdate, covering OCR reuse, page addition, and removal.
  • packages/generator/markdown/tests/markdown-file-generator.test.ts
    • Updated OCR text section tests to reflect the new <!-- page: {id} --> marker format and placement.
  • packages/generator/markdown/tests/ocr-extractor.test.ts
    • Added tests for extractOcrByPageId function.
  • packages/obsidian-plugin/src/frontmatter-metadata-adapter.ts
    • Updated ConversionMetadata interface to include parser, fileHash, and pageHashes.
    • Modified formatMetadata and getMetadata to handle the new metadata fields.
  • packages/obsidian-plugin/src/main.ts
    • Updated frontmatter handling to store parser and fileHash instead of mtime.
  • packages/obsidian-plugin/src/parser-registry.ts
    • Modified createParserMap to pass the ParserId as the id argument to parser constructors.
  • packages/obsidian-plugin/src/sync-orchestrator.ts
    • Removed mtime from FileChangeEvent creation, simplifying file event data.
  • packages/obsidian-plugin/src/utils/frontmatter.ts
    • Imported PageHash type.
    • Updated PetrifyFrontmatter interface to include parser, fileHash, and pageHashes.
    • Modified createFrontmatter to format and include page hashes.
    • Updated parseFrontmatter to extract new metadata fields and parse page hashes.
    • Added helper functions formatPageHashes, hasField, extractField, and parsePageHashes.
  • packages/obsidian-plugin/tests/conversion-saver.test.ts
    • Updated mock ConversionResult metadata to reflect the new parser, fileHash, and pageHashes fields.
  • packages/obsidian-plugin/tests/drop-handler.test.ts
    • Added id to mock parser creation.
    • Updated mock ConversionResult metadata to reflect the new parser, fileHash, and pageHashes fields.
  • packages/obsidian-plugin/tests/petrify-service.integration.test.ts
    • Imported IncrementalInput.
    • Updated FakeMetadata to use fileHash instead of mtime.
    • Added id to FakeParser.
    • Added incrementalUpdate method to FakeGenerator.
    • Removed mtime from createFileChangeEvent.
    • Updated integration tests to reflect hash-based skipping and metadata changes.
  • packages/obsidian-plugin/tests/process-file.test.ts
    • Removed mtime from createFileChangeEvent.
    • Added id to mock parser creation.
    • Updated mock ConversionResult metadata to reflect the new parser, fileHash, and pageHashes fields.
  • packages/obsidian-plugin/tests/sync-orchestrator.test.ts
    • Added incrementalUpdate to mock generator.
    • Added id to mock parser.
    • Removed mockFs.stat calls and mtime from FileChangeEvent in tests.
    • Updated mock ConversionResult metadata to reflect the new parser, fileHash, and pageHashes fields.
  • packages/obsidian-plugin/tests/utils/frontmatter.test.ts
    • Updated createFrontmatter tests to verify new metadata fields and page hash formatting.
    • Updated parseFrontmatter tests to verify parsing of new metadata fields and page hashes.
    • Updated updateKeepInContent tests to ensure preservation of new metadata fields.
  • packages/parser/pdf/src/pdf-parser.ts
    • Added id property to the constructor and class definition.
  • packages/parser/pdf/tests/pdf-parser.test.ts
    • Updated PdfParser instantiation in tests to include the id argument.
  • packages/parser/supernote-x/src/index.ts
    • Added id property to the constructor and class definition.
  • packages/parser/viwoods/src/index.ts
    • Added id property to the constructor and class definition.
  • packages/watcher/chokidar/src/chokidar-watcher.ts
    • Removed stats parameter and mtime property from handleFileEvent and FileChangeEvent creation.
  • packages/watcher/chokidar/tests/chokidar-watcher.test.ts
    • Removed mtime expectations and stats parameter from test emit calls.
  • packages/watcher/google-drive/src/google-drive-watcher.ts
    • Removed mtime property from FileChangeEvent creation.
Activity
  • The pull request was generated using Claude Code, indicating AI assistance in its creation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR is a significant and well-executed refactoring that replaces the mtime-based change detection logic with a more robust two-stage content hashing mechanism using file hashes and per-page hashes. This change enables efficient incremental updates, where only changed pages are re-processed by OCR, significantly improving performance. The introduction of new abstractions like diffPages, mergeOcrResults, parsePageMarkers is clean, and necessary updates to ports, adapters, and metadata formats have been handled consistently throughout the codebase. The test suite has also been thoroughly updated. However, medium-severity YAML injection vulnerabilities were identified in the frontmatter generation utility due to lack of proper escaping of user-influenced data (file paths and page IDs). Additionally, a performance optimization for crypto.subtle.digest was suggested to avoid unnecessary Uint8Array copies.

Comment thread packages/obsidian-plugin/src/utils/frontmatter.ts
Comment thread packages/obsidian-plugin/src/utils/frontmatter.ts
Comment thread packages/core/src/hash.ts
@jo-minjun jo-minjun merged commit d3c805a into main Feb 13, 2026
1 check passed
@jo-minjun jo-minjun deleted the feat/page-hashing branch February 13, 2026 07:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant