feat: replace mtime-based conversion with page-hashing change detection#43
Conversation
Summary of ChangesHello @jo-minjun, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the change detection mechanism from simple modification time checks to a sophisticated content-based hashing approach. This change enables efficient incremental updates, particularly for OCR processing, by identifying and re-processing only the parts of a file that have genuinely changed. The refactoring of core utilities and the updated metadata structure lay the groundwork for more reliable and performant file conversions within the system. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d465f1e to
08f0592
Compare
There was a problem hiding this comment.
Code Review
This PR is a significant and well-executed refactoring that replaces the mtime-based change detection logic with a more robust two-stage content hashing mechanism using file hashes and per-page hashes. This change enables efficient incremental updates, where only changed pages are re-processed by OCR, significantly improving performance. The introduction of new abstractions like diffPages, mergeOcrResults, parsePageMarkers is clean, and necessary updates to ports, adapters, and metadata formats have been handled consistently throughout the codebase. The test suite has also been thoroughly updated. However, medium-severity YAML injection vulnerabilities were identified in the frontmatter generation utility due to lack of proper escaping of user-influenced data (file paths and page IDs). Additionally, a performance optimization for crypto.subtle.digest was suggested to avoid unnecessary Uint8Array copies.
Summary
mergeOcrResults,parsePageMarkers,sha1Hex)Related issue
N/A
Type of change
Checklist
pnpm checkpasses (Biome lint and format)pnpm typecheckpassespnpm testpassespnpm buildsucceedsversions.jsonandmanifest.json(if version bump)CHANGELOG.mdunder[Unreleased]🤖 Generated with Claude Code