fix(mobile): pass PDF format into vectorization extractor#451
Open
codedogQBY wants to merge 1 commit into
Open
fix(mobile): pass PDF format into vectorization extractor#451codedogQBY wants to merge 1 commit into
codedogQBY wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Analysis
Issue #301 reports that small EPUB files can be vectorized, but searchable/text-layer PDFs cannot. The mobile vectorization flow extracts chapters through the hidden reader WebView before handing text to the core vectorization pipeline.
The manual mobile vectorization queue was always calling the extractor as if the file were an EPUB:
useVectorizationQueuepassedapplication/epub+zipfor every book format.ExtractorWebViewalways sentfileName: "book.epub"to the reader asset.That means a PDF selected for vectorization could enter the hidden reader/extractor with EPUB identity instead of PDF identity. The reader can rely on both MIME and filename extension for format-specific loading paths, so PDF chapter extraction could fail before vectorization ever receives text.
Changes
useVectorizationQueue.application/pdffor PDF books instead of hard-coded EPUB MIME.ExtractorWebView, e.g.book.pdfforapplication/pdf.Scope Notes
Desktop already has an explicit PDF text extraction path in
packages/app/src/lib/rag/book-extractor.ts; this PR targets the mobile hidden-reader extraction path used by manual vectorization and fallback content extraction. The broader large-file/mobile memory constraints are covered by the separate large-file PR work, while this PR fixes the format-routing bug that blocks text-layer PDFs from reaching the vectorization pipeline.Verification
pnpm --filter @readany/app-expo exec tsc --noEmitpnpm exec biome check packages/app-expo/src/components/rag/ExtractorWebView.tsx packages/app-expo/src/screens/library/useVectorizationQueue.tsgit diff --checkFixes #301