Skip to content

fix(mobile): pass PDF format into vectorization extractor#451

Open
codedogQBY wants to merge 1 commit into
mainfrom
codex/fix-mobile-pdf-vectorization-extraction
Open

fix(mobile): pass PDF format into vectorization extractor#451
codedogQBY wants to merge 1 commit into
mainfrom
codex/fix-mobile-pdf-vectorization-extraction

Conversation

@codedogQBY

Copy link
Copy Markdown
Owner

Analysis

Issue #301 reports that small EPUB files can be vectorized, but searchable/text-layer PDFs cannot. The mobile vectorization flow extracts chapters through the hidden reader WebView before handing text to the core vectorization pipeline.

The manual mobile vectorization queue was always calling the extractor as if the file were an EPUB:

  • useVectorizationQueue passed application/epub+zip for every book format.
  • ExtractorWebView always sent fileName: "book.epub" to the reader asset.

That means a PDF selected for vectorization could enter the hidden reader/extractor with EPUB identity instead of PDF identity. The reader can rely on both MIME and filename extension for format-specific loading paths, so PDF chapter extraction could fail before vectorization ever receives text.

Changes

  • Add mobile vectorization MIME mapping by book format in useVectorizationQueue.
  • Pass application/pdf for PDF books instead of hard-coded EPUB MIME.
  • Derive the hidden extractor filename from MIME in ExtractorWebView, e.g. book.pdf for application/pdf.
  • Keep existing EPUB behavior as the fallback for unknown formats.
  • Remove an existing non-null queue assertion while touching the file so targeted Biome stays clean.

Scope Notes

Desktop already has an explicit PDF text extraction path in packages/app/src/lib/rag/book-extractor.ts; this PR targets the mobile hidden-reader extraction path used by manual vectorization and fallback content extraction. The broader large-file/mobile memory constraints are covered by the separate large-file PR work, while this PR fixes the format-routing bug that blocks text-layer PDFs from reaching the vectorization pipeline.

Verification

  • pnpm --filter @readany/app-expo exec tsc --noEmit
  • pnpm exec biome check packages/app-expo/src/components/rag/ExtractorWebView.tsx packages/app-expo/src/screens/library/useVectorizationQueue.ts
  • git diff --check

Fixes #301

@codedogQBY codedogQBY added bug Something isn't working priority:p1 High: important feature broken or major platform/workflow regression area:ai AI, model configuration, vectorization, citations, prompts area:mobile Mobile, tablet, React Native, Android, iOS area:pdf PDF reading, import, selection, vectorization labels Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ai AI, model configuration, vectorization, citations, prompts area:mobile Mobile, tablet, React Native, Android, iOS area:pdf PDF reading, import, selection, vectorization bug Something isn't working priority:p1 High: important feature broken or major platform/workflow regression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

向量化过程中遇到的bug问题

1 participant