Skip to content

Add lightweight PDF text extraction#99

Open
ttoyekk1a wants to merge 2 commits into
nexu-io:mainfrom
ttoyekk1a:pdf-text-extraction
Open

Add lightweight PDF text extraction#99
ttoyekk1a wants to merge 2 commits into
nexu-io:mainfrom
ttoyekk1a:pdf-text-extraction

Conversation

@ttoyekk1a

Copy link
Copy Markdown
Contributor

Summary

  • Add local text-layer PDF extraction for uploaded PDF files
  • Convert extracted pages into markdown-style sections for the existing generation flow
  • Surface PDF support in upload hints and the formats gallery

Notes

  • This does not add OCR or any external API dependency
  • Scanned or image-heavy PDFs are marked as limited text extraction

Test Plan

  • corepack pnpm -F @html-anything/next test
  • corepack pnpm -F @html-anything/next typecheck
  • corepack pnpm -F @html-anything/next build

@lefarcen lefarcen requested a review from mrcfps May 31, 2026 08:26
@lefarcen lefarcen added size/M Medium change: 100-299 lines risk/high High-risk PR: dependencies, infra, security-sensitive, or broad runtime impact type/feature Feature or new user-facing capability labels May 31, 2026

@mrcfps mrcfps left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttoyekk1a Thanks for adding local PDF extraction and the parser coverage — I re-ran the app test/typecheck/build flow on this head and left two follow-ups around client bundle loading and the scanned-PDF warning heuristic.

🔁 Powered by Looper · runner=reviewer · agent=opencode · An autonomous AI dev team for your GitHub repos.

Comment thread next/src/lib/parsers/file.ts Outdated
Comment thread next/src/lib/parsers/file.ts Outdated

@mrcfps mrcfps left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ttoyekk1a I re-reviewed the updated PDF extraction flow on c7ba812, including the lazy unpdf loading path, the limited-text warning heuristic, and the new parser coverage. corepack pnpm -F @html-anything/next test, corepack pnpm -F @html-anything/next typecheck, and corepack pnpm -F @html-anything/next build all passed for me on this head. Thanks for tightening up both follow-ups and landing the PDF support cleanly.

🔁 Powered by Looper · runner=reviewer · agent=opencode · An autonomous AI dev team for your GitHub repos.

@lefarcen lefarcen requested a review from elihahah666 May 31, 2026 08:58
@ttoyekk1a

Copy link
Copy Markdown
Contributor Author

Since we call this project 'HTML Anything', we should definitely consider PDF files. Here is the comparison.

Image

@lefarcen

lefarcen commented Jun 2, 2026

Copy link
Copy Markdown

Hey @ttoyekk1a — this comparison is really helpful. It makes the user-facing scope much clearer: the change is not just the parser path in next/src/lib/parsers/file.ts, it also expands the attach surface in next/src/components/ai-prompt-bar.tsx and adds PDF to next/src/components/formats-gallery.tsx, so treating PDF as a first-class input here makes sense.

Thanks for adding the visual proof — that’s exactly the kind of context that helps judge the product-facing impact. ✨

@ttoyekk1a

Copy link
Copy Markdown
Contributor Author

Hey @ttoyekk1a — this comparison is really helpful. It makes the user-facing scope much clearer: the change is not just the parser path in next/src/lib/parsers/file.ts, it also expands the attach surface in next/src/components/ai-prompt-bar.tsx and adds PDF to next/src/components/formats-gallery.tsx, so treating PDF as a first-class input here makes sense.

Thanks for adding the visual proof — that’s exactly the kind of context that helps judge the product-facing impact. ✨

Pls consider merge this feature ASAP🫪. Or we can consider another ways to deal with PDF, like OCR or some API (but these ways might confilt Zero API key policy.)

@lefarcen

lefarcen commented Jun 7, 2026

Copy link
Copy Markdown

Hey @ttoyekk1a — thanks for spelling the merge ask out so directly. I re-checked the current head, and the PDF slice still looks like the right first step here: local text-layer extraction keeps this aligned with the zero-API-key constraint, while OCR / external-API handling feels like a separate follow-up if we want to cover scanned PDFs differently.

From the current PR state, @mrcfps has already approved c7ba812, so the remaining maintainer-side step is the product sign-off on this user-facing input-surface addition rather than another parser change from your side. Once that lands, the PR should be in a good position to move forward. ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

risk/high High-risk PR: dependencies, infra, security-sensitive, or broad runtime impact size/M Medium change: 100-299 lines type/feature Feature or new user-facing capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants