Add lightweight PDF text extraction by ttoyekk1a · Pull Request #99 · nexu-io/html-anything

ttoyekk1a · 2026-05-31T08:25:01Z

Summary

Add local text-layer PDF extraction for uploaded PDF files
Convert extracted pages into markdown-style sections for the existing generation flow
Surface PDF support in upload hints and the formats gallery

Notes

This does not add OCR or any external API dependency
Scanned or image-heavy PDFs are marked as limited text extraction

Test Plan

corepack pnpm -F @html-anything/next test
corepack pnpm -F @html-anything/next typecheck
corepack pnpm -F @html-anything/next build

mrcfps

@ttoyekk1a Thanks for adding local PDF extraction and the parser coverage — I re-ran the app test/typecheck/build flow on this head and left two follow-ups around client bundle loading and the scanned-PDF warning heuristic.

_{🔁 Powered by Looper · runner=reviewer · agent=opencode · An autonomous AI dev team for your GitHub repos.}

mrcfps

@ttoyekk1a I re-reviewed the updated PDF extraction flow on c7ba812, including the lazy unpdf loading path, the limited-text warning heuristic, and the new parser coverage. corepack pnpm -F @html-anything/next test, corepack pnpm -F @html-anything/next typecheck, and corepack pnpm -F @html-anything/next build all passed for me on this head. Thanks for tightening up both follow-ups and landing the PDF support cleanly.

_{🔁 Powered by Looper · runner=reviewer · agent=opencode · An autonomous AI dev team for your GitHub repos.}

ttoyekk1a · 2026-06-02T15:37:07Z

Since we call this project 'HTML Anything', we should definitely consider PDF files. Here is the comparison.

lefarcen · 2026-06-02T17:26:25Z

Hey @ttoyekk1a — this comparison is really helpful. It makes the user-facing scope much clearer: the change is not just the parser path in next/src/lib/parsers/file.ts, it also expands the attach surface in next/src/components/ai-prompt-bar.tsx and adds PDF to next/src/components/formats-gallery.tsx, so treating PDF as a first-class input here makes sense.

Thanks for adding the visual proof — that’s exactly the kind of context that helps judge the product-facing impact. ✨

ttoyekk1a · 2026-06-07T07:23:16Z

Hey @ttoyekk1a — this comparison is really helpful. It makes the user-facing scope much clearer: the change is not just the parser path in next/src/lib/parsers/file.ts, it also expands the attach surface in next/src/components/ai-prompt-bar.tsx and adds PDF to next/src/components/formats-gallery.tsx, so treating PDF as a first-class input here makes sense.

Thanks for adding the visual proof — that’s exactly the kind of context that helps judge the product-facing impact. ✨

Pls consider merge this feature ASAP🫪. Or we can consider another ways to deal with PDF, like OCR or some API (but these ways might confilt Zero API key policy.)

lefarcen · 2026-06-07T07:29:58Z

Hey @ttoyekk1a — thanks for spelling the merge ask out so directly. I re-checked the current head, and the PDF slice still looks like the right first step here: local text-layer extraction keeps this aligned with the zero-API-key constraint, while OCR / external-API handling feels like a separate follow-up if we want to cover scanned PDFs differently.

From the current PR state, @mrcfps has already approved c7ba812, so the remaining maintainer-side step is the product sign-off on this user-facing input-surface addition rather than another parser change from your side. Once that lands, the PR should be in a good position to move forward. ❤️

add lightweight PDF text extraction

9c625c7

lefarcen requested a review from mrcfps May 31, 2026 08:26

lefarcen added size/M Medium change: 100-299 lines risk/high High-risk PR: dependencies, infra, security-sensitive, or broad runtime impact type/feature Feature or new user-facing capability labels May 31, 2026

mrcfps reviewed May 31, 2026

View reviewed changes

Comment thread next/src/lib/parsers/file.ts Outdated

Comment thread next/src/lib/parsers/file.ts Outdated

refine PDF parser loading and warnings

c7ba812

mrcfps approved these changes May 31, 2026

View reviewed changes

lefarcen requested a review from elihahah666 May 31, 2026 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lightweight PDF text extraction#99

Add lightweight PDF text extraction#99
ttoyekk1a wants to merge 2 commits into
nexu-io:mainfrom
ttoyekk1a:pdf-text-extraction

ttoyekk1a commented May 31, 2026

Uh oh!

mrcfps left a comment

Uh oh!

Uh oh!

Uh oh!

mrcfps left a comment

Uh oh!

ttoyekk1a commented Jun 2, 2026

Uh oh!

lefarcen commented Jun 2, 2026

Uh oh!

ttoyekk1a commented Jun 7, 2026

Uh oh!

lefarcen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ttoyekk1a commented May 31, 2026

Summary

Notes

Test Plan

Uh oh!

mrcfps left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mrcfps left a comment

Choose a reason for hiding this comment

Uh oh!

ttoyekk1a commented Jun 2, 2026

Uh oh!

lefarcen commented Jun 2, 2026

Uh oh!

ttoyekk1a commented Jun 7, 2026

Uh oh!

lefarcen commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants