Add lightweight PDF text extraction#99
Conversation
mrcfps
left a comment
There was a problem hiding this comment.
@ttoyekk1a Thanks for adding local PDF extraction and the parser coverage — I re-ran the app test/typecheck/build flow on this head and left two follow-ups around client bundle loading and the scanned-PDF warning heuristic.
🔁 Powered by Looper · runner=reviewer · agent=opencode · An autonomous AI dev team for your GitHub repos.
mrcfps
left a comment
There was a problem hiding this comment.
@ttoyekk1a I re-reviewed the updated PDF extraction flow on c7ba812, including the lazy unpdf loading path, the limited-text warning heuristic, and the new parser coverage. corepack pnpm -F @html-anything/next test, corepack pnpm -F @html-anything/next typecheck, and corepack pnpm -F @html-anything/next build all passed for me on this head. Thanks for tightening up both follow-ups and landing the PDF support cleanly.
|
Hey @ttoyekk1a — this comparison is really helpful. It makes the user-facing scope much clearer: the change is not just the parser path in Thanks for adding the visual proof — that’s exactly the kind of context that helps judge the product-facing impact. ✨ |
Pls consider merge this feature ASAP. Or we can consider another ways to deal with PDF, like OCR or some API (but these ways might confilt Zero API key policy.) |
|
Hey @ttoyekk1a — thanks for spelling the merge ask out so directly. I re-checked the current head, and the PDF slice still looks like the right first step here: local text-layer extraction keeps this aligned with the zero-API-key constraint, while OCR / external-API handling feels like a separate follow-up if we want to cover scanned PDFs differently. From the current PR state, @mrcfps has already approved |

Summary
Notes
Test Plan