feat: add PDF processing support and enhance document handling#29
feat: add PDF processing support and enhance document handling#29harshpreet931 wants to merge 3 commits into
Conversation
- Updated allowed document MIME types to include 'application/pdf'. - Implemented PDF content extraction in a new module (pdf-parser.ts). - Integrated PDF processing into the document extraction workflow. - Enhanced error handling for PDF processing, including password protection. - Added functions for normalizing and cleaning text extracted from PDFs. - Implemented chunking of text for better handling of large documents. - Introduced image extraction markers and descriptions for images in PDFs.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces PDF processing support to the document handling system, expanding the supported file formats from text, Office documents, CSV, JSON, and ZIP files to include PDFs.
- Adds comprehensive PDF text extraction and image processing capabilities using PDF.js
- Integrates PDF support into existing document processor and attachment validation
- Updates examples to demonstrate PDF processing functionality
Reviewed Changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/utils/pdf-parser.ts | Complete PDF processing implementation with text extraction, image processing, and chunking logic |
| src/utils/document-processor.ts | Integrates PDF extraction into main document processor and updates supported types |
| src/utils/attachments.ts | Adds PDF MIME type to allowed document types for attachment validation |
| package.json | Adds required dependencies for PDF processing (canvas, pdfjs-dist) |
| examples/attachment-demo-server.ts | Updates examples and documentation to showcase PDF processing capabilities |
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
- Add proper documentation for WASM path configuration - Fix pages metadata to return actual page count from PDF document - Improve placeholder image description function documentation - Address Copilot code review suggestions for better clarity
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 7 out of 9 changed files in this pull request and generated 3 comments.
Files not reviewed (1)
- pnpm-lock.yaml: Language not supported
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
7fbbb18 to
e05de49
Compare
This pull request adds support for PDF document attachments in the attachment demo server and updates the documentation and example commands accordingly. It also introduces new dependencies required for PDF processing.
PDF Support Enhancements:
examples/attachment-demo-server.ts. [1] [2]curlcommands demonstrating how to send PDF documents (both via URL and base64 data) to the server for analysis inexamples/attachment-demo-server.ts.Dependency Updates for PDF Handling:
pdfjs-distandcanvasas new dependencies inpackage.jsonto enable PDF parsing and rendering.pnpm-lock.yamlto include the new dependencies and their transitive packages, such aspdfjs-dist,canvas, and related native modules. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22]