Skip to content

fix: improve error handling and status updates for document processing and add OCR requirement check for image files#1881

Open
ricofurtado wants to merge 2 commits into
release-cpdfrom
cpd-files-with-unsupported-extensions-are-incorrectly-displayed-as-Active-Completed-during-connector-ingestion
Open

fix: improve error handling and status updates for document processing and add OCR requirement check for image files#1881
ricofurtado wants to merge 2 commits into
release-cpdfrom
cpd-files-with-unsupported-extensions-are-incorrectly-displayed-as-Active-Completed-during-connector-ingestion

Conversation

@ricofurtado

@ricofurtado ricofurtado commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Supported File Types by Ingestion Path

There are currently two different supported file type lists depending on how content is ingested into OpenRAG.

Direct Upload (Frontend File Picker)

Defined in knowledge-dropdown.tsx (lines 60–76).

Category Extensions
Documents .txt, .md, .pdf, .docx, .html, .htm, .adoc, .asciidoc, .asc
Spreadsheets .csv, .xlsx
Presentations .pptx

Connector Ingestion (Backend Validator)

Defined in processors.py (lines 645–668).

Includes all file types supported by Direct Upload, plus additional Office formats, XHTML, and image formats.

Category Extensions
Documents .txt, .md, .pdf, .docx, .html, .htm, .adoc, .asciidoc, .asc
Additional Office Documents .dotx, .dotm, .docm
Spreadsheets .csv, .xlsx, .xls
Presentations .pptx, .potx, .ppsx, .pptm, .potm, .ppsm
Web Documents .xhtml
Images .jpg, .jpeg, .png, .tiff, .bmp, .webp

Root Cause of the Reported Defect

The image formats (.jpg, .jpeg, .png, .tiff, .bmp, .webp) are the source of the reported issue.

What Happens

  1. The connector ingestion validator accepts image files based on extension.
  2. The files are sent through the ingestion pipeline.
  3. Images only produce extractable text when Docling OCR is enabled.
  4. When OCR is disabled, no text content is extracted.
  5. No chunks are generated for indexing.
  6. The updated ingestion logic now correctly marks these files as Failed rather than Active.

Existing Frontend Limitation

The frontend intentionally excludes several file types that are accepted by connector ingestion.

A TODO comment in knowledge-dropdown.tsx (line 57) states:

Re-add other MIME/extension groups (images, xlsx/xls/ppt, rtf/odt, etc.) once ingestion is verified end-to-end in Langflow.

This indicates that support for images and additional document types was intentionally withheld from the file picker until end-to-end ingestion behavior was fully validated.


Gap Between Frontend and Connector Validation

Area Images Allowed?
Frontend File Picker ❌ No
Connector Ingestion Validator ✅ Yes

This inconsistency allows image files to be ingested through connectors while bypassing the frontend restrictions, creating the gap that exposed the bug.

Summary by CodeRabbit

Bug Fixes

  • Improved validation of document processing results to ensure failures are correctly detected and reported when content is not properly indexed
  • Enhanced error messages for image files to provide specific guidance when OCR processing is required for content extraction

…g and add OCR requirement check for image files
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2589f73b-0035-4332-8dc7-fa1048d1b7a1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cpd-files-with-unsupported-extensions-are-incorrectly-displayed-as-Active-Completed-during-connector-ingestion

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) label Jun 15, 2026
@ricofurtado ricofurtado requested a review from lucaseduoli June 15, 2026 19:32
@github-actions github-actions Bot added bug 🔴 Something isn't working. tests and removed bug 🔴 Something isn't working. labels Jun 15, 2026
@ricofurtado ricofurtado force-pushed the cpd-files-with-unsupported-extensions-are-incorrectly-displayed-as-Active-Completed-during-connector-ingestion branch from 5b36bb9 to 795337d Compare June 15, 2026 22:19
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) bug 🔴 Something isn't working. tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant