fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842) by ricofurtado · Pull Request #1941 · langflow-ai/openrag

ricofurtado · 2026-06-22T21:19:01Z

This pull request implements a more robust and accurate duplicate file detection and handling flow for connector-based uploads, with improvements across both the frontend and backend. The main changes include a new backend utility to classify duplicate and non-duplicate files with normalized metadata, enhanced frontend handling of duplicate counts and file lists, and improved metadata management during sync and ingestion.

Backend improvements to duplicate detection and file handling:

Introduced _classify_connector_duplicates and related helpers in src/api/connectors.py to expand connector file selections, normalize metadata, and efficiently classify files as duplicates or non-duplicates, returning both file lists and counts for frontend use. This replaces the previous, less robust duplicate check logic. [1] [2]
Updated the /connector_sync endpoint to use the new duplicate classification logic, skipping all duplicates when not overwriting and returning detailed status and counts to the frontend.
Improved filename cleaning and aliasing in sync and metadata update logic to ensure consistent duplicate detection and metadata updates, including using the correct filename in all relevant places. [1] [2]

Frontend improvements to duplicate dialog and sync flow:

Updated the duplicate check response handling in frontend/app/upload/[provider]/page.tsx to use the new backend response shape, accurately track and display duplicate counts and non-duplicate files, and pass these to the duplicate handling dialog. (frontend/app/upload/[provider]/page.tsxL32-R41, frontend/app/upload/[provider]/page.tsxR427, frontend/app/upload/[provider]/page.tsxL517-R543, frontend/app/upload/[provider]/page.tsxL562-R576, frontend/app/upload/[provider]/page.tsxR800)
Enhanced the duplicate handling dialog component to reliably display the correct duplicate count and file names, using a more robust key for list rendering and supporting the new response shape. [1] [2]

Processor and ingestion pipeline fixes:

Standardized the use of the cleaned filename throughout the ingestion pipeline, ensuring correct duplicate checks, file deletion, and metadata updates, and fixed the field used when deleting indexed chunks for removed files. [1] [2] [3] [4] [5] [6]

These changes together provide more accurate, user-friendly, and maintainable duplicate handling for connector uploads.

Summary by CodeRabbit

New Features
- Enhanced duplicate file detection when syncing specific files from connectors.
- Improved duplicate count reporting and handling during file sync operations.
Bug Fixes
- Fixed duplicate file classification to ensure accurate filtering during sync workflows.

coderabbitai · 2026-06-22T21:20:01Z

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 47.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title references issue `#1842` and mentions 'skip duplicates functionality for folder ingest for connectors', which aligns with the core changes across the codebase improving duplicate detection and handling for connector uploads.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix-error-in-skip-duplicates-functionality-for-folder-ingest-for-connectors-75616

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

frontend/app/upload/[provider]/page.tsx (1)
536-543: 🎯 Functional Correctness | 🔵 Trivial

Consider enriching nonDuplicateFiles with cached download URLs to avoid unnecessary Graph API calls during sync.

The backend's _connector_file_response() function (src/api/connectors.py:353) only includes downloadUrl and size if they exist in the source data. When nonDuplicateFiles is returned from the check-duplicates endpoint, these fields may be missing depending on what the connector's list_files() returns.

The frontend maps these files with missing fields (lines 461–462), and when passed back to the sync endpoint, the connector falls back to Graph API calls for SharePoint/OneDrive instead of using cached download URLs. This increases latency and API quota usage unnecessarily.

To optimize: either enrich non_duplicate_files on the backend with the original selection's cached URLs, or filter the original selectedFiles by non-duplicate IDs client-side before syncing.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/app/upload/`[provider]/page.tsx around lines 536 - 543, The
nonDuplicateFiles returned from the check-duplicates endpoint lacks the cached
downloadUrl and size fields, causing unnecessary Graph API calls during sync.
Instead of passing nonDuplicateFiles directly to setPendingSync, filter the
original selectedFiles array by matching file IDs against the non-duplicate file
IDs to preserve the cached download URLs. This way, the sync endpoint receives
complete file objects with all cached metadata, avoiding redundant API calls to
SharePoint/OneDrive.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/connectors.py`:
- Around line 656-680: In the duplicate file checking logic within the connector
sync endpoint, the condition that checks `if not selected_files` is incorrectly
treating zero expanded files as if all files were duplicates. Instead of
returning the "no_files" status whenever selected_files is empty, first verify
that duplicate_check contains actual duplicates by checking if
duplicate_check["duplicate_count"] is greater than zero. Only return the "all
duplicates" response when there are confirmed duplicates; otherwise, allow the
sync to proceed with normal behavior rather than aborting with an incorrect
"already exist" message.

---

Nitpick comments:
In `@frontend/app/upload/`[provider]/page.tsx:
- Around line 536-543: The nonDuplicateFiles returned from the check-duplicates
endpoint lacks the cached downloadUrl and size fields, causing unnecessary Graph
API calls during sync. Instead of passing nonDuplicateFiles directly to
setPendingSync, filter the original selectedFiles array by matching file IDs
against the non-duplicate file IDs to preserve the cached download URLs. This
way, the sync endpoint receives complete file objects with all cached metadata,
avoiding redundant API calls to SharePoint/OneDrive.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1681a31-9ce0-4ed4-b19e-66f7c90a1331

📥 Commits

Reviewing files that changed from the base of the PR and between d2a3403 and 00e9b76.

📒 Files selected for processing (7)

frontend/app/upload/[provider]/page.tsx
frontend/components/duplicate-handling-dialog.tsx
src/api/connectors.py
src/connectors/service.py
src/models/processors.py
tests/unit/connectors/test_connector_file_type_validation.py
tests/unit/test_connector_processor_filename_dedupe.py

mfortman11

LGTM

fix: enhance duplicate handling and indexing for connector file uploads

2cefbaf

github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests bug 🔴 Something isn't working. labels Jun 22, 2026

style: ruff autofix (auto)

00e9b76