Skip to content

fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842)#1941

Merged
ricofurtado merged 4 commits into
release-cpd-0.1from
fix-error-in-skip-duplicates-functionality-for-folder-ingest-for-connectors-75616
Jun 23, 2026
Merged

fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842)#1941
ricofurtado merged 4 commits into
release-cpd-0.1from
fix-error-in-skip-duplicates-functionality-for-folder-ingest-for-connectors-75616

Conversation

@ricofurtado

@ricofurtado ricofurtado commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

This pull request implements a more robust and accurate duplicate file detection and handling flow for connector-based uploads, with improvements across both the frontend and backend. The main changes include a new backend utility to classify duplicate and non-duplicate files with normalized metadata, enhanced frontend handling of duplicate counts and file lists, and improved metadata management during sync and ingestion.

Backend improvements to duplicate detection and file handling:

  • Introduced _classify_connector_duplicates and related helpers in src/api/connectors.py to expand connector file selections, normalize metadata, and efficiently classify files as duplicates or non-duplicates, returning both file lists and counts for frontend use. This replaces the previous, less robust duplicate check logic. [1] [2]
  • Updated the /connector_sync endpoint to use the new duplicate classification logic, skipping all duplicates when not overwriting and returning detailed status and counts to the frontend.
  • Improved filename cleaning and aliasing in sync and metadata update logic to ensure consistent duplicate detection and metadata updates, including using the correct filename in all relevant places. [1] [2]

Frontend improvements to duplicate dialog and sync flow:

Processor and ingestion pipeline fixes:

  • Standardized the use of the cleaned filename throughout the ingestion pipeline, ensuring correct duplicate checks, file deletion, and metadata updates, and fixed the field used when deleting indexed chunks for removed files. [1] [2] [3] [4] [5] [6]

These changes together provide more accurate, user-friendly, and maintainable duplicate handling for connector uploads.

Summary by CodeRabbit

  • New Features

    • Enhanced duplicate file detection when syncing specific files from connectors.
    • Improved duplicate count reporting and handling during file sync operations.
  • Bug Fixes

    • Fixed duplicate file classification to ensure accurate filtering during sync workflows.

@github-actions github-actions Bot added frontend 🟨 Issues related to the UI/UX backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) tests bug 🔴 Something isn't working. labels Jun 22, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 47.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title references issue #1842 and mentions 'skip duplicates functionality for folder ingest for connectors', which aligns with the core changes across the codebase improving duplicate detection and handling for connector uploads.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-error-in-skip-duplicates-functionality-for-folder-ingest-for-connectors-75616

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
frontend/app/upload/[provider]/page.tsx (1)

536-543: 🎯 Functional Correctness | 🔵 Trivial

Consider enriching nonDuplicateFiles with cached download URLs to avoid unnecessary Graph API calls during sync.

The backend's _connector_file_response() function (src/api/connectors.py:353) only includes downloadUrl and size if they exist in the source data. When nonDuplicateFiles is returned from the check-duplicates endpoint, these fields may be missing depending on what the connector's list_files() returns.

The frontend maps these files with missing fields (lines 461–462), and when passed back to the sync endpoint, the connector falls back to Graph API calls for SharePoint/OneDrive instead of using cached download URLs. This increases latency and API quota usage unnecessarily.

To optimize: either enrich non_duplicate_files on the backend with the original selection's cached URLs, or filter the original selectedFiles by non-duplicate IDs client-side before syncing.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@frontend/app/upload/`[provider]/page.tsx around lines 536 - 543, The
nonDuplicateFiles returned from the check-duplicates endpoint lacks the cached
downloadUrl and size fields, causing unnecessary Graph API calls during sync.
Instead of passing nonDuplicateFiles directly to setPendingSync, filter the
original selectedFiles array by matching file IDs against the non-duplicate file
IDs to preserve the cached download URLs. This way, the sync endpoint receives
complete file objects with all cached metadata, avoiding redundant API calls to
SharePoint/OneDrive.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/api/connectors.py`:
- Around line 656-680: In the duplicate file checking logic within the connector
sync endpoint, the condition that checks `if not selected_files` is incorrectly
treating zero expanded files as if all files were duplicates. Instead of
returning the "no_files" status whenever selected_files is empty, first verify
that duplicate_check contains actual duplicates by checking if
duplicate_check["duplicate_count"] is greater than zero. Only return the "all
duplicates" response when there are confirmed duplicates; otherwise, allow the
sync to proceed with normal behavior rather than aborting with an incorrect
"already exist" message.

---

Nitpick comments:
In `@frontend/app/upload/`[provider]/page.tsx:
- Around line 536-543: The nonDuplicateFiles returned from the check-duplicates
endpoint lacks the cached downloadUrl and size fields, causing unnecessary Graph
API calls during sync. Instead of passing nonDuplicateFiles directly to
setPendingSync, filter the original selectedFiles array by matching file IDs
against the non-duplicate file IDs to preserve the cached download URLs. This
way, the sync endpoint receives complete file objects with all cached metadata,
avoiding redundant API calls to SharePoint/OneDrive.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d1681a31-9ce0-4ed4-b19e-66f7c90a1331

📥 Commits

Reviewing files that changed from the base of the PR and between d2a3403 and 00e9b76.

📒 Files selected for processing (7)
  • frontend/app/upload/[provider]/page.tsx
  • frontend/components/duplicate-handling-dialog.tsx
  • src/api/connectors.py
  • src/connectors/service.py
  • src/models/processors.py
  • tests/unit/connectors/test_connector_file_type_validation.py
  • tests/unit/test_connector_processor_filename_dedupe.py

Comment thread src/api/connectors.py Outdated
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026
@ricofurtado ricofurtado changed the base branch from release-cpd to release-cpd-0.1 June 22, 2026 21:54
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026
@ricofurtado ricofurtado enabled auto-merge (squash) June 22, 2026 22:03
@ricofurtado ricofurtado disabled auto-merge June 22, 2026 22:03
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026
@ricofurtado ricofurtado changed the title fix: put correct file type in langflow-less ingestion (#1842) fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842) Jun 22, 2026
@github-actions github-actions Bot added bug 🔴 Something isn't working. and removed bug 🔴 Something isn't working. labels Jun 22, 2026

@mfortman11 mfortman11 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the lgtm label Jun 23, 2026
@ricofurtado ricofurtado merged commit effaf02 into release-cpd-0.1 Jun 23, 2026
16 of 17 checks passed
@github-actions github-actions Bot deleted the fix-error-in-skip-duplicates-functionality-for-folder-ingest-for-connectors-75616 branch June 23, 2026 12:56
@mpawlow mpawlow removed their request for review June 23, 2026 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend 🔷 Issues related to backend services (OpenSearch, Langflow, APIs) bug 🔴 Something isn't working. frontend 🟨 Issues related to the UI/UX lgtm tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants