fix: error in skip duplicates functionality for folder ingest for connectors 75616(#1842)#1941
Conversation
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
frontend/app/upload/[provider]/page.tsx (1)
536-543: 🎯 Functional Correctness | 🔵 TrivialConsider enriching
nonDuplicateFileswith cached download URLs to avoid unnecessary Graph API calls during sync.The backend's
_connector_file_response()function (src/api/connectors.py:353) only includesdownloadUrlandsizeif they exist in the source data. WhennonDuplicateFilesis returned from the check-duplicates endpoint, these fields may be missing depending on what the connector'slist_files()returns.The frontend maps these files with missing fields (lines 461–462), and when passed back to the sync endpoint, the connector falls back to Graph API calls for SharePoint/OneDrive instead of using cached download URLs. This increases latency and API quota usage unnecessarily.
To optimize: either enrich
non_duplicate_fileson the backend with the original selection's cached URLs, or filter the originalselectedFilesby non-duplicate IDs client-side before syncing.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@frontend/app/upload/`[provider]/page.tsx around lines 536 - 543, The nonDuplicateFiles returned from the check-duplicates endpoint lacks the cached downloadUrl and size fields, causing unnecessary Graph API calls during sync. Instead of passing nonDuplicateFiles directly to setPendingSync, filter the original selectedFiles array by matching file IDs against the non-duplicate file IDs to preserve the cached download URLs. This way, the sync endpoint receives complete file objects with all cached metadata, avoiding redundant API calls to SharePoint/OneDrive.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/api/connectors.py`:
- Around line 656-680: In the duplicate file checking logic within the connector
sync endpoint, the condition that checks `if not selected_files` is incorrectly
treating zero expanded files as if all files were duplicates. Instead of
returning the "no_files" status whenever selected_files is empty, first verify
that duplicate_check contains actual duplicates by checking if
duplicate_check["duplicate_count"] is greater than zero. Only return the "all
duplicates" response when there are confirmed duplicates; otherwise, allow the
sync to proceed with normal behavior rather than aborting with an incorrect
"already exist" message.
---
Nitpick comments:
In `@frontend/app/upload/`[provider]/page.tsx:
- Around line 536-543: The nonDuplicateFiles returned from the check-duplicates
endpoint lacks the cached downloadUrl and size fields, causing unnecessary Graph
API calls during sync. Instead of passing nonDuplicateFiles directly to
setPendingSync, filter the original selectedFiles array by matching file IDs
against the non-duplicate file IDs to preserve the cached download URLs. This
way, the sync endpoint receives complete file objects with all cached metadata,
avoiding redundant API calls to SharePoint/OneDrive.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d1681a31-9ce0-4ed4-b19e-66f7c90a1331
📒 Files selected for processing (7)
frontend/app/upload/[provider]/page.tsxfrontend/components/duplicate-handling-dialog.tsxsrc/api/connectors.pysrc/connectors/service.pysrc/models/processors.pytests/unit/connectors/test_connector_file_type_validation.pytests/unit/test_connector_processor_filename_dedupe.py
This pull request implements a more robust and accurate duplicate file detection and handling flow for connector-based uploads, with improvements across both the frontend and backend. The main changes include a new backend utility to classify duplicate and non-duplicate files with normalized metadata, enhanced frontend handling of duplicate counts and file lists, and improved metadata management during sync and ingestion.
Backend improvements to duplicate detection and file handling:
_classify_connector_duplicatesand related helpers insrc/api/connectors.pyto expand connector file selections, normalize metadata, and efficiently classify files as duplicates or non-duplicates, returning both file lists and counts for frontend use. This replaces the previous, less robust duplicate check logic. [1] [2]/connector_syncendpoint to use the new duplicate classification logic, skipping all duplicates when not overwriting and returning detailed status and counts to the frontend.Frontend improvements to duplicate dialog and sync flow:
frontend/app/upload/[provider]/page.tsxto use the new backend response shape, accurately track and display duplicate counts and non-duplicate files, and pass these to the duplicate handling dialog. (frontend/app/upload/[provider]/page.tsxL32-R41, frontend/app/upload/[provider]/page.tsxR427, frontend/app/upload/[provider]/page.tsxL517-R543, frontend/app/upload/[provider]/page.tsxL562-R576, frontend/app/upload/[provider]/page.tsxR800)Processor and ingestion pipeline fixes:
These changes together provide more accurate, user-friendly, and maintainable duplicate handling for connector uploads.
Summary by CodeRabbit
New Features
Bug Fixes