Conversation
- Add Example 9 to problem doc: demonstrates why blobFilesToDelete must be in the plan (writer-cleaner conflict resolution for external blobs) - Use engine-context HoodieData instead of HoodieListData.eager() for Stage 2 MDT lookups to distribute large candidate sets across executors - Update memory budget to reflect distributed Stage 2 processing
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 9 minutes and 20 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Greptile SummaryThis PR adds two new RFC documents for RFC-100 Part 2, describing the design for cleaning orphaned external blob files when Hudi expires old file slices. The two files are:
The design is thorough and well-structured. The two-stage pipeline (per-FG set-difference → cross-FG MDT index verification) is a sound approach to the cross-file-group sharing problem. Several issues worth addressing:
Confidence Score: 4/5Safe to merge as an RFC draft after addressing the two P1 design concerns; no production code is changed. Both files are documentation-only RFC additions with no code changes. The design is thorough and well-reasoned. Two P1 issues exist in the design itself: the circuit-breaker path can create permanent orphan blobs (R2 violation) and the managed→unmanaged flag transition handling relies on an implicit assumption about secondary index scope that is never stated. These should be resolved before implementation begins, but they don't block merging the RFC draft for community review. The missing Example 8 and the solution-content scope violation in the problem doc are P2 polish items. rfc/rfc-100/rfc-100-blob-cleaner-design.md — circuit breaker and managed-flag sections need clarification; rfc/rfc-100/rfc-100-blob-cleaner-problem.md — Example 8 gap and Example 9 scope issue. Important Files Changed
Sequence DiagramsequenceDiagram
participant CP as CleanPlanActionExecutor
participant S1 as Stage 1 (per-FG)
participant S2 as Stage 2 (cross-FG)
participant MDT as MDT Secondary Index
participant AUX as Sidecar (.hoodie/.aux/clean/)
participant TL as Timeline
participant CE as CleanActionExecutor
participant W as Writer (preCommit)
Note over CP: requestClean()
CP->>CP: hasBlobColumns()?
alt No blob columns
CP-->>TL: Plan (no blob work)
else Has blob columns
CP->>S1: For each expired/retained FG
S1-->>CP: all_local_orphans (set-difference)
alt local orphans exist
alt MDT secondary index available
CP->>S2: Verify candidates
S2->>MDT: Batch prefix scan (N paths)
MDT-->>S2: Map<path, List<recordKey>>
S2->>MDT: readRecordIndexLocations(all keys)
MDT-->>S2: Map<recordKey, (partition, fileId)>
S2-->>CP: blob_files_to_delete
else Fallback scan, few candidates
CP->>S2: Table scan
S2-->>CP: blob_files_to_delete
else Circuit breaker fires
Note over CP,S2: Stage 2 skipped — blobs deferred but file slices still deleted this cycle
CP-->>TL: Plan (no blob deletes recorded)
end
CP->>AUX: Write sidecar Parquet
CP->>TL: Persist plan (extraMetadata[blobDeletesPath])
end
end
Note over TL: REQUESTED
W->>TL: preCommit() — scan REQUESTED/INFLIGHT/COMPLETED clean actions
W->>AUX: Read sidecar Parquet
alt Writer blob paths intersect sidecar paths
W-->>W: HoodieWriteConflictException — retry
else No overlap
W->>TL: Commit succeeds
end
CE->>TL: Transition to INFLIGHT
CE->>AUX: Read sidecar Parquet
par Parallel deletion
CE->>CE: Delete file slices (existing)
and
CE->>CE: Delete blob files (from sidecar)
end
CE->>TL: Transition to COMPLETED (blobCleanStats + sidecarPath)
Note over AUX: Sidecar retained until archival
|
|
|
||
| #### Decision matrix | ||
|
|
||
| | Condition | Path used | Cost | Suitable for | | ||
| |-----------------------------|---------------|-----------------------|---------------------------| | ||
| | No local orphan candidates | Skip Stage 2 | Zero | No blob work this cycle | |
There was a problem hiding this comment.
Circuit breaker can produce permanent orphans (R2 violation)
When the circuit breaker fires (candidates > hoodie.cleaner.blob.external.scan.max.candidates), Stage 2 is skipped and no blob deletions are recorded — but the file slice deletions from the existing cleaner still execute within the same clean cycle. Once those expired file slices are deleted, the data that Stage 1 used to derive the orphan candidates is gone. In the next cleaner invocation, Stage 1 iterates expired_slices, but those slices no longer exist on storage. The orphan blob candidates can never be re-derived, making them permanent orphans — a direct violation of R2.
The description says cleanup is "deferred" but that framing is misleading: there is no durable queue of deferred candidates that the next cycle would pick up, and the source-of-truth file slices are deleted.
Possible mitigations to consider and explicitly document:
- Write deferred blob candidates to a durable artifact (separate from the sidecar, since no plan is written for this case) so the next cycle can pick them up without needing the expired slices.
- Alternatively, gate file-slice deletion on successful blob planning (i.e., refuse to delete file slices if Stage 2 was circuit-broken), though this changes existing cleaner semantics.
- Or, explicitly document this as a known limitation: blobs in tables that permanently lack an MDT secondary index and exceed the threshold are permanently orphaned, and the operator must either enable the index or accept the leak.
| for slice in retained_slices: | ||
| for ref in extractBlobRefs(slice.baseFile): // columnar projection only | ||
| if ref.type == OUT_OF_LINE and ref.managed == true: | ||
| retained_refs.add(ref.external_path) | ||
|
|
||
| // Compute local orphans by set difference | ||
| local_orphans = expired_refs - retained_refs | ||
|
|
||
| // All local orphans proceed to Stage 2 for cross-FG verification | ||
| all_local_orphans.addAll(local_orphans) |
There was a problem hiding this comment.
managed flag scope in Stage 2 is unspecified — managed→unmanaged transition may not be handled correctly
Stage 1 filters retained refs with ref.managed == true, so a retained slice whose managed flag transitioned to false for a given external_path contributes nothing to retained_refs. That path therefore becomes a local orphan candidate and proceeds to Stage 2.
The problem statement (rfc-100-blob-cleaner-problem.md, Section 2 — Managed flag discussion) explicitly requires: "If any retained reference to the same external_path exists (regardless of the managed flag), the blob must not be deleted." Stage 2 correctness therefore depends on the MDT secondary index indexing all external_path values regardless of the managed flag. If the secondary index writer applies the same managed == true filter, Stage 2 would miss the unmanaged retained reference, classify the blob as globally orphaned, and delete a blob that is still referenced — a violation of R1.
The design doc should explicitly state:
- Whether the secondary index on
<blob_col>.reference.external_pathindexes entries formanaged=falserecords, and - If it does not, how managed→unmanaged transitions are handled to satisfy R1 (e.g., the fallback table scan as a safety net, or a separate retained-refs enrichment pass).
|
|
||
| --- | ||
|
|
||
| ## Abstract |
Mirror of apache#18359 for automated bot review.
Original author: @voonhous
Base branch: master