Skip to content

[OSS PR #18359] feat(blob): Supplementing RFC-100 with blob cleaner design#44

Open
yihua wants to merge 6 commits into
masterfrom
oss-18359
Open

[OSS PR #18359] feat(blob): Supplementing RFC-100 with blob cleaner design#44
yihua wants to merge 6 commits into
masterfrom
oss-18359

Conversation

@yihua
Copy link
Copy Markdown
Owner

@yihua yihua commented Apr 14, 2026

Mirror of apache#18359 for automated bot review.

Original author: @voonhous
Base branch: master

- Add Example 9 to problem doc: demonstrates why blobFilesToDelete must be in the plan (writer-cleaner conflict resolution for external blobs)
- Use engine-context HoodieData instead of HoodieListData.eager() for Stage 2 MDT lookups to distribute large candidate sets across executors
- Update memory budget to reflect distributed Stage 2 processing
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Warning

Rate limit exceeded

@yihua has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 9 minutes and 20 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 9 minutes and 20 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 691094be-c72f-49cb-8764-2fb92756d777

📥 Commits

Reviewing files that changed from the base of the PR and between 35e2bbf and 6dc90a8.

📒 Files selected for processing (2)
  • rfc/rfc-100/rfc-100-blob-cleaner-design.md
  • rfc/rfc-100/rfc-100-blob-cleaner-problem.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch oss-18359

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR adds two new RFC documents for RFC-100 Part 2, describing the design for cleaning orphaned external blob files when Hudi expires old file slices. The two files are:

  • rfc-100-blob-cleaner-design.md: A detailed solution design covering the two-stage cleanup pipeline, MDT secondary index integration, sidecar Parquet lifecycle, writer-cleaner conflict check, crash recovery, configuration, and a test plan.
  • rfc-100-blob-cleaner-problem.md: A companion problem statement defining the 11 design constraints (C1–C11), 9 requirements (R1–R9), 7 illustrative failure modes, and 9 open questions.

The design is thorough and well-structured. The two-stage pipeline (per-FG set-difference → cross-FG MDT index verification) is a sound approach to the cross-file-group sharing problem. Several issues worth addressing:

  • Missing Example 8 in the problem document — examples jump from 7 to 9, suggesting a numbering error or omitted scenario.
  • Example 9 introduces solution content directly contradicting the problem doc's own stated scope ("It contains no solution content"). It compares two implementation approaches and advocates for one.
  • Circuit breaker in the fallback scan path may create permanent orphans — when candidates exceed the threshold, Stage 2 is skipped but file slices are still deleted that same cycle. In the next cycle those expired file slices are already gone, so the orphan blob candidates can never be re-derived (R2 violation under sustained absence of MDT secondary index).
  • managed flag coverage in Stage 2 is implicit, not stated — Stage 1 filters retained refs to managed == true only, but the problem doc requires that any retained reference (regardless of the managed flag) must prevent deletion. Stage 2's MDT index lookup must cover unmanaged references for correctness during managed→unmanaged transitions, but the design does not explicitly state the secondary index scope.
  • The placeholder Issue: <Link to GH feature issue> remains unfilled in the design doc header.

Confidence Score: 4/5

Safe to merge as an RFC draft after addressing the two P1 design concerns; no production code is changed.

Both files are documentation-only RFC additions with no code changes. The design is thorough and well-reasoned. Two P1 issues exist in the design itself: the circuit-breaker path can create permanent orphan blobs (R2 violation) and the managed→unmanaged flag transition handling relies on an implicit assumption about secondary index scope that is never stated. These should be resolved before implementation begins, but they don't block merging the RFC draft for community review. The missing Example 8 and the solution-content scope violation in the problem doc are P2 polish items.

rfc/rfc-100/rfc-100-blob-cleaner-design.md — circuit breaker and managed-flag sections need clarification; rfc/rfc-100/rfc-100-blob-cleaner-problem.md — Example 8 gap and Example 9 scope issue.

Important Files Changed

Filename Overview
rfc/rfc-100/rfc-100-blob-cleaner-design.md New design document for the blob cleaner. Well-structured two-stage pipeline, good crash-recovery and concurrency coverage. Two substantive issues: the circuit-breaker deferred cleanup path can create permanent orphans (R2 violation), and the managed-flag secondary index coverage is unspecified.
rfc/rfc-100/rfc-100-blob-cleaner-problem.md New problem statement document. Thorough constraint and requirement coverage. Two issues: Example 8 is missing (numbering jumps 7→9), and Example 9 includes solution advocacy that contradicts the doc's stated "no solution content" scope.

Sequence Diagram

sequenceDiagram
    participant CP as CleanPlanActionExecutor
    participant S1 as Stage 1 (per-FG)
    participant S2 as Stage 2 (cross-FG)
    participant MDT as MDT Secondary Index
    participant AUX as Sidecar (.hoodie/.aux/clean/)
    participant TL as Timeline
    participant CE as CleanActionExecutor
    participant W as Writer (preCommit)

    Note over CP: requestClean()
    CP->>CP: hasBlobColumns()?
    alt No blob columns
        CP-->>TL: Plan (no blob work)
    else Has blob columns
        CP->>S1: For each expired/retained FG
        S1-->>CP: all_local_orphans (set-difference)

        alt local orphans exist
            alt MDT secondary index available
                CP->>S2: Verify candidates
                S2->>MDT: Batch prefix scan (N paths)
                MDT-->>S2: Map<path, List<recordKey>>
                S2->>MDT: readRecordIndexLocations(all keys)
                MDT-->>S2: Map<recordKey, (partition, fileId)>
                S2-->>CP: blob_files_to_delete
            else Fallback scan, few candidates
                CP->>S2: Table scan
                S2-->>CP: blob_files_to_delete
            else Circuit breaker fires
                Note over CP,S2: Stage 2 skipped — blobs deferred but file slices still deleted this cycle
                CP-->>TL: Plan (no blob deletes recorded)
            end
            CP->>AUX: Write sidecar Parquet
            CP->>TL: Persist plan (extraMetadata[blobDeletesPath])
        end
    end

    Note over TL: REQUESTED

    W->>TL: preCommit() — scan REQUESTED/INFLIGHT/COMPLETED clean actions
    W->>AUX: Read sidecar Parquet
    alt Writer blob paths intersect sidecar paths
        W-->>W: HoodieWriteConflictException — retry
    else No overlap
        W->>TL: Commit succeeds
    end

    CE->>TL: Transition to INFLIGHT
    CE->>AUX: Read sidecar Parquet
    par Parallel deletion
        CE->>CE: Delete file slices (existing)
    and
        CE->>CE: Delete blob files (from sidecar)
    end
    CE->>TL: Transition to COMPLETED (blobCleanStats + sidecarPath)
    Note over AUX: Sidecar retained until archival
Loading

Comments Outside Diff (2)

  1. rfc/rfc-100/rfc-100-blob-cleaner-problem.md, line 1445-1511 (link)

    P1 Example 9 introduces solution content, contradicting the document's stated scope

    The document's opening paragraph explicitly states: "It contains no solution content."

    Example 9 compares two implementation approaches (Approach A vs. Approach B), provides pseudocode for both, and advocates for "blobFilesToDelete IN the plan" as the correct choice. This is unambiguously solution content. It also conflicts with the actual design in rfc-100-blob-cleaner-design.md, which uses a sidecar Parquet file rather than inline blobFilesToDelete in the plan — so even the "recommended" approach in Example 9 does not match the adopted design.

    Consider either:

    • Moving Example 9 (or a revised version of it) into the design doc as a rationale section for why blob delete information must be a durable artifact committed before or with the plan.
    • Replacing Example 9 with a problem-only framing: describe the failure mode of not having any durable artifact, without prescribing which form that artifact takes.
  2. rfc/rfc-100/rfc-100-blob-cleaner-problem.md, line 1400-1444 (link)

    P2 Missing Example 8 — numbering jumps from 7 to 9

    The illustrative examples section goes directly from Example 7 (line 1400) to Example 9 (line 1445) with no Example 8. This is either a numbering error (Example 9 should be renumbered to Example 8) or a placeholder for a missing scenario. Please either add the missing example or renumber to close the gap.

Reviews (1): Last reviewed commit: "Update RFC to include sidecar parquet fi..." | Re-trigger Greptile

Comment on lines +318 to +323

#### Decision matrix

| Condition | Path used | Cost | Suitable for |
|-----------------------------|---------------|-----------------------|---------------------------|
| No local orphan candidates | Skip Stage 2 | Zero | No blob work this cycle |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Circuit breaker can produce permanent orphans (R2 violation)

When the circuit breaker fires (candidates > hoodie.cleaner.blob.external.scan.max.candidates), Stage 2 is skipped and no blob deletions are recorded — but the file slice deletions from the existing cleaner still execute within the same clean cycle. Once those expired file slices are deleted, the data that Stage 1 used to derive the orphan candidates is gone. In the next cleaner invocation, Stage 1 iterates expired_slices, but those slices no longer exist on storage. The orphan blob candidates can never be re-derived, making them permanent orphans — a direct violation of R2.

The description says cleanup is "deferred" but that framing is misleading: there is no durable queue of deferred candidates that the next cycle would pick up, and the source-of-truth file slices are deleted.

Possible mitigations to consider and explicitly document:

  • Write deferred blob candidates to a durable artifact (separate from the sidecar, since no plan is written for this case) so the next cycle can pick them up without needing the expired slices.
  • Alternatively, gate file-slice deletion on successful blob planning (i.e., refuse to delete file slices if Stage 2 was circuit-broken), though this changes existing cleaner semantics.
  • Or, explicitly document this as a known limitation: blobs in tables that permanently lack an MDT secondary index and exceed the threshold are permanently orphaned, and the operator must either enable the index or accept the leak.

Comment on lines +214 to +223
for slice in retained_slices:
for ref in extractBlobRefs(slice.baseFile): // columnar projection only
if ref.type == OUT_OF_LINE and ref.managed == true:
retained_refs.add(ref.external_path)

// Compute local orphans by set difference
local_orphans = expired_refs - retained_refs

// All local orphans proceed to Stage 2 for cross-FG verification
all_local_orphans.addAll(local_orphans)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 managed flag scope in Stage 2 is unspecified — managed→unmanaged transition may not be handled correctly

Stage 1 filters retained refs with ref.managed == true, so a retained slice whose managed flag transitioned to false for a given external_path contributes nothing to retained_refs. That path therefore becomes a local orphan candidate and proceeds to Stage 2.

The problem statement (rfc-100-blob-cleaner-problem.md, Section 2 — Managed flag discussion) explicitly requires: "If any retained reference to the same external_path exists (regardless of the managed flag), the blob must not be deleted." Stage 2 correctness therefore depends on the MDT secondary index indexing all external_path values regardless of the managed flag. If the secondary index writer applies the same managed == true filter, Stage 2 would miss the unmanaged retained reference, classify the blob as globally orphaned, and delete a blob that is still referenced — a violation of R1.

The design doc should explicitly state:

  1. Whether the secondary index on <blob_col>.reference.external_path indexes entries for managed=false records, and
  2. If it does not, how managed→unmanaged transitions are handled to satisfy R1 (e.g., the fallback table scan as a safety net, or a separate retained-refs enrichment pass).


---

## Abstract
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unfilled issue link placeholder

The Issue field still contains the template placeholder text. This should be filled in with the corresponding GitHub issue link before the RFC is finalized.

Suggested change
## Abstract
Issue: <!-- Link to GH feature issue -->

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants