feat(blob): Supplementing RFC-100 with blob cleaner design by voonhous · Pull Request #18359 · apache/hudi

voonhous · 2026-03-20T10:15:32Z

Describe the issue this Pull Request addresses

Adding blob-cleaner design for RFC-100. The design does not introduce any new architecture. Instead, it leverages on already existing design and extends it.

For example, out-of-line/external blob cleaning is implemented via extending existing cleaner and uses secondary + record index for identifying blob references that needs to be cleaned. This avoids full table scans, and scalability of O(C * R_avg) where C is the number of candidate files to scan and R_avg average number of references per candidate path.

External blob cleaning has a complexity of:

worse case: O(C * R_avg) (All recordkeys belong in cleaned_fg)
best case: O(C) (First recordkey belongs in a non-cleaned FG)

Summary and Changelog

Added concrete design and considerations for external blob cleaner
- Considers 2 design specification as dictated by RFC-100 (managed blob paths and blob containers)
- Adds an additional handling for external blobs not managed by Hudi

Impact

None - RFC design.

Risk Level

None - RFC design.

Documentation Update

None, RFC design.

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

rahil-c · 2026-03-20T13:49:50Z

+|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------|
+| **Stage 1** | Per-file-group       | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables)               |
+| **Stage 2** | Cross-file-group     | Verify external blob candidates against MDT secondary index or fallback scan     | Only when external candidates exist    |
+| **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level                  | Only when container blobs are involved |


@voonhous forgive me if this is a beginner question but what does "container" blob mean?

Basically a set of blobs.

{Blob1, blob2, blob3, ...}, it's addressable by offsets

It's basically what is described in the rfc-100. I'm just extending the concepts there. If it's not required, we can nuke it.

It's referenced here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md#milestone-3-dynamic-inlineout-of-line-storage

Again, recording for the Nth time that we don't have any alignment on this :)

As discussed, will delete this term

rahil-c · 2026-03-25T18:37:22Z

+they are co-equal paths with different properties, different volumes, and different cleanup costs.
+
+**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's write path and stored
+under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path structure guarantees


Do we really need a /{instant} or would it suffice to have this the blob_id also have an attached instance time, something like {blob_id_instant}?

My understanding is that having a folder = 1 less list file operation, where we need to list, then filter.

Haven't really verified this. My main concern is trying to keep everything as organised as possible and if blob file count can be in the order of millions, I felt that having it under an instant folder made sense to keep everything easier to find.

To me, this is neither an out-of-line blob that is managed somewhere else externally by the user, nor is it an internal inline blob where it's stored right within the base or log files, right?

Are we assuming we are going to store blobs separately within the Hudi table path like this? I don't think we had alignment on this. To define anything like this, we should also not talk about other table services concretely. I'm going to simply review this with a narrow scope of external blobs and reference cleaning.

This is an extension of the original RFC-100 here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L168-L171

We've established that we had no alignment on things, let's remove this from scope.

rahil-c · 2026-03-25T19:01:46Z

+
+### C4: Container files
+
+Multiple blobs can be packed into a single container file, distinguished by `(offset, length)` within


Can you share with me an example to help better understand this?

IIUC, our focus now is external blobs.

This might be out of scope for this RFC since we're only focusing on external blobs. I added this because of this:

https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L207-L208

Yes, this is all again not very well flushed out. Let's ignore all this, or even honestly remove it from this doc. Can this MD file simply focus on the external blob cleaner problem?

the-other-tim-brown · 2026-03-26T02:13:46Z

+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]


We decided in another PR that this is too large to write to the plan. Has that changed?

Yes...

In this design, the plan IS the isolation boundary. All blob delete decisions must be made at plan time against the same snapshot used for file slice decisions.

Keeping blob deletes out of the plan makes it impossible to add writer-cleaner conflict resolution for external blobs (Scenario B from your problem doc). That's a correctness hole I don't think I can close later without putting the information on the timeline.

blobFilesToDelete being too long and large is a real concern. If this is a concern, maybe we can introduce a config to cap the number of blobs to delete in each iteration?

A config like hoodie.cleaner.blob.max.deletes.per.cycle that bounds how many blob deletes go into the plan per iteration, deferring the rest to the next cycle, but this might mean a clean is now broken up into multiple clean operations... Which might fundamentally change timeline assumptions.

Will add an example into rfc-100-*-problems under Example 9 to explain this.

We cannot write this into the plan. That hasn't changed.

It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.

Understand. Doing this is akin to manifest and loading or serializing everything into memory might take a very long time or OOM.

The alternative for this is we store these in a separate MDT partition using a storage layout that is optimized for storing large volumes of data.

TODO: Figure out an alternative to this while ensuring C11.

Updated design doc such that blob delete list is no longer stored in the HoodieCleanerPlan.

Instead, it's written to a sidecar Parquet file at .hoodie/.aux/clean/<instant>.blob_deletes.parquet, referenced from the plan via the existing extraMetadata map (extraMetadata["blobDeletesPath"]) so that no new Avro fields are added to the plan schema.

This way, we address the antipattern issue, there will also be no bloat regardless of blob candidate count, while keeping the delete list as a durable artifact on storage for writer conflict resolution (as illustrated in Example 9 in the problem doc for why this is required).

The full design is in the "Sidecar Parquet File: Blob Delete List" section under "Integration with Existing Cleaner", covering:

Write sequence (sidecar written before plan, atomicity guarantee)

Lifecycle (lives until archival, bounded count)

Cleanup (orphan cleanup at startup, rollback cleanup, archival cleanup)

🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.

the-other-tim-brown · 2026-03-26T02:16:56Z

+// Key format: escaped(external_path)$escaped(record_key)
+// Returns ALL record keys that reference each candidate path
+path_to_record_keys = mdtMetadata.readSecondaryIndexDataTableRecordKeysWithKeys(
+    HoodieListData.eager(candidate_paths), indexPartitionName)


The HoodieData here is always referencing HoodieListData but can we use the RDD backed HoodieData when running on spark?

Yeap, it's better to to that.
We should be using engine-distributed HoodieData (RDD on Spark, etc.) to keep this data partitioned across executors rather than materializing it all on the driver, esp if there are many blobs to delete.

readSecondaryIndexDataTableRecordKeysWithKeys() accepts generic HoodieData<String>, but V1 has a hard instanceof HoodieListData check that throws if you pass anything else.

V2 has no such restriction.

We have to use V2 interface/API then.

Good catch, updating the RFC!

- Add Example 9 to problem doc: demonstrates why blobFilesToDelete must be in the plan (writer-cleaner conflict resolution for external blobs) - Use engine-context HoodieData instead of HoodieListData.eager() for Stage 2 MDT lookups to distribute large candidate sets across executors - Update memory budget to reflect distributed Stage 2 processing

vinothchandar · 2026-04-01T18:49:19Z

+expected majority flow for Phase 3 workloads.
+
+**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have existing blob files in
+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled


is this AI generated? does NFS mount even work with our stack :)

Used AI to help concretise a few things. I think it interpreted HDFS mount as NFS here. I might have a typo here causing the interpretation error. Let me fix it.

vinothchandar · 2026-04-01T18:50:08Z

+external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled
+path). Records reference these blobs directly by path. The user does **not** want to bootstrap --
+they do not want Hudi to copy, move, or reorganize the blob files into `.hoodie/blobs/`. Hudi
+manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1


This clarifies my earlier comment. I think we are aligned. Let's just focus on the phase one for now.

vinothchandar

@voonhous thanks for the detailed write-up. I left some comments, mostly on the problem section.

Can we add a simple high-level overview of the design section and talk about how it meets all the different problems that you laid out in the other doc? Also, please scope this down to simply external blobs. Let's deal with the container files later on.

I can take another pass at it, once this is simplified.

vinothchandar · 2026-04-01T18:50:34Z

+manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1
+workloads and remains a supported flow in Phase 3.
+
+The non-path-dispatched flow has fundamentally different properties:


Let us call these external blobs. Not add new names like non-dispatched flow.

Done, named it external out-of-line blobs. When we refer to these 2 terms, they will mean the same thing and they are interchangeable.

vinothchandar · 2026-04-01T18:51:30Z

+|---------------------------|-----------------------------------|--------------------------------------|
+| Path uniqueness           | Guaranteed (instant in path, C11) | Not guaranteed (user controls)       |
+| Cross-FG sharing          | Does not occur (FG-scoped)        | Common (multiple records, same blob) |
+| Writer/cleaner race       | Cannot occur (D2)                 | Can occur (D3)                       |


Sorry, what are D2, D3, these things? What do they refer to?

D2: Writer-cleaner race cannot occur for Hudi-created blobs (structural proof)

D3: Writer-cleaner race can occur for external blobs (requires conflict check)

Too many things ongoing when writing this doc, forgot to add them in.

vinothchandar · 2026-04-01T18:52:45Z

+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Blob compaction internals.** Blob compaction (repacking partially-live container files) is a


For the same reason, I do not want to do any investments or anything around the first flow yet. path-dispatched.

As discussed, no blob compaction as this is container files specific, it will be removed from the RFC. We will be fully focusing on external out-of-line-blobs.

vinothchandar · 2026-04-01T18:53:50Z

+### Stance on the `managed` flag
+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are


In docs, let's refrain from referencing Java classes directly. It gives a false impression that we are designing everything for Java. Instead, let's please refer to the Avro schemas for these logical types.

Removed and rephrased to:

The `BlobReference` schema includes a `managed` boolean field (`reference.managed`).

vinothchandar · 2026-04-01T18:54:58Z

+
+The BlobReference schema includes a `managed` boolean field
+(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are
+cleaned. This document acknowledges the flag and treats it as a **filter** -- unmanaged blobs are


An interesting corner case is whether external blobs are allowed to go from managed to unmanaged across multiple writes. I think we should clarify the behavior on this.

Added.
Managed -> unmanaged with the same external_path, we will leave the blob untouched.
Unmanaged -> Managed with the same external_path, we will start tracking the blob, if it expires after t2, we will delete it.

vinothchandar · 2026-04-02T01:57:42Z

+
+## Approvers
+
+- (TBD)


Please add me, Ethan, and Rahil.

vinothchandar · 2026-04-02T01:59:07Z

+
+Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` pipeline -- same timeline
+instant, same plan-execute-complete lifecycle, same crash recovery and OCC integration. A
+`hasBlobColumns()` check gates all blob logic so non-blob tables pay zero cost.


Technically, there is cost to resolve the schema, which the cleaner does not have to do today, but it's small.

vinothchandar · 2026-04-02T01:59:43Z

+|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------|
+| **Stage 1** | Per-file-group       | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables)               |
+| **Stage 2** | Cross-file-group     | Verify external blob candidates against MDT secondary index or fallback scan     | Only when external candidates exist    |
+| **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level                  | Only when container blobs are involved |


Again, recording for the Nth time that we don't have any alignment on this :)

vinothchandar · 2026-04-02T02:05:57Z

+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]


We cannot write this into the plan. That hasn't changed.

vinothchandar · 2026-04-02T02:06:33Z

+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]


It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.

hudi-bot · 2026-04-06T10:03:03Z

CI report:

6dc90a8 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC — the problem statement is thorough and the two-stage pipeline design is well-motivated. The constraint catalog, failure mode examples, and cost model are particularly strong; they show real rigor in thinking through edge cases. A few areas could use more detail: the writer-side conflict check integration with ConcurrentOperation/TransactionUtils (which currently throws on CLEAN_ACTION), the sidecar lifecycle hook into archival, and the I/O cost of reading log files in Stage 1. See inline comments for specifics.

yihua · 2026-04-14T22:51:12Z

+   ├── Transition to INFLIGHT
+   ├── Read sidecar Parquet (blob delete list)
+   ├── Delete file slices (existing, parallelized)
+   ├── Delete blob files (new, parallelized)          // parallel with file slice deletes


🤖 The writer-side conflict check requires extending ConcurrentOperation.init() and TransactionUtils.getInflightAndRequestedInstants() to handle CLEAN_ACTION, which currently throws IllegalArgumentException for clean actions. This is a non-trivial change to a sensitive code path. Could you sketch out what ConcurrentOperation would look like for a clean action? For instance, what mutatedFileIds would it report — none? And does ConflictResolutionStrategy.hasConflict() need a new code path, or can it reuse the existing partition/fileId overlap logic?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → re-read sidecar,<br/>re-delete (FileNotFound = success)
+    end


🤖 In the writer-side conflict check, the writer reads sidecar Parquet files for REQUESTED, INFLIGHT, and COMPLETED clean instants. For COMPLETED instants, it reads blobDeletesSidecarPath from HoodieCleanMetadata. How many COMPLETED clean instants could accumulate before archival? If the table runs frequent clean cycles, the writer could end up reading many sidecars on every commit. Is there a bound on this, or should the design discuss amortization?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+                Note right of C: Globally orphaned → DELETE
+            end
+        end
+    end


🤖 The design mentions HoodieWriteStat.externalBlobPaths as an in-memory collection with "no additional I/O." How does the write path populate this? Does the writer need to scan the blob reference column of every record it writes, or is this extracted during the normal write flow? For large writes, even in-memory collection could be significant.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+
+### Schema Changes: HoodieCleanMetadata
+
+A new nullable field `blobCleanStats` of type `HoodieBlobCleanStats`:


🤖 The sidecar lifecycle says it lives until the clean instant is archived, and archival cleanup deletes it. But archival today doesn't know about sidecar files — it archives timeline instants without touching .aux/. Could you clarify where this cleanup hook would be added? Is it in the archival action itself, or a separate periodic cleanup?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+
+Stage 2 verifies each local orphan candidate against the global state to determine if the blob is
+still referenced by any active file slice outside the cleaned file groups. This is necessary because
+external blobs can be shared across file groups (C3, C11).


🤖 Stage 1 reads log files for expired slices ("full record read") to catch transient blob refs. For large log files, this could be expensive. Is there an estimate of the I/O cost for this? The back-of-envelope section estimates "~6 reads per FG" but log file reads are significantly more expensive than columnar projection on base files.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+
+| Crash point                                   | Recovery                                                                                               |
+|-----------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| After sidecar written, before plan persisted  | No REQUESTED instant on timeline. Cleaner starts fresh. Orphaned sidecar cleaned at next startup.      |


🤖 The memory budget section estimates ~100MB peak per FG, but then says FGs are processed sequentially within each partition batch while cleanerParallelism batches run in parallel. This means peak memory is 100MB * cleanerParallelism. With default parallelism, this might be fine, but it's worth noting that this is a new memory pressure the cleaner didn't have before. Have you considered a streaming/spill-to-disk approach for very large file groups?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+### Memory Budget
+
+Per-FG blob ref sets: ~100MB peak (500K records * 100 bytes/ref for expired + retained). FGs are
+processed sequentially within each partition batch -- per-FG sets are computed and discarded, not


🤖 The rollout plan mentions "Stage 1" and "Stage 2" both as pipeline stages in the algorithm AND as rollout phases. This overloaded terminology could cause confusion. Consider calling the rollout phases something like "Phase A: Foundation", "Phase B: Per-FG cleanup", etc. to distinguish them from the algorithm's Stage 1/Stage 2.

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+
+**Writer-side conflict check.** `preCommit()` conflict check reads sidecar for concurrency safety.
+Requires extending `ConcurrentOperation` / `TransactionUtils` to include clean actions when blob
+columns exist.


🤖 The RFC doesn't discuss what happens if the MDT secondary index becomes stale or inconsistent (e.g., after a failed MDT update). If Stage 2 relies on the secondary index for correctness, a stale index could lead to premature deletion. Is the safety check (verifying index is fully built) sufficient, or should the design also handle partial staleness?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:12Z

+  by reading the expired log files. Without log reads on the expired side,
+  every such transient blob becomes a permanent orphan that accumulates
+  storage indefinitely.
+```


🤖 The problem statement has excellent failure mode examples and constraints, but Q6 (recovery from premature deletion) is left fully open. Given that premature deletion is the single most critical failure, it might be worth the design doc addressing this — even if the answer is "no recovery; rely on correctness guarantees." Does the design doc intentionally leave this unanswered?

_{- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.}

yihua · 2026-04-14T22:51:14Z

+        FP["filePathsToBeDeleted<br/>(existing)"]
+        BP["blobFilesToDelete<br/>(new)"]


🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.

yihua

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Greptile Summary: This PR adds two new RFC documents for RFC-100 Part 2, describing the design for cleaning orphaned external blob files when Hudi expires old file slices. The two files are:

rfc-100-blob-cleaner-design.md: A detailed solution design covering the two-stage cleanup pipeline, MDT secondary index integration, sidecar Parquet lifecycle, writer-cleaner conflict check, crash recovery, configuration, and a test plan.
rfc-100-blob-cleaner-problem.md: A companion problem statement defining the 11 design constraints (C1–C11), 9 requirements (R1–R9), 7 illustrative failure modes, and 9 open questions.

The design is thorough and well-structured. The two-stage pipeline (per-FG set-difference → cross-FG MDT index verification) is a sound approach to the cross-file-group sharing problem. Several issues worth addressing:

Missing Example 8 in the problem document — examples jump from 7 to 9, suggesting a numbering error or omitted scenario.
Example 9 introduces solution content directly contradicting the problem doc's own stated scope ("It contains no solution content"). It compares two implementation approaches and advocates for one.
Circuit breaker in the fallback scan path may create permanent orphans — when candidates exceed the threshold, Stage 2 is skipped but file slices are still deleted that same cycle. In the next cycle those expired file slices are already gone, so the orphan blob candidates can never be re-derived (R2 violation under sustained absence of MDT secondary index).
managed flag coverage in Stage 2 is implicit, not stated — Stage 1 filters retained refs to managed == true only, but the problem doc requires that any retained reference (regardless of the managed flag) must prevent deletion. Stage 2's MDT index lookup must cover unmanaged references for correctness during managed→unmanaged transitions, but the design does not explicitly state the secondary index scope.
The placeholder Issue: <Link to GH feature issue> remains unfilled in the design doc header.

Greptile Confidence Score: 4/5
Safe to merge as an RFC draft after addressing the two P1 design concerns; no production code is changed.

Both files are documentation-only RFC additions with no code changes. The design is thorough and well-reasoned. Two P1 issues exist in the design itself: the circuit-breaker path can create permanent orphan blobs (R2 violation) and the managed→unmanaged flag transition handling relies on an implicit assumption about secondary index scope that is never stated. These should be resolved before implementation begins, but they don't block merging the RFC draft for community review. The missing Example 8 and the solution-content scope violation in the problem doc are P2 polish items.

rfc/rfc-100/rfc-100-blob-cleaner-design.md — circuit breaker and managed-flag sections need clarification; rfc/rfc-100/rfc-100-blob-cleaner-problem.md — Example 8 gap and Example 9 scope issue.

Sequence Diagram (Greptile):

sequenceDiagram
    participant CP as CleanPlanActionExecutor
    participant S1 as Stage 1 (per-FG)
    participant S2 as Stage 2 (cross-FG)
    participant MDT as MDT Secondary Index
    participant AUX as Sidecar (.hoodie/.aux/clean/)
    participant TL as Timeline
    participant CE as CleanActionExecutor
    participant W as Writer (preCommit)

    Note over CP: requestClean()
    CP->>CP: hasBlobColumns()?
    alt No blob columns
        CP-->>TL: Plan (no blob work)
    else Has blob columns
        CP->>S1: For each expired/retained FG
        S1-->>CP: all_local_orphans (set-difference)

        alt local orphans exist
            alt MDT secondary index available
                CP->>S2: Verify candidates
                S2->>MDT: Batch prefix scan (N paths)
                MDT-->>S2: Map<path, List<recordKey>>
                S2->>MDT: readRecordIndexLocations(all keys)
                MDT-->>S2: Map<recordKey, (partition, fileId)>
                S2-->>CP: blob_files_to_delete
            else Fallback scan, few candidates
                CP->>S2: Table scan
                S2-->>CP: blob_files_to_delete
            else Circuit breaker fires
                Note over CP,S2: Stage 2 skipped — blobs deferred but file slices still deleted this cycle
                CP-->>TL: Plan (no blob deletes recorded)
            end
            CP->>AUX: Write sidecar Parquet
            CP->>TL: Persist plan (extraMetadata[blobDeletesPath])
        end
    end

    Note over TL: REQUESTED

    W->>TL: preCommit() — scan REQUESTED/INFLIGHT/COMPLETED clean actions
    W->>AUX: Read sidecar Parquet
    alt Writer blob paths intersect sidecar paths
        W-->>W: HoodieWriteConflictException — retry
    else No overlap
        W->>TL: Commit succeeds
    end

    CE->>TL: Transition to INFLIGHT
    CE->>AUX: Read sidecar Parquet
    par Parallel deletion
        CE->>CE: Delete file slices (existing)
    and
        CE->>CE: Delete blob files (from sidecar)
    end
    CE->>TL: Transition to COMPLETED (blobCleanStats + sidecarPath)
    Note over AUX: Sidecar retained until archival

Greptile: yihua#44 (review)

yihua · 2026-04-14T23:21:58Z

+
+| Condition                   | Path used     | Cost                  | Suitable for              |
+|-----------------------------|---------------|-----------------------|---------------------------|
+| No local orphan candidates  | Skip Stage 2  | Zero                  | No blob work this cycle   |


Circuit breaker can produce permanent orphans (R2 violation)

When the circuit breaker fires (candidates > hoodie.cleaner.blob.external.scan.max.candidates), Stage 2 is skipped and no blob deletions are recorded — but the file slice deletions from the existing cleaner still execute within the same clean cycle. Once those expired file slices are deleted, the data that Stage 1 used to derive the orphan candidates is gone. In the next cleaner invocation, Stage 1 iterates expired_slices, but those slices no longer exist on storage. The orphan blob candidates can never be re-derived, making them permanent orphans — a direct violation of R2.

The description says cleanup is "deferred" but that framing is misleading: there is no durable queue of deferred candidates that the next cycle would pick up, and the source-of-truth file slices are deleted.

Possible mitigations to consider and explicitly document:

Write deferred blob candidates to a durable artifact (separate from the sidecar, since no plan is written for this case) so the next cycle can pick them up without needing the expired slices.

Alternatively, gate file-slice deletion on successful blob planning (i.e., refuse to delete file slices if Stage 2 was circuit-broken), though this changes existing cleaner semantics.

Or, explicitly document this as a known limitation: blobs in tables that permanently lack an MDT secondary index and exceed the threshold are permanently orphaned, and the operator must either enable the index or accept the leak.

— Greptile (original) (source:comment#3082984560)

yihua · 2026-04-14T23:21:59Z

+    local_orphans = expired_refs - retained_refs
+
+    // All local orphans proceed to Stage 2 for cross-FG verification
+    all_local_orphans.addAll(local_orphans)


managed flag scope in Stage 2 is unspecified — managed→unmanaged transition may not be handled correctly

Stage 1 filters retained refs with ref.managed == true, so a retained slice whose managed flag transitioned to false for a given external_path contributes nothing to retained_refs. That path therefore becomes a local orphan candidate and proceeds to Stage 2.

The problem statement (rfc-100-blob-cleaner-problem.md, Section 2 — Managed flag discussion) explicitly requires: "If any retained reference to the same external_path exists (regardless of the managed flag), the blob must not be deleted." Stage 2 correctness therefore depends on the MDT secondary index indexing all external_path values regardless of the managed flag. If the secondary index writer applies the same managed == true filter, Stage 2 would miss the unmanaged retained reference, classify the blob as globally orphaned, and delete a blob that is still referenced — a violation of R1.

The design doc should explicitly state:

Whether the secondary index on <blob_col>.reference.external_path indexes entries for managed=false records, and

If it does not, how managed→unmanaged transitions are handled to satisfy R1 (e.g., the fallback table scan as a safety net, or a separate retained-refs enrichment pass).

— Greptile (original) (source:comment#3082984636)

yihua · 2026-04-14T23:21:59Z

+
+---
+
+## Abstract


Unfilled issue link placeholder

The Issue field still contains the template placeholder text. This should be filled in with the corresponding GitHub issue link before the RFC is finalized.

Suggested change

## Abstract

Issue: 

— Greptile (original) (source:comment#3082985257)

yihua

The blob cleaning design and implementation are not part of Hudi 1.2.0 release. Removing this PR from release-1.2.0 milestone.

feat(blob): Add RFC-100 external blob cleaner design

3dd15e0

voonhous force-pushed the rfc-blob-cleaner branch from 0882eee to 3dd15e0 Compare March 20, 2026 10:16

voonhous requested review from bvaradar, rahil-c, the-other-tim-brown, vinothchandar and yihua March 20, 2026 10:18

github-actions Bot added the size:XL PR with lines of changes > 1000 label Mar 20, 2026

voonhous assigned vinothchandar, yihua, bvaradar and rahil-c Mar 20, 2026

rahil-c reviewed Mar 20, 2026

View reviewed changes

apache deleted a comment from hudi-bot Mar 20, 2026

voonhous added 2 commits March 22, 2026 14:15

Update cost

b1e3599

Add SI -> RI consideration to appendix

712b146

rahil-c reviewed Mar 25, 2026

View reviewed changes

the-other-tim-brown reviewed Mar 26, 2026

View reviewed changes

vinothchandar reviewed Apr 1, 2026

View reviewed changes

vinothchandar requested changes Apr 2, 2026

View reviewed changes

voonhous added 2 commits April 6, 2026 13:32

Address comments.

ca20424

Update RFC to include sidecar parquet file to contain blob delete files

6dc90a8

yihua reviewed Apr 14, 2026

View reviewed changes

yihua mentioned this pull request Apr 14, 2026

[OSS PR #18359] feat(blob): Supplementing RFC-100 with blob cleaner design yihua/hudi#44

Open

yihua reviewed Apr 14, 2026

View reviewed changes

voonhous added this to the release-1.2.0 milestone May 13, 2026

yihua reviewed May 14, 2026

View reviewed changes

yihua removed this from the release-1.2.0 milestone May 14, 2026


		### C4: Container files

		Multiple blobs can be packed into a single container file, distinguished by `(offset, length)` within

		FP["filePathsToBeDeleted<br/>(existing)"]
		BP["blobFilesToDelete<br/>(new)"]


		### Schema Changes: HoodieCleanMetadata

		A new nullable field `blobCleanStats` of type `HoodieBlobCleanStats`:


		---

		## Abstract

Conversation

voonhous commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

rahil-c Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vinothchandar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

voonhous Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

voonhous commented Mar 20, 2026 •

edited

Loading

rahil-c Mar 20, 2026 •

edited

Loading

voonhous Mar 23, 2026 •

edited

Loading

voonhous Mar 30, 2026 •

edited

Loading

voonhous Apr 2, 2026 •

edited

Loading