Skip to content

feat(blob): Supplementing RFC-100 with blob cleaner design#18359

Open
voonhous wants to merge 6 commits into
apache:masterfrom
voonhous:rfc-blob-cleaner
Open

feat(blob): Supplementing RFC-100 with blob cleaner design#18359
voonhous wants to merge 6 commits into
apache:masterfrom
voonhous:rfc-blob-cleaner

Conversation

@voonhous
Copy link
Copy Markdown
Member

@voonhous voonhous commented Mar 20, 2026

Describe the issue this Pull Request addresses

Adding blob-cleaner design for RFC-100. The design does not introduce any new architecture. Instead, it leverages on already existing design and extends it.

For example, out-of-line/external blob cleaning is implemented via extending existing cleaner and uses secondary + record index for identifying blob references that needs to be cleaned. This avoids full table scans, and scalability of O(C * R_avg) where C is the number of candidate files to scan and R_avg average number of references per candidate path.

External blob cleaning has a complexity of:

  • worse case: O(C * R_avg) (All recordkeys belong in cleaned_fg)
  • best case: O(C) (First recordkey belongs in a non-cleaned FG)

Summary and Changelog

  • Added concrete design and considerations for external blob cleaner
    • Considers 2 design specification as dictated by RFC-100 (managed blob paths and blob containers)
    • Adds an additional handling for external blobs not managed by Hudi

Impact

None - RFC design.

Risk Level

None - RFC design.

Documentation Update

None, RFC design.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------|
| **Stage 1** | Per-file-group | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables) |
| **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist |
| **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved |
Copy link
Copy Markdown
Collaborator

@rahil-c rahil-c Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous forgive me if this is a beginner question but what does "container" blob mean?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically a set of blobs.

{Blob1, blob2, blob3, ...}, it's addressable by offsets

Copy link
Copy Markdown
Member Author

@voonhous voonhous Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically what is described in the rfc-100. I'm just extending the concepts there. If it's not required, we can nuke it.

It's referenced here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md#milestone-3-dynamic-inlineout-of-line-storage

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, recording for the Nth time that we don't have any alignment on this :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, will delete this term

@apache apache deleted a comment from hudi-bot Mar 20, 2026
they are co-equal paths with different properties, different volumes, and different cleanup costs.

**Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's write path and stored
under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path structure guarantees
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a /{instant} or would it suffice to have this the blob_id also have an attached instance time, something like {blob_id_instant}?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that having a folder = 1 less list file operation, where we need to list, then filter.

Haven't really verified this. My main concern is trying to keep everything as organised as possible and if blob file count can be in the order of millions, I felt that having it under an instant folder made sense to keep everything easier to find.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, this is neither an out-of-line blob that is managed somewhere else externally by the user, nor is it an internal inline blob where it's stored right within the base or log files, right?

Are we assuming we are going to store blobs separately within the Hudi table path like this? I don't think we had alignment on this. To define anything like this, we should also not talk about other table services concretely. I'm going to simply review this with a narrow scope of external blobs and reference cleaning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extension of the original RFC-100 here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L168-L171

We've established that we had no alignment on things, let's remove this from scope.


### C4: Container files

Multiple blobs can be packed into a single container file, distinguished by `(offset, length)` within
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share with me an example to help better understand this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, our focus now is external blobs.

This might be out of scope for this RFC since we're only focusing on external blobs. I added this because of this:

https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L207-L208

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is all again not very well flushed out. Let's ignore all this, or even honestly remove it from this doc. Can this MD file simply focus on the external blob cleaner problem?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

Comment on lines +185 to +186
FP["filePathsToBeDeleted<br/>(existing)"]
BP["blobFilesToDelete<br/>(new)"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided in another PR that this is too large to write to the plan. Has that changed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes...

In this design, the plan IS the isolation boundary. All blob delete decisions must be made at plan time against the same snapshot used for file slice decisions.

Keeping blob deletes out of the plan makes it impossible to add writer-cleaner conflict resolution for external blobs (Scenario B from your problem doc). That's a correctness hole I don't think I can close later without putting the information on the timeline.

blobFilesToDelete being too long and large is a real concern. If this is a concern, maybe we can introduce a config to cap the number of blobs to delete in each iteration?

A config like hoodie.cleaner.blob.max.deletes.per.cycle that bounds how many blob deletes go into the plan per iteration, deferring the rest to the next cycle, but this might mean a clean is now broken up into multiple clean operations... Which might fundamentally change timeline assumptions.

Will add an example into rfc-100-*-problems under Example 9 to explain this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot write this into the plan. That hasn't changed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understand. Doing this is akin to manifest and loading or serializing everything into memory might take a very long time or OOM.

The alternative for this is we store these in a separate MDT partition using a storage layout that is optimized for storing large volumes of data.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Figure out an alternative to this while ensuring C11.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated design doc such that blob delete list is no longer stored in the HoodieCleanerPlan.

Instead, it's written to a sidecar Parquet file at .hoodie/.aux/clean/<instant>.blob_deletes.parquet, referenced from the plan via the existing extraMetadata map (extraMetadata["blobDeletesPath"]) so that no new Avro fields are added to the plan schema.

This way, we address the antipattern issue, there will also be no bloat regardless of blob candidate count, while keeping the delete list as a durable artifact on storage for writer conflict resolution (as illustrated in Example 9 in the problem doc for why this is required).

The full design is in the "Sidecar Parquet File: Blob Delete List" section under "Integration with Existing Cleaner", covering:

  • Write sequence (sidecar written before plan, atomicity guarantee)
  • Lifecycle (lives until archival, bounded count)
  • Cleanup (orphan cleanup at startup, rollback cleanup, archival cleanup)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.

// Key format: escaped(external_path)$escaped(record_key)
// Returns ALL record keys that reference each candidate path
path_to_record_keys = mdtMetadata.readSecondaryIndexDataTableRecordKeysWithKeys(
HoodieListData.eager(candidate_paths), indexPartitionName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The HoodieData here is always referencing HoodieListData but can we use the RDD backed HoodieData when running on spark?

Copy link
Copy Markdown
Member Author

@voonhous voonhous Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, it's better to to that.
We should be using engine-distributed HoodieData (RDD on Spark, etc.) to keep this data partitioned across executors rather than materializing it all on the driver, esp if there are many blobs to delete.

readSecondaryIndexDataTableRecordKeysWithKeys() accepts generic HoodieData<String>, but V1 has a hard instanceof HoodieListData check that throws if you pass anything else.

V2 has no such restriction.

We have to use V2 interface/API then.

Good catch, updating the RFC!

- Add Example 9 to problem doc: demonstrates why blobFilesToDelete must be in the plan (writer-cleaner conflict resolution for external blobs)
- Use engine-context HoodieData instead of HoodieListData.eager() for Stage 2 MDT lookups to distribute large candidate sets across executors
- Update memory budget to reflect distributed Stage 2 processing
expected majority flow for Phase 3 workloads.

**Flow 2: Non-path-dispatched (user-provided external blobs).** Users have existing blob files in
external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this AI generated? does NFS mount even work with our stack :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used AI to help concretise a few things. I think it interpreted HDFS mount as NFS here. I might have a typo here causing the interpretation error. Let me fix it.

external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled
path). Records reference these blobs directly by path. The user does **not** want to bootstrap --
they do not want Hudi to copy, move, or reorganize the blob files into `.hoodie/blobs/`. Hudi
manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarifies my earlier comment. I think we are aligned. Let's just focus on the phase one for now.

Copy link
Copy Markdown
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@voonhous thanks for the detailed write-up. I left some comments, mostly on the problem section.

Can we add a simple high-level overview of the design section and talk about how it meets all the different problems that you laid out in the other doc? Also, please scope this down to simply external blobs. Let's deal with the container files later on.

I can take another pass at it, once this is simplified.

manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1
workloads and remains a supported flow in Phase 3.

The non-path-dispatched flow has fundamentally different properties:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us call these external blobs. Not add new names like non-dispatched flow.

Copy link
Copy Markdown
Member Author

@voonhous voonhous Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, named it external out-of-line blobs. When we refer to these 2 terms, they will mean the same thing and they are interchangeable.

|---------------------------|-----------------------------------|--------------------------------------|
| Path uniqueness | Guaranteed (instant in path, C11) | Not guaranteed (user controls) |
| Cross-FG sharing | Does not occur (FG-scoped) | Common (multiple records, same blob) |
| Writer/cleaner race | Cannot occur (D2) | Can occur (D3) |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what are D2, D3, these things? What do they refer to?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • D2: Writer-cleaner race cannot occur for Hudi-created blobs (structural proof)
  • D3: Writer-cleaner race can occur for external blobs (requires conflict check)

Too many things ongoing when writing this doc, forgot to add them in.


- **Inline blobs.** Inline blob data lives inside the base/log file and is deleted when the file
slice is cleaned. No additional cleanup needed.
- **Blob compaction internals.** Blob compaction (repacking partially-live container files) is a
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason, I do not want to do any investments or anything around the first flow yet. path-dispatched.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, no blob compaction as this is container files specific, it will be removed from the RFC. We will be fully focusing on external out-of-line-blobs.

### Stance on the `managed` flag

The BlobReference schema includes a `managed` boolean field
(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In docs, let's refrain from referencing Java classes directly. It gives a false impression that we are designing everything for Java. Instead, let's please refer to the Avro schemas for these logical types.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and rephrased to:

The `BlobReference` schema includes a `managed` boolean field (`reference.managed`).


The BlobReference schema includes a `managed` boolean field
(`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are
cleaned. This document acknowledges the flag and treats it as a **filter** -- unmanaged blobs are
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An interesting corner case is whether external blobs are allowed to go from managed to unmanaged across multiple writes. I think we should clarify the behavior on this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.
Managed -> unmanaged with the same external_path, we will leave the blob untouched.
Unmanaged -> Managed with the same external_path, we will start tracking the blob, if it expires after t2, we will delete it.


## Approvers

- (TBD)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add me, Ethan, and Rahil.


Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` pipeline -- same timeline
instant, same plan-execute-complete lifecycle, same crash recovery and OCC integration. A
`hasBlobColumns()` check gates all blob logic so non-blob tables pay zero cost.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, there is cost to resolve the schema, which the cleaner does not have to do today, but it's small.

|-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------|
| **Stage 1** | Per-file-group | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables) |
| **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist |
| **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, recording for the Nth time that we don't have any alignment on this :)

Comment on lines +185 to +186
FP["filePathsToBeDeleted<br/>(existing)"]
BP["blobFilesToDelete<br/>(new)"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot write this into the plan. That hasn't changed.

Comment on lines +185 to +186
FP["filePathsToBeDeleted<br/>(existing)"]
BP["blobFilesToDelete<br/>(new)"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Apr 6, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the RFC — the problem statement is thorough and the two-stage pipeline design is well-motivated. The constraint catalog, failure mode examples, and cost model are particularly strong; they show real rigor in thinking through edge cases. A few areas could use more detail: the writer-side conflict check integration with ConcurrentOperation/TransactionUtils (which currently throws on CLEAN_ACTION), the sidecar lifecycle hook into archival, and the I/O cost of reading log files in Stage 1. See inline comments for specifics.

├── Transition to INFLIGHT
├── Read sidecar Parquet (blob delete list)
├── Delete file slices (existing, parallelized)
├── Delete blob files (new, parallelized) // parallel with file slice deletes
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The writer-side conflict check requires extending ConcurrentOperation.init() and TransactionUtils.getInflightAndRequestedInstants() to handle CLEAN_ACTION, which currently throws IllegalArgumentException for clean actions. This is a non-trivial change to a sensitive code path. Could you sketch out what ConcurrentOperation would look like for a clean action? For instance, what mutatedFileIds would it report — none? And does ConflictResolutionStrategy.hasConflict() need a new code path, or can it reuse the existing partition/fileId overlap logic?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


rect rgb(255, 245, 230)
Note right of TL: Crash here → re-read sidecar,<br/>re-delete (FileNotFound = success)
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 In the writer-side conflict check, the writer reads sidecar Parquet files for REQUESTED, INFLIGHT, and COMPLETED clean instants. For COMPLETED instants, it reads blobDeletesSidecarPath from HoodieCleanMetadata. How many COMPLETED clean instants could accumulate before archival? If the table runs frequent clean cycles, the writer could end up reading many sidecars on every commit. Is there a bound on this, or should the design discuss amortization?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Note right of C: Globally orphaned → DELETE
end
end
end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The design mentions HoodieWriteStat.externalBlobPaths as an in-memory collection with "no additional I/O." How does the write path populate this? Does the writer need to scan the blob reference column of every record it writes, or is this extracted during the normal write flow? For large writes, even in-memory collection could be significant.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


### Schema Changes: HoodieCleanMetadata

A new nullable field `blobCleanStats` of type `HoodieBlobCleanStats`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The sidecar lifecycle says it lives until the clean instant is archived, and archival cleanup deletes it. But archival today doesn't know about sidecar files — it archives timeline instants without touching .aux/. Could you clarify where this cleanup hook would be added? Is it in the archival action itself, or a separate periodic cleanup?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


Stage 2 verifies each local orphan candidate against the global state to determine if the blob is
still referenced by any active file slice outside the cleaned file groups. This is necessary because
external blobs can be shared across file groups (C3, C11).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Stage 1 reads log files for expired slices ("full record read") to catch transient blob refs. For large log files, this could be expensive. Is there an estimate of the I/O cost for this? The back-of-envelope section estimates "~6 reads per FG" but log file reads are significantly more expensive than columnar projection on base files.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


| Crash point | Recovery |
|-----------------------------------------------|--------------------------------------------------------------------------------------------------------|
| After sidecar written, before plan persisted | No REQUESTED instant on timeline. Cleaner starts fresh. Orphaned sidecar cleaned at next startup. |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The memory budget section estimates ~100MB peak per FG, but then says FGs are processed sequentially within each partition batch while cleanerParallelism batches run in parallel. This means peak memory is 100MB * cleanerParallelism. With default parallelism, this might be fine, but it's worth noting that this is a new memory pressure the cleaner didn't have before. Have you considered a streaming/spill-to-disk approach for very large file groups?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

### Memory Budget

Per-FG blob ref sets: ~100MB peak (500K records * 100 bytes/ref for expired + retained). FGs are
processed sequentially within each partition batch -- per-FG sets are computed and discarded, not
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The rollout plan mentions "Stage 1" and "Stage 2" both as pipeline stages in the algorithm AND as rollout phases. This overloaded terminology could cause confusion. Consider calling the rollout phases something like "Phase A: Foundation", "Phase B: Per-FG cleanup", etc. to distinguish them from the algorithm's Stage 1/Stage 2.

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.


**Writer-side conflict check.** `preCommit()` conflict check reads sidecar for concurrency safety.
Requires extending `ConcurrentOperation` / `TransactionUtils` to include clean actions when blob
columns exist.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The RFC doesn't discuss what happens if the MDT secondary index becomes stale or inconsistent (e.g., after a failed MDT update). If Stage 2 relies on the secondary index for correctness, a stale index could lead to premature deletion. Is the safety check (verifying index is fully built) sufficient, or should the design also handle partial staleness?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

by reading the expired log files. Without log reads on the expired side,
every such transient blob becomes a permanent orphan that accumulates
storage indefinitely.
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 The problem statement has excellent failure mode examples and constraints, but Q6 (recovery from premature deletion) is left fully open. Given that premature deletion is the single most critical failure, it might be worth the design doc addressing this — even if the answer is "no recovery; rely on correctness guarantees." Does the design doc intentionally leave this unanswered?

- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Comment on lines +185 to +186
FP["filePathsToBeDeleted<br/>(existing)"]
BP["blobFilesToDelete<br/>(new)"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Greptile Summary: This PR adds two new RFC documents for RFC-100 Part 2, describing the design for cleaning orphaned external blob files when Hudi expires old file slices. The two files are:

  • rfc-100-blob-cleaner-design.md: A detailed solution design covering the two-stage cleanup pipeline, MDT secondary index integration, sidecar Parquet lifecycle, writer-cleaner conflict check, crash recovery, configuration, and a test plan.
  • rfc-100-blob-cleaner-problem.md: A companion problem statement defining the 11 design constraints (C1–C11), 9 requirements (R1–R9), 7 illustrative failure modes, and 9 open questions.

The design is thorough and well-structured. The two-stage pipeline (per-FG set-difference → cross-FG MDT index verification) is a sound approach to the cross-file-group sharing problem. Several issues worth addressing:

  • Missing Example 8 in the problem document — examples jump from 7 to 9, suggesting a numbering error or omitted scenario.
  • Example 9 introduces solution content directly contradicting the problem doc's own stated scope ("It contains no solution content"). It compares two implementation approaches and advocates for one.
  • Circuit breaker in the fallback scan path may create permanent orphans — when candidates exceed the threshold, Stage 2 is skipped but file slices are still deleted that same cycle. In the next cycle those expired file slices are already gone, so the orphan blob candidates can never be re-derived (R2 violation under sustained absence of MDT secondary index).
  • managed flag coverage in Stage 2 is implicit, not stated — Stage 1 filters retained refs to managed == true only, but the problem doc requires that any retained reference (regardless of the managed flag) must prevent deletion. Stage 2's MDT index lookup must cover unmanaged references for correctness during managed→unmanaged transitions, but the design does not explicitly state the secondary index scope.
  • The placeholder Issue: <Link to GH feature issue> remains unfilled in the design doc header.

Greptile Confidence Score: 4/5
Safe to merge as an RFC draft after addressing the two P1 design concerns; no production code is changed.

Both files are documentation-only RFC additions with no code changes. The design is thorough and well-reasoned. Two P1 issues exist in the design itself: the circuit-breaker path can create permanent orphan blobs (R2 violation) and the managed→unmanaged flag transition handling relies on an implicit assumption about secondary index scope that is never stated. These should be resolved before implementation begins, but they don't block merging the RFC draft for community review. The missing Example 8 and the solution-content scope violation in the problem doc are P2 polish items.

rfc/rfc-100/rfc-100-blob-cleaner-design.md — circuit breaker and managed-flag sections need clarification; rfc/rfc-100/rfc-100-blob-cleaner-problem.md — Example 8 gap and Example 9 scope issue.

Sequence Diagram (Greptile):

sequenceDiagram
    participant CP as CleanPlanActionExecutor
    participant S1 as Stage 1 (per-FG)
    participant S2 as Stage 2 (cross-FG)
    participant MDT as MDT Secondary Index
    participant AUX as Sidecar (.hoodie/.aux/clean/)
    participant TL as Timeline
    participant CE as CleanActionExecutor
    participant W as Writer (preCommit)

    Note over CP: requestClean()
    CP->>CP: hasBlobColumns()?
    alt No blob columns
        CP-->>TL: Plan (no blob work)
    else Has blob columns
        CP->>S1: For each expired/retained FG
        S1-->>CP: all_local_orphans (set-difference)

        alt local orphans exist
            alt MDT secondary index available
                CP->>S2: Verify candidates
                S2->>MDT: Batch prefix scan (N paths)
                MDT-->>S2: Map<path, List<recordKey>>
                S2->>MDT: readRecordIndexLocations(all keys)
                MDT-->>S2: Map<recordKey, (partition, fileId)>
                S2-->>CP: blob_files_to_delete
            else Fallback scan, few candidates
                CP->>S2: Table scan
                S2-->>CP: blob_files_to_delete
            else Circuit breaker fires
                Note over CP,S2: Stage 2 skipped — blobs deferred but file slices still deleted this cycle
                CP-->>TL: Plan (no blob deletes recorded)
            end
            CP->>AUX: Write sidecar Parquet
            CP->>TL: Persist plan (extraMetadata[blobDeletesPath])
        end
    end

    Note over TL: REQUESTED

    W->>TL: preCommit() — scan REQUESTED/INFLIGHT/COMPLETED clean actions
    W->>AUX: Read sidecar Parquet
    alt Writer blob paths intersect sidecar paths
        W-->>W: HoodieWriteConflictException — retry
    else No overlap
        W->>TL: Commit succeeds
    end

    CE->>TL: Transition to INFLIGHT
    CE->>AUX: Read sidecar Parquet
    par Parallel deletion
        CE->>CE: Delete file slices (existing)
    and
        CE->>CE: Delete blob files (from sidecar)
    end
    CE->>TL: Transition to COMPLETED (blobCleanStats + sidecarPath)
    Note over AUX: Sidecar retained until archival
Loading

Greptile: yihua#44 (review)


| Condition | Path used | Cost | Suitable for |
|-----------------------------|---------------|-----------------------|---------------------------|
| No local orphan candidates | Skip Stage 2 | Zero | No blob work this cycle |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Circuit breaker can produce permanent orphans (R2 violation)

When the circuit breaker fires (candidates > hoodie.cleaner.blob.external.scan.max.candidates), Stage 2 is skipped and no blob deletions are recorded — but the file slice deletions from the existing cleaner still execute within the same clean cycle. Once those expired file slices are deleted, the data that Stage 1 used to derive the orphan candidates is gone. In the next cleaner invocation, Stage 1 iterates expired_slices, but those slices no longer exist on storage. The orphan blob candidates can never be re-derived, making them permanent orphans — a direct violation of R2.

The description says cleanup is "deferred" but that framing is misleading: there is no durable queue of deferred candidates that the next cycle would pick up, and the source-of-truth file slices are deleted.

Possible mitigations to consider and explicitly document:

  • Write deferred blob candidates to a durable artifact (separate from the sidecar, since no plan is written for this case) so the next cycle can pick them up without needing the expired slices.
  • Alternatively, gate file-slice deletion on successful blob planning (i.e., refuse to delete file slices if Stage 2 was circuit-broken), though this changes existing cleaner semantics.
  • Or, explicitly document this as a known limitation: blobs in tables that permanently lack an MDT secondary index and exceed the threshold are permanently orphaned, and the operator must either enable the index or accept the leak.

Greptile (original) (source:comment#3082984560)

local_orphans = expired_refs - retained_refs

// All local orphans proceed to Stage 2 for cross-FG verification
all_local_orphans.addAll(local_orphans)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 managed flag scope in Stage 2 is unspecified — managed→unmanaged transition may not be handled correctly

Stage 1 filters retained refs with ref.managed == true, so a retained slice whose managed flag transitioned to false for a given external_path contributes nothing to retained_refs. That path therefore becomes a local orphan candidate and proceeds to Stage 2.

The problem statement (rfc-100-blob-cleaner-problem.md, Section 2 — Managed flag discussion) explicitly requires: "If any retained reference to the same external_path exists (regardless of the managed flag), the blob must not be deleted." Stage 2 correctness therefore depends on the MDT secondary index indexing all external_path values regardless of the managed flag. If the secondary index writer applies the same managed == true filter, Stage 2 would miss the unmanaged retained reference, classify the blob as globally orphaned, and delete a blob that is still referenced — a violation of R1.

The design doc should explicitly state:

  1. Whether the secondary index on <blob_col>.reference.external_path indexes entries for managed=false records, and
  2. If it does not, how managed→unmanaged transitions are handled to satisfy R1 (e.g., the fallback table scan as a safety net, or a separate retained-refs enrichment pass).

Greptile (original) (source:comment#3082984636)


---

## Abstract
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unfilled issue link placeholder

The Issue field still contains the template placeholder text. This should be filled in with the corresponding GitHub issue link before the RFC is finalized.

Suggested change
## Abstract
Issue: <!-- Link to GH feature issue -->

Greptile (original) (source:comment#3082985257)

@voonhous voonhous added this to the release-1.2.0 milestone May 13, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blob cleaning design and implementation are not part of Hudi 1.2.0 release. Removing this PR from release-1.2.0 milestone.

@yihua yihua removed this from the release-1.2.0 milestone May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants