feat(blob): Supplementing RFC-100 with blob cleaner design#18359
feat(blob): Supplementing RFC-100 with blob cleaner design#18359voonhous wants to merge 6 commits into
Conversation
0882eee to
3dd15e0
Compare
| |-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------| | ||
| | **Stage 1** | Per-file-group | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables) | | ||
| | **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist | | ||
| | **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved | |
There was a problem hiding this comment.
@voonhous forgive me if this is a beginner question but what does "container" blob mean?
There was a problem hiding this comment.
Basically a set of blobs.
{Blob1, blob2, blob3, ...}, it's addressable by offsets
There was a problem hiding this comment.
It's basically what is described in the rfc-100. I'm just extending the concepts there. If it's not required, we can nuke it.
It's referenced here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md#milestone-3-dynamic-inlineout-of-line-storage
There was a problem hiding this comment.
Again, recording for the Nth time that we don't have any alignment on this :)
There was a problem hiding this comment.
As discussed, will delete this term
| they are co-equal paths with different properties, different volumes, and different cleanup costs. | ||
|
|
||
| **Flow 1: Path-dispatched (Hudi-created blobs).** Blobs created by Hudi's write path and stored | ||
| under `{table}/.hoodie/blobs/{partition}/{col}/{instant}/{blob_id}`. The path structure guarantees |
There was a problem hiding this comment.
Do we really need a /{instant} or would it suffice to have this the blob_id also have an attached instance time, something like {blob_id_instant}?
There was a problem hiding this comment.
My understanding is that having a folder = 1 less list file operation, where we need to list, then filter.
Haven't really verified this. My main concern is trying to keep everything as organised as possible and if blob file count can be in the order of millions, I felt that having it under an instant folder made sense to keep everything easier to find.
There was a problem hiding this comment.
To me, this is neither an out-of-line blob that is managed somewhere else externally by the user, nor is it an internal inline blob where it's stored right within the base or log files, right?
Are we assuming we are going to store blobs separately within the Hudi table path like this? I don't think we had alignment on this. To define anything like this, we should also not talk about other table services concretely. I'm going to simply review this with a narrow scope of external blobs and reference cleaning.
There was a problem hiding this comment.
This is an extension of the original RFC-100 here:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L168-L171
We've established that we had no alignment on things, let's remove this from scope.
|
|
||
| ### C4: Container files | ||
|
|
||
| Multiple blobs can be packed into a single container file, distinguished by `(offset, length)` within |
There was a problem hiding this comment.
Can you share with me an example to help better understand this?
There was a problem hiding this comment.
IIUC, our focus now is external blobs.
This might be out of scope for this RFC since we're only focusing on external blobs. I added this because of this:
https://github.com/apache/hudi/blob/master/rfc/rfc-100/rfc-100.md?plain=1#L207-L208
There was a problem hiding this comment.
Yes, this is all again not very well flushed out. Let's ignore all this, or even honestly remove it from this doc. Can this MD file simply focus on the external blob cleaner problem?
| FP["filePathsToBeDeleted<br/>(existing)"] | ||
| BP["blobFilesToDelete<br/>(new)"] |
There was a problem hiding this comment.
We decided in another PR that this is too large to write to the plan. Has that changed?
There was a problem hiding this comment.
Yes...
In this design, the plan IS the isolation boundary. All blob delete decisions must be made at plan time against the same snapshot used for file slice decisions.
Keeping blob deletes out of the plan makes it impossible to add writer-cleaner conflict resolution for external blobs (Scenario B from your problem doc). That's a correctness hole I don't think I can close later without putting the information on the timeline.
blobFilesToDelete being too long and large is a real concern. If this is a concern, maybe we can introduce a config to cap the number of blobs to delete in each iteration?
A config like hoodie.cleaner.blob.max.deletes.per.cycle that bounds how many blob deletes go into the plan per iteration, deferring the rest to the next cycle, but this might mean a clean is now broken up into multiple clean operations... Which might fundamentally change timeline assumptions.
Will add an example into rfc-100-*-problems under Example 9 to explain this.
There was a problem hiding this comment.
We cannot write this into the plan. That hasn't changed.
There was a problem hiding this comment.
It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.
There was a problem hiding this comment.
Understand. Doing this is akin to manifest and loading or serializing everything into memory might take a very long time or OOM.
The alternative for this is we store these in a separate MDT partition using a storage layout that is optimized for storing large volumes of data.
There was a problem hiding this comment.
TODO: Figure out an alternative to this while ensuring C11.
There was a problem hiding this comment.
Updated design doc such that blob delete list is no longer stored in the HoodieCleanerPlan.
Instead, it's written to a sidecar Parquet file at .hoodie/.aux/clean/<instant>.blob_deletes.parquet, referenced from the plan via the existing extraMetadata map (extraMetadata["blobDeletesPath"]) so that no new Avro fields are added to the plan schema.
This way, we address the antipattern issue, there will also be no bloat regardless of blob candidate count, while keeping the delete list as a durable artifact on storage for writer conflict resolution (as illustrated in Example 9 in the problem doc for why this is required).
The full design is in the "Sidecar Parquet File: Blob Delete List" section under "Integration with Existing Cleaner", covering:
- Write sequence (sidecar written before plan, atomicity guarantee)
- Lifecycle (lives until archival, bounded count)
- Cleanup (orphan cleanup at startup, rollback cleanup, archival cleanup)
There was a problem hiding this comment.
🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.
| // Key format: escaped(external_path)$escaped(record_key) | ||
| // Returns ALL record keys that reference each candidate path | ||
| path_to_record_keys = mdtMetadata.readSecondaryIndexDataTableRecordKeysWithKeys( | ||
| HoodieListData.eager(candidate_paths), indexPartitionName) |
There was a problem hiding this comment.
The HoodieData here is always referencing HoodieListData but can we use the RDD backed HoodieData when running on spark?
There was a problem hiding this comment.
Yeap, it's better to to that.
We should be using engine-distributed HoodieData (RDD on Spark, etc.) to keep this data partitioned across executors rather than materializing it all on the driver, esp if there are many blobs to delete.
readSecondaryIndexDataTableRecordKeysWithKeys() accepts generic HoodieData<String>, but V1 has a hard instanceof HoodieListData check that throws if you pass anything else.
V2 has no such restriction.
We have to use V2 interface/API then.
Good catch, updating the RFC!
- Add Example 9 to problem doc: demonstrates why blobFilesToDelete must be in the plan (writer-cleaner conflict resolution for external blobs) - Use engine-context HoodieData instead of HoodieListData.eager() for Stage 2 MDT lookups to distribute large candidate sets across executors - Update memory budget to reflect distributed Stage 2 processing
| expected majority flow for Phase 3 workloads. | ||
|
|
||
| **Flow 2: Non-path-dispatched (user-provided external blobs).** Users have existing blob files in | ||
| external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled |
There was a problem hiding this comment.
is this AI generated? does NFS mount even work with our stack :)
There was a problem hiding this comment.
Used AI to help concretise a few things. I think it interpreted HDFS mount as NFS here. I might have a typo here causing the interpretation error. Let me fix it.
| external storage (e.g., `s3://media-bucket/videos/`, a shared NFS mount, or any user-controlled | ||
| path). Records reference these blobs directly by path. The user does **not** want to bootstrap -- | ||
| they do not want Hudi to copy, move, or reorganize the blob files into `.hoodie/blobs/`. Hudi | ||
| manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1 |
There was a problem hiding this comment.
This clarifies my earlier comment. I think we are aligned. Let's just focus on the phase one for now.
vinothchandar
left a comment
There was a problem hiding this comment.
@voonhous thanks for the detailed write-up. I left some comments, mostly on the problem section.
Can we add a simple high-level overview of the design section and talk about how it meets all the different problems that you laid out in the other doc? Also, please scope this down to simply external blobs. Let's deal with the container files later on.
I can take another pass at it, once this is simplified.
| manages the *references*, not the *storage layout*. This is the expected primary flow for Phase 1 | ||
| workloads and remains a supported flow in Phase 3. | ||
|
|
||
| The non-path-dispatched flow has fundamentally different properties: |
There was a problem hiding this comment.
Let us call these external blobs. Not add new names like non-dispatched flow.
There was a problem hiding this comment.
Done, named it external out-of-line blobs. When we refer to these 2 terms, they will mean the same thing and they are interchangeable.
| |---------------------------|-----------------------------------|--------------------------------------| | ||
| | Path uniqueness | Guaranteed (instant in path, C11) | Not guaranteed (user controls) | | ||
| | Cross-FG sharing | Does not occur (FG-scoped) | Common (multiple records, same blob) | | ||
| | Writer/cleaner race | Cannot occur (D2) | Can occur (D3) | |
There was a problem hiding this comment.
Sorry, what are D2, D3, these things? What do they refer to?
There was a problem hiding this comment.
- D2: Writer-cleaner race cannot occur for Hudi-created blobs (structural proof)
- D3: Writer-cleaner race can occur for external blobs (requires conflict check)
Too many things ongoing when writing this doc, forgot to add them in.
|
|
||
| - **Inline blobs.** Inline blob data lives inside the base/log file and is deleted when the file | ||
| slice is cleaned. No additional cleanup needed. | ||
| - **Blob compaction internals.** Blob compaction (repacking partially-live container files) is a |
There was a problem hiding this comment.
For the same reason, I do not want to do any investments or anything around the first flow yet. path-dispatched.
There was a problem hiding this comment.
As discussed, no blob compaction as this is container files specific, it will be removed from the RFC. We will be fully focusing on external out-of-line-blobs.
| ### Stance on the `managed` flag | ||
|
|
||
| The BlobReference schema includes a `managed` boolean field | ||
| (`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are |
There was a problem hiding this comment.
In docs, let's refrain from referencing Java classes directly. It gives a false impression that we are designing everything for Java. Instead, let's please refer to the Avro schemas for these logical types.
There was a problem hiding this comment.
Removed and rephrased to:
The `BlobReference` schema includes a `managed` boolean field (`reference.managed`).
|
|
||
| The BlobReference schema includes a `managed` boolean field | ||
| (`HoodieSchema.Blob.EXTERNAL_REFERENCE_IS_MANAGED`). The RFC states that only managed blobs are | ||
| cleaned. This document acknowledges the flag and treats it as a **filter** -- unmanaged blobs are |
There was a problem hiding this comment.
An interesting corner case is whether external blobs are allowed to go from managed to unmanaged across multiple writes. I think we should clarify the behavior on this.
There was a problem hiding this comment.
Added.
Managed -> unmanaged with the same external_path, we will leave the blob untouched.
Unmanaged -> Managed with the same external_path, we will start tracking the blob, if it expires after t2, we will delete it.
|
|
||
| ## Approvers | ||
|
|
||
| - (TBD) |
There was a problem hiding this comment.
Please add me, Ethan, and Rahil.
|
|
||
| Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` pipeline -- same timeline | ||
| instant, same plan-execute-complete lifecycle, same crash recovery and OCC integration. A | ||
| `hasBlobColumns()` check gates all blob logic so non-blob tables pay zero cost. |
There was a problem hiding this comment.
Technically, there is cost to resolve the schema, which the cleaner does not have to do today, but it's small.
| |-------------|----------------------|----------------------------------------------------------------------------------|----------------------------------------| | ||
| | **Stage 1** | Per-file-group | Collect expired/retained blob refs, compute set difference, dispatch by category | Always (for blob tables) | | ||
| | **Stage 2** | Cross-file-group | Verify external blob candidates against MDT secondary index or fallback scan | Only when external candidates exist | | ||
| | **Stage 3** | Container resolution | Determine delete vs. flag-for-compaction at the container level | Only when container blobs are involved | |
There was a problem hiding this comment.
Again, recording for the Nth time that we don't have any alignment on this :)
| FP["filePathsToBeDeleted<br/>(existing)"] | ||
| BP["blobFilesToDelete<br/>(new)"] |
There was a problem hiding this comment.
We cannot write this into the plan. That hasn't changed.
| FP["filePathsToBeDeleted<br/>(existing)"] | ||
| BP["blobFilesToDelete<br/>(new)"] |
There was a problem hiding this comment.
It's a hard blocker. We've written stuff like this before, like tracking file stats in the commit data itself, and it had very bad consequences.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the RFC — the problem statement is thorough and the two-stage pipeline design is well-motivated. The constraint catalog, failure mode examples, and cost model are particularly strong; they show real rigor in thinking through edge cases. A few areas could use more detail: the writer-side conflict check integration with ConcurrentOperation/TransactionUtils (which currently throws on CLEAN_ACTION), the sidecar lifecycle hook into archival, and the I/O cost of reading log files in Stage 1. See inline comments for specifics.
| ├── Transition to INFLIGHT | ||
| ├── Read sidecar Parquet (blob delete list) | ||
| ├── Delete file slices (existing, parallelized) | ||
| ├── Delete blob files (new, parallelized) // parallel with file slice deletes |
There was a problem hiding this comment.
🤖 The writer-side conflict check requires extending ConcurrentOperation.init() and TransactionUtils.getInflightAndRequestedInstants() to handle CLEAN_ACTION, which currently throws IllegalArgumentException for clean actions. This is a non-trivial change to a sensitive code path. Could you sketch out what ConcurrentOperation would look like for a clean action? For instance, what mutatedFileIds would it report — none? And does ConflictResolutionStrategy.hasConflict() need a new code path, or can it reuse the existing partition/fileId overlap logic?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| rect rgb(255, 245, 230) | ||
| Note right of TL: Crash here → re-read sidecar,<br/>re-delete (FileNotFound = success) | ||
| end |
There was a problem hiding this comment.
🤖 In the writer-side conflict check, the writer reads sidecar Parquet files for REQUESTED, INFLIGHT, and COMPLETED clean instants. For COMPLETED instants, it reads blobDeletesSidecarPath from HoodieCleanMetadata. How many COMPLETED clean instants could accumulate before archival? If the table runs frequent clean cycles, the writer could end up reading many sidecars on every commit. Is there a bound on this, or should the design discuss amortization?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| Note right of C: Globally orphaned → DELETE | ||
| end | ||
| end | ||
| end |
There was a problem hiding this comment.
🤖 The design mentions HoodieWriteStat.externalBlobPaths as an in-memory collection with "no additional I/O." How does the write path populate this? Does the writer need to scan the blob reference column of every record it writes, or is this extracted during the normal write flow? For large writes, even in-memory collection could be significant.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| ### Schema Changes: HoodieCleanMetadata | ||
|
|
||
| A new nullable field `blobCleanStats` of type `HoodieBlobCleanStats`: |
There was a problem hiding this comment.
🤖 The sidecar lifecycle says it lives until the clean instant is archived, and archival cleanup deletes it. But archival today doesn't know about sidecar files — it archives timeline instants without touching .aux/. Could you clarify where this cleanup hook would be added? Is it in the archival action itself, or a separate periodic cleanup?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| Stage 2 verifies each local orphan candidate against the global state to determine if the blob is | ||
| still referenced by any active file slice outside the cleaned file groups. This is necessary because | ||
| external blobs can be shared across file groups (C3, C11). |
There was a problem hiding this comment.
🤖 Stage 1 reads log files for expired slices ("full record read") to catch transient blob refs. For large log files, this could be expensive. Is there an estimate of the I/O cost for this? The back-of-envelope section estimates "~6 reads per FG" but log file reads are significantly more expensive than columnar projection on base files.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| | Crash point | Recovery | | ||
| |-----------------------------------------------|--------------------------------------------------------------------------------------------------------| | ||
| | After sidecar written, before plan persisted | No REQUESTED instant on timeline. Cleaner starts fresh. Orphaned sidecar cleaned at next startup. | |
There was a problem hiding this comment.
🤖 The memory budget section estimates ~100MB peak per FG, but then says FGs are processed sequentially within each partition batch while cleanerParallelism batches run in parallel. This means peak memory is 100MB * cleanerParallelism. With default parallelism, this might be fine, but it's worth noting that this is a new memory pressure the cleaner didn't have before. Have you considered a streaming/spill-to-disk approach for very large file groups?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| ### Memory Budget | ||
|
|
||
| Per-FG blob ref sets: ~100MB peak (500K records * 100 bytes/ref for expired + retained). FGs are | ||
| processed sequentially within each partition batch -- per-FG sets are computed and discarded, not |
There was a problem hiding this comment.
🤖 The rollout plan mentions "Stage 1" and "Stage 2" both as pipeline stages in the algorithm AND as rollout phases. This overloaded terminology could cause confusion. Consider calling the rollout phases something like "Phase A: Foundation", "Phase B: Per-FG cleanup", etc. to distinguish them from the algorithm's Stage 1/Stage 2.
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
|
|
||
| **Writer-side conflict check.** `preCommit()` conflict check reads sidecar for concurrency safety. | ||
| Requires extending `ConcurrentOperation` / `TransactionUtils` to include clean actions when blob | ||
| columns exist. |
There was a problem hiding this comment.
🤖 The RFC doesn't discuss what happens if the MDT secondary index becomes stale or inconsistent (e.g., after a failed MDT update). If Stage 2 relies on the secondary index for correctness, a stale index could lead to premature deletion. Is the safety check (verifying index is fully built) sufficient, or should the design also handle partial staleness?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| by reading the expired log files. Without log reads on the expired side, | ||
| every such transient blob becomes a permanent orphan that accumulates | ||
| storage indefinitely. | ||
| ``` |
There was a problem hiding this comment.
🤖 The problem statement has excellent failure mode examples and constraints, but Q6 (recovery from premature deletion) is left fully open. Given that premature deletion is the single most critical failure, it might be worth the design doc addressing this — even if the answer is "no recovery; rely on correctness guarantees." Does the design doc intentionally leave this unanswered?
- Generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
| FP["filePathsToBeDeleted<br/>(existing)"] | ||
| BP["blobFilesToDelete<br/>(new)"] |
There was a problem hiding this comment.
🤖 I think the sidecar approach in this RFC is actually a good compromise here — the blob delete list is effectively "in the plan" for conflict resolution purposes (the plan's extraMetadata points to the sidecar, so writers can find it), but the actual data lives in a separate Parquet file to avoid plan bloat. Example 9 in the problem doc makes a compelling case that without a durable artifact on the timeline, writer-side conflict checks have no way to detect blob-level conflicts. The sidecar gives you both: durability for correctness, and separation for size.
yihua
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Greptile Summary: This PR adds two new RFC documents for RFC-100 Part 2, describing the design for cleaning orphaned external blob files when Hudi expires old file slices. The two files are:
rfc-100-blob-cleaner-design.md: A detailed solution design covering the two-stage cleanup pipeline, MDT secondary index integration, sidecar Parquet lifecycle, writer-cleaner conflict check, crash recovery, configuration, and a test plan.rfc-100-blob-cleaner-problem.md: A companion problem statement defining the 11 design constraints (C1–C11), 9 requirements (R1–R9), 7 illustrative failure modes, and 9 open questions.
The design is thorough and well-structured. The two-stage pipeline (per-FG set-difference → cross-FG MDT index verification) is a sound approach to the cross-file-group sharing problem. Several issues worth addressing:
- Missing Example 8 in the problem document — examples jump from 7 to 9, suggesting a numbering error or omitted scenario.
- Example 9 introduces solution content directly contradicting the problem doc's own stated scope ("It contains no solution content"). It compares two implementation approaches and advocates for one.
- Circuit breaker in the fallback scan path may create permanent orphans — when candidates exceed the threshold, Stage 2 is skipped but file slices are still deleted that same cycle. In the next cycle those expired file slices are already gone, so the orphan blob candidates can never be re-derived (R2 violation under sustained absence of MDT secondary index).
managedflag coverage in Stage 2 is implicit, not stated — Stage 1 filters retained refs tomanaged == trueonly, but the problem doc requires that any retained reference (regardless of themanagedflag) must prevent deletion. Stage 2's MDT index lookup must cover unmanaged references for correctness during managed→unmanaged transitions, but the design does not explicitly state the secondary index scope.- The placeholder
Issue: <Link to GH feature issue>remains unfilled in the design doc header.
Greptile Confidence Score: 4/5
Safe to merge as an RFC draft after addressing the two P1 design concerns; no production code is changed.
Both files are documentation-only RFC additions with no code changes. The design is thorough and well-reasoned. Two P1 issues exist in the design itself: the circuit-breaker path can create permanent orphan blobs (R2 violation) and the managed→unmanaged flag transition handling relies on an implicit assumption about secondary index scope that is never stated. These should be resolved before implementation begins, but they don't block merging the RFC draft for community review. The missing Example 8 and the solution-content scope violation in the problem doc are P2 polish items.
rfc/rfc-100/rfc-100-blob-cleaner-design.md — circuit breaker and managed-flag sections need clarification; rfc/rfc-100/rfc-100-blob-cleaner-problem.md — Example 8 gap and Example 9 scope issue.
Sequence Diagram (Greptile):
sequenceDiagram
participant CP as CleanPlanActionExecutor
participant S1 as Stage 1 (per-FG)
participant S2 as Stage 2 (cross-FG)
participant MDT as MDT Secondary Index
participant AUX as Sidecar (.hoodie/.aux/clean/)
participant TL as Timeline
participant CE as CleanActionExecutor
participant W as Writer (preCommit)
Note over CP: requestClean()
CP->>CP: hasBlobColumns()?
alt No blob columns
CP-->>TL: Plan (no blob work)
else Has blob columns
CP->>S1: For each expired/retained FG
S1-->>CP: all_local_orphans (set-difference)
alt local orphans exist
alt MDT secondary index available
CP->>S2: Verify candidates
S2->>MDT: Batch prefix scan (N paths)
MDT-->>S2: Map<path, List<recordKey>>
S2->>MDT: readRecordIndexLocations(all keys)
MDT-->>S2: Map<recordKey, (partition, fileId)>
S2-->>CP: blob_files_to_delete
else Fallback scan, few candidates
CP->>S2: Table scan
S2-->>CP: blob_files_to_delete
else Circuit breaker fires
Note over CP,S2: Stage 2 skipped — blobs deferred but file slices still deleted this cycle
CP-->>TL: Plan (no blob deletes recorded)
end
CP->>AUX: Write sidecar Parquet
CP->>TL: Persist plan (extraMetadata[blobDeletesPath])
end
end
Note over TL: REQUESTED
W->>TL: preCommit() — scan REQUESTED/INFLIGHT/COMPLETED clean actions
W->>AUX: Read sidecar Parquet
alt Writer blob paths intersect sidecar paths
W-->>W: HoodieWriteConflictException — retry
else No overlap
W->>TL: Commit succeeds
end
CE->>TL: Transition to INFLIGHT
CE->>AUX: Read sidecar Parquet
par Parallel deletion
CE->>CE: Delete file slices (existing)
and
CE->>CE: Delete blob files (from sidecar)
end
CE->>TL: Transition to COMPLETED (blobCleanStats + sidecarPath)
Note over AUX: Sidecar retained until archival
Greptile: yihua#44 (review)
|
|
||
| | Condition | Path used | Cost | Suitable for | | ||
| |-----------------------------|---------------|-----------------------|---------------------------| | ||
| | No local orphan candidates | Skip Stage 2 | Zero | No blob work this cycle | |
There was a problem hiding this comment.
Circuit breaker can produce permanent orphans (R2 violation)
When the circuit breaker fires (candidates > hoodie.cleaner.blob.external.scan.max.candidates), Stage 2 is skipped and no blob deletions are recorded — but the file slice deletions from the existing cleaner still execute within the same clean cycle. Once those expired file slices are deleted, the data that Stage 1 used to derive the orphan candidates is gone. In the next cleaner invocation, Stage 1 iterates expired_slices, but those slices no longer exist on storage. The orphan blob candidates can never be re-derived, making them permanent orphans — a direct violation of R2.
The description says cleanup is "deferred" but that framing is misleading: there is no durable queue of deferred candidates that the next cycle would pick up, and the source-of-truth file slices are deleted.
Possible mitigations to consider and explicitly document:
- Write deferred blob candidates to a durable artifact (separate from the sidecar, since no plan is written for this case) so the next cycle can pick them up without needing the expired slices.
- Alternatively, gate file-slice deletion on successful blob planning (i.e., refuse to delete file slices if Stage 2 was circuit-broken), though this changes existing cleaner semantics.
- Or, explicitly document this as a known limitation: blobs in tables that permanently lack an MDT secondary index and exceed the threshold are permanently orphaned, and the operator must either enable the index or accept the leak.
— Greptile (original) (source:comment#3082984560)
| local_orphans = expired_refs - retained_refs | ||
|
|
||
| // All local orphans proceed to Stage 2 for cross-FG verification | ||
| all_local_orphans.addAll(local_orphans) |
There was a problem hiding this comment.
managed flag scope in Stage 2 is unspecified — managed→unmanaged transition may not be handled correctly
Stage 1 filters retained refs with ref.managed == true, so a retained slice whose managed flag transitioned to false for a given external_path contributes nothing to retained_refs. That path therefore becomes a local orphan candidate and proceeds to Stage 2.
The problem statement (rfc-100-blob-cleaner-problem.md, Section 2 — Managed flag discussion) explicitly requires: "If any retained reference to the same external_path exists (regardless of the managed flag), the blob must not be deleted." Stage 2 correctness therefore depends on the MDT secondary index indexing all external_path values regardless of the managed flag. If the secondary index writer applies the same managed == true filter, Stage 2 would miss the unmanaged retained reference, classify the blob as globally orphaned, and delete a blob that is still referenced — a violation of R1.
The design doc should explicitly state:
- Whether the secondary index on
<blob_col>.reference.external_pathindexes entries formanaged=falserecords, and - If it does not, how managed→unmanaged transitions are handled to satisfy R1 (e.g., the fallback table scan as a safety net, or a separate retained-refs enrichment pass).
— Greptile (original) (source:comment#3082984636)
|
|
||
| --- | ||
|
|
||
| ## Abstract |
There was a problem hiding this comment.
Unfilled issue link placeholder
The Issue field still contains the template placeholder text. This should be filled in with the corresponding GitHub issue link before the RFC is finalized.
| ## Abstract | |
| Issue: <!-- Link to GH feature issue --> |
— Greptile (original) (source:comment#3082985257)
yihua
left a comment
There was a problem hiding this comment.
The blob cleaning design and implementation are not part of Hudi 1.2.0 release. Removing this PR from release-1.2.0 milestone.
Describe the issue this Pull Request addresses
Adding blob-cleaner design for RFC-100. The design does not introduce any new architecture. Instead, it leverages on already existing design and extends it.
For example, out-of-line/external blob cleaning is implemented via extending existing cleaner and uses secondary + record index for identifying blob references that needs to be cleaned. This avoids full table scans, and scalability of O(C * R_avg) where C is the number of candidate files to scan and R_avg average number of references per candidate path.
External blob cleaning has a complexity of:
Summary and Changelog
Impact
None - RFC design.
Risk Level
None - RFC design.
Documentation Update
None, RFC design.
Contributor's checklist