diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-design.md b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
new file mode 100644
index 0000000000000..c8f1ea0fc3ee7
--- /dev/null
+++ b/rfc/rfc-100/rfc-100-blob-cleaner-design.md
@@ -0,0 +1,777 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# RFC-100 Part 2: External Blob Cleanup for Unstructured Data
+
+## Proposers
+
+- @voon
+
+## Approvers
+
+- @rahil-c
+- @vinothchandar
+- @yihua
+
+## Status
+
+Issue: <Link to GH feature issue>
+
+> Please keep the status updated in `rfc/README.md`.
+
+---
+
+## Abstract
+
+When Hudi cleans expired file slices, external out-of-line blob files they reference may become
+orphaned -- still consuming storage but unreachable by any query. This RFC extends the existing file
+slice cleaner to identify and delete these orphaned blob files safely and efficiently. The design
+uses a two-stage pipeline: (1) per-file-group set-difference to find locally-orphaned blobs, and
+(2) cross-file-group verification via MDT secondary index lookup. Targeted index lookups scale with
+the number of candidates, not the table size. Tables without blob columns pay zero cost.
+
+This design focuses on **external blobs** -- the Phase 1 use case of RFC-100 where users have
+existing blob files in external storage (e.g., `s3://media-bucket/videos/`) and Hudi manages the
+*references* via the `BlobReference` schema, not the *storage layout*.
+
+---
+
+## Background
+
+### Why Blob Cleanup Is Needed
+
+RFC-100 introduces out-of-line blob storage for unstructured data (images, video, documents). A
+record's `BlobReference` field points to an external blob file by `reference.external_path`. When
+the cleaner expires old file slices, the blob files they reference may no longer be needed -- but the
+existing cleaner has no concept of transitive references. It deletes file slices without considering
+the blob files they point to. Without blob cleanup, orphaned blobs accumulate indefinitely.
+
+### External Blobs
+
+Users have existing blob files in external storage (e.g., `s3://media-bucket/videos/`). Records
+reference these blobs directly by path. Hudi manages the *references*, not the *storage layout*.
+Cross-file-group sharing is common -- multiple records across different file groups can point to the
+same blob. Key properties:
+
+| Property                  | External blobs                               |
+|---------------------------|----------------------------------------------|
+| Path uniqueness           | Not guaranteed (user controls)               |
+| Cross-FG sharing          | Common (multiple records, same blob)         |
+| Writer/cleaner race       | Can occur (external paths outside MVCC)      |
+| Per-FG cleanup sufficient | No -- cross-FG verification needed           |
+
+### Constraints and Requirements Reference
+
+Full descriptions and failure modes in [Problem Statement](rfc-100-blob-cleaner-problem.md).
+
+| ID  | Constraint                                          | Remarks                          |
+|-----|-----------------------------------------------------|----------------------------------|
+| C1  | Blob immutability (append-once, read-many)          |                                  |
+| C2  | Delete-and-re-add same path                         | Real concern for external blobs  |
+| C3  | Cross-file-group blob sharing                       | Common for external blobs        |
+| C4  | MOR log updates shadow base file blob refs          |                                  |
+| C5  | Existing cleaner is per-file-group scoped           |                                  |
+| C6  | OCC is per-file-group                               | No global contention allowed     |
+| C7  | Replace commits move blob refs between file groups  | Clustering, insert_overwrite     |
+| C8  | Savepoints freeze file slices and blob refs         |                                  |
+| C9  | Rollback and restore can invalidate or resurrect    |                                  |
+| C10 | Archival removes commit metadata                    |                                  |
+| C11 | Cross-FG verification needed at scale               |                                  |
+
+| ID  | Requirement                                                      |
+|-----|------------------------------------------------------------------|
+| R1  | No premature deletion (hard invariant)                           |
+| R2  | No permanent orphans (bounded cleanup)                           |
+| R3  | MOR correctness (over-retention acceptable, under-retention not) |
+| R4  | Concurrency safety (no global serialization)                     |
+| R5  | Scale proportional to work, not table size                       |
+| R6  | No cost for non-blob tables                                      |
+| R7  | All cleaning policies supported                                  |
+| R8  | Crash safety and idempotency                                     |
+| R9  | Observability (metrics for deleted, retained, reclaimed)         |
+
+---
+
+## Design Overview
+
+### Design Philosophy
+
+Blob cleanup extends the existing `CleanPlanner` / `CleanActionExecutor` pipeline -- same timeline
+instant, same plan-execute-complete lifecycle, same crash recovery and OCC integration. A
+`hasBlobColumns()` check gates all blob logic so non-blob tables pay near zero cost (schema scan 
+cost).
+
+External blobs require cross-file-group verification because the same blob can be referenced from
+multiple file groups (C3, C11). The design uses targeted MDT secondary index lookups that scale
+with the number of candidates, not the table size.
+
+### Two-Stage Pipeline
+
+| Stage       | Scope            | Purpose                                                              | When it runs                 |
+|-------------|------------------|----------------------------------------------------------------------|------------------------------|
+| **Stage 1** | Per-file-group   | Collect expired/retained blob refs, compute set difference           | Always (for blob tables)     |
+| **Stage 2** | Cross-file-group | Verify candidates against MDT secondary index or fallback scan       | When local orphans exist     |
+
+### Key Decisions
+
+| Decision            | Choice                                                  | Rationale                                                      |
+|---------------------|---------------------------------------------------------|----------------------------------------------------------------|
+| Blob identity       | `reference.external_path`                               | Path-based identity for external blobs                         |
+| Cleanup scope       | Per-FG candidate identification + cross-FG verification | Aligns with OCC (C6) and existing cleaner (C5); scales for C11 |
+| Cross-FG mechanism  | MDT secondary index on `reference.external_path`        | Short-circuits on first non-cleaned FG ref                     |
+| Blob delete storage | Sidecar Parquet file (`.hoodie/.aux/clean/`)             | Avoids plan bloat; durable artifact for writer conflict checks |
+| MOR strategy        | Over-retain (union of base + log refs)                  | Safe (C4, R3); cleaned after compaction                        |
+
+```mermaid
+flowchart LR
+    subgraph Planning["CleanPlanActionExecutor.requestClean()"]
+        direction TB
+        Gate{"hasBlobColumns()?"}
+        Gate -- No --> Skip["Skip blob cleanup<br/>(zero cost)"]
+        Gate -- Yes --> CP
+
+        subgraph CP["CleanPlanner (per-partition, per-FG)"]
+            direction TB
+            Policy["Policy method<br/>→ FileGroupCleanResult<br/>(expired + retained slices)"]
+            S1["<b>Stage 1</b><br/>Per-FG blob ref<br/>set difference"]
+            Policy --> S1
+        end
+
+        S1 --> S2["<b>Stage 2</b><br/>Cross-FG verification<br/>(MDT secondary index)"]
+        S2 --> SC["Write sidecar Parquet<br/>.hoodie/.aux/clean/&lt;instant&gt;<br/>.blob_deletes.parquet"]
+    end
+
+    subgraph Plan["HoodieCleanerPlan"]
+        FP["filePathsToBeDeleted<br/>(existing)"]
+        EM["extraMetadata[blobDeletesPath]<br/>(pointer to sidecar)"]
+    end
+
+    SC --> EM
+    CP --> FP
+
+    subgraph Execution["CleanActionExecutor.runClean()"]
+        direction TB
+        RS["Read sidecar Parquet"]
+        DF["Delete file slices<br/>(existing, parallel)"]
+        DB["Delete blob files<br/>(new, parallel)"]
+        RS --> DB
+    end
+
+    FP --> DF
+    EM --> RS
+```
+
+---
+
+## Algorithm
+
+### Stage 1: Per-File-Group Local Cleanup
+
+Stage 1 runs after the existing policy logic determines which file slices are expired and retained
+for a given file group. It collects blob refs from both sets and computes locally-orphaned blobs by
+set difference. All local orphans proceed to Stage 2 for cross-FG verification.
+
+```
+Input:  A file group FG with expired_slices and retained_slices (from policy)
+Output: local_orphan_candidates -- external blobs needing cross-FG verification
+
+for each file_group being cleaned:
+
+    // Collect expired blob refs (base files + log files)
+    // Must read log files: blob refs introduced and superseded within the log
+    // chain before compaction would otherwise become permanent orphans.
+    expired_refs = Set<external_path>()
+    for slice in expired_slices:
+        for ref in extractBlobRefs(slice.baseFile):   // columnar projection
+            if ref.type == OUT_OF_LINE and ref.managed == true:
+                expired_refs.add(ref.external_path)
+        for ref in extractBlobRefs(slice.logFiles):   // full record read
+            if ref.type == OUT_OF_LINE and ref.managed == true:
+                expired_refs.add(ref.external_path)
+
+    if expired_refs is empty:
+        continue                                       // no blob work for this FG
+
+    // Collect retained blob refs (base files only)
+    // Cleaning is fenced on compaction: retained base files contain the merged
+    // state. Log reads are unnecessary -- any shadowed base ref causes safe
+    // over-retention, cleaned after the next compaction cycle.
+    retained_refs = Set<external_path>()
+    for slice in retained_slices:
+        for ref in extractBlobRefs(slice.baseFile):   // columnar projection only
+            if ref.type == OUT_OF_LINE and ref.managed == true:
+                retained_refs.add(ref.external_path)
+
+    // Compute local orphans by set difference
+    local_orphans = expired_refs - retained_refs
+
+    // All local orphans proceed to Stage 2 for cross-FG verification
+    all_local_orphans.addAll(local_orphans)
+```
+
+**Correctness notes:**
+
+- **MOR -- expired side reads base + logs:** Blob refs can be introduced and superseded entirely
+  within the log chain (e.g., `log@t2: row1->blob_B`, then `log@t3: row1->blob_C`). After
+  compaction, `blob_B` exists only in the expired log. Skipping logs would orphan it permanently.
+- **MOR -- retained side reads base only:** Cleaning is fenced on compaction, so retained base
+  files contain the merged state. Shadowed base refs cause over-retention (safe), cleaned after
+  the next compaction.
+- **Savepoints:** Inherited from existing cleaner -- savepointed slices stay in the retained set.
+- **Replaced FGs (replace commits):** `retained_slices` is empty, so all blob refs become
+  candidates. For external blobs, clustering copies the pointer to the target FG, so Stage 2
+  finds the reference in the target FG and retains the blob.
+
+### Stage 2: Cross-File-Group Verification
+
+Stage 2 verifies each local orphan candidate against the global state to determine if the blob is
+still referenced by any active file slice outside the cleaned file groups. This is necessary because
+external blobs can be shared across file groups (C3, C11).
+
+#### Primary path: MDT secondary index
+
+When the MDT secondary index on `reference.external_path` is available and fully built:
+
+```
+Input:  all_local_orphans, cleaned_fg_ids
+Output: blob_files_to_delete (confirmed globally orphaned)
+
+candidate_paths = all_local_orphans.distinct()
+
+// Step 1: Batched prefix scan on secondary index
+// Key format: escaped(external_path)$escaped(record_key)
+// Returns ALL record keys that reference each candidate path
+// Uses engine-context HoodieData (e.g., RDD on Spark) to distribute work
+// across executors -- candidate sets can be large (row-level blob refs).
+candidate_paths_data = engineContext.parallelize(candidate_paths)
+path_to_record_keys = mdtMetadata.readSecondaryIndexDataTableRecordKeysWithKeys(
+    candidate_paths_data, indexPartitionName)
+    .groupBy(pair -> pair.getKey())
+
+// Step 2: Batch record index lookup -- ONE call for ALL record keys
+// Sorts keys internally, single sequential forward-scan through HFile.
+all_record_keys = path_to_record_keys.values().flatMap()
+all_locations = mdtMetadata.readRecordIndexLocations(
+    all_record_keys)                                    // -> Map<recordKey, (partition, fileId)>
+
+// Step 3: In-memory resolution with short-circuit per candidate
+for path in candidate_paths:
+    record_keys = path_to_record_keys.getOrDefault(path, [])
+
+    if record_keys is empty:
+        blob_files_to_delete.add(path)                  // globally orphaned
+        continue
+
+    found_live_reference = false
+    for rk in record_keys:
+        location = all_locations.get(rk)
+        if location != null and location.fileId NOT in cleaned_fg_ids:
+            found_live_reference = true
+            break                                       // short-circuit (in-memory)
+
+    if not found_live_reference:
+        blob_files_to_delete.add(path)                  // all refs in cleaned FGs
+```
+
+**Cost model.** Three steps: (1) batched prefix scan on secondary index, (2) batched record index
+lookup in a single sorted HFile scan, (3) in-memory resolution with short-circuit. Steps 1 and 2
+are each a single I/O pass; step 3 is pure hash set lookups.
+
+| Step                      | I/O                                           | Estimated cost (2K candidates) |
+|---------------------------|-----------------------------------------------|--------------------------------|
+| 1. Prefix scan (batched)  | 1 HFile open + forward scan of N prefix keys  | ~2-5s                          |
+| 2. Record index (batched) | 1 HFile open + forward scan of 6K sorted keys | ~1-2s                          |
+| 3. In-memory resolution   | Hash set checks (cleaned_fg_ids)              | ~0ms                           |
+
+*Estimates assume cloud object storage (S3/GCS/ADLS), ~10-100ms per-read latency, ~50-200 MB/s
+sequential throughput, 64-256KB HFile blocks. Pending benchmarking.*
+
+**Index definition.** Uses the existing `HoodieIndexDefinition` mechanism with
+`sourceFields = ["<blob_col>", "reference", "external_path"]`. The nested field path is supported
+by `HoodieSchemaUtils.projectSchema()` and `SecondaryIndexRecordGenerationUtils`. No new index
+infrastructure is needed.
+
+**Safety check.** The cleaner verifies the index is fully built before using it via
+`getMetadataPartitions()` and `getMetadataPartitionsInflight()`. A partially-built index falls
+back to the table scan path.
+
+#### Fallback path: table scan with circuit breaker
+
+When the MDT secondary index is unavailable, Stage 2 falls back to a parallelized table scan
+across all partitions. A circuit breaker (`hoodie.cleaner.blob.external.scan.max.candidates`,
+default 1000) defers cleanup if candidates exceed the threshold, preventing the scan from becoming
+a bottleneck on large tables. The operator is warned to enable the MDT secondary index.
+
+#### Decision matrix
+
+| Condition                   | Path used     | Cost                  | Suitable for              |
+|-----------------------------|---------------|-----------------------|---------------------------|
+| No local orphan candidates  | Skip Stage 2  | Zero                  | No blob work this cycle   |
+| MDT secondary index enabled | Index lookup  | O(candidates)         | Any scale                 |
+| No index, few candidates    | Table scan    | O(candidates * table) | Small tables              |
+| No index, many candidates   | Circuit break | Zero (deferred)       | Large tables need index   |
+
+```mermaid
+sequenceDiagram
+    participant C as Cleaner (Stage 2)
+    participant SI as MDT Secondary Index
+    participant RI as MDT Record Index
+
+    Note over C: Step 1: Batch prefix scan
+    C->>SI: All candidate paths (N paths, single call)
+    SI-->>C: Map<path, List<recordKey>>
+
+    Note over C: Step 2: Batch record index lookup
+    C->>C: Collect all record keys from all candidates
+    C->>RI: readRecordIndexLocations(all record keys)
+    Note over RI: Sort keys → single sequential<br/>forward-scan through HFile
+    RI-->>C: Map<recordKey, (partition, fileId)>
+
+    Note over C: Step 3: In-memory resolution
+    loop For each candidate path
+        alt No record keys for this path
+            Note right of C: Globally orphaned → DELETE
+        else Has record keys
+            C->>C: Check each location.fileId<br/>against cleaned_fg_ids (in-memory)
+            alt Any fileId NOT in cleaned_fg_ids
+                Note right of C: Live reference → RETAIN
+            else All in cleaned FGs
+                Note right of C: Globally orphaned → DELETE
+            end
+        end
+    end
+```
+
+### Execution Flow
+
+```
+1. CleanPlanActionExecutor.requestClean()
+   ├── hasBlobColumns(table)?                         // R6: zero-cost gate
+   ├── CleanPlanner: for each partition, for each file group:
+   │     ├── Refactored policy method -> FileGroupCleanResult
+   │     └── If hasBlobColumns: Stage 1 per FG
+   ├── CleanPlanner: replaced file groups -> Stage 1
+   ├── If local orphan candidates non-empty: Stage 2
+   ├── Write sidecar Parquet to .hoodie/.aux/clean/<instant>.blob_deletes.parquet
+   ├── Build HoodieCleanerPlan with extraMetadata["blobDeletesPath"]
+   └── Persist plan to timeline (REQUESTED state)
+
+2. CleanActionExecutor.runClean()
+   ├── Transition to INFLIGHT
+   ├── Read sidecar Parquet (blob delete list)
+   ├── Delete file slices (existing, parallelized)
+   ├── Delete blob files (new, parallelized)          // parallel with file slice deletes
+   ├── Build HoodieCleanMetadata with blobCleanStats + blobDeletesSidecarPath
+   └── Transition to COMPLETED
+```
+
+```mermaid
+sequenceDiagram
+    participant P as CleanPlanActionExecutor
+    participant TL as Timeline
+    participant AUX as .hoodie/.aux/clean/
+    participant E as CleanActionExecutor
+    participant S as Storage
+
+    Note over P: requestClean()
+
+    P->>P: Stage 1: per-FG blob ref set difference
+    P->>P: Stage 2: MDT index lookup (if candidates exist)
+    P->>AUX: Write sidecar Parquet (blob delete list)
+    P->>TL: Persist HoodieCleanerPlan<br/>(extraMetadata["blobDeletesPath"])
+
+    Note over TL: REQUESTED
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash before plan → orphaned sidecar<br/>(harmless, cleaned at next startup)
+    end
+
+    E->>TL: Transition plan state
+    Note over TL: INFLIGHT
+
+    E->>AUX: Read sidecar Parquet
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → re-read sidecar,<br/>re-delete (FileNotFound = success)
+    end
+
+    par Parallel deletion
+        E->>S: Delete file slices (existing)
+    and
+        E->>S: Delete blob files (from sidecar)
+    end
+
+    E->>E: Build HoodieCleanMetadata<br/>(blobCleanStats + sidecarPath)
+    E->>TL: Transition plan state
+
+    rect rgb(255, 245, 230)
+        Note right of TL: Crash here → re-execute<br/>(all deletes are no-ops)
+    end
+
+    Note over TL: COMPLETED
+    Note over AUX: Sidecar lives until archival<br/>(for writer conflict checks)
+```
+
+---
+
+## Integration with Existing Cleaner
+
+### CleanPlanner Refactoring
+
+The existing `CleanPlanner` policy methods produce `CleanFileInfo` objects (file paths to delete)
+without exposing the expired/retained slice partition that blob cleanup needs. We introduce a new
+return type:
+
+```java
+public class FileGroupCleanResult {
+  private final List<CleanFileInfo> filePathsToDelete;
+  private final List<FileSlice> expiredSlices;
+  private final List<FileSlice> retainedSlices;
+}
+```
+
+The three policy methods (`getFilesToCleanKeepingLatestVersions`,
+`getFilesToCleanKeepingLatestCommits`, `getFilesToCleanKeepingLatestHours`) are refactored to
+collect both expired and retained slices alongside the existing `CleanFileInfo` production. The
+existing behavior is unchanged -- the refactoring adds output without modifying the
+expired/retained classification logic.
+
+### Replaced File Group Handling
+
+Replaced file groups (from clustering, insert_overwrite, insert_overwrite_table) are cleaned via
+`getReplacedFilesEligibleToClean()`. A parallel method `getReplacedFileGroupBlobCleanResults()`
+produces `FileGroupCleanResult` objects with `retainedSlices = empty` and
+`expiredSlices = all slices`. This feeds into Stage 1 identically to normal file groups.
+
+### Schema Changes: HoodieCleanerPlan
+
+**No new fields.** The blob delete list is stored in a sidecar Parquet file, not in the plan. The
+plan references the sidecar via the existing `extraMetadata` map:
+
+- `extraMetadata["blobDeletesPath"]` -- path to the sidecar Parquet file
+  (e.g., `.hoodie/.aux/clean/20240101120000.blob_deletes.parquet`)
+
+This avoids plan bloat regardless of the number of blob candidates. The `extraMetadata` map is
+already part of the `HoodieCleanerPlan` Avro schema -- no schema change is needed.
+
+### Sidecar Parquet File: Blob Delete List
+
+The blob delete list is stored as a sidecar Parquet file at
+`.hoodie/.aux/clean/<instant>.blob_deletes.parquet`. This file is the **single source of truth**
+for blob delete candidates across all clean states (REQUESTED, INFLIGHT, COMPLETED).
+
+**Schema:**
+```
+message BlobDeleteList {
+  required binary external_path (STRING);
+}
+```
+
+**Write sequence:** The sidecar is written **before** the plan is persisted to the timeline.
+By the time the plan is visible, the sidecar is already durable on storage. This ensures
+atomicity: if a writer sees the plan, the sidecar is guaranteed to exist.
+
+**Lifecycle:** The sidecar lives until the clean instant is **archived**. This covers writer
+conflict checks against REQUESTED, INFLIGHT, and COMPLETED clean actions. When the instant is
+archived, the sidecar is deleted as part of archival cleanup.
+
+**Orphan cleanup:** If a crash occurs after writing the sidecar but before persisting the plan,
+the sidecar is orphaned. At cleaner startup, a lightweight cleanup routine lists
+`.hoodie/.aux/clean/` and deletes any sidecar whose instant has no corresponding plan on the
+timeline.
+
+**Rollback:** When a clean plan is rolled back, rollback logic reads
+`extraMetadata["blobDeletesPath"]` and deletes the sidecar.
+
+**Size:** Dictionary encoding on shared path prefixes (e.g., `s3://media-bucket/videos/...`)
+provides good compression. 10K blob paths ≈ a few hundred KB.
+
+### Schema Changes: HoodieCleanMetadata
+
+A new nullable field `blobCleanStats` of type `HoodieBlobCleanStats`:
+
+- `totalBlobFilesDeleted`, `totalBlobFilesRetained`
+- `totalBlobStorageReclaimed`
+- `blobDeletesSidecarPath` -- pointer to the sidecar (for COMPLETED-state conflict checks)
+- `failedBlobFilePaths` -- only failures (expected to be empty or near-empty)
+
+### hasBlobColumns() Gate
+
+An in-memory schema check (`TableSchemaResolver.getTableSchema().containsBlobType()`) gates all
+blob cleanup logic. Requires making `containsBlobType()` public (one-line visibility change).
+
+---
+
+## Concurrency & Safety
+
+### Writer-Cleaner Race: Conflict Check
+
+Under Hudi's MVCC design, the cleaner and writers operate on non-overlapping file slices -- the
+cleaner never conflicts with writers on file slice operations. However, external blob files are
+**not** covered by MVCC: a writer's new file slice may reference an external blob that the cleaner
+is simultaneously evaluating for deletion.
+
+**Writer-side conflict check in `preCommit()`.** The gap between the cleaner's planning-time
+snapshot and its actual file deletion is closed by a commit-time conflict check:
+
+1. Writers track external managed blob paths in `HoodieWriteStat.externalBlobPaths` (in-memory
+   collection, no additional I/O).
+2. At commit time (in `preCommit()`, under the existing transaction lock), the writer checks all
+   three clean states -- COMPLETED, INFLIGHT, and REQUESTED -- because a REQUESTED plan can begin
+   executing at any moment (the REQUESTED->INFLIGHT transition doesn't acquire the transaction
+   lock).
+3. For each clean instant, the writer locates the sidecar Parquet file:
+   - REQUESTED/INFLIGHT: reads `extraMetadata["blobDeletesPath"]` from the plan
+   - COMPLETED: reads `blobDeletesSidecarPath` from `HoodieCleanMetadata`
+4. The writer reads the sidecar and checks intersection with its external blob paths.
+5. If any overlap is found, the commit is rejected with `HoodieWriteConflictException` and the
+   writer retries.
+
+The sidecar is the single source of truth across all three states -- no blob paths are stored
+inline in the plan or completed metadata.
+
+**Note:** `ConcurrentOperation` and `TransactionUtils.getInflightAndRequestedInstants()` currently
+exclude `CLEAN_ACTION`. Adding external blob cleanup requires extending conflict resolution to
+include clean actions when blob columns exist.
+
+Cost is zero for non-blob tables. For external blobs: one timeline scan + 1-3 sidecar Parquet
+reads (~50-200ms each on cloud storage), gated on the writer having external blob paths.
+
+```mermaid
+sequenceDiagram
+    participant W as Writer
+    participant TL as Timeline
+    participant AUX as Sidecar (.aux)
+    participant CL as Cleaner
+
+    Note over W,CL: Scenario A: Writer commits BEFORE cleaner plans
+
+    W->>TL: Commit (references blob_X)
+    CL->>AUX: Write sidecar (blob_X in delete list)
+    CL->>TL: Plan cleanup (extraMetadata -> sidecar path)
+    Note right of CL: Sees blob_X in retained slice → not deleted
+    Note over W,CL: ✓ Safe
+
+    Note over W,CL: Scenario B: Writer commits AFTER cleaner plans, BEFORE delete
+
+    CL->>AUX: Write sidecar (blob_X in delete list)
+    CL->>TL: Plan cleanup
+    Note over TL: REQUESTED / INFLIGHT
+    W->>TL: preCommit() -- reads plan, finds sidecar path
+    W->>AUX: Read sidecar Parquet
+    Note left of W: blob_X in sidecar<br/>→ HoodieWriteConflictException<br/>→ Writer retries
+    Note over W,CL: ✓ Safe -- conflict detected
+
+    Note over W,CL: Scenario C: Writer commits AFTER cleaner deletes, BEFORE COMPLETED
+
+    CL->>CL: Delete blob_X from storage
+    Note over TL: Still INFLIGHT
+    W->>TL: preCommit() -- reads INFLIGHT plan
+    W->>AUX: Read sidecar Parquet
+    Note left of W: blob_X in sidecar → Rejection
+    Note over W,CL: ✓ Safe -- same as B
+
+    Note over W,CL: Scenario D: Cleaner completes, THEN writer acquires lock
+
+    CL->>TL: Transition to COMPLETED (metadata has sidecar path)
+    W->>TL: preCommit() -- reads COMPLETED metadata
+    W->>AUX: Read sidecar Parquet
+    Note left of W: blob_X in sidecar → Rejection
+    Note over W,CL: ✓ Safe
+```
+
+### Concurrency Matrix
+
+| Operation                      | Concurrent with Blob Cleaner | Safety Mechanism                                          |
+|--------------------------------|------------------------------|-----------------------------------------------------------|
+| Regular write (INSERT/UPSERT)  | Safe                         | Writer-side conflict check in preCommit()                 |
+| Compaction                     | Safe                         | `isFileSliceNeededForPendingMajorOrMinorCompaction`       |
+| Clustering / insert_overwrite  | Safe                         | Replaced FG lifecycle; Stage 2 finds refs in target FG    |
+| Rollback                       | Safe                         | MOR over-retention; clean operates on post-rollback state |
+| Restore                        | Safe                         | Clean operates on post-restore state                      |
+| Savepoint create/delete        | Safe                         | Savepointed slices excluded from cleaning                 |
+| Archival                       | No interaction               | Blob cleaner reads file slices, not commit metadata       |
+| Another cleaner instance       | Safe                         | `TransactionManager`; `checkIfOtherWriterCommitted`       |
+| MDT writes (index maintenance) | Safe                         | MDT commit atomicity                                      |
+
+### Crash Recovery
+
+Crash recovery is idempotent by construction, using the same mechanisms as existing file slice
+cleaning. The sidecar Parquet file ensures the blob delete list survives crashes:
+
+| Crash point                                   | Recovery                                                                                               |
+|-----------------------------------------------|--------------------------------------------------------------------------------------------------------|
+| After sidecar written, before plan persisted  | No REQUESTED instant on timeline. Cleaner starts fresh. Orphaned sidecar cleaned at next startup.      |
+| After plan persisted, before execution        | REQUESTED instant found; plan re-read, sidecar re-read, executed.                                      |
+| During execution (partial deletes)            | INFLIGHT instant re-executed. Sidecar re-read. Already-deleted files return FileNotFoundException -> success. |
+| After execution, before COMPLETED             | INFLIGHT re-executed. All deletes are no-ops. Metadata written, instant transitions to COMPLETED.      |
+| Plan rolled back                              | Rollback reads `extraMetadata["blobDeletesPath"]` and deletes the sidecar.                             |
+
+---
+
+## Performance
+
+### Cost Summary
+
+| Workload                     | Stage 1 cost                    | Stage 2 cost               | Total per cleanup cycle        |
+|------------------------------|---------------------------------|----------------------------|--------------------------------|
+| Non-blob table               | Zero (`hasBlobColumns` gate)    | N/A                        | **Zero**                       |
+| External blobs (index)       | ~6 Parquet reads per cleaned FG | O(C * R_avg)               | O(cleaned_FGs + C * R_avg)     |
+| External blobs (scan)        | ~6 Parquet reads per cleaned FG | O(candidates * table_size) | Circuit breaker limits this    |
+
+### Back-of-Envelope: Example 6 (50K FGs, 2K External Candidates)
+
+| Parameter                           | Value     | Notes                                                |
+|-------------------------------------|-----------|------------------------------------------------------|
+| FGs cleaned this cycle              | 500       | 1% of table                                          |
+| Stage 1: reads per FG               | ~6        | 3 retained + 3 expired slices                        |
+| Stage 1: total reads                | 3,000     | Parallelized across executors, ~20s                  |
+| External blob candidates            | 2,000     | Locally orphaned in cleaned FGs                      |
+| Avg refs per candidate              | 3         | Random assumption                                    |
+| Total record keys                   | 6,000     | 2,000 * 3                                            |
+| **Stage 2 cost (estimated)**        |           |                                                      |
+| Step 1: batched prefix scan         | 1 call    | Returns 6K record keys, ~2-5s estimated              |
+| Step 2: batched record index lookup | 1 call    | 6K keys sorted, single HFile scan, ~1-2s estimated   |
+| Step 3: in-memory resolution        | 6K checks | Hash set lookups against cleaned_fg_ids, ~0ms        |
+| **Total Stage 2**                   | **~3-7s** | Estimated; see I/O assumptions in Stage 2 cost model |
+| Comparison: naive full-table scan   | 12.5TB    | 50K FGs * 5 slices * 50MB = prohibitive              |
+
+### Memory Budget
+
+Per-FG blob ref sets: ~100MB peak (500K records * 100 bytes/ref for expired + retained). FGs are
+processed sequentially within each partition batch -- per-FG sets are computed and discarded, not
+accumulated. Only the output list (`all_local_orphans`) grows, containing
+only orphaned refs (much smaller). Peak heap for Stage 1: ~100MB * `cleanerParallelism` = 400MB-1.6GB.
+
+Stage 2 output lists (`candidate_paths`, `all_record_keys`) can be large -- each cleaned FG may
+contribute row-level blob refs as candidates. These are backed by engine-context `HoodieData`
+(e.g., Spark RDD) and distributed across executors, avoiding driver memory pressure.
+
+---
+
+## Configuration
+
+| Property                                           | Default | Description                                                                       |
+|----------------------------------------------------|---------|-----------------------------------------------------------------------------------|
+| `hoodie.cleaner.blob.enabled`                      | `true`  | Enable blob cleanup during clean action                                           |
+| `hoodie.cleaner.blob.dry.run`                      | `false` | Compute blob cleanup plan and log results but do not execute                      |
+| `hoodie.cleaner.blob.external.scan.parallelism`    | `10`    | Parallelism for Stage 2 fallback table scan                                       |
+| `hoodie.cleaner.blob.external.scan.max.candidates` | `1000`  | Circuit breaker for Stage 2 fallback scan; exceeding defers blob cleanup          |
+| `hoodie.metadata.index.secondary.column`           | (none)  | Set to `<blob_col>.reference.external_path` for cross-FG verification             |
+
+---
+
+## Rollout / Adoption Plan
+
+**Foundation (shared prerequisite).** `CleanPlanner` refactoring (policy methods return
+`FileGroupCleanResult`), sidecar Parquet write/read infrastructure, and the `hasBlobColumns`
+zero-cost gate.
+
+**Stage 1 (per-FG cleanup).** Set-difference logic. Produces local orphan candidates for Stage 2.
+
+**Stage 2 (cross-FG verification) -- priority.** External blobs are the primary initial use case --
+cross-FG verification prevents premature deletion of shared blobs. Requires MDT + record index +
+secondary index on `reference.external_path`. Includes fallback table scan with circuit breaker.
+
+**Sidecar lifecycle.** Write sidecar at planning time, read at execution time, clean up at archival.
+Orphan cleanup at cleaner startup. Rollback cleanup via `extraMetadata`.
+
+**Writer-side conflict check.** `preCommit()` conflict check reads sidecar for concurrency safety.
+Requires extending `ConcurrentOperation` / `TransactionUtils` to include clean actions when blob
+columns exist.
+
+### Backward Compatibility
+
+- All schema changes use nullable fields with null defaults. Existing clean plans and metadata
+  are unaffected.
+- `hasBlobColumns()` gate ensures zero behavioral change for non-blob tables.
+- One prerequisite code change: `HoodieSchema.containsBlobType()` visibility from package-private
+  to public (one-line change, no behavioral impact).
+
+---
+
+## Test Plan
+
+### Unit Tests
+
+- **Stage 1 set-difference:** Verify correct orphan identification for COW and MOR file groups,
+  including MOR over-retention (shadowed base refs kept until post-compaction).
+- **Stage 2 index lookup:** Verify short-circuit behavior (stop after first live reference), empty
+  results (globally orphaned), and batched prefix scans.
+- **Stage 2 fallback:** Verify table scan correctness and circuit breaker activation.
+- **Sidecar Parquet write/read:** Verify round-trip of blob delete list through sidecar file.
+- **Writer-side conflict check:** Verify detection of conflicts via sidecar reads for COMPLETED,
+  INFLIGHT, and REQUESTED clean actions.
+
+### Integration Tests
+
+- End-to-end clean cycle with external blob table and MDT secondary index (COW and MOR).
+- Clean cycle with replaced file groups (post-clustering, post-insert_overwrite).
+- Sidecar lifecycle: verify sidecar created at planning, read at execution, deleted at archival.
+
+### Concurrency Tests
+
+- Writer-cleaner race scenarios A-D (from concurrency analysis) with external blobs and sidecar.
+- Concurrent clean + compaction with blob tables.
+
+### Sidecar Lifecycle Tests
+
+- Orphan cleanup: sidecar exists without plan → cleaned at next startup.
+- Rollback cleanup: plan rolled back → sidecar deleted.
+- Archival cleanup: clean instant archived → sidecar deleted.
+- Missing sidecar at execution: graceful handling (skip blob cleanup, log warning).
+
+### Backward Compatibility
+
+- Non-blob table clean cycle produces identical behavior (no sidecar, no `blobCleanStats`).
+- Clean plan deserialization with and without `extraMetadata["blobDeletesPath"]`.
+
+---
+
+## Appendix
+
+- **[Problem Statement, Constraints & Requirements](rfc-100-blob-cleaner-problem.md)**
+  -- Complete problem scope, all 11 constraints (C1-C11), all 9 requirements (R1-R9), 7
+  illustrative failure mode examples, and open questions.
+
+### Why the MDT Secondary Index Maps to Record Keys (Not File Groups)
+
+Stage 2 uses a two-hop lookup: secondary index → record keys → record index → file group locations.
+This is not an artifact of this RFC — it is the fundamental design of Hudi's secondary index
+([RFC-77](../rfc-77/rfc-77.md)). The rationale:
+
+1. **Secondary keys are non-unique.** Unlike the record index (which maps unique record keys),
+   a secondary index is on arbitrary user columns (e.g., `city`, `status`) where many records
+   share the same value. The composite key format `{secondaryKey}${recordKey}` flattens this
+   non-unique mapping into unique tuples that fit the existing spillable/merge map infrastructure.
+
+2. **Record locations change independently of secondary key values.** Compaction, clustering,
+   and updates move records between file groups. The record index already maintains this mapping
+   correctly. A denormalized `secondary_key → file_group` mapping would duplicate that
+   maintenance burden and risk staleness.
+
+3. **Update handling requires tombstones on old values.** When a record's secondary key changes,
+   the old value may reside in a different file group in the SI partition than the new value.
+   The normalized design handles this with `old-secondary-key → (record-key, deleted)` tombstones,
+   which is simpler than tracking file group transitions directly.
+
+4. **Alternatives were evaluated and rejected.** RFC-77 considered direct `secondary_key →
+   file_group` mapping, Guava MultiMap, Chronicle Map, and separate spillable structures — all
+   rejected due to complexity, external dependencies, or maintenance cost.
+
+For this RFC, the two-hop cost is negligible: Step 1 (prefix scan) and Step 2 (record index lookup)
+are each a single batched HFile forward-scan, adding ~3-7s total for 2K candidates.
diff --git a/rfc/rfc-100/rfc-100-blob-cleaner-problem.md b/rfc/rfc-100/rfc-100-blob-cleaner-problem.md
new file mode 100644
index 0000000000000..ed8c5070b1b84
--- /dev/null
+++ b/rfc/rfc-100/rfc-100-blob-cleaner-problem.md
@@ -0,0 +1,781 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+
+# External Blob Cleaner: Problem Statement
+
+## 1. Goal
+
+When old file slices are cleaned, external out-of-line blob files they reference may become orphaned
+-- still consuming storage but unreachable by any query. The blob cleaner must identify and delete
+these unreferenced blob files without premature deletion (deleting a blob that is still referenced by
+a live record). This document defines the problem scope, design constraints, requirements, and
+illustrative failure modes. It contains no solution content.
+
+This document focuses on **external blobs** -- the Phase 1 use case of RFC-100 where users have
+existing blob files in external storage (e.g., `s3://media-bucket/videos/`) and Hudi manages the
+*references* via the `BlobReference` schema, not the *storage layout*.
+
+---
+
+## 2. Scope
+
+### In scope
+
+- Cleanup of **external out-of-line blob files** (referenced via `BlobReference.external_path`) when
+  references to them exist only in expired (cleaned) file slices.
+- All table types: **COW** and **MOR**.
+- All cleaning policies: `KEEP_LATEST_COMMITS`, `KEEP_LATEST_FILE_VERSIONS`,
+  `KEEP_LATEST_BY_HOURS`.
+- Interaction with table services: **compaction**, **clustering**.
+- Interaction with replace commits: **clustering**, **insert_overwrite**,
+  **insert_overwrite_table**.
+- Interaction with timeline operations: **savepoints**, **rollback**, **restore**, **archival**.
+- Single-writer and multi-writer (OCC) concurrency modes.
+
+| Property                  | External blobs                               |
+|---------------------------|----------------------------------------------|
+| Path uniqueness           | Not guaranteed (user controls)               |
+| Cross-FG sharing          | Common (multiple records, same blob)         |
+| Writer/cleaner race       | Can occur (external paths outside MVCC)      |
+| Delete-and-re-add         | Real concern (user controls paths)           |
+| Per-FG cleanup sufficient | No -- cross-FG verification needed           |
+
+### Out of scope
+
+- **Inline blobs.** Inline blob data lives inside the base/log file and is deleted when the file
+  slice is cleaned. No additional cleanup needed.
+- **Storing blob references in commit metadata.** Persisting blob reference sets within commit-level
+  metadata is an anti-pattern that does not scale and is not considered in this problem statement.
+- **Schema evolution.** Adding or removing blob columns does not change the cleanup problem.
+
+### Stance on the `managed` flag
+
+The `BlobReference` schema includes a `managed` boolean field (`reference.managed`). The RFC states
+that only managed blobs are cleaned. This document acknowledges the flag and treats it as a
+**filter** -- unmanaged blobs are excluded from cleanup consideration. However, the cleanup design
+must be **correct regardless of the flag's value**. The flag selects *which* blobs enter the cleanup
+pipeline; it must not be used as a correctness lever within the pipeline itself. The flag may later
+serve as an optimization (skip cleanup work for unmanaged blobs), but the problem statement and any
+solution must not depend on it for safety.
+
+**Managed-to-unmanaged transitions.** An external blob's `reference.managed` flag can change across
+writes. At time t1, a record references `s3://ext/video.mp4` with `managed=true`. At time t2, an
+update to the same record changes `managed` to `false` for the same path. The blob was managed in
+the expired slice but is unmanaged in the retained slice. The cleaner must decide: does the t1
+managed reference make the blob eligible for cleanup when t1 is expired, even though the t2
+reference says unmanaged?
+
+To safely handle this: If *any* retained reference to the same `external_path` exists (regardless of
+the `managed` flag), the blob must not be deleted. The `managed` flag filters which blob references
+*enter* the cleanup pipeline, but the liveness check must consider *all* references to a path, not
+just managed ones. A transition from managed to unmanaged is effectively the user saying "Hudi
+should stop managing this blob" -- the blob must survive the transition.
+
+The inverse case (unmanaged at t1, managed at t2) is straightforward: the blob is now managed and
+subject to cleanup when t2's file slice eventually expires.
+
+---
+
+## 3. Background: Existing Cleaner
+
+The existing Hudi cleaner provides the execution framework that blob cleanup must integrate with.
+
+### Plan-execute model
+
+Cleaning is a two-phase operation:
+
+1. **Plan** (`CleanPlanner`): For each partition and file group, determine which file slices are
+   expired based on the cleaning policy. Produce a `HoodieCleanerPlan` listing file paths to delete.
+2. **Execute** (`CleanActionExecutor`): Delete the files listed in the plan. Record results in
+   `HoodieCleanMetadata` on the timeline.
+
+### Per-partition, per-file-group iteration
+
+The cleaner iterates file groups within a partition. For each file group, it compares file slices
+against the retention policy and produces a list of file paths to delete. The cleaner has no concept
+of cross-file-group dependencies.
+
+### Savepoint awareness
+
+The cleaner collects all savepointed timestamps and their associated data files. File slices that
+overlap with savepointed files are excluded from cleaning. This preserves the savepoint invariant: a
+savepoint freezes a consistent snapshot including all data files it references.
+
+### OCC conflict resolution
+
+Concurrent writer conflict resolution operates at the `(partition, fileId)` granularity. There is
+no global serialization point. Concurrent writers to different file groups proceed without
+contention.
+
+### MVCC: Cleaner does not conflict with writes
+
+Under Hudi's MVCC design, the cleaner and writers operate on non-overlapping sets of file slices.
+Writers create new file slices; the cleaner deletes old (expired) ones. The cleaner only targets
+file slices that are older than the retention boundary, which by definition are not being written to.
+Consequently, the cleaner never conflicts with concurrent writers in the existing design -- there is
+no write-clean contention on the same file slice.
+
+Blob cleanup must preserve this property: it must not introduce scenarios where the cleaner and a
+writer contend over the same blob file. However, external blob files are **not** covered by MVCC --
+a writer's new file slice may reference an external blob that the cleaner is simultaneously
+evaluating for deletion. This gap is a central problem for external blob cleanup.
+
+### Rollback vs. Restore
+
+Hudi has two distinct undo operations that affect blob liveness:
+
+1. **Rollback of an inflight commit:** Removes the effects of a single uncommitted write and reverts
+   the affected file groups to their previous state. Scope is limited to one commit.
+2. **Table restore to a savepoint:** Reverts the table to a prior consistent snapshot, potentially
+   undoing multiple committed writes. Scope can affect many file groups across many partitions.
+
+Blob cleanup must handle both: rollback may remove the sole reference to a blob (making it orphaned)
+or resurrect a previously-shadowed reference, while restore may revert the table to a state where
+different blob references are live. These are distinct failure modes with different scopes and are
+addressed separately in the constraints.
+
+### Critical gap
+
+The existing cleaner operates on file paths (base files + log files) within a single file group. It
+has **no concept of transitive references** -- it does not know that a file slice contains pointers
+to external blob files that may need separate cleanup. Blob cleanup requires extending the cleaner
+to follow these references and determine blob-level liveness.
+
+---
+
+## 4. Design Constraints
+
+Each constraint is a fact about the Hudi system that any blob cleanup solution must respect. Violating
+any constraint leads to data corruption, premature deletion, or permanent orphans.
+
+### C1: Blob immutability
+
+Once a blob file is written, its content never changes. Blob files are append-once, read-many. This
+means a blob file's identity is stable for its entire lifetime.
+
+*Source: RFC-100, general storage semantics.*
+
+### C2: Delete-and-re-add same path
+
+A blob file can be deleted from external storage and a new file created at the same path with
+different content. Since the user controls external blob paths, path reuse is a real concern. The
+cleanup algorithm must not assume that two references to the same `reference.external_path` at
+different times refer to the same physical content.
+
+**Concurrent writer caveat.** Consider two concurrent writers, A and B, both referencing external
+blob X at the same path. Writer A commits first. Later, the cleaner evaluates A's expired slice and
+marks blob X as a deletion candidate. Meanwhile, Writer B is still inflight and also references
+blob X. If the cleaner deletes blob X before Writer B commits, Writer B's commit creates a dangling
+reference. This race exists because external blob paths are not under Hudi's control -- unlike file
+slices, the cleaner cannot rely on MVCC guarantees for external blob files. Any cleanup solution
+must account for inflight writers that may reference the same external blob path being considered
+for deletion.
+
+*Source: RFC-100; external blob storage semantics.*
+
+### C3: Cross-file-group blob sharing
+
+An external blob can be referenced by records in multiple file groups and multiple partitions. This
+is explicitly supported: two records in different file groups can point to the same
+`reference.external_path`. Cross-file-group sharing is the **common case** for external blobs
+(e.g., a shared media library where a popular video appears in multiple user playlists across
+partitions). Any cleanup algorithm that assumes blobs are scoped to a single file group will produce
+premature deletions.
+
+*Source: RFC-100 lines 196-198 (Option 1 scans all active file slices).*
+
+### C4: MOR log updates shadow base file blob refs
+
+In MOR tables, a log file update to a record's blob reference supersedes the base file's blob
+reference for that record. The base file's blob ref appears live (it exists in an active file slice)
+but is actually dead (the log update replaced it). Reading only the base file produces a **superset**
+of live references. Over-retention (keeping the shadowed blob longer) is safe. Under-retention
+(treating the log-shadowed base ref as already cleaned) would cause premature deletion if the log
+update is later rolled back.
+
+*Source: RFC-100 line 122 (merge mode determines which blob reference is returned); MOR semantics.*
+
+### C5: Existing cleaner is per-file-group scoped
+
+The cleaner iterates per file group within each partition. It determines expired file slices within
+a single file group. There is no existing mechanism to evaluate cross-file-group dependencies during
+cleaning.
+
+*Source: `CleanPlanner.getDeletePaths()`, per-file-group iteration in
+`getFilesToCleanKeepingLatestCommits()`.*
+
+### C6: OCC is per-file-group (no global contention allowed)
+
+Concurrent writer conflict resolution operates at `(partition, fileId)` granularity. Any solution
+that introduces a global contention point (global counter, global lock, global bitmap) violates this
+constraint and degrades write throughput under concurrency.
+
+*Source: Concurrent writer conflict resolution strategy; per-file-group OCC semantics.*
+
+### C7: Replace commits move blob refs between file groups
+
+Several operations produce `replacecommit` actions that replace one set of file groups with another:
+
+- **Clustering:** Reads records from source file groups and rewrites them to target file groups. For
+  external blobs, clustering copies the pointer (same `reference.external_path`) to the target file
+  group. After clustering, the source file group's slices still reference the original external
+  blobs until those slices are cleaned. The target file group's slices reference the same external
+  blobs.
+
+- **insert_overwrite / insert_overwrite_table:** Replaces an entire partition (or table) with new
+  data. The replacement records may reference entirely different external blobs, or the same ones,
+  or a mix. The replaced file groups become eligible for cleaning after retention expires.
+
+In all replace commit scenarios, the key property is: after the replace, both the old (replaced) and
+new (replacement) file groups may reference the same external blob. The cleaner must not delete an
+external blob referenced by a replaced file group if the replacement file group (or any other active
+file group) still references it.
+
+For external blobs, the append-vs-replace distinction does not affect blob identity. The cleaner's
+question is always: "Is this `external_path` referenced by any active file slice anywhere in the
+table?" Whether the reference arrived via an append, an upsert, a clustering copy, or an
+insert_overwrite is irrelevant -- if any live reference exists, the blob must not be deleted.
+
+*Source: Hudi replace commit semantics; clustering and insert_overwrite operations.*
+
+### C8: Savepoints freeze file slices and their blob refs
+
+A savepoint preserves a consistent snapshot. File slices covered by a savepoint are excluded from
+cleaning. This means any blob referenced by a savepointed file slice must also be preserved, even if
+the blob would otherwise be considered orphaned. The cleaner already handles savepoint exclusion for
+file slices; blob cleanup must extend this guarantee to the blobs they reference.
+
+*Source: Savepoint handling in CleanPlanner.*
+
+### C9: Rollback and restore can invalidate or resurrect references
+
+Two distinct undo operations affect blob liveness:
+
+**Rollback of an inflight commit:** Can remove file slices that were the sole reference to a blob
+(the blob becomes orphaned). Conversely, rolling back a commit that updated a record's blob
+reference can resurrect the previous reference (an older blob that appeared orphaned is now live
+again).
+
+**Table restore to a savepoint:** Can undo multiple committed writes simultaneously. All blob
+references introduced after the restore point become orphaned. All blob references that were live at
+the restore point are resurrected. The scope is broader than single-commit rollback: it may affect
+many file groups across many partitions simultaneously.
+
+Any blob cleanup solution must account for both directions (orphaning and resurrection) under both
+operations.
+
+*Source: Hudi rollback and restore semantics; timeline management.*
+
+### C10: Archival removes commit metadata from active timeline
+
+Hudi's archival process moves completed commits from the active timeline to the archived timeline.
+If blob cleanup depends on information in commit metadata (e.g., which blobs were written by a
+commit), that information becomes unavailable after archival unless it is persisted elsewhere. The
+cleaner must either complete blob reference resolution before archival, or ensure the necessary
+information survives archival.
+
+Note: Storing blob reference sets within commit metadata would compound this problem -- commit
+metadata grows with the number of blob references, and archival would either lose or have to
+specially preserve this information. This is an additional reason not to use commit metadata as the
+source of truth for blob reference liveness.
+
+*Source: Hudi archival semantics.*
+
+### C11: External blobs require cross-file-group verification at scale
+
+For external blobs, cross-file-group blob sharing (C3) is the **common case**, not an edge case.
+Users referencing external blobs (e.g., a shared media library) will frequently have multiple
+records across different file groups and partitions pointing to the same blob file. Any cleanup
+algorithm that treats cross-FG verification as a rare fallback will impose disproportionate cost on
+external blob workloads. The cross-FG verification path must be designed for volume, not just
+correctness.
+
+*Source: C3 (cross-FG sharing is common for external blobs).*
+
+---
+
+## 5. Requirements
+
+### R1: No premature deletion (hard invariant)
+
+A blob file must not be deleted while any live record still references it. This is the single most
+critical requirement. A premature deletion causes silent data corruption: queries return null or error
+for the affected records, and the data is unrecoverable.
+
+### R2: No permanent orphans (bounded cleanup)
+
+Every orphaned blob must eventually be cleaned. The number of cleanup cycles required to reclaim an
+orphan must be bounded (e.g., cleaned within N cleaner invocations after the last referencing file
+slice is expired). Unbounded accumulation of orphaned blobs wastes storage indefinitely.
+
+### R3: MOR correctness
+
+For MOR tables, blob cleanup must be safe in the presence of log updates that shadow base file blob
+references. Over-retention (keeping a shadowed blob until post-compaction) is acceptable.
+Under-retention (prematurely deleting a blob whose reference appears shadowed but could be resurrected
+by rollback) is not.
+
+### R4: Concurrency safety (no global serialization)
+
+Blob cleanup must not introduce global contention points. Write throughput for tables with blobs must
+not degrade compared to tables without blobs under concurrent writers. Per-file-group scoping (C6)
+must be preserved.
+
+### R5: Scale proportional to work, not table size
+
+Cross-FG verification is required for external blobs (C11), but the cost must be proportional to the
+number of **candidate blobs requiring verification**, not the total number of active file slices in
+the table. A table with 100K file groups where 50 external blob candidates need cross-FG
+verification must not scan all 100K file groups -- it must use targeted lookups or indexes to resolve
+those 50 candidates efficiently.
+
+### R6: No cost for non-blob tables
+
+Tables without blob columns must pay zero additional cost. The blob cleanup path must not be entered
+if no blob columns exist. This includes no additional metadata, no additional timeline entries, and no
+additional I/O.
+
+### R7: All cleaning policies supported
+
+Blob cleanup must work correctly under all three cleaning policies: `KEEP_LATEST_COMMITS`,
+`KEEP_LATEST_FILE_VERSIONS`, and `KEEP_LATEST_BY_HOURS`. The blob cleanup logic should be
+policy-agnostic -- it operates on the set of expired vs. retained file slices determined by the
+policy, not on the policy itself.
+
+### R8: Crash safety and idempotency
+
+If the cleaner crashes after planning but before completing all deletions, restarting must be safe.
+Blob deletions must be idempotent (deleting an already-deleted file is a no-op, not an error).
+The cleaner plan must include enough information to resume blob cleanup without re-reading expired
+file slices (which may no longer exist after a partial execution).
+
+### R9: Observability
+
+Blob cleanup must report metrics: number of blob files deleted, number of blob files retained
+(over-retained due to MOR), and total storage reclaimed. These metrics enable operators to understand
+blob storage growth and cleanup effectiveness.
+
+---
+
+## 6. Illustrative Examples
+
+Each example demonstrates a specific failure mode. These are not exhaustive -- they are designed to
+make the constraints and requirements concrete.
+
+### Example 1: Cross-file-group sharing -- per-FG cleanup deletes shared blob
+
+**Demonstrates:** C3, C5, R1
+
+```
+Setup:
+  Partition P1, File Group FG-1:
+    Slice @t1: row1.blob_ref = {external_path: "s3://shared/video.mp4", managed: true}
+
+  Partition P2, File Group FG-2:
+    Slice @t1: row2.blob_ref = {external_path: "s3://shared/video.mp4", managed: true}
+
+Action:
+  Cleaner expires FG-1's slice @t1 (no retained slices in FG-1).
+
+Per-FG cleanup (incorrect):
+  FG-1 expired refs = {"s3://shared/video.mp4"}
+  FG-1 retained refs = {}
+  Orphaned within FG-1 = {"s3://shared/video.mp4"}
+  -> DELETE s3://shared/video.mp4
+
+Result:
+  FG-2 still has an active slice @t1 referencing video.mp4.
+  Query on row2 -> FILE NOT FOUND. Data corruption.
+
+Correct behavior:
+  Before deleting, verify that no other active file slice in any file group
+  references s3://shared/video.mp4. FG-2's active slice references it, so
+  the blob must be retained.
+```
+
+### Example 2: Delete-and-re-add -- path reuse causes identity confusion
+
+**Demonstrates:** C2, R1
+
+```
+Setup:
+  At time t1: User writes row1 with
+    blob_ref = {external_path: "s3://user/photo.jpg", managed: true}
+  At time t2: User deletes the file at s3://user/photo.jpg externally
+  At time t3: User writes row2 with
+    blob_ref = {external_path: "s3://user/photo.jpg", managed: true}
+    (new file at same path, different content)
+
+Cleanup scenario:
+  Cleaner expires slice @t1. Slice @t3 is retained.
+  Expired refs = {"s3://user/photo.jpg"}
+  Retained refs = {"s3://user/photo.jpg"}
+  Same path in both sets -> retain. (Correct by coincidence.)
+
+  But consider: if the cleaner had cached blob identity by path and assumed
+  "same path = same blob," it would not detect that the t1 and t3 references
+  point to different physical content.
+
+  Edge case: if t3 is also expired and t1 is the only reference, the cleaner
+  would correctly delete. But if a new writer at t4 references the same path
+  AGAIN (third incarnation), the cleaner's identity model must not confuse the
+  three incarnations.
+
+Concurrent writer scenario (C2 caveat):
+  Writer A commits at t1, referencing s3://user/photo.jpg (managed=true).
+  Writer B starts at t0, also referencing s3://user/photo.jpg (managed=true).
+  Cleaner expires A's slice at t1. B is still inflight.
+  Cleaner deletes s3://user/photo.jpg.
+  Writer B commits -> dangling reference. Data corruption.
+
+  This race is specific to external blobs: the cleaner cannot rely on MVCC
+  guarantees for external blob files that are outside Hudi's control.
+```
+
+### Example 3: MOR log shadow -- base file ref appears live when superseded
+
+**Demonstrates:** C4, R3
+
+```
+Setup (MOR table):
+  File Group FG-1:
+    Base file @t1: row1.blob_ref = {external_path: "s3://ext/blob_A.bin", managed: true}
+    Log file @t2: row1.blob_ref = {external_path: "s3://ext/blob_B.bin", managed: true}
+
+  After merge: row1's effective blob_ref points to blob_B.bin.
+  blob_A.bin is no longer referenced by any live record.
+
+Cleanup scenario (pre-compaction):
+  Cleaner does not expire slice @t1 (it's retained).
+  Reading blob refs from the retained slice:
+    Base @t1: {"s3://ext/blob_A.bin"}
+    Log @t2: {"s3://ext/blob_B.bin"}
+    Union: {"s3://ext/blob_A.bin", "s3://ext/blob_B.bin"}
+
+  blob_A.bin appears live (it's in the retained set) even though it's been
+  superseded by the log update. This is over-retention -- safe but wasteful.
+
+After compaction:
+  Compacted base @t3: row1.blob_ref = {external_path: "s3://ext/blob_B.bin", managed: true}
+  Now the only retained ref is {"s3://ext/blob_B.bin"}.
+  blob_A.bin is no longer in any retained set -> eligible for deletion.
+
+Why over-retention is the correct default:
+  If the log file @t2 is rolled back, row1 reverts to blob_A.bin from the base
+  file. If blob_A.bin had been prematurely deleted, the rollback produces a
+  dangling reference. Over-retention prevents this.
+```
+
+### Example 4: Writer-cleaner race -- three scenarios
+
+**Demonstrates:** C6, R1, R4
+
+```
+A writer and cleaner operate concurrently on the same table.
+
+Note: Under Hudi's MVCC design, the cleaner and writers operate on non-overlapping
+file slices -- the cleaner never conflicts with writers on file slice operations.
+However, external blob files are NOT covered by MVCC: a writer's new file slice may
+reference an external blob that the cleaner is simultaneously evaluating for deletion.
+The following scenarios illustrate this gap.
+
+Scenario A: Writer commits BEFORE cleaner's timeline fence
+  t1: Writer starts, references blob_X (external)
+  t2: Writer commits (blob_X is now in a retained slice)
+  t3: Cleaner plans cleanup
+  t4: Cleaner checks timeline fence -- sees writer's commit at t2
+  t5: Cleaner removes blob_X from orphan candidates
+  -> Safe. Timeline fence catches the new reference.
+
+Scenario B: Writer commits AFTER cleaner's timeline fence, BEFORE delete
+  t1: Cleaner plans cleanup, blob_X is a candidate for deletion
+  t2: Cleaner checks timeline fence -- no new commits
+  t3: Writer commits, referencing blob_X (external)
+  t4: Cleaner deletes blob_X
+  -> UNSAFE. The timeline fence did not see the writer's commit.
+     blob_X is deleted, but the writer's new slice references it.
+
+Scenario C: Writer commits AFTER cleaner deletes
+  t1: Cleaner plans and executes, deletes blob_X
+  t2: Writer commits, referencing blob_X (e.g., user-provided external path)
+  -> UNSAFE. The blob is already gone. The writer's commit creates a dangling
+     reference.
+
+Scenarios B and C are real concerns for external blobs where the user can
+reference any path. Any solution must close this gap -- for example, via a
+writer-side conflict check at commit time.
+```
+
+### Example 5: Replace commits move refs -- replaced FG appears to have no retained slices
+
+**Demonstrates:** C7, C3, R1
+
+```
+Setup:
+  File Group FG-1:
+    Slice @t1: row1.blob_ref = {external_path: "s3://ext/video.mp4", managed: true}
+
+  Clustering at t2 rewrites FG-1's records to FG-2:
+    File Group FG-2:
+      Slice @t2: row1.blob_ref = {external_path: "s3://ext/video.mp4", managed: true}
+                 (external: pointer copied, same blob)
+
+  FG-1 is now a replaced file group. Its slice @t1 is eligible for cleaning
+  after the retention policy expires.
+
+Cleanup (incorrect per-FG only):
+  Cleaner cleans FG-1's slice @t1.
+  Expired refs = {"s3://ext/video.mp4"}
+  Retained refs within FG-1 = {} (FG-1 has no retained slices -- it's replaced)
+  s3://ext/video.mp4 appears orphaned within FG-1 -> DELETE
+
+  FG-2's live reference is destroyed. Data corruption.
+
+Alternative scenario -- insert_overwrite:
+  At t2, an insert_overwrite replaces partition P1 with new data in FG-3.
+  FG-3's records also reference s3://ext/video.mp4 (same external blob).
+  FG-1 is replaced. Same per-FG cleanup error: FG-1 concludes the blob is
+  orphaned, but FG-3 still references it.
+
+Correct behavior:
+  For external blobs, a cross-FG check is required before deletion, regardless
+  of which replace commit type (clustering, insert_overwrite, insert_overwrite_table)
+  created the replacement. Per-FG cleanup alone is never sufficient for external blobs.
+```
+
+### Example 6: External blobs at scale -- cross-FG verification is the common path
+
+**Demonstrates:** C3, C11, R5
+
+```
+Setup:
+  A media company stores 10M video files in s3://media-library/.
+  They create a Hudi table with a blob column referencing these videos.
+  Hudi manages refs, not storage layout.
+
+  The table has 50K file groups across 1K partitions.
+  Many videos are referenced by multiple records (e.g., a popular video
+  appears in multiple user playlists across different partitions).
+
+  Partition users/alice, FG-101:
+    Slice @t1: row1.blob_ref = {external_path: "s3://media-library/video_X.mp4", managed: true}
+
+  Partition users/bob, FG-202:
+    Slice @t1: row2.blob_ref = {external_path: "s3://media-library/video_X.mp4", managed: true}
+
+  Partition users/carol, FG-303:
+    Slice @t1: row3.blob_ref = {external_path: "s3://media-library/video_X.mp4", managed: true}
+
+Action:
+  Cleaner expires FG-101's slice @t1 (alice deleted her playlist entry).
+
+Naive per-FG cleanup (incorrect):
+  FG-101 expired refs = {"s3://media-library/video_X.mp4"}
+  FG-101 retained refs = {}
+  Orphaned within FG-101 -> DELETE video_X.mp4
+  Bob and Carol lose their video. Data corruption.
+
+Naive full-table scan (correct but expensive):
+  To verify video_X.mp4 is safe to delete, scan ALL 50K file groups
+  for references. This is correct but violates R5 -- the cost is
+  proportional to table size, not to the number of candidates.
+
+Scale concern:
+  If the cleaner expires 500 file groups and produces 2,000 external
+  blob candidates, and each candidate requires a full-table scan,
+  the cleanup cost is 2,000 * 50K = 100M file group checks.
+  This is prohibitive.
+
+Correct behavior:
+  Cross-FG verification for external blobs must use a targeted
+  mechanism (e.g., index lookup, partitioned scan with predicate
+  pushdown) that scales with the number of candidates, not with
+  the total table size. The mechanism must be a first-class design
+  element, not a fallback path.
+```
+
+### Example 7: MOR log-chain transient blob -- introduced and superseded within logs
+
+**Demonstrates:** C4, R2
+
+```
+Setup (MOR table):
+  File Group FG-1:
+    Base file @t1: row1.blob_ref = {external_path: "s3://ext/blob_A.bin", managed: true}
+    Log file @t2: row1.blob_ref = {external_path: "s3://ext/blob_B.bin", managed: true}
+    Log file @t3: row1.blob_ref = {external_path: "s3://ext/blob_C.bin", managed: true}
+
+  After merge: row1's effective blob_ref points to blob_C.bin.
+  blob_B.bin was introduced at t2 and superseded at t3 -- it exists ONLY in log @t2.
+
+After compaction @t4:
+  Compacted base @t4: row1.blob_ref = {external_path: "s3://ext/blob_C.bin", managed: true}
+  The pre-compaction slice (base @t1 + logs @t2, @t3) is now expired.
+
+Cleanup scenario:
+  Cleaner expires the pre-compaction slice.
+  Retained slice = compacted base @t4, refs = {"s3://ext/blob_C.bin"}.
+
+  If expired slice reads only the base file:
+    expired_refs = {"s3://ext/blob_A.bin"}  (from base @t1)
+    local_orphans = {"s3://ext/blob_A.bin"} - {"s3://ext/blob_C.bin"}
+                  = {"s3://ext/blob_A.bin"}
+    blob_A.bin is correctly identified as orphaned.
+    blob_B.bin is MISSED -- it exists only in expired log @t2.
+    blob_B.bin becomes a permanent orphan (R2 violation).
+
+  If expired slice reads base + log files:
+    expired_refs = {"s3://ext/blob_A.bin", "s3://ext/blob_B.bin", "s3://ext/blob_C.bin"}
+    local_orphans = {"s3://ext/blob_A.bin", "s3://ext/blob_B.bin", "s3://ext/blob_C.bin"}
+                    - {"s3://ext/blob_C.bin"}
+                  = {"s3://ext/blob_A.bin", "s3://ext/blob_B.bin"}
+    Both orphaned blobs are correctly identified and deleted.
+
+Why this matters:
+  Transient blob refs that are introduced and superseded entirely within
+  the log chain never appear in any base file. They can only be discovered
+  by reading the expired log files. Without log reads on the expired side,
+  every such transient blob becomes a permanent orphan that accumulates
+  storage indefinitely.
+```
+
+### Example 9: Why blobFilesToDelete must be in the plan -- writer-cleaner conflict resolution
+
+**Demonstrates:** C7, R1, R5 (extends Example 5, Scenario B)
+
+```
+Setup:
+  File Group FG-1 (partition users/alice):
+    Slice @t1 (expired): row1.blob_ref = (s3://ext/video.mp4, managed=true)
+    Slice @t3 (retained): row1.blob_ref = (s3://ext/photo.png, managed=true)
+    video.mp4 is locally orphaned in FG-1 (updated to photo.png at t3).
+    No other FG references video.mp4 at plan time.
+
+  File Group FG-2 (partition users/bob):
+    (exists, but does not reference video.mp4 yet)
+
+Approach A: blobFilesToDelete NOT in plan (execution-time computation)
+  t1: Cleaner plans at timeline fence T.
+      Plan = {filePathsToBeDeleted: [FG-1/@t1]}
+      No blob info written to plan. Plan goes to timeline as REQUESTED.
+
+  t2: Writer commits to FG-2, adds row2.blob_ref = (s3://ext/video.mp4).
+      Writer checks for conflicts: the clean plan on the timeline has no
+      blob info -- only filePathsToBeDeleted for FG-1. Writer is on FG-2.
+      No conflict detected. Writer succeeds.
+
+  t3: Cleaner transitions to INFLIGHT. Executor computes blob deletes:
+      Reads FG-1/@t1 -> expired_refs = {video.mp4}
+      Reads FG-1/@t3 -> retained_refs = {photo.png}
+      video.mp4 locally orphaned -> Stage 2 cross-FG check at fence T
+      -> does NOT see writer's commit at t2 -> globally orphaned -> DELETE
+
+  Result: FG-2 row2 now has a dangling reference to video.mp4.
+  Bob queries his data and gets a missing blob error. Data corruption.
+
+  Why it cannot be fixed: the blob delete decision existed only in the
+  executor's memory. There was no artifact on the timeline for the writer's
+  conflict resolution to check against. The cross-FG conflict was invisible.
+
+Approach B: blobFilesToDelete IN the plan (plan-time computation)
+  t1: Cleaner plans at timeline fence T.
+      Stage 1: video.mp4 locally orphaned in FG-1.
+      Stage 2: cross-FG check -> no other FG references video.mp4.
+      Plan = {filePathsToBeDeleted: [FG-1/@t1],
+              blobFilesToDelete: [s3://ext/video.mp4]}
+      Plan goes to timeline as REQUESTED.
+
+  t2: Writer commits to FG-2, adds row2.blob_ref = (s3://ext/video.mp4).
+      Writer's conflict resolution checks inflight/requested clean plan.
+      Sees blobFilesToDelete contains video.mp4 -- the same blob the
+      writer is referencing. CONFLICT DETECTED. Writer aborts and retries
+      after the clean cycle completes (or clean plan is rolled back).
+      -> Safe. The conflict is caught before corruption can occur.
+
+  Alternative: if the writer commits first (wins the race), the clean
+  plan's conflict resolution at INFLIGHT transition detects that a new
+  commit references a blob in blobFilesToDelete. Clean plan is invalidated
+  and re-planned in the next cycle, where it will see FG-2's reference
+  and retain video.mp4.
+
+Key insight:
+  Today, clean actions are not part of OCC conflict resolution
+  (TransactionUtils.getInflightAndRequestedInstants excludes CLEAN_ACTION,
+  and ConcurrentOperation.init throws for clean actions). Adding external
+  blob cleanup requires extending conflict resolution to check
+  blobFilesToDelete. This is only possible if the blob delete list is a
+  durable artifact on the timeline -- i.e., part of the plan.
+```
+
+---
+
+## 7. Open Questions
+
+These questions must be answered by any solution design. They are not prescriptive -- multiple valid
+answers exist for each.
+
+**Q1: What is blob identity?**
+How does the cleanup algorithm identify a specific blob? By `reference.external_path` alone? For
+external blobs where container files are out of scope, path-based identity may be sufficient. The
+identity model determines how deduplication and delete-and-re-add (C2) are handled.
+
+**Q2: Where is liveness computed?**
+Is the set of live blob references computed at write time (incremental), at clean time (batch), or
+some combination? Write-time computation amortizes cost but requires additional metadata storage.
+Clean-time computation avoids write overhead but may be expensive at scale. Note: storing liveness
+data within commit metadata is not a viable option -- it does not scale (see C10).
+
+**Q3: What is the unit of cleanup planning?**
+Does blob cleanup plan per-file-group (aligned with the existing cleaner), per-partition, or globally?
+Per-FG is naturally aligned with OCC (C6) but cannot handle cross-FG sharing (C3) without extension.
+Global planning handles cross-FG sharing but risks violating C6.
+
+**Q4: How does blob cleanup interact with archival?**
+If the cleanup algorithm depends on commit metadata to determine which blobs were written, what
+happens when those commits are archived (C10)? Must blob cleanup complete before archival? Must the
+relevant metadata be persisted outside the active timeline?
+
+**Q5: Extension or separate service?**
+Should blob cleanup be an extension of the existing file slice cleaner (same plan, same execution
+phase) or a separate service (independent schedule, independent timeline action)? Extension aligns
+lifecycle but increases cleaner complexity. Separation simplifies each component but introduces
+coordination challenges.
+
+**Q6: Failure mode and recovery if premature deletion occurs?**
+Despite best efforts, what happens if a blob is prematurely deleted? Is there a detection mechanism
+(query-time error surfacing)? Is there a recovery path (rebuild from an external source)? How does
+the system distinguish "blob correctly not present" from "blob incorrectly deleted"?
+
+**Q7: How does cross-FG verification scale for external blobs?**
+For external blob workloads where cross-FG sharing is common, what mechanism makes cross-FG
+verification efficient? Options include: an MDT index mapping blob paths to referencing file groups,
+predicate pushdown on the blob ref column during targeted scans, a reference count maintained at
+write time, or a bloom filter index. The chosen mechanism must satisfy R5 (cost proportional to
+candidates, not table size) and C6 (no global serialization). How does this mechanism interact with
+writes, and what is its maintenance cost?
+
+**Q8: How should the cleaner handle managed-to-unmanaged transitions?**
+If a blob reference transitions from `reference.managed = true` to `reference.managed = false`
+across writes (or vice versa), what is the cleaner's behavior? Should the `managed` flag be
+evaluated at the time of the expired reference, the retained reference, or both? See the managed
+flag discussion in Section 2.
+
+**Q9: How does blob cleanup interact with table restore?**
+Table restore (as distinct from single-commit rollback) can undo multiple committed writes,
+potentially orphaning many blobs at once and resurrecting others. Does the cleaner need special
+handling for post-restore cleanup, or does the standard cleanup algorithm handle it naturally? What
+if a restore occurs after a cleaner run has already deleted blobs that the restored state references?