The CDC replication sink uses Iceberg v2 equality deletes (merge-on-read) for UPDATE and DELETE operations. This avoids rewriting data files on every mutation — deletes are written as small delete files that are applied at read time by the scanner.
The trade-off: writes are fast, but reads degrade as delete files accumulate.
This document measures the degradation and discusses mitigation strategies.
- Table: 10,000 rows, 3 columns (
id BIGINT PK, value INT, label VARCHAR) - Workload: 10 rounds, each round = 500 UPDATEs + 500 DELETEs (with re-INSERTs to keep row count stable)
- Commit interval: 2 seconds (each round produces multiple commits)
- Measurement: Full table scan via
Scan().ToArrowRecords()— equivalent toSELECT COUNT(*)with merge-on-read - Environment: Apple M1 Pro, Docker/Colima, local MinIO + REST catalog
| Round | Cumulative DML | Scan Time | Slowdown vs Baseline |
|---|---|---|---|
| 0 | 0 | 41ms | 1.0x |
| 1 | 1,000 | 230ms | 5.6x |
| 2 | 2,000 | 171ms | 4.2x |
| 3 | 3,000 | 147ms | 3.6x |
| 4 | 4,000 | 125ms | 3.0x |
| 5 | 5,000 | 165ms | 4.0x |
| 6 | 6,000 | 207ms | 5.0x |
| 7 | 7,000 | 1,157ms | 28.2x |
| 8 | 8,000 | 190ms | 4.6x |
| 9 | 9,000 | 988ms | 24.1x |
| 10 | 10,000 | 5,087ms | 124x |
The scan overhead is not strictly proportional to the number of delete files. The spikes at rounds 7, 9, and 10 correlate with:
-
Number of equality delete files: Each commit with deletes produces a separate delete file. After 10 rounds with 2s commit interval, there are ~50+ delete files.
-
Cross-file matching: Each equality delete file must be matched against every data file in the table. With N data files and M delete files, the merge cost is O(N × M × rows_per_file).
-
No file-level pruning for equality deletes: Unlike position deletes (which reference specific files), equality deletes require checking every data file because any file might contain the deleted key.
For each data file:
1. Read all rows from the Parquet file
2. For each equality delete file with overlapping partition:
a. Read all delete keys
b. For each row in the data file:
- Check if PK matches any delete key
- If yes, filter out the row
3. Yield remaining rows
At 10K DML ops, this means:
- ~10 data files (from commits) × ~50 delete files × 10K rows = 5M comparisons
| Operation | Without equality deletes | With equality deletes |
|---|---|---|
| INSERT 10K rows | WriteRecords → 1 Parquet file | Same |
| UPDATE 1K rows | N/A | WriteRecords + WriteEqualityDeletes → 2 files per commit |
| DELETE 1K rows | N/A | WriteEqualityDeletes → 1 delete file per commit |
| COUNT(*) | Scan ~10 data files: 41ms | Scan ~10 data + ~50 delete files: 5,087ms |
The standard solution is compaction — periodically rewriting data files to physically remove rows matched by equality deletes.
Before compaction:
10 data files + 50 equality delete files
Scan cost: O(data_files × delete_files × rows)
After compaction:
1 data file + 0 delete files
Scan cost: O(rows)
- Time-based: Compact every N minutes or when delete file count exceeds a threshold
- File-count-based: Trigger when
num_delete_files / num_data_files > ratio - Read-time-based: Compact when scan latency exceeds a target SLA
- External: Use Spark/Trino
CALL rewrite_data_files()as a scheduled job
As of the current iceberg-go version, compaction is not yet implemented in the Go library. Options:
- Spark:
CALL iceberg.system.rewrite_data_files(table => 'public.my_table') - Trino:
ALTER TABLE my_table EXECUTE optimize - Custom Go: Read all files, merge/filter, write new data files, commit via
ReplaceDataFilesWithDataFiles
- Short-term: Run Spark/Trino compaction as a scheduled job (every 15-30 min for CDC tables)
- Medium-term: Add a
CompactTablemethod to the sink that runs after N commits - Long-term: Implement native Go compaction using
tx.ReplaceDataFilesWithDataFiles()
make recipe
bash tests/run_repl_test.sh TestEqualityDeleteReadPerf ./tests/bench/ 30mThe test is in tests/bench/equality_delete_perf_test.go.