transferia
diff --git a/‎.github/workflows/build_and_test.yml‎
Lines changed: 31 additions & 6 deletions b/‎.github/workflows/build_and_test.yml‎
Lines changed: 31 additions & 6 deletions
diff --git a/‎README.md‎
Lines changed: 24 additions & 0 deletions b/‎README.md‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎arrow_conversion.go‎
Lines changed: 33 additions & 30 deletions b/‎arrow_conversion.go‎
Lines changed: 33 additions & 30 deletions
diff --git a/‎cmd/trcli/main.go‎
Lines changed: 1 addition & 1 deletion b/‎cmd/trcli/main.go‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎demo/README.md‎
Lines changed: 171 additions & 0 deletions b/‎demo/README.md‎
Lines changed: 171 additions & 0 deletions
@@ -18,7 +18,7 @@ jobs:
       - name: Setup Go
         uses: actions/setup-go@v5
         with:
-          go-version: "1.23.6"
+          go-version: "1.25.5"
       - shell: bash
         run: |
           make build
@@ -33,7 +33,7 @@ jobs:
       - name: Setup Go
         uses: actions/setup-go@v5
         with:
-          go-version: "1.23.6"
+          go-version: "1.25.5"
       - shell: bash
         run: |
           go install gotest.tools/gotestsum@latest
@@ -45,14 +45,39 @@ jobs:
         run: |
           pg_dump --version
       - shell: bash
-        name: prepare local Spark
+        name: prepare local infra
         run: |
           docker compose -f recipe/docker-compose.yml up -d
-          sleep 5
-          docker compose -f recipe/docker-compose.yml exec -T spark-iceberg ipython ./provision.py
-          sleep 5
+          # Wait for MinIO to be ready and bucket to be created by mc container
+          for i in $(seq 1 30); do
+            if docker exec minio mc ls local/warehouse 2>/dev/null; then
+              echo "MinIO warehouse bucket ready"
+              break
+            fi
+            echo "Waiting for MinIO bucket... ($i/30)"
+            sleep 2
+          done
+          # Ensure bucket exists (in case mc container failed)
+          docker exec minio mc alias set local http://localhost:9000 admin password || true
+          docker exec minio mc mb local/warehouse --ignore-existing || true
+          docker exec minio mc anonymous set public local/warehouse || true
+          # Wait for Iceberg REST catalog
+          for i in $(seq 1 15); do
+            if curl -sf http://localhost:8181/v1/config > /dev/null 2>&1; then
+              echo "Iceberg REST catalog ready"
+              break
+            fi
+            echo "Waiting for REST catalog... ($i/15)"
+            sleep 2
+          done
+          # Run Spark provisioning (optional — creates test tables for non-CDC tests)
+          docker compose -f recipe/docker-compose.yml exec -T spark-iceberg ipython ./provision.py || echo "Spark provision failed (non-fatal for CDC tests)"
+          # Export container IPs for test env
           echo "AWS_S3_ENDPOINT=http://$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' minio):9000" >> $GITHUB_ENV
           echo "CATALOG_ENDPOINT=http://$(docker inspect -f '{{range.NetworkSettings.Networks}}{{.IPAddress}}{{end}}' iceberg-rest):8181" >> $GITHUB_ENV
+          echo "AWS_ACCESS_KEY_ID=admin" >> $GITHUB_ENV
+          echo "AWS_SECRET_ACCESS_KEY=password" >> $GITHUB_ENV
+          echo "AWS_REGION=us-east-1" >> $GITHUB_ENV
       - shell: bash
         run: |
           make run-tests
 
@@ -95,6 +95,30 @@ The Iceberg Provider also implements a Streaming Sink mechanism that:
 
 **Note**: It's for append-only sources, not for CDC
 
+### CDC Replication Sink (NEW)
+
+Full **Change Data Capture** replication from PostgreSQL to Iceberg v2 tables using [iceberg-go](https://github.com/apache/iceberg-go) — entirely in Go, no JVM required.
+
+**What works:**
+- INSERT, UPDATE, DELETE replication via Iceberg v2 equality deletes (merge-on-read)
+- Snapshot + incremental replication (WAL-based CDC)
+- Automatic table creation with schema inference from source
+- PK-based row deduplication within commit batches
+- Time-based flush with configurable commit interval
+
+**Benchmark results** (Apple M1 Pro, local MinIO + REST catalog):
+
+| Profile | Duration | Rows | Avg Rate | Lag | Data Loss |
+|---------|----------|------|----------|-----|-----------|
+| InsertOnly (1K→10K ramp) | 5 min | 1.35M | 4,500 rows/s | ~3s | 0 |
+
+**Key numbers:**
+- **6,400 rows/sec** peak write throughput
+- **3 second** steady-state replication lag
+- **Zero data loss** — Iceberg row count matches PG row count exactly after drain
+
+For details, see [benchmark README](tests/bench/README.md) and [equality delete performance analysis](doc/equality-delete-performance.md).
+
 ## Contributing
 
 This project is part of the Transferia ecosystem and follows its contribution guidelines. Please refer to the main [Transferia repository](https://github.com/transferia/transferia) for more information. 
@@ -156,6 +156,38 @@ func ToTimestamp(v interface{}) int64 {
 	return 0
 }
 
+// colTypeToIcebergType maps a transferia ColSchema to an Iceberg type.
+func colTypeToIcebergType(col abstract.ColSchema) iceberg.Type {
+	switch col.DataType {
+	case yt_schema.TypeInt64.String():
+		return iceberg.PrimitiveTypes.Int64
+	case yt_schema.TypeInt32.String():
+		return iceberg.PrimitiveTypes.Int32
+	case yt_schema.TypeInt16.String(), yt_schema.TypeInt8.String():
+		return iceberg.PrimitiveTypes.Int32
+	case yt_schema.TypeUint64.String(), yt_schema.TypeUint32.String():
+		return iceberg.PrimitiveTypes.Int64
+	case yt_schema.TypeUint16.String(), yt_schema.TypeUint8.String():
+		return iceberg.PrimitiveTypes.Int32
+	case yt_schema.TypeFloat32.String():
+		return iceberg.PrimitiveTypes.Float32
+	case yt_schema.TypeFloat64.String():
+		return iceberg.PrimitiveTypes.Float64
+	case yt_schema.TypeBytes.String():
+		return iceberg.PrimitiveTypes.Binary
+	case yt_schema.TypeString.String():
+		return iceberg.PrimitiveTypes.String
+	case yt_schema.TypeBoolean.String():
+		return iceberg.PrimitiveTypes.Bool
+	case yt_schema.TypeDate.String():
+		return iceberg.PrimitiveTypes.Date
+	case yt_schema.TypeDatetime.String(), yt_schema.TypeTimestamp.String():
+		return iceberg.PrimitiveTypes.TimestampTz
+	default:
+		return iceberg.PrimitiveTypes.String
+	}
+}
+
 // ConvertToIcebergSchema converts abstract.TableSchema to iceberg.Schema
 func ConvertToIcebergSchema(schema *abstract.TableSchema) (*iceberg.Schema, error) {
 	if schema == nil {
@@ -168,36 +200,7 @@ func ConvertToIcebergSchema(schema *abstract.TableSchema) (*iceberg.Schema, erro
 	nextID := 1 // probably shall use schema registry
 
 	for _, col := range schema.Columns() {
-		var fieldType iceberg.Type
-		switch col.DataType {
-		case yt_schema.TypeInt64.String():
-			fieldType = iceberg.PrimitiveTypes.Int64
-		case yt_schema.TypeInt32.String():
-			fieldType = iceberg.PrimitiveTypes.Int32
-		case yt_schema.TypeInt16.String(), yt_schema.TypeInt8.String():
-			fieldType = iceberg.PrimitiveTypes.Int32
-		case yt_schema.TypeUint64.String(), yt_schema.TypeUint32.String():
-			fieldType = iceberg.PrimitiveTypes.Int64
-		case yt_schema.TypeUint16.String(), yt_schema.TypeUint8.String():
-			fieldType = iceberg.PrimitiveTypes.Int32
-		case yt_schema.TypeFloat32.String():
-			fieldType = iceberg.PrimitiveTypes.Float32
-		case yt_schema.TypeFloat64.String():
-			fieldType = iceberg.PrimitiveTypes.Float64
-		case yt_schema.TypeBytes.String():
-			fieldType = iceberg.PrimitiveTypes.Binary
-		case yt_schema.TypeString.String():
-			fieldType = iceberg.PrimitiveTypes.String
-		case yt_schema.TypeBoolean.String():
-			fieldType = iceberg.PrimitiveTypes.Bool
-		case yt_schema.TypeDate.String():
-			fieldType = iceberg.PrimitiveTypes.Date
-		case yt_schema.TypeDatetime.String(), yt_schema.TypeTimestamp.String():
-			fieldType = iceberg.PrimitiveTypes.TimestampTz
-		default:
-			// JSON-based string
-			fieldType = iceberg.PrimitiveTypes.String
-		}
+		fieldType := colTypeToIcebergType(col)
 
 		field := iceberg.NestedField{
 			ID:       nextID,
 
@@ -57,7 +57,7 @@ func main() {
 				return nil
 			}
 			if runProfiler {
-				go serverutil.RunPprof()
+				go serverutil.RunPprof(8080)
 			}
 
 			switch strings.ToLower(logConfig) {
 
@@ -0,0 +1,171 @@
+# Demo: PostgreSQL CDC to Iceberg v2
+
+Real-time Change Data Capture replication from PostgreSQL to Apache Iceberg v2 tables — entirely in Go, no JVM.
+
+INSERT, UPDATE, and DELETE operations are captured from the PostgreSQL WAL and written to Iceberg using v2 equality deletes (merge-on-read).
+
+## Prerequisites
+
+- Docker / Docker Compose
+- Go 1.23+
+
+## Quick Start
+
+### 1. Start infrastructure
+
+```bash
+cd demo
+docker compose up -d
+```
+
+This starts:
+- **PostgreSQL** (port 5432) — source database with WAL-level replication
+- **MinIO** (ports 9000/9001) — S3-compatible storage for Iceberg data files
+- **Iceberg REST Catalog** (port 8181) — Iceberg table metadata
+
+### 2. Seed the source database
+
+```bash
+psql "host=localhost port=5432 user=postgres password=postgres dbname=demo" -f seed.sql
+```
+
+### 3. Build and start replication
+
+```bash
+# From the repo root
+make build
+./binaries/trcli activate --transfer demo/transfer.yaml --log-level info
+```
+
+The transfer will:
+1. **Snapshot** the existing `orders` table into Iceberg
+2. **Switch to CDC** — streaming WAL changes in real-time
+
+### 4. Generate changes
+
+**Option A: Load generator** (recommended) — runs a mix of INSERT/UPDATE/DELETE at a steady rate:
+
+```bash
+# Default: 10 ops/sec for 60s (60% insert, 30% update, 10% delete)
+./demo/loadgen.sh
+
+# Crank it up
+./demo/loadgen.sh --rate 50 --duration 120
+
+# Heavy update/delete workload (watch equality deletes accumulate in MinIO)
+./demo/loadgen.sh --rate 20 --insert 30 --update 40
+```
+
+Output:
+```
+=== CDC Load Generator ===
+Rate:     10 ops/sec
+Duration: 60s (~600 ops)
+Mix:      60% insert / 30% update / 10% delete
+==========================
+[10s] ops: 100/600 (I:62 U:28 D:10 E:0) PG rows: 153
+[20s] ops: 200/600 (I:121 U:57 D:22 E:0) PG rows: 200
+...
+```
+
+**Option B: Ad-hoc SQL** — run individual statements:
+
+```bash
+psql "host=localhost port=5432 user=postgres password=postgres dbname=demo"
+
+INSERT INTO orders (customer, product, quantity, price) VALUES ('zara', 'widget-z', 1, 9.99);
+UPDATE orders SET status = 'delivered' WHERE customer = 'zara';
+DELETE FROM orders WHERE customer = 'frank';
+```
+
+**Option C: Scripted workload** — a fixed set of INSERT/UPDATE/DELETE:
+
+```bash
+psql "host=localhost port=5432 user=postgres password=postgres dbname=demo" -f demo/workload.sql
+```
+
+### 5. Observe
+
+**MinIO Console** — browse the Iceberg data and delete files:
+```
+http://localhost:9001
+Login: admin / password
+Bucket: warehouse → public/orders/
+```
+
+You'll see:
+- `data/` — Parquet data files (from INSERTs and snapshot)
+- `data/` — Equality delete files (from UPDATEs and DELETEs, also Parquet, containing just the PK values)
+- `metadata/` — Iceberg table metadata, manifests, and snapshots
+
+**REST Catalog** — check table exists:
+```bash
+curl -s http://localhost:8181/v1/namespaces/public/tables | jq .
+```
+
+### 6. Cleanup
+
+```bash
+docker compose down -v
+```
+
+## How It Works
+
+```
+PostgreSQL WAL
+    │
+    ▼
+┌─────────────┐     ┌──────────────────┐     ┌─────────────────┐
+│  PG Source   │────▶│  SinkReplication  │────▶│  Iceberg v2     │
+│  (CDC)       │     │  (Go)            │     │  (Parquet + S3) │
+└─────────────┘     └──────────────────┘     └─────────────────┘
+                           │
+                    ┌──────┴──────┐
+                    │             │
+               INSERT/UPDATE   DELETE
+                    │             │
+                    ▼             ▼
+              WriteRecords   WriteEqualityDeletes
+              (data file)    (delete file with PK)
+                    │             │
+                    └──────┬──────┘
+                           ▼
+                      RowDelta Commit
+                    (atomic transaction)
+```
+
+- **INSERT** → appended as a new data row
+- **UPDATE** → equality delete (old PK) + new data row (new values), committed atomically via RowDelta
+- **DELETE** → equality delete file containing the deleted PK
+
+All mutations within a commit interval (default 5s) are batched and deduplicated by PK before writing.
+
+## Performance
+
+| Metric | Value |
+|--------|-------|
+| Peak throughput | 6,400 rows/sec |
+| Avg throughput | 4,500 rows/sec |
+| Replication lag | ~3 seconds |
+| Data loss | 0 (exact row count match) |
+
+Measured with 1.35M rows over 5 minutes. See [benchmark results](../tests/bench/README.md).
+
+## Known Limitations
+
+1. **Equality delete read overhead** — scan performance degrades as delete files accumulate (124x slower after 10K DML ops). Requires periodic compaction. See [performance analysis](../doc/equality-delete-performance.md) and [compaction research](../doc/compaction-research.md).
+
+2. **No built-in compaction** — iceberg-go doesn't have an `Optimize` API yet. Workaround: use Spark `CALL system.rewrite_data_files()` or the full-table scan+rewrite approach described in [compaction research](../doc/compaction-research.md).
+
+3. **Single-writer** — concurrent replication workers writing to the same table will hit `CommitFailedException`. The sink invalidates the table cache on failure and retries, but throughput drops.
+
+4. **No schema evolution** — if the source schema changes, the Iceberg table schema is not updated automatically. The table must be recreated.
+
+5. **Append-only without PK** — tables without a primary key are treated as append-only. UPDATEs and DELETEs are dropped since equality deletes require a PK.
+
+## Further Reading
+
+- [Equality delete performance analysis](../doc/equality-delete-performance.md) — why reads degrade and the numbers
+- [Compaction research](../doc/compaction-research.md) — strategies to keep read cost constant
+- [Optimize API proposal for iceberg-go](../doc/iceberg-go-optimize-proposal.md) — feature request draft
+- [Benchmark README](../tests/bench/README.md) — full benchmark methodology and results
Original file line number	Diff line number	Diff line change
`@@ -57,7 +57,7 @@ func main() {`
`57`	`57`	`return nil`
`58`	`58`	`}`
`59`	`59`	`if runProfiler {`
`60`		`- go serverutil.RunPprof()`
	`60`	`+ go serverutil.RunPprof(8080)`
`61`	`61`	`}`
`62`	`62`
`63`	`63`	`switch strings.ToLower(logConfig) {`