PostgreSQL → Debezium → Redpanda → ClickHouse change-data-capture pipeline.
| Service | Container | Port |
|---|---|---|
| PostgreSQL | postgres |
5432 |
| Redpanda | redpanda |
19092 (external), 9092 (internal) |
| Debezium Connect | debezium-connect |
8083 |
| ClickHouse | clickhouse |
8123 (HTTP), 9000 (native) |
docker compose up -dWait until Debezium Connect is healthy before proceeding:
docker compose ps
# debezium-connect should show (healthy)docker exec -it postgres psql -U user -d postgresCREATE TABLE public.report (
id TEXT PRIMARY KEY,
username TEXT NOT NULL,
product TEXT NOT NULL,
response_message TEXT NOT NULL DEFAULT ''
);INSERT INTO public.report (id, username, product, response_message) VALUES
('1', 'alice', 'product-a', 'success'),
('2', 'bob', 'product-b', 'success');Exit psql: \q
Run this from the project directory (where postgres-cdc-connector.json lives):
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d @postgres-cdc-connector.jsonPowerShell users: use
curl.exeand backtick (`) for line continuation:curl.exe -X POST http://localhost:8083/connectors ` -H "Content-Type: application/json" ` -d '@postgres-cdc-connector.json'
Verify it is running:
curl -s http://localhost:8083/connectors/postgres-cdc-connector/status | python -m json.toolExpected:
{
"connector": { "state": "RUNNING" },
"tasks": [{ "id": 0, "state": "RUNNING" }]
}Debezium performs an initial snapshot of the existing rows and publishes them to the Redpanda topic pgserver.public.report. Wait ~5 seconds for the snapshot to complete before moving on.
Open the ClickHouse client:
docker exec -it clickhouse clickhouse-clientThen run the three statements below in order.
1. Kafka queue table (reads raw messages from Redpanda — do not query this directly):
CREATE TABLE report_queue (
id String,
username String,
product String,
response_message String,
__op String
) ENGINE = Kafka
SETTINGS
kafka_broker_list = 'redpanda:9092',
kafka_topic_list = 'pgserver.public.report',
kafka_group_name = 'clickhouse-consumer-group',
kafka_format = 'JSONEachRow',
kafka_skip_broken_messages = 1;2. Storage table (where deduplicated data lives):
CREATE TABLE report (
id String,
username String,
product String,
response_message String,
op String,
_ingested_at DateTime DEFAULT now()
) ENGINE = ReplacingMergeTree(_ingested_at)
ORDER BY id;3. Materialized view (pipes rows from the queue into the storage table automatically):
CREATE MATERIALIZED VIEW report_mv
TO report AS
SELECT
id,
username,
product,
response_message,
__op AS op
FROM report_queue;Once the MV is created, ClickHouse starts consuming immediately. Wait a few seconds for the initial snapshot rows to land.
SELECT * FROM report FINAL;You should see the 2 rows inserted in Step 3 with op = 'r' (initial snapshot read).
ReplacingMergeTreededuplicates rows with the sameid, keeping the row with the highest_ingested_at. UseFINALto get deduplicated results immediately at query time rather than waiting for background merges.
If you want duplicates cleaned up right away without waiting for the background merge:
OPTIMIZE TABLE report FINAL;docker exec -it postgres psql -U user -d postgresUPDATE public.report
SET response_message = 'updated response'
WHERE id = '1';Debezium captures the change and publishes it to Redpanda as an op=u event. ClickHouse inserts the updated row with a new _ingested_at timestamp.
SELECT * FROM report FINAL WHERE id = '1';You should see response_message = 'updated response' and op = 'u'.
Without
FINALyou may temporarily see both the old and new row until the background merge runs.
ClickHouse only receives rows through CDC events. It has no direct connection to Postgres and cannot backfill on its own. If you ever truncate the ClickHouse report table, reset the connector, or delete the Redpanda topic, any rows that don't get a new CDC event will be missing.
Symptom: Postgres has N rows but ClickHouse has fewer.
Fix: Touch every row in Postgres to force Debezium to re-publish them:
docker exec -it postgres psql -U user -d postgresUPDATE public.report SET response_message = response_message;This no-op update causes Debezium to emit an op=u event for every row, which ClickHouse then consumes and stores.