Skip to content

Latest commit

 

History

History
211 lines (155 loc) · 6.65 KB

File metadata and controls

211 lines (155 loc) · 6.65 KB

Change Data Capture (CDC) Techniques

CDC techniques are generally recommended in the following order for production systems:

Log-based CDC → Native Change Feeds → Application-level Outbox → Trigger-based CDC → Query/Polling CDC


How It Works

  • Reads write-ahead logs / redo logs / binlogs generated by the DB engine.
  • CDC connector maintains offsets / LSNs to ensure exactly-once or at-least-once delivery.
  • Events are reconstructed with before/after images depending on DB and configuration.
  • Typically streams into Kafka, Kinesis, Pulsar, or directly to sinks.

Database Examples

  • PostgreSQL – WAL via logical decoding (pgoutput)
  • MySQL / MariaDB – Row-based binary logs
  • Oracle – Redo logs (GoldenGate, LogMiner)
  • SQL Server – Transaction log
  • MongoDB – Oplog
  • DB2 – Log streams

Additional Strengths

  • Full transaction context (commit boundaries, ordering)
  • Supports hard deletes
  • Handles multi-row and bulk operations efficiently
  • Low write amplification on the source DB
  • Decouples producers and consumers
  • Enables replayability from log offsets

Additional Challenges

  • Requires log retention tuning to avoid data loss
  • Needs DDL handling strategy (schema registry, evolution rules)
  • Can produce high event volumes that require downstream scaling
  • Security concerns if logs contain sensitive data (PII masking needed)

Failure & Recovery

  • Offset-based recovery enables safe restarts
  • Requires monitoring of lag, log growth, and connector health

Recommendation

Primary choice for enterprise-grade CDC
Use for high-volume OLTP systems, real-time analytics, replication, and event streaming.


How It Works

  • Database exposes a managed stream or subscription
  • Built-in checkpointing and retry
  • Often integrated with cloud messaging services

Database Examples

  • MongoDB – Change Streams
  • DynamoDB – Streams → Kinesis
  • Cosmos DB – Change Feed
  • Cassandra – Native CDC (CommitLog)
  • Managed cloud RDBMS CDC services

Additional Strengths

  • No need to manage log offsets manually
  • Highly available by design
  • Tightly integrated with cloud ecosystem
  • Simplified security and IAM integration

Additional Challenges

  • Retention limits can cause irreversible data loss if consumers lag
  • Limited historical replay
  • Often less granular control over events
  • Pricing can increase with throughput

Failure & Recovery

  • Recovery limited to retention window
  • Requires consumer lag monitoring to avoid expiration

Recommendation

Best option for NoSQL and fully managed cloud databases
Choose when operational simplicity outweighs flexibility.


3. Trigger-Based CDC — Last-Resort / Legacy Option

How It Works

  • Row-level triggers fire synchronously during DML
  • Writes deltas into audit / shadow tables
  • Downstream process polls or streams these tables

Database Examples

  • Oracle, SQL Server, PostgreSQL, MySQL

Additional Strengths

  • Fine-grained filtering logic possible inside triggers
  • Can enforce custom business rules
  • Works even on very old DB versions

Additional Challenges

  • Transaction coupling (trigger failure blocks writes)
  • Increased deadlock risk
  • Hard to test and version-control
  • Complicates bulk loads and migrations

Failure & Recovery

  • Trigger failures can halt application writes
  • Recovery often requires manual intervention

Recommendation

⚠️ Fallback only
Use only when log-based and native CDC are unavailable and write volume is modest.


4. Query-Based CDC (Polling / Timestamp / High-Water-Mark)

How It Works

  • Periodic SQL queries select rows where a column (e.g., updated_at) is greater than the last processed value (the high-water mark).
SELECT * 
FROM orders
WHERE updated_at > :last_processed_timestamp;

Database Examples

  • Any SQL database
  • Some NoSQL stores with timestamps or sequences

Additional Strengths

  • Lowest implementation cost
  • Simple to debug and reason about
  • Easy to integrate with legacy ETL tools

Additional Challenges

  • Misses intermediate updates
  • Deletes require schema changes or tombstone tables
  • Scaling requires partitioned polling
  • Can cause hot indexes

Failure & Recovery

  • Simple checkpoint recovery
  • Risk of duplicate or missing rows if timestamps are inconsistent

Recommendation

⚠️ Batch-oriented, low-volume use only Not suitable for streaming or strict consistency requirements.

5. Application-Level CDC — Outbox Pattern

How It Works

  • Application writes domain change + event in same transaction
  • Outbox table read by:
    • Polling job, or
    • Log-based CDC connector
  • Events published to message broker
    Application → DB (Data + Outbox) → CDC → Kafka / Event Bus 
    

Database Examples

Any RDBMS backing microservices

Additional Strengths

  • Eliminates dual-write problem
  • Produces business-semantic events
  • Supports exactly-once semantics with idempotent consumers
  • Decouples internal schema from external consumers

Additional Challenges

  • Requires application refactoring
  • Event schema versioning required
  • Ordering across aggregates can be complex
  • Backfills are harder than log-based CDC

Failure & Recovery

  • Transactional consistency guarantees no lost events
  • Requires idempotent publishing and consumer design

Recommendation

✅ Highly recommended for microservices & DDD architectures Often combined with log-based CDC for analytics and auditing.

CDC Technique Selection Matrix

Technique Latency Scalability Operational Complexity Best For
Log-Based Very Low Very High High Enterprise OLTP, analytics
Native Feeds Low High Medium Cloud & NoSQL systems
Outbox Low High Medium Microservices events
Triggers Low Low Medium Legacy DBs
Polling High Low Low Batch ETL

Final Architectural Guidance

  • Enterprise Data Platforms: Log-based CDC + Kafka
  • Cloud-Native NoSQL: Native change feeds
  • Microservices: Outbox + messaging
  • Legacy Systems: Triggers (carefully)
  • Reporting ETL: Polling / high-water-mark