CDC and Streaming
Change Data Capture (CDC) is the technique of reading a source database's transaction log (PostgreSQL WAL, MySQL binlog, Oracle redo log, SQL Server CDC tables) to capture every insert, update, and delete as a stream of change events โ typically published into Kafka, Kinesis, or directly into a destination warehouse. CDC + streaming replaces the traditional 'batch ETL every 4 hours' pattern with continuous, low-latency replication โ change events flow within seconds of the source commit. The architecture pairs a CDC tool (Debezium is the dominant open-source implementation; Fivetran, Airbyte, Striim, and Estuary offer managed alternatives) with a streaming backbone (Confluent Kafka, AWS Kinesis, Redpanda) and a destination (warehouse, lakehouse, downstream microservice, search index). The honest test: does your business actually need sub-minute data freshness for the use cases you'd actually build? If yes, CDC pays for itself; if no, you're paying streaming infrastructure tax for batch data.
The Trap
The trap is adopting CDC + streaming because real-time sounds modern, then discovering that 95% of your dashboards are reviewed daily and the streaming infrastructure cost (Kafka cluster, ops overhead, schema registry, monitoring) is 5-10x what an hourly batch job would have cost. The other trap is the operational maturity gap โ CDC pipelines fail in subtle ways (replication slot exhaustion in Postgres, binlog rotation in MySQL, schema drift cascades, exactly-once semantics edge cases) that batch ETL does not. Without 24/7 on-call and strong observability, CDC will create incidents you didn't have before. KnowMBA POV: most companies that adopt CDC + streaming would be better served by Fivetran + 15-minute batch loads + dbt for 90% of use cases, and add CDC selectively for the genuinely real-time use cases (fraud detection, operational dashboards, search indexing, customer-facing personalization). Treating CDC as the universal pipeline pattern is overengineering.
What to Do
Adopt CDC selectively, not universally. Step 1: list your actual use cases by required freshness โ 'CFO dashboard reviewed Monday morning' (daily is fine), 'fraud detection model' (sub-minute), 'product analytics' (15-min usually fine), 'customer-facing search index' (sub-minute). Step 2: use batch ETL (Fivetran, Airbyte) for the 80% that's daily/hourly. Step 3: use CDC + streaming only for the 20% that genuinely needs sub-minute freshness. Step 4: choose tooling โ Debezium + Kafka (open source, high control, high ops burden), Fivetran HVR (managed, lower control, lower burden), or your warehouse vendor's native CDC (Snowflake Snowpipe Streaming, Databricks Auto Loader). Step 5: invest in observability โ replication lag dashboards, schema-drift alerts, dead-letter queue monitoring, exactly-once verification. Step 6: write runbooks for the failure modes (replication slot full, schema break, lag spike) and rehearse them.
Formula
In Practice
Confluent (the company commercializing Apache Kafka) and Debezium (the open-source CDC framework now under the Red Hat umbrella) together define the modern CDC + streaming reference architecture. Public case studies: Wise (formerly TransferWise) runs CDC from MySQL into Kafka into multiple downstream services for cross-border payment processing โ sub-second freshness on transaction state is a regulatory and customer experience requirement. Netflix uses CDC for replication between their Cassandra-based services. Uber's Marmaray and Apache Hudi work was driven by CDC ingestion needs for their massive operational data volumes. The recurring pattern: CDC + streaming wins decisively when sub-minute freshness has clear business value (payments, fraud, real-time inventory, customer-facing search/personalization) and loses to simpler batch when the data is consumed in dashboards reviewed once a day.
Pro Tips
- 01
Replication lag is your #1 operational metric. A CDC pipeline with growing lag is a failing pipeline โ within hours it becomes hours-stale, defeating the entire point. Alert at 30-second lag; page at 5-minute lag for any pipeline marketed as real-time.
- 02
Schema evolution is the second-hardest CDC problem (after exactly-once semantics). Use a schema registry (Confluent Schema Registry, AWS Glue Schema Registry) and enforce backward-compatible changes only at the source. An upstream column rename can cascade into 12 broken downstream consumers within minutes.
- 03
Exactly-once semantics are easier to claim than to deliver. Most CDC pipelines provide 'at-least-once' with deduplication on the consumer side. For financial use cases where double-counting matters, build idempotency into your consumers explicitly โ don't trust the streaming framework's exactly-once promises without testing them under failure conditions.
Myth vs Reality
Myth
โReal-time streaming is the modern default; batch is legacyโ
Reality
Batch is the right answer for the vast majority of analytics use cases. Daily and hourly dashboards don't benefit from sub-second freshness. Streaming infrastructure costs 5-10x more in operational overhead than equivalent batch pipelines, and most companies underestimate that delta until the on-call burden hits engineering morale. Use streaming where it matters; use batch where it's good enough.
Myth
โCDC eliminates the need for batch transformationsโ
Reality
CDC handles ingestion. You still need transformations (joins, aggregations, business logic) on the destination side, and most of those are still better expressed as batch dbt models running every 5-15 minutes than as continuous stream processing. Stream processing is hard to debug, hard to backfill, and overkill for most aggregation logic. The dominant modern pattern is CDC ingestion + micro-batch dbt transformations, not full end-to-end streaming.
Try it
Run the numbers.
Pressure-test the concept against your own knowledge โ answer the challenge or try the live scenario.
Knowledge Check
A 300-person company is debating whether to migrate all data pipelines from Fivetran (15-min batch) to Debezium + Kafka (CDC streaming). Their use cases: 60 dashboards reviewed daily, 8 dashboards reviewed hourly, 2 fraud detection models requiring sub-minute data, and 1 customer-facing personalization service requiring sub-second data. What is the right architecture?
Industry benchmarks
Is your number good?
Calibrate against real-world tiers. Use these ranges as targets โ not absolutes.
CDC Pipeline Replication Lag (production benchmarks)
Debezium + Kafka pipelines in production at mid-to-large enterprisesExcellent
< 5 seconds end-to-end
Good
5-30 seconds
Acceptable
30 seconds - 2 minutes
Degraded (failing the SLA)
> 2 minutes
Source: https://debezium.io/documentation/reference/stable/architecture.html
Real-world cases
Companies that lived this.
Verified narratives with the numbers that prove (or break) the concept.
Wise (formerly TransferWise)
2018-present
Wise runs CDC pipelines from MySQL into Kafka into multiple downstream services for cross-border payment processing. Sub-second freshness on transaction state is required by both regulators (real-time fraud and AML monitoring) and customers (instant balance updates). Their published architecture uses Debezium for change capture, Kafka as the streaming backbone, and downstream consumers ranging from fraud-detection ML models to customer-facing balance services to compliance reporting pipelines. The CDC + streaming architecture is foundational to the product, not a layer added later โ for Wise, sub-second data is the product.
Source
MySQL via Debezium
Streaming Backbone
Apache Kafka
Latency Requirement
Sub-second end-to-end
Business Driver
Regulatory (AML/fraud) + UX (balance freshness)
Streaming wins decisively when sub-second freshness is part of the product. The cost is justified by the experience and regulatory outcomes the architecture enables.
Confluent + Debezium
2014-present
Confluent (commercializing Apache Kafka) and Debezium (now under Red Hat) together define the open-source reference architecture for CDC + streaming. Confluent Cloud handles managed Kafka, Schema Registry, and ksqlDB; Debezium provides connectors for MySQL, PostgreSQL, MongoDB, SQL Server, Oracle, and others. Customer adoption is concentrated in financial services, fraud detection, real-time inventory, and customer-facing personalization. Public case studies (Wise, Robinhood, Trivago, Lyft) all share a common pattern โ sub-second freshness is required by either a regulator or a customer-facing experience.
Reference Stack
Debezium + Kafka + Schema Registry
Confluent Cloud Customers
5,000+
Sweet Spot Industries
Finance, fraud, real-time commerce
Common Pattern
CDC ingestion + downstream micro-services
The Debezium + Kafka stack is the industry default for serious CDC + streaming. The question to ask is not 'can we do CDC' but 'do we need CDC for this specific use case'.
Hypothetical: Mid-Market SaaS
2021-2023
A 350-person SaaS company decided to standardize all data pipelines on self-managed Kafka + Debezium because the CTO wanted a 'real-time data platform'. They migrated 35 pipelines over 18 months at a fully-loaded cost of ~$1.4M (infrastructure + 2 dedicated streaming engineers + opportunity cost of slower analytical ship). The actual freshness benefit: 4 of the 35 pipelines had a use case for sub-minute freshness; the other 31 dashboards were reviewed daily. After the new CFO did the math, the company hybridized โ Fivetran for the 31 batch pipelines, Kafka for the 4 real-time ones โ saving ~$700K/year in ongoing operational cost. The lesson written up internally: 'real-time should be a feature for the use cases that need it, not a default for everything.'
Migration Investment
~$1.4M over 18 months
Pipelines Genuinely Needing Real-Time
4 of 35
Annual Operational Cost Reduction (after hybrid)
~$700K
Hindsight Architecture
Hybrid (batch + selective streaming)
'Real-time everywhere' is the most expensive architectural choice you can make for the wrong reasons. Reserve streaming for the use cases that actually need sub-minute data.
Decision scenario
The CDC Adoption Decision
You're VP of Data at a 600-person ecommerce company. Currently using Fivetran ($150K/year) for 50 source-to-warehouse pipelines, dbt for transformations, mostly batch use cases. The product team wants real-time inventory updates for the storefront (sub-second freshness on stock levels) and the fraud team wants sub-minute transaction streaming. Your CTO suggests 'while we're at it, let's migrate everything to Kafka and standardize'. Engineering capacity is tight. You have 6 months and need to deliver the real-time use cases without blowing the data team's roadmap.
Total Pipelines
50
Pipelines Needing Sub-Minute Freshness
2 (real-time inventory, fraud)
Current Annual Pipeline Spend
$150K (Fivetran)
CTO Proposal
Migrate all 50 to Kafka
Engineering Headroom
Tight
Decision 1
You can either accept the CTO's universal-streaming proposal, or push for hybrid (batch for 48 pipelines, streaming for 2). The CTO is a respected technical voice and the proposal sounds modern.
Universal streaming โ migrate all 50 pipelines to Confluent Cloud + Debezium over 12 months. Hire 2 streaming engineers.Reveal
Hybrid: keep Fivetran for the 48 batch pipelines, deploy Confluent Cloud + Debezium for the 2 real-time use cases (inventory + fraud). Add 1 streaming engineer (not 2).โ OptimalReveal
Related concepts
Keep connecting.
The concepts that orbit this one โ each one sharpens the others.
Beyond the concept
Turn CDC and Streaming into a live operating decision.
Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.
Typical response time: 24h ยท No retainer required
Turn CDC and Streaming into a live operating decision.
Use CDC and Streaming as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.