Data StrategyIntermediate7 min read

Batch vs Streaming Architecture

Batch processing collects data over a window (an hour, a day) and processes it in scheduled runs — high throughput, cheap, simple. Streaming processing handles each event as it arrives — low latency, expensive, complex. Modern data stacks usually combine both: batch for analytics, finance, ML training; streaming for fraud detection, alerting, personalization. Apache Kafka is the dominant streaming substrate; Apache Flink, Spark Streaming, and ksqlDB are the leading processors. The architecture decision is not 'which is better' — it is 'which problems genuinely need streaming, and which are batch problems people are dressing up as streaming because it sounds modern.'

Also known asBatch ProcessingStream ProcessingReal-Time vs BatchStreaming Data Architecture

Challenge a friend Browse library

The Trap

The trap is streaming-by-default in 2026 data stacks. Teams reach for Kafka + Flink because it's the prestigious architecture, then spend years debugging exactly-once semantics, watermarks, late-arriving events, and operational on-call burden — for use cases where a Cron job running dbt every 30 minutes would have been adequate. Streaming pipelines are roughly 5x more expensive to operate than batch (more infrastructure, more on-call, more skill scarcity in the team). The business value rarely justifies it. The other trap is the opposite: using batch for genuinely time-sensitive use cases like fraud or operational alerting, then wondering why the business is unhappy with delayed signals.

What to Do

Run the latency-vs-cost decision: write down the business consumer of each pipeline and the maximum acceptable end-to-end delay. If the consumer is a daily dashboard, that's 24 hours. If it's an analyst running ad-hoc queries, that's hours. If it's a fraud alert, that's seconds. Map each pipeline to the cheapest tier that meets its SLA. Default to batch unless you can articulate why seconds matter. For genuine streaming needs, isolate them — don't streamify the entire stack.

Formula

Pipeline Cost ≈ (Compute $/hr × Uptime Hours) + (Storage $/GB × Volume) + (On-Call Burden × Engineer Loaded Cost); Streaming typically 3-8x batch cost for equivalent workload.

In Practice

Apache Kafka was open-sourced from LinkedIn in 2011 to solve the 'every team builds a custom queue' problem. By 2023 it powered the streaming substrate at Netflix, Uber, Airbnb, Pinterest, and most Fortune 500 data stacks. But Confluent's own customer surveys consistently show that the majority of Kafka topics power what are effectively batch use cases — events flowing through Kafka but consumed by hourly batch jobs into a warehouse. The lesson: Kafka as transport is broadly useful; full-streaming compute is narrow.

Pro Tips

01
The 'micro-batch' middle ground (Spark Structured Streaming, dbt every 5 minutes) gives you near-real-time freshness at near-batch cost. For most 'real-time' business asks, micro-batch is the right answer.
02
Confluent's Jay Kreps (Kafka co-creator) has publicly written that 'streaming is not a replacement for batch' — they coexist. Read 'Questioning the Lambda Architecture' for the canonical thinking.
03
Streaming on-call is materially more painful than batch on-call. Consider that cost when evaluating: a streaming pipeline that pages your team three times a quarter consumes engineering capacity invisibly.

Myth vs Reality

Myth

“Streaming is the modern way; batch is legacy”

Reality

Batch underpins most analytics, ML training, and finance reporting at every modern data company. Snowflake, BigQuery, Databricks all primarily run batch workloads. Streaming is a specialized tool for use cases that genuinely need it — not a general replacement for batch.

Myth

“Streaming is faster than batch for everything”

Reality

Streaming is lower-latency per event but often lower-throughput per dollar than batch for the same total volume. Batch can use cheaper spot compute, larger parallelism, and fewer correctness guarantees. For weekly reports, batch finishes faster and cheaper.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge — answer the challenge or try the live scenario.

🧪

Knowledge Check

Marketing wants 'real-time' attribution data so campaign managers can pause underperforming ads. Currently the data lands in the warehouse every 4 hours via batch. They claim they need streaming. What should you ask?

Industry benchmarks

Is your number good?

Calibrate against real-world tiers. Use these ranges as targets — not absolutes.

Cost Multiplier (Streaming vs Batch, same workload)

Includes infrastructure + on-call + engineering overhead

Best Case (well-tuned)

2-3x

Typical

4-6x

Common

6-10x

Poorly Designed

> 10x

Source: Hypothetical synthesis of Confluent, Databricks, and Snowflake customer reports

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

💼

LinkedIn (Apache Kafka origin)

2010-2011

success

LinkedIn built Kafka because every internal team was building one-off pipelines for activity events, metrics, and logs. The original Kafka design was a unified streaming substrate that any system could publish to and any system could consume from. But LinkedIn explicitly designed Kafka to support BOTH streaming consumers (real-time alerting) AND batch consumers (Hadoop pulled from Kafka in chunks every hour). The 'log as a unified substrate' insight, not 'streaming everywhere,' is what made Kafka transformative.

Original Use Case

Activity stream + metrics

Architecture

Streaming substrate, mixed consumers

Year Open-Sourced

2011

Even Kafka — the canonical streaming technology — was designed to serve batch consumers as a first-class use case. The architecture that won was 'streaming transport, mixed compute,' not 'streaming everything.'

Source ↗

Related concepts