Data & System Architecture, from the ground up Lesson 41 / 80

Why streaming: bounded vs unbounded data

The conceptual shift from batch to streaming. Why 'stream' is just 'batch with very small batches' in the limit, and why that limit changes the design.

Module 5 ended with Netflix running batch on petabytes. The data sat still on S3, the jobs read it, the jobs wrote it back, and the next morning the dashboards were updated. Module 6 starts where that picture breaks down. The data does not sit still. The job does not finish. The dashboard wants to be right now, not tomorrow morning. The shift from batch to streaming is the most consequential data-architecture decision most teams make after the choice of database, and the conceptual move underneath it is small enough to fit in one sentence: the data is no longer bounded.

This lesson is the conceptual opener. No Kafka yet, no Flink, no watermarks. Just the reframe and the trade-off, because the rest of the module makes more sense once the reframe is in place.

Bounded versus unbounded data

The cleanest framing comes from Tyler Akidau, Slava Chernyak, and Reuven Lax in “Streaming Systems” (O’Reilly, 2018). They argue that the industry’s vocabulary had been confused for years because it conflated two orthogonal questions: what shape is the data, and what processing approach do you apply to it.

Bounded data is a fixed dataset with a known beginning and a known end. Last week’s transactions. Last month’s logs. The five-million-row CSV your finance team just exported. You can name its size, count its rows, list its files. Bounded data has a natural processing approach (read it, process it, write a result, you are done) and that approach is batch.

Unbounded data is a feed that does not end. Page views. Sensor readings. user clicks. Credit-card swipes. Bank-account events. Kafka topics. The dataset has a beginning (the day you started recording) and no end (you keep recording). You cannot name its size in any final way: by the time you finish counting, the count is already wrong. Unbounded data has a natural processing approach (process records as they arrive and never stop) and that approach is streaming.

The reframe matters because the historical conflation made teams think batch and streaming were two different worlds. They are not. Bounded versus unbounded is a property of the data. Batch versus streaming is a property of the processing. You can apply either approach to either kind of data, and the four combinations all exist in production.

Batch over bounded data is the default for analytical work: nightly ETL, weekly reports, training-data preparation. Streaming over unbounded data is the default for real-time work: dashboards, alerts, fraud detection. Batch over unbounded data is what you do when you take a snapshot of a stream and process it with batch tools (every “load yesterday’s events into the warehouse and run SQL” pipeline does this). Streaming over bounded data is rarer but exists: replaying a finite log of historical events through a streaming engine, which is what backfilling in Kappa architecture looks like.

Once you separate the two questions, the architecture decisions get cleaner. The data shape is a fact. The processing approach is a choice, made based on latency requirements, tooling fit, and what your team can operate.

Why streaming matters in 2026

The technology has been viable for a decade; the demand keeps growing. Four pressures push teams toward streaming.

Lower latency. A daily batch refreshes the dashboard at 6am. An hourly batch refreshes it sixty times a day. A streaming pipeline refreshes it every few seconds. For executives, “good enough” used to mean tomorrow morning. For operations teams, “good enough” used to mean an hour ago. For fraud-detection systems, “good enough” means right now, before the transaction settles. The threshold of acceptable latency has been falling for fifteen years, and streaming is what you reach for when batch can no longer hit it.

Real-time decisions. Some decisions cannot wait. A credit-card transaction has to be approved or declined in under a second. A recommendation has to render with the page. A trading signal has to fire before the price moves. These workloads cannot be expressed as batch over yesterday’s data; they require processing the event as it arrives.

Operational metrics. Knowing the state of your own system is itself a streaming problem. API latency histograms, error rate per endpoint, queue depth, database connection saturation. Modern observability stacks (Prometheus, Datadog, OpenTelemetry) are streaming systems even when nobody on the team thinks of them that way.

Event-driven architecture. A microservices system communicates through events. The order service writes an OrderPlaced event; the inventory service reads it and decrements stock; the email service sends a confirmation; the analytics service updates the dashboard. Each consumer is its own streaming pipeline over the same source, and the integration spine that ties them together is a streaming system. Module 7 covers event-driven patterns in depth.

The cost: streaming is harder

The reframe makes streaming sound like a free upgrade. It is not. Three structural problems get harder when you move from bounded to unbounded data.

State management is harder. A batch job reads its input, computes, writes output, and exits. The state lives in memory for the duration of the job, then disappears. A streaming job runs forever, and any aggregation it computes (a per-user counter, a windowed average, a running join) has to be remembered across restarts, across machine failures, across cluster rebalances. Lesson 43 covers how the major engines store this state; lesson 45 covers what “exactly-once” means in a system whose state has to survive failure.

Time semantics are harder. A batch job partitions by date because the date is in the data. A streaming job has to deal with two different times at once: the time the event happened (event time) and the time the system processed it (processing time). They are not equal, because events arrive late, out of order, or after the network swallowed them for fifteen seconds. Lesson 44 covers watermarks, the mechanism streaming engines use to reason about event time without waiting forever for the laggards.

Failure recovery is harder. A batch job that fails halfway through can be re-run from the start. A streaming job that fails halfway through has been emitting output for hours; “re-run from the start” means re-emitting, which is wrong unless the downstream is idempotent. Lesson 38 covered idempotent batch; the streaming version is structurally similar but more relentless because the failure surface is continuous.

The honest reading: streaming is the right answer when latency or volume justifies it, and the wrong answer when neither does. Reaching for streaming because it sounds modern is a way to triple the operational complexity of a workload that ran fine as a cron job.

The continuum view

The split between batch and streaming feels categorical when you describe it in slides. In practice it is a continuum, and the position on the continuum is set by the window size you process.

A daily batch is a batch with a one-day window. The job runs once at 2am, reads the last day of data, and produces a result. An hourly batch is a batch with a one-hour window. A microbatch every minute is the same shape with a sixty-second window. A microbatch every second is the same shape again, just faster. At some point you stop calling it batch and start calling it streaming, but the line is fuzzy. Spark Structured Streaming, which lesson 43 covers, is explicitly a microbatch system: it processes records in small batches, typically every few hundred milliseconds, and presents a streaming API on top of a batch engine. Flink, on the other hand, processes records one at a time as they arrive, with no batching, and is described as a “true streaming” engine. The distinction matters for latency at the millisecond scale and for the shape of the state model. It does not matter for the conceptual question of whether you are processing bounded or unbounded data.

The continuum view explains why teams can migrate gradually. A nightly batch becomes an hourly batch when the business asks for fresher data. The hourly batch becomes a fifteen-minute microbatch when an hour is too slow. The microbatch becomes a streaming pipeline when fifteen minutes is too slow. Each step shortens the window; the architecture shifts shape gradually rather than in one jump.

flowchart LR
    subgraph Bounded[Bounded data]
      B1[Last week of transactions]
      B2[Monthly export]
      B3[Historical archive]
    end
    subgraph Unbounded[Unbounded data]
      U1[Page views]
      U2[Sensor readings]
      U3[Kafka topic]
      U4[CDC log]
    end
    Bounded -->|natural fit| Batch[Batch processing]
    Unbounded -->|natural fit| Stream[Stream processing]
    Bounded -.->|possible| Stream
    Unbounded -.->|via snapshot| Batch

Diagram to create: a polished version showing two columns of data sources (bounded on the left, unbounded on the right) with arrows leading down to the two processing approaches. Solid arrows for natural fit, dashed for the cross-combinations. The point is that the shape of the data and the choice of processing are independent, and the cross-combinations both exist in production.

The two architecture styles

When teams build streaming pipelines, two recurring architectures show up in the literature. They are worth naming now because the rest of Module 6 will keep referring to them.

Lambda architecture, named by Nathan Marz around 2011 and dominant from 2014 to 2018, runs two pipelines in parallel: a batch pipeline that produces accurate results from the full historical dataset, and a streaming pipeline that produces approximate results with low latency. Query results are merged at read time: the batch view is correct but stale, the streaming view is fresh but approximate, and the application combines them. Lambda was a pragmatic answer to the limitations of streaming engines in the early 2010s, when “exactly-once” was a research topic and “stateful streaming at scale” was painful. The cost was running and maintaining two parallel pipelines, with two implementations of every transformation, two sets of bugs, two sets of operational concerns.

Kappa architecture, named by Jay Kreps in 2014, runs a single streaming pipeline. Batch is not a separate system; batch is a streaming pipeline replaying a long bounded segment of the same log. If you need to recompute history, you point a fresh consumer at offset zero of the Kafka topic and let it run. The streaming engine handles both the live tail and the historical backfill with the same code. Kappa became viable as the streaming engines (Flink in particular) developed strong exactly-once semantics, durable state, and savepoint mechanisms. By 2020 it was the architecture most new streaming systems were starting from, and Lambda was a legacy choice rather than a default. Lesson 46 covers the trade-offs between the two in detail.

The reason to introduce them now is that the choice affects what tooling you reach for. Lambda needs both a batch engine and a streaming engine, glued together by a serving layer. Kappa needs a strong streaming engine and a durable, replayable log. Module 6 spends most of its time on the Kappa shape because that is what most 2026 teams are building, but the Lambda lineage is everywhere in legacy code.

What Module 6 covers

The next seven lessons walk the modern streaming stack. Lesson 42 covers Kafka, the dominant log and the substrate every other component reads from and writes to. Lesson 43 compares the three stream processing engines that matter in 2026: Flink, Kafka Streams, Spark Structured Streaming. Lesson 44 covers event time, watermarks, and the windowing primitives, the most conceptually dense lesson in the module because time semantics are where streaming stops looking like fast batch. Lesson 45 covers exactly-once processing: what the term actually means and what guarantees you can buy. Lesson 46 covers Lambda versus Kappa in depth. Lesson 47 covers Change Data Capture, the pattern that turns a database transaction log into a Kafka topic. Lesson 48 closes the module with the Uber streaming case study.

The thread through all of them is the reframe in this lesson. Bounded data has one natural shape, unbounded data has another, and the next seven lessons are the toolkit for the unbounded case. The toolkit is real; the cost is real; the leverage is enormous when you need it.

Citations and further reading

  • Tyler Akidau, Slava Chernyak, Reuven Lax, “Streaming Systems” (O’Reilly, 2018). The reference book for the conceptual model used in this module. The bounded/unbounded reframe, the event-time/processing-time distinction, and the windowing model all come from this book and the earlier Google papers behind it.
  • Tyler Akidau, “The world beyond batch: Streaming 101” and “Streaming 102”, O’Reilly Radar, 2015 and 2016. The two essays that introduced the framing to the broader community before the book came out. Still worth reading; the conceptual moves are the same.
  • Nathan Marz, “Big Data: Principles and Best Practices of Scalable Real-Time Data Systems” (Manning, 2015). The original Lambda architecture book.
  • Jay Kreps, “Questioning the Lambda Architecture”, O’Reilly Radar, 2014, https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (retrieved 2026-05-01). The essay that named Kappa and argued against running two pipelines.
  • “Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapter 11. The standard reference for stream processing in a systems context, with a more conservative framing than the Akidau book.
Search