Data & System Architecture, from the ground up Lesson 47 / 80

Lambda vs kappa architecture

The historical context: why Lambda existed, why Kappa replaced it, and when Lambda still has a place in 2026.

The previous lessons in Module 6 introduced the streaming building blocks. Kafka as the durable log. Flink and Spark Structured Streaming as the engines that compute over moving data. Watermarks and event time as the model for “when can I commit a result”. This lesson zooms back out and asks the architectural question that Module 5 and Module 6 together force on every data team: when you have both batch and streaming pipelines, how do they fit together?

There are two canonical answers, both with names that sound much fancier than they are. Lambda architecture runs a batch pipeline and a streaming pipeline side by side and merges them at query time. Kappa architecture runs only a streaming pipeline and replays the historical log when it needs the batch view. The names came from the people who proposed them. The decision is older than the names.

By 2026 the practical answer is mostly Kappa, with caveats. The interesting work is understanding why Lambda existed, why Kappa replaced it, and the narrow set of cases where Lambda still earns its keep.

The world that produced Lambda

Nathan Marz proposed Lambda architecture in a series of blog posts and talks starting in 2011, formalised in his 2015 book “Big Data” with James Warren. The context matters. In 2011 the batch tool of the day was Hadoop with MapReduce. The streaming tools were Storm (which Marz had built at BackType and Twitter) and a handful of less mature competitors. Kafka existed but was new. Flink was a research project. Spark was an AMPLab paper.

In that world, batch and streaming were different beasts in two senses, both pushing toward running them separately.

The first sense was technical. Batch tools processed bounded historical datasets, optimised for throughput, and produced exact results. Streaming tools processed unbounded data, optimised for latency, and produced approximate results: rough counts, approximate uniques, sketches that traded accuracy for speed. The two engines had different APIs, different fault-tolerance models, and different performance profiles. Asking a 2011 streaming engine to also do the historical batch case was asking it to do something it had not been built for.

The second sense was operational. Streaming jobs were considered fragile. They could lose data on failure, suffer from late-arriving events with no obvious recovery path, and drift away from the truth over long uptimes. The batch pipeline was the source of truth; the streaming pipeline was an approximation that gave you something to look at while you waited for the batch run to finish. If the streaming pipeline broke, the batch run would correct it the next morning.

Lambda formalised that separation as an architecture.

What Lambda looks like

Three layers, all reading the same input log of immutable events.

Batch layer. Stores the full historical dataset and reprocesses it periodically (typically nightly) to produce accurate, complete views. The output is called the batch view. If a batch run takes six hours, the batch view is up to twenty-four hours stale at any given moment. That is fine, because the batch view is the source of truth: stale but correct.

Speed layer. Reads the same input log in real time and produces a real-time view covering only the events that arrived since the last batch run. The real-time view is approximate (it may use sketches, sampling, or other speed-for-accuracy trades) and it is short-lived: as soon as the next batch run completes, the corresponding portion of the real-time view is discarded.

Serving layer. Receives queries, reads from both views, and merges the results. A query that asks for “total events in the last seven days” reads the batch view for the days the batch has covered and the real-time view for everything since the last batch run. The user sees a single answer that is mostly accurate (the batch part) and partially approximate (the real-time tail).

flowchart LR
    LOG[(Event log)]
    BATCH[Batch layer<br/>full history<br/>exact, slow]
    SPEED[Speed layer<br/>recent only<br/>approximate, fast]
    BV[(Batch view)]
    RTV[(Real-time view)]
    SERVE[Serving layer]
    Q[Query]
    LOG --> BATCH --> BV
    LOG --> SPEED --> RTV
    BV --> SERVE
    RTV --> SERVE
    Q --> SERVE

In the 2011 to 2015 world this was a defensible design. You got low-latency answers from the speed layer, exact answers from the batch layer, and the merge at query time gave you the best of both. The architecture survived engine failures gracefully: the batch layer would catch up the next time it ran, and the speed layer’s approximations were cheap to throw away.

The complaint about Lambda

The complaint, voiced loudly by Jay Kreps in 2014 and by many others since, is that Lambda makes you maintain two codebases that compute the same thing.

The batch pipeline has a Spark or MapReduce job that aggregates seven days of events into a count. The speed pipeline has a Storm or Flink job that does the same aggregation in real time. The logic is the same; the implementation is duplicated, in different languages, against different APIs, with different testing harnesses, deployed to different infrastructure, often owned by different teams.

Three operational pains follow.

Bug fixes have to land twice. Fix a bug in the batch job, then port the fix to the streaming job. The two codebases drift under deadline pressure when only the urgent path gets patched. Eventually the batch view and the real-time view start disagreeing in subtle ways, and figuring out which one is right becomes a forensic exercise.

Schemas drift. A new field is added to the input events. The batch job picks it up; the streaming job does not, because the Storm topology was last touched eight months ago and nobody remembered to update it. The merged view starts showing inconsistent values for the new field depending on whether the data point falls in the batch part or the speed part of the window.

Operational overhead doubles. Two pipelines means two clusters, two deployment paths, two on-call rotations, two sets of dashboards, two sets of alerts. For a small team this is a real fraction of total engineering capacity.

The cost is paid every day. The benefit (low latency plus exact accuracy) is real, but the cost was high enough that people started asking whether it was strictly necessary.

Kappa: one pipeline

Jay Kreps, then at LinkedIn and the original author of Kafka, proposed Kappa architecture in a 2014 post titled “Questioning the Lambda Architecture”. The argument was simple: if your stream processor is good enough to handle the historical case via replay, you do not need the batch pipeline at all.

The Kappa design is one pipeline.

flowchart LR
    LOG[(Event log<br/>Kafka, durable, replayable)]
    STREAM[Stream processor<br/>Flink or Spark Streaming]
    VIEW[(Materialised view)]
    Q[Query]
    LOG --> STREAM --> VIEW --> Q

The stream processor reads from a durable, replayable log (Kafka, Kinesis, Pulsar). For real-time work it reads the tail of the log and updates the materialised view as new events arrive. To rebuild the view from scratch (because the logic changed, or the view got corrupted, or you discovered a bug), you start a new instance of the stream processor reading from the start of the log, let it catch up, and switch traffic to the new view when it overtakes the old one. The same code processes both cases. There is one codebase, one set of guarantees, one schema, one on-call rotation.

The technical preconditions for Kappa to work were not satisfied in 2011. They were satisfied by 2018 and very much so by 2026.

The log has to be durable enough. Kafka with sensible retention (lesson 42) holds events long enough to replay the whole history. For data that does not fit in Kafka’s retention, the events also live in a lakehouse table (lesson 37) that the stream processor can replay from.

The stream processor has to handle bulk replay. Flink and Spark Structured Streaming, in 2026, can chew through historical data at batch-comparable throughput when given the resources. The same engine that processes a hundred events per second from the live tail can process a hundred million per second when replaying.

The state stores have to be rebuildable. Stateful streaming jobs (lesson 43) keep their state in RocksDB or similar embedded stores backed by checkpoints. A replay from scratch builds the state from scratch as it goes. There is no separate “warm up the cache” step.

In a Kappa setup, the same Flink job that maintains a real-time aggregate is also the job that, if you point a fresh instance at the start of the log, will produce the historical aggregate from scratch. One codebase. One set of guarantees. One thing to operate.

Why Kappa won, mostly

By 2026 most new streaming-and-batch architectures default to Kappa. Three reasons.

Stream processors got powerful enough. Flink with RocksDB state, exactly-once sinks, and event-time watermarks does what Storm could not do in 2011. Spark Structured Streaming gives you the same engine for the streaming and the batch case, with the streaming code being a strict superset of the batch code. The technical objection that motivated Lambda no longer holds.

Lakehouses provide the backing storage. A Kappa setup needs a durable, replayable history. For high-volume firehoses that exceed Kafka retention, that history lives in Iceberg or Delta or Hudi tables (lesson 37), which stream processors can read as a source. The “where do I keep the long history” question has a clean answer.

The dual-codebase cost was always too high. Even teams that loved Lambda’s separation of concerns ended up paying the synchronisation tax every release. Once Kappa was technically viable, the operational simplification was hard to argue against.

The case study at the end of this module (lesson 48, Uber) is a Kappa-shaped story in practice: streaming-first ingestion, lakehouse storage, batch and real-time queries served from the same physical layer. The pattern recurs across the industry’s biggest data platforms.

When Lambda still has a place

The 2026 default is Kappa. Lambda is not extinct, but the cases where it earns its complexity are narrower than they were.

The batch and stream answers are genuinely different. Approximate distinct counts using HyperLogLog in the streaming layer are extremely cheap with bounded error. Exact distinct counts in the batch layer cost real money but give you the truth. If your product wants both (a fast approximate dashboard plus a slow audited report) and the streaming sketch is not just a less precise version of the batch aggregate but a different computation, Lambda lets you keep them as separate pipelines producing different things rather than the same thing twice.

Latency vs accuracy live in different products. A trading-floor dashboard wants thirty-millisecond freshness on volumes and prices. The end-of-day reconciliation wants exact, audited, slow numbers signed off by compliance. The two consumers do not share a query path, and running them as two pipelines that happen to read the same input log is sometimes cleaner than running one pipeline that has to satisfy both contracts.

Legacy systems where the batch layer predates streaming. A team has run a nightly Spark batch for ten years against the same input log and adds a streaming pipeline for the real-time cases that have appeared since. Replacing the batch pipeline is a multi-quarter migration with real risk. The pragmatic choice is to leave it in place and run streaming alongside. That is Lambda by accumulation rather than by design, and it is fine, as long as someone keeps the two consistent.

Teams large enough to staff both pipelines. At sufficient scale, dedicated streaming and batch teams with different products and different latency contracts can operate two codebases without the synchronisation pain dominating, because the two codebases are not actually computing the same thing. They are computing different things that happen to share a name.

Outside these cases, Lambda’s overhead is rarely justified.

The 2026 advice

Default to Kappa. Use one stream processor for both the live tail and the historical replay. Back it with Kafka for short-term replay and a lakehouse table for long-term replay. Express your aggregations once. Run them on one codebase, deployed once, owned by one team.

If you find yourself reaching for Lambda, articulate why. Is the streaming answer genuinely a different computation than the batch answer, not just a faster less precise version? Are the two consumers different enough that one codebase serving both is a worse architectural fit than two codebases serving each? Do you have the team to operate two pipelines, or are you adding a second pipeline to a team that is already stretched on the first?

If the answer to those questions is yes, Lambda is fine. If the answer is “we always did it that way” or “the streaming pipeline does not feel trustworthy yet”, invest in the streaming pipeline until it does, rather than building a second pipeline whose maintenance cost will outlive the trust deficit.

The Module 5 framing of “the same patterns at every scale” applies here too. A small team running a single Flink job against a single Kafka topic and a five-table lakehouse is doing Kappa. A petabyte-scale platform running hundreds of Flink jobs against thousands of topics and tens of thousands of lakehouse tables is doing the same thing, just bigger. The architectural shape is identical. Lesson 48 walks through one such shape at Uber’s scale and shows the same primitives doing the same job.

Citations and further reading

  • Nathan Marz and James Warren, “Big Data: Principles and Best Practices of Scalable Real-Time Data Systems” (Manning, 2015). The book that formalised Lambda architecture, with the original motivation and the layer-by-layer design.
  • Jay Kreps, “Questioning the Lambda Architecture”, O’Reilly Radar, 2014, https://www.oreilly.com/radar/questioning-the-lambda-architecture/ (retrieved 2026-05-01). The post that introduced Kappa as the alternative and made the case for one pipeline.
  • “Designing Data-Intensive Applications” (Martin Kleppmann, O’Reilly, 2017), chapters 11 and 12. The standard reference for stream processing and the relationship between batch and streaming.
  • Apache Flink documentation, https://flink.apache.org/ (retrieved 2026-05-01). The streaming engine that makes Kappa practical at scale, with state management and exactly-once semantics that match the architectural needs.
Search