Why streaming, and what 'streaming' even means in Spark

We’ve spent the last forty-eight lessons on what Spark calls batch processing: you have some data, it sits in a file or a database table, you run a job, the job finishes, you get an output. The data was there before you started; the data is there after you finish. Done.

Module 9 is about the other half of the story. The data hasn’t all arrived yet. It’s still arriving. It will keep arriving tomorrow, next week, next year, until somebody turns the system off. Your job is to compute over it as it shows up — not in a one-shot pass that ends, but in a process that runs continuously and emits results as they become available.

This is streaming. And before we touch any code, I want to slow down and pin down what that word actually means in Spark, because it’s one of those terms where the marketing has run ahead of the engineering and the engineering has changed twice in the last decade.

Bounded vs unbounded data

The cleanest way to think about it isn’t “batch” vs “streaming” — it’s the shape of the dataset.

Bounded data is a fixed dataset with a known size. Yesterday’s orders. Last quarter’s transactions. The contents of a Parquet file on S3. The current state of a Postgres table at the moment you read it. There’s a MIN, a MAX, a row count. You can compute over the whole thing. You can finish.

Unbounded data is a feed that doesn’t end. Kafka events flowing in from a webserver. Application log lines being written every second. IoT sensor readings, click streams, CDC events from an OLTP database, GPS pings from a fleet. There’s no MAX because the data is still being produced. The rows you see today are a small slice of what will eventually exist. You can compute over what’s arrived so far, but you can’t ever finish.

Both kinds of data are real, both are common, and Spark can process both. The distinction matters because the assumptions you can make are very different. With bounded data, you can sort the whole thing. You can compute exact aggregations. You can iterate twice. With unbounded data, none of those work — at least not the same way. You can’t sort an infinite list. You can’t compute “the average of all rows” because more rows are coming. You have to redefine what “result” means.

Streaming, in Spark, is the API for computing over unbounded data.

Batch and streaming are a continuum

Here’s the thing the docs don’t always make obvious: there isn’t a clean line between batch and streaming. They’re two ends of a spectrum.

A nightly ETL that processes the last 24 hours of data is “batch” — but only because somebody decided once a day was often enough. If you ran the same job every hour, then every minute, then every 10 seconds, you’d be doing micro-batch streaming without changing the logic. At the limit, every-row-as-it-arrives, you’d be doing record-at-a-time streaming. The difference is latency, not the operation.

This is the central insight behind Spark’s modern streaming API. Instead of having one set of operators for batch and a totally different set for streaming, Spark unified them: the same select, filter, groupBy, join you’ve been writing for forty-five lessons works on streaming DataFrames too. You learn the API once, you reuse it.

The mental model that makes this click is the infinite table.

The infinite table

Imagine a table that’s never finished being written. Every time a new event arrives at the source, a new row gets appended to the bottom. The table grows monotonically. You never see a row disappear; you only see new ones added.

That’s how Spark Structured Streaming models a stream: as a logical table that grows over time. The Kafka topic with three new messages this second? Those are three new rows in the table. The directory of log files where two new files appeared? Those new files contribute new rows. The table is conceptually infinite, but at any given instant it has a specific number of rows — the ones that have arrived so far.

When you write a streaming query, you’re describing a transformation on this infinite table. Spark’s job is to keep that transformation up to date as new rows arrive. If your query is SELECT count(*) FROM events WHERE country = 'IT', Spark maintains a running count and emits an updated value every time the input table grows. If your query is SELECT user_id, COUNT(*) FROM clicks GROUP BY user_id, Spark maintains a running per-user count and emits the deltas.

The query you write looks like batch SQL. The execution underneath is incremental: instead of recomputing from scratch each trigger, Spark figures out what’s new and updates the result.

The micro-batch model

How does Spark actually run this? By default, it cheats slightly. It doesn’t process events one at a time — it processes them in micro-batches.

A micro-batch is exactly what it sounds like: a small batch. Every trigger interval (default: as fast as possible, typically every few hundred milliseconds), Spark does the following:

Asks each source what new data has appeared since the last micro-batch (new files in the directory, new offsets in Kafka, new rows in the rate source).
Reads that new data into a regular DataFrame.
Runs your query against it, possibly combined with state from previous micro-batches (for things like running aggregations).
Writes the output to the sink.
Records the consumed offsets in the checkpoint so the next micro-batch knows where to pick up.

The result: end-to-end latencies of roughly 100 ms to a few seconds, depending on how heavy the work is and how big the micro-batches are. That’s not “real” sub-millisecond streaming — but it’s plenty fast for almost every analytical use case (fraud monitoring, dashboard updates, ETL into a warehouse, anomaly detection at human time scales).

Micro-batch processing inherits all the strengths of batch Spark: the Catalyst optimizer plans each micro-batch, Tungsten generates code for it, fault tolerance falls out of the existing batch retry mechanism, and you can use every operator you already know. The price is latency: each micro-batch has overhead, so you can’t realistically push intervals below ~100 ms.

Continuous mode: the experimental other path

For workloads that genuinely need single-row, sub-millisecond latency — high-frequency trading-adjacent, real-time alerting on individual events, low-latency feature serving — Spark also offers continuous processing mode, triggered with .trigger(continuous="100 ms").

In continuous mode, Spark doesn’t run micro-batches. Long-running tasks sit on the executors and process records as they arrive, one at a time. End-to-end latency drops to the millisecond range.

The catch: continuous mode supports a much smaller set of operations. As of Spark 4.x it still supports only map-like operations (select, filter, cast, withColumn) and a handful of sources/sinks. No aggregations, no joins, no windowing. It’s been marked experimental for years and the supported surface has barely grown. For 99% of jobs, micro-batch is what you want; continuous mode is a tool for a specific niche where micro-batch latency genuinely isn’t enough.

I’m going to spend the rest of this module on the micro-batch path, because that’s what production Spark streaming actually is.

A short history: DStreams and why they’re gone

If you read older books, blog posts, or Stack Overflow answers, you’ll see references to DStreams — short for “discretized streams.” DStreams were Spark’s first streaming API, introduced in Spark 1.x.

DStreams were RDD-based. A DStream was conceptually a sequence of RDDs, one per micro-batch interval, and you transformed it with RDD-style operations (map, flatMap, reduceByKey, etc.). It worked, it ran in production at lots of companies, and it taught the team a lot — but it had problems:

It was separate from DataFrames and the Catalyst optimizer. You couldn’t reuse your batch DataFrame logic; you had to rewrite it in RDD ops.
It was time-based on processing time only, not event time. Computing windows over when events actually happened (rather than when Spark processed them) was painful.
The fault-tolerance story was per-RDD. Late or out-of-order events were difficult.
The API was lower-level than the DataFrame API, so two parallel codebases grew up: batch in DataFrames, streaming in DStreams. Painful for a team to maintain.

Structured Streaming, introduced in Spark 2.x, replaces DStreams with the table-shaped API I described above. It uses DataFrames, runs through Catalyst and Tungsten, and supports event-time processing properly (lesson 52 — watermarks).

DStreams has been deprecated since Spark 3.4 and the pyspark.streaming module — including the old StreamingContext and DStream classes — has been removed in Spark 4. If you find a tutorial that imports from pyspark.streaming import StreamingContext, close the tab. That code won’t run on a current Spark and there’s no reason to write new code in that style.

For the rest of this module, “streaming” means Structured Streaming. The DStreams chapter is closed.

A taste: a five-line streaming job

To anchor the abstraction, here’s a complete streaming job. It watches a directory, reads any new CSV file that appears, and prints a per-second count of rows to the console:

from pyspark.sql import SparkSession
from pyspark.sql.functions import current_timestamp, window

spark = SparkSession.builder.appName("HelloStream").getOrCreate()

events = (spark.readStream
            .schema("user_id STRING, action STRING")
            .csv("/tmp/incoming/"))

counts = (events
            .withColumn("ts", current_timestamp())
            .groupBy(window("ts", "1 second"))
            .count())

query = (counts.writeStream
            .outputMode("complete")
            .format("console")
            .start())

query.awaitTermination()

Five things to notice:

spark.readStream instead of spark.read. That’s the whole API delta on the input side. Everything else — schema, format, options — looks identical to a batch read.
The result of readStream is a DataFrame. Specifically, events.isStreaming returns True, but it’s still a regular DataFrame as far as your transformations are concerned. The withColumn, groupBy, window calls are exactly the batch API.
writeStream instead of write. Same delta on the output side.
outputMode("complete") says “emit the entire current result table on every micro-batch” — appropriate for an aggregation. We’ll go through the three output modes (append, complete, update) in lesson 50.
awaitTermination() keeps the process alive while the streaming query runs. Without it, your script exits and Spark stops the query.

Drop a CSV into /tmp/incoming/, watch the console show counts, drop another, watch the counts update. You’re streaming.

Where this module is going

The next three lessons cover the foundations:

Lesson 50 — readStream, writeStream, the trigger types, and the all-important checkpoint. The actual mechanics of running a streaming job.
Lesson 51 — the Kafka source, which is what 90% of production streaming jobs read from. Offsets, deserialization, exactly-once semantics, the gotchas.
Lesson 52 — event time and watermarks. The piece that makes “what time did this event actually happen?” work correctly even when events arrive out of order.

After that we’ll cover output modes (53), stateful operations (54), and stream-stream joins (55) before wrapping up the module with the operations side: production deployment, monitoring, the failure modes (56-58).

For now, the takeaway: streaming isn’t a magic separate world. It’s the same DataFrame API you already know, applied to data that hasn’t finished arriving, run incrementally on micro-batches, with a checkpoint to track progress. Once that picture is in your head, the rest is mechanics.

References: Apache Spark Structured Streaming Programming Guide (https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html) and the Spark 4.0 migration notes on the removal of pyspark.streaming (https://spark.apache.org/docs/latest/streaming/). Retrieved 2026-05-01.