Last lesson we got data into Spark. Today we look at it. The three commands you’ll type in the first ten seconds of every notebook are .show(), .count(), and .collect(). They look harmless. Two of them are. The third one is responsible for more out-of-memory crashes than any other PySpark call I’ve seen.
This is also the lesson where we sneak in the most important mental model in Spark: transformations are lazy, actions are eager. You can chain a hundred .filter() and .select() calls and Spark does nothing. The moment you call an action — .show(), .count(), .collect(), .write — Spark wakes up, plans the whole pipeline, and executes it. Get that distinction wrong and you’ll spend an afternoon wondering why your “fast” code is slow.
Setup
Same starter SparkSession. Reusing the orders/customers data we generated in lesson 9:
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.appName("ShowCountCollect")
.master("local[*]")
.config("spark.sql.shuffle.partitions", "8")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
orders = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("./data/orders.csv"))
If you don’t have ./data/orders.csv lying around from last lesson, scroll back to lesson 9 and re-run the data-generation snippet. We’ll use the same six-row file all the way through.
.show() — the one you’ll use a thousand times
.show() prints rows to your console as a formatted ASCII table. That’s it. No return value, no DataFrame back, no Pandas object. It’s a side effect.
orders.show()
+-------+----------+-----+-------+----------+
|OrderId|CustomerId|Total|Country| OrderDate|
+-------+----------+-----+-------+----------+
| 1001| 1| 59.0| NL|2026-03-05|
| 1002| 1| 29.0| NL|2026-03-18|
| 1003| 2|149.0| IT|2026-02-15|
| 1004| 2| 89.5| IT|2026-03-22|
| 1005| 3|199.0| DE|2026-03-10|
| 1006| 4|42.42| RO|2026-03-28|
+-------+----------+-----+-------+----------+
The full signature is show(n=20, truncate=True, vertical=False). Three knobs, all useful.
n is how many rows to print. Default 20. Pass any positive integer:
orders.show(3) # First 3 rows
orders.show(1000) # First 1000 — but Spark won't fetch more rows than exist
truncate controls long string columns. Default True, which clips to 20 characters and shows ... for the rest. Pass False to see everything, or pass an integer for a custom width:
orders.show(truncate=False) # Full strings
orders.show(truncate=50) # Clip at 50 chars
If you’ve ever called .show() on a DataFrame full of 5KB JSON blobs and watched your terminal cry, that’s why truncate=True is the default.
vertical flips the layout from columns-as-columns to columns-as-rows, which is wonderful for wide tables:
orders.show(2, vertical=True)
-RECORD 0--------------
OrderId | 1001
CustomerId | 1
Total | 59.0
Country | NL
OrderDate | 2026-03-05
-RECORD 1--------------
OrderId | 1002
...
I use vertical mode any time a DataFrame has more than 8–10 columns. It’s the difference between reading and squinting.
What .show() actually does under the hood: Spark plans your pipeline, runs just enough of it to produce n rows, then collects those n rows to the driver and prints them. It’s an action, but a small one. Asking for 20 rows means Spark might only need to read one file partition before it has enough.
.count() — the one that runs your whole pipeline
.count() returns the number of rows, as a Python int:
n = orders.count()
print(n) # 6
Looks innocent. It’s not. To count rows, Spark has to execute the entire pipeline up to this point. Every read, every filter, every join. There’s no shortcut — you can’t know how many rows survive a join without doing the join.
So this:
big = (spark.read.parquet("./data/huge_orders.parquet")
.filter("OrderDate >= '2026-01-01'")
.join(customers, "CustomerId")
.filter("Country = 'IT'"))
print(big.count())
…reads the whole Parquet dataset, applies both filters, runs the join, and counts what’s left. On a 200GB dataset, that’s a multi-minute job — even though you only wanted a number.
The mistake everyone makes (myself included, repeatedly) is sprinkling .count() calls through a notebook for “sanity”:
df = read_orders()
print("After read:", df.count()) # Job 1
df = df.filter(...)
print("After filter:", df.count()) # Job 2 — re-reads everything
df = df.join(customers)
print("After join:", df.count()) # Job 3 — re-reads + re-filters
df.write.parquet("...") # Job 4 — re-reads + re-filters + re-joins
Each of those .count() calls runs the whole pipeline up to that point. From scratch. Spark doesn’t cache results between actions unless you tell it to. So a “quick sanity check” can quietly quadruple your job’s runtime.
There’s a fix — .cache() / .persist() — but it’s a lesson 23 topic. For now, the discipline is: don’t .count() in the middle of a pipeline unless you mean to pay for it.
.collect() — the one you should almost never call
.collect() returns every row of the DataFrame, as a Python list of Row objects, on the driver:
rows = orders.collect()
print(type(rows)) # <class 'list'>
print(len(rows)) # 6
print(rows[0]) # Row(OrderId=1001, CustomerId=1, Total=59.0, ...)
print(rows[0].Total) # 59.0
print(rows[0]["Total"])# 59.0 — both attribute and dict access work
For a six-row DataFrame, this is fine. It’s a list of six tuples. Maybe a kilobyte of memory.
For a 100GB DataFrame, this is a disaster. .collect() pulls every row to a single JVM — the driver — and tries to fit them all in driver memory. Your driver has, what, 4GB? 16GB? 64GB if you’re lucky? It will OOM. Loudly. Often after running for ten minutes, so you also get to wait before getting the crash.
The rule I write on every whiteboard:
Never
.collect()a DataFrame whose size you don’t already know is small.
If you came from Pandas, this is the single biggest behavioral difference. pandas.DataFrame is always in memory by definition. PySpark DataFrames are usually not in memory and may not even fit in the memory of any single machine. .collect() is the one operation that pretends Spark is Pandas, and Spark punishes you for it.
When .collect() is genuinely fine
- Tiny lookup tables (countries, currency codes, configuration maps).
- Aggregations where you’ve already reduced the row count, like
df.groupBy("Country").count().collect()on 30 countries. - Tests, where the inputs are deliberately small.
For everything else, use one of the safer alternatives.
The safe alternatives: .take(), .first(), .head()
Three actions that pull some rows to the driver without pretending it can hold everything:
first_three = orders.take(3) # list of 3 Row objects
print(first_three)
one_row = orders.first() # single Row, or None if empty
print(one_row)
head_three = orders.head(3) # list of 3 Rows (alias of .take())
print(head_three)
.take(n) is .collect() with a leash. It pulls n rows and stops. Spark is smart enough to read only as many partitions as it takes to get n rows, so on a 100GB dataset, .take(5) reads roughly one partition and returns in seconds.
.first() is .take(1)[0], except it returns None instead of crashing on an empty DataFrame. Use it when you genuinely want one row — for example, peeking at a schema-y example record.
.head(n) and .take(n) are aliases. Some people prefer .head() because it’s what Pandas uses; some prefer .take() because it’s what RDDs always called it. Pick one.
If you find yourself reaching for .collect(), ask “do I actually need every row?” The honest answer is almost always no, and .take(100) will give you more than enough to debug with.
The companions: .printSchema(), .describe(), .summary()
Three more “look at the data” methods you’ll use constantly. None of them are dangerous.
.printSchema() prints the schema. Cheap, instantaneous — Spark already knows the schema, no data is read:
orders.printSchema()
root
|-- OrderId: integer (nullable = true)
|-- CustomerId: integer (nullable = true)
|-- Total: double (nullable = true)
|-- Country: string (nullable = true)
|-- OrderDate: date (nullable = true)
I print the schema after every read. It catches type-inference mistakes immediately — the moment you see OrderDate: string you know inferSchema didn’t work and you need an explicit schema.
.describe() computes count / mean / stddev / min / max for each numeric column:
orders.describe().show()
+-------+------------------+------------------+------------------+-------+
|summary| OrderId| CustomerId| Total|Country|
+-------+------------------+------------------+------------------+-------+
| count| 6| 6| 6| 6|
| mean| 1003.5| 2.5| 94.65333333333334| null|
| stddev|1.8708286933869707|1.2247448713915890| 65.42818134849872| null|
| min| 1001| 1| 29.0| DE|
| max| 1006| 4| 199.0| RO|
+-------+------------------+------------------+------------------+-------+
Note: it returns a DataFrame, not a printed table. So you still need .show() to see it. (One of those small PySpark quirks that trips everyone up once.)
.summary() is the more powerful sibling. By default it adds the quartiles:
orders.summary().show()
You get count, mean, stddev, min, 25%, 50%, 75%, max. You can also pass custom percentiles:
orders.summary("count", "min", "25%", "50%", "75%", "99%", "max").show()
For an exploratory pass over a new dataset, .summary() is the single most useful one-liner in PySpark. It’s the equivalent of Pandas’ describe(include='all', percentiles=[...]).
Both .describe() and .summary() are full-pass operations — they read the whole DataFrame. On a tiny CSV that’s free; on a 200GB lake, that’s a real job. Run them sparingly on production-sized data, or run them on a .sample(0.01) first.
Action vs transformation — the eager/lazy preview
We’ll do this properly in lesson 19, but here’s the headline now so the rest of the course makes sense.
A transformation describes a new DataFrame in terms of an existing one. select, filter, withColumn, join, groupBy, orderBy, union — all transformations. They return a new DataFrame without doing any work. Spark just records the recipe.
An action asks Spark for a concrete result — a printed table, a number, a list of Rows, or a write to disk. Actions trigger execution. Spark looks at the recipe, plans an optimal physical execution, and runs it.
In other words:
# These four lines do absolutely nothing yet.
filtered = orders.filter("Country = 'IT'")
projected = filtered.select("OrderId", "Total")
ordered = projected.orderBy("Total")
big_only = ordered.filter("Total > 100")
# This line runs all of the above, plus the show.
big_only.show()
Spark optimises the whole chain before executing. It might push the filters into the file scan (“don’t even read non-IT rows”), reorder operations, prune columns. The query plan you get is often radically different from what you wrote.
This is also why a .show() halfway through a notebook can feel suspiciously fast — Spark only computed enough rows to fill the display — while .count() on the same DataFrame takes minutes. They ran different amounts of work.
A tour script
Pulling the lesson together, here’s a script that exercises every action we covered, in roughly the order you’d use them on a fresh dataset:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = (SparkSession.builder
.appName("ActionsTour")
.master("local[*]")
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
orders = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("./data/orders.csv"))
# 1. What does it look like?
orders.printSchema()
orders.show(5)
# 2. Quick stats
orders.summary("count", "min", "25%", "50%", "75%", "max").show()
# 3. Cheap row count (small dataset, fine)
print("Total rows:", orders.count())
# 4. Peek at filtered data without collecting everything
italian = orders.filter(col("Country") == "IT")
italian.show() # safe
print("IT rows:", italian.count()) # safe on small data
sample_rows = italian.take(2) # always safe
print(sample_rows)
# 5. Collect — only because we know it's tiny
country_totals = (orders.groupBy("Country")
.sum("Total")
.collect())
for row in country_totals:
print(f"{row['Country']}: {row['sum(Total)']:.2f}")
spark.stop()
Six rows of input. Six lines of meaningful output. No OOMs. That’s the workflow.
Next lesson we go the other direction: writing data back out, save modes, partitioned writes, and the file-count problem that bites every team eventually.