.cache() is not free — when to use it, when it's a trap

Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.

.cache() is not free — when to use it, when it's a trap

Someone on your team told you to “just cache it” and your job got faster. Great. Then you cached everything and your job started crashing with OOM errors. Also great, in a learning-experience kind of way.

.cache() and .persist() are powerful tools, but they’re not performance buttons you can mash indiscriminately. Understanding when to use them (and when not to) is the difference between a 10× speedup and a 10× headache.

What caching actually does

Spark is lazy. When you write a chain of transformations — filter, join, groupBy, select — nothing happens until you call an action (.count(), .show(), .write()). At that point, Spark builds an execution plan and runs the whole chain.

If you call two actions on the same DataFrame, Spark runs the entire chain twice. Every filter, every join, from scratch. It doesn’t remember intermediate results.

.cache() tells Spark: “After computing this DataFrame, keep the result in memory so the next action can reuse it instead of recomputing from scratch.”

expensive_df = (
    raw_data
    .join(lookup, "key")
    .filter(col("status") == "active")
    .groupBy("region").agg(sum("revenue").alias("total"))
)

expensive_df.cache()

# First action: computes and stores in memory
expensive_df.count()

# Second action: reads from cache, skips the whole chain
expensive_df.show()

When caching helps

Rule: cache when you reuse a DataFrame in multiple downstream actions.

Common scenarios:

  1. Branching pipelines. One DataFrame feeds two or more aggregations, writes, or reports.
  2. Iterative algorithms. ML training loops, graph algorithms, anything that reads the same data repeatedly.
  3. Interactive exploration. You’re in a notebook, poking at the same dataset with different queries.

If a DataFrame is used once and thrown away, caching it wastes memory for zero benefit.

When caching hurts

1. You don’t have enough memory.

Cached data lives in executor memory. If your DataFrame is 50 GB and your cluster has 40 GB of executor memory, something gets evicted — either your cache (making it pointless) or working memory for other operations (making everything slower).

2. You cache too early.

# Bad: caching the raw data before any filtering
raw = spark.read.parquet("s3://huge-dataset/")
raw.cache()  # 500 GB in memory? Good luck.

filtered = raw.filter(col("year") == 2025)  # only 5 GB after filter

Cache after filtering, not before. The smaller the cached DataFrame, the better.

3. You forget to unpersist.

Cached DataFrames stay in memory until the SparkSession ends or you explicitly release them:

expensive_df.unpersist()

In long-running jobs with multiple stages, forgetting to unpersist means earlier caches eat memory that later stages need. This is the most common caching mistake in production code.

4. You cache something that’s fast to recompute.

Reading a small Parquet file is already fast. Caching it saves milliseconds and wastes megabytes of executor memory. Only cache things where the recomputation cost is genuinely painful.

.cache() vs .persist()

.cache() is shorthand for .persist(StorageLevel.MEMORY_AND_DISK). You can customize:

from pyspark import StorageLevel

# Memory only (fastest, but evicted if memory is tight)
df.persist(StorageLevel.MEMORY_ONLY)

# Memory + disk (spills to disk if memory runs out)
df.persist(StorageLevel.MEMORY_AND_DISK)

# Disk only (slower but saves memory for other operations)
df.persist(StorageLevel.DISK_ONLY)

# Serialized (compressed, uses less memory, slower to read)
df.persist(StorageLevel.MEMORY_ONLY_SER)

In practice, the default (MEMORY_AND_DISK) is fine 90% of the time. Use DISK_ONLY when memory is tight and recomputation is expensive. Use MEMORY_ONLY when speed matters and you’re confident it fits.

How to check if caching is working

Spark UI → Storage tab. Shows every cached DataFrame, how much is in memory vs disk, and what fraction was successfully cached. If you see a DataFrame that’s 20% cached and 80% evicted, the cache is doing almost nothing — you’d be better off not caching and freeing that memory.

Also check: if a stage runs faster the second time but the first time got slower (because Spark was materializing the cache), make sure the total is still a net win. Caching has upfront cost.

The decision checklist

Before adding .cache(), ask:

  1. Is this DataFrame used more than once? No → don’t cache.
  2. Is it expensive to recompute? No → don’t cache.
  3. Does it fit in memory? No → use DISK_ONLY or don’t cache.
  4. Am I caching after filters/selects? No → move the cache later.
  5. Will I remember to .unpersist() when I’m done? No → you’ll create a memory leak.

If you answered “yes” to all five: cache it. Otherwise, let Spark recompute. Recomputation is Spark’s default for a reason — it’s usually efficient enough, and it doesn’t waste memory.