PySpark, from the ground up Lesson 58 / 60

Debugging slow Spark jobs: the 30-minute checklist

The systematic loop for figuring out what's wrong with a slow job — read the UI, find the slow stage, look at task skew, GC, shuffle volume, in that order.

A job that used to finish in 12 minutes is now taking 90. Your manager wants to know why before standup. You have one Spark UI tab open and a vague feeling.

This is the moment where everyone wants to start tuning. They go straight for spark.sql.shuffle.partitions, they bump executor memory, they sprinkle repartition() calls, they cache things at random. None of it works because none of it was diagnosed. They burn an hour and the job is still slow.

The discipline is: never tune blind, always read the UI first. Below is the 30-minute loop I run on any slow job. Six steps, in order, every time. Most jobs you’ll find the answer at step 2 or 3. The rest of the steps are there for the harder cases.

Step 1 — Open the UI and find the slow stage (3 minutes)

Slow jobs are usually slow because of one stage, not all of them.

Open the Spark UI (history server if the job already finished, live UI if it’s still running). Click Jobs. The slow run is the one with the long horizontal bar.

Click into it. You’re now on the Stages tab for that job. Sort by Duration descending. The top row is your culprit.

Stage Id | Description                | Duration | Tasks  | Input    | Shuffle Read
   18    | aggregate at App.scala:142 | 47 min   | 200    | 1.2 GB   | 89 GB
   12    | scan parquet ...           | 4.1 min  | 1200   | 380 GB   | 0
   ...

Stage 18 is where the 47 minutes went. Everything else is noise. From now on, every step is about stage 18.

A note on reading durations: the stage bar is wall-clock from start to finish. Inside it you have many tasks. The wall-clock time can be long because tasks are slow, or because the stage has many waves of tasks, or because one task is much slower than the others. You can’t tell yet. Move on.

Step 2 — Task duration distribution (5 minutes)

Click into the slow stage. Scroll to Summary Metrics for Completed Tasks. This table is gold. It gives you the min / 25th / median / 75th / max for each per-task metric: duration, GC time, input size, shuffle read, shuffle write, peak memory.

The column to look at first is Duration.

Healthy distribution:

Min: 4s  | 25th: 6s | Median: 7s | 75th: 8s | Max: 12s

Tasks finish in a tight band. The stage is just doing work. Move to step 3.

Skewed distribution:

Min: 0.4s | 25th: 1.1s | Median: 1.4s | 75th: 2.3s | Max: 41 min

That max is the entire stage. One task is doing 99% of the work. You have skew. This is the single most common cause of “my job got slow” on a real workload.

The fix path is lessons 28 and 29: figure out which key is hot (groupBy and count, look for the giant outliers), then either filter it out, broadcast the small side, or salt the join key. AQE handles many cases automatically (lesson 59), so check first whether AQE is enabled — if it isn’t, turning it on is half a config flip.

Step 3 — GC time fraction (3 minutes)

In the same Summary Metrics table, look at GC Time. Healthy is GC time below 10% of task duration. If you see GC time at 30% of duration on the median task, your executors are thrashing the garbage collector and that’s why the work is crawling.

You can also see this on the Executors tab: a Task Time (GC Time) column shows total seconds spent and total seconds in GC, per executor. If GC is more than ~20% of task time, the JVM is starving.

Possible causes:

  • A cached DataFrame too large for executor memory. Check the Storage tab — if you cached 80 GB on executors with 12 GB usable each, evictions are constant. Switch to MEMORY_AND_DISK_SER, or don’t cache, or cache a smaller projection.
  • A shuffle that pulls too many rows into a single task (revisit step 2 — could be skew driving GC pressure).
  • Memory configured too tightly. Lesson 57 covers spark.executor.memory, memoryOverhead, and the on-heap vs off-heap split.

Tuning memory blindly without checking GC time is the most common waste of a Friday afternoon. Check the number first.

Step 4 — Shuffle read and shuffle write (5 minutes)

Look at the Shuffle Read and Shuffle Write columns at the stage level. Compare them to your output size.

If the stage outputs 200 MB and shuffles 89 GB, you’re moving 450x more data than you need to. The fix is almost always to filter and project earlier, before the shuffle, not after. Spark’s optimizer does some of this for you, but it can’t push filters through UDFs, through selectExpr strings it doesn’t understand, or through user code that imperatively branches on data.

Quick wins:

  • Move .filter() calls before joins.
  • Replace select("*") with the columns you actually need.
  • Drop nested struct fields you’re not using (select("id", "amount", "ts") instead of carrying a 200-field record through five stages).
  • If two stages both shuffle by the same key, see if you can .repartition(key) once and reuse via cache, or use bucketed tables (lesson 36).

Also check how shuffle volume scales with input. A 4 GB input that produces 100 GB of shuffle implies row explosion — usually a many-to-many join. Lessons 25-30 cover the join mechanics in detail.

Step 5 — Read the SQL plan (5 minutes)

Click the SQL / DataFrame tab in the UI. Find the query for your slow job (timestamps and durations help match them up). Click into it. You get the physical plan, with per-operator metrics.

What to look for, in order of “this is bad”:

  • BroadcastNestedLoopJoin. Almost always wrong. It means Spark couldn’t find a join condition and is doing a Cartesian product. Check your join — did you forget the on= clause? Are you joining on a function like df1.id == upper(df2.id) that didn’t get translated? This operator is a five-alarm fire on any non-tiny input.
  • SortMergeJoin where you expected a broadcast. The small side wasn’t small enough (default broadcast threshold is 10 MB, raise with spark.sql.autoBroadcastJoinThreshold). Or its size estimate is wrong because it came from a DataFrame Spark didn’t have stats for. Force it with broadcast(df) (lesson 27).
  • Filter operators sitting above a scan instead of being pushed into it. For Parquet/ORC sources, predicate pushdown should fold them in. If they’re not pushed, your filter is on a column the source can’t filter (a UDF result, a struct field path, an expression).
  • Repeated subexpressions — the same filter or projection appearing in multiple branches of the plan. A cache() of the shared DataFrame can pay for itself fast.

The plan tells you what Spark is actually doing, not what your Python code suggests. They diverge.

Step 6 — Input size and spill (4 minutes)

Two more numbers, both on the Stages tab.

Input size — should match what you expect. If you think you’re scanning a 4 GB partitioned table and the stage shows Input: 380 GB, your partition pruning failed. Common cause: a filter on a column that’s not the partition column, or a filter expression that defeats pruning (partition_dt = string_col where string_col is a non-partitioned varchar Spark can’t prove constant).

Spill (Memory) and Spill (Disk) — these columns appear when tasks ran out of in-memory space and had to spill sort/aggregation buffers. Non-zero spill means a task tried to hold more than fit. Two paths: bump executor memory (lesson 57) or partition smaller (more partitions = less per task = less spill).

If you see large disk spill, your job is bottlenecked on local SSD throughput. On cloud, that’s the executor’s ephemeral disk. Some shapes (heavy sort + low memory) will always spill — your job is what it is and the question is just whether the spill is acceptable.

Putting it together: the printable checklist

Save this somewhere. Tape it to your monitor. The next time someone says “the job is slow”:

  1. Open Spark UI. Find the slowest stage by duration.
  2. Look at task duration distribution. Max vs median. Big gap = skew → lessons 28-29.
  3. Look at GC time. >20% of task time = memory pressure → lesson 57 or fix caching.
  4. Look at shuffle read/write. Hundreds of GB shuffled for a small output = filter earlier → lessons 25-30.
  5. Read the SQL plan. BroadcastNestedLoopJoin, missing broadcast, filter above scan → fix the query.
  6. Look at input size and spill. Mismatched scan size = pruning failed; non-zero spill = memory or partitions wrong.

The discipline is the order. Skew is more common than memory issues. Memory issues are more common than plan issues. Plan issues are more common than scan issues. You’ll resolve 80% of slow jobs at step 2.

What this checklist won’t catch

  • Cluster issues outside the job. A noisy neighbour on shared infra, a degraded node, a network partition. The UI shows you a Spark-shaped view of the world; if the underlying VM is sick, executors will look slow without an obvious cause. Check executor logs and platform metrics.
  • Code that’s just slow. A Python UDF doing a regex over a 200 GB column will be slow no matter how you tune Spark. The UI will tell you it’s spending all its time in MapPartitions near a Python row, but the fix is “stop using a Python UDF” (lesson 40), not “tune executor memory.”
  • Data-quality regressions. Yesterday the job ran on 4 GB. Today it’s running on 40 GB because upstream started double-writing. The job isn’t slow, the input is bigger. Always check input size first when “the job got slow overnight.”

The 30-minute promise

The whole loop above takes 30 minutes the first few times you do it. Then 15. Then 5, because you’ll learn to glance at the right tabs and skip past the healthy parts. By the time you’ve debugged twenty slow jobs you’ll have an intuition for where to look first based on the symptoms.

But always go back to the discipline. The day you start tuning before reading the UI is the day you’ll spend three hours fixing the wrong thing.

Next lesson: Adaptive Query Execution. The Spark feature that handles many of the above issues automatically — when it can. We’ll see when it can, when it can’t, and how to read the plans it rewrites at runtime.

Search