PySpark, from the ground up Lesson 60 / 60

A 30-minute health check on a Spark cluster you've never seen

The capstone checklist: hand over your laptop, you have until 5pm to figure out what's broken.

You are handed the keys to a Spark cluster you’ve never seen before. “Tell us what’s broken by 5pm.”

This lesson is the 30-minute checklist I run. Every step, in order, with what you’re looking for and what to do if you find it. It’s the closing lesson of the course because it pulls together everything we’ve covered — partitioning, shuffles, joins, memory, AQE, streaming, the SQL tab, the executors tab — and folds it into one ordered sweep. You can print it. You can paste it into your toolbox. The day a new client hands you a Spark cluster, you pull this out and you’re productive in half an hour.

Thanks for sticking through 60 lessons. Let’s finish with one scripted run through the entire toolkit.

Step 1 — The cluster overview (2 minutes)

Open the Spark UI (or the workspace UI if you’re on Databricks/EMR/Dataproc). Find the running application. The top-right gives you the version. The Environment tab gives you everything else.

spark.version                              # 3.5.1, 3.4.2, etc.
spark.sparkContext.master                  # yarn, k8s://..., local[*], spark://...
spark.sparkContext.defaultParallelism      # total cores
sc = spark.sparkContext
print(sc.statusTracker().getExecutorInfos())

Looking for:

  • Spark version — 3.2+ ideally, for AQE on by default. 3.0 / 3.1 means manual config audit.
  • Cluster manager — YARN, Kubernetes, Standalone, or vendor-managed. Different operational tools.
  • Executor count and size — how many, how much memory each, how many cores each. Compare to advertised cluster capacity.
  • Driver size — undersized drivers crash large collect() results.

Step 2 — Executors tab (3 minutes)

This is your sp_Blitz for Spark. Click Executors.

Executor ID | Address       | Status   | Cores | Memory Used | Task Time | GC Time | Failed | Active Tasks
driver      | 10.0.1.4      | Active   | 0     | 0 B / 4 GB  | 0 ms      | 0 ms    | 0      | 0
0           | 10.0.2.10     | Active   | 4     | 8.2 GB / 12 GB | 2.1 h | 18 min  | 0      | 4
1           | 10.0.2.11     | DEAD     | 4     | -           | 1.4 h     | 22 min  | 17     | 0
2           | 10.0.2.12     | Active   | 4     | 11.8 GB / 12 GB | 2.4 h | 1.1 h | 12     | 4

Looking for:

  • Dead executors. If you see them, look at logs (stderr link) — OOM, lost node, or driver killed them on idle timeout. Recurring deaths = your job is unstable.
  • GC time as a fraction of task time. Executor 2 above has GC at ~45% of task time. That’s catastrophic; lesson 57 (memory tuning) is the fix.
  • Failed task counts. Non-zero across many executors = a real problem (probably skew or OOM); concentrated on one executor = a sick node.
  • Memory usage near the limit. Executors at 11.8 / 12 GB are one allocation away from spill or OOM.

Note who looks bad and move on.

Step 3 — Running jobs and stuck queries (2 minutes)

Click Jobs. Sort by Duration. Anything running > 1 hour deserves a question.

# In a notebook:
[(j.jobId, j.name, j.status, j.numTasks, j.numActiveTasks)
 for j in spark.sparkContext.statusTracker().getActiveJobIds()]

Looking for:

  • Long-running jobs that should be quick (a daily aggregate that ran 6 hours instead of 30 minutes — kill candidate).
  • Jobs stuck with one active task while the rest are done — classic skew tail (lesson 28).
  • Driver-side stuck states — lots of completed stages but the job won’t finish; usually a collect() or toPandas() that’s pulling too much.

To kill a job: Spark UI has a (kill) link next to running jobs if you’re admin. Or spark.sparkContext.cancelJobGroup(...) if your code set a job group.

Step 4 — Cumulative job stats and the slowest 10 (3 minutes)

Click Jobs again, scroll down to the Completed Jobs list. Sort by Duration descending. Look at the top 10.

What do they have in common?

  • All write to the same table? → that table’s storage layout might be the bottleneck (small files, missing compaction, bad partitioning).
  • All join the same dimension? → that dim might need broadcast (lesson 27) or a better key.
  • All read from the same source? → check that source’s stats and partitioning.
  • All run at the same time of day? → resource contention with another job.

Pattern recognition is faster than tuning each one.

Step 5 — Disk usage on workers (3 minutes)

Spark’s local dirs (spark.local.dir, default /tmp) hold shuffle files, spill files, broadcast files, and cached blocks. They fill up.

# On each worker (via SSH or kubectl exec):
df -h /tmp
du -sh /tmp/spark-* 2>/dev/null | sort -h | tail

On managed platforms, check the worker disk metric in the platform UI. On Databricks, this is the cluster’s storage panel. On EMR, the Ganglia or CloudWatch disk metrics.

Looking for:

  • /tmp more than 80% full → shuffle and spill will fail. Add disk or restart the cluster.
  • Old spark-*-shuffle directories from past failed jobs that didn’t get cleaned. They get cleaned on graceful shutdown but linger after crashes. Safe to delete after confirming no application is running.
  • Cache files that didn’t get cleaned (blockmgr-* directories) — same story.

Step 6 — Top expensive queries in the SQL tab (3 minutes)

Click SQL / DataFrame. Sort by Duration. The top 10 queries are where your cluster’s time is going.

For each of the worst:

  • Click in. Look at the plan.
  • Find the worst operator (highest reported time).
  • Apply the lesson 58 checklist: skew? Bad join strategy? Massive shuffle? Filter not pushing down?

Common offenders on a fresh cluster:

  • A BroadcastNestedLoopJoin somewhere — missing join condition.
  • A SortMergeJoin between a 50 GB table and a 5 MB lookup table because someone disabled autoBroadcastJoinThreshold years ago and forgot.
  • A groupBy with 50 partitions of 8 GB each because AQE is off.
  • Python UDFs (BatchEvalPython) doing work that could be a SQL function (lesson 40).

These are the queries to file as follow-ups, not necessarily to fix today.

Step 7 — Recent failures and error class (2 minutes)

For Databricks clusters: the Event Log and Failed Jobs tab. For YARN: yarn application -list -appStates FAILED and the application logs. For Kubernetes: kubectl get pods and the executor pods that exited non-zero.

Group recent failures by error class:

  • Driver OOMjava.lang.OutOfMemoryError: Java heap space on the driver. Usually a collect(), toPandas(), or huge broadcast. Lesson 57.
  • Executor OOM — same exception on an executor. Skew (lesson 28), bad caching (lesson 23), or memory undersized.
  • Lost executorExecutorLostFailure followed by stage retry. Usually OOM-killed by the OS / container manager. Bump memoryOverhead.
  • Shuffle fetch failedFetchFailedException. An executor died while another was reading from it. Symptom of either of the above.
  • Task failed N times — your spark.task.maxFailures (default 4) was hit. Real bug or persistent skew.

The breakdown tells you the dominant class, which tells you which fix to prioritize.

Step 8 — Streaming queries and checkpoints (3 minutes)

If anyone runs Structured Streaming on this cluster:

for q in spark.streams.active:
    last = q.lastProgress
    print(q.name, q.status, last.get("inputRowsPerSecond"), last.get("processedRowsPerSecond"),
          last.get("batchDuration"), last.get("triggerExecution"))

Looking for:

  • Queries where input rate > processed rate sustainedly → falling behind, unbounded backlog.
  • Queries with batchDuration > trigger interval → can’t keep up; bigger cluster or smaller workload.
  • Queries where numInputRows is zero for hours → upstream is dead, query is idle.

Then check checkpoint storage:

# S3 or DBFS or whatever your checkpoint root is
aws s3 ls s3://my-checkpoints/ --recursive --summarize | tail -3

Looking for:

  • Checkpoint directories of dead queries that no one cleaned up — wasting storage.
  • Single checkpoint trees in the GBs because retention policy isn’t pruning state. Stateful queries (lesson 53) without watermarks accumulate forever.

Step 9 — Scheduled jobs and recent failures (2 minutes)

Cluster managers all have scheduled-job views.

  • Databricks: Workflows tab → Jobs → sort by last run status.
  • EMR: Step status; Airflow DAG runs if you orchestrate that way.
  • Dataproc: Workflow templates and recent job runs.
  • Plain Airflow / Dagster / etc.: failed task instances in the orchestrator.

Looking for:

  • Failed runs in the last 7 days. Patterns: same job failing daily, same time of day, same error?
  • Jobs that haven’t run in months. Schedule disabled? Should they be deleted?
  • Runs that succeeded but took 4x longer than usual → silent regressions, file as follow-up.

Step 10 — Cache audit (2 minutes)

Click the Storage tab.

RDD Name                | Storage Level    | Cached Partitions | Memory   | Disk
df_users (id 12)        | MEMORY_AND_DISK  | 200 / 200         | 18 GB    | 0
df_huge_facts (id 14)   | MEMORY_ONLY      | 80 / 1200         | 12 GB    | 0
df_unused_2024 (id 5)   | MEMORY_AND_DISK  | 200 / 200         | 6 GB     | 0

Looking for:

  • Cached datasets larger than fits in memory (the second row above — only 80 of 1200 partitions cached). The cache is doing nothing; either bump memory or stop caching.
  • Cached datasets that haven’t been touched in hours — leaked from a notebook that’s still attached. Unpersist them.
  • Multiple very similar cached DataFrames — someone cached the same data three times under different variable names.

spark.sparkContext._jsc.getPersistentRDDs() lists everything currently cached if you want to script the audit.

Step 11 — Delta / Iceberg table maintenance (2 minutes)

If the cluster reads/writes Delta or Iceberg tables, check that maintenance is running:

DESCRIBE HISTORY my.table LIMIT 10;     -- last operations
DESCRIBE DETAIL my.table;                -- numFiles, sizeInBytes

Looking for:

  • numFiles > 10,000 on tables under a few TB → small file problem, needs OPTIMIZE.
  • No OPTIMIZE in history for months → schedule weekly compaction.
  • No VACUUM either → old versioned files accumulating in object storage; costs money.
  • Iceberg equivalents: expire_snapshots and rewrite_data_files, also on a schedule.

Without these, your tables degrade quietly.

Step 12 — Memory configuration sanity (2 minutes)

sc.getConf().getAll()

Or just Environment tab in the UI, scroll to Spark Properties.

Looking for:

  • spark.executor.memory and spark.executor.memoryOverhead — overhead should be roughly 10% of executor memory or 1 GB, whichever larger. Tighter than that and YARN/K8s will kill containers (lesson 57).
  • spark.driver.memory — if anyone calls collect() on this cluster, this matters. Default 1 GB is too small for anything serious.
  • spark.executor.cores — typically 4-5. Higher means more contention for the JVM, lower means more JVMs (overhead).
  • spark.sql.shuffle.partitions — if it’s 200 (the default), and you’re processing terabytes, you have a problem. AQE helps but doesn’t fix all of it.

If containers are being killed (check the executor list in step 2 plus YARN/K8s events), memoryOverhead is the first thing to bump.

Step 13 — Configuration drift (2 minutes)

Same Environment tab. Filter for spark.sql.* and spark.shuffle.*. Scan for non-defaults.

Looking for:

  • spark.sql.adaptive.enabled = false — turn it back on (lesson 59) unless someone has a documented reason.
  • spark.sql.autoBroadcastJoinThreshold = -1 — broadcasts disabled. Almost never a good idea on a real workload.
  • spark.sql.shuffle.partitions = 2000 — set high once for a one-off job, never reset. Now affects every job on the cluster.
  • Custom serializers, codecs, evict policies — every non-default needs a comment in your team’s docs explaining why. If no one knows why, suspect.

Document each non-default with its justification. Anything you can’t justify, revert.

Step 14 — Write the report (3 minutes)

Take your notes and turn them into a short report. Mirror the SQL Server template — keep it short, prioritized, actionable.

Customer X Spark Cluster Health Check — 2026-06-18

  • Overall: Yellow. Two P1 findings.
  • Critical:
    • Streaming query clickstream-aggs falling behind 12 hours; input rate 4x processed rate. Either bigger cluster or downscale workload. Decide today.
    • Cluster running Spark 3.0 → AQE off by default, multiple jobs would benefit immediately. Plan an upgrade or backport AQE configs.
  • High:
    • Executor OOMs daily on daily-rollup job. Signal: skew on country_code. Salt or AQE.
    • 4 of 12 executors showing GC time > 30% of task time. Bump executor memory or shrink cached datasets.
    • No OPTIMIZE ever run on events_delta; 47,000 small files on a 800 GB table. Schedule weekly OPTIMIZE.
  • Medium:
    • BroadcastNestedLoopJoin in top expensive query (weekly_finance). Missing join condition; query is 40 minutes per run.
    • spark.sql.shuffle.partitions = 4000 set for a one-off job in March, never reverted; hurts small jobs.
    • Cached df_unused_2024 (6 GB, MEMORY_AND_DISK) attached to an idle notebook. Unpersist.
  • Low:
    • 12 dead executor logs accumulating in /tmp on workers. Add a cleanup cron.
    • Job etl-archive hasn’t run successfully since January. Disable or fix.
  • Follow-up:
    • Audit all Spark configs against defaults; document the keepers.
    • Migrate cluster to Spark 3.5 LTS.
    • Set up Delta OPTIMIZE + VACUUM schedule across all production tables.
    • Add cluster-level Prometheus / DataDog dashboards for executor GC, shuffle volume, task failure rate.

That’s your deliverable. Short, actionable, prioritized. Email it to whoever handed you the cluster keys.

Step 15 — What to do after this course

You now know the equivalent of a couple of years of on-the-job Spark experience, squeezed into 60 lessons. What to read and follow next:

  1. The Spark documentationPerformance Tuning and the AQE section. The source of truth for every config you’ll touch.
  2. Jacek Laskowski’s gitbookThe Internals of Apache Spark and The Internals of Spark SQL. Free, deeply technical, the place to go when you need to know what an operator actually does.
  3. Holden Karau’s books and talksHigh Performance Spark and Learning Spark. Production-focused, written by someone who’s been debugging real Spark clusters for years.
  4. The Databricks blog — biased toward their platform but packed with deep dives on Catalyst, AQE, Delta, Photon. Read with the bias filter on.
  5. Apache Spark JIRA — when you hit a weird bug, search it. There’s a 50/50 chance the fix is in the next minor version.
  6. Practice on a real workload — a sandbox with TPC-DS or your own data. Run the health check, fix findings, repeat monthly. That’s how the muscle memory forms.

Congratulations

You made it through 60 lessons of PySpark. You now know:

  • The fundamentals — what Spark is, the architecture, the RDD/DataFrame/Dataset hierarchy (Modules 1-2).
  • The DataFrame API — read, write, transform, aggregate, window, pivot, UDF (Modules 3-4).
  • Execution mechanics — lazy evaluation, narrow vs wide, the DAG, caching, shuffles, joins, broadcast, skew, salting (Modules 5-6).
  • Partitioning, bucketing, file layout, on-disk vs in-memory (Module 7).
  • The optimizer — Catalyst, Tungsten, the SQL tab, file formats and JDBC and cloud storage (Module 8).
  • Streaming — sources, watermarks, stateful operations, output modes, sinks (Module 9).
  • Production — debugging slow jobs, AQE, the cluster health check (Module 10).

If this course has done its job, you’re ready to open a Spark cluster you’ve never seen, diagnose its problems in 30 minutes, fix the top issues, and not panic when the on-call pager goes off at 3am. That’s the bar. That’s what “knowing PySpark” means at this level.

You’ll still learn new things every week — that’s the job. But you have the framework now. The next blog post or release-note item slots into a structure that makes sense, instead of being a wall of unfamiliar terms.

Go take care of your pipelines. They’re depending on you.

— Thanks for reading. If you spotted errors, have suggestions, or want to argue about whether repartition() or coalesce() is the right call, say hi.

Search