This is the closing lesson of Module 2. Everything you’ve written so far ran on master("local[*]"). That’s been deliberate — local mode is genuinely good for learning Spark, and most of what you’ve typed will run unchanged on a 100-node cluster. But “most” is doing a lot of work in that sentence. There’s a category of bugs that only exist when Spark is actually distributed, and discovering them in production is a bad time.
Today’s goal: understand exactly what local means, what it lies to you about, and what a sane dev → staging → prod workflow looks like in 2026.
What local mode actually is
When you write:
spark = SparkSession.builder.master("local[*]").getOrCreate()
…Spark starts a single JVM process on your laptop. That process is the driver, the executor, and the cluster manager all at once. There’s no network. No serialization between machines. No separate executor JVMs. Just one process pretending to be a cluster.
The [*] in local[*] controls parallelism — it tells Spark to use all available CPU cores as worker threads. Other forms:
.master("local") # 1 thread. Useful for tests where you want determinism.
.master("local[4]") # 4 threads.
.master("local[*]") # All cores.
.master("local[*, 3]") # All cores, retry failed tasks 3 times. Test fault tolerance.
Each “thread” runs one Spark task at a time. With local[8], you can run 8 partitions in parallel, on 8 cores, in one JVM. It looks like a cluster from the API’s perspective. The DataFrame code you write is byte-for-byte identical to cluster code. That’s the genuinely useful part.
Quick demo — same script, run twice with different parallelism:
from pyspark.sql import SparkSession
from pyspark.sql.functions import spark_partition_id, count
def run(master):
spark = (SparkSession.builder
.appName("LocalDemo")
.master(master)
.getOrCreate())
spark.sparkContext.setLogLevel("WARN")
df = spark.range(0, 1_000_000).repartition(16)
print(f"\nMaster={master}, default parallelism={spark.sparkContext.defaultParallelism}")
df.groupBy(spark_partition_id().alias("partition")).agg(count("*").alias("rows")).show()
spark.stop()
run("local[1]")
run("local[*]")
Both runs produce the same final aggregation. The first uses 1 worker thread, the second uses every core you have. Throughput differs by ~Nx where N is your core count. Output is identical.
Where local mode is genuinely fine
Be unapologetic about local for these:
- Dev iteration. Writing a transformation, running it, fixing the type error, running it again. The feedback loop on a real cluster is 30 seconds of submission overhead per attempt. Local is instant. You will iterate ten times faster.
- Unit tests. A pytest suite that builds a tiny DataFrame in code, runs your function, and asserts on the output is exactly what local mode is for. CI runners have 2–4 cores and that’s plenty.
- Small ETL. If your input is under, say, 5GB and your laptop has 16GB of RAM, local mode will out-perform a 4-node cluster (no network, no shuffle over the wire, no executor spin-up). I have a personal-finance pipeline that runs nightly on
local[*]and probably always will. - Learning. You’re not going to learn window functions any faster on EMR than on your laptop.
A laptop with local[*] and 16GB of memory is a real, serviceable Spark environment. Don’t apologize for using it.
Where local mode lies to you
Three categories of bug that local mode hides. All three will eventually find you in production.
1. Skew is invisible
In real Spark, partitions are distributed across executors. If one partition has 90% of the data — say, your groupBy("country") produced one US partition with 50GB and forty other partitions with 100MB each — that one US partition runs on one executor while the other 40 executors sit idle. The job’s wall-clock time becomes “however long that one task takes.” This is data skew, and it’s the single most common cause of “my Spark job is slow” tickets.
Local mode doesn’t show you this. With one process, all partitions go through the same JVM anyway, on the same threads, sharing the same memory. A skewed partition still takes longer than the others, but it can spill, swap, and otherwise work around the problem in-process. On a real cluster, that one task gets exactly one executor’s worth of memory and dies.
Concretely:
from pyspark.sql.functions import lit, col
import random
# Build a heavily-skewed dataset: one key with 1M rows, others with 1.
big = spark.range(0, 1_000_000).withColumn("key", lit("HOT"))
small = spark.range(0, 99).withColumn("key", col("id").cast("string"))
skewed = big.unionByName(small)
# Group by key — one task does almost all the work.
skewed.groupBy("key").count().show()
On local[*] this completes happily. On a small cluster with default settings, the HOT key task can OOM the executor. Lesson 41 is dedicated to skew — salting, broadcast joins, AQE skew handling — and you can’t really practice it locally.
2. Shuffles look free
A shuffle in Spark is when partitions get redistributed across the network — the engine behind every groupBy, join, distinct, and orderBy. Shuffles serialize data, write it to local disk, and pull it across the wire from other executors. They’re expensive. They’re often the bottleneck. The whole “tune Spark” art form is mostly about reducing shuffle volume.
Locally, a shuffle is an in-process memory transfer. No network. Often no disk. It’s nearly free. So locally a 5-shuffle pipeline runs in 8 seconds; on a cluster, the same pipeline takes 12 minutes. The shuffles dominate, and you didn’t see it coming.
You also won’t catch:
- Network timeouts and lost executors during shuffle reads.
- Disk-full failures from shuffle spill.
- Skew in shuffle output (one shuffle reducer with 50GB of data).
Pipelines that look great on a laptop have an embarrassing tendency to fall apart on a cluster, and the cause is almost always “the shuffles you stopped thinking about because they were free locally.”
3. Out-of-memory at executor scale doesn’t reproduce
In a real cluster, each executor has bounded memory — say, 16GB per executor. If a partition spills 20GB into a single task, that task dies. The executor might die. The job fails.
Locally, your “executor” is your whole JVM, which can use most of your laptop’s RAM (often 8–32GB). You can process partitions that wouldn’t fit in any reasonable cluster executor and never realise. Then you ship to a cluster and the same code OOMs every other run.
The diagnostic version of this: a single-record from_json() parse that allocates 200MB of nested structs, run on 100M records — locally it’s 20GB total memory churn, JVM handles it; on a cluster it’s 200MB times whatever shoves into one executor’s memory pressure model and goodbye.
There are subtler cousins too. Broadcast joins that “work” because everything’s in one process actually need broadcasting on a cluster. UDFs that work locally but serialize a 50MB closure to every executor. PyArrow type coercions that round-trip slightly differently when crossing process boundaries. None of this is ever a problem locally.
The dev workflow that catches them
The shape that survives contact with reality:
1. Develop on local[*]. Tight iteration loop. Write transformations, run unit tests, scaffold the pipeline.
2. Run on a small staging cluster, on a representative slice of data. Two or three executors, real network between them, a sampled or anonymized version of production data. This is where skew shows up, where shuffles get expensive, where memory-per-executor becomes a real constraint.
A 10% sample on a real cluster is enormously more useful than a 100% run on local. The point isn’t volume, it’s distributedness. If a small cluster doesn’t choke, a big one usually won’t either.
3. Promote to production. Same code, different --master and bigger data. By this point you’ve debugged 95% of the cluster-only bugs.
Skipping step 2 is the most common workflow mistake in the field. People go straight from local to prod, hit a skew bug in production, and spend an evening rolling back. Pay the staging tax. It’s cheap.
The cluster-manager landscape, 2026 edition
When you submit to a real cluster, the cluster manager is the thing that allocates executors for your job. Spark supports four:
YARN is the original Hadoop cluster manager. If your company runs an on-prem Cloudera/Hortonworks/legacy-Hadoop installation, you’re on YARN. It’s mature, battle-tested, and uniformly hated. It’s also slowly losing market share — basically nobody chooses YARN for a greenfield deployment in 2026 — but the install base is enormous and it isn’t going anywhere fast.
spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 50 \
--executor-cores 4 \
--executor-memory 16g \
--driver-memory 8g \
my_pipeline.py
Kubernetes is the modern default. If you’re standing up Spark from scratch in 2026, it’s probably on Kubernetes. The Spark Operator (originally from Google, now an Apache project) gives you SparkApplication CRDs, native pod scheduling, and clean integration with the rest of your cloud-native stack. Almost every cloud Spark vendor — Databricks, EMR Serverless, Dataproc Serverless, OpenShift — is now Kubernetes underneath, even if they hide it.
spark-submit \
--master k8s://https://my-cluster.example.com:6443 \
--deploy-mode cluster \
--conf spark.kubernetes.container.image=my-registry/spark:3.5.1 \
--conf spark.executor.instances=50 \
--conf spark.kubernetes.namespace=data \
local:///opt/jobs/my_pipeline.py
Standalone mode is Spark’s own bundled cluster manager. It’s a single binary, no Hadoop or Kubernetes required. Genuinely useful when you have a small, dedicated team running Spark on a fixed set of VMs and you don’t want to learn Kubernetes for it. The trade-off: no resource isolation between jobs (it’s first-come-first-served), no workload sharing with non-Spark jobs, and limited HA story. For a 5-machine analytics box, fine. For a multi-team platform, no.
# On the master node
./sbin/start-master.sh
# On each worker
./sbin/start-worker.sh spark://master-host:7077
# Submit
spark-submit \
--master spark://master-host:7077 \
--total-executor-cores 32 \
--executor-memory 8g \
my_pipeline.py
Mesos existed and you should know it existed. It was deprecated in Spark 3.2 (2021) and removed entirely in Spark 4.0 (2024). If anyone tries to convince you to deploy Mesos in 2026, gently steer them away.
The managed wrappers — Databricks, AWS EMR, Google Dataproc, Azure Synapse — are all built on top of one of these. Databricks runs on a managed Kubernetes-flavored substrate. EMR can run on YARN (classic EMR) or Kubernetes (EMR on EKS) or serverless. Dataproc has the same split. The submit script you write barely changes; what changes is who manages the underlying nodes, who patches them, who pays for them.
If you’re early in your Spark career, the practical answer for 2026 is: learn local[*] for development, learn enough Kubernetes that you can read a SparkApplication YAML, and understand that managed services like Databricks are great for not having to think about the rest. Pick up YARN if and when a job demands it.
A minimal spark-submit you can actually use
Here’s a portable starter. The same script works on local, standalone, YARN, and Kubernetes — only --master and a few --conf lines change:
spark-submit \
--master "local[*]" \
--name "my_pipeline" \
--conf "spark.sql.shuffle.partitions=200" \
--conf "spark.sql.adaptive.enabled=true" \
--py-files dependencies.zip \
--files config.yaml \
pipelines/my_pipeline.py \
--input s3a://my-bucket/raw/2026-05-01 \
--output s3a://my-bucket/curated/2026-05-01
The pieces:
--master— the cluster manager.local[*],yarn,k8s://..., orspark://....--name— what shows up in the Spark UI. Name your jobs. Future-you debugging at 3am will thank present-you.--conf— Spark configuration overrides. The two above (shuffle.partitionsand AQE) are the most common.--py-files— extra Python modules zipped up and shipped to executors. Anything beyond a single-file script needs this.--files— non-code files (configs, lookup data) shipped to the working directory of every executor.- The script path and any args you want to forward to your
argparse.
For a Python entry point, your script’s if __name__ == "__main__": block parses args, builds a SparkSession, runs the pipeline, calls spark.stop(), and exits. Same file you’d run locally with python my_pipeline.py, with the same argparse arguments. The portability comes from the script not caring how it was launched.
A practice exercise that pays off: take any of the multi-step PySpark snippets from earlier in this module and turn them into a single submittable pipelines/orders_etl.py. Read paths from argparse. spark = SparkSession.builder.appName("orders_etl").getOrCreate() — no master in the code, so it picks up whatever --master you submit with. Run it with spark-submit --master "local[*]" pipelines/orders_etl.py --input ./data/orders.csv --output ./data/orders.parquet. Once that works, the same script will run on a cluster with one flag changed.
What this module covered
That’s Module 2 done. You can install Spark, build a SparkSession, read CSV / JSON / Parquet, inspect with show / count / collect (carefully), write with appropriate save modes and partitioning, and submit a job to a real cluster manager. Everything from here is about doing more interesting things with the DataFrames — selecting, filtering, joining, aggregating, windowing, and eventually the production-debugging stuff in Module 10.
Next module: actually working with DataFrames. The Column API, select versus selectExpr, withColumn and its quirks, and the small habits that separate Spark code that scales from Spark code that doesn’t.