Partitioning: the thing that quietly kills your Spark job

Module 6 has, by this lesson, walked you through the partitioning concept end to end. Lesson 31 was “what a partition actually is.” Lesson 32 was repartition vs coalesce. Lesson 33 was partitionBy on writes — the directory layout. Lesson 34 was the diagnostic tools (getNumPartitions, spark_partition_id, the Spark UI Stages page). This lesson is the synthesis. The production-grade view of partitioning, as a single coherent system instead of three loosely-related concepts.

Get this right and most of the rest of Spark performance falls into place. Get it wrong and your only knob will be “more cluster,” which famously does not work.

The partition triangle

The single most useful mental model for partitioning is the triangle. Three corners. They look like the same word, but they’re three different things, and they interact in ways that surprise people for years.

            partitions on read
            (file count)
                  /\
                 /  \
                /    \
               /      \
              /        \
             /          \
            /____________\
   partitions       partitions
   in memory        on disk
   (shuffle)        (partitionBy)

Partitions on read is the file count of your input. If you spark.read.parquet("orders/") and the directory has 200 Parquet files, Spark starts with 200 partitions. Files larger than the configured split size (typically 128 MB) get split further; files smaller than openCostInBytes get coalesced together. Either way, this corner is set by how the previous job wrote the data. You don’t control it from the read side; you control it by changing how upstream wrote the files.

Partitions in memory is the partition count after a shuffle. Set by spark.sql.shuffle.partitions, default 200 in vanilla Spark. Every join, every groupBy, every wide transformation reshuffles to this count. This is the corner you control most directly during a job — set the config, call .repartition(n) explicitly, or let AQE coalesce post-shuffle in Spark 3.x.

Partitions on disk is the directory layout produced by df.write.partitionBy("year", "month"). It’s a physical partition (directory per value), not the same thing as a Spark partition (in-memory chunk). One Spark partition can write rows to multiple disk partitions; one disk partition can be read by multiple Spark partitions. The corner exists to enable partition pruning at read time — downstream queries that filter on year = 2026 AND month = 04 only scan one directory.

The interactions are where it gets interesting. A read with 200 input files might shuffle to 1,000 in-memory partitions during a groupBy, and then write out using partitionBy("country") into 50 country directories — each directory containing maybe 20 part files. Three different partition counts, three different layers, all live in the same job. When someone asks “how many partitions do you have,” the right answer is “for which of the three?”

The 100-200 MB rule, with the math behind it

The rule of thumb everybody quotes is: aim for 100 to 200 MB per in-memory partition. Sometimes you see 128 MB cited specifically (matching the HDFS block size of yesteryear), sometimes 256 MB, occasionally 512 MB. The exact number is less interesting than the math behind it, because once you understand the math, you can decide for your own cluster.

Two competing forces.

Lower bound: task scheduling overhead. Each partition becomes one task. Each task has fixed overhead — task launch, JVM serialization of the task closure, network round-trip to the executor, the executor’s task-tracking bookkeeping. In Spark 3.x this overhead is in the 50-200 millisecond range per task on a healthy cluster. If your partition contains 30 KB of data and processes in 10 milliseconds, the overhead dominates. You’re spending 95% of “compute time” on bookkeeping. Smaller partitions = more tasks = more overhead.

Upper bound: memory and shuffle. A partition has to fit in executor memory during processing. If your executor has 8 GB and you also need execution memory for joins, sorts, and a bit of cache, the partition itself probably can’t be more than 1-2 GB without spilling. And spilling is expensive — local disk write, local disk read, often back to HDFS or S3 if the spill is large. Larger partitions = more memory pressure = spill risk.

Between those two bounds, 100-200 MB sits at the sweet spot for most workloads on most clusters. The arithmetic: 200 MB per partition × 200 partitions = 40 GB of total shuffle data, which fits a small-to-medium job. For a 1 TB shuffle, you want closer to 5,000-10,000 partitions, which means setting spark.sql.shuffle.partitions higher than the default 200 from the start. Otherwise each partition is 5 GB and your executors will choke.

The other rule, equally useful: 2 to 4 partitions per core. 100 executor cores, 200-400 partitions. This gives the scheduler some elasticity — if one task finishes early, the executor picks up another from the queue, instead of sitting idle waiting for the slowest task in the cohort. More than 4-per-core and the overhead starts to matter; fewer than 1-per-core and you’ve under-parallelized.

In practice, you compute both bounds. “How many partitions makes each one ~150 MB?” Some number, call it N1. “How many gives me ~3 partitions per core?” Another number, N2. Take roughly the bigger of the two. That’s your target.

The diagnostic tree when partitioning is the suspect

A job is slow. You suspect partitioning. Walk this tree.

Root: do tasks in the slow stage finish in similar times, or wildly different times?

Open the Spark UI. Click the slow stage. Sort tasks by duration.

Branch 1: tasks all finish quickly, but there are way too many of them

The stage has 10,000 tasks, each finishes in 200 ms, but the overall stage takes 8 minutes. You’re looking at task-overhead-bound work.

The fix recipe:

df = df.coalesce(target_partitions)
# or for a fresh shuffle with even distribution:
df = df.repartition(target_partitions)

Where target_partitions is something like total_data_mb / 150 or 4 × num_cores, whichever is larger. If you’re about to write to disk, coalesce is the cheaper move (no shuffle). If you’re about to do another wide operation that benefits from even distribution, repartition it.

This branch is also the one where you discover, surprised, that someone wrote df.write.partitionBy("user_id") on a high-cardinality column and produced 3 million directories with 4 KB each. The fix is to never use partitionBy on high-cardinality columns. Use a coarse-grained column (year, country) for the directory layout, and let bucketing handle the fine-grained one — that’s lesson 36, the next one.

Branch 2: tasks finish slowly, every one of them

A few hundred tasks, each runs for 12 minutes, total stage runs for 12 minutes — they’re parallel but each is too big. Memory pressure shows up as Spill (Memory) and Spill (Disk) columns in the task table. Executor logs show GC pauses.

The fix recipe is the opposite: increase the partition count.

spark.conf.set("spark.sql.shuffle.partitions", "2000")
# Or explicitly upstream of the slow stage:
df = df.repartition(2000)

If the slow stage is a join, you also want to look upstream. Sometimes the join key produces fewer effective partitions than shuffle.partitions because the key has low cardinality (only 50 distinct values). In that case you need to add an additional dimension to the partition key:

df = df.repartition(2000, "join_key", "secondary_key")

repartition on multiple columns hashes the combined value, giving you actual diversity even when one column has low cardinality.

Branch 3: most tasks finish fast, a few drag forever

199 tasks at 30 seconds, one task at 28 minutes. The whole stage waits on that one task. This is skew, not partition count. Adding more partitions doesn’t help — the skewed key still all hashes to one bucket.

The fix recipe is salting (Module 5, lesson 29) or AQE skew handling:

spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")

We covered skew comprehensively in Module 5, so the cross-reference there is the canonical treatment. The reason it shows up again here: skew is a partitioning failure even though the cause is the data, not the config. Same diagnostic tree, different leaf.

Branch 4: tasks evenly distributed, partition count looks reasonable, job is still slow

You’ve ruled out partitioning. Time to look elsewhere — bad join strategy, missing broadcast, slow source, network throttling. The point of the diagnostic tree is also to exit it when partitioning isn’t the cause.

”Doubling the cluster doesn’t help”

Consider this the canary. You’re tuning a slow job. You ask the platform team for twice as many executors. The job runs… roughly the same speed. Maybe 5% faster. Definitely not 2x.

That symptom — adding hardware doesn’t help — almost always points at partitioning. If your job has 200 in-memory partitions and you go from 100 cores to 200 cores, you didn’t gain parallelism. You went from “2 partitions per core” to “1 partition per core.” Half your new cores sit idle. The slowest task is still 1/200th of the work, exactly as before.

Or you have skew. Doubling the cluster gives the 199 fast tasks more room and finishes them faster — they were never the bottleneck. The one slow task still runs on one executor. Total time barely budges.

Or you have one giant input file (a 30 GB Parquet that didn’t get split because it’s a single file). Spark uses one task to read it regardless of cluster size. More executors don’t divide a file you couldn’t divide.

The clue is robust enough to bet on: if doubling the cluster doesn’t approximately halve the runtime, the bottleneck is not horsepower. It’s partitioning, skew, or a single non-parallelizable step. Don’t ask for more hardware. Open the Spark UI.

A small concrete pattern

A typical Module 6 capstone code shape, putting the layers together:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = (SparkSession.builder
         .appName("PartitioningInPractice")
         .config("spark.sql.shuffle.partitions", "1024")
         .config("spark.sql.adaptive.enabled", "true")
         .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
         .getOrCreate())

# 1. Read — partitions on read are dictated by file layout upstream
orders = spark.read.parquet("warehouse/orders/")  # ~150 input partitions

# 2. Filter early; project narrow
orders = (orders
          .filter(F.col("year") == 2026)
          .select("order_id", "customer_id", "country", "total", "ts"))

# 3. Wide transform — Spark shuffles to spark.sql.shuffle.partitions = 1024
agg = (orders
       .groupBy("country", F.date_trunc("day", "ts").alias("day"))
       .agg(F.sum("total").alias("revenue"),
            F.count("*").alias("orders")))

# 4. Pre-write coalesce — too many output files would create thousands of tiny parts
agg = agg.coalesce(64)

# 5. Write with partitionBy on disk for downstream pruning
(agg.write
    .mode("overwrite")
    .partitionBy("country")
    .parquet("warehouse/daily_country_revenue/"))

Five steps, three different partitioning layers visible. AQE on top to coalesce small post-shuffle partitions. The coalesce(64) before write is the move that keeps you from producing 1,024 × 200 = 200,000 tiny part-files. The partitionBy("country") on write gives downstream queries cheap pruning when they filter on country.

What’s next

Lesson 36 is bucketing — the partitioning concept’s strange cousin. Where partitionBy produces directories, bucketing produces a fixed number of files inside each directory, hashed by a column. The result is a table that can be joined to another bucketed-the-same-way table without a shuffle, even when both tables are huge. It’s a powerful pattern that gets criminally underused because it requires planning ahead.

The thread that runs from Module 5 through Module 6 is consistent: most Spark performance work is data-movement work, and most data-movement work is partitioning work. You filter to move less. You broadcast to make small things travel cheaply. You salt to spread hot keys. You repartition to get the right granularity. You partitionBy on write to set up the next job’s read. Each technique is a different way of asking the same question — how is my data laid out, and is the layout helping or hurting?

Open the Spark UI. Look at the Stages page. Read partition counts and task durations. The job is telling you what’s wrong; you just have to be willing to listen.