Partitioning: the thing that quietly kills your Spark job

How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.

Partitioning: the thing that quietly kills your Spark job

Your Spark job takes 3 hours. You double the cluster size. Now it takes 2 hours and 50 minutes. What went wrong? Almost certainly, partitioning.

What partitions actually are

A Spark DataFrame isn’t one big table. It’s sliced into partitions — chunks of data distributed across executors. Each partition is processed independently in parallel. If you have 200 partitions and 20 executors, each executor handles ~10 partitions, one at a time.

The entire promise of Spark — “process data in parallel” — boils down to how well your data is spread across partitions. If one partition has 10 million rows and the others have 1,000 each, congratulations: one executor does all the work while 19 sit idle. This is called data skew, and it’s the single most common performance problem in Spark.

The default is almost always wrong

Spark’s default partition count comes from one of these:

  • spark.sql.shuffle.partitions — defaults to 200 for any operation that shuffles data (joins, group by, etc.)
  • The number of files or blocks in your source data (for reads)

200 partitions was a reasonable default in 2014 when clusters had 50 cores. Today, with thousands of cores and terabytes of data, 200 is almost always wrong — usually too few, leading to partitions that are too large and causing memory pressure, or occasionally too many, leading to tiny partitions with high scheduling overhead.

How to check your partitions

df.rdd.getNumPartitions()

More useful — check partition sizes:

from pyspark.sql.functions import spark_partition_id, count

df.groupBy(spark_partition_id().alias("partition_id")) \
  .agg(count("*").alias("row_count")) \
  .orderBy("row_count", ascending=False) \
  .show(10)

If the max partition is 100× larger than the min, you have a skew problem. If every partition has 50 rows, you have too many partitions.

repartition() vs coalesce()

These are your two tools for fixing partition counts:

repartition(n) — full shuffle. Every row is reassigned to one of n new partitions. Expensive (moves data across the network) but gives you evenly sized partitions.

df = df.repartition(500)  # 500 roughly equal partitions

coalesce(n) — merges partitions without a full shuffle. Can only reduce the count, never increase it. Cheap but can create uneven partitions.

df = df.coalesce(50)  # merge down from 200 to 50

The rule of thumb

  • Going up in partition count? → repartition()
  • Going down (e.g., before writing output files)? → coalesce()
  • Skewed data after a join? → repartition() on the skewed column, or use salting

Partition sizing guidelines

There’s no universal right answer, but these rules work in most real jobs:

  • Target 128–256 MB per partition. This balances parallelism with overhead.
  • Target 2–4 partitions per core. With 100 cores, aim for 200–400 partitions.
  • After a wide transformation (join, groupBy), check partitions. The default 200 might now be wrong.
  • Before writing to disk, coalesce. Writing 200 partitions to Parquet gives you 200 tiny files. Your downstream readers will hate you. Coalesce to a number that produces files of 128–512 MB.

Partition by column (for writes)

df.write.partitionBy("year", "month").parquet("s3://bucket/data/")

This creates a directory structure like year=2025/month=03/part-00001.parquet. Downstream queries that filter by year and month can skip irrelevant folders entirely — partition pruning. The improvement is often 10–100× on large datasets.

Choose partition columns that:

  • Are used in WHERE clauses frequently
  • Have low to medium cardinality (year, country, status — not user_id)
  • Don’t create millions of tiny directories

The skew escape hatch: salting

When one join key dominates (say, 80% of your orders come from one customer), that key’s data lands in one partition and everything stalls. The fix: add a random “salt” column, join on salt + key, and merge the results.

from pyspark.sql.functions import lit, rand, floor, col, concat

salt_buckets = 10

# Add salt to the skewed (large) table
large = large.withColumn("salt", floor(rand() * salt_buckets).cast("int"))

# Explode the small table to match all salt values
from pyspark.sql.functions import explode, array
small = small.withColumn("salt", explode(array([lit(i) for i in range(salt_buckets)])))

# Join on original key + salt
result = large.join(small, ["join_key", "salt"]).drop("salt")

Ugly? Yes. Effective? Dramatically. If you’re regularly hitting skew, this pattern is worth memorizing.

The one-sentence version

Most Spark performance problems are partition problems. Before adding more hardware, check partition counts and sizes — the fix is usually a single line of code.