PySpark · Programming · Narcis Miclaus

.cache() is not free — when to use it, when it's a trap

Published on 11 April 2026

Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.
- #pyspark
- #spark
- #caching
- #performance
Partitioning: the thing that quietly kills your Spark job

Published on 11 April 2026

How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.
- #pyspark
- #spark
- #partitioning
- #performance
PySpark joins that don't blow up the cluster

Published on 10 April 2026

Why joins are the number-one source of Spark pain, what the shuffle actually does, and the broadcast and salting tricks that turn a 40-minute job into a 4-minute one.
- #pyspark
- #spark
- #performance
- #joins