PySpark
Distributed dataframes, joins that don't blow up the cluster, and the parts of Spark that bite.
-
.cache() is not free — when to use it, when it's a trap
Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.
-
Partitioning: the thing that quietly kills your Spark job
How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.
-
PySpark joins that don't blow up the cluster
Why joins are the number-one source of Spark pain, what the shuffle actually does, and the broadcast and salting tricks that turn a 40-minute job into a 4-minute one.