Skip to main content
Narcis Miclaus
  • Home
  • Programming
  • Finance
  • Tools
  • Work
  • About
  • IT Italiano
  • RO Română
  • IT Italiano
  • RO Română
← Programming

PySpark

Distributed dataframes, joins that don't blow up the cluster, and the parts of Spark that bite.

  • .cache() is not free — when to use it, when it's a trap

    Published on 11 April 2026

    Spark's cache and persist sound like magic performance buttons. They're not. Here's when caching actually helps, when it makes things worse, and how to tell the difference.

    • #pyspark
    • #spark
    • #caching
    • #performance
  • Partitioning: the thing that quietly kills your Spark job

    Published on 11 April 2026

    How data gets split across executors, why the default is almost always wrong, and the repartition/coalesce dance that every Spark job eventually needs.

    • #pyspark
    • #spark
    • #partitioning
    • #performance
  • PySpark joins that don't blow up the cluster

    Published on 10 April 2026

    Why joins are the number-one source of Spark pain, what the shuffle actually does, and the broadcast and salting tricks that turn a 40-minute job into a 4-minute one.

    • #pyspark
    • #spark
    • #performance
    • #joins

Built with Astro — zero tracking, just words and numbers.

© 2026 Narcis Miclaus