If you came to PySpark from SQL, you already know joins. The mental model carries over almost perfectly: INNER, LEFT, RIGHT, FULL OUTER, semi, anti, cross — same seven flavors, same combinatoric meaning, same “this is how you turn two tables into one.” What changes in Spark is the cost. A join in PostgreSQL on a million-row table is a few hundred milliseconds. A join in Spark on a billion-row table can be three minutes or three hours depending on how Spark decides to execute it. That distinction is what Module 5 is about.
This lesson covers the syntax and semantics — the seven join types, the three styles of join condition, and the duplicate-column trap that bites everyone exactly once. Lesson 27 covers broadcast joins (the cheap path). Lesson 28 covers data skew (the slow path). Lesson 29 covers salting (the way out of the slow path).
The canonical form
Every Spark join boils down to:
result = df1.join(df2, on=<condition>, how=<type>)
on is what counts as a match. how is what to do with rows that have no match. That’s it. The defaults are how="inner" and no on, which gives you a cross join with a warning if you forget the condition.
A small running example
I’ll use two tiny DataFrames so the output of every join is small enough to read:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.appName("JoinsInPySpark")
.master("local[*]")
.getOrCreate())
customers = spark.createDataFrame(
[
(1, "Anna", "IT"),
(2, "Bjorn", "NL"),
(3, "Chiara","IT"),
(4, "Diego", "ES"), # never ordered
],
"customer_id INT, name STRING, country STRING",
)
orders = spark.createDataFrame(
[
(1001, 1, 59.00),
(1002, 1, 29.00),
(1003, 2, 149.00),
(1004, 3, 89.50),
(1005, 99, 12.00), # ghost customer — id 99 doesn't exist
],
"order_id INT, customer_id INT, total DOUBLE",
)
Four customers (one with no orders), five orders (one for a customer who doesn’t exist). All seven joins below run on these.
Inner join: the matches and only the matches
The default. how="inner" is what you get if you don’t pass how at all.
customers.join(orders, on="customer_id", how="inner").show()
# +-----------+-------+-------+--------+------+
# |customer_id| name|country|order_id| total|
# +-----------+-------+-------+--------+------+
# | 1| Anna| IT| 1001| 59.00|
# | 1| Anna| IT| 1002| 29.00|
# | 2| Bjorn| NL| 1003|149.00|
# | 3| Chiara| IT| 1004| 89.50|
# +-----------+-------+-------+--------+------+
Diego (no orders) is gone. Order 1005 (no matching customer) is gone. Inner join is “rows that match on both sides, period.”
Note Anna shows up twice — she has two orders. Joins multiply. One row on the left can produce N rows on the right if there are N matches. This is the row-combinatoric thing again. Easy to forget when you’re writing fast.
Left (left_outer) join: keep everything from the left
customers.join(orders, on="customer_id", how="left").show()
# +-----------+-------+-------+--------+------+
# |customer_id| name|country|order_id| total|
# +-----------+-------+-------+--------+------+
# | 1| Anna| IT| 1001| 59.00|
# | 1| Anna| IT| 1002| 29.00|
# | 2| Bjorn| NL| 1003|149.00|
# | 3| Chiara| IT| 1004| 89.50|
# | 4| Diego| ES| null| null|
# +-----------+-------+-------+--------+------+
Diego is back, with null columns for everything from orders. how="left" and how="left_outer" are aliases — both work, both mean the same thing.
The “customers with no orders” question is a trivial filter on top of this:
(customers
.join(orders, on="customer_id", how="left")
.where(F.col("order_id").isNull())
.select("customer_id", "name", "country")
.show())
Right (right_outer) join: keep everything from the right
customers.join(orders, on="customer_id", how="right").show()
# +-----------+-------+-------+--------+------+
# |customer_id| name|country|order_id| total|
# +-----------+-------+-------+--------+------+
# | 1| Anna| IT| 1001| 59.00|
# | 1| Anna| IT| 1002| 29.00|
# | 2| Bjorn| NL| 1003|149.00|
# | 3| Chiara| IT| 1004| 89.50|
# | 99| null| null| 1005| 12.00|
# +-----------+-------+-------+--------+------+
Order 1005’s ghost customer comes through with nulls on the customer side. Right joins exist for completeness. In practice, nobody writes them — flip the operands and write a left join. A code reviewer will read customers.join(orders, ..., how="left") faster than the right-join version every time.
Full outer join: everyone shows up
customers.join(orders, on="customer_id", how="full_outer").show()
# Diego appears (no orders), order 1005 appears (no customer),
# everyone else matches.
how="full", how="outer", and how="full_outer" are all the same. Useful for reconciliation reports — “what’s in system A that’s not in system B, what’s in B that’s not in A, and what matches” all in one query. I use it about as often as FULL OUTER JOIN in SQL: rarely, but when you need it nothing else fits.
Left semi: a filter, not a join
left_semi keeps rows from the left side that have at least one match on the right. It does not bring in any columns from the right side. It does not multiply rows.
# Customers who have placed at least one order
customers.join(orders, on="customer_id", how="left_semi").show()
# +-----------+-------+-------+
# |customer_id| name|country|
# +-----------+-------+-------+
# | 1| Anna| IT|
# | 2| Bjorn| NL|
# | 3| Chiara| IT|
# +-----------+-------+-------+
Anna shows up once even though she has two orders. The semi join is membership-only — one row per left match, no duplication. This is the Spark equivalent of WHERE EXISTS (...) in SQL, and it’s almost always what you want when you’re using a join purely as a filter.
If you’ve ever written df.join(other, "key", "inner").select(df["*"]).distinct() in production code: stop. left_semi is faster, cleaner, and won’t accidentally explode rows when the right side has duplicates.
Left anti: the inverse
left_anti keeps rows from the left side that have no match on the right.
# Customers with no orders
customers.join(orders, on="customer_id", how="left_anti").show()
# +-----------+-----+-------+
# |customer_id| name|country|
# +-----------+-----+-------+
# | 4|Diego| ES|
# +-----------+-----+-------+
This is WHERE NOT EXISTS (...) from SQL. It’s the cleanest way to answer every “things on the left without things on the right” question — customers without orders, products that never sold, users who never logged in. Cleaner than the left join + where col is null pattern, and Spark’s optimizer treats it more directly.
Cross join: the cartesian product
Every row on the left paired with every row on the right. 4 customers times 5 orders is 20 rows, none of which mean anything.
# Spark refuses to run this unless you explicitly ask:
customers.crossJoin(orders).count() # 20
The reason crossJoin is its own method (instead of how="cross") is that an accidental cartesian product on real data can be catastrophic — a million rows on each side becomes a trillion. Spark wants you to type crossJoin so you can’t trip into it.
When is a cross join useful? Building dimension grids (“every product × every day for the last 90 days, so the LEFT JOIN to sales doesn’t have date gaps”), generating combinatorial test data, and rare reporting cases. In day-to-day work, almost never.
Three ways to write the join condition
The on parameter accepts three styles, and you’ll see all three in production code:
# 1. String — works when both sides have the column with the same name
customers.join(orders, on="customer_id")
# 2. List of strings — same idea, multiple equality keys
events.join(sessions, on=["user_id", "session_date"])
# 3. Explicit column expression — most flexible, required if names differ
customers.join(orders, on=customers["customer_id"] == orders["customer_id"])
The first two styles have a hidden benefit: Spark drops the duplicate join column from the result. The expression style does not. Watch what happens:
result = customers.join(
orders,
on=customers["customer_id"] == orders["customer_id"],
how="inner",
)
result.columns
# ['customer_id', 'name', 'country', 'order_id', 'customer_id', 'total']
# ^^ two columns named customer_id ^^
Two columns called customer_id, both legal, both ambiguous to refer to. Try result.select("customer_id") and Spark throws AMBIGUOUS_REFERENCE. Welcome to the duplicate-column trap.
The duplicate-column trap and how to dodge it
Three fixes, in order of preference.
Fix 1: use the string-or-list form when you can. It’s the cleanest and Spark deduplicates for you:
customers.join(orders, on="customer_id").select("customer_id", "name", "total")
Fix 2: alias the DataFrames before the join. Now there’s no ambiguity because the columns are qualified:
c = customers.alias("c")
o = orders.alias("o")
(c.join(o, F.col("c.customer_id") == F.col("o.customer_id"))
.select("c.customer_id", "c.name", "o.total")
.show())
Fix 3: rename or drop one side before joining. Less elegant, sometimes pragmatic:
orders_renamed = orders.withColumnRenamed("customer_id", "cust_id")
customers.join(orders_renamed, customers["customer_id"] == orders_renamed["cust_id"])
Pick a convention and stick to it across the codebase. Mixing styles is what creates the bugs.
Conditions beyond simple equality
The on expression is just a boolean — anything that returns true/false on a row pair is valid. Range joins, multi-condition joins, inequality joins all work:
# Range join: events that happened during a user's session
sessions.join(
events,
on=(
(sessions["user_id"] == events["user_id"]) &
(events["timestamp"] >= sessions["start"]) &
(events["timestamp"] < sessions["end"])
),
how="inner",
)
Useful, but watch out: only equality predicates can use Spark’s fast hash-join path. Add an inequality and Spark falls back to a sort-merge or, worse, a broadcast nested loop. The job still works, it’s just slower. We’ll see this in .explain() output in lesson 27.
A few practical patterns
Three patterns worth knowing because they show up constantly:
Enrichment. A wide event table joined to a thin dimension table — every event needs to know the country name, not just the country code. Inner join on the dim’s primary key, project the columns you actually need:
events.join(
dim_country.select("country_code", "country_name"),
on="country_code",
how="left",
)
Note the select on the right side. It limits the columns Spark has to carry through the shuffle. Habits like this matter at scale — every column you bring along is bytes on the wire.
De-duplication via anti-join. You have a “things processed yesterday” table and “things to process today.” You want today minus yesterday:
today.join(yesterday.select("id"), on="id", how="left_anti")
Cleaner than the WHERE NOT IN equivalent, and it’s the right tool. NOT IN with nulls in the right-hand side gives surprising results (any null returns no rows); left_anti doesn’t have that footgun.
Existence check via semi. A user is “active” if they appear in both logins_30d and purchases_30d:
active = (logins_30d.select("user_id").distinct()
.join(purchases_30d.select("user_id"), on="user_id", how="left_semi"))
Two distincts, one semi join. No accidental row multiplication, no nulls to worry about.
What’s next
These are the seven joins. The semantics are SQL-shaped and uncontroversial. The interesting question — the one that decides whether your job runs in 30 seconds or 30 minutes — is how Spark physically executes the join. There are three main strategies (broadcast hash join, sort-merge join, shuffle hash join), and Spark picks one based on the size of each side, the join type, and the configuration.
Lesson 27 dives into broadcast joins: when one side is small enough to send everywhere, you skip the shuffle entirely and joins get cheap. Lesson 28 covers what happens when one side has a key with 100x the rows of the others — the dreaded skew problem. Lesson 29 covers salting, the standard fix.
For now, the right mental model is: write the join the way you’d write SQL, pick the right how, and avoid the duplicate-column trap. Performance comes next.
References: Apache Spark SQL documentation on join types and DataFrame join API; Databricks engineering blog posts on join strategies. Retrieved 2026-05-01.