Sort-Merge Vs Shuffle Hash Join Explained
Sort-Merge Vs Shuffle Hash Join Explained
📝 Introduction:
In my earlier days, I often struggled with choosing which join to use in PySpark —
Sort-Merge Join vs Shuffle Hash Join. But once I understood their internals,
everything became clearer, and I was able to optimize code based on the size and
characteristics of my data.
💡
If you’ve ever been stuck choosing between these joins in PySpark, this post is for
you!
Here’s a complete breakdown of the differences between these two joins, including
how they work internally and when to use each one.
● ➡️ Purpose:
1. Ensure that all rows with the same join key end up on the same
executor (node).
2. Shuffling is the process where Spark redistributes data across different
nodes, ensuring that rows with the same join key from both datasets
end up on the same node
● ➡️ Process:
1. Each row's join key is hashed.
2. Based on this hash, the row is sent to a specific executor.
● ➡️ Result: All data with the same join key is co-located, enabling efficient
joining.
Why is this important? Without shuffling, matching rows from different nodes can’t be
compared and joined. The network overhead in this step is significant for large datasets.
🔥 Sort-Merge Join: Sorting and Merging🔥
Once the data is shuffled, the Sort-Merge Join starts by sorting both datasets on the
join key. After sorting, Spark performs a merge operation to match rows with the
same key.
➡️ Step 2: After sorting, Spark can then efficiently merge the two datasets. It starts
by looking at the first row from both datasets and compares their join keys. If the
keys match, the rows are joined. If one key is smaller, Spark moves to the next row
in that dataset until the keys match.
● If the value in one array is smaller, you move the pointer in that array until you
find a match.
💡Example💡
Let's say we're joining 'Customers' and 'Orders' on 'customer_id':
1. ✨ Shuffle ✨ : Both datasets are distributed so that all data for each
2. ✨ Sort ✨: Each executor sorts its portion of 'Customers' and 'Orders' by
'customer_id' is on the same executor.
🚀When to Use🚀
● Both datasets are large.
● Join key has high cardinality (many unique values).
● Available memory is limited.
➡️🔥 Shuffle Hash Join: Building and Probing the Hash
Table🔥
Step-by-Step Process
2. ➡️ Build Phase:
○ Spark creates a hash table using the join key as the hash key and the
entire row of the smaller dataset as the value.
○ Hash Key: The join key.
○ Value: The rest of the columns in the smaller dataset. For example, if
you're joining on CustomerID, the hash key is CustomerID, and the
value would be the entire row containing customer details (Name,
Country, etc.).
3. ➡️ Probe Phase:
For each partition of the larger dataset, Spark computes the hash of the join
key and checks if a matching key exists in the hash table. If a match is found,
the rows are joined.
3. ✨ Probe ✨:
○ 'Orders' are divided into partitions (e.g., 100 partitions of 100,000
records each).
○ For each partition:
■ Load into memory.
■ Process each order, looking up the customer in the hash table.
■ Output joined results.
■ Move to the next partition.
Hash Table No hash table is built. A hash table is built for the
smaller dataset.
Handling Large Best for large datasets on Best when one dataset is
Datasets both sides. significantly smaller.
Handling Can handle skewed join May face memory issues if the
Skewed Data keys relatively well. hash table becomes too large.
● One small dataset and one large dataset: SHJ is more efficient when one of
the datasets is much smaller and can easily fit into memory.
● Faster join operation: If sorting is too expensive or unnecessary, SHJ can be
faster by avoiding the sorting step altogether.
Both Sort-Merge Join and Shuffle Hash Join are powerful techniques for joining
datasets in PySpark, but they have different use cases and trade-offs.
● Sort-Merge Join is better for large datasets on both sides and is more robust
when dealing with skewed data or when memory is a constraint.
● Shuffle Hash Join shines when one dataset is much smaller and can be held
in memory, allowing for faster joins without sorting.