0% found this document useful (0 votes)

52 views5 pages

Sort-Merge Vs Shuffle Hash Join Explained

Uploaded by

kumari.munni3737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views5 pages

Sort-Merge Vs Shuffle Hash Join Explained

Uploaded by

kumari.munni3737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

🔥 Most Asked PySpark Interview Question: Sort-Merge vs

Shuffle Hash Join Explained! 🔥

📝 Introduction:
In my earlier days, I often struggled with choosing which join to use in PySpark —
Sort-Merge Join vs Shuffle Hash Join. But once I understood their internals,
everything became clearer, and I was able to optimize code based on the size and
characteristics of my data.

💡
If you’ve ever been stuck choosing between these joins in PySpark, this post is for
you!

Here’s a complete breakdown of the differences between these two joins, including
how they work internally and when to use each one.

🌟 Step 1: Common First Step: Data Shuffling 🌟

Both join types start with a common step: shuffling data across the cluster.

● ➡️ Purpose:
1. Ensure that all rows with the same join key end up on the same
executor (node).
2. Shuffling is the process where Spark redistributes data across different
nodes, ensuring that rows with the same join key from both datasets
end up on the same node

● ➡️ Process:
1. Each row's join key is hashed.
2. Based on this hash, the row is sent to a specific executor.

● ➡️ Result: All data with the same join key is co-located, enabling efficient
joining.

Why is this important? Without shuffling, matching rows from different nodes can’t be
compared and joined. The network overhead in this step is significant for large datasets.
🔥 Sort-Merge Join: Sorting and Merging🔥
Once the data is shuffled, the Sort-Merge Join starts by sorting both datasets on the
join key. After sorting, Spark performs a merge operation to match rows with the
same key.

➡️ Step 1: Sorting: Spark sorts both datasets locally in each partition.

➡️ Step 2: After sorting, Spark can then efficiently merge the two datasets. It starts
by looking at the first row from both datasets and compares their join keys. If the
keys match, the rows are joined. If one key is smaller, Spark moves to the next row
in that dataset until the keys match.

Think of it like merging two sorted arrays:

● If the value in one array is smaller, you move the pointer in that array until you
find a match.

💡Example💡
Let's say we're joining 'Customers' and 'Orders' on 'customer_id':

1. ✨ Shuffle ✨ : Both datasets are distributed so that all data for each
2. ✨ Sort ✨: Each executor sorts its portion of 'Customers' and 'Orders' by
'customer_id' is on the same executor.

3. ✨ Merge ✨: Sorted data is scanned, matching 'Customers' and 'Orders' with

'customer_id'.

the same 'customer_id'.

🚀When to Use🚀
● Both datasets are large.
● Join key has high cardinality (many unique values).
● Available memory is limited.
➡️🔥 Shuffle Hash Join: Building and Probing the Hash
Table🔥

Step-by-Step Process

1. ➡️ Shuffle: Data is distributed based on join keys (same as Sort-Merge Join).

2. ➡️ Build Phase:
○ Spark creates a hash table using the join key as the hash key and the
entire row of the smaller dataset as the value.
○ Hash Key: The join key.
○ Value: The rest of the columns in the smaller dataset. For example, if
you're joining on CustomerID, the hash key is CustomerID, and the
value would be the entire row containing customer details (Name,
Country, etc.).

3. ➡️ Probe Phase:

For each partition of the larger dataset, Spark computes the hash of the join
key and checks if a matching key exists in the hash table. If a match is found,
the rows are joined.

○ The larger dataset is processed in partitions.

○ Each row's join key is used to probe the hash table for matches.

Detailed Look at the Probe Phase

● Partition Loading: A partition of the larger dataset is loaded into memory.

○ Partition size typically ranges from 32MB to 128MB.
● Row-by-Row Processing: Within each partition, rows are processed
sequentially.
○ For each row:
1. Extract the join key.
2. Use this key to look up matches in the hash table.
3. If a match is found, produce a joined row.
● Partition Cycling: After processing one partition, move to the next until all are
processed.
💡Example💡
Joining 'Customers' (smaller, 100,000 records) with 'Orders' (larger, 10 million
records):

1. ✨ Shuffle ✨: Distribute both datasets across executors based on

2. ✨ Build ✨: Create hash tables for 'Customers' on each executor.
'customer_id'.

3. ✨ Probe ✨:
○ 'Orders' are divided into partitions (e.g., 100 partitions of 100,000
records each).
○ For each partition:
■ Load into memory.
■ Process each order, looking up the customer in the hash table.
■ Output joined results.
■ Move to the next partition.

5. Key Differences and When to Use Each Join

Aspect Sort-Merge Join (SMJ) Shuffle Hash Join (SHJ)

Shuffling Data is shuffled based on Data is shuffled based on the join

the join key. key.

Sorting Data is sorted after No sorting is required.

shuffling.

Hash Table No hash table is built. A hash table is built for the
smaller dataset.

Memory Usage No significant memory Requires the smaller dataset to

requirement for sorting. fit into memory.

Handling Large Best for large datasets on Best when one dataset is
Datasets both sides. significantly smaller.

Handling Can handle skewed join May face memory issues if the
Skewed Data keys relatively well. hash table becomes too large.

Performance Can be slower due to Fast but limited by memory

sorting overhead. availability.
🧠When to Use Sort-Merge Join (SMJ)🧠:
● Large datasets on both sides: If both datasets are large and comparable in
size, SMJ is more efficient.
● Skewed data: SMJ can handle data skew (i.e., when some join keys appear
much more frequently than others) more gracefully because it doesn’t require
loading large parts of the data into memory.
● Resource constraints: When memory is limited, SMJ might be preferable
because it doesn’t require building large in-memory structures like hash
tables.

🧠When to Use Shuffle Hash Join (SHJ)🧠 :

● One small dataset and one large dataset: SHJ is more efficient when one of
the datasets is much smaller and can easily fit into memory.
● Faster join operation: If sorting is too expensive or unnecessary, SHJ can be
faster by avoiding the sorting step altogether.

6. Conclusion: Understanding Joins in PySpark

Both Sort-Merge Join and Shuffle Hash Join are powerful techniques for joining
datasets in PySpark, but they have different use cases and trade-offs.

● Sort-Merge Join is better for large datasets on both sides and is more robust
when dealing with skewed data or when memory is a constraint.
● Shuffle Hash Join shines when one dataset is much smaller and can be held
in memory, allowing for faster joins without sorting.

master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
spark QA
No ratings yet
spark QA
34 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
SQL_ &_PYSPAK
No ratings yet
SQL_ &_PYSPAK
6 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
PySpark Interview Questions 2025
No ratings yet
PySpark Interview Questions 2025
8 pages
Spark Class 2 PPT
No ratings yet
Spark Class 2 PPT
37 pages
ApacheSpark Top 10 QnA
No ratings yet
ApacheSpark Top 10 QnA
33 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Pyspark
100% (1)
Pyspark
48 pages
Py Spark
No ratings yet
Py Spark
9 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
SPARK_PYSPARK_DAY_21
No ratings yet
SPARK_PYSPARK_DAY_21
22 pages
Joins in Pyspark
No ratings yet
Joins in Pyspark
10 pages
4th Unit Answer Bank
No ratings yet
4th Unit Answer Bank
40 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Pyspark_12_questions
No ratings yet
Pyspark_12_questions
8 pages
Spark optimisation
No ratings yet
Spark optimisation
7 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Spark Optimisation Techniques
No ratings yet
Spark Optimisation Techniques
3 pages
Full PySpark Interview QA
No ratings yet
Full PySpark Interview QA
5 pages
50_PySpark_interview_questions__1732556477
No ratings yet
50_PySpark_interview_questions__1732556477
7 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
1714069759520
No ratings yet
1714069759520
17 pages
SparkOtp
No ratings yet
SparkOtp
7 pages
1. Cache() vs Persist() in PySpark
No ratings yet
1. Cache() vs Persist() in PySpark
2 pages
Pandas vs SQL Concepts Final
No ratings yet
Pandas vs SQL Concepts Final
13 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
spark
No ratings yet
spark
27 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
100% (1)
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
20 pages
Public International Law - Atty Benjamin Cabrido Jr.
100% (6)
Public International Law - Atty Benjamin Cabrido Jr.
671 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
Optimizing PySpark Operations
No ratings yet
Optimizing PySpark Operations
4 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Spark Interview More Questions With Answers
No ratings yet
Spark Interview More Questions With Answers
3 pages
Spark Optimization Case Study Cleaned
No ratings yet
Spark Optimization Case Study Cleaned
7 pages
Pyspark Shuffle
No ratings yet
Pyspark Shuffle
3 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Databricks RealQuestions
No ratings yet
Databricks RealQuestions
9 pages
Partition Pruning
No ratings yet
Partition Pruning
2 pages
pyspark
No ratings yet
pyspark
6 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
PySpark Real Time Q&A
No ratings yet
PySpark Real Time Q&A
5 pages
Depreciation PPT
No ratings yet
Depreciation PPT
17 pages
Object Role Modeling Fundamentals A Practical Guide to Data Modeling with ORM First Edition Terry Halpin 2024 scribd download
100% (17)
Object Role Modeling Fundamentals A Practical Guide to Data Modeling with ORM First Edition Terry Halpin 2024 scribd download
81 pages
311 - Research Methodology and Statistical Quantitative Methods PDF
No ratings yet
311 - Research Methodology and Statistical Quantitative Methods PDF
448 pages
Lumina Homes
No ratings yet
Lumina Homes
3 pages
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
No ratings yet
Combining Data in Pandas With Merge, .Join, and Concat - Real Python
2 pages
Heater & Air Conditioning Control System: Section
No ratings yet
Heater & Air Conditioning Control System: Section
230 pages
Engineering Students' Perception Towards Engineers and Engineering Works
No ratings yet
Engineering Students' Perception Towards Engineers and Engineering Works
11 pages
Hospital List
No ratings yet
Hospital List
18 pages
AS Practice PP
No ratings yet
AS Practice PP
12 pages
Optimizing The Maintenance Schedule For A Vehicle Fleet A Simulation-Based Case Study
No ratings yet
Optimizing The Maintenance Schedule For A Vehicle Fleet A Simulation-Based Case Study
15 pages
Roman Moderne Notes
No ratings yet
Roman Moderne Notes
25 pages
Lalit Shiva Chand Practical
No ratings yet
Lalit Shiva Chand Practical
10 pages
Analysis VMGO
No ratings yet
Analysis VMGO
2 pages
Sick But Yet at Work. An Empirical Study of Sickness Presenteeism
No ratings yet
Sick But Yet at Work. An Empirical Study of Sickness Presenteeism
8 pages
834L Photobrochure
No ratings yet
834L Photobrochure
1 page
Re TFP Quotation Hub
No ratings yet
Re TFP Quotation Hub
2 pages
The Fiqh of Itikaf Hanafi
No ratings yet
The Fiqh of Itikaf Hanafi
3 pages
Year 4 Civic LP June Love
No ratings yet
Year 4 Civic LP June Love
4 pages
Fusion 360 For The Future of Making Things
No ratings yet
Fusion 360 For The Future of Making Things
2 pages
Recursion Exercise PDF
No ratings yet
Recursion Exercise PDF
4 pages
Msds - N-Butyl Acetate
No ratings yet
Msds - N-Butyl Acetate
8 pages
Bus Stat Chapter 4
No ratings yet
Bus Stat Chapter 4
2 pages
International Conference On Superconductivity and Magnetism 2008 (ICSM 2008)
No ratings yet
International Conference On Superconductivity and Magnetism 2008 (ICSM 2008)
5 pages
The Order of Kindling The Chanukah Lights: For More On Chanukah Go To
No ratings yet
The Order of Kindling The Chanukah Lights: For More On Chanukah Go To
1 page
Polynomial Function
No ratings yet
Polynomial Function
4 pages
Case Study Denso-Toyota
0% (1)
Case Study Denso-Toyota
3 pages
Catur Putranto: Organization Education
No ratings yet
Catur Putranto: Organization Education
1 page
Shakib Hossain 24-08-1998: Affidavit of Identity - Bangladesh
No ratings yet
Shakib Hossain 24-08-1998: Affidavit of Identity - Bangladesh
1 page
The Indian Wound Care Market - A Review
No ratings yet
The Indian Wound Care Market - A Review
2 pages
Ac Resistances: Table 3.26 Multi Conductor Cables - AC Resistances (M /M)
No ratings yet
Ac Resistances: Table 3.26 Multi Conductor Cables - AC Resistances (M /M)
1 page
The Beginner’s Guide to Databases & SQL
From Everand
The Beginner’s Guide to Databases & SQL
Steven Mcananey
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet

Sort-Merge Vs Shuffle Hash Join Explained

Uploaded by

Sort-Merge Vs Shuffle Hash Join Explained

Uploaded by

🔥 Most Asked PySpark Interview Question: Sort-Merge vs

Shuffle Hash Join Explained! 🔥

🌟 Step 1: Common First Step: Data Shuffling 🌟

➡️ Step 1: Sorting: Spark sorts both datasets locally in each partition.

Think of it like merging two sorted arrays:

3. ✨ Merge ✨: Sorted data is scanned, matching 'Customers' and 'Orders' with

the same 'customer_id'.

1. ➡️ Shuffle: Data is distributed based on join keys (same as Sort-Merge Join).

○ The larger dataset is processed in partitions.

Detailed Look at the Probe Phase

● Partition Loading: A partition of the larger dataset is loaded into memory.

1. ✨ Shuffle ✨: Distribute both datasets across executors based on

5. Key Differences and When to Use Each Join

Shuffling Data is shuffled based on Data is shuffled based on the join

Sorting Data is sorted after No sorting is required.

Memory Usage No significant memory Requires the smaller dataset to

Performance Can be slower due to Fast but limited by memory

🧠When to Use Shuffle Hash Join (SHJ)🧠 :

6. Conclusion: Understanding Joins in PySpark

You might also like