0% found this document useful (0 votes)

55 views3 pages

PySpark reduceByKey Guide

PySpark's reduceByKey() transformation merges values of each key in a pair RDD using an associative function, typically resulting in unique keys with their respective counts. The transformation is a wider operation that shuffles data across partitions and defaults to hash-partitioning. An example demonstrates creating an RDD from a list of words and using reduceByKey to count occurrences of each word.

Uploaded by

rajani kale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views3 pages

PySpark reduceByKey Guide

Uploaded by

rajani kale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

PySpark reduceByKey usage with

example

PySpark reduceByKey() transformation is used to merge the values of

each key using an associative reduce function on PySpark RDD. It is a
wider transformation as it shuffles data across multiple partitions and It
operates on pair RDD (key/value pair).
When reduceByKey() performs, the output will be partitioned by either
numPartitions or the default parallelism level. The Default partitioner is
hash-partition.

First, let’s create an RDD from the list.

data = [('Project', 1),

('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]

rdd=spark.sparkContext.parallelize(data)

reduceByKey() Example

In our example, we use PySpark reduceByKey() to reduces the word string

by applying the sum function on value. The result of our RDD contains
unique words and their count.

rdd2=rdd.reduceByKey(lambda a,b: a+b)

for element in rdd2.collect():
print(element)
This yields below output.
Complete PySpark reduceByKey() example

Below is complete RDD example of

PySpark reduceByKey() transformation.

from pyspark.sql import SparkSession

spark =
SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [('Project', 1),

rdd=spark.sparkContext.parallelize(data)

rdd2=rdd.reduceByKey(lambda a,b: a+b)

for element in rdd2.collect():
print(element)

In conclusion, PySpark reduceByKey() transformation is used to merge

the values of each key using an associative reduce function and learned it
is a wider transformation that shuffles the data across RDD partitions .
Pictorial Representation of ReduceByKey

Spark - groupByKey Vs reduceByKey
No ratings yet
Spark - groupByKey Vs reduceByKey
3 pages
Spark Pyspark Day 22
No ratings yet
Spark Pyspark Day 22
12 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
24 pages
groupByKey VS reduceByKey
No ratings yet
groupByKey VS reduceByKey
3 pages
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark RDD Basics Cheat Sheet
No ratings yet
PySpark RDD Basics Cheat Sheet
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
RDD Actions
No ratings yet
RDD Actions
18 pages
PySpark RDD: Transformations & Operations
No ratings yet
PySpark RDD: Transformations & Operations
40 pages
PySpark RDD Transformations Guide
No ratings yet
PySpark RDD Transformations Guide
38 pages
Note
No ratings yet
Note
14 pages
PySpark RDD Cheat Sheet Guide
No ratings yet
PySpark RDD Cheat Sheet Guide
1 page
Dictionaries in Python
No ratings yet
Dictionaries in Python
17 pages
Final Print Py Spark
0% (1)
Final Print Py Spark
133 pages
PySpark Essentials for Developers
100% (1)
PySpark Essentials for Developers
21 pages
Optimizing Transformations and Actions - Part 2
No ratings yet
Optimizing Transformations and Actions - Part 2
3 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Bda Practical 1
No ratings yet
Bda Practical 1
2 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Spark RDD Transformations & Actions
No ratings yet
Spark RDD Transformations & Actions
5 pages
PySpark Notes
No ratings yet
PySpark Notes
4 pages
Dictionary in Python - Some More Functions and Methods
No ratings yet
Dictionary in Python - Some More Functions and Methods
7 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Module - Iv
No ratings yet
Module - Iv
69 pages
STUDY MAT XI Dictionary
No ratings yet
STUDY MAT XI Dictionary
7 pages
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
No ratings yet
Module - 4 - UNDERSTANDING MAP REDUCE FUNDAMENTALS
6 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Dictionary in Python
No ratings yet
Dictionary in Python
4 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Bda Lab
No ratings yet
Bda Lab
11 pages
Bda F
No ratings yet
Bda F
23 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
Python Dictionary
No ratings yet
Python Dictionary
8 pages
Practical-1 AIM: To Understand The Overall Programming Architecture Using Map Reduce Api
No ratings yet
Practical-1 AIM: To Understand The Overall Programming Architecture Using Map Reduce Api
7 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Python Mapping Types Guide
No ratings yet
Python Mapping Types Guide
5 pages
Python Sets and Dictionaries Guide
No ratings yet
Python Sets and Dictionaries Guide
26 pages
Lec 8
No ratings yet
Lec 8
19 pages
Spark RDD Transformations Guide
No ratings yet
Spark RDD Transformations Guide
9 pages
Dictionaries
No ratings yet
Dictionaries
12 pages
PYTHON-LIST - J - Dictionary
No ratings yet
PYTHON-LIST - J - Dictionary
10 pages
Lec 8
No ratings yet
Lec 8
24 pages
23CP309T BDA RE-MSE Question Paper
No ratings yet
23CP309T BDA RE-MSE Question Paper
2 pages
Testpy
No ratings yet
Testpy
2 pages
PySpark RDD Guide for Data Scientists
No ratings yet
PySpark RDD Guide for Data Scientists
1 page
Python Dictionary Tutorial Guide
No ratings yet
Python Dictionary Tutorial Guide
13 pages
Dictionaries
No ratings yet
Dictionaries
7 pages
Unit-2 Lect 3
No ratings yet
Unit-2 Lect 3
43 pages
Module 3.2.4
No ratings yet
Module 3.2.4
17 pages
Dictvvjj
No ratings yet
Dictvvjj
23 pages
Basic Operations On MAP: Example
No ratings yet
Basic Operations On MAP: Example
9 pages
Lecture Notes Map Reduce
No ratings yet
Lecture Notes Map Reduce
24 pages
Disctionary in Python
No ratings yet
Disctionary in Python
16 pages
Dictionaires and Sets
No ratings yet
Dictionaires and Sets
7 pages
Python 12 Dictionary
No ratings yet
Python 12 Dictionary
18 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
DatabricksDataEngineer Associate2024
80% (5)
DatabricksDataEngineer Associate2024
157 pages
Databricks Question 1668314325
100% (1)
Databricks Question 1668314325
104 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Azure Databricks Interview
100% (2)
Azure Databricks Interview
35 pages
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
100% (1)
Databricks Certified Professional Data Engineer Questions and Answers PDF Dumps
6 pages
Azure Databricks Course Slide Deck
75% (4)
Azure Databricks Course Slide Deck
169 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
PracticeExam DataEngineerAssociate
No ratings yet
PracticeExam DataEngineerAssociate
23 pages
Spark Interview Prep for Telugu Speakers
100% (3)
Spark Interview Prep for Telugu Speakers
31 pages
Snowflake Scenario Based Interview Questions
100% (2)
Snowflake Scenario Based Interview Questions
20 pages
Data Engineering With Databricks
100% (2)
Data Engineering With Databricks
63 pages
Snowflake Notes
100% (10)
Snowflake Notes
67 pages
Azure Databricks Guide: CSV & SQL Integration
No ratings yet
Azure Databricks Guide: CSV & SQL Integration
16 pages
Snowflake Training Slide SANMs
71% (7)
Snowflake Training Slide SANMs
218 pages
SnowFlake Notes
100% (1)
SnowFlake Notes
40 pages
Databricks Associate Data Engg
100% (7)
Databricks Associate Data Engg
64 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Mastering Apache Spark
67% (3)
Mastering Apache Spark
1,831 pages
Azure Databricks
67% (6)
Azure Databricks
69 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Databricks Certified Data Engineer Associate PDF
0% (1)
Databricks Certified Data Engineer Associate PDF
5 pages
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
100% (1)
Databricks Certified Data Engineer Professional Dumps by Ball 21-03-2024 10qa Ebraindumps
19 pages
Python Pandas Tutorial
96% (28)
Python Pandas Tutorial
178 pages
Spark SQL & DataFrames Guide 2.2.0
No ratings yet
Spark SQL & DataFrames Guide 2.2.0
35 pages
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
0% (1)
Databricks Certified Developer For Apache Spark 3.0 Practice Tests 540 Questions
290 pages
Architecting A Data Lake
100% (9)
Architecting A Data Lake
60 pages
Performance Tuning in Azure Databricks
100% (1)
Performance Tuning in Azure Databricks
124 pages
Snowflake Snowpro Exam Cheatsheet
83% (12)
Snowflake Snowpro Exam Cheatsheet
7 pages

PySpark reduceByKey Guide

Uploaded by

PySpark reduceByKey Guide

Uploaded by

PySpark reduceByKey usage with

PySpark reduceByKey() transformation is used to merge the values of

First, let’s create an RDD from the list.

data = [('Project', 1),

In our example, we use PySpark reduceByKey() to reduces the word string

rdd2=rdd.reduceByKey(lambda a,b: a+b)

Below is complete RDD example of

from pyspark.sql import SparkSession

data = [('Project', 1),

rdd2=rdd.reduceByKey(lambda a,b: a+b)

In conclusion, PySpark reduceByKey() transformation is used to merge

You might also like