PySpark reduceByKey usage with
example
PySpark reduceByKey() transformation is used to merge the values of
each key using an associative reduce function on PySpark RDD. It is a
wider transformation as it shuffles data across multiple partitions and It
operates on pair RDD (key/value pair).
When reduceByKey() performs, the output will be partitioned by either
numPartitions or the default parallelism level. The Default partitioner is
hash-partition.
First, let’s create an RDD from the list.
data = [('Project', 1),
('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]
rdd=spark.sparkContext.parallelize(data)
reduceByKey() Example
In our example, we use PySpark reduceByKey() to reduces the word string
by applying the sum function on value. The result of our RDD contains
unique words and their count.
rdd2=rdd.reduceByKey(lambda a,b: a+b)
for element in rdd2.collect():
print(element)
This yields below output.
Complete PySpark reduceByKey() example
Below is complete RDD example of
PySpark reduceByKey() transformation.
from pyspark.sql import SparkSession
spark =
SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [('Project', 1),
('Gutenberg’s', 1),
('Alice’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1),
('Adventures', 1),
('in', 1),
('Wonderland', 1),
('Project', 1),
('Gutenberg’s', 1)]
rdd=spark.sparkContext.parallelize(data)
rdd2=rdd.reduceByKey(lambda a,b: a+b)
for element in rdd2.collect():
print(element)
In conclusion, PySpark reduceByKey() transformation is used to merge
the values of each key using an associative reduce function and learned it
is a wider transformation that shuffles the data across RDD partitions .
Pictorial Representation of ReduceByKey