MapReduce:
Simple Programming for
Big Results
• Explain how MapReduce simplifies
creating parallel programs
• Design a WordCount application using the
MapReduce programming model
MapReduce = Programming
Model for Hadoop Ecosystem
Hive Pig
Giraph
Spark
Storm
Flink
MapReduce
HBase
Cassandra
MongoDB
Zookeeper
YARN
HDFS
Based on Functional Programming
Map = apply operation f (x) = y
to all elements
Reduce = summarize
operation on elements
Example MapReduce Application: WordCount
File 1
Result
File 2 WordCount
File
File N
Shuffle
Map Reduce
and Sort
Represents a large
number of applications.
Sort and Shuffle (You, http://you1.fake)
(apple, http://apple1.fake)
(apple, http://apple2.fake)
(is, http://apple2.fake)
(is, http://apple2.fake)
(rose, http://apple2.fake)
(red, http://apple2.fake)
Reduce Results for “apple”
(apple -> http://apple1.fake,
http://apple2.fake)
Reduce Results for “apple”
Key Value
(apple -> http://apple1.fake,
http://apple2.fake)
apple
Shuffle
Map Reduce
and Sort
Shuffle
Map Reduce
and Sort
Parallelization Parallelization
Parallelization over
over the input intermediate data over data groups
MapReduce is bad for:
Frequently changing data
Dependent tasks
Interactive analysis
MapReduce
Simplified parallel Applications with
programming independent data-
parallel tasks
Who The framework woke:
• User defines:
a. <key, value>
The framework:
• defines:
a. <key, value>
b. mapper & reducer functions
The framework:
• defines:
a. <key, value>
b. mapper & reducer functions
• Hadoop handles the logistics
Map/Reduce flow
• map() reads data and outputs <key,value>
Dn map() <key,value>
Map/Reduce flow
• Hadoop distributes map() to data
D1 map()
D2 map()
Dn
map()
Map/Reduce flow
• Hadoop groups <key,value> data
D1 map()
D2 map()
Dn
map()
Map/Reduce flow
• Hadoop distributes groups to reducers()
D1 map()
map() reduce() O1
D2
reduce() Om
Dn
map()
The paradigmatic example:
• Count word frequencies
Wordcount task:
• How would you count all the
words in Star Wars?
Wordcount serial code:
• In a nutshell:
1. Get word
2. Look up word in table
3. Add 1 to count
Wordcount serial code:
• Result Table: Word Count
a 1000
far 2000
Jedi 5000
Luke 9000
…
Wordcount task:
• How would you count all the words
in all the Star Wars scripts and …
Wordcount Map/Reduce:
The Mapper:
Loop Get word
Until
Emit <word> < 1>
Done
What One Mapper Does
line = A long time ago in a galaxy far far …
keys = A long time ago in a galaxy far far
Emit <key, value> ...
A 1 ago 1 far 1 in 1
time 1
to the reducers
long 1 galaxy 1
far 1
a 1
Wordcount Map/Reduce:
The Reducer:
Loop Get next <word><value>
Over If <word> is same as previous word
key- add <value> to count
else
values
emit <word> < count>
set count to 0
map() output
A1
long 1
A long time ago
map() time 1
ago 1
in 1
in a galaxy far map() a1
galaxy 1
far 1
far away
map() far 1
away 1
Hadoop shuffles, groups,
and distributes
A1 A1
long 1 a1
A long time ago
map() time 1 far 1
ago 1 far 1
ago 1
in 1
in a galaxy far map() a1
galaxy 1 galaxy 1
far 1 in 1
long 1
time 1
far away
map() far 1 away 1
away 1
…
reduce() aggregates
A1 A1
long 1 a1
time 1 far 1 A1
ago 1 far 1 a1
ago 1 reduce()
far 2
in 1 …
a1
galaxy 1 galaxy 1
far 1 in 1 reduce() galaxy 1
long 1 …
time 1
far 1 away 1
away 1
…
Example:
Joining Data
Joining Data
• Task: combine datasets by key
– A standard data management function
Joining Data
• Task: combine datasets by key
– A standard data management function
– In pseudo SQL
Select * from table A, table B, where
A.key=B.key
Joining Data
• Task: combine datasets by key
– A standard data management function
– In pseudo SQL
Select * from table A, table B, where
A.key=B.key
– Joins can be inner, left or right outer
Joining Data
• Task: given two wordcount datasets …
Joining Data
• Task: given two wordcount datasets …
File A: <word, total-count>
able , 5
actor , 18
burger , 25
.
.
.
Joining Data
• Task: given two wordcount datasets …
File A: <word, total-count> File B: <date word, day-count>
able , 5 Jan-16 able , 2
actor , 18 Feb-22 actor , 15
burger , 25 May-03 actor , 3
. Jul-4 burger, 20
. .
. .
.
Joining Data
• Task: combine by word
File A: <word, total-count> File B: <date word, day-count>
able , 5 Jan-16 able , 2
actor , 18 Feb-22 actor , 15
burger , 25 May-03 actor , 3
. Jul-04 burger, 20
. .
. .
.
Joining Data
• Result wanted:
File AjoinB: <word date, day-count total-count >
able Jan-16, 2 5
actor Feb-22, 15 18
actor May-03, 1 18
burger Jul-04, 20 25
.
.
.
Example:
Vector Multiplication
Vector Multiplication
• Task: multiply 2 arrays of N numbers
– A basic mathematical operation
– Let’s assume N is very large
Vector Multiplication
• Task: multiply 2 arrays of N numbers
A X B =
5 2.7 (𝟓 x 𝟐. 𝟕) # 1st of A & B
4 1.9 + (𝟒 x 1.9) # 2nd of A & B
-3.2 -1.3 + (– 𝟑. 𝟐 x –1.3) # 3rd …
. . .
. . .
. . .
-2 1 + (– 𝟐 x 𝟏) # Nth of A & B
Vector Multiplication
A B
5 2.7 • Recall:
4
1.9 – data partitioned in HDFS
-1.3
-3.2
.
.
... ...
. .
-2 1
Vector Multiplication
• Main design consideration:
need elements with same index together
Let <key, value> =
<index, number>
Vector Multiplication
A B
• Problem: array partitions
5 2.7
4 don’t have an index
1.9
-1.3
-3.2
.
.
... ...
. .
-2 1
Vector Multiplication
A B
5 2.7
4
1.9
Environment
-1.3
-3.2 Information
.
.
... ...
. .
-2 1
Vector Multiplication
A B info outside map/reduce
5 2.7
<key, value>
4
1.9 map()
Environment os.getenv
-1.3
-3.2 Information ('map_input_file')
.
.
... ...
. . map() can
-2 1 access info
Vector Multiplication
A B • Let’s assume:
1, 5 1, 2.7 – each line already has
2, 4
2, 1.9 <index, number>
3, -3.2 3,-1.3
. .
... ...
. .
N, -2 N, 1
Vector Multiplication
A B • Let’s assume:
1, 5 1, 2.7 – each line already has
2, 4
2, 1.9 <index, number>
3, -3.2 3,-1.3
. .
Note: mapper only needs to
pass data (identity function)
... ...
. .
N, -2 N, 1
Vector Multiplication
A B <index, num>
<index, num> <index, num>
shuffle & 1, 5
1, 5 1, 2.7
1, 2.7
2, 4 group indices 3, -1.3
2, 1.9
3, -3.2 3,-1.3 3, -3.2
. .
2, 1.9
... ... 2, 4
. . …
N, -2 N, 1
Vector Multiplication
A,B grouped
<index, num>
1, 5 What should
1, 2.7 reducers do?
3, -1.3
3, -3.2
2, 1.9
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
1, 5 Reducer:
1, 2.7 -get pairs of
3, -1.3 <index, number>
3, -3.2
2, 1.9
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
1, 5 subtotals Reducer:
1, 2.7 -get pairs of
3, -1.3 + 17.66
<index, number>
3, -3.2
-multiply & add
2, 1.9
7.6
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
subtotals Reducer:
1, 5
1, 2.7 -get pairs of
3, -1.3 + 17.66 <index, number>
3, -3.2 -multiply & add
2, 1.9
7.6
2, 4 (Still need get total
...
sum, but should be
largely reduced)