[go: up one dir, main page]

0% found this document useful (0 votes)
14 views55 pages

Mapreduce Final

MapReduce is a programming model that simplifies the creation of parallel programs by allowing users to define key-value pairs and mapper/reducer functions while Hadoop manages the logistics. The document illustrates the WordCount application as a primary example, detailing the mapping and reducing processes to count word frequencies across multiple files. Additionally, it discusses the limitations of MapReduce for frequently changing data and dependent tasks, along with examples of data joining and vector multiplication tasks.

Uploaded by

Muneeba Kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views55 pages

Mapreduce Final

MapReduce is a programming model that simplifies the creation of parallel programs by allowing users to define key-value pairs and mapper/reducer functions while Hadoop manages the logistics. The document illustrates the WordCount application as a primary example, detailing the mapping and reducing processes to count word frequencies across multiple files. Additionally, it discusses the limitations of MapReduce for frequently changing data and dependent tasks, along with examples of data joining and vector multiplication tasks.

Uploaded by

Muneeba Kaleem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

MapReduce:

Simple Programming for


Big Results
• Explain how MapReduce simplifies
creating parallel programs

• Design a WordCount application using the


MapReduce programming model
MapReduce = Programming
Model for Hadoop Ecosystem

Hive Pig
Giraph

Spark
Storm

Flink
MapReduce

HBase

Cassandra

MongoDB
Zookeeper

YARN

HDFS
Based on Functional Programming

Map = apply operation f (x) = y


to all elements

Reduce = summarize
operation on elements
Example MapReduce Application: WordCount

File 1
Result
File 2 WordCount
File

File N
Shuffle
Map Reduce
and Sort

Represents a large
number of applications.
Sort and Shuffle (You, http://you1.fake)
(apple, http://apple1.fake)
(apple, http://apple2.fake)

(is, http://apple2.fake)
(is, http://apple2.fake)

(rose, http://apple2.fake)
(red, http://apple2.fake)
Reduce Results for “apple”

(apple -> http://apple1.fake,


http://apple2.fake)
Reduce Results for “apple”

Key Value
(apple -> http://apple1.fake,
http://apple2.fake)

apple
Shuffle
Map Reduce
and Sort
Shuffle
Map Reduce
and Sort

Parallelization Parallelization
Parallelization over
over the input intermediate data over data groups
MapReduce is bad for:

Frequently changing data


Dependent tasks
Interactive analysis
MapReduce

Simplified parallel Applications with


programming independent data-
parallel tasks
Who The framework woke:
• User defines:
a. <key, value>
The framework:
• defines:
a. <key, value>
b. mapper & reducer functions
The framework:
• defines:
a. <key, value>
b. mapper & reducer functions
• Hadoop handles the logistics
Map/Reduce flow
• map() reads data and outputs <key,value>
Dn map() <key,value>
Map/Reduce flow
• Hadoop distributes map() to data
D1 map()

D2 map()

Dn
map()
Map/Reduce flow
• Hadoop groups <key,value> data

D1 map()

D2 map()

Dn
map()
Map/Reduce flow
• Hadoop distributes groups to reducers()

D1 map()
map() reduce() O1
D2

reduce() Om

Dn
map()
The paradigmatic example:
• Count word frequencies
Wordcount task:
• How would you count all the
words in Star Wars?
Wordcount serial code:
• In a nutshell:
1. Get word
2. Look up word in table
3. Add 1 to count
Wordcount serial code:
• Result Table: Word Count

a 1000

far 2000

Jedi 5000

Luke 9000


Wordcount task:
• How would you count all the words
in all the Star Wars scripts and …
Wordcount Map/Reduce:
The Mapper:

Loop Get word


Until
Emit <word> < 1>
Done
What One Mapper Does
line = A long time ago in a galaxy far far …

keys = A long time ago in a galaxy far far

Emit <key, value> ...

A 1 ago 1 far 1 in 1
time 1
to the reducers
long 1 galaxy 1
far 1
a 1
Wordcount Map/Reduce:
The Reducer:

Loop Get next <word><value>


Over If <word> is same as previous word
key- add <value> to count
else
values
emit <word> < count>
set count to 0
map() output
A1
long 1
A long time ago
map() time 1
ago 1

in 1
in a galaxy far map() a1
galaxy 1
far 1

far away
map() far 1
away 1
Hadoop shuffles, groups,
and distributes
A1 A1
long 1 a1
A long time ago
map() time 1 far 1
ago 1 far 1
ago 1

in 1
in a galaxy far map() a1
galaxy 1 galaxy 1
far 1 in 1
long 1
time 1
far away
map() far 1 away 1
away 1

reduce() aggregates
A1 A1
long 1 a1
time 1 far 1 A1
ago 1 far 1 a1
ago 1 reduce()
far 2
in 1 …
a1
galaxy 1 galaxy 1
far 1 in 1 reduce() galaxy 1
long 1 …

time 1
far 1 away 1
away 1

Example:
Joining Data
Joining Data
• Task: combine datasets by key
– A standard data management function
Joining Data
• Task: combine datasets by key
– A standard data management function
– In pseudo SQL
Select * from table A, table B, where
A.key=B.key
Joining Data
• Task: combine datasets by key
– A standard data management function
– In pseudo SQL
Select * from table A, table B, where
A.key=B.key
– Joins can be inner, left or right outer
Joining Data
• Task: given two wordcount datasets …
Joining Data
• Task: given two wordcount datasets …
File A: <word, total-count>
able , 5
actor , 18
burger , 25
.
.
.
Joining Data
• Task: given two wordcount datasets …
File A: <word, total-count> File B: <date word, day-count>
able , 5 Jan-16 able , 2
actor , 18 Feb-22 actor , 15
burger , 25 May-03 actor , 3
. Jul-4 burger, 20
. .
. .
.
Joining Data
• Task: combine by word
File A: <word, total-count> File B: <date word, day-count>
able , 5 Jan-16 able , 2
actor , 18 Feb-22 actor , 15
burger , 25 May-03 actor , 3
. Jul-04 burger, 20
. .
. .
.
Joining Data
• Result wanted:
File AjoinB: <word date, day-count total-count >
able Jan-16, 2 5
actor Feb-22, 15 18
actor May-03, 1 18
burger Jul-04, 20 25
.
.
.
Example:
Vector Multiplication
Vector Multiplication
• Task: multiply 2 arrays of N numbers
– A basic mathematical operation
– Let’s assume N is very large
Vector Multiplication
• Task: multiply 2 arrays of N numbers
A X B =
5 2.7 (𝟓 x 𝟐. 𝟕) # 1st of A & B
4 1.9 + (𝟒 x 1.9) # 2nd of A & B
-3.2 -1.3 + (– 𝟑. 𝟐 x –1.3) # 3rd …
. . .
. . .
. . .
-2 1 + (– 𝟐 x 𝟏) # Nth of A & B
Vector Multiplication
A B
5 2.7 • Recall:
4
1.9 – data partitioned in HDFS
-1.3
-3.2
.
.
... ...
. .
-2 1
Vector Multiplication
• Main design consideration:
need elements with same index together

Let <key, value> =


<index, number>
Vector Multiplication
A B
• Problem: array partitions
5 2.7
4 don’t have an index
1.9
-1.3
-3.2
.
.
... ...
. .
-2 1
Vector Multiplication
A B
5 2.7
4
1.9
Environment
-1.3
-3.2 Information
.
.
... ...
. .
-2 1
Vector Multiplication
A B info outside map/reduce
5 2.7
<key, value>
4
1.9 map()
Environment os.getenv
-1.3
-3.2 Information ('map_input_file')
.
.
... ...
. . map() can
-2 1 access info
Vector Multiplication
A B • Let’s assume:
1, 5 1, 2.7 – each line already has
2, 4
2, 1.9 <index, number>
3, -3.2 3,-1.3
. .
... ...
. .
N, -2 N, 1
Vector Multiplication
A B • Let’s assume:
1, 5 1, 2.7 – each line already has
2, 4
2, 1.9 <index, number>
3, -3.2 3,-1.3
. .
Note: mapper only needs to
pass data (identity function)
... ...
. .
N, -2 N, 1
Vector Multiplication
A B <index, num>
<index, num> <index, num>
shuffle & 1, 5
1, 5 1, 2.7
1, 2.7
2, 4 group indices 3, -1.3
2, 1.9
3, -3.2 3,-1.3 3, -3.2
. .
2, 1.9
... ... 2, 4
. . …
N, -2 N, 1
Vector Multiplication
A,B grouped
<index, num>
1, 5 What should
1, 2.7 reducers do?
3, -1.3
3, -3.2

2, 1.9
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
1, 5 Reducer:
1, 2.7 -get pairs of
3, -1.3 <index, number>
3, -3.2

2, 1.9
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
1, 5 subtotals Reducer:
1, 2.7 -get pairs of
3, -1.3 + 17.66
<index, number>
3, -3.2
-multiply & add
2, 1.9
7.6
2, 4
...
Vector Multiplication
A,B grouped
<index, num>
subtotals Reducer:
1, 5
1, 2.7 -get pairs of
3, -1.3 + 17.66 <index, number>
3, -3.2 -multiply & add
2, 1.9
7.6
2, 4 (Still need get total
...
sum, but should be
largely reduced)

You might also like