[go: up one dir, main page]

0% found this document useful (0 votes)
7 views49 pages

Map-Reduce 1

The document explains the Map Reduce computing paradigm, illustrating its application through examples such as calculating maximum temperature from weather data and counting words in a document. It details the roles of the Map and Reduce functions, the process of shuffling and sorting, and the use of combiners to optimize data processing. Additionally, it provides various examples of Map Reduce applications, including distributed grep and inverted index computation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views49 pages

Map-Reduce 1

The document explains the Map Reduce computing paradigm, illustrating its application through examples such as calculating maximum temperature from weather data and counting words in a document. It details the roles of the Map and Reduce functions, the process of shuffling and sorting, and the use of combiners to optimize data processing. Additionally, it provides various examples of Map Reduce applications, including distributed grep and inverted index computation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

“Map Reduce” Computing Paradigm

pm jat @ daiict
MR Programming – Example First[3]
Here is an example from book “Hadoop: The definitive guide”[3]
Weather Dataset (raw NCDC data)
• Consider a huge log file contains weather records of a city.
• Each row has many values but, we will use two values – month and temperature.
• Computing Goal: find out maximum temperature for each year.
• In SQL terms, we basically are attempting compute following on a plain text data file
(not a DB table)
select month, max(temperature)
from “temp_data.csv” group by month

7-Aug-25 map-reduce computing paradigm 2


Example #1: Max Temperature
• Input: a temperature data file without delimiter).

• Output: month wise maximum temperature.


• In SQL terms, we compute following on a plain text data file (not a DB table)
select month, max(temperature)
from “temp_data.csv” group by month

7-Aug-25 map-reduce computing paradigm 3


Map Task
• Data File
– Partitioned in chunks, and Replicated
• Multiple instances of “map” function run
on machines, typically on that are having
data chunks
• Runs parallel on multiple machines
• Their outputs are stored “locally”

7-Aug-25 map-reduce computing paradigm 4


Shuffle [and Sort ]

Map-Reduce System does it


and Keys are sorted

7-Aug-25 map-reduce computing paradigm 5


Reduce Task

• “Reducers” pull data from all “mappers” as (key,


value-list), and “aggregates” as specified in the
“reduce” function
• A “reducer” takes values for a certain keys only
• Multiple reducers run in parallel working on
different sets of keys
• They produce their outputs on DFS

7-Aug-25 map-reduce computing paradigm 6


Example #1: Max Temperature
• Here is a pair of Map and Reduce function that is to process said file and compute
monthly maximum temperature!

7-Aug-25 map-reduce computing paradigm 7


…. take away
• Map as a Task
– Input a List of <K1, V1> pairs; say List of (RecNo, Record)
– Output: List of <K2, V2> pairs; Say List of (Month, Temp)
– Execution happens for each file split in parallel; called mappers
– Typically MR master nodes invoke M map tasks, typically one for each split
• Map Function
– Input: <K1,V1> pair; say <RecNo, Recod>
– Output: <K2, V2> pair; say <Month, Temp>
– The map function is executed for all records from all input file splits
• Note execution of the map function should not require any movement of data.
Execution happens on relevant chunk servers only

7-Aug-25 map-reduce computing paradigm 8


…. take away
• Reduce as a Task
– Input a List of <K2, List(V2)> pairs; say List of (Month, List(Temps))
all values (from output of all mappers) of a Key taken by reduce task
– Output: List of <K3, V3> pairs; Say List of (Month, Max-Temp)
Mostly K2 and K3 are same and V3 is some aggregation of List(V12)
– Execution happens on a number of machines in parallel; called reducers
Say R numbers
• Reduce Function
– Runs on every reducer for every distinct Key of mapper outputs
– Input: <K2,List(V2)> pair; say Month, List(Temps)
– Output: <K3, V3> pair; say Month, Temp
– The map function is iteratively called for all data records on all splits
7-Aug-25 map-reduce computing paradigm 9
“Shuffle and Sort” the Map output
• “shuffling” is the process of making mapper outputs available to the reducers
• This is done as following
– Partition (determine which reducer a output should go) – “shuffling”
Done by applying a hash function on K2, say as “encode(K2) MOD R”
– Sort By Key (K2)
– Group By Key (K2) ==> <K2, List(V2)>
• As a result, we prepare input for “reducer” in the form of <K2, List(V2)> for all
distinct keys of map output
• Data actually move here for computation from “mappers” to “reducers” in “pull”
mode!

7-Aug-25 map-reduce computing paradigm 10


MR data flow [2]

Figure Source: [2]

7-Aug-25 map-reduce computing 11


Data Flow in a Map Reduce Computation

Shuffle and Sort

http://infolab.stanford.edu/~ullman/mmds/ch2.pdf
7-Aug-25 map-reduce computing paradigm 12
Hope this makes sense now!

7-Aug-25 map-reduce computing paradigm 13


Example: “Lab01 code”
• !

7-Aug-25 map-reduce computing paradigm 14


“Map Reduce” Computing Paradigm
• In many cases, a single pair of map-reduce functions should be performing
meaningful computations.
• However, a “pipeline” of map-reduce pairs can be used for performing complex
tasks!
– Output of one MR pair becomes input to another MR job, and so forth!

7-Aug-25 map-reduce computing paradigm 15


MR Examples* #: Word Count
• Word count from a huge document corpus
– A huge set of documents stored on HDFS
• Want to compute word frequency, aggregated over all documents

7-Aug-25 map-reduce computing paradigm 16


MR Examples* #: Word Count
• In pseudo code!

MAP function
input: <key, value>
<DocID, Document>
output: <key, value>
<word, 1>

Reduce function
input: <key, value-list>
<word, <1,1,1,1,1,1>>
output: <key, value>
<word, sum(1’s)>
7-Aug-25 map-reduce computing paradigm 17
Do it Yourself!

• MR Colab:
https://colab.research.google.com/drive/1ma8h2dZMzUUobXsJ-_gTNvyaj25n6boY

7-Aug-25 map-reduce computing paradigm 18


3 Mappers and 3 Reducers
7-Aug-25 map-reduce computing paradigm 19
Aggregate Operations using MR
• Mapper outputs: Key: Grouping Attribute, Value: Aggregating Attribute Value
• Reducer outputs: Key: Grouping Attribute, Value: Aggregated value
• For example if want to compute statewide number of employees. Output Key of
Map function would be “State” OR
• STATE, and GENDER WISE then Key shall be composite?

7-Aug-25 map-reduce computing 20


Confirm
• Appreciate the importance of keys and values as inputs and outputs to map and
reduce function!
• In queries like select DNO, AVG(Salary) from “employee.csv”
you can draw a rule to decide the key?

7-Aug-25 map-reduce computing paradigm 21


“Selection and Projection” using Map Reduce

• Have selection condition applied in Map functions


• It can be map only Job.
• Value here is a Tuple.

7-Aug-25 map-reduce computing paradigm 22


MR Example #: lines, words, and characters count

• Suppose we need to count the number of lines, words, and characters count in a
text file.
• Try defining the “Map” and “Reduce” functions for this task!

7-Aug-25 map-reduce computing paradigm 23


MR Example #: “line count”
• Suppose, we need to count the number of lines, words, and characters in a file.
• Following Map-Reduce shall do the job!

7-Aug-25 map-reduce computing paradigm 24


MR Example #3: “line count”
• Here is sample output of map function
#chars, 80
#words, 12
#lines, 1
#chars, 44
#words, 5
#lines, 1

7-Aug-25 map-reduce computing paradigm 25


MR Example #: “line count”
• Here is sample shuffled input to reduce function and its output

#chars, [80,44,67,108,..]
#words, [12, 5, 8, 9, … ]
#lines, [1, 1, 1, 1, …]

#chars, 23456
#words, 8653
#lines, 563

7-Aug-25 map-reduce computing paradigm 26


Exercise ##
• Suppose we have a Inventory data file with attribute values (Item ID, Description,
Cost, Price, Stock, Category), and
– Total Cost of inventory
– Category wise cost of inventory
– Category wise Count of Items which are costlier than 1000
– List of Items having “Cleaner” keyword in description

7-Aug-25 map-reduce computing paradigm 27


Combine Function - Motivation
• On shuffle, the number of values for a key could be large, or too large.
• That has problems
– Large values to shuffle
– More values to transfer over the network
– Memory Concerns at Reducers
– Increased load on reducers (reducers are much lesser in numbers than mappers
in an MR Job)

7-Aug-25 map-reduce computing 28


Combine Function - Motivation
• On shuffle, number of values for a key could be large, too large.
• That has problems

• The solution is “we do all possible aggregation at mapper” itself!
• For “aggregation” at the mapper level; we define a “combine” function; also
referred as “combiner”.
• This comes as the third function in Map-Reduce programming
• The function is expressed as: combine(k, list(v1))  (k,v2)
• V2 is some aggregated values on List(V1)

7-Aug-25 map-reduce computing 29


Word Count with combiner

7-Aug-25 map-reduce computing 30


7-Aug-25 map-reduce computing 31
Combiner in SUM
• Combiner for SUM: select dno, sum(salary) from employee group by dno;

7-Aug-25 map-reduce computing 32


Map-Reduce Combiners
• In many cases, Combiner is usually same as the reduce function.
• This however works only when reduce function is commutative and associative.
SUM is, where as AVERAGE is not.
• However we can have a trick used in Combiner: we output sum and count at
combiner!
• Example next:

7-Aug-25 map-reduce computing 33


Combiner in Average
• Combiner for AVG: select dno, avg(salary) from employee group by dno;

7-Aug-25 map-reduce computing 34


Examples from Original Article[1]
• Word Count
• Distributed GREP
• Count of URL Access Frequency
• Reverse Web-Link Graph from HTML pages
• Term Vector per Host
– Vector of (word, frequency) pairs for each host
• Inverted Index
– List of Document IDs for each “word”
• Distributed Sort

7-Aug-25 map-reduce computing paradigm 35


MR Examples* #2: Distributed “Grep”
• The "grep" is a powerful unix command with rich sets of options
• Primarily it finds lines in files on the file system that match the
specified “expressions”
• The map function primarily does the “necessary matching”
• There is nothing for the reducer map(line-no, line)
if (regex.search(pattern, line)
write(line-no, line)

reduce(line-no, line)
write(line-no, line)
-- Nothing requires to be done at reduce end.
Can be defined to be map-only job.
7-Aug-25 map-reduce computing paradigm 36
MR Examples #2: Count of “URL Access Frequency”

Input: “Web Access Log” generated by


web server

map(recno, record)
url = extract_url(record)
write(url, 1)

reduce(url, value)
count = 0
for each v in values
count += v
Sample Web Access Log: https://drive.google.com/file/d/1ZT6IpAS1ephI_GapXEOFk7Sqa3fIL8qK
7-Aug-25 map-reduce computing paradigm 37
MR Examples* #3: Reverse Web-Link Graph
• A Reverse Web-Link Graph (also known as a reverse link graph or backlink graph) is a
graph structure that shows, for each web page, which other pages link to it.
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output:
URL, List of URLs referring to it

7-Aug-25 map-reduce computing paradigm 38


MR Examples* #3: Reverse Web-Link Graph
• The map function outputs “target, source pairs” for each given page-url
• The reduce function concatenates the list of all source URLs associated with a given
target URL and emits the pair

Input: page-URL, page-itself (say .html)

map(page-url, page)
expr = xpath-expr-for-hrefs
targets = page.XPath( expr) reduce(target, source-list)
source = page-url write(target, source-list)
for each target in targets
write(target, source)

7-Aug-25 map-reduce computing paradigm 39


MR Examples* #4: Term-Vector per Host
• The “term” refers to a "word" in a document.
• A “term vectors” are often used to "summarize“ or "represent" a document in NLP,
IR, and Text Mining area.
• It is a "vector" of "term" and "frequency measure" pairs/tuples.
• The “Frequency Measure” is often TF (Term Frequency) or TFIDF(Term Frequency
Inverse Document Frequency)
• Google used Map-Reduce for doing this and related tasks!
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output: Host-Name, Term-Vector

7-Aug-25 map-reduce computing paradigm 40


MR Examples* #4: Term-Vector per Host
• The map function emits a hostname, term vector pair for each input document
(where the hostname is extracted from the URL of the document).
• The reduce function receives (Host-Name, term vectors of documents under the
host). it combines all term vectors together, discarding infrequent terms, and then
emits a final host-name, term vector pair

Input: document-URL, web-document (say .html) reduce(host-name, list-term-vectors)


map(url, document) combined-term-vector (ctv) = []
//compute term-vector locally for each vector in list-term-vectors
host = host_name( url ) merge(ctv, vector)
write(host_name, term-vector) ctv = trim(sort(ctv)) //keep only frequent
write(host-name, ctv)

7-Aug-25 map-reduce computing paradigm 41


MR Examples* #5: “Compute Inverted Index”
• What is “Inverted Index”?
• A term popular in “Information Retrieval”, is a data structure that contains
Map from “term” to a “document(s)”.
• It is used for finding out documents that contains the given term!
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output: term, list of document-URLs

7-Aug-25 map-reduce computing paradigm 42


MR Examples* #5: “Compute Inverted Index”
• The map function parses each document, and emits a sequence of word,
document ID pairs.
• The reduce function outputs document-list (may be sorted) for each "term"

map(doc-url, document)
words = {} //set of words
for each word in document reduce(word, list-doc-urls)
words.add (word ) write(word, list-doc-urls)
for each word in words
write(word, doc-url )

7-Aug-25 map-reduce computing paradigm 43


MR Examples* #6: Distributed Sort

Suppose this needs to be sorted on empno

Map:
outputs(emp-no, row)

The reduce function shall receive them sorted


On emp-no. It can just output as such!
7-Aug-25 map-reduce computing paradigm 44
Programming Map-Reduce in Python
• A library is available: https://mrjob.readthedocs.io/en/latest/
• Guide/Documentation is available at
https://mrjob.readthedocs.io/en/latest/guides.html
• How do you create MR programs here?
– Create a class that extends MRJob class
from the library
– Override at least “mapper” and
“reducer” methods
– Here is an example!
– Do you get what is t doing?

7-Aug-25 map-reduce computing 45


MR Job (mrjob python library) functions
• Other methods that MRJob class provides and you can override
mapper_init(self) #executed once for every mapper, and before mapper
mapper(self, key, value) #map function
mapper_final(self) #executed once for every mapper, and after mapper
reducer(self, key, values) #reduce function
combiner(self, key, values) #combine function
reducer_init, reducer_final, combiner_init, combiner_final

https://mrjob.readthedocs.io/en/latest/job.html
7-Aug-25 map-reduce computing 46
“Google Colab” for practice
• Here is a sample MR program:
https://colab.research.google.com/drive/1ma8h2dZMzUUobXsJ-_gTNvyaj25n6boY
• You can copy this in your account and experiment with!

• Here are some datasets that are used in my exercises! I may be adding more.
https://drive.google.com/drive/folders/1Q0sy0NlD2nkjmzxuYURQoFt5XRZpcScs

7-Aug-25 map-reduce computing paradigm 47


Further reading
• Chapter 2 at http://mmds.org
• Articles "The Google file system“[1] and "MapReduce: Simplified data processing on
large clusters“ [2]
• (Book) Radtka, Zachary, and Donald Miner. Hadoop with Python. O'Reilly Media, 2015.

7-Aug-25 map-reduce computing paradigm 48


References
[1] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." Proceedings of
the nineteenth ACM symposium on Operating systems principles. 2003.
https://www.usenix.org/legacy/event/osdi04/tech/full_papers/dean/dean_html/
[2] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters."
(2004)
[3] White, Tom. “Hadoop: The definitive guide“, 4th ed, O'Reilly Media, Inc., 2015.

7-Aug-25 Big Data Infrastructure 49

You might also like