“Map Reduce” Computing Paradigm
pm jat @ daiict
MR Programming – Example First[3]
Here is an example from book “Hadoop: The definitive guide”[3]
Weather Dataset (raw NCDC data)
• Consider a huge log file contains weather records of a city.
• Each row has many values but, we will use two values – month and temperature.
• Computing Goal: find out maximum temperature for each year.
• In SQL terms, we basically are attempting compute following on a plain text data file
(not a DB table)
select month, max(temperature)
from “temp_data.csv” group by month
7-Aug-25 map-reduce computing paradigm 2
Example #1: Max Temperature
• Input: a temperature data file without delimiter).
• Output: month wise maximum temperature.
• In SQL terms, we compute following on a plain text data file (not a DB table)
select month, max(temperature)
from “temp_data.csv” group by month
7-Aug-25 map-reduce computing paradigm 3
Map Task
• Data File
– Partitioned in chunks, and Replicated
• Multiple instances of “map” function run
on machines, typically on that are having
data chunks
• Runs parallel on multiple machines
• Their outputs are stored “locally”
7-Aug-25 map-reduce computing paradigm 4
Shuffle [and Sort ]
Map-Reduce System does it
and Keys are sorted
7-Aug-25 map-reduce computing paradigm 5
Reduce Task
• “Reducers” pull data from all “mappers” as (key,
value-list), and “aggregates” as specified in the
“reduce” function
• A “reducer” takes values for a certain keys only
• Multiple reducers run in parallel working on
different sets of keys
• They produce their outputs on DFS
7-Aug-25 map-reduce computing paradigm 6
Example #1: Max Temperature
• Here is a pair of Map and Reduce function that is to process said file and compute
monthly maximum temperature!
7-Aug-25 map-reduce computing paradigm 7
…. take away
• Map as a Task
– Input a List of <K1, V1> pairs; say List of (RecNo, Record)
– Output: List of <K2, V2> pairs; Say List of (Month, Temp)
– Execution happens for each file split in parallel; called mappers
– Typically MR master nodes invoke M map tasks, typically one for each split
• Map Function
– Input: <K1,V1> pair; say <RecNo, Recod>
– Output: <K2, V2> pair; say <Month, Temp>
– The map function is executed for all records from all input file splits
• Note execution of the map function should not require any movement of data.
Execution happens on relevant chunk servers only
7-Aug-25 map-reduce computing paradigm 8
…. take away
• Reduce as a Task
– Input a List of <K2, List(V2)> pairs; say List of (Month, List(Temps))
all values (from output of all mappers) of a Key taken by reduce task
– Output: List of <K3, V3> pairs; Say List of (Month, Max-Temp)
Mostly K2 and K3 are same and V3 is some aggregation of List(V12)
– Execution happens on a number of machines in parallel; called reducers
Say R numbers
• Reduce Function
– Runs on every reducer for every distinct Key of mapper outputs
– Input: <K2,List(V2)> pair; say Month, List(Temps)
– Output: <K3, V3> pair; say Month, Temp
– The map function is iteratively called for all data records on all splits
7-Aug-25 map-reduce computing paradigm 9
“Shuffle and Sort” the Map output
• “shuffling” is the process of making mapper outputs available to the reducers
• This is done as following
– Partition (determine which reducer a output should go) – “shuffling”
Done by applying a hash function on K2, say as “encode(K2) MOD R”
– Sort By Key (K2)
– Group By Key (K2) ==> <K2, List(V2)>
• As a result, we prepare input for “reducer” in the form of <K2, List(V2)> for all
distinct keys of map output
• Data actually move here for computation from “mappers” to “reducers” in “pull”
mode!
7-Aug-25 map-reduce computing paradigm 10
MR data flow [2]
Figure Source: [2]
7-Aug-25 map-reduce computing 11
Data Flow in a Map Reduce Computation
Shuffle and Sort
http://infolab.stanford.edu/~ullman/mmds/ch2.pdf
7-Aug-25 map-reduce computing paradigm 12
Hope this makes sense now!
7-Aug-25 map-reduce computing paradigm 13
Example: “Lab01 code”
• !
7-Aug-25 map-reduce computing paradigm 14
“Map Reduce” Computing Paradigm
• In many cases, a single pair of map-reduce functions should be performing
meaningful computations.
• However, a “pipeline” of map-reduce pairs can be used for performing complex
tasks!
– Output of one MR pair becomes input to another MR job, and so forth!
7-Aug-25 map-reduce computing paradigm 15
MR Examples* #: Word Count
• Word count from a huge document corpus
– A huge set of documents stored on HDFS
• Want to compute word frequency, aggregated over all documents
7-Aug-25 map-reduce computing paradigm 16
MR Examples* #: Word Count
• In pseudo code!
MAP function
input: <key, value>
<DocID, Document>
output: <key, value>
<word, 1>
Reduce function
input: <key, value-list>
<word, <1,1,1,1,1,1>>
output: <key, value>
<word, sum(1’s)>
7-Aug-25 map-reduce computing paradigm 17
Do it Yourself!
• MR Colab:
https://colab.research.google.com/drive/1ma8h2dZMzUUobXsJ-_gTNvyaj25n6boY
7-Aug-25 map-reduce computing paradigm 18
3 Mappers and 3 Reducers
7-Aug-25 map-reduce computing paradigm 19
Aggregate Operations using MR
• Mapper outputs: Key: Grouping Attribute, Value: Aggregating Attribute Value
• Reducer outputs: Key: Grouping Attribute, Value: Aggregated value
• For example if want to compute statewide number of employees. Output Key of
Map function would be “State” OR
• STATE, and GENDER WISE then Key shall be composite?
7-Aug-25 map-reduce computing 20
Confirm
• Appreciate the importance of keys and values as inputs and outputs to map and
reduce function!
• In queries like select DNO, AVG(Salary) from “employee.csv”
you can draw a rule to decide the key?
7-Aug-25 map-reduce computing paradigm 21
“Selection and Projection” using Map Reduce
• Have selection condition applied in Map functions
• It can be map only Job.
• Value here is a Tuple.
7-Aug-25 map-reduce computing paradigm 22
MR Example #: lines, words, and characters count
• Suppose we need to count the number of lines, words, and characters count in a
text file.
• Try defining the “Map” and “Reduce” functions for this task!
7-Aug-25 map-reduce computing paradigm 23
MR Example #: “line count”
• Suppose, we need to count the number of lines, words, and characters in a file.
• Following Map-Reduce shall do the job!
7-Aug-25 map-reduce computing paradigm 24
MR Example #3: “line count”
• Here is sample output of map function
#chars, 80
#words, 12
#lines, 1
#chars, 44
#words, 5
#lines, 1
…
7-Aug-25 map-reduce computing paradigm 25
MR Example #: “line count”
• Here is sample shuffled input to reduce function and its output
#chars, [80,44,67,108,..]
#words, [12, 5, 8, 9, … ]
#lines, [1, 1, 1, 1, …]
#chars, 23456
#words, 8653
#lines, 563
7-Aug-25 map-reduce computing paradigm 26
Exercise ##
• Suppose we have a Inventory data file with attribute values (Item ID, Description,
Cost, Price, Stock, Category), and
– Total Cost of inventory
– Category wise cost of inventory
– Category wise Count of Items which are costlier than 1000
– List of Items having “Cleaner” keyword in description
7-Aug-25 map-reduce computing paradigm 27
Combine Function - Motivation
• On shuffle, the number of values for a key could be large, or too large.
• That has problems
– Large values to shuffle
– More values to transfer over the network
– Memory Concerns at Reducers
– Increased load on reducers (reducers are much lesser in numbers than mappers
in an MR Job)
7-Aug-25 map-reduce computing 28
Combine Function - Motivation
• On shuffle, number of values for a key could be large, too large.
• That has problems
…
• The solution is “we do all possible aggregation at mapper” itself!
• For “aggregation” at the mapper level; we define a “combine” function; also
referred as “combiner”.
• This comes as the third function in Map-Reduce programming
• The function is expressed as: combine(k, list(v1)) (k,v2)
• V2 is some aggregated values on List(V1)
7-Aug-25 map-reduce computing 29
Word Count with combiner
7-Aug-25 map-reduce computing 30
7-Aug-25 map-reduce computing 31
Combiner in SUM
• Combiner for SUM: select dno, sum(salary) from employee group by dno;
7-Aug-25 map-reduce computing 32
Map-Reduce Combiners
• In many cases, Combiner is usually same as the reduce function.
• This however works only when reduce function is commutative and associative.
SUM is, where as AVERAGE is not.
• However we can have a trick used in Combiner: we output sum and count at
combiner!
• Example next:
7-Aug-25 map-reduce computing 33
Combiner in Average
• Combiner for AVG: select dno, avg(salary) from employee group by dno;
7-Aug-25 map-reduce computing 34
Examples from Original Article[1]
• Word Count
• Distributed GREP
• Count of URL Access Frequency
• Reverse Web-Link Graph from HTML pages
• Term Vector per Host
– Vector of (word, frequency) pairs for each host
• Inverted Index
– List of Document IDs for each “word”
• Distributed Sort
7-Aug-25 map-reduce computing paradigm 35
MR Examples* #2: Distributed “Grep”
• The "grep" is a powerful unix command with rich sets of options
• Primarily it finds lines in files on the file system that match the
specified “expressions”
• The map function primarily does the “necessary matching”
• There is nothing for the reducer map(line-no, line)
if (regex.search(pattern, line)
write(line-no, line)
reduce(line-no, line)
write(line-no, line)
-- Nothing requires to be done at reduce end.
Can be defined to be map-only job.
7-Aug-25 map-reduce computing paradigm 36
MR Examples #2: Count of “URL Access Frequency”
Input: “Web Access Log” generated by
web server
map(recno, record)
url = extract_url(record)
write(url, 1)
reduce(url, value)
count = 0
for each v in values
count += v
Sample Web Access Log: https://drive.google.com/file/d/1ZT6IpAS1ephI_GapXEOFk7Sqa3fIL8qK
7-Aug-25 map-reduce computing paradigm 37
MR Examples* #3: Reverse Web-Link Graph
• A Reverse Web-Link Graph (also known as a reverse link graph or backlink graph) is a
graph structure that shows, for each web page, which other pages link to it.
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output:
URL, List of URLs referring to it
7-Aug-25 map-reduce computing paradigm 38
MR Examples* #3: Reverse Web-Link Graph
• The map function outputs “target, source pairs” for each given page-url
• The reduce function concatenates the list of all source URLs associated with a given
target URL and emits the pair
Input: page-URL, page-itself (say .html)
map(page-url, page)
expr = xpath-expr-for-hrefs
targets = page.XPath( expr) reduce(target, source-list)
source = page-url write(target, source-list)
for each target in targets
write(target, source)
7-Aug-25 map-reduce computing paradigm 39
MR Examples* #4: Term-Vector per Host
• The “term” refers to a "word" in a document.
• A “term vectors” are often used to "summarize“ or "represent" a document in NLP,
IR, and Text Mining area.
• It is a "vector" of "term" and "frequency measure" pairs/tuples.
• The “Frequency Measure” is often TF (Term Frequency) or TFIDF(Term Frequency
Inverse Document Frequency)
• Google used Map-Reduce for doing this and related tasks!
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output: Host-Name, Term-Vector
7-Aug-25 map-reduce computing paradigm 40
MR Examples* #4: Term-Vector per Host
• The map function emits a hostname, term vector pair for each input document
(where the hostname is extracted from the URL of the document).
• The reduce function receives (Host-Name, term vectors of documents under the
host). it combines all term vectors together, discarding infrequent terms, and then
emits a final host-name, term vector pair
Input: document-URL, web-document (say .html) reduce(host-name, list-term-vectors)
map(url, document) combined-term-vector (ctv) = []
//compute term-vector locally for each vector in list-term-vectors
host = host_name( url ) merge(ctv, vector)
write(host_name, term-vector) ctv = trim(sort(ctv)) //keep only frequent
write(host-name, ctv)
7-Aug-25 map-reduce computing paradigm 41
MR Examples* #5: “Compute Inverted Index”
• What is “Inverted Index”?
• A term popular in “Information Retrieval”, is a data structure that contains
Map from “term” to a “document(s)”.
• It is used for finding out documents that contains the given term!
• Input: “Web Page URL”, Web Page itself ( .html )
• Required Output: term, list of document-URLs
7-Aug-25 map-reduce computing paradigm 42
MR Examples* #5: “Compute Inverted Index”
• The map function parses each document, and emits a sequence of word,
document ID pairs.
• The reduce function outputs document-list (may be sorted) for each "term"
map(doc-url, document)
words = {} //set of words
for each word in document reduce(word, list-doc-urls)
words.add (word ) write(word, list-doc-urls)
for each word in words
write(word, doc-url )
7-Aug-25 map-reduce computing paradigm 43
MR Examples* #6: Distributed Sort
Suppose this needs to be sorted on empno
Map:
outputs(emp-no, row)
The reduce function shall receive them sorted
On emp-no. It can just output as such!
7-Aug-25 map-reduce computing paradigm 44
Programming Map-Reduce in Python
• A library is available: https://mrjob.readthedocs.io/en/latest/
• Guide/Documentation is available at
https://mrjob.readthedocs.io/en/latest/guides.html
• How do you create MR programs here?
– Create a class that extends MRJob class
from the library
– Override at least “mapper” and
“reducer” methods
– Here is an example!
– Do you get what is t doing?
7-Aug-25 map-reduce computing 45
MR Job (mrjob python library) functions
• Other methods that MRJob class provides and you can override
mapper_init(self) #executed once for every mapper, and before mapper
mapper(self, key, value) #map function
mapper_final(self) #executed once for every mapper, and after mapper
reducer(self, key, values) #reduce function
combiner(self, key, values) #combine function
reducer_init, reducer_final, combiner_init, combiner_final
https://mrjob.readthedocs.io/en/latest/job.html
7-Aug-25 map-reduce computing 46
“Google Colab” for practice
• Here is a sample MR program:
https://colab.research.google.com/drive/1ma8h2dZMzUUobXsJ-_gTNvyaj25n6boY
• You can copy this in your account and experiment with!
• Here are some datasets that are used in my exercises! I may be adding more.
https://drive.google.com/drive/folders/1Q0sy0NlD2nkjmzxuYURQoFt5XRZpcScs
7-Aug-25 map-reduce computing paradigm 47
Further reading
• Chapter 2 at http://mmds.org
• Articles "The Google file system“[1] and "MapReduce: Simplified data processing on
large clusters“ [2]
• (Book) Radtka, Zachary, and Donald Miner. Hadoop with Python. O'Reilly Media, 2015.
7-Aug-25 map-reduce computing paradigm 48
References
[1] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." Proceedings of
the nineteenth ACM symposium on Operating systems principles. 2003.
https://www.usenix.org/legacy/event/osdi04/tech/full_papers/dean/dean_html/
[2] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: Simplified data processing on large clusters."
(2004)
[3] White, Tom. “Hadoop: The definitive guide“, 4th ed, O'Reilly Media, Inc., 2015.
7-Aug-25 Big Data Infrastructure 49