[go: up one dir, main page]

0% found this document useful (0 votes)
31 views9 pages

Map Reduce

The MapReduce framework, originally based on a Google paper, enables parallel data processing in Hadoop clusters, allowing users to perform tasks without necessarily writing Java code. A typical MapReduce job involves dividing data into input splits, processing them with mappers to produce key-value pairs, and optionally reducing the output for final results. Additionally, MapReduce Streaming allows the use of various programming languages, such as Python, for job implementation while leveraging Unix/Linux input and output streams.

Uploaded by

om6454984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views9 pages

Map Reduce

The MapReduce framework, originally based on a Google paper, enables parallel data processing in Hadoop clusters, allowing users to perform tasks without necessarily writing Java code. A typical MapReduce job involves dividing data into input splits, processing them with mappers to produce key-value pairs, and optionally reducing the output for final results. Additionally, MapReduce Streaming allows the use of various programming languages, such as Python, for job implementation while leveraging Unix/Linux input and output streams.

Uploaded by

om6454984
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The MapReduce framework

Dr. Garrett Dancik


Map Reduce background
• Based on a Google paper:
• https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

• MapReduce provides the original processing engine for


Hadoop, that allows for parallel processing of data in a
cluster

• Hadoop is built on Java, but you do not need to write Java


programs to use Map Reduce, though Java programs will
be faster.
Map Reduce Overview and Word Count Example

Image source: https://www.edupristine.com/blog/hadoop-mapreduce-framework


Steps of a MapReduce Job
1. Hadoop divides the data into input splits, and creates
one map task for each split.
2. Each mapper reads each record (each line) of its input
split, and outputs a key-value pair
3. Output from the mappers are transferred to reducers as
inputs, such that the input to each reducer is sorted by
key.
4. The reducer processes the data for each key, and
outputs the result. The reducer is optional.
MapReduce data flow with a single reduce task
Here, mappers process data on 3
nodes in parallel. Results are sent
across the cluster to one or more
reducers

An optional combiner function can


be specified to process the output
from each map task before being
sent to the reducer

Image from White, T. (2015). Hadoop: The definitive guide. Sebastopol, CA: OReilly.
Pseudocode for word count mapper and reducer

value well, 1
that, 1
well that is that
is, 1
that, 1

key, values
is, (1) is, 1
that, (1,1) that, 2
well, (1) well, 1

Dean, J., Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters, (2004).
OSDI'04: Sixth Symposium on Operating System Design and Implementation, pgs 137-150.
Some Additional MapReduce Applications

Dean, J., Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters, (2004).
OSDI'04: Sixth Symposium on Operating System Design and Implementation, pgs 137-150.
MapReduce Streaming
• MapReduce jobs were originally written in Java
• MapReduce in Java is the most efficient
• For word count example, see:
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
• We will use Hadoop MapReduce Streaming
• Uses Unix/Linux input and output streams as interface with Hadoop
• Map input data is passed over standard input to the map function
• Map output is a tab-separated key,value pair…
• …which is passed to the reducer over standard input
• The reduce function reads lines from standard input, sorted by key
• Any language can be used (we will use Python)
Running a Map Reduce Streaming Job
hadoop jar /usr/jars/hadoop-streaming-2.6.0-cdh5.7.0.jar \
-mapper "python3.4 $PWD/mapper.py" \
-reducer "python3.4 $PWD/reducer.py" \
-input "inputFiles" \
-output "outputDirectory"

More background and additional options:


https://hadoop.apache.org/docs/r1.2.1/streaming.html

• You may monitor jobs from localhost:8088 (for some links, you
will need to replace quickstart.cloudera with localhost). Some
links also require mapping of additional ports.

• The mapred terminal command may also be useful:


• mapred job -list (list jobs)
• mapred job –kill jobID (kill job with jobID)

You might also like