Big Data Management
Big Data Management
Submitted to:
Dr. Ankush Maind
Assistant Professor
LMTSM
We use the MapReduce function to calculate the total number of flights being cancelled from
each airport from June 2003-2004 , in which the key and value attributes were taken.
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are
broken down into tuples (key/value pairs). Secondly, reduce the task, which takes the output
from a map as an input and combines those data tuples into a smaller set of tuples. As the
sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
The key represents the airport name/code and value represents the flights cancelled in that
month.
The input data is taken from the excel/csv file attached to this mail and below is a snapshot of
the output we get after applying the map reducer function.
The Algorithm :-
Generally MapReduce paradigm is based on sending the computer to where the data resides.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce
stage.
Map stage − The map or mapper’s job is to process the input data. Generally the input data is in
the form of file or directory and is stored in the Hadoop file system (HDFS). The input file is
passed to the mapper function line by line. The mapper processes the data and creates several
small chunks of data.
Reduce stage − This stage is the combination of the Shuffle stage and the Reduce stage. The
Reducer’s job is to process the data that comes from the mapper. After processing, it produces
a new set of output, which will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate servers
in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying task
completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the network
traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
MapReduce Algorithm
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework views the
input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs as the
output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and hence,
need to implement the Writable interface. Additionally, the key classes have to implement the
Writable-Comparable interface to facilitate sorting by the framework. Input and Output types of a
MapReduce job − (Input) <k1, v1> → map → <k2, v2> → reduce → <k3, v3>(Output).
Terminology
PayLoad − Applications implement the Map and the Reduce functions, and form the core of the
job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
★ OUTPUT -
As we can see Atlanta(ATL) airport has the maximum number of flights being cancelled
in a year.
Thus, the authorities can use this information and from various factors to reduce their
inefficiencies from flights being cancelled and bring proper reconsideration or
amends/refunds so that in future people are not accustomed to these problems.