2.
MapReduce
- Stand Alone System/ Single System,
Continuously working (using File
System)
- Increase storage capacity, Processing
capacity started creating Clusters.
- Cluster is a region which has multiple
Stand-Alone devices(c1, c2, c3, c4)
distributed in geographical area.
- Multiple clusters will get interconnected
to each other, because of this multiple
file system can be used, this is called
Distributed File System.
Servers
• All the nodes can communicate with each other over
internet or through any networking devices.
Distributed File System
• A Distributed File System (DFS) is a file system that is distributed on various file servers and
locations.
• It permits programs to access and store isolated data in the same method as in the local files.
• It also permits the user to access files from any system.
• It allows network users to share information and files in a regulated and permitted manner.
Although, the servers have complete control over the data and provide users access control.
DFS has three services, and these are as follows:
Storage Service
TrueFile Service
Name Service
Distributed File System
DFS has two components in its services, and these are as follows:
1.Local Transparency
2.Redundancy
Local Transparency- The System hides the location of the file storage. If local transparency is
maintained , you don’t need to know or care if file is stored on your machine or another server. It is achieved via
the namespace component.
Redundancy- Storing multiple copies of same data (files or blocks) across different nodes in the system
to ensure reliability and fault tolerance. It is achieved via a file replication component.
Use of DFS
1. Remote Information
sharing
2. User mobility
3. Diskless workstation
4. Availability
Physical Organization of Computer Nodes
• Compute nodes are stored on racks, perhaps 8–64 on a rack.
• The nodes on a single rack are connected by a network, typically gigabit Ethernet (a type of
network cable that allow them send data to each other quickly).
• There can be many racks of compute nodes, and racks are connected by another level of network
or a switch (like a traffic controller for data).
• There are two levels of connection
1. Intra rack connection
• This is the network connection between compute nodes within the same rack.
• Nodes are typically connected through a local Ethernet inside the rack.
• uses Gigabit Ethernet or 10-Gigabit Ethernet.
• Low latency, because the physical distance is short and the network is local.
• Use for frequent, high-speed communication among nodes working closely together on a task.
• Example: If 32 nodes are on one rack and need to exchange data frequently during a parallel
computation, they'll use the intra-rack connection.
Physical Organization of Computer Nodes
2. Inter rack connection
Connects nodes on different racks to each other.
Racks are connected via a higher-level switch or a high-speed backbone network.
Speed is generally faster per link (e.g., 40-Gigabit or 100-Gigabit Ethernet, or InfiniBand), but has
to handle much more traffic.
Higher latency compared to intra-rack because data may pass through more switches and longer
cables.
Used when nodes on different racks need to communicate or exchange results, especially in
distributed computations.
Example: A job is split across nodes on 5 racks, and they need to synchronize results — this traffic
will go over the inter-rack connection.
A machine learning job uses nodes on Rack 1 and Rack 2. When nodes need to combine their
results, they use inter-rack connections.
Figure shows the architecture of a large- scale computing system.
There can be many racks, and each rack can have lots of compute
nodes (computers).
It is a fact of life that components fail.
The more components such as compute nodes and interconnection
networks, in the system will not be working at any given time.
For systems such as in Fig. Two common types of failures are:
• A single node failing (for example, its disk crashes).
• An entire rack failing (for example, the network that connects its
nodes stops working).
Some important calculations take minutes or even hours on thousands
of compute nodes.
If we had to abort and restart the computation every time one
component failed, then the computation might never complete
successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not
duplicate the file at several compute nodes, then if
one node failed, all its files would be unavailable
until the node is replaced. If we did not back up the
files at all, and the disk crashes, the files would be
lost forever.
2. Computations must be divided into tasks, such
that if any one task fails to execute to completion,
it can be restarted without affecting other tasks.
This strategy is followed by the map-reduce
programming system
Large Scale File System Organization
To exploit cluster computing, files must look and behave somewhat differently from
the conventional file systems found on single computers. This new file system, often
called a distributed file system or DFS (although this term has had other meanings in
the past), is typically used as follows.
• Files can be enormous, possibly a terabyte in size. If you have only small files,
there is no point using a DFS for them.
• Files are rarely updated. Rather, they are read as data for some calculation, and
possibly additional data is appended to files from time to time. For example, an
airline reservation system would not be suitable for a DFS, even if the data were
very large, because the data is changed so frequently.
• Files are divided into chunks, which are typically 64 megabytes in size.
• Chunks are replicated, perhaps three times, at three different compute nodes.
• Moreover, the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure.
• Normally, both the chunk size and the degree of replication can be decided by the user. To
find the chunks of a file, there is another small file called the master node or name node for
that file.
• The master node is itself replicated, and a directory for the file system as a whole knows
where to find its copies.
• The directory itself can be replicated, and all participants using the DFS know where the
directory copies are.
Distributed File Systems used:
1. Google File System(GFS)
2. Hadoop Distributed File System
3. CloudStore
Map Process
Suppose wants to find out the word “Refund” occur how
many times in the feedback.txt
1. Client sends this request to the JobTracker.
2. The JobTracker ask NameNode to return those
DataNodes which consists Feedback.txt file.
3. The JobTracker gives this data to TaskTracker for
processing.
4. The TaskTracker starts a Map task and monitors the tasks
progress.
5. The TaskTracker provides heartbeats and task status back
to the JobTracker.
6. As each Map task completes, each node stores the result of
its local computation i.e. “intermediate data” in temporary
local storage.
7. This intermediate data is sent as input to the reduce task
over the network.
The Reduce Process
Following are the steps:
1. The Reducer, task after collecting all of the
intermediate data from the Map tasks
2. After that starts the final computation step. In
this example, the final result is adding up the
occurrences of the word “Refund”
3. Then this output is written to Results.txt file.
4. Then file is saved on HDFS.
5. Then client can be able to read from that file.
MapReduce
• MapReduce works on divide-and-conquer principle. Huge input data are split
into smaller chunks of 64 MB that are processed by mappers in parallel.
Execution of map is co-located with data chunk.
• The framework then shuffles/sorts the results (intermediate data) of maps and
sends them as input to reducers. Programmers have to implement mappers
and reducers by extending the base classes provided by Hadoop to solve a
specific problem.
• Each of the Map tasks is given to one or more chunks from a DFS. These Map
tasks turn the chunk into a sequence of key–value pairs. The way key–value
pairs are produced from the input data is determined by the code written by
the user for the Map function.
• Let us assume mapper takes (k1, v1) as input in the form of (key, value) pair.
Let (k2, v2) be the transformed key–value pair by mapper.
• (k1, v1) → Map → (k2, v2)→ Sort→ (k2,(v2, v2, …, v2)) → Reduce → (k3, v3)
• The key–value pairs from each Map task are collected by a master
controller and sorted by key. It combines each unique key with all its
values, that is (k2,(v2, v2, …, v2)). The key–value combinations are
delivered to Reduces, so all key–value pairs with the same key wind
up at the same Reduce task.
• The way of combination of values is determined by the code written
by the programmer for the Reduce function. Reduce tasks work on
one key at a time. The Reducers again translates them into another
key–value pair (k3, v3) which is the result.
• Mapper is the mandatory part of a Hadoop job and can produce zero
or more key–value pairs (k2, v2). Reducer is the optional part of a
Hadoop job and can produce zero or more key–value pairs (k3, v3).
The driver program for initializing and controlling the MapReduce
execution is also written by user
2 Responsibilities of MapReduce Framework
The framework takes care of scheduling, monitoring and rescheduling
of failed tasks.
1. Provides overall coordination of execution.
2. Selects nodes for running mappers.
3. Starts and monitors mapper’s execution.
4. Sorts and shuffles output of mappers.
5. Chooses locations for reducer’s execution.
6. Delivers the output of mapper to reducer node.
7. Starts and monitors reducer’s execution
• Details of MapReduce Execution
1 Run-Time Coordination in MapReduce
MapReduce handles the distributed code execution on the cluster
transparently once the user submits his “jar” file.
MapReduce takes care of both scheduling and synchronization.
MapReduce has to ensure that all the jobs submitted by all the users
get fairly equal share of cluster’s execution.
MapReduce implements scheduling optimization by speculative
execution explained as follows: If a machine executes the task very
slowly, the JobTracker assigns the additional instance of the same task
to another node using a different TaskTracker.
MapReduce Execution Pipeline
Main components of MapReduce execution pipeline are as follows:
1. Driver: Driver is the main program that initializes a MapReduce job
and gets back the status of job execution. For each job, it defines
the configuration and specification of all its components (Mapper,
Reducer, Combiner and Custom partitioner) including the
input−output formats.
2. Input data: Input data can reside in HDFS or any other storage such
as HBase. InputFormat defines how to read the input and define the
split. Based on the split, InputFormat defines the number of map
tasks in the mapping phase. The job Driver invokes the InputFormat
directly to decide the number (InputSplits) and location of the map
task execution.
3. Mapper: For each map task, a new instance of mapper is
instantiated. As said earlier, individual mappers do not communicate
with each other. The partition of the key space produced by the
mapper, that is every intermediate data from the mapper, is given as
input to reducer.
4. Shuffle and sort: Shuffling is the process of moving map outputs to
reducers. Shuffle/Sort is triggered when mapper completes its job. As
soon as all the map tasks are completed, sort process groups the
key–value pairs to form the list of values. The grouping is performed
regardless of what Map and Reduce tasks do.
5. Reducer: Reducer executes the user-defined code. The reducers
reduce() method receives a key along with an iterator over all the
values associated with the key and produces the output key– value
pairs. RecordWriter is used for storing data in a location specified by
OutputFormat. Output can be from reducer or mapper, if reducer is not
present.
6. Distributed cache: Distributed cache is a resource used for sharing
data globally by all nodes in the cluster. This can be a shared library that
each task can access. The user’s code for driver, map and reduce along
with the configuring parameters can be packaged into a single jar file
and placed in this cache.
How MapReduce Copes with Node Failures
1. Task re-execution
2. Speculative Execution (duplicate task)
3. Heartbeat Mechanism
4. Data Replication in HDFS
Types of Node Failures
1. Mapper node fails
2. Reducer node fails
3. Data node fails
4. Task tracker fails
Algorithms using MapReduce
1 Matrix-Vector Multiplication by MapReduce
Let A and B be the two matrices to be multiplied and the result be matrix C.
Matrix A has dimensions L, M and matrix B has dimensions M, N. In the Map
phase:
1. For each element (i,j) of I, emit ((i,k), A[i,j]) for k in 1,…, N.
2. For each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1, …, L.
In the reduce phase, eIit
• key = (i,k)
• valuI = Sumj (A[i,j] * B[j,k])
• One reducer is used per output cell
• Each reducer compItesSumj (A[i,j] * B[j,k])
2 MapReduce and Relational Operators
MapReduce algorithms can be used for processing relational data:
Shuffle/Sort automatically handles group by sorting and partitioning in MapReduce.
The following operations are performed either in mapper or in reducer:
• Selection
• Projection
• Union, intersection and difference
• Natural join
• Grouping and aggregation
Multiple strategies such as Reduce-side join, Map-side join and In-memory join
(Striped variant, Memcached variant) are used for relational joins.
Multiple MapReduce jobs are required for complex operations.
For example: Top 10 URLs in terms of average time spent.
3 Computing Selections by MapReduce
Selections really may not need both the Map and Reduce tasks. They
can be done mostly in the map portion alone.
1. The Map Function: For each tuple t in R, test if it satisfies condition
C. If so, produce the key– value pair (t, t). That is, both the key and
value are t.
2. The Reduce Function: The Reduce function is the identity. It simply
passes each key–value pair to the output.
Note that the output is not exactly a relation, because it has key–value
pairs. However, a relation can be obtained by using only the value
components (or only the key components) of the output.
4 Computing Projections by MapReduce
Projection is also a simple operation in MapReduce. Here, the Reduce
function is used to eliminate duplicates since projection may cause the same
tuple to appear several times.
1. The Map Function: For each tuple t in R, construct a tuple ts by
eliminating from t those components whose attributes are not in S.
Output the key–value pair (ts, ts).
2. The Reduce Function: For each key ts produced by any of the Map tasks,
there will be one or more key–value pairs (ts, ts). The Reduce function
turns [ts, ts, ..., ts]) into (ts, ts), so it produces exactly one pair (ts, ts) for
this key ts.
The Reduce operation is duplicate elimination. This operation is associative
and commutative, so a combiner associated with each Map task can
eliminate whatever duplicates are produced locally.
However, the Reduce tasks are still needed to eliminate two identical tuples
coming from different Map tasks.
c
5 Union, Intersection and Difference by MapReduce
1 Union
For union operation, both the relations R and S need to have the same
schema.
Map tasks will be assigned chunks from either R or S relation.
The Map tasks do not really do anything except pass their input tuple as
key–value pairs to the Reduce tasks.
Reducer is used to eliminate duplicates.
Mappers are fed by all tuples of two sets to be united.
Reducer is used to eliminate duplicates.
1. The Map Function: Turn each input tuple t into a key–value pair (t, t).
2. The Reduce Function: Associated with each key t there will be either one
or two values. Produce output (t, t) in either case.
2 Intersection
Mappers are fed by all tuples of both R and S relations to be intersected.
Reducer emits only tuples that occurred twice.
It is possible only if both the sets contain this tuple because tuples include
primary key and can occur in one set only once in each relation.
To compute the intersection, we can use the same the Map function. However,
the Reduce function must produce a tuple only if both relations have the tuple.
If the key t has a list of two values [t, t] associated with it, then the Reduce task
for t should produce (t, t).
However, if the value-list associated with key t is just [t], then one of R and S is
missing t, so we do not want to produce a tuple for the intersection.
1. The Map function: Turn each tuple t into a key–value pair (t, t).
2. The Reduce function: If key t has value list [t, t], then produce (t, t).
Otherwise, produce nothing.
3 Difference
The difference R - S a tuple t can appear in the output is if it is in
relation R, but not in relation S.
The Map function can pass tuples from R and S through, but must
inform the Reduce function whether the tuple came from R or S.
Using the relation as the value associated with the key t, the two
functions are specified as follows:
1. The Map function: For a tuple t in R, produce key–value pair (t, R),
and for a tuple t in S, produce key–value pair (t, S). Note that the
intent is that the value is the name of R or S and not the entire
relation.
2. The Reduce function: For each key t, if the associated value list is
[R], then produce (t, t). Otherwise, produce nothing.