Module 2 MapReduce

Uploaded by

sivefik636

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views45 pages

Module 2 MapReduce

Uploaded by

sivefik636

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

2.

MapReduce
- Stand Alone System/ Single System,
Continuously working (using File
System)
- Increase storage capacity, Processing
capacity started creating Clusters.
- Cluster is a region which has multiple
Stand-Alone devices(c1, c2, c3, c4)
distributed in geographical area.
- Multiple clusters will get interconnected
to each other, because of this multiple
file system can be used, this is called
Distributed File System.
Servers
• All the nodes can communicate with each other over
internet or through any networking devices.
Distributed File System

• A Distributed File System (DFS) is a file system that is distributed on various file servers and
locations.
• It permits programs to access and store isolated data in the same method as in the local files.
• It also permits the user to access files from any system.
• It allows network users to share information and files in a regulated and permitted manner.
Although, the servers have complete control over the data and provide users access control.
DFS has three services, and these are as follows:
Storage Service
TrueFile Service
Name Service
Distributed File System

DFS has two components in its services, and these are as follows:

1.Local Transparency
2.Redundancy

Local Transparency- The System hides the location of the ﬁle storage. If local transparency is
maintained , you don’t need to know or care if ﬁle is stored on your machine or another server. It is achieved via
the namespace component.

Redundancy- Storing multiple copies of same data (ﬁles or blocks) across different nodes in the system
to ensure reliability and fault tolerance. It is achieved via a ﬁle replication component.
Use of DFS
1. Remote Information
sharing
2. User mobility
3. Diskless workstation
4. Availability
Physical Organization of Computer Nodes
• Compute nodes are stored on racks, perhaps 8–64 on a rack.
• The nodes on a single rack are connected by a network, typically gigabit Ethernet (a type of
network cable that allow them send data to each other quickly).
• There can be many racks of compute nodes, and racks are connected by another level of network
or a switch (like a traffic controller for data).
• There are two levels of connection
1. Intra rack connection
• This is the network connection between compute nodes within the same rack.
• Nodes are typically connected through a local Ethernet inside the rack.
• uses Gigabit Ethernet or 10-Gigabit Ethernet.
• Low latency, because the physical distance is short and the network is local.
• Use for frequent, high-speed communication among nodes working closely together on a task.
• Example: If 32 nodes are on one rack and need to exchange data frequently during a parallel
computation, they'll use the intra-rack connection.
Physical Organization of Computer Nodes
2. Inter rack connection
Connects nodes on different racks to each other.
Racks are connected via a higher-level switch or a high-speed backbone network.
Speed is generally faster per link (e.g., 40-Gigabit or 100-Gigabit Ethernet, or InfiniBand), but has
to handle much more traffic.
Higher latency compared to intra-rack because data may pass through more switches and longer
cables.
Used when nodes on different racks need to communicate or exchange results, especially in
distributed computations.
Example: A job is split across nodes on 5 racks, and they need to synchronize results — this traffic
will go over the inter-rack connection.
A machine learning job uses nodes on Rack 1 and Rack 2. When nodes need to combine their
results, they use inter-rack connections.
Figure shows the architecture of a large- scale computing system.
There can be many racks, and each rack can have lots of compute
nodes (computers).
It is a fact of life that components fail.
The more components such as compute nodes and interconnection
networks, in the system will not be working at any given time.
For systems such as in Fig. Two common types of failures are:
• A single node failing (for example, its disk crashes).
• An entire rack failing (for example, the network that connects its
nodes stops working).
Some important calculations take minutes or even hours on thousands
of compute nodes.
If we had to abort and restart the computation every time one
component failed, then the computation might never complete
successfully.
The solution to this problem takes two forms:
1. Files must be stored redundantly. If we did not
duplicate the file at several compute nodes, then if
one node failed, all its files would be unavailable
until the node is replaced. If we did not back up the
files at all, and the disk crashes, the files would be
lost forever.
2. Computations must be divided into tasks, such
that if any one task fails to execute to completion,
it can be restarted without affecting other tasks.
This strategy is followed by the map-reduce
programming system
Large Scale File System Organization

To exploit cluster computing, files must look and behave somewhat differently from
the conventional file systems found on single computers. This new file system, often
called a distributed file system or DFS (although this term has had other meanings in
the past), is typically used as follows.
• Files can be enormous, possibly a terabyte in size. If you have only small files,
there is no point using a DFS for them.
• Files are rarely updated. Rather, they are read as data for some calculation, and
possibly additional data is appended to files from time to time. For example, an
airline reservation system would not be suitable for a DFS, even if the data were
very large, because the data is changed so frequently.
• Files are divided into chunks, which are typically 64 megabytes in size.
• Chunks are replicated, perhaps three times, at three different compute nodes.
• Moreover, the nodes holding copies of one chunk should be located on different racks, so
we don’t lose all copies due to a rack failure.
• Normally, both the chunk size and the degree of replication can be decided by the user. To
find the chunks of a file, there is another small file called the master node or name node for
that file.
• The master node is itself replicated, and a directory for the file system as a whole knows
where to find its copies.
• The directory itself can be replicated, and all participants using the DFS know where the
directory copies are.
Distributed File Systems used:
1. Google File System(GFS)
2. Hadoop Distributed File System
3. CloudStore
Map Process
Suppose wants to find out the word “Refund” occur how
many times in the feedback.txt
1. Client sends this request to the JobTracker.
2. The JobTracker ask NameNode to return those
DataNodes which consists Feedback.txt file.
3. The JobTracker gives this data to TaskTracker for
processing.
4. The TaskTracker starts a Map task and monitors the tasks
progress.
5. The TaskTracker provides heartbeats and task status back
to the JobTracker.
6. As each Map task completes, each node stores the result of
its local computation i.e. “intermediate data” in temporary
local storage.
7. This intermediate data is sent as input to the reduce task
over the network.
The Reduce Process
Following are the steps:
1. The Reducer, task after collecting all of the
intermediate data from the Map tasks
2. After that starts the final computation step. In
this example, the final result is adding up the
occurrences of the word “Refund”
3. Then this output is written to Results.txt file.
4. Then file is saved on HDFS.
5. Then client can be able to read from that file.
MapReduce
• MapReduce works on divide-and-conquer principle. Huge input data are split
into smaller chunks of 64 MB that are processed by mappers in parallel.
Execution of map is co-located with data chunk.
• The framework then shuffles/sorts the results (intermediate data) of maps and
sends them as input to reducers. Programmers have to implement mappers
and reducers by extending the base classes provided by Hadoop to solve a
specific problem.
• Each of the Map tasks is given to one or more chunks from a DFS. These Map
tasks turn the chunk into a sequence of key–value pairs. The way key–value
pairs are produced from the input data is determined by the code written by
the user for the Map function.
• Let us assume mapper takes (k1, v1) as input in the form of (key, value) pair.
Let (k2, v2) be the transformed key–value pair by mapper.
• (k1, v1) → Map → (k2, v2)→ Sort→ (k2,(v2, v2, …, v2)) → Reduce → (k3, v3)
• The key–value pairs from each Map task are collected by a master
controller and sorted by key. It combines each unique key with all its
values, that is (k2,(v2, v2, …, v2)). The key–value combinations are
delivered to Reduces, so all key–value pairs with the same key wind
up at the same Reduce task.
• The way of combination of values is determined by the code written
by the programmer for the Reduce function. Reduce tasks work on
one key at a time. The Reducers again translates them into another
key–value pair (k3, v3) which is the result.
• Mapper is the mandatory part of a Hadoop job and can produce zero
or more key–value pairs (k2, v2). Reducer is the optional part of a
Hadoop job and can produce zero or more key–value pairs (k3, v3).
The driver program for initializing and controlling the MapReduce
execution is also written by user
2 Responsibilities of MapReduce Framework
The framework takes care of scheduling, monitoring and rescheduling
of failed tasks.
1. Provides overall coordination of execution.
2. Selects nodes for running mappers.
3. Starts and monitors mapper’s execution.
4. Sorts and shuffles output of mappers.
5. Chooses locations for reducer’s execution.
6. Delivers the output of mapper to reducer node.
7. Starts and monitors reducer’s execution
• Details of MapReduce Execution
1 Run-Time Coordination in MapReduce
MapReduce handles the distributed code execution on the cluster
transparently once the user submits his “jar” file.
MapReduce takes care of both scheduling and synchronization.
MapReduce has to ensure that all the jobs submitted by all the users
get fairly equal share of cluster’s execution.
MapReduce implements scheduling optimization by speculative
execution explained as follows: If a machine executes the task very
slowly, the JobTracker assigns the additional instance of the same task
to another node using a different TaskTracker.
MapReduce Execution Pipeline
Main components of MapReduce execution pipeline are as follows:
1. Driver: Driver is the main program that initializes a MapReduce job
and gets back the status of job execution. For each job, it defines
the configuration and specification of all its components (Mapper,
Reducer, Combiner and Custom partitioner) including the
input−output formats.

2. Input data: Input data can reside in HDFS or any other storage such
as HBase. InputFormat defines how to read the input and define the
split. Based on the split, InputFormat defines the number of map
tasks in the mapping phase. The job Driver invokes the InputFormat
directly to decide the number (InputSplits) and location of the map
task execution.
3. Mapper: For each map task, a new instance of mapper is
instantiated. As said earlier, individual mappers do not communicate
with each other. The partition of the key space produced by the
mapper, that is every intermediate data from the mapper, is given as
input to reducer.

4. Shuffle and sort: Shuffling is the process of moving map outputs to

reducers. Shuffle/Sort is triggered when mapper completes its job. As
soon as all the map tasks are completed, sort process groups the
key–value pairs to form the list of values. The grouping is performed
regardless of what Map and Reduce tasks do.
5. Reducer: Reducer executes the user-defined code. The reducers
reduce() method receives a key along with an iterator over all the
values associated with the key and produces the output key– value
pairs. RecordWriter is used for storing data in a location specified by
OutputFormat. Output can be from reducer or mapper, if reducer is not
present.
6. Distributed cache: Distributed cache is a resource used for sharing
data globally by all nodes in the cluster. This can be a shared library that
each task can access. The user’s code for driver, map and reduce along
with the configuring parameters can be packaged into a single jar file
and placed in this cache.
How MapReduce Copes with Node Failures
1. Task re-execution
2. Speculative Execution (duplicate task)
3. Heartbeat Mechanism
4. Data Replication in HDFS

Types of Node Failures

1. Mapper node fails
2. Reducer node fails
3. Data node fails
4. Task tracker fails
Algorithms using MapReduce
1 Matrix-Vector Multiplication by MapReduce
Let A and B be the two matrices to be multiplied and the result be matrix C.
Matrix A has dimensions L, M and matrix B has dimensions M, N. In the Map
phase:
1. For each element (i,j) of I, emit ((i,k), A[i,j]) for k in 1,…, N.
2. For each element (j,k) of B, emit ((i,k), B[j,k]) for i in 1, …, L.
In the reduce phase, eIit
• key = (i,k)
• valuI = Sumj (A[i,j] * B[j,k])
• One reducer is used per output cell
• Each reducer compItesSumj (A[i,j] * B[j,k])
2 MapReduce and Relational Operators
MapReduce algorithms can be used for processing relational data:
Shuffle/Sort automatically handles group by sorting and partitioning in MapReduce.
The following operations are performed either in mapper or in reducer:
• Selection
• Projection
• Union, intersection and difference
• Natural join
• Grouping and aggregation
Multiple strategies such as Reduce-side join, Map-side join and In-memory join
(Striped variant, Memcached variant) are used for relational joins.
Multiple MapReduce jobs are required for complex operations.
For example: Top 10 URLs in terms of average time spent.
3 Computing Selections by MapReduce
Selections really may not need both the Map and Reduce tasks. They
can be done mostly in the map portion alone.
1. The Map Function: For each tuple t in R, test if it satisfies condition
C. If so, produce the key– value pair (t, t). That is, both the key and
value are t.
2. The Reduce Function: The Reduce function is the identity. It simply
passes each key–value pair to the output.
Note that the output is not exactly a relation, because it has key–value
pairs. However, a relation can be obtained by using only the value
components (or only the key components) of the output.
4 Computing Projections by MapReduce
Projection is also a simple operation in MapReduce. Here, the Reduce
function is used to eliminate duplicates since projection may cause the same
tuple to appear several times.
1. The Map Function: For each tuple t in R, construct a tuple ts by
eliminating from t those components whose attributes are not in S.
Output the key–value pair (ts, ts).
2. The Reduce Function: For each key ts produced by any of the Map tasks,
there will be one or more key–value pairs (ts, ts). The Reduce function
turns [ts, ts, ..., ts]) into (ts, ts), so it produces exactly one pair (ts, ts) for
this key ts.
The Reduce operation is duplicate elimination. This operation is associative
and commutative, so a combiner associated with each Map task can
eliminate whatever duplicates are produced locally.
However, the Reduce tasks are still needed to eliminate two identical tuples
coming from different Map tasks.
c
5 Union, Intersection and Difference by MapReduce
1 Union
For union operation, both the relations R and S need to have the same
schema.
Map tasks will be assigned chunks from either R or S relation.
The Map tasks do not really do anything except pass their input tuple as
key–value pairs to the Reduce tasks.
Reducer is used to eliminate duplicates.
Mappers are fed by all tuples of two sets to be united.
Reducer is used to eliminate duplicates.
1. The Map Function: Turn each input tuple t into a key–value pair (t, t).
2. The Reduce Function: Associated with each key t there will be either one
or two values. Produce output (t, t) in either case.
2 Intersection
Mappers are fed by all tuples of both R and S relations to be intersected.
Reducer emits only tuples that occurred twice.
It is possible only if both the sets contain this tuple because tuples include
primary key and can occur in one set only once in each relation.
To compute the intersection, we can use the same the Map function. However,
the Reduce function must produce a tuple only if both relations have the tuple.
If the key t has a list of two values [t, t] associated with it, then the Reduce task
for t should produce (t, t).
However, if the value-list associated with key t is just [t], then one of R and S is
missing t, so we do not want to produce a tuple for the intersection.
1. The Map function: Turn each tuple t into a key–value pair (t, t).
2. The Reduce function: If key t has value list [t, t], then produce (t, t).
Otherwise, produce nothing.
3 Difference
The difference R - S a tuple t can appear in the output is if it is in
relation R, but not in relation S.
The Map function can pass tuples from R and S through, but must
inform the Reduce function whether the tuple came from R or S.
Using the relation as the value associated with the key t, the two
functions are specified as follows:
1. The Map function: For a tuple t in R, produce key–value pair (t, R),
and for a tuple t in S, produce key–value pair (t, S). Note that the
intent is that the value is the name of R or S and not the entire
relation.
2. The Reduce function: For each key t, if the associated value list is
[R], then produce (t, t). Otherwise, produce nothing.

Module III Hadoop Framework
No ratings yet
Module III Hadoop Framework
21 pages
BDA Module 2 COMP
No ratings yet
BDA Module 2 COMP
29 pages
The Hadoop Approach
100% (2)
The Hadoop Approach
14 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Haoop Architecture
No ratings yet
Haoop Architecture
34 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Module 2
No ratings yet
Module 2
21 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Module 2
No ratings yet
Module 2
21 pages
Hadoop
No ratings yet
Hadoop
25 pages
Big-Data-Unit 2
No ratings yet
Big-Data-Unit 2
70 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Introduction to Hadoop & DFS
No ratings yet
Introduction to Hadoop & DFS
34 pages
UNIT II Hadoop Framework
No ratings yet
UNIT II Hadoop Framework
25 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Hadoop Session
No ratings yet
Hadoop Session
65 pages
Hadoop: OREIN IT Technologies
No ratings yet
Hadoop: OREIN IT Technologies
65 pages
Hadoop & Distributed File Systems
No ratings yet
Hadoop & Distributed File Systems
120 pages
4
No ratings yet
4
53 pages
Bda 2
No ratings yet
Bda 2
35 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Design of HDFS: Unit 3
No ratings yet
Design of HDFS: Unit 3
20 pages
Haoop Architecture
No ratings yet
Haoop Architecture
24 pages
Hadoop Ecosystem & HDFS Guide
No ratings yet
Hadoop Ecosystem & HDFS Guide
46 pages
Hadoop Distributed File System Basics
No ratings yet
Hadoop Distributed File System Basics
30 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Notes - 3 Unit Neha
No ratings yet
Notes - 3 Unit Neha
25 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
BDH Unit 3
No ratings yet
BDH Unit 3
25 pages
DC Mod 6
No ratings yet
DC Mod 6
9 pages
12 Lecture
No ratings yet
12 Lecture
21 pages
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
No ratings yet
BIG DATA - Unit 4 HADOOP AND MAP REDUCE - Mini Xerox - Easy Read
16 pages
BDP 2023 03
No ratings yet
BDP 2023 03
59 pages
Unit-1 Introduction To Big Data
No ratings yet
Unit-1 Introduction To Big Data
38 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Module II
No ratings yet
Module II
46 pages
BDA - Unit 3
No ratings yet
BDA - Unit 3
41 pages
HDFS
No ratings yet
HDFS
46 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Bigdata Unit 3
No ratings yet
Bigdata Unit 3
96 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
CSE545 Sp23 (3) Hadoop MapReduce 2-13
No ratings yet
CSE545 Sp23 (3) Hadoop MapReduce 2-13
96 pages
HDFS: Architecture and Benefits
No ratings yet
HDFS: Architecture and Benefits
6 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
What Is DFS
No ratings yet
What Is DFS
37 pages
Unit 3
No ratings yet
Unit 3
61 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
HDFS & MapReduce Explained
No ratings yet
HDFS & MapReduce Explained
16 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
DFS, PPT
No ratings yet
DFS, PPT
18 pages
Windows 2000: Wwe 2K Windows NT
No ratings yet
Windows 2000: Wwe 2K Windows NT
21 pages
Samba - How To Collection
No ratings yet
Samba - How To Collection
85 pages
WIndows Notes-2 IACSD
No ratings yet
WIndows Notes-2 IACSD
18 pages
How To Raise Active Directory Domain and Forest Functional Level - TheITBros
No ratings yet
How To Raise Active Directory Domain and Forest Functional Level - TheITBros
9 pages
Distributed Systems Overview
No ratings yet
Distributed Systems Overview
40 pages
Unit-5.2 Distributed File System (DFS)
No ratings yet
Unit-5.2 Distributed File System (DFS)
29 pages
Renaming A Domain Controller
No ratings yet
Renaming A Domain Controller
8 pages
QUESTION 1 Your Network Contains An Acti
No ratings yet
QUESTION 1 Your Network Contains An Acti
35 pages
Step-by-Step Guide For The Distributed File System Solution in Windows Server 2003 R2
No ratings yet
Step-by-Step Guide For The Distributed File System Solution in Windows Server 2003 R2
21 pages
Windows Server 2003 Interview Questions & Answers
No ratings yet
Windows Server 2003 Interview Questions & Answers
25 pages
Distributed-File Systems Background
No ratings yet
Distributed-File Systems Background
9 pages
Study of Windows 2000 Operating System
50% (2)
Study of Windows 2000 Operating System
15 pages
70-741 Compressed
No ratings yet
70-741 Compressed
309 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
60 pages
Unit 2. What Is A Distributed File System (DFS)
No ratings yet
Unit 2. What Is A Distributed File System (DFS)
1 page
Information Sheets CO3.1-2 "File Services
No ratings yet
Information Sheets CO3.1-2 "File Services
42 pages
Dell EMC NX Series Network Attached Storage Systems Using Windows Storage Server 2016
No ratings yet
Dell EMC NX Series Network Attached Storage Systems Using Windows Storage Server 2016
36 pages
Hands-On Microsoft Windows Server 2019 (MindTap Course List) 3rd Edition Jason Eckert Updated 2025
No ratings yet
Hands-On Microsoft Windows Server 2019 (MindTap Course List) 3rd Edition Jason Eckert Updated 2025
94 pages
UserManual SmartScan 2019-12
No ratings yet
UserManual SmartScan 2019-12
40 pages
Windows Server Support Interview Questions and Answers (L2)
No ratings yet
Windows Server Support Interview Questions and Answers (L2)
33 pages
Windows DFS Setup Guide
No ratings yet
Windows DFS Setup Guide
28 pages
20413C ChangeLog
No ratings yet
20413C ChangeLog
14 pages
Az-801 6
No ratings yet
Az-801 6
26 pages
Microsoft Windows Server 2008 QUIZ 3
No ratings yet
Microsoft Windows Server 2008 QUIZ 3
8 pages
70-223 (Verbeterd) : Number: 070-223 Passing Score: 800 Time Limit: 120 Min File Version: 1.0
No ratings yet
70-223 (Verbeterd) : Number: 070-223 Passing Score: 800 Time Limit: 120 Min File Version: 1.0
35 pages
Failover Clustering
100% (1)
Failover Clustering
251 pages
MS SMB
No ratings yet
MS SMB
179 pages
IT Server Setup Guide
No ratings yet
IT Server Setup Guide
5 pages
W2K8 Remote Infrastructure PPT
No ratings yet
W2K8 Remote Infrastructure PPT
36 pages

Module 2 MapReduce

Uploaded by

Module 2 MapReduce

Uploaded by

2.

4. Shuffle and sort: Shuffling is the process of moving map outputs to

Types of Node Failures

You might also like