0% found this document useful (0 votes)

93 views24 pages

Large Scale Data Processing: Mapreduce Intro

This document provides an introduction to MapReduce, a programming model for processing large datasets in parallel across clusters of computers. It describes the key ideas behind MapReduce including scaling out rather than up, assuming failures are common, moving processing to the data, processing data sequentially, hiding system details from developers, and providing seamless scalability. It also provides an overview of how MapReduce executions work by splitting inputs, assigning tasks, processing intermediary results, sorting outputs, and generating final output files.

Uploaded by

Alvaro Gómez Rubio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views24 pages

Large Scale Data Processing: Mapreduce Intro

Uploaded by

Alvaro Gómez Rubio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Large Scale Data Processing

MapReduce intro

Dr. Wenceslao PALMA

wenceslao.palma@ucv.cl

W. PALMA 1 / 18
MapReduce

What is MapReduce?
MapReduce is a programming model for data processing introduced by Google
(2004) to support parallel and fault-tolerant computations over large data sets
on clusters of computers. It provides an abstraction that hides many
system-level details from the programmer.

Big ideas behind MapReduce

1 Scale “out”, not “up”.
2 Assume failures are common.
3 Move processing to the data.
4 Process data sequentially and avoid random access.
5 Hide system-level details from the applicaion developer.
6 Seamless scalability.

W. PALMA 2 / 18
MapReduce

(1) Scale “out”, not “up”. There is evidence to conclude that a cluster of
low-end servers approaches the performance of the equivalent cluster of
high-end servers. The small performance gap is insufficient to justify the price
premium of the high-end servers.

W. PALMA 3 / 18
MapReduce

MapReduce is an scale out approach that provides equitable distribution and

independence

W. PALMA 3 / 18
MapReduce

(2) Assume failures are common. Large-scale services distributed across a

large cluster must cope with failures as an intrinsic aspect of its operation.

W. PALMA 4 / 18
MapReduce

(2) Assume failures are common. Large-scale services distributed across a

large cluster must cope with failures as an intrinsic aspect of its operation.

MapReduce implementations cope with failures through automatic restart and

replication.

W. PALMA 4 / 18
MapReduce

(3) Move processing to the data. In high-performance computing

processing nodes and storage nodes are linked togheter by a high-capacity
interconnect. However, a bottleneck in the network is created when
data-intensive workloads are not very processor-demanding.

W. PALMA 5 / 18
MapReduce

(3) Move processing to the data. In high-performance computing

processing nodes and storage nodes are linked togheter by a high-capacity
interconnect. However, a bottleneck in the network is created when
data-intensive workloads are not very processor-demanding.

MapReduce takes advantage of data locality by running code on the processor

where the block of data we need resides.

W. PALMA 5 / 18
MapReduce

(4) Process data sequentially and avoid random access. Data-intensive

processing is desirable to avoid random data access and instead organiz
coputations so that data is processed sequentially.

W. PALMA 6 / 18
MapReduce

(4) Process data sequentially and avoid random access. Data-intensive

processing is desirable to avoid random data access and instead organiz
coputations so that data is processed sequentially.

In MapReduce all the computations are organized into long streaming

operations that take advantage of the aggregated bandwidth of many disks in
cluster. Mapreduce trades latency for throughput.

W. PALMA 6 / 18
MapReduce

(5) Hide system-level details from the application developer.

Programming distributed applications leads to the application developer to
deal with several threads, processes, or machines.

W. PALMA 7 / 18
MapReduce

(5) Hide system-level details from the application developer.

Programming distributed applications leads to the application developer to
deal with several threads, processes, or machines.

MapReduce addresses the challenges of distributed programming by providing

an abstraction that isolates the developer from system-level details.
MapReduce maintains a separation of what computations are to be performed
and how those computations are actually carried out on a cluster of machines.

W. PALMA 7 / 18
MapReduce

(6) Seamless scalability. If running an algorithm on a particular dataset

takes 100 machine hours, then we should be able to finish in an hour on a
cluster of 100 machines, or use a cluster of 10 machines to complete the same
task in ten hours.

W. PALMA 8 / 18
MapReduce

(6) Seamless scalability. If running an algorithm on a particular dataset

takes 100 machine hours, then we should be able to finish in an hour on a
cluster of 100 machines, or use a cluster of 10 machines to complete the same
task in ten hours.

With MapReduce, this isn’t so far from the truth, at least for some
applications.

W. PALMA 8 / 18
MapReduce

MapRaduce is not the first model of parallel computation. However:

it has changed the way we organize computations at a massive scale.
it maked certain large-data problems easier, but suffers from limitations
as well.

W. PALMA 9 / 18
MapReduce: logical view

The input to a MapReduce job is divided into fixed-sized pieces called splits.
A recommended split size is the size of an GDFS/HDFS block (64MB by
defautl). However, this can be changed when each file is created.
Split are processed in parallel by different machines.
The output ends up in R files on the distributed file system, where R is the
number of reducers.

W. PALMA 10 / 18
MapReduce: execution overview

Mapreduce splits input files into M pieces

Many copies of the user program are started on the cluster

W. PALMA 11 / 18
MapReduce: execution overview

Master node assigns map or reduce tasks to idle workers.

W. PALMA 12 / 18
MapReduce

Workers doing map tasks read a corresponding split

Intermediate results are buffered in memory

W. PALMA 13 / 18
MapReduce

Periodically, intermediate results are written to local disk.

These results are partitionated in R regions.
Locations of these partitions are published to master node.

W. PALMA 14 / 18
MapReduce

Reducers read all input data.

When reducer has read all input data, it sorts data by intermediate keys.

W. PALMA 15 / 18
MapReduce

Each reducer iterates over sorted intermediate data.

Output of the reduce function is appended to a final output file.

W. PALMA 16 / 18
MapReduce

The output is available in R output files.

Typically these files are not combined. They could be kept for an application
that is able to handle partitioned data.

W. PALMA 17 / 18
References

Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris

Dyer. Pre-production manuscript book, April 2011.

W. PALMA 18 / 18

MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Lecture 2.1
No ratings yet
Lecture 2.1
13 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
Distributed Systems: 18. Mapreduce
No ratings yet
Distributed Systems: 18. Mapreduce
39 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Unit IV Notes
No ratings yet
Unit IV Notes
25 pages
Second Exam Summary
No ratings yet
Second Exam Summary
44 pages
UNIT III Notes
No ratings yet
UNIT III Notes
24 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
B. Hadoop Ecosystem - III (MapReduce)
No ratings yet
B. Hadoop Ecosystem - III (MapReduce)
55 pages
Introduction To Batch Processing
No ratings yet
Introduction To Batch Processing
23 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop Mapreduce
No ratings yet
Hadoop Mapreduce
131 pages
Map Reduce Summary
No ratings yet
Map Reduce Summary
4 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
No ratings yet
Lecture 2 - Mapreduce: Cpe 458 - Parallel Programming, Spring 2009
26 pages
MapReduce Tutorial
100% (1)
MapReduce Tutorial
192 pages
Map Reduce and Hadoop
No ratings yet
Map Reduce and Hadoop
39 pages
Week 02
No ratings yet
Week 02
115 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop
No ratings yet
Hadoop
34 pages
Mapreduce Model Principles
No ratings yet
Mapreduce Model Principles
65 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
No ratings yet
Paper Summary - MapReduce - Simplified Data Processing On Large Clusters (2004) - MeloSpace
7 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
Lecture 05
No ratings yet
Lecture 05
23 pages
Big Data Lecture # 07
No ratings yet
Big Data Lecture # 07
21 pages
Lecture 3 - Big Data
No ratings yet
Lecture 3 - Big Data
43 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
MapReduce for Big Data Developers
No ratings yet
MapReduce for Big Data Developers
9 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Lec 6
No ratings yet
Lec 6
16 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
TWM XP Installation Instructions
No ratings yet
TWM XP Installation Instructions
5 pages
Final Updated Introduction To Quartus
No ratings yet
Final Updated Introduction To Quartus
10 pages
VLan Configuration
No ratings yet
VLan Configuration
15 pages
Intro Cyber 15.1
No ratings yet
Intro Cyber 15.1
3 pages
SLC, MLC, TLC, QLC, and PLC - The Most Detailed Comparison
No ratings yet
SLC, MLC, TLC, QLC, and PLC - The Most Detailed Comparison
4 pages
Rally Readthedocs Io en Latest
No ratings yet
Rally Readthedocs Io en Latest
453 pages
PeerApp UltraBand 1000 Administration Guide - Grid Configuration
No ratings yet
PeerApp UltraBand 1000 Administration Guide - Grid Configuration
126 pages
3rd Generation Intel Xeon Scalable Processors Codename Cooper Lake Specification Update - Rev015US
No ratings yet
3rd Generation Intel Xeon Scalable Processors Codename Cooper Lake Specification Update - Rev015US
20 pages
ESET Remote Administrator v5 Guide
No ratings yet
ESET Remote Administrator v5 Guide
122 pages
7 Days to Die: Game Log Insights
No ratings yet
7 Days to Die: Game Log Insights
33 pages
Best Practice Guide - Automating Appliance Configuration and Alert Backups
No ratings yet
Best Practice Guide - Automating Appliance Configuration and Alert Backups
10 pages
Satellite 4220xcdt
No ratings yet
Satellite 4220xcdt
2 pages
USB 2.0 To RS232 Cable For Windows User's Manual
No ratings yet
USB 2.0 To RS232 Cable For Windows User's Manual
7 pages
4 Bit State Machine
No ratings yet
4 Bit State Machine
2 pages
4 Introduction
No ratings yet
4 Introduction
3 pages
Optimizing Proxy Server White Paper: by George Coutsoumbidis
No ratings yet
Optimizing Proxy Server White Paper: by George Coutsoumbidis
13 pages
SCADA-Networking Protocol For Data Exchange
100% (1)
SCADA-Networking Protocol For Data Exchange
83 pages
Embedded Computing Systems Unit - I-Instruction Set Text Books: 1. Wayne Wolf: Computers As Components, Principles of Embedded Computing Systems Design, 2nd Edition, Elsevier, 2008
No ratings yet
Embedded Computing Systems Unit - I-Instruction Set Text Books: 1. Wayne Wolf: Computers As Components, Principles of Embedded Computing Systems Design, 2nd Edition, Elsevier, 2008
40 pages
Raj Kumar Resume B
No ratings yet
Raj Kumar Resume B
3 pages
Instructables LED Matrix Using Shift Registers
100% (1)
Instructables LED Matrix Using Shift Registers
21 pages
Ad Hoc & Sensor Networks Guide
No ratings yet
Ad Hoc & Sensor Networks Guide
87 pages
Cap
No ratings yet
Cap
7 pages
MrLBA - Multi-Resource Load Balancing Algorithm For Cloud Computing Using Ant Colony Optimization
No ratings yet
MrLBA - Multi-Resource Load Balancing Algorithm For Cloud Computing Using Ant Colony Optimization
11 pages
Document
No ratings yet
Document
709 pages
Conlog
No ratings yet
Conlog
36 pages
Belarc Advisor - Computer Profile
No ratings yet
Belarc Advisor - Computer Profile
5 pages
Excel Specifications and Limits - Microsoft Support
No ratings yet
Excel Specifications and Limits - Microsoft Support
11 pages
Windows 10 PowerToys - A Cheat Sheet - TechRepublic
No ratings yet
Windows 10 PowerToys - A Cheat Sheet - TechRepublic
18 pages
Binary Number System
100% (3)
Binary Number System
86 pages
Iready System Requirements PDF
No ratings yet
Iready System Requirements PDF
10 pages

Large Scale Data Processing: Mapreduce Intro

Uploaded by

Large Scale Data Processing: Mapreduce Intro

Uploaded by

Large Scale Data Processing

Dr. Wenceslao PALMA

Big ideas behind MapReduce

MapReduce is an scale out approach that provides equitable distribution and

(2) Assume failures are common. Large-scale services distributed across a

(2) Assume failures are common. Large-scale services distributed across a

MapReduce implementations cope with failures through automatic restart and

(3) Move processing to the data. In high-performance computing

(3) Move processing to the data. In high-performance computing

MapReduce takes advantage of data locality by running code on the processor

(4) Process data sequentially and avoid random access. Data-intensive

(4) Process data sequentially and avoid random access. Data-intensive

In MapReduce all the computations are organized into long streaming

(5) Hide system-level details from the application developer.

(5) Hide system-level details from the application developer.

MapReduce addresses the challenges of distributed programming by providing

(6) Seamless scalability. If running an algorithm on a particular dataset

(6) Seamless scalability. If running an algorithm on a particular dataset

MapRaduce is not the first model of parallel computation. However:

Mapreduce splits input files into M pieces

Master node assigns map or reduce tasks to idle workers.

Workers doing map tasks read a corresponding split

Periodically, intermediate results are written to local disk.

Reducers read all input data.

Each reducer iterates over sorted intermediate data.

The output is available in R output files.

Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris

You might also like