0% found this document useful (0 votes)

27 views42 pages

Lecture 10 Map Reduce

Uploaded by

coxigalu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views42 pages

Lecture 10 Map Reduce

Uploaded by

coxigalu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Distributed Systems

Monsoon 2024
Lecture 9
International Institute of Information Technology
Hyderabad, India

Slides credit to Dr. Yogesh Simmhan, IISc,

and other sources mentioned.
Data Structures
Logistics Week 1

• Extended the deadline for HW 3 to 12 midnight,

Monday, September 02, 2024.
Distributed Computing Platforms
• We will study yet another distributed computing platform
that has gained recent popularity.
• We will study what this platform and its recent avatars
offer.
Distributed Computing
⚫Distributed Computing involves
• Clusters of machines connected over network
• Distributed Storage
• Disks attached to clusters of machines
• Network Attached Storage

⚫Commodity clusters
• Commodity: Available off the shelf at large volumes
• Lower Cost of Acquisition
• Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPUs typically comparable
Distributed Computing
⚫Distributed Computing involves
• Clusters of machines connected over network
• Distributed Storage
• Disks attached to clusters of machines
• Network Attached Storage
How can we make effective use of multiple machines?

⚫Commodity clusters
• Commodity: Available off the shelf at large volumes
• Lower Cost of Acquisition
• Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPUs typically comparable
How can we use many of such machines of modest capability?
However,
⚫ Commodity clusters have lower reliability
⚫ Mass-produced
⚫ Cheaper materials

⚫ Smaller lifetime (~3 years)

⚫ How can applications easily deal with failures?

⚫ How can we ensure availability in the presence of faults?
⚫ While at the same time…
While at the same time,
⚫Realize that programming distributed systems is difficult
⚫Divide a job into multiple tasks
⚫Understand dependencies between tasks: Control, Data
⚫Coordinate and synchronize execution of tasks
⚫Pass information between tasks
⚫Avoid race conditions, deadlocks
⚫Paralel and distributed programming models/ languages/
abstractions/platforms try to make these easy
⚫E.g. Assembly programming vs. C++ programming
⚫E.g. C++ programming vs. Matlab programming
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
⚫ Geographical separation
⚫ Can allow for wider scope of operations, e.g., SETI
Distributed Programming
⚫ Recall that distributed systems are characterized as a
collection of autonomous computers communicating
over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
⚫ Geographical separation
⚫ Can allow for wider scope of operations, e.g., SETI
⚫ Autonomy
⚫ Processors are loosely coupled but cooperate with each
other
⚫ Heterogeneity
⚫ All processors need not be alike.
Challenges of Distributed Programming
⚫ Distributed programming is inherently challenging.
⚫ Need to ensure that several geographically
separate computers collaborate
– efficiently,
– reliably,
– transparently, and
– scalable manner
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
Challenges of Distributed Programming
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
⚫ Inherent complexity due to
– Latency: Computers can be far apart
– Reliability: Recovering from failures of individual or
collection of computers
– Partitioning: Sometimes, failures can partition the
system.
– Ordering: Lack of global clock means message/event
ordering is difficult
– Security: Program as well as computational aspects
Challenges of Distributed Programming
⚫ Distributed programming is inherently challenging.
⚫ Need to ensure that several geographically separate
computers collaborate
– Efficiently, reliably, transparently, and scalable manner
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
⚫ Inherent complexity due to
– Latency, Reliability, Partitioning, Ordering, Security
⚫ Accidental complexity due to
– Low-level APIs: Not enough abstraction
– Poor debugging tools: Poor fault location
– Algorithmic decomposition: Choice of algorithmic
techniques
– Continuous re-invention: Key components can change!
Enter Map-Reduce
⚫ MapReduce is a distributed data-parallel programming
model from Google
⚫ Introduced close to 20 years ago.

⚫ MapReduce works best with a distributed file system, called

Google File System (GFS)
⚫ Hadoop is the open source framework implementation from
Apache that can execute the MapReduce programming
model
⚫ Hadoop Distributed File System (HDFS) is the open source
implementation of the GFS design
⚫ Amazon’s PaaS is yet another implementation of Map-
Reduce
Map-Reduce

“A simple and powerful interface that enables

automatic parallelization and distribution of
large-scale computations, combined with an
implementation of this interface that achieves
high performance on large clusters of
commodity PCs.”

Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,

OSDI, 2004
A High Level View
⚫ Map Reduce provides
⚫ Clean abstraction for programmers
⚫ Automatic parallelization & distribution
⚫ Fault-tolerance
⚫ A batch data processing system
⚫ Status and monitoring tools
Map-Reduce Vs. RDBMS
• Why not RDBMS – the old data workhorse?
⚫ RDBMS suffer from a huge seek time resulting in huge
access times.
• Map-Reduce comes with the implicit assumption that nearly the
entire dataset is relevant for each query.
⚫ MapReduce is therefore akin to a batch query processor.
⚫ MapReduce is a good fit for problems that need to
analyze the whole dataset, in a batch fashion,
particularly for ad hoc analysis.
Map-Reduce Vs. RDBMS
⚫ An RDBMS is good for point queries or updates,
where the dataset has been indexed to deliver low-
latency retrieval and update times of a relatively small
amount of data.
⚫ MapReduce suits applications where the data is
written once, and read many times, whereas a
relational database is good for datasets that are
continually updated.
⚫ MapReduce works well on unstructured or
semistructured data
Map-Reduce Vs. RDBMS
⚫ Relational data is often normalized to retain its
integrity and remove redundancy.
⚫ Normalization poses problems for MapReduce,
⚫ reading a record a nonlocal operation,
⚫ one of the central assumptions that MapReduce
makes is that it is possible to perform (high-speed)
streaming reads and writes.
⚫ A web server log is a good example of a set of
records that is not normalized
⚫ the client hostnames are specified in full each time,
even though the same client may appear many times
⚫ Hence, logfiles are particularly well-suited to analysis
with MapReduce
Map-Reduce Vs. RDBMS
⚫ Over time, however, the differences between
relational databases and MapReduce systems are
reducing.
⚫ Relational databases starting to incorporate some
of the ideas from MapReduce
⚫ Aster Data’s and Greenplum’s databases
⚫ Higher-level query languages built on MapReduce
(such as Pig and Hive) make MapReduce
systems more approachable to traditional
database programmers.
Map-Reduce
⚫ MapReduce might sound like quite a restrictive
programming model
⚫ Mappers and reducers run with very limited
coordination between one another.
⚫ Typical problems that we can use Map-Reduce
programming model are from image analysis, to
graph-based problems, to machine learning
algorithms.
⚫ It can’t solve every problem, of course, but it is a
general data-processing tool
MapReduce: Data-parallel Programming Model

Copyright © 2011 Tom White, Hadoop Definitive Guide

▪ Process data using map & reduce functions

▪ map(ki, vi) → List<km, vm>[]
‣ map is called on every input item
‣ Emits a series of intermediate key/value pairs
▪ All values with a given key are grouped together
▪ reduce(km, List<vm>[]) → List<kr, vr>[]
‣ reduce is called on every unique key & all its values
‣ Emits a value that is added to the output
23
MapReduce: Word Count

Map(k1,v1) → list(k2,v2)
Reduce(k2, list(v2)) → list(k2,v2)

M <how,1>
<now,1> <how,1 1>
How now <brown,1> <now,1 1> brown 1
M <cow,1> <brown,1>
R cow 1
Brown cow
<how,1> <cow,1> does 1
<does,1> <does,1> how 2
How does M <it,1> <it,1> R it 1
It work <work,1> <work,1> now 2
now <now,1> work 1
M Reduce

MapReduce Framework
Map
Input Output

Distributed Word Count 27

Map

▪ Input records from the data source

‣ lines out of files, rows of a database, etc.
▪ Passed to map function as key-value pairs
‣ Line number, line value
▪ map() produces zero or more intermediate values, each
associated with an output key.

28
Map
▪ Example: Upper-case Mapper
map(k, v) { emit(k.toUpper(), v.toUpper()); }

(“foo”, “bar”) → (“FOO”, “BAR”)

(“Foo”, “other”) → (“FOO”, “OTHER”)
(“key2”, “data”) → (“KEY2”, “DATA”)

▪ Example: Filter Mapper

map(k, v) { if (isPrime(v)) then emit(k, v); }

(“foo”, 7) → (“foo”, 7)
(“test”, 10) → () //nothing emitted

29
Reduce

▪ All the intermediate values from map for a given output key
are combined together into a list
▪ reduce() combines these intermediate values into one or
more final values for that same output key … Usually one
final value per key
▪ One output “file” per reducer

30
MapReduce: Word Count Drilldown
How now How does
Brown cow It work now

for each w for each w in

in key do key do
emit(w,1) emit(w,1)

<How,1>
<How,1>
<does,1>
<now,1>
<it,1>
<brown,1>
<work,1>
<cow,1>
<now,1>

<brown,1> <does,1>
<How,1 1>
<cow,1> <it,1>
<now,1 1>
<work,1>
sum =
sum + value
emit(key,sum)
does 1
How 2 brown 1 it 1
now 2 cow 1 work 1
Mapper/Reducer Tasks vs. Map/Reduce Methods

▪ Number of Mapper and Reducer tasks is specified by

user
▪ Each Mapper/Reducer task can make multiple calls to
Map/Reduce method, sequentially
▪ Mapper and Reducer tasks may run on different
machines
▪ Implementation framework decides
‣ Placement of Mapper and Reducer tasks on machines
‣ Keys assigned to mapper and reducer tasks
‣ But can be controlled by user…

32
Shuffle & Sort: The Magic happens here!

▪ Shuffle does a “group by” of keys from all mappers

‣ Similar to SQL groupBy operation

▪ Sort of local keys to Reducer task performed

‣ Keys arriving at each reducer are sorted
‣ No sorting guarantee of keys across reducer tasks
▪ No ordering guarantees of values for a key
‣ Implementation dependent

▪ Shuffle and Sort implemented efficiently by framework

33
Map-Shuffle-Sort-Reduce

Host A Host B Host C

Intermediate Key-
Value Pairs

Host A Host C Host D

34
Data-Intensive Text Processing with MapReduce, Jimmy Lin, 2010
Optimization: Combiner

▪ Logic runs on output of Map tasks, on the map machines

‣ “Mini-Reduce,” only on local Map output
▪ Output of Combiner sent to shuffle
‣ Saves bandwidth before sending data to Reducers
▪ Same input and output types as Map’s output type
‣ Map(k,v) → (k’,v’)
‣ Combine(k’,v’[]) → (k’,v’)
‣ Reduce(k’,v’[]) → (k’’,v’’)

35
Optimization: Partitioner

▪ Decides assignment of intermediate keys grouped to

specific Reducer tasks
‣ Affects the load on each reducer task
▪ Sorting of local keys for Reducer task done after
partitioning
▪ Default is hash partitioning
‣ HashPartitioner(key, nParts) → part
‣ Number of Reducer (nParts) tasks known in advance
‣ Returns a partition number [0, nParts)
‣ Default partitioner balances number of keys per Reducer …
assuming uniform key distribution
‣ May not balance the number of values processed by a Reducer
36
Map-MiniShuffle-Combine-Partition-Shuffle-
Sort-Reduce

MiniShuffle

Combine & Partition phases

could be interchanged, based
on implementation

Combiner & Partitioner are

powerful constructs. Use them
wisely!

37
Data-Intensive Text Processing with MapReduce, Jimmy Lin, 2010
MapReduce for Histogram
int bucketWidth = 4 // input
7 2 11 2
2 1 11 4
9 10 6 6 Map(k, v) {
6 3 2 8 emit(floor(v/bucketWidth), 1)
0 5 1 10
2 4 8 11 // <bucketID, 1>
5 0 1 0 }
M M
1,1 0,1 2,1 0,1
0,1 0,1 2,1 1,1
2,1 2,1 1,1 1,1
1,1 0,1 0,1 2,1
0,1 1,1 0,1 2,1 // one reduce per bucketID
0,1 1,1 2,1 2,1
1,1 0,1 0,1 0,1 Reduce(k, v[]){
sum=0;
Shuffle foreach(n in v[]) sum++;
2,1 0,1 0,1 1,1 Data transfer &
2,1 0,1 0,1 1,1
emit(k, sum)
shuffle between
2,1 0,1 0,1 1,1
Map & Reduce // <bucketID, frequency>
2,1 0,1 0,1 1,1
2,1 0,1 0,1 1,1 (28 items) }
2,1 0,1 0,1 1,1
2,1 1,1
2,1 1,1

38
Combiner Advantage

▪ Mini-Shuffle lowers the overall cost for Shuffle

▪ E.g. n total items emitted from m mappers
▪ Reduces network transfer and Disk IO costs
‣ In ideal case, m items vs. n items written and read from disk,
transferred over network (m<<n)
▪ Shuffle, less of an impact
‣ If more mapper tasks are present than reducers, higher
parallelism for doing groupby and mapper-side partial sort.
‣ Local Sort on reducer is based on number of unique keys,
which does not change due to combiner.

39
MapReduce: Recap

▪ Programmers must specify:

map (k, v) → <k’, v’>*
reduce (k’, v’[]) → <k’’, v’’>*
All values with the same key are reduced together
▪ Optionally, also:
partition (k’, number of partitions) → partition for k’
Often a simple hash of the key, e.g., hash(k’) mod n
Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
‣ Mini-reducers that run in memory after the map phase
‣ Used as an optimization to reduce network traffic
▪ The execution framework handles everything else…
What Does it Solve?
⚫ Scalable data analytics
⚫ Data not viewed or constrained by disk read/writes
but find meaning in data via computation over sets
of keys and values.
⚫ Abstract out lots of programming details.
⚫ Forgo removing data redundancy via normalization
of tables.
⚫ The penalty of normalization is reading a record
requires reading from multiple sources/tables/files.
“Everything Else”
▪ The execution framework handles everything else…
‣ Scheduling: assigns workers to map and reduce tasks
‣ “Data distribution”: moves processes to data
‣ Synchronization: gathers, sorts, and shuffles intermediate
data
‣ Errors and faults: detects worker failures and restarts
▪ Limited control over data and execution flow
‣ All algorithms must expressed in m, r, c, p
▪ You don’t know:
‣ Where mappers and reducers run
‣ When a mapper or reducer begins or finishes
‣ Which input a particular mapper is processing
‣ Which intermediate key a particular reducer is processing
Map-Reduce
⚫ Failures are handled via the jobtracker and the
tasktracker.
⚫ The tasktracker notices failures of tasks including
⚫ Runtime exceptions
⚫ Sudden exit of the code executing the task
⚫ Tasks hanging – observed via timeouts
⚫ If the jobtracker itself fails, Map-Reduce has no
mechanisms to save and restore computations.
Other Variants of Map-Reduce
⚫ YARN – Yet Another Resource Negotiator
⚫ Developed to address the scalability concerns of Map-Reduce
once the number of nodes is of the order of 104 or more.
⚫ YARN splits the responsibilities of the jobtracker into two
separate entities
⚫ a resource manager to manage the use of resources across the
cluster, and
⚫ an application master to manage the lifecycle of applications running
on the cluster.
⚫ YARN has more sophisticated mechanisms to save state of the
jobtracker on failure.
⚫ YARN is developed to be more general than MapReduce,
⚫ In fact MapReduce can be seen as just a type of YARN application
Further Reading and Action Items
⚫ Several piece-meal tutorials available online.
⚫ A one-place material available as online book,
Hadoop: The Definitive Guide, by Tom White.
⚫ Will post this PDF in our course moodle.
⚫ Read chapters 1, 2, 3, and 6.
⚫ Homework 4 that gives you practice on Map-
Reduce will be posted soon.

Week 02
No ratings yet
Week 02
115 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Unit-4 CC
No ratings yet
Unit-4 CC
72 pages
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
No ratings yet
Mapreduce and Hadoop Distributed File System: K. Madurai and B. Ramamurthy
36 pages
Big Data
No ratings yet
Big Data
29 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
DC - Co 1 All in 1 PDF
No ratings yet
DC - Co 1 All in 1 PDF
197 pages
Big Data & Hadoop Overview
No ratings yet
Big Data & Hadoop Overview
44 pages
MapReduce for Data Scientists
No ratings yet
MapReduce for Data Scientists
213 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Chapter 3 - 大数据管理
No ratings yet
Chapter 3 - 大数据管理
38 pages
Unit 5
No ratings yet
Unit 5
32 pages
Spark Introduction
No ratings yet
Spark Introduction
90 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
BDA Lec5
No ratings yet
BDA Lec5
40 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
PDC Lecture 13
No ratings yet
PDC Lecture 13
32 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Cloud Compute
No ratings yet
Cloud Compute
46 pages
MapReduce and Hadoop Overview
No ratings yet
MapReduce and Hadoop Overview
69 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Parallel & Distributed Computing Guide
No ratings yet
Parallel & Distributed Computing Guide
26 pages
Hadoop and Spark Overview
No ratings yet
Hadoop and Spark Overview
34 pages
Case Studies and Advancements: Unit: 5
No ratings yet
Case Studies and Advancements: Unit: 5
54 pages
DCC Chapter 4
No ratings yet
DCC Chapter 4
37 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Unit - 3
No ratings yet
Unit - 3
34 pages
02 Hadoop
No ratings yet
02 Hadoop
117 pages
10-Big Data Nhom7
No ratings yet
10-Big Data Nhom7
81 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Lê Thị Hậu - ITDSIU21085 - Quiz3
No ratings yet
Lê Thị Hậu - ITDSIU21085 - Quiz3
5 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
Hadoop Seminar Report IIT Guwahati
No ratings yet
Hadoop Seminar Report IIT Guwahati
28 pages
Chapter - 2 Hadoop
100% (1)
Chapter - 2 Hadoop
32 pages
Agenda: Big Data Systems
No ratings yet
Agenda: Big Data Systems
25 pages
Chapter Three Data Science
No ratings yet
Chapter Three Data Science
23 pages
Introduction to MapReduce & Functional Programming
No ratings yet
Introduction to MapReduce & Functional Programming
37 pages
Unit 4 Map Reduce
No ratings yet
Unit 4 Map Reduce
7 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
Unit 2
No ratings yet
Unit 2
22 pages
BDAunit III
No ratings yet
BDAunit III
4 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
7 pages
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
No ratings yet
Big Data Analytics Module 3: Mapreduce Paradigm: Faculty Name: Ms. Varsha Sanap Dr. Vivek Singh
36 pages
BDA Unit-4
No ratings yet
BDA Unit-4
47 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
MA - VaishuAchini - VIT - 24 - ICT703 - A3
No ratings yet
MA - VaishuAchini - VIT - 24 - ICT703 - A3
21 pages
Cloud Computing Unit 3
No ratings yet
Cloud Computing Unit 3
10 pages
Bda-3 Unit
No ratings yet
Bda-3 Unit
23 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
ISYS6310 - Information System Project Management
No ratings yet
ISYS6310 - Information System Project Management
26 pages
Dot Net Developer Resume - 6.6 Yrs Experience
No ratings yet
Dot Net Developer Resume - 6.6 Yrs Experience
4 pages
C++ Resources: On The Web and Printed
No ratings yet
C++ Resources: On The Web and Printed
4 pages
Employee CRUD Operations in C#
No ratings yet
Employee CRUD Operations in C#
3 pages
Laboratory Work: 4 Learning Spring Boot and Thymeleaf Prelab Work
No ratings yet
Laboratory Work: 4 Learning Spring Boot and Thymeleaf Prelab Work
17 pages
U3 - C7 - L1 - T1 - Distributed Objects
No ratings yet
U3 - C7 - L1 - T1 - Distributed Objects
12 pages
Java AWT Programming Guide
No ratings yet
Java AWT Programming Guide
17 pages
Figure 1two Parse Trees For 9-5+2
No ratings yet
Figure 1two Parse Trees For 9-5+2
3 pages
Software Re-engineering Overview
No ratings yet
Software Re-engineering Overview
6 pages
Christopher Tuttle - Structural Analyst
No ratings yet
Christopher Tuttle - Structural Analyst
2 pages
The CLI Book: Writing Successful Command Line Interfaces With Node - Js 1st Edition Robert Kowalski Complete Edition
100% (2)
The CLI Book: Writing Successful Command Line Interfaces With Node - Js 1st Edition Robert Kowalski Complete Edition
60 pages
Revision Tour 1 - 2marks
No ratings yet
Revision Tour 1 - 2marks
5 pages
Design Verification Part 2
No ratings yet
Design Verification Part 2
3 pages
WWW Blog Teknisi Com 2021 06 Install Mototrbo Customer Progr
No ratings yet
WWW Blog Teknisi Com 2021 06 Install Mototrbo Customer Progr
10 pages
Reddit App Dev Guide for Architects
No ratings yet
Reddit App Dev Guide for Architects
9 pages
Dice Rolling Simulator - Jagrit Sahni
50% (2)
Dice Rolling Simulator - Jagrit Sahni
31 pages
Ganymede Beta Doc 2902
No ratings yet
Ganymede Beta Doc 2902
67 pages
Embedded Software Design: A Practical Approach To Architecture, Processes, and Coding Techniques
100% (8)
Embedded Software Design: A Practical Approach To Architecture, Processes, and Coding Techniques
474 pages
Point To Remember: TCS Technical Interview Questions ASP, C, C++, VB, HTML, C#, Etc.) ?
No ratings yet
Point To Remember: TCS Technical Interview Questions ASP, C, C++, VB, HTML, C#, Etc.) ?
21 pages
ITK Customization Chatgpt
No ratings yet
ITK Customization Chatgpt
11 pages
4 - Introduction To Java Coding
No ratings yet
4 - Introduction To Java Coding
4 pages
New USB Cable and Driver
No ratings yet
New USB Cable and Driver
3 pages
Assignment of ADVANCED JAVA: Title:-Rmi
No ratings yet
Assignment of ADVANCED JAVA: Title:-Rmi
9 pages
Angular 2.0questions
No ratings yet
Angular 2.0questions
12 pages
CVPD Java Programming Course Overview
No ratings yet
CVPD Java Programming Course Overview
2 pages
CH 11 PPT 546
No ratings yet
CH 11 PPT 546
65 pages
Process Description & Control Guide
No ratings yet
Process Description & Control Guide
4 pages
Soalan C++
No ratings yet
Soalan C++
11 pages
Durga Software Solutions
No ratings yet
Durga Software Solutions
14 pages
Abaqus 6.11: Abaqus Scripting User's Manual
No ratings yet
Abaqus 6.11: Abaqus Scripting User's Manual
323 pages

Lecture 10 Map Reduce

Uploaded by

Lecture 10 Map Reduce

Uploaded by

Distributed Systems

Slides credit to Dr. Yogesh Simmhan, IISc,

• Extended the deadline for HW 3 to 12 midnight,

⚫ Smaller lifetime (~3 years)

⚫ How can applications easily deal with failures?

⚫ MapReduce works best with a distributed file system, called

“A simple and powerful interface that enables

Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,

Copyright © 2011 Tom White, Hadoop Definitive Guide

▪ Process data using map & reduce functions

Distributed Word Count 27

▪ Input records from the data source

(“foo”, “bar”) → (“FOO”, “BAR”)

▪ Example: Filter Mapper

for each w for each w in

▪ Number of Mapper and Reducer tasks is specified by

▪ Shuffle does a “group by” of keys from all mappers

▪ Sort of local keys to Reducer task performed

▪ Shuffle and Sort implemented efficiently by framework

Host A Host B Host C

Host A Host C Host D

▪ Logic runs on output of Map tasks, on the map machines

▪ Decides assignment of intermediate keys grouped to

Combine & Partition phases

Combiner & Partitioner are

▪ Mini-Shuffle lowers the overall cost for Shuffle

▪ Programmers must specify:

You might also like