Distributed Systems
Monsoon 2024
Lecture 9
International Institute of Information Technology
Hyderabad, India
Slides credit to Dr. Yogesh Simmhan, IISc,
and other sources mentioned.
Data Structures
Logistics Week 1
• Extended the deadline for HW 3 to 12 midnight,
Monday, September 02, 2024.
Distributed Computing Platforms
• We will study yet another distributed computing platform
that has gained recent popularity.
• We will study what this platform and its recent avatars
offer.
Distributed Computing
⚫Distributed Computing involves
• Clusters of machines connected over network
• Distributed Storage
• Disks attached to clusters of machines
• Network Attached Storage
⚫Commodity clusters
• Commodity: Available off the shelf at large volumes
• Lower Cost of Acquisition
• Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPUs typically comparable
Distributed Computing
⚫Distributed Computing involves
• Clusters of machines connected over network
• Distributed Storage
• Disks attached to clusters of machines
• Network Attached Storage
How can we make effective use of multiple machines?
⚫Commodity clusters
• Commodity: Available off the shelf at large volumes
• Lower Cost of Acquisition
• Cost vs. Performance
• Low disk bandwidth, and high network latency
• CPUs typically comparable
How can we use many of such machines of modest capability?
However,
⚫ Commodity clusters have lower reliability
⚫ Mass-produced
⚫ Cheaper materials
⚫ Smaller lifetime (~3 years)
⚫ How can applications easily deal with failures?
⚫ How can we ensure availability in the presence of faults?
⚫ While at the same time…
While at the same time,
⚫Realize that programming distributed systems is difficult
⚫Divide a job into multiple tasks
⚫Understand dependencies between tasks: Control, Data
⚫Coordinate and synchronize execution of tasks
⚫Pass information between tasks
⚫Avoid race conditions, deadlocks
⚫Paralel and distributed programming models/ languages/
abstractions/platforms try to make these easy
⚫E.g. Assembly programming vs. C++ programming
⚫E.g. C++ programming vs. Matlab programming
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
Distributed Programming
⚫ Recall that distributed systems are characterized as
a collection of autonomous computers
communicating over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
⚫ Geographical separation
⚫ Can allow for wider scope of operations, e.g., SETI
Distributed Programming
⚫ Recall that distributed systems are characterized as a
collection of autonomous computers communicating
over a network of links.
⚫ Some typical features of such a system are:
⚫ No common physical clock
⚫ Clocks can drift and no notion of a common clock.
⚫ Systems can be asynchronous too.
⚫ No shared memory
⚫ Use messages for communication along with their
semantics.
⚫ Geographical separation
⚫ Can allow for wider scope of operations, e.g., SETI
⚫ Autonomy
⚫ Processors are loosely coupled but cooperate with each
other
⚫ Heterogeneity
⚫ All processors need not be alike.
Challenges of Distributed Programming
⚫ Distributed programming is inherently challenging.
⚫ Need to ensure that several geographically
separate computers collaborate
– efficiently,
– reliably,
– transparently, and
– scalable manner
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
Challenges of Distributed Programming
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
⚫ Inherent complexity due to
– Latency: Computers can be far apart
– Reliability: Recovering from failures of individual or
collection of computers
– Partitioning: Sometimes, failures can partition the
system.
– Ordering: Lack of global clock means message/event
ordering is difficult
– Security: Program as well as computational aspects
Challenges of Distributed Programming
⚫ Distributed programming is inherently challenging.
⚫ Need to ensure that several geographically separate
computers collaborate
– Efficiently, reliably, transparently, and scalable manner
⚫ One can also talk of inherent complexity and
accidental complexity [Check D. Schmidt, VU]
⚫ Inherent complexity due to
– Latency, Reliability, Partitioning, Ordering, Security
⚫ Accidental complexity due to
– Low-level APIs: Not enough abstraction
– Poor debugging tools: Poor fault location
– Algorithmic decomposition: Choice of algorithmic
techniques
– Continuous re-invention: Key components can change!
Enter Map-Reduce
⚫ MapReduce is a distributed data-parallel programming
model from Google
⚫ Introduced close to 20 years ago.
⚫ MapReduce works best with a distributed file system, called
Google File System (GFS)
⚫ Hadoop is the open source framework implementation from
Apache that can execute the MapReduce programming
model
⚫ Hadoop Distributed File System (HDFS) is the open source
implementation of the GFS design
⚫ Amazon’s PaaS is yet another implementation of Map-
Reduce
Map-Reduce
“A simple and powerful interface that enables
automatic parallelization and distribution of
large-scale computations, combined with an
implementation of this interface that achieves
high performance on large clusters of
commodity PCs.”
Dean and Ghermawat, “MapReduce: Simplified Data Processing on Large Clusters”,
OSDI, 2004
A High Level View
⚫ Map Reduce provides
⚫ Clean abstraction for programmers
⚫ Automatic parallelization & distribution
⚫ Fault-tolerance
⚫ A batch data processing system
⚫ Status and monitoring tools
Map-Reduce Vs. RDBMS
• Why not RDBMS – the old data workhorse?
⚫ RDBMS suffer from a huge seek time resulting in huge
access times.
• Map-Reduce comes with the implicit assumption that nearly the
entire dataset is relevant for each query.
⚫ MapReduce is therefore akin to a batch query processor.
⚫ MapReduce is a good fit for problems that need to
analyze the whole dataset, in a batch fashion,
particularly for ad hoc analysis.
Map-Reduce Vs. RDBMS
⚫ An RDBMS is good for point queries or updates,
where the dataset has been indexed to deliver low-
latency retrieval and update times of a relatively small
amount of data.
⚫ MapReduce suits applications where the data is
written once, and read many times, whereas a
relational database is good for datasets that are
continually updated.
⚫ MapReduce works well on unstructured or
semistructured data
Map-Reduce Vs. RDBMS
⚫ Relational data is often normalized to retain its
integrity and remove redundancy.
⚫ Normalization poses problems for MapReduce,
⚫ reading a record a nonlocal operation,
⚫ one of the central assumptions that MapReduce
makes is that it is possible to perform (high-speed)
streaming reads and writes.
⚫ A web server log is a good example of a set of
records that is not normalized
⚫ the client hostnames are specified in full each time,
even though the same client may appear many times
⚫ Hence, logfiles are particularly well-suited to analysis
with MapReduce
Map-Reduce Vs. RDBMS
⚫ Over time, however, the differences between
relational databases and MapReduce systems are
reducing.
⚫ Relational databases starting to incorporate some
of the ideas from MapReduce
⚫ Aster Data’s and Greenplum’s databases
⚫ Higher-level query languages built on MapReduce
(such as Pig and Hive) make MapReduce
systems more approachable to traditional
database programmers.
Map-Reduce
⚫ MapReduce might sound like quite a restrictive
programming model
⚫ Mappers and reducers run with very limited
coordination between one another.
⚫ Typical problems that we can use Map-Reduce
programming model are from image analysis, to
graph-based problems, to machine learning
algorithms.
⚫ It can’t solve every problem, of course, but it is a
general data-processing tool
MapReduce: Data-parallel Programming Model
Copyright © 2011 Tom White, Hadoop Definitive Guide
▪ Process data using map & reduce functions
▪ map(ki, vi) → List<km, vm>[]
‣ map is called on every input item
‣ Emits a series of intermediate key/value pairs
▪ All values with a given key are grouped together
▪ reduce(km, List<vm>[]) → List<kr, vr>[]
‣ reduce is called on every unique key & all its values
‣ Emits a value that is added to the output
23
MapReduce: Word Count
Map(k1,v1) → list(k2,v2)
Reduce(k2, list(v2)) → list(k2,v2)
M <how,1>
<now,1> <how,1 1>
How now <brown,1> <now,1 1> brown 1
M <cow,1> <brown,1>
R cow 1
Brown cow
<how,1> <cow,1> does 1
<does,1> <does,1> how 2
How does M <it,1> <it,1> R it 1
It work <work,1> <work,1> now 2
now <now,1> work 1
M Reduce
MapReduce Framework
Map
Input Output
Distributed Word Count 27
Map
▪ Input records from the data source
‣ lines out of files, rows of a database, etc.
▪ Passed to map function as key-value pairs
‣ Line number, line value
▪ map() produces zero or more intermediate values, each
associated with an output key.
28
Map
▪ Example: Upper-case Mapper
map(k, v) { emit(k.toUpper(), v.toUpper()); }
(“foo”, “bar”) → (“FOO”, “BAR”)
(“Foo”, “other”) → (“FOO”, “OTHER”)
(“key2”, “data”) → (“KEY2”, “DATA”)
▪ Example: Filter Mapper
map(k, v) { if (isPrime(v)) then emit(k, v); }
(“foo”, 7) → (“foo”, 7)
(“test”, 10) → () //nothing emitted
29
Reduce
▪ All the intermediate values from map for a given output key
are combined together into a list
▪ reduce() combines these intermediate values into one or
more final values for that same output key … Usually one
final value per key
▪ One output “file” per reducer
30
MapReduce: Word Count Drilldown
How now How does
Brown cow It work now
for each w for each w in
in key do key do
emit(w,1) emit(w,1)
<How,1>
<How,1>
<does,1>
<now,1>
<it,1>
<brown,1>
<work,1>
<cow,1>
<now,1>
<brown,1> <does,1>
<How,1 1>
<cow,1> <it,1>
<now,1 1>
<work,1>
sum =
sum + value
emit(key,sum)
does 1
How 2 brown 1 it 1
now 2 cow 1 work 1
Mapper/Reducer Tasks vs. Map/Reduce Methods
▪ Number of Mapper and Reducer tasks is specified by
user
▪ Each Mapper/Reducer task can make multiple calls to
Map/Reduce method, sequentially
▪ Mapper and Reducer tasks may run on different
machines
▪ Implementation framework decides
‣ Placement of Mapper and Reducer tasks on machines
‣ Keys assigned to mapper and reducer tasks
‣ But can be controlled by user…
32
Shuffle & Sort: The Magic happens here!
▪ Shuffle does a “group by” of keys from all mappers
‣ Similar to SQL groupBy operation
▪ Sort of local keys to Reducer task performed
‣ Keys arriving at each reducer are sorted
‣ No sorting guarantee of keys across reducer tasks
▪ No ordering guarantees of values for a key
‣ Implementation dependent
▪ Shuffle and Sort implemented efficiently by framework
33
Map-Shuffle-Sort-Reduce
Host A Host B Host C
Intermediate Key-
Value Pairs
Host A Host C Host D
34
Data-Intensive Text Processing with MapReduce, Jimmy Lin, 2010
Optimization: Combiner
▪ Logic runs on output of Map tasks, on the map machines
‣ “Mini-Reduce,” only on local Map output
▪ Output of Combiner sent to shuffle
‣ Saves bandwidth before sending data to Reducers
▪ Same input and output types as Map’s output type
‣ Map(k,v) → (k’,v’)
‣ Combine(k’,v’[]) → (k’,v’)
‣ Reduce(k’,v’[]) → (k’’,v’’)
35
Optimization: Partitioner
▪ Decides assignment of intermediate keys grouped to
specific Reducer tasks
‣ Affects the load on each reducer task
▪ Sorting of local keys for Reducer task done after
partitioning
▪ Default is hash partitioning
‣ HashPartitioner(key, nParts) → part
‣ Number of Reducer (nParts) tasks known in advance
‣ Returns a partition number [0, nParts)
‣ Default partitioner balances number of keys per Reducer …
assuming uniform key distribution
‣ May not balance the number of values processed by a Reducer
36
Map-MiniShuffle-Combine-Partition-Shuffle-
Sort-Reduce
MiniShuffle
Combine & Partition phases
could be interchanged, based
on implementation
Combiner & Partitioner are
powerful constructs. Use them
wisely!
37
Data-Intensive Text Processing with MapReduce, Jimmy Lin, 2010
MapReduce for Histogram
int bucketWidth = 4 // input
7 2 11 2
2 1 11 4
9 10 6 6 Map(k, v) {
6 3 2 8 emit(floor(v/bucketWidth), 1)
0 5 1 10
2 4 8 11 // <bucketID, 1>
5 0 1 0 }
M M
1,1 0,1 2,1 0,1
0,1 0,1 2,1 1,1
2,1 2,1 1,1 1,1
1,1 0,1 0,1 2,1
0,1 1,1 0,1 2,1 // one reduce per bucketID
0,1 1,1 2,1 2,1
1,1 0,1 0,1 0,1 Reduce(k, v[]){
sum=0;
Shuffle foreach(n in v[]) sum++;
2,1 0,1 0,1 1,1 Data transfer &
2,1 0,1 0,1 1,1
emit(k, sum)
shuffle between
2,1 0,1 0,1 1,1
Map & Reduce // <bucketID, frequency>
2,1 0,1 0,1 1,1
2,1 0,1 0,1 1,1 (28 items) }
2,1 0,1 0,1 1,1
2,1 1,1
2,1 1,1
38
Combiner Advantage
▪ Mini-Shuffle lowers the overall cost for Shuffle
▪ E.g. n total items emitted from m mappers
▪ Reduces network transfer and Disk IO costs
‣ In ideal case, m items vs. n items written and read from disk,
transferred over network (m<<n)
▪ Shuffle, less of an impact
‣ If more mapper tasks are present than reducers, higher
parallelism for doing groupby and mapper-side partial sort.
‣ Local Sort on reducer is based on number of unique keys,
which does not change due to combiner.
39
MapReduce: Recap
▪ Programmers must specify:
map (k, v) → <k’, v’>*
reduce (k’, v’[]) → <k’’, v’’>*
All values with the same key are reduced together
▪ Optionally, also:
partition (k’, number of partitions) → partition for k’
Often a simple hash of the key, e.g., hash(k’) mod n
Divides up key space for parallel reduce operations
combine (k’, v’) → <k’, v’>*
‣ Mini-reducers that run in memory after the map phase
‣ Used as an optimization to reduce network traffic
▪ The execution framework handles everything else…
What Does it Solve?
⚫ Scalable data analytics
⚫ Data not viewed or constrained by disk read/writes
but find meaning in data via computation over sets
of keys and values.
⚫ Abstract out lots of programming details.
⚫ Forgo removing data redundancy via normalization
of tables.
⚫ The penalty of normalization is reading a record
requires reading from multiple sources/tables/files.
“Everything Else”
▪ The execution framework handles everything else…
‣ Scheduling: assigns workers to map and reduce tasks
‣ “Data distribution”: moves processes to data
‣ Synchronization: gathers, sorts, and shuffles intermediate
data
‣ Errors and faults: detects worker failures and restarts
▪ Limited control over data and execution flow
‣ All algorithms must expressed in m, r, c, p
▪ You don’t know:
‣ Where mappers and reducers run
‣ When a mapper or reducer begins or finishes
‣ Which input a particular mapper is processing
‣ Which intermediate key a particular reducer is processing
Map-Reduce
⚫ Failures are handled via the jobtracker and the
tasktracker.
⚫ The tasktracker notices failures of tasks including
⚫ Runtime exceptions
⚫ Sudden exit of the code executing the task
⚫ Tasks hanging – observed via timeouts
⚫ If the jobtracker itself fails, Map-Reduce has no
mechanisms to save and restore computations.
Other Variants of Map-Reduce
⚫ YARN – Yet Another Resource Negotiator
⚫ Developed to address the scalability concerns of Map-Reduce
once the number of nodes is of the order of 104 or more.
⚫ YARN splits the responsibilities of the jobtracker into two
separate entities
⚫ a resource manager to manage the use of resources across the
cluster, and
⚫ an application master to manage the lifecycle of applications running
on the cluster.
⚫ YARN has more sophisticated mechanisms to save state of the
jobtracker on failure.
⚫ YARN is developed to be more general than MapReduce,
⚫ In fact MapReduce can be seen as just a type of YARN application
Further Reading and Action Items
⚫ Several piece-meal tutorials available online.
⚫ A one-place material available as online book,
Hadoop: The Definitive Guide, by Tom White.
⚫ Will post this PDF in our course moodle.
⚫ Read chapters 1, 2, 3, and 6.
⚫ Homework 4 that gives you practice on Map-
Reduce will be posted soon.