[go: up one dir, main page]

0% found this document useful (0 votes)
10 views101 pages

W6 Lecture Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views101 pages

W6 Lecture Notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Randomized Distributed

Algorithm

EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction
• Thislecture focused on a specific type of distributed
algorithms, which employ randomization.

EL
• Randomization has proved to be a very powerful
tool for designing distributed algorithms (as for many
other areas).
• It
PT
often simplifies algorithms and more importantly
N
allows us to solve problems in situations where they
cannot be solved by deterministic algorithms, or
with fewer resources than the best deterministic
algorithm.
Contd…
• Thislecture extends the formal model to include
randomization and describes randomized algorithm
for Leader Election Problem.

EL
• Randomization
PT
allows us to overcome impossibility
results and lower bounds, by relaxing the
N
termination conditions or the individual liveness
properties (in the case of mutual exclusion).
EL
Distributed Leader Election
Algorithm: A Case Study
PT
N
Weakening the Problem Definition
•A randomized algorithm is an algorithm that has
access to some source of random information, such
as that provided by flipping a coin or rolling dice.

EL
More formally:
• Extend
PT
the transition function of a processor to
take as an additional input a random number, drawn
N
from a bounded range under some fixed distribution.
Why is it important?
The Bad News:
1. The addition of random information alone

EL
typically will not affect the existence of
impossibility results or worst-case bounds.
2.
PT
For instance, even if processors have access to
random numbers, they will not be able to elect a
N
leader in an anonymous ring or solve consensus in
fewer than f+1 rounds in all (admissible)
executions.
Contd…
The Good News:

• randomization + weakening of problem statement is a

EL
powerful tool for overcoming limitations.

•Usually
PT
the weakening involves the termination
condition (for instance, a leader must be elected with a
N
certain probability) while the other conditions are not
changed (for instance, it should never be the case that
two leaders are elected).
Differs from a average case analysis of a
deterministic algorithm
•In average case analysis, there are several choices as to what
is being averaged over. One natural choice is the inputs. There
are two difficulties with this approach:

EL
(i) Determining an accurate probability distribution on the
inputs is often not practical.
(ii)
PT
Another drawback is even if such distributions can be
chosen with some degree of confidence, very little is
guaranteed about the behavior of the algorithm on a
N
particular input. For instance, even if the average running
time over all inputs is determined to be small, there still
could be some inputs for which the running time is
enormous.
Contd…
• In
the randomized approach, more stringent gurantees can
be made. Because the random numbers introduce another
dimension of variability even for the same inputs there are

EL
many different executions for the same input.
•A good randomized algorithm will gurantee good

input.
• Typically
PT
performance with some probability for each individual

the performance of a randomized algorithm is


N
defined to be the worst-case probability over all inputs.
Randomized

EL
Leader Election
PT
N
Randomized Leader Election
•The simplest use of randomization is to create initial asymmetry in
situations that are inherently symmetric.
•One such situation is anonymous rings, where processors do not have
distinct identifiers, it is impossible to elect a unique leader.
(From Theorem: There is no nonuniform anonymous algorithm for leader

EL
election in synchronous rings)
•This impossibility results holds even for randomized algorithms.
However a randomized algorithm formulation of LE “leader is elected

PT
with some probability”.
•A variant of the leader election problem that relaxes the condition that
eventually a leader must be elected in every admissible execution.
N
Relaxed version of the leader election problem requires:
Safety: In every configuration of
Liveness: At least one processor is
every admissible execution, at
elected with some non-zero
most one processor is in an
probability
elected state
Contd…
•Safety property has to hold with certainty: that is, the
algorithm should never elect two leaders.
•Liveness condition is relaxed, and the algorithm need not

EL
always terminate with a leader, rather, it is required to do so
with nonzero probability.

•An PT
algorithm that satisfies this weakened liveness condition
can fail to elect a leader either by not terminating at all or by
N
terminating without a leader.
1. Synchronous One-Shot Algorithm
• First, let us consider synchronous rings.

• There is only one admissible execution on an anonymous

EL
ring for a deterministic algorithm.
• For a randomized algorithm, however, there can be many

• The
PT
different executions, depending on the random choices.
approach that is used to devising a randomized leader
N
election algorithm is to use randomization to create
asymmetry by having processors choose random pseudo-
identifiers, drawn from some range, and then execute a
deterministic leader election algorithm.
Properties
•A simple deterministic leader election algorithm with these
properties is the following:

EL
1. Each processor sends a message around the ring to
collect all pseudo-identifiers.

2.
PT
When the message returns (after collecting n pseduo-
N
identifiers), a processor knows whether it is a unique
maximum or not.
Algorithm: Overview
• In this algorithm, the pseudo-identifier is chosen to be 2 with
probability 1/n and 1 with probability (1 - 1/n), where n is the
number of processors on the ring.

EL
• Thus each processor makes use of its source of randomness
exactly once, and the random numbers are drawn form the
range [1…2].
• PT
The set of all possible admissible executions of this algorithm for
fixed ring size n contains exactly one execution for each element
of the set  = {1,2}n
N
• That is, by specifying which random number, 1 or 2, is obtained
by each of the n processors in its first step, we have completely
determined the execution. Given an element R of,  then the
corresponding execution will be denoted by exec(R)
Algorithm1

EL
PT
N
Definition
• Let P be some predicate on executions, for example, at
least one leader is elected.

EL
• Then Pr[P]: probability of the event

PT
{R   : exec(R) satisfies P}
N
Analysis
• What is the probability that the algorithm terminates with a
leader?

EL
• This happens when a single processor has the maximum
identifier, 2.
• Theprobability that a single processor draws 2 is the probability
PT
that n-1 processors draw 1, and one processor draws 2, times
the number of possible choices for the processor drawing 2,
N
that is,
n 1 n 1
n 1  1 
n
 1  1 1
  1    1    c  1   
1  n  n   n  n e
Contd…
• It is simple to show that every processor terminated after
sending exactly n messages; moreover, at most one
processor terminates in an elected state.

EL
• In some executions, for example, when two processors
choose the pseudo-identifier 2, no processor terminates as
PT
a leader, However, the above analysis shows that this
happens with probability less than 1 – 1/e
N
Message Complexity

• Theorem 1: There is a randomized algorithm that, with


probability c > 1/e, elects a leader in a synchronous

EL
ring; the algorithm sends O(n2) messages.

PT
N
2. Synchronous Iterated Algorithm and Expectation
• It
is pleasing that the probability of termination in previous
algorithm does not decrease with the ring size n.
• However, we may wish to increase the probability of

EL
termination, at the expense of more time and messages.
• The algorithm can be modified so that each processor
PT
receiving a message with n pseudo-identifiers checks
whether a unique leader exists (in Lines 5-7 of Algorithm1).
N
• Ifnot, the processor chooses a new pseudo-identifier and
iterates the algorithm. We will show that this approach
amplifies the probability of success.
Contd…
• Let us consider the second option in more detail. In order to
repeat the Algorithm1, each processor will need to access the
random number source multiple times, in fact, potentially

EL
infinitely often.
• To completely specify an execution of the algorithm, we will
need to specify, for each processor, the sequence of random


PT
numbers that it obtains.
Thus now becomes the set of all n-tuples, each element of
N
which is a possibly infinite sequence over {1,2}.
Analysis
• For the iterated algorithm, the probability that the algorithm
terminates at the end of the kth iteration is equal to the
probability that the algorithm fails to terminate in the first k-1
iterations and succeeds in terminating in the kth iteration.

EL
• The analysis of Algorithm1 shows that the probability of success
in a single iterations is c > 1/e.
• Because PT
the probability of success or failure in each iteration is
independent, the desired probability is
N
(1 - c)k-1 c
• Thisprobability tends to 0 as k tends to  ;thus the probability
that the algorithm terminates with an elected leader is 1.
Time Complexity of Iterated Algorithm
• Worst-case number of iterations: 
• Expected number of iterations: 1 c  e

• LetT be a random variable that, for a given execution, is the

EL
value of the complexity measure of interest for that run (for
instance, the number of iterations until termination).

PT 
• Let E[T] be the expected value of T, taken over all R 

• That is,

 x  Pr[T  x]
N
E[T ] 
x is a valueof T
Contd…
• Notethat by the definition of Pr[P] above, this is ultimately
taking probabilities over 
• With this definition, we have the following theorem:

EL
• Theorem 2: There is a randomized algorithm that elects a
leader in a synchronous ring with probability 1 in
PT
(1/c) . n < e . n expected rounds; the algorithm sends O(n2)
expected messages.
N
Conclusion
•Randomization has proved to be a very powerful tool
for designing distributed algorithms.
•It allows us to solve problems in situations where they

EL
cannot be solved by deterministic algorithms.
•This lecture extends the formal model to include

PT
randomization and describes the randomization
algorithm for Leader Election Problem.
N
• We have presented two randomized algorithms:
(i) Synchronous One-Shot Algorithm with O(n2)
messages and (ii) Synchronous Iterated Algorithm with
O(n2) expected messages
Peer-to-Peer Computing and
Structured Overlay Network

EL
PT
N
Rajiv Misra
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction
• Peer-to-peer (P2P) network: application-level organization of the
network overlay for flexibility sharing resources (e.g., files and
multimedia documents) stored across network-wide computers.
• All nodes (called peers) are equal i.e. client as well as server;
communication directly between peers (no client-server)

EL
• Allows to find location of arbitrary objects; no DNS servers required


PT
Sharing of Large combined storage, CPU power, other resources,
without scalability costs
Dynamic insertion and deletion of nodes called churn, as well as of
N
resources, at low cost
• Overlay networks: Overlay networks refer to networks that are
constructed on top of another network (e.g. IP).
• P2P overlay network: Any overlay network that is constructed by the
Internet peers in the application layer on top of the IP network.
Desirable Characteristics of P2P

Features Performance
Large combined storage, CPU power,
Self-organizing
and resources

EL
Fast search for machines and data
Distributed control
objects

Anonymity PT
Role symmetry for nodes Scalable
Efficient management of churn
Selection of geographically close
N
Naming mechanism
servers
Security, authentication,
Redundancy in storage and paths
trust
Table I: Desirable characteristics and performance features of P2P systems.
Application Layer Overlays
•A core mechanism in P2P networks is searching for data,
and this mechanism depends on how (i) the data, and (ii) the
network, are organized.
•Search algorithms for P2P networks tend to be data-centric,

EL
as opposed to the host-centric algorithms for traditional
networks.

PT
P2P search uses the P2P overlay, which is a logical graph
among the peers that is used for the object search and object
N
storage and management algorithms.
• Note that above the P2P overlay is the application layer
overlay, where communication between peers is point-to-
point (representing a logical all-to-all connectivity) once a
connection is established.
EL
Classification of P2P
Overlay Network
PT
N
P2P Overlay Network

Structured Unstructured

EL
DHT Non DHT
Non Deterministic
Mercury Deterministic
CAN

CHORD PT Napster

BitTorrent
Gnutella

Kazaa
N
Tapestry
JXTA
Pastry

Kademila

Figure: Classification of P2P Overlay Network


Classification of P2P Overlay Network
•The P2P overlay can be (i) structured (e.g., hypercubes, meshes,
butterfly networks, de Bruijn graphs) or (ii) unstructured, i.e., no
particular graph structure is used.
(i) Structured overlays use some rigid organizational principles

EL
based on the properties of the P2P overlay graph structure, for the
object storage algorithms and the object search algorithms.

PT
(ii) Unstructured overlays use very loose guidelines for object
storage. As there is no definite structure to the overlay graph, the
search mechanisms are more “ad-hoc,” and typically use some
N
forms of flooding or random walk strategies.
•Thus, object storage and search strategies are intricately linked to
the overlay structure as well as to the data organization
mechanisms.
Structured vs. Unstructured Overlays
Structured Overlays: Unstructured Overlays:
•Structure ⇒ placement of •No structure for overlay ⇒ no
files is highly deterministic, structure for data/file
file insertions and deletions placement

EL
have some overhead •Node join/departures are
•Fast lookup easy; local overlay simply
•Hash mapping based on a adjusted

PT
single characteristic (e.g., file
name)
•Only local indexing used
•File search entails high
N
•Range queries, keyword message overhead and high
queries, attribute queries delays
difficult to support •Complex, keyword, range,
•Examples: Chord, attribute queries supported
Content Addressable •Examples: FreeNet, Gnutella,
Network(CAN), Pastry. KaZaA, BitTorrent
Data Indexing
• Data identified by indexing, which allows physical data independence from
apps.
• Centralized indexing: e.g., versions of Napster, DNS
• Distributed indexing: Indexes to data scattered across peers. Access data
through mechanisms such as Distributed Hash Tables (DHT). These differ in

EL
hash mapping, search algorithms, diameter for lookup, fault tolerance,
churn resilience.
• Local indexing: Each peer indexes only the local objects. Remote objects

PT
need to be searched for. Typical DHT uses flat key space. Used commonly in
unstructured overlays (E.g., Gnutella) along with flooding search or random
walk search.
N
• Another classification:
• Semantic indexing: human readable, e.g., filename, keyword, database key.
Supports keyword searches, range searches, approximate searches.
• Semantic-free indexing: Not human readable. Corresponds to index
obtained by use of hash function.
Structured Overlays:

EL
Distributed Hash Tables
PT
N
Simple Distributed Hash Table scheme

EL
PT
N
• Mappings from node address space and object space in a simple
DHT. Highly deterministic placement of files/data allows fast
lookup. But file insertions/deletions under churn incurs some
cost.
• Attribute search, range search, keyword search etc. not possible.
Chord Distributed Hash Table: Overview
•The Chord protocol, proposed by Stoica et al. (2003), uses a flat key space to
associate the mapping between network nodes and data objects / files /values.
•The node address as well as object value is mapped to a logical identifier in a
common flat key space using a consistent hash function.

EL
•When a node joins or leaves the network of n nodes, only O(1/n) keys have
to be moved from one location to another.
•Two steps involved:

PT
1. Map the object value to its key
2. Map the key to the node in the native address space using lookup
N
•Common address space is a m-bit identifier (2m addresses), and this space is
arranged on a logical ring mod (2m ).
•A key k gets assigned to the first node such that the node identifier equals or
is greater than the key identifier of k in the common identifier space. The node
is the successor of k, denoted succ(k).
Example

EL
PT
N
A Chord ring for m = 7 is depicted in figure. Nodes N5, N18, N23, N28, N63, N73,
N99, N104, N115, and N119 are shown. Six keys, K8, K15, K28, K53, K87, and
K121, are stored among these nodes using succ(k) as follows:
succ8 = 18, succ15 = 18, succ28 = 28, succ53 = 63, succ87 = 99, and succ121 = 5.
A key k gets assigned to the first node such that the node identifier equals or is greater than the key
identifier of k in the common identifier space. The node is the successor of k, denoted succ(k).
Chord: Simple Lookup
(variables)
•Each node tracks its successor integer: successor ← initial value;
on the ring. (1) i.Locate_Successor (key ), where key ƒ= i:
•Query for key x is forwarded (1a) if key ∈ (i, successor ] then
on the ring until reaches first (1b) return(successor )
(1c) else return successor.Locate _Successor(key ).

EL
node whose identifier y ≥ key
(x mod (2m)).
•The result, node with key y, is

PT
returned to the querying node
(along the reverse of the path
that was followed by the
N
query.)
•This mechanism requires O(1)
local space but O(n) hops.
Chord: Scalable Lookup
• Each node i maintains a routing table, called the finger table,
with O(log n) entries, such that the x th entry (1 ≤ x ≤ m) is
the node identifier of the node succ (i + 2x−1).
• denoted by i .finger [x ] = succ (i + 2x−1).

EL
• This is the first node whose key is greater than the key of
node i by at least 2x−1 mod 2m .

PT
• Complexity: O(log n) message hops at the cost of O(log n)
space in the local routing tables
N
• Due to the log structure of the finger table, there is more
info about nodes closer by than about nodes further away.
Contd…

Consider a query on key key at node i ,


1 . if key lies between node i and its successor, the key would
reside at the successor and its address is returned.

EL
2. If key lies beyond the successor, then node i searches
through the m entries in its finger table to identify the

PT
node j such that j most immediately precedes key , among
all the entries in the finger table.
N
3. As node j is the closest known node that precedes key, j is
most likely to have the most information on locating key ,
i.e., locating the immediate successor node to which key
has been mapped.
Chord: Scalable Lookup Procedure
(variables)
integer: successor ← initial value;
integer: predecessor ← initial value;
array of integer finger [1 . . . log n];

EL
(1) i.Locate Successor (key ), where key = i:
(1a) if key ∈ (i, successor ] then
(1b) return(successor )
(1c) else
(1d) PT
j ← Closest _Preceding _Node(key );
(1e) return j.Locate _Successor (key ).
N
(2) i.Closest_Preceding_Node(key ), where key = i:
(2a) for count = m down to 1 do
(2b) if finger [count] ∈ (i, key ] then
(2c) break();
(2d) return(finger [count]).
Algorithm 2: A scalable object location algorithm in Chord at node i.
Chord: Scalable Lookup Example
Example: The use of the finger tables in answering the query lookup(K8)at
node N28 is illustrated in figure. The finger tables of N28, N99, and N5 that
are used are shown.

EL
PT
N
Managing Churn: Node joins
The code to manage dynamic node joins, departures, and failures is
given in the Algorithm 3 of managing churn
Node joins:
•To create a new ring, a node i executes Create_New_Ring which
creates a ring with the singleton node. To join a ring that contains some

EL
node j, node i invokes Join_Ring(j).

PT
Node j locates i’s successor on the logical ring and informs i of its
successor. Before i can participate in the P2P exchanges, several actions
need to happen: i’s successor needs to update its predecessor entry to i,
N
i’s predecessor needs to revise its successor field to i, i needs to identify
its predecessor, the finger table at i needs to be built, and the finger
table of all nodes need to be updated to account for i’s presence.

•This is achieved by procedures Stabilize(), Fix_Fingers(), and


Check_Predecessor() that are periodically invoked by each node.
Contd…
The Given Figure: Illustrates the main steps of the joining process. A recent
joiner node i that has executed Join_Ring(·) gets integrated into the ring by the
following sequence:

EL
PT
N

Figure: Steps in the integration of node i in the ring, where j > i > k
Contd…

• Once all the successor variables and finger tables have


stabilized, a call by any node to Locate_Successor(·) will
reflect the new joiner i.

EL
• Until then, a call to Locate_Successor(·) may result in the
Locate_Successor(·) call performing a conservative scan.
• The
PT
loop in Closest_Preceding_Node that scans the finger
table will result in a search traversal using smaller hops
rather than truly logarithmic hops, resulting in some
N
inefficiency.
• Still, the node i will be located although via more hops.
Managing Churn: Node failures and departures

Node failures and departures:

• When a node j fails abruptly, its successor i on the ring will discover the
failure when the successor i executes Check_Predecessor () periodically.

EL
• Process i gets a chance to update its predecessor field when another
node k causes i to execute Notify(k). But that can happen only if k’s

PT
successor variable is i.
• This requires the predecessor of the failed node to recognize that its
successor has failed, and get a new functioning successor.
N
• In fact, the successor pointers are required for object search; the
predecessor variables are required only to accommodate new joiners.
• Note from Algorithm 3 that knowing that the successor is functional,
and that the nodes pointed to by the finger pointers are functional, is
essential.
Algorithm 3: Managing churn in Chord. Code shown is for node i
(variables)
integer: successor ← initial value;
integer: predecessor ← initial value;
array of integer finger [1 . . . log m];
integer: next finger ←1;

EL
(1)i.Create_New_Ring ():
(1a) predecessor ← ⊥ ;
(1b) successor ← i .

(2a) predecessor ← ⊥ ; PT
(2) i.Join_Ring (j), where j is any node on the ring to be joined:

(2b) successor ←j.Locate_Successor (i ).


N
(3) i.Stabilize(): // executed periodically to verify and inform successor
(3a) x ← successor.predecessor ;
(3b) if x ∈ (i, successor ) then
(3c) successor ← x ;
(3d) successor.Notify (i ).
Contd…

(4)i.Notify (j): // j believes it is predecessor of i


(4a) if predecessor = ⊥ or j ∈ (predecessor , i )) then
(4b) transfer keys in the range [j, i] to j;
(4c) predecessor ← j.

EL
(5) i.Fix_Fingers():// executed periodically to update the finger table
5a) next_finger ← next_finger + 1;
(5b) if next_finger > m then
(5c)
PT
next_finger ← 1;
(5d) finger [next_finger ] ← Locate_Successor (i + 2next _finger−1 ).
(6)i.Check_Predecessor (): // executed periodically to verify whether
N
//predecessor still exists
(6a) if predecessor has failed then
(6b) predecessor ← ⊥ .
Complexity

• For a Chord network with n nodes, each node is responsible for


at most (1 + ϵ)K/n keys, with “high probability”, where K is
the total number of keys.

EL
• Using consistent hashing, ϵ can be shown to be bounded by
O(logn).

PT
• The search for a successor Locate_Successor in a Chord network
with n nodes requires time complexity O(log n) with high
probability.
N
• The size of the finger table is log (n) ≤ m

• The average lookup time is 1/2 log (n).


Comparison of Structure P2P Overlay Network (1)
Routing
Overlay Network
Algorithm Distance from i to j path length
Topology

(j − i) mod 2m

EL
Chord One-dimensional ring O(logN)
CAN d-dimensional cube Euclidean distance (d/4) N1/d

GISP
PT
Structureless
objects: the difference of the two
IDs; nodes: (i,j)/(2st-12sj-1); si, sj
are “peer strength”
i⊕j
uncertain
N
Kademila XOR-tree O(log2b N)
Pastry Tree + ring assessed digit by digit O(log2b N)
Tapestry Tree same as Pastry O(log2b N)
Viceroy Butterfly (j − i) mod 1 O(logN)

Source: “A Survey on Distributed Hash Table (DHT): Theory, Platforms, and


Applications”, Hao Zhang, Yonggang Wen, Haiyong Xie, and Nenghai Yu., 2013
Comparison of Structure P2P Overlay Network(2)
Routing # of
Algorithm path maintained Node’s joining Node’s leaving
selection peers
Find successor; construct Run stabilization
Chord O(log2 N)
finger table protocol periodically

EL
CAN 2d Generate neighbor list Update neighbor lists

as many as Generate routing list Delete failing nodes


GISP

PT
possible from routing list
at most 160 × Constructs k- buckets Detect the target node
Greedy k(O(log2 N)) before routing
Kademila
Algorithm
Generate routing table, Detect the target node
N
O(log2b N) neighbor- hood set and a before routing
Pastry namespace set
Construct the routing Heartbreak message
Tapestry O(log2b N)
table
Construct the three kinds Repaired by the
Viceroy at most 7
of links LOOKUP operation
Conclusion

•Peer-to-peer (P2P) networks allow equal participation


and resource sharing among the users.

EL
•This lecture first give the basic fundamentals and
underlying concepts of P2P networks. i.e. (i) Structured


PT
P2P networks and (ii) Unstructured P2P networks
Then we have discussed the concept of ‘Distributed
N
Hash Table’ and DHT based classical structured P2P
network i.e. Chord
The Google File System
(GFS)

EL
PTRajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction: File System

• File systems determine how data is stored and retrieved

• Distributed file systems manage the storage across a

EL
network of machines
• Added complexity due to the network
PT
• GFS/HDFS are distributed file systems
N
Introduction: Motivating Facts
• Google processes its data in computing clusters

• Google makes their own servers – use commodity-

EL
class CPUs running customized versions of Linux.
• Seek to maximize performance per dollar, not

• How
PT
absolute performance.
this is specifically measured is not public
N
information– includes system running costs and
power consumption
• Novel switching power supply design to reduce
costs and improve power efficiency.
Introduction: GFS

• GFS is a scalable distributed file system for large data


intensive applications

EL
• Shares many of the same goals as previous distributed
file systems such as performance, scalability, reliability,

PT
and availability.
• The design of GFS is driven by four key observations
N
• Component failures, huge files, mutation of files, and
benefits of co-designing the applications and file
system API
GFS Assumptions
• Hardware failures are common
• System is built from many inexpensive commodity components
• Modest number of huge files
• A few million files, each typically 100MB or larger (Multi-GB files

EL
are common)
• No need to optimize for small files

PT
Workloads : two kinds of reads, and writes
• Large streaming reads (1MB or more) and small random reads (a
few KBs)
N
• Small random reads
• Sequential appends to files by hundreds of data producers
• High sustained throughput is more important than latency
• Response time for individual read and write is not critical
GFS Design Overview
• Single Master
• Centralized management
• Files stored as chunks

EL
• With a fixed size of 64MB each.
• Reliability through replication
• Each chunk is replicated across 3 or more chunk servers
• Data caching
PT
• Due to large size of data sets
• Interface
N
• Suitable to Google apps
• Create, delete, open, close, read, write, snapshot, record
append
GFS Architecture A single master
Several Clients (metadata)

EL
PT
N
Data does not
flow across the Several data
GFS master servers
EL
GFS: Master
PT
N
The Role of the Master
Metadata Server : Maintain all file system metadata

• File namespace
• File to chunk mapping

EL
• Chunk location information
• Keep all the metadata in the master's memory (FAST)

PT
Location of chunks and their replicas:
N
• Master does not keep a persistent record (no disk I/O)
• Operations log for persistence and recovery
• Poll it from chunkservers (master as monitor, no extra cost)
Master as metadata server

Monitor

• Heartbeat messages to detect the state of each chunkserver

EL
• Communicate with chunkservers periodically

PT
Centralized Controller

• System-wide activities
N
• Manage chunk replication in the whole system
One Master
•Single master
•Simplify the design of the system
•Control chunk placement using global knowledge

EL
•Potential bottleneck?

•Avoiding bottleneck
PT
•Clients do no read/write data through master
•Clients cache metadata (e.g., chunk handles, locations)
N
•Client typically asks for multiple chunks in the same
request
•Master can also include the information for chunks
immediately following those requested
Caching

•Metadata
• Cached on client after being retrieved from the Master
• Only kept for a period of time to prevent stale data

EL
•File caching

PT
• Clients never cache file data
• Chunkservers never cache file data (Linux’s buffer will for
local files)
N
• File working sets too large
• Simplifies cache coherence
EL
GFS: Chunks
PT
N
Chunks

• 1 chunk = 64MB or 128MB (can be changed); chunk stored


as a plain Linux file on a chunk server
• Advantages of large (but not too large) chunk size

EL
• Reduced need for client/master interaction
• 1 request per chunk suits the target workloads

working set
PT
Client can cache all the chunk locations for a multi-TB
N
• Reduced size of metadata on the master (kept in
memory)
•Disadvantage: chunkserver can become hotspot for popular
file(s)
Chunk Locations

• Master does not keep a persistent record of chunk


replica locations
• Master polls chunkservers about their chunks at

EL
startup
• Master keeps up to date through periodic HeartBeat
messages
PT
• Master/chunkservers easily kept in sync when chunk
N
servers leave/join/fail/restart [regular event]
• Chunkserver has the final word over what chunks it has
Operation log
• Persistent record of critical metadata changes
• Critical to the recovery of the system

EL
• Changes to metadata are only made visible to clients after
they have been written to the operation log


PT
Operation log replicated on multiple remote machines
Before responding to client operation, log record must
N
have been flashed locally and remotely
• Master recovers its file system from checkpoint + operation
Consistency Model
• Atomicity
and correctness of file namespace are ensured
by namespace locking.
• Aftersuccessful data mutation(writes or record appends),

EL
changes are applied to a chunk in the same order on all
replicas.
• In
PT
case of chunk server failure at the time of mutation
(stale replica), it is garbage collected at the soonest
N
opportunity.
• Regular handshakes between Master and chunk servers
helps in identifying failed chunk servers and detects data
corruption by checksumming.
EL
System Interactions
PT
N
Leases & Mutation Order
• Master grants chunk lease to one of the
replicas(primary).

EL
• Allreplicas follow a serial order picked by the
primary.
• Leases PT
timeout at 60 seconds. (also possible to
extend the timeout)
N
• Leases are revocable.
Step 1
Client Master
2 Step-1. Client asks master
which chunk server holds
Secondary current lease of chunk and
Replica A

EL
locations of other replicas.

Step-2. Master replies with


Primary
Replica
PT identity of primary and locations
of secondary replicas.
N
Legend:

Secondary Control
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master

3 2 Step-3. Client pushes data to all replicas

Secondary Step-4. Once all replicas have


Replica A acknowledged receiving the data, client

EL
sends write request to primary. The
primary assigns consecutive serial
numbers to all the mutations it receives,
Primary
Replica
PT providing serialization. It applies
mutations in serial number order.
N
Legend:

Secondary Control
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master

3 2 Step-5. Primary forwards write


request to all secondary replicas.
Secondary They apply mutations in the same
Replica A

EL
serial number order.
6 Step-6. Secondary replicas reply to
Primary
Replica
PT primary indicating they have
5 completed operation
N
Legend:
6 Control
Secondary
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master

3 2

Secondary
Step-7. Primary replies to the client
Replica A

EL
with success or error message
7 6

Primary
Replica
PT 5
N
Legend:
6 Control
Secondary
Replica B Data
Figure 2: Write Control and Data Flow
System Interactions
• Data Flow
• Data is pipelined over TCP connections
• A chain of chunk servers form a pipeline

EL
• Each machine forwards data to the closest machine
• Atomic Record Appends
PT
• GFS provides an atomic append operation called “record
append”
N
• Snapshot
• Makes a copy of file or a directory tree almost
instantaneously
Read Algorithm (1)
1. Application originates the read request.
2. GFS client translates the request from (filename, byte
range) -> (filename, chunk index), and sends it to master.

EL
3. Master responds with chunk handle and replica locations
(i.e. chunkservers where the replicas are stored).
PT
4. Client picks a location and sends the (chunk handle, byte
range) request to that location.
N
5. Chunkserver sends requested data to the client.
6. Client forwards the data to the application.
Read Algorithm (2)

EL
Application

1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master

3. (chunk handle,
replica locations)
Read Algorithm (3)

EL
Chunk Server
Application
4. (chunk handle,
byte range)

6. (data from file)


PT Chunk Server
N
GFS Client
5. (data from file)
Chunk Server
Write Algorithm (1)
1. Application originates write request.
2. GFS client translates request from (filename, data) ->
(filename, chunk index), and sends it to master.

EL
3. Master responds with chunk handle and (primary +
secondary) replica locations.
PT
4. Client pushes write data to all locations. Data is stored in
chunkservers’ internal buffers.
N
5. Client sends write command to primary.
Write Algorithm (2)
6. Primary determines serial order for data instances stored
in its buffer and writes the instances in that order to the
chunk.

EL
7. Primary sends serial order to the secondaries and tells
them to perform the write.

PT
8. Secondaries respond to the primary.
9. Primary responds back to client.
N
Note: If write fails at one of chunkservers,
client is informed and retries the write.
Write Algorithm (3)

EL
Application

1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master

3. (chunk handle,
primary and
secondary replica
locations)
Write Algorithm (4)

Primary

EL
Chunk
Buffer
Application
(Data)

PT
(Data)
Secondary
Buffer
Chunk
N
GFS Client (Data) Secondary
Chunk
Buffer
4
Write Algorithm (5) 7. Write
command,
serial order

Primary 6.

EL
Chunk
Application 5. (Write D1 D2 D3 D4
command)

PT Secondary

D1 D2 D3 D4
Chunk
N
GFS Client
Secondary
Chunk
D1 D2 D3 D4
Write Algorithm (6)
8. response

Primary

EL
Chunk
Application 9. response empty

PT Secondary
empty
Chunk
N
GFS Client
Secondary
Chunk
empty
Record Append Algorithm (1)
1. Application originates record append request.
2. GFS client translates request and sends it to master.
3. Master responds with chunk handle and (primary +

EL
secondary) replica locations.
4. Client pushes writ
5.

6.
PT
Primary checks if record fits in specified chunk.
If record does not fit, then the primary:
N
• pads the chunk,
• tells secondaries to do the same,
• and informs the client.
• Client then retries the append with the next chunk.
Record Append Algorithm (2)
7. If record fits, then the primary:
• appends the record,

EL
• tells secondaries to do the same,
• receives responses from secondaries,

PT
• and sends final response to the client. e data to all
locations.
N
Record Append Algorithm (3)

EL
Application

1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master

3. (chunk handle,
primary and
secondary replica
locations)
Record Append Algorithm (4)

Primary

EL
Chunk
Buffer
Application
(Data)

PT
(Data)
Secondary
Buffer
Chunk
N
GFS Client (Data) Secondary
Chunk
Buffer
4
Record Append Algorithm (5) 7. Appends
the record

6. If record fits then


append
Primary

EL
Chunk
Application 5. (Write D1 D2 D3 D4
command)

PT Secondary

D1 D2 D3 D4
Chunk
N
GFS Client
Secondary
Chunk
D1 D2 D3 D4
Record Append Algorithm (6)
8. response

Primary

EL
Chunk
Application 9. response empty

PT Secondary
empty
Chunk
N
GFS Client
Secondary
Chunk
empty
EL
Master Operation
PT
N
Namespace Management & Locking

• Locks are used over namespaces to ensure proper


serialization

EL
• Read/write locks

• GFS simply uses directory like file names : /foo/bar

PT
• GFS logically represents its namespace as a lookup table
mapping full pathnames to metadata
N
• If a Master operation involves /d1/d2/../dn/leaf, read locks
are acquired on d1,/d1/d2,..d1/d2/../leaf and either a read
or write lock on the full pathname /d1/d2/…..dn/leaf
Chunk Replica Placement
• Creation of (initially empty) chunks
• Use under-utilised chunk servers; spread across racks
• Limit number of recent creations on each chunk server
• Re-replication

EL
• Started once the available replicas fall below setting


PT
Master instructs chunkserver to copy chunk data directly from
existing valid replica
Number of active clone operations/bandwidth is limited
N
• Rebalancing
• The Master Rebalances the replicas periodically (examines
replicas distribution and moves replicas for better disk space and
load balancing)
Garbage collection
• Deletion logged by master
• File renamed to a hidden file, deletion timestamp kept
• Periodic scan of the master’s file system namespace

EL
• Hidden files older than 3 days are deleted from
master’s memory (no further connection between


PT
file and its chunk)
Periodic scan of the master’s chunk namespace
N
• Orphaned chunks (not reachable from any file) are
identified, their metadata deleted
• HeartBeat messages used to synchronise deletion
between master/chunkserver
Stale replica detection

• Scenario: a chunkserver misses a change


(“mutation”) applied to a chunk, e.g. a chunk was
appended

EL
• Master maintains a chunk version number to
distinguish up-to-date and stale replicas
PT
• Before an operation on a chunk, master ensures that
version number is advanced
N
• Stale replicas are removed in the regular garbage
collection cycle
Fault Tolerance
• Fast Recovery: master and chunkservers are designed to
restart and restore state in few seconds
• Chunk Replication: across multiple machines, across

EL
multiple racks
• Master Mechanisms:

PT
– Keep log of all changes made to metadata
– Periodic checkpoints of the log
N
– Log and checkpoints replicated on multiple machines
– Master state is replicated on multiple machines
– Shadow master for reading data if real master is down
Reading
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung,
“The Google File System”

EL
http://labs.google.com/papers/gfs.html

PT
N
Conclusion
• GFS is a distributed file system that support large-scale
data processing workloads on commodity hardware
• GFS has different points in the design space

EL
• Component failures as the norm
• Optimize for huge files


PT
GFS provides fault tolerance
• Replicating data
N
• Fast and automatic recovery
• Chunk replication

• GFS has the simple, centralized master that does not


become a bottleneck.

You might also like