W6 Lecture Notes
W6 Lecture Notes
Algorithm
EL
PT Rajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction
• Thislecture focused on a specific type of distributed
algorithms, which employ randomization.
EL
• Randomization has proved to be a very powerful
tool for designing distributed algorithms (as for many
other areas).
• It
PT
often simplifies algorithms and more importantly
N
allows us to solve problems in situations where they
cannot be solved by deterministic algorithms, or
with fewer resources than the best deterministic
algorithm.
Contd…
• Thislecture extends the formal model to include
randomization and describes randomized algorithm
for Leader Election Problem.
EL
• Randomization
PT
allows us to overcome impossibility
results and lower bounds, by relaxing the
N
termination conditions or the individual liveness
properties (in the case of mutual exclusion).
EL
Distributed Leader Election
Algorithm: A Case Study
PT
N
Weakening the Problem Definition
•A randomized algorithm is an algorithm that has
access to some source of random information, such
as that provided by flipping a coin or rolling dice.
EL
More formally:
• Extend
PT
the transition function of a processor to
take as an additional input a random number, drawn
N
from a bounded range under some fixed distribution.
Why is it important?
The Bad News:
1. The addition of random information alone
EL
typically will not affect the existence of
impossibility results or worst-case bounds.
2.
PT
For instance, even if processors have access to
random numbers, they will not be able to elect a
N
leader in an anonymous ring or solve consensus in
fewer than f+1 rounds in all (admissible)
executions.
Contd…
The Good News:
EL
powerful tool for overcoming limitations.
•Usually
PT
the weakening involves the termination
condition (for instance, a leader must be elected with a
N
certain probability) while the other conditions are not
changed (for instance, it should never be the case that
two leaders are elected).
Differs from a average case analysis of a
deterministic algorithm
•In average case analysis, there are several choices as to what
is being averaged over. One natural choice is the inputs. There
are two difficulties with this approach:
EL
(i) Determining an accurate probability distribution on the
inputs is often not practical.
(ii)
PT
Another drawback is even if such distributions can be
chosen with some degree of confidence, very little is
guaranteed about the behavior of the algorithm on a
N
particular input. For instance, even if the average running
time over all inputs is determined to be small, there still
could be some inputs for which the running time is
enormous.
Contd…
• In
the randomized approach, more stringent gurantees can
be made. Because the random numbers introduce another
dimension of variability even for the same inputs there are
EL
many different executions for the same input.
•A good randomized algorithm will gurantee good
input.
• Typically
PT
performance with some probability for each individual
EL
Leader Election
PT
N
Randomized Leader Election
•The simplest use of randomization is to create initial asymmetry in
situations that are inherently symmetric.
•One such situation is anonymous rings, where processors do not have
distinct identifiers, it is impossible to elect a unique leader.
(From Theorem: There is no nonuniform anonymous algorithm for leader
EL
election in synchronous rings)
•This impossibility results holds even for randomized algorithms.
However a randomized algorithm formulation of LE “leader is elected
PT
with some probability”.
•A variant of the leader election problem that relaxes the condition that
eventually a leader must be elected in every admissible execution.
N
Relaxed version of the leader election problem requires:
Safety: In every configuration of
Liveness: At least one processor is
every admissible execution, at
elected with some non-zero
most one processor is in an
probability
elected state
Contd…
•Safety property has to hold with certainty: that is, the
algorithm should never elect two leaders.
•Liveness condition is relaxed, and the algorithm need not
EL
always terminate with a leader, rather, it is required to do so
with nonzero probability.
•An PT
algorithm that satisfies this weakened liveness condition
can fail to elect a leader either by not terminating at all or by
N
terminating without a leader.
1. Synchronous One-Shot Algorithm
• First, let us consider synchronous rings.
EL
ring for a deterministic algorithm.
• For a randomized algorithm, however, there can be many
• The
PT
different executions, depending on the random choices.
approach that is used to devising a randomized leader
N
election algorithm is to use randomization to create
asymmetry by having processors choose random pseudo-
identifiers, drawn from some range, and then execute a
deterministic leader election algorithm.
Properties
•A simple deterministic leader election algorithm with these
properties is the following:
EL
1. Each processor sends a message around the ring to
collect all pseudo-identifiers.
2.
PT
When the message returns (after collecting n pseduo-
N
identifiers), a processor knows whether it is a unique
maximum or not.
Algorithm: Overview
• In this algorithm, the pseudo-identifier is chosen to be 2 with
probability 1/n and 1 with probability (1 - 1/n), where n is the
number of processors on the ring.
EL
• Thus each processor makes use of its source of randomness
exactly once, and the random numbers are drawn form the
range [1…2].
• PT
The set of all possible admissible executions of this algorithm for
fixed ring size n contains exactly one execution for each element
of the set = {1,2}n
N
• That is, by specifying which random number, 1 or 2, is obtained
by each of the n processors in its first step, we have completely
determined the execution. Given an element R of, then the
corresponding execution will be denoted by exec(R)
Algorithm1
EL
PT
N
Definition
• Let P be some predicate on executions, for example, at
least one leader is elected.
EL
• Then Pr[P]: probability of the event
PT
{R : exec(R) satisfies P}
N
Analysis
• What is the probability that the algorithm terminates with a
leader?
EL
• This happens when a single processor has the maximum
identifier, 2.
• Theprobability that a single processor draws 2 is the probability
PT
that n-1 processors draw 1, and one processor draws 2, times
the number of possible choices for the processor drawing 2,
N
that is,
n 1 n 1
n 1 1
n
1 1 1
1 1 c 1
1 n n n n e
Contd…
• It is simple to show that every processor terminated after
sending exactly n messages; moreover, at most one
processor terminates in an elected state.
EL
• In some executions, for example, when two processors
choose the pseudo-identifier 2, no processor terminates as
PT
a leader, However, the above analysis shows that this
happens with probability less than 1 – 1/e
N
Message Complexity
EL
ring; the algorithm sends O(n2) messages.
PT
N
2. Synchronous Iterated Algorithm and Expectation
• It
is pleasing that the probability of termination in previous
algorithm does not decrease with the ring size n.
• However, we may wish to increase the probability of
EL
termination, at the expense of more time and messages.
• The algorithm can be modified so that each processor
PT
receiving a message with n pseudo-identifiers checks
whether a unique leader exists (in Lines 5-7 of Algorithm1).
N
• Ifnot, the processor chooses a new pseudo-identifier and
iterates the algorithm. We will show that this approach
amplifies the probability of success.
Contd…
• Let us consider the second option in more detail. In order to
repeat the Algorithm1, each processor will need to access the
random number source multiple times, in fact, potentially
EL
infinitely often.
• To completely specify an execution of the algorithm, we will
need to specify, for each processor, the sequence of random
•
PT
numbers that it obtains.
Thus now becomes the set of all n-tuples, each element of
N
which is a possibly infinite sequence over {1,2}.
Analysis
• For the iterated algorithm, the probability that the algorithm
terminates at the end of the kth iteration is equal to the
probability that the algorithm fails to terminate in the first k-1
iterations and succeeds in terminating in the kth iteration.
EL
• The analysis of Algorithm1 shows that the probability of success
in a single iterations is c > 1/e.
• Because PT
the probability of success or failure in each iteration is
independent, the desired probability is
N
(1 - c)k-1 c
• Thisprobability tends to 0 as k tends to ;thus the probability
that the algorithm terminates with an elected leader is 1.
Time Complexity of Iterated Algorithm
• Worst-case number of iterations:
• Expected number of iterations: 1 c e
EL
value of the complexity measure of interest for that run (for
instance, the number of iterations until termination).
PT
• Let E[T] be the expected value of T, taken over all R
• That is,
x Pr[T x]
N
E[T ]
x is a valueof T
Contd…
• Notethat by the definition of Pr[P] above, this is ultimately
taking probabilities over
• With this definition, we have the following theorem:
EL
• Theorem 2: There is a randomized algorithm that elects a
leader in a synchronous ring with probability 1 in
PT
(1/c) . n < e . n expected rounds; the algorithm sends O(n2)
expected messages.
N
Conclusion
•Randomization has proved to be a very powerful tool
for designing distributed algorithms.
•It allows us to solve problems in situations where they
EL
cannot be solved by deterministic algorithms.
•This lecture extends the formal model to include
PT
randomization and describes the randomization
algorithm for Leader Election Problem.
N
• We have presented two randomized algorithms:
(i) Synchronous One-Shot Algorithm with O(n2)
messages and (ii) Synchronous Iterated Algorithm with
O(n2) expected messages
Peer-to-Peer Computing and
Structured Overlay Network
EL
PT
N
Rajiv Misra
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction
• Peer-to-peer (P2P) network: application-level organization of the
network overlay for flexibility sharing resources (e.g., files and
multimedia documents) stored across network-wide computers.
• All nodes (called peers) are equal i.e. client as well as server;
communication directly between peers (no client-server)
EL
• Allows to find location of arbitrary objects; no DNS servers required
•
•
PT
Sharing of Large combined storage, CPU power, other resources,
without scalability costs
Dynamic insertion and deletion of nodes called churn, as well as of
N
resources, at low cost
• Overlay networks: Overlay networks refer to networks that are
constructed on top of another network (e.g. IP).
• P2P overlay network: Any overlay network that is constructed by the
Internet peers in the application layer on top of the IP network.
Desirable Characteristics of P2P
Features Performance
Large combined storage, CPU power,
Self-organizing
and resources
EL
Fast search for machines and data
Distributed control
objects
Anonymity PT
Role symmetry for nodes Scalable
Efficient management of churn
Selection of geographically close
N
Naming mechanism
servers
Security, authentication,
Redundancy in storage and paths
trust
Table I: Desirable characteristics and performance features of P2P systems.
Application Layer Overlays
•A core mechanism in P2P networks is searching for data,
and this mechanism depends on how (i) the data, and (ii) the
network, are organized.
•Search algorithms for P2P networks tend to be data-centric,
EL
as opposed to the host-centric algorithms for traditional
networks.
•
PT
P2P search uses the P2P overlay, which is a logical graph
among the peers that is used for the object search and object
N
storage and management algorithms.
• Note that above the P2P overlay is the application layer
overlay, where communication between peers is point-to-
point (representing a logical all-to-all connectivity) once a
connection is established.
EL
Classification of P2P
Overlay Network
PT
N
P2P Overlay Network
Structured Unstructured
EL
DHT Non DHT
Non Deterministic
Mercury Deterministic
CAN
CHORD PT Napster
BitTorrent
Gnutella
Kazaa
N
Tapestry
JXTA
Pastry
Kademila
EL
based on the properties of the P2P overlay graph structure, for the
object storage algorithms and the object search algorithms.
PT
(ii) Unstructured overlays use very loose guidelines for object
storage. As there is no definite structure to the overlay graph, the
search mechanisms are more “ad-hoc,” and typically use some
N
forms of flooding or random walk strategies.
•Thus, object storage and search strategies are intricately linked to
the overlay structure as well as to the data organization
mechanisms.
Structured vs. Unstructured Overlays
Structured Overlays: Unstructured Overlays:
•Structure ⇒ placement of •No structure for overlay ⇒ no
files is highly deterministic, structure for data/file
file insertions and deletions placement
EL
have some overhead •Node join/departures are
•Fast lookup easy; local overlay simply
•Hash mapping based on a adjusted
PT
single characteristic (e.g., file
name)
•Only local indexing used
•File search entails high
N
•Range queries, keyword message overhead and high
queries, attribute queries delays
difficult to support •Complex, keyword, range,
•Examples: Chord, attribute queries supported
Content Addressable •Examples: FreeNet, Gnutella,
Network(CAN), Pastry. KaZaA, BitTorrent
Data Indexing
• Data identified by indexing, which allows physical data independence from
apps.
• Centralized indexing: e.g., versions of Napster, DNS
• Distributed indexing: Indexes to data scattered across peers. Access data
through mechanisms such as Distributed Hash Tables (DHT). These differ in
EL
hash mapping, search algorithms, diameter for lookup, fault tolerance,
churn resilience.
• Local indexing: Each peer indexes only the local objects. Remote objects
PT
need to be searched for. Typical DHT uses flat key space. Used commonly in
unstructured overlays (E.g., Gnutella) along with flooding search or random
walk search.
N
• Another classification:
• Semantic indexing: human readable, e.g., filename, keyword, database key.
Supports keyword searches, range searches, approximate searches.
• Semantic-free indexing: Not human readable. Corresponds to index
obtained by use of hash function.
Structured Overlays:
EL
Distributed Hash Tables
PT
N
Simple Distributed Hash Table scheme
EL
PT
N
• Mappings from node address space and object space in a simple
DHT. Highly deterministic placement of files/data allows fast
lookup. But file insertions/deletions under churn incurs some
cost.
• Attribute search, range search, keyword search etc. not possible.
Chord Distributed Hash Table: Overview
•The Chord protocol, proposed by Stoica et al. (2003), uses a flat key space to
associate the mapping between network nodes and data objects / files /values.
•The node address as well as object value is mapped to a logical identifier in a
common flat key space using a consistent hash function.
EL
•When a node joins or leaves the network of n nodes, only O(1/n) keys have
to be moved from one location to another.
•Two steps involved:
PT
1. Map the object value to its key
2. Map the key to the node in the native address space using lookup
N
•Common address space is a m-bit identifier (2m addresses), and this space is
arranged on a logical ring mod (2m ).
•A key k gets assigned to the first node such that the node identifier equals or
is greater than the key identifier of k in the common identifier space. The node
is the successor of k, denoted succ(k).
Example
EL
PT
N
A Chord ring for m = 7 is depicted in figure. Nodes N5, N18, N23, N28, N63, N73,
N99, N104, N115, and N119 are shown. Six keys, K8, K15, K28, K53, K87, and
K121, are stored among these nodes using succ(k) as follows:
succ8 = 18, succ15 = 18, succ28 = 28, succ53 = 63, succ87 = 99, and succ121 = 5.
A key k gets assigned to the first node such that the node identifier equals or is greater than the key
identifier of k in the common identifier space. The node is the successor of k, denoted succ(k).
Chord: Simple Lookup
(variables)
•Each node tracks its successor integer: successor ← initial value;
on the ring. (1) i.Locate_Successor (key ), where key ƒ= i:
•Query for key x is forwarded (1a) if key ∈ (i, successor ] then
on the ring until reaches first (1b) return(successor )
(1c) else return successor.Locate _Successor(key ).
EL
node whose identifier y ≥ key
(x mod (2m)).
•The result, node with key y, is
PT
returned to the querying node
(along the reverse of the path
that was followed by the
N
query.)
•This mechanism requires O(1)
local space but O(n) hops.
Chord: Scalable Lookup
• Each node i maintains a routing table, called the finger table,
with O(log n) entries, such that the x th entry (1 ≤ x ≤ m) is
the node identifier of the node succ (i + 2x−1).
• denoted by i .finger [x ] = succ (i + 2x−1).
EL
• This is the first node whose key is greater than the key of
node i by at least 2x−1 mod 2m .
PT
• Complexity: O(log n) message hops at the cost of O(log n)
space in the local routing tables
N
• Due to the log structure of the finger table, there is more
info about nodes closer by than about nodes further away.
Contd…
EL
2. If key lies beyond the successor, then node i searches
through the m entries in its finger table to identify the
PT
node j such that j most immediately precedes key , among
all the entries in the finger table.
N
3. As node j is the closest known node that precedes key, j is
most likely to have the most information on locating key ,
i.e., locating the immediate successor node to which key
has been mapped.
Chord: Scalable Lookup Procedure
(variables)
integer: successor ← initial value;
integer: predecessor ← initial value;
array of integer finger [1 . . . log n];
EL
(1) i.Locate Successor (key ), where key = i:
(1a) if key ∈ (i, successor ] then
(1b) return(successor )
(1c) else
(1d) PT
j ← Closest _Preceding _Node(key );
(1e) return j.Locate _Successor (key ).
N
(2) i.Closest_Preceding_Node(key ), where key = i:
(2a) for count = m down to 1 do
(2b) if finger [count] ∈ (i, key ] then
(2c) break();
(2d) return(finger [count]).
Algorithm 2: A scalable object location algorithm in Chord at node i.
Chord: Scalable Lookup Example
Example: The use of the finger tables in answering the query lookup(K8)at
node N28 is illustrated in figure. The finger tables of N28, N99, and N5 that
are used are shown.
EL
PT
N
Managing Churn: Node joins
The code to manage dynamic node joins, departures, and failures is
given in the Algorithm 3 of managing churn
Node joins:
•To create a new ring, a node i executes Create_New_Ring which
creates a ring with the singleton node. To join a ring that contains some
EL
node j, node i invokes Join_Ring(j).
PT
Node j locates i’s successor on the logical ring and informs i of its
successor. Before i can participate in the P2P exchanges, several actions
need to happen: i’s successor needs to update its predecessor entry to i,
N
i’s predecessor needs to revise its successor field to i, i needs to identify
its predecessor, the finger table at i needs to be built, and the finger
table of all nodes need to be updated to account for i’s presence.
EL
PT
N
Figure: Steps in the integration of node i in the ring, where j > i > k
Contd…
EL
• Until then, a call to Locate_Successor(·) may result in the
Locate_Successor(·) call performing a conservative scan.
• The
PT
loop in Closest_Preceding_Node that scans the finger
table will result in a search traversal using smaller hops
rather than truly logarithmic hops, resulting in some
N
inefficiency.
• Still, the node i will be located although via more hops.
Managing Churn: Node failures and departures
• When a node j fails abruptly, its successor i on the ring will discover the
failure when the successor i executes Check_Predecessor () periodically.
EL
• Process i gets a chance to update its predecessor field when another
node k causes i to execute Notify(k). But that can happen only if k’s
PT
successor variable is i.
• This requires the predecessor of the failed node to recognize that its
successor has failed, and get a new functioning successor.
N
• In fact, the successor pointers are required for object search; the
predecessor variables are required only to accommodate new joiners.
• Note from Algorithm 3 that knowing that the successor is functional,
and that the nodes pointed to by the finger pointers are functional, is
essential.
Algorithm 3: Managing churn in Chord. Code shown is for node i
(variables)
integer: successor ← initial value;
integer: predecessor ← initial value;
array of integer finger [1 . . . log m];
integer: next finger ←1;
EL
(1)i.Create_New_Ring ():
(1a) predecessor ← ⊥ ;
(1b) successor ← i .
(2a) predecessor ← ⊥ ; PT
(2) i.Join_Ring (j), where j is any node on the ring to be joined:
EL
(5) i.Fix_Fingers():// executed periodically to update the finger table
5a) next_finger ← next_finger + 1;
(5b) if next_finger > m then
(5c)
PT
next_finger ← 1;
(5d) finger [next_finger ] ← Locate_Successor (i + 2next _finger−1 ).
(6)i.Check_Predecessor (): // executed periodically to verify whether
N
//predecessor still exists
(6a) if predecessor has failed then
(6b) predecessor ← ⊥ .
Complexity
EL
• Using consistent hashing, ϵ can be shown to be bounded by
O(logn).
PT
• The search for a successor Locate_Successor in a Chord network
with n nodes requires time complexity O(log n) with high
probability.
N
• The size of the finger table is log (n) ≤ m
(j − i) mod 2m
EL
Chord One-dimensional ring O(logN)
CAN d-dimensional cube Euclidean distance (d/4) N1/d
GISP
PT
Structureless
objects: the difference of the two
IDs; nodes: (i,j)/(2st-12sj-1); si, sj
are “peer strength”
i⊕j
uncertain
N
Kademila XOR-tree O(log2b N)
Pastry Tree + ring assessed digit by digit O(log2b N)
Tapestry Tree same as Pastry O(log2b N)
Viceroy Butterfly (j − i) mod 1 O(logN)
EL
CAN 2d Generate neighbor list Update neighbor lists
PT
possible from routing list
at most 160 × Constructs k- buckets Detect the target node
Greedy k(O(log2 N)) before routing
Kademila
Algorithm
Generate routing table, Detect the target node
N
O(log2b N) neighbor- hood set and a before routing
Pastry namespace set
Construct the routing Heartbreak message
Tapestry O(log2b N)
table
Construct the three kinds Repaired by the
Viceroy at most 7
of links LOOKUP operation
Conclusion
EL
•This lecture first give the basic fundamentals and
underlying concepts of P2P networks. i.e. (i) Structured
•
PT
P2P networks and (ii) Unstructured P2P networks
Then we have discussed the concept of ‘Distributed
N
Hash Table’ and DHT based classical structured P2P
network i.e. Chord
The Google File System
(GFS)
EL
PTRajiv Misra
N
Dept. of Computer Science &
Engineering
Indian Institute of Technology Patna
rajivm@iitp.ac.in
Introduction: File System
EL
network of machines
• Added complexity due to the network
PT
• GFS/HDFS are distributed file systems
N
Introduction: Motivating Facts
• Google processes its data in computing clusters
EL
class CPUs running customized versions of Linux.
• Seek to maximize performance per dollar, not
• How
PT
absolute performance.
this is specifically measured is not public
N
information– includes system running costs and
power consumption
• Novel switching power supply design to reduce
costs and improve power efficiency.
Introduction: GFS
EL
• Shares many of the same goals as previous distributed
file systems such as performance, scalability, reliability,
PT
and availability.
• The design of GFS is driven by four key observations
N
• Component failures, huge files, mutation of files, and
benefits of co-designing the applications and file
system API
GFS Assumptions
• Hardware failures are common
• System is built from many inexpensive commodity components
• Modest number of huge files
• A few million files, each typically 100MB or larger (Multi-GB files
EL
are common)
• No need to optimize for small files
•
PT
Workloads : two kinds of reads, and writes
• Large streaming reads (1MB or more) and small random reads (a
few KBs)
N
• Small random reads
• Sequential appends to files by hundreds of data producers
• High sustained throughput is more important than latency
• Response time for individual read and write is not critical
GFS Design Overview
• Single Master
• Centralized management
• Files stored as chunks
EL
• With a fixed size of 64MB each.
• Reliability through replication
• Each chunk is replicated across 3 or more chunk servers
• Data caching
PT
• Due to large size of data sets
• Interface
N
• Suitable to Google apps
• Create, delete, open, close, read, write, snapshot, record
append
GFS Architecture A single master
Several Clients (metadata)
EL
PT
N
Data does not
flow across the Several data
GFS master servers
EL
GFS: Master
PT
N
The Role of the Master
Metadata Server : Maintain all file system metadata
• File namespace
• File to chunk mapping
EL
• Chunk location information
• Keep all the metadata in the master's memory (FAST)
PT
Location of chunks and their replicas:
N
• Master does not keep a persistent record (no disk I/O)
• Operations log for persistence and recovery
• Poll it from chunkservers (master as monitor, no extra cost)
Master as metadata server
Monitor
EL
• Communicate with chunkservers periodically
PT
Centralized Controller
• System-wide activities
N
• Manage chunk replication in the whole system
One Master
•Single master
•Simplify the design of the system
•Control chunk placement using global knowledge
EL
•Potential bottleneck?
•Avoiding bottleneck
PT
•Clients do no read/write data through master
•Clients cache metadata (e.g., chunk handles, locations)
N
•Client typically asks for multiple chunks in the same
request
•Master can also include the information for chunks
immediately following those requested
Caching
•Metadata
• Cached on client after being retrieved from the Master
• Only kept for a period of time to prevent stale data
EL
•File caching
PT
• Clients never cache file data
• Chunkservers never cache file data (Linux’s buffer will for
local files)
N
• File working sets too large
• Simplifies cache coherence
EL
GFS: Chunks
PT
N
Chunks
EL
• Reduced need for client/master interaction
• 1 request per chunk suits the target workloads
•
working set
PT
Client can cache all the chunk locations for a multi-TB
N
• Reduced size of metadata on the master (kept in
memory)
•Disadvantage: chunkserver can become hotspot for popular
file(s)
Chunk Locations
EL
startup
• Master keeps up to date through periodic HeartBeat
messages
PT
• Master/chunkservers easily kept in sync when chunk
N
servers leave/join/fail/restart [regular event]
• Chunkserver has the final word over what chunks it has
Operation log
• Persistent record of critical metadata changes
• Critical to the recovery of the system
EL
• Changes to metadata are only made visible to clients after
they have been written to the operation log
•
•
PT
Operation log replicated on multiple remote machines
Before responding to client operation, log record must
N
have been flashed locally and remotely
• Master recovers its file system from checkpoint + operation
Consistency Model
• Atomicity
and correctness of file namespace are ensured
by namespace locking.
• Aftersuccessful data mutation(writes or record appends),
EL
changes are applied to a chunk in the same order on all
replicas.
• In
PT
case of chunk server failure at the time of mutation
(stale replica), it is garbage collected at the soonest
N
opportunity.
• Regular handshakes between Master and chunk servers
helps in identifying failed chunk servers and detects data
corruption by checksumming.
EL
System Interactions
PT
N
Leases & Mutation Order
• Master grants chunk lease to one of the
replicas(primary).
EL
• Allreplicas follow a serial order picked by the
primary.
• Leases PT
timeout at 60 seconds. (also possible to
extend the timeout)
N
• Leases are revocable.
Step 1
Client Master
2 Step-1. Client asks master
which chunk server holds
Secondary current lease of chunk and
Replica A
EL
locations of other replicas.
Secondary Control
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master
EL
sends write request to primary. The
primary assigns consecutive serial
numbers to all the mutations it receives,
Primary
Replica
PT providing serialization. It applies
mutations in serial number order.
N
Legend:
Secondary Control
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master
EL
serial number order.
6 Step-6. Secondary replicas reply to
Primary
Replica
PT primary indicating they have
5 completed operation
N
Legend:
6 Control
Secondary
Replica B Data
Figure 2: Write Control and Data Flow
4 Step 1
Client Master
3 2
Secondary
Step-7. Primary replies to the client
Replica A
EL
with success or error message
7 6
Primary
Replica
PT 5
N
Legend:
6 Control
Secondary
Replica B Data
Figure 2: Write Control and Data Flow
System Interactions
• Data Flow
• Data is pipelined over TCP connections
• A chain of chunk servers form a pipeline
EL
• Each machine forwards data to the closest machine
• Atomic Record Appends
PT
• GFS provides an atomic append operation called “record
append”
N
• Snapshot
• Makes a copy of file or a directory tree almost
instantaneously
Read Algorithm (1)
1. Application originates the read request.
2. GFS client translates the request from (filename, byte
range) -> (filename, chunk index), and sends it to master.
EL
3. Master responds with chunk handle and replica locations
(i.e. chunkservers where the replicas are stored).
PT
4. Client picks a location and sends the (chunk handle, byte
range) request to that location.
N
5. Chunkserver sends requested data to the client.
6. Client forwards the data to the application.
Read Algorithm (2)
EL
Application
1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master
3. (chunk handle,
replica locations)
Read Algorithm (3)
EL
Chunk Server
Application
4. (chunk handle,
byte range)
EL
3. Master responds with chunk handle and (primary +
secondary) replica locations.
PT
4. Client pushes write data to all locations. Data is stored in
chunkservers’ internal buffers.
N
5. Client sends write command to primary.
Write Algorithm (2)
6. Primary determines serial order for data instances stored
in its buffer and writes the instances in that order to the
chunk.
EL
7. Primary sends serial order to the secondaries and tells
them to perform the write.
PT
8. Secondaries respond to the primary.
9. Primary responds back to client.
N
Note: If write fails at one of chunkservers,
client is informed and retries the write.
Write Algorithm (3)
EL
Application
1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master
3. (chunk handle,
primary and
secondary replica
locations)
Write Algorithm (4)
Primary
EL
Chunk
Buffer
Application
(Data)
PT
(Data)
Secondary
Buffer
Chunk
N
GFS Client (Data) Secondary
Chunk
Buffer
4
Write Algorithm (5) 7. Write
command,
serial order
Primary 6.
EL
Chunk
Application 5. (Write D1 D2 D3 D4
command)
PT Secondary
D1 D2 D3 D4
Chunk
N
GFS Client
Secondary
Chunk
D1 D2 D3 D4
Write Algorithm (6)
8. response
Primary
EL
Chunk
Application 9. response empty
PT Secondary
empty
Chunk
N
GFS Client
Secondary
Chunk
empty
Record Append Algorithm (1)
1. Application originates record append request.
2. GFS client translates request and sends it to master.
3. Master responds with chunk handle and (primary +
EL
secondary) replica locations.
4. Client pushes writ
5.
6.
PT
Primary checks if record fits in specified chunk.
If record does not fit, then the primary:
N
• pads the chunk,
• tells secondaries to do the same,
• and informs the client.
• Client then retries the append with the next chunk.
Record Append Algorithm (2)
7. If record fits, then the primary:
• appends the record,
EL
• tells secondaries to do the same,
• receives responses from secondaries,
PT
• and sends final response to the client. e data to all
locations.
N
Record Append Algorithm (3)
EL
Application
1. (file name,
byte range) PT2. (file name,
chunk index)
N
GFS Client Master
3. (chunk handle,
primary and
secondary replica
locations)
Record Append Algorithm (4)
Primary
EL
Chunk
Buffer
Application
(Data)
PT
(Data)
Secondary
Buffer
Chunk
N
GFS Client (Data) Secondary
Chunk
Buffer
4
Record Append Algorithm (5) 7. Appends
the record
EL
Chunk
Application 5. (Write D1 D2 D3 D4
command)
PT Secondary
D1 D2 D3 D4
Chunk
N
GFS Client
Secondary
Chunk
D1 D2 D3 D4
Record Append Algorithm (6)
8. response
Primary
EL
Chunk
Application 9. response empty
PT Secondary
empty
Chunk
N
GFS Client
Secondary
Chunk
empty
EL
Master Operation
PT
N
Namespace Management & Locking
EL
• Read/write locks
PT
• GFS logically represents its namespace as a lookup table
mapping full pathnames to metadata
N
• If a Master operation involves /d1/d2/../dn/leaf, read locks
are acquired on d1,/d1/d2,..d1/d2/../leaf and either a read
or write lock on the full pathname /d1/d2/…..dn/leaf
Chunk Replica Placement
• Creation of (initially empty) chunks
• Use under-utilised chunk servers; spread across racks
• Limit number of recent creations on each chunk server
• Re-replication
EL
• Started once the available replicas fall below setting
•
•
PT
Master instructs chunkserver to copy chunk data directly from
existing valid replica
Number of active clone operations/bandwidth is limited
N
• Rebalancing
• The Master Rebalances the replicas periodically (examines
replicas distribution and moves replicas for better disk space and
load balancing)
Garbage collection
• Deletion logged by master
• File renamed to a hidden file, deletion timestamp kept
• Periodic scan of the master’s file system namespace
EL
• Hidden files older than 3 days are deleted from
master’s memory (no further connection between
•
PT
file and its chunk)
Periodic scan of the master’s chunk namespace
N
• Orphaned chunks (not reachable from any file) are
identified, their metadata deleted
• HeartBeat messages used to synchronise deletion
between master/chunkserver
Stale replica detection
EL
• Master maintains a chunk version number to
distinguish up-to-date and stale replicas
PT
• Before an operation on a chunk, master ensures that
version number is advanced
N
• Stale replicas are removed in the regular garbage
collection cycle
Fault Tolerance
• Fast Recovery: master and chunkservers are designed to
restart and restore state in few seconds
• Chunk Replication: across multiple machines, across
EL
multiple racks
• Master Mechanisms:
PT
– Keep log of all changes made to metadata
– Periodic checkpoints of the log
N
– Log and checkpoints replicated on multiple machines
– Master state is replicated on multiple machines
– Shadow master for reading data if real master is down
Reading
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung,
“The Google File System”
EL
http://labs.google.com/papers/gfs.html
PT
N
Conclusion
• GFS is a distributed file system that support large-scale
data processing workloads on commodity hardware
• GFS has different points in the design space
EL
• Component failures as the norm
• Optimize for huge files
•
PT
GFS provides fault tolerance
• Replicating data
N
• Fast and automatic recovery
• Chunk replication