[go: up one dir, main page]

0% found this document useful (0 votes)
6 views15 pages

Replication Control in Distributed File Systems

The document presents a replication control protocol for distributed file systems that ensures strict or sequential consistency without performance overhead for normal reads. It utilizes a primary-copy scheme with server redirection to handle concurrent writes and tolerates various component failures, including network partitions, without requiring heartbeat messages. The protocol has been implemented in NFSv4, aiming for widespread acceptance and deployment.

Uploaded by

ashoksqlboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Replication Control in Distributed File Systems

The document presents a replication control protocol for distributed file systems that ensures strict or sequential consistency without performance overhead for normal reads. It utilizes a primary-copy scheme with server redirection to handle concurrent writes and tolerates various component failures, including network partitions, without requiring heartbeat messages. The protocol has been implemented in NFSv4, aiming for widespread acceptance and deployment.

Uploaded by

ashoksqlboy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228957161

Replication Control in Distributed File Systems

Article · May 2004

CITATIONS READS
5 135

2 authors:

Fed Herad Peter Honeyman


Shanghai University University of Michigan
38 PUBLICATIONS 461 CITATIONS 88 PUBLICATIONS 3,213 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Peter Honeyman on 28 May 2014.

The user has requested enhancement of the downloaded file.


CITI Technical Report 04-01

Replication Control in Distributed File Systems

Jiaying Zhang
jiayingz@eecs.umich.edu

Peter Honeyman
honey@citi.umich.edu

ABSTRACT
We present a replication control protocol for distributed file systems that can guarantee strict consistency
or sequential consistency while imposing no performance overhead for normal reads. The protocol uses a
primary-copy scheme with server redirection when concurrent writes occur. It tolerates any number of
component omission and performance failures, even when these lead to network partition. Failure detec-
tion and recovery are driven by client accesses. No heartbeat messages or expensive group communication
services are required. We have implemented the protocol in NFSv4, the emerging Internet standard for
distributed filing.

April 1, 2004

Center for Information Technology Integration


University of Michigan
535 W. William St., Suite 3100
Ann Arbor, MI 48103-4978
Replication Control in Distributed File Systems

Jiaying Zhang and Peter Honeyman


jiayingz@eecs.umich.edu honey@citi.umich.edu

1. Introduction lows applications to trade consistency for availability.


In the following discussion, we follow Lamport [17]
In modern distributed systems, replication receives and refer to ordered writes as sequential consistency.
particular attention for improving performance and There are many examples of replication schemes in
availability: failure can be hidden from users and ap- distributed file and database systems. The work de-
plications if they can obtain data services from an iden- scribed here is novel in the following ways.
tical replica; replication can improve performance by First, our protocol uses a primary-copy with server
scaling the number of replicas with demand and by redirection scheme that offers strict or sequential con-
offering nearby copies to services distributed over a sistency without imposing any overhead on normal
wide area. reads.
A fundamental challenge with replication is to Second, our design takes into account file system
maintain data consistency among replicas. In a repli- workload characteristics to improve replication per-
cation system, the value of each logical item is stored formance. Distributed file systems typically encounter
in one or more physical data items, referred to as its workloads in which write sharing is rare. Furthermore,
copies. Each read or write operation on a logical data when a file is opened for writing, there is usually a
item must be mapped to corresponding operations on burst of updates. Unlike the traditional primary-copy
physical copies. Exactly when and how these map- [6] or two-phase locking [13] approaches, our system
pings are carried out determines the consistency guar- dynamically binds a primary server when a client
antees provided by the system and the cost of replica- opens a file for writing. This offers two benefits: first,
tion. it provides superior read performance by allowing a
An ideal distributed file system provides applica- client to access data from a nearby replication server if
tions strict consistency, i.e., a guarantee that all I/O the referred file is not under modification. Second,
operations yield identical results at all nodes at all when a client modifies a file, only file open and close
times [2]. To enhance data availability, a replication operations require additional concurrency control mes-
system must tolerate common component failures. To sages for replication support.
reduce overhead due to replication, a replication con- Third, most recoverable systems detect failures with
trol protocol should minimize the number of physical periodic heartbeat or probing messages, inducing cost
accesses required to implement a logical access. In to normal operations. In our system, failure detection
practice, these three goals always conflict with each and recovery are driven by client access requests. No
other and the trade-offs among them need to be con- heartbeat messages or expensive group communication
sidered. services are required.
In this paper, we present a replication control proto- Fourth, we have implemented the system as an ex-
col for distributed file systems that tolerates a large tension to a standard Internet filing protocol, NFSv4
class of failures, guarantees strict consistency, and [9], which promise widespread acceptance and de-
imposes little overhead on normal reads. We observe ployment.
that not all applications need strict consistency - Often, In the remainder of this paper, we describe the types
ordered writes suffice, i.e., although applications do of failure that we anticipate in a distributed system.
not necessarily see updates simultaneously, they are Then we introduce our replication control protocol and
guaranteed to see them in the same order. Therefore, the recovery procedures in the case of failures. We use
our system also provides the support for this consis- a result from database theory to prove that our protocol
tency model in case of failures; as we shall see, it al- can guarantee sequential consistency. After that, we

2
describe ways to support strict consistency and discuss strategy differs from the usual primary copy scheme in
the related work. that it allows late and dynamic binding of the primary
server, chosen at the granularity of a single file or di-
2. Failure Models rectory. We present details in the following subsec-
tions.
A system fails if it does not adequately provide the
services for which it is designed. There is a well- 3.1 File Updates
studied hierarchy of failures in distributed systems,
omission failure, performance failure, and Byzantine When a client opens a file for writing, the server it
failure [9]. selects temporarily becomes the primary for that file by
Omission failure occurs when a component fails to instructing all other replication servers to redirect fur-
respond to a service request. Typical omission failures ther client accesses for that file to it. When the file is
include server crash or communication link outage. closed, the primary server withdraws from its role by
Performance failure occurs when a system component re-enabling replication on the other replication servers.
fails to respond to a service request within the time Two or more servers may try to become the primary
limit specified for the delivery of that service. Occa- for a file at the same time. When these servers are in
sional message delays caused by overloaded servers or the same partition, contention is always apparent to the
network congestion are examples of performance conflicting servers. We resolve the conflict by having
faults. In Byzantine failure, components act in arbi- conflicting servers cooperate: the server that has dis-
trary, even malicious ways. Compromised security can abled more replicas is allowed to continue; the server
lead to Byzantine failure. that has disabled fewer replicas quits the process; when
An important subclass of omission and performance a tie happens, the server with bigger IP address is al-
failures is network partition. A partition is a collection lowed to proceed. If the conflicting servers are in dif-
of connected servers and clients isolated from the rest ferent partitions, and neither of them can collect the
of the system. acknowledgements from a majority of the replicas, no
Although security breach is increasingly common in server can become the primary. We discuss this case
the Internet, Byzantine failure is beyond the scope of further in Section 4.5.
our work. By implication, we rely on the distributed While replication is disabled, the primary server is
file system to provide authorized communication. This responsible for distributing updates to other replication
narrows our focus to two kinds of failure: crashed servers. Updates must be delivered in order, either by
nodes and partitioned networks. Furthermore, we as- including a serial number with the update or through a
sume that even these failures are rare, although this reliable transport protocol such as TCP. In addition to
does not affect the security or correctness of our proto- distributing the file data written by clients, each update
col. Rather, our goal is to develop a system that per- message from the primary server to other replication
forms well in the face of typical Internet conditions servers also includes the metadata related to the up-
and application requirements. In the next section, we date, such as the modification time. Every replication
present the design of the replication control protocol. server stores the metadata accordingly after updating
the file data. As we show in Section 4.4, the stored
3. Protocol Design metadata help to identify the most recent copy of the
file during failure recovery. File modification with
In a distributed file system, some nodes are servers two replication servers is illustrated in Figure 1.
and some are clients. Clients send messages to servers
to request service; servers accept the messages, carry 3.2 Directory Updates
out the requests, and return responses to the client. In
this paper, we focus on the replication control protocol Directory modifications include creation, deletion,
that guarantees consistency in the face of node and and modification of entries in a directory. Unlike file
network failure. The mechanisms for locating and writes, little time elapses between the start and the fin-
managing replicas, as well as implementation and per- ish of a directory update, which reduces the likelihood
formance details, can be found in a companion paper of concurrent accesses to the directory while it is being
[7]. updated. So instead of redirecting access requests for
Our goal is to provide strict or sequential consis- the directory while an update is in progress, replication
tency at little cost to exclusive or shared reads. To servers simply block these accesses until the primary
realize this goal, we use a primary copy method with server distributes the update and re-enables replication.
server redirection when concurrent writes occur. The Directory modification with two replication servers is
illustrated in Figure 2.

-3-
Figure 1: File modification. Figure 2: Directory modification.
1. A client issues an open request to a server. 2. The 1. A client issues a directory update request to a server.
server instructs other replication servers to redirect re- 2. The server instructs other replication servers to block
quests, making it the primary server for the file. 3. Rep- any access to this directory. 3. Replication servers ac-
lication servers acknowledge the request. 4. The primary knowledge the request. 4. The primary server distributes
server acknowledges the open request. 5. The client the update request. 5. Other servers update the directory.
sends writes to the primary server. 6. The primary server 6. Other servers acknowledge the update. 7. The pri-
distributes the writes to other replicas. 7. Other servers mary server processes the directory update request. 8.
update the written file. 8. Other servers acknowledge the The directory update request is acknowledged. 9. The
update. 9. The primary server updates the written file. primary server instructs the other servers to re-enable
10. The primary server acknowledges the client write access. 10. The redirected servers restore access to the
request. (Steps 5 through 10 may be repeated.) 11. The directory and acknowledge the request to re-enable repli-
client issues a close request. 12. The close request is cation.
acknowledged. 13. The primary server instructs the
redirected servers to re-enable replication. 14. The redi- To prevent data inconsistency, our protocol main-
rected servers disable redirection and acknowledge the tains an active view among replication servers and
request to re-enable replication. allows updates only among the servers contained in the
active view. We refer to the server group covered by
the active view as active group. To ensure the unique-
4. Failure Recovery ness of the active group, we follow Cristian and
Mishra [8] and define it to be one that contains a ma-
The principle challenge to data replication is main- jority of the replication servers. In the remainder of
taining consistency in the face of failure. In this sec- this section, we consider the different types of failure
tion, we enumerate the failures that can occur and the and explain the failure recovery procedure for each
actions to take to maintain consistency.1 Using the case.
view change properties and rules proposed by El-
Abbadi et al. [1], we are able to prove that our protocol 4.1 Client Crash
guarantees sequential consistency. The proof and the
pseudo-code are given in the next section. Following the specification of NFSv4, each open
In our design, we assume an asynchronous commu- file is associated with a lease, subject to renewal by the
nication network: there is no bound on the message client. In the event of client failure, the primary server
transmission delays between nodes. In such a network, receives no further renewals, so the lease expires.
it is impossible for a node p to distinguish between a Once the primary server decides that the client fails, it
failure of node q (crash failure) or a failure of the closes each file opened on behalf of the failed client.
communication network connecting p and q (partition If the client is the only writer for a file, the primary
failure). Yet these two kinds of failure can have dif- server re-enables replication for the file. Unsurpris-
ferent effects on file system states: while a failed node ingly, the file content reflects all writes acknowledged
can not perform any operations, in a partitioned net- by the primary server prior to the failure.
work, if no control mechanisms are used, nodes in dif-
ferent partitions may operate on the same logical ob- 4.2 Network Partition
ject unwittingly, which leads to inconsistent system
states. To detect partition failures, every server keeps a ta-
ble that records the liveness of other replication servers
from its point of view. The set of live servers is called
1
Failure recovery for directories is not discussed the active view.
in detail as it can be viewed as a special case of Network partition changes the primary server’s ac-
file write. tive view. The partition is detected when the primary

-4-
P (v1) P (v1)
S2 (v1) S2 (v1)

S5 (v1) S5 (v1)

S3 (v2) S3 (v2)
S6 (v1) S6 (v2)
S4 (v2) S4 (v2)

S7 (v1) S7 (v2)
Network partition Network partition

Figure 3.1 Figure 3.2


P (v1) P (v2)
S2 (v2) S2 (v1)

S5 (v2) S5 (v1)

S3 (v2) S3 (v2)
S6 (v2)
S4 (v2) S4 (v2) S6 (v2)

S7 (v2) S7 (v2)
Network partition Network partition

Figure 3.3 Figure 3.4


Figure 3: Possible situations after the primary server is partitioned in a minority partition.
This figure shows different situations that may occur after the primary server P is separated in a minority partition
during processing a client’s write request. S2 - S7 represent other replication servers. The value in parentheses
denotes the data version of the file copy at the corresponding replication server.

server disables replication, when it distributes updates The primary server forwards each client write re-
to other replication servers, or when it re-enables repli- quest to the replication servers in its active view. The
cation. We consider each case in turn. primary server updates its local copy and acknowl-
edges the client request after it receives update ac-
4.2.1 Detect Partition while Disabling Replication knowledgements from a majority of the replication
servers. Any replica that fails to acknowledge an up-
The primary server constructs its active view for the date is removed from the active view.
file as it receives acknowledgements for disabling rep- If the active view shrinks to less than a majority
lication requests. As soon as the active view consti- during processing a client write request, different situa-
tutes a majority, the primary server acknowledges the tions may occur, as illustrated in Figure 3.1-3.3. Be-
client open request. A replication server cannot be- cause the primary server can not determine the data
come the primary if it fails to collect acknowledge- version in the majority partition, it fails the client write
ments from a majority of the replication servers. request. Some replication servers may have applied
However, the client should impose a timeout on its the update; consequently the file content is inconsistent
open request to avoid waiting forever. among replication servers. However, because replica-
tion is still disabled at these servers, and as we show in
4.2.2 Detect Partition during Update Distribution Section 4.4, our failure recovery procedure guarantees
that the system converges in the majority partition be-

-5-
fore re-enabling the replication, inconsistent data ac- The replacement first asks all other replication serv-
cesses are prevented. ers for permission to become the new primary and the
Figure 3.4 shows a special situation, in which net- modification time of the file copy at each replica. The
work failure places the primary server in a minority failure may have occurred during replication disabling
partition after it assembles a majority view for the cur- stage; consequently some replication servers may not
rent update. As described previously, the primary have received the replication disabling requests from
server updates its own copy and acknowledges the the failed primary. In this case, these replicas simply
write. At this time, there must be at least one fresh grant the replacement the authority to become the pri-
copy of the file in the majority partition. In Section mary. If the request comes to a replica that has been
4.4, we describe how that fresh copy is found and used disabled replication, that server verifies the primary
to update other replication servers to assure data con- server failure, switches the primary, and acknowledges
sistency after failure recovery. the request. We note that upon receiving a recovery
request, a replication server should first check the state
4.2.3 Detect Partition while Re-enabling Replication of the original primary server to avoid switching the
primary blindly under unusual network conditions; e.g.
Partition that is detected when the primary server network connections may not be transitive.
processes a client close request does not affect data The replacement becomes the new primary if it re-
consistency, as all replication servers in the active view ceives the acknowledgements from a majority of the
should have the same file contents. Therefore, the replication servers. It then determines which replicas
primary server acknowledges a client close request have stale data by comparing the received modification
immediately. At the same time, it sends its active view times and synchronizes these servers with the freshest
to other replication servers in the view, and re-enables file copy, which it may first have to retrieve for itself.
the replication on them. Any server outside the active Following that, the replacement primary constructs a
view may have stale data, so the servers that are re- new active view, distributes it to other replication serv-
enabled replication refuse any later requests coming ers in the view, and re-enables their replication.
from a server not in the active view. A failed replica The replacement responds to the client after finish-
can be re-added to the active view only after it syn- ing the recovery. The client then engages in the client-
chronizes with the up-to-date copy. side recovery mechanism for server migration, part of
the NFSv4 standard. In brief, this entails cleaning up
4.3 Replication Server Crash state associated with the old server, reopening the file
on a new server, and reissuing the failed request. If the
Replication server crash can be regarded as a spe- failed I/O is a write, it is indeterminate whether the
cial case of network partition if we consider the state of the recovered file reflects that write operation.
crashed server as a single partition. Therefore the However, this does not compromise data consistency
same handling procedure applies. When the primary as repeating a write request is safe in NFSv4.
server detects a replication server failure during dis- Other clients may also detect the primary server
abling replication or distributing updates, it removes failure and initiate server recovery on different replica-
the failed server from its active view. At the time of tion servers. If the recovery is not complete, the con-
re-enabling replication, the primary server sends its flict is resolved through the same procedure as men-
active view to other live replication servers. These tioned in Section 3.1. When replication is later re-
servers refuse any latter requests from a server outside enabled, all such clients receive the acknowledgements
the active view. A failed replica has to synchronize and engage in client-side recovery. If a client detects
itself with a live replication server before being re- the failure after the recovery is complete, it receives an
added into the active view. immediate response and starts client-side recovery.

4.4 Primary Server Failure Recovery 4.5 No Majority Partition

A client detects primary server failure (crash failure As described above, the failure recovery procedure
or network partition that separates the primary server starts when the primary server is isolated in a minority
in a minority partition) when an access request times partition. However, if there are multiple partitions and
out. In this case, the client selects a live replication no partition includes a majority of the replication serv-
server and informs it of the failure. After verifying the ers, a new primary can not be elected until the parti-
primary server failure, the selected replacement works tions heal sufficiently to allow a quorum to assemble.
with other replication servers to become the new pri- In this case, read requests can be satisfied, but any
mary, as follows. write requests are refused.

-6-
p changing its local state to indicate that it is no longer
5. Proof of Correctness assigned to v.
The function members: V → P yields for each vir-
In databases, one-copy serializability requires that tual partition v the set of processors that were at some
the concurrent execution of transactions on replicated point in their past assigned to v.
data be equivalent to a serial execution on non- El. Abbadi et al. decompose a replication manage-
replicated data [4]. In replicated file systems, sequen- ment algorithm into two parts: a replication control
tial consistency is comparable to one-copy serializabil- protocol that translates each logical operation into one
ity by viewing each file or directory operation as a or more physical operations, and a concurrency control
transaction. A file operation executes on a single ob- protocol that synchronizes the execution of physical
ject, the accessed file, but a directory operation may operations. Based on this decomposition, they present
involve more than one object. three properties and five rules, and prove that any rep-
In this section, we prove that our replication control lication control protocol that satisfies these properties
protocol guarantees sequential consistency by demon- and rules guarantees one-copy serializability when
strating that it obeys the view change properties and combined with a concurrency control protocol that
rules of El. Abbadi et al. [1], which they show are suf- ensures conflict-preserving serializability. Below we
ficient conditions for a replicated database to guarantee describe them in turn.
one-copy serializability. Three Properties
We introduce essential definitions, properties and
rules in Section 5.1. We present the pseudo code for (S1) View consistency
our protocol in Section 5.2 and prove correctness in If defview(p) and defview(q) and vp(p)=vp(q), then
Section 5.3. view(p)=view(q).
(S2) Reflexivity
5.1 Background of View Change Protocol If defview(p), then p ∈ view(p).
(S3) Serializability of virtual partitions
In this subsection, we introduce the definitions and For any execution E produced by the replicated data
the theorem proposed by El. Abbadi, et al. [1] for later management protocol, the set of virtual partition iden-
discussion. tifiers occurring in E can be totally ordered by a rela-
tion << that satisfies the condition
Definitions
if v << w and p ∈ (members(v) ∩ view(w)), then
In a distributed system that consists of a finite set of depart(p, v) happens before join(q, w) for any q ∈
processors, a processor’s view is defined as its estimate members(w).
of the set of processors with which communication is This property states that p’s departure from v must
possible. be visible to all the members of its new cohort in w.
Let P be the set of processors, p a member of P, and
Five Rules
P the power set of P. The function view: P → P gives
the view of each processor p in P. (R1) Majority rule
A virtual partition is a set of communicating proc- A logical object L is accessible from a processor p
essors that share a common view and a test for mem- assigned to a virtual partition only if a majority of cop-
bership in the partition. We assume that at any time, a ies of L reside on processors in view(p).
processor is in at most one virtual partition. (R2) Read rule
Let V denote the set of all possible virtual parti- Processor p implements the logical read of L by
tions. The instantaneous assignment of processors to checking if L is accessible from it and, if so, sending a
virtual partitions is given by the partial function vp: P physical read request to any processor q ∈ (view(p) ∩
→ V. vp is undefined for p if p is not assigned to any copy(L)). (If q does not respond, then the physical
virtual partition. read can be retried at another processor or the logical
The total function defview: P → {true, false} char- read can be aborted.)
acterizes the domain of vp, i.e., defview(p) is true if p (R3) Write rule
is currently assigned to some virtual partition, and is Processor p implements the logical write of L by
false otherwise. checking if L is accessible from it and, if so, sending
A function join(p, v) denotes the event where p physical write requests to all processors q ∈ (view(p)
changes its local state to indicate that it is currently ∩ copies(L)) which are accessible and have copies of
assigned to v. Similarly, function depart(p, v) denotes L. (If any physical write request can not be honored,
the logical write is aborted).

-7-
(R4) Execution rule The Boolean function enabled: L → {true, false} is
All physical operations carried out on behalf of a true on a replica p if the replication of object L on p is
transaction t must be executed by processors of the enabled. On a replica p, the variable L.primary records
same virtual partition v. In this case we say that t exe- the primary server p admits for L, and the variable
cutes in v. L.view records p’s view for L when L is accessible on p.
(R5) Partition initialization rule On a replication server, function disable(L) disables the
Let p be a processor that has joined a new virtual replication of L, and function enable(L) enables the
partition v, and let Lp be a copy of a logical object L replication of L, respectively.
that is accessible in v. The first operation on Lp must be When a server receives a write-open request from a
either a write of Lp performed on behalf of a transac- client, it calls the procedure ObjectDisable to disable
tion executing in v, or a recover(Lp) operation that the replication of the file on other replicas. The server
writes into Lp the most recent value of L written by a becomes the primary after receiving the acknowl-
transaction executing in some virtual partition u such edgements from a majority of the replication servers.
that u << v for any legal creation order <<. The routine can fail either due to a network partition
that separates the server from the majority of the repli-
CP-Serializability
cas, or because there is a competing server that starts
Two physical operations conflict if they operate on ObjectDisable procedure for L simultaneously. In the
the same physical object and at least one of them is a first case, the server simply keeps trying the process
write. and leaves the client to detect the failure with the re-
An execution E of a set of transactions T is conflict- quest timing out. For the second case, we resolve the
preserving (CP) serializable if there exists an equiva- conflict with the strategy described in Section 3.1 – the
lent serial execution ES of T that preserves the order of server that disables a majority of the replicas contin-
execution of conflicting operations. ues; the server that disables fewer replicas quits the
A concurrency control protocol ensures CP- procedure; when a tie happens, the server with bigger
Serializability if it guarantees that the execution of any identifier wins.
set of transactions is conflict-preserving. In the procedure’s pseudo code, the parameter δ in
THEOREM line 6 is an upper bound on the message transmission
delay between any two replicas. For a subset A⊆P, the
Let R be a replication control protocol obeying function count(A) (line 10) returns the number of the
properties Sl–S3 and rules Rl–R5, and let C be a con- replicas in A. In line 21, the parameter ε is a random
currency control protocol that ensures CP- value between [0, 2δ]. We have the server wait for the
Serializability of physical operations. Any execution duration of 2δ+ε if it fails to become the primary, al-
of transactions produced by R and C is one-copy seri- lowing the competitor to disable the replication during
alizability. this period. The purpose of waiting additional ε time
is to resolve extraordinary situations. For example, in
5.2 Protocol Pseudo Code a conflict that involves more than two competing serv-
ers, having these servers restart the replication dis-
To demonstrate that our replication control protocol abling processes at different times reduces the likeli-
satisfies the properties and rules described above, we hood that they conflict again.
give an abstract implementation of the protocol below.
For simplicity, we assume a reliable underlying trans-
port protocol, such as TCP, among replication servers.
2

We specify that the view of a replica p for an object


L is undefined if the replication for L is disabled on p.
We assume each replication server has a unique identi-
fier at its creation time, e.g., its DNS name. A replica-
tion server denotes its unique identifier as myid. When
a replication server starts, its view for each object L is
initialized as the set of all the replicas holding a copy
of L, denoted as L.copies.

2
The assumption can be easily relaxed by including serial
numbers in exchanged messages.

-8-
procedure ObjectDisable(in L: L); procedure ObjectWrite(in L: L, value);
1 var A: set of P; T: Timer; p, q; 1 var A: set of P; p, q: P;
2 mtime: us (time in microsecond);
2 while [enabled(L)] do {
3 disable(L); L.contender←myid; 3 mtime←current-time; A←{myid};
4 for each p ∈ L.view−{myid} do 4 for each p ∈ L.view−{myid} do
5 send(p, “disable”, L, myid); 5 send(p, “update”, L, value, mtime, myid);
6 T.set(δ×2); A←{myid}; 6 while [count(A)≤count(L.copies)/2] do {
7 while [1] do { 7 select from
8 select from 8 receive(“ok”, L, q) ⇒ A←A∪{q};
9 receive(“ok”, L, q) ⇒ 9 endselect;
10 A←A∪{q}; 10 };
11 if [count(A)>count(L.copies)/2] or 11 update(L, value, mtime);
[count(A)=count(L.copies)/2 and myid>L.contender] 12 return;
then
When the client closes L after modification, the pri-
12 L.primary←myid;
mary server calls the procedure ObjectEnable to enable
13 return true; the replication for L on other replication servers. Ac-
14 fi; cording to NFSv4, a client should not have any pend-
15 T.timeout ⇒ ing writes before issuing a close request. This implies
16 for each p ∈ A−{myid} do that at least a majority of the replication servers have
17 send (p, “enable”, L, myid, ∅); the fresh copy of L at this point. However, before re-
18 enable(L); wait(δ×2+ε); enabling the replication, the primary server should wait
19 break; for a period of time, allowing slow replication servers
20 endselect; to catch up, as well as constructing a complete active
21 }; view for L. For simplicity, we implement this process
22 }; with the primary server sending an empty update to
23 return false; other replication servers.
The primary server distributes its view to other ac-
The primary server calls the procedure ObjectWrite
tive replicas when re-enabling replication. Each of
to update the copies of L on other replication servers
these replication servers stores the view with L respec-
when it receives a write request for L. In the pseudo
tively. Any requests from a replica outside the view
code, the variable mtime records the modification time
are not allowed, until that replica obtains the fresh
of L specified by the primary server. It is sent to other
copy of L and is re-added into the view.
replication servers along with the update data. Each
replica stores this time with its physical copy of L re- procedure ObjectEnable(in L: L);
spectively.
During ObjectWrite, the primary server is the only 1 var A: set of P; T: Timer; p, q: P;
replica whose view is defined. Therefore it can ac- 2 for each p ∈ L.view−{myid} do
knowledge a client write request after a majority of the 3 send(p, “update”, L, ∅, L.mtime, myid);
replicas reply, instead of waiting for the acknowl- 4 T.set(δ×4); A←{myid};
edgements from all the replication servers in its view.
5 while [A≠L.view] do {
This is equal to say that after receiving the
6 select from
acknowledgements from a majority of the replicas, the
7 receive(“ok”, L, q) ⇒ A←A∪{q};
primary server’s view becomes the set of all the replied
8 T.timeout ⇒
replicas; the view extends when the acknowledgements
from more replication servers are received. 9 if count(A)>count(L.copies)/2 then
10 break;
11 fi;
12 endselect;
13 };
14 L.primary←∅; enable(L); L.view←A;
15 for each p ∈ L.view−{myid} do
16 send(p, “enable”, L, myid, L.view);
17 return;

-9-
Based on the above procedures, we give the abstract Correctness does not require a failed server to rejoin
implementation for processing logical read, write- the active views from which it is excluded, but com-
open, write and write-close requests below. In the mon sense argues that a repaired server should return
procedure LogicalWriteOpen and LogicalWriteClose, to service. Because a rejoining server does not know
we use the variable L.count to record the number of what it has missed during partition, we charge the pri-
concurrent writers for L on the primary server. The mary server with the responsibility of probing for the
symbols “<” and “>” delimit critical sections that are return of any replication servers that are not in the ac-
protected with a mutual exclusion lock. We note that tive view. The primary server performs this by sched-
the wait function used in all the procedures automati- uling the task Probe after re-enabling the replication
cally releases the lock. (line 8 in the procedure LogicalWriteClose).
When detecting a reconnection, the task Probe calls
procedure LogicalRead(in L: L); the procedure AddMember to synchronize the return-
1 if enabled(L) or L.primary=myid then ing replica with the up-to-date copy and re-add that
2 serve client request; replica to the active view. In the procedure, function
3 else sync(l, source, target) represents the process that syn-
4 redirect client to L.primary; chronizes the copy of L on target with the copy of L on
5 fi; source.
task Probe(in L: L);
procedure LogicalWriteOpen(in L: L);
1 var R: set of P; T: Timer; r, q: P;
1 <if ∼enabled(L) and L.primary=myid then
2 L.count++; acknowledge client request;
2 R←L.copies−L.view;
3 else if ObjectDisable(L)=true then 3 while[R≠∅] do {
4 L.count←1; acknowledge client request;
4 for each r ∈ R do
5 else if L.primary≠∅ then 5 send(r, “probe”, myid);
6 redirect client to L.primary; 6 T.set(δ×2);
7 else 7 while [1] do {
8 deny client request; 8 select from
9 fi; 9 receive(“ok”, L, q) ⇒
10 fi; 10 AddMember(L, q); reset(T);
11 fi;> 11 T.timeout ⇒ break;
12 endselect;
13 };
procedure LogicalWrite(in L: L);
14 R←L.copies−L.view;
1 if enabled(L) or L.primary≠myid then 15 };
2 deny client request; 16 exit task;
3 else
4 ObjectWrite(L); procedure AddMember(in L: L, add-id: P)
5 acknowledge client request;
6 fi; 1 <if ObjectDisable(L)=true then
2 sync(L, myid, add-id);
procedure LogicalWriteClose(in L: L); 3 L.view←L.view∪{add-id};
4 ObjectEnable(L);
1 <if enabled(L) or L.primary≠myid then 5 fi;>
2 deny client request;
3 else If the primary server fails, a connected client can
4 L.count−−; acknowledge client request; detect the failure when its request times out. At this
5 if L.count=0 then time, the client appeals another replication server to
6 ObjectEnable(L); recover the failure, which triggers the procedure Re-
7 if L.view≠L.copies then cover on the selected replacement. If the server suc-
8 schedule(Probe(L)); ceeds in collecting the acknowledgements from a ma-
9 fi; jority of the replication servers, it brings all accessible
10 fi; copies up-to-date, forms a new view and distributes it
11 fi;> to the acknowledged replicas. After that, it replies to

- 10 -
the client request and starts the task Probe to detect the The task Monitor runs on every replication server
return of the failed replicas. and is responsible for generating responses to the re-
quests from other replicas. In line 29, the replication
procedure Recover(in L: L, primary: P) server calls the procedure Alive to check the state of
1 var A: set of P; T: Timer; p, q, sync-id: P; the original primary server upon receiving a recovery
2 newest: µs (time in microsecond); request, so that it does not switch primary blindly un-
der unusual network conditions; e.g. network connec-
3 if Alive(primary) then tions may not be transitive.
4 deny client request; return;
5 fi; 1 task Monitor;
6 if primary∉L.view then 2 var view: set of P; T: Timer; p: P; L: L;
7 acknowledge client request; return; 3 mtime: us (time in microsecond);
8 fi;
9 <if enabled(L) then 4 while[1] do {
10 disable(L); L.primary←primary; 5 select from
11 fi;> 6 receive(“disable”, L, p) ⇒
12 while [∼enabled(L)] do { 7 if p∉L.view then continue; fi;
13 <if L.primary≠primary then 8 <if enabled(L) then
14 wait(δ×2); continue; 9 disable(L); L.primary←p;
15 else 10 send(p, “ok”, myid, L);
16 L.primary←myid; L.contender←myid; 11 else if p<L.contender then
17 fi;> 12 L.contender←p;
18 for each p ∈ L.view−{myid} do 13 fi;
19 send(p, “recover”, L, myid); 14 fi;>
20 T.set(δ×2); A←{myid}; 15 receive(“update”, L, val, mtime, p) ⇒
21 newest←L.mtime; sync-id←myid; 16 if ∼enabled(L) and L.primary=p then
22 while [1] do { 17 update(L, val, mtime);
23 select from 18 send(p, “ok”, myid, L);
24 receive(“ok”, L, mtimeq, q) ⇒ 19 fi;
25 A←A∪{q}; 20 receive(“enable”, L, p, view) ⇒
26 if mtimeq>newest then 21 <if ∼enabled(L) and L.primary=p then
27 newest←mtimeq; sync-id←q; 22 enable(L); L.primary←∅;
28 fi; 23 if view≠∅ then L.view←view; fi;
29 T.timeout ⇒ break; 24 send(p, “ok”, myid, L);
30 endselect; 25 fi;>
31 }; 26 receive(“probe”, p) ⇒ send(p, “ok”, myid);
32 if count(A)>count(L.copies)/2 or [count(A)= 27 receive(“recover”, L, p) ⇒
count(L.copies)/2 and myid>L.contender] then 28 if p∉L.view then continue; fi;
33 if sync-id≠myid then 29 <if ∼Alive(L.primary) then
34 sync(L, sync-id, myid); 30 if enabled(L) then disable(L); fi;
35 fi; 31 L.primary←p;
36 for each p ∈ A−{myid} do 32 send(p, “ok”, myid, L, L.mtime);
37 sync(L, myid, p); 33 else if p<L.contender then
38 L.view←A; enable(L); 34 L.contender←p;

39 for each p ∈ A−{myid} do 35 fi;


40 send (p, “enable”, L, myid, L.view); 36 fi;>
41 schedule(Probe(L)); 37 receive(“remove”, D, L, mtime, p) ⇒
42 else 38 <if ∼enabled(D) and D.primary=p and
43 for each p ∈ A−{myid} do ∼enabled(L) and L.primary=p then
44 send (p, “enable”, L, myid, ∅); 39 remove(D, L, mtime);
45 L.primary←primary; wait(δ×2+ε);
40 send(p, “ok”, myid, D);
46 fi; 41 fi;>
47 }; 42 endselect;
48 acknowledge client request; return; 43 };

- 11 -
procedure Alive(check-id: P);
5.3 Correctness Proof
1 var T: Timer;
2 if check-id=myid then return true; fi; To prove that our protocol guarantees sequential
3 if check-id=∅ then return false; fi; consistency, we show that it satisfies properties S1–S3,
4 send(check-id, “probe”, myid); rules R1–R5, and CP-Serializability.
5 T.set(δ×2); First, the use of a single replica, the primary server,
6 while [1] do { to determine the view of a majority partition ensures
7 select from S1. Second, the primary server distributes its view
8 receive(“ok”, check-id) ⇒ return true; only to the replicas included in the active view, so S2
9 T.timeout ⇒ return false; is guaranteed. Third, every replica departs from its old
10 endselect; virtual partition (disable replication) before forming a
11 }; new partition (obtain a new view and enable replica-
tion), which ensures S3.
Next we give the entry remove procedure as an ex- To see that R1 is satisfied, we observe that an ob-
ample for directory modifications. Unlike file writes, a ject L can be accessed on a server p only if the replica-
directory modification can involve more than one ob- tion of L is enabled on p or if p is the primary server
ject. We disable the replications for all the involved for L. In both cases, the view of p contains a majority
objects before performing a directory update. In the of the replication servers.
procedure, D represents the parent directory of L that is In our system, a logical read of object L is per-
to be removed. We show the replication disabling for formed either by reading the physical copy at the con-
the two objects as separate operations. In the real im- nected server p when L is accessible on p, or by read-
plementation, these requests are sent in one message. ing the copy at the primary server if replication is dis-
abled on p. Property S2 then gives us rule R2.
procedure LogicalRemove(in D, L: L); Rule R3, the write rule, follows with similar reason-
1 var A: set of P; T: Timer; p, q: P; ing. L is writable at p only if p holds a copy of L and p
2 mtime: us (time in microsecond); is the primary server for L. p then sends the update to
all the replication servers in view(p).
3 while [∼ObjectDisable(D) or ∼ObjectDisable(L)] Rule R4 is automatically satisfied since a file or di-
do rectory operation is executed atomically in file sys-
4 {}; tems.
5 mtime←current-time; Rule 5 says that before copy lp can be read in a par-
6 for each p ∈ D.view−{myid} do tition v, it must contain the most recent value assigned
7 send(p, “remove”, D, L, mtime, myid); to L. The rule is satisfied in our protocol by requiring
8 T.set(δ×2); A←{myid}; each replica to synchronize with the up-to-date copy
9 while [A≠D.view] do { before being added to the active view (procedure Add-
10 select from Member) or forming a new view (procedure Recover).
11 receive(“ok”, D, q) ⇒ Finally, in our protocol, CP-Serializability is en-
12 A←A∪{q}; sured by requiring all access requests to be served
13 if count(A)>count(D.copies)/2 then from the primary server when concurrent writes occur.
14 remove(D, L, mtime);
15 acknowledge client;
16 fi; 6. Discussion
17 T.timeout ⇒
18 if count(A)>count(D.copies)/2 then As described previously, each replication server
19 break; stores an active view for every replicated object, which
20 fi; incurs substantial storage overhead. Given our as-
21 endselect; sumption about the relative frequency of failure, it is
22 }; more efficient to store the complement of an active
23 D.primary←∅; enable(D); D.view←A; view on each replication server. In our implementa-
24 for each p ∈ D.view−{myid} do tion, we record the complement of an active view in
25 send(p, “enable”, D, myid, D.view); NFSv4 extended attributes, if it is nonempty.
26 if D.view≠D.copies then schedule(Probe(D)); fi; So far, the presented algorithm is sufficient to guar-
27 return; antee sequential consistency. However, when partition
occurs, a read operation in a minority partition may

- 12 -
return stale data. Such a situation does not compro- for periodic heartbeat messages or special group com-
mise the system state, but it can lead to a loss of exter- munication services. Second, by taking advantage of
nal consistency [9], i.e. the ordering of operations in- the features provided by our primary-copy scheme,
side the system does not agree with the order that an when the system is free of failure, our view change
application expects. If sequential consistency is too protocol is totally embedded into the concurrency con-
weak a guarantee for an application, strict consistency trol messages (replication enabling messages and rep-
that guarantees both one-copy serializability and exter- lication disabling messages). This helps to reduce the
nal consistency is required. Our protocol supports this network traffic in normal operations.
requirement by disabling data updates everywhere Recent years have seen a lot of work in peer-to-peer
when a failure occurs. We are aware that other meth- (P2P) file systems, including OceanStore [15], Ivy
ods exist to support strict consistency. For example, [16], Pangaea [19] and Farsite [20]. These systems
replication servers can exchange periodic heartbeat address the design of systems in untrusted, highly dy-
messages to detect partitions and bound the staleness namic environments. Consequently, reliability and
of data, or a reader can check a specified number of continuous data availability are usually critical goals in
replication servers to ensure that it accesses the fresh these systems, but performance or data consistency are
copy of data. Compared with these methods, our strat- often sacrificed. Compared to these systems, our sys-
egy allows less availability for write operations in the tem addresses data replication among file system serv-
case of failure. However, it has two dominant advan- ers, which are more reliable but have more stringent
tages: first, it does not affect read performance, which requirements on average I/O performance. This leads
we believe is critical for a file system to be really use- to different design strategies in our approach.
ful; second, it adds no overhead or network traffic to
normal operations. 8. Conclusion
7. Related Work This paper presents a replication control protocol
for distributed file systems that supports strict consis-
A lot of research has been undertaken on replication tency or sequential consistency, even in partition fail-
control in distributed systems. Our replica control ures. In the protocol, failure detection and recovery
protocol shares many features with the previous work are driven by client accesses. No heartbeat messages
in this area. Specially, our protocol can be regarded as or expensive group communication services are re-
an extension of the primary copy method [6] and a quired. The protocol imposes a small performance
special implementation of the view change protocol in penalty on writes, and no overhead on reads. It is well
distributed file systems. suited for enterprise computing environments in which
Echo [10] and Harp [5] are file systems that use the reads outnumber writes and failures are rare.
primary copy scheme to support mutable replication.
In these systems, replication is used only to increase 9. References
data availability; potential performance benefits from [1] A. E1 Abbadi, D. Skeen, and F. Cristian, “An Efficient
replication are not targeted. Both of these systems use Fault-tolerant Protocol for Replicated Data Management”,
pre-determined primary server for a collection of disks, Proc. Of 5th ACM SIGACTSIGMOD, pp. 215-229, (1985).
a potential bottleneck if those disks contain hot spots
or if the primary server is located remotely from cli- [2] P. Bernstein and N. Goodman, “The failure and recovery
ents. In our system, we avoid this problem by allow- problem for replicated distributed databases”, ACM TODS,
(Dec. 1984).
ing dynamic determination of a primary server, chosen
at the granularity of a single file or directory. We use [3] F. Cristian, H. Aghali, R. Strong and D. Dolev, “Atomic
replication to improve performance, as well as avail- Broadcast: From Simple Message Diffusion to Byzantine
ability. A client can choose a nearby or lightly loaded Agreement”, Proc. Of 15th FTCS, pp.200-206, (June 1985).
replication server to access data, and switch to a work-
ing replication server if the originally selected server [4] S.B. Davidson, H. GarciaMolina and D. Skeen, “Consis-
fails. tency in Partitioned Networks”, ACM Computing Surveys
El-Abbadi et al. first proposed a view change proto- 17(31) (1985).
col in the context of transactional replication systems
[5] B. Likov, S. Ghemawat, R. Gruber, P. Johnson, L. Shrira,
[1]. Our failure detection and recovery scheme can be
and M. Williams, “Replication in the Harp File System”,
regarded as a special implementation of a view change Proc. of 13th SOSP, Pacific Grove, (Oct. 1991).
protocol in distributed file systems, with two novelties.
First, in our protocol, failure detection and recovery
are driven by client accesses. This eliminates the need

- 13 -
[6] P. Alsberg and J. Day, “A Principle for Resilient Sharing
of Distributed Resources”, Proc. of 2nd International Confer- [14] J.N. Gray, “The Transaction Concept: Virtues and Limi-
ence on Software Engineering, pp. 627-644, (Oct. 1976). tations”, Proc. Of 7th VLDB, pp. 144-154, (1981).

[7] J. Zhang and P. Honeyman, “Naming, Migration, and [15] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao,
Replication in NFSv4”, Tech. Report, CITI, University of and J. Kubiatowicz, “Pond: the OceanStore Prototype”, Proc.
Michigan, (2003). Of 2nd USENIX FAST. (Mar. 2003).

[8] F. Cristian and S. Mishra, “Automatic service availability [16] A. Muthitacharoen, R. Morris, T.M. Gil, and B. Chen,
management in asynchronous distributed systems”, Proc. of “Ivy: A Read/Write Peer-to-peer File System”, Proc. Of 5th
2nd International Workshop on Configurable Distributed OSDI, Boston (Dec. 2002).
Systems, Pittsburgh, PA, (Mar 1994).
[17] L. Lamport, “How to make a Multiprocessor Computer
[9] Sun Microsystems, Inc., “NFS Version 4 Protocol”, RFC that Correctly Executes Multiprocess Programs”, IEEE
3010 (Dec. 2000). Trans. on Computers, C-28(9):690-691, (Sep. 1979).

[10] A. Hisgen, A. Birrel, T. Mann, M. Schroeder, and G. [18] D.K. Gifford, “Information Storage in a Decentralized
Swart, “Granularity and Semantic Level of Replication in the Computer System”, Tech Report CSL-81-8, Xerox Corpora-
Echo Distributed File System”, Proc. Of Workshop on Man- tion, (Mar. 1983).
agement of Replicated Data, Houston (Nov. 1990).
[19] Y. Saito, C. Karamonolis, M. Karlsson, and M. Mahal-
[11] M. Pease, R. Shostak, and L. Lamport, “Reaching ingam, “Taming aggressive replication in the Pangaea wide-
agreement in the presence of faults”, Journal of ACM, 27 area file system”, Proc .of 5th OSDI, (Dec. 2002).
(April 1980).
[20] A. Adya, W.J. Bolosky, M. Castro, R. Chaiken, G. Cer-
[12] L. Lamport, R. Shostak, and M. Pease, “The Byzantine mak, J.R. Douceur, J. Howell, J.R. Lorch, M. Theimer, R.P.
Generals Problem”, ACM Trans. on Prog. Lang. and Systems Wattenhofer, “FARSITE: Federated, Available, and Reliable
4(3) (July 1982). Storage for an Incompletely Trusted Environment”, Proc. Of
5th OSDI, (Dec. 2002).
[13] P.A. Bernstein and N. Goodman, “Concurrency control
in distributed database systems”, ACM Computing Surveys.
13(2). (1981).

- 14 -

View publication stats

You might also like