[go: up one dir, main page]

0% found this document useful (0 votes)
50 views42 pages

Unit II Message Ordering - Updated

The document discusses logical time and global state in distributed systems, focusing on clock synchronization, particularly using the Network Time Protocol (NTP), and the implementation of logical clocks. It outlines concepts such as scalar time and vector time, detailing how they are used to maintain event ordering and consistency in distributed environments. The document emphasizes the importance of clock synchronization for applications relying on accurate time measurements across multiple machines.

Uploaded by

sudharshan1504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views42 pages

Unit II Message Ordering - Updated

The document discusses logical time and global state in distributed systems, focusing on clock synchronization, particularly using the Network Time Protocol (NTP), and the implementation of logical clocks. It outlines concepts such as scalar time and vector time, detailing how they are used to maintain event ordering and consistency in distributed environments. The document emphasizes the importance of clock synchronization for applications relying on accurate time measurements across multiple machines.

Uploaded by

sudharshan1504
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

CS3551- DISTRIBUTED SYSTEMS

UNIT II LOGICAL TIME AND GLOBAL STATE

Logical Time: Physical Clock Synchronization: NTP – A Framework for a System of


Logical Clocks – Scalar Time – Vector Time; Message Ordering and Group
Communication: Message Ordering Paradigms – Asynchronous Execution with
Synchronous Communication – Synchronous Program Order on Asynchronous System –
Group Communication – Causal Order – Total Order; Global State and Snapshot
Recording Algorithms: Introduction – System Model and Definitions – Snapshot
Algorithms for FIFO Channels.

Logical Time

Physical Clock Synchronization: NTP


Motivation:
• In centralized systems, there is only single clock. A processs gets the time by simply
issuing a system call to the kernel.
• In distributed systems, there is no global clock or common memory. Each processor has
its own internal clock and its own notion of time.
• These clocks can easily drift seconds per day, accumulating significant errors over time.
• Also, because different clocks tick at different rates, they may not remain always
synchronized although they might be synchronized when they start.
• This clearly poses serious problems to applications that depend on asynchronized notion
of time.
• For most applications and algorithms that run in a distributed system, we need to know
time in one or more of the following contexts:

o The time of the day at which an event happened on a specific machine in the
network.
o The time interval between two events that happened on different machines in the
network.
o The relative ordering of events that happened on different machines in the
network.
• Unless the clocks in each machine have a common notion of time , time-based queries
cannot be answered.
• Clock synchronization has a significant effect on many problems like secure systems,
fault diagnosis and recovery, scheduled operations, database systems, and real-world
clock values.
• Clock synchronization is the process of ensuring that physically distributed processors
have a common notion of time.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 1


CS3551- DISTRIBUTED SYSTEMS

• Due to different clocks rates, the clocks at various sites may diverge with time and
periodically a clock synchronization must be performed to correct this clock skew in
distributed systems.
• Clocks are synchronized to an accurate real-time standard like UTC (Universal
Coordinated Time).
• Clocks that must not only be synchronized with each other but also have to adhere to
physical time are termed physical clocks.

Definitions and Terminology


Let Ca and Cb be any two clocks.
Time:
The time of a clock in a machine p is given by the function C p(t), where Cp(t) =t for a
perfect clock.
Frequency:
Frequency is the rate at which a clock progresses. The frequency at time t of clock Ca is
C′a(t).
Offset:
Clock offset is the difference between the time reported by a clock and the real time . The
offset of the clock Ca is given by Ca(t) − t. The offset of clock Ca relative to Cb at time t ≥
0 is given by Ca(t)−Cb(t).
Skew:
The skew of a clock is the difference in the frequencies of the clock and the perfect clock.
The skew of a clock Ca relative to clock Cb at time t is (C′a(t)−C′b(t)). If the skew is
bounded by ρ, then as per Equation (1),clock values are allowed to diverge at a rate in the
range of 1 –ρ to 1 +ρ.
Drift (rate):
The drift of clock Cais the second derivative of the clock value with respect to time,
namely, C′′a(t). The drift of clock Ca relative to clock Cb at time t is C′′a(t)−C′′b(t).

Clock Inaccuracies
• Physical clocks are synchronized to an accurate real-time standard like UTC(Universal
Coordinated Time).
• However, due to the clock inaccuracy discussed above, a time r (clock) is said to be
working within its specification if (where constant ρ is the maximum skew rate specified
by the manufacturer.)

Course Instructor: Dr. Suja A. Alex, AP/IT Page 2


CS3551- DISTRIBUTED SYSTEMS

Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks with respect to UTC.

Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.

Offset delay estimation method


• The Network Time Protocol (NTP)which is widely used for clock synchronization on
the Internet uses the Offset Delay Estimation method.
• The design of NTP involves a hierarchical tree of time servers.

1. The primary server at the root synchronizes with the UTC.


2. The next level contains secondary servers, which act as a backup to the primary
server.
3. At the lowest level is the synchronization subnet which has the clients.

Clock offset and delay estimation:


In practice, a source node cannot accurately estimate the local time on the target node due to
varying message or network delays between the nodes.
• This protocol employs a common practice of performing several trials and chooses the
trial with the minimum delay.
• Figure 3.6 shows how NTP timestamps are numbered and exchanged between peers A
and B.
• LetT1,T2,T3,T4 be the values of the four most recent timestamps as shown.
• Assume clocks A and B are stable and running at the same speed.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 3


CS3551- DISTRIBUTED SYSTEMS

Figure 3.6: Offset and delay estimation.

• Let a=T1−T3 and b=T2−T4.


• If the network delay difference from A to B and from B to A, called differential delay, is
small, the clock offset θ and round-trip delay δ of B relative to A at timeT4are
approximately given by the following.

• Each NTP message includes the latest three timestamps T1,T2 and T3,while T4 is
determined upon arrival.
• Thus, both peers A and B can independently calculate delay and offset using a single
bidirectional message stream as shown in Figure 3.7
.

Figure 3.7: Timing diagram for the two servers.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 4


CS3551- DISTRIBUTED SYSTEMS

The Network Time Protocol synchronization protocol


• A pair of servers in symmetric mode exchange pairs of timing messages.
• A store of data is then built up about the relationship between the two servers (pairs of
offset and delay).
Specifically, assume that each peer maintains pairs (Oi,Di),

Where Oi- measure of offset (θ)


Di- transmission delay of two messages (δ).
• The offset corresponding to the minimum delay is chosen. Specifically, the delay and
offset are calculated as follows. Assume that message m takes time t to transfer and m′
takes t′ to transfer.
• The offset between A’s clock and B’s clock is O. If A’s local clock time is A(t) and B’s
local clock time is B(t), we have

• The eight most recent pairs of (Oi,Di) are retained.


• The value of Oi that corresponds to minimum Di is chosen to estimate O.
--------------------------------------------------------------------------------------------------------------------
A Framework for a System of Logical Clocks

Definition

• A system of logical clocks consists of a time domain T and a logical clock C. Elements of
T form a partially ordered set over a relation <.
• Relation < is called the happened before or causal precedence. Intuitively, this relation
is analogous to the earlier than relation provided by the physical time.
• The logical clock C is a function that maps an event e in a distributed system to an
element in the time domain T, denoted as C(e) and called the timestamp of e, and is
defined as follows:

C:H→T

such that the following property is satisfied:

Course Instructor: Dr. Suja A. Alex, AP/IT Page 5


CS3551- DISTRIBUTED SYSTEMS

for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).

This monotonicity property is called the clock consistency condition.

When T and C satisfy the following condition,

for two events ei and ej , ei → ej⇔ C(ei ) < C(ej )

the system of clocks is said to be strongly consistent.

Implementing Logical Clocks

Implementation of logical clocks requires addressing two issues: datastructures local to every
process to represent logical time and a protocol to update the data structures to ensure the
consistency condition.

Each process pi maintains data structures that allow it the following two capabilities:

• A local logical clock, denoted by lci , that helps process pi measure its own
progress.
• A logical global clock, denoted by gci , that is a representation of process pi ’s local view
of the logical global time. Typically, lci is a part of gci .

The protocol ensures that a process’s logical clock, and thus its view of the global time, is
managed consistently. The protocol consists of the following two rules:

• R1: This rule governs how the local logical clock is updated by a process when it
executes an event.
• R2: This rule governs how a process updates its global logical clock to update its view of
the global time and global progress.
• Systems of logical clocks differ in their representation of logical time and also in the
protocol to update the logical clocks.

--------------------------------------------------------------------------------------------------------------------

Scalar Time

• Proposed by Lamport in 1978 as an attempt to totally order events in a distributed system.


• Time domain is the set of non-negative integers.
• The logical local clock of a process pi and its local view of the global time are squashed
into one integer variable Ci .
• Rules R1 and R2 to update the clocks are as follows:
o R1: Before executing an event (send, receive, or internal), process pi executes the
following:

Course Instructor: Dr. Suja A. Alex, AP/IT Page 6


CS3551- DISTRIBUTED SYSTEMS

Ci :=Ci + d (d > 0)

• In general, every time R1 is executed, d can have a different value; however, typically
d is kept at 1.
o R2: Each message piggybacks the clock value of its sender at sending time.
When a process pi receives a message with timestamp Cmsg , it executes the following
actions:
o Ci := max(Ci , Cmsg )
o Execute R1.
o Deliver the message.
Figure 3.1 shows evolution of scalar time.

Evolution of scalar time:

Figure 3.1: The space-time diagram of a distributed execution.

Basic Properties

Consistency Property

Scalar clocks satisfy the monotonicity and hence the consistency property:

for two events ei and ej , ei → ej =⇒ C(ei ) < C(ej ).

Total Ordering

• Scalar clocks can be used to totally order events in a distributed system.


• The main problem in totally ordering events is that two or more events at different
processes may have identical timestamp.
• For example in Figure 3.1, the third event of process P1 and the second event of process
P2 have identical scalar timestamp.

Total Ordering

Course Instructor: Dr. Suja A. Alex, AP/IT Page 7


CS3551- DISTRIBUTED SYSTEMS

A tie-breaking mechanism is needed to order such events. A tie is broken as follows:

• Process identifiers are linearly ordered and tie among events with identical scalar
timestamp is broken on the basis of their process identifiers.
• The lower the process identifier in the ranking, the higher the priority.
• The timestamp of an event is denoted by a tuple (t, i ) where t is its time of occurrence
and i is the identity of the process where it occurred.
• The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j),
respectively, is defined as follows:
x≺ y ⇔ (h < k or (h = k and i < j))

Event counting

• If the increment value d is always 1, the scalar time has the following interesting
property: if event e has a timestamp h, then h-1 represents the minimum logical duration,
counted in units of events, required before producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially before the event e regardless
of the processes that produced these events.
• For example, in Figure 3.1, five events precede event b on the longest causal path ending
at b.

No Strong Consistency

• The system of scalar clocks is not strongly consistent; that is, for two events

ei and ej , C(ei ) < C(ej ) =/⇒ei → ej .

• For example, in Figure 3.1, the third event of process P1 has smaller scalar timestamp
than the third event of process P2.However, the former did not happen before the latter.
• The reason that scalar clocks are not strongly consistent is that the logical local clock and
logical global clock of a process are squashed into one, resulting in the loss causal
dependency information among events at different processes.
• For example, in Figure 3.1, when process P2 receives the first message from process P1,
it updates its clock to 3, for getting that the timestamp of the latest event at P1 on which it
depends is 2.

-------------------------------------------------------------------------------------------------------------------

Vector Time

• The system of vector clocks was developed independently by Fidge, Mattern and
Schmuck.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 8


CS3551- DISTRIBUTED SYSTEMS

• In the system of vector clocks, the time domain is represented by a set of n-dimensional
non-negative integer vectors.
• Each process pi maintains a vector vti [1..n], where vti [i ] is the local logical clock of pi
and describes the logical time progress at process pi .
• vti [j ] represents process pi ’s latest knowledge of process pj local time.
• If vti [j ]=x, then process pi knows that local time at process pj has progressed till x.
• The entire vector vti constitutes pi ’s view of the global logical time and is used to
timestamp events.

Process pi uses the following two rules R1 and R2 to update its clock:
• R1: Before executing an event, process pi updates its local logical time as follows:
vti [i ] := vti [i ] + d (d > 0)
• R2: Each message m is piggybacked with the vector clock vt of the sender process at
sending time. On the receipt of such a message (m,vt), process pi executes the following
sequence of actions:
• Update its global logical time as follows:

o Execute R1.
o Deliver the message m.

• The timestamp of an event is the value of the vector clock of its process when the event is
executed.
• Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ...., 0].

An Example of Vector Clocks

Figure 3.2: Evolution of vector time.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 9


CS3551- DISTRIBUTED SYSTEMS

Comparing Vector Timestamps

The following relations are defined to compare two vector timestamps, vh

andvk:

• If the process at
which an event occurred is known, the test to compare two timestamps can be simplified
as follows: If events x and y respectively occurred at processes pi and pj and are assigned
timestamps vh and vk, respectively, then

Properties of Vector Time

Isomorphism

If events in a distributed system are timestamped using a system of vector clocks, we have the
following property.

If two events x and y have timestamps vh and vk, respectively, then

Thus, there is an isomorphism between the set of partially ordered events produced by a
distributed computation and their vector timestamps.

Strong Consistency

• The system of vector clocks is strongly consistent; thus, by examining the vector
timestamp of two events, we can determine if the events are causally related.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 10


CS3551- DISTRIBUTED SYSTEMS

• However, Charron-Bost showed that the dimension of vector clocks cannot be less than
n, the total number of processes in the distributed computation, for this property to hold.

Event Counting

• If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i ], denotes the number of events that have occurred at pi until thatinstant.
• So, if an event e has timestamp vh, vh[j ] denotes the number of events executed by
process pj that causally precede e. Clearly, ∑vh[j ]– 1represents the total number of
events that causally precede e in the distributed computation.

Applications:

Distributed debugging, implementations of causal ordering communication and causal


distributed shared memory, establishment of global breakpoints, and in determining the
consistency of checkpoints in optimistic recovery.

--------------------------------------------------------------------------------------------------------------------

MESSAGE ORDERING AND GROUP COMMUNICATION

Outline and Notations

Outline

• Message orders: non-FIFO, FIFO, causal order, synchronous order


• Group communication with multicast: causal order, total order
• Expected behaviour semantics when failures occur
• Multicasts: application layer on overlays; also at network layer

Notations
• Network (N,L); event set (E,≺)
• Message mi: send and receive events si and ri
• send and receive events :s and r.
• M, send(M), and receive(M)I
• Corresponding events :a∼b denotes a and b occur at the same process
• send-receive pairs T - { (s,r) ∈ Ei×Ej | s corresponds to r}

MESSAGE ORDERING PARADIGMS

Course Instructor: Dr. Suja A. Alex, AP/IT Page 11


CS3551- DISTRIBUTED SYSTEMS

The order of delivery of messages in a distributed system is an important aspect of system


executions because it determines the messaging behavior that can be expected by the distributed
program.

Several orderings on messages have been defined: (i) non-FIFO, (ii) FIFO, (iii) causal order, and
(iv) synchronous order. There is a natural hierarchy among these orderings.

Asynchronous and FIFO Executions

Figure 2.1: (a) A -execution that is non-FIFO (b) A -execution that is FIFO

Asynchronous executions

A -execution: ( E ,≺) for which the causality relation is a partial order.

no causality cycles on any logical link, not necessarily FIFO delivery, e.g., network layer IPv4
connectionless service. Messages may be delivered in any order, not necessarily first-in first-out.
Such executions are also known as non-FIFO executions.

All physical links obey FIFO, due to the physical properties of the medium.

FIFO executions

An A -execution in which:

for all (s ,r) and (s′,r′) ∈T,(s∼s′ and r∼r′ and s≺s′) =⇒ r≺r′

Logical link is inherently non-FIFO. messages are necessarily delivered in the order in which
they are sent. Therefore, FIFO logical channels can be realistically assumed when designing
distributed algorithms.

Can assume connection-oriented service at transport layer, e.g., TCP

To implement FIFO over non-FIFO link:

The sender assigns and appends a (sequence_num, connection_id) tuple to each message.
The receiver uses a buffer to order the incoming messages.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 12


CS3551- DISTRIBUTED SYSTEMS

Causal Order: Definition

Causal order (CO)

A CO execution is an A -execution in which, for all ( s,r) and (s′,r′)∈T,(r∼r′ and s≺s′) =⇒r≺r’

• If send events s and s′ are related by causality ordering (not physical time ordering), their
corresponding receive events r and r′ occur in the same order at all common dests.
• If s and s′ are not related by causality, then CO is vacuously satisfied.
• Figure 2.2(a) shows an execution that violates CO because s1 ≺ s3 and at
• the common destination P1, we have r3 ≺ r1.
• Figure 2.2(b) shows an execution that satisfies CO. Only s1 and s2 are related by
causality but the destinations of the corresponding messages are different.

Figure 2.2: (a) Violates CO as s1≺s3; r3≺r1

(b) Satisfies CO. (c) Satisfies CO. No send events related by causality. (d) Satisfies CO.

• Figure 2.2(c) shows an execution that satisfies CO. No send events are related by
causality.
• Figure 2.2(d) shows an execution that satisfies CO. s2 and s1 are related by causality but
the destinations of the corresponding messages are different. Similarly for s2 and s3.

Causal order is useful for applications requiring updates to shared data, implementing distributed
shared memory, and fair resource allocation such as granting of requests for distributed mutual
exclusion.

Figure 2.2(a) shows an execution that violates CO. To enforce CO, message m 3 should be kept
pending in the local buffer after it arrives at P1, until m1 arrives and m1 is delivered.

Causal Order: Definition from Implementation Perspective

CO alternate definition

If send (m1) ≺ send (m2) then for each common destination d of messages m1 and m2, deliverd (m1) ≺

Course Instructor: Dr. Suja A. Alex, AP/IT Page 13


CS3551- DISTRIBUTED SYSTEMS

deliverd (m2) must be satisfied.

Message arrival vs. delivery:


• Message m that arrives in OS buffer at Pi may have to be delayed until the messages
that were sent to Pi causally before m was sent (the “overtaken” messages) have
arrived!
• The event of an application processing an arrived message is referred to as a delivery
event (instead of as a receive event).

• no message overtaken by a chain of messages between the same (sender, receiver)


pair. In Fig. 2.1(a), m1 overtaken by chain (m2, m3).
• CO degenerates to FIFO when m1, m2 sent by same process
• Uses: updates to shared data, implementing distributed shared memory, fair resource
allocation; collaborative applications, event notification systems, distributed virtual
environments

Message Order (MO)

A-execution in which, for all (s, r) and (sr , r r) ∈ T, s ≺ sr =⇒ ¬(r r ≺ r)

Fig 2.3(a): s1 ≺ s 3 but ¬(r 3 ≺ r 1) is false ⇒ MO not satisfied m cannot be overtaken by


a chain

Figure 2.3: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

Consider any message pair, say m1 and m3 in Figure 2.3(a). s1<s3 but ⌐( r3 < r1) is false. Hence,
the execution does not satisfy MO.

Causal Order: Other Characterizations

Course Instructor: Dr. Suja A. Alex, AP/IT Page 14


CS3551- DISTRIBUTED SYSTEMS

Figure 2.4: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

Empty-Interval (EI) property

An execution(E, ≺) is an EI execution if for each (s, r ) ∈ T , the open interval {x ∈ E | s ≺ x ≺ r } in the


partial order is empty.

• Fig 2.4(b). Consider M2. No event x such that s 2 ≺ x ≺ r 2. Holds for all messages in
the execution. EI
• For EI (s, r ), there exists some linear extension < such that the corresp. interval
{x ∈ E | s < x < r} is also empty.

• An empty (s, r) interval in a linear extension implies s; r may be arbitrarily close; shown
by vertical arrow in a timing diagram.
• An execution E is CO iff for each M, there exists some space-time diagram in which that
message can be drawn as a vertical arrow.

• A linear extension of a partial order (E,<) is any total order (E,<)| each ordering relation
of the partial order is preserved.

• CO =/> all messages can be drawn as vertical arrows in the same space-time diagram
(otherwise all (s, r) intervals empty in the same linear extension; synchronous execution).

Common Past and Future


An execution (E, ≺) is CO iff for each pair (s, r ) ∈ T and each event e ∈ E ,
• Weak common past: e ≺ r =⇒ ¬(s ≺ e)
• Weak common future: s ≺ e =⇒ ¬(e ≺ r )
If the past of both s and r are identical (analogously for the future), viz.,

We geta subclass of CO executions, called synchronous executions.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 15


CS3551- DISTRIBUTED SYSTEMS

Synchronous Executions (SYNC)


When all the communication between pairs of processes uses synchronous send and receive
primitives, the resulting order is the synchronous order.

Figure 2.5: (a) Execution in an async system (b) Equivalent sync execution.
• Handshake between sender and receiver.
• In a timing diagram, the “instantaneous” message communication can be shown by
bidirectional vertical message lines.

Synchronous Executions: Definition


Causality in a synchronous execution.
The synchronous causality relation << on E is the smallest transitive relation that satisfies the
following.

Synchronous execution (or S-execution).

An execution (E,˂˂) for which the causality relation ˂˂ is a partial order.

Timestamping a synchronous execution.

An execution (E, ≺) is synchronous iff there exists a mapping from E to T (scalar timestamps) |
• for any message M, T (s (M)) = T (r (M))
• for each process Pi , if ei ≺ eit then T (ei ) < T (eit )

---------------------------------------------------------------------------------------------------------------------

ASYNCHRONOUS EXECUTION WITH SYNCHRONOUS COMMUNICATION

Course Instructor: Dr. Suja A. Alex, AP/IT Page 16


CS3551- DISTRIBUTED SYSTEMS

When all the communication between pairs of processes is by using synchronous send and
receive primitives, the resulting order is synchronous order.

Will a program written for an asynchronous system (A-execution) run correctly if run with
synchronous primitives?

• There is a possibility that the program may deadlock.

Figure 2.6: A-execution deadlocks when using synchronous primitives.

An A-execution that is realizable under synchronous communication is a realizable with


synchronous communication (RSC) execution.

Examples The asynchronous execution of Figure 2.6, illustrated in Figure 2.7(a) using a timing
diagram, will deadlock if run with synchronous primitives. The executions in Figure 2.7(b)–(c)
will also deadlock when run on a synchronous system.

Figure 2.7: Illustration of non-RSC A-executions.

RSC Executions

Each send event is immediately followed by its corresponding receive event in this linear
extension. Such an A-execution can be realized under synchronous communication and is called
a realizable with synchronous communication (RSC) execution.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 17


CS3551- DISTRIBUTED SYSTEMS

Non-separated linear extension of (E ,≺) is a linear extension of (E, ≺) such that for each pair (s,
r ) ∈ T , the interval E | s ≺ x ≺ r }is empty.

Examples

• Figure 2.4(d):

is a linear extension that is non-separated.

is a linear extension that is separated.

Figure 2.5(b)

is a linear extension that is non-separated.

is a linear extension that is separated.

RSC execution

An A-execution (E, ≺) is an RSC execution iff there exists a non-separated linear extension of
the partial order (E, ≺).

Checking for all linear extensions has exponential cost! Practical test using the crown
characterization.

Crown: Definition

Crown

Let E be an execution. A crown of size k in E is a sequence ( (s ,r ), i ∈ { 0, . . ., k-1}) of pairs of


corresponding send and receive events such that

s 0 ≺ r 1, s 1 ≺ r 2, . . . . . . sk−2 ≺ rk−1 , sk−1 ≺r0.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 18


CS3551- DISTRIBUTED SYSTEMS

Figure 2.8: Illustration of non-RSC A-executions and crowns.

Fig 2.8(a): crown is ((s 1 , r 1), (s 2 , r 2)) as we have s 1 ≺ r 2 and s 2 ≺ r 1


Fig 2.8(b) crown is ((s 1 , r 1), (s 2 , r 2)) as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 2.8(c): crown is ((s 1 , r 1), (s 3 , r 3), (s 2 , r 2)) as we have s 1 ≺ r 3 and s 3 ≺ r 2 and s 2 ≺ r 1
Fig 2.4(a): crown is ((s 1 , r 1), (s 2 , r 2), (s 3 , r 3)) as we have s 1 ≺ r 2 and s 2 ≺ r 3 and s 3 ≺ r 1.

Some observations
• In a crown, s i and r i +1 may or may not be on same process
• Non-CO execution must have a crown
• CO executions (that are not synchronous) have a crown (see Fig 2.4(b))
• Cyclic dependencies of crown ⇒ cannot schedule messages serially ⇒ not RSC

Crown Test for RSC executions

• Define the ‹→: T × T relation on messages in the execution (E, ≺) as follows. Let
‹→ ([s, r ], [st, r t ]) iff s ≺ r t . Observe that the condition s ≺ r t (which has the form
used in the definition of a crown) is implied by all the four conditions: (i) s ≺ st , or
(ii) s ≺ r t , or (iii) r ≺ st , and (iv) r ≺ r t .
• Now define a directed graph G‹→ = (T , ‹→), where the vertex set is the set of messages
T and the edge set is defined by ‹→.
Observe that ‹→: T × T is a partial order iff G‹→ has no cycle, i.e., there must not be
a cycle with respect to ‹→ on the set of corresponding (s, r ) events.
• Observe from the defn. of a crown that G‹→ has a directed cycle iff (E, ≺) has a
crown.

Crown criterion

An A-computation is RSC, i.e., it can be realized on a system with synchronous communication, iff it
contains no crown.

Crown test complexity: O(|E|) (actually, # communication events)

Timestamps for a RSC execution

Execution (E, ≺) is RSC iff there exists a mapping from E to T (scalar timestamps) such that for any
message M, T (s (M)) = T (r (M))

for each (a, b) in (E × E ) \ T , a ≺ b =⇒ T (a) < T (b)

Course Instructor: Dr. Suja A. Alex, AP/IT Page 19


CS3551- DISTRIBUTED SYSTEMS

Hierarchy of Message Ordering Paradigms

uting)

Figure 2.9: Hierarchy of message ordering paradigms. (a) Venn diagram (b) Example
executions.
• An A-execution is RSC iff A is an S-execution.
• RSC ⊂ CO ⊂ FIFO ⊂ A.
• More restrictions on the possible message orderings in the smaller classes. The degree of
concurrency is most in A, least in SYNC.
• A program using synchronous communication easiest to develop and verify. A program using
non-FIFO communication, resulting in an A-execution, hardest to design and verify.

Simulations: Async Programs on Sync Systems

RSC execution: schedule events as per a non-separated linear extension


• adjacent (s, r ) events sequentially
• partial order of original A-execution unchanged
If A-execution is not RSC:
• partial order has to be changed; or
• model each Ci,j by control process Pi,j and use sync communication (see Fig 2.10)

Course Instructor: Dr. Suja A. Alex, AP/IT Page 20


CS3551- DISTRIBUTED SYSTEMS

Figure 2..10: Modeling channels as processes to simulate an execution using asynchronous


primitives on an synchronous system.
The communication events at the application processes Pi and Pj are encircled.
• Enables decoupling of sender from receiver.
• This implementation is expensive.

Simulations: Synch Programs on Async Systems

A (valid) S-execution can be trivially realized on an asynchronous system by scheduling the


messages in the order in which they appear in the Sexecution.The partial order of the S-execution
remains unchanged but the communication occurs on an asynchronous system that uses
asynchronouscommunication primitives. Once a message send event is scheduled,
themiddleware layer waits for an acknoweldgment; after the ack is received, the synchronous
send primitive completes.

--------------------------------------------------------------------------------------------------------------------

SYNC PROGRAM ORDER ON ASYNC SYSTEMS

Non-determinism:

The distributed programs are deterministic, i.e., repeated runs of the same program will produce
the same partial order. In many cases, programs are non-deterministic.

A receive call can receive a message from any sender who has sent a message, if the expected
sender is not specified. The receive calls in most of the algorithms are non-deterministic in this
sense – the receiver is willing to perform a rendezvous with any willing and ready sender.

Multiple send and receive calls which are enabled at a process can be executed in an
interchangeable order.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 21


CS3551- DISTRIBUTED SYSTEMS

If i sends to j, and j sends to i concurrently using blocking synchronous calls, there results a
deadlock.

If the receive call at one of the processes can be scheduled before the send call, then there is no
deadlock.

Rendezvous:

One form of group communication is called multiway rendezvous, which is a synchronous


communication among an arbitrary number of asynchronous processes. All the processes
involved “meet with each other,” i.e., communicate “synchronously” with each other at one time.

Rendezvous between a pair of processes at a time, which is called binary rendezvous.

Support for binary rendezvous communication was first provided by programming languages
such as CSP and Ada. In these languages, the repetitive command (the ∗ operator) over the
alternative command (the operator) on multiple guarded commands (each having the form Gi →
CLi) is used, as follows:

A guard Gi is a boolean expression. If a guard Gi evaluates to true then CLi is said to be enabled,
otherwise CLi is said to be disabled.

Some typical observations about synchronous communication under binary rendezvous are as
follows

1. For the receive command, the sender must be specified.


2. Send and received commands may be individually disabled or enabled.
3. Synchronous communication is implemented by scheduling messages under the covers
using asynchronous communication.

Algorithm for binary rendezvous:

These algorithms typically share the following features. At each process, there is a set of tokens
representing the current interactions that are enabled locally. If multiple interactions are enabled,
a process chooses one of them and tries to “synchronize” with the partner process. The problem
reduces to one of scheduling messages satisfying the following constraints:

• Schedule on-line, atomically, and in a distributed manner, i.e., the scheduling code at any
process does not know the application code of other processes.
• Schedule in a deadlock-free manner (i.e., crown-free), such that both the sender and
receiver are enabled for a message when it is scheduled.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 22


CS3551- DISTRIBUTED SYSTEMS

• Schedule to satisfy the progress property (i.e., find a schedule within a bounded number
of steps) in addition to the safety (i.e., correctness) property.

Bagrodia’s Algorithm for Binary Rendezvous (1)

Assumptions
1. Receive commands are forever enabled from all processes.
2. A send command, once enabled, remains enabled until it completes, i.e., it is not possible
that a send command gets disabled (by its guard getting falsified) before the send is
executed.
3. To prevent deadlock, process identifiers are used to introduce asymmetry to break
potential crowns that arise.
4. Each process attempts to schedule only one send event at any time.
Message types: M, ack (M), request(M), permission(M)
Process blocks when it knows it can successfully synchronize the current message with the partner
process.

Fig 2..11: Rules to prevent message cycles. (a) High priority process blocks. (b) Low priority process
does not block.

• To send to a lower priority process, messages M and ack(M) are involved in that order.
The sender issues send(M) and blocks until ack(M) arrives. Thus, when sending to a
lower priority process, the sender blocks waiting for the partner process to synchronize
and send an acknowledgement.
• To send to a higher priority process, messages request(M), permission(M), and M are
involved, in that order. The sender issues send(request(M)), does not block, and awaits
permission. When permission(M) arrives, the sender issues send(M).

Bagrodia’s Algorithm for Binary Rendezvous: Code

Course Instructor: Dr. Suja A. Alex, AP/IT Page 23


CS3551- DISTRIBUTED SYSTEMS

Thus, when sending to a higher priority process, the sender asks the higher priority process via
the request(M) to give permission to send. When the higher priority process gives permission to
send, the higher priority process, which is the intended receiver, blocks.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 24


CS3551- DISTRIBUTED SYSTEMS

Figure 2.12: Scheduling messages with sync communication

In either case, a higher priority process blocks on a lower priority process. So cyclic waits are
avoided.

In more detail, a cyclic wait is prevented because before sending a message M to a higher
priority process, a lower priority process requests the higher priority process for permission to
synchronize on M, in a non-blocking manner. While waiting for this permission, there are two
possibilities:

1. If a message MI from a higher priority process arrives, it is processed by a receive


(assuming receives are always enabled) and ack(M I) is returned. Thus, a cyclic wait is
prevented.
2. Also, while waiting for this permission, if a request(M I) from a lower priority process
arrives, a permission(MI) is returned and the process blocks until MI actually arrives.
--------------------------------------------------------------------------------------------------------------------

GROUP COMMUNICATION

Processes across a distributed system cooperate to solve a joint task. Often, they need to
communicate with each other as a group, and therefore there needs to be support for group
communication. A message broadcast is the sending of a message to all members in the
distributed system.

Multicasting wherein a message is sent to a certain subset, identified as a group, of the processes
in the system. At the other extreme is unicasting, which is the familiar point-to-point message
communication.

If a multicast algorithm requires the sender to be a part of the destination group, the multicast
algorithm is said to be a closed group algorithm. If the sender of the multicast can be outside the
destination group, the multicast algorithm is said to be an open group algorithm.

Closed group algorithms cannot be used in several scenarios such as in a large system (e.g., on-
line reservation or Internet banking systems)

Causal order (CO):

Causal order has many applications such as updating replicated data, allocating requests in a fair
manner, and synchronizing multimedia streams.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 25


CS3551- DISTRIBUTED SYSTEMS

Figure 2.13: (a) Updates to 3 replicas. (b) Causal order (CO) and total order violated. (c) Causal
order violated.

Consider Figure 2.13(a), which shows two processes P1 and P2 that issue updates to the three
replicas R1(d), R2(d), and R3(d) of data item d.

Message m creates a causality between send(m1) and send(m2). If P2 issues its update causally
after P1 issued its update, then P2’s update should be seen by the replicas after they see P1’s
update, in order to preserve the semantics of the application.

Figure 2.13(b) shows that R1 sees P2’s update first, while R2 and R3 see P1’s update first. Here,
CO is violated. Figure 2.13(c) shows that all replicas see P2’s update first. However, CO is still
violated. If message m did not exist as shown, then the executions shown in Figure 2.13(b) and
(c) would satisfy CO.

Safety

In order to prevent causal order from being violated, a message M that arrives at a process may
need to be buffered until all system wide messages sent in the causal past of the send(M) event to
that same destination have already arrived.

Liveness

A message that arrives at a process must eventually be delivered to the process.

THE RAYNAL–SCHIPER–TOUEG ALGORITHM

Each message M should carry a log of all other messages, or their identifiers, sent
causally before M’s send event, and sent to the same destination destM. This log can then be
examined to ensure whether it is safe to deliver a message. All algorithms aim to reduce this log
overhead, and the space and time overhead of maintaining the log information at the processes.

In order to implement safety, the messages piggyback the control information.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 26


CS3551- DISTRIBUTED SYSTEMS

How does algorithm simplify if all msgs are broadcast?

---------------------------------------------------------------------------------------------------------------------

Causal order (CO)

OPTIMAL KS ALGORITHM FOR CO: PRINCIPLES

• Mi,a : ath multicast message sent by Pi

Delivery Condition for correctness:

Msg M ∗that carries information “ d ∈ M. Dests”, where message M was sent to d in the causal
past of Send( M ∗ ), is not delivered to d if M has not yet been delivered to d .

Necessary and Sufficient Conditions for Optimality:

• For how long should the information “ d∈ Mi, a.Dests” be stored in the log at a process,
and piggybacked on messages?
• An optimal CO algorithm stores in local message logs and propagates on messages,
information of the form “d is a destination of M” about a message M sent in the causal
past, as long as and only as long as:
o (Propagation Constraint I: ) it is not known that the message Mi ,a is delivered
to d , and

Course Instructor: Dr. Suja A. Alex, AP/IT Page 27


CS3551- DISTRIBUTED SYSTEMS

o (Propagation Constraint II: ) it is not known that a message has been sent to d
in the causal future of Send (Mi,a), and hence it is not guaranteed using a
reasoning based on transitivity that the message Mi,a will be delivered to d in CO.

⇒ if either (I) or (II) is false, “d ∈ M.Dests” must not be stored or propagated, even to remember
that (I) or (II) has been falsified.

“d ∈ M i,a..Dests” must be available in the causal future of event e i,a , but not in the causal future
of Deliverd (Mi,a ), and not in the causal future of e k,c , where d ∈ M k,c .Dests and there is no
other message sent causally between M i,a and M k,c to the same destination d .

In the causal future of Deliver d(Mi,a ), and Send (Mk,c ), the information is redundant; elsewhere,
it is necessary.

Information about what messages have been delivered (or are guaranteed to be delivered without
violating CO) is necessary for the Delivery Condition.

• For optimality, this cannot be stored. Algorithm infers this using set-operation logic.

“d ɛ M.Dests”

• must exist at e1 and e2 because (I) and (II) are true.


• must not exist at e3 because (I) is false
• must not exist at e4, e5, e6 because (II) is false

Course Instructor: Dr. Suja A. Alex, AP/IT Page 28


CS3551- DISTRIBUTED SYSTEMS

• must not exist at e7, e8 because (I) and (II) are false

• Info about messages (i) not known to be delivered and (ii) not guaranteed to be delivered
in CO, is explicitly tracked using (source, ts, dest).

• Must be deleted as soon as either (i) or (ii) becomes false

• Info about messages already delivered and messages guaranteed to be delivered in CO is


implicitly tracked without storing or propagating it:
• derived from the explicit information.
• used for determining when (i) or (ii) becomes false for the explicit information
being stored/piggybacked.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 29


CS3551- DISTRIBUTED SYSTEMS

Explicit tracking : Tracking of (source, timestamp, destination) information for messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, is done explicitly using the
l.Dests field of entries in local logs at nodes and o.Dests field of entries in messages.

Implicit tracking Tracking of messages that are either (i) already delivered, or (ii) guaranteed to
be delivered in CO, is performed implicitly. The information about messages (i) already
delivered or (ii) guaranteed to be delivered in CO is deleted and not propagated because it is
redundant as far as enforcing CO is concerned. However, it is useful in determining what
information that is being carried in other messages and is being stored in logs at other nodes has
become redundant and thus can be purged.
---------------------------------------------------------------------------------------------------------------------

TOTAL MESSAGE ORDER

Total order

For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered
to both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mx before My .

Course Instructor: Dr. Suja A. Alex, AP/IT Page 30


CS3551- DISTRIBUTED SYSTEMS

The execution in Figure 2.15(b) does not satisfy total order. Even if the message m did not exist,
total order would not be satisfied. The execution in Figure 2.15(c) satisfies total order.

CENTRALIZED ALGORITHM FOR TOTAL ORDER:


Assuming all processes broadcast messages, the centralized solution shown in Algorithm enforces
total order in a system with FIFO channels. Each process sends the message it wants to broadcast to a
centralized process, which simply relays all the messages it receives to every other process over FIFO
channels. It is straightforward to see that total order is satisfied. Furthermore, this algorithm also
satisfies causal message order.

Fig 2.15: (a) Updates to 3 replicas. (b) Total order violated. (c) Total order not violated

Time Complexity: 2 hops/ transmission

Message complexity: n

THREE-PHASE DISTRIBUTED ALGORITHM:

The three phases of the algorithm are first described from the viewpoint of the sender, and then
from the viewpoint of the receiver.

Sender

Phase 1 : In the first phase, a process multicasts (line 1b) the message M with a locally
unique tag and the local timestamp to the group members.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 31


CS3551- DISTRIBUTED SYSTEMS

Phase 2: In the second phase, the sender process awaits a reply from all the group
members who respond with a tentative proposal for a revised timestamp for that message
M. The await call in line 1d is non-blocking i.e., any other messages received in the
meanwhile are processed. Once all expected replies are received, the process computes
the maximum of the proposed timestamps for M, and uses the maximum as the final
timestamp.

Phase 3: In the third phase, the process multicasts the final timestamp to the group in
line (1f).

Receivers

Phase 1 In the first phase, the receiver receives the message with a tentative/proposed
timestamp. It updates the variable priority that tracks the highest proposed timestamp
(line 2a), then revises the proposed timestamp to the priority, and places the message with
its tag and the revised timestamp at the tail of the queue temp_Q (line 2b). In the queue,
the entry is marked as undeliverable.

Phase 2 In the second phase, the receiver sends the revised timestamp (and the tag) back
to the sender (line 2c). The receiver then waits in a non-blocking manner for the final
timestamp (correlated by the message tag).

Phase 3 In the third phase, the final timestamp is received from the multicaster (line 3).
The corresponding message entry in temp_Q is identified using the tag (line 3a), and is
marked as deliverable (line 3b) after the revised timestamp is overwritten by the final
timestamp (line 3c). The queue is then resorted using the timestamp field of the entries as
the key (line 3c). As the queue is already sorted except for the modified entry for the
message under consideration, that message entry has to be placed in its sorted position in
the queue. If the message entry is at the head of the temp_Q, that entry, and all
consecutive subsequent entries that are also marked as deliverable, are dequeued from
temp_Q, and enqueued in deliver_Q in that order (the loop in lines 3d–3g).

Course Instructor: Dr. Suja A. Alex, AP/IT Page 32


CS3551- DISTRIBUTED SYSTEMS

Total Message Order: 3-phase Algorithm Code

Course Instructor: Dr. Suja A. Alex, AP/IT Page 33


CS3551- DISTRIBUTED SYSTEMS

Total Order: Distributed Algorithm: Example and Complexity

Figure 2.16: (a) A snapshot for PROPOSED TS and REVISE TS messages. The dashed lines
show the further execution after the snapshot. (b) The FINAL TS messages.

Example An example execution to illustrate the algorithm is given in Figure 2.16. Here, A and B
multicast to a set of destinations and C and D are the common destinations for both multicasts.

Figure 2.16 (a) The main sequence of steps is as follows:


1. A sends a REVISE_TS(7) message, having timestamp 7. B sends a REVISE_TS(9) message,
having timestamp 9.
2. C receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks it as
undeliverable; priority = 7. C then sends PROPOSED_TS(7) message to A.
3. D receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks it as
undeliverable; priority = 9. D then sends PROPOSED_TS(9) message to B.
4. C receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks it as
undeliverable; priority = 9. C then sends PROPOSED_TS(9) message to B.
5. D receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks it as
undeliverable; priority = 10. D assigns a tentative timestamp value of 10, which is greater than
all of the timestamps on REVISE_TSs seen so far, and then sends PROPOSED_TS(10) message
to A.
6. When A receives PROPOSED_TS(7) from C and PROPOSED_TS(10) from D, it computes
the final timestamp as max_7_ 10_ = 10, and sends FINAL_TS(10) to C and D.
7. When B receives PROPOSED_TS(9) from C and PROPOSED_TS(9) from D, it computes the
final timestamp as max_9_ 9_ = 9, and sends FINAL_TS(9) to C and D.
8. C receives FINAL_TS(10) from A, updates the corresponding entry in temp_Q with the
timestamp, resorts the queue, and marks the message as deliverable. As the message is not at the
head of the queue, and some entry ahead of it is still undeliverable, the message is not moved to
delivery_Q.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 34


CS3551- DISTRIBUTED SYSTEMS

9. D receives FINAL_TS(9) from B, updates the corresponding entry in temp_Q by marking the
corresponding message as deliverable, and resorts the queue. As the message is at the head of the
queue, it is moved to delivery_Q.
10. When C receives FINAL_TS(9) from B, it will update the corresponding entry in temp_Q by
marking the corresponding message as deliverable. As the message is at the head of the queue, it
is moved to the
delivery_Q, and the next message (of A), which is also deliverable, is also moved to the
delivery_Q.
11. When D receives FINAL_TS(10) from A, it will update the corresponding entry in temp_Q
by marking the corresponding message as deliverable. As the message is at the head of the
queue, it is moved to
the delivery_Q.

Complexity:

• Three phases
• 3(n − 1) messages for n − 1 dests
• Delay: 3 message hops
• Also implements causal order

---------------------------------------------------------------------------------------------------------------------

GLOBAL STATE AND SNAPSHOT RECORDING ALGORITHMS

Introduction

• Recording the global state of a distributed system on-the-fly is an important


paradigm.
• The lack of globally shared memory, global clock and unpredictable message delays
in a distributed system make this problem non-trivial.
• This chapter first defines consistent global states and discusses issues to be
addressed to compute consistent distributed snapshots.

Then several algorithms to determine on-the-fly such snapshots are presented for
several types of networks.
System model

• The system consists of a collection of n processes p1, p2, ..., pn that are connected by
channels.
• There are no globally shared memory and physical global clock and processes
communicate by passing messages through communication channels.
• Cij denotes the channel from process pi to process pj and its state is denoted by SCij .

Course Instructor: Dr. Suja A. Alex, AP/IT Page 35


CS3551- DISTRIBUTED SYSTEMS

• The actions performed by a process are modeled as three types of events: Internal
events, the message send event and the message receive event.
• For a message mij that is sent by process pi to process pj , let send (mij ) and rec (mij )
denote its send and receive events.

• At any instant, the state of process pi , denoted by LSi , is a result of the sequence of all the
events executed by pi till that instant.
• For an event e and a process state LSi , e ∈LSi iff e belongs to the sequence of events that
have taken process pi to state LSi .
• For an event e and a process state LS i , e ƒ∈LS i iff e does not belong to the sequence of events
that have taken process pi to state LSi .
• For a channel Cij , the following set of messages can be defined based on the local states of the
processes pi and pj
• Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi ˄ rec (mij ) ∈ LSj }
• Thus, if a snapshot recording algorithm records the state of processes p i
and pj as LSi and LSj , respectively, then it must record the state of
channel Cij as transit(LSi,LSj).

Models of communication

Recall, there are three models of communication: FIFO, non-FIFO, and CO.
• In FIFO model, each channel acts as a first-in first-out message queue and thus, message
ordering is preserved by a channel.
• In non-FIFO model, a channel acts like a set in which the sender process adds messages and
the receiver process removes messages from it in a random order.
• A system that supports causal delivery of messages satisfies the following property: “For
any two messages mij and mkj , if send (mij ) −→ send (mkj ), then rec (mij ) −→ rec (mkj
)”.

Consistent global state

• The global state of a distributed system is a collection of the local states of the processes
and the channels.
• Notationally, global state GS is defined as,

GS = {∪i LSi , ∪i ,j SCij }

Course Instructor: Dr. Suja A. Alex, AP/IT Page 36


CS3551- DISTRIBUTED SYSTEMS

• A global state GS is a consistent global state iff it satisfies the following two conditions :
• C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj . (⊕ is Ex-OR operator.)
• C2: send(mij )ƒ∈LS i ⇒ mij ƒ∈SCij ∧ rec(mij )ƒ∈LS j .

Condition C1 states the law of conservation of messages. Every message m ij that is recorded as sent in the
local state of a process pi must be captured in the state of the channel C ij or in the collected local state of
the receiver process pj.

Condition C2 states that in the collected global state, for every effect, its cause must be present. If a
message mij is not recorded as sent in the local state of process pi, then it must neither be present in the
state of the channel C ij nor in the collected local state of the receiver process pj.

Interpretation in terms of cuts

• A cut in a space-time diagram is a line joining an arbitrary point on each process


line that slices the space-time diagram into a PAST and a FUTURE.
• A consistent global state corresponds to a cut in which every message received in
the PAST of the cut was sent in the PAST of that cut.
• Such a cut is known as a consistent cut.
• For example, consider the space-time diagram for the computation illustrated in
Figure 2.17.
• Cut C1 is inconsistent because message m1 is flowing from the FUTURE to the
PAST.
• Cut C2 is consistent and message m4 must be captured in the state of channel C21.

Figure 2.17: An Interpretation in Terms of a Cut.

Issues in recording a global state


The following two issues need to be addressed:
I1: How to distinguish between the messages to be recorded in the snapshot from those not to be

Course Instructor: Dr. Suja A. Alex, AP/IT Page 37


CS3551- DISTRIBUTED SYSTEMS

recorded.
-Any message that is sent by a process before recording its snapshot, must be recorded in
the global snapshot (from C1).
-Any message that is sent by a process after recording its snapshot, must not be recorded
in the global snapshot (from C2).

I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message mij that was sent by
process pi after recording its snapshot.

Example: Let S1 and S2 be two distinct sites of a distributed system which maintain bank
accounts A and B, respectively. A site refers to a process in this example. Let the communication
channels from site S1 to site S2 and from site S2 to site S1 be denoted by C12 and C21,
respectively. Consider the following sequence of actions, which are also illustrated in the timing
diagram of Figure

• Time t0: Initially, Account A=$600, Account B=$200, C12=$0, C21 =$0.
• Time t1: Site S1 initiates a transfer of $50 from Account A to Account B. Account A is
decremented by $50 to $550 and a request for $50 credit to Account B is sent on Channel
C12 to site S2. Account A=$550,Account B=$200, C12 =$50, C21 =$0.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 38


CS3551- DISTRIBUTED SYSTEMS

• Time t2: Site S2 initiates a transfer of $80 from Account B to Account A. Account B is
decremented by $80 to $120 and a request for $80 credit to Account A is sent on Channel
C21 to site S1. Account A=$550, Account B=$120, C12=$50, C21 =$80.
• Time t3: Site S1 receives the message for a $80 credit to Account A and updates Account
A. Account A=$630, Account B=$120, C12 =$50, C21 =$0.
• Time t4: Site S2 receives the message for a $50 credit to Account B and updates Account
B. Account A=$630, Account B=$170, C12 =$0, C21 =$0.

Suppose the local state of Account A is recorded at time t0 to show $600 and the local state of
Account B and channels C12 and C21 are recorded at timet2 to show $120, $50, and $80,
respectively. Then the recorded global state shows $850 in the system. An extra $50 appears in
the system. The reason for the inconsistency is that Account A’s state was recorded before the
$50 transfer to Account B using channel C12 was initiated, whereas channel C12’s state was
recorded after the $50 transfer was initiated.
---------------------------------------------------------------------------------------------------------------------

SNAPSHOT ALGORITHMS FOR FIFO CHANNELS

CHANDY-LAMPORT ALGORITHM

• The Chandy-Lamport algorithm uses a control message, called a marker whose role in a
FIFO system is to separate messages in the channels.
• After a site has recorded its snapshot, it sends a marker, along all of its outgoing
channels before sending out any more messages.
• A marker separates the messages in the channel into those to be included in the snapshot
from those not to be recorded in the snapshot. This addresses issue I1.
• The role of markers in a FIFO system is to act as delimiters for the messages in the
channels so that the channel state recorded by the process at the receiving end of the
channel satisfies the condition C2.
• Since all messages that follow a marker on channel Cij have been sent by process pi after
pi has taken its snapshot, process pj must record its snapshot no later than when it
receives a marker on channel Cij . This addresses issue I2.
• A process must record its snapshot no later than when it receives a marker on any of its
incoming channels.
• The algorithm can be initiated by any process by executing the “Marker Sending Rule”
by which it records its local state and sends a marker on each outgoing channel.
• A process executes the “Marker Receiving Rule” on receiving a marker. If the process
has not yet recorded its local state, it records the state of the channel on which the marker
is received as empty and executes the “Marker Sending Rule” to record its local state.
• The algorithm terminates after each process has received a marker on all of its incoming
channels.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 39


CS3551- DISTRIBUTED SYSTEMS

• All the local snapshots get disseminated to all other processes and all the processes can
determine the global state.
• If the process has not yet recorded its local state, it records the state of the channel on
which the marker is received as empty and executes the marker sending rule to record its
local state. Otherwise, the state of the incoming channel on which the marker is received
is recorded as the set of computation messages received on that channel after recording
the local state but before receiving the marker on that channel. The algorithm can be
initiated by any process by executing the marker sending rule. The algorithm terminates
after each process has received a marker on all of its incoming channels.
• The recorded local snapshots can be put together to create the global snapshot in several
ways. One policy is to have each process send its local snapshot to the initiator of the
algorithm. Another policy is to have each process send the information it records along
all outgoing channels, and to have each process receiving such information for the first
time propagate it along its outgoing channels. All the local snapshots get disseminated to
all other processes and all the processes can determine the global state.

Correctness and Complexity

Correctness
• A process stops recording the state of an incoming channel when a marker is
received on that channel. Due to FIFO property of channels, it follows that no
message sent after the marker on that channel is recorded in the channel state. Thus,
condition C2 is satisfied.
• When a process pj receives message mij that precedes the marker on channel Cij , it

Course Instructor: Dr. Suja A. Alex, AP/IT Page 40


CS3551- DISTRIBUTED SYSTEMS

acts as follows: If process pj has not taken its snapshot yet, then it includes mij in
its recorded snapshot. Otherwise, it records mij in the state of the channel Cij . Thus,
condition C1 is satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e ) messages and O(d
) time, where e is the number of edges in the network and d is the diameter of the network.

Properties of the recorded global state:

Consider two possible executions of the snapshot algorithm for the money transfer example

Figure: Timing diagram of two possible executions of the banking example.

The recorded global state may not correspond to any of the global states that occurred during the
computation.

1. (Markers shown using dashed-and-dotted arrows.) Let site S1 initiate the algorithm just
after t1. Site S1 records its local state (account A=$550) and sends a marker to site S2.
The marker is received by site S2 after t4. When site S2 receives the marker, it records its
local state (account B=$170), the state of channel C12 as $0, and sends a marker along
channel C21. When site S1 receives this marker, it records the state of channel C21 as
$80. The $800 amount in the system is conserved in the recorded global state,
A = $550,B = $170,C12= $0,C21= $80.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 41


CS3551- DISTRIBUTED SYSTEMS

2. (Markers shown using dotted arrows.) Let site S1 initiate the algorithm just after t0 and
before sending the $50 for S2. Site S1 records its local state (account A = $600) and
sends a marker to site S2. The marker is received by site S2 between t2 and t3. When site
S2 receives the marker, it records its local state (account B = $120), the state of channel
C12 as $0, and sends a marker along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in the system is conserved in
the recorded global state,
A = $600,B = $120,C12 = $0,C21 = $80.

Course Instructor: Dr. Suja A. Alex, AP/IT Page 42

You might also like