Unit II Message Ordering - Updated
Unit II Message Ordering - Updated
Logical Time
o The time of the day at which an event happened on a specific machine in the
network.
o The time interval between two events that happened on different machines in the
network.
o The relative ordering of events that happened on different machines in the
network.
• Unless the clocks in each machine have a common notion of time , time-based queries
cannot be answered.
• Clock synchronization has a significant effect on many problems like secure systems,
fault diagnosis and recovery, scheduled operations, database systems, and real-world
clock values.
• Clock synchronization is the process of ensuring that physically distributed processors
have a common notion of time.
• Due to different clocks rates, the clocks at various sites may diverge with time and
periodically a clock synchronization must be performed to correct this clock skew in
distributed systems.
• Clocks are synchronized to an accurate real-time standard like UTC (Universal
Coordinated Time).
• Clocks that must not only be synchronized with each other but also have to adhere to
physical time are termed physical clocks.
Clock Inaccuracies
• Physical clocks are synchronized to an accurate real-time standard like UTC(Universal
Coordinated Time).
• However, due to the clock inaccuracy discussed above, a time r (clock) is said to be
working within its specification if (where constant ρ is the maximum skew rate specified
by the manufacturer.)
Figure 3.5 illustrates the behavior of fast, slow, and perfect clocks with respect to UTC.
Figure 3.5: The behavior of fast, slow, and perfect clocks with respect to UTC.
• Each NTP message includes the latest three timestamps T1,T2 and T3,while T4 is
determined upon arrival.
• Thus, both peers A and B can independently calculate delay and offset using a single
bidirectional message stream as shown in Figure 3.7
.
Definition
• A system of logical clocks consists of a time domain T and a logical clock C. Elements of
T form a partially ordered set over a relation <.
• Relation < is called the happened before or causal precedence. Intuitively, this relation
is analogous to the earlier than relation provided by the physical time.
• The logical clock C is a function that maps an event e in a distributed system to an
element in the time domain T, denoted as C(e) and called the timestamp of e, and is
defined as follows:
C:H→T
Implementation of logical clocks requires addressing two issues: datastructures local to every
process to represent logical time and a protocol to update the data structures to ensure the
consistency condition.
Each process pi maintains data structures that allow it the following two capabilities:
• A local logical clock, denoted by lci , that helps process pi measure its own
progress.
• A logical global clock, denoted by gci , that is a representation of process pi ’s local view
of the logical global time. Typically, lci is a part of gci .
The protocol ensures that a process’s logical clock, and thus its view of the global time, is
managed consistently. The protocol consists of the following two rules:
• R1: This rule governs how the local logical clock is updated by a process when it
executes an event.
• R2: This rule governs how a process updates its global logical clock to update its view of
the global time and global progress.
• Systems of logical clocks differ in their representation of logical time and also in the
protocol to update the logical clocks.
--------------------------------------------------------------------------------------------------------------------
Scalar Time
Ci :=Ci + d (d > 0)
• In general, every time R1 is executed, d can have a different value; however, typically
d is kept at 1.
o R2: Each message piggybacks the clock value of its sender at sending time.
When a process pi receives a message with timestamp Cmsg , it executes the following
actions:
o Ci := max(Ci , Cmsg )
o Execute R1.
o Deliver the message.
Figure 3.1 shows evolution of scalar time.
Basic Properties
Consistency Property
Scalar clocks satisfy the monotonicity and hence the consistency property:
Total Ordering
Total Ordering
• Process identifiers are linearly ordered and tie among events with identical scalar
timestamp is broken on the basis of their process identifiers.
• The lower the process identifier in the ranking, the higher the priority.
• The timestamp of an event is denoted by a tuple (t, i ) where t is its time of occurrence
and i is the identity of the process where it occurred.
• The total order relation ≺ on two events x and y with timestamps (h,i) and (k,j),
respectively, is defined as follows:
x≺ y ⇔ (h < k or (h = k and i < j))
Event counting
• If the increment value d is always 1, the scalar time has the following interesting
property: if event e has a timestamp h, then h-1 represents the minimum logical duration,
counted in units of events, required before producing the event e;
• We call it the height of the event e.
• In other words, h-1 events have been produced sequentially before the event e regardless
of the processes that produced these events.
• For example, in Figure 3.1, five events precede event b on the longest causal path ending
at b.
No Strong Consistency
• The system of scalar clocks is not strongly consistent; that is, for two events
• For example, in Figure 3.1, the third event of process P1 has smaller scalar timestamp
than the third event of process P2.However, the former did not happen before the latter.
• The reason that scalar clocks are not strongly consistent is that the logical local clock and
logical global clock of a process are squashed into one, resulting in the loss causal
dependency information among events at different processes.
• For example, in Figure 3.1, when process P2 receives the first message from process P1,
it updates its clock to 3, for getting that the timestamp of the latest event at P1 on which it
depends is 2.
-------------------------------------------------------------------------------------------------------------------
Vector Time
• The system of vector clocks was developed independently by Fidge, Mattern and
Schmuck.
• In the system of vector clocks, the time domain is represented by a set of n-dimensional
non-negative integer vectors.
• Each process pi maintains a vector vti [1..n], where vti [i ] is the local logical clock of pi
and describes the logical time progress at process pi .
• vti [j ] represents process pi ’s latest knowledge of process pj local time.
• If vti [j ]=x, then process pi knows that local time at process pj has progressed till x.
• The entire vector vti constitutes pi ’s view of the global logical time and is used to
timestamp events.
Process pi uses the following two rules R1 and R2 to update its clock:
• R1: Before executing an event, process pi updates its local logical time as follows:
vti [i ] := vti [i ] + d (d > 0)
• R2: Each message m is piggybacked with the vector clock vt of the sender process at
sending time. On the receipt of such a message (m,vt), process pi executes the following
sequence of actions:
• Update its global logical time as follows:
o Execute R1.
o Deliver the message m.
• The timestamp of an event is the value of the vector clock of its process when the event is
executed.
• Figure 3.2 shows an example of vector clocks progress with the increment
value d=1.
Initially, a vector clock is [0, 0, 0, ...., 0].
andvk:
• If the process at
which an event occurred is known, the test to compare two timestamps can be simplified
as follows: If events x and y respectively occurred at processes pi and pj and are assigned
timestamps vh and vk, respectively, then
Isomorphism
If events in a distributed system are timestamped using a system of vector clocks, we have the
following property.
Thus, there is an isomorphism between the set of partially ordered events produced by a
distributed computation and their vector timestamps.
Strong Consistency
• The system of vector clocks is strongly consistent; thus, by examining the vector
timestamp of two events, we can determine if the events are causally related.
• However, Charron-Bost showed that the dimension of vector clocks cannot be less than
n, the total number of processes in the distributed computation, for this property to hold.
Event Counting
• If d=1 (in rule R1), then the i th component of vector clock at process pi ,
vti [i ], denotes the number of events that have occurred at pi until thatinstant.
• So, if an event e has timestamp vh, vh[j ] denotes the number of events executed by
process pj that causally precede e. Clearly, ∑vh[j ]– 1represents the total number of
events that causally precede e in the distributed computation.
Applications:
--------------------------------------------------------------------------------------------------------------------
Outline
Notations
• Network (N,L); event set (E,≺)
• Message mi: send and receive events si and ri
• send and receive events :s and r.
• M, send(M), and receive(M)I
• Corresponding events :a∼b denotes a and b occur at the same process
• send-receive pairs T - { (s,r) ∈ Ei×Ej | s corresponds to r}
Several orderings on messages have been defined: (i) non-FIFO, (ii) FIFO, (iii) causal order, and
(iv) synchronous order. There is a natural hierarchy among these orderings.
Figure 2.1: (a) A -execution that is non-FIFO (b) A -execution that is FIFO
Asynchronous executions
no causality cycles on any logical link, not necessarily FIFO delivery, e.g., network layer IPv4
connectionless service. Messages may be delivered in any order, not necessarily first-in first-out.
Such executions are also known as non-FIFO executions.
All physical links obey FIFO, due to the physical properties of the medium.
FIFO executions
An A -execution in which:
for all (s ,r) and (s′,r′) ∈T,(s∼s′ and r∼r′ and s≺s′) =⇒ r≺r′
Logical link is inherently non-FIFO. messages are necessarily delivered in the order in which
they are sent. Therefore, FIFO logical channels can be realistically assumed when designing
distributed algorithms.
The sender assigns and appends a (sequence_num, connection_id) tuple to each message.
The receiver uses a buffer to order the incoming messages.
A CO execution is an A -execution in which, for all ( s,r) and (s′,r′)∈T,(r∼r′ and s≺s′) =⇒r≺r’
• If send events s and s′ are related by causality ordering (not physical time ordering), their
corresponding receive events r and r′ occur in the same order at all common dests.
• If s and s′ are not related by causality, then CO is vacuously satisfied.
• Figure 2.2(a) shows an execution that violates CO because s1 ≺ s3 and at
• the common destination P1, we have r3 ≺ r1.
• Figure 2.2(b) shows an execution that satisfies CO. Only s1 and s2 are related by
causality but the destinations of the corresponding messages are different.
(b) Satisfies CO. (c) Satisfies CO. No send events related by causality. (d) Satisfies CO.
• Figure 2.2(c) shows an execution that satisfies CO. No send events are related by
causality.
• Figure 2.2(d) shows an execution that satisfies CO. s2 and s1 are related by causality but
the destinations of the corresponding messages are different. Similarly for s2 and s3.
Causal order is useful for applications requiring updates to shared data, implementing distributed
shared memory, and fair resource allocation such as granting of requests for distributed mutual
exclusion.
Figure 2.2(a) shows an execution that violates CO. To enforce CO, message m 3 should be kept
pending in the local buffer after it arrives at P1, until m1 arrives and m1 is delivered.
CO alternate definition
If send (m1) ≺ send (m2) then for each common destination d of messages m1 and m2, deliverd (m1) ≺
Figure 2.3: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.
Consider any message pair, say m1 and m3 in Figure 2.3(a). s1<s3 but ⌐( r3 < r1) is false. Hence,
the execution does not satisfy MO.
Figure 2.4: (a) Violates CO as s 1 ≺ s 3; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.
• Fig 2.4(b). Consider M2. No event x such that s 2 ≺ x ≺ r 2. Holds for all messages in
the execution. EI
• For EI (s, r ), there exists some linear extension < such that the corresp. interval
{x ∈ E | s < x < r} is also empty.
• An empty (s, r) interval in a linear extension implies s; r may be arbitrarily close; shown
by vertical arrow in a timing diagram.
• An execution E is CO iff for each M, there exists some space-time diagram in which that
message can be drawn as a vertical arrow.
• A linear extension of a partial order (E,<) is any total order (E,<)| each ordering relation
of the partial order is preserved.
• CO =/> all messages can be drawn as vertical arrows in the same space-time diagram
(otherwise all (s, r) intervals empty in the same linear extension; synchronous execution).
Figure 2.5: (a) Execution in an async system (b) Equivalent sync execution.
• Handshake between sender and receiver.
• In a timing diagram, the “instantaneous” message communication can be shown by
bidirectional vertical message lines.
An execution (E, ≺) is synchronous iff there exists a mapping from E to T (scalar timestamps) |
• for any message M, T (s (M)) = T (r (M))
• for each process Pi , if ei ≺ eit then T (ei ) < T (eit )
---------------------------------------------------------------------------------------------------------------------
When all the communication between pairs of processes is by using synchronous send and
receive primitives, the resulting order is synchronous order.
Will a program written for an asynchronous system (A-execution) run correctly if run with
synchronous primitives?
Examples The asynchronous execution of Figure 2.6, illustrated in Figure 2.7(a) using a timing
diagram, will deadlock if run with synchronous primitives. The executions in Figure 2.7(b)–(c)
will also deadlock when run on a synchronous system.
RSC Executions
Each send event is immediately followed by its corresponding receive event in this linear
extension. Such an A-execution can be realized under synchronous communication and is called
a realizable with synchronous communication (RSC) execution.
Non-separated linear extension of (E ,≺) is a linear extension of (E, ≺) such that for each pair (s,
r ) ∈ T , the interval E | s ≺ x ≺ r }is empty.
Examples
• Figure 2.4(d):
Figure 2.5(b)
RSC execution
An A-execution (E, ≺) is an RSC execution iff there exists a non-separated linear extension of
the partial order (E, ≺).
Checking for all linear extensions has exponential cost! Practical test using the crown
characterization.
Crown: Definition
Crown
Some observations
• In a crown, s i and r i +1 may or may not be on same process
• Non-CO execution must have a crown
• CO executions (that are not synchronous) have a crown (see Fig 2.4(b))
• Cyclic dependencies of crown ⇒ cannot schedule messages serially ⇒ not RSC
• Define the ‹→: T × T relation on messages in the execution (E, ≺) as follows. Let
‹→ ([s, r ], [st, r t ]) iff s ≺ r t . Observe that the condition s ≺ r t (which has the form
used in the definition of a crown) is implied by all the four conditions: (i) s ≺ st , or
(ii) s ≺ r t , or (iii) r ≺ st , and (iv) r ≺ r t .
• Now define a directed graph G‹→ = (T , ‹→), where the vertex set is the set of messages
T and the edge set is defined by ‹→.
Observe that ‹→: T × T is a partial order iff G‹→ has no cycle, i.e., there must not be
a cycle with respect to ‹→ on the set of corresponding (s, r ) events.
• Observe from the defn. of a crown that G‹→ has a directed cycle iff (E, ≺) has a
crown.
Crown criterion
An A-computation is RSC, i.e., it can be realized on a system with synchronous communication, iff it
contains no crown.
Execution (E, ≺) is RSC iff there exists a mapping from E to T (scalar timestamps) such that for any
message M, T (s (M)) = T (r (M))
uting)
Figure 2.9: Hierarchy of message ordering paradigms. (a) Venn diagram (b) Example
executions.
• An A-execution is RSC iff A is an S-execution.
• RSC ⊂ CO ⊂ FIFO ⊂ A.
• More restrictions on the possible message orderings in the smaller classes. The degree of
concurrency is most in A, least in SYNC.
• A program using synchronous communication easiest to develop and verify. A program using
non-FIFO communication, resulting in an A-execution, hardest to design and verify.
--------------------------------------------------------------------------------------------------------------------
Non-determinism:
The distributed programs are deterministic, i.e., repeated runs of the same program will produce
the same partial order. In many cases, programs are non-deterministic.
A receive call can receive a message from any sender who has sent a message, if the expected
sender is not specified. The receive calls in most of the algorithms are non-deterministic in this
sense – the receiver is willing to perform a rendezvous with any willing and ready sender.
Multiple send and receive calls which are enabled at a process can be executed in an
interchangeable order.
If i sends to j, and j sends to i concurrently using blocking synchronous calls, there results a
deadlock.
If the receive call at one of the processes can be scheduled before the send call, then there is no
deadlock.
Rendezvous:
Support for binary rendezvous communication was first provided by programming languages
such as CSP and Ada. In these languages, the repetitive command (the ∗ operator) over the
alternative command (the operator) on multiple guarded commands (each having the form Gi →
CLi) is used, as follows:
A guard Gi is a boolean expression. If a guard Gi evaluates to true then CLi is said to be enabled,
otherwise CLi is said to be disabled.
Some typical observations about synchronous communication under binary rendezvous are as
follows
These algorithms typically share the following features. At each process, there is a set of tokens
representing the current interactions that are enabled locally. If multiple interactions are enabled,
a process chooses one of them and tries to “synchronize” with the partner process. The problem
reduces to one of scheduling messages satisfying the following constraints:
• Schedule on-line, atomically, and in a distributed manner, i.e., the scheduling code at any
process does not know the application code of other processes.
• Schedule in a deadlock-free manner (i.e., crown-free), such that both the sender and
receiver are enabled for a message when it is scheduled.
• Schedule to satisfy the progress property (i.e., find a schedule within a bounded number
of steps) in addition to the safety (i.e., correctness) property.
Assumptions
1. Receive commands are forever enabled from all processes.
2. A send command, once enabled, remains enabled until it completes, i.e., it is not possible
that a send command gets disabled (by its guard getting falsified) before the send is
executed.
3. To prevent deadlock, process identifiers are used to introduce asymmetry to break
potential crowns that arise.
4. Each process attempts to schedule only one send event at any time.
Message types: M, ack (M), request(M), permission(M)
Process blocks when it knows it can successfully synchronize the current message with the partner
process.
Fig 2..11: Rules to prevent message cycles. (a) High priority process blocks. (b) Low priority process
does not block.
• To send to a lower priority process, messages M and ack(M) are involved in that order.
The sender issues send(M) and blocks until ack(M) arrives. Thus, when sending to a
lower priority process, the sender blocks waiting for the partner process to synchronize
and send an acknowledgement.
• To send to a higher priority process, messages request(M), permission(M), and M are
involved, in that order. The sender issues send(request(M)), does not block, and awaits
permission. When permission(M) arrives, the sender issues send(M).
Thus, when sending to a higher priority process, the sender asks the higher priority process via
the request(M) to give permission to send. When the higher priority process gives permission to
send, the higher priority process, which is the intended receiver, blocks.
In either case, a higher priority process blocks on a lower priority process. So cyclic waits are
avoided.
In more detail, a cyclic wait is prevented because before sending a message M to a higher
priority process, a lower priority process requests the higher priority process for permission to
synchronize on M, in a non-blocking manner. While waiting for this permission, there are two
possibilities:
GROUP COMMUNICATION
Processes across a distributed system cooperate to solve a joint task. Often, they need to
communicate with each other as a group, and therefore there needs to be support for group
communication. A message broadcast is the sending of a message to all members in the
distributed system.
Multicasting wherein a message is sent to a certain subset, identified as a group, of the processes
in the system. At the other extreme is unicasting, which is the familiar point-to-point message
communication.
If a multicast algorithm requires the sender to be a part of the destination group, the multicast
algorithm is said to be a closed group algorithm. If the sender of the multicast can be outside the
destination group, the multicast algorithm is said to be an open group algorithm.
Closed group algorithms cannot be used in several scenarios such as in a large system (e.g., on-
line reservation or Internet banking systems)
Causal order has many applications such as updating replicated data, allocating requests in a fair
manner, and synchronizing multimedia streams.
Figure 2.13: (a) Updates to 3 replicas. (b) Causal order (CO) and total order violated. (c) Causal
order violated.
Consider Figure 2.13(a), which shows two processes P1 and P2 that issue updates to the three
replicas R1(d), R2(d), and R3(d) of data item d.
Message m creates a causality between send(m1) and send(m2). If P2 issues its update causally
after P1 issued its update, then P2’s update should be seen by the replicas after they see P1’s
update, in order to preserve the semantics of the application.
Figure 2.13(b) shows that R1 sees P2’s update first, while R2 and R3 see P1’s update first. Here,
CO is violated. Figure 2.13(c) shows that all replicas see P2’s update first. However, CO is still
violated. If message m did not exist as shown, then the executions shown in Figure 2.13(b) and
(c) would satisfy CO.
Safety
In order to prevent causal order from being violated, a message M that arrives at a process may
need to be buffered until all system wide messages sent in the causal past of the send(M) event to
that same destination have already arrived.
Liveness
Each message M should carry a log of all other messages, or their identifiers, sent
causally before M’s send event, and sent to the same destination destM. This log can then be
examined to ensure whether it is safe to deliver a message. All algorithms aim to reduce this log
overhead, and the space and time overhead of maintaining the log information at the processes.
---------------------------------------------------------------------------------------------------------------------
Msg M ∗that carries information “ d ∈ M. Dests”, where message M was sent to d in the causal
past of Send( M ∗ ), is not delivered to d if M has not yet been delivered to d .
• For how long should the information “ d∈ Mi, a.Dests” be stored in the log at a process,
and piggybacked on messages?
• An optimal CO algorithm stores in local message logs and propagates on messages,
information of the form “d is a destination of M” about a message M sent in the causal
past, as long as and only as long as:
o (Propagation Constraint I: ) it is not known that the message Mi ,a is delivered
to d , and
o (Propagation Constraint II: ) it is not known that a message has been sent to d
in the causal future of Send (Mi,a), and hence it is not guaranteed using a
reasoning based on transitivity that the message Mi,a will be delivered to d in CO.
⇒ if either (I) or (II) is false, “d ∈ M.Dests” must not be stored or propagated, even to remember
that (I) or (II) has been falsified.
“d ∈ M i,a..Dests” must be available in the causal future of event e i,a , but not in the causal future
of Deliverd (Mi,a ), and not in the causal future of e k,c , where d ∈ M k,c .Dests and there is no
other message sent causally between M i,a and M k,c to the same destination d .
In the causal future of Deliver d(Mi,a ), and Send (Mk,c ), the information is redundant; elsewhere,
it is necessary.
Information about what messages have been delivered (or are guaranteed to be delivered without
violating CO) is necessary for the Delivery Condition.
• For optimality, this cannot be stored. Algorithm infers this using set-operation logic.
“d ɛ M.Dests”
• must not exist at e7, e8 because (I) and (II) are false
• Info about messages (i) not known to be delivered and (ii) not guaranteed to be delivered
in CO, is explicitly tracked using (source, ts, dest).
Explicit tracking : Tracking of (source, timestamp, destination) information for messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, is done explicitly using the
l.Dests field of entries in local logs at nodes and o.Dests field of entries in messages.
Implicit tracking Tracking of messages that are either (i) already delivered, or (ii) guaranteed to
be delivered in CO, is performed implicitly. The information about messages (i) already
delivered or (ii) guaranteed to be delivered in CO is deleted and not propagated because it is
redundant as far as enforcing CO is concerned. However, it is useful in determining what
information that is being carried in other messages and is being stored in logs at other nodes has
become redundant and thus can be purged.
---------------------------------------------------------------------------------------------------------------------
Total order
For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered
to both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mx before My .
The execution in Figure 2.15(b) does not satisfy total order. Even if the message m did not exist,
total order would not be satisfied. The execution in Figure 2.15(c) satisfies total order.
Fig 2.15: (a) Updates to 3 replicas. (b) Total order violated. (c) Total order not violated
Message complexity: n
The three phases of the algorithm are first described from the viewpoint of the sender, and then
from the viewpoint of the receiver.
Sender
Phase 1 : In the first phase, a process multicasts (line 1b) the message M with a locally
unique tag and the local timestamp to the group members.
Phase 2: In the second phase, the sender process awaits a reply from all the group
members who respond with a tentative proposal for a revised timestamp for that message
M. The await call in line 1d is non-blocking i.e., any other messages received in the
meanwhile are processed. Once all expected replies are received, the process computes
the maximum of the proposed timestamps for M, and uses the maximum as the final
timestamp.
Phase 3: In the third phase, the process multicasts the final timestamp to the group in
line (1f).
Receivers
Phase 1 In the first phase, the receiver receives the message with a tentative/proposed
timestamp. It updates the variable priority that tracks the highest proposed timestamp
(line 2a), then revises the proposed timestamp to the priority, and places the message with
its tag and the revised timestamp at the tail of the queue temp_Q (line 2b). In the queue,
the entry is marked as undeliverable.
Phase 2 In the second phase, the receiver sends the revised timestamp (and the tag) back
to the sender (line 2c). The receiver then waits in a non-blocking manner for the final
timestamp (correlated by the message tag).
Phase 3 In the third phase, the final timestamp is received from the multicaster (line 3).
The corresponding message entry in temp_Q is identified using the tag (line 3a), and is
marked as deliverable (line 3b) after the revised timestamp is overwritten by the final
timestamp (line 3c). The queue is then resorted using the timestamp field of the entries as
the key (line 3c). As the queue is already sorted except for the modified entry for the
message under consideration, that message entry has to be placed in its sorted position in
the queue. If the message entry is at the head of the temp_Q, that entry, and all
consecutive subsequent entries that are also marked as deliverable, are dequeued from
temp_Q, and enqueued in deliver_Q in that order (the loop in lines 3d–3g).
Figure 2.16: (a) A snapshot for PROPOSED TS and REVISE TS messages. The dashed lines
show the further execution after the snapshot. (b) The FINAL TS messages.
Example An example execution to illustrate the algorithm is given in Figure 2.16. Here, A and B
multicast to a set of destinations and C and D are the common destinations for both multicasts.
9. D receives FINAL_TS(9) from B, updates the corresponding entry in temp_Q by marking the
corresponding message as deliverable, and resorts the queue. As the message is at the head of the
queue, it is moved to delivery_Q.
10. When C receives FINAL_TS(9) from B, it will update the corresponding entry in temp_Q by
marking the corresponding message as deliverable. As the message is at the head of the queue, it
is moved to the
delivery_Q, and the next message (of A), which is also deliverable, is also moved to the
delivery_Q.
11. When D receives FINAL_TS(10) from A, it will update the corresponding entry in temp_Q
by marking the corresponding message as deliverable. As the message is at the head of the
queue, it is moved to
the delivery_Q.
Complexity:
• Three phases
• 3(n − 1) messages for n − 1 dests
• Delay: 3 message hops
• Also implements causal order
---------------------------------------------------------------------------------------------------------------------
Introduction
• The system consists of a collection of n processes p1, p2, ..., pn that are connected by
channels.
• There are no globally shared memory and physical global clock and processes
communicate by passing messages through communication channels.
• Cij denotes the channel from process pi to process pj and its state is denoted by SCij .
• The actions performed by a process are modeled as three types of events: Internal
events, the message send event and the message receive event.
• For a message mij that is sent by process pi to process pj , let send (mij ) and rec (mij )
denote its send and receive events.
• At any instant, the state of process pi , denoted by LSi , is a result of the sequence of all the
events executed by pi till that instant.
• For an event e and a process state LSi , e ∈LSi iff e belongs to the sequence of events that
have taken process pi to state LSi .
• For an event e and a process state LS i , e ƒ∈LS i iff e does not belong to the sequence of events
that have taken process pi to state LSi .
• For a channel Cij , the following set of messages can be defined based on the local states of the
processes pi and pj
• Transit: transit(LSi , LSj ) = {mij |send (mij ) ∈ LSi ˄ rec (mij ) ∈ LSj }
• Thus, if a snapshot recording algorithm records the state of processes p i
and pj as LSi and LSj , respectively, then it must record the state of
channel Cij as transit(LSi,LSj).
Models of communication
Recall, there are three models of communication: FIFO, non-FIFO, and CO.
• In FIFO model, each channel acts as a first-in first-out message queue and thus, message
ordering is preserved by a channel.
• In non-FIFO model, a channel acts like a set in which the sender process adds messages and
the receiver process removes messages from it in a random order.
• A system that supports causal delivery of messages satisfies the following property: “For
any two messages mij and mkj , if send (mij ) −→ send (mkj ), then rec (mij ) −→ rec (mkj
)”.
• The global state of a distributed system is a collection of the local states of the processes
and the channels.
• Notationally, global state GS is defined as,
• A global state GS is a consistent global state iff it satisfies the following two conditions :
• C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj . (⊕ is Ex-OR operator.)
• C2: send(mij )ƒ∈LS i ⇒ mij ƒ∈SCij ∧ rec(mij )ƒ∈LS j .
Condition C1 states the law of conservation of messages. Every message m ij that is recorded as sent in the
local state of a process pi must be captured in the state of the channel C ij or in the collected local state of
the receiver process pj.
Condition C2 states that in the collected global state, for every effect, its cause must be present. If a
message mij is not recorded as sent in the local state of process pi, then it must neither be present in the
state of the channel C ij nor in the collected local state of the receiver process pj.
recorded.
-Any message that is sent by a process before recording its snapshot, must be recorded in
the global snapshot (from C1).
-Any message that is sent by a process after recording its snapshot, must not be recorded
in the global snapshot (from C2).
I2: How to determine the instant when a process takes its snapshot.
-A process pj must record its snapshot before processing a message mij that was sent by
process pi after recording its snapshot.
Example: Let S1 and S2 be two distinct sites of a distributed system which maintain bank
accounts A and B, respectively. A site refers to a process in this example. Let the communication
channels from site S1 to site S2 and from site S2 to site S1 be denoted by C12 and C21,
respectively. Consider the following sequence of actions, which are also illustrated in the timing
diagram of Figure
• Time t0: Initially, Account A=$600, Account B=$200, C12=$0, C21 =$0.
• Time t1: Site S1 initiates a transfer of $50 from Account A to Account B. Account A is
decremented by $50 to $550 and a request for $50 credit to Account B is sent on Channel
C12 to site S2. Account A=$550,Account B=$200, C12 =$50, C21 =$0.
• Time t2: Site S2 initiates a transfer of $80 from Account B to Account A. Account B is
decremented by $80 to $120 and a request for $80 credit to Account A is sent on Channel
C21 to site S1. Account A=$550, Account B=$120, C12=$50, C21 =$80.
• Time t3: Site S1 receives the message for a $80 credit to Account A and updates Account
A. Account A=$630, Account B=$120, C12 =$50, C21 =$0.
• Time t4: Site S2 receives the message for a $50 credit to Account B and updates Account
B. Account A=$630, Account B=$170, C12 =$0, C21 =$0.
Suppose the local state of Account A is recorded at time t0 to show $600 and the local state of
Account B and channels C12 and C21 are recorded at timet2 to show $120, $50, and $80,
respectively. Then the recorded global state shows $850 in the system. An extra $50 appears in
the system. The reason for the inconsistency is that Account A’s state was recorded before the
$50 transfer to Account B using channel C12 was initiated, whereas channel C12’s state was
recorded after the $50 transfer was initiated.
---------------------------------------------------------------------------------------------------------------------
CHANDY-LAMPORT ALGORITHM
• The Chandy-Lamport algorithm uses a control message, called a marker whose role in a
FIFO system is to separate messages in the channels.
• After a site has recorded its snapshot, it sends a marker, along all of its outgoing
channels before sending out any more messages.
• A marker separates the messages in the channel into those to be included in the snapshot
from those not to be recorded in the snapshot. This addresses issue I1.
• The role of markers in a FIFO system is to act as delimiters for the messages in the
channels so that the channel state recorded by the process at the receiving end of the
channel satisfies the condition C2.
• Since all messages that follow a marker on channel Cij have been sent by process pi after
pi has taken its snapshot, process pj must record its snapshot no later than when it
receives a marker on channel Cij . This addresses issue I2.
• A process must record its snapshot no later than when it receives a marker on any of its
incoming channels.
• The algorithm can be initiated by any process by executing the “Marker Sending Rule”
by which it records its local state and sends a marker on each outgoing channel.
• A process executes the “Marker Receiving Rule” on receiving a marker. If the process
has not yet recorded its local state, it records the state of the channel on which the marker
is received as empty and executes the “Marker Sending Rule” to record its local state.
• The algorithm terminates after each process has received a marker on all of its incoming
channels.
• All the local snapshots get disseminated to all other processes and all the processes can
determine the global state.
• If the process has not yet recorded its local state, it records the state of the channel on
which the marker is received as empty and executes the marker sending rule to record its
local state. Otherwise, the state of the incoming channel on which the marker is received
is recorded as the set of computation messages received on that channel after recording
the local state but before receiving the marker on that channel. The algorithm can be
initiated by any process by executing the marker sending rule. The algorithm terminates
after each process has received a marker on all of its incoming channels.
• The recorded local snapshots can be put together to create the global snapshot in several
ways. One policy is to have each process send its local snapshot to the initiator of the
algorithm. Another policy is to have each process send the information it records along
all outgoing channels, and to have each process receiving such information for the first
time propagate it along its outgoing channels. All the local snapshots get disseminated to
all other processes and all the processes can determine the global state.
Correctness
• A process stops recording the state of an incoming channel when a marker is
received on that channel. Due to FIFO property of channels, it follows that no
message sent after the marker on that channel is recorded in the channel state. Thus,
condition C2 is satisfied.
• When a process pj receives message mij that precedes the marker on channel Cij , it
acts as follows: If process pj has not taken its snapshot yet, then it includes mij in
its recorded snapshot. Otherwise, it records mij in the state of the channel Cij . Thus,
condition C1 is satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e ) messages and O(d
) time, where e is the number of edges in the network and d is the diameter of the network.
Consider two possible executions of the snapshot algorithm for the money transfer example
The recorded global state may not correspond to any of the global states that occurred during the
computation.
1. (Markers shown using dashed-and-dotted arrows.) Let site S1 initiate the algorithm just
after t1. Site S1 records its local state (account A=$550) and sends a marker to site S2.
The marker is received by site S2 after t4. When site S2 receives the marker, it records its
local state (account B=$170), the state of channel C12 as $0, and sends a marker along
channel C21. When site S1 receives this marker, it records the state of channel C21 as
$80. The $800 amount in the system is conserved in the recorded global state,
A = $550,B = $170,C12= $0,C21= $80.
2. (Markers shown using dotted arrows.) Let site S1 initiate the algorithm just after t0 and
before sending the $50 for S2. Site S1 records its local state (account A = $600) and
sends a marker to site S2. The marker is received by site S2 between t2 and t3. When site
S2 receives the marker, it records its local state (account B = $120), the state of channel
C12 as $0, and sends a marker along channel C21. When site S1 receives this marker, it
records the state of channel C21 as $80. The $800 amount in the system is conserved in
the recorded global state,
A = $600,B = $120,C12 = $0,C21 = $80.