Distributed Systems
Distributed Systems
Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers
Two definitions
• A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
• A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.
Distribution transparency
What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.
Observation
Distribution transparancy is handled through many different techniques in a
layer between applications and operating systems: a middleware layer
Distribution transparency
Introduction Design goals
Distribution transparency
Types
Transparency Description
Distribution transparency
Distribution transparency
Types
Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
• Full transparency will cost performance, exposing distribution of the
system
• Keeping replicas exactly up-to-date with the master takes time
• Immediately flushing write operations to disk for fault tolerance
Distribution transparency
Introduction Design goals
Degree of transparency
Exposing distribution may be good (Explain that our system is distributed)
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when e.g.,
a server does not respond for a long time, report it as failing).
Conclusion
Distribution transparency is a nice goal, but achieving it is a different story, and it
should often not even be aimed at.
Distribution transparency
Introduction Design goals
Openness
Openness of distributed systems
For example:
In the case of Web caching, for example, a browser should
ideally provide facilities for only storing documents (i.e.,
a mechanism) and at the same time allow users to decide
which documents are stored and for how long (i.e., a
policy)
Openness
Introduction Design goals
Dependability
Basics :
1. Dependability refers to the degree that a computer
system can be relied upon to operate as expected.
2. Dependability in distributed systems can be less
dependence due to partial failures
Dependability
Introduction Design goals
Dependability
Requirements related to dependability
Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired
Dependability
Introduction Design goals
Dependability
Introduction Design goals
Terminology
Failure, error, fault
Term Description
Failure A component is not living up to its specifications
Dependability
Introduction Design goals
Terminology
Handling faults
Dependability
Introduction Design goals
On security
Observation
A distributed system that is not secure, is not dependable
What we need
• Confidentiality: information is disclosed only to authorized parties
• Integrity: Ensure that alterations to assets of a system can be made only
in an authorized way
Security
Introduction Design goals
Security mechanisms
Keeping it simple
It’s all about encrypting and decrypting data using security keys.
Notation
K (data) denotes that we use key K to encrypt/decrypt data.
Security
Introduction Design goals
Security mechanisms
Symmetric cryptosystem
With encryption key EK (data) and decryption key DK (data):
if data = DK (EK (data)) then DK = EK . Note: encryption and descryption
key are the same and should be kept secret.
Asymmetric cryptosystem
Distinguish a public key PK (data) and a private (secret) key SK (data).
Security
Introduction Design goals
Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data) returns a fixed-length
string.
• Any change from data to data∗will lead to a completely different string
H(data∗).
• Given a hash value, it is computationally impossible to find a data with
h = H(data)
Security
Introduction Design goals
Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data) returns a fixed-length
string.
• Any change from data to data∗will lead to a completely different string
H(data∗).
• Given a hash value, it is computationally impossible to find a data with
h = H(data)
Security
Lecture 2
Introduction Design goals
Scalability
Introduction Design goals
Observation
A solution:multiple powerful servers operating independently in parallel. Today,
the challenge still lies in geographical and administrative scalability.
Scalability
Introduction Design goals
Size scalability
scalability problems with centralized solutions
Scalability
Introduction Design goals
Formal analysis
A centralized service can be modeled as a simple queuing system
Scalability
Introduction Design goals
Formal analysis
Utilization U of a service is the fraction of time that it is busy
Average throughput
Scalability
Introduction Design goals
Formal analysis
Response time: total time take to process a request after submission
Observations
• If U is small, response-to-service time is close to 1: a request is
immediately processed
• If U goes up to 1, the system comes to a grinding halt.
Solution: decrease S.
Scalability
Introduction Design goals
Examples
• Computational grids: share expensive resources between different
domains.
• Shared equipment: how to control, manage, and use a shared radio
telescope constructed as large-scale shared sensor network?
Scalability
Introduction Design goals
Scalability
Introduction Design goals
Scalability
Introduction Design goals
Scalability
Introduction Design goals
Scalability
Introduction Design goals
Observation
If we can tolerate inconsistencies, we may reduce the need for global
synchronization, but tolerating inconsistencies is application dependent.
Scalability
Introduction A simple classification of distributed systems
Parallel computing
Observation
High-performance distributed computing started with parallel computing
Problem
Performance of distributed shared memory could never compete with that of
multiprocessors, and failed to meet the expectations of programmers. It has
been widely abandoned by now.
Cluster computing
Essentially a group of high-end systems connected through a LAN
• Homogeneous: same OS, near-identical hardware
• Single, or tightly coupled managing node(s)
Grid computing
plenty of nodes from everywhere
• Heterogeneous
• Dispersed across several organizations
• Can easily span a wide-area network
Note
To allow for collaborations, grids generally use virtual organizations. In
essence, this is a grouping of users (or better: their IDs) that allows for
authorization on resource allocation.
Integrating applications
Situation
Organizations challenged with many networked applications, but
achieving interoperability was hard
Basic approach
A networked application is one that runs on a server making its
services available to remote clients. Simple integration: clients
combine requests for (different) applications; send that off; collect
responses, and present a coherent result to the user.
Issue: all-or-nothing
• Atomic: happens indivisibly (seemingly)
• Consistent: does not violate system invariants
• Isolated: not mutual interference
• Durable: commit means changes are permanent
Architectural styles
Basic idea
A style is formulated in terms of
• (replaceable) components with well-defined interfaces
• the way that components are connected to each other
• the data exchanged between components
• how these components and connectors are jointly configured into
a system.
Connector
A mechanism that mediates communication, coordination, or cooperation
among components. Example: facilities for (remote) procedure call,
messaging, or streaming.
Architectures Architectural styles
Layered architecture
Different layered organizations
Layered architectures
Architectures Architectural styles
Layered architectures
Architectures Architectural styles
Application Layering
Traditional three-layered view
• Application-interface layer contains units for interfacing to users or
external applications
• Processing layer contains the functions of an application, i.e., without
specific data
• Data layer contains the data that a client wants to manipulate through the
application components
Observation
This layering is found in many distributed information systems, using traditional
database technology and accompanying applications.
Layered architectures
Architectures Architectural styles
Application Layering
Example: a simple search engine
Layered architectures
Architectures Architectural styles
Object-based style
Components are objects, connected to each other through procedure
calls. Objects may be placed on different machines; calls can thus execute
across a network.
Encapsulation
Objects are said to encapsulate data and offer methods on that data without
revealing the internal implementation.
Service-oriented architectures
Architectures Architectural styles
RESTful architectures
Essence
View a distributed system as a collection of resources, individually managed by
components. Resources may be added, removed, retrieved, and modified by
(remote) applications.
1. Resources are identified through a single naming scheme
2. All services offer the same interface
3. Messages sent to or from a service are fully self-described
4. After executing an operation at a service, that component
forgets everything about the caller
Basic operations
Operation Description
PUT Create a new resource
GET Retrieve the state of a resource in some representation
DELETE Delete a resource
POST Modify a resource by transferring a new state
Service-oriented architectures
Architectures Architectural styles
Objects (i.e., files) are placed into buckets (i.e., directories). Buckets cannot be
placed into buckets. Operations on ObjectName in bucket BucketName
require the following identifier.
Typical operations
All operations are carried out by sending HTTP requests:
• Create a bucket/object: PUT, along with the URI
• Listing objects: GET on a bucket name
• Reading an object: GET on a full URI
Service-oriented architectures
Architectures Architectural styles
On interfaces
Issue
Many people like RESTful approaches because the interface to a service is
so simple. The catch is that much needs to be done in the parameter space.
Service-oriented architectures
Architectures Architectural styles
Coordination
Publish-subscribe architectures
Architectures Architectural styles
Publish-subscribe architectures
Architectures Middleware and distributed systems
Solution
A wrapper or adapter offers an interface acceptable to a client application. Its
functions are transformed into those available at the component.
Middleware organization
Architectures Layered-system architectures
Multitiered Architectures
Architectures Layered-system architectures
Multitiered Architectures
Architectures Layered-system architectures
Alternative organizations
Vertical distribution
Comes from dividing distributed applications into three logical layers and running
the components from each layer on a different server (machine).
Horizontal distribution
A client or server may be physically split up into logically equivalent parts, but each part
is operating on its own share of the complete data set.
Peer-to-peer architectures
Processes are all equal: the functions that need to be carried out are represented by every
process ⇒ each process will act as a client and a server at the same time (i.e., acting as a
servant).
Lec 4
Architectures Symmetrically distributed system architectures
Structured P2P
Essence
Make use of a semantic-free index: each data item is uniquely associated with
a key, in turn used as an index. Common practice: use a hash function
key(data item) = hash(data item’s value).
P2P system now responsible for storing (key,value) pairs.
Example: Chord
Principle
• Nodes are logically organized in a ring. Each node has an m-bit identifier.
• Each data item is hashed to an m-bit key.
• Data item with key k is stored at node with smallest identifier id ≥ k ,
called the successor of key k .
• The ring is extended with various shortcut links to other nodes.
Unstructured P2P
Essence
Each node maintains an ad hoc list of neighbors. The resulting overlay
resembles a random graph: an edge ⟨u, v ⟩ exists only with a certain probability
P[⟨u, v ⟩].
Searching
• Flooding: issuing node u passes request for d to all neighbors. Request
is ignored when receiving node had seen it before. Otherwise, v
searches locally for d.
Super-peer networks
Essence
It is sometimes sensible to break the symmetry in pure peer-to-peer networks:
• When searching in unstructured P2P systems, having index servers
improves performance
• Deciding where to store data can often be done more efficiently through
brokers.
Example: BitTorrent
Architectures Hybrid system architectures
Cloud computing
Cloud computing
Architectures Hybrid system architectures
Cloud computing
Make a distinction between four layers
• Hardware: Processors, routers, power and cooling systems. Customers
normally never get to see these.
• Infrastructure: Deploys virtualization techniques. Evolves around
allocating and managing virtual storage devices and virtual servers.
• Platform: Provides higher-level abstractions for storage and such.
Example: Amazon S3 storage system offers an API for (locally created)
files to be organized and stored in so-called buckets.
• Application: Actual applications, such as office suites (text processors,
spreadsheet applications, presentation applications). Comparable to the
suite of apps shipped with OSes.
Cloud computing
Architectures Hybrid system architectures
Edge-server architecture
Essence
Systems deployed on the Internet where servers are placed at the edge of the
network: the boundary between enterprise networks and the actual Internet.
Edge orchestration
Managing resources at the edge may be trickier than in the cloud
• Resource allocation: we need to guarantee the availability of the
resources required to perform a service.
• Service placement: we need to decide when and where to place a
service. This is notably relevant for mobile applications.
• Edge selection: we need to decide which edge infrastructure should be
used when a service needs to be offered. The closest one may not be the
best one.
Blockchains
Principle working of a blockchain system
Blockchain architectures
Architectures Hybrid system architectures
Blockchains
Principle working of a blockchain system
Observations
• Blocks are organized into an unforgeable append-only chain
• Each block in the blockchain is immutable ⇒ massive replication
• The real snag lies in who is allowed to append a block to a chain
Blockchain architectures
Architectures Hybrid system architectures
Observation
A single entity decides on which validator can go ahead and append a block.
Does not fit the design goals of blockchains.
Blockchain architectures
Architectures Hybrid system architectures
Observation
• A selected, relatively small group of servers jointly reach consensus on
which validator can go ahead.
• None of these servers needs to be trusted, as long as roughly two-thirds
behave according to their specifications.
• In practice, only a few tens of servers can be accommodated.
Blockchain architectures
Architectures Hybrid system architectures
Observation
• Participants collectively engage in a leader election. Only the elected
leader is allowed to append a block of validated transactions.
• Large-scale, decentralized leader election that is fair, robust, secure, and
so on, is far from trivial.
Blockchain architectures
Distributed Systems
Chapter : Communication
Communication Foundations
Disadvantage:
• Focus on message-passing only (just sent data)
• Often unneeded or unwanted functionality (a lot of layer: 7 layers)
• Violates access transparency (all layer see all layer)
Layered Protocols
Communication Foundations
Low-level layers
Summary:
• Physical layer: contains the specification and implementation of bits, and
their transmission between sender and receiver
• Data link layer: used in the transmission of a series of bits into a frame to
allow for error and flow control
• Network layer: describes how packets in a network of computers are to
be routed.
Observation
For many distributed systems: the lowest-level interface is just the network
layer.
Layered Protocols
Communication Foundations
Transport Layer
The transport layer provides the actual communication facilities for most
distributed systems.
Layered Protocols
Communication Foundations
Middleware layer
Layered Protocols
Communication Foundations
Layered Protocols
Communication Foundations
Types of communication
Types of Communication
Communication Foundations
Types of communication
Transient versus persistent
Types of Communication
Communication Foundations
Types of communication
Places for synchronization
• At request submission
• At request delivery
• After request processing
Types of Communication
Communication Foundations
Client/Server
Client/Server computing is generally based on a model of transient
synchronous communication:
• Client and server have to be active at the time of communication
• Client issues request and blocks until it receives a reply
• Server essentially waits only for incoming requests, and subsequently
processes them
Types of Communication
Communication Foundations
Messaging
Message-oriented middleware
Aims at high-level persistent asynchronous communication:
• Processes send each other messages, which are queued
• Sender need not wait for an immediate reply but can do other things
• Middleware often ensures fault tolerance
Types of Communication
Communication Remote procedure call
Conclusion
Communication between caller &
callee can be hidden by using the
procedure-call mechanism.
1. Client procedure calls client stub. 6. Server does local call; returns result to stub.
2. Stub builds message; calls local OS. 7. Stub builds message; calls OS.
3. OS sends the message to remote OS. 8. OS sends message to client’s OS.
4. Remote OS gives a message to a 9. Client’s OS gives message to stub.
stub. 10. Client stub unpacks result; returns to client.
5. Stub unpacks parameters; calls server.
Conclusion
The client and server must properly interpret messages, transforming them
into machine-dependent representations.
Parameter passing
Communication Remote procedure call
Asynchronous RPCs
Try to eliminate the strict request-reply behavior, but let the client continue
without waiting for an answer from the server.
Variations on RPC
Communication Remote procedure call
Variations on RPC
Communication Message-oriented communication
Queue-based messaging
Four possible combinations
Message-oriented middleware
Essence
Asynchronous persistent communication through support of middleware-level
queues. Queues correspond to buffers at communication servers.
Operations
Operation Description
PUT Append a message to a specified queue
GET Block until the specified queue is nonempty, and
remove the first message
POLL Check a specified queue for messages, and remove
the first. Never block
NOTIFY Install a handler to be called when a message is put
into the specified queue
General model
Queue managers
Queue managers manage queues. An application can only put messages
into a local queue. Getting a message is possible by extracting it from a local
queue only ⇒ Queue managers need to route messages.
Routing
Message broker
Observation
Message queuing systems assume a common messaging protocol: all
applications agree on message format (i.e., structure and data representation)
Introduction to threads
Basic idea
We build virtual processors in software, on top of physical processors:
Processor: Provides a set of instructions along with the capability of
automatically executing a series of those instructions.
Thread: A minimal software processor in whose context a series of
instructions can be executed. Saving a thread context implies
stopping the current execution and saving all the data needed
to continue the execution at a later stage.
Process: A software processor in whose context one or more threads
may be executed. Executing a thread, means executing a
series of instructions in the context of that thread.
Introduction to threads
Processes Threads
Context switching
Contexts
• Processor context: The minimal collection of values stored in the
registers of a processor used for the execution of a series of instructions
(e.g., stack pointer, addressing registers, program counter).
• Thread context: The minimal collection of values stored in registers and
memory, used for the execution of a series of instructions (i.e., processor
context, state).
• Process context: The minimal collection of values stored in registers and
memory, used for the execution of a thread (i.e., thread context, but now
also at least MMU register values).
Introduction to threads
Processes Threads
Context switching
Observations
1. Threads share the same address space.
2. Thread context switching can be done entirely independent of the
operating system.
3. Process switching is generally (somewhat) more expensive as it involves
getting the OS in the loop, i.e., trapping to the kernel.
4. Creating and destroying threads is much cheaper than doing so for
processes.
Introduction to threads
Processes Threads
Introduction to threads
Processes Threads
Trade-offs
• Threads use the same address space: more prone to errors
• No support from OS/HW to protect threads using each other’s memory
• Thread context switching may be faster than process context switching
Introduction to threads
Processes Threads
Introduction to threads
Processes Threads
User-space solution
• All operations can be completely handled within a single process ⇒
implementations can be extremely efficient.
• All services provided by the kernel are done on behalf of the process in
which a thread resides ⇒ if the kernel decides to block a thread, the
entire process will be blocked.
• Threads are used when there are many external events: threads block on
a per-event basis ⇒ if the kernel can’t distinguish threads, how can it
support signaling events to them?
Introduction to threads
Processes Threads
Introduction to threads
Processes Threads
Introduction to threads
Processes Threads
Introduction to threads
Processes Threads
Better structure
• Most servers have high I/O demands. Using simple, well-understood
blocking calls simplifies the structure.
• Multithreaded programs tend to be smaller and easier to understand due
to simplified flow of control.
Overview
Model Characteristics
Multithreading Parallelism, blocking system calls
Single-threaded process No parallelism, blocking system calls
Finite-state machine Parallelism, nonblocking system calls
Virtualization
Observation
Virtualization is important:
• Hardware changes faster than software
• Ease of portability and code migration
• Isolation of failing or attacked components
Principle of virtualization
Processes Virtualization
Mimicking interfaces
Four types of interfaces at three different levels
1. Instruction set architecture: the set of machine instructions, with two
subsets:
• Privileged instructions: allowed to be executed only by the operating
system.
• General instructions: can be executed by any program.
2. System calls as offered by an operating system.
3. Library calls, known as an application programming interface (API)
Principle of virtualization
Processes Virtualization
Ways of virtualization
Differences
(a) Separate set of instructions, an interpreter/emulator, running atop an OS.
(b) Low-level instructions, along with bare-bones minimal operating system
(c) Low-level instructions, but delegating most work to a full-fledged OS.
Principle of virtualization
Processes Virtualization
Special instructions
• Control-sensitive instruction: may affect configuration of a machine (e.g.,
one affecting relocation register or interrupt table).
• Behavior-sensitive instruction: effect is partially determined by context
(e.g., POPF sets an interrupt-enabled flag, but only in system mode).
Principle of virtualization
Processes Virtualization
Solutions
• Emulate all instructions
• Wrap nonprivileged sensitive instructions to divert control to VMM
• Paravirtualization: modify guest OS, either by preventing nonprivileged
sensitive instructions, or making them nonsensitive (i.e., changing the
context).
Principle of virtualization
Processes Virtualization
Containers
Example: PlanetLab
Essence
Different organizations contribute machines, which they subsequently share for
various experiments.
Problem
We need to ensure that different distributed applications do not get into each
other’s way ⇒ virtualization
Containers
Processes Virtualization
Vserver
Independent and protected environment with its own libraries, server versions,
and so on. Distributed applications are assigned a collection of vservers
distributed across multiple machines
Containers
Processes Virtualization
Containers
Processes Virtualization
IaaS
Instead of renting out a physical machine, a cloud provider will rent out a VM
(or VMM) that may be sharing a physical machine with other customers ⇒
almost complete isolation between customers (although performance isolation
may not be reached).
Client-server interaction
Distinguish application-level and middleware-level solutions
Improving X
Practical observations
• There is often no clear separation between application logic and
user-interface commands
• Applications tend to operate in a tightly synchronous manner with an X
kernel
Alternative approaches
• Let applications control the display completely, up to the pixel level (e.g.,
VNC)
• Provide only a few high-level display operations (dependent on local
video drivers), allowing more efficient display operations.
Client-side software
Generally tailored for distribution transparency
• Access transparency: client-side stubs for RPCs
• Location/migration transparency: let client-side software keep track of
actual location
• Replication transparency: multiple invocations handled by client stub:
Observation
Concurrent servers are the norm: they can easily handle multiple requests,
notably in the presence of blocking operations (to disks or other servers).
Contacting a server
Observation: most services are tied to a specific port
Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?
Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?
Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?
Consequences
• Clients and servers are completely independent
• State inconsistencies due to client or server crashes are reduced
• Possible loss of performance because, e.g., a server cannot anticipate
client behavior (think of prefetching file blocks)
Observation
The performance of stateful servers can be extremely high, provided clients
are allowed to keep local copies. As it turns out, reliability is often not a major
problem.
Object servers
• Activation policy: which actions to take
when an invocation request comes in:
• Where are code and data of the
object?
• Which threading model to use?
• Keep modified state of object, if any?
• Object adapter: implements a specific
activation policy
Object servers
Processes Servers
Crucial element
The first tier is generally responsible for passing requests to an appropriate
server: request dispatching
Server clusters
Processes Servers
Request Handling
Observation
Having the first tier handle all communication from/to the cluster may lead to a
bottleneck.
Server clusters
Processes Servers
Client transparency
To keep client unaware of distribution, let DNS resolver act on behalf of client.
Problem is that the resolver may actually be far from local to the actual client.
Server clusters
Processes Servers
Important note
The cache is often sophisticated enough to hold more than just passive data.
Much of the application code of the origin server can be moved to the cache as
well.
Server clusters
Processes Code migration
Weak mobility: Move only code and data segment (and reboot
execution)
• Relatively simple, especially if code is portable
• Distinguish code shipping (push) from code fetching (pull)
Observation
As containers are directly dependent on the underlying operating system, their
migration in heterogeneous environments is far from trivial, to simply
impractical, just as process migration is.
Dependability
Definition: A service provider provides services to clients. To provide
services, the component may require the services from other
components ⇒ a component may depend on some other component.
Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired
Basic concepts
Fault tolerance Introduction to fault tolerance
Traditional metrics
• Mean Time To Failure (MTTF): The average time until a component fails.
• Mean Time To Repair (MTTR): The average time needed to repair a
component.
• Mean Time Between Failures (MTBF): Simply MTTF + MTTR.
Basic concepts
Fault tolerance Introduction to fault tolerance
Observation
Reliability and availability make sense only if we have an accurate idea of
what a failure actually is.
Basic concepts
Fault tolerance Introduction to fault tolerance
Terminology
Failure, error, fault
Basic concepts
Fault tolerance Introduction to fault tolerance
Terminology
Handling faults
Basic concepts
Fault tolerance Introduction to fault tolerance
Failure models
Types of failures
Failure models
Fault tolerance Introduction to fault tolerance
Failure models
Fault tolerance Introduction to fault tolerance
Halting failures
Asynchronous versus synchronous systems
• Asynchronous system: no assumptions about process execution speeds
or message delivery times → cannot reliably detect crash failures.
• Synchronous system: process execution speeds and message delivery
times are bounded → we can reliably detect omission and timing failures.
• In practice, we have partially synchronous systems: most of the time, we
can assume the system to be synchronous, yet there is no bound on the
time that a system is asynchronous → can normally reliably detect crash
failures.
Failure models
Fault tolerance Introduction to fault tolerance
Halting failures
Assumptions we can make
Failure models
Fault tolerance Introduction to fault tolerance
Process resilience
Basic idea
Protect against operations processes through process replication, organizing
multiple processes into a process group. Distinguish between flat groups and
hierarchical groups.
Consensus
Prerequisite
In a fault-tolerant process group, each nonfaulty process executes the same
commands, and in the same order, as every other nonfaulty process.
Reformulation
Nonfaulty group members need to reach consensus on which command to
execute next.
Flooding-based consensus
System model
• A process group P = {P1,..., Pn}
• Fail-stop failure semantics, i.e., with reliable failure detection
• A client contacts a Pi requesting it to execute a command
• Every Pi maintains a list of proposed commands
In this type of failure, the distributed system is generally halted and unable to
perform the execution. Sometimes it leads to ending up the execution
resulting in an associate incorrect outcome. Method failure causes the system
state to deviate from specifications, and also method might fail to progress.
•Behavior –
It may be understood as if incorrect computation like Protection violation,
deadlocks, timeout, user input, etc is performed then the method stops its
execution.
•Recovery –
Method failure can be prevented by aborting the method or restarting it from
its prior state.
2. System failure :
In system failure, the processor associated with the distributed system
fails to perform the execution. This is caused by computer code errors
and hardware issues. Hardware issues may involve CPU/memory/bus
failure. This is assumed that whenever the system stops its execution
due to some fault then the interior state is lost.
1. Behavior –
It is concerned with physical and logical units of the processor.
The system may freeze, reboot and also it does not perform any
functioning leading it to go in an idle state.
1. Recovery –
This can be cured by rebooting the system as soon as possible and
configuring the failure point and wrong state.
Secondary storage device failure :
A storage device failure is claimed to have occurred once the keep information can’t
be accessed. This failure is sometimes caused by parity error, head crash, or dirt
particles settled on the medium.
•Behavior –
Stored information can’t be accessed.
•Recovery/Design strategies –
Reconstruct content from the archive and the log of activities and style reflected disk
system. A system failure will additionally be classified as follows.
• a disruption failure
• A halting failure
Communication medium failure :
A communication medium failure happens once a web site cannot
communicate with another operational site within the network. it’s typically
caused by the failure of the shift nodes and/or the links of the human activity
system.
•Behavior –
A web site cannot communicate with another operational site.
•Errors/Faults –
Failure of shift nodes or communication links.
•Recovery/Design strategies –
Reroute, error-resistant communication protocols.
Report
create group of 5 student and
improve the following slides
Fault tolerance Process resilience
Raft
Developed for understandability
• Uses a fairly straightforward leader-election algorithm (see Chp. 5). The
current leader operates during the current term.
• Every server (typically, five) keeps a log of operations, some of which
have been committed. A backup will not vote for a new leader if its own
log is more up to date.
• All committed operations have the same position in the log of each
respective server.
• The leader decides which pending operation is to be committed next ⇒ a
primary-backup approach.
Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o,t , ⟩to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.
Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o,t , ⟩to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.
Note
In practice, only updates are broadcast. At the end, every server has the same
view and knows about the c committed operations. Note that effectively, any
information at the backups is overwritten.
Crucial observations
• The new leader has the most committed operations in its log.
• Any missing commits will eventually be sent to the other backups.
Consensus in faulty systems with crash failures
Fault tolerance Process resilience
Understanding Paxos
We will build up Paxos from scratch to understand where many consensus
algorithms actually come from.
Example: Paxos
Fault tolerance Process resilience
Paxos essentials
Starting point
• We assume a client-server configuration, with initially one primary server.
• To make the server more robust, we start with adding a backup server.
• To ensure that all commands are executed in the same order at both
servers, the primary assigns unique sequence numbers to all commands.
In Paxos, the primary is called the leader.
• Assume that actual commands can always be restored (either from
clients or servers) ⇒ we consider only control messages.
Example: Paxos
Fault tolerance Process resilience
Two-server situation
Example: Paxos
Fault tolerance Process resilience
• When the leader notices that operation o has not yet been learned, it
retransmits ACCEPT(o, t ) with the original timestamp.
Example: Paxos
Fault tolerance Process resilience
Problem
Primary crashes after executing an operation, but the backup never received
the accept message.
Example: Paxos
Fault tolerance Process resilience
Solution
Never execute an operation before it is clear that is has been learned.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?
Example: Paxos
Fault tolerance Process resilience
Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?
Solution
S2 will also have to wait until it knows that S3 has learned o1.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.
Example: Paxos
Fault tolerance Process resilience
Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Observation
If either one of the backups (S2 or S3 ) crashes, Paxos will behave correctly:
operations at nonfaulty servers are executed in the same order.
Example: Paxos
Fault tolerance Process resilience
Example: Paxos
Fault tolerance Process resilience
Observation
• Primary faulty ⇒ BA1 says that backups may store the same, but different
(and thus wrong) value than originally sent by the client.
• Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.
Assumptions
• A server may exhibit arbitrary failures
• Messages may be lost, delayed, and received out of order
• Messages have an identifiable sender (i.e., they are signed)
• Partially synchronous execution model
Essence
A primary-backup approach with 3k + 1 replica servers.
• C is the client
• P is the primary
• B1, B2 , B3 are backups
• Assume B2 is faulty
Procedure
• The next primary P∗is known deterministically
• A backup server broadcasts VIEW -CHANGE(v + 1, P): P is the set of
prepares it had sent out.
LJ
• P∗waits for 2k + 1 view-change messages, with X = P containing all
previously sent prepares.
• P∗ sends out NEW -VIEW (v+1,X,O) with O a new set of pre-prepare
messages.
• Essence: this allows the nonfaulty backups to replay what has gone on in
the previous view, if necessary, and bring o into the new view v + 1.
Question
Are there limitations to what can be readily achieved?
• What is needed to enable reaching consensus?
• What happens when groups are partitioned?
Conclusion
In a network subject to communication failures, it is impossible to realize an
atomic read/write shared memory that guarantees a response to every
request.
Fundamental question
What are the practical ramifications of the CAP theorem?
Failure detection
Issue
How can we reliably detect that a process has actually crashed?
General model
• Each process is equipped with a failure detection module
• A process P probes another process Q for a reaction
• If Q reacts: Q is considered to be alive (by P)
• If Q does not react with t time units: Q is suspected to have crashed
Failure detection
Fault tolerance Process resilience
Failure detection
Fault tolerance Reliable client-server communication
Solution
• Orphan is killed (or rolled back) by the client when it recovers
• Client broadcasts new epoch number when recovering ⇒ server kills
client’s orphans
• Require computations to complete in a T time units. Old ones are simply
removed.
Introduction
Fault tolerance Reliable group communication
Tricky part
Agreement is needed on what the group actually looks like before a received
message can be delivered.
Introduction
Fault tolerance Reliable group communication
Introduction
Fault tolerance Distributed commit
• Phase 2b: Each participant waits for GLOBAL - COMMIT or GLOBAL - ABORT
and handles accordingly.
Fault tolerance Recovery
Recovery: Background
Essence
When a failure occurs, we need to bring the system into an error-free state:
• Forward error recovery: Find a new state from which the system can
continue operation
• Backward error recovery: Bring the system back into a previous error-free
state
Practice
Use backward error recovery, requiring that we establish recovery points
Observation
Recovery in distributed systems is complicated by the fact that processes need
to cooperate in identifying a consistent state from where to recover
Introduction
Fault tolerance Recovery
Recovery line
Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue
Checkpointing
Fault tolerance Recovery
Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.
Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue
Observation
It is possible to consider only those processes that depend on the recovery of
the coordinator, and ignore the rest
Checkpointing
Fault tolerance Recovery
Cascaded rollback
Observation
If checkpointing is done at the “wrong” instants, the recovery line may lie at
system startup time. We have a so-called cascaded rollback.
Checkpointing
Fault tolerance Recovery
Message logging
Alternative
Instead of taking an (expensive) checkpoint, try to replay your (communication)
behavior from the most recent checkpoint ⇒ store messages in a log.
Assumption
We assume a piecewise deterministic execution model:
• The execution of each process can be considered as a sequence of state
intervals
• Each state interval starts with a nondeterministic event (e.g., message
receipt)
• Execution in a state interval is deterministic
Conclusion
If we record nondeterministic events (to replay them later), we obtain a
deterministic execution model that will allow us to do a complete replay.
Message logging
Fault tolerance Recovery
Message logging