[go: up one dir, main page]

0% found this document useful (0 votes)
37 views234 pages

Distributed Systems

Uploaded by

Hadeer Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views234 pages

Distributed Systems

Uploaded by

Hadeer Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 234

Distributed Systems

A distributed system is a networked computer system in


which processes and resources are sufficiently spread
across multiple computers.

First feature: DS is a Second feature : users


collection of computing believe they are dealing
elements( node) each with a single systems
being able to behave
independently of each
other
Using DS to make a small devises to high-performance
mainframe
Introduction From networked systems to distributed systems

Distributed versus Decentralized


What many people state

Centralized Decentralized Distributed

When does a decentralized system become distributed?


• Adding 1 link between two nodes in a decentralized system?
• Adding 2 links between two other nodes?
• In general: adding k > 0 links....?

Distributed versus decentralized systems


Introduction From networked systems to distributed systems

Alternative approach
Two views on realizing distributed systems
• Integrative view: connecting existing networked computer systems into a
larger a system.
• Expansive view: an existing networked computer systems is extended
with additional computers

Two definitions
• A decentralized system is a networked computer system in which
processes and resources are necessarily spread across multiple
computers.
• A distributed system is a networked computer system in which processes
and resources are sufficiently spread across multiple computers.

Distributed versus decentralized systems


Introduction From networked systems to distributed systems

Some common misconceptions


Centralized solutions do not scale
Make distinction between logically and physically centralized. The root of the
Domain Name System:
• logically centralized
• physically (massively) distributed
• decentralized across several organizations

Centralized solutions have a single point of failure


Generally, not true (e.g., the root of DNS). A single point of failure is often:
• easier to manage
• easier to make more robust

Distributed versus decentralized systems


Introduction From networked systems to distributed systems

Perspectives on distributed systems


Distributed systems are complex: take persepctives
• Architecture: common organizations
• Process: what kind of processes, and their relationships
• Communication: facilities for exchanging data
• Coordination: application-independent algorithms
• Naming: how do you identify resources?
• Consistency and replication: performance requires of data, which need to
be the same
• Fault tolerance: keep running in the presence of partial failures
• Security: ensure authorized access to resources

Studying distributed systems


Introduction Design goals

What do we want to achieve?


Overall design goals
• Support sharing of resources
• Distribution transparency
• Openness
• Scalability
Introduction Design goals

Distribution transparency

What is transparency?
The phenomenon by which a distributed system attempts to hide the fact that
its processes and resources are physically distributed across multiple
computers, possibly separated by large distances.

Observation
Distribution transparancy is handled through many different techniques in a
layer between applications and operating systems: a middleware layer

Distribution transparency
Introduction Design goals

Distribution transparency
Types

Transparency Description

Access Hide differences in data representation and how an object is


accessed.
For example: files distributed on different OS.
Location Hide where an object is located.
For example: You can move a system file from one server to other
without user realize you change the location
Relocation Hide that an object may be moved to another location
while in use

Distribution transparency
Distribution transparency
Types

Migration Hide that an object may move to another location


For example : A typical example is communication between mobile
phones

Replication Hide that an object is replicated

Concurrency Hide that an object may be shared by several


independent users

Failure Hide the failure and recovery of an object


Introduction Design goals

Degree of transparency
Aiming at full distribution transparency may be too much
• There are communication latencies that cannot be hidden
• Completely hiding failures of networks and nodes is (theoretically and
practically) impossible
• You cannot distinguish a slow computer from a failing one
• You can never be sure that a server actually performed an operation
before a crash
• Full transparency will cost performance, exposing distribution of the
system
• Keeping replicas exactly up-to-date with the master takes time
• Immediately flushing write operations to disk for fault tolerance

Distribution transparency
Introduction Design goals

Degree of transparency
Exposing distribution may be good (Explain that our system is distributed)
• Making use of location-based services (finding your nearby friends)
• When dealing with users in different time zones
• When it makes it easier for a user to understand what’s going on (when e.g.,
a server does not respond for a long time, report it as failing).

Conclusion
Distribution transparency is a nice goal, but achieving it is a different story, and it
should often not even be aimed at.

Distribution transparency
Introduction Design goals

Openness of distributed systems


Open distributed system
A system that offers components that can easily be used by or integrated into
other systems. An open distributed system itself will often consist of
components that originate from elsewhere.

What are we talking about?


Be able to interact with services from other open systems, irrespective of the
underlying environment:

• Systems should conform to well-defined interfaces


1.A general approach is to define services through interfaces using an
Interface Definition Language (IDL)
2.They must specify the input and output for system

• Systems should easily interact


1.They must be under the common standerd

Openness
Openness of distributed systems

• Systems should support portability of applications


• It is easy to transport the process in different
distributed system without affected on other
• Systems should be easily extensible
• it should be easy to add new components or
replace existing ones without affecting those
components that stay in place.
Introduction Design goals

Policies versus mechanisms

For example:
In the case of Web caching, for example, a browser should
ideally provide facilities for only storing documents (i.e.,
a mechanism) and at the same time allow users to decide
which documents are stored and for how long (i.e., a
policy)

Openness
Introduction Design goals

Dependability
Basics :
1. Dependability refers to the degree that a computer
system can be relied upon to operate as expected.
2. Dependability in distributed systems can be less
dependence due to partial failures

A component provides services to clients. To provide


services, the component may require the services from
other components ⇒ a component may depend on some
other component.
For example :
A component C depends on C∗ if the correctness of C’s
behavior depends on the correctness of C∗’s behavior.
(Components are processes or channels.)

Dependability
Introduction Design goals

Dependability
Requirements related to dependability

Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired

Dependability
Introduction Design goals

Reliability versus availability


Reliability R(t ) of component C
Conditional probability that C has been functioning correctly during [0, t ) given
C was functioning correctly at the time T = 0.

Traditional metrics for fault tolerance :


• Mean Time To Failure (MTTF): The average time until a component fails.
• Mean Time To Repair (MTTR): The average time needed to repair a
component.
• Mean Time Between Failures (MTBF): Simply MTTF + MTTR.

Dependability
Introduction Design goals

Terminology
Failure, error, fault

Term Description
Failure A component is not living up to its specifications

Error Part of a component that can lead to a failure

Fault Cause of an error

Dependability
Introduction Design goals

Terminology
Handling faults

Term Description Example


Fault prevention Prevent the occurrence Don’t hire sloppy
of a fault programmers
Fault tolerance Build a component and Build each component by
make it mask the two independent
occurrence of a fault programmers
Fault removal Reduce the presence, Get rid of sloppy
number, or seriousness programmers
of a fault
Fault forecasting Estimate current Estimate how a recruiter is
presence, future doing when it comes to
incidence, and hiring sloppy programmers
consequences of faults

Dependability
Introduction Design goals

On security
Observation
A distributed system that is not secure, is not dependable

What we need
• Confidentiality: information is disclosed only to authorized parties
• Integrity: Ensure that alterations to assets of a system can be made only
in an authorized way

Authorization, Authentication, Trust


• Authentication: verifying the correctness of a claimed identity
• Authorization: does an identified entity has proper access rights?
• Trust: one entity can be assured that another will perform particular
actions according to a specific expectation

Security
Introduction Design goals

Security mechanisms
Keeping it simple
It’s all about encrypting and decrypting data using security keys.

Notation
K (data) denotes that we use key K to encrypt/decrypt data.

Security
Introduction Design goals

Security mechanisms
Symmetric cryptosystem
With encryption key EK (data) and decryption key DK (data):
if data = DK (EK (data)) then DK = EK . Note: encryption and descryption
key are the same and should be kept secret.

Asymmetric cryptosystem
Distinguish a public key PK (data) and a private (secret) key SK (data).

Security
Introduction Design goals

Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data) returns a fixed-length
string.
• Any change from data to data∗will lead to a completely different string
H(data∗).
• Given a hash value, it is computationally impossible to find a data with
h = H(data)

Security
Introduction Design goals

Security mechanisms
Secure hashing
In practice, we use secure hash functions: H(data) returns a fixed-length
string.
• Any change from data to data∗will lead to a completely different string
H(data∗).
• Given a hash value, it is computationally impossible to find a data with
h = H(data)

Practical digital signatures


Sign message for Bob by Alice:

Security
Lecture 2
Introduction Design goals

Scale in distributed systems


Observation
Many developers of modern distributed systems easily use the adjective
“scalable” without making clear why their system actually scales.

At least three components


• Number of users or processes (size scalability)
• Maximum distance between nodes (geographical scalability)
• Number of administrative domains (administrative scalability)

Scalability
Introduction Design goals

Scale in distributed systems


Observation
Many developers of modern distributed systems easily use the adjective
“scalable” without making clear why their system actually scales.

At least three components


• Number of users or processes (size scalability)
• Maximum distance between nodes (geographical scalability)
• Number of administrative domains (administrative scalability)

Observation
A solution:multiple powerful servers operating independently in parallel. Today,
the challenge still lies in geographical and administrative scalability.

Scalability
Introduction Design goals

Size scalability
scalability problems with centralized solutions

• The computational capacity, limited by the CPUs


• The storage capacity, including the transfer rate between CPUs and
disks
• The network between the user and the centralized service

Scalability
Introduction Design goals

Formal analysis
A centralized service can be modeled as a simple queuing system

Assumptions and notations


• The queue has infinite capacity ⇒ arrival rate of requests is not
influenced by current queue length or what is being processed.
• Arrival rate requests: λ
• Processing capacity service: µ requests per second

Fraction of time having k requests in the system

Scalability
Introduction Design goals

Formal analysis
Utilization U of a service is the fraction of time that it is busy

Average number of requests in the system

Average throughput

Scalability
Introduction Design goals

Formal analysis
Response time: total time take to process a request after submission

with S = 1 being the service time.


µ

Observations
• If U is small, response-to-service time is close to 1: a request is
immediately processed
• If U goes up to 1, the system comes to a grinding halt.
Solution: decrease S.

Scalability
Introduction Design goals

Problems with geographical scalability

• Cannot simply go from LAN to WAN: many


distributed systems assume synchronous client-server
interactions: client sends request and waits for an
answer. Latency may easily prohibit this scheme.
• WAN links are often inherently unreliable: simply
moving streaming video from LAN to WAN is bound to
fail.
• Lack of multipoint communication, so that a simple
search broadcast cannot be deployed. Solution is to
develop separate naming and directory services
(having their own scalability problems).
Scalability
Introduction Design goals

Problems with administrative scalability


Definition:
Conflicting policies concerning usage (and thus payment), management,
and security

Examples
• Computational grids: share expensive resources between different
domains.
• Shared equipment: how to control, manage, and use a shared radio
telescope constructed as large-scale shared sensor network?

Exception: several peer-to-peer networks


• File-sharing systems (based, e.g., on BitTorrent)
• Peer-to-peer telephony (early versions of Skype)
• Peer-assisted audio streaming (Spotify)
Note: end users collaborate and not administrative entities.

Scalability
Introduction Design goals

Techniques for scaling


Hide communication latencies
• Make use of asynchronous communication
• Have separate handler for incoming response
• Problem: not every application fits this model

Scalability
Introduction Design goals

Techniques for scaling


Facilitate solution by moving computations to client

Scalability
Introduction Design goals

Techniques for scaling Examples:


Partition data and computations across multiple machines
• Move computations to clients (Java applets and scripts)
• Decentralized naming services (DNS)
• Decentralized information systems (WWW)

Scalability
Introduction Design goals

Techniques for scaling


Replication and caching: Make copies of data available at different
machines
• Replicated file servers and databases
• Mirrored Websites
• Web caches (in browsers and proxies)
• File caching (at server and client)

Scalability
Introduction Design goals

Scaling: The problem with replication


Applying replication is easy, except for one thing
• Having multiple copies (cached or replicated), leads to
inconsistencies: modifying one copy makes that copy different from the
rest.
• Always keeping copies consistent and in a general way requires global
synchronization on each modification.
• Global synchronization precludes large-scale solutions.

Observation
If we can tolerate inconsistencies, we may reduce the need for global
synchronization, but tolerating inconsistencies is application dependent.

Scalability
Introduction A simple classification of distributed systems

Parallel computing
Observation
High-performance distributed computing started with parallel computing

Multiprocessor and multicore versus multicomputer

High-performance distributed computing


Introduction A simple classification of distributed systems

Distributed shared memory systems


Observation
Multiprocessors are relatively easy to program in comparison to
multicomputers, yet have problems when increasing the number of processors
(or cores). Solution: Try to implement a shared-memory model on top of a
multicomputer.

Example through virtual-memory techniques


Map all main-memory pages (from different processors) into one single
virtual address space. If a process at processor A addresses a page P
located at processor B, the OS at A traps and fetches P from B, just as it
would if P had been located on local disk.

Problem
Performance of distributed shared memory could never compete with that of
multiprocessors, and failed to meet the expectations of programmers. It has
been widely abandoned by now.

High-performance distributed computing


Introduction A simple classification of distributed systems

Cluster computing
Essentially a group of high-end systems connected through a LAN
• Homogeneous: same OS, near-identical hardware
• Single, or tightly coupled managing node(s)

High-performance distributed computing


Introduction A simple classification of distributed systems

Grid computing
plenty of nodes from everywhere
• Heterogeneous
• Dispersed across several organizations
• Can easily span a wide-area network

Note
To allow for collaborations, grids generally use virtual organizations. In
essence, this is a grouping of users (or better: their IDs) that allows for
authorization on resource allocation.

High-performance distributed computing


Introduction A simple classification of distributed systems

Architecture for grid computing


The layers
• Fabric: Provides interfaces to local
resources (for querying state and
capabilities, locking, etc.)
• Connectivity: Communication/transaction
protocols, e.g., for moving data between
resources. Also various authentication
protocols.
• Resource: Manages a single resource,
such as creating processes or reading
data.
• Collective: Handles access to multiple
resources: discovery, scheduling,
replication.
• Application: Contains actual grid
applications in a single organization.

High-performance distributed computing


Introduction A simple classification of distributed systems

Integrating applications

Situation
Organizations challenged with many networked applications, but
achieving interoperability was hard
Basic approach
A networked application is one that runs on a server making its
services available to remote clients. Simple integration: clients
combine requests for (different) applications; send that off; collect
responses, and present a coherent result to the user.

Distributed information systems


Introduction A simple classification of distributed systems

Example EAI: (nested) transactions


Transaction
Primitive Description
BEGIN TRANSACTION Mark the start of a transaction
END TRANSACTION Terminate the transaction and try to commit
ABORT TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise

Issue: all-or-nothing
• Atomic: happens indivisibly (seemingly)
• Consistent: does not violate system invariants
• Isolated: not mutual interference
• Durable: commit means changes are permanent

Distributed information systems


Chapter 02: Architectures
Architectures Architectural styles

Architectural styles
Basic idea
A style is formulated in terms of
• (replaceable) components with well-defined interfaces
• the way that components are connected to each other
• the data exchanged between components
• how these components and connectors are jointly configured into
a system.

Connector
A mechanism that mediates communication, coordination, or cooperation
among components. Example: facilities for (remote) procedure call,
messaging, or streaming.
Architectures Architectural styles

Layered architecture
Different layered organizations

(a) (b) (c)

Layered architectures
Architectures Architectural styles

Example: communication protocols


Protocol, service, interface (TCP-UDP)

Layered architectures
Architectures Architectural styles

Application Layering
Traditional three-layered view
• Application-interface layer contains units for interfacing to users or
external applications
• Processing layer contains the functions of an application, i.e., without
specific data
• Data layer contains the data that a client wants to manipulate through the
application components

Observation
This layering is found in many distributed information systems, using traditional
database technology and accompanying applications.

Layered architectures
Architectures Architectural styles

Application Layering
Example: a simple search engine

Layered architectures
Architectures Architectural styles

Object-based style
Components are objects, connected to each other through procedure
calls. Objects may be placed on different machines; calls can thus execute
across a network.

Encapsulation
Objects are said to encapsulate data and offer methods on that data without
revealing the internal implementation.

Service-oriented architectures
Architectures Architectural styles

RESTful architectures
Essence
View a distributed system as a collection of resources, individually managed by
components. Resources may be added, removed, retrieved, and modified by
(remote) applications.
1. Resources are identified through a single naming scheme
2. All services offer the same interface
3. Messages sent to or from a service are fully self-described
4. After executing an operation at a service, that component
forgets everything about the caller

Basic operations
Operation Description
PUT Create a new resource
GET Retrieve the state of a resource in some representation
DELETE Delete a resource
POST Modify a resource by transferring a new state

Service-oriented architectures
Architectures Architectural styles

Example: Amazon’s Simple Storage Service

Objects (i.e., files) are placed into buckets (i.e., directories). Buckets cannot be
placed into buckets. Operations on ObjectName in bucket BucketName
require the following identifier.

Typical operations
All operations are carried out by sending HTTP requests:
• Create a bucket/object: PUT, along with the URI
• Listing objects: GET on a bucket name
• Reading an object: GET on a full URI

Service-oriented architectures
Architectures Architectural styles

On interfaces

Issue
Many people like RESTful approaches because the interface to a service is
so simple. The catch is that much needs to be done in the parameter space.

Amazon S3 SOAP interface

Service-oriented architectures
Architectures Architectural styles

Coordination

Event-based and Shared data space

Publish-subscribe architectures
Architectures Architectural styles

Publish and subscribe


Issue: how to match events?
• Assume events are described by (attribute,value) pairs
• Topic-based subscription: specify a “attribute = value” series
• Content-based subscription: specify a “attribute ∈ range” series

Publish-subscribe architectures
Architectures Middleware and distributed systems

Middleware: the OS of distributed systems

What does it contain?


Commonly used components and functions that need not be implemented by
applications separately.
Architectures Middleware and distributed systems

Using legacy to build middleware


Problem
The interfaces offered by a legacy component are most likely not suitable for
all applications.

Solution
A wrapper or adapter offers an interface acceptable to a client application. Its
functions are transformed into those available at the component.

Developing adaptable middleware


Problem
Middleware contains solutions that are good for most applications ⇒ you may
want to adapt its behavior for specific applications.

Middleware organization
Architectures Layered-system architectures

Centralized system architectures


Basic Client–Server Model
Characteristics:
• There are processes offering services (servers)
• There are processes that use services (clients)
• Clients and servers can be on different machines
• Clients follow request/reply model regarding using services

Simple client-server architecture


Architectures Layered-system architectures

Multi-tiered centralized system architectures


Some traditional organizations
• Single-tiered: dumb terminal/mainframe configuration
• Two-tiered: client/single server configuration
• Three-tiered: each layer on separate machine

Traditional two-tiered configurations

(a) (b) (c) (d) (e)

Multitiered Architectures
Architectures Layered-system architectures

Being client and server at the same time


Three-tiered architecture

Multitiered Architectures
Architectures Layered-system architectures

Example: The Network File System


Foundations
Each NFS server provides a standardized view of its local file system: each
server supports the same model, regardless the implementation of the file
system.

The NFS remote access model

Remote access Upload/download


Note
FTP is a typical upload/download model. The same can be said for systems
like Dropbox.

Example: The Network File System


Architectures Layered-system architectures

Example: Simple Web servers


Back in the old days...

...life was simple:


• A website consisted as a collection of HTML files
• HTML files could be referred to each other by a hyperlink
• A Web server essentially needed only a hyperlink to fetch a file
• A browser took care of properly rendering the content of a file

Example: The Web


Architectures Layered-system architectures

Example (cnt’d): Less simple Web servers


Still back in the old days...

...life became a bit more complicated:


• A website was built around a database with content
• A Webpage could still be referred to by a hyperlink
• A Web server essentially needed only a hyperlink to fetch a file
• A separate program (Common Gateway Interface) composed a page
• A browser took care of properly rendering the content of a file

Example: The Web


Architectures Symmetrically distributed system architectures

Alternative organizations
Vertical distribution
Comes from dividing distributed applications into three logical layers and running
the components from each layer on a different server (machine).

Horizontal distribution
A client or server may be physically split up into logically equivalent parts, but each part
is operating on its own share of the complete data set.

Peer-to-peer architectures
Processes are all equal: the functions that need to be carried out are represented by every
process ⇒ each process will act as a client and a server at the same time (i.e., acting as a
servant).
Lec 4
Architectures Symmetrically distributed system architectures

Structured P2P
Essence
Make use of a semantic-free index: each data item is uniquely associated with
a key, in turn used as an index. Common practice: use a hash function
key(data item) = hash(data item’s value).
P2P system now responsible for storing (key,value) pairs.

Simple example: hypercube

Looking up d with key k ∈ {0, 1, 2,..., 24 − 1} means routing request to node


with identifier k .

Structured peer-to-peer systems


Architectures Symmetrically distributed system architectures

Example: Chord
Principle
• Nodes are logically organized in a ring. Each node has an m-bit identifier.
• Each data item is hashed to an m-bit key.
• Data item with key k is stored at node with smallest identifier id ≥ k ,
called the successor of key k .
• The ring is extended with various shortcut links to other nodes.

Structured peer-to-peer systems


Architectures Symmetrically distributed system architectures

Unstructured P2P
Essence
Each node maintains an ad hoc list of neighbors. The resulting overlay
resembles a random graph: an edge ⟨u, v ⟩ exists only with a certain probability
P[⟨u, v ⟩].

Searching
• Flooding: issuing node u passes request for d to all neighbors. Request
is ignored when receiving node had seen it before. Otherwise, v
searches locally for d.

• May be limited by a Time-To-Live: a maximum number of hops.

• Random walk: issuing node u passes request for d to randomly chosen


neighbor, v . If v does not have d , it forwards request to one of its
randomly chosen neighbors, and so on.

Unstructured peer-to-peer systems


Architectures Symmetrically distributed system architectures

Super-peer networks
Essence
It is sometimes sensible to break the symmetry in pure peer-to-peer networks:
• When searching in unstructured P2P systems, having index servers
improves performance
• Deciding where to store data can often be done more efficiently through
brokers.

Hierarchically organized peer-to-peer networks


Architectures Symmetrically distributed system architectures

Collaboration: The BitTorrent case


Principle: search for a file F
• Lookup file at a global directory ⇒ returns a torrent file
• Torrent file contains reference to tracker: a server keeping an accurate
account of active nodes that have (chunks of) F .
• P can join swarm, get a chunk for free, and then trade a copy of that
chunk for another one with a peer Q also in the swarm.

Example: BitTorrent
Architectures Hybrid system architectures

Cloud computing

Cloud computing
Architectures Hybrid system architectures

Cloud computing
Make a distinction between four layers
• Hardware: Processors, routers, power and cooling systems. Customers
normally never get to see these.
• Infrastructure: Deploys virtualization techniques. Evolves around
allocating and managing virtual storage devices and virtual servers.
• Platform: Provides higher-level abstractions for storage and such.
Example: Amazon S3 storage system offers an API for (locally created)
files to be organized and stored in so-called buckets.
• Application: Actual applications, such as office suites (text processors,
spreadsheet applications, presentation applications). Comparable to the
suite of apps shipped with OSes.

Cloud computing
Architectures Hybrid system architectures

Edge-server architecture
Essence
Systems deployed on the Internet where servers are placed at the edge of the
network: the boundary between enterprise networks and the actual Internet.

The edge-cloud architecture


Architectures Hybrid system architectures

Reasons for having an edge infrastructure


• Latency and bandwidth: Especially important for certain real-time
applications, such as augmented/virtual reality applications. Many people
underestimate the latency and bandwidth to the cloud.
• Reliability: The connection to the cloud is often assumed to be unreliable,
which is often a false assumption. There may be critical situations in
which extremely high connectivity guarantees are needed.
• Security and privacy: The implicit assumption is often that when assets
are nearby, they can be made better protected. Practice shows that this
assumption is generally false. However, securely handling data
operations in the cloud may be trickier than within your own organization.

The edge-cloud architecture


Architectures Hybrid system architectures

Edge orchestration
Managing resources at the edge may be trickier than in the cloud
• Resource allocation: we need to guarantee the availability of the
resources required to perform a service.
• Service placement: we need to decide when and where to place a
service. This is notably relevant for mobile applications.
• Edge selection: we need to decide which edge infrastructure should be
used when a service needs to be offered. The closest one may not be the
best one.

The edge-cloud architecture


Architectures Hybrid system architectures

Blockchains
Principle working of a blockchain system

Blockchain architectures
Architectures Hybrid system architectures

Blockchains
Principle working of a blockchain system

Observations
• Blocks are organized into an unforgeable append-only chain
• Each block in the blockchain is immutable ⇒ massive replication
• The real snag lies in who is allowed to append a block to a chain
Blockchain architectures
Architectures Hybrid system architectures

Appending a block: distributed consensus


Centralized solution

Observation
A single entity decides on which validator can go ahead and append a block.
Does not fit the design goals of blockchains.

Blockchain architectures
Architectures Hybrid system architectures

Appending a block: distributed consensus


Distributed solution (permissioned)

Observation
• A selected, relatively small group of servers jointly reach consensus on
which validator can go ahead.
• None of these servers needs to be trusted, as long as roughly two-thirds
behave according to their specifications.
• In practice, only a few tens of servers can be accommodated.

Blockchain architectures
Architectures Hybrid system architectures

Appending a block: distributed consensus


Decentralized solution (permisionless)

Observation
• Participants collectively engage in a leader election. Only the elected
leader is allowed to append a block of validated transactions.
• Large-scale, decentralized leader election that is fair, robust, secure, and
so on, is far from trivial.

Blockchain architectures
Distributed Systems

Chapter : Communication
Communication Foundations

Basic networking model

Disadvantage:
• Focus on message-passing only (just sent data)
• Often unneeded or unwanted functionality (a lot of layer: 7 layers)
• Violates access transparency (all layer see all layer)
Layered Protocols
Communication Foundations

Low-level layers
Summary:
• Physical layer: contains the specification and implementation of bits, and
their transmission between sender and receiver
• Data link layer: used in the transmission of a series of bits into a frame to
allow for error and flow control
• Network layer: describes how packets in a network of computers are to
be routed.

Observation
For many distributed systems: the lowest-level interface is just the network
layer.

Layered Protocols
Communication Foundations

Transport Layer
The transport layer provides the actual communication facilities for most
distributed systems.

Standard Internet protocols


• TCP: connection-oriented, reliable, stream-oriented communication
• UDP: unreliable (best-effort) datagram communication

Layered Protocols
Communication Foundations

Middleware layer

Middleware is developed to provide common services and protocols that many


different applications can use
• A lot of communication protocols
• Organized of data necessary for integrated systems (UDP)
• Naming protocols, to allow easy sharing of resources
• Security protocols for secure communication
• Scaling mechanisms, such as for replication and caching

Layered Protocols
Communication Foundations

An adapted layering scheme

Layered Protocols
Communication Foundations

Types of communication

• Transient versus persistent communication


• Asynchronous versus synchronous communication

Types of Communication
Communication Foundations

Types of communication
Transient versus persistent

• Transient communication: The Comm. server discards a message when


it cannot be delivered to the next server or to the receiver.
• Persistent communication: A message is stored at a communication
server as long as it takes to deliver it.

Types of Communication
Communication Foundations

Types of communication
Places for synchronization

• At request submission
• At request delivery
• After request processing

Types of Communication
Communication Foundations

Client/Server
Client/Server computing is generally based on a model of transient
synchronous communication:
• Client and server have to be active at the time of communication
• Client issues request and blocks until it receives a reply
• Server essentially waits only for incoming requests, and subsequently
processes them

Drawbacks synchronous communication


• Client cannot do any other work while waiting for a reply
• Failures have to be handled immediately: the client is waiting

Types of Communication
Communication Foundations

Messaging
Message-oriented middleware
Aims at high-level persistent asynchronous communication:
• Processes send each other messages, which are queued
• Sender need not wait for an immediate reply but can do other things
• Middleware often ensures fault tolerance

Types of Communication
Communication Remote procedure call

Basic RPC (REMOTE PROCEDURE CALL) operation


Observations
• Application developers are familiar with simple procedure model
• Well-engineered procedures operate in isolation (black box)
• There is no fundamental reason not to execute procedures on a
separate machine

Conclusion
Communication between caller &
callee can be hidden by using the
procedure-call mechanism.

Basic RPC operation


Communication Remote procedure call

Basic RPC operation

1. Client procedure calls client stub. 6. Server does local call; returns result to stub.
2. Stub builds message; calls local OS. 7. Stub builds message; calls OS.
3. OS sends the message to remote OS. 8. OS sends message to client’s OS.
4. Remote OS gives a message to a 9. Client’s OS gives message to stub.
stub. 10. Client stub unpacks result; returns to client.
5. Stub unpacks parameters; calls server.

Basic RPC operation


Communication Remote procedure call

RPC: Parameter passing


There are more than just encapsulation parameters in a message
• Client and server machines may have different data representations (think
of byte ordering)
• encapsulation of a parameter means transforming a value into a
sequence of bytes
• Client and server have to agree on the same encoding:

• How are basic data values represented (integers, floats, characters)


• How are complex data values represented (arrays, unions)

Conclusion
The client and server must properly interpret messages, transforming them
into machine-dependent representations.

Parameter passing
Communication Remote procedure call

Asynchronous RPCs

Try to eliminate the strict request-reply behavior, but let the client continue
without waiting for an answer from the server.

Variations on RPC
Communication Remote procedure call

Sending out multiple RPCs

Sending an RPC request to a group of servers.

Variations on RPC
Communication Message-oriented communication

Transient messaging: sockets


Berkeley socket interface
Operation Description
socket Create a new communication end point
bind Attach a local address to a socket
listen Tell operating system what the maximum number of pending
connection requests should be
accept Block caller until a connection request arrives
connect Actively attempt to establish a connection
send Send some data over the connection
receive Receive some data over the connection
close Release the connection

Simple transient messaging with sockets


Communication Message-oriented communication

Making sockets easier to work with


Sockets are rather low-level, and programming mistakes are easily made.
Three patterns
• Request-reply
• Publish-subscribe
• Pipeline

Advanced transient messaging


Communication Message-oriented communication

Queue-based messaging
Four possible combinations

Message-oriented persistent communication


Communication Message-oriented communication

Message-oriented middleware
Essence
Asynchronous persistent communication through support of middleware-level
queues. Queues correspond to buffers at communication servers.

Operations

Operation Description
PUT Append a message to a specified queue
GET Block until the specified queue is nonempty, and
remove the first message
POLL Check a specified queue for messages, and remove
the first. Never block
NOTIFY Install a handler to be called when a message is put
into the specified queue

Message-oriented persistent communication


Communication Message-oriented communication

General model
Queue managers
Queue managers manage queues. An application can only put messages
into a local queue. Getting a message is possible by extracting it from a local
queue only ⇒ Queue managers need to route messages.

Routing

Message-oriented persistent communication


Communication Message-oriented communication

Message broker
Observation
Message queuing systems assume a common messaging protocol: all
applications agree on message format (i.e., structure and data representation)

Broker handles application heterogeneity in an MQ system


• Transforms incoming messages to target format
• Very often acts as an application gateway
• May provide subject-based routing capabilities (i.e., publish-subscribe
capabilities)

Message-oriented persistent communication


Distributed Systems
(4th edition, version 01)

Chapter 03: Processes


Processes Threads

Introduction to threads
Basic idea
We build virtual processors in software, on top of physical processors:
Processor: Provides a set of instructions along with the capability of
automatically executing a series of those instructions.
Thread: A minimal software processor in whose context a series of
instructions can be executed. Saving a thread context implies
stopping the current execution and saving all the data needed
to continue the execution at a later stage.
Process: A software processor in whose context one or more threads
may be executed. Executing a thread, means executing a
series of instructions in the context of that thread.

Introduction to threads
Processes Threads

Context switching
Contexts
• Processor context: The minimal collection of values stored in the
registers of a processor used for the execution of a series of instructions
(e.g., stack pointer, addressing registers, program counter).
• Thread context: The minimal collection of values stored in registers and
memory, used for the execution of a series of instructions (i.e., processor
context, state).
• Process context: The minimal collection of values stored in registers and
memory, used for the execution of a thread (i.e., thread context, but now
also at least MMU register values).

Introduction to threads
Processes Threads

Context switching
Observations
1. Threads share the same address space.
2. Thread context switching can be done entirely independent of the
operating system.
3. Process switching is generally (somewhat) more expensive as it involves
getting the OS in the loop, i.e., trapping to the kernel.
4. Creating and destroying threads is much cheaper than doing so for
processes.

Introduction to threads
Processes Threads

Why use threads


Some simple reasons
• Avoid needless blocking: a single-threaded process will block when doing
I/O; in a multithreaded process, the operating system can switch the CPU
to another thread in that process.
• Exploit parallelism: the threads in a multithreaded process can be
scheduled to run in parallel on a multiprocessor or multicore processor.
• Avoid process switching: structure large applications not as a collection of
processes, but through multiple threads.

Introduction to threads
Processes Threads

Avoid process switching


Avoid expensive context switching

Trade-offs
• Threads use the same address space: more prone to errors
• No support from OS/HW to protect threads using each other’s memory
• Thread context switching may be faster than process context switching

Introduction to threads
Processes Threads

The cost of a context switch


Consider a simple clock-interrupt handler
• direct costs: actual switch and executing code of the handler
• indirect costs: other costs, notably caused by messing up the cache

What a context switch may cause: indirect costs

(a) before the context switch


(b) after the context switch
(c) after accessing block D.

(a) (b) (c)

Introduction to threads
Processes Threads

Threads and operating systems


Main issue
Should an OS kernel provide threads, or should they be implemented as
user-level packages?

User-space solution
• All operations can be completely handled within a single process ⇒
implementations can be extremely efficient.
• All services provided by the kernel are done on behalf of the process in
which a thread resides ⇒ if the kernel decides to block a thread, the
entire process will be blocked.
• Threads are used when there are many external events: threads block on
a per-event basis ⇒ if the kernel can’t distinguish threads, how can it
support signaling events to them?

Introduction to threads
Processes Threads

Threads and operating systems


Kernel solution
The whole idea is to have the kernel contain the implementation of a thread
package. This means that all operations return as system calls:
• Operations that block a thread are no longer a problem: the kernel
schedules another available thread within the same process.
• handling external events is simple: the kernel (which catches all events)
schedules the thread associated with the event.
• The problem is (or used to be) the loss of efficiency because each thread
operation requires a trap to the kernel.

Introduction to threads
Processes Threads

Combining user-level and kernel-level threads


Basic idea
Introduce a two-level threading approach: kernel threads that can execute
user-level threads.

Introduction to threads
Processes Threads

User and kernel threads combined


Principle operation
• User thread does system call ⇒ the kernel thread that is executing that
user thread, blocks. The user thread remains bound to the kernel thread.
• The kernel can schedule another kernel thread having a runnable user
thread bound to it. Note: this user thread can switch to any other
runnable user thread currently in user space.
• A user thread calls a blocking user-level operation ⇒ do context switch to
a runnable user thread, (then bound to the same kernel thread).
• When there are no user threads to schedule, a kernel thread may remain
idle, and may even be removed (destroyed) by the kernel.

Introduction to threads
Processes Threads

Using threads at the client side


Multithreaded web client
Hiding network latencies:
• Web browser scans an incoming HTML page, and finds that more files
need to be fetched.
• Each file is fetched by a separate thread, each doing a (blocking) HTTP
request.
• As files come in, the browser displays them.

Multiple request-response calls to other machines (RPC)


• A client does several calls at the same time, each one by a different
thread.
• It then waits until all results have been returned.
• Note: if calls are to different servers, we may have a linear speed-up.

Threads in distributed systems


Processes Threads

Using threads at the server side


Improve performance
• Starting a thread is cheaper than starting a new process.
• Having a single-threaded server prohibits simple scale-up to a
multiprocessor system.
• As with clients: hide network latency by reacting to next request while
previous one is being replied.

Better structure
• Most servers have high I/O demands. Using simple, well-understood
blocking calls simplifies the structure.
• Multithreaded programs tend to be smaller and easier to understand due
to simplified flow of control.

Threads in distributed systems


Processes Threads

Why multithreading is popular: organization


Dispatcher/worker model

Overview
Model Characteristics
Multithreading Parallelism, blocking system calls
Single-threaded process No parallelism, blocking system calls
Finite-state machine Parallelism, nonblocking system calls

Threads in distributed systems


Processes Virtualization

Virtualization
Observation
Virtualization is important:
• Hardware changes faster than software
• Ease of portability and code migration
• Isolation of failing or attacked components

Principle: mimicking interfaces

Principle of virtualization
Processes Virtualization

Mimicking interfaces
Four types of interfaces at three different levels
1. Instruction set architecture: the set of machine instructions, with two
subsets:
• Privileged instructions: allowed to be executed only by the operating
system.
• General instructions: can be executed by any program.
2. System calls as offered by an operating system.
3. Library calls, known as an application programming interface (API)

Principle of virtualization
Processes Virtualization

Ways of virtualization

(a) Process VM (b) Native VMM (c) Hosted VMM

Differences
(a) Separate set of instructions, an interpreter/emulator, running atop an OS.
(b) Low-level instructions, along with bare-bones minimal operating system
(c) Low-level instructions, but delegating most work to a full-fledged OS.

Principle of virtualization
Processes Virtualization

Zooming into VMs: performance


Refining the organization
• Privileged instruction: if
and only if executed in
user mode, it causes a
trap to the operating
system
• Nonpriviliged instruction:
the rest

Special instructions
• Control-sensitive instruction: may affect configuration of a machine (e.g.,
one affecting relocation register or interrupt table).
• Behavior-sensitive instruction: effect is partially determined by context
(e.g., POPF sets an interrupt-enabled flag, but only in system mode).

Principle of virtualization
Processes Virtualization

Condition for virtualization


Necessary condition
For any conventional computer, a virtual machine monitor may be constructed
if the set of sensitive instructions for that computer is a subset of the set of
privileged instructions.

Problem: condition is not always satisfied


There may be sensitive instructions that are executed in user mode without
causing a trap to the operating system.

Solutions
• Emulate all instructions
• Wrap nonprivileged sensitive instructions to divert control to VMM
• Paravirtualization: modify guest OS, either by preventing nonprivileged
sensitive instructions, or making them nonsensitive (i.e., changing the
context).

Principle of virtualization
Processes Virtualization

Containers

• Namespaces: a collection of processes in a container is given their own


view of identifiers
• Union file system: combine several file systems into a layered fashion
with only the highest layer allowing for wr it e operations (and the one
being part of a container).
• Control groups: resource restrictions can be imposed upon a collection of
processes.
Containers
Processes Virtualization

Example: PlanetLab
Essence
Different organizations contribute machines, which they subsequently share for
various experiments.

Problem
We need to ensure that different distributed applications do not get into each
other’s way ⇒ virtualization

Containers
Processes Virtualization

PlanetLab basic organization

Vserver
Independent and protected environment with its own libraries, server versions,
and so on. Distributed applications are assigned a collection of vservers
distributed across multiple machines

Containers
Processes Virtualization

PlanetLab Vservers and slices


Essence
• Each Vserver operates in its own environment (cf. chroot ).
• Linux enhancements include proper adjustment of process IDs (e.g.,
i n i t having ID 0).
• Two processes in different Vservers may have same user ID, but does not
imply the same user.

Separation leads to slices

Containers
Processes Virtualization

VMs and cloud computing


Three types of cloud services
• Infrastructure-as-a-Service covering the basic infrastructure
• Platform-as-a-Service covering system-level services
• Software-as-a-Service containing actual applications

IaaS
Instead of renting out a physical machine, a cloud provider will rent out a VM
(or VMM) that may be sharing a physical machine with other customers ⇒
almost complete isolation between customers (although performance isolation
may not be reached).

Application of virtual machines to distributed systems


Processes Clients

Client-server interaction
Distinguish application-level and middleware-level solutions

Networked user interfaces


Processes Clients

Example: The X Window system


Basic organization

Networked user interfaces


Processes Clients

Example: The X Window system


Basic organization

X client and server


The application acts as a client to the X-kernel, the latter running as a server
on the client’s machine.

Networked user interfaces


Processes Clients

Improving X
Practical observations
• There is often no clear separation between application logic and
user-interface commands
• Applications tend to operate in a tightly synchronous manner with an X
kernel

Alternative approaches
• Let applications control the display completely, up to the pixel level (e.g.,
VNC)
• Provide only a few high-level display operations (dependent on local
video drivers), allowing more efficient display operations.

Networked user interfaces


Processes Clients

Virtual desktop environment


Logical development
With an increasing number of cloud-based applications, the question is how to
use those applications from a user’s premise?
• Issue: develop the ultimate networked user interface
• Answer: use a Web browser to establish a seamless experience

The Google Chromebook


Virtual desktop environment
Processes Clients

The anatomy of a Web browser

Virtual desktop environment


Processes Clients

Client-side software
Generally tailored for distribution transparency
• Access transparency: client-side stubs for RPCs
• Location/migration transparency: let client-side software keep track of
actual location
• Replication transparency: multiple invocations handled by client stub:

• Failure transparency: can often be placed only at client (we’re trying to


mask server and communication failures).

Client-side software for distribution transparency


Processes Servers

Servers: General organization


Basic model
A process implementing a specific service on behalf of a collection of clients. It
waits for an incoming request from a client and subsequently ensures that the
request is taken care of, after which it waits for the next incoming request.

General design issues


Processes Servers

Servers: General organization


Basic model
A process implementing a specific service on behalf of a collection of clients. It
waits for an incoming request from a client and subsequently ensures that the
request is taken care of, after which it waits for the next incoming request.

Two basic types


• Iterative server: Server handles the request before attending a next
request.
• Concurrent server: Uses a dispatcher, which picks up an incoming
request that is then passed on to a separate thread/process.

Observation
Concurrent servers are the norm: they can easily handle multiple requests,
notably in the presence of blocking operations (to disks or other servers).

General design issues


Processes Servers

Contacting a server
Observation: most services are tied to a specific port

ftp-data 20 File Transfer [Default Data]


ftp 21 File Transfer [Control]
telnet 23 Telnet
smtp 25 Simple Mail Transfer
www 80 Web (HTTP)

Dynamically assigning an end point: two approaches

General design issues


Processes Servers

Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?

General design issues


Processes Servers

Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?

Solution 1: Use a separate port for urgent data


• Server has a separate thread/process for urgent messages
• Urgent message comes in ⇒ associated request is put on hold
• Note: we require OS supports priority-based scheduling

General design issues


Processes Servers

Out-of-band communication
Issue
Is it possible to interrupt a server once it has accepted (or is in the process of
accepting) a service request?

Solution 1: Use a separate port for urgent data


• Server has a separate thread/process for urgent messages
• Urgent message comes in ⇒ associated request is put on hold
• Note: we require OS supports priority-based scheduling

Solution 2: Use facilities of the transport layer


• Example: TCP allows for urgent messages in same connection
• Urgent messages can be caught using OS signaling techniques

General design issues


Processes Servers

Servers and state


Stateless servers
Never keep accurate information about the status of a client after having
handled a request:
• Don’t record whether a file has been opened (simply close it again after
access)
• Don’t promise to invalidate a client’s cache
• Don’t keep track of your clients

General design issues


Processes Servers

Servers and state


Stateless servers
Never keep accurate information about the status of a client after having
handled a request:
• Don’t record whether a file has been opened (simply close it again after
access)
• Don’t promise to invalidate a client’s cache
• Don’t keep track of your clients

Consequences
• Clients and servers are completely independent
• State inconsistencies due to client or server crashes are reduced
• Possible loss of performance because, e.g., a server cannot anticipate
client behavior (think of prefetching file blocks)

General design issues


Processes Servers

Servers and state


Stateful servers
Keeps track of the status of its clients:
• Record that a file has been opened, so that prefetching can be done
• Knows which data a client has cached, and allows clients to keep local
copies of shared data

Observation
The performance of stateful servers can be extremely high, provided clients
are allowed to keep local copies. As it turns out, reliability is often not a major
problem.

General design issues


Processes Servers

Object servers
• Activation policy: which actions to take
when an invocation request comes in:
• Where are code and data of the
object?
• Which threading model to use?
• Keep modified state of object, if any?
• Object adapter: implements a specific
activation policy

Object servers
Processes Servers

Example: the Apache Web server

Example: The Apache Web server


Processes Servers

Three different tiers


Common organization

Crucial element
The first tier is generally responsible for passing requests to an appropriate
server: request dispatching

Server clusters
Processes Servers

Request Handling
Observation
Having the first tier handle all communication from/to the cluster may lead to a
bottleneck.

A solution: TCP handoff

Server clusters
Processes Servers

When servers are spread across the Internet


Observation
Spreading servers across the Internet may introduce administrative problems.
These can be largely circumvented by using data centers from a single cloud
provider.

Request dispatching: if locality is important


Common approach: use DNS:
1. Client looks up specific service through DNS - client’s IP address is part
of request
2. DNS server keeps track of replica servers for the requested service, and
returns address of most local server.

Client transparency
To keep client unaware of distribution, let DNS resolver act on behalf of client.
Problem is that the resolver may actually be far from local to the actual client.

Server clusters
Processes Servers

A simplified version of the Akamai CDN

Important note
The cache is often sophisticated enough to hold more than just passive data.
Much of the application code of the origin server can be moved to the cache as
well.

Server clusters
Processes Code migration

Reasons to migrate code


Load distribution
• Ensuring that servers in a data center are sufficiently loaded (e.g., to
prevent waste of energy)
• Minimizing communication by ensuring that computations are close to
where the data is (think of mobile computing).

Flexibility: moving code to a client when needed

Avoids pre-installing software and increases dynamic configuration.


Reasons for migrating code
Processes Code migration

Reasons to migrate code


Privacy and security
In many cases, one cannot move data to another location, for whatever reason
(often legal ones). Solution: move the code to the data.

Example: federated machine learning

Reasons for migrating code


Processes Code migration

Paradigms for code mobility

Models for code migration


Processes Code migration

Strong and weak mobility


Object components
• Code segment: contains the actual code
• Data segment: contains the state
• Execution state: contains context of thread executing the object’s code

Weak mobility: Move only code and data segment (and reboot
execution)
• Relatively simple, especially if code is portable
• Distinguish code shipping (push) from code fetching (pull)

Strong mobility: Move component, including execution state


• Migration: move entire object from one machine to the other
• Cloning: start a clone, and set it in the same execution state.

Models for code migration


Processes Code migration

Migration in heterogeneous systems


Main problem
• The target machine may not be suitable to execute the migrated code
• The definition of process/thread/processor context is highly dependent on
local hardware, operating system and runtime system

Only solution: abstract machine implemented on different platforms


• Interpreted languages, effectively having their own VM
• Virtual machine monitors

Observation
As containers are directly dependent on the underlying operating system, their
migration in heterogeneous environments is far from trivial, to simply
impractical, just as process migration is.

Migration in heterogeneous systems


Processes Code migration

Migrating a virtual machine


Migrating images: three alternatives
1. Pushing memory pages to the new machine and resending the ones that
are later modified during the migration process.
2. Stopping the current virtual machine; migrate memory, and start the new
virtual machine.
3. Letting the new virtual machine pull in new pages as needed: processes
start on the new virtual machine immediately and copy memory pages on
demand.

Migration in heterogeneous systems


Processes Code migration

Performance of migrating virtual machines


Problem
A complete migration may actually take tens of seconds. We also need to
realize that during the migration, a service will be completely unavailable for
multiple seconds.

Measurements regarding response times during VM migration

Migration in heterogeneous systems


Distributed Systems
(4th edition, version 01)

Chapter 08: Fault Tolerance


Fault tolerance Introduction to fault tolerance

Dependability
Definition: A service provider provides services to clients. To provide
services, the component may require the services from other
components ⇒ a component may depend on some other component.

Requirements related to dependability

Requirement Description
Availability Readiness for usage
Reliability Continuity of service delivery
Safety Very low probability of catastrophes
Maintainability How easy can a failed system be repaired

Basic concepts
Fault tolerance Introduction to fault tolerance

Reliability versus availability


Reliability R(t ) of component C
Conditional probability that C has been functioning correctly during [0, t ) given
C was functioning correctly at time T = 0.

Traditional metrics
• Mean Time To Failure (MTTF): The average time until a component fails.
• Mean Time To Repair (MTTR): The average time needed to repair a
component.
• Mean Time Between Failures (MTBF): Simply MTTF + MTTR.

Basic concepts
Fault tolerance Introduction to fault tolerance

Reliability versus availability


Availability A(t ) of component C
Average fraction of time that C has been up-and-running in the interval [0, t ).
• Long-term availability A: A(∞)
• Note: A = MTTF = MTTF
MTBF MTTF+MTTR

Observation
Reliability and availability make sense only if we have an accurate idea of
what a failure actually is.

Basic concepts
Fault tolerance Introduction to fault tolerance

Terminology
Failure, error, fault

Term Description Example


Failure A component is not living up to Crashed program
its specifications
Error Part of a component that can Programming bug
lead to a failure
Fault Cause of an error Sloppy programmer

Basic concepts
Fault tolerance Introduction to fault tolerance

Terminology
Handling faults

Term Description Example


Fault Prevent the occurrence of a Don’t hire sloppy
prevention fault programmers
Fault tolerance Build a service provider Build each component by
such that it can mask the two independent
occurrence of a fault programmers

Fault removal Reduce the presence, Get rid of sloppy


number, or seriousness of a programmers
fault
Fault Estimate current presence, Estimate how a recruiter is
forecasting future incidence, and doing when it comes to
consequences of faults. hiring sloppy programmers

Basic concepts
Fault tolerance Introduction to fault tolerance

Failure models
Types of failures

Type Description of server’s behavior


Crash failure Halts, but is working correctly until it halts
Omission failure Fails to respond to incoming requests
Receive omission Fails to receive incoming messages
Send omission Fails to send messages
Timing failure The response lies outside a specified time
interval.
Response failure Response is incorrect
Value failure The value of the response is wrong
State-transition failure Deviates from the correct flow of control

Failure models
Fault tolerance Introduction to fault tolerance

Dependability versus security


Omission versus commission
Arbitrary failures are sometimes qualified as malicious. It is better to make the
following distinction:
• Omission failures: a service provider fails to take any action that it
should have taken
• Commission failure: a component takes an action that it should not have
taken

Failure models
Fault tolerance Introduction to fault tolerance

Halting failures
Asynchronous versus synchronous systems
• Asynchronous system: no assumptions about process execution speeds
or message delivery times → cannot reliably detect crash failures.
• Synchronous system: process execution speeds and message delivery
times are bounded → we can reliably detect omission and timing failures.
• In practice, we have partially synchronous systems: most of the time, we
can assume the system to be synchronous, yet there is no bound on the
time that a system is asynchronous → can normally reliably detect crash
failures.

Failure models
Fault tolerance Introduction to fault tolerance

Halting failures
Assumptions we can make

Halting type Description


Fail-stop Crash failures, but reliably discovered
Fail-noisy Crash failures, finally, can be detectable
Fail-silent Omission or crash failures: clients cannot tell
what went wrong
Fail-safe Arbitrary, yet benign failures (i.e., they cannot
do any harm)
‫ تعسفى‬Fail- Arbitrary, with malicious failures
arbitrary

Failure models
Fault tolerance Introduction to fault tolerance

Redundancy for failure masking


Types of redundancy
• Information redundancy: Add extra bits to data units so that errors can
recovered when bits are changed.
• Time redundancy: Design a system such that an action can be performed
again if anything went wrong. Typically used when faults are transient or
intermittent.
• Physical redundancy: add equipment or processes in order to allow one
or more components to fail. This type is extensively used in distributed
systems.

Failure masking by redundancy


Fault tolerance Process resilience

Process resilience
Basic idea
Protect against operations processes through process replication, organizing
multiple processes into a process group. Distinguish between flat groups and
hierarchical groups.

Resilience by process groups


Fault tolerance Process resilience

Consensus
Prerequisite
In a fault-tolerant process group, each nonfaulty process executes the same
commands, and in the same order, as every other nonfaulty process.

Reformulation
Nonfaulty group members need to reach consensus on which command to
execute next.

Consensus in faulty systems with crash failures


Fault tolerance Process resilience

Flooding-based consensus
System model
• A process group P = {P1,..., Pn}
• Fail-stop failure semantics, i.e., with reliable failure detection
• A client contacts a Pi requesting it to execute a command
• Every Pi maintains a list of proposed commands

Consensus in faulty systems with crash failures


Various Failures in Distributed System
Method failure :

In this type of failure, the distributed system is generally halted and unable to
perform the execution. Sometimes it leads to ending up the execution
resulting in an associate incorrect outcome. Method failure causes the system
state to deviate from specifications, and also method might fail to progress.

•Behavior –
It may be understood as if incorrect computation like Protection violation,
deadlocks, timeout, user input, etc is performed then the method stops its
execution.

•Recovery –
Method failure can be prevented by aborting the method or restarting it from
its prior state.
2. System failure :
In system failure, the processor associated with the distributed system
fails to perform the execution. This is caused by computer code errors
and hardware issues. Hardware issues may involve CPU/memory/bus
failure. This is assumed that whenever the system stops its execution
due to some fault then the interior state is lost.
1. Behavior –
It is concerned with physical and logical units of the processor.
The system may freeze, reboot and also it does not perform any
functioning leading it to go in an idle state.
1. Recovery –
This can be cured by rebooting the system as soon as possible and
configuring the failure point and wrong state.
Secondary storage device failure :
A storage device failure is claimed to have occurred once the keep information can’t
be accessed. This failure is sometimes caused by parity error, head crash, or dirt
particles settled on the medium.

•Behavior –
Stored information can’t be accessed.

•Errors inflicting failure –


Parity error, head crash, etc.

•Recovery/Design strategies –
Reconstruct content from the archive and the log of activities and style reflected disk
system. A system failure will additionally be classified as follows.

• Associate cognitive state failure

• A partial cognitive state failure

• a disruption failure

• A halting failure
Communication medium failure :
A communication medium failure happens once a web site cannot
communicate with another operational site within the network. it’s typically
caused by the failure of the shift nodes and/or the links of the human activity
system.

•Behavior –
A web site cannot communicate with another operational site.

•Errors/Faults –
Failure of shift nodes or communication links.

•Recovery/Design strategies –
Reroute, error-resistant communication protocols.
Report
create group of 5 student and
improve the following slides
Fault tolerance Process resilience

Raft
Developed for understandability
• Uses a fairly straightforward leader-election algorithm (see Chp. 5). The
current leader operates during the current term.
• Every server (typically, five) keeps a log of operations, some of which
have been committed. A backup will not vote for a new leader if its own
log is more up to date.
• All committed operations have the same position in the log of each
respective server.
• The leader decides which pending operation is to be committed next ⇒ a
primary-backup approach.

Consensus in faulty systems with crash failures


Fault tolerance Process resilience

Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o,t , ⟩to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.

Consensus in faulty systems with crash failures


Fault tolerance Process resilience

Raft
When submitting an operation
• A client submits a request for operation o.
• The leader appends the request ⟨o,t , ⟩to its own log (registering the
current term t and length of ).
• The log is (conceptually) broadcast to the other servers.
• The others (conceptually) copy the log and acknowledge the receipt.
• When a majority of acks arrives, the leader commits o.

Note
In practice, only updates are broadcast. At the end, every server has the same
view and knows about the c committed operations. Note that effectively, any
information at the backups is overwritten.

Consensus in faulty systems with crash failures


Fault tolerance Process resilience

Raft: when a leader crashes

Crucial observations
• The new leader has the most committed operations in its log.
• Any missing commits will eventually be sent to the other backups.
Consensus in faulty systems with crash failures
Fault tolerance Process resilience

Realistic consensus: Paxos


Assumptions (rather weak ones, and realistic)
• A partially synchronous system (in fact, it may even be asynchronous).
• Communication between processes may be unreliable: messages may
be lost, duplicated, or reordered.
• Corrupted message can be detected (and thus subsequently ignored).
• All operations are deterministic: once an execution is started, it is known
exactly what it will do.
• Processes may exhibit crash failures, but not arbitrary failures.
• Processes do not collude.

Understanding Paxos
We will build up Paxos from scratch to understand where many consensus
algorithms actually come from.

Example: Paxos
Fault tolerance Process resilience

Paxos essentials
Starting point
• We assume a client-server configuration, with initially one primary server.
• To make the server more robust, we start with adding a backup server.
• To ensure that all commands are executed in the same order at both
servers, the primary assigns unique sequence numbers to all commands.
In Paxos, the primary is called the leader.
• Assume that actual commands can always be restored (either from
clients or servers) ⇒ we consider only control messages.

Example: Paxos
Fault tolerance Process resilience

Two-server situation

Example: Paxos
Fault tolerance Process resilience

Handling lost messages


Some Paxos terminology
• The leader sends an accept message ACCEPT(o, t ) to backups when
assigning a timestamp t to command o.
• A backup responds by sending a learn message: LEARN(o, t )

• When the leader notices that operation o has not yet been learned, it
retransmits ACCEPT(o, t ) with the original timestamp.

Example: Paxos
Fault tolerance Process resilience

Two servers and one crash: problem

Problem
Primary crashes after executing an operation, but the backup never received
the accept message.

Example: Paxos
Fault tolerance Process resilience

Two servers and one crash: solution

Solution
Never execute an operation before it is clear that is has been learned.

Example: Paxos
Fault tolerance Process resilience

Three servers and two crashes: still a problem?

Example: Paxos
Fault tolerance Process resilience

Three servers and two crashes: still a problem?

Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?

Example: Paxos
Fault tolerance Process resilience

Three servers and two crashes: still a problem?

Scenario
What happens when LEARN(o1) as sent by S2 to S1 is lost?

Solution
S2 will also have to wait until it knows that S3 has learned o1.

Example: Paxos
Fault tolerance Process resilience

Paxos: fundamental rule


General rule
In Paxos, a server S cannot execute an operation o until it has received a
LEARN (o) from all other nonfaulty servers.

Example: Paxos
Fault tolerance Process resilience

Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.

Example: Paxos
Fault tolerance Process resilience

Failure detection
Practice
Reliable failure detection is practically impossible. A solution is to set timeouts,
but take into account that a detected failure may be false.

Example: Paxos
Fault tolerance Process resilience

Required number of servers


Observation
Paxos needs at least three servers

Example: Paxos
Fault tolerance Process resilience

Required number of servers


Observation
Paxos needs at least three servers

Adapted fundamental rule


In Paxos with three servers, a server S cannot execute an operation o until it
has received at least one (other) LEARN (o) message, so that it knows that a
majority of servers will execute o.

Example: Paxos
Fault tolerance Process resilience

Required number of servers


Assumptions before taking the next steps
• Initially, S1 is the leader.
• A server can reliably detect it has missed a message, and recover from
that miss.
• When a new leader needs to be elected, the remaining servers follow a
strictly deterministic algorithm, such as S1 → S2 → S3 .
• A client cannot be asked to help the servers to resolve a situation.

Example: Paxos
Fault tolerance Process resilience

Required number of servers


Assumptions before taking the next steps
• Initially, S1 is the leader.
• A server can reliably detect it has missed a message, and recover from
that miss.
• When a new leader needs to be elected, the remaining servers follow a
strictly deterministic algorithm, such as S1 → S2 → S3 .
• A client cannot be asked to help the servers to resolve a situation.

Observation
If either one of the backups (S2 or S3 ) crashes, Paxos will behave correctly:
operations at nonfaulty servers are executed in the same order.

Example: Paxos
Fault tolerance Process resilience

But what about progress?

Example: Paxos
Fault tolerance Process resilience

Consensus under arbitrary failure semantics


Essence
We consider process groups in which communication between process is
inconsistent.

Improper forwarding Different messages

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

Consensus under arbitrary failure semantics


System model
• We consider a primary P and n − 1 backups B 1 ,..., Bn−1.
• A client sends v ∈ {T, F}to P
• Messages may be lost, but this can be detected.
• Messages cannot be corrupted beyond detection.
• A receiver of a message can reliably detect its sender.

Byzantine agreement: requirements


BA1: Every nonfaulty backup process stores the same value.
BA2: If the primary is nonfaulty then every nonfaulty backup process stores
exactly what the primary had sent.

Observation
• Primary faulty ⇒ BA1 says that backups may store the same, but different
(and thus wrong) value than originally sent by the client.
• Primary not faulty ⇒ satisfying BA2 implies that BA1 is satisfied.

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

Why having 3k processes is not enough

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

Why having 3k + 1 processes is enough

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

Practical Byzantine Fault Tolerance (PBFT)


Background
One of the first solutions that managed to Byzantine fault tolerance while
keeping performance acceptable. Popularity has increased with the
introduction of permissioned blockchains.

Assumptions
• A server may exhibit arbitrary failures
• Messages may be lost, delayed, and received out of order
• Messages have an identifiable sender (i.e., they are signed)
• Partially synchronous execution model

Essence
A primary-backup approach with 3k + 1 replica servers.

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

PBFT: four phases

• C is the client
• P is the primary
• B1, B2 , B3 are backups
• Assume B2 is faulty

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

PBFT: four phases

• All servers assume to be working in a current view v .


• C requests operation o to be executed
• P timestamps o and sends PRE-PREPARE(t, v, o)
• Backup Bi accepts the pre-prepare message if it is also is in v and has
not accepted a an operation with timestamp t before.

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

PBFT: four phases

• Bi broadcasts PREPARE(t, v, o) to all (including the primary)


• Note: a nonfaulty server will eventually log 2k messages PREPARE(t, v, o)
(including its own) ⇒ consensus on the ordering of o.
• Note: it doesn’t matter what faulty B2 sends, it cannot affect joint
decisions by P, B1, B3 .

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

PBFT: four phases

• All servers broadcast COMMIT (t, v, o)


• The commit is needed to also make sure that o can be executed now,
that is, in the current view v .
• When 2k messages have been collected, excluding its own, the server
can safely execute o en reply to the client.

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

PBFT: when the primary fails


Issue
When a backup detects the primary failed, it will broadcast a view change to
view v + 1. We need to ensure that any outstanding request is executed once
and only once by all nonfaulty servers. The operation needs to be handed over
to the new view.

Procedure
• The next primary P∗is known deterministically
• A backup server broadcasts VIEW -CHANGE(v + 1, P): P is the set of
prepares it had sent out.
LJ
• P∗waits for 2k + 1 view-change messages, with X = P containing all
previously sent prepares.
• P∗ sends out NEW -VIEW (v+1,X,O) with O a new set of pre-prepare
messages.
• Essence: this allows the nonfaulty backups to replay what has gone on in
the previous view, if necessary, and bring o into the new view v + 1.

Consensus in faulty systems with arbitrary failures


Fault tolerance Process resilience

Realizing fault tolerance


Observation
Considering that the members in a fault-tolerant process group are so tightly
coupled, we may bump into considerable performance problems, but perhaps
even situations in which realizing fault tolerance is impossible.

Question
Are there limitations to what can be readily achieved?
• What is needed to enable reaching consensus?
• What happens when groups are partitioned?

Some limitations on realizing fault tolerance


Fault tolerance Process resilience

Distributed consensus: when can it be reached

Formal requirements for consensus


• Processes produce the same output value
• Every output value must be valid
• Every process must eventually provide output

Some limitations on realizing fault tolerance


Fault tolerance Process resilience

Consistency, availability, and partitioning


CAP theorem
Any networked system providing shared data can provide only two of the
following three properties:
C: consistency, by which a shared and replicated data item appears as a
single, up-to-date copy
A: availability, by which updates will always be eventually executed
P: Tolerant to the partitioning of process group.

Conclusion
In a network subject to communication failures, it is impossible to realize an
atomic read/write shared memory that guarantees a response to every
request.

Some limitations on realizing fault tolerance


Fault tolerance Process resilience

CAP theorem intuition


Simple situation: two interacting processes
• P and Q can no longer communicate:
• Allow P and Q to go ahead ⇒ no consistency
• Allow only one of P, Q to go ahead ⇒ no availability
• P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.

Some limitations on realizing fault tolerance


Fault tolerance Process resilience

CAP theorem intuition


Simple situation: two interacting processes
• P and Q can no longer communicate:
• Allow P and Q to go ahead ⇒ no consistency
• Allow only one of P, Q to go ahead ⇒ no availability
• P and Q have to be assumed to continue communication ⇒ no
partitioning allowed.

Fundamental question
What are the practical ramifications of the CAP theorem?

Some limitations on realizing fault tolerance


Fault tolerance Process resilience

Failure detection
Issue
How can we reliably detect that a process has actually crashed?

General model
• Each process is equipped with a failure detection module
• A process P probes another process Q for a reaction
• If Q reacts: Q is considered to be alive (by P)
• If Q does not react with t time units: Q is suspected to have crashed

Observation for a synchronous system

a suspected crash ≡ a known crash

Failure detection
Fault tolerance Process resilience

Practical failure detection


Implementation
• If P did not receive heartbeat from Q within time t : P suspects Q.
• If Q later sends a message (which is received by P):
• P stops suspecting Q
• P increases the timeout value t
• Note: if Q did crash, P will keep suspecting Q.

Failure detection
Fault tolerance Reliable client-server communication

Reliable remote procedure calls


What can go wrong?
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.

RPC semantics in the presence of failures


Fault tolerance Reliable client-server communication

Reliable remote procedure calls


What can go wrong?
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.

Two “easy” solutions


1: (cannot locate server): just report back to client
2: (request was lost): just resend message

RPC semantics in the presence of failures


Fault tolerance Reliable client-server communication

Reliable RPC: client crash


Problem
The server is doing work and holding resources for nothing (called doing an
orphan computation).

Solution
• Orphan is killed (or rolled back) by the client when it recovers
• Client broadcasts new epoch number when recovering ⇒ server kills
client’s orphans
• Require computations to complete in a T time units. Old ones are simply
removed.

RPC semantics in the presence of failures


Fault tolerance Reliable group communication

Simple reliable group communication


Intuition
A message sent to a process group G should be delivered to each member of
G. Important: make distinction between receiving and delivering messages.

Introduction
Fault tolerance Reliable group communication

Less simple reliable group communication


Reliable communication in the presence of faulty processes
Group communication is reliable when it can be guaranteed that a message is
received and subsequently delivered by all nonfaulty group members.

Tricky part
Agreement is needed on what the group actually looks like before a received
message can be delivered.

Introduction
Fault tolerance Reliable group communication

Simple reliable group communication


Reliable communication, but assume nonfaulty processes
Reliable group communication now boils down to reliable multicasting: is a
message received and delivered to each recipient, as intended by the sender.

Introduction
Fault tolerance Distributed commit

Distributed commit protocols


Problem
Have an operation being performed by each member of a process group, or
none at all.
• Reliable multicasting: a message is to be delivered to all recipients.
• Distributed transaction: each local transaction must succeed.
Fault tolerance Distributed commit

Two-phase commit protocol (2PC)


Essence
The client who initiated the computation acts as coordinator; processes
required to commit are the participants.
• Phase 1a: Coordinator sends VOTE-REQUEST to participants (also called
a pre-write)
• Phase 1b: When participant receives VOTE-REQUEST it returns either
VOTE-COMMIT or VOTE-ABORT to coordinator. If it sends VOTE-ABORT, it
aborts its local computation
• Phase 2a: Coordinator collects all votes; if all are VOTE-COMMIT, it sends
GLOBAL - COMMIT to all participants, otherwise it sends GLOBAL - ABORT

• Phase 2b: Each participant waits for GLOBAL - COMMIT or GLOBAL - ABORT
and handles accordingly.
Fault tolerance Recovery

Recovery: Background
Essence
When a failure occurs, we need to bring the system into an error-free state:
• Forward error recovery: Find a new state from which the system can
continue operation
• Backward error recovery: Bring the system back into a previous error-free
state

Practice
Use backward error recovery, requiring that we establish recovery points

Observation
Recovery in distributed systems is complicated by the fact that processes need
to cooperate in identifying a consistent state from where to recover

Introduction
Fault tolerance Recovery

Consistent recovery state


Requirement
Every message that has been received is also shown to have been sent in the
state of the sender.

Recovery line
Assuming processes regularly checkpoint their state, the most recent
consistent global checkpoint.

Checkpointing
Fault tolerance Recovery

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:

Checkpointing
Fault tolerance Recovery

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message

Checkpointing
Fault tolerance Recovery

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint

Checkpointing
Fault tolerance Recovery

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue

Checkpointing
Fault tolerance Recovery

Coordinated checkpointing
Essence
Each process takes a checkpoint after a globally coordinated action.

Simple solution
Use a two-phase blocking protocol:
• A coordinator multicasts a checkpoint request message
• When a participant receives such a message, it takes a checkpoint, stops
sending (application) messages, and reports back that it has taken a
checkpoint
• When all checkpoints have been confirmed at the coordinator, the latter
broadcasts a checkpoint done message to allow all processes to continue

Observation
It is possible to consider only those processes that depend on the recovery of
the coordinator, and ignore the rest

Checkpointing
Fault tolerance Recovery

Cascaded rollback
Observation
If checkpointing is done at the “wrong” instants, the recovery line may lie at
system startup time. We have a so-called cascaded rollback.

Checkpointing
Fault tolerance Recovery

Message logging
Alternative
Instead of taking an (expensive) checkpoint, try to replay your (communication)
behavior from the most recent checkpoint ⇒ store messages in a log.

Assumption
We assume a piecewise deterministic execution model:
• The execution of each process can be considered as a sequence of state
intervals
• Each state interval starts with a nondeterministic event (e.g., message
receipt)
• Execution in a state interval is deterministic

Conclusion
If we record nondeterministic events (to replay them later), we obtain a
deterministic execution model that will allow us to do a complete replay.

Message logging
Fault tolerance Recovery

Message logging and consistency


When should we actually log messages?
Avoid orphan processes:
• Process Q has just received and delivered messages m1 and m2
• Assume that m2 is never logged.
• After delivering m1 and m2 , Q sends message m3 to process R
• Process R receives and subsequently delivers m3 : it is an orphan.

Message logging

You might also like