[go: up one dir, main page]

0% found this document useful (0 votes)
23 views93 pages

Cloud Computing Notes (As Per Guidelines)

The document outlines the curriculum for a B.Sc. (H) Computer Science course focusing on Cloud Computing, detailing various platforms such as AWS, Google AppEngine, Microsoft Azure, Hadoop, and Force.com. It also discusses principles of parallel and distributed computing, differentiating between the two and explaining their architectures and components. Key concepts include parallel processing, distributed systems, and the importance of middleware in facilitating communication and application development in cloud environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views93 pages

Cloud Computing Notes (As Per Guidelines)

The document outlines the curriculum for a B.Sc. (H) Computer Science course focusing on Cloud Computing, detailing various platforms such as AWS, Google AppEngine, Microsoft Azure, Hadoop, and Force.com. It also discusses principles of parallel and distributed computing, differentiating between the two and explaining their architectures and components. Key concepts include parallel processing, distributed systems, and the importance of middleware in facilitating communication and application development in cloud environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

B.Sc.

(H) Computer Science, Semester VI, Cloud Compu9ng as per guidelines


Jan – May 2025

1.1, 1.2, 1.2.1, 1.2.2, 1.2.4, 1.2.5: Done in class

1.4 – Compu9ng plaGorms and technologies

1.4.1 – Amazon web services (AWS)


• AWS offers comprehensive cloud IaaS services ranging from virtual compute,
storage, and networking to complete compuBng stacks.
• AWS is mostly known for its compute and storage-on-demand services, namely
Elas9c Compute Cloud (EC2) and Simple Storage Service (S3).
• EC2 provides users with
o customizable virtual hardware: can be used as the base infrastructure for
deploying compuBng systems
o large variety of virtual hardware configuraBons, including GPU and cluster
instances
o EC2 instances are deployed either by using the AWS console, which is a
comprehensive Web portal for accessing AWS services, or by using the Web
services API available for several programming languages.
o EC2 also provides the capability to save a specific running instance as an
image, thus allowing users to create their own templates for deploying
systems.
• S3:
o EC2 instance templates are stored into S3 that delivers persistent storage on
demand.
o organized into buckets; these are containers of objects that are stored in
binary form and can be enriched with aRributes.
o Users can store objects of any size, from simple files to enBre disk images,
and have them accessible from everywhere.

1.4.2 – Google AppEngine


• Google AppEngine is a scalable runBme environment mostly devoted to execuBng
Web applicaBons.
• AppEngine provides both a secure execuBon environment and a collecBon of services
that simplify the development of scalable and high-performance Web applicaBons.
• These services include
o in-memory caching: if frequently run query returns a set of results that do not
need to appear in your app immediately, you can cache the results
o scalable datastore: designed to automaBcally scale to very large data sets,
allowing applicaBons to maintain high performance as they receive more
traffic
o job queues, messaging
o cron tasks: App Engine Cron Service allows you to configure regularly
scheduled tasks that operate at defined Bmes or regular intervals. These tasks
are commonly known as cron jobs. These cron jobs are automaBcally
triggered by the App Engine Cron Service. For instance, you might use this to
send out a report email on a daily basis
• Developers can build and test applicaBons on their own machines using the
AppEngine soYware development kit (SDK), which replicates the producBon runBme
environment and helps test and profile applicaBons.
• Once development is complete, developers can easily migrate their applicaBon to
AppEngine, set quotas to contain the costs generated, and make the applicaBon
available to the world.
• The languages currently supported are Python, Java, and Go.

1.4.3 – MicrosoY Azure


• MicrosoY Azure is a cloud operaBng system and a pla^orm for developing
applicaBons in the cloud
• ApplicaBons in Azure are organized around the concept of roles, which idenBfy a
distribuBon unit for applicaBons and embody the applicaBon’s logic.
• Currently, there are three types of role: Web role, worker role, and virtual machine
role.
• The Web role is designed to host a Web applicaBon
• The worker role is a more generic container of applicaBons and can be used to
perform workload processing
• The virtual machine role provides a virtual environment in which the compuBng stack
can be fully customized, including the operaBng systems.
• Besides roles, Azure provides a set of addiBonal services that complement
applicaBon execuBon, such as support for storage (relaBonal data and blobs),
networking, caching, content delivery, and others.

1.4.4 Hadoop
• Apache Hadoop is an open-source framework that is suited for processing large data
sets on commodity hardware.
• Hadoop is an implementaBon of MapReduce, an applicaBon programming model
developed by Google, which provides two fundamental operaBons for data
processing: map and reduce.
• MAP: transforms and synthesizes the input data provided by the user
• REDUCE: aggregates the output obtained by the map operaBons.
• Hadoop provides the runBme environment, and developers need only provide the
input data and specify the map and reduce funcBons that need to be executed.
• Hadoop is an integral part of the Yahoo! cloud infrastructure and supports several
business processes of the company.
• Currently, Yahoo! manages the largest Hadoop cluster in the world, which is also
available to academic insBtuBons.

1.4.5 – Force.com and Salesforce.com


• Force.com is a cloud compuBng pla^orm for developing social enterprise
applicaBons.
• The pla^orm is the basis for SalesForce.com, a SoYware-as-a-Service soluBon for
customer relaBonship management.
• Force.com allows developers to create applicaBons by composing ready-to-use
blocks; a complete set of components supporBng all the acBviBes of an enterprise
are available.
• It is also possible to develop your own components or integrate those available in
AppExchange into your applicaBons.
• The pla^orm provides complete support for developing applicaBons, from the design
of the data layout to the definiBon of business rules and workflows and the definiBon
of the user interface.
• The Force.com pla^orm is completely hosted on the cloud and provides complete
access to its funcBonaliBes and those implemented in the hosted applicaBons
through Web services technologies.

Chapter 2: Principles of Parallel and Distributed CompuBng

2.2 – Parallel vs Distributed CompuBng

• These terms oYen used interchangeably


• The term parallel
o implies a Bghtly coupled system
o model in which the computaBon is divided among several processors sharing
the same memory.
o The architecture oYen characterized by the homogeneity of components:
each processor is of the same type and it has the same capability as the
others.
o The shared memory has a single address space, which is accessible to all the
processors.
o Parallel programs are then broken down into several units of execuBon that
can be allocated to different processors and can communicate with each
other by means of the shared memory.
o Parallel systems now include all architectures that are based on the concept
of shared memory, whether this is physically present or created with the
support of libraries, specific hardware, and a highly efficient networking
infrastructure.
o For example, a cluster of which the nodes are connected through an
InfiniBand network and configured with a distributed shared memory system
can be considered a parallel system.
• The term distributed
o refers to a wider class of system, including those that are Bghtly coupled.
o encompasses any architecture or system that allows the computaBon to be
broken down into units
o Executed concurrently on different compuBng elements, whether these are
processors on different nodes, processors on the same computer, or cores
within the same processor.
o The term distributed oYen implies that the locaBons of the compuBng
elements are not the same and such elements might be heterogeneous in
terms of hardware and soYware features.
o Classic examples of distributed compuBng systems are compuBng grids or
Internet compuBng systems

2.3 – Elements of parallel compuBng


• To connect mulBple processors working in coordinaBon with each other to solve
“Grand Challenge” problems.
• The first steps in this direcBon led to the development of parallel compuBng, which
encompasses techniques, architectures, and systems for performing mulBple
acBviBes in parallel.
• IntroducBon of parallelism within a single computer by coordinaBng the acBvity of
mulBple processors together.

2.3.1 – What is parallel processing?

• Processing of mulBple tasks simultaneously on mulBple processors is called parallel


processing.
• A given task is divided into mulBple subtasks using a divide-and-conquer technique,
and each subtask is processed on a different central processing unit (CPU).
• Programming on a mulB-processor system using the divide-and-conquer technique is
called parallel programming.
• Parallel processing provides a cost-effecBve soluBon to this problem by increasing
the number of CPUs in a computer and by adding an efficient communicaBon system
between them.
• The development of parallel processing is being influenced by many factors. The
prominent among them include the following:
o ComputaBonal requirements are ever increasing in the areas of both scienBfic
and business compuBng.
o SequenBal architectures are reaching physical limitaBons as they are reaching
saturaBon point (no more verBcal growth), and hence an alternaBve way to
get high computaBonal speed is to connect mulBple CPUs (opportunity for
horizontal growth).
o Hardware improvements in pipelining, superscalar, and the like are non-
scalable and require sophisBcated compiler technology. Developing such
compiler technology is a difficult task.
o Vector processing works well for certain kinds of problems. It is suitable
mostly for scienBfic problems (involving lots of matrix operaBons) and
graphical processing. It is not suitable for other areas such as dbase.
o Significant development in networking technology is paving the way for
heterogeneous compuBng.

2.4 – Elements of distributed compuBng

2.4.1 General concepts and definiBons


• As a general definiBon of the term distributed system, we use the one proposed by
Tanenbaum et. al [1]:

“A distributed system is a collec9on of independent computers that appears to its


users as a single coherent system.”

• This definiBon includes various types of distributed compuBng systems that are
especially focused on unified usage and aggregaBon of distributed resources.
• Communica9on is another fundamental aspect of distributed compuBng.
• Since distributed systems are composed of more than one computer that collaborate
together, it is necessary to provide some sort of data and informaBon exchange
between them, which generally occurs through the network (Coulouris et al. [2]):

“A distributed system is one in which components located at networked computers


communicate and coordinate their ac9ons only by passing messages.”

• The components of a distributed system communicate with some sort of message


passing. This is a term that encompasses several communicaBon models.

2.4.2 – Components of a distributed system

• A distributed system emerges from the collaboraBon of several elements that—by


working together—give users the illusion of a single coherent system.

Figure 2.10 provides an overview of the different layers that are involved in providing the
services of a distributed system.

• BoRom layer: computer and network hardware consBtute the physical infrastructure;
these components are directly managed by the operaBng system, which provides the
basic services for interprocess communicaBon (IPC), process scheduling and
management, and resource management in terms of file system and local devices.
• Taken together these two layers become the pla^orm on top of which specialized
soYware is deployed to turn a set of networked computers into a distributed system.
• The use of well-known standards at the operaBng system level and even more at the
hardware and network levels allows easy harnessing of heterogeneous components
• For example, network connecBvity between different devices is controlled by
standards, which allow them to interact seamlessly. At the operaBng system level,
IPC services are implemented on top of standardized communicaBon protocols such
as Transmission Control Protocol/Internet Protocol (TCP/IP), User Datagram Protocol
(UDP) or others.
• The middleware layer leverages such services to build a uniform environment for the
development and deployment of distributed applicaBons. This layer supports the
programming paradigms for distributed systems.
• The middleware develops its own protocols, data formats, and programming
language or frameworks for the development of distributed applicaBons.
• This layer is completely independent from the underlying operaBng system and hides
all the heterogeneiBes of the boRom layers.
• The top of the distributed system stack is represented by the applicaBons and
services designed and developed to use the middleware.
• These can serve several purposes and oYen expose their features in the form of
graphical user interfaces (GUIs) accessible locally or through the Internet via a Web
browser.
• For example, in the case of a cloud compuBng system, the use of Web technologies is
strongly preferred, not only to interface distributed applicaBons with the end user
but also to provide pla^orm services aimed at building distributed systems. Amazon
Web Services (AWS), which provide faciliBes for creaBng virtual machines, organizing
them together into a cluster, and deploying applicaBons and systems on top.
• Figure 2.11 shows an example of how the general reference architecture of a
distributed system is contextualized in the case of a cloud compuBng system.
• The core logic is then implemented in the middle-ware that manages the
virtualizaBon layer, which is deployed on the physical infrastructure in order to
maximize its uBlizaBon and provide a customizable runBme environment for
applicaBons.
2.4.3 – Architectural styles for distributed compu9ng

Architectural styles based on independent components

This class of architectural style models systems in terms of independent components that
have their own life cycles, which interact with each other to perform their acBviBes. There
are two major categories within this class—communicaBng processes and event systems—
which differenBate in the way the interacBon among components is managed.

Communica9ng Processes. In this architectural style, components are represented by


independent processes that leverage IPC faciliBes for coordinaBon management. This is an
abstracBon that is quite suitable to modeling distributed systems that, being distributed over
a network of compuBng nodes, are necessarily composed of several concurrent processes.
Connectors are idenBfied by IPC faciliBes used by these processes to communicate.

Event Systems. In this architectural style, the components of the system are loosely coupled
and connected. In addiBon to exposing operaBons for data and state manipulaBon, each
component also publishes (or announces) a collecBon of events with which other
components can register. In general, other components provide a callback that will be
executed when the event is acBvated.

During the acBvity of a component, a specific runBme condiBon can acBvate one of the
exposed events, thus triggering the execuBon of the callbacks registered with it.
The main advantage of such an architectural style is that it fosters the development of open
systems: new modules can be added and easily integrated into the system as long as they
have compliant interfaces for registering to the events.

This architectural style solves some of the limitaBons observed for the top-down and object-
oriented styles. First, the invocaBon paRern is implicit, and the connecBon between the
caller and the callee is not hard-coded; this gives a lot of flexibility since addiBon
or removal of a handler to events can be done without changes in the source code of
applicaBons.
Second, the event source does not need to know the idenBty of the event handler in order
to invoke the callback.
The disadvantage of such a style is that it relinquishes control over system computaBon.
When a component triggers an event, it does not know how many event handlers will be
invoked and whether there are any registered handlers.

2.4.3.3 System architectural styles

System architectural styles cover the physical organizaBon of components and processes
over a distributed infrastructure. They provide a set of reference models for the deployment
of such systems and help engineers not only have a common vocabulary in describing the
physical layout of systems but also quickly idenBfy the major advantages and drawbacks of a
given deployment and whether it is applicable for a specific class of applicaBons.
Two fundamental reference styles: client/server and peer-to-peer.

Client/server

This architecture is very popular in distributed compuBng.

As depicted in Figure 2.12, the client/server model features two major components: a
server and a client. These two components interact with each other through a network
connecBon using a given protocol. The communicaBon is unidirecBonal: The client issues a
request to the server, and aYer processing the request the server returns a response. There
could be mulBple client components issuing requests to a server that is passively waiBng for
them. Hence, the important opera9ons in the client-server paradigm are request, accept
(client side), and listen and response (server side).

The client/server model is suitable in many-to-one scenarios, where the informaBon and
the services of interest can be centralized and accessed through a single access point: the
server. In general, mulBple clients are interested in such services and the server must be
appropriately designed to efficiently serve requests coming from different clients. This
consideraBon has implicaBons on both client design and server design.

For the client design, we idenBfy two major models:

• Thin-client model. In this model, the load of data processing and transformaBon is put on
the server side, and the client has a light implementaBon that is mostly concerned with
retrieving and returning the data it is being asked for, with no considerable further
processing.
• Fat-client model. In this model, the client component is also responsible for processing
and transforming the data before returning it to the user, whereas the server features a
relaBvely light implementaBon that is mostly concerned with the management of access to
the data.

The three major components in the client-server model: presenta9on, applica9on logic, and
data storage.
In the thin-client model, the client embodies only the presentaBon component, while the
server absorbs the other two. In the fat-client model, the client encapsulates presentaBon
and most of the applicaBon logic, and the server is principally responsible for the data
storage and maintenance.

PresentaBon, applicaBon logic, and data maintenance can be seen as conceptual layers,
which are more appropriately called Bers. The mapping between the conceptual layers and
their physical implementaBon in modules and components allows differenBaBng among
several types of architectures, which go under the name of mul99ered architectures.

Two major classes exist:

• Two-9er architecture. This architecture parBBons the systems into two Bers, which are
located one in the client component and the other on the server. The client is responsible for
the presentaBon Ber by providing a user interface; the server concentrates the applicaBon
logic and the data store into a single Ber.

The server component is generally deployed on a powerful machine that is capable of


processing user requests, accessing data, and execuBng the applicaBon logic to provide a
client with a response.

This architecture is suitable for systems of limited size and suffers from scalability issues. In
parBcular, as the number of users increases the performance of the server might
dramaBcally decrease.

Another limitaBon is caused by the dimension of the data to maintain, manage, and access,
which might be prohibiBve for a single computaBon node or too large for serving the clients
with saBsfactory performance.

• Three-9er architecture/N-9er architecture. The three-Ber architecture separates the


presentaBon of data, the applicaBon logic, and the data storage into three 9ers. This
architecture is generalized into an N-9er model in case it is necessary to further divide the
stages composing the applicaBon logic and storage Bers. This model is generally more
scalable than the two-Ber one because it is possible to distribute the Bers into several
compuBng nodes, thus isolaBng the performance boRlenecks. At the same Bme, these
systems are also more complex to understand and manage.

A classic example of three-Ber architecture is consBtuted by a medium-size Web applicaBon


that relies on a relaBonal database management system for storing its data.
Client component is represented by a Web browser that embodies the presentaBon Ber
Applica9on server encapsulates the business logic Ber
Database server machine (possibly replicated for high availability) maintains the data
storage

Nowadays, the client/server model is an important building block of more complex systems,
which implement some of their features by idenBfying a server and a client process
interacBng through the network. This model is generally suitable in the case of a many-to-
one scenario, where the interacBon is unidirecBonal and started by the clients and suffers
from scalability issues, and therefore it is not appropriate in very large systems.

Peer-to-peer

The peer-to-peer model, depicted in Figure 2.13, introduces a symmetric architecture in


which all the components, called peers, play the same role and incorporate both client and
server capabiliBes of the client/server model.

More precisely, each peer acts as a server when it processes requests from other peers and
as a client when it issues requests to other peers. With respect to the client/server model
that parBBons the responsibiliBes of the IPC between server and clients, the peer-to-peer
model aRributes the same responsibiliBes to each component.

Therefore, this model is quite suitable for highly decentralized architecture, which can scale
beRer along the dimension of the number of peers.

The disadvantage of this approach is that the management of the implementaBon of


algorithms is more complex than in the client/server model.

The most relevant example of peer-to-peer systems is consBtuted by file-sharing


applicaBons such as Gnutella, BitTorrent, and Kazaa. Despite the differences among these
networks in coordinaBng nodes and sharing informaBon on the files and their locaBons, all
of them provide a user client that is at the same 9me a server providing files to other peers
and a client downloading files from other peers.

To address an incredibly large number of peers, different architectures have been designed
that divert slightly from the peer-to-peer model. For example, in Kazaa not all the peers have
the same role, and some of them are used to group the accessibility informaBon of a group
of peers. Another interesBng example of peer-to-peer architecture is represented by the
Skype network.

The client/server architecture, which originally included only two types of components, has
been further extended and enriched by developing mulBBer architectures as the complexity
of systems increased. Currently, this model is sBll the predominant reference architecture for
distributed systems and applicaBons. The server and client abstracBon can be used in some
cases to model the macro scale or the micro scale of the systems. For peer-to-peer systems,
pure implementaBons are very hard to find.
Chapter 3 Virtualiza9on

• VirtualizaBon allows the creaBon of a secure, customizable, and isolated execuBon


environment for running applicaBons.
• The basis of this technology is the ability of a computer program—or a combinaBon
of soYware and hardware—to emulate an execuBng environment separate from the
one that hosts such programs.
• VirtualizaBon provides a great opportunity to build elasBcally scalable systems that
can provision addiBonal capability with minimum costs.

3.1 Introduc9on

• VirtualizaBon is a large umbrella of technologies and concepts that are meant to


provide an abstract environment—whether virtual hardware or an operaBng
system—to run applicaBons.
• The term virtualizaBon is oYen synonymous with hardware virtualizaBon, which plays
a fundamental role in efficiently delivering Infrastructure-as-a-Service (IaaS) soluBons
for cloud compuBng.
• VirtualizaBon technologies provide a virtual environment for not only execuBng
applicaBons but also for storage, memory, and networking.
• VirtualizaBon technologies have gained renewed interested recently due to the
confluence of several phenomena:
o Increased performance and compu9ng capacity. Nowadays, almost all these
PCs have resources enough to host a virtual machine manager and execute a
virtual machine with by far acceptable performance.
o Underu9lized hardware and so^ware resources. Computers today are so
powerful that in most cases only a fracBon of their capacity is used by an
applicaBon or the system. Using these resources for other purposes aYer
hours could improve the efficiency of the IT infrastructure. To transparently
provide such a service, it would be necessary to deploy a completely separate
environment, which can be achieved through virtualizaBon.
o Lack of space. The conBnuous need for addiBonal capacity, whether storage
or compute power, makes data centers grow quickly. Companies such as
Google and MicrosoY expand their infrastructures, but in most cases
enterprises cannot afford to build another data center to accommodate
addiBonal resource capacity. This condiBon, along with hardware
underuBlizaBon, has led to the diffusion of a technique called server
consolida9on, for which virtualizaBon technologies are fundamental. Server
consolidaBon is a technique for aggregaBng mulBple services and applicaBons
originally deployed on different servers on one physical server. Server
consolidaBon allows us to reduce the power consumpBon of a data center
and resolve hardware underuBlizaBon.
o Greening ini9a9ves. Recently, companies are increasingly looking for ways to
reduce the amount of energy they consume and to reduce their carbon
footprint. Data centers are one of the major power consumers and have a
significant impact on the carbon footprint of a data center. Hence, reducing
the number of servers through server consolidaBon will definitely reduce the
impact of cooling and power consumpBon of a data center. VirtualizaBon
technologies can provide an efficient way of consolidaBng servers.
o Rise of administra9ve costs. Power consumpBon and cooling costs leads to
significant increment in administraBve costs, which also includes cost of
hardware monitoring, defecBve hardware replacement, server setup and
updates, server resources monitoring, and backups. VirtualizaBon can help
reduce the number of required servers for a given workload, thus reducing
the administraBve cost.

3.2 Characteris9cs of virtualized environments

• VirtualizaBon is a broad concept that refers to the creaBon of a virtual version of


something, whether hardware, a soYware environment, storage, or a network.
• In a virtualized environment there are three major components: guest, host, and
virtualiza9on layer.
• The guest represents the system component that interacts with the virtualizaBon
layer rather than with the host, as would normally happen.
• The host represents the original environment where the guest is supposed to be
managed.
• The virtualiza9on layer is responsible for recreaBng the same or a different
environment where the guest will operate.

• Such a general abstracBon finds different applicaBons and then implementaBons of


the virtualizaBon technology. The most intuiBve and popular is represented by
hardware virtualiza9on.
• In the case of hardware virtualizaBon, the guest is represented by a system image
comprising an operaBng system and installed applicaBons. These are installed on top
of virtual hardware that is controlled and managed by the virtualizaBon layer, also
called the virtual machine manager. The host is instead represented by the physical
hardware, and in some cases the operaBng system, that defines the environment
where the virtual machine manager is running.
• In the case of virtual storage, the guest might be client applicaBons or users that
interact with the virtual storage management soYware deployed on top of the real
storage system.
• The case of virtual networking is also similar: The guest—applicaBons and users—
interacts with a virtual network, such as a virtual private network (VPN), which is
managed by specific soYware (VPN client) using the physical network available on
the node. VPNs are useful for creaBng the illusion of being within a different physical
network and thus accessing the resources in it, which would otherwise not be
available.
• The main common characterisBc of all these different implementaBons is the fact
that the virtual environment is created by means of a so^ware program. The ability
to use soYware to emulate such a wide variety of environments creates a lot of
opportuniBes, previously less aRracBve because of excessive overhead introduced by
the virtualizaBon layer.

3.2.1 Increased security

• The ability to control the execuBon of a guest in a completely transparent manner


opens new possibiliBes for delivering a secure, controlled execuBon environment.
• The virtual machine represents an emulated environment in which the guest is
executed.
• All the operaBons of the guest are generally performed against the virtual machine,
which then translates and applies them to the host. This level of indirecBon allows
the virtual machine manager to control and filter the acBvity of the guest, thus
prevenBng some harmful operaBons from being performed.
• Resources exposed by the host can then be hidden or simply protected from the
guest.
• Moreover, sensiBve informaBon that is contained in the host can be naturally hidden
without the need to install complex security policies.
• Increased security is a requirement when dealing with untrusted code. For example,
applets downloaded from the Internet run in a sandboxed version of the Java Virtual
Machine (JVM), which provides them with limited access to the hosBng operaBng
system resources.
• Hardware virtualizaBon soluBons such as VMware Desktop, VirtualBox, and Parallels
provide the ability to create a virtual computer with customized virtual hardware on
top of which a new operaBng system can be installed.
• By default, the file system exposed by the virtual computer is completely separated
from the one of the host machine. This becomes the perfect environment for running
applicaBons without affecBng other users in the environment.
The term sandbox idenBfies an isolated execuBon environment where instrucBons can be
filtered and blocked before being translated and executed in the real execuBon
environment. The expression sandboxed version of the Java Virtual Machine (JVM) refers to
a parBcular configuraBon of the JVM where, by means of security policy, instrucBons that
are considered potenBal harmful can be blocked.

3.2.2 Managed execu9on

• VirtualizaBon of the execuBon environment allows a wider range of features also can
be implemented. In parBcular, sharing, aggregaBon, emulaBon, and isolaBon are the
most relevant features (see Figure 3.2).

• Sharing. VirtualizaBon allows the creaBon of a separate compuBng environments


within the same host. Sharing is a parBcularly important feature in virtualized data
centers, where this basic feature is used to reduce the number of acBve servers and
limit power consumpBon.
• Aggrega9on. A group of separate hosts can be Bed together and represented to
guests as a single virtual host. This funcBon is naturally implemented in middleware
for distributed compuBng, which harnesses the physical resources of a homogeneous
group of machines and represents them as a single resource.
• Emula9on. (1) Guest programs are executed within an environment that is controlled
by the virtualizaBon layer, which ulBmately is a program. For instance, a completely
different environment with respect to the host can be emulated, thus allowing the
execuBon of guest programs requiring specific characterisBcs that are not present in
the physical host. (2) This feature becomes very useful for tesBng purposes, where a
specific guest has to be validated against different pla^orms or architectures and the
wide range of opBons is not easily accessible during development. (3) Again,
hardware virtualizaBon soluBons are able to provide virtual hardware and emulate a
parBcular kind of device such as Small Computer System Interface (SCSI) devices for
file I/O, without the hosBng machine having such hardware installed. (4) Old and
legacy soYware that does not meet the requirements of current systems can be run
on emulated hardware without any need to change the code.
• Isola9on. VirtualizaBon allows providing guests—whether they are operaBng
systems, applicaBons, or other enBBes—with a completely separate environment, in
which they are executed. The guest program performs its acBvity by interacBng with
an abstracBon layer, which provides access to the underlying resources. IsolaBon
brings several benefits; first, it allows mulBple guests to run on the same host
without interfering with each other. Second, it provides a separaBon between the
host and the guest. The virtual machine can filter the acBvity of the guest and
prevent harmful operaBons against the host.
• Another important capability enabled by virtualizaBon is performance tuning. This
feature makes it easier to control the performance of the guest by finely tuning the
properBes of the resources exposed through the virtual environment. For instance,
soYware-implemenBng hardware virtualizaBon soluBons can expose to a guest
operaBng system only a fracBon of the memory of the host machine or set the
maximum frequency of the processor of the virtual machine.
• Another advantage of managed execuBon is that someBmes it allows easy capturing
of the state of the guest program, persisBng it, and resuming its execuBon. This, for
example, allows virtual machine managers such as Xen Hypervisor to stop the
execuBon of a guest operaBng system, move its virtual image into another machine,
and resume its execuBon in a completely transparent manner. This technique is
called virtual machine migra9on and consBtutes an important feature in virtualized
data centers for opBmizing their efficiency in serving applicaBon demands.

3.2.3 Portability

• The concept of portability applies in different ways according to the specific type of
virtualizaBon considered.
• In the case of a hardware virtualiza9on soluBon, the guest is packaged into a virtual
image that, in most cases, can be safely moved and executed on top of different
virtual machines. Virtual images are generally proprietary formats that require a
specific virtual machine manager to be executed.
• In the case of programming-level virtualiza9on, as implemented by the JVM or the
.NET runBme, the binary code represenBng applicaBon components (jars or
assemblies) can be run without any recompilaBon on any implementaBon of the
corresponding virtual machine. This makes the applicaBon development cycle more
flexible and applicaBon deployment very straigh^orward.
• Finally, portability allows having your own system always with you and ready to use
as long as the required virtual machine manager is available. This requirement is, in
general, less stringent than having all the applicaBons and services you need
available to you anywhere you go.

3.3 Taxonomy of virtualiza9on techniques

• VirtualizaBon covers a wide range of emulaBon techniques that are applied to


different areas of compuBng. A classificaBon of these techniques helps us beRer
understand their characterisBcs and use (see Figure 3.3).
• VirtualizaBon is mainly used to emulate execuBon environments, storage, and
networks.
• Among these categories, execu9on virtualiza9on consBtutes the oldest, most
popular, and most developed area. In parBcular we can divide these execuBon
virtualizaBon techniques into two major categories by considering the type of host
they require: Process-level and System-level.
• Process-level techniques are implemented on top of an exisBng operaBng system,
which has full control of the hardware.
• System-level techniques are implemented directly on hardware and do not require—
or require a minimum of support from—an exisBng operaBng system.
• Within these two categories we can list various techniques that offer the guest a
different type of virtual computaBon environment: bare hardware, operaBng system
resources, low-level programming language, and applicaBon libraries.

3.3.1 Execu9on virtualiza9on

• ExecuBon virtualizaBon includes all techniques that aim to emulate an execuBon


environment that is separate from the one hosBng the virtualizaBon layer. All these
techniques concentrate their interest on providing support for the execuBon of
programs, whether these are the operaBng system, a binary specificaBon of a
program compiled against an abstract machine model, or an applicaBon.
• ExecuBon virtualizaBon can be implemented directly on top of the hardware by the
operaBng system, an applicaBon, or libraries dynamically or staBcally linked to an
applicaBon image.
3.3.1.1 Machine reference model

• Reference model defines the interfaces between the levels of abstracBons, which
hide implementaBon details.
• From this perspecBve, virtualizaBon techniques actually replace one of the layers and
intercept the calls that are directed toward it.

Modern compuBng systems can be expressed in terms of the reference model described in
Figure 3.4.

• At the bobom layer, the model for the hardware is expressed in terms of the
Instruc9on Set Architecture (ISA), which defines the instrucBon set for the
processor, registers, memory, and interrupt management.
• ISA is the interface between hardware and soYware, and it is important to the
operaBng system (OS) developer (System ISA) and developers of applicaBons that
directly manage the underlying hardware (User ISA).
• The applica9on binary interface (ABI) separates the operaBng system layer from the
applicaBons and libraries, which are managed by the OS.
• ABI covers details such as low-level data types, alignment, and call convenBons and
defines a format for executable programs. System calls are defined at this level.
• ABI interface allows portability of applicaBons and libraries across operaBng systems
that implement the same ABI.
• The highest level of abstracBon is represented by the applica9on programming
interface (API), which interfaces applicaBons to libraries and/or the underlying
operaBng system.
• For any operaBon to be performed in the applicaBon level API, ABI and ISA are
responsible for making it happen.
• The high-level abstracBon is converted into machine-level instrucBons to perform the
actual operaBons supported by the processor.
• The machine-level resources, such as processor registers and main memory
capaciBes, are used to perform the operaBon at the hardware level of the central
processing unit (CPU).
• This layered approach simplifies
o the development and implementaBon of compuBng systems
o simplifies the implementaBon of mulBtasking and the coexistence of mulBple
execuBng environments
o such a model not only requires limited knowledge of the enBre compuBng
stack
o provides ways to implement a minimal security model for managing and
accessing shared resources.
• For this purpose, the instrucBon set exposed by the hardware has been divided into
different security classes that define who can operate with them.
• The first disBncBon can be made between privileged and nonprivileged instrucBons.
o Nonprivileged instrucBons are those instrucBons that can be used without
interfering with other tasks because they do not access shared resources. This
category contains, for example, all the floaBng, fixed-point, and arithmeBc
instrucBons.
o Privileged instrucBons are those that are executed under specific restricBons
and are mostly used for sensiBve operaBons, which expose (behavior-
sensiBve) or modify (control-sensiBve) the privileged state. For instance,
behavior-sensiBve instrucBons are those that operate on the I/O, whereas
control-sensiBve instrucBons alter the state of the CPU registers.
o Some types of architecture feature more than one class of privileged
instrucBons and implement a finer control of how these instrucBons can be
accessed. For instance, a possible implementaBon features a hierarchy of
privileges (see Figure 3.5) in the form of ring-based security: Ring 0, Ring 1,
Ring 2, and Ring 3; Ring 0 is in the most privileged level and Ring 3 in the least
privileged level. Ring 0 is used by the kernel of the OS, rings 1 and 2 are used
by the OS-level services, and Ring 3 is used by the user. Recent systems
support only two levels, with Ring 0 for supervisor mode and Ring 3 for user
mode.
• All the current systems support at least two different execuBon modes: supervisor
mode and user mode.
o The supervisor mode denotes an execuBon mode in which all the instrucBons
(privileged and nonprivileged) can be executed without any restricBon. This
mode, also called master mode or kernel mode, is generally used by the
operaBng system (or the hypervisor) to perform sensiBve operaBons on
hardware-level resources.
o In user mode, there are restricBons to control the machine-level resources. If
code running in user mode invokes the privileged instrucBons, hardware
interrupts occur and trap the potenBally harmful execuBon of the instrucBon.
o There might be some instrucBons that can be invoked as privileged
instrucBons under some condiBons and as nonprivileged instrucBons under
other condiBons.
o The disBncBon between user and supervisor mode allows us to understand
the role of the hypervisor and why it is called that.
• Conceptually, the hypervisor runs above the supervisor mode. In reality, hypervisors
are run in supervisor mode, and the division between privileged and nonprivileged
instrucBons has posed challenges in designing virtual machine managers.
• It is expected that all the sensiBve instrucBons will be executed in privileged mode,
which requires supervisor mode in order to avoid traps. Without this assumpBon it is
impossible to fully emulate and manage the status of the CPU for guest operaBng
systems.
• Unfortunately, this is not true for the original ISA, which allows 17 sensiBve
instrucBons to be called in user mode. This prevents mulBple operaBng systems
managed by a single hypervisor to be isolated from each other, since they are able to
access the privileged state of the processor and change it.
• It is expected that in a hypervisor-managed environment, all the guest operaBng
system code will be run in user mode in order to prevent it from directly accessing
the status of the CPU. If there are sensiBve instrucBons that can be called in user
mode (that is, implemented as nonprivileged instrucBons), it is no longer possible to
completely isolate the guest OS.
• More recent implementaBons of ISA (Intel VT and AMD Pacifica) have solved this
problem by redesigning such instrucBons as privileged ones.

3.3.1.2 Hardware-level virtualiza9on

• Hardware-level virtualizaBon is a virtualizaBon technique that provides an abstract


execuBon environment in terms of computer hardware on top of which a guest
operaBng system can be run.
• In this model, the guest is represented by the operaBng system, the host by the
physical computer hardware, the virtual machine by its emulaBon, and the virtual
machine manager by the hypervisor (see Figure 3.6).
• The hypervisor is generally a program or a combinaBon of soYware and hardware
that allows the abstracBon of the underlying physical hardware.
• Hardware-level virtualizaBon is also called system virtualiza9on, since it provides ISA
to virtual machines, which is the representaBon of the hardware interface of a
system.
Hypervisors

• A fundamental element of hardware virtualizaBon is the hypervisor, or virtual


machine manager (VMM). It recreates a hardware environment in which guest
operaBng systems are installed.
• There are two major types of hypervisors: Type I and Type II (see Figure 3.7).
• Type I hypervisors run directly on top of the hardware. Therefore, they take the
place of the operaBng systems and interact directly with the ISA interface exposed by
the underlying hardware, and they emulate this interface in order to allow the
management of guest operaBng systems. This type of hypervisor is also called a
na9ve virtual machine since it runs naBvely on hardware.
• Type II hypervisors require the support of an operaBng system to provide
virtualizaBon services. This means that they are programs managed by the operaBng
system, which interact with it through the ABI and emulate the ISA of virtual
hardware for guest operaBng systems. This type of hypervisor is also called a hosted
virtual machine since it is hosted within an operaBng system.
• Conceptually, a virtual machine manager is internally organized as described in Figure
3.8.

• Three main modules, dispatcher, allocator, and interpreter, coordinate their acBvity
in order to emulate the underlying hardware.
• The dispatcher consBtutes the entry point of the monitor and reroutes the
instrucBons issued by the virtual machine instance to one of the two other modules.
• The allocator is responsible for deciding the system resources to be provided to the
VM: whenever a virtual machine tries to execute an instrucBon that results in
changing the machine resources associated with that VM, the allocator is invoked by
the dispatcher.
• The interpreter module consists of interpreter rouBnes. These are executed
whenever a virtual machine executes a privileged instrucBon: a trap is triggered and
the corresponding rouBne is executed.
• The criteria that need to be met by a virtual machine manager to efficiently support
virtualizaBon were established by Goldberg and Popek in 1974. Three properBes
have to be saBsfied:
o Equivalence. A guest running under the control of a virtual machine manager
should exhibit the same behavior as when it is executed directly on the
physical host.
o Resource control. The virtual machine manager should be in complete
control of virtualized resources.
o Efficiency. A staBsBcally dominant fracBon of the machine instrucBons should
be executed without intervenBon from the virtual machine manager.
• Popek and Goldberg provided a classificaBon of the instrucBon set and proposed
three theorems that define the properBes that hardware instrucBons need to saBsfy
in order to efficiently support virtualizaBon.
THEOREM 3.1
For any conven+onal third-genera+on computer, a VMM may be constructed if the set of
sensi+ve instruc+ons for that computer is a subset of the set of privileged instruc+ons.

• This theorem establishes that all the instrucBons that change the configuraBon of the
system resources should generate a trap in user mode and be executed under the
control of the virtual machine manager.
• This allows hypervisors to efficiently control only those instrucBons that would reveal
the presence of an abstracBon layer while execuBng all the rest of the instrucBons
without considerable performance loss.
• The theorem always guarantees the resource control property when the hypervisor is
in the most privileged mode (Ring 0).
• The nonprivileged instrucBons must be executed without the intervenBon of the
hypervisor.
• The equivalence property also holds good since the output of the code is the same in
both cases because the code is not changed.

THEOREM 3.2
A conven+onal third-genera+on computer is recursively virtualizable if:
• It is virtualizable and
• A VMM without any +ming dependencies can be constructed for it.

• Recursive virtualizaBon is the ability to run a virtual machine manager on top of


another virtual machine manager. This allows nesBng hypervisors as long as the
capacity of the underlying resources can accommodate that.
• Virtualizable hardware is a prerequisite to recursive virtualizaBon.

THEOREM 3.3
A hybrid VMM may be constructed for any conven+onal third-genera+on machine in
which the set of user-sensi+ve instruc+ons is a subset of the set of privileged instruc+ons.

• There is another term, hybrid virtual machine (HVM), which is less efficient than the
virtual machine system. In the case of an HVM, more instrucBons are interpreted
rather than being executed directly.
• All instrucBons in virtual supervisor mode are interpreted. Whenever there is an
aRempt to execute a behavior-sensiBve or control-sensiBve instrucBon, HVM
controls the execuBon directly or gains the control via a trap. Here all sensiBve
instrucBons are caught by HVM that are simulated.
• This reference model represents what we generally consider classic virtualizaBon—
that is, the ability to execute a guest operaBng system in complete isolaBon.
• To a greater extent, hardware-level virtualizaBon includes several strategies that
differenBate from each other in terms of which kind of support is expected from the
underlying hardware, what is actually abstracted from the host, and whether the
guest should be modified or not.
Hardware virtualiza9on techniques

Hardware-assisted virtualiza9on.

• This term refers to a scenario in which the hardware provides architectural support
for building a virtual machine manager able to run a guest operaBng system in
complete isolaBon. This technique was originally introduced in the IBM System/370.
• At present, examples of hardware-assisted virtualizaBon are the extensions to the
x86-64 bit architecture introduced with Intel VT (formerly known as Vanderpool) and
AMD V (formerly known as Pacifica).
• Before the introducBon of hardware-assisted virtualizaBon, so^ware emula9on of
x86 hardware was significantly costly from the performance point of view. The
reason for this is that by design the x86 architecture did not meet the formal
requirements introduced by Popek and Goldberg, and early products were using
binary transla9on to trap some sensiBve instrucBons and provide an emulated
version.
• Products such as VMware Virtual Pla^orm, introduced in 1999 by VMware, which
pioneered the field of x86 virtualizaBon, were based on this technique.
• AYer 2006, Intel and AMD introduced processor extensions, and a wide range of
virtualizaBon soluBons took advantage of them: Kernel-based Virtual Machine
(KVM), VirtualBox, Xen, VMware, Hyper-V, Sun xVM, Parallels, and others.

Full virtualiza9on.

• Full virtualizaBon refers to the ability to run a program, most likely an operaBng
system, directly on top of a virtual machine and without any modificaBon, as though
it were run on the raw hardware.
• To make this possible, virtual machine managers are required to provide a complete
emulaBon of the enBre underlying hardware.
• The principal advantage of full virtualizaBon is complete isola9on, which leads to
enhanced security, ease of emulaBon of different architectures, and coexistence of
different systems on the same pla^orm.
• Full virtualizaBon poses important concerns related to performance and technical
implementaBon. A key challenge is the intercepBon of privileged instrucBons such as
I/O instrucBons: Since they change the state of the resources exposed by the host,
they have to be contained within the virtual machine manager.
• A simple solu9on to achieve full virtualizaBon is to provide a virtual environment for
all the instrucBons, thus posing some limits on performance.
• A successful and efficient implementaBon of full virtualizaBon is obtained with a
combinaBon of hardware and soYware. This is what is accomplished through
hardware-assisted virtualizaBon.

Paravirtualiza9on.

• This is a not-transparent virtualizaBon soluBon that allows implemenBng thin virtual


machine managers.
• ParavirtualizaBon techniques expose a soYware interface to the virtual machine that
is slightly modified from the host and, as a consequence, guests need to be
modified.
• The aim of paravirtualizaBon is to provide the capability to demand the execuBon of
performance-criBcal operaBons directly on the host, thus prevenBng performance
losses that would otherwise be experienced in managed execuBon.
• This allows a simpler implementaBon of virtual machine managers that have to
simply transfer the execuBon of these operaBons.
• To take advantage of such an opportunity, guest operaBng systems need to be
modified and explicitly ported by remapping the performance-criBcal operaBons
through the virtual machine soYware interface. This is possible when the source
code of the operaBng system is available, and this is the reason that
paravirtualizaBon was mostly explored in the open-source and academic
environment.
• Whereas this technique was iniBally applied in the IBM VM operaBng system
families, the term paravirtualizaBon was introduced in literature in the Denali project
[24] at the University of Washington. This technique has been successfully used by
Xen for providing virtualizaBon soluBons for Linux-based operaBng systems
specifically ported to run on Xen hypervisors.
• OperaBng systems that cannot be ported can sBll take advantage of
paravirtualizaBon by using ad hoc device drivers that remap the execuBon of criBcal
instrucBons to the paravirtualizaBon APIs exposed by the hypervisor. Xen provides
this soluBon for running Windows-based operaBng systems on x86 architectures.

Par9al virtualiza9on.

• ParBal virtualizaBon provides a parBal emulaBon of the underlying hardware, thus


not allowing the complete execuBon of the guest operaBng system in complete
isolaBon.
• ParBal virtualizaBon allows many applicaBons to run transparently, but not all the
features of the operaBng system can be supported, as happens with full
virtualizaBon.
• An example of parBal virtualizaBon is address space virtualizaBon used in Bme-
sharing systems; this allows mulBple applicaBons and users to run concurrently in a
separate memory space, but they sBll share the same hardware resources (disk,
processor, and network).
• Historically, parBal virtualizaBon has been an important milestone for achieving full
virtualizaBon, and it was implemented on the experimental IBM M44/44X.

Opera9ng system-level virtualiza9on

OperaBng system-level virtualizaBon offers the opportunity to create different and separated
execuBon environments for applicaBons that are managed concurrently. Differently from
hardware virtualizaBon, there is no virtual machine manager or hypervisor, and the
virtualizaBon is done within a single operaBng system, where the OS kernel allows for
mulBple isolated user space instances. The kernel is also responsible for sharing the system
resources among instances and for limiBng the impact of instances on each other. A user
space instance in general contains a proper view of the file system, which is completely
isolated, and separate IP addresses, soYware configuraBons, and access to devices.
OperaBng systems supporBng this type of virtualizaBon are general-purpose, Bme-
shared operaBng systems with the capability to provide stronger namespace and resource
isolaBon. This virtualizaBon technique can be considered an evoluBon of the chroot
mechanism in Unix systems. The chroot operaBon changes the file system root directory for
a process and its children to a specific directory.

Compared to hardware virtualizaBon, this strategy imposes liRle or no overhead because


applicaBons directly use OS system calls and there is no need for emulaBon. There is no
need to modify applicaBons to run them, nor to modify any specific hardware, as in the case
of hardware-assisted virtualizaBon. On the other hand, operaBng system-level virtualizaBon
does not expose the same flexibility of hardware virtualizaBon, since all the user space
instances must share the same operaBng system.

3.3.1.3 Programming language-level virtualiza9on

Programming language-level virtualizaBon is mostly used to achieve ease of deployment of


applicaBons, managed execuBon, and portability across different pla^orms and operaBng
systems. It consists of a virtual machine execuBng the byte code of a program, which is the
result of the compilaBon process. Compilers implemented and used this technology to
produce a binary format represenBng the machine code for an abstract architecture.

Generally, these virtual machines consBtute a simplificaBon of the underlying hardware


instrucBon set and provide some high-level instrucBons that map some of the features of
the languages compiled for them.

3.4 Virtualiza9on and cloud compu9ng

• VirtualizaBon technologies are primarily used to offer configurable compuBng


environments and storage.
• Network virtualizaBon is less popular and, in most cases, is a complementary feature,
which is naturally needed in build virtual compuBng systems.
• ParBcularly important is the role of virtual compuBng environment and execuBon
virtualizaBon techniques.
• Among these, hardware and programming language virtualizaBon are the techniques
adopted in cloud compuBng systems.
• Hardware virtualizaBon is an enabling factor for soluBons in the Infrastructure-as-a-
Service (IaaS) market segment, while programming language virtualizaBon is a
technology leveraged in Pla^orm-as-a-Service (PaaS) offerings.
• Moreover, virtualizaBon also allows isolaBon and a finer control, thus simplifying the
leasing of services and their accountability on the vendor side.
• VirtualizaBon also gives the opportunity to design more efficient compuBng systems
by means of consolidaBon, which is performed transparently to cloud compuBng
service users.
• Since virtualizaBon allows us to create isolated and controllable environments, it is
possible to serve these environments with the same resource without them
interfering with each other.
• This opportunity is parBcularly aRracBve when resources are underuBlized, because
it allows reducing the number of acBve resources by aggregaBng virtual machines
over a smaller number of resources that become fully uBlized. This pracBce is also
known as server consolida9on, while the movement of virtual machine instances is
called virtual machine migra9on (see Figure 3.10).

• Because virtual machine instances are controllable environments, consolidaBon can


be applied with a minimum impact, either by temporarily stopping its execuBon and
moving its data to the new resources or by performing a finer control and moving the
instance while it is running. This second techniques is known as live migra9on and in
general is more complex to implement but more efficient since there is no disrupBon
of the acBvity of the virtual machine instance.
• Server consolidaBon and virtual machine migraBon are principally used in the case of
hardware virtualizaBon, even though they are also technically possible in the case of
programming language virtualizaBon (see Figure 3.9).
• In Storage virtualizaBon, vendors backed by large compuBng infrastructures featuring
huge storage faciliBes can harness these faciliBes into a virtual storage service, easily
parBBonable into slices. These slices can be dynamic and offered as a service.
• Cloud compuBng revamps the concept of desktop virtualizaBon, iniBally introduced
in the mainframe era. The ability to recreate the enBre compuBng stack—from
infrastructure to applicaBon services—on demand.

3.5 Pros and cons of virtualiza9on

3.5.1 Advantages of virtualiza9on

• Managed execuBon and isolaBon are perhaps the most important advantages of
virtualizaBon. These allow building secure and controllable compuBng environments.
• A virtual execuBon environment can be configured as a sandbox, thus prevenBng any
harmful operaBon to cross the borders of the virtual host.
• Moreover, allocaBon of resources and their parBBoning among different guests is
simplified, being the virtual host controlled by a program. This enables fine-tuning of
resources.
• Portability is another advantage of virtualizaBon, especially for execuBon
virtualizaBon techniques.
• Virtual machine instances are normally represented by one or more files that can be
easily transported with respect to physical systems. Moreover, they also tend to be
self-contained.
• It is in fact possible to build our own operaBng environment within a virtual machine
instance and bring it with us wherever we go, as though we had our own laptop. This
concept is also an enabler for migraBon techniques in a server consolidaBon
scenario.
• It also contribute to reducing the costs of maintenance, since the number of hosts is
expected to be lower than the number of virtual machine instances.
• Since the guest program is executed in a virtual environment, there is very limited
opportunity for the guest program to damage the underlying hardware.
• It is possible to achieve a more efficient use of resources.
• MulBple systems can securely coexist and share the resources of the underlying host,
without interfering with each other. This is a prerequisite for server consolidaBon,
which allows adjusBng the number of acBve physical resources dynamically
according to the current load of the system, thus creaBng the opportunity to save in
terms of energy consumpBon and to be less impacBng on the environment.

3.5.2 The other side of the coin: disadvantages

3.5.2.1 Performance degrada9on

• Performance is definitely one of the major concerns in using virtualizaBon


technology. Since virtualizaBon interposes an abstracBon layer between the guest
and the host, the guest can experience increased latencies.
• For instance, in the case of hardware virtualizaBon, where the intermediate emulates
a bare machine on top of which an enBre system can be installed, the causes of
performance degradaBon can be traced back to the overhead introduced by the
following acBviBes:
• Maintaining the status of virtual processors
• Support of privileged instrucBons (trap and simulate privileged instrucBons)
• Support of paging within VM
• Console funcBons
• Similar consideraBon can be made in the case of virtualizaBon technologies at higher
levels, such as in the case of programming language virtual machines (Java, .NET, and
others). Binary translaBon and interpretaBon can slow down the execuBon of
managed applicaBons.
• Moreover, because their execuBon is filtered by the runBme environment, access to
memory and other physical resources can represent sources of performance
degradaBon.
• These concerns are becoming less and less important thanks to technology
advancements and the ever-increasing computaBonal power available today.
• For example, specific techniques for hardware virtualizaBon such as
paravirtualizaBon can increase the performance of the guest program by offloading
most of its execuBon to the host without any change. In programming-level virtual
machines such as the JVM or .NET, compilaBon to naBve code is offered as an opBon
when performance is a serious concern.

3.5.2.2 Inefficiency and degraded user experience

• VirtualizaBon can someBme lead to an inefficient use of the host. In parBcular, some
of the specific features of the host cannot be exposed by the abstracBon layer and
then become inaccessible.
• In the case of hardware virtualizaBon, this could happen for device drivers: The
virtual machine can someBme simply provide a default graphic card that maps only a
subset of the features available in the host.
• In the case of programming-level virtual machines, some of the features of the
underlying operaBng systems may become inaccessible unless specific libraries are
used.
• For example, in the first version of Java the support for graphic programming was
very limited and the look and feel of applicaBons was very poor compared to naBve
applicaBons. These issues have been resolved by providing a new framework called
Swing for designing the user interface, and further improvements have been done by
integraBng support for the OpenGL libraries in the soYware development kit.

3.5.2.3 Security holes and new threats

• VirtualizaBon opens the door to a new and unexpected form of phishing.


• The capability of emulaBng a host in a completely transparent manner led the way to
malicious programs that are designed to extract sensiBve informaBon from the guest.
• In the case of hardware virtualizaBon, malicious programs can preload themselves
before the operaBng system and act as a thin virtual machine manager toward it. The
operaBng system is then controlled and can be manipulated to extract sensiBve
informaBon of interest to third parBes.
• Examples of these kinds of malware are BluePill and SubVirt.
• BluePill, malware targeBng the AMD processor family, moves the execuBon of the
installed OS within a virtual machine.
• The original version of SubVirt was developed as a prototype by MicrosoY through
collaboraBon with Michigan University. SubVirt infects the guest OS, and when the
virtual machine is rebooted, it gains control of the host.
• The spread of such kinds of malware is facilitated by the fact that originally, hardware
and CPUs were not manufactured with virtualizaBon in mind.
• In parBcular, the exisBng instrucBon sets cannot be simply changed or updated to
suit the needs of virtualizaBon. Recently, both Intel and AMD have introduced
hardware support for virtualizaBon with Intel VT and AMD Pacifica, respecBvely.
• The same consideraBons can be made for programming-level virtual machines:
Modified versions of the runBme environment can access sensiBve informaBon or
monitor the memory locaBons uBlized by guest applicaBons while these are
executed.
• To make this possible, the original version of the runBme environment needs to be
replaced by the modified one, which can generally happen if the malware is run
within an administraBve context or a security hole of the host operaBng system is
exploited.
Unit 3 – Chapter 5

5.2 Data Center Technology

Grouping IT resources in close proximity with one another, rather than having them
geographically dispersed, allows for power sharing, higher efficiency in shared IT resource
usage, and improved accessibility for IT personnel. These are the advantages that naturally
popularized the data center concept.

Data centers are typically comprised of the following technologies and components:

Virtualiza9on

Data centers consist of both physical and virtualized IT resources. The physical IT resource
layer refers to the facility infrastructure that houses compuBng/networking systems and
equipment, together with hardware systems and their operaBng systems (Figure 5.7).

The resource abstracBon and control of the virtualizaBon layer is comprised of operaBonal
and management tools that are oYen based on virtualizaBon pla^orms that abstract the
physical compuBng and networking IT resources as virtualized components
Standardiza9on and Modularity

Data centers are built upon standardized commodity hardware and designed with modular
architectures, aggregaBng mulBple idenBcal building blocks of facility infrastructure and
equipment to support scalability, growth, and speedy hardware replacements.

Modularity and standardiza9on are key requirements for reducing investment and
operaBonal costs as they enable economies of scale for the procurement, acquisiBon,
deployment, operaBon, and maintenance processes.

Common virtualizaBon strategies and the constantly improving capacity and performance of
physical devices both favor IT resource consolidaBon, since fewer physical components are
needed to support complex configuraBons. Consolidated IT resources can serve different
systems and be shared among different cloud consumers.

Automa9on

Data centers have specialized pla^orms that automate tasks like provisioning, configuraBon,
patching, and monitoring without supervision. Advances in data center management
pla^orms and tools leverage autonomic compuBng technologies to enable self-configuraBon
and self-recovery.

Remote Opera9on and Management

Most of the operaBonal and administraBve tasks of IT resources in data centers are
commanded through the network’s remote consoles and management systems. Technical
personnel are not required to visit the dedicated rooms that house servers, except to
perform highly specific tasks.

High Availability

Since any form of data center outage significantly impacts business conBnuity for the
organizaBons that use their services, data centers are designed to operate with increasingly
higher levels of redundancy to sustain availability. Data centers usually have redundant,
uninterruptable power supplies, cabling, and environmental control subsystems in
anBcipaBon of system failure, along with communicaBon links and clustered hardware
for load balancing.

Security-Aware Design, Opera9on, and Management

Requirements for security, such as physical and logical access controls and data recovery
strategies, need to be thorough and comprehensive for data centers, since they are
centralized structures that store and process business data.

Facili9es
Data center faciliBes are custom-designed locaBons that are ou^iRed with specialized
compuBng, storage, and network equipment. These faciliBes have several funcBonal layout
areas, as well as various power supplies, cabling, and environmental control staBons that
regulate heaBng, venBlaBon, air condiBoning, fire protecBon, and other related subsystems.

Compu9ng Hardware

Much of the heavy processing in data centers is oYen executed by standardized commodity
servers that have substanBal compuBng power and storage capacity. Several compuBng
hardware technologies are integrated into these modular servers, such as:
• rackmount form factor server design composed of standardized racks with interconnects
for power, network, and internal cooling
• support for different hardware processing architectures, such as x86-32bits, x86-64, and
RISC
• a power-efficient mulB-core CPU architecture that houses hundreds of processing cores in
a space as small as a single unit of standardized racks
• redundant and hot-swappable components, such as hard disks, power supplies, network
interfaces, and storage controller cards

Storage Hardware

Data centers have specialized storage systems that maintain enormous amounts of digital
informaBon in order to fulfill considerable storage capacity needs. These storage systems are
containers housing numerous hard disks that are organized into arrays.
Storage systems usually involve the following technologies:
• Hard Disk Arrays – These arrays inherently divide and replicate data among mulBple
physical drives, and increase performance and redundancy by including spare disks. This
technology is oYen implemented using redundant arrays of independent disks (RAID)
schemes, which are typically realized through hardware disk array controllers.

• I/O Caching – This is generally performed through hard disk array controllers, which
enhance disk access Bmes and performance by data caching.

• Hot-Swappable Hard Disks – These can be safely removed from arrays without requiring
prior powering down.

• Storage VirtualizaBon – This is realized through the use of virtualized hard disks and
storage sharing.

• Fast Data ReplicaBon Mechanisms – These include snapshoyng, which is saving a virtual
machine’s memory into a hypervisor-readable file for future reloading, and volume cloning,
which is copying virtual or physical hard disk volumes and parBBons.

Networked storage devices usually fall into one of the following categories:
• Storage Area Network (SAN) – Physical data storage media are connected through a
dedicated network and provide block-level data storage access using industry standard
protocols, such as the Small Computer System Interface (SCSI).
• Network-ARached Storage (NAS) – Hard drive arrays are contained and managed by this
dedicated device, which connects through a network and facilitates access to data using file-
centric data access protocols like the Network File System (NFS) or Server Message Block
(SMB).

Network Hardware

Data centers require extensive network hardware in order to enable mulBple levels of
connecBvity. For a simplified version of networking infrastructure, the data center is broken
down into five network subsystems, followed by a summary of the most common elements
used for their implementaBon.

Carrier and External Networks Interconnec9on

A subsystem related to the internetworking infrastructure, this interconnecBon is usually


comprised of backbone routers that provide rouBng between external WAN connecBons and
the data center’s LAN, as well as perimeter network security devices such as firewalls and
VPN gateways.

Web-Tier Load Balancing and Accelera9on

This subsystem comprises Web acceleraBon devices, such as XML pre-processors,


encrypBon/decrypBon appliances, and layer 7 switching devices that perform content-aware
rouBng.

LAN Fabric

The LAN fabric consBtutes the internal LAN and provides high-performance and redundant
connecBvity for all of the data center’s network-enabled IT resources. It is oYen
implemented with mulBple network switches that facilitate network communicaBons and
operate at speeds of up to ten gigabits per second.

SAN Fabric

Related to the implementaBon of storage area networks (SANs) that provide connecBvity
between servers and storage systems, the SAN fabric is usually implemented with Fibre
Channel (FC), Fibre Channel over Ethernet (FCoE), and InfiniBand network switches.

NAS Gateways

This subsystem supplies aRachment points for NAS-based storage devices and implements
protocol conversion hardware that facilitates data transmission between SAN and NAS
devices.

Data center network technologies have operaBonal requirements for scalability and high
availability that are fulfilled by employing redundant and/or fault-tolerant configuraBons.
These five network subsystems improve data center redundancy and reliability to ensure
that they have enough IT resources to maintain a certain level of service even in the face of
mulBple failures.

Other Considera9ons

IT hardware is subject to rapid technological obsolescence, with lifecycles that typically last
between five to seven years. The on-going need to replace equipment frequently results in a
mix of hardware whose heterogeneity can complicate the enBre data center’s operaBons
and management (although this can be parBally miBgated through virtualizaBon).

Security is another major issue when considering the role of the data center and the vast
quanBBes of data contained within its doors. Even with extensive security precauBons in
place, housing data exclusively at one data center facility means much more can be
compromised by a successful security incursion than if data was distributed across individual
unlinked components.

5.6. Containeriza9on

ContainerizaBon is an operaBng system-level virtualizaBon technology used to deploy and


run applicaBons and cloud services without the need to deploy a virtual server for each
soluBon. Instead, they are deployed within containers. Using containers enables mulBple
isolated cloud services to run on a single physical server or virtual server while accessing the
same operaBng system kernel.

The operaBng system kernel allows for the existence of mulBple isolated user-space
instances or mulBple isolated runBmes known as containers, parBBons, virtual engines, jails
or chroot jails. Regardless of which runBme is used, when a cloud service executes within a
container, it is running on a real computer from its point of view.

A cloud service running on a physical or virtual server operaBng system can see all of the
provided resources, such as connected devices, ports, files, folders, network shares, CPUs, as
well as the physical addressable memory. However, a cloud service running inside a
container can only see the container’s contents and devices aRached to the container.

Containeriza9on Vs. Virtualiza9on

As explained earlier, virtualizaBon refers to the act of creaBng a virtual, rather than an actual
version of something. This includes virtual computer hardware pla^orms, storage devices,
and computer network resources. Virtual servers are an abstracBon of physical hardware via
server visualizaBon and the use of hypervisors for abstracBng a given physical server into
mulBple virtual servers.

The hypervisor allows mulBple virtual servers to run on a single physical host. Virtual servers
see the emulated hardware presented to them by the hypervisor as real hardware, and each
virtual server has its own operaBng system, also known as a guest operaBng system, that
needs to be deployed inside the virtual server and managed and maintained as if it were
deployed on a physical server.
In contrast, containers are an abstracBon at the applicaBon or service layer that package
code and dependencies together. MulBple containers can be deployed on the same machine
and share an operaBng system kernel with other containers. Each container runs as an
isolated process in the user space. Containers do not require the guest operaBng system that
is needed for virtual servers and can run directly on a physical server’s operaBng system.

Containers also consume less storage space than virtual servers. Figure 5.12 depicts the
difference between virtual servers and containers.

Containers can be deployed in virtual servers, in which case nested virtualizaBon is required
to allow the container engine to be installed and operated. Nested virtualizaBon refers to
the deployment where one virtualized system is deployed on another.

Benefits of Containers

Portability is one of the key benefits of containers, allowing cloud resource administrators to
move containers to any environment that shares the same host operaBng system and
container engine that the container is hosted on, and without the need to change the
applicaBon or soYware.

Efficient resource u9liza9on is achieved by significantly reducing the CPU, memory and
storage usage footprint compared to virtual servers. It is possible to support several
containers on the same infrastructure required by a single virtual server, resulBng in
performance improvements.

Containers can be created and deployed much faster than virtual servers, which supports a
more agile process and facilitates conBnuous integraBon.

Furthermore, containers allow versions of an soYware code and its dependencies to be


tracked.

Container Hos9ng and Pods


A single process of one cloud service is normally deployed in each container, though more
than one cloud service or process can be deployed in each, if required. In some cases, one
core process and its side processes or extra funcBons are deployed in the same container.
The amount of resources each container consumes can be restricted.

MulBple containers can be deployed in a logical construct called a pod. A pod is a group of
one or more containers that have shared storage and/or network resources, and also share
the same configuraBon that determines how the containers are to be run. A pod is typically
employed when there are different cloud services that are part of the same applicaBon or
namespace and that need to run under a single IP address.

Fundamental Container Architecture Elements

Container Engine

The key component of container architecture is the container engine, also referred to as the
containerizaBon engine. The container engine is specialized soYware that is deployed in an
operaBng system to abstract the required resources and enable the definiBon and
deployment of containers. Container engine soYware can be deployed on physical machines
or virtual machines.

Container Build File

A container build file is a descriptor (created by the user or service) that represents the
requirements of the applicaBon and services that run inside the container, as well as the
configuraBon parameters required by the container engine in order to create and deploy the
container.

Container Image

The container engine uses a container image to deploy an image based on pre-defined
requirements. For example, if an applicaBon requires a database component or Web server
service to operate, these requirements are defined by the user in the container build file.
Based on the defined descripBons, the container engine customizes the operaBng system
image and the required commands or services for the applicaBon.

Container

The container is an executable instance of a pre-defined or customized container image that


contains one or more soYware programs, most commonly an applicaBon or service. While
containers are isolated from each other, they may be required to access a shared resource
over the network, such as a file system or remote IT resource.

Networking Address
Each container has its own network address (such as an IP address) used to communicate
with other containers and external components. A container can be connected to more than
one network by allocaBng addiBonal network addresses to the container.

Containers use the physical or virtual network card of the system that the container engine
is deployed on to communicate with other containers and IT resources.

Storage Device

Similar to the networking address, a container may connect to one or more storage devices
that are made available to the containers over the network. Each container has its own level
of access to the storages defined by the system or administrators
Chapter 4: Cloud Compu0ng Architecture
(REFERENCE 1 as per guidelines)

4.1 Introduc9on

• UBlity-oriented data centers serve as the infrastructure through which the services
are implemented and delivered.
• Any cloud service, whether virtual hardware, development pla^orm, or applicaBon
soYware, relies on a distributed infrastructure owned by the provider or rented from
a third party.

A broad defini9on of the phenomenon could be as follows:

Cloud compu9ng is a u9lity-oriented and Internet-centric way of delivering IT services on


demand. These services cover the en9re compu9ng stack: from the hardware
infrastructure packaged as a set of virtual machines to so^ware services such as
development plaGorms and distributed applica9ons.

4.2 The cloud reference model

Cloud compuBng supports any IT service that can be consumed as a uBlity and delivered
through a network, most likely the Internet. Such characterizaBon includes quite different
aspects: infrastructure, development pla^orms, applicaBon and services.

4.2.1 Architecture

It is possible to organize all the concrete realizaBons of cloud compuBng into a layered view
covering the enBre stack (see Figure 4.1), from hardware appliances to soYware systems.
• Cloud resources are harnessed to offer “compuBng horsepower” required for
providing services. OYen, this layer is implemented using a datacenter in which
hundreds and thousands of nodes are stacked together.
• Cloud infrastructure can be heterogeneous in nature and database systems and other
storage services can also be part of the infrastructure.
• The physical infrastructure is managed by the core middleware, the objecBves of
which are to provide an appropriate runBme environment for applicaBons and to
best uBlize resources.
• At the boRom of the stack, virtualizaBon technologies are used to guarantee runBme
environment customizaBon, applicaBon isolaBon, sandboxing, and quality of service.
Hardware virtualizaBon is most commonly used at this level.
• Hypervisors manage the pool of resources and expose the distributed infrastructure
as a collecBon of virtual machines.
• By using virtual machine technology, it is possible to finely parBBon the hardware
resources such as CPU and memory and to virtualize specific devices, thus meeBng
the requirements of users and applicaBons.
• This soluBon is generally paired with storage and network virtualizaBon strategies,
which allow the infrastructure to be completely virtualized and controlled.
• For example, programming-level virtualizaBon helps in creaBng a portable runBme
environment where applicaBons can be run and controlled. This scenario generally
implies that applicaBons hosted in the cloud be developed with a specific technology
or a programming language, such as Java, .NET, or Python. In this case, the user does
not have to build its system from bare metal.
• Infrastructure management is the key funcBon of core middleware, which supports
capabiliBes such as negoBaBon of the quality of service, admission control, execuBon
management and monitoring, accounBng, and billing.
• The combinaBon of cloud hosBng pla^orms and resources is generally classified as a
• Infrastructure-as-a-Service (IaaS) soluBon.
• We can organize the different examples of IaaS into two categories: Some of them
provide both the management layer and the physical infrastructure; others provide
only the management layer (IaaS (M)).
• In this second case, the management layer is oYen integrated with other IaaS
soluBons that provide physical infrastructure and adds value to them.
• IaaS soluBons are suitable for designing the system infrastructure but provide limited
services to build applicaBons. Such service is provided by cloud programming
environments and tools.
• The range of tools include Web-based interfaces, command-line tools, and
frameworks for concurrent and distributed programming.
• In this scenario, users develop their applicaBons specifically for the cloud by using
the API exposed at the user-level middleware.
• For this reason, this approach is also known as PlaGorm-as-a-Service (PaaS) because
the service offered to the user is a development pla^orm rather than an
infrastructure.
• PaaS soluBons generally include the infrastructure as well, which is bundled as part
of the service provided to users.
• In the case of Pure PaaS, only the user-level middleware is offered, and it has to be
complemented with a virtual or physical infrastructure.
• The top layer of the reference model depicted in Figure 4.1 contains services
delivered at the applicaBon level. These are mostly referred to as So^ware-as-a-
Service (SaaS).
• As a reference model, it is then expected to have an adapBve management layer in
charge of elasBcally scaling on demand.
• SaaS implementaBons should feature such behavior automaBcally, whereas PaaS and
IaaS generally provide this funcBonality as a part of the API exposed to users.
• The reference model described in Figure 4.1 also introduces the concept of
everything as a Service (XaaS). This is one of the most important elements of cloud
compuBng: Cloud services from different providers can be combined to provide a
completely integrated soluBon covering all the compuBng stack of a system.
• IaaS providers can offer the bare metal in terms of virtual machines where PaaS
soluBons are deployed. When there is no need for a PaaS layer, it is possible to
directly customize the virtual infrastructure with the soYware stack needed to run
applicaBons.

Table 4.1 summarizes the characterisBcs of the three major categories used to classify cloud
compuBng soluBons.

4.2.2 Infrastructure- and hardware-as-a-service

• Infrastructure- and Hardware-as-a-Service (IaaS/HaaS) soluBons are the most


popular and developed market segment of cloud compuBng.
• They deliver customizable infrastructure on demand.
• The available opBons within the IaaS offering umbrella range from single servers to
enBre infrastructures, including network devices, load balancers, and database and
Web servers.
• The main technology used to deliver and implement these soluBons is hardware
virtualiza9on: one or more virtual machines opportunely configured and
interconnected define the distributed system on top of which applicaBons are
installed and deployed.
• IaaS/HaaS soluBons bring all the benefits of hardware virtualizaBon: workload
par99oning, applica9on isola9on, sandboxing, and hardware tuning.
• From the perspecBve of the service provider, IaaS/HaaS allows beber exploi9ng the
IT infrastructure and provides a more secure environment where execuBng third
party applicaBons.
• From the perspecBve of the customer, it reduces the administra9on and
maintenance cost as well as the capital costs allocated to purchase hardware.
For example, full customizaBon offered by virtualizaBon to deploy their infrastructure in the
cloud; virtual machines with only the selected operaBng system installed; prepackaged
system images that already contain the soYware stack.
AddiBonal services can be provided, generally including the following: SLA resource-based
allocaBon, workload management, support for infrastructure design through advanced Web
interfaces, and the ability to integrate third-party IaaS soluBons.

Figure 4.2 provides an overall view of the components forming an Infrastructure-as-a-


Service soluBon. It is possible to disBnguish three principal layers: the physical
infrastructure, the so^ware management infrastructure, and the user interface.

• At the top layer the user interface provides access to the services exposed by the
soYware management infrastructure.
• Such an interface is generally based on Web 2.0 technologies: Web services, RESTful
APIs, and mash-ups. These technologies allow either applicaBons or final users to
access the services exposed by the underlying infrastructure.
• Web 2.0 applicaBons allow developing full-featured management consoles
completely hosted in a browser or a Web page.
• Web services and RESTful APIs allow programs to interact with the service without
human intervenBon, thus providing complete integraBon within a soYware system.
• Management of the virtual machines is the most important funcBon performed by
this layer. A central role is played by the scheduler, which is in charge of allocaBng
the execuBon of virtual machine instances.
• The scheduler interacts with the other components that perform a variety of tasks:
o The pricing and billing component takes care of the cost of execuBng each
virtual machine instance and maintains data that will be used to charge the
user.
o The monitoring component tracks the execuBon of each virtual machine
instance and maintains data required for reporBng and analyzing the
performance of the system.
o The reserva9on component stores the informaBon of all the virtual machine
instances that have been executed or that will be executed in the future.
o If support for QoS-based execuBon is provided, a QoS/SLA management
component will maintain a repository of all the SLAs made with the users;
used to ensure that a given virtual machine instance is executed with the
desired quality of service.
o The VM repository component provides a catalog of virtual machine images
that users can use to create virtual instances.
o A VM pool manager component is responsible for keeping track of all the live
instances.
• The bobom layer is composed of the physical infrastructure, on top of which the
management layer operates. A service provider will most likely use a massive
datacenter containing hundreds or thousands of nodes. A cloud infrastructure
developed in house, in a small or medium-sized enterprise or within a university
department, will most likely rely on a cluster.
• From an architectural point of view, the physical layer also includes the virtual
resources that are rented from external IaaS providers.
• In the case of complete IaaS soluBons, all three levels are offered as service. This is
generally the case with public clouds vendors such as Amazon, GoGrid, Joyent,
Rightscale, Terremark, Rackspace, ElasBcHosts, and Flexiscale.
• Other soluBons instead cover only the user interface and the infrastructure soYware
management layers. They need to provide credenBals to access third-party IaaS
providers or to own a private infrastructure in which the management soYware is
installed.
• This is the case with Enomaly, Elastra, Eucalyptus, OpenNebula, and specific IaaS (M)
soluBons from VMware, IBM, and MicrosoY.

4.2.3 PlaGorm as a service


• Pla^orm-as-a-Service (PaaS) soluBons provide a development and deployment
pla^orm for running applicaBons in the cloud. They consBtute the middleware on
top of which applicaBons are built.

A general overview of the features characterizing the PaaS approach is given in Figure 4.3.

• ApplicaBon management is the core funcBonality of the middleware.


• PaaS implementaBons provide applicaBons with a runBme environment and do not
expose any service for managing the underlying infrastructure.
• They automate the process of deploying applicaBons to the infrastructure,
configuring applicaBon components, provisioning and configuring supporBng
technologies such as load balancers and databases, and managing system change
based on policies set by the user.
• Developers design their systems in terms of applicaBons and are not concerned with
hardware (physical or virtual), operaBng systems, and other low-level services.
• The core middleware is in charge of managing the resources and scaling applicaBons
on demand or automaBcally, according to the commitments made with users.
• PaaS soluBons can offer middleware for developing applicaBons together with the
infrastructure or simply provide users with the soYware that is installed on the user
premises.
• In the first case, the PaaS provider also owns large datacenters where applicaBons
are executed; in the second case, referred to in this book as Pure PaaS, the
middleware consBtutes the core value of the offering.
• It is also possible to have vendors that deliver both middleware and infrastructure
and ship only the middleware for private installaBons.
Table 4.2 provides a classificaBon of the most popular PaaS implementaBons.

It is possible to organize the various soluBons into three wide categories: PaaS-I, PaaS-II, and
PaaS-III.
• The first category idenBfies PaaS implementaBons that completely follow the cloud
compuBng style for applicaBon development and deployment. They offer an
integrated development environment hosted within the Web browser where
applicaBons are designed, developed, composed, and deployed. This is the case of
Force.com and Longjump. Both deliver as pla^orms the combinaBon of middleware
and infrastructure.
• In the second class we can list all those soluBons that are focused on providing a
scalable infrastructure for Web applicaBon, mostly websites. In this case, developers
generally use the providers’ APIs, which are built on top of industrial runBmes, to
develop applicaBons. Google AppEngine is the most popular product in this category.
It provides a scalable runBme based on the Java and Python programming languages,
which have been modified for providing a secure runBme environment and enriched
with addiBonal APIs and components to support scalability.
• The third category consists of all those soluBons that provide a cloud programming
pla^orm for any kind of applicaBon, not only Web applicaBons. Among these, the
most popular is Microso^ Windows Azure, which provides a comprehensive
framework for building service-oriented cloud applicaBons on top of the .NET
technology, hosted on MicrosoY’s datacenters. Other soluBons in the same category,
such as Manjraso^ Aneka, Apprenda, provide only middleware with different
services.

The PaaS umbrella encompasses a variety of soluBons for developing and hosBng
applicaBons in the cloud. Despite this heterogeneity, it is possible to iden9fy some criteria
that are expected to be found in any implementa9on. As noted by Sam Charrington,
product manager at Appistry.com, there are some essen9al characteris9cs that iden9fy a
PaaS solu9on:
• Run9me framework. This framework represents the “soYware stack” of the PaaS
model and the runBme framework executes end-user code according to the policies
set by the user and the provider.
• Abstrac9on. PaaS soluBons are disBnguished by the higher level of abstracBon that
they provide. In the case of PaaS the focus is on the applicaBons the cloud must
support. This means that PaaS soluBons offer a way to deploy and manage
applicaBons on the cloud.
• Automa9on. PaaS environments automate the process of deploying applicaBons to
the infrastructure, scaling them by provisioning addiBonal resources when needed.
This process is performed automaBcally and according to the SLA made between the
customers and the provider.
• Cloud services. PaaS offerings provide developers and architects with services and
APIs, helping them to simplify the creaBon and delivery of elasBc and highly available
cloud applicaBons.

• Another essenBal component for a PaaS-based approach is the ability to integrate


third-party cloud services offered from other vendors by leveraging service-oriented
architecture. Such integraBon should happen through standard interfaces and
protocols.
• One of the major concerns of leveraging PaaS soluBons for implemenBng
applicaBons is vendor lock-in. PaaS environments deliver a pla^orm for developing
applicaBons, which exposes a well-defined set of APIs and, in most cases, binds the
applicaBon to the specific runBme of the PaaS provider. This poses the risk of making
these applicaBons completely dependent on the provider. Such dependency can
become a significant obstacle in retargeBng the applicaBon to another environment
and runBme if the commitments made with the provider cease.
• Finally, from a financial standpoint, PaaS soluBons can cut the cost across
development, deployment, and management of applicaBons. It helps management
reduce the risk of ever-changing technologies by offloading the cost of upgrading the
technology to the PaaS provider.

4.2.4 So^ware as a service

• SoYware-as-a-Service (SaaS) is a soYware delivery model that provides access to


applicaBons through the Internet as a Web-based service.
• In this scenario, customers neither need install anything on their premises nor have
to pay considerable up-front costs to purchase the soYware and the required
licenses.
• They simply access the applicaBon website, enter their credenBals and billing details,
and can instantly use the applicaBon, which, in most of the cases, can be further
customized for their needs.
• On the provider side, the specific details and features of each customer’s applicaBon
are maintained in the infrastructure and made available on demand.
• The SaaS model is appealing for applicaBons serving a wide range of users and that
can be adapted to specific needs with liRle further customizaBon. This requirement
characterizes SaaS as a “one-to-many” soYware delivery model, whereby an
applicaBon is shared across mulBple users.
• This scenario facilitates the development of soYware pla^orms that provide a
general set of features and support specializaBon and ease of integraBon of new
components.
• As a result, SaaS applicaBons are naturally mulBtenant. Mul9tenancy, which is a
feature of SaaS compared to tradiBonal packaged soYware, allows providers to
centralize and sustain the effort of managing large hardware infrastructures,
maintaining and upgrading applicaBons transparently to the users, and opBmizing
resources by sharing the costs among the large user base.
• On the customer side, such costs consBtute a minimal fracBon of the usage fee paid
for the soYware.

The acronym SaaS was then coined in 2001 by the So^ware Informa9on & Industry
Associa9on (SIIA) with the following connotaBon:

In the so^ware as a service model, the applica9on, or service, is deployed from a


centralized datacenter across a network—Internet, Intranet, LAN, or VPN—providing
access and use on a recurring fee basis. Users “rent,” “subscribe to,” “are assigned,” or “are
granted access to” the applica9ons from a central provider. Business models vary
according to the level to which the so^ware is streamlined, to lower price and increase
efficiency, or value-added through customiza9on to further improve digi9zed business
processes.

• The analysis carried out by SIIA was mainly oriented to cover applicaBon service
providers (ASPs) and all their variaBons, which capture the concept of soYware
applicaBons consumed as a service in a broader sense.
• ASPs already had some of the core characteris9cs of SaaS:
o The product sold to customer is applica9on access.
o The applicaBon is centrally managed.
o The service delivered is one-to-many.
o The service delivered is an integrated soluBon delivered on the contract,
which means provided as promised.
• The SaaS approach introduces a more flexible way of delivering applicaBon services
that are fully customizable by the user by integraBng new services, injecBng their
own components, and designing the applicaBon and informaBon workflows. Such a
new approach has also been possible with the support of Web 2.0 technologies,
which allowed turning the Web browser into a full-featured interface, able even to
support applicaBon composiBon and development.
• IniBally the SaaS model was of interest only for lead users and early adopters. The
benefits delivered at that stage were the following:
o SoYware cost reducBon and total cost of ownership (TCO) were paramount
o Service-level improvements
o Rapid implementaBon
o Standalone and configurable applicaBons
o Rudimentary applicaBon and data integraBon
o SubscripBon and pay-as-you-go (PAYG) pricing
• SaaS 2.0, which does not introduce a new technology but transforms the way in
which SaaS is used. In parBcular, SaaS 2.0 is focused on providing a more robust
infrastructure and applicaBon pla^orms driven by SLAs. SaaS 2.0 will focus on the
rapid achievement of business objecBves.
• The exisBng SaaS infrastructures not only allow the development and customizaBon
of applicaBons, but they also facilitate the integraBon of services that are exposed by
other parBes.
• This approach dramaBcally changes the soYware ecosystem of the SaaS market,
which is no longer monopolized by a few vendors but is now a fully interconnected
network of service providers.
• SoYware-as-a-Service applicaBons can serve different needs. CRM, ERP, and social
networking applicaBons are definitely the most popular ones. SalesForce.com is
probably the most successful and popular example of a CRM service. It provides a
wide range of services for applicaBons: customer relaBonship and human resource
management, enterprise resource planning, and many other features.
• SalesForce.com builds on top of the Force.com pla^orm, which provides a fully
featured environment for building applicaBons. It offers either a programming
language or a visual environment to arrange components together for building
applicaBons.
• Similar soluBons are offered by NetSuite and RightNow. NetSuite is an integrated
soYware business suite featuring financials, CRM, inventory, and ecommerce
funcBonaliBes integrated all together.
• RightNow is customer experience-centered SaaS applicaBon that integrates together
different features, from chat to Web communiBes, to support the common acBvity of
an enterprise.
Another important class of popular SaaS applicaBons comprises social networking
applica9ons such as Facebook and professional networking sites such as LinkedIn.
Other than providing the basic features of networking, they allow incorporaBng and
extending their capabiliBes by integraBng third-party applicaBons.
Office automa9on applica9ons are also an important representaBve for SaaS applicaBons:
Google Documents and Zoho Office are examples of Web-based applicaBons that aim to
address all user needs for documents, spreadsheets, and presentaBon management.

4.3 Types of clouds

• Clouds consBtute the primary outcome of cloud compuBng.


• A more useful classificaBon is given according to the administraBve domain of a
cloud: It idenBfies the boundaries within which cloud compuBng services are
implemented, provides hints on the underlying infrastructure adopted to support
such services, and qualifies them.
• It is then possible to differenBate four different types of cloud:
o Public clouds. The cloud is open to the wider public.
o Private clouds. The cloud is implemented within the private premises of an
insBtuBon and generally made accessible to the members of the insBtuBon or
a subset of them.
o Hybrid or heterogeneous clouds. The cloud is a combinaBon of the two
previous soluBons and most likely idenBfies a private cloud that has been
augmented with resources or services hosted in a public cloud.
o Community clouds. The cloud is characterized by a mulB-administraBve
domain involving different deployment models (public, private, and hybrid),
and it is specifically designed to address the needs of a specific industry.

4.3.1 Public clouds

• Public clouds consBtute the first expression of cloud compuBng.


• The services offered are made available to anyone, from anywhere, and at any Bme
through the Internet.
• From a structural point of view they are a distributed system, most likely composed
of one or more datacenters connected together, on top of which the specific services
offered by the cloud are implemented.
• They offer soluBons for minimizing IT infrastructure costs and serve as a viable opBon
for handling peak loads on the local infrastructure.
• They have become an interesBng opBon for small enterprises, which are able to start
their businesses without large up-front investments by completely relying on public
infrastructure for their IT needs.
• They have the ability to grow or shrink according to the needs of the related
business.
• Currently, public clouds are used both to completely replace the IT infrastructure of
enterprises and to extend it when it is required.
• A fundamental characterisBc of public clouds is mulBtenancy. A public cloud is meant
to serve a mulBtude of users, not a single customer.
• Any customer requires a virtual compuBng environment that is separated, and most
likely isolated, from other users. This is a fundamental requirement to provide
effecBve monitoring of user acBviBes and guarantee the desired performance and
the other QoS aRributes negoBated with users.
• Hence, a significant porBon of the soYware infrastructure is devoted to monitoring
the cloud resources, to bill them according to the contract made with the user, and
to keep a complete history of cloud usage for each customer. These features are
fundamental to public clouds because they help providers offer services to users with
full accountability.
• A public cloud can offer any kind of service: infrastructure, pla^orm, or applicaBons.
• For example, Amazon EC2 is a public cloud that provides infrastructure as a service;
Google AppEngine is a public cloud that provides an applicaBon development
pla^orm as a service; and SalesForce.com is a public cloud that provides soYware as
a service.
• From an architectural point of view there is no restricBon concerning the type of
distributed system implemented to support public clouds. Most likely, one or more
datacenters consBtute the physical infrastructure on top of which the services are
implemented and delivered.
• Public clouds can be composed of geographically dispersed datacenters to share the
load of users and beRer serve them according to their locaBons.
• For example, Amazon Web Services has datacenters installed in the United States,
Europe, Singapore, and Australia; they allow their customers to choose between
three different regions: us-west-1, us-east-1, or eu-west-1. Such regions are priced
differently and are further divided into availability zones, which map to specific
datacenters.
• According to the specific class of services delivered by the cloud, a different soYware
stack is installed to manage the infrastructure: virtual machine managers, distributed
middleware, or distributed applicaBons.

4.3.2 Private clouds

• Public clouds are not applicable in all scenarios. For example, a very common
criBque to the use of cloud compuBng in its canonical implementaBon is the loss of
control. In the case of public clouds, the provider is in control of the infrastructure
and, eventually, of the customers’ core logic and sensiBve data. Even though there
could be regulatory procedure in place that guarantees fair management and respect
of the customer’s privacy, this condiBon can sBll be perceived as a threat or as an
unacceptable risk that some organizaBons are not willing to take.
• In parBcular, insBtuBons such as government and military agencies will not consider
public clouds as an opBon for processing or storing their sensiBve data.
• More precisely, the geographical locaBon of a datacenter generally determines the
regulaBons that are applied to management of digital informaBon. As a result,
according to the specific locaBon of data, some sensiBve informaBon can be made
accessible to government agencies or even considered outside the law if processed
with specific cryptographic techniques.
• For example, the USA PATRIOT Act5 provides its government and other agencies with
virtually limitless powers to access informaBon, including that belonging to any
company that stores informaBon in the U.S. territory.
• More specifically, having an infrastructure able to deliver IT services on demand can
sBll be a winning soluBon, even when implemented within the private premises of an
insBtuBon. This idea led to the diffusion of private clouds, which are similar to public
clouds, but their resource-provisioning model is limited within the boundaries of an
organizaBon.
• Private clouds are virtual distributed systems that rely on a private infrastructure and
provide internal users with dynamic provisioning of compuBng resources.
• Instead of a pay-as-you-go model as in public clouds, there could be other schemes
in place, taking into account the usage of the cloud and proporBonally billing the
different departments or secBons of an enterprise.
• Private clouds have the advantage of keeping the core business operaBons in-house
by relying on the exisBng IT infrastructure and reducing the burden of maintaining it
once the cloud has been set up.
• In this scenario, security concerns are less criBcal, since sensiBve informaBon does
not flow out of the private infrastructure.
• Moreover, exisBng IT resources can be beRer uBlized because the private cloud can
provide services to a different range of users.
• Another interesBng opportunity that comes with private clouds is the possibility of
tesBng applicaBons and systems at a comparaBvely lower price rather than public
clouds before deploying them on the public virtual infrastructure.
• A Forrester report on the benefits of delivering in-house cloud compuBng soluBons
for enterprises highlighted some of the key advantages of using a private cloud
compuBng infrastructure:
o Customer informaBon protecBon. In-house security is easier to maintain and
rely on.
o Infrastructure ensuring SLAs. Quality of service implies specific operaBons
such as appropriate clustering and failover, data replicaBon, system
monitoring and maintenance, and disaster recovery, and other upBme
services can be commensurate to the applicaBon needs.
o Compliance with standard procedures and operaBons. If organizaBons are
subject to third-party compliance standards, specific procedures have to be
put in place when deploying and execuBng applicaBons.
• From an architectural point of view, private clouds can be implemented on more
heterogeneous hardware: They generally rely on the exisBng IT infrastructure already
deployed on the private premises. This could be a datacenter, a cluster, an enterprise
desktop grid, or a combinaBon of them.
• The physical layer is complemented with infrastructure management soYware (i.e.,
IaaS (M); or a PaaS soluBon, according to the service delivered to the users of the
cloud.
Different opBons can be adopted to implement private clouds. Figure 4.4 provides a
comprehensive view of the soluBons together with some reference to the most popular
soYware used to deploy private clouds.

• At the boRom layer of the soYware stack, virtual machine technologies such as Xen,
KVM, and VMware serve as the foundaBons of the cloud.
• Virtual machine management technologies such as VMware vCloud, Eucalyptus, and
OpenNebula can be used to control the virtual infrastructure and provide an IaaS
soluBon. VMware vCloud is a proprietary soluBon, but Eucalyptus provides full
compaBbility with Amazon Web Services interfaces and supports different virtual
machine technologies such as Xen, KVM, and VMware.
• Like Eucalyptus, OpenNebula is an open-source soluBon for virtual infrastructure
management that supports KVM, Xen, and VMware, which has been designed to
easily integrate third-party IaaS providers. Its modular architecture allows extending
the soYware with addiBonal features such as the capability of reserving virtual
machine instances.
• SoluBons that rely on the previous virtual machine managers and provide added
value are OpenPEX and InterGrid.
• PaaS soluBons can provide an addiBonal layer and deliver a high-level service for
private clouds. Among the opBons available for private deployment of clouds we can
consider DataSynapse, Zimory Pools, Elastra, and Aneka. DataSynapse is a global
provider of applicaBon virtualizaBon soYware.
• Zimory provides a soYware infrastructure layer that allows creaBng an internal cloud
composed of sparse private and public resources and provides faciliBes for migraBng
applicaBons within the exisBng infrastructure.
• Aneka is a soYware development pla^orm that can be used to deploy a cloud
infrastructure on top of heterogeneous hardware: datacenters, clusters, and desktop
grids.
• Private clouds can provide in-house soluBons for cloud compuBng, but if compared
to public clouds they exhibit more limited capability to scale elasBcally on demand.

4.3.3 Hybrid clouds

• Public clouds are huge enough to serve the needs of mulBple users, but they suffer
from security threats and administraBve pi^alls. Private clouds suffer from the
inability to scale on demand and to efficiently address peak loads.
• In this case, it is important to leverage capabiliBes of public clouds as needed. Hence,
a hybrid soluBon could be an interesBng opportunity for taking advantage of the best
of the private and public worlds.
• This led to the development and diffusion of hybrid clouds.
• Hybrid clouds allow enterprises to exploit exisBng IT infrastructures, maintain
sensiBve informaBon within the premises, and naturally grow and shrink by
provisioning external resources and releasing them when they’re no longer needed.
• Security concerns are then only limited to the public porBon of the cloud that can be
used to perform operaBons with less stringent constraints but that are sBll part of
the system workload.
• Figure 4.5 provides a general overview of a hybrid cloud: It is a heterogeneous
distributed system resulBng from a private cloud that integrates addiBonal services
or resources from one or more public clouds.
• For this reason they are also called heterogeneous clouds.
• As depicted in the diagram, dynamic provisioning is a fundamental component in this
scenario. Hybrid clouds address scalability issues by leveraging external resources for
exceeding capacity demand.
• These resources or services are temporarily leased for the Bme required and then
• released. This pracBce is also known as cloudburs9ng.
• According to the Cloud CompuBng Wiki, the term cloudburst has a double meaning;
it also refers to the “failure of a cloud compuBng environment due to the inability to
handle a spike in demand”. In this book, we always refer to the dynamic provisioning
of resources from public clouds when menBoning this term.
• Whereas the concept of hybrid cloud is general, it mostly applies to IT infrastructure
rather than soYware services.
• In an IaaS scenario, dynamic provisioning refers to the ability to acquire on demand
virtual machines in order to increase the capability of the resulBng distributed
system and then release them.
• Infrastructure management soYware and PaaS soluBons are the building blocks for
deploying and managing hybrid clouds.
• What is missing is then an advanced scheduling engine that’s able to differenBate
these resources and provide smart allocaBons by taking into account the budget
available to extend the exisBng infrastructure.
• In the case of OpenNebula, advanced schedulers such as Haizea can be integrated to
provide cost-based scheduling.
• A different approach is taken by InterGrid. This is essenBally a distributed scheduling
engine that manages the allocaBon of virtual machines in a collecBon of peer
networks.
• Such networks can be represented by a local cluster, a gateway to a public cloud, or a
combinaBon of the two.
• Dynamic provisioning is most commonly implemented in PaaS soluBons that support
hybrid clouds.
• In this scenario, the role of dynamic provisioning becomes fundamental to ensuring
the execuBon of applicaBons under the QoS agreed on with the user.
• For example, Aneka provides a provisioning service that leverages different IaaS
providers for scaling the exisBng cloud infrastructure. In parBcular, each user
applicaBon has a budget aRached, and the scheduler uses that budget to opBmize
the execuBon of the applicaBon by renBng virtual nodes if needed.

4.3.4 Community clouds

• Community clouds are distributed systems created by integraBng the services of


different clouds to address the specific needs of an industry, a community, or a
business sector.
• The NaBonal InsBtute of Standards and Technologies (NIST) characterizes community
clouds as follows:

The infrastructure is shared by several organiza9ons and supports a specific community


that has shared concerns (e.g., mission, security requirements, policy, and compliance
considera9ons). It may be managed by the organiza9ons or a third party and may exist on
premise or off premise.

Figure 4.6 provides a general view of the usage scenario of community clouds, together with
reference architecture.
• The users of a specific community cloud fall into a well-idenBfied community, sharing
the same concerns or needs; they can be government bodies, industries, or even
simple users, but all of them focus on the same issues for their interacBon with the
cloud.
• From an architectural point of view, a community cloud is most likely implemented
over mulBple administraBve domains. This means that different organizaBons such as
government bodies, private enterprises, research organizaBons, and even public
virtual infrastructure providers contribute with their resources to build the cloud
infrastructure.

Candidate sectors for community clouds are as follows:

Media industry.

• In the media industry, companies are looking for low-cost, agile, and simple
• soluBons to improve the efficiency of content producBon.
• Most media producBons involve an extended ecosystem of partners. In parBcular,
the creaBon of digital content is the outcome of a collaboraBve process that includes
movement of large data, massive compute-intensive rendering tasks, and complex
workflow execuBons.
• Community clouds can provide a shared environment where services can facilitate
business-to-business collaboraBon and offer the horsepower in terms of aggregate
bandwidth, CPU, and storage required to efficiently support media producBon.

Healthcare industry.

• In the healthcare industry, there are different scenarios in which community clouds
could be of use.
• In parBcular, community clouds can provide a global pla^orm on which to share
informaBon and knowledge without revealing sensiBve data maintained within the
private infrastructure.
• The naturally hybrid deployment model of community clouds can easily support the
storing of paBent-related data in a private cloud while using the shared infrastructure
for noncriBcal services and automaBng processes within hospitals.

Energy and other core industries.

• In these sectors, community clouds can bundle the comprehensive set of soluBons
that together verBcally address management, deployment, and orchestraBon of
services and operaBons.
• Since these industries involve different providers, vendors, and organizaBons, a
community cloud can provide the right type of infrastructure to create an open and
fair market.

Public sector.
• Legal and poliBcal restricBons in the public sector can limit the adopBon of public
cloud offerings.
• Moreover, governmental processes involve several insBtuBons and agencies and are
aimed at providing strategic soluBons at local, naBonal, and internaBonal
administraBve levels.
• They involve business-to-administraBon, ciBzen-to-administraBon, and possibly
business-to-business processes.
• Some examples include invoice approval, infrastructure planning, and public
hearings.
• A community cloud can consBtute the opBmal venue to provide a distributed
environment in which to create a communicaBon pla^orm for performing such
operaBons.

Scien9fic research.

• Science clouds are an interesBng example of community clouds. In this case, the
common interest driving different organizaBons sharing a large distributed
infrastructure is scienBfic compuBng.
• The term community cloud can also idenBfy a more specific type of cloud that arises
from concern over the controls of vendors in cloud compuBng and that aspire to
combine the principles of digital ecosystems with the case study of cloud compuBng.
• A community cloud is formed by harnessing the underuBlized resources of user
machines and providing an infrastructure in which each can be at the same Bme a
consumer, a producer, or a coordinator of the services offered by the cloud.

The benefits of these community clouds are the following:

• Openness. By removing the dependency on cloud vendors, community clouds are open
systems in which fair compeBBon between different soluBons can happen.

• Community. Being based on a collecBve that provides resources and services, the
infrastructure turns out to be more scalable because the system can grow simply by
expanding its user base.

• Graceful failures. Since there is no single provider or vendor in control of the


infrastructure, there is no single point of failure.

• Convenience and control. Within a community cloud there is no conflict between


convenience and control because the cloud is shared and owned by the community, which
makes all the decisions through a collecBve democraBc process.

• Environmental sustainability. The community cloud is supposed to have a smaller carbon


footprint because it harnesses underuBlized resources. Moreover, these clouds tend to be
more organic by growing and shrinking in a symbioBc relaBonship to support the demand of
the community, which in turn sustains it.
Chapter 8: Data-Intensive Compu0ng, MapReduce Programming
(REFERENCE 1 as per guidelines)

• Data-intensive compuBng focuses on a class of applicaBons that deal with a large


amount of data; large volumes of data that need to be efficiently stored, made
accessible, indexed, and analyzed.
• Distributed compuBng is definitely of help in addressing these challenges by
providing more scalable and efficient storage architectures, however, there are
several challenges in the form of data representaBon, efficient algorithms, and
scalable infrastructures need to be faced.
• MapReduce, which is a popular programming model for creaBng data-intensive
applicaBons and their deployment on clouds.

8.1 What is data-intensive compu9ng?

• Data-intensive compu9ng is concerned with producBon, manipulaBon, and analysis


of large-scale data in the range of hundreds of megabytes (MB) to petabytes (PB) and
beyond.
• The term dataset is commonly used to idenBfy a collecBon of informaBon elements
that is relevant to one or more applicaBons. Datasets are oYen maintained in
repositories, which are infrastructures supporBng the storage, retrieval, and indexing
of large amounts of informaBon.
• To facilitate the classificaBon and search, relevant bits of informaBon, called
metadata, are aRached to datasets.
• Data-intensive computaBons occur in many applica9on domains: ComputaBonal
science, people conducBng scienBfic simulaBons and experiments, telescopes
mapping the sky, bioinformaBcs applicaBons mine databases, earthquake simulators,
etc.
• Customer data for any telecom company: This volume of informaBon is not only
processed to generate billing statements, but it is also mined to idenBfy scenarios,
trends, and paRerns that help these companies provide beRer service.
• The IT giants such as Google, which is reported to process about 24 peta-bytes of
informaBon per day and to sort petabytes of data in hours.
• Social networking and gaming are two other sectors in which data-intensive
compuBng is now a reality.
• Facebook inbox search operaBons involve crawling about 150 terabytes of data, and
the whole uncompressed data stored by the distributed infrastructure reach to 36
petabytes.
• Zynga, a social gaming pla^orm, moves 1 petabyte of data daily and it has been
reported to add 1,000 servers every week to store the data generated by games like
Farmville and FronBerville.

8.1.1 Characterizing data-intensive computa9ons


• Data-intensive applicaBons deal with huge volumes of data that exhibit compute-
intensive proper9es.
• Figure 8.1 idenBfies the domain of data-intensive compuBng in the two upper
quadrants of the graph.
• Datasets are commonly persisted in several formats and distributed across different
locaBons. Such applicaBons process data in mulBstep analyBcal pipelines, including
transformaBon and fusion stages.
• The processing requirements scale almost linearly with the data size, and they can be
easily processed in parallel.
• They also need efficient mechanisms for data management, filtering and fusion, and
efficient querying and distribuBon.

8.1.2 Challenges ahead

• For example, the locaBon of data is crucial as the need for moving terabytes of data
• becomes an obstacle for high-performing computaBons.
• Data parBBoning as well as content replicaBon and scalable algorithms help in
improving the performance of data-intensive applicaBons.
• Open challenges in data-intensive compuBng given by Ian Gorton et al. are:
o Scalable algorithms that can search and process massive datasets
o New metadata management technologies that can scale to handle complex,
heterogeneous, and distributed data sources
o Advances in high-performance compuBng pla^orms aimed at providing a
beRer support for accessing in-memory mulBterabyte data structures
o High-performance, highly reliable, petascale distributed file systems
o Data signature-generaBon techniques for data reducBon and rapid processing
o New approaches to soYware mobility for delivering algorithms that are able
to move the computaBon to where the data are located
o Specialized hybrid interconnecBon architectures that provide beRer support
for filtering mulBgigabyte datastreams coming from high-speed networks and
scienBfic instruments
o Flexible and high-performance soYware integraBon techniques that facilitate
the combinaBon of soYware modules running on different pla^orms to
quickly form analyBcal pipelines

8.1.3 Historical perspec9ve

Support for data-intensive computaBons is provided by harnessing storage, networking


technologies, algorithms, and infrastructure soYware all together.

8.1.3.3 Data clouds and “Big Data”

• Large datasets and massive amounts of data are being produced, mined, and
crunched by companies that provide Internet services such as searching, online
adverBsing, and social media.
• It is criBcal for such companies to efficiently analyze these huge datasets because
they consBtute a precious source of informaBon about their customers.
• Log analysis is an example of a data-intensive operaBon that is commonly performed
in this context; companies such as Google have a massive amount of data in the form
of logs that are daily processed using their distributed infrastructure.
• As a result, they seRled upon an analyBc infrastructure, which differs from the grid-
based infrastructure used by the scienBfic community.
• Together with the diffusion of cloud compuBng technologies that support data-
intensive computaBons, the term Big Data has become popular.
• This term characterizes the nature of data-intensive computaBons today and
currently idenBfies datasets that grow so large that they become complex to work
with using on-hand database management tools.
• Big Data problems are found in nonscienBfic applicaBon domains such as weblogs,
radio frequency idenBficaBon (RFID), sensor networks, social networks, Internet text
and documents, Internet search indexing, call detail records, military surveillance,
medical records, photography archives, video archives, and large scale ecommerce.
• Other than the massive size, what characterizes all these examples is that new data
are accumulated with Bme rather than replacing the old data.
• In general, the term Big Data applies to datasets of which the size is beyond the
ability of commonly used soYware tools to capture, manage, and process within a
tolerable elapsed Bme.
• Therefore, Big Data sizes are a constantly moving target, currently ranging from a few
dozen tera-bytes to many petabytes of data in a single dataset.
• Cloud technologies support data-intensive compuBng in several ways:
o By providing a large amount of compute instances on demand, which can be
used to process and analyze large datasets in parallel.
o By providing a storage system opBmized for keeping large blobs of data and
other distributed data store architectures.
o By providing frameworks and programming APIs opBmized for the processing
and management of large amounts of data. These APIs are mostly coupled
with a specific storage infrastructure to opBmize the overall performance of
the system.

• A data cloud is a combinaBon of these components.


• An example is the MapReduce framework, which provides the best performance for
leveraging the Google File System on top of Google’s large compuBng infrastructure.
• Another example is the Hadoop system, the most mature, large, and open-source
data cloud. It consists of the Hadoop Distributed File System (HDFS) and Hadoop’s
implementaBon of MapReduce.
• A similar approach is proposed by Sector, which consists of the Sector Distributed
File System (SDFS) and a compute service called Sphere that allows users to execute
arbitrary user-defined funcBons (UDFs) over the data managed by SDFS.
• Greenplum uses a shared-nothing massively parallel processing (MPP) architecture
based on commodity hardware. The architecture also integrates MapReduce-like
funcBonality into its pla^orm.
• A similar architecture has been deployed by Aster, which uses an MPP-based data-
warehousing appliance that supports MapReduce and targets 1 PB of data.

8.2 Technologies for data-intensive compu9ng

• Processing large quanBBes of data


• Storage systems and programming models define the technologies

8.2.1 Storage systems

• TradiBonally, database management systems consBtuted the de facto storage


support for several types of applicaBons. Due to the explosion of unstructured data,
the relaBonal model in its original formulaBon does not seem to be the preferred
soluBon.

Some factors contribuBng to change in convenBonal databases are:

• Growing of popularity of Big Data. The management of large quanBBes of data is


common in several fields: scienBfic compuBng, enterprise applicaBons, media
entertainment, natural language processing, and social network analysis. The large
volume of data imposes new and more efficient techniques for data management.
• Growing importance of data analy9cs in the business chain. The management of
data is considered a key element of business profit. This situaBon arises in popular
social networks such as Facebook, which concentrate their focus on the management
of user profiles, interests, and connecBons among people. This massive amount of
data, which is constantly mined, requires new technologies and strategies to support
data analyBcs.
• Presence of data in several forms, not only structured. Data today exhibits a
heterogeneous nature and appears in several forms and formats. Structured data are
constantly growing as a result of the conBnuous use of tradiBonal enterprise
applicaBons and system, but at the same Bme the advances in technology and the
democraBzaBon of the Internet as a pla^orm where everyone can pull informaBon
has created a massive amount of informaBon that is unstructured and does not
naturally fit into the relaBonal model.
• New approaches and technologies for compu9ng. Cloud compuBng promises access
to a massive amount of compuBng capacity on demand. This allows engineers to
design soYware systems that incrementally scale to arbitrary degrees of parallelism.
Classical database infrastructures are not designed to provide support to such a
volaBle environment.

• In parBcular, advances in distributed file systems for the management of raw data in
the form of files, distributed object stores, and the spread of the NoSQL movement
consBtute the major direcBons toward support for data-intensive compuBng.

8.2.1.1 High-performance distributed file systems and storage clouds

• Distributed file systems consBtute the primary support for data management.
• They provide an interface whereby to store informaBon in the form of files and later
access them for read and write operaBons.
• Among the several implementaBons of file systems, few of them specifically address
the management of huge quanBBes of data on a large number of nodes. Mostly
these file systems consBtute the data storage support for large compuBng clusters,
supercomputers, massively parallel architectures, and lately, storage/compuBng
clouds.

Lustre.

• The Lustre file system is a massively parallel distributed file system that covers the
needs of a small workgroup of clusters to a large-scale compuBng cluster.
• The file system is used by several of the Top 500 supercompuBng systems, including
the one rated the most powerful supercomputer in the June 2012 list, Sequoia.
• Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands
of clients with an I/O throughput of hundreds of gigabytes per second (GB/s).
• The system is composed of a metadata server that contains the metadata about the
file system and a collecBon of object storage servers that are in charge of providing
storage.
• Users access the file system via a POSIX-compliant client, which can be either
mounted as a module in the kernel or through a library.
• The file system implements a robust failover strategy and recovery mechanism,
making server failures and recoveries transparent to clients.
IBM General Parallel File System (GPFS).

• GPFS is the high-performance distributed file system developed by IBM that provides
support for the RS/6000 supercomputer and Linux compuBng clusters.
• GPFS is a mulBpla^orm distributed file system built over several years of academic
research and provides advanced recovery mechanisms.
• GPFS is built on the concept of shareddisks, in which a collecBon of disks is aRached
to the file system nodes by means of some switching fabric.
• The file system makes this infrastructure transparent to users and stripes large files
over the disk array by replicaBng porBons of the file to ensure high availability.
• By means of this infrastructure, the system is able to support petabytes of storage,
which is accessed at a high throughput and without losing consistency of data.
• Compared to other implementaBons, GPFS distributes the metadata of the enBre file
system and provides transparent access to it, thus eliminaBng a single point of
failure.

Google File System (GFS).

• GFS is the storage infrastructure that supports the execuBon of distributed


applicaBons in Google’s compuBng cloud.
• The system has been designed to be a fault-tolerant, highly available, distributed file
system built on commodity hardware and standard Linux operaBng systems.
• GFS specifically addresses Google’s needs in terms of distributed storage for
applicaBons, and it has been designed with the following assump9ons:
o The system is built on top of commodity hardware that oYen fails.
o The system stores a modest number of large files; mulB-GB files are common
and should be treated efficiently, and small files must be supported, but there
is no need to opBmize for that.
o The workloads primarily consist of two kinds of reads: large streaming reads
and small random reads.
o The workloads also have many large, sequenBal writes that append data to
files.
o High-sustained bandwidth is more important than low latency.
• The architecture of the file system is organized into a single master, which contains
the metadata of the enBre file system, and a collecBon of chunk servers, which
provide storage space.
• From a logical point of view the system is composed of a collecBon of soYware
daemons, which implement either the master server or the chunk server.
• A file is a collecBon of chunks for which the size can be configured at file system
level. Chunks are replicated on mulBple nodes in order to tolerate failures.
• Clients look up the master server and idenBfy the specific chunk of a file they want to
access. Once the chunk is idenBfied, the interacBon happens between the client and
the chunk server.
• ApplicaBons interact through the file system with a specific interface supporBng the
usual operaBons for file creaBon, deleBon, read, and write.
• GFS has been conceived by considering that failures in a large distributed
infrastructure are common rather than a rarity; therefore, specific aRenBon has been
given to implemenBng a highly available, lightweight, and fault-tolerant
infrastructure.
• The potenBal single point of failure of the single-master architecture has been
addressed by giving the possibility of replicaBng the master node on any other node
belonging to the infrastructure.
• Moreover, a stateless daemon and extensive logging capabiliBes facilitate the
system’s recovery from failures.

Sector.

• Sector is the storage cloud that supports the execuBon of data-intensive applicaBons
defined according to the Sphere framework. It is a user space file system that can be
deployed on commodity hardware across a wide-area network.
• Compared to other file systems, Sector does not parBBon a file into blocks but
replicates the enBre files on mulBple nodes, allowing users to customize the
replicaBon strategy for beRer performance.
• The system’s architecture is composed of four nodes: a security server, one or more
master nodes, slave nodes, and client machines.
• The security server maintains all the informaBon about access control policies for
user and files, whereas master servers coordinate and serve the I/O requests of
clients, which ulBmately interact with slave nodes to access files.
• The protocol used to exchange data with slave nodes is UDT, which is a lightweight
connecBon-oriented protocol opBmized for wide-area networks.

Amazon Simple Storage Service (S3).

• Amazon S3 is the online storage service provided by Amazon. Even though its
internal details are not revealed, the system is claimed to support high availability,
reliability, scalability, infinite storage, and low latency at commodity cost.
• The system offers a flat storage space organized into buckets, which are aRached to
an Amazon Web Services (AWS) account.
• Each bucket can store mulBple objects, each idenBfied by a unique key.
• Objects are idenBfied by unique URLs and exposed through HTTP, thus allowing very
simple get-put semanBcs.
• Because of the use of HTTP, there is no need for any specific library for accessing the
storage system, the objects of which can also be retrieved through the Bit Torrent
protocol.
• Despite its simple semanBcs, a POSIX-like client library has been developed to mount
S3 buckets as part of the local file system.
• Besides the minimal semanBcs, security is another limita9on of S3.
• The visibility and accessibility of objects are linked to AWS accounts, and the owner
of a bucket can decide to make it visible to other accounts or the public.
• It is also possible to define authenBcated URLs, which provide public access to
anyone for a limited (and configurable) period of Bme.
• Except for the S3 service, it is possible to sketch a general reference architecture in all
the systems presented that idenBfies two major roles into which all the nodes can be
classified. Metadata or master nodes contain the informaBon about the locaBon of
files or file chunks, whereas slave nodes are used to provide direct access to the
storage space.
• The architecture is completed by client libraries, which provide a simple interface for
accessing the file system, which is to some extent or completely compliant to the
POSIX specificaBon.
• Varia9ons of the reference architecture can include the ability to support mulBple
masters, to distribute the metadata over mulBple nodes, or to easily interchange the
role of nodes. The most important aspect common to all these different
implementaBons is the ability to provide fault-tolerant and highly available storage
systems.

8.2.1.2 NoSQL systems

• The term Not Only SQL (NoSQL) was originally coined in 1998 to idenBfy a relaBonal
database that did not expose a SQL interface to manipulate and query data but relied
on a set of UNIX shell scripts and commands to operate on text files containing the
actual data.
• In a very strict sense, NoSQL cannot be considered a relaBonal database since it is
not a monolithic piece of soYware organizing informaBon according to the relaBonal
model, but rather is a collecBon of scripts that allow users to manage most of the
simplest and more common database tasks by using text files as informaBon stores.
• Later, in 2009, the term NoSQL was reintroduced with the intent of labeling all those
database management systems that did not use a relaBonal model but provided
simpler and faster alternaBves for data manipulaBon.
• Nowadays, the term NoSQL is a big umbrella encompassing all the storage and
database management systems that differ in some way from the relaBonal model.
Their general philosophy is to overcome the restricBons imposed by the relaBonal
model and to provide more efficient systems.
• This oYen implies the use of tables without fixed schemas to accommodate a larger
range of data types or avoid joins to increase the performance and scale horizontally.
• Two main factors have determined the growth of the NoSQL movement: in many
cases simple data models are enough to represent the informaBon used by
applicaBons, and the quanBty of informaBon contained in unstructured formats has
grown considerably in the last decade.
• These two factors made soYware engineers look to alternaBves that were more
suitable to specific applicaBon domains they were working on.
• As a result, several different ini9a9ves explored the use of nonrela9onal storage
systems, which considerably differ from each other. A broad classifica9on is reported
by Wikipedia, which disBnguishes NoSQL implementaBons into:
o Document stores (Apache Jackrabbit, Apache CouchDB, SimpleDB,
Terrastore).
o Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum).
o Key-value stores. This is a macro classificaBon that is further categorized into
key-value stores on disk, key-value caches in RAM, hierarchically key-value
stores, eventually consistent key-value stores, and ordered key-value store.
o MulBvalue databases (OpenQM, Rocket U2, OpenInsight).
o Object databases (ObjectStore, JADE, ZODB).
o Tabular stores (Google BigTable, Hadoop HBase, Hypertable).
o Tuple stores (Apache River).

Some prominent implementa9ons that support data-intensive applica9ons.

Apache CouchDB and MongoDB.

• Apache CouchDB and MongoDB are two examples of document stores.


• Both provide a schema-less store whereby the primary objects are documents
organized into a collecBon of key-value fields.
• The value of each field can be of type string, integer, float, date, or an array of values.
• The databases expose a RESTful interface and represent data in JSON format.
• Both allow querying and indexing data by using the MapReduce programming model,
expose JavaScript as a base language for data querying and manipulaBon rather than
SQL, and support large files as documents.
• From an infrastructure point of view, the two systems support data replicaBon and
high availability.
• CouchDB ensures ACID properBes on data.
• MongoDB supports sharding, which is the ability to distribute the content of a
collecBon among different nodes.

Amazon Dynamo.

• Dynamo is the distributed key-value store that supports the management of


informaBon of several of the business services offered by Amazon Inc.
• The main goal of Dynamo is to provide an incrementally scalable and highly available
storage system. This goal helps in achieving reliability at a massive scale, where
thousands of servers and network components build an infrastructure serving 10
million requests per day. Dynamo provides a simplified interface based on get/put
semanBcs, where objects are stored and retrieved with a unique idenBfier (key).
• The main goal of achieving an extremely reliable infrastructure has imposed some
constraints on the properBes of these systems. For example, ACID properBes on data
have been sacrificed in favor of a more reliable and efficient infrastructure. This
creates what it is called an eventually consistent model, which means that in the
long term all the users will see the same data.
• The architecture of the Dynamo system, shown in Figure 8.3, is composed of a
collecBon of storage peers organized in a ring that shares the key space for a given
applicaBon.
• The key space is parBBoned among the storage peers, and the keys are replicated
across the ring, avoiding adjacent peers.
• Each peer is configured with access to a local storage facility where original objects
and replicas are stored.
• Furthermore, each node provides faciliBes for distribuBng the updates among the
rings and to detect failures and unreachable nodes.
• With some relaxaBon of the consistency model applied to replicas and the use of
object versioning, Dynamo implements the capability of being an always-writable
store, where consistency of data is resolved in the background.
• The downside of such an approach is the simplicity of the storage model, which
requires applicaBons to build their own data models on top of the simple building
blocks provided by the store. For example, there are no referenBal integrity
constraints, relaBonships are not embedded in the storage model, and therefore join
operaBons are not supported.

Google Bigtable.

• Bigtable is the distributed storage system designed to scale up to peta-bytes of data


across thousands of servers.
• Bigtable provides storage support for several Google applicaBons that expose
different types of workload: from throughput-oriented batch-processing jobs to
latency-sensiBve serving of data to end users.
• Bigtable’s key design goals are wide applicability, scalability, high performance, and
high availability. To achieve these goals, Bigtable organizes the data storage in tables
of which the rows are distributed over the distributed file system supporBng the
middleware, which is the Google File System.
• From a logical point of view, a table is a mulBdimensional sorted map indexed by a
key that is represented by a string of arbitrary length.
• A table is organized into rows and columns; columns can be grouped in column
family, which allow for specific opBmizaBon for beRer access control, the storage and
the indexing of data. A simple data access model consBtutes the interface for client
applicaBons that can address data at the granularity level of the single column of a
row.
• Moreover, each column value is stored in mulBple versions that can be automaBcally
Bme-stamped by Bigtable or by the client applicaBons.
• Besides the basic data access, Bigtable APIs also allow more complex operaBons such
as single row transacBons and advanced data manipulaBon by means of the Sazwall
scripBng language or the MapReduce APIs.
• Sazwall is an interpreted procedural programming language developed at Google for
the manipulaBon of large quanBBes of tabular data. It includes specific capabiliBes
for supporBng staBsBcal aggregaBon of values read or computed from the input and
other features that simplify the parallel processing of petabytes of data.
• Figure 8.4 gives an overview of the infrastructure that enables Bigtable.

• Bigtable idenBfies two kinds of processes: master processes and tablet server
processes.
• A tablet server is responsible for serving the requests for a given tablet that is a
conBguous parBBon of rows of a table.
• Each server can manage mulBple tablets (commonly from 10 to 1,000).
• The master server is responsible for keeping track of the status of the tablet servers
and of the allocaBon of tablets to tablet servers.
• The server constantly monitors the tablet servers to check whether they are alive,
and in case they are not reachable, the allocated tablets are reassigned and
eventually parBBoned to other servers.
• Chubby —a distributed, highly available, and persistent lock service—supports the
acBvity of the master and tablet servers.
• System monitoring and data access are filtered through Chubby, which is also
responsible for managing replicas and providing consistency among them.
• At the very boRom layer, the data are stored in the Google File System in the form of
files, and all the update operaBons are logged into the file for the easy recovery of
data in case of failures or when tablets need to be reassigned to other servers.
• Bigtable uses a specific file format for storing the data of a tablet, which can be
compressed for opBmizing the access and storage of data.
• It serves as a storage back-end for 60 applicaBons (such as Google Personalized
Search, Google AnlyBcs, Google Finance, and Google Earth) and manages petabytes
of data.

Apache Cassandra.

• Cassandra is a distributed object store for managing large amounts of structured


data spread across many commodity servers.
• The system is designed to avoid a single point of failure and offer a highly reliable
service. Cassandra was iniBally developed by Facebook; now it is part of the Apache
incubator iniBaBve. Currently, it provides storage support for several very large Web
applicaBons such as Facebook itself, Digg, and TwiRer.
• Cassandra is defined as a second-generaBon distributed database that builds on the
concept of Amazon Dynamo, which follows a fully distributed design, and Google
Bigtable, from which it inherits the “column family” concept.
• The data model exposed by Cassandra is based on the concept of a table that is
implemented as a distributed mulBdimensional map indexed by a key.
• The value corresponding to a key is a highly structured object and consBtutes the
row of a table. Cassandra organizes the row of a table into columns, and sets of
columns can be grouped into column families.
• The APIs provided by the system to access and manipulate the data are very simple:
inserBon, retrieval, and deleBon.
• The inserBon is performed at the row level; retrieval and deleBon can operate at the
column level.
• In terms of the infrastructure, Cassandra is very similar to Dynamo. It has been
designed for incremental scaling, and it organizes the collecBon of nodes sharing a
key space into a ring. Each node manages mulBple and disconBnuous porBons of the
key space and replicates its data up to N other nodes.
• ReplicaBon uses different strategies; it can be rack aware, data center aware, or rack
unaware, meaning that the policies can take into account whether the replicaBon
needs to be made within the same cluster or datacenter or not to consider the geo-
locaBon of nodes.
• The local file system of each node is used for data persistence, and Cassandra makes
extensive use of commit logs, which makes the system able to recover from transient
failures.
• Each write operaBon is applied in memory only aYer it has been logged on disk so
that it can be easily reproduced in case of failures.
• When the data in memory trespasses a specified size, it is dumped to disk.
• Read operaBons are performed in-memory first and then on disk.
• To speed up the process, each file includes a summary of the keys it contains so that
it is possible to avoid unnecessary file scanning to search for a key.
• The largest Cassandra deployment to our knowledge manages 100 TB of data
distributed over a cluster of 150 machines.

Hadoop HBase.

• HBase is the distributed database that supports the storage needs of the Hadoop
distributed programming pla^orm.
• HBase is designed by taking inspiraBon from Google Bigtable; its main goal is to offer
real-Bme read/write operaBons for tables with billions of rows and millions of
columns by leveraging clusters of commodity hardware.
• The internal architecture and logic model of HBase is very similar to Google Bigtable,
and the enBre system is backed by the Hadoop Distributed File System (HDFS), which
mimics the structure and services of GFS.

8.2.2 Programming plaGorms

• Pla^orms for programming data-intensive applicaBons provide abstracBons that help


express the computaBon over a large quanBty of informaBon and runBme systems
able to efficiently manage huge volumes of data.
• Distributed workflows have oYen been used to analyze and process large amounts of
data. This approach introduced a plethora of frameworks for workflow management
systems, which, eventually incorporated capabiliBes to leverage the elasBc features
offered by cloud compuBng. These systems are fundamentally based on the
abstracBon of a task, which puts a big burden on the developer, who needs to deal
with data management and, oYen, data transfer issues.
• Programming pla^orms for data-intensive compuBng provide higher-level
abstracBons, which focus on the processing of data and move into the runBme
system the management of transfers, thus making the data always available where
needed. This is the approach followed by the MapReduce programming plaGorm,
which expresses the computaBon in the form of two simple funcBons—map and
reduce—and hides the complexiBes of managing large and numerous data files into
the distributed file system supporBng the pla^orm.

8.2.2.1 The MapReduce programming model

• MapReduce is a programming pla^orm Google introduced for processing large


quanBBes of data. It expresses the computaBonal logic of an applicaBon in two
simple funcBons: map and reduce.
• Data transfer and management are completely handled by the distributed storage
infrastructure (i.e., the Google File System), which is in charge of providing access to
data, replicaBng files, and eventually moving them where needed.
• Therefore, developers are provided with an interface that presents data at a higher
level: as a collecBon of key-value pairs.
• The computaBon of MapReduce applicaBons is then organized into a workflow of
map and reduce operaBons that is enBrely controlled by the runBme system;
developers need only specify how the map and reduce funcBons operate on the key-
value pairs.
• More precisely, the MapReduce model is expressed in the form of the two funcBons,
which are defined as follows:

• The map funcBon reads a key-value pair and produces a list of key-value pairs of
different types.
• The reduce funcBon reads a pair composed of a key and a list of values and produces
a list of values of the same type.
• The types (k1,v1,k2,kv2) used in the expression of the two funcBons provide hints as
to how these two funcBons are connected and are executed to carry out the
computaBon of a MapReduce job: The output of map tasks is aggregated together
by grouping the values according to their corresponding keys and consBtutes the
input of reduce tasks that, for each of the keys found, reduces the list of aRached
values to a single value.
• Therefore, the input of a MapReduce computaBon is expressed as a collecBon of key-
value pairs <k1,v1>, and the final output is represented by a list of values: list(v2).
• Figure 8.5 depicts a reference workflow characterizing MapReduce computaBons. As
shown, the user submits a collecBon of files that are expressed in the form of a list of
<k1,v1> pairs and specifies the map and reduce funcBons.
• These files are entered into the distributed file system that supports MapReduce and,
if necessary, parBBoned in order to be the input of map tasks.
• Map tasks generate intermediate files that store collecBons of <k2, list(v2)> pairs,
and these files are saved into the distributed file system.
• The MapReduce runBme might eventually aggregate the values corresponding to the
same keys. These files consBtute the input of reduce tasks, which finally produce
output files in the form of list(v2).
• The operaBon performed by reduce tasks is generally expressed as an aggregaBon of
all the values that are mapped by a specific key. The number of map and reduce tasks
to create, the way files are parBBoned with respect to these tasks, and the number
of map tasks connected to a single reduce task are the responsibiliBes of the
MapReduce runBme.
• In addiBon, the way files are stored and moved is the responsibility of the distributed
file system that supports MapReduce.
• The computaBon model expressed by MapReduce is very straigh^orward and has
proven successful in the case of Google, where the majority of the informaBon that
needs to be processed is stored in textual form and is represented by Web pages or
log files.

Some of the examples that show the flexibility of MapReduce are the following:

Distributed grep. The grep operaBon, which performs the recogniBon of paRerns within text
streams, is performed across a wide set of files. MapReduce is leveraged to provide a
parallel and faster execuBon of this operaBon. In this case, the input file is a plain text file,
and the map funcBon emits a line into the output each Bme it recognizes the given paRern.
The reduce task aggregates all the lines emiRed by the map tasks into a single file.

Count of URL-access frequency. MapReduce is used to distribute the execuBon of Web


server log parsing. In this case, the map funcBon takes as input the log of a Web server and
emits into the output file a key-value pair <URL,1> for each page access recorded in the log.
The reduce funcBon aggregates all these lines by the corresponding URL, thus summing the
single accesses, and outputs a <URL, total-count> pair.

Reverse Web-link graph. The Reverse Web-link graph keeps track of all the possible Web
pages that might lead to a given link. In this case input files are simple HTML pages that are
scanned by map tasks emiyng <target, source> pairs for each of the links found in the Web
page source. The reduce task will collate all the pairs that have the same target into a
<target, list(source)> pair. The final result is given one or more files containing these
mappings.

Term vector per host. A term vector recaps the most important words occurring in a set of
documents in the form of list(<word, frequency>), where the number of occurrences of a
word is taken as a measure of its importance. MapReduce is used to provide a mapping
between the origin of a set of document, obtained as the host component of the URL of a
document, and the corresponding term vector. In this case, the map task creates a pair
<host, term-vector> for each text document retrieved, and the reduce task aggregates the
term vectors corresponding to documents retrieved from the same host.

Inverted index. The inverted index contains informaBon about the presence of words in
documents. This informaBon is useful to allow fast full-text searches compared to direct
document scans. In this case, the map task takes as input a document, and for each
document it emits a collecBon of <word, document-id>. The reduce funcBon aggregates the
occurrences of the same word, producing a pair <word, list(document-id)>.

Distributed sort. In this case, MapReduce is used to parallelize the execuBon of a sort
operaBon over a large number of records. This applicaBon mostly relies on the properBes of
the MapReduce runBme, which sorts and creates parBBons of the intermediate files, rather
than in the operaBons performed in the map and reduce tasks. Indeed, these are very
simple: The map task extracts the key from a record and emits a <key, record> pair for each
record; the reduce task will simply copy through all the pairs. The actual sorBng process is
performed by the MapReduce runBme, which will emit and parBBon the key-value pair by
ordering them according to the key.

• An interesBng example is its applicaBon in the field of machine learning, where


staBsBcal algorithms such as Support Vector Machines (SVM), Linear Regression
(LR), Naiive Bayes (NB), and Neural Network (NN), are expressed in the form of map
and reduce funcBons.
• Other interesBng applicaBons can be found in the field of compute-intensive
applicaBons, such as the computaBon of Pi with a high degree of precision. It has
been reported that the Yahoo! Hadoop cluster has been used to compute the 1015+1
bit of Pi.11 Hadoop is an open-source implementaBon of the MapReduce pla^orm.

In general, any computaBon that can be expressed in the form of two major stages can be
represented in the terms of MapReduce computaBon. These stages are:

Analysis. This phase operates directly on the data input file and corresponds to the
operaBon performed by the map task. Moreover, the computaBon at this stage is expected
to be embarrassingly parallel, since map tasks are executed without any sequencing or
ordering.

Aggrega9on. This phase operates on the intermediate results and is characterized by


operaBons that are aimed at aggregaBng, summing, and/or elaboraBng the data obtained at
the previous stage to present the data in their final form. This is the task performed by the
reduce funcBon.

• AdaptaBons to this model are mostly concerned with idenBfying the appropriate
keys, creaBng reasonable keys when the original problem does not have such a
model, and finding ways to parBBon the computaBon between map and reduce
funcBons.
• Moreover, more complex algorithms can be decomposed into mulBple MapReduce
programs, where the output of one program consBtutes the input of the following
program.
• Figure 8.6 gives a more complete overview of a MapReduce infrastructure, according
to the implementaBon proposed by Google.

• As depicted, the user submits the execuBon of MapReduce jobs by using the client
libraries that are in charge of submiyng the input data files, registering the map and
reduce funcBons, and returning control to the user once the job is completed.
• A generic distributed infrastructure (i.e., a cluster) equipped with job-scheduling
capabiliBes and distributed storage can be used to run MapReduce applicaBons.
• Two different kinds of processes are run on the distributed infrastructure: a master
process and a worker process.
• The master process is in charge of controlling the execuBon of map and reduce tasks,
parBBoning, and reorganizing the intermediate output produced by the map task in
order to feed the reduce tasks. The worker processes are used to host the execuBon
of map and reduce tasks and provide basic I/O faciliBes that are used to interface the
map and reduce tasks with input and output files.
• In a MapReduce computaBon, input files are iniBally divided into splits (generally 16
to 64 MB) and stored in the distributed file system. The master process generates the
map tasks and assigns input splits to each of them by balancing the load.
• Worker processes have input and output buffers that are used to opBmize the
performance of map and reduce tasks.
• In parBcular, output buffers for map tasks are periodically dumped to disk to create
intermediate files. Intermediate files are parBBoned using a user-defined funcBon to
evenly split the output of map tasks.
• The locaBons of these pairs are then noBfied to the master process, which forwards
this informaBon to the reduce tasks, which are able to collect the required input via a
remote procedure call in order to read from the map tasks’ local storage.
• The key range is then sorted and all the same keys are grouped together. Finally, the
reduce task is executed to produce the final output, which is stored in the global file
system. This process is completely automaBc; users may control it through
configuraBon parameters that allow specifying (besides the map and reduce
funcBons) the number of map tasks, the number of parBBons into which to separate
the final output, and the parBBon funcBon for the intermediate key range.
• MapReduce runBme ensures a reliable execuBon of applicaBons by providing a fault-
tolerant infrastructure.
• Failures of both master and worker processes are handled, as are machine failures
that make intermediate outputs inaccessible. Worker failures are handled by
rescheduling map tasks somewhere else. This is also the technique that is used to
address machine failures since the valid intermediate output of map tasks has
become inaccessible.
• Master process failure is instead addressed using checkpoinBng, which allows
restarBng the MapReduce job with a minimum loss of data and computaBon.

8.2.2.2 Varia9ons and extensions of MapReduce

• MapReduce model exhibits limitaBons, mostly due to the fact that the abstracBons
provided to process data are very simple, and complex problems might require
considerable effort to be represented in terms of map and reduce funcBons only.
• Therefore, a series of extensions to and variaBons of the original MapReduce model
have been proposed.
• They aim at extending the MapReduce applicaBon space and providing developers
with an easier interface for designing distributed algorithms.

Hadoop. Apache Hadoop is a collecBon of soYware projects for reliable and scalable dis-
tributed compuBng. Taken together, the enBre collecBon is an open-source implementaBon
of the MapReduce framework supported by a GFS-like distributed file system. The iniBaBve
consists of mostly two projects: Hadoop Distributed File System (HDFS) and Hadoop
MapReduce. The former is an implementaBon of the Google File System; the laRer provides
the same features and abstracBons as Google MapReduce. IniBally developed and
supported by Yahoo!, Hadoop now consBtutes the most mature and large data cloud
applicaBon and has a very robust community of developers and users supporBng it. Yahoo!
now runs the world’s largest Hadoop cluster, composed of 40,000 machines and more than
300,000 cores, made available to academic insBtuBons all over the world. Besides the core
projects of Hadoop, a collecBon of other projects related to it provides services for
distributed compuBng.
Pig. Pig is a pla^orm that allows the analysis of large datasets. Developed as an Apache
project, Pig consists of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluaBng these programs. The Pig infrastructure’s layer consists of a
compiler for a high-level language that produces a sequence of MapReduce jobs that can be
run on top of distributed infrastructures such as Hadoop. Developers can express their data
analysis programs in a textual language called Pig LaBn, which exposes a SQL-like interface
and is characterized by major expressiveness, reduced programming effort, and a familiar
interface with respect to MapReduce.
Chapter 2: Assessing the Value Proposi0on
(REFERENCE 2 as per guidelines)

• The various aRributes of cloud compuBng that make it a unique service are
scalability, elasBcity, low barrier to entry, and a uBlity type of delivery.
• Cloud compuBng is parBcularly valuable because it shiYs capital expenditures into
operaBng expenditures. This has the benefit of decoupling growth from cash on hand
or from requiring access to capital. It also shiYs risk away from an organizaBon and
onto the cloud provider.
• Service Level Agreements (SLAs) are an important aspect of cloud compuBng. They
are essenBally your working contract with any provider.

Measuring the Cloud’s Value

On the Cloud

• Any applicaBon or process that benefits from economies of scale, commodiBzaBon of


assets, and conformance to programming standards benefits from the applicaBon of
cloud compuBng.
• ApplicaBons that work with cloud compuBng are ones that are referred to as “low
touch” applicaBons; they tend to be applicaBons that have low margins and usually
low risk.

On Premises

• Any applicaBon or process that requires a completely customized soluBon, imposes a


high degree of specializaBon, and requires access to proprietary technology is going
to expose the limits of cloud compuBng rather quickly.
• The “high touch” applicaBons that come with high margins require commiRed
resources and pose more of a risk; those applicaBons are best done on-premises.

Classifica9on based on Service model

• A cloud is defined as the combinaBon of the infrastructure of a datacenter with the


ability to provision hardware and soYware. A service that concentrates on hardware
follows the Infrastructure as a Service (IaaS) model.
• When you add a soYware stack, such as an operaBng system and applicaBons to the
service, the model shiYs to the SoYware as a Service (SaaS) model.
• When the service requires the client to use a complete
hardware/soYware/applicaBon stack, it is using the most refined and restricBve
service model, called the Pla^orm as a Service (PaaS) model.

Classifica9on based on Deployment model


• A cloud is an infrastructure that can be parBBoned and provisioned, and resources
are pooled and virtualized.
• If the cloud is available to the public on a pay-as-you-go basis, then the cloud is a
public cloud, and the service is described as a uBlity.
• If the cloud is capBve in an organizaBon’s infrastructure (network), it is referred to as
a private cloud.
• When you mix public and private clouds together, you have a hybrid cloud.

Unique characterisBcs of an ideal cloud compuBng model

• Scalability: You have access to unlimited computer resources as needed.


This feature obviates the need for planning and provisioning. It also enables batch
processing, which greatly speeds up high-processing applicaBons.
• Elas9city: You have the ability to right-size resources as required.
This feature allows you to opBmize your system and capture all possible transacBons.
• Low barrier to entry: You can gain access to systems for a small investment.
This feature offers access to global resources to small ventures and provides the
ability to experiment with liRle risk.
• U9lity: A pay-as-you-go model matches resources to need on an ongoing basis.
This eliminates waste and has the added benefit of shiYing risk from the client.

It is the construc9on of large datacenters running commodity hardware that has enabled
cloud compuBng to gain tracBon. These datacenters gain access to low-cost electricity, high-
network bandwidth pipes, and low-cost commodity hardware and soYware, which, taken
together, represents an economy of scale that allows cloud providers to amorBze their
investment and retain a profit.

The virtualiza9on of pooled resources—processors or compute engines, storage, and


network connecBvity—opBmizes these investments and allows the cloud provider to pass
along these economies to customers. Pooling also blurs the differences between a small
deployment and a large one because scale becomes Bed only to demand.

Companies become cloud compu9ng providers for several reasons:


• Profit: The economies of scale can make this a profitable business.
• Op9miza9on: The infrastructure already exists and isn’t fully uBlized. This was
certainly the case for Amazon Web Services.
• Strategic: A cloud compuBng pla^orm extends the company’s products and defends
their franchise. This is the case for MicrosoY’s Windows Azure Pla^orm.
• Extension: A branded cloud compuBng pla^orm can extend customer relaBonships
by offering addiBonal service opBons. This is the case with IBM Global Services and
the various IBM cloud services.
• Presence: Establish a presence in a market before a large compeBtor can emerge.
Google App Engine allows a developer to scale an applicaBon immediately. For
Google, its office applicaBons can be rolled out quickly and to large audiences.
• PlaGorm: A cloud compuBng provider can become a hub master at the center of
many ISV’s (Independent SoYware Vendor) offerings. The customer relaBonship
management provider SalesForce.com has a development pla^orm called Force.com
that is a PaaS offering.

Cloud compuBng is sBll in its infancy, but trends in adopBon are already evident. In his white
paper “Realizing the Value ProposiBon of Cloud CompuBng: CIO’s Enterprise IT Strategy for
the Cloud,” Jitendra Pal Thethi, a Principle Architect for Infosys’ MicrosoY Technology Group,
lists the following business types as the Top 10 adopters of cloud compuBng. As a group,
early adopters are categorized by their need for ubiquity and access to large data sets:
1. Messaging and team collaboraBon applicaBons
2. Cross enterprise integraBon projects
3. Infrastructure consolidaBon, server, and desktop virtualizaBon efforts
4. Web 2.0 and social strategy companies
5. Web content delivery services
6. Data analyBcs and computaBon
7. Mobility applicaBons for the enterprise
8. CRM applicaBons
9. Experimental deployments, test bed labs, and development efforts
10. Backup and archival storage

The nature of cloud compuBng should provide us with new classes of applicaBons, some of
which are currently emerging. Because Wide Area Network (WAN) bandwidth provides one
of the current boRlenecks for distributed compuBng, one of the major areas of interest in
cloud compuBng is in establishing content delivery networks (CDN). These soluBons are also
called edge networks because they cache content geographically.

Due to its scalability, cloud compuBng provides a means to do high-performance parallel


batch processing that wasn’t available to many organizaBons before. If a company must
perform a complex data analysis that might take a server a month to do, then with cloud
compuBng you might launch 100 virtual machine instances and complete the analysis in
around 8 hours for the same cost.
Processor-intensive applicaBons that users currently perform on their desktops such as
mathemaBcal simulaBons in MathemaBca and Matlab, graphic rendering in Renderman, and
long encoding/decoding tasks are other examples of applicaBons that could benefit from
parallel batch processing and be done directly from the desktop.
The economics must work out, but this approach is a completely new one for most people
and is a game changer. The relaBve ubiquity of cloud compuBng systems also enables
emerging classes of interacBve mobile applicaBons. The large array of sensors, diagnosBc,
and mobile devices, all of which both generate data and consume data, require the use of
large data sets and on-demand processing that are a good fit for the cloud compuBng
model.
Cloud compuBng also can provide access to mulBple data sets that can support layered
forms of informaBon, the types of informaBon you get when you view a mashup, such as the
layers of informaBon like Panoramio provided in the applicaBon Google Earth.

The laws of Cloudonomics

A summary of Wienman’s “10 Laws of Cloudonomics” follows and his interpretaBon:


1. U9lity services cost less even though they cost more.
UBliBes charge a premium for their services, but customers save money by not paying for
services that they aren’t using.
2. On-demand trumps forecas9ng.
The ability to provision and tear down resources (de-provision) captures revenue and
lowers costs.
3. The peak of the sum is never greater than the sum of the peaks.
A cloud can deploy less capacity because the peaks of individual tenants in a shared system
are averaged over Bme by the group of tenants.
4. Aggregate demand is smoother than individual.
MulB-tenancy also tends to average the variability intrinsic in individual demand because
the “coefficient of random variables” is always less than or equal to that of any of the
individual variables. With a more predictable demand and less variaBon, clouds can run at
higher uBlizaBon rates than capBve systems. This allows cloud systems to operate at
higher efficiencies and lower costs.
5. Average unit costs are reduced by distribu9ng fixed costs over more units of output.
Cloud vendors have a size that allows them to purchase resources at significantly reduced
prices. (This feature was described in the previous secBon.)
6. Superiority in numbers is the most important factor in the result of a combat
(Clausewitz).
Weinman argues that a large cloud’s size has the ability to repel botnets and DDoS aRacks
beRer than smaller systems do.
7. Space-9me is a con9nuum (Einstein/Minkowski).
The ability of a task to be accomplished in the cloud using parallel processing allows real-
Bme business to respond quicker to business condiBons and accelerates decision making
providing a measurable advantage.
8. Dispersion is the inverse square of latency.
Latency, or the delay in geyng a response to a request, requires both large-scale and
mulB-site deployments that are a characterisBc of cloud providers. Cuyng latency in half
requires four Bmes the number of nodes in a system.
9. Don’t put all your eggs in one basket.
The reliability of a system with n redundant components and a reliability of r is 1-(1-r)n.
Therefore, when a datacenter achieves a reliability of 99 percent, two redundant
datacenters have a reliability of 99.99 percent (four nines) and three redundant datacenters
can achieve a reliability of 99.9999 percent (six nines). Large cloud providers with
geographically dispersed sites worldwide therefore achieve reliability rates that are hard for
private systems to achieve.
10. An object at rest tends to stay at rest (Newton).
Private datacenters tend to be located in places where the company or unit was founded or
acquired. Cloud providers can site their datacenters in what are called “greenfield sites.” A
greenfield site is one that is environmentally friendly: locaBons that are on a network back-
bone, have cheap access to power and cooling, where land is inexpensive, and the environ-
mental impact is low. A network backbone is a very high-capacity network connecBon. On
the Internet, an Internet backbone consists of the high-capacity routes and routers that are
typically operated by an individual service provider such as a government or commercial
enBty.
Cloud compu9ng obstacles

Resource limits are exposed at peak condiBons of the uBlity itself. As we all know, power
uBliBes suffer brownouts and outages when the temperature soars, and cloud compuBng
providers are no different. You see these outages on peak compuBng days such as Black
Monday, which is the Monday aYer Thanksgiving in the United States when Internet
Christmas sales tradiBonally start.

The illusion of low barrier to entry may be pierced by an inconsistent pricing scheme that
makes scaling more expensive than it should be. You can see this limit in the nonlinearity of
pricing associated with “extra large” machine instances versus their “standard” size
counterparts. AddiBonally, the low barrier to entry also can be accompanied by a low barrier
to provisioning. If you make a provisioning error, it can lead to vast costs.

Cloud compuBng vendors run very reliable networks. OYen, cloud data is load-balanced
between virtual systems and replicated between sites. However, even cloud providers
experience outages. In the cloud, it is common to have various resources, such as machine
instances, fail. Except for Bghtly managed PaaS cloud providers, the burden of resource
management is sBll in the hands of the user, but the user is oYen provided with limited or
immature management tools to address these issues.

Table 2.1 summarizes the various obstacles and challenges that cloud compuBng faces.
Behavioral factors rela9ng to cloud adop9on

The “10 Laws of Behavioral Cloudonomics” are summarized below:

1. People are risk averse and loss averse.


Ariely argues that losses are more painful than gains are pleasurable. Cloud iniBaBves
may cause the concerns of adopBon to be weighed more heavily than the benefits accrued
to improving total costs and achieving greater agility.
2. People have a flat-rate bias.
Loss aversion expresses itself by preferences to flat-rate plans where risk is psychologically
minimized to usage-based plans where costs are actually less.
3. People have the need to control their environment and remain anonymous.
The need for environmental control is a primal direcBve. Loss of control leads to “learned
helplessness” and shorter life spans.
4. People fear change.
Uncertainty leads to fear, and fear leads to inerBa. This is as true for cloud compuBng as
it is for invesBng in the stock market.
5. People value what they own more than what they are given.
This is called the endowment effect. It is a predilecBon for exisBng assets that is out of
line with their value to others. The cogniBve science behind this principle is referred to as
the choice-supporBve bias.
6. People favor the status quo and invest accordingly.
There is a bias toward the way things have been and a willingness to invest in the status
quo that is out of line with their current value. In cogniBve science, the former aRribute
is referred to as the status quo bias, while the laRer aRribute is referred to as an escalaBon
of commitment.
7. People discount future risk and favor instant gra9fica9on.
Weinman argues that because cloud compuBng is an on-demand service, the instant
graBficaBon factor should favor cloud compuBng.
8. People favor things that are free.
When offered an item that is free or another that costs money but offers a greater gain,
people opt for the free item. Weinman argues that this factor also favors the cloud
compuBng model because upfront costs are eliminated.
9. People have the need for status.
A large IT organizaBon with substanBal assets is a visual display of your status; a cloud
deployment is not. This is expressed as a pride of ownership.
10. People are incapacitated by choice.
The Internet enables commerce to shiY to a large inventory where profit can be maintained
by many sales of a few items each, the so-called long tail. When this model is applied to
cloud compuBng, people tend to be overwhelmed by the choice and delay adopBon.

Measuring cloud compu9ng costs

Usually, a commodity is cheaper than a specialized item, but not always. Depending upon
your situaBon, you can pay more for public cloud compuBng than you would for owning and
managing your private cloud, or for owning and using soYware as well. That’s why it’s
important to analyze the costs and benefits of your own cloud compuBng scenario carefully
and quanBtaBvely. You will want to compare the costs of cloud compuBng to private
systems.

The cost of a cloud compuBng deployment is roughly esBmated to be

where the unit cost is usually defined as the cost of a machine instance per hour or another
resource.

Depending upon the deployment type, other resources add addiBonal unit costs: storage
quanBty consumed, number of transacBons, incoming or outgoing amounts of data, and so
forth. Different cloud providers charge different amounts for these resources, some
resources are free for one provider and charged for another, and there are almost always
variable charges based on resource sizing. Cloud resource pricing doesn’t always scale
linearly based on performance.

To compare your cost benefit with a private cloud, you will want to compare the value you
determine in the equaBon above with the same calculaBon:

NoBce the addiBonal term for UBlizaBon added as a divisor to the term for CostDATACENTER.
This term appears because it is assumed that a private cloud has capacity that can’t be
captured, and it is further assumed that a private cloud doesn’t employ the same level of
virtualizaBon or pooling of resources that a cloud compuBng provider can achieve. Indeed,
no system can work at 100 percent uBlizaBon because queuing theory states that as the
system approaches 100 percent, the latency and response Bmes go to infinity. Typical
efficiencies in datacenters are between 60 and 85 percent. It is also further assumed that
the datacenter is operaBng under averaged loads (not at peak capacity) and that the
capacity of the datacenter is fixed by the assets it has.

The costs associated with resources in the cloud compuBng model CostCLOUD can
be unbundled to a greater extent than the costs associated with CostDATACENTER. The
CostDATACENTER consists of the summaBon of the cost of each of the individual systems with all
the associated resources, as follows:

where the sum includes terms for System 1, System 2, System 3, and so on.

The costs of a system in a datacenter must also include the overhead associated with power,
cooling, and the physical plant. EsBmates of these addiBonal overheads indicate that over
the lifeBme of a system, overhead roughly doubles the cost of any system. For a server with
a four-year lifeBme, you would therefore need to include an overhead roughly equal to 25
percent of the system’s acquisiBon cost.

The overhead associated with IT staff is also a major cost, but it’s highly variable from
organizaBon to organizaBon. It is not uncommon for the burden cost of a system in a
datacenter to be 150 percent of the cost of the system itself.

The costs associated with the cloud model are calculated rather differently. Each resource
has its own specific cost and many resources can be provisioned independently of one
another. In theory, therefore, the CostCLOUD is beRer represented by the equaBon:
Avoiding Capital Expenditures

A major part of cloud compuBng’s value proposiBon and its appeal is its ability to convert
capital expenses (CapEx) to opera9ng expenses (OpEx) through a usage pricing scheme that
is elasBc and can be right-sized. The conversion of real assets to virtual ones provides a
measure of protecBon against too much or too liRle infrastructure. EssenBally, moving
expenses onto the OpEx side of a budget allows an organizaBon to transfer risk to their cloud
compuBng provider.

A company wishing to grow would normally be faced with the following opBons:
• Buy the new equipment, and deploy it in-house
• Lease the equipment for a set period of Bme
• Outsource the operaBon to a managed-services organizaBon

Cloud compuBng is also a good opBon when the cost of infrastructure and management is
high. This is oYen the case with legacy applicaBons and systems where maintaining the
system capabiliBes is a significant cost.

Right-sizing

Consider an accounBng firm with a variable demand load, as shown in Figure 2.2. For each of
the four quarters of the tax year, clients file their quarterly taxes on the service’s Web site.
Demand for three of those quarters rises broadly as the quarterly filing deadline arrives. The
fourth quarter that represents the year-end tax filing on April 15 shows a much larger and
more pronounced spike for the two weeks approaching and just following that quarter’s
end. Clearly, this accounBng business can’t ignore the demand spike for its year-end
accounBng, because this is the single most important porBon of the firm’s business, but it
needs to match demand to resources to maximize its profits.

Buying and leasing infrastructure to accommodate the peak demand (or alternaBvely load)
shown in the figure as DMAX means that nearly half of that infrastructure remains idle for
most of the Bme. Fiyng the infrastructure to meet the average demand, DAVG, means that
half of the transacBons in the Q2 spike are not captured, which is the mission criBcal porBon
of this enterprise. More accurately using DAVG means that during maximum demand the
service is slowed to a crawl and the system may not be responsive enough to saBsfy any of
the users.

These limits can be a serious constraint on profit and revenue. Outsourcing the demand may
provide a soluBon to the problem. But outsourcing essenBally shiYs the burden of capital
expenditures onto the service provider. A service contract that doesn’t match infrastructure
to demand suffers from the same inefficiencies that capBve infrastructure does.

The cloud compuBng model addresses this problem by allowing you to right-size your
infrastructure. In Figure 2.2, the demand is saBsfied by an infrastructure that is labeled in
terms of a CU or “Compute Unit.” The rule for this parBcular cloud provider is that
infrastructure may be modified at the beginning of any month. For the low-demand Q1/Q4
Bme period, a 1 CU infrastructure is applied. On February 1st, the size is changed to a 4 CU
infrastructure, which captures the enBre spike of Q2 demand. Finally, on June 1st, a 2 CU
size is applied to accommodate the typical demand DAVG that is experienced in the last half
of Q2 through the middle of Q4. This curve-fiyng exercise captures the demand nearly all
the Bme with liRle idle capacity leY unused.

If this deployment represented a single server, then 1 CU might represent a single dual-core
processor, 2 CU might represent a quad-core processor, and 4 CU might represent a dual
quad-core processor virtual system. Most cloud providers size their systems small, medium,
and large in just this manner.

Right-sizing is possible when the system load is cyclical or in some cases when there are
predictable bursts or spikes in the load. You encounter cyclical loads in many public facing
commercial ventures with seasonal demands, when the load is affected by Bme zones, and
at Bmes that new products launch. Burst loads are less predictable. You can encounter
bursts in systems that are gateways or hubs for traffic. In situaBons where demand is
unpredictable and change can be rapid, right-sizing a cloud compuBng soluBon demands
automated soluBons.

Compu9ng the Total Cost of Ownership

The Total Cost of Ownership or TCO is a financial esBmate for the costs of the use of a
product or service over its lifeBme, oYen broken out on a yearly or quarterly basis. In
pitching cloud compuBng projects, it is common to create spreadsheets that predict the
costs of using the cloud compuBng model versus performing the same funcBons in-house or
on-premises.

To be really useful, a TCO must account for the real costs of items, but frequently they do
not. For example, the cost of a system deployed in-house is not only the cost of acquisiBon
and the cost of maintenance. A system consumes resources such as space in a datacenter or
porBon of your site, power, cooling, and management. All these resources represent an
overhead that is oYen over- looked, either in error or for poliBcal reasons. When you
account for monitoring and management of systems, you must account for the burdened
cost of an IT employee, the cost of the hardware and soYware that is used for management,
and other hidden costs.

Any discussion of Total Cost of Ownership provides an operaBonal look at infrastructure


deployment. A beRer metric for enterprises is captured by a Return on Investment or ROI
calculaBon. To accurately measure an ROI, you need to capture the opportuniBes that a
business has been able to avail itself of (or not), something that is accurate only in hindsight.
The flexibility and agility of cloud compuBng allows a company to focus on its core business
and create more opportuniBes.

Specifying Service Level Agreements

A Service Level Agreement (SLA) is the contract for performance negoBated between you
and a service provider. In the early days of cloud compuBng, all SLAs were negoBated
between a client and the provider. Today with the advent of large uBlity-like cloud
compuBng providers, most SLAs are standardized unBl a client becomes a large consumer of
services.

SLAs usually specify these parameters:


• Availability of the service (upBme)
• Response Bmes or latency
• Reliability of the service components
• ResponsibiliBes of each party
• WarranBes

If a vendor fails to meet the stated targets or minimums, it is punished by having to offer the
client a credit or pay a penalty.

MicrosoY publishes the SLAs associated with the Windows Azure Pla^orm components at
hRp://www.microsoY.com/windowsazure/sla/, which is illustraBve of industry pracBce for
cloud providers. Each individual component has its own SLA. The summary versions of
these SLAs from MicrosoY are reproduced here:

Windows Azure SLA: “Windows Azure has separate SLA’s for compute and storage. For
compute, we guarantee that when you deploy two or more role instances in different fault
and upgrade domains, your Internet facing roles will have external connecBvity at least
99.95% of the Bme. AddiBonally, we will monitor all of your individual role instances and
guarantee that 99.9% of the Bme we will detect when a role instance’s process is not
running and iniBate correcBve acBon.”

SQL Azure SLA: “SQL Azure customers will have connecBvity between the database and
our Internet gateway. SQL Azure will maintain a “Monthly Availability” of 99.9% during a
calendar month. “Monthly Availability Percentage” for a specific customer database is the
raBo of the Bme the database was available to customers to the total Bme in a month.
Time is measured in 5-minute intervals in a 30-day monthly cycle. Availability is always
calculated for a full month. An interval is marked as unavailable if the customer’s aRempts
to connect to a database are rejected by the SQL Azure gateway.”

AppFabric SLA: “UpBme percentage commitments and SLA credits for Service Bus and
Access Control are similar to those specified above in the Windows Azure SLA. Due to
inherent differences between the technologies, underlying SLA definiBons and terms differ
for the Service Bus and Access Control services. Using the Service Bus, customers will
have connecBvity between a customer’s service endpoint and our Internet gateway; when
our service fails to establish a connecBon from the gateway to a customer’s service end-
point, then the service is assumed to be unavailable. Using Access Control, customers will
have connecBvity between the Access Control endpoints and our Internet gateway. In
addiBon, for both Service Bus and Access Control, the service will process correctly for-
maRed requests for the handling of messages and tokens; when our service fails to process
a request properly, then the service is assumed to be unavailable. SLA calculaBons will be
based on an average over a 30-day monthly cycle, with 5-minute Bme intervals. Failures
seen by a customer in the form of service unavailability will be counted for the purpose of
availability calculaBons for that customer.”

Some cloud providers allow for service credits based on their ability to meet their
contractual levels of upBme. For example, Amazon applies a service credit of 10 percent off
the charge for Amazon S3 if the monthly upBme is equal to or greater than 99 percent but
less than 99.9 percent. When the upBme drops below 99 percent, the service credit
percentage rises to 25 percent and this credit is applied to usage in the next billing period.

Amazon Web Services uses an algorithm that calculates upBme based on the following
formula:
UpBme = Error Rate/Requests

as measured for each 5-minute interval during a billing period. The error rate is based on
internal server counters such as “InternalError” or “ServiceUnavailable.” There are
exclusions that limit Amazon’s exposure.

Service Level Agreements are based on the usage model. Most cloud providers price their
pay-as-you-go resources at a premium and issue standard SLAs only for that purpose. You
can also purchase subscripBons at various levels that guarantee you access to a certain
amount of purchased resources. The SLAs aRached to a subscripBon oYen offer different
terms. If your organizaBon requires access to a certain level of resources, then you need a
subscripBon to a service. A usage model may not provide that level of access under peak
load condiBons.

Defining Licensing Models

When you purchase shrink-wrapped soYware, you are using that soYware based on a
licensing agreement called a EULA or End User License Agreement. The EULA may specify
that the soYware meets the following criteria:
• It is yours to own.
• It can be installed on a single or mulBple machines.
• It allows for one or more connecBons.
• It has whatever limit the ISV has placed on its soYware.

In most instances, the purchase price of the soYware is directly Bed to the EULA.

For a long Bme now, the computer industry has known that the use of distributed
applicaBons over the Internet was going to impact the way in which companies license their
soYware, and indeed it has. The problem is that there is no uniform descripBon of how
applicaBons accessed over cloud networks will be priced. There are several different
licensing models in play at the moment—and no clear winners.

Cloud-based providers tend to license their applicaBons or services based on user or


machine accounts, but they do so in ways that are different than you might expect based on
your experience with physical hardware and soYware. Many applicaBons and services use a
subscripBon or usage model (or both) and Be it to a user account. Lots of experimentaBon is
going on in the publishing industry on how to price Internet offerings, and you can find the
same to be true in all kinds of computer applicaBons at the moment. Some services Be their
licenses into a machine account when it makes sense.

An example is the backup service Carbonite, where the service is for backing up a licensed
computer. However, cloud compuBng applicaBons rarely use machine licenses when the
applicaBon is meant to be ubiquitous. If you need to access an applicaBon, service, or Web
site from any locaBon, then a machine license isn’t going to be pracBcal.

The impact of cloud compuBng on bulk soYware purchases and enterprise licensing schemes
is even harder to gauge. Several analysts have remarked that the advent of cloud compuBng
could lead to the end of enterprise licensing and could cause difficulBes for soYware vendors
going forward. It isn’t clear what the impact on licensing will be in the future, but it is
certainly an area to keep your eyes on.
Chapter 12: Understanding Cloud Security
(REFERENCE 2 as per guidelines)

Many of the tools and techniques that you would use to protect your data, comply with
regulaBons, and maintain the integrity of your systems are complicated by the fact that you
are sharing your systems with others and many Bmes outsourcing their operaBons as well.
Different types of cloud compuBng service models provide different levels of security
services. You get the least amount of built in security with an Infrastructure as a Service
provider, and the most with a SoYware as a Service provider.

AdapBng your on-premises systems to a cloud model requires that you determine what
security mechanisms are required and mapping those to controls that exist in your chosen
cloud service provider. When you idenBfy missing security elements in the cloud, you can
use that mapping to work to close the gap.

Storing data in the cloud is of parBcular concern. Data should be transferred and stored in an
encrypted format. You can use proxy and brokerage services to separate clients from direct
access to shared cloud storage.

Logging, audiBng, and regulatory compliance are all features that require planning in cloud
compuBng systems. They are among the services that need to be negoBated in Service Level
Agreements.

Securing the Cloud

The Internet was designed primarily to be resilient; it was not designed to be secure. Any
distributed applicaBon has a much greater aRack surface than an applicaBon that is closely
held on a Local Area Network. Cloud compuBng has all the vulnerabiliBes associated with
Internet applicaBons, and addiBonal vulnerabiliBes arise from pooled, virtualized, and
outsourced resources.

In the report “Assessing the Security Risks of Cloud CompuBng,” Jay Heiser and Mark NicoleR
of the Gartner Group highlighted the following areas of cloud compuBng that they felt were
uniquely troublesome:
• AudiBng
• Data integrity
• e-Discovery for legal compliance
• Privacy
• Recovery
• Regulatory compliance

Your risks in any cloud deployment are dependent upon the parBcular cloud service model
chosen and the type of cloud on which you deploy your applicaBons. In order to evaluate
your risks, you need to perform the following analysis:
1. Determine which resources (data, services, or applicaBons) you are planning to move to
the cloud.

2. Determine the sensiBvity of the resource to risk.

Risks that need to be evaluated are loss of privacy, unauthorized access by others, loss of
data, and interrupBons in availability.

3. Determine the risk associated with the parBcular cloud type for a resource.

Cloud types include public, private (both external and internal), hybrid, and shared
community types. With each type, you need to consider where data and funcBonality
will be maintained.

4. Take into account the parBcular cloud service model that you will be using.

Different models such as IaaS, SaaS, and PaaS require their customers to be responsible
for security at different levels of the service stack.

5. If you have selected a parBcular cloud service provider, you need to evaluate its system to
understand how data is transferred, where it is stored, and how to move data both in and
out of the cloud.

You may want to consider building a flowchart that shows the overall mechanism of the
system you are intending to use or are currently using.

One technique for maintaining security is to have “golden” system image references that you
can return to when needed. The ability to take a system image off-line and analyze the
image for vulnerabiliBes or compromise is invaluable. The compromised image is a primary
forensics tool.

Many cloud providers offer a snapshot feature that can create a copy of the client’s enBre
environment; this includes not only machine images, but applicaBons and data, network
interfaces, firewalls, and switch access. If you feel that a system has been compromised, you
can replace that image with a known good version and contain the problem.

The security boundary

In order to concisely discuss security in cloud compuBng, you need to define the parBcular
model of cloud compuBng that applies. This nomenclature provides a framework for
understanding what security is already built into the system, who has responsibility for a
parBcular security mechanism, and where the boundary between the responsibility of the
service provider is separate from the responsibility of the customer.

The most commonly used model based on U.S. NaBonal InsBtute of Standards and
Technology (NIST) separates deployment models from service models and assigns those
models a set of service aRributes. Deployment models are cloud types: community, hybrid,
private, and public clouds. Service models follow the SPI Model for three forms of service
delivery: SoYware, Pla^orm, and Infrastructure as a Service. In the NIST model, it was
not required that a cloud use virtualizaBon to pool resources, nor did that model require
that a cloud support mulB-tenancy. It is just these factors that make security such a
complicated proposiBon in cloud compuBng.

Cloud Security Alliance (CSA) cloud compuBng stack model, which shows how different
funcBonal units in a network stack relate to one another. This model can be used to separate
the different service models from one another. CSA is an industry working group that studies
security issues in cloud compuBng and offers recommendaBons to its members. The work of
the group is open and available.

The CSA parBBons its guidance into a set of operaBonal domains:


• Governance and enterprise risk management
• Legal and electronic discovery
• Compliance and audit
• InformaBon lifecycle management
• Portability and interoperability
• TradiBonal security, business conBnuity, and disaster recovery
• Datacenter operaBons
• Incidence response, noBficaBon, and remediaBon
• ApplicaBon security
• EncrypBon and key management
• IdenBty and access management
• VirtualizaBon

One key difference between the NIST model and the CSA is that the CSA considers mulB-
tenancy to be an essenBal element in cloud compuBng. MulB-tenancy adds a number of
addiBonal security concerns to cloud compuBng that need to be accounted for. In mulB-
tenancy, different customers must be isolated, their data segmented, and their service
accounted for. To provide these features, the cloud service provider must provide a policy-
based environment that is capable of supporBng different levels and quality of service,
usually using different pricing models. MulB-tenancy expresses itself in different ways in the
different cloud deployment models and imposes security concerns in different places.

Security service boundary

The CSA funcBonal cloud compuBng hardware/soYware stack is the Cloud Reference Model.
This model is reproduced in Figure 12.3. IaaS is the lowest level service, with PaaS and SaaS
the next two services above. As you move upward in the stack, each service model inherits
the capabiliBes of the model beneath it, as well as all the inherent security concerns and risk
factors. IaaS supplies the infrastructure; PaaS adds applicaBon development frameworks,
transacBons, and control structures; and SaaS is an operaBng environment with applicaBons,
management, and the user interface.
As you ascend the stack, IaaS has the least levels of integrated funcBonality and the lowest
levels of integrated security, and SaaS has the most.

The most important lesson from this discussion of architecture is that each different type of
cloud service delivery model creates a security boundary at which the cloud service
provider’s responsibiliBes end and the customer’s responsibiliBes begin. Any security
mechanism below the security boundary must be built into the system, and any security
mechanism above must be maintained by the customer. As you move up the stack, it
becomes more important to make sure that the type and level of security is part of your
Service Level Agreement.

In the SaaS model, the vendor provides security as part of the Service Level Agreement, with
the compliance, governance, and liability levels sBpulated under the contract for the enBre
stack. For the PaaS model, the security boundary may be defined for the vendor to include
the soYware framework and middleware layer. In the PaaS model, the customer would be
responsible for the security of the applicaBon and UI at the top of the stack. The model with
the least built-in security is IaaS, where everything that involves soYware of any kind is the
customer’s problem.

In thinking about the Cloud Security Reference Model in relaBonship to security needs, a
fundamental disBncBon may be made between the nature of how services are provided
versus where those services are located. A private cloud may be internal or external to an
organizaBon, and although a public cloud is most oYen external, there is no requirement
that this mapping be made so.

This makes the locaBon of trust boundaries in cloud compuBng rather ill defined, dynamic,
and subject to change depending upon a number of factors. Establishing trust boundaries
and creaBng a new perimeter defense that is consistent with your cloud compuBng network
is an important consideraBon. The key to understanding where to place security
mechanisms is to understand where physically in the cloud resources are deployed and
consumed, what those resources are, who manages the resources, and what mechanisms
are used to control them.

Table 12.1 lists some of the different service models and lists the parBes responsible for
security in the different instances.

You might also like