Cloud Computing Notes (As Per Guidelines)
Cloud Computing Notes (As Per Guidelines)
1.4.4 Hadoop
• Apache Hadoop is an open-source framework that is suited for processing large data
sets on commodity hardware.
• Hadoop is an implementaBon of MapReduce, an applicaBon programming model
developed by Google, which provides two fundamental operaBons for data
processing: map and reduce.
• MAP: transforms and synthesizes the input data provided by the user
• REDUCE: aggregates the output obtained by the map operaBons.
• Hadoop provides the runBme environment, and developers need only provide the
input data and specify the map and reduce funcBons that need to be executed.
• Hadoop is an integral part of the Yahoo! cloud infrastructure and supports several
business processes of the company.
• Currently, Yahoo! manages the largest Hadoop cluster in the world, which is also
available to academic insBtuBons.
• This definiBon includes various types of distributed compuBng systems that are
especially focused on unified usage and aggregaBon of distributed resources.
• Communica9on is another fundamental aspect of distributed compuBng.
• Since distributed systems are composed of more than one computer that collaborate
together, it is necessary to provide some sort of data and informaBon exchange
between them, which generally occurs through the network (Coulouris et al. [2]):
Figure 2.10 provides an overview of the different layers that are involved in providing the
services of a distributed system.
• BoRom layer: computer and network hardware consBtute the physical infrastructure;
these components are directly managed by the operaBng system, which provides the
basic services for interprocess communicaBon (IPC), process scheduling and
management, and resource management in terms of file system and local devices.
• Taken together these two layers become the pla^orm on top of which specialized
soYware is deployed to turn a set of networked computers into a distributed system.
• The use of well-known standards at the operaBng system level and even more at the
hardware and network levels allows easy harnessing of heterogeneous components
• For example, network connecBvity between different devices is controlled by
standards, which allow them to interact seamlessly. At the operaBng system level,
IPC services are implemented on top of standardized communicaBon protocols such
as Transmission Control Protocol/Internet Protocol (TCP/IP), User Datagram Protocol
(UDP) or others.
• The middleware layer leverages such services to build a uniform environment for the
development and deployment of distributed applicaBons. This layer supports the
programming paradigms for distributed systems.
• The middleware develops its own protocols, data formats, and programming
language or frameworks for the development of distributed applicaBons.
• This layer is completely independent from the underlying operaBng system and hides
all the heterogeneiBes of the boRom layers.
• The top of the distributed system stack is represented by the applicaBons and
services designed and developed to use the middleware.
• These can serve several purposes and oYen expose their features in the form of
graphical user interfaces (GUIs) accessible locally or through the Internet via a Web
browser.
• For example, in the case of a cloud compuBng system, the use of Web technologies is
strongly preferred, not only to interface distributed applicaBons with the end user
but also to provide pla^orm services aimed at building distributed systems. Amazon
Web Services (AWS), which provide faciliBes for creaBng virtual machines, organizing
them together into a cluster, and deploying applicaBons and systems on top.
• Figure 2.11 shows an example of how the general reference architecture of a
distributed system is contextualized in the case of a cloud compuBng system.
• The core logic is then implemented in the middle-ware that manages the
virtualizaBon layer, which is deployed on the physical infrastructure in order to
maximize its uBlizaBon and provide a customizable runBme environment for
applicaBons.
2.4.3 – Architectural styles for distributed compu9ng
This class of architectural style models systems in terms of independent components that
have their own life cycles, which interact with each other to perform their acBviBes. There
are two major categories within this class—communicaBng processes and event systems—
which differenBate in the way the interacBon among components is managed.
Event Systems. In this architectural style, the components of the system are loosely coupled
and connected. In addiBon to exposing operaBons for data and state manipulaBon, each
component also publishes (or announces) a collecBon of events with which other
components can register. In general, other components provide a callback that will be
executed when the event is acBvated.
During the acBvity of a component, a specific runBme condiBon can acBvate one of the
exposed events, thus triggering the execuBon of the callbacks registered with it.
The main advantage of such an architectural style is that it fosters the development of open
systems: new modules can be added and easily integrated into the system as long as they
have compliant interfaces for registering to the events.
This architectural style solves some of the limitaBons observed for the top-down and object-
oriented styles. First, the invocaBon paRern is implicit, and the connecBon between the
caller and the callee is not hard-coded; this gives a lot of flexibility since addiBon
or removal of a handler to events can be done without changes in the source code of
applicaBons.
Second, the event source does not need to know the idenBty of the event handler in order
to invoke the callback.
The disadvantage of such a style is that it relinquishes control over system computaBon.
When a component triggers an event, it does not know how many event handlers will be
invoked and whether there are any registered handlers.
System architectural styles cover the physical organizaBon of components and processes
over a distributed infrastructure. They provide a set of reference models for the deployment
of such systems and help engineers not only have a common vocabulary in describing the
physical layout of systems but also quickly idenBfy the major advantages and drawbacks of a
given deployment and whether it is applicable for a specific class of applicaBons.
Two fundamental reference styles: client/server and peer-to-peer.
Client/server
As depicted in Figure 2.12, the client/server model features two major components: a
server and a client. These two components interact with each other through a network
connecBon using a given protocol. The communicaBon is unidirecBonal: The client issues a
request to the server, and aYer processing the request the server returns a response. There
could be mulBple client components issuing requests to a server that is passively waiBng for
them. Hence, the important opera9ons in the client-server paradigm are request, accept
(client side), and listen and response (server side).
The client/server model is suitable in many-to-one scenarios, where the informaBon and
the services of interest can be centralized and accessed through a single access point: the
server. In general, mulBple clients are interested in such services and the server must be
appropriately designed to efficiently serve requests coming from different clients. This
consideraBon has implicaBons on both client design and server design.
• Thin-client model. In this model, the load of data processing and transformaBon is put on
the server side, and the client has a light implementaBon that is mostly concerned with
retrieving and returning the data it is being asked for, with no considerable further
processing.
• Fat-client model. In this model, the client component is also responsible for processing
and transforming the data before returning it to the user, whereas the server features a
relaBvely light implementaBon that is mostly concerned with the management of access to
the data.
The three major components in the client-server model: presenta9on, applica9on logic, and
data storage.
In the thin-client model, the client embodies only the presentaBon component, while the
server absorbs the other two. In the fat-client model, the client encapsulates presentaBon
and most of the applicaBon logic, and the server is principally responsible for the data
storage and maintenance.
PresentaBon, applicaBon logic, and data maintenance can be seen as conceptual layers,
which are more appropriately called Bers. The mapping between the conceptual layers and
their physical implementaBon in modules and components allows differenBaBng among
several types of architectures, which go under the name of mul99ered architectures.
• Two-9er architecture. This architecture parBBons the systems into two Bers, which are
located one in the client component and the other on the server. The client is responsible for
the presentaBon Ber by providing a user interface; the server concentrates the applicaBon
logic and the data store into a single Ber.
This architecture is suitable for systems of limited size and suffers from scalability issues. In
parBcular, as the number of users increases the performance of the server might
dramaBcally decrease.
Another limitaBon is caused by the dimension of the data to maintain, manage, and access,
which might be prohibiBve for a single computaBon node or too large for serving the clients
with saBsfactory performance.
Nowadays, the client/server model is an important building block of more complex systems,
which implement some of their features by idenBfying a server and a client process
interacBng through the network. This model is generally suitable in the case of a many-to-
one scenario, where the interacBon is unidirecBonal and started by the clients and suffers
from scalability issues, and therefore it is not appropriate in very large systems.
Peer-to-peer
More precisely, each peer acts as a server when it processes requests from other peers and
as a client when it issues requests to other peers. With respect to the client/server model
that parBBons the responsibiliBes of the IPC between server and clients, the peer-to-peer
model aRributes the same responsibiliBes to each component.
Therefore, this model is quite suitable for highly decentralized architecture, which can scale
beRer along the dimension of the number of peers.
To address an incredibly large number of peers, different architectures have been designed
that divert slightly from the peer-to-peer model. For example, in Kazaa not all the peers have
the same role, and some of them are used to group the accessibility informaBon of a group
of peers. Another interesBng example of peer-to-peer architecture is represented by the
Skype network.
The client/server architecture, which originally included only two types of components, has
been further extended and enriched by developing mulBBer architectures as the complexity
of systems increased. Currently, this model is sBll the predominant reference architecture for
distributed systems and applicaBons. The server and client abstracBon can be used in some
cases to model the macro scale or the micro scale of the systems. For peer-to-peer systems,
pure implementaBons are very hard to find.
Chapter 3 Virtualiza9on
3.1 Introduc9on
• VirtualizaBon of the execuBon environment allows a wider range of features also can
be implemented. In parBcular, sharing, aggregaBon, emulaBon, and isolaBon are the
most relevant features (see Figure 3.2).
3.2.3 Portability
• The concept of portability applies in different ways according to the specific type of
virtualizaBon considered.
• In the case of a hardware virtualiza9on soluBon, the guest is packaged into a virtual
image that, in most cases, can be safely moved and executed on top of different
virtual machines. Virtual images are generally proprietary formats that require a
specific virtual machine manager to be executed.
• In the case of programming-level virtualiza9on, as implemented by the JVM or the
.NET runBme, the binary code represenBng applicaBon components (jars or
assemblies) can be run without any recompilaBon on any implementaBon of the
corresponding virtual machine. This makes the applicaBon development cycle more
flexible and applicaBon deployment very straigh^orward.
• Finally, portability allows having your own system always with you and ready to use
as long as the required virtual machine manager is available. This requirement is, in
general, less stringent than having all the applicaBons and services you need
available to you anywhere you go.
• Reference model defines the interfaces between the levels of abstracBons, which
hide implementaBon details.
• From this perspecBve, virtualizaBon techniques actually replace one of the layers and
intercept the calls that are directed toward it.
Modern compuBng systems can be expressed in terms of the reference model described in
Figure 3.4.
• At the bobom layer, the model for the hardware is expressed in terms of the
Instruc9on Set Architecture (ISA), which defines the instrucBon set for the
processor, registers, memory, and interrupt management.
• ISA is the interface between hardware and soYware, and it is important to the
operaBng system (OS) developer (System ISA) and developers of applicaBons that
directly manage the underlying hardware (User ISA).
• The applica9on binary interface (ABI) separates the operaBng system layer from the
applicaBons and libraries, which are managed by the OS.
• ABI covers details such as low-level data types, alignment, and call convenBons and
defines a format for executable programs. System calls are defined at this level.
• ABI interface allows portability of applicaBons and libraries across operaBng systems
that implement the same ABI.
• The highest level of abstracBon is represented by the applica9on programming
interface (API), which interfaces applicaBons to libraries and/or the underlying
operaBng system.
• For any operaBon to be performed in the applicaBon level API, ABI and ISA are
responsible for making it happen.
• The high-level abstracBon is converted into machine-level instrucBons to perform the
actual operaBons supported by the processor.
• The machine-level resources, such as processor registers and main memory
capaciBes, are used to perform the operaBon at the hardware level of the central
processing unit (CPU).
• This layered approach simplifies
o the development and implementaBon of compuBng systems
o simplifies the implementaBon of mulBtasking and the coexistence of mulBple
execuBng environments
o such a model not only requires limited knowledge of the enBre compuBng
stack
o provides ways to implement a minimal security model for managing and
accessing shared resources.
• For this purpose, the instrucBon set exposed by the hardware has been divided into
different security classes that define who can operate with them.
• The first disBncBon can be made between privileged and nonprivileged instrucBons.
o Nonprivileged instrucBons are those instrucBons that can be used without
interfering with other tasks because they do not access shared resources. This
category contains, for example, all the floaBng, fixed-point, and arithmeBc
instrucBons.
o Privileged instrucBons are those that are executed under specific restricBons
and are mostly used for sensiBve operaBons, which expose (behavior-
sensiBve) or modify (control-sensiBve) the privileged state. For instance,
behavior-sensiBve instrucBons are those that operate on the I/O, whereas
control-sensiBve instrucBons alter the state of the CPU registers.
o Some types of architecture feature more than one class of privileged
instrucBons and implement a finer control of how these instrucBons can be
accessed. For instance, a possible implementaBon features a hierarchy of
privileges (see Figure 3.5) in the form of ring-based security: Ring 0, Ring 1,
Ring 2, and Ring 3; Ring 0 is in the most privileged level and Ring 3 in the least
privileged level. Ring 0 is used by the kernel of the OS, rings 1 and 2 are used
by the OS-level services, and Ring 3 is used by the user. Recent systems
support only two levels, with Ring 0 for supervisor mode and Ring 3 for user
mode.
• All the current systems support at least two different execuBon modes: supervisor
mode and user mode.
o The supervisor mode denotes an execuBon mode in which all the instrucBons
(privileged and nonprivileged) can be executed without any restricBon. This
mode, also called master mode or kernel mode, is generally used by the
operaBng system (or the hypervisor) to perform sensiBve operaBons on
hardware-level resources.
o In user mode, there are restricBons to control the machine-level resources. If
code running in user mode invokes the privileged instrucBons, hardware
interrupts occur and trap the potenBally harmful execuBon of the instrucBon.
o There might be some instrucBons that can be invoked as privileged
instrucBons under some condiBons and as nonprivileged instrucBons under
other condiBons.
o The disBncBon between user and supervisor mode allows us to understand
the role of the hypervisor and why it is called that.
• Conceptually, the hypervisor runs above the supervisor mode. In reality, hypervisors
are run in supervisor mode, and the division between privileged and nonprivileged
instrucBons has posed challenges in designing virtual machine managers.
• It is expected that all the sensiBve instrucBons will be executed in privileged mode,
which requires supervisor mode in order to avoid traps. Without this assumpBon it is
impossible to fully emulate and manage the status of the CPU for guest operaBng
systems.
• Unfortunately, this is not true for the original ISA, which allows 17 sensiBve
instrucBons to be called in user mode. This prevents mulBple operaBng systems
managed by a single hypervisor to be isolated from each other, since they are able to
access the privileged state of the processor and change it.
• It is expected that in a hypervisor-managed environment, all the guest operaBng
system code will be run in user mode in order to prevent it from directly accessing
the status of the CPU. If there are sensiBve instrucBons that can be called in user
mode (that is, implemented as nonprivileged instrucBons), it is no longer possible to
completely isolate the guest OS.
• More recent implementaBons of ISA (Intel VT and AMD Pacifica) have solved this
problem by redesigning such instrucBons as privileged ones.
• Three main modules, dispatcher, allocator, and interpreter, coordinate their acBvity
in order to emulate the underlying hardware.
• The dispatcher consBtutes the entry point of the monitor and reroutes the
instrucBons issued by the virtual machine instance to one of the two other modules.
• The allocator is responsible for deciding the system resources to be provided to the
VM: whenever a virtual machine tries to execute an instrucBon that results in
changing the machine resources associated with that VM, the allocator is invoked by
the dispatcher.
• The interpreter module consists of interpreter rouBnes. These are executed
whenever a virtual machine executes a privileged instrucBon: a trap is triggered and
the corresponding rouBne is executed.
• The criteria that need to be met by a virtual machine manager to efficiently support
virtualizaBon were established by Goldberg and Popek in 1974. Three properBes
have to be saBsfied:
o Equivalence. A guest running under the control of a virtual machine manager
should exhibit the same behavior as when it is executed directly on the
physical host.
o Resource control. The virtual machine manager should be in complete
control of virtualized resources.
o Efficiency. A staBsBcally dominant fracBon of the machine instrucBons should
be executed without intervenBon from the virtual machine manager.
• Popek and Goldberg provided a classificaBon of the instrucBon set and proposed
three theorems that define the properBes that hardware instrucBons need to saBsfy
in order to efficiently support virtualizaBon.
THEOREM 3.1
For any conven+onal third-genera+on computer, a VMM may be constructed if the set of
sensi+ve instruc+ons for that computer is a subset of the set of privileged instruc+ons.
• This theorem establishes that all the instrucBons that change the configuraBon of the
system resources should generate a trap in user mode and be executed under the
control of the virtual machine manager.
• This allows hypervisors to efficiently control only those instrucBons that would reveal
the presence of an abstracBon layer while execuBng all the rest of the instrucBons
without considerable performance loss.
• The theorem always guarantees the resource control property when the hypervisor is
in the most privileged mode (Ring 0).
• The nonprivileged instrucBons must be executed without the intervenBon of the
hypervisor.
• The equivalence property also holds good since the output of the code is the same in
both cases because the code is not changed.
THEOREM 3.2
A conven+onal third-genera+on computer is recursively virtualizable if:
• It is virtualizable and
• A VMM without any +ming dependencies can be constructed for it.
THEOREM 3.3
A hybrid VMM may be constructed for any conven+onal third-genera+on machine in
which the set of user-sensi+ve instruc+ons is a subset of the set of privileged instruc+ons.
• There is another term, hybrid virtual machine (HVM), which is less efficient than the
virtual machine system. In the case of an HVM, more instrucBons are interpreted
rather than being executed directly.
• All instrucBons in virtual supervisor mode are interpreted. Whenever there is an
aRempt to execute a behavior-sensiBve or control-sensiBve instrucBon, HVM
controls the execuBon directly or gains the control via a trap. Here all sensiBve
instrucBons are caught by HVM that are simulated.
• This reference model represents what we generally consider classic virtualizaBon—
that is, the ability to execute a guest operaBng system in complete isolaBon.
• To a greater extent, hardware-level virtualizaBon includes several strategies that
differenBate from each other in terms of which kind of support is expected from the
underlying hardware, what is actually abstracted from the host, and whether the
guest should be modified or not.
Hardware virtualiza9on techniques
Hardware-assisted virtualiza9on.
• This term refers to a scenario in which the hardware provides architectural support
for building a virtual machine manager able to run a guest operaBng system in
complete isolaBon. This technique was originally introduced in the IBM System/370.
• At present, examples of hardware-assisted virtualizaBon are the extensions to the
x86-64 bit architecture introduced with Intel VT (formerly known as Vanderpool) and
AMD V (formerly known as Pacifica).
• Before the introducBon of hardware-assisted virtualizaBon, so^ware emula9on of
x86 hardware was significantly costly from the performance point of view. The
reason for this is that by design the x86 architecture did not meet the formal
requirements introduced by Popek and Goldberg, and early products were using
binary transla9on to trap some sensiBve instrucBons and provide an emulated
version.
• Products such as VMware Virtual Pla^orm, introduced in 1999 by VMware, which
pioneered the field of x86 virtualizaBon, were based on this technique.
• AYer 2006, Intel and AMD introduced processor extensions, and a wide range of
virtualizaBon soluBons took advantage of them: Kernel-based Virtual Machine
(KVM), VirtualBox, Xen, VMware, Hyper-V, Sun xVM, Parallels, and others.
Full virtualiza9on.
• Full virtualizaBon refers to the ability to run a program, most likely an operaBng
system, directly on top of a virtual machine and without any modificaBon, as though
it were run on the raw hardware.
• To make this possible, virtual machine managers are required to provide a complete
emulaBon of the enBre underlying hardware.
• The principal advantage of full virtualizaBon is complete isola9on, which leads to
enhanced security, ease of emulaBon of different architectures, and coexistence of
different systems on the same pla^orm.
• Full virtualizaBon poses important concerns related to performance and technical
implementaBon. A key challenge is the intercepBon of privileged instrucBons such as
I/O instrucBons: Since they change the state of the resources exposed by the host,
they have to be contained within the virtual machine manager.
• A simple solu9on to achieve full virtualizaBon is to provide a virtual environment for
all the instrucBons, thus posing some limits on performance.
• A successful and efficient implementaBon of full virtualizaBon is obtained with a
combinaBon of hardware and soYware. This is what is accomplished through
hardware-assisted virtualizaBon.
Paravirtualiza9on.
Par9al virtualiza9on.
OperaBng system-level virtualizaBon offers the opportunity to create different and separated
execuBon environments for applicaBons that are managed concurrently. Differently from
hardware virtualizaBon, there is no virtual machine manager or hypervisor, and the
virtualizaBon is done within a single operaBng system, where the OS kernel allows for
mulBple isolated user space instances. The kernel is also responsible for sharing the system
resources among instances and for limiBng the impact of instances on each other. A user
space instance in general contains a proper view of the file system, which is completely
isolated, and separate IP addresses, soYware configuraBons, and access to devices.
OperaBng systems supporBng this type of virtualizaBon are general-purpose, Bme-
shared operaBng systems with the capability to provide stronger namespace and resource
isolaBon. This virtualizaBon technique can be considered an evoluBon of the chroot
mechanism in Unix systems. The chroot operaBon changes the file system root directory for
a process and its children to a specific directory.
• Managed execuBon and isolaBon are perhaps the most important advantages of
virtualizaBon. These allow building secure and controllable compuBng environments.
• A virtual execuBon environment can be configured as a sandbox, thus prevenBng any
harmful operaBon to cross the borders of the virtual host.
• Moreover, allocaBon of resources and their parBBoning among different guests is
simplified, being the virtual host controlled by a program. This enables fine-tuning of
resources.
• Portability is another advantage of virtualizaBon, especially for execuBon
virtualizaBon techniques.
• Virtual machine instances are normally represented by one or more files that can be
easily transported with respect to physical systems. Moreover, they also tend to be
self-contained.
• It is in fact possible to build our own operaBng environment within a virtual machine
instance and bring it with us wherever we go, as though we had our own laptop. This
concept is also an enabler for migraBon techniques in a server consolidaBon
scenario.
• It also contribute to reducing the costs of maintenance, since the number of hosts is
expected to be lower than the number of virtual machine instances.
• Since the guest program is executed in a virtual environment, there is very limited
opportunity for the guest program to damage the underlying hardware.
• It is possible to achieve a more efficient use of resources.
• MulBple systems can securely coexist and share the resources of the underlying host,
without interfering with each other. This is a prerequisite for server consolidaBon,
which allows adjusBng the number of acBve physical resources dynamically
according to the current load of the system, thus creaBng the opportunity to save in
terms of energy consumpBon and to be less impacBng on the environment.
• VirtualizaBon can someBme lead to an inefficient use of the host. In parBcular, some
of the specific features of the host cannot be exposed by the abstracBon layer and
then become inaccessible.
• In the case of hardware virtualizaBon, this could happen for device drivers: The
virtual machine can someBme simply provide a default graphic card that maps only a
subset of the features available in the host.
• In the case of programming-level virtual machines, some of the features of the
underlying operaBng systems may become inaccessible unless specific libraries are
used.
• For example, in the first version of Java the support for graphic programming was
very limited and the look and feel of applicaBons was very poor compared to naBve
applicaBons. These issues have been resolved by providing a new framework called
Swing for designing the user interface, and further improvements have been done by
integraBng support for the OpenGL libraries in the soYware development kit.
Grouping IT resources in close proximity with one another, rather than having them
geographically dispersed, allows for power sharing, higher efficiency in shared IT resource
usage, and improved accessibility for IT personnel. These are the advantages that naturally
popularized the data center concept.
Data centers are typically comprised of the following technologies and components:
Virtualiza9on
Data centers consist of both physical and virtualized IT resources. The physical IT resource
layer refers to the facility infrastructure that houses compuBng/networking systems and
equipment, together with hardware systems and their operaBng systems (Figure 5.7).
The resource abstracBon and control of the virtualizaBon layer is comprised of operaBonal
and management tools that are oYen based on virtualizaBon pla^orms that abstract the
physical compuBng and networking IT resources as virtualized components
Standardiza9on and Modularity
Data centers are built upon standardized commodity hardware and designed with modular
architectures, aggregaBng mulBple idenBcal building blocks of facility infrastructure and
equipment to support scalability, growth, and speedy hardware replacements.
Modularity and standardiza9on are key requirements for reducing investment and
operaBonal costs as they enable economies of scale for the procurement, acquisiBon,
deployment, operaBon, and maintenance processes.
Common virtualizaBon strategies and the constantly improving capacity and performance of
physical devices both favor IT resource consolidaBon, since fewer physical components are
needed to support complex configuraBons. Consolidated IT resources can serve different
systems and be shared among different cloud consumers.
Automa9on
Data centers have specialized pla^orms that automate tasks like provisioning, configuraBon,
patching, and monitoring without supervision. Advances in data center management
pla^orms and tools leverage autonomic compuBng technologies to enable self-configuraBon
and self-recovery.
Most of the operaBonal and administraBve tasks of IT resources in data centers are
commanded through the network’s remote consoles and management systems. Technical
personnel are not required to visit the dedicated rooms that house servers, except to
perform highly specific tasks.
High Availability
Since any form of data center outage significantly impacts business conBnuity for the
organizaBons that use their services, data centers are designed to operate with increasingly
higher levels of redundancy to sustain availability. Data centers usually have redundant,
uninterruptable power supplies, cabling, and environmental control subsystems in
anBcipaBon of system failure, along with communicaBon links and clustered hardware
for load balancing.
Requirements for security, such as physical and logical access controls and data recovery
strategies, need to be thorough and comprehensive for data centers, since they are
centralized structures that store and process business data.
Facili9es
Data center faciliBes are custom-designed locaBons that are ou^iRed with specialized
compuBng, storage, and network equipment. These faciliBes have several funcBonal layout
areas, as well as various power supplies, cabling, and environmental control staBons that
regulate heaBng, venBlaBon, air condiBoning, fire protecBon, and other related subsystems.
Compu9ng Hardware
Much of the heavy processing in data centers is oYen executed by standardized commodity
servers that have substanBal compuBng power and storage capacity. Several compuBng
hardware technologies are integrated into these modular servers, such as:
• rackmount form factor server design composed of standardized racks with interconnects
for power, network, and internal cooling
• support for different hardware processing architectures, such as x86-32bits, x86-64, and
RISC
• a power-efficient mulB-core CPU architecture that houses hundreds of processing cores in
a space as small as a single unit of standardized racks
• redundant and hot-swappable components, such as hard disks, power supplies, network
interfaces, and storage controller cards
Storage Hardware
Data centers have specialized storage systems that maintain enormous amounts of digital
informaBon in order to fulfill considerable storage capacity needs. These storage systems are
containers housing numerous hard disks that are organized into arrays.
Storage systems usually involve the following technologies:
• Hard Disk Arrays – These arrays inherently divide and replicate data among mulBple
physical drives, and increase performance and redundancy by including spare disks. This
technology is oYen implemented using redundant arrays of independent disks (RAID)
schemes, which are typically realized through hardware disk array controllers.
• I/O Caching – This is generally performed through hard disk array controllers, which
enhance disk access Bmes and performance by data caching.
• Hot-Swappable Hard Disks – These can be safely removed from arrays without requiring
prior powering down.
• Storage VirtualizaBon – This is realized through the use of virtualized hard disks and
storage sharing.
• Fast Data ReplicaBon Mechanisms – These include snapshoyng, which is saving a virtual
machine’s memory into a hypervisor-readable file for future reloading, and volume cloning,
which is copying virtual or physical hard disk volumes and parBBons.
Networked storage devices usually fall into one of the following categories:
• Storage Area Network (SAN) – Physical data storage media are connected through a
dedicated network and provide block-level data storage access using industry standard
protocols, such as the Small Computer System Interface (SCSI).
• Network-ARached Storage (NAS) – Hard drive arrays are contained and managed by this
dedicated device, which connects through a network and facilitates access to data using file-
centric data access protocols like the Network File System (NFS) or Server Message Block
(SMB).
Network Hardware
Data centers require extensive network hardware in order to enable mulBple levels of
connecBvity. For a simplified version of networking infrastructure, the data center is broken
down into five network subsystems, followed by a summary of the most common elements
used for their implementaBon.
LAN Fabric
The LAN fabric consBtutes the internal LAN and provides high-performance and redundant
connecBvity for all of the data center’s network-enabled IT resources. It is oYen
implemented with mulBple network switches that facilitate network communicaBons and
operate at speeds of up to ten gigabits per second.
SAN Fabric
Related to the implementaBon of storage area networks (SANs) that provide connecBvity
between servers and storage systems, the SAN fabric is usually implemented with Fibre
Channel (FC), Fibre Channel over Ethernet (FCoE), and InfiniBand network switches.
NAS Gateways
This subsystem supplies aRachment points for NAS-based storage devices and implements
protocol conversion hardware that facilitates data transmission between SAN and NAS
devices.
Data center network technologies have operaBonal requirements for scalability and high
availability that are fulfilled by employing redundant and/or fault-tolerant configuraBons.
These five network subsystems improve data center redundancy and reliability to ensure
that they have enough IT resources to maintain a certain level of service even in the face of
mulBple failures.
Other Considera9ons
IT hardware is subject to rapid technological obsolescence, with lifecycles that typically last
between five to seven years. The on-going need to replace equipment frequently results in a
mix of hardware whose heterogeneity can complicate the enBre data center’s operaBons
and management (although this can be parBally miBgated through virtualizaBon).
Security is another major issue when considering the role of the data center and the vast
quanBBes of data contained within its doors. Even with extensive security precauBons in
place, housing data exclusively at one data center facility means much more can be
compromised by a successful security incursion than if data was distributed across individual
unlinked components.
5.6. Containeriza9on
The operaBng system kernel allows for the existence of mulBple isolated user-space
instances or mulBple isolated runBmes known as containers, parBBons, virtual engines, jails
or chroot jails. Regardless of which runBme is used, when a cloud service executes within a
container, it is running on a real computer from its point of view.
A cloud service running on a physical or virtual server operaBng system can see all of the
provided resources, such as connected devices, ports, files, folders, network shares, CPUs, as
well as the physical addressable memory. However, a cloud service running inside a
container can only see the container’s contents and devices aRached to the container.
As explained earlier, virtualizaBon refers to the act of creaBng a virtual, rather than an actual
version of something. This includes virtual computer hardware pla^orms, storage devices,
and computer network resources. Virtual servers are an abstracBon of physical hardware via
server visualizaBon and the use of hypervisors for abstracBng a given physical server into
mulBple virtual servers.
The hypervisor allows mulBple virtual servers to run on a single physical host. Virtual servers
see the emulated hardware presented to them by the hypervisor as real hardware, and each
virtual server has its own operaBng system, also known as a guest operaBng system, that
needs to be deployed inside the virtual server and managed and maintained as if it were
deployed on a physical server.
In contrast, containers are an abstracBon at the applicaBon or service layer that package
code and dependencies together. MulBple containers can be deployed on the same machine
and share an operaBng system kernel with other containers. Each container runs as an
isolated process in the user space. Containers do not require the guest operaBng system that
is needed for virtual servers and can run directly on a physical server’s operaBng system.
Containers also consume less storage space than virtual servers. Figure 5.12 depicts the
difference between virtual servers and containers.
Containers can be deployed in virtual servers, in which case nested virtualizaBon is required
to allow the container engine to be installed and operated. Nested virtualizaBon refers to
the deployment where one virtualized system is deployed on another.
Benefits of Containers
Portability is one of the key benefits of containers, allowing cloud resource administrators to
move containers to any environment that shares the same host operaBng system and
container engine that the container is hosted on, and without the need to change the
applicaBon or soYware.
Efficient resource u9liza9on is achieved by significantly reducing the CPU, memory and
storage usage footprint compared to virtual servers. It is possible to support several
containers on the same infrastructure required by a single virtual server, resulBng in
performance improvements.
Containers can be created and deployed much faster than virtual servers, which supports a
more agile process and facilitates conBnuous integraBon.
MulBple containers can be deployed in a logical construct called a pod. A pod is a group of
one or more containers that have shared storage and/or network resources, and also share
the same configuraBon that determines how the containers are to be run. A pod is typically
employed when there are different cloud services that are part of the same applicaBon or
namespace and that need to run under a single IP address.
Container Engine
The key component of container architecture is the container engine, also referred to as the
containerizaBon engine. The container engine is specialized soYware that is deployed in an
operaBng system to abstract the required resources and enable the definiBon and
deployment of containers. Container engine soYware can be deployed on physical machines
or virtual machines.
A container build file is a descriptor (created by the user or service) that represents the
requirements of the applicaBon and services that run inside the container, as well as the
configuraBon parameters required by the container engine in order to create and deploy the
container.
Container Image
The container engine uses a container image to deploy an image based on pre-defined
requirements. For example, if an applicaBon requires a database component or Web server
service to operate, these requirements are defined by the user in the container build file.
Based on the defined descripBons, the container engine customizes the operaBng system
image and the required commands or services for the applicaBon.
Container
Networking Address
Each container has its own network address (such as an IP address) used to communicate
with other containers and external components. A container can be connected to more than
one network by allocaBng addiBonal network addresses to the container.
Containers use the physical or virtual network card of the system that the container engine
is deployed on to communicate with other containers and IT resources.
Storage Device
Similar to the networking address, a container may connect to one or more storage devices
that are made available to the containers over the network. Each container has its own level
of access to the storages defined by the system or administrators
Chapter 4: Cloud Compu0ng Architecture
(REFERENCE 1 as per guidelines)
4.1 Introduc9on
• UBlity-oriented data centers serve as the infrastructure through which the services
are implemented and delivered.
• Any cloud service, whether virtual hardware, development pla^orm, or applicaBon
soYware, relies on a distributed infrastructure owned by the provider or rented from
a third party.
Cloud compuBng supports any IT service that can be consumed as a uBlity and delivered
through a network, most likely the Internet. Such characterizaBon includes quite different
aspects: infrastructure, development pla^orms, applicaBon and services.
4.2.1 Architecture
It is possible to organize all the concrete realizaBons of cloud compuBng into a layered view
covering the enBre stack (see Figure 4.1), from hardware appliances to soYware systems.
• Cloud resources are harnessed to offer “compuBng horsepower” required for
providing services. OYen, this layer is implemented using a datacenter in which
hundreds and thousands of nodes are stacked together.
• Cloud infrastructure can be heterogeneous in nature and database systems and other
storage services can also be part of the infrastructure.
• The physical infrastructure is managed by the core middleware, the objecBves of
which are to provide an appropriate runBme environment for applicaBons and to
best uBlize resources.
• At the boRom of the stack, virtualizaBon technologies are used to guarantee runBme
environment customizaBon, applicaBon isolaBon, sandboxing, and quality of service.
Hardware virtualizaBon is most commonly used at this level.
• Hypervisors manage the pool of resources and expose the distributed infrastructure
as a collecBon of virtual machines.
• By using virtual machine technology, it is possible to finely parBBon the hardware
resources such as CPU and memory and to virtualize specific devices, thus meeBng
the requirements of users and applicaBons.
• This soluBon is generally paired with storage and network virtualizaBon strategies,
which allow the infrastructure to be completely virtualized and controlled.
• For example, programming-level virtualizaBon helps in creaBng a portable runBme
environment where applicaBons can be run and controlled. This scenario generally
implies that applicaBons hosted in the cloud be developed with a specific technology
or a programming language, such as Java, .NET, or Python. In this case, the user does
not have to build its system from bare metal.
• Infrastructure management is the key funcBon of core middleware, which supports
capabiliBes such as negoBaBon of the quality of service, admission control, execuBon
management and monitoring, accounBng, and billing.
• The combinaBon of cloud hosBng pla^orms and resources is generally classified as a
• Infrastructure-as-a-Service (IaaS) soluBon.
• We can organize the different examples of IaaS into two categories: Some of them
provide both the management layer and the physical infrastructure; others provide
only the management layer (IaaS (M)).
• In this second case, the management layer is oYen integrated with other IaaS
soluBons that provide physical infrastructure and adds value to them.
• IaaS soluBons are suitable for designing the system infrastructure but provide limited
services to build applicaBons. Such service is provided by cloud programming
environments and tools.
• The range of tools include Web-based interfaces, command-line tools, and
frameworks for concurrent and distributed programming.
• In this scenario, users develop their applicaBons specifically for the cloud by using
the API exposed at the user-level middleware.
• For this reason, this approach is also known as PlaGorm-as-a-Service (PaaS) because
the service offered to the user is a development pla^orm rather than an
infrastructure.
• PaaS soluBons generally include the infrastructure as well, which is bundled as part
of the service provided to users.
• In the case of Pure PaaS, only the user-level middleware is offered, and it has to be
complemented with a virtual or physical infrastructure.
• The top layer of the reference model depicted in Figure 4.1 contains services
delivered at the applicaBon level. These are mostly referred to as So^ware-as-a-
Service (SaaS).
• As a reference model, it is then expected to have an adapBve management layer in
charge of elasBcally scaling on demand.
• SaaS implementaBons should feature such behavior automaBcally, whereas PaaS and
IaaS generally provide this funcBonality as a part of the API exposed to users.
• The reference model described in Figure 4.1 also introduces the concept of
everything as a Service (XaaS). This is one of the most important elements of cloud
compuBng: Cloud services from different providers can be combined to provide a
completely integrated soluBon covering all the compuBng stack of a system.
• IaaS providers can offer the bare metal in terms of virtual machines where PaaS
soluBons are deployed. When there is no need for a PaaS layer, it is possible to
directly customize the virtual infrastructure with the soYware stack needed to run
applicaBons.
Table 4.1 summarizes the characterisBcs of the three major categories used to classify cloud
compuBng soluBons.
• At the top layer the user interface provides access to the services exposed by the
soYware management infrastructure.
• Such an interface is generally based on Web 2.0 technologies: Web services, RESTful
APIs, and mash-ups. These technologies allow either applicaBons or final users to
access the services exposed by the underlying infrastructure.
• Web 2.0 applicaBons allow developing full-featured management consoles
completely hosted in a browser or a Web page.
• Web services and RESTful APIs allow programs to interact with the service without
human intervenBon, thus providing complete integraBon within a soYware system.
• Management of the virtual machines is the most important funcBon performed by
this layer. A central role is played by the scheduler, which is in charge of allocaBng
the execuBon of virtual machine instances.
• The scheduler interacts with the other components that perform a variety of tasks:
o The pricing and billing component takes care of the cost of execuBng each
virtual machine instance and maintains data that will be used to charge the
user.
o The monitoring component tracks the execuBon of each virtual machine
instance and maintains data required for reporBng and analyzing the
performance of the system.
o The reserva9on component stores the informaBon of all the virtual machine
instances that have been executed or that will be executed in the future.
o If support for QoS-based execuBon is provided, a QoS/SLA management
component will maintain a repository of all the SLAs made with the users;
used to ensure that a given virtual machine instance is executed with the
desired quality of service.
o The VM repository component provides a catalog of virtual machine images
that users can use to create virtual instances.
o A VM pool manager component is responsible for keeping track of all the live
instances.
• The bobom layer is composed of the physical infrastructure, on top of which the
management layer operates. A service provider will most likely use a massive
datacenter containing hundreds or thousands of nodes. A cloud infrastructure
developed in house, in a small or medium-sized enterprise or within a university
department, will most likely rely on a cluster.
• From an architectural point of view, the physical layer also includes the virtual
resources that are rented from external IaaS providers.
• In the case of complete IaaS soluBons, all three levels are offered as service. This is
generally the case with public clouds vendors such as Amazon, GoGrid, Joyent,
Rightscale, Terremark, Rackspace, ElasBcHosts, and Flexiscale.
• Other soluBons instead cover only the user interface and the infrastructure soYware
management layers. They need to provide credenBals to access third-party IaaS
providers or to own a private infrastructure in which the management soYware is
installed.
• This is the case with Enomaly, Elastra, Eucalyptus, OpenNebula, and specific IaaS (M)
soluBons from VMware, IBM, and MicrosoY.
A general overview of the features characterizing the PaaS approach is given in Figure 4.3.
It is possible to organize the various soluBons into three wide categories: PaaS-I, PaaS-II, and
PaaS-III.
• The first category idenBfies PaaS implementaBons that completely follow the cloud
compuBng style for applicaBon development and deployment. They offer an
integrated development environment hosted within the Web browser where
applicaBons are designed, developed, composed, and deployed. This is the case of
Force.com and Longjump. Both deliver as pla^orms the combinaBon of middleware
and infrastructure.
• In the second class we can list all those soluBons that are focused on providing a
scalable infrastructure for Web applicaBon, mostly websites. In this case, developers
generally use the providers’ APIs, which are built on top of industrial runBmes, to
develop applicaBons. Google AppEngine is the most popular product in this category.
It provides a scalable runBme based on the Java and Python programming languages,
which have been modified for providing a secure runBme environment and enriched
with addiBonal APIs and components to support scalability.
• The third category consists of all those soluBons that provide a cloud programming
pla^orm for any kind of applicaBon, not only Web applicaBons. Among these, the
most popular is Microso^ Windows Azure, which provides a comprehensive
framework for building service-oriented cloud applicaBons on top of the .NET
technology, hosted on MicrosoY’s datacenters. Other soluBons in the same category,
such as Manjraso^ Aneka, Apprenda, provide only middleware with different
services.
The PaaS umbrella encompasses a variety of soluBons for developing and hosBng
applicaBons in the cloud. Despite this heterogeneity, it is possible to iden9fy some criteria
that are expected to be found in any implementa9on. As noted by Sam Charrington,
product manager at Appistry.com, there are some essen9al characteris9cs that iden9fy a
PaaS solu9on:
• Run9me framework. This framework represents the “soYware stack” of the PaaS
model and the runBme framework executes end-user code according to the policies
set by the user and the provider.
• Abstrac9on. PaaS soluBons are disBnguished by the higher level of abstracBon that
they provide. In the case of PaaS the focus is on the applicaBons the cloud must
support. This means that PaaS soluBons offer a way to deploy and manage
applicaBons on the cloud.
• Automa9on. PaaS environments automate the process of deploying applicaBons to
the infrastructure, scaling them by provisioning addiBonal resources when needed.
This process is performed automaBcally and according to the SLA made between the
customers and the provider.
• Cloud services. PaaS offerings provide developers and architects with services and
APIs, helping them to simplify the creaBon and delivery of elasBc and highly available
cloud applicaBons.
The acronym SaaS was then coined in 2001 by the So^ware Informa9on & Industry
Associa9on (SIIA) with the following connotaBon:
• The analysis carried out by SIIA was mainly oriented to cover applicaBon service
providers (ASPs) and all their variaBons, which capture the concept of soYware
applicaBons consumed as a service in a broader sense.
• ASPs already had some of the core characteris9cs of SaaS:
o The product sold to customer is applica9on access.
o The applicaBon is centrally managed.
o The service delivered is one-to-many.
o The service delivered is an integrated soluBon delivered on the contract,
which means provided as promised.
• The SaaS approach introduces a more flexible way of delivering applicaBon services
that are fully customizable by the user by integraBng new services, injecBng their
own components, and designing the applicaBon and informaBon workflows. Such a
new approach has also been possible with the support of Web 2.0 technologies,
which allowed turning the Web browser into a full-featured interface, able even to
support applicaBon composiBon and development.
• IniBally the SaaS model was of interest only for lead users and early adopters. The
benefits delivered at that stage were the following:
o SoYware cost reducBon and total cost of ownership (TCO) were paramount
o Service-level improvements
o Rapid implementaBon
o Standalone and configurable applicaBons
o Rudimentary applicaBon and data integraBon
o SubscripBon and pay-as-you-go (PAYG) pricing
• SaaS 2.0, which does not introduce a new technology but transforms the way in
which SaaS is used. In parBcular, SaaS 2.0 is focused on providing a more robust
infrastructure and applicaBon pla^orms driven by SLAs. SaaS 2.0 will focus on the
rapid achievement of business objecBves.
• The exisBng SaaS infrastructures not only allow the development and customizaBon
of applicaBons, but they also facilitate the integraBon of services that are exposed by
other parBes.
• This approach dramaBcally changes the soYware ecosystem of the SaaS market,
which is no longer monopolized by a few vendors but is now a fully interconnected
network of service providers.
• SoYware-as-a-Service applicaBons can serve different needs. CRM, ERP, and social
networking applicaBons are definitely the most popular ones. SalesForce.com is
probably the most successful and popular example of a CRM service. It provides a
wide range of services for applicaBons: customer relaBonship and human resource
management, enterprise resource planning, and many other features.
• SalesForce.com builds on top of the Force.com pla^orm, which provides a fully
featured environment for building applicaBons. It offers either a programming
language or a visual environment to arrange components together for building
applicaBons.
• Similar soluBons are offered by NetSuite and RightNow. NetSuite is an integrated
soYware business suite featuring financials, CRM, inventory, and ecommerce
funcBonaliBes integrated all together.
• RightNow is customer experience-centered SaaS applicaBon that integrates together
different features, from chat to Web communiBes, to support the common acBvity of
an enterprise.
Another important class of popular SaaS applicaBons comprises social networking
applica9ons such as Facebook and professional networking sites such as LinkedIn.
Other than providing the basic features of networking, they allow incorporaBng and
extending their capabiliBes by integraBng third-party applicaBons.
Office automa9on applica9ons are also an important representaBve for SaaS applicaBons:
Google Documents and Zoho Office are examples of Web-based applicaBons that aim to
address all user needs for documents, spreadsheets, and presentaBon management.
• Public clouds are not applicable in all scenarios. For example, a very common
criBque to the use of cloud compuBng in its canonical implementaBon is the loss of
control. In the case of public clouds, the provider is in control of the infrastructure
and, eventually, of the customers’ core logic and sensiBve data. Even though there
could be regulatory procedure in place that guarantees fair management and respect
of the customer’s privacy, this condiBon can sBll be perceived as a threat or as an
unacceptable risk that some organizaBons are not willing to take.
• In parBcular, insBtuBons such as government and military agencies will not consider
public clouds as an opBon for processing or storing their sensiBve data.
• More precisely, the geographical locaBon of a datacenter generally determines the
regulaBons that are applied to management of digital informaBon. As a result,
according to the specific locaBon of data, some sensiBve informaBon can be made
accessible to government agencies or even considered outside the law if processed
with specific cryptographic techniques.
• For example, the USA PATRIOT Act5 provides its government and other agencies with
virtually limitless powers to access informaBon, including that belonging to any
company that stores informaBon in the U.S. territory.
• More specifically, having an infrastructure able to deliver IT services on demand can
sBll be a winning soluBon, even when implemented within the private premises of an
insBtuBon. This idea led to the diffusion of private clouds, which are similar to public
clouds, but their resource-provisioning model is limited within the boundaries of an
organizaBon.
• Private clouds are virtual distributed systems that rely on a private infrastructure and
provide internal users with dynamic provisioning of compuBng resources.
• Instead of a pay-as-you-go model as in public clouds, there could be other schemes
in place, taking into account the usage of the cloud and proporBonally billing the
different departments or secBons of an enterprise.
• Private clouds have the advantage of keeping the core business operaBons in-house
by relying on the exisBng IT infrastructure and reducing the burden of maintaining it
once the cloud has been set up.
• In this scenario, security concerns are less criBcal, since sensiBve informaBon does
not flow out of the private infrastructure.
• Moreover, exisBng IT resources can be beRer uBlized because the private cloud can
provide services to a different range of users.
• Another interesBng opportunity that comes with private clouds is the possibility of
tesBng applicaBons and systems at a comparaBvely lower price rather than public
clouds before deploying them on the public virtual infrastructure.
• A Forrester report on the benefits of delivering in-house cloud compuBng soluBons
for enterprises highlighted some of the key advantages of using a private cloud
compuBng infrastructure:
o Customer informaBon protecBon. In-house security is easier to maintain and
rely on.
o Infrastructure ensuring SLAs. Quality of service implies specific operaBons
such as appropriate clustering and failover, data replicaBon, system
monitoring and maintenance, and disaster recovery, and other upBme
services can be commensurate to the applicaBon needs.
o Compliance with standard procedures and operaBons. If organizaBons are
subject to third-party compliance standards, specific procedures have to be
put in place when deploying and execuBng applicaBons.
• From an architectural point of view, private clouds can be implemented on more
heterogeneous hardware: They generally rely on the exisBng IT infrastructure already
deployed on the private premises. This could be a datacenter, a cluster, an enterprise
desktop grid, or a combinaBon of them.
• The physical layer is complemented with infrastructure management soYware (i.e.,
IaaS (M); or a PaaS soluBon, according to the service delivered to the users of the
cloud.
Different opBons can be adopted to implement private clouds. Figure 4.4 provides a
comprehensive view of the soluBons together with some reference to the most popular
soYware used to deploy private clouds.
• At the boRom layer of the soYware stack, virtual machine technologies such as Xen,
KVM, and VMware serve as the foundaBons of the cloud.
• Virtual machine management technologies such as VMware vCloud, Eucalyptus, and
OpenNebula can be used to control the virtual infrastructure and provide an IaaS
soluBon. VMware vCloud is a proprietary soluBon, but Eucalyptus provides full
compaBbility with Amazon Web Services interfaces and supports different virtual
machine technologies such as Xen, KVM, and VMware.
• Like Eucalyptus, OpenNebula is an open-source soluBon for virtual infrastructure
management that supports KVM, Xen, and VMware, which has been designed to
easily integrate third-party IaaS providers. Its modular architecture allows extending
the soYware with addiBonal features such as the capability of reserving virtual
machine instances.
• SoluBons that rely on the previous virtual machine managers and provide added
value are OpenPEX and InterGrid.
• PaaS soluBons can provide an addiBonal layer and deliver a high-level service for
private clouds. Among the opBons available for private deployment of clouds we can
consider DataSynapse, Zimory Pools, Elastra, and Aneka. DataSynapse is a global
provider of applicaBon virtualizaBon soYware.
• Zimory provides a soYware infrastructure layer that allows creaBng an internal cloud
composed of sparse private and public resources and provides faciliBes for migraBng
applicaBons within the exisBng infrastructure.
• Aneka is a soYware development pla^orm that can be used to deploy a cloud
infrastructure on top of heterogeneous hardware: datacenters, clusters, and desktop
grids.
• Private clouds can provide in-house soluBons for cloud compuBng, but if compared
to public clouds they exhibit more limited capability to scale elasBcally on demand.
• Public clouds are huge enough to serve the needs of mulBple users, but they suffer
from security threats and administraBve pi^alls. Private clouds suffer from the
inability to scale on demand and to efficiently address peak loads.
• In this case, it is important to leverage capabiliBes of public clouds as needed. Hence,
a hybrid soluBon could be an interesBng opportunity for taking advantage of the best
of the private and public worlds.
• This led to the development and diffusion of hybrid clouds.
• Hybrid clouds allow enterprises to exploit exisBng IT infrastructures, maintain
sensiBve informaBon within the premises, and naturally grow and shrink by
provisioning external resources and releasing them when they’re no longer needed.
• Security concerns are then only limited to the public porBon of the cloud that can be
used to perform operaBons with less stringent constraints but that are sBll part of
the system workload.
• Figure 4.5 provides a general overview of a hybrid cloud: It is a heterogeneous
distributed system resulBng from a private cloud that integrates addiBonal services
or resources from one or more public clouds.
• For this reason they are also called heterogeneous clouds.
• As depicted in the diagram, dynamic provisioning is a fundamental component in this
scenario. Hybrid clouds address scalability issues by leveraging external resources for
exceeding capacity demand.
• These resources or services are temporarily leased for the Bme required and then
• released. This pracBce is also known as cloudburs9ng.
• According to the Cloud CompuBng Wiki, the term cloudburst has a double meaning;
it also refers to the “failure of a cloud compuBng environment due to the inability to
handle a spike in demand”. In this book, we always refer to the dynamic provisioning
of resources from public clouds when menBoning this term.
• Whereas the concept of hybrid cloud is general, it mostly applies to IT infrastructure
rather than soYware services.
• In an IaaS scenario, dynamic provisioning refers to the ability to acquire on demand
virtual machines in order to increase the capability of the resulBng distributed
system and then release them.
• Infrastructure management soYware and PaaS soluBons are the building blocks for
deploying and managing hybrid clouds.
• What is missing is then an advanced scheduling engine that’s able to differenBate
these resources and provide smart allocaBons by taking into account the budget
available to extend the exisBng infrastructure.
• In the case of OpenNebula, advanced schedulers such as Haizea can be integrated to
provide cost-based scheduling.
• A different approach is taken by InterGrid. This is essenBally a distributed scheduling
engine that manages the allocaBon of virtual machines in a collecBon of peer
networks.
• Such networks can be represented by a local cluster, a gateway to a public cloud, or a
combinaBon of the two.
• Dynamic provisioning is most commonly implemented in PaaS soluBons that support
hybrid clouds.
• In this scenario, the role of dynamic provisioning becomes fundamental to ensuring
the execuBon of applicaBons under the QoS agreed on with the user.
• For example, Aneka provides a provisioning service that leverages different IaaS
providers for scaling the exisBng cloud infrastructure. In parBcular, each user
applicaBon has a budget aRached, and the scheduler uses that budget to opBmize
the execuBon of the applicaBon by renBng virtual nodes if needed.
Figure 4.6 provides a general view of the usage scenario of community clouds, together with
reference architecture.
• The users of a specific community cloud fall into a well-idenBfied community, sharing
the same concerns or needs; they can be government bodies, industries, or even
simple users, but all of them focus on the same issues for their interacBon with the
cloud.
• From an architectural point of view, a community cloud is most likely implemented
over mulBple administraBve domains. This means that different organizaBons such as
government bodies, private enterprises, research organizaBons, and even public
virtual infrastructure providers contribute with their resources to build the cloud
infrastructure.
Media industry.
• In the media industry, companies are looking for low-cost, agile, and simple
• soluBons to improve the efficiency of content producBon.
• Most media producBons involve an extended ecosystem of partners. In parBcular,
the creaBon of digital content is the outcome of a collaboraBve process that includes
movement of large data, massive compute-intensive rendering tasks, and complex
workflow execuBons.
• Community clouds can provide a shared environment where services can facilitate
business-to-business collaboraBon and offer the horsepower in terms of aggregate
bandwidth, CPU, and storage required to efficiently support media producBon.
Healthcare industry.
• In the healthcare industry, there are different scenarios in which community clouds
could be of use.
• In parBcular, community clouds can provide a global pla^orm on which to share
informaBon and knowledge without revealing sensiBve data maintained within the
private infrastructure.
• The naturally hybrid deployment model of community clouds can easily support the
storing of paBent-related data in a private cloud while using the shared infrastructure
for noncriBcal services and automaBng processes within hospitals.
• In these sectors, community clouds can bundle the comprehensive set of soluBons
that together verBcally address management, deployment, and orchestraBon of
services and operaBons.
• Since these industries involve different providers, vendors, and organizaBons, a
community cloud can provide the right type of infrastructure to create an open and
fair market.
Public sector.
• Legal and poliBcal restricBons in the public sector can limit the adopBon of public
cloud offerings.
• Moreover, governmental processes involve several insBtuBons and agencies and are
aimed at providing strategic soluBons at local, naBonal, and internaBonal
administraBve levels.
• They involve business-to-administraBon, ciBzen-to-administraBon, and possibly
business-to-business processes.
• Some examples include invoice approval, infrastructure planning, and public
hearings.
• A community cloud can consBtute the opBmal venue to provide a distributed
environment in which to create a communicaBon pla^orm for performing such
operaBons.
Scien9fic research.
• Science clouds are an interesBng example of community clouds. In this case, the
common interest driving different organizaBons sharing a large distributed
infrastructure is scienBfic compuBng.
• The term community cloud can also idenBfy a more specific type of cloud that arises
from concern over the controls of vendors in cloud compuBng and that aspire to
combine the principles of digital ecosystems with the case study of cloud compuBng.
• A community cloud is formed by harnessing the underuBlized resources of user
machines and providing an infrastructure in which each can be at the same Bme a
consumer, a producer, or a coordinator of the services offered by the cloud.
• Openness. By removing the dependency on cloud vendors, community clouds are open
systems in which fair compeBBon between different soluBons can happen.
• Community. Being based on a collecBve that provides resources and services, the
infrastructure turns out to be more scalable because the system can grow simply by
expanding its user base.
• For example, the locaBon of data is crucial as the need for moving terabytes of data
• becomes an obstacle for high-performing computaBons.
• Data parBBoning as well as content replicaBon and scalable algorithms help in
improving the performance of data-intensive applicaBons.
• Open challenges in data-intensive compuBng given by Ian Gorton et al. are:
o Scalable algorithms that can search and process massive datasets
o New metadata management technologies that can scale to handle complex,
heterogeneous, and distributed data sources
o Advances in high-performance compuBng pla^orms aimed at providing a
beRer support for accessing in-memory mulBterabyte data structures
o High-performance, highly reliable, petascale distributed file systems
o Data signature-generaBon techniques for data reducBon and rapid processing
o New approaches to soYware mobility for delivering algorithms that are able
to move the computaBon to where the data are located
o Specialized hybrid interconnecBon architectures that provide beRer support
for filtering mulBgigabyte datastreams coming from high-speed networks and
scienBfic instruments
o Flexible and high-performance soYware integraBon techniques that facilitate
the combinaBon of soYware modules running on different pla^orms to
quickly form analyBcal pipelines
• Large datasets and massive amounts of data are being produced, mined, and
crunched by companies that provide Internet services such as searching, online
adverBsing, and social media.
• It is criBcal for such companies to efficiently analyze these huge datasets because
they consBtute a precious source of informaBon about their customers.
• Log analysis is an example of a data-intensive operaBon that is commonly performed
in this context; companies such as Google have a massive amount of data in the form
of logs that are daily processed using their distributed infrastructure.
• As a result, they seRled upon an analyBc infrastructure, which differs from the grid-
based infrastructure used by the scienBfic community.
• Together with the diffusion of cloud compuBng technologies that support data-
intensive computaBons, the term Big Data has become popular.
• This term characterizes the nature of data-intensive computaBons today and
currently idenBfies datasets that grow so large that they become complex to work
with using on-hand database management tools.
• Big Data problems are found in nonscienBfic applicaBon domains such as weblogs,
radio frequency idenBficaBon (RFID), sensor networks, social networks, Internet text
and documents, Internet search indexing, call detail records, military surveillance,
medical records, photography archives, video archives, and large scale ecommerce.
• Other than the massive size, what characterizes all these examples is that new data
are accumulated with Bme rather than replacing the old data.
• In general, the term Big Data applies to datasets of which the size is beyond the
ability of commonly used soYware tools to capture, manage, and process within a
tolerable elapsed Bme.
• Therefore, Big Data sizes are a constantly moving target, currently ranging from a few
dozen tera-bytes to many petabytes of data in a single dataset.
• Cloud technologies support data-intensive compuBng in several ways:
o By providing a large amount of compute instances on demand, which can be
used to process and analyze large datasets in parallel.
o By providing a storage system opBmized for keeping large blobs of data and
other distributed data store architectures.
o By providing frameworks and programming APIs opBmized for the processing
and management of large amounts of data. These APIs are mostly coupled
with a specific storage infrastructure to opBmize the overall performance of
the system.
• In parBcular, advances in distributed file systems for the management of raw data in
the form of files, distributed object stores, and the spread of the NoSQL movement
consBtute the major direcBons toward support for data-intensive compuBng.
• Distributed file systems consBtute the primary support for data management.
• They provide an interface whereby to store informaBon in the form of files and later
access them for read and write operaBons.
• Among the several implementaBons of file systems, few of them specifically address
the management of huge quanBBes of data on a large number of nodes. Mostly
these file systems consBtute the data storage support for large compuBng clusters,
supercomputers, massively parallel architectures, and lately, storage/compuBng
clouds.
Lustre.
• The Lustre file system is a massively parallel distributed file system that covers the
needs of a small workgroup of clusters to a large-scale compuBng cluster.
• The file system is used by several of the Top 500 supercompuBng systems, including
the one rated the most powerful supercomputer in the June 2012 list, Sequoia.
• Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands
of clients with an I/O throughput of hundreds of gigabytes per second (GB/s).
• The system is composed of a metadata server that contains the metadata about the
file system and a collecBon of object storage servers that are in charge of providing
storage.
• Users access the file system via a POSIX-compliant client, which can be either
mounted as a module in the kernel or through a library.
• The file system implements a robust failover strategy and recovery mechanism,
making server failures and recoveries transparent to clients.
IBM General Parallel File System (GPFS).
• GPFS is the high-performance distributed file system developed by IBM that provides
support for the RS/6000 supercomputer and Linux compuBng clusters.
• GPFS is a mulBpla^orm distributed file system built over several years of academic
research and provides advanced recovery mechanisms.
• GPFS is built on the concept of shareddisks, in which a collecBon of disks is aRached
to the file system nodes by means of some switching fabric.
• The file system makes this infrastructure transparent to users and stripes large files
over the disk array by replicaBng porBons of the file to ensure high availability.
• By means of this infrastructure, the system is able to support petabytes of storage,
which is accessed at a high throughput and without losing consistency of data.
• Compared to other implementaBons, GPFS distributes the metadata of the enBre file
system and provides transparent access to it, thus eliminaBng a single point of
failure.
Sector.
• Sector is the storage cloud that supports the execuBon of data-intensive applicaBons
defined according to the Sphere framework. It is a user space file system that can be
deployed on commodity hardware across a wide-area network.
• Compared to other file systems, Sector does not parBBon a file into blocks but
replicates the enBre files on mulBple nodes, allowing users to customize the
replicaBon strategy for beRer performance.
• The system’s architecture is composed of four nodes: a security server, one or more
master nodes, slave nodes, and client machines.
• The security server maintains all the informaBon about access control policies for
user and files, whereas master servers coordinate and serve the I/O requests of
clients, which ulBmately interact with slave nodes to access files.
• The protocol used to exchange data with slave nodes is UDT, which is a lightweight
connecBon-oriented protocol opBmized for wide-area networks.
• Amazon S3 is the online storage service provided by Amazon. Even though its
internal details are not revealed, the system is claimed to support high availability,
reliability, scalability, infinite storage, and low latency at commodity cost.
• The system offers a flat storage space organized into buckets, which are aRached to
an Amazon Web Services (AWS) account.
• Each bucket can store mulBple objects, each idenBfied by a unique key.
• Objects are idenBfied by unique URLs and exposed through HTTP, thus allowing very
simple get-put semanBcs.
• Because of the use of HTTP, there is no need for any specific library for accessing the
storage system, the objects of which can also be retrieved through the Bit Torrent
protocol.
• Despite its simple semanBcs, a POSIX-like client library has been developed to mount
S3 buckets as part of the local file system.
• Besides the minimal semanBcs, security is another limita9on of S3.
• The visibility and accessibility of objects are linked to AWS accounts, and the owner
of a bucket can decide to make it visible to other accounts or the public.
• It is also possible to define authenBcated URLs, which provide public access to
anyone for a limited (and configurable) period of Bme.
• Except for the S3 service, it is possible to sketch a general reference architecture in all
the systems presented that idenBfies two major roles into which all the nodes can be
classified. Metadata or master nodes contain the informaBon about the locaBon of
files or file chunks, whereas slave nodes are used to provide direct access to the
storage space.
• The architecture is completed by client libraries, which provide a simple interface for
accessing the file system, which is to some extent or completely compliant to the
POSIX specificaBon.
• Varia9ons of the reference architecture can include the ability to support mulBple
masters, to distribute the metadata over mulBple nodes, or to easily interchange the
role of nodes. The most important aspect common to all these different
implementaBons is the ability to provide fault-tolerant and highly available storage
systems.
• The term Not Only SQL (NoSQL) was originally coined in 1998 to idenBfy a relaBonal
database that did not expose a SQL interface to manipulate and query data but relied
on a set of UNIX shell scripts and commands to operate on text files containing the
actual data.
• In a very strict sense, NoSQL cannot be considered a relaBonal database since it is
not a monolithic piece of soYware organizing informaBon according to the relaBonal
model, but rather is a collecBon of scripts that allow users to manage most of the
simplest and more common database tasks by using text files as informaBon stores.
• Later, in 2009, the term NoSQL was reintroduced with the intent of labeling all those
database management systems that did not use a relaBonal model but provided
simpler and faster alternaBves for data manipulaBon.
• Nowadays, the term NoSQL is a big umbrella encompassing all the storage and
database management systems that differ in some way from the relaBonal model.
Their general philosophy is to overcome the restricBons imposed by the relaBonal
model and to provide more efficient systems.
• This oYen implies the use of tables without fixed schemas to accommodate a larger
range of data types or avoid joins to increase the performance and scale horizontally.
• Two main factors have determined the growth of the NoSQL movement: in many
cases simple data models are enough to represent the informaBon used by
applicaBons, and the quanBty of informaBon contained in unstructured formats has
grown considerably in the last decade.
• These two factors made soYware engineers look to alternaBves that were more
suitable to specific applicaBon domains they were working on.
• As a result, several different ini9a9ves explored the use of nonrela9onal storage
systems, which considerably differ from each other. A broad classifica9on is reported
by Wikipedia, which disBnguishes NoSQL implementaBons into:
o Document stores (Apache Jackrabbit, Apache CouchDB, SimpleDB,
Terrastore).
o Graphs (AllegroGraph, Neo4j, FlockDB, Cerebrum).
o Key-value stores. This is a macro classificaBon that is further categorized into
key-value stores on disk, key-value caches in RAM, hierarchically key-value
stores, eventually consistent key-value stores, and ordered key-value store.
o MulBvalue databases (OpenQM, Rocket U2, OpenInsight).
o Object databases (ObjectStore, JADE, ZODB).
o Tabular stores (Google BigTable, Hadoop HBase, Hypertable).
o Tuple stores (Apache River).
Amazon Dynamo.
Google Bigtable.
• Bigtable idenBfies two kinds of processes: master processes and tablet server
processes.
• A tablet server is responsible for serving the requests for a given tablet that is a
conBguous parBBon of rows of a table.
• Each server can manage mulBple tablets (commonly from 10 to 1,000).
• The master server is responsible for keeping track of the status of the tablet servers
and of the allocaBon of tablets to tablet servers.
• The server constantly monitors the tablet servers to check whether they are alive,
and in case they are not reachable, the allocated tablets are reassigned and
eventually parBBoned to other servers.
• Chubby —a distributed, highly available, and persistent lock service—supports the
acBvity of the master and tablet servers.
• System monitoring and data access are filtered through Chubby, which is also
responsible for managing replicas and providing consistency among them.
• At the very boRom layer, the data are stored in the Google File System in the form of
files, and all the update operaBons are logged into the file for the easy recovery of
data in case of failures or when tablets need to be reassigned to other servers.
• Bigtable uses a specific file format for storing the data of a tablet, which can be
compressed for opBmizing the access and storage of data.
• It serves as a storage back-end for 60 applicaBons (such as Google Personalized
Search, Google AnlyBcs, Google Finance, and Google Earth) and manages petabytes
of data.
Apache Cassandra.
Hadoop HBase.
• HBase is the distributed database that supports the storage needs of the Hadoop
distributed programming pla^orm.
• HBase is designed by taking inspiraBon from Google Bigtable; its main goal is to offer
real-Bme read/write operaBons for tables with billions of rows and millions of
columns by leveraging clusters of commodity hardware.
• The internal architecture and logic model of HBase is very similar to Google Bigtable,
and the enBre system is backed by the Hadoop Distributed File System (HDFS), which
mimics the structure and services of GFS.
• The map funcBon reads a key-value pair and produces a list of key-value pairs of
different types.
• The reduce funcBon reads a pair composed of a key and a list of values and produces
a list of values of the same type.
• The types (k1,v1,k2,kv2) used in the expression of the two funcBons provide hints as
to how these two funcBons are connected and are executed to carry out the
computaBon of a MapReduce job: The output of map tasks is aggregated together
by grouping the values according to their corresponding keys and consBtutes the
input of reduce tasks that, for each of the keys found, reduces the list of aRached
values to a single value.
• Therefore, the input of a MapReduce computaBon is expressed as a collecBon of key-
value pairs <k1,v1>, and the final output is represented by a list of values: list(v2).
• Figure 8.5 depicts a reference workflow characterizing MapReduce computaBons. As
shown, the user submits a collecBon of files that are expressed in the form of a list of
<k1,v1> pairs and specifies the map and reduce funcBons.
• These files are entered into the distributed file system that supports MapReduce and,
if necessary, parBBoned in order to be the input of map tasks.
• Map tasks generate intermediate files that store collecBons of <k2, list(v2)> pairs,
and these files are saved into the distributed file system.
• The MapReduce runBme might eventually aggregate the values corresponding to the
same keys. These files consBtute the input of reduce tasks, which finally produce
output files in the form of list(v2).
• The operaBon performed by reduce tasks is generally expressed as an aggregaBon of
all the values that are mapped by a specific key. The number of map and reduce tasks
to create, the way files are parBBoned with respect to these tasks, and the number
of map tasks connected to a single reduce task are the responsibiliBes of the
MapReduce runBme.
• In addiBon, the way files are stored and moved is the responsibility of the distributed
file system that supports MapReduce.
• The computaBon model expressed by MapReduce is very straigh^orward and has
proven successful in the case of Google, where the majority of the informaBon that
needs to be processed is stored in textual form and is represented by Web pages or
log files.
Some of the examples that show the flexibility of MapReduce are the following:
Distributed grep. The grep operaBon, which performs the recogniBon of paRerns within text
streams, is performed across a wide set of files. MapReduce is leveraged to provide a
parallel and faster execuBon of this operaBon. In this case, the input file is a plain text file,
and the map funcBon emits a line into the output each Bme it recognizes the given paRern.
The reduce task aggregates all the lines emiRed by the map tasks into a single file.
Reverse Web-link graph. The Reverse Web-link graph keeps track of all the possible Web
pages that might lead to a given link. In this case input files are simple HTML pages that are
scanned by map tasks emiyng <target, source> pairs for each of the links found in the Web
page source. The reduce task will collate all the pairs that have the same target into a
<target, list(source)> pair. The final result is given one or more files containing these
mappings.
Term vector per host. A term vector recaps the most important words occurring in a set of
documents in the form of list(<word, frequency>), where the number of occurrences of a
word is taken as a measure of its importance. MapReduce is used to provide a mapping
between the origin of a set of document, obtained as the host component of the URL of a
document, and the corresponding term vector. In this case, the map task creates a pair
<host, term-vector> for each text document retrieved, and the reduce task aggregates the
term vectors corresponding to documents retrieved from the same host.
Inverted index. The inverted index contains informaBon about the presence of words in
documents. This informaBon is useful to allow fast full-text searches compared to direct
document scans. In this case, the map task takes as input a document, and for each
document it emits a collecBon of <word, document-id>. The reduce funcBon aggregates the
occurrences of the same word, producing a pair <word, list(document-id)>.
Distributed sort. In this case, MapReduce is used to parallelize the execuBon of a sort
operaBon over a large number of records. This applicaBon mostly relies on the properBes of
the MapReduce runBme, which sorts and creates parBBons of the intermediate files, rather
than in the operaBons performed in the map and reduce tasks. Indeed, these are very
simple: The map task extracts the key from a record and emits a <key, record> pair for each
record; the reduce task will simply copy through all the pairs. The actual sorBng process is
performed by the MapReduce runBme, which will emit and parBBon the key-value pair by
ordering them according to the key.
In general, any computaBon that can be expressed in the form of two major stages can be
represented in the terms of MapReduce computaBon. These stages are:
Analysis. This phase operates directly on the data input file and corresponds to the
operaBon performed by the map task. Moreover, the computaBon at this stage is expected
to be embarrassingly parallel, since map tasks are executed without any sequencing or
ordering.
• AdaptaBons to this model are mostly concerned with idenBfying the appropriate
keys, creaBng reasonable keys when the original problem does not have such a
model, and finding ways to parBBon the computaBon between map and reduce
funcBons.
• Moreover, more complex algorithms can be decomposed into mulBple MapReduce
programs, where the output of one program consBtutes the input of the following
program.
• Figure 8.6 gives a more complete overview of a MapReduce infrastructure, according
to the implementaBon proposed by Google.
• As depicted, the user submits the execuBon of MapReduce jobs by using the client
libraries that are in charge of submiyng the input data files, registering the map and
reduce funcBons, and returning control to the user once the job is completed.
• A generic distributed infrastructure (i.e., a cluster) equipped with job-scheduling
capabiliBes and distributed storage can be used to run MapReduce applicaBons.
• Two different kinds of processes are run on the distributed infrastructure: a master
process and a worker process.
• The master process is in charge of controlling the execuBon of map and reduce tasks,
parBBoning, and reorganizing the intermediate output produced by the map task in
order to feed the reduce tasks. The worker processes are used to host the execuBon
of map and reduce tasks and provide basic I/O faciliBes that are used to interface the
map and reduce tasks with input and output files.
• In a MapReduce computaBon, input files are iniBally divided into splits (generally 16
to 64 MB) and stored in the distributed file system. The master process generates the
map tasks and assigns input splits to each of them by balancing the load.
• Worker processes have input and output buffers that are used to opBmize the
performance of map and reduce tasks.
• In parBcular, output buffers for map tasks are periodically dumped to disk to create
intermediate files. Intermediate files are parBBoned using a user-defined funcBon to
evenly split the output of map tasks.
• The locaBons of these pairs are then noBfied to the master process, which forwards
this informaBon to the reduce tasks, which are able to collect the required input via a
remote procedure call in order to read from the map tasks’ local storage.
• The key range is then sorted and all the same keys are grouped together. Finally, the
reduce task is executed to produce the final output, which is stored in the global file
system. This process is completely automaBc; users may control it through
configuraBon parameters that allow specifying (besides the map and reduce
funcBons) the number of map tasks, the number of parBBons into which to separate
the final output, and the parBBon funcBon for the intermediate key range.
• MapReduce runBme ensures a reliable execuBon of applicaBons by providing a fault-
tolerant infrastructure.
• Failures of both master and worker processes are handled, as are machine failures
that make intermediate outputs inaccessible. Worker failures are handled by
rescheduling map tasks somewhere else. This is also the technique that is used to
address machine failures since the valid intermediate output of map tasks has
become inaccessible.
• Master process failure is instead addressed using checkpoinBng, which allows
restarBng the MapReduce job with a minimum loss of data and computaBon.
• MapReduce model exhibits limitaBons, mostly due to the fact that the abstracBons
provided to process data are very simple, and complex problems might require
considerable effort to be represented in terms of map and reduce funcBons only.
• Therefore, a series of extensions to and variaBons of the original MapReduce model
have been proposed.
• They aim at extending the MapReduce applicaBon space and providing developers
with an easier interface for designing distributed algorithms.
Hadoop. Apache Hadoop is a collecBon of soYware projects for reliable and scalable dis-
tributed compuBng. Taken together, the enBre collecBon is an open-source implementaBon
of the MapReduce framework supported by a GFS-like distributed file system. The iniBaBve
consists of mostly two projects: Hadoop Distributed File System (HDFS) and Hadoop
MapReduce. The former is an implementaBon of the Google File System; the laRer provides
the same features and abstracBons as Google MapReduce. IniBally developed and
supported by Yahoo!, Hadoop now consBtutes the most mature and large data cloud
applicaBon and has a very robust community of developers and users supporBng it. Yahoo!
now runs the world’s largest Hadoop cluster, composed of 40,000 machines and more than
300,000 cores, made available to academic insBtuBons all over the world. Besides the core
projects of Hadoop, a collecBon of other projects related to it provides services for
distributed compuBng.
Pig. Pig is a pla^orm that allows the analysis of large datasets. Developed as an Apache
project, Pig consists of a high-level language for expressing data analysis programs, coupled
with infrastructure for evaluaBng these programs. The Pig infrastructure’s layer consists of a
compiler for a high-level language that produces a sequence of MapReduce jobs that can be
run on top of distributed infrastructures such as Hadoop. Developers can express their data
analysis programs in a textual language called Pig LaBn, which exposes a SQL-like interface
and is characterized by major expressiveness, reduced programming effort, and a familiar
interface with respect to MapReduce.
Chapter 2: Assessing the Value Proposi0on
(REFERENCE 2 as per guidelines)
• The various aRributes of cloud compuBng that make it a unique service are
scalability, elasBcity, low barrier to entry, and a uBlity type of delivery.
• Cloud compuBng is parBcularly valuable because it shiYs capital expenditures into
operaBng expenditures. This has the benefit of decoupling growth from cash on hand
or from requiring access to capital. It also shiYs risk away from an organizaBon and
onto the cloud provider.
• Service Level Agreements (SLAs) are an important aspect of cloud compuBng. They
are essenBally your working contract with any provider.
On the Cloud
On Premises
It is the construc9on of large datacenters running commodity hardware that has enabled
cloud compuBng to gain tracBon. These datacenters gain access to low-cost electricity, high-
network bandwidth pipes, and low-cost commodity hardware and soYware, which, taken
together, represents an economy of scale that allows cloud providers to amorBze their
investment and retain a profit.
Cloud compuBng is sBll in its infancy, but trends in adopBon are already evident. In his white
paper “Realizing the Value ProposiBon of Cloud CompuBng: CIO’s Enterprise IT Strategy for
the Cloud,” Jitendra Pal Thethi, a Principle Architect for Infosys’ MicrosoY Technology Group,
lists the following business types as the Top 10 adopters of cloud compuBng. As a group,
early adopters are categorized by their need for ubiquity and access to large data sets:
1. Messaging and team collaboraBon applicaBons
2. Cross enterprise integraBon projects
3. Infrastructure consolidaBon, server, and desktop virtualizaBon efforts
4. Web 2.0 and social strategy companies
5. Web content delivery services
6. Data analyBcs and computaBon
7. Mobility applicaBons for the enterprise
8. CRM applicaBons
9. Experimental deployments, test bed labs, and development efforts
10. Backup and archival storage
The nature of cloud compuBng should provide us with new classes of applicaBons, some of
which are currently emerging. Because Wide Area Network (WAN) bandwidth provides one
of the current boRlenecks for distributed compuBng, one of the major areas of interest in
cloud compuBng is in establishing content delivery networks (CDN). These soluBons are also
called edge networks because they cache content geographically.
Resource limits are exposed at peak condiBons of the uBlity itself. As we all know, power
uBliBes suffer brownouts and outages when the temperature soars, and cloud compuBng
providers are no different. You see these outages on peak compuBng days such as Black
Monday, which is the Monday aYer Thanksgiving in the United States when Internet
Christmas sales tradiBonally start.
The illusion of low barrier to entry may be pierced by an inconsistent pricing scheme that
makes scaling more expensive than it should be. You can see this limit in the nonlinearity of
pricing associated with “extra large” machine instances versus their “standard” size
counterparts. AddiBonally, the low barrier to entry also can be accompanied by a low barrier
to provisioning. If you make a provisioning error, it can lead to vast costs.
Cloud compuBng vendors run very reliable networks. OYen, cloud data is load-balanced
between virtual systems and replicated between sites. However, even cloud providers
experience outages. In the cloud, it is common to have various resources, such as machine
instances, fail. Except for Bghtly managed PaaS cloud providers, the burden of resource
management is sBll in the hands of the user, but the user is oYen provided with limited or
immature management tools to address these issues.
Table 2.1 summarizes the various obstacles and challenges that cloud compuBng faces.
Behavioral factors rela9ng to cloud adop9on
Usually, a commodity is cheaper than a specialized item, but not always. Depending upon
your situaBon, you can pay more for public cloud compuBng than you would for owning and
managing your private cloud, or for owning and using soYware as well. That’s why it’s
important to analyze the costs and benefits of your own cloud compuBng scenario carefully
and quanBtaBvely. You will want to compare the costs of cloud compuBng to private
systems.
where the unit cost is usually defined as the cost of a machine instance per hour or another
resource.
Depending upon the deployment type, other resources add addiBonal unit costs: storage
quanBty consumed, number of transacBons, incoming or outgoing amounts of data, and so
forth. Different cloud providers charge different amounts for these resources, some
resources are free for one provider and charged for another, and there are almost always
variable charges based on resource sizing. Cloud resource pricing doesn’t always scale
linearly based on performance.
To compare your cost benefit with a private cloud, you will want to compare the value you
determine in the equaBon above with the same calculaBon:
NoBce the addiBonal term for UBlizaBon added as a divisor to the term for CostDATACENTER.
This term appears because it is assumed that a private cloud has capacity that can’t be
captured, and it is further assumed that a private cloud doesn’t employ the same level of
virtualizaBon or pooling of resources that a cloud compuBng provider can achieve. Indeed,
no system can work at 100 percent uBlizaBon because queuing theory states that as the
system approaches 100 percent, the latency and response Bmes go to infinity. Typical
efficiencies in datacenters are between 60 and 85 percent. It is also further assumed that
the datacenter is operaBng under averaged loads (not at peak capacity) and that the
capacity of the datacenter is fixed by the assets it has.
The costs associated with resources in the cloud compuBng model CostCLOUD can
be unbundled to a greater extent than the costs associated with CostDATACENTER. The
CostDATACENTER consists of the summaBon of the cost of each of the individual systems with all
the associated resources, as follows:
where the sum includes terms for System 1, System 2, System 3, and so on.
The costs of a system in a datacenter must also include the overhead associated with power,
cooling, and the physical plant. EsBmates of these addiBonal overheads indicate that over
the lifeBme of a system, overhead roughly doubles the cost of any system. For a server with
a four-year lifeBme, you would therefore need to include an overhead roughly equal to 25
percent of the system’s acquisiBon cost.
The overhead associated with IT staff is also a major cost, but it’s highly variable from
organizaBon to organizaBon. It is not uncommon for the burden cost of a system in a
datacenter to be 150 percent of the cost of the system itself.
The costs associated with the cloud model are calculated rather differently. Each resource
has its own specific cost and many resources can be provisioned independently of one
another. In theory, therefore, the CostCLOUD is beRer represented by the equaBon:
Avoiding Capital Expenditures
A major part of cloud compuBng’s value proposiBon and its appeal is its ability to convert
capital expenses (CapEx) to opera9ng expenses (OpEx) through a usage pricing scheme that
is elasBc and can be right-sized. The conversion of real assets to virtual ones provides a
measure of protecBon against too much or too liRle infrastructure. EssenBally, moving
expenses onto the OpEx side of a budget allows an organizaBon to transfer risk to their cloud
compuBng provider.
A company wishing to grow would normally be faced with the following opBons:
• Buy the new equipment, and deploy it in-house
• Lease the equipment for a set period of Bme
• Outsource the operaBon to a managed-services organizaBon
Cloud compuBng is also a good opBon when the cost of infrastructure and management is
high. This is oYen the case with legacy applicaBons and systems where maintaining the
system capabiliBes is a significant cost.
Right-sizing
Consider an accounBng firm with a variable demand load, as shown in Figure 2.2. For each of
the four quarters of the tax year, clients file their quarterly taxes on the service’s Web site.
Demand for three of those quarters rises broadly as the quarterly filing deadline arrives. The
fourth quarter that represents the year-end tax filing on April 15 shows a much larger and
more pronounced spike for the two weeks approaching and just following that quarter’s
end. Clearly, this accounBng business can’t ignore the demand spike for its year-end
accounBng, because this is the single most important porBon of the firm’s business, but it
needs to match demand to resources to maximize its profits.
Buying and leasing infrastructure to accommodate the peak demand (or alternaBvely load)
shown in the figure as DMAX means that nearly half of that infrastructure remains idle for
most of the Bme. Fiyng the infrastructure to meet the average demand, DAVG, means that
half of the transacBons in the Q2 spike are not captured, which is the mission criBcal porBon
of this enterprise. More accurately using DAVG means that during maximum demand the
service is slowed to a crawl and the system may not be responsive enough to saBsfy any of
the users.
These limits can be a serious constraint on profit and revenue. Outsourcing the demand may
provide a soluBon to the problem. But outsourcing essenBally shiYs the burden of capital
expenditures onto the service provider. A service contract that doesn’t match infrastructure
to demand suffers from the same inefficiencies that capBve infrastructure does.
The cloud compuBng model addresses this problem by allowing you to right-size your
infrastructure. In Figure 2.2, the demand is saBsfied by an infrastructure that is labeled in
terms of a CU or “Compute Unit.” The rule for this parBcular cloud provider is that
infrastructure may be modified at the beginning of any month. For the low-demand Q1/Q4
Bme period, a 1 CU infrastructure is applied. On February 1st, the size is changed to a 4 CU
infrastructure, which captures the enBre spike of Q2 demand. Finally, on June 1st, a 2 CU
size is applied to accommodate the typical demand DAVG that is experienced in the last half
of Q2 through the middle of Q4. This curve-fiyng exercise captures the demand nearly all
the Bme with liRle idle capacity leY unused.
If this deployment represented a single server, then 1 CU might represent a single dual-core
processor, 2 CU might represent a quad-core processor, and 4 CU might represent a dual
quad-core processor virtual system. Most cloud providers size their systems small, medium,
and large in just this manner.
Right-sizing is possible when the system load is cyclical or in some cases when there are
predictable bursts or spikes in the load. You encounter cyclical loads in many public facing
commercial ventures with seasonal demands, when the load is affected by Bme zones, and
at Bmes that new products launch. Burst loads are less predictable. You can encounter
bursts in systems that are gateways or hubs for traffic. In situaBons where demand is
unpredictable and change can be rapid, right-sizing a cloud compuBng soluBon demands
automated soluBons.
The Total Cost of Ownership or TCO is a financial esBmate for the costs of the use of a
product or service over its lifeBme, oYen broken out on a yearly or quarterly basis. In
pitching cloud compuBng projects, it is common to create spreadsheets that predict the
costs of using the cloud compuBng model versus performing the same funcBons in-house or
on-premises.
To be really useful, a TCO must account for the real costs of items, but frequently they do
not. For example, the cost of a system deployed in-house is not only the cost of acquisiBon
and the cost of maintenance. A system consumes resources such as space in a datacenter or
porBon of your site, power, cooling, and management. All these resources represent an
overhead that is oYen over- looked, either in error or for poliBcal reasons. When you
account for monitoring and management of systems, you must account for the burdened
cost of an IT employee, the cost of the hardware and soYware that is used for management,
and other hidden costs.
A Service Level Agreement (SLA) is the contract for performance negoBated between you
and a service provider. In the early days of cloud compuBng, all SLAs were negoBated
between a client and the provider. Today with the advent of large uBlity-like cloud
compuBng providers, most SLAs are standardized unBl a client becomes a large consumer of
services.
If a vendor fails to meet the stated targets or minimums, it is punished by having to offer the
client a credit or pay a penalty.
MicrosoY publishes the SLAs associated with the Windows Azure Pla^orm components at
hRp://www.microsoY.com/windowsazure/sla/, which is illustraBve of industry pracBce for
cloud providers. Each individual component has its own SLA. The summary versions of
these SLAs from MicrosoY are reproduced here:
Windows Azure SLA: “Windows Azure has separate SLA’s for compute and storage. For
compute, we guarantee that when you deploy two or more role instances in different fault
and upgrade domains, your Internet facing roles will have external connecBvity at least
99.95% of the Bme. AddiBonally, we will monitor all of your individual role instances and
guarantee that 99.9% of the Bme we will detect when a role instance’s process is not
running and iniBate correcBve acBon.”
SQL Azure SLA: “SQL Azure customers will have connecBvity between the database and
our Internet gateway. SQL Azure will maintain a “Monthly Availability” of 99.9% during a
calendar month. “Monthly Availability Percentage” for a specific customer database is the
raBo of the Bme the database was available to customers to the total Bme in a month.
Time is measured in 5-minute intervals in a 30-day monthly cycle. Availability is always
calculated for a full month. An interval is marked as unavailable if the customer’s aRempts
to connect to a database are rejected by the SQL Azure gateway.”
AppFabric SLA: “UpBme percentage commitments and SLA credits for Service Bus and
Access Control are similar to those specified above in the Windows Azure SLA. Due to
inherent differences between the technologies, underlying SLA definiBons and terms differ
for the Service Bus and Access Control services. Using the Service Bus, customers will
have connecBvity between a customer’s service endpoint and our Internet gateway; when
our service fails to establish a connecBon from the gateway to a customer’s service end-
point, then the service is assumed to be unavailable. Using Access Control, customers will
have connecBvity between the Access Control endpoints and our Internet gateway. In
addiBon, for both Service Bus and Access Control, the service will process correctly for-
maRed requests for the handling of messages and tokens; when our service fails to process
a request properly, then the service is assumed to be unavailable. SLA calculaBons will be
based on an average over a 30-day monthly cycle, with 5-minute Bme intervals. Failures
seen by a customer in the form of service unavailability will be counted for the purpose of
availability calculaBons for that customer.”
Some cloud providers allow for service credits based on their ability to meet their
contractual levels of upBme. For example, Amazon applies a service credit of 10 percent off
the charge for Amazon S3 if the monthly upBme is equal to or greater than 99 percent but
less than 99.9 percent. When the upBme drops below 99 percent, the service credit
percentage rises to 25 percent and this credit is applied to usage in the next billing period.
Amazon Web Services uses an algorithm that calculates upBme based on the following
formula:
UpBme = Error Rate/Requests
as measured for each 5-minute interval during a billing period. The error rate is based on
internal server counters such as “InternalError” or “ServiceUnavailable.” There are
exclusions that limit Amazon’s exposure.
Service Level Agreements are based on the usage model. Most cloud providers price their
pay-as-you-go resources at a premium and issue standard SLAs only for that purpose. You
can also purchase subscripBons at various levels that guarantee you access to a certain
amount of purchased resources. The SLAs aRached to a subscripBon oYen offer different
terms. If your organizaBon requires access to a certain level of resources, then you need a
subscripBon to a service. A usage model may not provide that level of access under peak
load condiBons.
When you purchase shrink-wrapped soYware, you are using that soYware based on a
licensing agreement called a EULA or End User License Agreement. The EULA may specify
that the soYware meets the following criteria:
• It is yours to own.
• It can be installed on a single or mulBple machines.
• It allows for one or more connecBons.
• It has whatever limit the ISV has placed on its soYware.
In most instances, the purchase price of the soYware is directly Bed to the EULA.
For a long Bme now, the computer industry has known that the use of distributed
applicaBons over the Internet was going to impact the way in which companies license their
soYware, and indeed it has. The problem is that there is no uniform descripBon of how
applicaBons accessed over cloud networks will be priced. There are several different
licensing models in play at the moment—and no clear winners.
An example is the backup service Carbonite, where the service is for backing up a licensed
computer. However, cloud compuBng applicaBons rarely use machine licenses when the
applicaBon is meant to be ubiquitous. If you need to access an applicaBon, service, or Web
site from any locaBon, then a machine license isn’t going to be pracBcal.
The impact of cloud compuBng on bulk soYware purchases and enterprise licensing schemes
is even harder to gauge. Several analysts have remarked that the advent of cloud compuBng
could lead to the end of enterprise licensing and could cause difficulBes for soYware vendors
going forward. It isn’t clear what the impact on licensing will be in the future, but it is
certainly an area to keep your eyes on.
Chapter 12: Understanding Cloud Security
(REFERENCE 2 as per guidelines)
Many of the tools and techniques that you would use to protect your data, comply with
regulaBons, and maintain the integrity of your systems are complicated by the fact that you
are sharing your systems with others and many Bmes outsourcing their operaBons as well.
Different types of cloud compuBng service models provide different levels of security
services. You get the least amount of built in security with an Infrastructure as a Service
provider, and the most with a SoYware as a Service provider.
AdapBng your on-premises systems to a cloud model requires that you determine what
security mechanisms are required and mapping those to controls that exist in your chosen
cloud service provider. When you idenBfy missing security elements in the cloud, you can
use that mapping to work to close the gap.
Storing data in the cloud is of parBcular concern. Data should be transferred and stored in an
encrypted format. You can use proxy and brokerage services to separate clients from direct
access to shared cloud storage.
Logging, audiBng, and regulatory compliance are all features that require planning in cloud
compuBng systems. They are among the services that need to be negoBated in Service Level
Agreements.
The Internet was designed primarily to be resilient; it was not designed to be secure. Any
distributed applicaBon has a much greater aRack surface than an applicaBon that is closely
held on a Local Area Network. Cloud compuBng has all the vulnerabiliBes associated with
Internet applicaBons, and addiBonal vulnerabiliBes arise from pooled, virtualized, and
outsourced resources.
In the report “Assessing the Security Risks of Cloud CompuBng,” Jay Heiser and Mark NicoleR
of the Gartner Group highlighted the following areas of cloud compuBng that they felt were
uniquely troublesome:
• AudiBng
• Data integrity
• e-Discovery for legal compliance
• Privacy
• Recovery
• Regulatory compliance
Your risks in any cloud deployment are dependent upon the parBcular cloud service model
chosen and the type of cloud on which you deploy your applicaBons. In order to evaluate
your risks, you need to perform the following analysis:
1. Determine which resources (data, services, or applicaBons) you are planning to move to
the cloud.
Risks that need to be evaluated are loss of privacy, unauthorized access by others, loss of
data, and interrupBons in availability.
3. Determine the risk associated with the parBcular cloud type for a resource.
Cloud types include public, private (both external and internal), hybrid, and shared
community types. With each type, you need to consider where data and funcBonality
will be maintained.
4. Take into account the parBcular cloud service model that you will be using.
Different models such as IaaS, SaaS, and PaaS require their customers to be responsible
for security at different levels of the service stack.
5. If you have selected a parBcular cloud service provider, you need to evaluate its system to
understand how data is transferred, where it is stored, and how to move data both in and
out of the cloud.
You may want to consider building a flowchart that shows the overall mechanism of the
system you are intending to use or are currently using.
One technique for maintaining security is to have “golden” system image references that you
can return to when needed. The ability to take a system image off-line and analyze the
image for vulnerabiliBes or compromise is invaluable. The compromised image is a primary
forensics tool.
Many cloud providers offer a snapshot feature that can create a copy of the client’s enBre
environment; this includes not only machine images, but applicaBons and data, network
interfaces, firewalls, and switch access. If you feel that a system has been compromised, you
can replace that image with a known good version and contain the problem.
In order to concisely discuss security in cloud compuBng, you need to define the parBcular
model of cloud compuBng that applies. This nomenclature provides a framework for
understanding what security is already built into the system, who has responsibility for a
parBcular security mechanism, and where the boundary between the responsibility of the
service provider is separate from the responsibility of the customer.
The most commonly used model based on U.S. NaBonal InsBtute of Standards and
Technology (NIST) separates deployment models from service models and assigns those
models a set of service aRributes. Deployment models are cloud types: community, hybrid,
private, and public clouds. Service models follow the SPI Model for three forms of service
delivery: SoYware, Pla^orm, and Infrastructure as a Service. In the NIST model, it was
not required that a cloud use virtualizaBon to pool resources, nor did that model require
that a cloud support mulB-tenancy. It is just these factors that make security such a
complicated proposiBon in cloud compuBng.
Cloud Security Alliance (CSA) cloud compuBng stack model, which shows how different
funcBonal units in a network stack relate to one another. This model can be used to separate
the different service models from one another. CSA is an industry working group that studies
security issues in cloud compuBng and offers recommendaBons to its members. The work of
the group is open and available.
One key difference between the NIST model and the CSA is that the CSA considers mulB-
tenancy to be an essenBal element in cloud compuBng. MulB-tenancy adds a number of
addiBonal security concerns to cloud compuBng that need to be accounted for. In mulB-
tenancy, different customers must be isolated, their data segmented, and their service
accounted for. To provide these features, the cloud service provider must provide a policy-
based environment that is capable of supporBng different levels and quality of service,
usually using different pricing models. MulB-tenancy expresses itself in different ways in the
different cloud deployment models and imposes security concerns in different places.
The CSA funcBonal cloud compuBng hardware/soYware stack is the Cloud Reference Model.
This model is reproduced in Figure 12.3. IaaS is the lowest level service, with PaaS and SaaS
the next two services above. As you move upward in the stack, each service model inherits
the capabiliBes of the model beneath it, as well as all the inherent security concerns and risk
factors. IaaS supplies the infrastructure; PaaS adds applicaBon development frameworks,
transacBons, and control structures; and SaaS is an operaBng environment with applicaBons,
management, and the user interface.
As you ascend the stack, IaaS has the least levels of integrated funcBonality and the lowest
levels of integrated security, and SaaS has the most.
The most important lesson from this discussion of architecture is that each different type of
cloud service delivery model creates a security boundary at which the cloud service
provider’s responsibiliBes end and the customer’s responsibiliBes begin. Any security
mechanism below the security boundary must be built into the system, and any security
mechanism above must be maintained by the customer. As you move up the stack, it
becomes more important to make sure that the type and level of security is part of your
Service Level Agreement.
In the SaaS model, the vendor provides security as part of the Service Level Agreement, with
the compliance, governance, and liability levels sBpulated under the contract for the enBre
stack. For the PaaS model, the security boundary may be defined for the vendor to include
the soYware framework and middleware layer. In the PaaS model, the customer would be
responsible for the security of the applicaBon and UI at the top of the stack. The model with
the least built-in security is IaaS, where everything that involves soYware of any kind is the
customer’s problem.
In thinking about the Cloud Security Reference Model in relaBonship to security needs, a
fundamental disBncBon may be made between the nature of how services are provided
versus where those services are located. A private cloud may be internal or external to an
organizaBon, and although a public cloud is most oYen external, there is no requirement
that this mapping be made so.
This makes the locaBon of trust boundaries in cloud compuBng rather ill defined, dynamic,
and subject to change depending upon a number of factors. Establishing trust boundaries
and creaBng a new perimeter defense that is consistent with your cloud compuBng network
is an important consideraBon. The key to understanding where to place security
mechanisms is to understand where physically in the cloud resources are deployed and
consumed, what those resources are, who manages the resources, and what mechanisms
are used to control them.
Table 12.1 lists some of the different service models and lists the parBes responsible for
security in the different instances.