14 Distributed DBMSs
14 Distributed DBMSs
14 Distributed DBMSs
12.1 Objectives
12.2 Introduction
The DDB technology emerged as a merger of two technologies: (1) database technology, and (2)
network and data communication technology. The latter has made tremendous strides in terms of
wired and wireless technologies-from satellite and cellular communications and Metropolitan Area
Networks (MANs) to the standardization of protocols like Ethernet, TCP/IP, and the
Asynchronous Transfer Mode (ATM) as well as the explosion of the Internet.
Distributed databases bring the advantages of distributed computing to the database management
domain. A distributed computing system consists of a number of processing elements, not
necessarily homogeneous, that are interconnected by a computer network, and that cooperate in
performing certain assigned tasks.
As a general goal, distributed computing systems partition a big, unmanageable problem into
smaller pieces and solve it efficiently in a coordinated manner. The economic viability of this
approach stems from two reasons:
• more computer power is harnessed to solve a complex task, and
• each autonomous processing element can be managed independently and develop its own
applications.
A distributed database (DDB) is a collection of multiple logically interrelated databases
distributed over a computer network, and a distributed database management system (DDBMS)
as a software system that manages a distributed database. It consists of a single logical database
that is split into a number of fragments. Each fragment is stored on one or more computers
(replicas) under the control of a separate DBMS, with the computers connected by a
communications network. Each site is capable of independently processing user requests that
require access to local data (that is, each site has some degree of local autonomy) and is also
capable of processing data stored on other computers in the network.
Users access the distributed database via applications. Applications are classified as those that do
not require data from other sites (local applications) and those that do require data from other
sites (global applications). We require a DDBMS to have at least one global application.
Advantages
Matches distributed organizational model
Many organizations are naturally distributed over several locations. For example, a bank
has many offices in different cities. It is natural for databases used in such an application
to be distributed over these locations. A bank may keep a database at each branch office
containing details such things as the staff that work at that location, the account information
of customers etc.
The staff at a branch office will make local inquiries of the database. The company
headquarters may wish to make global inquiries involving the access of data at all or a
number of branches.
Improved sharability and local autonomy
The geographical distribution of an organization can be reflected in the distribution of the
data; users at one site can access data stored at other sites. Data can be placed at the site
close to the users who normally use that data. In this way, users have local control of the
data, and they can consequently establish and enforce local policies regarding the use of
this data. A global database administrator (DBA) is responsible for the entire system.
Generally, part of this responsibility is assigned the local level, so that the local DBA can
manage the local DBMS.
Improved availability
In a centralized DBMS, a computer failure terminates the applications of the DBMS.
However, a failure at one site of a DDBMS, or a failure of a communication link making\
some sites inaccessible, does not make the entire system in opera bite. Distributed DBMSs
are designed to continue to function despite such failures. If a single node fails, the system
may be able to reroute the failed node's requests to another site.
Improved reliability
As data may be replicated so that it exists at more than one site, the failure of a node or a
communication link does not necessarily make the data inaccessible.
Improved performance
As the data is located near the site of 'greatest demand', and given the inherent parallelism
of distributed DBMSs, speed of database access may be better than that achievable from a
remote centralized database. Furthermore, since each site handles only a part of the entire
database, there may not be the same contention for CPU and I/O services as characterized
by a centralized DBMS.
Economics
It is now generally accepted that it costs much less to create a system of smaller computers
with the equivalent power of a single large computer. This makes it more cost effective for
corporate divisions and departments to obtain separate computers. It is also much more
cost-effective to add workstations· to a network than to update a mainframe system.
The second potential cost saving occurs where database are geographically remote and the
applications require access to distributed data. In such cases, owing to the relative expense
of data being transmitted across the network as opposed to the cost of local access, it may
be much more economical to partition the application and perform the processing locally
at each site.
Modular growth
In a distributed environment, it is much easier to handle expansion. New sites can be added
to the network without affecting the operations of other sites. This flexibility allows an
organization to expand relatively easily. Adding processing and storage power to the
network can usually handle the increase in database size. In a centralized DBMS, growth
may entail changes to both hardware (the procurement of a more powerful system) and
software (the procurement of a more powerful or more configurable DBMS).
Disadvantages
Complexity
A distributed DBMS that hides the distributed nature from the user and provides an acceptable
level of performance, reliability, availability is inherently more complex then a centralized DBMS.
The fact that data can be replicated also adds an extra level of complexity to the distributed DBMS.
If the software does not handle data replication adequately, there wi1l be degradation in
availability, reliability and performance compared with the centralized system, and the advantages
we cites above will become disadvantages.
Cost
Increased complexity means that we can expect the procurement and maintenance costs for a
DDBMS to be higher than those for a centralized DBMS. Furthermore, a distributed
DBMS requires additional hardware to establish a network between sites. There are ongoing
communication costs incurred with the use of this network. There are also additional labor costs
to manage and maintain the local DBMSs and the underlying network.
Security
In a centralized system, access to the data can be easily controlled. However, in a distributed
DBMS not only does access to replicated data have to be controlled in multiple locations but also
the network itself has to be made secure. In the past, networks were regarded as an insecure
communication medium. Although this is still partially true, significant developments have been
made to make networks more secure.
Integrity control more difficult
Database integrity refers to the validity and consistency of stored data. Integrity is usually
expressed in terms of constraints, which are consistency rules that the database is not permitted to
violate. Enforcing integrity constraints generally requires access to a large amount of data that
defines the constraints. In a distributed DBMS, the communication and processing costs that are
required to enforce integrity constraints are high as compared to centralized system.
Lack of Standards
Although distributed DBMSs depend on effective communication, we are only now starting to see
the appearance of standard communication and data access protocols. This lack of standards has
significantly limited the potential of distributed DBMSs. There are also no tools or methodologies
to help users convert a centralized DBMS into a distributed DBMS
Lack of experience
General-purpose distributed DBMSs have not been widely accepted, although many of the
protocols and problems are well understood. Consequently, we do not yet have the same level of
experience in industry as we have with centralized DBMSs. For a prospective adopter of this
technology, this may be a significant deterrent.
A DBMS running across multiple processors and disks that is designed to execute operations in
parallel, whenever possible, in order to improve performance.
Parallel DBMSs are again based on the premise that single-processor systems can no longer meet
the growing requirements for cost-effective scalability, reliability, and performance. A powerful
and financially attractive alternative to a single-processor driven DBMS is a parallel DBMS driven
by multiple processors. Parallel DBMSs link multiple, smaller machines to achieve the same
throughput as a single, larger machine, often with greater scalability and reliability than single-
processor DBMSs.
To provide multiple processors with common access to a single database, a parallel DBMS must
provide for shared resource management. Which resources are shared and how those shared
resources are implemented directly affects the performance and scalability of the system.
The three main architectures for parallel DBMSs are:
• shared memory;
• shared disk;
• shared nothing.
Shared memory is a tightly coupled architecture in which multiple processors within a single
system share system memory.
Shared disk is a loosely coupled architecture optimized for applications that are inherently
centralized and require high availability and performance. Each processor can access all disks
directly, but each has its own private memory.
The shared disk architecture eliminates the shared memory performance bottleneck.
Shared nothing, often known as massively parallel processing (MPP), is a multiple-processor
architecture in which each processor is part of a complete system, with its own memory and disk
storage. The database is partitioned among all the disks on each system associated with the
database, and data is transparently available to users on all systems. This architecture is more
scalable than shared memory and can easily support a large number of processors. However,
performance is optimal only when requested data is stored locally.
Although the shared nothing definition sometimes includes distributed DBMSs, the distribution of
data in a parallel DBMS is based solely on performance considerations.
In addition, the nodes of a DDBMS are typically geographically distributed, separately
administered, and have a slower interconnection network, whereas the nodes of a parallel DBMS
are typically within the same computer or within the same site.
Techniques that are used to break up the database into logical units, called fragments, which may
be assigned for storage at the various sites. We also discuss the use of data replication, which
permits certain data to be stored in more than one site, and the process of allocating fragments-or
replicas of fragments-for storage at the various sites. These techniques are used during the process
of distributed database design. The information concerning data fragmentation, allocation, and
replication is stored in a global directory that is accessed by the DDBS applications as needed.
12.6.1 Data Fragmentation
In a DDB, decisions must be made regarding which site should be used to store which portions of
the database. For now, we will assume that there is no replication; that is, each relation-or portion
of a relation-is to be stored at only one site. We discuss replication and its effects later in this
section. We also use the terminology of relational databases similar concepts apply to other data
models. We assume that we are starting with a relational database schema and must decide on how
to distribute the relations over the various sites.
Before we decide on how to distribute the data, we must determine the logical units of the database
that are to be distributed. The simplest logical units are the relations themselves; that is, each whole
relation is to be stored at a particular site. In our example, we must decide on a site to store each
of the relations EMPLOYEE, DEPARTMENT, PROJECT, WORKS_ON, and DEPENDENT. In
many cases, however, a relation can be divided into smaller logical units for distribution. We
assume there are three computer sites-one for each department in the company," We may want to
store the database information relating to each department at the computer site for that department.
A technique called horizontal fragmentation can be used to partition each relation by department.
Horizontal Fragmentation.
A horizontal fragment of a relation is a subset of the tuples in that relation. The tuples that belong
to the horizontal fragment are specified by a condition on one or more attributes of the relation.
Often, only a single attribute is involved. For example, we may define three horizontal fragments
on the EMPLOYEE relation with the following conditions: (Empnum <= 3), Empnum >3),---each
fragment contains the EMPLOYEE tuples.
Replication is useful in improving the availability of data. The most extreme case is replication of
the whole database at every site in the distributed system, thus creating a fully replicated
distributed database. This can improve availability remarkably because the system can continue to
operate as long as at least one site is up. It also improves performance of retrieval for global
queries, because the result of such a query can be obtained locally from anyone site; hence, a
retrieval query can be processed at the local site where it is submitted, if that site includes a server
module.
The disadvantage of full replication is that it can slow down update operations drastically, since a
single logical update must be performed on every copy of the database to keep the copies
consistent. This is especially true if many copies of the database exist. Full replication makes the
concurrency control and recovery techniques more expensive than they would be if there were no
replication,
Four alternative strategies regarding placement of data:
Centralized
Partitioned (or Fragmented)
Complete Replication
Selective Replication
Centralized
Consists of single database and DBMS stored at one site with users distributed across the network.
(not a true distribution)!
Partitioned
Database partitioned into disjoint fragments, each fragment assigned to one site.
Complete Replication
Consists of maintaining complete copy of database at each site.
Selective Replication
Combination of partitioning, replication, and centralization
REPLICATION
Functionality of DDBMS is attractive but protocols & algorithms are complex and can cause
problems that may outweigh advantages.
Alternative and more simplify approach to data distribution is DB Replication
Replication server:
Every major database vendor has replication solution.
Database Replication:
the process of copying and maintaining database objects, such as relations, in multiple databases
that make up a distributed database system.
Benefits of Database Replication
Availability
Reliability
Performance
Load Reduction
Disconnected Computing
Support Multiple Users
Support Advanced Applications
Date’s Twelve (12) Rules for a DDBMS
We list Date’s twelve rules (or objectives) for DDBMSs (Date, 1987b). The basis for these rules
is that a distributed DBMS should feel like a non-distributed DBMS to the user. These rules are
akin to Codd’s twelve rules for relational systems.
1. Local autonomy
The sites in a distributed system should be autonomous. In this context, autonomy means that:
local data is locally owned and managed; local operations remain purely local; all operations at a
given site are controlled by that site.
2. No reliance on a central site
There should be no one site without which the system cannot operate. This implies that there
should be no central servers for services such as transaction management, deadlock detection,
query optimization, and management of the global system catalog.
3. Continuous operation
Ideally, there should never be a need for a planned system shutdown, for operations such as: n
adding or removing a site from the system; and the dynamic creation and deletion of fragments at
one or more sites.
4. Location independence
Location independence is equivalent to location transparency. The user should be able to access
the database from any site. Furthermore, the user should be able to access all data as if it were
stored at the user’s site, no matter where it is physically stored.
5. Fragmentation independence
The user should be able to access the data, no matter how it is fragmented.
6. Replication independence
The user should be unaware that data has been replicated. Thus, the user should not be able to
access a particular copy of a data item directly, nor should the user have to specifically update all
copies of a data item.
7. Distributed query processing
The system should be capable of processing queries that reference data at more than one site.
8. Distributed transaction processing
The system should support the transaction as the unit of recovery. The system should ensure that
both global and local transactions conform to the ACID rules for transactions, namely: atomicity,
consistency, isolation, and durability.
9. Hardware independence
It should be possible to run the DDBMS on a variety of hardware platforms.
10. Operating system independence
As a corollary to the previous rule, it should be possible to run the DDBMS on a variety of
operating systems.
11. Network independence
Again, it should be possible to run the DDBMS on a variety of disparate communication networks.
12. Database independence
It should be possible to have a DDBMS made up of different local DBMSs, perhaps supporting
different underlying data models. In other words, the system should support heterogeneity. The
last four rules are ideals. As the rules are so general, and as there is a lack of standards in computer
and network architectures, we can expect only partial compliance from vendors in the foreseeable
future.
The term distributed database management system can describe various systems that differ from
one another in many respects. The main thing that all such systems have in common is the fact that
data and software are distributed over multiple sites connected by some form of communication
network. In this section we discuss a number of types of DDBMSs and the criteria and factors that
make some of these systems different.
Multidatabase systems
A distributed DBMS in which each site maintains complete autonomy.
In recent years, there has been considerable interest in MDBSs, which attempt to logically
integrate a number of independent DDBMSs while allowing the local DBMSs to maintain
complete control of their operations. One consequence of complete autonomy is that there can be
no software modifications to the local DBMSs. Thus, an MDBS requires an additional software
layer on top of the local systems to provide the necessary functionality.
In simple terms, an MDBS is a DBMS that resides transparently on top of existing database and
file systems, and presents a single database to its users. An MDBS maintains only the global
schema against which users issue queries and updates and the local DBMSs themselves maintain
all user data. The global schema is constructed by integrating the schemas of the local databases.