[go: up one dir, main page]

0% found this document useful (0 votes)
8 views137 pages

Advanced Database System

data bas

Uploaded by

Birhanu Abegaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views137 pages

Advanced Database System

data bas

Uploaded by

Birhanu Abegaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

PART ONE

FUNDAMENTALS OF DATABASE SYSTEM


Chapter ONE
1. Introduction
1.1. Data, information, information System
Data: The term 'DATA' can be defined as the value of an attribute of an entity. It is a collection of raw facts,
figures and concepts. Data cannot be used for decision making unless it is processed or organized in useful
form.
Information: A processed data is meaningful to the user. It is useful for decision-making.
An information system (IS) is a set of interrelated elements or components that collect (input), manipulate
(process), and disseminate (output) data and information and provide a feedback mechanism to meet an
objective.
In information systems:
 Input is the activity of gathering and capturing raw data
 Processing involves converting or transforming data into useful outputs.
 Output involves producing useful information, usually in the form of documents and reports.
 In information systems, feedback is output that is used to make changes to input or processing activities.
1.2. Information System and Organization, database system
Information System does not exist without organization. That is, organization of data is necessary if data is
voluminous. Information System is a support system for the organizational activity to achieve a certain goal.
A database system is basically a computerized record keeping system. Users of the database can perform a
variety of operations. Such as:

Adding new data to empty file


Adding new data to existing file
Retrieving data from existing file
Modifying data to existing file
Deleting data from existing file
Searching for target information

1.3. Data Management approaches


Data management passes through the different levels of development along with the development in
technology and services. These levels could be best described by categorizing the levels into three levels or
types of development/approach. Even though there is an advantage and a problem overcomes at each new
data handling approach/level, all methods or approaches of data management are in use to some extent. The
major three approaches/levels are discussed as follows:
1. Manual Approach
In the manual data handling approach, data storage and retrieval follows the primitive and traditional way of
data/information handling where cards and paper are used for the purpose. Typing the data on paper and put
in a file cabinet. The data storage and retrieval will be performed using human labour. This approach Works
well if the number of items to be stored is small.

Limitations of the Manual approach

o Prone to error
o Data loss: due to damaged papers or unable to locate it.
o Redundancy: multiple copies of the same data within the organization.
o Inconsistency: Modifications are not reflected on all multiple copies
o Difficult to update, retrieve, integrate
o You have the data but it is difficult to compile the information
o Limited to small size information
o Cross referencing is difficult
An alternative approach of data handling is a computerized way of dealing with the information. The
computerized approach could also be either decentralized or centralized base on where the data resides in the
system.
2. File based Approach
After the introduction of computer for data processing to the business community, the need to use the device
for data storage and processing increase. File based data handling approaches were an early attempt to
computerize the manual filing system. There were, and still are, several computer applications with file based
processing used for the purpose of data handling. It is a collection of application programs that performs
services for the end users. In such systems, every application program that provides service to end users
defines and manage its own data. Such systems have number of programs for each of the different
applications in the organization. Moreover, this approach is the decentralized computerized data handling
method.
Limitations of the File Based approach
As business application become more complex and demanding more flexible and reliable data handling
methods, the shortcomings of the file-based system became evident. These shortcomings include, but not
limited to:
Separation/Isolation of data
When data is isolated in separate files, it is difficult to access data that should be available. This is because;
there is no concept of relationship between files. Therefore, we need to create a temporary file for the
participating files.
Duplication of data (Redundancy)
This is concerning with storage of similar information in multiple files.
The following are some of the disadvantage of redundancy:
 It costs time and money to enter the data
 It takes up additional storage space (memory space)
 Inconsistency: this is loss of data integrity. For instance, if modification in the child table is unable to
be reflected on the parent table.
Data Dependence
Changes to an existing structure are difficult to make. Example: change in the size of Student Name (from 20
characters to 30 characters) requires a new program to convert student file to a new format. The new program
opens original student file, open a temporary file, read records from original student file and write to the
temporary file, delete the original student file and finally rename the temporary file as student file. It is time
consuming and Prone to error.

Incompatible file formats


The structure of file is dependent on the application programs. Incompatibility of files makes them difficult to
process jointly. Example: consider two files with in the same enterprise but in different departments, or in
different branches: If the first file is constructed using COBOL and the second file is written using C++, then
there will be a problem of integrity.

3. Database Approach
What is a Database?

A database is a collection of related data in an organized way. Most of the time, organization is in tabular
form. E.g. book database

The organization of the database becomes necessary when the data is voluminous. Otherwise, managing data
will be very difficult. An organization must have accurate and reliable data for effective decision-
making. To this end, the organization maintains records on the various facets maintaining
relationships among them. Such related data are called a database. Generally, a database is an organized
collection of related information. The organized information or database serves as a base from which desired
information can be retrieved or decision made by further recognizing or processing the data. People use
several databases in their day-to-day life. Dictionary, Telephone directory, Library catalog, etc are example
for databases where the entries are arranged according to alphabetical or classified order.
E.g. A Manufacturing Company with product data
A Bank with account data

A Hospital with patients

A University with Students

A government with planning data


What is a database system?
It is a computerized record keeping system, which stores related data in an organized way. The overall
purpose of a database system is to store information and to allow users to add, delete, retrieve, search, query
and update that information upon request. The information concerned can be anything that is deemed to be of
significance to the individual or organization the system is intended to serve. That is, needed to assist in the
general process of running the business of that individual or organization. A database system is an integrated
collection of related files, along with details of the interpretation of the data contained therein. Database
system is nothing more than a computer-based record keeping system i.e. a system whose overall purpose is
to record and maintain information/data.
Thus in database approach:

 Database is just a computerized record keeping system or a kind of electronic filing cabinet.
 Database is a repository for collection of computerized data files.
 Database is a shared collection of logically related data designed to meet the information needs of an
organization. Since it is a shared corporate resource, the database is integrated with minimum amount of
or no duplication.
 Database is a collection of logically related data where these logically related data comprises: entities,
attributes, relationships, and business rules of an organization's information.
 In addition to containing data required by an organization, database also contains a description of the data,
which called as “Metadata” or “Data Dictionary” or “Systems Catalogue”, or “Data about Data”.
 Since a database contains information about the data (metadata), it is called a self-descriptive collection
on integrated records.
 The purpose of a database is to store information and to allow users to retrieve and update that
information on demand.
 Database is designed once and used simultaneously by many users.
 Unlike the traditional file, based approach in database approach there is program data independence. That
is the separation of the data definition from the application. Thus the application is not affected by
changes made in the data structure and file organization.
 Each database application will perform the combination of creating database, reading, updating and
deleting data.

The advantages of a database approach over the traditional and paper-based methods of record keeping will
include the following:

Compactness: no need for possibly voluminous paper files.

Speed: the machine can retrieve and change data faster than a human can. In particular, ad hoc, spur-of-the-
moment queries

(“Do we have more red screws than blue ones?”) can be answered quickly without any need for time
consuming manual or visual searches.
Accuracy: timely, accurate and up-to-date information is available on demand at any time.
The foregoing benefits apply with even more force in a multi-user environment where the database is likely to
be much larger and much more complex than in the single user case. In a multi-user environment the database
system provides the enterprise with centralized control of its data. The centralized approach has the following
advantages:
Data can be shared: two or more users can access and use same data instead of storing data in redundant
manner for each user.
Redundancy can be reduced: in non-database or non-centralized systems, each application or department
keeps its own private files. The files may hold common data elements that exist as part of the enterprises data.
This will lead to considerable redundancy in stored data, with resultant waste in storage space. For example, a
personnel application and an education records application might both own a file that includes department
information for employees. Note that, this is not to say we should eliminate all redundancies. Sometimes
there are sound reasons for maintaining several copies of the same data.
Inconsistency can (to some extent) avoided: If there are a number of files, which store similar data,
elements among other sorts of data then when a change is made to a particular data (among the common
ones) this change need to be done throughout the system where there is such data stored. This is not, often,
the case. Some of the data might be updated and others left as they are which results in inconsistent
information about the same phenomena.
Standards can be enforced: Standardizing data representation is particularly desirable as an aid to data
interchange or migration of data between systems. Likewise, data naming and documentation standards are
also very desirable as they facilitate data sharing and understandability.
Security restrictions can be applied: since the data is stored in, one place/area all accesses to the data can
be regulated by the system through some defined rules built into the system. The system ensures that the only
means of access to the database is through proper channels. Different rules can be established for each type of
access (retrieve, insert, delete, etc.) to each of information to the database.
Integrity can be maintained The problem of integrity is the problem of ensuring the data in the database is
accurate. Inconsistency between two entries that represent the same “fact” is an example of lack of integrity.
It is more serious in a multi-user environment where one user may enter bad data and other users may go on
working on the updated data as if it were a correct one.
Conflicting requirements can be balanced: knowing the overall requirements of the enterprise the Database
Administrator (DBA) can structure the system so as to provide an overall service that is best for the
enterprise. For example, a representation can be chosen for the data in storage that gives fast access for the
most important applications (possibly at the cost of poorer performance for certain other applications).
Transaction support can be provided: basic demands of any transaction support systems are implanted in a
full scale DBMS.
Improved decision support: the database will provide information useful for decision making.
Less labor: unlike the other data handling methods, data maintenance will not demand much resource
Centralized information control: since relevant data in the organization will be stored at one repository, it
can be controlled and managed at the central level
Limitations and risk of Database Approa
o Introduction of new professional and specialized personnel.
o Complexity in designing and managing data
o The cost and risk during conversion from the old to the new system
o High cost to be incurred to develop and maintain the system
o Complex backup and recovery services from the users perspective
o Reduced performance due to centralization and data independency
o High impact on the system when failure occurs to the central system.
Note: Database System (DBS) contains:

The Database + The DBMS + Application Programs (what users interact with)
1.4. Components of a Database System
A database system involves four major components, namely, data, hardware, software and users and
designers of database. A brief discussion will follow on each of these components.
Data: The actual data stored in the database system may be stored as a single database or distributed in many
distinct files and treated as one. Is the system a single-user or multi-user one? How are we going to achieve
the utmost possible performance concerning the data storage and maintenance? What other benefits or
drawbacks do we expect as the result of placement or structure of the database?. These and similar issues
might be concerned with the way the data stored in the system.
Hardware: This portion of the system consists of secondary storage media (disks, tapes and optical media)
that are used to hold the stored data and associated device controllers (hard disk controller, etc.); and the
processor(s) and associated main memory that are used to support the execution of the database system
software.

Software: This is the software. Database Management System (DBMS) that is responsible for the overall
management of communications between the user and the database. It is found between the data and the
users, which, in other words, means the data is entirely covered or shielded by the DBMS software. The
DBMS provides facilities for operating on the database. This is the most important software component in the
overall system that allows the user to interact with the data.

Users and Designers of Database: As people are one of the components in DBS environment, there are
group of roles played by different stakeholders of the designing and operation of a database system.
Database Administrator (DBA)
 Responsible to oversee, control and manage the database resources (the database itself, the DBMS and
other related software)
 Authorizing access to the database
 Coordinating and monitoring the use of the database
 Responsible for determining and acquiring hardware and software resources
 Accountable for problems like poor security, poor performance of the system
 Involves in all steps of database development

We can have further classifications of this role in big organizations having huge amount of data and user
requirement.
 Data Administrator (DA): is responsible on management of data resources. It involves in
database planning, development, maintenance of standards policies and procedures at the
conceptual and logical design phases.
 Database Administrator (DBA): is more technically oriented role. It is responsible for the
physical realization of the database and involved in physical design, implementation, security
and integrity control of the database.
Database Administrator a person that is responsible for all technical operations or details of the
database system. DBA is the user that controls the enterprises data resource. The functions of
the DBA include the following.
o Defining the conceptual schema: Will directly participate or help on the process of
identifying the content of the database, i.e., what information is to be held in the
database and create the corresponding conceptual schema using the conceptual DDL.
o Defining the internal schema: The DBA must also decide how the data is to be
represented in the stored database and then create the corresponding storage structure
definition (the internal schema) using the internal DDL (including associated mapping
between the internal and conceptual schema).
o Liaising with users: By communicating with users the DBA will ensure that the data
they require is available, and to write (or help users write) the necessary external
schemas using the applicable external DDL. Other functions include consulting on
application design, providing technical education, assisting with problem determination
and resolution, and similar system related professional services.
o Defining security and integrity rules: Since security and integrity rules are part of the
conceptual schema, the conceptual DDL should include facilities for specifying such
rules.
o Defining backup and recovery procedures: In the event of damage to any portion of
a database, caused by human error or failure in the hardware or operating system, it is
essential to be able to repair the data concerned with the minimum of delay and with as
little effect as possible on the rest of the system. The DBA should define and
implement appropriate backup and recovery scheme.
o Monitoring performance and responding to changing requirements: Periodic
performance analysis should be done by the DBA and based on the results obtained
propose for improved systems and or do the modifications on the existing data
definitions
1. Database Designer (DBD)
 Identifies the data to be stored and choose the appropriate structures to represent and store the data.
 Should understand the user requirement and should choose how the user views the database.
 Involve on the design phase before the implementation of the database system.

We have two distinctions of database designers, one involving in the logical and conceptual design and
another involving in physical design.
 Logical and Conceptual DBD
 Identifies data (entity, attributes and relationship) relevant to the organization
 Identifies constraints on each data
 Understand data and business rules in the organization
 Sees the database independent of any data model at conceptual level and consider one
specific data model at logical design phase.
 Physical DBD
 Take logical design specification as input and decide how it should be physically realized.
 Map the logical data model on the specified DBMS with respect to tables and integrity
constraints. (DBMS dependent designing)
 Select specific storage structure and access path to the database
 Design security measures required on the database
2. Application Programmer and Systems Analyst
 System analyst determines the user requirement and how the user wants to view the database.
 The application programmer implements these specifications as programs; code, test, debug, document
and maintain the application program.
 Determines the interface on how to retrieve, insert, update and delete data in the database.
 The application could use any high level programming language according to the availability, the
facility and the required service.
Application programmers who are responsible for writing application programs that use the database
using some programming language such as COBOL, Pascal, or a programming language built-in to
the DBMS.
3. End-users: These are those people who are engaged on processing different types of operations on the
database system. Users are workers, whose job requires accessing the database frequently for various
purpose. There are different group of users in this category.
 Naïve Users:
 Sizable proportion of users
 Unaware of the DBMS
 Only access the database based on their access level and demand
 Use standard and pre-specified types of queries.
 Sophisticated Users
 Are users familiar with the structure of the Database and facilities of the DBMS.
 Have complex requirements
 Have higher level queries
 Are most of the time engineers, scientists, business analysts, etc
 Casual Users
 Users who access the database occasionally.
 Need different information from the database each time.
 Use sophisticated database queries to satisfy their needs.
 Are most of the time middle to high level managers.

Generally, End users are those that interact with the system from online workstations or terminals that use an
application program developed by application programmers or those that query the system through an
interface provided by the DBMS.
These users can be again classified as “Actors on the Scene” and “Workers Behind the Scene”.
Actors On the Scene:
 Data Administrator
 Database Administrator
 Database Designer
 End Users
Workers Behind the Scene
 DBMS designers and implementers: who design and implement different DBMS software.
 Tool Developers are experts who develop software packages that facilitates database system
designing and use. Prototype, simulation, code generator developers could be an example.
Independent software vendors could also be categorized in this group.
 Operators and Maintenance Personnel are system administrators who are responsible for actually
running and maintaining the hardware and software of the database system and the information
technology facilities.
1.5. Database Management System (DBMS)
'Database management technology' and the resulting software are known as 'database management
system' (DBMS) which they manage a computer stored database or collection of data.
A database management system (DBMS) is a software system that allows access to data contained in a
database. The objective of the DBMS is to provide a convenient and effective method of defining, storing and
retrieving the information contained in the database. The DBMS interfaces with the application programs, so
that multiple applications and users can use the data contained in the database. In addition, the DBMS exerts
centralized control of the database, prevents fraudulent or unauthorized users from accessing the data, and
ensures the privacy of the data.
Database Management System (DBMS) is the tool for creating and managing the large amounts of data
efficiently and allowing it to persist for a long periods of time. Hence DBMS is a general-purpose software
that facilities the processes of defining, constructing, manipulating, and sharing database.
 Defining: involves specifying data types, structure and constraints.
 Constructing: is the process of storing the data into a storage media.
 Manipulating: is retrieving and updating data from and into the storage.
 Sharing: allows multiple users to access data.
A DBMS is software that enables users to define, create, maintain and control access to the database.
Example: Ms Access, FoxPro, SQL Server, MySQL, Oracle.
The phrase “Database System” is used to colloquially refer to database and database management system
(DBMS).
 Database Management System (DBMS) is a Software package used for providing EFFICIENT,
CONVENIENT and SAFE MULTI-USER (many people/programs accessing same database, or even
same data, simultaneously) storage of and access to MASSIVE amounts of PERSISTENT (data outlives
programs that operate on it) data.
 A DBMS also provides a systematic method for creating, updating, storing, retrieving data in a database.
 DBMS also provides the service of controlling data access, enforcing data integrity, managing
concurrency control, and recovery. Having this in mind, a full scale DBMS should at least have the
following services to provide to the user.
1. Data storage, retrieval and update in the database
2. A user accessible catalogue
3. Transaction support service: ALL or NONE transaction, which minimize data inconsistency.
4. Concurrency Control Services: access and update on the database by different users
simultaneously should be implemented correctly.
5. Recovery Services: a mechanism for recovering the database after a failure must be available.
6. Authorization Services (Security): must support the implementation of access and authorization
service to database administrator and users.
7. Support for Data Communication: should provide the facility to integrate with data transfer
software or data communication managers.
8. Integrity Services: rules about data and the change that took place on the data, correctness and
consistency of stored data, and quality of data based on business constraints.
9. Services to promote data independency between the data and the application
1.5.1. Components of DBMS Environment
A DBMS is software package used to design, manage, and maintain databases. Each DBMS should have
facilities to define the database, manipulate the content of the database and control the database. These
facilities will help the designer, the user as well as the database administrator to discharge their responsibility
in designing, using and managing the database. It provides the following facilities:
 Data Definition Language (DDL):
o Language used to define each data element required by the organization.
o Commands for setting up schema or the intension of database
o These commands are used to setup a database, create, delete and alter table with the facility of
handling constraints
o Allows DBA or user to describe and name entitles, attributes and relationships required for the
application.
o Specification notation for defining the database schema
 Data Manipulation Language (DML):
o Is a core command used by end-users and programmers to store, delete, and upate the data in the
database e.g. SQL
o Provides basic data manipulation operations on data held in the database.
o Language for manipulating the data organized by the appropriate data model
 Data Query Language (DQL):
o Language for accessing or retrieving the data organized by the appropriate data model
o Since the required data or Query by the user will be extracted using this type of language, it is also
called "Query Language"
o Procedural DQL: user specifies what data is required and how to get the data.
o Non-Procedural DQL: user specifies what data is required but not how it is to be retrieved
 Data Dictionary (DD):
o Due to the fact that a database is a self describing system, this tool, Data Dictionary, is used to store
and organize information about the data stored in the database.
 Data Control Language (DCL):
o Database is a shared resource that demands control of data access and usage. The database
administrator should have the facility to control the overall operation of the system.
o Data Control Languages are commands that will help the Database Administrator to control the
database.
o The commands include grant or revoke privileges to access the database or particular object within the
database and to store or remove database transactions
1.6. Database Development Life Cycle
As it is one component in most information system development tasks, there are several steps in developing a
database system. Here more emphasis is given to the design phases of the database system development life
cycle. The major steps in database system development are;
1. Planning: that is identifying information gap in an organization and propose a database solution to
solve the problem.
2. Analysis: that concentrates more on fact finding about the problem or the opportunity. Feasibility
analysis, requirement determination and structuring, and selection of best design method are also
performed at this phase.
3. Design: in database system development more emphasis is given to this phase. The phase is further
divided into three sub-phases.
A. Conceptual Design: concise description of the data, data type, relationship between data and
constraints on the data.
 There is no implementation or physical detail consideration.
 Used to elicit and structure all information requirements.
B. Logical Design: a higher level conceptual abstraction with selected specific database model to
implement the data structure.
 It is particular DBMS independent and with no other physical considerations.
C. Physical Design: physical implementation of the upper level design of the database with
respect to internal storage and file structure of the database for the selected DBMS.
 To develop all technology and organizational specification.
4. Implementation: the testing and deployment of the designed database for use.
5. Operation and Support: administering and maintaining the operation of the database system and
providing support to users.
Basic Concepts

 Database Design: The activity of specifying the schema of a database in a given data model
 Database Schema: The structure of a database that:
o Captures data types, relationships and constraints in data
o Is independent of any application program
o Changes infrequently
 Data Model:
o A set of primitives for defining the structure of a database.
o A set of operations for specifying retrieval and updates on a database
o Examples: Relational, Hierarchical, Networked, Object-Oriented
 Database Instance or State: The actual data contained in a database at a given time.

1.7. Database Systems Architecture


There may be several types of architectures of database systems. However, the following architecture
(ANSI/SPARC) is applicable to most modern database systems. External level, Conceptual level and
Internal level.
 All users should be able to access same data. This is important since the database is having a
shared data feature where all the data is stored in one location and all users will have their own
customized way of interacting with the data.
 A user's view is unaffected or immune to changes made in other views. Since the requirement of
one user is independent of the other, a change made in one user’s view should not affect other
users.
 Users should not need to know physical database storage details. As there are naïve users of the
system, hardware level or physical details should be a black-box for such users.
 DBA should be able to change database storage structures without affecting the users' views. A
change in file organization, access method should not affect the structure of the data which in turn
will have no effect on the users.
 Internal structure of database should be unaffected by changes to physical aspects of storage.
 DBA should be able to change conceptual structure of database without affecting all users. In any
database system, the DBA will have the privilege to change the structure of the database, like
adding tables, adding and deleting an attribute, changing the specification of the objects in the
database.
All the above and much other functionality are possible due to the three level ANSI-SPARC Database System
Architectures.

Table 1 Three-level ANSI-SPARC Architecture of a Database System


ANSI-SPARC Architecture and Database Design Phases
Table 2 ANSI-SPARC Architecture and Database Design Phases
The Database System Architecture is consists of the three levels: External level, conceptual level, Internal
level.
External Level:
The external level is the one closest to the users, i.e., it is the one concerned with the way the data is viewed
by individual users. An external view is the content of the database as seen by some particular user (i.e., to
that user the database is similar to the view he is working/accessing).
Each external view is defined by a means of an external schema, which consists basically of definitions of
each of the various external record types in that external view. The external schema is written using the
external DDL portion of the user’s data sub language.
External level is users' view of the database. Describes that part of database that is relevant to a particular
user. Different users have their own customized view of the database independent of other users.
Conceptual Level:
o The conceptual level is found in between the other two. It is a representation of the entire information
content of the database including the relations with one another and security and integrity rules, etc.
o It is the view of the data as it really is or by its entirety rather than as users are forced to see it by the
constraints of (for example) the particular language or hardware they might be using.
o The conceptual view is defined by means of the conceptual schema, which is written using another DDL,
the conceptual DDL of the data sublanguage in use. If data independence is to be achieved, then those
conceptual DDL must not involve any considerations of storage structure or access technique. Thus there
must be no reference in the conceptual schema to stored field representations, stored record sequence,
indexing, hashing addressing, pointers or any other storage and access details.
o The conceptual schema includes a great many additional features, such as the security and integrity rules.
Conceptual level is community view of the database. Describes what data is
stored in database and relationships among the data.
Internal Level:
Is the one closest to the physical storage, i.e., it is concerned with the way the data is physically stored.
o Is a low-level representation of the entire database?
The internal view is described by means of the internal schema, which not only defines the various stored
record types but also specifies what indexes exist, how stored fields are represented, what physical sequence
the stored records are in, and so on. The internal schema is written using yet another DDL-the internal DDL.
There will be many distinct external views, each consisting of a more or less abstract representation of some
portion of the total database, and there will be precisely one conceptual view, consisting of a similarly
abstract representation of the database in its entirety. Note that most users will not be interested in the total
database, but only in some restricted portion of it. Likewise, there will be precisely one internal view,
representing the total database as physically stored. The following example will clarify the levels to some
extent.
At the conceptual level, the database contains information concerning an entity type called employee. Each
individual employee occurrence has an employee_number (six characters), a department_number (four
characters), and a salary (five decimal digits).
At the internal level, employees are represented by a stored record type called stored_emp, twenty bytes long.
Stored_emp contains four stored fields: a six byte prefix (presumably containing control information such as
flags or pointers), and three data fields corresponding to the three properties of employees. In addition,
stored_emp records are indexed on the empno field by an index called empx, whose definition is not shown.
The Pascal user has an external view of the database in which employee is represented by a Pascal record
containing two fields (department numbers are of no interest to this user and therefore been omitted from the
view). The record type is defined according to the syntax and declaration rules in Pascal.
Similarly, the COBOL user has an external view in which each employee is represented by a COBOL record
containing two fields (this time salary is not needed by this user and omitted). The record type is defined
according to COBOL rules.
Notice that: the corresponding objects can have different names at each level. The employee number is
referred to as empno in the Pascal view, as emp# in the internal view and as employee_number in the
conceptual view. In general, to define the correspondence between the conceptual view and the internal view;
and the conceptual view and the external view we need an operation called mapping. The mappings are
important, for example, fields can have different data types, field and record names can be changed, and
several conceptual fields can be combined into a single external field, and so on.
Internal level is the physical representation of the database on the computer. Describes how the data is stored
in the database.
The following example can be taken as an illustration for the difference between the three levels in the ANSI-
SPARC database system Architecture. Where:
 The first level is concerned about the group of users and their respective data requirement independent
of the other.
 The second level is describing the whole content of the database where one piece of information will
be represented once.
 The third level

Table 3 Differences between Three Levels of ANSI-SPARC Architecture

1.8. Defines DBS schemas at three levels:


Internal schema: at the internal level to describe physical storage structures and access paths. Typically
uses a physical data model.
Conceptual schema: at the conceptual level to describe the structure and constraints for the whole
database for a community of users. Uses a conceptual or an implementation data model.
External schema: at the external level to describe the various user views. Usually uses the same data
model as the conceptual level.
Data Independence
Define as the ability (immunity) of applications to change storage structure and access technique without
modifying the main application.
In older systems, the way in which the data is organized in secondary storage, and the technique for accessing
it, are both dictated by the requirements of the application under consideration, and moreover that knowledge
of that data organization and that access technique is built into the application logic and code. In such type of
systems it is impossible to change the storage structures (how the data is physically stored) or access
technique (how it is accessed) without affecting the application. The applications mentioned are simply
programs that are designed to specific tasks where every knowledge of the data structure and the access
mechanism is also defined within itself.
In database systems, it would be extremely undesirable to allow applications to be data dependent. Major
reasons are:
Different applications will need different views of the same data. Suppose, we have an employee data stored
with (employee_id, employee_name, employee_salary, and employee_address, etc. data items), one user may
need only to use the employee_name and employee_salary data items whereas another user require only the
employee_name and employee_address data items. For data dependent applications, such needs will entail the
change of the main application with creation of two different copies of the same application, as it would be
applied by both users.
The Database Administrator (DBA) must have the freedom to change the storage structure or access
technique in response to changing requirements, without having to modify existing applications. For example,
new kinds of data might be added to the database, new standards might be adopted; new types of storage
devices might become available, and so on.
Logical Data Independence:
 Refers to immunity of external schemas to changes in conceptual schema.
 Conceptual schema changes e.g. addition or removal of entities should not require changes to external
schema or rewrites of application programs.
 The capacity to change the conceptual schema without having to change the external schemas and
their application programs.
Physical Data Independence
 The ability to modify the physical schema without changing the logical schema.
 Applications depend on the logical schema.
 In general, the interfaces between the various levels and components should be well defined so that
changes in some parts do not seriously influence others.
 The capacity to change the internal schema without having to change the conceptual schema
 Refers to immunity of conceptual schema to changes in the internal schema
 Internal schema changes e.g. using different file organizations, storage structures/devices should not
require change to conceptual or external schemas.
UNIT TWO
2. Database Model
A database model is a conceptual description of how the database works. It describes how the data elements
are stored in the database and how the data is presented to the user and programmer for access; and the
relationship between different items in the database.
A specific DBS has its own specific Data Definition Language, but this type of language is too low level to
describe the data requirements of an organization in a way that is readily understandable by a variety of users.
We need a higher-level language. Such a higher-level is called database model.
Database Model is a set of concepts to describe the structure of a database, and certain constraints that the
database should obey.
A database model is a description of the way that data is stored in a database. Database model helps to
understand the relationship between entities and to create the most effective structure to hold data.
Database Model is a collection of tools or concepts for describing:
 Data
 Data relationships
 Data semantics
 Data constraints
The main purpose of database model is to represent the data in an understandable way.
Categories of database models include:
 Object-based
 Record-based
 Physical
2.1. Record-based Data Models
Consist of a number of fixed format records. Each record type defines a fixed number of fields, Each field is
typically of a fixed length.The following are examples of this database model category.
 Hierarchical Database Model
 Network Database Model
 Relational Database Model
1. Hierarchical Model
In this model, the data is organized in a tree structure that originates from a root, and each class of data
resides at different levels along a particular branch of the root. The data structure at each class level is called a
node. There is always a single root node which is usually owned by the system or DBMS. Each of the
pointers in the root then will point to (child) nodes there by depicting a parent-child sort of relationship.
Searches are done by traversing the tree up and down with known search algorithms and modules supplied by
the DBMS or may, for special cases, be designed by the application programmer. The initial structure of the
database must be defined by the application programmer when the database is created. From this point on, the
parent-children structure can’t be changed without redesigning the whole structure.
Generally, Hierarchical database model is:
 The simplest database model
 Record type is referred to as node or segment
 The top node is the root node
 Nodes are arranged in a hierarchical structure as sort of upside-down tree
 A parent node can have more than one child node
 A child node can only have one parent node
 The relationship between parent and child is one-to-many and one-to-one
 Relation is established by creating physical link between stored records (each is stored with a
predefined access path to other records)
 To add new record type or relationship, the database must be redefined and then stored in a new form.

Department

Employee Job

Time Card Activity

Table 4 hierarchical data model

Advantages of Hierarchical Database Model:

 Hierarchical Model is simple to construct and operate on.


 Corresponds to a number of natural hierarchically organized domains-e.g., assemblies in
manufacturing, personnel organization in companies.
 Language is simple; uses constructs like GET, GET UNIQUE, GET NEXT, GET NEXT WITHIN
PARENT etc.
Disadvantages of Hierarchical Database Model:

 Navigational and procedural nature of processing.


 Database is visualized as a linear arrangement of records.
 Little scope for "query optimization".
2. Network Model

The network is a conceptual description of databases where many-to-many (multiple parent-children)


relationships exist. To make this model easier to understand, the relationships between the different data
items are commonly referred to as sets to distinguish them from the strictly parent-child relationships defined
by the HDBM.

The network model uses pointers to map the relationships between the different data items. The flexibility of
the NDB model is in showing many-to-many relationships is its greatest strength, though the flexibility
comes at a price (the interrelationships between the different data sets become extremely complex and
difficult to map).

Like the HDBM, NDBMs can very quickly be searched, especially through the use of index pointers that lead
directly to the first item in a set being searched. The NDBM suffers from the same structural problem as the
HDBM; the initial design of the database is arbitrary, and once its setup, any changes to the different sets
require the programmer to create an entirely new structure. The dual problems of duplicated data and
inflexible structure led to the development of a database model that minimizes both problems by making
relationships between the different data items the foundation for how the database is structure.

Generally, Network database model is

 Allows record types to have more than one parent unlike hierarchical
 A network database models sees records as set members
 Each set has an owner and one or more members
 Allows/supports many to many relationship between entities
 Like hierarchical model network model is a collection of physically linked records.
 Allow member records to have more than one owner
Department Job

Employee
Activity

Time Card

Table 5 Network Data Model

Advantages of Network Data Model:

 Network Model is able to model complex relationships and represents semantics of add or delete
on the relationships.
 Can handle most situations for modeling using record types and relationship types.
 Language is navigational; uses constructs like FIND, FIND member, FIND owner, FIND NEXT
within set, GET etc. Programmers can do optimal navigation through the database.
Disadvantages of Network Data Model:

 Navigational and procedural nature of processing.


 Database contains a complex array of pointers that thread through a set of records.
 Little scope for automated "query optimization”.
3. Relational Database Model

The relational database model is a way of looking at data- that is, it is a prescription for a way of representing
data (namely, by means of tables), and a prescription for a way of manipulating such data (by means of
operators). More precisely, the relational database model is concerned with three aspects of data: data
structure (objects), data integrity, and data manipulation (operators).

The primary purpose behind the relational database model is the preservation of data integrity. To be
considered truly relational, a DBMS must completely prevent access to the data by any means other than
queries handled by the DBMS itself. While the relational model does not specify how the data is stored on the
disk, the preservation of data integrity implies that the data must be stored in a format that prevents it from
being accessed from outside the DBMS that created it.
The relational model also requires that the data be accessed through programs that don’t rely on the position
of the data in the database. This is in direct contrast to the other database models, where the program has to
follow a series of pointers to the data it wants. A program querying a relational database simply asks for the
data it wants, and it is up to the DBMS to do the necessary searches and provide the answer. Searches can be
speed up by creating an index on one or more columns in a table; however, the DBMS controls and uses the
index. The user has only to ask the DBMS to create the index, and it will be maintained and used
automatically from that point on.
The relational database model has a number of advantages over the other models. The most important is its
complete flexibility in describing the relationships between the various data items. Once the tables are created
and relationships defined then users can query the database on any of the individual columns in a table or on
the relationships between the different tables.

Changing the structure of the database objects is as simple as adding or deleting columns in a table. Creating
new tables, deleting old tables etc. are also very simple. The major tasks that the designers of a relational
database has to make concerns the definitions of the tables and their relationships in the database.

Generally, Relational database model is

 Developed by Dr. Edgar Frank Codd in 1970 (famous paper, 'A Relational Model for Large
Shared Data Banks').
 Terminologies originates from the branch of mathematics called set theory and relation.
 Can define more flexible and complex relationship.
 Viewed as a collection of tables called “Relations” equivalent to collection of record types.
 Relation: Two dimensional table.
 Stores information or data in the form of tables  rows and columns.
 A row of the table is called tuple equivalent to record.
 A column of a table is called attribute equivalent to fields.
 Data value is the value of the Attribute.
 Records are related by the data stored jointly in the fields of records in two tables or files. The
related tables contain information that creates the relation.
 The tables seem to be independent but are related somehow.
 No physical consideration of the storage is required by the user.
 Many tables are merged together to come up with a new virtual view of the relationship.
Table 6 terminologies in database
Alternative terminologies

Relation Table File

Tuple Row Record

Attribute Column Field

 The rows represent records (collections of information about separate items).


 The columns represent fields (particular attributes of a record).
 Conducts searches by using data in specified columns of one table to find additional data in another
table.
 In conducting searches, a relational database model matches information from a field in one table with
information in a corresponding field of another table to produce a third table that combines requested
data from both tables.

UNIT THREE
3. Database Modelling Using the Entity-Relationship (ER)
Database Model
3.1. Properties of Relational Databases - Basic Concepts in Relational Database
 Each row of a table is uniquely identified by a primary key (can be composed of one or more
columns).
 Each tuple in a relation must be unique.
 Group of columns, that uniquely identifies a row in a table is called a candidate key.
 Entity integrity rule of the model states that no component of the primary key may contain a
NULL value.
 A column or combination of columns that matches the primary key of another table is called a
foreign key. This key is used to cross-reference tables.
 The referential integrity rule of the model states that, for every foreign key value in a table there
must be a corresponding primary key value in another table in the database or it should be NULL.
 All tables are logical entities.
 A table is either a base tables (named relations) or views (Unnamed Relations).
 Only base tables are physically stores.
 Views are derived from base tables with SQL instructions like:
[select .. from .. where .. order by].
 Relational database is the collection of tables.
 Each entity in one table.
 Attributes are fields (columns) in table.
 Order of rows and columns is immaterial or irrelevant.
 Entries with repeating groups are said to be un-normalized.
 Entries are single-valued.
 Each column (field or attribute) has a distinct name.
All values in a column represent the same attribute and have the same data format.

3.2. Building Blocks of the Relational Database Model


The building blocks of the relational database model are:
 Entities: Real world physical or logical object.
 Attributes: Properties used to describe each Entity or real world object.
 Relationship: The association between the real world objects (i.e Entities.)
 Constraints: Rules that should be obeyed or followed while manipulating the data.
1. ENTITIES: The entities (persons, places, things etc.) which the organization has to deal with. Relations
can also describe relationships. The name given to an entity should always be a singular noun descriptive
of each item to be stored in it. E.g.: student, NOT students. Every relation has a schema, which describes
the columns, or fields, the relation itself corresponds to our familiar notion of a table: A relation is a
collection of tuples, each of which contains values for a fixed number of attributes.
Existence Dependency: The dependence of an entity on the existence of one or more entities.
Weak entity : An entity that can not exist without the entity with which it has a relationship – it is
indicated by a double rectangle.
2. ATTRIBUTES - The items of information which characterize and describe these entities. Attributes are
pieces of information about entities. The analysis must of course identify those which are actually
relevant to the proposed application. Attributes will give rise to recorded items of data in the database. At
this level we need to know such things as:
 Attribute name: Should be explanatory words or phrases.
 The domain: from which attribute values are taken (A domain is a set of values from which
attribute values may be taken.) Each attribute has values taken from a domain. For example,
the domain of Name is string and that for salary is real.
 Whether the attribute is part of the entity identifier (attributes which just describe an entity and
those which help to identify it uniquely).
 Whether it is permanent or time-varying (which attributes may change their values over time).
 Whether it is required or optional for the entity (whose values will sometimes be unknown or
irrelevant).

3.2.1. Types of Attributes


(1) Simple (atomic) Vs Composite attributes
 Simple : Contains a single value (not divided into sub parts)
E.g. Age, gender,etc.
 Composite: Divided into sub parts (composed of other attributes).
E.g. Name, address,etc.
(2) Single-valued Vs multi-valued attributes
 Single-valued : Have only single value (the value may change but has only one
value at one time).
E.g. Name, Sex, Id. No. color_of_eyes, etc.
 Multi-Valued: Have more than one value.
E.g. Address, dependent-name, Person may have several college
degrees, etc.
(3) Stored vs. Derived Attributes
 Stored : not possible to derive or compute.
E.g. Name, Address, etc.
 Derived: The value may be derived (computed) from the values of other attributes.
E.g. Age (current year – year of birth).
Length of employment (current date- start date).
Profit (earning-cost).
G.P.A (grade point/credit hours).
(4) Null Values
 NULL applies to attributes which are not applicable or which do not have values.
 You may enter the value NA (meaning not applicable).
 Value of a key attribute can not be null.
Default value - Assumed value if no explicit value.
3.2.2. Entity versus Attributes
When designing the conceptual specification of the database, one should pay attention to the distinction
between an Entity and an Attribute.

@ Consider designing a database of employees for an organization:

@ Should address be an attribute of Employees or an entity (connected to Employees by a relationship)?


 If we have several addresses per employee, address must be an entity (attributes cannot be set-
valued/multi valued).

 If the structure (city, Woreda, Kebele, etc) is important, e.g. want to retrieve employees in a
given city, address must be modeled as an entity (attribute values are atomic).

3. RELATIONSHIPS :The relationships between entities which exist and must be taken into account when
processing information. In any business processing one object may be associated with another object due
to some event. Such kind of association is what we call a relationship between entity objects.
 One external event or process may affect several related entities.
 Related entities require setting of links from one part of the database to another.
 A relationship should be named by a word or phrase which explains its function.
 Role names are different from the names of entities forming the relationship: one entity may
take on many roles, the same role may be played by different entities.
 For each relationship, one can talk about the number of entities and the number of tuples
participating in the association. These two concepts are called degree and cardinality of a
relationship respectively.
3.2.3. Degree of a Relationship
Degree of relationship is an important point about a relationship which concerns how many entities are
participate in it. The number of entities participating in a relationship is called the degree of the relationship.
Among the Degrees of relationship, the following are the basic:
 Unary/recursive relationship: Tuples/records of a Single entity are related with each other.
 Binary relationships: Tuples/records of two entities are associated in a relationship.
 Ternary relationship: Tuples/records of three different entities are associated.
 And a generalized one: n-nary relationship: Tuples from arbitrary number of entity sets are
participating in a relationship.

3.2.4. Cardinality of a Relationship


Another important concept about relationship is the number of instances/tuples that can be associated with a
single instance from one entity in a single relationship. The number of instances participating or associated
with a single instance from an entity in a relationship is called the cardinality of the relationship. The major
cardinalities of a relationship are:
 One-to-one: one tuple is associated with only one other tuple.
o E.g. Building -to- Location  as a single building will be located in a single location and as a
single location will only accommodate a single Building.
 One-to-many: one tuple can be associated with many other tuples, but not the reverse.
o E.g. Department-to-Student  as one department can have multiple students.
 Many-to-one: many tuples are associated with one tuple but not the reverse.
o E.g. Employee–to-Department: as many employees belong to a single department.
 Many-to-many: one tuple is associated with many other tuples and from the other side, with a different
role name one tuple will be associated with many tuples.
o E.g. Student–to-Course as a student can take many courses and a single course can be attended
by many students.
4. Relational Constraints/Integrity Rules
Relational Integrity:
 Domain integrity: No value of the attribute should be beyond the allowable limits.
 Entity integrity: In a base relation, no attribute of a Primary Key can assume a value of NULL.
 Referential integrity: If a Foreign Key exists in a relation, either the Foreign Key value must
match a Candidate Key value in its home relation or the Foreign Key value must be NULL.
 Enterprise integrity: Additional rules specified by the users or database administrators of a
database are incorporated.
Keys and constraints If tuples are need to be unique in the database, and then we need to make each tuple
distinct. To do this we need to have relational keys that uniquely identify each relation.

A super key : A super key also know as super set is then a set of one or more attributes that in group
(collectively) can identify an entity uniquely from the entity set.

Example: Consider the “EMPLOYEES” entity set, then

- “EmpId”, “EmpId, Name”, “NationalId”, “NationalId, BDate”, … are super keys

- “Name”, “BDate” are NOT super keys

Super Key: an attribute or set of attributes that uniquely identifies a tuple within a relation.

Note: If K is a super set (super key) then a set consisting of K is also a super set.

The more interesting super set is the minimal super set that is referred to as the candidate key.
The candidate key is the sufficient and the necessary set of attributes to distinguish an entity set.

Example: In the “EMPLOYEES” entity set

- “EmpId”, “NationalId”, “Name, BDate” (assuming that there is no coincidence that employees with the
same name may born on the same day) … are candidate keys.

The designer of the database is the one that makes the choice of the candidate keys for implementation, but
the choice has to be made carefully. Primary key is a term used to refer to the candidate key that is selected
by the designer for implementation.

Candidate Key: an attribute or set of attributes that uniquely identifies individual occurrences of an entity
type or tuple within a relation.

A candidate key has two properties:

1. Uniqueness
2. Irreducibility
Candidate Key: a super key such that no proper subset of that collection is a Super Key within the relation.

Composite key: A candidate key that consists of two or more attributes.

Primary key: the candidate key that is selected to identify tuples uniquely within the relation. The entire set
of attributes in a relation can be considered as a primary case in a worst case.

In another way, an entity type may have one or more possible candidate keys, one of which is selected to be a
primary key.
Foreign key: an attribute, or set of attributes, within one relation that matches the candidate key of some
relation. A foreign key is a link between different relations to create the view or the unnamed
relation.

3.3. Relational Views


Relations are perceived as a table from the users’ perspective. Actually, there are two kinds of relation in
relational database. The two categories or types of relations are Base (Named) and View (Unnamed)
Relations. The basic difference is on how the relation is created, used and updated:
1. Base Relation: It is a named relation corresponding to an entity in the conceptual schema, whose
tuples are physically stored in the database.
2. View (Unnamed Relation): A View is the dynamic result of one or more relational operations
operating on the base relations to produce another virtual relation that does not actually exist as
presented. So a view is virtually derived relation that does not necessarily exist in the database but can
be produced upon request by a particular user at the time of request. The virtual table or relation can
be created from single or different relations by extracting some attributes and records with or without
conditions.
Purpose of a view

 Hides unnecessary information from users: since only part of the base relation (Some collection of
attributes, not necessarily all) are to be included in the virtual table.
 Provide powerful flexibility and security: since unnecessary information will be hidden from the user
there will be some sort of data security.
 Provide customized view of the database for users: each users are going to be interfaced with their
own preferred data set and format by making use of the Views.
 A view of one base relation can be updated.
 Update on views derived from various relations is not allowed since it may violate the integrity of the
database.
 Update on view with aggregation and summary is not allowed. Since aggregation and summary results
are computed from a base relation and does not exist actually.
Schemas and Instances

When a database is designed using a relational data model, all the data is represented in a form of a table. In
such definitions and representation, there are two basic components of the database. The two components are
the definition of the relation or the table and the actual data stored in each table. The data definition is what
we call the Schema or the skeleton of the database and the relations with some information at some point in
time is the Instance or the flesh of the database.
Schemas

Schema describes how data is to be structured, defined at setup/design time (also called "metadata"). Since it
is used during the database development phase, there is rare tendency of changing the schema unless there is
a need for system maintenance which demands change to the definition of a relation.
 Database Schema (Intension): specifies name of relation and the collection of the attributes
(specifically the Name of attributes).
 refer to a description of database (or intention)
 specified during database design
 should not be changed unless during maintenance
 Schema Diagrams: convention to display some aspect of a schema visually.
 Schema Construct: refers to each object in the schema (e.g. STUDENT)
E.g.: STUNEDT (FName,LName,Id,Year,Dept,Sex)
Instances
Instance: is the collection of data in the database at a particular point of time (snap-shot).
 Also called State or Snap Shot or Extension of the database.
 Refers to the actual data in the database at a specific point in time.
 State of database is changed any time we add, delete or update an item.
 Valid state: the state that satisfies the structure and constraints specified in the schema and is
enforced by DBMS.
Since instance is actual data of database at some point in time, changes rapidly. To define a new database, we
specify its database schema to the DBMS (database is empty). Database is initialized when we first load it
with data.

3.4. ENTITY - RELATIONSHIP DIAGRAMS


As one important aspect of E-R modeling, database designers represent their data model by E-R diagrams.
These diagrams enable designers and users to express their understanding of what the planned database is
intended to do and how it might work, and to communicate about the database through a common language.
Each organization that uses E-R diagrams must adopt a specific style for representing the various
components.
Graphical Representations in ER Diagramming
 Entity is represented by a rectangle containing the name of the entity.

Strong Entity

 Connected entities are called relationship participants


 Attributes are represented by ovals and are connected to the entity by a line.

Oval
Ovals Ovals Ovals
Oval

Multi-valued Composite Oval


Attribute
 A derived attribute is indicated by a dotted line. (……..)
Ovals
 Primary Keys are underlined. Key
 Relationships are represented by Diamond shaped symbols
 Weak Relationship is a relationship between Weak and Strong Entities.
 Strong Relationship is a relationship between two strong Entities.
Strong Relationship Weak Relationship

An entity-relationship model (ERM) is a model that provides a high-level description of a conceptual data
model. Data modeling that provides a graphical notation for representing such data models in the form of
entity-relationship diagrams (ERD).

The whole purpose of ER modeling is to create an accurate reflection of the real world in a database. The ER
model doesn’t actually give us a database description. It gives us an intermediate step from which it is easy to
define a database.

The E-R data model is based on a perception of a real world that consists of a set of basic objects called
entities, and of relationships among these objects. It was developed to facilitate database design by allowing
the specification of an enterprise schema, which represents the overall logical structure of a database.

The E-R data model is one of several semantic data models; the semantic aspect of the model lies in the
attempt to represent the meaning of the data. The E-R model is extremely useful in mapping the meanings
and interactions of real-world enterprises onto a conceptual scheme. Because of this utility, many database
design tools draw on concepts from the E-R model.

A data model in which information stored in the database is viewed as sets of entities and sets of relationships
among entities. There are three basic notions that the ER Model employs: entity sets, relationships, and
attributes.
UNIT FOUR
4. Database Design
Database design is the process of coming up with different kinds of specification for the data to be stored in
the database. The database design part is one of the middle phases we have in information systems
development (DBS) where the system uses a database approach. Design is the part on which we would be
engaged to describe how the data should be perceived at different levels and finally how it is going to be
stored in a computer system.
Information System with Database application (DBS development life cycles)consists of several tasks which
include:
 Planning of Information systems Design
 Requirements Analysis
 Design (Conceptual, Logical and Physical Design)
 Tuning
 Implementation
 Operation and Support
From these different phases, the prime interest of a database system development will be the design part
which is again sub divided into other three sub-phases. These sub-phases are:
1. Conceptual Database Design
2. Logical Design Database, and
3. Physical Database Design
In general, one has to go back and forth between these tasks to refine a database design, and decisions in one
task can influence the choices in another task. In developing a good design, one should answer such questions
as:
 What are the relevant Entities for the Organization
 What are the important features of each Entity
 What are the important Relationships
 What are the important queries from the user
 What are the other requirements of the Organization and the Users
4.1. The Three levels of Database Design
Conceptual Design

Logical Design

Physical Design
Table 7 levels of database design
1. Conceptual Database Design
Conceptual design is the process of constructing a model of the information used in an enterprise,
independent of any physical considerations.

 It is the source of information for the logical design phase.


 Mostly uses an Entity Relationship Model to describe the data at this level.
 After the completion of Conceptual Design one has to go for refinement of the schema, which is
verification of Entities, Attributes, and Relationships.
2. Logical Database Design
Logical design is the process of constructing a model of the information used in an enterprise based on a
specific database model (e.g. Relational, Hierarchical or Network or Object), but independent of a particular
DBMS and other physical considerations.
Normalization process
– Collection of Rules to be maintained.
– Discover new entities in the process.
– Revise attributes based on the rules and the discovered Entities.
3. Physical Database Design
Physical design is the process of producing a description of the implementation of the database on secondary
storage. -- defines specific storage or access methods used by database.

o Describes the storage structures and access methods used to achieve efficient access to the data.
o Tailored to a specific DBMS system -- Characteristics are function of DBMS and operating systems.
o Includes estimate of storage space.

NOTE:
In conceptual data model/Design
o Identify what are the entities/entity types
o Identify what are the attributes: - the information about entities and relationship should we store in the
database.
o Identify relationship types
o Identify what are the constraints/business rules that hold?
o Draw entity-relationship diagram:- representing the database in the ER model using pictorial representation
called ER diagram
o Review the conceptual data model with user
In logical data model/Design
o Map the conceptual model to a logical model
o Mapping entities and relationships in ER-Diagram into tables
-Translate ER-diagram with constraints
o Derive relations from the logical data model
o Validate model using normalization
o Validate model against user transactions
o Draw entity-relationship diagram
o Define integrity constraints
o Check for future growth

NB: Startng from this we are going to design database using the relational database model.

4.2. Conceptual Database Design


Conceptual design revolves around discovering and analyzing organizational and user data requirements. The
important activities are to identify

o Entities
o Attributes
o Relationships
o Constraints
And based on these components develop the ER model using

 ER diagrams
4.2.1. The Entity Relationship (E-R) Model
An entity-relationship (E-R) data model is a high-level conceptual model that describes data as entities,
attributes, and relationships. The E-R model is represented by E-R diagrams that show, how data will be
represented and organized in the various components of the final database. However, the model diagrams do
not specify the actual data, or even exactly how it is stored. The users and applications will create the data
content and the database management system will create the database to store the content.
Entity-Relationship modeling is used to represent conceptual view of the database. The main components of
ER Modeling are:
 Entities
o Corresponds to entire table, not row
o Represented by Rectangle
 Attributes
o Represents the property used to describe an entity or a relationship
o Represented by Oval
 Relationships
o Represents the association that exist between entities
o Represented by Diamond
 Constraints
o Represent the constraint in the data
Before working on the conceptual design of the database, one has to know and answer the following basic
questions.
What are the entities and relationships in the enterprise?
What information about these entities and relationships should we store in the database?
What are the integrity constraints that hold? Constraints on each data with respect to update,
retrieval and store.
Represent this information pictorially in ER diagrams, then map ER diagram into a relational
schema.
4.2.2. Developing an E-R Diagram
Designing conceptual model for the database is not a one linear process but an iterative activity where the
design is refined again and again. To identify the entities, attributes, relationships, and constraints on the
data, there are different set of methods used during the analysis phase. These include information gathered
by.
 Interviewing end users individually and in a group
 Questionnaire survey
 Direct observation
 Examining different documents
The basic E-R model is graphically depicted and presented for review. The process is repeated until the end
users and designers agree that the E-R diagram is a fair representation of the organization’s activities and
functions. Checking for Redundant Relationships in the ER Diagram. Relationships between entities indicate
access from one entity to another - it is therefore possible to access one entity occurrence from another entity
occurrence even if there are other entities and relationships that separate them - this is often referred to as
Navigation' of the ER diagram. The last phase in ER modeling is validating an ER Model against requirement of
the user.

Example 1: Build an E-R Diagram for the following information:

A student record management system will have the following two basic data object categories with their own
features or properties: Students will have an Id, Name, Dept, Age, GPA and Course will have an Id, Name,
Credit Hours. Whenever a student enroll in a course in a specific Academic Year and Semester, the Student
will have a grade for the course.

Name Dept DoB Id Name Credit

Id Gpa
Students Course

Age

Enrolled_In Semester
Academic
Year

Grade

Example 2: Build an ER Diagram for the following information:

A Personnel record management system will have the following two basic data object categories with their
own features or properties: Employee will have an Id, Name, DoB, Age, Tel and Department will have an Id,
Name, Location. Whenever an Employee is assigned in one Department, the duration of his stay in the
respective department should be registered.

Structural Constraints on Relationship

1. Constraints on Relationship/Multiplicity/ Cardinality Constraints: Multiplicity constraint is the number or


range of possible occurrence of an entity type/relation that may relate to a single occurrence/tuple of an entity
type/relation through a particular relationship. Mostly used to insure appropriate enterprise constraints.
One-to-one relationship:

 A customer is associated with at most one loan via the relationship borrower.A loan is associated with at most
one customer via borrower.

E.g.: Relationship Manages between STAFF and BRANCH.

The multiplicity of the relationship is:

 One branch can only have one manager.


 One employee could manage either one or no branches.

1..1 Manages 0..1


Employee Branch

One-To-Many Relationships

 In the one-to-many relationship a loan is associated with at most one customer via borrower, a customer is
associated with several (including 0) loans via borrower.
E.g.: Relationship Leads between STAFF and PROJECT

The multiplicity of the relationship is:

 One staff may Lead one or more project(s)


 One project is Lead by one staff

1..1 Leads 0..*


Employee Project

Many-To-Many Relationship

 A customer is associated with several (possibly 0) loans via borrower. A loan is associated with several
(possibly 0) customers via borrower.

E.g.: Relationship Teaches between INSTRUCTOR and COURSE

The multiplicity of the relationship

 One Instructor Teaches one or more Course(s)


 One Course Thought by Zero or more Instructor(s)

0..* Teaches 1..*


Instructor Course

2. Participation of an Entity Set in a Relationship Set=Particpation constraints


Participation constraint of a relationship is involved in identifying and setting the mandatory or optional
feature of an entity occurrence to take a role in a relationship. There are two distinct participation constraints
with this respect, namely: Total Participation and Partial Participation.
1. Total participation: every tuple in the entity or relation participates in at least one relationship by
taking a role. This means, every tuple in a relation will be attached with at least one other tuple. The
entity with total participation in a relationship will be connected to the relationship using a double
line.
2. Partial participation: some tuple in the entity or relation may not participate in the relationship. This
means, there is at least one tuple from that Relation not taking any role in that specific relationship.
The entity with partial participation in a relationship will be connected to the relationship using a
single line.
E.g. 1: Participation of EMPLOYEE in “belongs to” relationship with DEPARTMENT is total
since every employee should belong to a department. Participation of DEPARTMENT in “belongs
to” relationship with EMPLOYEE is total since every department should have more than one
employee.

Employee Belongs To Department

E.g. 2: Participation of employee in “manages” relationship with Department, is partial


participation since not all employees are managers. Participation of department in “Manages”
relationship with employee is total since every department should have a manager.

Problem in Employee Manages Department ER


Modeling

The Entity-Relationship Model is a conceptual data model that views the real world as consisting of entities
and relationships. The model visually represents these concepts by the Entity-Relationship diagram. The basic
constructs of the ER model are entities, relationships, and attributes. Entities are concepts, real or abstract,
about which information is collected. Relationships are associations between the entities. Attributes are
properties which describe the entities.

While designing the ER model one could face a problem on the design which is called a connection traps.
Connection traps are problems arising from misinterpreting certain relationships.

There are two types of connection traps;

1. Fan trap:
Occurs where a model represents a relationship between entity types, but the pathway between certain
entity occurrences is ambiguous.

May exist where two or more one-to-many (1:M) relationships fan out from an entity. The problem
could be avoided by restructuring the model so that there would be no 1:M relationships fanning out
from a singe entity and all the semantics of the relationship is preserved.

Example:

1..* Works 1..1 IsAssigned 1..*

EMPLOYEE BRANCH CAR


1..1

Semantics description of the problem;

Emp1 Bra1 Car1

Emp2 Bra2 Car2

Emp3 Bra3 Car3

Emp4 Car4

Emp5 Car5

Problem: Which car (Car1 or Car3 or Car5) is used by Employee 6. Emp6 working in Branch 1 (Bra1). Thus
from this ER Model one can not tell which car is used by which staff since a branch can have more than one
car and also a branch is populated by more than one employee. Thus we need to restructure the model to
avoid the connection trap.

To avoid the Fan Trap problem we can go for restructuring of the E-R Model. This will result in the following
E-R Model.

1..1 Has 1..* 1..* Used By 1..*

BRANCH CAR EMPLOYEE

Semantics description of the problem;


Car1
Bra1 Emp1
Car2
Bra2 Emp2
Car3
Bra3 Emp3
Car4
Emp4
Car5
Emp5

3. Chasm Trap:
Occurs where a model suggests the existence of a relationship between entity types, but the path way
does not exist between certain entity occurrences.
It may exist when there are one or more relationships with a minimum multiplicity on cardinality of
zero forming part of the pathway between related entities.
Example:

1..1 Has 1..* 0..1 Manages 0..*


BRANCH EMPLOYEE PROJECT

If we have a set of projects that are not active currently then we cannot assign a project manager for these
projects. So there are project with no project manager making the participation to have a minimum value of
zero.
Problem:
How can we identify which BRANCH is responsible for which PROJECT? We know that whether the
PROJECT is active or not there is a responsible BRANCH. But which branch is a question to be answered,
and since we have a minimum participation of zero between employee and PROJECT we can’t identify the
BRANCH responsible for each PROJECT.
The solution for this Chasm Trap problem is to add another relationship between the extreme entities (Branch
and Project).

1..1 Has 1..* 0..1 Manages 0..*


BRANCH EMPLOYEE PROJECT Enhanced E-
R (E-ER)
Model

1..1 Responsible for 1..*


Object-oriented extensions to E-R model. EER is important when we have a relationship between two entities
and the participation is partial between entity occurrences. In such cases EER is used to reduce the
complexity in participation and relationship complexity. ER diagrams consider entity types to be primitive
objects. EER diagrams allow refinements within the structures of entity types.

EER Concepts: In this part we will discuss the following basic EER concepts.

Generalization
Specialization
Sub classes
Super classes
Attribute Inheritance
Constraints on specialization and generalization
Generalization

Generalization occurs when two or more entities represent categories of the same real-world object.
Generalization is the process of defining a more general entity type from a set of more specialized entity
types. A generalization hierarchy is a form of abstraction that specifies that two or more entities that share
common attributes can be generalized into a higher level entity type. Generalization is considered as
bottom-up definition of entities. Generalization hierarchy depicts relationship between higher level super
class and lower level subclass.

Generalization hierarchies can be nested. That is, a subtype of one hierarchy can be a supertype of
another. The level of nesting is limited only by the constraint of simplicity.

Example: Account is a generalized form for Saving and Current Accounts.


Specialization

Specialization is the result of subset of a higher level entity set to form a lower level entity set. The
specialized entities will have additional set of attributes (distinguishing characteristics) that distinguish them
from the generalized entity. Is considered as Top-Down definition of entities. Specialization process is the
inverse of the Generalization process. Identify the distinguishing features of some entity occurrences, and
specialize them into different subclasses.

Reasons for Specialization are:

 Attributes only partially applying to superclasses.


 Relationship types only partially applicable to the superclass.
In many cases, an entity type has numerous sub-groupings of its entities that are meaningful and need to be
represented explicitly. This need requires the representation of each subgroup in the ER model. The
generalized entity is a superclass and the set of specialized entities will be subclasses for that specific
Superclass.
Example: Saving Accounts and Current Accounts are Specialized entities for the generalized entity
Accounts. Manager, Sales, Secretary: are specialized employees.

Subclass/Subtype
An entity type whose tuples have attributes that distinguish its members from tuples of the generalized or
Super class entities. When one generalized Super class has various subgroups with distinguishing features and
these subgroups are represented by specialized form, the groups are called subclasses. Subclasses can be
either mutually exclusive (disjoint) or overlapping (inclusive). A single subclass may inherit attributes from
two distinct super classes. A mutually exclusive category/subclass is when an entity instance can be in only
one of the subclasses. E.g.: An EMPLOYEE can either be SALARIED or PART-TIMER but not both.
An overlapping category/subclass is when an entity instance may be in two or more subclasses. E.g.: A
person who works for a university can be both employee and a student at the same time.
Superclass /Supertype
An entity type whose tuples share common attributes. Attributes that are shared by all entity occurrences
(including the identifier) are associated with the supertype. Superclass /Supertype Is the generalized entity.
Relationship Between Superclass and Subclass

The relationship between a superclass and any of its subclasses is called a superclass/subclass or
class/subclass relationship. An instance can not only be a member of a subclass. i.e. Every instance of a
subclass is also an instance in the Superclass. A member of a subclass is represented as a distinct database
object, a distinct record that is related via the key attribute to its super-class entity. An entity cannot exist in
the database merely by being a member of a subclass; it must also be a member of the super-class. An entity
occurrence of a sub class not necessarily should belong to any of the subclasses unless there is full
participation in the specialization. A member of a subclass is represented as a distinct database object, a
distinct record that is related via the key attribute to its super-class entity. The relationship between a subclass
and a Superclass is an “IS A” or “IS PART OF” type.

 Subclass IS PART OF Superclass


 Manager IS AN Employee

All subclasses or specialized entity sets should be connected with the superclass using a line to a circle where
there is a subset symbol indicating the direction of subclass/superclass relationship.

We can also have subclasses of a subclass forming a hierarchy of specialization. Superclass attributes are
shared by all subclasses f that superclass. Subclass attributes are unique for the subclass.

Attribute Inheritance
An entity that is a member of a subclass inherits all the attributes of the entity as a member of the superclass.
The entity also inherits all the relationships in which the superclass participates. An entity may have more
than one subclass categories. All entities/subclasses of a generalized entity or superclass share a common
unique identifier attribute (primary key). i.e. The primary key of the superclass and subclasses are always
identical.

Consider the EMPLOYEE supertype entity shown above. This entity can have several different subtype
entities (for example: HOURLY and SALARIED), each with distinct properties not shared by other subtypes.
But whether the employee is Hourly or Salaried, same attributes (EmployeeId, Name, and DateHired) are
shared. The Supertype EMPLOYEE stores all properties that subclasses have in common. And HOURLY
employees have the unique attribute Wage (hourly wage rate), while SALARIED employees have two unique
attributes, StockOption and Salary.

Constraints on specialization and generalization

Completeness Constraint.

The Completeness Constraint addresses the issue of whether or not an occurrence of a Superclass must also
have a corresponding Subclass occurrence. The completeness constraint requires that all instances of the
subtype be represented in the supertype. The Total Specialization Rule specifies that an entity occurrence
should at least be a member of one of the subclasses. Total

Participation of superclass instances on subclasses is diagrammed with a double line from the Supertype to
the circle as shown below. E.g.: If we have Extention and regular as subclasses of a superclass student, then it
is mandatory that each student to be either Extention or regular student. Thus the participation of instances of
student in Extention and regular subclasses will be total.
The Partial Specialization Rule specifies that it is not necessary for all entity occurrences in the superclass to
be a member of one of the subclasses. Here we have an optional participation on the specialization. Partial
Participation of superclass instances on subclasses is diagrammed with a single line from the Supertype to the
circle. E.g.: If we have Manager and Secretary as subclasses of a superclass Employee, then it is not the
case that all employees are either manager or secretary. Thus the participation of instances of employee in
manager and secretary subclasses will be partial.
Disjointness Constraints
Specifies the rule whether one entity occurrence can be a member of more than one subclasses. i.e. it is a type
of business rule that deals with the situation where an entity occurrence of a Superclass may also have more
than one Subclass occurrence. The Disjoint Rule restricts one entity occurrence of a superclass to be a
member of only one of the subclasses. Example: a Employee can either be salaried or part-timer, but not the
both at the same time. The Overlap Rule allows one entity occurrence to be a member f more than one
subclass. Example: Employee working at the university can be both a Student and an employee at the same
time. This is diagrammed by placing either the letter "d" for disjoint or "o" for overlapping inside the circle
on the Generalization Hierarchy portion of the E-R diagram.

The two types of constraints on generalization and specialization (Disjointness and Completeness constraints)
are not dependent on one another. That is, being disjoint will not favour whether the tuples in the superclass
should have Total or Partial participation for that specific specialization.

From the two types of constraints we can have four possible constraints

@ Disjoint AND Total


@ Disjoint AND Partial
@ Overlapping AND Total
@ Overlapping AND Partial
Chapter Five
Relational Database design
Databases and database systems have become an essential component of everyday life in modern society. In
the course of a day, most of us encounter several activities that involve some interaction with a database. For
example, if we go to the bank to deposit or withdraw funds, if we make a hotel or airline reservation, if we
access a computerized library catalog to search for a bibliographic item, or if we buy some item-such as a
book, toy, or computer-from an Internet vendor through its Web page, chances are that our activities will
involve someone or some computer program accessing a database. Even purchasing items from a supermarket
nowadays in many cases involves an automatic update of the database that keeps the inventory of
supermarket items.
These interactions are examples of what we may call traditional database applications, in which most of the
information that is stored and accessed is either textual or numeric. In the past few years, advances in
technology have been leading for exciting new applications of database systems. Multimedia databases can
now store pictures, video clips, and sound messages. Geographic information systems (GIS) can store and
analyze maps, weather data, and satellite images. Data warehouses and online analytical processing (OLAP)
systems are used in many companies to extract and analyze useful information from very large databases for
decision making. Real-time and active database technology is used in controlling industrial and
manufacturing processes. And database search techniques are being applied to the World Wide Web to
improve the search for information that is needed by users browsing the Internet.
The relational model represents the database as a collection of relations. Informally, each relation resembles a
table of values or, twosome extent, a "flat" file of records. However, there are important differences between
relations and files. When a relation is thought of as a table of values, each row in the table represents a
collection of related data values. In the relational model, each row in the table represents a fact that typically
corresponds to a real-world entity or relationship. The table name and column names are used to help in
interpreting the meaning of the values in each row. All values in a column are of the same data type. In the
formal relational model terminology, a row is called a tuple, a column header is called an attribute, and the
table is called a relation. The data type describing the types of values that can appear in each column is
represented by a domain of possible values.

5.1. Domains, Attributes, Tuples, and Relations


A domain D is a set of atomic values. By atomic we mean that each value in the domain is indivisible as far as the
relational model is concerned. A common method of specifying a domain is to specify a data type from which the
data values forming the domain are drawn. It is also useful to specify a name for the domain, to help in
interpreting its values. Some examples of domains:
 USA_phone_numbers: The set of ten-digit phone numbers valid in the United States.
 Social_securiry_numbers: The set of valid nine-digit social security numbers.
 Names: The set of character strings that represent names of persons.
The preceding is called logical definitions of domains. A data type or format is also specified for each domain. For
example, the data type for the domain USA_phone_numbers can be declared as a character string of the form
(ddd)ddd-dddd, where each d is a numeric (decimal) digit and the first three digits form a valid telephone area
code.
A domain is thus given a name, data type, and format. Additional information for interpreting the values of a
domain can also be given; for example, a numeric domain such as Person weights should have the units of
measurement, such as pounds or kilograms. A relation schema R, denoted by R(A 1, A2, ... , An) is made up of a
relation name R and a list of attributes A1,A2, ..., An. Each attribute Ai is the name of a role played by some domain
D in the relation schema R. D is called the domain of Ai and is denoted by do m(A). A relation schema is used to
describe a relation; R is called the name of this relation. The degree (or arity) of a relation is the number of
attributes n of its relation schema.
An example of a relation schema for a relation of degree seven, which describes university students, is the
following:
STUDENT(Name, SSN, HomePhone, Address, OfficePhone, Age, GPA)
Using the data type of each attribute, the definition is sometimes written as:
STUDENT(Name: string, SSN: string, HomePhone: string, Address: string, OfficePhone:
string, Age: integer, GPA: real)
Conceptual data models use concepts such as entities, attributes, and relationships. An entity represents a
real-world object or concept, such as an employee or a project that is described in the database. An attribute
represents some property of interest that further describes an entity, such as the employee's name or salary. A
relationship among two or more entities represents an association among two or more entities, for example, a
works-on relationship between an employee and a project.
Cardinality ratio: specifies the maximum number of relationship instances that an entity can participate in.
e.g. Works-For relationship between Department and Employee has 1:N cardinality ratio
Works-On relationship between Employee and Project has M:N cardinality ratio
Manages relationship between Employee and Department has 1:1 cardinality ratio
Example, from University database

Students Name studentNumber class major


Aster 12 A CS
Maru 13 B CS
Course CourseName CourseNumber crieditHour Department
Introduction to computer science Cosc123 3 CS
Data structure Cosc344 4 CS
District Mathematics Math101 4 math
Database Cosc354 3 CS
A relation (or relation state) r of the relation schema R(A1 , A2, ... , An) also denoted by r(R), is a set of n-
tuples r = {t1 , t2, ... , tn}. Each n-tuple t is an ordered list of nvalues t = <v1, V2, ... ,vn>, where each value
Vi,1 <= i <=n, is an element of dom(A) or isa special null value.

5.2. Functional Dependency


A functional dependency is a constraint between two sets of attributes from the database. Suppose that our
relational database schema has n attributes A1, A2,… ,An; let us think of the whole database as being
described by a single universal relation schema R ={A1, A2, ... ,An}· we use this concept only in developing
the formal theory of data dependencies. A functional dependency, denoted by X  Y, between two sets of
attributes X and Y that are subsets of R specifies a constraint on the possible tuples that can form a relation
state r of R. The constraint is that, for any two tuples t1 and t2 in r that have t1[X] = t2[X], they must also
have t1[Y] = t2[y] .
This means that the values of the Y component of a tuple in r depend on, or are determined by, the values of
the X component; alternatively, the values of the X component of a tuple uniquely (or functionally) determine
the values of the Y component. We also say that there is a functional dependency from X to Y, or that Y is
functionally dependent on X. The abbreviation for functional dependency is FD or f.d. The set of attributes X
is called the left-hand side of the FD, and Y is called the right-hand side. Thus, X functionally determines Y
in a relation schema R if, and only if, whenever two tuples of r(R) agree on their X-value, they must
necessarily agree on their Y-value.
Note the following:
A functional dependency is a property of the semantics or meaning of the attributes. The database designers
will use their understanding of the semantics of the attributes of R-that is, how they relate to one another-to
specify the functional dependencies that should hold on all relation states (extensions) r of R. the main use of
functional dependencies is to describe further a relation schema R by specifying constraints on its attributes
that must hold at all times.
Consider the relation schema EMP_PROj from the semantics of the attributes, we know that the following
functional dependencies should hold:
a. SSN → ENAME
b. PNUMBER → {PNAME, PLOCATION}
C. {SSN, PNUMBER}→HOURS

These functional dependencies specify that (a) the value of an employee's social security number (SSN)
uniquely determines the employee name (ENAME),
(b) the value of project’s number (PNUMBER) uniquely determines the project name (PNAME) and
location(PLOCATION), and
(c) a combination of SSN and PNUMBER values uniquely determines the number of hours the employee
currently works on the project per week (HOURS).Alternatively, we say that ENAME is functionally
determined by (orfunctionally dependenton) SSN, or "given a value of SSN, we know the value of ENAME,"
and so on. A functional dependency is a property of the relation schema R, not of a particular legal relation
state r of R. Hence, an FD cannot be inferred automatically from a given relation extension r but must be
defined explicitly by someone who knows the semantics of the attributes of R. We denote by F the set of
functional dependencies that are specified on relation schema R.
The set of all functional dependencies that include F as well as all dependencies that can be inferred from F is
called the closure of F: it is denoted by F+
F={SSN {Ename,Bdate,Address, Dnumber},
Dnumber{Dname,DmgrSSN}}
Additionally we can infer from F are
 SSN->{Dname,DmgrSSN}
 SSN->SSN
 Dnumber->Dname
Most important type of constraint used to improve designs systematically is a unique-value constraint called
functional dependency.
Title Year Length Film type Studio name Star name
Start wars 1997 124 color fox Carrie fisher
Start wars 1997 124 color fox Mark Hamill
Start wars 1997 124 color fox Harrison
Ford
For example, consider the above relation and we can drive the following functional dependency
Title, year length, filmtype, studioname
This FD says that if two tuples have the same value in their title and year components then these two tuples must
have the same values in their length, filmtype, and studioname components.
Title, yearstarname doesn’t hold, and it is not a functional dependency i.e. given a movie it is entirely
possible that there is more than one star for the movie listed in our database.
Keys of Relations
We say a set of one or more attributes {A1,A2,…,An} is a key for a relation R if:
1. Those attributes functionally determine all other attributes of the relation (i.e. because relations are sets, it
is impossible for two distinct tuples of R to agree on all of A1,A2,…An )

2. No proper subsets of {A1,A2,…,An} functionally determines all other attributes of R, i.e. a key must be
minimal.
FD is an assertion about the schema of a relation, not about a particular instance
e.g. titleFilm_type doesn‟t hold in the above relation (even if for every tuple in this particular instance of
the relation movies it happens that any two tuples agreeing on title also agree on film_type)
5.2.1. Functional Dependency analysis
Functional dependencies can be used to determine keys of a relation. If a particular attribute or set of
attributes determines all other attributes of a relation, we say that attribute or set of attributes is a Key.
Closure of Attributes
Given a relation, functional dependencies and set of attributes, closure is find all B such that A  B. denoted
by A+ or {A1,A2,…,An}+
 It is finding entire set of attributes determined by attributes A1,…An.
 To find the closure of a set of attributes {A1,A2,…An}+
 start with the attribute set {A1,….An, C,D,… }
 then insert the attribute if there exists any functional dependency
e.g. if ACD insert C and D to the closure set
 Repeat until no change
Closure example
Given the relation with the following schema
Student(SSN,Sname,Address,Hscode,Hsname,Hscity,GPA,Priority)and if the following functional
dependencies hold
SSn->Sname,Address,GPA
GPA->priority
Hscode->Hsname,Hscity
{SSN,Hscode}+ = {SSN,Hscode,Sname,address,GPA,Priority,Hsname,Hscity}
If we find {SSN}+ only it doesn’t determine all the attributes and hence it can’t be a key. But together SSN
and Hs code can be a key because they determine all the attributes. Compute A+ and if it equals all attributes
then A is Key.
The appearance an attribute in the left, middle or right hand side of a functional dependency determines that
weather that attribute can be a key or not. i.e.
- An attribute that appears in the left hand side (LHS) of a functional dependency can be a key or part of a key
- An attribute that appears in the right hand side (RHS) of a functional dependency cannot be a key or part of
a key
- An attribute that appears in the middle of a functional dependency may or may not be a key or part of a key
Examples
1. IF a relation R(A,B,C) and functional dependency F= {AB,BC} is given then we can have left, middle,
and right hand side attributes as follows
L M R
A B C
We can find A+ ={A,B,C} therefore, A is a Key for this relation
2. If a relation R(A,B,C,D) and functional dependency F= {ABC,CB,CD} is given then to determine
the key of this relation we can put the attributes to the left, middle, and right form and find the closure of
each. If the closure of a particular attribute contains all attributes, then that attribute or a set of attribute will
be a key.
L M R
A B,C D
Find closures
A+={A} AB+ = {ABCD} AC+ = {ACBD} therefore, from this the Key for the given relation is {ABC}
3. If a relation R(A,B,C) and functional dependency F= {AB,BC,CA} is given then to determine the
key of this relation we can put the attributes to the left, middle, and right form and find the closure of each. If
the closure of a particular attribute contains all attributes, then that attribute or a set of attribute will be a key.
L M R
A,B,C
Find closures
A+={A,B,C} B+ = {B,C,A} C+ = {CAB} therefore, from this the Key for the given relation is {ABC}
5.2.Minimal covers
It is used to eliminate redundant functional dependencies from left hand side, and it is a way of revising FDs
so that they are more efficient. It should be done in very complicated databases. Its aim is to find a minimal
set of dependencies as being a set of dependencies in a standard or canonical form with no redundancies.
Steps to find minimal covers
1. First you should have singleton Right hand side
2. No extraneous (undesired) Left hand side attributes
3. No redundant functional dependencies
Example: if a relation R(ABCDE) and a functional dependency F={AD,BCAD ,CB, EA, ED } is
given then find the minimal covers of the given FD.
- Singleton RHS
F={AD,BCA ,BCD,CB, EA, ED }
- Find and remove any extraneous attributes from LHS
First let’s try BCA, to eliminate either B or C from LHS we should find the B+ and the C+, if for example
B+ contains C then we can eliminate C from LHS otherwise we can‟t and the same is true for eliminating B
from LHS.
Now find B+={B} so we can’t eliminate the C from LHS, and then find C+={CBAD}
- Since C closure (C+) contains B we can eliminate the B
Therefore, the FD, BCA, will become CA, then make this change immediately in the set of functional
dependencies
Next, consider the FD, BCD then find B+={B} so we can’t eliminate C and C+={CABD}
- Since C closure (C+) contains B we can eliminate the B. therefore, the FD, BCD will become CD
- The given set of FD after the extraneous attributes removed becomes
F={AD,CA ,CD,CB, EA, ED }
- Now find whether there is any redundant FDs or not, to do this check each and every functional
dependency.
- Check first AD, to determine this FD is redundant or not find A+ and if A+ contains D then we can get
rid of D i.e. we don’t need AD
Find A+={A} – no way to remove D, so keep this FD
- Check CA, find C+={CDB} – doesn’t include A, so keep this FD
- Check CD, find C+={CADB} – since C+ contains D, we can eliminate CD FD
- Check CB, find C+={CAD} – doesn’t include B, so keep this FD
- Check EA, find E+={ED} – doesn’t include A, so keep this FD
- Check ED, find E+={EAD} – since E+ contains D, we can eliminate ED FD
Therefore, the minimal cover for F={AD,BCAD ,CB, EA, ED } becomes
F={AD,CA ,CB, EA}

5.3. Normalization and Normal Forms


Many problems arise from redundancy can be addressed by replacing a relation with a collection of smaller
relations. Normalization process takes a relation schema through a series of tests to certify whether it satisfies a
certain normal form.
Normalization is a process of analyzing a given relation schema based on their FDs and primary keys to achieve
the desirable properties of minimizing redundancy and minimizing the insertion, deletion and update anomalies.
Normal form of a relation refers to the highest normal form condition that it meets.

5.3.1 First Normal Form (1NF)


It is now considered to be part of the formal definition of a relation in the basic relational model. It states that the
domain of an attribute must include only atomic (simple, indivisible) values and the value of any attribute in a
tuple must be a single value from the domain of that attribute.
- It was defined to disallow multi valued and composite attributes
Example consider a relational schema Department(Dno,Dname,Dmgrssn,Dlocation)the attribute Dlocation may
has more than one value, there for this relation is not in 1NF

Therefore, it should be normalized to 1NF, there are different techniques to do this


1. Expand the key so that there will be a separate tuple in the original department location for each location
of a department with the primary key {Dno, Dlocation} together, but this introduces redundancy.

Now this is in 1NF with redundancy


2. Remove the attribute Dlocation that violates 1NF and place it in a separate relation Dept_location along
with the Primary Key, the primary key for this new relation will be the combination {Dno,Dlocation}
Now this is in 1NF and this method is the best method to normalize a relation to 1NF
3. If a maximum number of values is known for the attribute – for example if our department locations are
only three i.e. Dlocation1, Dlocation2, Dlocation3, replace Dlocation attribute with these three atomic
attributes.

This is now also in 1NF but it introduces null values

5.3.2. Second Normal Form (2NF)


It is based on full functional dependency. A full functional dependency XY is a full FD if removal of any
attribute A from X means that the dependency doesn‟t hold any more. For example consider the following FD
{SSN,Pnumber}Hours –is a full dependency (neither SSNHours nor Pnumber Hours holds) A
functional dependency XY is a partial dependency if some attribute Aϵ X can be removed from X and
the dependency still holds. For example {SSN,Pnumber}Ename -is a partial
dependency because SSNEname holds. A relational schema R is in second normal form if every
nonprime attribute A in R is fully functional dependent on the primary key of R. for example the schema Emp-
Proj shown below is not in second normal form.

Emp-Proj
This is not in second normal form and can be normalized in to a number of 2NF relations in which nonprime
attributes are associated only with the part of the primary key on which they are fully functionally dependent.

5.3.3. Third Normal Form (3NF)


It is based on the concept of transitive dependency. A functional dependency XY in a relation schema R is a
transitive dependency if there is a set of attributes Z that is neither a candidate key nor a subset of any key of
R, and both XZ and ZY hold. A relation schema R is in 3NF if it satisfies 2NF and no nonprime attribute
of R is transitively dependent on the primary key.

This is in 2NF since no partial dependency on a key exists, but it is not in 3NF because of the transitive
dependency of DmgrSSN and Dname on SSN via Dnumber. SSNDmgrSSN is transitive because both
SSNDnumber and DnumberDmgrSSN hold and Dnumber is neither a key nor a subset of the key.

Generally, we want to design our relation schemas so that they have neither partial nor transitive dependencies,
because these cause the update anomalies problem.
5.3.4. Boyce – coded Normal Form
- Every relation in BCNF is also in 3NF

- Boyce – coded normal form states that a relation schema R is in BCNF if whenever a functional dependency
XA holds in R, then X is a supper key of R.
PART TWO
ADVANCED DATABASE SYSTEM
Chapter six
Advanced SQL and Query optimization
6. Introduction
SQL (Structured Query Language) is both Data definition language (DDL) and Data manipulation language
(DML). It has facilities for defining views on the database, for specifying security and authorization, for
defining integrity constraints, and for specifying integrity constraints.
Data Definition Language – the main sql command for data definition is CREATE statement, which can be
used to create Schemas, tables, domains as well as other constructs such as views, assertions, and triggers.
CREATE table Employee(Fnamevarchar (15) not null,
Lnamevarchar(15) Not Null,
SSN Char(9) Not Null,
Bdate Date,
Sex Char,
Dno INT Not Null,
Primary Key (SSN),
Foreign Key (Dno) References Department (Dnumber));
CREATE Domain SSN_TYPE as Char(9), after this you can use SSN_TYPE in place of Char(9).
Specifying Constraints
We can specify different constraints that will be applied on the attributes to have or not to have a certain
value, the following are some examples of attribute constraints:-
 NOT NULL – this specifies the attribute value should not be null.
 Default – this enables to give default value for that attribute if an explicit value is not provided.
 Check – used for restricting domain values
e.g. Dnumber INT NOT NULL CHECK (Dnumber>0 and Dnumber<21); this restricts Dnumber to integer
numbers between 1 and 20.
Check constraint can be used with the create statement to put restrictions on that particular domain, for
example
CREATE Domain D_NUM as integer CHECK (D_NUM>0 and D_NUM<21), this means whenever you use
D_NUM it will be an integer type and always contains numbers between 1 and 20.
- Primary Key – used to identify each record uniquely
Foreign Key – used to create relationship between two or more relations, and while specifying this constraint
you can use important commands that can keep the referential integrity when a particular relation that is
related with other relation is updates or deleted. Thus are ON DELETE and ON UPDATE
ON DELETE SET DEFAULT - if the value is deleted in the primary relation set default value
ON UPDATE CASCADE – if the value in the primary relation is updated or changed repeat the same change
to all related relations with that relation.
The SELECT – FROM- WHERE – SQL Statement
The general format for this statement is:
SELECT <Attribute List>
FROM <table List>
WHERE <Condition>
For example,
SELECT Bdate, Address
From Employee
Where Fname = “John‟ and Lname=‟mulat‟
This requires only Employee relation listed in the From clause and selects the Employee tuples that satisfy the
condition, then projects the results on Bdate and address attributes.
In relational Algebra this can be rewritten as
Π Bdate, Address (σFname=‟John‟ and Lname=‟Smith‟(Employee))
- If a query uses a single relation is similar to a SELECT-PROJECT pair of relational algebra
SELECT Fname, Lname, Address
From Employee, Department
Where Dname = „Research‟ and Dnumber=Dno
- This is similar to a SELECT-PROJECT-JOIN sequence of relational algebra.
The condition Dname=‟research‟ is a selection condition and corresponds to a select operation in
relational algebra.
The condition Dnumber=Dno is a join condition
Write a SELECT query to retrieve all employees in department 5 whose salary is between $15000 and
$20000.
SELECT * From Employee
Where (salary between, 15000 and 20000) and Dno=5;
Or salary >=15000 and salary <=20000
Nested Queries
Some queries require that existing values in the database be fetched and then used in a comparison condition;
this can be done using nested queries.
For example, to retrieve the name of each employee who has a dependent with the same first name and same
sex as the employee.
SELECT E.Fname,E.Lname
From Employee AS E
Where E.SSN IN( SELECT ESSN
From Department
Where E.Fname=Dependent_Name
And E.SEX =SEX);
6.1. Data Manipulation Language
Sql Statements used as Data manipulation are the INSERT, UPDATE, DELETE statements and so on.
To delete an employee record whose last name is brown.
DELETE from employee
Where Lname=‟Brown‟
To change Project location and Dnum whose project number is 10.
UPDATE Project Set Plocation=‟Bellaire‟, Dnum=5 Where Pnumber =10;

6.2. Assertion
An assertion is used to specify general constraints using the create assertion statement. For example, to
specify the constraint that “the salary of an employee must not be greater than the salary of the manager of
the department that the employee works for”
CREATE assertion Salary_constraint
CHECK (NOT Exists
( SELECT *
From Employee E, Employee M, Department D
Where E.salary>M.salary And
E.Dno=D.Dnumber And
D.MgrSSN=M.SSN));

6.3. Views
It is a virtual table that is derived from other tables. A view is obtained from previously defined tables called
base tables whose tuples are physically stored in the database. This may limit the possible update operations
that can be applied on views, but not on querying a view. It is a way of specifying a table that we need to
reference frequently. It is specified using CREATE view statement
CREATE view Works_ON1
As SELECT Fname, Lname, Pname, Hours
From Employee, Project, works-on
Where SSN=ESSN and Pno=Pnumber
We can create a view by specifying new attribute names
CREATE view Dept-Info(DeptName, NumofEmp, Total_Sal)
As SELECT Dname, Count(*),Sum(Salary)
From Department, Employee
Where Dnumber=Dno
Group by Dname
We can specify SQL queries on a view in the same way we specify queries involving base tables.
SELECT Fname,Lname
From Works_ON1
Where Pname=‟ProjectX‟
This can be written to retrieve the last and first names of all employees who work on projectX.
The main advantage of view is to simplify the specification of certain queries. The view becomes up to date
when we modify tuples in the base table. The view can also be dropped, if we don’t need it any more using
the drop statement.
DROP view Works_ON1
Used as a security and authorization mechanism by:
 By restricting a user to a specific rows (to see his/her record) or column (not to see salary column) in
a table
 Aggregate information instead of supplying details (present sum of a particular column)

6.4. Stored Procedures


It is a prepared SQL code that can be Saved and reused over and over again. It is a group of SQL statements
compiled in to a single execution plan. The SQL statements needed to do a commonly performed task can be
designed, coded, and tested once in a stored procedure. Each application needing to perform that task can
then simply execute the stored procedure.
Many tasks can be implemented as series of SQL statements. If these SQL statements are written into a stored
procedure, they become part of a single execution plane. Stored procedures can also shield users from
needing to know the details of the tables in the database. SQL server includes a set of system stored
procedures whose names usually start with SP_
Example: to change name of the database you can write
Execute SP_renameDB ‘database name’, ‘new name’
To create user defined stored procedures you can use the create procedure statement
Create procedure <owner>.<procedure name><param><data type> As <Body>
Examples
AS
Select * from [Users]
Where Isnull(@login,login) =login
You can write your stored procedure to select all information from Employee table for example and use it
whenever you want as the following.
CREATE procedure Employee_details
AS
Select * from Employee
To run this just write EXECUTE Employee_details

6.5. SQL queries and Relational Algebra


6.5.1. Relational Algebra
It enables a user to specify basic retrieval requests. A sequence of relational algebra operations forms a
relational algebra expression. It provides a formal foundation for relational model operations. It is used as a
basis for implementing and optimizing queries in relational database management system. Its operations can
be divided into two:
1. Include set operations, UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT.
2. Developed for relational database specifically (SELECT, PROJECT, and JOIN)
1. SELECT operation
A SELECT operation can be visualized as a horizontal partitioning of the relation into two sets of tuples.

Example, σDno=4(Employee) – to select employees whose department is 4

The general form is σ<Selection condition>(R)

σ - sigma is used to denote the select operator, and the selection operation is Boolean expression.
Select operation is commutative

σ<cond1>( σ<cond2>(R)) = σ<cond2>( σ<cond1>(R))


2 Project operation
If we are interested only in certain attributes, we use project operation to project the relation over these
attributes only. It can be visualized as a virtual partitioning of the relation.
Example, to list each employee’s first and last name and salary, we can use project operation
πLname, Fname, Salary (Employee)
The general form of the project operation is π<Attribute list> (R)
Where π (Pi) is used to represent the project operation. If the attribute list includes only non key attributes or
R, duplicate tuples are likely to occur. However, project operation removes any duplicate tuples.
Project operation is not commutative.
- We can apply several relational algebra operations one after the other.
πLname,Fname,Salary (σDno=5 (Employee))
3 JOIN
The JOIN operation denoted by⋈, is used to combine related tuples from two relations into a single tuples. It
allows as processing relationships among relations. For example, to get the names of the managers of each
department (i.e. to get the managers name, we need to combine each department tuple with the employee
tuple whose SSN value matches the mgrSSN value in the department tuple.)
This is a general form R ⋈<join condition>S - it can be combined with other operations
πDname,Lname,Fname(R<mgrSSN=SSN>S (Employee))
There are different kinds of joins these are:
INNER JOIN
Used to combine data from multiple relations so that related information can be presented in a single table,
only matching records are kept in the result
OUTER JOIN
Can be used if we want to keep all the tuples in R, or all those in S if we have two relations R and S, or all
those in both relations regardless of whether or not they have matching tuples in the other relation. It can be
Left outer join, Right outer join, or full outer join
LEFT OUTER JOIN – to retrieve a list of all employee names and the name of the departments they
manage, if they do not manage it will be indicated by null. It keeps every tuple in the first or left relation. If
no matching tuple is found in S, these attributes are filled with Null values in the result.
RIGHT OUTER JOIN- keeps every tuple in the second or right relation. If no matching tuples found these
attribute values will be filled with Null values.
FULL OUTER JOIN – keeps all tuples in both the left and right relation. If no matching tuples are found it
will put Null values.

6.5.2. Translating SQL queries to Relational Algebra


Since relational algebra is somewhat low, level than SQL queries it will give us a mathematical foundation
for analyzing and optimizing SQL queries. \
1. For example, consider the following subset of the engineering database schema

EMP(ENO, ENAME, TITLE)


ASG(ENO, PNO, RESP, DUR)
“Find the names of employees who are managing a project”
SELECT ENAME
FROM EMP,ASG
WHERE EMP.ENO = ASG.ENO
AND RESP = ‘‘Manager’’
Now to translate this SQL query to relational algebra we can use temporary variables that can store the
intermediate results.

ENAME( RESP=”manager”^EMP.ENO=ASG.ENO(EMP×ASG)

And

ENAME(EMP ⋈ ENO( RESP=”manager”(ASG)))

6.2. Query Processing and Optimization


6.2.1. Query Processing
The aim of query processing is to find information in one or more databases and deliver it to the user quickly
and efficiently. Traditional techniques work well for databases with standard, single-site relational structures,
but databases containing more complex and diverse types of data demand new query processing and
optimization techniques.
Query Processing can be divided into four main phases:
1. Decomposition:
2. Optimization
3. Code generation, and
4. Execution

Query Decomposition
Query decomposition is the process of transforming a high level query into a relational algebra query, and to
check that the query is syntactically and semantically correct. Query decomposition consists of parsing and
validation.
Typical stages in query decomposition are:
1.Analysis: lexical and syntactical analysis of the query (correctness). Query tree will be built for the query
containing leaf node for base relations, one or many non-leaf nodes for relations produced by relational
algebra operations and root node for the result of the query. Sequence of operation is from the leaves to the
root.
2. Normalization: convert the query into a normalized form. The predicate WHERE will be converted to
Conjunctive () or Disjunctive () Normal form. 
3. Semantic Analysis: to reject normalized queries hat are not correctly formulated or contradictory.
Incorrect if components do not contribute to generate result. Contradictory if the predicate can not be satisfied
by any tuple. Algorithms: relation connection graph and normalized attribute connection graph.
4. Simplification: to detect redundant qualifications, eliminate common sub-expressions, and transform the
query to a semantically equivalent but more easily and effectively computed form.
5. Query Restructuring More than one translation is possible Use transformation rules
Most real-world data is not well structured. Today's databases typically contain much non-structured data
such as text, images, video, and audio, often distributed across computer networks. In this complex
environment, efficient and accurate query processing becomes quite challenging. There could be tons of
tricks (not only in storage and query processing, but also in concurrency control, recovery, etc.) Different
tricks may work better in different usage scenarios. Same tricks get used over and over again in different
applications.
Query processing: Execute transactions in behalf of this query and print the result. Steps in query
processing:
Example: Select Customer name
From Customer, Invoice
Where region = Kansas City and Amount > 1000
Table 8 query processing
Query Optimization
What is wrong with the ordinary query?
Everyone wants the performance of their database to be optimal. In particular, there is often a
requirement for a specific query or object that is query based, to run faster.
Problem of query optimization is to find the sequence of steps that produces the answer to user request in
the most efficient manner, given the database structure.
The performance of a query is affected by the tables or queries that underlies the query and by the
complexity of the query.
When data/workload characteristics change
The best navigation strategy changes
The best way of organizing the data changes
Query optimizers are one of the main means by which modern database systems achieve their performance
advantages. Given a request for data manipulation or retrieval, an optimizer will choose an optimal plan for
evaluating the request from among the manifold alternative strategies. i.e. there are many ways (access
paths) for accessing desired file/record. The optimizer tries to select the most efficient (cheapest) access path
for accessing the data. DBMS is responsible to pick the best execution strategy based on various
considerations.
Query optimizers were already among the largest and most complex modules of database systems.
Most efficient processing: Least amount of I/O and CPU resources.
Selection of the best method: In a non-procedural language, the system does the optimization at the time of
execution. On the other hand, in a procedural language, programmers have some flexibility in selecting the
best method. For optimizing the execution of a query the programmer must know:
File organization,
Record access mechanism and primary or secondary key,
Data location on disk,
Data access limitations
To write correct code, application programmers need to know how data is organized physically (e.g., which
indexes exist), to write efficient code, application programmers also need to worry about data/workload
characteristics
One has to cope with change!
Example: Consider relations r(AB) and s(CD). We require r X s.
Method 1
a. Load next record of r in RAM.
b. Load all records of s, one at a time and concatenate with r.
c. All records of r concatenated?
NO: goto a.
YES: exit (the result in RAM or on disk).
Performance: Too many accesses.
Method 2: Improvement
a. Load as many blocks of r as possible leaving room for one block of s.
b. Run through the s file completely one block at a time.
Performance: Reduces the number of times s blocks are loaded by a factor of equal to the number of r
records than can fit in main memory.
Considerations during query Optimization:
Narrow down intermediate result sets quickly. SELECT before JOIN
Use access structures (indexes).

6.2.2. Approaches to Query Optimization


Heuristics Approach
 The heuristic approach uses the knowledge of the characteristics of the relational algebra operations and
the relationship between the operators to optimize the query.
Thus the heuristic approach of optimization will make use of:
 Properties of individual operators
 Association between operators
Query Tree: a graphical representation of the operators, relations, attributes and predicates and processing
sequence during query processing.
Query tree is composed of three main parts:
i. The Leafs: the base relations used for processing the query/ extracting the required information
ii. The Root: the final result/relation as an output based on the operation on the relations used for query
processing
iii. Nodes: intermediate results or relations before reaching the final result.
Sequence of execution of operation in a query tree will start from the leaves and continues to the intermediate
nodes and ends at the root.
The properties of each operations and the association between operators is analyzed using set of rules called
TRANSFORMATION RULES. Use of the transformation rules will transform the query to relatively good
execution strategy.

6.2.3. Transformation Rules for Relational Algebra


1. Cascade of SELECTION: conjunctive SELECTION Operations can cascade into individual Selection
Operations and Vice Versa

(c1∩c2∩c3) (R)= c1( c2( c3(R))where ci is a predicate where ci is a predicate


2. Commutatively of SELECTION operations

c1 ( c2(R))= c2 ( c1(R)) where ci is a predicate


3. Cascade of PROJECTION: in the sequence of PROJECTION Operations, only the last in the sequence is
required

L1 L2 L3 L4 (R)= (R) L1

4. Commutatively of SELECTION with PROJECTION and Vise Versa


a. If the predicate c1 involves only the attributes in the projection list (L1), then the selection and projection
operations commute.

L1 ( (R))= c1 c1 ( (R))L1

5. Commutatively of THETA JOIN/Cartesian Product


R X S is equivalent to S X R
Also holds for Equi-Join and Natural-Join

(R⋈ S)= (S⋈ R)


c1 c1

6. Commutatively of SELECTION with THETA JOIN


a. If the predicate c1 involves only attributes of one of the relations (R) being joined, then the Selection and
Join operations commute.

c1 (R⋈ S)=( (R)) ⋈ S)


c c1 c

b. If the predicate is in the form c1,c2 and c1 involves only attributes of R and c2 involves only attributes of
S, then the Selection and Theta Join operations commute.

c1^c2 (R⋈ S)=( (R)) ⋈ ( S))


c c1 c c2

7. Commutatively of PROJECTION and THETA JOIN


If the projection list is of the form L1, L2, where L1 involves only attributes of R and L2 involves only
attributes of S being joined and the predicate c involves only attributes in the projection list, then the
SELECTION and JOIN operations commute

L1,L2 (R⋈ S)=( c L1,L2 (R)) ⋈ ( c L1,L2 S))


8. Commutatively of the Set Operations: UNION and INTERSECTION but not SET DIFFERENCE

R S=S R R S=S R and

9. Associatively of the THETA JOIN,CARTESIAN PRODUCT, UNION and INTERSECTION.

(R S) T=R (S T) where is one of the operations

10. Commuting SELECTION with SET OPERATIONS


c (R S)= ( c(R) c (S))where is one of the operations

11. Commuting PROJECTION with UNION

L1 (S R)= L1 (S) L1 (R)


Heuristic Approach will be implemented by using the above transformation rules in the following sequence or
steps.

6.2.4. Sequence for Applying Transformation Rules


1. Use
Rule-1Cascade SELECTION
2. Use
Rule-2: Commutatively of SELECTION
Rule-4: Commuting SELECTION with PROJECTION
Rule-6: Commuting SELECTION with JOIN and CARTESIAN
Rule-10: commuting SELECTION with SETOPERATIONS
3. Use
Rule-9: Associatively of Binary Operations (JOIN,CARTESIAN, UNION and INTERSECTION). Rearrange
nodes by making the most restrictive operations to be performed first (moving it as far down the tree as
possible)
4. Perform Cartesian Operations with the subsequent Selection Operation
5. Use
Rule-3: Cascade of PROJECTION
Rule-4: Commuting PROJECTION with SELECTION
Rule-7: Commuting PROJECTION with JOIN and CARTESIAN
Rule-11: commuting PROJECTION with UNION
Main Heuristic
The main heuristic is to first apply operations that reduce the size (the cardinality and/or the degree) of the
intermediate relation. That is:
a. Perform SELECTION as early as possible: that will reduce the cardinality (number of tuples) of the
relation.
b. Perform PROJECTION as early as possible: that will reduce the degree (number of attributes) of the
relation.
Both a and b will be accomplished by placing the SELECT and PROJECT operations as far down the tree
as possible.
c. SELECT and JOIN operations with most restrictive conditions resulting with smallest absolute size should
be executed before other similar operations. This is achieved by reordering the nodes with JOIN
Example: consider the following schemas and the query, where the EMPLOYEE and the PROJECT relations
are related by the WORKS_ON relation.
EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
WORKS_ON (WEmpID, WProjID)
WEmpID (refers to employee identification) and PProjID (refers to project identification) are foreign keys to
WORKS_ON relation from EMPLOYEE and PROJECT relations respectively.
Query: The manager of the company working on road construction would like to view employees name born
before January 1 1965 who are working on the project named Ring Road.
Relational Algebra representation of the query will be:

<FName, LName>( <DoB<Jan11965WEmpID=EEmpIDPProjID=WProjIDPName=’Ring

Road’> (EMPLOYEEXWORKS_ONXPROJECT
The SQL equivalence for the above query will be:
SELECTFName, LName
FROM EMPLOYEE, WORKS_ON, PROJECT
WHEREDoB<Jan 1 1965EEmpID=WEmpIDWProjID=PProjIDPName=”Ring Road”
The initial query tree will be:
( ( (
By applying the first step (cascading the selection) we will come up with the following structure.

(DoB<Jan1 1965) (WEmpID=EEmpID) (PProjID=WProjID) PName=’Ring Road’)

(EMPLOYEEXWORKS_ONXPROJECT) ))
By applying the second step it can be seen that some conditions have attribute that belong to a single relation (
DoB belongs to EMPLOYEE and PName belongs to PROJECT) thus the selection operation can be commuted
with Cartesian Operation. Then, since the condition WEmpID=EEmpIDbase the employee andWORKS_ON

( ( ) X(
relation the selection with this condition can be cascaded.

PProjID=WProjID) PName=’Ring Road’) PROJECT WEmpID=EEmpID)

(WORKS_ONX (
( (

(DoB<Jan1 1965) EMPLOYEE ))))


The query tree after this modification will be:
Using the third step, perform most restrictive operations first.
From the query given we can see that selection on PROJECT is most restrictive than selection on EMPLOYEE.
Thus, it is better to perform selection on PROJECT BEFOR selection on EMPLOYEE. Rearrange the nodes to
achieve this.
Using the forth step, Perform Cartesian Operations with the subsequent Selection Operation.

Using the fifth step, Perform the projection as early as possible.


Cost Estimation Approach to Query Optimization
The main idea is to minimize he cost of processing a query. The cost function is comprised of:
I/O cost + CPU processing cost + communication cost + Storage cost
These components might have different weights in different processing environments
The DBMs will use information stored in the system catalogue for the purpose of estimating
cost. The main target of query optimization is to minimize the size of the intermediate relation.
The size will have effect in the cost of:
Disk Access
Data Transpiration
Storage space in the Primary Memory
Writing on Disk
The statistics in the system catalogue used for cost estimation purpose are:
Cardinality of a relation: the number of tuples contained in a relation currently (r)
Degree of a relation: number of attributes of a relation
Number of tuples on a relation that can be stored in one block of memory
Total number of blocks used by a relation
Number of distinct values of an attribute (d)
Selection Cardinality of an attribute (S): that is average number of records that will satisfy
an equality condition S=r/d
By sing the above information one could calculate the cost of executing a query and selecting the
best strategy, which is with the minimum cost of processing.
Cost Components for Query Optimization
The costs of query execution can be calculated for the following major process we have during
processing.
1. Access Cost of Secondary Storage
Data is going to be accessed from secondary storage, as a query will be needing some part of the
data stored in the database. The disk access cost can again be analyzed in terms of:
Searching
Reading, and
Writing, data blocks used to store some portion of a relation.
The disk access cost will vary depending on the file organization used and the access method
implemented for the file organization. In addition to the file organization, the data allocation
scheme, whether the data is stored contiguously or in scattered manner, will affect the disk
access cost.
2. Storage Cost
While processing a query, as any query would be composed of many database operations, there
could be one or more intermediate results before reaching the final output. These intermediate
results should be stored in primary memory for further processing. The bigger the intermediate
relation, the larger the memory requirement, which will have impact on the limited available
space. This will be considered as a cost of storage.
3. Computation Cost
Query is composed of many operations. The operations could be database operations like reading
and writing to a disk, or mathematical and other operations like:
Searching, Sorting, Merging, Computation on field values
4. Communication Cost
In most database systems the database resides in one station and various queries originate from
different terminals. This will have impact on the performance of the system adding cost for query
processing. Thus, the cost of transporting data between the database site and the terminal from
where the query originate should be analyzed.
Pipelining
Pipelining is another method used for query optimization. It is sometime referred to as on-the-fly
processing of queries. As query optimization tries to reduce the size of the intermediate result,
pipelining use a better way of reducing the size by performing different conditions on a single
intermediate result continuously. Thus the technique is said to reduce the number of intermediate
relations in query execution.
Pipelining performs multiple operations on a single relation in a pipeline.
Ex:
Let’s say we have a relation on employee with the following schema
Employee (ID, FName, LName, DoB, Salary, Position, Dept)
If a query would like to extract supervisors with salary greater than 2000, the relational algebra
representation of the query will be

(Salary>2000) ∩ (Position=Supervisor) (Employee)


After reading the relation from the memory, the system could perform the operation by
cascading the SELECT operation.

(
1. Approach One

(Salary>2000) (Position=Supervisor) (Employee))


Using this approach we will have the following relations
Employee
Relation created by the Operation:

R1 = (Position=Supervisor) (Employee)

The resulting Relation with the Operation

R2 = (Salary>2000) (R1)
2. Approach Two
One can select a single tuple from the relation Employee and perform both tests in a pipeline and
create the final relation at once. This is what is called PIPELINING.
UNIT- Seven
Concurrency Control and Transactions Processing
What is a transaction?
A Transaction is a mechanism for applying the desired modifications/operations to a database. It
is evident in real life that the final database instance after a successful manipulation of the
content of the database is the most up-to-date copy of the database. Action, or series of actions,
carried out by a single user or application program, which accesses or changes contents of
database. (i.e. Logical unit of work on the database.) A transaction could be a whole program,
part/module of a program or a single command. Changes made in real time to a database are
called transactions. Examples include ATM transactions, credit card approvals, flight
reservations, hotel check-in, phone calls, supermarket canning, academic registration and billing.
A transaction could be composed of one or more database and non-database operations.
Transforms database from one consistent state to another, although consistency may be violated
during transaction.
A database transaction is a unit of interaction with database management system or similar
system that is treated in a coherent and reliable way independent of other transactions.

7.1. Transaction processing system


A system that manages transactions and controls their access to a DBMS is called a TP monitor.
A transaction processing system (TPS) generally consists of a TP monitor, one or more DBMSs,
and a set of application programs containing transaction.
In database field, a transaction is a group of logical operations that must all succeed or fail as a
group. Systems dedicated to supporting such operations are known as transaction processing
systems.
In comparison with database transaction, application program is series of transactions with non-
database processing in between.
Distinction between business transactions and online transactions:
A business transaction is an interaction in the real world, usually between an enterprise and a
person, where something is exchanged. An online transaction is the execution of a program that
performs an administrative or real-time function, often by accessing shared data sources, usually
on behalf of an online user (although some transactions are run offline in batch).
What we are interested about is the online transaction, which is the interaction between the users
of a database system and the shared data stored in the database. This transaction program
contains the steps involved in the business transaction.
Transactions can be started, attempted, then committed or aborted via data manipulation
commands of SQL.
Can have one of two outcomes for any transaction:
Success - transaction commits and database reaches a new consistent state
Committed transaction cannot be aborted or rolled back.
How do you discard a committed transaction?
Failure - transaction aborts, and database must be restored to consistent state before it started.
Such a transaction is rolled back or undone. Aborted transaction that is rolled back can be
restarted later. In many data manipulation languages, there are keywords used to indicate
different states of a transaction.
A single transaction might require several queries, each reading and/or writing information in the
database. When this happens it is usually important to be sure that the database is not left with
only some of the queries carried out. For example, when doing a money transfer, if the money
was debited from one account, it is important that it also be credited to the depositing account.
Also, transactions should not interfere with each other.
A transaction is expected to exhibit some basic features or properties to be considered as a valid
transaction. These features are:
A: Atomicity
C: Consistency
I: Isolation
D: Durability
It is referred as ACID property of transaction. Without the ACID property, the integrity of the
database cannot be guaranteed.
Atomicity
Is all or none property
Every transaction should be considered as an atomic process, which cannot be sub divided into
small tasks. Due to this property, just like an atom, which exists or does not exist, a transaction
has only two states. Done or Never Started. 41 Advanced Database Systems Lecture Note –
Wollo University, Department of Computer Science
Done - a transaction must complete successfully and its effect should be visible in the database.
Never Started - If a transaction fails during execution then all its modifications must be undone
to bring back the database to the last consistent state, i.e., remove the effect of failed transaction.
No state between Done and Never Started
Consistency
If the transaction code is correct then a transaction, at the end of its execution, must leave the
database consistent. A transaction should transform a database from one previous consistent state
to another consistent state.
Isolation
A transaction must execute without interference from other concurrent transactions and its
intermediate or partial modifications to data must not be visible to other transactions.
Durability
The effect of a completed transaction must persist in the database, i.e., its updates must be
available to other transaction immediately after the end of its execution, and is should not be
affected due to failures after the completion of the transaction.
In practice, these properties are often relaxed somewhat to provide better performance.

7.1.1. State of a Transaction


A transaction is an atomic operation from the users‘ perspective. But it has a collection of
operations and it can have a number of states during its execution.
A transaction can end in three possible ways.
1. Successful Termination: when a transaction completes the execution of all operations in it and
reaches the COMMIT command.
2. Suicidal Termination: when the transaction detects an error during its processing and decide to
abrupt itself before the end of the transaction and perform a ROLL BACK
3. Murderous Termination: When the DBMS or the system force the execution to abort for any
reason.
Table 9 states of transaction
Most SQL statements seem to be very short and easy to execute. But the reverse is true if you
consider it as a one command transaction. Actually a database system interprets a transaction not
as an application program but as a logical sequence of low- level operations read and write
(referred to as primitives).

7.1.2. Ways of Transaction Execution


In a database system, many transactions are executed. Basically there are two ways of executing
a set of transactions:
(a) Serially: Serial Execution: In a serial execution transactions are executed strictly serially.
Thus, Transaction Ti completes and writes its results to the database then only the next
transaction Tj is scheduled for execution. This means at one time there is only one
transaction that is being executed in the system. The data is not shared between
transactions at one specific time.
In Serial transaction execution, one transaction being executed does not interferethe execution of
any other transaction.
Good things about serial execution Correct execution, i.e., if the input is correct then output will
be correct. Fast execution, since all the resources are available to the active.
The worst thing about serial execution is very inefficient resource utilization.
T2
read (X) read (X)
X := X+N X := X+N
write (X) write (X)
read (Y)
Y := Y+N
write (Y)
We execute this transaction serially as follows:
Time T1 T2
read (X) {X = 10}
X := X+N {X =
11}
write (X) {X = 11}
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
read (X) {X = 11}
X := X+N {X = 12}
write (X)
Final values of X, Y at the end of T1 and T2: X = 12 and Y = 7.
Thus we can witness that in serial execution of transaction, if we have two transactions Ti and
Ti+1, then Ti+1 will only be executed after the completion of Ti.
(b) Concurrently: is the reverse of serially executable transactions, in this scheme the individual
operations of transactions, i.e., reads and writes are interleaved in some order.

Time T1 T2
read (X) {X = 10}
read (X) {X = 10}
X := X+N {X = 11}
X := X+N {X = 11}
write (X) {X = 11}
write (X)
read (Y) {Y = 6}
Y := Y+N {Y = 7}
write (Y) {Y = 7}
Final values at the end of T1 and T2 : X = 11, and Y = 7. This improves resource utilization,
unfortunately gives incorrect result.
The correct value of X is 12 but in concurrent execution X =11, which is incorrect. The reason
for this error is incorrect sharing of X by T1 and T2.
In serial execution T2 read the value of X written by T1 (i.e., 11) but in concurrent execution T2
read the same value of X (i.e., 10) as T1 did and the update made by T1 was overwritten by T2‘s
update.
This is the reason the final value of X is one less that what is produced by serial execution.

7.2. Problems Associated with Concurrent Transaction Processing


Although two transactions may be correct in themselves, interleaving of operations may produce
an incorrect result, which needs control over access.
Having a concurrent transaction processing, one can enhance the throughput of the system. As
reading and writing is performed from and on secondary storage, the system will not be idle
during these operations if there is a concurrent processing.
Every transaction should be correct by themselves, but this would not guarantee that the
interleaving of these transactions will produce a correct result.
The three potential problems caused by concurrency are:
 Lost Update Problem
 Uncommitted Dependency Problem
 Inconsistent Analysis Problem
1. Lost Update Problem
Successfully completed update on a data set by one transaction is overridden by another
transaction/user.
E.g. Account with balance A=100.
T1 reads the account A
T1 withdraws 10 from A
 T1 makes the update in the Database
 T2 reads the account A
 T2 adds 100 on A
 T2 makes the update in the Database
 In the above case, if done one after the other (serially) then we have no problem.
 If the execution is T1 followed by T2 then A=190
 If the execution is T2 followed by T1 then A=190
 But if they start at the same time in the following sequence:
 T1 reads the account A=100
 T1 withdraws 10 making the balance A=90
 T2 reads the account A=100
 T2 adds 100 making A=200
 T1 makes the update in the Database A=90
 T2 makes the update in the Database A=200

After the successful completion of the operation in this schedule, the final value of A will be
200, which override the update made by the first transaction that changed the value from 100 to
90.
2. Uncommitted Dependency Problem
It occurs when one transaction can see intermediate results of another transaction before it is
committed.
E.g.
T2 increases 100 making it 200 but then aborts the transaction before it is committed. T1 gets
200, subtracts 10 and make it 190. But the actual balance should be 90
3. Inconsistent Analysis Problem
It occurs when transaction reads several values but second transaction updates some of them
during execution and before the completion of the first.
E.g.
T2 would like to add the values of A=10, B=20 and C=30. after the values are read by T2 and
before its completion, T1 updates the value of B to be 50. at the end of the execution of the two
transactions T2 will come up with the sum of 60 while it should be 90 since B is updated to 50.
As discussed above, the objective of Concurrency Control Protocol is to schedule transactions in
such a way as to avoid any interference between them. This demands a new principle in
transaction processing, which is serializability of the schedule of execution of multiple
transactions.

7.2.1. Concurrency Control Techniques


Concurrency Control is the process of managing simultaneous operations on the database
without having them interfere with one another. Prevents interference when two or more users
are accessing database simultaneously and at least one is updating data. Although two
transactions may be correct in themselves, interleaving of operations may produce an incorrect
result.
Three basic concurrency control techniques:
1. Locking methods
2. Time stamping
3. Optimistic
Locking and Time stamping are pessimistic approaches since they delay transactions.
Both Locking and Time stamping are conservative approaches: delay transactions in case they
conflict with other transactions.
The optimistic approach allows us to proceed and check conflicts at the end.
Optimistic methods assume conflict is rare and only check for conflicts at commit.
1. Locking Method
A LOCK is a mechanism for enforcing limits on access to a resource in an environment where
there are many threads of execution. Locks are one way of enforcing concurrency control
policies. Transaction uses locks to deny access to other transactions and so prevent incorrect
updates.
Lock prevents another transaction from modifying item or even reading it, in the case of a write
lock.
Lock (X): If a transaction T1 applies Lock on data item X, then X is locked and it is not
available to any other transaction.
Unlock (X): T1 Unlocks X. X is available to other transactions.
Types of a Lock
Shared lock: A Read operation does not change the value of a data item. Hence a data item can
be read by two different transactions simultaneously under share lock mode. So only to read a
data item T1 will do: Share lock (X), then Read (X), and finally Unlock (X).
Exclusive lock: A write operation changes the value of the data item. Hence two write
operations from two different transactions or a write from T1 and a read from T2 are not
allowed. A data item can be modified only under Exclusive lock. To modify a data item T1 will
do: Exclusive lock (X), then Write (X) and finally Unlock (X).
When these locks are applied, then a transaction must behave in a special way. This special
behavior of a transaction is referred to as well-formed.
Well-formed: A transaction is well- formed if it does not lock a locked data item and it does not
try to unlock an unlocked data item.
Locking - Basic Rules
 If transaction has shared lock on item, can read but not update item.
 If transaction has exclusive lock on item, can both read and update item.
 Reads cannot conflict, so more than one transaction can hold shared locks simultaneously
on same item.
 Exclusive lock gives transaction exclusive access to that item.
 Some systems allow transaction to upgrade a shared lock to an exclusive lock, or vice-
versa.
Examples: T1 and T2 are two transactions. They are executed under locking as follows. T1
locks A in exclusive mode. When T2 want s to lock A, it finds it locked by T1 so T2 waits for
Unlock on A by T1. When A is released then T2 locks A and begins execution.
Suppose a lock on a data item is applied, the data item is processed and it is unlocked
immediately after reading/writing is completed as follows.
Initial values of A = 10 and B = 20.
Table 10 locking thechnique
Serial Execution of T1 and then T2 Concurrent Execution of T1 and T2
T1 T1 T1 T2
Lock (A) Lock (A)
read (A) {A = 10} read (A) {A = 10}
A := A + 100 A := A + 100
write (A) (A = write (A) (A = 110}
110}
Unlock (A) Unlock (A)
Lock (B) Lock (B)
read (B) {B = 20} read (B) {B = 20}
B := B + 10 B := B * 5
write (B) {B =30} write (B) {B =
100}
Unlock (B) Unlock (B)
Lock (B) Lock (B)
read (B) {B = 30} read (B) {B = 100}
B := B * 5 B := B + 10
write (B) {B = 150} write (B) {B = 110}
Unlock (B) Unlock (B)
Lock (A) Lock (A)
Read (A) {A = Read (A) {A =
110} 110}
A := A + 20 A := A + 20
Write (A) {A = Write (A) {A =
130} 130}
Unlock (A) Unlock (A)
Final Result: A=130 B=150 Final Result: A=130 B=110

The final result of the two transactions using the two types of transaction execution (serial and
concurrent) is not the same. This indicates that the above method of locking and unlocking is not
correct. Thus, to preserve consistency we have to use another approach to locking, two-phase
locking scheme. Two-Phase Locking (2PL) A transaction follows 2PL protocol if all locking
operations precede the first unlock operation in the transaction. The 2PL protocol demands
locking and unlocking of a transaction to have two phases.
 Growing phase - acquires all locks but cannot release any locks.
 Shrinking phase - releases locks but cannot acquire any new locks.
Locking methods: problems Deadlock: A deadlock that may result when two (or more)
transactions are each waiting for locks held by the other to be released. Deadlock - possible
solutions only one-way to break deadlock: abort one or more of the transactions in the
deadlock. Deadlock should be transparent to user, so DBMS should restart transaction(s). Two
general techniques for handling deadlock:
 Deadlock prevention
 Deadlock detection and recovery
Timeout The deadlock detection could be done using the technique of TIMEOUT. Every
transaction will be given a time to wait in case of deadlock. If a transaction waits for the
predefined period of time in idle mode, the DBMS will assume that deadlock occurred and it will
abort and restart the transaction.
2. Time-stamping Method
Timestamp: a unique identifier created by DBMS that indicates relative starting time of a
transaction. Can be generated by:

 using system clock at time transaction started, or


 Incrementing a logical counter every time a new transaction starts.
Time-stamping: A concurrency control protocol that orders transactions in such a way that older
transactions, transactions with smaller time stamps, get priority in the event of conflict.
Transactions ordered globally base do their timestamp so that older transactions, transactions
with earlier timestamps, get priority in the event of conflict.
 Conflict is resolved by rolling back and restarting transaction.
 Since there is no need to use lock there will be No Deadlock.
In timestamp ordering, the schedule is equivalent to the particular serial order that corresponds to
the order of the transaction timestamps. To implement this scheme, every transaction will be
given a timestamp which is a unique identifier of a transaction. If Ti came to processing prior to
Tj then TS of Tj will be larger than TS of Ti. Again each data item will have a timestamp for
Read and Write.
 WTS(A) which denotes the largest timestamp of any transaction that successfully
executed Write(A)
 RTS(A) which denotes the largest timestamp of any transaction that successfully
executed Read(A)
These timestamps are updated whenever a new Read(A) or Write(A) instruction is executed.
Read/write proceeds only if last update on that data item was carried out by an older transaction.
Otherwise, transaction requesting read/write is restarted and given a new timestamp.
The timestamp ordering protocol ensures that any conflicting read and write operations are
executed in the timestamp order. The protocol manages concurrent execution such that the time-
stamps determine the serializability order. Rules for permitting execution of operations in
Time-stamping Method Suppose that Transaction Ti issues Read(A)
 If TS(Ti) < WTS(A): this implies that Ti needs to read a value of A which was already
overwritten. Hence the read operation must be rejected and Ti is rolled back.
 If TS(Ti) >= WTS(A): then the read is executed and RTS(A) is set to the maximum of
RTS(A) and TS(Ti).
Suppose that Transaction Ti issues Write(A)
 If TS(Ti) < RTS(A): then this implies that the value of A that Ti is producing was
previously needed and it was assumed that it would never be produced. Hence, the Write
operation must be rejected and Ti is rolled back.
 If TS(Ti) < WTS(A): then this implies that Ti is attempting to Write an object value of A.
hence, this write operation can be ignored.
 Otherwise the Write operation is executed and WTS(A) is set to the maximum of
WTS(A) or TS(Ti).
A transaction that is rolled back due to conflict will be restarted and be given a new timestamp.
Problem with timestamp-ordering protocol:
 Suppose Ti aborts, but Tj has read a data item written by Ti
 Then Tj must abort; if Tj had been allowed to commit earlier, the schedule is not
recoverable.
 Further, any transaction that has read a data item written by Tj must also abort
 This can lead to cascading rollback…
Cascading Rollback Whenever some transaction T tries to issue a Read_Item(X) or a
Write_Item(X) operation, the basic timestamp ordering algorithm compares the timestamp of T
with the read timestamp and the write timestamp of X to ensure that the timestamp order of
execution of the transactions is not violated. If the timestamp order is violated by the operation,
then transaction T will violate the equivalent serial schedule, so T is aborted. Then T is
resubmitted to the system as a new transaction with new timestamp. If T is aborted and rolled
back, any transaction Ti that may have used a value written by T must also be rolled back.
Similarly, any transaction Tj that may have used a value written by Ti must also be rolled back,
and so on. This effect is known as cascading rollback.
3. Optimistic Technique
 Locking and assigning and checking timestamp values may be unnecessary for some
transactions
 Assumes that conflict is rare.
 When transaction reaches the level of executing commit, a check is performed to
determine whether conflict has occurred. If there is a conflict, transaction is rolled back
and restarted.
 Based on assumption that conflict is rare and more efficient to let transactions proceed
without delays to ensure serializability.
 At commit, check is made to determine whether conflict has occurred.
 If there is a conflict, transaction must be rolled back and restarted.
 Potentially allows greater concurrency than traditional protocols.
 Three phases:
1. Read
2. Validation
3. Write
1. Optimistic Techniques - Read Phase
 Extends from start until immediately before commit.
 Transaction reads values from database and stores them in local variables. Updates are
applied to a local copy of the data.
2. Optimistic Techniques - Validation Phase
 Follows the read phase.
 For read-only transaction, checks that data read are still current values. If no
interference, transaction is committed, else aborted and restarted.
 For update transaction, checks transaction leaves database in a consistent state, with
serializability maintained.
3. Optimistic Techniques - Write Phase
 Follows successful validation phase for update transactions.
 Updates made to local copy are applied to the database.
7.3. Granularity of data items
Granularity is the size of the data items chosen as the unit of protection by a concurrency
control protocol. It could be:
The entire database
 A file
 A page (a section of physical disk in which relations are stored)
 A record
 A field value of a record
The granularity has effect on the performance of the system. As locking will prevent access to
the data, the size of the data required to be locked will prevent other transactions from having
access. If the entire database is locked, then consistency will be highly maintained but less
performance of the system will be witnessed. Is a single data item is locked; consistency maybe
at risk but concurrent processing and performance will be enhanced. Thus, as one go from the
entire database to a single value, performance and concurrent processing will be enhanced but
consistency will be at risk and needs good concurrency control mechanism and strategy.

UNIT - Eight
Database Recovery
Database recovery is the process of restoring database to a correct state in the event of a failure.
A database recovery is the process of eliminating the effects of a failure from the database.
Recovery, in database systems terminology, is called restoring the last consistent state of the
data items. Types of failures A failure is a state where data inconsistency is visible to
transactions if they are scheduled for execution. The kind of failure could be:
 System crashes, resulting in loss of main memory,
 Media failures, resulting in loss of parts of secondary storage,
 Application software errors,
 Natural physical disasters,
 Carelessness or unintentional destruction of data or facilities,
 Sabotage
In databases usually a failure can generally be categorized as one of the following major groups:
1. Transaction failure: a transaction cannot continue with its execution, therefore, it is aborted
and if desired it may be restarted at some other time. Reasons: Deadlock, timeout, protection
violation, or system error.
2. System failure: the database system is unable to process any transactions. Some of the
common reasons of system failure are: wrong data input, register overflow, addressing error,
power failure, memory failure, etc.
3. Media failure: failure of non-volatile storage media (mainly disk). Some of the common
reasons are: head crash, dust on the recording surfaces, fire, etc.
To make the database secured, one should formulate a ―plan of attack in advance. The plan
will be used in case of database insecurity that may range from minor inconsistency to total loss
of the data due to hazardous events
The basic steps in performing a recovery are
1. Isolating the database from other users. Occasionally, you may need to drop and re-create the
database to continue the recovery.
2. Restoring the database from the most recent useable dump.
3. Applying transaction log dumps, in the correct sequence, to the database to make the data as
current as possible.
It is a good idea to test your backup and recovery plans periodically by loading the backups and
transaction logs into a test database and verifying that your procedure really works. One can
recover databases after three basic types of problems: user error, software failure, and hardware
failure. Each type of failure requires a recovery mechanism. In a transaction recovery, the effect
of failed transaction is removed from the database, if any. In a system failure, the effects of failed
transactions have to be removed from the database and the effects of completed transactions have
to be installed in the database. The database recovery manger is responsible to guarantee the
atomicity and durability properties of the ACID property. The execution of a transaction T is
correct if, given a consistent database; T leaves the database in the next consistent state. Initial
consistent state of the database: S1. T changes the consistent state from S1 to S2. During this
transition the database may go into an inconsistent state, which is visible only to T. If the system
or T fails then database will not be able to reach S2, consequently the visible state would be an
inconsistent state. From this state we can go either forward (to state S2) or go backward (to state
S1).
Example: The initial value of A=100, B=200 and C=300 The Required state after the execution
of T1 is A=500, B=800 and C=700 Thus S1= (100,200,300) AND S2= (500,800,700)
Transaction (T1)
Force writing is the explicit writing of the buffers to secondary storage.

8.1. Transactions and Recovery


 Transactions represent basic unit of recovery.
 Recovery manager responsible for atomicity and durability
 If failure occurs between commit and database buffers being flushed to secondary
storage then, to ensure durability, recovery manager has to redo (roll forward)
transaction's updates.
 If transaction had not committed at failure time, recovery manager has to undo (rollback)
any effects of that transaction for atomicity.
 Partial undo - only one transaction has to be undone.
 Global undo - all transactions have to be undone.
Transaction Log: Execution history of concurrent transactions.
DBMS starts at time t0, but fails at time tf. Assume data for transactions T2 and T3 have been
written to secondary storage.
 T1 and T6 have to be undone. In absence of any other information, recovery manager
has to redo T2, T3, T4, and T5.
 Tc is the checkpoint time by the DBMS
 Recovery Facilities
 DBMS should provide following facilities to assist with recovery:
 Backup mechanism: that makes periodic backup copies of database.
 Logging facility: that keeps track of current state of transactions and database changes.
 Checkpoint facility: that enables updates to database in progress to be made permanent.
Recovery manger: which allows DBMS to restore the database to a consistent state following a
failure.
Restoring the database means transforming the state of the database to the immediate good state
before the failure. To do this, the change made on the database should be preserved. Such kind of
information is stored in a system log or transaction log file. Log File
Contains information about all updates to database:
 Transaction records
 Checkpoint records
 Often used for other purposes (for example, auditing).
Transaction records contain:
 Transaction identifier.
 Type of log record, (transaction start, insert, update, delete, abort, commit)
 Identifier of data item affected by database action (insert, delete, and update operations)
 Before-image of data item
 After-image of data item.
 Log management information.
 Log file sometimes split into two separate random-access files.
 Potential bottleneck; critical in determining overall performance.
Table 11 Transaction records

Check pointing: Checkpoint is a point of synchronization between database and log file. All
buffers are force-written to secondary storage.
 Checkpoint record is created containing identifiers of all active transactions.
 When failure occurs, redo all transactions that committed since the checkpoint and undo
all transactions active at time of crash.
 In previous example, with checkpoint at time tc, changes made by T2 and T3 have been
written to secondary storage. Thus:
 only redo T4 and T5,
 Undo transactions T1 and T6.
8.2. Database Recovery Techniques
Damage to the database could be either physical and relate which will result in the loss of the
data stored or just inconsistency of the database state after the failure. For each we can have a
recover mechanism:
1. If database has been damaged:
 Need to restore last backup copy of database and reapply updates of committed
transactions using log file.
 Extensive damage/catastrophic failure: physical media failure; is restored by using the
backup copy and by re executing the committed transactions from the log up to the time
of failure.
2. If database is only inconsistent:
 No physical damage/only inconsistent: the restoring is done by reversing the changes made
on the database by consulting the transaction log.
 Need to undo changes that caused inconsistency. May also need to redo some transactions to
ensure updates reach secondary storage.
 Do not need backup, but can restore database using before- and after-images in the log file.

8.2.1. Recovery Techniques for Inconsistent Database State


Recovery is required if only the database is updated. The kind of recovery also depends on the
kind of update made on the database. Database update: A transaction‘s updates to the database
can be applied in two ways: Three main recovery techniques:
1. Deferred Update
2. Immediate Update
3. Shadow Paging
Deferred Update
Updates are not written to the database until after a transaction has reached its commit point. If
transaction fails before commit, it will not have modified database and so no undoing of changes
required. May be necessary to redo updates of committed transactions as their effect may not
have reached database. A transaction first modifies all its data items and then writes all its
updates to the final copy of the database. No change is going to be recorded on the database
before commit. The changes will be made only on the local transaction workplace. Update on the
actual database is made after commit and after the change is recorded on the log. Since there is
no need to perform undo operation it is also called NO-UNDO/REDO Algorithm
Immediate Update/ Update-In-Place
 Updates are applied to database as they occur.
 Need to redo updates of committed transactions following a failure.
 May need to undo effects of transactions that had not committed at time of failure.
 Essential that log records are written before write to database. Write-ahead log protocol.
 If no "transaction commit" record in log, then that transaction was active at failure and
must be undone.
 Undo operations are performed in reverse order in which they were written to log.
As soon as a transaction updates a data item, it updates the final copy of the database on the
database disk. During making the update, the change will be recorded on the transaction log to
permit rollback operation in case of failure. UNDO and REDO are required to make the
transaction consistent. Thus, it is called UNDO/REDO Algorithm. This algorithm will undo all
updates made in place before commit. The redo is required because some operations, which are
completed but not committed, should go to the database. If we don‘t have the second scenario,
then the other variation of this algorithm is called UNDO/NO-REDO Algorithm.
Shadow Paging
It maintains two page tables during life of a transaction: current page and shadow page table.
When transaction starts, two pages are the same. Shadow page table is never changed thereafter
and is used to restore database in event of failure. During transaction, current page table records
all updates to database. When transaction completes, current page table becomes shadow page
table.
UNIT – Nine
Distributed Database Systems
Before the development of database technology, every application used to have its own data in
the application logic.
Database development facilitates the integration of data available in an organization from a
number of applications and enforces security on data access on a single local site. However, it is
not always the case that organizational data reside in one central site. This demand databases at
different sites to be integrated and synchronized with all the facilities of database approach.
This is will be made possible by computer networks and data communication optimized by
internet, mobile and wireless computing and intelligent devices.
This leads to Distributed Database Systems.
The decentralization approach to database mirrors the natural organizational structure of
companies, which are logically distributed in to divisions, departments, projects and so on and
physically, distributed in to offices, plants, and factories each of which maintains its own
operational data. Distributed Database is not a centralized database.

Table 12 centralized and Distributed database

9.1. Data Distribution Strategies


Distributed DB stores logically related shared data and metadata at several physically
independent sites connected via network. Distributed DBMS is the software system that permits
the management of a Distributed DB and makes the distribution transparent to the user. Data
allocation is the process of deciding where to allocate/store particular data item.
There are 3 data allocation strategies:
1. Centralized: the entire DB is located at a single site. And computers access through
networks. Known as distributed processing.
2. Partitioned: the DB is split into several disjoint parts (called partitions, segments or
fragments) and stored at several sites
3. Replicated: copies of one or more partitions are stored at several sites
Selective- combines fragmentation (locality of reference for those which are less updated)
replication and centralization as appropriate for the data.
Complete: - database copy is made available in each site. Snap shot is one method used here.
In a distributed database system, the database is logically stored as single database but
physically fragmented on several computers. The computers in a distributed system
communicate with each other through various communication media, such as high speed buses
or telephone line. A distributed database system has the following components.
 Local DBMS
 Distributed DDBMS
 Global System Catalog(GSC)
 Data communication (DC)
A distributed database system consists of a collection of sites, each of which maintains a local
database system (Local DBMS) but each local DBMS also participates in at least one global
transaction where different databases are integrated together.
 Local Transaction: transactions that access data only in that single site
 Global Transaction: transactions that access data in several sites.
 Parallel DBMS: a DBMS running across multiple processors and disks that is designed
to execute operations in parallel, whenever possible, in order to improve performance.
Three architectures for parallel DBMS:
 Shared Memory- for fast data access for a limited number of processors.
 Shared Disk- for application inherently centralized
 Shared nothing.- massively parallel
 What makes DDBMS different is that
 The various sites are aware of each other
 Each site provides a facility for executing both local and global transactions.
The different sites can be connected physically in different topologies.
 Fully /networked,
 Partially Connected,
 Tree Network,
 Star Network and
 o Ring Network
The differences between these sites is based on:
Installation Cost: cost of linking sites physically.
Communication Cost: cost to send and receive messages and data
Reliability: resistance to failure
Availability: degree to which data can be accessed despite the failure.
The distribution of the database sites could be:
1. Large Geographical Area: Long-Haul Network
 relatively slow
 less reliable
 uses telephone line, microwave, satellite
2. Small Geographical Area: Local Area Network
 higher speed
 lower rate of error
 use twisted pair, base band coaxial, broadband coaxial, fiber optics
Even though integration of data implies centralized storage and control, in distributed database
systems the intention is different. Data is stored in different database systems in a decentralized
manner but act as if they are centralized through development of computer networks. A
distributed database system consists of loosely coupled sites that share no physical component
and database systems that run on each site are independent of each other.
 Those, which share physical components, are known as Parallel DBMS.
 Transactions may access data at one or more sites
 Organization may implement their database system on a number of separate computer
system rather than a single, centralized mainframe. Computer Systems may be located at
each local branch office.

9.2. Functions of a DDBMS


DDBMS have the following functionality.
 Extended Communication Services to provide access to remote sites
 Extended Data Dictionary- to store data distribution details a need for global system
catalog
 Distributed Query Processing - optimization of query remote data access
 Extended security- access control to a distributed data
 Extended Concurrency Control –maintain consistency of replicated data
 Extended Recovery Services- failures of individual sites and the communication line.

9.3. Issues in DDBMS How is data stored in DDBMS?


There are several ways of storing a single relation in distributed database systems.
1. Replication:
 System maintains multiple copies of similar data (identical data)
 Stored in different sites, for faster retrieval and fault tolerance.
 Duplicate copies of the tables can be kept on each system (replicated). With this option,
updates to the tables can become involved (of course the copies of the tables can be read-
only).
 Advantage: Availability, Increased parallelism (if only reading)
 Disadvantage: increased overhead of update
2. Fragmentation:
Relation is partitioned into several fragments stored in distinct sites.The partitioning could be
vertical, horizontal or both.
I. Horizontal Fragmentation
Systems can share the responsibility of storing information from a single table with individual
systems storing groups of rows. It is performed by the Selection Operation. The whole content
of the relation is reconstructed using the UNION operation
Table 13 horizontal fragmentation
Vertical Fragmentation Systems can share the responsibility of storing particular attributes of a
table. It Needs attribute with tuple number (the primary key value be repeated.) which is
performed by the Projection Operation. The whole content of the relation is reconstructed using
the Natural JOIN operation using the attribute with Tuple number (primary key values)
Table 14 horizontal fragmentation
Both (hybrid fragmentation) is a system that can share the responsibility of storing particular
attributes of a subset of records in a given relation. It is reformed by projection then selection
or selection then projection relational algebra operators. Reconstruction is made by combined
effect of Union and natural join operators
Fragmentation is correct if it fulfills the following
1. Complete: - a data item must appear in at least one fragment of a given relation R (R1,
R2…Rn).
2. Reconstruction:- it must be possible to reconstruct a relation from the fragments
3. Disjointness: - a data item should only be found in a single fragment except for vertical
fragmentation (the primary key is repeated for reconstruction).
Data transparency: The degree to which system user may remain unaware of the details of how
and where the data items are stored in a distributed system.
1. Distribution transparency Even though there are many systems they appear as one- seen as a
single, logical entity.
2. Replication transparency Copies of data floating around everywhere also seem like just one
copy to the developers and users
3. Fragmentation transparency A table that is actually stored in parts everywhere across sites
may seem like just a single table in a single location
4. Location Transparency- the user doesn‘t need to know where a data item is physically
located.
How does it work? Distributed computing can be difficult to implement, particularly for
replicated data that can be updated from many systems. In order to operate a distributed database
system has to take care of
 Distributed Query Processing
 Distributed Transaction Management
 Replication Data Management -If you are going to have copies of data on many
machines how often does the data get updated if it is changed in another system? Who is
in charge of propagating the update to the data?
 Distributed Database Recovery If one machine goes down how does that affect the
others.
 Security: Just like any computer network, a distributed system needs to have a common
way to validate users entering from any computer in the network of servers.
 Common Data-Dictionary Your schema now has to be distinguished and work in
connection to schemas created on many systems.
9.4. Homogeneous and Heterogeneous Distributed Databases
1. In a homogeneous distributed database
 All sites have identical software (DBMS)
 Are aware of each other and agree to cooperate in processing user requests.
 Each site surrenders part of its autonomy in terms of right to change schemas or software
 Appears to the user as a single system
2. In a heterogeneous distributed database
 Different sites may use different schemas and software (DBMS)
 Difference in schema is a major problem for query processing
 Difference in software is a major problem for transaction processing
 Sites may not be aware of each other and may provide only limited facilities for
cooperation in transaction processing.
 May need gateways to interface one another.
9.5. Advantages and disadvantages of DDBMS
9.5.1. Why DDBMS/Advantages
1. Many existing systems
 Maybe you have no choice.
 Possibly there are many different existing system, with possible different kinds of
systems (Oracle, Informix, others) that need to be used together.
2. Data sharing and distributed control:
 User at one site may be able to access data that is available at another site.
 Each site can retain some degree of control over local data
 We will have local as well as global database administrator
3. Reliability and availability of data If one site fails the rest can continue operation as long as
transaction does not demand data from the failed system and the data is not replicated in other
sites
4. Speedup of query processing If a query involves data from several sites, it may be possible
to split the query into sub-queries that can be executed at several sites which is parallel
processing. Query can be sent to the least heavily loaded sites
5. Expansion(Scalability):In a distributed environment you can easily expand by adding more
machines to the network.
9.5.2. Disadvantages of DDBMS
1. Software Development Cost
Is difficult to install, thus is costly
2. Greater Potential for Bugs
Parallel processing may endanger correctness of algorithms
3. Increased Processing Overhead: Exchange of message between sites – high communication
latency: Due to communication jargons
4. Communication problems
5. Increased Complexity and Data Inconsistency Problems: Since clients can read and modify
closely related data stored in different database instances concurrently.
6. Security Problems: network and replicated data security.
9.6. Query Processing and Transaction Management in DDBMS Query
Processing
There are different strategies to process a specific query, which in turn increase the performance
of the system by minimizing processing time and cost. In addition to the cost estimates we have
for a centralized database (disk access, relation size, etc), we have to consider the following in
distributed query processing:
 Cost of data transmission over the huge network
 Gain of parallel processing of a single query
For the case of Replicated data allocation, even though parallel processing is used to increase
performance, update will have a great impact since all the sites containing the data item should
be updated. For the case of fragmentation, update works more like the centralized database but
reconstruction of the whole relation will require accessing data from all sites containing part of
the relation. Let the distributed database has three sites (S1, S2, and S3). And two relations,
EMPLOYEE and DEPARTMENT are located at S1 and S2 respectively without any
fragmentation. And a query is initiated from S3 to retrieve employees [First Name (15 byte
long), Last name (15 byte long) and Department name (10 byte long) total of 40 bytes with the
department they are working in. 72 Advanced Database Systems Lecture Note – Wollo
University, Department of Computer Science
Let: For EMPLOYEE we have the following information
1. 10,000 records
2. each record is 100 bytes long
For DEPARTMENT we have the following information
3. 100 records
4. each record is 35 bytes long
There are three ways of executing this query:
1. Transfer DEPARTMENT and EMPLOYEE to S3 and perform the join there: needs transfer of
10,000*100+100*35=1,003,500 byte.
2. Transfer the EMPLOYEE to S2, perform the join there which will have 40*10,000 = 400,000
bytes and transfer the result to S3. we need 1,000,000+400,000=1,400,000 byte to be transferred
3. Transfer the DEPARTMENT to S1, perform the join there which will have 40*10,000 =
400,000 bytes and transfer the result to S3. We need 3,500+400,000=403,500 byte to be
transferred.
Then one can select the strategy that will reduce the data transfer cost for this specific query.
Other steps of optimization may also be included to make the processing more efficient by
reducing the size of the relations using projection. Transaction Management Transaction is a
logical unit of work constituted by one or more operations executed by a single user. A
transaction begins with the user's first executable query statement and ends when it is committed
or rolled back. A Distributed Transaction is a transaction that includes one or more statements
that, individually or as a group, update data on two or more distinct nodes of a distributed
database.
Representation of Query in Distributed Database
SQL Statement Object Database Domain
SELECT * FROM dept sales midroc.telecom.et;
dept@sales.midroc.teleco
m.et;
There are two types of transaction in DDBMS to access data from other sites:
1. Remote Transaction: contains only statements that access a single remote node. Thus,
Remote Query statement is a query that selects information from one or more remote tables, all
of which reside at the same remote node or site.
For example, the following query accesses data from the dept table in the Addis schema (the site)
of the remote sales database:
SELECT * FROM Addis.dept@sales.midroc.telecom.et;
A remote update statement is an update that modifies data in one or more tables, all of which
are collocated at the same remote node. For example, the following query updates the branch
table in the Addis schema of the remote sales database:
UPDATE Addis.dept@ sales.midroc.telecom.et;
SET loc = 'Arada'
WHERE BranchNo = 5;
2. Distributed Transaction: contains statements that access more than one node.
A distributed query statement retrieves information from two or more nodes. If all statements of
a transaction reference only a single remote node, the transaction is remote, not distributed.
A database must guarantee that all statements in a transaction, distributed or non-distributed,
either commit or roll back as a unit. The effects of an ongoing transaction should be invisible to
all other transactions at all nodes; this transparency should be true for transactions that include
any type of operation, including queries, updates, or remote procedure calls.
o For example, the following query accesses data from the local database as well as the remote
sales database:
SELECT ename, dname
FROM Dessie.emp DS, Addis.dept@ sales.midroc.telecom.et AD
WHERE DS.deptno = AD.deptno; {Employee data is stored in Dessie and Sales data is stored in
Addis, there is an employee responsible for each sale}
Different kinds of SQL statements
Remote query
client_nm from clients@accounts.motorola.com;
Distributed query
select project_name, student_nm from intership@accounts.motorola.com i, student s where
s.stu_id = i.stu_id
Remote Update
update intership@accounts.motorola.com set stu_id = '242' where stu_id = '200'
Distributed Update
update intership@accounts.motorola.com set stu_id = '242' where stu_id = '200' update student
set stu_id = '242' where stu_id = '200' commit
Concurrency Control
There are various techniques used for concurrency control in centralized database systems. The
techniques in distributed database system are similar with the centralized approach with
additional implementation requirements or modifications.
The main difference or the change that should be incorporated is the way the lock manager is
implemented and how it functions. There are different schemes for concurrency control in DDBS
1. Non-Replicated Scheme
 No data is replicated in the system
 All sites will maintain a local lock manager (local lock and unlock)
 If site Si needs a lock on data in site Sj it send message to lock manager of site
 Sj and the locking will be handled by site Sj
 All the locking and unlocking principles are handled by the local lock manager in which
the data object resides.
 Is simple to implement
 Need three message transfers
 To request a lock
 To notify grant of lock
 To request unlock
2. Single Coordinate Approach
 The system choose one single lock manager that resides in one of the sites (Si)
 All locks and unlocks requests are made at site Si where the lock manager resides
 Is simple to implement
 Needs two message transfers
 To request a lock
 To request unlock
 Simple deadlock handling
 Could be a bottleneck since all processes are handled at one site
 Is vulnerable/at risk if the site with the lock manager fails

UNIT – Ten
Introduction to Object-Oriented Database Systems Object
Orientation
An object-oriented database (OOD) is a database system that can work with complex data
objects i.e objects that mirror those used in object-oriented programming languages.

In object-oriented programming, everything is an object, and many objects are quite complex,
having different properties and methods. An object-oriented database management system works
in concert with an object-oriented programming language to facilitate the storage and retrieval of
object-oriented data.
You might be thinking, “Wait, I use objects in my programming all the time. In addition, I use a
database. So, does that mean the database I use is an OOD?” Probably not, and the reason has to
do with one of the main features of OOD: object data persistence.

When your program is running, you might have an object, for example, an instance of a task.
That object has properties like a name and status. It might also have some methods
like update_task() or get_task_history(). Somewhere early in your program, you initialized
that task object, and now you have access to it because it is stored in memory.

What happens when your program terminates execution? Your object… is gone. That data object
was transient, not persistent. The next time your program starts up, you’ll need to retrieve those
data values (like name and status) from your database all over again in order to initialize
a new task object instance.

With an OOD, data objects are stored with all of their properties in the database. When your
program terminates, the objects continue to persist, stored in the OOD. When your program
starts up again, it can retrieve an object with the properties from the database. The process of
storing and retrieving a complex data object with an OOD is transparent to the user of the
database.

This is quite different from relational databases (like MySQL or SQLite) but not significantly
from document databases (like MongoDB). In relational databases, the developer needs to
compose an object from the results of a set of queries, while in document databases the mapping
of the document fields to the class properties should be almost transparent.

OODs have been around for decades. MongoDB Realm is one of the new and promising pieces
of software in that field.

10.1. Object Orientation


It is set of design and development principles and works based on autonomous computer
structures known as objects

10.2. OO Contribution areas


The following are OO contribution areas

 Programming Languages
 Graphical User Interfaces
 Databases
 Design
 Operating Systems
10.3. Evolution of OO Concepts
Concepts stem from object-oriented programming languages (OOPLs)
 Ada, ALGOL, LISP, SIMULA
 OOPLs goals
 Easy-to-use development environment
 Powerful modeling tools for development
 Decrease in development time
 Make reusable code
 OO Attributes
 Data set not passive
 Data and procedures bound together
 Objects can act on itself
Advanced Database Applications The application of database is being highly practiced in
complex applications demanding different way of modeling and designing. Some of these
applications are:
 Computer Aided Design (CAD)
 Computer Aided Manufacturing
 Computer Aided Software Engineering
 Network Management System
 Multimedia System
 Digital Publication
 Geographic Information System
 Interactive and Dynamic Web Sites

Drawbacks/Weaknesses of Relational DBMS In addition to the emergence of many advanced


database application areas, there were some drawbacks on the relational database management
system.
1. Poor representation of ‗real world‘ entities: relations does not correspond to real world objects
2. Semantic overloading (semantically Overloaded): Representation of complex relationship is
difficult
Eg. M:N relationship is represented by adding one additional relation (making the relation an
entity)
 One cannot distinguish a relation from a relationship.
 Difficult to distinguish different types of relationship.
3. Poor support for integrity and enterprise constraints: Many commercial BMS do not support
all integrity constraints. The relational model do not support enterprise constraint, it has to be
done on the DBMS
4. Homogeneous data structure
 Has vertical and horizontal homogeneity
 Horizontal Homogeneity: Each tuple of a relation have same number and type of
attributes
 Vertical Homogeneity: Values of an attribute should come from same domain.
 Field value should be atomic
5. Limited operations
 Relational model has fixed operations on the data
 Does not allow additional/new operations
6. Difficulty in handling recursive queries: Direct and indirect recursive queries of a single
relation cannot be represented. Query to extract information from recursive relationship between
tuples of same entity is difficult to represent.
7. Impedance mismatch Mixing of different programming paradigms: Mismatch between the
languages used
8. Poor navigational process: Access is based on navigating individual records from different
relations

10.4. Object oriented Concepts


Object is a uniquely identifiable entity that contains both the attributes that describes the state of
the ‗real world‘ object and the action that are associated with it. OODBMS can manage;

 complex, highly interrelated information.


 Abstract representation of a real-world entity
 Unique identity
 Embedded properties
 Ability to interact with other objects and self

OID (Object Identity) Each object is unique in OO systems, Unique to object, Not a primary
key (PK is unique only for a relation, PK is selected from attribute making it dependent on the
state), Is invariant (will not change), Independent of values of attributes ( two objects can have
same state but will have different OID), Is invisible to users
• Entity integrity is enforced in oop
• Relationship: embedding the OID of one object into the other ( embed OID for a branch to
employee object)
Advantage of using OID:
 Are efficient
 Are fast
 Cannot be modified by users (system generated)
 Independent of content
Table 15 representation of objects
Attributes and Methods: An object has state and behaviors. An object has properties (which might also
be called attributes) like name, status, and create_date. The set of properties taken together represents
its state. Along with this, an object has behaviors (also known as methods, actions, or functions) that
modify or operate on its properties, like update_task() or get_task_history().
Object state: It is object values at any given time and values of attributes at any given point in
time.
Table 16 representation of object and method
Messages
 Means by which objects communicate
 Request from one object to the other to activate one of its methods
 Invokes method/calls method to be applied
 Sent to object from real world or other object
 Notation: Object. Method
Eg: Staff Object. Update salary(slary)

Table 17 mwessaging process in redal worled events


Classes: It is grouping of all objects with the same properties and behaviors form a class. In our
example above, we talked about task objects. These objects together all belong to the Task class.
 Blueprint for defining similar objects
 Objects with similar attributes and respond to same message are grouped together
 Defined only once and used by many objects
 Collection of similar objects
 Shares attributes and structure
o Objects in a class are called instances of the class
Eg: Class Definition : defining the class BRANCH
BRANCH
Attributes
BrabchNo
street
city
postcode
Methods
Print()
getPostCode()
numberofStaff()

Objects of BRANCH class


brabchNo=B005 brabchNo=B007 brabchNo=B003
street=st1 street=st2 city=Dire street=st2
city=Addis postcode=452 city=Bahirdar
postcode=1425 postcode=85
Table 18 class assigning techniques
I. Object Characteristics
1. Class Hierarchy
• Superclass
• Subclass
2. Inheritance
It is bility of object to inherit the data structure and behavior of classes above it
• Single inheritance – class has one immediate superclass
• Multiple – class has more than one immediate superclass
II. Method Overriding
• Method redefined at subclass
III. Polymorphism
• Allows different objects to respond to same message in different ways

10.4.1 Object Classification


Objects in oop can be classified as follows

A) Simple
They are only single-valued attributes have no attributes refer to other objects.
B) Composite
They have at least one multi-valued attribute and no attributes refer to other object
C) Compound
They have at least one attribute that references other object
D) Hybrid
They are repeating group of attributes and can at least one refers to other object

10.5. Characteristics of OO Data Model


OO Data Model has the following characteristics
 Supports complex objects
 Must be extensible
 Supports encapsulation
 Exhibit inheritance
 Supports object identity
Table 19 OO Data Model
Table 2 OO vs. E-R Model Components
OO Data Model E-R Model
Type Entity definition
Object Entity Instance
Class Entity set
Instance variable Attribute
N/A Primary key
OID N/A
Class hierarchy E-R diagram

10.5. OODBMS
Object-oriented database technology is a marriage of object-oriented programming and database
technologies. Database management system integrates benefits of typical database systems with
OODM characteristics
 Handles a mix of data types (since OODBMS permits new data definition)
 Follows OO rules
 Follows DBMS rules

10.5.1 OO and Database Design


It provides data identification and the procedures for data manipulation
 Data and procedures self-contained entity
 Iterative and incremental
 DBA does more programming
 Lack of standards

10.5.2 OODBMS Advantages


 More semantic information
 Support for complex objects
 Extensibility of data types (user defined data types)
 May improve performance with efficient caching
 Versioning
 Polymorphism: one operation shared by many objects and each acting differently
 Reusability
 Inheritance speeds development and application: defining new objects in terms of
previously defined objects Incremental Definition)
 Potential to integrate DBMSs into single environment
 Relationship between objects is represented explicitly supporting both navigational and
associative access to information.
10.5.3. OODBMS Disadvantages
 Strong opposition from the established RDBMSs
 Lack of theoretical foundation
 No standard
 No single data model
 Throwback to old pointer systems
 Lack of standard ad hoc query language
 Lack of business data design and management tools
 Steep learning curve
 Low market presence
 Lack of compatibility between different OODBMSs
UNIT - Eleven
Database Administration and Security
In today's society, information is a critical resource in not only the fields of industry, commerce,
education, or medicine, but also in the fields of military, diplomacy, or governments. Some
information is extremely important as to have to be protected. For example, data corruption or
fabrication in a hospital database could result in patients' receiving the wrong medication.
Disclosure or modification of military information could endanger national security.
Privacy – Ethical and legal rights that individuals have with regard to control over the
dissemination and use of their personal information
Database security – Protection of information contained in the database against unauthorized
access, modification or destruction
Database integrity – Mechanism that is applied to ensure that the data in the database is correct
and consistent
A good database security management system has not only the following characteristics: data
independence, shared access, minimal redundancy, data consistency, and data integrity but also
the following characteristics: privacy, integrity, and availability.
 Privacy signifies that an unauthorized user cannot disclose data
 Integrity ensures that an unauthorized user cannot modify data
 Availability ensures that data be made available to the authorized user unfailingly
 Copyright ensures the native rights of individuals as a creator of information.
 Validity ensures activities to be accountable by law.
With a strong enforcement and management of these, it is said that the database system can
effectively prevent accidental security and integrity threats from system error, improper
authorization and concurrent usage anomalies. In addition to have an efficient system, it should
have prevention on malicious or intentional security and integrity threats where computer system
operator can bypass security as well as programmers as hackers. There are certain security policy
issues that we should recognize, where we should consider administrative control policies,
decide which security features offered by the DBMS is used to implement the system, decide
whether the focus of security
administration is left with DBA and whether it is centralized or decentralized. Besides, one
should decide on ownership of shared data as well. When we talk about the levels of security
protection, it may start from organization & administrative security, physical & personnel
security, communication security and Information systems security Database security and
integrity is about protecting the database from being inconsistent and being disrupted. We can
also call it database misuse. Database misuse could be Intentional or Accidental, where
accidental misuse is easier to cope with than intentional misuse. Accidental inconsistency could
occur due to:
 System crash during transaction processing
 Anomalies due to concurrent access
 Anomalies due to redundancy
 Logical errors
Likewise, even though there are various threats that could be categorized in this group,
Intentional misuse could be:
 Unauthorized reading of data
 Unauthorized modification of data or
 Unauthorized destruction of data
Most systems implement good Database Integrity to protect the system from accidental misuse
while there are many computer based measures to protect the system from intentional misuse,
which is termed as Database Security measures. Levels of Security Measures Security
measures can be implemented at several levels and for different components of the system.
These levels are:
1. Physical Level: concerned with securing the site containing the computer system. The backup
systems should also be physically protected from access except for authorized users. In other
words, the site or sites containing the computer systems must be physically secured against
armed or sneaky entry by intruders.
2. Human Level: concerned with authorization of database users for access the content at
different levels and privileges.
3. Operating System: concerned with the weakness and strength of the operating system
security on data files. Weakness may serve as a means of unauthorized access to the database.
No matter how secure the database system is, weakness in operating system security may serve
as a means of unauthorized access to the database. This also includes protection of data in
primary and secondary memory from unauthorized access.
4. Database System: concerned with data access limit enforced by the database system. Access
limit like password, isolated transaction and etc. Some database system users may be authorized
to access only a limited portion of the database. Other users may be allowed to issues queries,
but may be forbidden to modify the data. It is the responsibility of the database system to ensure
that these authorization restrictions are not violated.
5. Application Level: Since almost all database systems allow remote access through terminals
or networks, software-level security with the network software is as important as physical
security, both on the Internet and networks private to an enterprise.
Even though we can have different levels of security and authorization on data objects and users,
who access which data is a policy matter rather than technical. These policies
 should be known by the system: should be encoded in the system
 should be remembered: should be saved somewhere (the catalogue)
Database Integrity constraints contribute to maintaining a secure database system by preventing
data from becoming invalid and hence giving misleading or incorrect results
Domain Integrity means that each column in any table will have set of allowed values and
cannot assume any value other than the one specified in the domain.
Entity Integrity means that in each table the primary key (which may be composite) satisfies
both of two conditions:
1. That the primary key is unique within the table and
2. That the primary key column(s) contains no null values.
Referential Integrity means that in the database as a whole, things are set up in such a way that if
a column exists in two or more tables in the database (typically as a primary key in one table and
as a foreign key in one or more other tables), then any change to a value in that column in any
one table will be reflected in corresponding changes to that value where it occurs in other
tables. This means that the RDBMS must be set up so as to take appropriate actions to spread a
change in one table from that table to the other tables where the change must also occur.
The effect of the existence and maintenance of referential integrity is, in short, that if a column
exists in two or more tables in the database, every occurrence of the column will contain only
values that are consistent across the database.
Key constraints in a relational database, there should be some collection of attributes with a
special feature used to maintain the integrity of the database. These attributes will be named as
Primary Key, Candidate Key, Foreign Key, and etc. These Key(s) should obey some rules set by
the relational data model.
Enterprise Constraint means some business rules set by the enterprise on how to use, manage
and control the database
Database Security - the mechanisms that protect the database against intentional or accidental
threats. Database security encompasses hardware, software, people and data.
Database Management Systems supporting multi-user database system must provide a database
security and authorization subsystem to enforce limits on individual and group access rights and
privileges.
11.1 Security Issues and general considerations
Legal, ethical and social issues regarding the right to access information
Physical control issues regarding how to keep the database physically secured.
Policy issues regarding privacy of individual level at enterprise and national level
Operational consideration on the techniques used (password, etc) to access and manipulate the
database
System level security including operating system and hardware control
Security levels and security policies in enterprise level
The designer and the administrator of a database should first identify the possible threat that
might be faced by the system in order to take counter measures.
Threat may be any situation or event, whether intentional or accidental, that may adversely
affect a system and consequently the organization
A threat may be caused by a situation or event involving a person, action, or circumstance that is
likely to bring harm to an organization
The harm to an organization may be tangible or intangible
Tangible – loss of hardware, software, or data
Intangible – loss of credibility or client confidence
Examples of threats:
Using another persons’ means of access
Unauthorized amendment/modification or copying of data
Program alteration
Inadequate policies and procedures that allow a mix of confidential and normal out put
Wire-tapping
Illegal entry by hacker
 Blackmail
 Theft of data, programs, and equipment
Failure of security mechanisms, giving greater access than normal
 Staff shortages or strikes
 Inadequate staff training
Viewing and disclosing unauthorized data
 Electronic interference and radiation
 Data corruption owing to power loss or surge
 Fire (electrical fault, lightning strike, arson), flood, bomb
 Physical damage to equipment
 Breaking cables or disconnection of cables
 Introduction of viruses
An organization deploying a database system needs to identify the types of threat it may be
subjected to and initiate appropriate plans and countermeasures, bearing in mind the costs of
implementing each.
Countermeasures: Computer Based Controls
The types of countermeasure to threats on computer systems range from physical controls to
administrative procedures
Despite the range of computer-based controls that are available, it is worth noting that, generally,
the security of a DBMS is only as good as that of the operating system, owing to their close
association
The following are computer-based security controls for a multi-user environment:
Authorization: The granting of a right or privilege that enables a subject to have legitimate
access to a system or a system‘s object
Authorization controls can be built into the software, and govern not only what system or object
a specified user can access, but also what the user may do with it
Authorization controls are sometimes referred to as access controls
The process of authorization involves authentication of subjects (i.e. a user or program)
requesting access to objects (i.e. a database table, view, procedure, trigger, or any other object
that can be created within the system)
Views: A view is the dynamic result of one or more relational operations operation on the base
relations to produce another relation.
A view is a virtual relation that does not actually exist in the database, but is produced upon
request by a particular user.
The view mechanism provides a powerful and flexible security mechanism by hiding parts of the
database from certain users.
Using a view is more restrictive than simply having certain privileges granted to a user on the
base relation(s)
Backup and recovery: Backup is the process of periodically taking a copy of the database and
log file (and possibly programs) on to offline storage media. A DBMS should provide backup
facilities to assist with the recovery of a database following failure. Database recovery is the
process of restoring the database to a correct state in the event of a failure.
Journaling is the process of keeping and maintaining a log file (or journal) of all changes made to
the database to enable recovery to be undertaken effectively in the event of a failure.
The advantage of journaling is that, in the event of a failure, the database can be recovered to its
last known consistent state using a backup copy of the database and the information contained in
the log file. If no journaling is enabled on a failed system, the only means of recovery is to
restore the database using the latest backup version of the database. However, without a log file,
any changes made after the last backup to the database will be lost
Integrity: Integrity constraints contribute to maintaining a secure database system by preventing
data from becoming invalid and hence giving misleading or incorrect results
 Domain Integrity: setting the allowed set of values
 Entity integrity: demanding Primary key values not to assume a NULL value
 Referential integrity: enforcing Foreign Key values to have a value that already exist in
the corresponding Candidate Key attribute(s) or be NULL.
 Key constraints: the rules the Relational Data Model has on different kinds of Key.
Encryption Authorization may not be sufficient to protect data in database systems, especially
when there is a situation where data should be moved from one location to the other using
network facilities. Encryption is used to protect information stored at a particular site or
transmitted between sites from being accessed by unauthorized users. Encryption is the encoding
of the data by a special algorithm that renders the data unreadable by any program without the
decryption key. It is not possible for encrypted data to be read unless the reader knows how to
decipher/decrypt the encrypted data. If a database system holds particularly sensitive data, it
may be deemed necessary to encode it as a precaution against possible external threats or
attempts to access it. The DBMS can access data after decoding it, although there is degradation
in performance because of the time taken to decode it. Encryption also protects data transmitted
over communication lines. To transmit data securely over insecure networks requires the use of a
Cryptosystem, which includes:
1. An encryption key to encrypt the data (plaintext)
2. An encryption algorithm that, with the encryption key, transforms the plaintext into cipher
text
3. A decryption key to decrypt the cipher text
4. A decryption algorithm that, with the decryption key, transforms the cipher text back into
plaintext Data encryption standard (DES) is an approach, which does both a substitution of
characters and a rearrangement of their order based on an encryption key.
Types of Cryptosystems
Cryptosystems can be categorized into two
1. Symmetric encryption – uses the same key for both encryption and decryption and relies on
safe communication lines for exchanging the key.
2. Asymmetric encryption – uses different keys for encryption and decryption e.g. RSA
Generally, symmetric algorithms are much faster to execute on a computer than those that are
asymmetric. In the contrary, asymmetric algorithms are more secure than symmetric algorithms.

11.2. RAID technology (Redundant Array of Independent Disks)


The hardware that the DBMS is running on must be fault-tolerant, meaning that the DBMS
should continue to operate even if one of the hardware components fails. This suggests having
redundant components that can be seamlessly integrated into the working system whenever there
is one or more component failures. The main hardware components that should be fault-tolerant
include disk drives, disk controllers, CPU, power supplies, and cooling fans. Disk drives are the
most vulnerable components with the shortest times between failures of any of the hardware
components.
RAID works on having a large disk array comprising an arrangement of several independent
disks that are organized to improve reliability and at same time increase performance
 Performance is increased through data striping
Data striping – the data is segmented into equal size partitions (the striping unit) which are
transparently distributed across multiple disks.

11.3. Security at different Levels of Data


Almost all RDBMSs provide security at different levels and formats of data. This includes:
1. Relation Level: permission to have access to a specific relation.
2. View Level: permission to data included in the view and not in the named relations
3. Hybrid (Relation/View): the case where only part of a single relation is made available to
users through View.
Any database access request will have the following three major components
2. Requested Operation: what kind of operation is requested by a specific query?
3. Requested Object: on which resource or data of the database is the operation sought to be
applied?
4. Requesting User: who is the user requesting the operation on the specified object?
The database should be able to check for all the three components before processing any request.
The checking is performed by the security subsystem of the DBMS.
Authentication
All users of the database will have different access levels and permission for different data
objects, and authentication is the process of checking whether the user is the one with the
privilege for the access level.
Is the process of checking the users are who they say they are.
Each user is given a unique identifier, which is used by the operating system to determine who
they are
Thus the system will check whether the user with a specific username and password is trying to
use the resource.
Associated with each identifier is a password, chosen by the user and known to the operation
system, which must be supplied to enable the operating system to authenticate who the user
claims to be.
Authorization/Privilege
Authorization refers to the process that determines the mode in which a particular (previously
authenticated) client is allowed to access a specific resource controlled by a server. Most of the
time, authorization is implemented by using Views.
Views are unnamed relations containing part of one or more base relations creating a
customized/personalized view for different users.
Views are used to hide data that a user needs not to see.
Forms of user authorization
There are different forms of user authorization on the resource of the database. These forms are
privileges on what operations are allowed on a specific data object. User authorization on the
data/extension
1. Read Authorization: the user with this privilege is allowed only to read the content of the
data object.
2. Insert Authorization: the user with this privilege is allowed only to insert new records or
items to the data object.
3. Update Authorization: users with this privilege are allowed to modify content of attributes
but are not authorized to delete the records.
4. Delete Authorization: users with this privilege are only allowed to delete a record and not
anything else.
Different users, depending on the power of the user, can have one or the combination of the
above forms of authorization on different data objects.
User authorization on the database schema
1. Index Authorization: deals with permission to create as well as delete an index table for
relation.
2. Resource Authorization: deals with permission to add/create a new relation in the database.
3. Alteration Authorization: deals with permission to add as well as delete attribute.
4. Drop Authorization: deals with permission to delete and existing relation.
11.4. Role of DBA in Database Security
The database administrator is responsible to make the database to be as secure as possible. For
this the DBA should have the most powerful privilege than every other user. The DBA provides
capability for database users while accessing the content of the database. The major
responsibilities of DBA in relation to authorization of users are:
1. Account Creation: involves creating different accounts for different USERS as well as USER
GROUPS.
2. Security Level Assignment: involves in assigning different users at different categories of
access levels.
3. Privilege Grant: involves giving different levels of privileges for different users and user
groups.
4. Privilege Revocation: involves denying or canceling previously granted privileges for users
due to various reasons.
5. Account Deletion: involves in deleting an existing account of users or user groups. Is similar
with denying all privileges of users on the database.

11.5. Approaches to Database Security


There are two broader approaches to security. The two types of database security mechanisms
are:
1. Discretionary security mechanisms
Grant different privileges to different users and user groups on various data objects
The privilege is to access different data objects
The mode of the privilege could be
 Read,
 Insert,
 Delete,
 Update files, records or fields.
 Is more flexible
 One user can have A but not B and another user can have B but not A
2. Mandatory security mechanisms
 Enforce multilevel security
 Classifying data and users into various security classes (or levels) and implementing the
appropriate security policy of the organization.
 Each data object will have certain classification level
 Each user is given certain clearance level
 Only users who can pass the clearance level can access the data object
 Is comparatively not-flexible/rigid
If one user can have A but not B then B is accessed by users with higher privilege and we cannot
have B but not A
The ability to classify user into a hierarchy of groups provide a powerful tool for administering
large systems with thousands of users and objects. A database system can support one or both of
the security mechanisms to protect the data. In most systems it is better to filter those that are
allowed rather than identifying the not allowed. Since if some object is authorized then it means
it is not constrained.

You might also like