[go: up one dir, main page]

0% found this document useful (0 votes)
91 views137 pages

BCA 202 Database Management System2

Uploaded by

chibunnajoe31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views137 pages

BCA 202 Database Management System2

Uploaded by

chibunnajoe31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

CONTENTS

Unit Contents Page No.

Introduction of Database Management System


1.0 Introduction
1.1 Difference between Data, information, Data processing and Data Management.
1.2 File Oriented Approach, Database oriented approach to Data Management
1.3 Database schema and instances
1 1-21
1.4 Database users
1.5 3 tier architecture of DBMS and Data Independence
1.6 Types of Database System
1.7 Database Languages
1.8 DBMS interfaces.

Data Modelling
2.0 Introduction
2.1 Unit Objectives
2.2 Data Modeling: Data Models,
2.3 Logical Data Modeling:
2.3.1 Hierarchical Data Model
2.3.2 Network Data Model
2.3.3 Relational Data Model
Advantages and Disadvantages of Logical Data Modeling.
2 2.4 Conceptual Data Modeling 22-40
2.4.1 Entity Relationship Model
2.4.2 Entities
2.4.3 Attributes
2.4.4 Types of Attributes
2.4.5 Relationships
2.4.6 Degree of relationship Set
2.5 Mapping Cardinalities, Keys
2.5.1 ER Diagram Notations
2.6 Case studies on ERD.
Unit Contents Page No.

Normalization
3.0 Introduction
3.1 Unit Objectives
3.2 Normalization: Keys: Composite, Candidate, Primary, Secondary,
Foreign, Super key
3.3 CODD’s Rules, Mapping conceptual model into Relational Model.
3 41-64
3.4 Functional Dependencies
3.5 Decomposition
3.6 Lossy and Lossless Decomposition, Dependency Preserving
Decomposition
3.7 Advantages and Disadvantages of Normalization
3.8 Normal Forms (1NF, 2NF, 3NF,) Case Studies on Normalization.

File Structures and Data Administration


4.0 Introduction
4.1 Unit Objectives
4.2 File Structures and Data Administration: File Organization
4.3 Overview of Physical Storage Media, Magnetic Disk, RAID, Tertiary
Storage
4.3.1 Storage Access
4.4 Data Dictionary Storage
4 65-91
4.5 Organization of File (Sequential,Clustering)
4.6 Indexing and Hashing
4.7 Basic Concepts, indices
4.8 B+ Tree index file
4.9 B- tree index file
4.10 Static hashing
4.11 Dynamic Hashing
4.12 Data administration, Role and Responsibility of DBA
Unit Contents Page No.

Transaction and Concurrency Control


5.0 Introduction
5.1 Unit Objectives
5.2 Transaction and Concurrency Control , Multiprogramming and
Multiprocessing
5.3 Concept of transaction
5.3.1 Transaction state
5.4 ACID properties
5 5.5 Schedules 92-114
5.6 Serializability of schedules
5.7 Concurrency Control
5.7.1 lock based protocols
5.7.2 timestamp based protocols
5.8 Deadlock and its handling
5.8.1 Wait-Die and Wound-Wait, Deadlock prevention without using
timestamps
5.9 Deadlock detection and time outs

Database Recovery and security Management


6.0 Introduction
6.1 Unit Objectives
6 6.2 Database Recovery and security Management: Database Recovery, 115-121
6.3 Types of Failures, and Data access.
6.4 Recovery and atomicity
6.5 Recovery Techniques Algorithms: Log Based Recovery, Check points,
Database Management
System

UNIT - 1
INTRODUCTION OF
NOTES

DATABASE MANAGEMENT SYSTEM

Structure
1.0 Introduction
1.1 Difference between Data, information, Data processing and Data
Management.
1.2 File Oriented Approach, Database oriented approach to Data
Management
1.3 Database schema and instances
1.4 Database users
1.5 3 tier architecture of DBMS and Data Independence
1.6 Types of Database System
1.7 Database Languages
1.8 DBMS interfaces.

Unit objectives
After going through this unit, you will be able to :
- Define the concept of data, data management, information
- Different approaches to data management
- Understand Database schema and instances,
- Comprehend 3 tier architecture of DBMS and Data Independence.
- Database users
- Different Types of Database System.
- Understand the different Database Languages,
- Explain DBMS interfaces.

Introduction of
Database
Management System 1
Database Management
System
1.0 INTRODUCTION
NOTES
In this unit, you will learn about data management, Data, information, Data
Processing.
Organizations cannot make good and effective decisions at proper occasions
or moments if data is invalid or incorrect. You will learn about different database
users, overview of Database systems.

1.1 DATA

Data is an indispensable resource for any organization and is generated for


and as a result of its operations. Data provides correct information for better
utilization of other organizational resources, thereby helping an organization to
make good and effective decisions if the required data is not available in time
and in the correct and desired format. The data that an organization necessarily
maintains for ensuring its smooth operation is termed as operational data.
Sometimes, data and information are interchangeably used. Data means known
facts that can be recorded.
Data is collection of facts in raw form that becomes information after
processing of data.
Information : Information refers to processed, organized or
summarized data.
The activity of processing data using a computer is called data processing.
Data processing consists of three sub activities: capturing input data,
manipulating the data, and managing output results. As used in data processing,
information is data arranged in an order and form that is useful to people
receiving it. Data is raw material used as input to data processing and information
is processed data obtained as output of data processing.

e.g
- If we write 200 then it is data but if we write Rs.200 then it is definitely
information.
- At home mother keeps the record of how much milk is consumed
everyday, sometimes there are guest, she requires more milk, then at
Introduction of the end of the month she collects the data, process the data and then
Database prepares a bill to be given to the milk man i.e information.
2 Management System
- Suppose, in a college, vacancy has been created for the post of Database Management
professor and the eligibility to become a professor is that the candidate System
should have Ph.D Degree. Against this post several applicants applied.
All these applications will be considered as data initially because there NOTES
may be some applications whose eligibility is not fulfilled. After short
listing as per requirement then only those applications will be
considered as an information.
Data is unprocessed material while information is processed material.

Data Processing & Data Management

Data Processing
Data Management is the process of developing data architectures in which
practices and procedures deal with data execution on a regular basis. It basically
aims at explaining the way data is to be stored in a structured manner and the
process or procedure that should be adopted on a regular basis. The terms related
to data management are data modeling, data warehousing, data movement,
database administration and data mining.

1.2 APPROACHES TO DATA MANAGEMENT

File-based Systems
Conventionally, before the database systems were introduced, data in
software systems was maintained using files. In this approach, the needed data
is stored locally and programs are developed for each type of application. The
following terms are related to a file processing system:
- Data is provided to the computer in terms of bytes.
- Data item is the smallest named unit of data that has meaning in the
real world, e.g includes employee name, address and employee code.
Traditionally, the term, data item is called field in data processing and
is the smallest unit of data that has meaning for its users. Record is a
set of related data items (fields) organized in a well defined structure
treated as a unit by an application program. The structure of the record
depends on:
- Items to be included, their characteristics, variability, prevalence in
file
- The intended uses of the information, time requirements, overlapping
or conflicting requirements.
- Update characteristics, e.g frequency, growth, deletion and Introduction of
modification, time constraint, etc. Database
Management System 3
Database Management - Links to the other records/files/systems
System
- Computing environment like operating systems, languages, hardware,
etc.
NOTES
- File is a collection of all occurrences of similar type of records i.e it is
a group of records of a single type.
In file based system, an organization stores information as groups of records
in separate files. These file based systems consist of a few data files and many
application programs. Data files are organized to facilitate access to records and
ensue their efficient storage. In a file based system, there are two types of
records-logical records and physical records. A logical record is a collection of
related data items treated as a conceptual unit ind4ependent of how or where the
information is stored. A physical record is a contiguous area of storage space
defined by the characteristics of storage devices and operating systems and
includes all the bytes, which will be read/written by a single command to the I/O
device. A file processing system relies on the basic access methods of the host
operating system for data management.
In a file processing system, it is possible for the application program to only
make a request for data from the operating system by demanding a specific
logical record. The relationship between physical and logical records and the
location of the physical records in a specific file is kept track of the operating
system. Location of fields within the logical records is a activity that is taken
care of by the application program.
In a file processing system, a program takes less time to execute than an
equivalent program written in a high level language as algorithms for sort, merge
and report generation are already in built in the file management software.
Besides, the cost of software development is less. Such file based approaches
provide an improved automated data processing than earlier manual paper record
based systems. However, in view of demand for increased efficiency and speed,
a file processing system suffers some significant disadvantages, which are as
follows :
- Data redundancy and inconsistency : A major drawback in the
traditional file system environment is non sharing of data, it means if
different systems of an organization are using some common data,
even then instead of storing it once and sharing it, each system stores
data in separate files. Often, the same information is stored in more
than one file. This redundancy in storing the same data in multiple
places leads to several problems. Firstly, this leads to wastage of
storage space and poses a serious problem if the file processing system
has to manage a large amount of data. Secondly, it leads to the
duplication of effort, as one needs to perform several updates. Thirdly,
data inconsistency leads to a number of problems, including loss of
Introduction of
information and incorrect results.
Database
4 Management System
- Difficulty in accessing data: A file based system does not provide Database Management
users with ashoc information requests as most of the information System
retrieval possibilities are implemented based on pre-determined
request for data. In today’s competitive and fast business environment, NOTES
other than such regularly schedule requests, responding to unexpected
queries is also indispensable. A new application program often
requires an entirely new set of file definitions. Even though an existing
file may contain some of the needed data items, the application often
requires a number of other data items from the existing file as well as
definitions of all new data items, thereby requiring excessive
programming effort. Therefore, conventional file processing systems
do not let needed data to be retrieved conveniently and efficiently.
- Data isolation : Data is scattered in various files, in different formats.
In a file processing system, it is difficult to determine relationships
between isolated data in orde4r to meet user requirements. To tackle
such situations the files have to be evaluated by analysts and
programmers to determine the specific data requirement from each file
and the relationships between the data. Then applications have to be
written in a third generation language for processing and extracting
the needed data. Imagine the work involved if data from several files
was needed.
- Program data interdependency ; In a file based system, data is
organized in the form of Files and records. Files and records are
described by specific formats and access strategy, which are coded into
application program by the programmer. Such application programs
are data dependent. Consequently, this type of environment requires
that appropriate changes be made in all the concerned application
programs if there is any change in the data structure. This is something
that makes a change very painful or problematic for the designers or
developers of the system.
- Atomicity problem : In many applications, which already been
implemented, it is not easy to see to it that the data is restored to the
last consistent state following the detection of a failure.
- Integrity problem : Data integrity means correctness or accuracy of
data. Integrity ensures accuracy of data. The data values stored in a
file system may have to satisfy certain types of constraints.
Programmers enforce these types of constraints in the system by
adding appropriate code in the various application programs.
However, when new constraints are to be added, it is difficult to change
the program to enforce them with small effort and time.
- Security problems : Conventional systems develop applications in an
Introduction of
adhoc manner. At times, different components of the operational data Database
Management System 5
Database Management are accessed by different parts of a firm. In an environment of this
System type, it can be quite difficult to ensure/enforce security.
- Concurrent access anomalies : If the system allows multiple users
NOTES to update the data simultaneously, the interaction of concurrent updates
may result in inconsistent data. In the traditional file system, such
concurrent access anomalies cannot be avoided without huge
programming effort.

All the above limitations can be attributed to the following


main factors
- Definition of data is embedded in application programs rather than
stored separately. Thus, it is mandatory that appropriate changes are
made to the application programs, if there is a change in either the
format of data or its structure. In this environment, any change in data
structure or format requires appropriate changes to the application
programs.
- There is no control over the access and manipulation of data beyond
that imposed by application programs.

Database Management System


With a Database Management System or DBMS is referred to in short,
programs do not deal with stored data by its location but they are provided with
a software by a DBMS. This software allows application programs to deal with
the data field names irrespective of the location of the fields within the records,
the location of the records within a file and the file within a device. DBMS
integrates all files into one system. Thus, data management becomes more
efficient since DBMS provides centralized control on the operational data. In
DBMS, all files are integrated into one system, thus making data management
more efficient by providing centralized control on the operational data. Database
management systems are not only used in the commercial applications, but also
in many of the scientific/engineering applications.

Database
Database implies a place where data can be stored in a structured manner.
It is a shared collection or batch of data that is logically related, along with
descriptions to meet the information requirements of an organization.
The files contain information such as :
1. The set of data available to the user, the so-called “end user data”. This
is the real data, which can be read, connected and modified by the user.
Introduction of 2. The so-called ‘metadata”. Here, the properties and the relative
Database relations of the end user data and described.
6 Management System
Database Management System (DBMS) Database Management
System
It is a software system that allows users to not only define and create a
database but also maintain it an control its access. A Database Management
System can be called a collection of interrelated data and a collection or set of NOTES
programs that helps in accessing, updating and managing that data.
Management of the data means specifying how data will be stored,
structured and accessed in the database. This includes the following :
- Defining : Specifying data types and structures, and constraints for
data to be stored.
- Constructing : Storing data in a storage medium.
- Manipulating : Querying , updating and generating reports.
- Sharing : Allowing multiple users and programs to access data
simultaneously.
Management of database users means managing the users in such a way
that they are able to perform any desired operation on the database.
DBMS available in market such as Access, dBase, FileMaker Pro,
ORACLE, DB2, Ingress, Informix, Sybase, etc. A Database system is collection
of application programs that interacts with the database along with DBMS and
database itself. Database systems are designed in a manner that facilities the
management of huge bodies of information.

1.3 SCHEMAS AND INSTANCES

Schemas , Subschema Instances or State of a Database


The overall description of a Database is called Database schema. It is
specified at the time the Database is designed and should not change very
frequently. The values of a data item can be fitted into a framework. Such time
of information is included in database schema :
- Characteristics of data items
- Logical structure and relationship among those data items
- Format for storage representation
- Integrity parameters, authorization and backup policies
A subschema is its proper subset designed to support ‘views’ belonging to
different classes of users in order to hide or protect information. It refers to the
same view as schema but for the data types and record types, which are used in
a particular application or by a particular user.
Introduction of
As information is inserted, deleted, updated or database changes with time. Database
The collection of data or information stored in the database at a particular Management System 7
Database Management moment of time is called an instance, state or a snapshot of the database.
System Database schema and database state are two different things. While a new
database is being defined, only the database schema is specified to the DBMS.
NOTES The existing state of the database, with no data, is the empty state. We get the
initial state of the database when data in the database is first inserted. The DBMS
is responsible for ensuring that every sate is a valid state satisfying the structure
and constraints specified in the schema. Sometimes, the schema is referred to as
the intension, and a database state as an extension of that schema.

Logical and Physical View of Data


Separating the logical and physical structure of data clearly is one of the
main features of the database approach. The term “logical structure”, indicates
the manner in which the programmers view it whereas the physical structure
refers to the manner in which data is actually stored on the storage medium.
A logical view of data expresses the way a user thinks about data. Usually,
it is expressed in terms of a data model.
A physical view of data is the way data are handled at low level, retrieval
and storing of data. It is expressed in terms of specific locations on storage
devices plus techniques used to access it.
A set of logical constructs that can help describe the structure of a database,
its data types, constraints and relationships, is referred to as a data model. It is
also a set of basic operations that specify updates and retrievals on the database.
The set of principles that defines a data model may be divided into the
following three major parts :
Data definition – a set of principles concerned with how data is structured.
Data manipulation - a set of principles concerned with how data is
operated upon
Data integrity - a set of principles concerned with determining which states
are valid for a database.

Data Dictionary
A Data Dictionary can be treated as a special file, which stores the
information about the organization and usage of data contained in the database.
This information is called metadata (Data about Data). It is sometimes termed
as system catalog that documents the data in the database. In a DBMS, the
detailed structure and organization of each file are stored in the system catalog.
System catalog and data dictionary are used interchangeably. A system catalog
is a repository that integrates metadata. A data dictionary is a repository that
manages metadata. It is a part of the system catalog that is generated for each
Introduction of database. A Data Dictionary can function in a variety of ways , which are as
Database
follows :
8 Management System
Active(Integrated) : This is always consistent with the current structure Database Management
and definition, maintained automatically by the system itself. System

Passive (Non-Integrated) : It is used only for documentation purpose and


is not used by the DBMS software. It is simply a self-contained application and NOTES
a set of files used for documentation the data processing environment. It is not
consistent and managed by users of the system and modified whenever the
structure of the database is changed.

Need for DBMS


Consider part of a savings bank enterprise that keeps information about all
customers and savings accounts in permanent system files at the bank. In
addition, the system has a number of application programs that allow the user to
manipulate the files, including:
- A program to debit or credit an account.
- A program to add a new account.
- A program to find the balance of an account
- A program to generate monthly statements.
These application programs have been written by system programmers in
response to the need of the bank organization.

Characteristic of Database
There are characteristics distinguish the database approach from the
traditional approach of programming with files. Traditional file processing, each
user defines and implements the files needed for a specific software application
as part of programming the application. Programs are written to print a student’s
transcript and to enter new grades into the file are part of the application. The
another user may keep track of students fees paid and their payments. Both the
users are interested in data about students, each user maintains separate files and
programs to manipulate these files because each requires some data not available
from the other users files.
This redundancy in defining and storing data result in wasted space used
for storage and in redundant efforts to collect common till date data. In database
approach a single database is maintained then accessed by various users. The4
labels or names of data are once defines and used many times by retrieving
queries. The database approach versus the file processing approach is –

1. Self describing nature of a Database system.


The database system contains not only the database, but it also maintains
the definition or description of the database structure and constraints. This
definition is stored in the DBMS catalog which contains information such as the
structure of each file, the type and storage format of each item and constraints Introduction of
on the data. In short the information is stores in the catalog called meta-data, Database
Management System 9
Database Management that is data about data. Catalog is used by the DBMS software and by the
System database users who need information about database structure. In traditional file
processing, data definition is part of application programs. The programs are
NOTES constrained to work with only one specific database, whose structure is declared
in the application programs. A COBOL program has data divisions statements
to define its files. Whereas file processing software can access only specific
databases. DBMS software can access variety of databases by extracting the
database definitions from the catalog and then using these definitions.

1.4 DATABASE USERS

Database users are as follows


- Database Designer : Before implementing a database and populating
the database with data, it is required to identify the data to be stored in
the database and the appropriate structure to represent and store the
data. A database designer does the tasks before the database is actually
implemented and populated with data. They analyze and integrate all
external views from different groups of users and their data and
processing requirements. It is responsible to identify the integrity
constraints during the database design.
- Data Administrator : A non-technical person who is responsible for
managing the data resource of an organization and decides on policies
is called a Data Administrator.
- Application programmer : A computer professional who writes
application programs through which a user can interact with the system
is called an application programmes. Application programmes use
several tools such as RAD tools to construct forms and reports without
application programs. Interact with the system through DML calls,
which are embedded in a program written in host languages. These
users have clear idea of the structure of the database and knows clearly
about the needs of the organizations.
- Database Administrator (DBA) : Database Administrator (DBA) is
a group of people or a person who is responsible for the database
system’s overall control. Once Database is installed and is functioning
properly in a production environment of an organization, the database
administrator takes over the charge and performs specific DBA related
activities including:
- Database maintenance

Introduction of - Database Backup


Database -Grant of rights to Database users
10 Management System
-Managing print jobs Database Management
System
-Ensuring quality of service to all users
Routine work: It includes- Acting as liaison with users to ensure that the
NOTES
data they require is available and to write the necessary external schemas and
conceptual/external mapping
Responding to modification in requirements, monitoring performance and
modifying details related to storage and access, thereby organizing the system
in a manner that will provide the best performance for the organization.
End users : People who use the DBMS for querying, updating and
generating reports are called end-users. They are classified into :
1. Naïve users are users who interact with the system by invoking one of
the permanent application programs that have been written previously
by the application programmer. Example : Tellers in bank.
2. Sophisticated users interact with the system without writing a program.
Instead they form their request using database language especially
DML. Tools they use are OLAP and some data mining tools, which
help them summarize data in different patterns, occasionally accessing
the database using Database query language.
3. Casual users are those who access the Database occasionally and have
different needs each time, they use a Sophisticated query language.

1.5 DATABASE ARCHITECTURE AND DATA INDEPENDENCE

DBMS are very complex systems. To study the database concepts and the
structure and capabilities of DBMS you have to study the architecture of
database management syst4em.

There are two ways


1. The logical DBMS architecture that deals with the storage and
presentation of data users, the specific way data is stored and presented
to users.
2. The physical DBMS architecture deals with software components that
make up a DBMS.

Logical DBMS Architecture


This architecture was suggested by ANSI/SPARC(American National
Standards Institute/Standards planning and Requirements Committee). This
logical architecture is known as three level architecture.
Introduction of
The logical architecture describes how users perceive data in a database. Database
Users can access, manipulate and update data without knowing the details that Management System 11
Database Management where it is located, stored and maintained. A DBMS provides the user with an
System abstract view of data. This is done by defining levels of abstraction.

NOTES Levels of Abstraction


The three levels of abstraction are :
1. Physical or internal level is the lowest level of abstraction. It provides
a low level description of the physical database. It describes how data
is actually stored on physical media.
2. Logical or conceptual level is the next level of abstraction. It tells us
how data is interrelated and what data is stored.
3. External or view level is the highest level of abstraction. It provides
a window on the conceptual view. This window enables the user to
view that data which he is interested in.
The three level database architecture allows a clear separation of data
representation as the users see it and the physical data structure layout. This
separation of different views of data is flexible and adaptable. This adaptability
and flexibility is known as data independence.
Since a schema defines each view, there are several schemas in the database
partitioned according to the levels of abstraction. The internal schema expresses
the internal view. It contains the definition of the stored record, the method of
representing the data fields and the access aids used. The conceptual schema
defines this conceptual view. There is only one conceptual schema per database.
Each external view is described by means of a schema called an external schema
or a subschema.

Mapping between the Levels


Mapping is the transformation of request and results between different
levels of abstraction.
The conceptual mapping exists between the internal and conceptual levels.
Mapping can be conceptual, internal or external mapping. It defines the
correspondence between the data structures and files of the internal view and the
fields and records of the conceptual view.
If a modification is made to the structure of the stored database, a change
must be made in the conceptual/internal mapping to ensure that the view from
the conceptual level remains the same. If the physical structure of the database
gets modified, the DBMS has knowledge of these modifications and continues
to provide the same logical view as before the changes. This is called physical
data independence.
The conceptual/external mapping exists between the external and
conceptual levels. This defines the correspondence between the conceptual view
Introduction of
and a particular external.
Database
12 Management System
Database Management
System
1.6 TYPES OF DATABASE SYSTEM
NOTES
1. Hierarchical databases.
In a hierarchical database, records contain information about there
groups of parent/child relationships, just like as a tree structure. The
structure implies that a record can have also a repeating information.
In this structure Data follows a series of records, It is a set of field
values attached to it. It collects all records together as a record type.
These record types are the equivalent of tables in the relational model,
and with the individual records being the equivalent of rows. To create
links between these record types, the hierarchical model uses these
type Relationships.

Advantage
Hierarchical database can be accessed and updated easily because in
this model structure is like as a tree and the relationships between
records are defined in advance.

Disadvantage
This type of database structure is that each child in the tree may have
only one parent, and relationships or linkages between children are not
permitted, even if they make sense from a logical standpoint.
Hierarchical databases are so in their design. it can adding a new field
or record requires that the entire database be structured.

2. Network Database
A network databases are mainly used on a large digital computers.
More connections can be made between different types of data,
network databases are considered more efficiency. Introduction of
Database
Management System 13
Database Management Data in network model are represented by collection of records and
System relationships among data are represented by links, which can be
viewed as pointers. The records in the database are organized as
NOTES collections of arbitrary graphs.
However, instead of using a single-parent tree hierarchy, the network
model uses set theory to provide a tree-like hierarchy with the
exception that child tables were allowed to have more than one parent.
It supports many-to-many relationships.

3. Relational Database
In relational databases, the relationship between data files is relational.
Hierarchical and network databases require the user to pass a hierarchy
in order to access needed data. These databases connect to the data in
different files by using common data numbers or a key field. Data in
relational databases is stored in different access control tables, each
having a key field that mainly identifies each row. In the relational
databases are more reliable than either the hierarchical or network
database structures. In relational databases, data and relationships
among data are represented by a collection of tables, each of which
has a number of columns with unique names. Files filled up with data
are called relations (tuples) designates a row or record, and columns
are referred to as attributes or fields.
Relational databases work on each table has a key field that uniquely
indicates each row, and that these key fields can be used to connect
one table of data to another.
The relational database has two major reasons
1. Relational databases can be used with little or no training.
Introduction of
Database 2. Database entries can be modified without specify the entire body.
14 Management System
Object-Oriented Model Database Management
System
In this Model we have to discuss the functionality of the object oriented
Programming .It takes more than storage of programming language objects.
Object DBMS's increase the semantics of the C++ and Java. It provides full- NOTES
featured database programming capability, while containing native language
compatibility. It adds the database functionality to object programming
languages. This approach is the analogical of the application and database
development into a constant data model and language environment. Applications
require less code, use more natural data modeling, and code bases are easier to
maintain. Object developers can write complete database applications with a
decent amount of additional effort.
The object-oriented database derivation is the integrity of object-oriented
programming language systems and consistent systems. The power of the object-
oriented databases comes from the cyclical treatment of both consistent data, as
found in databases, and transient data, as found in executing programs.

1.7 DATABASE LANGUAGES

The Data Definition Language is used to specify schemas (design) of a


database while the data manipulation language is used to fire queries(commands)
on database in order to manipulate it. Both are the main pillars of database
language like SQL.
The database languages are categorized as DDL,DML,DCL and TCL.

Data Definition Language (DDL)


- The DDL specify the schema of database by set of definitions.
- This language allows the users to define data and their relationship to
other types of data. It is used to create data tables, dictionaries and
files within databases.
- The DDL is also used to specify the structure of each table, set of
associated values with each attribute, integrity constraints, security and
authorization information for all the table and physical storage
structure of all the tables on the disk.
- the data values stored in the database should satisfy some constraints
for consistency purpose.
Example : Salary of an employee should not be negative or the employee
id should not start with ‘E’ etc. The DDL Data Definition Language provides
functionality to specify such constraints.
Introduction of
Database
Management System 15
Database Management Data Manipulation Language (DML)
System
The Data Manipulation Language (DML) is used for accessing and
manipulating data in a database. DML provides a set of functionalities to support
NOTES the basic Data Manipulation operations on the data stored in the database.
- The DMS is basically of two types.
1) Procedural DML –There is a requirement of user to specify which
data is required and how to get this data.
2) Declarative DML – Here is also requirement of user to specify
which data is required and without specifying how to get this
data. This is easier to understand and useful than procedural
DML.
- A concept of query is used for processing. Query is a statement or
command which request the retrieval of information from the database.
The part of DML which involved information retrieval is known as
query language. The three levels of abstraction are used in defining
structure of data as well as to manipulate the data.

The three tier Architecture.


A software application is created using programming languages(called as
frontend) and database(called as backend). In every software we have to
implement following three layers.
1. User Layer(presentation layer)- It is also called as client layer which
contain User interface of our application. This layer is used for design
purpose. In this data is presented to the user and also input can
accepted from the user.
For e.g in banking software, the registration form of an account holder
can be considered as user layer.
2. Business Layer–In this layer we can write all business logic like
validation of data, calculations, data insertion etc. This acts as an
interface between Client layer and Data Access Layer. This layer is
also called the intermediary layer helps to make communication faster
between client and data layer.
3. Data Layer –In this layer actual database is in focus. Data Access
Layer contains methods to connect with database and to perform insert,
update, delete, get data from database based on our input data.
Depending upon the implementation of these three layers are the types
of database architecture.

Introduction of
Database
16 Management System
The three tier Architecture is most widely used architecture in Database Management
System
today’s world.
- In this architecture the user layer, business layer and data layer
NOTES
are implemented independently by three different applications.
- The data required by the business logic exists in database server.
- In the three tier architecture all layers interact with each other
independently.

Figure : Three tiered architecture

Advantages of three tier Architecture


1. In three tier Architecture we can manage the data independently.
2. We can make the changes in presentation layer without affecting other
two tiers.
3. As each tier is independent it is possible to use different groups of
developers.
4. It is most secure since the client doesn’t have direct access to the
database layer.
5. When one tier fails there is no data loss, because you are always secure
by accessing the other tier.
6. Due to distributed deployment of application server, scalability is
increased. Introduction of
Database
7. A similar logic can be used in various applications. It is reusable. Management System 17
Database Management Disadvantages of three tier Architecture
System
1. It is more complex structure.
2. More difficult to set up and maintain it.
NOTES
3. The physical separation of the tiers may affect the performance.

Example
Explain 3 tier web architecture for online shopping database system.
In online shopping frontend Dot Net environment is used while as backend
database MS SQL Server 2008 is used.

The three layers


1. User Layer(presentation layer)-This layer consist of user interface
designed for the interaction with end user. This layer is created in ASP
.Net. It includes the screens which will be used by the end user for
shopping. These screen show the products with details as per their
categories. User can select the product to purchase and add them into
cart. This design is created with advanced controls available in ASP
.Net.
2. Business Layer – This layer consist of validation checking code
related to product selection of user. Accidently user may select wrong
number of products to purchase. For example –if any user is giving
order to purchase 10000 TV sets, then the order should be validate.
This logical code is implemented using the C# .Net.
3. Data Layer – This layer contains the code interacting with database
on the server. For example, accessing product details from database,
inserting transaction details from database, inserting transaction details
of user order in database etc. The database handling is implemented
using the ADO .Net. Here all the three layers work independently and
efficiently.

Diagrammatic representation

Introduction of
Database
18 Management System
Database Management
System
1.8 DATABASE INTERFACE
NOTES
1. Menu-Based Interfaces for Web Clients or Browsing
In this Interface we have a list of options, from this user has to select
the option known as menus, which lead the user by the formulation of
a request. Pull-down menus are extremely popular techniques in Web-
based user interfaces.
They are also used within browsing interfaces that permit a user to
look through the contents of a database within an exploratory and
unstructured manner.

2. Forms-Based Interfaces
A form-based interface is a kind of user interface. Over here the user
interacts with the application by selecting one of a number of possible
values, and by entering text into the fields that accept it. A word
processor which is used to write documents, we can have settings for
the font size, the font to use, and the alignment of the paragraph on
the page. Many databases support a technology called query. example:
STUDENT ADMISSION ENROLLMENT.

3. Graphical User Interfaces


GUI is an interface that uses icons i.e small images or other visual
indicators to interact with electronic devices, rather than only text via
a command line. For example, Microsoft Windows is a GUI, whereas
MS-DOS is a command line.

4. Natural Language Interfaces


Natural Language Interface provides natural, human-like interaction
with any application. This makes the work effective, as it eliminates
the necessity to study special syntax of queries (e.g., Boolean operators
used in Google) and allows for detailed and precise description of the
requested information.

5. Interfaces for Parametric Users


Parametric users, such as bank tellers, often have a small set of
operations that they must perform repeatedly. Systems analysts and
programmers design and implement a special interface for each known
class of naive users. Usually, a small set of abbreviated commands is
included, with the goal of minimizing the number of keystrokes Introduction of
required for each request. Database
Management System 19
Database Management 6. Interfaces for the DBA
System
Most database systems contain privileged commands that can be used
only by the DBA's staff. These include commands for creating
NOTES accounts, setting system parameters, granting account authorization,
changing a schema, and reorganizing the storage structures of a
database.

Summary
A database management system(DBMS) consists of a collection of
interrelated data and a collection of programs to access that data. The data
contains information about one particular enterprise. The primary goal of a
DBMS is to provide an environment which is both convenient and efficient to
use in retrieving and storing information.
Database systems are designed to manage large bodies of information. The
management of data involves both the definition of structures for the storage of
information and the provision of mechanisms for the manipulation of
information. The database system must provide for the safety of the information
stored, despite system crashes or attempts at unauthorized access. If data is to
be shared among several users, the system must avoid possible anomalous results.
A major purpose of a database system is to provide users with an abstract
view of the data. The system hides certain details of how the data is stored and
maintained. This is accomplished by defining three levels of abstraction at which
the database may be viewed: the physical level, the conceptual level and the
view level.
Databases change over time as information is inserted and deleted. The
collection of information stored in the database at a particular moment in time is
called an instance of the database. The ability to modify a scheme definition in
one level without affecting a scheme definition in the next higher level is called
data independence. There are two levels of data independence: physical data
independence and logical data independence.
A database scheme is specified by a set of definitions which are expressed
by a data definition language (DDL). DDL statements are compiled into a set of
tables which are stored in a special file called the data dictionary which contains
metadata.
A data manipulation language (DML) is a language that enables users to
access or manipulate data.
There are basically two types : procedural DMLs, which require a user to
specify what data is needed without specifying how to get it.
Introduction of
Database
20 Management System
Exercises Database Management
System
1. Explain Data, information and Data Processing and Data Management.
2. Explain the difference between physical and logical data
NOTES
independence.
3. Why we need DBMS.
4. Explain different Database languages.
5. Define Database Schema and instances.
6. Illustrate with examples database interfaces.

*****

Introduction of
Database
Management System 21
Database Management
System

UNIT - 2
DATA MODELLING
NOTES

Structure
2.0 Introduction
2.1 Unit Objectives
2.2 Data Modeling: Data Models,
2.3 Logical Data Modeling:
2.3.1 Hierarchical Data Model
2.3.2 Network Data Model
2.3.3 Relational Data Model
Advantages and Disadvantages of Logical Data Modeling.
2.4 Conceptual Data Modeling
2.4.1 Entity Relationship Model
2.4.2 Entities
2.4.3 Attributes
2.4.4 Types of Attributes
2.4.5 Relationships
2.4.6 Degree of relationship Set
2.5 Mapping Cardinalities, Keys
2.5.1 ER Diagram Notations
2.6 Case studies on ERD.

2.0 INTRODUCTION

In this unit, you will learn about the basic concepts of data models, which
are mainly concerned with logical database design and the rules for it.
In this unit will discuss the Entity- Relationship model which facilities
database design by enabling the designer to express the logical properties of the
database in an enterprise schema. The various data models supported by DBMS
are Hierarchical, Network, Relational and Entity-Relationship.
22 Data Modelling
Database Management
System
2.1 UNIT OBJECTIVES
NOTES
After going through this unit, you will be able to :
- Define the concept of data models
- Understand the Entity- Relationship model
- Describe and compare the Hierarchical, Network and relational data
models.
- Understand the Mapping Cardinalities

2.2 DATA MODELS : BASIC CONCEPT OF DATA MODELS

The process of analysis of data object and their relationships to other data
objects is known as data modeling. It is the conceptual representation of data in
database. It is the first step in database designing. Data models define how data
is connected to each other and how they are processed and stored inside the
system. A data model provides a way to describe the design of a database at the
physical, logical and view levels.

2.3 TYPES OF DATA MODELS

There are different types of data models :

2.3.1 Hierarchical Model


A data model in which the data is organized into a tree structure is known
as Hierarchical Data Model.
- Hierarchical Data Model structure contains parent-child relationship
where root of the tree is a parent which then branches into its children.
It is another type of record based data model.
- The data is stored in the form of records. These records are connected
to one another.
A record is a collection of fields, each field contains only one value. The
Hierarchical Data Models represent data according to its hierarchy.

Data Modelling 23
Database Management
System

NOTES

Fig 2.1 Hierarchical Model


- In Hierarchical Model there is a business rule. The rule is that , one
parent node can have many child nodes but one child nodes can have
only one parent node. This type of model is not in use currently.

Advantages of Hierarchical Model


a) Simple to understand : Due to its Hierarchical structure it is easy to
understand. Most of the time data have hierarchical relationship.
Therefore, it is easy to arrange the data in that manner.
b) Database Integrity : In Hierarchical data model there is always a
parent child association between different records on different level.
Due to this inherent structure integrity gets maintained.
c) Efficient : The performance of this model is very efficient due to its
Hierarchical structure when database contain large amount of data
which has various related records.

2.3.2 Network Database Model


- It is extended type of Hierarchical data model. This data model is also
represented as Hierarchical, but any child in the tree can have multiple
parents.
- In Network Data Model there is no need of parent child association.
There is no downward tree structure.

Fig 2.2 Network Model


24 Data Modelling
- It is the flexible way of representing the objects and their relationship. Database Management
A Network Data Model allows multiple records linked in the same file. System

- Basically, Network Data Model forms a network like structure between


the entities. NOTES

Advantages of Network Model


a) Design is simple : The network model is simple and easy to design
and understand. There is no complex structure in Network Model.
b) It has capability to handle various relationships : The network
model can handle the one to many and many relationships which is
useful to develop the database.
c) Easy to access : The data access is easy and flexible than the
hierarchical data model. In network model there is no any hierarchy
in the objects and their relations therefore it is very easy to access the
data in the objects and their relations therefore it is very easy to access
the data in network model.

2.3.3 Relational Model


- The relational model is developed by E.F.Codd. Relational database
is a type of record-based relations.
- Relational database is an attempt to simplify the data structure by
making use of tables. Tables are used to represent the data and their
relationships. Table is a collection of rows and columns. Tables are
also known as relations.
- Records are known as tuples and fields are known as attributes.
- The Relational model is called as record based model because the
database is structured in fixed format records of different types. A
record consists of fields or attributes.
- In the relational model, every record must have a unique identification
or key based on the data.
- In the following table STUD_ID is the key through which we can
identify the record uniquely in the relation. Relational data model is
the most widely used record-based data model.

Data Modelling 25
Database Management Advantages of Relational Data model
System
a) Supports SQL
- For accessing the data in Relational data model we have a
NOTES
language known as Structured Query Language (SQL).
- This language is used to access, insert, update or delete the data
from the table. By using Relational data model we can execute
the complex queries.

b) Flexible
- We can easily manipulate the information which is linked with
various tables.
- We can extract the information from different table
simultaneously by using this model.

2.4. CONCEPTUAL DATA MODELING

2.4.1 Entity Relationship (E-R) Model


- This model describes inter-related things of interest in specific domain
of knowledge. An ER model is composed of entity types (which
classify the things of interest) and specifies relationships that can exist
between instances of those entity types.
- The Entity Relationship (E-R) Model allows specifications of an
enterprise schema and represents the overall logical structure of the
database.
- Now a days the database related to real world applications becomes
very vast and complex. Representing relations between the different
elements of the database becomes difficult.
- ER Model simplifies this task. It is nothing but the design technique
for database. It is a graphical technique which helps to understand and
organize the complex data which should not depend upon the actual
database implementation.
- The real world objects can be easily mapped with entities of E-R
model.
- Diagrams known as entity relationship diagrams are drawn. In ER
model a graphical representation of database system is generated.
- Basic concepts of ER Model are as follows:
Entity, attribute, Relationship, constraints and keys.

26 Data Modelling
Advantages of Entity Relationship Model Database Management
System
a) Simple Design : The ER model is simple and easy to design. It shows
the logical view of the data and its relationships. This model is easy
to understand. NOTES
b) Effective representation : The presentation of Entity Relationship
Model is very simple and effective. The programmer and designer can
easily understand the flow of the system by referring the ER Model.
c) Connected with Relational Model : The Entity Relationship Model
is connected with the relational model. Due to this advantage we can
develop a well-structured design.

2.4.2 Entity and Entity Set


- The E-R model consists of entities and relationships between those
entities. An entity is nothing but a thing having its own properties.
These properties helps to differentiate the object(entity) from other
objects.
- An entity is a thing that exists either physically or logically. An entity
may be a physical object such as a house or a car, or an entity may be
a logical concept like an event such as a house sale or a car service, or
a concept such as a customer transaction or order.
- An entity set is a set of entities which share the same properties. In a
company employee is the entity set which has similar properties like
Employee_ID, emp_name, salary, etc.
- There is difference between an entity and an entity-type. An entity-
type is a category. An entity, strictly speaking, is an instance of a given
entity-types. There are usually many instances of an entity-type.
- There are two types of entities in Database management system.
1. Strong Entity or Regular Entity
If an entity having it’s own key attribute specifies then it is a
strong entity. Key attribute is used to identify that entity uniquely
among set of entities in entity-set.
Example : In a parent/child relationship, a parent is considered
as a strong entity.
Strong entity is denoted by a single rectangle.
The relation between two Strong entities is denoted by a single
diamond simply called relationship.

2. Weak Entity
The entity which does not have any key attribute is known as
weak entity. The weak entity has a partial discriminator key.
Data Modelling 27
Database Management Weak entity depends on the strong entity for its existence. Weak
System entity is denoted by double rectangle.
Example : In a parent/child relationship, a child is considered as
NOTES a weak entity which is completely depends upon the strong entity
‘parent’.

2.4.3 Attribute
- An attribute is a characteristic of an entity. Entities are represented by
means of their attributes. All attributes have their own specific values.
For example, an employee entity may have Employee_ID, emp_name,
salary as attributes.
- In a database management system an attribute is a database
component, such as field or column of a table.
Example :
The entity student has attributes like student_id, student_name. In this
every attribute has a value. Here 101 is the value for the attribute
student_id, Prem is the value for attribute student_name.
Fig.

2.4.4 Types of Attributes


There are five different types of attributes in Database Management System
1. Single –value Attribute
A single-value attribute is the attribute which can hold a single value
for the single entity.
Example : In the entity student, student_name is the single-value
attribute since a student have a single value for name attribute.

28 Data Modelling
2. Multi-valued Attribute Database Management
System
- A Multi-valued Attribute is the attribute which can hold multiple
values for the single entity.
NOTES
Example : In the entity student, the attribute student_contact number
would be considered a .Multi-valued Attribute since a student could
have multiple contact numbers.

3. Simple Attribute
An attribute whose value cannot be further divided is known as Simple
Attribute. That means it is atomic in nature.
Example : In the entity student, student_age cannot be divided, so it
is simple attribute of student entity.

4. Composite Attribute
The Composite Attributes are the attributes which can be further
divided into sub parts. These sub parts represent the basic entities with
their independent meaning.
Example : In the entity student, student_name is the composite
attribute, we can divide this attribute in three different sub parts :
First_name, Middle_name and Last_name.

5. Derived Attribute
The attribute which is not physically exist in database, but its value
can be calculated from the other present attributes is known as derived
attribute.
Example : In the entity student, we can calculate the average age of
students. This average age is not physically present in the database
but it can be derived from the attribute student_age.

Data Modelling 29
Database Management
System

NOTES

2.4.5 Relationships
The association between two different entities is called as relationship. In
the real world application, what does one entity do with the other, how do they
connect to each other?
Example : An employee works at a department, a student enrolls for a
course. Works at and Enrolls are called relationships.

2.4.6 The Degree of relationships.


The Degree of relationship refers to number of entities participated in the
relation.
1. Unary Relationship
A Unary relationship exists when there is relation between single
entity. A Unary relationship is also known as recursive relationship in
which an entity relates with itself.
Example : A person can be in the relationship with another person,
such as –
A person that is someone’s child.
A woman who can be someone’s mother

30 Data Modelling
2. Binary Relationship Database Management
System
A Binary relationship exists only when there is relation between only
two entities. In this case the degree of relation is two.
NOTES
Example : A teacher teaches student. In this teacher and student are
two different entities which are connected with each other via relation
Teaches.

3. Ternary Relationship
A ternary relationship exists when there are relations between three
entities. In ternary relation the degree of relation is six.
Example : A person can be a student and a person also can be teacher.
Teacher, student and person are three entities which are related to each
other.

4. Quaternary Relationship
A Quaternary Relationship exists when there are relations between
four entities. In Quaternary Relation the degree of relation is eight.
Example : The four entities Employee, Management Faculty,
Teaching Faculty and Non-Teaching Faculty are connected with each
other via is a relationship.

Data Modelling 31
Database Management
System
2.5 KEYS
NOTES
The specification of differentiation of entities in a given entity set is very
important. The entities can be differentiated in terms of attributes. Here the
values of attributes are in focus. These values should be different to identify the
attributes uniquely. In an entity set, no two entitites should have same values for
all the attributes.
In the database schema the notion of key is directly applies to entity sets.
The key for an entity in entity set is an attribute or set of attributes which is used
to distinguish entities from each other.
Keys are also used to identify relationships uniquely and differentiate these
relationships from each other.

There are six types of keys available in DBMS :


1. Primary Key
Primary Key uniquely identify each entity in the entity set. It must
have unique values and cannot hold null values. Let R be a
relationship set having entity sets E1, E2,....En. Consider primary key
(Ei) denotes the set of attributes that forms the primary for entity set
Ei. The set of attributes associated with the relationship set R is
responsible for composition of primary key for that relationship set.
Example : In Bank database, the account_number entity should be
primary key. Because this field cannot be kept NULL as well as no
account_number should be repeated.

2. Super Key
- This key is formed by combining more than one attributes for the
purpose of uniquely identifying entities.
Example : In student database having attributes Student_reg_id,
Student_roll_no, Student_name, Address, Contact_no.
The Super keys are :
{Student_reg_id},
{Student_roll_no}
{Student_reg_id, Student_roll_no}
{Student_reg_id, Student_name}
{Student_roll_no, Student_name}
{Student_reg_id, Student_roll_no,Student_name}
32 Data Modelling
- It means super key can be any combination of attributes, so that Database Management
identifying the record becomes easier. System

3. Candidate Key NOTES


- Candidate key is formed by collection of attributes which hold
unique values. Candidate keys are selected from the set of super
keys. A super key without redundant values is known as
Candidate key.
- Candidate key are also known as minimal super key having
uniqueness property. The attribute which do not contain duplicate
value, may be a candidate key.
Example : In student database with attributes Student_reg_id,
Student_roll_no, Student_name, Address, Contact_no.
The candidate keys are :
{Student_reg_id},
{Student_roll_no}
{Student_reg_id, Student_roll_no}
- It means candidate key can be any combination of key attributes,
so that identifying the record from the table becomes easier.

4. Alternate Keys
- From all candidate keys, only one key gets selected as primary
key, remaining keys are known as alternative or secondary keys.
Example : In student table Student_address, Contact_no,
Date_Of_Birth are the alternative keys.

5. Composite Key
A key which consists of more than one attributes to uniquely identify
rows in a table is called composite key. It is also known as compound
key.

Constraints
There are certain constraints in E-R enterprise schema to which the contents
of a database must conform.
Two types of constraints are
1. Mapping cardinalities
2. Participation constraints

Data Modelling 33
Database Management
System
2.5 MAPPING CARDINALITIES
NOTES

Cardinality in DBMS
- In Database Management System the term ‘Cardinality’ refers to the
uniqueness of data values contained in a column.
- If the column contains a large percentage of totally unique values then
it is considered as high cardinality while if the column contains a lot
of “repeats” in its data range, it is known as Low cardinality

- Sometimes cardinality also refers to the relationships between tables.


Cardinality between tables can be one-to-one, many-to-one or many-
to-many.
- A relationship where two entities are participating is called a binary
relationship.

Mapping cardinalities
- Mapping cardinalities expresses the number of entities to which
another entity can be associated with in a relationship-set.
- It defines the relationship between two entities via a relationship set.
Consider a binary relationship set R(relation) between entity of set A
and set B. For this relationship mapping cardinalities are as follows:

One to One
- An entity of entity-set A can be associated with at most one entity of
entity set B and vice versa that means an entity in entity-set B can be
associated with at most one entity of entity-set A.

34 Data Modelling
Database Management
System

NOTES

Example : In following example one state have only one capital.

One to Many
- In this type an entity in set A is associated with many other entities in
set B.
- But an entity in entity set B can be associated with maximum of one
entity in entity set A.

Example : In following example one state have only one capital.

Many to One
- In this type an entity in set A is associated with at most one entity in
set B. And an entity in set B can be associated with number of entities in set A.

Data Modelling 35
Database Management
System

NOTES

Example : In following example many students can enroll in one school.

Many to Many
- In this type any entity in entity set A is associated with number of
entities in entity set B.

- An entity in entity set B is associated with number of entities in set


entity A.
Example : In following example many students learn many subjects.

The appropriate mapping cardinality for a particular relationship set is


depending upon the real world situation to which the relationship set is modeling.

36 Data Modelling
1. Participation Constraints Database Management
System
There are two types of participation constraints.
i. Total participation
NOTES
ii. Partial participation
i. Total participation : The participation of an entity set E in a
relationship set R is said to be total if every entity in E participates in
at least one relationship in R.
ii. Partial participation : The participation of an entity set E in a
relationship set R is said to be partial if only some entities in E
participates in relationships in R.
- Example : In a college database system, consider teachers are
assigned as project guides to every student. In this case every
Student entity is related with teacher entity through the
relationship “Guide”.
- Hence participation of student in the relationship guidance is
total. But it is not compulsory that every teacher should guide
the students.
- It is possible that not all the teacher entities are related with the
student through the relationship “Guide”. The participation of
teacher in the “guide” relationship set is partial.

2.5.1 ER Diagram Notations


The pictorial representation of data using different conventions relates to
the data with each other is known as Entity Relationship Diagram. E-R diagram
express the logical structure of database in graphical manner.
Special symbols are used to draw an ER-Diagram.

Data Modelling 37
Database Management
System

NOTES

Representations
1. Entity
An Entity is any object, place, person or class. In ER diagram, an
entity is represented using rectangles. Consider an example of an
organization. Employee, Manager, Company, Product etc. are
considered as entities.

2. Weak Entity
Weak Entity is an entity which depends upon another entity. Weak
entity is represented by double rectangle. Subject is the weak entity
because Subject depends on course.

3. Attribute
Attributes are nothing but the properties of entity. Here Emp_id, Name
and address are attributes or entity student.

Summary
The entity-relationship data model is based on a perception of a real world
which consists of a set of basic objects called entities and relationships among
these objects. The model is intended primarily for the database design process.
It was developed in order to facilitate database design by allowing the
specification of an enterprise scheme. Such a scheme represents the overall
logical structure of the database.
An entity is an object that exists and is distinguishable from other objects.
The distinction is accomplished by associating with each entity a set of attributes
that describe the object. A relationship is an association among several entities.
38 Data Modelling
The collection of all entities of the same type is termed an entity set, and the Database Management
collection of all relationships of the same type is a relationship set. System

Mapping cardinalities express the number of entities to which another entity


can be associated via a relationship set. Another form of constraint is existence NOTES
dependency, which specifies that the existence of entity x depends on the
existence of entity y.
An important task in database modeling is to specify how entities and
relationships are distinguished. Individual entities and relationships are distinct,
but from a database perspective , their difference must be expressed in term if
their attributes. To make such distinctions a primary key is assigned to each
entity set. The primary key is a set of one or more attributes that, taken
collectively, allows us to identify uniquely an entity in the entity set and a
relationship in a relationship set. An entity set that does not have sufficient
attributes to form a primary key is termed a weak entity set. An entity set that
has a primary key is termed a strong entity set.

Exercises
1. Explain the difference between a weak and a strong entity set.
2. Draw and explain the ER Diagram Notations.
3. Explain types of attributes.
4. Every weak entity set can be converted to a strong entity set by simply
adding appropriate attributes. Why, then, do we have weak entity sets?

2.6 CASE STUDIES ON ERD

Case 1.
Draw an ER-Diagram for an online Book shop which should consist of
entity set, attribute, relationship, mapping cardinality and keys, it will maintain
information about all customers, books, book author, publisher, billing etc.

Case 2.
Following requirements are given for a database of the National Hockey
League. The NHL has many teams, each team has a name, a city, a coach, a
captain and a set of players. Each player belongs to only one team. Each player
has a name, a position(such as left wing or goalie), a skill level, a set of injury
records. A team captain is also a player. A game is played between two teams
(referred to as host_team and guest_team) and has a date and score.
Construct an ER diagram for the NHL database.

Data Modelling 39
Database Management Case 3
System
Draw an ER-Diagram for hospital management where patient take
treatments from physician and also he/she can claim a medical insurance.
NOTES
Case 4 :
Consider a university database for the scheduling of classrooms for final
exams. This database could be modeled as the single entity set exam, with
attributes course_name, section_no, room_no and time. Alternatively, one or
more additional entity sets could be defined, along with relationship sets to
replace some of the attributes of the exam entity set as:
1) Course with attributes name, dept and course_no.
2) Section with attributes s_no and enrollment, and dependent as a weak
entity set on course.
3) Room with attributes r_no, capacity and building.
Draw an E-R diagram illustrating the use of all three additional entity sets
listed.

*****

40 Data Modelling
Database Management
System

UNIT - 3
NORMALIZATION
NOTES

Structure
3.0 Introduction
3.1 Unit Objectives
3.2 Normalization: Keys: Composite, Candidate, Primary, Secondary,
Foreign, Super key
3.3 CODD’s Rules, Mapping conceptual model into Relational Model.
3.4 Functional Dependencies
3.5 Decomposition
3.6 Lossy and Lossless Decomposition, Dependency Preserving
Decomposition
3.7 Advantages and Disadvantages of Normalization
3.8 Normal Forms (1NF, 2NF, 3NF,) Case Studies on Normalization.

3.0 INTRODUCTION

In this unit, you will learn that database designing is a process in which you
create a logical data model for a database to store the data of a company. It is
performed after the initial database study phase in the database life cycle. You
can use the normalization technique to create the logical data model for a
database and eliminate data redundancy. Normalization also allows you to
organize data efficiently and reduce anomalies during data operations. Various
normalization forms, such as first, second and third, can be applied to create a
logical data model for a database. The second and third normal forms are based
on partial dependency and transitive dependency. Partial dependency occurs
when a row of a table is uniquely identified by one column that is part of a
primary key. Transitive dependency occurs when a non-key column is uniquely
identified by values in another non-key column of a table.

Normalization 41
Database Management
System
3.1 UNIT OBJECTIVES
NOTES
After going through this unit, you will be able to :
- Discuss the process of normalization and its advantages and
disadvantages
- Understand the first, second and third normal forms
- Understand Decomposition
- Discuss Lossy and Lossless Decomposition, Dependency Preserving
Decomposition

3.2 NORMALIZATION

Normalization is a process of eliminating redundancy of data in a database.


A relational table in a database is said to be in a normal form if it satisfies certain
constraints. The normalization process involves various levels of normal forms
that allow you to separate the data into multiple related tables. The various normal
forms are first normal form (lNF), second normal form (2NF), third normal form
(3NF), fourth normal form (4NF) and fifth normal form (5NF).

The goals of normalization are


• Removing the redundant data
• Ensuring that only related data is stored in a table
Therefore, normalization helps you to remove data redundancy and
update inconsistencies when data is inserted, deleted or modified in a
database. The benefits of normalization are as follows:
• It provides better overall database organization and data consistency
within a database.
• It allows you to create tables that can be easily joined with other tables
with related information.
• It helps to reduce redundant data across the tables.
* It prevents data loss by assigning primary and foreign keys in a
table.
* It helps to reduce modification anomalies, such as deletion,
insertion and update anomalies.
* It defines relation constraints that are a logical consequence of
keys.
42 Normalization
Keys Database Management
System
Primary Key
- Under Entity Integrity Constraint Primary key is the main factor.
NOTES
- Primary key uniquely identify each record in a table. It must have
unique values and cannot hold null values i.e Primary key is the
combination of NOT NULL & UNIQUE constraints.

Example : Above example shows how entity integrity constraint


exactly works.
- Here we set primary key to Stud_ID table. If we add any repeated
value or null value in the column it will show error because primary
key never contains null or repeated values.

Foreign Key
- The Foreign Key constraint is also known as Referential Integrity
Constraint. In this constraint one field is common in between two
tables.
- Foreign Key represent relationships between tables. There is parent
child relationship between two tables having common column.
- The master table can be referenced as parent while the transaction table
is considered as child. The common field will work as primary key in
parent table while Foreign Key in child table.
Example : Consider Training Institute Database having two tables
Course_details and Student. There is a condition that the students may
register for courses which are available in institute currently and not
for the courses which are not offered at the moment. To specify this
rule while inserting values into database foreign Key constrain is used.
As follows :

Table 3.1 Course _details (Master/Parent)

Table 3.2 Student(Transaction/Child) Table

Normalization 43
Database Management - In both the tables, the field Course_Code is common. In course details
System Course_Code is referred as primary key and in Student table it is
referred as foreign key.
NOTES - So, after assigning foreign key constraint to Student table the record
entry for new student will not accept Course_Code which is not
available in the master table Course_details.
Candidate Key : The super key consisting of a minimum number of
columns required to identify a unique row in a table.

3.3 CODD’S RULES

Codd designed these rules to define what is required from the relational
database management system.
All these thirteen rules are followed by very few commercial products.
Even the oracle follows eight and half rules.
Codd’s rule are :-

Rule 0 : Foundation Rule


Foundatiaon rule states that the system must be capable to manage their
database systems through their relational capabilities. All other twelve rules are
derived from this foundation rule.
Remaining twelve rule are as follows :

Rule 1 : Information Representation


- All the information in the database must be stored in standard form of
tables.
- Table is considered as best format to store and manage the data.

Rule 2 : Systematic treatment of null values


- In a database the NULL values must be given a systematic and uniform
treatment. In number of cases we may have to set null values on place
of data. For example, data is missing, data is not known, or data is not
applicable.
- Null Value is different from empty character or string, string of a blank
character, and it is also different from zero value or any number.

44 Normalization
Example Database Management
System

NOTES

In relation PH_NO of Sita is NULL.

Rule 3 : The guaranteed access rule


- In a database every single data element must be accessible logically
with a combination of table-name, primary-key (row value), and
attribute –name (column value).
- There should not be requirement of any other entity to access the data.

Rule 4 : Active online catalog


- The description of the structure of entire database must be stored in
an online catalog, known as data dictionary.
- Data dictionary can be accessed by authorized users. Metadata should
be maintained for all the data in the database.
- The system must support an online, inline, relational catalog that is
accessible to authorized users by means of their regular query
language.
- That is nothing but, just like ordinary data the database description is
represented at the logical level, so that is possible for authorized users
to apply the same language of regular data to its interrogation.

Rule 5 : The comprehensive data sub language rule


Database should not be directly accessible. It should always be accessible
by using some strong query language. This rule illustrates that the system should
support at least one relational language. The language –
a) Has a linear syntax
b) Can be used both interactively and within application programs
c) Supports data definition operations (including view definitions), data
manipulation operations (update as well as retrieval), security and
integrity constraints, and transaction management operations (begin,
commit and rollback)

Rule 6 : View updating rule


- Views are the virtual tables created using queries to show the partial
or complete view of the table. Normalization 45
Database Management - All views that are theoretically updatable must also be updatable by
System the system.
- The rule states that we should be able to make changes in views.
NOTES
Rule 7 : High level Insert, Update and Delete Rule
- High level means multiple rows from multiple columns are affected
by the single query.
- This rule states that a database must support insertion, updation and
deletion.
- This must not be limited for a single row, that is , it must also support
union, minus and intersection operations to yield sets of data records.

Example
- Suppose employees get 5% hike in a year. Then their salary has to be
updated to reflect the new salary. Since this is the annual hike given
to the employees, this increment is applicable for all the employees.
- Hence the query should not be written for updating the salary one by one
for thousands of employee. A single query should be strong enough
to update the entire employee’s salary at a time.

Rule 8 : Physical data independence


- Changes in the physical level(i.e format of data storage or container
of data like array or linked list) must not require any change to an
application.

Rule 9 : Logical data independence


- The user’s view (application) should not be dependent upon logical
data in a database.
- That means changes in logical data must not affect the applications
which uses it. For example, if two tables are merged or one is split
into two different tables, there should be no impact or change on the
user application.

Rule 10 : Integrity independence


- The integrity constraints must be specified independently from
application programs and stored in the catalog.
- We should be able to make changes in integrity constraints
independently without the need of any change in the application.

Rule 11 : Distribution independence


- The distribution of database at various locations should not be visible
to end user.
46 Normalization
- Users should always get the impression that the data is located at one Database Management
site only. System

- That means even though the data is distributed, it should not affect the
speed of access and performance of data compared to centralized NOTES
database.

Rule 12 : The non-subversion rule


- If a relational system has a low-level(single record at a time) language,
that low level cannot be used to subvert or bypass the integrity rules
and constraints expressed in the higher level relational language.
- Means any how those integrity rules and constraints must be followed,
we cannot violet them by using any back door option.

3.4 FUNCTIONAL DEPENDENCY

The attributes of a relation is said to be dependent on each other when an


attribute of a table uniquely identifies another attribute of the same table. This
is called as functional dependency.
If attribute A of a relation uniquely identifies the attribute B of same relation
then it can represented as A B which means attribute B is functionally
dependent on attribute A.
Formal definition of functional dependency
Let R be a relation schema with attributes A,B i.e R include A, B.
Give a relation R,
A set of attributes B in R is said to functionally dependent on another set of
attributes A in R,
i.e A B Left hand side of FD that is A is the determinant set of
attributes and right-hand side of FD means B is the dependent or determined set
of attributes. Usually primary key of relation is determinant in FD. Thus, given
a row and the values of the attributes in set A, one can determine the
corresponding value of the B attribute.

Table 3.3 STUDENT

From STUDENT_ID attribute it is possible to identify particular students


name. Name of the student is functionally dependent on STUDENT_ID and not
Normalization 47
Database Management on vice versa; because more than two students may have same names and using
System STUDENT_ID we can determine the name of student but using
STUDENT_NAME we can’t determine the appropriate STUDENT_ID.
NOTES So we can say that,
STUDENT_ID STUDENT_NAME

i.e STUDENT_NAME is functionally dependent on STUDENT_ID


Similarly,
STUDENT_ID all the attributes(i.e ADDRESS, COURSE)
So we can say that all the remaining attributes are functionally dependent
on STUDENT_ID attribute.
It is possible that an attribute of a relation can be identified by set of another
attributes which are of the same relation. This can be implies as follows :
STUDENT_ID , STUDENT_NAME ADDRESS
STUDENT_ID , STUDENT_NAME COURSE
In above situation, usually composite primary key of relation is determinant
in FD.

Types of Functional Dependencies


a) Single-valued Functional Dependency
b) Multi-valued Functional Dependency
c) Fully Functional Dependency
d) Partial Functional Dependency
e) Transitive Functional Dependency
f) Trivial Functional Dependency
g) Non Trivial Functional Dependency

a) Single-valued Functional Dependency


- Database is a collection of related information in which information
depends on another information.
- The information is either Single-valued or Multi-valued. For example,
the name of the person or his/her date of birth is Single-valued facts.
But the Contact number of a person is a multi-valued attribute.
- A single-valued functional dependency is when A is the primary key
of an relation (e.g. STUDENT_ID) and B is some single valued
attribute of the relation (e.g STUDENT_NAME) Then, A B
must always hold by the relation.
48 Normalization
Database Management
System

NOTES

STUDENT_ID STUDENT_NAME
1002 Anuja
For every STUDENT_ID there should be unique name (A B)

b) Multi-valued Functional Dependency


- A multi-valued dependency exists when a relation R has at least 3
attibutes (like A, B and C) and for value of A there is well defined set
of values of B and a well defined set of values of C. However, the set
of values of B is independent of set C and vice versa.
For example :
Table 3. 4 Published _Book

As shown in Table 3.4 , name of particular author is determined by Book_Id


attribute.
{ Book_Id} ->> { Author_Name}
- Also it is possible to determine the Price of book by Book_Id.
{ Book_Id} ->> { Price}
But it is difficult to determine the price of book using author name or vice
versa. So this dependency is called as multi-valued functional dependency or
simply multi-valued dependency.

c) Fully Functional Dependency


Fully Functional Dependency occurs only in a relation with composite
primary key. Fully Functional Dependency occurs when one or more non key
attributes are depending on all parts of the composite primary key.

Normalization 49
Database Management Table 3.5 Stud_Courses
System

NOTES

Here is the composite primary key is Stud_Id + Course_Id


- Attribute Grade is fully functionally dependent on the primay key
(Stud_Id + Course_Id) because both parts of the primary keys are
needed to determine Grade.
- On the other hand both Stud_Name and Contact_No attributes are not
fully functionally dependent on the primary key, because only a part
of the primary key namely Stud_Id is needed to determine both
Stud_Name and Contact_No.
- Also attributes Credit-Hours and Course_Name are not fully
functionally dependent on the primary key because only Course_Id is
needed to determine their values.

D) Partial Functional Dependency


- Partial Functional Dependency occurs only in a relation with
composite primary key. Partial Functional Dependency occurs when
one or more non key attributes are depending on a part of the primary
key.
h) - In partial functional dependency the determinant (Left-hand side of
FD) consists of key attributes, but not the entire primary key and the
determined(Right-hand side of FD) consists of non-key attributes.

50 Normalization
For example, Database Management
System
Here the composite primary key is Stud_Id + Course_Id

NOTES

- To determine name of the student or contact number of the student we


can use only Stud_Id attribute which is part of the primary key.
{ Stud_Id} {Stud_Name, Contact_No}

E) Transitive Functional Dependency


- Functional Dependency is said to be transitive if it is indirectly form
by using two functional dependencies. For example in relation R(A,
B, C), Functional Dependency F includes :
A B and B C both the FDs are hold by relation, and if B A
doesn’t hold
- Then we can say that A C also holds by the relation.
- This type of dependency is called as Transitive Dependency.
Example
If Student_Courses relation holds following FDs:
{Course_Id} { Stud_Name}
{ Stud_Name} {Student_Age}
Then we can say that the relation also holds following dependency.
{ Course_Id} {Student_Age}

Normalization 51
Database Management F) Trivial Functional Dependency
System
- The dependency of an attribute on a set of attributes is known as trivial
functional dependency if the set of attributes contains the attributes
NOTES itself.
- In this dependency the Right-hand side of FD is a subset of Left-hand
side of FD. Functional dependency of the form A B is trivial if
B is a subset of A.
- The functional dependencies {A, B} A is trivial .
- In a relation, if attribute A is depend on a set of attributes AB, and also
the attribute A is subset of AB then this functional dependency is said
to be trivial.
For example,

- { Stud_Id, Stud_Name} {Stud_Id} is a trival functional


dependency as {Stud_Id} is subset of {Stud_Id, Stud_Name}. By
knowing the values of id and name of the student it is possible to find
values of student_id uniquely.
- The functional dependencies :
Stud_Id Stud_Id and Stud_Name Stud_Name
Is also referred as trival dependencies.

G) Non Trivial Functional Dependency


- It is inverse to trivial dependency. Functional Dependency is said to
be non trivial if none of the attributes of Right-Hand side of FDs are
part of the Right-Hand side attributes.
- Functional dependency of the form A B is non trivial if values
in B are not present in A i.e (B is not a subset of A).
- Non- Trivial Functional Dependency can be categorized into :

(i) Complete Non Trivial Functional Dependency


Functional Dependency is Completely Non Trivial if none of the Right-
Hand side attributes of FD are part of the Left-Hand side attributes of the FD.
For example,
{Stud_Id} {Stud_Name}
52 Normalization
Values of Stud_Name attribute are totally different from values of Stud_Id Database Management
in a relation. System

(ii) Semi Non Trivial Functional Dependency NOTES


A Functional Dependency is Semi Non Trivial if at least one of the Right-
Hand side attributes of FD is not part of the Left-Hand side attributes of FD.
For example,
{Stud_Id, Student_Name} {Stud_Name, Contact_No}
Values of Contact_No attribute are totally different from values of {Stud_Id,
Student_Name} in a relation.

Closure of Functional Dependency


1. The set of all Functional Dependencies logically implied by F are
known as the closure of set F.
2. The closure of F is denoted by F+.
3. To compute F+ , we can use some rules of inference like Armstrong’s
Axioms.

Inference Rules of Functional Dependencies


- The inference rule is an assertion which is used to derive new FDs
from previously defined basic FDs. A set of new Functional
Dependencies is derived by applying assertions on previously defined
basic Functional Dependency set.
- It is problematic to represent sets of all FDs for a relation initially.
- To make it easy, first define only basic FDs. Then apply set of
inference rules on those FDs to derive new set of FDs.
- The most basic inference rule is Armstrongs’s Axioms. There are other
rules such as union, decomposition and pseudo transitivity which are
derived from Axioms.

Armstrongs’s Axioms
Armstrongs’s Axioms were developed by William W. Armstrong. These
are a set of axioms (or inference rules) which are used to infer (conclude) all the
FDs on a relational database.
Consider a relation R having three sets of attributes X, Y, Z. Then the
Armstrong’s axioms are :
- Reflexivity Rule : If in a relation R values of Y attribute set are the
subset of values in X attribute set, then FD X Y can hold by the
relation R. This type of dependency also called as trivial dependency.

Normalization 53
Database Management - Augmentation Rule : If FD X Y is hold by relation R, then FD
System XZ YZ also hold relation.
- Transitivity rule : If Functional Dependencies X Y and Y Z
NOTES are hold by relation R then the FD X Z also followed by relation
R. This type of dependency also called as transitive dependency.

Properties of Armstrong’s axioms


The Armstrong’s axioms are sound :
- When basic functional dependencies F on a relation R are given, then
the FDs generated by using Armstrong’s axioms will hold by R.
- In other words applying Armstrong’s axioms on set of FDs which are
previously defined will not generate invalid FDs.
- The Armstrong’s axioms are complete : When basic functional
dependencies F on a relation R are given, then each and every valid
functional dependency on relation R can be found by applying only
Armstrong’s axioms on set of FDs which are previously defined.
- Union or Additivity
If in relation R having three sets of attributes X, Y, Z and the set of
functional dependencies : X Y and X Z are hold by relation
R, then the functional dependency X YZ also hold by relation R.
This rule is called as Union.
- Decomposition or Projectivity
If in relation R having three sets of attributes X, Y, Z and the set of
functional dependency : X YZ is hold by relation R, then the
following sets of functional dependencies also hold by relation R : X
Y and X Z . This inference rule is called as Decomposition rule.
- Pseudo transitivity
- If in relation R having four sets of attributes W, X, Y, Z and the
functional dependencies X Y and YW Z are hold by
relation R, then the relation also hold following functional dependency
XW Z. This inference rule known as Pseudo transitivity.

3.5 DECOMPOSITION

One solution to eliminate the redundancy and also the update and deletion
anomalies in database is breaking a relation into two or more relations. This
process is called as decomposition.

54 Normalization
Let R be a relation schema. Database Management
System
A set of relation schemas {R1,R2,……….,Rn} is a decomposition of R if
R=R1 U R2 U…..URn
NOTES
Example,
Let us know how to break the relation into more than one relation.
Consider CLASSINFO relation having COURSE_ID, COURSE_NAME,
STUD_NAME, SUB_ID AND SUB_NAME.

Table 3.1 CLASSINFO

This relation can be decomposed into two relations as follows:


i) COURSE_STUDENT (COURSE_ID, COURSE_NAME,
SUB_NAME)
ii) SUBJECT_STUDENT(SUB_ID, SUB_NAME, STUD_NAME)

Table 3.2 COURSE_STUDENT

Table 3.3 SUBJECT_STUDENT

Normalization 55
Database Management Now try to join these two relations COURSE_STUDENT and
System SUBJECT_STUDENT as follows

NOTES Table 3.2 COURSE_STUDENT and SUBJECT_STUDENT

Here two additional rows are displayed this states that the bad design of
decomposition may leads to information loss. So, to ensure that the database has
good design after decomposing, the decomposition process is done through some
criteria which is defined by desirable properties of decomposition.
While decomposing the relation there must be some properties to hold so
that the database remains in consistent state. There are several properties which
are required to hold while decomposing the database schema.
There are two properties of decomposition is:
i) Lossless decomposition
ii) Dependency Preservation decomposition

3.6 LOSSLESS DECOMPOSITION

- When a relation is decomposed into several relations it is essential that


the decomposition does not leads to any loss of the information.
- If there is the loss of the information, then it is called lossy
decomposition or lossy-join decomposition. Decomposition is called
as lossless decomposition if there is no loss of information in the
decomposition of the relation into the desirable smaller relations.
- A decomposition {R1,R2,….,RN} of a relation R us called a lossless
decomposition for R if the natural join of R1,R2,….,Rn produces
exactly the relation R.
- Decomposition is lossless if we can recover to original relation from
combining smaller decomposed relations as follows:
Thus, Rˋ=R
56 Normalization
Lossless decomposition property Database Management
System
The lossless join decomposition is defined as, if R is a relation schema and
F is a set of functional dependencies on R. R1 and R2 form decomposition of R.
If at least one of following functional dependency is hold by F+ then the NOTES
decomposition is called as lossless-join decomposition.
F:
- R1ᴒR2 R1, that is all attributes common to both R1 and R2
functionally determine ALL the attributes in R1 OR
- R1ᴒR2 R2, That all attributes common to both R1 and R2
functionally determine ALL the attributes in R2
In other words, if R1 ᴒR2 forms a superkey of either R1 or R2. The
decomposition of R is a Lossless decomposition.

ii) Dependency Preservation decomposition


- To ensure that the functional dependencies hold by relation R should
also preserved by the smaller relations (R1,R2,……Rn), the
dependency preservation property of decomposition is used.
- A decomposition of the relation R into smaller relations R1, R2, …Rn
is dependency preserving if all the Functional dependencies within R
are followed in relations R1,R2,…,Rn.

Dependency preservation decomposition property


- Dependency preservation decomposition is defined as, If R is a relation
schema and F is a set of functional dependencies on R. R1, R2, ……
..Rn form decomposition of R. If (F1 U F2 U…UFn)+ = F+
Where, F1, F2, …..Fn are the functional dependencies of relation R1,R2,…
..Rn.
F is a functional dependency of relation R
(F1 U F2 U …U Fn)+ is a closure of union of all sets of functional
dependencies of R1, R2, …. Rn.
F+is a closure of set of functional dependency of relation R.

3.7 ADVANTAGES AND DISADVANTAGES OF


NORMALIZATION

The advantages of normalization are as follows:


* It serves as the basis for physical database design, ensuring that customer
requirements are properly satisfied and that new requirements are easier
to accommodate. Normalization 57
Database Management * It makes the design of a modular system easier.
System
* It makes the database easier to maintain.
* It ensures structural stability of data.
NOTES
* It prevents various updating anomalies which can occur in non-
normalized record structures.
* It enables record processing by a set of simple operators.
* Each table in a normalized database contains information about a single
subject, so each table tends to have fewer indices. This makes insert,
update and delete operations more efficient.
* Normalization breaks the database down into smaller tables making it
much easier for database users and administrators to save disk space and
simplify their data structures. Throughout the process of normalization,
security also becomes easier to control. The database administrator can
determine which users have access to certain tables.
* Another advantage of normalization is that it provides indexing. Indexing
speeds up the process of data access, increase delete, update and insert
performances. Normalization also minimizes modification anomalies.
Modification anomalies can occur when the data is deleted, inserted or
updated and the data is lost in other ways such as hardware being damaged or
stolen. In addition, normalization helps with the size reduction of tables and rows.
Creating primary and foreign key constraints will reduce the number of empty
or null values in columns and the overall size of the database.

The disadvantages of normalization are


* It is easier to modify a proper normalized database than to modify an
improper normalized database. Modification of a database structure is
not simple. In fact it is problematic as it serves as the foundation of a
larger application. It changes to the database risk breaking the application.
* A table can only have a single clustered index. Clustered indices tend to
produce faster data retrievals. Normalized databases typically have more
tables than non¬ normalized databases, so normalized databases typically
have more clustered indices.
* The query in a normalized table contains complexity. The data retrievals
are possible occasionally not frequently.

3.8 NORMAL FORMS(1NF, 2NF, 3NF)

Dr.EdgarF.Codd proposes normalization as an integral part of a relational


model. Normalization is a database design technique which is used to organize
58 Normalization
the tables in such manner that it should reduce redundancy and dependency Database Management
of data. System

Normalization is a database design technique which is used to organize the


tables in such manner that it should reduce redundancy and dependency of data. NOTES

- It divides larger tables to smaller tables and links these smaller tables
using their relationships.
- Normalization is implemented by some formal rules either by a process
of synthesis or decomposition. Synthesis is used to create database
design in normalized form based on a known set of dependencies.
Decomposition generally is done on existing database which is not
normalized. The relation of this database is further divided into
multiple relations to improve the design.
- normalization is the multi step process. It creates smaller tables from
larger table.

Types of Normalization
1) First normal form (1NF) : Having unique values, no repeating groups.
2) Second normal form (2NF) :Having unique values, no repeating
groups, no partial dependency.
3) Third normal form (3NF) : Same like second normal form and having
transitive dependency.
4) Boyce-Codd Normal form (BCNF) :It is more developed version then
3NF.
5) Fourth normal form (4NF) : No multi-valued dependency.

First normal form (1NF)


- A domain is a set of all unique values which is permitted for an
attribute. Atomic means that cannot be divided further.
- First normal form defines that all the attributes in a relation must have
atomic (not further divisible) and single values.
- Consider the table Course_details

Table 3.1 Course_details

Table 3.1 Course_details is not a 1NF as the rule says “each attribute of a
table must have atomic (single) values”, the attribute ‘Languages’ contain
Normalization 59
Database Management multiple values which violets the rule of 1NF. To convert this data into First
System normal form, we have to rearrange it in the table.

NOTES

- Now each attribute contain single values. Hence the database design
is in First normal form.

Second normal form (2NF)


To understand the second normal form first we should get concepts of Prime
and Non-prime attributes.
- Prime attribute : An attribute, which is a part of the prime-key is known
as prime attribute.
- Non-Prime attribute : An attribute, which is not a part of the prime-
key is said to be a non-prime attribute.
- As per the rule of Second normal form, every non-prime attribute must
be dependent upon the prime attribute.
- If we follow Second normal form, then every non-prime attribute
should be fully functionally dependent on prime key attribute.
- Consider table

Table 3.2 STUDENT_PROJECT

- In the above example the prime key attributes are STUD_ID and
PROJ_ID. Here the non key attributes like STUD_NAME and
PROJ_NAME are dependent upon one of the prime key attributes.
- Means STUD_NAME is depend upon STUD_ID while the
PROJ_NAME depends upon PROJ_ID. This is called as partial
dependency.
60 Normalization
- As per the rule of Second normal form the non key attributes should Database Management
be dependent upon all the key attributes which is not followed. System

- To convert the data in Second normal form we have to split the table
in two different tables as follows. NOTES

STUDENT

- In the table STUDENT all thenon prime attributes like STUD_NAME


and PROJ_NAME are completely depends upon the prime attribute
STUD_ID.

PROJECT

- In the table PROJECT the non prime attributes PROJ_NAME is


completely depends upon the prime attribute PROJ_ID.
- Third normal form (3NF)
A database design is said to be in 3NF if both the following conditions are
satisfied by it
- Initially the design must be in 2NF.
- No non-prime attribute should be transitively dependent on prime key
attribute.
- For any non-trivial functional dependency, X A
Either X is a super key or A is prime attribute.

Table 3.3 STUDENT_DETAILS

Normalization 61
Database Management Solved Question explaining the concept of normalization.
System
First Normal Form(1NF) : A relation is said to be in the first normal form
if it is already in unnormalized form and it has no repeating group.
NOTES
Let us consider the Invoice Report of Alpha Book House as shown below.

Alpha Book House


Pune - 413001
Customer No. : 1052
Customer Name : Gamma school of computer studies
Address : University Road, Pune - 01

Fig 3.4 Invoice of a book company

Normalization of invoice report upto 3NF


The invoice report shown in Fig 3.4 is represented in a form of relation.
The relation is named as invoice. This is unnormalized form.
1) Invoice (Cust_no, Cust_Name, Cust_Add, (ISBN, Title,
Author_Name, Author’s country, Qty, Unit_price))
The amount in the last column of the invoice for each book can be
computed by multiplying the number of copies (Qty) with respective
unit price. Similarly, the grand total at the bottom of the invoice
representing the invoice amount can be calculated by summing the
entries in the last column of the invoice. Similarly, the data at the top
right corner of the invoice can be printed using the system clock.
Hence, these three entries are not included in the invoice relation.
1NF : First Normal Form
- The relation is in 1NF if it has no repeating groups.
- In the invoice relation, the fields in the inner most of parentheses
put together is known as a repeating group. This will result in
redundancy of data for the first three fields. This redundancy will
lead to inconsistency of data. Hence, it is divided into two sub-
relations shown below to remove the repeating group.
62 Normalization
2) Customer (Cust_No, Cust_Name, Cust_Add) Database Management
System
3) Customer_Book (Cust_No, ISBN, Title, Author_Name, Author’s
country, Qty, Unit_price)
NOTES
Now each of the above relations (i.e relation 2 and relation 3) is in 1NF.

Second Normal Form


2NF removes partial dependency among attributes of relation. In Relation
2, the number of key fields is only one and hence there is no scope for partial
dependency. The absence of partial dependency in Relation 2 takes it into 2NF
without any modification.
In Relation 3, the number of key fields are two. The dependency diagram
of Relation 3 is shown in below fig. 3.5 Dependency diagram of relation 3

In fig. 3.5, Qty dep4ends on Cust_No and ISBN, but the remaining non_key
fields(Title, Author_Name, Author’s country, Qty, Unit_price) depend only on
ISBN. Thus, there exists partial dependency.
The existence of partial dependency will result into insertion update and
deletion anomaly.

Insertion anomaly
In Relation 3, if we want to insert data of a new book (ISBN, Title,
Author_Name, Author’s country, Unit_price) there must be at least one customer.
This means that the data of a new book can be entered into Relation 3 only when
the first customer buys the book.

Update anomaly
In Relation 3, if we want to change any of the nonkey fields, like Title,
Author_Name, it will result into inconsistency because the same is available in
more than one record.

Normalization 63
Database Management Deletion anomaly
System
In Relation 3, if book is purchased by only one customer, then the book data
will be lost when we delete that record after fully satisfying that customer order.
NOTES
Hence, Relation 3 is divided into two relations as shown below :
4) Sales(Cust_No, ISBN, Qty)
5) Book_Author(ISBN, Title, Author_Name, Author’s country,
Unit_price)
Above two relations(i.e Relation 4 and Relation 5) are now in 2NF.

Summary
Database designing involves normalizing a database to ensure that the
relational tables in the database contain only related information and store data
only at a single location in the Database. The higher normal forms, such as the
fourth and fifth normal forms, in normalization are based on the concept of multi-
valued dependency.

Exercise
1. What are the steps to normalize a relational table into the first normal
form?
2. What are the steps to normalize a relational table into the second
normal form?
3. What are the steps to normalize a relational table into the third normal
form?
4. List the advantages of normalization.
5. Explain different types of keys.
6. What is a primary key in a relational table?
7. Explain Lossless decomposition, dependency preserving
decomposition.
8. Explain what is meant by :
a. Repetition of information.
b. Inability to represent information.
c. Loss of information.
Explain why any one of these properties may indicate a bad relational
database design.

*****

64 Normalization
Database Management
System

UNIT - 4 NOTES
FILE STRUCTURES AND
DATA ADMINISTRATION

Structure
4.0 Introduction
4.1 Unit Objectives
4.2 File Structures and Data Administration: File Organization
4.3 Overview of Physical Storage Media, Magnetic Disk, RAID, Tertiary
Storage
4.3.1 Storage Access
4.4 Data Dictionary Storage
4.5 Organization of File (Sequential,Clustering)
4.6 Indexing and Hashing
4.7 Basic Concepts, indices
4.8 B+ Tree index file
4.9 B- tree index file
4.10 Static hashing
4.11 Dynamic Hashing
4.12 Data administration, Role and Responsibility of DBA

4.0 INTRODUCTION

The logical model of the database is the correct level for database users to
focus on. The goal of a database system is to simplify and facilitate access to
data. Users of the system should not be burdened unnecessarily with the physical
details of the implementation of the system. We shall define various data
structures that will allow fast access to data. We shall consider several alternative
structures, each best suited to a different kind of access to data.

File Structures and


Data Administration 65
Database Management
System
4.1 UNIT OBJECTIVES
NOTES
Some of the functions of the database system may be provided by the
computer’s operating system. The operating system provides only the most basic
services and the database system must build on that base. The database system
will include consideration of the interface between the database system and the
operating system. Unit will cover the data dictionary storage, organization of
file, indexing and hashing, B+ Tree index file, and very important role and
responsibility of DBA.

4.2 FILE ORGANIZATION

A file is organized logically as a sequence of records. These records are


mapped onto disk blocks. Files are provided as a basic construct in operating
systems, so we shall assume the existence of an underlying file system. We need
to consider ways of representing logical data models in terms of files.
Although blocks are of a fixed size determined by the physical properties
of the disk and by the operating system, records sizes vary. In a relational
database, tuples of distinct relations are generally of different sizes. In a network
database, it is likely that the owner record type is of a different size than the
member record type.
One approach to mapping the database to files is to use several files and
store records of only one fixed length in any given file. An alternative is to
structure our files in such a way that we can accommodate multiple lengths for
records. Files of fixed-length records are easier to implement than files of
variable-length records. Many of the techniques used for them can be applied to
the variable-length case. Thus, we begin by considering a file of fixed-length
records.

Fixed Length Records


As an example, let us consider a file of deposit records for our bank
database. Each record of this file is defined as follows:
type deposit = record
branch-name : char(20);
account-number : integer;
File Structures and customer-name : char(20);
66 Data Administration
balance : real; Database Management
System
end
Figure 4.1: File containing deposit records.
NOTES

If we assume that each character occupies a byte, an integer occupies 4


bytes, and a real 8 bytes, our deposit record is 52 bytes long. A simple approach
is to use the first 52 bytes for the first record, the next 52 bytes for the first record,
the next 52 bytes for the second record, and so on. There are two problems with
this simple approach :
- It is difficult to delete a record from this structure. The space occupied
by the record to be deleted must be filled with some other record of
the file, or we must have a way of marking deleted records so that they
can be ignored.
- Unless the block size happens to be a multiple of 52, some records
will cross block boundaries. That is, part of the record will be stored
in one block and part in another. It would thus require two block
accesses to read or write such a record.

File Structures and


Data Administration 67
Database Management Figure 4.2 File of Figure 4.1 with record 2 deleted
System

NOTES

When a record is deleted, we could move the record that came after it into
the space formerly occupied by the deleted record, and so on, until every record
following the deleted record has been moved ahead. Such an approach requires
moving a large number of records. It might be better simply to move the last
record of the file into the space occupied by the deleted record.
It is undesirable to move records in order to occupy the space freed by a
deleted record, since this requires additional block accesses. Since insertions
tend to be more frequent than deletions, it is acceptable to leave the space
occupied by the deleted record open ,and wait for a subsequent insertion before
reusing the space. A simple marker on a deleted record is not sufficient, since it
is hard to find this available space when an insertion is being done. Thus, we
need to introduce additional structure.
At the beginning of the file, we allocate a certain number of bytes as a file
header. The header will contain a variety of information about the file. For now,
all we need to store there is the address of the first record whose contents are
deleted. We use this first record to store the address of the second available
record, and so on. We may think of these stored addresses as pointers since they
“point” to the location of a record.
Upon insertion of a new record, we use the record pointed to by the header.
We change the header pointer to point to the next available record. If no space
is available, we add the record to the end of the file.
The use of pointers requires careful programming. If we move or delete a
record to which another record contains a pointer, that pointer becomes incorrect
File Structures and in the sense that it no longer points to the desired record. Such pointers are called
68 Data Administration
dangling pointers and, in effect, point to garbage. In order to avoid the dangling Database Management
pointer problem, we must avoid moving or deleting records that are pointed to System
by other records. These records are pinned.
NOTES

Figure 4.3 File of Figure 4.1 with record 2 deleted

Insertion and deletion for files of fixed-length records are quite simple to
implement because the space made available by a deleted record is exactly the
space needed to insert a record. If we allow records of variable length in a file,
this is no longer the case. An inserted record may not fit in the space left free by
a deleted record or it may fill only part of that space.

Variable –Length Records


Variable –Length Records arise in database systems in several ways :
- Storage of multiple record types in a file.
- Record types that allow variable lengths for one or more fields.
- Record types that allow repeating fields.
A number of different techniques for implementing Variable –Length
Records exist. For purpose of illustration, we shall use one example to
demonstrate the various implementation techniques. We consider a different
representation of the deposit information stored in the file in which we use one
Variable –Length Record for each branch name and all of the account information
for that branch. The format of the record is :
File Structures and
Data Administration 69
Database Management type deposit-list = record
System
branch-name : char (20);
account-info : array [1..∞] of
NOTES
record;
account-number : integer;
customer-name : char(20);
balance : real;
end
end
Fig 4.4: Byte string representation of variable length records.

We define account-info as an array with an arbitrary number of elements


so that there is no limit to how large a record can be.

Byte String Representation


A simple method for implementing variable length records is to attach a
special end-of-record(┴) symbol to the end of each record. We can then store
each record as a string of consecutive bytes.
The byte string representation has several disadvantages.
- It is not easy to reuse space occupied formerly by a deleted record.
Although techniques exist to manage insertion and deletion, they lead
to a large number of small fragments of disk storage that are wasted.
- There is no space, in general, for records to grow longer. If a variable-
length record becomes longer, it must be moved, and movement is
costly if the record is pinned.
Thus, the byte string representation is not usually used for implementing
variable-length records.

Fixed-Length Representation
In order to implement variable-length records efficiently in a file system,
we use one or more fixed-length records to represent one variable-length record.
File Structures and
70 Data Administration
There are two techniques for implementing files of variable-length records Database Management
using fixed-length records. System

- Reserved space .If there is a maximum record length that is never


exceeded , we may use fixed-length records of that length. Unused NOTES
space is filled with a special null, or “end –of-record” symbol.
- Pointers – The variable-length record is represented by a list of fixed
length records, chained together via pointers.
Fig 4.5: File of Figure 4.1 using the reserved space method

4.3 OVERVIEW OF PHYSICAL STORAGE MEDIA

Several types of data storage exist in most computer systems. These storage
media are classifies by the speed with which data can be accessed, by the cost
per unit of data to buy the memory , and by how ‘reliable” they are. Among the
media typically available are:
- Cache : This is the fastest and most costly form of storage. The size
of cache memory is very small and the use of cache is managed by the
operating system. We shall not need to be concerned about managing
cache storage in the database system.
- Main memory : This is the storage media used for data that is
available to be operated on. The general-purpose machine instructions
operate on main memory. Although main memory may contain several
megabytes of data, main memory is generally too small to store the
entire database. Main memory is sometimes referred to as core
memory, a reference to a technology used in the 1960s and early 1970s
to implement main memory. The contents of main memory are usually
lost if a power failure or system crash occurs.
- Disk storage : This is the primary medium for the long term storage
of data. The entire database is stored on disk. Data must be moved
from disk to main memory in order for the data to be operated on.
After operations are performed, the data must be returned to disk. Disk
storage is referred to as direct-access storage because it is possible to
File Structures and
read data on disk in an order. Disk storage usually survives power
Data Administration 71
Database Management failures and system crashes. Disk storage devices themselves may
System fail and destroy data, but such failures are significantly less frequent
than system crashes.
NOTES - Tape storage- This is storage used primarily for backup and archival
data. Although tape is much cheaper than disk, access to data is much
slower, since the tape must be read sequentially from the beginning.
For this reason, tape storage is referred to as sequential access storage
and is used primarily for recovery from disk failures. Tape devices
are less complex than disks, thus they are more reliable.

Storage Access
1. Block – A database file is partitioned into fixed length storage units
called blocks.
2. Buffer – Portion of main memory available to store copies of disk
blocks.
3. Buffer management – A major goal of the file structures is to
minimize the number of blocks that must be accessed. Another way
to reduce the number of disk accesses is to keep as many blocks as
possible in main memory. The goal is to maximize the chance that
when a block is accessed, it is already in main memory and, thus, no
disk access is required.
Since it is not possible to keep all blocks in main memory, we need to
manage the allocation of the space available in main memory for the storage of
blocks. The buffer is that part of main memory available for storage of copies
of disk blocks. There is always a copy kept on disk of every block, but the copy
on disk may be an older version of the block than the version in the buffer. The
subsystem responsible for the allocation of buffer space is called the buffer
manager. The buffer manager intercepts all requests made by the rest of the
system for blocks of the database. If the block is already in the buffer, the
requester is passed the address of the block in main memory. If the block is not
in the buffer, the buffer manager reads the block in from disk into the buffer, and
passes the address of the block in main memory to the requester. Thus, the buffer
manager is transparent to those system programs that issue disk block requests.
In order to serve the database system well, the buffer manager must use more
sophisticated techniques than typical virtual memory management schemes:
- Replacement Strategy. When there is no room left in the buffer, a
block must be removed from the buffer before a new one can be read
in. Typical operating systems use a least recently used scheme, in
which the block that was referenced least recently is written back to
disk and removed from the buffer. This simple approach can be
improved upon for database application.

File Structures and - Pinned blocks .In order for the database system to be able to recover
72 Data Administration from crashes, it is necessary to restrict those times when a block may
be written back to disk. A block that is not allowed to be written back Database Management
to disk is said to be pinned. Although many operating systems do not System
provide support for pinned blocks, such a feature is essential for the
implementation of a database system that is resilient to crashes. NOTES
- Forced output of blocks. There are situations in which it is necessary
to write the block back to disk even though the buffer space it occupies
is not needed. This is called the forced output of a block. The
requirement is due to the fact that main memory contents and thus
buffer contents are lost in a crash while data on disk usually survives
a crash.

4.4 DATA DICTIONARY STORAGE

A relational database system needs to maintain data about the relations. This
information is called the data dictionary, or system catalog.
Among the types of information the system must store are:
- Names of the relations.
- Names if the attributes of each relation.
- Domains of attributes.
- Names of views defined on the database, and the definition of those views.
- Integrity constraints for each relation(for example, key constraints)
In addition to the above items, many systems keep the following data on
users of the system.
- Name of authorized users.
- Accounting information about users.
In systems that use highly sophisticated structures to store relations,
statistical and descriptive data about relations may be kept on the:
- Number of tuples in each relation
- Method of storage used for each relation
Some database systems store this information using special-purpose data
structures and code. It is generally preferable to store the data about the database
in the database itself. By using the database to store system data, we simplify
the overall structure of the system and allow the full power of the database to be
used to permit fast access to system data.
The exact choice of how to represent system data using relations must be
made by the system designer. One possible representation is :
System-catalog-scheme=(relation-name, number-of-atributes) File Structures and
Data Administration 73
Database Management Attribute-scheme=(attribute-name, relation-name, domain-type, position)
System
User-scheme=(user-name, encrypted-password, group)
Index-scheme=(index-name, relation-name, index-type, index-attributes)
NOTES
View-scheme=(view-name, definition)

4.5 ORGANISATION OF FILE

Sequential Files
A sequential files is designed for efficient processing of records in sorted
order based on some search key. To permit fast retrieval of records in search-
key order, records are chained together by pointers. The pointer in each record
points to the next record in search key order. In order to minimize the number
of block accesses in sequential file processing, records are stored physically in
search-key order, or as close to search-key order as possible.

Figure 4.6 Sequential file for deposit records.


Figure 4.6 Shows a Sequential file for deposit records taken from our
banking example. The records are stored in search-key order, using branch-name
as the search key. It is difficult to maintain physical sequential order as records
are inserted and deleted, since it is costly to move many records as a result of a
single insertion or deletion. Deletion can be managed using pointer chains. For
insertion, we apply the following rules :
1. Locate the record in the file that comes before the record to be inserted
in search-key order.
2. If there is a free record within the same block as this record, insert the
File Structures and new record there. Otherwise insert the new record in an overflow
74 Data Administration
block. In either case, adjust the pointers so as to chain the records Database Management
together in search-key order. System

NOTES

Figure 4.7 Sequential file after an insertion


Figure 4.7 shows the file of figure 4.6 after the insertion of the record
(North, 888, Adams, 800) . The structure in Figure 4.7 allows for fast insertion
of new records, but it forces sequential file-processing applications to process
records in an order that does not match the physical order of the records.
If relatively few records need to be stored in overflow blocks, this approach
works well. Eventually, however, the correspondence between search-key order
and physical order may be totally lost, and sequential processing becomes
significantly less efficient. At this point, the file should be reorganized so that it
is once again physically in Sequential order. Such reorganizations are costly and
are done during times when the system load is low. The frequency with which
reorganizations are needed depends on the frequency of insertion of new records.
In the extreme case in which insertions rarely occur, it is possible always to keep
the file in physically sorted order.

4.6 INDEXING AND HASHING

Many queries reference only a small proportion of the records in a file. For
File Structures and
example, the query “ Find all accounts at the Perry branch” references only a
Data Administration 75
Database Management fraction of the account records. It is inefficient for the system to have to read
System every record and check the branch-name field for the name “Perry”. Ideally, the
system should be able to locate these records directly. In order to allow these
NOTES forms of access, we design additional structures that we associate with files. We
shall consider two general approaches to this problem : the construction of indices
and the construction of hash functions.

4.7 BASIC CONCEPTS

An index for a file works in much the same way as a catalog in library. If
we are looking for a book by a particular author, we look in the author catalog
and a card in the catalog tells us where to find the book. To assist us in searching
the catalog, the cards are kept in alphabetic order, so we do not have to check
every card to find the one we want.
In real-world databases, indices of the type described above may be too
large to be handled efficiently. Instead, more sophisticated indexing techniques
may be used. As an alternative to indexing, techniques for both hashing and
indexing. Each technique must be evaluated on the basis of :
- Access time. The time it takes to find a particular data item, using the
technique in question.
- Insertion time. The time it takes to insert a new data item. This
includes the time it takes to find the correct place to insert the new
data item as well as the time it takes to update the index structure.
- Deletion time. The time it takes to delete a data item. This includes
the time it takes to find the item to be deleted as well as the time it
takes to update the index structure.
- Space overhead. The additional space occupied by an index structure.
Provided that the amount of additional space is moderate, it is usually
worthwhile to sacrifice the space to achieve improved performance.
It is often the case that we desire to have more than one index or hash
function for a file. In library example, we note that more libraries maintain
several card catalogs : for author, for subject and for title. The attribute or set of
attributes used to look up records in a file is called a search key. This definition
of key differs from that of primary key, candidate key and superkey.

Indexing
In order to allow fast random access to records in a file, an index structure
is used. Each index structure is associate with a particular search key. If the file
is sequentially ordered, and if we choose to include several indices on different
File Structures and search keys, the index whose search key specifies the sequential order of the file
76 Data Administration
is the primary index. The other indices are called secondary indices. The search Database Management
key of a primary index is usually the primary key. System

We assume that all files are sequentially ordered and thus have a primary
search key. Such files, together with a primary index are called index-sequential NOTES
files. They are one of the oldest index schemes used in database systems. They
are designed for applications that require both sequential processing of the entire
file and random access to individual records.
There are two types of indices that may be used :
- Dense index. An index record appears for every search-key value in
the file. The record contains the search-key value and a pointer to the
record.
- Sparse index. Index records are created for only some of the records.
To locate a record, we find the index record with the largest search key
value that is less than or equal to the search key value for which we
are looking. We start at the record pointed to by that index record and
follow the pointers in the file until we find the desired record.

Figure 4.8 Sequential file for deposit records

File Structures and


Figure 4.9 Dense index Data Administration 77
Database Management Primary Index
System
Figure 4.8 shows a sequential file of deposit records taken from our banking
example. The records are stored in search-key order, using branch-name as the
NOTES search key.
Figure 4.8 and 4.10 show dense and sparse indices, respectively, for the
deposit file. Suppose we are looking up records for the Perry branch. Using the
dense index of Figure 4.9, we follow the pointer directly to the first Perry record.
We process this record, and follow the pointer in that record to locate the next
record in search-key (branch-name) order. We continue processing records until
we encounter a record for a branch other than Perry.

Figure 4.10 : Sparse index


If we are using the sparse index (figure 4.10) we do not find an index entry
for “Perry”. Since the last entry before “Perry” is “Mianus,” we follow that
pointer. We then read the deposit file in sequential order, until we find the first
Perry record, and begin processing at that point.
It is generally faster to locate a record if we have a dense index rather than
a sparse index. Sparse index have an advantage over dense indices in that they
require less space and they impose less maintenance overhead for insertions and
deletions. There is a trade-off that the system designer must make between access
time and space overhead. Although the decision regarding this trade-off is
dependent on the specific application, a good compromise is to have a sparse
index with one entry per block. The reason this design is a good trade-off is that
the dominant cost in processing a database request is the time it takes to bring a
block from disk into main memory. Once we have brought the block in , the
time to scan the entire block is negligible. Using this sparse index, we locate the
block containing the record we are seeking. Thus, unless the record is on an
overflow block we minimize block accesses while keeping the size of the index
as small as possible.
File Structures and
78 Data Administration
For the above technique to be fully general, we must consider the case Database Management
where records for one search-key value occupy several blocks. It is easy to System
modify our scheme to handle this.
Even if we use a sparse index, the index itself may become too large for NOTES
efficient processing. It is not unreasonable, in practice, to have a file with
100,000 records, with 10 records stored in each block. If we have one index
record per block, the index has 10,000 records. Index records are smaller than
data records, so assume 100 index records fit on a block. Thus, our index
occupies 100 blocks.
If an index is sufficiently small to be kept in main memory, search time is
low. If the index is so large that it must be kept on disk, a search results in several
disk block reads. If the index occupies b blocks, and binary search is used, we
may read as many as 1 + log2(b) blocks. For our 100 block index, this means 7
block reads. Note that if overflow blocks have been used, binary search will not
be possible. In that case, a sequential search is typically used which requires b
block reads. Thus, the process of searching the index may be costly.
To deal with this problem, we treat the index just as we would treat any
other sequential file, and we construct a sparse index on the primary index. To
locate a record, we first use binary search on the outer index to find the record
for the largest search key value less than or equal to the one we desire. The
pointer point to a block of the inner index. We scan this block until we find the
record which has the largest search key values less than or equal to the one we
desire. The pointer in this record points to the block of the file that contains the
record for which we are looking.
Using the two levels of indexing, we have read only one index block rather
than 7. If we assume that the outer index is already in main memory. If our file
is extremely large, even the outer index may grow too large to fit in main
memory. In this case, we can create yet another level of index. Indeed we can
repeat this process as many times as necessary. It is generally the case that two
levels are sufficient and situations requiring more than three levels are extremely
rear. Each level of index corresponds to a unit of physical storage. Thus, we
have indices at the track, cylinder and disk levels.
Regardless of what form of index is used, every index must be updated
whenever a record is either inserted into or deleted from the file. We describe
algorithms for updating single-level indices below.
- Deletion. In order to delete a record, it is necessary to look up the
record to be deleted. If the deleted record was the last record with its
particular search-key value, then we delete the search-key value from
the index. For dense indices, we delete a search-key value similarly
to deleting in a file. For sparse indices, we delete a key value by
replacing its entry in the index with the next search key value. If the
next search-key value already has an index entry, we delete the entry. File Structures and
Data Administration 79
Database Management - Insertion. Perform a lookup using the search-key value appearing in
System the record to be inserted. If the index is dense, and the search-key
value does not appear in the index, insert it. If the index is sparse, no
NOTES change needs to be made to the index unless a new block is created.
In this case, the first search-key value appearing in the new block is
inserted into the database.

Secondary Indices
Secondary Indices may be structured differently from primary indices. The
pointers in the Secondary Indices do not point directly to the file. Instead, each
of these pointers points to a bucket which contains pointers to the file.
This approach allows all the pointers for one secondary search-key value
to be stored together. Such an approach is useful in certain types of queries for
which we may do considerable processing using only the pointers. For primary
keys, we can obtain all the pointers for one primary search key value using a
sequential scan.
A sequential scan in primary key order is efficient because records are stored
physically in an order that approximates primary key order. We cannot store a
file physically ordered both by the primary key and a secondary key. Because
secondary key order and physical key order differ, if we attempt to scan the file
sequentially in secondary key order, the reading of each record is likely to require
the reading of a new block from disk.
The secondary index may be either dense or sparse. If it is dense, then the
pointer in each individual bucket points to records with the appropriate search-
key value. If the secondary index is sparse, then the pointer in each individual
bucket points to records with search key value in the appropriate range. In this
case each bucket entry is either a single pointer or a record consisting of two
fields : a search key value and a pointer to some file record.
By associating a search-key value with each pointer in the bucket, we
eliminate the need to read records with a secondary search-key value other than
the one on which we are performing a lookup.
The bucket structure can be eliminated if the secondary index is dense and
the search-key values form a primary key.
The procedure for deletion and insertion can be applied to a file with
multiple indices. Whenever a file is modified, every index must be updated.
Secondary indices improve the performance of queries that use keys other than
the primary one. However, they impose a serious overhead on modification of
the database. The designer of a database decides which secondary indices are
desirable on the basis of an estimate of the relative frequency of queries and
modifications.

File Structures and


80 Data Administration
Database Management
System
4.8 B+ -TREES
NOTES
The primary disadvantage of the index-sequential file organization is that
performance degrades as the file grows. Although this degradation can be
remedied by reorganization of the file, it is undesirable to perform such
reorganizations frequently. The B+ -tree file structure is the most widely used
of seve4ral file structures that maintain their efficiency despite insertion and
deletion of data. A B+ -tree index takes the form of a balanced tree in which
ever4y path from the root of the tree to a leaf of the tree is of the same length.
Each node in the tree has between [n/2] and n children, where n is fixed for a
particular tree ([x]) denotes the greatest integer not less than x; that is, we round
upward).
We shall see that the B+ -tree structure imposes some overhead on insertion
and deletion as well as some added space overhead. Nevertheless, this overhead
is acceptable for files with a high frequency of modification since the cost of file
organization is avoided.
A B+ -tree index is a multilevel index, but it has a structure that differs from
that of the a multilevel index-sequential file. A typical node of a B+ -tree is
shown below. It contains up to n-1 search key values K1,K2,…….,Kn-1 and n
pointers P1, P2,…..Pn. The search key values with a node are kept in sorted
order; thus if i<j, then Ki < Kj.
We consider first the structure of the leaf nodes. For 1 ≤ I < n, Pi points to
either a file record with search key value Ki or to a bucket of pointers each of
which points to a file record with search-key value Ki. Below is the typical node
of a B+ -tree.

Figure 4.11 A leaf node for deposit B+ -tree index ( n = 3).


The bucket structure is used only if the search key does not form a
primary key, and the file is not sorted in the search-key value order. Pointer File Structures and
Pn has a special purpose which we shall discuss shortly. Data Administration 81
Database Management Figure 4.11 shows one leaf node of a B+-tree for the deposit file, in which
System we have chosen n to be 3, and the search key is branch-name. Note that since the
deposit file is ordered by branch-name, the pointers in the leaf node point directly
NOTES to the file.
Now that we have seen the structure of a leaf node, let us consider how
search-key values are assigned to particular nodes. Each leaf can hold
up to n - 1 values. We allow leaf nodes to contain as few as [(n - 1 )/2]
values. The ranges of values in each leaf do not overlap. Thus, if Li and Lj
are leaf nodes and i < j, then every search-key value in Li is less than every
search-key value in Lj. The set of leaf nodes in a B+-tree must form a
dense index so that every search-key value appears in some leaf node.
Now we can explain the use of the pointer Pn. Since there is a linear order
on the leaves based upon the search-key values they contain, we use Pn to chain
the leaf nodes together in search-key order. This allows for efficient sequential
processing of the file.
The nonleaf nodes of the B+-tree form a multilevel (sparse) index on the
leaf nodes. The structure of nonleaf nodes is the same as for leaf nodes except
that all pointers are pointers to tree nodes. A node may hold up to
n pointers, but must hold at least [n/2] pointers. Let us consider a node
containing m pointers. If 1 < i < m, pointer Pi points to the subtree
containing search-key values less than K i and greater than or equal to Ki
_ 1 . Pointer Pm points to the part of the subtree containing those key
values greater than or equal to K m _ 1, and pointer P 1 points to the part of
the subtree containing those search-key values less than K 1 .

Figure 4.12 B+-tree for deposit file with n = 3.


The requirement that each node hold at least [n /2] pointers is imposed
at all levels of the tree except for the root. Figure 4.12 shows a complete B+-tree
for the deposit file (with n = 3). For simplicity, we have omitted both the
pointers to the file itself and the null pointers. As an example of a B+-tree
for which the root must have less than [n/2] values, we show a B+-tree for the
deposit file in Figure 4.8 with n = 5. It is always possible to construct a B+-
tree, for any n, in which all nonroot nodes contain at least [n/2] pointers.
File Structures and
82 Data Administration
The examples we have given of B+-trees have all been balanced. That is, Database Management
the length of every path from the root to a leaf node is the same. This property is System
a requirement for a B+-tree. Indeed, the "B" in B +-tree stands for "balanced."
It is the balance property of B+-trees that ensures good performance for NOTES
lookup, insertion, and deletion.
Let us consider how queries are processed using a B+-tree. Suppose we
wish to find all records with a search-key value of k. First, we examine the root
node and look for the smallest search-key value greater than k. Assume this
search-key value is K i;. We follow pointer Pi to another node. If K < K 1 then
we follow P 1 to another node. If we have m pointers in the node, and K ≥Km
-1, then we follow Pm to another node. Once again, we look for the smallest
search-key value greater than k and follow the corresponding pointer.
Eventually, we reach a leaf node, at which point the pointer directs us to the
desired record or bucket.

Figure: 4.13 B+-tree for deposit file with n = 5.


Thus, in processing a query, a path is traversed in the tree from the root to
some leaf node. If there are K search-key values in the file, the path is no longer
than log [n/2](K). In practice, this means that only a few node need to be accessed
even if the file is extremely large. Typically, a node is made to be the same size
as a disk block. Thus, it is reasonable for n to be between 10 and 100 (or even
larger). Even if we have one million search¬ key values in the file, a lookup
requires that only between 3 and 9 nodes be accessed.
Insertion and deletion are more complicated than lookup since it may be
necessary to split a node that becomes too large as the result of an insertion or to
combine nodes if a node becomes too small (fewer than [n/2] pointers).
Furthermore, when a node is split or a pair of nodes is combined, we must ensure
that balance is preserved. To introduce the idea behind insertion and deletion in
a B+-tree, let us assume temporarily that nodes never become too large or too
small. Under this assumption, insertion and deletion are as defined below.
* Insertion. Using the same technique as for lookup, we find the leaf node
in which the search-key value would appear. If the search-key value
already appears in the leaf node, we add the new record to the file and,
if necessary, a pointer to the bucket. If the search-key value does not
File Structures and
appear, we insert the value in the leaf node, and position it so that the
Data Administration 83
Database Management search keys are still in order. We then insert the new record in the file
System and, if necessary, create a new bucket with the appropriate pointer.
* Deletion. Using the same technique as for lookup, we find the record to
NOTES be deleted, and remove it from the file. The search-key value is removed
from the leaf node if there is no bucket associated with that search-key
value or if the bucket becomes empty as a result of the deletion.
We now consider an example in which a node must be split. Assume that
we wish to insert a record with a branch-name value of "Clearview" into the B+-
tree of Figure 4.12. Using the algorithm for lookup, we find that "Clearview"
should appear in the node containing "Brighton" and "Downtown." There is no
room to insert the search-key value "Clearview ." Therefore, the node is split into
two nodes.
Figure 4.14 shows the two leaf nodes that result from inserting "Clearview"
and splitting the node containing "Brighton" and "Downtown ." In general, we
take the n search¬ key values (the n - l values in the leaf node plus the value
being inserted) and put the first [(n - 1)/2] in the existing node and the remaining
values in a new node.
Having split a leaf node, we must insert the new leaf node into the B+-tree
structure. In our example, the new node has "Downtown" as its

Figure 4.14 Split of leaf node on insertion of "Clearview."


smallest search-key value. We need to insert this search-key value into the
parent of the leaf node that was split. The search-key value "Downtown" was
inserted into the parent. It was possible to perform this insertion because there
was room for an added search-key value. If this were not the case, the parent
would have had to be split. In the worst case, all nodes along the path to the root
must be split. If the root itself is split, the entire tree becomes deeper.
The general technique for insertion into a B+-tree is to determine the leaf
node l into which insertion must occur. If a split results, insert the new node into
the parent of node l. If this insertion causes a split, proceed recursively until either
an insertion does not cause a split or a new root is created.
We now consider deletions that cause tree nodes to contain too few pointers.
First, let us delete "Downtown" from the B+-tree. We locate the entry for
"Downtown" using our lookup algorithm. When we delete the entry for
"Downtown" from its leaf node, the leaf becomes empty. Since, in our example
n = 3 and 0 < [(n - 1)/2 ], this node must be eliminated from the B+-tree. To delete
a leaf node, we must delete the pointer to it from its parent. In our example, this
leaves the parent node, which formerly contained three pointers, with only two
pointers. Since 2 >[n/2 ], the node is still sufficiently large and the deletion
File Structures and operation is complete.
84 Data Administration
When a deletion is made to a parent of a leaf node, the parent node itself Database Management
may become too small. This is exactly what happens if we delete "Perryridge" System
from the B+-tree. Deletion of the Perryridge entry causes a leaf node to become
empty. When we delete the pointer to this node in its parent, the parent is left NOTES
with only one pointer . Since n = 3, [n/2] = 2, and thus only one pointer is too
few. However, since the node contains useful information, we cannot simply
delete it. Instead, we look at the sibling node (containing the one search key,
Mianus). This sibling node has room to accommodate the information contained
in our now-too-small node, so we coalesce these nodes, so that the sibling node
now contains the keys "Mianus" and "Redwood." The other node (the node
containing only the search key "Redwood") now contains redundant information
and can be deleted from its parent (which happens to be the root in our example).
It is not always possible to coalesce nodes. To illustrate this, let us delete
"Perryridge" from the B+-tree In this example, the "Downtown" entry is still part
of the tree. Once again, the leaf node containing "Perryridge" becomes empty.
The parent of the leaf node becomes too small (only one pointer) . However, in
this example, the sibling node already contains the maximum number of
pointers, three.
Thus, it cannot accommodate an additional pointer. The solution in this case
it to redistribute the pointers so that each sibling has two pointers.
Note that the redistribution of values necessitates a change of a search-key
value in the parent of the two siblings. In general, to delete a value in a B+-tree,
we perform a lookup on the value and delete it. If the node is too small, we delete
it from its parent. This results in recursive application of the deletion algorithm
until the root is reached, a parent remains adequately full after deletion, or
coalescence is applied.
Although insertion and deletion operations on B+-trees are complicated,
they require relatively few operations. It can be shown that the number of
operations needed for a worst-case insertion or deletion is proportional to the
logarithm of the number of search keys. It is the speed of operation on B+-trees
• that makes them a frequently used index structure in database implementations.

4.9 B-TREE INDEX FILES

B-tree indices are similar to B+-tree indices. The primary distinction


between the two approaches is that a B-tree eliminates the redundant storage of
search-key values.
A B-tree allows search-key values to appear only once. Since search keys
are not repeated in the B-tree, we are able to store the index using fewer tree
nodes than in the corresponding B+-tree index. However, since search keys that
appear in nonleaf nodes appear nowhere else in the B-tree, we are forced to File Structures and
Data Administration 85
Database Management include an additional pointer field for each search key in a nonleaf node. These
System additional pointers to either file records or buckets for the associated search key.
B-trees offer an additional advantage over B+-trees besides the lack of
NOTES redundant storage of search keys. In a lookup on a B+-tree, it is always necessary
to traverse a path from the root of the tree to some leaf node. However, in a B-
tree, it is sometimes possible to find the desired value before reading a leaf node.
Thus, lookup is slightly faster in a B-tree, though, in general, lookup time is still
proportional to the logarithm of the number of search keys.
These advantages of B-tree over B+-trees are offset by several
disadvantages.
-Leaf and nonleaf nodes are of the same size in a B+-tree. In a B-tree, the
nonleaf nodes are larger. This complicates storage management for the index.

Figure 4.15 Typical nodes of a B-tree : (a) leaf node and (b) nonleaf node
-Deletion in a B-tree is more complicated. In a B+-tree, the deleted entry
always appears in a leaf. In a B-tree the deleted entry may appear in a nonleaf
node. The proper value must be selected as a replacement from the subtree of
the node containing the deleted entry. Specifically, if search key Ki is deleted,
the smallest search key appearing in the subtree of pointer Pi + 1 must be moved
to the field formerly occupied by Ki.
The advantages of B-trees are marginal for large indices. Thus, the structural
simplicity of a B+-tree is preferred by many database system implementors.
Details of the insertion and deletion algorithms for B-trees are explored in the
exercises.

4.10 STATIC HASH FUNCTIONS

One disadvantage of index schemes is that we must access an index


structure in order to locate data. The technique of hashing allows us to avoid
accessing an index structure. We assume that the dense index is partitioned
among a number of different buckets. The address of the bucket containing a
pointer to the desired data item is obtained directly by computing a function on
the search-key value of the desired record. Formally, let K denote the set of all
search-key values, and B the set of all bucket addresses. A hash function h is a
function from K to B.
File Structures and
86 Data Administration
The principle behind hashing is that, although the set K of all possible Database Management
search-key values is large (perhaps infinite), the set {K1, K2, . . . , K n} of search- System
key values actually stored in the database is much smaller than K.
We do not know at design time which search-key values will be stored in NOTES
the database, but we know that there are too many possible values to justify
allocating one bucket for every possible value. We do know, however, at design
time approximately how many search-key values will be stored in the database.
We choose the number of buckets to correspond to the number of search-key
values we expect to have stored in the database. It is the hash function that defines
the assignment of search-key values to particular buckets.
Hash functions require careful design. A bad hash function may result in
lookup taking time proportional to the number of search keys in the file. A well-
designed function gives an average-case lookup time that is a (small) constant,
independent of the number of search keys in the file. This is accomplished by
ensuring that, on average, records are distributed uniformly among the buckets.
Let h denote a hash function. To perform a lookup on a search-key value
Ki, we simply compute h(Ki) and search the bucket with address. Suppose that
two search keys, K5 and K7 have the same hash value, that is, h(K5) = h(K7). If
we perform a lookup on K5, the bucket h(k5) contains records with search-key
values K5 and records with search¬ key values K7. Thus, we have to check the
search-key value of every record in the bucket to verify that the record is one
that we want.
The worst possible hash function maps all search-key values to the same
bucket. This is undesirable because the entire dense index is kept in the same
bucket, and thus lookup requires scanning the entire index. An ideal hash function
maps every search-key value to a distinct bucket. Such a function is ideal because
every record in the bucket searched as a result of a lookup has the desired search-
key value.
Since we do not know at design time precisely which search-key values will
be stored in the file, we want to choose a hash function that assigns search-key
values to buckets such that:
* The distribution is uniform. That is, each bucket is assigned the same
number of search-key values from the set of all possible search-key
values.
* The distribution is random. That is, in the average case, each bucket will
have nearly the same number of values assigned to it.
To illustrate these principles, let us attempt to choose a hash function for
the deposit file using the search key branch-name . The hash function we choose
must have desirable properties not only on the example deposit file we have been
using, but also on a deposit file of realistic size for a large bank with many
branches.
File Structures and
Data Administration 87
Database Management Assume that we decide to have 26 buckets and define a hash function that
System maps names beginning with the ith letter of the alphabet to the ith bucket. This
hash function has the virtue of simplicity, but it fails to provide a uniform
NOTES distribution since we expect more branch names to begin with such letters as "B"
and "R" than "Q" and "X," for example.
Typical hash functions perform some computation on the internal binary
machine representation of characters in the search key. A simple hash function
of this type is to compute the sum, modulo the number of buckets allocated of
the binary representations of characters of a key. The application of such a
scheme, using 10 buckets, to the deposit file, under the assumption that the ith
letter in the alphabet is represented by the integer i.
Insertion is almost as simple as lookup. If the search-key value of the record
to be inserted is Kj, we compute h( K i) to locate the bucket for that record.
Deletion is equally straightforward. If the search-key value of the record to
be deleted is K i, we compute h( Ki) and search the corresponding bucket for
that record.
The form of hash structure we have described above is sometimes referred
to as open hashing. Under an alternative approach, called closed hashing, all
records are stored in one bucket and the hash function computes addresses within
the bucket. Closed hashing is used frequently in the construction of symbol tables
for compilers and assemblers, but open hashing is preferred for database systems.
The reason is that deletion under closed hashing is troublesome. Typically,
compilers and assemblers perform only lookup and insertion operations on their
symbol tables. However, in a database system, it is important to be able to handle
deletion as well as insertion. Thus, closed hashing is of only minor importance
in database implementation.
An important drawback to the form of hashing we have described above is
that the hash function must be chosen when we implement the system and cannot
be changed easily thereafter. Since the function h maps search-key values to a
fixed set B of bucket addresses, we waste space if B is excessively large. If B is
too small, our buckets contain records of many different search-key values, and
performance suffers. Typically, choosing the size of B to be twice the number of
search-key values in the file gives a good space/performance trade-off.

4.11 DYNAMIC HASH FUNCTIONS

As we have seen, the need to fix the set B of bucket addresses is a serious
problem with the static hashing technique of the previous section. Most databases
grow larger over time. If we are to use static hashing for such a database, we face
File Structures and three classes of options:
Data Administra-
88 tionChapter Heading
• Choose a hash function based on the current file size. This will result in Database Management
performance degradation as the database grows. System

• Choose a hash function based on the anticipated size of the file at some
point in the future. Although performance degradation is avoided, a NOTES
significant amount of space is wasted initially.
• Periodically reorganize the hash structure in response to file growth. Such
reorganization involves choosing a new hash function, recomputing the
hash function on every record in the file, and generating new bucket
assignments. This is a massive time-consuming reorganization.
Furthermore, it is necessary to forbid access to the file during
reorganization.
Several hashing techniques allow the hash function to be modified
dynamically in order to accommodate the growth or shrinkage of the database.
These techniques are called dynamic hash functions. Below, we describe one
form of dynamic hashing called extendable hashing. Extendable hashing copes
with changes in database size by splitting and coalescing buckets as the database
grows and shrinks. As a result, space efficiency is retained. Moreover, since the
reorganization is performed on only one bucket at a time, the resulting
performance overhead is acceptably low.
With extendable hashing, we choose a hash function h with the desirable
properties of uniformity and randomness. However, this hash function generates
values over a relatively large range, namely b-bit binary integers. A typical value
for b is 32.
We do not create a bucket for each hash value. Indeed, 232 is over 4 billion,
and that many buckets is unreasonable for all but the largest databases. Instead,
we create buckets on demand, as records are inserted into the file. We do not use
the entire b bits of the hash initially. At any point, we use i bits, where 0 ≤i ≤ b.
These i bits are used as an offset into an additional table of bucket addresses. The
value of i grows and shrinks with the size of the database.

4.12 ROLE AND RESPONSIBILITY OF DBA

One of the main reasons for having database management systems is to have
central control of both data and programs accessing that data. The person having
such central control over the system is called the database administration (DBA)
the functions of the database administrator include :
- Scheme definition. The original database scheme is created by writing
a set of definitions which are translated by the DDL compiler to set of
tables that are permanently stored in the data dictionary.

File Structures and


Data Administration 89
Database Management - Storage structure and access method definition. Appropriate
System storage structures and access methods are created by writing a set of
definitions which are translated by the data storage and definition
NOTES language compiler.
- Scheme and physical organization modification. Modifications to
either the database scheme or the description of the physical storage
organization, although relatively rare, are accomplished by writing a
set of definitions which are used by either the DDL compiler or the
data storage and definition language compiler to generate
modifications to the appropriate internal system tables.
- Granting of authorization for data access. The granting of different
types of authorization allows the database administrator to regulate
which parts of the database various users can access.
- Integrity constraint specification. Integrity constraint is kept in a
special system structure that is consulted by the database manager
whenever an update takes place in the system.

Summary
Many queries reference only a small proportion of the records in a file. In
order to reduce the overhead in searching for these records, we can construct
indices for the files that store the database.
Index-sequential files are one of the oldest index schemes used in database
systems. They are designed for applications that require both sequential
processing of the entire file and random access to individual records. To permit
fast retrieval of records in search-key order, records are chained together by
pointers. In order to allow fast random access, an index structure is used. There
are two types of indices that may be used; dense index and sparse index.
In a standard index-sequential file, only one index is maintained. If several
indices on different search keys are used, the index whose search key specifies
the sequential order of the file is the primary index. The other indices are called
secondary indices. Secondary indices improve the performance of queries that
use search keys other than the primary one. However, they impose a serious
overhead on modification of the database.
The primary disadvantage of the index-sequential file organization is that
performance degrades as the file grows. To overcome this deficiency, a B+ -tree
index can be used. A B+ -tree index takes the form of a balanced tree in which
every path from the root of the tree to a leaf of the tre is of the same length.
Lookup in this scheme is quite straightforward and efficient. However, insertion
and deletion are more complicated. Nevertheless, the number of operations
required for insertion and deletion on B+ -tree is proportional to the logarithm
of the size of the database.
File Structures and
90 Data Administration
B -tree indices are similar to B+ -tree indices. The primary advantage of a Database Management
B-tree is that it eliminates the redundant storage of search-key values. The major System
disadvantage is that leaf and nonleaf nodes are not of the same size, thus, storage
management for the index is complicated. NOTES
Index schemes require that we access an index structure to locate data. The
technique of hashing, by contrast, allows us to find the address of a data item
directly by computing a function on the search key value of the desired record.
Since we do not know at design time precisely which search-key values will be
stored in the file, a good hash function to choose is one that assigns search-key
values to buckets such that the distribution is both uniform and random.
A static hash function is one in which the set of bucket addresses is fixed.
Such a function cannot easily accommodate databases that grow significantly
larger over time. There are several hashing techniques that allow the hash
function to be modified. These are called dynamic hash functions.

Exercises
1. List the role and responsibilities of Database Administrator.
2. What is the difference between primary index and a secondary index?
3. Write note on B+ -tree indices.
4. Explain Indexing and Hashing.
5. Explain the concept of File organization.
6. Elaborate the concept of Data Dictionary Storage.

*****

File Structures and


Data Administration 91
Database Management
System

UNIT - 5
TRANSACTION AND
NOTES

CONCURRENCY CONTROL

Structure
5.0 Introduction
5.1 Unit Objectives
5.2 Transaction and Concurrency Control , Multiprogramming and
Multiprocessing
5.3 Concept of transaction
5.3.1 Transaction state
5.4 ACID properties
5.5 Schedules
5.6 Serializability of schedules
5.7 Concurrency Control
5.7.1 lock based protocols
5.7.2 timestamp based protocols
5.8 Deadlock and its handling
5.8.1 Wait-Die and Wound-Wait, Deadlock prevention without using
timestamps
5.9 Deadlock detection and time outs

5.1 UNIT OBJECTIVES

In this chapter, we address the construction of the transaction processing


component of the database manager. We will consider the issues of concurrency
control and recovery together.

Transaction and
92 Concurrency Control
Database Management
System
5.2 MULTIPROGRAMMING AND MULTIPROCESSING
NOTES
Both manual loading and batch processing systems load and process one
job at a time. After such a system loads a job, the job remains in main memory
until its execution is over, and the system loads the next job only after completion
of the current job.
The job that is currently loaded in the system is the sole occupant of user’s
area of main memory (operating system resides in a separate part of main
memory) and has CPU available exclusively for its execution. Uniprogramming
system, which processes only one job at a time, and all system resources are
available exclusively for the job until its completion.
A job does not need CPU for entire duration of its processing because in
addition to computing (for which CPU is needed) it often needs to perform I/O
operations (such as reading/writing of data from/to tape/disk, waiting for data
input from keyboard/mouse, and printing of results) during the course of its
processing. In fact, depending on CPU utilization during the course of
processing, jobs are of two types :
1. CPU bound jobs. These jobs mostly perform computations with little
I/O operations. Hence, their CPU utilization is very high. Programs
used for scientific and engineering computations usually fall in this
category of jobs.
2. I/O bound jobs. These jobs mostly perform I/O operations with little
computation. Hence, their CPU utilization is very low. Programs used
for commercial data processing applications usually fall in this
category of jobs.
In a uniprogramming system, APU is idle whenever the job executing
currently performs I/O operations. CPU idle time may not be significant for CPU
bound jobs, but it may be of the order of 80-90% for I/O bound jobs. Moreover,
since I/O devices are slower than CPU by 20 to 100 times, CPU idle time is
significant even for CPU bound jobs.
To minimize the idle time of CPU, researchers introduced the concept of
multiprogramming, which uses the idea of organizing multiple jobs in a system
so that its CPU always has something to execute.
Multiprogramming is interleaved execution of two or more different and
independent programs by a computer. Multiprogramming systems have better
throughput than uniprogramming systems because Multiprogramming reduces
CPU idle time drastically. However, they are more sophisticated because they
require following hardware and software features : Transaction and
Concurrency Control 93
Database Management 1. Large memory. Multiprogramming requires large main memory to
System accommodate a good number of user programs along with operating
system.
NOTES 2. Memory protection. Multiprogramming requires a memory
protection mechanism to prevent a job(in one memory partition) from
changing another job’s program/data (in another memory partition).
Operating systems use a combination of hardware and software
protection mechanisms for this. It prevents one job from addressing
beyond the limits of its own allocated memory area.
3. Job status preservation. In multiprogramming, when the operating
system blocks a running job because the job needs to do I/O
processing, it takes CPU away from the job and gives it to another job
that is ready for execution. The blocked job resumes its execution
sometime later. To enable this, the operating system must preserve the
job’s complete status information when it takes away CPU form it, and
must restore this information back before it gives the CPU to the job
again. For this, operating system maintains a process control block
(PCB) for each process, which is loaded in memory partition. Figure
5.1 shows a typical process control block. With this arrangement,
before taking away CPU from a running process, the operating system
preserves its status in its PCB. The operating system restores back the
status of this process just before it gives back the CPU to it again later
to resume its execution. Hence, the process can continue execution
without any problem.
4. Proper job mix. A multiprogramming system requires a proper mix
of I/O –bound and CPU bound jobs to overlap the operations of CPU
and I/O devices effectively. If all loaded jobs need I/O at the same
time, CPU will again be idle. Hence, jobs resident simultaneously in
main memory should contain a good mix of CPU bound and I/O jobs
so that at least one job is always ready to utilize CPU.
5. CPU scheduling. In a multiprogramming system, often there are
multiple jobs in ready state (waiting for the operating system to
allocate CPU to it). Hence, when CPU becomes free, operating system
must have a mechanism to decide to which one of these ready jobs it
should allocate then CPU. Part of the operating system that takes this
decision is called CPU scheduler, and the algorithm it uses for this is
called CPU scheduling algorithm.

Transaction and
94 Concurrency Control
Database Management
System

NOTES

Figure 5.1 A typical process control block (PCB)

Multiprocessing
The use of I/O processors improves the efficiency of a computer system by
making concurrent input processing, and output operations possible. CPU
performs arithmetic and logical operations, while I/O processors carry out I/O
operations concurrently.

Figure 5.2 Architecture of a computer system with its CPU,


memory and I/O processors
Figure 5.2 shows the architecture of a computer system with its CPU,
memory and I/O processors.
Designer carried further the idea of using I/O processors to improve system
performance by designing systems with multiple CPUs. Such systems are called
multiprocessing systems because they use multiple processors (CPUs) and can
execute multiple processes concurrently. These systems use multiple processors
(CPUs) to process either instructions from different and independent programs
or different instructions from the same program simultaneously. Figure 5.3 show
basic organization of a typical multiprocessing system.

Transaction and
Concurrency Control 95
Database Management
System

NOTES

Figure 5.3 Basic organization of a typical multiprocessing system.


Multiprocessing system are of two types – tightly coupled systems and
loosely coupled systems. In tightly coupled systems, all processors share a single
system-wide primary memory. On the other hand, in loosely coupled systems,
the processors do not share memory, and each processor has its own local
memory.

Difference between Multiprogramming and Multiprocessing


Multiprocesing is simultaneous execution of two or more processes by a
computer system having more than one CPU. Multiprogramming is interleaved
execution of two or more processes by a single CPU system. Hence, while
multiprogramming involves execution of a portion of one program, then a portion
of another.

Advantages and Limitations of Multiprocessing


Multiprocesing systems have the advantages :
1. Better performance
They have better performance (shorter response times and higher
throughput) than single processor systems. For example, if we have
to run two different programs, a two processor system is evidently
more powerful than a single processor system beacuse the two
processor system can run both programs on different processors.
Similarly, if we can partition a large computation job into multiple sub-
computations(sub-processes) that can run concurrently, a
Multiprocessor system with sufficient number of processors can run
all sub-processes simultaneously, with each one on a different
processor known as parallel processing. The speed up ratio with n
processors is not n but less than n. This is because when multiple
processors cooperate to execute the processes or sub processes, the
operating system incurs certain amount of overhead in keeping
everything working correctly. This overhead, plus contention for shard
resources, lowers the expected gains from additional processors.
Transaction and
96 Concurrency Control
2. Better reliability Database Management
System
They also have better reliability than single processor systems. In a
properly designed multiprocessor system, if one of the processors
breaks down, the system can continue to function with remaining NOTES
processors. The system can tolerate partial failure of processors and
can avoid a complete breakdown. For e.g if a ystem has 4 processors
and one fails, then the system can utilize the remaining 3 processors
to process the jobs. Thus, the entire system run only 25% slower,
rather than failing altogether. This ability of a system to continue
providing service proportional to the level of non-failed hardware is
called graceful degradation feature.
Multiprocesing systems, however, require a very sophisticated operating
system to schedule balance, and coordinate the input, output and processing
activities of multiple processors. Designing such an operating system is complex
and time taking. Multiprocesing systems have higher initial cost and their regular
operation and maintenance is costlier than single-processor systems.

5.3 CONCEPT OF TRANSACTION

- A transaction is a series of operations performed as a single logical


unit of work on the Database Management System. Transaction leads
to modification in the database contents.
- A transaction is initiated by a user program written in the high level
data manipulation languages like SQL or programming language like
Java. The transaction consists of all the operations executed between
start transaction and end transaction.
- Transactions in a database environment have two main reasons :
1. To provide reliable units of work that allow correct recovery from
failures and keep a database consistent even in cases of system failure,
when execution stops and many operations upon a database remain
uncompleted, with unclear status.
Example : In a banking system, transferring money from one account
to another leads to change in balance values of both accounts. To keep
the database in consistent state both changes must succeed or fail
together.
2. To provide isolation between programs accessing a database
concurrently. If this isolation is not provided, the programs outcomes
are possibly incorrect.
Transaction and
Concurrency Control 97
Database Management Process of Transaction
System
The execution of transaction leads to perform series of read and write
operations on database objects, which are as follows-
NOTES
(A) Read Operation
To perform transaction on a data which is present in main memory Read()
operation is used.
Read (X)
This statement states that value of X is read from main memory (i.e from
local buffer)
If it is not present in the main memory then :
1. It is first brought into main memory from disk, using Input() operation.
Like Input (X);
2. Then its value is copied into a program variable as shown in Fig 5.4

(B) Write Operation


To write a database object, an in- memory copy of the object is first modifies
and then written to disk.
Write (X);
This statement states that value of X is written to local buffer which is in
main memory. And then this updated value is copied into appropriate disk block,
so that the changes are done permanently as shown in fig 5.5

Transaction and
98 Concurrency Control
5.3.1 Transaction States Database Management
System
There are five states of transaction :

NOTES

Fig 5. 6 Transaction States


During execution, transaction will be in one of the following state.

Active
- This is the first state of transaction. In this state the transaction is being
executed. This state is entry point to every transaction. Transaction
which are in active state indicate that their execution has been started.

Partially Committed
When a transaction completes all operations, it will enter into partially
committed state. Even the last operation has been performed; data is still not
saved to the database.

Failed
- If execution of transaction cannot proceed due to failure of the system
or failure in database, then the transaction is said to be in failed state.

Abort
- In case of the failed transaction, the modification done in database
during transaction processing must be rolled back to ensure the
atomicity and consistency of database.
- The transaction enters into “Abort” this state after roll back operation
is performed. It is the end of transaction when any fault occurs.

Committed
- If without any error transaction completed successfully it will come
into committed state which will allow to make changes permanent into
database. This is the last step of a transaction, if it executes
without fail.
- Every transaction goes through at least three states among the five. Transaction and
Either transaction may goes through active – partially committed – Concurrency Control 99
Database Management committed or through active – Failed – aborted or through active –
System partially committed-failed-abort.

NOTES
5.4 PROPERTIES OF TRANSACTION

A set of properties of database transactions are – ACID (Atomicity,


Consistency, Isolation, Durability) . A sequence of database operations that
satisfies the ACID properties and thus, can be perceived as single logical
operation on the data is called transaction.

ACID properties are:-


(A) Atomicity
1. Atomicity is based on the concept that each transition be “all of
nothing” : if any one part of the transaction in a sequence fails, then
the entire transaction fails, and there will be no change in database
state.
2. An atomic system must guarantee atomicity in every situation,
including power failures, errors and crashes. For the outside world, a
committed transaction appears completely by its effects on the
database. That means it should be indivisible and an aborted
transaction does not happen.
3. This property states that, either all operations contained by a
transaction are done successfully or none of them complete at all. To
maintain the consistency of data this property is very useful.

(B) Consistency
1. The consistency property ensures that the transaction executed on the
database system will bring the database from one valid state to another.
2. The data which is written by the transaction must be valid according
to all standard rules and regulations regarding constraints, cascades,
triggers etc. Consistency does not guarantee accuracy of the
transaction as per the expectations of programmer.

(C) Isolation
1. When transactions are performed in a sequence, the state of system is
always valid without any problem. But sometimes we may have to
perform multiple transactions concurrently.
2. In case of concurrent transactions, the isolation property ensures that
the system state should be same that would be obtained if transactions
Transaction and were executed sequentially, i.e one after the other. The effect of any
100 Concurrency Control
incomplete transaction should be invisible to other transaction by Database Management
means of isolation. System

(D) Durability NOTES


1. The durability property assures that after transaction committed
successfully the updates made should remain permanent in the
database even in the event of power loss, crashes or errors.
2. Example in database management system, when a series of transaction
is executed, the modification done should be stored permanently even
if the database crashes just after the transaction. To avoid problem
because of power loss, the states of database after every transaction
must be recorded in a non-volatile memory.
A complete example of transaction
- Consider a banking system having number of accounts stored in
the database.
- To access the database following operations are used in
transaction:
- Read(X) which transfer data X from database to local buffer &
Write (X), which transfer the data X from local buffer to database.
- Suppose, before the transaction M has $500 and N has $1000
balance in their accounts.
- Let Ti be a transaction which will transfer $150 from account M
to N, written as :
Ti : Read(M);
M:= M-150;
Write(M);
Read(N);
N:=N+150;
Write(N);
- Atomicity : If any failure occurs in the system during Atomicity,
Consistency, Isolation, Durability proceeded up to Write(M); and
further operations does not executed due to failure occurrence, it
will leads the banking database system in inconsistent state if
atomicity property is not provided.
- But if the atomicity property is provided then if all operations are
not executed, then the operations which are executed before
failure get cancelled.
- Consistency: If consistency is provided, then in above example
Transaction and
the sum of balance in account M and N will be same before Concurrency Control 101
Database Management transaction Ti perform and after Ti completes. In initial situation
System M + N gives $1500. If transaction committed successfully then
M hold $350 and N holds $1150. Addition of amount will give
NOTES $1500. If transaction does not committed successfully then M
holds $500 and N holds $1000. Addition of amount will give
$1500.
- Isolation : This property takes care of no other transactions should
interfere with transaction Ti during its execution.
- Durability : It ensures that the data modification done by
transaction Ti will persist in database even if any failure occurs
during system after committing the transaction.

5.5 SCHEDULES

- A schedule is a collection of several transactions that can be executed


in a predefined order as a single unit.
- Schedule may combine the operations of one transaction with other
transaction. These transaction with other transaction. These
operations must be in same sequence as they occur in their individual
transactions.
- Schedule is necessary in a situation when an system has several
transactions which are going to be performed on same database object.
If those transactions are executed in parallel, they may affect the result
of the transaction.
Example : Consider transaction T1 and T2 are present in the system,
and both use the values of account A commonly. If one transaction is
updating the values which the other transaction is accessing, then the
order of these two transactions will change the result of second
transaction. Hence a schedule is created to execute the transactions.
- Depending on the arrangement of these transactions, schedule can be
divided into two types : Serial schedule and concurrent schedule.

5.6 TYPES OF SCHEDULE

(A) Serial schedule


- In Serial schedule all the transactions in the schedule are executed one
Transaction and after another.
102 Concurrency Control
- Example : account M has $500 and account N has $1000 initially. Database Management
Transaction T1 transfers $200 from account M to N and transaction System
T2 transfers $100 from account M to N. T1 and T2 can be scheduled
in two ways for execution. NOTES
Schedule 1 : Serial schedule for T1 followed by T2
T1 T2
Read(M);
M=M-200;
Write(M);
Read(N);
N=N+200;
Write(N);
Read(M);
M=M-100;
Write(M);
Read(N);
N=N+100;
Write(N);

After execution of both transactions M holds $200 and N holds $1300.


Schedule 2 : Serial schedule for T2 followed by T1
T1 T2
Read(M);
M=M-100;
Write(M);
Read(N);
N=N+100;
Write(N);
Read(M);
M=M-200;
Write(M);
Read(N);
N=N+200;
Transaction and
Write(N); Concurrency Control 103
Database Management After execution of both transactions M holds $200 and N holds $1300.
System
(B) Concurrent Schedule
NOTES Concurrent Schedule is a schedule where, several transactions are executed
simultaneously. The operations of one transaction can be mixed with operations
of other transaction.
Schedule 3 : Concurrent schedule for T1 and T2
T1 T2
Read(M);
M=M-200;
Write(M); Read(M);
M=M-100;
Write(M);
Read(N);
N=N+200;
Write(N);
Read(N);
N=N+100;
Write(N);
After execution of both transactions M holds $200 and N holds $1300.

5.7 CONCURRENCY CONTROL

The overall Concurrency control mechanisms help ensure the serializability


of schedules by using the Concurrency control protocols. The three concurrency
control mechanisms are locking, timestamp and optimistic. It is difficult to
predict the Concurrency control mechanism executing on different computer
system.

5.7.1 Locks
A lock is a variable associated with a data item that defines the status of the
data item corresponding to the database operation, which can be performed on
it. Locks help synchronise the access of data items when concurrent transactions
are executed on the items.
There are two different types of lock binary and shared/exclusive.

Transaction and
104 Concurrency Control
Binary locks Database Management
System
Binary locks are the locks that have two states, locked and unlocked. If the
value of lock on a data item is 1, it means that the data item is in locked state and
cannot be accessed by any database access operation. The 0 value of lock on the NOTES
data item specifies that the data item is in unlocked state and can be accessed by
a database access operation.
When a transaction requests access on a data item, it isssues the
lock_item(X) operation. The 1 value of the operation specifies that the
transaction is required to wait for the data item, whereas, 0 value of the operation
specifies that the transaction can lock and accesss the dat item. When the
transaction stops accessing a data item, it issues an unlock_item(X) operation so
that other transactions can use the data item.
The database can easily implement binary lock by maintaining a record with
the following three fields :
- Data item name
- The LOCK variable
- Locking transaction
The system maintains the records of locks in the lock table. The items that
are not included in the lock table are called unlocked items. A transaction follows
the following rules if binary lock is implemented :
- A transaction should issue the lock_item(X) operation before performing
read and write operations on a data item.
- A transaction should issue the unlock_item(X) operation before
performing read and write operations on a data item.
- A transaction cannot issue the lock_item(X) operation if it already hold
a lock on a data item.
- A transaction cannot issue the unlock_item(X) operation if it does not
hold a lock on a data item.

Shared/exclusive locks
Shared/exclusive locks are the locks, which allow multiple transactions to
read the same data item concurrently but only one transaction to write on a data
item at a time. Shared/exclusive locks are also called read/write locks. In
Shared/exclusive locks, a transaction can issue three operations : read_lock(X),
write_lock(X) and unlock(X). A read locked item is called shar4ed locked and
a write locked item is called exlusive locked. When the database system
repeatedly denies the request of a transaction for the exclusive lock of a data
item, it is called livelock. Livelock occurs when various transactions request
shared lock on the same data item on which a transaction is trying to have an
exclusive lock.
Transaction and
Concurrency Control 105
Database Management To implement the Shared/exclusive lock, lock table includes the following
System fields :
- Data item name
NOTES
- The LOCK variable
- Locking transaction
- No_of_reads
A transaction follows the following rules if Shared/exclusive lock is
implemented:
- A transaction should issue the read_lock(X) and write_lock(X) operations
before performing read_item(X) operation on a data item.
- A transaction should issue the write_lock(X) operation before any
write_item(X) operation is performed on a data item.
- A transaction cannot issue the read_lock(X) operation if it already holds
a shared or exclusive lock on a data item.
- A transaction should issue the unlock(X) operation after performing all
read_item(X) and write_item(X) operations on a data item.
- A transaction cannot issue the write_lock(X) operation if it already holds
a shared or exclusive lock on a data item.

Problems associated with locking


The type of locking implemented in a system can cause different problems
in execution of transactions. The various problems associated with locking are:
- Deadlock
- Starvation
Deadlock : The problem of deadlock occurs when each transaction included
in a set of two or more transactions is waiting for the data item that is locked by
any other transaction of the set.

Deadlock prevention protocol


There are various deadlock prevention protocols, which provide some rules
that you can use to avoid the problem of deadlock. One of the deadlock
prevention protocols states that a transaction should check the availability of all
the required data items before starting the execution. If the data items are not
available, the transaction should wait for a specific period of time and then check
the availability of data items again.
The other deadlock prevention protocols decide what to do when a deadlock
occurs in transactions. According to these protocols, the transactions are either
aborted or made to wait for the data items. These protocols use the timestamp
Transaction and concurrency control technique and provide the following schemes :
106 Concurrency Control
-Wait-die ; It specifies that if the timestamp of a T transaction is lower than Database Management
the timestamp of the T’ transaction, then T is allowed to wait, otherwise T is System
aborted and restarted.
-Wound-wait : It specifies that if the timestamp of a T transaction is lower NOTES
than the timestamp of T’ transaction, then abort T’ and restart it. If the timestamp
of the T transaction is greater than the timestamp of the T’ transaction, then T’ is
allowed to wait.
The other deadlock prevention protocols include the No Waiting (NW) and
Cautious Waiting(CW) algorithms. In the NW algorithm, if a transaction is unable
to access a data item, the transaction is immediately aborted and restarted after
a specific period of time, without checking whether a deadlock will occur or not.
In the CW algorithm, if there if there is a transaction Ti waiting for a data item,
which is locked by the Tj transaction, the deadlock prevention protocol checks
whether or not Tj is waiting for any other data item. If Tj is not waiting for any
other data item, Ti is allowed to wait, and if Tj is waiting for a data item the Ti
transaction is aborted.

Dead lock detection


In deadlock detection, the system checks whether or not the problem of
deadlock exists. To detect a deadlock, the system constructs the wait-for-graph.
In this graph, one node is created for each transaction, which is currently
executing. When a transaction Ti is issuing an operation to lock a data item, which
is already locked by transaction Tj, then a directed edge is created. If the data
item requested by Ti is released, then the edge is
Deleted from the graph. Figure shows the wait-for-graph for transactions
Ti and Tj.

Figure : wait-for-graph

Starvation
Starvation occurs when a transaction is not executed for along period of
time while other transactions are executing. This problem occurs if some
transactions are assigned high priority than others. To avoid starvation, a system Transaction and
Concurrency Control 107
Database Management can use the first-come-first-serve waiting method, according to which the
System transactions can lock the data item in the order in which they request for the data
item. The wait-die and wound-wait scheme also help prevent starvation among
NOTES transactions. Another way to avoid starvation is to define the priority of
transactions and increasing the priority of a transaction, which is waiting for a
data item.

The Two-phase locking protocol


A transaction is said to follow the two-phase locking protocol, if the
read_lock and write_lock locking operations precede the first unlock operation
in thetransaction. The two-phase locking protocol involves two phases,
expanding and shrinking. During the expanding phase, which is also called
growing phase, a transaction can acquire new lockson data items without
releasing previous locked data items. In the shrinking phase, a transaction
releases previous locked data items but does not acquire new locks. The
read_locked can be changed to write_locked during the expanding phase and the
write_locked can be changed to read_locked during the shrinking phase..
For example , the transactions, T and T' that follow the two-phase locking
protocol , appear as shown below.

The two-phase locking protocol is of the following types:

• Conservative: A transaction specifies read_set and write_set to lock all


the data items required by the transaction.
• Strict: A transaction does not release any of the exclusive locks on the
data items until the transaction is not committed or aborted.
• Rigorous: A transaction does not release any of the exclusive or shared
locks on the data items until the transaction is not committed or aborted.

Transaction and
108 Concurrency Control
5.7.2 Timestamps Database Management
System
A time stamp is a unique identifier that database system creates to identify
transactions, when transactions are submitted. It is represented as
TS(<transactionname>). Timestamps support the Timestamp Ordering(TO) NOTES
scheme in which a schedule has transactions in order of their timestamp values
and the schedule is serializable. The TO scheme associates the following two
timestamp values with each data item:
• Read_TS(X): It refers to the read timestamp of the data item X. This is
the highest timestamp of transactions that have succefully read the data
item X.
• Write_TS(X): It refers to the write timestamp of the data item X. This is
the highest timestamp of transactions that have successfully written the
data item X.

There are two types of TO schemes:


• Basic
• Strict

Basic TO
Inbasic TO scheme: when a transaction T issues the read_item(X) or
write_item(X) operation, the basic TO algorithm compares the timestamp of
transaction T with the read_TS(X) and write_TS(X) values to ensure that the
transactions follow the TO scheme. If the transaction does not follow the TO
scheme, then transaction T is aborted and assigned a new timestamp value. If
transaction T is aborted and rolled back then any transaction Tl, which will use
the value changed by transaction T is also rolled back. Similarly, the transaction
T2, which will use the value changed by the transaction Tl, is also rol1ed back.
This process is called cascading rollback. The basic TO algorithm checks the
following conditions:
• Transaction T issues the write_itcm(X) operation:
- If read_TS(X)>TS(T) or write_TS(X)>TS(T), then abort and
rollback T and reject the operation of T.
- If read_TS(X)≤TS(T) or write_TS(X)≤TS(T), then execute the
write_item(X) operation of T and set write_TS(X) to TS(T).
- Transaction T issues the read_item(X) operation :
- If read_TS(X)>TS(T) then abort and rollback T and reject the
operations of T.
- If write_TS(X)<=TS(T), then execute the read_item(X) operation
of T and set read_TS(X) to a higher value of TS(T).

Transaction and
Concurrency Control 109
Database Management Strict TO
System
Strict TO ensures that the schedules are strict as well as serializable. In
Strict TO, the read_item(X) or write_item(X) operations of a transaction T are
NOTES postponed until the transaction T’ that has written the value of X is committed.
The transaction T waits for transaction T’ only if TS(X)>TS(T’).

5.8 DEADLOCK

A system is said to be in a state of Deadlock, if every transaction in the


schedule is waiting for another transaction in the schedule to release the lock of
some data item. Consider transactions from T0 to Tn. In this case transaction
T1 is waiting for T2, T2 is waiting for T3 and so on. And at the last the
transaction Tn is waiting for transaction T0 in cyclic manner.
The transactions present in deadlock are either rollback or restarted. A
system with a deadlock indicates bad behavior of the system. The rollback of
the transaction may be partial. That is a transaction may rolled back to the point
where it obtained a lock whose release resolves the deadlock.
It’s not good to have any transaction that causes deadlock. To handle the
deadlock situation, we have two options one is to avoid a system to enter in
deadlock by Deadlock Prevention method or after deadlock try to recover it by
deadlock detection and deadlock recovery method.
So to handle deadlocks following methods are used :
1. Deadlock Prevention
2. Deadlock Detection
3. Deadlock Recovery

Deadlock Prevention
- DBMS regularly checks all the transaction operations which are going
to execute, to prevent any deadlock situation in the system. The
checking of all the operations is done to analyze whether the execution
of them leads the system in deadlock or not. If any operation of a
transaction leads the system in deadlock state then that transaction will
not allow for execution.
- There are different methods to prevent deadlock. All methods use the
timestamp of transaction. The timestamp of any transaction Ti is
denoted as TS(Ti).

5.8.1 Wait-Die Scheme


- In wait-die scheme if the older transaction requests resources hold by
Transaction and younger one then DBMS allows Ti to wait for needed resources but if
110 Concurrency Control
the younger transaction requests resources hold by older one then Database Management
DBMS kills Tj and restarted it later. System

- Suppose if a transaction Ti requests for the resources which are already


locked by other transaction Tj. Then in this scheme the DBMS checks NOTES
for the timestamp of both the transaction Ti and Tj and performs
following action.
- If TS(Ti) < TS(Tj) (means Ti is the older transaction than Tj) and Ti
requests resources which are hold by Tj, then DBMS allows Ti to wait
until resource is available for execution. That means if a younger
transaction has locked some resource and older transaction needs those
resources so older transaction is waiting for younger transaction to
complete, then older transaction is allowed wait for it till it is available.
- TS(Ti) < TS(Tj) (means Ti is the older transaction than Tj) and Tj
requests resources which are hold by Ti, then DBMS killed the Tj and
restarted it latter with random delay but with the same timestamp. That
means if the older transaction has held some resource and younger
transaction waits for the resource held by older one, then younger
transaction is killed and restarted with same timestamp after some
time.

Wound-Wait Scheme
- In this method, if an older transaction requests for a resource held by
younger transaction, then older transaction forces younger transaction
to kill the transaction and release that resource. The younger
transaction is restarted later with same timestamp.
- But if the younger transaction is requesting a resource which is held
by older one, then DBMS allows younger transaction to wait until the
older releases that resource.

5.9 DEADLOCK DETECTION

Detecting deadlock situation in advance is always good instead of aborting


a transaction. A deadlock avoidance mechanism is used to detect deadlock
situation in advance. An algorithm that examines the state of the system is called
periodically to check the occurance of deadlock.
In deadlock avoidance “wait-for graph” method is used.
Wait- for Graph
“Wait- for graph” scheme used a graph(set of edges and vertices(nodes)) to
detect deadlock in the system.
-A node is created when a transaction enters into system. Transaction and
Concurrency Control 111
Database Management - When a transaction Ti requests for a lock on resource X which is held
System by some other transaction Tj, a directed edge is created from Ti to Tj.
- This edge is removed when Tj releases resource X, and Ti locks that
NOTES resource.
Waits for graph indicates which transaction is waiting for another
transaction to complete

A graph with cycle indicates the occurrence of deadlock in the system.


The wait for graph for Fig. 5.7 has following situation
- Transaction T1 is waiting for transaction T2 and T3
- Transaction T3 is waiting for transaction T2
- Transaction T2 is waiting for transaction T4
Here, the graph has no cycle; hence it is not in deadlock state.
Consider the transaction T4 is requesting an item held by T3 then a directed
edge from T4 to T3 will be drawn

Now the graph has a cycle T2 T4 T3 T2


Here the transactions T2, T3 AND T4 are in deadlock state.

Transaction and
112 Concurrency Control
Deadlock Recovery Database Management
System
When deadlock is detected by the system, the only one solution is recovery
of deadlock. The best way of recovery is rollback operation.
NOTES
In the recovery process, three actions are taken

1. Selection of victim
The first step is to detect that which transactions should be rolled back to
recover the system. It is necessary to rollback the transaction which has
minimum cost. The minimum cost depends upon various factors:
- The data items used by the transaction.
- Time duration for the transaction to complete its assigned task.
- Number of total transactions need to be rolled back.

2. Rollback
After finalizing the transaction to be rolled back, we have to decide upto
which previous point it should be rolled back.
Rollback operation has two options :
- Total rollback : Abort the whole transaction and restart it.
- Partial rollback : Rollback transaction as per necessity. But this
format requires the system to store state of all the running transaction.
The order of lock requests/grants and updates performed by the
transaction should be recorded. The deadlock detection mechanism
should decide which locks the selected transaction need to release to
resolve the deadlock. The selected transaction should be released up
to the point where the first lock is obtained. All the operations after
that point should be undone.
The transaction must be capable of resuming execution after the partial
rollback.

3. Starvation
In deadlock handling mechanism the selection of transaction to
rollback(victim) depends upon the cost fact. Sometimes it is possible that same
transaction may be selected multiple times in multiple deadlock situations. In
such case it will be difficult for the victim transaction to complete its execution.
It is called as starvation.
Hence it is necessary that, a transaction should not be selected as victim
many number of times. To achieve this the number of rollbacks of every
transaction should be maintained in the cost factor.

Transaction and
Concurrency Control 113
Database Management Summary
System
Transaction processing is based upon a storage model in which main
memory holds a log buffer, a database buffer, and a system buffer. The system
NOTES buffer holds pages of system object code and local workspaces of transactions.
Secondary storage includes the complete object code for the system, swap space
for virtual memory, and the database itself. Stable storage that must be accessible
online is simulated using mirrored disks which provide redundant data storage.
Offline, or archival, stable storage may consist of multiple tape copies of data
stored in a physically secure location.
Transaction failure may lead to cascading rollback of transactions that have
read data written by a failed transaction. If cascading rollback results in the abort
of an already committed transaction, then the schedule is said to be non
recoverable. Recoverable schedules can be ensure, and cascading rollback
avoided, under two-phase locking by requiring that locks be held by a transaction
until it either commits or aborts.
Various locking protocols, including the multiple granularity locking
scheme, do not guard against deadlocks. One way to prevent deadlock is to use
preemption and transaction rollbacks. To control the preemption, we assign a
unique timestamp to each transaction. These timestamps are used to decide
whether a transaction should wait or roll back. If transaction is rolled back, it
retains its old timestamp when restarted. The wait-die and wound-wait schemes
are tow preemptive schemes. Another method for dealing with deadlock is to
use a deadlock detection an d recovery scheme. To do so, a wait for graph is
constructed. A system is in a deadlock state if and only if the wait for graph
contains a cycle. When a detection algorithm determines that a deadlock exists,
the system must recover from the deadlock. This is accomplished by rolling back
one or more transactions in order to break the deadlock.

Exercises
1. Suppose that the system crashes during the time it is recovering from
a prior crash. When the system again recovers, what action must be
taken?
2. If deadlock is avoided, is starvation still possible? Explain.
3. Explain cascading rollback.
4. When the system recovers from a crash, in what order must
transactions be undone and redone. Why is this order important?

*****

Transaction and
114 Concurrency Control
Database Management
System

UNIT - 6
DATABASE RECOVERY AND
NOTES

SECURITY MANAGEMENT

Structure
6.0 Introduction
6.1 Unit Objectives
6.2 Database Recovery and security Management: Database Recovery,
6.3 Types of Failures, and Data access.
6.4 Recovery and atomicity
6.5 Recovery Techniques Algorithms: Log Based Recovery, Check points,

6.0 INTRODUCTION

This unit introduces you to recovery mechanisms. Recovery mechanisms


perform the recovery process, which help take backup of data that is damaged
due to some failures caused during execution of a transaction. Recovery control
techniques help recover a database from catastrophic failures as well non-
catastrophic failures.

6.1 UNIT OBJECTIVES

- Learn in detail about the different recovery mechanisms.


- Explain the utility of system logs and the shadow paging technique.

6.2 RECOVERY METHODS

Failure is a routine situation for any system. There are various causes of
failure.
Database
1. Transaction failure : A transaction may fail mainly because of two
Recovery and
reasons. Security Management 115
Database Management a. Logical error : There are some internal conditions like invalid
System input, data not found, overflow or resource limit exceeded,
because of which the system get failed and unable to continue its
NOTES execution.
b. System error : Because of cyclic dependency of transactions on
each other, the system may enter in undesirable state like
deadlock which leads to break the execution of transaction.
c. System crash : Because of hardware malfunction or bug in
database software or operating system, the system may halt. The
loss of contents of volatile storage is possible is this crash.
d. Disk failure : Failure during data transfer or head crash may cause
he loss of data in a disk.
Whatever is the reason of failure, it leads to data loss and inconsistency in
the database system, hence recovery is very important.
For DBMS recovery there are two types of techniques which maintain the
atomicity of transaction are follows :
1. Log based Recovery
2. Shadow paging

Log Based Recovery


Log
- Log is a data structure which is used to store all modifications
performed on the database.
- Log contains sequence of log records which keeps track of update
activities performed on the database.
- Log is an essential part of recovery system to handle the failures.
- So it is necessary to store the logs in storage where data should not be
loss due to any reason. So the log records are stored on stable storage.

Log-based Recovery
If any failure occurs in the system, with the help of logs, system can be
recovered. This type of recovery is called as Log-based Recovery.
- The Log-based Recovery works in following manner
The log file is placed on a stable storage media.
When a transaction enters the system and starts execution, it writes a log
about it.
<Tn, Start>
Database
When the transaction updates an item X, it writes logs as follows
Recovery and
116 Security Management
<Tn,X,V1,V2> Database Management
System
It reads that the value of X is changed by Tn from V1 to V2.
When the transaction finishes, it logs
NOTES
<Tn, commit>
- There are two techniques for using log to achieve recovery and ensure
atomicity in case of failures.
- Deferred database modifications
- This technique delay the database modification until transaction
committed.
- It allows storing all the modification done by transaction in the
database, when that transaction committed successfully.
- Immediate database modification
- Immediate database modification technique allows storing each
modification done by transaction in to database as soon as it is done.
- In this technique database modification operation doesn’t wait for
commit operation to be performed by transaction.
- The system allows modifying the database when transaction
processing is performed and not committed yet. Database
modification performed by active transaction is known as
uncommitted modification.

Checkpoints
- In DBMS storing large number of logs may fill out all the memory
space available in the system. The size of log may increase
tremendously as time goes which makes difficult to handle it. With the
checkpoint mechanism we can remove all previous logs from the
system and store them on a storage disc permanently. Using
Checkpoint a point can be declared before which the DBMS was in
consistent state and all the transactions were committed.
- In immediate database modification recovery technique, some
transactions are requires to be redone and some requires to be undone
by searching the entire log. So the searching entire log is time
consuming process. To overcome this problem check points are used.
- Consider a system with concurrent transactions crashes and recovers,
it behaves in the following manner.

Database
Recovery and
Security Management 117
Database Management
System

NOTES

- During execution along with keeping log using immediate database


modification and deferred database modification techniques,
checkpoints are periodically performed by the system, which perform
following sequence of steps:
- All the logs reside currently on main memory are stored permanently
on the disk.
- All buffer blocks which are modified currently are written to the disks
permanently.
- A log record <checkpoint> also written onto the disk permanently.
- While processing the checkpoint transactions are not allowed.
- One advantage of maintaining checkpoint is that, if in set of
transactions <Ti,…,Tn> the recent checkpoint is maintained at
transaction Tj where, i<j and j<n i.e Tj is in between Ti and Tn.
- Then for recovery purpose keeping log records of transactions which
performed in between Tj to Tn is preferable. Deleting the unnecessary
log records which are done before the checkpoint Tj is allowed.
- This means that the transaction recovery can start from the checkpoint
log record and not from first record in the log. So it reduces the time.
(discussed in detail in last para of the unit)

Shadow Paging
- The shadow paging does not require the use of a log in a single user
environment. In this scheme database is considered to be made up of
a number of fixed size disk pages or blocks for recovery purpose.
- A directory with n entries is created. In it the ith entry points to the ith
database page on disk. This directory is kept in main memory. It
records references of all reads or writes to database pages. When
execution of transaction started we copy the current directory whose
entries points to the most recent database pages on disk into a shadow
Database directory.
Recovery and
118 Security Management
- The current directory is used by the transaction and the shadow Database Management
directory is saved on disk. System

- While executing the transaction, no changes are made in shadow


directory. If a ‘write’ operation is performed, then new copy of NOTES
modified database page is created. Both new and old copies are kept.
That means for pages which are updated during the transaction, old
and new, both versions are kept. The shadow directory refers the old
version while the current directory refers the new version.
- In case of failure, for recovery purpose the modified database pages
are freed and current directory is discarded. The shadow directory
provides the state of the database before transaction execution. The
state is recovered by restoring the shadow directory. In this way the
database returned to its state prior to the transaction that was executing
when the crash occurred and any modified pages are discarded.
- This technique can be categorized as a NO-UNDO/NO-REDO
technique for recovery of database because for recovery, neither
undoing nor redoing data items is performed.
- In case of multiuser system where transactions are performed
concurrently, logs and checkpoints are used in the shadow paging
technique.
- One of disadvantage of this technique is that the database pages which
are updated, change location on disk. Because of this, it is difficult to
keep related database pages close together on disk. It requires complex
storage management strategies to manage those pages.

6.3 TYPES OF FAILURES AND DATA ACCESS

A database management system is susceptible to a number of failures. In


this chapter we will study the failure types and commit protocols. In a distributed
database system, failures can be broadly categorized into soft failures, hard
failures and network failures.

Soft Failure
Soft failure is the type of failure that causes the loss in volatile memory of
the computer and not in the persistent storage. Here, the information stored in
the non-persistent storage like main memory, buffers, caches or registers, is lost.
They are also known as system crash. The various types of soft failures are as
follows −
• Operating system failure.
Database
• Main memory crash. Recovery and
Security Management 119
Database Management • Transaction failure or abortion.
System
• System generated error like integer overflow or divide-by-zero error.
• Failure of supporting software.
NOTES
• Power failure.

Hard Failure
A hard failure is the type of failure that causes loss of data in the persistent
or non-volatile storage like disk. Disk failure may cause corruption of data in
some disk blocks or failure of the total disk. The causes of a hard failure are −
• Power failure.
• Faults in media.
• Read-write malfunction.
• Corruption of information on the disk.
• Read/write head crash of disk.
Recovery from disk failures can be short, if there is a new, formatted, and
ready-to-use disk on reserve. Otherwise, duration includes the time it takes to
get a purchase order, buy the disk, and prepare it.

Network Failure
Network failures are prevalent in distributed or network databases. These
comprises of the errors induced in the database system due to the distributed
nature of the data and transferring data over the network. The causes of network
failure are as follows −
• Communication link failure.
• Network congestion.
• Information corruption during transfer.
• Site failures.
• Network partitioning.

Commit Protocols
Any database system should guarantee that the desirable properties of a
transaction are maintained even after failures. If a failure occurs during the
execution of a transaction, it may happen that all the changes brought about by
the transaction are not committed. This makes the database inconsistent. Commit
protocols prevent this scenario using either transaction undo (rollback) or
transaction redo (roll forward).
Database
Recovery and
120 Security Management
Commit Point Database Management
System
The point of time at which the decision is made whether to commit or abort
a transaction, is known as commit point. Following are the properties of a
commit point. NOTES
• It is a point of time when the database is consistent.
• At this point, the modifications brought about by the database can be seen
by the other transactions. All transactions can have a consistent view of
the database.
• At this point, all the operations of transaction have been successfully
executed and their effects have been recorded in transaction log.
• At this point, a transaction can be safely undone, if required.
• At this point, a transaction releases all the locks held by it.

Transaction Undo
The process of undoing all the changes made to a database by a transaction
is called transaction undo or transaction rollback. This is mostly applied in case
of soft failure.

Transaction Redo
The process of reapplying the changes made to a database by a transaction
is called transaction redo or transaction roll forward. This is mostly applied for
recovery from a hard failure.

Transaction Log
A transaction log is a sequential file that keeps track of transaction
operations on database items. As the log is sequential in nature, it is processed
sequentially either from the beginning or from the end.
Purposes of a transaction log −
• To support commit protocols to commit or support transactions.
• To aid database recovery after failure.
A transaction log is usually kept on the disk, so that it is not affected by soft
failures. Additionally, the log is periodically backed up to an archival storage like
magnetic tape to protect it from disk failures as well.

Lists in Transaction Logs


The transaction log maintains five types of lists depending upon the status
of the transaction. This list aids the recovery manager to ascertain the status of a
transaction. The status and the corresponding lists are as follows −
• A transaction that has a transaction start record and a transaction commit Database
record, is a committed transaction – maintained in commit list. Recovery and
Security Management 121
Database Management • A transaction that has a transaction start record and a transaction failed
System record but not a transaction abort record, is a failed transaction –
maintained in failed list.
NOTES • A transaction that has a transaction start record and a transaction abort
record is an aborted transaction – maintained in abort list.
• A transaction that has a transaction start record and a transaction before-
commit record is a before-commit transaction, i.e. a transaction where
all the operations have been executed but not committed – maintained in
before-commit list.
• A transaction that has a transaction start record but no records of before-
commit, commit, abort or failed, is an active transaction – maintained in
active list.

Immediate Update and Deferred Update


Immediate Update and Deferred Update are two methods for maintaining
transaction logs.
In immediate update mode, when a transaction executes, the updates made
by the transaction are written directly onto the disk. The old values and the
updates values are written onto the log before writing to the database in disk. On
commit, the changes made to the disk are made permanent. On rollback, changes
made by the transaction in the database are discarded and the old values are
restored into the database from the old values stored in the log.
In deferred update mode, when a transaction executes, the updates made to
the database by the transaction are recorded in the log file. On commit, the
changes in the log are written onto the disk. On rollback, the changes in the log
are discarded and no changes are applied to the database.
In order to recuperate from database failure, database management systems
resort to a number of recovery management techniques. In this chapter, we will
study the different approaches for database recovery.

The typical strategies for database recovery are


• In case of soft failures that result in inconsistency of database, recovery
strategy includes transaction undo or rollback. However, sometimes,
transaction redo may also be adopted to recover to a consistent state of
the transaction.
• In case of hard failures resulting in extensive damage to database,
recovery strategies encompass restoring a past copy of the database from
archival backup. A more current state of the database is obtained through
redoing operations of committed transactions from transaction log.

Database
Recovery and
122 Security Management
Database Management
System
6.4 RECOVERY FROM TRANSACTIONS FAILURE
NOTES

Recovery from Power Failure


Power failure causes loss of information in the non-persistent memory.
When power is restored, the operating system and the database management
system restart. Recovery manager initiates recovery from the transaction logs.
In case of immediate update mode, the recovery manager takes the
following actions −
• Transactions which are in active list and failed list are undone and written
on the abort list.
• Transactions which are in before-commit list are redone.
• No action is taken for transactions in commit or abort lists.
In case of deferred update mode, the recovery manager takes the following
actions −
• Transactions which are in the active list and failed list are written onto
the abort list. No undo operations are required since the changes have
not been written to the disk yet.
• Transactions which are in before-commit list are redone.
• No action is taken for transactions in commit or abort lists.

Recovery from Disk Failure


A disk failure or hard crash causes a total database loss. To recover from
this hard crash, a new disk is prepared, then the operating system is restored, and
finally the database is recovered using the database backup and transaction log.
The recovery method is same for both immediate and deferred update modes.

The recovery manager takes the following actions


• The transactions in the commit list and before-commit list are redone and
written onto the commit list in the transaction log.
• The transactions in the active list and failed list are undone and written
onto the abort list in the transaction log.
If a transaction T fails, for whatever reason, we may use the techniques to
undo the changes made to the database by T. In a system that allows concurrent
execution, it is necessary also to ensure that any concurrent transaction T’ that is
dependent on T(that is, T’ reads data written by T) is also aborted.

Database
Recovery and
Security Management 123
Database Management Cascading Rollback
System
In order to recover correctly from the failure of a transaction T, it may be
necessary to roll back several transactions. Such situations occur if transactions
NOTES have read data written by T.
To illustrate this, consider the partial schedule of figure 6.1 Transaction T1,
writes the value of A that is read by transaction T2. Transaction T2 writes the
value of A that is read by Transaction T3. Suppose that, at this point T1 fails.
T1must be rolled back. Since T2is dependent on T1, T2 must be rolled back.
Since T3 is dependent on T2, T3 must be rolled back.
Figure 6.1 Partial schedule

This phenomenon, in which a single transaction failure leads to a series of


transaction rollbacks, is called cascading rollback.
Cascading rollback may occur under two phase locking. Figure 6.2 shows
the partial schedule of figure 6.1 with lock steps inserted. Notice that each
transaction observes the two phase locking protocol, but the failure of T1 after
the read(A) step of T3 leads to cascading rollback of T2 and T3.
The timestamp-based protocol may result in cascading rollback. To
illustrate, consider again the partial schedule of figure 6.1. If the timestamps of
transactions t1, t2 and t3 are such that TS(T1)<TS(T2)<TS(T3), then this partial
schedule is possible under the timestamp protocol. However, the validation
protocol automatically guards against cascading rollbacks since the actual write
of a data value by a transaction takes place only after the transaction issuing the
write has committed.
Cascading rollback is undesirable since it leads to the undoing of a
significant amount of work. It is useful to design protocols for transaction
processing that do not have the potential of cascading rollback.

Database
Recovery and
124 Security Management
Figure 6.2 Partial schedule under two-phase locking. Database Management
System

NOTES

Recoverable Schedules
Consider the schedule of Figure 6.3 in which T5 is a transaction that
performs only one instruction, read(A). Suppose that the system allows T5 to
commit immediately after executing the read(A) instruction. Thus T5 commits
before T4. Now suppose that T4 fails before it commits. Since T5 has read the
value of data item A written by T4, it is necessary to abort T5 to ensure transaction
atomicity. However, T5 has already committed and cannot be aborted.
Figure 6.3 Sample schedule

Thus, we have a situation where it is impossible to recover correctly from


the failure of T4. Clearly, transaction processing systems must ensure that it is
possible to recover from the failure of any active transaction. Thus, non
recoverable scenarios, such as the one described above, must be avoided.
The validation protocol ensures recoverability and avoids cascading
rollbacks by ensuring that uncommitted writes cannot be read.
The timestamp protocol can be modified to prevent non recoverable
executions and to avoid cascading rollbacks as follows : Associate a commit bit
bi with each transaction Ti. Initially bi is false. When Ti commits, bi is set to
true. A transaction Tj that attempts to read a data item Q must satisfy the
requirements of the timestamp protocol. If it does, then the commit bit of the
transaction that was the last to write Q is checked. If this bit is true, the read by
Tj is allowed. Otherwise Tj must wait until the bit is set to true.
This modification to the timestamp protocol introduces waiting, but
Database
deadlock is not possible. Observe that a transaction may wait only for a
Recovery and
transaction with an earlier timestamp. Thus, the cycle of waits is not possible. Security Management 125
Database Management Under two-phase locking, cascading rollback is avoided by imposing an
System additional requirement that all exclusive locks taken by a transaction T must be
held until T commits. This ensures that any data written by an uncommitted
NOTES transaction is locked in exclusive mode, preventing any other transaction from
reading the data.
This requirement is more stringent than the commit bit scheme we used for
the timestamp protocol since it restricts write instructions as well as read
instructions.

6.5 RECOVERY TECHNIQUES ALGORITHMS

Log Scanning
It is necessary to consider only the following transactions during recovery:
We use checkpoints to reduce the number of log records that must be
scanned when the system recovers from a crash.
1. Those transactions that started after the last checkpoint.
2. The one transaction, if any, that was active at the time of the last
checkpoint.
The situation is more complex when transactions may execute concurrently
since several transactions may have been active at the time of the last checkpoint.
In a concurrent transaction processing system, we require that the
checkpoint log reord e of the form <checkpoint L>, where L is a list of
transactions active at the time of the checkpoint. When the system recove3rs
from a crash it constructs two lists : The undo list consists of transactions to be
undone and the redo-list consists of transactions to be redone.
These two lists are constructed on recovery as follows. Initially they are
both empty. We scan the log backward, examining each record, until the first
<checkpoint> record is found:
- For each record found of the form <Ti commits> we add Ti to redo-list.
- For each record found of the form <Tistarts>, if Ti is not in redo-list
then we add Ti to undo-list.
When all the appropriate log records have been examined, we check the list
L. For each transaction Ti in L, if Ti is not in redo-list then we add Ti to the
undo-list.
Once the redo-list and undo-list have been constructed, the recovery
proceeds as follows :
Database 1. Re-scan the log from the most recent record backward and perform
Recovery and
undo(Ti) for each Ti on the undo list.
126 Security Management
2. Continue to scan the log backward until the <Ti starts> record for all Database Management
Ti on the redo-list have been located. System

3. Scan the log forward and perform redo(Ti) for each Ti on the redo-list.
NOTES
It is important in step 1 to process the log backward to ensure that the
resulting state of the database is correct. To illustrate, consider the pair of log
records
<Ti, A, 10, 20>
<Tj, A, 20, 30>
Which represent a modification of data item A by Ti followed by a
modification of A by Tj. If Ti and Tj are both on the undo-list, then A should be
restored to the value 10. This can be achieved if the log is processed backward
since A is set first to 20 and then to 10. If the log were processed in the forward
direction the result would be incorrect.
After all transactions on the undo-list have been undone, those transactions
on the redo-list are redone. It is important in this case to process the log forward.
When the recovery process has completed, transaction processing resumes.

Checkpointing
Checkpoint is a point of time at which a record is written onto the database
from the buffers. As a consequence, in case of a system crash, the recovery
manager does not have to redo the transactions that have been committed before
checkpoint. Periodical checkpointing shortens the recovery process.
The two types of checkpointing techniques are −
• Consistent checkpointing
• Fuzzy checkpointing

Consistent Checkpointing
Consistent checkpointing creates a consistent image of the database at
checkpoint. During recovery, only those transactions which are on the right side
of the last checkpoint are undone or redone. The transactions to the left side of
the last consistent checkpoint are already committed and needn’t be processed
again. The actions taken for checkpointing are −
• The active transactions are suspended temporarily.
• All changes in main-memory buffers are written onto the disk.
• A “checkpoint” record is written in the transaction log.
• The transaction log is written to the disk.
• The suspended transactions are resumed. Database
Recovery and
Security Management 127
Database Management If in step 4, the transaction log is archived as well, then this checkpointing
System aids in recovery from disk failures and power failures, otherwise it aids recovery
from only power failures.
NOTES
Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions
are written in the log. In case of power failure, the recovery manager processes
only those transactions that were active during checkpoint and later. The
transactions that have been committed before checkpoint are written to the disk
and hence need not be redone.

Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the
time of system crash is tfail. Let there be four transactions Ta, Tb, Tc and Td such
that −
• Ta commits before checkpoint.
• Tb starts before checkpoint and commits before system crash.
• Tc starts after checkpoint and commits before system crash.
• Td starts after checkpoint and was active at the time of system crash.
The situation is depicted in the following diagram −

The actions that are taken by the recovery manager are −


• Nothing is done with Ta.
• Transaction redo is performed for Tb and Tc.
• Transaction undo is performed for Td.

Transaction Recovery Using UNDO / REDO


Transaction recovery is done to eliminate the adverse effects of faulty
transactions rather than to recover from a failure. Faulty transactions include all
transactions that have changed the database into undesired state and the
Database transactions that have used values written by the faulty transactions.
Recovery and
128 Security Management
Transaction recovery in these cases is a two-step process Database Management
System
• UNDO all faulty transactions and transactions that may be affected by
the faulty transactions.
NOTES
• REDO all transactions that are not faulty but have been undone due to
the faulty transactions.

Steps for the UNDO operation are


• If the faulty transaction has done INSERT, the recovery manager deletes
the data item(s) inserted.
• If the faulty transaction has done DELETE, the recovery manager inserts
the deleted data item(s) from the log.
• If the faulty transaction has done UPDATE, the recovery manager
eliminates the value by writing the before-update value from the log.

Steps for the REDO operation are


• If the transaction has done INSERT, the recovery manager generates an
insert from the log.
• If the transaction has done DELETE, the recovery manager generates a
delete from the log.
• If the transaction has done UPDATE, the recovery manager generates an
update from the log.
In a local database system, for committing a transaction, the transaction
manager has to only convey the decision to commit to the recovery manager.
However, in a distributed system, the transaction manager should convey the
decision to commit to all the servers in the various sites where the transaction is
being executed and uniformly enforce the decision. When processing is complete
at each site, it reaches the partially committed transaction state and waits for all
other transactions to reach their partially committed states. When it receives the
message that all the sites are ready to commit, it starts to commit. In a distributed
system, either all sites commit or none of them does.

The different distributed commit protocols are


• One-phase commit
• Two-phase commit
• Three-phase commit

Distributed One-phase Commit


Distributed one-phase commit is the simplest commit protocol. Let us
consider that there is a controlling site and a number of slave sites where the Database
transaction is being executed. The steps in distributed commit are − Recovery and
Security Management 129
Database Management • After each slave has locally completed its transaction, it sends a “DONE”
System message to the controlling site.
• The slaves wait for “Commit” or “Abort” message from the controlling
NOTES site. This waiting time is called window of vulnerability.
• When the controlling site receives “DONE” message from each slave, it
makes a decision to commit or abort. This is called the commit point.
Then, it sends this message to all the slaves.
• On receiving this message, a slave either commits or aborts and then
sends an acknowledgement message to the controlling site.

Distributed Two-phase Commit


Distributed two-phase commit reduces the vulnerability of one-phase
commit protocols. The steps performed in the two phases are as follows −

Phase 1: Prepare Phase


• After each slave has locally completed its transaction, it sends a “DONE”
message to the controlling site. When the controlling site has received
“DONE” message from all slaves, it sends a “Prepare” message to the
slaves.
• The slaves vote on whether they still want to commit or not. If a slave
wants to commit, it sends a “Ready” message.
• A slave that does not want to commit sends a “Not Ready” message. This
may happen when the slave has conflicting concurrent transactions or
there is a timeout.

Phase 2: Commit/Abort Phase


• After the controlling site has received “Ready” message from all the
slaves −
o The controlling site sends a “Global Commit” message to the
slaves.
o The slaves apply the transaction and send a “Commit ACK”
message to the controlling site.
o When the controlling site receives “Commit ACK” message from
all the slaves, it considers the transaction as committed.
• After the controlling site has received the first “Not Ready” message from
any slave −
o The controlling site sends a “Global Abort” message to the slaves.
o The slaves abort the transaction and send a “Abort ACK”
Database
Recovery and message to the controlling site.
130 Security Management
o When the controlling site receives “Abort ACK” message from Database Management
all the slaves, it considers the transaction as aborted. System

Distributed Three-phase Commit NOTES


The steps in distributed three-phase commit are as follows −

Phase 1: Prepare Phase


The steps are same as in distributed two-phase commit.

Phase 2: Prepare to Commit Phase


• The controlling site issues an “Enter Prepared State” broadcast message.
• The slave sites vote “OK” in response.

Phase 3: Commit / Abort Phase


The steps are same as two-phase commit except that “Commit
ACK”/”Abort ACK” message is not required.

Summary
A computer system, like any other mechanical or electrical device, is
subject to failure. There are a variety of causes of such failure, including disk
crash, power failure, and software errors. In each of these cases, information
concerning the database system is lost. An integral part of a database system is
a recovery scheme which is responsible for the detection of failure and the
restoration of the database to a state that existed prior to the occurrence of the
failure.
The various types of failures that may occur in a system must be dealt with
in a different manner. The simplest type of failure to deal with is one which does
not result in the loss of information in the system. The ones that are more difficult
to deal with are those that do result in loss of information.
In case of failure, the state of the database system may no longer be
consistent; that is , it may not reflect a state of the world that the database is
supposed to capture. To preserve consistency, we require that each transaction
be atomic; that is , wither all the instructions associated with it are executed to
completion, or none are performed. It is the responsibility of the recovery scheme
to ensure the atomicity property.

There are three different schemes for ensuring atomicity.


- Log with deferred modifications. During the execution of a transaction
all the write operations are deferred until the transaction partially
commits. All updates are recorded on the log, which must be kept in
stable storage. When a transaction partially commits, the information
on the log associated with the transaction is used in executing the Database
deferred writes. If the system crashes before the transaction completes Recovery and
Security Management 131
Database Management its execution, or, if the transaction aborts, then the information on the
System log is simply ignored. Using the log, the system can handle any failure
which does not result in the loss of information on nonvolatile storage.
NOTES After a failure has occurred the recovery scheme consults the log and
each committed transaction is redone using the redo operation. To
reduce the overhead of searching the log and redoing transactions, the
checkpointing technique can be used.
- Log with immediate modifications. All updates are applied directly
to the database and a log of all the changes to the system state is kept
in stable storage. If a crash occurs, the information in the log is used
in restoring the state of the system to a previous consistent state. This
is accomplished by using the undo and redo operation. A log can be
used to handle any failure which does not result in the loss of
information on nonvolatile storage, and the checkpointing technique
can be used to reduce the overhead of searching the log and performing
undo and redo operations.
- Shadow paging. Two page tables are maintained during the life of a
transaction: the current page table and the shadow page table. When
the transaction starts, both page tables are identical. The Shadow page
table is never changed during the duration of the transation performs
a write operation. All input and output operations use the current page
table to locate database pages on disk. When the transaction partially
commits, the Shadow page table is discarded and the current table
becomes the new page table. If the transaction aborts, the current page
table is simple discarded.
Efficient implementation of a recovery scheme requires that the
number of writes to the database and to stable storage be minimized.
Log records may be kept in volatile storage initially, but must be
written to stable storage when one of the following conditions occurs:
- Before the <Ti commit> log record may be output to stable storage,
all log records pertaining to transaction Ti must have been output to
stable storage.
- Before a block of data in main memory is output to the database(in
nonvolatile memory) all log records pertaining to data in that block
must have been output to stable storage.
In order to recover from failure that result in the loss of nonvolatile storage,
the entire contents of the database need to be dumped onto stable storage
periodically, say once a day. If a failure occurs that results in the loss of physical
database blocks, the most recent dump is used in restoring the database to a
previous consistent state. Once this has been accomplished, the log is used to
bring the database system to the most recent consistent state.
Database
Recovery and
132 Security Management
Exercise Database Management
System
1. Explain the purpose of the checkpoint mechanism. How often should
a database management system do a checkpoint?
NOTES
2. Explain the recovery procedure that needs to take place after a disk
crash.
3. How often should checkpoints be performed? How does the frequency
of checkpoints affect:
- system performance when no failure occurs?
- the time it takes to recover from a system crash?
- the time it takes to recover from a disk crash?
4. Compare the shadow paging recovery scheme with the log-based
recovery schemes in terms of ease of implementation and
overhead cost.

*****

Database
Recovery and
Security Management 133

You might also like