BCA 202 Database Management System2
BCA 202 Database Management System2
Data Modelling
2.0 Introduction
2.1 Unit Objectives
2.2 Data Modeling: Data Models,
2.3 Logical Data Modeling:
2.3.1 Hierarchical Data Model
2.3.2 Network Data Model
2.3.3 Relational Data Model
Advantages and Disadvantages of Logical Data Modeling.
2 2.4 Conceptual Data Modeling 22-40
2.4.1 Entity Relationship Model
2.4.2 Entities
2.4.3 Attributes
2.4.4 Types of Attributes
2.4.5 Relationships
2.4.6 Degree of relationship Set
2.5 Mapping Cardinalities, Keys
2.5.1 ER Diagram Notations
2.6 Case studies on ERD.
Unit Contents Page No.
Normalization
3.0 Introduction
3.1 Unit Objectives
3.2 Normalization: Keys: Composite, Candidate, Primary, Secondary,
Foreign, Super key
3.3 CODD’s Rules, Mapping conceptual model into Relational Model.
3 41-64
3.4 Functional Dependencies
3.5 Decomposition
3.6 Lossy and Lossless Decomposition, Dependency Preserving
Decomposition
3.7 Advantages and Disadvantages of Normalization
3.8 Normal Forms (1NF, 2NF, 3NF,) Case Studies on Normalization.
UNIT - 1
INTRODUCTION OF
NOTES
Structure
1.0 Introduction
1.1 Difference between Data, information, Data processing and Data
Management.
1.2 File Oriented Approach, Database oriented approach to Data
Management
1.3 Database schema and instances
1.4 Database users
1.5 3 tier architecture of DBMS and Data Independence
1.6 Types of Database System
1.7 Database Languages
1.8 DBMS interfaces.
Unit objectives
After going through this unit, you will be able to :
- Define the concept of data, data management, information
- Different approaches to data management
- Understand Database schema and instances,
- Comprehend 3 tier architecture of DBMS and Data Independence.
- Database users
- Different Types of Database System.
- Understand the different Database Languages,
- Explain DBMS interfaces.
Introduction of
Database
Management System 1
Database Management
System
1.0 INTRODUCTION
NOTES
In this unit, you will learn about data management, Data, information, Data
Processing.
Organizations cannot make good and effective decisions at proper occasions
or moments if data is invalid or incorrect. You will learn about different database
users, overview of Database systems.
1.1 DATA
e.g
- If we write 200 then it is data but if we write Rs.200 then it is definitely
information.
- At home mother keeps the record of how much milk is consumed
everyday, sometimes there are guest, she requires more milk, then at
Introduction of the end of the month she collects the data, process the data and then
Database prepares a bill to be given to the milk man i.e information.
2 Management System
- Suppose, in a college, vacancy has been created for the post of Database Management
professor and the eligibility to become a professor is that the candidate System
should have Ph.D Degree. Against this post several applicants applied.
All these applications will be considered as data initially because there NOTES
may be some applications whose eligibility is not fulfilled. After short
listing as per requirement then only those applications will be
considered as an information.
Data is unprocessed material while information is processed material.
Data Processing
Data Management is the process of developing data architectures in which
practices and procedures deal with data execution on a regular basis. It basically
aims at explaining the way data is to be stored in a structured manner and the
process or procedure that should be adopted on a regular basis. The terms related
to data management are data modeling, data warehousing, data movement,
database administration and data mining.
File-based Systems
Conventionally, before the database systems were introduced, data in
software systems was maintained using files. In this approach, the needed data
is stored locally and programs are developed for each type of application. The
following terms are related to a file processing system:
- Data is provided to the computer in terms of bytes.
- Data item is the smallest named unit of data that has meaning in the
real world, e.g includes employee name, address and employee code.
Traditionally, the term, data item is called field in data processing and
is the smallest unit of data that has meaning for its users. Record is a
set of related data items (fields) organized in a well defined structure
treated as a unit by an application program. The structure of the record
depends on:
- Items to be included, their characteristics, variability, prevalence in
file
- The intended uses of the information, time requirements, overlapping
or conflicting requirements.
- Update characteristics, e.g frequency, growth, deletion and Introduction of
modification, time constraint, etc. Database
Management System 3
Database Management - Links to the other records/files/systems
System
- Computing environment like operating systems, languages, hardware,
etc.
NOTES
- File is a collection of all occurrences of similar type of records i.e it is
a group of records of a single type.
In file based system, an organization stores information as groups of records
in separate files. These file based systems consist of a few data files and many
application programs. Data files are organized to facilitate access to records and
ensue their efficient storage. In a file based system, there are two types of
records-logical records and physical records. A logical record is a collection of
related data items treated as a conceptual unit ind4ependent of how or where the
information is stored. A physical record is a contiguous area of storage space
defined by the characteristics of storage devices and operating systems and
includes all the bytes, which will be read/written by a single command to the I/O
device. A file processing system relies on the basic access methods of the host
operating system for data management.
In a file processing system, it is possible for the application program to only
make a request for data from the operating system by demanding a specific
logical record. The relationship between physical and logical records and the
location of the physical records in a specific file is kept track of the operating
system. Location of fields within the logical records is a activity that is taken
care of by the application program.
In a file processing system, a program takes less time to execute than an
equivalent program written in a high level language as algorithms for sort, merge
and report generation are already in built in the file management software.
Besides, the cost of software development is less. Such file based approaches
provide an improved automated data processing than earlier manual paper record
based systems. However, in view of demand for increased efficiency and speed,
a file processing system suffers some significant disadvantages, which are as
follows :
- Data redundancy and inconsistency : A major drawback in the
traditional file system environment is non sharing of data, it means if
different systems of an organization are using some common data,
even then instead of storing it once and sharing it, each system stores
data in separate files. Often, the same information is stored in more
than one file. This redundancy in storing the same data in multiple
places leads to several problems. Firstly, this leads to wastage of
storage space and poses a serious problem if the file processing system
has to manage a large amount of data. Secondly, it leads to the
duplication of effort, as one needs to perform several updates. Thirdly,
data inconsistency leads to a number of problems, including loss of
Introduction of
information and incorrect results.
Database
4 Management System
- Difficulty in accessing data: A file based system does not provide Database Management
users with ashoc information requests as most of the information System
retrieval possibilities are implemented based on pre-determined
request for data. In today’s competitive and fast business environment, NOTES
other than such regularly schedule requests, responding to unexpected
queries is also indispensable. A new application program often
requires an entirely new set of file definitions. Even though an existing
file may contain some of the needed data items, the application often
requires a number of other data items from the existing file as well as
definitions of all new data items, thereby requiring excessive
programming effort. Therefore, conventional file processing systems
do not let needed data to be retrieved conveniently and efficiently.
- Data isolation : Data is scattered in various files, in different formats.
In a file processing system, it is difficult to determine relationships
between isolated data in orde4r to meet user requirements. To tackle
such situations the files have to be evaluated by analysts and
programmers to determine the specific data requirement from each file
and the relationships between the data. Then applications have to be
written in a third generation language for processing and extracting
the needed data. Imagine the work involved if data from several files
was needed.
- Program data interdependency ; In a file based system, data is
organized in the form of Files and records. Files and records are
described by specific formats and access strategy, which are coded into
application program by the programmer. Such application programs
are data dependent. Consequently, this type of environment requires
that appropriate changes be made in all the concerned application
programs if there is any change in the data structure. This is something
that makes a change very painful or problematic for the designers or
developers of the system.
- Atomicity problem : In many applications, which already been
implemented, it is not easy to see to it that the data is restored to the
last consistent state following the detection of a failure.
- Integrity problem : Data integrity means correctness or accuracy of
data. Integrity ensures accuracy of data. The data values stored in a
file system may have to satisfy certain types of constraints.
Programmers enforce these types of constraints in the system by
adding appropriate code in the various application programs.
However, when new constraints are to be added, it is difficult to change
the program to enforce them with small effort and time.
- Security problems : Conventional systems develop applications in an
Introduction of
adhoc manner. At times, different components of the operational data Database
Management System 5
Database Management are accessed by different parts of a firm. In an environment of this
System type, it can be quite difficult to ensure/enforce security.
- Concurrent access anomalies : If the system allows multiple users
NOTES to update the data simultaneously, the interaction of concurrent updates
may result in inconsistent data. In the traditional file system, such
concurrent access anomalies cannot be avoided without huge
programming effort.
Database
Database implies a place where data can be stored in a structured manner.
It is a shared collection or batch of data that is logically related, along with
descriptions to meet the information requirements of an organization.
The files contain information such as :
1. The set of data available to the user, the so-called “end user data”. This
is the real data, which can be read, connected and modified by the user.
Introduction of 2. The so-called ‘metadata”. Here, the properties and the relative
Database relations of the end user data and described.
6 Management System
Database Management System (DBMS) Database Management
System
It is a software system that allows users to not only define and create a
database but also maintain it an control its access. A Database Management
System can be called a collection of interrelated data and a collection or set of NOTES
programs that helps in accessing, updating and managing that data.
Management of the data means specifying how data will be stored,
structured and accessed in the database. This includes the following :
- Defining : Specifying data types and structures, and constraints for
data to be stored.
- Constructing : Storing data in a storage medium.
- Manipulating : Querying , updating and generating reports.
- Sharing : Allowing multiple users and programs to access data
simultaneously.
Management of database users means managing the users in such a way
that they are able to perform any desired operation on the database.
DBMS available in market such as Access, dBase, FileMaker Pro,
ORACLE, DB2, Ingress, Informix, Sybase, etc. A Database system is collection
of application programs that interacts with the database along with DBMS and
database itself. Database systems are designed in a manner that facilities the
management of huge bodies of information.
Data Dictionary
A Data Dictionary can be treated as a special file, which stores the
information about the organization and usage of data contained in the database.
This information is called metadata (Data about Data). It is sometimes termed
as system catalog that documents the data in the database. In a DBMS, the
detailed structure and organization of each file are stored in the system catalog.
System catalog and data dictionary are used interchangeably. A system catalog
is a repository that integrates metadata. A data dictionary is a repository that
manages metadata. It is a part of the system catalog that is generated for each
Introduction of database. A Data Dictionary can function in a variety of ways , which are as
Database
follows :
8 Management System
Active(Integrated) : This is always consistent with the current structure Database Management
and definition, maintained automatically by the system itself. System
Characteristic of Database
There are characteristics distinguish the database approach from the
traditional approach of programming with files. Traditional file processing, each
user defines and implements the files needed for a specific software application
as part of programming the application. Programs are written to print a student’s
transcript and to enter new grades into the file are part of the application. The
another user may keep track of students fees paid and their payments. Both the
users are interested in data about students, each user maintains separate files and
programs to manipulate these files because each requires some data not available
from the other users files.
This redundancy in defining and storing data result in wasted space used
for storage and in redundant efforts to collect common till date data. In database
approach a single database is maintained then accessed by various users. The4
labels or names of data are once defines and used many times by retrieving
queries. The database approach versus the file processing approach is –
DBMS are very complex systems. To study the database concepts and the
structure and capabilities of DBMS you have to study the architecture of
database management syst4em.
Advantage
Hierarchical database can be accessed and updated easily because in
this model structure is like as a tree and the relationships between
records are defined in advance.
Disadvantage
This type of database structure is that each child in the tree may have
only one parent, and relationships or linkages between children are not
permitted, even if they make sense from a logical standpoint.
Hierarchical databases are so in their design. it can adding a new field
or record requires that the entire database be structured.
2. Network Database
A network databases are mainly used on a large digital computers.
More connections can be made between different types of data,
network databases are considered more efficiency. Introduction of
Database
Management System 13
Database Management Data in network model are represented by collection of records and
System relationships among data are represented by links, which can be
viewed as pointers. The records in the database are organized as
NOTES collections of arbitrary graphs.
However, instead of using a single-parent tree hierarchy, the network
model uses set theory to provide a tree-like hierarchy with the
exception that child tables were allowed to have more than one parent.
It supports many-to-many relationships.
3. Relational Database
In relational databases, the relationship between data files is relational.
Hierarchical and network databases require the user to pass a hierarchy
in order to access needed data. These databases connect to the data in
different files by using common data numbers or a key field. Data in
relational databases is stored in different access control tables, each
having a key field that mainly identifies each row. In the relational
databases are more reliable than either the hierarchical or network
database structures. In relational databases, data and relationships
among data are represented by a collection of tables, each of which
has a number of columns with unique names. Files filled up with data
are called relations (tuples) designates a row or record, and columns
are referred to as attributes or fields.
Relational databases work on each table has a key field that uniquely
indicates each row, and that these key fields can be used to connect
one table of data to another.
The relational database has two major reasons
1. Relational databases can be used with little or no training.
Introduction of
Database 2. Database entries can be modified without specify the entire body.
14 Management System
Object-Oriented Model Database Management
System
In this Model we have to discuss the functionality of the object oriented
Programming .It takes more than storage of programming language objects.
Object DBMS's increase the semantics of the C++ and Java. It provides full- NOTES
featured database programming capability, while containing native language
compatibility. It adds the database functionality to object programming
languages. This approach is the analogical of the application and database
development into a constant data model and language environment. Applications
require less code, use more natural data modeling, and code bases are easier to
maintain. Object developers can write complete database applications with a
decent amount of additional effort.
The object-oriented database derivation is the integrity of object-oriented
programming language systems and consistent systems. The power of the object-
oriented databases comes from the cyclical treatment of both consistent data, as
found in databases, and transient data, as found in executing programs.
Introduction of
Database
16 Management System
The three tier Architecture is most widely used architecture in Database Management
System
today’s world.
- In this architecture the user layer, business layer and data layer
NOTES
are implemented independently by three different applications.
- The data required by the business logic exists in database server.
- In the three tier architecture all layers interact with each other
independently.
Example
Explain 3 tier web architecture for online shopping database system.
In online shopping frontend Dot Net environment is used while as backend
database MS SQL Server 2008 is used.
Diagrammatic representation
Introduction of
Database
18 Management System
Database Management
System
1.8 DATABASE INTERFACE
NOTES
1. Menu-Based Interfaces for Web Clients or Browsing
In this Interface we have a list of options, from this user has to select
the option known as menus, which lead the user by the formulation of
a request. Pull-down menus are extremely popular techniques in Web-
based user interfaces.
They are also used within browsing interfaces that permit a user to
look through the contents of a database within an exploratory and
unstructured manner.
2. Forms-Based Interfaces
A form-based interface is a kind of user interface. Over here the user
interacts with the application by selecting one of a number of possible
values, and by entering text into the fields that accept it. A word
processor which is used to write documents, we can have settings for
the font size, the font to use, and the alignment of the paragraph on
the page. Many databases support a technology called query. example:
STUDENT ADMISSION ENROLLMENT.
Summary
A database management system(DBMS) consists of a collection of
interrelated data and a collection of programs to access that data. The data
contains information about one particular enterprise. The primary goal of a
DBMS is to provide an environment which is both convenient and efficient to
use in retrieving and storing information.
Database systems are designed to manage large bodies of information. The
management of data involves both the definition of structures for the storage of
information and the provision of mechanisms for the manipulation of
information. The database system must provide for the safety of the information
stored, despite system crashes or attempts at unauthorized access. If data is to
be shared among several users, the system must avoid possible anomalous results.
A major purpose of a database system is to provide users with an abstract
view of the data. The system hides certain details of how the data is stored and
maintained. This is accomplished by defining three levels of abstraction at which
the database may be viewed: the physical level, the conceptual level and the
view level.
Databases change over time as information is inserted and deleted. The
collection of information stored in the database at a particular moment in time is
called an instance of the database. The ability to modify a scheme definition in
one level without affecting a scheme definition in the next higher level is called
data independence. There are two levels of data independence: physical data
independence and logical data independence.
A database scheme is specified by a set of definitions which are expressed
by a data definition language (DDL). DDL statements are compiled into a set of
tables which are stored in a special file called the data dictionary which contains
metadata.
A data manipulation language (DML) is a language that enables users to
access or manipulate data.
There are basically two types : procedural DMLs, which require a user to
specify what data is needed without specifying how to get it.
Introduction of
Database
20 Management System
Exercises Database Management
System
1. Explain Data, information and Data Processing and Data Management.
2. Explain the difference between physical and logical data
NOTES
independence.
3. Why we need DBMS.
4. Explain different Database languages.
5. Define Database Schema and instances.
6. Illustrate with examples database interfaces.
*****
Introduction of
Database
Management System 21
Database Management
System
UNIT - 2
DATA MODELLING
NOTES
Structure
2.0 Introduction
2.1 Unit Objectives
2.2 Data Modeling: Data Models,
2.3 Logical Data Modeling:
2.3.1 Hierarchical Data Model
2.3.2 Network Data Model
2.3.3 Relational Data Model
Advantages and Disadvantages of Logical Data Modeling.
2.4 Conceptual Data Modeling
2.4.1 Entity Relationship Model
2.4.2 Entities
2.4.3 Attributes
2.4.4 Types of Attributes
2.4.5 Relationships
2.4.6 Degree of relationship Set
2.5 Mapping Cardinalities, Keys
2.5.1 ER Diagram Notations
2.6 Case studies on ERD.
2.0 INTRODUCTION
In this unit, you will learn about the basic concepts of data models, which
are mainly concerned with logical database design and the rules for it.
In this unit will discuss the Entity- Relationship model which facilities
database design by enabling the designer to express the logical properties of the
database in an enterprise schema. The various data models supported by DBMS
are Hierarchical, Network, Relational and Entity-Relationship.
22 Data Modelling
Database Management
System
2.1 UNIT OBJECTIVES
NOTES
After going through this unit, you will be able to :
- Define the concept of data models
- Understand the Entity- Relationship model
- Describe and compare the Hierarchical, Network and relational data
models.
- Understand the Mapping Cardinalities
The process of analysis of data object and their relationships to other data
objects is known as data modeling. It is the conceptual representation of data in
database. It is the first step in database designing. Data models define how data
is connected to each other and how they are processed and stored inside the
system. A data model provides a way to describe the design of a database at the
physical, logical and view levels.
Data Modelling 23
Database Management
System
NOTES
Data Modelling 25
Database Management Advantages of Relational Data model
System
a) Supports SQL
- For accessing the data in Relational data model we have a
NOTES
language known as Structured Query Language (SQL).
- This language is used to access, insert, update or delete the data
from the table. By using Relational data model we can execute
the complex queries.
b) Flexible
- We can easily manipulate the information which is linked with
various tables.
- We can extract the information from different table
simultaneously by using this model.
26 Data Modelling
Advantages of Entity Relationship Model Database Management
System
a) Simple Design : The ER model is simple and easy to design. It shows
the logical view of the data and its relationships. This model is easy
to understand. NOTES
b) Effective representation : The presentation of Entity Relationship
Model is very simple and effective. The programmer and designer can
easily understand the flow of the system by referring the ER Model.
c) Connected with Relational Model : The Entity Relationship Model
is connected with the relational model. Due to this advantage we can
develop a well-structured design.
2. Weak Entity
The entity which does not have any key attribute is known as
weak entity. The weak entity has a partial discriminator key.
Data Modelling 27
Database Management Weak entity depends on the strong entity for its existence. Weak
System entity is denoted by double rectangle.
Example : In a parent/child relationship, a child is considered as
NOTES a weak entity which is completely depends upon the strong entity
‘parent’.
2.4.3 Attribute
- An attribute is a characteristic of an entity. Entities are represented by
means of their attributes. All attributes have their own specific values.
For example, an employee entity may have Employee_ID, emp_name,
salary as attributes.
- In a database management system an attribute is a database
component, such as field or column of a table.
Example :
The entity student has attributes like student_id, student_name. In this
every attribute has a value. Here 101 is the value for the attribute
student_id, Prem is the value for attribute student_name.
Fig.
28 Data Modelling
2. Multi-valued Attribute Database Management
System
- A Multi-valued Attribute is the attribute which can hold multiple
values for the single entity.
NOTES
Example : In the entity student, the attribute student_contact number
would be considered a .Multi-valued Attribute since a student could
have multiple contact numbers.
3. Simple Attribute
An attribute whose value cannot be further divided is known as Simple
Attribute. That means it is atomic in nature.
Example : In the entity student, student_age cannot be divided, so it
is simple attribute of student entity.
4. Composite Attribute
The Composite Attributes are the attributes which can be further
divided into sub parts. These sub parts represent the basic entities with
their independent meaning.
Example : In the entity student, student_name is the composite
attribute, we can divide this attribute in three different sub parts :
First_name, Middle_name and Last_name.
5. Derived Attribute
The attribute which is not physically exist in database, but its value
can be calculated from the other present attributes is known as derived
attribute.
Example : In the entity student, we can calculate the average age of
students. This average age is not physically present in the database
but it can be derived from the attribute student_age.
Data Modelling 29
Database Management
System
NOTES
2.4.5 Relationships
The association between two different entities is called as relationship. In
the real world application, what does one entity do with the other, how do they
connect to each other?
Example : An employee works at a department, a student enrolls for a
course. Works at and Enrolls are called relationships.
30 Data Modelling
2. Binary Relationship Database Management
System
A Binary relationship exists only when there is relation between only
two entities. In this case the degree of relation is two.
NOTES
Example : A teacher teaches student. In this teacher and student are
two different entities which are connected with each other via relation
Teaches.
3. Ternary Relationship
A ternary relationship exists when there are relations between three
entities. In ternary relation the degree of relation is six.
Example : A person can be a student and a person also can be teacher.
Teacher, student and person are three entities which are related to each
other.
4. Quaternary Relationship
A Quaternary Relationship exists when there are relations between
four entities. In Quaternary Relation the degree of relation is eight.
Example : The four entities Employee, Management Faculty,
Teaching Faculty and Non-Teaching Faculty are connected with each
other via is a relationship.
Data Modelling 31
Database Management
System
2.5 KEYS
NOTES
The specification of differentiation of entities in a given entity set is very
important. The entities can be differentiated in terms of attributes. Here the
values of attributes are in focus. These values should be different to identify the
attributes uniquely. In an entity set, no two entitites should have same values for
all the attributes.
In the database schema the notion of key is directly applies to entity sets.
The key for an entity in entity set is an attribute or set of attributes which is used
to distinguish entities from each other.
Keys are also used to identify relationships uniquely and differentiate these
relationships from each other.
2. Super Key
- This key is formed by combining more than one attributes for the
purpose of uniquely identifying entities.
Example : In student database having attributes Student_reg_id,
Student_roll_no, Student_name, Address, Contact_no.
The Super keys are :
{Student_reg_id},
{Student_roll_no}
{Student_reg_id, Student_roll_no}
{Student_reg_id, Student_name}
{Student_roll_no, Student_name}
{Student_reg_id, Student_roll_no,Student_name}
32 Data Modelling
- It means super key can be any combination of attributes, so that Database Management
identifying the record becomes easier. System
4. Alternate Keys
- From all candidate keys, only one key gets selected as primary
key, remaining keys are known as alternative or secondary keys.
Example : In student table Student_address, Contact_no,
Date_Of_Birth are the alternative keys.
5. Composite Key
A key which consists of more than one attributes to uniquely identify
rows in a table is called composite key. It is also known as compound
key.
Constraints
There are certain constraints in E-R enterprise schema to which the contents
of a database must conform.
Two types of constraints are
1. Mapping cardinalities
2. Participation constraints
Data Modelling 33
Database Management
System
2.5 MAPPING CARDINALITIES
NOTES
Cardinality in DBMS
- In Database Management System the term ‘Cardinality’ refers to the
uniqueness of data values contained in a column.
- If the column contains a large percentage of totally unique values then
it is considered as high cardinality while if the column contains a lot
of “repeats” in its data range, it is known as Low cardinality
Mapping cardinalities
- Mapping cardinalities expresses the number of entities to which
another entity can be associated with in a relationship-set.
- It defines the relationship between two entities via a relationship set.
Consider a binary relationship set R(relation) between entity of set A
and set B. For this relationship mapping cardinalities are as follows:
One to One
- An entity of entity-set A can be associated with at most one entity of
entity set B and vice versa that means an entity in entity-set B can be
associated with at most one entity of entity-set A.
34 Data Modelling
Database Management
System
NOTES
One to Many
- In this type an entity in set A is associated with many other entities in
set B.
- But an entity in entity set B can be associated with maximum of one
entity in entity set A.
Many to One
- In this type an entity in set A is associated with at most one entity in
set B. And an entity in set B can be associated with number of entities in set A.
Data Modelling 35
Database Management
System
NOTES
Many to Many
- In this type any entity in entity set A is associated with number of
entities in entity set B.
36 Data Modelling
1. Participation Constraints Database Management
System
There are two types of participation constraints.
i. Total participation
NOTES
ii. Partial participation
i. Total participation : The participation of an entity set E in a
relationship set R is said to be total if every entity in E participates in
at least one relationship in R.
ii. Partial participation : The participation of an entity set E in a
relationship set R is said to be partial if only some entities in E
participates in relationships in R.
- Example : In a college database system, consider teachers are
assigned as project guides to every student. In this case every
Student entity is related with teacher entity through the
relationship “Guide”.
- Hence participation of student in the relationship guidance is
total. But it is not compulsory that every teacher should guide
the students.
- It is possible that not all the teacher entities are related with the
student through the relationship “Guide”. The participation of
teacher in the “guide” relationship set is partial.
Data Modelling 37
Database Management
System
NOTES
Representations
1. Entity
An Entity is any object, place, person or class. In ER diagram, an
entity is represented using rectangles. Consider an example of an
organization. Employee, Manager, Company, Product etc. are
considered as entities.
2. Weak Entity
Weak Entity is an entity which depends upon another entity. Weak
entity is represented by double rectangle. Subject is the weak entity
because Subject depends on course.
3. Attribute
Attributes are nothing but the properties of entity. Here Emp_id, Name
and address are attributes or entity student.
Summary
The entity-relationship data model is based on a perception of a real world
which consists of a set of basic objects called entities and relationships among
these objects. The model is intended primarily for the database design process.
It was developed in order to facilitate database design by allowing the
specification of an enterprise scheme. Such a scheme represents the overall
logical structure of the database.
An entity is an object that exists and is distinguishable from other objects.
The distinction is accomplished by associating with each entity a set of attributes
that describe the object. A relationship is an association among several entities.
38 Data Modelling
The collection of all entities of the same type is termed an entity set, and the Database Management
collection of all relationships of the same type is a relationship set. System
Exercises
1. Explain the difference between a weak and a strong entity set.
2. Draw and explain the ER Diagram Notations.
3. Explain types of attributes.
4. Every weak entity set can be converted to a strong entity set by simply
adding appropriate attributes. Why, then, do we have weak entity sets?
Case 1.
Draw an ER-Diagram for an online Book shop which should consist of
entity set, attribute, relationship, mapping cardinality and keys, it will maintain
information about all customers, books, book author, publisher, billing etc.
Case 2.
Following requirements are given for a database of the National Hockey
League. The NHL has many teams, each team has a name, a city, a coach, a
captain and a set of players. Each player belongs to only one team. Each player
has a name, a position(such as left wing or goalie), a skill level, a set of injury
records. A team captain is also a player. A game is played between two teams
(referred to as host_team and guest_team) and has a date and score.
Construct an ER diagram for the NHL database.
Data Modelling 39
Database Management Case 3
System
Draw an ER-Diagram for hospital management where patient take
treatments from physician and also he/she can claim a medical insurance.
NOTES
Case 4 :
Consider a university database for the scheduling of classrooms for final
exams. This database could be modeled as the single entity set exam, with
attributes course_name, section_no, room_no and time. Alternatively, one or
more additional entity sets could be defined, along with relationship sets to
replace some of the attributes of the exam entity set as:
1) Course with attributes name, dept and course_no.
2) Section with attributes s_no and enrollment, and dependent as a weak
entity set on course.
3) Room with attributes r_no, capacity and building.
Draw an E-R diagram illustrating the use of all three additional entity sets
listed.
*****
40 Data Modelling
Database Management
System
UNIT - 3
NORMALIZATION
NOTES
Structure
3.0 Introduction
3.1 Unit Objectives
3.2 Normalization: Keys: Composite, Candidate, Primary, Secondary,
Foreign, Super key
3.3 CODD’s Rules, Mapping conceptual model into Relational Model.
3.4 Functional Dependencies
3.5 Decomposition
3.6 Lossy and Lossless Decomposition, Dependency Preserving
Decomposition
3.7 Advantages and Disadvantages of Normalization
3.8 Normal Forms (1NF, 2NF, 3NF,) Case Studies on Normalization.
3.0 INTRODUCTION
In this unit, you will learn that database designing is a process in which you
create a logical data model for a database to store the data of a company. It is
performed after the initial database study phase in the database life cycle. You
can use the normalization technique to create the logical data model for a
database and eliminate data redundancy. Normalization also allows you to
organize data efficiently and reduce anomalies during data operations. Various
normalization forms, such as first, second and third, can be applied to create a
logical data model for a database. The second and third normal forms are based
on partial dependency and transitive dependency. Partial dependency occurs
when a row of a table is uniquely identified by one column that is part of a
primary key. Transitive dependency occurs when a non-key column is uniquely
identified by values in another non-key column of a table.
Normalization 41
Database Management
System
3.1 UNIT OBJECTIVES
NOTES
After going through this unit, you will be able to :
- Discuss the process of normalization and its advantages and
disadvantages
- Understand the first, second and third normal forms
- Understand Decomposition
- Discuss Lossy and Lossless Decomposition, Dependency Preserving
Decomposition
3.2 NORMALIZATION
Foreign Key
- The Foreign Key constraint is also known as Referential Integrity
Constraint. In this constraint one field is common in between two
tables.
- Foreign Key represent relationships between tables. There is parent
child relationship between two tables having common column.
- The master table can be referenced as parent while the transaction table
is considered as child. The common field will work as primary key in
parent table while Foreign Key in child table.
Example : Consider Training Institute Database having two tables
Course_details and Student. There is a condition that the students may
register for courses which are available in institute currently and not
for the courses which are not offered at the moment. To specify this
rule while inserting values into database foreign Key constrain is used.
As follows :
Normalization 43
Database Management - In both the tables, the field Course_Code is common. In course details
System Course_Code is referred as primary key and in Student table it is
referred as foreign key.
NOTES - So, after assigning foreign key constraint to Student table the record
entry for new student will not accept Course_Code which is not
available in the master table Course_details.
Candidate Key : The super key consisting of a minimum number of
columns required to identify a unique row in a table.
Codd designed these rules to define what is required from the relational
database management system.
All these thirteen rules are followed by very few commercial products.
Even the oracle follows eight and half rules.
Codd’s rule are :-
44 Normalization
Example Database Management
System
NOTES
Example
- Suppose employees get 5% hike in a year. Then their salary has to be
updated to reflect the new salary. Since this is the annual hike given
to the employees, this increment is applicable for all the employees.
- Hence the query should not be written for updating the salary one by one
for thousands of employee. A single query should be strong enough
to update the entire employee’s salary at a time.
- That means even though the data is distributed, it should not affect the
speed of access and performance of data compared to centralized NOTES
database.
NOTES
STUDENT_ID STUDENT_NAME
1002 Anuja
For every STUDENT_ID there should be unique name (A B)
Normalization 49
Database Management Table 3.5 Stud_Courses
System
NOTES
50 Normalization
For example, Database Management
System
Here the composite primary key is Stud_Id + Course_Id
NOTES
Normalization 51
Database Management F) Trivial Functional Dependency
System
- The dependency of an attribute on a set of attributes is known as trivial
functional dependency if the set of attributes contains the attributes
NOTES itself.
- In this dependency the Right-hand side of FD is a subset of Left-hand
side of FD. Functional dependency of the form A B is trivial if
B is a subset of A.
- The functional dependencies {A, B} A is trivial .
- In a relation, if attribute A is depend on a set of attributes AB, and also
the attribute A is subset of AB then this functional dependency is said
to be trivial.
For example,
Armstrongs’s Axioms
Armstrongs’s Axioms were developed by William W. Armstrong. These
are a set of axioms (or inference rules) which are used to infer (conclude) all the
FDs on a relational database.
Consider a relation R having three sets of attributes X, Y, Z. Then the
Armstrong’s axioms are :
- Reflexivity Rule : If in a relation R values of Y attribute set are the
subset of values in X attribute set, then FD X Y can hold by the
relation R. This type of dependency also called as trivial dependency.
Normalization 53
Database Management - Augmentation Rule : If FD X Y is hold by relation R, then FD
System XZ YZ also hold relation.
- Transitivity rule : If Functional Dependencies X Y and Y Z
NOTES are hold by relation R then the FD X Z also followed by relation
R. This type of dependency also called as transitive dependency.
3.5 DECOMPOSITION
One solution to eliminate the redundancy and also the update and deletion
anomalies in database is breaking a relation into two or more relations. This
process is called as decomposition.
54 Normalization
Let R be a relation schema. Database Management
System
A set of relation schemas {R1,R2,……….,Rn} is a decomposition of R if
R=R1 U R2 U…..URn
NOTES
Example,
Let us know how to break the relation into more than one relation.
Consider CLASSINFO relation having COURSE_ID, COURSE_NAME,
STUD_NAME, SUB_ID AND SUB_NAME.
Normalization 55
Database Management Now try to join these two relations COURSE_STUDENT and
System SUBJECT_STUDENT as follows
Here two additional rows are displayed this states that the bad design of
decomposition may leads to information loss. So, to ensure that the database has
good design after decomposing, the decomposition process is done through some
criteria which is defined by desirable properties of decomposition.
While decomposing the relation there must be some properties to hold so
that the database remains in consistent state. There are several properties which
are required to hold while decomposing the database schema.
There are two properties of decomposition is:
i) Lossless decomposition
ii) Dependency Preservation decomposition
- It divides larger tables to smaller tables and links these smaller tables
using their relationships.
- Normalization is implemented by some formal rules either by a process
of synthesis or decomposition. Synthesis is used to create database
design in normalized form based on a known set of dependencies.
Decomposition generally is done on existing database which is not
normalized. The relation of this database is further divided into
multiple relations to improve the design.
- normalization is the multi step process. It creates smaller tables from
larger table.
Types of Normalization
1) First normal form (1NF) : Having unique values, no repeating groups.
2) Second normal form (2NF) :Having unique values, no repeating
groups, no partial dependency.
3) Third normal form (3NF) : Same like second normal form and having
transitive dependency.
4) Boyce-Codd Normal form (BCNF) :It is more developed version then
3NF.
5) Fourth normal form (4NF) : No multi-valued dependency.
Table 3.1 Course_details is not a 1NF as the rule says “each attribute of a
table must have atomic (single) values”, the attribute ‘Languages’ contain
Normalization 59
Database Management multiple values which violets the rule of 1NF. To convert this data into First
System normal form, we have to rearrange it in the table.
NOTES
- Now each attribute contain single values. Hence the database design
is in First normal form.
- In the above example the prime key attributes are STUD_ID and
PROJ_ID. Here the non key attributes like STUD_NAME and
PROJ_NAME are dependent upon one of the prime key attributes.
- Means STUD_NAME is depend upon STUD_ID while the
PROJ_NAME depends upon PROJ_ID. This is called as partial
dependency.
60 Normalization
- As per the rule of Second normal form the non key attributes should Database Management
be dependent upon all the key attributes which is not followed. System
- To convert the data in Second normal form we have to split the table
in two different tables as follows. NOTES
STUDENT
PROJECT
Normalization 61
Database Management Solved Question explaining the concept of normalization.
System
First Normal Form(1NF) : A relation is said to be in the first normal form
if it is already in unnormalized form and it has no repeating group.
NOTES
Let us consider the Invoice Report of Alpha Book House as shown below.
In fig. 3.5, Qty dep4ends on Cust_No and ISBN, but the remaining non_key
fields(Title, Author_Name, Author’s country, Qty, Unit_price) depend only on
ISBN. Thus, there exists partial dependency.
The existence of partial dependency will result into insertion update and
deletion anomaly.
Insertion anomaly
In Relation 3, if we want to insert data of a new book (ISBN, Title,
Author_Name, Author’s country, Unit_price) there must be at least one customer.
This means that the data of a new book can be entered into Relation 3 only when
the first customer buys the book.
Update anomaly
In Relation 3, if we want to change any of the nonkey fields, like Title,
Author_Name, it will result into inconsistency because the same is available in
more than one record.
Normalization 63
Database Management Deletion anomaly
System
In Relation 3, if book is purchased by only one customer, then the book data
will be lost when we delete that record after fully satisfying that customer order.
NOTES
Hence, Relation 3 is divided into two relations as shown below :
4) Sales(Cust_No, ISBN, Qty)
5) Book_Author(ISBN, Title, Author_Name, Author’s country,
Unit_price)
Above two relations(i.e Relation 4 and Relation 5) are now in 2NF.
Summary
Database designing involves normalizing a database to ensure that the
relational tables in the database contain only related information and store data
only at a single location in the Database. The higher normal forms, such as the
fourth and fifth normal forms, in normalization are based on the concept of multi-
valued dependency.
Exercise
1. What are the steps to normalize a relational table into the first normal
form?
2. What are the steps to normalize a relational table into the second
normal form?
3. What are the steps to normalize a relational table into the third normal
form?
4. List the advantages of normalization.
5. Explain different types of keys.
6. What is a primary key in a relational table?
7. Explain Lossless decomposition, dependency preserving
decomposition.
8. Explain what is meant by :
a. Repetition of information.
b. Inability to represent information.
c. Loss of information.
Explain why any one of these properties may indicate a bad relational
database design.
*****
64 Normalization
Database Management
System
UNIT - 4 NOTES
FILE STRUCTURES AND
DATA ADMINISTRATION
Structure
4.0 Introduction
4.1 Unit Objectives
4.2 File Structures and Data Administration: File Organization
4.3 Overview of Physical Storage Media, Magnetic Disk, RAID, Tertiary
Storage
4.3.1 Storage Access
4.4 Data Dictionary Storage
4.5 Organization of File (Sequential,Clustering)
4.6 Indexing and Hashing
4.7 Basic Concepts, indices
4.8 B+ Tree index file
4.9 B- tree index file
4.10 Static hashing
4.11 Dynamic Hashing
4.12 Data administration, Role and Responsibility of DBA
4.0 INTRODUCTION
The logical model of the database is the correct level for database users to
focus on. The goal of a database system is to simplify and facilitate access to
data. Users of the system should not be burdened unnecessarily with the physical
details of the implementation of the system. We shall define various data
structures that will allow fast access to data. We shall consider several alternative
structures, each best suited to a different kind of access to data.
NOTES
When a record is deleted, we could move the record that came after it into
the space formerly occupied by the deleted record, and so on, until every record
following the deleted record has been moved ahead. Such an approach requires
moving a large number of records. It might be better simply to move the last
record of the file into the space occupied by the deleted record.
It is undesirable to move records in order to occupy the space freed by a
deleted record, since this requires additional block accesses. Since insertions
tend to be more frequent than deletions, it is acceptable to leave the space
occupied by the deleted record open ,and wait for a subsequent insertion before
reusing the space. A simple marker on a deleted record is not sufficient, since it
is hard to find this available space when an insertion is being done. Thus, we
need to introduce additional structure.
At the beginning of the file, we allocate a certain number of bytes as a file
header. The header will contain a variety of information about the file. For now,
all we need to store there is the address of the first record whose contents are
deleted. We use this first record to store the address of the second available
record, and so on. We may think of these stored addresses as pointers since they
“point” to the location of a record.
Upon insertion of a new record, we use the record pointed to by the header.
We change the header pointer to point to the next available record. If no space
is available, we add the record to the end of the file.
The use of pointers requires careful programming. If we move or delete a
record to which another record contains a pointer, that pointer becomes incorrect
File Structures and in the sense that it no longer points to the desired record. Such pointers are called
68 Data Administration
dangling pointers and, in effect, point to garbage. In order to avoid the dangling Database Management
pointer problem, we must avoid moving or deleting records that are pointed to System
by other records. These records are pinned.
NOTES
Insertion and deletion for files of fixed-length records are quite simple to
implement because the space made available by a deleted record is exactly the
space needed to insert a record. If we allow records of variable length in a file,
this is no longer the case. An inserted record may not fit in the space left free by
a deleted record or it may fill only part of that space.
Fixed-Length Representation
In order to implement variable-length records efficiently in a file system,
we use one or more fixed-length records to represent one variable-length record.
File Structures and
70 Data Administration
There are two techniques for implementing files of variable-length records Database Management
using fixed-length records. System
Several types of data storage exist in most computer systems. These storage
media are classifies by the speed with which data can be accessed, by the cost
per unit of data to buy the memory , and by how ‘reliable” they are. Among the
media typically available are:
- Cache : This is the fastest and most costly form of storage. The size
of cache memory is very small and the use of cache is managed by the
operating system. We shall not need to be concerned about managing
cache storage in the database system.
- Main memory : This is the storage media used for data that is
available to be operated on. The general-purpose machine instructions
operate on main memory. Although main memory may contain several
megabytes of data, main memory is generally too small to store the
entire database. Main memory is sometimes referred to as core
memory, a reference to a technology used in the 1960s and early 1970s
to implement main memory. The contents of main memory are usually
lost if a power failure or system crash occurs.
- Disk storage : This is the primary medium for the long term storage
of data. The entire database is stored on disk. Data must be moved
from disk to main memory in order for the data to be operated on.
After operations are performed, the data must be returned to disk. Disk
storage is referred to as direct-access storage because it is possible to
File Structures and
read data on disk in an order. Disk storage usually survives power
Data Administration 71
Database Management failures and system crashes. Disk storage devices themselves may
System fail and destroy data, but such failures are significantly less frequent
than system crashes.
NOTES - Tape storage- This is storage used primarily for backup and archival
data. Although tape is much cheaper than disk, access to data is much
slower, since the tape must be read sequentially from the beginning.
For this reason, tape storage is referred to as sequential access storage
and is used primarily for recovery from disk failures. Tape devices
are less complex than disks, thus they are more reliable.
Storage Access
1. Block – A database file is partitioned into fixed length storage units
called blocks.
2. Buffer – Portion of main memory available to store copies of disk
blocks.
3. Buffer management – A major goal of the file structures is to
minimize the number of blocks that must be accessed. Another way
to reduce the number of disk accesses is to keep as many blocks as
possible in main memory. The goal is to maximize the chance that
when a block is accessed, it is already in main memory and, thus, no
disk access is required.
Since it is not possible to keep all blocks in main memory, we need to
manage the allocation of the space available in main memory for the storage of
blocks. The buffer is that part of main memory available for storage of copies
of disk blocks. There is always a copy kept on disk of every block, but the copy
on disk may be an older version of the block than the version in the buffer. The
subsystem responsible for the allocation of buffer space is called the buffer
manager. The buffer manager intercepts all requests made by the rest of the
system for blocks of the database. If the block is already in the buffer, the
requester is passed the address of the block in main memory. If the block is not
in the buffer, the buffer manager reads the block in from disk into the buffer, and
passes the address of the block in main memory to the requester. Thus, the buffer
manager is transparent to those system programs that issue disk block requests.
In order to serve the database system well, the buffer manager must use more
sophisticated techniques than typical virtual memory management schemes:
- Replacement Strategy. When there is no room left in the buffer, a
block must be removed from the buffer before a new one can be read
in. Typical operating systems use a least recently used scheme, in
which the block that was referenced least recently is written back to
disk and removed from the buffer. This simple approach can be
improved upon for database application.
File Structures and - Pinned blocks .In order for the database system to be able to recover
72 Data Administration from crashes, it is necessary to restrict those times when a block may
be written back to disk. A block that is not allowed to be written back Database Management
to disk is said to be pinned. Although many operating systems do not System
provide support for pinned blocks, such a feature is essential for the
implementation of a database system that is resilient to crashes. NOTES
- Forced output of blocks. There are situations in which it is necessary
to write the block back to disk even though the buffer space it occupies
is not needed. This is called the forced output of a block. The
requirement is due to the fact that main memory contents and thus
buffer contents are lost in a crash while data on disk usually survives
a crash.
A relational database system needs to maintain data about the relations. This
information is called the data dictionary, or system catalog.
Among the types of information the system must store are:
- Names of the relations.
- Names if the attributes of each relation.
- Domains of attributes.
- Names of views defined on the database, and the definition of those views.
- Integrity constraints for each relation(for example, key constraints)
In addition to the above items, many systems keep the following data on
users of the system.
- Name of authorized users.
- Accounting information about users.
In systems that use highly sophisticated structures to store relations,
statistical and descriptive data about relations may be kept on the:
- Number of tuples in each relation
- Method of storage used for each relation
Some database systems store this information using special-purpose data
structures and code. It is generally preferable to store the data about the database
in the database itself. By using the database to store system data, we simplify
the overall structure of the system and allow the full power of the database to be
used to permit fast access to system data.
The exact choice of how to represent system data using relations must be
made by the system designer. One possible representation is :
System-catalog-scheme=(relation-name, number-of-atributes) File Structures and
Data Administration 73
Database Management Attribute-scheme=(attribute-name, relation-name, domain-type, position)
System
User-scheme=(user-name, encrypted-password, group)
Index-scheme=(index-name, relation-name, index-type, index-attributes)
NOTES
View-scheme=(view-name, definition)
Sequential Files
A sequential files is designed for efficient processing of records in sorted
order based on some search key. To permit fast retrieval of records in search-
key order, records are chained together by pointers. The pointer in each record
points to the next record in search key order. In order to minimize the number
of block accesses in sequential file processing, records are stored physically in
search-key order, or as close to search-key order as possible.
NOTES
Many queries reference only a small proportion of the records in a file. For
File Structures and
example, the query “ Find all accounts at the Perry branch” references only a
Data Administration 75
Database Management fraction of the account records. It is inefficient for the system to have to read
System every record and check the branch-name field for the name “Perry”. Ideally, the
system should be able to locate these records directly. In order to allow these
NOTES forms of access, we design additional structures that we associate with files. We
shall consider two general approaches to this problem : the construction of indices
and the construction of hash functions.
An index for a file works in much the same way as a catalog in library. If
we are looking for a book by a particular author, we look in the author catalog
and a card in the catalog tells us where to find the book. To assist us in searching
the catalog, the cards are kept in alphabetic order, so we do not have to check
every card to find the one we want.
In real-world databases, indices of the type described above may be too
large to be handled efficiently. Instead, more sophisticated indexing techniques
may be used. As an alternative to indexing, techniques for both hashing and
indexing. Each technique must be evaluated on the basis of :
- Access time. The time it takes to find a particular data item, using the
technique in question.
- Insertion time. The time it takes to insert a new data item. This
includes the time it takes to find the correct place to insert the new
data item as well as the time it takes to update the index structure.
- Deletion time. The time it takes to delete a data item. This includes
the time it takes to find the item to be deleted as well as the time it
takes to update the index structure.
- Space overhead. The additional space occupied by an index structure.
Provided that the amount of additional space is moderate, it is usually
worthwhile to sacrifice the space to achieve improved performance.
It is often the case that we desire to have more than one index or hash
function for a file. In library example, we note that more libraries maintain
several card catalogs : for author, for subject and for title. The attribute or set of
attributes used to look up records in a file is called a search key. This definition
of key differs from that of primary key, candidate key and superkey.
Indexing
In order to allow fast random access to records in a file, an index structure
is used. Each index structure is associate with a particular search key. If the file
is sequentially ordered, and if we choose to include several indices on different
File Structures and search keys, the index whose search key specifies the sequential order of the file
76 Data Administration
is the primary index. The other indices are called secondary indices. The search Database Management
key of a primary index is usually the primary key. System
We assume that all files are sequentially ordered and thus have a primary
search key. Such files, together with a primary index are called index-sequential NOTES
files. They are one of the oldest index schemes used in database systems. They
are designed for applications that require both sequential processing of the entire
file and random access to individual records.
There are two types of indices that may be used :
- Dense index. An index record appears for every search-key value in
the file. The record contains the search-key value and a pointer to the
record.
- Sparse index. Index records are created for only some of the records.
To locate a record, we find the index record with the largest search key
value that is less than or equal to the search key value for which we
are looking. We start at the record pointed to by that index record and
follow the pointers in the file until we find the desired record.
Secondary Indices
Secondary Indices may be structured differently from primary indices. The
pointers in the Secondary Indices do not point directly to the file. Instead, each
of these pointers points to a bucket which contains pointers to the file.
This approach allows all the pointers for one secondary search-key value
to be stored together. Such an approach is useful in certain types of queries for
which we may do considerable processing using only the pointers. For primary
keys, we can obtain all the pointers for one primary search key value using a
sequential scan.
A sequential scan in primary key order is efficient because records are stored
physically in an order that approximates primary key order. We cannot store a
file physically ordered both by the primary key and a secondary key. Because
secondary key order and physical key order differ, if we attempt to scan the file
sequentially in secondary key order, the reading of each record is likely to require
the reading of a new block from disk.
The secondary index may be either dense or sparse. If it is dense, then the
pointer in each individual bucket points to records with the appropriate search-
key value. If the secondary index is sparse, then the pointer in each individual
bucket points to records with search key value in the appropriate range. In this
case each bucket entry is either a single pointer or a record consisting of two
fields : a search key value and a pointer to some file record.
By associating a search-key value with each pointer in the bucket, we
eliminate the need to read records with a secondary search-key value other than
the one on which we are performing a lookup.
The bucket structure can be eliminated if the secondary index is dense and
the search-key values form a primary key.
The procedure for deletion and insertion can be applied to a file with
multiple indices. Whenever a file is modified, every index must be updated.
Secondary indices improve the performance of queries that use keys other than
the primary one. However, they impose a serious overhead on modification of
the database. The designer of a database decides which secondary indices are
desirable on the basis of an estimate of the relative frequency of queries and
modifications.
Figure 4.15 Typical nodes of a B-tree : (a) leaf node and (b) nonleaf node
-Deletion in a B-tree is more complicated. In a B+-tree, the deleted entry
always appears in a leaf. In a B-tree the deleted entry may appear in a nonleaf
node. The proper value must be selected as a replacement from the subtree of
the node containing the deleted entry. Specifically, if search key Ki is deleted,
the smallest search key appearing in the subtree of pointer Pi + 1 must be moved
to the field formerly occupied by Ki.
The advantages of B-trees are marginal for large indices. Thus, the structural
simplicity of a B+-tree is preferred by many database system implementors.
Details of the insertion and deletion algorithms for B-trees are explored in the
exercises.
As we have seen, the need to fix the set B of bucket addresses is a serious
problem with the static hashing technique of the previous section. Most databases
grow larger over time. If we are to use static hashing for such a database, we face
File Structures and three classes of options:
Data Administra-
88 tionChapter Heading
• Choose a hash function based on the current file size. This will result in Database Management
performance degradation as the database grows. System
• Choose a hash function based on the anticipated size of the file at some
point in the future. Although performance degradation is avoided, a NOTES
significant amount of space is wasted initially.
• Periodically reorganize the hash structure in response to file growth. Such
reorganization involves choosing a new hash function, recomputing the
hash function on every record in the file, and generating new bucket
assignments. This is a massive time-consuming reorganization.
Furthermore, it is necessary to forbid access to the file during
reorganization.
Several hashing techniques allow the hash function to be modified
dynamically in order to accommodate the growth or shrinkage of the database.
These techniques are called dynamic hash functions. Below, we describe one
form of dynamic hashing called extendable hashing. Extendable hashing copes
with changes in database size by splitting and coalescing buckets as the database
grows and shrinks. As a result, space efficiency is retained. Moreover, since the
reorganization is performed on only one bucket at a time, the resulting
performance overhead is acceptably low.
With extendable hashing, we choose a hash function h with the desirable
properties of uniformity and randomness. However, this hash function generates
values over a relatively large range, namely b-bit binary integers. A typical value
for b is 32.
We do not create a bucket for each hash value. Indeed, 232 is over 4 billion,
and that many buckets is unreasonable for all but the largest databases. Instead,
we create buckets on demand, as records are inserted into the file. We do not use
the entire b bits of the hash initially. At any point, we use i bits, where 0 ≤i ≤ b.
These i bits are used as an offset into an additional table of bucket addresses. The
value of i grows and shrinks with the size of the database.
One of the main reasons for having database management systems is to have
central control of both data and programs accessing that data. The person having
such central control over the system is called the database administration (DBA)
the functions of the database administrator include :
- Scheme definition. The original database scheme is created by writing
a set of definitions which are translated by the DDL compiler to set of
tables that are permanently stored in the data dictionary.
Summary
Many queries reference only a small proportion of the records in a file. In
order to reduce the overhead in searching for these records, we can construct
indices for the files that store the database.
Index-sequential files are one of the oldest index schemes used in database
systems. They are designed for applications that require both sequential
processing of the entire file and random access to individual records. To permit
fast retrieval of records in search-key order, records are chained together by
pointers. In order to allow fast random access, an index structure is used. There
are two types of indices that may be used; dense index and sparse index.
In a standard index-sequential file, only one index is maintained. If several
indices on different search keys are used, the index whose search key specifies
the sequential order of the file is the primary index. The other indices are called
secondary indices. Secondary indices improve the performance of queries that
use search keys other than the primary one. However, they impose a serious
overhead on modification of the database.
The primary disadvantage of the index-sequential file organization is that
performance degrades as the file grows. To overcome this deficiency, a B+ -tree
index can be used. A B+ -tree index takes the form of a balanced tree in which
every path from the root of the tree to a leaf of the tre is of the same length.
Lookup in this scheme is quite straightforward and efficient. However, insertion
and deletion are more complicated. Nevertheless, the number of operations
required for insertion and deletion on B+ -tree is proportional to the logarithm
of the size of the database.
File Structures and
90 Data Administration
B -tree indices are similar to B+ -tree indices. The primary advantage of a Database Management
B-tree is that it eliminates the redundant storage of search-key values. The major System
disadvantage is that leaf and nonleaf nodes are not of the same size, thus, storage
management for the index is complicated. NOTES
Index schemes require that we access an index structure to locate data. The
technique of hashing, by contrast, allows us to find the address of a data item
directly by computing a function on the search key value of the desired record.
Since we do not know at design time precisely which search-key values will be
stored in the file, a good hash function to choose is one that assigns search-key
values to buckets such that the distribution is both uniform and random.
A static hash function is one in which the set of bucket addresses is fixed.
Such a function cannot easily accommodate databases that grow significantly
larger over time. There are several hashing techniques that allow the hash
function to be modified. These are called dynamic hash functions.
Exercises
1. List the role and responsibilities of Database Administrator.
2. What is the difference between primary index and a secondary index?
3. Write note on B+ -tree indices.
4. Explain Indexing and Hashing.
5. Explain the concept of File organization.
6. Elaborate the concept of Data Dictionary Storage.
*****
UNIT - 5
TRANSACTION AND
NOTES
CONCURRENCY CONTROL
Structure
5.0 Introduction
5.1 Unit Objectives
5.2 Transaction and Concurrency Control , Multiprogramming and
Multiprocessing
5.3 Concept of transaction
5.3.1 Transaction state
5.4 ACID properties
5.5 Schedules
5.6 Serializability of schedules
5.7 Concurrency Control
5.7.1 lock based protocols
5.7.2 timestamp based protocols
5.8 Deadlock and its handling
5.8.1 Wait-Die and Wound-Wait, Deadlock prevention without using
timestamps
5.9 Deadlock detection and time outs
Transaction and
92 Concurrency Control
Database Management
System
5.2 MULTIPROGRAMMING AND MULTIPROCESSING
NOTES
Both manual loading and batch processing systems load and process one
job at a time. After such a system loads a job, the job remains in main memory
until its execution is over, and the system loads the next job only after completion
of the current job.
The job that is currently loaded in the system is the sole occupant of user’s
area of main memory (operating system resides in a separate part of main
memory) and has CPU available exclusively for its execution. Uniprogramming
system, which processes only one job at a time, and all system resources are
available exclusively for the job until its completion.
A job does not need CPU for entire duration of its processing because in
addition to computing (for which CPU is needed) it often needs to perform I/O
operations (such as reading/writing of data from/to tape/disk, waiting for data
input from keyboard/mouse, and printing of results) during the course of its
processing. In fact, depending on CPU utilization during the course of
processing, jobs are of two types :
1. CPU bound jobs. These jobs mostly perform computations with little
I/O operations. Hence, their CPU utilization is very high. Programs
used for scientific and engineering computations usually fall in this
category of jobs.
2. I/O bound jobs. These jobs mostly perform I/O operations with little
computation. Hence, their CPU utilization is very low. Programs used
for commercial data processing applications usually fall in this
category of jobs.
In a uniprogramming system, APU is idle whenever the job executing
currently performs I/O operations. CPU idle time may not be significant for CPU
bound jobs, but it may be of the order of 80-90% for I/O bound jobs. Moreover,
since I/O devices are slower than CPU by 20 to 100 times, CPU idle time is
significant even for CPU bound jobs.
To minimize the idle time of CPU, researchers introduced the concept of
multiprogramming, which uses the idea of organizing multiple jobs in a system
so that its CPU always has something to execute.
Multiprogramming is interleaved execution of two or more different and
independent programs by a computer. Multiprogramming systems have better
throughput than uniprogramming systems because Multiprogramming reduces
CPU idle time drastically. However, they are more sophisticated because they
require following hardware and software features : Transaction and
Concurrency Control 93
Database Management 1. Large memory. Multiprogramming requires large main memory to
System accommodate a good number of user programs along with operating
system.
NOTES 2. Memory protection. Multiprogramming requires a memory
protection mechanism to prevent a job(in one memory partition) from
changing another job’s program/data (in another memory partition).
Operating systems use a combination of hardware and software
protection mechanisms for this. It prevents one job from addressing
beyond the limits of its own allocated memory area.
3. Job status preservation. In multiprogramming, when the operating
system blocks a running job because the job needs to do I/O
processing, it takes CPU away from the job and gives it to another job
that is ready for execution. The blocked job resumes its execution
sometime later. To enable this, the operating system must preserve the
job’s complete status information when it takes away CPU form it, and
must restore this information back before it gives the CPU to the job
again. For this, operating system maintains a process control block
(PCB) for each process, which is loaded in memory partition. Figure
5.1 shows a typical process control block. With this arrangement,
before taking away CPU from a running process, the operating system
preserves its status in its PCB. The operating system restores back the
status of this process just before it gives back the CPU to it again later
to resume its execution. Hence, the process can continue execution
without any problem.
4. Proper job mix. A multiprogramming system requires a proper mix
of I/O –bound and CPU bound jobs to overlap the operations of CPU
and I/O devices effectively. If all loaded jobs need I/O at the same
time, CPU will again be idle. Hence, jobs resident simultaneously in
main memory should contain a good mix of CPU bound and I/O jobs
so that at least one job is always ready to utilize CPU.
5. CPU scheduling. In a multiprogramming system, often there are
multiple jobs in ready state (waiting for the operating system to
allocate CPU to it). Hence, when CPU becomes free, operating system
must have a mechanism to decide to which one of these ready jobs it
should allocate then CPU. Part of the operating system that takes this
decision is called CPU scheduler, and the algorithm it uses for this is
called CPU scheduling algorithm.
Transaction and
94 Concurrency Control
Database Management
System
NOTES
Multiprocessing
The use of I/O processors improves the efficiency of a computer system by
making concurrent input processing, and output operations possible. CPU
performs arithmetic and logical operations, while I/O processors carry out I/O
operations concurrently.
Transaction and
Concurrency Control 95
Database Management
System
NOTES
Transaction and
98 Concurrency Control
5.3.1 Transaction States Database Management
System
There are five states of transaction :
NOTES
Active
- This is the first state of transaction. In this state the transaction is being
executed. This state is entry point to every transaction. Transaction
which are in active state indicate that their execution has been started.
Partially Committed
When a transaction completes all operations, it will enter into partially
committed state. Even the last operation has been performed; data is still not
saved to the database.
Failed
- If execution of transaction cannot proceed due to failure of the system
or failure in database, then the transaction is said to be in failed state.
Abort
- In case of the failed transaction, the modification done in database
during transaction processing must be rolled back to ensure the
atomicity and consistency of database.
- The transaction enters into “Abort” this state after roll back operation
is performed. It is the end of transaction when any fault occurs.
Committed
- If without any error transaction completed successfully it will come
into committed state which will allow to make changes permanent into
database. This is the last step of a transaction, if it executes
without fail.
- Every transaction goes through at least three states among the five. Transaction and
Either transaction may goes through active – partially committed – Concurrency Control 99
Database Management committed or through active – Failed – aborted or through active –
System partially committed-failed-abort.
NOTES
5.4 PROPERTIES OF TRANSACTION
(B) Consistency
1. The consistency property ensures that the transaction executed on the
database system will bring the database from one valid state to another.
2. The data which is written by the transaction must be valid according
to all standard rules and regulations regarding constraints, cascades,
triggers etc. Consistency does not guarantee accuracy of the
transaction as per the expectations of programmer.
(C) Isolation
1. When transactions are performed in a sequence, the state of system is
always valid without any problem. But sometimes we may have to
perform multiple transactions concurrently.
2. In case of concurrent transactions, the isolation property ensures that
the system state should be same that would be obtained if transactions
Transaction and were executed sequentially, i.e one after the other. The effect of any
100 Concurrency Control
incomplete transaction should be invisible to other transaction by Database Management
means of isolation. System
5.5 SCHEDULES
5.7.1 Locks
A lock is a variable associated with a data item that defines the status of the
data item corresponding to the database operation, which can be performed on
it. Locks help synchronise the access of data items when concurrent transactions
are executed on the items.
There are two different types of lock binary and shared/exclusive.
Transaction and
104 Concurrency Control
Binary locks Database Management
System
Binary locks are the locks that have two states, locked and unlocked. If the
value of lock on a data item is 1, it means that the data item is in locked state and
cannot be accessed by any database access operation. The 0 value of lock on the NOTES
data item specifies that the data item is in unlocked state and can be accessed by
a database access operation.
When a transaction requests access on a data item, it isssues the
lock_item(X) operation. The 1 value of the operation specifies that the
transaction is required to wait for the data item, whereas, 0 value of the operation
specifies that the transaction can lock and accesss the dat item. When the
transaction stops accessing a data item, it issues an unlock_item(X) operation so
that other transactions can use the data item.
The database can easily implement binary lock by maintaining a record with
the following three fields :
- Data item name
- The LOCK variable
- Locking transaction
The system maintains the records of locks in the lock table. The items that
are not included in the lock table are called unlocked items. A transaction follows
the following rules if binary lock is implemented :
- A transaction should issue the lock_item(X) operation before performing
read and write operations on a data item.
- A transaction should issue the unlock_item(X) operation before
performing read and write operations on a data item.
- A transaction cannot issue the lock_item(X) operation if it already hold
a lock on a data item.
- A transaction cannot issue the unlock_item(X) operation if it does not
hold a lock on a data item.
Shared/exclusive locks
Shared/exclusive locks are the locks, which allow multiple transactions to
read the same data item concurrently but only one transaction to write on a data
item at a time. Shared/exclusive locks are also called read/write locks. In
Shared/exclusive locks, a transaction can issue three operations : read_lock(X),
write_lock(X) and unlock(X). A read locked item is called shar4ed locked and
a write locked item is called exlusive locked. When the database system
repeatedly denies the request of a transaction for the exclusive lock of a data
item, it is called livelock. Livelock occurs when various transactions request
shared lock on the same data item on which a transaction is trying to have an
exclusive lock.
Transaction and
Concurrency Control 105
Database Management To implement the Shared/exclusive lock, lock table includes the following
System fields :
- Data item name
NOTES
- The LOCK variable
- Locking transaction
- No_of_reads
A transaction follows the following rules if Shared/exclusive lock is
implemented:
- A transaction should issue the read_lock(X) and write_lock(X) operations
before performing read_item(X) operation on a data item.
- A transaction should issue the write_lock(X) operation before any
write_item(X) operation is performed on a data item.
- A transaction cannot issue the read_lock(X) operation if it already holds
a shared or exclusive lock on a data item.
- A transaction should issue the unlock(X) operation after performing all
read_item(X) and write_item(X) operations on a data item.
- A transaction cannot issue the write_lock(X) operation if it already holds
a shared or exclusive lock on a data item.
Figure : wait-for-graph
Starvation
Starvation occurs when a transaction is not executed for along period of
time while other transactions are executing. This problem occurs if some
transactions are assigned high priority than others. To avoid starvation, a system Transaction and
Concurrency Control 107
Database Management can use the first-come-first-serve waiting method, according to which the
System transactions can lock the data item in the order in which they request for the data
item. The wait-die and wound-wait scheme also help prevent starvation among
NOTES transactions. Another way to avoid starvation is to define the priority of
transactions and increasing the priority of a transaction, which is waiting for a
data item.
Transaction and
108 Concurrency Control
5.7.2 Timestamps Database Management
System
A time stamp is a unique identifier that database system creates to identify
transactions, when transactions are submitted. It is represented as
TS(<transactionname>). Timestamps support the Timestamp Ordering(TO) NOTES
scheme in which a schedule has transactions in order of their timestamp values
and the schedule is serializable. The TO scheme associates the following two
timestamp values with each data item:
• Read_TS(X): It refers to the read timestamp of the data item X. This is
the highest timestamp of transactions that have succefully read the data
item X.
• Write_TS(X): It refers to the write timestamp of the data item X. This is
the highest timestamp of transactions that have successfully written the
data item X.
Basic TO
Inbasic TO scheme: when a transaction T issues the read_item(X) or
write_item(X) operation, the basic TO algorithm compares the timestamp of
transaction T with the read_TS(X) and write_TS(X) values to ensure that the
transactions follow the TO scheme. If the transaction does not follow the TO
scheme, then transaction T is aborted and assigned a new timestamp value. If
transaction T is aborted and rolled back then any transaction Tl, which will use
the value changed by transaction T is also rolled back. Similarly, the transaction
T2, which will use the value changed by the transaction Tl, is also rol1ed back.
This process is called cascading rollback. The basic TO algorithm checks the
following conditions:
• Transaction T issues the write_itcm(X) operation:
- If read_TS(X)>TS(T) or write_TS(X)>TS(T), then abort and
rollback T and reject the operation of T.
- If read_TS(X)≤TS(T) or write_TS(X)≤TS(T), then execute the
write_item(X) operation of T and set write_TS(X) to TS(T).
- Transaction T issues the read_item(X) operation :
- If read_TS(X)>TS(T) then abort and rollback T and reject the
operations of T.
- If write_TS(X)<=TS(T), then execute the read_item(X) operation
of T and set read_TS(X) to a higher value of TS(T).
Transaction and
Concurrency Control 109
Database Management Strict TO
System
Strict TO ensures that the schedules are strict as well as serializable. In
Strict TO, the read_item(X) or write_item(X) operations of a transaction T are
NOTES postponed until the transaction T’ that has written the value of X is committed.
The transaction T waits for transaction T’ only if TS(X)>TS(T’).
5.8 DEADLOCK
Deadlock Prevention
- DBMS regularly checks all the transaction operations which are going
to execute, to prevent any deadlock situation in the system. The
checking of all the operations is done to analyze whether the execution
of them leads the system in deadlock or not. If any operation of a
transaction leads the system in deadlock state then that transaction will
not allow for execution.
- There are different methods to prevent deadlock. All methods use the
timestamp of transaction. The timestamp of any transaction Ti is
denoted as TS(Ti).
Wound-Wait Scheme
- In this method, if an older transaction requests for a resource held by
younger transaction, then older transaction forces younger transaction
to kill the transaction and release that resource. The younger
transaction is restarted later with same timestamp.
- But if the younger transaction is requesting a resource which is held
by older one, then DBMS allows younger transaction to wait until the
older releases that resource.
Transaction and
112 Concurrency Control
Deadlock Recovery Database Management
System
When deadlock is detected by the system, the only one solution is recovery
of deadlock. The best way of recovery is rollback operation.
NOTES
In the recovery process, three actions are taken
1. Selection of victim
The first step is to detect that which transactions should be rolled back to
recover the system. It is necessary to rollback the transaction which has
minimum cost. The minimum cost depends upon various factors:
- The data items used by the transaction.
- Time duration for the transaction to complete its assigned task.
- Number of total transactions need to be rolled back.
2. Rollback
After finalizing the transaction to be rolled back, we have to decide upto
which previous point it should be rolled back.
Rollback operation has two options :
- Total rollback : Abort the whole transaction and restart it.
- Partial rollback : Rollback transaction as per necessity. But this
format requires the system to store state of all the running transaction.
The order of lock requests/grants and updates performed by the
transaction should be recorded. The deadlock detection mechanism
should decide which locks the selected transaction need to release to
resolve the deadlock. The selected transaction should be released up
to the point where the first lock is obtained. All the operations after
that point should be undone.
The transaction must be capable of resuming execution after the partial
rollback.
3. Starvation
In deadlock handling mechanism the selection of transaction to
rollback(victim) depends upon the cost fact. Sometimes it is possible that same
transaction may be selected multiple times in multiple deadlock situations. In
such case it will be difficult for the victim transaction to complete its execution.
It is called as starvation.
Hence it is necessary that, a transaction should not be selected as victim
many number of times. To achieve this the number of rollbacks of every
transaction should be maintained in the cost factor.
Transaction and
Concurrency Control 113
Database Management Summary
System
Transaction processing is based upon a storage model in which main
memory holds a log buffer, a database buffer, and a system buffer. The system
NOTES buffer holds pages of system object code and local workspaces of transactions.
Secondary storage includes the complete object code for the system, swap space
for virtual memory, and the database itself. Stable storage that must be accessible
online is simulated using mirrored disks which provide redundant data storage.
Offline, or archival, stable storage may consist of multiple tape copies of data
stored in a physically secure location.
Transaction failure may lead to cascading rollback of transactions that have
read data written by a failed transaction. If cascading rollback results in the abort
of an already committed transaction, then the schedule is said to be non
recoverable. Recoverable schedules can be ensure, and cascading rollback
avoided, under two-phase locking by requiring that locks be held by a transaction
until it either commits or aborts.
Various locking protocols, including the multiple granularity locking
scheme, do not guard against deadlocks. One way to prevent deadlock is to use
preemption and transaction rollbacks. To control the preemption, we assign a
unique timestamp to each transaction. These timestamps are used to decide
whether a transaction should wait or roll back. If transaction is rolled back, it
retains its old timestamp when restarted. The wait-die and wound-wait schemes
are tow preemptive schemes. Another method for dealing with deadlock is to
use a deadlock detection an d recovery scheme. To do so, a wait for graph is
constructed. A system is in a deadlock state if and only if the wait for graph
contains a cycle. When a detection algorithm determines that a deadlock exists,
the system must recover from the deadlock. This is accomplished by rolling back
one or more transactions in order to break the deadlock.
Exercises
1. Suppose that the system crashes during the time it is recovering from
a prior crash. When the system again recovers, what action must be
taken?
2. If deadlock is avoided, is starvation still possible? Explain.
3. Explain cascading rollback.
4. When the system recovers from a crash, in what order must
transactions be undone and redone. Why is this order important?
*****
Transaction and
114 Concurrency Control
Database Management
System
UNIT - 6
DATABASE RECOVERY AND
NOTES
SECURITY MANAGEMENT
Structure
6.0 Introduction
6.1 Unit Objectives
6.2 Database Recovery and security Management: Database Recovery,
6.3 Types of Failures, and Data access.
6.4 Recovery and atomicity
6.5 Recovery Techniques Algorithms: Log Based Recovery, Check points,
6.0 INTRODUCTION
Failure is a routine situation for any system. There are various causes of
failure.
Database
1. Transaction failure : A transaction may fail mainly because of two
Recovery and
reasons. Security Management 115
Database Management a. Logical error : There are some internal conditions like invalid
System input, data not found, overflow or resource limit exceeded,
because of which the system get failed and unable to continue its
NOTES execution.
b. System error : Because of cyclic dependency of transactions on
each other, the system may enter in undesirable state like
deadlock which leads to break the execution of transaction.
c. System crash : Because of hardware malfunction or bug in
database software or operating system, the system may halt. The
loss of contents of volatile storage is possible is this crash.
d. Disk failure : Failure during data transfer or head crash may cause
he loss of data in a disk.
Whatever is the reason of failure, it leads to data loss and inconsistency in
the database system, hence recovery is very important.
For DBMS recovery there are two types of techniques which maintain the
atomicity of transaction are follows :
1. Log based Recovery
2. Shadow paging
Log-based Recovery
If any failure occurs in the system, with the help of logs, system can be
recovered. This type of recovery is called as Log-based Recovery.
- The Log-based Recovery works in following manner
The log file is placed on a stable storage media.
When a transaction enters the system and starts execution, it writes a log
about it.
<Tn, Start>
Database
When the transaction updates an item X, it writes logs as follows
Recovery and
116 Security Management
<Tn,X,V1,V2> Database Management
System
It reads that the value of X is changed by Tn from V1 to V2.
When the transaction finishes, it logs
NOTES
<Tn, commit>
- There are two techniques for using log to achieve recovery and ensure
atomicity in case of failures.
- Deferred database modifications
- This technique delay the database modification until transaction
committed.
- It allows storing all the modification done by transaction in the
database, when that transaction committed successfully.
- Immediate database modification
- Immediate database modification technique allows storing each
modification done by transaction in to database as soon as it is done.
- In this technique database modification operation doesn’t wait for
commit operation to be performed by transaction.
- The system allows modifying the database when transaction
processing is performed and not committed yet. Database
modification performed by active transaction is known as
uncommitted modification.
Checkpoints
- In DBMS storing large number of logs may fill out all the memory
space available in the system. The size of log may increase
tremendously as time goes which makes difficult to handle it. With the
checkpoint mechanism we can remove all previous logs from the
system and store them on a storage disc permanently. Using
Checkpoint a point can be declared before which the DBMS was in
consistent state and all the transactions were committed.
- In immediate database modification recovery technique, some
transactions are requires to be redone and some requires to be undone
by searching the entire log. So the searching entire log is time
consuming process. To overcome this problem check points are used.
- Consider a system with concurrent transactions crashes and recovers,
it behaves in the following manner.
Database
Recovery and
Security Management 117
Database Management
System
NOTES
Shadow Paging
- The shadow paging does not require the use of a log in a single user
environment. In this scheme database is considered to be made up of
a number of fixed size disk pages or blocks for recovery purpose.
- A directory with n entries is created. In it the ith entry points to the ith
database page on disk. This directory is kept in main memory. It
records references of all reads or writes to database pages. When
execution of transaction started we copy the current directory whose
entries points to the most recent database pages on disk into a shadow
Database directory.
Recovery and
118 Security Management
- The current directory is used by the transaction and the shadow Database Management
directory is saved on disk. System
Soft Failure
Soft failure is the type of failure that causes the loss in volatile memory of
the computer and not in the persistent storage. Here, the information stored in
the non-persistent storage like main memory, buffers, caches or registers, is lost.
They are also known as system crash. The various types of soft failures are as
follows −
• Operating system failure.
Database
• Main memory crash. Recovery and
Security Management 119
Database Management • Transaction failure or abortion.
System
• System generated error like integer overflow or divide-by-zero error.
• Failure of supporting software.
NOTES
• Power failure.
Hard Failure
A hard failure is the type of failure that causes loss of data in the persistent
or non-volatile storage like disk. Disk failure may cause corruption of data in
some disk blocks or failure of the total disk. The causes of a hard failure are −
• Power failure.
• Faults in media.
• Read-write malfunction.
• Corruption of information on the disk.
• Read/write head crash of disk.
Recovery from disk failures can be short, if there is a new, formatted, and
ready-to-use disk on reserve. Otherwise, duration includes the time it takes to
get a purchase order, buy the disk, and prepare it.
Network Failure
Network failures are prevalent in distributed or network databases. These
comprises of the errors induced in the database system due to the distributed
nature of the data and transferring data over the network. The causes of network
failure are as follows −
• Communication link failure.
• Network congestion.
• Information corruption during transfer.
• Site failures.
• Network partitioning.
Commit Protocols
Any database system should guarantee that the desirable properties of a
transaction are maintained even after failures. If a failure occurs during the
execution of a transaction, it may happen that all the changes brought about by
the transaction are not committed. This makes the database inconsistent. Commit
protocols prevent this scenario using either transaction undo (rollback) or
transaction redo (roll forward).
Database
Recovery and
120 Security Management
Commit Point Database Management
System
The point of time at which the decision is made whether to commit or abort
a transaction, is known as commit point. Following are the properties of a
commit point. NOTES
• It is a point of time when the database is consistent.
• At this point, the modifications brought about by the database can be seen
by the other transactions. All transactions can have a consistent view of
the database.
• At this point, all the operations of transaction have been successfully
executed and their effects have been recorded in transaction log.
• At this point, a transaction can be safely undone, if required.
• At this point, a transaction releases all the locks held by it.
Transaction Undo
The process of undoing all the changes made to a database by a transaction
is called transaction undo or transaction rollback. This is mostly applied in case
of soft failure.
Transaction Redo
The process of reapplying the changes made to a database by a transaction
is called transaction redo or transaction roll forward. This is mostly applied for
recovery from a hard failure.
Transaction Log
A transaction log is a sequential file that keeps track of transaction
operations on database items. As the log is sequential in nature, it is processed
sequentially either from the beginning or from the end.
Purposes of a transaction log −
• To support commit protocols to commit or support transactions.
• To aid database recovery after failure.
A transaction log is usually kept on the disk, so that it is not affected by soft
failures. Additionally, the log is periodically backed up to an archival storage like
magnetic tape to protect it from disk failures as well.
Database
Recovery and
122 Security Management
Database Management
System
6.4 RECOVERY FROM TRANSACTIONS FAILURE
NOTES
Database
Recovery and
Security Management 123
Database Management Cascading Rollback
System
In order to recover correctly from the failure of a transaction T, it may be
necessary to roll back several transactions. Such situations occur if transactions
NOTES have read data written by T.
To illustrate this, consider the partial schedule of figure 6.1 Transaction T1,
writes the value of A that is read by transaction T2. Transaction T2 writes the
value of A that is read by Transaction T3. Suppose that, at this point T1 fails.
T1must be rolled back. Since T2is dependent on T1, T2 must be rolled back.
Since T3 is dependent on T2, T3 must be rolled back.
Figure 6.1 Partial schedule
Database
Recovery and
124 Security Management
Figure 6.2 Partial schedule under two-phase locking. Database Management
System
NOTES
Recoverable Schedules
Consider the schedule of Figure 6.3 in which T5 is a transaction that
performs only one instruction, read(A). Suppose that the system allows T5 to
commit immediately after executing the read(A) instruction. Thus T5 commits
before T4. Now suppose that T4 fails before it commits. Since T5 has read the
value of data item A written by T4, it is necessary to abort T5 to ensure transaction
atomicity. However, T5 has already committed and cannot be aborted.
Figure 6.3 Sample schedule
Log Scanning
It is necessary to consider only the following transactions during recovery:
We use checkpoints to reduce the number of log records that must be
scanned when the system recovers from a crash.
1. Those transactions that started after the last checkpoint.
2. The one transaction, if any, that was active at the time of the last
checkpoint.
The situation is more complex when transactions may execute concurrently
since several transactions may have been active at the time of the last checkpoint.
In a concurrent transaction processing system, we require that the
checkpoint log reord e of the form <checkpoint L>, where L is a list of
transactions active at the time of the checkpoint. When the system recove3rs
from a crash it constructs two lists : The undo list consists of transactions to be
undone and the redo-list consists of transactions to be redone.
These two lists are constructed on recovery as follows. Initially they are
both empty. We scan the log backward, examining each record, until the first
<checkpoint> record is found:
- For each record found of the form <Ti commits> we add Ti to redo-list.
- For each record found of the form <Tistarts>, if Ti is not in redo-list
then we add Ti to undo-list.
When all the appropriate log records have been examined, we check the list
L. For each transaction Ti in L, if Ti is not in redo-list then we add Ti to the
undo-list.
Once the redo-list and undo-list have been constructed, the recovery
proceeds as follows :
Database 1. Re-scan the log from the most recent record backward and perform
Recovery and
undo(Ti) for each Ti on the undo list.
126 Security Management
2. Continue to scan the log backward until the <Ti starts> record for all Database Management
Ti on the redo-list have been located. System
3. Scan the log forward and perform redo(Ti) for each Ti on the redo-list.
NOTES
It is important in step 1 to process the log backward to ensure that the
resulting state of the database is correct. To illustrate, consider the pair of log
records
<Ti, A, 10, 20>
<Tj, A, 20, 30>
Which represent a modification of data item A by Ti followed by a
modification of A by Tj. If Ti and Tj are both on the undo-list, then A should be
restored to the value 10. This can be achieved if the log is processed backward
since A is set first to 20 and then to 10. If the log were processed in the forward
direction the result would be incorrect.
After all transactions on the undo-list have been undone, those transactions
on the redo-list are redone. It is important in this case to process the log forward.
When the recovery process has completed, transaction processing resumes.
Checkpointing
Checkpoint is a point of time at which a record is written onto the database
from the buffers. As a consequence, in case of a system crash, the recovery
manager does not have to redo the transactions that have been committed before
checkpoint. Periodical checkpointing shortens the recovery process.
The two types of checkpointing techniques are −
• Consistent checkpointing
• Fuzzy checkpointing
Consistent Checkpointing
Consistent checkpointing creates a consistent image of the database at
checkpoint. During recovery, only those transactions which are on the right side
of the last checkpoint are undone or redone. The transactions to the left side of
the last consistent checkpoint are already committed and needn’t be processed
again. The actions taken for checkpointing are −
• The active transactions are suspended temporarily.
• All changes in main-memory buffers are written onto the disk.
• A “checkpoint” record is written in the transaction log.
• The transaction log is written to the disk.
• The suspended transactions are resumed. Database
Recovery and
Security Management 127
Database Management If in step 4, the transaction log is archived as well, then this checkpointing
System aids in recovery from disk failures and power failures, otherwise it aids recovery
from only power failures.
NOTES
Fuzzy Checkpointing
In fuzzy checkpointing, at the time of checkpoint, all the active transactions
are written in the log. In case of power failure, the recovery manager processes
only those transactions that were active during checkpoint and later. The
transactions that have been committed before checkpoint are written to the disk
and hence need not be redone.
Example of Checkpointing
Let us consider that in system the time of checkpointing is tcheck and the
time of system crash is tfail. Let there be four transactions Ta, Tb, Tc and Td such
that −
• Ta commits before checkpoint.
• Tb starts before checkpoint and commits before system crash.
• Tc starts after checkpoint and commits before system crash.
• Td starts after checkpoint and was active at the time of system crash.
The situation is depicted in the following diagram −
Summary
A computer system, like any other mechanical or electrical device, is
subject to failure. There are a variety of causes of such failure, including disk
crash, power failure, and software errors. In each of these cases, information
concerning the database system is lost. An integral part of a database system is
a recovery scheme which is responsible for the detection of failure and the
restoration of the database to a state that existed prior to the occurrence of the
failure.
The various types of failures that may occur in a system must be dealt with
in a different manner. The simplest type of failure to deal with is one which does
not result in the loss of information in the system. The ones that are more difficult
to deal with are those that do result in loss of information.
In case of failure, the state of the database system may no longer be
consistent; that is , it may not reflect a state of the world that the database is
supposed to capture. To preserve consistency, we require that each transaction
be atomic; that is , wither all the instructions associated with it are executed to
completion, or none are performed. It is the responsibility of the recovery scheme
to ensure the atomicity property.
*****
Database
Recovery and
Security Management 133