DBMS
DBMS
in
DATABASE MANAGEMENT
SYSTEM COURSE BY
References :
• Fundamentals of Databases systems by Navathe
• Database System Concepts by Henry F. Korth 7th edition
Database : A collection of related data representing facts that have meaning in the real world.
Database management system : it is a software package designed to store and manage databases.
It is computerized system; whose overall purpose is to maintain the information and to make that
information available on demand.
Q : Do you use DBMS ? – No, in computer, OS provides you a file system. So, DBMS are required by
large organizations. File system has main disadvantage of redundancy. We will study more in
upcoming chapter.
Data model : set of rules, conventions to represent/store/manage the data. For example, relational
mode, ER model.
It is table/relation-based model proposed by E.F. Codd in a research paper. It covers how to store the
data, all terminology, definition, two query languages for relational model.
Students(
roll: string,
name: string,
age: integer,
gpa: real
)
Atomic means say if our row contains some string “ravi kumar” then we cannot get “kumar” out of
this string because it is defined to be atomic i.e. indivisible.
Degree (or arity) of a relation : the number of attributes n of its relation schema.
Q : Why this model is called “relational” ? – this is same as relation we saw in discrete math. Each
attribute can be considered as a set. For example, previously
𝑅 ⊆ 𝐴1 × 𝐴2 × … × 𝐴𝑛
𝐴1 × 𝐴2 × … × 𝐴𝑛 = {(𝑎1 , 𝑎2 , … , 𝑎𝑛 )|𝑎𝑗 ∈ 𝐴𝑗 }
Attribute values are atomic (as explained earlier) have a known domain and can sometimes be “NULL”
and meaning of “NULL” is depend on context.
Some attributes which is used to uniquely identify tuples is called key. For example, contact no. of
student in school register.
Superkey : A set of attributes (can be singleton also) which can identify any record uniquely. In
relational model, at least one superkey is definitely present because as two records can never be
entirely same so at worst all attributes combined will work as superkey.
Primary key (PK) : One of the candidate keys is chosen as (selected as) to be a primary key (which must
be not NULL). This is chosen by database owner you can say. Other keys apart from primary key in
candidate key is called alternate keys.
Every relational model has exactly one primary key therefore, no tuple can be fully NULL.
Q : Given 𝑅(𝑎1 , 𝑎2 , 𝑎3 , … , 𝑎𝑛 ) then max no. of superkey ? – to get max no. of superkey each 𝑎𝑖 must
be candidate key. Therefore, 2n – 1 superkey at max.
Q : Given 𝑅(𝑎1 , 𝑎2 , … , 𝑎𝑛 ) then max no. of candidate key ? – All keys cannot be candidate key then it
won’t be maximum. For example, R(a, b, c, d) in which max number of CK would be {ab, ac, ad, bc, bd,
cd} and not {a, b, c, d}. we know that candidate key is subset of superkey so we want to find subset
𝑛
which has highest number of keys. 𝐶 (𝑛, ).
2
Q : Given 𝑅(𝑎1 , 𝑎2 , … , 𝑎8 ) then max no. of candidate key if it is known that 𝑎1 𝑎2 𝑎3 is CK. Other CKs
we don’t know ? – we know that max no. of candidate key with 8 attributes would be C(8, 4) (from
previous question) but this count 𝑎1 𝑎2 𝑎3 . So, C(8, 4) – 5 + 1 = 66. Why 5? Because at max we are
selecting 4 attributes and we know 𝑎1 𝑎2 𝑎3 is already selected thus, from remaining we can get 5 and
+1 because we also have to count 𝑎1 𝑎2 𝑎3 also which we have never counted.
//Lecture 4
Note that relational model is theoretical/mathematical model and SQL is practical language loosely
based on relational model.
Relational model considers table as a set of rows so no two tuples can be same. And SQL on the other
hand considers table as a multiset of rows. We will explore more of this in SQL chapter.
Let’s say we want to create database for student studying in specific school. We should have some
restriction/ constraints for example,
a) Sid (student id) must be integer : this type of restriction is often known as domain integrity
constraint.
b) Sid (primary key) must be unique and never NULL : because if sid is NULL then we cannot
distinguish two students. This is called Entity integrity constraint also known as uniqueness
constraint.
c) Sid must be different for any two students : this type of restriction is known as key integrity
constraint.
d) Consider one example, in this Kanpur must be the state of UP so we cannot allow Rajasthan
with Kanpur. State is dependent on district. This is called functional dependency IC.
e) There should be some restriction on false information for example, consider following
database. Here each student must take course provided by institute or don’t take course then
we will consider that student as registered but not enrolled (this is okay we don’t have any
problem) but we have problem with student selected course is not available in database. For
example,
Therefore, foreign need not be CK. FK may or may not be NULL, by default FK is allowed to be NULL.
FK need not be a single attribute in other words FK can consists of more than one attribute also.
Foreign key can point to a CK within the same table. For example,
Check assertion is a constraint used in SQL, it is used for limiting the value range for any column.
Then Referential IC says that R.X can either be NULL OR a value already presents in S.Y.
Q : When referential integrity violation happens; then what to do ? – In the referential table,
Deletion of row : No violation, insert a row : violation possible, modify a row : violation possible
Thus, if violation happens when we insert/modify a row then that action/operation is rejected/not
allowed.
But in referenced table, insertion : no violation possible. But if Delete/modify : violation possible and
if violation happens then 1) Cascade delete (meaning delete all the rows from referential table) 2)
Cascade modify (meaning modify all the rows from referential table) 3) Put NULL (if we want that
value to exists) 4) Ignore operation.
//Lecture 5
Definition : Let A, B be two attributes in a relation R then A → B iff for any two tuples t1, t2 such that
t1.A = t2.A then t1.B = t2.B.
Example,
Functional dependency defined only for schema and not for instances
We can combine on LHS, RHS. We can split on RHS but we cannot split on LHS.
X is closed iff X+ = X
NOTE : If every set of attributes is closed then there will be no non-trivial FD’s thus always in BCNF.
We wish to test whether ad → b since b is a member of the closure of ad we conclude that ad→b
follows from the set of FD’s.
Any attribute y which is not present in any closure of other then y then y must be present in every
key (Candidate key = key).
Cover : Let two FD set F, G then we can say F covers G iff 𝐺 + ⊆ 𝐹 + . F is a minimal cover of G if F is the
smallest set of functional dependencies that cover G.
//lecture 6
From now on we will consider only the nontrivial FDs with RHS being a single attribute.
//Lecture 7
A relation (relation schema/scheme) can be in the following normal forms like 1NF, 2NF, 3NF and
BCNF,…
Domain is atomic if its elements are considered to be indivisible units. A relational schema R is in first
normal form if the domains of all attributes of R are atomic. Which means every relation is in 1NF.
A relation is in 2NF if every non-prime attribute is fully dependent on every candidate key OR a relation
is in 2NF if no non-prime attribute is partially dependent on some candidate key.
NOTE : here in definition non-prime attribute is used meaning if some relation has no non-prime
attribute then it is in 2NF.
Most common mistake in checking whether some part of candidate key determines some non-prime
attribute is only checking FDs in given FD set but we actually have to check FDs in FD closure.
Steps : 1) Find CKs and find non-prime attributes. 2) For every proper subset of CK, find closure and
see if it contains non-prime attribute.
1.2.2) 3NF :
Say 𝛼 is not a super key meaning it cannot determine any candidate key and say A is non-prime
attribute. Therefore, if 𝛼 → 𝐴 then 𝐶𝐾 → 𝛼 → 𝐴 which is the condition of transitive FD.
Which means another variation of violation can be 𝜶(𝒘𝒉𝒊𝒄𝒉 𝒊𝒔 𝒏𝒐𝒕 𝒂 𝒔𝒖𝒑𝒆𝒓𝒌𝒆𝒚) → 𝑨(𝒏𝒐𝒏 𝑷𝑨).
Q : In 3NF; can we have transitive dependency ? – Surprisingly yes, because definition says no non-
prime attribute is transitively dependent upon some candidate key but prime attribute can transitively
dependent upon each other. For example, AB ⟷ CD and A → C. Here there is no non-prime attribute
and Therefore, AB → C is transitive dependency.
Now,
Walla, this is also violation condition of 3NF so we can say that 2NF violation → 3NF violation.
Contraposition will be 3NF → 2NF.
Definition : A relation R is in BCNF iff for all non-trivial FDs 𝑋 → 𝑌, X should be superkey.
3NF is saying this should not happen : NOT SK → Non-prime. Which means for every non-trivial FD X
→ A this should happen : X is either SK or A is prime.
Q : What makes a relation to be in 3NF but not in BCNF ? – When for some non-trivial FD, X → A where
X is not a superkey and A is prime.
If a relation is in 3NF and not in BCNF, then there must be a non-trivial FD A→B, where B is a prime
attribute and A is not a super key (assume A is the minimal attribute set which determines B). This
means B is not a super key as otherwise A must also be a super key. But since, B is a prime attribute,
for some non-empty set of attributes Y, BY is a candidate key. Since A→B, and A is assumed to be
minimal, this means AY is also a candidate key. Thus, we got two candidate keys AY and BY and Y is
common. This proofs below sentence.
“ A necessary (but not sufficient) condition for a relation to be in 3NF but not in BCNF is that it
should have overlapping candidate keys – two candidate keys of two or more attributes and at
least one common attribute ”
Note Q is necessary condition for P == P->Q == if 3NF but not in BCNF then overlapping cand. key
//Lecture 9
Suppose you have two relation student and pay-grade and you want to know answer of following
question or queries.
Sometimes, we must join two relations to get some information and make one relation/table based
on following :
A B B C A B C
X Y Z U X Z U
X Z V W X Z V
Y Z Z V Y Z U
Z V Y Z V
Z V W
Q : What if no common attribute with same name in two relations… how to do natural join ? – Combine
every row of one relation with every row of other relation. This is same as cartesian product.
1) Update Anomaly : after modification of some row we have to change the other row also
because of FD. Or may be other table have some attributes of previous row then it also
invalidates.
2) Insertion Anomaly : Suppose we want to add new course to one department and we don’t
have student entry till now so we cannot put NULL in primary column.
3) Deletion Anomaly : Suppose there is one student in some courses and he passed now we need
to delete that entry but course must remain as we don’t want to delete course. But after
deleting name we can’t put NULL in primary column also.
4) Redundant storage : some information is stored repeatedly.
1.3.1) Decomposition :
Suppose that relation R contains attributes A1,…, An. A decomposition of R consists of replacing R by
two or more relations such that :
Consider,
Example,
But Decomposition R1(a, b) and R2(a, c) is lossless. Now, we can conclude that a binary decomposition
R (yes this condition is only for binary) into (R1, R2) is
𝑅1 ∩ 𝑅2 → 𝑅1 𝒐𝒓 𝑅1 ∩ 𝑅2 → 𝑅2
Lossy : iff common attributes of R1, R2 is not a superkey in any of the R1 or R2 meaning
𝑅1 ∩ 𝑅2 ↛ 𝑅1 𝒂𝒏𝒅 𝑅1 ∩ 𝑅2 ↛ 𝑅2
//Lecture 10
Therefore, to check lossless and lossy decomposition first find common attributes then find if it is key
of one of the relations. For this you take closure of common attributes and see if its closure contains
all the columns of one of the relations.
2) Dependencies preservation :
Suppose, some dependencies associated to R relation and now you decompose R into R1 and R2. Then
some dependencies will be for R1 (let’s say F1) and some will be for R2 (Let’s say F2). Then you again
combine both relation into one can it happens that we get some new dependencies apart of F i.e.
𝐹 + ⊂ (𝐹1 ∪ 𝐹2 )+ - Answer is no, you can never have.
Example,
First, It is lossless.
So, 𝑏 → 𝑑 this FD without joining tables cannot be enforced. That is why we say not Dep. Preserving.
Here F1 is called FDs preserved by R1 only and similarly for F2. This also explains perseveration of FD’s
after decomposition. After finding F1, F2, F3, …. If (𝐹1 ∪ 𝐹2 ∪ 𝐹3 )+ = 𝐹 + then decomposition is DP.
Q : How to check lossy/lossless for non-binary decomposition ? – you must have seen this last year,
We have wrong method which was successive combining method to check lossless/lossy
decomposition. One thing to note that if this method answers lossless then decomposition will be
lossless. But answers lossy then decomposition may or may not be lossy.
Consider 𝑅 (𝐴, 𝐵, 𝐶, 𝐷, 𝐸 ) and non-binary decomposition 𝑅1 (𝐴, 𝐵), 𝑅3 (𝐵, 𝐶, 𝐷), 𝑅2 (𝐴, 𝐶 ), 𝑅4 (𝐴, 𝐷, 𝐸)
and some FD. We make table like as such
A B C D E
R1 * * * *
R2 * *
R3 * * * *
R4 * * * *
Now, suppose we have FD 𝐴 → 𝐶 then we will fill all the star place in C which is present in A but not
in C. Suppose we also have 𝐵𝐶 → 𝐸 then we will apply procedure. Similarly, for the rest FD we do this
repeatedly and at last we check if there exists a row with all * entries then lossless else lossy.
NOTE : Even after applying some FD if there exists a row with all * entries you can conclude result.
Q : Consider the relation R(A, B, C, D) with the FD : B → AD and the proposed decomposition {A, B}, {B,
C}, and {C, D}. -
A B C D
R1 * * #
R2 * * * #
R3 * *
So, there is no row containing *. Thus, this is lossy decomposition.
//Lecture 12
Thus, superkey of a particular instance r(R) are the potential candidate keys for the relation R.
Potential candidate key exists in relation instance.
FD set :
• Split the right part, producing FDs with only one attribute on the right part. And also remove
trivial FD.
• For all FDs on LHS find a redundant (extraneous) attribute
• Eliminate all redundant FDs
//Lecture 13
We know that all the data shouldn’t be visible to all the types of users.
The DBMS can enforce access controls that govern which data is visible to different classes of users.
There are two types of data independence namely, physical data independency and logical data
independency.
1.4.3) Redundancies :
Trivial FDs never create redundancies in the database because those are trivial, and they always hold
on all relations.
Non-trivial FDs in which LHS is a superkey never create redundancies in the database
Which means you cannot have redundancy caused by FDs in BCNF. but is it free from all redundancies
? – No, there is still redundancies. Because we put more than one “many to many” relationships in
single table. For example, Students(Sid, cid, movie) here sid to cid (here many students may have
taken may course and many courses may be taken by many students.
Such situation gives birth to a new concept of MVDs (multi valued dependencies)
Redundancy
4NF : A relation is in Fourth Normal Form (4NF) if and only if for every nontrivial multi-valued
dependency, A↠B, A is a super key of the relation.
//Lecture 16
NOTE : A database {𝑹𝟏 , … , 𝑹𝒏 } is in 3NF if each relation schema 𝑹𝒊 is in 3NF and similarly for BCNF.
Q : If R is not in BCNF, then how to make it into BCNF ? – decompose R into sub-relations such that
Each sub-relation is in BCNF.
Do BCNF decomposition : give a lossless decomposition(must) if possible, you can also give me DP.
If it is a BCNF decomposition : just check each individual sub-relation if they are in BCNF or not.
• If R is not in BCNF : take a violation of BCNF (meaning take X → 𝛼. Where X is not SK)
Find X+ = R1 and do R2 = (R - 𝑋 + ) ∪ (X)
You may get different BCNF decompositions depending on the order of violations that we take.
NOTE : for ALL relations, lossless BCNF decomposition is ALWAYS possible. But for some relations,
lossless dependency preserving BCNF decomposition does NOT exist.
NOTE : For ALL relations, lossless 3NF decomposition is ALWAYS possible. For ALL relations, lossless
dependency preserving 3NF decomposition is always possible.
2. Queries
//mod2-Lecture 1
Query languages are specialized languages for asking questions, or queries, that involve the data in a
database.
Along with relational model Codd. Gave two query languages namely relational algebra and relational
calculus. These are theoretical query languages.
In relation algebra query, we need to specify everything from what you want? To how to get it ? (step
by step procedure in what order) we can say it is procedural query language.
Whereas in relational calculus query, we need to specify just what you want. We don’t care how it
will get it. It is more abstract. It is also known as declarative query language. SQL is one such example.
From programmer’s point of view : relational calculus is easier. But from QL compiler point of view,
relational algebra is easier and more useful.
An algebra whose operands are relations or variables that represent relations. and operators are
designed to do the most common things that we need to do with relations in a database.
1) Selection Operator :
𝑆 = 𝜎𝑝𝑜𝑝>0.5 𝑈𝑠𝑒𝑟
We can observe one thing that it is unary operator, schema not changes, cardinality may change. It is
unary because we check every tuple independently and individually. For example,
Student :
2) Project operation :
If the projection list is a superkey of R that is, it includes some key of R – the resulting relation has the
same number of tuples as R.
NOTE :
When you apply any of this operation then in output you will get new table with no name.
The result of the PROJECT operation has only the attributes specified in <attribute list> in the same
order as they appear in the list. for example, 𝛑𝐠𝐩𝐚,𝐧𝐚𝐦𝐞 𝐒 then output relation contain attribute in
gpa, name order.
• Selection operation is commutative. 𝜎𝑝1 (𝜎𝑝2 (𝑅)) = 𝜎𝑝2 (𝜎𝑝1 (𝑅)) this is true because we can
write this as 𝜎𝑝1∧𝑝2 = 𝜎𝑝2∧𝑝1 .
• But project operation is not commutative. Because there are changes that after applying, we
do not get any attributes.
//Lecture 2
Union, intersection, set difference to define these operations relations must have compatible
scheme. (union compatible schema)
4) Assignment operator :
5) Renaming :
Notation :
//Lecture 3
Sometimes we need to combine two or more tables (same or different) to retrieve some information.
For example, find father of B1 : to do this we do cartesian product between parents table and kids
table.
For this reason, we can create one separate operation. In general, say column C is common in both
table 𝑇1 and 𝑇2 then conditional join is defined as 𝑻𝟏 ⋈𝒄 𝑻𝟐 = 𝝈𝒄 (𝑻𝟏 × 𝑻𝟐 ).
1) Join Operation :
Two types of join operation : inner join (by default) and outer join operation.
• Equivalence join : less imp. This is same as condition join but one thing is added. If 𝜃 contains
= condition not containing <, ≤, >, ≥.
• Natural join : 𝑅 ⋈ 𝑆. Equality of all common attributes.
//Lecture 4
2) Division operator :
In math, for integer A and B, A/B is the largest integer Q such that Q*B <= A.
Similarly, in Queries, for relation R and S, R/S = Q, here Q is also relation. This means Q is the largest
relation such that 𝑄 × 𝑆 ⊆ 𝑅.
Now, say
To find, we check which attributes of “a” are related with all the value of “b” in r2. By doing so we are
guarantying that 𝑄 (𝑎) × 𝑟2 ⊆ 𝑟1 .
//Lecture 5
R/S ? = kids who play all the S Games = All kids – kids who do not play some S games.
Kids who don’t play some games = 𝜋𝑘𝑖𝑑𝑠 (all possible pair of students – all existing pair)
Another implementation,
Q : R(kids, Game); S(Game); for every S.Game gi, we find the set of kids Ki who play that S.Game gi…
What can we say about intersection of all Ki ? – This is same as division operation i.e. R(Kids,
Game)/S(Game)
K2 = all the kids who plays g2…. Kn = all the kids who plays g3
𝐾1 ∩ 𝐾2 … ∩ 𝐾𝑛 = all the kids who play g1, g2, g3, g4, …, gn = kids who play all S games
|𝑹|
Max. value of |𝑹/𝑺| = ⌊ |𝑺| ⌋ and min. value of R/S is 0.
Suppose we join 𝑅 ⋈ 𝑆. A tuple of R which doesn’t join with any tuples of S is said to be dangling of
R. and Similarly, A tuples of S which doesn’t join with any tuples of R is said to be dangling of S.
If R and S are two relations then 𝑅 ⋈ 𝑆 may contains some dangling tuples in R, S. Outer join includes
dangling tuples in the final answer with help of NULL.
Left outer join (⟕) : Dangling tuples of R are included in the final result of 𝑅 ⋈ 𝑆 with the help of
NULL.
Right outer join (⟖) : Dangling tuples of S are included in the final result of 𝑅 ⋈ 𝑆 with the help of
NULL.
FULL outer join (⟗) : Dangling types of R, S are included in the final result of 𝑅 ⋈ 𝑆 with the help of
NULL. By default, outer join ≡ full outer join.
For exmple,
//Lecture 6
𝑹 ⟕ 𝑺 = 𝑹 ⋈ 𝑺 + 𝑹𝑫𝒂𝒏𝒈𝒍𝒊𝒏𝒈 ; 𝑹 ⟖ 𝑺 = 𝑹 ⋈ 𝑺 + 𝑺𝒅𝒂𝒏𝒈𝒍𝒊𝒏𝒈
𝐑 ⟗ 𝑺 = 𝑹 ⋈ 𝑺 + 𝑹𝑫𝒂𝒏𝒈𝒍𝒊𝒏𝒈 + 𝑺𝒅𝒂𝒏𝒈𝒍𝒊𝒏𝒈
𝑹 ⟕ 𝑺 = 𝑹 ⋈ 𝑺 + 𝑹𝑫𝒂𝒏𝒈𝒍𝒊𝒏𝒈
Whenever you find “all” in queries (in some), we use division operator.
But this gives us more useless tuples so to reduce that we can just first select 103 from S…
Q : find the names of sailors who have reserved at least one boat. –
Q : find the names of sailors who have reserved a red and a green boat. –
Snames of red boat people ∩ Snames of green boat people – this will not work in some relations
consider following counter…
But we know sid is unique so first we can find intersection in sid then find name that would make
sense.
Q : Find the names of sailors who have reserved at least two boats. –
//Lecture 8
If we not write R1.bid ≠ R2.bid then query will return name of sailors who have not reserved no boats.
Meaning at least 1. This is happening because tuple of S can select same boat from R1 and R2. So, we
need not equal condition between boat id.
When all the attributes of R are different from S then we have max tuples = |𝑅 × 𝑆|.
And when all the attributes of R and S is same (meaning same domain) but there is no matching
between any instances then the result will be 0.
//Lecture 9
In relational algebra, we have to provide what we want and how it will be computed… hence,
procedural query language…
Here order of operations given by user. And its output is dependent upon order of operation.
In TRC (tuple relational calculus), we just have to specify what we want… Hence, declarative language.
R (A, B)
Cross product :
//Lecture 11
Way 2 :
⋆ Unsafe query (only for TRC) : If a TRC query provides any data which is not in the database then that
TRC query is unsafe. For example, most popular
{𝑡|𝑡 ∉ 𝑆}
SQL has more power than relational algebra, and the main source of this additional power is
its aggregation and grouping constructs, together with arithmetic operations on numerical
attributes.
//Lecture 13
//Lecture 14
⋆ SQL sublanguages :
1) The data definition language (DDL) : This subset of SQL supports the creation, deletion, and
modification of definitions for tables and views.
2) The data manipulation language (DML) : This subset of SQL allows users to pose queries and
to insert, delete, and modify rows.
SELECT A, D
FROM R, S
WHERE 𝐴 > 10 ∧ 𝐶 ≤ 5
NOTE : In SQL also, if AB is primary key then A, B (one of them) cannot be NULL & no duplicates
tuples in AB (so in relation) so if primary key is specified then it is useless to write [Distinct].
Q : find the names of sailors who have reserved boat number 103 –
SELECT sname FROM Sailors S, Reserves AS R WHERE (bid = 103) ∧ (S.sid = R.sid)
But renaming the tables are useless here it is useful when we are using same table two times for
example,
Note that these three operations are same as relational algebra because they are union compatible
and will not return multiset (i.e. no duplicates).
REALITY : SQL first eliminates duplicates from input tables then it will produce result and then again
eliminates duplicates.
AND is trickier than OR. Because if we want some name of kids who wares red and green T-shirt we
cannot just write (colors = “red” and colors = “green”) this will always give you empty result. This is
happening because colors cannot be red and green at the same time, we have to apply INTERSECT
operator.
//Lecture 17
Maximum : max
COUNT (DISTICNT A)
COUNT(*) : count number of rows in the table. COUNT (DISTINCT *) : WTH ? it’s an error
//Lecture 19
Aggregate attribute always removes NULLs and then calculates. But note that COUNT(*) is different
as it gives no. of rows so it cannot ignore NULL entries.
But,
In SQL, we have 3-truth value system namely, True, False and Unknown (So when to use unknown?)
Which means WHERE clause passes a tuple only when “TRUE” and fails when “FALSE” or “UNKNOWN”.
//Lecture 20
It is easy to find avg marks of all student but how about average marks of all students of Gujrat ?
No, we can find that by SELECT avg(marks) FROM students WHERE state = ‘Gujrat’
SELECT avg(marks) FROM students GROUP BY state – invalid query (as per SQL – 92)
Now, consider same schema, Output the average marks of students, state wise, but only for those
states where average marks > 5.
SELECT state, avg(marks) FROM students GROUP BY state HAVING avg(marks) > 5
Something interesting : If the GROUP BY clause is omitted, the entire table is regarded as a single
group. That’s why, aggregation without GROUP BY clause gives SINGLE tuple in the output.
If NULLs exist in the grouping attribute, then a separate group is created for all tuples with a NULL
value in the grouping attribute.
But in case of ORDER BY A DES, B ASC – first A is sorted in descending order and in case of tie it is
sorted by B in ascending order
Order of evaluation : FROM > WHERE > GROUP BY > HAVING > SELECT > ORDER BY
//Lecture 22
CORRELATED SUBQUERY :
3)
4)
5)
6)
⋆ SOME SQL KEYWORDS : ALL, EXISTS, ANY, UNIQUE, IN, NOT IN, NOT ALL, NOT EXISTS, NOT ANY, NOT
UNIQUE
SELECT S.sname
FROM Sailors S
WHERE S.sid IN (SELECT R.sid
FROM Reserves R)
• EXISTS – True iff subquery result is non-empty. NOT EXISTS – True iff subquery result is empty.
SELECT S.sname
FROM Sailors S
WHERE S.sid EXISTS (SELECT COUNT(*)
FROM R)
• UNIQUE – True iff no duplicates in the result of subquery. NOT UNIQUE – some duplicates
tuples should be there in the result of subquery.
• ALL, ANY/SOME – In SQL, Any ≡ Some.
ALL this is same as ∀ meaning if subquery returns empty set then value will be true.
ANY/SOME is same as ∃ meaning if subquery returns empty set then value will be false.
NOTE : SQL, RA, RC does not permit attributes to be repeated in the same relation. And if present
then it is ambiguity.
3. ER MODEL
//Lecture 1
We need to convert ER (Entity Relational) model to Relational model because there is no database
software which uses ER model.
Entity : which is a thing or object in the real world with an independent existence. Each entity has set
of attributes—the particular properties that describe it. For example, an EMPLOYEE entity may be
described by the employee’s name, age, address, salary, and job.
Relationship : Association between entities. Relationship set is set of all similar relationship between
same attributes.
Degree of relationship set : relationship between how many entities, is called degree of relationship
set.
SIMPLE (ATOMIC) Vs COMPOSITE : Attributes that are not divisible are called simple or atomic
attributes. Composite attributes can form a hierarchy; for example, Street_address can be further
subdivided into three simple component attributes: Number, Street, and Apartment_number, as
shown in Figure.
SINGLE-VALUED Vs MULTIVALUED ATTRIBUTE : Most attributes have a single value for a particular
entity; such attributes are called single-valued. For example, Age is a single-valued attribute of a
person.
One person may not have any college degrees, another person may have one, and a third person may
have two or more degrees; therefore, different people can have different numbers of values for the
College_degrees attribute. Such attributes are called multivalued.
STORED Vs DERIVED ATTRIBUTES : For a particular person entity, the value of Age can be determined
from the current (today’s) date and the value of that person’s Birth_date. The Age attribute is hence
called a derived attribute and is said to be derivable from the Birth_date attribute, which is called a
stored attribute.
⋆ Relationship strength :
A weak or non-identifying relationship exists between two entities when the primary key of one of
the related entities does not contain a primary key component of the other related entities.
A strong or identifying relationship is when the primary key of the related entity contains the primary
key of the “parent”.
//Lecture 2
NOTE :
1-1 RELATIONSHIP :
M-N RELATIONSHIP :
1-M RELATIONSHIP : R(c1, r1), R(c1, r2), R(c2, r3) thus e2 must be primary key of R as it is not
repeating.
Strong entity set are those entity set which has its own key.
Entity set without any key is called weak entity set. What is the need of this entity set ? -
Weak entity set does not have any key and its existence depends on existence of owner entity
And the relation here is called identifying/Strong relation as without this relation it is not possible to
identify weak entity.
Partial ≡ Discriminator.
Here, primary key of identifying relationship : (PK of owner ES, Partial key of WES)
//Lecture 3
Note that every entity must be uniquely identifiable. Reason being by the definition of entity (a
distinguishable things or object). There are two ways
NOTE : In ER model, the following concepts not present : Functional dependencies and foreign key.
Relationship set (represented as diamond) is not an entity set; it is association between entity sets.
Thus, it cannot contain attributes of other entity set. But relationship set can have its own attributes
which are describing its property.
While converting ER model to relational model we should keep in mind few things :
Q : How to handle multivalued attribute ? – for each MVA, we create a separate relation in relational
model for each MVA.
For example,
If owner entity (i.e. user) will gone then weak entity (hobbies) will be gone.
⋆ Specialization is the process of classifying a class of objects into more specialized subclasses.
Generalization is the inverse process of generalizing several classes into a higher-level abstract class
that includes the objects in all these classes. We call the relationship between a subclass and its
superclass an IS-A-SUBCLASS-OF relationship, or simply an IS-A relationship. Denoted by triangle.
1-1 MAPPING :
If both sides total participation is there then we need only one table because there is no problem of
NULL to be appear on any side. so, ARB(a1, a2, b1, b2) or ARB(a1, a2, b1, b2).
1-M RELATIONSHIP :
In case of total on both sides we cannot merge all together ex. ARB(a1, a2, b1, b2)
If we merge all together then b1 will only be key of whole relation and 𝑎1 → 𝑎2 can happen which
violets 3NF.
M-N RELATIONSHIP :
Whatever be the mapping (whether total or partial) one table per entity and relation must require.
4. MAGNETIC DISK
//Mod4-Module1
Although a database system provides a high-level view of data, ultimately data have to be stored as
bits on a storage device.
Block : It is collection of sectors. Note that it is not a property of disk but it is property of a DBMS
software. DBMS decides how many consecutive sectors to be considered as one sectors.
All R/W head move together and vertically in same position. And only one R/W head is active at any
time. Usually, head movement is slower than disk rotation.
To Read data,
Step 1 : Put R/W head on desired track (this is also called seek time).
Step 2 : Rotate starting point of desired sector under R/W head (this is called rotational latency).
Step 3 : Rotate complete desired sector under R/W head (this is called transfer time).
A particular sector :
• Access time : It is the time from when a read or write request is issued to when data transfer
begins.
• Data-transfer rate : How much data(Bytes) per second can we read/write.
• Reliability.
0+1
Average rotational latency = × 10𝑚𝑠 = 5𝑚𝑠
2
1
Transfer time (one sector) = × 10𝑚𝑠 = 1𝑚𝑠
10
The closest that two records can be on a disk to be on the same sector (because we need only one
seek and one rotation)
They could be on the same block >> they could be on the same track >> the same cylinder >> an
adjacent cylinder.
As we have natural notion of “closeness” for sectors, we can extend to a notion of next and previous
sector.
• It is first stored on the first sector of the first surface of the first cylinder. Then in the next
sector, and next until all the sectors on the first track are exhausted.
• Then it moves on to the first sector of the second surface (remains at the same cylinder), then
next sector and so on. It exhausts all available surfaces for the first cylinder in this way.
• After that, it moves on to repeat the process for the next cylinder.
There are two types of addressing : Cylinder-head-sector (CHS) and Logical block addressing (LBA)
NOTE : by default, track number start at 0, and track 0 is the outermost track of the disk. The highest
numbered track is next to the spindle
Sectors are the smallest unit that we can access so, sectors are given addresses, meaning each sector
has its own address.
In <C, H, S> each number represents how many cylinders, surfaces, sectors are gone respectively
(because indexing starts from 0).
With Constant linear velocity (CLV), the density of bits is uniform from cylinder to cylinder, because
there are more sectors in outer cylinders, the disk spins slower when reading those cylinders, causing
the rate of bits passing under the read write head to remain constant. This is the approach used by
modern CDs and DVDs. Here outer track data capacity = k x inner data capacity where k is
proportional to radius of track.
With Constant Angular Velocity (CAV), the disk rotates at a constant angular speed, with the bit
density decreasing on outer cylinders (these disks would have a constant number of sectors per track
on all cylinders). Here inner track data capacity = outer track data capacity
//Lecture 3
We now describe disk-scheduling algorithms, which schedule the order of disk I/O to maximize
performance.
Now, we know nowadays we have multiprogramming OS meaning there can be many processes in
ready queue. And some process request to perform some I/O operation on some I/O devices. And it
is responsibility of OS to manage resources (this includes I/O) efficiently.
Q : What matters most to improve performance (fast performance) ? – total seek time and total
Rotational latency (this also includes actual data transfer time as data transfer requires rotation to
read or write).
We can see that seek time >>> rotational latency (in most cases) thus, aim of disk scheduling is to
minimize the overall seek time. Seek time is affected by no. of cylinder thus all disk scheduling algo
only cares about cylinder number of the disk requests.
Basic idea :
Whenever a certain portion of the data is needed, it must be located on disk, copied to main memory
for processing, and then rewritten to the disk if the data is changed.
NOTE : block is also called page in DBMS. This is not same as the page in paging. But we use block.
BLOCK : Each file is also logically partitioned into fixed-length storage units called blocks, which are
the units of both storage allocation and data transfer. A block may contain several records. No record
is larger than a block. No block size is larger than a track size.
Before answering we explore record property (time stamp of record, size of record, field value, etc)
and how and where these properties are store. Time stamp and all properties related to record are
stored in metadata of record which is within record itself.
And Field value are properties of file so they are stored at in block header.
For size of field, if size of field is of variable size the it is stored in metadata of record. And if fixed
length records then we will store in block header.
Blocking factor : No. of record that can be fitted into per block. If B is block size and R is record size
𝐵
then, if fixed length record then Bfr = ⌊ ⌋ , In variable length Bfr represents average number of records
𝑅
𝐵
per block for the file. Thus, Bfr = .
𝑅
NOTE : Unspanned Organization is generally used with fixed-length records because it makes each
record start at known location in the block, simplifying record processing. For variable-length
records, either a spanned or an unspanned organization can be used.
Thus, Unspanned organization is by default taken. And fixed-length records are by default
considered.
Q : For a variable-length records, different records have different sizes. Let the average record size in
a file be R, and number of records in the file be r. Assume that block size is B bytes, and K bytes are
used for block header. Using spanned organization, find the number of blocks needed to store this file
? – As variable length records are given meaning two block can contain same record info. Total no. of
records = r. meaning if we somehow find record in per block. Then we can take ⌈𝑟/𝑠𝑜𝑚𝑒𝑡ℎ𝑖𝑛𝑔⌉ = # of
block.
Note that bfr in variable length records gives average no. of record per block.
𝐵−𝐾 𝑟
Thus, bfr = ; # of blocks needed to store file = ⌈ ⌉.
𝑅 𝑏𝑓𝑟
Determine how the file records are physically placed on the disk.
1) Heap files (files of unordered records) : records are placed in the file in the order in which
they are inserted, so new records are inserted at the end of the file.
2) Sequential or sorted files : Put ordered based on some field (some field means some
attributes like sid or pid something). This field is called ordering field.
Block address : Address of the first sector of that block. <C, H, S> address OR LBA address.
Record address : a record can be identified by giving its block address and the offset.
//Lecture 3
Having a large block size means that even a 1-byte file, ties up an entire cylinder. On the other hand,
a small block size means that most files will span multiple blocks and thus need multiple seeks and
rotational delay.
To keep track of free disk space, the system maintains a free-space list. The free-space list records all
free disk blocks-those not allocated to any file.
For example, consider a disk where blocks 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 17, 18, 25, 26, and 27 are free
and the rest of the blocks are allocated. The free-space bitmap would be
001111001111110001100000011100000…
Way 1 : within free block we maintain one pointer pointing to next free block and so on. And we keep
one pointer to first free block in RAM. But this is stupid as we are storing pointer in free block, they
are not free space anymore.
Way 2 : Store addresses of n free blocks in the first free block. The first n – 1 of these blocks are actually
free. The last block contains the addresses of another n free blocks, and so on.
Q : when does the linked- list scheme require fewer blocks than the bitmap ? – When no. of free block
is few OR in other words when disk is nearly full then liked list scheme require fewer blocks than the
bitmap.
NOTE : one thing to realize that linked list is actually does not wasting space. Free blocks are used
to hold the free list (address), so the storage is essentially free. Thus, liked free list method doesn’t
eat up our useful space… (थोड़ी जगह हे तोह दे दो। नही ीं हे तोह कोई बात नही)ीं
5.2) INDEXING :
Index is overhead. Entire book is data file and pages consist of only indexing is called index file. So both
are different thing.
Index file = index Vs Data file = main file = indexed file or file
To gain fast random access to records in a file, we can use an index structure. Each index structure is
associated with a particular search key.
Note that ordered indexes means index entries are ordered by some key but file may or may not be
ordered.
• PRIMARY INDEX :
If data file (original) is sorted on a candidate key attribute A then index on A is called primary index.
Q : From DBMS point of view, index at the end of Navathe book, is primary index ? – Answer is NO,
search key is topic name meaning they are sorted by (Ordered by) topic name but data file (book) is
not sorted on Topic name. So, topic name is index but it is not primary index.
It has two implementations : (i) For each data record one index record ………… this is called Dense
(ii) For each block one index record ………… this is called sparse
• CLUSTERING INDEX :
Q : From DBMS point of view, index at the end of Navathe book, is clustering index ? – No, again same
reason because data file is not sorted on search key (i.e. on topic name)
NOTE : To create clustering index on a file R : R must be physically ordered by a NON-KEY FIELD (say
F), then an index on F is called clustering index.
//Lecture 7
• SECONDARY INDEX : indexing on non-ordering field (say field B). Now, if B is key then we say
“secondary index on key” and if B is not key then we say “secondary index on non-key”.
Meaning file is not ordered by index is always ordered. In every case whether index is primary,
secondary, clustering index file is always ordered.
We know that heap file is unordered thus we can do indexing on heap using secondary index.
A you can see there is 700 entry so in such cases we create extra level of block. That extra level consists
of only block addresses or block pointers which points to addresses of block which contains same
nonkey value…
We found that indexing done in “Navathe” was secondary indexing on topic name !!
In short,
NOTE : SI is useful when we want a particular record which can exists in few numbers of data blocks.
Example, people with age = 50. (SI on age very good). SI is worst when we have to access to many
data blocks. Example, range queries. Here worst case happens when all record satisfies queries but
you are still accessing data block + index block.
//Lecture 9
Till now we saw single level indexing now we will see multilevel indices and more.
To reduce I/O cost : we can put index in main memory and can-do search in main memory only,
But this idea of putting index in RAM is good only when index file is very small or sparse.
Once we decide to go for multilevel index then by default outer most level has one block.
ISAM (Index sequential access method) : Sequential file ordered by key + Multilevel PI
//Lecture 11
These are dynamic multilevel tree-based indexes. Modification is less costly and modification friendly.
Q : But why B and B+ tree and not other data structure like BST, AVL tree and Heap ? – because these
are in-memory data structure. Not suitable for disk data structure. And In case of AVL tree we have
one pointer per block so there is lots of storage waste we can instead use B tree and B+ tree.
5.3.1) B TREE :
Order of B tree (p) = Maximum number of tree (block or Node) pointers per node.
SEARCHING IN B TREE : same as binary search tree. TC with n keys = 𝑂 (log 𝑝 𝑛 × log 2 𝑝) = 𝑂(log 2 𝑛)
2
INSERTION IN B TREES : Insert the keys 78, 52, 81, 40, 33, 90, 85, 20 and 38 with order 3
Sometimes if no. of keys in node is even then and overflow occurs then it is hard to decide which node
to take as root in that case, we follow left and right biased split.
NOTE : Always stick to left or right biased but only one of them consistently. But by default, right
biased.
#leaf nodes = 1 + #keys in non-leaf nodes … idea is each non-leaf have some immediate predecessor
in leaf node only. But there is rightmost leaf node which is not contain any immediate predecessor of
any non-key thus we have added +1. This satisfy for B tree as well as B+ tree.
//Lecture 14
Consider a B tree with height h and order p, then max # of keys in B tree will be
In B tree we can observe that some keys are in non-leaf node and some in leaf. In B+ tree all keys are
in leaf node.
//Lecture 16
5.3.2) B+ TREE :
B+ tree keep all keys (Key + Data pointer) in leaf node while it keeps only key in non-leaf node. And
leaf-nodes have block pointer at the end which points to next leaf node. There are no data pointer in
non-leaf node.
Thus, in B+ tree we have duplicate key (only key not data pointer)
At max one key can have two duplications, one key at non-leaf and duplicate key in leaf node with
data pointer. B+ tree never contains duplicates in same or distinct non-leaf or leaf.
Here total pointer is used where as in B tree, tree pointers is used. This is because, leaf node in B+ tree
contains data pointer and one block or tree pointer.
In B+ tree,
FEW OBSERVATIONS :
//Lecture 18
INSERTION IN B+ TREE :
All searches (successful OR unsuccessful) in B+ tree take exactly same amount of time (index I/O cost
= # of levels) i.e. O(h), with n nodes and order p it will take O(logpn)
Similar to B tree we can also find no. max or min number of keys…
For the same system, same file : i.e. block size same, for n keys : then B+ tree will take less number of
level and less number of nodes.
NOTE : During searching in B+tree we need not to go till leaf node because key might be present in
internal node. Because we are doing searching not data access. If data access were given then we
definitely go to leaf.
It is worth noting that, as a result of deletion, a key value that is present in a nonleaf node of the B+-
tree may not be present at any leaf of the tree. For example, in Figure, the value “Gold” has been
deleted from the leaf level but is still present in a nonleaf node.
//Korth page 733
We study several algorithms for computing the join of relations, and we analyze their respective costs.
Above code shows a simple algorithm to compute the theta join, r ⋈θ s, of two relations r and s. This
algorithm is called the nested-loop join algorithm, since it basically consists of a pair of nested for
loops.
Extending the algorithm to compute the natural join is straightforward, since the natural join can be
expressed as a theta join followed by elimination of repeated attributes by a projection. The only
change required is an extra step of deleting repeated attributes from the tuple tr ⋅ ts, before adding it
to the result.
For each record in r, we have to perform a complete scan of s. We assume that each relation must fit
into one track.
In worst case,
• Block transfer : nr ∗ bs + br … Only one buffer (to hold block) is present for each relation.
• No. of Seeks : nr + br … S is stored sequentially so one seek. Total nr seeks for s. and
br seeks for r in total.
In best case,
To try running example take two cases for inner relation as student and same for takes.
A variant of the nested-loop join where every block of the inner relation is paired with every block of
the outer relation. Within each pair of blocks, every tuple in one block is paired with every tuple in the
other block, to generate all pairs of tuples.
• block transfer : br ∗ bs + br … each block in the inner relation s is read only once for each
block in the outer relation, instead of once for each tuple in the outer relation.
• No. of seeks : 2 ∗ br … br for outer relation and another br for inner relation being
sequential and access br times.
• Block transfer : br + bs
• No. of seeks : 2
Silly Mistakes :
A transaction is seen by the DBMS as a series, or list, of actions. The actions that can be executed by
a transaction include reads and writes of database objects.
Problem can happen when several transactions executed concurrently for example, two people
booking same seat. This will be handled by transaction manager and recovery manager.
Data item can be database record, but it can also be a larger unit such as a whole disk block, or even
a smaller unit such as an individual field (attribute) value of some record in the database.
Each data item has a unique name. if the item is a single record then the record id can be the item
name etc.
State of database : A database has a state, which is value for each of its elements. Basically, database
state = database instance = database snapshot
//Lecture 2
Read(A, t) : Read(A)
Write(A, t) :
Blind write :
In each transaction each write should be followed by read on same object. If T1 : r(x), w(x), w(x) here
second write is blind write because it is not followed by any read.
Here in all cases we are taking simplified version of read and write but actual implementation is
different which we will study in recovery management chapter.
Suppose,
Meaning in actual implementation, value of A will be stored immediately after W(A) or before commit
or after commit !!
NOTE : write(X, t) brings the block containing X BUT write(X, t) doesn’t read X. It simply puts value
of t into X.
Abort or Rollback : whatever you have done do it again (Undo changes by T may be some failure occur)
Concurrent control : database must go from one consistent state to another consistent state after
concurrent execution of transaction.
//Lecture 3
6.1.1) ATOMICITY :
A transaction is an atomic unit of processing; it should either be performed in its entirety or not
performed at all.
If a transaction begins to execute but fails for whatever reason, any changes to the database that the
transaction may have made must be undone.
6.1.2) DURABILITY :
After commit, despite of any failure, all changes by T must be reflect in DB (must be made permanent
in DB).
6.1.3) CONSISTENCY :
If transaction is completely executed from beginning to end without interference from other
transactions, it should take the database from one consistent state to another. In sort, individual
transaction (running in isolation) must take database from one consistent state to another.
Q : Is serial execution always correct ? in other words does serial execution always preserves database
consistency ? – Yes, because of consistency property, individual transaction must take database from
one consistent state to another.
6.1.4) ISOLATION :
A transaction should not see the changes made by other transactions. Pseudo feel that you are
executing alone.
There are many ways to ensure isolation but strongest way is serializability.
//Lecture 5
6.2) SERIALIZABILITY :
When transactions are executing concurrently in an interleaved fashion, then the order of execution
of operations from all the various transactions is known as a schedule (or history).
The idea of serializability is allow those concurrent execution whose effect is equivalent to some serial
schedule.
S (schedule) is serializable iff for every initial Database state, effect of S is same as some serial
schedule. If for Some initial Database state, effect of S is not same as some serial schedule then S is
not serializable.
Here As W(A) is not done after A=A+B then content of database is A = 200, B = 400. Instead of A = 600.
Each data value maintains one buffer in which it stores its updated value and on W(A) it stores into
DB.
//Lecture 6
We say a schedule S is serializable if there is a serial schedule S’ such that for every initial consistent
database state, the effects (final database state) of S on Database is same as effect of some serial
schedule on Database.
Problem with general notion of serializability : we have to consider all computations which is hard to
implement and expensive.
Solution : Don’t rely in computations, only see read, write not computations.
//Lecture 7
Conflict serializability purely based on read and write operation it ignores other computation for
example,
CONFLICT PAIRS : We say that two operations I and J in a schedule conflict if they are operations :
For example,
NOTE : There is no effect of commit in conflict serializable because by default by the end of
transaction if nothing is written it is automatically committed. So, there is no effect of commit
operations on serializability.
If abort is coming in some transaction simply consider that, that transaction was never existed
(ignore transaction completely).
For example, 𝑇1 → 𝑇2 implies that there is conflicting pair between T1 and T2 but it T1’s operation
coming first in sequence.
View serializability is slightly similar to conflict serializability because it is purely based on read, write
operations only. But for view serializability schedule, it should be view equivalent to some serial
schedule.
This is happening because View serializability uses the blind writes carefully and conflict serializability
doesn’t take benefit of the blind writes.
Which means If schedule is non-conflict serializable and no blind writes then not view serializable but
if schedule is non-conflict serializable and blind write is present then it can be view serializable.
If a schedule is VS but not CS then there is blind write. But if there is blind write then schedule may
or may not be VS.
If a schedule only contains read, write then VS = serializability but if we some computations are
also given then VS <> serializability
For example,
//Lecture 9
6.3) RECOVERABILITY :
So, by far we have studied schedule while assuming implicitly that there are no transaction failures.
We now address the effect of transaction failures during concurrent execution.
- Isolation
- Consistency of database
- Consistency of transaction
- Consistency of schedule
- Atomicity
- Durability
NOTE : Showing wrong result/values to user is also considered violation of consistency of database.
Schedules
Recoverable Irrecoverable
NOTE : serializability and recoverability are not related (nothing to do with each other)
To preserve atomicity in above case we have to abort T6 and all the transaction who have read from
T6. But as T7 already committed so is this not possible thus violation of atomicity.
Strict schedule or Strict recoverable schedule : Transactions can neither read nor write an item X until
the last transaction that wrote X has committed (or aborted). – don’t read/write from uncommitted
write.
//From korth page chapter 19
When a system crash occurs, we must consult the log to determine those transactions that need to
be redone and those that need to be undone. In principle, we need to search the entire log to
determine this information to reduce these types of overhead, we introduce checkpoints.
1) Output onto stable storage all log records currently residing in main memory.
2) Output to the disk all modified buffer blocks.
3) Output onto stable storage a log record of the form <checkpoint L>, where L is a list of
transactions active at the time of the checkpoint.
Example, As an illustration, consider the set of transactions {T0, T1, …, T100}. Suppose that the most
recent checkpoint took place during the execution of transaction T67 and T69, while T68 and all
transactions with subscripts lower than 67 completed before the checkpoint. Thus, only transactions
T67, T69, …, T100 need to be considered during the recovery scheme. Each of them needs to be redone
if it has completed (i.e., either committed or aborted); otherwise, it was incomplete and needs to be
undone.
NOTE : Undo means we have to delete all updates made by that transaction (even before
checkpoint). Redone means we again have to do operation after checkpoint.
//Lecture 10
Till now we are doing “completing schedule” then check for serializability. But better idea would be
“prepare beforehand”
Thus, while database is in execution, we ensure that only conflict serializability schedule generate, for
that we need set of different rules.
• The Lost Update Problem : This problem occurs when two transactions that access the same
database items have their operations interleaved in a way that makes the value of some
database items incorrect.
In Serializable schedules (Conflict Serializable), We DO NOT have lost update problem because when
there is lost update problem, there will be cycle in the precedence graph, So Not Conflict serializable.
• The Temporary update (dirty read) problem : One transaction reads from write of
uncommitted transaction and commits.
• The Incorrect Summary Problem : If one transaction is calculating an aggregate summary
function on a number of database items while other transactions are updating some of these
items.
• The Unrepeatable Read Problem : Another problem that may occur is called unrepeatable
read, where a transaction T reads the same item twice and then item is changed by another
transaction T′ between the two reads.
So, each transaction, individually, follow these rules… Automatically system ONLY generates conflict
serializable schedules.
Exclusive (X) mode : Data item can be both read as well as written. X-lock is requested using
lock-X instruction
Shared (S) mode : Data item can only be read. S-lock is requested using lock-S instruction.
Note that all these requests are handed by CCM (concurrency control manager) in database s/w.
But by using these general S, X locking protocol, non-serializable schedules are possible (can be
generated).
Thus, using locks(read/write locks) in transactions, as described earlier, does not guarantee
serializability of schedules on its own. Deadlock possible, Starvation possible, inconsistency possible.
Serializability not guarantee.
This protocol ensures that if every transaction individually follow 2PL rules then every schedule that
will be generated will be conflict serializable. And the equivalent serial schedule is in the order of lock
point.
Which means if 2PL → conflict serializable. Thus, if not 2PL then it may or may not be conflict
serializable. Thus, not all conflict serializable schedule is produced by 2PL.
For example,
For example,
//Lecture 12
Not only deadlock but it does not guarantee recoverability, cascade less schedule.
Strict 2PL : Release Exclusive lock only after commit/abort of all transactions which uses them,
Rigorous 2PL : Release all locks only after commit/abort, transaction can be serialized in commit order
Conservative 2PL : Take all the locks before you starting transaction
Q : Does granularity of data item affect amount of concurrency we can achieve ? – Yes, if entire
Database is data item then only serial execution of transaction is possible because at a time, we can
lock entire DB then unlock then lock again unlock, so only serial execution is possible. Thus, less
concurrency.
The larger the data-item, the lesser the amount of concurrency we can achieve
- Upgrade (A) : Converts shared lock to exclusive lock ……. Applicable only in growing phase
- Downgrade (A) : Convert exclusive lock to shared lock ……. Applicable only in shrinking phase
System with many transactions … Deadlock happens when there exists a set of transaction such that
every transaction in set is waiting for other transaction to release locks.
DEADLOCK AVOIDANCE : You have to periodically check for deadlock and avoid.
DEADLOCK PREVENTION : If you implement me then deadlock never occur and you don’t have to
periodically check for deadlock.
• Conservative approach : Take all the locks before transaction starts execution. Deadlock never
possible.
• Partial ordering the data-items : Given then order and access them according to this order
only.
• Transaction Timestamp : Time at which transaction was created.
We will always give more priority to older as these transactions had done more computation.
In both preventions when a transaction aborts, it will restart after some time but with old timestamp
to avoid starvation.
If we don’t implement deadlock prevention then deadlock possible thus, we periodically detect
deadlock and recover.
We use wait-for graph, where nodes are transaction and Ti → Tj edge implies Ti waiting for Tj to
release lock.
The system is in deadlock state if and only if the wait-for graph has a cycle.
Again, as we say earlier, to remove deadlock we abort some process who have done less computation.
//Lecture 14
We again focus our mind into Concurrency control protocols. We already saw lock-based protocol
then 2PL. Now, we will see timestamp-based protocols.
Timestamp based protocol : Your fate (serializability) is already decided (you are just acting (locking)
according to it)
This protocol has one rule : If TS(T) < TS(T’) then order of conflict operations in any schedule between
T, T’ should be in the order of T → T’. If not then the transaction who is LATE, is aborted & restarted
with a NEW FRESH Timestamp.
NOTE : For every element x protocol uses : Read time stamp RTS and Write Time stamp WTS (both
tells us who is the youngest transaction to perform these operation).
Transaction who came first assigned a lower number then the transaction who came later.
If above does not occur, then execute the write_item(X) operation of T and set write_TS(X) to TS(T).
If write_TS(X) ≤ TS(T), then execute the read_item(X) operation of T. And set read_item(X) = Youngest
timestamp.
THOMAS WRITE RULE : Only modifies conditions for write operations as follows :
1) Read_TS(X) > TS(T) → abort and roll back T and reject the operation. (same as timestamp)
2) Write_TS(X) > TS(T) → ignore T’s request but continue processing.
3) If neither of this satisfy then set write_TS(X) = TS(T)
Using Thomas write rule, we can allow/generate some VS and non-CS schedules.
Example of TSP,
TSP ensures conflict serializability in order of timestamp and avoids deadlock but does not avoids
cascading rollback.
Reason : All the arcs in the precedence graph are of the form,
NOTE :
• If you are using Thomas write rule then some non-conflict serializable (but correct) are also
allowed.
• As soon as conflicting in opposite direction then abort it and do not consider in upcoming
pair making.
• 2PL is not subset of TSP and vice versa.
Silly mistake :