[go: up one dir, main page]

0% found this document useful (0 votes)
69 views17 pages

Normalization PDF

The document discusses the design process for a database and outlines some issues with an initial design for an EMP-PROJ relation. Specifically, it notes that the initial design leads to information repetition when attributes are stored multiple times for each project-employee assignment. It also discusses anomalies such as insertion anomalies where a new project cannot be added until an employee is assigned, deletion anomalies where deleting an employee also requires deleting associated project records, and modification anomalies where changing a project requires updating multiple records. The document advocates for a normalized design to address these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views17 pages

Normalization PDF

The document discusses the design process for a database and outlines some issues with an initial design for an EMP-PROJ relation. Specifically, it notes that the initial design leads to information repetition when attributes are stored multiple times for each project-employee assignment. It also discusses anomalies such as insertion anomalies where a new project cannot be added until an employee is assigned, deletion anomalies where deleting an employee also requires deleting associated project records, and modification anomalies where changing a project requires updating multiple records. The document advocates for a normalized design to address these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Design Process - Where are we?

Conceptual
Design

Conceptual
Schema
(ER Model)

Step 1: ER-to-Relational
Logical Mapping
Design Step 2: Normalization:
“Improving” the design
Logical Schema
(Relational Model)
5-1

Relational Design Principles

n Relations should have semantic unity


n Information repetition should be avoided
l Anomalies: insertion, deletion, modification
n Avoid null values as much as possible
l Difficulties with interpretation
à don’t know, don’t care, known but unavailable,
does not apply
l Specification of joins
n Avoid spurious joins
5-2
Bad Design
Employee No Employee
Name
Salary
Title


Project No EMP-PROJ
Resp
Project Name
Duration
Budget

EMP-PROJ


ENO ENAME TITLE SALARY PNO PNAME BUDGET DURATION RESP

E1 J. Doe Elect. Eng. 40000 P1 Instrumentation 150000 12 Manager
E2 M. Smith Analyst 34000 P1 Instrumentation 150000 24 Analyst
E2 M. Smith Analyst 34000 P2 Database Develop. 135000 6 Analyst
E3 A. Lee Mech. Eng. 27000 P3 CAD/CAM 250000 10 Consultant
E3 A. Lee Mech. Eng. 27000 P4 Maintenance 310000 48 Engineer
E4 J. Miller Programmer 24000 P2 Database Develop. 135000 18 Programmer
E5 B. Casey Syst. Anal. 34000 P2 Database Develop. 135000 24 Manager
E6 L. Chu Elect. Eng. 40000 P4 Maintenance 310000 48 Manager
E7 R. Davis Mech. Eng. 27000 P3 CAD/CAM 250000 36 Engineer
E8 J. Jones Syst. Anal. 34000 P3 CAD/CAM 250000 40 Manager
5-3

Information Repetition
n The TITLE, SALARY, BUDGET attribute values are
repeated for each project that the engineer is involved in.
l Waste of space
l Complicates updates This example instance of
EMP -PROJ relation violates
one of the constraints in our
earlier design. Which one?
EMP-PROJ

ENO ENAME TITLE SALARY PNO PNAME BUDGET DURATION RESP


E1 J. Doe Elect. Eng. 40000 P1 Instrumentation 150000 12 Manager


E2 M. Smith Analyst 34000 P1 Instrumentation 150000 24 Analyst
E2 M. Smith Analyst 34000 P2 Database Develop. 135000 6 Analyst
E3 A. Lee Mech. Eng. 27000 P3 CAD/CAM 250000 10 Consultant
E3 A. Lee Mech. Eng. 27000 P4 Maintenance 310000 48 Engineer
E4 J. Miller Programmer 24000 P2 Database Develop. 135000 18 Programmer
E5 B. Casey Syst. Anal. 34000 P2 Database Develop. 135000 24 Manager
E6 L. Chu Elect. Eng. 40000 P4 Maintenance 310000 48 Manager
E7 R. Davis Mech. Eng. 27000 P3 CAD/CAM 250000 36 Engineer
E8 J. Jones Syst. Anal. 34000 P3 CAD/CAM 250000 40 Manager
5-4
Insertion Anomaly

n It is difficult (impossible?) to store information


about a new project until an employee is assigned
to it. Why?

EMP-PROJ


ENO ENAME TITLE SALARY PNO PNAME BUDGET DURATION RESP

E1 J. Doe Elect. Eng. 40000 P1 Instrumentation 150000 12 Manager
E2 M. Smith Analyst 34000 P1 Instrumentation 150000 24 Analyst
E2 M. Smith Analyst 34000 P2 Database Develop. 135000 6 Analyst
E3 A. Lee Mech. Eng. 27000 P3 CAD/CAM 250000 10 Consultant
E3 A. Lee Mech. Eng. 27000 P4 Maintenance 310000 48 Engineer
E4 J. Miller Programmer 24000 P2 Database Develop. 135000 18 Programmer
E5 B. Casey Syst. Anal. 34000 P2 Database Develop. 135000 24 Manager
E6 L. Chu Elect. Eng. 40000 P4 Maintenance 310000 48 Manager
E7 R. Davis Mech. Eng. 27000 P3 CAD/CAM 250000 36 Engineer
E8 J. Jones Syst. Anal. 34000 P3 CAD/CAM 250000 40 Manager

5-5

Deletion Anomaly
n If an engineer, who is the only employee on a project,
leaves the company, his personal information cannot be
deleted, or the information about that project is lost.
n May have to delete many tuples.

EMP-PROJ

ENO ENAME TITLE SALARY PNO PNAME BUDGET DURATION RESP MGR

E1 J. Doe Elect. Eng. 40000 P1 Instrumentation 150000 12 Manager E1


E2 M. Smith Analyst 34000 P1 Instrumentation 150000 24 Analyst E1
E2 M. Smith Analyst 34000 P2 Database Develop. 135000 6 Analyst E5
E3 A. Lee Mech. Eng. 27000 P3 CAD/CAM 250000 10 Consultant E8
E3 A. Lee Mech. Eng. 27000 P4 Maintenance 310000 48 Engineer E6
E4 J. Miller Programmer 24000 P2 Database Develop. 135000 18 Programmer E5
E5 B. Casey Syst. Anal. 34000 P2 Database Develop. 135000 24 Manager E5
E6 L. Chu Elect. Eng. 40000 P4 Maintenance 310000 48 Manager E6
E7 R. Davis Mech. Eng. 27000 P3 CAD/CAM 250000 36 Engineer E8
E8 J. Jones Syst. Anal. 34000 P3 CAD/CAM 250000 40 Manager E8

5-6
Modification Anomaly
n If any attribute of project (say BUDGET of P1) is
modified, all the tuples for all employees who
work on that project need to be modified.

EMP-PROJ

ENO ENAME TITLE SALARY ≈


≈ PNO PNAME BUDGET DURATION RESP MGR

E1 J. Doe Elect. Eng. 40000 P1 Instrumentation 150000 12 Manager E1


E2 M. Smith Analyst 34000 P1 Instrumentation 150000 24 Analyst E1
E2 M. Smith Analyst 34000 P2 Database Develop. 135000 6 Analyst E5
E3 A. Lee Mech. Eng. 27000 P3 CAD/CAM 250000 10 Consultant E8
E3 A. Lee Mech. Eng. 27000 P4 Maintenance 310000 48 Engineer E6
E4 J. Miller Programmer 24000 P2 Database Develop. 135000 18 Programmer E5
E5 B. Casey Syst. Anal. 34000 P2 Database Develop. 135000 24 Manager E5
E6 L. Chu Elect. Eng. 40000 P4 Maintenance 310000 48 Manager E6
E7 R. Davis Mech. Eng. 27000 P3 CAD/CAM 250000 36 Engineer E8
E8 J. Jones Syst. Anal. 34000 P3 CAD/CAM 250000 40 Manager E8

5-7

What to do?
n Take each relation individually and “improve” it in terms
of the desired characteristics
l Normal forms
à Atomic values (1NF)
à Can be defined according to keys and dependencies.
à Functional Dependencies ( 2NF, 3NF, BCNF)
à Multivalued dependencies (4NF)
l Normalization
à Normalization is a process of concept separation which applies a top -
down methodology for producing a schema by subsequent
refinements and decompositions.
à Do not combine unrelated sets of facts in one table; each relation
should contain an independent set of facts.
à Universal relation assumption
à 1NF to 3NF; 1NF to BCNF
5-8
Normalization Issues
n How do we decompose a schema into a desirable normal
form?
n What criteria should the decomposed schemas follow in order
to preserve the semantics of the original schema?
l Reconstructability: recover the original relation ⇒ no spurious joins
l Lossless decomposition: no information loss
l Dependency preservation: the constraints (i.e., dependencies) that
hold on the original relation should be enforceable by means of the
constraints (i.e., dependencies) defined on the decomposed relations.
n What happens to queries?
l Processing time may increase due to joins
l Denormalization
5-9

Normal Forms
All relations
1NF
2NF
3NF
BCNF
4NF

5-10
Functional Dependence
n Given relation R defined over U = {A1, A2, ..., An}
where X ⊆ U, Y ⊆ U. If, for all pairs of tuples t1
and t2 in any legal instance of relation scheme R,
t1[X] = t2[X] ⇒ t1[Y] = t2[Y],
then the functional dependency X → Y holds in R.
n Example
l In relation EMP-PROJ
à (ENO, PNO) → (ENAME, TITLE, SALARY, DURATION,
RESP)
à ENO → (ENAME, TITLE, SALARY)
à PNO → (PNAME, BUDGET)
à TITLE → SALARY
5-11

Some Basics

n Superkey
l A set of one or more attributes, which, taken collectively,
allow us to identify uniquely a tuple in a relation.
l Let R be a relation scheme. A subset K of R is a superkey
of R if, in any legal relation [instance] r of R, for all pairs
t 1 and t 2 of tuples in r such that t 1 [K] = t 2 [K] ⇒ t1 = t2 .
n Candidate key
l A superkey for which no proper subset is a superkey.
n Primary key
l The candidate key that is chosen by the database designer
as the principle key.

5-12
Some Basics

n Attributes
l Prime attribute is a member of any key
l Non-prime attribute is any attribute which is not prime
n Full functional dependency
l A FD X→Y is a full functional dependency if X is minimal, i.e.,
removal of any attribute A from X means the dependency does not
hold anymore.
f
l Formally - X → Y iff for all A ∈X, (X−{A})→Y.
/
n Partial functional
p
dependency
l Formally - X → Y iff for some A ∈X, (X−{A})→Y.

n Transitive dependency
l Formally - X→Y and Y→Z and X→Z and Y→X / and Z ⊄ Y
5-13

Normal Forms Based on FDs


1NF eliminates the relations within relations
or relations as attributes of tuples.
First Normal Form (1NF)
eliminate the partial functional
dependencies of non-prime attributes
Lossless & to key attributes
Second Normal Form (2NF)
Dependency
preserving eliminate the transitive functional
dependencies of non-prime attributes
to key attributes
Third Normal Form (3NF)
Lossless eliminate the partial and transitive
functional dependencies of prime (key)
attributes to key.
Boyce-Codd Normal Form (BCNF)

5-14
First Normal Form
n All attribute values are atomic
n 1NF relation cannot have an attribute value that
is:
l a set of values (set-value)
l a tuple of values (nested relation)
n This is a standard assumption in relational
DBMSs and in the rest of this section
n In object-oriented DBMSs this assumption is
relaxed.

5-15

Second Normal Form


n Two possible definitions:
l A relation R ∈2NF iff all non-prime attributes in R are fully
functionally dependent on primary key.
l A relation R ∈2NF iff the attributes are either
à a candidate key, or
à fully dependent on every key .
n Partial functional dependencies cause problems.
n 2NF is only of historical importance, since it is subsumed
by 3NF.
n In the example, EMP-PROJ is not 2NF, we turn it into
2NF by decomposing it:
l EMP(ENO, ENAME, TITLE, SALARY)
l PROJ(PNO,PNAME,BUDGET,MGR)
l ASSIGN(ENO,PNO,DURATION,RESP)
5-16
Third Normal Form
n Intuitively: A relation R ∈ 3NF iff
l R ∈ 2NF (i.e., every non-prime attribute is fully
functionally dependent on every key)
l No non-prime attribute of R is transitively dependent
on the primary key.
n The issues is to remove the transitive
dependencies
n N.B.: The absence of transitive dependencies
guarantees absence of partial functional
dependencies.
5-17

Third Normal Form


n Formally: A relation scheme R defined over U =
{A1, A2, …, An} is in 3NF if for all functional
dependencies that hold on R of the form X→Y,
where X ⊆ U and X ⊆ U, at least one of the
following holds:
l X→Y is a trivial functional dependency (i.e., Y ⊆ X)
l X is a superkey for R
l Y is contained in a candidate key for R (Y is a set of
prime attributes
n The first two conditions deal with transitive
dependencies.
5-18
3NF – Example
EMP
ENO ENAME TITLE SALARY

fd1
fd2

n EMP is not in 3NF because of fd 2


l TITLE → SALARY but TITLE is not a superkey and
SALARY is not prime
l Problem is that ENO transitively determines SALARY
(as well as directly determining it)
n Solution:
EMP PAY
ENO ENAME TITLE TITLE SALARY

fd1 fd2 5-19

Boyce-Codd Normal Form


n You can still have transitive dependencies in 3NF
if the dependent attribute(s) are prime.
n A 1NF relation scheme R is in BCNF if for every
non-trivial functional dependency X→Y, X is a
superkey.
n Properties of BCNF
l All non-prime attributes are fully dependent on every
key.
l All prime attributes are fully dependent on the keys that
they do not belong to.
l No attribute is non-trivially dependent on any set of
non-prime attributes.
5-20
Boyce-Codd Normal Form
n Formally: A relation scheme R defined over
U = {A1, A2, …, An} is in BCNF if for all
functional dependencies that hold on R of the
form X→ A, where X ⊆ U and A ⊆ U, at least
one of the following holds :
l X→A is a trivial functional dependency
l X is a superkey for R
n No transitive dependencies.

5-21

BCNF – Example
n Assume the following definition of the PROJECT
relation with:
l Each employee on a project has a unique location and
responsibility with respect to that project, and
l Only one project can be found at each location
n FDs would be
PROJECT
PJNO ENO LOCATION RESP

which makes PROJECT in 3NF but not in BCNF 5-22


Inferencing over FDs

n We would like to be able to infer from a given set


of FDs F all implied FDs F+, which is called the
closure of F.
n Important because the 3NF and BCNF definitions
refer to “all functional dependencies”.
n Example:
ENO → (ENAME, TITLE, SALARY,APT#,STREET,CITY)
⇒ (ENO → ENAME)
n This requires a set of inference rules
l Armstrong’s axioms
l Additional rules
5-23

Inference Rules
n Let X, Y and Z be sets of attributes in relation scheme R
n Armstrong’s axioms:
l Augmentation: {X →Y} ⇒ {XZ → YZ}
l Transitivity: {X →Y ,Y → Z} ⇒ {X → Z}
l Reflexivity: W ⊆ X ⇒ {X → W}
n These rules are
l Sound: do not generate any incorrect FDs – anything derived
+
from F is in F
l Complete: given F as a set of FDs, they permit us to find all of
+
F
n Additional Rules:
l Union: {X→Y, X→Z} ⇒ (X→YZ)
l Decomposition: {X→YZ} ⇒ {X→Y, X→Z}
l Pseudotransitivity: {X→Y, WY→Z} ⇒ {XW→Z}
5-24
Why These Rules?
n Lossless join decomposition:
l If R is decomposed into R1 , …, Rn , it should be possible
to reconstruct R with no additional (spurious) tuples.
l If a relation scheme R is decomposed into R1 and R2 ,
then at least one of the following FDs should be in F+
à R1 ∩ R2 → R1
à R1 ∩ R2 → R2

n Dependency preservation:
l If a relation scheme R is decomposed into R1 and R2 ,
then every FD in F that holds on relation R (even the
implied ones) should be guaranteed to hold whenever
the projected dependencies within relations R1 and R2
are enforced.
5-25

Closure of a Set of FDs


n This is most easily done by converting it to the
problem of computing the closure of a set of
attributes.
n For each FD defined on the base relations, pick the
attribute (or set of attributes) that appear on its
left-hand-side
l Find their closure which gives the set of attributes that
are dependent on that attribute
+
à Theorem 1: X →Y ∈ F iff Y ⊆ ComputeX+(X, F).
à Theorem 2: X is a superkey of R iff ComputeX+(X, F) = R.
l This also gives the set of FDs that can be inferred from
the original FD.
5-26
Closure of a Set of Attributes

function ComputeX +(X, F)


begin
X+ ← X
while there exists Y → Z ∈ F such that
+ +
Y ⊆ X and Z ⊆ X
+ +
then X ← X ∪ Z
+
return(X )
end
5-27

Attribute Closure Example

n Let F consist of
l A→B
l C → D, E
l E, G → H
n ComputeX +({C, G}, F)
l Initial: X+ = {C, G}
l Iteration 1( C → D, E): X+ = {C, G, D, E}
l Iteration 2 (E, G → H): X+ = {C, G, D, E, H}

5-28
Lossless Join BCNF
Decomposition

Input: Relation R<U,F> /* U={attributes}, F:{FDs} */


Output: Decomposition D for R
Step 1. D←{R}; /* We are talking about attributes of R*/
Step 2. While there is a relation schema Q∈D that is not
in BCNF do
if X→Y is the FD causing violation
then D←(D−Q) ∪ (Q−Y) ∪ (X∪Y)
}
}
R1 R2
5-29

BCNF Decomposition Example

n Consider the relation and F


l EMP(ENO, ENAME, TITLE, PNO, PNAME, RESP)
l F = {ENO → ENAME, TITLE,
PNO → PNAME,
ENO, PNO → RESP}
n EMP is not in BCNF, because ENO and PNO are
individually not superkeys. Thus,
l ENO → ENAME, TITLE
l PNO → PNAME
both cause violation of BCNF.
5-30
BCNF Decomposition Example

n We start with D = {ENO, ENAME, TITLE, PNO,


PNAME, RESP}
n Iteration 1
l Pick one of the FDs that violate BNCF
l ENO → ENAME, TITLE
n D = {R1, R2} where
l R1 (ENO, PNO, PNAME, RESP)
l R2 (ENO, ENAME, TITLE)
n R2 is in BCNF, but R1 is not
5-31

BCNF Decomposition Example

n Iteration 2
l D has R1 which is not in BCNF
l Pick one of the FDs that violate BNCF
l PNO → PNAME
n D = {R2, R3, R4} where
l R3 (ENO, PNO, RESP)
l R4 (PNO, PNAME)
n Both relations are in BCNF
n Threfore, replace EMP with R2, R3, R4
5-32
Complexity of Normalization
n Assume we are given a set of attributes A and a set of FDs
F, and let n = the size of this input (at most O(|A|*|F|)).
l The number of dependencies in F+ may be exponential in n.
l A+ can be found in linear time.
l Testing whether X → Y is in F+ can be done in linear time.
l Testing whether a decomposition is lossless can be done in linear
time.
l Testing whether a decomposition is dependency preserving can be
done in polynomial time.
l Testing whether a relation scheme is in BCNF is NP-complete.
l There is a quadratic algorithm to find a set of relations over
attributes A where
à Each is in 3NF
à The set preserves all dependencies in F, and
à The set correspond to a lossless decomposition of the universal
relation covering all of A. 5-33

You might also like