Faculty of Computer Science & Engineering
Ho Chi Minh City University of Technology
Data modeling
Dr. Vo Thi Ngoc Chau
(chauvtn@cse.hcmut.edu.vn)
Semester 1 2014-2015
Main references
[1] R. Elmasri, S. R. Navathe, Fundamentals of Database
Systems- 4th Edition, Pearson- Addison Wesley, 2003.
[2] H. G. Molina, J. D. Ullman, J. Widom, Database System
Implementation, Prentice-Hall, 2000.
[3] H. G. Molina, J. D. Ullman, J. Widom, Database Systems: The
Complete Book, Prentice-Hall, 2002
[4] A. Silberschatz, H. F. Korth, S. Sudarshan, Database System
Concepts 3rd Edition, McGraw-Hill, 1999.
[5] R. Ramakrishnan, J. Gehrke, Database Management Systems- 2rd
Edition, McGraw-Hill.
[6] M. P. Papazoglou, S. Spaccapietra, Z. Tari, Advances in ObjectOriented Data Modeling, MIT Press, 2000.
[7]. G. Simsion, Data Modeling: Theory and Practice, Technics
Publications, LLC, 2007.
Content
2.1. Concepts
2.2. Why is data modeling important?
2.3. Data modeling with entity-
relationship and relational models
2.4. Dependencies and normalization
2.5. Conclusion
3
2.1. Concepts
Data
modeling
Conceptual
data model
Entity relationship model
Representational
Relational data model
Database
data model
design
Relational database design
4
Data modeling
Modeling - Cambridge dictionary
Model = a representation of something, either
as a physical object which is usually smaller
than the real object, or as a simple description
of the object which might be used in calculations
Modeling = constructing a representation of
something, either as a physical object which is
usually smaller than the real object, or as a
simple description of the object which might be
used in calculations
Data modeling
Data modeling [7], pp. 31-32
formalizing and representing the data structures
of reality
a representation of the things of significance to
an enterprise and the relationships among those
things
Shoval and Frumermann, 1994, p28
Hay, 1996a
an attempt to capture the essence of things
both concrete and abstract
Keuffel, 1996
6
Data modeling
Data modeling [7], pp. 31-32
an abstract representation of the data about
entities, events, activities, and their associations
within an organization
McFadden, Hoffer et al., 1999
The core idea underlying all the definitions is
the same: a data model is used for describing
entities and their relationships within a core
domain.
Topi and Ramesh, 2002
7
Data modeling
Data modeling [7], pp. 31-32
Data modeling is generally viewed as a design
activity
an activity that involves the creation of
abstractions
Davydov, 1994
the art and science of arranging the structure
and relationship of data
Srinivasan and Teeni, 1990
McComb, 2004, p293
data modeling is a design discipline
Simsion and Witt, 2005, p7
Conceptual data model
A data model (aka semantic data model)
Provides concepts close to the way many users
perceive data
Provides the concepts essential for supporting
the application environment at a very high nonsystem-specific level
Used for a conceptual schema of a database
from data requirements
Example: entity-relationship model
9
Conceptual data model
Characteristics of a conceptual data model
Expressiveness
Simplicity
A small number of basic concepts that are distinct and orthogonal
in their meaning
Formality
Simple enough for an end user to use and understand an easy
diagrammatic notation
Minimality
Distinctions between data, relationships, constraints
Concepts must be formally defined state criteria for the validity
of a schema in the model
Unique interpretation
Complete and unambiguous semantics for each modeling construct
S. Navathe. Evolution of data modeling for databases. Communications of the ACM 35(9)(1992) 112-123.
10
Representational data model
A data model (aka implementation/logical data
model)
Provides concepts understood by end-users but not
too far removed from the way data is organized
within the computer
Hides some details of data storage but able to be
implemented on a computer system in a direct way
(in some DBMS)
Example: relational data model
11
Database design
Design the logical and physical structure of one or
more databases to accommodate the information
needs of the users in an organization for a defined
set of applications
The overall database design activity has to undergo
a systematic process called the design methodology,
whether the target database is managed by an
RDBMS, ODBMS, or ORDBMS, ...
The result of the design activity is a rigidly defined
database schema that cannot easily be modified
once the database is implemented.
12
Database design
Goals:
Satisfy the information content requirements of
the specified users and applications
Provide a natural and easy-to-understand
structuring of the information
Support processing requirements and any
performance objectives, such as response time,
processing time, and storage space
13
Database
design
Figure 2.1 Phases of database
design and implementation for
large databases
[1], pp. 368
14
Database design
Six main phases of the overall database
design and implementation process
Requirements collection and analysis
Conceptual database design
Choice of a DBMS
Data model mapping (also called logical
database design)
Physical database design
Database system implementation and tuning
15
Database design
Phase 1: Requirements Collection and Analysis
Identify the major application areas and user groups that will use the database
or whole work will be affected by the database
Study and analyze existing documentation concerning the applications
Study the current operating environment and planned use of the information
Analyze the types of transactions and their frequencies as well as of the flow of
information within the system
Study geographic characteristics regarding users, origin of transactions, destination of
reports,
Specify the input and output data for the transactions
Collect written responses to sets of questions from the potential database users
or user groups
Users priorities and the importance users place on various applications
Requirements from users and applications of the information system that
will interact with the database system
Time-consuming but crucial to the success of the information system
Know and analyze the expectations of the users and the intended uses of
16
the database in as much detail as possible
Database design
Phase 2: Conceptual Database Design
Phase 2a: Conceptual Schema Design
Examine the data requirements resulting from Phase 1
Produce a conceptual database schema in a DBMSindependent high-level data model
Phase 2b: Transaction Design
Design the characteristics of known database transactions
(applications) in a DBMS-independent way to ensure that
the database schema will include all the information
required by these transactions
Identify transactions input/output and functional
behavior
Group transactions into three categories: retrieval
transactions, update transactions, mixed transactions
17
Database design
Phase 2: Conceptual Database Design
Phase 2a: Conceptual Schema Design
Goal = a complete understanding of the database
structure, meaning (semantics), interrelationships, and
constraints
Identify: entity types, relationship types, attributes,
key attributes, cardinality and participation constraints
on relationships, weak entity types,
specialization/generalization hierarchies
Approaches
The centralized (or one-shot) schema design approach
The view integration approach
18
Database design
Phase 2a: Conceptual Schema Design
The centralized (or one-shot) schema design
approach
The requirements of the different applications and user
groups from Phase 1 are merged into a single set of
requirements before schema design begins.
A single schema corresponding to the merged set of
requirements is then designed.
The DBA is responsible for deciding how to merge the
requirements and for designing the conceptual schema
for the whole database
Once the conceptual schema is designed and finalized,
external schemas for the various user groups and
applications can be specified by the DBA.
19
Database design
Phase 2a: Conceptual Schema Design
The view integration approach
The requirements are not merged.
A schema (or view) is designed for each user group or
application based only on its own requirements.
Each user group or application
These schemas are merged or integrated into a global
conceptual schema for the entire database.
The DBA
The individual views can be reconstructed as external
schemas after view integration.
20
Database design
Phase 3: Choice of a DBMS
Technical factors
Storage structures and access paths, the user and
programmer interfaces, high-level query languages,
development tools, architectural options, supported by
DBMS
DBMS portability among different types of hardware
Non-technical factors
Suitability & type of the DBMS for the task
Cost: software acquisition cost, maintenance cost, hardware
acquisition cost, database creation and conversion cost,
personnel cost, training cost, operating cost
Availability of vendor services
21
Database design
Phase 4: Data Model Mapping (Logical
Database Design)
Create a conceptual schema and external
schemas in the data model of the selected DBMS
The result of this phase should be DDL
statements in the language of the chosen DBMS
that specify the conceptual and external level
schemas of the database system.
22
Database design
Phase 5: Physical Database Design
Restricted to choosing the most appropriate structures for the
database files from among the options offered by that DBMS
Choose specific storage structures and access paths for the
database files
Response time
Space utilization
Transaction throughput
Estimate record size and number of records in each database
file
Estimate the update and retrieval patterns for the file
cumulatively from all the transactions
Estimate the file growth, either in the record size because of
new attributes or in the number of records
23
Database design
Phase 6: Database System Implementation and Tuning
Create the database schemas and (empty) database files
Responsibility of the DBA in conjunction with the database
designers
Reformat the data for loading into the new database if needed
Load/populate with the data if needed
Implement database transactions referring to the conceptual
specifications of transaction, then write and test program code
with embedded DML commands
Responsibility of the application programmers
Database tuning continues as long as the database is in
existence, as long as performance problems are discovered,
and while the requirements keep changing.
24
Relational database design
Two main approaches
A top-down design technique
Designing a conceptual schema in a high-level data model
Mapping the conceptual schema into a relational database
schema
Analyzing each of the resulting relations based on the
functional dependencies and assigned primary keys
A bottom-up design technique (relational synthesis)
View relational database schema design strictly in terms of
functional and other types of dependencies specified on the
database attributes
Synthesize the relation schemas applying a normalization
algorithm
25
2.2. Why is data modeling important?
Leverage
Make programming simpler and cheaper
Poor data organization can be expensive to fix.
Conciseness
Take more directly to the heart of the business requirements
Data quality
Problems with data quality can be traced back to a lack of
consistency in:
Defining and interpreting data
Implementing mechanisms to enforce the definitions
A key role in achieving good data quality by establishing a
common understanding of what is to be held in each table
and column and how it is to be interpreted
G. C. Simsion, G. C. Witt, Data modeling essentials 3rd edition, Elsevier Inc, 2005, pp. 8-10.
26
2.3. Data modeling with entity-relationship and
relational models
Entity-relationship
Relational
Data
model
data model
model mapping
27
Database
design
Entity Relationship Model
P. P-S. Chen. The EntityRelationship Model Toward
a Unified View of Data. ACM
Transactions on Database
Systems 1(1)(March 1976)
9-36.
Figure 2.1 Phases of database
design and implementation for
large databases
[1], pp. 368
28
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
The entity-relationship model adopts the
more natural view that the real world
consists of entities and relationships.
The model can achieve a high degree of data
independence and is based on set theory
and relation theory.
The entity-relationship model can be used as
a basis for a unified view of data.
A special diagrammatric technique, the
entity-relationship diagram, is introduced as
a tool for database design.
29
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Fig. 1. Analysis of data models using multiple levels
of logical views
30
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
The entity-relationship (ER) modeling
concepts
Entity types
Relationship types
Attributes
Keys
Structural constraints
31
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Symbol
Meaning
Example
Employee
Entity type
Employees
Dependent
Weak entity type
Dependents of
each employee
works_on
Relationship type
Employee
works_on Project
dependents_of
Identifying
relationship of the
weak entity type
Dependents
dependents_of
Employees
32
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Street
Symbol
Meaning
Example
Name
Attribute
Name of an
employee
EmployeeID
Key Attribute
Distinct identifier of
an employee
PhoneNumber
Multivalued
Attribute
Phone numbers of an
employee
Composite
Attribute
Address (Street,
District, City) of an
employee
Derived Attribute
Age of an employee
(derived from date of
birth)
District
Address
Age
City
33
Entity Relationship Model
P. P-S. Chen. The Entity-Relationship Model Toward a Unified View of
Data. ACM Transactions on Database Systems 1(1)(March 1976) 9-36.
Constraints
34
Figure 3.2. An ER schema diagram
for the COMPANY database
[1], pp. 54.
35
a. List the entity types
b. Is there a weak entity
type? If so, give its
name, partial key, and
identifying relationship
c.
What constraints do the
partial key and the
identifying relationship
of the weak entity type
specify in this diagram?
d. List the names of all
relationship types, and
specify the (min, max)
constraint on each
participation of an entity
type in a relationship
type.
Figure 3.18. An ER diagram for a BANK database schema
[1], pp. 81.
36
Enhanced Entity Relationship Modeling
Class/subclass relationships and type inheritance
The relationship between a superclass and any one of its subclasses
Specialization
Define a set of subclasses of an entity type
Establish additional specific attributes with each subclass
Establish additional specific relationship types between each
subclass and other entity types or other subclasses
Generalization
A reverse process of abstraction in which we suppress the
differences among several entity types
Identify their common features
Generalize them into a single superclass of which the original entity
types are special subclasses
37
partial specialization
total specialization
Figure 4.1. EER diagram notation to represent subclasses and specialization.
[1], pp. 87.
38
Figure 4.3. Generalization
[1], pp. 91.
39
- Disjoint (d): the subclasses of the specialization must be disjoint.
- Overlapping (o): the subclasses are not constrained to be disjoint,
their sets of entities may overlap.
Figure 4.5. EER diagram notation for overlapping (nondisjoint) specialization.
[1], pp. 93.
40
Figure 4.7. A specialization lattice
with multiple inheritance for a
UNIVERSITY database
[1], pp. 96.
a. List superclasses, subclasses
of each superclass
b. List class/subclass
relationships
c. Describe constraints on each
specialization
41
Database
design
A representational data
model supported by the
chosen DBMS
Relational Data Model
E. F. Codd. A Relational
Model of Data for Large
Shared Data Banks.
Communications of the ACM
13(6)(June, 1970) 377-387.
Figure 2.1 Phases of database
design and implementation for
large databases
[1], pp. 368
42
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which
can be applied to any valid instances of the data types
listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which
implicitly or explicitly define the set of consistent
database states or changes of state or both -- these
rules may sometimes be expressed as insert-updatedelete rules
43
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
The term relation is used here in its accepted
mathematical sense.
A relational database based on the relational data
model is a collection of relations.
Given sets S1, S2, , Sn (not necessarily distinct),
R is a relation on these n sets if it is a set of ntuples each of which has its first element from S1,
its second element from S2, and so on.
Sj is the jth domain of R.
R is said to have degree n.
More concisely, R is a subset of the Cartesian
product S1xS2xxSn.
44
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Fig. 1. A relation of degree 4
A relation of degree 4, called supply, which reflects the shipments-in-progress
of parts from specified suppliers to specified projects in specified quantities.
45
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
An n-ary relation R has the following
properties:
Each row represents an n-tuple of R.
The ordering of rows is immaterial.
All rows are distinct.
The ordering of columns is significant it
corresponds to the ordering S1, S2, , Sn of the
domains on which R is defined.
The significance of each column is partially
conveyed by labeling it with the name of the
corresponding domain.
46
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which
can be applied to any valid instances of the data types
listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which
implicitly or explicitly define the set of consistent
database states or changes of state or both -- these
rules may sometimes be expressed as insert-updatedelete rules
47
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Operations on relations
Usual set operations: union, intersection, minus
Projection ()
Select a subset of the tuples from a relation that satisfy a
selection condition
Cartesian product (cross product) (x)
Select certain columns of a relation (striking out the others)
Selection ()
Compatible relations
Combine tuples from two relations in a combinatorial fashion
Join (inner/outer join, equijoin, natural join)
Combine related tuples from two relations into single tuples
49
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
50
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
T R :S
51
Data model
E. F. Codd. Data models in database management, ACM, 1980.
A combination of three following components
(1). A collection of data structure types (the building
blocks of any database that conforms to the model);
(2). A collection of operators or inferencing rules, which
can be applied to any valid instances of the data types
listed in (1), to retrieve or derive data from any parts
of those structures in any combinations desired;
(3). A collection of general integrity rules, which
implicitly or explicitly define the set of consistent
database states or changes of state or both -- these
rules may sometimes be expressed as insert-updatedelete rules
52
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Constraints inherent in the data model
Domain constraints
Key constraints
Constraints on nulls
Entity integrity constraints
Referential integrity constraints
53
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Domain constraints
Within each tuple, the value of each attribute A
must be an atomic value from the domain dom(A).
Key constraints
1. Two distinct tuples in any state of the relation
cannot have identical values for (all) the attributes in
the key.
2. It is a minimal superkey that is, a superkey from
which we cannot remove any attributes and still have
the uniqueness constraint in condition 1 hold.
Superkey of the relation schema R specifies a uniqueness
constraint that no two distinct tuples in any state r of R
can have the same value for the superkey.
54
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Key constraints
A relation schema may have more than one key.
Constraints on nulls
Specify whether null values are or are not
permitted on attributes
Entity integrity constraint
Candidate keys: primary key and secondary keys
No primary key value can be null.
Referential integrity constraint
Specify a referential integrity constraint between
the two relation schemas R1 and R2
55
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Referential integrity constraint
A set of attributes FK (foreign key) in relation
schema R1 is a foreign key of R1 that references
relation R2 if it satisfies the following two rules:
The attributes in FK have the same domain(s) as the
primary key attributes PK of R2; the attributes FK are
said to reference or refer to the relation R2.
A value of FK in a tuple t1 of the current state r1(R1)
either occurs as a value of PK for some tuple t2 in the
current state r2(R2) or is null. In the former case, we
have t1[FK] = t2[PK], and we say that the tuple t1
references or refers to the tuple t2.
56
Relational Data Model
E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Communications of the ACM 13(6)(June, 1970) 377-387.
Relation Employee
Relation schema: EMPLOYEE (FNAME, MINIT, LNAME, SSN,
BDATE, ADDRESS, SEX, SALARY, SUPERSSN, DNO)
Attributes = FNAME, MINIT, LNAME, SSN,
Domain of Attribute SEX = {F, M}
Domain of SALARY = a set of positive integer numbers
Primary key = SSN
Foreign key = SUPERSSN
57
The SQL standard has gone through a number of revisions.
http://en.wikipedia.org/wiki/SQL (Accessed on 09/05/2012)
High-level database language on relational
databases: SQL-86, SQL-89, SQL-92, SQL-99,
58
SQL (structured query language)
Data Definition Language (DDL): Create,
Alter, Drop
Data Manipulation Language (DML): Select,
Insert, Update, Delete
Data Control Language (DCL): Commit,
Rollback, Grant, Revoke
59
SQL (structured query language)
CREATE TABLE TableName
{(colName dataType [NOT NULL] [UNIQUE]
[DEFAULT defaultOption]
[CHECK searchCondition] [,...]}
[PRIMARY KEY (listOfColumns),]
{[UNIQUE (listOfColumns),] [,]}
{[FOREIGN KEY (listOfFKColumns)
REFERENCES ParentTableName [(listOfCKColumns)],
[ON UPDATE referentialAction]
[ON DELETE referentialAction ]] [,]}
{[CHECK (searchCondition)] [,] })
60
SQL (structured query language)
SELECT [DISTINCT | ALL]
{* | [columnExpression [AS newName]] [,...] }
FROM TableName [alias] [, ...]
[WHERE condition]
[GROUP BY columnList]
[HAVING condition]
[ORDER BY columnList]
61
Database
design
Data model mapping
from a conceptual
schema based on the ER
model to a relational
database schema based
on the relational data
model
Figure 2.1 Phases of database
design and implementation for
large databases
[1], pp. 368
62
Data model mapping
Table 7.1. Correspondence between ER and Relational Models
[1], pp. 198.
63
Data model mapping
Mapping of specialization or generalization
Convert each specialization with m subclasses
{S1, S2, , Sm} and (generalized) superclass C,
where the attributes of C are {k, a1, , an} and
k is the (primary) key, into relation schemas
using one of the four following options:
Multiple relations Superclass and subclasses
Multiple relations Subclass relations only (total)
Single relation with one type attribute (disjoint)
Single relation with multiple type attributes
(overlapping)
64
Using Multiple relations Subclass relations only (total)
Figure 7.4, pp. 200.
Figure 4.3. Generalization
[1], pp. 91.
65
Figure 7.4. Using Single relation with multiple type attributes with Boolean fields MFlag
and PFlag.
[1], pp. 200.
66
Data model mapping
Mapping algorithms
Automatically create a relational schema from a
conceptual schema design
Notes on data model mapping
Slightly different for other post-relational data
models as compared to the relational data model
Constraints on databases
Inherent model-based constraints
Schema-based constraints
Application-based constraints
67
68
69
70
71
Query
Example: For every project located in Stafford,
retrieve the project number, the controlling
department number and the department managers
last name, address and birthdate.
SQL query:
SELECT P.NUMBER, P.DNUM, E.LNAME, E.ADDRESS, E.BDATE
FROM PROJECT AS P, DEPARTMENT AS D, EMPLOYEE AS E
WHERE P.DNUM=D.DNUMBER AND D.MGRSSN=E.SSN AND
P.PLOCATION=STAFFORD;
Relational algebra expression:
PNUMBER, DNUM, LNAME, ADDRESS, BDATE(((PLOCATION=STAFFORD(PROJECT))
DNUM=DNUMBER (DEPARTMENT))
MGRSSN=SSN
(EMPLOYEE))72
Query: For each project on which more than two employees
work, retrieve the project number, project name, and the
number of employees who work on that project.
SELECT
PNUMBER, PNAME, COUNT (*) AS PROJ_NUMBER
FROM
PROJECT, WORKS_ON
WHERE
PNUMBER=PNO
GROUP BY
PNUMBER, PNAME
HAVING
COUNT (*) > 2
Temp (PNUMBER, PNAME, PROJ_NUMBER) PNUMBER, PNAME
PNUMBER=PNO(PROJECT
Result
COUNT *
X WORKS_ON))
PROJ_NUMBER > 2
(Temp)
73
Query: retrieve the last name and first name of each employee
whose salary is greater than the max salary in department 5.
LNAME, FNAME
SALARY > C
EMPLOYEE
MAX SALARY
DNO = 5
EMPLOYEE
74
Data model mapping
Informal measures of quality for relation schema
design
Semantics of the attributes
Reducing the redundant values in tuples
Storage space & update anomalies
Reducing the null values in tuples
How to interpret the attribute values stored in a tuple of the
relation how the attribute values in a tuple relate to one another
Multiple interpretations of nulls
Disallowing the possibility of generating spurious tuples
Join relations with equality conditions on attributes that are either primary
keys or foreign keys in a way that guarantees that no spurious tuples are
generated
75
2.4. Dependencies and normalization
Functional dependency (FD)
The single most important concept in relational schema
design theory
A constraint between two sets of attributes from the
database
A property of the semantics or meaning of the
attributes.
Examples
SSN ENAME
The value of an employees social security number (SSN)
uniquely determines the employee name (ENAME)
PNUMBER {PNAME, PLOCATION}
{SSN, PNUMBER} HOURS
76
2.4. Dependencies and normalization
Functional dependency (FD)
A constraint between two sets of attributes from the database
A constraint on the possible tuples that can form a relation state
r of a relation schema R = {A1, A2, , An}
Denoted by XY
X, Y: two sets of attributes that are subsets of R
For any two tuples t1 and t2 in r that have t1[X] = t2[X], they
must also have t1[Y] = t2[Y].
The values of Y of a tuple in r depend on (are determined by)
the values of X.
The values of X of a tuple uniquely (functionally) determine
the values of Y.
If XY in R, this does not say whether or not YX in R.
77
2.4. Dependencies and normalization
Armstrongs inference rules (IR1-3) and
others (IR4-6)
78
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies
F specified on relation schema R
F+, the closure of F, is formally the set of all
dependencies that include F as well as all
dependencies that can be inferred from F.
79
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies F
F = {SSN{ENAME, BDATE, ADDRESS, DNUMBER},
DNUMBER{DNAME, DMGRSSN}}
F+ = {SSN{ENAME, BDATE, ADDRESS, DNUMBER},
DNUMBER{DNAME, DMGRSSN}, SSNSSN, ,
DNUMBER DNAME, SSN{DNAME, DMGRSSN}}
80
2.4. Dependencies and normalization
Closure F+ of a set of functional dependencies F
F = {{SSN, PNUMBER}HOURS, SSNENAME,
PNUMBER{PNAME, PLOCATION}}
F+ = ???
81
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F
where X is a set of attributes that appears as a
left-hand side of a functional dependency in F
X+, the closure of X under F, is the set
of attributes that are functionally
determined by X based on F.
82
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F
where X is a set of attributes that appears as a
left-hand side of a functional dependency in F
83
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F
where X is a set of attributes that appears as a
left-hand side of a functional dependency in F
F = {{SSN, PNUMBER}HOURS, SSNENAME,
PNUMBER{PNAME, PLOCATION}}
{SSN, PNUMBER}+ = {SSN, PNUMBER, HOURS, ENAME,
PNAME, PLOCATION}
{SSN}+ = {SSN, ENAME}
{PNUMBER}+ = {PNUMBER, PNAME, PLOCATION}
84
2.4. Dependencies and normalization
Closure X+ of a set of attributes X under F
where X is a set of attributes that appears as a
left-hand side of a functional dependency in F
F = {{A, B}{C}, {A}{D, E}, {B}{M}, {M}{G, H},
{D}{I, J}}
{A, B}+ = ???
{A}+ = ???
{B}+ = ???
{M}+ = ???, and {D}+ = ???
85
2.4. Dependencies and normalization
Identifying a key K of R based on a set of
given functional dependencies F
86
2.4. Dependencies and normalization
Identifying a key K of R based on a set of
given functional dependencies F
R = (A, B, C, D, E, G)
F = {CE, AEG, EAD}
K = ???
87
2.4. Dependencies and normalization
Identifying a key K of R based on a set of given
functional dependencies F
R = (A, B, C, D, E, G)
F = {CE, AEG, EAD}
K = {A, B, C, D, E, G}
(K-{A})+ = {B, C, D, E, G, A} K = {B, C, D, E, G}
(K-{B})+ = {C, D, E, G, A} K = {B, C, D, E, G}
(K-{C})+ = {B, D, E, G, A} K = {B, C, D, E, G}
(K-{D})+ = {B, C, E, G, A, D} K = {B, C, E, G}
(K-{E})+ = {B, C, G, E, A, D} K = {B, C, G}
(K-{G})+ = {B, C, E, A, D, G} K = {B, C}
88
2.4. Dependencies and normalization
Identifying a key K of R based on a set of
given functional dependencies F
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
K = ???
89
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of given
functional dependencies F
Find N = U - fFright(f)
N is a set of the attributes not in the right side of any functional
dependency in F. U is a set of all the attributes of R.
Find N+, the closure of N under F
If N+ = U then there is only one key K = N.
Otherwise, find D = fFright(f) - fFleft(f)
D is a set of the attributes only in the right side of the functional
dependencies in F.
Find L = U (N+D)
L is a set of the attributes neither in the right side nor functionally
determined by N.
Let Li be any subset of L. Find (NLi)+ under F.
If (NLi)+ = U then K = NLi.
Ki, Kj such that Ki Kj, remove Kj.
90
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of
given functional dependencies F
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
All keys K ???
91
2.4. Dependencies and normalization
R = (A, B, C, E, G)
F = {AB, BCE, EGA}
N = U - fFright(f) = {C, G}
N+ = {C, G}
D = fFright(f) - fFleft(f) =
L = U (N+D) = {A, B, E}
Subsets of L: L1 = {A}, L2 = {B}, L3 = {E}, L4 =
{A, B}, L5 = {A, E}, L6 = {B, E}, L7 = {A, B, E}
(NL1)+ = {C, G, A, B, E} K1 = {C, G, A}
(NL2)+ = {C, G, B, E, A} K2 = {C, G, B}
(NL3)+ = {C, G, E, A, B} K3 = {C, G, E}
92
2.4. Dependencies and normalization
Identifying all keys K of R based on a set of
given functional dependencies F
R = (A, B, C, E, G)
F = {ABC, CGE, BG, EA}
All keys K ???
93
2.4. Dependencies and normalization
Membership of a functional dependency XY
with respect to a set of functional
dependencies F, i.e. XY is inferred from F,
denoted by F XY
A functional dependency XY is inferred from a set of
dependencies F specified on R if XY holds in every
legal relation state r of R; that is, whenever r satisfies
all the dependencies in F, XY also holds in r.
94
2.4. Dependencies and normalization
Membership of a functional dependency XY
with respect to a set of functional
dependencies F, i.e. XY is inferred from F,
denoted by F XY
Check if F XY
Find X+, the closure of X under F
If Y X+ then F XY
95
2.4. Dependencies and normalization
Membership of a functional dependency XY
with respect to a set of functional
dependencies F, i.e. XY is inferred from F,
denoted by F XY
F = {A C, AC D, E AD, E H}
F A CD
A+ under F = {A, C, D} {C, D} F A CD
F E AH ???
96
2.4. Dependencies and normalization
A set of functional dependencies F is said to
cover another set of functional dependencies
E if every FD in E is also in F+; that is, if every
dependency in E can be inferred from F.
Alternatively, E is covered by F.
Two sets of functional dependencies E and F
are equivalent if E+ = F+; that is, every
dependency in E can be inferred from F and
every dependency in F can be inferred from E;
that is also, E covers F and F covers E.
97
2.4. Dependencies and normalization
F = {A C, AC D, E AD, E H}
G = {A CD, E AH}
Are F and G equivalent?
* Check if F covers G.
* Check if G covers F.
98
2.4. Dependencies and normalization
The minimal cover F of a set of functional
dependencies E is a set of functional
dependencies that satisfies the property that
every dependency in E is in the closure F+ of
F. In addition, this property is lost if any
dependency from the set F is removed; F
must have no redundancies in it, and the
dependencies in F are in a standard form.
99
2.4. Dependencies and normalization
The minimal cover F of a set of functional
dependencies E
1.
Every dependency in F has a single attribute for its righthand side.
2.
We cannot replace any dependency X A in F with a
dependency Y A, where Y is a proper subset of X, and still
have a set of dependencies that is equivalent to F.
3.
We cannot remove any dependency from F and still have a
set of dependencies that is equivalent to F.
100
2.4. Dependencies and normalization
The minimal cover F of a set of functional dependencies E
101
2.4. Dependencies and normalization
The minimal cover F of a set of functional
dependencies E
E = {A B, ABCD I, IJ G, IJ H, ACDJ GI}
The minimal cover F of E ???
102
2.4. Dependencies and normalization
Normalization
A process of analyzing the given relation
schemas based on their FDs and primary keys to
achieve the desirable properties of:
(1) minimizing redundancy
(2) minimizing the insertion, deletion, and update
anomalies
Normal form of a relation
The highest normal form condition that the
relation meets the degree to which it has
been normalized
103
2.4. Dependencies and normalization
The process of normalization through
decomposition produces the relational
schemas that have:
The lossless join or non-additive join property
No spurious tuple generation problem occurs.
The dependency preservation property
Each FD is represented in some individual relation.
104
2.4. Dependencies and normalization
Table 10.1. Summary of normal forms based on primary keys and
corresponding normalization ([1], pp. 321)
105
2.4. Dependencies and normalization
Pure relations based on the relational data
model are all in 1NF.
Relations in 2NF have non-key attributes
functionally dependent on the whole
primary key.
Relations in 3NF have non-key attributes
only functionally dependent on the whole
primary key.
106
Primary key = {SSN, PNUMBER}
Non-key attributes = HOURS, ENAME, PNAME, PLOCATION
FD1: {SSN, PNUMBER} HOURS
FD2: SSN ENAME
FD3: PNUMBER {PNAME, PLOCATION}
Functionally dependent on
a part of the primary key 107
Primary key = SSN
Non-key attributes = ENAME, BDATE, ADDRESS,
DNUMBER, DNAME, DMGRSSN
FD1: SSN {ENAME, BDATE, ADDRESS, DNUMBER}
FD2: DNUMBER {DNAME, DMGRSSN}
Non-key attributes are functionally
dependent on another non-key
attribute.
108
2.4. Dependencies and normalization
Relation: LOTS (PROPERTY_ID#,
COUNTY_NAME, LOT#, AREA, PRICE,
TAX_RATE)
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA,
PRICE, TAX_RATE}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA,
PRICE, TAX_RATE}
FD3: COUNTY_NAME TAX_RATE
FD4: AREA PRICE
109
Is LOTS really in 1NF?
Normalization is carried out as follows. Determine LOTS1, LOTS2,
LOTS1A, and LOTS1B.
Figure 10.11 (d). Summary of the progressive normalization of LOTS.
[1], pp. 322.
110
2.4. Dependencies and normalization
Relation: LOTS (PROPERTY_ID#,
COUNTY_NAME, LOT#, AREA, PRICE,
TAX_RATE)
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA,
PRICE, TAX_RATE}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA,
PRICE, TAX_RATE}
FD3: COUNTY_NAME TAX_RATE
FD4: AREA PRICE
2NF
111
2.4. Dependencies and normalization
Relation: LOTS1 (PROPERTY_ID#, COUNTY_NAME, LOT#, AREA,
PRICE) -- in 2NF
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA, PRICE}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA, PRICE}
FD4: AREA PRICE
3NF
Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF
Primary key: COUNTY_NAME
FD: COUNTY_NAME TAX_RATE
112
2.4. Dependencies and normalization
Relation: LOTS1A (PROPERTY_ID#, COUNTY_NAME, LOT#,
AREA) -- in 3NF
Primary key: PROPERTY_ID#
Secondary key: {COUNTY_NAME, LOT#}
Functional dependencies
FD1: PROPERTY_ID# {COUNTY_NAME, LOT#, AREA}
FD2: {COUNTY_NAME, LOT#} {PROPERTY_ID#, AREA}
Relation: LOTS1B (AREA, PRICE) -- in 3NF
Primary key: AREA
FD: AREA PRICE
Relation: LOTS2 (COUNTY_NAME, TAX_RATE) -- in 3NF
Primary key: COUNTY_NAME
FD: COUNTY_NAME TAX_RATE
113
2.4. Dependencies and normalization
Boyce-Codd normal form (BCNF)
A simpler form of 3NF, but stricter than 3NF
Every relation in BCNF is also in 3NF; however, a
relation in 3NF is not necessarily in BCNF.
For a relation in 3NF, each non-key attribute must:
Fully functionally dependent on every key
Nontransitively dependent on every key
Not functionally dependent on other nonkey attributes
For a relation in BCNF, all key/non-key attributes must be
functionally dependent on superkeys.
XA holds in R => X is a superkey of R.
XA holds in R and X is not a superkey of R => R is not
114
in BCNF.
2.4. Dependencies and normalization
Figure 10.12. (a) BCNF normalization of LOTS1A
[1], pp. 325.
FD5: AREA COUNTY_NAME
AREA is not a superkey of LOTS1A.
LOTS1A is in 3NF; but not in BCNF.
115
2.4. Dependencies and normalization
Candidate key = {STUDENT, COURSE}
FD1: {STUDENT, COURSE} INSTRUCTOR
FD2: INSTRUCTOR COURSE
Normalization???
116
2.4. Dependencies and normalization
Multivalued dependencies
A multivalued dependency X
Y specified on
relation schema R, where X and Y are both
subsets of R, specifies the following constraint on
any relation state r of R: if two tuple t1 and t2
exist in r such that t1[X] = t2[X], then two tuples
t3 and t4 should also exist in r with the following
properties, where Z is used to denote (R-(XUY)):
t3[X] = t4[X] = t1[X] = t2[X]
t3[Y] = t1[Y] and t4[Y] = t2[Y]
t3[Z] = t2[Z] and t4[Z] = t1[Z]
117
2.4. Dependencies and normalization
Multivalued dependencies
ENAME
PNAME
ENAME
DNAME
No association between PNAME and DNAME
118
2.4. Dependencies and normalization
Multivalued dependencies (MVDs)
A multivalued dependency X
Y is a trivial MVD
if:
(a) Y is a subset of X
(b) X U Y = R
A trivial MVD does not specify any significant or
meaningful constraint on R.
A nontrivial MVD satisfies neither (a) nor (b).
119
2.4. Dependencies and normalization
Fourth normal form
A relation schema R is in 4NF with respect to a
set of dependencies (that includes functional
dependencies and multivalued dependencies) if,
for every nontrivial multivalued dependency X
in F+, X is a superkey for R.
120
2.5. Conclusion
Data modeling
Database design process: 6 phases
Conceptual database design
Choice of a DBMS a representational data model
Entity-relationship model
Relational data model
Data model mapping
Functional/multivalued dependencies
Normalization: 1NF, 2NF, 3NF, BCNF, 4NF
121
Questions ???
122
Review - Chapter 12 in [1]
12.1. What are the six phases of database design? Discuss each phase.
12.2. Which of the six phases are considered the main activities of the database
design process itself? Why?
12.3. Why is it important to design the schemas and applications in parallel?
12.4. Why is it important to use an implementation-independent data model during
conceptual schema design? What models are used in current design tools? Why?
12.5. Discuss the importance of Requirements Collection and Analysis.
12.7. Discuss the characteristics that a data model for conceptual schema design
should possess.
12.8. Compare and contrast the two main approaches to conceptual schema design.
12.9. Discuss the strategies for designing a single conceptual schema from its
requirements.
12.10. What are the steps of the view integration approach to conceptual schema
design? What are the difficulties during each step?
12.13. Discuss the factors that influence the choice of a DBMS package for the
information system of an organization.
12.14. What is system-independent data model mapping? How is it different from
system-dependent data model mapping?
12.15. What are the important factors that influence physical database design?
123
Review - Chapters 3, 4 in [1]
3.1. Discuss the role of a high-level data model in the database design process.
3.2. List the various cases where use of a null value would be appropriate.
3.3. Define the following terms: entity, attribute, attribute value, relationship
instance, composite attribute, multivalued attribute, derived attribute, complex
attribute, key attribute, value set (domain).
3.4. What is an entity type? What is an entity set? Explain the differences among an
entity, an entity type, and an entity set.
3.6. What is a relationship type? Explain the differences among a relationship
instance, a relationship type, and a relationship set.
3.7. What is a participation role? When is it necessary to use role names in the
description of relationship types?
3.12. When is the concept of a weak entity used in data modeling? Define the terms
owner entity type, weak entity type, identifying relationship type, and partial key.
4.1. What is a subclass? When is a subclass needed in data modeling?
Define the following terms: superclass of a subclass, superclass/subclass
relationship, is-a relationship, specialization, generalization, category, specific (local)
attributes) specific relationships.
4.6. Discuss the two main types of constraints on specializations and
generalizations.
124
Review - Chapter 5 in [1]
5.1. Define the following terms: domain, attribute, n-tuple,
relation schema, relation state, degree of a relation, relational
database schema, relational database state.
5.2. Why are tuples in a relation not ordered?
5.3. Why are duplicate tuples not allowed in a relation?
5.4. What is the difference between a key and a superkey?
5.8. Discuss the entity integrity and referential integrity
constraints. Why is each considered important?
5.9. Define foreign key. What is this concept used for?
125
Review - Chapter 10 in [1]
10.1. Discuss attribute semantics as an informal measure of goodness for a relation
schema.
10.2. Discuss insertion, deletion, and modification anomalies. Why are they considered
bad? Illustrate with examples.
10.3. Why should nulls in a relation be avoided as far as possible? Discuss the problem
of spurious tuples and how we may prevent it.
10.4. State the informal guidelines for relation schema design that we discussed.
Illustrate how violation of these guidelines may be harmful.
10.5. What is a functional dependency? What are the possible sources of the information
that defines the functional dependencies that hold among the attributes of a relation
schema?
10.6. Why can we not infer a functional dependency automatically from a particular
relation state?
10.12. What does the term unnormalized relation refer to? How did the normal forms
develop historically from first normal form up to Boyce-Codd normal form?
10.13. Define first, second, and third normal forms when only primary keys are
considered. How do the general definitions of 2NF and 3NF, which consider all keys of a
relation, differ from those that consider only primary keys?
10.14. What undesirable dependencies are avoided when a relation is in 2NF?
10.15. What undesirable dependencies are avoided when a relation is in 3NF?
10.16. Define Boyce-Codd normal form. How does it differ from 3NF? Why is it
126
considered a stronger form of 3NF?