Module 3-2
Module 3-2
Manipulation Language),
Physical Data Organization
SYLLAB
US
• SQL DML (Data Manipulation Language)
▫ SQL queries on single and multiple queries
tables,
and Nested
non-correlated), Aggregation and grouping,
(correlated
Views, assertions, Triggers, SQL data types.
• Physical Data Organization
▫ Review of terms: physical and logical records, blocking factor,
pinned and unpinned organization. Heap files,
Indexing, Singe level indices, numerical examples,
Multi-level-indices, numerical examples, B-Trees & B+-Trees
(structure only, algorithms not required), Extendible
Hashing, Indexing on multiple keys – grid files
Data-manipulation
language(DML)
• The SQL DML provides the ability to query
information from the database and to insert
tuples into, delete tuples from, and modify
tuples in the database.
▫ Integrity
□ The SQL DDL includes commands for specifying
integrity constraints that the data stored in the
database must satisfy. Updates that violate
integrity constraints are disallowed.
▫ View definition
□ The SQL DDL includes commands for defining
views.
▫ Transaction control
Basic Retrieval Queries in
SQL
• SQL has one basic statement for retrieving information from a
database; the SELECT statement
• This is not the same as the SELECT operation of the relational
algebra
• Important distinction between SQL and the formal relational
model;
▫ SQL allows a table (relation) to have two or more tuples that are
identical in all their
attribute values
▫ Hence, an SQL relation (table) is a multi-set (sometimes called a
bag) of tuples;
▫ it is not a set of tuples SQL relations can be constrained to be sets by
specifying PRIMARY KEY or UNIQUE attributes, or by using the
DISTINCT option in a query
SELEC <attribute
T list>
FROM <table list>
• <attribute
WHER <condition>
list>
E ▫is a list of attribute names whose values
are to be retrieved by the query
• <table list>
▫is a list of the relation names required to
process the query
• <condition>
▫is a conditional (Boolean) expression that
identifies the tuples to be retrieved by
the query
PREPARED BY SHARIKA T
R,
SNGCE
QO. Retrieve the birth date andaddress of the
employee(s) whose name is ‘John B.
Smith’.
Q1. Retrieve the name and address of all
employees who work for the ‘Research’
department.
Q 2. For every project located in ‘Stafford’, list the
project number, the controlling department number,
and the department manager’s last name, address,
and birth date.
PREPARED BY SHARIKA T R
PREPARED BY SHARIKA T
R,
SNGCE
Ambiguous Attribute
Names, Aliasing,
Renaming, and Tuple
Variables
• In SQL, we can use the same name for two (or
more) attributes as long as the attributes are in
different relations
• a multitable query refers to two or more attributes
with the same name, we must qualify the attribute
name with the relation name to prevent ambiguity
• This is done by prefixing the relation name to the
attribute name and separating the two by a
Suppose
Dno and Lname attributes of
EMPLOYEE relation were
called Dnumber and Name and
the Dname attribute was called
Name. To prevent ambiguity Q1
should be rephrased as Q1A
Q8. For each employee, retrieve
the eemployee’s
first and last name and the first and last
name of his or her immediate
supervisor.
• Alternative relation names E and S are called
aliases or tuple variables, for the EMPLOYEE
relation.
• An alias follow the keyword AS
• It is also possible to rename the relation
attributes within the query in SQL by giving
them aliases.
Unspecified WHERE Clause and Use of the
Asterisk
•
missing WHERE clause indicates no condition
on tuple selection;
▫hence, all tuples of the relation specified in the
FROM sclause qualify and are selected for the query
result
• If more than one relation is specified in the
FROM clause and there is no WHERE clause,
then the CROSS PRODUCT all possible tuple
combinations of these relations is selected
Q 9 and 10. Select all EMPLOYEE Ssns(Q 9)
and all combinations of EMPLOYEE Ssn and
DEPARTMENT Dname (Q10) in the
database.
• To retrieve all the attribute values of the
selected tuples, we do not have to list the
attribute names explicitly in SQL;
• we just specify an asterisk (*), which
stands for all the attributes
Q1C: retrieves all the attribute values of
any EMPLOYEE who works in DEPARTMENT
number
5
Q1D: Retrieves all the attributes of an EMPLOYEE and
the attributes of the DEPARTMENT in which he or she
works for every employee of the ‘Research’
department
Q10A: specifies the CROSS
PRODUCT of the EMPLOYEE and
DEPARTMENT
relations
Tables as Sets in
SQL
• SQL usually treats a table not as a set but rather as a multiset;
▫ duplicate tuples can appear more than once in a table, and
in the result of a
query.
• SQL does not automatically eliminate duplicate tuples in the
results of queries, for the following reasons
▫ Duplicate elimination is an expensive operation.
One way to implement it is to sort the tuples first and then
eliminate duplicates.
▫ The user may want to see duplicate tuples in the result of
a query.
▫ When an aggregate function is applied to tuples, in most
cases we do not want to eliminate duplicates
DISTINCT
Keyword
• to eliminate duplicate tuples from the result of
an SQL querys we use the keyword DISTINCT
in the SELECT clause
• only distinct tuples should remain in the result
• a query with SELECT DISTINCT eliminates
duplicates, whereas a query with SELECT ALL
does not.
• SELECT with neither ALL nor DISTINCT is
equivalent to SELECT ALL
Q11 retrieves the salary of
every employee without
distinct
Q11A :retrieves the salary of every
employee using keyword DISTINCT
EXCEPT and
INTERSECT
• set union (UNION), set difference (EXCEPT),
and set intersection (INTERSECT)
operations.
• The relations resulting from these set operations are sets
of tuples; that is, duplicate tuples are eliminated from
the result.
• These set operations apply only to union-compatible
relations, so we must make sure that the two relations
on which we apply the operation have the same
attributes and that the attributes appear in the same
order in both relations
Q4. Make a list of all project numbers for projects that
involve an employee whose last name is ‘Smith’,
either as a worker or as a manager of the department
that controls the project
The first SELECT query retrieves the projects that involve a ‘Smith’ as
manager of the department that controls the project, and the second
retrieves the projects that involve a ‘Smith’ as a worker on the project.
Notice that if several employees have the last name ‘Smith’, the project
names involving any of them will be retrieved.
Applying the UNION operation to the two SELECT queries gives the
desired result.
UNION
ALL
• The UNION ALL command combines the result
set of two or more SELECT statements (allows
duplicate values).
• The following SQL statement returns the cities
(duplicate values also) from both the "Customers"
and the "Suppliers" table:
This SQL UNION ALL example would return the supplier_id multiple
times in the result set if that same value appeared in both the
suppliers and orders table. The SQL UNION ALL operator does not
remove duplicates. If you wish to remove duplicates, try using the
UNION operator.
SNGC
E
INTERSECT
Operator
• INTERSECT operator is
used to return the records
that are in common
between two SELECT
statements or data sets.
• If a record exists in one
query and not in the
other, it will be omitted
from the INTERSECT
results.
•
EXCEP
T• The SQL EXCEPT clause/operator is used to
combine two SELECT statements and returns
rows from the first SELECT statement that are
not returned by the second SELECT statement.
▫This means EXCEPT returns only rows, which
are not available in the second SELECT
statement.
• Just as with the UNION operator, the same
rules apply when using the EXCEPT operator.
Substring Pattern Matching
and Arithmetic Operators
• LIKE comparison operator
▫ This can be used for string pattern matching
▫ Partial strings are specified using two reserved
characters
□ % replaces an arbitrary number of zero or more
characters, and
□ the underscore (_) replaces a single character
Q 12. Retrieve all employees whose
address is in Houston, Texas.
Q 12A. Find all employees who were born
during the 1950s.
suppose that we want to see the effect of giving all employees who work
on the ‘ProductX’ project a 10 percent raise; we can issue Query 13 to
see what their salaries would become. This example also shows how we
can rename an attribute in the query result using AS in the SELECT
clause.
• For string data types, the concatenate
operator || can be used in a query to
append two string values.
• For date, time, timestamp, and interval
data types, operators include
incrementing (+) or decrementing (–) a
date, time, or timestamp by an interval.
• In addition, an interval value is the result
of the difference between two date, time,
or timestamp values.
• Another comparison operator, which
can be used for convenience, is
BETWEEN
Q 14. Retrieve all employees in
department 5 whose salary is between
$30,000 and $40,000.
Student Table
SNGC
E
Sort according to multiple columns
Q. Fetch all data from the table Student and then
sort the result in ascending order first according to
the column Age. and then in descending order
according to the column ROLL_NO.
Q15. Retrieve a list of employee and the
projects they are working on,ordered by
department and, within each department,
ordered alphabetically by last name, then
first name.
The keyword ALL can be combined with each of >, >=, <, <=, and
<>, these operators. For example, the comparison condition (v > ALL
V) returns TRUE if the value v is greater than all the values in the set
(or multiset) V.
Q . Select the Essns of all employees who
work the same (project,
hours)combination on some project that
employee ‘John Smith’ (whose Ssn
=‘123456789’) works on.
SELECT
COUNT(*)
FROM
EMPLOYEE;
Q22. The number of employees in the
‘Research’ department
If we write COUNT(SALARY)
instead of COUNT(DISTINCT
SALARY) then
duplicate values will not be
eliminated. However, any tuples with
NULL for SALARY will not be
counted. In general, NULL values are
discarded when aggregate functions
are applied to a particular column
Q. Retrieve the names of all
employees who have two or
more dependents
V
1
V
2
• We can specify SQL queries on a view in the
same way we specify queries involving base
tables.
• For example,
▫ to retrieve the last name and first name of all
employees who work on the ‘ProductX’ project,
we can utilize the WORKS_ON1 view and
specify the query as in QV1:
• A view is supposed to be always up-to-date;
▫ if we modify the tuples in the base tables on
which the view is defined, the view must
automatically reflect these changes.
• Hence, the view is not realized or materialized at the
time of view definition but rather at the time when we
specify a query on the view.
• It is the responsibility of the DBMS and not the user
to make sure that the view is kept up-to-date
DROP
•VIEW
If we do not need a view any more, we can use
the DROP VIEW command to dispose of it.
• For example, to get rid of the view V1, we can use
the SQL statement in V1A:
View Implementation, View Update, and
Inline
View• squery modification
▫ involves modifying or transforming the view
query (submitted by the user) into a query on
the underlying base tables.
▫ For example, the query QV1 would be
automatically modified to the following
query by the DBMS
• SELECT e.person_name FROM Employee e, Works w, Company c WHERE e.city = c.city AND
w.company_name = c.company_name AND e.person_name = w.person_name
SYLLAB
US
• SQL DML (Data Manipulation Language)
▫ SQL queries on single and multiple queries
tables,
and Nested
non-correlated), Aggregation and grouping,
(correlated
Views, assertions, Triggers, SQL data types.
• Physical Data Organization
▫ Review of terms: physical and logical records, blocking factor,
pinned and unpinned organization. Heap files,
Indexing, Singe level indices, numerical examples,
Multi-level-indices, numerical examples, B-Trees & B+-Trees
(structure only, algorithms not required), Extendible
Hashing, Indexing on multiple keys – grid files
Fixed and Variable length
records
• A file is a sequence of records.
• In many cases, all records in a file are of the
same record type.
• If every record in the file has exactly the
same size (in bytes), the file is said to be
made up of fixed-length records.
• If different records in the file have different sizes,
the file is said to be made up of variable-length
records.
Spanned and Unspanned
Organization
•Part of the record can be stored on one block and
the rest on another.
• A pointer at the end of the first block points to
the block containing the remainder of the
record.
• This organization is called spanned because
records can span more than one block.
• Whenever a record is larger than a block, we
must use a spanned organization.
• If records are not allowed to cross block
Blocking factor for the
file
• The records of a file must be allocated to disk blocks
because a block is the unit of data transfer between
disk and memory.
• When the block size is larger than the record size, each
block will contain numerous records, although some
files may have unusually large records that cannot fit
in one block.
• Suppose that the block size is B bytes.
• For a file of fixed-length records of size R bytes, with B ≥
R, we can fit bfr =B / R records per block.
• The value bfr is called the blocking factor for the file.
Index Structures
• Indexing is a data structure technique to
efficiently retrieve records from the database
files based on some attributes on which the
indexing has been done.
• An index on a database table provides a
convenient mechanism for locating a row
(data record) without scanning the entire
table and thus greatly reduces the time it
takes to process a query.
• The index is usually specified on one field of
the file.
• One form of an index is a file of entries
<field value, pointer to record>, which is
ordered by field value.
Types of
index
• Indexes can be characterized as
1. Dense index
2. Sparse index
•
A dense index has an index entry for
every search key value (and hence every
record) in the data file.
•
A sparse (or nondense) index, on the
other hand, has index entries for only
some of the search values.
• Advantages:
▫Stores and organizes data into computer
files.
▫Makes it easier to find and access data
at any given time.
▫It is a data structure that is added to
a file to provide faster access to the
data.
▫It reduces the number of blocks that the
DBMS has to check.
• Disadvantages
▫Index needs to be updated periodically for
insertion or deletion of records in the
Structure of
index
• An index is a small table having only two
columns.
• The first column contains a copy of the
primary or candidate key of a table
• The second column contains a set of pointers
holding the address of the disk block where that
particular key value can be found.
• If the indexes are sorted, then it is called as
ordered indices.
Exampl
e• Suppose we have an ordered file with 30,000
records and stored on a disk of block size 1024
bytes and records are of fixed size, unspanned
organisation. Record length = 100 bytes. How
many block access needed to search a record?
Types of Indexes
Primary
Index
• Primary index is defined on an ordered data file.
The data file is ordered on a key field.
• The key field is generally the primary key of the
relation.
• The first record in each block of the data file is called the
anchor record of the block or the block anchor.
• A primary index is a nondense (sparse) index, since
it includes an entry for each disk block of the data
file and the keys of its anchor record rather than
for every search value.
Figure 5.1: Primary Index on the Ordering Key Field
of the File
Exampl
e• Suppose we have an ordered file with 30,000
records and stored on a disk of block size 1024
bytes and records are of fixed size, unspanned
organisation.
• Record length = 100 bytes. How many block
access if using a primary index file, with an
ordering key field of the file 9 bytes and block
pointer size 6 bytes.
Clustering
Index
• Defined on an ordered data file.
• The data file is ordered on a non-key field unlike
primary index, which requires that the ordering field
of the data file have a distinct value for each record.
• Includes one index entry for each distinct value of the
field;
▫ the index entry points to the first data block that
contains records with that field value.
• It is another example of nondense index where
Insertion and Deletion is relatively straightforward
with a clustering index.
Figure 5.2: A Clustering Index on Dept_number Ordering Nonkey
Field of a File
Secondary
Index
• A secondary index provides a secondary means of
accessing a file for which some primary access
already exists.
• The secondary index may be on a field which is a
candidate key and has a unique value in every
record, or a non-key with duplicate values.
• The index is an ordered file with two fields.
• The first field is of the same data type as some non-
ordering field of the data file that is an indexing
field.
• The second field is either a block pointer or a
record pointer.
• There can be many secondary indexes (and
hence, indexing fields) for the same file.
• Includes one entry for each record in the data file;
Exampl
e• Suppose we have an ordered file with 30,000
records and stored on a disk of block size 1024
bytes and records are of fixed size, unspanned
organisation. Record length = 100 bytes. How
many block access if using a secondary index file.
Figure 5.3: Dense Secondary Index (with Block Pointer) on
a Non Ordering Key Field of the File
Figure 5.4: Secondary Index (with Record Pointer) on a Non Key Field implemented
using one level of indirection so that Index entries are of Fixed Length and have unique
field values
Single level and Multi-level
indexing
• Because a single-level index is an ordered
file, we can create a primary index to the
index itself;
• In this case, the original index file is called
the first- level index and the index to the
index is called the second-level index.
• We can repeat the process, creating a third,
fourth, ..., top level until all entries of the top
level fit in one disk block.
• A multi-level index can be created for any type
of first- level index (primary, secondary,
Figure 5.5: Two-Level Primary
Search Trees
• A search tree is slightly different from a
multilevel index.
• A search tree of order p is a tree such that each
node contains at most p − 1 search values and
p pointers in the order <P1, K1, P2, K2, ...,
Pq−1, Kq−1, Pq>, where q
≤ p.
• Each Pi is a pointer to a child node (or a
NULL pointer), and each Ki is a search
value from some ordered set of values.
• Two constraints must hold at all times on the
search tree:
1. Within each node, K1 < K2 < ... < Kq−1.
2. For all values X in the subtree pointed at
Figure 5.6: A node in a search tree with pointers to subtrees below
it
B-
Trees
•The B-tree has additional constraints that
ensure that the tree is always balanced.
• A B-tree of order p, when used as an access
structure on a key field to search for records in
a data file, can be defined as follows:
1. Each internal node in the B-tree is of the
form
□ <P1, <K1, Pr1>, P2, <K2, Pr2>, ..., <Kq–1,
Prq–1>,
Pq> where q ≤ p. Each Pi is a tree pointer—
a pointer to another node in the Btree. Each
Pri is a data pointer—a pointer to the record
whose search key field value is equal to Ki.
2. Within each node, K1 < K2 < ... < Kq−1.
3. For all search key field values X in the subtree
pointed at by Pi, we have: Ki–1 < X < Ki for 1 <
i < q; X < Ki for i = 1; and Ki–1 < X for i = q.
4. Each node has at most p tree pointers.
5. Each node, except the root and leaf nodes, has
at least
┌(p/2)┐tree pointers. The root node has at least
two tree pointers unless it is the only node in
the tree.
6. A node with q tree pointers, q ≤ p, has q – 1
search key field values (and hence has q – 1
data pointers).
7. All leaf nodes are at the same level. Leaf
nodes have the same structure as internal
nodes except that all of their tree pointers Pi
are NULL
Figure 5.7: B-tree structures. (a) A node in a B-tree with q –
1 search values.
(b) A B-tree of order p = 3.The values were inserted in
the order 8, 5, 1, 7, 3, 12, 9, 6.
• A B-tree starts with a single root node (which
is also a leaf node) at level 0 (zero).
• Once the root node is full with p – 1 search
key values and we attempt to insert another
entry in the tree, the root node splits into two
nodes at level 1.
• Only the middle value is kept in the root node,
and the rest of the values are split evenly
between the other two nodes.
• When a nonroot node is full and a new
entry is inserted into it,
▫ that node is split into two nodes at the
same level, and the middle entry is moved to
the parent node along with two
pointers to the new split nodes.
• If the parent node is full, it is also split.
• If deletion of a value causes a node to be less than
half full,
▫ it is combined with its neighboring nodes,
and this can also propagate all the way to
the root.
▫Hence, deletion can reduce the number of tree
levels.
Properties of a B-
tree
• For a tree to be classified as a B-tree, it must fulfill the
following conditions:
▫ the nodes in a B-tree of order m can have a
maximum of m children
▫ each internal node (non-leaf and non-root) can have at
least (m/2) children (rounded up)
▫ the root should have at least two children – unless it’s
a leaf
▫ a non-leaf node with k children should have k-1 keys
▫ all leaves must appear on the same level
Building a B-
tree
• Since we’re starting with an empty tree, the first item
we insert will become the root node of our tree.
• At this point, the root node has the key/value pair.
• The key is 1, but the value is depicted as a star to make
it easier to represent, and to indicate it is a reference
to a record.
• The root node also has pointers to its left and right
children shown as small rectangles to the left and
right of the key.
• Since the node has no children, those pointers will
be empty for now:
We know that this tree has order of 3, so it
can have only up to 2 keys in it. So we can
add the payload with key 2 to the root node in
ascending order:
Algebra
Operations
PREPARED BY SHARIKA T
R,
SNGCE
PREPARED BY SHARIKA T
R,
SNGCE
Outline of a Heuristic Algebraic Optimization
Algorith
• Using rule 1, break up any select operations with conjunctive
m:conditions into a cascade of select operations.
• Using rules 2, 4, 6, and 10 concerning the commutativity of select
with other operations, move each select operation as far down the
query tree as is permitted by the attributes involved in the select
condition.
• Using rule 9 concerning associativity of binary operations, rearrange
the leaf nodes of the tree so that the leaf node relations with the most
restrictive select operations are executed first in the query tree
representation.
• Using Rule 12, combine a Cartesian product operation with a
subsequent select operation in the tree into a join operation.
• Using rules 3, 4, 7, and 11 concerning the cascading of project and the
commuting of project with other operations, break down and move
lists of projection attributes down the tree as far as possible by
creating new project operations as needed.
Summary of Heuristics
for Algebraic
Optimization
• The main heuristic is to apply first the operations that
reduce the size of intermediate results.
• Perform select operations as early as possible to
reduce the number of tuples and perform project
operations as early as possible to reduce the
number of attributes.
▫ This is done by moving select and project operations
as far down the
tree as possible.
• The select and join operations that are most restrictive
should be executed before other similar operations.
Query Execution
Plans
• An execution plan for a relational algebra query
consists of a combination of the relational algebra
query tree and information about the access
methods to be used for each relation as well as
the methods to be used in computing the
relational operators stored in the tree.
• Materialized evaluation: the result of an operation
is stored as a temporary relation.
• Pipelined evaluation:
EN
D