Query Processing
Database System Concepts - 7th Edition 15.1 ©Silberschatz, Korth and Sudarshan
Query Processing
Silberschatz, Korth, Sudarshan,
Database System Concepts,
7° edition, 2011
Database System Concepts - 7th Edition 15.2 ©Silberschatz, Korth and Sudarshan
Basic Steps in Query Processing
1. Parsing and translation We mainly focus on
the optimization phase
2. Optimization
3. Evaluation
Database System Concepts - 7th Edition 15.3 ©Silberschatz, Korth and Sudarshan
Basic Steps in Query Processing (cont.)
Parser and translator
Translate the (SQL) query into relational algebra
Parser checks syntax (e.g., correct relation and operator names)
Evaluation engine
The query-execution engine takes a query-evaluation plan, executes
that plan, and returns the answers to the query
Optimizer (in a nutshell – more details in the next slides)
Chooses the most efficient implementation to execute the query
Produces equivalent relational algebra expressions
Annotates them with instructions (algorithms): query execution plan (QEP)
Estimates the cost of each equivalent QEP, according to a given cost model
Choose the “best” QEP
Database System Concepts - 7th Edition 15.4 ©Silberschatz, Korth and Sudarshan
Basic Steps: Optimization
1st level of optimization: an SQL query has many equivalent relational
algebra expressions
salary75000(salary(instructor)) and
salary(salary75000(instructor)) are equivalent
They both correspond to SELECT salary
FROM instructor
WHERE salary < 75000
2nd level of optimization: a relational algebra operation can be evaluated
using one of several different algorithms
e.g., block nested-loop join VS. merge-join; file scan VS. index scan
Input of optimization: a query in the form of an algebra expression
Output of optimization: the “best” annotated relational algebra expression
specifying detailed evaluation strategy (query evaluation plan or query
execution plan – QEP) answering the input query
Database System Concepts - 7th Edition 15.5 ©Silberschatz, Korth and Sudarshan
Basic Steps: Optimization (Cont.)
Different query evaluation plans have different costs
User is not expected to specify least-cost plans
Query Optimization: amongst all equivalent QEP choose the one
with lowest cost
Cost is estimated using statistical information from the database catalog
# of tuples in relations, tuple sizes, # of distinct values for a given attribute, etc.
We study… (Chapter 15⋆ – evaluation of QEP)
How to measure query costs (establish a cost model)
Algorithms for evaluating relational algebra operations and their cost
How to combine algorithms for individual operations in order to evaluate a
complex expression (QEP)
… and (Chapter 16⋆ – choosing the best QEP)
How to optimize queries, that is, how to find a QEP with lowest estimated
cost
⋆
Silberschatz, Korth, and Sudarshan, Database System Concepts, 7° ed.
Database System Concepts - 7th Edition 15.6 ©Silberschatz, Korth and Sudarshan
How to measure query costs
(cost model)
These slides are a modified version of the slides provided with the book:
(however, chapter numeration refers to 7 th Ed.)
Database System Concepts, 6th Ed.
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
The original version of the slides is available at: https://www.db-book.com/
Measures of Query Cost
Response time (wall-clock time needed to execute a plan) depends on several factors
system configuration
amount of dedicated buffer in RAM (aka, memory, main memory)
whether or not indices are (partially) stored permanently in the buffer
runtime conditions
amount of free buffer at the time the plan is executed
content of the buffer at the time the plan is executed
parameters, embedded in queries, which are resolved at runtime only
SELECT salary
FROM instructor
WHERE salary < $a
where $a is a variable provided by the application (user)
Thus
1. cost models (like ours) focus on resource consumption rather than response time
(optimizers minimize resource consumption rather than response time)
2. different optimizers may make different assumptions (parameters): every theoretical
analysis must be recast with the actual parameters used by the concrete system
(optimizer) to which the analysis is going to be applied
Database System Concepts - 7th Edition 15.8 ©Silberschatz, Korth and Sudarshan
Measures of Query Cost (Cont.)
Query cost (total elapsed time for answering a query) is measured in terms of
different resources
disk access (I/O operation on disk)
CPU usage
(network communication for distributed DBMS – later in this course)
Typically disk access is the predominant cost, and is also relatively easy to
estimate. Measured by taking into account
Number of seeks (number of random I/O accesses)
Number of blocks read
Number of blocks written
It is generally assumed cost for writing to be twice as the cost for reading
(data is read back after being written to ensure the write was successful)
VERY IMPORTANT!!!
- “disk” refers to permanent drive for file storage, hard-disk, secondary memory, permanent memory
- “memory” refers to volatile drive for data storage, RAM, main memory, buffer
These are all used as synonims
This is a so far accepted choice for measuring query costs (cost model).
New technologies: faster hard-disks (solid-state drives – SSD) and cheaper (thus bigger) RAM
might direct towards different cost models (e.g., based also on CPU usage or RAM I/O operations)
Database System Concepts - 7th Edition 15.9 ©Silberschatz, Korth and Sudarshan
Measures of Query Cost (Cont.)
We ignore difference between writing and reading: we just consider
tS – time for one seek
tT – time to transfer one block
Example: cost for b block transfers plus S seeks
b * tT + S * t S
Values of tT and tS must be calibrated for the specific disk system
Typical values (2018): tS = 4 ms, tT = 0.1 ms
Some DBMS performs, during installation, seeks and block transfers to
estimate average values
We ignore CPU costs for simplicity
Real systems usually do take CPU cost into account
We do not include cost to writing output to disk in our cost formulae
Database System Concepts - 7th Edition 15.10 ©Silberschatz, Korth and Sudarshan
Algorithms for evaluating relational
algebra operations
These slides are a modified version of the slides provided with the book:
(however, chapter numeration refers to 7 th Ed.)
Database System Concepts, 6th Ed.
©Silberschatz, Korth and Sudarshan
See www.db-book.com for conditions on re-use
The original version of the slides is available at: https://www.db-book.com/
Physical organization of records
At the physical level, records are stored (on permanent disks) in files
(managed and organized by the filesystem)
We assume files are organized according to sequential file
organization
i.e., a file is stored in contiguous blocks, with records ordered according
to some attribute(s) – not necessarily ordered by primary key
Other file organization techniques exist (e.g., B+-tree file
organization), leading to different formulas for cost estimate
Database System Concepts - 7th Edition 15.12 ©Silberschatz, Korth and Sudarshan
Selection Operation
File scan (relation scan without indices)
PROs: can be applied to any file, regardless of its ordering, availability of indices,
nature of selection operation, etc.
CONs: it is slow
Algorithm A1 (linear search). Retrieve and scan each file block and
test all records to see whether they satisfy the selection condition
br denotes number of blocks containing records from relation r
Cost estimate??? (selection on a generic, non-key attribute)
cost = br block transfers + 1 seek = tS + br * tT
We assume blocks are stored contiguously so 1 seek operation is enough (disk head
does not need to move to seek next block)
Selection on a key attribute. Cost estimate???
stop on finding record
cost = (br /2) block transfers + 1 seek = tS + (br / 2)* tT
Database System Concepts - 7th Edition 15.13 ©Silberschatz, Korth and Sudarshan
Selections Using Indices
Index scan (relation scan using an index)
selection condition must be on search-key of index
hi : height of the B+-tree (# of accesses to traverse the index
before accessing the data)
A2 (primary index, equality on key). Retrieve a single record
that satisfies the corresponding equality condition. Cost?
cost = (hi + 1) * (tT + tS)
A3 (primary index, equality on nonkey). Retrieve multiple
records. Cost?
Let b = number of blocks containing matching records
Records will be on consecutive blocks
cost = hi * (tT + tS) + tS + tT * b There is a mistake in the 6th ed.
of the book⋆ (Fig. 12.3): the “tS”
⋆ summand is omitted
Silberschatz, Korth, and Sudarshan, Database System Concepts, 6° ed.
Database System Concepts - 7th Edition 15.14 ©Silberschatz, Korth and Sudarshan
Selections Using Indices
A4 (secondary index, equality on key). Cost?
Equal to A2
cost = (hi + 1) * (tT + tS)
A4 (secondary index, equality on nonkey)
Retrieve multiple records. Cost?
each of n matching records may be on a different block
Cost = (hi + n) * (tT + tS)
– Can be very expensive! Can be worse than file scan
Database System Concepts - 7th Edition 15.15 ©Silberschatz, Korth and Sudarshan
Selections Involving Comparisons
Can implement selections of the form
AV (r) or A V(r) by using
a linear file scan,
or by using indices in the following ways:
A5 (primary index, comparison).
A V(r)
use index to find first tuple v and scan relation sequentially from there
RECALL: b is the number of blocks containing matching records
Equal to A3: Cost = hi * (tT + tS) + tS + tT * b
AV(r)
just scan relation sequentially till first tuple > v; do not use the index
Similar to A1 (file scan, equality on key): Cost = tS + b* tT
Database System Concepts - 7th Edition 15.16 ©Silberschatz, Korth and Sudarshan
Selections Involving Comparisons
A6 (secondary index, comparison). Cost?
For A V(r) use index to find first index entry v and scan index sequentially
from there, to find pointers to records.
For AV (r) just scan leaf pages of index finding pointers to records, till first
entry > v
In either case, retrieve records that are pointed to
requires an I/O for each record
Equal to A4, equality on nonkey: cost = (hi + n) * (tT + tS)
Linear file scan may be cheaper
Database System Concepts - 7th Edition 15.17 ©Silberschatz, Korth and Sudarshan
Summary of costs for selections
Database System Concepts - 7th Edition 15.18 ©Silberschatz, Korth and Sudarshan
Complex Selections (cont’d)
A9 (conjunctive selection by intersection of identifiers)
If there are indices with pointers to records (rather than actual records) – this is
our assumption so far anyway
Scan indices but do not access records, just collect sets of pointers (one per
index)
Compute the intersection, and then access records. Cost?
Cost: cost of scanning all indices plus cost of accessing records
Optimization: order records in the intersection and then access them in sorted
order. Advantages:
no block is accessed twice (2 records in the same block are retrieved together)
some seek time is saved as blocks are transferred in sorted order (disk-arm is minimized)
A10 (disjunctive selection by union of identifiers)
If ALL conditions can be checked through some index, then similar to A9
Scan indices but do not access records, just collect sets of pointers (one per index)
Compute the union, and then access records (in sorted order)
Cost: cost of scanning all indices plus cost of accessing records
If even only 1 condition has no associate index, then A1 (linear scan)
Database System Concepts - 7th Edition 15.19 ©Silberschatz, Korth and Sudarshan