Unit 3
Unit 3
2
Query Processing and Optimization
Query Processing includes translation of high-level queries into
low-level expression that can be used at the physical level of the
file system, query optimization and actual execution of the query
to get the result.
Query Optimization is the process in which multiple query-
execution plans for satisfying a query are examined and a most
efficient query plan is identified for execution.
The term optimization is actually a misnomer because in most of
the cases the chosen execution plan is not optimal strategy. It is
just a reasonably efficient or the best available strategy for
executing the query. Finding the optimal strategy is usually too
time consuming except for the simplest of queries.
3
Query Processing Steps
5
6
Query Tree
• It is used to represent relational algebra expressions.
• Tree Data Structure
• Input relations are represented as leaf nodes
• Relational algebra operation as internal nodes
• Example:
• Relation: Book
Bid B_name Category A_id Price
• SQL Query
Select B_name, Price
From Book
Where Price > 50;
7
Query Tree: Example
Select B_name, Price
From Book
Where Price > 50;
Book Book
8
Query Evaluation Plan
• It is used to fully specify how to evaluate a query, each
operation in the query tree is annotated with
instructions which specify the algorithm or the index to
be used to evaluate that operation.
B_name, Price(Price>50(Book))
B_name, Price
Price>50
Book 9
Query Evaluation Plan
• Note:
• Different evaluation plans for a given query can have different costs. It is the
responsibility of the query optimizer to generate a least costly plan.
• Best Query Evaluation plan is finally submitted to the query
evaluation engine for actual execution.
10
Query Blocks
• Query submitted to the database system is first decomposed into
Query Blocks.
A Query Block forms a basic unit that can be translated into
12
Algorithms for External Sorting
13
14
Algorithms for SELECT Operation
• SELECT operation : There are many algorithms for executing a SELECT operation,
which is basically a search operation to locate the records in a disk file that
satisfy a certain condition.
15
16
Search Methods for Complex Selections
• Conjunction: A conjunction selection is of the form:
• σ θ1^ θ2^…. ^Θn( r ).
• Disjunction: A disjunctive selection is a selection of the form:
• σ θ1v θ2v…. vΘn( r ). A disjunctive condition is satisfied by the union of all
records satisfying the individual, simple condition θi.
• Negation : The result of selection σ-θ( r ) is the set of tuples of r for
which the condition θ evaluates to be false.
17
Joining
• Like selection, the join operation can be implemented in a variety of
ways. In terms of disk access, the joining can be very expensive, so
implementing and utilizing efficient join algorithms is critical in
minimizing a query’s execution time.
• The following are 5 well-known types of join algorithms are:
• Nested-Loop Join
• Block Nested Loop Join
• Indexed Nested Loop Join
• Sort Nested Loop Join
• Sort Merge Join
• Hash Join
18
Nested Join Loop
• This algorithm consists of an inner for loop nested within an outer for loop. We
use following notations
• r,s : Relations r and s
• tr : Tuple in relation r
• ts : Tuple in relation s
• nr : Number of records in relation r
• ns : Number of records in relation s
• br : Number of blocks with records in relation r
• bs : Number of blocks with records in relation s
19
Nested Join Loop
• The primary difference in cost between the block nested loop join and the
basic nested-loop join is that, in the worst case, each block in the inner relation
s is read only once for each block in the outer relation, instead of once for each
tuple in the outer relation. Clearly, it is more efficient to use the smaller
relation as the outer relation, in case neither of the relations fits in memory.
21
Block Nested Loop Join
• for each block br of r {
• for each block bs of s {
• for each tuple tr in br {
• for each tuple ts in bs {
• if join condition is true for (tr, ts)
• add tuple tr*ts to the result
•}
•}
•}
•}
22
Sort Merge Join
• This algorithm can be used to perform natural joins and equi-joins and
requires that each relation r and r be sorted by the common attributes
between R ∩ S. The details for how this algorithm works will not be
presented here.
• However, it is notable to point out that each record in r and s is only
scanned once, thus producing a worst and best-case cost of br + bs.
• Variations of the Sort-Merge join algorithm are used for instance, when
the data files are in unsorted order, but there exist secondary indices for
the two relations.
23
Hash Join
• Like the sort-merge join, the hash join algorithm can be used to perform natura
joins and equi-joins. The concept behind the hash –join algorithm is to partition
the tuples of each given relation into sets.
• The partition is done on the basic of same hash value on join attributes.
• The has function provides the hash value. The main goal of using the hash
function in the algorithm is to reduce the number of comparisons and increase
the efficiency to complete the join operation on the relations.
24
Hash Join
• For example, suppose there are two tuples a and b where both of them
satisfy the join condition. It means they have the same hash value for the
join attributes.
• Suppose that both a and b tuples consists of a hash value as i. It implies that
tuple a should be in ai, and tuple b should be in bi.
25
Evaluation of Expression
• One obvious way to evaluate an expression is simply to evaluate one
operation at a time, in an appropriate order. There are two
approaches how a query expression tree can be evaluated.
• Materialization : Compute the result of an evaluation primitive and
materialize (store) the new relation on the disk
• Pipelining : Pass on tuple to parent operation even while an
operation is still being executed.
26
Materialization
• It is easier to understand intuitively to evaluate an expression by looking at a
pictorial representation of the expression in an operator tree.
• Consider the expression: π staff_name (σdept_id=3(Department) ⋈ staff)
• π staff_name
•
• ⋈
• σdept_id=3 Staff
• Department 27
Materialization
29
Pipelining
• It can start generating query results quickly, if the most operator of query
evaluation plan in combined in a pipeline with its inputs. This can be quite
useful if the results are displayed to user as they are generated.
30
Pipelining
• Pipeline can be executed in either of two ways.
• Demand-driven (or Lazy evaluation)Pipelining: In this method, the result of
lower level queries is not passed to the higher level automatically. It will be
passed to higher level only when it is requested by the higher level. In this
method, it retains the result value and state with it and it will be transformed
to the next level only when it is requested.
32
Query Optimization
• Selecting a single record from a r on a non-key attribute can have,
• Cost of [log2(br)] = 100 [binary search]
• Cost of br/2 = 5000 (Linear search]
• It is noticed that the cost difference between the two selects differs by a
factor of 500
33
Cost Based Optimization
• This process of selecting a lower cost mechanism is known as cost based
optimization. This is based on the cost of the query. The query can use
different paths based on indexes, constraints, sorting methods etc.
• This method mainly uses the statistics like record size, number of records ,
number of records per block, number of blocks, table size, whether whole
table fits in a block, organization of tables, uniqueness of column values,
size of columns etc.
• Some of the features of the cost based optimization are as follows.
• It is based on the cost of the query that can be optimized.
• The query can use a lot of paths on the value of indexes, available sorting
methods, constraints etc.
34
Cost Based Optimization
• The aim of query optimization is to choose the most efficient path of
implementing the query at the possible lowest minimum cost in the
form of an algorithm.
35
Heuristic Base Optimization
• Heuristic optimization transforms the query tree by using a set of rules
(Heuristics) that typically (but not in all cases) improve execution performance .
Some common heuristic rules are:
• Perform selections early (reduces the number of tuples)
• Perform the projection early (reduces the number of attributes)
• Perform most restrictive selection and join operations (with smallest result size )
before other similar operations
Initially query tree from SQL statement is generated. Query tree is transformed
into more complex tree, via a series of tree modification, each of which hopefully
reduces the execution time
36
Semantic Base Optimization
• This strategy uses constraints specified on the database schema such as unique
attributes and other more complex constraints-in order to modify one query into
another query that is more efficient to execute.
• Example: Consider the following query
• select e.first_name, last_name from Employee e, Employee m
• where e.super_ssn = m.ssn and e.salary > m.salary
• This query retrieves the names of employees who earn more than their
supervisors. Suppose that we have constraint on the database schema that stated
that no employee can earn more than his or her direct supervisor. If the semantic
query optimizer checks for the existence of this constraint, it does not need to
execute the query at all because it knows that the result of the query will be
empty.
37
Notations for Query Tree and Query Graphs
38
Expression and Tree Transformation
• After a high level query (SQL statement) has been parsed into an equivalent
relational algebra expression, the query optimizer can perform heuristic rules
on the expression and tree to transform the expression and tree into
equivalent, but optimal form.
• Example: Consider the following query:
• Π stu_name, marks_obtained
• σstu_id = 10
• σsub_id = 20
• ⋈
•
• Student Marks
40
Expression and Tree Transformation
• Suppose the student and marks relations both have 100 records and the
number of stu_id = 10 is 50. Note that the Cartesian product resulting in 10000
records can be reduced by 50% if the σstu_id = 10 is performed first.
• We can also combine the σsub_id = 20 and Cartesian product operations into a
more efficient join operation, as well as eliminating any unneeded columns
before the expensive join in performed. The diagram below shows this better
“optimized” version of the tree.
41
Expression and Tree Transformation
• Π stu_name, marks_obtained
•
• ⋈stu_id = sub_id
•
• Π stu_name, marks_obtained Π stu_name, marks_obtained
σstu_id = 10
• σsub_id = 20
student Marks 42
Query Trees and Heuristics for Query Optimization
• Step 1: Scanner and parser generate initial query representation
• Step 2: Representation is optimized according to heuristic rules
• Step 3: Query execution plan is developed
• Execute groups of operations based on access paths available and files involved
43
Heuristic Optimization of Query Trees
• In general, many different relational algebra expressions and hence many
different query trees can be equivalent; that is, they can correspond to
the same query. The query parser will typically generate a standard initial
query tree to correspond to an SQL query, without doing any
optimization. It is now the job of heuristic query optimizer to transform
this initial query tree into a final query tree that is efficient to execute.
• The optimizer must include rules for equivalent among relational algebra
expressions that can be applied to the initial tree. The heuristic query
optimization rules then utilize these equivalence expressions to transform
the initial tree into the final, optimized query tree.
44
Query Trees and Heuristics for Query Optimization (cont’d.)
• Example Heuristic rule
• Apply SELECT and PROJECT before JOIN
• Reduces size of files to be joined
• Query Tree
• Represents relational algebra expression
• Query Graph
• Represents relational calculus expression
45
Example of Transforming a Query
• For every project located in ‘Stafford’, retrieve the project number,
the controlling department number, department manager’s last
name, address, and birthdate
46
Figure 19.1 Two query trees for
the query Q2. (a) Query tree
corresponding to the relational
algebra expression for Q2. (b)
Initial (canonical) query tree for
SQL query Q2. (c) Query graph
for Q2.
47
Query Trees and Heuristics for Query Optimization (cont’d.)
• Query tree represents a specific order of operations for executing a query
• Preferred to query graph for this reason
• Query graph
• Relation nodes displayed as single circles
• Constants represented by constant nodes
• Double circles or ovals
• Selection or join conditions represented as edges
• Attributes to be retrieved displayed in square brackets
48
Heuristic Optimization of Query Trees
• Many different query trees can be used to represent the query and get the
same results
• Figure b shows initial tree for Q2
• Very inefficient - will never be executed
• Optimizer will transform into equivalent final query tree
Query : Consider the following query Q on the following database schema: Find the last
names of employees born after 1957 who work on a project named ‘ Aquarius’.
49
Schema
50
Query Transformation Example
54
Query Transformation Example (cont’d.)
56
General Transformation Rules for Relational
Algebra Operations
57
General Transformation Rules for Relational Algebra Operations
58
59
Summary of Heuristics for Algebraic Optimization
• Apply first the operations that reduce the size of intermediate results
• Perform SELECT and PROJECT operations as early as possible to reduce the number of
tuples and attributes
• The SELECT and JOIN operations that are most restrictive should be executed before
other similar operations
60
Choice of Query Execution Plans
1. Alternatives for Query Evaluation: An execution plan for a relational
algebra expression represented as a query tree includes information
about the access methods available for each relation as well as the
algorithms to be used in computing the relational operators
represented in the tree.
2. Nested Subquery Optimization : The process of removing the nested
query and converting the outer and inner query into one block is called
unnesting. Whenever possible, query optimizer tries to convert
queries with nested subqueries into a join operation.
3. Subquery (View) Merging Transformation : There are instances where
a subquery appears in the FROM clause of a query and amounts to
including a derived relation, which is similar to predefined view that is
involved in the query. This FROM clause subquery is often referred to as
inline view. In such cases, the transformation of the query can be 61
Choice of Query Execution Plans
• -referred to as a view-merging or subquery merging transformation.
• 4. Materialized Views : A view is defined in the database as a query, and
a materialized view stores the results of that query. Using materialized
views. A materialized view may be stored temporarily to allow more
queries to be processed against it or permanently.
• A materialized view constitutes derived data because its content can be
computed as a result of processing the defining query of the materialized
view. The main idea behind materialization is that it is much cheaper to
read it when needed and query against it than to recognize it from
scratch. The savings can be significant when the view involves costly
operations like join, aggregation and so forth.
62
Choice of Query Execution Plans
• Use of Selectivities in Cost Based Optimization: A query optimizer does not
depend solely on heuristic rules or query transformations: it also eliminates
and compares the costs of executing a query using different execution
strategies and algorithms, and it then chooses the strategy with the lower cost
estimates. For this approach to work, accurate cost estimates are required so
that different strategies can be compared fairly and realistically.
63