Query Proc Notes
Query Proc Notes
Query Proc Notes
In databases that provide low-level access routines such flat file databases, the
programmer must write code to perform the queries.
With higher level database query languages such as SQL, a special component of
the DBMS called the Query Processor takes care of arranging the underlying
access routines to satisfy a given query.
Thus queries can be specified in terms of the required results rather than in terms
of how to achieve those results.
For any given query, there may be a number of different ways to execute it.
Each operation in the query (SELECT, JOIN, etc.) can be implemented using one
or more different Access Routines.
For example, an access routine that employs an index to retrieve some rows
would be more efficient that an access routine that performs a full table scan.
The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Optimization typically takes one of two forms: Heuristic Optimization or Cost
Based Optimization
In Heuristic Optimization, the query execution is refined based on heuristic
rules for reordering the individual operations.
With Cost Based Optimization, the overall cost of executing the query is
systematically reduced by estimating the costs of executing several different
execution plans.
Once the query optimizer has determined the execution plan (the specific ordering
of access routines), the code generator writes out the actual access routines to be
executed.
With an interactive session, the query code is interpreted and passed directly to
the runtime database processor for execution.
It is also possible to compile the access routines and store them for later
execution.
At this point, the query has been scanned, parsed, planned and (possibly)
compiled.
The runtime database processor then executes the access routines against the
database.
The results are returned to the application that made the query in the first place.
Any runtime errors are also returned.
Query Optimization
The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Example
SELECT DISTINCT Emp.name, Dept.phone
FROM Emp, Dept
WHERE
Emp.dept = Sales and
Emp.dept = Dept.name
A query can be represented as a tree data structure. Operations are at the interior
nodes and data items (tables, columns) are at the leaves.
The query is evaluated in a depth-first pattern.
In Oracle this is Rule Based optimization.
Consider this query from Elmasri/Navathe:
In relational algebra:
SSN BDATE
ADDRESS
25000 987654321 4
JENNIFER S WALLACE
43000 888665555 4
RAMESH
K NARAYAN
38000 333445555 5
JOYCE
A ENGLISH
25000 333445555 5
AHMAD
V JABBAR
25000 987654321 4
JAMES
E BORG
55000
1
DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5
MGRSSN
--------888665555
987654321
333445555
PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30
PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford
MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78
DNUM
---5
5
5
4
1
4
WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20
HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null
Note the two cross product operations. These require lots of space and time
(nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6
projects * 3 departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and
project operations as possible before doing any joins.
Just looking at the Syntax of the query may not give the whole picture - need to
look at the data as well.
Several Cost components to consider:
1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc. for performing inmemory operations.
4. Communications Costs to ship data around a network. e.g., in a
distributed or client/server database.
Of these, Access cost is the most crucial in a centralized DBMS. The more work
we can do with data in cache or in memory, the better.
Access Routines are algorithms that are used to access and aggregate data in a
database.
An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.
We are interested in access routines for selection, projection, join and set
operations such as union, intersection, set difference, cartesian product, etc.
As with heuristic optimization, there can be many different plans that lead to the
same result.
In general, if a query contains n operations, there will be n! possible plans.
However, not all plans will make sense. We should consider:
Perform all simple selections first
Perform joins next
Perform projection last
Overview of the Cost Based optimization process
1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan
contains a set of operations O1...Ok
2. Select a plan
3. For each operation Oi in the plan, enumerate the access routines
4. For each possible Access routine for Oi, estimate the cost
Select the access routine with the lowest cost
5. Repeat previous 2 steps until an efficient access routine has been selected
for each operation
Sum up the costs of each access routine to determine a total cost for the
plan
6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest
total cost.
Example outline: Assume 3 operations (one projection, two selections and a
join) : P1 S1 S2 and J1
In general, perform the selections first, and then the join and finally the projection
1. Enumerate the plans.
Note there are two orderings of selection that are possible so the two plans
become:
2.
3.
4.
5.
6.
Plan A: S1 S2 J1 P1
Plan B: S2 S1 J1 P1
Choose a plan (let us start with Plan A)
For each operation enumerate the access routines:
Operation S1 has possible access routines: Linear Search and binary
search
Operation S2 has possible access routines: Linear Search and indexed
search
Operation J1 has possible access routines: Nested Loop join and indexed
join
Choose the least cost access routine for each operation
Operation S1 least cost access routine is binary search at a cost of 10
blocks
Operation S2 least cost access routine is linear search at a cost of 20
blocks
Operation J1 least cost access routine is indexed join at a cost of 40 blocks
The sum of the costs for each access routine are: 10 + 20 + 30 = 70
Thus the total cost for Plan A will be: 70
In repeating the steps 2 though 5 for Plan B, we come up with:
Operation S2 least cost access routine is binary search at a cost of 20
blocks
Operation S1 least cost access routine is indexed search at a cost of 5
blocks
Operation J1 least cost access routine is indexed join at a cost of 30 blocks
Columns to Index
Clustering Index
The clustering index holds the most potential for performance gains
A clustered index allow for merge join
Columns that benefit from clustering:
1. Primary key column
2. Columns with low index cardinality (# of distinct values) or skewed
distribution
Code Transformation
In Oracle, you can embed hints in SQL statements to guide the optimizer towards
making more efficient choices.
The general syntax for a SELECT statement is:
SELECT /*+ hint */
colA, colB, colC, ...
FROM tab1, tab2, ...
The following query forces the use of the Rule based optimizer:
SELECT /*+ RULE */
fname, lname, salary, dno
FROM
employee
WHERE salary > 39000;
In MS SQL Server, query hints can be added using an OPTIONS clause at the end of the
statement:
SELECT select_list
FROM table_source
WHERE search_condition
GROUP BY group_by_expression
HAVING search_condition
ORDER BY order_expression
OPTIONS (query options)