[go: up one dir, main page]

0% found this document useful (0 votes)
61 views10 pages

Query Proc Notes

Download as doc, pdf, or txt
Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1/ 10

Query Processing and Optimization Query Processing

In databases that provide low-level access routines such flat file databases, the
programmer must write code to perform the queries.
With higher level database query languages such as SQL, a special component of
the DBMS called the Query Processor takes care of arranging the underlying
access routines to satisfy a given query.
Thus queries can be specified in terms of the required results rather than in terms
of how to achieve those results.

A query is processed in four general steps:


1.
2.
3.
4.

Scanning and Parsing


Query Optimization or planning the execution strategy
Query Code Generator (interpreted or compiled)
Execution in the runtime database processor

1. Scanning and Parsing

When a query is first submitted (via an applications program), it must be scanned


and parsed to determine if the query consists of appropriate syntax.
Scanning is the process of converting the query text into a tokenized
representation.
The tokenized representation is more compact and is suitable for processing by
the parser.
This representation may be in a tree form.
The Parser checks the tokenized representation for correct syntax.
In this stage, checks are made to determine if columns and tables identified in the
query exist in the database and if the query has been formed correctly with the
appropriate keywords and structure.
If the query passes the parsing checks, then it is passed on to the Query Optimizer.

2. Query Optimization or Planning the Execution Strategy

For any given query, there may be a number of different ways to execute it.
Each operation in the query (SELECT, JOIN, etc.) can be implemented using one
or more different Access Routines.
For example, an access routine that employs an index to retrieve some rows
would be more efficient that an access routine that performs a full table scan.
The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Optimization typically takes one of two forms: Heuristic Optimization or Cost
Based Optimization
In Heuristic Optimization, the query execution is refined based on heuristic
rules for reordering the individual operations.

With Cost Based Optimization, the overall cost of executing the query is
systematically reduced by estimating the costs of executing several different
execution plans.

3. Query Code Generator (interpreted or compiled)

Once the query optimizer has determined the execution plan (the specific ordering
of access routines), the code generator writes out the actual access routines to be
executed.
With an interactive session, the query code is interpreted and passed directly to
the runtime database processor for execution.
It is also possible to compile the access routines and store them for later
execution.

4. Execution in the runtime database processor

At this point, the query has been scanned, parsed, planned and (possibly)
compiled.
The runtime database processor then executes the access routines against the
database.
The results are returned to the application that made the query in the first place.
Any runtime errors are also returned.

Query Optimization

The goal of the query optimizer is to find a reasonably efficient strategy for
executing the query (not quite what the name implies) using the access routines.
Example
SELECT DISTINCT Emp.name, Dept.phone
FROM Emp, Dept
WHERE
Emp.dept = Sales and
Emp.dept = Dept.name

1-2 selections, 1 join, 1-2 projections


Several possible query execution plans (QEPs)
Step 1: R1 = Emp Dept
Step 2: R2 = dept = Sales and dept = name2 (R1)
Step 3: name1, phone (R2)
or
Step 1: R1 = dept = Sales (Emp)
Step 2: R2 = R1 R1.dept = Dept.name Dept
Step 3: name1, phone (R2)
Among all possible equivalent QEPs, the one that can be evaluated with the
minimum cost is the so-called optimal plan

cost = I/O cost + CPU cost


(fetches of data from disk + joins and comparisons in main memory)
Usually, I/O cost >> CPU cost
To reduce I/O cost, indexing structures are used as catalogues to data
Primary key retrieval: B+ trees, Hashing
Secondary key retrieval: Inverted files
The Selectivity (s) = the ratio of records satisfying the condition
C1 = V
1 / COLCARD
C1 IN (V1, V2, ..)
# of values in list / COLCARD
C1 > V
(HIGHKEY - V) / (HIGHKEY - LOWKEY)
C2 < V
(V - LOWKEY) / (HIGHKEY - LOWKEY)
Number of I/O needed = s * # of rows or pages (blocks)

Query Optimization Strategies


We divide the query optimization into two types: Heuristic (sometimes called Rule based)
and Systematic (Cost based).

Heuristic Query Optimization

A query can be represented as a tree data structure. Operations are at the interior
nodes and data items (tables, columns) are at the leaves.
The query is evaluated in a depth-first pattern.
In Oracle this is Rule Based optimization.
Consider this query from Elmasri/Navathe:

SELECT PNUMBER, DNUM, LNAME


FROM
PROJECT, DEPARTMENT, EMPLOYEE
WHERE DNUM=DNUMBER and MGRSSN=SSN and
PLOCATION = 'Stafford';

In relational algebra:

on the following schema:


EMPLOYEE TABLE
FNAME
MI LNAME
SALARY SUPERSSN DNO
-------- -- ------------ --------- -JOHN
B SMITH
30000 333445555 5
FRANKLIN T WONG
40000 888665555 5
ALICIA
J ZELAYA

SSN BDATE

ADDRESS

--------- --------- ------------------------- 123456789 09-JAN-55 731 FONDREN, HOUSTON, TX

333445555 08-DEC-45 638 VOSS,HOUSTON TX

999887777 19-JUL-58 3321 CASTLE, SPRING, TX

25000 987654321 4
JENNIFER S WALLACE
43000 888665555 4
RAMESH
K NARAYAN
38000 333445555 5
JOYCE
A ENGLISH
25000 333445555 5
AHMAD
V JABBAR
25000 987654321 4
JAMES
E BORG
55000
1

987654321 20-JUN-31 291 BERRY, BELLAIRE, TX

666884444 15-SEP-52 975 FIRE OAK, HUMBLE, TX

453453453 31-JUL-62 5631 RICE, HOUSTON, TX

987987987 29-MAR-59 980 DALLAS, HOUSTON, TX

888665555 10-NOV-27 450 STONE, HOUSTON, TX

DEPARTMENT TABLE:
DNAME
DNUMBER
--------------- --------HEADQUARTERS
1
ADMINISTRATION
4
RESEARCH
5

MGRSSN
--------888665555
987654321
333445555

PROJECT TABLE:
PNAME
PNUMBER
---------------- ------ProductX
1
ProductY
2
ProductZ
3
Computerization
10
Reorganization
20
NewBenefits
30

PLOCATION
---------Bellaire
Sugarland
Houston
Stafford
Houston
Stafford

MGRSTARTD
--------19-JUN-71
01-JAN-85
22-MAY-78

DNUM
---5
5
5
4
1
4

WORKS_ON TABLE:
ESSN
PNO
--------- --123456789
1
123456789
2
666884444
3
453453453
1
453453453
2
333445555
2
333445555
3
333445555
10
333445555
20
999887777
30
999887777
10
987987987
10
987987987
30
987654321
30
987654321
20
888665555
20

Which of the following query trees is more efficient ?

HOURS
----32.5
7.5
40.0
20.0
20.0
10.0
10.0
10.0
10.0
30.0
10.0
35.0
5.0
20.0
15.0
null

The left hand tree is evaluated in steps as follows:

The right hand tree is evaluated in steps as follows:

Note the two cross product operations. These require lots of space and time
(nested loops) to build.
After the two cross products, we have a temporary table with 144 records (6
projects * 3 departments * 8 employees).
An overall rule for heuristic query optimization is to perform as many select and
project operations as possible before doing any joins.

Systematic (Cost based) Query Optimization

Just looking at the Syntax of the query may not give the whole picture - need to
look at the data as well.
Several Cost components to consider:
1. Access cost to secondary storage (hard disk)
2. Storage Cost for intermediate result sets
3. Computation costs: CPU, memory transfers, etc. for performing inmemory operations.
4. Communications Costs to ship data around a network. e.g., in a
distributed or client/server database.
Of these, Access cost is the most crucial in a centralized DBMS. The more work
we can do with data in cache or in memory, the better.
Access Routines are algorithms that are used to access and aggregate data in a
database.
An RDBMS may have a collection of general purpose access routines that can be
combined to implement a query execution plan.
We are interested in access routines for selection, projection, join and set
operations such as union, intersection, set difference, cartesian product, etc.
As with heuristic optimization, there can be many different plans that lead to the
same result.
In general, if a query contains n operations, there will be n! possible plans.
However, not all plans will make sense. We should consider:
Perform all simple selections first
Perform joins next
Perform projection last
Overview of the Cost Based optimization process
1. Enumerate all of the legitimate plans (call these P1...Pn) where each plan
contains a set of operations O1...Ok
2. Select a plan
3. For each operation Oi in the plan, enumerate the access routines
4. For each possible Access routine for Oi, estimate the cost
Select the access routine with the lowest cost
5. Repeat previous 2 steps until an efficient access routine has been selected
for each operation
Sum up the costs of each access routine to determine a total cost for the
plan
6. Repeat steps 2 through 5 for each plan and choose the plan with the lowest
total cost.
Example outline: Assume 3 operations (one projection, two selections and a
join) : P1 S1 S2 and J1
In general, perform the selections first, and then the join and finally the projection
1. Enumerate the plans.
Note there are two orderings of selection that are possible so the two plans
become:

2.
3.

4.

5.
6.

Plan A: S1 S2 J1 P1
Plan B: S2 S1 J1 P1
Choose a plan (let us start with Plan A)
For each operation enumerate the access routines:
Operation S1 has possible access routines: Linear Search and binary
search
Operation S2 has possible access routines: Linear Search and indexed
search
Operation J1 has possible access routines: Nested Loop join and indexed
join
Choose the least cost access routine for each operation
Operation S1 least cost access routine is binary search at a cost of 10
blocks
Operation S2 least cost access routine is linear search at a cost of 20
blocks
Operation J1 least cost access routine is indexed join at a cost of 40 blocks
The sum of the costs for each access routine are: 10 + 20 + 30 = 70
Thus the total cost for Plan A will be: 70
In repeating the steps 2 though 5 for Plan B, we come up with:
Operation S2 least cost access routine is binary search at a cost of 20
blocks
Operation S1 least cost access routine is indexed search at a cost of 5
blocks
Operation J1 least cost access routine is indexed join at a cost of 30 blocks

The sum of the costs for each access routine are: 20 + 5 + 30 = 55


Thus the total cost for Plan B will be: 55
Final result: Plan B would be the best plan - pass Plan B along to the query code
generator.

Columns to Index

Primary key and foreign key columns


Columns that have aggregates computed frequently
Columns that are searched or joined over less than 5 to 10 percent of the rows
Columns frequently used in an ORDER BY, GROUP BY, and DISTINCT clause

Clustering Index

The clustering index holds the most potential for performance gains
A clustered index allow for merge join
Columns that benefit from clustering:
1. Primary key column
2. Columns with low index cardinality (# of distinct values) or skewed
distribution

3. Column frequently processed in sequence (ORDER BY, GROUP BY,


DISTINCT)
4. Columns frequently searched or joined over a range of values such as
BETWEEN, <, >, LIKE Avoid indexing columns

When Not to Index


1. That are frequently updated
2. That have low cardinality, skewed distribution (unless clustering index is
used)
3. That are longer than 40 bytes
4. That cause lock contention problems
5. Avoid indexing tables
6. That are fewer than 7 pages unless used frequently in joins or in referential
integrity checking

Code Transformation

Adding predicates, transforming code, flattening nested select


Examples of adding predicates
1. WHERE
TA.C1 = TB.C5
AND
TA.C1 > 10
AND
TB.C5 > 10
-- added by the optimizer
2. WHERE
TA.C1 = TB.C5
AND
TB.C5 = TC.C10
AND
TC.C10 = TA.C1 -- added by the optimizer
Examples of transforming code
1. WHERE TA.C1 = 10 OR TA.C1 = 30
transformed to:
WHERE TA.C1 IN (10, 30)

Using Hints in Oracle SQL

In Oracle, you can embed hints in SQL statements to guide the optimizer towards
making more efficient choices.
The general syntax for a SELECT statement is:
SELECT /*+ hint */
colA, colB, colC, ...
FROM tab1, tab2, ...

The /* and */ are normally comments.


Placing the + sign (plus sign) after the opening comment causes the comment to
be treated as a hint.
Different values for hint can include:

ALL_ROWS - Optimize the query for best throughput (lowest resource


utilization)
o FIRST_ROWS - Optimize for fastest response time.
o CHOOSE - Optimizer chooses either Rule based or Cost based. If
statistics are available (via the ANALYZE TABLE command), Cost based
is chosen, otherwise, rule based is chosen.
o RULE - Force the use of the Rule based optimizer.
For example, the following query forces the use of the cost based optimizer,
working to minimize response time:
o

SELECT /*+ FIRST_ROWS */


fname, lname, salary, dno
FROM
employee
WHERE salary > 39000;

The following query forces the use of the Rule based optimizer:
SELECT /*+ RULE */
fname, lname, salary, dno
FROM
employee
WHERE salary > 39000;

Other types of hints include:


o FULL(table) - Force a full table scan for table. i.e., will ignore an index if
one exists.
o INDEX (table index) - Force the use of a given index when accessing the
given table
o ORDERED - Force the tables to be joined in the order in which they
appear in the FROM clause.
o STAR - Forces a star query plan (come back to this when we talk about
data warehousing).
o USE_NL (table1, table2) - Use a nested loop to perform the join
operation for table1 to table2
o USE_MERGE (table1, table2) - Use a sort-merge join operation to join
table1 with table2
o DRIVING_SITE - Force the query execution to be performed at a
different site in a distributed database.
o PUSH_SUBQ - Evaluate a non-correlated subquery as early as possible in
the query plan.
For more details, see Chapter 8 Optimizer Modes and Hints in the book:
Oracle8(TM) Server Tuning Release 8.0. Part Number: A54638-01
For a general explanation of how Oracle handles query optimization see Chapter
19 The Optimizer in the book: Oracle8(TM) Server Concepts Release 8.0. Part
Number: A54643-01

Using Hints in MS SQL Server


Note: This material from the MS SQL Server 7.0 Books Online

In MS SQL Server, query hints can be added using an OPTIONS clause at the end of the
statement:
SELECT select_list
FROM table_source
WHERE search_condition
GROUP BY group_by_expression
HAVING search_condition
ORDER BY order_expression
OPTIONS (query options)

Some of the query options available are:

{HASH | ORDER} GROUP


Specifies that aggregations described in the GROUP BY or COMPUTE clause of the query should
use hashing or ordering.
{MERGE | HASH | CONCAT} UNION
Specifies that all UNION operations are performed by merging, hashing, or concatenating UNION
sets. If more than one UNION hint is specified, the query optimizer selects the least expensive
strategy from those hints specified.
{LOOP | MERGE | HASH |} JOIN
Specifies that all join operations are performed by loop join, merge join, or hash join in the whole
query. If more than one join hint is specified, the query optimizer selects the least expensive join
strategy for the allowed ones. If, in the same query, a join hint is also specified for a specific pair
of tables, it takes precedence in the joining of the two tables.
FAST number_rows
Specifies that the query is optimized for fast retrieval of the first number_rows (a nonnegative
integer). After the first number_rows are returned, the query continues execution and produces its
full result set.
FORCE ORDER
Specifies that the join order indicated by the query syntax is preserved during query optimization.

You might also like