Lecture 9
Query Optimization
November 24, 2010
Dan Suciu -- CSEP544 Fall 2010 1
Outline
• Chapter 15 in the textbook
• Paper: Query Optimizers: Time to
Rethink the Contract ?, by Surajit
Chaudhuri
• Next time: parallel databases, Bloom
filters
Dan Suciu -- CSEP544 Fall 2010 2
Query Optimization Algorithm
• Enumerate alternative plans
• Compute estimated cost of each plan
– Compute number of I/Os
– Compute CPU cost
• Choose plan with lowest cost
– This is called cost-based optimization
Dan Suciu -- CSEP544 Fall 2010 3
Example
Supplier(sid, sname, scity, SELECT sname
sstate) FROM Supplier x, Supply y
Supply(sid, pno, quantity) WHERE x.sid = y.sid
and y.pno = 2
• Some statistics
and x.scity = „Seattle‟
– T(Supplier) = 1000 records
and x.sstate = „WA‟
– T(Supply) = 10,000 records
– B(Supplier) = 100 pages
– B(Supply) = 100 pages
– V(Supplier,scity) = 20, V(Supplier,state) = 10
– V(Supply,pno) = 2,500
– Both relations are clustered
• M = 10
Dan Suciu -- CSEP544 Fall 2010 4
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 1
(On the fly) sname
(On the fly)
scity=„Seattle‟ sstate=„WA‟ pno=2
(Block-nested loop)
sid = sid
Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 5
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 1
(On the fly) sname Selection and project on-the-fly
-> No additional cost.
(On the fly)
scity=„Seattle‟ sstate=„WA‟ pno=2
(Block-nested loop) Total cost of plan is thus cost of join:
= B(Supplier)+B(Supplier)*B(Supply)/M
sid = sid = 100 + 10 * 100
= 1,100 I/Os
Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 6
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 2
(On the fly) sname
(Sort-merge join)
sid = sid
(Scan
write to T1) (Scan
scity=„Seattle‟ sstate=„WA‟ write to T2)
pno=2
Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 7
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 2
Total cost
(On the fly) sname
= 100 + 100 * 1/20 * 1/10 (1)
+ 100 + 100 * 1/2500 (2)
+ 2 (3)
(Sort-merge join) + 0 (4)
sid = sid Total cost 204 I/Os
(Scan
write to T1) (Scan
scity=„Seattle‟ sstate=„WA‟ write to T2)
pno=2
Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 8
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 3
(On the fly) sname
(On the fly)
scity=„Seattle‟ sstate=„WA‟
(Index nested loop)
sid = sid
(Use index)
pno=2
Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not9
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 3
(On the fly) sname
(On the fly)
scity=„Seattle‟ sstate=„WA‟
(Index nested loop)
sid = sid
(Use index) 4 tuples
pno=2
Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not10
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500
Physical Query Plan 3
(On the fly) sname
Total cost
(On the fly) = 1 (1)
+ 4 (2)
scity=„Seattle‟ sstate=„WA‟ + 0 (3)
+ 0 (3)
Total cost 5 I/Os
(Index nested loop)
sid = sid
(Use index) 4 tuples
pno=2
Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not11
Simplifications
• In the previous examples, we assumed
that all index pages were in memory
• When this is not the case, we need to
add the cost of fetching index pages
from disk
Dan Suciu -- CSEP544 Fall 2010 12
Lessons
1. Need to consider several physical plan
– even for one, simple logical plan
2. No plan is best in general
– need to have statistics over the data
– the B‟s, the T‟s, the V‟s
Dan Suciu -- CSEP544 Fall 2010 13
The Contract of the Optimizer
[Chaudhuri]
• high-quality execution plans for all
queries,
• while taking relatively small optimization
time, and
• with limited additional input such as
histograms.
Dan Suciu -- CSEP544 Fall 2010 14
Query Optimization
Three major components:
1. Search space
2. Algorithm for enumerating query plans
3. Cardinality and cost estimation
Dan Suciu -- CSEP544 Fall 2010 15
History of Query Optimization
• First query optimizer was for System R,
from IBM, in 1979
• It had all three components in place, and
defined the architecture of query
optimizers for years to come
• You will see often references to System R
• Read Section 15.6 in the book
Dan Suciu -- CSEP544 Fall 2010 16
1. Search Space
• This is the set of all alternative plans
that are considered by the optimizer
• Defined by the set of algebraic laws and
the set of plans used by the optimizer
• Will discuss these laws next
Dan Suciu -- CSEP544 Fall 2010 17
Left-Deep Plans and
Bushy Plans
R2
R4
R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan
System R considered only left deep plans,
and so do some optimizers
Dan Suciu today
-- CSEP544 Fall 2010 18
Relational Algebra Laws
• Selections
– Commutative: c1(c2(R)) = c2(c1(R))
– Cascading: c1c2(R) = c2(c1(R))
• Projections
• Joins
– Commutativity : R ⋈ S = S ⋈ R
– Associativity: R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T
– Distributivity: R ⨝ (S T) = (R ⨝ S) (R ⨝ T)
– Outer joins get more complicated
Dan Suciu -- CSEP544 Fall 2010 19
Example
Which plan is more efficient ?
R ⨝ (S ⨝ T) or (R ⨝ S) ⨝ T ?
• Assumptions:
– Every join selectivity is 10%
• That is: T(R ⨝ S) = 0.1 * T(R) * T(S) etc.
– B(R)=100, B(S) = 50, B(T)=500
– All joins are main memory joins
– All intermediate results are materialized
Dan Suciu -- CSEP544 Fall 2010 20
Simple Laws
C AND C‟(R) = C( C‟(R)) = C(R) C‟(R)
C OR C‟(R) = C(R) C‟(R)
C (R ⨝ S) = C (R) ⨝ S
When C involves
C (R – S) = C (R) – S only attributes of R
C (R S) = C (R) C (S)
C (R ⨝ S) = C (R) ⨝ S
Dan Suciu -- CSEP544 Fall 2010 21
Example
• Example: R(A, B, C, D), S(E, F, G)
F=3 (R ⨝ D=E S) = ?
A=5 AND G=9 (R ⨝ D=E S) = ?
Dan Suciu -- CSEP544 Fall 2010 22
Simple Laws
PM(R ⨝ S) = PM(PP(R) ⨝ PQ(S))
PM(PN(R)) = PM(R) /* note that M ⊆ N */
• Example R(A,B,C,D), S(E, F, G)
PA,B,G(R ⨝ D=E S) = P ? (P?(R) ⨝ D=E P?(S))
Dan Suciu -- CSEP544 Fall 2010 23
Laws for Group-by and Join
if agg is
“duplicate
insensitive”
(A, agg(B)(R)) = A, agg(B)(R)
A, agg(B)((R)) = A, agg(B)(R)
Which of the following are “duplicate insensitive” ?
sum, count, avg, min, max
Dan Suciu -- CSEP544 Fall 2010 24
Example
A, agg(D)(R(A,B) ⨝ B=C S(C,D)) =
A, agg(D)(R(A,B) ⨝ B=C (C, agg(D)S(C,D)))
These are very powerful laws.
They were introduced only in the 90‟s.
Dan Suciu -- CSEP544 Fall 2010 25
Laws Involving Constraints
Foreign key
Product(pid, pname, price, cid)
Company(cid, cname, city, state)
Ppid, price(Product ⨝cid=cid Company) = Ppid, price(Product)
Need a second constraint for this law to hold. Which ?
Dan Suciu -- CSEP544 Fall 2010 26
Example Foreign key
Product(pid, pname, price, cid)
Company(cid, cname, city, state)
CREATE VIEW CheapProductCompany
SELECT *
FROM Product x, Company y
WHERE x.cid = y.cid and x.price < 100
SELECT pname, price SELECT pname, price
FROM CheapProductCompany FROM Product
Dan Suciu -- CSEP544 Fall 2010 27
Law of Semijoins
Recall the definition of a semijoin:
• R ⋉ S = P A1,…,An (R ⨝ S)
• Where the schemas are:
– Input: R(A1,…An), S(B1,…,Bm)
– Output: T(A1,…,An)
• The law of semijoins is:
R ⨝ S = (R ⋉ S) ⨝ S
Dan Suciu -- CSEP544 Fall 2010 28
Laws with Semijoins
• Very important in parallel databases
• Often combined with Bloom Filters (next
lecture)
• Read pp. 747 in the textbook
Dan Suciu -- CSEP544 Fall 2010 29
Semijoin Reducer
• Given a query: Q = R1 ⨝ R2 ⨝ . . . ⨝ Rn
Ri1 = Ri1 ⋉ Rj1
• A semijoin reducer for Q is
Ri2 = Ri2 ⋉ Rj2
.....
Rip = Rip ⋉ Rjp
such that the query is equivalent to:
Q = Rk1 ⨝ Rk2 ⨝ . . . ⨝ Rkn
• A full reducer is such that no dangling tuples remain
Dan Suciu -- CSEP544 Fall 2010 30
Example
• Example:
Q = R(A,B) ⨝ S(B,C)
• A semijoin reducer is:
R1(A,B) = R(A,B) ⋉ S(B,C)
• The rewritten query is:
Q = R1(A,B) ⨝ S(B,C)
31
Why Would We Do This ?
• Large attributes:
Q = R(A, B, D, E, F,…) ⨝ S(B, C, M, K, L, …)
• Expensive side computations
Q = γA,B,count(*)R(A,B,D) ⨝ σC=value(S(B,C))
R1(A,B,D) = R(A,B,D) ⋉ σC=value(S(B,C))
Q = γA,B,count(*)R1(A,B,D) ⨝ σC=value(S(B,C))
32
Semijoin Reducer
• Example:
Q = R(A,B) ⨝ S(B,C)
• A semijoin reducer is:
R1(A,B) = R(A,B) ⋉ S(B,C)
• The rewritten query is:
Q = R1(A,B) ⨝ S(B,C)
Are there dangling tuples ?
33
Semijoin Reducer
• Example:
Q = R(A,B) ⨝ S(B,C)
• A full semijoin reducer is:
R1(A,B) = R(A,B) ⋉ S(B,C)
S1(B,C) = S(B,C) ⋉ R1(A,B)
• The rewritten query is:
Q :- R1(A,B) ⨝ S1 (B,C)
No more dangling tuples 34
Semijoin Reducer
• More complex example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(C,D,E)
• A full reducer is:
S‟(B,C) := S(B,C) ⋉ R(A,B)
T‟(C,D,E) := T(C,D,E) ⋉ S(B,C)
S‟‟(B,C) := S‟(B,C) ⋉ T‟(C,D,E)
R‟(A,B) := R (A,B) ⋉ S‟‟(B,C)
Q = R‟(A,B) ⨝ S‟‟(B,C) ⨝ T‟(C,D,E)
35
Semijoin Reducer
• Example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(A,C)
• Doesn‟t have a full reducer (we can reduce forever)
Theorem a query has a full reducer iff it is “acyclic”
[Database Theory, by Abiteboul, Hull, Vianu]
Dan Suciu -- CSEP544 Fall 2010 36
Example with Semijoins
Emp(eid, ename, sal, did) [Chaudhuri‟98]
Dept(did, dname, budget)
DeptAvgSal(did, avgsal) /* view */
View: CREATE VIEW DepAvgSal As (
SELECT E.did, Avg(E.Sal) AS avgsal
FROM Emp E
GROUP BY E.did)
Query: SELECT E.eid, E.sal
FROM Emp E, Dept D, DepAvgSal V
WHERE E.did = D.did AND E.did = V.did
AND E.age < 30 AND D.budget > 100k
AND E.sal > V.avgsal
37
Goal: compute only the necessary part of the view
Example with Semijoins
Emp(eid, ename, sal, did) [Chaudhuri‟98]
Dept(did, dname, budget)
DeptAvgSal(did, avgsal) /* view */
CREATE VIEW LimitedAvgSal As (
New view SELECT E.did, Avg(E.Sal) AS avgsal
uses a reducer: FROM Emp E, Dept D
WHERE E.did = D.did AND D.buget > 100k
GROUP BY E.did)
SELECT E.eid, E.sal
New query: FROM Emp E, Dept D, LimitedAvgSal V
WHERE E.did = D.did AND E.did = V.did
AND E.age < 30 AND D.budget > 100k
AND E.sal > V.avgsal 38
Example with Semijoins
Emp(eid, ename, sal, did)
Dept(did, dname, budget) [Chaudhuri‟98]
DeptAvgSal(did, avgsal) /* view */
CREATE VIEW PartialResult AS
(SELECT E.eid, E.sal, E.did
FROM Emp E, Dept D
Full reducer: WHERE E.did=D.did AND E.age < 30
AND D.budget > 100k)
CREATE VIEW Filter AS
(SELECT DISTINCT P.did FROM PartialResult P)
CREATE VIEW LimitedAvgSal AS
(SELECT E.did, Avg(E.Sal) AS avgsal
FROM Emp E, Filter F
39
WHERE E.did = F.did GROUP BY E.did)
Example with Semijoins
New query:
SELECT P.eid, P.sal
FROM PartialResult P, LimitedDepAvgSal V
WHERE P.did = V.did AND P.sal > V.avgsal
Dan Suciu -- CSEP544 Fall 2010 40
Pruning the Search Space
• Prune entire sets of plans that are
unpromising
• The choice of partial plans influences how
effective we can prune
Dan Suciu -- CSEP544 Fall 2010 41
Complete Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝
⨝
⨝ Pruning is
T difficult
σA<40 ⨝ here.
σA<40 S
R S T
R
Dan Suciu -- CSEP544 Fall 2010 42
Bottom-up Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
Pruning can be done ⨝
more efficiently
⨝ ⨝ T
σA<40 ⨝ σA<40 S ⨝ σA<40 S
…..
R S T R R S R 43
Top-down Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
⨝ ⨝ σA<40
⨝
T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
SELECT * S and S.C=T.C …..
and R.A < 40
FROM R
WHERE R.A < 40 44
Query Optimization
Three major components:
1. Search space
2. Algorithm for enumerating query plans
3. Cardinality and cost estimation
Dan Suciu -- CSEP544 Fall 2010 45
2. Plan Enumeration Algorithms
• System R (in class)
– Join reordering – dynamic programming
– Access path selection
– Bottom-up; simple; limited
• Modern database optimizers (will not discuss)
– Rule-based: database of rules (x 100s)
– Dynamic programming
– Top-down; complex; extensible
Dan Suciu -- CSEP544 Fall 2010 46
Join Reordering
System R [1979]
• Push all selections down (=early) in the query plan
• Pull all projections up (=late) in the query plan
• What remains are joins:
SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk
Dan Suciu -- CSEP544 Fall 2010 47
SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk
Join Reordering
Dynamic programming
• For each subquery Q {R1, …, Rn},
compute the optimal join order for Q
• Store results in a table: 2n-1 entries
– Often much fewer entries
Dan Suciu -- CSEP544 Fall 2010 48
SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk
Join Reordering
Step 1: For each {Ri} do:
• Initialize the table entry for {Ri} with the cheapest
access path for Ri
Step 2: For each subset Q {R1, …, Rn} do:
• For every partition Q = Q‟ Q‟‟
• Lookup optimal plan for Q‟ and for Q‟‟ in the table
• Compute the cost of the plan Q‟ ⨝ Q‟‟
• Store the cheapest plan Q‟ ⨝ Q‟‟ in table entry for Q
Dan Suciu -- CSEP544 Fall 2010 49
Reducing the Search Space
Restriction 1: only left linear trees (no bushy)
Restriction 2: no trees with cartesian product
R(A,B) ⨝ S(B,C) ⨝ T(C,D)
Plan: (R(A,B)⨝T(C,D)) ⨝ S(B,C)
has a cartesian product.
Most query optimizers will not consider it
Dan Suciu -- CSEP544 Fall 2010 50
Access Path Selection
• Access path: a way to retrieve tuples from a table
– A file scan
– An index plus a matching selection condition
• Index matches selection condition if it can be used to
retrieve just tuples that satisfy the condition
– Example: Supplier(sid,sname,scity,sstate)
– B+-tree index on (scity,sstate)
• matches scity=„Seattle‟
• does not match sid=3, does not match sstate=„WA‟
Dan Suciu -- CSEP544 Fall 2010 51
Access Path Selection
• Supplier(sid,sname,scity,sstate)
• Selection condition: sid > 300 scity=„Seattle‟
• Indexes: B+-tree on sid and B+-tree on scity
Dan Suciu -- CSEP544 Fall 2010 52
Access Path Selection
• Supplier(sid,sname,scity,sstate)
• Selection condition: sid > 300 scity=„Seattle‟
• Indexes: B+-tree on sid and B+-tree on scity
• Which access path should we use?
Dan Suciu -- CSEP544 Fall 2010 53
Access Path Selection
• Supplier(sid,sname,scity,sstate)
• Selection condition: sid > 300 scity=„Seattle‟
• Indexes: B+-tree on sid and B+-tree on scity
• Which access path should we use?
• We should pick the most selective access path
Dan Suciu -- CSEP544 Fall 2010 54
Access Path Selectivity
• Access path selectivity is the number of pages
retrieved if we use this access path
– Most selective retrieves fewest pages
• As we saw earlier, for equality predicates
– Selection on equality: a=v(R)
– V(R, a) = # of distinct values of attribute a
– 1/V(R,a) is thus the reduction factor
– Clustered index on a: cost B(R)/V(R,a)
– Unclustered index on a: cost T(R)/V(R,a)
– (we are ignoring I/O cost of index pages for simplicity)
Dan Suciu -- CSEP544 Fall 2010 55
Other Decisions for the
Optimization Algorithm
• How much memory to allocate to each
operator
• Pipeline or materialize (next)
Dan Suciu -- CSEP544 Fall 2010 56
Materialize Intermediate
Results Between Operators
HashTable S
V2 ⋈ repeat read(R, x)
y join(HashTable, x)
write(V1, y)
HashTable T
⋈ U repeat read(V1, y)
z join(HashTable, y)
V1 write(V2, z)
HashTable U
⋈ T repeat read(V2, z)
u join(HashTable, z)
write(Answer, u)
R S Dan Suciu -- CSEP544 Fall 2010 57
Materialize Intermediate
Results Between Operators
Question in class
Given B(R), B(S), B(T), B(U)
• What is the total cost of the plan ?
– Cost =
• How much main memory do we need ?
– M=
Dan Suciu -- CSEP544 Fall 2010 58
Pipeline Between Operators
⋈
HashTable1 S
HashTable2 T
HashTable3 U
⋈ U repeat read(R, x)
y join(HashTable1, x)
z join(HashTable2, y)
u join(HashTable3, z)
write(Answer, u)
⋈ T
R S Dan Suciu -- CSEP544 Fall 2010 59
Pipeline Between Operators
Question in class
Given B(R), B(S), B(T), B(U)
• What is the total cost of the plan ?
– Cost =
• How much main memory do we need ?
– M=
Dan Suciu -- CSEP544 Fall 2010 60
Pipeline in Bushy Trees
⋈
⋈
⋈
⋈
⋈ ⋈
⋈ V
Z
R S T I X Y
Dan Suciu -- CSEP544 Fall 2010 61
Example (will skip in class)
• Logical plan is:
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
• Main memory M = 101 buffers
Dan Suciu -- CSEP544 Fall 2010 62
Example (will skip in class)
M = 101
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Naïve evaluation:
• 2 partitioned hash-joins
• Cost = 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k
63
Example (will skip in class)
M = 101
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Smarter:
• Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk
• Step 2: hash S on x into 100 buckets; to disk
• Step 3: read each Ri in memory (50 buffer) join with Si (1 buffer); hash
result on y into 50 buckets (50 buffers) -- here we pipeline
• Cost so far: 3B(R) + 3B(S) 64
Example (will skip in class)
M = 101
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• How large are the 50 buckets on y ? Answer: k/50.
• If k <= 50 then keep all 50 buckets in Step 3 in memory, then:
• Step 4: read U from disk, hash on y and join with memory
• Total cost: 3B(R) + 3B(S) + B(U) = 55,000 65
Example (will skip in class)
M = 101
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• If 50 < k <= 5000 then send the 50 buckets in Step 3 to disk
– Each bucket has size k/50 <= 100
• Step 4: partition U into 50 buckets
• Step 5: read each partition and join in memory
• Total cost: 3B(R) + 3B(S) + 2k + 3B(U) = 75,000 + 2k 66
Example (will skip in class)
M = 101
k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• If k > 5000 then materialize instead of pipeline
• 2 partitioned hash-joins
• Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k
67
Query Optimization
Three major components:
1. Search space
2. Algorithm for enumerating query plans
3. Cardinality and cost estimation
Dan Suciu -- CSEP544 Fall 2010 68
3. Cardinality and Cost
Estimation
• Collect statistical summaries of stored data
• Estimate size (=cardinality) in a bottom-up
fashion
– This is the most difficult part, and still inadequate
in today‟s query optimizers
• Estimate cost by using the estimated size
– Hand-written formulas, similar to those we used
for computing the cost of each physical operator
Dan Suciu -- CSEP544 Fall 2010 69
Statistics on Base Data
• Collected information for each relation
– Number of tuples (cardinality)
– Indexes, number of keys in the index
– Number of physical pages, clustering info
– Statistical information on attributes
• Min value, max value, number distinct values
• Histograms
– Correlations between columns (hard)
• Collection approach: periodic, using sampling
Dan Suciu -- CSEP544 Fall 2010 70
Size Estimation Problem
S = SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk
Given T(R1), T(R2), …, T(Rn)
Estimate T(S)
How can we do this ? Note: doesn‟t have to be exact.
Dan Suciu -- CSEP544 Fall 2010 71
Size Estimation Problem
S = SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk
Remark: T(S) ≤ T(R1) × T(R2) × … × T(Rn)
Dan Suciu -- CSEP544 Fall 2010 72
Selectivity Factor
• Each condition cond reduces the size
by some factor called selectivity factor
• Assuming independence, multiply the
selectivity factors
Dan Suciu -- CSEP544 Fall 2010 73
Example
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40
T(R) = 30k, T(S) = 200k, T(T) = 10k
Selectivity of R.B = S.B is 1/3
Selectivity of S.C = T.C is 1/10
Selectivity of R.A < 40 is ½
What is the estimated size of the query output ?
Dan Suciu -- CSEP544 Fall 2010 74
Rule of Thumb
• If selectivities are unknown, then:
selectivity factor = 1/10
[System R, 1979]
Dan Suciu -- CSEP544 Fall 2010 75
Using Data Statistics
• Condition is A = c /* value selection on R */
– Selectivity = 1/V(R,A)
• Condition is A < c /* range selection on R */
– Selectivity = (c - Low(R, A))/(High(R,A) - Low(R,A))T(R)
• Condition is A = B /* R ⨝A=B S */
– Selectivity = 1 / max(V(R,A),V(S,A))
– (will explain next)
Dan Suciu -- CSEP544 Fall 2010 76
Assumptions
• Containment of values: if V(R,A) <= V(S,B), then
the set of A values of R is included in the set of
B values of S
– Note: this indeed holds when A is a foreign key in R,
and B is a key in S
• Preservation of values: for any other attribute C,
V(R ⨝A=B S, C) = V(R, C) (or V(S, C))
Dan Suciu -- CSEP544 Fall 2010 77
Selectivity of R ⨝A=B S
Assume V(R,A) <= V(S,B)
• Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)
In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))
Dan Suciu -- CSEP544 Fall 2010 78
Size Estimation for Join
Example:
• T(R) = 10000, T(S) = 20000
• V(R,A) = 100, V(S,B) = 200
• How large is R ⨝A=B S ?
Dan Suciu -- CSEP544 Fall 2010 79
Histograms
• Statistics on data maintained by the
RDBMS
• Makes size estimation much more
accurate (hence, cost estimations are
more accurate)
Dan Suciu -- CSEP544 Fall 2010 80
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68
σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?
Dan Suciu -- CSEP544 Fall 2010 81
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68
σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?
Estimate = 25000 / 50 = 500 Estimate = 25000 * 6 / 60 = 2500
Dan Suciu -- CSEP544 Fall 2010 82
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68
σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?
Age: 0..20 20..29 30-39 40-49 50-59 > 60
Tuples 200 800 5000 12000 6500 500
Dan Suciu -- CSEP544 Fall 2010 83
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68
σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?
Age: 0..20 20..29 30-39 40-49 50-59 > 60
Tuples 200 800 5000 12000 6500 500
Estimate = 1200 Estimate = 2*80 + 5*500 = 266084
Types of Histograms
• How should we determine the bucket
boundaries in a histogram ?
Dan Suciu -- CSEP544 Fall 2010 85
Types of Histograms
• How should we determine the bucket
boundaries in a histogram ?
• Eq-Width
• Eq-Depth
• Compressed
• V-Optimal histograms
Dan Suciu -- CSEP544 Fall 2010 86
Employee(ssn, name, age)
Histograms
Eq-width:
Age: 0..20 20..29 30-39 40-49 50-59 > 60
Tuples 200 800 5000 12000 6500 500
Eq-depth:
Age: 0..20 20..29 30-39 40-49 50-59 > 60
Tuples 1800 2000 2100 2200 1900 1800
Compressed: store separately highly frequent values: (48,1900)
Dan Suciu -- CSEP544 Fall 2010 87
V-Optimal Histograms
• Defines bucket boundaries in an optimal
way, to minimize the error over all point
queries
• Computed rather expensively, using
dynamic programming
• Modern databases systems use V-
optimal histograms or some variations
Dan Suciu -- CSEP544 Fall 2010 88
Difficult Questions on Histograms
• Small number of buckets
– Hundreds, or thousands, but not more
– WHY ?
• Not updated during database update,
but recomputed periodically
– WHY ?
• Multidimensional histograms rarely used
– WHY ?
Dan Suciu -- CSEP544 Fall 2010 89
Summary of Query
Optimization
• Three parts:
– search space, algorithms, size/cost estimation
• Ideal goal: find optimal plan. But
– Impossible to estimate accurately
– Impossible to search the entire space
• Goal of today‟s optimizers:
– Avoid very bad plans
Dan Suciu -- CSEP544 Fall 2010 90