[go: up one dir, main page]

0% found this document useful (0 votes)
11 views90 pages

Zyqwadawfafslecture09 Query Optimization

The lecture covers query optimization, focusing on cost-based optimization algorithms that evaluate alternative query plans to choose the one with the lowest estimated cost. It discusses the importance of considering various physical plans, the history of query optimization, and the foundational laws of relational algebra that guide the optimization process. Additionally, it highlights the role of semijoins and their significance in parallel databases.

Uploaded by

Linacero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views90 pages

Zyqwadawfafslecture09 Query Optimization

The lecture covers query optimization, focusing on cost-based optimization algorithms that evaluate alternative query plans to choose the one with the lowest estimated cost. It discusses the importance of considering various physical plans, the history of query optimization, and the foundational laws of relational algebra that guide the optimization process. Additionally, it highlights the role of semijoins and their significance in parallel databases.

Uploaded by

Linacero
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Lecture 9

Query Optimization
November 24, 2010

Dan Suciu -- CSEP544 Fall 2010 1


Outline
• Chapter 15 in the textbook

• Paper: Query Optimizers: Time to


Rethink the Contract ?, by Surajit
Chaudhuri

• Next time: parallel databases, Bloom


filters
Dan Suciu -- CSEP544 Fall 2010 2
Query Optimization Algorithm
• Enumerate alternative plans

• Compute estimated cost of each plan


– Compute number of I/Os
– Compute CPU cost

• Choose plan with lowest cost


– This is called cost-based optimization
Dan Suciu -- CSEP544 Fall 2010 3
Example
Supplier(sid, sname, scity, SELECT sname
sstate) FROM Supplier x, Supply y
Supply(sid, pno, quantity) WHERE x.sid = y.sid
and y.pno = 2
• Some statistics
and x.scity = „Seattle‟
– T(Supplier) = 1000 records
and x.sstate = „WA‟
– T(Supply) = 10,000 records
– B(Supplier) = 100 pages
– B(Supply) = 100 pages
– V(Supplier,scity) = 20, V(Supplier,state) = 10
– V(Supply,pno) = 2,500
– Both relations are clustered
• M = 10
Dan Suciu -- CSEP544 Fall 2010 4
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 1


(On the fly) sname

(On the fly)


scity=„Seattle‟ sstate=„WA‟  pno=2

(Block-nested loop)
sid = sid

Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 5
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 1


(On the fly) sname Selection and project on-the-fly
-> No additional cost.
(On the fly)
scity=„Seattle‟ sstate=„WA‟  pno=2

(Block-nested loop) Total cost of plan is thus cost of join:


= B(Supplier)+B(Supplier)*B(Supply)/M
sid = sid = 100 + 10 * 100
= 1,100 I/Os

Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 6
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 2


(On the fly) sname 

(Sort-merge join) 


sid = sid
(Scan
write to T1) (Scan
scity=„Seattle‟ sstate=„WA‟ write to T2)
 pno=2

Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 7
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 2


Total cost
(On the fly) sname 
= 100 + 100 * 1/20 * 1/10 (1)
+ 100 + 100 * 1/2500 (2)
+ 2 (3)
(Sort-merge join)  + 0 (4)
sid = sid Total cost  204 I/Os
(Scan
write to T1) (Scan
scity=„Seattle‟ sstate=„WA‟ write to T2)
 pno=2

Supplier Supply
(File scan) (File scan)
Dan Suciu -- CSEP544 Fall 2010 8
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 3


(On the fly)  sname
(On the fly)
 scity=„Seattle‟ sstate=„WA‟

 (Index nested loop)


sid = sid

(Use index)
 pno=2

Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not9
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 3


(On the fly)  sname
(On the fly)
 scity=„Seattle‟ sstate=„WA‟

 (Index nested loop)


sid = sid

(Use index) 4 tuples


 pno=2

Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not10
T(Supplier) = 1000 B(Supplier) = 100 V(Supplier,scity) = 20 M = 10
T(Supply) = 10,000 B(Supply) = 100 V(Supplier,state) = 10
V(Supply,pno) = 2,500

Physical Query Plan 3


(On the fly)  sname
Total cost
(On the fly) = 1 (1)
+ 4 (2)
 scity=„Seattle‟ sstate=„WA‟ + 0 (3)
+ 0 (3)
Total cost  5 I/Os
 (Index nested loop)
sid = sid

(Use index) 4 tuples


 pno=2

Supply Supplier
(Index lookup on pno ) (Index lookup on sid)
Assume: clustered Doesn‟t matter if clustered or not11
Simplifications
• In the previous examples, we assumed
that all index pages were in memory

• When this is not the case, we need to


add the cost of fetching index pages
from disk

Dan Suciu -- CSEP544 Fall 2010 12


Lessons
1. Need to consider several physical plan
– even for one, simple logical plan

2. No plan is best in general


– need to have statistics over the data
– the B‟s, the T‟s, the V‟s

Dan Suciu -- CSEP544 Fall 2010 13


The Contract of the Optimizer
[Chaudhuri]
• high-quality execution plans for all
queries,
• while taking relatively small optimization
time, and
• with limited additional input such as
histograms.

Dan Suciu -- CSEP544 Fall 2010 14


Query Optimization
Three major components:

1. Search space

2. Algorithm for enumerating query plans

3. Cardinality and cost estimation


Dan Suciu -- CSEP544 Fall 2010 15
History of Query Optimization
• First query optimizer was for System R,
from IBM, in 1979
• It had all three components in place, and
defined the architecture of query
optimizers for years to come
• You will see often references to System R
• Read Section 15.6 in the book

Dan Suciu -- CSEP544 Fall 2010 16


1. Search Space
• This is the set of all alternative plans
that are considered by the optimizer

• Defined by the set of algebraic laws and


the set of plans used by the optimizer

• Will discuss these laws next


Dan Suciu -- CSEP544 Fall 2010 17
Left-Deep Plans and
Bushy Plans

R2

R4

R3 R1 R3 R1 R2 R4
Left-deep plan Bushy plan

System R considered only left deep plans,


and so do some optimizers
Dan Suciu today
-- CSEP544 Fall 2010 18
Relational Algebra Laws
• Selections
– Commutative: c1(c2(R)) = c2(c1(R))
– Cascading: c1c2(R) = c2(c1(R))
• Projections
• Joins
– Commutativity : R ⋈ S = S ⋈ R
– Associativity: R ⋈ (S ⋈ T) = (R ⋈ S) ⋈ T
– Distributivity: R ⨝ (S  T) = (R ⨝ S)  (R ⨝ T)
– Outer joins get more complicated
Dan Suciu -- CSEP544 Fall 2010 19
Example
Which plan is more efficient ?
R ⨝ (S ⨝ T) or (R ⨝ S) ⨝ T ?
• Assumptions:
– Every join selectivity is 10%
• That is: T(R ⨝ S) = 0.1 * T(R) * T(S) etc.
– B(R)=100, B(S) = 50, B(T)=500
– All joins are main memory joins
– All intermediate results are materialized
Dan Suciu -- CSEP544 Fall 2010 20
Simple Laws

 C AND C‟(R) =  C( C‟(R)) =  C(R)   C‟(R)


 C OR C‟(R) =  C(R)   C‟(R)
 C (R ⨝ S) =  C (R) ⨝ S

When C involves
 C (R – S) =  C (R) – S only attributes of R
 C (R  S) =  C (R)   C (S)
 C (R ⨝ S) =  C (R) ⨝ S
Dan Suciu -- CSEP544 Fall 2010 21
Example
• Example: R(A, B, C, D), S(E, F, G)
 F=3 (R ⨝ D=E S) = ?
 A=5 AND G=9 (R ⨝ D=E S) = ?

Dan Suciu -- CSEP544 Fall 2010 22


Simple Laws

PM(R ⨝ S) = PM(PP(R) ⨝ PQ(S))


PM(PN(R)) = PM(R) /* note that M ⊆ N */

• Example R(A,B,C,D), S(E, F, G)


PA,B,G(R ⨝ D=E S) = P ? (P?(R) ⨝ D=E P?(S))

Dan Suciu -- CSEP544 Fall 2010 23


Laws for Group-by and Join
if agg is
“duplicate
insensitive”
(A, agg(B)(R)) = A, agg(B)(R)
A, agg(B)((R)) = A, agg(B)(R)

Which of the following are “duplicate insensitive” ?


sum, count, avg, min, max

Dan Suciu -- CSEP544 Fall 2010 24


Example

A, agg(D)(R(A,B) ⨝ B=C S(C,D)) =


A, agg(D)(R(A,B) ⨝ B=C (C, agg(D)S(C,D)))

These are very powerful laws.


They were introduced only in the 90‟s.

Dan Suciu -- CSEP544 Fall 2010 25


Laws Involving Constraints
Foreign key
Product(pid, pname, price, cid)
Company(cid, cname, city, state)

Ppid, price(Product ⨝cid=cid Company) = Ppid, price(Product)

Need a second constraint for this law to hold. Which ?

Dan Suciu -- CSEP544 Fall 2010 26


Example Foreign key
Product(pid, pname, price, cid)
Company(cid, cname, city, state)

CREATE VIEW CheapProductCompany


SELECT *
FROM Product x, Company y
WHERE x.cid = y.cid and x.price < 100

SELECT pname, price SELECT pname, price


FROM CheapProductCompany FROM Product

Dan Suciu -- CSEP544 Fall 2010 27


Law of Semijoins
Recall the definition of a semijoin:
• R ⋉ S = P A1,…,An (R ⨝ S)
• Where the schemas are:
– Input: R(A1,…An), S(B1,…,Bm)
– Output: T(A1,…,An)
• The law of semijoins is:
R ⨝ S = (R ⋉ S) ⨝ S
Dan Suciu -- CSEP544 Fall 2010 28
Laws with Semijoins
• Very important in parallel databases
• Often combined with Bloom Filters (next
lecture)
• Read pp. 747 in the textbook

Dan Suciu -- CSEP544 Fall 2010 29


Semijoin Reducer
• Given a query: Q = R1 ⨝ R2 ⨝ . . . ⨝ Rn

Ri1 = Ri1 ⋉ Rj1


• A semijoin reducer for Q is
Ri2 = Ri2 ⋉ Rj2
.....
Rip = Rip ⋉ Rjp

such that the query is equivalent to:


Q = Rk1 ⨝ Rk2 ⨝ . . . ⨝ Rkn

• A full reducer is such that no dangling tuples remain

Dan Suciu -- CSEP544 Fall 2010 30


Example
• Example:
Q = R(A,B) ⨝ S(B,C)

• A semijoin reducer is:


R1(A,B) = R(A,B) ⋉ S(B,C)

• The rewritten query is:


Q = R1(A,B) ⨝ S(B,C)
31
Why Would We Do This ?
• Large attributes:
Q = R(A, B, D, E, F,…) ⨝ S(B, C, M, K, L, …)

• Expensive side computations


Q = γA,B,count(*)R(A,B,D) ⨝ σC=value(S(B,C))

R1(A,B,D) = R(A,B,D) ⋉ σC=value(S(B,C))


Q = γA,B,count(*)R1(A,B,D) ⨝ σC=value(S(B,C))

32
Semijoin Reducer
• Example:
Q = R(A,B) ⨝ S(B,C)

• A semijoin reducer is:


R1(A,B) = R(A,B) ⋉ S(B,C)

• The rewritten query is:


Q = R1(A,B) ⨝ S(B,C)

Are there dangling tuples ?


33
Semijoin Reducer
• Example:
Q = R(A,B) ⨝ S(B,C)

• A full semijoin reducer is:


R1(A,B) = R(A,B) ⋉ S(B,C)
S1(B,C) = S(B,C) ⋉ R1(A,B)

• The rewritten query is:


Q :- R1(A,B) ⨝ S1 (B,C)

No more dangling tuples 34


Semijoin Reducer
• More complex example:
Q = R(A,B) ⨝ S(B,C) ⨝ T(C,D,E)

• A full reducer is:


S‟(B,C) := S(B,C) ⋉ R(A,B)
T‟(C,D,E) := T(C,D,E) ⋉ S(B,C)
S‟‟(B,C) := S‟(B,C) ⋉ T‟(C,D,E)
R‟(A,B) := R (A,B) ⋉ S‟‟(B,C)

Q = R‟(A,B) ⨝ S‟‟(B,C) ⨝ T‟(C,D,E)


35
Semijoin Reducer
• Example:

Q = R(A,B) ⨝ S(B,C) ⨝ T(A,C)

• Doesn‟t have a full reducer (we can reduce forever)

Theorem a query has a full reducer iff it is “acyclic”


[Database Theory, by Abiteboul, Hull, Vianu]

Dan Suciu -- CSEP544 Fall 2010 36


Example with Semijoins
Emp(eid, ename, sal, did) [Chaudhuri‟98]
Dept(did, dname, budget)
DeptAvgSal(did, avgsal) /* view */

View: CREATE VIEW DepAvgSal As (


SELECT E.did, Avg(E.Sal) AS avgsal
FROM Emp E
GROUP BY E.did)

Query: SELECT E.eid, E.sal


FROM Emp E, Dept D, DepAvgSal V
WHERE E.did = D.did AND E.did = V.did
AND E.age < 30 AND D.budget > 100k
AND E.sal > V.avgsal
37
Goal: compute only the necessary part of the view
Example with Semijoins
Emp(eid, ename, sal, did) [Chaudhuri‟98]
Dept(did, dname, budget)
DeptAvgSal(did, avgsal) /* view */
CREATE VIEW LimitedAvgSal As (
New view SELECT E.did, Avg(E.Sal) AS avgsal
uses a reducer: FROM Emp E, Dept D
WHERE E.did = D.did AND D.buget > 100k
GROUP BY E.did)

SELECT E.eid, E.sal


New query: FROM Emp E, Dept D, LimitedAvgSal V
WHERE E.did = D.did AND E.did = V.did
AND E.age < 30 AND D.budget > 100k
AND E.sal > V.avgsal 38
Example with Semijoins
Emp(eid, ename, sal, did)
Dept(did, dname, budget) [Chaudhuri‟98]
DeptAvgSal(did, avgsal) /* view */
CREATE VIEW PartialResult AS
(SELECT E.eid, E.sal, E.did
FROM Emp E, Dept D
Full reducer: WHERE E.did=D.did AND E.age < 30
AND D.budget > 100k)

CREATE VIEW Filter AS


(SELECT DISTINCT P.did FROM PartialResult P)

CREATE VIEW LimitedAvgSal AS


(SELECT E.did, Avg(E.Sal) AS avgsal
FROM Emp E, Filter F
39
WHERE E.did = F.did GROUP BY E.did)
Example with Semijoins

New query:

SELECT P.eid, P.sal


FROM PartialResult P, LimitedDepAvgSal V
WHERE P.did = V.did AND P.sal > V.avgsal

Dan Suciu -- CSEP544 Fall 2010 40


Pruning the Search Space
• Prune entire sets of plans that are
unpromising

• The choice of partial plans influences how


effective we can prune

Dan Suciu -- CSEP544 Fall 2010 41


Complete Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40



⨝ Pruning is
T difficult
σA<40 ⨝ here.
σA<40 S
R S T
R
Dan Suciu -- CSEP544 Fall 2010 42
Bottom-up Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40

Pruning can be done ⨝


more efficiently
⨝ ⨝ T

σA<40 ⨝ σA<40 S ⨝ σA<40 S


…..
R S T R R S R 43
Top-down Partial Plans
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40

⨝ ⨝ σA<40


T T SELECT R.A, T.D
SELECT * FROM R, S, T
FROM R, S WHERE R.B=S.B
WHERE R.B=S.B
SELECT * S and S.C=T.C …..
and R.A < 40
FROM R
WHERE R.A < 40 44
Query Optimization
Three major components:

1. Search space

2. Algorithm for enumerating query plans

3. Cardinality and cost estimation


Dan Suciu -- CSEP544 Fall 2010 45
2. Plan Enumeration Algorithms
• System R (in class)
– Join reordering – dynamic programming
– Access path selection
– Bottom-up; simple; limited
• Modern database optimizers (will not discuss)
– Rule-based: database of rules (x 100s)
– Dynamic programming
– Top-down; complex; extensible
Dan Suciu -- CSEP544 Fall 2010 46
Join Reordering
System R [1979]
• Push all selections down (=early) in the query plan
• Pull all projections up (=late) in the query plan
• What remains are joins:

SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk

Dan Suciu -- CSEP544 Fall 2010 47


SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk

Join Reordering
Dynamic programming

• For each subquery Q {R1, …, Rn},


compute the optimal join order for Q

• Store results in a table: 2n-1 entries


– Often much fewer entries
Dan Suciu -- CSEP544 Fall 2010 48
SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk

Join Reordering
Step 1: For each {Ri} do:
• Initialize the table entry for {Ri} with the cheapest
access path for Ri
Step 2: For each subset Q  {R1, …, Rn} do:
• For every partition Q = Q‟  Q‟‟
• Lookup optimal plan for Q‟ and for Q‟‟ in the table
• Compute the cost of the plan Q‟ ⨝ Q‟‟
• Store the cheapest plan Q‟ ⨝ Q‟‟ in table entry for Q
Dan Suciu -- CSEP544 Fall 2010 49
Reducing the Search Space
Restriction 1: only left linear trees (no bushy)

Restriction 2: no trees with cartesian product

R(A,B) ⨝ S(B,C) ⨝ T(C,D)


Plan: (R(A,B)⨝T(C,D)) ⨝ S(B,C)
has a cartesian product.
Most query optimizers will not consider it
Dan Suciu -- CSEP544 Fall 2010 50
Access Path Selection
• Access path: a way to retrieve tuples from a table
– A file scan
– An index plus a matching selection condition

• Index matches selection condition if it can be used to


retrieve just tuples that satisfy the condition
– Example: Supplier(sid,sname,scity,sstate)
– B+-tree index on (scity,sstate)
• matches scity=„Seattle‟
• does not match sid=3, does not match sstate=„WA‟

Dan Suciu -- CSEP544 Fall 2010 51


Access Path Selection
• Supplier(sid,sname,scity,sstate)

• Selection condition: sid > 300 scity=„Seattle‟

• Indexes: B+-tree on sid and B+-tree on scity

Dan Suciu -- CSEP544 Fall 2010 52


Access Path Selection
• Supplier(sid,sname,scity,sstate)

• Selection condition: sid > 300 scity=„Seattle‟

• Indexes: B+-tree on sid and B+-tree on scity

• Which access path should we use?

Dan Suciu -- CSEP544 Fall 2010 53


Access Path Selection
• Supplier(sid,sname,scity,sstate)

• Selection condition: sid > 300 scity=„Seattle‟

• Indexes: B+-tree on sid and B+-tree on scity

• Which access path should we use?

• We should pick the most selective access path

Dan Suciu -- CSEP544 Fall 2010 54


Access Path Selectivity
• Access path selectivity is the number of pages
retrieved if we use this access path
– Most selective retrieves fewest pages

• As we saw earlier, for equality predicates


– Selection on equality: a=v(R)
– V(R, a) = # of distinct values of attribute a
– 1/V(R,a) is thus the reduction factor
– Clustered index on a: cost B(R)/V(R,a)
– Unclustered index on a: cost T(R)/V(R,a)
– (we are ignoring I/O cost of index pages for simplicity)

Dan Suciu -- CSEP544 Fall 2010 55


Other Decisions for the
Optimization Algorithm
• How much memory to allocate to each
operator

• Pipeline or materialize (next)

Dan Suciu -- CSEP544 Fall 2010 56


Materialize Intermediate
Results Between Operators
HashTable  S

V2 ⋈ repeat read(R, x)
y  join(HashTable, x)
write(V1, y)

HashTable  T
⋈ U repeat read(V1, y)
z  join(HashTable, y)
V1 write(V2, z)

HashTable  U
⋈ T repeat read(V2, z)
u  join(HashTable, z)
write(Answer, u)

R S Dan Suciu -- CSEP544 Fall 2010 57


Materialize Intermediate
Results Between Operators
Question in class

Given B(R), B(S), B(T), B(U)

• What is the total cost of the plan ?


– Cost =
• How much main memory do we need ?
– M=

Dan Suciu -- CSEP544 Fall 2010 58


Pipeline Between Operators


HashTable1  S
HashTable2  T
HashTable3  U
⋈ U repeat read(R, x)
y  join(HashTable1, x)
z  join(HashTable2, y)
u  join(HashTable3, z)
write(Answer, u)
⋈ T

R S Dan Suciu -- CSEP544 Fall 2010 59


Pipeline Between Operators
Question in class

Given B(R), B(S), B(T), B(U)

• What is the total cost of the plan ?


– Cost =
• How much main memory do we need ?
– M=

Dan Suciu -- CSEP544 Fall 2010 60


Pipeline in Bushy Trees




⋈ ⋈
⋈ V
Z
R S T I X Y
Dan Suciu -- CSEP544 Fall 2010 61
Example (will skip in class)
• Logical plan is:

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks

• Main memory M = 101 buffers

Dan Suciu -- CSEP544 Fall 2010 62


Example (will skip in class)
M = 101

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks

Naïve evaluation:
• 2 partitioned hash-joins
• Cost = 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k

63
Example (will skip in class)
M = 101

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Smarter:
• Step 1: hash R on x into 100 buckets, each of 50 blocks; to disk
• Step 2: hash S on x into 100 buckets; to disk
• Step 3: read each Ri in memory (50 buffer) join with Si (1 buffer); hash
result on y into 50 buckets (50 buffers) -- here we pipeline
• Cost so far: 3B(R) + 3B(S) 64
Example (will skip in class)
M = 101

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• How large are the 50 buckets on y ? Answer: k/50.
• If k <= 50 then keep all 50 buckets in Step 3 in memory, then:
• Step 4: read U from disk, hash on y and join with memory
• Total cost: 3B(R) + 3B(S) + B(U) = 55,000 65
Example (will skip in class)
M = 101

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• If 50 < k <= 5000 then send the 50 buckets in Step 3 to disk
– Each bucket has size k/50 <= 100
• Step 4: partition U into 50 buckets
• Step 5: read each partition and join in memory
• Total cost: 3B(R) + 3B(S) + 2k + 3B(U) = 75,000 + 2k 66
Example (will skip in class)
M = 101

k blocks
U(y,z)
10,000 blocks
R(w,x) S(x,y)
5,000 blocks 10,000 blocks
Continuing:
• If k > 5000 then materialize instead of pipeline
• 2 partitioned hash-joins
• Cost 3B(R) + 3B(S) + 4k + 3B(U) = 75000 + 4k
67
Query Optimization
Three major components:

1. Search space

2. Algorithm for enumerating query plans

3. Cardinality and cost estimation


Dan Suciu -- CSEP544 Fall 2010 68
3. Cardinality and Cost
Estimation
• Collect statistical summaries of stored data

• Estimate size (=cardinality) in a bottom-up


fashion
– This is the most difficult part, and still inadequate
in today‟s query optimizers
• Estimate cost by using the estimated size
– Hand-written formulas, similar to those we used
for computing the cost of each physical operator
Dan Suciu -- CSEP544 Fall 2010 69
Statistics on Base Data
• Collected information for each relation
– Number of tuples (cardinality)
– Indexes, number of keys in the index
– Number of physical pages, clustering info
– Statistical information on attributes
• Min value, max value, number distinct values
• Histograms
– Correlations between columns (hard)
• Collection approach: periodic, using sampling
Dan Suciu -- CSEP544 Fall 2010 70
Size Estimation Problem

S = SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk

Given T(R1), T(R2), …, T(Rn)


Estimate T(S)

How can we do this ? Note: doesn‟t have to be exact.


Dan Suciu -- CSEP544 Fall 2010 71
Size Estimation Problem

S = SELECT list
FROM R1, …, Rn
WHERE cond1 AND cond2 AND . . . AND condk

Remark: T(S) ≤ T(R1) × T(R2) × … × T(Rn)

Dan Suciu -- CSEP544 Fall 2010 72


Selectivity Factor
• Each condition cond reduces the size
by some factor called selectivity factor

• Assuming independence, multiply the


selectivity factors

Dan Suciu -- CSEP544 Fall 2010 73


Example
R(A,B) SELECT *
S(B,C) FROM R, S, T
T(C,D) WHERE R.B=S.B and S.C=T.C and R.A<40

T(R) = 30k, T(S) = 200k, T(T) = 10k

Selectivity of R.B = S.B is 1/3


Selectivity of S.C = T.C is 1/10
Selectivity of R.A < 40 is ½

What is the estimated size of the query output ?


Dan Suciu -- CSEP544 Fall 2010 74
Rule of Thumb
• If selectivities are unknown, then:
selectivity factor = 1/10
[System R, 1979]

Dan Suciu -- CSEP544 Fall 2010 75


Using Data Statistics
• Condition is A = c /* value selection on R */
– Selectivity = 1/V(R,A)

• Condition is A < c /* range selection on R */


– Selectivity = (c - Low(R, A))/(High(R,A) - Low(R,A))T(R)

• Condition is A = B /* R ⨝A=B S */
– Selectivity = 1 / max(V(R,A),V(S,A))
– (will explain next)

Dan Suciu -- CSEP544 Fall 2010 76


Assumptions
• Containment of values: if V(R,A) <= V(S,B), then
the set of A values of R is included in the set of
B values of S
– Note: this indeed holds when A is a foreign key in R,
and B is a key in S

• Preservation of values: for any other attribute C,


V(R ⨝A=B S, C) = V(R, C) (or V(S, C))

Dan Suciu -- CSEP544 Fall 2010 77


Selectivity of R ⨝A=B S
Assume V(R,A) <= V(S,B)

• Each tuple t in R joins with T(S)/V(S,B) tuple(s) in S

• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

In general: T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))

Dan Suciu -- CSEP544 Fall 2010 78


Size Estimation for Join
Example:
• T(R) = 10000, T(S) = 20000
• V(R,A) = 100, V(S,B) = 200
• How large is R ⨝A=B S ?

Dan Suciu -- CSEP544 Fall 2010 79


Histograms
• Statistics on data maintained by the
RDBMS
• Makes size estimation much more
accurate (hence, cost estimations are
more accurate)

Dan Suciu -- CSEP544 Fall 2010 80


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68

σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?

Dan Suciu -- CSEP544 Fall 2010 81


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68

σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?

Estimate = 25000 / 50 = 500 Estimate = 25000 * 6 / 60 = 2500


Dan Suciu -- CSEP544 Fall 2010 82
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68

σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?

Age: 0..20 20..29 30-39 40-49 50-59 > 60

Tuples 200 800 5000 12000 6500 500

Dan Suciu -- CSEP544 Fall 2010 83


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
min(age) = 19, max(age) = 68

σage=48(Empolyee) = ? σage>28 and age<35(Empolyee) = ?

Age: 0..20 20..29 30-39 40-49 50-59 > 60

Tuples 200 800 5000 12000 6500 500

Estimate = 1200 Estimate = 2*80 + 5*500 = 266084


Types of Histograms
• How should we determine the bucket
boundaries in a histogram ?

Dan Suciu -- CSEP544 Fall 2010 85


Types of Histograms
• How should we determine the bucket
boundaries in a histogram ?

• Eq-Width
• Eq-Depth
• Compressed
• V-Optimal histograms
Dan Suciu -- CSEP544 Fall 2010 86
Employee(ssn, name, age)
Histograms
Eq-width:

Age: 0..20 20..29 30-39 40-49 50-59 > 60

Tuples 200 800 5000 12000 6500 500

Eq-depth:

Age: 0..20 20..29 30-39 40-49 50-59 > 60

Tuples 1800 2000 2100 2200 1900 1800

Compressed: store separately highly frequent values: (48,1900)


Dan Suciu -- CSEP544 Fall 2010 87
V-Optimal Histograms
• Defines bucket boundaries in an optimal
way, to minimize the error over all point
queries
• Computed rather expensively, using
dynamic programming
• Modern databases systems use V-
optimal histograms or some variations

Dan Suciu -- CSEP544 Fall 2010 88


Difficult Questions on Histograms
• Small number of buckets
– Hundreds, or thousands, but not more
– WHY ?
• Not updated during database update,
but recomputed periodically
– WHY ?
• Multidimensional histograms rarely used
– WHY ?
Dan Suciu -- CSEP544 Fall 2010 89
Summary of Query
Optimization
• Three parts:
– search space, algorithms, size/cost estimation
• Ideal goal: find optimal plan. But
– Impossible to estimate accurately
– Impossible to search the entire space
• Goal of today‟s optimizers:
– Avoid very bad plans

Dan Suciu -- CSEP544 Fall 2010 90

You might also like