Sorting & Aggregations: Intro To Database Systems Andy Pavlo
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
Sorting & Aggregations: Intro To Database Systems Andy Pavlo
10 Aggregations
ADMINISTRIVIA
C O U R S E S TAT U S
QUERY PL AN
SELECT A.id, B.value
The operators are arranged in a tree. FROM A, B
WHERE A.id = B.id
Data flows from the leaves of the tree AND B.value > 100
up towards the root.
p A.id, B.value
s value>100
A B
CMU 15-445/645 (Fall 2019)
5
D I S K- O R I E N T E D D B M S
T O D AY ' S A G E N D A
SORTING ALGORITHMS
Phase #1 – Sorting
→ Sort blocks of data that fit in main-memory and then
write back the sorted blocks to a file on disk.
Phase #2 – Merging
→ Combine sorted sub-files into a single larger file.
2 -WAY E X T E R N A L M E R G E S O R T
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Memory
Disk
Page #1 Page #2
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Memory
Disk
Page #1 Page #2
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Memory
Disk
Page #1 Page #2 Sorted Run
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Memory Memory
Disk
Page #1 Page #2 Sorted Run
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Memory Memory
Disk
Page #1 Page #2 Sorted Run
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Pass #1,2,3,…
→ Recursively merges pairs of runs into runs twice as long.
→ Uses three buffer pages (2 for input pages, 1 for output).
Disk
Page #1 Page #2 Sorted Run
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Pass #1,2,3,…
→ Recursively merges pairs of runs into runs twice as long.
→ Uses three buffer pages (2 for input pages, 1 for output).
Disk
Page #1 Page #2 Sorted Run
2 -WAY E X T E R N A L M E R G E S O R T
Pass #0
→ Read every B pages of the table into memory
→ Sort pages into runs and write them back to disk.
Pass #1,2,3,…
→ Recursively merges pairs of runs into runs twice as long.
→ Uses three buffer pages (2 for input pages, 1 for output).
2 -WAY E X T E R N A L M E R G E S O R T
EOF
3,4 6,2 9,4 8,7 5,6 3,1 2 ∅
= 1 + ⌈ log2 N ⌉ PASS
2,3 1,2
4-PAGE
#2 RUNS
Total I/O cost 4,4
6,7
3,5
6
= 2N ∙ (# of passes) 8,9 ∅
PASS 1,2
8-PAGE
#3 2,3
RUNS
3,4
4,5
6,6
7,8
9
∅ CMU 15-445/645 (Fall 2019)
13
2 -WAY E X T E R N A L M E R G E S O R T
D O U B L E B U F F E R I N G O P T I M I Z AT I O N
Memory
Disk
Page #1 Page #2
D O U B L E B U F F E R I N G O P T I M I Z AT I O N
Memory
Disk
Page #1 Page #2
Pass #0
→ Use B buffer pages.
→ Produce ⌈N / B⌉ sorted runs of size B
Pass #1,2,3,…
→ Merge B-1 runs (i.e., K-way merge).
Pass #0
→ Use B buffer pages.
→ Produce ⌈N / B⌉ sorted runs of size B
Pass #1,2,3,…
→ Merge B-1 runs (i.e., K-way merge).
EXAMPLE
A G G R E G AT I O N S
S O R T I N G A G G R E G AT I O N
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
ORDER BY cid
53666 15-721 C
53655 15-445 C
S O R T I N G A G G R E G AT I O N
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
ORDER BY cid
53666 15-721 C
53655 15-445 C
S O R T I N G A G G R E G AT I O N
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
ORDER BY cid
53666 15-721 C
53655 15-445 C
A LT E R N AT I V E S T O S O R T I N G
H A S H I N G A G G R E G AT E
E X T E R N A L H A S H I N G A G G R E G AT E
Phase #1 – Partition
→ Divide tuples into buckets based on hash key.
→ Write them out to disk when they get full.
Phase #2 – ReHash
→ Build in-memory hash table for each partition and
compute the aggregation.
PHASE #1 PA R T I T I O N
PHASE #1 PA R T I T I O N
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
53666 15-721 C
53655 15-445 C
B-1 partitions
sid cid grade cid 15-445 15-445
15-445 15-445
53666 15-445 C 15-445
15-445
PHASE #2 REHASH
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
53655 15-445 C
15-445 15-445
15-445 15-445
15-445
15-826
15-826
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
53655 15-445 C
15-445 15-445
15-445 15-445
15-445
15-826
15-826
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
Hash Table 53655 15-445 C
15-445 15-445
15-445 15-445
15-445
h2 cid
15-445
15-826
15-826
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
Hash Table 53655 15-445 C
15-445 15-445
15-445 15-445
15-445
h2 cid
15-445
15-826 15-826
15-826 h2
⋮
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
Hash Table 53655 15-445 C
15-445 15-445
15-445 15-445
15-445
h2 cid
15-445
15-826
15-826 h2 15-826 Final Result
⋮ cid
15-445
15-826
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
Hash Table 53655 15-445 C
15-445 15-445
15-445 15-445
15-445
h2 cid
15-445
15-826
15-826 h2 15-826 Final Result
⋮ cid
15-445
15-721 15-826
PHASE #2 REHASH
enrolled(sid,cid,grade)
SELECT DISTINCT cid sid cid grade
FROM enrolled 53666 15-445 C
WHERE grade IN ('B','C') 53688 15-721 A
53688 15-826 B
Phase #1 Buckets 53666 15-721 C
53655 15-445 C
15-445 15-445
15-445 15-445
15-445
h2
15-826
15-826 h2 Final Result
⋮ Hash Table cid
cid 15-445
15-721
h2 15-721
15-826
15-721
H A S H I N G S U M M A R I Z AT I O N
H A S H I N G S U M M A R I Z AT I O N
15-445
15-445 h2 Hash Table
15-826 key value
Phase #1 h2 15-445 (2, 7.32)
Buckets
⋮ 15-826 (1, 3.33)
15-721 15-721 (1, 2.89)
h2
H A S H I N G S U M M A R I Z AT I O N
Running Totals
SELECT cid, AVG(s.gpa) AVG(col) → (COUNT,SUM)
FROM student AS s, enrolled AS e MIN(col) → (MIN)
WHERE s.sid = e.sid MAX(col) → (MAX)
GROUP BY cid SUM(col) → (SUM)
COUNT(col) → (COUNT)
15-445
15-445 h2 Hash Table
15-826 key value
Phase #1 h2 15-445 (2, 7.32)
Buckets
⋮ 15-826 (1, 3.33)
15-721 15-721 (1, 2.89)
h2
H A S H I N G S U M M A R I Z AT I O N
Running Totals
SELECT cid, AVG(s.gpa) AVG(col) → (COUNT,SUM)
FROM student AS s, enrolled AS e MIN(col) → (MIN)
WHERE s.sid = e.sid MAX(col) → (MAX)
GROUP BY cid SUM(col) → (SUM)
COUNT(col) → (COUNT)
15-445
15-445 h2 Hash Table Final Result
15-826 key value cid AVG(gpa)
Phase #1 h2 15-445 (2, 7.32) 15-445 3.66
Buckets
⋮ 15-826 (1, 3.33) 15-826 3.33
15-721 15-721 (1, 2.89) 15-721 2.89
h2
C O S T A N A LY S I S
Answer: B ∙ (B-1)
→ A table of N pages needs about sqrt(N) buffers
→ Assumes hash distributes records evenly.
Use a "fudge factor" f>1 for that: we need
B ∙ sqrt(f ∙ N)
CONCLUSION
PROJECT #2
https://15445.courses.cs.cmu.edu/fall2019/project2/
PROJECT #2 TA S K S
Page Layouts
Hash Table Implementation
Table Resizing
Concurrency Control Protocol
DEVELOPMENT HINTS
THINGS TO NOTE
P L A G I A R I S M WA R N I N G
NEXT CLASS