Lec3 1
Lec3 1
15 May 2024
Course Setup
• OH slots:
• Tue 2pm-3pm, Zhengjie@TASC 1 9407
• Mon 11am-12pm, Xinyi@ASB9812
• Wed 1pm-2pm, Obumneme@ASB9812
• Thu 3pm-4pm, Zahra@ASB9812
3
MR Example - Map
1.com 2.com 3.org 4.edu key: document id
ABC BBC CBC AAC value: document contents
4
MR Example - Shuffle & Reduce
A, 1 C, 3 B, 2 A, 4
B, 1 B, 3 C, 2 C, 4
C, 1
5
Trends
• 1 Hybrid Transactional and Analytical Processing
• 2. Cloud-Native Databases
• 4. Specialized Systems
• E.g. Time-series Database, GPU Acceleration, Vector Database
6
Outline
• An overview of data models
• Relational Algebra
• Core operators & Derived operators
• Extension
• Relational calculus
7
Recap: Data Models
• Query: Who have accounts with 0 balance managed by a branch in
Springfield? Branch
Account Id Location
234 Brockway
Id Name Balance Branch
1000 Ned Flanders 0 234
Id Location
1001 Homer Simpson 0 653
653 Springfield
1002 Montgomery Burns 1000000000 781
Id Location
781 Shelbyville
9
Structure of the data
• Schema (e.g., table names, attribute names)
• Describe the conceptual structure of the data
10
Operations on the data
• Query language (e.g., RA, SQL)
• Describe what operations that can be performed on data
11
Constraints on the data.
• Constraints (e.g., balance >= 0, account id is unique)
• describe limitations on what the data can be.
12
Commonly Used Data Models
• Relational Data Model
13
The Relational Model in Brief
Id Name Balance Branch Id Location
1000 Ned Flanders 0 234 234 Brockway
1001 Homer Simpson 0 653 653 Springfield
1002 Montgomery Burns 1000000000 781 781 Shelbyville
16
Edgar F. Codd (1923-2003)
17
http://en.wikipedia.org/wiki/File:Edgar_F_Codd.jpg
Relational data model
• A database is a collection of relations (or tables)
• Each relation has a set of attributes (or columns)
• Each attribute has a name and a domain (or type)
• Set-valued attributes are not allowed
• Each relation contains a set of tuples (or rows)
• Each tuple has a value for each attribute of the relation
• Duplicate tuples are not allowed
• Two tuples are duplicates if they agree on all attributes
• Simplicity is a virtue!
18
Example Group
User gid name
uid name age pop abc Book Club
142 Bart 10 0.9 gov Student Government
123 Milhouse 10 0.2 dps Dead Putting Society
857 Lisa 8 0.7 … …
456 Ralph 8 0.3
… … … …
Member uid gid
142 dps
123 gov
Ordering of rows doesn’t matter
857 abc
(even though output is
always in some order) 857 gov
456 abc
456 gov
… …
19
Schema vs. instance
• Schema (metadata)
• Specifies the logical structure of data
• Is defined at setup time
• Rarely changes
• Instance
• Represents the data content
• Changes rapidly, but always conforms to the schema
• Compare to types vs. collections of objects of these types in a
programming language
20
Example
• Schema
• User (uid int, name string, age int, pop float)
• Group (gid string, name string)
• Member (uid int, gid string)
• Instance
• User: {〈142, Bart, 10, 0.9〉, 〈857, Milhouse, 10, 0.2〉, …}
• Group: {〈abc, Book Club〉, 〈gov, Student Government〉, …}
• Member: {〈142, dps〉, 〈123, gov〉, …}
21
Relational algebra
A language for querying relational data based on “operators”
RelOp
RelOp
• Core operators:
• Selection, projection, cross product, union, difference, and renaming
• Additional, derived operators:
• Join, natural join, intersection, etc.
• Compose operators to make complex queries
22
Selection
• Input: a table 𝑅
• Notation: 𝜎! 𝑅
• 𝑝 is called a selection condition (or predicate)
• Purpose: filter rows according to some criteria
• Output: same columns as 𝑅, but only rows or 𝑅 that satisfy 𝑝
23
Selection example
• Users with popularity higher than 0.5
𝜎!"!#$.& 𝑈𝑠𝑒𝑟
24
More on selection
• Selection condition can include any column of 𝑅, constants,
comparison (=, ≤, etc.) and Boolean connectives (∧: and, ∨: or, ¬:
not)
• Example: users with popularity at least 0.9 and age under 10 or above 12
𝜎"#"$%.' ∧ *+,-.% ∨ *+,0.1 𝑈𝑠𝑒𝑟
• You must be able to evaluate the condition over each single row of
the input table!
• Example: the most popular user
𝜎"#" $ ,2,34 "#" 56 78,3 𝑈𝑠𝑒𝑟
RO N G!
W
25
Projection
• Input: a table 𝑅
• Notation: 𝜋' 𝑅
• 𝐿 is a list of columns in 𝑅
• Purpose: output chosen columns
• Output: same rows, but only the columns in 𝐿
26
Projection example
• IDs and names of all users
𝜋()*,,-./ 𝑈𝑠𝑒𝑟
27
More on projection
• Duplicate output rows are removed (by definition)
• Example: user ages
𝜋*+, 𝑈𝑠𝑒𝑟
uid name age pop age
142 Bart 10 0.9 10
123 Milhouse 10 0.2 10
𝜋,/.
857 Lisa 8 0.7 8
456 Ralph 8 0.3 8
… … … … …
28
Composing Select and Project
• Names of users with popularity higher than 0.5
𝜋,-./ (𝜎!"!#$.& 𝑈𝑠𝑒𝑟)
30
Cross product example
𝑈𝑠𝑒𝑟×𝑀𝑒𝑚𝑏𝑒𝑟 uid gid
uid name age pop
123 gov
123 Milhouse 10 0.2
857 abc
857 Lisa 8 0.7
857 gov
… … … … × … …
• So cross product is commutative, i.e., for any 𝑅 and 𝑆, 𝑅×𝑆 = 𝑆×𝑅 (up
to the ordering of columns)
32
Derived operator: join
(A.k.a. “theta-join”)
• Input: two tables 𝑅 and 𝑆
• Notation: 𝑅 ⋈! 𝑆
• 𝑝 is called a join condition (or predicate)
• Purpose: relate rows from two tables according to some
criteria
• Output: for each row 𝑟 in 𝑅 and each row 𝑠 in 𝑆, output a
row 𝑟𝑠 if 𝑟 and 𝑠 satisfy 𝑝
• Shorthand for 𝜎! 𝑅×𝑆
33
Join example
• Info about users, plus IDs of their groups
𝑈𝑠𝑒𝑟 ⋈!"#$.&'()*#+,#$.&'( 𝑀𝑒𝑚𝑏𝑒𝑟
uid gid
uid name age pop
123 gov
123 Milhouse 10 0.2
857 abc
857
…
Lisa
…
8
…
0.7
…
⋈
×
!"#$.&'()
*#+,#$.&'( 857 gov
… …
35
Natural join example
𝑈𝑠𝑒𝑟 ⋈ 𝑀𝑒𝑚𝑏𝑒𝑟 = 𝜋? 𝑈𝑠𝑒𝑟 ⋈? 𝑀𝑒𝑚𝑏𝑒𝑟
= 𝜋&'(,/0+#,01#,232,1'( 𝑈𝑠𝑒𝑟 ⋈ !"#$.&'() 𝑀𝑒𝑚𝑏𝑒𝑟
*#+,#$.&'(
uid gid
uid name age pop
123 gov
123 Milhouse 10 0.2
857 abc
857 Lisa 8 0.7 ⋈
⋈
!"#$.&'()
*#+,#$.&'( 857 gov
… … … …
… …
37
Difference
• Input: two tables 𝑅 and 𝑆
• Notation: 𝑅 − 𝑆
• 𝑅 and 𝑆 must have identical schema
• Output:
• Has the same schema as 𝑅 and 𝑆
• Contains all rows in 𝑅 that are not in 𝑆
38
Derived operator: intersection
• Input: two tables 𝑅 and 𝑆
• Notation: 𝑅 ∩ 𝑆
• 𝑅 and 𝑆 must have identical schema
• Output:
• Has the same schema as 𝑅 and 𝑆
• Contains all rows that are in both 𝑅 and 𝑆
• Shorthand for?
uid gid uid gid uid gid
123 gov ∩ 123 gov = 123 gov
857 abc 857 gov
39
Intersection
R ∩ S
• R∩S=R–?
= R – ?
40
Intersection
• R ∩ S = R – (R – S)
R ∩ S
• Also = S – (S – R)
= R –( R
? – S )
• How can you write it without –
𝑅⋈𝑆
41
Renaming
• Input: a table 𝑅
• Notation: 𝜌4 𝑅, 𝜌 5!,5",… 𝑅, or 𝜌4 5!,5",… 𝑅
• Purpose: “rename” a table and/or its columns
• Output: a table with the same rows as 𝑅, but called differently
Member M1
uid gid uid gid
123 gov 𝜌,- ./0-,2/0- 𝑀𝑒𝑚𝑏𝑒𝑟 = 123 gov
857 abc 857 gov
42
Renaming
• Used to
• Avoid confusion caused by identical column names
• Create identical column names for natural joins
• As with all other relational operators, it doesn’t modify the database
• Think of the renamed table as a copy of the original
Member M1
uid gid uid gid
123 gov 𝜌,- ./0-,2/0- 𝑀𝑒𝑚𝑏𝑒𝑟 = 123 gov
857 abc 857 gov
43
Renaming example
• IDs of users who belong to at least two groups
𝑀𝑒𝑚𝑏𝑒𝑟 ⋈? 𝑀𝑒𝑚𝑏𝑒𝑟
uid gid uid gid
123 gov 123 gov
857 abc ⋈? 857 abc
857 gov 857 gov
uid gid uid gid
123 gov 123 gov Same uid
123 gov 857 abc
123 gov 857 gov
857 abc 123 gov Different gids
857 abc 857 abc
857 abc 857 gov
857 gov 123 gov
857 gov 857 abc
857 gov 857 gov 44
Renaming example
• IDs of users who belong to at least two groups
𝑀𝑒𝑚𝑏𝑒𝑟 ⋈? 𝑀𝑒𝑚𝑏𝑒𝑟
𝜌 &'(!,1'(! 𝑀𝑒𝑚𝑏𝑒𝑟
𝜋&'(! ⋈&'(!)&'(" ∧ 1'(!91'("
𝜌 &'(",1'(" 𝑀𝑒𝑚𝑏𝑒𝑟
45
Expression tree notation
𝜋#$%,
⋈#$%, (#$%- ∧ '$%, +'$%-
𝑀𝑒𝑚𝑏𝑒𝑟 𝑀𝑒𝑚𝑏𝑒𝑟
46
Summary of core operators
• Selection: 𝜎2 𝑅
• Projection: 𝜋: 𝑅
• Cross product: 𝑅×𝑆
• Union: 𝑅 ∪ 𝑆
• Difference: 𝑅 − 𝑆
• Renaming: 𝜌4 5!,5",… 𝑅
• Does not really add “processing” power
47
Summary of derived operators
• Join: 𝑅 ⋈2 𝑆
• Natural join: 𝑅 ⋈ 𝑆
• Intersection: 𝑅 ∩ 𝑆
• Many more
• Semijoin, anti-semijoin, quotient, …
48
User (uid int, name string, age int, pop float)
An exercise
Group (gid string, name string)
Member (uid int, gid string)
Another exercise
Group (gid string, name string)
Member (uid int, gid string)
𝜋'$% 𝜋'$%
𝐺𝑟𝑜𝑢𝑝 ⋈
𝑀𝑒𝑚𝑏𝑒𝑟 𝜎,-./(""$1-"
𝑈𝑠𝑒𝑟 50
User (uid int, name string, age int, pop float)
A trickier exercise
Group (gid string, name string)
Member (uid int, gid string)
52
Monotone operators
RelOp What happens
Add more rows to the output?
to the input...
• If some old output rows may need to be removed
• Then the operator is non-monotone
• Otherwise the operator is monotone
• That is, old output rows always remain “correct” when more rows are
added to the input
uid gid uid gid uid gid
123 gov − 123 gov = 857 abc
857 abc 857 gov
857 abc
53
Monotone operators
RelOp What happens
Add more rows to the output?
to the input...
• If some old output rows may need to be removed
• Then the operator is non-monotone
• Otherwise the operator is monotone
• That is, old output rows always remain “correct” when more rows are
added to the input The old row is always
uid gid uid gid uid gid “correct” no matter
123 gov − 123 gov = 857 abc what is added to R
857 abc 857 gov 693 abc
857 gov
693 abc 54
Classification of relational operators
• Selection: 𝜎2 𝑅 Monotone
• Projection: 𝜋: 𝑅 Monotone
• Cross product: 𝑅×𝑆 Monotone
• Join: 𝑅 ⋈2 𝑆 Monotone
• Natural join: 𝑅 ⋈ 𝑆 Monotone
• Union: 𝑅 ∪ 𝑆 Monotone
• Difference: 𝑅 − 𝑆 Monotone w.r.t. 𝑅; non-monotone w.r.t 𝑆
• Intersection: 𝑅 ∩ 𝑆 Monotone?
55
Is intersection monotonic?
R ∩ S
•R ∩ S = R – (R – S)
= R –( R
? – S )
• Yes!
R1 ⊆ R2 ⇒ S ∩ R1 ⊆ S ∩ R2
Why is “−” needed for “highest”?
• Composition of monotone operators produces a monotone query
• Old output rows remain “correct” when more rows are added to the input
• Is the “highest” query monotone?
• No!
• Current highest pop is 0.9
• Add another row with pop 0.91
• Old answer is invalidated
• So it must use difference!
57
Why do we need core operator 𝑋?
• Difference
• The only non-monotone operator
• Projection
• The only operator that removes columns
• Cross product
• The only operator that adds columns
• Union
• The only operator that allows you to add rows?
• A more rigorous argument?
• Selection?
• Left as an exercise for the viewer
58
Extensions to relational algebra
• Duplicate handling (“bag algebra”)
• Grouping and aggregation
59
Why is RA a good query language?
• Simple
• A small set of core operators
• Semantics are easy to grasp
• Declarative?
• Yes, compared with older languages like CODASYL
• Though assembling operators into a query does feel somewhat
“procedural”
• Complete?
• With respect to what?
60
User (uid int, name string, age int, pop float)
Relational Calculus
Group (gid string, name string)
Member (uid int, gid string)
• 𝑢. 𝑢𝑖𝑑 𝑢 ∈ 𝑈𝑠𝑒𝑟 ∧
¬ ∃𝑢. ∈ 𝑈𝑠𝑒𝑟: 𝑢. 𝑝𝑜𝑝 < 𝑢. . 𝑝𝑜𝑝 }, or
• 𝑢. 𝑢𝑖𝑑 𝑢 ∈ 𝑈𝑠𝑒𝑟 ∧
∀𝑢. ∈ 𝑈𝑠𝑒𝑟: 𝑢. 𝑝𝑜𝑝 ≥ 𝑢. . 𝑝𝑜𝑝 }
• Relational algebra = “safe” relational calculus
• Every query expressible as a safe relational calculus query is also expressible as a
relational algebra query
• And vice versa
• Example of an “unsafe” relational calculus query
• 𝑢. 𝑛𝑎𝑚𝑒 ¬ 𝑢 ∈ 𝑈𝑠𝑒𝑟
• Cannot evaluate it just by looking at the database
61
Relational Calculus
• RA and RC form the basis for “real” query languages (e.g. SQL),
and for implementation
• Relational Algebra
• Queries are expressed by applying a sequence of operations to relations
• More operational/procedural, very useful for representing how query executes
• Relational Calculus
• Let’s users describe WHAT they want in first-order logic
• Rather than HOW to compute it
• Non-operational, declarative
62
Codd’s Theorem
• Established equivalence in expressivity between :
• Relational Calculus
• Relational Algebra
63
Limits of relational algebra
• Relational algebra has no recursion
• Example: given relation Friend(uid1, uid2), who can Bart reach in his social
network with any number of hops?
• Writing this query in r.a. is impossible!
• So RA is not as powerful as general-purpose languages (not Turing-
complete)
• But why not?
• Optimization becomes undecidable
• Simplicity is empowering
• Besides, you can always implement it at the application level (and
recursion is added to SQL nevertheless 😱)
65
What’s Next
• Next class:
• Database design in E/R model (recording)
• Sign up Piazza
• https://piazza.com/sfu.ca/summer2024/cmpt354d100
• Access code: qc76kdzsz9l
• Assignment 1 releasing soon
• Stay tuned for RA help tools
66