9/18/2014
Lecture No 04
SQL: Behind the scenes
By: Syed Aun Irtaza
What is data model?
Three components: 1. Structures
rows and columns?
1. Structures nodes and edges?
2. Constraints key-value pairs?
3. Operations
a sequence of bytes?
2. Constraints
all rows must have the same number of columns
all values in one column must have the same type
a child cannot have two parents
3. Operations
find the value of key x
find the rows where column lastname is Jordan
get the next N bytes
1
9/18/2014
A collection of information organized to
afford efficient retrieval When people use the word
database, fundamentally what they
are saying is that the data should
be self-describing and it should
have a schema. Thats really all the
word database means.
-- Jim Gray, The Fourth Paradigm
How is the data physically organized on disk?
What kinds of queries are efficiently
1. Sharing supported by this organization, and what
Support concurrent access by multiple readers and kinds are not?
writers How hard is it update the data, or add new
2. Data Model Enforcement data?
Make sure all applications see clean, organized data
What happens when I encounter new queries
3. Scale that I didnt anticipate? Do I reorganize the
Work with datasets too large to fit in memory
data? How hard is that?
4. Flexibility
Use the data in new, unanticipated ways
One View Everything is a table
Relational Database Management Systems Every row in a table has the same columns
were invented to let you use one set of data Relationships are implicit: no pointers
in multiple ways, including ways that are
unforeseen at the time the database is built
and the 1st applications are written.
(Curt Monash, analyst/blogger)
2
9/18/2014
God made the integers; all else is the work of Pre-Relational: if your data changed, your
application broke.
man. Early RDBMS were buggy and slow (and often
(Leopold Kronecker, 19th Century Mathematician) reviled), but required only 5% of the application
Codd made relations; all else is the work of code.
Activities of users at terminals and most
man. application programs should remain unaffected
(Raghu Ramakrishnan, DB text book author) when the internal representation of data is
changed and even when some aspects of the
external representation are changed. Codd-- 79
Key Ideas: Programs that manipulate tabular data
exhibit an algebraic structure allowing reasoning
and manipulation independently of physical data
representation
N = ((z*2)+((z*3)+0))/1 5 1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x 4 3
3. (*) distributes: (n*x+n*y) = n*(x+y) 2
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division
operator
Same idea works with the Relational Algebra!
3
9/18/2014
Sets: {a,b,c}, {a,d,e,f}, { }, . . .
Bags: {a, a, b, c}, {b, b, b, b, b}, . . .
Relational Algebra has two semantics:
Set semantics = standard Relational Algebra
Bag semantics = extended Relational Algebra
Rule of thumb:
Every paper will assume set semantics
Every implementation will assume bag semantics
R1 R2 R1 - R2
SELECT * FROM R1 SELECT * FROM R1
UNION All EXCEPT
SELECT * FROM R2 SELECT * FROM R2
Derived operator using minus Returns all tuples which satisfy a condition
R1 R2 = R1 (R1 R2) c(R)
Derived operator using minus Examples
sSalary > 40000 (Employee)
R1 R2 = R1 R2 sname = Smith (Employee)
The condition c can be =, <, , >, , <>
4
9/18/2014
Eliminates columns
A1,,An (R)
Example: project social-security number and
names:
P SSN, Name (Employee)
Answer(SSN, Name)
Each tuple in R1 with each tuple in R2
R1 R2
Traditionally rare in practice, but can come up
in analytics
Find all pairs of similar images/tweets/songs
Compute the cross product, then compute
a similarity function f(x1,x2) for every
possible pair
R1 A=B R2 = A=B (R1 R2)
SELECT * SELECT *
FROM R1 JOIN R2 FROM R1, R2
ON R1.A = R2.B WHERE R1.A = R2.B
Two ways to spell the same query
The optimizer doesnt care about the syntax you
use; its going to work on the algebraic
representation anyway.
Sometimes one syntax or the other is more
convenient.
5
9/18/2014
A join that involves a predicate Find all hospitals within 5 miles of a school
R1 R2 = (R1 R2)
name(Hospitals distance(location,location) < 5 Schools)
Here can be any condition
Note that equi-join is a special case of SELECT DISTINCT h.name
FROM Hospitals h, Schools s
theta join where is an equality condition
WHERE distance(h.location, s.location) < 5
Find all user clicks made within 5 seconds of Outer join
a page load Include tuples with no matches in the output
Use NULL values for missing attributes
Clicks abs(click_time load_time) < 5 PageLoads Variants
Left outer join
Right outer join
Full outer join
SELECT *
FROM Clicks c, PageLoads p
WHERE abs(c.click_time - p.load_time) < 5
You might hear band join or range join
SELECT binid,
round(avg(cast(Fluo as float)),3) as Fluo,
round(avg(cast(Oxygen as float)),3) as Oxygen,
round(avg(cast(Nitrate_uM as float)),3) as Nitrate_uM,
round(avg(cast(longitude as float)),3) as longitude,
round(avg(cast(latitude as float)),3) as latitude
FROM (
SELECT *, cast(floor(ts) +
floor((ts - floor(ts))*24*60/binsize) *
binsize / (24*60) as datetime) as binid
FROM (
SELECT *, cast(timestamp as float) as ts, 5.0 as binsize
FROM Tokyo_4_merged_data_time
)x
) bins
GROUP BY binid
ORDER BY binid asc
6
9/18/2014
7
9/18/2014
PostgreSQL (and Greenplum)
SQL, PL/pgSQL, Python, C/C++, R,
Microsoft SQL Server
SQL, T-SQL, C# or any CLR language
Oracle
SQL, PL-SQL, Java, C/C++, Python, others
SQLite
NONE!! (sorry)
8
9/18/2014
A view is just a query with a name
We can use the view just like a real table
9
9/18/2014
A view is a relation defined by a query
10