Databaser
Databaser
1
Contents
1 Introduction 7
1.1 Data vs. systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 A short history of databases . . . . . . . . . . . . . . . . . . . . . 8
1.3 SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 DBMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 The book contents . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.6 The big picture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2
4 Relational data modelling 57
4.1 Relations and tables . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Functional dependencies . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Keys and superkeys . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4 Modelling SQL key and uniqueness constraints . . . . . . . . . . 62
4.5 Referential constraints . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Operations on relations . . . . . . . . . . . . . . . . . . . . . . . 63
4.7 Multiple tables and joins . . . . . . . . . . . . . . . . . . . . . . . 64
4.8 Transitive closure . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.9 General constraints on data . . . . . . . . . . . . . . . . . . . . . 67
4.10 Multiple values . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.11 Null values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3
7.4.3 CHECK constraints . . . . . . . . . . . . . . . . . . . . . 104
7.4.4 Triggers: a first example . . . . . . . . . . . . . . . . . . . 105
7.4.5 The syntax of triggers . . . . . . . . . . . . . . . . . . . . 106
7.4.6 A more complex trigger example . . . . . . . . . . . . . . 107
7.5 Authorization and grant diagrams . . . . . . . . . . . . . . . . . 108
4
Note on the the current edition
These notes were expanded from Aarne’s 2017 version in 2020–21 by Jyrki, who
has also reorganized the chapters. After that, some material that is not relevant
for the course at Chalmers and GU was left out. This is in particular the case
with exercises, which are available through the course web page.
We are grateful to Ana Bove for comments on the earlier version.
Tampere and Gothenburg, January 21, 2021
Jyrki Nummenmaa and Aarne Ranta
Preface
This book is based on the idea of specific learning outcomes, which are at the
core of utilizing database systems when building applications. As such, the book
is both useful to a practitioner outside of the academic world, while at the same
time giving the basis for an academic first course on databases. The specific
learning outcomes targeted in the book are the following skills:
• Manipulation of a SQL database, including data definition to create the
tables, as well as to update and query them.
• Conceptual design of the data and database content for an application.
• Systematic formal definition of the properties of the database data.
• Optimization of the database design using dependencies.
• Understanding the principles on what happens when a query is processed
in a database system, thereby helping to write better queries and to opti-
mize certain aspects of how the data is stored.
• Implementing a database application.
• Knowledge on new data models used on database systems, such as docu-
ment databases and so called Big Data.
This book originates from the courses given by the authors. More specifi-
cally, the Databases course (TDA357/DIT620) at the IT Faculty of University
of Gothenburg and Chalmers, and the courses Introduction to Databases and
Database programming given at the University of Tampere.
The text is intended to support the reader’s intuition while at the same time
giving sufficiently precise information so that the reader can apply the methods
given in the book for different applications. The book proceeds from experimen-
tation and intuition to more formal and precise treatment of the topics, thereby
hopefully making learning easier. In particular, Chapters 2, 3 and 4 are suitable
as study material for people who in practice have no background in computing
studies.
A particular virtue of the book is its size. The book is not meant to be a
handbook or manual, as such information is better offered in the Internet, but
a concise, intuitive, and precise introduction to databases.
The real value from the book only comes with practice. To build a proper
understanding, you should build and use your own database on a computer.
You should also work on some theoretical problems by pencil and paper.
5
By far the best way to study the book is to participate in a course with
a teacher - this gives regularity to the study, as it is easy to try to consume
too much too quickly when self-studying. For self-study purposes each chapter
includes an estimate on how fast the student who has no initial knowledge
should progress. The human mind needs time to arrange the new materials and
therefore a reasonable steady pace is preferable even though the student would
have time available to study the book all day long.
We will use a running example that deals with geographical data: countries
and their capitals, neighbours, currencies, and so on. This is a bit different from
many other slides, books, and articles. In them, you can find examples such as
course descriptions, employer records, and movie databases. Such examples
may feel more difficult since they are not completely common knowledge. Using
familiar common-sense examples means that when learning new mathematical
and programming concepts, you do not need to learn new world knowledge at
the same time.
We find it easier to study new technical material if the contents are familiar.
For instance, it is easier to test a query that is supposed to assign ”Paris” to
”the capital of France” than a query that is supposed to assign ”60,000” to
”the salary of John Johnson” - even though of course most people are familiar
with work and salaries, but a particular sum may seem strange, etc. There
is simply one thing less to keep in mind. It also eliminates the need to show
example tables all the time, because we can simply refer to ”the table containing
all European countries and their capitals”, which most readers will have clear
enough in their minds. No geographical knowledge is required, and it is not
important at all if the reader can position the countries and cities on the map.
Of course, we will have the occasion to show other kinds of databases as well.
The country database does not have all the characteristics that a database might
have, for instance, very rapid changes in the data. The exercises and suggested
programming assignments will include such material.
In addition to the authors’ own database exploration and study, the book has
drawn inspiration from various sources, even if it seems to be quite an original
compilation of content. A lot of inspiration obviously comes from other database
textbooks, such as Garcia-Molina, Ullman, and Widom, Database systems: The
Complete Book ), and by earlier course material at Chalmers by Niklas Broberg
and Graham Kemp. We are grateful to (list needs to be expanded) Grégoire
Détrez on general advice and comments on the contents, and to Simon Smith,
Adam Ingmansson, and Viktor Blomqvist for comments during the course. More
comments, corrections, and suggestions are therefore most welcome - your name
will be added here if you don’t object!
Gothenburg, February 2019
Jyrki Nummenmaa and Aarne Ranta
6
1 Introduction
Spreadsheets are a common form to organize the data as tables, for calculations,
comparisons, etc. Figure 1 shows a spreadsheet table that has information about
various countries. A quick look already reveals that most of the rows contain
information of a single country: the country name, capital, area, population,
continent where the country resides, currency of the country, and population
density, obviosly computed from population and area. The first row contains
a title row for the table, while the second row contains names for columns,
explaining their contents. Then follow the rows containing information for each
country, and finally a row with some some summary information.
Are spreadsheets databases? If not, how are they different? Why aren’t
the spreadsheets enough? Who needs databases on top of them? These are
common questions by spreadsheet users when they hear about databases the
first time. We give a short explanation here, but the more in-depth information
on databases in this book should answer these questions properly.
In short, the existence of databases is justified by the needs for correctness
and reliability, combined with a high volume of data and users. Correctness
comes in different forms. Consider, for instance, the data types in each col-
umn. In column F, the currency information comes in different and inconsistent
formats, some as codes and others as full currency name.
7
way to treat the missing values.
In spreadsheets people do computations that refer to other cells. When you
copy-paste parts of your sheet, you sometimes get the desired outcome and
sometimes not. The spreadsheet refers to cells using row numbers and column
ids, like population of China being at D10, instead of refering to it somehow as
the value of population column of the row where China is the country name.
The more data there is, the more complicated it is to work with row numbers
and column ids, and there can be millions or billions of rows and thousands of
columns data. When the data is accessed by computer programs, their mainte-
nance gets complicated with row number and column id references.
Databases have to support large amounts of concurrent users, without mess-
ing up the users view to the data nor the content of the data. While having
the spreadsheet in a cloud service does allow you to share the spreadsheet for
concurrent use, the concurrency mechanisms are unlikely to handle succesfully
millions of changes per second in the spreadsheet.
The databases are expected to store the data in a reliable and correct way.
Banks, for instance, have to store the data about bank accounts so that no
penny is lost and, in addition, they need records of the banking events for e.g.
book-keeping purposes. Companies need to store and protect the data about
their products, customers, selling and purchasing events, etc.
8
machine specific. It was difficult to exchange data and maintain it when for in-
stance computers were replaced. Finding information happened through compli-
cated programs, and combining different pieces of information from the database
was cumbersome. As a response to this situation, relational databases were
invented in around 1970. They turned out to be both structured and generic
enough for most purposes. They have a mathematical theory that is both pre-
cise and simple. Thus they are easy enough to understand by users and easy
enough to implement in different applications. As a result, relational databases
are often the most stable and reliable parts of information systems. They can
also be the most precious ones, since they contain the results from decades of
work by thousands of people.
Despite their success, relational databases have recently been challenged by
other approaches. Some of the challengers want to support more complex data
than relations. For instance, XML (Extended Markup Language) supports hi-
erarchical data, and hierarchical databases were popular in the 1960’s but were
deemed too complicated by the proponents of relational databases. Recently,
Big Data applications have called for simpler models. In many applications,
such as social media, accuracy and reliability are not so important as for in-
stance in bank applications. On the other hand, performance is much more
important, and then the traditional relational models can be too rich. Non-
relational approaches are known as NoSQL, also interpreted as Not Only
SQL, by reference to the SQL language introduced in the next section.
1.3 SQL
Relational databases are also known as SQL databases. SQL is a computer
language designed in the early 1970’s, originally called Structured Query Lan-
guage. The full name is seldom used: one says rather ”sequel” or ”es queue el”.
SQL is a special purpose language. Its purpose is to process of relational
databases. This includes several operations:
• queries, asking questions, e.g. ”what are the neighbouring countries of
France”
• updates, changing entries, e.g. ”change the currency of Estonia from
Crown to Euro”
• inserts, adding entries, e.g. South Sudan with all the data attached to it
• removals, taking away entries, e.g. German Democratic Republic when
it ceased to exist
• definitions, creating space for new kinds of data, e.g. for the main domain
names in URL’s
These notes will cover all these operations and also some others. SQL is
designed to make it easy to perform them - easier than a general purpose
programming language, such as Java or C. The idea is that SQL should be
easier to learn as well, so that it is accessible for instance to bank employees
without computer science training. And finally, SQL is meant to be safer in
certain aspects, such as termination of the programs. However, as we will see,
most end users of databases today don’t even need SQL. They use some end
9
user programs, for intance an ATM interface with menus, which are sim-
pler and less powerful than full SQL. These end user programs are written by
programmers as combinations of SQL and general purpose languages.
Now, since a general purpose language could perform all operations that
SQL can, isn’t SQL superfluous? No, since SQL is a useful intermediate layer
between user interaction and the data. One reason is the high level of abstraction
in SQL.
1.4 DBMS
The implementations of SQL are called SQL database management systems
(DBMS). Here are some popular systems, in an alphabetical order:
• IBM DB2, proprietary
• Microsoft SQL Server, proprietary
• MySQL, open source, supported by Oracle
• MariaDB, open source, successor of MySQL,
• Oracle, proprietary
• PostgreSQL, open source
• SQLite, open source
• Teradata, proprietary, designed for large amounts of data and database
analytics.
The DBMS implementations are highly optimized and reliable. A general
purpose programmer would have a hard time matching their performance and
the wide selection of services available, and, most importantly, the reliability
that the database systems offer. Losing or destroying data would be a serious
risk.
Each DBMS has a slightly different dialect of SQL. There is also an official
standard, but no existing system implements all of it, or only it. In these notes,
we will most of the time try to keep to the parts of SQL that belong to the
standard and are implemented by at least most of the systems.
However, since we also have to do some practical work, we have to choose
a DBMS to work in. The choice assumed in this book is PostgreSQL. Its main
advantages are:
• It follows the standard more closely than many other systems.
• It is free and open source, hence easier to get hold of.
10
retrieved from SQL databases. You will learn the main language constructs of
SQL. They enable you to
• Define SQL tables
• Insert and modify the data in the tables
• Query the tables in various ways: selecting data based on the values,
select desired columns of the tables, combine data from different tables,
use set-theoretical operations on tables, and group and aggregate the data
• Utilize some widely used low-level manipulations of strings and other SQL
datatypes.
11
With this chapter, you will learn how to use dependencies to design databases
avoiding these problems.
12
We will also cover some pitfalls in embedding. The basic problem in the em-
bedding is that the datatypes in the database and in the programming language
do not match. Another is that careless programming may create security holes
and open up the system to things like SQL injection, where an end user can
include harmful SQL code in the queries and data manipulations. In one famous
example, the name of a student includes a piece of SQL code that deletes all
data from a student database.
13
14
2 Tables and SQL
This chapter is about tables as a basic data structure in databases. We will study
them using the database language SQL, starting from the basics and advancing
little by little. The concepts are introduced using examples, and no prior knowl-
edge of databases is required. We will first explain the main language constructs
of SQL by using just one table. We will learn to define database tables using
SQL. We will insert data to the tables using SQL, as well as delete and update
existing data. We will write queries including selections, projections, renamings,
unions, intersections, groupings, and aggregations. We will also take a look at
the different datatypes of SQL and operations that manipulate their values. Af-
ter introducing the main operations of SQL, we generalize the treatment to many
tables. This generalization involves just a few new SQL constructs: foreign keys,
cartesian products, and joins. But it is an important step conceptually, since it
raises the question of how a database should be divided to separate tables. This
question will be the topic of the subsequent chapters on database design.
The first row with text in bold is the title row, describing the role of data
stored in each column of the subsequent rows. Each of those subsequent rows
contains data about one country.
The table is the basic structure of databases in the sense of SQL. One in-
terpretation is that each row describes a complex object: here, a country with
all its relevant attributes. Another interpretation is that each row of the table
states a fact. For instance, the first row after the title row states the fact that
there exists a country named Tanzania, whose capital is Dodoma,
whose area is 945087 and population 41892895, which lies on the
African continent, and uses a currency whose code is TZS.
15
Each row states a fact of the same format. Those familiar with logic should see a
similarity between rows and logical propositions. In this interpretation, the table
is a set of facts, which means that the order of the rows is seen unimportant,
and re-ordering the rows does not change the information content. However, we
will see that SQL also provides a method to order the output produced from a
table in a desired way, for instance alphabetically or by the size of the country.
The values stored in each column have the same type, such as string (name,
capital), or number (area, population). Some are strings with a fixed length, e.g.
in our example table currency has length 3, and continent length 2. Defining
the types for each column is an important part in guaranteeing that the table
makes sense: for instance, that the population is always a number and as such
it can be compared with other populations. This way, the choice of type must
be such that it takes into account the values that are going to be inserted in the
table in the future. If for instance some currencies had 4-character codes, that
should be taken into account when choosing the type for currency. When we use
a table to store values in a database, we will define such properties explicitly,
and then the database system will ensure that the values stored fulfill those
properties.
It is not necessary that all values exist on all rows: in the above table Cuba
has no continent, since it is considered here an island outside continents (tech-
nically, it is usually considered to be in NA). Some values or their combinations
can be unique for each row and they can thereby used to identify each row. In
the above table name seems to be such a column on its own.
psql Countries
if Countries is the name of the database that you are using. If you are ad-
ministrating your own PostgreSQL installation, you may use the Unix shell
command
createdb Countries
to create such a database; this you will only have to do once. After this, you
can start PostgreSQL with the command psql Countries.
16
2.3 Creating an SQL table
Below is a statement in the SQL language to create a table for data on countries.
17
A peculiar feature of the SQL language is that it is case-insensitive: Countries,
countries, and COUNTRIES are all interpreted as the same name. But a good
practice that is often followed is to use
• capital initials for tables: Countries
• small initials for attributes: name
• all capitals for SQL keywords: CREATE
However, string literals (in single quotes, used for e.g. attribute values) are
case-sensitive. Hence ’sweden’ is treated as distinct from ’Sweden’.
If you want to get rid of the table you created, you can remove it with the
following command, and then create it with different properties.
This must obviously be used with caution: if there is data in the table, it will
all get lost!
A table once created can be changed later with an ALTER TABLE statement.
The most important ways to alter a table are:
• ADD COLUMN with the usual syntax (attribute, type, inline constraints).
The new column may by default contain NULL values, unless specified
otherwise.
• DROP COLUMN with the attribute name
• ADD CONSTRAINT with the usual constraint syntax. Rejected if the already
existing data in the table violates the constraint.
• DROP CONSTRAINT with a named constraint
Notice that DROP COLUMN must also be used with caution since the data
already in the column gets lost.
18
• in the beginning of a line, + * ? operate on the whole line; elsewhere,
they operate on the word just before
• ## start comments, which explain unexpected notation or behaviour
• text in double quotes means literal code, e.g. "*" means the operator *
• other symbols, e.g. parentheses, also mean literal code (quotes are used
only in some cases, to separate code from grammar notation)
• parentheses can be added to disambiguate the scopes of operators
Also recall that SQL syntax is case-insensitive:
• keywords are usually written with capitals, but can be written by any
combinations of capital and small letters
• the same concerns identifiers, i.e. names of tables, attributes, constraints
• however, string literals in single quotes are case sensitive
As the first example, the grammar of CREATE TABLE statements is specified as
follows:
statement ::=
CREATE TABLE tablename (
* attribute type inlineconstraint*
* [CONSTRAINT name]? constraint
) ;
type ::=
CHAR ( integer )
| VARCHAR ( integer )
| TEXT | INT | FLOAT | BOOLEAN
inlineconstraint ::=
PRIMARY KEY | UNIQUE | NOT NULL
| DEFAULT value
| REFERENCES tablename ( attribute )
19
| CHECK ( condition )
Since such constraints refer to several attibutes, they must be separately from
the introduction of the attributes themselves, in the constraint part of the
CREATE TABLE statement. Their grammar is similar to inline constraints, except
that they have to mention the list of attributes affected:
constraint ::=
PRIMARY KEY ( attribute+ )
| UNIQUE ( attribute+ )
| FOREIGN KEY tablename ( attribute+ )
| CHECK condition
20
Exercise 2.1 Make your own SQL CREATE TABLE statement for the Country
data. Use data types and constraints that are different from the book.
Exercise 2.2 Consider the Country table above and the following SQL CRE-
ATE TABLE statement. List the things that make it unreasonable for storing
data on countries.
Exercise 2.3 Suppose you want to sell items, which have at least a name,
a description and a value. Write a CREATE TABLE statement for a table
containing data of those items.
This can be done from the PostgreSQL prompt. But a more convenient way
for a lot of data is to prepare a file that contains the statements and then read
in the statements into the program:
\i countries-data.sql
Notice that a complete filepath is required, starting with the root (/ in Linux
and MacOS, \ in Windows). The data in the file must of course match your
database schema. To give an example, if you have a table
Countries (name,capital,area,population,continent,currency)
you can read data from a file that looks like this:
21
The file
http://www.cse.chalmers.se/˜aarne/db-course/countries.tsv
can be used for this purpose. It has in turn been extracted from a file received
from the Geonames database,
http://www.geonames.org/
The order of values in INSERT statements must be the same as the order of
the attributes given when creating the table. This makes it a bit brittle, and also
unpractical if the data is read from a file that assumes a different order. More
control is obtained by specifying the attributes separately before the VALUES
keyword:
INSERT INTO Countries
(continent,name,capital,currency,population,area) VALUES
(’SA’,’Peru’,’Lima’,’PEN’,29907003,1285220) ;
The grammar of insert statement is the following:
statement ::= INSERT INTO tablename tableplaces? value+ ;
value ::=
integer | float | ’string’
| value operation value
| NULL
| ## nothing between the commas, meaning NULL
22
2.7 Deleting and updating
To get rid of all of your rows you have inserted, you may either remove the table
completely with the DROP TABLE command or use the following form of DELETE
FROM command:
This will delete all rows from the table but keep the empty table. To select just
a part of the rows for deletion, a WHERE clause can be used:
will delete only the European countries. The condition in the WHERE part can
be any SQL condition, which will be explained in more detail later.
By using a sequence of DELETE and INSERT statements we can modify the
contents of the table row by row. It is, however, also practical to be able to
change a part of the contents of rows without having to delete and insert those
rows. For this purpose, the UPDATE statement is to be used:
UPDATE Countries
SET currency = ’EUR’
WHERE name = ’Sweden’
is the command to issue the day when Sweden joins the Euro zone.
The UPDATE command can also refer to the old values. For instance, when a
new person is born in Finland, we can celebrate this by updating the population
as follows:
UPDATE Countries
SET population = population + 1
WHERE country = ’Finland’
which returns the whole countries table. The table can be restricted by adding
a WHERE clause:
23
name capital area population continent currency
Greece Athens 131940 11000000 EU EUR
Sweden Stockholm 449964 9555893 EU SEK
Netherlands Amsterdam 41526 16645000 EU EUR
Finland Helsinki 337030 5244000 EU EUR
The asterisk * means that all attributes are selected for the result. But
another set of attributes can also be listed, as in
name population
Greece 11000000
Sweden 9555893
Netherlands 16645000
Finland 5244000
returns
currency
CUP
EUR
EUR
CNY
CLP
PEN
TZS
EUR
SEK
24
gives the following answer:
currency
CUP
EUR
CNY
CLP
PEN
TZS
SEK
statement ::=
SELECT DISTINCT? attribute+ FROM table+ WHERE condition
condition ::=
expression comparison expression
| expression NOT? BETWEEN expression AND expression
| condition boolean condition
| expression NOT? LIKE ’pattern*’
| expression NOT? IN values
| NOT? EXISTS ( query )
| expression IS NOT? NULL
| NOT ( condition )
comparison ::=
= | < | > | <> | <= | >=
expression ::=
attribute
| value
| expression operation expression
pattern ::=
%
| _
| character ## match a specific string/char
| [ character* ]
| [^ character* ]
25
Most productions in this grammar have to do with the conditions that can
be used in the WHERE clause. As we saw earlier, such WHERE clauses are also
used in UPDATE and DELETE statements, and conditions are also used in CHECK
constraints. Hence a closer look at conditions is in place here.
The simplest WHERE conditions are comparisons of values, which can come
from attributes and constants. Both numeric values and strings can be com-
pared, and the symbols in this case are similar to ordinary programming lan-
guages, with the slightly deviant symbol for inequality, <>:
Values that are compared can in general be expressions built from attributes
and constants. Thus the following query selects countries with population den-
sity at least 100:
Conditions can moreover be combined with boolean operators AND and OR, as
well as NOT:
e = e
if the expression e has value NULL. However, there is a special comparison op-
erator “IS NULL‘, which can be used:
e IS NULL
26
which is true if e has the value NULL.
To give an example, recall that in our example table the continent of Cuba
is NULL. Hence all of the following queries return an empty table: So, only the
answer to the last one of the following queries contains a row, the others will
evaluate to an empty set of rows.
SELECT * FROM countries WHERE name = ’Cuba’ AND continent <> ’EU’
SELECT 2+2
These are the minimal queries in SQL: no FROM fields with tables are needed.
However, the result of a query is always displayed as a table:
?column?
4
The column title is often the expression used in the SELECT field, but can
also be a dummy title as ?column? above. Renaming makes it possible to use
some other title:
27
Here is a more complex example, showing countries with their population densi-
ties. The operator FLOOR drops the digits after the decimal point. The attribute
name is renamed to country, which is more informative than just name:
Exercise 2.5 Write the following queries and try them to query data from the
Countries table.
• Which countries have a population over 100 000 000 and what is their
population?
• Which countries use euros as a currency?
• Which European countries do not use euros as a currency?
• Change the currency to euros for all European countries.
• Delete all European countries from the database.
• Insert the original data of European countries to the database.
returns the names of European and African countries. The same names would
be obtained by the query that uses OR in the WHERE field:
But the following query would not be easy to express without UNION:
28
SELECT name AS place FROM Countries
UNION
SELECT capital AS place FROM Countries
to show the sizes of all countries as just ”big” or ”small”, hiding the exact
numeric populations.
There is a subtlety in the set operations: they are defined always to return
sets, in the mathematical sense that every row counts just once. This is opposed
to multisets, where duplicates of rows count separately. While SQL tables with
primary keys have no duplicate rows, a SELECT query that omits the primary
key attribute can result in a table with duplicates, as shown by the query
However, since UNION is a set operation, the same effect can be obtained with
Hence the intuition that the union of two sets is at most as big as each of the
set fails here: the SQL union of a table can be smaller than the table itself!
To prevent duplicate removal, SQL provides the ALL keyword. The following
query keeps all duplicates of currencies, actually even doubling them:
29
SELECT currency FROM countries
UNION ALL
SELECT currency FROM countries
Exercise 2.6 Write the following queries using set operations and try them
to query data from the Countries table.
• Which countries use euros as a currency and do not have a population
density at least 50?
• Which European countries do not use euros as a currency, have a pop-
ulation density at most 100, and have a capital whose name starts with
’A’ ?
30
2.13 Sorting results
The ORDER BY keyword forces the listing of rows in a specified order. This
operation is generally called sorting in computing. Thus the following query
sorts countries primarily by the continent in ascending order (from A to Z),
and secondarily (wiithin continents) by size in descending order (from smallest
to largest):
The default order is ascending, whereas descending order can be forced by the
keyword DESC. Cuba, with NULL continent, appears last in the listing.
ORDER BY is usually presented as a last field of a SELECT group. But it can
also be appended to a query formed by a set-theoretic operation:
shows first all big countries in alphabetical order, then all small ones. Without
parentheses around the union query, ORDER BY would be applied only to the
latter query.
In mathematical terms, ORDER BY means that tables are treated as lists,
where not only duplicates but also their order counts. Set operations such as
UNION and DISTINCT preserve the order, while removing duplicates.
31
• COUNT, the number of rows
• SUM, the sum of values
• MIN, the smallest value
• MAX, the biggest value
• AVG, the average value
Thus the following query gives the minimum, maximum, total, and average
population for countries in South America, as well as the number of countries:
SELECT
MIN(population), MAX(population), SUM(population),
AVG(population), COUNT(name)
FROM countries
WHERE continent = ’SA’
min max sum avg count
16746491 29907003 46653494 23326747.000000000000 2
Wrapping the average with FLOOR would remove the decimals.
What about the corresponding statistics for all continents? To do this,
we need to add yet another field to SELECT queries: GROUP BY, which forms
subgroups of the data, where the countries on the same continent are grouped
together:
SELECT
continent, MIN(population), MAX(population), SUM(population),
FLOOR(AVG(population)) AS avg, COUNT(name)
FROM countries
GROUP BY continent
32
SELECT
continent, MIN(population), MAX(population), SUM(population),
FLOOR(AVG(population)) AS avg, COUNT(name)
FROM countries
GROUP BY continent
HAVING COUNT(name) > 1
continent min max sum avg count
SA 16746491 29907003 46653494 23326747 2
EU 2007000 11000000 27806893 6951723 4
As a final note on aggregation syntax at this point, we can mention the use
of the DISTINCT keyword in aggregation functions:
currency count
CNY 1
PEN 1
CUP 1
EUR 3
33
As we have seen, the grouped table can have more than one aggregation
attributes: all of those appearing in the SELECT field, such as
However, it is not only the SELECT field that counts. As a final subtlety of
grouping: the table formed by GROUP BY must also contain the aggregations
used in the HAVING and ORDER BY clauses, even if they are not shown in the
final table:
From the semantic point of view, GROUP BY is a very complex operator, because
one has to look at many different places to see exactly what table it forms.
We will get more clear about this when looking at relational algebra and query
compilation in Chapter 6.
The table is now so wide that we have had to abbreviate the attribute names to
fit it on the page. The real problem, though, is that the column value creates
a problem of data consistency. The value data needs to be updated daily,
as the currency rates are constantly changing. Since many countries use the
same currency (e.g. EUR), the same update has to be performed on several rows.
What if we forget to update the value on all of those rows? We will then have
34
an inconsistency in the values, caused by the redundancy in repeating the
same information many times.
To avoid this inconsistency, we will, instead of one table, store the data
in two separate tables and learn how to combine data from different tables
in SQL queries. The basic intuition is to avoid redundancy that may lead to
inconsistencies. The current case quite obvious, but in general, this can be a
tricky problem, and Chapters 3 and 5 dealing with database design will give
more tools for deciding how to divide data into tables.
So, instead of adding a new attribute to the table on countries, we will
now create a new table which contains just the information on the values of
currencies. We can also add some other information about currencies, which
can later be read from this table instead of being repeated for each country. Our
new table contains the currency name, to support queries like ”which countries
have a currency named peso”:
Notice that the primary key is the 3-letter currency code rather than the name,
because many currencies can have the same name.
Now that we have split the information about countries to two separate
tables, we need a way to combine this information. The general term for this is
joining the tables. We will later introduce a set of specific JOIN operations in
SQL. But a more elementary method is to use the FROM part of SELECT FROM
WHERE statements, and give a list of tables that are used, instead of just one
table.
Let us start with a table that shows, for each country, its capital and the
value of its currency:
35
This query compares the currency attribute of Countries with the code at-
tribute of Currencies to select the matching rows.
But what if we want to show the names of the countries and the currencies?
Following the model of the previous query, we would have
This is because now both Countries and Currencies contain a column named
name, and it is not clear which one is referred to in the query. The solution is
to use qualified names where the attribute is prefixed by the table name:
We can also introduce shorthand names to the tables with the AS construct, and
use these names elsewhere for the table (recalling that the FROM part, where the
names are introduced, is evaluated first in SQL):
The first step in evaluating a query is to form the table in accordance with
the FROM part. When it has two tables like here, their cartesian product is
formed first: a table where each row of Countries is paired with each row of
Currencies. This is of course a large table: even with our miniature database
of the world’s countries, it has 9 × 7 = 63 rows, since Countries has 9 rows
and Currencies has 7. But the WHERE clause shrinks its size to 9, because the
currency is the same as code on only 9 of the rows. 1
The condition in the WHERE part is called a join condition, as it controls
how rows from tables are joined together. We can of course also state other
conditions in the WHERE part, as we did before: for instance,
is possible to optimize the query and shrink the table in advance. Chapter 6 will show some
ways in which this is done.
36
would only include European countries. Leaving out a join condition will pro-
duce all pairs of rows - the whole cartesian product - in the result. The reader
is urged to try out
and see the big table that results, with lots of unrelated capitals and currency
names. You can use the COUNT aggregation function to see just the number of
rows are created in a table. Here is an artificial example:
This example also shows that you can form the cartesian product of a table with
itself, but that you then have to rename the table to avoid ambiguous attributes.
If the foreign key is composite, only the latter method works, just as with
primary keys.
Hint: constraints can even be added after the table has been created by the
ALTER TABLE statement, for instance,
The FOREIGN KEY clause in the CREATE TABLE statement for Countries adds
a requirement that every value in the column for currencies must be found
exactly once in the code column of the Currencies table. It is the job of a
37
database management system to check this requirement, prohibiting all dele-
tions from Currencies or inserts to Countries that would result in a currency
in Countries that doesn’t exist in Currencies. In this way, the database man-
agement system maintains the integrity of the database.
Any conditions in the WHERE part can be used for joining data from tables.
However, there are some particularly interesting cases, like joining a table with
itself. The following query lists all pairs of countries that have the same cur-
rency:
(Notice the use of < rather than <> to state that the countries are different.
Using <> would repeat every pair in the two possible orders.)
It is of course possible to join more than two tables, basically an unlimited
number. Adding the currency table to the query above we can add the currency
value to the table:
Thanks to our foreign key requirement, we know that each currency in the
Countries table is found in the Currencies table. What we do not know is if
there is a currency that is not used in any country. In that case, there would
be data not participating in the join. This is also the case with NULL values.
In the next section, we will have a look at particular SQL statements to join
data, which also deals with the problem of rows not joining with any rows in
the other table.
38
Figure 2: Example database diagram
Luckily, the JOINs have a compositional meaning. INNER is the simplest join
type, and the keyword can be omitted without change of meaning. The inner
JOIN with an ON condition gives the purest form of join, similar to cartesian
product with a WHERE clause:
39
is equivalent to
The condition is typically looking for attributes with equal values in the two
tables. With good luck (or design) such attributes have the same name, and
one can write
as a shorthand for
well... almost, since when JOIN is used with ON, it repeats the values of a an b
from both tables, like the cartesian product does. JOIN with USING eliminates
the duplicates of the attributes in USING.
An important special case is NATURAL JOIN, where no conditions are needed.
It is equivalent to
Capitals Currencies
country | capital country | currency
--------+-------- ---------+--------
Sweden | Stockholm Norway | NOK
Norway | Oslo Germany | EUR
The effects of different joins are shown in Figure 3. (Notice: to try them in a
DBMS, you have to wrap them in SELECT * FROM ...)
40
Capitals CROSS JOIN Currencies
country | capital | country | currency
-------- +-----------+----------+----
Sweden | Stockholm | Norway | NOK
Sweden | Stockholm | Germany | EUR
Norway | Oslo | Norway | NOK
Norway | Oslo | Germany | EUR
41
For example,
WITH
EuroCountries AS (
SELECT *
FROM countries
WHERE currency = ’Euro’
)
SELECT *
FROM EuroCountries A, EuroCountries B
WHERE ...
is a way to avoid the duplication of the query selecting the countries using the
Euro as their currency.
A view is like a query in a WITH clause, but its definition is global. A view
is created with the CREATE VIEW statement
Views are used for ”frequently asked queries”. They can happily contain re-
dundant information. For example, the following view can be used to display
currency rates for different countries:
Just creating a view does not yet display its data. But it can be queried just like
tables created with CREATE TABLE, by writing it in the FROM field. The simplest
example is of course
Views are evaluated from the underlying tables each time they are used in
queries. Hence, if you have created a view once, it will display different infor-
mation whenever the underlying tables have changed. Views don’t occupy their
own storage in the database, and using them can simplify queries considerably.
42
2.22 SQL pitfalls
Even if SQL is simple and easy to learn, it does not always behave the way one
would expect. In this section, we list some things that do not feel quite logical
in SQL query design, or whose semantics may feel surprising.
Renaming syntax
Renaming is made with the AS operator, which however works in two opposite
ways:
• In WITH clauses, the name is before the definition: name AS (query).
• In SELECT parts, the name is after the definition: expression AS name.
• In FROM parts, the name is after the definition but AS may be omitted:
table AS? name.
Cartesian products
The bare cartesian product from a FROM clause can be a huge table, since
the sizes are multiplied. With the same logic, if the product contains an empty
table, its size is always 0. Then it does not matter that the empty table might
be ”irrelevant”:
43
NULL values and three-valued logic
Because of NULL values, SQL follows a three-valued logic: TRUE, FALSE, UNKNOWN.
The truth tables as such are natural. But the way they are used in WHERE
clauses is good to keep in mind. Recalling that a comparison with NULL results
in UNKNOWN, and that WHERE clauses only select TRUE instances, the query
SELECT ...
FROM ...
WHERE v = v
SELECT ...
FROM ...
WHERE v < 10 OR v >= 10
In other words, if v is NULL, SQL does not even assume that it has the same
value in all occurrences.
Another example2 , involves NOT IN: since
u NOT IN (1,2,v)
means
NOT (u = 1 OR u = 2 OR u = v)
p q NOT p p AND q p OR q
T T F T T
T F F F T
T U F U T
F T T F T
F F T F F
F U T F U
U T U U T
U F U F U
U U U U U
44
Sets and multisets in SQL
Being a set means that duplicates don’t count. This is what holds in the math-
ematical theory of relations (Chapter 4). But SQL is a funny combination of
sets and multisets, which do allow duplicates.
First of all, typical SQL tables have a primary key. When a primary key
exists, the rows in an SQL table are unique and thereby the SQL tables are sets
of rows. But since SQL queries do not remove duplicates, the results of queries
can be multisets. However, some SQL query operations remove duplicates,
thereby producing sets.
In partiular, the set operations UNION, INTERSECT, EXCEPT do remove
duplicates. Hence
and may result in a table with less rows than the original table.
Summarization pitfalls
There are several ways to perform meaningless summarization. Consider the fol-
lowing table, which records currency exchange events for a bank. Each event has
information of money exchange transaction: how much money was exchanged,
what was the rate used, and how much money did the client receive.
Why? Because the values of changed are given in different units (currencies).
Thereby, blindly summing them up does not make sense. We can get meaningful
sums by either adding currency1 to GROUP BY part, or select only rows with
a given currency, such as euros.
Another obviously meaningless summarization would be summing up the
dates - that, however, the SQL database is unlikely to accept. This might theo-
retically have some meaning if dates started from ”the beginning of the world”,
45
that is, they would have a natural zero point. Since they do not, summing up
is not acceptable.
How about summing up the rates? It is hard to imagine that it would make
any sense either. They also have different units. But wait - some of them have
the same unit, for instance the ones which are used to convert euros to Swedish
crowns. But summing them up does not seem like a great idea either - why?
They are by nature multipliers that make little sense without accompanying
values so summing them alone gives results that make no sense.
In many cases where sum is not meaningful average is. For instance, the
average exchange rate makes sense where the sum of exchange rates does not.
There are also summarizations that theoretically could make sense, but in
practice the application is hard to find. For instance, you may sum up the
heights of a group of people, which could make sense e.g. in buying a yoga mat
roll for them, but in general it would be hard to imagine reasonable uses for
such a summation.
Most SQL databases would happily perform most of the above summariza-
tions, since they have no metadata that could describe units and scales.
46
3 Conceptual modeling using Entity-Relationship
diagrams
Entity-Relationship (E-R) diagrams are a popular way to communicate database
design to stakeholders, who don’t necessarily have technical knowledge of databases.
E-R diagrams specify the connections between dataobjects and attributes in a
graphical fashion, showing how they are related to each other and what depends
on what. In this way, they can also help the technically knowledgeable database
designers to get a clear view of the database structure. In fact, it is possible to
generate a database schema automatically from an E-R diagram, specifying all
tables, attributes, primary keys, and foreign keys. In this chapter, we start with
an intuitive explanation of E-R diagrams and proceed to showing how database
schemas are derived from them.
3.1 Introduction
The projects where databases are used typically have many stakeholders. Con-
sider, for instance, selling concert tickets online. You have to take into account
the customers who buy the tickets, but they are not normally participating in
designing the system. Concert organizers should obviously be participating,
even though they might not have technical knowledge about how such systems
are built. The same concerns business experts, whose skills are needed to un-
derstand how concerts are arranged, when tickets should be on sale, how to
manage cancellations, and so on.
The technical knowledge is provided by computing professionals. These pro-
fessionals need to understand the different roles of the users of the system, as
well as how the system connects to other information systems, such as pay-
ment services. Professionals are also needed to design the user interface, and, of
course, to design and implement the database that holds the information about
concerts, customers, and reservations.
The related data content needs to be communicated between stakeholders.
Since some of them may be stricly non-technical people, the SQL language is not
the ideal way to communicate the content. Typically, a description on a higher,
conceptual level is appropriate. In this chapter, we will introduce a technique
called Entity-Relationship (E-R) modeling to describe the data content.
This method is simple to understand, which has been the key to its success.
The model has a graphical representation, and the E-R models are often called
E-R diagrams.
A relational database, as we have seen, consists of a set of tables, which are
linked to each other by referential constraints (foreign keys). This is a simple
model to implement and flexible to use. But designing a database directly
as tables can be hard, because only some things are ”naturally” tables; some
other things are more like relationships between tables, and might seem to
require a more complicated model. E-R modelling is a richer structure than
just tables, but it can in the end be automatically converted to a table design.
47
Figure 4: An example E-R diagram
Since it is easier than building tables directly, it helps design databases with
right dependencies.
In this chapter, we will learn to model the data content as E-R diagrams.
We will also tell how E-R diagrams can almost mechanically be derived from
descriptive texts. Finally, you will learn how they are, even more mechanically,
converted to relational schemas, and thereby eventually to SQL CREATE TABLE
statements.
In this chapter, we can just give the bare bones of E-R models. Building
them in practice is a skill that has to be practised. This practice is particu-
larly suited for work in pairs or small groups, discussing the pros and cons of
different modeling choices. Sometimes there are many models that are equally
good, but a model that first looks good may finally not be so good if you take
everything into account. Taking everything into account is easier if many people
are contributing.
48
tionship type. In the figure we do not require that a currency is always used in
a country, and not even that a country always needs to have a currency (there
are indeed countries that only use foreign currencies). As we will see later, the
diagram permits a country to have several currencies, taken into use at different
dates.
Some of the entities and relationships are drawn with double contours. An
example is City, which is related to Country with the double-contour relationship
isCityOf. Such entities and relationships are called weak. The idea is that a
weak entity can only be identified by a relationship to a supporting entity. For
example, the name London does not in itself identify a city, because there is a
London in U.K. and another London in Canada. The full reference to a city
must thus be London, UK or London, Canada (which in reality is called London,
Ontario, but our simplified diagram does not show states within countries). The
reader might notice that this pair of attributes correspond to a composite key,
where the country name is a foreign key.
Another weak entity in our example is Capital: a capital is always related
to some country, and cannot exist without it. At the same time, it makes sense
to require that a capital is a city in the country. For this purpose, we need yet
another shape in the diagram: an ISA relationship between Capital and City
(”a capital is a city”). These relationships are marked with triangles.
Notice, finally, what is not possible in E-R diagrams:
• there are no connecting lines directly between entities, or between rela-
tionships
• there are no relationships between relationships
• weak relationships and ISA relationships have no attributes
49
to country, and since there are n cities, where n is assumed to be an arbitrary
non-negative integer, in a country, there is an n on the line to the city. The
marking is similar in other lines, apart from the is-a relationship which does not
need this specification.
The other property is totality, which means that an entity always has to
participate in the relationship. We can safely say that a capital must be a
capital of a country, so there is total participation. Totality is marked with a
double line on that participant’s side.
We have also drawn, redundantly, another way to mark similar information.
If there is no arrow at the end of the line, then there is multiple participation.
If there is a sharp arrow at the end of the line, then there is a total single
participation (see Capital and Country). If there is a round arrowhead, then
there is a partial single participation (see City and Country).
50
Figure 5: Composite and multivalued attributes example
51
Element ::=
"ENTITY" Name Attributes
| "WEAK" "ENTITY" Name Support+ Attributes
| "ISA" Name SuperEntity Attributes
| "RELATIONSHIP" Name RelatedEntity+ Attributes
Attributes ::=
":" Attribute* # attributes start after colon
| # no attributes at all, no colon needed
RelatedEntity ::= Arrow Entity ("(" Role ")")? # optional role in parentheses
Support ::= Entity WeakRelationship
Attribute ::=
"(" PrimitiveAttribute Ident* ")"
| PrimitiveAttribute
PrimitiveAttribute ::=
Ident Ident*
| "_" Ident Ident*
For clarity, the primitive attributes within a composed attribute are always non
key attributes, however the composite attribute itself may be a key attribute.
52
3.4 From description to E-R
The starting point of an E-R diagram is often a text describing the domain. You
may have to add your own understanding to the text. The expressions used in
the text may give clues to what kind of elements to use. Here are some typical
examples:
It is not always the case that just these grammatical forms are used. You should
rather try if they are usable as alternative ways to describe the domain. For
example, when deciding if something is an attribute of an entity, you should try
if it really is the something of the entity, i.e. if it is unique. In this way, you
can decide that the population is an attribute of a country, but export product
is not.
You can also reason in terms of the informal semantics of the elements:
• An entity is an independent class of objects, which can have properties
(attributes) as well as relationships to other entities.
• An attribute is a simple (atomic) property, such as name, size, colour,
date. It belongs to only one entity.
• A relationship states a fact between two or more entities. These can also
be entities of the same kind (e.g. ”country X is a neighbour of country
Y”).
• A subentity is a special case of a more general entity. It typically has
attributes that the general entity does not have. For instance, an EU
country has an attribute ”EU joining year”, which countries generally do
not have.
• A weak entity is typically a part of some other entity. Its identity (i.e.
key) needs this other entity to be complete. For instance, a city needs a
country, since ”Paris, France” is different from ”Paris, Texas”. The other
entity is called supporting entity, and the relationships are supporting
relationships. If the weak entity has its own key attributes, they are
called discriminators (e.g. the name of the city).
53
Figure 6: Translating E-R diagrams to SQL database schemas.
54
• A relationship becomes a table that has the key attributes of all related
entities, as well as its own attributes.
Other kinds of elements have different possibilities:
• In exactly-one relationships, one can leave out the relationship and use
the key of the related entity as attribute directly.
• In weak entities, one likewise leaves out the relationship, as it is always
exactly-one to the strong entity.
• An at-most-one relationship can be treated in two ways:
– the NULL approach: the same way as exactly-one, allowing NULL
values
– the (pure) E-R approach: the same way as to-many, preserving the
relationship. No NULLs needed. However, the key of the related
entity is not needed.
55
relations must have singleton keys. It may be that the only natural key of a
relation includes all attributes in a composite key.
Since many keys are in practice used as foreign keys in other relations, it is
highly desirable that their values do not change. The key values used as foreign
keys are also stored many times and included in search data structures. For
these reasons, it is often more simple and straightforward in practice to create
artificial keys that are usually just integers.
For instance, Sweden has introduced a system of ”person numbers” to uniquely
identify every resident of the country. Artificial keys may also be automatically
generated by the system internally and never shown to the user. Then they are
known as surrogate keys. The surrogate keys are guaranteed not to change,
whereas natural values, no matter how stable they seem, might do that. An-
other related issue is that keys are used to compose indices for the data, used
in joins, and one may not like to grow these structures unnecessarily.
56
4 Relational data modelling
Modelling data with E-R diagrams gives a conceptual description of the data,
which can be used to produce a tentative database design. However, SQL and
E-R models are not in a one-to-one correspondence, and we cannot use E-R
diagrams to define all the details of our data. SQL provides more details, but it
is unnecessarily complicated for reasoning about our data design and data. This
reasoning is important for both studying the properties of our database design
and in checking the integrity of data. Powerful mechanisms for reasoning are
provided by mathematics and logic, and to utilize those mechanisms we need to
describe our data using mathematics and logic. The relational model of data,
based on set thery, is the tool of choice for this purpose. With this chapter,
you learn how to use the relational model to define your data as well as basic
operations on the data. You will also learn the correspondence of the relational
data model concepts to the expressions of SQL.
r ⊆ T1 × . . . × Tn
ht1 , . . . , tn i ∈ T1 × . . . × Tn if t1 ∈ T1 , . . . , tn ∈ Tn
In these definitions, each Ti is a set, called a domain. The elements ti are the
values (or components) of the tuple. The domains determine the types of the
values of the tuple. The cartesian product of which the relation is a subset is
its signature.
The most familiar example of a cartesian product in mathematics is the two-
dimensional space of real numbers, R × R. Its elements have the form (x, y),
and the components are usually called the x -coordinate and the y-coordinate.
A relation with the signature R × R is any subset of this two-dimensional space,
such as the graph of a function, or a geometric figure such as a circle or a
triangle. Mathematical relations are typically, but not necessarily, infinite sets
of tuples.
In the database world, a relation is a finite set of tuples, and it is usually
called a table. Tuples are called rows. We will use the words relation and
table as synonyms as well as tuple and row, when we feel appropriate. Here is
an example of a table and its mathematical representation:
57
{hSweden, Stockholm, SEKi, hFinland, Helsinki, EURi, hEstonia, Tallinn, EURi}
When seeing the relation as a table, it is also natural to talk about its
columns. Mathematically, a column is the set of components from a given
place i :
{ti | h. . . , ti , . . .i ∈ R}
It is a special case of a projection from the relation. (The general case, as we
will see later, is the projection of many components at the same time. The idea
is the same as projecting a 3-dimensional object with xyz coordinates to a plane
with just xy coordinates.)
What is the signature of the above table as a relation? What are the types
of its components? For the time being, it is enough to think that every type is
String. Then the signature is
String × String × String
However, database design can also impose more accurate types, such as 3-
character strings for the currency. This is an important way to guarantee the
quality of the data.
Now, what are ”country”, ”capital”, and ”currency” in the table, mathe-
matically? In databases, they are called attributes. In programming language
terminology, they would be called labels, and the tuples would be records.
Hence yet another representation of the table is a list of records,
[
{country = Sweden, capital = Stockholm, currency = SEK},
{country = Finland, capital = Helsinki, currency = EUR},
{country = Estonia, capital = Tallinn, currency = EUR}
]
Mathematically, the labels can be understood as indexes, that is, indicators of
the positions in tuples. (Coordinates, as in the xyz example, is also a possible
name.) Given a cartesian product (i.e. a relation signature)
T1 × . . . × Tn
we can fix a set of n labels (which are strings),
L = {a1 , . . . , an } ⊂ String
and an indexing function
i : L → {1, . . . , n}
which should moreover be a bijection (i.e. a one-to-one correspondance). Then
we can refer to each component of a tuple by using the label instead of the
index:
t.a = ti(a)
58
One advantage of labels is that we don’t need to keep the tuples ordered. For
instance, inserting a new row in a table in SQL by just listing the values without
labels is possible, but risky, since we may have forgotten the order; the notation
making the labels explicit is more reliable.
A relation schema consists of the name of the relation, the attributes, and
the domains of the attributes:
The relation (table) itself is called an instance of the schema. The types are
often omitted, so that we write
is the same relation as the one above. As SQL implementations require the
existence of a primary key, there cannot be any duplicates in the database
tables. In queries SQL, however, makes a distinction, marked by the DISTINCT
and ORDER keywords. This means that, strictly speaking, SQL tables resulting
from queries are lists of tuples. If the order does not matter but the repetitions
do, the tables are multisets or bags.
In set theory, you should think of a relation as a collection of facts. The first
fact is that Finland is a country whose capital is Helsinki and whose currency is
EUR. Repeating this fact does not add anything to the collection of facts. The
order of facts does not mean anything either, since the facts don’t refer to each
other.
59
If X is a set of attributes {B1 , . . . , Bm } ⊆ S and t is a tuple of a relation
with schema S, we can form a simultaneous projection,
However, it clearly follows from the above, that for each country there is
always exactly one value (the value associated with the one currency of the
country), which logically follows from the functional dependencies above.
Normally, the set of FDs identified for an application is not exhaustive in the
sense that there are FDs that can be inferred from that set while not explicitly
belonging to the set. So, when a set of attributes X functionally determines an
attribute A, it may not necessarily be an explicitly given FD, but an FD that
can be inferred from the given FDs.
Given a set of FDs, the following rules can be used to correctly compute all
implied FDs.
1. If Y ⊂ X, then X → Y . (Trivial)
2. If X → Y , then XA → Y A. (Adding an attribute to both sides)
3. If X → Y and Y → Z, then X → Z. (Transitivity)
60
Generally, we will not need to compute all implied FDs as there could be a
huge amount of them. We are more interested in calculating the closure of a
set of attributes X. The closure of X, denoted by X + , is the set of all attributes
that X functionally determines, including X. Given X, we can compute X + by
starting with X + = X and applying repatedly the following rule as long as new
attributes can be added to X + : Iff there is a FD X 0 → Y such that X 0 ⊂ X +
computed this far, and Y contains attributes that are not yet in X + , then add
all those attributes of Y to X + .
For example, the closure of the singleton attribute country is {country, cur-
rency, value} and it can be computed by applying the rule once to add currency
and then once to add value.
This makes country a superkey, and, as such, a candidate when finding keys.
However, country is not the only superkey. All of
country
country currency
country value
country currency value
are superkeys. However, all of these sets but the first are just irrelevant exten-
sions of the first. They are superkeys but not keys. We conclude that country
is the only possible key.
61
4.4 Modelling SQL key and uniqueness constraints
In SQL, we define one key and additionally we may define a number of unique
constraints. On the logical level, key and unique constraints are the same. So,
both can be modelled in the same way using functional dependencies.
However, to create a mapping between a SQL database definition and the
relational model, we need a way to identify the primary key. We do this by
marking the primary key attributes either by underlining or (in code ASCII
text) with an underscore prefix.
JustCountries(_name,capital,currency)
currency -> Currencies.code
Currencies(_code,valueInUSD)
JustCountries(_name,capital,currency)
currency -> Currencies.code
unique capital
In the actual database, the key and uniqueness constraints prevent us from
inserting a new country with the same name or capital.
In JustCountries, currency would not work as a key, because many coun-
tries can have the same currency.
In Currencies, valueInUSD could work as a key, if it is unlikely that two
currencies have exactly the same value. This would not be very natural of
course. But the strongest reason of not using valueInUSD as a key is that we
know that some day two currencies might well get the same value.
The keys above contain only one attribute, and as such, they are called sin-
gleton keys. A key can also be composite. This means that many attributes
together form the key. For example, in
PostalCodes(_city,_street,code)
the city and the street together determine the postal code, but the city alone is
not enough. Nor is the street, because many cities may have the same street-
name. For very long streets, we may have to look at the house number as well.
The postal code determines the city but not the street. The code and the street
together would be another possible composite key, but perhaps not as natural.
62
JustCountries(country,capital,currency)
Currencies(code,valueInUSD)
For the integrity of the data, we want to require that all currencies in JustCountries
exist in Currencies. We add to the schema a referential constraint,
JustCountries(country,capital,currency)
JustCountries.currency => Currencies.code
Countries (_name,capital,population,currency)
capital => Cities.name
currency => Currencies.code
represents
Union: R ∪ S = {t|t ∈ R or t ∈ S}
Intersection: R ∩ S = {t|t ∈ R and t ∈ S}
Difference: R − S = {t|t ∈ R and t ∈/ S}
Cartesian product: R × S = {ht, si|t ∈ R and s ∈ S}
However, the database versions are a bit different from set theory:
• Union, intersection, and difference are only valid for relations that have
the same schema.
• Cartesian products are flattened: hha, b, ci, hd, eii becomes ha, b, c, d, ei
3 It is common to use the arrow (->) as symbol for referential constraints. However, we
chose to use the double arrow => in order not to confuse with functional dependencies.
63
These standard operations are a part of relational algebra. They are also a
part of SQL (with different notations). But in addition, some other operations
are important - in fact, even more frequently used:
Projection: πa,b,c R = {ht.a, t.b, t.ci | t ∈ R}
Selection: σC R = {t | t ∈ R and C}
Theta join: R ./C S = {ht, si | t ∈ R and s ∈ S and C}
In selection and theta join, C is a condition that may refer to the tuples
and their components. In SQL, they correspond to WHERE clauses. The use of
attributes makes them handy. For instance.
σcurrency=EUR Countries
selects those rows where the currency is EUR, i.e. the rows for Finland and
Estonia.
A moment’s reflection shows that theta join can be defined as the combina-
tion of selection and cartesian product:
R ./C S = σC (R × S)
The ./ symbol without a condition denotes natural join, which joins tuples
that have the same values of the common attributes. It used to be the basic
form of join, but it is less common nowadays. The history behind is that it is
based on design philosophy where the attributes referencing the same thing are
given the same name. This makes joining by identical names a feasible idea.
Otherwise, of course, the idea is not that feasible. The definition is as follows:
64
This table has a redundancy, as the USD value of EUR is repeated twice.
As we will see later, redundancy should usually be avoided. For instance, some-
one might update the USD value of the currency of Finland but forget Estonia,
which would lead to inconsistency. You can also think of the database as a story
that states some facts about the countries. Normally you would only state once
the fact that EUR is 1.09 USD.
Redundancy can be avoided by splitting the table into two:
JustCountries:
name capital currency
Sweden Stockholm SEK
Finland Helsinki EUR
Estonia Tallinn EUR
Currencies:
code valueInUSD
SEK 0.12
EUR 1.09
Searching for the USD value of the currency of Sweden now involves a join of
the two tables:
SELECT valueInUSD
FROM JustCountries, Currencies
WHERE name = ’Sweden’ AND currency = code
65
4.8 Transitive closure
The initial relational model was based on the idea of relational algebra as a
measuring stick for the kinds of queries the user can pose. A language equally
expressive as relational algebra was called relationally complete. However, using
relational algebra it is not possible to query for transitive closure, discussed in
this section. See following table, which we can call, say requires.
This relation is transitive in the sense that if course B is required for course
A, and course C is required for course B, then course C is also required for
course A.
Now, how can we compute all courses required for compilers? We can
directly see, just by selection, that programming languages is required. Joining
requires with itself, and taking a projection and a union, we get a tuple saying
that also programming is required. A further join with projection and union to
the previous results gives us tuples for all the required courses.
If we know exactly how many times it is enough to join a relation with itself,
we can write a relational algebra expression that answers this question. If not,
then it is not doable with relational algebra.
For transitive closure computation, let us consider a transitive relation R
that has only two attributes, the first being called A and the second B. Thus, it
only contains one transitive relation and no additional attributes. The following
relation composition for two-attribute relations may be used in search of new
transitive relationships:
66
• In some cases the transitive relation may be split into different relations.
It is possible to join them first into a single table to avoid this situation.
SQL did not initially contain a transitive closure operation. This lead to the
situation that different SQL implementations started to contain different ways
to specify transitive closure. For example, PostgreSQL provides a special kind
of local WITH definitions, WITH RECURSIVE, to answer transitive queries.4
The SQL standard includes a possibility to define assertions that are SQL
queries stating conditions that the data needs to obey. The SQL assertions
follow the idea that one can generally query whatever SQL allows as WHERE
part of queries. The above examples can be represented using SQL CHECK
constraings as follows.
The assertions may also be written with the focus on a single database
row (referring to its values). Further, there are natural constraints involving
aggregation functions to express things like a country’s population must be at
least the sum of its cities’ populations.
4 http://www.postgresqltutorial.com/postgresql-recursive-query/
67
The generality of assertions comes with a cost. They may involve frequent
heavy computation and therefore they are generally not in use in actual database
systems.
Even though the assertions are not used in practice, it makes perfect sense to
describe these general constraints on data. They can be taken care with other
solutions, to be discussed in later chapters. If they are not described, they are
easily forgotten.
country neighbours
Sweden Finland, Norway
Finland Sweden, Norway, Russia
But this is not possible, since the attributes cannot have list types. One has to
write a new line for each neighbourhood relationship:
country neighbour
Sweden Finland
Sweden Norway
Finland Sweden
Finland Norway
Finland Russia
68
5 Dependencies and database design
The use of E-R diagrams implies a design that can automatically be produced
from the E-R design. But is it a good design? Are we even able to identify a
good design before we use the database? There are usually many open questions
on the use of the database: the data volume and distributions, the queries, the
update activity, the number of concurrent users, etc. Even though there could
be initial estimates for this, the lifetime of database is often long and new ways
to use the database emerge.
We would preferably want to find general criteria for good design and meth-
ods to avoid potentially problematic designs in a general setting. We need
intuitive design goals and formal properties of good database design, and a
methodology to create such good designs.
Our approach is mainly based on the use of functional dependencies.
They give us a way to formalize some of the properties of good design. They also
give methods and algorithms to carry out the design. Functional dependencies
are independent of E-R diagrams, but they also add expressive power: they
allow us to define relationships between attributes that E-R diagrams cannot
express. In this chapter, we will define the different kinds of dependencies, study
their use, and see how they combine with E-R diagrams.
Otherwise, there is a risk of redundancy, repetition of the same informa-
tion (in particular, of the same argument-value pair). Redundancy can lead
to inconsistency, if the updates fail to update all copies of the data, which is
known as an update anomaly. To avoid such inconsistency in a badly designed
database, it may be necessary to execute an extensive amount of database up-
dates as a knock-on effect of a single update. This could be automated by using
triggers (Section 7.4.4). But it is better to design the database in a way that
avoids the problem altogether.
Consider, first, splitting this table into two as follows. The first table tells
who is teaching which student and the second tells who is studying which course.
69
professor student
Ranta Anders
Ranta Lena
Kemp Anders
Nummenmaa Lena
course student
Compiler construction Anders
Compiler construction Lena
Data Science Anders
Database systems Lena
This is obviously not what we want. The table includes rows that were
not in our original table. We say that the way to join the data is ”lossy” -
however lossy does not mean losing data, which indeed did not happen, but
losing information on which pieces of data belonged together.
Our second attempt is to store the data with two tables, where the first table
tells who is teaching which course and the second tells who is studying which
course.
professor course
Ranta Compiler construction
Kemp Data Science
Nummenmaa Database systems
course student
Compiler construction Anders
Compiler construction Lena
Data Science Anders
Database systems Lena
If we join the tables on book, we get the original table back. In this case our
join is ”lossless” - no information was lost. In this case, though, the losslessness
is based on the fact that no course is being taught by two professors. If e.g.
70
Nummenmaa starts to teach Data Science to Kati, then the situation with
losslessness changes. If, however, we know that each professor always only
teaches one course, then we will be safe with joining on course and we always
get a lossless join. In this case our database design would have the lossless join
property.
Our initial design with just one table had - trivially - no problems with loss-
lessness. Why would we not just stick with it? If we know that each professor
only teaches one course, then using just one table usually introduces redundant
information that is repeated in different rows. Let’s assume, again, that each
course is only taught by one professor. If all data is stored in just one table
and the professors Ranta and Nummenmaa swap courses so that Nummenmaa
will teach compile construction and Ranta will teach Database systesm, then we
would need to update a whole lot of rows in the tables (well, in our initial exam-
ple just 3, but potentially). If the data is split into two tables with the lossless
design, then only the two rows in (professor,course) table need to updated, no
matter how many students study these courses. So, the solution where the table
is split seems beneficial.
It seems, right enough, that the key structure plays a central role in solving
these problems. But as a modeling construct, it is not enough. Suppose, now,
that each student studies each course with one particular professor, and let us
suppose the following table, with (student, course) as the key.
This would mean that it is impossible to add a row where Anders studies
Compile construction under the instruction from Nummenmaa.
Let us make a further assumption that each course is only taught by at most
one professor. There is no way we can model this information with keys, since
course cannot be a key: then for example Ranta would be a duplicate key value.
Also, in our table we have redundant information on who is teaching which
course, since the course is always repeated with the professor.
We may try to get rid of the redundancy by splitting the table into two, as
follows.
professor course
Ranta Compiler construction
Kemp Data Science
Nummenmaa Database systems
71
professor student
Ranta Anders
Ranta Lena
Kemp Anders
Nummenmaa Lena
We got rid of the redundant data, but other problems appeared. Joining the
tables on professor we get e.g. a row (Kemp, Anders, Data Science)
Let us now consider a design where we have three tables
professor course
Ranta Compiler construction
Kemp Data Science
Nummenmaa Database systems
professor student
Ranta Anders
Ranta Lena
Kemp Anders
Nummenmaa Lena
course student
Compiler construction Anders
Compiler construction Lena
Data Science Anders
Database systems Lena
This effectively eliminates redundant data. Let us suppose now that the
only key-like constraint is that each student studies each course with exactly
one professor. Then, when updates take place, we can only check this constraint
by joining the tables. Besides, as the reader is urged to check, the joins are not
lossless. Lossy joins follow from the fact that there are no foreign keys.
If no functional dependencies exist, then the three-table desing is lossless and
there is no constraint checking problem. However, then intuitive querying be-
comes more problematic as there are multiple relationships between attributes.
E.g. there is a relationship between course and student directly stored, but an-
other one exists and it can be realized by joining the two other tables. Further,
in some high-level query systems, as e.g. using natural language, the user may
want to just ask for professors and students. With two existing interpretations,
there is some ambiguity.
In addition to removing redundancy, we wish to preserve the possibility to
maintain dependency-wise the correctness of the database, and if we split a
table into several tables, we want to be able to recover the initial table as it
was. These properties are defined formally below.
Definition (Dependency preservation) We call the database schema D de-
pendency preserving with respect to a set of functional dependencies FD, if
72
we can infer any given functional dependency in FD from the dependencies in
the relations of D.
Definition (Lossless join property). Let D = {R1 , R2 , . . . , Rn } be a database
schema using the set of attributes U . We say that D has the lossless join prop-
erty if every relation rU over U decomposes losslessly onto R1 , R2 , . . . , Rn . If D
has the lossless join property, we say that D is lossless. Otherwise, we say that
D is lossy.
These two concepts are related: it is known that a dependency preserving
database schema D (i.e. set of relations) is lossless if and only if it contains a
relation schema R such that closure of R contains all attributes in D.
We want to find design methods that produce database schemas which imply
these properties. In what follows, it is important that the reader has a good
recollection of functional dependencies, their closures, and keys.
73
1. Group F by the left hand side, i.e. so that dependencies X → A and
X → B end up in the same group.
2. For each of the groups, return the schema containing the left hand side of
the group X and all the right hand side attributes, ie if A1 . . . An where
X → Ai .
3. (Optionally:) If one of the schemas contains a key of U, these groups are
enough; otherwise, add a schema containing just some key of U.
Synthesis produces database schemas in which all relation schemas are in
third normal form (3NF), defined as follows.
A relation schema R is in 3NF if all dependencies X → A in R, where A is a
single attribute, are such that either X is a key of R or A belongs to some key
of R.
Let us consider the dependencies we get from E-R diagrams. If an attribute
appears in more than one relation schema, it is a part of a key in all schemata.
This means that forming relation schemata directly from the E-R diagram pro-
duces a relation schema in 3NF (with respect to the dependencies defined in the
E-R diagram). Further, the dependency set that we get from the E-R diagram
is a minimal cover. We leave the rest of the reasoning of this for the reader, but
just notice that the following are equal:
• Producing the relation schema directly from the E-R diagram using the
methodology of this book.
• Producing the dependency set directly from the E-R diagram and synthe-
sizing it into a database schema, without Step 3.
• Computing a minimal cover to the set of dependencies directly obtained
from the E-R diagram and then synthesizing the database schema from
there, without Step 3.
In fact, when we are dealing with a database schema transformed directly
from an E-R diagram, and no other functional dependencies exist, then our
database schema is in Boyce-Codd Normal Form (BCNF), defined as follows.
A relation schema R is in BCNF if all dependencies X → A in R that belong
to a minimal cover are such that X is a key of R. That is, in BCNF we allow
only key dependencies. As a simple consequence of our definition, all relation
schemas with at most two attributes are in BCNF.
BCNF is a beneficial property, as relations whose schemas are in BCNF avoid
redundancy. However, not all functional dependencies can be described in the
E-R diagram. As an example, let us consider addresses, which are commonly
stored in information systems. A part of an address is commonly formed from
a city name, street name, and postcode. Here is a table with cities, streets, and
postcodes.
74
Here is the relation schema with functional dependencies:
city street postcode
city street -> postcode
postcode -> city
If we use E-R diagrams and model an entity with an address, which con-
tains attributes city, street, and postcode, then we have no way to model these
dependencies in an E-R diagram. The keys or the relation are city,street and
postcode,street. It is easy to check that the relation schema is in 3NF. It seems to
introduce redundancy, though, as postcode,city pairs are stored multiple times.
This redundance is because of the FD postcode -> city, and that FD is not
a key FD. It follows that the table is not in BCNF.
Since the City is functionally determined by postcode, we could think of
splitting the table by the FD postcode -> cityby a decomposition operation,
that would make a new table with the attributes of the FD and remove the
right hand side of the FD from the initial table, thereby giving us two tables as
follows.
street postcode
Framnäsgatan 41264
Rännvägen 41296
Hörsalsvägen 41296
Barnhusgatan 41102
Barnhusgatan 11123
city postcode
Gothenburg 41264
Göteborg 41102
Stockholm 11123
We can get back our initial table by a natural join on Postcode. What
is a problem, though, is that when we insert new data, how are we going to
check that the FD city street -> postcode still holds? To know which City
Street Postcode triples already exist, we have to join the two tables we created
- it seems that the decomposition step has removed the ability to efficiently
check the FD City Street -> Postcode. The situation is difficult: Either we
need to introduce redundancy or produce a design where we cannot check the
dependencies (dependency preservation is lost).
Let us try decomposing not based on a FD, say, projecting new tables on
City,Street and Street,Postcode
city street
Gothenburg Framnäsgatan
Gothenburg Rännvägen
Gothenburg Hörsalsvägen
Göteborg Barnhusgatan
Stockholm Barnhusgatan
75
street postcode
Framnäsgatan 41264
Rännvägen 41296
Hörsalsvägen 41296
Barnhusgatan 41102
Barnhusgatan 11123
This is an even worse idea. When we join back on Street, we get all the
original rows and additionally
which is just plain wrong. Our design has led to loss of information.
The idea of the decomposition step we used based on FD postcode ->
city is the basis of decomposing database schemas into BCNF.
Algorithm (BCNF decomposition). Consider a relation R with schema S and a
set F of functional dependencies, which is a minimal cover. Apply the following
steps as long as they lead to a change.
1. If R has no BCNF violations, return R
2. If R has a violating functional dependency X → A, decompose R to two
relations
• R1 with schema X ∪ X +
• R2 with schema S − X + − X
In the City Street Postcode, the only FD which can be used to decompose is
postcode -> city and produces, like seen above, the schemas Street Postcode
and Postcode City, which are in BCNF, but the ability to check for dependency
City Street -> Postcode has disappeared. In this case it would seem like
a good choice not to decompose, particularly as postcodes do not change that
often.
What to do with the other functional dependency, currency -> value? We
could call it a non-key FD, which is not standard terminology, but a handy
term. Looking at the table, we see that it creates a redundancy: the value
is repeated every time a currency occurs. Non-key FD’s are called BCNF
violations. They can be removed by BCNF decomposition: we build a
separate table for each such FD. Here is the result:
country currency
Sweden SEK
Finland EUR
Estonia EUR
currency value
SEK 0.12
EUR 1.10
76
These tables have no BCNF violations, and no redundancies either. Each of
them has their own functional dependencies and keys:
They also enjoy lossless join: we can reconstruct the original table by a natural
join Countries ./ Currencies.
Example Consider the schema
It has one 3NF violation: currency -> value. Moreover, the FD set is not a
minimal cover: the second FD can be dropped because it follows from the first
and the third ones by transitivity. Hence we have a minimal cover
country currency
currency value
i.e. exactly the same ones as we would obtain by BCNF decomposition. These
relations are hence not only 3NF but also BCNF.
77
Iterating this, it is possible to reach a state with no violations of normal
forms. At this point it is also possible to compare the dependency-based design
with the E-R design.
It seems not so smart, though, to neglect the E-R diagram in the design. To
utilize the E-R diagram information, the following workflow is recommended:
1. As a basis, get the initial relations and functional dependencies from the
E-R diagram.
2. Model additional FDs.
3. From the functional dependencies, calculate the possible keys of the rela-
tions.
4. Use BCNF decomposition where seen appropriate.
Dependency-based design is, in a way, more mechanical than E-R design.
In E-R design, you have to decide many things: the ontological status of each
concept (whether it is an entity, attribute, relationship, etc). You also have to
decide the keys of the entities. In dependency analysis, you only have to decide
the basic dependencies. Lots of other dependencies are derived from these by
mechanical rules. Also the possible keys - candidate keys - are mechanically
derived. The decomposition to normal forms is mechanical as well. You just
have to decide what normal form (if any) you want to achieve. In addition, you
have to decide which of the candidate keys to declare as the primary key of
each table.
Be reminded that it is a standard practice to replace the calculated keys
by artificial surrogate keys and just use the calculated keys for uniqueness con-
straints.
You need to decide what decomposition you want. Normalization has pros
and cons. This also depends on the application. If query performance is crucial
and one wants to avoid joins, and updates seldom happen or are otherwise
known to be feasible on pre-joined data, then you may decompose less or store
joins directly.
The following section on acyclicity will sort of crash the theory we learned
until now. We will see that there are further complications and that the de-
composition and BCNF do not necessarily solve our design problems. These
complications are related to dependency patterns that may or may not arise.
Anyway, it is important to recognize the limitations of design methods.
78
5.6 Multivalued dependencies and the fourth normal form
Consider the following information about persons. Bob speaks English. Jill
speaks French. Jill speaks Swedish. Bob is a cook by profession. Jill is a
lawyer by profession. Jill is also a programmer by profession. Let us consider
putting this information on a single table with Name, Language, and Profession
as attributes. We can easily get a row with Bob, English and cook as attribute
values. But how about Jill? If we put Jill and French in a row, what would
be the related profession? To cover all the combinations, we could have rows
with Jill, French and lawyer, as well as Jill, French and programmer. The whole
table would look like the following:
Thus we can say that the languages of persons are inindependent of their pro-
fessions. This can be expressed as multivalued dependencies,
Why ”multivalued”? Because there are multiple values of language and profes-
sion for each person, and these value sets are independent of each other.
Multivalued dependencies (MVD) are based on this idea. An MVD
X→ → Y in a relation schema R says that X determines Y independently of all
other attributes of R. The standard formal definition is a bit more complicated,
but the following observation is sufficient.
Given a relation r with schema R and ’X ⊂ R, Y ⊂ R, then X → → Y if
r = πXY r ./ πX∪{R−Y .
The MVD X → → Y is trivial if Y ⊆ X or R ⊆ X ∪ Y . In this book, we are
considering non-trivial MVDs.
Let Z = R − (X ∪ Y ) and consider a non-trivial MVD X → → Y . Note that
Y and Z are in a symmetric position, so we also have X → → Z. An alternative
notation is X → → Y | Z, emphasizing that Y is independent of Z. Interestingly,
if X → Y , then also X → →Y.
In short, if r is equal to joining the projections of r on XY and R-Y, then
X→ →Y.
Fourth normal form (4NF). If there is a non-trivial MVD X → → Y in a
relation schema and X is not a superkey, that is a fourth normal form (4NF)
violation, and the relation is not in 4NF. If no 4NF violations exist in a relation,
then that relation is in 4NF.
If we add a row with values Jill, Finnish, pilot, then what? Then the relation
would not be symmetric on language and profession for a person. This could
79
be an error, but it could also mean that language and profession are not inde-
pendent, and the relation gives the languages in which a person can practice a
profession. Also, the 4NF violation would go away.
Note that MVDs and 4NF are different from FDs and BCNF in an interesting
way: A BCNF violation can only be created by adding a tuple to the database
whereas a 4NF violation can be created also by deleting a tuple from a database.
As another example of MVDs, let us consider the following. Sweden produces
cars. Sweden produces paper. Norway is a neighbour of Sweden. Finland is a
neighbour of Sweden. Thus, we get a table of countries, their export products,
and their neighbours:
In this table, Sweden exports both cars and paper, and it has both Finland
and Norway as neighbours. Intuitively, export products and neighbours are two
independent facts about Sweden: they have nothing to do with each other, and
we have the MVDs country → → product, country → → neighbour, or, equivalently,
country →→ product | neighbour.
The 4NF decomposition splits the table in accordance to the violating
MVD:
country product
Sweden cars
Sweden paper
country neighbour
Sweden Norway
Sweden Finland
These tables are free from violations. Hence they are in 4NF. Their natural join
losslessly returns the original table.
It is relatively easy to see that 4NF is a stronger requirement than BCNF.
If X → A violates BCNF, then it also violates 4NF, because
• it is an MVD by the theorem above
• it is not trivial, because
– if {A} ⊆ X, then X → A is a trivial FD and cannot violate BCNF
– if X ∪ {A} = S, then X is a superkey and X → A cannot violate
BCNF
The 4NF decomposition algorithm can be formulated as follows.
Algorithm (4NF decomposition). Consider a relation R with signature S and
a set M of multivalued dependencies. R can be brought to 4NF by the following
steps:
1. If R has no 4NF violations, return R
80
2. If R has a violating multivalued dependency X →
→ Y , decompose R to
two relations
• R1 with signature X ∪ {Y }
• R2 with signature S − Y
3. Apply the above steps to R1 and R2
In the previous example, we could actually prove the MVD by looking at
the tuples (see definition of MVD below). Finding a provable MVD in an
instance of a database can be difficult, because so many combinations must
be present. An MVD might of course be assumed to hold as a part of the
domain description. This can lead to a better structure and smaller tables.
However, if the assumption fails, the natural join from those tables can produce
combinations not existing in the reality.
A generalisation of 4NF is the project-join normal form (PJNF) also called
fifth normal form, which takes into account the possibility that a relation could
be a result of several joins instead of just one.
In practice, it is very rare that anyone would actually use a relation violating
4NF.
Decomposing with the FDs in the order they are given (the third one does not
decompose anymore), we get the following tables.
81
This looks like a good structure, except for the last table. Applying 4NF de-
composition to this gives the final result
82
6 SQL query processing with relational algebra
Relational algebra is a mathematical query language. It is much simpler than
SQL, as it has only a few operations, each usually denoted by Greek letters or
mathematical symbols. But for the same reason, it is easier to analyse and
optimize. Relational algebra is therefore useful as an intermediate language in a
database management system. SQL queries can be first translated to relational
algebra, which is optimized before it is executed. This chapter will tell the basics
about this translation and some query optimizations.
Since this is the ”official” relational algebra (from the textbook), a cou-
ple of SQL constructs cannot however be treated: sorting in DESC order and
aggregation of DISTINCT values. Both of them would be easy to add. More
imporatantly, this language extends the algebra of Chapter 4 in several ways.
83
relation ::=
relname name of relation (can be used alone)
| σcondition relation selection (sigma) WHERE
| πprojection+ relation projection (pi) SELECT
| ρrelname (attribute+)? relation renaming (rho) AS
| γattribute*,aggregationexp+ relation
grouping (gamma) GROUP BY, HAVING
| τexpression+ relation sorting (tau) ORDER BY
| δ relation removing duplicates (delta) DISTINCT
| relation × relation cartesian product FROM, CROSS JOIN
| relation ∪ relation union UNION
| relation ∩ relation intersection INTERSECT
| relation − relation difference EXCEPT
| relation ./ relation NATURAL JOIN
| relation ./condition relation theta join JOIN ON
| relation ./attribute+ relation INNER JOIN
o
| relation ./attribute+ relation FULL OUTER JOIN
| relation ./oL
attribute+ relation LEFT OUTER JOIN
oR
| relation ./attribute+ relation RIGHT OUTER JOIN
projection ::=
expression expression, can be just an attribute
| expression → attribute rename projected expression AS
aggregationexp ::=
aggregation( *|attribute ) without renaming
| aggregation( *|attribute ) → attribute with renaming AS
84
The most important extension is that it operates on multisets (turned to sets
by δ, ∪, ∩) and recognizes order (controlled by τ ).
SELECT projections
FROM table,...,table
WHERE condition
85
But notice, first of all, that names of relations can themselves be used as alge-
braic queries. Thus we translate
SELECT * FROM Countries
=⇒ Countries
SELECT * FROM Countries WHERE name=’UK’
=⇒ σname=0 U K 0 Countries
In general, SELECT * does not add anything to the algebra translation. However,
there is a subtle exception with grouping queries, to be discussed later.
If the SELECT field contains attributes or expressions, these are copied into
the same expressions under the π operator. When the field is given another
name by AS, the arrow symbol is used in algebra:
SELECT capital, area/1000 FROM Countries WHERE name=’UK’
=⇒ πcapital,area/1000 σname=0 U K 0 Countries
SELECT name AS country, population/area AS density FROM Countries
=⇒ πname→country,population/area→density Countries
The renaming of attributes could also be done with the ρ operator:
ρC(country,density) πname,population/area Countries
is a more complicated and, in particular, less compositional solution, because
the SELECT field is used in two different algebra operations. Moreover, it must
invent C as a dummy name for the renamed table. However, ρ is the way to go
when names are given to tables in the FROM field. This happens in particular
when a cartesian product is made with two copies of the same table:
SELECT A.name, B.capital
FROM Countries AS A, Countries AS B
WHERE A.name = B.capital
=⇒
πA.name,B.capital σA.name=B.capital (ρA Countries × ρB Countries)
No renaming of attributes takes place in this case.
Set-theoretical operations and joins work in the same way as in SQL. No-
tice once again that the SQL distinction between ”queries” and ”tables” is not
present in algebra, but everything is relations. This means that all operations
work with all kinds of relation arguments, unlike in SQL (see Section 2.22).
86
SELECT currency, COUNT(name)
FROM Countries
GROUP BY currency
=⇒
πcurrency,AV G(population) σCOU N T (name)>1
γcurrency,AV G(population),COU N T (name) Countries
Thus the HAVING clause itself becomes an ordinary σ. Notice that, since
SELECT does not show the COUNT(name) attribute, a projection must be
applied on top.
Here is an example with ORDER BY, which is translated to τ :
=⇒
πcurrency,AV G(population) τCOU N T (name)
γcurrency,AV G(population),COU N T (name) Countries
The ”official” version of relational algebra (as in the book) performs the
renaming of attributes under SELECT γ itself:
However, a more compositional (and to our mind more intuitive) way to do this
is in a separate π corresponding to the SELECT:
πcurrency,COU N T (name)→users γcurrency,COU N T (name) Countries
The official notation actually always involves a renaming, even if it is to the
aggregation expression to itself:
γcurrency,COU N T (name)→COU N T (name) Countries
87
This is of course semantically justified because COUNT(name) on the right
of the arrow is not an expression but an attribute (a string without syntactic
structure). However, the official notation is not consistent in this, since it does
not require corresponding renaming in the π operator.
In addition to GROUP BY, γ must be used whenever an aggregation appears
in the SELECT part. This can be understood as grouping by 0 attributes, which
means that there is only one group. Thus we translate
SELECT *
FROM Countries
GROUP BY name
HAVING count(name) > 0
Surprisingly, the result is the whole Countries table (because the HAVING con-
dition is always true), without a column for count(name). This may be a bug in
PostgreSQL. Otherwise it is a counterexample to the rule that SELECT * does
not change the relation in any way.
=⇒ δ(πcurrency Countries)
88
6.5.2 Example: pushing conditions in cartesian products
Cartesian products generate huge tables, but only fractions of them usually show
up in the final result. How can we avoid building these tables in intermediate
stages? One of the most powerful techniques is pushing conditions into products,
in accordance with the following equivalence:
Direct translation:
σcode=”EU R”AN Dcontinent=”EU ”AN Dcode=currency (countries×currencies)
Optimized translation:
σcode=currency (σcontinent=”EU ” countries × σcode=”EU R” currencies)
6.6 Indexes
The SQL queries are evaluated based on values in the database. Therefore,
it is beneficial to enable fast value-based search of the data. An index is an
efficient lookup structure, making it fast to fetch information by the values of
the index. The DBMS utilizes the available indices automatically, if it recognizes
that it is evaluating an expression where data is searched by values for which an
index exists. The most prominent examples for using an index include making
a selection based on the index value or such natural join of two tables that the
join attribute appears as a key in one of the tables.
A DBMS automatically creates an index for the primary key of each table.
It is used to eliminate duplicate key values. One can manually create and drop
keys for other attributes as needed by using the following SQL syntax:
statement ::=
CREATE INDEX indexname ON tablename (attribute+)?
| DROP INDEX indexname
89
• The disk is divided to blocks.
• Each tuple is in a block, which may contain many tuples.
• A block access is the smallest unit of time.
• Read 1 tuple = 1 block access.
• Modify 1 tuple = 2 block accesses (read + write).
• Every table is stored in some number n blocks. Hence,
– Reading all tuples of a table = n block accesses.
– In particular, lookup all tuples matching an attibute without index
= n.
– Similarly, modification without index = 2n.
– Insert new value without index = 2 (read one block and write it
back).
• An index is stored in 1 block (idealizing assumption). Hence, for indexed
attributes,
– Reading the whole index = 1 block access.
– Lookup 1 tuple (i.e. to find where it is stored) = 1 block access.
– Fetch all tuples = 1 + k block accesses (where k is the number of
tuples per attribute and k << n, i.e. k is significantly smaller than
n).
– In particular, fetching a tuple if the index is a key = 2 (1 + k with
k =1).
– Modify (or insert) 1 tuple with index = 4 block accesses (read and
write both the tuple and the index).
With this model, the decision goes as follows:
1. Generate some candidates for indexes: I1 , . . . , Ik
2. Identify the a set of typical SQL statements (for instance, the queries and
updates performed via the end user interface): S1 , . . . , Sn
3. For each candidate index configuration, compute the costs of the typical
statements: C(Ii , Sj )
4. Estimate the probabilities of each typical statement, e.g. as its relative
frequency: P (Sj )
5. Select the best index configuration, i.e. the one with the lowest expected
Pn
cost: argmini P (Sj )C(Ii )
j=1
PhoneNumbers(name,number)
Neither the name nor the number can be assumed to be a key, since one person
can have many numbers, and many persons can share a number. Assume that
90
SELECT number FROM PhoneNumbers WHERE name=X -- 0.8
SELECT name FROM PhoneNumbers WHERE number=Y -- 0.05
INSERT INTO PhoneNumbers VALUES (X,Y) -- 0.15
Here are the costs with each indexing configuration, with ”index both” as
the winner:
91
7 Database system reliability
Due to the critical nature of data, intuitively the users assume the database
systems to be highly reliable in several ways. With so much work used in speci-
fying the correctness requirements for the data, we assume the database system
to help in looking after the correctness of the data. For instance, the foreign key
requirements should be maintained when the data is updated. Some updates are
meant to be performed together, the classic example being transfering money
from one account to another, where both withdrawal and deposit should either
happen together, or not happen at all. The presence of concurrent users has to
be correctly taken into account.
Further, once updates are done, we expect them to be permanent, that is,
not unexpectedly lost later. We also expect the database system to support
limiting user rights to access only the data they are entitled to.
In this chapter, you will learn what these problems mean, more precisely, and
what is the role of the database system in solving them, and, most importantly,
what is your role in getting things done right. You will also learn about user
authorization to control who can access which data.
UPDATE Accounts
SET (balance = balance - 100) WHERE holder = ’Alice’ ;
UPDATE Accounts
SET (balance = balance + 100) WHERE holder = ’Bob’ ;
If the first update is not feasible, for instance, because Alice has less than 100
pounds, then the second update should not be performed.
A transaction is an atomic sequence of statements that is executed to-
gether: a transaction succeeds or fails as a whole, so that either all or none of
its updates are performed in the database.
We mark the beginning of a transaction with BEGIN command, and the the
COMMIT command indicates that we want to terminate the transaction success-
fully, storing all the transaction updates to the database.
The previous bank transfer collected to a transaction is as follows:
BEGIN ;
UPDATE Accounts
SET (balance = balance - 100) WHERE holder = ’Alice’ ;
UPDATE Accounts
SET (balance = balance + 100) WHERE holder = ’Bob’ ;
COMMIT ;
92
The transaction may fail for technical problems or failures, in which case
it is the duty of the database management system to take care that none of
the updates is performed. We can also cancel and roll back a transaction by a
ROLLBACK statement.
In Postgres, individual SQL statements are by default transactions. This
includes the execution of triggers, which we will use for bank transfers in Section
?.?. They can e.g. be used to watch after that a balance never goes negative.
For the moment, assume that this is the case.
The BEGIN command starts a transaction and after that the SQL statements
belong to that transaction until it either fails and is rolled back by the system,
or either COMMIT or ROLLBACK statement is used to terminate the transaction.
One of the important properties of database systems is that they give con-
current access to to data to several users. In addition to atomicity, introduced
above, the transactions also serve for isolation, keeping concurrent users’ access
separate.
Suppose that in addition to the bank transfer transaction above, which we
call T1 for now, we have another bank transfer transaction T2 as follows.
BEGIN ;
UPDATE Accounts
SET (balance = balance - 50) WHERE holder = ’Bob’ ;
UPDATE Accounts
SET (balance = balance + 50) WHERE holder = ’David’ ;
COMMIT ;
T1 T2
BEGIN
Update bal(A) = bal(A)-100
Update bal(B) = bal(B)+100
BEGIN
Update bal(B) = bal(B)-50
ROLLBACK
Update bal(D) = bal(D)+50
COMMIT
93
We want the database system to maintain a level isolation between trans-
actions. The strictest case we require each transaction only see data that is
permanently committed to the database. This is the default. There can be ap-
plications, where e.g. transactions take a long time or we only need approximate
calculations, in which case the requirement for isolation can be flexed.
Once the changes of a transaction are made permanent, they are assumed
to be durable, meaning that the database system will contain the data even if
there are failures like the computers that run the database system being shut
down abruptly.
The requirement above that a balance should always stay non-negative is
about consistency of the data. The requirements of maintaining required
foreign keys, non-null values, and correct data types are also about consistency.
Atomicity, Consistency, Isolation and Durability make up the ACID prop-
erties
• A, Atomicity: the transaction is an atomic unit that succeeds or fails as
a whole.
• C, Consistency: the transaction keeps the database consistent (i.e. pre-
serves its constraints).
• I, Isolation: parallel transactions operate isolated of each other.
• D, Durability: committed transactions have persistent effect even if the
system has a failure.
The ACID properties are in the heart of database reliability. Having intro-
duced those properties, we will take a closer look at how they are maintained.
Hint 1 The main purpose of transactions is to support the maintenance of ACID
properties. But a transaction can also be faster than individual statements, as
all of them require some overhead, when the system processes the start and
finish a transaction. For instance, if you want to execute thousands of INSERT
statements, it can be better to make them into one transaction.
Hint 2 Transactions keep users isolated from the updated data, using meth-
ods introduced later in this chapter. Generally the time of isolation should be
short, and preferably not including interaction with users who may take coffee
breaks etc., when the system is stopping other users from making progress. If
possible, collect the necessary data first and then execute the transaction in a
straightforward way.
The full syntax of starting transactions is as follows:
statement ::=
START TRANSACTION mode* | BEGIN | COMMIT | ROLLBACK
mode ::=
ISOLATION LEVEL level
| READ WRITE | READ ONLY
level ::=
SERIALIZABLE | REPEATABLE READ
| READ COMMITTED | READ UNCOMMITTED
94
The READ WRITE and READ ONLY modes indicate what interferences are
possible. This, as well as the effect of level, are discussed in more detail later in
this chapter.
We will continue to study them, in the order they are listed. In short,
atomicity is provided by transactions, so correct use of them has a key position.
Maintaining consistency has numerous options that can be specified, so it is a
relatively large topic in this book. Isolation maintenance has some options that
are good to know. Durability mechanisms are executed by the database system,
but it is good to know what it takes to roll back the database state in time,
and, of course, taking backups is necessary.
95
two main approaches for maintaining isolation: locking and timestamping,
and the latter naturally combines with multiversioning.
Behind the techniques there is the idea that if the result of concurrent trans-
action execution is the same as if the transactions would be executed in a serial
order, one at a time, then we consider the execution correct.
A schedule is an ordering of the operations of transactions. A serial sched-
ule is a schedule where transactions are executed one after the other. The fol-
lowing schedule is a serial schedule. The other possible serial schedule for T1
and T2 would be to execute all of T2 first, then followed by T1.
T1 T2
BEGIN
Update bal(A) = bal(A)-100
Update bal(B) = bal(B)+100
COMMIT
BEGIN
Update bal(B) = bal(B)-50
Update bal(D) = bal(D)+50
COMMIT
T1 T2
BEGIN
Select bal(A)
BEGIN
Select bal(D)
Update bal(B) = bal(B)+100
COMMIT
Update bal(A) = bal(A)+50
COMMIT
Both transactions T1 and T2 access the data item bal(A). However the
outcome is the same as if T1 executed first, then followed by T2. Finally, the
following schedule is not serializable, as the write of T1 affects the values seen
by T2 and vice versa.
96
T1 T2
BEGIN
BEGIN
Update bal(A) = bal(A)+50
Select bal(A)
Update bal(B) = bal(B)+100
Select bal(B)
COMMIT
COMMIT
7.3.1 Locking
In locking, when a process needs access to a data item, it will ask for a lock
on the item. There are two types of locks: a Read lock (R-lock, shared lock,
S-lock), and a Read-Write lock (RW lock, exclusive lock, X-lock). Having an
R-lock to a data item gives the right to read that item, while having an RW-lock
to a data item gives the right to both read and write the data item.
The general idea is based on the following thinking. Reads do not generate
any dirty data, so we can allow several transactions to hold an R-lock simulta-
neously. As writes generate temporary dirty data not to be read by others, only
one transaction can hold an RW-lock to a data item at any time, and at that
time no other transaction can even hold an R-lock. Locks are usually automat-
ically requested by the database management system - in that case the user or
programmer does not need to issue additional commands for them.
If a process needs to read a data item and has no previous lock on it, then it
asks for an R-lock on the item. If some other process holds an RW-lock to the
item, then no lock is given and the process either rolls back or waits. If there
are only R-locks or there are no locks on the item, then the R-lock is given to
the process and the information of the lock is inserted into a lock table holding
information on all locks that hold in the database.
Similarly, if a process wants to write a data item, then it asks for an RW-
lock on the item. IF some other process holds any kind of a lock, then the lock
cannot be given and the requesting process either rolls back or waits. Otherwise
the lock is given and inserted to the lock table.
When to release locks? Let us review the balance transfer example of Sub-
section ?.? and add explicit lock operations.
97
T1 T2
BEGIN
Take RW-lock on bal(A)
Update bal(A) = bal(A)-100
Take RW-lock on bal(B)
Update bal(B) = bal(B)+100
BEGIN
Release RW-lock on bal(A)
Release RW-lock on bal(B)
Take RW-lock on bal(B)
Update bal(B) = bal(B)-50
ROLLBACK
Take RW-lock on bal(D)
Update bal(D) = bal(D)+50
COMMIT
This way, the locks did not help us to prevent from reading dirty data. To
maintain maximum isolation, the transaction only releases locks at the end of
transaction. This creates a serializable schedule, where the order of RW-locks
on same items as well as R-locks on the the items where someone at some point
has an RW-lock, determine the serial order.
The waiting transactions may eventually get the locks when other transac-
tions terminate. Otherwise they may time out and roll back. They may also
stay on waiting, but there may be problems like deadlock, where there is a
cycle of processes each waiting for the locks that the next process in the cycle
holds. This can be detected by the system and some ”random victim” can be
sacrifised for the common good. Another problem is that a process that rolls
back and restarts might not get the locks even after repeated restart - this is
called starvation. As a further problem with locking, transactions accessing
large amounts of data will create a lot of entries in the lock tables, thus making
the lock tables big.
7.3.2 Timestamping
The basic idea is that when a transaction starts, it is given a timestamp (which
can be just a serial number). When the transactions proceed, their operations
are analysed. If the schedule thus formed can be made equal to the serial
schedule of executing the transactions in timestamp order, then transactions
can proceed. Else, some transaction(s) must be rolled back.
Timestamping with multiversioning is based on the idea that several versions
of the data item are stored, along with the information on when they have been
read and when they have been written. This gives more flexibility, as an earlier
transaction (by timestamp) can still access old data correct for it, even if some
later transaction has written a new value for that data item. We only consider
timestamping with multiversioning here.
98
To read a data item value, a transaction simply goes through the list of
values for that data item, and chooses the right value by the timestamp.
For instance, if the list, assumed here to be ordered, has <write timestamp,
value> pairs <0,5.0>, <104,2.0>, <109,11.0> for some data item A, then a
transaction with timestamp 107 wanting to read A would pick the value 2.0 as
it is the last value written before time 107. The value written at time 0 is too
old and the value written at 109 is too new.
If we ensure that we do not clean up old values that some running transaction
may still need, then reads cannot fail. We can remove values from the list when
there are no such transactions anymore that would need those values. For
instance, if all executing transactions have at least timestamp 104, then we
could remove <0,5.0> from the list. We can also choose to keep them, to have
access to historical data.
Writing, however, can fail if a transaction tries to write data that some-
one should have seen (by timestamp). To determine this, we need to store the
biggest timestamp for transactions that have seen the value. This way, we need
a triple <write timestamp, latest read timestamp, value> for each data item
version. Suppose we add for data item A the latest read timestamps as follows:
<0,70,5.0>, <104,106,2.0>, <109,109,11.0>. Suppose that a transaction with
timestamp 107 wants to write a value 3.0 to A. We know from the timestamps
that no-one has read the value between 106 and 109, so we may add a new ver-
sion, resulting to: <0,70,5.0>, <104,106,2.0>, <107,107,3.0>, <109,109,11.0>
Suppose, now, that after this a transaction holding timestamp 105 wants
to write the value 6.0 to A. This is not directly doable, as transaction 106 has
read the value 2.0 and for serializable execution, it should have read the value
written by transaction 105. Thus, we either need to rollback transaction 105
or transaction 106, as they are in conflict. (If transaction 105 wanted to write
exactly the same value 2.0 that was valid since 104, then we could get away
from this by just adding a new entry as we know that that is the value read by
transaction 106.)
Rolling back transactions has some complications when using timestamps. If
we roll back a transaction that has written some data that another transaction
has read, then also that transaction has to be rolled back. This way, the rollbacks
may cascade.
Choosing the transaction to be rolled back when a conflicting write is about
to take place, has also implications. The writing transaction clearly has to be
the older transaction. If it is rolled back and then restarted, there is a risk of
starvation. If younger transactions (the ones that should have seen the data
item the older transaction is about to write) is rolled back, then we need to do
addiditional book-keeping: We need to track all such transactions which means
that we need to store all reading transactions’ timestamps, not just the last one.
This way, cascading rollback may hit all the younger transactions. Even
though reading transactions never block, this way they have to wait for earlier
transactions to commit or roll back before they can decide if they can commit
or if cascading rollback hits them.
A clear bonus of timestamping with multiversioning is that it has very little
99
overhead if there are no conflicts, as we do not need to write and check lock
tables when accessing the data.
Seats(date,flight,seat,status)
100
However, if many customers want the same flight on the same day, how
should the system behave? The safest policy is only to allow one booking trans-
action at a time and let all others wait. This would guarantee the ACID prop-
erties. The opposite is to let SELECTs and UPDATEs be freely intertwined.
This could lead to interference problems in the form or Dirty reads: read
dirty data resulting from a concurrent uncommitted transaction. A practical
consequence could be e.g. overbooking a flight.
T1 T2
WRITE a
READ a
ROLLBACK
• Non-repeatable reads: read data twice and get different results (be-
cause of concurrent committed transaction that modifies or deletes the
data).
T1 T2
READ a
UPDATE a = a’
COMMIT
READ a
T1 T2
SELECT * FROM A
INSERT INTO A a’
COMMIT
SELECT * FROM A
The following table shows which interference are allowed by which isolation
levels, from the strictest to the loosest:
101
7.4 Consistency
Here we take a deeper look at inserts, updates, and deletions, in the presence of
constraints. The integrity constraints of the database may restrict these actions
or even prohibit them. An important problem is that when one piece of data
is changed, some others may need to be changed as well. For instance, when
a row is deleted or updated, how should this affect other rows that reference
it as foreign key? Some of these things can be guaranteed by constraints in
basic SQL. But some things need more expressive power. For example, when
making a bank transfer, money should not only be taken from one account, but
the same amount must be added to the other account. Database management
systems support triggers, which are programs that do many SQL actions at
once. We can utilize triggers to maintain the consistency of databases. Notably,
they can be used for other things as well (even for destroying consistency, when
used inappropriately). The concepts discussed in this chapter are also called
active elements, because they affect the way in which the database reacts to
actions. This is in contrast to the data itself (the rows in the tables), which is
”passive”.
102
cause they can be inefficient. Therefore one should use triggers to mimick
assertions. This is what we do in this course.
Constraints that are marked DEFERRABLE in a CREATE TABLE statement
are checked only at the end of each transaction (Section 9.2). This is useful,
for instance, if one INSERT in a transaction involves a foreign key that is given
only in a later statement.
The type attribute is supposed to hold the information of the bank account
type: a savings account, a current account, etc. Let us then add a table with
transfers, with foreign keys referencing the first table:
103
recipient TEXT REFERENCES Accounts(number),
amount INT
)
Recall that a table once created can be changed later with an ALTER TABLE
statement, including adding and dropping constraints.
Hence a good reason to name constraints it to make DROP CONSTRAINT possible.
Here is a ”constraint” that is not possible in Transfers, trying to say that
the balance cannot be exceeded in a transfer:
104
The reason is that the Transfers table cannot refer to the Accounts table (other
than in FOREIGN KEY constraints). However, in this case, this is also un-
necessary to state: because of the positive balance constraint in Accounts, a
transfer exceeding the sender’s balance would be automatically blocked.
Here is another attempted condition for Accounts, which tries to set a min-
imum for the money in the bank:
However, this is not valid SQL. The problem is that constraints in CREATE TABLE
statements may not use aggregation functions (here, sum(balance)), because a
constraint can only address individual rows, not entire tables. Global constraints
in SQL are called assertion, and can be defined as follows:
However, assertions are not supported by modern DBMS systems such as Post-
greSQL, because they are considered too expensive to maintain, since they must
be checked after every change in the data. What remains is to use triggers, which
can be defined only to apply after narrowly specified changes, not every time
anything is changed.
After definining this function, we can call it in a trigger. The full details of
trigger syntax are, again, given in next section.
105
7.4.5 The syntax of triggers
Triggers in PostgreSQL are written in the language PL/PGSQL which is almost
standard SQL, with an exception: the trigger body can only be a function call,
and the function must be written separately.
Here is the part of the syntax that we will need in our discussion:
functiondefinition ::=
CREATE FUNCTION functionname() RETURNS TRIGGER AS $$
BEGIN
* statement
END
$$ LANGUAGE ’plpgsql’
;
triggerdefinition ::=
CREATE TRIGGER triggernane
whentriggered
FOR EACH ROW|STATEMENT
? WHEN ( condition )
EXECUTE PROCEDURE functionname
;
whentriggered ::=
BEFORE|AFTER events ON tablename
| INSTEAD OF events ON viewname
statement ::=
IF ( condition ) THEN statement+ elsif* END IF ;
| RAISE EXCEPTION ’message’ ;
| sqlstatement ;
| RETURN NEW|OLD|NULL ;
Comments:
• a trigger is activated BEFORE or AFTER an event, which is a change in the
database - either INSERT, UPDATE, or DELETE
• the statements may refer to attributes in NEW rows (in case of INSERT and
UPDATE) and in OLD rows (in case of UPDATE and DELETE).
• FOR EACH ROW means that the trigger is executed for each new row affected
by the change
• FOR EACH STATEMENT means that the trigger is executed just once for the
whole statement, even when it affects several rows
106
• a trigger is an atomic transaction, which either succeeds or fails totally
(see below for more details).
• The PL/PGSQL functions have also access to further information, such
as
– function now() telling the current time
– variable TG OP telling the type of event (INSERT, UPDATE, or DELETE)
The function make transfer() could also contain checks of conditions, be-
tween the BEGIN line and the first UPDATE:
IF (NEW.sender = NEW.recipient)
THEN RAISE EXCEPTION ’cannot transfer to oneself’ ;
END IF ;
107
CREATE OR REPLACE FUNCTION make_transfer_log() RETURNS TRIGGER AS $$
BEGIN
IF (TG_OP = ’INSERT’) THEN
INSERT INTO TransferLog SELECT now(), user, ’I’, NEW.*;
ELSIF (TG_OP = ’UPDATE’) THEN
INSERT INTO TransferLog SELECT now(), user, ’U’, NEW.*;
ELSIF (TG_OP = ’DELETE’) THEN
INSERT INTO TransferLog SELECT now(), user, ’D’, OLD.*;
END IF;
RETURN NULL;
END
$$ LANGUAGE plpgsql ;
statement ::=
GRANT privilege+ ON object TO user+ grantoption?
108
(a) (b) (c)
privilege ::=
SELECT | INSERT | DELETE | UPDATE | REFERENCES | ...
| ALL PRIVILEGES
object ::=
tablename (attribute+)+ | viewname (attribute+)+ | trigger | ...
109
Many users, though, access the data using a program provided for them.
This program may be a part of the web application, so in fact the user may use
a web page, which then sends a request to a server which provides the data. In
such cases the program functionalities limit the data that the user can access.
The program logs in to the database as a database user and gets the relevant
access. It makes sense to pay attention to the rights given to the programs for
various reasons, such as programing errors, various attacks, etc.
110
8 SQL in software applications
SQL was designed to be a high-level query language, for direct query and ma-
nipulation. However, direct access to SQL (e.g. via the PostgreSQL shell) can
be both too demanding and too powerful. Most database access by end users
hence takes place via more high-level interfaces such as web forms. Most web
services such as bank transfers and booking train tickets access a database.
End user programs are often built by combining SQL and a general purpose
programming language. This is called embedding, and the general purpose
language is called a host language.
The host language in which SQL is embedded provides GUIs and other
means that makes data access easier and safer. Databases are also accessed by
programs that analyse them, for instance to collect statistics. Then the host
language provides computation methods that are more powerful than those
available in SQL.
With this chapter, you will learn the basic principles in embedding SQL
into a procedural language, Java, and a functional language, Haskell, and a
multi-paradigm language Python. However, the basics of these languages are
necessary prerequisites for this chapter. We will also cover some pitfalls in
embedding.
A practical challenge in the embedding is making the data types match,
starting from the primitive data types, and including the abstract data types
such as tables and views, including their schemas. There is a mismatch between
the primitive data types of SQL and practically all languages to which SQL is
embedded. One has to be careful with string lengths, integer sizes, etc. As
a further problem, programming languages typically do not implement NULL
values similarly as SQL, and the possibility of NULL values needs to be taken
into account in application programming.
There are also security issues. For instance, a security hole can enable SQL
injection where an end user can include SQL code in the data that she is asked
to give. In one famous example, the name of a student includes a piece of code
that deletes all data from a student database. To round off, we will look at the
highest level of database access from the human point of view: natural language
queries.
The code snippets and class diagrams in this chapter are used to create
an idea of the proposed ways to implement database connectivity. The book
website contains more and more complete examples.
111
> Sweden
Stockholm
>
The progam is very rough in the sense that it does not even recover from errors
or terminate gracefully. Thus the only way to terminate it is by ”control-C”.
A more decent program is shown in course website - a template from which
Figure 9 is a stripped-down version.
The SQL-specific lines are marked *.
• The first one loads the java.sql JDBC functionalities.
• The second one, in the main method, loads a PostgreSQL driver class.
• The next three ones define the database url, username, and password.
• Then the connection is opened by these parameters. The rest of the main
method is setting the user inaction loop as a ”console”.
• The method getCapital that sends a query and displays the results can
throw an exception (a better method would catch it). This exception
happens for instance when the SQL query has a syntax error.
• The actual work takes place in the body of getCapital:
– The first thing is to create a Statement object, which has a method
for executing a query.
– Executing the query returns a ResultSet, which is an iterator for all
the results.
– We iterate through the rows with a while loop on rs.next(), which
returns False when all rows have been scanned.
– For each row, we print column 2, which holds the capital.
– The ResultSet object rs enables us to getString for the column.
– At the end, we close the result set rs and the statement st nicely to
get ready for the next query.
The rest of the code is ordinary Java. For the database-specific parts, excel-
lent Javadoc documentation can be found for googling for the APIs with class
and method names. The only tricky thing is perhaps the concepts Connection,
Statement, and ResultSet:
112
import java.sql.*; // JDBC functionalities *
import java.io.*; // Reading user input
113
8.2 Building queries and updates from input data
When building a query, it is obviously important to get the spaces and quotes in
right positions! A safer way to build a query is to use a prepared statement.
It has question marks for the arguments to be inserted, so we don’t need to care
about spaces and quotes. But we do need to select the type of each argument,
with setString(Arg,Val), setInt(Arg,Val), etc.
static void addMountain(Connection conn, String name, String continent, String height)
throws SQLException
{
PreparedStatement st =
conn.prepareStatement("INSERT INTO Mountains VALUES (?,?,?)") ;
st.setString(1,name) ;
st.setString(2,continent) ;
st.setInt(3,Integer.parseInt(height)) ;
st.executeUpdate() ;
st.close() ;
}
Now, how to divide the work between SQL and the host language, such as Java?
As a guiding principle,
Put as much of your program in the SQL query as possible.
In a more complex program (as we will see shortly), one can send several queries
and collect their results from result sets, then combine the answer with some Java
programming. But this is not using SQL’s capacity to the full:
• You miss the optimizations that SQL provides, and have to reinvent them man-
ually in your code.
• You increase the network traffic and pull unnecessary data from the database
server.
• Your code is likely to be slower than that of the database system.
114
Just think about the ”pushing conditions” example from Section 6.5.2.
If you just query the first line with SQL and do the WHERE part in Java, you may
have to transfer thousands of times of more rows that you moreover have to inspect
than when doing everything in SQL.
115
Figure 10: SQL oriented datatype example for Java
116
program can be implemented using views, which takes the complications of the query
closer to the database system.
We will present one way, consistent to the primitive datatype management above,
to wrap relations to classes. Wrapping of views can be done in a similar way apart
from the fact that one cannot by default update from views. The db class holds the
database functionality while the bean class offers standard setters and getters. The
top level class is left for the application specific code. If the database changes, just the
base and the db class can be replaced, leaving the application specific code hopefully
working as before.
The example, given in Figure 11, wraps a single simplified Countries table in Java
classes.
Note that these classes should be generated automatically from the database de-
scription using a database design tool, such as the QueryConverter.
import Database.HDBC -- you need to install this, e.g.: cabal install HDBC
import Database.HDBC.PostgreSQL (connectPostgreSQL) -- cabal install HDBC-postgresql
main = do
conn <- connectPostgreSQL -- this part needs to be changed for your database
"host=localhost dbname=testdb user=testuser password=XXXXX"
withTransaction conn updateCountry -- use transactional mode, explicit commit/rollback
getCountries <- prepare conn "SELECT * FROM countries;"
execute getCountries []
result <- fetchAllRows getCountries
putStrLn $ show result
disconnect conn
return ()
updateCountry conn = do
updateCountry <-
prepare conn "UPDATE countries SET continent = ?, capital = ? WHERE name = ?"
execute updateCountry [SqlNull, toSql "Habana", toSql "Cuba"]
commit conn -- so we commit here
The result is a list of countries, each country further repesented in a list. Removing
all but Finland and Cuba, the list looks like:
117
[ [SqlByteString "Finland",SqlByteString "Helsinki",SqlDouble 337030.0,
SqlInteger 5244000,SqlByteString "EU",SqlByteString "EUR"]
[SqlByteString "Cuba",SqlByteString "Habana",SqlDouble 110860.0,
SqlInteger 11423000,SqlNull,SqlByteString "CUP"] ]
As the example suggests, the values are wrapped in special Sql datatypes for
HDBC, where SqlNull represents the NULL value. Robustness can be improved by
catching SqlError type exceptions, which could occur due to various reasons, like
connection might fail. Also, if we try to update a country area to some rubbish like
toSql "oink-oink" we will get an SqlError.
Even with the Sql types there is a need for caution. The needs to check for string
lengths to match the database content is one issue, another is to check that the data
format is correct to start with. Similarly as with Java, there is a need for a systematic
way to construct your software to deal with the type issues. A potential mismatch
between Haskell types and database types is that Haskell Nothing is not the same as
SQL NULL, as NULL is a no-information missing value, leaving it open if a value exists
or not, whereas Nothing is used to mark a does-not-exist missing value ie. a value
does not exist at all.
According to the documentation, everything in HDBC happens within transac-
tions. However, in practice with Postgres the situation is the same as using the Post-
gres interactive shell: if a transaction is not started explicitly, each database operation
is a transaction of its own. However, the withTransaction function, used in the ex-
ample, calls the function and runs it within a transaction. Changing the commit at the
last line of the example into a rollback would lead to the update being rolled back.
With Java, above, we suggest implementing specific classes matching the database
tables and views. Similarly, in Haskell it is advisable to define types matching the
database table and view schemas, using e.g. record types.
try:
conn = psycopg2.connect("dbname=’testdb’ user=’testuser’ password=’XXXXX’")
cur = conn.cursor()
my_update = """UPDATE countries SET continent = %s, capital = %s WHERE name = %s"""
my_data = (None, "Habana", "Cuba") # query is defined separately from data, None is Null
118
cur.execute(my_update, my_data) # and they are bound together here
conn.commit()
cur.execute("SELECT * from countries")
rows = cur.fetchall()
print ("\nCountries: \n")
for row in rows:
print (" ", row[1], " ", row[2], " ", row[3], " ", row[4], " ", row[5])
The code largely follows the same lines as the Haskell code, with a couple of
exceptions, commented in the code.
Similarly as with Java and Haskell, it is advisable to define types matching the
database table and view schemas, using e.g. record types.
119
SELECT * FROM Persons
WHERE name = ’John’’ OR 0=0;--’
where the first single quote is escaped and the whole string is hence included in the
name.
In the Haskell code, the HDBC is based on the use of prepared statements.
In Python, the DB-API does not have an explicit function call to create a pre-
pared statement separately. The execute function prepares the statement and then
uses the parameter list to populate the %s parameters with values. This would not
allow for recycling the SQL query. However, the DB-API specifies another function,
‘executemany‘, that is given a list of parameter value sets, to be used repetively in
the SQL query.
Since the DB-API does not have an explicit call to create a prepared statement,
the psycopg2 library also does not have a call to create a prepared statement.
120
9 Document databases
The relational data model has been dominating the database world for a long time,
and it is still the appropriate model for many needs. There are, however, data that are
not that naturally encoded as tables. In particular documents with hierarchical hetero-
geneous structure have been a tough case for relational databases. A particular concern
in document databases is a query language or facility, which has search structures more
suitable for such documents than SQL.
These documents are modelled typically as key-value pairs, using JSON (JavaScript
Object Notation) or similar language. As the values may hierarchically consist of lists
of key-value pairs, the notion is general and flexible. We will introduce the usage of
JSON using Postgres and MongoDB systems.
Another apporach is XML (eXtensible Markup Language). The hierarchical docu-
ment structure is modelled through markup tags. Web page language HTML is an ex-
ample of XML use. XML has designated query languages, such as XPath and XQuery.
This chapter introduces XML and gives a summary of XPath.
9.1 JSON
JSON (JavaScript Object Notation)5 is similar to XML as it supports arbitrarily
nested, tree-like data structures. Data in JSON is grouped in objects, which are
key-value pairs enclosed in curly braces. Thus the following object encodes an
English-Swedish dictionary entry:
JSON objects are similar to objects in Java and JavaScript, and to records in some
other programming languages. A peculiarity of JSON is that the keys are quoted, not
only the string values.
Both objects and values can be collected to arrays in square brackets. Thus a
dictionary with two entries is an array of two objects:
[
{"pos": "noun", "english": "computer", "swedish": "dator"},
{"pos": "verb", "english": "compute", "swedish": "beräkna"}
]
and an entry with three Swedish equivalents for an English word has an array of three
string values:
121
Some things to notice:
• both keys and values are strings in double quotes
• number values can also be given without quotes
• strings "null", "true", "false" have special semantics
• JSON data can be freely divided into lines
• the order of key-value pairs in an object doesn’t matter
• the order of items in an array does matter
JSON is a very flexible format, and it does not require similar regularity as
tables in SQL. You can e.g. create an array of elements all having different
structures.
[
{"pos": "noun", "english": "computer", "swedish": "dator"},
{"name": "John", "position": "king"}
]
As an element may be an array of elements, it is possible to create arbitrary
nested structures.
[
{"name": "John",
"relationship": "friend",
"skills": [ { "speaks": "English", "repairs": "bicycles"} ]
"kids": [ { {"name": "Jill",
"pets": [ {"fish": "goldfish",
"dog": {"name": "Rekku", "race": "cairn terrier"} ] }
, {"name":"Bob"} ]
{"name": "Ben",
"relationship": "neighbour"}
]
122
However, JSON can also be queried with XPath, introduced in Section \ref{xpathSection},
and even PostgreSQL has facilities for inserting and querying JSON data.
In PostgreSQL, an extension of standard SQL introduces the type json as a possible
data type of a column7 . For example, a dictionary entry could be defined as follows:
The entry part, alas, does not have a specified schema, so the database user must
rely on the data being of an expected format. But given that the user has sufficient
knowledge about the JSON in the entry attribute, one can make the following kinds
of queries:
• retrieve JSON objects with ordinary attribute selection
SELECT entry
FROM Entries
;
{"pos": "noun", "english": "computer", "swedish": "dator"}
‘ {"pos": "verb", "english": "compute", "swedish": "beräkna"}
• retrieve values from JSON objects with the -> operator
SELECT entry -> ’english’
FROM Entries
;
"computer"
"compute"
• retrieve values from JSON objects in unquoted text with the ->> operator
SELECT entry ->> ’english’
FROM Entries
;
computer
compute
• filter with the same arrow operators in WHERE clauses
SELECT entry -> ’english’
FROM Entries
WHERE entry ->> ’pos’ = ’noun’
;
computer
Populating a PostgreSQL database with JSON values can be done with the usual
INSERT statements:
9.3 MongoDB*
(not covered in the course)
7 http://www.postgresqltutorial.com/postgresql-json/
123
9.4 XML and its data model
XML (eXtensible Markup Language) is a notation for documents and data. For
documents, it can be seen as a generalization of HTML (Hypertext Markup Language):
HTML is just one of the languages that can be defined in XML. If more structure is
wanted for special kinds of documents, HTML can be ”extended” with the help of
XML. For instance, if we want to store an English-Swedish dictionary, we can build
the following kind of XML objects:
<word>
<pos>Noun</pos>
<english>computer</english>
<swedish>dator</swedish>
</word>
(where pos = part of speech = ”ordklass”). When printing the dictionary, this object
could be transformed into an HTML object,
<p>
<i>computer</i> (Noun)
dator
</p>
But the HTML format is less suitable when the dictionary is used as data, where one
can look up words. Therefore the original XML structure is better suited for storing
the dictionary data.
The form of an XML data object is
where <tag> is the start tag and </tag> is the end tag. A limiting case is tags
with no content in between, which has a shorthand notation,
<tag/> = <tag></tag>
A grammar of XML is given in Figure 12, using the same notation for grammar
rules as used for SQL before.
All XML data must be properly nested between start and end tags. The syntax is
the same for all kinds of XML, including XHTML (which is XML-compliant HTML):
Plain HTML, in contrast to XHTML, also allows start tags without end tags.
From the data perspective, the XML object corresponds to a row in a relational
database with the schema
Words(pos,english,swedish)
A schema for XML data can be defined in a DTD (Document Type Declaration).
The DTD expression for the ”word” schema assumed by the above object is
124
document ::= header? dtd? element
starttag ::= < ident attr* >
header ::= "<?xml version=1.0 encoding=utf-8 standalone=no?>" endtag ::= </ ident >
## standalone=no if with DTD emptytag ::= < ident attr* />
dtd ::= <! DOCTYPE ident [ definition* ]> attr ::= ident = string ## string in double quotes
attribute ::= ident type #REQUIRED|#IMPLIED cond ::= [ exp op exp ] | [ integer ]
type ::= CDATA | ID | IDREF exp ::= "@"? ident | integer | string
element ::= starttag element* endtag | emptytag op ::= = | != | < | > | <= | >=
The entries in a DTD define elements, which are structures of data. The first line
defines an element called word as a tuple of elements pos, english, and swedish. These
other elements are all defined as #PCDATA, which means parsed character data. It
can be used for translating TEXT in SQL. But it is moreover parsed, which means that
all XML tags in the data (such as HTML formatting) are interpreted as tags. (There is
also a type for unparsed text, CDATA, but it cannot be used in ELEMENT declarations,
bur only for values of attributes.)
XML supports more data structures than the relational model. In the relational
model, the only structure is the tuple, and all its elements must be atomic. In full XML,
the elements of tuples can be structured elements themselves. They are called daugh-
ter elements. In fact, XML supports algebraic datatypes similar to Haskell’s data
definitions:
• Elements can be defined as
tuples of elements: E, F
lists of elements: E*
nonempty lists of elements: E+
alternative elements: E | F
optional elements: E?
strings: #PCDATA
125
-- XML
<expression>
<addition>
<expression>
<constant>23</constant>
</expression>
<expression>
<multiplication>
<expression>
<constant>15</constant>
</expression>
<expression>
<variable>x</variable>
</expression>
</multiplication>
</expression>
</addition>
</expression>
-- Haskell
data Expression =
Variable String
| Constant String
| Addition Expression Expression
| Multiplication Expression Expression
Addition
(Constant "23")
(Multiplication
(Constant "15")
(Variable "x")))
Figure 13: A complete XML document for arithmetic expressions and the cor-
responding Haskell code, with 23 + 15x as example.
126
To encode SQL tuples, we only need the tuple type and the PCDATA type. However,
this DTD encoding does not capture all parts of SQL’s table definitions:
• basic types in XML are not so refined: basically only TEXT is available
• constraints cannot be expressed in the DTD
Some of these problems can be solved by using attributes rather than elements. Here
is an alternative representation of dictionary entries:
The #REQUIRED keyword is similar to a NOT NULL constraint in SQL. Optional attributes
have the keyword #IMPLIED.
Let us look at another example, which shows how to model referential constraints:
The code attribute of Currency is declared as ID, which means that it is an identifier
(which moreover has to be unique). The currency attribute of Country is declared
as IDREF, which means it must be an identifier declared as ID in some other element.
However, since IDREF does not specify what element, it only comes half way in express-
ing a referential constraint. Some of these problems are solved in alternative format
to DTD, called XML Schema.
Attributes were originally meant for metadata (such as font size) rather than data.
In fact, the recommendation from W3C is to use elements rather than attributes for
data (see http://www.w3schools.com/xml/xml dtd el vs attr.asp). However, since at-
tributes enable some constraints that elements don’t, their use for data can sometimes
be justified.
127
There is an on-line XPath test program in http://xmlgrid.net/xpath.html
There is a more expressive query language called XQuery, which extends XPath.
Another possibility is to use XSLT (eXtensible Stylesheet Language for Transforma-
tions), whose standard use is to convert between XML formats (e.g. from dictionary
data to HTML). Writing queries in a host language (in a similar way as in JDBC) is
of course also a possibility.
9.6 YAML*
(not covered in the course)
128
10 Big Data
Big Data is a word used for data whose mere size is a problem. Big Data typically
refers to amounts of data that cannot be treated with traditional methods. Data may
also be big in the sense that it is generated rapidly, or there is ’big’ variation in data
types, attributes, etc. At the time of writing, when we just talk about data volume,
Big Data is often expected to be at least terabytes (1012 bytes), maybe even petabytes
(1015 ). Big Data usually refers to situations where the computations cannot be feasibly
performed using a single computer, even a powerful one.
The database world, though, has grown up with need for more data. There are
database solutions that manage huge amounts of data in distributed SQL databases,
so just the amount of data itself does not serve as a justification for new solutions.
However, applications with relaxed requirements of the ACID properties, such as social
media, music services, etc. have enabled other types of solutions. For example,it is
not so fatal if some comment in the social media is lost, summing up huge amounts of
approximate values takes into account some dirty data, and so on.
In Big Data, data is usually distributed, maybe to thousands of computers (or
millions in the case of companies like Google). The computations must also be par-
allelizable as much as possible. This has implications for big data systems, which
makes them different from relational databases:
• simpler queries (e.g. no joins, search on indexed attributes only)
• looser structures (e.g. no tables with rigid schemas)
• less integrity guarantees (e.g. no checking of constraints, no transactions)
• more redundancy (”denormalization” to keep together data that belongs to-
gether)
In database design, we split the data into tables, following a logical design. In
traditional database applications, the database system stores the database data
on a single server, but with Big Data it could be that a single table is too big
for a single server. So now, the tables are further split physically into different
servers.
We will have a look at some of the concepts through Postgres and Cassandra
DBMS and then discuss some conclusions and observations on Big Data in
databases.
129
The CREATE TABLE statement introduces, not just the table contents but also
the fact that we will partition that table by dates. The partitions are created sepa-
rately, e.g.:
which create partitions for the first two quarters of year 2020.
If, then, you execute the following statement:
you will notice that the inserted row can be queried from both rainfalland
rainfall y2020q2but querying rainfall y2020 q1 does not find this row.
This means that the value is actually stored in the partition determined by the
date. It can be inserted and retrieved through the top level table, though. You may
also insert rows directly into the partitions, like
Inserting that row into rainfall y2020q2 would give an error due to the partition
constraint violation.
So, you may insert data through the top level table or the partition. Note that
the FROM values are inclusive while the TO values are not, so
does not.
These partitions could be further partitioned by range of another attribute.
Partitioning has efficiency benefits. The data is split into smaller sets which can
be more efficient to process. Also, data management may become easier, since the
partitions can be added and deleted individually. There are different values by which
partitioning is meaningful, such as ownership of data, time, geographical location,
organizational structure, etc.
While a partition may also be called a shard, it is more typical to associate shard-
ing to distributing the partitions onto different servers. At the time of writing this,
distributing the partitions is not a part of standard Postgres distribution but it can be
done by using Foreign data wrappers, a mechanism to access data in another database,
Postgres or something else.
One option to do sharding with Postgres is by using Citus, a system for sharding
and replicating the data across servers. There is an open source distribution of Citus,
available to all.
Another option is to use the foreign data wrapper functionality. Initially, this
functionality has been available to access data in other servers and systems. But now,
we want to use the foreign data wrappers for PostgreSQL data.
130
Sharding is also available in MongoDB.
The data can also be replicated in addition to partitioning it. In MongoDB, at
the time of writing this, the replicas only serve as backup if a server fails. However,
there are systems allowing different copies or replicas are used to access the data at
the same time on different servers, thus increasing the responsiveness of the database
system. This implies complications with synchronization of updates, a topic beyond
the scope of this book.
10.2 MapReduce
A standard way to perform computations on Big Data is to use rounds of MapReduce
computations. In MapReduce, a round of computations consists of first selecting the
relevant data and mapping it by the values into suitable subsets, then followed by the
reduce step in which those subsets are used to compute some results. These results may
be intermediate and further MapReduce rounds may follow. The MapReduce query
engine is similar to map and fold in functional programming. The computations can be
made in parallel in different computers. While Map reads a common input and maps
the data items onto different sets, Reduce is supposed to be executed independently
on each data set. MapReduce was originally developed at Google, to work on top of
the Bigtable data storage system.
The idea of MapReduce also carries over to SQL and we can demonstrate the idea
using the rainfall tables above and SQL.
Let’s suppose that we have a table
and we want to calculate the average rainfall by city for each quarter of year 2020.
The idea is very simple:
• Copy the data from rainfall readings into rainfall. Thus the data will be split
into different tables for different quarters. Suppose we have the partitions for
each quarter we are interested in. This is the Map step of MapReduce.
• Execute the summarizing query on rainfall, which means it will be executed on
each sub-table. This is the Reduce step of MapReduce.
MapReduce can improve efficiency. Ultimately, if partitions are distributed onto
different servers, then we can speed up Reduce proportionally to the number of
servers. Even if we have all the partitions on the same server, we are dealing
with smaller data sets which brings some efficiency benefit.
MapReduce, though, is not just ”SQL for big amounts of data”. On the contrary,
there are queries that SQL allows, but are not feasible with huge distributed data
sets, like joining when the join explodes the result set, while at the same time
MapReduce is not limited to the SQL language, and it is possible to perform
general computations on the data. Also MongoDB, discussed in the previous
chapter, supports MapReduce.
131
The basic MapReduce does not run on SQL databases. MapReduce may utilize
simple files and various standard programming languages can be used to program
the Map and Reduce functions. The main point is, though, that there is an
underlying system that controls the execution, leaving the programmer free of
complications of distributed computing.
Cassandra SQL
data object key-value pair=(rowkey, columns) row
single value column=(attribute,value,timestamp) column value
collection column family table
database keyspace schema, E-R diagram
storage unit key-value pair table
query language CQL SQL
query engine MapReduce relational algebra
Bigtagle is a proprietary system, which was the model of the open-source Cas-
sandra, together with Amazon’s Dynamo data storage system.8 The MapReduce
implementation used by Cassandra is Hadoop.
It is easy to try out Cassandra, if you are familiar with SQL. Let us follow the
instructions from the tutorial in
https://wiki.apache.org/cassandra/GettingStarted
Step 1. Download Cassandra from http://cassandra.apache.org/download/
Step 2. Install Cassandra by unpacking the downloaded archive.
Step 3. Start Cassandra server by going to the created directory and giving the
command
bin/cassandra -f
bin/cqlsh
The following CQL session shows some of the main commands. First time you have
to create a keyspace:
132
Take it to use for your subsequent commands:
capital | currency
-----------+----------
Stockholm | SEK
So you can only retrieve indexed values. PRIMARY KEY creates the primary index,
but you can also create a secondary index:
133
> CREATE INDEX on Countries(capital) ;
name
--------
Norway
134
data sequentially in order, which is relatively fast. As an example, suppose that we
position timestamp first in the primary key, and know the time interval for which we
need the data. This allows fast retrieval and storage. Building secondary indices does
not give similar advantage, as the data is only organized on the primary key.
The obvious general drawback is that there may be a need to reorganize the data,
and this can take a reasonably - or indeed unreasonably - long. This can even result
into a service break. The data replication happens through the Hadoop HDFS file
system, a replicating Big Data file system.
This could happen in a number of ways:
• Data can be distributed by some attribute values, like city data could be dis-
tributed by country or by continent.
• Data can be stored by columns (attributes) rather than by rows. This may be
practical e.g. for some analysis tasks that need all values for some columns but
none for some others.
• Columns that are likely to be used together can be stored together.
• Rows that are likely to be used together can be stored together.
• The data cab be split in a seemingly random way into different servers, e.g.
using a hash function that given a key value gives an address containing the
server used.
• Some data can be replicated ie stored on several servers at the same time. This
improves accessibility but complicates maintenance of data.
135
A Appendix: SQL in a nutshell
Figure 14 shows a grammar of SQL. It is not full SQL, but it does contain all those
constructs that are used in this course for database definitions and queries. The syntax
for triggers, indexes, authorizations, and transactions will be covered in later chapters.
In addition to the standard constructs, different DBMSs contain their own ones,
thus creating ”dialects” of SQL. They may even omit some standard constructs. In
this respect, PostgreSQL is closer to the standard than many other systems.
The grammar notation is aimed for human readers. It is hence not completely
formal. A full formal grammar can be found e.g. in PostgreSQL references:
http://www.postgresql.org/docs/9.5/interactive/index.html
Another place to look is the Query Converter source file
https://github.com/GrammaticalFramework/gf-contrib/blob/master/query-
converter/MinSQL.bnf
It is roughly equivalent to Figure 14, and hence incomplete. But it is completely
formal and is used for implementing an SQL parser.9
The grammar is BNF (Backus Naur form) with the following conventions:
• CAPITAL words are SQL keywords, to take literally
• small character words are names of syntactic categories, defined each in their
own rules
• | separates alternatives
• + means one or more, separated by commas
• * means zero or more, separated by commas
• ? means zero or one
• in the beginning of a line, + * ? operate on the whole line; elsewhere, they
operate on the word just before
• ## start comments, which explain unexpected notation or behaviour
• text in double quotes means literal code, e.g. "*" means the operator *
• other symbols, e.g. parentheses, also mean literal code (quotes are used only in
some cases, to separate code from grammar notation)
• parentheses can be added to disambiguate the scopes of operators
Another important aspect of SQL syntax is case insensitivity:
• keywords are usually written with capitals, but can be written by any combi-
nations of capital and small letters
• the same concerns identifiers, i.e. names of tables, attributes, constraints
• however, string literals in single quotes are case sensitive
136
statement ::= type ::=
CREATE TABLE tablename ( CHAR ( integer ) | VARCHAR ( integer ) | TEXT
* attribute type inlineconstraint* | INT | FLOAT
* [CONSTRAINT name]? constraint deferrable?
) ; inlineconstraint ::= ## not separated by commas!
| PRIMARY KEY
DROP TABLE tablename ; | REFERENCES tablename ( attribute ) policy*
| | UNIQUE | NOT NULL
INSERT INTO tablename tableplaces? values ; | CHECK ( condition )
| | DEFAULT value
DELETE FROM tablename
? WHERE condition ; constraint ::=
| PRIMARY KEY ( attribute+ )
UPDATE tablename | FOREIGN KEY ( attribute+ )
SET setting+ REFERENCES tablename ( attribute+ ) policy*
? WHERE condition ; | UNIQUE ( attribute+ ) | NOT NULL ( attribute )
| | CHECK ( condition )
query ;
| policy ::=
CREATE VIEW viewname ON DELETE|UPDATE CASCADE|SET NULL
AS ( query ) ; deferrable ::=
| NOT? DEFERRABLE (INITIALLY DEFERRED|IMMEDIATE)?
ALTER TABLE tablename tableplaces ::=
+ alteration ; ( attribute+ )
|
COPY tablename FROM filepath ; values ::=
## postgresql-specific, tab-separated VALUES ( value+ ) ## keyword VALUES only in INSERT
| ( query )
query ::=
SELECT DISTINCT? columns setting ::=
? FROM table+ attribute = value
? WHERE condition
? GROUP BY attribute+ alteration ::=
? HAVING condition ADD COLUMN attribute type inlineconstraint*
? ORDER BY attributeorder+ | DROP COLUMN attribute
|
query setoperation query localdef ::=
| WITH tablename AS ( query )
query ORDER BY attributeorder+
## no previous ORDER in query columns ::=
| "*"
WITH localdef+ query | column+
137