Introducing Cassandra Data Model and
Cassandra Query Language
Apache Cassandra:
Core Concepts, Skills, and Tools
Artem Chebotko, Leo Schuman
April 8, 2014
©2014 DataStax Training. Use only with permission. Slide 1
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 2
What are the essential constituents of
the Cassandra data model?
• The Cassandra data model defines
1. Column family as a way to store and organize data
2. Table as a two-dimensional view of a multi-dimensional column family
3. Operations on tables using the Cassandra Query Language (CQL)
• We cover these three constituents in the order they are listed
• Understanding column families is a prerequisite to understanding tables
• Understanding tables is a prerequisite to understanding operations
©2014 DataStax Training. Use only with permission. Slide 3
What are row, row key, column key,
and column value?
• Row is the smallest unit that stores related data in Cassandra
• Rows – individual rows constitute a column family
• Row key – uniquely identifies a row in a column family
• Row – stores pairs of column keys and column values
• Column key – uniquely identifies a column value in a row
• Column value – stores one value or a collection of values
Row Column
keys
(or
column
names)
cola colb colc cold
row
key
va vb vc vd
Column
values
(or
cells)
©2014 DataStax Training. Use only with permission. Slide 4
What are row, row key, column key,
and column value?
• Sample rows that describe an artist and a band
• Column keys are inherently sorted
born country died style type
John
Lennon
1940 England 1980 Rock artist
country founded style type
The
Beatles
England 1957 Rock band
• A row can be retrieved if its row key is known
• A column value can be retrieved if its row key and column key are known
©2014 DataStax Training. Use only with permission. Slide 5
What is a wide row?
• Rows may be described as “skinny” or “wide”
• Skinny row – has a fixed, relatively small number of column keys
• Previous examples were skinny rows
• Wide row – has a relatively large number of column keys (hundreds or
thousands); this number may increase as new data values are inserted
• For example, a row that stores all bands of the same style
• The number of such bands will increase as new bands are formed
... The
Animals ... The
Beatles ...
Rock
... ... ...
• Note that column values do not exist in this example
• The column key – in this case a band name – stores all the data desired
• Could have stored the number of albums, or year founded, etc., as column values
©2014 DataStax Training. Use only with permission. Slide 6
What are composite row key and
composite column key?
• Composite row key – multiple components separated by colon
genre performer tracks
Revolver:1966
Rock The
Beatles {1:
'Taxman',
...,
14:
'Tomorrow
Never
Knows'}
• ‘Revolver’ and 1966 are the album title and year
• ‘tracks’ value is a collection (map)
• Composite column key – multiple components separated by colon
• Composite column keys are sorted by each component
1:title 2:title ... 14:title
Revolver:1966
Taxman Eleanor
Rigby ... Tomorrow
Never
Knows
• 1,2, …, 14 are track numbers; ‘title’ is metadata
• We could have stored actual title as components of composite column
keys: 1:Taxman, 2:Eleanor Rigby, …, 14:Tomorrow Never Knows
©2014 DataStax Training. Use only with permission. Slide 7
Can simple and composite column keys co-exist
in the same row?
• Row can contain both simple and composite column keys
1:title 2:title ... genre performer
Revolver:1966
Taxman Eleanor
Rigby ... Rock The
Beatles
• ‘genre’ and ‘performer’ are simple column keys
• ‘1:title’, ‘2:title’, … are composite column keys
©2014 DataStax Training. Use only with permission. Slide 8
What components of a row can store useful values?
• Any component of a row can store data or metadata
• Simple or composite row keys
• Simple or composite column keys
• Atomic or set-valued (collection) column values
1:title 1:duration ... 7:title 7:duration
Revolver:1966:Side
one
Taxman 2:39 ... She
Said
She
Said 2:37
8:title 8:duration ... 14:title 14:duration
Revolver:1966:Side
two
Good
Day
Sunshine 2:10 ... Tomorrow
Never
Knows 2:57
• Metadata: ‘Side one’, ‘Side two’, ‘title’, ‘duration’
• Data: everything else (‘Revolver’, ‘1966’, ‘She Said She Said’, etc.)
©2014 DataStax Training. Use only with permission. Slide 9
What is a column family?
• Column family – set of rows with a similar structure
COLUMNS
cola colb colc cold
row
key3
v3.a v3.b v3.c v3.d
cola colb colc cold
ROWS
row
CELLS
key1
v1.a v1.b v1.c v1.d
cola colb colc cold
row
key2
v2.a v2.b v2.c v2.d
©2014 DataStax Training. Use only with permission. Slide 10
What is a column family?
• Distributed
• Sparse
• Column family that stores data about artists and bands
born country died style type
John
Lennon
1940 England 1980 Rock artist
born country style type
Paul
McCartney
1942 England Rock artist
country founded style type
The
Beatles
England 1957 Rock band
©2014 DataStax Training. Use only with permission. Slide 11
What is a column family?
• Sorted columns
• Multidimensional
• Column family that stores albums and their tracks
1:title ... 11:title ... 14:title
Revolver:1966
Taxman ... Doctor
Robert ... Tomorrow
Never
Knows
1:title ... 11:title
Let
It
Be:1970
Two
Of
Us ... Get
Back
1:title ... 11:title
Magical
Mystery
Tour:1967
Magical
Mystery
Tour ... All
You
Need
Is
Love
©2014 DataStax Training. Use only with permission. Slide 12
What are the size limitations for a column family?
• Size of a column family is only limited to the size of a cluster
• Linear scalability
• Rows are distributed among the nodes in a cluster
• Column family component size considerations
• Data from a one row must fit on one node
• Data from any given row never spans multiple nodes
• Maximum columns per row is 2 billion
• In practice – Up to 100 thousand
• Maximum data size per cell (column value) is 2 GB
• In practice – Up to 100 MB
©2014 DataStax Training. Use only with permission. Slide 13
Exercise 1: Model sample data as column families
©2014 DataStax Training. Use only with permission. Slide 14
What is a CQL table and how is it related to
a column family?
• A CQL table is a column family
• CQL tables provide two-dimensional views of a column family, which contains
potentially multi-dimensional data, due to composite keys and collections
• CQL table and column family are largely interchangeable terms
• Not surprising when you recall tables and relations, columns and attributes, rows
and tuples in relational databases
• Supported by declarative language Cassandra Query Language
• Data Definition Language, subset of CQL
• SQL-like syntax, but with somewhat different semantics
• Convenient for defining and expressing Cassandra database schemas
©2014 DataStax Training. Use only with permission. Slide 15
What is CQL table and how is it related to
column family?
• Cassandra 1.2+ relies on CQL schema, concepts, and terminology,
though the older Thrift API remains available
• Recall that CQL provides a two dimensional view of potentially multi-
dimensional data
Table (CQL API terms)
Column Family (Thrift API terms)
Table is a set of partitions
Column family is a set of rows
Partition may be single or multiple row
Row may be skinny or wide
Partition key uniquely identifies a partition, Row key uniquely identifies a row, and may be
and may be simple or composite
simple or composite
Column uniquely identifies a cell in a Column key uniquely identies a cell in a row,
partition, and may be regular or clustering
and may be simple or composite
Primary key is comprised of a partition key
plus clustering columns, if any, and uniquely
identifies a row in both its partition and table
©2014 DataStax Training. Use only with permission. Slide 16
What are partition, partition key, row, column,
and cell?
• Table with single-row partitions
columns
partition key
performer
born
country
died
founded
style
type
John Lennon
1940
England
1980
Rock
artist
partitions
Paul McCartney
1942
England
Rock
artist
rows
The Beatles
England
1957
Rock
band
• Column family view
cells
©2014 DataStax Training. Use only with permission. Slide 17
What are composite partition key and clustering column?
• Table with multi-row partitions
columns
composite partition key
album_title
year
num track_title
ber
clustering column
Revolver
1966
1
Taxman
Revolver
1966
…
…
rows in a partition/table
Tomorrow Never Knows
Revolver
1966
14
Let It Be
1970
1
Two Of Us
Let It Be
1970
…
…
... ...
partitions
Let It Be
1970
11
Get Back
1:title 11:title 14:title
Revolver:1966
Magical Mystery Tour
Magical Mystery Tour
1967
1
Taxman ... Doctor
Robert ... Tomorrow
Never
Knows
Magical Mystery Tour
1967
…
…
1:title ... 11:title
Magical Mystery Tour
All You Need Is Love
1967
11
Let
It
Be:1970
Two
Of
Us ... Get
Back
1:title ... 11:title
Magical
Mystery
cells
Tour:1967
Magical
Mystery
Tour ... All
You
Need
Is
Love
©2014 DataStax Training. Use only with permission. Slide 18
What are static columns?
• Table with multi-row partitions
clustering column
static columns
composite partition key
album_title
year
num genre
performer
track_title
ber
Revolver
1966
1
Rock
The Beatles
Taxman
rows in
Revolver
1966
…
Rock
The Beatles
…
a partition
Tomorrow Never Knows
Revolver
1966
14
Rock
The Beatles
Let It Be
1970
1
Rock
The Beatles
Two Of Us
Let It Be
1970
…
Rock
The Beatles
…
partitions
Let It Be
1970
11
Rock
The Beatles
Get Back
cells
Magical Mystery Tour
Magical Mystery Tour
1967
1
Rock
The Beatles
Magical Mystery Tour
1967
…
Rock
The Beatles
…
Magical Mystery Tour
All You Need Is Love
1967
11
Rock
The Beatles
• Static column values are shared for all rows in a multi-row partition
©2014 DataStax Training. Use only with permission. Slide 19
What are static columns?
• Table with multi-row partitions
album_title
year
num genre
performer
track_title
ber
static column
value
Revolver
1966
1
Rock
The Beatles
Taxman
Revolver
1966
…
Rock
The Beatles
…
Tomorrow Never Knows
Revolver
1966
14
Rock
The Beatles
Let It Be
1970
1
Rock
The Beatles
Two Of Us
Let It Be
1970
…
Rock
The Beatles
…
Let It Be
1970
11
Rock
The Beatles
Get Back
Magical Mystery Tour
Magical Mystery Tour
1967
1
Rock
The Beatles
Magical Mystery Tour
1967
…
Rock
The Beatles
…
Magical Mystery Tour
All You Need Is Love
1967
11
Rock
The Beatles
©2014 DataStax Training. Use only with permission. Slide 20
What is a primary key?
• Primary key uniquely identifies a row in a table
• Simple or composite partition key and all clustering columns (if present)
performer
born
country
died
founded
style
type
John Lennon
1940
England
1980
Rock
artist
Paul McCartney
1942
England
Rock
artist
The Beatles
England
1957
Rock
band
• Primary key (table above)
album_title
year
num track_title
ber
• performer
Revolver
1966
1
Taxman
• Primary key (table below)
Revolver
1966
…
…
• album, year, number
Revolver
1966
14
Tomorrow Never Knows
Let It Be
1970
1
Two Of Us
• Static columns cannot be part Let It Be
1970
…
…
of a primary key
Let It Be
1970
11
Get Back
Magical Mystery Tour
Magical Mystery Tour
1967
1
©2014 DataStax Training. Use only with permission. Slide 21
What are collection columns?
• Multiple values can be stored in a column
• Set – typed collection of unique values (e.g., genres)
{"Blues", "Jazz", "Rock"}
• Ordered by values
• No duplicates
• List – typed collection of non-unique values (e.g., artists)
["Lennon", "Lennon", "McCartney"]
• Ordered by position
• Duplicates are allowed
• Map – typed collection of key-value pairs (e.g., tracks)
{1:"Taxman", 2:"Eleanor Rigby", 3:"I'm Only Sleeping"}
• Ordered by keys
• Unique keys but not values
©2014 DataStax Training. Use only with permission. Slide 22
What are collection columns?
• Map example
• Collection column tracks holds a map of album tracks
title
year
genre
performer
tracks
Revolver
1966
Rock
The Beatles
{1: 'Taxman',
2: 'Eleanor Rigby',
3: 'I'm Only Sleeping',
4: 'Love You To',
…,
14: 'Tomorrow Never Knows'}
Let It Be
1970
Rock
The Beatles
{1: 'Two Of Us', 2: 'I Dig A Pony', 3: 'Across The
Universe', 4: 'Let It Be', 5: 'Maggie Mae', …, 11:
'Get Back'}
Magical 1967
Rock
The Beatles
{1: 'Magical Mystery Tour', 2: 'The Fool On The
Mystery Hill', 3: 'Flying', 4: 'Blue Jay Way', …, 11: 'All You
Tour
Need Is Love'}
©2014 DataStax Training. Use only with permission. Slide 23
Exercise 2: Represent column families as tables
©2014 DataStax Training. Use only with permission. Slide 24
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 25
What is cqlsh and how do you launch it?
• Cassandra client with the command-line interface
• Supports Cassandra Query Language statements
• Supports cqlsh shell commands
• Launching on Linux
$ ./cqlsh [options] [host [port]]
• Launching on Windows
python cqlsh [options] [host [port]]
• Examples
$ ./cqlsh
$ ./cqlsh -u student -p cassandra 127.0.0.1 9160
©2014 DataStax Training. Use only with permission. Slide 26
What shell commands does cqlsh support?
Command
Description
CAPTURE Captures command output and appends it to a file
CONSISTENCY Shows the current consistency level, or given a level, sets it
COPY Imports and exports CSV (comma-separated values) data
DESCRIBE Provides information about a Cassandra cluster or data objects
EXPAND Formats the output of a query vertically
EXIT or QUIT Terminates cqlsh
SHOW Shows the Cassandra version, host, or data type assumptions
SOURCE Executes a file containing CQL statements
TRACING Enables or disables request tracing
©2014 DataStax Training. Use only with permission. Slide 27
What shell commands does cqlsh support?
• CQL commands must be terminated with semi-colon
• SOURCE
SOURCE 'file'
SOURCE './myscript.cql';
• COPY
COPY table_name ( column, ...)
FROM ( 'file_name' | STDIN )
WITH option = 'value' AND ... ;
COPY table_name ( column , ... )
TO ( 'file_name' | STDOUT )
WITH option = 'value' AND ... ;
COPY performers_by_style (style, name)
FROM ‘./performers_by_style.csv'
WITH HEADER = 'true';
©2014 DataStax Training. Use only with permission. Slide 28
What shell commands does cqlsh support?
• DESCRIBE
DESCRIBE CLUSTER | SCHEMA | KEYSPACES |
KEYSPACE keyspace_name | TABLES | TABLE table_name
DESCRIBE TABLE album;
• EXIT
EXIT | QUIT;
©2014 DataStax Training. Use only with permission. Slide 29
Demo 3: How to launch and use cqlsh
©2014 DataStax Training. Use only with permission. Slide 30
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 31
What is a keyspace or schema?
• Keyspace – a top-level namespace for a CQL table schema
• Defines the replication strategy for a set of tables
• Keyspace per application is a good idea
• Data objects (e.g., tables) belong to a single keyspace
• Replication strategy – the number and pattern by which partitions
are copied among nodes in a cluster
• Two strategies available
• Simple Strategy (used for prototyping)
• Network Topology Strategy (production)
©2014 DataStax Training. Use only with permission. Slide 32
How to create, use and drop keyspaces/schemas?
• To create a keyspace
CREATE KEYSPACE musicdb
WITH replication = {
'class': 'SimpleStrategy',
'replication_factor' : 3
};
• To assign the working default keyspace for a cqlsh session
USE musicdb;
• To delete a keyspace and all internal data objects
DROP KEYSPACE musicdb;
©2014 DataStax Training. Use only with permission. Slide 33
What is the syntax of the CREATE TABLE statement?
• The CQL below creates a table in the current keyspace
Primary key declared inline
Primary key declared in separate clause
CREATE TABLE performer ( CREATE TABLE performer (
name VARCHAR PRIMARY KEY, name VARCHAR,
type VARCHAR, type VARCHAR,
country VARCHAR, country VARCHAR,
style VARCHAR, style VARCHAR,
founded INT, founded INT,
born INT, born INT,
died INT died INT,
); PRIMARY KEY (name)
);
©2014 DataStax Training. Use only with permission. Slide 34
How are primary key, partition key, and
clustering columns defined?
• Simple partition key, no clustering columns
PRIMARY KEY ( partition_key_column )
• Composite partition key, no clustering columns
PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN ) )
• Simple partition key and clustering columns
PRIMARY KEY ( partition_key_column,
clustering_column1, …, clustering_columnM )
• Composite partition key and clustering columns
PRIMARY KEY ( ( partition_key_col1, …, partition_key_colN ),
clustering_column1, …, clustering_columnM )
©2014 DataStax Training. Use only with permission. Slide 35
How are primary key, partition key, static and
clustering columns defined?
• Example
Can find a performer, genre, and all track
Can find all performers and albums for a numbers and titles for a given album title
given track title
and year
CREATE TABLE albums_by_track ( CREATE TABLE tracks_by_album (
track_title VARCHAR, album_title VARCHAR,
performer VARCHAR, year INT,
year INT, performer VARCHAR STATIC,
album_title VARCHAR, genre VARCHAR STATIC,
PRIMARY KEY number INT,
(track_title, performer, track_title VARCHAR,
year, album_title) PRIMARY KEY
); ((album_title, year),
number)
);
©2014 DataStax Training. Use only with permission. Slide 36
What CQL data types are available?
CQL Type
Constants
Description
ASCII
strings
US-ASCII character string
BIGINT
integers
64-bit signed long
BLOB
blobs
Arbitrary bytes (no validation), expressed as hexadecimal
BOOLEAN
booleans
true or false
COUNTER
integers
Distributed counter value (64-bit long)
DECIMAL
integers, floats
Variable-precision decimal
DOUBLE
integers
64-bit IEEE-754 floating point
FLOAT
integers, floats
32-bit IEEE-754 floating point
INET
strings
IP address string in IPv4 or IPv6 format*
INT
integers
32-bit signed integer
LIST
n/a
A collection of one or more ordered elements
MAP
n/a
A JSON-style array of literals: { literal : literal, literal : literal ... }
SET
n/a
A collection of one or more elements
TEXT
strings
UTF-8 encoded string
TIMESTAMP
integers, strings
Date plus time, encoded as 8 bytes since epoch
TUPLE
n/a
Up to 32k fields
UUID
uuids
A UUID in standard UUID format
TIMEUUID
uuids
Type 1 UUID only (CQL 3)
VARCHAR
strings
UTF-8 encoded string
VARINT
integers
Arbitrary-precision integer
©2014 DataStax Training. Use only with permission. Slide 37
Exercise 4: Create a keyspace and tables using cqlsh
©2014 DataStax Training. Use only with permission. Slide 38
What are UUID and TIMEUUID for?
• UUID and TIMEUUID are universally unique identifiers
• Generated programmatically
• Format
hex{8}-hex{4}-hex{4}-hex{4}-hex{12}
52b11d6d-16e2-4ee2-b2a9-5ef1e9589328
• Used to assign conflict-free (unique) identifiers to data objects
• Numeric range so vast that duplication is statistically all but impossible
• UUID data type supports Version 4 UUIDs
• Randomly generated sequence of 32 hex digits separated by dashes
• 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328
©2014 DataStax Training. Use only with permission. Slide 39
What are UUID and TIMEUUID for?
• TIMEUUID data type supports Version 1 UUIDs
• Embeds a time value within a UUID
• Generated using time (60 bits), a clock sequence number (14 bits), and MAC
address (48 bits)
1be43390-9fe4-11e3-8d05-425861b86ab6
• CQL function now() generates a new TIMEUUID
• Time can be extracted from TIMEUUID
• CQL function dateOf() extracts the embedded timestamp as a date
• TIMEUUID values in clustering columns or in column names are ordered
based on time
• DESC order on TIMEUUID lists most recent data first
©2014 DataStax Training. Use only with permission. Slide 40
What are UUID and TIMEUUID for?
• Example
• Users are identified by UUID
• User activities (i.e., rating a track) are identified by TIMEUUID
• A user may rate the same track multiple times
• Activities are ordered by the time component of TIMEUUID
CREATE TABLE track_ratings_by_user (
user UUID,
activity TIMEUUID,
rating INT,
album_title VARCHAR,
album_year INT,
track_title VARCHAR,
PRIMARY KEY (user, activity)
) WITH CLUSTERING ORDER BY (activity DESC);
©2014 DataStax Training. Use only with permission. Slide 41
What is TIMESTAMP for?
• TIMESTAMP holds date and time
• 64-bit integer representing a number of milliseconds since January 1 1970 at
00:00:00 GMT
• Entered as
• 64-bit integer
• String literal in the ISO 8601 format
• 1979-12-18 08:12:51-0400
• 2014-02-27
• Other variations are allowed
• Displayed in cqlsh as
• yyyy-mm-dd HH:mm:ssZ
©2014 DataStax Training. Use only with permission. Slide 42
What are special properties of the COUNTER
data type?
• Cassandra supports distributed counters
• Useful for tracking a count
• Counter column stores a number that can only be updated
• Incremented or decremented
• Cannot assign an initial value to a counter (initial value is 0)
• Counter column cannot be part of a primary key
• If a table has a counter column, all non-counter columns must be part of a
primary key
CREATE TABLE ratings_by_track (
album_title VARCHAR, album_year INT, track_title VARCHAR,
num_ratings COUNTER,
sum_ratings COUNTER,
PRIMARY KEY (album_title, album_year, track_title)
);
©2014 DataStax Training. Use only with permission. Slide 43
What are special properties of the COUNTER
data type?
• Performance considerations
• Read is as efficient as for non-counter columns
• Update is fast but slightly slower than an update for non-counter columns
• A read is required before a write can be performed
• Accuracy considerations
• If a counter update is timed out, a client application cannot simply retry a
“failed” counter update as the timed-out update may have been persisted
• Counter update is not an idempotent operation
• Running an increment twice is not the same as running it once
©2014 DataStax Training. Use only with permission. Slide 44
What is the purpose of the CLUSTERING
ORDER BY clause?
• CLUSTERING ORDER BY defines how data values in clustering
columns are ordered (ASC or DESC) in a table
• ASC is the default order for all clustering columns
• When retrieving data, the default order or the order specified by a
CLUSTERING ORDER BY clause is used
• The order can be reversed in a query using the ORDER BY clause
CREATE TABLE albums_by_genre (
genre VARCHAR,
performer VARCHAR,
year INT,
title VARCHAR,
PRIMARY KEY (genre, performer, year, title)
) WITH CLUSTERING ORDER BY
(performer ASC, year DESC, title ASC);
©2014 DataStax Training. Use only with permission. Slide 45
Exercise 5: Create tables using UUID, TIMEUUID, and
COUNTER columns
©2014 DataStax Training. Use only with permission. Slide 46
What is the syntax of the ALTER TABLE statement?
• ALTER TABLE manipulates the table metadata
• Adding a column
ALTER TABLE album ADD cover_image VARCHAR;
• Changing a column data type
ALTER TABLE album ALTER cover_image TYPE BLOB;
• Types must be compatible
• Clustering and indexed columns are not supported
• Dropping a column
ALTER TABLE album DROP cover_image;
• PRIMARY KEY columns are not supported
©2014 DataStax Training. Use only with permission. Slide 47
What is the syntax of the DROP TABLE statement?
• DROP TABLE removes a table (all data in the table is lost)
DROP TABLE album;
©2014 DataStax Training. Use only with permission. Slide 48
What are collection columns for?
• Collection columns are multi-valued columns
• Designed to store discrete sets of data (e.g., tags for a blog post)
• A collection is retrieved in its entirety
• 64,000 - maximum number of elements in a collection
• In practice – dozens or hundreds
• 64 KB - maximum size of each collection element
• In practice – much smaller
• Collection columns
• cannot be part of a primary key
• cannot be part of a partition key
• cannot be used as a clustering column
• cannot nest inside of another collection
©2014 DataStax Training. Use only with permission. Slide 49
How are collection columns defined?
• Set – typed collection of unique values
keywords SET<VARCHAR>
• Ordered by values
• No duplicates
• List – typed collection of non-unique values
songwriters LIST<VARCHAR>
• Ordered by position
• Duplicates are allowed
• Map – typed collection of key-value pairs
tracks
MAP<INT,VARCHAR>
• Ordered by keys
• Unique keys but not values
©2014 DataStax Training. Use only with permission. Slide 50
What is a user-defined type?
• User-defined types group related fields of information
• Represents related data in a single table, instead of multiple, separate tables
• Uses any data type, including collections and other user-defined types
• Reserved words cannot be used as a name for a user-defined type
• byte
• smallint
• complex
• enum
• date
• interval
• macaddr
• bitstring
CREATE TYPE track (
album_title VARCHAR,
album_year INT,
track_title VARCHAR,
);
©2014 DataStax Training. Use only with permission. Slide 51
What is a user-defined type?
• Table columns can be user-defined types
• Requires the use of the frozen keyword in C* 2.1
• A user-defined type can be used as a data type for a collection
CREATE TABLE musicdb.track_ratings_by_user (
user UUID,
activity TIMEUUID,
rating INT,
song frozen <track>,
PRIMARY KEY (user, activity)
) WITH CLUSTERING ORDER BY (activity DESC);
©2014 DataStax Training. Use only with permission. Slide 52
What is the syntax of the ALTER TYPE statement?
• ALTER TYPE can change a user-defined type
• Change the type of a field
• Types must be compatible
ALTER TYPE track ALTER album_title TYPE BLOB;
• Add a field to a type
ALTER TYPE track ADD track_number INT;
• Rename a field of a type
ALTER TYPE track RENAME album_year TO year;
• Rename a user-defined type
ALTER TYPE track RENAME TO song;
©2014 DataStax Training. Use only with permission. Slide 53
What is the syntax of the DROP TYPE statement?
• DROP TYPE removes a user-defined type
• Cannot drop a user-defined type that is in use by a table or another type
DROP TYPE track;
©2014 DataStax Training. Use only with permission. Slide 54
What is a tuple?
• Tuples hold fixed-length sets of typed positional fields
• Convenient alternative to creating a user-defined type
• Accommodates up to 32768 fields, but generally only use a few
• Useful when prototyping
• Must use the frozen keyword in C* 2.1
• Tuples can be nested in other tuples
CREATE TABLE user (
id UUID PRIMARY KEY,
email text,
equalizer frozen<tuple<float,float,float,float,float,
float,float,float,float,float>>,
• Tuples can also be nested in other tuples
name text,
preferences set<text>
);
©2014 DataStax Training. Use only with permission. Slide 55
What is a secondary index?
• Tables are indexed on columns in a primary key
• Search on a partition key is very efficient
• Search on a partition key and clustering columns is very efficient
• Search on other columns is not supported
• Secondary indexes
• Can index additional columns to enable searching by those columns
• one column per index
• Cannot be created for
• counter columns
• static columns
©2014 DataStax Training. Use only with permission. Slide 56
How do you create and drop secondary indexes?
• To create a secondary index
CREATE TABLE performer (
name VARCHAR,
type VARCHAR,
country VARCHAR,
style VARCHAR,
founded INT,
born INT,
died INT,
PRIMARY KEY (name)
);
CREATE INDEX performer_style_key ON performer (style);
• To drop a secondary index
DROP INDEX performer_style_key;
©2014 DataStax Training. Use only with permission. Slide 57
When do you want to use a secondary index?
• Secondary indexes are for searching convenience
• Use with low-cardinality columns
• Columns that may contain a relatively small set of distinct values
• For example, there are many artists but only a few dozen music styles
• Allows searching for all artists for a specified style (a potentially expensive query because
it may return a large result set)
• Use with smaller datasets or when prototyping
• Do not use
• On high-cardinality columns
• On counter column tables
• On a frequently updated or deleted columns
• To look for a row in a large partition unless narrowly queried
• e.g., search on both a partition key and an indexed column
©2014 DataStax Training. Use only with permission. Slide 58
Exercise 6: Add user-defined type, alter tables, add
collection column, and add secondary indexes
©2014 DataStax Training. Use only with permission. Slide 59
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 60
What is DevCenter and how do you launch it?
• Cassandra client with the GUI interface
• IDE for developers and administrators
• Supports Cassandra Query Language statements
• Does not support cqlsh commands
• SOURCE, COPY, DESCRIBE, etc.
• Launching on Linux
$ ./DevCenter
• Launching on Windows
DevCenter.exe
• Launching on Mac OS
DevCenter.app
©2014 DataStax Training. Use only with permission. Slide 61
What are the main features of DevCenter?
• Main features
• Create and manage Cassandra connections
• Create, edit, and execute CQL scripts
• syntax highlighting
• code auto-completion
• real-time script validation against the current connection
• Explore database objects via the Schema explorer
• Navigate long CQL scripts via the Outline view
• Execute CQL queries and view results and query trace
©2014 DataStax Training. Use only with permission. Slide 62
What are the main features of DevCenter?
©2014 DataStax Training. Use only with permission. Slide 63
Demo 7: How to launch and use DevCenter
©2014 DataStax Training. Use only with permission. Slide 64
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 65
What is the syntax of the INSERT statement?
INSERT INTO table_name (column1, column2 ...)
VALUES (value1, value2 ...)
• Inserts a row into a table
• Must specify columns to insert values into
• Primary key columns are mandatory (identify the row)
• Other columns do not have to have values
• Non-existent ‘values’ do not take up space
• Atomicity and isolation
• Inserts are atomic
• All values of a row are inserted or none
• Inserts are isolated
• Two inserts with the same values in primary key columns will not interfere
– executed one after another
©2014 DataStax Training. Use only with permission. Slide 66
What is the syntax of the INSERT statement?
• To insert a row into a table
CREATE TABLE albums_by_performer (
performer VARCHAR,
year INT,
title VARCHAR,
genre VARCHAR,
PRIMARY KEY (performer, year, title)
) WITH CLUSTERING ORDER BY (year DESC, title ASC);
INSERT INTO albums_by_performer (performer,year,title,genre)
VALUES ('The Beatles', 1966, 'Revolver', 'Rock');
INSERT INTO albums_by_performer (performer, year, title)
VALUES ('The Beatles', 1995, 'Beatlemania');
performer
year
title
Genre
The Beatles
1995
Beatlemania
The Beatles
1966
Revolver
Rock
©2014 DataStax Training. Use only with permission. Slide 67
What is the syntax of the INSERT statement?
• To insert a row into a table with UUID and TIMEUUID columns
CREATE TABLE track_ratings_by_user (
user UUID,
activity TIMEUUID,
rating INT,
album_title VARCHAR,
album_year INT,
track_title VARCHAR,
PRIMARY KEY (user, activity)
) WITH CLUSTERING ORDER BY (activity DESC);
INSERT INTO track_ratings_by_user
(user,activity,rating,album_title,album_year,track_title)
VALUES (52b11d6d-16e2-4ee2b2a9-5ef1e9589328,
dbf3fbfc-9fe4-11e3-8d05-425861b86ab6, 5,'Revolver',1966,'Yellow
Submarine');
user
activity
album_title
album_year
rating
track_title
52b11d6d-16e dbf3fbfc-9fe4- Revolver
1966
5
Yellow Submarine
2- …
…
©2014 DataStax Training. Use only with permission. Slide 68
What is the syntax of the UPDATE statement?
UPDATE <keyspace>.<table>
SET column_name1 = value, column_name2 = value,
WHERE primary_key_column = value;
• Updates columns in an existing row
• Row must be identified by values in primary key columns
• Primary key columns cannot be updated
• An existing value is replaced with a new value
• A new value is added if a value for a column did not exist before
• Atomicity and isolation
• Updates are atomic
• All values of a row are updated or none
• Updates are isolated
• Two updates with the same values in primary key columns will not
interfere – executed one after another
©2014 DataStax Training. Use only with permission. Slide 69
What is the syntax of the UPDATE statement?
• To update a row in a table
UPDATE albums_by_performer
SET genre = 'Rock'
WHERE performer = 'The Beatles' AND
year = 1995 AND
title = 'Beatlemania';
• Before update
performer
year
title
Genre
The Beatles
1995
Beatlemania
The Beatles
1966
Revolver
Rock
• After update
performer
year
title
Genre
The Beatles
1995
Beatlemania
Rock
The Beatles
1966
Revolver
Rock
©2014 DataStax Training. Use only with permission. Slide 70
What is an "upsert"?
• UPdate + inSERT
• Both UPDATE and INSERT are write operations
• No reading before writing
• Term “upsert” denotes the following behavior
• INSERT updates or overwrites an existing row
• When inserting a row in a table that already has another row with the
same values in primary key columns
• UPDATE inserts a new row
• When a to-be-updated row, identified by values in primary key columns,
does not exist
• Upserts are legal and do not result in error or warning messages
©2014 DataStax Training. Use only with permission. Slide 71
What are lightweight transactions or ‘compare and set’?
• Introduces a new clause IF NOT EXISTS for inserts
• Insert operation executes if a row with the same primary key does not exist
• Uses a consensus algorithm called Paxos to ensure inserts are done serially
• Multiple messages are passed between coordinator and replicas with a large
performance penalty
• [applied] column returns true if row does not exist and insert executes
• [applied] column is false if row exists and the existing row will be returned
INSERT INTO albums_by_performer (performer,year,title)
VALUES ('The Beatles', 1966, 'Revolver’) IF NOT EXISTS;
[applied]
true
INSERT INTO albums_by_performer (performer, year, title)
VALUES ('The Beatles', 1995, 'Beatlemania’) IF NOT EXISTS;
[applied]
performer
year
false
The Beatles
1966
©2014 DataStax Training. Use only with permission. Slide 72
What are lightweight transactions or Compare and Set?
• Update uses IF to verify the value for column(s) before execution
• [applied] column returns true if condition(s) matches and update written
• [applied] column is false if condition(s) do not match and the current row will
be returned
UPDATE albums_by_performer SET year = 1968 WHERE performer =
'The Beatles' IF title = 'Revolver';
[applied]
true
UPDATE albums_by_performer SET year = 1968 WHERE performer =
'The Beatles' IF title = 'Revolver’ AND year = 1967;
[applied]
performer
year
false
The Beatles
1966
©2014 DataStax Training. Use only with permission. Slide 73
What is the purpose of the TTL option?
• Time-to-live (TTL) defines expiring columns
• INSERT and UPDATE can optionally assign data values a time-to-live
• TTL is specified in seconds
• Expired columns/values are eventually deleted
• With no TTL specified, columns/values never expire
• TTL is useful for automatic deletion
• When data gets outdated after some time
• When only most recent data is needed
• Older data may be archived elsewhere by a background process
• Helps keep the size of a table and its partitions manageable
• Restricts the data view to most recent data
©2014 DataStax Training. Use only with permission. Slide 74
What is the purpose of the TTL option?
• To store a row for 86400 seconds (1 day)
INSERT INTO track_ratings_by_user
(user,activity,rating,album_title,album_year,track_title)
VALUES (52b11d6d-16e2-4ee2-b2a9-5ef1e9589328,
dbf3fbfc-9fe4-11e3-8d05-25861b86ab6,5,'Revolver',1966,'Yellow
Submarine')
USING TTL 86400;
• Re-inserting the same row before it expires will overwrite TTL
• To store a column value for 30 seconds
UPDATE track_ratings_by_user
USING TTL 30
SET rating = 0
WHERE user = 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328 AND
activity = dbf3fbfc-9fe4-11e3-8d05-425861b86ab6;
• Only column ‘rating’ for this row is affected by TTL
©2014 DataStax Training. Use only with permission. Slide 75
What is the syntax of the DELETE statement?
• Deletes a partition, a row or specified columns in a row
• Row must be identified by values in primary key columns
• Primary key columns cannot be deleted without deleting the whole row
• To delete a partition from a table
DELETE FROM track_ratings_by_user
WHERE user = 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328;
• To delete a row from a table
DELETE FROM track_ratings_by_user
WHERE user = 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328 AND
activity = dbf3fbfc-9fe4-11e3-8d05-425861b86ab6;
• To delete a column from a table row
DELETE rating FROM track_ratings_by_user
WHERE user = 52b11d6d-16e2-4ee2-b2a9-5ef1e9589328 AND
activity = dbf3fbfc-9fe4-11e3-8d05-425861b86ab6;
©2014 DataStax Training. Use only with permission. Slide 76
What is the syntax of the TRUNCATE statement?
• TRUNCATE removes all rows in a table
• The table definition (schema) is not affected
TRUNCATE track_ratings_by_user;
©2014 DataStax Training. Use only with permission. Slide 77
Exercise 8: Inserting and updating values using DevCenter
©2014 DataStax Training. Use only with permission. Slide 78
How do you manipulate counters?
• COUNTER – defining and updating
• INSERT is not allowed
• Initial counter value is 0
CREATE TABLE stats (
performer VARCHAR,
albums COUNTER,
concerts COUNTER,
PRIMARY KEY (performer)
);
UPDATE stats
SET albums = albums + 1, concerts = concerts + 10
WHERE performer = 'The Beatles';
performer
albums
concerts
The Beatles
1
10
©2014 DataStax Training. Use only with permission. Slide 79
How do you manipulate collections?
• CQL set – defining and inserting
• Collection column cannot be part of a primary key
CREATE TABLE band (
name VARCHAR PRIMARY KEY,
members SET<VARCHAR>
);
INSERT INTO band (name, members)
VALUES ('The Beatles', {'Paul', 'John', 'George', 'Ringo'});
name
members
The Beatles
{'George', 'John', 'Paul', 'Ringo'}
©2014 DataStax Training. Use only with permission. Slide 80
How do you manipulate collections?
• CQL set – performing union, difference and deletion
UPDATE band SET members = members +
{'Pete', 'Stuart', 'Paul', 'Jonathan'}
WHERE name = 'The Beatles';
name
members
The Beatles
{'George', 'John', 'Jonathan', 'Paul', 'Pete', 'Ringo', 'Stuart'}
UPDATE band SET members = members - {'Jonathan'}
WHERE name = 'The Beatles';
name
members
The Beatles
{'George', 'John', 'Paul', 'Pete', 'Ringo', 'Stuart'}
DELETE members FROM band WHERE name = 'The Beatles';
name
members
The Beatles
©2014 DataStax Training. Use only with permission. Slide 81
How do you manipulate collections?
• CQL list – defining and inserting
• Collection column cannot be part of a primary key
CREATE TABLE song (
id UUID PRIMARY KEY,
title VARCHAR,
songwriters LIST<VARCHAR>
);
INSERT INTO song (id, title, songwriters)
VALUES (252608cb-0f56-4cf3-82ee-b7fe00f3920f,
'I Want to Hold Your Hand', ['John', 'Paul']);
id
songwriters
title
252608cb-0f56-4cf3-82ee- ['John', 'Paul']
I Want to Hold Your Hand
b7fe00f3920f
©2014 DataStax Training. Use only with permission. Slide 82
How do you manipulate collections?
• CQL list – appending and prepending
UPDATE song SET songwriters = songwriters +
['Paul', 'Jonathan']
WHERE id = 252608cb-0f56-4cf3-82ee-b7fe00f3920f;
id
songwriters
title
252608cb-0f56-4cf3-82ee- ['John', 'Paul', 'Paul', 'Jonathan']
I Want to Hold Your Hand
b7fe00f3920f
UPDATE song SET songwriters = ['Patrick'] + songwriters
WHERE id = 252608cb-0f56-4cf3-82ee-b7fe00f3920f;
id
songwriters
title
252608cb-0f56-4cf3-82ee- ['Patrick', 'John', 'Paul', 'Paul', I Want to Hold Your Hand
b7fe00f3920f
'Jonathan']
©2014 DataStax Training. Use only with permission. Slide 83
How do you manipulate collections?
• CQL list – updating, subtracting and deleting
UPDATE song SET songwriters[3] = 'Ringo'
WHERE id = 252608cb-0f56-4cf3-82ee-b7fe00f3920f;
id
songwriters
title
252608cb-0f56-4cf3-82ee- ['Patrick', 'John', 'Paul', 'Ringo', I Want to Hold Your Hand
b7fe00f3920f
'Jonathan']
UPDATE song SET songwriters = songwriters -
['Patrick', 'Jonathan', 'Ringo']
WHERE id = 252608cb-0f56-4cf3-82ee-b7fe00f3920f;
id
songwriters
title
252608cb-0f56- …
['John', 'Paul']
I Want to Hold Your Hand
DELETE songwriters[0], songwriters[1] FROM song
WHERE id = 252608cb-0f56-4cf3-82ee-b7fe00f3920f;
id
songwriters
title
252608cb-0f56- …
I Want to Hold Your Hand
©2014 DataStax Training. Use only with permission. Slide 84
How do you manipulate collections?
• CQL map – defining and inserting
• Collection column cannot be part of a primary key
CREATE TABLE album (
title VARCHAR,
year INT,
tracks MAP<INT,VARCHAR>,
PRIMARY KEY ((title, year))
);
INSERT INTO album (title, year, tracks)
VALUES ('Revolver', 1966, {1: 'Taxman', 2: 'Eleanor Rigby'});
title
year
tracks
Revolver
1966
{1: 'Taxman', 2: 'Eleanor Rigby'}
©2014 DataStax Training. Use only with permission. Slide 85
How do you manipulate collections?
• CQL map – updating
UPDATE album SET tracks[14] = 'Yellow Submarine'
WHERE title = 'Revolver' AND year = 1966;
title
year
tracks
Revolver
1966
{1: 'Taxman', 2: 'Eleanor Rigby', 14: 'Yellow Submarine'}
UPDATE album SET tracks[14] = 'Tomorrow Never Knows'
WHERE title = 'Revolver' AND year = 1966;
title
year
tracks
Revolver
1966
{1: 'Taxman', 2: 'Eleanor Rigby', 14: 'Tomorrow Never Knows'}
©2014 DataStax Training. Use only with permission. Slide 86
How do you manipulate collections?
• CQL map – deleting
DELETE tracks[14] FROM album
WHERE title = 'Revolver' AND year = 1966;
title
year
tracks
Revolver
1966
{1: 'Taxman', 2: 'Eleanor Rigby'}
DELETE tracks FROM album
WHERE title = 'Revolver' AND year = 1966;
title
year
tracks
Revolver
1966
©2014 DataStax Training. Use only with permission. Slide 87
How do you manipulate user-defined types and tuples?
• User-defined type - Defining and inserting
CREATE TYPE track (
album_title text,
album_year int,
track_title text
);
CREATE TABLE track_ratings_by_user (
user UUID,
activity TIMEUUID,
rating INT,
song frozen <track>,
PRIMARY KEY (user, activity)
) WITH CLUSTERING ORDER BY (activity desc));
INSERT INTO track_ratings_by_user (user, activity, rating,
song ) VALUES (6ed4f220-5361-11e4-8d89-c971d060d947,
779a96e0-6eea-11e4-9803-0900200c9a66, 10,
{album_title: 'Let It Be', album_year: 1970,
track_title: 'Let It Be'});
user
activity
rating
song
62d4f220-5361-… 779a96e0-6eea-… 10 {album_title: 'Let It Be',
album_year: 1970,
track_title: 'Let It Be’}
©2014 DataStax Training. Use only with permission. Slide 88
How do you manipulate user-defined types and tuples?
• User-defined type - Updating
UPDATE track_ratings_by_user
SET song = {album_title: 'Let It Be', album_year: 1970,
track_title: ’Two of Us'}
WHERE user = 6ed4f220-5361-11e4-8d89-c971d060d947 AND
activity = 779a96e0-6eea-11e4-9803-0900200c9a66;
user
activity
rating
song
62d4f220-5361-…
779a96e0-6eea-…
10
{album_title: 'Let It Be',
album_year: 1970,
track_title: 'Two of Us’}
• User-defined type - Deleting
DELETE song from track_ratings_by_user WHERE user =
6ed4f220-5361-11e4-8d89-c971d060d947 AND activity =
779a96e0-6eea-11e4-9803-0900200c9a66;
user
activity
rating
song
62d4f220-5361-…
779a96e0-6eea-…
10
©2014 DataStax Training. Use only with permission. Slide 89
How do you manipulate user-defined types and tuples?
• Tuple - Defining and inserting
CREATE TABLE user (
id UUID PRIMARY KEY,
email text,
name text,
preferences set<text>,
equalizer frozen<tuple<float,float,float,float,float,
float,float,float,float,float>>
);
INSERT INTO user (id, equalizer)
VALUES (6ed4f220-5361-11e4-8d89-c971d060d947,
(3.0, 6.0, 9.0, 7.0, 6.0, 5.0, 7.0, 9.0, 11.0, 8.0));
id
equalizer
62d4f220-5361-…
(3.0, 6.0, 9.0, 7.0, 6.0, 5.0, 7.0, 9.0, 11.0, 8.0)
©2014 DataStax Training. Use only with permission. Slide 90
How do you manipulate user-defined types and tuples?
• Tuple - Updating
UPDATE user SET equalizer =
(4.0, 1.6, -1.8, -5.6, -0.7, 0.9, 2.9, 4.3, 4.3, 4.3)
WHERE id = 6ed4f220-5361-11e4-8d89-c971d060d947;
id
equalizer
62d4f220-5361-…
(4.0, 1.6, -1.8, -5.6, -0.7, 0.9, 2.9, 4.3, 4.3, 4.3)
• Tuple - Deleting
DELETE equalizer from user
WHERE id = 6ed4f220-5361-11e4-8d89-c971d060d947
id
equalizer
62d4f220-5361-…
©2014 DataStax Training. Use only with permission. Slide 91
Exercise 9: Manipulate values in counter, collection and
UDT columns
©2014 DataStax Training. Use only with permission. Slide 92
What is the purpose of the BATCH statement?
• BATCH statement combines multiple INSERT, UPDATE, and
DELETE statements into a single logical operation
• Saves on client-server and coordinator-replica communication
• Atomic operation
• If any statement in the batch succeeds, all will
• No batch isolation
• Other “transactions” can read and write data being affected by a partially
executed batch
©2014 DataStax Training. Use only with permission. Slide 93
What is the purpose of the BATCH statement?
• Example
BEGIN BATCH
DELETE FROM albums_by_performer
WHERE performer = 'The Beatles' AND
year = 1966 AND title = 'Revolver';
INSERT INTO albums_by_performer (performer, year, title,
genre)
VALUES ('The Beatles', 1966, 'Revolver', 'Rock');
APPLY BATCH;
• BEGIN UNLOGGED BATCH
• Does not write to the batchlog
• Saves time but no longer atomic
• Allows operations on counter columns
©2014 DataStax Training. Use only with permission. Slide 94
What is the purpose of the BATCH statement?
• Lightweight transactions in batch
• Batch will execute only if conditions for all lightweight transactions are met
• All operations in batch will execute serially with the increased performance
overhead
BEGIN
BATCH
UPDATE user SET lock = true IF lock = false;
WHERE performer = 'The Beatles' AND
year = 1966 AND title = 'Revolver';
INSERT INTO albums_by_performer (performer, year, title,
genre)
VALUES ('The Beatles', 1966, 'Revolver', 'Rock');
UPDATE user SET lock = false;
APPLY BATCH;
©2014 DataStax Training. Use only with permission. Slide 95
What is the syntax of the SELECT statement?
• Retrieves rows from a table that satisfy an optional condition
• SELECT – Which columns to retrieve?
• FROM – Which table to retrieve from?
• WHERE – What condition must rows satisfy?
• ORDER BY – How to sort a result set?
• LIMIT – How many rows to return?
• ALLOW FILTERING – Is scanning over all partitions allowed?
SELECT select_expression
FROM keyspace_name.table_name
WHERE relation AND relation ...
ORDER BY ( clustering_column ( ASC | DESC )...)
LIMIT n
ALLOW FILTERING
©2014 DataStax Training. Use only with permission. Slide 96
What is the syntax of the SELECT statement?
• To retrieve all rows
SELECT *
FROM album;
• To retrieve specific columns of all rows
SELECT performer, title, year
FROM album;
• To retrieve a specific field from a user-defined type column
SELECT performer.lastname
FROM album;
• To compute the number of rows in a table
SELECT COUNT(*)
FROM album;
©2014 DataStax Training. Use only with permission. Slide 97
What predicates are allowed in the WHERE clause?
• Equality search – one partition
• To retrieve one partition, values for all partition key columns must be
specified
• In a single-row partition, row = partition
CREATE TABLE tracks_by_album ( …
PRIMARY KEY ((album_title, year), number));
SELECT album_title, year, number, track_title
FROM tracks_by_album
WHERE album_title = 'Revolver' AND year = 1966;
album_title
year
number
track_title
Revolver
1966
1
Taxman
Revolver
1966
2
Eleanor Rigby
…
…
…
…
Revolver
1966
14
Tomorrow Never
Knows
©2014 DataStax Training. Use only with permission. Slide 98
What predicates are allowed in the WHERE clause?
• Equality search – one row
• To retrieve one row, values for all primary key columns must be specified
• In a single-row partition, primary key = partition key
CREATE TABLE tracks_by_album ( …
PRIMARY KEY ((album_title, year), number));
SELECT album_title, year, number, track_title
FROM tracks_by_album
WHERE album_title = 'Revolver' AND year = 1966 AND
number = 6;
album_title
year
number
track_title
Revolver
1966
6
Yellow Submarine
©2014 DataStax Training. Use only with permission. Slide 99
What predicates are allowed in the WHERE clause?
• Equality search – subset of rows
• To retrieve a subset of rows in a partition, values for all partition key columns
and all clustering columns must be specified with the last clustering column
value being a set
• IN is only allowed on the last clustering column of a primary key
CREATE TABLE tracks_by_album ( …
PRIMARY KEY ((album_title, year), number));
SELECT album_title, year, number, track_title
FROM tracks_by_album
WHERE album_title = 'Revolver' AND year = 1966 AND
number IN (2,6,7,14);
album_title
year
number
track_title
Revolver
1966
2
Eleanor Rigby
Revolver
1966
6
Yellow Submarine
Revolver
1966
7
She Said She Said
Revolver
1966
14
Tomorrow Never Knows
©2014 DataStax Training. Use only with permission. Slide 100
What predicates are allowed in the WHERE clause?
• Equality search – subset of rows
• To retrieve a subset of rows in a partition, values for all partition key columns
and one or more but not all clustering columns must be specified
• Clustering columns in a predicate must constitute a prefix of clustering columns
specified in the primary key definition
CREATE TABLE albums_by_performer ( …
PRIMARY KEY (performer, year, title));
SELECT title, year
FROM albums_by_performer
WHERE performer = 'The Beatles' AND year = 1970;
title
year
At The Hollywood Bowl
1970
Let It Be
1970
The Beatles Christmas Album
1970
©2014 DataStax Training. Use only with permission. Slide 101
What predicates are allowed in the WHERE clause?
• Equality search – multiple partitions
• To retrieve multiple partitions, a set of values for a partition key must be
specified using IN
• IN is only allowed on the last column of a partition key
CREATE TABLE albums_by_performer ( …
PRIMARY KEY (performer, year, title));
SELECT performer, title, year
FROM albums_by_performer
WHERE performer IN ('The Beatles', 'Deep Purple');
performer
title
year
The Beatles
Let It Be...Naked
2003
…
…
…
The Beatles
With The Beatles
1963
Deep Purple
Abandon
1998
…
…
…
©2014 DataStax Training. Use only with permission. Slide 102
What predicates are allowed in the WHERE clause?
• Range search
• >, >=, <, <=
• Can only a range search on a partition key using the token() function
WHERE
token(key) >= token(?) AND token(key) < token(?)
• Results are not meaningful for RandomPartitioner and Murmur3Partitioner
• Allowed on only one clustering column in a predicate
• This column should be defined later in the PRIMARY KEY clause than any other
clustering column used in a predicate
©2014 DataStax Training. Use only with permission. Slide 103
What predicates are allowed in the WHERE clause?
• Range search – subset of rows
CREATE TABLE tracks_by_album ( …
PRIMARY KEY ((album_title, year), number));
SELECT album_title, year, number, track_title
FROM tracks_by_album
WHERE album_title = 'Revolver' AND year = 1966 AND
number >= 6 AND number < 8;
album_title
year
number
track_title
Revolver
1966
6
Yellow Submarine
Revolver
1966
7
She Said She Said
©2014 DataStax Training. Use only with permission. Slide 104
What predicates are allowed in the WHERE clause?
• Range search – slice of a partition
CREATE TABLE track_by_duration ( …
PRIMARY KEY (track_title, minutes, seconds));
SELECT album_title, year, number, track_title
FROM tracks_by_duration
WHERE album_title = 'Revolver' AND year = 1966 AND
(minutes, seconds) >= (2, 30) AND
(minutes, seconds) < (6, 0);
album_title
year
number
track_title
Revolver
1966
6
Yellow Submarine
Revolver
1966
7
She Said She Said
©2014 DataStax Training. Use only with permission. Slide 105
What is the purpose of the LIMIT clause?
• LIMIT restricts the number of returned rows
• Default value is 10,000 (cqlsh)
• To retrieve less rows
SELECT * FROM performer LIMIT 10;
• To retrieve more rows
SELECT * FROM performer LIMIT 100000;
©2014 DataStax Training. Use only with permission. Slide 106
What is the purpose of the ALLOW FILTERING clause?
• Allows scanning over all partitions
• Predicate does not specify values for partition key columns
• Relaxes the requirement that a partition key must be specified
• Potentially expensive queries that may return large results
• Use with caution
• LIMIT clause is recommended
• Predicate can have equality or inequality relations on clustering columns
• Return 7th tracks for the first 10 albums in the table
SELECT * FROM tracks_by_album
number = 7 LIMIT 10 ALLOW FILTERING;
WHERE
• Return the number of albums with 30 or more tracks
SELECT COUNT(*) FROM tracks_by_album
WHERE number = 30 LIMIT 100000 ALLOW FILTERING;
©2014 DataStax Training. Use only with permission. Slide 107
How are indexed columns used in a query?
• A predicate may involve only an indexed column
CREATE INDEX performer_country_key ON performer (country);
SELECT name FROM performer WHERE country = 'Iceland';
• A predicate may involve primary key and indexed columns
• Useful to narrow a search in a large multi-row partition
• A predicate may involve multiple indexed columns
• ALLOW FILTERING must be used
CREATE INDEX performer_country_key ON performer (country);
CREATE INDEX performer_style_key ON performer (style);
SELECT name FROM performer
WHERE country = 'Iceland' AND style = 'Rock' ALLOW FILTERING;
©2014 DataStax Training. Use only with permission. Slide 108
How are indexed collection columns queried?
• Searches on indexed collections uses the CONTAINS keyword
• Set, List, Map – Search for a value
CREATE INDEX ON user (preferences);
SELECT id FROM user
WHERE preferences CONTAINS 'Rock';
• Map – Search for a key
CREATE INDEX ON album (tracks);
SELECT title, tracks FROM album
WHERE tracks CONTAINS KEY 20;
©2014 DataStax Training. Use only with permission. Slide 109
How are indexed UDT and tuple columns queried?
• The column is treated as a blob and must search on all fields
• User-defined type – Search all fields
CREATE INDEX ON track_ratings_by_user (song);
SELECT * FROM track_ratings_by_user
WHERE song = {album_title: 'Beatles For Sale',
album_year: 1964,
track_title: 'Cant Buy Me Love'};
• Tuple – Search all fields
CREATE INDEX ON user (equalizer);
SELECT * FROM user
WHERE equalizer = (1.0, 2.0, 3.0, 4.0, 5.0,
6.0, 7.0, 8.0, 9.0, 10.0);
©2014 DataStax Training. Use only with permission. Slide 110
Exercise 10: Explore equality and range search in queries
©2014 DataStax Training. Use only with permission. Slide 111
What is the purpose of the ORDER BY clause?
• ORDER BY specifies how query results must be sorted
• Allowed only on clustering columns
• Default order is ASC or as defined by WITH CLUSTERING ORDER
• Default order can be reversed for all clustering columns at once
CREATE TABLE tracks_by_album ( …
PRIMARY KEY ((album_title, year), number));
SELECT album_title, year, number, track_title
FROM tracks_by_album
WHERE album_title = 'Revolver' AND year = 1966
ORDER BY number DESC;
album_title
year
number
track_title
Revolver
1966
14
Tomorrow Never Knows
Revolver
1966
13
Got to Get You Into My Life
…
…
…
…
Revolver
1966
1
Taxman
©2014 DataStax Training. Use only with permission. Slide 112
What functions are available in CQL?
• TIMEUUID functions
• dateOf() – extracts the timestamp as a date of a timeuuid column
SELECT dateOf(timeuuid_column), … FROM …;
• now() – generates a new unique timeuuid
INSERT INTO … (timeuuid_column, …) VALUES (now(), …);
• minTimeuuid() and maxTimeuuid() – return a UUID-like result
given a conditional time component as an argument
SELECT * FROM … WHERE … AND
timeuuid_column > maxTimeuuid('2014-01-01 00:00+0000') AND
timeuuid_column < minTimeuuid('2014-03-01 00:00+0000');
• unixTimestampOf() – extracts the “raw” timestamp of a timeuuid
column as a 64-bit integer
SELECT unixTimestampOf(timeuuid_column), … FROM …;
©2014 DataStax Training. Use only with permission. Slide 113
What functions are available in CQL?
• Blob conversion functions
• Series of typeAsBlob() and blobAsType() functions
SELECT varcharAsBlob(varchar_column), … FROM …;
SELECT blobAsBigint(blob_column), … FROM …;
• Token access function
• token() function
SELECT * FROM … WHERE token(partition_key) > token(2014);
©2014 DataStax Training. Use only with permission. Slide 114
Demo 11: Explore queries with various predicates
(optional)
©2014 DataStax Training. Use only with permission. Slide 115
Learning Objectives
• Understand the Cassandra data model
• Introduce cqlsh (optional)
• Understand and use the DDL subset of CQL
• Introduce DevCenter
• Understand and use the DML subset of CQL
• Understand basics of data modeling (optional)
©2014 DataStax Training. Use only with permission. Slide 116
What is data modeling?
• Data modeling is a process that involves
• Collection and analysis of data requirements in an information system
• Identification of participating entities and relationships among them
• Identification of data access patterns
• A particular way of organizing and structuring data
• Design and specification of a database schema
• Schema optimization and data indexing techniques
• Data modeling = Science + Art
©2014 DataStax Training. Use only with permission. Slide 117
What are the key steps of data modeling?
• Key steps of data modeling for Cassandra
1. Understand data and application queries
• Data may or may not exist in some format (RDBMS, XML, CSV, …)
• Queries can be organized into a query graph
2. Design column families
• Design is based on access patterns or queries over data
3. Implement the design using CQL
• Optimizations concerning data types, keys, partition sizes, ordering
©2014 DataStax Training. Use only with permission. Slide 118
What are the key steps of data modeling?
• The products of the data modeling steps are documented as
• Conceptual data model
• Technology-independent, unified view of data
• Entity-relationship model, dimensional model, etc.
• Logical data model
• Unique for Cassandra
• Column family diagrams
• Physical data model
• Unique for Cassandra
• CQL definitions
©2014 DataStax Training. Use only with permission. Slide 119
What is a data modeling framework?
• Defines transitions between
models
Conceptual
Model
• Query-driven methodology
• Formal analysis and validation
Query-‐Driven
Methodology
Logical
• Defines a scientific approach to Model
data modeling
Analysis
&
• Modeling rules
Validation
• Mapping patterns
• Schema optimization techniques
Physical
Model
©2014 DataStax Training. Use only with permission. Slide 120
What is a conceptual data model?
• Unified view of data
• Captures understanding of data entities and relationships
• Technology-independent
• Has nothing to do with existing database models
• Graphical representations
• Entity-relationship diagrams
• Chen notation recommended
• Dimensional modeling diagrams
• UML diagrams
©2014 DataStax Training. Use only with permission. Slide 121
What is a conceptual data model?
• Conceptual data model for music data
• ER diagram (Chen notation)
• Describes entities, relationships, roles, keys, cardinalities
• What is possible and what is not in existing or future data
style year genre id name
country title
format
1 n email
name Performer releases Album User
cover
image
1 preferences
1
IsA performs
born has id
founded disjoint
m
died
covering
n timestamp
n involvedIn Activity
Band Artist
1
Track disjoint
rating
not
covering IsA
n has
m
member number
period title Play Rate
©2014 DataStax Training. Use only with permission. Slide 122
What is the Cassandra data modeling methodology?
• Defines how a conceptual DM maps to a logical DM
• Modeling rules
• Ensure that a query is efficiently supported by a column family
• Mapping patterns
• Pattern input: one or more components of a conceptual DM
• Pattern input: a query
• Pattern output: a column family or several alternative solutions
• Enables an algorithmic approach to Cassandra data modeling
• For each query
• Identify a subset of the conceptual DM that describes query data
• Apply a suitable mapping pattern on the subset and the query
©2014 DataStax Training. Use only with permission. Slide 123
What is a logical data model?
• Data is viewed and organized into column families or tables
• Both column families and tables can be used at the logical level
• Table is a two-dimensional view of a multi-dimensional column family
• Chebotko Diagram
• Graphical representation of a logical data model
• A column family is represented by a rectangle
• Column family name
• Columns that may optionally be designated as K (partition key), C
(clustering column), S (static column), and IDX (indexed column)
• Access patterns are represented by links between column families
• Labeled with queries
©2014 DataStax Training. Use only with permission. Slide 124
What is a logical data model?
Q1 Q2 Q3 Q5 Q8 Q12
Performers_by_style Performer Album Albums_by_genre User Activities_by_user
Q2
style K name K title K Q5 genre K id K Q9 user K
name C↑ type year K performer
C↑ name
activity (timeuuid) C↓
country performer Q3
year C↓ email type IDX
style genre title C↑ preferences
(set) album_title
Q4 founded tracks
(map) album_year
born Q8 track_title
died Q6 rating
Q4
Activities_by_track
Q3 Albums_by_track
Albums_by_performer album_title K
Q1 track K Q7
album_year K
performer
K Q3 performer
C↑
track_title K
year
C↓ year C↓
activity
(timeuuid) C↓
title
C↑ title C↑
user
genre Q7 type
Q7 rating
Tracks_by_album Track_stats
ACCESS
PATTERNS
Q1:
Find
performers
for
a
specified
style;
order
by
performer
(ASC).
album K album_title K
Q2:
Find
information
for
a
specified
performer
(artist
or
band). year K Q11 album_year K
Q3:
Find
information
for
a
specified
album
(title
and
year). number
C↑ track_title K
Q4:
Find
albums
for
a
specified
performer;
order
by
album
release
year
(DESC)
and
title
(ASC).
Q5:
Find
albums
for
a
specified
genre;
order
by
performer
(ASC),
year
(DESC),
and
title
(ASC). performer
S Q10 num_ratings
(counter)
Q6:
Find
albums
and
performers
for
a
specified
track
title;
order
by
performer
(ASC),
year
(DESC),
and
title
(ASC). genre S sum_ratings
(counter)
Q7:
Find
tracks
for
a
specified
album
(title
and
year);
order
by
track
number
(ASC).
Q8:
Find
information
for
a
specified
user. title num_plays
(counter)
Q9:
Find
activities
for
a
specified
user;
order
by
activity
time
(DESC).
Q10:
Find
statistics
for
a
specified
track.
Q11:
Find
user
activities
for
a
specified
track;
order
by
activity
time
(DESC).
Q12:
Find
user
activities
for
a
specified
activity
type.
…
©2014 DataStax Training. Use only with permission. Slide 125
How do you analyse and validate a logical design?
• Important considerations
• Natural or surrogate keys?
• Are write conflicts (overwrites) possible?
• What data types to use?
• How large are partitions?
• How much data duplication is required?
• Are client-side joins required and at what cost?
• Are data consistency anomalies possible?
• How to enable transactions and data aggregation?
• …
• Various optimization techniques are defined and applied
• Result in a physical data model
©2014 DataStax Training. Use only with permission. Slide 126
What is a physical data model?
• Final blueprint of database schema design
• CQL script that instantiates a database schema in Cassandra
• Chetbotko Diagrams can be used at the physical level to visualize the design
• When there are significant differences from the logical design
• A physical-level Chetbotko Diagram should show column data types
Q1 Q2 Q3 Q5 Q8 Q12
Performers_by_style Performer Album Albums_by_genre User Activities_by_user
Q2
style TEXT K name TEXT K title TEXT
K Q5 genre TEXT K id UUID K
Q9 user UUID K
name TEXT C↑ type TEXT year INT
K performer
TEXT C↑ name
TEXT activity TIMEUUID C↓
country performer TEXT Q3
TEXT year INT C↓ email TEXT type TEXT IDX
style TEXT genre TEXT title TEXT C↑ preferences
SET<TEXT> album_title TEXT
Q4 founded INT tracks MAP<INT,TEXT> album_year INT
Q8
born INT track_title TEXT
died INT Q6 Activities_by_track rating INT
Q4
album_title TEXT K
Q3 Albums_by_track activity_month
was
added
album_year INT K
Albums_by_performer Q1 track TEXT K Q7 track_title TEXT K to
the
physical
data
model
performer TEXT
K Q3 performer
TEXT C↑ activity_month TIMESTAMP K to
split
large
partitions
year INT
C↓ year INT C↓ activity
TIMEUUID C↓
title
TEXT
C↑ title TEXT C↑ user UUID
C*
data
types
were
added
genre TEXT Q7 type TEXT to
all
columns
Q7 rating INT
Tracks_by_album Track_stats
ACCESS
PATTERNS
Q1:
Find
performers
for
a
specified
style;
order
by
performer
(ASC).
album TEXT K album_title TEXT K
Q2:
Find
information
for
a
specified
performer
(artist
or
band). year INT K Q11 album_year INT K
Q3:
Find
information
for
a
specified
album
(title
and
year). number
INT C↑ track_title TEXT K
Q4:
Find
albums
for
a
specified
performer;
order
by
album
release
year
(DESC)
and
title
(ASC).
Q5:
Find
albums
for
a
specified
genre;
order
by
performer
(ASC),
year
(DESC),
and
title
(ASC). performer
TEXT S Q10 num_ratings
COUNTER
Q6:
Find
albums
and
performers
for
a
specified
track
title;
order
by
performer
(ASC),
year
(DESC),
and
title
(ASC). genre TEXT S sum_ratings
COUNTER
Q7:
Find
tracks
for
a
specified
album
(title
and
year);
order
by
track
number
(ASC).
Q8:
Find
information
for
a
specified
user. title TEXT num_plays
COUNTER
Q9:
Find
activities
for
a
specified
user;
order
by
activity
time
(DESC).
Q10:
Find
statistics
for
a
specified
track.
Q11:
Find
user
activities
for
a
specified
track;
order
by
activity
time
(DESC).
Q12:
Find
user
activities
for
a
specified
activity
type.
...
©2014 DataStax Training. Use only with permission. Slide 127
Is relational database design similar to
Cassandra database design?
No!
Cassandra
Relational
• Multi-dimensional column family
• Two-dimensional relation
• Equally good for simple and complex • Suited for simple data
data
• Complex data requires many relations
and “star” schemas
• All data required to answer a query • Data from many relations is combined
must be nested in a column family
to answer a query
• Referential integrity is a non-issue
• Referential integrity is important
• Data modeling methodology is driven • Data modeling is driven by data only
by queries and data
• Data duplication is considered a
• Data duplication is considered normal problem (normalization theory)
(side effect of data nesting)
©2014 DataStax Training. Use only with permission. Slide 128
How do you migrate from a relational database
to Cassandra?
• The common ground
• The conceptual data model is the same (technology-independent)
• Application queries executed over data are the same
• SQL and CQL are not
• General idea
• Extract (reverse engineer) a conceptual data model from a relational database
schema
• Analyze queries
• Perform logical and physical design for Cassandra as usual
• Execute SQL queries and import their results into respective column families
in Cassandra
• Rewrite queries in CQL
• There can be many nuances
©2014 DataStax Training. Use only with permission. Slide 129
Where do you learn more about data modeling?
• A course specifically dedicated to data modeling
• Apache Cassandra: Data Modeling
• datastax.com
• planetcassandra.org
• cassandra.apache.org
©2014 DataStax Training. Use only with permission. Slide 130
Demo 12: Explore CQL and data modeling resources
©2014 DataStax Training. Use only with permission. Slide 131
Summary
• Data in Cassandra is stored in column families or tables
• Column family is a set of rows with unique row keys
• Table is a set of partitions with unique partition keys
• Table is a two-dimensional view of a multi-dimensional column
family
• Table partitions and partition keys correspond to column family
rows and row keys
• Table rows are different from column family rows
• Table partitions can be single-row or multi-row depending on the
absence or presence of clustering columns, respectively
• Table primary key uniquely identifies a row and is formed by a
partition key and clustering columns
©2014 DataStax Training. Use only with permission. Slide 132
Summary
• CQL keyspace-related statements: CREATE KEYSPACE, USE, DROP
KEYSPACE
• CQL table-related statements: CREATE TABLE, ALTER TABLE,
DROP TABLE
• CQL index-related statements: CREATE INDEX, DROP INDEX
• CQL data types: VARCHAR, TEXT, INT, UUID, TIMEUUID,
TIMESTAMP, COUNTER, SET, LIST, MAP, etc.
• CQL data manipulation statements: INSERT, UPDATE, DELETE,
TRUNCATE, BATCH, SELECT (INSERT and UPDATE have a TTL
option)
• CQL query clauses: SELECT, FROM, WHERE, ORDER BY, LIMIT,
ALLOW FILTERING
©2014 DataStax Training. Use only with permission. Slide 133
Summary
• Data modeling steps require to understand data and queries, design
column families, optimize, and implement tables in CQL
• Conceptual data model is technology-independent
• Logical data model is captured using column family diagrams
• Physical data model is captured in CQL schema definitions
• Data modeling framework defines transitions between conceptual,
logical and physical data models
• Data modeling methodology is query-driven
©2014 DataStax Training. Use only with permission. Slide 134
Review Questions
• What is the relationship between a column family and a CQL table?
• How are wide rows implemented in CQL?
• How are clustering columns ordered?
• What is the difference between UUID and TIMEUUID?
• When should secondary indexes be used?
• Are CQL counters 100% accurate?
• How does an upsert work?
• What predicates are allowed in a CQL query?
• When should the ALLOW FILTERING clause be used?
• How can data from two tables be combined in a CQL query?
• What are components of the data modeling framework?
• What is the purpose of Chetboko Diagrams?
©2014 DataStax Training. Use only with permission. Slide 135
©2014 DataStax Training. Use only with permission. Slide 136