8.2. CQL Exercises
8.2. CQL Exercises
8.2. CQL Exercises
com
Throughout the course we will work with 2 sets of data:
Movielens data set (real data): 6k users, 4k movies, 1 million ratings - we copy only 10k
entries
Weather data (random generated) 500k entries
Movies.csv extract
Users.csv extract
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 0246
Ratings2.csv
1 1193 5 978300760
1 661 3 978302109
1 914 3 978301968
1 3408 4 978300275
1 2355 5 978824291
uid age gender ocupation zip rating datetime mid title year genres
weather-data.csv
------------+---------------------------------+-------------
All the above files can be found in the data directory
(both cloud and local installations).
cd /data
Ls -l
Note: for the intro part we play a bit with different
data types and we insert data manually, before we
work with the above data.
Valentina.Crisan@academyofdata.com
CQL exercises
Reference for syntax questions:
http://docs.datastax.com/en/cql/3.3/cql/cql_using/useAboutCQL.html
Run Cqlsh
Desc keyspaces // will list all keyspaces on the cluster
describe metro_systems; // you can see the Replication factor & strategy
set at keyspace level
Exit cqlsh and run nodetool status command - this should give you an image of your cluster
(even if it’s a 1 node cluster). Note the datacenter name.
Re-enter Cqlsh
Let’s create another keyspaceNetwork Topology replication strategy and Replication factor
1.
describe metro_systems1;
Let’s alter a created keyspace and change the property durable writes:
OR if we want to modify the replication factor from 1 to 3:
Desc metro_systems1
Stop: let’s explain what means durable writes option in the keyspace description?
First, connect to the created keyspace: i.e. all further CQL commands will be executed in the
context of the chosen keyspace.
USE metro_systems;
2.1. Create Table users with the next structure: userid (uuid), gender (text), age (int),
occupation (text), zipcode (int) and compound primary key: partition key userid and cluster
key gender.
DESCRIBE users; // you can see all properties of the table
ALTER TABLE users ADD relatives int; // add a new column to the table
ALTER TABLE users ALTER relatives type blob; // alter column type from int to blob
Describe users; // to see the types of the columns
Please note: Alter columns types, some changes will not be possible, like for example:
varchar to ascii.
ALTER TABLE users DROP relatives; //delete the new added column to the table
Describe users;
DROP TABLE users; // deletes all the data of the table including the table structure &
indexes
Assuming you dropped the table users previously, let’s re-create it.
We have a table, now let’s insert data manually. We will learn later also how to insert data
from a file, but for the moment let’s do some manual inserts/updates.
Observation: the UUID can be generated automatically with uuid ();
INSERT INTO users (uid, name, age, gender, occupation, zipcode) VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66, 'Andrew', 25, 'M', 'plummer',
11501);
TRACING ON; // allows to see the events in background for Cassandra.
INSERT INTO users (uid, name, age, gender, occupation, zipcode) VALUES
(5132b130-ae79-11e4-ab27-0800200c9a66, 'Lili', 24, 'F', 'actress', 11503);
// note that I am inserting data in the same partition as the first
insert. Given that the Clustering Key is different the data will be added
to the existing partition.
TRACING OFF
SELECT * FROM users; // remember the way Cassandra stores data is
different than the CQL view.
Remember for the next section: Cassandra doesn’t read (by default) before
performing writes.
UPDATE users SET age = 35 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a77 and
name = 'Andrew'; // the UUID - partition key - is not existent thus the update
will do an insert
SELECT * from users; // let’s check that everything went well
UPDATE u sers SET a ge = 44 WHERE uid = 5132b130-ae79-11e4-ab27-0800200c9a66 and
name = ' Elaine'; / / the partition key existing but not with this clustering
key, a n ew entry i nside an existing partition will be generated
Select * from users where uid = 5132b130-ae79-11e4-ab27-0800200c9a66; // to
see all entries in this partition
Lets do some more inserts. The first row inserted will have a new partition and cluster key. The
second row will have the same partition key but a different clustering key.
insert into users (uid, name, age, gender, occupation, zipcode) values
(uuid(),'John',35,'M','data analyst',11502); // see the uuid generated here
and use it for the next insert
insert into users (uid, name, age, gender, occupation, zipcode) values
(90278d0f-6053-4c87-a1f3-69a05853d2a8,'Elaine',34,'F','engineer',11502);
Valentina.Crisan@academyofdata.com
Before we move on, type exit in cqlsh and lets see some stats for our table.
See that data is still in memtable, no SStables generated, see also the estimated number
of partitions (Number of keys (estimate))
Use metro_systems;
Let’s try a an update on another non existing entry but with LWT:
Now let’s try an update on an existing entry with LWT:
Exit CQL
Let’s force a flush to write the data on disk in an SStable. Run
Nodetool flush metro_systems users // For cloud users please make sure you use
your own keyspace
Go to cd /var/lib/data/cassandra/ . Choose your keyspace directory and the directory
of the table. Use ls -l to list the tables generated after flush.
IF you would like to see how an SStable looks like you can use : sstabledump
nameofthedatafile // e.g. sstabledump mc-1-big-Data.db
Sstabledump -e nameofthedatafile // will show us the partitions in the specific SStable
Valentina.Crisan@academyofdata.com
3. Let’s do some queries on the data - we will come back to
Queries, this is just to get us going with the syntax
Cd /
// You can also pick the columns to display instead of choosing all data.
SELECT * FROM users LIMIT 3; // you can limit the number of returned
values
//You can fine-tune the display order using the ORDER BY clause, either descending
or ascending. The partition key must be defined in the WHERE clause and the ORDER
BY clause defines the clustering column to use for ordering. Order By works only on
clustering keys.
// You can query for a partition key that is part of a list of partition keys
You can return the number of records that fulfills a request:
You can ask for the distinct partitions in a table (DISTINCT works only on partition
keys):
SELECT * FROM users WHERE gender = 'M'; // WILL NOT WORK
//lets count how many records are in the query outcome
Tracing on;
select * from users where gender = 'F'; // to see all the operations C* is doing in the
background (even for a 1 node cluster you can see loads of operations).
CAN YOU SEE THAT THERE ARE 2 TABLES queried ON THE NODE IN ORDER TO GET
THE DATA: on the index table and on the users table
Executing single-partition query on users.users_by_gender
Executing single-partition query on users -- appears 3 times. Can you tell me why?
We will come back on this point in section Indexing in C*.
4.Collections
SETS
UPDATE users SET emails = emails + {'elaine@yahoo.com'} WHERE uid =
89d3a708-0571-440a-8a9b-997856c17789 AND name = 'Elaine';
Valentina.Crisan@academyofdata.com
UPDATE users SET emails = emails - {'elaine@yahoo.com'} WHERE uid =
89d3a708-0571-440a-8a9b-997856c17789 AND name = 'Elaine';
LISTS
update users set top_movies = ['romeo and juliet']+ top_movies where uid =
89d3a708-0571-440a-8a9b-997856c17789 and name = 'Elaine';
You can index a collection, let’s query on a movie name:
MAPS
LET’S ADD A MAP COLUMN TO TABLE USERS:
Let’s re-create table users with below syntax and import data from a CSV file.
WE HAVE IN THE DATA DIRECTORY (CLOUD OR LOCAL INSTALLATIONS) THE CSV
FILES THAT WE WILL BE WORKING WITH. WE NEED Users.csv
Did you know we can have a collection as part Primary Key? So if we want to
know movies by genres - we could define Genre as part of the Primary key.
6.2. Lets also do an exercise on User Defined Types
// When using the frozen keyword, you cannot update parts of a
user-defined type value. The entire value must be overwritten. Cassandra
treats the value of a frozen, user-defined type like a blob.
//Let’s see if we can select only parts of the UDT
select address.street from users_udt where uid = 1;
// Let’s see if we can update parts of the UDT
update users_udt_1 set address.street = 'Vitan' where uid = 1; // will not
work
Important:From 3.4.2: User-defined types may now be stored in a non-frozen
form, allowing individual fields to be updated and deleted in UPDATE
statements and DELETE statements, respectively. (CASSANDRA-7423). Let’s
test.
update users_udt_1 set address.street = 'Vitan' where uid = 1;
Select token(uid) from users1 where uid=1; // // we want to see which token is
allocated for this partition
Let’s look for the data (time & temp) on a single weather station, how would the
query look like?
What about all the temperatures between 2016-01-10 03:49:46 and 2016-01-11
16:29:59?
Query the weather data for the single day and then 2 consecutive days
SELECT * FROM temperature_by_day WHERE w_id='FFOERIRWOL' AND date='2015-03-19';
SELECT * FROM temperature_by_day WHERE w_id='FFOERIRWOL' AND date in
('2015-03-19','2015-03-18');
Valentina.Crisan@academyofdata.com
Let’s see how to query using a non primary key column:
This query will not work - because we are trying to query without partition key:
In order to work either we allow filtering of data:
select * from temperature_by_day where temperature = 2 9 a llow filtering; // use
this with caution, brings all the table data locally i n o ne coordinator.
Deleting old entries - WE CAN SET EXPIRING DATA AT TABLE LEVEL OR
PER ENTRY