SIC - Big Data - Chapter 5 - Workbook
SIC - Big Data - Chapter 5 - Workbook
1
Contents
Lab 1: Start with MariaDB.........................................................................................................................3
Lab 2: Working with SQL Tables.............................................................................................................11
Lab 3: Working with Tables.....................................................................................................................13
Lab 4: Working with Queries..................................................................................................................16
Lab 5: Using Pig for ETL Processing (Pre-processing).......................................................................18
Lab 6: Running Basic Queries with Hive QL.........................................................................................25
Lab 7: Handling partitioned table for performance..........................................................................36
Lab 8: Working with complex Data type..............................................................................................45
Lab 9: Complex Queries...........................................................................................................................47
Lab 10: Working with Apache Impala.....................................................................................................48
Lab 11: Create a batch view......................................................................................................................57
Lab 12: Create a speed view.....................................................................................................................60
Lab 1
2
Lab 1: Start with MariaDB
In this lab, you will use DDL and DML commands from MariaDB. Actually, many RDBMS system
exist, and not all commands are the same, but most of them support standard SQL. There is no
problem with command compatibility. Here, we use the commands based on MariaDB.
1.1. In a terminal window, log in to MariaDB and show the existing database lists.
Note: If you do not enter anything after the password, you will be prompted for the password:
1.2. If the login is successful, the "MariaDB [(none)]>" prompt appears and a screen waiting
for commands is displayed. Enter a command to check which database exists here.
1.3. Next, enter the command to create a new test database and review the table in test DB.
3
1.4. Drop the test database;
1.5. Using the labs database, create the accounts table as follows and verify the accounts
table.
Column Name Column Type Constraints
user_id serial Primary key
username varchar(30) UNIQUE, NOT NULL
password varchar(20) NOT NULL
email Varchar(50) UNIQUE, NOT NULL
last_login timestamp
4
1.5.2. How to change the database used to labs and create a table.
1.6. Use CTAS(Create Table As Select) to create authors_100 that only stores 100 records
from existing authors.
MariaDB [labs]> create table authors_100 as select * from authors where id <= 100;
5
1.7. Use CTAS to create authors2 with the same schema without records.
MariaDB [labs]> create table authors2 as select * from authors where id = -1;
1.8. Use the authors table to query records with the following conditions, and show the
results.
6
1.8.2. Find everyone who uses example.com for their email address.
1.8.3. Among them in 1.8.2, list only those who have a birthday after 2019 in
descending order.
MariaDB [labs]> select * from authors where email like '%@example.com' and birthdate
>= '2019-01-01' order by birthdate desc;
7
1.8.4. List only the top 10 people born after 2015.
MariaDB [labs]> select * from authors where birthdate >= '2015-01-01' order by
birthdate desc limit 10;
1.9. Use the authors_100 table to delete and records with the following conditions, and
show the results.
1.9.1. Display the current number of records, and verify that all records with a birth
date before 2015 have been deleted.
8
1.9.2. Confirm that line 87 has been deleted, and add the following data to the table.
Validate this record. Add to 3 rows like as :
MariaDB [labs]> insert into authors_100 values (77, 'Jason', 'Park', 'jsjeong@abc.com',
'1995-02-05', now());
MariaDB [labs]> insert into authors_100 values (1, 'Taehyung', 'Kim',
'thkim@example.com', '1995-12-30', now());
MariaDB [labs]> insert into authors_100 (first_name) values ('TBD');
9
1.10. Update table data. Modify the record whose id value is 0 with the following
information.
10
Lab 2: Working with SQL Tables
1. Create a table named countries. Use the following schema:
Column Name Column Type Constraints
id serial
name varchar(50) UNIQUE, NOT NULL
city varchar(50) NOT NULL
population integer
latitude decimal(10,8)
longitude decimal(11,8)
1.2. Create a table named products The table must contain the following sample data and
contain an auto incrementing pid column :
Product name Model Max Weight Max service years Parts name
DC motor D9I40GBH 750 15 DCX
BLDC motor S7J30GHE 3,800 25 SNP
AC motor G8I50BHE 10,000 30 GDE
Numbers should be formatted to use commas every 1000 places and periods for
decimal places. Decimals should not be formatted smaller than the 100 th place.
11
1.4. Create a table named orders. Use the following conditions:
2. Modifying tables
2.2.3. Change the data type of date_of_birth to a date type instead of a string
2.3.1. The maximum weight can now be in the range of 0.0001 kg to 200,000 kg
2.4.1. Add a new column to store customer email addresses. Make it a string with
maximum 50 characters
12
2.4.3. Add a new column called burger_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.
2.4.4. Add a new column called side_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.
2.4.5. Add a new column called drink_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.
13
Lab 3: Working with Tables
1.2.1. Add birthdays for the members of BTS. BTS members are Singer and Song
Writers.
Name Date of Birth
Namjoon Kim September 12, 1994
Jeongguk September 1, 1997
Jeon
Yoongi Min March 9, 1993
Hoseok Jung February 18, 1994
Taehyun Kim December 30, 1995
Jimin Park October 13, 1995
Seokjin Kim December 4, 1992
Also, entered the deceases status for the two entries above. Frank Sinatra is
deceased. Tom Cruise is alive but use the default setting to enter that status.
14
Last name is a required field.
2. Updating tables
2.1.1. Change the last name column so that we can add the entries from 1.2.3.
2.1.2. Review the schema for the celebrities table. What would happen if we tried to
enter the following:
First Name Last Name Occupation Date of Birth Deceased
Alice Perry Singer, Actor '01/08/1945' NULL
2.2. Enter the following into the orders table: You will have to examine the schema and use
INSERT statements appropriately.
15
Onion Ring 2.00 2
Soda 0.99 1
Shake 2.00 2
2.4.1. Product name, model, max weight, and max service years, ordered by max
service years in ascending order
2.5.3. List first name of all celebrities who are both singers and actors(actress)
2.5.4. List first and last name of all singers sorted by age, descending order
2.6.1. List the customer with the highest single expensive order
2.6.2. List all customers email address so that we can send them coupons. Do not
include those that do not have an email address
16
2.6.4. List all orders that included a shake
2.6.5. List any order that did not include a main sandwich.
17
Lab 4: Working with Queries
1. Creating queries
1.2.3. What is the maximum service years and which model is?
1.4.2. How many customer have multiple orders and who are they?
18
2.1.2. Change the continent to the following:
USA = North America
Germany = Europe
France = Europe
Korea = Asia
2.2.4. Remove the members of BTS that we entered in Lab 3:, section 1.2.1. We will
replace the individual members with the group.
19
Lab 5: Using Pig for ETL Processing (Pre-processing)
In this lab, you will use the Pig to explore, correct, and pre-processing data in files. The two
data files to be used in the lab are stored in different formats, so in order to analyze them
together, the data format must be unified. For this work, Pig is used to perform pre-processing.
The pre-processed data is used for analysis using hive in the next unit.
1.1. Create the working directory for this lab and change to the directory. In lab, we’ll make
works directory here.
$ mkdir ~/works
$ cd ~/works
1.2. Copy pig_data1.txt, pig_data2.txt files in the Data folder under the home directory to
this working directory.
$ cp ~/Data/pig_data* .
$ pig -x local
1.4. Load the data in the data_pig1.txt file into shell and show the contents.
20
Note: You can see some records. You may discover records that are missing some data or have
errors.
1.5. Load the first three field values and display them on the screen.
21
grunt> describe col_three;
1.7. Next, create a relation that meets the following conditions using the default data type
and field order.
1.7.1. Read pig_data1.txt and find only the records with the 6th field value greater
than 1000, and print only the 1st, 2nd, 3rd, and 6th field values on the result
screen.
1.7.2. Display the previous results except that the first and third field values are null.
22
2. Processing Input Data from a Pig script
2.1. Make the script file, etl_1.pig, running your command for validation in grunt shell.
2.1.1. First, copy the data pig_data1.txt and pig_data2.txt to home directory folder in
hdfs .
23
2.1.2. Load the data file pig_data1.txt using the data schema below.
Index Field Name Data type
0 brand chararray
1 date chararray
2 model chararray
3 agent chararray
4 country chararray
5 price int
6 code chararray
2.1.3. Filter out all fields with null values for brand name, model, and agent.
2.1.4. Select only values with a price of 1000 or more, and use the foreach command
to create relations in the following order:
Index Field Name Data type
0 code chararray
1 brand chararray
2 model chararray
3 date chararray
4 country chararray
5 price int
2.1.5. Change the brand field to uppercase and remove any leading or trailing
whitespace. (Hint: built-in function TRIM)
$ vim etl_1.pig
data = LOAD 'pig_data1.txt' AS (brand:chararray,
date:chararray,
model:chararray,
agent:chararray,
country:chararray,
price:int,
code:chararray);
data_notnull = FILTER data BY brand is not null and model is not null and agent is not
null;
24
price_up = filter data_notnull by price > 1000;
$ pig etl_1.pig
3.1.1. First, load the data file pig_data2.txt using the data schema below.
Index Field Name Data type
0 brand chararray
1 date chararray
25
2 model chararray
3 agent chararray
4 country chararray
5 price int
6 code chararray
3.1.2. Filter out all fields with null values for brand name, model, and agent.
3.1.4. Change the brand field to uppercase and remove any leading or trailing
whitespace.
3.1.5. Remove the $sign in the cost field, create a relation in the following field order:
Index Field Name Data type
0 code chararray
1 brand chararray
2 model chararray
3 date chararray
4 country chararray
5 price chararray
3.1.6. Save the result in the same format as etl_1.pig to car2 folder in hdfs.
Hint: In the pig_data2 file, since each field is separated by ",", you must use the
command USING PigStorage to separate it.
data = LOAD 'pig_data2.txt' USING PigStorage(',')
AS (brand:chararray,
date:chararray,
model:chararray,
agent:chararray,
country:chararray,
price:chararray,
code:chararray);
26
Lab 6: Running Basic Queries with Hive QL
1. In terminal window, run the Beeline.
$ beeline -u jdbc:hive2://
1.1. If the beeline is running, the "0: jdbc:hive2://> " prompt appears and a screen waiting
for commands is displayed. Enter a command to check which database exists here.
1.2. Next, enter the command to show the table in default DB.
1.3. Create the test database and show the database list.
0: jdbc:hive2://> create database test;
0: jdbc:hive2://> show databases;
1.4. Delete the created test database and verify the result.
27
0: jdbc:hive2://> drop database test;
0: jdbc:hive2://> show databases;
1.6. Verify the products table using desc command and check the hdfs file system.
28
0: jdbc:hive2://> desc products;
1.7. Change the field l_date to the r_date column name, and change the code to id.
0: jdbc:hive2://> alter table products change code id int;
0: jdbc:hive2://> alter table products change l_date r_date string;
29
1.9. Drop the test table and verify the hdfs directory in /user/hive/warehouse.
30
Note: When the managed table is deleted, the folder in the file system of hdfs is also
deleted.
31
2. Create and load a table using Shell
2.1. Before using the shell, create a directory to be used in the lab as /mywarehouse in the
hdfs file system and change the permission to student.
2.3. And the data preprocessed using pig in the previous lab is saved here.
$ hdfs dfs -mv car1 car2 /mywarehouse
2.4. Create and load a table using editor and execute the script the shell -f option.
2.4.1. Create an external table using CREATE TABLE. In the last unit, you practiced
creating file using Pig for data pre-processing. The schema is shown below:
Column Name Column Type
Code Int
brand string
model string
r_date string
country string
32
price int
2.4.2. The data are separated by '\t' for each record and are stored in the
/mywarehouse/car1 directory in the hdfs file system.
$vi products.sql
create external table if not exists products (
code int,
brand string,
model string,
r_date string,
country string,
price int
)
row format delimited
fields terminated by '\t'
location '/mywarehouse/car1';
2.5. Verify and delete the products table created by script. Terminate the beeline.
$ beeline -u jdbc:hive2://
33
0: jdbc:hive2://> desc products;
0: jdbc:hive2://> select * from products;
0: jdbc:hive2://> drop table products;
0: jdbc:hive2://> !quit
(exit the beeline)
$
Note: It was saved in the folder car1 where the data was pre-processed in pig. After creating
this as an external table, select the table contents.
34
›
Note: When a table created as external is deleted, table information is deleted from the
metastore, but car1’s data remains.
3.1. In a terminal, run the following command to import the country table from MariaDB
using hive-import option.
3.1.1. First, login the beeline, and create the mydb database for lab.
0: jdbc:hive2://> create database mydb;
Note: The created database mydb is created in /user/hive/warehouse with the name
mydb.db and is distinguished from general tables.
35
3.2. Verify that the table imported from beeline is saved correctly. 10000 records checked.
0: jdbc:hive2://> select * from mydb.authors;
0: jdbc:hive2://> use mydb;
0: jdbc:hive2://> select count(*) from authors;
Note: This is the result of searching 20 records of authors. This is the case when the database
name and table are used together.
36
4. Write a hive QL commands that execute the following conditions.
4.1. Save the posts table using sqoop in mydb database, and find the total number of
records.
Hint: The --null-string and --null-non-string arguments are optional. If not specified,
then the string "null" will be used.
The --hive-drop-import-delims argument drops \n, \r, and \01 from string fields when
importing to Hive.
4.2. Import the only first_name, last_name, email and save the authors_parquet folder in
parquet format in /mywarehouse of hdfs, and create an authors_parquet external table
using this folder.
Hint: sqoop import --connect jdbc:mysql://localhost/labs --username student --
password student --table authors --target-dir /mywarehouse/authors_parquet --as-
parquetfile
4.3. Use alter table to rename the “first_name” column to “fname” in authors_parquet.
Verify the change with DESCRIBE.
4.4. Rename the entire table to authors_parquet2. And verify the change the folder name in
/mywarehouse.
37
Lab 7: Handling partitioned table for performance
In this lab, you will create a table for car data that is partitioned by part number. You already
built the data using pig service in previous lab. We can query the car data in the same table, but
distinguish them by part numbers. It does this by creating a partition for each record.
$ beeline -u jdbc:hive2://
1.3. Make sure that the processed data is in the following HDFS.
1.4. Create the table called cars with the following schema:
1.4.1. The partition column is part_num (type tinyint), and field delimiter is ‘\t’
38
1.4.2. The table location is /mywarehouse/car with managed type.
1.4.3. Change the cars table to add two partitions. Partitions are one for part 1 and one
for part 2.
39
1.4.4. Load the data from /mywarehouse/car1 into the part 1 partition, and load the
data from /mywarehouse/car2 into the part 2 partition.
When the load command is executed, the file in the existing car1 is moved to the
cars/part_num=1 folder.
40
1.4.6. Show the data of car1 using select statement
1.4.7. Move the data in car2 to part_num=2 using hdfs mv instead of the load
command and verify that the data looks the same as 1.4.6.
41
Note: We checked the entire data including part_num=2 as above.
1.4.8. Verify the data for both number of part were correctly loaded by counting the
records for each:
42
2.1. In firebox url, type the Hue page http://localhost:8888. The id is student, and password
is student. Clink the “sign in” to enter hue.
Note: The table browser displays the mydb database name and a list of tables.
43
2.3. Execute the command executed in beeline in the hive editor and verify the result.
44
2.3.3. select count(*) as CNT from cars where part_num=2;
2.3.4. You can check the schema information and sample data contents by clicking "i"
next to the cars table in the left menu.
45
2.4. There are two problems with the data in cars table. One is that the price is null in
part_num=2, another is that the sub-title is included in the middle of the data such as
“Brand, MODEL...”. Writing a query that solves these problems.
46
2.5.12. What is the average price per brand?
47
Lab 8: Working with complex Data type
In this lab, you will run a complex data type and join operation.
1.1. First, import all tables in the MariaDB labs database. The table imported as an external
table is created as a hive table.
48
Order information
o Customer who ordered
o Status of order (completed, in progress, placed)
Order items information
o Which order is this for?
o List of products ordered
Products information
o name of product
o cost
o product type (sandwich, side, drink)
o redeem_points
We have added the status of the order so that system can use to help the management keep
track of all the orders.
2.3.3. Needs to link to the products unique Key for each product ordered
49
2.4.3. Product cost
50
Lab 9: Complex Queries
1. Create complex queries by joining multiple tables
1.1. Create joined queries from countries and continent to return the following information:
1.2. Create joined queries from customer, order, order_item and product to return the
following information:
1.2.1. All orders, including the customer who ordered, and all the products purchased
1.2.2. Any order that does not include a sandwich in it. Show the order id.
1.2.3. Who was the customer who made the order without the sandwich?
1.2.5. For every product, total number of times that product has been ordered..
51
Lab 10: Working with Apache Impala
1. Setting up the lab environment
Impala and Kudu are resource intensive. Further, Kudu cannot be easily setup in a standalone
Hadoop cluster such as in our Virtual Box environment. To overcome this limitation, we shall
follow the Apache Kudu Quickstart guidelines (https://kudu.apache.org/docs/quickstart.html)
adapted for Windows environment along with a Docker image of Impala to connect to the Kudu
storage.
1.1.1. Either search for "install docker desktop for windows" on your browser or go
directly to https://docs.docker.com/docker-for-windows/install/
1.1.3. At the end of the installation, you will be asked to install WSL 2 backend. Please
install this.
Note: You have previously installed Git in the AWS S3 Storage lab in Chapter 4 Unit 1.
Unless you have removed and uninstalled the program, you may skip this step.
1.2.1. Either search for "install git for windows " or go directly to https://git-
scm.com/download/win
1.2.2. You may choose either the 32-bit or 64-bit version. Whenever possible, it is
better to choose the 64-bit version but this will depend on your Windows
version.
1.3.1. From Windows search, type "powershell" and start a new Windows PowerShell
session
52
1.3.2. From the home directory on your Windows computer, create and navigate into
C5U3_lab folder
cd ~
mkdir C5U3_lab
cd C5U3_lab
git clone https://github.com/apache/kudu
ls
cd kudu›
53
1.4.1. Setup the KUDU_QUICKSTART_IP environmental variable with the following
command. In PowerShell the back tick (`) is used to continue the command on
the next line.
$env:KUDU_QUICKSTART_IP=(Get-NetIPConfiguration | `
Where-Object {$_.IPv4DefaultGateway -ne $null -and `
$_.NetAdapter.Status -ne "Disconnected"}).IPv4Address.IPAddress
1.4.2. Check to make sure the variable has been set properly
$env:KUDU_QUICKSTART_IP
1.5. Bring up the docker Kudu cluster using the provided YAML configuration file
CAUTION: Docker desktop for Windows must be running and you must be in the kudu
directory before executing the following command.
docker-compose -f docker/quickstart.yml up -d
NOTE: When you first create the Kudu cluster, the displayed text will say "Creating" instead of
"Recreating"
54
1.5.2. Check to make sure the Kudu service is up by browsing to localhost:8050
NOTE: The Network id generated will be different on your machine. If you have
already attempted this lab before, you may get a message that the quickstart-
network already exists. You may ignore the message and continue
OR
1.7.1. Set up environmental variables for Impala to use with the following two
commands.
55
$env:QUICKSTART_IP=$(docker network inspect quickstart-network `
-f '{{(index .IPAM.Config 0).Gateway}}')
$env:QUICKSTART_LISTEN_ADDR=$env:QUICKSTART_IP
CAUTION: If you have previously attempted this lab, there may already exist a container with
the name "/kudu-impala." In this case, it is best to remove the container and run a fresh new
kudu-impala container Only follow the instruction to remove the existing container if a
container already exists. If not, skip to step 1.7.5
1.7.4. CAUTION: Follow only if container already exists. Use docker container rm to
remove the existing container. Use the container id displayed in the error
message when deleting
56
1.7.5. Check to make sure Impala services are up. Use your browser and navigate to
localhost:25000
1.8.1. Open a command line to the Impala docker container with the following
command. The container id is displayed when you created the container.
1.8.2. Once inside the docker container, start the Impala shell.
57
impala-shell
58
>Insert the following values and verify the result by executing the select statement.
2.1.1. INSERT INTO test VALUES (1, 'Tesla', '02/17/22', ‘CM-9', 'Gabon', '1179.00');
2.1.2. INSERT INTO test VALUES (2, 'Hyundai', '02/05/20', ‘K9', 'Korea', '5779.00');
2.1.3. INSERT INTO test VALUES (3, 'Tesla', '03/14/21', ‘Y-9', 'USA', '3300.00');
2.2. Change the field l_date to the r_date column name, and change the code to id.
59
2.4. Verify the number of records in car_products. And drop the car_products;
60
Lab 11: Create a batch view
Creating an actual Lambda architecture is a very difficult and time-consuming project. Many
enterprises are trying to move away from this architecture whenever possible due to its high
maintenance and development costs. However, there are many use cases where Lambda
architecture is the only viable choice.
Creating an actual lambda architecture would be beyond the scope of this lab. An actual proof-
of-concept (POC) project will typically involve many moving pieces and actual code
development. In this lab, we will create a data pipeline that will have many of the working
parts without the real complexity of an actual working POC.
1.1.1. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#netcat
-tcp-source
1.2.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#file-
channel
1.3.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#memo
ry-channel
61
1.3.3. Maximum events store in the channel should be 10000
1.4.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hdfs-
sink
1.5.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#logge
r-sink
1.6.2. The file channel will connect with the HDFS sink
1.6.3. The memory channel will connect with the logger sink
62
--name a1 \
-Dflume.root.logger=INFO,console
cd /home/student/Scripts
python stream2.py ~/Data/anonymous-msweb.data
2.3.8. Set the location to where Flume has saved to HDFS in above step 1.4.3
2.4. Confirm that the new Hive table is reading from the HDFS directory. Run the following
query:
63
64
Lab 12: Create a speed view
We will simulate ingesting real-time streaming data .
sudo stop-hbase.sh
1.3. Make sure that both zookeeper and kaka is running. If not repeat above step.
65
2. Create an Apache NiFi dataflow
66
2.5. Add a funnel
2.7.5. Add another funnel and connect the output of ConsumeKafka_2_6 to the funnel
on the success relationship
67
3.2.1. Navigate the /home/student/Scripts
cd /home/student/Scripts
python stream2.py ~/Data/anonymous-msweb.data
3.3. Wait until you see flowfiles flowing through the data flow. Stop all the processor and
inspect the queues.
3.3.1. Inspect the queue between ListenTCP and PublishKafka_2_6. Make sure you
view the content of the flowfile
3.5. From another terminal, use kafka-console-consumer to verify that the streaming data
has been properly ingested
kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic weblogs \
--from-beginning
3.5.2. After making sure, everything is flowing properly, stop all components in NiFi
END OF LAB
68