[go: up one dir, main page]

0% found this document useful (0 votes)
38 views68 pages

SIC - Big Data - Chapter 5 - Workbook

hay

Uploaded by

Phan Đức Tài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views68 pages

SIC - Big Data - Chapter 5 - Workbook

hay

Uploaded by

Phan Đức Tài
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 68

Chapter 5.

Big Data Analytics


Exercise Workbook

1
Contents
Lab 1: Start with MariaDB.........................................................................................................................3
Lab 2: Working with SQL Tables.............................................................................................................11
Lab 3: Working with Tables.....................................................................................................................13
Lab 4: Working with Queries..................................................................................................................16
Lab 5: Using Pig for ETL Processing (Pre-processing).......................................................................18
Lab 6: Running Basic Queries with Hive QL.........................................................................................25
Lab 7: Handling partitioned table for performance..........................................................................36
Lab 8: Working with complex Data type..............................................................................................45
Lab 9: Complex Queries...........................................................................................................................47
Lab 10: Working with Apache Impala.....................................................................................................48
Lab 11: Create a batch view......................................................................................................................57
Lab 12: Create a speed view.....................................................................................................................60

Lab 1

2
Lab 1: Start with MariaDB
In this lab, you will use DDL and DML commands from MariaDB. Actually, many RDBMS system
exist, and not all commands are the same, but most of them support standard SQL. There is no
problem with command compatibility. Here, we use the commands based on MariaDB.

1. Creating Database & Table

1.1. In a terminal window, log in to MariaDB and show the existing database lists.

$ mysql --user=student --password=student

Note: If you do not enter anything after the password, you will be prompted for the password:

1.2. If the login is successful, the "MariaDB [(none)]>" prompt appears and a screen waiting
for commands is displayed. Enter a command to check which database exists here.

MariaDB [(none)]> show databases;

1.3. Next, enter the command to create a new test database and review the table in test DB.

3
1.4. Drop the test database;

MariaDB [test]> drop database test;

1.5. Using the labs database, create the accounts table as follows and verify the accounts
table.
Column Name Column Type Constraints
user_id serial Primary key
username varchar(30) UNIQUE, NOT NULL
password varchar(20) NOT NULL
email Varchar(50) UNIQUE, NOT NULL
last_login timestamp

1.5.1. How to create directly without changing the database.

MariaDB [(none)]> CREATE TABLE IF NOT EXISTS labs.accounts (


user_id serial PRIMARY KEY,
username VARCHAR (30) UNIQUE NOT NULL,
password VARCHAR (20) NOT NULL,
email VARCHAR ( 50 ) UNIQUE NOT NULL,
last_login timestamp );

4
1.5.2. How to change the database used to labs and create a table.

MariaDB [(none)]> use labs;

1.6. Use CTAS(Create Table As Select) to create authors_100 that only stores 100 records
from existing authors.

MariaDB [labs]> create table authors_100 as select * from authors where id <= 100;

5
1.7. Use CTAS to create authors2 with the same schema without records.

MariaDB [labs]> create table authors2 as select * from authors where id = -1;

1.8. Use the authors table to query records with the following conditions, and show the
results.

1.8.1. Find all Walton in first name.

MariaDB [labs]> select * from authors where first_name = 'Walton';

6
1.8.2. Find everyone who uses example.com for their email address.

MariaDB [labs]> select * from authors where email like '%@example.com';

1.8.3. Among them in 1.8.2, list only those who have a birthday after 2019 in
descending order.

MariaDB [labs]> select * from authors where email like '%@example.com' and birthdate
>= '2019-01-01' order by birthdate desc;

7
1.8.4. List only the top 10 people born after 2015.

MariaDB [labs]> select * from authors where birthdate >= '2015-01-01' order by
birthdate desc limit 10;

1.9. Use the authors_100 table to delete and records with the following conditions, and
show the results.

1.9.1. Display the current number of records, and verify that all records with a birth
date before 2015 have been deleted.

MariaDB [labs]> delete from authors_100 where birthdate < '2015-01-01';

8
1.9.2. Confirm that line 87 has been deleted, and add the following data to the table.
Validate this record. Add to 3 rows like as :

id first_name last_name email birthdate added


77 Jason Park jsjeong@abc.com 1995-02-05 now
1 Taehyung Kim thkim@example.co 1995-12-31 now
m
TBD

MariaDB [labs]> insert into authors_100 values (77, 'Jason', 'Park', 'jsjeong@abc.com',
'1995-02-05', now());
MariaDB [labs]> insert into authors_100 values (1, 'Taehyung', 'Kim',
'thkim@example.com', '1995-12-30', now());
MariaDB [labs]> insert into authors_100 (first_name) values ('TBD');

9
1.10. Update table data. Modify the record whose id value is 0 with the following
information.

1.10.1. First_name:Jimin, last_name: Park, email:jmpark@abc.com

MariaDB [labs]> update authors_100 set id = 100, first_name = 'Jimin', last_name =


'Park', email = 'jmpark@abc.com' where id = 0;

1.11.Drop the authors2 table and exit MariaDB.

MariaDB [labs]> drop table authors2;


MariaDB [labs]> exit

10
Lab 2: Working with SQL Tables
1. Create a table named countries. Use the following schema:
Column Name Column Type Constraints
id serial
name varchar(50) UNIQUE, NOT NULL
city varchar(50) NOT NULL
population integer
latitude decimal(10,8)
longitude decimal(11,8)

1.1. Create a table named famous_people. Use the following conditions:


Column Name Conditions Contraints
uid auto-incrementing value
name string up to 100 characters Cannot be NULL
occupation string up to 150 characters
birthday string up to 50 characters
existence contains either true or false If unknown, should be false

1.2. Create a table named products The table must contain the following sample data and
contain an auto incrementing pid column :
Product name Model Max Weight Max service years Parts name
DC motor D9I40GBH 750 15 DCX
BLDC motor S7J30GHE 3,800 25 SNP
AC motor G8I50BHE 10,000 30 GDE

1.3. Other considerations:

 Product name and Model have maximum 75 characters

 Product name and Model cannot be empty

 The Weights can be in the range of 0.001 kg to 40,000 kg

 Numbers should be formatted to use commas every 1000 places and periods for
decimal places. Decimals should not be formatted smaller than the 100 th place.

 Conservation status is a 3 letter code

11
1.4. Create a table named orders. Use the following conditions:

Column Name Conditions Contraints


id auto-incrementing value
customer_nam string up to 100 characters cannot be empty
e
burger string up to 50 characters
side string up to 50 characters
drink string up to 50 characters
order_total dollars and cents numeric cannot be empty. all
orders are less than $100

2. Modifying tables

2.1. Rename the famous_people table to celebrities.

2.2. Change the celebrities table with the following:

2.2.1. Change name column to first_name

Change its data type to varchar(80)

2.2.2. Add a new column named last_name.


The column can store strings up 100 characters. This column cannot be empty

2.2.3. Change the data type of date_of_birth to a date type instead of a string

date_of_birth column cannot be empty

2.3. Change the products table with the following:

2.3.1. The maximum weight can now be in the range of 0.0001 kg to 200,000 kg

2.3.2. The model cannot contain duplicates

2.4. Change the orders table with the following:

2.4.1. Add a new column to store customer email addresses. Make it a string with
maximum 50 characters

2.4.2. Add a new column named customer_loyalty_points. It holds an integer value.


The default value for the column should be 0.

12
2.4.3. Add a new column called burger_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.

2.4.4. Add a new column called side_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.

2.4.5. Add a new column called drink_cost. It contains dollars and cents. The
maximum value is $100. The default value is 0.

2.4.6. Remove the order_total column.

13
Lab 3: Working with Tables

1. Adding data to tables

1.1. Add to countries table:


Name Capital Population
USA Washington 333,098,437
D.C.
Germany Berlin 84,073,352
France Paris 65,426,179
Korea Seoul 51,305,186

1.2. Add to celebrities table

1.2.1. Add birthdays for the members of BTS. BTS members are Singer and Song
Writers.
Name Date of Birth
Namjoon Kim September 12, 1994
Jeongguk September 1, 1997
Jeon
Yoongi Min March 9, 1993
Hoseok Jung February 18, 1994
Taehyun Kim December 30, 1995
Jimin Park October 13, 1995
Seokjin Kim December 4, 1992

1.2.2. Use the INSERT command to add the following:


First Name Last Name Occupation Date of Birth
Yong Cho Singer, Actor December 12, 1915
SY Lee Actor July 3, 1962

Also, entered the deceases status for the two entries above. Frank Sinatra is
deceased. Tom Cruise is alive but use the default setting to enter that status.

1.2.3. Enter the following data. What happens?


First Name Last Name Occupation Date of Birth Deceased
Jason Singer, Actress '08/15/1968' false
Henry Singer, Songwriter, Actor '11/06/1961' true

14
Last name is a required field.

2. Updating tables

2.1. Update the celebrities table

2.1.1. Change the last name column so that we can add the entries from 1.2.3.

2.1.2. Review the schema for the celebrities table. What would happen if we tried to
enter the following:
First Name Last Name Occupation Date of Birth Deceased
Alice Perry Singer, Actor '01/08/1945' NULL

2.2. Enter the following into the orders table: You will have to examine the schema and use
INSERT statements appropriately.

2.2.1. Customer information:


Henry Kim, Shaun Silverman, and Jason Gomez.

2.2.2. Customer email:


Henry's email address is henry@lab.com.
Shaun's email address is shaun@lab.com.
Jason doesn't have an email address.

2.2.3. Order information:


Henry orders a cheeseburger, fries and a soda
Shaun orders a cheeseburger, onion ring and chocolate shake
He also orders a chicken burger, fries, soda
Jason orders a side of onion rings with a strawberry shake.

The item costs and redeem points are listed below:


Item Cost Redeem
points
Burger 3.00 5
Cheeseburger 4.00 7
Chicken Burger 3.50 6
Double
6.00 9
Cheeseburger
Fries 1.50 2

15
Onion Ring 2.00 2
Soda 0.99 1
Shake 2.00 2

2.3. Query the countries table for the following information:

2.3.1. Population of Korea

2.3.2. Population and capital of all entries in the countries table

2.3.3. Names of countries in alphabetical order

2.3.4. Country, capital and population, ordered by population, descending

2.3.5. Same as above but ordered ascending

2.3.6. Countries with a population less than 100,000,000.

2.3.7. Countries with a population between 50 and 100 million

2.4. Query the products table for the following information:

2.4.1. Product name, model, max weight, and max service years, ordered by max
service years in ascending order

2.4.2. List the name of products in descending order to maximum weight

2.5. Query the celebrities table or the following information:

2.5.1. First and last name of all celebrities still alive

2.5.2. All celebrities born after 1990, in ascending order by age

2.5.3. List first name of all celebrities who are both singers and actors(actress)

2.5.4. List first and last name of all singers sorted by age, descending order

2.6. Query the orders table for the following information:

2.6.1. List the customer with the highest single expensive order

2.6.2. List all customers email address so that we can send them coupons. Do not
include those that do not have an email address

2.6.3. List names of all customers who ordered fries

16
2.6.4. List all orders that included a shake

2.6.5. List any order that did not include a main sandwich.

17
Lab 4: Working with Queries

1. Creating queries

1.1. Create queries on the countries table

1.1.1. Name the country with the smallest population

1.1.2. Name the country with the second largest population

1.1.3. List the first row in the countries table

1.1.4. How many countries are listed in the countries table?

1.2. Create queries on the products table

1.2.1. List all unique model name

1.2.2. Count number of products with a parts name of DCX

1.2.3. What is the maximum service years and which model is?

1.2.4. What is the average weight of all the products

1.3. Create queries on the celebrities table

1.3.1. How many celebrities have multiple occupations?

1.3.2. Which celebrities have more than 3 occupations?

1.3.3. Who is the oldest celebrity?

1.3.4. Who is the youngest?

1.4. Create queries on the orders table

1.4.1. What is the total amount of all the orders

1.4.2. How many customer have multiple orders and who are they?

2. Updating data in tables

2.1. Modify the countries table:

2.1.1. Add continent column as string with maximum 100 characters

18
2.1.2. Change the continent to the following:
USA = North America
Germany = Europe
France = Europe
Korea = Asia

2.2. Modify celebrities table:

2.2.1. Remove anyone who is not a singer

2.2.2. Remove the column occupation

2.2.3. Change the name of the table to singers

2.2.4. Remove the members of BTS that we entered in Lab 3:, section 1.2.1. We will
replace the individual members with the group.

2.2.5. Add a new singer with name = BTS.

19
Lab 5: Using Pig for ETL Processing (Pre-processing)
In this lab, you will use the Pig to explore, correct, and pre-processing data in files. The two
data files to be used in the lab are stored in different formats, so in order to analyze them
together, the data format must be unified. For this work, Pig is used to perform pre-processing.
The pre-processed data is used for analysis using hive in the next unit.

1. Working in the Grunt shell

1.1. Create the working directory for this lab and change to the directory. In lab, we’ll make
works directory here.

$ mkdir ~/works
$ cd ~/works

1.2. Copy pig_data1.txt, pig_data2.txt files in the Data folder under the home directory to
this working directory.

$ cp ~/Data/pig_data* .

1.3. Start the Grunt shell in local mode.

$ pig -x local

1.4. Load the data in the data_pig1.txt file into shell and show the contents.

grunt> tuples = LOAD 'pig_data1.txt';


grunt> dump tuples;

20
Note: You can see some records. You may discover records that are missing some data or have
errors.

1.5. Load the first three field values and display them on the screen.

grunt> col_three = load 'pig_data1.txt' as (brand:chararray, date:chararray,


model:chararray);
grunt> dump col_three;

1.6. Use the DESCRIBE command to see the schema of col_three;

21
grunt> describe col_three;

1.7. Next, create a relation that meets the following conditions using the default data type
and field order.

1.7.1. Read pig_data1.txt and find only the records with the 6th field value greater
than 1000, and print only the 1st, 2nd, 3rd, and 6th field values on the result
screen.

grunt> data = load 'pig_data1.txt';


grunt> hi_price = filter data by $5 > 1000;
grunt> res = foreach hi_price generate $0, $1, $2, $5;
grunt> dump res;

1.7.2. Display the previous results except that the first and third field values are null.

grunt> res_notnull = filter res by $0 is not null and $2 is not null;


grunt> dump res_notnull;
grunt> quit;

22
2. Processing Input Data from a Pig script

2.1. Make the script file, etl_1.pig, running your command for validation in grunt shell.

2.1.1. First, copy the data pig_data1.txt and pig_data2.txt to home directory folder in
hdfs .

$ hdfs dfs -put pig_data1.txt pig_data2.txt .

23
2.1.2. Load the data file pig_data1.txt using the data schema below.
Index Field Name Data type
0 brand chararray
1 date chararray
2 model chararray
3 agent chararray
4 country chararray
5 price int
6 code chararray

2.1.3. Filter out all fields with null values for brand name, model, and agent.

2.1.4. Select only values with a price of 1000 or more, and use the foreach command
to create relations in the following order:
Index Field Name Data type
0 code chararray
1 brand chararray
2 model chararray
3 date chararray
4 country chararray
5 price int

2.1.5. Change the brand field to uppercase and remove any leading or trailing
whitespace. (Hint: built-in function TRIM)

2.1.6. Save the result relation as car1.

$ vim etl_1.pig
data = LOAD 'pig_data1.txt' AS (brand:chararray,
date:chararray,
model:chararray,
agent:chararray,
country:chararray,
price:int,
code:chararray);

data_notnull = FILTER data BY brand is not null and model is not null and agent is not
null;

24
price_up = filter data_notnull by price > 1000;

reordered = FOREACH price_up GENERATE code,


UPPER(TRIM(brand)),
model,
date,
country,
price;

STORE reordered INTO 'car1';

2.1.7. Run the etl_1.pig

$ pig etl_1.pig

3. Create the script file for pre-processing

3.1. Make the script file, etl_2.pig using pig_data2.txt.

3.1.1. First, load the data file pig_data2.txt using the data schema below.
Index Field Name Data type
0 brand chararray
1 date chararray

25
2 model chararray
3 agent chararray
4 country chararray
5 price int
6 code chararray

3.1.2. Filter out all fields with null values for brand name, model, and agent.

3.1.3. Duplicate records with the same value are deleted.

3.1.4. Change the brand field to uppercase and remove any leading or trailing
whitespace.

3.1.5. Remove the $sign in the cost field, create a relation in the following field order:
Index Field Name Data type
0 code chararray
1 brand chararray
2 model chararray
3 date chararray
4 country chararray
5 price chararray

3.1.6. Save the result in the same format as etl_1.pig to car2 folder in hdfs.
Hint: In the pig_data2 file, since each field is separated by ",", you must use the
command USING PigStorage to separate it.
data = LOAD 'pig_data2.txt' USING PigStorage(',')
AS (brand:chararray,
date:chararray,
model:chararray,
agent:chararray,
country:chararray,
price:chararray,
code:chararray);

26
Lab 6: Running Basic Queries with Hive QL
1. In terminal window, run the Beeline.
$ beeline -u jdbc:hive2://

1.1. If the beeline is running, the "0: jdbc:hive2://> " prompt appears and a screen waiting
for commands is displayed. Enter a command to check which database exists here.

1.2. Next, enter the command to show the table in default DB.

1.3. Create the test database and show the database list.
0: jdbc:hive2://> create database test;
0: jdbc:hive2://> show databases;

Note: The created database is saved as a folder in the hdfs directory.

1.4. Delete the created test database and verify the result.

27
0: jdbc:hive2://> drop database test;
0: jdbc:hive2://> show databases;

1.5. Create a table named products. Use the following schema:


Column Name Column Type
Code Int
brand string
l_date string
model string
country string
price int

0: jdbc:hive2://> create table if not exists products (


code int,
brand string,
l_date string,
model string,
country string,
price string
)
row format delimited fields terminated by '\t';

1.6. Verify the products table using desc command and check the hdfs file system.

0: jdbc:hive2://> show tables;

28
0: jdbc:hive2://> desc products;

1.7. Change the field l_date to the r_date column name, and change the code to id.
0: jdbc:hive2://> alter table products change code id int;
0: jdbc:hive2://> alter table products change l_date r_date string;

1.8. Rename the products table to test.


0: jdbc:hive2://> alter table products rename to test;
Note: When the alter command is executed, the hdfs folder name is also changed from
products to test.

29
1.9. Drop the test table and verify the hdfs directory in /user/hive/warehouse.

0: jdbc:hive2://> drop table test;

30
Note: When the managed table is deleted, the folder in the file system of hdfs is also
deleted.

31
2. Create and load a table using Shell

2.1. Before using the shell, create a directory to be used in the lab as /mywarehouse in the
hdfs file system and change the permission to student.

2.2. If there is /mywarehouse in your HDFS, skip these phases.


$ sudo -u hadoop hdfs dfs -mkdir /mywarehouse
$ sudo -u hadoop hdfs dfs -chown student:student /mywarehouse

2.3. And the data preprocessed using pig in the previous lab is saved here.
$ hdfs dfs -mv car1 car2 /mywarehouse

2.4. Create and load a table using editor and execute the script the shell -f option.

2.4.1. Create an external table using CREATE TABLE. In the last unit, you practiced
creating file using Pig for data pre-processing. The schema is shown below:
Column Name Column Type
Code Int
brand string
model string
r_date string
country string

32
price int

2.4.2. The data are separated by '\t' for each record and are stored in the
/mywarehouse/car1 directory in the hdfs file system.
$vi products.sql
create external table if not exists products (
code int,
brand string,
model string,
r_date string,
country string,
price int
)
row format delimited
fields terminated by '\t'
location '/mywarehouse/car1';

2.4.3. Run the products.sql with -f option.


$ beeline -u jdbc:hive2:// -f products.sql

2.5. Verify and delete the products table created by script. Terminate the beeline.
$ beeline -u jdbc:hive2://

33
0: jdbc:hive2://> desc products;
0: jdbc:hive2://> select * from products;
0: jdbc:hive2://> drop table products;
0: jdbc:hive2://> !quit
(exit the beeline)
$

Note: It was saved in the folder car1 where the data was pre-processed in pig. After creating
this as an external table, select the table contents.

34

Note: When a table created as external is deleted, table information is deleted from the
metastore, but car1’s data remains.

3. Create a table using Sqoop’s Hive import option

3.1. In a terminal, run the following command to import the country table from MariaDB
using hive-import option.

3.1.1. First, login the beeline, and create the mydb database for lab.
0: jdbc:hive2://> create database mydb;

Note: The created database mydb is created in /user/hive/warehouse with the name
mydb.db and is distinguished from general tables.

3.1.2. In another terminal, run sqoop command like this:


$ hdfs dfs -rm -r authors
$ sqoop import --connect jdbc:mysql://localhost/labs --username student --password
student --fields-terminated-by '\t' --table authors --hive-import --hive-database
'mydb' --hive-table 'authors' --split-by id

35
3.2. Verify that the table imported from beeline is saved correctly. 10000 records checked.
0: jdbc:hive2://> select * from mydb.authors;
0: jdbc:hive2://> use mydb;
0: jdbc:hive2://> select count(*) from authors;

Note: This is the result of searching 20 records of authors. This is the case when the database
name and table are used together.

Note: This checks the total number of authors records.

36
4. Write a hive QL commands that execute the following conditions.

4.1. Save the posts table using sqoop in mydb database, and find the total number of
records.
Hint: The --null-string and --null-non-string arguments are optional. If not specified,
then the string "null" will be used.
The --hive-drop-import-delims argument drops \n, \r, and \01 from string fields when
importing to Hive.

4.2. Import the only first_name, last_name, email and save the authors_parquet folder in
parquet format in /mywarehouse of hdfs, and create an authors_parquet external table
using this folder.
Hint: sqoop import --connect jdbc:mysql://localhost/labs --username student --
password student --table authors --target-dir /mywarehouse/authors_parquet --as-
parquetfile

4.3. Use alter table to rename the “first_name” column to “fname” in authors_parquet.
Verify the change with DESCRIBE.

4.4. Rename the entire table to authors_parquet2. And verify the change the folder name in
/mywarehouse.

37
Lab 7: Handling partitioned table for performance
In this lab, you will create a table for car data that is partitioned by part number. You already
built the data using pig service in previous lab. We can query the car data in the same table, but
distinguish them by part numbers. It does this by creating a partition for each record.

1. Creating and Loading a static partition to table


In terminal windows,

1.1. Run the Beeline.

$ beeline -u jdbc:hive2://

1.2. Use the mydb database for lab.

0: jdbc:hive2://> use mydb;

1.3. Make sure that the processed data is in the following HDFS.

$ hdfs dfs -ls -R /mywarehouse/car[12]

1.4. Create the table called cars with the following schema:

Column Name Column Type


Code Int
brand string
l_date string
model string
country string
price int

1.4.1. The partition column is part_num (type tinyint), and field delimiter is ‘\t’

38
1.4.2. The table location is /mywarehouse/car with managed type.

0: jdbc:hive2://> create table if not exists cars (


code int,
brand string,
model string,
r_date string,
country string,
price int
)
PARTITIONED BY (part_num tinyint)
row format delimited
fields terminated by '\t'
location '/mywarehouse/cars';

1.4.3. Change the cars table to add two partitions. Partitions are one for part 1 and one
for part 2.

0: jdbc:hive2://> alter table cars add partition (part_num=1);


0: jdbc:hive2://> alter table cars add partition (part_num=2);

39
1.4.4. Load the data from /mywarehouse/car1 into the part 1 partition, and load the
data from /mywarehouse/car2 into the part 2 partition.

0: jdbc:hive2://> load data inpath '/mywarehouse/car1' into table cars


partition(part_num=1);

When the load command is executed, the file in the existing car1 is moved to the
cars/part_num=1 folder.

1.4.5. Verify the car1 and cars folder.

40
1.4.6. Show the data of car1 using select statement

0: jdbc:hive2://> select code, brand, model, r_date, price from cars;

Note: The warning message here can be ignored.

1.4.7. Move the data in car2 to part_num=2 using hdfs mv instead of the load
command and verify that the data looks the same as 1.4.6.

0: jdbc:hive2://> load data inpath '/mywarehouse/car2' into table cars


partition(part_num=2)

41
Note: We checked the entire data including part_num=2 as above.

1.4.8. Verify the data for both number of part were correctly loaded by counting the
records for each:

0: jdbc:hive2://> select count(*) from cars where part_num=1;


0: jdbc:hive2://> select count(*) from cars where part_num=2;

Question: What is the number of records in each partition?

2. Working with Table in hue.


In this lab, you will practice using Hue query to execute queries. And verify the databases and
tables using the Table browser in Hue, and practice creating a new database and table using it.

42
2.1. In firebox url, type the Hue page http://localhost:8888. The id is student, and password
is student. Clink the “sign in” to enter hue.

2.2. Click the Table browser in the left toolbar.

Note: The table browser displays the mydb database name and a list of tables.

43
2.3. Execute the command executed in beeline in the hive editor and verify the result.

2.3.1. select code, brand, model, r_date, price from cars;

2.3.2. select count(*) from cars where part_num=1;

44
2.3.3. select count(*) as CNT from cars where part_num=2;

2.3.4. You can check the schema information and sample data contents by clicking "i"
next to the cars table in the left menu.

45
2.4. There are two problems with the data in cars table. One is that the price is null in
part_num=2, another is that the sub-title is included in the middle of the data such as
“Brand, MODEL...”. Writing a query that solves these problems.

2.5. Creating queries on cars for the following information.

2.5.1. Model of TESLA

2.5.2. Code, Model and Brand of all entries in the cars

2.5.3. Brand name in alphabetical order

2.5.4. Brand, model, r_date, and price, ordered by price, descending

2.5.5. Same as above but ordered ascending

2.5.6. Brand with a price less than $2000.00

2.5.7. Brand with a price between $3000.00 and $4000.00

2.5.8. All records with model names MINI, BWM, Porsche.

2.5.9. Name the Brand with the cheapest price

2.5.10. Name the Brand with the second expensive price

2.5.11. How many records do not have overlapping brands?

46
2.5.12. What is the average price per brand?

47
Lab 8: Working with complex Data type
In this lab, you will run a complex data type and join operation.

1. Working with Relation Database

1.1. First, import all tables in the MariaDB labs database. The table imported as an external
table is created as a hive table.

1.1.1. Create id column and set it as unique Key

1.1.2. Create continent_name to hold the name of the continents

1.1.3. Add the following continents in the order shown:


Africa
Asia
Europe
North America
South America

1.2. Modify the countries table

1.2.1. Remove all the items from the table

1.2.2. Remove the continents table we added in the previous lab

1.2.3. Add the following information to the countries table


name capital population continent_id
USA Washington D.C. 333,098,437 match to North America
Germany Berlin 84,073,352 match to Europe
France Paris 65,426,179 match to Europe
Korea Seoul 51,305,186 match to Asia
Mexico Mexico City 130,262,216 match to South America
Egypt Cairo 104,258,327 match to Africa

2. Creating multiple join tables


Currently, the orders table has many repeated information. If you examine the table carefully,
there are four sections of information that can be identified. They are:
 Customer Information
o Customer Name
o Customer email

48
 Order information
o Customer who ordered
o Status of order (completed, in progress, placed)
 Order items information
o Which order is this for?
o List of products ordered
 Products information
o name of product
o cost
o product type (sandwich, side, drink)
o redeem_points
We have added the status of the order so that system can use to help the management keep
track of all the orders.

2.1. Create customer table

2.1.1. Needs to have a unique Key

2.1.2. Customer name

2.1.3. Customer email

2.2. Create orders table

2.2.1. Needs to have a unique Key

2.2.2. Needs to link to the customer unique Key

2.2.3. An order status column

2.3. Create order_items table

2.3.1. Needs to have a unique Key

2.3.2. Needs to link to the order unique Key

2.3.3. Needs to link to the products unique Key for each product ordered

2.4. Create a products table

2.4.1. Needs to have a unique Key

2.4.2. Product name

49
2.4.3. Product cost

2.4.4. Amount of redeem point earned for this product

50
Lab 9: Complex Queries
1. Create complex queries by joining multiple tables

1.1. Create joined queries from countries and continent to return the following information:

1.1.1. List of country names and the associated continent

1.1.2. List of all countries in Asia, with names of their capital

1.2. Create joined queries from customer, order, order_item and product to return the
following information:

1.2.1. All orders, including the customer who ordered, and all the products purchased

1.2.2. Any order that does not include a sandwich in it. Show the order id.

1.2.3. Who was the customer who made the order without the sandwich?

1.2.4. Total cost of all of Shaun Silverman's orders in our records

1.2.5. For every product, total number of times that product has been ordered..

51
Lab 10: Working with Apache Impala
1. Setting up the lab environment
Impala and Kudu are resource intensive. Further, Kudu cannot be easily setup in a standalone
Hadoop cluster such as in our Virtual Box environment. To overcome this limitation, we shall
follow the Apache Kudu Quickstart guidelines (https://kudu.apache.org/docs/quickstart.html)
adapted for Windows environment along with a Docker image of Impala to connect to the Kudu
storage.

1.1. Install Docker for Windows

1.1.1. Either search for "install docker desktop for windows" on your browser or go
directly to https://docs.docker.com/docker-for-windows/install/

1.1.2. Follow the instructions to install Docker Desktop for Windows.

1.1.3. At the end of the installation, you will be asked to install WSL 2 backend. Please
install this.

1.1.4. Start Docker Desktop for Windows

1.2. Install Git for Windows

Note: You have previously installed Git in the AWS S3 Storage lab in Chapter 4 Unit 1.
Unless you have removed and uninstalled the program, you may skip this step.

1.2.1. Either search for "install git for windows " or go directly to https://git-
scm.com/download/win

1.2.2. You may choose either the 32-bit or 64-bit version. Whenever possible, it is
better to choose the 64-bit version but this will depend on your Windows
version.

1.3. Install Apache Impala Quickstart

1.3.1. From Windows search, type "powershell" and start a new Windows PowerShell
session

52
1.3.2. From the home directory on your Windows computer, create and navigate into
C5U3_lab folder

1.3.3. Clone the https://github.com/apache/kudu git project

1.3.4. A new "kudu" directory will be created. Navigate to that directory

cd ~
mkdir C5U3_lab
cd C5U3_lab
git clone https://github.com/apache/kudu
ls
cd kudu›

1.4. Start the Kudu docker containers

53
1.4.1. Setup the KUDU_QUICKSTART_IP environmental variable with the following
command. In PowerShell the back tick (`) is used to continue the command on
the next line.

$env:KUDU_QUICKSTART_IP=(Get-NetIPConfiguration | `
Where-Object {$_.IPv4DefaultGateway -ne $null -and `
$_.NetAdapter.Status -ne "Disconnected"}).IPv4Address.IPAddress

1.4.2. Check to make sure the variable has been set properly

$env:KUDU_QUICKSTART_IP

1.5. Bring up the docker Kudu cluster using the provided YAML configuration file

CAUTION: Docker desktop for Windows must be running and you must be in the kudu
directory before executing the following command.

1.5.1. Run the following command from PowerShell:

docker-compose -f docker/quickstart.yml up -d

NOTE: When you first create the Kudu cluster, the displayed text will say "Creating" instead of
"Recreating"

54
1.5.2. Check to make sure the Kudu service is up by browsing to localhost:8050

1.6. Create a network for Impala and Kudu to communicate

1.6.1. Use docker network command to create a bridge network

NOTE: The Network id generated will be different on your machine. If you have
already attempted this lab before, you may get a message that the quickstart-
network already exists. You may ignore the message and continue

docker network create -d bridge quickstart-network

OR

1.7. Install Apache Impala docker image

1.7.1. Set up environmental variables for Impala to use with the following two
commands.

55
$env:QUICKSTART_IP=$(docker network inspect quickstart-network `
-f '{{(index .IPAM.Config 0).Gateway}}')

$env:QUICKSTART_LISTEN_ADDR=$env:QUICKSTART_IP

1.7.2. Check that the environmental variables have been created

1.7.3. Start the Impala cluster with the following command:

docker run -d --name kudu-impala --network="docker_default" `


-p 21000:21000 -p 21050:21050 -p 25000:25000 -p 25010:25010 -p 25020:25020 `
--memory=4096m apache/kudu:impala-latest impala

CAUTION: If you have previously attempted this lab, there may already exist a container with
the name "/kudu-impala." In this case, it is best to remove the container and run a fresh new
kudu-impala container Only follow the instruction to remove the existing container if a
container already exists. If not, skip to step 1.7.5

1.7.4. CAUTION: Follow only if container already exists. Use docker container rm to
remove the existing container. Use the container id displayed in the error
message when deleting

docker container rm <"container id">

56
1.7.5. Check to make sure Impala services are up. Use your browser and navigate to
localhost:25000

1.8. Start the Impala Shell

1.8.1. Open a command line to the Impala docker container with the following
command. The container id is displayed when you created the container.

docker exec -it <container id> /bin/bash

1.8.2. Once inside the docker container, start the Impala shell.

57
impala-shell

2. Create the labs database and show the database list.

0: jdbc:hive2://> create database labs;


0: jdbc:hive2://> show databases;

2.1. Create a table named test. Use the following schema:


Column Name Column Type
Code Int
brand string
l_date string
model string
country string
price int

58
>Insert the following values and verify the result by executing the select statement.

2.1.1. INSERT INTO test VALUES (1, 'Tesla', '02/17/22', ‘CM-9', 'Gabon', '1179.00');

2.1.2. INSERT INTO test VALUES (2, 'Hyundai', '02/05/20', ‘K9', 'Korea', '5779.00');

2.1.3. INSERT INTO test VALUES (3, 'Tesla', '03/14/21', ‘Y-9', 'USA', '3300.00');

2.2. Change the field l_date to the r_date column name, and change the code to id.

> alter table test change code id int;


> alter table test change l_date r_date string;

2.3. Rename the test car_products;

> alter table test rename to car_products;

59
2.4. Verify the number of records in car_products. And drop the car_products;

60
Lab 11: Create a batch view
Creating an actual Lambda architecture is a very difficult and time-consuming project. Many
enterprises are trying to move away from this architecture whenever possible due to its high
maintenance and development costs. However, there are many use cases where Lambda
architecture is the only viable choice.
Creating an actual lambda architecture would be beyond the scope of this lab. An actual proof-
of-concept (POC) project will typically involve many moving pieces and actual code
development. In this lab, we will create a data pipeline that will have many of the working
parts without the real complexity of an actual working POC.

1. Create an Apache Flume data pipeline


Create a Flume configuration file with the following single source, two channels and two sinks

1.1. A single netcat source

1.1.1. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#netcat
-tcp-source

1.1.2. The hostname will be localhost

1.1.3. It will read from port 44444

1.2. First channel

1.2.1. A file channel

1.2.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#file-
channel

1.2.3. Set the checkpoint directory to /tmp/flume/checkpoint

1.2.4. Set the data directory to /tmp/flume/data

1.3. Second channel

1.3.1. A memory channel

1.3.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#memo
ry-channel

61
1.3.3. Maximum events store in the channel should be 10000

1.3.4. Maximum number of events per transaction

1.4. First Sink

1.4.1. A HDFS sink

1.4.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#hdfs-
sink

1.4.3. Save the data in /user/student/labc5u4

1.4.4. We will never roll based on time interval

1.4.5. We will roll when the file size reaches 16384

1.4.6. We will never roll based on number of events

1.4.7. The file type will be a DataStream

1.5. Second Sink

1.5.1. A logger sink

1.5.2. Refer to
https://flume.apache.org/releases/content/1.9.0/FlumeUserGuide.html#logge
r-sink

1.6. Connecting the data pipelines

1.6.1. The single source will connect to both channels

1.6.2. The file channel will connect with the HDFS sink

1.6.3. The memory channel will connect with the logger sink

1.7. Run the flume agent

1.7.1. Assume that the name of the agent is a1

flume-ng agent --conf $FLUME_HOME/conf \


-conf-file net2hdfsandlog.conf \

62
--name a1 \
-Dflume.root.logger=INFO,console

1.8. Start the streaming source simulator

1.8.1. Navigate the /home/student/Scripts

1.8.2. Execute the stream2.py Python script

cd /home/student/Scripts
python stream2.py ~/Data/anonymous-msweb.data

2. Create an external Hive table linked to HDFS

2.1. Start the beeline console

2.2. If a labs database does not already exits, create one

2.3. Create an external table and name it weblog

2.3.1. Add string column and name it line_type

2.3.2. Add string column and name it id1

2.3.3. Add string column and name it id2

2.3.4. Add string column and name it title

2.3.5. Add string column and name it url

2.3.6. Set the delimiter to a comma

2.3.7. Store as a textfile

2.3.8. Set the location to where Flume has saved to HDFS in above step 1.4.3

2.4. Confirm that the new Hive table is reading from the HDFS directory. Run the following
query:

SELECT * FROM weblog LIMIT 10;

63
64
Lab 12: Create a speed view
We will simulate ingesting real-time streaming data .

1. Setup Apache Kafka


In order to reduce resource demand to our virtual machine, we shall stop hbase and run only
Apache Kafka

1.1. Stop HBase services

sudo stop-hbase.sh

1.2. Restart Kafka and Zookeeper

sudo systemctl stop kafka


sudo systemctl stop zookeeper
sudo systemctl start zookeeper
sudo systemctl status zookeeper
sudo systemctl start kafka
sudo systemctl status kafka

1.3. Make sure that both zookeeper and kaka is running. If not repeat above step.

65
2. Create an Apache NiFi dataflow

2.1. Open Apache NiFI from Firefox browser

2.2. Create a new processor group

2.2.1. Name the processor group, Lab C5U4

2.2.2. Double click on processor group labC5U4 to enter its canvas

2.3. Add ListenTCP processor to canvas

2.3.1. Set Port to 44444 in Properties tab

2.3.2. Change Max Size of Socket Buffer to 200000 B

2.3.3. Set Client Auth to NONE

2.4. Add PublishKafka_2_6 processor to canvas

2.4.1. Automatically terminate failure relationship

2.4.2. Set topic name to weblogs

2.4.3. Keep Delivery Guarantee to Best Effort

2.4.4. Change Use Transactions to false

66
2.5. Add a funnel

2.6. Connect the data sources

2.6.1. Connect ListenTCP to PublishKafka_2_6 on success relationship

2.6.2. Connect PublishKafka_2_6 to a funnel

2.7. Add ConsumeKafka_2_6 to canvas

2.7.1. Change topic name to weblogs

2.7.2. Set honor transactions to false

2.7.3. Set Group ID to 1

2.7.4. Leave offset reset to latest

2.7.5. Add another funnel and connect the output of ConsumeKafka_2_6 to the funnel
on the success relationship

3. Test the speed view dataflow

3.1. Select all the components and run them

3.2. Start the streaming source simulator

67
3.2.1. Navigate the /home/student/Scripts

3.2.2. Execute the stream2.py Python script

cd /home/student/Scripts
python stream2.py ~/Data/anonymous-msweb.data

3.3. Wait until you see flowfiles flowing through the data flow. Stop all the processor and
inspect the queues.

3.3.1. Inspect the queue between ListenTCP and PublishKafka_2_6. Make sure you
view the content of the flowfile

3.3.2. Inspect the queue between PublishKafka_2_6 and the funnel

3.3.3. Inspect the queue between ConsumeKafka_2_6 and the funnel

3.4. Start all the components again

3.5. From another terminal, use kafka-console-consumer to verify that the streaming data
has been properly ingested

3.5.1. Run the following command from another terminal

kafka-console-consumer \
--bootstrap-server localhost:9092 \
--topic weblogs \
--from-beginning

3.5.2. After making sure, everything is flowing properly, stop all components in NiFi

3.5.3. Terminate the console consumer

END OF LAB

68

You might also like