[go: up one dir, main page]

0% found this document useful (0 votes)
132 views114 pages

Lab Guide - PDF - EN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views114 pages

Lab Guide - PDF - EN

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 114

Lab Guide

BD Basics
Version 6.0
Copyright 2015 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training
Congratulations on choosing a Talend training module. Take a minute to review the following points to help you get the most from
your experience.

Technical Difficulty

Instructor-Led
If you are following an instructor-led training (ILT) module, there will be periods for questions at regular intervals. However, if you
need an answer in order to proceed with a particular lab, or if you encounter a situation with the software that prevents you from pro-
ceeding, don’t hesitate to ask the instructor for assistance so it can be resolved quickly.

Self-Paced
If you are following a self-paced, on-demand training (ODT) module, and you need an answer in order to proceed with a particular
lab, or you encounter a situation with the software that prevents you from proceeding with the training module, a Talend professional
consultant can provide assistance. Double-click the Live Expert icon on your desktop to go to the Talend Live Support login page
(you will find your login and password in your ODT confirmation email). The consultant will be able to see your screen and chat with
you to determine your issue and help you on your way. Please be considerate of other students and only use this assistance if you are
having difficulty with the training experience, not for general questions.

Exploring
Remember that you are interacting with an actual copy of the Talend software, not a simulation. Because of this, you may be tempted
to perform tasks beyond the scope of the training module. Be aware that doing so can quickly derail your learning experience, leaving
your project in a state that is not readily usable within the tutorial, or consuming your limited lab time before you have a chance to fin-
ish. For the best experience, stick to the tutorial steps! If you want to explore, feel free to do so with any time remaining after you've fin-
ished the tutorial (but note that you cannot receive consultant assistance during such exploration).

Additional Resources
After completing this module, you may want to refer to the following additional resources to further clarify your understanding and
refine and build upon the skills you have acquired:
Talend product documentation (help.talend.com)

Talend Forum (talendforge.org/)

Documentation for the underlying technologies that Talend uses (such as Apache) and third-party applications that com-
plement Talend products (such as MySQL Workbench)
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.

CONTENTS | Lab Guide


CONTENTS
LESSON 1 Basic Concepts
Basic Concepts 10
Overview 10
Opening a Project 11
Overview 11
Accessing the training environment 11
Run 11
Talend Forge 12
Start 14
Monitoring the Hadoop Cluster 15
Overview 15
Connect to Cloudera Manager 15
Check services health 15
Connect to Hue 16
Creating Cluster Metadata 17
Overview 17
Manual Configuration 17
Discovering Hadoop configuration files 20
Configuration with the Hadoop configuration files 22
Automatic configuration 22
Wrap-Up 26
Recap 26

LESSON 2 Reading and Writing Data in HDFS


Reading and Writing Data in HDFS 28
Overview 28
Storing a File on HDFS 29
Overview 29
Configure HDFS Connection 29
Create Job to write data to HDFS 31
Run Job and check results 32
Storing Multiple files on HDFS 35
Overview 35
Create Job 35
Run 36
Reading Data from HDFS 38
Overview 38
Create Job 38
Run 39
Storing Sparse Data set with HBase 40
Overview 40
Generate sparse data 40
Configure HBase Connection 41
Read sparse data 42
Handle Null values 43
Save Data to HBase 44
Run Job and check results in Hue 46
Challenges 49
Overview 49
Put Support File 49
Double Up Orders 49
Solutions 50
Overview 50
Put Support File 50
Double Up Orders 50
Wrap-Up 53
Recap 53
Further Reading 53

LESSON 3 Working with Tables


Working with Tables 56
Overview 56
Importing Tables with Sqoop 57
Overview 57
Preparing the MySQL database 57
Create generic database Metadata 58
Importing Tables 59
Run Job and Check results 61
Creating Tables with Hive 64
Overview 64
Create Hive Connection Metadata 64
Create a Hive table manually 65
Run Job and check results 66
Using the Hive table creation wizard 68
Check Hive table 71
Wrap-Up 73
Recap 73

LESSON 4 Processing Data and Tables in HDFS


Processing Data and Tables in HDFS 76

CONTENTS | Lab Guide


Overview 76
Processing Hive Tables with Jobs 77
Overview 77
Extracting useful data 77
Process data 78
Transfer results to Hive 79
Run Job and Check results 82
Profiling Hive Tables 85
Overview 85
Hive Connection Analysis 85
Hive Tables Analysis 87
Column Analysis 88
Product Category column analysis 92
Processing Data with Pig 95
Overview 95
Writing Data to HDFS with tHDFSOutput 95
Load Data 96
Filter and Map Data 97
Aggregate and Sort Data 98
Store Results 99
Run Job and check results 100
Processing Data with Big Data Batch Job 103
Overview 103
Read and filter data 103
Map and Aggregate data 104
Sort and save data 105
Run Job and Check results 107
Convert to Map Reduce Batch Job 108
Run Job and Check results 109
Wrap-Up 113
Recap 113
Further reading 113
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.

CONTENTS | Lab Guide


LESSON 1
Basic Concepts
This chapter discusses the following.

Basic Concepts 10
Opening a Project 11
Monitoring the Hadoop Cluster 15
Creating Cluster Metadata 17
Wrap-Up 26
Basic Concepts

Overview
During this training, you will be assigned a Hadoop cluster preconfigured. This Hadoop cluster has been built with a Cloudera CDH
5.4 distribution. The purpose is to try the different functionalities, not to have a production cluster. So, this training cluster is in pseudo-
distributed mode. That means that there is only one node. This is enough to understand the different concepts in this training.
Before starting to build Jobs to read and write data from and to HDFS, there are some prerequisites. First, you will open a new pro-
ject in the Talend Studio. Then you will connect to your cluster to discover how to monitor it, using Cloudera Manager and Hue. For
your jobs to succeed, the cluster must be up and running. So, if a service fails you need to know how to restart it.
And the last point is to create your cluster Metadata. This step will avoid repetitive configuration of components in the Jobs you will cre-
ate.

Objectives
After completing this lesson, you will be able to:
Open a new project in Talend Studio

Connect to the Cloudera Manager

Connect to Hue

Create Cluster Metadata manually, using configuration files, and automatically

Before you begin


Be sure that you are working in an environment that contain the following:
A properly installed copy of Talend Studio

A properly configured Hadoop cluster

The supporting files for this lesson

Everything has already been set up in your training environment.


The first step is to create a new project in the Talend Studio.

10 | BD Basics - Lab Guide


Opening a Project

Overview
Before developing Talend Jobs, you need a project to store them. In this lab you will open a pre-existing project for your big data
Jobs.

Accessing the training environment


For this training, two virtual machines have been set up for you. The first machine is a Windows machine where the Talend Studio is
installed. This is where you will create your Job.
The second machine is a Linux machine hosting a Cloudera CDH5.4 cluster. You should not need to access this machine. In fact, the
cluster is monitored from the Windows machine through a Web Browser.
You will have detailed explanations about how to proceed later in the course.
To connect to the Windows machine, in your Skytap environment, run the machine called Windows-TalendStudio.

Run
1. Double-click the Talend Studio shortcut on your desktop to run Talend Studio :

2. Using the Local connection, click Select an existing project and then click BDBasics in the Project list:

The project has already been created for you.


Note: You may have a different version of the studio in your training environment. However, you will have the same func-
tionalities as in the Talend Real-time Big Data Platform.
3. Click Finish to open the project.

LESSON 1 | 11
Talend Forge
When the Connect to TalendForge window appears, log in with your existing Talend account or create a new one:

12 | BD Basics - Lab Guide


When the initialization is complete, Talend Studio displays the Welcome page:

LESSON 1 | 13
Start
Click Start now!. The Talend Studio main window appears, ready for you to create Jobs:

The next step is to connect to your Hadoop cluster.

14 | BD Basics - Lab Guide


Monitoring the Hadoop Cluster

Overview
As you will develop jobs to use HDFS, HBase, and Hive to store your data, you need a running Hadoop cluster.
For this training a Hadoop cluster will be used. It is running the core Hadoop functionalities, such as HDFS, Yarn, HBase, Hive or
Sqoop. It also runs Hue that will help you to browse your files and tables.

Connect to Cloudera Manager


1. Click the Cloudera Manager shortcut in your web browser.
Or you can navigate to the url http://ClusterCDH54:7180.
If you can't access the web page, wait a couple of minutes so that the service start and try again to connect.
2. In the Username box, enter "admin" and then, in the Password box, enter "admin".
3. Click Login.

Check services health


The Cloudera Manager is a web interface to monitor and perform administration tasks on a cluster. It helps to check services health
and to restart services individually if needed.

If a service is red flagged, it means it is in bad health.You can restart it individually clicking the black arrow on the right side of the ser-
vice and then clicking Restart.

LESSON 1 | 15
Connect to Hue
Hue is a web interface that helps to check what is done on your cluster. You can browse your files and tables. It is also possible to
track Map Reduce tasks.
1. Open a new tab in your web browser.
2. Click the Hue shortcut or navigate to http://ClusterCDH54:8888.
3. In the Username box, enter "student" and then, in the Password box, enter "training".
4. Click Sign in.
You will be logged in Hue:

The next step is to create Metadata on your Hadoop Cluster.

16 | BD Basics - Lab Guide


Creating Cluster Metadata

Overview
To be able to run Big Data Jobs, the Talend Studio requires to be connected to a running Hadoop Cluster. The connection inform-
ation can be configured in each component individually or the configuration can be stored in a metadata in the Repository and be
reused as needed in the different components.
You will use this approach instead of individual connection configuration of components and Jobs.
You will now create your cluster metadata using three different methods. First you will configure the connection to the cluster manu-
ally, then from the hadoop configuration files and last, using a Wizard.

Manual Configuration
The first way to create a Hadoop cluster Metadata is to create it manually. This requires that you already have several information
about your cluster such as the Namenode URI, the Resource Manager address or the Job Tracker URI, depending on if you will use
Yarn or Map Reduce v1. You may also need several other information such as the Job History or the Resource Manager Scheduler
location.
1. In the Studio, in the Repository, under Metadata, locate Hadoop Cluster.
2. Right-click Hadoop Cluster and then click Create Hadoop Cluster:

LESSON 1 | 17
3. In the Name box, enter "TrainingCluster_manual", and then click Next.
The Hadoop Configuration Import Wizard opens:

4. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).

18 | BD Basics - Lab Guide


5. Select Enter manually Hadoop services and then, click Finish.
The Hadoop Cluster Connection window opens:

6. Check that the Distribution information are correct.


There are a few values preconfigured, such as the Namenode URI and the Resource Manager address.
The localhost value must be changed to the ip address or to the DNS name of your cluster. If the cluster was configured
keeping the default port values, then 8020 and 8032 are respectively the host port for the Namenode and the Resource
Manager.
The Hadoop cluster has already been configured for you with the name ClusterCDH54.
7. Configure the connection as follows:
Namenode URI: hdfs://ClusterCDH54:8020

Resource Manager: ClusterCDH54:8032

Resource Manager Scheduler: ClusterCDH54:8030

LESSON 1 | 19
Job History: ClusterCDH54:10020

Staging directory: /user

User name: student

8. Check your configuration:

9. Click Finish. Your cluster Metadata will appear under Repository/Metadata/Hadoop Cluster.

Discovering Hadoop configuration files


An alternative way is to create your Metadata using the Hadoop configuration files. The configuration files have been copied in the
HadoopConf folder under C:\StudentFiles.

20 | BD Basics - Lab Guide


The Hadoop configuration files are xml files that describe each parameter value of your Hadoop cluster. In the HadoopConf folder,
you will find four files: core-site.xml, hdfs-site.xml, mapred-site.xml and yarn-site.xml. The files were copied from a Hadoop cluster
installed with the same distribution and version as your training cluster.
1. Edit core-site.xml with Notepad++:

The first property that appears in this file is the location of the Namenode:
hdfs://ClusterCDH54:8020.
2. Edit yarn-site.xml with Notepad++:

In this file, you will find the Resource Manager address and the Resource Manager Scheduler address.
3. Edit mapred-site.xml with Notepad++:

In this file, you will find the Job History address.


As you noticed, you need all this information to create your Cluster Metadata. If you choose to create the cluster Metadata using the
configuration files, the files will be parsed to find all these values.

LESSON 1 | 21
Configuration with the Hadoop configuration files
1. Right-click Hadoop Cluster and then click Create Hadoop Cluster.
2. In the Name box, enter "TrainingCluster_files" and then, click Next.
3. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
4. Select Import configuration from local files and then, click Next.
5. Click Browse... and then locate the configuration files under
C:\StudentFiles\HadoopConf. Click OK:

6. Click Finish.
7. The configuration is almost done, except for the user name.
In the User name box, enter "student" and then click Finish.
Your cluster Metadata will appear under Repository/Metadata/Hadoop Cluster.

Automatic configuration
The last way to configure your Metadata is to connect to the Cloudera Manager and then all the connection information will be
retrieved automatically to create the cluster Metadata.
1. Right-click Hadoop Cluster and then click Create Hadoop Cluster.
2. In the Name box, enter "TrainingCluster" and then, click Next.
3. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
4. Select Retrieve configuration from Ambari or Cloudera and then, click Next.
5. In the Manager URI (with port) box, enter
"http://ClusterCDH54:7180".
6. In the Username and Password boxes, enter "admin".

22 | BD Basics - Lab Guide


7. Click Connect. This will list all the clusters administered by the Cloudera Manager:

8. Click Fetch. The wizard will retrieve the configurations files of all running services in your cluster:

9. Click Deselect All then, select YARN.


10. Click Finish.

LESSON 1 | 23
11. Enter "student" (without quotes) in the User name box:

24 | BD Basics - Lab Guide


12. Click Check Services, to check if you succeed in connecting to the Namenode and to the Resource Manager :

If the progress bars go up to 100% and you have no error message, it means that your conneciton was successfull.
13. Click Close.
14. Click Finish to finalize the metadata creation. It will appear in the Repository under Metadata/Hadoop Cluster.

You created the same Metadata three times with different techniques. For the next lessons, the last one will be used (Train-
ingCluster).
Now that you have covered the three different ways to create your Cluster Metadata, it's time to Wrap-Up.

LESSON 1 | 25
Wrap-Up

Recap
In this lesson you covered the key base knowledge required to be efficient building and running Big Data Jobs.
You first opened your project, then you connected to Cloudera Manager and Hue. Cloudera Manager will help you to restart a ser-
vice if needed. And Hue, will help you to browse your files and tables, and to track the execution of your Map Reduce jobs.
And the last part was about creating Metadata on your cluster. You created the same Metadata three times using different tech-
niques. So you can use any of them for your Job.
For the next lessons, the TrainingCluster Metadata will be used. But feel free to try any of the three Metadata. The results will be the
same, otherwise, that means that a Metadata is not well configured.

26 | BD Basics - Lab Guide


LESSON 2
Reading and Writing Data in HDFS
This chapter discusses the following.

Reading and Writing Data in HDFS 28


Storing a File on HDFS 29
Storing Multiple files on HDFS 35
Reading Data from HDFS 38
Storing Sparse Data set with HBase 40
Challenges 49
Solutions 50
Wrap-Up 53
Reading and Writing Data in HDFS

Overview
Hadoop's file system—HDFS—scales to hold petabytes of data. In this lesson you will use Talend Big Data components to read and
write data to HDFS.
First, you will read a text file storing customer information from your local system and write it to HDFS. Then, you will read a collection
of customer Twitter messages, stored in separate files, and write them individually to HDFS. Next, you will develop a Job to read a
designated subset of the Twitter files back from HDFS.
Finally, you will simulate a sparse data set and then write it to HDFS using HBase dedicated components.

Objectives
After completing this lesson, you will be able to:
Write text files to HDFS

Read text files from HDFS

Use HBase to store sparse data on HDFS

Configure connections to HDFS and HBase

Before you begin


Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio

A properly configured Hadoop cluster

The supporting files for this lesson

The first step is to create a new Job to read a local text file and then write it to HDFS.

28 | BD Basics - Lab Guide


Storing a File on HDFS

Overview
Your first step is to store some text data on HDFS. In this lab, you are saving a file of customer information such as first name, last
name, city, state, etc...
As you did previously for your cluster connection, it is possible to create Metadata to get connected to HDFS.

Configure HDFS Connection


Now you will create Metadata for a HDFS connection.
1. Right-click TrainingCluster under Repository>Metadata>Hadoop Cluster, and then click Create HDFS:

LESSON 2 | 29
2. In the Name box, enter "HDFSConnection" and then, click Next:

A default configuration is proposed, you can adjust it to your needs.


For this training, you will keep the default values. That means that your username to connect to HDFS will be student, and
"\n" will be the row separator and lastly, ";" will be the field separator.
3. Click Check. You should be quickly connected to your cluster and you will have the following success message:

4. Click OK and then click Finish to create your HDFS connection Metadata. It will appear in the repository, below the Train-
ingCluster Metadata:

30 | BD Basics - Lab Guide


Create Job to write data to HDFS
In the C:\StudentFiles folder, you will find the CustomersData.csv file. You will create a Job that will read this file and then write it to
HDFS.
1. In the Repository, right-click Job Designs and then click Create Standard Job:

2. Name your Job "PutCustomersData".


3. Place a tHDFSPut component on the design workspace. You use this component to move files from a local file system to
HDFS.
4. Double-click to open the Component view.
5. In the Property list, select Repository.
6. Click (...) and then locate the HDFSConnection metadata that you previously created.
7. To configure the Local directory, click (...) and then navigate to "C:\StudentFiles".
8. In the HDFS directory box, enter
"/user/student/BDBasics/Customers"
9. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of the same
name is being saved.
10. Click the green plus sign below the Files table.
11. In the Filemask column, enter "CustomersData.csv" (quotes included).
Your configuration should be as follows:

LESSON 2 | 31
Next, you will run your Job and then check the result in Hue.

Run Job and check results


1. Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equals to 0
and several other messages that prove that the Job successfully executed:

32 | BD Basics - Lab Guide


2. In your web browser, the page with Hue should be already opened. Otherwise, navigate to "http://ClusterCDH54:8888"
and login with student/training.
3. Click File Browser:

LESSON 2 | 33
4. Navigate to /user/student/BDBasics/Customers/CustomersData.csv:

This shows you the content of the CustomersData.csv file. The file was created by your Job in the new directory BDBasic-
s/Customers that you specified.
Next, you will put multiple files to HDFS.

34 | BD Basics - Lab Guide


Storing Multiple files on HDFS

Overview
Your next step is to store a set of files on HDFS. In this scenario, you are saving a series of Twitter message files for later analysis.

Create Job
Under the C:\StudentFiles\tweets_in folder, you will find tweets saved as text files. You will put this files on HDFS.
1. Right-click PutCustomersData Job in the Repository and then click Duplicate.
2. In the Input new name box, enter PutTweets and in the Job Type list, select Standard:

3. Click OK to duplicate the Job. And then, open PutTweets.


4. Double-click tHDFSPut to open the Component view.
5. In the Local directory, locate "C:/StudentFiles/tweets_in".
6. In the HDFS directory, enter "/user/student/BDBasics/tweets".
7. In the Filemask column, enter "*". This means to select all the files under C:\StudenFiles\tweets_in.
Your configuration should be as follows:

Next you will run your Job and check the result in Hue.

LESSON 2 | 35
Run
1. Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equals to 0
and several other messages that prove that the Job successfully executed:

2. In your web browser, the page with Hue should be already opened. Otherwise, navigate to "http://ClusterCDH54:8888"
and login with student/training.
3. Click File Browser and navigate to
/user/student/BDBasics/tweets:

Eleven files appear because there are eleven tweet files in your local directory. The tHDFSPut component used HDFS to
write all of the files in your local directory into the Hadoop file system.
You used a file mask to write a file set. File masks give you considerable control over what files you want Talend components
to operate on.
Now that some data exists on HDFS, you can create a Job to read it.

36 | BD Basics - Lab Guide


LESSON 2 | 37
Reading Data from HDFS

Overview
Now you will create a Job to read HDFS data. You are going to transfer a subset of the Twitter files from HDFS to your local file sys-
tem.

Create Job
If you examine the files in the tweets folder that you just created on HDFS, you will notice that the files where produced in 2012 and
2013. You will read only the 2012 files.
1. Create a new Job, naming it "GetTweets".
2. Place a tHDFSGet component on the design workspace.
You use tHDFSGet to read files stored on HDFS and make copies of them on your local system.
3. Double-click tHDFSGet to open the Component view.
4. In the Property list, select Repository.
5. Click (...) and then locate the HDFSConnection metadata that you previously created.
6. To configure the HDFS directory, click (...) and then navigate to "/user/student/BDBasics/tweets".
7. In the Local directory, enter "C:/StudentFiles/tweets_out" (including the quotes).
8. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of the same
name is being saved.
9. Click the green plus sign below the Files table.
10. Replace "newline" with "2012*" (including the quotes).
You are using a file mask to request all files in the HDFS directory tweets with names that begin with the string "2012":

Now, you will run the Job and check the results in Hue.

38 | BD Basics - Lab Guide


Run
1. Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equals to 0
and several other messages that prove that the Job successfully executed:

2. In a file browser, navigate to C:\StudentFiles\tweets_out:

There are five tweet files that your Job read from HDFS and put on your local system. Because a file mask was used to limit
the files selected for the operation, only files with names that begin with "2012" were read from HDFS.
Now that you have worked with text files on HDFS, it is time to work with sparse dataset and discover how to store them efficiently in
HDFS.

LESSON 2 | 39
Storing Sparse Data set with HBase

Overview
Apache HBase is the Hadoop column-oriented database. It is an open-source, non-relational database modeled after Google's BigT-
able and provides the same functionality in top of Hadoop and HDFS.
HBase is useful when you need random, real-time read/write access to your Big Data. Furthermore, HBase can host very large
tables -- billions of rows X millions of columns-- and is particularly well suited for sparse data sets.
If your relational table looks like below (data missing in columns), it is considered "sparse" and is a good candidate for HBase.

In this lab, you will generate sparse data and build a Job to store them to HDFS with HBase:

Generate sparse data


First, you will open a Job that will generate a sparse data set.
It will create a dataset that represents the number of connections per month to a website by customers identified by their Id.

40 | BD Basics - Lab Guide


1. Under the C:\StudentFiles folder, you will find the JobDesigns.zip archive file.
2. Import the Job named GenerateSparseData:

This Job is composed of two components. The tRowGenerator component will generate an integer value for the cus-
tomers ID, and will also generate random integer values to simulate the number of connections per month.
The second component will save the data in a file named HBase_sample.txt under the C:\StudentFiles folder.
3. Run the Job, then locate and open "C:\StudentFiles\HBase_sample.txt" with Notepad++:

As expected, there are a lot of 0 value in the data.


Next, you will configure the connection to HBase.

Configure HBase Connection


An alternative way to configure a connection is to use a dedicated component. You will now create a new Job and the first step will be
to configure the connection to HBase.
1. Create a new standard Job naming it StoreSparseData.
2. Add a tHBaseConnection component and open the Component view.
3. In the Distribution list, select Cloudera and in the HBase version list, select Cloudera CDH5.4 (YARN mode).
4. In the Zookeeper quorum box, enter "ClusterCDH54" (including the quotes).

LESSON 2 | 41
5. In the Zookeeper client port box, enter "2181" (including the quotes).
Your configuration should be as follows:

Read sparse data


You will now continue to build the Job to store your data with HBase.
1. Add a tFileInputDelimited component below tHBaseConnection.
2. Connect tHBaseConnection to tFileInputDelimited with an OnSubjobOk trigger.
3. Configure tFileInputDelimited to read the "C:\StudentFiles\HBase_sample.txt" file, knowing that the row separator is
"\n", the field separator is "\t" and that there is 1 Header row.
Your configuration should be as follows:

4. In the GenerateSparseData Job, double-click tFileOutputDelimited to open the Component view.


5. Edit the schema, select all the columns and copy the schema:

42 | BD Basics - Lab Guide


6. In the StoreSparseData Job, double-click tFileInputDelimited to open the Component view.
7. Edit the schema and paste the schema:

Handle Null values


HBase is well suited for sparse dataset because it does not persist Null values. In our current data set, there are a lot of zeros. You will
process the data to find the zeros and replace them by Nulls.
You will store the raw data set in HBase and the processed data as well.
1. Add a tReplicate component at the right side of tFileInputDelimited and connect with the Main row.
2. Add a tMap component at the right side of tReplicate and connect with the Main row.
3. Double-click to open the tMap editor.
4. Add an output and name it "out_with_NULLS".
5. Select all the columns in the row table and drag to the out_with_NULLS table.
Note: You may have a different row index.
6. Edit the expression for each month and modify it as follows:
(row3.Jan==0)?null:row3.Jan
This means that all 0 values will be replaced by a null.
Note: The Jan value must be replaced as needed to fit all months.

LESSON 2 | 43
Your configuration should be similar to this:

7. Click Ok.

Save Data to HBase


Now, you will add components to save raw and processed data to HBase.
1. At the right side of tMap, add a tHBaseOutput component and connect it with the out_with_NULLS row.
2. Open the Component view.
3. Select Use an existing connection and select tHBaseConnection_1 in the list.
4. Click Sync columns.
5. In the Table name box, enter "data_withNulls".
6. In the Action on table list, select Drop table if exists and create.
7. To create HBase tables, a family name must be associated with each column. To create the family names, click Advanced
settings.
8. In the Family parameters table, add 2 lines as follows:

9. In the Basic settings tab, set the Id column Family Name to "Id".

44 | BD Basics - Lab Guide


10. Set the other columns Family Name to "Date".
Your configuration should be as follows:

11. Copy tHBaseOutput_1 and paste below the tMap component.


12. Connect tHBaseOutput_2 to tReplicate with a Main row.
13. Open the Component view of tHBaseOutput_2 and change the Table name to "RawData":

LESSON 2 | 45
14. Add a tHBaseClose component below the tFileInputDelimited component and connect with an OnSubjobOk trigger.
Now, you can run your Job and check the results in Hue.

Run Job and check results in Hue


1. Run your Job and check the result of the execution in the Console:

Your execution should be successfull otherwise, double check your Job and check HBase health in the Cloudera Manager.
2. Connect to Hue and click Data Browsers>HBase.
This will give you the list of HBase tables. You should see Data_withNulls and RawData in the list:

46 | BD Basics - Lab Guide


3. Click RawData:

This is an extract of the RawData table. You will still find the zeros. But if you compare with the content of Data_withNulls:

The Null values are not stored.


Now that you have used HBase to store sparse data, it's time to move to the Challenge to test your knowledge.

LESSON 2 | 47
48 | BD Basics - Lab Guide
Challenges

Overview
Complete these exercises to further reinforce the skills you learned in the previous lesson. See Solutions for possible solutions to the
exercises.

Put Support File


Develop a Job to write an XML file of support requests stored locally to HDFS. Configure the input and output targets as follows:
Source file: C:/StudentFiles/support/support.xml

Target file: /user/student/BDBasics/support/support.xml

Double Up Orders
Develop a second Job to write a file of duplicated orders to HDFS. Use the local file C:/StudentFiles/duplicated_orders as the
source file. Put the file into the HDFS directory /user/student/erp, keeping the same file name. Use the schema stored in the file
orders.xml and sort the file on column id before saving it. Hint: use tSortRow to sort the file.

LESSON 2 | 49
Solutions

Overview
These are possible solutions to the exercises. Note that your solutions may differ and still be acceptable.

Put Support File


1. Use a tHDFSPut component to create the HDFS folder and write the target file to it.
2. Configure the component as shown in the following image:

Double Up Orders
1. Use a tFileInputDelimited to read the file duplicated_orders from the local file system:

50 | BD Basics - Lab Guide


2. Import the schema from the file orders.xml:

3. Sort the data in ascending order by the column id with the tSortRow component :

LESSON 2 | 51
4. Use a tHDFSOutput component to rewrite the data to HDFS :

5. Click Overwrite in the Action list.


6. Run your Job and check the results in Hue.

Now it is time to wrap up this section.

52 | BD Basics - Lab Guide


Wrap-Up

Recap
In this lab, you learned the basics of writing files using Talens's Big Data components for Hadoop Distributed File System, HDFS.
You used tHDFSPut to write different types of files to HDFS. Then, you used tHDFSGet to read a subset of the files back from
HDFS.
And last, you used HBase dedicated components to store sparse data on HDFS (tHBaseConnection, tHBaseOutput,
tHBaseClose).

Further Reading
For more information about topics covered in this lab, see the Talend Data Integration Studio User Guide, Talend Big Data Platform
User Guide and the Talend Components Reference Guide.

LESSON 2 | 53
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 3
Working with Tables
This chapter discusses the following.

Working with Tables 56


Importing Tables with Sqoop 57
Creating Tables with Hive 64
Wrap-Up 73
Working with Tables

Overview
In this lesson, you will cover two common use cases.
HDFS can be used to optimize data warehouse. So, you could decide to move your data from a relational database to HDFS. The
first use case will show you how to transfer MySQL tables to HDFS using Sqoop.
The second use case will show you how to create a table using Hive. Then this table can be processed using Hive QL language which
is very similar to SQL.

Objectives
After completing this lesson, you will be able to:
Transfer MySQL tables to HDFS using Sqoop

Create Hive connection Metadata

Save data as Hive tables

Before you begin


Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio

A properly configured Hadoop cluster

The supporting files for this lesson

56 | BD Basics - Lab Guide


Importing Tables with Sqoop

Overview
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a rela-
tional database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), trans-
form the data using Map Reduce, and then export that data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop creates
and runs a Map-only Map Reduce Job to import the data.
In this Lab, you will transfer MySQL tables into HDFS using Sqoop dedicated components. First, you will push the customers data
into a MySQL database.

Preparing the MySQL database


The Job to copy the Customers data in MySQL has already been created for you. It is saved in the C:\StudentFiles folder.
1. From the C:\StudentFiles\JobDesigns.zip archive file, import the PushCustomerDataToMySQL Job and the asso-
ciated RemoteMySQL database Metadata.
2. Run the PushCustomerDataToMySQL.
This will copy 1 million of rows in a remotely hosted MySQL database named CustomersData:

3. To check the data copied in the CustomersData table, right-click the tMysqlOutput component and then click Data
Viewer:

LESSON 3 | 57
You will now transfer this table to HDFS using the tSqoopImport component.

Create generic database Metadata


The tSqoopImport component calls Sqoop to transfer data from a relational database management system such as MySQL or
Oracle into the Hadoop Distributed File System.
First you will create a generic database Metadata which is required for Sqoop to connect to your MySQL database.
1. In the Repository, under Metadata, right-click Db Connections and then click Create connection.
2. In the Name box, enter "genericConnection". You can also add a Purpose and a Description. Then, click Next.
3. In the DB Type list, select "General JDBC".
4. In the JDBC URL box, enter
"jdbc:mysql://ClusterCDH54:3306/training" (without the quotes).
5. Click the (...) next to the Driver jar box and select
"mysql-connector-java-5.1.30-bin.jar".
6. Click the (...) next to the Class name box, and then, in the drop-down list, select "com.mysql.jdbc.driver".
7. In the User name and Password boxes, enter "root" (without the quotes).

58 | BD Basics - Lab Guide


8. Click the (...) next to the Mapping file box, and then select mapping_Mysql.xml in the list. Click OK.
Your configuration should be as follows:

9. Click Check. Your connection should be successfull:

10. Click OK and Finish.


The genericConnection Metadata will appear in the Repository.

Importing Tables
You will create a simple Job to import the CustomersData MySQL table into HDFS using a tSqoopImport component.
In the tSqoopImport component, the first option is to choose the Mode: Commandline or Java API.
If you choose Commandline, the Sqoop shell is used to call Sqoop. In this mode, you have to deploy and run the Job in the host where
Sqoop is installed. This means that you have to install and use the Jobserver, as described in the Talend Data Integration Advanced
training, or as described in the Talend Installation Guide.
If you select Use Java API, the Java API is used to call Sqoop. In this mode, the Job can be run locally in the Studio, but you have to
configure the connection to your cluster.
Note that a JDK is required to execute the Job with the Java API and the JDK versions on both machines must be compatible.
1. Create a new standard Job and name it "SqoopImport".
2. Add a tSqoopImport component and open the Component view.
3. In the Mode box, select Use Java API.
4. In the Hadoop Property list, select Repository and then browse to find the HDFSConnection you configured earlier in
the course.
This will configure the Distribution, the Hadoop version, the NameNode URI, the Resource Manager, and the Hadoop user
name:

LESSON 3 | 59
5. In the JDBC Property list, select Repository and then, browse to find the genericConnection Metadata.
This will configure the Connection, the Username, the Password and the Driver JAR values:

6. In the Table Name box, enter "CustomersData".


7. Select Delete target directory.
8. Select Specify Target Dir and enter
"/user/student/BDBasics/SqoopTable" (quotes included).

60 | BD Basics - Lab Guide


Your configuration should be as follows:

Run Job and Check results


As you did previously, you will run your Job and then check the results in the Console and in Hue.
1. Run your Job and check the results in the Console. The last line should be an exit code equals to 0.
2. You can investigate a little bit what you see in the Console. For example, you can see the execution of the Map Reduce Job
generated by the Sqoop import:

Note that your Job Id will be different. The Id is the number following "job_...". In the current example, the Job Id is
1435565883520_0012.

LESSON 3 | 61
3. Right after, you will find a recap of various counters:

There, you can see that 4 map tasks ran on the cluster and that the 1 million of records were transferred in approximately 21
seconds.
4. To see your Job in Hue, click Job Brower.
The Job Browser window opens and you will see the list of all your Jobs.
5. From the Console, find the Id of your Job. Then, in Hue, find your Job Id in the Job list. It should be followed by green boxes,
corresponding to a successful execution of Map and Reduce tasks:

62 | BD Basics - Lab Guide


6. In Hue, click File Browser and navigate to /user/student/BDBasics/SqoopTable:

The data has been split in multiple parts.


7. Click part-m-00000:

There, you can check that your data have been imported as expected.
Now that you have imported a MySQL table to HDFS using a tSqoopImport component, you can continue to investigate working
with tables. The next topic will show you how to create tables in HDFS with Hive.

LESSON 3 | 63
Creating Tables with Hive

Overview
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Hive supports HiveQL which is similar to SQL but does not support the complete construct of SQL.
Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Hive, through HiveQL language,
provides a higher level of abstraction over Java Map Reduce programming.
First, you will create Hive Metadata in the Repository and then, you will use various methods to create Hive tables.

Create Hive Connection Metadata


As you did previously for the cluster and HDFS connections, you will create a Hive connection Metadata in the Repository.
1. Right-click TrainingCluster in the Repository under Metadata/Hadoop Cluster, and then click Create Hive.
2. In the Name box, enter "HiveConnection", and then click Next.
3. In the Hive Model list, select Standalone.
4. In the Port box, enter "10000" (without quotes).
Your configuration should be as follows:

5. Click Check. You should have a successful connection message:

6. Click Ok and Finish to create the Hive connection Metadata.


The Hive connection Metadata appears in the Repository under Hadoop Cluster/TrainingCluster.

64 | BD Basics - Lab Guide


Create a Hive table manually
You will now create a Job that will create a table and then populate it with the customers data.
1. Create a new standard Job and name it "HiveLoad".
2. Add a tHiveCreateTable and open the Component view.
The tHiveCreateTable component will create an empty Hive table according to the specified schema.
3. In the Property Type list, select Repository and then browse to find the HiveConnection Metadata:

4. In the Schema list, select Repository then browse to find the CustomersData generic schema Metadata.
5. In the Table Name box, enter "CustomersData".
6. In the Action on table list, select "Create table if not exists":

7. Add a tHiveLoad component.


The tHiveLoad component will populate the table with data.
8. Connect tHiveLoad to tHiveCreateTable with an OnSubjobOk trigger.
9. Open the Component view.
10. Set the Property Type to Repository and use the HiveConnection Metadata.

LESSON 3 | 65
11. In the Load action list, select "LOAD".
12. In the File Path box, enter
"/user/student/BDBasics/Customers/CustomersData".
13. In the Table Name box, enter "CustomersData".
14. In the Action on file list, select "OVERWRITE":

You will now run your Job and check the results in the Console and in Hue.

Run Job and check results


1. Run your Job and check the output in the Console:

The Job successfully executed so now, you can check the results in Hue.

66 | BD Basics - Lab Guide


2. In Hue, click Data Browsers and then, click Metastore Tables:

3. Click the customersdata table:

This is the description of your CustomersData table.


Note that the columns type have been automatically converted.

LESSON 3 | 67
4. Click the Sample tab:

If you examine the results in the purchasedata column, you will see only NULL values. This is due to the fact that the
timestamp format of Hive is not equivalent to the date format in the Talend Studio. This leads to a failure in the data con-
version.
A possible workaround is to consider dates as of String type.
You will now experiment another way to create a Hive table. You will use a Wizard which will automatically create a Hive table from a
file stored in HDFS.

Using the Hive table creation wizard


To create a Hive table automatically from the CustomersData.csv file, stored on HDFS, you will have to change the perspective of the
Studio. You will move to the Profiling perspective which is dedicated to Data Quality analysis on database or on HDFS, depending on
where are stored your data.
1. To make sure that the CustomersData.csv file is availabele for the following steps, run again the PutCustomersData Job.
2. In the upper-right corner of the Studio, click Profiling to open the Profiling perspective:

68 | BD Basics - Lab Guide


3. In the DQ Repository, expand Metadata/Hadoop Cluster/TrainingCluster:

There, you will find the connection metadata you created earlier.
4. Right-click HDFSConnection and then click Create Hive Table:

The connection to HDFS will be checked and next, the wizard, to create a Hive table based on a file stored in HDFS, will
start.
5. Browse to find the CustomersData.csv file under /user/student/BDBasics/Customers.

LESSON 3 | 69
6. Select CustomersData.csv and wait until the Creation status changes to Success:

As expected, the wizard detects 8 columns in the CustomersData.csv file.


Note: The Hive table creation wizard will convert all the files under the selected folder. So, in the current example, only
the CustomersData.csv file will be converted because it is the only file in the Customers folder.
7. Click Next.
8. In the Schema, edit the columns name as follows:
Column0 > Id

Column1 > FirstName

Column2 > LastName

Column3 > City

Column4 > State

Column5 > ProductCategory

Column6 > Gender

Column7 > PurchaseDate

70 | BD Basics - Lab Guide


9. Click Next.
10. In the New Table Name box, enter "CustomersData_auto" (without quotes).
11. In the Hive connection list, select HiveConnection and then, click Finish.
The Hive table will be created and then the wizard will be closed.
12. Under DQ Repository>Metadata>DB connections>HiveConnection>default, right-click Tables and click Reload table list:

13. Click Reload in the Reload pop up message.


The CustomersData_auto table should appear in the table list.

Check Hive table


You can check the table in the Studio or in Hue.
1. In the DQ Repository, right-click the CustomersData_auto table, then click Preview:

LESSON 3 | 71
2. This Preview translates into a HiveQL query applied to your Hive table:

The query is "select * from default.customersdata_auto" and appears in the SQL Editor, in top of the window.
The result appears in tab 1.
3. To check the new table in Hue, click Data Browsers>Metastore Tables, then click customersdata_auto in the table list.
4. Click the Sample tab:

You have covered the various ways to work with Tables with Hive. So, now it's time to recap what you have learned in this lesson.

72 | BD Basics - Lab Guide


Wrap-Up

Recap
In this lesson you learned how to use Talend's Big Data components for Hive and Sqoop.
First, you imported a MySQL table to HDFS using the tSqoopImport component. The import has been done through a Map-only
Map Reduce Job.
Next, you manually created a Hive table with the tHiveCreateTable component and then populated it with the tHiveLoad com-
ponent.
And last, you used the Hive table creation wizard to create automatically your Hive table from a file stored on HDFS.

LESSON 3 | 73
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 4
Processing Data and Tables in
HDFS
This chapter discusses the following.

Processing Data and Tables in HDFS 76


Processing Hive Tables with Jobs 77
Profiling Hive Tables 85
Processing Data with Pig 95
Processing Data with Big Data Batch Job 103
Wrap-Up 113
Processing Data and Tables in HDFS

Overview
Once stored on HDFS, you will need to process your tables and data to extract useful information.
Depending on your data type, you can adopt various strategies.
If your data are stored as Hive Tables, Hive QL might be the best way to address your needs. Hive QL language is a high level pro-
gramming language similar to SQL. Hive converts the request as Map Reduce Jobs that will be executed on your cluster.
In this lesson, you will analyze Hive tables with a Job or with the Profiling view of the Studio.
If your data are stored as text files, an option is to use Pig. Pig Latin is also a high-level language providing common operations to
group, filter and join data. The Pig Latin script is automatically converted in Java Map Reduce code to be executed on the cluster.
Talend provides components to use Pig with minimal programming efforts.
The last way to process your data, covered in this lesson, is to use a Big Data Batch Job. This kind of Job automatically converts the
components in Java Map Reduce code that will be run on the Cluster.

Objectives
After completing this lesson, you will be able to:
Process Hive tables with a standard Job

Process Hive tables in the Profiling perspective of the studio

Process data with Pig components

Process data with a Big Data Batch Job

Before you begin


Be sure that you are in a working environment that contains the following:
A properly installed copy of the Talend Studio

A properly configured Hadoop cluster

The supporting files for this lesson

76 | BD Basics - Lab Guide


Processing Hive Tables with Jobs

Overview
Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Hive, through HiveQL language,
provides a higher level of abstraction over Java Map Reduce programming.
You will now analyze the customers data table you created previously.
Using the Studio, you can analyze your data with your own Hive queries or you can use the Profiling view and use the Data Quallity
functionalities of the Studio over your Hive tables.
You will use various components to extract useful data from a Hive table, process it and then, store the result in another Hive table.
At the end of this lab, your Job will look like the following:

Extracting useful data


The first step is to collect useful data from the CustomersData Hive table you previously created .
You will limit your investigations to the firs 100 000 rows of the table.
1. Go back to the Integration Perspective.
2. Create a new standard Job and name it "HiveProcessing".
3. In the Repository, click HiveConnection under Metadata/Hadoop cluster/TrainingCluster/Hive.
4. Drag it to the Designer.
5. Select tHiveConnection in the Components list and click OK.

LESSON 4 | 77
6. Add a tHiveInput component and connect it with an OnSubjobOk trigger.
7. Open the Component view.
8. Select the Use an existing connection option.
9. Set the Schema to Repository and then use the CustomersData generic schema metadata.
10. In the Table Name box, enter "CustomersData".
11. In the Query box, you will enter the HiveQL query that will be sent to the cluster.
As mentionned in the Overview, the investigations will be limited to the first 100 000 rows.
In the Query box, enter :
"select * from CustomersData where
CustomersData.Id<=100000"
Your configuration should be as follows:

Process data
You will now aggregate the data and store the result in HDFS.
1. Add a tAggregateRow component, connect it with the Main row, and then, open the Component view.
2. Click Sync columns and then click (...) to edit the schema.
3. Configure the output schema to have 3 columns named ProductCategory, Gender and Count, as follows:

4. Click OK to save the schema.


5. Click the green plus sign below the Group by table to add 2 Output column: ProductCategory and Gender.
6. Click the green plus sign below the Operations table to add Count to the Output Column.
7. In the Function column, select count.

78 | BD Basics - Lab Guide


8. In the Input column position column, select ProductCategory.
Your configuration should be as follows:

9. In the Repository, click HDFSConnection under Metadata/Hadoop cluster/TrainingCluster/HDFS.


10. Drag it to the Designer.
11. Select tHDFSOutput in the Components list and then, click OK.
12. Connect tAggregateRow to tHDFSOutput with the Main row and then open the Component view.
13. In the File Name box, enter
"/user/student/BDBasics/Hive/agg_results".
This will save the aggregation results in HDFS. The last step is to transfer the results in a Hive table.

Transfer results to Hive


1. In the tHDFSOutput Component view, click (...) to edit the schema.
2. Select ProductCategory,Gender and Count in the Input or Output table, then copy the schema:

3. Close the schema window.


4. Add a tHiveCreateTable below tHiveInput and connect it with an OnSubjobOk trigger.

LESSON 4 | 79
5. Open the Component view.
6. Select the Use an existing connection option.
7. Click (...) to edit the schema.
8. Paste the Schema:

9. In the DB Type column, select STRING for ProductCategory and Gender and then, select INT for Count:

10. Click OK to save the schema.


11. In the Table Name box, enter "AggResults".

80 | BD Basics - Lab Guide


12. In the Action on table list, select Create table if not exists.
Your configuration should be as follows:

13. Add a tHiveRow component below tHiveCreateTable and connect it with an OnSubjobOk trigger.
14. In the Component view, select the Use an existing connection option.
15. Copy the schema in tHiveCreateTable and paste it in the schema of the tHiveRow component.
16. Click OK to save the schema.
17. In the Query box, you will be able to write your own HiveQL.
The query in a tHiveRow component is executed at each flow iteration in your Job. In the current Job, the query will be
executed only once to transfer the data from the HDFS file to the AggResults Hive table.
In the Query box, enter:
"LOAD DATA INPATH
'/user/student/BDBasics/Hive/agg_results' OVERWRITE INTO TABLE AggResults "

Note: Copy and paste the Query from the LabCodeToCopy file in the C:\StudentFiles folder.

LESSON 4 | 81
18. Your configuration should be as follows:

Your Job is now complete. It's time to run it and check the results.

Run Job and Check results


1. Run your Job and check the results in the Console:

You will see an error message regarding the winutils.exe binary not available in the Hadoop binaries. In the current Job, this
error won't prevent the Job to succeed but it could in some cases.
You can fix this issue by downloading the winutils.exe file and then save it in a tmp folder. This has been done for you in your
training environment.
2. In the Run view, click Advanced settings.
3. Select the Use specific JVM arguments option.
4. Click New....

82 | BD Basics - Lab Guide


5. In the Argument box, enter:
"-Dhadoop.home.dir=C:/tmp/winutils" (without quotes)

6. Click OK to save the new argument and run again your Job.
The Job should run successfully without error messages:

7. Even if it's not clearly stated in the Console, the HiveQL query, in the tHiveInput component, executed on the Cluster. You
can see the Job generated in the Hue Job Browser:

8. Using the Hue File Browser, navigate to


/user/student/BDBasics/Hive:

LESSON 4 | 83
The folder is empty because the data have been transfered in the Hive table, deleting the file on HDFS.
9. In Hue, click Data Browsers>Metastore Tables. The table AggResults should be in the table list.
10. Click AggResults and then click the Sample tab:

You have processed your Hive Table with various components such as tHiveInput and tHiveRow components. You will now process
your Hive table using the Profiling perspective of the Studio.

84 | BD Basics - Lab Guide


Profiling Hive Tables

Overview
Using the Studio, you can run various analysis over your Hive tables. In this lab, you will use the Profiling view of the Studio to run ana-
lysis of your Hive connection, Hive Tables and columns.

Hive Connection Analysis


1. Switch to the Profiling perspective.
2. In the DQ Repository, under Metadata/DB connections, you will see your HiveConnection.
Right-click HiveConnection and then, click Reload database list:

3. Click Reload in the Reload pop up message.


This will update the table list :

As expected, you will find the AggResults, CustomersData and CustomersData_auto Hive tables.
Now, you will start the Overview Analysis of HiveConnection.
4. Right-click HiveConnection under DQ Repository/Metadata/DB connections, and click Overview analysis:

5. In the Name box, enter OverviewAnalysis and click Next.


6. In the Table name filter and view name filter boxes, it is possible to filter the tables of interest. In the current lab, you will
let the boxes empty.

LESSON 4 | 85
7. Click Finish.
Your analysis will appear in the DQ Repository, under Data Profiling/Analyses:

8. The OverviewAnalysis also opens so that you can examine the Analysis settings.
Click the Running man figure to start the analysis:

9. At the bottom of the Connection Analysis window, you will find the Analysis summary:

In the Analysis summary, you should see that the analysis was successful. There is 1 schema, named default and 3 tables

86 | BD Basics - Lab Guide


named customersdata, customersdata_auto and aggresults.
As expected, CustomersData and CustomersData_auto have 1 million of rows and AggResults has 14 rows.
Now that you have an overview of your Hive Tables, you can move to the next step, which is to analyze each table.

Hive Tables Analysis


In the Profilling view, you can easily refine the kind of analysis to suit your needs. You will now focus on the CustomersData table.
1. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, and click Pre-
view:

2. This will open a SQL Editor window which is split in two parts.
The upper part is the request submitted to preview your table:

The bottom part is a preview of the CustomersData table:

LESSON 4 | 87
3. Right-click customersdata and click Table analysis.

4. In the Name box, enter "TableAnalysis" and click Finish.


5. The TableAnalysis window opens. By default, a row count operation is proposed. You will add a filter to count the number of
rows where the customer is a woman.
In the Where box, enter customersdata.gender='F' :

6. Click the Running man icon to start the analysis.


7. On the right side of the Table Analysis, you will find the result of your request:

You will continue to investigate your data realizing an analysis on some columns of the CustomersData table.

Column Analysis
1. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, and click
Column Analysis:

88 | BD Basics - Lab Guide


2. In the Name box, enter "ColumnAnalysis" and then, click Finish.
The Column Analysis page opens.
3. As the analysis can be time consuming, you will reduce the scope of the analysis. You will exclude the id, firstname, lastname,
city, and purchasedate columns.
Under Data preview, click Select Data and then select state, productcategory and gender columns:

4. Click OK to save the selection.


5. Under Analyzed Columns you will find state, productcategory and gender.
Click the plus sign next to state to have the details of what will be analyzed in the state column:

LESSON 4 | 89
You will have basic information about your columns, such as the number of rows, the number of blank or null values, the num-
ber of distinct count.

6. In the Analysis Parameter, you can choose the execution engine : SQL or Java. To send your requests to the cluster,
select SQL and then click the running man icon:

90 | BD Basics - Lab Guide


7. While the Analysis is running, you can go in the Hue Job Browser and follow the execution of the Map and Reduce tasks
launched by the Column Analysis:

8. At the end of the execution, open the Analysis Results tab. There you will find the results of each analysis over state, pro-
ductcategory and gender columns:

There are 7 distinct product categories. You will now run a analysis to list these values.

LESSON 4 | 91
Product Category column analysis
1. Right-click productcategory under DQ Repository/Metadata/DB con-
nections/HiveConnection/default/Tables/customersdata , and click Column Analysis>Analyze:

2. In the Name box, enter "ProductCategoryAnalysis" (without quotes) and click Finish.
3. In the Column Analysis window, under Data preview, click Select Indicators:

This is were you will specify which statistics you are interested in.

92 | BD Basics - Lab Guide


4. If you scroll down, you will see an option named Hide non applicable indicators.
Select this option. It will simplify the indicators selection:

5. Expand Simple Statistics and select Row Count in the productcategory column:

6. Expand Advanced Statistics and select Frequency Table.


7. Click OK to save your selection:

8. Click the Running Man to start the analysis.

LESSON 4 | 93
9. Open the Analysis Results tab:

The frequency analysis lists the product categories values and the count for each value. The numbers are very close to each
others because this data set has been randomly generated.
You have covered the last analysis for this lab. Now it's time to move to the next lab where you will cover how to process data on
HDFS using Pig.

94 | BD Basics - Lab Guide


Processing Data with Pig

Overview
The motivation behind Pig is that Map Reduce is very powerful but it requires a Java programmer. And the programmer may have to
re-invent common functionalities such as joining or filtering.
Pig is a platform for analyzing large data sets. It consists of a high-level programming language, Pig Latin, that opens Hadoop to non-
Java programmers. Pig Latin also provides common operations to group, filter, join or sort data.
Pig provides an execution engine in top of Hadoop.
The Pig Latin script is converted in Map Reduce code that will be executed on your cluster.
In this lab, you will process the Customers data stored in HDFS. You will perform basic tasks such as filtering rows, mapping, aggreg-
ate and sort data and then store your results in HDFS.
At the end of this lab, your Job will look like the following:

Writing Data to HDFS with tHDFSOutput


Earlier in the course, you used the tHDFSPut component to copy a file from your local file system and paste it in HDFS. Another way
to write data to HDFS is to use the tHDFSOutput component which write a data flow in HDFS.
You will create a Job that reads the CustomersData.csv file and write it to HDFS using tHDFSOutput.
1. Switch your Studio to the Integration perspective.
2. Create a new Standard Job and name it "WritingHDFS".
3. Add a tFileInputDelimited component and open the Component view.
4. Next to the File name box, click (...) and navigate to "C:\StudentFiles\CustomersData.csv".
5. Set the Row Separator to "\n" and the Field Separator to ";".
6. Set the Schema type to Repository and browse to find the CustomersData generic schema metadata.
Your configuration should be as follows:

LESSON 4 | 95
7. Add a tHDFSOutput component and connect it with the Main row.
8. Double-click the tHDFSOutput component to open the Component view.
9. In the Property Type list, select Repository. Then, navigate to find HDFSConnection.
10. In the Schema list, select Repository. Then, navigate to find the CustomersData generic schema metadata.
11. In the File Name box, enter
"/user/student/BDBasics/CustomersData".
Your configuration should be as follows:

12. Save your Job. It will be run later from the main Job that will process the data using Pig components.

Load Data
You will create a new Job named PigProcessing which will process the customers data.
1. Create a a new Standard Job and name it "PigProcessing".
2. From the Repository, drag the WritingHDFS Job and drop it in the Designer. It will appear as a tRunJob component
labelled WritingHDFS:

3. Add a tPigLoad component.


4. Connect the WritingHDFS component to tPigLoad with an OnSubjobOk trigger.
5. Double-click tPigLoad to open the Component view.
Pig Jobs can be executed in local or Map/Reduce mode. This is configured in the tPigLoad component. For this lab, you will
use the Map/Reduce mode.

96 | BD Basics - Lab Guide


6. As you did for the tHDFSOutput component, set the Property Type to Repository using the HDFSConnection
metadata.
7. Set the Schema type to Repository using the CustomersData generic schema metadata.
8. Select Set Jobhistory address.
9. Select Set resourcemanager scheduler address.
These options provide additional information on the cluster which are necessary for Pig to run without Warning messages.
10. In the Input file URI, enter
"/user/student/BDBasics/CustomersData".
Your configuration should be as follows:

Filter and Map Data


You will continue to build your Job with the two next components to filter and Map your data. The goal here is to extract customers liv-
ing in California and get the corresponding gender and product category.
You will use the tPigFilterRow component to filter the State and then, the tPigMap will extract Gender and ProductCategory
columns.
1. Add a tPigFilterRow and connect with a Pig Combine row.
2. Open the Component view.
3. Select the Use advanced filter option.

LESSON 4 | 97
4. In the Filter box, enter "State matches 'California'" (quotes included):

5. Add a tPigMap component, connect with the Pig Combine row and then open the Component view.
6. Create a new output named "MappingOut".
7. Select ProductCategory and Gender columns in the row2 table and drag in the MappingOut table.
Your mapping should be as follows:

8. Click OK to save the mapping.

Aggregate and Sort Data


You extracted some data, now you will aggregate and sort them. The goal here is to have the number of men and women per Pro-
ductCategory, and then sort by ProductCategory alphabetical order.
1. Add a tPigAggregate component, connect with the MappingOut row and then open the Component view.
2. Edit the schema and add a column named Count with an Integer type:

3. Click OK to save the schema.


4. Click the green plus sign below the Group by table to add ProductCategory and Gender columns.
5. Click the green plus sign below the Operations table.

98 | BD Basics - Lab Guide


6. In the Additional Output Column list, select Count.
7. In the Function list, select count.
8. In the Input Column list, select ProductCategory.
Your configuration should be as follows:

9. Add a tPigSort component, connect with the Pig Combine row and then, open the Component view.
10. Click the green plus sign below the Sort key table.
11. Configure to sort the ProductCategory column by ascending order, as shown below:

Store Results
Once processed, the last step is to store your results on HDFS.
1. Add a tPigStore component, connect with the Pig Combine row and then, open the Component view.
2. Set the Property Type to Repository and then select HDFSConnection.
3. In the Result Folder URI, enter "/user/student/BDBasics/Pig/out".
4. Select the Remove directory if exists option.
This will allow you to run again the Job as needed.

LESSON 4 | 99
5. Your configuration should be as follows:

Run Job and check results


1. Run the Job and inspect the results in the Console.
2. Scrolling in the Console, you will find information about the WritingHDFS Job:

The WritingHDFS Job succesfully executes and then the Pig components will start to work.
3. If you continue to investigate in the logs, you will fing the Pig requests equivalent to each Pig component. The first Pig com-
ponent is the tPigLoad:

4. Right after, you will find Pig requests for tPigFilterRow, tPigMap, tPigAggregate, tPigSort and tPigStore:

100 | BD Basics - Lab Guide


5. If you continue to scroll down, you will see several MapReduce Job submitted and their statistics. And this is the final report of
all Map Reduce jobs execution:

There, you can see that the execution was successful. First, 1 million rows were read from the CustomersData file and then,
14 records were written in the /user/student/BDBascis/Pig/out folder.
6. In Hue, using the File Browser, navigate to /user/student/BDBascis/Pig/out.
7. Click the part-r-00000 file to see the result:

8. In Hue, in the Job Browser, you will be able to see all the Jobs submitted by the Studio. The Jobs have different IDs but they
have the same name BDBASICS_PigProcessing_0.1_tPigLoad_1.

LESSON 4 | 101
All your Jobs have succeeded:

You can now continue investing processing data on HDFS and move to the next lab.

102 | BD Basics - Lab Guide


Processing Data with Big Data Batch Job

Overview
The last way to process data covered in this course is to use Big Data Batch Jobs. You will first create a Standard Job and then con-
vert it to a Big Data Batch Job using the Map Reduce framework.
The Job will be very similar to the PigProcessing Job. At the end of this lab, your Job will look like the following:

The first step will be to read the data using the HDFSConnection metadata.

Read and filter data


1. Create a new Standard Job and name it "StandardProcessing".
2. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to the
Designer.
3. Select tHDFSInput in the Components list and click OK:

4. Double-click tHDFSInput to open the Component view.


5. In the Schema list, select Repository. Then, navigate to find the CustomersData generic schema metadata.
6. Next to the File Name box, click (...) and browse to find
/user/student/BDBasics/CustomersData.

LESSON 4 | 103
Your configuration should be as follows:

7. Add a tFilterRow component and connect it with the Main row.


8. In the Component view, click the green plus sign below the Conditions table to add a a new line.
9. In the InputColumun column, select State.
10. In the Function column, select Match.
11. In the Operator column, select Equals.
12. In the Value colum, enter "California" (quotes included).
Your configuration should be as follows:

In the next section, you will map and aggregate your data.

Map and Aggregate data


1. Add a tMap component and connect it with the Filter row.
2. Open the mapping editor and create a new output named "out".
3. In the row2 table, select Gender and ProductCategory and drag to the out table, as follows:

4. Click Ok to save your configuration.

104 | BD Basics - Lab Guide


5. Add a tAggregateRow component and connect it with the out row.
6. Open the Component view.
7. Edit the schema and add a new column named Count with an Integer type:

8. Click OK to save the schema.


9. Click the green plus sign below the Group by table and add ProductCategory and Gender columns.
10. Click the green plus sign below the Operations table.
11. In the Output column list, select Count.
12. In the Function list, select count.
13. In the Input colum position list, select ProductCategory:

The last steps are to sort the results and then to save them on HDFS.

Sort and save data


1. Add a tSortRow component, connect it with the Main row and then, open the Component view.
2. Click the green plus sign below the Criteria table.
3. In the Schema column, select ProductCategory.
4. In the sort num or alpha? column, select alpha.

LESSON 4 | 105
5. In the Order asc or desc column, select asc:

6. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to the
Designer.
7. In the Components list, select tHDFSOutput.
8. Connect it with the Main row and then open the Component view.
9. In the File Name box, enter
"/user/student/BDBasics/CustomersDataOut".
10. In the Action list, select Overwrite.
Your configuration should be as follows:

Your Job is now ready to be run.

106 | BD Basics - Lab Guide


Run Job and Check results
1. Run your Job and check the results in the Console:

Your Job should execute succesfully.


2. In Hue, using the File Browser, navigate to /user/student/BDBasics and open CustomersDataOut:

The results should be the same as in the previous lab named Processing Data with Pig.
You will now convert this Standard Job into a Big Data Batch Job which will use the Map Reduce framework.

LESSON 4 | 107
Convert to Map Reduce Batch Job
Using the studio, you can convert a standard Job to a Big Data Batch Job and then choose between Map Reduce or Spark frame-
work.
In this lesson, you will focus on Map Reduce jobs.
Instead of converting your current Job, you will duplicate it as a Map Reduce based Batch Job.
1. In the Repository, right-click the StandardProcessing Job and then, click Duplicate.
2. In the Input new name box, enter MRProcessing.
3. In the Job Type list, select Big Data Batch.
4. In the Framework list, select MapReduce:

5. Click OK.
6. The Hadoop cluster configuration is set at the Job level in Big Data Batch Jobs. So, before duplicating your Job, you will be
asked to choose from which component the Hadoop configuration will be retrieved:

7. Select the HDFSConnection component with the blue arrow and click OK.

108 | BD Basics - Lab Guide


8. In the Repository, under Big Data Batch Jobs, you will find your MRProcessing Job. If not opened yet, double-click to
open it:

9. Double-click the tHDFSOutput component to open the Component view.


10. In the Folder box, enter
"/user/student/BDBasics/CustomersDataOut_MR".
Your standard Job is now converted to a Map Reduce Job.

Run Job and Check results


Before running your Job, you will check that the Hadoop cluster configuration is correct.
1. Open the Run view of the MRProcessing Job.
2. Click the Hadoop Configuration tab:

As the configuration has been retrieved from a tHDFSInput component, some configuration are missing, because they are
not necessary to read/write from HDFS.
The Resource Manager adress is necessary to run a Map Reduce Job, and is currently missing. If you run the Job, it will fail

LESSON 4 | 109
with the following error message:

3. You can either use your TrainingCluster metadata or set the Resource Manager to "ClusterCDH54:8032":

4. Go back to the Basic Run tab and run your Job.


5. You can check that your Job is running in the Designer view. As the different Map Reduce Jobs execute, you will see the
progress bars changing.

110 | BD Basics - Lab Guide


First they are empty then, when a Map or Reduce task starts to execute, the progress bar becomes red. And when a task
succesfully complete, the progress bar becomes green:

6. You can also follow the execution in the Console:

You should see three reports as above, one for each Map Reduce Job launched by the Studio.
The execution of Map Reduce tasks is given with percentage.
7. From the Console, you can get your Job ID and find it in the Hue Job Browser:

8. Using the File Browser in Hue, check the results saved in the /user/student/BDBasics/CustomersDataOut_MR
folder.

LESSON 4 | 111
9. Click the part-r-00000 file:

This time again, you should have the same results as in previous labs.
You have now covered the different ways to process tables and data stored on HDFS. And now, it's time to recap what you've
learned.

112 | BD Basics - Lab Guide


Wrap-Up

Recap
In this chapter, you covered how to process tables and data stored on HDFS.
The first part of the chapter was dedicated to Hive Tables. You built a Job to extract data of interest in your Hive table with a tHiveIn-
put component. You processed the data, saved the result in HDFS with a tHDFSOutput component and then you transferred the res-
ult in a Hive table using tHiveCreateTable and tHiveRow.
Next, you used the Profiling view of the Studio to perform different level of analyses. You started at the connection level, moved to the
Tables and column levels and ended with a custom analysis on the ProductCategory column. Each request was run as a Map reduce
Job on your cluster.
The second part of the chapter was dedicated to data processing with Pig and Big Data Batch Jobs.
You created a Job using Pig components to process your data. Each Pig request was executed as a Map Reduce Job on your
cluster.
Next, you created a standard Job to process your data and reproduced the results obtained with Pig. Then, you duplicated the stand-
ard job and created a Big Data Batch Job using the Map Reduce Framework.

Further reading
If you want to discover more about data profiling, the Talend Data Quality trainings will help you. And if you are interested in dis-
covering more about Big Data Batch Jobs, the Talend Big Data Advanced training will give you an overview of real life use cases
using Big Data Batch Jobs.

LESSON 4 | 113
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.

You might also like