Lab Guide - PDF - EN
Lab Guide - PDF - EN
BD Basics
Version 6.0
Copyright 2015 Talend Inc. All rights reserved.
Information in this document is subject to change without notice. The software described in this document is furnished under a license
agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of those agree-
ments. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or any means electronic
or mechanical, including photocopying and recording for any purpose other than the purchaser's personal use without the written
permission of Talend Inc.
Talend Inc.
800 Bridge Parkway, Suite 200
Redwood City, CA 94065
United States
+1 (650) 539 3200
Welcome to Talend Training
Congratulations on choosing a Talend training module. Take a minute to review the following points to help you get the most from
your experience.
Technical Difficulty
Instructor-Led
If you are following an instructor-led training (ILT) module, there will be periods for questions at regular intervals. However, if you
need an answer in order to proceed with a particular lab, or if you encounter a situation with the software that prevents you from pro-
ceeding, don’t hesitate to ask the instructor for assistance so it can be resolved quickly.
Self-Paced
If you are following a self-paced, on-demand training (ODT) module, and you need an answer in order to proceed with a particular
lab, or you encounter a situation with the software that prevents you from proceeding with the training module, a Talend professional
consultant can provide assistance. Double-click the Live Expert icon on your desktop to go to the Talend Live Support login page
(you will find your login and password in your ODT confirmation email). The consultant will be able to see your screen and chat with
you to determine your issue and help you on your way. Please be considerate of other students and only use this assistance if you are
having difficulty with the training experience, not for general questions.
Exploring
Remember that you are interacting with an actual copy of the Talend software, not a simulation. Because of this, you may be tempted
to perform tasks beyond the scope of the training module. Be aware that doing so can quickly derail your learning experience, leaving
your project in a state that is not readily usable within the tutorial, or consuming your limited lab time before you have a chance to fin-
ish. For the best experience, stick to the tutorial steps! If you want to explore, feel free to do so with any time remaining after you've fin-
ished the tutorial (but note that you cannot receive consultant assistance during such exploration).
Additional Resources
After completing this module, you may want to refer to the following additional resources to further clarify your understanding and
refine and build upon the skills you have acquired:
Talend product documentation (help.talend.com)
Documentation for the underlying technologies that Talend uses (such as Apache) and third-party applications that com-
plement Talend products (such as MySQL Workbench)
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
Basic Concepts 10
Opening a Project 11
Monitoring the Hadoop Cluster 15
Creating Cluster Metadata 17
Wrap-Up 26
Basic Concepts
Overview
During this training, you will be assigned a Hadoop cluster preconfigured. This Hadoop cluster has been built with a Cloudera CDH
5.4 distribution. The purpose is to try the different functionalities, not to have a production cluster. So, this training cluster is in pseudo-
distributed mode. That means that there is only one node. This is enough to understand the different concepts in this training.
Before starting to build Jobs to read and write data from and to HDFS, there are some prerequisites. First, you will open a new pro-
ject in the Talend Studio. Then you will connect to your cluster to discover how to monitor it, using Cloudera Manager and Hue. For
your jobs to succeed, the cluster must be up and running. So, if a service fails you need to know how to restart it.
And the last point is to create your cluster Metadata. This step will avoid repetitive configuration of components in the Jobs you will cre-
ate.
Objectives
After completing this lesson, you will be able to:
Open a new project in Talend Studio
Connect to Hue
Overview
Before developing Talend Jobs, you need a project to store them. In this lab you will open a pre-existing project for your big data
Jobs.
Run
1. Double-click the Talend Studio shortcut on your desktop to run Talend Studio :
2. Using the Local connection, click Select an existing project and then click BDBasics in the Project list:
LESSON 1 | 11
Talend Forge
When the Connect to TalendForge window appears, log in with your existing Talend account or create a new one:
LESSON 1 | 13
Start
Click Start now!. The Talend Studio main window appears, ready for you to create Jobs:
Overview
As you will develop jobs to use HDFS, HBase, and Hive to store your data, you need a running Hadoop cluster.
For this training a Hadoop cluster will be used. It is running the core Hadoop functionalities, such as HDFS, Yarn, HBase, Hive or
Sqoop. It also runs Hue that will help you to browse your files and tables.
If a service is red flagged, it means it is in bad health.You can restart it individually clicking the black arrow on the right side of the ser-
vice and then clicking Restart.
LESSON 1 | 15
Connect to Hue
Hue is a web interface that helps to check what is done on your cluster. You can browse your files and tables. It is also possible to
track Map Reduce tasks.
1. Open a new tab in your web browser.
2. Click the Hue shortcut or navigate to http://ClusterCDH54:8888.
3. In the Username box, enter "student" and then, in the Password box, enter "training".
4. Click Sign in.
You will be logged in Hue:
Overview
To be able to run Big Data Jobs, the Talend Studio requires to be connected to a running Hadoop Cluster. The connection inform-
ation can be configured in each component individually or the configuration can be stored in a metadata in the Repository and be
reused as needed in the different components.
You will use this approach instead of individual connection configuration of components and Jobs.
You will now create your cluster metadata using three different methods. First you will configure the connection to the cluster manu-
ally, then from the hadoop configuration files and last, using a Wizard.
Manual Configuration
The first way to create a Hadoop cluster Metadata is to create it manually. This requires that you already have several information
about your cluster such as the Namenode URI, the Resource Manager address or the Job Tracker URI, depending on if you will use
Yarn or Map Reduce v1. You may also need several other information such as the Job History or the Resource Manager Scheduler
location.
1. In the Studio, in the Repository, under Metadata, locate Hadoop Cluster.
2. Right-click Hadoop Cluster and then click Create Hadoop Cluster:
LESSON 1 | 17
3. In the Name box, enter "TrainingCluster_manual", and then click Next.
The Hadoop Configuration Import Wizard opens:
4. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
LESSON 1 | 19
Job History: ClusterCDH54:10020
9. Click Finish. Your cluster Metadata will appear under Repository/Metadata/Hadoop Cluster.
The first property that appears in this file is the location of the Namenode:
hdfs://ClusterCDH54:8020.
2. Edit yarn-site.xml with Notepad++:
In this file, you will find the Resource Manager address and the Resource Manager Scheduler address.
3. Edit mapred-site.xml with Notepad++:
LESSON 1 | 21
Configuration with the Hadoop configuration files
1. Right-click Hadoop Cluster and then click Create Hadoop Cluster.
2. In the Name box, enter "TrainingCluster_files" and then, click Next.
3. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
4. Select Import configuration from local files and then, click Next.
5. Click Browse... and then locate the configuration files under
C:\StudentFiles\HadoopConf. Click OK:
6. Click Finish.
7. The configuration is almost done, except for the user name.
In the User name box, enter "student" and then click Finish.
Your cluster Metadata will appear under Repository/Metadata/Hadoop Cluster.
Automatic configuration
The last way to configure your Metadata is to connect to the Cloudera Manager and then all the connection information will be
retrieved automatically to create the cluster Metadata.
1. Right-click Hadoop Cluster and then click Create Hadoop Cluster.
2. In the Name box, enter "TrainingCluster" and then, click Next.
3. In the Distribution list, select Cloudera and in the Version list, select Cloudera CDH5.4(YARN mode).
4. Select Retrieve configuration from Ambari or Cloudera and then, click Next.
5. In the Manager URI (with port) box, enter
"http://ClusterCDH54:7180".
6. In the Username and Password boxes, enter "admin".
8. Click Fetch. The wizard will retrieve the configurations files of all running services in your cluster:
LESSON 1 | 23
11. Enter "student" (without quotes) in the User name box:
If the progress bars go up to 100% and you have no error message, it means that your conneciton was successfull.
13. Click Close.
14. Click Finish to finalize the metadata creation. It will appear in the Repository under Metadata/Hadoop Cluster.
You created the same Metadata three times with different techniques. For the next lessons, the last one will be used (Train-
ingCluster).
Now that you have covered the three different ways to create your Cluster Metadata, it's time to Wrap-Up.
LESSON 1 | 25
Wrap-Up
Recap
In this lesson you covered the key base knowledge required to be efficient building and running Big Data Jobs.
You first opened your project, then you connected to Cloudera Manager and Hue. Cloudera Manager will help you to restart a ser-
vice if needed. And Hue, will help you to browse your files and tables, and to track the execution of your Map Reduce jobs.
And the last part was about creating Metadata on your cluster. You created the same Metadata three times using different tech-
niques. So you can use any of them for your Job.
For the next lessons, the TrainingCluster Metadata will be used. But feel free to try any of the three Metadata. The results will be the
same, otherwise, that means that a Metadata is not well configured.
Overview
Hadoop's file system—HDFS—scales to hold petabytes of data. In this lesson you will use Talend Big Data components to read and
write data to HDFS.
First, you will read a text file storing customer information from your local system and write it to HDFS. Then, you will read a collection
of customer Twitter messages, stored in separate files, and write them individually to HDFS. Next, you will develop a Job to read a
designated subset of the Twitter files back from HDFS.
Finally, you will simulate a sparse data set and then write it to HDFS using HBase dedicated components.
Objectives
After completing this lesson, you will be able to:
Write text files to HDFS
The first step is to create a new Job to read a local text file and then write it to HDFS.
Overview
Your first step is to store some text data on HDFS. In this lab, you are saving a file of customer information such as first name, last
name, city, state, etc...
As you did previously for your cluster connection, it is possible to create Metadata to get connected to HDFS.
LESSON 2 | 29
2. In the Name box, enter "HDFSConnection" and then, click Next:
4. Click OK and then click Finish to create your HDFS connection Metadata. It will appear in the repository, below the Train-
ingCluster Metadata:
LESSON 2 | 31
Next, you will run your Job and then check the result in Hue.
LESSON 2 | 33
4. Navigate to /user/student/BDBasics/Customers/CustomersData.csv:
This shows you the content of the CustomersData.csv file. The file was created by your Job in the new directory BDBasic-
s/Customers that you specified.
Next, you will put multiple files to HDFS.
Overview
Your next step is to store a set of files on HDFS. In this scenario, you are saving a series of Twitter message files for later analysis.
Create Job
Under the C:\StudentFiles\tweets_in folder, you will find tweets saved as text files. You will put this files on HDFS.
1. Right-click PutCustomersData Job in the Repository and then click Duplicate.
2. In the Input new name box, enter PutTweets and in the Job Type list, select Standard:
Next you will run your Job and check the result in Hue.
LESSON 2 | 35
Run
1. Run your Job and follow the execution in the Console. At the end of execution, you should have an exit code equals to 0
and several other messages that prove that the Job successfully executed:
2. In your web browser, the page with Hue should be already opened. Otherwise, navigate to "http://ClusterCDH54:8888"
and login with student/training.
3. Click File Browser and navigate to
/user/student/BDBasics/tweets:
Eleven files appear because there are eleven tweet files in your local directory. The tHDFSPut component used HDFS to
write all of the files in your local directory into the Hadoop file system.
You used a file mask to write a file set. File masks give you considerable control over what files you want Talend components
to operate on.
Now that some data exists on HDFS, you can create a Job to read it.
Overview
Now you will create a Job to read HDFS data. You are going to transfer a subset of the Twitter files from HDFS to your local file sys-
tem.
Create Job
If you examine the files in the tweets folder that you just created on HDFS, you will notice that the files where produced in 2012 and
2013. You will read only the 2012 files.
1. Create a new Job, naming it "GetTweets".
2. Place a tHDFSGet component on the design workspace.
You use tHDFSGet to read files stored on HDFS and make copies of them on your local system.
3. Double-click tHDFSGet to open the Component view.
4. In the Property list, select Repository.
5. Click (...) and then locate the HDFSConnection metadata that you previously created.
6. To configure the HDFS directory, click (...) and then navigate to "/user/student/BDBasics/tweets".
7. In the Local directory, enter "C:/StudentFiles/tweets_out" (including the quotes).
8. In the Overwrite file list, select always. This tells the component to always replace existing files when a new file of the same
name is being saved.
9. Click the green plus sign below the Files table.
10. Replace "newline" with "2012*" (including the quotes).
You are using a file mask to request all files in the HDFS directory tweets with names that begin with the string "2012":
Now, you will run the Job and check the results in Hue.
There are five tweet files that your Job read from HDFS and put on your local system. Because a file mask was used to limit
the files selected for the operation, only files with names that begin with "2012" were read from HDFS.
Now that you have worked with text files on HDFS, it is time to work with sparse dataset and discover how to store them efficiently in
HDFS.
LESSON 2 | 39
Storing Sparse Data set with HBase
Overview
Apache HBase is the Hadoop column-oriented database. It is an open-source, non-relational database modeled after Google's BigT-
able and provides the same functionality in top of Hadoop and HDFS.
HBase is useful when you need random, real-time read/write access to your Big Data. Furthermore, HBase can host very large
tables -- billions of rows X millions of columns-- and is particularly well suited for sparse data sets.
If your relational table looks like below (data missing in columns), it is considered "sparse" and is a good candidate for HBase.
In this lab, you will generate sparse data and build a Job to store them to HDFS with HBase:
This Job is composed of two components. The tRowGenerator component will generate an integer value for the cus-
tomers ID, and will also generate random integer values to simulate the number of connections per month.
The second component will save the data in a file named HBase_sample.txt under the C:\StudentFiles folder.
3. Run the Job, then locate and open "C:\StudentFiles\HBase_sample.txt" with Notepad++:
LESSON 2 | 41
5. In the Zookeeper client port box, enter "2181" (including the quotes).
Your configuration should be as follows:
LESSON 2 | 43
Your configuration should be similar to this:
7. Click Ok.
9. In the Basic settings tab, set the Id column Family Name to "Id".
LESSON 2 | 45
14. Add a tHBaseClose component below the tFileInputDelimited component and connect with an OnSubjobOk trigger.
Now, you can run your Job and check the results in Hue.
Your execution should be successfull otherwise, double check your Job and check HBase health in the Cloudera Manager.
2. Connect to Hue and click Data Browsers>HBase.
This will give you the list of HBase tables. You should see Data_withNulls and RawData in the list:
This is an extract of the RawData table. You will still find the zeros. But if you compare with the content of Data_withNulls:
LESSON 2 | 47
48 | BD Basics - Lab Guide
Challenges
Overview
Complete these exercises to further reinforce the skills you learned in the previous lesson. See Solutions for possible solutions to the
exercises.
Double Up Orders
Develop a second Job to write a file of duplicated orders to HDFS. Use the local file C:/StudentFiles/duplicated_orders as the
source file. Put the file into the HDFS directory /user/student/erp, keeping the same file name. Use the schema stored in the file
orders.xml and sort the file on column id before saving it. Hint: use tSortRow to sort the file.
LESSON 2 | 49
Solutions
Overview
These are possible solutions to the exercises. Note that your solutions may differ and still be acceptable.
Double Up Orders
1. Use a tFileInputDelimited to read the file duplicated_orders from the local file system:
3. Sort the data in ascending order by the column id with the tSortRow component :
LESSON 2 | 51
4. Use a tHDFSOutput component to rewrite the data to HDFS :
Recap
In this lab, you learned the basics of writing files using Talens's Big Data components for Hadoop Distributed File System, HDFS.
You used tHDFSPut to write different types of files to HDFS. Then, you used tHDFSGet to read a subset of the files back from
HDFS.
And last, you used HBase dedicated components to store sparse data on HDFS (tHBaseConnection, tHBaseOutput,
tHBaseClose).
Further Reading
For more information about topics covered in this lab, see the Talend Data Integration Studio User Guide, Talend Big Data Platform
User Guide and the Talend Components Reference Guide.
LESSON 2 | 53
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 3
Working with Tables
This chapter discusses the following.
Overview
In this lesson, you will cover two common use cases.
HDFS can be used to optimize data warehouse. So, you could decide to move your data from a relational database to HDFS. The
first use case will show you how to transfer MySQL tables to HDFS using Sqoop.
The second use case will show you how to create a table using Hive. Then this table can be processed using Hive QL language which
is very similar to SQL.
Objectives
After completing this lesson, you will be able to:
Transfer MySQL tables to HDFS using Sqoop
Overview
Sqoop is a tool designed to transfer data between Hadoop and relational databases. You can use Sqoop to import data from a rela-
tional database management system (RDBMS) such as MySQL or Oracle into the Hadoop Distributed File System (HDFS), trans-
form the data using Map Reduce, and then export that data back into an RDBMS.
Sqoop automates most of this process, relying on the database to describe the schema for the data to be imported. Sqoop creates
and runs a Map-only Map Reduce Job to import the data.
In this Lab, you will transfer MySQL tables into HDFS using Sqoop dedicated components. First, you will push the customers data
into a MySQL database.
3. To check the data copied in the CustomersData table, right-click the tMysqlOutput component and then click Data
Viewer:
LESSON 3 | 57
You will now transfer this table to HDFS using the tSqoopImport component.
Importing Tables
You will create a simple Job to import the CustomersData MySQL table into HDFS using a tSqoopImport component.
In the tSqoopImport component, the first option is to choose the Mode: Commandline or Java API.
If you choose Commandline, the Sqoop shell is used to call Sqoop. In this mode, you have to deploy and run the Job in the host where
Sqoop is installed. This means that you have to install and use the Jobserver, as described in the Talend Data Integration Advanced
training, or as described in the Talend Installation Guide.
If you select Use Java API, the Java API is used to call Sqoop. In this mode, the Job can be run locally in the Studio, but you have to
configure the connection to your cluster.
Note that a JDK is required to execute the Job with the Java API and the JDK versions on both machines must be compatible.
1. Create a new standard Job and name it "SqoopImport".
2. Add a tSqoopImport component and open the Component view.
3. In the Mode box, select Use Java API.
4. In the Hadoop Property list, select Repository and then browse to find the HDFSConnection you configured earlier in
the course.
This will configure the Distribution, the Hadoop version, the NameNode URI, the Resource Manager, and the Hadoop user
name:
LESSON 3 | 59
5. In the JDBC Property list, select Repository and then, browse to find the genericConnection Metadata.
This will configure the Connection, the Username, the Password and the Driver JAR values:
Note that your Job Id will be different. The Id is the number following "job_...". In the current example, the Job Id is
1435565883520_0012.
LESSON 3 | 61
3. Right after, you will find a recap of various counters:
There, you can see that 4 map tasks ran on the cluster and that the 1 million of records were transferred in approximately 21
seconds.
4. To see your Job in Hue, click Job Brower.
The Job Browser window opens and you will see the list of all your Jobs.
5. From the Console, find the Id of your Job. Then, in Hue, find your Job Id in the Job list. It should be followed by green boxes,
corresponding to a successful execution of Map and Reduce tasks:
There, you can check that your data have been imported as expected.
Now that you have imported a MySQL table to HDFS using a tSqoopImport component, you can continue to investigate working
with tables. The next topic will show you how to create tables in HDFS with Hive.
LESSON 3 | 63
Creating Tables with Hive
Overview
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big
Data, and makes querying and analyzing easy.
Hive supports HiveQL which is similar to SQL but does not support the complete construct of SQL.
Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Hive, through HiveQL language,
provides a higher level of abstraction over Java Map Reduce programming.
First, you will create Hive Metadata in the Repository and then, you will use various methods to create Hive tables.
4. In the Schema list, select Repository then browse to find the CustomersData generic schema Metadata.
5. In the Table Name box, enter "CustomersData".
6. In the Action on table list, select "Create table if not exists":
LESSON 3 | 65
11. In the Load action list, select "LOAD".
12. In the File Path box, enter
"/user/student/BDBasics/Customers/CustomersData".
13. In the Table Name box, enter "CustomersData".
14. In the Action on file list, select "OVERWRITE":
You will now run your Job and check the results in the Console and in Hue.
The Job successfully executed so now, you can check the results in Hue.
LESSON 3 | 67
4. Click the Sample tab:
If you examine the results in the purchasedata column, you will see only NULL values. This is due to the fact that the
timestamp format of Hive is not equivalent to the date format in the Talend Studio. This leads to a failure in the data con-
version.
A possible workaround is to consider dates as of String type.
You will now experiment another way to create a Hive table. You will use a Wizard which will automatically create a Hive table from a
file stored in HDFS.
There, you will find the connection metadata you created earlier.
4. Right-click HDFSConnection and then click Create Hive Table:
The connection to HDFS will be checked and next, the wizard, to create a Hive table based on a file stored in HDFS, will
start.
5. Browse to find the CustomersData.csv file under /user/student/BDBasics/Customers.
LESSON 3 | 69
6. Select CustomersData.csv and wait until the Creation status changes to Success:
LESSON 3 | 71
2. This Preview translates into a HiveQL query applied to your Hive table:
The query is "select * from default.customersdata_auto" and appears in the SQL Editor, in top of the window.
The result appears in tab 1.
3. To check the new table in Hue, click Data Browsers>Metastore Tables, then click customersdata_auto in the table list.
4. Click the Sample tab:
You have covered the various ways to work with Tables with Hive. So, now it's time to recap what you have learned in this lesson.
Recap
In this lesson you learned how to use Talend's Big Data components for Hive and Sqoop.
First, you imported a MySQL table to HDFS using the tSqoopImport component. The import has been done through a Map-only
Map Reduce Job.
Next, you manually created a Hive table with the tHiveCreateTable component and then populated it with the tHiveLoad com-
ponent.
And last, you used the Hive table creation wizard to create automatically your Hive table from a file stored on HDFS.
LESSON 3 | 73
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.
LESSON 4
Processing Data and Tables in
HDFS
This chapter discusses the following.
Overview
Once stored on HDFS, you will need to process your tables and data to extract useful information.
Depending on your data type, you can adopt various strategies.
If your data are stored as Hive Tables, Hive QL might be the best way to address your needs. Hive QL language is a high level pro-
gramming language similar to SQL. Hive converts the request as Map Reduce Jobs that will be executed on your cluster.
In this lesson, you will analyze Hive tables with a Job or with the Profiling view of the Studio.
If your data are stored as text files, an option is to use Pig. Pig Latin is also a high-level language providing common operations to
group, filter and join data. The Pig Latin script is automatically converted in Java Map Reduce code to be executed on the cluster.
Talend provides components to use Pig with minimal programming efforts.
The last way to process your data, covered in this lesson, is to use a Big Data Batch Job. This kind of Job automatically converts the
components in Java Map Reduce code that will be run on the Cluster.
Objectives
After completing this lesson, you will be able to:
Process Hive tables with a standard Job
Overview
Hive converts the HiveQL query into Map Reduce code and then submits it to the Hadoop cluster. Hive, through HiveQL language,
provides a higher level of abstraction over Java Map Reduce programming.
You will now analyze the customers data table you created previously.
Using the Studio, you can analyze your data with your own Hive queries or you can use the Profiling view and use the Data Quallity
functionalities of the Studio over your Hive tables.
You will use various components to extract useful data from a Hive table, process it and then, store the result in another Hive table.
At the end of this lab, your Job will look like the following:
LESSON 4 | 77
6. Add a tHiveInput component and connect it with an OnSubjobOk trigger.
7. Open the Component view.
8. Select the Use an existing connection option.
9. Set the Schema to Repository and then use the CustomersData generic schema metadata.
10. In the Table Name box, enter "CustomersData".
11. In the Query box, you will enter the HiveQL query that will be sent to the cluster.
As mentionned in the Overview, the investigations will be limited to the first 100 000 rows.
In the Query box, enter :
"select * from CustomersData where
CustomersData.Id<=100000"
Your configuration should be as follows:
Process data
You will now aggregate the data and store the result in HDFS.
1. Add a tAggregateRow component, connect it with the Main row, and then, open the Component view.
2. Click Sync columns and then click (...) to edit the schema.
3. Configure the output schema to have 3 columns named ProductCategory, Gender and Count, as follows:
LESSON 4 | 79
5. Open the Component view.
6. Select the Use an existing connection option.
7. Click (...) to edit the schema.
8. Paste the Schema:
9. In the DB Type column, select STRING for ProductCategory and Gender and then, select INT for Count:
13. Add a tHiveRow component below tHiveCreateTable and connect it with an OnSubjobOk trigger.
14. In the Component view, select the Use an existing connection option.
15. Copy the schema in tHiveCreateTable and paste it in the schema of the tHiveRow component.
16. Click OK to save the schema.
17. In the Query box, you will be able to write your own HiveQL.
The query in a tHiveRow component is executed at each flow iteration in your Job. In the current Job, the query will be
executed only once to transfer the data from the HDFS file to the AggResults Hive table.
In the Query box, enter:
"LOAD DATA INPATH
'/user/student/BDBasics/Hive/agg_results' OVERWRITE INTO TABLE AggResults "
Note: Copy and paste the Query from the LabCodeToCopy file in the C:\StudentFiles folder.
LESSON 4 | 81
18. Your configuration should be as follows:
Your Job is now complete. It's time to run it and check the results.
You will see an error message regarding the winutils.exe binary not available in the Hadoop binaries. In the current Job, this
error won't prevent the Job to succeed but it could in some cases.
You can fix this issue by downloading the winutils.exe file and then save it in a tmp folder. This has been done for you in your
training environment.
2. In the Run view, click Advanced settings.
3. Select the Use specific JVM arguments option.
4. Click New....
6. Click OK to save the new argument and run again your Job.
The Job should run successfully without error messages:
7. Even if it's not clearly stated in the Console, the HiveQL query, in the tHiveInput component, executed on the Cluster. You
can see the Job generated in the Hue Job Browser:
LESSON 4 | 83
The folder is empty because the data have been transfered in the Hive table, deleting the file on HDFS.
9. In Hue, click Data Browsers>Metastore Tables. The table AggResults should be in the table list.
10. Click AggResults and then click the Sample tab:
You have processed your Hive Table with various components such as tHiveInput and tHiveRow components. You will now process
your Hive table using the Profiling perspective of the Studio.
Overview
Using the Studio, you can run various analysis over your Hive tables. In this lab, you will use the Profiling view of the Studio to run ana-
lysis of your Hive connection, Hive Tables and columns.
As expected, you will find the AggResults, CustomersData and CustomersData_auto Hive tables.
Now, you will start the Overview Analysis of HiveConnection.
4. Right-click HiveConnection under DQ Repository/Metadata/DB connections, and click Overview analysis:
LESSON 4 | 85
7. Click Finish.
Your analysis will appear in the DQ Repository, under Data Profiling/Analyses:
8. The OverviewAnalysis also opens so that you can examine the Analysis settings.
Click the Running man figure to start the analysis:
9. At the bottom of the Connection Analysis window, you will find the Analysis summary:
In the Analysis summary, you should see that the analysis was successful. There is 1 schema, named default and 3 tables
2. This will open a SQL Editor window which is split in two parts.
The upper part is the request submitted to preview your table:
LESSON 4 | 87
3. Right-click customersdata and click Table analysis.
You will continue to investigate your data realizing an analysis on some columns of the CustomersData table.
Column Analysis
1. Right-click customersdata under DQ Repository/Metadata/DB connections/HiveConnection/default/Tables, and click
Column Analysis:
LESSON 4 | 89
You will have basic information about your columns, such as the number of rows, the number of blank or null values, the num-
ber of distinct count.
6. In the Analysis Parameter, you can choose the execution engine : SQL or Java. To send your requests to the cluster,
select SQL and then click the running man icon:
8. At the end of the execution, open the Analysis Results tab. There you will find the results of each analysis over state, pro-
ductcategory and gender columns:
There are 7 distinct product categories. You will now run a analysis to list these values.
LESSON 4 | 91
Product Category column analysis
1. Right-click productcategory under DQ Repository/Metadata/DB con-
nections/HiveConnection/default/Tables/customersdata , and click Column Analysis>Analyze:
2. In the Name box, enter "ProductCategoryAnalysis" (without quotes) and click Finish.
3. In the Column Analysis window, under Data preview, click Select Indicators:
This is were you will specify which statistics you are interested in.
5. Expand Simple Statistics and select Row Count in the productcategory column:
LESSON 4 | 93
9. Open the Analysis Results tab:
The frequency analysis lists the product categories values and the count for each value. The numbers are very close to each
others because this data set has been randomly generated.
You have covered the last analysis for this lab. Now it's time to move to the next lab where you will cover how to process data on
HDFS using Pig.
Overview
The motivation behind Pig is that Map Reduce is very powerful but it requires a Java programmer. And the programmer may have to
re-invent common functionalities such as joining or filtering.
Pig is a platform for analyzing large data sets. It consists of a high-level programming language, Pig Latin, that opens Hadoop to non-
Java programmers. Pig Latin also provides common operations to group, filter, join or sort data.
Pig provides an execution engine in top of Hadoop.
The Pig Latin script is converted in Map Reduce code that will be executed on your cluster.
In this lab, you will process the Customers data stored in HDFS. You will perform basic tasks such as filtering rows, mapping, aggreg-
ate and sort data and then store your results in HDFS.
At the end of this lab, your Job will look like the following:
LESSON 4 | 95
7. Add a tHDFSOutput component and connect it with the Main row.
8. Double-click the tHDFSOutput component to open the Component view.
9. In the Property Type list, select Repository. Then, navigate to find HDFSConnection.
10. In the Schema list, select Repository. Then, navigate to find the CustomersData generic schema metadata.
11. In the File Name box, enter
"/user/student/BDBasics/CustomersData".
Your configuration should be as follows:
12. Save your Job. It will be run later from the main Job that will process the data using Pig components.
Load Data
You will create a new Job named PigProcessing which will process the customers data.
1. Create a a new Standard Job and name it "PigProcessing".
2. From the Repository, drag the WritingHDFS Job and drop it in the Designer. It will appear as a tRunJob component
labelled WritingHDFS:
LESSON 4 | 97
4. In the Filter box, enter "State matches 'California'" (quotes included):
5. Add a tPigMap component, connect with the Pig Combine row and then open the Component view.
6. Create a new output named "MappingOut".
7. Select ProductCategory and Gender columns in the row2 table and drag in the MappingOut table.
Your mapping should be as follows:
9. Add a tPigSort component, connect with the Pig Combine row and then, open the Component view.
10. Click the green plus sign below the Sort key table.
11. Configure to sort the ProductCategory column by ascending order, as shown below:
Store Results
Once processed, the last step is to store your results on HDFS.
1. Add a tPigStore component, connect with the Pig Combine row and then, open the Component view.
2. Set the Property Type to Repository and then select HDFSConnection.
3. In the Result Folder URI, enter "/user/student/BDBasics/Pig/out".
4. Select the Remove directory if exists option.
This will allow you to run again the Job as needed.
LESSON 4 | 99
5. Your configuration should be as follows:
The WritingHDFS Job succesfully executes and then the Pig components will start to work.
3. If you continue to investigate in the logs, you will fing the Pig requests equivalent to each Pig component. The first Pig com-
ponent is the tPigLoad:
4. Right after, you will find Pig requests for tPigFilterRow, tPigMap, tPigAggregate, tPigSort and tPigStore:
There, you can see that the execution was successful. First, 1 million rows were read from the CustomersData file and then,
14 records were written in the /user/student/BDBascis/Pig/out folder.
6. In Hue, using the File Browser, navigate to /user/student/BDBascis/Pig/out.
7. Click the part-r-00000 file to see the result:
8. In Hue, in the Job Browser, you will be able to see all the Jobs submitted by the Studio. The Jobs have different IDs but they
have the same name BDBASICS_PigProcessing_0.1_tPigLoad_1.
LESSON 4 | 101
All your Jobs have succeeded:
You can now continue investing processing data on HDFS and move to the next lab.
Overview
The last way to process data covered in this course is to use Big Data Batch Jobs. You will first create a Standard Job and then con-
vert it to a Big Data Batch Job using the Map Reduce framework.
The Job will be very similar to the PigProcessing Job. At the end of this lab, your Job will look like the following:
The first step will be to read the data using the HDFSConnection metadata.
LESSON 4 | 103
Your configuration should be as follows:
In the next section, you will map and aggregate your data.
The last steps are to sort the results and then to save them on HDFS.
LESSON 4 | 105
5. In the Order asc or desc column, select asc:
6. In the Repository, under Metadata/Hadoop Cluster/TrainingCluster/HDFS, click HDFSConnection and drag it to the
Designer.
7. In the Components list, select tHDFSOutput.
8. Connect it with the Main row and then open the Component view.
9. In the File Name box, enter
"/user/student/BDBasics/CustomersDataOut".
10. In the Action list, select Overwrite.
Your configuration should be as follows:
The results should be the same as in the previous lab named Processing Data with Pig.
You will now convert this Standard Job into a Big Data Batch Job which will use the Map Reduce framework.
LESSON 4 | 107
Convert to Map Reduce Batch Job
Using the studio, you can convert a standard Job to a Big Data Batch Job and then choose between Map Reduce or Spark frame-
work.
In this lesson, you will focus on Map Reduce jobs.
Instead of converting your current Job, you will duplicate it as a Map Reduce based Batch Job.
1. In the Repository, right-click the StandardProcessing Job and then, click Duplicate.
2. In the Input new name box, enter MRProcessing.
3. In the Job Type list, select Big Data Batch.
4. In the Framework list, select MapReduce:
5. Click OK.
6. The Hadoop cluster configuration is set at the Job level in Big Data Batch Jobs. So, before duplicating your Job, you will be
asked to choose from which component the Hadoop configuration will be retrieved:
7. Select the HDFSConnection component with the blue arrow and click OK.
As the configuration has been retrieved from a tHDFSInput component, some configuration are missing, because they are
not necessary to read/write from HDFS.
The Resource Manager adress is necessary to run a Map Reduce Job, and is currently missing. If you run the Job, it will fail
LESSON 4 | 109
with the following error message:
3. You can either use your TrainingCluster metadata or set the Resource Manager to "ClusterCDH54:8032":
You should see three reports as above, one for each Map Reduce Job launched by the Studio.
The execution of Map Reduce tasks is given with percentage.
7. From the Console, you can get your Job ID and find it in the Hue Job Browser:
8. Using the File Browser in Hue, check the results saved in the /user/student/BDBasics/CustomersDataOut_MR
folder.
LESSON 4 | 111
9. Click the part-r-00000 file:
This time again, you should have the same results as in previous labs.
You have now covered the different ways to process tables and data stored on HDFS. And now, it's time to recap what you've
learned.
Recap
In this chapter, you covered how to process tables and data stored on HDFS.
The first part of the chapter was dedicated to Hive Tables. You built a Job to extract data of interest in your Hive table with a tHiveIn-
put component. You processed the data, saved the result in HDFS with a tHDFSOutput component and then you transferred the res-
ult in a Hive table using tHiveCreateTable and tHiveRow.
Next, you used the Profiling view of the Studio to perform different level of analyses. You started at the connection level, moved to the
Tables and column levels and ended with a custom analysis on the ProductCategory column. Each request was run as a Map reduce
Job on your cluster.
The second part of the chapter was dedicated to data processing with Pig and Big Data Batch Jobs.
You created a Job using Pig components to process your data. Each Pig request was executed as a Map Reduce Job on your
cluster.
Next, you created a standard Job to process your data and reproduced the results obtained with Pig. Then, you duplicated the stand-
ard job and created a Big Data Batch Job using the Map Reduce Framework.
Further reading
If you want to discover more about data profiling, the Talend Data Quality trainings will help you. And if you are interested in dis-
covering more about Big Data Batch Jobs, the Talend Big Data Advanced training will give you an overview of real life use cases
using Big Data Batch Jobs.
LESSON 4 | 113
This page intentionally left blank to ensure new chapters
start on right (odd number) pages.