Fundamentals of Apache Sqoop Notes
Fundamentals of Apache Sqoop Notes
What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and external datastores such as relational databases,
enterprise data warehouses.
Sqoop is used to import data from external datastores into Hadoop Distributed
File System or related Hadoop eco-systems like Hive and HBase. Similarly,
Sqoop can also be used to extract data from Hadoop or its eco-systems and
export it to external datastores such as relational databases, enterprise data
warehouses. Sqoop works with relational databases such as Teradata,
Netezza, Oracle, MySQL, Postgres etc.
1. Full Load
2. Incremental Load
3. Parallel import/export
5. Compression
As we are dealing with Big Data, Hadoop stores and processes the Big Data
using different processing frameworks like MapReduce, Hive, HBase,
Cassandra, Pig etc and storage frameworks like HDFS to achieve benefit of
distributed computing and distributed storage. In order to store and analyze
the Big Data from relational databases, Data need to be transferred between
database systems and Hadoop Distributed File System (HDFS). Here, Sqoop
comes into picture. Sqoop acts like a intermediate layer between Hadoop and
relational database systems. You can import data and export data between
relational database systems and Hadoop and its eco-systems directly using
sqoop.
Sqoop Architecture
Sqoop Architecture
Sqoop provides command line interface to the end users. Sqoop can also be
accessed using Java APIs. Sqoop command submitted by the end user is
parsed by Sqoop and launches Hadoop Map only job to import or export data
because Reduce phase is required only when aggregations are needed.
Sqoop just imports and exports the data; it does not do any aggregations.
Generic Syntax:
The Hadoop specific generic arguments must precede any import arguments,
and the import arguments can be of any order.
Importing a Table into HDFS
Syntax:
--query Executes the SQL query provided and imports the results
Incremental Exports
Syntax:
--check-column --last-value
1. Append
2. Lastmodified.
Append mode is to be used when new rows are continually being added with
increasing values. Column should also be specified which is continually
increasing with --check-column. Sqoop imports rows whose value is greater
than the one specified with --last-value. Lastmodified mode is to be used
when records of the table might be updated, and each such update will set
the current timestamp value to a last-modified column. Records whose check
column timestamp is more recent than the timestamp specified with --last-
value are imported.
Notes:
2. “–password” parameter is insecure as any one can read it from command line.
–P option can be used, which prompts for password in console. Otherwise, it is
recommended to use –password-file pointing to the file containing password (Make
sure you have revoked permission to unauthorized users).
--create-hive-table Creates Hive table and fails if that table already exists
--hive-delims-replacement Replaces delimiters like \n, \r, and \01 from string fields with user defined delimiters
--map-column-hive Overrides default mapping from SQL type datatypes to Hive datatypes
Syntax:
--hive-table
Specifying --hive-import, Sqoop imports data into Hive table rather than HDFS
directory.
--hbase-create-table If specified, creates missing HBase tables and fails if already exists
Argument Description
Syntax:
Specifying –hbase-table, Sqoop will import data into HBase rather than HDFS
directory.
Sqoop-Import-all-Tables
The import-all-tables imports all tables in a RDBMS database to HDFS. Data
from each table is stored in a separate directory in HDFS. Following
conditions must be met in order to use sqoop-import-all-tables:
3. You should not use splitting column, and should not check any conditions
using where clause.
Generic Syntax:
Sqoop specific arguments are similar with sqoop-import tool, but few options
like --table, --split-by, --columns, and --where arguments are invalid.
Syntax:
Generic Syntax:
Sqoop export command prepares INSERT statements with set of input data
then hits the database. It is for exporting new records, If the table has unique
value constant with primary key, export job fails as the insert statement fails. If
you have updates, you can use --update-key option. Then Sqoop prepares
UPDATE statement which updates the existing row, not the INSERT
statements as earlier.
Syntax:
Sqoop-Job
Sqoop job command allows us to create a job. Job remembers the
parameters used to create job, so they can be invoked any time with same
arguments.
Generic Syntax:
$ sqoop job (generic args) (job args) [-- [subtool name] (subtool args)]
$ sqoop-job (generic args) (job args) [-- [subtool name] (subtool args)]
Sqoop-job makes work easy when we are using incremental import. The last
value imported is stored in the job configuration of the sqoop-job, so for the
next execution it directly uses from configuration and imports the data.
Sqoop-job options:
Argument Description
--create Defines a new job with the specified job-id (name). Actual sqoop import command should be seperated by “--“
Syntax:
Sqoop-Codegen
Sqoop-codegen command generates Java class files which encapsulate and
interpret imported records. The Java definition of a record is initiated as part
of the import process. For example, if Java source is lost, it can be recreated.
New versions of a class can be created which use different delimiters
between fields, and so on.
Generic Syntax:
Syntax:
Sqoop-Eval
Sqoop-eval command allows users to quickly run simple SQL queries against
a database and the results are printed on to the console. Generic Syntax:
Syntax:
Using this, users can be sure that they are importing the data as expected.
Sqoop-List-Database
Used to list all the database available on RDBMS server. Generic Syntax:
Syntax:
Sqoop-List-Tables
Used to list all the tables in a specified database. Generic Syntax:
Syntax:
Read More on –
What is Hive?
Hive Architecture
CREATE Database,Table
DROP Database,Table
TRUNCATE Table
ALTER Database,Table
Let’s look at the usage of the top hive commands in HQL on both databases
and tables –
[COMMENT database_comment]
[LOCATION hdfs_path]
In the above syntax for create database command, the values mentioned in
square brackets [] are optional.
In Hadoop Hive, the mode is set as RESTRICT by default and users cannot
delete it unless it is non-empty. For deleting a database in Hive along with the
existing tables, users must change the mode from RESTRICT to CASCADE.
In the syntax for drop database Hive command, “if exists” clause is used to
avoid any errors that might occur if the programmer tries to delete a database
which does not exist.
(property_name=property_value, ...);
[COMMENT table_comment]
[LOCATION hdfs_path]
The LOCATION keyword is used for specifying where the table should be
stored on HDFS.
DROP table command removes the metadata and data for a particular table.
Data is usually moved to .Trash/Current directory if Trash is configured. If
PURGE option is specified then the table data will not go to the trash directory
and there will be no scope to retrieve the data in case of erroneous DROP
command execution.
[db_name].new_table_name;
(‘property_key’=’property_new_value’)
In the above step, we have set the creator attribute for the table and similarly
we can later or modify other tbale properites also.
DESCRIBE Table Command in Hive
Gives the information of a particular table and the syntax is as follows –
( [.field_name]
# Storage Information
SerDe Library:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.099 seconds, Fetched: 37 row(s)
We can also check the description for a specific column from the table as
follows –
LOAD command
Insert command
existing_table_name
Example
Let’s load a structured file that contains information about different students.
Let’s take a look at the data present in the file –
ID|name|age|fee|city|state |zip
1|Kendall|22|25874|Kulti-Barakar|WB|451333
2|Mikayla|25|35367|Jalgaon|Maharastra|710179
3|Raven|20|49103|Rewa|Madhya Pradesh|392423
4|Carla|19|27121|Pilibhit|UP|769853
5|Edward|21|32053|Tuticorin|Tamil Nadu|368262
6|Wynter|21|43956|Surendranagar|GJ|457441
7|Patrick|19|19050|Mumbai|MH|580220
8|Hayfa|18|15590|Amroha|UP|470705
9|Raven|16|37836|Cuddalore|TN|787001
The file is a ‘|’ delimited file where each row can be inserted as a table
record.
First let’s create a table student based on the contents in the file –
> )
> COMMENT 'This table holds the demography info for each student'
OK
If the keyword LOCAL is not specified, then Hive will need absolute URI of the
file. However, if local is specified then it assumes the following rules -
It will assume it’s an HDFS path and will try to search for the file in HDFS.
If the path is not absolute, then hive will try to locate the file in the /user/ in
HDFS.
Using the OVERWRITE keyword while importing means the data will be
ingested i.e. it will delete old data and put new data otherwise it would just
append the new data. The contents of the target table will be deleted and
replaced by the files referred to by file path; otherwise the files referred by file
path will be added to the table.
Let’s check if the data has been inserted into the table –
hive> select * from students;
2. Check first 5 records:
Now, let's try to retrieve only 5 records using the limit option -
OK
:
2016-09-17 17:50:08,645 Stage-1 map = 0%, reduce = 0%
1.14 sec
OK
AP 25
AR 1
AS 16
Andhra Pradesh 15
Arunachal Pradesh 1
Assam 11
BR 29
Bihar 20
CT 9
Chhattisgarh 6
DL 2
Delhi 2
GJ 13
Gujarat 14
HP 2
HR 6
Haryana 8
Himachal Pradesh 1
JH 11
JK 1
Jharkhand 4
KA 12
KL 6
Karnataka 24
Kerala 7
MH 20
MN 1
MP 17
MZ 2
Madhya Pradesh 16
Maharastra 32
Meghalaya 1
Mizoram 1
NL 1
OR 7
Odisha 10
PB 8
Pondicherry 2
Punjab 8
RJ 16
Rajasthan 24
TN 17
Tamil Nadu 20
Tripura 1
UP 53
UT 4
Uttar Pradesh 37
Uttarakhand 1
WB 20
West Bengal 28
state 1
PREVIOUS NEXT
iii. Eclipse must be installed as the MapReduce WordCount example will be run
from eclipse IDE.
Word Count - Hadoop Map Reduce Example –
How it works?
Hadoop WordCount operation occurs in 3 stages –
i. Mapper Phase
For instance if you consider the sentence “An elephant is an animal”. The
mapper phase in the WordCount example will split the string into individual
tokens i.e. words. In this case, the entire sentence will be split into 5 tokens
(one for each word) with a value 1 as shown below –
(elephant,1)
(is,1)
(an,1)
(animal,1)
If you would like more information about Big Data and Hadoop Certification, please
click the orange "Request Info" button on top of this page.
Hadoop WordCount Example- Shuffle Phase Execution
After the map phase execution is completed successfully, shuffle phase is
executed automatically wherein the key-value pairs generated in the map
phase are taken as input and then sorted in alphabetical order. After the
shuffle phase is executed from the WordCount example code, the output will
look like this -
(an,1)
(an,1)
(animal,1)
(elephant,1)
(is,1)
In the reduce phase, all the keys are grouped together and the values for
similar keys are added up to find the occurrences for a particular word. It is
like an aggregation phase for the keys generated by the map phase. The
reducer phase takes the output of shuffle phase as input and then reduces
the key-value pairs to unique keys with values added up. In our example “An
elephant is an animal.” is the only word that appears twice in the sentence.
After the execution of the reduce phase of MapReduce WordCount example
program, appears as a key only once but with a count of 2 as shown below -
(an,2)
(animal,1)
(elephant,1)
(is,1)
This is how the MapReduce word count program executes and outputs the
number of occurrences of a word in any given input file. An important point to
note during the execution of the WordCount example is that the mapper class
in the WordCount program will execute completely on the entire input file and
not just a single sentence. Suppose if the input file has 15 lines then the
mapper class will split the words of all the 15 lines and form initial key value
pairs for the entire dataset. The reducer execution will begin only after the
mapper phase is executed successfully.
Step 1 –
Let’s create the java project with the name “Sample WordCount” as shown
below -
File > New > Project > Java Project > Next.
IntWritable(1);
value.toString();
= new StringTokenizer(line);
while
(tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
In the mapper class code, we have used the String Tokenizer class which
takes the entire line and breaks into small tokens (string/word).
Step 6 –
Create a Reducer class within the WordCount class extending
MapReduceBase Class to implement reducer interface. The reducer class for
the wordcount example in hadoop will contain the -
output,
sum +=
values.next().get();
}
IntWritable(sum));
}
}
Step 7 –
Create main() method within the WordCount class and set the following
properties using the JobConf class -
i. OutputKeyClass
ii. OutputValueClass
v. InputFormat
vi. OutputFormat
vii. InputFilePath
viii. OutputFolderPath
conf.setJobName("WordCount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
/user/cloudera/Input/war_and_peace /user/cloudera/Output
Important Note: war_and_peace(Download link) must be available in HDFS
at /user/cloudera/Input/war_and_peace.
If not, upload the file on HDFS using the following commands -
$ hadoop fs –help
Help hdfs shell command helps hadoop developers figure out all the available
hadoop commands and how to use them.
If you would like more information about Big Data and Hadoop Certification, please
click the orange "Request Info" button on top of this page.
Variations of the Hadoop fs Help Command
$ hadoop fs –help ls
Using the help command with a specific command lists the usage information
along with the options to use the command.
Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9.
$ hadoop fs –usage ls
Usage command gives all the options that can be used with a particular hdfs
command.
This command will list all the available files and subdirectories under default
directory.For instance, in our example the default directory for Cloudera VM is
/user/cloudera
5) copyFromLocal
Copy a file from local filesytem to HDFS location.
For the following examples, we will use Sample.txt file available in the
/home/Cloudera location.
6) put –
This hadoop command uploads a single file or multiple source files from local
file system to hadoop distributed file system (HDFS).
8) du
Displays the disk usage for all the files available under a given directory.
9) df
Displas disk usage of current hadoop distributed file system.
Example - $ hadoop fs –df
10) Expunge
This HDFS command empties the trash by deleting all the files and
directories.
11) Cat
This is similar to the cat command in Unix and displays the contents of a file.
12) cp
Copy files from one HDFS location to another HDFS location.
Example – $ hadoop fs –cp /user/cloudera/dezyre/war_and_peace
/user/cloudera/dezyre1/
13) mv
Move files from one HDFS location to another HDFS location.
14) rm
Removes the file or directory from the mentioned HDFS location.
rm
-r
Example – $ hadoop fs –rm -r /user/cloudera/dezyre3
Deletes or removes the directory and its content from HDFS location in a
recursive manner.
15) tail
This hadoop command will show the last kilobyte of the file to stdout.
16) copyToLocal
Copies the files to the local filesystem . This is similar to hadoop fs -get
command but in this case the destination location msut be a local file
reference
18) touchz
Used to create an emplty file at the specified location.
19) setrep
This hadoop fs command is used to set the replication for a specific file.
21) chown
This command lets you change both the owner and group name
simulataneously.