[go: up one dir, main page]

0% found this document useful (0 votes)
451 views66 pages

Fundamentals of Apache Sqoop Notes

Sqoop is a tool used for transferring bulk data between Hadoop and external data stores like relational databases. It imports data into HDFS or Hive/HBase and can export data from these systems into external databases. Sqoop uses MapReduce for parallel data transfer and provides features like full/incremental loads, importing query results, and connectivity to major databases. Developers use Sqoop to easily move data to and from Hadoop for processing and analysis.

Uploaded by

paramreddy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
451 views66 pages

Fundamentals of Apache Sqoop Notes

Sqoop is a tool used for transferring bulk data between Hadoop and external data stores like relational databases. It imports data into HDFS or Hive/HBase and can export data from these systems into external databases. Sqoop uses MapReduce for parallel data transfer and provides features like full/incremental loads, importing query results, and connectivity to major databases. Developers use Sqoop to easily move data to and from Hadoop for processing and analysis.

Uploaded by

paramreddy2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 66

Fundamentals of Apache Sqoop

What is Sqoop?
Apache Sqoop is a tool designed for efficiently transferring bulk data between
Apache Hadoop and external datastores such as relational databases,
enterprise data warehouses.

Sqoop is used to import data from external datastores into Hadoop Distributed
File System or related Hadoop eco-systems like Hive and HBase. Similarly,
Sqoop can also be used to extract data from Hadoop or its eco-systems and
export it to external datastores such as relational databases, enterprise data
warehouses. Sqoop works with relational databases such as Teradata,
Netezza, Oracle, MySQL, Postgres etc.

Why is Sqoop used?


For Hadoop developers, the interesting work starts after data is loaded into
HDFS. Developers play around the data in order to find the magical insights
concealed in that Big Data. For this, the data residing in the relational
database management systems need to be transferred to HDFS, play around
the data and might need to transfer back to relational database management
systems. In reality of Big Data world, Developers feel the transferring of data
between relational database systems and HDFS is not that interesting,
tedious but too seldom required. Developers can always write custom scripts
to transfer data in and out of Hadoop, but Apache Sqoop provides an
alternative.

Sqoop automates most of the process, depends on the database to describe


the schema of the data to be imported. Sqoop uses MapReduce framework to
import and export the data, which provides parallel mechanism as well as
fault tolerance. Sqoop makes developers life easy by providing command line
interface. Developers just need to provide basic information like source,
destination and database authentication details in the sqoop command.
Sqoop takes care of remaining part.
Sqoop provides many salient features like:

1. Full Load

2. Incremental Load

3. Parallel import/export

4. Import results of SQL query

5. Compression

6. Connectors for all major RDBMS Databases

7. Kerberos Security Integration

8. Load data directly into Hive/Hbase

9. Support for Accumulo

Sqoop is Robust, has great community support and contributions. Sqoop is


widely used in most of the Big Data companies to transfer data between
relational databases and Hadoop.

Where is Sqoop used?


Relational database systems are widely used to interact with the traditional
business applications. So, relational database systems has become one of
the sources that generate Big Data.

As we are dealing with Big Data, Hadoop stores and processes the Big Data
using different processing frameworks like MapReduce, Hive, HBase,
Cassandra, Pig etc and storage frameworks like HDFS to achieve benefit of
distributed computing and distributed storage. In order to store and analyze
the Big Data from relational databases, Data need to be transferred between
database systems and Hadoop Distributed File System (HDFS). Here, Sqoop
comes into picture. Sqoop acts like a intermediate layer between Hadoop and
relational database systems.  You can import data and export data between
relational database systems and Hadoop and its eco-systems directly using
sqoop.

Sqoop Architecture

Sqoop Architecture
Sqoop provides command line interface to the end users. Sqoop can also be
accessed using Java APIs. Sqoop command submitted by the end user is
parsed by Sqoop and launches Hadoop Map only job to import or export data
because Reduce phase is required only when aggregations are needed.
Sqoop just imports and exports the data; it does not do any aggregations.

Learn Hadoop by working on interesting Big Data and Hadoop Projects for just


$9
Sqoop parses the arguments provided in the command line and prepares the
Map job. Map job launch multiple mappers depends on the number defined by
user in the command line. For Sqoop import, each mapper task will be
assigned with part of data to be imported based on key defined in the
command line. Sqoop distributes the input data among the mappers equally
to get high performance. Then each mapper creates connection with the
database using JDBC and fetches the part of data assigned by Sqoop and
writes it into HDFS or Hive or HBase based on the option provided in the
command line.

Basic Commands and Syntax for Sqoop


Sqoop-Import
Sqoop import command imports a table from an RDBMS to HDFS. Each
record from a table is considered as a separate record in HDFS. Records can
be stored as text files, or in binary representation as Avro or SequenceFiles.

Generic Syntax:

$ sqoop import (generic args) (import args)

$ sqoop-import (generic args) (import args)

The Hadoop specific generic arguments must precede any import arguments,
and the import arguments can be of any order.
Importing a Table into HDFS
Syntax:

$ sqoop import --connect --table --username --password --target-dir

--connect        Takes JDBC url and connects to database


--table             Source table name to be imported
--username    Username to connect to database
--password     Password of the connecting user
--target-dir     Imports data to the specified directory

Importing Selected Data from Table


Syntax:

$ sqoop import --connect --table --username --password --columns --where

--columns       Selects subset of columns


--where           Retrieves the data which satisfies the condition

Importing Data from Query


Syntax:

$ sqoop import --connect --table --username --password --query

--query           Executes the SQL query provided and imports the results

Incremental Exports
Syntax:

$ sqoop import --connect --table --username --password --incremental

--check-column --last-value

Sqoop import supports two types of incremental imports: 

1. Append 
2. Lastmodified.

Append mode is to be used when new rows are continually being added with
increasing values. Column should also be specified which is continually
increasing with --check-column. Sqoop imports rows whose value is greater
than the one specified with --last-value. Lastmodified mode is to be used
when records of the table might be updated, and each such update will set
the current timestamp value to a last-modified column. Records whose check
column timestamp is more recent than the timestamp specified with --last-
value are imported.

Notes:

1. In JDBC connection string, database host shouldn't be used as “localhost” as


Sqoop launches mappers on multiple data nodes and the mapper will not able to
connect to DB host.

2. “–password” parameter is insecure as any one can read it from command line.
–P option can be used, which prompts for password in console. Otherwise, it is
recommended to use –password-file pointing to the file containing password (Make
sure you have revoked permission to unauthorized users).

Few arguments helpful with Sqoop import:


Argument Description

--num-mappers,-m Mappers to Launch

--fields-terminated-by Field Separator

--lines-terminated-by End of line seprator

Importing Data into Hive


Below mentioned Hive arguments is used with the sqoop import command to directly
load data into Hive:
Argument Description

--hive-home Override $HIVE_HOME path

--hive-import Import tables into Hive

--hive-overwrite Overwrites existing Hive table data

--create-hive-table Creates Hive table and fails if that table already exists

--hive-table Sets the Hive table name to import

--hive-drop-import-delims Drops delimiters like\n, \r, and \01 from string fields

--hive-delims-replacement Replaces delimiters like \n, \r, and \01 from string fields with user defined delimiters

--hive-partition-key Sets the Hive partition key

--hive-partition-value Sets the Hive partition value

--map-column-hive Overrides default mapping from SQL type datatypes to Hive datatypes

Syntax:

$ sqoop import --connect --table --username --password --hive-import

--hive-table

Specifying --hive-import, Sqoop imports data into Hive table rather than HDFS
directory.

Importing Data into HBase


Below mentioned HBase arguments is used with the sqoop import command to directly
load data into HBase:
Argument Description

--column-family Sets column family for the import

--hbase-create-table If specified, creates missing HBase tables and fails if already exists
Argument Description

--hbase-row-key Specifies which column to use as the row key

--hbase-table Imports to Hbase table

Syntax:

$ sqoop import --connect --table --username --password --hbase-table

Specifying –hbase-table, Sqoop will import data into HBase rather than HDFS
directory.

Sqoop-Import-all-Tables
The import-all-tables imports all tables in a RDBMS database to HDFS. Data
from each table is stored in a separate directory in HDFS. Following
conditions must be met in order to use sqoop-import-all-tables:

1. Each table should have a single-column primary key.

2. You should import all columns of each table.

3. You should not use splitting column, and should not check any conditions
using where clause.

Generic Syntax:

$ sqoop import-all-tables (generic args) (import args)

$ sqoop-import-all-tables (generic args) (import args)

Sqoop specific arguments are similar with sqoop-import tool, but few options
like --table, --split-by, --columns, and --where arguments are invalid.

Syntax:

$ sqoop-import-all-tables ---connect --username --password


Sqoop-Export
Sqoop export command exports a set of files in a HDFS directory back to
RDBMS tables. The target table should already exist in the database.

Generic Syntax:

$ sqoop export (generic args) (export args)

$ sqoop-export (generic args) (export args)

Sqoop export command prepares INSERT statements with set of input data
then hits the database. It is for exporting new records, If the table has unique
value constant with primary key, export job fails as the insert statement fails. If
you have updates,  you can use --update-key option. Then Sqoop prepares
UPDATE statement which updates the existing row, not the INSERT
statements as earlier.

Syntax:

$ sqoop-export ---connect --username --password --export-dir

Sqoop-Job
Sqoop job command allows us to create a job. Job remembers the
parameters used to create job, so they can be invoked any time with same
arguments.

Generic Syntax:

$ sqoop job (generic args) (job args) [-- [subtool name] (subtool args)]

$ sqoop-job (generic args) (job args) [-- [subtool name] (subtool args)]

Sqoop-job makes work easy when we are using incremental import. The last
value imported is stored in the job configuration of the sqoop-job, so for the
next execution it directly uses from configuration and imports the data.
Sqoop-job options:
Argument Description

--create Defines a new job with the specified job-id (name). Actual sqoop import command should be seperated by “--“

--delete Deletes a saved job.

--exec Executes the saved job.

--show Show the save job configuration

--list Lists all the saved jobs

Syntax:

$ sqoop job --create -- import --connect --table

Sqoop-Codegen
Sqoop-codegen command generates Java class files which encapsulate and
interpret imported records. The Java definition of a record is initiated as part
of the import process. For example, if Java source is lost, it can be recreated.
New versions of a class can be created which use different delimiters
between fields, and so on.

Generic Syntax:

$ sqoop codegen (generic args) (codegen args)

$ sqoop-codegen (generic args) (codegen args)

Syntax:

$ sqoop codegen --connect --table

Sqoop-Eval
Sqoop-eval command allows users to quickly run simple SQL queries against
a database and the results are printed on to the console. Generic Syntax:

$ sqoop eval (generic args) (eval args)

$ sqoop-eval (generic args) (eval args)

Syntax:

$ sqoop eval --connect --query "SQL query"

Using this, users can be sure that they are importing the data as expected.

Sqoop-List-Database
Used to list all the database available on RDBMS server. Generic Syntax:

$ sqoop list-databases (generic args) (list databases args)

$ sqoop-list-databases (generic args) (list databases args)

Syntax:

$ sqoop list-databases --connect

Sqoop-List-Tables
Used to list all the tables in a specified database. Generic Syntax:

$ sqoop list-tables (generic args) (list tables args)

$ sqoop-list-tables (generic args) (list tables args)

Syntax:

$ sqoop list-tables –connect


What will you learn from this hive tutorial?
This hadoop hive tutorial shows how to use various Hive commands in HQL
to perform various operations like creating a table in hive, deleting a table in
hive, altering a table in hive, etc.

Pre-requisites to follow this Hive Tutorial


 Hive Installation must be completed successfully.
 Basic knowledge of SQL is required to follow this hadoop hive tutorial.

Learn the Basics of Hive Hadoop


Hive makes data processing on Hadoop easier by providing a database query
interface to hadoop. Hive is a friendlier data warehouse tool for users from
ETL or database background who are accustomed to using SQL for querying
data.

Read More on –

What is Hive?
Hive Architecture

Commonly Used Hive Commands


Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9
DDL Commands in Hive
SQL users might already be familiar with what DDL commands are but for readers who
are new to SQL, DDL refers to Data Definition Language. DDL commands are the
statements that are responsible for defining and changing the structure of a database or
table in Hive.

CREATE Database,Table

DROP Database,Table

TRUNCATE Table

ALTER  Database,Table

SHOW Databases,Tables,Table Properties,Partitions,Functions,Index

DESCRIBE Database, Table ,View

DDL Commands in Hive

Let’s look at the usage of the top hive commands in HQL on both databases
and tables –

DDL Commands on Databases in Hive


Create Database in Hive
As the name implies, this DDL command in Hive is used for creating
databases.

CREATE (DATABASE) [IF NOT EXISTS] database_name

  [COMMENT database_comment]

  [LOCATION hdfs_path]

  [WITH DBPROPERTIES (property_name=property_value, ...)];

In the above syntax for create database command, the values mentioned in
square brackets [] are optional.

Usage of Create Database Command in Hive


hive> create database if not exists firstDB comment "This is my first demo" location
'/user/hive/warehouse/newdb' with DBPROPERTIES ('createdby'='abhay','createdfor'='dezyre');
OK
Time taken: 0.092 seconds

Drop Database in Hive


This command is used for deleting an already created database in Hive and
the syntax is as follows -

DROP (DATABASE) [IF EXISTS] database_name [RESTRICT|CASCADE];

Usage of Drop Database Command in Hive


hive> drop database if exists firstDB CASCADE;
OK
Time taken: 0.099 seconds

In Hadoop Hive, the mode is set as RESTRICT by default and users cannot
delete it unless it is non-empty. For deleting a database in Hive along with the
existing tables, users must change the mode from RESTRICT to CASCADE.
In the syntax for drop database Hive command, “if exists” clause is used to
avoid any errors that might occur if the programmer tries to delete a database
which does not exist.

Describe Database Command in Hive


This command is used to check any associated metadata for the databases.

Alter Database Command in Hive


Whenever the developers need to change the metadata of any of the
databases, alter hive DDL command can be used as follows –

ALTER (DATABASE) database_name SET DBPROPERTIES

(property_name=property_value, ...);

Usage of ALTER database command in Hive –


Let’s use the Alter command to modify the OWNER property and specify the
role for the owner –

ALTER (DATABASE) database_name SET OWNER [USER|ROLE] user_or_role;

Show Database Command in Hive


Programmers can view the list of existing databases in the current schema.

Usage of Show Database Command


Show databases;
Use Database Command in Hive
This hive command is used to select a specific database for the session on
which hive queries would be executed.

Usage of Use Database Command in Hive

DDL Commands on Tables in Hive


Create Table Command in Hive
Hive create table command is used to create a table in the existing database
that is in use for a particular session.

CREATE  TABLE [IF NOT EXISTS] [db_name.]table_name    --

  [(col_name data_type [COMMENT col_comment], ...)]

  [COMMENT table_comment]

   [LOCATION hdfs_path]

Hive Create Table Usage


In the above step, we have created a hive table named Students in the
database college with various fields like ID, Name, fee, city, etc. Comments
have been mentioned for each column so that anybody referring to the table
gets an overview about what the columns mean.

The LOCATION keyword is used for specifying where the table should be
stored on HDFS.

How to create a table in hive by copying an existing table schema?


Hive lets programmers create a new table by replicating the schema of an
existing table but remember only the schema of the new table is replicated but
not the data. When creating the new table, the location parameter can be
specified.

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name    Like

[db_name].existing_table   [LOCATION hdfs_path]

DROP Table Command in Hive


Drops the table and all the data associated with it in the Hive metastore.

DROP TABLE [IF EXISTS] table_name [PURGE];

Usage of DROP Table command in Hive

DROP table command removes the metadata and data for a particular table.
Data is usually moved to .Trash/Current directory if Trash is configured. If
PURGE option is specified then the table data will not go to the trash directory
and there will be no scope to retrieve the data in case of erroneous DROP
command execution.

TRUNCATE Table Command in Hive


 This hive command is used to truncate all the rows present in a table i.e. it
deletes all the data from the Hive meta store and the data cannot be restored.

TRUNCATE TABLE [db_name].table_name

Usage of TRUNCATE Table in Hive

ALTER Table Command in Hive


Using ALTER Table command, the structure and metadata of the table can be
modified even after the table has been created. Let’s try to change the name
of an existing table using the ALTER command –

ALTER TABLE [db_name].old_table_name RENAME TO

[db_name].new_table_name;

Syntax to ALTER Table Properties

ALTER TABLE [db_name].tablename SET TBLPROPERTIES

(‘property_key’=’property_new_value’)

In the above step, we have set the creator attribute for the table and similarly
we can later or modify other tbale properites also.
DESCRIBE Table Command in Hive
Gives the information of a particular table and the syntax is as follows –

DESCRIBE [EXTENDED|FORMATTED]  [db_name.] table_name[.col_name

( [.field_name]

Usage of Describe Table Command

hive> describe extended college.college_students;


OK
id bigint unique id for each student
name string student name
age int student age between 16-26
fee double student college fee
city string cities to which students
belongs
state string student home address state
s
zip bigint student address zip code

Detailed Table Information Table(tableName:college_students,


dbName:college, owner:abhay, createTime:1474107648, lastAccessTime:0,
retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id,
type:bigint, comment:unique id for each student), FieldSchema(name:name,
type:string, comment:student name), FieldSchema(name:age, type:int,
comment:sudent age between 16-26), FieldSchema(name:fee, type:double,
comment:student college fee), FieldSchema(name:city, type:string,
comment:cities to which students belongs), FieldSchema(name:state ,
type:string, comment:student home address state s),
FieldSchema(name:zip, type:bigint, comment:student address zip code)],
location:hdfs://Boss-
Machine:9000/user/hive/warehouse/college.db/college_students,
inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,
compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null,
serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe,
parameters:{serialization.format=1}), bucketCols:[], sortCols:[],
parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:
[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false),
partitionKeys:[], parameters:{creator=abhay,
last_modified_time=1474108578, totalSize=0, numFiles=0,
transient_lastDdlTime=1474108578, comment=changed the creater to abhay,
last_modified_by=abhay}, viewOriginalText:null, viewExpandedText:null,
tableType:MANAGED_TABLE)
Time taken: 0.291 seconds, Fetched: 9 row(s)

hive> describe formatted college.college_students;


OK
# col_name data_type comment

id bigint unique id for each student


name string student name
age int sudent age between 16-26
fee double student college fee
city string cities to which students
belongs
state string student home address state
s
zip bigint student address zip code

# Detailed Table Information


Database: college
Owner: abhay
CreateTime: Sat Sep 17 15:50:48 IST 2016
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://Boss-
Machine:9000/user/hive/warehouse/college.db/college_students
Table Type: MANAGED_TABLE
Table Parameters:
comment changed the creater to abhay
creator abhay
last_modified_by abhay
last_modified_time 1474108578
numFiles 0
totalSize 0
transient_lastDdlTime 1474108578

# Storage Information
SerDe Library:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
serialization.format 1
Time taken: 0.099 seconds, Fetched: 37 row(s)

We can also check the description for a specific column from the table as
follows –

Show Table Command in Hive


Gives the list of existing tables in the current database schema.

DML Commands in Hive


DML (Data Manipulation Language) commands in Hive are used for inserting
and querying the data from hive tables once the structure and architecture of
the database has been defined using the DDL commands listed above.

Data can be loaded into Hive tables using –

 LOAD command

 Insert command

Usage of LOAD Command for Inserting Data Into Hive Tables


Syntax for Load Command in Hive

LOAD DATA [LOCAL] INPATH 'hdfsfilepath/localfilepath' [OVERWRITE] INTO TABLE

existing_table_name

Example
Let’s load a structured file that contains information about different students.
Let’s take a look at the data present in the file –

ID|name|age|fee|city|state |zip

1|Kendall|22|25874|Kulti-Barakar|WB|451333

2|Mikayla|25|35367|Jalgaon|Maharastra|710179

3|Raven|20|49103|Rewa|Madhya Pradesh|392423

4|Carla|19|27121|Pilibhit|UP|769853

5|Edward|21|32053|Tuticorin|Tamil Nadu|368262

6|Wynter|21|43956|Surendranagar|GJ|457441

7|Patrick|19|19050|Mumbai|MH|580220

8|Hayfa|18|15590|Amroha|UP|470705

9|Raven|16|37836|Cuddalore|TN|787001

The file is a ‘|’ delimited file where each row  can be inserted as a table
record.

First let’s create a table student based on the contents in the file –

 The ROW FORMAT DELIMITED must appear before any of the other


clauses, with the exception of the STORED AS … clause.
 The clause ROW FORMAT DELIMITED FIELDS TERMINATED BY
'| means  I character will be used as field separator by hive.
 The clause LINES TERMINATED BY ‘\n' means that the line delimiter will
be new line.
 The clause LINES TERMINATED BY ‘\n' and STORED AS … do not
require the ROW FORMAT DELIMITED keywords.

hive> CREATE TABLE IF NOT EXISTS college.students (

> ID BIGINT COMMENT 'unique id for each student',

> name STRING COMMENT 'student name',

> age INT COMMENT 'sudent age between 16-26',

> fee DOUBLE COMMENT 'student college fee',


> city STRING COMMENT 'cities to which students belongs',

> state STRING COMMENT 'student home address state s',

> zip BIGINT COMMENT 'student address zip code'

> )

> COMMENT 'This table holds the demography info for each student'

> ROW FORMAT DELIMITED

> FIELDS TERMINATED BY '|'

> LINES TERMINATED BY '\n'

> STORED AS TEXTFILE

> LOCATION '/user/hive/warehouse/college.db/students';

OK

Time taken: 0.112 seconds

Let’s  load the file into the student table –

If the keyword LOCAL is not specified, then Hive will need absolute URI of the
file. However, if local is specified then it assumes the following rules -

 It will assume it’s an HDFS path and will try to search for the file in HDFS.

 If the path is not absolute, then hive will try to locate the file in the /user/  in
HDFS.

Using the OVERWRITE keyword while importing means the data will be
ingested i.e. it will delete old data and put new data otherwise it would just
append the new data. The contents of the target table will be deleted and
replaced by the files referred to by file path; otherwise the files referred by file
path will be added to the table.

Let’s check if the data has been inserted into the table –
hive> select * from students;

596 Stephen 25 16573.0 Gaya BR 874761

597 Colby 25 19929.0 New Bombay Maharastra 868698

598 Drake 21 49260.0 Nagaon Assam 157775

599 Tanek 18 12535.0 Gurgaon Haryana 201260

600 Hedda 23 43896.0 Ajmer RJ 697025

Time taken: 0.132 seconds, Fetched: 601 row(s)


2. Check first 5 records:

Now, let's try to retrieve only 5 records using the limit option -

hive> select * from students limit 5;

OK

NULL name NULL NULL city state NULL

1 Kendall 22 25874.0 Kulti-Barakar WB 451333

2 Mikayla 25 35367.0 Jalgaon Maharastra 710179

3 Raven 20 49103.0 Rewa Madhya Pradesh 392423

4 Carla 19 27121.0 Pilibhit UP 769853

Time taken: 0.144 seconds, Fetched:

Let's count the total number of records in the table -


Let’s check the number of students from each state. (State  column holds
name for each state)-

select state ,COUNT(*) from students group by state ;

:
2016-09-17 17:50:08,645 Stage-1 map = 0%, reduce = 0%

2016-09-17 17:50:15,017 Stage-1 map = 100%, reduce = 0%, Cumulative CPU

1.14 sec

2016-09-17 17:50:23,598 Stage-1 map = 100%, reduce = 100%, Cumulative

CPU 2.52 sec

MapReduce Total cumulative CPU time: 2 seconds 520 msec

Ended Job = job_1474096442601_0004

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.52 sec HDFS Read:

34525 HDFS Write: 1228 SUCCESS

Total MapReduce CPU Time Spent: 2 seconds 520 msec

OK

AP 25

AR 1

AS 16

Andhra Pradesh 15

Arunachal Pradesh 1

Assam 11

BR 29

Bihar 20

CT 9

Chhattisgarh 6

DL 2

Daman and Diu 1

Delhi 2

GJ 13

Gujarat 14

HP 2

HR 6
Haryana 8

Himachal Pradesh 1

JH 11

JK 1

Jammu and Kashmir 6

Jharkhand 4

KA 12

KL 6

Karnataka 24

Kerala 7

MH 20

MN 1

MP 17

MZ 2

Madhya Pradesh 16

Maharastra 32

Meghalaya 1

Mizoram 1

NL 1

OR 7

Odisha 10

PB 8

Pondicherry 2

Punjab 8

RJ 16

Rajasthan 24

TN 17

Tamil Nadu 20

Tripura 1
UP 53

UT 4

Uttar Pradesh 37

Uttarakhand 1

WB 20

West Bengal 28

state 1

Time taken: 25.048 seconds, Fetched: 53 row(s)

PREVIOUS NEXT

What will you learn from this Hadoop MapReduce


Tutorial?
This hadoop tutorial aims to give hadoop developers a great start in the world
of hadoop mapreduce programming by giving them a hands-on experience in
developing their first hadoop based WordCount application. Hadoop
MapReduce WordCount example is a standard example where hadoop
developers begin their hands-on programming with. This tutorial will help
hadoop developers learn how to implement WordCount example code in
MapReduce to count the number of occurrences of a given word in the input
file.
Pre-requisites to follow this Hadoop WordCount Example Tutorial
i. Hadoop Installation must be completed successfully.
ii. Single node hadoop cluster must be configured and running.

iii.  Eclipse must be installed as the MapReduce WordCount example will be run
from eclipse IDE.
Word Count - Hadoop Map Reduce Example –
How it works?
Hadoop WordCount operation occurs in 3 stages –

i. Mapper Phase

ii. Shuffle Phase

iii. Reducer Phase

Hadoop WordCount Example- Mapper Phase Execution


The text from the input text file is tokenized into words to form a key value pair
with all the words present in the input text file. The key is the word from the
input file and value is ‘1’.

For instance if you consider the sentence “An elephant is an animal”. The
mapper phase in the WordCount example will split the string into individual
tokens i.e. words. In this case, the entire sentence will be split into 5 tokens
(one for each word) with a value 1 as shown below –

Key-Value pairs from Hadoop Map Phase Execution-


(an,1)

(elephant,1)

(is,1)

(an,1)

(animal,1)

If you would like more information about Big Data and Hadoop Certification, please
click the orange "Request Info" button on top of this page.
Hadoop WordCount Example- Shuffle Phase Execution
After the map phase execution is completed successfully, shuffle phase is
executed automatically wherein the key-value pairs generated in the map
phase are taken as input and then sorted in alphabetical order. After the
shuffle phase is executed from the WordCount example code, the output will
look like this -

(an,1)

(an,1)

(animal,1)

(elephant,1)
(is,1)

Hadoop WordCount Example- Reducer Phase Execution

In the reduce phase, all the keys are grouped together and the values for
similar keys are added up to find the occurrences for a particular word. It is
like an aggregation phase for the keys generated by the map phase. The
reducer phase takes the output of shuffle phase as input and then reduces
the key-value pairs to unique keys with values added up. In our example “An
elephant is an animal.” is the only word that appears twice in the sentence.
After the execution of the reduce phase of MapReduce WordCount example
program, appears as a key only once but with a count of 2 as shown below -

(an,2)

(animal,1)

(elephant,1)

(is,1)

This is how the MapReduce word count program executes and outputs the
number of occurrences of a word in any given input file. An important point to
note during the execution of the WordCount example is that the mapper class
in the WordCount program will execute completely on the entire input file and
not just a single sentence. Suppose if the input file has 15 lines then the
mapper class will split the words of all the 15 lines and form initial key value
pairs for the entire dataset. The reducer execution will begin only after the
mapper phase is executed successfully.

Learn Hadoop by working on interesting Big Data and Hadoop Projects for just


$9. 
Running the WordCount Example in Hadoop
MapReduce using Java Project with Eclipse
Now, let’s create the WordCount java project with eclipse IDE for Hadoop.
Even if you are working on Cloudera VM, creating the Java project can be
applied to any environment.

Step 1 –
Let’s create the java project with the name “Sample WordCount” as shown
below -

File > New > Project > Java Project > Next.

"Sample WordCount" as our project name and click "Finish":


Step 2 -
The next step is to get references to hadoop libraries by clicking on Add JARS
as follows –
Step 3 -
Create a new package within the project with the name com.code.dezyre-
Step 4 –
Now let’s implement the WordCount example program by creating a
WordCount class under the project com.code.dezyre.
Step 5 -
Create a Mapper class within the WordCount class which extends
MapReduceBase Class to implement mapper interface. The mapper class will
contain -

               1. Code to implement "map" method.

`              2. Code for implementing the mapper-stage business logic should


be written within this method.
Mapper Class Code for WordCount Example in Hadoop MapReduce

public static class Map extends MapReduceBase implements Mapper {

                               private final static IntWritable one = new

IntWritable(1);

                               private Text word = new Text();

                               public void map(LongWritable key, Text value,

OutputCollector output, Reporter reporter)

                                                             throws IOException {

                                              String line =

value.toString();

                                              StringTokenizer tokenizer

= new StringTokenizer(line);

                                              while

(tokenizer.hasMoreTokens()) {

                                                            

word.set(tokenizer.nextToken());

                                                            

output.collect(word, one);

                                              }

                               }

               }

In the mapper class code, we have used the String Tokenizer class which
takes the entire line and breaks into small tokens (string/word). 

Step 6 –
Create a Reducer class within the WordCount class extending
MapReduceBase Class to implement reducer interface. The reducer class for
the wordcount example in hadoop will contain the -

               1. Code to implement "reduce" method


               2. Code for implementing the reducer-stage business logic should
be written within this method

Reducer Class Code for WordCount Example in Hadoop MapReduce

public static class Reduce extends MapReduceBase implements Reducer {

                               public void reduce(Text key, Iterator values, OutputCollector

output,

                                                             Reporter reporter) throws IOException {

                                              int sum = 0;

                                              while (values.hasNext()) {

                                                             sum +=

values.next().get();

                                              }

                                              output.collect(key, new

IntWritable(sum));

                               }

               }

Step 7 –
Create main() method within the WordCount class and set the following
properties using the JobConf class -

i. OutputKeyClass

ii. OutputValueClass

iii. Mapper Class

iv.  Reducer Class

v. InputFormat

vi. OutputFormat
vii. InputFilePath

viii.  OutputFolderPath

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

conf.setJobName("WordCount");

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);

//conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);

Would you like to work on hands-on Hadoop Projects -CLICK HERE.


Step 8 –
Create the JAR file for the wordcount class –
How to execute the Hadoop MapReduce
WordCount program ?
>> hadoop jar  (jar file name) (className_along_with_packageName) (input

file) (output folderpath)

hadoop jar dezyre_wordcount.jar com.code.dezyre.WordCount

/user/cloudera/Input/war_and_peace /user/cloudera/Output
Important Note: war_and_peace(Download link) must be available in HDFS
at /user/cloudera/Input/war_and_peace. 
If not, upload the file on HDFS using the following commands -

hadoop fs –mkdir /user/cloudera/Input

hadoop fs –put war_and_peace /user/cloudera/Input/war_and_peace

Output of Executing Hadoop WordCount Example –


The program is run with the war and peace input file. To get the War and
Peace Dataset along with the Hadoop Example Code for the Wordcount
program delivered to your inbox, send an email to khushbu@dezyre.com!

Send us an email at anjali@dezyre.com, if you have any specific questions


related to big data and hadoop careers.

What will you learn from this Hadoop Commands


tutorial?
This hadoop mapreduce tutorial will give you a list of commonly used hadoop
fs commands that can be used to manage files on a Hadoop cluster. These
hadoop hdfs commands can be run on a pseudo distributed cluster or from
any of the VM’s like Hortonworks, Cloudera, etc.
Pre-requisites to follow this Hadoop tutorial
 Hadoop must be installed.
 Hadoop Cluster must be configured.

1) help  HDFS Shell Command


Syntax of help hdfs Command

 $ hadoop fs –help

Help hdfs shell command helps hadoop developers figure out all the available
hadoop commands and how to use them.
If you would like more information about Big Data and Hadoop Certification, please
click the orange "Request Info" button on top of this page.
Variations of the Hadoop fs Help Command
$ hadoop fs –help ls
Using the help command with a specific command lists the usage information
along with the options to use the command.

Learn Hadoop by working on interesting Big Data and Hadoop Projects for just $9.
 

2) Usage HDFS Shell Command

$ hadoop fs –usage ls

Usage command gives all the options that can be used with a particular hdfs
command.

3) ls HDFS Shell Command


Syntax for ls Hadoop Command -
$ hadoop fs –ls

This command will list all the available files and subdirectories under default
directory.For instance, in our example the default directory for Cloudera VM is
/user/cloudera

Variations of Hadoop ls Shell Command


$ hadoop fs –ls /
Returns all the available files and subdirectories present under the root
directory.

$ hadoop fs –ls –R /user/cloudera


Returns all the available files and recursively lists all the subdirectories
under /user/Cloudera
 

4) mkdir- Used to create a new directory in HDFS at a given location.


Example of HDFS mkdir Command -
$ hadoop fs –mkdir /user/cloudera/dezyre1
The above command will create a new directory named dezyre1 under the
location /user/cloudera

Note : Cloudera and other hadoop distribution vendors provide /user/ directory


with read/write permission to all users but other directories are available as
read-only.Thus, to create a folder in the root directory, users require
superuser permission  as shown below -
$ sudo –u hdfs hadoop fs –mkdir /dezyre
This command will create a new directory named dezyre under the / (root
directory).

5) copyFromLocal
Copy a file from local filesytem to HDFS location.

For the following examples, we will use Sample.txt file available in the
/home/Cloudera location.

Example - $ hadoop fs –copyFromLocal Sample1.txt /user/cloudera/dezyre1


Copy/Upload Sample1.txt available in /home/cloudera (local default) to
/user/cloudera/dezyre1 (hdfs path)

6) put –
This hadoop command uploads a single file or multiple source files from local
file system to hadoop distributed file system (HDFS).

Ex - $ hadoop fs –put Sample2.txt /user/cloudera/dezyre1


Copy/Upload Sample2.txt available in /home/cloudera (local default) to
/user/cloudera/dezyre1 (hdfs path)
7) moveFromLocal 
This hadoop command functions similar to the put command but the source
file will be deleted after copying.

Example - $ hadoop fs –moveFromLocal Sample3.txt /user/cloudera/dezyre1


Move Sample3.txt available in /home/cloudera (local default) to
/user/cloudera/dezyre1 (hdfs path). Source file will be deleted after moving.

8) du
Displays the disk usage for all the files available under a given directory.

Example - $ hadoop fs –du /user/cloudera/dezyre1

9) df
 Displas disk usage of current hadoop distributed file system.
Example - $ hadoop fs –df
10) Expunge
This HDFS command empties the trash by deleting all the files and
directories.

Example - $ hadoop fs –expunge

11) Cat
This is similar to the cat command in Unix and displays the contents of a file.

Example - $ hadoop fs –cat /user/cloudera/dezyre1/Sample1.txt


 

12) cp 
Copy files from one HDFS location to another HDFS location.
Example – $ hadoop fs –cp /user/cloudera/dezyre/war_and_peace
/user/cloudera/dezyre1/

13) mv 
Move files from one HDFS location to another HDFS location.

Example – $ hadoop fs –mv /user/cloudera/dezyre1/Sample1.txt


/user/cloudera/dezyre/

 
 

14) rm
Removes the file or directory from the mentioned HDFS location.

Example – $ hadoop fs –rm -r /user/cloudera/dezyre3

rm
-r
Example  – $ hadoop fs –rm -r /user/cloudera/dezyre3
Deletes or removes the directory and its content from HDFS location in a
recursive manner.

  Example – $ hadoop fs –rm /user/cloudera/dezyre3


  Delete or remove the files from HDFS location.

15) tail 
This hadoop command will show the last kilobyte of the file to stdout.

Example – $ hadoop fs -tail /user/cloudera/dezyre/war_and_peace


Example – $ hadoop fs -tail –f /user/cloudera/dezyre/war_and_peace
Using the tail commands with -f option, shows the last kilobyte of the file from
end in a page wise format.
 

16) copyToLocal
Copies the files to the local filesystem . This is similar to hadoop fs -get
command but in this case the destination location msut be a local file
reference

Example - $ hadoop fs –copyFromLocal /user/cloudera/dezyre1/Sample1.txt


/home/cloudera/hdfs_bkp/
Copy/Download Sample1.txt available in /user/cloudera/dezyre1 (hdfs path) to
/home/cloudera/hdfs_bkp/ (local path)
17) get
Downloads or Copies the files to the local filesystem.

Example - $ hadoop fs –get /user/cloudera/dezyre1/Sample2.txt


/home/cloudera/hdfs_bkp/
Copy/Download Sample2.txt available in /user/cloudera/dezyre1 (hdfs path) to
/home/cloudera/hdfs_bkp/ (local path)

18) touchz
Used to create an emplty file at the specified location.

Example - $ hadoop fs –touchz /user/cloudera/dezyre1/Sample4.txt


It will create a new empty file Sample4.txt in /user/cloudera/dezyre1/ (hdfs
path)

19) setrep
This hadoop fs command is used to set the replication for a specific file.

Example - $ hadoop fs –setrep –w 1 /user/cloudera/dezyre1/Sample1.txt


It will set the replication factor of Sample1.txt to 1
20) chgrp
This hadoop command is basically used to change the group name.

Example - $ sudo –u hdfs hadoop fs –chgrp –R cloudera /dezyre


It will change the /dezyre directory group membership from supergroup to
cloudera (To perform this operation superuser permission is required)

21) chown
This command lets you change both the owner and group name
simulataneously.

Example - $ sudo –u hdfs hadoop fs –chown –R cloudera /dezyre


It will change the /dezyre directory ownership from hdfs user to cloudera user
(To perform this operation superuser is permission required)

22) hadoop chmod


Used to change the permissions of a given file/dir.

Example - $ hadoop fs –chmod /dezyre


It will change the /dezyre directory permission to 700 (drwx------).

Note : hadoop chmod 777


To execute this , the user must be the owner of the file or must be a super
user. On executing this command, all users will get read,write and execute
permission on the file.

You might also like