Hadoop Related Tools
Syllabus
ee ann ae - Hhase clients - Hbase examples - pracis. Pig - Grunt -
pig ig Latin - developing and testing Pig Latin scripts. Hive - data types and file
formats ~ HiveQL data definition - HiveQL data manipulation - HiveQL queries.
Contents
5.1 Hbase
52. Data Model and Implementations
5.3 Hbase Clients
5.4 Praxis
55 Pig
5.6 Hive
5.7 HiveQL Data Definition
68 HiveQL Data Manipulation
5.9 HiveQL Queries
5.10 Two Marks Questions with AnswersFEB Hbase
* HBase is an open source, non-relational, distributed database modeled after
Google's BigTable. HBase is an open source and sorted miap data built on Hadoop.
It is column oriented and horizontally scalable.
© It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop file system: It runs on top of Hadoop and HDFS,
providing Big Table-like capabilities for Hadoop.
¢ HBase supports massively parallelized processing via MapReduce for using HBase
as both source and sink.
¢ HBase supports an easy-to-use Java API for programmatic access. It also suppotts
Thrift and REST for non-Java front-ends.
+ HBase is a column oriented distributed database in Hadoop environment. It can
store massive amounts of data from terabytes to petabytes. HBase is scalable,
distributed big data storage on top of the Hadoop eco system.
* The HBase physical architecture consists of servers in a Master-Slave relationship.
Typically, the HBase cluster has one Master node, called HMaster and multiple
Region Servers called HRegionServer. Fig, 5.1.1 shows Hbase architecture.
Client
HMaster
Region Server Region Server Region Server Zookeeper
Region [Region [[Resion
[> Region | Region
ty g :
* HDFS
7
| +S HFig. 5:1,1, Hbase architecture
Zookeeper is a centralized monitoring server which maintains configuration
information and provides distributed synchronization. If the client wants to
communicate with regions servers, client has to approach Zookeeper.
HMaster in the master server of Hbase and it coordinates the HBase cluster:
HMaster is responsible for the administrative operations of thé cluster.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgelytics
008 Analytic
eee a
‘aedlove ey ves Hadoop Related Tools
tdlistes end Zoolaepe, form the following functi
7 8 functions’ in communication with
i Hosting and managing regions
2. Splitting regions automatically,
3, Handling read and writes resi
Ss.
4. Communicating wii ‘
eae with clients directly
HRegions : For each column family, HRegi |
of HRegions are Memelare sal ee ‘egions maintain a store. “ain components
» Data model in HBase is desi;
igned 4
vary in field size, data type ee 3 accommodate semi-structured data that could
jase is a col i Fi :
ree yiysteal center non-relational database. This means that data is stored
a imns and indexed by a unique row key. This architecture allows
for rapi retrieval of individual rows and columns and efficient scans over
individual columns within a table. '
Both data and requests are distributed across all servers in an HBase cluster,
allowing user to query results on petabytes of data within milliseconds. HBase is
most effectively used to store non-relational data, accessed via the HBase API.
Features and Application of Hbase
Features of Hbase :
1. Hbase is linearly scalable.
matic failure support.
2. It has auto
3, It provides consistent read and writes. f i
‘ th Hadoop, both a * source and a destination.
5.
. Tt integrates W’
a API for client.
across
. It has easy jav'
6. It provides dat
it ters.
a replication cluste!
real-time read/write access to Big Data.
of commodity jnardware.
modeled after Google's Bigtable.
works on top of
Where to use Hbase 7
have random
1. Apache Hbase is used t0
ables on t0P of clust
ers
tabase
2 It hosts very large * :
ry ont ige Apache HBase
3. Apache Hbase #. 4 nr ale
Bigtable acts up om Google
Hadoop and HDF:Relatec
Big Data Analytics 5-4 Hadoop Related Tools
Applications of Hbase :
1. It is used whenever there is a need to write heavy applications.
2. Hbase is used whenever we need to provide fast random access to available data,
3. Companies such as Facebook, Twitter, Yahoo and Adobe use HBase internally.
[ERE Difference between HDFS and Hbase
Ee
Sr. No.
HDFS does not support iat individual : “HBase
record lookups. tables :
It provides high latency batch |
no concept of batch processing,
pes It provides only sequent
HDFS are suited for
operations. —
yu
6 In HDFS, data are primarily ac
through Map Reduce jobs.
HDFS doesn't Base a
commands, cl
Avo on Theis, |
ee gD
Difference between Hbase arid Relational Database
Base is Schema-less
It is Column - oriented datastore, 1 is Rew > oriented eee
It is designed to st :
It s_design store denormalized It is designed to store normalized data.
Ie contains wide and sparsely populated I containe thin tablesHadoop Related Tools
"Relational database has
Stipport for Partitioning.
, Tk is good for semi-sty
6. : structured data, “structured. 8 Well as It is Good for structured data.
No transactions are there in HBase
pal Limitations of HBase
It takes a very long time to
to activate another Node if
In HBase,
RDBMS is transactional,
Tecover if the
the first nodes
cross data Operations and join
Base needs anew format When we want to migrate from RDBMS external
sources to HBase servers,
HMaster goes down, It takes a long time
go down,
°perations are very difficult to perform,
B®
<
8 os
aid
8 &
BB
ef
5 we
Ba
® 5
3
eS
ae
s
°P security factor to grant access to the users,
one default sort for a table and it does not support large size
HBase allows only
of binary files,
HBase is expensive in terms
of hardware requirement and memory blocks’
allocations.
52] Data Model and Implementations
* The Apache HBase Data Model is designed to accommodate structured or
semi-structured data that could vary in field size, data type and columns. HBase
Stores data in tables, which have rows and columns. The table schema is very
different from traditional relational database tables.
* A database consists of multiple tables. Each table consists of multiple rows, sorted
by row key. Each row contains a row key and one or more column families.
. is ip families can
. mn family is defined when the table is created, Column
Each col Renee (Gamily : column). A cell is uniquely identified by
be aa Pac aa A cell contains an uninterpreted array of bytes and a
‘able,row,family : :
timestamp. : :
* HB, oe del has some logical components which are as follows :
ase data m
2, Rows
1. Tables ‘tea /Cobumnné’", (4 Versioné/Timéstarnp 5 Cella
Feacalusin Fariis/Cal bles are more like logical collection of rows stored in
{gee The Hace ad Regions. As shown above, every Region is then served
Separate partitions ca
by exactly one Region Server.
eo ‘knowledge
is© - an up-thrust for knot
PUBLICATION’
TECHNICAL
eeBig Data Analytics 5-6 Hadoop Related Tools
create ‘
'''
. 'e syntax to create a table in HBase shell is shown below.
* Example : create 'CustomerContactInformation’,’ CustomerName' , ' ContactInfo’
* Tables are automatically partitioned horizontally by HBase into regions. Each
region comprises a subset of a table's rows. A region is denoted by the table it
belongs to. Fig, 5.2.1 shows region with table.
Region server 7
Uae Table A, Region 1
s Le Table A, Region 2
Region 1 | | Table G Region 1070
7 Table L, Region 25
~The z
t _ Region server 86
pn | g Table Ay Region 3
fh 7. Table C, Region 30
“a 77 Table F, Region 160,
Region 3 | t ia “Table F, Region 776
~ ti gia Region server 367
7 Table Ay Region 4
Region 4 | : i Table C, Region 17,
P Table E, Region 82
Table P, Region 1116
Fig. 5.2.1 Region with table
* There is one region server per node. There are many regions in a region server. At
any lime, a given region is pinned to a particular region server. Tables are split
into regions and are scattered across region servers. A
region.
table must have at least one
* Rows : A row is one instance of data in a table and is identified by a ‘rowkey.
Rowkeys are unique in a Table and ‘are always treated as a byte[ ].
* Column families : Data in a row are grouped together as Column Families. Each
Column Family has one more Columns and these Columns in a family are stored
together in a low level storage file known as HFile. Column Families form the
basic unit of physical storage to which certain HBase features like compression are
applied.
* Columns : A Column Family is’ made of one or more columns. A Column is
identified by a Column Qualifier that consists ‘of the Column Family name
concatenated with the Column name using a colon - example : columnfamily :
columnname, There can be multiple Columns within a Column Family and Rows
within a table can have varied number of Columns,
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeF.
Analytics.
gt 5-7
Hadoop Related Tools
| cell + A Cell stores dat .
Column Family and ona essentially a unique combination of rowkey,
called its value and the di (Column Qualifier). The data stored in a Cell is
ata type is always treated as bytel J.
version : The data stored i ti
in a cell is versioned and versions of data are identified
py the timestamp. The number
of versi ae
configurable and this value by ange of data retained in a column family is
jme-to-Live : TTL is a built
ec. ae balls ste of HBase that ages out data based on its
only for a certain duration of fe in use cases where data needs to be held
\der than the specified Mae So, if on a major compaction the timestamp is
older than the specified TTL in the past, the record in question doesn't get put in
the HFile being generated by the major compaction; that is, the older records are
removed a5 a patt of the normal upkeep’ ofthe table.
o If TTL is bd used and an aging requirement is still needed, then’ a much more
1/O intensive operation would need to be done
[I Hbase Clients
+ There are a number of client options for interacting with an HBase cluster. There
are a number of client options for interacting with an HBase cluster.
4, Java
+ Hbase is written in Java.
+ Example : Creating table and inserting data in Hbase table are shown in the
following program.
3 ExampleClient
~ Configuration config =
Create table .
HpeseAdain fagmin = new HBaseAdmin(config)s
HTableDescriptor hid = nev” HTableDescrivtor( "test";
‘HColumnDescriptor hed = new HColumnDeseription(‘data");
td.addFamily(hod)s ae
adrain.createTable(ntd);
byte [] tablename = nta.getNamel);
; ns -- 2 pul :
-]/ Ron some seas sable(config. tablename);
table table Bi s y sroBytes("7oW!
pyte [| row! =
ve pt = new puNcON
in up-thrust for knowiedge
TECHNICAL PUBLICATIONSBig Data Analytics 5-8 Hadoop Related Tools
oe "byte [] databytes = Bytes.toBytes(“data");
i pladd(databytes, Bytes.toBytes("FN"), Bytes. yeti
fable pure)
:
* To create a table, we need to first create an instance of HBaseAdmin and then ask
it to create the table named test with a single column family named data.
2. MapReduce
* HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package
facilitate using HBase as.a source and/or sink in MapReduce jobs. The
TableInputFormat class makes splits on tegion boundaries so maps are handed a
single region to work on. TheTableOutputFormat will ‘write the result of
MapReduce into HBase.
* Example : A MapReduce application to count the number of rows in an HBase
table
eevee oe Ce ve oe
e Counter ee to count the oe 7 :
public static enum Counte ers {ROWS} —
@Override ay
public void map(Immut:
Context context)
throws IOException {_
for (KeyValue value: values list()) {
if (value.getValue().length > 0) {.
context.getCounter(Counters. ROWS), jorement(1);
break;
bd}
oe static Job deemapienteinppeanaweten conf, String[} args)
throws IOException {
String tableName = args(0};
Job job = new Job(conf, NAME + "" + tableName);
job.setJarByClass(RowCounter.class);
// Columns are space delimited
‘StringBuilder sb = new StringBuilder();
final int columnoffset = 1;
for (int i = columnoffset; i < args.length; i++) {
if (i > columnoffset) {
sb.append("");
- sb.append(argstil);Big Data An:
ig Data Analytics 5-9 Hadoop Related Tools
Scan scan = new Scan();
scan.setFilter(new FirstKeyOnlyFilter());
if (sblength() > 0) {
for (String columnName :sb.toString().split("")) {
String [] fields = columnName.split(’
if(felds.length
scan.addFamily(Bytes.toBytes(fields{0})):
} else {
scan.addColumn(Bytes.toBytes(fields{0]), Bytes.toBytes(fields[1)):
tht
// Second argument is the table name.
job.setOutputFormatClass(NullOutputFormat.class);
‘TableMapReduceUtil.initTableMapperJob(tableName, scan,
RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job);
job.setNumReduceTasks(0);
retum job;
public static void main(Stringl] args) throws Exception {
Configuration conf = HBaseConfiguration.create();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 1) {
(ERROR: Wrong number of parameters: * + argslength):
System.err.println\
Systom.er printin("Usage: RowCounter { ‘..1");
System.exit(-1);
BAS
Job job = createSubmittableJob(conf, otherArgs);
System.exit(job.waitForCompletion(true) ? 0: 1);
ys
3. Avro, REST, and Thrift
«© HBase ships with Avro, REST and Thrift interfaces. These are useful when the
interacting application is written in a language other than Java. In all cases, a Java
server hosts an instance of the HBase client brokering application Avro, REST, and
‘Thrift requests in and out of the HBase cluster. This extra work proxying requests
and responses means these interfaces are slower than using the Java client. directly.
+ REST: To put up a stargate instance, start it using the following command:
opp ‘hbase:daemon.sh start rest
© This will start a server instance, by default on port 8080, background it and catch
any emissions by the server in logfiles under the HBase logs directory.
Clients can ask for the response to be formatted as JSON, Google's protobufs, oF as
XML, depending on how the client HTTP Accept header is set.
To stop the REST server, type +
© % pbase-daemon.sh stop rest
TECHNICAL ‘PUBLICATIONS® = an up-thrust for knowledgeBig Data Analytics 5-10 Hadoop Related Tools
« Thrift : Start a Thrift service by putting up a server to field Thrift clients by
running the following :
Rae daemon ue :
This will start the server instance, by default on port 9090, background it and
catch any emissions by the server in logfiles under the HBase logs directory. The
HBase Thrift documentation* notes the Thrift version used generating classes.
To stop the Thrift server, type :
Mim
* Avro: The Avro server is started and stopped in the same manner as we start
and stop the Thrift or REST services. The Avro server by default uses port 9090.
ERI Praxis
* When Hbase cluster running under load, following issues are considered :
1. Versions : A particular Hbase version would run on any Hadoop that had a
matching minor version. HBase 0.20.5 would run on an Hadoop 0.20.2, but
HBase 0.19.5 would not run on Hadoop 0.20.0
2. HDFS : In MapReduce, HDFS files are opened, with their content streamed
through a map. task and then closed. In HBase, data files are opened on cluster
startup and kept open. Because of this, HBase tends to see issues not normally
encountered by MapReduce clients.
© Running out of file descriptors : Because of open files on a loaded ‘cluster, it
doesn't take long before we run into system - and Hadoop - imposed limits. Each
open file consumes at least one descriptor over on the remote datanode. The
default limit on the number of file descriptors per process is 1,024. HBase process
p is running with sufficient file descriptors by looking at the first few lines of a
- regionservers log.
© Running out of datanode threads : The Hadoop datanode has an upper bound of
256 on the number of threads it can run at any one time.
«Syne : We must run HBase on an HDFS that has a working sync. Otherwise, there
is loss of data. This means running HBase on Hadoop 0.2.x, which adds a
working sync/append to Hadoop 0,20.
« Ul: HBase runs a web server on the master to present a view on the state of
running cluster. By default, it listens on port 60010. The master UI displays a list
of basic attributes such as software versions, cluster load, request rates, lists of
cluster tables and participating regionservers.10
oe
5-11
Anolytics
z ee : HBase tables Hadoop Related Tools
wt |, TOWS are si are like those i
Jong as the column fanny oe caine ea let an RDBMS, except that cells are
a
|, Joins : There is no native di ey belong to preevsts nnn ney BY Me Clank
jt so that there is no latabase join facility i
need for database ne ane en wide tables can make
walling from secondary or tertia
ry
tables. A wide row
can someti
particular primary k Sometimes be madi
ary key. le to hold all data that pertains to a
a Pig
Foaturos of Pig Hadoop :
1
+ Fig. 55.1 shows Pig
«Pig has two eX
1. Local mode =
files a
using the
« Pig is an open-source hi
creating =MapReduce mh level data flow system. A high-level platform for
sequences of one or more Navneet a Hadoop. It translates into efficient
jobs. P
« Pig offers a high-l :
igh-level language to write data analysis programs which we call as
Pig Latin. ‘The salient .
{fo substantial pa allan of pig programs is that their structure is amenable
data sets. | which in turns enables them to handle very Terge
Pig, makes use of both, the Hi
‘ adoop Distr ;
MapReduce. P tributed File System as W!
ell as the
Incbuilt. operators: Apache Pig provides a very good set of operators for
performing several data operations like sort, join, filter, ete.
. Base of programming.
g are automatically optimized.
Automatic optimization : The tasks in Apache Pi
| Handles all kinds of data ‘Apache Pig can analyze both structured, and
unstructured data and store the results in HDFS.
rchitecture. (Refer Fig. 55.1 on next page)
single machine; all
Specify local mode
ecution modes =
To ran pig in local mode, we need access 10 @
re installed and run using local host and file system.
1x flag (pig 02)
need access to a Hadoop
smapreduce mode, we
ode is the default mode; but don'tBig Data Analytics 5-12 Hadoop Related Toots
L
Pig Latin
scripts
Apache Pic
MapReduce
Parser
compiler
Fig. 5.5.1 Pig architecture
* Pig Hadoop framework has four main components :
Parser : When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser. The parser is responsible for checking the syntax of the script, along
with other miscellaneous checks. Parser gives an output in the form of a
Directed Acyclic Graph (DAG) that contains Pig Latin statements, together with
other logical operators represented as nodes.
Optimizer : After the output from the parser is retrieved, a logical plan for
DAG is passed to a logical optimizer. The optimizer is responsible for carrying
out the logical optimizations.
. Compiler : The role of the compiler comes in when the output from the
optimizer is received. The compiler compiles the logical plan sent by the
optimize the logical plan is then converted into a series of MapReduce tasks or
jobs.
FERUMIOAL BIBL marine ® on neFe
sa anoles
we 5-13
: i: : Hadoop Related Tools
4, Execution Engine : After the logi
e logical plan is converted to MapReduce jobs, these
jobs are sent to Hadoop j
soe adoop for yield PREY sorted order and these job
ling the desired result. jobs are executed
Pig can run on two types i
rn types of environments :, The local environment in a single ]VM
environment on a Hadoop cluster.
Pig has variety of scal
; alee lar data types and standard data processing options. Pig
i @ map being a set of key - value pairs.
Most pig operators a
es cd wians a relation as an input and give a relation as the output. It
etic operations and relational operations too.
ig's language 1k '
ak : eae os currently consists of a textual. language called Pig Latin. Pig
: 7 dinars language. This means it allows users to describe how data
om one or more inputs should be read, processed and then stored to one or
more outputs in parallel.
+ These data flows can be simple linear flows, or complex workflows that include
points where multiple inputs are joined and where data is split into multiple
streams to be processed by different operators. To be mathematically precise, a Pig
Latin script describes a directed acyclic graph (DAG), where the edges are data
flows and the nodes are operators that process the data.
+ The first step in a Pig program is to LOAD the data, which we want to
manipulate from HDFS. Then run the data through a set of transformations.
Finally, DUMP the data to the screen or STORE the results in a file somewhere.
Advantages of Pig :
1. Fast execution that works with MapReduce, Spark and Tez.
rocess almost.any amount of data, regardless of size.
. Its ability to p'
ew users learn Pig Latin.
2
3, A strong documentation Process that helps ni
4
je interoperability that lets professionals work from anywhere with
. Local and remot
a reliable connection.
Pig disadvantages *
1. Slow start-up and clean-up of MapReduce jobs
2 Not suitable for interactive OLAP analytics
3. Complex applications ™Y require
1n up-thrust for knowledge
NICAL PUBLICATIO!)
Big Data Analytics
[EEE Pig Data Model
Hadoop Related Tools
With Pig, when the data is loaded the data model is specified. Any data that we
load from the disk into Pig will have a specific schema and structure. Pig data
model is rich enough to manage most of what's thrown in its way like table - like
structures and nested hierarchical data structures.
However, Pig data types can be divided into two groups in general terms : Scalar
forms and complex types.
* Scalar types contain a single value, while complex types include other values, such
as the values of Tuple, Container and Map.
In its data model, Pig Latin has those four types
« Atom : An atom is any single attribute, like, for example, a string or a
number ‘Hadoop', The atomic values of Pig are scalar forms that appear, for
example, in most programming languages, int, long, float, double, char array
and byte array. ‘
= Tuple : A tuple is a record generated by a series of fields. For example, each
field can be of any form, 'Hadoop;' ,' or 6. Just think of a tuple in a table as
a row.
» Bag : A pocket is a set of tuples, which are not special. The bag's schema is
flexible, each tuple in the set can contain an arbitrary number of fields and
can be of any sort. :
» Map: A map is a set of pairs with main values. The value can store any
type and the key needs to be unique. A char array must be the key of a map
and the value may be of any kind.
EEF] Pig Latin
© The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop. It is'a textual language that abstracts the programming from the Java,
MapReduce idiom into a notation.
* The Pig Latin statements are used to process the data, It is an operator that
accepts a relation as an input and generates another relation as an output.
a) It can span multiple lines.
b) Each statement must end with a semi-colon.
©) It may include expression and schemas.
d) By default, these statements are processed using multi - query execution
Pig Latin statements work with relations. A relation can be defined as follows :
a) A relation is a bag (more specifically, an outer bag).
b) A bag is a collection of tuples.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgea Analytics
922 5-15 Hadoop Related Tools
) A tuple is an ordered set of fields
4) A field is a piece of data.
Pig Latin Datatypes
ne ;
ui represents a signed 32-bit integer. For Example : 13
2, Long : It represents a signed 64-bit integer. For Example : 13L-
S Saar This data type represents a signed 32-bit floating point. For Example :
4, Double : "double" represents a 64-bit floating point. For Example : 13.5
5, Chararray : It represents a character array (string) in Unicode UTF-S format: For
Example : 'Big Data’
. 6, Bytearray : This data type represents a Byte array.
7. Boolean : "Boolean" represents a Boolean value. For Example : true/ false.
fl Developing and Testing Pig Latin Scripts
« Pig provides several tools and diagnostic operators to help us to develop
applications.
« Scripts in Pig can be executed in interactive or batch mode. To use pig in
interactive mode, we invoke it in local or map-reduce modé then enter commands
one after the other. In batch mode, we save commands in a .pig file and specify
the path to the file when invoking pig-
« At an overly simplified level a Pig script consists of three steps. In the first step
we load data from HDFS. In the second step we perform transformations on the
“Yata, In the final step we store transformed data. Transformations are the heart of
Pig scripts.
«Pig has a schema concept that is used when loading data to specify what it should
expect. First specify columns and optionally their data types. Any columns in data
ut not included in the schema are truncated.
those specified in schema they are filled with
«© When we have fewer columns than
first move them to HDFS then from there we
nulls. To load sample data sets we
will load into Pig-
Pig Script Interfaces : Pig programs can be packaged in three different ways.
1. Script : This function is ‘nothing more than a file consists of Pig Latin
commands, identified by the .pig suffix. Ending Pig program with the -pig
cation but not required. The commands are interpreted by
fer and then runs in the order determined by the Pig
extension is a conve!
the Pig Latin compil
optimizer.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeHadoop Related Tools
Big Data Analytics 5-16 es
2. Grunt : Grunt acts as a command interpreter where we can interactively enter
Pig Latin at the Grunt command line and immediately see the response. This
method is useful for prototyping during early development stage and with
what-if scenarios,
3. Embedded : Pig Latin statements can run within Java, JavaScript and Python
programs,
Pig scripts, Grunt shell Pig commands and embedded Pig programs can be
executed in either Local mode or on MapReduce mode. The Grunt shell enables an
interactive shell to submit Pig commands and run Pig scripts. To start the Grunt
shell in Interactive mode, we need to submit the command pig at the shell.
To tell the complier whether a script or Grunt shell is executed locally or in
Hadoop mode just specify it in the -x flag to the pig command. The following is
an example of how we would specify running our Pig script in local mode :
_ pig’ local mindStick. pig) * es ®
Here's how we would run the Pig script in Hadoop mode, which is the default if
we don't specify the flag :
"pig x mapreduce mindstick pig
By default, when we specify the pig command without any parameters, it starts
the Grunt shell in Hadoop mode. If we want to start the Grunt shell in local mode
just add the -x local flag to the command.
EG Hive
Apache Hive is an open source data warehouse software for reading, writing and
managing large data set files that are stored directly in either the Apache Hadoop
Distributed File System (HDFS) or other data’ storage systems such as Apache
HBase.
Data analysts often use Hive to analyze data, query large amounts of unstructured
data and generate data summaries.
Features of Hive :
1. It stores schema in a database and processes data into HDES,
2. It is designed for OLAP.
3. It provides SQL type language for querying called HiveQL or HQL.
4. It is familiar, fast, scalable and extensible.
Hive supports variety of storage formats : TEXTFILE for plaintext, SEQUENCEFILE
for binary key-value pairs, RCFILE stores columns of a table in a record columnar
format
TECHNICAL PUBLICATIONS® - an up-thnust for knowledge”
tics
pata Analytics 5-17 3
i Hadoop Related Tools
|, Hive table structure consists of rows and columns. The
to some record, transaction, or particular entity detail ee aaa
qhe values of the corresponding columns re
; ae present th i ‘l
characteristics for each row. eet eer
» Hadoop and its ecosystem are used to apply some structure to unstructured dat
: : ata.
‘Therefore, if a table structure is an appropriate way to view the restructured di i
Hive may be a good tool to use. au
yollowing are some Hive use cases :
1, Exploratory or ad-hoc analysis of HDFS data : Data can be queried,
transformed and exported to analytical tools.
9. Extracts or data feeds to reporting systems, dashboards, or data repositories
such as HBase.
3, Combining external structured data to data already residing in HDFS.
advantages *
1, Simple querying for anyone already familiar with SQL.
9, Its ability to connect with a variety of relational databases, including Postgres and
MySQL.
3, Simplifies working with large amounts of data.
Disadvantages *
1. Updating data is complicated
2, No real time access to data
2
for a simple Word Count application
. High latency.
Write a code in JAVA
word in a given input set using the
« Program Example +
ber of occurrences of each
that counts the num’
Hadoop Map-Reduce framework on local-standalone set-up:
import java.io, IOException; ese
import, java.util StringTokenizer: t
import gry apache hadoop.conf Configurations
import org.apache.hadoop £s.Path:
import cry apache hacdoop io intWritablet
import org.apache.hadoop1o-Text
import org.apache-hadooP. " mapreduce.Jobi
import ora. epcne hadoop maPredee
import doop._mapreduce:
Sport sae ao see duce ib nputSelaputr ores
arial 1n up-thrust for knowledgeHadoop Related Tools
5-18
Big Data Analytics
wena utFormat;
import giglapathe hada mapreduce lib output FteOutpatFo n
public class WordCount {
“public static class TokenizerMapper "
se exxends Mapper