UNIT 111 |
ere
Basics of Hadoop
Syllabus
MapReduce workflows - unit tests with MRUnit - test data and local tests - anatomy of MapReduce
wb run classic Map-reduce - YARN - failures in classic ‘Map-reduce and YARN - job scheduling -
shuffle and sort ~ task execution - MapReduce types - input formats - output, formats.
Contents
3,1 Data Format
3.2 Hadoop Streaming
3.3 Hadoop Pipes
3.4 Design of Hadoop Distributed File System
3.5 Hadoop I/O
3.6 File-based Data Structures
3.7 Cassandra - Hadoop Integration
3.8 Two Marks Questions with Answers
(HDFS)Basics of Hedoop
Big Data Analytics
ERI Data Format
+ The data is stored using a line-oriented
ASCII format, in which each line is .
; , many of whi
record, The format supports a rich set of meteorological elements, many of whic,
are optional or with variable data lengths. ee
The default output format provided by hadoop is at art
records as lines of text. If file output format is not specitie fu j es . ext
files are created as output files. Output key-value pairs fe : Hea
because TextOutputFormat converts these into strings with toString! E
HDES data is stored in something called blocks. These blocks are the smallest uni
of data that the file system can store. Files are processed and broken down into
these blocks, which are then taken and distributed across the cluster and also
replicated for safety.
Analyzing the Data with Hadoop
Hadoop supports parallel processing, so we take this advantage for expressing
query as a MapReduce job. After some local, small-scale testing, we will be able to
tun it on a cluster of machines.
MapReduce works by breaking the processing into two phases : The map phase
and the reduce phase.
Each phase has key-value pairs as input and output, the types of which may be
chosen by the programmer. The programmer also specifies two functions : The
map function and the reduce function.
MapReduce is a method for distributing a task across multiple nodes. Each node
Processes the data stored on that node to the extent possible. A running
MapReduce job consists of various phases which is described in the following
Fig. 3.1.1. .
Fig. 3.1.1 Phases of Hadoop MapReduce
In the map job; we split the input dataset into chunks. Map task processes these
chunks in parallel. The map we use outputs as inputs for the reduced tasks.
Reducers process the intermediate data from the maps into smaller tuples that
reduce the tasks leading to the final output of the framework. P
TECHNICAL PUBLICATIONS® «an up thrust for knowledg
e{8
ge 3-3
plocks across a distrib; Basics of Hadoop
uted g
utation and ee fault
comp’ Network infrastruc, OMe against fai
ture, dep SoA fale of storage,
/ monitoring and security
action for
Programs
also ‘ mets, Mo:
streaming API of Hadoop, be written in any Oi ae arte,
ting language using the
pa Scaling Out
To scale out, we need to stor
e the data j a
HDFS, to allow Hadoop to mor ‘a in a distributed file syst i
; we 4 system, typically
hosting a part of the data, the MapReduce computation to each machine
MapReduce job i it
ane of Nein an Unit of work that the client wants to be performed : It
iput data, the MapReduce program and configuration information.
Hadoop runs the job by dividing it i 7
esi an vehoe ee ividing it into tasks, of which there are two types : Map
There are two types of nodes that control the job execution process : A job tracker
and a number of task trackers.
Job tracker : This tracker plays the role of scheduling jobs and tracking all jobs
assigned to the task tracker.
Task tracker : This tracker plays the role of tracking tasks and reporting the status
of tasks to the job tracker.
Hadoop divides the input to a MapReduce job into fixed-size pieces called input
splits. Hadoop creates one map task for each split, which runs the user defined
map function for each record in the split.
That split information is used by YARN ApplicationMaster to try to schedule map
tasks on the same node where split data is residing thus making the task data
ions then each map task has to
local. If map tasks are spawned at random locations :
copy the fae jt needs to process from the DataNode where that split data is
residing, resulting in Jots of cluster bandwidth. By trying to schedule map tasks on
the same node where split data is residing, what Hadoop framework does is to
send computation to data rather than bringing data to computation, saving cluster
ity optimization.
bandwidth, This is called data locality °P! i
a ‘: ut to the local disk, not to HDEFS, Why is this? Map
Map tasks write their oufP ducing tasks to produce the
yutput and, once the job is complete the maP output can be thrown away. So
Kill, If the node running the
: ihe J” jication, would be ove!
storing it in i orenye et ret peen consumed by the reduce task then
map task fails befor
an up-thrust for knowledge
TECHNICAL PUBLICATIONSBasics of Hadooy
3-4
Big Data Analytics
er node to re-create th
Hadoop will automatically rerun the map task on anoth :
map output. ;
«Fig. 3.1.2 shows MapReduce data flow with a sing)
le reduce task.
Slave
node
Output
HDFS
HDFS
replication
a
iaecal |
eee
eee
Fig. 3.1.2 MapReduce data flow with a single reduce task
The number of reduced tasks is not governed by the size of the input, but is
specified independently.
When there are multiple reducers, the map tasks partition their output, each
creating one partition for each reduce task..There can be many keys in each
partition, but the records for any given key are all in a single partition.
Fig. 3.1.3 shows MapReduce data flow with multiple reduce tasks.
Hadoop allows the user to specify a combiner function to be run on the map |
output, the combiner function's output forms the input to the reduce function.
Since the combiner function is an optimization, Hadoop does not provide a
sussacies of how many times it will call it for a particular map output record, if
at all.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledaea ee
os Basics of Hadoop
Li. HDFS
+” replication
7
1
3 t
fu. HOFS
replication
t
!
Fig, 3.1.3 MapReduce data flow with multiple reduce tasks
FH] Hadoop Streaming
‘API that allows writing Mappers and Reduces in any
the interface between Hadoop and the
tility that comes. with the Hadoop
Hadoop Streaming is 4
language. It uses UNIX standard streams as
user application. Hadoop streaming is a ul
distribution,
Streaming is naturally
and processed as a ke
function reads lines from the $
its results to the standard output
suited for text processing. The data view is line-oriented
ey-value pair separated by a ‘tab’ character. The Reduce
n tandard input, which is sorted by key, and writes
i : which is-much faster using. MapReduce
It helps in real-time 422 ge cluster. Tere ave diene Technologies
Togramming running on a‘ : .
ike spark Kafka and others which help 2 real-time Hadoop steaming
je of Hadoop Sea eeprogsamed MapReduce jobs on Hadoop clusters
. Users can execute nom J, and C++
Supported languages nelude P oa ae and prov
2 Hadoop Streaming moni?
entire execution for analys* MapReduce paradigm
3H ing works OF Se ontion.
. Hadoop Streaming nd security/ fauthen a don't
scalability, flexibility: quick '© develop. an
el ides logs of a job's
rs the Prot
so it supports
require -much
ps are
”
Hadoop StreaminB jel
Programming.
—s 8 Sec upetrust for owes?etait ies
: ing Utility *
+ Following code shows Streaming path tothe streaminglar Hay
Patreaming” Jer \
> hadoop jar
/nadoop-
Location of mapperfile and
See define it as Mapper
mapper /path/to/mapperPY \
4 epy \ ocation of reducerfile and
perenne Lortiafing it as reducer
reducer /path/to/reducer.Py \
“input /user/hduser/books/* \ ut and output locations
Inpt
output /user/nduser/books-output
Where :
Input = Input location for Map
Output = Output location for the Reducer to s|
per from where it can read input
tore the output
Mapper = The executable file of the Mapper
Reducer = The executable file of the Reducer
* Fig. 3.2.1 shows code execution process.
key
scoot
Reduce job
Fig. 3.2.1 Code execution Process
© Mz id i ir i
fap and reduce functions read their input from STDIN and produce their outpt!
to STDOUT. In the diagram above, the Ma
Reader/Format in the form of key-value pair,
code, and then passes through the Redu
aggregation and releases the data to the output.
ut,
TE
‘CHNICAL PUBLICATIONS® an Up-thrust fo
“thrust for knowledge
per reads the input data from lnpt
Maps them as per logic writtet oy
ce stream, which performsg . aes
puenrr Pipes ee
, HadooP pipes is the name of the
gtreaming, this uses standard input and terface fo
reduce code. Output f
CH ii
a Hadoop MapReduce. Unlike
‘Onimunicate with the map and *
lask f
reduce furictio tracker communicates with
, Fig. 3.3.1 shows execution of strea
ning and pipes,
Fig. 3.3.1 Execution of streaming and pipes
* With Hadoop pipes, we can implement applications that require higher
performance in numerical calculations using C++ in MapReduce. The pipes utility
works by establishing a persistent socket connection on a port with the Java pipes
task on one end, and the external C++ process at the other.
* Other dedicated alternatives and implementations are also available, such as
Pydoop for Python, and libhdfs for C. These are mostly built as wrappers, and are
JNEbased. It is, however, noticeable that MapReduce tasks are often a smaller
component to a larger aspect of chaining, redirecting and recursing MapReduce
jobs. This is usually done with the help of higher-level languages or APIs like
Hive and Cascading, which can be used to express such data extraction an
transformation problems.
yy Design of Hadoop Distributed File System (HDFS)
HIDFS) is a distributed file system designed
* The Hadoop Distributed File System ( Se bee
i fi te
to run on commodity hardware. HDES is the file ae fem. ee
“HES stores file system metadata and application data sepa
uted HD! res metadat dedicated server,
istrib i IFS stot etadata on a
: file systems, like GFS , ‘ x
| call ‘Application 4
led the NameNode. APP
©
TECHNICAL PUBLICATIONS” - anBig Data Analytics
vers are fully connected and communicate with each othe,
DataNodes. All ser
ing TCP-based protocols.
ae DES) is a distributed file system that handle,
. Distributed File System (HI d
ee a sets running on commodity hardware. It is used to scale a single
undreds of nodes.
f data that it can read or write. HDFS blocks
rable. When a file is saved in HDFS, the
large 4
‘Apache Hadoop cluster to
«A block is the minimum amount of
are 128 MB by default and this is configu!
to smaller chunks or "blocks".
HDES is a fault-tolerant and resilient system, meaning it prevents a failure in g
node from affecting the overall system's health and allows for recovery from
Tn order to achieve this, data stored in HDFS is automatically
file is broken int
failure too.
replicated across different nodes.
chical file organization. A user or an application
© HDFS supports a traditional hierar
files inside these directories. The file system
can create directories and: store
namespace hierarchy is similar to most other existing file systems; one can create
and remove files, move a file from one directory to another, or rename a file.
Hadoop distributed file system is a block-structured file system where each file is
divided into blocks of a pre-determined size. These blocks are stored across a
cluster of one or several machines.
* Apache Hadoop HDFS architecture follows a master/slave architecture, where a
cluster comprises of a single NameNode (MasterNode) and all the other nodes are
DataNodes (Slave nodes).
* HIDES can be deployed on:a broad spectrum’ of machines that-support Java.
Though one can run several DataNodes on a single machine, but in the practical
world, these DataNodes are spread across various machines.
* Design issue of HDFS :
1. Commodity hardware :; HDFS do not require expensive hardware for executing
user tasks. It's designed to run on clusters of commodity hardware.
2 ae data access : HDFS is built around the idea that the most efficient
lata processing pattern is a write-once, read-many-times pattern.
3, ie walters, arbitrary file modifications ; Files in HDFS may be written t
: : fee walter, Writes are always made at the end of the file. There is 9°
PP for multiple writers, or for modifications at arbitrary offsets in the file.
4. Low-latency data access,
5. Holds lots of small files,
6. Store very large files,
Seen manningyr ee ee ee
sp Analytics 3-9
ak Basics of Hadoop
he HDFS achieve the following goals :
e large datasets : Organizi
* Pea DES is ee eens and storing datasets can be a hard talk to
atasets. To do this, HIDES sh age the applications that have to deal with huge
pases should have hundreds of nodes per cluster.
faults:
a ae - me should have technology in place to scan and detect
faults quickly and effectively as it includes a large number of commodity
pardware. Failure of components is a common issue
3, Hardware efficiency : When large datasets’ are involved it can reduce the
network traffic and increase the processing speed.
po HDFS Architecture
« Fig. 3.4.1 shows HDFS architecture.
Motod
(ime | Pee)
Blocks
Rock t Rock?
Fig, 3.4.1 Hadoop architecture
* Hadoop distributed file system js a block-structured file system where each file is
divided into blocks of a pre-dete!
cluster of one or several machines.
mined size. These blocks are stored across a
cture follows a master/slave architecture, where a
* Apache Hadoop HDFS archite
cluster comprises of a single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes).
* DataNodes process and store data blocks, while NameNodes manage the many
“DataNodes, maintain data block metadata and control client access.
r
NameNode and DataNode
© Namenode holds the meta data for the HDFS like Namespace information, block
information ete. When in use, all ths information ts stored In main memory. But
this information also stored in disk for persistence storage.
TECHNICAL PUBLICATIONS® - an upthrst for knowledge3-10 8 ‘
Big Data Analytics 288 oy hy
Ne
he file system namespace. It keeps the directo
anages th # , ;
iaik ek A ‘ai and metadata about files and directories,
DataNode is a slave node in HDFS that stores the actual data 2 inst
NameNode. In brief, NameNode controls and manages a single oy adi We
° la
nodes.
TY tree
Of
a
DataNode serves to read or write requests. It also creates, deletes and
blocks on the instructions from the NameNode.
Fig. 842 shows Namenode. It shows how NameNode stores information oy
3. :
Reads at startup and merges
with edit logs
Fig. 3.4.2 Name node
Two different files are :
1. fsimage : It is the snapshot of the file system when name node started.
2. Edit logs : It is the sequence of changes made to the file system after nam:
node started.
Only in the restart of namenode, edit logs are applied to fsimage to get the let
snapshot of the file system.
But namenode restart are rare in production clusters which means edit logs @
grow very large for the clusters where namenode runs for a long period of tit®
The following issues we will encounter in this situation :
1. Editlog become very large, which will be challenging to manage it
2. Namenode restart takes long time because lot of changes to be merge:
3. In the case of crash, we will lost huge amount of metadata since {6
very old,
mae %
So to overcome this issues we need a mechanism which will help ¥8 ‘
edit log size which is manageable a
it
nd have up to date fsimage, 80 t
namenode reduces,
1
TECHNICAL PUBLICATIONS® . an up-trust for knowlodg®3-11
gary Namenode helps to overcome the
- ° ‘ above ii i
, ee pility of merging editlogs with fg ee
= ors ty gs with fsimage from the namecode.
“3 43 shows secondary Namenode,
ag.
Secondary Query for:
edit logs
Namenode in regular intervals
Namenode
Update fsimage
with editlogs
Copy the updated
Timage back to
Namenode
Fig. 3.4.3 Secondary Namenode
+ Working of secondary Namenode :
1. It gets the edit logs from the Namenode in regular intervals and applies of
fsimage.
2. Once it has new fsimage, it copies back to Namenode.
3, Namenode will use this fsimage for the next restart, which will reduce the
startup time.
* Secondary Namenode’s whole purpose is to have a checkpoint in HDFS. Its just a
helper node for Namecode. That is why it also known as checkpoint node inside
the community.
REE] nDEs Block
* HDBS is a block structured file system. In general the users data stored in HDFS
in terms of block The-files in the file system are divided into one. or more
segments called blocks. The default. size of HDFS block is 64 MB that can be
increased as per need. ,
The HDFS is fault tolerant such that if a data node fails then the current block
Write operation on the data node is re-replicated to some other node. The block
size, number of replicas and replication factors are specified in the Hadoop
configuration file. The synchronization between name node and data node is done
by heartbest fungtions which are periodically generated by data node to name
Rode,
TECHNICAL PUBLICATIONS® - an up-thmst for knowledgeBasics of Hadoop
Big Data Analytics 3-12
a ackers are used when
* Apart from above components the job tracker and ein sienainase
map teduce applications run over the HDFS. Hadloop wins Gn ih ides es
job tracker and several task trackers. The job teat ore
master while task trackers run on data nodes like slaves.
lient and assignin,
‘The job tracker is responsible for taking the requests ea i. 5 7 a sign i
task trackers to it with tasks to be performed. The ibe Gath i Seal? wide
assign tasks to the task tracker on the data nodes where the
If for some reason the node fails the job tracker assigns the ieee
tracker where the replica of the data exists since the data blo a na ss
across the data nodes. This ensures that the job does not fail even
within the cluster.
The Command - Line Interface :
* The HDFS can be manipulated either using the command line. All the commands
used for manipulating
HDFS through the command line interface begin with the
“hadoop fs" command,
Most of the Linux commands are supported over HDFS which starts with "-"
~"sign.
For example : The command for listing the files in Hadoop directory will be,
thadoop fs -Is
The general syntax of HDFS command line manipulation is,
#hadoop fs -
REY Java interface
* Hadoop is written in Java,
80 most Hadoop filesystem
, through the Java API. The
filesystem shell, for example,
lass to provide file system oj
Java API, Hadoop makes
interactions are mediated
is a Java application that
erations. By exposing its
it awkward for non-Java
applications to access,
1. Reading Data from a Hadoop URL :
* To read a file from a Had
open a stream to read the d,
am in = null; us
\ ta. The syntax is as follows
byt eee ce
in = new URL(*hafsi://nost/pat "
; 1 process in ‘
TECHNICAL PuBLIcATioNS® - a” up-thrust for knowledgeig Data Analytics
3-13
Basics of Hadoop
« Java — recognizes
s Hadoop'
setURLStr a
eamHandlerFactory ae URL scheme by calli
i cee at Haat calling the
ith
FsUrlStreamHandlerFacto:
‘an instance of
« The ‘setURLStreamH:
fandlerF;
oc el VEE Wiss derdlls Tey une eae
responsible for creating URL factory for the Java Virtual athe uid ea
eae sen Hane loa at ee iB ariemete
sed to retrieve the
This method can only b
a ly be called once per JVM, so it is typically executed
ecuted in a static
« Example : Displaying fi
ying fil
g files from a Hadoop file system on standard output using
* import java.ne
import org.apache.hado
port org.apache.hadoop,
jf-vs URLCat
2. Reading Data Using the FileSystem API:
ible to set URLStreamHandlerFactory
© Sometimes it is not POSS
so in that case We will use Filesystem ‘API for opening an inpu'
’A file in a Hadoop filesystem is represented by a Hadoop Path
tic factory methods for
a jon o
C0
for our application,
stream for a file,
object.
gettin
« There are two st2!Basics oy,
tosis
Big Data Analytics Ri
ed to interface with a file sys
85 te
ust . mn,
* FleSpetem is a generic abet pe for concrete implementations, with
ci factory th
FileSystem class also serves a5 @
following methods : fee
Public static FileSystem Ce a
configuration such as scheme and aul . See
oA are object encapsulates a client or atte e aoa a is sop
using configuration files read from the classpath, su r
FSDataInputStream :
. The epen () method on FileSystem actually returns 4 FSDatalnputStream rath,
than a standard java.io class.
* This class is a specialization of java.io.Dat
“access, so we can read from any part of the stream
org.apache hadoop fs;
Writing Data : | : : -
* The FileSystem class has a number of methods for creating a file. The simplest is
the method that takes a path object for the file to be created and returns an output
stream to write to : public FSDataOutputStream create(Path f) throws IOException
FSDataOutputStream |
*. The create() method on/FileSystem returns an FSDataOutputStream, which, like
FSDatalnputStream, has a method for querying the current position in the file :
package org.apache.hadoop.fs;
* We can append to an existing file using the append() method :
public FSDataOutputstieam append Path f throws 1OEWeapHoH Il"
* The append operation allows a single writer to modify an already written file by
opening it and writing data from the final offset in the file. With this API,
applications that produce unbotinded files, such as log files, can write to an
existing file after a restart, for example, the append operation is optional and not
implemented by all Hadoop filesystems,
| {
Data Flow
1. Anatomy of a File Read:
* Fig. 3.44 shows sequence of events when readi i
i ling a file. It data flows
between client and. HDFS, the namenode and the datetodes oo ae
n conf) + Use information j.,,
talnputStream with support for random
Package
* The client opens the file it Wishes to
object, which for HDFS is an instance
the namenode, using RPC, to determine
blocks in the file.
read by calling open() on the FileSyste™
of Distributed FileSystem (DFS). DES cal
the locations of the blocks for the first fe"
TECHNICAL PUBLICATIONS® - an up.thnst for knowledgepote Analytios
9) 3:
g is Basics of Hadoop
® Metadata Request
Distributed to get block location
Mle system,
Fig. 3.4.4 Client reading data from HDFS.
+ For each block, the namenode returns the addresses of the datanodes that have a
copy ot that block. Furthermore, the datanodes are sorted according to their
proximity to the client. If the-client is itself a datanode, then it will read from the
local datanode, if it hosts a copy of the block.
* The DFS returns an FSDatalnputStream to the client for it to read data from.
FSDatalnputStream in turn wraps a DFSInputStream, which manages the datanode
and namenode I/O.
+ The client then calls read() on the stream, DFSInputStream, which. has stored the
datanode addresses for the first few Ulocks in the file, then connects to the first
(closest) datanode for the first block in the file.
* Data is streamed from the datanode back to the client, which calls read(
repeatedly on the stream. When the end of the block is reached, DFSInputStream
will close the connection to the datanode, then find the best datanode for the next
block, This happens transparently to the client, which from its point of view is
just reading a continuous stream.
. sm order with the DFSInputStream opening new connections to
ae = a ‘cient reads through the stream. It will also call the namenode to
Re ee “og for the next batch of blocks as needed. When the
Fetrieve the datanode location’ the FSDatalnputStream.
dient has finished reading, it calls close() on the pt i "
eee eae ginputStream encounters an error while communicating
During reading, if the DFSMPI next closest one for that block.
a datanode, then it will try 2
[CATIONS
TECHNICAL PUBL
up-thrust for knowle: Basics of
3-16 S009
Big Data Analytics
2. Anatomy of a File Write: :
* Fig. 3.45 shows anatomy of a file write.
2: create
Distributed 77
FileSystem complete
eT FSData
OutputStream
DataNode
Pipeline of
- datanodes
DataNode
DataNode
| datenode datanode
datanode
Fig. 3.4.5 Anatomy of a file write
. The client calls create() on DistributedFileSystem to create a file,
. An RPC call to the namenode happens through the DFS to create a new file.
3. As the client writes data, data is split into packets by DFSOutputStream, which is
then written to an internal queue, called data queue. Datastreamer consumes the
data queue,
4. Data streamer streams the packets to the
the packet and forwards it to the second
5. In addition to the intemal queue, DRS
of the packets that are waiting to be a
first DataNode in the Pipeline. It stores
DataNode in the pipeline.
OutputStream also manages the "Ackquet”
icknowledged by DataNodes.
6. When the client finishes writing the file, it calls close() on the stream,
EZ Heartbeat Mechanism in HDFS |
Namenode and task tracker will send its heartbe
* Fig. 3.4.6 shows heartbeat mechanism,
TECHNICAL PUBLICATIONS’
" Up-thrust for knowledge
‘at to job tracker,
» A datanode sends heartbeat t0|
nal pas
y Basics of Hadoop
=]
=
=
ee
=
=e
Fig. 3.4.6 Heartbeat mechanism
The connectivity between the NameNode and a DataNode are managed by the
persistent heartbeats that are sent by the DataNode every three seconds.
The heartbeat provides the NameNode confirmation about the availability of the
blocks and the replicas of the DataNode.
Additionally, heartbeats also carry information about total storage capacity, storage
in use and the number of data transfers currently in progress. These statistics are
by the NameNode for managing space’ allocation and load balancing.
During normal operations, if the NameNode does not receive @ heartbeat from a
DataNode in ten minutes the NameNode, it considers that DataNode to be out of
service and the block replicas hosted to be unavailable.
The NameNode schedules the creation of new replicas o
DataNodes.
The heartbeats carry roundtrip communications
NameNode, including commands to :
®) Replicate blocks to other nodes.
) Remove local block replicas.
©) Reregister the node.
4) Shut down the node.
f those blocks on other
and instructions from the
SE Send an immediate block report: :
for knowledge
@
TECHNICAL PUBLICATIONS” = fan up-thrustBasie8 Of Hod,
Big Data Analytics
et
EEE Role of Sorter, Shuffler and Combin
' -reducet,
rin MapReduces Paradigm
nemt is an optional class Ltd ee b
. 5 asa passin;
Gael aanat en the Map class and thereafter passing the outny,
ing the inputs
‘ 8.
key-value pairs to the Reducet clas te ap output records with iy
is to summarize
a t over the network to the actual
© The main function of a combi be sett
same key. The output of the combiner will
reducer task as input. :
* The process of transferring data from the mappers eres rah i
shuffling ie. the process by which the system ea tl ei ee eH the
map output to the reducer as input. So, shuffle phas: ty for the
reducers, otherwise, they would not have any input.
the map output from Mapper to a Reducer in
* Shuffle phase in Hadoop transfers g and sorting of map
MapReduce. Sort phase in MapReduce covers the merginy
outputs. i
‘* Data from the mapper are grouped by the key, split amon
by the key. Every reducer obtains all values associated with the same key. Shuffle
and sort phase in Hadoop occur simultaneously and are done by the MapReduce
g reducers and sorted
framework.
Hadoop I/O
Hadoop input output system comes with a set of primitives. Hadoop deals with
multi-terabytes of datasets; a special consideration on these primitives will give an
idea how Hadoop handles data input and output.
Data Integrity
* Data integrity means that data should remain accurate and consistent all across its
storing, processing and retrieval operations.
+ However, since every I/O operation on the disk or network carries with it a small
chance of introducing errors into the data that it is reading or writing. The usual
way of detecting corrupted data is by computing a checksum for the data when it
first enters the system and again whenever it is transmitted across a channel that
is unreliable and hence capable of corrupting the data,
* The commonly used error detecting code is CRC. i :
35 32-bit
integer checksum input of any size, ee ee
Data Integrity in HDFS :
* HDFS transparently checksums all data
; : oo
checksums when ‘reading date written to it and by default verifi
A separate checksum is created for eve
TECHNICAL PUBLICATIONS® - an up-tnrust for knowledge“lt. Anal ics 3-19
Z Basics of Hadoop
checksum b
jobyte-P™ yytes of data. The default i
‘i t :
secksim 16 4 bytes long, the storage overhead is on ey eater
sue.
that enters int
an) ata ‘0 the system is verifi an
naa verified by the dat i
rower “ou sie or er processing. Data Be to a et ie
i : furth ee 0 th le pipeline is
the dient with Checks: ia ruption found is immediately notified to
the client read from the datanode also
Dee goes through the dril
1 a igh the same drill. The datanod
ma as ae verification to keep track of the verified block. The ice
is re : Bee janode upon receiving @ block verification success signal from
ihe client, This type of statistics helps in Keeping the bad disks at bay.
ae ag this, a periodic verification on the block store is made with the help of
DataBlockScanner running along with the datanode thread in the background. This
protects data from corruption in the physical storage media. :
« HDFS stores replicas of blocks, it can “heal” corrupted blocks by ‘copying one of
the good replicas to produce a new, uncorrupt replica.
it reports the bad block and the
before throwing @
erupt, so it does
If a client detects an error when reading a block,
datanode it was trying to read from to the namenode
ChecksumException. The namenode marks the block replica as. co
not direct clients to it, or try to COPY this replica to ‘another datanode.
other datanode, so its
It then schedules @ COPY of the block to be replicated on *
1. at the expected level. Onct this has happened the
replication factor is bac
i 0 disable verification of checksums by
ted. It is possible
Checksum() method on FileSysten pefore using the
passing false
() method to ree
cal File system
open
Ea Hadoop Lo
« The Hadoop cal file system performs dient side checksums. When a file is
‘utomatically creates @ transparent file in the packground with the file
it a
created, it ich ves che chunks to check the file.
o 512 bytes, the chunk
: unk ca eck a segment up t
Each ch pechecksum property and the ‘chunk is then st
the file-byte-P*
of data is divided by
ored as metadata ina
y though the settings of the files might change and if
A correct)
: file can be Tee
‘The file © aetected then the local system throws @ checksum exception.
error i
checksum file systoM © yse checksum fi 4
use checksum file systems as a security measure to ensure that
t
y file syst
al Hg not corrupt OF damaged in any Way.
TECHNICAL PUBLICATIONS?
———
~ an up-thrust for knowledg?Big Data Analytics 3-20 Basics Of Hecoog
* In this file system, the underlying file system is called the raw file system, ig
error is detected while working on the checksum file system, it will cy
reportchecksumfailure),
* Here the local system moves the affected file to another directory as a file titled ag
bad_file. It is then the responsiblity of an administrator to Keep a check on these
bad_files and take the necessary action.
Compression
Compression has two major benefits :
a) It creates space for.a file.
b) It.also increases the speed of data transfer to a disk or drive.
The following are the commonly used methods of compression in Hadoop :
)-Deflate, b) Gzip, c) Bzip2, d) Lzo, e) Lz4, and f) Snappy.
All these compression methods Primarily provide optimization of speed and
storage space and they all have different characteristics and advantages.
Gzip is a general compressor used to clear the space and performs foster than
bzip2, but the decompression speed of bzip2 is good
.
Lzo, Lad and Snappy can be optimized as required and hence, are the be
tter tools
in comparison to the others.
Codecs ;
* A codec is an algorithm that is used to
of a data stream to transmit or store it.
* In Hadoop, these compression and decompression operations run with different
codecs and with different compression formats.
EERO serialization
* Serialization is the process of converting a data object,
data represented within a region of data storage into
the state of the object in an easily transmittable form.
data can be delivered to another data store,
perform compression and decompression
@ combination of code and
a series of bytes that saves
In this serialized form, the
application, or some other destination.
+ Data serialization is the process of converting an object into a stream of bytes t0
more easily save or transmit it.
+ Fig. 3.5.1 shows serialization and deserialization,
© The reverse process,
constructing a data structure or object from a series of bytes
is deseria
lization. The deserialization process recreates the object, thus making the
data easier to read and modify as a native structure in a programming language-
TECHNICAL PUBLICATIONS® - an upthrust for
knowledge