[go: up one dir, main page]

0% found this document useful (0 votes)
49 views29 pages

Big Data Analytics Unit-3

Notes

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
49 views29 pages

Big Data Analytics Unit-3

Notes

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 29
UNIT 111 | ere Basics of Hadoop Syllabus MapReduce workflows - unit tests with MRUnit - test data and local tests - anatomy of MapReduce wb run classic Map-reduce - YARN - failures in classic ‘Map-reduce and YARN - job scheduling - shuffle and sort ~ task execution - MapReduce types - input formats - output, formats. Contents 3,1 Data Format 3.2 Hadoop Streaming 3.3 Hadoop Pipes 3.4 Design of Hadoop Distributed File System 3.5 Hadoop I/O 3.6 File-based Data Structures 3.7 Cassandra - Hadoop Integration 3.8 Two Marks Questions with Answers (HDFS) Basics of Hedoop Big Data Analytics ERI Data Format + The data is stored using a line-oriented ASCII format, in which each line is . ; , many of whi record, The format supports a rich set of meteorological elements, many of whic, are optional or with variable data lengths. ee The default output format provided by hadoop is at art records as lines of text. If file output format is not specitie fu j es . ext files are created as output files. Output key-value pairs fe : Hea because TextOutputFormat converts these into strings with toString! E HDES data is stored in something called blocks. These blocks are the smallest uni of data that the file system can store. Files are processed and broken down into these blocks, which are then taken and distributed across the cluster and also replicated for safety. Analyzing the Data with Hadoop Hadoop supports parallel processing, so we take this advantage for expressing query as a MapReduce job. After some local, small-scale testing, we will be able to tun it on a cluster of machines. MapReduce works by breaking the processing into two phases : The map phase and the reduce phase. Each phase has key-value pairs as input and output, the types of which may be chosen by the programmer. The programmer also specifies two functions : The map function and the reduce function. MapReduce is a method for distributing a task across multiple nodes. Each node Processes the data stored on that node to the extent possible. A running MapReduce job consists of various phases which is described in the following Fig. 3.1.1. . Fig. 3.1.1 Phases of Hadoop MapReduce In the map job; we split the input dataset into chunks. Map task processes these chunks in parallel. The map we use outputs as inputs for the reduced tasks. Reducers process the intermediate data from the maps into smaller tuples that reduce the tasks leading to the final output of the framework. P TECHNICAL PUBLICATIONS® «an up thrust for knowledg e {8 ge 3-3 plocks across a distrib; Basics of Hadoop uted g utation and ee fault comp’ Network infrastruc, OMe against fai ture, dep SoA fale of storage, / monitoring and security action for Programs also ‘ mets, Mo: streaming API of Hadoop, be written in any Oi ae arte, ting language using the pa Scaling Out To scale out, we need to stor e the data j a HDFS, to allow Hadoop to mor ‘a in a distributed file syst i ; we 4 system, typically hosting a part of the data, the MapReduce computation to each machine MapReduce job i it ane of Nein an Unit of work that the client wants to be performed : It iput data, the MapReduce program and configuration information. Hadoop runs the job by dividing it i 7 esi an vehoe ee ividing it into tasks, of which there are two types : Map There are two types of nodes that control the job execution process : A job tracker and a number of task trackers. Job tracker : This tracker plays the role of scheduling jobs and tracking all jobs assigned to the task tracker. Task tracker : This tracker plays the role of tracking tasks and reporting the status of tasks to the job tracker. Hadoop divides the input to a MapReduce job into fixed-size pieces called input splits. Hadoop creates one map task for each split, which runs the user defined map function for each record in the split. That split information is used by YARN ApplicationMaster to try to schedule map tasks on the same node where split data is residing thus making the task data ions then each map task has to local. If map tasks are spawned at random locations : copy the fae jt needs to process from the DataNode where that split data is residing, resulting in Jots of cluster bandwidth. By trying to schedule map tasks on the same node where split data is residing, what Hadoop framework does is to send computation to data rather than bringing data to computation, saving cluster ity optimization. bandwidth, This is called data locality °P! i a ‘: ut to the local disk, not to HDEFS, Why is this? Map Map tasks write their oufP ducing tasks to produce the yutput and, once the job is complete the maP output can be thrown away. So Kill, If the node running the : ihe J” jication, would be ove! storing it in i orenye et ret peen consumed by the reduce task then map task fails befor an up-thrust for knowledge TECHNICAL PUBLICATIONS Basics of Hadooy 3-4 Big Data Analytics er node to re-create th Hadoop will automatically rerun the map task on anoth : map output. ; «Fig. 3.1.2 shows MapReduce data flow with a sing) le reduce task. Slave node Output HDFS HDFS replication a iaecal | eee eee Fig. 3.1.2 MapReduce data flow with a single reduce task The number of reduced tasks is not governed by the size of the input, but is specified independently. When there are multiple reducers, the map tasks partition their output, each creating one partition for each reduce task..There can be many keys in each partition, but the records for any given key are all in a single partition. Fig. 3.1.3 shows MapReduce data flow with multiple reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map | output, the combiner function's output forms the input to the reduce function. Since the combiner function is an optimization, Hadoop does not provide a sussacies of how many times it will call it for a particular map output record, if at all. TECHNICAL PUBLICATIONS® - an up-thrust for knowledae a ee os Basics of Hadoop Li. HDFS +” replication 7 1 3 t fu. HOFS replication t ! Fig, 3.1.3 MapReduce data flow with multiple reduce tasks FH] Hadoop Streaming ‘API that allows writing Mappers and Reduces in any the interface between Hadoop and the tility that comes. with the Hadoop Hadoop Streaming is 4 language. It uses UNIX standard streams as user application. Hadoop streaming is a ul distribution, Streaming is naturally and processed as a ke function reads lines from the $ its results to the standard output suited for text processing. The data view is line-oriented ey-value pair separated by a ‘tab’ character. The Reduce n tandard input, which is sorted by key, and writes i : which is-much faster using. MapReduce It helps in real-time 422 ge cluster. Tere ave diene Technologies Togramming running on a‘ : . ike spark Kafka and others which help 2 real-time Hadoop steaming je of Hadoop Sea eeprogsamed MapReduce jobs on Hadoop clusters . Users can execute nom J, and C++ Supported languages nelude P oa ae and prov 2 Hadoop Streaming moni? entire execution for analys* MapReduce paradigm 3H ing works OF Se ontion. . Hadoop Streaming nd security/ fauthen a don't scalability, flexibility: quick '© develop. an el ides logs of a job's rs the Prot so it supports require -much ps are ” Hadoop StreaminB jel Programming. —s 8 Sec upetrust for owes? etait ies : ing Utility * + Following code shows Streaming path tothe streaminglar Hay Patreaming” Jer \ > hadoop jar /nadoop- Location of mapperfile and See define it as Mapper mapper /path/to/mapperPY \ 4 epy \ ocation of reducerfile and perenne Lortiafing it as reducer reducer /path/to/reducer.Py \ “input /user/hduser/books/* \ ut and output locations Inpt output /user/nduser/books-output Where : Input = Input location for Map Output = Output location for the Reducer to s| per from where it can read input tore the output Mapper = The executable file of the Mapper Reducer = The executable file of the Reducer * Fig. 3.2.1 shows code execution process. key scoot Reduce job Fig. 3.2.1 Code execution Process © Mz id i ir i fap and reduce functions read their input from STDIN and produce their outpt! to STDOUT. In the diagram above, the Ma Reader/Format in the form of key-value pair, code, and then passes through the Redu aggregation and releases the data to the output. ut, TE ‘CHNICAL PUBLICATIONS® an Up-thrust fo “thrust for knowledge per reads the input data from lnpt Maps them as per logic writtet oy ce stream, which performs g . aes puenrr Pipes ee , HadooP pipes is the name of the gtreaming, this uses standard input and terface fo reduce code. Output f CH ii a Hadoop MapReduce. Unlike ‘Onimunicate with the map and * lask f reduce furictio tracker communicates with , Fig. 3.3.1 shows execution of strea ning and pipes, Fig. 3.3.1 Execution of streaming and pipes * With Hadoop pipes, we can implement applications that require higher performance in numerical calculations using C++ in MapReduce. The pipes utility works by establishing a persistent socket connection on a port with the Java pipes task on one end, and the external C++ process at the other. * Other dedicated alternatives and implementations are also available, such as Pydoop for Python, and libhdfs for C. These are mostly built as wrappers, and are JNEbased. It is, however, noticeable that MapReduce tasks are often a smaller component to a larger aspect of chaining, redirecting and recursing MapReduce jobs. This is usually done with the help of higher-level languages or APIs like Hive and Cascading, which can be used to express such data extraction an transformation problems. yy Design of Hadoop Distributed File System (HDFS) HIDFS) is a distributed file system designed * The Hadoop Distributed File System ( Se bee i fi te to run on commodity hardware. HDES is the file ae fem. ee “HES stores file system metadata and application data sepa uted HD! res metadat dedicated server, istrib i IFS stot etadata on a : file systems, like GFS , ‘ x | call ‘Application 4 led the NameNode. APP © TECHNICAL PUBLICATIONS” - an Big Data Analytics vers are fully connected and communicate with each othe, DataNodes. All ser ing TCP-based protocols. ae DES) is a distributed file system that handle, . Distributed File System (HI d ee a sets running on commodity hardware. It is used to scale a single undreds of nodes. f data that it can read or write. HDFS blocks rable. When a file is saved in HDFS, the large 4 ‘Apache Hadoop cluster to «A block is the minimum amount of are 128 MB by default and this is configu! to smaller chunks or "blocks". HDES is a fault-tolerant and resilient system, meaning it prevents a failure in g node from affecting the overall system's health and allows for recovery from Tn order to achieve this, data stored in HDFS is automatically file is broken int failure too. replicated across different nodes. chical file organization. A user or an application © HDFS supports a traditional hierar files inside these directories. The file system can create directories and: store namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. Hadoop distributed file system is a block-structured file system where each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster of one or several machines. * Apache Hadoop HDFS architecture follows a master/slave architecture, where a cluster comprises of a single NameNode (MasterNode) and all the other nodes are DataNodes (Slave nodes). * HIDES can be deployed on:a broad spectrum’ of machines that-support Java. Though one can run several DataNodes on a single machine, but in the practical world, these DataNodes are spread across various machines. * Design issue of HDFS : 1. Commodity hardware :; HDFS do not require expensive hardware for executing user tasks. It's designed to run on clusters of commodity hardware. 2 ae data access : HDFS is built around the idea that the most efficient lata processing pattern is a write-once, read-many-times pattern. 3, ie walters, arbitrary file modifications ; Files in HDFS may be written t : : fee walter, Writes are always made at the end of the file. There is 9° PP for multiple writers, or for modifications at arbitrary offsets in the file. 4. Low-latency data access, 5. Holds lots of small files, 6. Store very large files, Seen manning yr ee ee ee sp Analytics 3-9 ak Basics of Hadoop he HDFS achieve the following goals : e large datasets : Organizi * Pea DES is ee eens and storing datasets can be a hard talk to atasets. To do this, HIDES sh age the applications that have to deal with huge pases should have hundreds of nodes per cluster. faults: a ae - me should have technology in place to scan and detect faults quickly and effectively as it includes a large number of commodity pardware. Failure of components is a common issue 3, Hardware efficiency : When large datasets’ are involved it can reduce the network traffic and increase the processing speed. po HDFS Architecture « Fig. 3.4.1 shows HDFS architecture. Motod (ime | Pee) Blocks Rock t Rock? Fig, 3.4.1 Hadoop architecture * Hadoop distributed file system js a block-structured file system where each file is divided into blocks of a pre-dete! cluster of one or several machines. mined size. These blocks are stored across a cture follows a master/slave architecture, where a * Apache Hadoop HDFS archite cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). * DataNodes process and store data blocks, while NameNodes manage the many “DataNodes, maintain data block metadata and control client access. r NameNode and DataNode © Namenode holds the meta data for the HDFS like Namespace information, block information ete. When in use, all ths information ts stored In main memory. But this information also stored in disk for persistence storage. TECHNICAL PUBLICATIONS® - an upthrst for knowledge 3-10 8 ‘ Big Data Analytics 288 oy hy Ne he file system namespace. It keeps the directo anages th # , ; iaik ek A ‘ai and metadata about files and directories, DataNode is a slave node in HDFS that stores the actual data 2 inst NameNode. In brief, NameNode controls and manages a single oy adi We ° la nodes. TY tree Of a DataNode serves to read or write requests. It also creates, deletes and blocks on the instructions from the NameNode. Fig. 842 shows Namenode. It shows how NameNode stores information oy 3. : Reads at startup and merges with edit logs Fig. 3.4.2 Name node Two different files are : 1. fsimage : It is the snapshot of the file system when name node started. 2. Edit logs : It is the sequence of changes made to the file system after nam: node started. Only in the restart of namenode, edit logs are applied to fsimage to get the let snapshot of the file system. But namenode restart are rare in production clusters which means edit logs @ grow very large for the clusters where namenode runs for a long period of tit® The following issues we will encounter in this situation : 1. Editlog become very large, which will be challenging to manage it 2. Namenode restart takes long time because lot of changes to be merge: 3. In the case of crash, we will lost huge amount of metadata since {6 very old, mae % So to overcome this issues we need a mechanism which will help ¥8 ‘ edit log size which is manageable a it nd have up to date fsimage, 80 t namenode reduces, 1 TECHNICAL PUBLICATIONS® . an up-trust for knowlodg® 3-11 gary Namenode helps to overcome the - ° ‘ above ii i , ee pility of merging editlogs with fg ee = ors ty gs with fsimage from the namecode. “3 43 shows secondary Namenode, ag. Secondary Query for: edit logs Namenode in regular intervals Namenode Update fsimage with editlogs Copy the updated Timage back to Namenode Fig. 3.4.3 Secondary Namenode + Working of secondary Namenode : 1. It gets the edit logs from the Namenode in regular intervals and applies of fsimage. 2. Once it has new fsimage, it copies back to Namenode. 3, Namenode will use this fsimage for the next restart, which will reduce the startup time. * Secondary Namenode’s whole purpose is to have a checkpoint in HDFS. Its just a helper node for Namecode. That is why it also known as checkpoint node inside the community. REE] nDEs Block * HDBS is a block structured file system. In general the users data stored in HDFS in terms of block The-files in the file system are divided into one. or more segments called blocks. The default. size of HDFS block is 64 MB that can be increased as per need. , The HDFS is fault tolerant such that if a data node fails then the current block Write operation on the data node is re-replicated to some other node. The block size, number of replicas and replication factors are specified in the Hadoop configuration file. The synchronization between name node and data node is done by heartbest fungtions which are periodically generated by data node to name Rode, TECHNICAL PUBLICATIONS® - an up-thmst for knowledge Basics of Hadoop Big Data Analytics 3-12 a ackers are used when * Apart from above components the job tracker and ein sienainase map teduce applications run over the HDFS. Hadloop wins Gn ih ides es job tracker and several task trackers. The job teat ore master while task trackers run on data nodes like slaves. lient and assignin, ‘The job tracker is responsible for taking the requests ea i. 5 7 a sign i task trackers to it with tasks to be performed. The ibe Gath i Seal? wide assign tasks to the task tracker on the data nodes where the If for some reason the node fails the job tracker assigns the ieee tracker where the replica of the data exists since the data blo a na ss across the data nodes. This ensures that the job does not fail even within the cluster. The Command - Line Interface : * The HDFS can be manipulated either using the command line. All the commands used for manipulating HDFS through the command line interface begin with the “hadoop fs" command, Most of the Linux commands are supported over HDFS which starts with "-" ~"sign. For example : The command for listing the files in Hadoop directory will be, thadoop fs -Is The general syntax of HDFS command line manipulation is, #hadoop fs - REY Java interface * Hadoop is written in Java, 80 most Hadoop filesystem , through the Java API. The filesystem shell, for example, lass to provide file system oj Java API, Hadoop makes interactions are mediated is a Java application that erations. By exposing its it awkward for non-Java applications to access, 1. Reading Data from a Hadoop URL : * To read a file from a Had open a stream to read the d, am in = null; us \ ta. The syntax is as follows byt eee ce in = new URL(*hafsi://nost/pat " ; 1 process in ‘ TECHNICAL PuBLIcATioNS® - a” up-thrust for knowledge ig Data Analytics 3-13 Basics of Hadoop « Java — recognizes s Hadoop' setURLStr a eamHandlerFactory ae URL scheme by calli i cee at Haat calling the ith FsUrlStreamHandlerFacto: ‘an instance of « The ‘setURLStreamH: fandlerF; oc el VEE Wiss derdlls Tey une eae responsible for creating URL factory for the Java Virtual athe uid ea eae sen Hane loa at ee iB ariemete sed to retrieve the This method can only b a ly be called once per JVM, so it is typically executed ecuted in a static « Example : Displaying fi ying fil g files from a Hadoop file system on standard output using * import java.ne import org.apache.hado port org.apache.hadoop, jf-vs URLCat 2. Reading Data Using the FileSystem API: ible to set URLStreamHandlerFactory © Sometimes it is not POSS so in that case We will use Filesystem ‘API for opening an inpu' ’A file in a Hadoop filesystem is represented by a Hadoop Path tic factory methods for a jon o C0 for our application, stream for a file, object. gettin « There are two st2! Basics oy, tosis Big Data Analytics Ri ed to interface with a file sys 85 te ust . mn, * FleSpetem is a generic abet pe for concrete implementations, with ci factory th FileSystem class also serves a5 @ following methods : fee Public static FileSystem Ce a configuration such as scheme and aul . See oA are object encapsulates a client or atte e aoa a is sop using configuration files read from the classpath, su r FSDataInputStream : . The epen () method on FileSystem actually returns 4 FSDatalnputStream rath, than a standard java.io class. * This class is a specialization of java.io.Dat “access, so we can read from any part of the stream org.apache hadoop fs; Writing Data : | : : - * The FileSystem class has a number of methods for creating a file. The simplest is the method that takes a path object for the file to be created and returns an output stream to write to : public FSDataOutputStream create(Path f) throws IOException FSDataOutputStream | *. The create() method on/FileSystem returns an FSDataOutputStream, which, like FSDatalnputStream, has a method for querying the current position in the file : package org.apache.hadoop.fs; * We can append to an existing file using the append() method : public FSDataOutputstieam append Path f throws 1OEWeapHoH Il" * The append operation allows a single writer to modify an already written file by opening it and writing data from the final offset in the file. With this API, applications that produce unbotinded files, such as log files, can write to an existing file after a restart, for example, the append operation is optional and not implemented by all Hadoop filesystems, | { Data Flow 1. Anatomy of a File Read: * Fig. 3.44 shows sequence of events when readi i i ling a file. It data flows between client and. HDFS, the namenode and the datetodes oo ae n conf) + Use information j.,, talnputStream with support for random Package * The client opens the file it Wishes to object, which for HDFS is an instance the namenode, using RPC, to determine blocks in the file. read by calling open() on the FileSyste™ of Distributed FileSystem (DFS). DES cal the locations of the blocks for the first fe" TECHNICAL PUBLICATIONS® - an up.thnst for knowledge pote Analytios 9) 3: g is Basics of Hadoop ® Metadata Request Distributed to get block location Mle system, Fig. 3.4.4 Client reading data from HDFS. + For each block, the namenode returns the addresses of the datanodes that have a copy ot that block. Furthermore, the datanodes are sorted according to their proximity to the client. If the-client is itself a datanode, then it will read from the local datanode, if it hosts a copy of the block. * The DFS returns an FSDatalnputStream to the client for it to read data from. FSDatalnputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O. + The client then calls read() on the stream, DFSInputStream, which. has stored the datanode addresses for the first few Ulocks in the file, then connects to the first (closest) datanode for the first block in the file. * Data is streamed from the datanode back to the client, which calls read( repeatedly on the stream. When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block, This happens transparently to the client, which from its point of view is just reading a continuous stream. . sm order with the DFSInputStream opening new connections to ae = a ‘cient reads through the stream. It will also call the namenode to Re ee “og for the next batch of blocks as needed. When the Fetrieve the datanode location’ the FSDatalnputStream. dient has finished reading, it calls close() on the pt i " eee eae ginputStream encounters an error while communicating During reading, if the DFSMPI next closest one for that block. a datanode, then it will try 2 [CATIONS TECHNICAL PUBL up-thrust for knowle : Basics of 3-16 S009 Big Data Analytics 2. Anatomy of a File Write: : * Fig. 3.45 shows anatomy of a file write. 2: create Distributed 77 FileSystem complete eT FSData OutputStream DataNode Pipeline of - datanodes DataNode DataNode | datenode datanode datanode Fig. 3.4.5 Anatomy of a file write . The client calls create() on DistributedFileSystem to create a file, . An RPC call to the namenode happens through the DFS to create a new file. 3. As the client writes data, data is split into packets by DFSOutputStream, which is then written to an internal queue, called data queue. Datastreamer consumes the data queue, 4. Data streamer streams the packets to the the packet and forwards it to the second 5. In addition to the intemal queue, DRS of the packets that are waiting to be a first DataNode in the Pipeline. It stores DataNode in the pipeline. OutputStream also manages the "Ackquet” icknowledged by DataNodes. 6. When the client finishes writing the file, it calls close() on the stream, EZ Heartbeat Mechanism in HDFS | Namenode and task tracker will send its heartbe * Fig. 3.4.6 shows heartbeat mechanism, TECHNICAL PUBLICATIONS’ " Up-thrust for knowledge ‘at to job tracker, » A datanode sends heartbeat t0 | nal pas y Basics of Hadoop =] = = ee = =e Fig. 3.4.6 Heartbeat mechanism The connectivity between the NameNode and a DataNode are managed by the persistent heartbeats that are sent by the DataNode every three seconds. The heartbeat provides the NameNode confirmation about the availability of the blocks and the replicas of the DataNode. Additionally, heartbeats also carry information about total storage capacity, storage in use and the number of data transfers currently in progress. These statistics are by the NameNode for managing space’ allocation and load balancing. During normal operations, if the NameNode does not receive @ heartbeat from a DataNode in ten minutes the NameNode, it considers that DataNode to be out of service and the block replicas hosted to be unavailable. The NameNode schedules the creation of new replicas o DataNodes. The heartbeats carry roundtrip communications NameNode, including commands to : ®) Replicate blocks to other nodes. ) Remove local block replicas. ©) Reregister the node. 4) Shut down the node. f those blocks on other and instructions from the SE Send an immediate block report: : for knowledge @ TECHNICAL PUBLICATIONS” = fan up-thrust Basie8 Of Hod, Big Data Analytics et EEE Role of Sorter, Shuffler and Combin ' -reducet, rin MapReduces Paradigm nemt is an optional class Ltd ee b . 5 asa passin; Gael aanat en the Map class and thereafter passing the outny, ing the inputs ‘ 8. key-value pairs to the Reducet clas te ap output records with iy is to summarize a t over the network to the actual © The main function of a combi be sett same key. The output of the combiner will reducer task as input. : * The process of transferring data from the mappers eres rah i shuffling ie. the process by which the system ea tl ei ee eH the map output to the reducer as input. So, shuffle phas: ty for the reducers, otherwise, they would not have any input. the map output from Mapper to a Reducer in * Shuffle phase in Hadoop transfers g and sorting of map MapReduce. Sort phase in MapReduce covers the merginy outputs. i ‘* Data from the mapper are grouped by the key, split amon by the key. Every reducer obtains all values associated with the same key. Shuffle and sort phase in Hadoop occur simultaneously and are done by the MapReduce g reducers and sorted framework. Hadoop I/O Hadoop input output system comes with a set of primitives. Hadoop deals with multi-terabytes of datasets; a special consideration on these primitives will give an idea how Hadoop handles data input and output. Data Integrity * Data integrity means that data should remain accurate and consistent all across its storing, processing and retrieval operations. + However, since every I/O operation on the disk or network carries with it a small chance of introducing errors into the data that it is reading or writing. The usual way of detecting corrupted data is by computing a checksum for the data when it first enters the system and again whenever it is transmitted across a channel that is unreliable and hence capable of corrupting the data, * The commonly used error detecting code is CRC. i : 35 32-bit integer checksum input of any size, ee ee Data Integrity in HDFS : * HDFS transparently checksums all data ; : oo checksums when ‘reading date written to it and by default verifi A separate checksum is created for eve TECHNICAL PUBLICATIONS® - an up-tnrust for knowledge “lt. Anal ics 3-19 Z Basics of Hadoop checksum b jobyte-P™ yytes of data. The default i ‘i t : secksim 16 4 bytes long, the storage overhead is on ey eater sue. that enters int an) ata ‘0 the system is verifi an naa verified by the dat i rower “ou sie or er processing. Data Be to a et ie i : furth ee 0 th le pipeline is the dient with Checks: ia ruption found is immediately notified to the client read from the datanode also Dee goes through the dril 1 a igh the same drill. The datanod ma as ae verification to keep track of the verified block. The ice is re : Bee janode upon receiving @ block verification success signal from ihe client, This type of statistics helps in Keeping the bad disks at bay. ae ag this, a periodic verification on the block store is made with the help of DataBlockScanner running along with the datanode thread in the background. This protects data from corruption in the physical storage media. : « HDFS stores replicas of blocks, it can “heal” corrupted blocks by ‘copying one of the good replicas to produce a new, uncorrupt replica. it reports the bad block and the before throwing @ erupt, so it does If a client detects an error when reading a block, datanode it was trying to read from to the namenode ChecksumException. The namenode marks the block replica as. co not direct clients to it, or try to COPY this replica to ‘another datanode. other datanode, so its It then schedules @ COPY of the block to be replicated on * 1. at the expected level. Onct this has happened the replication factor is bac i 0 disable verification of checksums by ted. It is possible Checksum() method on FileSysten pefore using the passing false () method to ree cal File system open Ea Hadoop Lo « The Hadoop cal file system performs dient side checksums. When a file is ‘utomatically creates @ transparent file in the packground with the file it a created, it ich ves che chunks to check the file. o 512 bytes, the chunk : unk ca eck a segment up t Each ch pechecksum property and the ‘chunk is then st the file-byte-P* of data is divided by ored as metadata ina y though the settings of the files might change and if A correct) : file can be Tee ‘The file © aetected then the local system throws @ checksum exception. error i checksum file systoM © yse checksum fi 4 use checksum file systems as a security measure to ensure that t y file syst al Hg not corrupt OF damaged in any Way. TECHNICAL PUBLICATIONS? ——— ~ an up-thrust for knowledg? Big Data Analytics 3-20 Basics Of Hecoog * In this file system, the underlying file system is called the raw file system, ig error is detected while working on the checksum file system, it will cy reportchecksumfailure), * Here the local system moves the affected file to another directory as a file titled ag bad_file. It is then the responsiblity of an administrator to Keep a check on these bad_files and take the necessary action. Compression Compression has two major benefits : a) It creates space for.a file. b) It.also increases the speed of data transfer to a disk or drive. The following are the commonly used methods of compression in Hadoop : )-Deflate, b) Gzip, c) Bzip2, d) Lzo, e) Lz4, and f) Snappy. All these compression methods Primarily provide optimization of speed and storage space and they all have different characteristics and advantages. Gzip is a general compressor used to clear the space and performs foster than bzip2, but the decompression speed of bzip2 is good . Lzo, Lad and Snappy can be optimized as required and hence, are the be tter tools in comparison to the others. Codecs ; * A codec is an algorithm that is used to of a data stream to transmit or store it. * In Hadoop, these compression and decompression operations run with different codecs and with different compression formats. EERO serialization * Serialization is the process of converting a data object, data represented within a region of data storage into the state of the object in an easily transmittable form. data can be delivered to another data store, perform compression and decompression @ combination of code and a series of bytes that saves In this serialized form, the application, or some other destination. + Data serialization is the process of converting an object into a stream of bytes t0 more easily save or transmit it. + Fig. 3.5.1 shows serialization and deserialization, © The reverse process, constructing a data structure or object from a series of bytes is deseria lization. The deserialization process recreates the object, thus making the data easier to read and modify as a native structure in a programming language- TECHNICAL PUBLICATIONS® - an upthrust for knowledge

You might also like