Big Data Analytics Unit-5

Notes

Uploaded by

prasathdhanam66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

615 views28 pages

Big Data Analytics Unit-5

Notes

Uploaded by

prasathdhanam66

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 28

Hadoop Related Tools Syllabus ee ann ae - Hhase clients - Hbase examples - pracis. Pig - Grunt - pig ig Latin - developing and testing Pig Latin scripts. Hive - data types and file formats ~ HiveQL data definition - HiveQL data manipulation - HiveQL queries. Contents 5.1 Hbase 52. Data Model and Implementations 5.3 Hbase Clients 5.4 Praxis 55 Pig 5.6 Hive 5.7 HiveQL Data Definition 68 HiveQL Data Manipulation 5.9 HiveQL Queries 5.10 Two Marks Questions with AnswersFEB Hbase * HBase is an open source, non-relational, distributed database modeled after Google's BigTable. HBase is an open source and sorted miap data built on Hadoop. It is column oriented and horizontally scalable. © It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop file system: It runs on top of Hadoop and HDFS, providing Big Table-like capabilities for Hadoop. ¢ HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. ¢ HBase supports an easy-to-use Java API for programmatic access. It also suppotts Thrift and REST for non-Java front-ends. + HBase is a column oriented distributed database in Hadoop environment. It can store massive amounts of data from terabytes to petabytes. HBase is scalable, distributed big data storage on top of the Hadoop eco system. * The HBase physical architecture consists of servers in a Master-Slave relationship. Typically, the HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Fig, 5.1.1 shows Hbase architecture. Client HMaster Region Server Region Server Region Server Zookeeper Region [Region [[Resion [> Region | Region ty g : * HDFS 7 | +S HFig. 5:1,1, Hbase architecture Zookeeper is a centralized monitoring server which maintains configuration information and provides distributed synchronization. If the client wants to communicate with regions servers, client has to approach Zookeeper. HMaster in the master server of Hbase and it coordinates the HBase cluster: HMaster is responsible for the administrative operations of thé cluster. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgelytics 008 Analytic eee a ‘aedlove ey ves Hadoop Related Tools tdlistes end Zoolaepe, form the following functi 7 8 functions’ in communication with i Hosting and managing regions 2. Splitting regions automatically, 3, Handling read and writes resi Ss. 4. Communicating wii ‘ eae with clients directly HRegions : For each column family, HRegi | of HRegions are Memelare sal ee ‘egions maintain a store. “ain components » Data model in HBase is desi; igned 4 vary in field size, data type ee 3 accommodate semi-structured data that could jase is a col i Fi : ree yiysteal center non-relational database. This means that data is stored a imns and indexed by a unique row key. This architecture allows for rapi retrieval of individual rows and columns and efficient scans over individual columns within a table. ' Both data and requests are distributed across all servers in an HBase cluster, allowing user to query results on petabytes of data within milliseconds. HBase is most effectively used to store non-relational data, accessed via the HBase API. Features and Application of Hbase Features of Hbase : 1. Hbase is linearly scalable. matic failure support. 2. It has auto 3, It provides consistent read and writes. f i ‘ th Hadoop, both a * source and a destination. 5. . Tt integrates W’ a API for client. across . It has easy jav' 6. It provides dat it ters. a replication cluste! real-time read/write access to Big Data. of commodity jnardware. modeled after Google's Bigtable. works on top of Where to use Hbase 7 have random 1. Apache Hbase is used t0 ables on t0P of clust ers tabase 2 It hosts very large * : ry ont ige Apache HBase 3. Apache Hbase #. 4 nr ale Bigtable acts up om Google Hadoop and HDF:Relatec Big Data Analytics 5-4 Hadoop Related Tools Applications of Hbase : 1. It is used whenever there is a need to write heavy applications. 2. Hbase is used whenever we need to provide fast random access to available data, 3. Companies such as Facebook, Twitter, Yahoo and Adobe use HBase internally. [ERE Difference between HDFS and Hbase Ee Sr. No. HDFS does not support iat individual : “HBase record lookups. tables : It provides high latency batch | no concept of batch processing, pes It provides only sequent HDFS are suited for operations. — yu 6 In HDFS, data are primarily ac through Map Reduce jobs. HDFS doesn't Base a commands, cl Avo on Theis, | ee gD Difference between Hbase arid Relational Database Base is Schema-less It is Column - oriented datastore, 1 is Rew > oriented eee It is designed to st : It s_design store denormalized It is designed to store normalized data. Ie contains wide and sparsely populated I containe thin tablesHadoop Related Tools "Relational database has Stipport for Partitioning. , Tk is good for semi-sty 6. : structured data, “structured. 8 Well as It is Good for structured data. No transactions are there in HBase pal Limitations of HBase It takes a very long time to to activate another Node if In HBase, RDBMS is transactional, Tecover if the the first nodes cross data Operations and join Base needs anew format When we want to migrate from RDBMS external sources to HBase servers, HMaster goes down, It takes a long time go down, °perations are very difficult to perform, B® < 8 os aid 8 & BB ef 5 we Ba ® 5 3 eS ae s °P security factor to grant access to the users, one default sort for a table and it does not support large size HBase allows only of binary files, HBase is expensive in terms of hardware requirement and memory blocks’ allocations. 52] Data Model and Implementations * The Apache HBase Data Model is designed to accommodate structured or semi-structured data that could vary in field size, data type and columns. HBase Stores data in tables, which have rows and columns. The table schema is very different from traditional relational database tables. * A database consists of multiple tables. Each table consists of multiple rows, sorted by row key. Each row contains a row key and one or more column families. . is ip families can . mn family is defined when the table is created, Column Each col Renee (Gamily : column). A cell is uniquely identified by be aa Pac aa A cell contains an uninterpreted array of bytes and a ‘able,row,family : : timestamp. : : * HB, oe del has some logical components which are as follows : ase data m 2, Rows 1. Tables ‘tea /Cobumnné’", (4 Versioné/Timéstarnp 5 Cella Feacalusin Fariis/Cal bles are more like logical collection of rows stored in {gee The Hace ad Regions. As shown above, every Region is then served Separate partitions ca by exactly one Region Server. eo ‘knowledge is© - an up-thrust for knot PUBLICATION’ TECHNICAL eeBig Data Analytics 5-6 Hadoop Related Tools create ‘''' . 'e syntax to create a table in HBase shell is shown below. * Example : create 'CustomerContactInformation’,’ CustomerName' , ' ContactInfo’ * Tables are automatically partitioned horizontally by HBase into regions. Each region comprises a subset of a table's rows. A region is denoted by the table it belongs to. Fig, 5.2.1 shows region with table. Region server 7 Uae Table A, Region 1 s Le Table A, Region 2 Region 1 | | Table G Region 1070 7 Table L, Region 25 ~The z t _ Region server 86 pn | g Table Ay Region 3 fh 7. Table C, Region 30 “a 77 Table F, Region 160, Region 3 | t ia “Table F, Region 776 ~ ti gia Region server 367 7 Table Ay Region 4 Region 4 | : i Table C, Region 17, P Table E, Region 82 Table P, Region 1116 Fig. 5.2.1 Region with table * There is one region server per node. There are many regions in a region server. At any lime, a given region is pinned to a particular region server. Tables are split into regions and are scattered across region servers. A region. table must have at least one * Rows : A row is one instance of data in a table and is identified by a ‘rowkey. Rowkeys are unique in a Table and ‘are always treated as a byte[ ]. * Column families : Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. * Columns : A Column Family is’ made of one or more columns. A Column is identified by a Column Qualifier that consists ‘of the Column Family name concatenated with the Column name using a colon - example : columnfamily : columnname, There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns, TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeF. Analytics. gt 5-7 Hadoop Related Tools | cell + A Cell stores dat . Column Family and ona essentially a unique combination of rowkey, called its value and the di (Column Qualifier). The data stored in a Cell is ata type is always treated as bytel J. version : The data stored i ti in a cell is versioned and versions of data are identified py the timestamp. The number of versi ae configurable and this value by ange of data retained in a column family is jme-to-Live : TTL is a built ec. ae balls ste of HBase that ages out data based on its only for a certain duration of fe in use cases where data needs to be held \der than the specified Mae So, if on a major compaction the timestamp is older than the specified TTL in the past, the record in question doesn't get put in the HFile being generated by the major compaction; that is, the older records are removed a5 a patt of the normal upkeep’ ofthe table. o If TTL is bd used and an aging requirement is still needed, then’ a much more 1/O intensive operation would need to be done [I Hbase Clients + There are a number of client options for interacting with an HBase cluster. There are a number of client options for interacting with an HBase cluster. 4, Java + Hbase is written in Java. + Example : Creating table and inserting data in Hbase table are shown in the following program. 3 ExampleClient ~ Configuration config = Create table . HpeseAdain fagmin = new HBaseAdmin(config)s HTableDescriptor hid = nev” HTableDescrivtor( "test"; ‘HColumnDescriptor hed = new HColumnDeseription(‘data"); td.addFamily(hod)s ae adrain.createTable(ntd); byte [] tablename = nta.getNamel); ; ns -- 2 pul : -]/ Ron some seas sable(config. tablename); table table Bi s y sroBytes("7oW! pyte [| row! = ve pt = new puNcON in up-thrust for knowiedge TECHNICAL PUBLICATIONSBig Data Analytics 5-8 Hadoop Related Tools oe "byte [] databytes = Bytes.toBytes(“data"); i pladd(databytes, Bytes.toBytes("FN"), Bytes. yeti fable pure) : * To create a table, we need to first create an instance of HBaseAdmin and then ask it to create the table named test with a single column family named data. 2. MapReduce * HBase classes and utilities in the org.apache.hadoop.hbase.mapreduce package facilitate using HBase as.a source and/or sink in MapReduce jobs. The TableInputFormat class makes splits on tegion boundaries so maps are handed a single region to work on. TheTableOutputFormat will ‘write the result of MapReduce into HBase. * Example : A MapReduce application to count the number of rows in an HBase table eevee oe Ce ve oe e Counter ee to count the oe 7 : public static enum Counte ers {ROWS} — @Override ay public void map(Immut: Context context) throws IOException {_ for (KeyValue value: values list()) { if (value.getValue().length > 0) {. context.getCounter(Counters. ROWS), jorement(1); break; bd} oe static Job deemapienteinppeanaweten conf, String[} args) throws IOException { String tableName = args(0}; Job job = new Job(conf, NAME + "" + tableName); job.setJarByClass(RowCounter.class); // Columns are space delimited ‘StringBuilder sb = new StringBuilder(); final int columnoffset = 1; for (int i = columnoffset; i < args.length; i++) { if (i > columnoffset) { sb.append(""); - sb.append(argstil);Big Data An: ig Data Analytics 5-9 Hadoop Related Tools Scan scan = new Scan(); scan.setFilter(new FirstKeyOnlyFilter()); if (sblength() > 0) { for (String columnName :sb.toString().split("")) { String [] fields = columnName.split(’ if(felds.length scan.addFamily(Bytes.toBytes(fields{0})): } else { scan.addColumn(Bytes.toBytes(fields{0]), Bytes.toBytes(fields[1)): tht // Second argument is the table name. job.setOutputFormatClass(NullOutputFormat.class); ‘TableMapReduceUtil.initTableMapperJob(tableName, scan, RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job); job.setNumReduceTasks(0); retum job; public static void main(Stringl] args) throws Exception { Configuration conf = HBaseConfiguration.create(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if (otherArgs.length < 1) { (ERROR: Wrong number of parameters: * + argslength): System.err.println\ Systom.er printin("Usage: RowCounter { ‘..1"); System.exit(-1); BAS Job job = createSubmittableJob(conf, otherArgs); System.exit(job.waitForCompletion(true) ? 0: 1); ys 3. Avro, REST, and Thrift «© HBase ships with Avro, REST and Thrift interfaces. These are useful when the interacting application is written in a language other than Java. In all cases, a Java server hosts an instance of the HBase client brokering application Avro, REST, and ‘Thrift requests in and out of the HBase cluster. This extra work proxying requests and responses means these interfaces are slower than using the Java client. directly. + REST: To put up a stargate instance, start it using the following command: opp ‘hbase:daemon.sh start rest © This will start a server instance, by default on port 8080, background it and catch any emissions by the server in logfiles under the HBase logs directory. Clients can ask for the response to be formatted as JSON, Google's protobufs, oF as XML, depending on how the client HTTP Accept header is set. To stop the REST server, type + © % pbase-daemon.sh stop rest TECHNICAL ‘PUBLICATIONS® = an up-thrust for knowledgeBig Data Analytics 5-10 Hadoop Related Tools « Thrift : Start a Thrift service by putting up a server to field Thrift clients by running the following : Rae daemon ue : This will start the server instance, by default on port 9090, background it and catch any emissions by the server in logfiles under the HBase logs directory. The HBase Thrift documentation* notes the Thrift version used generating classes. To stop the Thrift server, type : Mim * Avro: The Avro server is started and stopped in the same manner as we start and stop the Thrift or REST services. The Avro server by default uses port 9090. ERI Praxis * When Hbase cluster running under load, following issues are considered : 1. Versions : A particular Hbase version would run on any Hadoop that had a matching minor version. HBase 0.20.5 would run on an Hadoop 0.20.2, but HBase 0.19.5 would not run on Hadoop 0.20.0 2. HDFS : In MapReduce, HDFS files are opened, with their content streamed through a map. task and then closed. In HBase, data files are opened on cluster startup and kept open. Because of this, HBase tends to see issues not normally encountered by MapReduce clients. © Running out of file descriptors : Because of open files on a loaded ‘cluster, it doesn't take long before we run into system - and Hadoop - imposed limits. Each open file consumes at least one descriptor over on the remote datanode. The default limit on the number of file descriptors per process is 1,024. HBase process p is running with sufficient file descriptors by looking at the first few lines of a - regionservers log. © Running out of datanode threads : The Hadoop datanode has an upper bound of 256 on the number of threads it can run at any one time. «Syne : We must run HBase on an HDFS that has a working sync. Otherwise, there is loss of data. This means running HBase on Hadoop 0.2.x, which adds a working sync/append to Hadoop 0,20. « Ul: HBase runs a web server on the master to present a view on the state of running cluster. By default, it listens on port 60010. The master UI displays a list of basic attributes such as software versions, cluster load, request rates, lists of cluster tables and participating regionservers.10 oe 5-11 Anolytics z ee : HBase tables Hadoop Related Tools wt |, TOWS are si are like those i Jong as the column fanny oe caine ea let an RDBMS, except that cells are a |, Joins : There is no native di ey belong to preevsts nnn ney BY Me Clank jt so that there is no latabase join facility i need for database ne ane en wide tables can make walling from secondary or tertia ry tables. A wide row can someti particular primary k Sometimes be madi ary key. le to hold all data that pertains to a a Pig Foaturos of Pig Hadoop : 1 + Fig. 55.1 shows Pig «Pig has two eX 1. Local mode = files a using the « Pig is an open-source hi creating =MapReduce mh level data flow system. A high-level platform for sequences of one or more Navneet a Hadoop. It translates into efficient jobs. P « Pig offers a high-l : igh-level language to write data analysis programs which we call as Pig Latin. ‘The salient . {fo substantial pa allan of pig programs is that their structure is amenable data sets. | which in turns enables them to handle very Terge Pig, makes use of both, the Hi ‘ adoop Distr ; MapReduce. P tributed File System as W! ell as the Incbuilt. operators: Apache Pig provides a very good set of operators for performing several data operations like sort, join, filter, ete. . Base of programming. g are automatically optimized. Automatic optimization : The tasks in Apache Pi | Handles all kinds of data ‘Apache Pig can analyze both structured, and unstructured data and store the results in HDFS. rchitecture. (Refer Fig. 55.1 on next page) single machine; all Specify local mode ecution modes = To ran pig in local mode, we need access 10 @ re installed and run using local host and file system. 1x flag (pig 02) need access to a Hadoop smapreduce mode, we ode is the default mode; but don'tBig Data Analytics 5-12 Hadoop Related Toots L Pig Latin scripts Apache Pic MapReduce Parser compiler Fig. 5.5.1 Pig architecture * Pig Hadoop framework has four main components : Parser : When a Pig Latin script is sent to Hadoop Pig, it is first handled by the parser. The parser is responsible for checking the syntax of the script, along with other miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph (DAG) that contains Pig Latin statements, together with other logical operators represented as nodes. Optimizer : After the output from the parser is retrieved, a logical plan for DAG is passed to a logical optimizer. The optimizer is responsible for carrying out the logical optimizations. . Compiler : The role of the compiler comes in when the output from the optimizer is received. The compiler compiles the logical plan sent by the optimize the logical plan is then converted into a series of MapReduce tasks or jobs. FERUMIOAL BIBL marine ® on neFe sa anoles we 5-13 : i: : Hadoop Related Tools 4, Execution Engine : After the logi e logical plan is converted to MapReduce jobs, these jobs are sent to Hadoop j soe adoop for yield PREY sorted order and these job ling the desired result. jobs are executed Pig can run on two types i rn types of environments :, The local environment in a single ]VM environment on a Hadoop cluster. Pig has variety of scal ; alee lar data types and standard data processing options. Pig i @ map being a set of key - value pairs. Most pig operators a es cd wians a relation as an input and give a relation as the output. It etic operations and relational operations too. ig's language 1k ' ak : eae os currently consists of a textual. language called Pig Latin. Pig : 7 dinars language. This means it allows users to describe how data om one or more inputs should be read, processed and then stored to one or more outputs in parallel. + These data flows can be simple linear flows, or complex workflows that include points where multiple inputs are joined and where data is split into multiple streams to be processed by different operators. To be mathematically precise, a Pig Latin script describes a directed acyclic graph (DAG), where the edges are data flows and the nodes are operators that process the data. + The first step in a Pig program is to LOAD the data, which we want to manipulate from HDFS. Then run the data through a set of transformations. Finally, DUMP the data to the screen or STORE the results in a file somewhere. Advantages of Pig : 1. Fast execution that works with MapReduce, Spark and Tez. rocess almost.any amount of data, regardless of size. . Its ability to p' ew users learn Pig Latin. 2 3, A strong documentation Process that helps ni 4 je interoperability that lets professionals work from anywhere with . Local and remot a reliable connection. Pig disadvantages * 1. Slow start-up and clean-up of MapReduce jobs 2 Not suitable for interactive OLAP analytics 3. Complex applications ™Y require 1n up-thrust for knowledge NICAL PUBLICATIO!) Big Data Analytics [EEE Pig Data Model Hadoop Related Tools With Pig, when the data is loaded the data model is specified. Any data that we load from the disk into Pig will have a specific schema and structure. Pig data model is rich enough to manage most of what's thrown in its way like table - like structures and nested hierarchical data structures. However, Pig data types can be divided into two groups in general terms : Scalar forms and complex types. * Scalar types contain a single value, while complex types include other values, such as the values of Tuple, Container and Map. In its data model, Pig Latin has those four types « Atom : An atom is any single attribute, like, for example, a string or a number ‘Hadoop', The atomic values of Pig are scalar forms that appear, for example, in most programming languages, int, long, float, double, char array and byte array. ‘ = Tuple : A tuple is a record generated by a series of fields. For example, each field can be of any form, 'Hadoop;' ,' or 6. Just think of a tuple in a table as a row. » Bag : A pocket is a set of tuples, which are not special. The bag's schema is flexible, each tuple in the set can contain an arbitrary number of fields and can be of any sort. : » Map: A map is a set of pairs with main values. The value can store any type and the key needs to be unique. A char array must be the key of a map and the value may be of any kind. EEF] Pig Latin © The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is'a textual language that abstracts the programming from the Java, MapReduce idiom into a notation. * The Pig Latin statements are used to process the data, It is an operator that accepts a relation as an input and generates another relation as an output. a) It can span multiple lines. b) Each statement must end with a semi-colon. ©) It may include expression and schemas. d) By default, these statements are processed using multi - query execution Pig Latin statements work with relations. A relation can be defined as follows : a) A relation is a bag (more specifically, an outer bag). b) A bag is a collection of tuples. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgea Analytics 922 5-15 Hadoop Related Tools ) A tuple is an ordered set of fields 4) A field is a piece of data. Pig Latin Datatypes ne ; ui represents a signed 32-bit integer. For Example : 13 2, Long : It represents a signed 64-bit integer. For Example : 13L- S Saar This data type represents a signed 32-bit floating point. For Example : 4, Double : "double" represents a 64-bit floating point. For Example : 13.5 5, Chararray : It represents a character array (string) in Unicode UTF-S format: For Example : 'Big Data’ . 6, Bytearray : This data type represents a Byte array. 7. Boolean : "Boolean" represents a Boolean value. For Example : true/ false. fl Developing and Testing Pig Latin Scripts « Pig provides several tools and diagnostic operators to help us to develop applications. « Scripts in Pig can be executed in interactive or batch mode. To use pig in interactive mode, we invoke it in local or map-reduce modé then enter commands one after the other. In batch mode, we save commands in a .pig file and specify the path to the file when invoking pig- « At an overly simplified level a Pig script consists of three steps. In the first step we load data from HDFS. In the second step we perform transformations on the “Yata, In the final step we store transformed data. Transformations are the heart of Pig scripts. «Pig has a schema concept that is used when loading data to specify what it should expect. First specify columns and optionally their data types. Any columns in data ut not included in the schema are truncated. those specified in schema they are filled with «© When we have fewer columns than first move them to HDFS then from there we nulls. To load sample data sets we will load into Pig- Pig Script Interfaces : Pig programs can be packaged in three different ways. 1. Script : This function is ‘nothing more than a file consists of Pig Latin commands, identified by the .pig suffix. Ending Pig program with the -pig cation but not required. The commands are interpreted by fer and then runs in the order determined by the Pig extension is a conve! the Pig Latin compil optimizer. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeHadoop Related Tools Big Data Analytics 5-16 es 2. Grunt : Grunt acts as a command interpreter where we can interactively enter Pig Latin at the Grunt command line and immediately see the response. This method is useful for prototyping during early development stage and with what-if scenarios, 3. Embedded : Pig Latin statements can run within Java, JavaScript and Python programs, Pig scripts, Grunt shell Pig commands and embedded Pig programs can be executed in either Local mode or on MapReduce mode. The Grunt shell enables an interactive shell to submit Pig commands and run Pig scripts. To start the Grunt shell in Interactive mode, we need to submit the command pig at the shell. To tell the complier whether a script or Grunt shell is executed locally or in Hadoop mode just specify it in the -x flag to the pig command. The following is an example of how we would specify running our Pig script in local mode : _ pig’ local mindStick. pig) * es ® Here's how we would run the Pig script in Hadoop mode, which is the default if we don't specify the flag : "pig x mapreduce mindstick pig By default, when we specify the pig command without any parameters, it starts the Grunt shell in Hadoop mode. If we want to start the Grunt shell in local mode just add the -x local flag to the command. EG Hive Apache Hive is an open source data warehouse software for reading, writing and managing large data set files that are stored directly in either the Apache Hadoop Distributed File System (HDFS) or other data’ storage systems such as Apache HBase. Data analysts often use Hive to analyze data, query large amounts of unstructured data and generate data summaries. Features of Hive : 1. It stores schema in a database and processes data into HDES, 2. It is designed for OLAP. 3. It provides SQL type language for querying called HiveQL or HQL. 4. It is familiar, fast, scalable and extensible. Hive supports variety of storage formats : TEXTFILE for plaintext, SEQUENCEFILE for binary key-value pairs, RCFILE stores columns of a table in a record columnar format TECHNICAL PUBLICATIONS® - an up-thnust for knowledge” tics pata Analytics 5-17 3 i Hadoop Related Tools |, Hive table structure consists of rows and columns. The to some record, transaction, or particular entity detail ee aaa qhe values of the corresponding columns re ; ae present th i ‘l characteristics for each row. eet eer » Hadoop and its ecosystem are used to apply some structure to unstructured dat : : ata. ‘Therefore, if a table structure is an appropriate way to view the restructured di i Hive may be a good tool to use. au yollowing are some Hive use cases : 1, Exploratory or ad-hoc analysis of HDFS data : Data can be queried, transformed and exported to analytical tools. 9. Extracts or data feeds to reporting systems, dashboards, or data repositories such as HBase. 3, Combining external structured data to data already residing in HDFS. advantages * 1, Simple querying for anyone already familiar with SQL. 9, Its ability to connect with a variety of relational databases, including Postgres and MySQL. 3, Simplifies working with large amounts of data. Disadvantages * 1. Updating data is complicated 2, No real time access to data 2 for a simple Word Count application . High latency. Write a code in JAVA word in a given input set using the « Program Example + ber of occurrences of each that counts the num’ Hadoop Map-Reduce framework on local-standalone set-up: import java.io, IOException; ese import, java.util StringTokenizer: t import gry apache hadoop.conf Configurations import org.apache.hadoop £s.Path: import cry apache hacdoop io intWritablet import org.apache.hadoop1o-Text import org.apache-hadooP. " mapreduce.Jobi import ora. epcne hadoop maPredee import doop._mapreduce: Sport sae ao see duce ib nputSelaputr ores arial 1n up-thrust for knowledgeHadoop Related Tools 5-18 Big Data Analytics wena utFormat; import giglapathe hada mapreduce lib output FteOutpatFo n public class WordCount { “public static class TokenizerMapper " se exxends Mapper { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { : _ Strin¢Tokenizer itr = new StringTokenizer(value.toString())t : ‘while (itr.hasMoreTokens()) { _-word.set(itrnextToken()); ‘context.write(word, one); , + | public static class IntSumReducer "extends Reducer CREATE DATABASE Rollcall; * Make sure the database we are creating doesn't exist on Hive warehouse, if exists it throws Database Rollcall already exists error. ° At any time, we can see the databases that already exist as follows : EERE a oo = Rollcall hive> CREATE DATABASE students hive> SHOW DATABASES; default Rollcall student mates Saino : base. Tables in that database will be te a directory for each datal at da ‘ ea cro a of the database directory. The exception 1s tables in the default database, which doesn't have its own directory. + Drop Database Statement + StatementDROP DROP DATABASE IC eT | CASCADEN: Ee? x name [RESTRIC TECHNICAL PUBLICATIONS up-thrust for knowledge (DATABASE|SCHEMA) [IF EXIS'x Hadoop Related 7, oo Big Date Analytics brat J id; Example : hive> DROP DATABASE IF EXISTS useri nea . ALTER DATABASE : The ALTER DATABASE statement a a is aa tS ; it i fax for i change the metadata associated with the database in Hive. Syntax for a anging Database Properties Mu ER (DATABASE | SCHEMA) datab: ioe * Data manipulation language is a subset of SQL statements that modify the data stored in tables. Hive has no row - level insert, update and delete operations, the only way to put data into an table is to use one of the. "bulk" load operations, iveQL Data Manipulation Inserting data into tables from queries : * The INSERT statement perform loading data into a table from a query. INSERT OVERWRITE TAI S students Wee one ee eae * With OVERWRITE, any previous contents of the partition are replaced. If we drop the keyword OVERWRITE or replace it with INTO, Hive appends the data rather than replaces it. This feature is only available in Hive v0.8.0 or later. * We can mix INSERT OVERWRITE clauses and INSERT INTO clauses, as well. Dynamic partition inserts : * Hive also supports a dynamic partition feature, where it can infer the partitions to create based on query parameters, Hive determines the values of the partition keys, from the last two columns in the SELECT clause, * The static partition keys must come before the dynamic partition keys. Dynamic partitioning is not enabled by default. When it is enabled, it works in “strict” mode by default, where it expects at least some columns to be static. This helps protect against a badly designed query that generates a gigantic number of partitions. * Hive Data Manipulation Language (DML) Commands - a) LOAD - The LOAD statement transfers data files into the locations that correspond to Hive tables. b) SELECT - The SELECT statement in Hive functions similarly to the SELECT statement in SQL. It is primarily for retrieving data from the database. ) INSERT - The INSERT clause loads the data into a Hive table. Users can also perform an insert to both the Hive table and/or partition, TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeyytios ota Ane 5-25 Hadoop Re DELETE - The DELETE 2p Related Tools da) te targete Satie clause deletes all the data in the ee ted if the WHERE clause is Be Saat tnanee = The UPDATE ester ified. : and in Hi ery includes the WHI live updates the is net \¢ WHERE clause, then it updat a data in the table. Ifthe mee! condition in the WHERE claus ‘es the column of the rows that e. EXPORT - The Hive EX PORT command IMPORT - The Hive IMP aon e IMPORT statement imports the data fr ; cation to a new or currently existing table em, prtiulerine? HiveQl Queries : The Hive Langone (HiveQL) is a query language for Hive 10 process and analy2€ re ata jn a Metastore. Hive Query Language is used for rocessing, and analyzing structured data. It separates users from the complexity PreMap Reduce programming: geLECT FROM Clauses” SELECT is the projection operator table, view OF nested query we select records. specifies the columns to keep, as well as the outputs of more columns. in SQL. The FROM clause identifies from which For a given record, SELECT f function calls on one OF 4 Here's the syntax of Hive's SELECT statement. seEECT (ALL | pISTINCT] gelect_exPT. select _exPT, FROM table_reference {WHERE where condition! [GROUP BY col_Bst {HAVING having conaition] : [CLUSTER BY col_st | [DISTRIBUTE BY colli [LIMIT number] tion operator in HiveQL- ‘The points are wea by the FROM clause jec the table spe ie «SELECT is the Pro) a) SELECT scans b) WHERE givesss Hadoop Rey. 5.28 2? Related Toa Big Data Analytics ns : . e Saree ae the columns, we can manipulate columns Ne USING either i ie string functions arithmetic operators or function calls. Math, date anc ‘Ss ATE alsy popular, ; functions. * Here's an example query that uses both operators and . * WHERE Clauses : A WHERE clause is used to filter the result set by Using Predicate operators and logical operators. Functions can also be used to compute the condition. * GROUP BY Clauses : A GROUP BY clause is frequently used with: aggregate functions, to group the result set by columns and apply aggregate functions over each group. Functions can also be used to compute the grouping key. En Two Marks Questions with Answers Q.1 What is HBase 2 Q2 _ What is Hive 2 Ans.: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to Project structure onto this data and query the data using a SQL-like language called HiveQL Q.3 What Is Hive data definition 7 Ans. : Hive data definition assigns relational structure to the files stored on the HDFS cluster. We can easily query the structured data to extract specific information. For example, data definition for log files would contain columns like : CLASS, FILENAME, MESSAGE, LINENUBER, etc. Q4 Explain services provided by Zookeeper in Hbase. Ans. + Various services that Zookeeper provides include ; a) Establishing client communication with region servers, ») Tracking server failure and network partitions, ©) Maintain configuration information &) Provides ephemeral nodes, which represent different region servers TECHNICAL PUBLICATIONS® : an up-thrus for knowledgewe Data Analytic Big. lytios Hadoop Related Tools [QS What Is Zookeeper 7 Ans.: ZooKeeper service keeps track of all the region servers that are there i HBase cluster - tracking information about how many region servers aré a ee which region servers are holding which DataNode. as Q.6 What are the responsibilities of HMaster ? ‘Ans. : Responsibilities of HMaster : a) Manages and monitors the Hadoop cluster b) Performs administration 7 ©) Controlling the failover d) DDL operations are hanidled by the HMaster e) Whenever a client wants to change the schema and change any of the metadata operations, HMaster is responsible for all these operations. Q.7 Where to Use HBase ? ‘Ans. ; Hadoop HBase is used to have random real - time access to the big data. Tt can host large tables on top of cluster commodity. HBase is a non - relational database which modelled after Google's big table. It works similar to a big table to store the files of Hadoop. Q.8 Explain unique features of Hbase ? Ans.: * HBase is built for low latency operations « HBBase is used extensively for random read and write operations HBase stores a large amount of data in terms of tables Automatic and configurable sharding of tables 4 HBase stores data in the form of key/value pairs in a columnar model Q.9 Explain data model in Hbase ? : ‘Ans, : The data model in Base is designed to accommodate semi - structured data data type and columns. Additionally, the layout of the that could vary in field size, data model makes it easier to partition the data and distribute it across the cluster. .e difference betwee! Janguage similar to Perl f transformations and op. n Pig Latin and Pig engine 2 used to search large data sets. It erations that are applied to the Q.10 What is th ans. : Pig Latin is a scripting is composed of a sequence o input data to create The Pig engine i6 translates Pig Latin 0 data. the environment in which Pig Latin p 0 MapReduce jobs. ograms are executed. It yperators int TECHNICAL PUBLICATIONS® - an up-trust for knowledgeBig Data Analytics Ans. : Q.11 What is pig storage? ‘ 7 eet i led pi Ans. : Pig has a built-in load function called pi d wish to import data from a file system into the Pig, we can use Pig storage, Q.12- What are the features of Hive ? It stores schema in a database and processed data into HDFS. Itis designed for OLAP. It provides SQL type language for querying called HiveQL or HQL, It is familiar, fast, scalable and extensible. 5-28 Hadoop Reta, ig. storage. In addition, Wheneve, Qoa

Dbms Co Po Mapping Justification
No ratings yet
Dbms Co Po Mapping Justification
3 pages
Big - Data Lab Manual
No ratings yet
Big - Data Lab Manual
65 pages
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
No ratings yet
VTU Exam Question Paper With Solution of 18CS72 Big Data and Analytics Feb-2022-Dr. v. Vijayalakshmi
25 pages
NoSQL Databases for MCA Students
0% (1)
NoSQL Databases for MCA Students
15 pages
Course Outcomes: Pre-Requisites Co/Po Mapping
No ratings yet
Course Outcomes: Pre-Requisites Co/Po Mapping
2 pages
NoSQL Databases for Tech Students
No ratings yet
NoSQL Databases for Tech Students
16 pages
DBMS Previous Year Question Paper
No ratings yet
DBMS Previous Year Question Paper
3 pages
Unit 2 - RELATIONAL MODEL
No ratings yet
Unit 2 - RELATIONAL MODEL
28 pages
Bda Unit 5 Notes
No ratings yet
Bda Unit 5 Notes
23 pages
Big Data Analytics Lab Manual
No ratings yet
Big Data Analytics Lab Manual
38 pages
CS3492 DBMS Univ - QP Answer AM 2024
No ratings yet
CS3492 DBMS Univ - QP Answer AM 2024
19 pages
Bda Super Imp
No ratings yet
Bda Super Imp
35 pages
Static and Dynamic Hashing
No ratings yet
Static and Dynamic Hashing
12 pages
CS8091-BIG DATA ANALYTICS UNIT V Notes
100% (4)
CS8091-BIG DATA ANALYTICS UNIT V Notes
31 pages
Stream Processing Lab Manual
No ratings yet
Stream Processing Lab Manual
38 pages
Content Beyond Syllabus (DBMS)
No ratings yet
Content Beyond Syllabus (DBMS)
26 pages
Dbms Question Bank Unit I
100% (1)
Dbms Question Bank Unit I
2 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
18CS72
No ratings yet
18CS72
2 pages
DBMS - Question Bank (Unit 1 To 6,9)
No ratings yet
DBMS - Question Bank (Unit 1 To 6,9)
5 pages
Ccs368-Stream Processing Lab Manual
No ratings yet
Ccs368-Stream Processing Lab Manual
50 pages
Cloud Computing Lab Record 2021
No ratings yet
Cloud Computing Lab Record 2021
5 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
100% (1)
IAT-I Question Paper With Solution of 18CS823 Nosql Database May-2021-Poonam Tijare
12 pages
Algo PPT Unit-2 B Tree
No ratings yet
Algo PPT Unit-2 B Tree
38 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
5 pages
DBMS PPT Unit-5
100% (2)
DBMS PPT Unit-5
85 pages
SIC ANswers With Question
No ratings yet
SIC ANswers With Question
15 pages
PHD CS 100 Questions Answers Complete
100% (2)
PHD CS 100 Questions Answers Complete
9 pages
NoSQL Technologies Notes Unit 1
100% (1)
NoSQL Technologies Notes Unit 1
20 pages
Evidence Protection System Using Blockchain Technology: 1) Background/ Problem Statement
No ratings yet
Evidence Protection System Using Blockchain Technology: 1) Background/ Problem Statement
8 pages
Vtu 5th Sem Cse Computer Networks
No ratings yet
Vtu 5th Sem Cse Computer Networks
91 pages
CS3492 Database Management Systems Question Bank 1
No ratings yet
CS3492 Database Management Systems Question Bank 1
11 pages
Unit 5-Key - Value Store Database
No ratings yet
Unit 5-Key - Value Store Database
16 pages
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
100% (1)
Cse Vi Computer Graphics and Visualization 10CS65 Notes PDF
97 pages
DBMS (UNIT-6) (Advances in Databases and Big Data)
No ratings yet
DBMS (UNIT-6) (Advances in Databases and Big Data)
103 pages
Module-4 Normalization: Database Design Theory DBMS (18CS53)
No ratings yet
Module-4 Normalization: Database Design Theory DBMS (18CS53)
24 pages
Hbase Lab Manual3.0-Update
No ratings yet
Hbase Lab Manual3.0-Update
8 pages
Sonata Software Sample Aptitude Placement Paper Level1
No ratings yet
Sonata Software Sample Aptitude Placement Paper Level1
7 pages
Ad3391 LAB MANUAL
No ratings yet
Ad3391 LAB MANUAL
23 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
Parallel DBMS Vendors
No ratings yet
Parallel DBMS Vendors
14 pages
Vtu 5TH Sem Cse DBMS Notes
100% (1)
Vtu 5TH Sem Cse DBMS Notes
54 pages
DM Unit 5
No ratings yet
DM Unit 5
47 pages
Big Data - Unit 2 Hadoop Framework
100% (1)
Big Data - Unit 2 Hadoop Framework
19 pages
CHAPTER 03: Big Data Technology Landscape
No ratings yet
CHAPTER 03: Big Data Technology Landscape
81 pages
Case Study: Google File System
No ratings yet
Case Study: Google File System
7 pages
MLQuestion Bank
No ratings yet
MLQuestion Bank
2 pages
Hadoop Command Line Interface
No ratings yet
Hadoop Command Line Interface
10 pages
Evaluation of Relational Algebra Expressions: What Is It?
No ratings yet
Evaluation of Relational Algebra Expressions: What Is It?
5 pages
Unit 3
100% (1)
Unit 3
30 pages
Advance Java Questions
No ratings yet
Advance Java Questions
4 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Dbms m5 Module 5 Notes of Dbms Vtu
No ratings yet
Dbms m5 Module 5 Notes of Dbms Vtu
49 pages
Types of Database Parallelism
No ratings yet
Types of Database Parallelism
2 pages
Database Design & ER Models Guide
No ratings yet
Database Design & ER Models Guide
20 pages
Anatomy OF File Write and Read
No ratings yet
Anatomy OF File Write and Read
6 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
CCS334 BDA - Unit 5
No ratings yet
CCS334 BDA - Unit 5
27 pages
Ba Iift 17-18
No ratings yet
Ba Iift 17-18
40 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
30 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages

Big Data Analytics Unit-5

Uploaded by

Big Data Analytics Unit-5

Uploaded by

You might also like