[go: up one dir, main page]

0% found this document useful (0 votes)
62 views59 pages

Big Data Analytics Lab

The Big Data Analytics Laboratory manual outlines the curriculum for III/IV B.Tech students in Artificial Intelligence and Data Science for the academic year 2023-2024. It includes a series of experiments focusing on data structures in Java, setting up and using Hadoop, and implementing various MapReduce programs. The manual also provides an index of experiments and additional tasks related to data management and analytics using tools like Pig and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views59 pages

Big Data Analytics Lab

The Big Data Analytics Laboratory manual outlines the curriculum for III/IV B.Tech students in Artificial Intelligence and Data Science for the academic year 2023-2024. It includes a series of experiments focusing on data structures in Java, setting up and using Hadoop, and implementing various MapReduce programs. The manual also provides an index of experiments and additional tasks related to data management and analytics using tools like Pig and Hive.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

BIG DATA ANALYTICS LABORATORY (R2032121) MANUAL

III/IV B.TECH, Semester-II


Academic Year: 2023-24

DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

Faculty: Dr.CH RAMA DEVI, Assoc Professor

SIR C R REDDY COLLEGE OF ENGINEERING


Eluru-534007, West Godavari Dist, Andhra Pradesh, India
(Accredited by NBA, Approved by AICTE, New Delhi & Permanently affiliated to JNTUK,
Kakinada) Telephone No: 08812-230840, 230565, Fax: 08812-224193
Website: www.sircrrengg.ac.in

pg. 1
SIR C R REDDY COLLEGE OF ENGINEERING, ELURU.
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

CERTIFICATE

This is to certify that the bonafide record of the work done in BIG DATA
ANALYTICS LABAROTARY by Mr/Mrs:________________ bearing
Regd.No:____________ in III/IV B.Tech AI&DS course during the academic year
2023-2024.

Total Number of Experiments held:_____ Total Number of Experiments done:______

LAB-IN-CHARGE HEAD OF THE DEPARTMENT

EXTERNAL EXAMINER

pg. 2
SIR C.R.REDDY COLLEGE OF ENGINEERING ELURU-5340
07, WEST GODAVARI DIST, A P., INDIA
(Approved by AICTE, New Delhi )
Phone no: 08812-230840, 2300656 Fax: 08812-224193
Visit us at http://www.sircrrengg.ac.in
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

INDEX
S. No Name of Experiment Dates
1 Implement the following Data structures in Java
a) Linked Lists b) Stacks c) Queues d) Set e) Map

2 (i)Perform setting up and Installing Hadoop in its three


operating modes: standalone, pseudo distributed, fully
distributed.
(ii).use web based tools to monitor your hadoop setup.

3 Implement the following file management tasks in


Hadoop:
(i) Adding files and directories
(ii) Retrieving files
(iii) Deleting files

4 Run a basic Word Count MapReduce program to


understand MapReduce Paradigm.

5 Write a map reduce program that mines weather data.

6
Use MapReduce to find the shortest path between two
people in a social graph.

7
Implement Friends-of-friends algorithm in MapReduce.

8
Implement an iterative PageRank graph algorithm in
MapReduce.

9
Perform an efficient semi-join in MapReduce.

Install and Run Pig then write Pig Latin scripts to sort,
10
group, join, project, and filter your data.

Install and run hive then use hive to create, alter and drop
11
database, tables, views, functions, indexes.

pg. 3
Big Data Analytics Lab Manual
List of Experiments:

Experiment 1: Week 1, 2:
1. Implement the following Data structures in Java
a) Linked Lists b) Stacks c) Queues d) Set e) Map

Experiment 2: Week 3:
2. (i)Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed (ii)Use web based

tools to monitor your Hadoop setup.

Experiment 3: Week 4:

3. Implement the following file management tasks in Hadoop:

 Adding files and directories

 Retrieving files

 Deleting files

Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies them into HDFS using one of the above

command line utilities.

Experiment 4: Week 5:
4. Run a basic Word Count MapReduce program to understand MapReduce Paradigm.
Experiment 5: Week 6:
5. Write a map reduce program that mines weather data.
Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a

good candidate for analysis with Map Reduce, since it is semi

structured and record-oriented.

Experiment 6: Week 7:
6. Use MapReduce to find the shortest path between two people in a social graph.
Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back

pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state

of the graph. Iterate until the target node has been

pg. 4
reached.

Experiment 7: Week 8:
7. Implement Friends-of-friends algorithm in MapReduce.
Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network .The first job calculates the
common friends for each user, and the second job sorts the common friends by the number of connections to your friends.

Experiment 8: Week 9:
8. Implement an iterative PageRank graph algorithm in MapReduce.
Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for
propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank
values for each node, and for re-creating the original graph with the updated PageRank values.

Experiment 9: Week 10:


9. Perform an efficient semi-join in MapReduce.
Hint: Perform a semi-join by having the mappers load a Bloom filter from the Distributed Cache, and then filter results

from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which

data source records should be emitted to the reducers. Experiment 10: Week 11:

10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Experiment 11: Week 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes

Add-On Experiments
1. Write a program on type conversion by auto boxing and unboxing

2. Write a program that add and delete the elements from the list by using Map and Set classes
of collection framework

pg. 5
Experiment 1
AIM: Implement the following Data structures in Java

a) Linked Lists b) Stacks c) Queues d) Set e) Map

1 a) Linked Lists
Program
// Java Program to Demonstrate implementation of LinkedList class , File name:
LinkedListDemo.java
// Importing required classes import
java.util.*;
// Main class
public class LinkedListDemo {
// Driver code
public static void main(String args[])
{
// Creating object of the class linked list LinkedList<String> ll = new
LinkedList<String>();
// Adding elements to the linked list ll.add("Graps");
ll.add("Banana");
ll.addLast("Apple"); ll.addLast("Watermelon");
System.out.println("LinkedList items\n" +ll); System.out.println('\n');
ll.addFirst("Orange");
ll.add(3, "Cherry");
ll.add(3, "Mango"); //above 3rd index item shifted to 4th index System.out.println(ll);
System.out.println('\n');
System.out.println("ll.get(0): " +ll.get(0)); // returns the element at the specified position in this list.
System.out.println("ll.peek(): " +ll.peek()); // see the top item of the stack ll.remove(3);
ll.removeFirst();
ll.removeLast();
System.out.println("Linked List items: "+ll);
}}
Output:

LinkedList items
[Graps, Banana, Apple, Watermelon]

[Orange, Graps, Banana, Mango, Cherry, Apple, Watermelon]

ll.get(0): Orange
ll.peek(): Orange
Linked List items: [Graps, Banana, Cherry, Apple]

pg. 6
1 b) Stacks
Program:
import java.io.*;
import java.util.*;

class StackDemo {

// Main Method
public static void main(String[] args)
{

// Initialization of Stack using Generics Stack<String>


stack1 = new Stack<String>();

// Default initialization of Stack Stack


stack2 = new Stack();

// pushing the elements stack1.push("Grapes" );


stack1.push("Mango"); stack1.push("Banana");
stack1.push("Cherry");
stack1.push("Watermelon");
stack1.push("Orange");
System.out.println("Printing the Stack1 Elements"); System.out.println(stack1);
stack1.remove(3);
System.out.println("After removal from 3rd index: \n" +stack1);

// Removing elements using pop() method System.out.println("Popped


element: " + stack1.pop());
System.out.println("Popped element: " + stack1.pop());

// Displaying the Stack after pop operation System.out.println("Stack


after pop operation:\n" + stack1);

stack2.push("Hadoop");
stack2.push(3); //Heterogeneous data can also be added stack2.push("Year");

// Printing the Stack Elements

System.out.println("Stack2 elements\n" +stack2);


}
}

pg. 7
Output:
Printing the Stack1 Elements
[Grapes, Mango, Banana, Cherry, Watermelon, Orange]
After removal from 3rd index:
[Grapes, Mango, Banana, Watermelon, Orange]
Popped element: Orange
Popped element: Watermelon
Stack after pop operation:
[Grapes, Mango, Banana]
Stack2 elements
[Hadoop, 3, Year]

1 c) Queues
import java.util.*;
class PriorityQueueDemo
{
public static void main(String args[])
{
PriorityQueue<String> queue=new PriorityQueue<String>(); queue.add("One");
queue.add("Two");
queue.add("Three");
queue.add("Four"); queue.add("Five");
System.out.println("\nhead:"+queue.element());
System.out.println("head:"+queue.peek()); System.out.println("iterating the queue
elements:\n"); Iterator itr=queue.iterator();
while(itr.hasNext())
{
System.out.println(itr.next());
}
queue.remove(); queue.poll();
System.out.println( );
System.out.println("After removing two elements: \n"); Iterator<String>
itr2=queue.iterator(); while(itr2.hasNext())
{
System.out.println(itr2.next());
}

pg. 8
}
}

Output:
head:Five
head:Five
iterating the queue elements:

Five
Four
Two
One
Three

After removing two elements:

Three
One
Two

1 d) Set
// Java program Illustrating Set Interface

// Importing utility classes import


java.util.*;

// Main class public class


GFG {

// Main driver method


public static void main(String[] args)
{
// Demonstrating Set using HashSet
// Declaring object of type String
Set<String> hash_Set = new HashSet<String>();

// Adding elements to the Set


// using add() method

pg. 9
hash_Set.add("Geeks"); hash_Set.add("For");
hash_Set.add("Geeks");
hash_Set.add("Example"); hash_Set.add("Set");

// Printing elements of HashSet object


System.out.println(hash_Set);
}
}

Output:
[Set, Example, Geeks, For]
1 e) Map

1) Simple display of map

import java.util.*; public class

MapHashDemo {

public static void main(String[] args) { Map m1 = new


HashMap(); m1.put("Zara", "8");
m1.put("Mahnaz", "31");
m1.put("Ayan", "12");
m1.put("Daisy", "14"); System.out.println();
System.out.println(" Map Elements");
System.out.print("\t" + m1);
}
}

Output:
Map Elements

{Daisy=14, Ayan=12, Zara=8, Mahnaz=31}

2) Map-Ex2

import java.util.*;

public class MapHashDemo2 {

public static void main(String args[]) {

pg. 10
// Create a hash map HashMap hm = new
HashMap();
// Put elements to the map hm.put("David", new
Double(3434.34)); hm.put("Mahesh", new Double(123.22));
hm.put("Kavya", new Double(1378.00)); hm.put("Lavanya",
new Double(99.22)); hm.put("Kiran", new Double(-
19.08));

// Get a set of the entries Set set =


hm.entrySet();
// Get an iterator
Iterator i = set.iterator();
// Display elements while(i.hasNext()) {
Map.Entry me = (Map.Entry)i.next(); System.out.print(me.getKey() + ":
"); System.out.println(me.getValue());
}
System.out.println();
// Deposit 1000 into David's account
double balance = ((Double)hm.get("David")).doubleValue(); hm.put("David", new Double(balance +
1000)); System.out.println("David's new balance: " + hm.get("David"));
}
}

Output:
Kiran: -19.08

Kavya: 1378.0David: 3434.34

Mahesh: 123.22

Lavanya: 99.22

David's new balance: 4434.34

pg. 11
EXPERIMENT-2

AIM: (i)Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed
(ii)Use web based tools to monitor your Hadoop setup

(i) installing Hadoop in three operating modes:

As we all know Hadoop is an open-source framework which is mainly used for storage purpose and maintaining
and analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is
actually a data management tool. Hadoop also posses a scale-out storage property, which means that we can
scale up or scale down the number of nodes as per are a requirement in the future which is really a cool feature .

Hadoop Mainly works on 3 different Modes:


1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode

1. Standalone Mode

In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node, Job
Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in Hadoop1. For
Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also means that we are installi ng
Hadoop only in a single system. By default, Hadoop is made to run in this Standalone Mode or we can also call it
as the Local mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop
distributed file system) is one of the major components for Hadoop which utilized for storage Permission is not
utilized in this mode. You can think of HDFS as similar to the file system’s available for windows i.e. NTFS (New
Technology File System) and FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when
your Hadoop works in this mode there is no need to configure the files – hdfs-site.xml, mapred-site.xml, core-
site.xml for Hadoop environment. In this Mode, all of your Processes will run on a single JVM(Java Virtual
Machine) and this mode can only be used for small development purposes.

2. Pseudo Distributed Mode (Single Node Cluster)

In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is simulated,
which means that all the processes inside the cluster will run independently to each other. All the daemons that
are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be running as a
separate process on separate JVM(Java Virtual Machine) or we can say run on different java processes that is
why it is called a Pseudo-distributed.

One thing we should remember that as we are using only the single node set up so all the Master and Slave
processes are handled by the single system. Namenode and Resource Manager are used as Master and
Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master. The
purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In this Mode,

 Hadoop is used for development and for debugging purposes both.


 Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and Output processes.

pg. 12
 We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for setting up the
environment.

3. Fully Distributed Mode (Multi-Node Cluster)

This is the most important one in which multiple nodes are used few of them run the Master Daemon’s that are
Namenode and Resource Manager and the rest of them run the Slave Daemon’s that are DataNode and Node
Manager. Here Hadoop will run on the clusters of Machine or nodes. Here the data that is used is distributed
across different nodes. This is actually the Production Mode of Hadoop let’s clarify or understand this Mode in a
better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your system and you run
all the processes in a single system but here in the fully distributed mode we are extracting this tar or zip file to
each of the nodes in the Hadoop cluster and then we are using a particular node for a particular process. Once
you distribute the process among the nodes then you’ll define which nodes are working as a master or which one
of them is working as a slave.

pg. 13
(ii) Three Best Monitoring Tools for Hadoop

Here is our list of the best Hadoop monitoring tools:

1. Prometheus – Cloud monitoring software with a customizable Hadoop dashboard, integrations, alerts, and
many more. It keeps the data long-term, with 3x redundancy, so that we can focus on applying the data rather
than maintaining a database. Get updates and plugins without lifting a finger, as they keep
our Prometheus and Grafana stack up-to-date. It is easy to use with no extensive configuration needed to
thoroughly monitor your technology stack.

2. LogicMonitor – Infrastructure monitoring software with a HadoopPackage, REST API, alerts, reports,
dashboards, and more. LogicMonitor finds, queries, and begins monitoring virtually any datacenter resource. If
you have a resource in your datacenter that is not immediately found and monitored, LogicMonitor’s
professional services will investigate how to add it.

pg. 14
3. Dynatrace – Application performance management software with Hadoop monitoring - with
NameNode/DataNode metrics, dashboards, analytics, custom alerts, and more. Dynatrace provides a high-
level overview of the main Hadoop components within your cluster. Enhanced insights are available for HDFS
and MapReduce. Hadoop-specific metrics are presented alongside all infrastructure measurements, providing
you with in-depth Hadoop performance analysis of both current and historical data.

pg. 15
Experiment -3
Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files and Deleting files

AIM: To implement the following file management tasks in Hadoop:


1. Adding files and directories
2. Retrieving files
3. Deleting Files

SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step 1: Starting HDFS Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
$ hadoop namenode -format

After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data
nodes as cluster.
$ start-dfs.sh

Listing Files in HDFS

After loading the information in the server, we can find the list of files in a directory, status of a file, using ls Given below is the
syntax of ls that you can pass to a directory or a filename as an argument.

$ $HADOOP_HOME/bin/hadoop fs –ls

Inserting Data into HDFS :

Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the
steps given below to insert the required file in the Hadoop file system.

Step-2: Adding Files and Directories to HDFS

pg. 16
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input

Transfer and store a data file from local systems to the Hadoop file system using the put command.

$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input

Step 3 :You can verify the file using ls command

$ $HADOOP_HOME/bin/hadoop fs -ls /user/input

Step 4 :Retrieving Data from HDFS

Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the
Hadoop file system.

Initially, view the data from HDFS using cat command


$ $HADOOP_HOME/bin/hadoop fs -cat /user/output/outfile

Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/

Step-5: Deleting Files from HDFS

$ hadoop fs -rm file.txt

Step 6:Shutting Down the HDFS


You can shut down the HDFS by using the following command.

$ stop-dfs.sh

pg. 17
Experiment-4
Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.

AIM: To Develop a MapReduce program to calculate the frequency of a given word in agiven file Map Function – It takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples (Key-Value pair).

Example – (Map function in Word Count)


Input
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN

Output
Convert into another
set of data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1), (buS,1), (caR,1),
(CAR,1), (car,1), (BUS,1), (TRAIN,1)

Reduce Function – Takes the output from Map as an input and combines those data tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input
Set of Tuples(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1),(caR,1),(CAR,1),
(car,1), (BUS,1), (TRAIN,1)

Output Converts into smaller set of tuples


(BUS,7), (CAR,7), (TRAIN,4)

Workflow of MapReduce consists of 5 steps

1. Splitting – The splitting parameter can be anything, e.g. splitting by space,comma, semicolon, or even by a new line („\n‟).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In orderto group them in “Reduce Phase” the
similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from eachcluster) is combine together to form a Result

Now Let’s See the Word Count Program in Java

Step1 :Make sure Hadoop and Java are installed properly

hadoop version

javac –version

Step 2. Create a directory on the Desktop named Lab and inside it create two folders;
one called “Input” and the other called “tutorial_classes”. [You can do this step using GUI normally or through terminal
commands]
cd Desktop
mkdir Lab
mkdir Lab/Input
mkdir Lab/tutorial_classes
Step 3. Add the file attached with this document “WordCount.java” in the directory Lab
Step 4. Add the file attached with this document “input.txt” in the directory Lab/Input

pg. 18
Step 5. Type the following command to export the hadoop classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
Step 6. It is time to create these directories on HDFS rather than locally. Type the following commands. Hadoop fs -mkdir
/WordCount Tutorial
hadoop fs -mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input

Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and you should see the directories and files
we placed in the file system.

pg. 19
Step 8. Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java

Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
Step 9. Now, we run the jar file on Hadoop.
hadoop jar WordCount.jar WordCount /WordCountTutorial/Input /WordCountTutorial/Output.

pg. 20
hadoop dfs -cat /WordCountTutorial/Output/*

Program: Step 5. Type following Program :


package PackageDemo;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount { public static void main(String [] args) throws Exception
{
Configuration c=new Configuration();
String[] files=new GenericOptionsParser(c,args).getRemainingArgs();
Path input=new Path(files[0]);
Path output=new Path(files[1]);
Job j=new Job(c,"wordcount");
j.setJarByClass(WordCount.c lass);
j.setMapperClass(MapForWordCount.class);
j.setReducerClass(ReduceForWordCount.class);
j.setOutputKeyClass(Text.class);

pg. 21
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);

System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper <LongWritable, Text, Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text,IntWritable,Text,IntWaritable>
{
public void reduce(Text word, Iterable values, Context con) throwsIOException
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
The output is stored in /r_output/part-00000

OUTPUT:

pg. 22
EXERCISE-5:
AIM:- Write a Map Reduce Program that mines Weather Data.

DESCRIPTION:
Climate change has been seeking a lot of attention since long time. The antagonistic effect of this climate is being
felt in every part of the earth. There are many examples for these, such as sea levels are rising, less rainfall,
increase in humidity. The propose system overcomes the some issues that occurred by using other techniques. In
this project we use the concept of Big data Hadoop. In the proposed architecture we are able to process offline
data, which is stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather forecast. Finally,
we plot the graph for the obtained MAX and MIN temperature for each moth of the particular year to visualize the
temperature. Based on the previous year data weather data of coming year is predicted.
ALGORITHM:-
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text input data set. WordCount
fits very well with the MapReduce programming model making it a great example to understand the Hadoop Map/Reduce
programming style. Our implementation consists of three main parts:
1.Mapper
2.Reducer
3.Main program

Step-1. Write a Mapper

AMapper overrides the ―map‖function from the Class "org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs
as the input. A Mapper implementation may output <key,value> pairs using the provided Context .

Input value of the WordCount Map task will be a line of text from the input data file and the key would be the line number
<line_number, line_of_text> . Map task outputs <word, one> for each word in the line of text
Pseudo-code
void Map (key, value)
{
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value)
{
for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single result. Here, the
WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
void Reduce (max_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(max_temp, sum);

}
void Reduce (min_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(min_temp, sum);
}
3.Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
pg. 23
Mapper Class: class which overrides the "map" function.For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.

Output Key: type of output key. For here, Text.


Output Value: type of output value. For here, IntWritable.
File Input Path
File Output Path
INPUT:- Set of Weather Data over the years
OUTPUT:-

pg. 24
EXPERIMENT-6

AIM: Use MapReduce to find the shortest path between two people in a social graph.

Program:

import java.io.IOException;
import java.util.*;

// Define your Mapper class


public class ShortestPathMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
// Parse the input data (person, list of friends)
String[] tokens = value.toString().split("\\s+");
String person = tokens[0];
List<String> friends = Arrays.asList(tokens[1].split(","));

// Emit (friend, distance) pairs for each friend


for (String friend : friends) {
context.write(new Text(friend), new Text("0," + person)); // Distance initialized to 0
}
}
}

// Define your Reducer class


public class ShortestPathReducer extends Reducer<Text, Text, Text, Text> {
@Override
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException
{
int minDistance = Integer.MAX_VALUE;
String shortestPath = "";

// Iterate through all values for the current key


for (Text value : values) {
String[] parts = value.toString().split(",");
int distance = Integer.parseInt(parts[0]);
String path = parts[1];

if (distance < minDistance) {


minDistance = distance;
shortestPath = path;
}
}

// Emit (friend, shortest path) pair


context.write(key, new Text(minDistance + "," + shortestPath));
}
}

// Define your main class


public class ShortestPath {
public static void main(String[] args) throws Exception {
// Set up your MapReduce job configuration
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Shortest Path");
job.setJarByClass(ShortestPath.class);
job.setMapperClass(ShortestPathMapper.class);
job.setReducerClass(ShortestPathReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

pg. 25
FileInputFormat.addInputPath(job, new Path(args[0])); // Input path

FileOutputFormat.setOutputPath(job, new Path(args[1])); // Output path


System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

pg. 26
EXPERIMENT-7

AIM: Implement Friends-of-friends algorithm in MapReduce.

Program:

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class FriendsOfFriends {

public static class Map extends Mapper<Object, Text, Text, Text> {

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
String personA = tokenizer.nextToken();
String personB = tokenizer.nextToken();

context.write(new Text(personA), new Text(personB));


context.write(new Text(personB), new Text(personA));
}
}

public static class Reduce extends Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
List<String> friends = new ArrayList<>();
for (Text val : values) {
friends.add(val.toString());
}

List<String> mutualFriends = new ArrayList<>();


for (String friend : friends) {
for (String otherFriend : friends) {
if (!friend.equals(otherFriend)) {
if (!mutualFriends.contains(otherFriend) && !friends.contains(otherFriend)) {
mutualFriends.add(otherFriend);
}
}
}
}

context.write(key, new Text(mutualFriends.toString()));


}
}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "FriendsOfFriends");
job.setJarByClass(FriendsOfFriends.class);
pg. 27
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Usage:
Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:
hadoop jar FriendsOfFriends.jar input_directory output_directory

pg. 28
EXPERIMENT-8

AIM: Implement an iterative PageRank graph algorithm in MapReduce.

Program:
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.DoubleWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class PageRank {

public static class Map extends Mapper<Object, Text, Text, DoubleWritable> {

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

String[] parts = value.toString().split("\t");

String page = parts[0];

String[] neighbors = parts[1].split(",");

// Emit initial page rank contribution for each neighbor

double initialPageRank = Double.parseDouble(parts[2]);

double contribution = initialPageRank / neighbors.length;

for (String neighbor : neighbors) {

context.write(new Text(neighbor), new DoubleWritable(contribution));

// Preserve the graph structure

context.write(new Text(page), new DoubleWritable(0)); // Pass the adjacency list as is

public static class Reduce extends Reducer<Text, DoubleWritable, Text, DoubleWritable> {

pg. 29
private static final double dampingFactor = 0.85; // Damping factor for PageRank calculation

private static final double initialPageRank = 1.0; // Initial PageRank value

public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {

double sum = 0;

for (DoubleWritable value : values) {

sum += value.get();

double newPageRank = (1 - dampingFactor) + dampingFactor * sum; // Apply PageRank formula

context.write(key, new DoubleWritable(newPageRank));

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "PageRank");

job.setJarByClass(PageRank.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(DoubleWritable.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

Usage:

Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:

hadoop jar PageRank.jar input_directory output_directory

pg. 30
EXPERIMENT-9

AIM: Perform an efficient semi-join in MapReduce.

Program:
import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class SemiJoin {

public static class Map1 extends Mapper<Object, Text, Text, Text> {

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

// Assuming the join key is the first field

String[] parts = value.toString().split("\t");

String joinKey = parts[0];

context.write(new Text(joinKey), new Text("A\t" + value.toString())); // Prefix 'A' denotes the first dataset

public static class Map2 extends Mapper<Object, Text, Text, Text> {

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

// Assuming the join key is the first field

String[] parts = value.toString().split("\t");

String joinKey = parts[0];

context.write(new Text(joinKey), new Text("B")); // Prefix 'B' denotes the second dataset

public static class Reduce extends Reducer<Text, Text, Text, Text> {

pg. 31
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

boolean foundSecondDataset = false;

for (Text value : values) {

if (value.toString().equals("B")) {

foundSecondDataset = true;

break;

if (!foundSecondDataset) {

for (Text value : values) {

String[] parts = value.toString().split("\t", 2);

context.write(new Text(parts[1]), new Text()); // Emit values from the first dataset

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "SemiJoin");

job.setJarByClass(SemiJoin.class);

job.setReducerClass(Reduce.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

// Input from the first dataset

MultipleInputs.addInputPath(job, new Path(args[0]), TextInputFormat.class, Map1.class);

// Input from the second dataset

MultipleInputs.addInputPath(job, new Path(args[1]), TextInputFormat.class, Map2.class);

FileOutputFormat.setOutputPath(job, new Path(args[2]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

pg. 32
Usage:
Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:

hadoop jar SemiJoin.jar input_directory_1 input_directory_2 output_directory

pg. 33
EXPERIMENT-10

AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.

1. Prerequisites

1. Hardware Requirement

* RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work.

* CPU — Min. Quad-core, with at least 1.80GHz

2. JRE 1.8 — Offline installer for JRE

3. Java Development Kit — 1.8

4. A Software for Un-Zipping like 7Zip or Win Rar

* I will be using 64-bit windows for the process, please check and download the version supported by your

system x86 or x64 for all the software.

5. Hadoop

* I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop.

* If you don‟t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10.

6. MySQL Query Browser

7. Download PIG zip

* I am using PIG-0.17.0, you can also use any other STABLE version of Apache Pig.

pg. 34
2. Unzip and Install PIG

After Downloading the PIG, we need to Unzip the pig-0.17.0.tar.gz file.

pg. 35
 Once extracted, we would get a new file pig-0.17.0.tar.

Now, once again we need to extract this tar file.

 Now we can organize our PIG installation, we can create a folder and move the final extracted file in it. For EX:

pg. 36
 Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.(it can

cause issues later)

 I have placed my PIG in D: drive you can use C: or any other drive also.

3. Setting Up Environment Variables

Another important step in setting up a work environment is to set your Systems environment variable.

pg. 37
pg. 38
3.1 Setting PIG_HOME

 Open environment Variable and click on “New” in “User Variable”.

 On clicking “New”, we get the below screen.

 Now as shown, add PIG_HOME in variable name and path of PIG in Variable Value.

 Click OK and we are half done with setting PIG_HOME.

3.2 Setting Path Variable

 The last step in setting the Environment variable is setting Path in System Variable.

pg. 39
 Select Path variable in the system variables and click on “Edit”.

 Now we need to add these paths to Path Variable:-

* %PIG_HOME%\bin

 Click OK and OK. & we are done with Setting Environment Variables.

Note:- If you want the path to be set for all users you need to select “New” from System Variables.

3.3 Verify the Paths

 Now we need to verify that what we have done is correct and reflecting.

pg. 40
 Open a NEW Command Window

 Run following commands

 echo %PIG_HOME%

4. Verifying Setup

We are done with setting up the PIG on our System.

Now we need to check if everything works smoothly…

Open a cmd window, run the below command to test the connection and PIG.

pig –version

Upon running the command we should get the version of PIG. i.e 0.17.0 in our case.

Fig 11:- Checking PIG version

Congrats… we have successfully installed PIG-0.17.0 on our WIndows 10.

Don‟t worry some of us can get the below error after running pig -version
'-Xmx1000M' is not recognized as an internal or external command,operable program or batch file.

pg. 41
To resolve this we will need to perform the following steps:-

1. Open the pig. cmd file in edit mode.

We can find the file in the bin folder.

2. Now we need to change the value of the HADOOP_BIN_PATH

Old value:- %HADOOP_HOME%\bin


New Value:- %HADOOP_HOME%\libexec

3. Save the file.

The next step is to verify the setup once again. So, we need to execute the

pig -version command once again.

5. Starting PIG

Now we need to start a new Command Prompt remember to run it as administrator to avoid permission issues and

execute the below commands


pig

Yes, it's that simple… We can see grunt> once the pig starts.

pg. 42
Fig 13:- Starting PIG

GROUP:
Similar to Group in Sql, here we use group for one relation and Cogroup of more number of relations. Both
GROUP and COGROUP are similar to each other.

B = GROUP A BY age;

X = GROUP A BY f2*f3;

X = COGROUP A BY owner INNER, B BY friend2 INNER;

B = GROUP A BY (tcid, tpid);

JOIN:
Join concept is similar to Sql joins, here we have many types of joins such as Inner join, outer join and some
specialized joins.

INNER JOIN:
The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes s ense to filter them out
before the join.

Note: Both Cogroup and join work in a similar way, just the difference is Cogroup creates a nested set of output
records.

X = JOIN A BY a1, B BY b1;

pg. 43
OUTER JOIN:
Use the OUTER JOIN operator to perform left, right, or full outer joins. Outer joins will only work provided the
relations which need to produce nulls (in the case of non -matching keys) have schemas.

C = JOIN A by $0 LEFT OUTER, B BY $0;

Sort:
Apache Pig provides Order By and Limit operators to perform shorting and restricting the relations.

1. ORDER BY
2. LIMIT

We have used the “finance_data.txt” dataset to perform these operations. We will put “finance_data.txt” in the HDFS location
“/pigexample/” from the local file system.
Content of “finance.txt”:

1,Chanel,Shawnee,KS,9133882079
2,Ezekiel,Easton,MD,4106691642
3,Willow,New York,NY,2125824976
4,Bernardo,Conroe,TX,9363363951
5,Ammie,Columbus,OH,6148019788
6,Francine,Las Cruces,NM,5059773911
7,Ernie,Ridgefield Park,NJ,2017096245
8,Albina,Dunellen,NJ,7329247882
9,Alishia,New York,NY,2128601579
10,Solange,Metairie,LA,5049799175

We will load “finance.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.
Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance.txt /pigexample/
Now we will create a relation and load data from HDFS to Pig.
Command:
grunt> findata = LOAD '/pigexample/finance_data.txt' USING PigStorage(',') as (empid:
int,empname:chararray,city:chararray,state:chararray,phone:int );

1. ORDER BY
ORDER BY operator is used to short the content of the relation based on one or more fields .
Syntax:
grunt> alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias
[ASC|DESC] …] } [PARALLEL n];

We will perform ORDER BY operation on relation “findata” using the state column.
Command:
grunt> orderdata = ORDER findata BY state DESC;

Output:

pg. 44
Now we will use the DUMP operator to print the output of relation “orderdata” on screen.
Command:
grunt> DUMP orderdata;

Output:

2. LIMIT
The LIMIT operator provides a limited number of tuples for a relation.
Syntax:
grunt> alias = LIMIT alias n;

We will use the LIMIT operation to restrict the output of the relation “findata” to ten rows and using the DUMP operator we
will print records on the terminal.
Command:
grunt> limitdata = LIMIT findata 10;
grunt> DUMP limitdata;

pg. 45
Output:

Latin:
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax:
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example:
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt:
001,Rajiv,Reddy,21,9848022337,Hyderabad

002,siddarth,Battacharya,22,9848022338,Kolkata

003,Rajesh,Khanna,22,9848022339,Delhi

004,Preethi,Agarwal,21,9848022330,Pune

005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar

006,Archana,Mishra,23,9848022335,Chennai

007,Komal,Nayak,24,9848022334,trivendram

008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city Chennai.
filter_data = FILTER student_details BY city == 'Chennai';

Verification:
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;

Output:
It will produce the following output, displaying the contents of the relation filter_data as
follows.(6,Archana,Mishra,23,9848022335,Chennai), (8,Bharathi,Nambiayar,24,9848022333,Chennai)

pg. 46
.EXPRIMENT-11

AIM:
Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and
indexes

1. Pre-requisites:

Install Hadoop by following this guide: https://medium.com/republic-of-coders-india/guide-to-install-

and-run-hadoop-on-windows-a0b64fe447b6

Download Apache Derby Binaries:

Hive requires a relational database like Apache Derby to create a Metastore and store all metadata

Download the derby tar file from the following link:

https://downloads.apache.org//db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-bin.tar.gz

Extract it to the location where you have installed Hadoop

pg. 47
2. Download Hive binaries:

Download Hive binaries from the following link:

https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz

Extract it to the location where you have installed Hadoop

3. Setting up Environment variables:

Type „environment‟ in Windows Search Bar

pg. 48
Click on Environment Variables

pg. 49
Click on New

Add the following variables:

HIVE_HOME: E:\hadoop-3.1.0\apache-hive-3.1.2-bin

DERBY_HOME: E:\hadoop-3.1.0\db-derby-10.14.2.0-bin

HIVE_LIB: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib

HIVE_BIN: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\bin

HADOOP_USER_CLASSPATH_FIRST: true

pg. 50
In Path Variable in User Variables add the following paths:

%HIVE_BIN%

%DERBY_HOME%\bin

Now in System Variables add the following:

HADOOP_USER_CLASSPATH_FIRST: true

4. Configuring Hive:

Copy Derby Libraries:

Copy all the jar files stored in Derby library files stored in:

E:\hadoop-3.1.0\db-derby-10.14.2.0-bin\lib

pg. 51
And paste them in Hive libraries directory:

E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib

5. Configuring Hive-site.xml:

Create a new file with the name hive-site.xml in E:\hadoop-3.1.0\apache-hive-3.1.2-bin\conf

pg. 52
Add the following lines in the file

<?xml version=”1.0"?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration><property> <name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>

<description>JDBC connect string for a JDBC metastore</description>

</property><property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>org.apache.derby.jdbc.ClientDriver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

<property>

<name>hive.server2.enable.doAs</name>

<description>Enable user impersonation for HiveServer2</description>

<value>true</value>

</property>

<property>

<name>hive.server2.authentication</name>

<value>NONE</value>

<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD based authentication

KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom authentication provider (Use with property

hive.server2.custom.authentication.class) </description>

</property>

<property>

<name>datanucleus.autoCreateTables</name>

<value>True</value>

pg. 53
</property>

<property>

<name>hive.server2.active.passive.ha.enable</name>

<value>true</value> # change false to true

</property>

</configuration>

6. Starting Services:

Start Hadoop Services:

Change the directory in terminal to the location where Hadoop is stored and give the following command:

start-all.cmd

Start Derby Network Server:

Start the Derby Network Server with the following command:

StartNetworkServer -h 0.0.0.0

Initialize Hive Metastore:

Give the following command to initialize Hive Metastore:


hive --service schematool -dbType derby -initSchema

Start Hive Server:


hive --service hiveserver2 start

Start Hive:

pg. 54
Start hive by giving the following command:

Hive
Create database:

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name


[COMMENT database_comment]
[LOCATION hdfs_path]
[MANAGEDLOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ...)];
LOCATION is used to specify default HDFS location for external table while MANAGEDLOCATION is the default
HDFS location for managed tables.

Example:

CREATE DATABASE IF NOT EXISTS hql;


CREATE SCHEMA IF NOT EXISTS hql;
Output:

OK

Alter database:

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;


ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
ALTER (DATABASE|SCHEMA) database_name SET MANAGEDLOCATION hdfs_path;
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value,
...);
Example:

ALTER DATABASE hql SET DBPROPERTIES('database usage'='Hive SQL tutorials.');


Output:OK

Tables:
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide range of
flexibility where the data files for tables are stored. It provides two types of table: -
o Internal table

o External table

Internal Table

The internal tables are also called managed tables as the lifecycle of their data is controlled by the Hive. By default,
these tables are stored in a subdirectory under the directory defined by hive.metastore.warehouse.dir (i.e.
/user/hive/warehouse). The internal tables are not flexible enough to share with other tools like Pig. If we try to drop
the internal table, Hive deletes both table schema and data.

o Let's create an internal table by using the following command:-

pg. 55
1. hive> create table demo.employee (Id int, Name string , Salary float)

2. row format delimited

3. fields terminated by ',' ;

Here, the command also includes the information that the data is separated by ','.

o Let's see the metadata of the created table by using the following command:-

1. hive> describe demo.employee

o Let's see the result when we try to create the existing table again.

In such a case, the exception occurs. If we want to ignore this type of exception, we can use if not exists command while
creating the table.

1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)

2. row format delimited

3. fields terminated by ',' ;

pg. 56
o Let's see the metadata of the created table by using the following command: -

1. hive> describe new_employee;

Hive allows creating a new table by using the schema of an existing table.

1. hive> create table if not exists demo.copy_employee

2. like demo.employee;

External Table

The external table allows us to create and access a table and a data externally. The external keyword is used to
specify the external table, whereas the location keyword is used to determine the location of loaded data.

As the table is external, the data is not present in the Hive directory. Therefore, if we try to drop the table, the
metadata of the table will be deleted, but the data still exists.

To create an external table, follow the below steps: -

o Let's create a directory on HDFS by using the following command: -

pg. 57
1. hdfs dfs -mkdir /HiveDirectory

o Now, store the file on the created directory.

1. hdfs dfs -put hive/emp_details /HiveDirectory

o Let's create an external table using the following command: -

1. hive> create external table emplist (Id int, Name string , Salary float)

2. row format delimited

3. fields terminated by ','

4. location '/HiveDirectory';

o Now, we can use the following command to retrieve the data: -

1. select * from emplist;

Creating a View:

You can create a view at the time of executing a SELECT statement. The syntax is as follows:

CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...

Creating an Index

An Index is nothing but a pointer on a particular column of a table. Creating an index means creating
a pointer on a particular column of a table. Its syntax is as follows:

CREATE INDEX index_name


ON TABLE base_table_name (col_name, ...)

AS 'index.handler.class.name'

pg. 58
[WITH DEFERRED REBUILD]

[IDXPROPERTIES (property_name=property_value, ...)]

[IN TABLE index_table_name]

[PARTITIONED BY (col_name, ...)]

[ ROW FORMAT ...] STORED AS ...

| STORED BY ...

[LOCATION hdfs_path]

[TBLPROPERTIES (...)]

pg. 59

You might also like