Big Data Analytics Lab
Big Data Analytics Lab
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
pg. 1
SIR C R REDDY COLLEGE OF ENGINEERING, ELURU.
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
CERTIFICATE
This is to certify that the bonafide record of the work done in BIG DATA
ANALYTICS LABAROTARY by Mr/Mrs:________________ bearing
Regd.No:____________ in III/IV B.Tech AI&DS course during the academic year
2023-2024.
EXTERNAL EXAMINER
pg. 2
SIR C.R.REDDY COLLEGE OF ENGINEERING ELURU-5340
07, WEST GODAVARI DIST, A P., INDIA
(Approved by AICTE, New Delhi )
Phone no: 08812-230840, 2300656 Fax: 08812-224193
Visit us at http://www.sircrrengg.ac.in
DEPARTMENT OF
ARTIFICIAL INTELLIGENCE AND DATA SCIENCE
INDEX
S. No Name of Experiment Dates
1 Implement the following Data structures in Java
a) Linked Lists b) Stacks c) Queues d) Set e) Map
6
Use MapReduce to find the shortest path between two
people in a social graph.
7
Implement Friends-of-friends algorithm in MapReduce.
8
Implement an iterative PageRank graph algorithm in
MapReduce.
9
Perform an efficient semi-join in MapReduce.
Install and Run Pig then write Pig Latin scripts to sort,
10
group, join, project, and filter your data.
Install and run hive then use hive to create, alter and drop
11
database, tables, views, functions, indexes.
pg. 3
Big Data Analytics Lab Manual
List of Experiments:
Experiment 1: Week 1, 2:
1. Implement the following Data structures in Java
a) Linked Lists b) Stacks c) Queues d) Set e) Map
Experiment 2: Week 3:
2. (i)Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed (ii)Use web based
Experiment 3: Week 4:
Retrieving files
Deleting files
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and copies them into HDFS using one of the above
Experiment 4: Week 5:
4. Run a basic Word Count MapReduce program to understand MapReduce Paradigm.
Experiment 5: Week 6:
5. Write a map reduce program that mines weather data.
Weather sensors collecting data every hour at many locations across the globe gather a large volume of log data, which is a
Experiment 6: Week 7:
6. Use MapReduce to find the shortest path between two people in a social graph.
Hint: Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a back
pointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state
pg. 4
reached.
Experiment 7: Week 8:
7. Implement Friends-of-friends algorithm in MapReduce.
Hint: Two MapReduce jobs are required to calculate the FoFs for each user in a social network .The first job calculates the
common friends for each user, and the second job sorts the common friends by the number of connections to your friends.
Experiment 8: Week 9:
8. Implement an iterative PageRank graph algorithm in MapReduce.
Hint: PageRank can be implemented by iterating a MapReduce job until the graph has converged. The mappers are responsible for
propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank
values for each node, and for re-creating the original graph with the updated PageRank values.
from the actual MapReduce data source by performing membership queries against the Bloom filter to determine which
data source records should be emitted to the reducers. Experiment 10: Week 11:
10. Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
Experiment 11: Week 12:
11. Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes
Add-On Experiments
1. Write a program on type conversion by auto boxing and unboxing
2. Write a program that add and delete the elements from the list by using Map and Set classes
of collection framework
pg. 5
Experiment 1
AIM: Implement the following Data structures in Java
1 a) Linked Lists
Program
// Java Program to Demonstrate implementation of LinkedList class , File name:
LinkedListDemo.java
// Importing required classes import
java.util.*;
// Main class
public class LinkedListDemo {
// Driver code
public static void main(String args[])
{
// Creating object of the class linked list LinkedList<String> ll = new
LinkedList<String>();
// Adding elements to the linked list ll.add("Graps");
ll.add("Banana");
ll.addLast("Apple"); ll.addLast("Watermelon");
System.out.println("LinkedList items\n" +ll); System.out.println('\n');
ll.addFirst("Orange");
ll.add(3, "Cherry");
ll.add(3, "Mango"); //above 3rd index item shifted to 4th index System.out.println(ll);
System.out.println('\n');
System.out.println("ll.get(0): " +ll.get(0)); // returns the element at the specified position in this list.
System.out.println("ll.peek(): " +ll.peek()); // see the top item of the stack ll.remove(3);
ll.removeFirst();
ll.removeLast();
System.out.println("Linked List items: "+ll);
}}
Output:
LinkedList items
[Graps, Banana, Apple, Watermelon]
ll.get(0): Orange
ll.peek(): Orange
Linked List items: [Graps, Banana, Cherry, Apple]
pg. 6
1 b) Stacks
Program:
import java.io.*;
import java.util.*;
class StackDemo {
// Main Method
public static void main(String[] args)
{
stack2.push("Hadoop");
stack2.push(3); //Heterogeneous data can also be added stack2.push("Year");
pg. 7
Output:
Printing the Stack1 Elements
[Grapes, Mango, Banana, Cherry, Watermelon, Orange]
After removal from 3rd index:
[Grapes, Mango, Banana, Watermelon, Orange]
Popped element: Orange
Popped element: Watermelon
Stack after pop operation:
[Grapes, Mango, Banana]
Stack2 elements
[Hadoop, 3, Year]
1 c) Queues
import java.util.*;
class PriorityQueueDemo
{
public static void main(String args[])
{
PriorityQueue<String> queue=new PriorityQueue<String>(); queue.add("One");
queue.add("Two");
queue.add("Three");
queue.add("Four"); queue.add("Five");
System.out.println("\nhead:"+queue.element());
System.out.println("head:"+queue.peek()); System.out.println("iterating the queue
elements:\n"); Iterator itr=queue.iterator();
while(itr.hasNext())
{
System.out.println(itr.next());
}
queue.remove(); queue.poll();
System.out.println( );
System.out.println("After removing two elements: \n"); Iterator<String>
itr2=queue.iterator(); while(itr2.hasNext())
{
System.out.println(itr2.next());
}
pg. 8
}
}
Output:
head:Five
head:Five
iterating the queue elements:
Five
Four
Two
One
Three
Three
One
Two
1 d) Set
// Java program Illustrating Set Interface
pg. 9
hash_Set.add("Geeks"); hash_Set.add("For");
hash_Set.add("Geeks");
hash_Set.add("Example"); hash_Set.add("Set");
Output:
[Set, Example, Geeks, For]
1 e) Map
MapHashDemo {
Output:
Map Elements
2) Map-Ex2
import java.util.*;
pg. 10
// Create a hash map HashMap hm = new
HashMap();
// Put elements to the map hm.put("David", new
Double(3434.34)); hm.put("Mahesh", new Double(123.22));
hm.put("Kavya", new Double(1378.00)); hm.put("Lavanya",
new Double(99.22)); hm.put("Kiran", new Double(-
19.08));
Output:
Kiran: -19.08
Mahesh: 123.22
Lavanya: 99.22
pg. 11
EXPERIMENT-2
AIM: (i)Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed
(ii)Use web based tools to monitor your Hadoop setup
As we all know Hadoop is an open-source framework which is mainly used for storage purpose and maintaining
and analyzing a large amount of data or datasets on the clusters of commodity hardware, which means it is
actually a data management tool. Hadoop also posses a scale-out storage property, which means that we can
scale up or scale down the number of nodes as per are a requirement in the future which is really a cool feature .
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name node, Job
Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes in Hadoop1. For
Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also means that we are installi ng
Hadoop only in a single system. By default, Hadoop is made to run in this Standalone Mode or we can also call it
as the Local mode. We mainly use Hadoop in this Mode for the Purpose of Learning, testing, and debugging.
Hadoop works very much Fastest in this mode among all of these 3 modes. As we all know HDFS (Hadoop
distributed file system) is one of the major components for Hadoop which utilized for storage Permission is not
utilized in this mode. You can think of HDFS as similar to the file system’s available for windows i.e. NTFS (New
Technology File System) and FAT32(File Allocation Table which stores the data in the blocks of 32 bits ). when
your Hadoop works in this mode there is no need to configure the files – hdfs-site.xml, mapred-site.xml, core-
site.xml for Hadoop environment. In this Mode, all of your Processes will run on a single JVM(Java Virtual
Machine) and this mode can only be used for small development purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster is simulated,
which means that all the processes inside the cluster will run independently to each other. All the daemons that
are Namenode, Datanode, Secondary Name node, Resource Manager, Node Manager, etc. will be running as a
separate process on separate JVM(Java Virtual Machine) or we can say run on different java processes that is
why it is called a Pseudo-distributed.
One thing we should remember that as we are using only the single node set up so all the Master and Slave
processes are handled by the single system. Namenode and Resource Manager are used as Master and
Datanode and Node Manager is used as a slave. A secondary name node is also used as a Master. The
purpose of the Secondary Name node is to just keep the hourly based backup of the Name node. In this Mode,
pg. 12
We need to change the configuration files mapred-site.xml, core-site.xml, hdfs-site.xml for setting up the
environment.
This is the most important one in which multiple nodes are used few of them run the Master Daemon’s that are
Namenode and Resource Manager and the rest of them run the Slave Daemon’s that are DataNode and Node
Manager. Here Hadoop will run on the clusters of Machine or nodes. Here the data that is used is distributed
across different nodes. This is actually the Production Mode of Hadoop let’s clarify or understand this Mode in a
better way in Physical Terminology.
Once you download the Hadoop in a tar file format or zip file format then you install it in your system and you run
all the processes in a single system but here in the fully distributed mode we are extracting this tar or zip file to
each of the nodes in the Hadoop cluster and then we are using a particular node for a particular process. Once
you distribute the process among the nodes then you’ll define which nodes are working as a master or which one
of them is working as a slave.
pg. 13
(ii) Three Best Monitoring Tools for Hadoop
1. Prometheus – Cloud monitoring software with a customizable Hadoop dashboard, integrations, alerts, and
many more. It keeps the data long-term, with 3x redundancy, so that we can focus on applying the data rather
than maintaining a database. Get updates and plugins without lifting a finger, as they keep
our Prometheus and Grafana stack up-to-date. It is easy to use with no extensive configuration needed to
thoroughly monitor your technology stack.
2. LogicMonitor – Infrastructure monitoring software with a HadoopPackage, REST API, alerts, reports,
dashboards, and more. LogicMonitor finds, queries, and begins monitoring virtually any datacenter resource. If
you have a resource in your datacenter that is not immediately found and monitored, LogicMonitor’s
professional services will investigate how to add it.
pg. 14
3. Dynatrace – Application performance management software with Hadoop monitoring - with
NameNode/DataNode metrics, dashboards, analytics, custom alerts, and more. Dynatrace provides a high-
level overview of the main Hadoop components within your cluster. Enhanced insights are available for HDFS
and MapReduce. Hadoop-specific metrics are presented alongside all infrastructure measurements, providing
you with in-depth Hadoop performance analysis of both current and historical data.
pg. 15
Experiment -3
Hadoop Implementation of file management tasks, such as Adding files and directories, retrieving files and Deleting files
SYNTAX AND COMMANDS TO ADD, RETRIEVE AND DELETE DATA FROM HDFS
Step 1: Starting HDFS Initially you have to format the configured HDFS file system, open namenode (HDFS server), and
execute the following command.
$ hadoop namenode -format
After formatting the HDFS, start the distributed file system. The following command will start the namenode as well as the data
nodes as cluster.
$ start-dfs.sh
After loading the information in the server, we can find the list of files in a directory, status of a file, using ls Given below is the
syntax of ls that you can pass to a directory or a filename as an argument.
$ $HADOOP_HOME/bin/hadoop fs –ls
Assume we have data in the file called file.txt in the local system which is ought to be saved in the hdfs file system. Follow the
steps given below to insert the required file in the Hadoop file system.
pg. 16
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Transfer and store a data file from local systems to the Hadoop file system using the put command.
Assume we have a file in HDFS called outfile. Given below is a simple demonstration for retrieving the required file from the
Hadoop file system.
Get the file from HDFS to the local file system using get command.
$ $HADOOP_HOME/bin/hadoop fs -get /user/output/ /home/hadoop_tp/
$ stop-dfs.sh
pg. 17
Experiment-4
Run a basic Word Count Map Reduce program to understand Map Reduce Paradigm.
AIM: To Develop a MapReduce program to calculate the frequency of a given word in agiven file Map Function – It takes a set of
data and converts it into another set of data, where individual elements are broken down into tuples (Key-Value pair).
Output
Convert into another
set of data(Key,Value)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1), (TRAIN,1),(BUS,1), (buS,1), (caR,1),
(CAR,1), (car,1), (BUS,1), (TRAIN,1)
Reduce Function – Takes the output from Map as an input and combines those data tuplesinto a smaller set of tuples.
Example – (Reduce function in Word Count)
Input
Set of Tuples(output of Map function)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1), (car,1), (bus,1), (car,1), (train,1), (bus,1),(TRAIN,1),(BUS,1), (buS,1),(caR,1),(CAR,1),
(car,1), (BUS,1), (TRAIN,1)
1. Splitting – The splitting parameter can be anything, e.g. splitting by space,comma, semicolon, or even by a new line („\n‟).
2. Mapping – as explained above
3. Intermediate splitting – the entire process in parallel on different clusters. In orderto group them in “Reduce Phase” the
similar KEY data should be on same cluster.
4. Reduce – it is nothing but mostly group by phase
5. Combining – The last phase where all the data (individual result set from eachcluster) is combine together to form a Result
hadoop version
javac –version
Step 2. Create a directory on the Desktop named Lab and inside it create two folders;
one called “Input” and the other called “tutorial_classes”. [You can do this step using GUI normally or through terminal
commands]
cd Desktop
mkdir Lab
mkdir Lab/Input
mkdir Lab/tutorial_classes
Step 3. Add the file attached with this document “WordCount.java” in the directory Lab
Step 4. Add the file attached with this document “input.txt” in the directory Lab/Input
pg. 18
Step 5. Type the following command to export the hadoop classpath into bash.
export HADOOP_CLASSPATH=$(hadoop classpath)
Make sure it is now exported.
echo $HADOOP_CLASSPATH
Step 6. It is time to create these directories on HDFS rather than locally. Type the following commands. Hadoop fs -mkdir
/WordCount Tutorial
hadoop fs -mkdir /WordCountTutorial/Input
hadoop fs -put Lab/Input/input.txt /WordCountTutorial/Input
Step 7. Go to localhost:9870 from the browser, Open“Utilities→ Browse File System” and you should see the directories and files
we placed in the file system.
pg. 19
Step 8. Then, back to local machine where we will compile the WordCount.java file.
Assuming we are currently in the Desktop directory.
cd Lab
javac -classpath $HADOOP_CLASSPATH -d tutorial_classes WordCount.java
Put the output files in one jar file (There is a dot at the end)
jar -cvf WordCount.jar -C tutorial_classes .
Step 9. Now, we run the jar file on Hadoop.
hadoop jar WordCount.jar WordCount /WordCountTutorial/Input /WordCountTutorial/Output.
pg. 20
hadoop dfs -cat /WordCountTutorial/Output/*
pg. 21
j.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(j, input);
FileOutputFormat.setOutputPath(j, output);
System.exit(j.waitForCompletion(true)?0:1);
}
public static class MapForWordCount extends Mapper <LongWritable, Text, Text,IntWritable>{
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException
{
String line = value.toString();
String[] words=line.split(",");
for(String word: words )
{
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
}
public static class ReduceForWordCount extends Reducer<Text,IntWritable,Text,IntWaritable>
{
public void reduce(Text word, Iterable values, Context con) throwsIOException
InterruptedException
{
int sum = 0;
for(IntWritable value : values)
{
sum += value.get();
}
con.write(word, new IntWritable(sum));
}
}
}
The output is stored in /r_output/part-00000
OUTPUT:
pg. 22
EXERCISE-5:
AIM:- Write a Map Reduce Program that mines Weather Data.
DESCRIPTION:
Climate change has been seeking a lot of attention since long time. The antagonistic effect of this climate is being
felt in every part of the earth. There are many examples for these, such as sea levels are rising, less rainfall,
increase in humidity. The propose system overcomes the some issues that occurred by using other techniques. In
this project we use the concept of Big data Hadoop. In the proposed architecture we are able to process offline
data, which is stored in the National Climatic Data Centre (NCDC). Through this we are able to find out the
maximum temperature and minimum temperature of year, and able to predict the future weather forecast. Finally,
we plot the graph for the obtained MAX and MIN temperature for each moth of the particular year to visualize the
temperature. Based on the previous year data weather data of coming year is predicted.
ALGORITHM:-
MAPREDUCE PROGRAM
WordCount is a simple program which counts the number of occurrences of each word in a given text input data set. WordCount
fits very well with the MapReduce programming model making it a great example to understand the Hadoop Map/Reduce
programming style. Our implementation consists of three main parts:
1.Mapper
2.Reducer
3.Main program
AMapper overrides the ―map‖function from the Class "org.apache.hadoop.mapreduce.Mapper" which provides <key, value> pairs
as the input. A Mapper implementation may output <key,value> pairs using the provided Context .
Input value of the WordCount Map task will be a line of text from the input data file and the key would be the line number
<line_number, line_of_text> . Map task outputs <word, one> for each word in the line of text
Pseudo-code
void Map (key, value)
{
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value)
{
for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and assemble a single result. Here, the
WordCount program will sum up the occurrence of each word to pairs as
<word, occurrence>.
Pseudo-code
void Reduce (max_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(max_temp, sum);
}
void Reduce (min_temp, <list of value>)
{
for each x in <list of value>:
sum+=x;
final_output.collect(min_temp, sum);
}
3.Write Driver
The Driver program configures and run the MapReduce job. We use the main program to perform basic
configurations such as:
Job Name : name of this Job
Executable (Jar) Class: the main executable class. For here, WordCount.
pg. 23
Mapper Class: class which overrides the "map" function.For here, Map.
Reducer: class which override the "reduce" function. For here , Reduce.
pg. 24
EXPERIMENT-6
AIM: Use MapReduce to find the shortest path between two people in a social graph.
Program:
import java.io.IOException;
import java.util.*;
pg. 25
FileInputFormat.addInputPath(job, new Path(args[0])); // Input path
pg. 26
EXPERIMENT-7
Program:
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
String personA = tokenizer.nextToken();
String personB = tokenizer.nextToken();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
InterruptedException {
List<String> friends = new ArrayList<>();
for (Text val : values) {
friends.add(val.toString());
}
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Usage:
Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:
hadoop jar FriendsOfFriends.jar input_directory output_directory
pg. 28
EXPERIMENT-8
Program:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
pg. 29
private static final double dampingFactor = 0.85; // Damping factor for PageRank calculation
public void reduce(Text key, Iterable<DoubleWritable> values, Context context) throws IOException, InterruptedException {
double sum = 0;
sum += value.get();
job.setJarByClass(PageRank.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(DoubleWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Usage:
Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:
pg. 30
EXPERIMENT-9
Program:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text(joinKey), new Text("A\t" + value.toString())); // Prefix 'A' denotes the first dataset
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text(joinKey), new Text("B")); // Prefix 'B' denotes the second dataset
pg. 31
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
if (value.toString().equals("B")) {
foundSecondDataset = true;
break;
if (!foundSecondDataset) {
context.write(new Text(parts[1]), new Text()); // Emit values from the first dataset
job.setJarByClass(SemiJoin.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
pg. 32
Usage:
Compile the Java code and create a JAR file. Then, you can run the MapReduce job using Hadoop:
pg. 33
EXPERIMENT-10
AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your data.
1. Prerequisites
1. Hardware Requirement
* RAM — Min. 8GB, if you have SSD in your system then 4GB RAM would also work.
* I will be using 64-bit windows for the process, please check and download the version supported by your
5. Hadoop
* I am using Hadoop-2.9.2, you can also use any other STABLE version for Hadoop.
* If you don‟t have Hadoop, you can refer to installing it from Hadoop: How to install in 5 Steps in Windows 10.
* I am using PIG-0.17.0, you can also use any other STABLE version of Apache Pig.
pg. 34
2. Unzip and Install PIG
pg. 35
Once extracted, we would get a new file pig-0.17.0.tar.
Now we can organize our PIG installation, we can create a folder and move the final extracted file in it. For EX:
pg. 36
Please note while creating folders, DO NOT ADD SPACES IN BETWEEN THE FOLDER NAME.(it can
I have placed my PIG in D: drive you can use C: or any other drive also.
Another important step in setting up a work environment is to set your Systems environment variable.
pg. 37
pg. 38
3.1 Setting PIG_HOME
Now as shown, add PIG_HOME in variable name and path of PIG in Variable Value.
The last step in setting the Environment variable is setting Path in System Variable.
pg. 39
Select Path variable in the system variables and click on “Edit”.
* %PIG_HOME%\bin
Click OK and OK. & we are done with Setting Environment Variables.
Note:- If you want the path to be set for all users you need to select “New” from System Variables.
Now we need to verify that what we have done is correct and reflecting.
pg. 40
Open a NEW Command Window
echo %PIG_HOME%
4. Verifying Setup
Open a cmd window, run the below command to test the connection and PIG.
pig –version
Upon running the command we should get the version of PIG. i.e 0.17.0 in our case.
Don‟t worry some of us can get the below error after running pig -version
'-Xmx1000M' is not recognized as an internal or external command,operable program or batch file.
pg. 41
To resolve this we will need to perform the following steps:-
The next step is to verify the setup once again. So, we need to execute the
5. Starting PIG
Now we need to start a new Command Prompt remember to run it as administrator to avoid permission issues and
Yes, it's that simple… We can see grunt> once the pig starts.
pg. 42
Fig 13:- Starting PIG
GROUP:
Similar to Group in Sql, here we use group for one relation and Cogroup of more number of relations. Both
GROUP and COGROUP are similar to each other.
B = GROUP A BY age;
X = GROUP A BY f2*f3;
JOIN:
Join concept is similar to Sql joins, here we have many types of joins such as Inner join, outer join and some
specialized joins.
INNER JOIN:
The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes s ense to filter them out
before the join.
Note: Both Cogroup and join work in a similar way, just the difference is Cogroup creates a nested set of output
records.
pg. 43
OUTER JOIN:
Use the OUTER JOIN operator to perform left, right, or full outer joins. Outer joins will only work provided the
relations which need to produce nulls (in the case of non -matching keys) have schemas.
Sort:
Apache Pig provides Order By and Limit operators to perform shorting and restricting the relations.
1. ORDER BY
2. LIMIT
We have used the “finance_data.txt” dataset to perform these operations. We will put “finance_data.txt” in the HDFS location
“/pigexample/” from the local file system.
Content of “finance.txt”:
1,Chanel,Shawnee,KS,9133882079
2,Ezekiel,Easton,MD,4106691642
3,Willow,New York,NY,2125824976
4,Bernardo,Conroe,TX,9363363951
5,Ammie,Columbus,OH,6148019788
6,Francine,Las Cruces,NM,5059773911
7,Ernie,Ridgefield Park,NJ,2017096245
8,Albina,Dunellen,NJ,7329247882
9,Alishia,New York,NY,2128601579
10,Solange,Metairie,LA,5049799175
We will load “finance.txt” from the local filesystem into HDFS “/pigexample/” using the below commands.
Command:
$hadoop fs -copyFromLocal /home/cloudduggu/pig/tutorial/finance.txt /pigexample/
Now we will create a relation and load data from HDFS to Pig.
Command:
grunt> findata = LOAD '/pigexample/finance_data.txt' USING PigStorage(',') as (empid:
int,empname:chararray,city:chararray,state:chararray,phone:int );
1. ORDER BY
ORDER BY operator is used to short the content of the relation based on one or more fields .
Syntax:
grunt> alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias
[ASC|DESC] …] } [PARALLEL n];
We will perform ORDER BY operation on relation “findata” using the state column.
Command:
grunt> orderdata = ORDER findata BY state DESC;
Output:
pg. 44
Now we will use the DUMP operator to print the output of relation “orderdata” on screen.
Command:
grunt> DUMP orderdata;
Output:
2. LIMIT
The LIMIT operator provides a limited number of tuples for a relation.
Syntax:
grunt> alias = LIMIT alias n;
We will use the LIMIT operation to restrict the output of the relation “findata” to ten rows and using the DUMP operator we
will print records on the terminal.
Command:
grunt> limitdata = LIMIT findata 10;
grunt> DUMP limitdata;
pg. 45
Output:
Latin:
The FILTER operator is used to select the required tuples from a relation based on a condition.
Syntax:
Given below is the syntax of the FILTER operator.
grunt> Relation2_name = FILTER Relation1_name BY (condition);
Example:
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/ as shown below.
student_details.txt:
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_details as shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city Chennai.
filter_data = FILTER student_details BY city == 'Chennai';
Verification:
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
Output:
It will produce the following output, displaying the contents of the relation filter_data as
follows.(6,Archana,Mishra,23,9848022335,Chennai), (8,Bharathi,Nambiayar,24,9848022333,Chennai)
pg. 46
.EXPRIMENT-11
AIM:
Install and Run Hive then use Hive to create, alter, and drop databases, tables, views, functions, and
indexes
1. Pre-requisites:
and-run-hadoop-on-windows-a0b64fe447b6
Hive requires a relational database like Apache Derby to create a Metastore and store all metadata
https://downloads.apache.org//db/derby/db-derby-10.14.2.0/db-derby-10.14.2.0-bin.tar.gz
pg. 47
2. Download Hive binaries:
https://downloads.apache.org/hive/hive-3.1.2/apache-hive-3.1.2-bin.tar.gz
pg. 48
Click on Environment Variables
pg. 49
Click on New
HIVE_HOME: E:\hadoop-3.1.0\apache-hive-3.1.2-bin
DERBY_HOME: E:\hadoop-3.1.0\db-derby-10.14.2.0-bin
HIVE_LIB: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib
HIVE_BIN: E:\hadoop-3.1.0\apache-hive-3.1.2-bin\bin
HADOOP_USER_CLASSPATH_FIRST: true
pg. 50
In Path Variable in User Variables add the following paths:
%HIVE_BIN%
%DERBY_HOME%\bin
HADOOP_USER_CLASSPATH_FIRST: true
4. Configuring Hive:
Copy all the jar files stored in Derby library files stored in:
E:\hadoop-3.1.0\db-derby-10.14.2.0-bin\lib
pg. 51
And paste them in Hive libraries directory:
E:\hadoop-3.1.0\apache-hive-3.1.2-bin\lib
5. Configuring Hive-site.xml:
pg. 52
Add the following lines in the file
<?xml version=”1.0"?>
<configuration><property> <name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
</property><property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
</property>
<property>
<name>hive.server2.enable.doAs</name>
<value>true</value>
</property>
<property>
<name>hive.server2.authentication</name>
<value>NONE</value>
<description> Client authentication types. NONE: no authentication check LDAP: LDAP/AD based authentication
KERBEROS: Kerberos/GSSAPI authentication CUSTOM: Custom authentication provider (Use with property
hive.server2.custom.authentication.class) </description>
</property>
<property>
<name>datanucleus.autoCreateTables</name>
<value>True</value>
pg. 53
</property>
<property>
<name>hive.server2.active.passive.ha.enable</name>
</property>
</configuration>
6. Starting Services:
Change the directory in terminal to the location where Hadoop is stored and give the following command:
start-all.cmd
StartNetworkServer -h 0.0.0.0
Start Hive:
pg. 54
Start hive by giving the following command:
Hive
Create database:
Example:
OK
Alter database:
Tables:
In Hive, we can create a table by using the conventions similar to the SQL. It supports a wide range of
flexibility where the data files for tables are stored. It provides two types of table: -
o Internal table
o External table
Internal Table
The internal tables are also called managed tables as the lifecycle of their data is controlled by the Hive. By default,
these tables are stored in a subdirectory under the directory defined by hive.metastore.warehouse.dir (i.e.
/user/hive/warehouse). The internal tables are not flexible enough to share with other tools like Pig. If we try to drop
the internal table, Hive deletes both table schema and data.
pg. 55
1. hive> create table demo.employee (Id int, Name string , Salary float)
Here, the command also includes the information that the data is separated by ','.
o Let's see the metadata of the created table by using the following command:-
o Let's see the result when we try to create the existing table again.
In such a case, the exception occurs. If we want to ignore this type of exception, we can use if not exists command while
creating the table.
1. hive> create table if not exists demo.employee (Id int, Name string , Salary float)
pg. 56
o Let's see the metadata of the created table by using the following command: -
Hive allows creating a new table by using the schema of an existing table.
2. like demo.employee;
External Table
The external table allows us to create and access a table and a data externally. The external keyword is used to
specify the external table, whereas the location keyword is used to determine the location of loaded data.
As the table is external, the data is not present in the Hive directory. Therefore, if we try to drop the table, the
metadata of the table will be deleted, but the data still exists.
pg. 57
1. hdfs dfs -mkdir /HiveDirectory
1. hive> create external table emplist (Id int, Name string , Salary float)
4. location '/HiveDirectory';
Creating a View:
You can create a view at the time of executing a SELECT statement. The syntax is as follows:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT table_comment]
AS SELECT ...
Creating an Index
An Index is nothing but a pointer on a particular column of a table. Creating an index means creating
a pointer on a particular column of a table. Its syntax is as follows:
AS 'index.handler.class.name'
pg. 58
[WITH DEFERRED REBUILD]
| STORED BY ...
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
pg. 59