BDA Lab Practical
BDA Lab Practical
INDEX
EXPERIMENT-1
AIM: Implement the following data structure in Java
Data Structures. A data structure is a particular way of storing and
organizing data in a computer so that it can be used efficiently. Data
structures provide a means to manage large amounts of data
efficiently. efficient data structures are a key to designing efficient
algorithms.
The Java collections framework (JCF) is a set of classes and
interfaces that implement commonly reusable collection data
structures. Although referred to as a framework, it works in a manner
of a library. The JCF provides both interfaces that define various
collections and classes that implement them. The objective of this
program is to implement Linked list stack Queue data Structures.
Java Collection simply means a single unit of objects. Java
Collection framework provides many interfaces (Set, List, Queue,
Deque etc.) and classes (Array List, Vector, LinkedList, Priority Queue,
HashSet, Linked Hash Set, Tree Set etc.).
Map interface, which is also a part of java collection framework,
doesn't inherit from Collection interface. Collection interface is a
member of java. util package. Collections is a utility class in java.
util package. It consists of only static methods which are used to
operate on objects of type.
a) Linked List
Program:
package org.arpit.java2blog.datastructures;
class Node
{ public
int data;
public Node next;
}
}
public class
MyLinkedList {
private Node head;
Node(); newNode.data =
data; current.next =
newNode;
}
// Method for printing
Linked List public void
printLinkedList() {
System.out.println("Printing LinkedList (head -->
last) "); Node current = head;
while (current != null) {
current.displayNodeD
ata(); current =
current.next;
}
System.out.println();
}
public static void main(String args[])
{
MyLinkedList myLinkedlist = new
MyLinkedList();
myLinkedlist.insertFirst(50);
myLinkedlist.insertFirst(60);
myLinkedlist.insertFirst(70);
myLinkedlist.insertFirst(10);
myLinkedlist.insertLast(20);
myLinkedlist.printLinkedList();
// Linked list will be
// 10 -> 70 -> 60 -> 50 -> 20
System.out.println("=======================
=="); System.out.println("Delete node after Node
60");
Node node=new Node();
node.data=60;
myLinkedlist.deleteAfter
(node);
// After deleting node after 1,Linked list will be
// 10 -> 70 -> 60 -> 20
System.out.println("=======================
=="); myLinkedlist.printLinkedList();
}
}
Output:
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
b) Stack
Program:
package org.arpit.java2blog.datastructures;
public class
Mystack{
MyStack {
intsize;
int
arr[
];
intt
op;
MyStack(int
size) {
this.size =
size;
this.arr = new
int[size];
this.top = -1;
}
public int
pop() {
if
(!isEmpt
y()) {
int
topElement =
top; top--;
System.out.println("Popped element :" +
arr[topElement]); return arr[topElement];
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
} else {
System.out.println("Stack is empty
!"); return -1;
}
}
public boolean
isEmpty() { return
(top == -1);
}
public boolean
isFull() { return
(size - 1 ==
top);
}
}
}
Output:
Stack is empty!
=================
Pushed
element:100
Pushed
element:90
Pushed
element:10
Pushed
element:50
=================
Popped element :50
Popped element :10
Popped element :90
=================
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
c) Queue
Program:
package org.arpit.java2blog.datastructures;
//Node data
structure private
class Node
{
int data;
Node next;
}
//constructor
public QueueUsingLinkedList()
{
front =
null;
rear =
null;
currentS
ize = 0;
}
public boolean isEmpty()
{
return (currentSize == 0);
}
//Remove item from the beginning of the list to simulate
Queue public int dequeue()
{
int data =
front.data; front
= front.next;
if (isEmpty())
{
rear = null;
}
currentSize--;
System.out.println(data+ " removed from the
queue"); return data;
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
}
//Add data to the end of the list to simulate Queue
public void enqueue(int data)
{
Node oldRear
= rear; rear =
new Node();
rear.data =
data;
rear.next =
null;
if (isEmpty())
{
front = rear;
}
else
{
oldRear.next = rear;
}
currentSize++;
System.out.println(data+ " added to the queue");
}
public static void main(String a[]){
QueueUsingLinkedList.dequeue();
QueueUsingLinkedList.enqueue(10);
QueueUsingLinkedList.enqueue(20);
QueueUsingLinkedList.enqueue(40);
queueUsingLinkedList.dequeue();
QueueUsingLinkedList.enqueue(70);
queueUsingLinkedList.
queueUsingLinkedList.dequeue();
queueUsingLinkedList.enqueue(80);
queueUsingLinkedList.enqueue(100);
queueUsingLinkedList.dequeue();
queueUsingLinkedList.enqueue(150);
queueUsingLinkedList.enqueue(50);
}
}
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Output:
60 added to the queue
60 removed from
to the queue
20 added to the
queue 40 added
to the queue
10 removed from
to the queue
20 removed from
to the queue
40 removed from
d) Set
Output:
[South Africa, Australia, India]
Set after removing Australia:[South Africa, India]
Iterating over set:
South Africa
India
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
e) Map
Program:
import java.util.*;
class HashMapDemo
{
public static void main(String args[])
{
HashMap< String,Integer> hm = new HashMap< String,Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));
Output:
Run on IDE Output:
a:100 b:200 c:300 d:400
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
EXPERIMENT-2
Aim: Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed.
1. Standalone Mode
• Default mode of Hadoop
• HDFS is not utilized in this mode.
• Local file system is used for input and output
• Used for debugging purpose
• No Custom Configuration is required in 3 Hadoop(mapred-site.xml,core-
site.xml, hdfs-site.xml) files.
• Standalone mode is much faster than Pseudo-distributed mode.
The steps for install Hadoop in Pseudo Distributed Mode (Single Node Cluster) are
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
Before proceeding further, you need to make sure that Hadoop is working fine. Just
issue the following command −
$ hadoop version
If everything is fine with your setup, then you should see the following result –
Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a non-distributed mode on a single machine.
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export
HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. It is required to make changes in those
configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location
of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and
size of Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data,
namenode path, and datanode paths of your local file systems. It means the place
where you want to store the Hadoop infrastructure.
Let us assume the following data. dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the
directory created by hdfs file system.) namenode path =
//home/hadoop/hadoopinfra/hdfs/namenode
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add
the following properties in between the <configuration>, </configuration> tags in this
file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default,
Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file
from mapred-site.xml.template to mapred-site.xml file using the following
command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
EXPERIMENT-3
Aim: Implement the following file management tasks in Hadoop:
2. Count the number of directories,files and bytes under the paths that match
the specified file pattern
hadoop fs -count hdfs:/
• 8. Add a sample text file from the local directory named “data” to the new
directory you created in HDFS during the previous step.
#hadoop fs -put data/sample.txt /user/training/hadoop #
9. List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop
11. Since /user/training is your home directory in HDFS, any command that does not
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
have an absolute path is interpreted as relative to that directory. The next command will
therefore list your home directory, and should show the items you‟ve just added there.
hadoop fs -ls
# 12. See how much space this directory occupies in HDFS. hadoop fs -du -s -h
hadoop/retail
# 13. Delete a file „customers‟ from the “retail” directory. hadoop fs -rm
hadoop/retail/customers
# 15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*
17. Finally, remove the entire retail directory and all, of its contents in HDFS.
hadoop fs -rm -r hadoop/retail
20. To view the contents of your text file purchases.txt which is present in your
hadoop directory.
hadoop fs -cat hadoop/purchases.txt
21. Add the purchases.txt file from “hadoop” directory which is present in HDFS
directory
to the directory “data” which is present in your local directory hadoop fs - copyToLocal
hadoop/purchases.txt /home/training/data
# 30. Copy a directory from one node in the cluster to another # Use „-distcp‟
command to copy,
# -overwrite option to overwrite in an existing files # -update command to synchronize
both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop
EXPERIMENT-4
Aim: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.
Counting the number of words in any language is a piece of cake like in C, C++,
Python, Java, etc. MapReduce also uses Java but it is very easy if you know the syntax
on how to write it. It is the basic of MapReduce. You will first learn how to execute
this code similar to “Hello World” program in other languages. So here are the steps
which show how to write a MapReduce code for Word Count.
Example:
Input:
Hello I am GeeksforGeeks Hello I am an Intern
Output:
GeeksforGeeks 1
Hello 2
I 2
Intern 1
am 2
an 1
Steps:
First Open Eclipse -> then select File -> New -> Java Project ->Name
it WordCount -> then Finish.
In the above figure, you can see the Add External JARs option on the
Right Hand Side. Click on it and add the below mention files. You can find
these files in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1- cdh5.13.0.jar
2. /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar
Mapper Code: You have to copy paste this program into the WCMapper Java Class
file.
// Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
// Reduce function
public void reduce(Text key, Iterator<IntWritable> value, OutputCollector<Text,
IntWritable> output, Reporter rep)
throws IOException
{
int count = 0;
// Counting the frequency of each words while (value.hasNext())
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
{
IntWritable i = value.next(); count += i.get();
output.collect(key, new IntWritable(count));
}
}
Driver Code: You have to copy paste this program into the WCDriver Java Class file.
// Main Method
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Open the terminal on CDH and change the directory to the workspace.
You can do this by using “cd workspace/” command. Now, Create a text
file(WCFile.txt) and move it to HDFS. For that open terminal and write
this code(remember you should be in the same directory as jar file you
have created just now).
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Now, run this command to copy the file input file into the HDFS. hadoop fs -put
WCFile.txt WCFile.txt
Now to run the jar file by writing the code as shown in the screenshot.
Output:
After Executing the code, you can see the result in Output file or by writing
following command on terminal.
hadoop fs -cat WCOutput/part-00000
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
EXPERIMENT-5
Aim: Write a Map Reduce program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume of
log data, which is a good candidate for analysis with Map Reduce, since it is semi
structured and record-oriented.
There is a data file for each year. Each data file contains among other things, the year
and the temperature information( which is relevant for this program ).
Below is the snapshot of the data with year and temperature field highlighted in green
box. This is the snapshot of data taken from year 1901 file
So, in a MapReduce program there are 2 most important phases - Map Phase and Reduce
Phase.
You need to have an understanding of MapReduce concepts so as to understand the
intricacies of MapReduce programming. It is one the major component of Hadoop along
with HDFS
a) For writing any MapReduce program, firstly, you need to figure out the data flow,
like in this example am taking just the year and temperature information in the map
phase and passing it on to the reduce phase. So Map phase in my example is
essentially a data preparation phase. Reduce phase on the other hand is more of a
data aggregation one.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Map Phase: I will be pulling out the year and temperature data from the log data that is
there in the file, as shown in the above snapshot.
Reduce Phase: The data that is generated by the mapper(s) is fed to the reducer, which
is another java program. This program takes all the values associated with a particular
key and find the average temperature for that key. So, a key in our case is the year and
value is a set of IntWritable objects which represent all the captured temperature
information for that year.
I will be writing a java class, each for a Map and Reduce phase and one driver class to
create a job with configuration information.
AverageMapper.java
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import java.io.IOException;
if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
context.write(new Text(year),new IntWritable(temperature));
}
}
Let us get into the details of our AverageMapper class. I need to extend generic class
Mapper with four formal data types: input key, input value, output key, output value.
The key for the Map phase is the offset of the beginning of the line from the beginning
of the file, but as we have no need for it, we can ignore it. The input value would be
temperature and output key would be year and output value will be temperature, an
integer. The data is fed to the map function one line or record at a time. The map()
function converts it into the string and read the year and temperature part from the
applicable index value.
Also, map() function creates a Context object which is the output object from map(). It
contains year value as Text and temperature value as IntWritable.
AverageReducer.java
import org.apache.hadoop.mapreduce.*;
import java.io.IOException;
public class AverageReducer extends Reducer <Text, IntWritable,Text, IntWritable >
{
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
int max_temp = 0;
int count = 0;
for (IntWritable value : values)
{
max_temp += value.get(); count+=1;
}
context.write(key, new IntWritable(max_temp/count));
}
Now coming to Reduce Class. Again, four formal data types: input key, input value,
output key, output value is specified for this class. The input type and value of reduce
function should match output key and value of the map function: Text and IntWritable
objects. The reduce() function iterates through all the values and find the sum and count
of the values, and finally the average temperature value from that.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
AverageDriver.java
import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.*
job.setJarByClass(AverageDriver.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path (args[1]));
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.clas);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.clas);
System.exit(job.waitForCompl etion(true)?0:1);
}
}
A Job object forms the specification of the job and gives you control over how the job
will be run. Hadoop has a special feature of data locality, wherein the code for the
program is send to the data instead of other way around. So, Hadoop distributes the jar
file of the program across the cluster. we pass the name of the class in setJarByClass()
method which hadoop can use to locate the jar file containing this class. We need to
specify input and output paths. Input path can specify the file or directory which will be
used as an input to the program and output path is a directory which will be created by
Reducer. If the directory already exists it leads to an error. Then we specify the map and
reduce types to use via setMapperClass() and setReducerClass(). Next we set the output
types for the map and reduce functions. waitForCompletion() method submits the job
and waits for it to finish. It return 0 or 1, indicating success or failure of the job.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
EXPERIMENT-6
Aim: Implement Matrix Multiplication with Hadoop Map Reduce
}
}
program ends here
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.waitForCompletion(true);
}
}
Step 8. Uploading the M, N file which contains the matrix multiplication data to
HDFS.
$ cat M M,0,0,1 M,0,1,2
M,1,0,3
M,1,1,4
$ cat N N,0,0,5 N,0,1,6
N,1,0,7
N,1,1,8
$ hadoop fs -mkdir Matrix/
$ hadoop fs -copyFromLocal M Matrix/
$ hadoop fs -copyFromLocal N Matrix/
Step 9. Executing the jar file using hadoop command and thus how fetching record
from HDFS and storing output in HDFS.
$ hadoop jar MatrixMultiply.jar www.ehadoopinfo.com.MatrixMultiply Matrix/*
result/ WARNING: Use "yarn jar" to launch YARN applications.
17/10/09 14:31:22 INFO impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
17/10/09 14:31:23 INFO client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
17/10/09 14:31:23 WARN mapreduce.JobResourceUploader: Hadoop command-line
option parsing not performed. Implement the Tool interface and execute your
application with ToolRunner to remedy this.
17/10/09 14:31:24 INFO input.FileInputFormat: Total input paths to process : 2
17/10/09 14:31:24 INFO mapreduce.JobSubmitter: number of splits:2
17/10/09 14:31:24 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1507555978175_0006
17/10/09 14:31:25 INFO impl.YarnClientImpl: Submitted application
application_1507555978175_0006
17/10/09 14:31:25 INFO mapreduce.Job: The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1507555978175_000 6/
17/10/09 14:31:25 INFO mapreduce.Job: Running job:
job_1507555978175_0006
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Map-Reduce Framework Map input records=8 Map output records=16 Map output
bytes=160
Map output materialized bytes=204 Input split bytes=238
Combine input records=0 Combine output records=0 Reduce input groups=4 Reduce
shuffle bytes=204 Reduce input records=16 Reduce output records=4 Spilled
Records=32 Shuffled Maps =2
Failed Shuffles=0 Merged Map outputs=2
GC time elapsed (ms)=196 CPU time spent (ms)=2720
Physical memory (bytes) snapshot=536309760 Virtual memory (bytes)
snapshot=2506076160 Total committed heap usage (bytes)=360185856
Shuffle Errors
BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0
WRONG_MAP=0 WRONG_REDUCE=0
File Input Format Counters Bytes Read=64
File Output Format Counters Bytes Written=36
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Output:
Step 10. Getting Output from part-r-00000 that was generated after the execution
of the hadoop command.
$ hadoop fs -cat result/part-r-00000 0,0,19.0
0,1,22.0
1,0,43.0
1,1,50.0
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
EXPERIMENT-7
Aim: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and
filter your data.
Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link
release page as shown in the following snapshot
Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page.
On this page, under the Download section, you will have two links, namely, Pig 0.8 and
later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the
page having a set of mirrors.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Step 3
Choose and click any one of these mirrors as shown below.
Click Mirrors
Step 4
These mirrors will take you to the Pig Releases page. This page contains various
versions of Apache Pig.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz
Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as
shown below.
$ mv pig-0.15.0-src.tar.gz/* /usr/local/hadoop/Pig/
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Syntax
The syntax of the describe operator is as follows − grunt> Describe
Relation_name Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using sh
command from the Grunt shell, we cannot execute the commands that are a part of the
shell environment (ex − cd).
Syntax
Given below is the syntax of sh command. grunt> sh shell command parameter Example
We can invoke the ls command of Linux shell from the Grunt shell using the shoption
as shown below. In this example, it lists out the files in the
/pig/bin/directory. grunt> sh ls
pig pig_1444799121955.log pig.cmd pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.
Syntax
Given below is the syntax of fs command. grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In the
following example, it lists the files in the HDFS root directory.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
clear Command
The clear command is used to clear the screen of the Grunt shell. Syntax
You can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties. Usage
You can get a list of Pig commands using the help command as shown below.
Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.
kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case
sensitive.
quit Command
You can quit from the Grunt shell using this command. Usage Quit from the Grunt shell
as shown below. grunt> quit
Let us now take a look at the commands using which you can control Apache Pig from
the Grunt shell.
exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell. Syntax Given
below is the syntax of the utility command exec.
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]
FOREACH operator
The FOREACH operator is used to generate specified data transformations based on the
column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_detailsas shown
below.
city:chararray);
Let us now get the id, age, and city values of each student from the
relationstudent_details and store it into another relation named foreach_data using the
foreach operator as shown below.
Output
It will produce the following output, displaying the contents of the relationforeach_data.
And we have loaded this file into Pig with the relation name student_detailsas shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
EXPERIMENT-8
Aim: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and
filter your data.
All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore,
you need to install any Linux flavored OS. The following simple steps are executed for Hive
installation:
Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hive. Let us verify java installation using the
following command:
$ java –version
If Java is already installed on your system, you get to see the following response: java version
"1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-
b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java. Installing
Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following
linkhttp://www.oracle.com/technetwork/java/javase/downloads/jdk7- downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step II: Generally you will find
the downloaded java file in the Downloads folder.
Verify it and extract the jdk-7u71-linux-x64.gz file using the following
commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz Step III:
To make java available to all the users, you have to move it to the location
―/usr/local/‖. Open root, and type the following commands.
$ su password:
h) mv jdk1.7.0_71 /usr/local/
i) exit Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc Step V:
Use the following commands to configure java alternatives:
f) alternatives --install /usr/bin/java java usr/local/java/bin/java 2
g) alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
h) alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2
i) alternatives --set java usr/local/java/bin/java
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
Now verify the installation using the command java -version from the terminal as explained above.
Step 2: Verifying Hadoop Installation
Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation
using the following command
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:
$ su password:
• cd /usr/local
• mv hadoop-2.4.1/* to hadoop/
• exit
Now apply all the changes into the current running system.
$ source ~/.bashrc
Step II: Hadoop Configuration
You can find all the Hadoop configurationfiles in the location
―$HADOOP_HOME/etc/hadoop‖. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
In order to develop Hadoop programs using java, you have to reset the java
environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of
java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop. core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance, memory
allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode path,
and the datanode path of your local file systems. It means the place where you want to store the Hadoop
infra.
Let us assume the following data. dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created
by hdfs file system.
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path =
//home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the
<configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes according
to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains
a template of yarn-site.xml. First of all, you need to copy the file from mapred- site,xml.template to
mapred-site.xml file using the following command.
Open mapred-site.xml file and add the following properties in between the
<configuration>,
</configuration> tags in this file. Verifying Hadoop Installation
The following steps are used to verify the Hadoop installation. Step I: Name Node Setup
Set up the namenode using the command ―hdfs namenode -format‖ as follows.
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO
namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO
namenode.NameNode: SHUTDOWN_MSG:
/********************************************************
**** SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/
Step II: Verifying Hadoop dfs
The expected output is as follows:
localhost: starting datanode, logging to /home/hadoop/hadoop- 2.4.1/logs/hadoop-hadoop- datanode-
localhost.out
Starting secondary namenodes [0.0.0.0] Step III: Verifying Yarn Script
The following command is used to start the yarn script. Executing this command will start your yarn
daemons.
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop- 2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out
Step IV: Accessing Hadoop on Browser
The default port number to access Hadoop is 50070. Use the following url to get Hadoop services on
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
your browser.
http://localhost:50070/
We use hive-0.14.0 in this tutorial. You can download it by visiting the following link
http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named ―apache-hive- 0.14.0-bin.tar.gz‖ for
this tutorial. The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
Step 4: Installing Hive
The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.
Extracting and verifying Hive Archive
The following command is used to verify the download and extract the hive archive:
$ ls
On successful download, you get to see the following response: Copying files to /usr/local/hive
directory
We need to copy the files from the super user ―su -‖. The following commands are used to copy the
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
• exit
You can set up the Hive environment by appending the following lines to~/.bashrc file:
The following command is used to execute ~/.bashrc file.
Hive installation is completed successfully. Now you require an external database server to configure
Metastore. We use Apache Derby database.
Step 6: Downloading and Installing Apache Derby
Follow the steps given below to download and install Apache Derby:
Downloading Apache Derby
The following command is used to download Apache Derby. It takes some time to download.
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby- 10.4.2.0-bin.tar.gz The
following command is used to verify the download:
On successful download, you get to see the following response:
Extracting and verifying Derby archive
The following commands are used for extracting and verifying the Derby archive:
$ ls
On successful download, you get to see the following response: Copying files to /usr/local/derby
directory
We need to copy from the super user ―su -‖. The following commands are used to copy the files from
the extracted directory to the /usr/local/derby directory:
• cd /home/user
• mv db-derby-10.4.2.0-bin /usr/local/derby
• exit
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab
You can set up the Derby environment by appending the following lines to~/.bashrc file:
Apache Hive 18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools
.jar
The following command is used to execute ~/.bashrc file:
Create a directory to store Metastore
Create a directory named data in $DERBY_HOME directory to store Metastore data. Derby
installation and environmental setup is now complete. Step 7: Configuring
Metastore of Hive
Configuring Metastore means specifying to Hive where the database is stored. You can do this by
editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the
template file using the following command:
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false org.jpox.validateTables = false org.jpox.validateColumns = false
org.jpox.validateConstraints = false org.jpox.storeManagerType = rdbms org.jpox.autoCreateSchema
= true org.jpox.autoStartMechanismMode = checked org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true javax.jdo.option.NontransactionalRead
= true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create
= true javax.jdo.option.ConnectionUserName = APP javax.jdo.option.ConnectionPassword = mine
Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we
use the /user/hive/warehouse folder. You need to set write permission for these newly created folders
as shown below:
Now set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir
/user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs - chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w
/user/hive/warehouse The following commands are used to verify Hive installation:
$ bin/hive
On successful installation of Hive, you get to see the following response:
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
…………………. hive>