[go: up one dir, main page]

0% found this document useful (0 votes)
15 views55 pages

BDA Lab Practical

The document outlines a series of programming tasks and experiments for a B.Tech - CSE student, focusing on data structures implemented in Java, including Linked Lists, Stacks, Queues, Sets, and Maps. It also includes instructions for setting up and installing Hadoop in its three operating modes: Standalone, Pseudo-distributed, and Fully distributed. The experiments aim to enhance practical understanding of data structures and Hadoop's functionality in handling large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views55 pages

BDA Lab Practical

The document outlines a series of programming tasks and experiments for a B.Tech - CSE student, focusing on data structures implemented in Java, including Linked Lists, Stacks, Queues, Sets, and Maps. It also includes instructions for setting up and installing Hadoop in its three operating modes: Standalone, Pseudo-distributed, and Fully distributed. The experiments aim to enhance practical understanding of data structures and Hadoop's functionality in handling large datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

INSTITUTE OF ENGINEERING AND TECHNOLOGY

Mohanlal Sukhadia University, Udaipur


Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

INDEX

S.No. Topic Date Signature

1. Implement the following Data structures in


Java
i) Linked Lists ii) Stacks iii) Queues
iv) Set v) Map
2. Perform setting up and Installing Hadoop in
its three operating modes: Standalone,
Pseudo distributed, Fully distributed.
3. Implement the following file management
tasks in Hadoop:
● Adding files and directories
● Retrieving files
● Deleting files Hint: A typical Hadoop
workflow creates data files (such as log
files) elsewhere and copies them into HDFS
using one of the above command line
utilities. 4 Run a basic Word Count Map
Reduce pr
4. Run a basic Word Count Map Reduce
program to understand Map Reduce
Paradigm.
5. Write a Map Reduce program that mines
weather data. Weather sensors collecting
data every hour at many locations across the
globe gather a large volume of log data,
which is a good candidate for analysis with
Map Reduce, since it is semi structured and
record-oriented.
6. Implement Matrix Multiplication with
Hadoop Map Reduce
7. Install and Run Pig then write Pig Latin
scripts to sort, group, join, project, and filter
your data.
8. Install and Run Hive then use Hive to
create, alter, and drop databases, tables,
views, functions, and indexes.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-1
AIM: Implement the following data structure in Java
Data Structures. A data structure is a particular way of storing and
organizing data in a computer so that it can be used efficiently. Data
structures provide a means to manage large amounts of data
efficiently. efficient data structures are a key to designing efficient
algorithms.
The Java collections framework (JCF) is a set of classes and
interfaces that implement commonly reusable collection data
structures. Although referred to as a framework, it works in a manner
of a library. The JCF provides both interfaces that define various
collections and classes that implement them. The objective of this
program is to implement Linked list stack Queue data Structures.
Java Collection simply means a single unit of objects. Java
Collection framework provides many interfaces (Set, List, Queue,
Deque etc.) and classes (Array List, Vector, LinkedList, Priority Queue,
HashSet, Linked Hash Set, Tree Set etc.).
Map interface, which is also a part of java collection framework,
doesn't inherit from Collection interface. Collection interface is a
member of java. util package. Collections is a utility class in java.
util package. It consists of only static methods which are used to
operate on objects of type.

Implement the following Data structures in Java

a) Linked List
Program:
package org.arpit.java2blog.datastructures;

class Node
{ public
int data;
public Node next;

public void displayNodeData() {


System.out.println("{ " + data
+ " } ");
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

}
}

public class
MyLinkedList {
private Node head;

public boolean isEmpty() {

return (head == null);


}

// Method for inserting node at the start of


linked list public void insertFirst(int data) {
Node newHead = new
Node(); newHead.data =
data; newHead.next = head;
head = newHead;
}

// Method for deleting node from start of


linked list public Node deleteFirst() {
Node temp =
head; head =
head.next;
return temp;
}
// Method used to delete node after provided
node public void deleteAfter(Node after) {
Node temp = head;
while (temp.next != null && temp.data !=
after.data) { temp = temp.next;
}
if (temp.next != null)
temp.next =
temp.next.next;
}
// Method used to insert at end of
LinkedList public void insertLast(int
data) {
Node current = head;
while (current.next != null) {
current = current.next; // we'll loop until current.next is null
}
Node newNode = new
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Node(); newNode.data =
data; current.next =
newNode;
}
// Method for printing
Linked List public void
printLinkedList() {
System.out.println("Printing LinkedList (head -->
last) "); Node current = head;
while (current != null) {
current.displayNodeD
ata(); current =
current.next;
}
System.out.println();
}
public static void main(String args[])
{
MyLinkedList myLinkedlist = new
MyLinkedList();
myLinkedlist.insertFirst(50);
myLinkedlist.insertFirst(60);
myLinkedlist.insertFirst(70);
myLinkedlist.insertFirst(10);
myLinkedlist.insertLast(20);
myLinkedlist.printLinkedList();
// Linked list will be
// 10 -> 70 -> 60 -> 50 -> 20
System.out.println("=======================
=="); System.out.println("Delete node after Node
60");
Node node=new Node();
node.data=60;
myLinkedlist.deleteAfter
(node);
// After deleting node after 1,Linked list will be
// 10 -> 70 -> 60 -> 20
System.out.println("=======================
=="); myLinkedlist.printLinkedList();
}
}

Output:
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Printing LinkedList (head –> last)


{ 10 }
{ 70 }
{ 60 }
{ 50 }
{ 20 }
Delete node after Node 60
Printing LinkedList (head
–> last)
{ 10 }
{ 70 }
{ 60 }
{ 20 }
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

b) Stack

Program:

package org.arpit.java2blog.datastructures;
public class
Mystack{
MyStack {
intsize;
int
arr[
];
intt
op;
MyStack(int
size) {
this.size =
size;
this.arr = new
int[size];
this.top = -1;
}

public void push(int


element) { if
(!isFull()) {
top++;
arr[top] = element;
System.out.println("Pushed element:" + element);
} else {
System.out.println("Stack is full !");
}
}

public int
pop() {
if
(!isEmpt
y()) {
int
topElement =
top; top--;
System.out.println("Popped element :" +
arr[topElement]); return arr[topElement];
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

} else {
System.out.println("Stack is empty
!"); return -1;
}
}

public int peek() {


if(!this.isEmpty()
)
return arr[top];
else
{
System.out.println("Stack is
Empty"); return -1;
}
}

public boolean
isEmpty() { return
(top == -1);
}

public boolean
isFull() { return
(size - 1 ==
top);
}

public static void main(String[] args)


{ MyStack myStack = new
MyStack(5); myStack.pop();
System.out.println("=================");
myStack.push(100);
myStack.pu
sh(90);
myStack.pu
sh(10);
myStack.pu
sh(50);
System.out.println("=================");
myStack.pop();
myStack.pop();
myStack.pop();
System.out.println("=================");
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

}
}

Output:
Stack is empty!
=================
Pushed
element:100
Pushed
element:90
Pushed
element:10
Pushed
element:50
=================
Popped element :50
Popped element :10
Popped element :90
=================
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

c) Queue
Program:
package org.arpit.java2blog.datastructures;

public class QueueUsingLinkedList


{
private Node front,
rear; private int
currentSize; // size

//Node data
structure private
class Node
{
int data;
Node next;
}
//constructor
public QueueUsingLinkedList()
{
front =
null;
rear =
null;
currentS
ize = 0;
}
public boolean isEmpty()
{
return (currentSize == 0);
}
//Remove item from the beginning of the list to simulate
Queue public int dequeue()
{
int data =
front.data; front
= front.next;
if (isEmpty())
{
rear = null;
}
currentSize--;
System.out.println(data+ " removed from the
queue"); return data;
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

}
//Add data to the end of the list to simulate Queue
public void enqueue(int data)
{
Node oldRear
= rear; rear =
new Node();
rear.data =
data;
rear.next =
null;
if (isEmpty())
{
front = rear;
}
else
{
oldRear.next = rear;
}
currentSize++;
System.out.println(data+ " added to the queue");
}
public static void main(String a[]){

QueueUsingLinkedList queueUsingLinkedList = new QueueUsingLinkedList();


queueUsingLinkedList.enqueue(60);

QueueUsingLinkedList.dequeue();
QueueUsingLinkedList.enqueue(10);
QueueUsingLinkedList.enqueue(20);
QueueUsingLinkedList.enqueue(40);
queueUsingLinkedList.dequeue();
QueueUsingLinkedList.enqueue(70);

queueUsingLinkedList.

queueUsingLinkedList.dequeue();
queueUsingLinkedList.enqueue(80);
queueUsingLinkedList.enqueue(100);
queueUsingLinkedList.dequeue();
queueUsingLinkedList.enqueue(150);
queueUsingLinkedList.enqueue(50);
}
}
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Output:
60 added to the queue

60 removed from

the queue 10 added

to the queue

20 added to the

queue 40 added

to the queue

10 removed from

the queue 70 added

to the queue

20 removed from

the queue 80 added

to the queue

100 added to the queue

40 removed from

the queue 150

added to the queue

50 added to the queue


INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

d) Set

Program: Using Hash Set


import java.util.*;
class GFG
{
public static void main(String[] args)
{
Set<String> h = new HashSet<String>();

// Adding elements into the HashSet


// using add() h.add("India");
h.add("Australia");
h.add("South Africa");

// Adding the duplicate


// element
h.add("India");

// Displaying the HashSet


System.out.println(h);

// Removing items from HashSet


// using remove()
h.remove("Australia");
System.out.println("Set after removing " + "Australia:" + h);

// Iterating over hash set items System.out.println("Iterating over set:"); Iterator<String>


i = h.iterator();
while (i.hasNext())
System.out.println(i.next());
}

Output:
[South Africa, Australia, India]
Set after removing Australia:[South Africa, India]
Iterating over set:

South Africa
India
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

e) Map
Program:
import java.util.*;
class HashMapDemo
{
public static void main(String args[])
{
HashMap< String,Integer> hm = new HashMap< String,Integer>();
hm.put("a", new Integer(100));
hm.put("b", new Integer(200));
hm.put("c", new Integer(300));
hm.put("d", new Integer(400));

// Returns Set view


Set< Map.Entry< String,Integer> > st = hm.entrySet();
for (Map.Entry< String,Integer> me:st)
{
System.out.print(me.getKey()+":");
System.out.println(me.getValue());
}
}

Output:
Run on IDE Output:
a:100 b:200 c:300 d:400
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-2
Aim: Perform setting up and Installing Hadoop in its three operating modes:
Standalone, Pseudo distributed, Fully distributed.

Hadoop can be run in 3 different modes. Different modes of Hadoop are

1. Standalone Mode
• Default mode of Hadoop
• HDFS is not utilized in this mode.
• Local file system is used for input and output
• Used for debugging purpose
• No Custom Configuration is required in 3 Hadoop(mapred-site.xml,core-
site.xml, hdfs-site.xml) files.
• Standalone mode is much faster than Pseudo-distributed mode.

2. Pseudo Distributed Mode (Single Node Cluster)


• Configuration is required in given 3 files for this mode
• Replication factory is one for HDFS.
• Here one node will be used as Master Node / Data Node / Job Tracker / Task
Tracker
• Used for Real Code to test in HDFS.
• Pseudo distributed cluster is a cluster where all daemons are
• running on one node itself.

3. Fully distributed mode (or multiple node cluster)


• This is a Production Phase
• Data are used and distributed across many nodes.
• Different Nodes will be used as Master Node / Data Node / Job Tracker / Task
Track In objective of this experiment is to install Hadoop in Pseudo Distributed Mode.
Hadoop is an open source, Java-based programming framework that supports the
processing and storage of extremely large data sets in a distributed computing
environment. It is part of the Apache project sponsored by the Apache Software
Foundation. Hadoop was created by computer scientists Doug Cutting and Mike
Cafarella in 2006 to support distribution for the Nutch search engine. It was inspired
by Google's MapReduce, a software framework in which an application is broken
down into numerous small parts. Any of these parts, which are also called fragments
or blocks, can be run on any node in the cluster. After years of development within the
open-source community, Hadoop 1.0 became publically available in November 2012
as part of the Apache project sponsored by the Apache Software Foundation.

The steps for install Hadoop in Pseudo Distributed Mode (Single Node Cluster) are
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

1. Installing Hadoop in Standalone Mode

installation of Hadoop 2.4.1 in standalone mode.


There are no daemons running and everything runs in a single JVM. Standalone mode
is suitable for running MapReduce programs during development, since it is easy to
test and debug them.

Setting Up Hadoop

You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
Before proceeding further, you need to make sure that Hadoop is working fine. Just
issue the following command −
$ hadoop version

If everything is fine with your setup, then you should see the following result –
Hadoop 2.4.1
Subversion https://svn.apache.org/repos/asf/hadoop/common -r 1529768
Compiled by hortonmu on 2013-10-07T06:28Z
Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
It means your Hadoop's standalone mode setup is working fine. By default, Hadoop is
configured to run in a non-distributed mode on a single machine.

2. Installing Hadoop in Pseudo Distributed Mode


Follow the steps given below to install Hadoop 2.4.1 in pseudo distributed mode.
Step 1 − Setting Up Hadoop
You can set Hadoop environment variables by appending the following commands
to ~/.bashrc file.
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME export
HADOOP_COMMON_HOME=$HADOOP_HOME

export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export
HADOOP_INSTALL=$HADOOP_HOME
Now apply all the changes into the current running system.
$ source ~/.bashrc
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Step 2 − Hadoop Configuration

You can find all the Hadoop configuration files in the location
“$HADOOP_HOME/etc/hadoop”. It is required to make changes in those
configuration files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
In order to develop Hadoop programs in java, you have to reset the java environment
variables in hadoop-env.sh file by replacing JAVA_HOME value with the location
of java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
The following are the list of files that you have to edit to configure Hadoop.
core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop
instance, memory allocated for the file system, memory limit for storing the data, and
size of Read/Write buffers.
Open the core-site.xml and add the following properties in between <configuration>,
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data,
namenode path, and datanode paths of your local file systems. It means the place
where you want to store the Hadoop infrastructure.
Let us assume the following data. dfs.replication (data replication value) = 1
(In the below given path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the
directory created by hdfs file system.) namenode path =
//home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)

datanode path = //home/hadoop/hadoopinfra/hdfs/datanode


Open this file and add the following properties in between the <configuration>
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value>
</property>
</configuration>

yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add
the following properties in between the <configuration>, </configuration> tags in this
file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default,
Hadoop contains a template of yarn-site.xml. First of all, it is required to copy the file
from mapred-site.xml.template to mapred-site.xml file using the following
command.
$ cp mapred-site.xml.template mapred-site.xml
Open mapred-site.xml file and add the following properties in between the
<configuration>, </configuration>tags in this file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

Verifying Hadoop Installation


INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

The following steps are used to verify the Hadoop installation.

Step 1 − Name Node Setup


Set up the namenode using the command “hdfs namenode -format” as follows.
$ cd ~
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode STARTUP_MSG: host =
localhost/192.168.1.11
STARTUP_MSG: args = [-format] STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14
21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain 1 images
with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2 − Verifying Hadoop dfs


The following command is used to start dfs. Executing this command will start your
Hadoop file system.
$ start-dfs.sh
The expected output is as follows −
10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop 2.4.1/logs/hadoop-
hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop 2.4.1/logs/hadoop-
hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]
Step 3 − Verifying Yarn Script
The following command is used to start the yarn script. Executing this command will
start your yarn daemons.
$ start-yarn.sh
The expected output as follows −
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

starting yarn daemons


starting resourcemanager, logging to /home/hadoop/hadoop 2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop 2.4.1/logs/yarn-
hadoop-nodemanager-localhost.out

Step 4 − Accessing Hadoop on Browser


The default port number to access Hadoop is 50070. Use the following url to get
Hadoop services on browser.
http://localhost:50070/

Step 5 − Verify All Applications for Cluster


The default port number to access all applications of cluster is 8088. Use the following
url to visit this service.
http://localhost:8088/
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-3
Aim: Implement the following file management tasks in Hadoop:

Frequently used HDFS shell commands


Open a terminal window to the current working directory. 1.Print the Hadoop version
hadoop version

# 2. List the contents of the root directory in HDFS


hadoop fs -ls /

1. Report the amount of space used and available on currently mounted


filesystem
hadoop fs -df hdfs:/

2. Count the number of directories,files and bytes under the paths that match
the specified file pattern
hadoop fs -count hdfs:/

# 5. Run a DFS filesystem checking utility


hadoop fsck – /

# 6. Run a cluster balancing utility


hadoop balancer

7. Create a new directory named “hadoop” below the


• /user/training directory in HDFS. Since you‟re currently logged in with the
“training” user ID,
• /user/training is your home directory in HDFS. hadoop fs -mkdir
/user/training/hadoop

• 8. Add a sample text file from the local directory named “data” to the new
directory you created in HDFS during the previous step.
#hadoop fs -put data/sample.txt /user/training/hadoop #
9. List the contents of this new directory in HDFS.
hadoop fs -ls /user/training/hadoop

• 10. Add the entire local directory called “retail” to the


• /user/training directory in HDFS.
hadoop fs -put data/retail /user/training/Hadoop

11. Since /user/training is your home directory in HDFS, any command that does not
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

have an absolute path is interpreted as relative to that directory. The next command will
therefore list your home directory, and should show the items you‟ve just added there.
hadoop fs -ls

# 12. See how much space this directory occupies in HDFS. hadoop fs -du -s -h
hadoop/retail

# 13. Delete a file „customers‟ from the “retail” directory. hadoop fs -rm
hadoop/retail/customers

# 14. Ensure this file is no longer in HDFS.


hadoop fs -ls hadoop/retail/customers

# 15. Delete all files from the “retail” directory using a wildcard.
hadoop fs -rm hadoop/retail/*

# 16. To empty the trash


hadoop fs -expunge

17. Finally, remove the entire retail directory and all, of its contents in HDFS.
hadoop fs -rm -r hadoop/retail

# 18. List the hadoop directory again


hadoop fs -ls hadoop

19. Add the purchases.txt file from the local directory


// named “/home/training/” to the hadoop directory you created in HDFS hadoop
fs -copyFromLocal /home/training/purchases.txt hadoop/

20. To view the contents of your text file purchases.txt which is present in your
hadoop directory.
hadoop fs -cat hadoop/purchases.txt

21. Add the purchases.txt file from “hadoop” directory which is present in HDFS
directory
to the directory “data” which is present in your local directory hadoop fs - copyToLocal
hadoop/purchases.txt /home/training/data

# 22. cp is used to copy files between directories present in HDFS


hadoop fs -cp /user/training/*.txt /user/training/hadoop
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

# 23. „-get‟ command can be used alternaively to „-copyToLocal‟ command hadoop


fs -get hadoop/sample.txt /home/training/

# 24. Display last kilobyte of the file “purchases.txt” to stdout.


hadoop fs -tail hadoop/purchases.txt

25. Default file permissions are 666 in HDFS


Use „-chmod‟ command to change permissions of a file hadoop fs -ls
hadoop/purchases.txt
sudo -u hdfs hadoop fs -chmod 600 hadoop/purchases.txt

26. Default names of owner and group are training,training


Use „-chown‟ to change owner name and group name simultaneously hadoop fs - ls
hadoop/purchases.txt
sudo -u hdfs hadoop fs -chown root:root hadoop/purchases.txt

#27. Default name of group is training


# Use „-chgrp‟ command to change group name hadoop fs -ls hadoop/purchases.txt
sudo -u hdfs hadoop fs -chgrp training hadoop/purchases.txt

# 28. Move a directory from one location to other


hadoop fs -mv hadoop apache_hadoop

# 29. Default replication factor to a file is 3.


# Use „-setrep‟ command to change replication factor of a file\ hadoop fs -setrep -w 2
apache_hadoop/sample.txt

# 30. Copy a directory from one node in the cluster to another # Use „-distcp‟
command to copy,
# -overwrite option to overwrite in an existing files # -update command to synchronize
both directories
hadoop fs -distcp hdfs://namenodeA/apache_hadoop hdfs://namenodeB/hadoop

# 31. Command to make the name node leave safe mode


hadoop fs -expunge
sudo -u hdfs hdfs dfsadmin -safemode leave

# 32. List all the hadoop file system shell commands


hadoop fs

# 33. Last but not least, always ask for help!


INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-4
Aim: Run a basic Word Count Map Reduce program to understand Map Reduce
Paradigm.

Hadoop and MapReduce

Counting the number of words in any language is a piece of cake like in C, C++,
Python, Java, etc. MapReduce also uses Java but it is very easy if you know the syntax
on how to write it. It is the basic of MapReduce. You will first learn how to execute
this code similar to “Hello World” program in other languages. So here are the steps
which show how to write a MapReduce code for Word Count.
Example:

Input:
Hello I am GeeksforGeeks Hello I am an Intern

Output:

GeeksforGeeks 1
Hello 2
I 2
Intern 1
am 2
an 1
Steps:
 First Open Eclipse -> then select File -> New -> Java Project ->Name
it WordCount -> then Finish.

 Create Three Java Classes into the project. Name them


WCDriver(having the main function), WCMapper, WCReducer.
 You have to include two Reference Libraries for that:
Right Click on Project -> then select Build Path-> Click on Configue
Build Path
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

In the above figure, you can see the Add External JARs option on the
Right Hand Side. Click on it and add the below mention files. You can find
these files in /usr/lib/
1. /usr/lib/hadoop-0.20-mapreduce/hadoop-core-2.6.0-mr1- cdh5.13.0.jar
2. /usr/lib/hadoop/hadoop-common-2.6.0-cdh5.13.0.jar

Mapper Code: You have to copy paste this program into the WCMapper Java Class
file.

// Importing libraries
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;

public class WCMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text,


IntWritable>
{
// Map function
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter rep)
throws IOException
{
String line = value.toString();
// Splitting the line on spaces for (String word : line.split(" "))
{
if (word.length() > 0)
{
output.collect(new Text(word), new IntWritable(1));
}
}
}
}
}
Reducer Code: You have to copy paste this program into the WCReducer Java Class
file.

// Importing libraries import java.io.IOException;


import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;

public class WCReducer extends MapReduceBase implements Reducer<Text,


IntWritable, Text, IntWritable> {

// Reduce function
public void reduce(Text key, Iterator<IntWritable> value, OutputCollector<Text,
IntWritable> output, Reporter rep)
throws IOException
{
int count = 0;
// Counting the frequency of each words while (value.hasNext())
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

{
IntWritable i = value.next(); count += i.get();
output.collect(key, new IntWritable(count));
}
}

Driver Code: You have to copy paste this program into the WCDriver Java Class file.

// Importing libraries import java.io.IOException;


import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class WCDriver extends Configured implements Tool {


public int run(String args[]) throws IOException
{
if (args.length < 2)
{
System.out.println("Please give valid inputs"); return -1;
}
JobConf conf = new JobConf(WCDriver.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WCMapper.class);
conf.setReducerClass(WCReducer.class);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(IntWritable.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
return 0;
}

// Main Method
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

public static void main(String args[]) throws Exception


{
int exitCode = ToolRunner.run(new WCDriver(), args);
System.out.println(exitCode);
}
}
 Now you have to make a jar file. Right Click on Project-> Click on
Export-> Select export destination as Jar File-> Name the jar
File(WordCount.jar) -> Click on next -> at last Click on Finish. Now
copy this file into the Workspace directory of Cloudera

 Open the terminal on CDH and change the directory to the workspace.
You can do this by using “cd workspace/” command. Now, Create a text
file(WCFile.txt) and move it to HDFS. For that open terminal and write
this code(remember you should be in the same directory as jar file you
have created just now).
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Now, run this command to copy the file input file into the HDFS. hadoop fs -put
WCFile.txt WCFile.txt

 Now to run the jar file by writing the code as shown in the screenshot.

Output:

 After Executing the code, you can see the result in Output file or by writing
following command on terminal.
hadoop fs -cat WCOutput/part-00000
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-5
Aim: Write a Map Reduce program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume of
log data, which is a good candidate for analysis with Map Reduce, since it is semi
structured and record-oriented.

Map Reduce program for weather Data set


MapReduce Program: To find average temperature for each year in NCDC data set.
Big data is a framework for storage and processing of data ( structured/unstructured ).
Please check out the program below which draw out results out of semi-structured data
from a weather sensor. Its a MapReduce program written in java.
The aim of the program is to find the average temperature in each year of NCDC data.
This program takes a data input of multiple files where each file contains weather data
of a particular year. This weather data is shared by NCDC (National Climatic Data
Center ) and is collected by weather sensors at many locations across the globe. NCDC
input data can be downloaded from
https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all.

There is a data file for each year. Each data file contains among other things, the year
and the temperature information( which is relevant for this program ).
Below is the snapshot of the data with year and temperature field highlighted in green
box. This is the snapshot of data taken from year 1901 file

So, in a MapReduce program there are 2 most important phases - Map Phase and Reduce
Phase.
You need to have an understanding of MapReduce concepts so as to understand the
intricacies of MapReduce programming. It is one the major component of Hadoop along
with HDFS

Continuing with our current program:

a) For writing any MapReduce program, firstly, you need to figure out the data flow,
like in this example am taking just the year and temperature information in the map
phase and passing it on to the reduce phase. So Map phase in my example is
essentially a data preparation phase. Reduce phase on the other hand is more of a
data aggregation one.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

b) Secondly, decide on the types for the key/value pairs—MapReduce program


uses lists and (key/value) pairs as its main data primitives. So you need to decide
the types for key/value pairs—K1, V1, K2, V2, K3, and V3 for the input,
intermediate, and output key/value pairs. In this example, am taking LongWritable
and Text as (K1,V1) for input and Text and IntWritable as both for (K2,V2) and
(K3,V3)

Map Phase: I will be pulling out the year and temperature data from the log data that is
there in the file, as shown in the above snapshot.

Reduce Phase: The data that is generated by the mapper(s) is fed to the reducer, which
is another java program. This program takes all the values associated with a particular
key and find the average temperature for that key. So, a key in our case is the year and
value is a set of IntWritable objects which represent all the captured temperature
information for that year.

I will be writing a java class, each for a Map and Reduce phase and one driver class to
create a job with configuration information.

So, in this particular example I will be writing 3 java classes:


c) AverageMapper.java
d) AverageReducer.java
e) AverageDriver.java

Codes For all Three classes:

AverageMapper.java

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import java.io.IOException;

public class AverageMapper extends Mapper<LongWritable, Text, Text, IntWritable>


{
public static final int MISSING = 9999;
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException
{
String line = value.toString();
String year = line.substring(15,19);
int temperature;
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

if (line.charAt(87)=='+')
temperature = Integer.parseInt(line.substring(88, 92));
else
temperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if(temperature != MISSING && quality.matches("[01459]"))
context.write(new Text(year),new IntWritable(temperature));
}
}

Let us get into the details of our AverageMapper class. I need to extend generic class
Mapper with four formal data types: input key, input value, output key, output value.
The key for the Map phase is the offset of the beginning of the line from the beginning
of the file, but as we have no need for it, we can ignore it. The input value would be
temperature and output key would be year and output value will be temperature, an
integer. The data is fed to the map function one line or record at a time. The map()
function converts it into the string and read the year and temperature part from the
applicable index value.
Also, map() function creates a Context object which is the output object from map(). It
contains year value as Text and temperature value as IntWritable.
AverageReducer.java

import org.apache.hadoop.mapreduce.*;
import java.io.IOException;
public class AverageReducer extends Reducer <Text, IntWritable,Text, IntWritable >
{
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
int max_temp = 0;
int count = 0;
for (IntWritable value : values)
{
max_temp += value.get(); count+=1;
}
context.write(key, new IntWritable(max_temp/count));
}
Now coming to Reduce Class. Again, four formal data types: input key, input value,
output key, output value is specified for this class. The input type and value of reduce
function should match output key and value of the map function: Text and IntWritable
objects. The reduce() function iterates through all the values and find the sum and count
of the values, and finally the average temperature value from that.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

AverageDriver.java
import org.apache.hadoop.io.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.mapreduce.*

public class AverageDriver


{
public static void main(String[] args) throws Exception {
if (args.length != 2)
{
System.err.println("Please Enter the input andoutput parameters");
System.exit(-1);
}
Job job = new Job();

job.setJarByClass(AverageDriver.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job,new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path (args[1]));
job.setMapperClass(AverageMapper.class);
job.setReducerClass(AverageReducer.clas);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.clas);
System.exit(job.waitForCompl etion(true)?0:1);
}
}

A Job object forms the specification of the job and gives you control over how the job
will be run. Hadoop has a special feature of data locality, wherein the code for the
program is send to the data instead of other way around. So, Hadoop distributes the jar
file of the program across the cluster. we pass the name of the class in setJarByClass()
method which hadoop can use to locate the jar file containing this class. We need to
specify input and output paths. Input path can specify the file or directory which will be
used as an input to the program and output path is a directory which will be created by
Reducer. If the directory already exists it leads to an error. Then we specify the map and
reduce types to use via setMapperClass() and setReducerClass(). Next we set the output
types for the map and reduce functions. waitForCompletion() method submits the job
and waits for it to finish. It return 0 or 1, indicating success or failure of the job.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-6
Aim: Implement Matrix Multiplication with Hadoop Map Reduce

n mathematics, matrix multiplication or the matrix product is a binary operation that


produces a matrix from two matrices. The definition is motivated by linear equations
and linear transformations on vectors, which have numerous applications in applied
mathematics, physics, and engineering. In more detail, if A is an n × m matrix and B is
an m × p matrix, their matrix product AB is an n × p matrix, in which the m entries
across a row of A are multiplied with the m entries down a column of B and summed
to produce an entry of AB. When two linear transformations are represented by
matrices, then the matrix product represents the composition of the two
transformations.

Algorithm for Map Function.


a. for each element mij of M do
produce (key,value) pairs as ((i,k), (M,j,mij), for k=1,2,3,.. upto the number of
columns of N
b. for each element njk of N do
produce (key,value) pairs as ((i,k),(N,j,Njk), for i = 1,2,3,.. Upto the number of rows
of M.
c. return Set of (key,value) pairs that each key (i,k), has list with values
(M,j,mij) and (N, j,njk) for all possible values of j.

Algorithm for Reduce Function. for each key (i,k) do


sort values begin with M by j in listM sort values begin with N by j in listN
multiply mij and njk for jth value of each list sum up mij x njk return (i,k), Σj=1
mij x njk

Step 1. Download the hadoop jar files with these links.


Download Hadoop Common Jar files: https://goo.gl/G4MyHp
$ wget https://goo.gl/G4MyHp -O hadoop-common-2.2.0.jar Download Hadoop
Mapreduce Jar File: https://goo.gl/KT8yfB
$ wget https://goo.gl/KT8yfB -O hadoop-mapreduce-client-core-2.7.1.jar

Step 2. Creating Mapper file for Matrix Multiplication.


INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

package www.ehadoopinfo.com; import org.apache.hadoop.conf.*;


import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; import java.io.IOException;

public class Map


extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text,
Text>
{
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
Configuration conf=context.getConfiguration(); int m =
Integer.parseInt(conf.get("m"));
int p = Integer.parseInt(conf.get("p")); String line = value.toString();
// (M, i, j, Mij);
String[] indicesAndValue = line.split(","); Text outputKey = new Text();
Text outputValue = new Text();
if (indicesAndValue[0].equals("M"))
{
for (int k = 0; k < p; k++)
{
outputKey.set(indicesAndValue[1] + "," + k);
// outputKey.set(i,k);
outputValue.set(indicesAndValue[0] + "," + indicesAndValue[2]
+ "," + indicesAndValue[3]);
// outputValue.set(M,j,Mij); context.write(outputKey, outputValue);
}
} else
{
// (N, j, k, Njk);
for (int i = 0; i < m; i++) {
outputKey.set(i + "," + indicesAndValue[2]); outputValue.set("N,"
+ indicesAndValue[1] + ","
+ indicesAndValue[3]); context.write(outputKey, outputValue);
}
}

}
}
program ends here
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Step 3. Creating Reducer.java file for Matrix Multiplication.


package www.ehadoopinfo.com; import org.apache.hadoop.io.Text; import
org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.HashMap;
public class Reduce
extends org.apache.hadoop.mapreduce.Reducer<Text, Text, Text, Text> { @Override
public void reduce(Text key, Iterable<Text> values, Context context) throws
IOException, InterruptedException {
String[] value;
//key=(i,k),
//Values = [(M/N,j,V/W),..]
HashMap<Integer, Float> hashA= new HashMap<Integer, Float>();
HashMap<Integer, Float> hashB = new HashMap<Integer, Float>();
for (Text val : values)
{
value = val.toString().split(",");
if (value[0].equals("M")) { hashA.put(Integer.parseInt(value[1]),
Float.parseFloat(value[2]));
} else
{
hashB.put(Integer.parseInt(value[1]), Float.parseFloat(value[2]));
}
}
int n = Integer.parseInt(context.getConfiguration(). get("n"));
float result = 0.0f; float m_ij; float n_jk;
for (int j = 0; j < n; j++) {
m_ij = hashA.containsKey(j) ? hashA.get(j) : 0.0f; n_jk = hashB.containsKey(j) ?
hashB.get(j) : 0.0f; result += m_ij * n_jk;
}
if (result != 0.0f) { context.write(null,
new Text(key.toString() + "," + Float.toString(result)));
}

Step 4. Creating MatrixMultiply.java file for package www.ehadoopinfo.com;


import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.Path; import
org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class MatrixMultiply {

public static void main(String[] args) throws Exception { if (args.length != 2)


{
System.err.println("Usage: MatrixMultiply <in_dir> <out_dir>"); System.exit(2);
}
Configuration conf = new Configuration();
// M is an m-by-n matrix; N is an n-by-p matrix. conf.set("m", "1000");
conf.set("n", "100");
conf.set("p", "1000"); @SuppressWarnings("deprecation") Jobjob= new Job(conf,
"MatrixMultiply"); job.setJarByClass(MatrixMultiply.class);
job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class);

job.setMapperClass(Map.class); job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));


FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}
}

Step 5. Compiling the program in particular folder named as operation/


$ javac -cp hadoop-common-2.2.0.jar:hadoop-mapreduce-client-core-
2.7.1.jar:operation/:. -d operation/ Map.java
$ javac -cp hadoop-common-2.2.0.jar:hadoop-mapreduce-client-core-
2.7.1.jar:operation/:. -d operation/ Reduce.java
$ javac -cp hadoop-common-2.2.0.jar:hadoop-mapreduce-client-core-
2.7.1.jar:operation/:. -d operation/ MatrixMultiply.java
Step 6. Let‟s retrieve the directory after compilation.
$ ls -R operation/ operation/:

Step 7. Creating Jar file for the Matrix Multiplication.


$ jar -cvf MatrixMultiply.jar -C operation/ . added manifest adding: www/(in = 0)
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

(out= 0)(stored 0%)


adding: www/ehadoopinfo/(in = 0) (out= 0)(stored 0%) adding:
www/ehadoopinfo/com/(in = 0) (out= 0)(stored 0%)
adding: www/ehadoopinfo/com/Reduce.class(in = 2919) (out= 1271)(deflated 56%)
adding: www/ehadoopinfo/com/MatrixMultiply.class(in = 1815) (out=
932)(deflated 48%) adding: www/ehadoopinfo/com/Map.class(in = 2353)
(out= 993)(deflated 57%)

Step 8. Uploading the M, N file which contains the matrix multiplication data to
HDFS.
$ cat M M,0,0,1 M,0,1,2
M,1,0,3
M,1,1,4
$ cat N N,0,0,5 N,0,1,6
N,1,0,7
N,1,1,8
$ hadoop fs -mkdir Matrix/
$ hadoop fs -copyFromLocal M Matrix/
$ hadoop fs -copyFromLocal N Matrix/

Step 9. Executing the jar file using hadoop command and thus how fetching record
from HDFS and storing output in HDFS.
$ hadoop jar MatrixMultiply.jar www.ehadoopinfo.com.MatrixMultiply Matrix/*
result/ WARNING: Use "yarn jar" to launch YARN applications.
17/10/09 14:31:22 INFO impl.TimelineClientImpl: Timeline service address:
http://sandbox.hortonworks.com:8188/ws/v1/timeline/
17/10/09 14:31:23 INFO client.RMProxy: Connecting to ResourceManager at
sandbox.hortonworks.com/10.0.2.15:8050
17/10/09 14:31:23 WARN mapreduce.JobResourceUploader: Hadoop command-line
option parsing not performed. Implement the Tool interface and execute your
application with ToolRunner to remedy this.
17/10/09 14:31:24 INFO input.FileInputFormat: Total input paths to process : 2
17/10/09 14:31:24 INFO mapreduce.JobSubmitter: number of splits:2
17/10/09 14:31:24 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1507555978175_0006
17/10/09 14:31:25 INFO impl.YarnClientImpl: Submitted application
application_1507555978175_0006
17/10/09 14:31:25 INFO mapreduce.Job: The url to track the job:
http://sandbox.hortonworks.com:8088/proxy/application_1507555978175_000 6/
17/10/09 14:31:25 INFO mapreduce.Job: Running job:
job_1507555978175_0006
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

17/10/09 14:31:35 INFO mapreduce.Job: Job job_1507555978175_0006 running in


uber mode : false
17/10/09 14:31:35 INFO mapreduce.Job: map 0% reduce 0%
17/10/09 14:31:45 INFO mapreduce.Job: map 100% reduce 0%
17/10/09 14:31:53 INFO mapreduce.Job: map 100% reduce 100%
17/10/09 14:31:54 INFO mapreduce.Job: Job job_1507555978175_0006 completed
successfully 17/10/09 14:31:55 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=198 FILE: Number of bytes written=386063 FILE:
Number of read operations=0
FILE: Number of large read operations=0 FILE: Number of write operations=0
HDFS: Number of bytes read=302 HDFS: Number of bytes written=36 HDFS:
Number of read operations=9
HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job
Counters
Launched map tasks=2 Launched reduce tasks=1 Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=15088 Total time spent by all
reduces in occupied slots (ms)=6188 Total time spent by all map tasks (ms)=15088
Total time spent by all reduce tasks (ms)=6188 Total vcore-seconds taken by all map
tasks=15088 Total vcore-seconds taken by all reduce tasks=6188
Total megabyte-seconds taken by all map tasks=3772000 Total megabyte- seconds
taken by all reduce tasks=1547000

Map-Reduce Framework Map input records=8 Map output records=16 Map output
bytes=160
Map output materialized bytes=204 Input split bytes=238
Combine input records=0 Combine output records=0 Reduce input groups=4 Reduce
shuffle bytes=204 Reduce input records=16 Reduce output records=4 Spilled
Records=32 Shuffled Maps =2
Failed Shuffles=0 Merged Map outputs=2
GC time elapsed (ms)=196 CPU time spent (ms)=2720
Physical memory (bytes) snapshot=536309760 Virtual memory (bytes)
snapshot=2506076160 Total committed heap usage (bytes)=360185856
Shuffle Errors
BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0
WRONG_MAP=0 WRONG_REDUCE=0
File Input Format Counters Bytes Read=64
File Output Format Counters Bytes Written=36
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Output:

Step 10. Getting Output from part-r-00000 that was generated after the execution
of the hadoop command.
$ hadoop fs -cat result/part-r-00000 0,0,19.0
0,1,22.0
1,0,43.0
1,1,50.0
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-7
Aim: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and
filter your data.

Install, and set up Apache Pig in your system:


Prerequisites: It is essential that you have Hadoop and Java installed on your system
before you go for Apache Pig
Therefore, prior to installing Apache Pig, install Hadoop and JAVA. Download Apache
Pig First of all, download the latest version of Apache Pig from the following website
https://pig.apache.org/

Step 1
Open the homepage of Apache Pig website. Under the section News, click on the link
release page as shown in the following snapshot

Step 2
On clicking the specified link, you will be redirected to the Apache Pig Releases page.
On this page, under the Download section, you will have two links, namely, Pig 0.8 and
later and Pig
0.7 and before. Click on the link Pig 0.8 and later, then you will be redirected to the
page having a set of mirrors.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Apache Pig Releases

Step 3
Choose and click any one of these mirrors as shown below.

Click Mirrors
Step 4
These mirrors will take you to the Pig Releases page. This page contains various
versions of Apache Pig.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Click the latest version among them.

Install Apache Pig


After downloading the Apache Pig software, install it in your Linux environment by
following the steps given below.
Step 1
Create a directory with the name Pig in the same directory where the installation
directories of Hadoop, Java, and other software were installed.
$ mkdir Pig

Step 2
Extract the downloaded tar files as shown below.
$ cd Downloads/
$ tar zxvf pig-0.15.0-src.tar.gz
$ tar zxvf pig-0.15.0.tar.gz

Step 3
Move the content of pig-0.15.0-src.tar.gz file to the Pig directory created earlier as
shown below.
$ mv pig-0.15.0-src.tar.gz/* /usr/local/hadoop/Pig/
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Configure Apache Pig


After installing Apache Pig, we have to configure it. To configure, we need to
edit two files − bashrc and pig.properties. .bashrc file
In the .bashrc file, set the following variables −
PIG_HOME folder to the Apache Pig‘s installation folder, PATH environment variable
to the bin folder, and PIG_CLASSPATH environment variable to the etc (configuration)
folder of your Hadoop installations (the directory that contains the core-site.xml, hdfs-
site.xml and mapred- site.xml files).
export PIG_HOME = /home/Hadoop/Pig export PATH =
PATH:/home/Hadoop/pig/bin
export PIG_CLASSPATH = $HADOOP_HOME/confpig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file,
you can set various parameters as given below.
pig -h properties
The following properties are supported −
Logging: verbose = true|false; default is false. This property is the same as -v
Additionally, any Hadoop property can be specified. Verifying the Installation
Verify the installation of Apache Pig by typing the version command. If the installation
is successful, you will get the version of Apache Pig as shown below.
$ pig –version
Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35

Apache Pig Latin scripts on Describe Operator


Basic Grunt shell commands For each operator Order By operator
g) Apache Pig - Describe Operator

The describe operator is used to view the schema of a relation.

Syntax
The syntax of the describe operator is as follows − grunt> Describe
Relation_name Example
Assume we have a file student_data.txt in HDFS with the following content.
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata 003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
And we have read it into a relation student using the LOAD operator as shown below.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

grunt> student = LOAD ‗student_data.txt' USING PigStorage(',') as ( id:int,


firstname:chararray, lastname:chararray, phone:chararray, city:chararray
); Now, let us describe the relation named student and verify the schema as shown
below. grunt> describe student;
Output
Once you execute the above Pig Latin statement, it will produce the following output.
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city:
chararray
}

Apache Pig - Grunt Shell


After invoking the Grunt shell, you can run your Pig scripts in the shell. In addition to
that, there are certain useful shell and utility commands provided by the Grunt shell.
This chapter explains the shell and utility commands provided by the Grunt shell.
Note − In some portions of this chapter, the commands like Load and Storeare
used. Refer the respective chapters to get in-detail information on them.
Shell Commands
The Grunt shell of Apache Pig is mainly used to write Pig Latin scripts. Prior to that,
we can invoke any shell commands using sh and fs.

sh Command
Using sh command, we can invoke any shell commands from the Grunt shell. Using sh
command from the Grunt shell, we cannot execute the commands that are a part of the
shell environment (ex − cd).
Syntax
Given below is the syntax of sh command. grunt> sh shell command parameter Example
We can invoke the ls command of Linux shell from the Grunt shell using the shoption
as shown below. In this example, it lists out the files in the
/pig/bin/directory. grunt> sh ls
pig pig_1444799121955.log pig.cmd pig.py
fs Command
Using the fs command, we can invoke any FsShell commands from the Grunt shell.

Syntax
Given below is the syntax of fs command. grunt> sh File System command parameters
Example
We can invoke the ls command of HDFS from the Grunt shell using fs command. In the
following example, it lists the files in the HDFS root directory.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

clear Command
The clear command is used to clear the screen of the Grunt shell. Syntax
You can clear the screen of the grunt shell using the clear command as shown below.
grunt> clear
help Command
The help command gives you a list of Pig commands or Pig properties. Usage

You can get a list of Pig commands using the help command as shown below.

Show the execution plan to compute the alias or for entire script.
-script - Explain the entire script.
-out - Store the output into directory rather than print to stdout.
-brief - Don't expand nested plans (presenting a smaller graph for overview).
-dot - Generate the output in .dot format. Default is text format.

-xml - Generate the output in .xml format. Default is text format.


-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details. alias - Alias to explain.
dump <alias> - Compute the alias and writes the results to stdout.

Utility Commands: exec [-param <param_name>=param_value] [-param_file


<file_name>]
<script> - Execute the script with access to grunt environment including aliases.

-param <param_name - See parameter substitution for details.


-param_file <file_name> - See parameter substitution for details. script - Script to be
executed.
run [-param <param_name>=param_value] [-param_file <file_name>]
<script> - Execute the script with access to grunt environment.
-param <param_name - See parameter substitution for details.
-param_file <file_name> - See parameter substitution for details. script - Script to be
executed.
sh <shell command> - Invoke a shell command.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

kill <job_id> - Kill the hadoop job specified by the hadoop job id.
set <key> <value> - Provide execution parameters to Pig. Keys and values are case
sensitive.

The following keys are supported:


default_parallel - Script-level reduce parallelism. Basic input size heuristics used by
default.

debug - Set debug on or off. Default is off.

job.name - Single-quoted name for jobs. Default is PigLatin:<script name> job.priority


- Priority for jobs. Values: very_low, low, normal, high, very_high. Default is normal
stream.skippath - String that contains the path. This is used by streaming any hadoop
property.

help - Display this message.


history [-n] - Display the list statements in cache.
-n Hide line numbers. quit - Quit the grunt shell.
history Command
This command displays a list of statements executed / used so far since the Grunt sell is
invoked.

grunt> customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING


PigStorage(',');
grunt> orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(',');
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',');
Then, using the history command will produce the following output. grunt> history
customers = LOAD 'hdfs://localhost:9000/pig_data/customers.txt' USING
PigStorage(','); orders = LOAD 'hdfs://localhost:9000/pig_data/orders.txt' USING
PigStorage(','); student = LOAD 'hdfs://localhost:9000/pig_data/student.txt' USING
PigStorage(',');
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

quit Command
You can quit from the Grunt shell using this command. Usage Quit from the Grunt shell
as shown below. grunt> quit
Let us now take a look at the commands using which you can control Apache Pig from
the Grunt shell.

exec Command
Using the exec command, we can execute Pig scripts from the Grunt shell. Syntax Given
below is the syntax of the utility command exec.
grunt> exec [–param param_name = param_value] [–param_file file_name] [script]

FOREACH operator
The FOREACH operator is used to generate specified data transformations based on the
column data.
Syntax
Given below is the syntax of FOREACH operator.
grunt> Relation_name2 = FOREACH Relatin_name1 GENERATE (required data);
Example

Assume that we have a file named student_details.txt in the HDFS directory/pig_data/


as shown below.
student_details.txt 001,Rajiv,Reddy,21,9848022337,Hyderabad

002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the relation name student_detailsas shown
below.

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt'


USING PigStorage(',')
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray,

city:chararray);
Let us now get the id, age, and city values of each student from the

relationstudent_details and store it into another relation named foreach_data using the
foreach operator as shown below.

grunt> foreach_data = FOREACH student_details GENERATE id,age,city;


Verification
Verify the relation foreach_data using the DUMP operator as shown below grunt>
Dump foreach_data;

Output
It will produce the following output, displaying the contents of the relationforeach_data.

(1,21,Hyderabad) (2,22,Kolkata) (3,22,Delhi)


(4,21,Pune)(5,23,Bhuwaneshwar) (6,23,Chennai) (7,24,trivendram)(8,24,Chennai)
ORDER BY operator
The ORDER BY operator is used to display the contents of a relation in a sorted order based on one
or more fields.
Syntax
Given below is the syntax of the ORDER BY operator.
grunt> Relation_name2 = ORDER Relatin_name1 BY (ASC|DESC) Example
Assume that we have a file named student_details.txt in the HDFS directory/pig_data/ as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Pig with the relation name student_detailsas shown below.
grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

EXPERIMENT-8
Aim: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and
filter your data.

All Hadoop sub-projects such as Hive, Pig, and HBase support Linux operating system. Therefore,
you need to install any Linux flavored OS. The following simple steps are executed for Hive
installation:
Step 1: Verifying JAVA Installation
Java must be installed on your system before installing Hive. Let us verify java installation using the
following command:
$ java –version
If Java is already installed on your system, you get to see the following response: java version
"1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-
b02, mixed mode)
If java is not installed in your system, then follow the steps given below for installing java. Installing
Java
Step I:
Download java (JDK <latest version> - X64.tar.gz) by visiting the following
linkhttp://www.oracle.com/technetwork/java/javase/downloads/jdk7- downloads-1880260.html.
Then jdk-7u71-linux-x64.tar.gz will be downloaded onto your system. Step II: Generally you will find
the downloaded java file in the Downloads folder.
Verify it and extract the jdk-7u71-linux-x64.gz file using the following
commands.
$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz Step III:
To make java available to all the users, you have to move it to the location
―/usr/local/‖. Open root, and type the following commands.
$ su password:

h) mv jdk1.7.0_71 /usr/local/
i) exit Step IV:
For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.
export JAVA_HOME=/usr/local/jdk1.7.0_71 export PATH=$PATH:$JAVA_HOME/bin
Now apply all the changes into the current running system.
$ source ~/.bashrc Step V:
Use the following commands to configure java alternatives:
f) alternatives --install /usr/bin/java java usr/local/java/bin/java 2
g) alternatives --install /usr/bin/javac javac usr/local/java/bin/javac 2
h) alternatives --install /usr/bin/jar jar usr/local/java/bin/jar 2
i) alternatives --set java usr/local/java/bin/java
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

j) alternatives --set javac usr/local/java/bin/javac # alternatives --set jar


usr/local/java/bin/jar

Now verify the installation using the command java -version from the terminal as explained above.
Step 2: Verifying Hadoop Installation
Hadoop must be installed on your system before installing Hive. Let us verify the Hadoop installation
using the following command
$ hadoop version
If Hadoop is already installed on your system, then you will get the following response:

Hadoop 2.4.1 Subversion https://svn.apache.org/repos/asf/hadoop/common


-r 1529768 Compiled by hortonmu on 2013-10-07T06:28Z Compiled with protoc 2.5.0
From source with checksum 79e53ce7994d1628b240f09af91e1af4
If Hadoop is not installed on your system, then proceed with the following steps: Downloading Hadoop
Download and extract Hadoop 2.4.1 from Apache Software Foundation using the following
commands.

$ su password:
• cd /usr/local

• wget http://apache.claz.org/hadoop/common/hadoop-2.4.1/ hadoop- 2.4.1.tar.gz

• tar xzf hadoop-2.4.1.tar.gz

• mv hadoop-2.4.1/* to hadoop/

• exit

Installing Hadoop in Pseudo Distributed Mode


The following steps are used to install Hadoop 2.4.1 in pseudo distributed mode. Step I: Setting up
Hadoop
You can set Hadoop environment variables by appending the following commands to ~/.bashrc
file.export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativ e
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now apply all the changes into the current running system.
$ source ~/.bashrc
Step II: Hadoop Configuration
You can find all the Hadoop configurationfiles in the location
―$HADOOP_HOME/etc/hadoop‖. You need to make suitable changes in those configuration
files according to your Hadoop infrastructure.
$ cd $HADOOP_HOME/etc/hadoop
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

In order to develop Hadoop programs using java, you have to reset the java
environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of
java in your system.
export JAVA_HOME=/usr/local/jdk1.7.0_71
Given below are the list of files that you have to edit to configure Hadoop. core-site.xml
The core-site.xml file contains information such as the port number used for Hadoop instance, memory
allocated for the file system, memory limit for storing the data, and the size of Read/Write buffers.
Open the core-site.xml and add the following properties in between the
<configuration> and
</configuration> tags.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
hdfs-site.xml
The hdfs-site.xml file contains information such as the value of replication data, the namenode path,
and the datanode path of your local file systems. It means the place where you want to store the Hadoop
infra.
Let us assume the following data. dfs.replication (data replication value) = 1
(In the following path /hadoop/ is the user name. hadoopinfra/hdfs/namenode is the directory created
by hdfs file system.
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode
(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.) datanode path =
//home/hadoop/hadoopinfra/hdfs/datanode
Open this file and add the following properties in between the
<configuration>,
</configuration> tags in this file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/namenode </value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoop/hadoopinfra/hdfs/datanode </value >
</property>
</configuration>
Note: In the above file, all the property values are user-defined and you can make changes according
to your Hadoop infrastructure.
yarn-site.xml
This file is used to configure yarn into Hadoop. Open the yarn-site.xml file and add the following
properties in between the <configuration>, </configuration> tags in this file.
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

mapred-site.xml
This file is used to specify which MapReduce framework we are using. By default, Hadoop contains
a template of yarn-site.xml. First of all, you need to copy the file from mapred- site,xml.template to
mapred-site.xml file using the following command.

Open mapred-site.xml file and add the following properties in between the
<configuration>,
</configuration> tags in this file. Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation. Step I: Name Node Setup
Set up the namenode using the command ―hdfs namenode -format‖ as follows.
$ hdfs namenode -format
The expected result is as follows.
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted. 10/24/14 21:30:56 INFO
namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0

10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0 10/24/14 21:30:56 INFO
namenode.NameNode: SHUTDOWN_MSG:
/********************************************************
**** SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11

************************************************************/
Step II: Verifying Hadoop dfs
The expected output is as follows:
localhost: starting datanode, logging to /home/hadoop/hadoop- 2.4.1/logs/hadoop-hadoop- datanode-
localhost.out
Starting secondary namenodes [0.0.0.0] Step III: Verifying Yarn Script
The following command is used to start the yarn script. Executing this command will start your yarn
daemons.
The expected output is as follows:
starting yarn daemons
starting resourcemanager, logging to /home/hadoop/hadoop- 2.4.1/logs/yarn-hadoop-
resourcemanager-localhost.out
localhost: starting nodemanager, logging to /home/hadoop/hadoop- 2.4.1/logs/yarn-hadoop-
nodemanager-localhost.out
Step IV: Accessing Hadoop on Browser
The default port number to access Hadoop is 50070. Use the following url to get Hadoop services on
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

your browser.

http://localhost:50070/

Step V: Verify all applications for cluster


The default port number to access all applications of cluster is 8088. Use the following url to visit this
service.
http://localhost:8088/

Step 3: Downloading Hive

We use hive-0.14.0 in this tutorial. You can download it by visiting the following link
http://apache.petsads.us/hive/hive-0.14.0/. Let us assume it gets downloaded onto the
/Downloads directory. Here, we download Hive archive named ―apache-hive- 0.14.0-bin.tar.gz‖ for
this tutorial. The following command is used to verify the download:
$ ls
On successful download, you get to see the following response:
Step 4: Installing Hive
The following steps are required for installing Hive on your system. Let us assume the Hive archive is
downloaded onto the /Downloads directory.
Extracting and verifying Hive Archive
The following command is used to verify the download and extract the hive archive:

$ ls
On successful download, you get to see the following response: Copying files to /usr/local/hive
directory
We need to copy the files from the super user ―su -‖. The following commands are used to copy the
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

files from the extracted directory to the /usr/local/hive‖ directory.


# cd /home/user/Download
• mv apache-hive-0.14.0-bin /usr/local/hive

• exit

Setting up environment for Hive

You can set up the Hive environment by appending the following lines to~/.bashrc file:
The following command is used to execute ~/.bashrc file.

Step 5: Configuring Hive


To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the
$HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the
template file:
Edit the hive-env.sh file by appending the following line:

Hive installation is completed successfully. Now you require an external database server to configure
Metastore. We use Apache Derby database.
Step 6: Downloading and Installing Apache Derby

Follow the steps given below to download and install Apache Derby:
Downloading Apache Derby
The following command is used to download Apache Derby. It takes some time to download.
$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby- 10.4.2.0-bin.tar.gz The
following command is used to verify the download:
On successful download, you get to see the following response:
Extracting and verifying Derby archive
The following commands are used for extracting and verifying the Derby archive:
$ ls
On successful download, you get to see the following response: Copying files to /usr/local/derby
directory
We need to copy from the super user ―su -‖. The following commands are used to copy the files from
the extracted directory to the /usr/local/derby directory:

• cd /home/user

• mv db-derby-10.4.2.0-bin /usr/local/derby

• exit
INSTITUTE OF ENGINEERING AND TECHNOLOGY
Mohanlal Sukhadia University, Udaipur
Name- Shivam Chouhan | Class- B.Tech -CSE (VIII Sem) | Subject- BDA Lab

Setting up environment for Derby

You can set up the Derby environment by appending the following lines to~/.bashrc file:
Apache Hive 18
export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools
.jar
The following command is used to execute ~/.bashrc file:
Create a directory to store Metastore
Create a directory named data in $DERBY_HOME directory to store Metastore data. Derby
installation and environmental setup is now complete. Step 7: Configuring
Metastore of Hive
Configuring Metastore means specifying to Hive where the database is stored. You can do this by
editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the
template file using the following command:
$ cp hive-default.xml.template hive-site.xml
Edit hive-site.xml and append the following lines between the <configuration> and
</configuration> tags:
Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false org.jpox.validateTables = false org.jpox.validateColumns = false
org.jpox.validateConstraints = false org.jpox.storeManagerType = rdbms org.jpox.autoCreateSchema
= true org.jpox.autoStartMechanismMode = checked org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true javax.jdo.option.NontransactionalRead
= true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create
= true javax.jdo.option.ConnectionUserName = APP javax.jdo.option.ConnectionPassword = mine

Step 8: Verifying Hive Installation

Before running Hive, you need to create the /tmp folder and a separate Hive folder in HDFS. Here, we
use the /user/hive/warehouse folder. You need to set write permission for these newly created folders
as shown below:
Now set them in HDFS before verifying Hive. Use the following commands:
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir
/user/hive/warehouse $ $HADOOP_HOME/bin/hadoop fs - chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w
/user/hive/warehouse The following commands are used to verify Hive installation:
$ bin/hive
On successful installation of Hive, you get to see the following response:
Hive history file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
…………………. hive>

You might also like