[go: up one dir, main page]

0% found this document useful (0 votes)
7 views107 pages

Big Data Analytics Lab

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views107 pages

Big Data Analytics Lab

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

lOMoARcPSD|372 403 83

DEPARTMENT OF
Artificial Intelligence And Data Science

III B Tech II Sem

BIG DATA ANALYTICS LAB


lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT- 1
AIM:
1) Implement the following Data structures in Java
a) Linked Lists b) Stacks c) Queues d) set e) map

CODE:
a) Linked Lists

import java.io.*;
import java.util.LinkedList;
import java.util.ListIterator;
public class LLDemo
{
static void insertFirst(LinkedList ll,String a)
{
ll.addFirst(a);
System.out.println(ll);
}
static void insertLast(LinkedList ll,String a)
{
ll.addLast(a);
System.out.println(ll);
}
static void DeleteFirst(LinkedList ll)
{
ll.removeFirst(); System.out.println(ll);
}
static void DeleteLast(LinkedList ll)
{
ll.removeLast(); System.out.println(ll);
}
static void Find(LinkedList ll,String a)
{
int pos=ll.indexOf(a);
if(pos==-1)
System.out.println("\nElement not found");
else
System.out.println("\nElement found at the postion:"+pos);
}

static void Minsert(LinkedList ll,String a,String b)


{
int pos=ll.indexOf(b);
System.out.println("\nInserting at position:"+pos);
ll.add(pos,a)

1
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

System.out.println(ll);
}
static void RemoveElement(LinkedList ll,String a)
{
int pos=ll.indexOf(a);
ll.remove(a);
System.out.println(ll);
}
public static void main(String arg[]) throws IOException
{
LinkedList<String> ll=new LinkedList<String>();
DataInputStream din=new DataInputStream(System.in);
String st;
while(true)
{
System.out.println("\nMenu:\n1.Insert First\n2.Insert Last\n3.Remove
First\n4.Remove Last\n5.Search For an element\n6.Middle Insert\n7.Remove");
System.out.println("\nEnter your operation:");int
ch=Integer.parseInt(din.readLine());
switch(ch)
{
case 1:
System.out.println("enter a element to insert:");
st=din.readLine();
insertFirst(ll,st);break;

case 2: System.out.println("Enter an element to insert:");


st=din.readLine();
insertLast(ll,st);break;

DeleteFirst(ll);
case 3: break;

DeleteLast(ll);
case 4: break;

System.out.println("Enter an element to search:");


case 5: st=din.readLine();
Find(ll,st);
break;

System.out.println("Enter an element to insert:");


case 6:

2
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

st=din.readLine();
System.out.println("enter the after element:");
String s=din.readLine();
Minsert(ll,st,s);
break;
case 7:
System.out.println("Enter an element to delete:");
st=din.readLine();
RemoveElement(ll,st);break;

default: System.exit(0);

}
}
}
}

OUTPUT:
javac
LLDemo.javajava
LLDemo

Menu:
1. Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


1
enter a element to insert:1
[1]
Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

3
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Enter your operation:


2
Enter an element to insert:5
[1, 5]

Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


2
Enter an element to insert:3
[1, 5, 3]

Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


3
[5, 3]

Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

4
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Enter your operation:


4
[5]

Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


5
Enter an element to search:5

Element found at the postion:0

Menu:
1. Insert First
2. Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


6
Enter an element to insert:9
enter the after element:5

Inserting at position:0
[9, 5]

Menu:
1.Insert First
2.Insert Last
3. Remove First

5
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

4. Remove Last
5. Search For an element
6.Middle Insert
7.Remove

Enter your operation:


7
Enter an element to delete:8
[9, 5]

Menu:
1.Insert First
2.Insert Last
3.Remove First
4.Remove Last
5.Search For an element
6.Middle Insert
7.Remove

Enter your operation:


7
Enter an element to delete:9
[5]

6
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

B) Stacks

import java.util.*;
import java.io.*;
public class StackDemo
{
static void insert(Stack s,int a)
{
s.add(new Integer(a)); System.out.println("Elements
in stack are:"+s);
}
static void delete(Stack s)
{
Integer a=(Integer)s.pop();
System.out.println("Deleted element is "+a);
System.out.println("Remaining elements in stack are:"+s);
}
static void first(Stack s)
{
Integer a=(Integer)s.peek();
System.out.println("the first element in stack is "+a);
}
public static void main(String a[]) throws IOException
{
Stack s=new Stack();
DataInputStream d=new DataInputStream(System.in);
while(true)
{
System.out.println("Menu\n1.Insert\n2.Delete\n3.First
Element\n4.Exit");
System.out.println("Enter your choice:");
int ch=Integer.parseInt(d.readLine());
switch(ch)
{
case 1:
System.out.println("Enter the element to insert:");int
ele=Integer.parseInt(d.readLine()); insert(s,ele);
break;

case 2: delete(s);
break;

case 3: first(s);
break;

case 4: System.exit(0);

7
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

default:
System.out.println("invalid choice");

}
}

}
Output:
javac StackDemo.java
java StackDemo

Menu
1.Insert
2.Delete
3.First Element
4.Exit
Enter your choice:
1
Enter the element to insert:5
Elements in stack are:[5]
Menu
1.Insert
2.Delete 3.First
Element 4.Exit
Enter your choice:
1
Enter the element to insert:8
Elements in stack are:[5, 8]
Menu
1.Insert
2.Delete 3.First
Element 4.Exit
Enter your choice:
3
the first element in stack is 8
Menu
1. Insert
2.Delete 3.First
Element 4.Exit
Enter your choice:
2
Deleted element is 8
Remaining elements in stack are:[5]

8
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

C)Queues

import java.util.*;
import java.io.*;

public class QueueDemo {

static void insert(Queue q, int a) {


q.add(new Integer(a));

System.out.println("Elements in Queue: " + q);


}

static void delete(Queue q){

Integer a = (Integer) q.remove();


System.out.println(a);
System.out.println("Elements in Queue: " +q);
}

static void first(Queue q){

Integer a = (Integer) q.peek(); System.out.println("First


element on queue:"+a);
}

public static void main(String args[]) throws IOException {Queue


q = new LinkedList();
DataInputStream din=new DataInputStream(System.in);
while(true)
{
System.out.println("Menu\n 1.Insert\n 2.Delete\n 3.Peek\n 4.Exit");
System.out.println("Enter your choice:");
int ch=Integer.parseInt(din.readLine());

switch(ch)
{
case 1: System.out.println("Enter element to insert into Queue:");int
ele=Integer.parseInt(din.readLine());
insert(q,ele);
break;
case 2:
delete(q);
break;

9
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

case 3: first(q);
break;
case 4: System.exit(0);
default: System.out.println("Invalid choice");

}
}
}
}

Output:
javac QueueDemo.java
java QueueDemo Menu
1.Insert
2. Delete
3.Peek
4.Exit
Enter your choice:
1
Enter element to insert into Queue:1
Elements in Queue: [1]
Menu
1.Insert
2.Delete
3.Peek
4.Exit
Enter your choice:
1
Enter element to insert into Queue:8
Elements in Queue: [1, 8]
Menu
1.Insert
2.Delete
3.Peek
4.Exit
Enter your choice:
3
First element on queue:1

10
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

D) Set

import java.util.Set;
import java.util.HashSet;
import java.util.TreeSet;
import java.util.Iterator;

public class SetDemo{

public static void main(String args[]){


int count[]={34,22,10,60,30,22};
Set<Integer>set=new HashSet<Integer>();
HashSet<Integer>clone1=new HashSet<Integer>();

try{
for(int i =0; i<5; i++){
set.add(count[i]);
clone1.add(count[i]);
}
System.out.println(set);
Object s=clone1.clone();
System.out.println("the cloned set is"+s);
System.out.println("size of set is "+set.size());boolean
b=set.remove(30);
if(b)
System.out.println("element is removed");
System.out.println("the set display using iterator");
Iterator it=set.iterator();
while(it.hasNext())
System.out.println(it.next()+" ");
TreeSet sortedSet=new TreeSet<Integer>();
sortedSet.addAll(set);
System.out.println("The sorted list is:");
System.out.println(sortedSet);
System.out.println("the set is"+(TreeSet)sortedSet.headSet(40));
System.out.println("the subset is "+(TreeSet)sortedSet.subSet(22,40));
System.out.println("the tailset is "+(TreeSet)sortedSet.tailSet(22));

System.out.println("The First element of the set is: "+


(Integer)sortedSet.first());
System.out.println("The last element of the set is: "+
(Integer)sortedSet.last());

11
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

}
catch(Exception e){}
}
}

Output:
javac SetDemo.java
java SetDemo
[34, 22, 10, 30, 60]
the cloned set is[34, 22, 10, 60, 30]
size of set is 5
element is removed
the set display using iterator
34
22
10
60
The sorted list is:
[10, 22, 34, 60]
the set is[10, 22, 34]
the subset is [22, 34]
the tailset is [22, 34, 60]
The First element of the set is: 10
The last element of the set is: 60

12
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

E) Map

import java.util.HashMap;
import java.util.Map;
import java.util.Iterator;
import java.util.Set; import
java.util.TreeMap;public
class MapDemo {
public static void main(String args[]) {

/* This is how to declare HashMap */


HashMap<Integer, String> hmap = new HashMap<Integer, String>();

/*Adding elements to HashMap*/


hmap.put(12, "chp");
hmap.put(2, "pnr");
hmap.put(7, "kry");
hmap.put(49, "nsr");
hmap.put(3, "chari");
System.out.println("the original hashmap is :"+hmap);Object
s=hmap.clone();
System.out.println("the cloned hashmap is:"+s);
System.out.println("the size of hashmap is"+hmap.size());

/* Display content using Iterator*/


Set set = hmap.entrySet();
Iterator iterator = set.iterator();
while(iterator.hasNext()) {
Map.Entry mentry = (Map.Entry)iterator.next();
System.out.print("key is: "+ mentry.getKey() + " & Value is: ");
System.out.println(mentry.getValue());
}

/* Get values based on key*/


String var= hmap.get(2);
System.out.println("Value at index 2 is: "+var);

/* Remove values based on key*/


hmap.remove(3);
System.out.println("Map key and values after removal:");Set
set2 = hmap.entrySet();
Iterator iterator2 = set2.iterator();
while(iterator2.hasNext()) {
Map.Entry mentry2 = (Map.Entry)iterator2.next();
System.out.print("Key is: "+mentry2.getKey() + " & Value is: ");
System.out.println(mentry2.getValue());

13
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

if(hmap.containsKey(7))
System.out.println("key exists");
else
System.out.println("key does not exits");
if(hmap.containsValue("mouni"))
System.out.println("Value exists");
else
System.out.println("Value does not exits");

TreeMap<Integer, String> tmap=new TreeMap<Integer, String>();


tmap.putAll(hmap);
Set set3 = tmap.entrySet();
Iterator it = set3.iterator();
while(it.hasNext())
{
Map.Entry me = (Map.Entry)it.next();
System.out.print("Key is: "+me.getKey() + " & Value is:

"+me.getValue()+"\n");
}
}
}

Output:

javac MapDemo.java
java MapDemo
the original hashmap is :{49=nsr, 2=pnr, 3=chari, 7=kry, 12=chp}
the cloned hashmap is:{2=pnr, 49=nsr, 3=chari, 7=kry, 12=chp}
the size of hashmap is5
key is: 49 & Value is: nsr
key is: 2 & Value is: pnr
key is: 3 & Value is: chari
key is: 7 & Value is: kry
key is: 12 & Value is: chp
Value at index 2 is: pnr
Map key and values after removal:
Key is: 49 & Value is: nsr
Key is: 2 & Value is: pnr
Key is: 7 & Value is: kry
Key is: 12 & Value is:
chpkey exists
Value does not exits
Key is: 2 & Value is:

14
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

pnrKey is: 7 & Value is:


kry
Key is: 12 & Value is:
chpKey is: 49 & Value is:
nsr

15
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

VMWARE INSTALLATION PROCESS


Software Links
Virtual box

https://www.virtualbox.org/wiki/Do https://drive.google.com/drive/folders/1KyzfpcSI_iJShS76BfWi
wnloads bBjuX2j0RBev

Please click on the following link to download CodeTantra' Hadoop Virtual Disk Image.
https://s3.ap-south-1.amazonaws.com/ct-hadoop-installation/CodeTantra-Hadoop-VDI.zip

Please find the Hadoop installation links and download the appropriate one according to your OS
architecture.

Hadoop 64bit

https://s3.ap-south-1.amazonaws.com/ct-hadoop-installation/CT-Hadoop-LinuxSetup-64bit-v3.zip

Hadoop 32bit
https://s3.ap-south-1.amazonaws.com/ct-hadoop-installation/CT-Hadoop-LinuxSetup-32bit-v3.zip

Cloudera Version 5.17.0


https://drive.google.com/drive/folders/1KyzfpcSI_iJShS76BfWibBjuX2j0RBev

Cloudera Version4.7
https://gecgudlavallerumic-my.sharepoint.com/:f:/g/personal/bhagec_gecgudlavallerumic_in/EpfaD_p-
lKdNlu03RVBG9KoBsxMmsM8Lhg1VRY36solKhw?e=D1wk3A

PREQUSITES FOR BDA LAB

1. Cloudex lab Account

2. installing Winscp :for file transfer between Server and Remote system

3.install Eclipse java IDE

4. Cloudera quick start VM 4.7.x Installation

16
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

17
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

18
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

19
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

20
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

21
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

22
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

23
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

24
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

25
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

26
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

27
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

28
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

29
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

30
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

31
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

32
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Connection

33
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

34
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

35
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

36
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

37
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

38
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

The file size 1s 102 MB

39
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

40
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

41
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

42
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

43
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

44
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

45
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Simple way drag drop the files between Local/ your Desktop(left hand window) to remote Server(cluster-
Right hand side window).

Cloudera quick start VM 4.7.x Installation procedure:


1. Open http://www.cloudera.com/content/cloudera/en/downloads/quickstart_vms/cdh-5-4-x.html#
and click on “Download for VirtualBox” button.

2. Download Oracle Virtual Box from http://dlc.sun.com.edgesuite.net/virtualbox/4.3.12/VirtualBox-


4.3.12-93733-Win.exe.
3. Open Oracle Virtual Box and click on New.

46
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

4. Give the details as given below:

And click on Next.


47
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

5. Set the RAM memory as given below and click on Next. Approximately Half of the RAM need to
be allocated to Virtual Box Instance

6. Select Use an existing virtual hard drive file and click on Create.

7. Now, select ClouderaVM4.7 and click on Start button.

48
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

System requirements:
This requires a 64-bit host OS and a virtualization product that can support a 64-bit guest OS.
Better to have 8GM RAM since we are using virtual box but 4GB is also fine for practice
Double Click on the “poweroff” button and you will be accessing Cloudera Manager
Cloudera Manager UserId/Password: cloudera/cloudera

49
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

LINUX COMMANDS

50
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

5
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-2
AIM: Perform setting up and Installing in its three operating modes:
Standalone,
Pseudo distributed,
Fully distributed

A) Standalone

Step 1: Open a Terminal


Step 2: Update the Repository: Command: sudo
apt-get update

Step 3: Once update is complete install java: Command:


sudo apt-get install openjdk-6-jdk

Step 4: After java installed, To check whether java is installed on your system or not give thebelow
command:
Command: java –version

Step 5: Install openssh-server


Command: sudo apt-get install openssh-server

Step 6: Create a ssh key:


Command: ssh-keygen –t rsa –P “”

Step 7: Moving the key to authorized key:


Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 8: Download and extract Hadoop:


Command:
wget http://archive.apache.org/dist/hadoop/core/hadoop-1.2.0/hadoop-1.2.0.tar.gz
Command: tar -xvf hadoop-1.2.0.tar.gz

Step 9: Change the directory to hadoop and copy to /usr/local

hadoop-

env.shadd

java_home as
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

9.2.
open bashrc file and append the following:

#HADOOP VARIABLES START


export JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

52
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

export HADOOP_HOME=/usr/local/hadoop
export
PATH=$PATH:$HADOOP_HOME/bin export
PATH=$PATH:$HADOOP_HOME/sbin
#HADOOP VARIABLES END
Step10:
To check hadoop version give the below command
Command: hadoop version

Step11:Format hadoop namenode Command:


hadoop namenode –format

53
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Pseudo distributed mode:


Configure core-site.xml
path: computer/usr/lib/hadoop/conf/core_site.xml.
<property>
name>fs.default.name</name>
<value>hdfs://localhost:8020</value>
</property>

Step13 :Configure hdfs-site.xml


Path: computer/usr/lib/hadoop/conf/hdfs-site.xml
<property>
name>dfs.replication</name>
<value>1</value>

54
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

Step14: Configure mapred-site.xml


Path:usr/bin/hadoop/conf/core_site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:8021</value>
</property>

Step15:Start the namenode, datanode


Command: start-dfs.sh
Step16: Start the task tracker and job tracker
Command: start-mapred.sh
Step17:To check if Hadoop started correctly
Command: jp

55
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Distributed mode:

First create a user group with 3 systems.


Commands: sudo groupadd cse_hadoop sudo
useradd -g cse_hadoop srkit
sudo passwd srkit

Step1:Configure /etc/hosts Command: sudo


gedit /etc/hosts

Step2: Install ssh server on all nodes


Command: sudo apt-get install openssh-server

Step3: Create a ssh key (on Namenode)


Command: ssh-keygen

Step4: Create a password-less ssh loginCommand:


ssh-copy-id -i $HOME/.ssh/id_rsa.pub cse@192.168.5.16 ssh-
copy-id -i $HOME/.ssh/id_rsa.pub cse@192.168.5.18 ssh-
copy-id -i $HOME/.ssh/id_rsa.pub cse-19@192.168.5.19

Step5: Test ssh login


Command:
ssh 192.168.5.16
ssh 192.168.5.18
ssh 192.168.5.19

Step6: Extract Hadoop-1.2.0


Command: tar -xvf hadoop-1.2.0.tar.gz
Command: cd hadoop-1.2.0

Step7: Edit Hadoop-env.sh


Command: sudo gedit conf/hadoop-env.sh export
JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64

Step 8: Configure core-site.xml


path: computer/ usr/lib/hadoop conf/core-site.xml
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs:// 192.168.5.17:8020</value>
</property>

</configuration>

56
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Step9: Configure hdfs-site.xml


path:computer/ usr/lib/hadoop/ conf/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

</configuration>

Step10: Configure mapred-site.xml


path:computer/ usr/lib/hadoop/ conf/mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>192.168.5.17</value>
</property>

</configuration>

Step11: Configure masters path:usr/


lib/hadoop conf/masters
192.168.5.17

Step12: Configure slaves


path: usr/lib/hadoop conf/slaves
192.168.5.18
192.168.5.19

Step13: Format Namenode Command:


hadoop namenode -format
Step14: Start Namenode and DatanodeCommand:
start-dfs.sh
Step15: Start Jobtracker and Tasktracker
Command: start-mapred.sh
Step16: To check if Hadoop started correctlyCommand: jps

57
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Using HDFS monitoring UI:

HDFS Namenode on UI
Open the browser and type
http://locahost:50070/

58
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

HDFS Live Nodes list

59
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

HDFS Jobtracker
http://locahost:50030/

HDFS Logs
http://locahost:50070/logs/

60
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

HDFS Tasktracker
http://locahost:500
60/

61
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-3
AIM: Implement the following file management tasks in Hadoop:
⮚ Adding files and directories
⮚ Retrieving files
⮚ Deleting files
Hint: A typical Hadoop workflow creates data files (such as log files) elsewhere and
copies them into HDFS using one of the above command line utilities.

HDFS basic Command-line file operations


1. Create a directory in HDFS at given path(s):
Command: hadoop fs -mkdir <paths>
2. List the contents of a directory:
Command: hadoop fs -ls <args>
Upload and download a file in HDFS:
Upload:
Command: hadoop fs -put <localsrc> <HDFS_dest_path>
Download:
Command: hadoop fs -get <HDFS_src> <localdst>
3. See contents of a file:
Command: hadoop fs -cat <path[filename]>
4. Copy a file from source to destination:
Command: hadoop fs -cp <source> <dest>
Copy a file from/To Local file system to HDFS:
Command: hadoop fs -copyFromLocal <localsrc> URI
Command: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localsrc>
5. Move file from source to destination:
Command: hadoop fs -mv <src> <dest>
6. Remove a file or directory in HDFS:
Remove files specified as argument. Delete directory only when it is empty.
Command: hadoop fs -rm <arg>
Recursive version of delete
Command: hadoop fs -rmr <arg>
7. Display last few lines of a file:
Command: hadoop fs -tail <path[filename]>
Display the aggregate length of a file:
Command: hadoop fs -du <path>
8. Getting help:
Command: hadoop fs -help
Adding files and directories:
⮚ Creating a directory
Command: hadoop fs -mkdir input/
⮚ Copying the files from localfile system to HDFS
Command: hadoop fs -put inp/file01 input/

Retrieving files:
62
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Command: hadoop fs -get input/file01 localfs


Deleting files and directories:
Command: hadoop fs -rmr input/file01

Hadoop provides a set of command line utilities that work similarly to the Linux file commands.

Default directories
Local file system : /home/cloudera
HDFS : /user/cloudera

Basic file commands:


Hadoop file command take the form of
hadoop fs -cmd <args>

where cmd is the specific file command and <args> is the variable number of arguments
Example:
Command for listing files is:
hadoop fs –ls
Most common file management tasks in hadoop are—
• Adding files and directories
• Retrieving files
• Deleting files

i) Adding files and directories: Before running hadoop programs need to put the data into HDFS
first .
1. mkdir : Create a directory in HDFS at given path(s).
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/cloudera/myfolder1
(absolute path)
Or
hadoop fs –mkdir myfolder1
(relative path)
Create a sub directory
Example:
hadoop fs –mkdir /user/cloudera/myfolder1/subfolder1

2. ls : List the contents of a directory.


hadoop fs -ls <args>
Example:
hadoop fs – ls
hadoop fs –ls / (list the contents of root directory)
hadoop fs –lsr / (recursively displays entries in all subdirectories
of path)
hadoop fs –ls –R
hadoop fs –lsr /user/cloudera/myfolder1

3. put or copyFromLocal : Upload a file in HDFS

hadoop fs -put localsrc dst


or
hadoop fs –copyFromLocal localsrc dst

Copy single src file, or multiple src files from local file system to the Hadoop distributed file system

63
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Example
create two files in local filesystem using cat or using any editor nano or gedit
cat > file1
This is Hadoop Lab
Ctrl+Z
cat > file2
This is Bigdata Lab
Ctrl+Z
hadoop fs - put file1 /user/cloudera/myfolder1
hadoop fs -copyFromLocal file2 /user/cloudera/myfolder1/subfolder1
hadoop fs -put file3 . (put the file in the default directory
checking:
hadoop fs – lsr /user/cloudera/myfolder1
hadoop fs –ls /

ii) Retriving files


copy files from HDFS to local filesystem.

1.Download: get or copyToLocal :Copies/Downloads files to the local file system


hadoop fs –get hdfs_src localdst
or
hadoop fs - copyToLocal hdfs_src localdst
Example:
hadoop fs -get /user/clooudera/myfolder1/file1 .
hadoop fs -copyToLocal /user/cloudera/myfolder1/file2 .

Another way to access the data is to display it. We can use the Hadoop filecommand with unix pipes
to send its output for further processing.
hadoop fs –cat file1
hadoop fs –cat file1 | head
hadoop fs –tail file1 (display the last 1 kb of file1)
c) Deleting files
Hadoop command for removing files is rm
Example :
hadoop fs -rm file1
hadoop fs -rmr myfolder1 (remove directory recursively)

Looking Up Help
A list of hadoop file commands together with the usage and description of each command can see by
using help command.
hadoop fs -help cmd
Example :
hadoop fs –help ls

1. cp : Copy a file from source to destination


hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/cloudera/file1 /user/cloudera/myfolder1

2.mv : Move file from source to destination.


Note:- Moving files across filesystem is not permitted.
hadoop fs –mv <src> <dest>
Example:
64
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

hadoop fs -mv /user/cloudera/file1 user/cloudera/myfolder1

3. du : Shows disk usage, in bytes, for all the files which match path; filenames are reported
with the full HDFS protocol prefix.
hadoop fs -du <path>
Example:
hadoop fs -du /user/cloudera

4. dus : Like -du, but prints a summary of disk usage of all files/directories in the path.
hadoop fs - dus <path>
Example:
hadoop fs -dus /user/cloudera

iii) Moving files across filesystem

5. moveFromLocal : files from local file system to the Hadoop distributed file system
hadoop fs –moveopyFromLocal localsrc dst
Move single src file, or multiple src files from local file system to the Hadoop distributed file system
Example
create afile in local filesystem using cat or using any editor nano or gedit
cat > file4
This is Hadoop and BigdataLab
Ctrl+Z

hadoop fs -moveFromLocal file4 /user/cloudera/myfolder1/subfolder1


checking:
hadoop fs – lsr /user/cloudera/myfolder1
hadoop fs –ls .

6. moveToLocal: copy files from HDFS to local filesystem.


hadoop fs - moveToLocal hdfs_src localdst
Example:
hadoop fs -moveToLocal /user/cloudera/myfolder1/file4 .

7. Chmod : To change permissions of files/directories


hadoop fs -chmod 777 filename/directory name
Example:
hadoop fs –chmod 666 /user/cloudera/file2

8.getmerge: concatenates the files in the source directory into the destination file.
hadoop fs -getmerge <src> <localdst> [addnl]

The addnl option is for adding new line character at the end of each file.
Example :
hadoop fs –getmerge file1 file2 mergfile

9. chown : used to change the ownership of files. The -R option can be used to recursively
change the owner of a directory structure.
hadoop fs -chown [-R] <NewOwnerName>[:NewGroupName] <file or dir name>

10. Expunge : Used to empty the trash.


hadoop fs -expunge

11.setrep: used to change the replication factor of a file.


65
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

hadoop fs -setrep -w 4 /user/cloudera /file1

12. touchz: creates a zero byte file. This is similar to the touch command in unix.
hadoop fs -touchz /user/cloudera/filename
Example :
hadoop fs –touchz /user/cloudera/file0

66
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-4

AIM: Word count Program


Hadoop Map Reduce is a system for parallel processing which was initially adopted by Google for
executing the set of functions over large data sets in batch mode which is stored in the fault-tolerant
large cluster.

The input data set which can be a terabyte file broken down into chunks of 64 MB by default is the
input to Mapper function. The Mapper function then filters and sort these data chunks
on Hadoop cluster data nodes based on the business requirement.
After the distributed computation is completed, the output of the mapper function is passed to reducer
function which combines all the elements back together to provide the resulting output.

An example of Hadoop MapReduce usage is “word-count” algorithm in raw Java using classes
provided by Hadoop libraries. Count how many times a given word such as “are”, “Hole”, “the” exists
in a document which is the input file.
To begin, consider below figure, which breaks the word-count process into steps.

Hadoop MapReduce
Word Count Process
The building blocks of Hadoop MapReduce programs are broadly classified into two phases, the map
and reduce.

Both phases take input data in form of (key, value) pair and output data as (key, value) pair. The mapper
program runs in parallel on the data nodes in the cluster. Once map phase is over, reducer run in parallel
on data nodes.

The input file is split


into 64 MB chunk and is spread over the data nodes of the cluster. The mapper program runs on each
data node of the cluster and generates (K1, V1) as the key-value pair.

Sort and shuffle stage creates the iterator for each key for e.g. (are, 1,1,1) which is passed to the reduce
function that sums up the values for each key to generate (K2, V2) as output. The illustration of the
same is shown in above figure (word count MapReduce process).

67
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Hadoop MapReduce Algorithm for Word Count Program


1. Take one line at a time
2. Split the line into individual word one by one (tokenize)
3. Take each word
4. Note the frequency count (tabulation) for each word
5. Report the list of words and the frequency tabulation
Hadoop MapReduce Code Structure for Word Count Program
Step 1
Import Hadoop libraries for Path, configuration, I/O, Map Reduce, and utilities.

import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;

Step 2
The summary of the classes defined in the “word count map reduce” program is as below :

public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {


......................
.........................
............................

public static void main(String[] args) throws Exception {


========================
}

68
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

We have created a package in the eclipse and defined a class named “WordCount”. The “WordCount”
class has two nested class and one main class. “Mapper” and “Reducer” are the reserved keywords.

The source code for the same is written by Hadoop developer. We are extending the “Mapper” and
“Reducer” class by the “Map” and “Reduce” respectively using inheritance.
Let us understand what is LongWritable, Text, IntWritable. For the same, we need to first understand
serialization and de-serialization in java.

Object serialization is a mechanism where an object can be represented as a sequence of bytes that
includes the object’s data as well as information about the object’s type and the types of data stored in
the object.

The serialized object is written in a file and then de-serialized to recreate the object back into memory.

For example word “Hai” has a serializable value of say “0010110” and then once it is written in a file,
you can de-serialized back to “Hai”.

In Hadoop MapReduce framework, mapper output is feeding as reducer input. These intermediate
values are always in serialized form.
Serialization and de-serialization in java are called as Writable in Hadoop MapReduce programming.
Therefore, Hadoop developers have converted all the data types in serialized form. For example, Int in
java is IntWritable in MapReduce framework, String in java is Text in MapReduce framework and so
on.

The input and output of the mapper or reducer is in (key, value) format. For example, we have a file
which contains text input and text outputs say the sample data as (1, aaa). The key is considered to be
the precision of input data. The precision for (1, aaa) is defined as “01234”. 0 for “1”, 1 for “,” and so
on which makes it to “01234”.

Therefore, for a text input/output file, the precision of first value is considered to be as key and the rest
are values. In this case, “0” is considered as the key while as “(1, aaa)” as value.

Similarly, if you have another data in the file say (2, bbb). The precision for (1, bbb) is defined as
“56789”. Key here will be 5 and the value will be (1, bbb).

Now, let us try to understand the below with an example:

Mapper<LongWritable, Text, Text, IntWritable> {

Consider, we have the first line in the file as “Hi! How are you”.

The mapper input key value is (0, Hi!), (4, How), (8, are), (12, you). Therefore, the key generated by
mapper class has a data type “LongWritable” i.e. the first parameter and the value generated by mapper
class is “Text”.

The mapper output value would be the word and the count of the word i.e. (Hi!,1), (How,1), (are,1),
(you, 1).

If the word “are” repeated twice in the sentence then the mapper output would be (are,1,1). Hence, the
key of the mapper output is “Text” while as the value is “IntWritable”. This output to the mapper is
69
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

getting

This output to the mapper is getting fed as the input to the reducer. Therefore, if the reducer input is
(are, 1, 1) then the output of the reducer will be (are,2). Here, the reducer output data type has the key
as “Text” and value as “IntWritable”.

Step 3
Define the map class. The key and value input pair have to be serializable by the framework and hence
need to implement the Writable interface.
Output pairs do not need to be of the same types as input pairsbbbbbbb. Output pairs are collected with
calls to context.

Inside the static class “map” we are declaring an object with the name “one” to store the incremental
value of the given word and the particular word is stored in the variable named “word”.

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {


private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);

}
}
}

The above piece of code takes each line as an input and stores it into the variable “line”. StringTokenizer
allows an application to break a string into tokens. For example:

StringTokenizerst = new StringTokenizer(“my name is kumar”,” “);

The output of the above line will be: my


name is kumar

If the “tokenizer” variable has more number of tokens to count then the while loop will get open. The
context will take care of executing the for loop i.e. to read line by line of the file and store the output
70
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

as the particular word and their occurrences. For example: if you have “hai, hai, hai” then the context
will store (hai, 1, 1, 1)

Step 4
Reduce class will accept shuffled key-value pairs as input.The code then totals the values for the key-
value pairs with the same key and outputs the totaled key-value pairs; e.g. <word,3>
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

Step 5
The main method sets up the Map Reduce configuration by defining the type of input. In this case, the
input is text.The code then defines the Map, Combine, and Reduce classes, as well as specifying the
input/output formats.

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();


Job job = new Job(conf, "wordcount");//Name for the job is “wordcount”
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class); // Mapper Class Name
job.setReducerClass(Reduce.class); //Reducer Class Name
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}

71
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Step 6
The full Java code for the “word count” program is as below:

import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {


public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);


private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException {

String line = value.toString();


StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);

}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
72
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();


Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
}

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
//import org.apache.hadoop.mapreduce.Counter;
public class WordCountMapper extends
Mapper<LongWritable, Text, Text, LongWritable> {

private final static LongWritable one = new LongWritable(1);


@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
String[] words = line.split(" ");
for (int i = 0; i < words.length; i++) {
context.write(new Text(words[i]), one);
}
/*StringTokenizer strtock = new StringTokenizer(str);
while (strtock.hasMoreTokens()) {
temp.set(strtock.nextToken());
context.write(temp, one);
73
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

}*/
}
}

import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.io.Text;

public class WordCountReducer extends Reducer<Text,LongWritable,Text,LongWritable>


{
@Override
protected void reduce(Text key,Iterable<LongWritable> value,Context context)throws
IOException,InterruptedException
{
long sum=0;
while(value.iterator().hasNext())
{
sum+=value.iterator().next().get();
}
context.write(key,new LongWritable(sum));
}

}
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
//import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.util.Tool;
public class WordCountJob implements Tool{
private Configuration conf;
@Override
public Configuration getConf()
{
return conf;
}
@Override
public void setConf(Configuration conf)
{
this.conf=conf;
}
@Override
public int run(String []args)throws Exception
{

Job wordcountjob=new Job(getConf());


74
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

wordcountjob.setJobName("mat word count");


wordcountjob.setJarByClass(this.getClass());
wordcountjob.setMapperClass(WordCountMapper.class);
wordcountjob.setReducerClass(WordCountReducer.class);
wordcountjob.setCombinerClass(WordCountCombiner.class);
wordcountjob.setMapOutputKeyClass(Text.class);
wordcountjob.setMapOutputValueClass(LongWritable.class);
wordcountjob.setOutputKeyClass(Text.class);
wordcountjob.setOutputValueClass(LongWritable.class);
FileInputFormat.setInputPaths(wordcountjob,new Path(args[0]));
FileOutputFormat.setOutputPath(wordcountjob,new Path(args[1]));
wordcountjob.setNumReduceTasks(2);
return wordcountjob.waitForCompletion(true)==true? 0:1;
}
public static void main(String []args)throws Exception
{
ToolRunner.run(new Configuration(),new WordCountJob(),args);
}

Implementation of Run a basic Word Count Map Reduce program to understand Map
Reduce Paradigm.

PROGRAM:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
75
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}}

public static void main(String[] args) throws Exception {


Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
⮚ Create the temporary content file in the input directory
Command: sudo mkdir input
Command: sudo gedit input/file.txt
⮚ Type some text on that file, save the file and close

⮚ Put the file.txt into hdfs


Command: hadoop fs -mkdir input
Command: hadoop fs -put input/file.txt input/
⮚ Create jar file WordCount Program
Command: hadoop com.sun.tools.javac.Main WordCount.java
Command: jar cf wc.jar WordCount*.class
76
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

⮚ Run WordCount jar file on input directory


Command: hadoop jar wc.jar WordCount input output

⮚ To see the output


Command: cat output/*

77
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-5
AIM: Write a Map Reduce program that mines weather data. Weather sensors collecting data every
hour at many locations across the globe gather a large volume of log data, which is a good candidate
for analysis with MapReduce, since it is semi structured and record-oriented.

Steps to Execute Analyzing NCDC Data with Java MapReduce


step1: Create your input file with sample data for the program in your VM local file system by
downloading from the NCDC website with following steps
You can find weather data for each year from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
.All files are zipped by year and the weather station. For each year, there are multiple files for
different weather stations .

Here is an example for 1990 (ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1901/).


▪ 010080-99999-1990.gz
▪ 010100-99999-1990.gz
▪ 010150-99999-1990.gz
Step1.1Download the above .gz files of each year then
Step1.2: Merge All year .gz into single .gz file by using Commands
#zcat file1.gz file2.gz file3.gz | gzip -c > allfiles-zcat.gz
▪ $zcat 010080-99999-1990.gz 010100-99999-1990.gz 010150-99999-1990.gz | gzip –c >1990.gz

Step1.3: Repeat it for all years starting from 1901 to 2020 by repeating step1.1,step1.2
Step1.4: Store all 1901.gz,1902.gz ................................................. 2020.gz files into a folder name it
as all in the /cloudera directory
Step1.5 (Merge All .gz files in all directory): in all directory which contains .gz files i.e
1901.gz,1902.gz................................................. 2020.gz files are once again merged to single file
with .gz as file extension and strore it as ncde,gz file by following command.
▪ $zcat 1901.gz 1902.gz 1903.gz ................ 2020.gz | gzip –c >ncdc.gz
Step1.6 Extract it by right click and rename it with ncdc.txt file

78
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Step2: Now move the input file i.e ncdc.txt to HDFS using following command to HDFS with
terminal:
$hadoopfs -put <local input file path><hdfs path>
Make sure that in HDFS there should be any file already exist example:
$ hadoop fs -copyFromLocal /home/cloudera/Downloads/all/ncdc.txt /user/cloudera/data
Step3: Open java IDE Eclipse in Cloudera and create a java project with any name and create three
classes in the same java project with .java as class name ,open each class and write/ type code
available in Tom White prescribed textbook page number 46,47,and 48 code respectively

79
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Step4: Go to project build path and add the following jars:


a. hadoop-common. jar (filesystem/usr/lib/hadoop/hadoop-common.jar)
b. hadoop-core. jar (filesystem/usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar)

80
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

step5:Now, once the errors are resolved, right click on your project and save(export) it to some
location on local file system with .jar file extension
step6: Command to run MapReduce Program:
a. hadoop jar <jar location><packagename.driver class name><input file hdfs
path><output hdfs folder path>
ex: $ hadoop jar /home/cloudera/max.jar tem.com.MaxTemperature
/user/cloudera/data /ncdc.txt /user/cloudera/data/output
Step7: Check the Output by opening HDFS File browser in the folder /user/cloudera/data/output you
will get
outfile name:
if reducer is used: part-r-00000
if reducer is not used: part-m-00000 open it you will see
1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283

81
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT – 6
AIM: Write a MapReduce program that mines weather data.
Weather sensors collecting data every hour at many locations across the globe gather a large
volume of log data, which is good candidate for analysis with MapReduce, since it is semi-
structured and record-oriented.

PROGRAM:
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class MyMaxMin {
public static class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text,
Text> {
@Override
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException {
String line = Value.toString();
if (!(line.length() == 0)) {
String date = line.substring(6, 14);
float temp_Min = Float.parseFloat(line.substring(22, 28).trim());
float temp_Max = Float.parseFloat(line.substring(32, 36).trim());
if (temp_Max > 35.0) {
context.write(new Text("Hot Day " + date),new
Text(String.valueOf(temp_Max)));
}
if (temp_Min < 10) {
context.write(new Text("Cold Day " + date),new
Text(String.valueOf(temp_Min)));
}
}
}
}
public static class MaxTemperatureReducer extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text Key, Iterator<Text> Values, Context context)throws
IOException, InterruptedException

82
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

{
String temperature = Values.next().toString();
context.write(Key, new Text(temperature));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "weather example");
job.setJarByClass(MyMaxMin.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path OutputPath = new Path(args[1]);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

sample input dataset:

⮚ Compiling and creating jar file for hadoop mapreduce java


program: Command: hadoop com.sun.tools.javac.Main

83
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

⮚ Runnning weather dataset mapreduce jar file on hadoop


Command: hadoop jar we.jar MyMaxMin weather/input weather/output

output:

84
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-7
AIM: Write a program to find how many flights between origin and destination by
using Map reduce.

PROGRAM:
package com.lbrce.flight;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.conf.Configuration;
public class Flight {
public static class FlightMapper extends Mapper<LongWritable, Text, Text, Text>
{
public void map(LongWritable arg0, Text Value, Context context) throws IOException,
InterruptedException
{
String line = Value.toString();
if (!(line.length() == 0))
{
String fno = line.substring(0, 4);
String origin=line.substring(8, 12).trim();
String dest =line.substring(13, 18).trim();
if(origin.equals("HYD")&&dest.equals("SAN"))
{
context.write(new Text("Flight " + fno),new Text("HYD SAN"));
}
}
}
}
public static class FlightReducer extends Reducer<Text, Text, Text, Text>
{
public void reduce(Text Key, Iterator<Text> Values, Context context)throws IOException,
InterruptedException
{
String nof = Values.next().toString();
context.write(Key, new Text(nof));
}}

85
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

public static void main(String[] args) throws Exception

{
Configuration conf = new Configuration(); Job job = new Job(conf,
"weather example"); job.setJarByClass(Flight.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setMapperClass(FlightMapper.class);
job.setReducerClass(FlightReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class); Path OutputPath = new
Path(args[1]); FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? : 1);
}
}

Input:
Flight Origin Desti
Arrival Num
nation Time
--------------------------------------
AI111 HYD SAN 22:30
QA222 BOM NEY 24:26
SA333 DEL DAL 32:24
BA444 CHE SAN 42:15
SA555 HYD NEJ 24:26
QA666 BAN DAL 22:30
AI777 HYD SAN 32:24
SA888 DEL SAN 42:15
BA999 BAN NEY 32:24
SA123 BOM NEJ 24:26
QA321 CHE SAN 42:15
SA345 BAN DAL 24:26
AI456 CHE SAN 42:15
BA789 HYD SAN 22:30
QA156 BOM NEJ 32:24
SA234 BAN DAL 24:26
BA132 BOM NEJ 42:15
AI431 HYD SAN 22:30
AA001 CHE SAN 32:24
AA007 BOM NEJ 24:26
AA009 HYD SAN 24:26
DT876 BAN DAL 42:15
JT567 HYD SAN 22:30

86
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT – 8
AIM: Installation of Hive along with practice examples.

Download and extract Hive:


Command: wget https://archive.apache.org/dist/hive/hive-0.14.0/apache-hive-0.14.0-
bin.tar.gz
Command: tar zxvf apache-hive-0.14.0-bin.tar.gz
Command: sudo mv apache-hive-0.13.1-bin /usr/lib/hive
Command: sudo gedit $HOME/.bashrc
export HIVE_HOME=/usr/lib/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/lib/hadoop/lib/*.jar
export CLASSPATH=$CLASSPATH:/usr/lib/hive/lib/*.jar
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-env.sh.template hive-env.sh
export HADOOP_HOME=/usr/lib/hadoop

⮚ Downloading Apache Derby


The following command is used to download Apache Derby. It takes some time to
download.
Command: wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-
derby- 10.4.2.0-bin.tar.gz
Command: tar zxvf db-derby-10.4.2.0-bin.tar.gz
Command: sudo mv db-derby-10.4.2.0-bin /usr/lib/derby
Command: sudo gedit $HOME/.bashrc
export DERBY_HOME=/usr/local/derby
export PATH=$PATH:$DERBY_HOME/bin
export
CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/
derbytools.jar:$DERBY_HOME/lib/derbyclient.jar
Command: sudo mkdir $DERBY_HOME/data
Command: sudo cd $HIVE_HOME/conf
Command: sudo cp hive-default.xml.template hive-site.xml
Command: Sudo gedit $HOVE_HOME/conf/hive-site.xml
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true </value>
<description>JDBC connect string for a JDBC metastore </description>
</property>

⮚ Create a file named jpox.properties and add the following lines into it:
javax.jdo.PersistenceManagerFactoryClass = org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema = false
org.jpox.validateTables = false
org.jpox.validateColumns = false

87
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

org.jpox.validateConstraints = false
org.jpox.storeManagerType = rdbms
org.jpox.autoCreateSchema = true
org.jpox.autoStartMechanismMode = checked
org.jpox.transactionIsolation = read_committed
javax.jdo.option.DetachAllOnCommit = true
javax.jdo.option.NontransactionalRead = true
javax.jdo.option.ConnectionDriverName = org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL = jdbc:derby://hadoop1:1527/metastore_db;create =
true
javax.jdo.option.ConnectionUserName = APP
javax.jdo.option.ConnectionPassword = mine

Command: HADOOP_HOME/bin/hadoop fs -mkdir /tmp


Command: HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
Command: HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse
Command: hive
Logging initialized using configuration in jar:file:/home/hadoop/hive-
0.9.0/lib/hive-common-0.9.0.jar!/hive-log4j.properties Hive history
file=/tmp/hadoop/hive_job_log_hadoop_201312121621_1494929084.txt
………………….

hive> show tables;


OK
Time Taken: 2.798 seconds

⮚ Database and table creation, dropping:


hive> CREATE DATABASE [IF NOT EXISTS]
userdb; hive> SHOW DATABASES;
default
userdb
hive> DROP DATABASE IF EXISTS userdb;
hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
> salary String, destination String)
> COMMENT „Employee details‟
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY „\t‟
> LINES TERMINATED BY „\n‟
> STORED AS TEXTFILE;

Example
We will insert the following data into the table. It is a text file named sample.txt in
/home/user directory.
1201 Gopal 45000 Technical manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Krian 40000 Hr Admin
88
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

1205 Kranthi 30000 Op Admin

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt'


> OVERWRITE INTO TABLE employee;
hive> SELECT * FROM employee WHERE Salary>=40000;

+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|12 | Gopal | 45000 | Technical | |
01 manager TP
|12 | Manisha | 45000 | Proofreader | |
02 PR
|12 | | 40000 | Technical writer | |
03 Masthanval TP
i
|12 | Krian | 40000 | Hr Admin | |
04 HR
+ + + + + +

hive> ALTER TABLE employee RENAME TO emp;


hive> DROP TABLE IF EXISTS employee;

Functions:

Return Signature Description


Type
BIGINT round(double a) It returns the rounded BIGINT
value of the double.
BIGINT floor(double a) It returns the maximum BIGINT
value that is equal or less than the
double.
BIGINT ceil(double a) It returns the minimum BIGINT
value that is equal or greater
than the double.
Double rand(), rand(int seed) It returns a random number that
changes from row to row.
String concat(string A, string B,...) It returns the string resulting from
concatenating B after A.

String substr(string A, int start) It returns the substring of A


starting from start position till the
end of
string A.

89
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

string substr(string A, int start, It returns the substring of A


int length) starting
from start position with the given
length.
string upper(string A) It returns the string resulting from
converting all characters of A to
upper case.
string ucase(string A) Same as above.
string lower(string A) It returns the string resulting
from converting all characters
of B to
lower case.
hive> SELECT round(2.6) from temp;
2.0
⮚ Views:
Example
Let us take an example for view. Assume employee table as given below, with the
fields Id, Name, Salary, Designation, and Dept. Generate a query to retrieve the employee
details who earn a salary of more than Rs 30000. We store the result in a view named
emp_30000.

+ + + + + +
| ID | Name | Salary | Designation | Dept |
+ + + + + +
|1201 | Gopal | 45000 | Technical manager | TP |
|1202 | Manisha | 45000 | Proofreader | PR |
|1203 | Masthanvali | 40000 | Technical writer | TP |
|1204 | Krian | 40000 | Hr Admin | HR |
|1205 | Kranthi | 30000 | Op Admin | Admin |

The following query retrieves the employee details using the above scenario:
hive> CREATE VIEW emp_30000 AS
> SELECT * FROM employee
> WHERE salary>30000;
⮚ Indexes:
The following query creates an index:
hive> CREATE INDEX inedx_salary ON TABLE employee(salary)
> AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler';

90
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT – 9
AIM: Install and Run Pig then write Pig Latin scripts to sort, group, join, project, and filter your
data.
PROCEDURE:
⮚ Download and extract pig-0.13.0.
Command: wget https://archive.apache.org/dist/pig/pig-0.13.0/pig-0.13.0.tar.gz
Command: tar xvf pig-0.13.0.tar.gz
Command: sudo mv pig-0.13.0 /usr/lib/pig
⮚ Set Path for pig
Command: sudo gedit
$HOME/.bashrc export
PIG_HOME=/usr/lib/pig
export PATH=$PATH:$PIG_HOME/bin
export PIG_CLASSPATH=$HADOOP_COMMON_HOME/conf
⮚ pig.properties file
In the conf folder of Pig, we have a file named pig.properties. In the pig.properties file,
you can set various parameters as given below.
pig -h properties
> Verifying the
Installation
Verify the installation of Apache Pig by typing the version command. If the installation is
successful, you will get the version of Apache Pig as shown below.
Command: pig -version

Local mode MapReduce mode


Command: Command:
$ pig -x local $ pig -x mapreduce
15/09/28 10:13:03 INFO pig.Main: 15/09/28 10:28:46 INFO pig.Main:
Logging error messages to: Logging error messages to:
/home/Hadoop/pig_144341538399 /home/Hadoop/pig_144341632612
1.l og 2015-09-28 10:13:04,838 3.l og 2015-09-28 10:28:46,427
[main] INFO [main] INFO
org.apache.pig.backend.hadoop.executi org.apache.pig.backend.hadoop.exec
on engine.HExecutionEngine - uti on engine.HExecutionEngine -
Connecting to hadoop file system at: Connecting to hadoop file system at:
file:/// file:///
grunt>
grunt>

91
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Grouping Of Data:
⮚ put dataset into hadoop
Command: hadoop fs -put pig/input/data.txt pig_data/

⮚ Run pig script program of GROUP on hadoop mapreduce


grunt>
student_details = LOAD
'hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray,
city:chararray);
group_data = GROUP student_details by age;
Dump group_data;
Output:

Joining Of Data:
⮚ Run pig script program of JOIN on hadoop mapreduce
grunt>
customers = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/customers.txt'
USING PigStorage(',')as (id:int, name:chararray, age:int, address:chararray,
salary:int);
orders = LOAD 'hdfs://localhost:8020/user/pcetcse/pig_data/orders.txt' USING
PigStorage(',')as (oid:int, date:chararray, customer_id:int, amount:int)

92
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

grunt> coustomer_orders = JOIN customers BY id, orders BY customer_id;


⮚ Verification
Verify the relation coustomer_orders using the DUMP operator as shown below.
grunt> Dump coustomer_orders;
⮚ Output
You will get the following output that wills the contents of the relation named
coustomer_orders.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
Sorting of Data:
⮚ Run pig script program of SORT on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as
shown below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now sort the relation in a descending order based on the age of the student and
store it into another relation named data using the ORDER BY operator as shown
below.
grunt> order_by_data = ORDER student_details BY age DESC;
⮚ Verification
Verify the relation order_by_data using the DUMP operator as shown below.
grunt> Dump order_by_data;
⮚ Output
It will produce the following output, displaying the contents of the relation order_by_data as
follows.
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(4,Preethi,Agarwal,21,9848022330,Pune)
93
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Filtering of data:
⮚ Run pig script program of FILTER on hadoop mapreduce
Assume that we have a file named student_details.txt in the HDFS directory /pig_data/
as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Pig with the schema name student_details as
shown below.
grunt>
student_details = LOAD
„hdfs://localhost:8020/user/pcetcse/pig_data/student_details.txt' USING
PigStorage(',')as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
Let us now use the Filter operator to get the details of the students who belong to the city
Chennai.
grunt> filter_data = FILTER student_details BY city == 'Chennai';
⮚ Verification
Verify the relation filter_data using the DUMP operator as shown below.
grunt> Dump filter_data;
⮚ Output
It will produce the following output, displaying the contents of the relation filter_data as
follows.
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)

94
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

PIG Running Modes


We can manually override the default mode is –x or –exectype options
$ pig –x local
$pig –x mapreduce
Bulding blocks
- Field – piece of data Ex: “abc”
- Tuple – ordered set of fields represented with “(“ and “)”
Ex : (10.3, abc, 5)
- Bag – collection of tuples representd with “{“ and “}”
Ex: { (10.3, abc, 5), (def, 12,13.5) }

Grouping data
grunt> group1 = group data by age;
grunt> describe group1;
group1: {group: int,data: {(age: int)}}
grunt> dump group1;
(12,{(12)})
(19,{(19)})
(24,{(24),(24)})
(25,{(25)})
(27,{(27)})
(35,{(35),(35)})
(45,{(45)})
(55,{(55)})
(65,{(65)})
The data bag is grouped by ‘age’ therefore Group element contain unique values
To see how pig transforms data
grunt > ILLUSTRAGE group1;

95
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Load Command
LOAD 'data' [USING function] [AS schema];
• data – name of the directory or file – Must be in single quotes
• USING – specifies the load function to use
– By default uses PigStorage which parses each line into fields
using a delimiter.
- Default delimiter is tab (“\t‟)
• AS – assign a schema to incoming data
– Assigns names to fields
– Declares types to fields
LOADING DATA:
• Create file in local file system
[cloudera@localhost ~]$ cat > a.txt
25
35
45
55
65
24
12
19
27
35
24
copy file from local file system to hdfs
[cloudera@localhost ~]$ hadoop fs -put a.txt
Pig Latin – Diagnostic Tools
• Display the structure of the Bag
grunt> DESCRIBE <bag_name>;
ex: DESCRIBE data;
• Display Execution Plan
– Produces Various reports
• Logical Plan
• MapReduce Plan
grunt> EXPLAIN <bag_name>;
ex: EXPLAIN data;
• Illustrate how Pig engine transforms the data
grunt> ILLUSTRATE <bag_name>;
ex: ILLUSTRATE data;
Filter data
grunt> grunt> filter1 = filter data by age > 30;
grunt> dump filter1;
(35)
(45)
(55)
(65)
(35)

96
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

grunt> filter2 = filter data by age < 20;


grunt> dump filter2;
(12)
(19)
Sort data
Sort by Ascending order
grunt> sort1 = order data by age ASC;
grunt> dump sort1;
(12)
(19)
(24)
(24)
(25)
(27)
(35)
(35)
(45)
(55)
(65)
Sort by Descending order
grunt> sort2 = order data by age DESC;
grunt> dump sort2;
(65)
(55)
(45)
(35)
(35)
(27)
(25)
(24)
(24)
(19)
(12)
Grouping data:
grunt> group1 = group data by age;
grunt> describe group1;
group1: {group: int,data: {(age: int)}}
grunt> dump group1;
(12,{(12)})
(19,{(19)})
(24,{(24),(24)})
(25,{(25)})
(27,{(27)})
(35,{(35),(35)})
(45,{(45)})
(55,{(55)})
(65,{(65)})

97
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

The data bag is grouped by ‘age’ therefore Group element contain unique values
To see how pig transforms data
grunt > ILLUSTRAGE group1;
FOREACH
FOREACH<bag> GENERATE <data>
Iterates over each element in the bag and produce a result.
grunt> records = LOAD ‘std.txt’ USING PigStorage(‘ , ’) AS (roll:int, name:chararray);
grunt> dump records;
(501,aaa)
(502,hhh)
(507,yyy)
(204,rrr)
(510,bbb)
grunt> stdname = foreach records generate name;
grunt> dump stdname;
(aaa)
(hhh)
(yyy)
(rrr)
(bbb)
grunt> stdroll = foreach records generate roll;
grunt> dump stdroll;
(501)
(502)
(507)
(204)
(510)

JOIN:
The JOIN operator is used to combine records from two or more relations. While performing a join
operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the
two particular tuples are matched, else the records are dropped. Joins can be of the following types −

Self-join
Inner-join
Outer-join − left join, right join, and full join
Self-join
Self-join is used to join a table with itself.
Inner Join
Default Join is Inner Join – Rows are joined where the keys match – Rows that do not have matches
are not included in the result
Records which will not join with the ‘other’ record-set are still included in the result
Left Outer – Records from the first data-set are included whether they
have a match or not. Fields from the unmatched (second) bag are set to null.
Right Outer – The opposite of Left Outer Join: Records from the
second data-set are included no matter what. Fields from the
unmatched (first) bag are set to null.
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Full Outer – Records from both sides are included. For


unmatched records the fields from the ‘other’ bag are set to null.
cloudera@localhost ~]$ cat>a.txt
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
[cloudera@localhost ~]$ cat>b.txt
2,4
8,9
1,3
2,7
2,9
4,6
4,9
[cloudera@localhost ~]$ hadoop fs -put a.txt
[cloudera@localhost ~]$ hadoop fs -put b.txt
Self join
Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least
one relation.
i.e we join one table to itself rather than joining two tables.
grunt> ONE= load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
grunt> TWO = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);
SELFJ = JOIN ONE by a1 , TWO BY a1;
grunt> describe SELFJ;
SELFJ: {ONE::a1: int,ONE::a2: int,ONE::a3: int,TWO::a1: int,TWO::a2: int,TWO::a3: int}

Equi-join
inner Join is used quite frequently; it is also referred to as equijoin.

grunt> A = load 'a.txt' using PigStorage(',') as (a1:int,a2:int,a3:int);


grunt> B = load 'b.txt' using PigStorage(',') as (b1:int,b2:int,b3:int);
grunt> X = Join A by a1, B by b1;
grunt> Dump X;
(1,2,3,1,3,)
(4,2,1,4,6,)
(4,2,1,4,9,)
(4,3,3,4,6,)
(4,3,3,4,9,)
(8,3,4,8,9,)
(8,4,3,8,9,)

99
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Left outer join


A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
LEFTJ = JOIN A by a1 LEFT OUTER, B BY b1;
DUMP LEFTJ;
(1,2,3,1,3)
(4,3,3,4,9)
(4,3,3,4,6)
(4,2,1,4,9)
(4,2,1,4,6)
(7,2,5,,)
(8,4,3,8,9)
(8,3,4,8,9)
Right outer join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
RIGHTJ = JOIN A by a1 RIGHT OUTER, B BY b1;
DUMP RIGHTJ;
(1,2,3,1,3)
(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)
(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(8,3,4,8,9)
(8,4,3,8,9)
Full join
A = LOAD ‘A.txt' using PigStorage(',') AS (a1:int,a2:int,a3:int);
B = LOAD, ‘B.txt' using PigStorage(',') AS (b1:int,b2:int);
FULLJ = JOIN A by a1 FULL, B BY b1;
DUMP FULLJ;
(1,2,3,1,3)

(,,,2,4)
(,,,2,7)
(,,,2,9)
(4,2,1,4,6)

(4,2,1,4,9)
(4,3,3,4,6)
(4,3,3,4,9)
(7,2,5,,)
(8,3,4,8,9)
(8,4,3,8,9)

100
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

UNION & SPLIT:


UNION combines multiple relations together whereas SPLIT partitions a relation in to multiple ones.
grunt> cat a.txt
1,2,3
4,2,1
8,3,4
grunt> cat b.txt
4,3,3
7,2,5
8,4,3
grunt> a = load 'a.txt' using PigStorage(',') as (a1:int, a2:int, a3:int);
grunt> b = load 'b.txt' using PigStorage(',') as (b1:int, b2:int, b3:int);
grunt> dump a;
(1,2,3)
(4,2,1)
(8,3,4)
grunt> dump b;
(4,3,3)
(7,2,5)
(8,4,3)
grunt> c = UNION a, b;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
grunt> SPLIT c into sp1 if $0 == 4, sp2 if $0 == 8;
Split operation on ‘c’ sends a tuple to sp1 if its first field ($0) is 0 , and to sp2 if it’s 1
grunt> dump sp1;
(4,3,3)
(4,2,1)
grunt > dump sp2;
(8,4,3)
(8,3,4)
grunt> chars = LOAD 'char.txt' AS (c:chararray);
grunt> chargrp = GROUP chars by c;

grunt> dump chargrp;


(a,{(a),(a),(a)})

(c,{(c),(c)})
(i,{(i),(i),(i)})
(k,{(k),(k),(k),(k)})
(l,{(l),(l)})

101
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

grunt> describe chargrp;


chargrp: {group: chararray,chars: {(c: chararray)}}
FOREACH with Functions:
– Pig comes with many functions including COUNT, FLATTEN, CONCAT, etc...
– Can implement a custom function
COUNT:

grunt> counts = FOREACH chargrp GENERATE group, COUNT(chars);


(a,3)
(c,2)
(i,3)
(k,4)
(l,2)

Output:

102
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

EXPERIMENT-10
AIM: Install and Run Hive then use Hive to create, alter, and drop databases, tables, views,
functions, and indexes

Installing Hive:

1. ensure hadoop is already installed on your computer.


2. download hive setup from the following link: https://archive.apache.org/dist/hive/hive-
0.11.0
3. extract it and rename it as hive.
4. copy the hive directory to /usr/local
5. In path /usr/local/hive/conf rename hive-default.xml.template as hive-site.xml and hive-
env.sh.template as hive-env.sh
6. Open hive-env.sh and specify hadoop path as export
HADOOP_HOME=/usr/local/hadoop
7. Open hive-site.xml and write the property where to save database etc.
hive.metastore.warehouse.dir /home/chp/Hive/warehouse location of default databasefor the
warehouse
8. Open bashrc file and append hive environment variables
#HIVE VARIABLES START export
JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64 export
HADOOP_HOME=/usr/local/hadoop export
HIVE_HOME=/usr/local/hive export PATH=$PATH:$HIVE_HOME/bin export
CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:. export
CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:. #HIVE VARIABLES END 9. check for
hiveinstallation by using the command hive in terminal.
Q) Create a database chp and use the database hive> create database chp; OK Time taken:
0.116 seconds
hive> show databases; OK chp default hive> use chp; OK Time taken: 0.018seconds

Q) Create tables emp and dept and load data from text files on hdfs.
hadoop fs -mkdir /user/chp/data
hadoop fs -put /home/chp/Desktop/hive_data/*.txt /user/chp/data
hive> create table emp(id int,name string,sal double) row format delimited fields terminatedby ',';
OK Time taken: 8.331 seconds
hive> show tables;
OK
Emp
hive> create table dept(eid int,dept string) row format delimited fields terminated by '@';OK
Time taken: 0.088 seconds
hive> load data inpath '/user/chp/data/faculty.txt' into table emp;hive>
load data inpath '/user/chp/data/dept.txt' into table dept;

103
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

hive> select * from emp;


OK
1 chp 10000.0
2 pnr 20000.0
3 kry 30000.0
Time taken: 0.379 seconds, Fetched: 3 row(s)
hive> select * from dept;
OK
2 cse
3 mca
4 cse
Time taken: 0.133 seconds, Fetched: 4 row(s)

Views:
Q) Create a view from emp table with the fields id and name.

create view emp_view as select id,name from emp;


select * from emp_view;

1 chp
2 pnr
3 kry

Q) Find no.of employees using above view.


select count(*) from emp_view;
3
Q) Drop view.
drop view emp_view;

Functions:

Q) Display employee names in uppercase


hive> select upper(name) from emp;
CHP
PNR
KRY
Q) Display employee names from 2nd character
hive> select substr(name,2) from emp;
hp
nr
ry

104
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Q) Concatenate emp id and name


hive> select concat(id,name) from emp;
1chp
2pnr
3kry

Q) Find the salaries of the employees by applying ceil function.


hive> select ceil(sal) from emp;
10000
20000
30000

Q) Find the square root of the emp salaries.


hive> select sqrt(sal) from emp;
100.0
141.4213562373095
173.20508075688772

Q) Find the length of the emp names. hive>


select name,length(name) from emp;
chp 3
pnr 3
kry 3

Q) Find no.of employees in the table emp.


hive> select count(*) from emp;
3

Q) Find the salary of all the employees.


hive> select sum(sal) from emp; 60000.0

Q) Find the average salary of the employees.


hive> select avg(sal) from emp;
20000.0

Q) Find the minimum salary of all the employees.


hive> select min(sal) from emp;
10000.0

Q) Find the maximum salary of all the employees.


hive> select max(sal) from emp;
30000.0

105
lOMoARcPSD|372 403 83

BIG DATA ANALYTICS LAB

Index:

Create index:
hive>create index emp_index on table emp(name,sal) as
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' with deferred rebuild;

hive> create index dept_index on table dept(eid) as 'bitmap' with deferred rebuild;

hive> show formatted index on emp;


idx_name tab_name col_names
idx_tab_nameidx_type

emp_index emp name, sal default emp_emp_index

compacthive>show formatted index on dept;

idx_name tab_name col_names


idx_tab_nameidx_type

dept_index dept eid default dept_dept_index

bitmapdrop index:

hive>drop index if exists emp_index on emp;

106

You might also like