SCHOOL OF INFORMATION AND
COMMUNICATION TECHNOLOGY
BIG DATA ANALYTICS LAB
AI 381
NAME- ANUSHKA SRIVASTAVA
ROLL NO- 215/UAI/031
BRANCH- B.TECH AI
SEM-5th
INDEX
S.No Program Date Signature
1. Installation of VMWare to setup the
Hadoop environment and its ecosystems.
2. Perform setting up and Installing Hadoop in
its three operating modes. i. Standalone. ii.
Pseudo distributed. iii. Fully distributed.
3. Use web based tools to monitor your Hadoop
setup.
4. Implementing the basic commands of LINUX
Operating System – File/Directory creation,
deletion, update operations
5. Implement the following file management
tasks in Hadoop: i. Adding files and
directories ii. Retrieving files iii. Deleting files
6. Run a basic word count Map Reduce
program to understand Map Reduce
Paradigm
7. Write a Map Reduce program that mines
weather data
8. Matrix multiplication with Hadoop
MapReduce
1
1. Installation of VMWare to setup the Hadoop environment and its
ecosystems.
Steps-
Step 1 : Install VMWare Player before downloading the Hadoop
Step 2: Download the “Cloudera Setup File” from any of the above links and extract that
zipped file on your hard drive. Scroll down and select Accept
Step 3: Start VMPlayer and click open a Virtual Machine. Browse the extracted folder.
Login credentials: Machine Login credentials are: Username - admin Password - admin
Cloudera Manager Credentials are: Username - admin Password – admin
Step 4: Checking your Hadoop Cluster
● Type: sudo jps to see if all nodes are running (if you see an error like below, wait for
some time and then try again, your threads are not started yet)
● Type: sudo su hdfs
● Execute your command ie – hadoop dfs –ls /
Screenshot
2
3
2. Perform setting up and Installing Hadoop in its three operating modes. 1.
Standalone. 2. Pseudo distributed. 3. Fully distributed.
1)Standalone-
ALGORITHM
● Command for installing ssh is “sudo apt-get install ssh”.
● Command for key generation is ssh-keygen –t rsa –P “ ”.
● Store the key into rsa.pub by using the command cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
● Extract the java by using the command tar xvfz
jdk-8u60-linux-i586.tar.gz.
● Extract the eclipse by using the command tar xvfz eclipse-jee-mars-R-linux-
gtk.tar.gz
● Extract the hadoop by using the command tar xvfz hadoop-2.7.1.tar.gz
2) Pseudo distributed-
ALGORITHM
● In order install pseudo distributed mode we need to configure the
hadoop configuration files resides in the directory
/home/lendi/hadoop-2.7.1/etc/hadoop.
● First configure the hadoop-env.sh file by changing the java path.
● Configure the core-site.xml which contains a property tag, it contains name
and value. Name as fs.defaultFS and value as hdfs://localhost:9000
● Configure hdfs-site.xml.
● Configure yarn-site.xml.
● Configure mapred-site.xml before configure the copy
mapred-site.xml.template to mapred-site.xml.
● Now format the name node by using command hdfs namenode
–format.
● Type the command start-dfs.sh,start-yarn.sh means that starts the
daemons like
4
● Run JPS which views all daemons. Create a directory in the hadoop by
using command hdfs dfs –mkdr /csedir and enter some data into lendi.txt
using command nano lendi.txt and copy from local directory to hadoop using
command hdfs dfs copyFromLocal lendi.txt /csedir/and run sample jar file
wordcount to check whether pseudo distributed mode is working or not.
● Display the contents of file by using command hdfs dfs –cat
/newdir/part-r-00000
3) Fully distributed-
ALGORITHM
● Stop all single node clusters $stop-all.sh
● Decide one as NameNode (Master) and remaining as
DataNodes(Slaves).
● Copy public key to all three hosts to get a password less SSH access
$ssh-copy-id –I $HOME/.ssh/id_rsa.pub lendi@l5sys24
● Configure all Configuration files, to name Master and Slave Nodes. $cd
$HADOOP_HOME/etc/hadoop $nano core-site.xml $ nano hdfs-site.xml
5
● Add hostnames to file slaves and save it. $ nano slaves
● Configure $ nano yarn-site.xml
● Do in Master Node $ hdfs namenode –format $ start-dfs.sh $start-yarn.sh
● Format NameNode
● Daemons Starting in Master and Slave Nodes
6
3. Use web based tools to monitor your Hadoop setup.
Introduction-
Hadoop set up can be managed by different web based tools, which can be easy for the
user to identify the running daemons. Few of the tools used in the real world are-
● Apache Ambari
● Horton Works
● Apache Spark
7
4. Implementing the basic commands of LINUX Operating System –
File/Directory creation, deletion, update operations.
File Operations-
● Creating a File:
Touch filename.txt
● Editing a File:
nano filename.txt
● Deleting a File:
rm filename.txt
Directory Operations-
● Creating a Directory:
mkdir directoryname
● Changing Directory:
cd directoryname
● Deleting a Directory:
rmdir directoryname
8
5. Implement the following file management tasks in Hadoop: 1. Adding files
and directories 2. Retrieving files 3. Deleting files.
1)Adding files and directories-
Before we run Hadoop programs on data stored in HDFS, we‘ll need to put the data
into HDFS first. Creating a directory and putting a file in it. HDFS has a default
working directory of /user/$USER, where $USER is our login user name. This
directory isn‘t automatically created for us, though, creating it with the mkdir
command. For the purpose of illustration, we use chuck. We should substitute our
user name in the example commands.
hadoop fs -mkdir /user/chuck hadoop
fs -put example.txt
hadoop fs -put example.txt /user/chuck
2)Retrieving files-
The Hadoop command get copies files from HDFS back to the local filesystem. To
retrieve example.txt, we can run the following command:
hadoop fs -cat example.txt
3)Fully distributed-
hadoop fs -rm example.txt
Command for creating a directory in hdfs is
“hdfs dfs –mkdir /lendicse”.
Adding directory is done through the command
“hdfs dfs –put lendi_english/”
OUTPUT-
9
6. Run a basic word count Map Reduce program to understand Map Reduce
Paradigm.
Prerequisites-
● Java Installation - Check whether the Java is installed or not using the following
command. java -version.
● Hadoop Installation - Check whether the Hadoop is installed or not using the
following command. hadoop version.
Steps-
Step-1 Write a Mapper
● A Mapper overrides the “map” function from the Class
"org.apache.hadoop.mapreduce.Mapper" which provides <key,value> pairs
as the input. A Mapper implementation may output <key,value> pairs using
the provided Context .
● Input value of the WordCount Map task will be a line of text from the input
data file and the key would be the line number <line_number, line_of_text>.
Map task outputs <word,one> for each word in the line of text.
Pseudo-code
void Map (key, value){
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate <key,value> output from multiple map tasks and
assemble a single result. Here, the WordCount program will sum up the occurrence of
each word to pairs as <word,occurrence>
Pseudo-code
void Reduce (keyword, ){
for each x in :
sum+=x;
final_output.collect(keyword, sum);
}
10
OUTPUT-
11
7. Write a Map Reduce program that mines weather data.
Steps-
Step-1. Write a Mapper
● A Mapper overrides the “map” function from the Class
"org.apache.hadoop.mapreduce.Mapper" which provides <key,value> pairs
as the input. A Mapper implementation may output <key,value> pairs using
the provided Context .
● Input value of the WordCount Map task will be a line of text from the input
data file and the key would be the line number <line_number, line_of_text>.
Map task outputs <word,one> for each word in the line of text.
Pseudo-code
void Map (key, value){
for each max_temp x in value:
output.collect(x, 1);
}
void Map (key, value){
for each min_temp x in value:
output.collect(x, 1);
}
Step-2 Write a Reducer
A Reducer collects the intermediate output from multiple map tasks and
assembles a single result. Here, the WordCount program will sum up the
occurrence of each word to pairs as <word, occurrence>.
Pseudo-code
void Reduce (max_temp, ){
for each x in :
sum+=x;
final_output.collect(max_temp, sum);
}
void Reduce (min_temp, ){
for each x in :
sum+=x;
final_output.collect
(min_temp,sum);
}
Step-3 Write Driver
The Driver program configures and run the MapReduce job. We use the main
program to perform basic configurations such as:
● Job Name : name of this Job
● Executable (Jar) Class: the main executable class. For here, WordCount.
12
● Mapper Class: class which overrides the "map" function. For here, Map.
● Reducer: class which override the "reduce" function. For here , Reduce.
● Output Key: type of output key. For here, Text.
● Output Value: type of output value. For here, IntWritable
File Input Path
File Output Path
OUTPUT-
13
8. Program to implement the naïve Bayesian classifier for a sample training
data set stored as a .CSV file. Compute the accuracy of the classifier,
considering a few test data sets.
Steps-
● setup ()
● var NIB = (I-1)/IB+1
● var NKB = (K-1)/KB+1
● var NJB = (J-1)/JB+1
● map (key, value)
● if from matrix A with key=(i,k) and value=a(i,k)
● for 0 <= jb < NJB
● emit (i/IB, k/KB, jb, 0), (i mod IB, k mod KB, a(i,k))
● if from matrix B with key=(k,j) and value=b(k,j)
● for 0 <= ib < NIB emit (ib, k/KB, j/JB, 1), (k mod KB, j mod JB, b(k,j))
● Intermediate keys (ib, kb, jb, m) sort in increasing order first by ib, then by
kb, then by jb, then by m. Note that m = 0 for A data and m = 1 for B data.
● The partitioner maps intermediate key (ib, kb, jb, m) to a reducer r as
follows:
r = ((ib*JB + jb)*KB + kb) mod R
● These definitions for the sorting order and partitioner guarantee that each reducer
R[ib,kb,jb] receives the data it needs for blocks A[ib,kb] and B[kb,jb], with the data
for the A block immediately preceding the data for the B block.
● var A = new matrix of dimension IBxKB
● var B = new matrix of dimension KBxJB
● var sib = -1
● var skb = -1
● Reduce (key, valueList)
● if key is (ib, kb, jb, 0)
● sib = ib
● skb = kb
● Zero matrix A
● for each value = (i, k, v) in valueList A(i,k) = v
● if key is (ib, kb, jb, 1)
● if ib != sib or kb != skb return // A[ib,kb] must be zero!
● Zero matrix B
● for each value = (k, j, v) in valueList B(k,j) = v
● ibase = ib*IB
● jbase = jb*JB
● for 0 <= i < row dimension of A
● for 0 <= j < column dimension of B
● sum = 0
● for 0 <= k < column dimension of A = row dimension of B a. sum +=
● A(i,k)*B(k,j)
● if sum != 0 emit (ibase+i, jbase+j), sum
14
OUTPUT-