Practical – 4
Aim: Hadoop installation as single node cluster and multi node cluster.
Pre-requisite:
OS: UBUNTU 14.04 LTS
FRAMEWORK: Hadoop 2.7.3
JAVA VERSION: 1.7.0_131
Single node Cluster:
Steps:
1. check if linux repository service is working or not:
Gcet@gfl1-5:~$ sudo apt-get update
Ign http://extras.ubuntu.com trusty InRelease
Ign http://in.archive.ubuntu.com trusty InRelease
Get:1 http://extras.ubuntu.com trusty Release.gpg [72 B]
Hit http://in.archive.ubuntu.com trusty/universe Translation-en
Ign http://in.archive.ubuntu.com trusty/main Translation-en_IN
Ign http://in.archive.ubuntu.com trusty/multiverse Translation-en_IN
Ign http://in.archive.ubuntu.com trusty/restricted Translation-en_IN
Ign http://in.archive.ubuntu.com trusty/universe Translation-en_IN
Fetched 4,302 kB in 40s (107 kB/s)
Reading package lists... Done
2. Check java version:
Gcet@gfl1-5:~$ java -version
java version "1.7.0_131"
OpenJDK Runtime Environment (IcedTea 2.6.9) (7u131-2.6.9-0ubuntu0.14.04.2)
OpenJDK Server VM (build 24.131-b00, mixed mode)
3. Download Hadoop from apache.hadoop.org site and to install hadoop perform the step
as under:
Gcet@gfl1-5:~$ tar -xvf Hadoop-2.7.3.tar.gz
hadoop-2.7.3/share/hadoop/tools/lib/hadoop-extras-2.7.3.jar
hadoop-2.7.3/share/hadoop/tools/lib/asm-3.2.jar
hadoop-2.7.3/include/
hadoop-2.7.3/include/hdfs.h
hadoop-2.7.3/include/Pipes.hh
hadoop-2.7.3/include/TemplateFactory.hh
hadoop-2.7.3/include/StringUtils.hh
hadoop-2.7.3/include/SerialUtils.hh
hadoop-2.7.3/LICENSE.txt
hadoop-2.7.3/NOTICE.txt
hadoop-2.7.3/README.txt
Gcet@gfl1-5:~$ sudo mv/home/Gcet/Downloads/Hadoop-2.7.3 /usr/local/Hadoop
4. check if hadoop is working properly or not using the command under:
Gcet@gfl1-5:~$ /usr/local/hadoop/hadoop-2.7.3/bin/hadoop
classpath prints the class path needed to get the
credential interact with credential providers
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
trace view and modify Hadoop tracing settings
Most commands print help when invoked w/o parameters.
5. install openssl, ssh and rsync:
Gcet@gfl1-5:~$ sudo apt-get install openssl
[sudo] password for Gcet:
Reading package lists... Done
Building dependency tree
Reading state information... Done
openssl is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 460 not upgraded.
Gcet@gfl1-5:~$ sudo apt-get install ssh
Reading package lists... Done
Building dependency tree
Reading state information... Done
ssh is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 460 not upgraded.
Gcet@gfl1-5:~$ sudo apt-get install ssl
Reading package lists... Done
Building dependency tree
Reading state information... Done
rsync is already the newest version.
0 upgraded, 0 newly installed, 0 to remove and 460 not upgraded.
6. set environment variable for java:
Gcet@gfl1-5:~$ export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
7. to run examples on single system requires to create input file
Gcet@gfl1-5:~$ mkdir input1
Gcet@gfl1-5:~$ mkdir output1
8. Now copy xml file from Hadoop folder to input folder
Gcet@gfl1-5:~$ cp /usr/local/hadoop/hadoop-2.7.3/etc/hadoop/capacity-scheduler.xml
/home/Gcet/Desktop/input1
9. Now run Hadoop examples:
Gcet@gfl1-5:~$ /usr/local/hadoop/hadoop-2.7.3/bin/hadoop jar /usr/local/hadoop/hadoop-
2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar grep /home/Gcet/Desktop/input1/
/home/Gcet/output1/output1 'principal[.]*'
17/07/26 15:15:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
17/07/26 15:15:52 INFO Configuration.deprecation: session.id is deprecated. Instead, use
dfs.metrics.session-id
17/07/26 15:15:52 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
17/07/26 15:15:52 INFO input.FileInputFormat: Total input paths to process : 2
17/07/26 15:15:52 INFO mapreduce.JobSubmitter: number of splits:2
17/07/26 15:15:53 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_local1790612813_0001
17/07/26 15:15:53 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
17/07/26 15:15:53 INFO mapreduce.Job: Running job: job_local1790612813_0001
17/07/26 15:15:53 INFO mapred.LocalJobRunner: OutputCommitter set in config null
17/07/26 15:15:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version
is 1
17/07/26 15:15:53 INFO mapred.LocalJobRunner: OutputCommitter is
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
17/07/26 15:15:53 INFO mapred.LocalJobRunner: Waiting for map tasks
17/07/26 15:15:53 INFO mapred.LocalJobRunner: Starting task:
attempt_local1790612813_0001_m_000000_0
17/07/26 15:15:53 INFO output.FileOutputCommitter: File Output Committer Algorithm version
is 1
17/07/26 15:15:53 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
.
.
.
.
17/07/26 15:15:55 INFO mapreduce.Job: Job job_local192240145_0002 running in uber mode :
false
17/07/26 15:15:55 INFO mapreduce.Job: map 100% reduce 100%
17/07/26 15:15:55 INFO mapreduce.Job: Job job_local192240145_0002 completed successfully
17/07/26 15:15:55 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=1195494
FILE: Number of bytes written=2315812
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=0
Spilled Records=0
Shuffled Maps =1
GC time elapsed (ms)=10
Total committed heap usage (bytes)=854065152
Shuffle BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=98
File Output Format Counters
Bytes Written=8
Multi node cluster:
Steps:
We have two machines (master and slave) with IP:
Master IP: 192.168.56.102
Slave IP: 192.168.56.103
STEP 1: Check the IP address of all machines.
Command: ip addr show (you can use the ifconfig command as well)
STEP 2: Disable the firewall restrictions.
Command: service iptables stop
Command: sudo chkconfig iptables off
STEP 3: Open hosts file to add master and data node with their respective IP addresses.
Command: sudo nano /etc/hosts
Same properties will be displayed in the master and slave hosts files.
STEP 4: Restart the sshd service.
Command: service sshd restart
STEP 5: Create the SSH Key in the master node. (Press enter button when it asks you to enter a filename to
save the key).
Command: ssh-keygen -t rsa -P “”
STEP 6: Copy the generated ssh key to master node’s authorized keys.
Command: cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
STEP 7: Copy the master node’s ssh key to slave’s authorized keys.
Command: ssh-copy-id -i $HOME/.ssh/id_rsa.pub edureka@slave
STEP 8: Click here to download the Java 8 Package. Save this file in your home directory.
STEP 9: Extract the Java Tar File on all nodes.
Command: tar -xvf jdk-8u101-linux-i586.tar.gz
STEP 10: Download the Hadoop 2.7.3 Package on all nodes.
Command: wget https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
STEP 11: Extract the Hadoop tar File on all nodes.
Command: tar -xvf hadoop-2.7.3.tar.gz
STEP 12: Add the Hadoop and Java paths in the bash file (.bashrc) on all nodes.
Open. bashrc file. Now, add Hadoop and Java Path as shown below:
Command: sudo gedit .bashrc
Then, save the bash file and close it.
For applying all these changes to the current Terminal, execute the source command.
Command: source .bashrc
To make sure that Java and Hadoop have been properly installed on your system and can be
accessed through the Terminal, execute the java -version and hadoop version commands.
Command: java -version
Command: hadoop version
Now edit the configuration files in hadoop-2.7.3/etc/hadoop directory.
STEP 13: Create masters file and edit as follows in both master and slave machines as below:
Command: sudo gedit masters
STEP 14: Edit slaves file in master machine as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/slaves
STEP 15: Edit slaves file in slave machine as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/slaves
STEP 16: Edit core-site.xml on both master and slave machines as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/core-site.xml
1<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>fs.default.name</name>
6<value>hdfs://master:9000</value>
7</property>
8</configuration>
STEP 7: Edit hdfs-site.xml on master as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/hdfs-site.xml
1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>dfs.replication</name>
6 <value>2</value>
7 </property>
8 <property>
9 <name>dfs.permissions</name>
10<value>false</value>
11</property>
12<property>
13<name>dfs.namenode.name.dir</name>
14<value>/home/edureka/hadoop-2.7.3/namenode</value>
15</property>
16<property>
17<name>dfs.datanode.data.dir</name>
18<value>/home/edureka/hadoop-2.7.3/datanode</value>
19</property>
20</configuration>
STEP 18: Edit hdfs-site.xml on slave machine as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/hdfs-site.xml
1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>dfs.replication</name>
6 <value>2</value>
7 </property>
8 <property>
9 <name>dfs.permissions</name>
10<value>false</value>
11</property>
12<property>
13<name>dfs.datanode.data.dir</name>
14<value>/home/edureka/hadoop-2.7.3/datanode</value>
15</property>
16</configuration>
STEP 19: Copy mapred-site from the template in configuration folder and the edit mapred-site.xml on both
master and slave machines as follows:
Command: cp mapred-site.xml.template mapred-site.xml
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/mapred-site.xml
1<?xml version="1.0" encoding="UTF-8"?>
2<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3<configuration>
4<property>
5<name>mapreduce.framework.name</name>
6<value>yarn</value>
7</property>
8</configuration>
STEP 20: Edit yarn-site.xml on both master and slave machines as follows:
Command: sudo gedit /home/edureka/hadoop-2.7.3/etc/hadoop/yarn-site.xml
1 <?xml version="1.0" encoding="UTF-8"?>
2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
3 <configuration>
4 <property>
5 <name>yarn.nodemanager.aux-services</name>
6 <value>mapreduce_shuffle</value>
7 </property>
8 <property>
9 <name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
10<value>org.apache.hadoop.mapred.ShuffleHandler</value>
11</property>
12</configuration>
STEP 21: Format the namenode (Only on master machine).
Command: hadoop namenode -format
STEP 22: Start all daemons (Only on master machine).
Command: ./sbin/start-all.sh
STEP 23: Check all the daemons running on both master and slave machines.
Command: jps
On master
On slave
At last, open the browser and go to master:50070/dfshealth.html on your master machine, this will give
you the NameNode interface. Scroll down and see for the number of live nodes, if its 2, you have
successfully setup a multi node Hadoop cluster. In case, it’s not 2, you might have missed out any of
the steps which I have mentioned above. But no need to worry, you can go back and verify all the
configurations again to find the issues and then correct them.