BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Shri Vaishnav Vidhyapeeth Vishwavidhyalaya
Shri Vaishnav Institute of Information Technology
Department of Information Technology
Year (2022-23)
Subject Name with Code:-
BIG DATA AND HADOOP LAB
(BTCS-702)
4thYear (Semester VII)
SUBMITTED TO :- SUBMITTED BY:-
Prof. Anand Gadwal JAY
PARTH SHARMA 1
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Experiment-01
Objective:- Installation process of VirtualBox.
Introduction of VirtualBox:-
VirtualBox is a powerful x86 and AMD64/Intel64 virtualization product for
enterprise as well as home use. Not only is VirtualBox an extremely feature
rich, high performance product for enterprise customers, it is also the only
professional solution that is freely available as Open Source Software under the
terms of the GNU General Public License (GPL) version 2. VirtualBox is a
general-purpose full virtualizer for x86 hardware, targeted at server, desktop and
embedded use.
Why is Virtualization useful:-
The techniques and features that Oracle VM VirtualBox provides are useful in
the following scenarios:-
1. Running multiple operating systems simultaneously:- Oracle VM
VirtualBox enables you to run more than one OS at a time. This way, you
can run software written for one OS on another, such as Windows
software on Linux or a Mac, without having to reboot to use it. Since you
can configure what kinds of virtual hardware should be presented to each
such OS, you can install an old OS such as DOS or OS/2 even if your real
computer's hardware is no longer supported by that OS.
2. Easier software installations:- Software vendors can use virtual
machines to ship entire software configurations. For example, installing a
complete mail server solution on a real machine can be a tedious task.
With Oracle VM VirtualBox, such a complex setup, often called an
appliance, can be packed into a virtual machine. Installing and running a
mail server becomes as easy as importing such an appliance into Oracle
VM VirtualBox.
3. Testing and disaster recovery:- Once installed, a virtual machine and its
virtual hard disks can be considered a container that can be arbitrarily
frozen, copied, backed up, and transported between hosts. On top of that,
with the use of another Oracle VM VirtualBox feature called snapshots,
PARTH SHARMA 2
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
one can save a particular state of a virtual machine and revert back to that
state, if necessary. This way, one can freely experiment with a computing
environment. If something goes wrong, such as problems after installing
PARTH SHARMA 3
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
software or infecting the guest with a virus, you can easily switch back to a
previous snapshot and avoid the need of frequent backups and restores. Any
number of snapshots can be created, allowing you to travel back and forward
in virtual machine time. You can delete snapshots while a VM is running to
reclaim disk space.
4. Infrastructure consolidation:- Virtualization can significantly reduce
hardware and electricity costs. Most of the time, computers today only
use a fraction of their potential power and run with low average system
loads. A lot of hardware resources as well as electricity is thereby wasted.
So, instead of running many such physical computers that are only
partially used, one can pack many virtual machines onto a few powerful
hosts and balance the loads between them.
Steps of Installation of VirtualBox:-
Step 1:- To download VirtualBox, click on the following:-
link https://www.virtualbox.org/wiki/Downloads Now, depending on your OS,
select which version to install. In our case, it will be the first one (Windows
host).
Step 2:- Once the option is selected, click on “Next”.
PARTH SHARMA 4
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 3:- You have the option asking where to install the application. We can
leave it as default and click on “Next”.
PARTH SHARMA 5
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 4:- Once the options are selected as shown in the following screenshot,
click on Next.
PARTH SHARMA 6
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 5:- A dialog box will come up asking whether to proceed with the
installation. Click “Yes”.
PARTH SHARMA 7
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 6:- In the next step, click on “Install”.
Step 7:- Tick the start VirtualBox check box and click on “Finish”.
PARTH SHARMA 8
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 8:- VirtualBox application will now open as shown in the following
screenshot. Now, we are ready to install the virtual machines.
PARTH SHARMA 9
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Installation of VirtualBox in Windows 10:-
Step 1:- Download the ISO file which is required to install Ubuntu on our
virtual box. Click on the first link.
Step 2:- Then click on the download tab to install the latest version that is
available.
PARTH SHARMA 10
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 3:- After that thank you page will available and the ISO file of ubuntu
will be installed in our system.
PARTH SHARMA 11
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 4:- Next click on the virtual box and click on the new button and provide
the name of the machine. Then click on next.
Step 5:- Then provide the memory size then click on next.
PARTH SHARMA 12
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 6:- Then select the file location and the size and click on create.
Step 7:- Then new machine will be visible which is ubuntu 20.04.
PARTH SHARMA 13
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 8:- Then select the machine and open settings and select advanced in
general and make shared clipboard and drag n drop bidirectional which allow us
to share files between our host machine i.e. windows and our virtual machine
i.e. ubuntu.
PARTH SHARMA 14
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 9:- Then select the ISO file and location then click on OK.
That’s how a virtual machine is created in virtual box.
PARTH SHARMA 15
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Experiment-02
Objective:- Installation process of Hadoop.
Introduction of Hadoop:-
Hadoop is an open source framework by Apache that is used to efficiently store
and process large datasets ranging in size from gigabytes to petabytes of data.
Instead of using one large computer to store and process the data, Hadoop
allows clustering multiple computers to analyse massive datasets in parallel
more quickly. Hadoop consists of four main modules:
● Hadoop Distributed File System (HDFS) – A distributed file system
that runs on standard or low-end hardware. HDFS provides better
data throughput than traditional file systems, in addition to high fault
tolerance and native support of large datasets.
● Yet Another Resource Negotiator (YARN) – Manages and monitors
cluster nodes and resource usage. It schedules jobs and tasks.
● MapReduce – A framework that helps programs do the parallel
computation on data. The map task takes input data and converts it
into a dataset that can be computed in key value pairs. The output of
the map task is consumed by reduce tasks to aggregate output and
provide the desired result.
● Hadoop Common – Provides common Java libraries that can be used
across all modules.
Why is Hadoop useful :-
● Open Source:- Hadoop is open source which means it is free to use. The
source code is available online for anyone to understand it or make some
modifications as per the industry requirement.
● Fault Tolerance:- Hadoop uses inexpensive system which can be
crashed at any moment in Hadoop data is replicated on various data notes
PARTH SHARMA 16
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
in Hadoop clusters which ensures the availability of data if somehow any
of our system got crashed.
● High Availability:- Due to fault tolerance in case if any of the data notes
goes down the same data can be retrieved from other node where the data
is replicated.
● Cost Effective:- Runs on low cost commodity hardware.
● Easy to use:- Hadoop is easy to use since the developers need not to
worry about of any processing work since it is managed by Hadoop itself.
Steps of Installation of Hadoop:-
Step 1:- Check java version through this command on command prompt.
Step 2:- Download Hadoop version from this link:-
https://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.2.4/hadoop-3.
2.4-src.tar.gz
And extract it to a folder.
PARTH SHARMA 17
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 3:- Setup system environment variable:-
Open control panel to edit the system environment variable.
Create a new user variable. Put the variable_name as HADOOP_HOME and
variable_value as the path of the bin folder where you extracted Hadoop.
Similarly create a new user variable with variable name as JAVA_HOME and
variable value as the path of the bin folder in the java directory.
PARTH SHARMA 18
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Set Hadoop bin directory and java bin directory path in system variable path.
Edit Path in system variable bin directory path in system variable path. Edit
Path in system variable.
Click on New and add the bin directory path of Hadoop and Java in it.
PARTH SHARMA 19
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 4:- Configurations:-
Edit some files located in the Hadoop directory of the etc folder where we
installed Hadoop. The files that need to be edited core-site, Hadoop-env,
hdfs-site, mapred-site, yarn-site.
PARTH SHARMA 20
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
1. Edit the file core-site.xml in the Hadoop directory. Copy this xml property in
the configuration in the file.
/span>configuration>
/span>property>
/span>name>fs.defaultFS/span>/name>
/span>value>hdfs://localhost:9000</value>
/span>/property>
/span>/configuration>
2. Edit mapred-site.xml and copy this property in the cofiguration.
/span>configuration>
/span>property>
/span>name>mapreduce.framework.name/span>/name>
/span>value>yarn/span>/value>
/span>/property>
/span>/configuration>
Step 4:- Create a folder ‘data’ in the Hadoop directory.
PARTH SHARMA 21
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 5:-Create a folder with the name ‘datanode’ and a folder ‘namenode’ in
this data directory.
PARTH SHARMA 22
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
4. Edit the file hdfs-site.xml and add below property in the configuration.
Note: The path of namenode and datanode across value would be the path of the
datanode and namenode folders you just created.
/span>configuration>
/span>property>
/span>name>dfs.replication/span>/name>
/span>value>1/span>/value>
/span>/property>
/span>property>
/span>name>dfs.namenode.name.dir/span>/name>
/span>value>C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\namenod
e/span>/value>
/span>/property>
/span>property>
/span>name>dfs.datanode.data.dir/span>/name>
/span>value>
C:\Users\hp\Downloads\hadoop-3.1.0\hadoop-3.1.0\data\datanode/span>/value>
/span>/property>
/span>/configuration>
5. Edit the file yarn-site.xml and add below property in the configuration.
/span>configuration>
/span>property>
/span>name>yarn.nodemanager.aux-services/span>/name>
/span>value>mapreduce_shuffle/span>/value>
/span>/property>
/span>property>
/span>name>yarn.nodemanager.auxservices.mapreduce.shuffle.class/span>/nam
e>
/span>value>org.apache.hadoop.mapred.ShuffleHandler/span>/value>
/span>/property>
/span>/configuration>
Step 6:- Edit hadoop-env.cmd and replace %JAVA_HOME% with the path of
the java folder where jdk 1.8 is installed.
PARTH SHARMA 23
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 7:- Hadoop needs windows OS specific files which does not come with
default download of Hadoop. To include those files, replace the bin folder in
Hadoop directory with the bin folder provided in this github link.
https://github.com/s911415/apache-hadoop-3.1.0-winutils
Download it as zip file. Extract it and copy the bin folder in it.
PARTH SHARMA 24
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
Step 8:- Check whether Hadoop is successfully installed by running this
command on cmd-
hadoop version
Since it doesn’t throw error and successfully shows the Hadoop version, that
means Hadoop is successfully installed in the system.
PARTH SHARMA 25
BIG DATA AND HADOOP (BTCS-702) 19100BTIT06586
PARTH SHARMA 26