Processing Big Data
With Hadoop Map
Reduce Technology
By
Dr. Aditya Bhardwaj
aditya.bhardwaj@bennett.edu.in
Big Data Analytics and Business Intelligence (CSET 371)
Session 1. History and Introduction to Hadoop
1.1. Introduction to Hadoop
1.1.2 History of Hadoop
1.1.3 How does Hadoop
Works?
1.1.1 Introduction to Hadoop
• Hadoop is an open source tool to handle, process and
analyze Big Data.
• Hadoop platform provides an improved programming
model, which is used to create and run distributed
systems quickly and efficiently.
• It is licensed under the Apache, so also called Apache-
Hadoop
3/24
Who Uses Hadoop ?
08/20/2025 4
1.1.2 History of Hadoop
• In 2003, Google developed the Distributed File System (DFS) called Google
File System GFS to provide efficient, reliable access to data using large
cluster of master slave architecture where the data is stored and executed
on distributed data nodes servers.
• Although, GFS was a great discovery but this file system was
commercialized.
• In 2005, Doug Cutting an employee at Yahoo developed an open
source implementation of GFS for search engine project and that
open-source project was named as Apache Hadoop.
• With Open Source we have freedom to run the program,
freedom to modify the source code and the freedom to
redistribute its exact copies
Keyword ‘Hadoop’ Search on Google
6/24
Inventor of Hadoop
• Doug named it as Hadoop just by seeing his son’s toy elephant
name
08/20/2025 7
How Big Data Is Used In Amazon Recommendation
Systems
https://youtu.be/S4RL6prqtGQ
8
While Developing Hadoop Two Major Concerns for the Team of Doug Cutt
How to stores large size files from terabytes to petabytes
across different terminals i.e. storage framework was
required ?.
How to facilitates the processing of large amount of data
present in structured and unstructured format i.e. parallel
processing framework was required ?.
Small Concern: How to provide fault-tolerance to
Hadoop architecture.
9/24
Functional Architecture of Hadoop
• The core component of Hadoop includes HDFS and MapReduce.
How does Hadoop Works?
Working of Hadoop can be described using five steps:
Step-1: Input data is broken into small chunks called as
blocks size 64 Mb or 128 Mb and these blocks are stored
distributed on different nodes in the cluster using HDFS
mechanism thus enabling highly efficient parallel
processing. Each block is replicated as per the
replication factor (By default 3). (Implemented in hdfs-
site.xml )
Step-2: Once all the blocks of the data are stored on
data-nodes, user can process the data.
11/
How does Hadoop Works? (contd..)
Step-3: JobTracker receives the requests for MapReduce
execution from the client.
Step-4: TaskTracker runs on DataNode and it actually
implement the map reduce algorithm for the data
stored in data nodes.
Step-5: Once all the nodes process the data, output is
written back on HDFS.
12/
Session2. Understanding HDFS
Syllabus to be covered with this topic:
2.1.1. Introduction to HDFS
2.1.2 HDFS Architecture (Using Read and
Write Operations)
2.1.3 Hadoop Distribution and basic
Commands
2.1.4 HDFS Command Line and Web
Interface
2.1.1 High Level Architecture of Hadoop Multimode
Cluster
14/
Why Distributed System DFS Preform Well !!
TO READ 1 TB OF DATA
1 MACHIENE 10 MACHIENES
4 I/O Channels 4 I/O Channels
Each Channel – 100 MB/s Each Channel – 100 MB/s
45 MINUTES 4.5 minutes
2.1.1 Introduction to HDFS
The file in HDFS is split into large block size of 64 to
128 MB by default and each block of the file is
independently replicated at multiple data nodes.
HDFS provides fault-tolerance by replicating the data on
three nodes: two on the same rack and one on a different
rack. NameNode implement this functionality and it
actively monitors the information regarding replicas of a
block. By default replica factor is 3
HDFS has a master-slave architecture, which comprises
of NameNode and number of DataNodes.
16/
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and
DataNode
i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset. NameNode stores
Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and
directories.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
17/
2.1.1 HDFS (contd..)
DataNode: It is also known as Slave.
• HDFS Datanode is responsible for storing actual data in HDFS.
• Datanode performs read and write operation as per the request of
the clients.
HDFS provides fault-tolerance by replicating the data on three
nodes: two on the same rack and one on a different rack.
NameNode implement this functionality and it actively monitors
the information regarding replicas of a block. By default replica
factor is 3
HDFS has a master-slave architecture, which comprises of
NameNode and DataNodes.
18/
2.1.2 HDFS Architecture (Read-Write
• Operations)
This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.2 HDFS Architecture (contd..)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs version HDFS version
mkdir <path> creates directories.
hdfs dfs –ls path displays a list of the contents of a
directory specified by path
provided by the user, showing the
names, permissions, owner, size
and modification date for each
entry.
put copies the file or directory from
the local file system to the
destination within the DFS.
21/
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs –copyFromLocal This hadoop shell command is
<localSrc> <dest> similar to put command, but the
source is restricted to a local file
reference.
hdfs dfs –cat This Hadoop shell command
displays the contents of the
filename on console or stdout.
hdfs dfs –mv source destination Used for moving files from source
to destination
hdfs dfs –chmod Used for changing the file
permission.
22/
2.1.7 HDFS Command Line and Web Interface
Step 1. Install Hadoop cluster setup
Step-2. There is fs.default.name property in core-site.xml
file set to hdfs://localhost/, which is used to set a default
Hadoop distributed file system.
The HDFS file system is accessed by user applications
with the help of the HDFS client. It is a library that
reveals the interface of the HDFS file system and hides
almost all the complexities that appear during the
implementation of HDFS.
23/
Thanks Note
24
tungal/presentations/ad2012