[go: up one dir, main page]

0% found this document useful (0 votes)
5 views24 pages

Lecture 4 Introduction To Hadoop

The document provides an overview of Hadoop, an open-source tool for processing and analyzing Big Data, including its history, architecture, and functionality. It explains how Hadoop works through a distributed system using HDFS and MapReduce, detailing the roles of NameNode and DataNode. Additionally, it covers basic commands for interacting with HDFS and highlights the importance of fault-tolerance in data storage.

Uploaded by

Ravi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views24 pages

Lecture 4 Introduction To Hadoop

The document provides an overview of Hadoop, an open-source tool for processing and analyzing Big Data, including its history, architecture, and functionality. It explains how Hadoop works through a distributed system using HDFS and MapReduce, detailing the roles of NameNode and DataNode. Additionally, it covers basic commands for interacting with HDFS and highlights the importance of fault-tolerance in data storage.

Uploaded by

Ravi Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Processing Big Data

With Hadoop Map


Reduce Technology

By
Dr. Aditya Bhardwaj

aditya.bhardwaj@bennett.edu.in

Big Data Analytics and Business Intelligence (CSET 371)


Session 1. History and Introduction to Hadoop

1.1. Introduction to Hadoop

1.1.2 History of Hadoop

1.1.3 How does Hadoop

Works?
1.1.1 Introduction to Hadoop

• Hadoop is an open source tool to handle, process and


analyze Big Data.

• Hadoop platform provides an improved programming


model, which is used to create and run distributed
systems quickly and efficiently.

• It is licensed under the Apache, so also called Apache-


Hadoop

3/24
Who Uses Hadoop ?

08/20/2025 4
1.1.2 History of Hadoop
• In 2003, Google developed the Distributed File System (DFS) called Google
File System GFS to provide efficient, reliable access to data using large
cluster of master slave architecture where the data is stored and executed
on distributed data nodes servers.

• Although, GFS was a great discovery but this file system was
commercialized.

• In 2005, Doug Cutting an employee at Yahoo developed an open


source implementation of GFS for search engine project and that
open-source project was named as Apache Hadoop.

• With Open Source we have freedom to run the program,


freedom to modify the source code and the freedom to
redistribute its exact copies
Keyword ‘Hadoop’ Search on Google

6/24
Inventor of Hadoop
• Doug named it as Hadoop just by seeing his son’s toy elephant
name

08/20/2025 7
How Big Data Is Used In Amazon Recommendation
Systems

https://youtu.be/S4RL6prqtGQ

8
While Developing Hadoop Two Major Concerns for the Team of Doug Cutt

How to stores large size files from terabytes to petabytes


across different terminals i.e. storage framework was
required ?.

How to facilitates the processing of large amount of data


present in structured and unstructured format i.e. parallel
processing framework was required ?.

 Small Concern: How to provide fault-tolerance to


Hadoop architecture.

9/24
Functional Architecture of Hadoop
• The core component of Hadoop includes HDFS and MapReduce.
How does Hadoop Works?
Working of Hadoop can be described using five steps:
Step-1: Input data is broken into small chunks called as
blocks size 64 Mb or 128 Mb and these blocks are stored
distributed on different nodes in the cluster using HDFS
mechanism thus enabling highly efficient parallel
processing. Each block is replicated as per the
replication factor (By default 3). (Implemented in hdfs-
site.xml )

Step-2: Once all the blocks of the data are stored on


data-nodes, user can process the data.

11/
How does Hadoop Works? (contd..)
Step-3: JobTracker receives the requests for MapReduce
execution from the client.

Step-4: TaskTracker runs on DataNode and it actually


implement the map reduce algorithm for the data
stored in data nodes.

Step-5: Once all the nodes process the data, output is


written back on HDFS.

12/
Session2. Understanding HDFS
Syllabus to be covered with this topic:
 2.1.1. Introduction to HDFS

 2.1.2 HDFS Architecture (Using Read and

Write Operations)
 2.1.3 Hadoop Distribution and basic

Commands

 2.1.4 HDFS Command Line and Web

Interface
2.1.1 High Level Architecture of Hadoop Multimode
Cluster

14/
Why Distributed System DFS Preform Well !!

TO READ 1 TB OF DATA

1 MACHIENE 10 MACHIENES

4 I/O Channels 4 I/O Channels


Each Channel – 100 MB/s Each Channel – 100 MB/s

45 MINUTES 4.5 minutes


2.1.1 Introduction to HDFS
 The file in HDFS is split into large block size of 64 to
128 MB by default and each block of the file is
independently replicated at multiple data nodes.

 HDFS provides fault-tolerance by replicating the data on


three nodes: two on the same rack and one on a different
rack. NameNode implement this functionality and it
actively monitors the information regarding replicas of a
block. By default replica factor is 3

HDFS has a master-slave architecture, which comprises


of NameNode and number of DataNodes.

16/
HDFS Components:
There are two major components of Hadoop HDFS- NameNode and
DataNode

i. NameNode
It is also known as Master node.
NameNode does not store actual data or dataset. NameNode stores
Metadata i.e. number of blocks, their location, on which Rack, which
Datanode the data is stored and other details. It consists of files and
directories.

Tasks of HDFS NameNode


Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming,
closing, opening files and directories.
17/
2.1.1 HDFS (contd..)
DataNode: It is also known as Slave.
• HDFS Datanode is responsible for storing actual data in HDFS.
• Datanode performs read and write operation as per the request of
the clients.

 HDFS provides fault-tolerance by replicating the data on three


nodes: two on the same rack and one on a different rack.
NameNode implement this functionality and it actively monitors
the information regarding replicas of a block. By default replica
factor is 3

 HDFS has a master-slave architecture, which comprises of


NameNode and DataNodes.

18/
2.1.2 HDFS Architecture (Read-Write
• Operations)
This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.2 HDFS Architecture (contd..)
• This figure demonstrate how to read a file store at datanode in HDFS architecture.
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs version HDFS version

mkdir <path> creates directories.

hdfs dfs –ls path displays a list of the contents of a


directory specified by path
provided by the user, showing the
names, permissions, owner, size
and modification date for each
entry.
put copies the file or directory from
the local file system to the
destination within the DFS.

21/
2.1.6 HDFS Basic Commands
Command Description
hdfs dfs –copyFromLocal This hadoop shell command is
<localSrc> <dest> similar to put command, but the
source is restricted to a local file
reference.
hdfs dfs –cat This Hadoop shell command
displays the contents of the
filename on console or stdout.
hdfs dfs –mv source destination Used for moving files from source
to destination

hdfs dfs –chmod Used for changing the file


permission.

22/
2.1.7 HDFS Command Line and Web Interface
Step 1. Install Hadoop cluster setup
Step-2. There is fs.default.name property in core-site.xml
file set to hdfs://localhost/, which is used to set a default
Hadoop distributed file system.

The HDFS file system is accessed by user applications


with the help of the HDFS client. It is a library that
reveals the interface of the HDFS file system and hides
almost all the complexities that appear during the
implementation of HDFS.

23/
Thanks Note

24
tungal/presentations/ad2012

You might also like