*HDFS
ARCHITECTURE
Hadoop Distributed File System
*HDFS - FEATURES
*HDFS stores very large files running on a
cluster of commodity hardware.
*HDFS stores data reliably even in the case
of hardware failure. It provides high
throughput by providing the data access
in parallel.
*HDFS
ARCHITECTURE
EXPLAINED
* Hadoop Distributed File System follows the master-
slave architecture.
* Each cluster comprises a single master node and
multiple slave nodes.
* Internally the files get divided into one or more blocks,
and each block is stored on different slave machines
depending on the replication factor.
* The Master node is the NameNode and DataNodes are
the slave nodes
*MASTER NODE / NAME
NODE
*NameNode is the centerpiece of the
Hadoop Distributed File System.
*It maintains and manages the file
system namespace and provides the
right access permission to the clients.
*Fsimage: Fsimage stands for File System
image. It contains the complete
namespace of the Hadoop file system
since the NameNode creation.
*Edit log: It contains all the recent
changes performed to the file system
namespace to the most recent Fsimage.
*HDFS DATA NODE
*DataNodes are the slave nodes in Hadoop
HDFS.
*DataNodes are inexpensive commodity
hardware.
*They store blocks of a file.
*HDFS DATA NODE
RESPONSIBILITIE
S
* DataNode is responsible for serving the client
read/write requests.
* Based on the instruction from the NameNode,
DataNodes performs block creation, replication, and
deletion.
* DataNodes send a heartbeat to NameNode to report
the health of HDFS.
* DataNodes also sends block reports to NameNode to
report the list of blocks it contains.
*SECONDARY
NAMENODE
*HDFS BACKUP
NODES
*A Backup node provides the same check
pointing functionality as the Checkpoint
node.
*In Hadoop, Backup node keeps an in-
memory, up-to-date copy of the file
system namespace. It is always
synchronized with the active NameNode
state.
*Replication
Management
* HDFS stores replicas of a block on multiple
DataNodes based on the replication factor.
* If the replication factor is 3, then three copies
of a block get stored on different DataNodes.
* So if one DataNode containing the data block
fails, then the block is accessible from the
other DataNode containing a replica of the
block.
*Replication
Management
*Ifwe are storing a file of 128 Mb and the
replication factor is 3, then (3*128=384)
384 Mb of disk space is occupied for a file
as three copies of a block get stored.
*HDFS Rack
awareness algorithm
*The first replica will get stored on the local
rack.
*The second replica will get stored on the
other DataNode in the same rack.
*The third replica will get stored on a
different rack.
*HDFS
READ/WRITE
OPERATION
*Study link from
the web
*https://data-flair.training/blogs/hadoop-hdf
s-architecture/