0% found this document useful (0 votes)

14 views6 pages

Introduction to Hadoop and HDFS Overview

The document provides an introduction to Hadoop, its architecture, and the Hadoop Distributed File System (HDFS), detailing its components, functionalities, and commands for file operations. It also covers MapReduce, explaining its components, workflow, and the roles of Mapper, Reducer, and Combiner in data processing. Additionally, it highlights the advantages of using Hadoop over traditional RDBMS for handling large datasets.

Uploaded by

bhavana632005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Introduction to Hadoop and HDFS Overview

Uploaded by

bhavana632005

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

BDA MODULE – 2

Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS Vs Hadoop, History
of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed File System),Processing data
with Hadoop, Managing resources and applications with Hadoop YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper, Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.
TB1: Ch 5: 5.1-,5.8, 5.10-5.12, Ch 8: 8.1 - 8.8

HADOOP
Hadoop is an open-source framework. Written in Java, it was first made to support the text search engine
Nutch. Hadoop is based on Google’s MapReduce and Google File System technologies and is now widely
used by companies like Yahoo, Facebook, LinkedIn, and Twitter.

With a neat diagram explain HDFS architecture.

HDFS (Hadoop Distributed File System) is the storage component of Hadoop. It is designed to store and
process very large files efficiently across multiple machines.

1. Storage Component
HDFS acts as the main storage system of Hadoop, where all big data files are stored.
2. Distributed File System
Data is not stored on one machine. Instead, files are distributed across multiple nodes (DataNodes)
in the cluster, as shown in the figure.
3. Based on Google File System
The design of HDFS is inspired by Google File System (GFS), which was created to handle large-scale
data reliably.
4. High Throughput
HDFS uses large block sizes and moves computation closer to the data, which helps in processing
data faster.
5. Replication
Each file block is replicated on multiple DataNodes. If a DataNode fails, the system automatically re-
replicates the missing blocks, ensuring fault tolerance.
6. Handles Large Files
HDFS is optimized to read and write very large files (in GBs or TBs) efficiently.
7. Runs on Native File Systems
HDFS runs on native file systems like ext3, ext4, etc., present on the operating system.
8. Client–Server Interaction
o The client contacts the NameNode to get metadata (file name, block locations).
o Actual data read/write happens directly with DataNodes.
o DataNodes communicate with each other and does the operations for replication.
9. Block Example
If a file of 192 MB is stored with a 64 MB block size, it is divided into 3 blocks (A, B, C).
These blocks are replicated and stored on different DataNodes, as shown in the figure.
• NameNode: Stores metadata (file name, block info, DataNode locations).
• DataNodes: Store actual data blocks (A, B, C).
• Client: Requests metadata from NameNode and reads/writes data from DataNodes.

5.10.1 HDFS Daemons

HDFS daemons are background programs that manage and run the HDFS file system. HDFS works with
three main daemons: NameNode, DataNode, and Secondary NameNode.

1. NameNode
• HDFS splits large files into blocks and the NameNode uses rack IDs to identify the racks (groups of
DataNodes) where blocks are stored.
• The NameNode tracks block locations on DataNodes and manages file operations such as read, write,
create, and delete.
• The main function of the NameNode is to manage the file system namespace (all files and directories
in the cluster).
• The namespace information, including block location and file properties, is stored in a file called
FsImage.
• Every change to file system metadata is recorded in the EditLog.
• On startup, the NameNode loads FsImage and EditLog into memory, applies all changes,one
NameNode exists per cluster.

2. DataNode
• There are multiple DataNodes in a cluster, and during pipeline read and write operations, DataNodes
communicate with each other.
• Each DataNode continuously sends a heartbeat message to the NameNode to ensure connectivity.
• If the NameNode does not receive a heartbeat from a DataNode, it replicates that DataNode’s data
within the cluster and continues running normally.
3. Secondary NameNode
• The Secondary NameNode takes snapshots of HDFS metadata at regular intervals as specified in the
Hadoop configuration.
• Since its memory requirements are the same as the NameNode, it is better to run them on different
machines.
• In case of NameNode failure, the Secondary NameNode can be manually configured to bring up the
cluster, but it does not record real-time changes to HDFS metadata.

Anatomy of HDFS File Read

1. The client opens a file using open() on the DistributedFileSystem.
2. DistributedFileSystem asks the NameNode for block locations, and NameNode provides the addresses
of DataNodes storing the blocks.
3. The client reads the file by connecting to the closest DataNode for the first block using
FSDataInputStream.
4. Data is streamed from the DataNode by repeated read() calls.
5. After reaching the end of a block, the client connects to the next DataNode for subsequent blocks.
6. Once reading is complete, the client calls close() on the FSDatalnputStream.
Anatomy of HDFS File Write
1. The client creates a file using create() on DistributedFileSystem, which requests the NameNode to
create a file without blocks.
2. FSDataOutputStream splits the data into packets to form a data queue for the DataStreamer.
3. DataStreamer requests the NameNode to allocate blocks and selects DataNodes to form a pipeline
4. Packets are sent through the pipeline and write data blocks to their local [Link], first DataNode
stores and forwards to second, second stores and forwards to third.
5. FSDataOutputStream manages an "Ack queue" to track acknowledgments; packets are removed only
when all DataNodes acknowledge.
6. On finishing writing, the client calls close(), which flushes remaining packets and informs the
NameNode that the file creation is complete.

Give HDFS Commands to perform the following operations

i) To get the list of directories and files at the root of HDFS ⭐

➢ hadoop fs -ls /

ii) To get the list of complete directories and files of HDFS

➢ hadoop fs -ls -R /

iii) To create a directory (say, sample) in HDFS⭐

➢ hadoop fs -mkdir /sample

iv) To copy a file from local file system to HDFS ⭐

➢ hadoop fs -put /root/sample/[Link] /sample/[Link]

v) To copy a file from HDFS to local file system

➢ hadoop fs -get /sample/[Link] /root/sample/[Link]

vi) To copy a file from local file system to HDFS via copyFromLocal command

➢ hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]

vii) To copy a file from Hadoop file system to local file system via copyToLocal command

➢ hadoop fs -copyToLocal /sample/[Link] /root/sample/[Link]

viii) To display the contents of an HDFS file on console⭐

➢ hadoop fs -cat /sample/[Link]

ix) To copy a file from one directory to another on HDFS

➢ hadoop fs -cp /sample/[Link] /sample1

x) To remove a directory from HDFS

➢ hadoop fs -rm -r /sample1

I am not studying this Implement a word count program in Hadoop.

What is MapReduce? Explain its component and workflow

MapReduce is a software framework that helps in the parllael processing of the massive amounts of data
across a Hadoop cluster. It works on the principle of splitting the input data into independent chunks, processing
them in parallel, and then combining the results.

Components of MapReduce
1. JobTracker (Master Daemon)
o Responsible for scheduling tasks on TaskTrackers.
o Monitors running tasks and re-executes failed tasks.
o Creates the execution plan for the submitted job.
o There is only one JobTracker per cluster.
2. TaskTracker (Slave Daemon)
o Executes tasks assigned by the JobTracker.
o Uses multiple JVMs to run map and reduce tasks in parallel.
o Continuously sends heartbeat messages to JobTracker.
o If heartbeat is missed, JobTracker assumes failure and reschedules the task.
MapReduce Workflow
1. The input data is split into small chunks.
2. The JobTracker creates master and worker processes and assigns map tasks to the TaskTracker to
process the chunks in parallel.
3. Each mapper processes its chunk, creates key–value pairs
4. The partitioner decides which reducer gets the data.
5. The map output is shuffled and sorted by key and sent to reducers.
6. Reducers process the data and write the final output to the file system.
7. After all reducers finish execution, the final output is given to the user.

Differentiate between Combiner,Reducer,Mapper

Aspect Mapper Reducer Combiner
Converts input data into Combines values with the Reduces data size before
Purpose
key–value pairs same key globally reducer

Runs the map function on Runs the reduce function Aggregates mapper
Function
input records on grouped data output locally
Key–value pairs from input Key–value pairs from Key–value pairs from
Input
data mappers mapper
Intermediate key–value Final output key–value Reduced intermediate
Output
pairs pairs key–value pairs
Phase First phase Last phase Optional middle phase
Data Reads input and produces Receives data and Processes mapper output
Flow intermediate data produces final output before reducer
Data stored locally before
Storage Output written to HDFS Data stays local
sending

Comprehensive Guide to Hadoop Framework
No ratings yet
Comprehensive Guide to Hadoop Framework
56 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Key Benefits and Overview of Hadoop
No ratings yet
Key Benefits and Overview of Hadoop
37 pages
RDBMS vs Hadoop: Key Differences
No ratings yet
RDBMS vs Hadoop: Key Differences
19 pages
Overview of Hadoop HDFS Architecture
No ratings yet
Overview of Hadoop HDFS Architecture
88 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
29 pages
Big Data and HDFS Overview Guide
No ratings yet
Big Data and HDFS Overview Guide
14 pages
Understanding HDFS in Hadoop Clusters
No ratings yet
Understanding HDFS in Hadoop Clusters
5 pages
Understanding Hadoop Cluster and HDFS
No ratings yet
Understanding Hadoop Cluster and HDFS
64 pages
Overview of Hadoop Architecture and Use Cases
No ratings yet
Overview of Hadoop Architecture and Use Cases
47 pages
Hadoop Distributed File System Overview
No ratings yet
Hadoop Distributed File System Overview
27 pages
Comprehensive Hadoop HDFS Guide
No ratings yet
Comprehensive Hadoop HDFS Guide
4 pages
Understanding Hadoop's Master-Slave Model
No ratings yet
Understanding Hadoop's Master-Slave Model
33 pages
MapReduce Types and HDFS Scaling in Hadoop
No ratings yet
MapReduce Types and HDFS Scaling in Hadoop
46 pages
Hadoop Architecture and MapReduce Overview
No ratings yet
Hadoop Architecture and MapReduce Overview
34 pages
Overview of Hadoop Architecture and Ecosystem
No ratings yet
Overview of Hadoop Architecture and Ecosystem
47 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
40 pages
HDFS Overview and Command Guide
No ratings yet
HDFS Overview and Command Guide
25 pages
Hadoop and HDFS Overview for Big Data
No ratings yet
Hadoop and HDFS Overview for Big Data
41 pages
Hadoop Ecosystem for Business Intelligence
No ratings yet
Hadoop Ecosystem for Business Intelligence
26 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
42 pages
Overview of Hadoop and HDFS Concepts
No ratings yet
Overview of Hadoop and HDFS Concepts
52 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
71 pages
Overview of Hadoop and HDFS Architecture
No ratings yet
Overview of Hadoop and HDFS Architecture
23 pages
HDFS and MapReduce Overview Guide
No ratings yet
HDFS and MapReduce Overview Guide
71 pages
Understanding Hadoop: HDFS & MapReduce
No ratings yet
Understanding Hadoop: HDFS & MapReduce
25 pages
Overview of Hadoop Components and MapReduce
No ratings yet
Overview of Hadoop Components and MapReduce
14 pages
Hadoop Architecture Overview
No ratings yet
Hadoop Architecture Overview
84 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Hadoop HDFS File Management Guide
No ratings yet
Hadoop HDFS File Management Guide
20 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
76 pages
4
No ratings yet
4
53 pages
Hakro GmbH's NoSQL and Hadoop Insights
No ratings yet
Hakro GmbH's NoSQL and Hadoop Insights
32 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
47 pages
HDFS, Sqoop, Hive, Pig, HBase Overview
No ratings yet
HDFS, Sqoop, Hive, Pig, HBase Overview
104 pages
Hadoop Framework Overview and Components
No ratings yet
Hadoop Framework Overview and Components
75 pages
Big Data Processing with Hadoop and MapReduce
No ratings yet
Big Data Processing with Hadoop and MapReduce
119 pages
Overview of Hadoop Ecosystem
No ratings yet
Overview of Hadoop Ecosystem
24 pages
Understanding HDFS Architecture
No ratings yet
Understanding HDFS Architecture
18 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
47 pages
Introduction to Hadoop HDFS Basics
No ratings yet
Introduction to Hadoop HDFS Basics
27 pages
Hadoop Architecture and HDFS Overview
No ratings yet
Hadoop Architecture and HDFS Overview
248 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
32 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
31 pages
Overview of Apache Hadoop Framework
No ratings yet
Overview of Apache Hadoop Framework
57 pages
Understanding the Hadoop Ecosystem
No ratings yet
Understanding the Hadoop Ecosystem
48 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
26 pages
Introduction to Hadoop Framework
No ratings yet
Introduction to Hadoop Framework
59 pages
Hadoop and MapReduce Overview
No ratings yet
Hadoop and MapReduce Overview
16 pages
Overview of Hadoop Modules and HDFS
No ratings yet
Overview of Hadoop Modules and HDFS
101 pages
Introduction to Hadoop Basics
No ratings yet
Introduction to Hadoop Basics
52 pages
Overview of Hadoop Distributed File System
No ratings yet
Overview of Hadoop Distributed File System
12 pages
DataNode Download Issues in HDFS
No ratings yet
DataNode Download Issues in HDFS
6 pages
HDFS Architecture and Key Concepts
No ratings yet
HDFS Architecture and Key Concepts
16 pages
HDFS Overview and Command Guide
No ratings yet
HDFS Overview and Command Guide
29 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
27 pages
Understanding Apache Hadoop Basics
No ratings yet
Understanding Apache Hadoop Basics
22 pages
Introduction to Hadoop and MapReduce
No ratings yet
Introduction to Hadoop and MapReduce
31 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
14 pages
Overview of Hadoop Architecture
No ratings yet
Overview of Hadoop Architecture
48 pages
Web Application Attack Vectors Guide
No ratings yet
Web Application Attack Vectors Guide
59 pages
Trilio Backup and Recovery for OpenStack
No ratings yet
Trilio Backup and Recovery for OpenStack
20 pages
Three-Schema Architecture Overview
No ratings yet
Three-Schema Architecture Overview
11 pages
Understand and Troubleshoot IP Address Management (IPAM) in Windows Server 8 Beta
No ratings yet
Understand and Troubleshoot IP Address Management (IPAM) in Windows Server 8 Beta
102 pages
Gradle VM Options Android Studio
No ratings yet
Gradle VM Options Android Studio
3 pages
Managing GitHub Copilot Licenses for Enterprises
No ratings yet
Managing GitHub Copilot Licenses for Enterprises
4 pages
IPS Signature Actions and Configurations
No ratings yet
IPS Signature Actions and Configurations
6 pages
DevOps Engineer Resume and Skills
No ratings yet
DevOps Engineer Resume and Skills
8 pages
White Paper - Data Warehouse Project Management
100% (2)
White Paper - Data Warehouse Project Management
32 pages
Mobile App Development Lab Manual
No ratings yet
Mobile App Development Lab Manual
75 pages
SARJ Field Catalog Instructions
No ratings yet
SARJ Field Catalog Instructions
10 pages
Semrush Premium Cookies Overview
0% (1)
Semrush Premium Cookies Overview
8 pages
E-Commerce Site Design and Implementation
No ratings yet
E-Commerce Site Design and Implementation
22 pages
S03 Data Clearing SD V 4 2
No ratings yet
S03 Data Clearing SD V 4 2
22 pages
Data Warehousing with Informatica
No ratings yet
Data Warehousing with Informatica
32 pages
Importing Creo Schematics Data Guide
No ratings yet
Importing Creo Schematics Data Guide
3 pages
Senior Software Testing Engineer Resume
No ratings yet
Senior Software Testing Engineer Resume
2 pages
Employee Work Center Workflow Guide
No ratings yet
Employee Work Center Workflow Guide
10 pages
Satyam Practical File: SQL & PL/SQL Guide
No ratings yet
Satyam Practical File: SQL & PL/SQL Guide
104 pages
API Interview Questions and Answers
No ratings yet
API Interview Questions and Answers
26 pages
CS2032 Data Mining Question Bank
No ratings yet
CS2032 Data Mining Question Bank
5 pages
How To Trigger Workflow From Web Dynpro ABAP - Part 1
No ratings yet
How To Trigger Workflow From Web Dynpro ABAP - Part 1
36 pages
Azure Architecting Solutions Course Outline
No ratings yet
Azure Architecting Solutions Course Outline
5 pages
Opencrx Installation Guide For Ibm Db2 V9.5 (Express C) : Version 2.0 / 2.1
No ratings yet
Opencrx Installation Guide For Ibm Db2 V9.5 (Express C) : Version 2.0 / 2.1
15 pages
ACI ACRIS Recommended Practice Handbook
No ratings yet
ACI ACRIS Recommended Practice Handbook
24 pages
Protecting OUs from Accidental Deletion
No ratings yet
Protecting OUs from Accidental Deletion
31 pages
Data Warehouse Project for Air Quality Analysis
No ratings yet
Data Warehouse Project for Air Quality Analysis
6 pages
Cloud Computing Lab Manual BCS601
No ratings yet
Cloud Computing Lab Manual BCS601
51 pages
Cybersecurity in Big Data Era From Securing Big Data To Data-Driven Security
No ratings yet
Cybersecurity in Big Data Era From Securing Big Data To Data-Driven Security
18 pages
Understanding Network Security Essentials
No ratings yet
Understanding Network Security Essentials
32 pages

Introduction to Hadoop and HDFS Overview

Uploaded by

Introduction to Hadoop and HDFS Overview

Uploaded by

BDA MODULE – 2

With a neat diagram explain HDFS architecture.

5.10.1 HDFS Daemons

Anatomy of HDFS File Read

Give HDFS Commands to perform the following operations

ii) To get the list of complete directories and files of HDFS

iii) To create a directory (say, sample) in HDFS⭐

➢ hadoop fs -mkdir /sample

iv) To copy a file from local file system to HDFS ⭐

➢ hadoop fs -put /root/sample/[Link] /sample/[Link]

➢ hadoop fs -get /sample/[Link] /root/sample/[Link]

➢ hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]

➢ hadoop fs -copyToLocal /sample/[Link] /root/sample/[Link]

viii) To display the contents of an HDFS file on console⭐

➢ hadoop fs -cat /sample/[Link]

ix) To copy a file from one directory to another on HDFS

➢ hadoop fs -cp /sample/[Link] /sample1

x) To remove a directory from HDFS

➢ hadoop fs -rm -r /sample1

I am not studying this Implement a word count program in Hadoop.

What is MapReduce? Explain its component and workflow

Differentiate between Combiner,Reducer,Mapper

You might also like