0% found this document useful (0 votes)

7 views33 pages

Module 2 HDFS

Uploaded by

Saif Madre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views33 pages

Module 2 HDFS

Uploaded by

Saif Madre

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Hadoop Distributed File

System
Event Date
Search engines and indexes are created Late 1990s
Open-source web search engines are invented Late 1990s
Web search engines use web crawlers to copy web pages and process and index them Late 1990s

Nutch, a web crawler and distributed processing system, is developed Late 1990s

Google works on a similar concept for storing and processing data in a distributed environment Late 1990s

Doug Cutting joins Yahoo and retains the name Nutch for the web crawler portion of the project 2006

Doug Cutting names the storage and distributed processing portion of the project Hadoop 2006

Yahoo releases Hadoop as an open-source project 2008

Hadoop is released as a framework by the Apache Software Foundation (ASF) 2012

What is Hadoop?
Open-source framework for storing and processing massive datasets across
clusters of computers.
● Massive Storage: Handles petabytes of data distributedly.
● Faster Processing: Distributes computation across multiple machines for rapid
analysis.
● Fault Tolerant: Automatically handles failures for high availability.
Hadoop Goals
The main goals of Hadoop are listed below:
1. Scalable: It can scale up from a single server to thousands of servers.
2. Fault tolerance: It is designed with very high degree of fault tolerance.
3. Economical: It uses commodity hardware instead of high-end hardware.
4. Handle hardware failures: The resiliency of these clusters comes from
the software’s ability to detect and handle failures at the application layer.
Core Hadoop Components
Hadoop consists of the following components:
1. Hadoop Common: This package provides file system and OS level abstractions. It contains
libraries and utilities required by other Hadoop modules.
2. Hadoop Distributed File System (HDFS): HDFS is a distributed file system that provides a
limited interface for managing the file system.
3. Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop MapReduce engine
uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a resource-
management platform responsible for managing compute resources in clusters and using them
for scheduling of users’ applications.
Hadoop Distributed File System (HDFS)

Main components of
HDFS
Name Node
● NameNode is the master that contains the metadata. In general, it maintains
the directories and files and manages the blocks which are present on the
DataNode.
● NameNode maps DataNode to the list of blocks, monitors status (health) of
DataNode and replicates the missing blocks.
● Keep a editlog that will log all activities and FSImage keeps track of blocks of
DN.
● Secondary NameNode is responsible for performing periodic checkpoints and
keeps an image of NN. These are used to restart the NameNode in case of
failure.
Functions of NameNode:
1. Manages namespace of the file system in memory.
2. Maintains “inode” information.
3. Maps inode to the list of blocks and locations.
4. Takes care of authorization and authentication.
5. Creates checkpoints and logs the namespace changes.
DataNodes
● DataNodes are the slaves which provide the actual storage and are deployed on each
machine. Minimum Block size is of 128 MB.

● They are responsible for processing read and write requests for the clients.
● Functions of DataNode:

1. Handles block storage on multiple volumes and also maintain block integrity.

2. Periodically sends heartbeats and also the block reports to NameNode.

HDFS
handles job
processing
requests from
the user in
the form of
Sequence
Diagram
Read and Write Proces
Read process :
a) Client first gets the datanodes where the actual data is located from the namenode
b) Then it directly contacts the datanodes to read the data

Write process :
a) Client asks namenode for some datanodes to write the data and if available Namenode
gives them
b)Client goes directly to the datanodes and write
MapReduce
The MapReduce algorithm aids in parallel processing and basically comprises two sequential
phases: map and reduce.
1. In the map phase, a set of key–value pairs forms the input and over each key–value pair, the
desired function is executed so as to generate a set of intermediate key–value pairs.
2. In the reduce phase, the intermediate key–value pairs are grouped by key and the values are
combined together according to the reduce algorithm provided by the user. Sometimes no
reduce phase is required, given the type of operation coded by the user.
In the MapReduce paradigm, each job has a user-defined map phase followed by a user-defined
reduce phase as follows:
1. Map phase is a parallel, share-nothing processing of input.
2. In the reduce phase, the output of the map phase is aggregated.
Main Components of MapReduce
The main components of MapReduce are listed below:
1. JobTrackers: JobTracker is the master which manages the jobs and resources in the
cluster. The JobTracker tries to schedule each map on the TaskTracker which is
running on the same DataNode as the underlying block.
2. TaskTrackers: TaskTrackers are slaves which are deployed on each machine in the
cluster. They are responsible for running the map and reduce tasks as instructed by the
JobTracker.
3. JobHistoryServer: JobHistoryServer is a daemon that saves historical information
about completed tasks/applications.
Yet Another Resource Negotiator (YARN)
● YARN addresses problems with MapReduce 1.0s architecture, specifically the one
faced by JobTracker service.
● Hadoop generally has up to tens of thousands of nodes in the cluster. Obviously,
MapReduce 1.0 had issues with scalability, memory usage, synchronization, and also
Single Point of Failure (SPOF) issues.
● In effect, YARN became another core component of Apache Hadoop.
● It splits up the two major functionalities “resource management” and “job scheduling
and monitoring” of the JobTracker into two separate daemons.
● One acts as a “global Resource Manager (RM)” and the other as a “ApplicationMaster
(AM)” per application. Thus, instead of having a single node to handle both scheduling
and resource management for the entire cluster, YARN distributes this responsibility
across the cluster.
YARN
The RM and the NodeManager manage the applications in a distributed manner. The RM is
the one that arbitrates resources among all the applications in the system. The per-
application AM negotiates resources from the RM and works with the NodeManager(s) to
execute and monitor the component tasks.
1. The RM has a scheduler that takes into account constraints such as queue
capacities, user-limits, etc. before allocating resources to the various running
applications.
2. The scheduler performs its scheduling function based on the resource requirements
of the applications.
3. The NodeManager is responsible for launching the applications’ containers. It
monitors the application’s resource usage (CPU, memory, disk, network) and reports
the information to the RM.
4. Each AM runs as a normal container. It has the responsibility of negotiating
appropriate resource containers from the scheduler, tracking their status and monitoring
their progress.
Hadoop Ecosystem
The main ecosystems components of Hadoop architecture are as follows:

1. Apache HBase: Columnar (Non-relational) database.

2. Apache Hive: Data access and query.

3. Apache HCatalog: Metadata services.

4. Apache Pig: Scripting platform.

5. Apache Mahout: Machine learning libraries for Data Mining.

6. Apache Oozie: Workflow and scheduling services.

7. Apache ZooKeeper: Cluster coordination.

8. Apache Sqoop: Data integration services.

Apache HBase

Column-Oriented NoSQL Database built on HDFS.

● High Performance: Optimized for fast read/write operations on large datasets.

● Scalability: Handles massive amounts of data distributedly.
● Flexibility: Stores sparse data efficiently.
● Access: Accessed through Java, Thrift, and REST APIs.
Apache Hive

Data Warehouse Infrastructure on Hadoop

● SQL-like Interface (HiveQL): Enables querying and managing large datasets.

● Data Organization: Tables, Partitions, and Buckets for efficient data management.
● Processing Engine: Converts HiveQL queries into MapReduce jobs for distributed
processing.
● Extensibility: Supports User-Defined Functions (UDFs) for custom operations.
HCatalog

Metadata Management for Hadoop

● Centralized Repository: Stores metadata about data stored in HDFS.

● Simplified Access: Provides a common interface for various tools to access HDFS data.
● Data Sharing: Enables seamless data sharing across different platforms and applications.
● Improved Efficiency: Streamlines data management and reduces complexity.
PIG

High-Level Data Analysis Language

● Data Flow Language: Defines data transformations using a scripting language (Pig Latin).
● MapReduce Abstraction: Automatically converts Pig Latin scripts into MapReduce jobs.
● Flexible Data Model: Handles complex data structures like nested tuples and maps.
● Data Manipulation: Provides operators for loading, transforming, filtering, and storing data.
SQOOP

Data Transfer Tool between Hadoop and Relational Databases

● Import: Efficiently transfers data from relational databases (MySQL, Oracle, etc.) to Hadoop
(HDFS, Hive, HBase).
● Export: Exports processed data from Hadoop back to relational databases for reporting and
visualization.
● Incremental Loads: Supports importing only new or updated data for efficient data
synchronization.
OOZIE

Workflow and Job Coordinator for Hadoop

● Orchestrates Hadoop Jobs: Manages and schedules MapReduce, Pig, Hive, Sqoop, and
other jobs.
● Dependency Management: Defines job dependencies using Directed Acyclic Graphs
(DAGs).
● Automation: Automates complex data processing pipelines with conditional logic.
● Integration: Seamlessly integrates with other Hadoop components for end-to-end workflow
management.
MAHOUT

Scalable Machine Learning Library

● Core Algorithms: Recommendation, Classification, Clustering, Frequent Itemset

Mining
● Distributed Processing: Designed for large-scale data processing on Hadoop
clusters.
● Performance: Optimizes machine learning algorithms for distributed environments.
● Flexibility: Offers a range of algorithms and tools for various machine learning tasks.
ZOOKEEPER

Centralized Service for Distributed Coordination

● Core Functions: Configuration management, naming, synchronization, group

services.
● Data Model: Hierarchical namespace (zNodes) for storing data.
● High Availability: Master-slave architecture for fault tolerance.
● Integration: Used by HBase and other distributed systems for coordination.
Hadoop Limitations
1. Accessibility: HDFS is not directly mountable, requiring workarounds for
data access.
2. Security: Default security is disabled, leaving data vulnerable to attacks. Lack
of encryption at storage and network levels is a major concern.
3. Performance: Not optimized for small files, impacting performance for certain
workloads.
4. Stability: Open-source nature can lead to stability issues, requiring careful
version management.
5. Scope: Not a one-size-fits-all solution for big data; other platforms like Google
Cloud Dataflow offer additional benefits.

Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
HADOOP
No ratings yet
HADOOP
19 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit-2 - Introduction To Hadoop and Hadoop Architecture
No ratings yet
Unit-2 - Introduction To Hadoop and Hadoop Architecture
46 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Unit 3
No ratings yet
Unit 3
18 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Hadoop Notes
No ratings yet
Hadoop Notes
8 pages
Cloud PDF
No ratings yet
Cloud PDF
138 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
Big Data Ecosystem & Hadoop Guide
No ratings yet
Big Data Ecosystem & Hadoop Guide
31 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Introduction To
No ratings yet
Introduction To
7 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
277 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Big Data
No ratings yet
Big Data
67 pages
Hadoop Ecosystem Overview
No ratings yet
Hadoop Ecosystem Overview
38 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Module 2 Hadoop Eco System
No ratings yet
Module 2 Hadoop Eco System
13 pages
What Is The Hadoop Ecosystem?
No ratings yet
What Is The Hadoop Ecosystem?
4 pages
Unit 2
No ratings yet
Unit 2
17 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Hadoop
No ratings yet
Hadoop
7 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Learn
No ratings yet
Learn
16 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Updated Unit-IV Reference PPT 08-02-2022
No ratings yet
Updated Unit-IV Reference PPT 08-02-2022
103 pages
BigData Unit-4 Complete
No ratings yet
BigData Unit-4 Complete
97 pages
Hadoop Ecosystem Components Guide
No ratings yet
Hadoop Ecosystem Components Guide
19 pages
Jenny Blog
No ratings yet
Jenny Blog
12 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Module 2
No ratings yet
Module 2
23 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
18 Module 2
No ratings yet
18 Module 2
9 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
ZOYA MOMIN 221257 - Exploratory Data Analysis and Visualization Using Tableau - R Programming
No ratings yet
ZOYA MOMIN 221257 - Exploratory Data Analysis and Visualization Using Tableau - R Programming
2 pages
SocNet
No ratings yet
SocNet
9 pages
Mini Project
No ratings yet
Mini Project
6 pages
Experiment No 2
No ratings yet
Experiment No 2
6 pages
Review Sheet
No ratings yet
Review Sheet
3 pages
The Sentinels
No ratings yet
The Sentinels
11 pages
Marketing Management (Updated Final Term Syllabus)
No ratings yet
Marketing Management (Updated Final Term Syllabus)
98 pages
Experiment 3
No ratings yet
Experiment 3
4 pages
Saif Rsa1
No ratings yet
Saif Rsa1
3 pages
Experiment No 2
No ratings yet
Experiment No 2
6 pages
Saif DH
No ratings yet
Saif DH
2 pages
Exp 7
No ratings yet
Exp 7
20 pages
Cafe
No ratings yet
Cafe
25 pages
Exp 01 MP
No ratings yet
Exp 01 MP
1 page
Dark We Bits Impact On The Internet and Done
No ratings yet
Dark We Bits Impact On The Internet and Done
10 pages
VoLTE (Voice Over LTE)
No ratings yet
VoLTE (Voice Over LTE)
15 pages
Rugved DH
No ratings yet
Rugved DH
3 pages
EXPERIMENT 1 - Dark Web - A Study On Its Structure, Risks, and Societal Impact (1) (1) Final
No ratings yet
EXPERIMENT 1 - Dark Web - A Study On Its Structure, Risks, and Societal Impact (1) (1) Final
3 pages
Ruzz Exp 2
No ratings yet
Ruzz Exp 2
4 pages
Chapter 10 DF
No ratings yet
Chapter 10 DF
14 pages
FEZ Panda II UserManual
No ratings yet
FEZ Panda II UserManual
5 pages
Aneka Soal Ujian Sistem Operasi: File System & FUSE
No ratings yet
Aneka Soal Ujian Sistem Operasi: File System & FUSE
2 pages
CPU Guardium All Phase
No ratings yet
CPU Guardium All Phase
14 pages
CCNA Answers Chapter 1
No ratings yet
CCNA Answers Chapter 1
14 pages
Product ID: ST1030USBM: 10-Port Industrial USB 3.0 Hub With ESD & 350W Surge Protection
No ratings yet
Product ID: ST1030USBM: 10-Port Industrial USB 3.0 Hub With ESD & 350W Surge Protection
4 pages
Hardware and Networking Interview Questions With Answers
No ratings yet
Hardware and Networking Interview Questions With Answers
8 pages
KNCA-question and Answer - 138 Questions
100% (1)
KNCA-question and Answer - 138 Questions
38 pages
Deep Learning With PyTorch: Object Classification - Filliat Et Al
No ratings yet
Deep Learning With PyTorch: Object Classification - Filliat Et Al
3 pages
Future Technology Devices International LTD Ft232Bm Usb Uart Ic
No ratings yet
Future Technology Devices International LTD Ft232Bm Usb Uart Ic
31 pages
Fernando Ovpn
No ratings yet
Fernando Ovpn
5 pages
ATM To IP Conversion Process
No ratings yet
ATM To IP Conversion Process
4 pages
CN Course File Naac
No ratings yet
CN Course File Naac
77 pages
Prob650m P
No ratings yet
Prob650m P
402 pages
EntraPass Web Installation D29009908R001 - DN2156 - 1610
No ratings yet
EntraPass Web Installation D29009908R001 - DN2156 - 1610
4 pages
Autoliv Night Vision
No ratings yet
Autoliv Night Vision
91 pages
Final Revision 1st Prep 1st Term 2024
No ratings yet
Final Revision 1st Prep 1st Term 2024
7 pages
4 24 CSCWD KSM Killer of Spectre and Meltdown Attacks
No ratings yet
4 24 CSCWD KSM Killer of Spectre and Meltdown Attacks
6 pages
DVR5404F: 4CH DVR with HDMI & PTZ Support
No ratings yet
DVR5404F: 4CH DVR with HDMI & PTZ Support
3 pages
U3 - C8 - T2 - Peer-To-Peer Middleware
No ratings yet
U3 - C8 - T2 - Peer-To-Peer Middleware
9 pages
Arp Nat DHCP
No ratings yet
Arp Nat DHCP
5 pages
KISS 2U Short Users Manual
No ratings yet
KISS 2U Short Users Manual
47 pages
Canon Imageclass Mf4350D Status: Current
No ratings yet
Canon Imageclass Mf4350D Status: Current
2 pages
8259A Programmable Interrupt Controller Guide
No ratings yet
8259A Programmable Interrupt Controller Guide
42 pages
Lola Download Tool: Technical Bulletin Oxo Connect
No ratings yet
Lola Download Tool: Technical Bulletin Oxo Connect
7 pages
Linux Cheat Sheet for Sysadmins
No ratings yet
Linux Cheat Sheet for Sysadmins
8 pages
Designing and Implementing A VLSM Addressing Scheme: Topology
No ratings yet
Designing and Implementing A VLSM Addressing Scheme: Topology
4 pages
DevCon Command Guide for Admins
No ratings yet
DevCon Command Guide for Admins
5 pages
Implementing The CAN Calibration Protocol (CCP) in An SAE J1939 Application
No ratings yet
Implementing The CAN Calibration Protocol (CCP) in An SAE J1939 Application
6 pages

Module 2 HDFS

Uploaded by

Module 2 HDFS

Uploaded by

Hadoop Distributed File

Yahoo releases Hadoop as an open-source project 2008

Hadoop is released as a framework by the Apache Software Foundation (ASF) 2012

2. Periodically sends heartbeats and also the block reports to NameNode.

1. Apache HBase: Columnar (Non-relational) database.

2. Apache Hive: Data access and query.

3. Apache HCatalog: Metadata services.

4. Apache Pig: Scripting platform.

5. Apache Mahout: Machine learning libraries for Data Mining.

6. Apache Oozie: Workflow and scheduling services.

7. Apache ZooKeeper: Cluster coordination.

8. Apache Sqoop: Data integration services.

Column-Oriented NoSQL Database built on HDFS.

● High Performance: Optimized for fast read/write operations on large datasets.

Data Warehouse Infrastructure on Hadoop

● SQL-like Interface (HiveQL): Enables querying and managing large datasets.

Metadata Management for Hadoop

● Centralized Repository: Stores metadata about data stored in HDFS.

High-Level Data Analysis Language

Data Transfer Tool between Hadoop and Relational Databases

Workflow and Job Coordinator for Hadoop

Scalable Machine Learning Library

● Core Algorithms: Recommendation, Classification, Clustering, Frequent Itemset

Centralized Service for Distributed Coordination

● Core Functions: Configuration management, naming, synchronization, group

You might also like