[go: up one dir, main page]

0% found this document useful (0 votes)
7 views33 pages

Module 2 HDFS

Uploaded by

Saif Madre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views33 pages

Module 2 HDFS

Uploaded by

Saif Madre
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Hadoop Distributed File

System
Event Date
Search engines and indexes are created Late 1990s
Open-source web search engines are invented Late 1990s
Web search engines use web crawlers to copy web pages and process and index them Late 1990s

Nutch, a web crawler and distributed processing system, is developed Late 1990s

Google works on a similar concept for storing and processing data in a distributed environment Late 1990s

Doug Cutting joins Yahoo and retains the name Nutch for the web crawler portion of the project 2006

Doug Cutting names the storage and distributed processing portion of the project Hadoop 2006

Yahoo releases Hadoop as an open-source project 2008

Hadoop is released as a framework by the Apache Software Foundation (ASF) 2012


What is Hadoop?
Open-source framework for storing and processing massive datasets across
clusters of computers.
● Massive Storage: Handles petabytes of data distributedly.
● Faster Processing: Distributes computation across multiple machines for rapid
analysis.
● Fault Tolerant: Automatically handles failures for high availability.
Hadoop Goals
The main goals of Hadoop are listed below:
1. Scalable: It can scale up from a single server to thousands of servers.
2. Fault tolerance: It is designed with very high degree of fault tolerance.
3. Economical: It uses commodity hardware instead of high-end hardware.
4. Handle hardware failures: The resiliency of these clusters comes from
the software’s ability to detect and handle failures at the application layer.
Core Hadoop Components
Hadoop consists of the following components:
1. Hadoop Common: This package provides file system and OS level abstractions. It contains
libraries and utilities required by other Hadoop modules.
2. Hadoop Distributed File System (HDFS): HDFS is a distributed file system that provides a
limited interface for managing the file system.
3. Hadoop MapReduce: MapReduce is the key algorithm that the Hadoop MapReduce engine
uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0): It is a resource-
management platform responsible for managing compute resources in clusters and using them
for scheduling of users’ applications.
Hadoop Distributed File System (HDFS)

Main components of
HDFS
Name Node
● NameNode is the master that contains the metadata. In general, it maintains
the directories and files and manages the blocks which are present on the
DataNode.
● NameNode maps DataNode to the list of blocks, monitors status (health) of
DataNode and replicates the missing blocks.
● Keep a editlog that will log all activities and FSImage keeps track of blocks of
DN.
● Secondary NameNode is responsible for performing periodic checkpoints and
keeps an image of NN. These are used to restart the NameNode in case of
failure.
Functions of NameNode:
1. Manages namespace of the file system in memory.
2. Maintains “inode” information.
3. Maps inode to the list of blocks and locations.
4. Takes care of authorization and authentication.
5. Creates checkpoints and logs the namespace changes.
DataNodes
● DataNodes are the slaves which provide the actual storage and are deployed on each
machine. Minimum Block size is of 128 MB.

● They are responsible for processing read and write requests for the clients.
● Functions of DataNode:

1. Handles block storage on multiple volumes and also maintain block integrity.

2. Periodically sends heartbeats and also the block reports to NameNode.


HDFS
handles job
processing
requests from
the user in
the form of
Sequence
Diagram
Read and Write Proces
Read process :
a) Client first gets the datanodes where the actual data is located from the namenode
b) Then it directly contacts the datanodes to read the data

Write process :
a) Client asks namenode for some datanodes to write the data and if available Namenode
gives them
b)Client goes directly to the datanodes and write
MapReduce
The MapReduce algorithm aids in parallel processing and basically comprises two sequential
phases: map and reduce.
1. In the map phase, a set of key–value pairs forms the input and over each key–value pair, the
desired function is executed so as to generate a set of intermediate key–value pairs.
2. In the reduce phase, the intermediate key–value pairs are grouped by key and the values are
combined together according to the reduce algorithm provided by the user. Sometimes no
reduce phase is required, given the type of operation coded by the user.
In the MapReduce paradigm, each job has a user-defined map phase followed by a user-defined
reduce phase as follows:
1. Map phase is a parallel, share-nothing processing of input.
2. In the reduce phase, the output of the map phase is aggregated.
Main Components of MapReduce
The main components of MapReduce are listed below:
1. JobTrackers: JobTracker is the master which manages the jobs and resources in the
cluster. The JobTracker tries to schedule each map on the TaskTracker which is
running on the same DataNode as the underlying block.
2. TaskTrackers: TaskTrackers are slaves which are deployed on each machine in the
cluster. They are responsible for running the map and reduce tasks as instructed by the
JobTracker.
3. JobHistoryServer: JobHistoryServer is a daemon that saves historical information
about completed tasks/applications.
Yet Another Resource Negotiator (YARN)
● YARN addresses problems with MapReduce 1.0s architecture, specifically the one
faced by JobTracker service.
● Hadoop generally has up to tens of thousands of nodes in the cluster. Obviously,
MapReduce 1.0 had issues with scalability, memory usage, synchronization, and also
Single Point of Failure (SPOF) issues.
● In effect, YARN became another core component of Apache Hadoop.
● It splits up the two major functionalities “resource management” and “job scheduling
and monitoring” of the JobTracker into two separate daemons.
● One acts as a “global Resource Manager (RM)” and the other as a “ApplicationMaster
(AM)” per application. Thus, instead of having a single node to handle both scheduling
and resource management for the entire cluster, YARN distributes this responsibility
across the cluster.
YARN
The RM and the NodeManager manage the applications in a distributed manner. The RM is
the one that arbitrates resources among all the applications in the system. The per-
application AM negotiates resources from the RM and works with the NodeManager(s) to
execute and monitor the component tasks.
1. The RM has a scheduler that takes into account constraints such as queue
capacities, user-limits, etc. before allocating resources to the various running
applications.
2. The scheduler performs its scheduling function based on the resource requirements
of the applications.
3. The NodeManager is responsible for launching the applications’ containers. It
monitors the application’s resource usage (CPU, memory, disk, network) and reports
the information to the RM.
4. Each AM runs as a normal container. It has the responsibility of negotiating
appropriate resource containers from the scheduler, tracking their status and monitoring
their progress.
Hadoop Ecosystem
The main ecosystems components of Hadoop architecture are as follows:

1. Apache HBase: Columnar (Non-relational) database.

2. Apache Hive: Data access and query.

3. Apache HCatalog: Metadata services.

4. Apache Pig: Scripting platform.

5. Apache Mahout: Machine learning libraries for Data Mining.

6. Apache Oozie: Workflow and scheduling services.

7. Apache ZooKeeper: Cluster coordination.

8. Apache Sqoop: Data integration services.


Apache HBase

Column-Oriented NoSQL Database built on HDFS.

● High Performance: Optimized for fast read/write operations on large datasets.


● Scalability: Handles massive amounts of data distributedly.
● Flexibility: Stores sparse data efficiently.
● Access: Accessed through Java, Thrift, and REST APIs.
Apache Hive

Data Warehouse Infrastructure on Hadoop

● SQL-like Interface (HiveQL): Enables querying and managing large datasets.


● Data Organization: Tables, Partitions, and Buckets for efficient data management.
● Processing Engine: Converts HiveQL queries into MapReduce jobs for distributed
processing.
● Extensibility: Supports User-Defined Functions (UDFs) for custom operations.
HCatalog

Metadata Management for Hadoop

● Centralized Repository: Stores metadata about data stored in HDFS.


● Simplified Access: Provides a common interface for various tools to access HDFS data.
● Data Sharing: Enables seamless data sharing across different platforms and applications.
● Improved Efficiency: Streamlines data management and reduces complexity.
PIG

High-Level Data Analysis Language

● Data Flow Language: Defines data transformations using a scripting language (Pig Latin).
● MapReduce Abstraction: Automatically converts Pig Latin scripts into MapReduce jobs.
● Flexible Data Model: Handles complex data structures like nested tuples and maps.
● Data Manipulation: Provides operators for loading, transforming, filtering, and storing data.
SQOOP

Data Transfer Tool between Hadoop and Relational Databases

● Import: Efficiently transfers data from relational databases (MySQL, Oracle, etc.) to Hadoop
(HDFS, Hive, HBase).
● Export: Exports processed data from Hadoop back to relational databases for reporting and
visualization.
● Incremental Loads: Supports importing only new or updated data for efficient data
synchronization.
OOZIE

Workflow and Job Coordinator for Hadoop

● Orchestrates Hadoop Jobs: Manages and schedules MapReduce, Pig, Hive, Sqoop, and
other jobs.
● Dependency Management: Defines job dependencies using Directed Acyclic Graphs
(DAGs).
● Automation: Automates complex data processing pipelines with conditional logic.
● Integration: Seamlessly integrates with other Hadoop components for end-to-end workflow
management.
MAHOUT

Scalable Machine Learning Library

● Core Algorithms: Recommendation, Classification, Clustering, Frequent Itemset


Mining
● Distributed Processing: Designed for large-scale data processing on Hadoop
clusters.
● Performance: Optimizes machine learning algorithms for distributed environments.
● Flexibility: Offers a range of algorithms and tools for various machine learning tasks.
ZOOKEEPER

Centralized Service for Distributed Coordination

● Core Functions: Configuration management, naming, synchronization, group


services.
● Data Model: Hierarchical namespace (zNodes) for storing data.
● High Availability: Master-slave architecture for fault tolerance.
● Integration: Used by HBase and other distributed systems for coordination.
Hadoop Limitations
1. Accessibility: HDFS is not directly mountable, requiring workarounds for
data access.
2. Security: Default security is disabled, leaving data vulnerable to attacks. Lack
of encryption at storage and network levels is a major concern.
3. Performance: Not optimized for small files, impacting performance for certain
workloads.
4. Stability: Open-source nature can lead to stability issues, requiring careful
version management.
5. Scope: Not a one-size-fits-all solution for big data; other platforms like Google
Cloud Dataflow offer additional benefits.

You might also like