Hadoop
Distributed
File System
History
Hadoop Distributed File System (HDFS) is an open-source implementation of the
Google’s GFS architecture as developed by Apache Software Foundation.
The development was initiated by Yahoo in 2006 being inspired by the Google’s GFS
and MapReduce papers and was looking to develop an open-source based system to
fulfill their storage requirements.
They decided to pass the storage and data processing parts of their ‘Nutch’ search
engine project to Apache Foundation and form the Hadoop as an open-source
project.
Features
HDFS is a Java-based distributed file system that provides scalable and reliable data
storage.
HDFS is also designed to run on large clusters of commodity servers.
It is a sub-project of the Apache Hadoop project.
It is extremely fault-tolerant and provides high output access to application data.
The file system is available for consumers on the Amazon EC2 cloud platform.
HDFS is designed to reliably store very large files across multiple machines in a large
cluster.
HDFS cluster contains two types of nodes as one single master node called as
NameNode and other slave nodes called as DataNodes.
Files are broken into sequence of blocks of reasonably bigger size (64 MB or 128 MB)
These blocks are stored on DataNodes commodity servers.
Fault Tolerance
Consider a scenario of node failure in HDFS
To increase fault tolerance of the system, it replicates blocks over multiple DataNodes.
By default it uses 3 replicas. The block size and the replication factor are configurable.
During read operation, data is fetched from any one of the replicas.
During write operation, data is sent to all of the DataNodes containing replicas of the
file.
Master node usually stores metadata about the blocks.
Master-Slave Architecture
Every server in a HDFS cluster have data node and a task tracker associated with
them.
The single name node stays in a master server that manages the file system and stores
metadata about the data nodes.
The master server also has a job tracker that coordinates all of the activities across a
cluster.
Every server, master or slave both, have MapReduce function implemented into them.
Every node has a database engine also.
The name node and data node are actually pieces of software developed in
Java that generally run on Linux operating system.
Usages of portable language like java ensure that the software can be
deployed on broad range of commodity hardware.
Generally in real-life cases, one data node is created on one server although
the HDFS architecture does not prevent running multiple data nodes on the
same server.
The goals of HDFS
❏ Fast recovery from hardware failures
❏ Access to streaming data
❏ Accommodation of large data sets
❏ Portability
Pig Latin
High Level
Language
Introduction
Pig Latin is a high-level data flow language developed by Yahoo! that has been
implemented on top of Hadoop in the Apache Pig project.
Pig Latin, Sawzall and DryadLINQ are different approaches to building languages
on top of MapReduce and its extensions
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
grunt> Student_data = LOAD 'student_data.txt' USING PigStorage(',')as
( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );