BDA MODULE – 2
Introduction to Hadoop: Introducing hadoop, Why hadoop, Why not RDBMS, RDBMS Vs Hadoop, History
of Hadoop, Hadoop overview, Use case of Hadoop, HDFS (Hadoop Distributed File System),Processing data
with Hadoop, Managing resources and applications with Hadoop YARN(Yet Another Resource Negotiator).
Introduction to Map Reduce Programming: Introduction, Mapper, Reducer, Combiner, Partitioner,
Searching, Sorting, Compression.
TB1: Ch 5: 5.1-,5.8, 5.10-5.12, Ch 8: 8.1 - 8.8
HADOOP
Hadoop is an open-source framework. Written in Java, it was first made to support the text search engine
Nutch. Hadoop is based on Google’s MapReduce and Google File System technologies and is now widely
used by companies like Yahoo, Facebook, LinkedIn, and Twitter.
With a neat diagram explain HDFS architecture.
HDFS (Hadoop Distributed File System) is the storage component of Hadoop. It is designed to store and
process very large files efficiently across multiple machines.
1. Storage Component
HDFS acts as the main storage system of Hadoop, where all big data files are stored.
2. Distributed File System
Data is not stored on one machine. Instead, files are distributed across multiple nodes (DataNodes)
in the cluster, as shown in the figure.
3. Based on Google File System
The design of HDFS is inspired by Google File System (GFS), which was created to handle large-scale
data reliably.
4. High Throughput
HDFS uses large block sizes and moves computation closer to the data, which helps in processing
data faster.
5. Replication
Each file block is replicated on multiple DataNodes. If a DataNode fails, the system automatically re-
replicates the missing blocks, ensuring fault tolerance.
6. Handles Large Files
HDFS is optimized to read and write very large files (in GBs or TBs) efficiently.
7. Runs on Native File Systems
HDFS runs on native file systems like ext3, ext4, etc., present on the operating system.
8. Client–Server Interaction
o The client contacts the NameNode to get metadata (file name, block locations).
o Actual data read/write happens directly with DataNodes.
o DataNodes communicate with each other and does the operations for replication.
9. Block Example
If a file of 192 MB is stored with a 64 MB block size, it is divided into 3 blocks (A, B, C).
These blocks are replicated and stored on different DataNodes, as shown in the figure.
• NameNode: Stores metadata (file name, block info, DataNode locations).
• DataNodes: Store actual data blocks (A, B, C).
• Client: Requests metadata from NameNode and reads/writes data from DataNodes.
5.10.1 HDFS Daemons
HDFS daemons are background programs that manage and run the HDFS file system. HDFS works with
three main daemons: NameNode, DataNode, and Secondary NameNode.
1. NameNode
• HDFS splits large files into blocks and the NameNode uses rack IDs to identify the racks (groups of
DataNodes) where blocks are stored.
• The NameNode tracks block locations on DataNodes and manages file operations such as read, write,
create, and delete.
• The main function of the NameNode is to manage the file system namespace (all files and directories
in the cluster).
• The namespace information, including block location and file properties, is stored in a file called
FsImage.
• Every change to file system metadata is recorded in the EditLog.
• On startup, the NameNode loads FsImage and EditLog into memory, applies all changes,one
NameNode exists per cluster.
2. DataNode
• There are multiple DataNodes in a cluster, and during pipeline read and write operations, DataNodes
communicate with each other.
• Each DataNode continuously sends a heartbeat message to the NameNode to ensure connectivity.
• If the NameNode does not receive a heartbeat from a DataNode, it replicates that DataNode’s data
within the cluster and continues running normally.
3. Secondary NameNode
• The Secondary NameNode takes snapshots of HDFS metadata at regular intervals as specified in the
Hadoop configuration.
• Since its memory requirements are the same as the NameNode, it is better to run them on different
machines.
• In case of NameNode failure, the Secondary NameNode can be manually configured to bring up the
cluster, but it does not record real-time changes to HDFS metadata.
Anatomy of HDFS File Read
1. The client opens a file using open() on the DistributedFileSystem.
2. DistributedFileSystem asks the NameNode for block locations, and NameNode provides the addresses
of DataNodes storing the blocks.
3. The client reads the file by connecting to the closest DataNode for the first block using
FSDataInputStream.
4. Data is streamed from the DataNode by repeated read() calls.
5. After reaching the end of a block, the client connects to the next DataNode for subsequent blocks.
6. Once reading is complete, the client calls close() on the FSDatalnputStream.
Anatomy of HDFS File Write
1. The client creates a file using create() on DistributedFileSystem, which requests the NameNode to
create a file without blocks.
2. FSDataOutputStream splits the data into packets to form a data queue for the DataStreamer.
3. DataStreamer requests the NameNode to allocate blocks and selects DataNodes to form a pipeline
4. Packets are sent through the pipeline and write data blocks to their local [Link], first DataNode
stores and forwards to second, second stores and forwards to third.
5. FSDataOutputStream manages an "Ack queue" to track acknowledgments; packets are removed only
when all DataNodes acknowledge.
6. On finishing writing, the client calls close(), which flushes remaining packets and informs the
NameNode that the file creation is complete.
Give HDFS Commands to perform the following operations
i) To get the list of directories and files at the root of HDFS ⭐
➢ hadoop fs -ls /
ii) To get the list of complete directories and files of HDFS
➢ hadoop fs -ls -R /
iii) To create a directory (say, sample) in HDFS⭐
➢ hadoop fs -mkdir /sample
iv) To copy a file from local file system to HDFS ⭐
➢ hadoop fs -put /root/sample/[Link] /sample/[Link]
v) To copy a file from HDFS to local file system
➢ hadoop fs -get /sample/[Link] /root/sample/[Link]
vi) To copy a file from local file system to HDFS via copyFromLocal command
➢ hadoop fs -copyFromLocal /root/sample/[Link] /sample/[Link]
vii) To copy a file from Hadoop file system to local file system via copyToLocal command
➢ hadoop fs -copyToLocal /sample/[Link] /root/sample/[Link]
viii) To display the contents of an HDFS file on console⭐
➢ hadoop fs -cat /sample/[Link]
ix) To copy a file from one directory to another on HDFS
➢ hadoop fs -cp /sample/[Link] /sample1
x) To remove a directory from HDFS
➢ hadoop fs -rm -r /sample1
I am not studying this Implement a word count program in Hadoop.
What is MapReduce? Explain its component and workflow
MapReduce is a software framework that helps in the parllael processing of the massive amounts of data
across a Hadoop cluster. It works on the principle of splitting the input data into independent chunks, processing
them in parallel, and then combining the results.
Components of MapReduce
1. JobTracker (Master Daemon)
o Responsible for scheduling tasks on TaskTrackers.
o Monitors running tasks and re-executes failed tasks.
o Creates the execution plan for the submitted job.
o There is only one JobTracker per cluster.
2. TaskTracker (Slave Daemon)
o Executes tasks assigned by the JobTracker.
o Uses multiple JVMs to run map and reduce tasks in parallel.
o Continuously sends heartbeat messages to JobTracker.
o If heartbeat is missed, JobTracker assumes failure and reschedules the task.
MapReduce Workflow
1. The input data is split into small chunks.
2. The JobTracker creates master and worker processes and assigns map tasks to the TaskTracker to
process the chunks in parallel.
3. Each mapper processes its chunk, creates key–value pairs
4. The partitioner decides which reducer gets the data.
5. The map output is shuffled and sorted by key and sent to reducers.
6. Reducers process the data and write the final output to the file system.
7. After all reducers finish execution, the final output is given to the user.
Differentiate between Combiner,Reducer,Mapper
Aspect Mapper Reducer Combiner
Converts input data into Combines values with the Reduces data size before
Purpose
key–value pairs same key globally reducer
Runs the map function on Runs the reduce function Aggregates mapper
Function
input records on grouped data output locally
Key–value pairs from input Key–value pairs from Key–value pairs from
Input
data mappers mapper
Intermediate key–value Final output key–value Reduced intermediate
Output
pairs pairs key–value pairs
Phase First phase Last phase Optional middle phase
Data Reads input and produces Receives data and Processes mapper output
Flow intermediate data produces final output before reducer
Data stored locally before
Storage Output written to HDFS Data stays local
sending