Unit-2 Hadoop and MapReduce
Unit-2 Hadoop and MapReduce
By
Prof Shibdas Dutta
Associate Professor,
DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Big data and Parallel Processing
• Data from its storage capacity to computer capacity (which includes memory) and back
to storage once important results are computed. With big data, we have more data that
will fit on a single computer.
• Linear processing: problem statement is broken into set of instructions that are
executed sequentially till all instructions are completed successfully.
If an error occurs in any one of the instructions, the entire sequence of instruction
is executed from the beginning after the error has been resolved. Linear processing is
best suited for minor computing task and is inefficient and time consuming when it
comes to processing complex problems such as big data.
Big data and Parallel Processing
Parallel Processing: the problem statement is broken down into a set of executable
instructions. The instructions are the distributed to multiple execution nodes of equal
processing power and are executed in parallel. Since the instructions are run on separate
execution nodes, errors can be fixed and executed locally independent of other
instructions. Reduce computation time, less memory and computer requirements.
Flexibility: The biggest advantage of parallel processing is that execution nodes can be
added or removed as and when required. This significantly reduces the infrastructure cost.
What is Hadoop?
• Hadoop is the solution to Big Data problems. It is the technology to store massive
datasets on a cluster of cheap machines in a distributed manner.
• Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache
Software Foundation. Since then two versions of Hadoop has come.
• Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in
various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.
History of Hadoop
From Humble Beginnings to Big Data Giant
Hadoop, the open-source framework synonymous with large-scale data processing, boasts a rich
and fascinating history. Here's a breakdown of its key milestones:
2002: Doug Cutting and Mike Cafarella develop the Apache Nutch project, an open-source web
crawler used to build search engines.
2003: Google publishes a paper on its GFS (Google File System), highlighting its distributed
nature and high scalability.
2004: Google engineers Jeff Dean and Sanjay Ghemawat present a paper on MapReduce, a
programming model for processing large datasets across clusters of commodity hardware.
2005: Nutch developers face limitations managing and analyzing their crawl data. They adapt
GFS and MapReduce to create the Nutch Distributed File System (NDFS) and Nutch-MR.
2006: Doug Cutting leaves Google and joins Yahoo. Recognizing the potential of NDFS and
MapReduce, he merges them into a new project called "Hadoop," named after his son's toy
elephant.
History of Hadoop.....cont...
Rapid Growth and Adoption (2007-2010):
2007: Yahoo runs Hadoop on a 1,000-node cluster, demonstrating its ability to handle massive
datasets.
2008: Hadoop graduates to become a top-level Apache Software Foundation project, signifying
its growing community and industry interest.
2009: The first "Hadoop World" conference takes place, bringing together developers and users
from around the globe.
2010: The release of Hadoop 1.0 marks a significant milestone, solidifying its core architecture
and APIs.
2011-2012: Several subprojects emerge within the Hadoop ecosystem, including HBase (NoSQL
database), Hive (data warehouse), Pig (high-level data flow language), and ZooKeeper
(distributed coordination service).
2013-2014: Hadoop YARN (Yet Another Resource Negotiator) becomes the default resource
management system, improving scalability and flexibility.
2015: The Apache Spark project gains traction as a faster alternative for certain use cases.
2016-present: Cloud deployments of Hadoop become increasingly popular, driven by ease of use
and scalability. Hadoop continues to evolve with new features and integrations, remaining a
History of Hadoop.....cont...
Key Notes:
Hadoop emerged from the need to manage and analyze large datasets at scale.
Its open-source nature and community contributions fueled its rapid adoption.
The ecosystem has diversified, offering various tools and technologies for diverse big data
needs.
Though challenged by newer technologies, Hadoop remains a crucial player in the big data
landscape.
Why Hadoop?
• Apache Hadoop is not only a storage system but is a platform for data storage as well
as processing.
• It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go
down, data processed by another node).
• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured. It is not bounded by a single schema.
• Excels at processing data of complex nature. Its scale-out architecture divides
workloads across many nodes. Another added advantage is that its flexible file-
system eliminates ETL bottlenecks.
• Scales economically, it can deploy on commodity hardware. Apart from this its
open-source nature guards against vendor lock.
Hadoop Architecture
Hadoop Features
1. Reliability
In the Hadoop cluster, if any node goes down, it will not disable the whole cluster.
Instead, another node will take the place of the failed node. Hadoop cluster will continue
functioning as nothing has happened. Hadoop has built-in fault tolerance feature.
2. Scalable
Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the
cloud you need not worry about scalability. You can easily procure more hardware and
expand your Hadoop cluster within minutes.
3. Economical
Hadoop gets deployed on commodity hardware which is cheap machines. This makes
Hadoop very economical. Also as Hadoop is an open system software there is no cost of
license too.
Hadoop Features
4. Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of sub-tasks.
These sub-tasks are independent of each other. Hence they execute in parallel giving
high throughput.
5. Distributed Storage
Hadoop splits each file into the number of blocks. These blocks get stored distributedly
on the cluster of machines.
6. Fault Tolerance
Hadoop replicates every block of file many times depending on the replication factor.
Replication factor is 3 by default. In Hadoop suppose any node goes down then the data
on that node gets recovered. This is because this copy of the data would be available on
other nodes due to replication. Hadoop is fault tolerant.
Prerequisites to Learn Hadoop
Familiarity with some basic Linux Command – Hadoop is set up over Linux
Operating System preferable Ubuntu. So one must know certain basic Linux commands.
These commands are for uploading the file in HDFS, downloading the file from HDFS and
so on.
Basic Java concepts – One can get started in Hadoop while simultaneously grasping
basic concepts of Java.
We can write map and reduce functions in Hadoop using other languages too.
And these are Python, Perl, C, Ruby, etc. This is possible via streaming API. It supports
reading from standard input and writing to standard output.
Hadoop also has high-level abstractions tools like Pig and Hive which do not require
familiarity with Java.
Components of Hadoop
• Master is a high-end machine whereas slaves are inexpensive computers. The Big Data files
get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion
on the cluster of slave nodes. On the master, we have metadata stored.
Hadoop Distributed File System (HDFS)
Secondary
NameNode
Secondary NameNode & Checkpointing NameNode
NameNod
300 MB e
128 MB 128 MB 44 MB
128 128
44 MB
MB MB
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Read Pipeline
MapReduce
They are:-
Map Phase- This phase applies business logic to the data. The input data gets converted
into key-value pairs.
Map( )
Reduce( )
Reduce( )
Map( )
Reduce
Map Tasks
Tasks
MapReduce
• The client specifies the file for input to the Map function. It splits it into tuples
• Map function defines key and value from the input file. The output of the map function
is this key-value pair.
• MapReduce framework sorts the key-value pair from map function.
• The framework merges the tuples having the same key together.
• The reducers get these merged key-value pairs as input.
• Reducer applies aggregate functions on key-value pair.
• The output from the reducer gets written to HDFS.
Input Splitting Mapping Shufflin Reducin Final Result
g g
List (K2,
K1, V1 V2) This, (1) This, 1
This – 1
This is a is -1
mirror. a -1 is, (1,1) is, 2
mirror - 1
List (K3,
V3)
a, (1,1) a, 2
This is a mirror This, 1
Mirror – 1 is, 2
Mirror shows
Mirror shows shows -1 a, 2
reflection
Reflection is a reflection reflection - mirror, 2 mirror, 2
mirror, (1,1)
property 1 shows, 1
reflection, 2
property, 1
Reflection shows, (1) shows, 1
–1
Reflection is is – 1
a property a -1
reflection, (1,1) reflection, 2
property -
1
1. Mapper Code: Mapper logic is written over here, which processes parallelly on
different node.
2. Reducer Code: Mapper’s output will be taken as an input to reducer. It does all
the aggregations.
3. Driver Code: It carries all the job configurations i.e. job name, input path, output
path, mapper class etc.
Yet Another Resource Locator (YARN)
Resource Manager:
• Resource Manager runs on the master node.
• It knows where the location of slaves (Rack
Awareness).
• It is aware about how much resources each slave
have.
• Resource Scheduler decides how the resources get
assigned to various tasks.
• Application Manager is one more service run by
Resource Manager.
• Application Manager negotiates the first container
for an application.
• Resource Manager keeps track of the heart beats
from the Node Manager.
Yet Another Resource Locator (YARN)
Node Manager:
• It runs on slave machines.
• It manages containers. Containers are nothing but
a fraction of Node Manager’s resource capacity
• Node manager monitors resource utilization of
each container.
• It sends heartbeat to Resource Manager.
Yet Another Resource Locator (YARN)
3…..
Happy Learning