[go: up one dir, main page]

0% found this document useful (0 votes)
53 views32 pages

Unit-2 Hadoop and MapReduce

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views32 pages

Unit-2 Hadoop and MapReduce

Uploaded by

Kalighat Okira
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Hadoop and MapReduce

By
Prof Shibdas Dutta
Associate Professor,

DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Big data and Parallel Processing

• Data from its storage capacity to computer capacity (which includes memory) and back
to storage once important results are computed. With big data, we have more data that
will fit on a single computer.

• Linear processing: problem statement is broken into set of instructions that are
executed sequentially till all instructions are completed successfully.

If an error occurs in any one of the instructions, the entire sequence of instruction
is executed from the beginning after the error has been resolved. Linear processing is
best suited for minor computing task and is inefficient and time consuming when it
comes to processing complex problems such as big data.
Big data and Parallel Processing

Parallel Processing: the problem statement is broken down into a set of executable
instructions. The instructions are the distributed to multiple execution nodes of equal
processing power and are executed in parallel. Since the instructions are run on separate
execution nodes, errors can be fixed and executed locally independent of other
instructions. Reduce computation time, less memory and computer requirements.

Flexibility: The biggest advantage of parallel processing is that execution nodes can be
added or removed as and when required. This significantly reduces the infrastructure cost.
What is Hadoop?

• Hadoop is the solution to Big Data problems. It is the technology to store massive
datasets on a cluster of cheap machines in a distributed manner.

• It is an open-source software developed as a project by Apache Software Foundation.

• Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache
Software Foundation. Since then two versions of Hadoop has come.

• Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in
various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.
History of Hadoop
From Humble Beginnings to Big Data Giant
Hadoop, the open-source framework synonymous with large-scale data processing, boasts a rich
and fascinating history. Here's a breakdown of its key milestones:

Early days (2002-2004):

2002: Doug Cutting and Mike Cafarella develop the Apache Nutch project, an open-source web
crawler used to build search engines.
2003: Google publishes a paper on its GFS (Google File System), highlighting its distributed
nature and high scalability.
2004: Google engineers Jeff Dean and Sanjay Ghemawat present a paper on MapReduce, a
programming model for processing large datasets across clusters of commodity hardware.

Birth of Hadoop (2005-2006):

2005: Nutch developers face limitations managing and analyzing their crawl data. They adapt
GFS and MapReduce to create the Nutch Distributed File System (NDFS) and Nutch-MR.
2006: Doug Cutting leaves Google and joins Yahoo. Recognizing the potential of NDFS and
MapReduce, he merges them into a new project called "Hadoop," named after his son's toy
elephant.
History of Hadoop.....cont...
Rapid Growth and Adoption (2007-2010):

2007: Yahoo runs Hadoop on a 1,000-node cluster, demonstrating its ability to handle massive
datasets.
2008: Hadoop graduates to become a top-level Apache Software Foundation project, signifying
its growing community and industry interest.
2009: The first "Hadoop World" conference takes place, bringing together developers and users
from around the globe.
2010: The release of Hadoop 1.0 marks a significant milestone, solidifying its core architecture
and APIs.

Evolution and Diversification (2011-present):

2011-2012: Several subprojects emerge within the Hadoop ecosystem, including HBase (NoSQL
database), Hive (data warehouse), Pig (high-level data flow language), and ZooKeeper
(distributed coordination service).
2013-2014: Hadoop YARN (Yet Another Resource Negotiator) becomes the default resource
management system, improving scalability and flexibility.
2015: The Apache Spark project gains traction as a faster alternative for certain use cases.
2016-present: Cloud deployments of Hadoop become increasingly popular, driven by ease of use
and scalability. Hadoop continues to evolve with new features and integrations, remaining a
History of Hadoop.....cont...

Key Notes:

 Hadoop emerged from the need to manage and analyze large datasets at scale.

 Its open-source nature and community contributions fueled its rapid adoption.

 The ecosystem has diversified, offering various tools and technologies for diverse big data
needs.

 Though challenged by newer technologies, Hadoop remains a crucial player in the big data
landscape.
Why Hadoop?
• Apache Hadoop is not only a storage system but is a platform for data storage as well
as processing.

• It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go
down, data processed by another node).

• Following characteristics of Hadoop make it a unique platform:

• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured. It is not bounded by a single schema.
• Excels at processing data of complex nature. Its scale-out architecture divides
workloads across many nodes. Another added advantage is that its flexible file-
system eliminates ETL bottlenecks.
• Scales economically, it can deploy on commodity hardware. Apart from this its
open-source nature guards against vendor lock.
Hadoop Architecture
Hadoop Features

1. Reliability
In the Hadoop cluster, if any node goes down, it will not disable the whole cluster.
Instead, another node will take the place of the failed node. Hadoop cluster will continue
functioning as nothing has happened. Hadoop has built-in fault tolerance feature.

2. Scalable
Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the
cloud you need not worry about scalability. You can easily procure more hardware and
expand your Hadoop cluster within minutes.

3. Economical
Hadoop gets deployed on commodity hardware which is cheap machines. This makes
Hadoop very economical. Also as Hadoop is an open system software there is no cost of
license too.
Hadoop Features

4. Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of sub-tasks.
These sub-tasks are independent of each other. Hence they execute in parallel giving
high throughput.

5. Distributed Storage
Hadoop splits each file into the number of blocks. These blocks get stored distributedly
on the cluster of machines.

6. Fault Tolerance
Hadoop replicates every block of file many times depending on the replication factor.
Replication factor is 3 by default. In Hadoop suppose any node goes down then the data
on that node gets recovered. This is because this copy of the data would be available on
other nodes due to replication. Hadoop is fault tolerant.
Prerequisites to Learn Hadoop

Familiarity with some basic Linux Command – Hadoop is set up over Linux
Operating System preferable Ubuntu. So one must know certain basic Linux commands.
These commands are for uploading the file in HDFS, downloading the file from HDFS and
so on.

Basic Java concepts – One can get started in Hadoop while simultaneously grasping
basic concepts of Java.

We can write map and reduce functions in Hadoop using other languages too.
And these are Python, Perl, C, Ruby, etc. This is possible via streaming API. It supports
reading from standard input and writing to standard output.

Hadoop also has high-level abstractions tools like Pig and Hive which do not require
familiarity with Java.
Components of Hadoop

Hadoop consists of three core components –

1. Hadoop Distributed File System (HDFS) – It is the storage layer of


Hadoop.
2. Map-Reduce – It is the data processing layer of Hadoop.
3. YARN – It is the resource management layer of Hadoop.
Hadoop Distributed File System (HDFS)

• HDFS has a master-slave topology.

• Master is a high-end machine whereas slaves are inexpensive computers. The Big Data files
get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion
on the cluster of slave nodes. On the master, we have metadata stored.
Hadoop Distributed File System (HDFS)

HDFS has two daemons running for it. They are :

NameNode : NameNode performs following functions –

• NameNode Daemon runs on the master machine.


• It is responsible for maintaining, monitoring and managing DataNodes.
• It records the metadata of the files like the location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the metadata like deletion, creation and
renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the DataNodes.
Hadoop Distributed File System (HDFS)

DataNode runs on the slave machine.

• It stores the actual business data.


• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating and deleting the blocks on
the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to NameNode reporting the
health of HDFS.
Hadoop Distributed File System (HDFS)

Secondary
NameNode
Secondary NameNode & Checkpointing NameNode

• Checkpointing is a process of combining


editLogs with FSImage (File System Image) editLog editLog
• Secondary NameNode takes over the
responsibility of checkpointing, therefore
making NameNode more available. FSImag FSImag
e e
• Allows faster failover as it prevents edit
logs from getting too huge.
• Checkpointing happens periodically editLog FSImag
(New) e (final)
(default: 1 hour)
Hadoop Distributed File System (HDFS)

NameNod
300 MB e

DataNode DataNode DataNode

128 MB 128 MB 44 MB

128 128
44 MB
MB MB
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Read Pipeline
MapReduce

It is the data processing layer of Hadoop. It processes data in two phases.

They are:-

Map Phase- This phase applies business logic to the data. The input data gets converted
into key-value pairs.

Reduce Phase- The Reduce


phase takes as input the
output of Map Phase. It
applies aggregation based on
the key of the key-value
pairs.
MapReduce

Map( )

Reduce( )

Input Map( ) Ouput

Reduce( )

Map( )

Reduce
Map Tasks
Tasks
MapReduce

Map-Reduce works in the following way:

• The client specifies the file for input to the Map function. It splits it into tuples
• Map function defines key and value from the input file. The output of the map function
is this key-value pair.
• MapReduce framework sorts the key-value pair from map function.
• The framework merges the tuples having the same key together.
• The reducers get these merged key-value pairs as input.
• Reducer applies aggregate functions on key-value pair.
• The output from the reducer gets written to HDFS.
Input Splitting Mapping Shufflin Reducin Final Result
g g
List (K2,
K1, V1 V2) This, (1) This, 1
This – 1
This is a is -1
mirror. a -1 is, (1,1) is, 2
mirror - 1
List (K3,
V3)
a, (1,1) a, 2
This is a mirror This, 1
Mirror – 1 is, 2
Mirror shows
Mirror shows shows -1 a, 2
reflection
Reflection is a reflection reflection - mirror, 2 mirror, 2
mirror, (1,1)
property 1 shows, 1
reflection, 2
property, 1
Reflection shows, (1) shows, 1
–1
Reflection is is – 1
a property a -1
reflection, (1,1) reflection, 2
property -
1

property, (1) property, 1


K1, List
(V2)
MapReduce

Three major parts of MapReduce program:

1. Mapper Code: Mapper logic is written over here, which processes parallelly on
different node.

2. Reducer Code: Mapper’s output will be taken as an input to reducer. It does all
the aggregations.

3. Driver Code: It carries all the job configurations i.e. job name, input path, output
path, mapper class etc.
Yet Another Resource Locator (YARN)

Resource Manager:
• Resource Manager runs on the master node.
• It knows where the location of slaves (Rack
Awareness).
• It is aware about how much resources each slave
have.
• Resource Scheduler decides how the resources get
assigned to various tasks.
• Application Manager is one more service run by
Resource Manager.
• Application Manager negotiates the first container
for an application.
• Resource Manager keeps track of the heart beats
from the Node Manager.
Yet Another Resource Locator (YARN)

Node Manager:
• It runs on slave machines.
• It manages containers. Containers are nothing but
a fraction of Node Manager’s resource capacity
• Node manager monitors resource utilization of
each container.
• It sends heartbeat to Resource Manager.
Yet Another Resource Locator (YARN)

The application startup process is as follows:-


• The client submits the job to Resource Manager.
• Resource Manager contacts Resource Scheduler and allocates container.
• Now Resource Manager contacts the relevant Node Manager to launch the
container.
• Container runs Application Master.
HDFS Commands

1. ls: This command is used to list all the files

hdfs dfs -ls <path>

2. mkdir: To create a directory.

hdfs dfs -mkdir <folder name>

3…..
Happy Learning

You might also like