0% found this document useful (0 votes)

53 views32 pages

Unit-2 Hadoop and MapReduce

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views32 pages

Unit-2 Hadoop and MapReduce

Uploaded by

Kalighat Okira

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Hadoop and MapReduce

By
Prof Shibdas Dutta
Associate Professor,

DCGDATACORESYSTEMSINDIAPVTLTD
Kolkata
Big data and Parallel Processing

• Data from its storage capacity to computer capacity (which includes memory) and back
to storage once important results are computed. With big data, we have more data that
will fit on a single computer.

• Linear processing: problem statement is broken into set of instructions that are
executed sequentially till all instructions are completed successfully.

If an error occurs in any one of the instructions, the entire sequence of instruction
is executed from the beginning after the error has been resolved. Linear processing is
best suited for minor computing task and is inefficient and time consuming when it
comes to processing complex problems such as big data.
Big data and Parallel Processing

Parallel Processing: the problem statement is broken down into a set of executable
instructions. The instructions are the distributed to multiple execution nodes of equal
processing power and are executed in parallel. Since the instructions are run on separate
execution nodes, errors can be fixed and executed locally independent of other
instructions. Reduce computation time, less memory and computer requirements.

Flexibility: The biggest advantage of parallel processing is that execution nodes can be
added or removed as and when required. This significantly reduces the infrastructure cost.
What is Hadoop?

• Hadoop is the solution to Big Data problems. It is the technology to store massive
datasets on a cluster of cheap machines in a distributed manner.

• It is an open-source software developed as a project by Apache Software Foundation.

• Doug Cutting created Hadoop. In the year 2008 Yahoo gave Hadoop to Apache
Software Foundation. Since then two versions of Hadoop has come.

• Version 1.0 in the year 2011 and version 2.0.6 in the year 2013. Hadoop comes in
various flavors like Cloudera, IBM BigInsight, MapR and Hortonworks.
History of Hadoop
From Humble Beginnings to Big Data Giant
Hadoop, the open-source framework synonymous with large-scale data processing, boasts a rich
and fascinating history. Here's a breakdown of its key milestones:

Early days (2002-2004):

2002: Doug Cutting and Mike Cafarella develop the Apache Nutch project, an open-source web
crawler used to build search engines.
2003: Google publishes a paper on its GFS (Google File System), highlighting its distributed
nature and high scalability.
2004: Google engineers Jeff Dean and Sanjay Ghemawat present a paper on MapReduce, a
programming model for processing large datasets across clusters of commodity hardware.

Birth of Hadoop (2005-2006):

2005: Nutch developers face limitations managing and analyzing their crawl data. They adapt
GFS and MapReduce to create the Nutch Distributed File System (NDFS) and Nutch-MR.
2006: Doug Cutting leaves Google and joins Yahoo. Recognizing the potential of NDFS and
MapReduce, he merges them into a new project called "Hadoop," named after his son's toy
elephant.
History of Hadoop.....cont...
Rapid Growth and Adoption (2007-2010):

2007: Yahoo runs Hadoop on a 1,000-node cluster, demonstrating its ability to handle massive
datasets.
2008: Hadoop graduates to become a top-level Apache Software Foundation project, signifying
its growing community and industry interest.
2009: The first "Hadoop World" conference takes place, bringing together developers and users
from around the globe.
2010: The release of Hadoop 1.0 marks a significant milestone, solidifying its core architecture
and APIs.

Evolution and Diversification (2011-present):

2011-2012: Several subprojects emerge within the Hadoop ecosystem, including HBase (NoSQL
database), Hive (data warehouse), Pig (high-level data flow language), and ZooKeeper
(distributed coordination service).
2013-2014: Hadoop YARN (Yet Another Resource Negotiator) becomes the default resource
management system, improving scalability and flexibility.
2015: The Apache Spark project gains traction as a faster alternative for certain use cases.
2016-present: Cloud deployments of Hadoop become increasingly popular, driven by ease of use
and scalability. Hadoop continues to evolve with new features and integrations, remaining a
History of Hadoop.....cont...

Key Notes:

 Hadoop emerged from the need to manage and analyze large datasets at scale.

 Its open-source nature and community contributions fueled its rapid adoption.

 The ecosystem has diversified, offering various tools and technologies for diverse big data
needs.

 Though challenged by newer technologies, Hadoop remains a crucial player in the big data
landscape.
Why Hadoop?
• Apache Hadoop is not only a storage system but is a platform for data storage as well
as processing.

• It is scalable (as we can add more nodes on the fly), Fault-tolerant (Even if nodes go
down, data processed by another node).

• Following characteristics of Hadoop make it a unique platform:

• Flexibility to store and mine any type of data whether it is structured, semi-
structured or unstructured. It is not bounded by a single schema.
• Excels at processing data of complex nature. Its scale-out architecture divides
workloads across many nodes. Another added advantage is that its flexible file-
system eliminates ETL bottlenecks.
• Scales economically, it can deploy on commodity hardware. Apart from this its
open-source nature guards against vendor lock.
Hadoop Architecture
Hadoop Features

1. Reliability
In the Hadoop cluster, if any node goes down, it will not disable the whole cluster.
Instead, another node will take the place of the failed node. Hadoop cluster will continue
functioning as nothing has happened. Hadoop has built-in fault tolerance feature.

2. Scalable
Hadoop gets integrated with cloud-based service. If you are installing Hadoop on the
cloud you need not worry about scalability. You can easily procure more hardware and
expand your Hadoop cluster within minutes.

3. Economical
Hadoop gets deployed on commodity hardware which is cheap machines. This makes
Hadoop very economical. Also as Hadoop is an open system software there is no cost of
license too.
Hadoop Features

4. Distributed Processing
In Hadoop, any job submitted by the client gets divided into the number of sub-tasks.
These sub-tasks are independent of each other. Hence they execute in parallel giving
high throughput.

5. Distributed Storage
Hadoop splits each file into the number of blocks. These blocks get stored distributedly
on the cluster of machines.

6. Fault Tolerance
Hadoop replicates every block of file many times depending on the replication factor.
Replication factor is 3 by default. In Hadoop suppose any node goes down then the data
on that node gets recovered. This is because this copy of the data would be available on
other nodes due to replication. Hadoop is fault tolerant.
Prerequisites to Learn Hadoop

Familiarity with some basic Linux Command – Hadoop is set up over Linux
Operating System preferable Ubuntu. So one must know certain basic Linux commands.
These commands are for uploading the file in HDFS, downloading the file from HDFS and
so on.

Basic Java concepts – One can get started in Hadoop while simultaneously grasping
basic concepts of Java.

We can write map and reduce functions in Hadoop using other languages too.
And these are Python, Perl, C, Ruby, etc. This is possible via streaming API. It supports
reading from standard input and writing to standard output.

Hadoop also has high-level abstractions tools like Pig and Hive which do not require
familiarity with Java.
Components of Hadoop

Hadoop consists of three core components –

1. Hadoop Distributed File System (HDFS) – It is the storage layer of

Hadoop.
2. Map-Reduce – It is the data processing layer of Hadoop.
3. YARN – It is the resource management layer of Hadoop.
Hadoop Distributed File System (HDFS)

• HDFS has a master-slave topology.

• Master is a high-end machine whereas slaves are inexpensive computers. The Big Data files
get divided into the number of blocks. Hadoop stores these blocks in a distributed fashion
on the cluster of slave nodes. On the master, we have metadata stored.
Hadoop Distributed File System (HDFS)

HDFS has two daemons running for it. They are :

NameNode : NameNode performs following functions –

• NameNode Daemon runs on the master machine.

• It is responsible for maintaining, monitoring and managing DataNodes.
• It records the metadata of the files like the location of blocks, file size, permission,
hierarchy etc.
• Namenode captures all the changes to the metadata like deletion, creation and
renaming of the file in edit logs.
• It regularly receives heartbeat and block reports from the DataNodes.
Hadoop Distributed File System (HDFS)

DataNode runs on the slave machine.

• It stores the actual business data.

• It serves the read-write request from the user.
• DataNode does the ground work of creating, replicating and deleting the blocks on
the command of NameNode.
• After every 3 seconds, by default, it sends heartbeat to NameNode reporting the
health of HDFS.
Hadoop Distributed File System (HDFS)

Secondary
NameNode
Secondary NameNode & Checkpointing NameNode

• Checkpointing is a process of combining

editLogs with FSImage (File System Image) editLog editLog
• Secondary NameNode takes over the
responsibility of checkpointing, therefore
making NameNode more available. FSImag FSImag
e e
• Allows faster failover as it prevents edit
logs from getting too huge.
• Checkpointing happens periodically editLog FSImag
(New) e (final)
(default: 1 hour)
Hadoop Distributed File System (HDFS)

NameNod
300 MB e

DataNode DataNode DataNode

128 MB 128 MB 44 MB

128 128
44 MB
MB MB
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Write Pipeline
HDFS – Read Pipeline
MapReduce

It is the data processing layer of Hadoop. It processes data in two phases.

They are:-

Map Phase- This phase applies business logic to the data. The input data gets converted
into key-value pairs.

Reduce Phase- The Reduce

phase takes as input the
output of Map Phase. It
applies aggregation based on
the key of the key-value
pairs.
MapReduce

Map( )

Reduce( )

Input Map( ) Ouput

Reduce( )

Map( )

Reduce
Map Tasks
Tasks
MapReduce

Map-Reduce works in the following way:

• The client specifies the file for input to the Map function. It splits it into tuples
• Map function defines key and value from the input file. The output of the map function
is this key-value pair.
• MapReduce framework sorts the key-value pair from map function.
• The framework merges the tuples having the same key together.
• The reducers get these merged key-value pairs as input.
• Reducer applies aggregate functions on key-value pair.
• The output from the reducer gets written to HDFS.
Input Splitting Mapping Shufflin Reducin Final Result
g g
List (K2,
K1, V1 V2) This, (1) This, 1
This – 1
This is a is -1
mirror. a -1 is, (1,1) is, 2
mirror - 1
List (K3,
V3)
a, (1,1) a, 2
This is a mirror This, 1
Mirror – 1 is, 2
Mirror shows
Mirror shows shows -1 a, 2
reflection
Reflection is a reflection reflection - mirror, 2 mirror, 2
mirror, (1,1)
property 1 shows, 1
reflection, 2
property, 1
Reflection shows, (1) shows, 1
–1
Reflection is is – 1
a property a -1
reflection, (1,1) reflection, 2
property -
1

property, (1) property, 1

K1, List
(V2)
MapReduce

Three major parts of MapReduce program:

1. Mapper Code: Mapper logic is written over here, which processes parallelly on
different node.

2. Reducer Code: Mapper’s output will be taken as an input to reducer. It does all
the aggregations.

3. Driver Code: It carries all the job configurations i.e. job name, input path, output
path, mapper class etc.
Yet Another Resource Locator (YARN)

Resource Manager:
• Resource Manager runs on the master node.
• It knows where the location of slaves (Rack
Awareness).
• It is aware about how much resources each slave
have.
• Resource Scheduler decides how the resources get
assigned to various tasks.
• Application Manager is one more service run by
Resource Manager.
• Application Manager negotiates the first container
for an application.
• Resource Manager keeps track of the heart beats
from the Node Manager.
Yet Another Resource Locator (YARN)

Node Manager:
• It runs on slave machines.
• It manages containers. Containers are nothing but
a fraction of Node Manager’s resource capacity
• Node manager monitors resource utilization of
each container.
• It sends heartbeat to Resource Manager.
Yet Another Resource Locator (YARN)

The application startup process is as follows:-

• The client submits the job to Resource Manager.
• Resource Manager contacts Resource Scheduler and allocates container.
• Now Resource Manager contacts the relevant Node Manager to launch the
container.
• Container runs Application Master.
HDFS Commands

1. ls: This command is used to list all the files

hdfs dfs -ls <path>

2. mkdir: To create a directory.

hdfs dfs -mkdir <folder name>

3…..
Happy Learning

MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
Module 4-1
No ratings yet
Module 4-1
21 pages
Cassandra Hadoop Integration
No ratings yet
Cassandra Hadoop Integration
2 pages
IR Unit III - Notes
No ratings yet
IR Unit III - Notes
18 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Certified Data Engineer Professional Questions Answers Only
100% (1)
Certified Data Engineer Professional Questions Answers Only
96 pages
Introduction To Big Data Analytics
No ratings yet
Introduction To Big Data Analytics
33 pages
Principles of Parallel and Distributed Computing
No ratings yet
Principles of Parallel and Distributed Computing
34 pages
Introduction To Information Technology Turban, Rainer and Potter John Wiley & Sons, Inc
No ratings yet
Introduction To Information Technology Turban, Rainer and Potter John Wiley & Sons, Inc
39 pages
Moodle Administrator: User Manual For Faculty Members
No ratings yet
Moodle Administrator: User Manual For Faculty Members
79 pages
Unit 4
No ratings yet
Unit 4
57 pages
Social Media and Web Analytics Unit-5
No ratings yet
Social Media and Web Analytics Unit-5
10 pages
(IJCST-V12I6P9) :Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
No ratings yet
(IJCST-V12I6P9) :Mrs.N.Dhivya, Mrs.S.Senthamarai Selvi, R.Gayathri
5 pages
Concepts of Abstraction and Virtualization in Cloud Computing
No ratings yet
Concepts of Abstraction and Virtualization in Cloud Computing
20 pages
MIS Part 2 Management Information System
No ratings yet
MIS Part 2 Management Information System
127 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
50 MCQ Database Questions
No ratings yet
50 MCQ Database Questions
16 pages
Social Information Filtering
No ratings yet
Social Information Filtering
25 pages
Implementation Issues Task
No ratings yet
Implementation Issues Task
18 pages
Velocity v9 Best Practices
100% (1)
Velocity v9 Best Practices
818 pages
MSC IT Syllabus
93% (15)
MSC IT Syllabus
69 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
A Kannada English Dictionary PDF
No ratings yet
A Kannada English Dictionary PDF
1,833 pages
Unit - 1
No ratings yet
Unit - 1
14 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Lecture 5. Industry 4.0 Technologies and Enterprise Architecture
No ratings yet
Lecture 5. Industry 4.0 Technologies and Enterprise Architecture
13 pages
Multi-Criteria Decision Making
No ratings yet
Multi-Criteria Decision Making
5 pages
Rainfall Prediction in India Using Multiple Linear Regression
No ratings yet
Rainfall Prediction in India Using Multiple Linear Regression
3 pages
Iwt - Session 1 - Unit 2 - III Bcom CA B
100% (1)
Iwt - Session 1 - Unit 2 - III Bcom CA B
31 pages
NetApp Ncsa PDF
No ratings yet
NetApp Ncsa PDF
96 pages
16-029 Pentaho Hadoop Ebook v7
No ratings yet
16-029 Pentaho Hadoop Ebook v7
22 pages
Chapter 1
No ratings yet
Chapter 1
27 pages
E Commerce: AN Indian Perspective, Fifth Edition by P.T. Joseph, S.J
No ratings yet
E Commerce: AN Indian Perspective, Fifth Edition by P.T. Joseph, S.J
3 pages
Web Services
No ratings yet
Web Services
2 pages
GDG Notes
No ratings yet
GDG Notes
8 pages
Exception Handling in Java PDF
No ratings yet
Exception Handling in Java PDF
1 page
JSP Program
No ratings yet
JSP Program
10 pages
Big Data Fund
No ratings yet
Big Data Fund
5 pages
Field Type
No ratings yet
Field Type
36 pages
Big Data Analytics PDF
No ratings yet
Big Data Analytics PDF
22 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
Shelf Life Based On SL PPDS
No ratings yet
Shelf Life Based On SL PPDS
15 pages
Java Notes
No ratings yet
Java Notes
169 pages
Printed 黃建華Oracle - EBS Workflow
No ratings yet
Printed 黃建華Oracle - EBS Workflow
90 pages
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
No ratings yet
Introduction To Operating System (OS) : Associate Professor, DCG Data Core Systems India PVT LTD Kolkata
59 pages
PBS Software Solutions 2006
No ratings yet
PBS Software Solutions 2006
68 pages
Cloning Oracle E-Business Suite Release 12.2 With Rapid Clone (Doc ID 1383621.1)
100% (1)
Cloning Oracle E-Business Suite Release 12.2 With Rapid Clone (Doc ID 1383621.1)
31 pages
Dpu Mba Artificial Intelligence & Machine Learning Management
No ratings yet
Dpu Mba Artificial Intelligence & Machine Learning Management
26 pages
Unit-4 Big Data Analytics Methods Using R
No ratings yet
Unit-4 Big Data Analytics Methods Using R
57 pages
Lesson 03 - Configuring File Access and Printers On Windows 7 Clients
No ratings yet
Lesson 03 - Configuring File Access and Printers On Windows 7 Clients
27 pages
Unit-1 Introduction To Big Data Analytics
No ratings yet
Unit-1 Introduction To Big Data Analytics
57 pages
GC 2024 12 20
No ratings yet
GC 2024 12 20
32 pages
Unit-3 Hadoop Environment
No ratings yet
Unit-3 Hadoop Environment
31 pages
Marco - Data Governance Fundamentals
No ratings yet
Marco - Data Governance Fundamentals
34 pages
Unit 4
No ratings yet
Unit 4
29 pages
Cloud Computing Unit-1
No ratings yet
Cloud Computing Unit-1
61 pages
DBMS End Term
No ratings yet
DBMS End Term
27 pages
File Handling
No ratings yet
File Handling
7 pages
Introduction To Information and Big Data Security
No ratings yet
Introduction To Information and Big Data Security
39 pages
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
No ratings yet
PG - M.sc. - Computer Science - 34141 Data Mining and Ware Housing
192 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Decision Tree
17 pages
Oracle DBA Overview July 19
No ratings yet
Oracle DBA Overview July 19
21 pages
Cloud Computing Notes Btech Csvtu 6th Sem
No ratings yet
Cloud Computing Notes Btech Csvtu 6th Sem
12 pages
Final - Unit 1.1
No ratings yet
Final - Unit 1.1
23 pages
ST-1 Solution Big Data KCS061
No ratings yet
ST-1 Solution Big Data KCS061
26 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Harnessing Big Data
No ratings yet
Harnessing Big Data
29 pages
Aditya Paruchuri
No ratings yet
Aditya Paruchuri
7 pages
Ccs367-Question Bank
No ratings yet
Ccs367-Question Bank
23 pages
Oracle Apps Technical Concept
No ratings yet
Oracle Apps Technical Concept
14 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
13 pages
IOT (Unit III) ...
No ratings yet
IOT (Unit III) ...
39 pages
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - Linear Regression
8 pages
IT8005 EC Notes UNIT 4
No ratings yet
IT8005 EC Notes UNIT 4
27 pages
Tourism Report PDF
No ratings yet
Tourism Report PDF
40 pages
BDM Unit I Slides Part 1
No ratings yet
BDM Unit I Slides Part 1
27 pages
BW Quiz From SearchSAP
No ratings yet
BW Quiz From SearchSAP
17 pages
Assignment - 1 - Linux - Baisc - Commands Submission
No ratings yet
Assignment - 1 - Linux - Baisc - Commands Submission
6 pages
Lesson 6 Virtualization
No ratings yet
Lesson 6 Virtualization
5 pages
Certified List of Candidates: Commission On Elections
No ratings yet
Certified List of Candidates: Commission On Elections
7 pages
Hoc SQL
No ratings yet
Hoc SQL
3 pages
CS2255 Au Question Paper
No ratings yet
CS2255 Au Question Paper
7 pages
21IT1701 - MCMAD - Unit IV Notes
No ratings yet
21IT1701 - MCMAD - Unit IV Notes
23 pages
Business Analytics CDS
No ratings yet
Business Analytics CDS
5 pages
Turban Dss9e Ch08
No ratings yet
Turban Dss9e Ch08
39 pages
Text Analytics and Text Mining Overview
No ratings yet
Text Analytics and Text Mining Overview
16 pages
Univ - QP - DC - Case Study Based Questions
No ratings yet
Univ - QP - DC - Case Study Based Questions
6 pages
Project
No ratings yet
Project
14 pages
IT Infrastructure Architecture: Hosting and Deployment Options (Chapter 13 - 14)
No ratings yet
IT Infrastructure Architecture: Hosting and Deployment Options (Chapter 13 - 14)
31 pages
Data Sheet Five9 Geographic Redundancy
No ratings yet
Data Sheet Five9 Geographic Redundancy
2 pages
Virtualization and Five Step Process
No ratings yet
Virtualization and Five Step Process
19 pages
Jboss Eap Infinispan
No ratings yet
Jboss Eap Infinispan
2 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
IoT Platform Complete Self-Assessment Guide
From Everand
IoT Platform Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet

Unit-2 Hadoop and MapReduce

Uploaded by

Unit-2 Hadoop and MapReduce

Uploaded by

Hadoop and MapReduce

• It is an open-source software developed as a project by Apache Software Foundation.

Early days (2002-2004):

Birth of Hadoop (2005-2006):

Evolution and Diversification (2011-present):

• Following characteristics of Hadoop make it a unique platform:

Hadoop consists of three core components –

1. Hadoop Distributed File System (HDFS) – It is the storage layer of

• HDFS has a master-slave topology.

HDFS has two daemons running for it. They are :

NameNode : NameNode performs following functions –

• NameNode Daemon runs on the master machine.

DataNode runs on the slave machine.

• It stores the actual business data.

• Checkpointing is a process of combining

DataNode DataNode DataNode

It is the data processing layer of Hadoop. It processes data in two phases.

Reduce Phase- The Reduce

Input Map( ) Ouput

Map-Reduce works in the following way:

property, (1) property, 1

Three major parts of MapReduce program:

The application startup process is as follows:-

1. ls: This command is used to list all the files

hdfs dfs -ls <path>

2. mkdir: To create a directory.

hdfs dfs -mkdir <folder name>

You might also like