0% found this document useful (0 votes)

11 views21 pages

26-Unit5 DataAnalytics ApacheHadoop IoTPart1

The document provides an overview of the MapReduce programming model, which is used for processing large-scale data in a parallel manner. It details the phases of Map and Reduce, the functions of Hadoop components such as NameNode, JobTracker, and DataNode, and the setup and configuration of a Hadoop cluster. Additionally, it outlines the efficiency and data locality advantages of using MapReduce for data analysis.

Uploaded by

ashayamal2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views21 pages

26-Unit5 DataAnalytics ApacheHadoop IoTPart1

Uploaded by

ashayamal2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 1

Src:https://images.app.goo.gl/F5SJHpViizCDGFXe7

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 2

❖ Map Reduce is a widely used parallel data processing model for processing and
analysis of massive scale data.
Healthcare

Src:https://journals.sagepub.com/doi/pdf/10.1177/0256090915575450

❖ Map Reduce has 2 phases : Map and Reduce

❖ Map Reduce Programs are in a functional programming style* to create map and
reduce functions
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 3
*Programming Languages that support functional programming: Haskell, JavaScript, Scala, Erlang, Lisp, ML, Clojure, OCaml, Common Lisp, Racket.
Key + Value input Map Reduce Programs

❖ Input data to the map and reduce phases in a form of Key- value pairs
❖ Runtime systems on map reduce typically large clusters built of commodity hardware.

Src: http://cedlabs.in/Blog/CMS%204.2.1/Blog.php

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 4

❖ Tasks of MapReduce Run-time systems:
❑ partitioning the data
❑ scheduling of job
❑ communication between nodes in the cluster

❖ MapReduce tasks makes it easier for programmers to analyze massive scale data

Let’s look into flow of data for a MapReduce job or task…!

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 5

@Input: @Output:
▪ Set of Key – Value pair Input Output ▪ Set of Key – Value pair
is provided as input is produced as output

@MAP: MAP @REDUCE:

1. In the MAP phase, data ▪ When all the MAP tasks
is read from a distributed are completed, the
file system MAP Reduce phase begins in
2. Data is partitioned among Reduce which the intermediate
a set of computing nodes data with the same key
in the cluster MAP is aggregated.
3. Data is sent to the node
as a set of Key-value pairs
MAP Reduce ▪ An optional Combine
task can be used to
=> The Map tasks process perform aggregation on
the input records the intermediate data of
independently of each other MAP the same key for the
and produce Intermediate output of the mapper
results as Key-Value pairs. before transferring the
 The intermediate results output to the Reduce
are stored on the local disk of task.
the node running the Map
task.

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 6

❖ Faster data transmission between the nodes in a cluster
=> due to the locality of data
=> data processing takes place on the nodes where the data resides

❖ Efficiency
=> Map Reduce programming model moves the computation to where the data resides thus decreasing
the transmission of data and improving efficiency.

❖ Parallel Processing of massive scale data

 Map Reduce programming model is well suited for parallel processing of massive scale data in which the data
analysis tasks can be accomplished by independent map and reduce operations.

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 7

❖ What is Hadoop Cluster?
▪ Master Node + backup node + a number of slave nodes

Runs the NameNode Runs the Run the DataNode

and JobTracker processes Secondary NameNode and JobTracker processes
process

12-05-2020 8
Slave Node
Client DataNode
Master Node
NameNode TaskTracker
Client Slave Node
JobTracker
DataNode
TaskTracker
Backup Node
Slave Node
Secondary DataNode
NameNode TaskTracker

Figure:Components of Hadoop Cluster

12-05-2020 9
❖ What are the functions of the key processes of Hadoop?
❑ NameNode:
▪ NameNode keep the directory tree of all files in the file system and tracks where across the cluster the file data is kept.
▪ NameNode does not store the data of files itself ▪ Client applications talk to the NameNode whenever the
file to be located/added/copied/moved/deleted
Request to Locate a file
Client Request to add a file
▪ The NameNode responds to the successful requests by
Request to copy/move a file NameNode returning a list of relevant DataNode servers where data
Apps Request to delete a file lives.
▪ NameNode serves as both directory namespace manager
and ‘inode table’ for the Hadoop DFS.
▪ In any DFS deployment, a single NameNode can be found
in execution

12-05-2020 10
❖ What are the functions of the key processes of Hadoop?
❑ Secondary NameNode:
▪ When the NameNode goes down, the file system goes offline. An optional Secondary NameNode which is hosted on a
separate machine creates checkpoints of the namespace.
Note: HDFS Cluster has NameNode as a single point of failure. Hence, HDFS is not currently a high availability system.
❑ JobTracker:
▪ JobTracker is the service within Hadoop that distributes MapReduce tasks to specific nodes in the cluster.
=> ideally the nodes that have the same data or at least are in the same track
❑ TaskTracker:
▪ TaskTracker is a node in a Hadoop cluster
▪ TaskTracker accepts Map, Reduce and Shuffle tasks from the JobTracker
▪ Each TaskTracker has a defined number of slots which indicate the number of tasks that it can accept
▪ When JobTracker tries to find a TaskTracker to schedule a map or reduce task it first looks for an empty slot on the same nod e
that hosts DataNode containing the data.
▪ If an empty slot not found on the same node, the jobTracker looks for an empty slot on a node in the same rack

12-05-2020 11
❖ What are the functions of the key processes of Hadoop?
❑ DataNode:
▪ A DataNode stores data in an HDFS file system.
▪ A functional HDFS filesystem has more than one DataNode, with data replicated across them.
▪ DataNodes connect to the NameNode on startup.
▪ DataNodes responds to requests from the NameNode for file system operations.
▪ Client applications can talk directly to the DataNode, once the NameNode has provided the location of the data.
▪ MapReduce operations are assigned to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files
▪ TaskTracker instances can be deployed on the same servers that host DataNode instances, so that MapReduce operations are
performed close to the data.

To know further visit: https://www.researchgate.net/publication/277935711_Hadoop_MapReduce_and_HDFS_a_developers_perspective

12-05-2020 12
TaskTracker Node
Client Node JobTracker Node

MapReduce TaskTracker
JobClient JobTracker
Program
Map or
Reduce Task

Distributed file system TaskTracker Node

HDFS TaskTracker
HDFS
Map or
Reduce Task

12-05-2020 13
1. Job execution starts when client application submit Jobs to the JobTracker. The JobTracker talks to the NameNode to
determine the location of data.
2. JobTracker locates TaskTracker nodes with available slots at/or near the data. The TaskTracker nodes are monitored using the
heartbeat signals that are sent by the TaskTracker to JobTracker
3. The TaskTracker send out heartbeat messages to the JobTracker usually every few minutes, to reassure the
JobTracker that they are still alive
4. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date
with where in the cluster, new work can be delegated.
5. The JobTracker submits the work to the TaskTracker nodes when they poll for tasks. To choose a task for a
TaskTracker, the JobTracker uses various scheduling algorithms.
6. The default scheduling algorithm in Hadoop is FIFO. In FIFO scheduling, a work queue is maintained and
JobTracker pulls the oldest job first for scheduling. There is no option of the job priority or size of the job in FIFO
scheduling.
7. The TaskTracker nodes are monitored using the heartbeat signals that are sent by the TaskTrackers to JobTracker.
8. The TaskTracker spawns a separate JVM process for each task so that any task failure does not bring down the
TaskTracker
9. The TaskTracker monitors these spawned processes while capturing the output and exit codes.
10. When the process finishes, successfully or not, the TaskTracker notifies the JobTracker. When a task fails the
TaskTracker notifies the JobTracker and the JobTracker decides whether to resubmit the job to some other
TaskTracker or mark that specific record as something to avoid.
11. The JobTracker can blacklist a TaskTracker as unreliable if there are repeated task failures.
12. When the job is completed, the JobTracker updates the status. Client application can poll the JobTracker for status.

Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 14

▪ Hadoop is a opensource framework written in Java
▪ Hadoop is designed to work with commodity hardware.
▪ The Hadoop file system HDFS is highly fault tolerant
▪ The preferred operating system for Hadoop is Linux. It can also set up on Windows like
operating systems with a Cygwin environment

A multi node Hadoop cluster configuration:

- It comprises of one master node that runs the NameNode and JobTracker and
two slave nodes that run the TaskTracker and DataNode.

- The hardware used for Hadoop cluster consists of EC2 (m1.Large) instances
running Ubuntu Linux.

12-05-2020 15
❖ Install Java:

- Hadoop requires Java 6 or later version.

- Commands for installing Java at Linux: stepwise illustration:
1. Install updates

2. Set the Properties

12-05-2020 16
❖ Install and Configure Hadoop

➢ Install Hadoop
➢ Configure Network
➢ Configure Hadoop

12-05-2020 17
1. After unpacking the Hadoop setup package on all the nodes of the cluster, the next step is to configure the network such that
all the nodes can connect to each other over the network

2. To make addressing of nodes simple, assign simple host names to nodes.(like master, slave1, and slave2)
3. The etc/hosts file edited on all the nodes and IP addresses and host names of all the nodes are added
4. Hadoop control strips use SSH for client wide operations such as starting and stopping NameNode, DataNode, JobTracker,
TaskTracker and other daemons on the nodes in the cluster.
5. For the control strips to work , all the nodes in the cluster must be able to connect to each other via a password –less SSH login
6. To enable this, public/private RSA key pair is generated on each node.
7. The private key is stored in the file /.ssh/id_rsa and public key is stored in the file /.ssh/id_rsa.pub
8. The public SSH key of each node is copied to the /.ssh/authorized_keys file of every other node. This can be done manually
editing the /.ssh/authorized_keys file on each node or using the ssh-copy-id command.

9. The final step to setup the networking is to save host key fingerprints of each node to the known_hosts file of every other node
This is done by connecting from each node to every other node by SSH

12-05-2020 18
File Name Description
core-site.xml Configuration parameters for Hadoop core which are common to
MapReduce and HDFS
mapred-site.xml Configuration parameters for MapReduce daemons – JobTracker and
TaskTracker
hdfs-site.xml Configuration parameters for Hadoop daemons- NameNode and
Secondary NameNode and Data Node
hadoop-env.sh Environment variables for Hadoop daemons
masters List of nodes that run a Secondary Name Node
slaves List of nodes that run a TaskTracker and Data Node
Log4j.properties Logging properties for the Hadoop daemons
mapred-queue- Access Control lists
acls.xml

Hadoop is configured using a number of configuration files listed

12-05-2020 19
Click below linked /embedded object to
know regarding steps involved in configuring Click below embedded object to know regarding steps
Hadoop involved in starting and stopping Hadoop cluster
1. Check the hosts details(not mandatory) After Hadoop is configured, the relevant file system created are:

2. Fetch the Hadoop mirror URL and download the stable

https://downloads.apache.org/hadoop/common/stable/
Choose hadoop-3.2.1.tar.gz
Here, both gz are choosen(.src is not needed however)

Using hdfs executable, one can create input and output nodes , also can format NameNode

Use the command, “hdfs namenode –format” to format and to save the imagefile of NameNode

3. Extract .zip using tar –xzf hadoop-3.2.1.tar.gz command

12-05-2020 20
https://www.youtube.com/watch?v=mafw2-CVYnA
http://blog.newtechways.com/2017/10/apache-hadoop-ecosystem.html
https://www.slideshare.net/LiorSidi/hadoop-ecosystem-65935516
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

https://www.youtube.com/watch?v=l2n124ioO1I

https://downloads.apache.org/hadoop/common/stable/

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 21

P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Bda Unit 2
No ratings yet
Bda Unit 2
54 pages
Unit 3
No ratings yet
Unit 3
13 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Hadoop MapReduce for Developers
No ratings yet
Hadoop MapReduce for Developers
4 pages
Unit 2
No ratings yet
Unit 2
12 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
Unit III
No ratings yet
Unit III
90 pages
Big Data
No ratings yet
Big Data
47 pages
Bda U2
No ratings yet
Bda U2
79 pages
Big Data Mapreduce and Streaming
No ratings yet
Big Data Mapreduce and Streaming
10 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
3 Unit
No ratings yet
3 Unit
17 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
Anatomy of Hadoop MapReduce Jobs
No ratings yet
Anatomy of Hadoop MapReduce Jobs
11 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Questionsand Answers
No ratings yet
Questionsand Answers
23 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Big Data Analytics-4
No ratings yet
Big Data Analytics-4
26 pages
Unit 2 Topic 5 Developing A Map Reduce Application
No ratings yet
Unit 2 Topic 5 Developing A Map Reduce Application
52 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
2 BDA MapReduce
No ratings yet
2 BDA MapReduce
30 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
BDA UNIT - 4 Notes
No ratings yet
BDA UNIT - 4 Notes
28 pages
Hadoop
No ratings yet
Hadoop
34 pages
1 Unit-1
No ratings yet
1 Unit-1
59 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Unit 3
No ratings yet
Unit 3
33 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
26 pages
CC Unit5
No ratings yet
CC Unit5
27 pages
JIT2022 Artificial Intelligenceand Conductof Lit Reviews
No ratings yet
JIT2022 Artificial Intelligenceand Conductof Lit Reviews
19 pages
34-Unit5 DataAnalytics IoT Adoop Oozie Part3
No ratings yet
34-Unit5 DataAnalytics IoT Adoop Oozie Part3
13 pages
MC Is
No ratings yet
MC Is
8 pages
Toc 1
No ratings yet
Toc 1
25 pages
PowerShell For Beginners Alex Marin
No ratings yet
PowerShell For Beginners Alex Marin
285 pages
Lab 6.2.4
No ratings yet
Lab 6.2.4
2 pages
KMSAuto ReadMe
No ratings yet
KMSAuto ReadMe
4 pages
Wonderware (Schneider Electric) TN691
No ratings yet
Wonderware (Schneider Electric) TN691
5 pages
Print Screen 1. Use The Snipping Tool
No ratings yet
Print Screen 1. Use The Snipping Tool
4 pages
Developer Project Portfolio
No ratings yet
Developer Project Portfolio
2 pages
Rtos Unit 2 Notes
No ratings yet
Rtos Unit 2 Notes
26 pages
Asami Asami Pu La Deshpande PDF
0% (1)
Asami Asami Pu La Deshpande PDF
2 pages
Moichor User Guide
No ratings yet
Moichor User Guide
11 pages
Sentinel Hardware Key
No ratings yet
Sentinel Hardware Key
22 pages
Perintah Dasar Debian
No ratings yet
Perintah Dasar Debian
6 pages
Ibm Powerkvm
No ratings yet
Ibm Powerkvm
282 pages
Virtualization For Cloud Computing
No ratings yet
Virtualization For Cloud Computing
28 pages
OSY Microproject 1 To 5
100% (1)
OSY Microproject 1 To 5
16 pages
LabVIEW 7.0 Course Errata
No ratings yet
LabVIEW 7.0 Course Errata
5 pages
RHCSA Exam Practice: Password & Network Configuration
50% (6)
RHCSA Exam Practice: Password & Network Configuration
68 pages
Pub - The Tomes of Delphi 3 Win32 Graphical Api
No ratings yet
Pub - The Tomes of Delphi 3 Win32 Graphical Api
928 pages
Linux Scripting and Automation Guide
No ratings yet
Linux Scripting and Automation Guide
19 pages
Docker Kuber NT Es Master Class Typed Notes
No ratings yet
Docker Kuber NT Es Master Class Typed Notes
4 pages
CODESYS Getting Started Guide
No ratings yet
CODESYS Getting Started Guide
13 pages
A Survey of Linux Device Drivers
100% (1)
A Survey of Linux Device Drivers
6 pages
Achievement Chart Computer Systems Servicing NC Ii
No ratings yet
Achievement Chart Computer Systems Servicing NC Ii
3 pages
Cacloaikey
No ratings yet
Cacloaikey
3 pages
How To Delete A Node From Oracle RAC 19C
No ratings yet
How To Delete A Node From Oracle RAC 19C
21 pages
AD Report
No ratings yet
AD Report
8 pages
Linux Os-1
No ratings yet
Linux Os-1
56 pages
Distributed and Real-Time OS Case Studies
No ratings yet
Distributed and Real-Time OS Case Studies
12 pages
cs3451 Ios Unit III Notes
No ratings yet
cs3451 Ios Unit III Notes
31 pages
File Handling
No ratings yet
File Handling
24 pages
Model View of Extended Reality
No ratings yet
Model View of Extended Reality
3 pages

26-Unit5 DataAnalytics ApacheHadoop IoTPart1

Uploaded by

26-Unit5 DataAnalytics ApacheHadoop IoTPart1

Uploaded by

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 1

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 2

❖ Map Reduce has 2 phases : Map and Reduce

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 4

Let’s look into flow of data for a MapReduce job or task…!

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 5

@MAP: MAP @REDUCE:

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 6

❖ Parallel Processing of massive scale data

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 7

Runs the NameNode Runs the Run the DataNode

Figure:Components of Hadoop Cluster

To know further visit: https://www.researchgate.net/publication/277935711_Hadoop_MapReduce_and_HDFS_a_developers_perspective

Distributed file system TaskTracker Node

Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 14

A multi node Hadoop cluster configuration:

- Hadoop requires Java 6 or later version.

2. Set the Properties

Hadoop is configured using a number of configuration files listed

2. Fetch the Hadoop mirror URL and download the stable

3. Extract .zip using tar –xzf hadoop-3.2.1.tar.gz command

12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 21

You might also like