12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 1
Src:https://images.app.goo.gl/F5SJHpViizCDGFXe7
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 2
❖ Map Reduce is a widely used parallel data processing model for processing and
analysis of massive scale data.
Healthcare
Src:https://journals.sagepub.com/doi/pdf/10.1177/0256090915575450
❖ Map Reduce has 2 phases : Map and Reduce
❖ Map Reduce Programs are in a functional programming style* to create map and
reduce functions
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 3
*Programming Languages that support functional programming: Haskell, JavaScript, Scala, Erlang, Lisp, ML, Clojure, OCaml, Common Lisp, Racket.
Key + Value input Map Reduce Programs
❖ Input data to the map and reduce phases in a form of Key- value pairs
❖ Runtime systems on map reduce typically large clusters built of commodity hardware.
Src: http://cedlabs.in/Blog/CMS%204.2.1/Blog.php
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 4
❖ Tasks of MapReduce Run-time systems:
❑ partitioning the data
❑ scheduling of job
❑ communication between nodes in the cluster
❖ MapReduce tasks makes it easier for programmers to analyze massive scale data
Let’s look into flow of data for a MapReduce job or task…!
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 5
@Input: @Output:
▪ Set of Key – Value pair Input Output ▪ Set of Key – Value pair
is provided as input is produced as output
@MAP: MAP @REDUCE:
1. In the MAP phase, data ▪ When all the MAP tasks
is read from a distributed are completed, the
file system MAP Reduce phase begins in
2. Data is partitioned among Reduce which the intermediate
a set of computing nodes data with the same key
in the cluster MAP is aggregated.
3. Data is sent to the node
as a set of Key-value pairs
MAP Reduce ▪ An optional Combine
task can be used to
=> The Map tasks process perform aggregation on
the input records the intermediate data of
independently of each other MAP the same key for the
and produce Intermediate output of the mapper
results as Key-Value pairs. before transferring the
The intermediate results output to the Reduce
are stored on the local disk of task.
the node running the Map
task.
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 6
❖ Faster data transmission between the nodes in a cluster
=> due to the locality of data
=> data processing takes place on the nodes where the data resides
❖ Efficiency
=> Map Reduce programming model moves the computation to where the data resides thus decreasing
the transmission of data and improving efficiency.
❖ Parallel Processing of massive scale data
Map Reduce programming model is well suited for parallel processing of massive scale data in which the data
analysis tasks can be accomplished by independent map and reduce operations.
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 7
❖ What is Hadoop Cluster?
▪ Master Node + backup node + a number of slave nodes
Runs the NameNode Runs the Run the DataNode
and JobTracker processes Secondary NameNode and JobTracker processes
process
12-05-2020 8
Slave Node
Client DataNode
Master Node
NameNode TaskTracker
Client Slave Node
JobTracker
DataNode
TaskTracker
Backup Node
Slave Node
Secondary DataNode
NameNode TaskTracker
Figure:Components of Hadoop Cluster
12-05-2020 9
❖ What are the functions of the key processes of Hadoop?
❑ NameNode:
▪ NameNode keep the directory tree of all files in the file system and tracks where across the cluster the file data is kept.
▪ NameNode does not store the data of files itself ▪ Client applications talk to the NameNode whenever the
file to be located/added/copied/moved/deleted
Request to Locate a file
Client Request to add a file
▪ The NameNode responds to the successful requests by
Request to copy/move a file NameNode returning a list of relevant DataNode servers where data
Apps Request to delete a file lives.
▪ NameNode serves as both directory namespace manager
and ‘inode table’ for the Hadoop DFS.
▪ In any DFS deployment, a single NameNode can be found
in execution
12-05-2020 10
❖ What are the functions of the key processes of Hadoop?
❑ Secondary NameNode:
▪ When the NameNode goes down, the file system goes offline. An optional Secondary NameNode which is hosted on a
separate machine creates checkpoints of the namespace.
Note: HDFS Cluster has NameNode as a single point of failure. Hence, HDFS is not currently a high availability system.
❑ JobTracker:
▪ JobTracker is the service within Hadoop that distributes MapReduce tasks to specific nodes in the cluster.
=> ideally the nodes that have the same data or at least are in the same track
❑ TaskTracker:
▪ TaskTracker is a node in a Hadoop cluster
▪ TaskTracker accepts Map, Reduce and Shuffle tasks from the JobTracker
▪ Each TaskTracker has a defined number of slots which indicate the number of tasks that it can accept
▪ When JobTracker tries to find a TaskTracker to schedule a map or reduce task it first looks for an empty slot on the same nod e
that hosts DataNode containing the data.
▪ If an empty slot not found on the same node, the jobTracker looks for an empty slot on a node in the same rack
12-05-2020 11
❖ What are the functions of the key processes of Hadoop?
❑ DataNode:
▪ A DataNode stores data in an HDFS file system.
▪ A functional HDFS filesystem has more than one DataNode, with data replicated across them.
▪ DataNodes connect to the NameNode on startup.
▪ DataNodes responds to requests from the NameNode for file system operations.
▪ Client applications can talk directly to the DataNode, once the NameNode has provided the location of the data.
▪ MapReduce operations are assigned to TaskTracker instances near a DataNode, talk directly to the DataNode to access the files
▪ TaskTracker instances can be deployed on the same servers that host DataNode instances, so that MapReduce operations are
performed close to the data.
To know further visit: https://www.researchgate.net/publication/277935711_Hadoop_MapReduce_and_HDFS_a_developers_perspective
12-05-2020 12
TaskTracker Node
Client Node JobTracker Node
MapReduce TaskTracker
JobClient JobTracker
Program
Map or
Reduce Task
Distributed file system TaskTracker Node
HDFS TaskTracker
HDFS
Map or
Reduce Task
12-05-2020 13
1. Job execution starts when client application submit Jobs to the JobTracker. The JobTracker talks to the NameNode to
determine the location of data.
2. JobTracker locates TaskTracker nodes with available slots at/or near the data. The TaskTracker nodes are monitored using the
heartbeat signals that are sent by the TaskTracker to JobTracker
3. The TaskTracker send out heartbeat messages to the JobTracker usually every few minutes, to reassure the
JobTracker that they are still alive
4. These messages also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date
with where in the cluster, new work can be delegated.
5. The JobTracker submits the work to the TaskTracker nodes when they poll for tasks. To choose a task for a
TaskTracker, the JobTracker uses various scheduling algorithms.
6. The default scheduling algorithm in Hadoop is FIFO. In FIFO scheduling, a work queue is maintained and
JobTracker pulls the oldest job first for scheduling. There is no option of the job priority or size of the job in FIFO
scheduling.
7. The TaskTracker nodes are monitored using the heartbeat signals that are sent by the TaskTrackers to JobTracker.
8. The TaskTracker spawns a separate JVM process for each task so that any task failure does not bring down the
TaskTracker
9. The TaskTracker monitors these spawned processes while capturing the output and exit codes.
10. When the process finishes, successfully or not, the TaskTracker notifies the JobTracker. When a task fails the
TaskTracker notifies the JobTracker and the JobTracker decides whether to resubmit the job to some other
TaskTracker or mark that specific record as something to avoid.
11. The JobTracker can blacklist a TaskTracker as unreliable if there are repeated task failures.
12. When the job is completed, the JobTracker updates the status. Client application can poll the JobTracker for status.
Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 14
▪ Hadoop is a opensource framework written in Java
▪ Hadoop is designed to work with commodity hardware.
▪ The Hadoop file system HDFS is highly fault tolerant
▪ The preferred operating system for Hadoop is Linux. It can also set up on Windows like
operating systems with a Cygwin environment
A multi node Hadoop cluster configuration:
- It comprises of one master node that runs the NameNode and JobTracker and
two slave nodes that run the TaskTracker and DataNode.
- The hardware used for Hadoop cluster consists of EC2 (m1.Large) instances
running Ubuntu Linux.
12-05-2020 15
❖ Install Java:
- Hadoop requires Java 6 or later version.
- Commands for installing Java at Linux: stepwise illustration:
1. Install updates
2. Set the Properties
12-05-2020 16
❖ Install and Configure Hadoop
➢ Install Hadoop
➢ Configure Network
➢ Configure Hadoop
12-05-2020 17
1. After unpacking the Hadoop setup package on all the nodes of the cluster, the next step is to configure the network such that
all the nodes can connect to each other over the network
2. To make addressing of nodes simple, assign simple host names to nodes.(like master, slave1, and slave2)
3. The etc/hosts file edited on all the nodes and IP addresses and host names of all the nodes are added
4. Hadoop control strips use SSH for client wide operations such as starting and stopping NameNode, DataNode, JobTracker,
TaskTracker and other daemons on the nodes in the cluster.
5. For the control strips to work , all the nodes in the cluster must be able to connect to each other via a password –less SSH login
6. To enable this, public/private RSA key pair is generated on each node.
7. The private key is stored in the file /.ssh/id_rsa and public key is stored in the file /.ssh/id_rsa.pub
8. The public SSH key of each node is copied to the /.ssh/authorized_keys file of every other node. This can be done manually
editing the /.ssh/authorized_keys file on each node or using the ssh-copy-id command.
9. The final step to setup the networking is to save host key fingerprints of each node to the known_hosts file of every other node
This is done by connecting from each node to every other node by SSH
12-05-2020 18
File Name Description
core-site.xml Configuration parameters for Hadoop core which are common to
MapReduce and HDFS
mapred-site.xml Configuration parameters for MapReduce daemons – JobTracker and
TaskTracker
hdfs-site.xml Configuration parameters for Hadoop daemons- NameNode and
Secondary NameNode and Data Node
hadoop-env.sh Environment variables for Hadoop daemons
masters List of nodes that run a Secondary Name Node
slaves List of nodes that run a TaskTracker and Data Node
Log4j.properties Logging properties for the Hadoop daemons
mapred-queue- Access Control lists
acls.xml
Hadoop is configured using a number of configuration files listed
12-05-2020 19
Click below linked /embedded object to
know regarding steps involved in configuring Click below embedded object to know regarding steps
Hadoop involved in starting and stopping Hadoop cluster
1. Check the hosts details(not mandatory) After Hadoop is configured, the relevant file system created are:
2. Fetch the Hadoop mirror URL and download the stable
https://downloads.apache.org/hadoop/common/stable/
Choose hadoop-3.2.1.tar.gz
Here, both gz are choosen(.src is not needed however)
Using hdfs executable, one can create input and output nodes , also can format NameNode
Use the command, “hdfs namenode –format” to format and to save the imagefile of NameNode
3. Extract .zip using tar –xzf hadoop-3.2.1.tar.gz command
12-05-2020 20
https://www.youtube.com/watch?v=mafw2-CVYnA
http://blog.newtechways.com/2017/10/apache-hadoop-ecosystem.html
https://www.slideshare.net/LiorSidi/hadoop-ecosystem-65935516
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html
https://www.youtube.com/watch?v=l2n124ioO1I
https://downloads.apache.org/hadoop/common/stable/
12-05-2020 Dr Sumalatha Aradhya, Dept of CSE, SIT, Tumakuru 21