0% found this document useful (0 votes)

13 views8 pages

Unit - 5 Learning Notes

The document provides an overview of Hadoop architecture, including its components such as HDFS, MapReduce, and YARN, and discusses the challenges organizations face in implementing Hadoop effectively. It also introduces MapR Database, highlighting its advantages as a NoSQL database that integrates with Hadoop for real-time analytics and operational capabilities. The document emphasizes the importance of proper architectural design for successful Hadoop cluster deployment and management.

Uploaded by

Krishnaprasanna M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views8 pages

Unit - 5 Learning Notes

Uploaded by

Krishnaprasanna M

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Unit – 5

(Learning Notes)
SYLLABUS:
 MapReduce/Hadoop
 MapR M3
 Advanced SQL Tools

Hadoop Overview

Hortonworks founder predicted that by end of 2020, 75% of Fortune 2000

companies will be running 1000 node hadoop clusters in production. The
tiny toy elephant in the big data room has become the most popular big data
solution across the globe. However, implementation of Hadoop in production
is still accompanied by deployment and management challenges like
scalability, flexibility and cost effectiveness.

Many organizations that venture into enterprise adoption of Hadoop by

business users or by an analytics group within the company do not have
any knowledge on how a good hadoop architecture design should be and
how actually a hadoop cluster works in production. This lack of knowledge
leads to design of a hadoop cluster that is more complex than is necessary
for a particular big data application making it a pricey implementation.
Apache Hadoop was developed with the purpose of having a low–cost,
redundant data store that would allow organizations to leverage big data
analytics at economical cost and maximize profitability of the business.

A good hadoop architectural design requires various design considerations

in terms of computing power, networking and storage. This blog post gives
an in-depth explanation of the Hadoop architecture and the factors to be
considered when designing and building a Hadoop cluster for production
success.
Hadoop Architecture Overview
Apache Hadoop offers a scalable, flexible and reliable distributed
computing big data framework for a cluster of systems with storage capacity
and local computing power by leveraging commodity hardware. Hadoop
follows a Master Slave architecture for the transformation and analysis of
large datasets using Hadoop MapReduce paradigm. The 3 important hadoop
components that play a vital role in the Hadoop architecture are -
1. Hadoop Distributed File System (HDFS) – Patterned after the UNIX file
system
2. Hadoop MapReduce
3. Yet Another Resource Negotiator (YARN)

Hadoop follows a master slave architecture design for data storage and
distributed data processing using HDFS and MapReduce respectively. The
master node for data storage is hadoop HDFS is the NameNode and the
master node for parallel processing of data using Hadoop MapReduce is the
Job Tracker. The slave nodes in the hadoop architecture are the other
machines in the Hadoop cluster which store data and perform complex
computations. Every slave node has a Task Tracker daemon and a DataNode
that synchronizes the processes with the Job Tracker and NameNode
respectively. In Hadoop architectural implementation the master or slave
systems can be setup in the cloud or on-premise.
Role of Distributed Storage - HDFS in Hadoop Application
Architecture Implementation
A file on HDFS is split into multiple bocks and each is replicated
within the Hadoop cluster. A block on HDFS is a blob of data within the
underlying file system with a default size of 64MB.The size of a block can be
extended up to 256 MB based on the requirements.

Hadoop Distributed File System (HDFS) stores the application data and
file system metadata separately on dedicated servers. NameNode and
DataNode are the two critical components of the Hadoop HDFS architecture.
Application data is stored on servers referred to as DataNodes and file
system metadata is stored on servers referred to as NameNode. HDFS
replicates the file content on multiple DataNodes based on the replication
factor to ensure reliability of data. The NameNode and DataNode
communicate with each other using TCP based protocols. For the Hadoop
architecture to be performance efficient, HDFS must satisfy certain pre-
requisites –
1. All the hard drives should have a high throughput.
2. Good network speed to manage intermediate data transfer and block
replications.

NameNode
All the files and directories in the HDFS namespace are represented on the
NameNode by Inodes that contain various attributes like permissions,
modification timestamp, disk space quota, namespace quota and access
times. NameNode maps the entire file system structure into memory. Two
files fsimage and edits are used for persistence during restarts.

 Fsimage file contains the Inodes and the list of blocks which define the
metadata.It has a complete snapshot of the file systems metadata at any
given point of time.

 The edits file contains any modifications that have been performed on the
content of the fsimage file.Incremental changes like renaming or appending
data to the file are stored in the edit log to ensure durability instead of
creating a new fsimage snapshot everytime the namespace is being altered.

DataNode
DataNode manages the state of an HDFS node and interacts with the blocks.
A DataNode can perform CPU intensive jobs like semantic and language
analysis, statistics and machine learning tasks, and I/O intensive jobs like
clustering, data import, data export, search, decompression, and indexing. A
DataNode needs lot of I/O for data processing and transfer.

On startup every DataNode connects to the NameNode and performs a

handshake to verify the namespace ID and the software version of the
DataNode. If either of them does not match then the DataNode shuts down
automatically. A DataNode verifies the block replicas in its ownership by
sending a block report to the NameNode. As soon as the DataNode registers,
the first block report is sent. DataNode sends heartbeat to the NameNode
every 3 seconds to confirm that the DataNode is operating and the block
replicas it hosts are available.

How does the Hadoop MapReduce architecture work?

The execution of a MapReduce job begins when the client submits the
job configuration to the Job Tracker that specifies the map, combine and
reduce functions along with the location for input and output data. On
receiving the job configuration, the job tracker identifies the number of
splits based on the input path and select Task Trackers based on their
network vicinity to the data sources. Job Tracker sends a request to the
selected Task Trackers.

The processing of the Map phase begins where the Task Tracker
extracts the input data from the splits. Map function is invoked for each
record parsed by the “InputFormat” which produces key-value pairs in the
memory buffer. The memory buffer is then sorted to different reducer nodes
by invoking the combine function. On completion of the map task, Task
Tracker notifies the Job Tracker. When all Task Trackers are done, the Job
Tracker notifies the selected Task Trackers to begin the reduce phase. Task
Tracker reads the region files and sorts the key-value pairs for each key. The
reduce function is then invoked which collects the aggregated values into
the output file.

MAPR M3 Architecture
MapR Database is an enterprise-grade, high performance, NoSQL (“Not Only
SQL”) database management system. You can use it to add realtime,
operational analytics capabilities to big data applications. As a multi-model
NoSQL database, it supports both JSON document models and wide column
data models.

Why use MapR Database?

 Integrated analytics with SQL: MapR Database's integration with
Drill for MapR provides a low latency, distributed, SQL-like query
engine for large-scale datasets, including structured and semi-
structured, nested data.

 Operational analytics: MapR Database can run in the same cluster

as Apache™ Hadoop® and Apache Spark, letting you immediately
analyze or process live, interactive data. This also enables you to
eliminate data silos to speed the data-to-action cycle, providing a more
efficient data architecture.

 Global distribution of applications: Application access to MapR

Database tables is distributable on a global scale.

 Flexible data model: You can use MapR Database as both a

document database and a wide-column database. As a document
database, MapR Database stores JSON documents in JSON tables. As
a wide-column database, it stores binary files in binary tables.
How is MapR Database Related to MapR
Filesystem?
MapR Database implements tables within the framework of the MapR file
system MapR Database creates tables (both binary and JSON tables) in
logical units called volumes.

MapR Database's architecture has the

following advantages:
 It reduces process overhead because it has no extra layers to pass
through when performing operations on data.

 MapR Database, like several other NoSQL databases, is a log-based

database. MapR Database runs inside of the MapR file system
process, which enables it to read from and write to disks directly. In
contrast, other NoSQL databases must communicate with a separate
process to performs disk reads and writes. The approach taken
by MapR Databaseeliminates extra process hops, duplicate caching,
and needless abstractions, with the consequence of optimizing I/O
operations on your data.

 It minimizes compaction delays because it avoids I/O storms when it

merges logged operations with structures on disk.
As a log-based database, MapR Database must write logged
operations to disk. MapR Database stores table regions (also
called tablets) and smaller structures within them partially as b-trees.
Together with write-ahead logs (WAL), these b-trees comprise log-
structured-merge trees. Write-ahead logs for the smaller structures
within regions are periodically restructured by rolling merge
operations on the b-trees. Because MapR Database performs these
merges at small scales, applications running against MapR
Database see no significant effects on latency while the merges are
taking place.
SQL Vs No-SQL

Bda QB Sample Unit
No ratings yet
Bda QB Sample Unit
12 pages
Wa0001.
No ratings yet
Wa0001.
56 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Module II
No ratings yet
Module II
46 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
14 pages
Biodiesel Research
No ratings yet
Biodiesel Research
29 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
500+ Data Engineering Interview - Questions
No ratings yet
500+ Data Engineering Interview - Questions
118 pages
HDFS
No ratings yet
HDFS
46 pages
Big Data
No ratings yet
Big Data
16 pages
Lec 5 - Big Data Storage Technologies I - Hadoop
No ratings yet
Lec 5 - Big Data Storage Technologies I - Hadoop
44 pages
Bda Unit 4 Material
No ratings yet
Bda Unit 4 Material
37 pages
Module 2 Hadoop
No ratings yet
Module 2 Hadoop
180 pages
Introduction To Hadoop and MapReduce Programming
No ratings yet
Introduction To Hadoop and MapReduce Programming
29 pages
Design of HDFS: Unit 3
No ratings yet
Design of HDFS: Unit 3
20 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
Big Data Unit-III
No ratings yet
Big Data Unit-III
39 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
1 Bda Chapter1 Answer
No ratings yet
1 Bda Chapter1 Answer
7 pages
Efficient Ways To Improve The Performance of HDFS For Small Files
No ratings yet
Efficient Ways To Improve The Performance of HDFS For Small Files
5 pages
Unit Iiibig Data Processing Demo
No ratings yet
Unit Iiibig Data Processing Demo
32 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
HDFS & MapReduce Explained
No ratings yet
HDFS & MapReduce Explained
16 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
Bda 2
No ratings yet
Bda 2
6 pages
Big Data
No ratings yet
Big Data
51 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Hadoop
No ratings yet
Hadoop
7 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Lecture Notes Hadoop
100% (1)
Lecture Notes Hadoop
11 pages
Hadoop
No ratings yet
Hadoop
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
56 pages
Unit II Hadoop
No ratings yet
Unit II Hadoop
23 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Unit 2
No ratings yet
Unit 2
14 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
41 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
Unit Ii
No ratings yet
Unit Ii
39 pages
500+ Interview Questions-1
No ratings yet
500+ Interview Questions-1
126 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Unit-2 Hadoop HDFS Hadoopecosystem
No ratings yet
Unit-2 Hadoop HDFS Hadoopecosystem
25 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Bigdata Unit IV
No ratings yet
Bigdata Unit IV
29 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Unit V Cloud Technologies and Advancements
100% (1)
Unit V Cloud Technologies and Advancements
33 pages
Unit 3
No ratings yet
Unit 3
28 pages
Unit 4
No ratings yet
Unit 4
30 pages
Unit - 3 Learning Notes
No ratings yet
Unit - 3 Learning Notes
8 pages
Unit - 2 Learning Notes
No ratings yet
Unit - 2 Learning Notes
7 pages
BDA Unit1 Notes
No ratings yet
BDA Unit1 Notes
14 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
10 pages
HT With R
No ratings yet
HT With R
33 pages
5 Tenses
No ratings yet
5 Tenses
3 pages
G10 Ch02 Test 2025 01s1 Vector Ans
No ratings yet
G10 Ch02 Test 2025 01s1 Vector Ans
3 pages
Tai Chi - New Energy Ways, Energy Body Manipulation, Development - Robert Bruce
No ratings yet
Tai Chi - New Energy Ways, Energy Body Manipulation, Development - Robert Bruce
61 pages
Telemetry EAUpdater
No ratings yet
Telemetry EAUpdater
1 page
BCG Matrix: Surya.H 121901042 Strategic Management
No ratings yet
BCG Matrix: Surya.H 121901042 Strategic Management
3 pages
Baseball and Cultural Insights
No ratings yet
Baseball and Cultural Insights
4 pages
BLUPF90 Guide for Geneticists
No ratings yet
BLUPF90 Guide for Geneticists
3 pages
Detailed Lesson Plan in Horticulture
95% (57)
Detailed Lesson Plan in Horticulture
5 pages
Lafayette College Engineering Paper
100% (3)
Lafayette College Engineering Paper
8 pages
Mindfulness-Based Cognitive Therapy - Psychology Today Australia
No ratings yet
Mindfulness-Based Cognitive Therapy - Psychology Today Australia
8 pages
Norma Ashrae 62 1989
No ratings yet
Norma Ashrae 62 1989
3 pages
Reimbursement List
No ratings yet
Reimbursement List
27 pages
SCM - CH5 Network Design in SC
No ratings yet
SCM - CH5 Network Design in SC
45 pages
Types and Processes of Audits
100% (1)
Types and Processes of Audits
58 pages
10 Leadership Styles You Should Know Final
100% (1)
10 Leadership Styles You Should Know Final
13 pages
Dictionary of Battles
100% (1)
Dictionary of Battles
318 pages
Built - in Methods in Python
No ratings yet
Built - in Methods in Python
3 pages
Forex Market Structure Explained
No ratings yet
Forex Market Structure Explained
3 pages
ACL Rehab Guide for Patients
No ratings yet
ACL Rehab Guide for Patients
20 pages
Slip Test - 4 (+2) and Slip Test - 3 (+1) Time Table & Syllabus Aug 2025
No ratings yet
Slip Test - 4 (+2) and Slip Test - 3 (+1) Time Table & Syllabus Aug 2025
3 pages
Kahoot! Admin Guide for Schools
No ratings yet
Kahoot! Admin Guide for Schools
13 pages
Donation List for Charity Auction
No ratings yet
Donation List for Charity Auction
4 pages
Tushar Bhawsar Resume-1
No ratings yet
Tushar Bhawsar Resume-1
1 page
DARK EDEN - GENESIS (Text Only)
No ratings yet
DARK EDEN - GENESIS (Text Only)
29 pages
Comparison Between Finland and Pakistan Education System
No ratings yet
Comparison Between Finland and Pakistan Education System
33 pages
Dokument RWTHMindstormsNXT Functions v4.04
No ratings yet
Dokument RWTHMindstormsNXT Functions v4.04
166 pages
Complete Diagnostic Ultrasound: Abdomen and Pelvis 2nd Edition Aya Kamaya PDF For All Chapters
100% (1)
Complete Diagnostic Ultrasound: Abdomen and Pelvis 2nd Edition Aya Kamaya PDF For All Chapters
37 pages
Practice Mcq-On-Welding Page 7
No ratings yet
Practice Mcq-On-Welding Page 7
4 pages
At3363e GH14D
No ratings yet
At3363e GH14D
4 pages
Unit 5-1
No ratings yet
Unit 5-1
19 pages

Unit - 5 Learning Notes

Uploaded by

Unit - 5 Learning Notes

Uploaded by

Unit – 5

Hortonworks founder predicted that by end of 2020, 75% of Fortune 2000

Many organizations that venture into enterprise adoption of Hadoop by

A good hadoop architectural design requires various design considerations

On startup every DataNode connects to the NameNode and performs a

How does the Hadoop MapReduce architecture work?

Why use MapR Database?

 Operational analytics: MapR Database can run in the same cluster

 Global distribution of applications: Application access to MapR

 Flexible data model: You can use MapR Database as both a

MapR Database's architecture has the

 MapR Database, like several other NoSQL databases, is a log-based

 It minimizes compaction delays because it avoids I/O storms when it

You might also like