[go: up one dir, main page]

0% found this document useful (0 votes)
10 views141 pages

BDA Module1

The document serves as a syllabus for a course on Big Data and Hadoop, covering topics such as the characteristics of Big Data, its sources, and the Hadoop ecosystem. It includes case studies demonstrating the application of Big Data across various industries including education, healthcare, weather forecasting, transportation, and banking. The syllabus emphasizes the importance of understanding data types, value generation from Big Data, and the technological tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views141 pages

BDA Module1

The document serves as a syllabus for a course on Big Data and Hadoop, covering topics such as the characteristics of Big Data, its sources, and the Hadoop ecosystem. It includes case studies demonstrating the application of Big Data across various industries including education, healthcare, weather forecasting, transportation, and banking. The syllabus emphasizes the importance of understanding data types, value generation from Big Data, and the technological tools used in the field.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 141

Syllabus:

2
3
4
5
CHAP 1.
INTRODUCTION TO BIG
DATA AND HADOOP
Contents
7

◻ Introduction to Big Data


◻ Big Data characteristics
◻ Types of Big Data
◻ Traditional vs. Big Data business approach
◻ Case Study of Big Data Solutions.
◻ Concept of Hadoop
◻ Core Hadoop Components
◻ Hadoop Ecosystem
Evolution of Technology
8
5 marks

What is actually Big Data?


9

◻ Big data-: So large data that it becomes difficult to process it using


the traditional system.
Sizes of data
10

◻ How huge this data need to be

GB/TB//PB/EX
> In size

Big Data
Sizes of data
11

◻ Example...
Its difficult to edit 10TB file in limited time in traditional system
2. Resize /enhance 3. Within
1.Process specified time

JPG
10.33

BMP PNG

Big Data
Sources of Big Data
12

Social Media Social media such as Facebook and Twitter hold information and
Data: views posted by millions of people across the globe.

Black Box It is a component of helicopter, airplanes, and jets, etc. It


Data: captures voices of the flight crew, recordings of microphones and
earphones, and the performance information of the aircraft.
Stock The stock exchange data holds information about the ‘buy’ and
Exchange ‘sell’ decisions made on a share of different companies made by
Data: the customers.
Transport Transport data includes model, capacity, distance and availability
Data: of a vehicle.

Search Search engines retrieve lots of data from different databases.


Engine Data:

Power Grid The power grid data holds information consumed by a particular
Data: node with respect to a base station.
Characteristics of Big Data
5 Marks
(Vs of big data)
13

Amount of data
and the form of
data
Volume

Velocity Variety
1. Volume
14

● The amount of data generated every second.


● Here we are talking about Zettabyte or more.
● It is the task of big data to convert such huge data
into valuable information.
● The volume of data to be analyzed is massive.
● Data is generated by machines, networks and human
interaction on systems like social media.
1. Volume
15

● The amount of data generated every second.


● Storage Size
1 Byte (B) 8 Bits
1 Kilobyte(K/KB) 210 Bytes 1024 Bytes
1 Megabyte(M/MB) 220 Bytes 1024 KB
1 Gigabyte(G/GB) 230 Bytes 1024 MB
1 Terabyte(T/TB) 240 Bytes 1024 GB
1 Petabyte(P/PB) 250 Bytes 1024 TB
1 Exabyte(E/EB) 260 Bytes 1024 PB
1 Zettabyte(Z/ZB) 270 Bytes 1024 EB
1 Yottabyte(Y/YB) 280 Bytes 1024 ZB
Characteristics of Big Data
5 Marks
(Vs of big data)
16

Amount of data
and the form of
data
Volume

Velocity Variety
1. Volume
17

● The amount of data generated every second.


● Here we are talking about Zettabyte or more.
● It is the task of big data to convert such huge data
into valuable information.
● The volume of data to be analyzed is massive.
● Data is generated by machines, networks and human
interaction on systems like social media.
Example of Volume...1
18
Airbus
Airbus generates 10TB every 30 minutes
About 640TB is generated in one flight
Example of Volume...2
19

 Self-driving cars will generate 2 Petabyte of data


every year.

 From now on, the amount of data in the world will


double every two years.

 By 2030, we will have 50 times the amount of data as


that we had in 2020.
20
3. VARIETY
Refers to the different types of data we can
now use.
In past the data was structured that fitted
in columns and rows.
• Stored in Database
• Spreadsheets
But now the data is unstructured that are
difficult to store, analyze, mine.
• Email, photo, audio
• monitoring devices, PDFs
3. Variety
21
2. Velocity
22

◻ The speed of generation of data

◻ High-frequency stock trading algorithms reflect


market changes within microseconds
◻ Machine to machine processes exchange data
between billions of devices
◻ On-line gaming systems support millions of
concurrent users, each producing multiple inputs per
second.
2. Velocity
23 4.6 billion
30 billion RFID camera
tags today phones
400 million of tweet data (1.3B in 2005) world wide
every day

100s of
millions of
GPS enabled
data every day

devices sold
? TBs of

annually

100+ TBs of 2+ billion people on


log data every the Web by end 2011
day

76 million smart meters


in 2009…
200M by 2014
Examples of velocity
24

Almost 2.5 million queries on Google are


performed.
Around 20 million photos are viewed.
Every minute we upload 48 hours of video
on Youtube.
Every minute over 200 million emails are
sent.
400 million tweets are sent
Characteristics of Big Data
(Vs of big data)
25

Volume Velocity Veracity

Variety Value
4. VALUE
26

• Having access to big data is no good unless we


can turn it into value.
4. VALUE
27

• Having access to big data is no good unless we


can turn it into value.

• Companies are starting to generate amazing


value from their big data.

-> Discovering a consumer preference or


sentiment
-> To make a relevant offer by location
-> Identifying a piece of equipment that is
about to fail.
4. VALUE
28

● The real big data challenge is a human


one which is
-> learning to ask the right questions,
-> recognizing patterns
-> making informed assumptions
-> predicting behavior.
5. Veracity/ validity
29

Consistency, accuracy,
quality and trustworthiness.

Are the results meaningful


for the given problem
space?

Especially in automated
decision-making, where no human
is involved anymore, you need to
be sure that both the data and
the analyses are correct.
All Vs
30
Types of data
31

Data Types

Unstrucuted Semi structured


Structural data
data data
Types of data
32
1. Structured Data:
It refers to data that has
a defined length and
format for big data

Ex. numbers, dates, and


groups of words and
numbers called strings.

It’s usually stored in a


database.
Types of data
33

◻ Structured Data: Applications·


Airline reservation system
Inventory control
CRM systems
ERP systems
Types of data
34

2. Unstructured Data
✔ is the one which cannot be fit into tabular
databases.

Applications Music(Audio) Movie(vedio)

Pictures
X-Rays
Types of data
35

◻ Semi-Structured Data
The data which do not have a proper format

attached to it.

Ex.

◻ Data within an email


◻ Data in Doc File
2/5
marks
Traditional Vs Big data approach
36
Traditional Vs Big data approach
37
Traditional Big data approach
8
marks
Case Study of Big data solution---1
38

◻ Big Data in Education Industry


Customized and Dynamic Learning
Programs
Reframing Course
Material
Grading Systems
Career Prediction
Types of Educational Data Collected
39
Key Applications of Big Data in Education
Personalized Learning

● Adaptive learning platforms tailor content to individual student needs.

● Example: Platforms like Khan Academy or Coursera adjust difficulty and topics based on learner performance.

40
🧠 b. Early Warning Systems

● Predict students at risk of dropping out or failing using data on attendance, grades, and engagement.

● Enables timely interventions by educators.

📊 c. Curriculum & Instructional Improvement

● Analyze which teaching methods and materials produce the best results.

● Optimize curriculum design with real-time feedback.

🤖 d. Automated Grading and Feedback

● Use AI and big data for grading essays, quizzes, and assignments to reduce educator workload.

🌐 e. Enhancing Administrative Efficiency

● Improve resource allocation—class sizes, faculty hiring, and facility management—based on data-driven insights.

💡 f. Lifelong Learning & Career Pathways

● Track student skills and recommend courses or career options aligned with job market trends.
41

Case Study: Arizona State University (ASU)


Big Data Initiative:

● Uses data from online courses, campus systems, and student engagement.

● AI models identify students needing help and customize support.

Outcome:

● Improved retention rates by 5-10%

● Increased graduation rates with targeted interventions


Benefits of Big Data in Education
42
Case Study of Big data solution---2
43

◻ Big Data in Healthcare Industry


Big data reduces costs of treatment since
there is less chances of having to
perform unnecessary diagnosis.
It helps avoid preventable diseases by
detecting them in early stages.
Patients can be provided with
evidence-based medicine.
Types of Healthcare Data
44
Key Applications of Big Data in Healthcare
45

Personalized Medicine
● Tailoring treatments based on patient genetics, lifestyle, and history.

● Example: Cancer therapies targeted using genomic data.

b. Predictive Analytics
● Predict disease outbreaks, patient admissions, and risk of chronic conditions.

● Hospitals use predictive models to reduce readmissions.

c. Genomic Research
● Big Data enables faster DNA sequencing and analysis for discovering disease
markers.
46

d. Operational Efficiency
● Optimize hospital resource allocation (staffing, equipment use) based on patient flow data.

e. Remote Monitoring & Wearables


● Continuous patient monitoring for early detection of complications.

f. Drug Discovery & Development


● Analyze vast clinical data to accelerate new drug approvals and trials.

g. Fraud Detection & Compliance


● Identify insurance fraud and ensure regulatory compliance using data analytics.
47

Case Study: Mount Sinai Health System


Big Data Initiative:

● Uses data from EHRs, genomics, and wearables.

● Developed AI models to predict patient deterioration in ICUs.

Impact:

● Improved patient outcomes by early intervention.

● Reduced ICU stays and healthcare costs.


Benefits of Big Data in Healthcare
48
Case Study of Big data solution---3
49

◻ Big Data in Weather Patterns


In weather forecasting
To study global warming
In understanding the patterns of
natural disasters
To make necessary preparations in the
case of crises
To predict the availability of usable
water around the world
50

Weather forecasting has become increasingly data-driven thanks to the


explosion of Big Data. Massive volumes of sensor, satellite, and historical
climate data allow meteorologists to:
● Predict weather events more accurately and earlier

● Track climate change trends

● Issue warnings for natural disasters like hurricanes, floods, and


droughts
Types of Weather Data Collected

51
Applications of Big Data in Weather Forecasting

52

a. Improved Short-term Forecasting

● Real-time data streams enable high-resolution models to predict storms, rain, heatwaves.

● Example: NOAA (National Oceanic and Atmospheric Administration) uses petabytes of data daily for
precise forecasts.

b. Climate Modeling & Research

● Analyzing decades of historical weather data to identify long-term climate trends.

● Helps understand global warming, glacier retreat, sea level rise.

c. Disaster Prediction & Early Warning Systems

● Big data models predict hurricanes, cyclones, tornado paths.

● Governments use these models to evacuate and prepare populations.

d. Agriculture & Resource Management

● Weather data combined with soil moisture and satellite imagery guide irrigation schedules and crop
protection.

e. Energy Sector Optimization

● Forecasting wind and solar patterns to manage renewable energy grids efficiently.
53

Case Study: IBM’s The Weather Company


Big Data Approach:

● Uses billions of weather observations daily from satellites, radars, weather stations, and
aircraft.

● Applies machine learning to improve forecasting models.

● Provides hyperlocal weather forecasts via mobile apps.

Impact:

● More accurate, localized forecasts support agriculture, disaster management, and


transportation.
Benefits of Big Data in Weather Forecasting
54
Case Study of Big data solution---4
55

◻ Big Data in Transportation Industry


Route planning
Congestion management and traffic control
Safety level of traffic
https://www.youtube.com/watch?v=O76A0FIFdqs
56

The transportation industry is rapidly evolving through digital transformation. Big Data is enabling smarter logistics, predictive
maintenance, route optimization, and real-time tracking — leading to safer, faster, and more efficient travel and freight movement.

What Kind of Data Is Collected?


Key Applications of Big Data in Transportation

a. Route Optimization

● Real-time data helps in rerouting based on traffic, weather, or construction.

● Example: Uber and Ola adjust routes dynamically using Google Maps APIs + internal ride data.
57
b. Fleet Management & Logistics

● Companies like FedEx and DHL use data to:

○ Monitor delivery vehicle locations

○ Improve fuel efficiency

○ Predict delivery times

c. Smart Public Transit

● Passenger flow analysis to adjust bus/train frequencies.

● Example: London Underground uses Oyster card data + CCTV to predict peak loads.

d. Predictive Maintenance

● Airlines and railways analyze engine data to predict part failures.

● Example: GE Aviation collects terabytes of jet engine data per flight to prevent breakdowns.

e. Infrastructure Planning

● Big Data helps cities design better roads, bike lanes, and signal systems based on real commuter behavior.

f. Accident & Risk Analysis

● Crash data + driving patterns help identify dangerous zones, and adjust speed limits or signage.
58

Benefits of Big Data in Transportation

Tools & Technologies in Use


59

Future Trends
● Autonomous Vehicles: Data from LIDAR, cameras, sensors, and AI

● Hyperloop & High-Speed Rail: Real-time system diagnostics

● Urban Mobility-as-a-Service (MaaS) platforms

● Electric Vehicle (EV) analytics: Charging patterns, battery health


Case Study of Big data solution---5
60

◻ Big Data in Banking Sector


Misuse of credit/debit cards
Business clarity (Make data-driven lending
and investment decisions)

Customer statistics alteration


Money laundering
Risk mitigation (Reduce risk and fraud
)
Increase profitability
Improve customer service
Types of Data Used in Banking
61
62

Key Applications of Big Data in Banking


a. Fraud Detection & Risk Management
● Real-time anomaly detection for suspicious transactions

● Uses machine learning models trained on millions of data points


● Example: Unusual location + large amount + new device triggers alert

b. Personalized Banking & Offers


● Tailor-made financial products based on spending patterns and life stages

● Example: Offering home loans to customers saving aggressively

c. Chatbots & Customer Service


● AI-driven chatbots like HDFC’s Eva or SBI’s Intelligent Assistant

● Use data from past queries to predict and auto-respond

d. Credit Scoring & Lending


● Traditional credit scores augmented with behavioral & alternative data

● Example: Paytm or Fintechs use mobile usage + payments history for loans

e. Predictive Analytics for Investment & Retention


● Predict customer churn and take preemptive action (offers, engagement)

● Analyze economic indicators to guide investment decisions

f. Regulatory Compliance (RegTech)


● Use of big data to comply with KYC/AML regulations

● Automates tracking of large transactions and suspicious activity


63

Case Study: JPMorgan Chase

JPMorgan Chase & Co. is an American multinational investment bank and financial services holding
company headquartered in New York City. It's one of the largest financial institutions in the world,
and the largest of the "Big Four" banks in the United States. JPMorgan Chase offers a wide array of
financial services, including consumer and commercial banking, investment banking, financial
transaction processing, and asset management.

Challenge: Manual legal document review for financial risk

Big Data Solution:

● Developed COIN (Contract Intelligence) platform using machine learning

● Reviewed 12,000 documents in seconds, saving 360,000 man-hours annually

Impact:

● Faster contract processing

● Reduced errors and operational risk


Case Study of Big data solution---6
64

◻ Big Data in Media and Entertainment


Industry
Predicting the interests of audiences
(Understand audience preferences)
Optimized or on-demand scheduling of
media streams in digital media
distribution platforms (Optimize content production)
Getting insights from customer
reviews(Personalize user experiences)
Effective targeting of the advertisements
(Drive advertising revenue)

https://www.youtube.com/watch?v=TzxmjbL-i4Y&t=2s
65

What Kind of Data Is Collected?

M&E companies gather structured and unstructured data from multiple sources:
Applications of Big Data in M&E

66

a. Personalized Recommendations
● Netflix, YouTube, Spotify use algorithms that analyze watch/listen history to suggest content. Example: Netflix claims 80% of viewer
activity is driven by recommendations.

b. Content Creation & Acquisition


● Analyze viewing trends to decide what types of shows/movies to produce.Example: Netflix’s House of Cards was greenlit using
insights showing:

○ Kevin Spacey was popular


○ Political dramas had strong engagement
○ David Fincher’s content was widely watched

c. Targeted Advertising
● Platforms use user data to show demographically relevant ads, increasing ROI for advertisers. Example: YouTube uses Google’s ad
system to deliver interest-based video ads.

d. Predictive Analytics for Hits


● Big data helps predict the success of a song, movie, or series based on:

○ Past success of similar genres


○ Viewer demographics
○ Social media sentiment

e. Piracy & Copyright Protection


● AI + data tools monitor for illegal content distribution.

● Use of watermarks and content fingerprinting to trace leaks.

f. User Engagement Optimization


● Platforms test thumbnail images, show titles, and trailers based on A/B data to drive more clicks and views.
67

Case Study: Netflix


Goal: Improve viewer retention and reduce content cost

Big Data Use:

● Analyzed billions of interactions (pauses, rewinds, binging)

● Customized thumbnails based on user behavior

● Tracked watch trends across regions and time of day

Outcome:

● Saved millions in failed productions

● Increased user satisfaction and reduced churn


Case Study
68

1. Crash Overview

● Flight AI 171 (register VT‑ANB), a Boeing 787‑8 Dreamliner, departed Ahmedabad for London Gatwick on
June 12, 2025 around 13:38 IST — 08:08 UTC I
● Shortly after takeoff, the aircraft stalled at approximately 600–625 ft, and within 30–36 seconds, the crew
issued a “Mayday – no power” emergency call
● It crashed into a medical college hostel near the airport, tragically killing 241 on board and several more
on the ground. One passenger survived

2. Black Box Recovery & Data Extraction

● Both the Flight Data Recorder (FDR) and Cockpit Voice Recorder (CVR) were recovered:

○ FDR from the rooftop of the building hit on June 13


○ CVR was later recovered from crash debris on June 16 .

● Both were sent to the AAIB’s new New Delhi lab, with support from the U.S. NTSB, and data extraction
began around June 24–25
69

3. Role of Big Data & Analytics


● FDR logs include continuous parameters—altitude, airspeed, engine RPM, fuel flow, control positions—and
event-records for anomalies
● CVR provides cockpit conversations, alerts, alarms, and “Mayday…” call, essential for human-factors analysis.

● Data is cross-referenced with other sources:

○ Maintenance logs (engines and C‑checks)


○ Radar & ATC communications
○ CCTV/enthusiast videos showing possible RAM Air Turbine (RAT) deployment

Insights & Hypotheses (Preliminary)


● Complete power loss—pilots explicitly report “no power”
● Evidence suggests dual engine failure: RAT deployment followed by inability to climb
● Possible causes include:

○ Simultaneous engine shutdown/fuel control issues


○ Bird strike
○ Flap misconfiguration affecting lift
○ High ambient temperature (reduced performance)
○ Human error or system malfunction
70
The rise of big data
71
The rise of big data
72
The rise of big data
73
The rise of big data
74
The rise of big data
75
The rise of big data
76

Parallel Distributed
Processing Data base
Big data challenges and solution
77
5 marks

WHAT IS HADOOP?
78

● An open-source software framework for storing and


processing big data in a distributed and scalable way on
large clusters of commodity h/w.
● (The Apache Software Foundation is the parent organization
that provides free, reliable, and scalable software tools like
Hadoop, all developed by a large global community. )

40
While "Hadoop" is often thought of as an acronym, it actually doesn't stand for anything specific. It
was named after a toy elephant belonging to the son of Doug Cutting, the creator of Hadoop
What does Hadoop do?
79

Hadoop is a framework that:


● Stores massive volumes of data across multiple computers (distributed storage)

● Processes that data quickly and efficiently using parallel processing


(Simple Analogy: Imagine you have a huge book to read. You split it into 100 pages and
give 10 friends 10 pages each to read at the same time. They read it in parallel and report
their findings. That’s how Hadoop works with big data)

What is Hadoop and How It Works?


Apache Hadoop is an open-source framework that stores and processes huge data sets
across a cluster of computers. It uses HDFS for distributed storage and MapReduce for
parallel processing. Data is broken into blocks and stored across multiple nodes. If one
node fails, Hadoop uses backup copies to recover. It's cost-effective, scalable, and
fault-tolerant, making it ideal for big data analytics.
Instead of backups, Hadoop uses data replication. Internally, it creates multiple copies of each block of data (by default, 3
copies). It also has a function called 'distcp', which allows you to replicate copies of data between clusters. This is what's
typically done for "backups" by most Hadoop operators.
Visual Diagram of How
Hadoop Works
80

Components Explained:
● Client sends data for storage or processing.

● NameNode tracks where data is stored (like


an index).

● DataNodes store actual data blocks.

● MapReduce runs in parallel on the


DataNodes.

● If one node fails, data is available from


another node (replicated).
81

The NameNode acts as the central control unit of HDFS. It stores metadata for the HDFS file system,
managing the directory structure and tracking file metadata. The NameNode does not store the actual
data but rather serves as the system's “index.”

Data Storage: DataNodes store the actual data blocks of files in the HDFS. They are responsible for
reading and writing these data blocks. Heartbeat and Block Report: DataNodes send periodic heartbeat
signals to the NameNode to indicate that they are alive and well. It confirms their availability and sending
"block reports" detailing the data blocks they store. This mechanism ensures the NameNode knows the status and
data distribution across the cluster.

The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by
marshalling the distributed servers, running the various tasks in parallel, managing all communications
and data transfers between the various parts of the system, and providing for redundancy and fault
tolerance. In essence, it simplifies the complexities of distributed computing by abstracting away the low-level
details of managing servers and data movement.

Replication ensures that data or services are accessible even if some nodes fail or become
unavailable. Users can access replicated data from other available nodes, reducing downtime and
improving system reliability.
While replication guarantees that data remains accessible even with node failures, ensuring that all copies of the
data are identical (consistent) after updates becomes a complex task.
82

To address the consistency challenge in data replication, systems use various consistency models and consensus
algorithms. These mechanisms ensure that replicas converge to the same state after updates, either by requiring
immediate consistency across all nodes or by allowing some temporary divergence with eventual consistency.

1. Consistency Models:
Strong Consistency:
Ensures that all replicas are updated simultaneously, and any read operation will always return the most recent
write. This provides the highest level of consistency but can impact performance due to the need for coordination
across all replicas. This is often achieved with synchronous replication where the leader waits for confirmation
from followers before acknowledging a write request.
Weak/Eventual Consistency:
Allows replicas to diverge temporarily, but guarantees that all replicas will eventually converge to the same state.
This offers higher performance and scalability but may result in reads returning slightly stale data.
Other Consistency Models:
There are various other models like causal consistency, which preserves causality between related
operations, and sequential consistency, which provides a total order of operations across all replicas.

2. Consensus Algorithms:
Raft, Paxos, other consensus algorithms like Zab (used in Apache ZooKeeper) and Viewstamped Replication, each with its own
strengths and weaknesses.
83

Replication Strategies:
Synchronous Replication:
The leader node waits for all replicas to acknowledge a write before confirming the write to the client. This
provides strong consistency but can slow down write operations.

Asynchronous Replication:
The leader node doesn't wait for acknowledgments from all replicas. This improves write performance but can
lead to eventual consistency issues.

Semi-synchronous Replication:
A hybrid approach that aims to balance consistency and performance. It typically waits for a quorum of replicas to

acknowledge the write before confirming to the client.


2/5 Marks

84
FEATURES OF HADOOP
1. Cost effective system
2. Large cluster of node
3. Parallel Processing
4. Distributed Data
5. Automatic failover Management
6. Data locality optimization
7. Heterogeneous cluster
8. Scalability

41
FEATURES OF HADOOP
Cost effective system:
 It can implemented on a simple hardware, these hardware
components are technically referred to as Commodity
Hardware(like your PC, laptop). This drastically reduces the
hardware cost. Hadoop is free to use, so there's no need to pay for
software licenses, unlike many traditional database systems.

85
FEATURES OF HADOOP
86

Large cluster of node:


 Offering More Computing Power and a Huge Storage system to
the clients.
● Each node handles a part of the data or processing.

● As data grows, you can simply add more nodes


(horizontal scaling).

● Clusters can contain hundreds or thousands of


nodes, making them ideal for big data systems.

43
FEATURES OF HADOOP
87

Parallel Processing
 Data can be processed simultaneously across all the nodes
within the cluster, and thus saving a lot of time.
● Faster execution: Tasks are completed quicker by working in parallel.
● Efficiency: Utilizes all computing power available in the cluster.
● Scalability: Works well as data volume increases.

Example:
In Hadoop's MapReduce:
10.33
● A large file is split into blocks. 44
● Each block is processed in parallel by different nodes in the cluster.
● Final results are combined (reduced) to give the complete output.
FEATURES OF HADOOP
88

◻ Distributed Data

● Improves speed: Data is accessed and processed from multiple locations simultaneously.
● Enhances fault tolerance: Copies of data are stored on different nodes — if one fails, others have backups.
● Enables parallel processing: Each node works on its own part of the data.
● Supports large-scale storage: Easily handles terabytes or petabytes of data.

Example: In Hadoop’s HDFS (Hadoop Distributed File System), A 100 GB file may be split into 64 MB blocks.
These blocks are stored across different nodes in the cluster. Processing happens close to where each block resides.
FEATURES OF HADOOP
89
Automatic failover Management
 The Hadoop Framework would replace that particular machine,
with another machine
 It replicates all the configuration settings and the data, from the failed
machine onto this newly replicated machine.
Automatic failover Management

90

Automatic failover management is a system’s ability to detect failures and automatically switch to a backup
component or node, without human intervention, to avoid downtime or data loss.

● Ensures high availability of services

● Minimizes system downtime in case of hardware or software failure

● Data is not lost because it is replicated across nodes

● Critical for 24/7 operations and large-scale data processing

Example in Hadoop:
● The NameNode is the master in HDFS.

● If the active NameNode fails, the Standby NameNode takes over automatically.

● Similarly, data blocks are stored in multiple nodes (default 3 copies), so if one node crashes, another has a
backup.
FEATURES OF HADOOP
91

◻ Data locality optimization:


Transferring the code, which is of few megabytes in size, to the
data centre.
This saves a lot of time and bandwidth.
10.33
India

USA

PB

MB
FEATURES OF HADOOP
92

Heterogeneous cluster
 Each node can be from a different vendor, and each node can be
running a different version and flavor of operating system.

IBM Intel AMD HP

RHEL UBUNTU FEDORA CENTOS


FEATURES OF HADOOP
93

Scalability
 We can easily add or remove a node to or from a Hadoop
Cluster without bringing down or affecting the cluster
operation.
10 Marks

HDFS Architecture Overview


94
10 Marks

HADOOP’S ARCHITECTURE
95

◻ Hadoop Components

Distributed Processing MapReduce

HDFS
Distributed Storage

Yarn Common
Yet Another Resources Framework Utilities
Negotiator (Job Scheduler
and Resource Manager)
Java Lib. & Utilities (Java
files & scripts)
HADOOP’S ARCHITECTURE
96

Main Components

MapReduce

HDFS
Distributed Storage

YARN Common
Framework Utilities

57
HDFS Architecture Overview

● Namenode periodically receives a heartbeat and a Block report from each Datanode in the cluster.
● Every Datanode sends heartbeat message after every 3 seconds to Namenode.
● The health report is just information about a particular Datanode that is working properly or not. In the
words we can say that particular Datanode is alive or not.
● A block report of a particular Datanode contains information about all the blocks on that resides on the
corresponding Datanode.
● When Namenode doesn't receive any heartbeat message for 10 minutes(ByDefault) from a particular
Datanode then corresponding Datanode is considered Dead or failed by Namenode.
● Since blocks will be under replicated, the system starts the replication process from one Datanode to
another by taking all block information from the Block report of corresponding Datanode.
● The Data for replication transfers directly from one Datanode to another without data passing through
Namenode.

The three major components of HDFS, i.e. Namenode, Datanode and Secondary Name Node are explained in
slides:
Namenode

● Hadoop file system is a master/slave file system in which Namenode works as the master and Datanode work as a slave.
● Namenode is so critical term to Hadoop file system because it acts as a central component of HDFS.
● If Namenode gets down then the whole Hadoop cluster is inaccessible and considered dead.

Basic operations of Namenode:

● Namenode maintains and manages the Data Nodes and assigns the task to them.
● Namenode does not contain actual data of files.
● Namenode stores metadata of actual data like Filename, path, number of data blocks, block IDs, block location, number of replica
other slave related informations.
● Namenode manages all the request(read, write) of client for actual data file.
● Namenode executes file system name space operations like opening/closing files, renaming files and directories.
● Take care of authorization and authentication.
● Take care of the replication factor
● Create checkpoints and logs the namespace change
● Handle DataNode failure
Namenode

NAMENODE – (It is one per cluster)

Maintain The HDFS namespace that includes:,


 file name
 file path
 File permission
 File owner
 number of blocks
 block Ids
 replication level RAM

Maintain the mapping of file blocks to DataNodes

 Read: ask NameNode for the location


 Write: ask NameNode to nominate DataNodes

99
59
Namenode

Namenode uses two files for storing this metadata information.


 FsImage:
It is the snapshot of the filesystem when namenode started. Think of it like a blueprint that shows how
files and directories are organized within a system. It allows HDFS to quickly load the current state of
the namespace without having to replay all the entries in the EditLog.
 EditLog:
Its the sequence of changes made to the filesystem after namenode started. When a change occurs
in the HDFS (e.g., a file is created), the change is recorded in the EditLog. This log allows the
NameNode to recover the state of the file system in case of a failure, ensuring data consistency and
durability.

 Only in the restart of namenode, edit logs are applied to fsimage to get the latest snapshot of the file system.

100
60
HDFS ARCHITECTURE OVERVIEW..
NAMENODE – ONE PER CLUSTER
101

Functions:
 Maintain and manage “node” (Slave node) information
 Manage namespace of the file system in memory
 Take care of authorization and authentication.
 Take care of the replication factor
 Create checkpoints and logs the namespace change
 Handle DataNode failure

61
Datanode

DATANODE: Datanode stores actual data and works as instructed by Namenode. A Hadoop file system can have multiple data
nodes but only one active Namenode.

 DataNode is a daemon (process that runs in background) that runs on the ‘SlaveNode’ in Hadoop Cluster.

 In Hdfs file is broken into small chunks called blocks(default block of 128/256 MB)

 These blocks of data are stored on the slave node.

 It stores the actual data. So, large number of disks are required to store data.(Recommended 8 disks).

 These data read/write operation to disks is performed by the DataNode.

 For hosting datanodes, commodity hardware can be used.

102 62
Datanode

Basic Operations of Datanode:


● Datanodes is responsible of storing actual data.
● Upon instruction from Namenode, it performs operations like creation/replication/deletion of data blocks.
● When one of Datanode gets down then it will not make any effect on Hadoop cluster due to replication.
● All Datanodes are synchronized in the Hadoop cluster in a way that they can communicate with each
other for various operations.
Secondary NameNode (SNN)

SECONDARY NAMENODE (SNN):

Is a helper to the Primary NameNode(PNN) but not replace PNN.

Is responsible for performing periodic checkpoints.

It is usually run on a different machine than the PNN due to same memory requirements as of PNN

it stores the latest checkpoint in a directory which is structured the same way as the PNN.

The check point image is always ready to be read by the PNN if necessary.
63

104
Secondary NameNode (SNN)

Process:
1. The SNN asks the PNN to roll its edits file,
so new edits go to a new file. (Rolling edits
means finalizing the current
edits_inprogress and starting a new one.)
2. The SNN retrieves fsimage and edits
from the PNN
3. The SNN loads fsimage into memory,
applies each operation from edits, then
creates a new consolidated fsimage file.
4. The SNN sends the new fsimage
back to the primary.
5. The PNN replaces the old fsimage with the
new one from the SNN, and the old edits
file with the new one it started in step1.
6. It also updates the fstime file to record the
time that the checkpoint was taken.
(checkpointing is the process of merging any
outstanding edit logs with the latest fsimage,
saving the full state to a new fsimage file)
(The “fstime file" stores the file system
timestamps, specifically the last write time, and
potentially creation and access times. It includes
fields for hours, minutes, seconds, day, month,
and year. It is used alongside fsimage and edits
files to track metadata changes.) 105 64
HADOOP’S ARCHITECTURE
106

Main Components

Distributed Processing MapReduce

HDFS

YARN Common
Framework Utilities

65
MAIN COMPONENTS OF MAPREDUCE
107

Job Tracker
Recourse management
Is used to assign MapReduce Tasks to Task Trackers in the Cluster
of Nodes.
Sometimes, it reassigns same tasks to other Task Trackers as previous
Task Trackers are failed or shutdown scenarios.
Maintains all the Task Trackers status like Up/running, Failed,
Recovered etc.

Master Node

Job
Tracker

Name
Node 69
MAIN COMPONENTS OF MAPREDUCE
108
Slave Node
Job Tracker
Recourse management I
Tasks used to assign MapReduce Tasks to Task Trackers in the Cluster
of Nodes.
Tracker
Sometimes, it reassigns same tasks to other Task Trackers as previous
Task Trackers are failed or shutdown scenarios.
Data
maintains all the Task Trackers status like Up/running, Failed,
Node
Recovered etc.
Task Tracker
Agents deployed to each machine in the cluster to run the map and reduce
tasks
Task Tracker executes the Tasks which are assigned by Job
Tracker
Sends the status of those tasks to Job Tracker.
Job History Server
. 69
A component that tracks completed jobs and is typically deployed as a separate
function or with Job Tracker
Map()
Reduce()

MAP-REDUCE Big Data


Map()
Reduce()
O/P

Map()

109
6
6

MapReduce is a programming framework that allows us to perform


distributed and parallel processing on large data sets in a distributed
environment.

MapReduce consists of two distinct tasks –


Map
Reduce.

Reducer phase takes place after Mapper phase has been


completed.
10 Marks

MAPREDUCE
Big Data Map(
) Reduce()

Map( O/P

Big Data
)
Reduce()

Map(
)

The Map job:


A block of data is read and processed to produce key-value pairs as intermediate outputs. 66
The output of a Mapper or Map job (key-value pairs) is input to the Reducer.

The Reducer :
Receives the key-value pair from multiple map jobs. 110
Then, the reducer aggregates those intermediate data tuples (intermediate key-value
pair) into a smaller set of tuples or key value pairs which is the final output.
MAPREDUCE… EG.
111

68
5/10 Marks

HADOOP’S ARCHITECTURE
112

Main Components

MapReduce

HDFS

YARN Common
Yet Another Resources Framework Utilities
Negotiator (Job
Scheduler and Resource 70
Manager)
YARN
113
(YET ANOTHER RESOURCE NEGOTIATOR)

Job Tracker

Resource Node Application


Manager Manager Master

71
YARN
114
(YET ANOTHER RESOURCE NEGOTIATOR)

72
YARN
115
(YET ANOTHER RESOURCE NEGOTIATOR)

Resource Manager
This daemon process resides on the
Master Node
Responsible for,
❑ Managing resources scheduling for
different compute applications in an
optimum way
❑ Coordinating with two process on master node
Scheduler
Application Manager

74
YARN
116
(YET ANOTHER RESOURCE NEGOTIATOR)
Scheduler
This daemon process resides on the Master Node (runs along
with Resource Manager daemon )
Responsible for,
Scheduling the job execution as per submission request received
by Resource Manager
Allocating resources to applications submitted to the
cluster
Coordinating with Application Manager daemon and keeping
track of resources of running applications
Application Manager
This daemon process resides on the Master Node (runs along
with Resource Manager daemon )
Responsible for,
Helping Scheduler daemon to keeps track of running 75
application by coordination
Negotiating first container for executing application
specific task with suitable Application Master on slave
YARN
117
(YET ANOTHER RESOURCE NEGOTIATOR)
Node o This daemon process resides on the slave
Manager nodes (runs along with Data Node daemon)

o Responsible for:
• Managing and executing containers

• Monitoring resource usage (i.e. usage of

memory, CPU, network etc..) and reporting it


back to Resource Manager daemon
• Periodically sending heart-bits to Resource

Manager for its health status update


76
YARN
118
(YET ANOTHER RESOURCE NEGOTIATOR)
Application o This daemon process runs on the slave node
Master (along with the Node Manager daemon)

o It is per application specific library works with


Node Manager to execute the task
o The instance of this daemon is per application, which means in case of
multiple jobs submitted on cluster, it may have more than one instances
of Application Master on slave nodes

o Responsible for:
▪ Negotiating suitable resource containers

on slave node from Resource Manager


▪ Working with one or multiple Node

Managers to monitor task execution on 77


slave nodes
YARN WORK FLOW:

Client submits an Resource Manager allocates a


application container to start Application
Manager
Application Manager registers with Resource Manager
Application Manager asks containers from Resource Manager

Application Manager notifies Node Manager to launch containers

Application code is executed in the container

Client contacts Application Manager


Resource unregisters with Resource
Manager/Application Manager
Manager to monitor 78
119
application’s status
https://www.youtube.com/watch?v=Pu9qgnebCjs
10 marks

120
HADOOP ECOSYSTEM

80
121
HADOOP ECOSYSTEM
Core Hadoop Components
❑ Hadoop Distributed File System (HDFS)
❑ MapReduce- Distributed Data Processing
Framework of Apache Hadoop
❑ YARN

81
122
HADOOP ECOSYSTEM
Core Hadoop Components
HDFS
HDFS is the one, which makes it possible to store different types of large
data sets (i.e. structured, unstructured and semi structured data).
It helps us in storing our data across various nodes and maintaining
the log file about the stored data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming, closing, opening files
and directories.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion,
and replication according to the instruction of NameNode. 82
DataNode manages data storage of the system.
123
HADOOP ECOSYSTEM
Core Hadoop Components
MapReduce- Distributed Data Processing Framework of Apache
Hadoop
Is responsible for the analysing large datasets in parallel before reducing
it to find the results.
Operations performed:
Map Task in the Hadoop ecosystem takes input data and splits into
independent chunks and output of this task will be the input for Reduce
Task.
In the same Hadoop ecosystem Reduce task combines Mapped
data tuples into smaller set of tuples.
Meanwhile, both input and output of tasks are stored in a file system.
MapReduce takes care of scheduling jobs, monitoring jobs and83
re-executes the failed task.
124
HADOOP ECOSYSTEM
Core Hadoop Components
YARN (Yet Another Resource Negotiator)
It provides the resource management.
YARN is called as the operating system of Hadoop as it is
responsible for managing and monitoring workloads.
It allows multiple data processing engines such as real-time
streaming and batch processing to handle data stored on a single
platform.

84
125
HADOOP ECOSYSTEM
Data Access Components of Hadoop
Ecosystem
Pig-
Hive

85
126
HADOOP ECOSYSTEM
Data Access Components of Hadoop
Ecosystem
Pig-
PIG has two parts:
• Pig Latin: the language
• The pig runtime: the execution environment(JRE)
It has SQL like command structure.
How Pig works?
In PIG, first the load command, loads the data.
Then perform various functions on it like grouping, filtering,
joining, sorting, etc.
At last, either you can dump the data on the screen or you
86
can store the result back in HDFS.
Eg: Healthcare data
HADOOP
127
ECOSYSTEM
Data Access Components of Hadoop Ecosystem
Hive
Open source data warehouse system for Summarization, querying
and analyzing large data set stored in files
Performs reading, writing and managing large data sets in
a distributed environment using SQL-like interface.
The query language of Hive is called Hive Query
Language (HQL)
 HIVE + SQL = HQL
The Hive Command line interface is used to execute HQL commands.
Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
HADOOP
128
ECOSYSTEM
Data Access Components of Hadoop Ecosystem
Hive
129
HADOOP ECOSYSTEM
Data Integration Components of Hadoop
Ecosystem-
Sqoop
Flume

88
HADOOP
130
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Sqoop
Is used for importing data from external sources into related Hadoop components
like HDFS, HBase or Hive.
It can also be used for exporting data from Hadoop to other external structured
data stores.
Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.
Eg. Coupons.com
HADOOP
131
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Flume
The Flume is a service which helps in ingesting unstructured and semi-
structured data into HDFS.
It helps us to ingest online streaming data from various sources like
network traffic, social media, email messages, log files etc. in HDFS.

90
HADOOP
132
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Flume
The flume agent has 3 components: source, sink and channel.
Source: it accepts the data from the incoming streamline and stores the data in
the channel.
Channel: it acts as the local storage or the primary storage. A Channel is
a temporary storage between the source of data and persistent data in
the HDFS.
Sink: collects the data from the channel and commits or writes the data in
the HDFS permanently.

HDFS
Source Sink

Web
Server
Channel 90
133
HADOOP ECOSYSTEM
Data Storage Component of Hadoop Ecosystem
Hbase

92
HADOOP
134
ECOSYSTEM
Data Storage Component of Hadoop Ecosystem
Hbase
HBase is an open source, non-relational distributed
database. In other words, it is a NoSQL database.
It supports all types of data
HBase is a column-oriented database that uses HDFS
for underlying storage of data.
It can create large tables with millions of rows and
columns on hardware machine.
HBase supports random reads and also batch
computations using MapReduce.
Eg. Facebook
91
135
HADOOP ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Oozie
Zookeeper

92
136
HADOOP ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Oozie
Is a clock and alarm service inside Hadoop Ecosystem.
For Apache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
There are two kinds of Oozie jobs:
Oozie workflow:
These are sequential set of actions to be executed.
Oozie Coordinator:
These are the Oozie jobs which are triggered when the data is made
available to it.
An Oozie coordinator responds to the availability of data and it rests
otherwise.
93
HADOOP
137
ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop
Ecosystem component
Zookeeper manages and coordinates a large cluster of
machines.
Features of Zookeeper:
• Fast – Zookeeper is fast with workloads where reads to
• data are more common than writes. The ideal
• read/write ratio is 10:1.
• Ordered – Zookeeper maintains a record of all transactions.
94
HADOOP
138
ECOSYSTEM
Apache AMBARI
The Ambari provides:
Hadoop cluster provisioning:
It gives us step by step process for installing Hadoop services across
a number of hosts.
It also handles configuration of Hadoop services over a cluster.
Hadoop cluster management:
It provides a central management service for starting, stopping and
re-configuring Hadoop services across the cluster.
Hadoop cluster monitoring:
For monitoring health and status, Ambari provides us a
dashboard.
The Amber Alert framework is an alerting service which notifies
the user, whenever the attention is needed. 95
 For example, if a node goes down or low disk space on a
node, etc.
HADOOP
139
ECOSYSTEM
Apache MAHOUT
Mahout is open source framework for creating scalable
machine learning algorithm and data mining library.
Provides the data science tools to automatically find meaningful patterns
in those big data sets.
Algorithms of Mahout are:
Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart
or terms in query session) and then identifies which items 96
typically appear together.
Any Queries 97

140
Descriptive Questions
141

What is Big Data and explain characteristics/5 Vs


1 of Big Data.
2 Explain types of Big Data
Differentiation between Traditional Vs Big data
3 approach

What is Hadoop and explain features of HADOOP


4
5 HADOOP’S architecture/ components of HADOOP

6 Components of HADOOP ECOSYSTEM

You might also like