BDA Module1
BDA Module1
2
3
4
5
CHAP 1.
INTRODUCTION TO BIG
DATA AND HADOOP
Contents
7
GB/TB//PB/EX
> In size
Big Data
Sizes of data
11
◻ Example...
Its difficult to edit 10TB file in limited time in traditional system
2. Resize /enhance 3. Within
1.Process specified time
JPG
10.33
BMP PNG
Big Data
Sources of Big Data
12
Social Media Social media such as Facebook and Twitter hold information and
Data: views posted by millions of people across the globe.
Power Grid The power grid data holds information consumed by a particular
Data: node with respect to a base station.
Characteristics of Big Data
5 Marks
(Vs of big data)
13
Amount of data
and the form of
data
Volume
Velocity Variety
1. Volume
14
Amount of data
and the form of
data
Volume
Velocity Variety
1. Volume
17
100s of
millions of
GPS enabled
data every day
devices sold
? TBs of
annually
Variety Value
4. VALUE
26
Consistency, accuracy,
quality and trustworthiness.
Especially in automated
decision-making, where no human
is involved anymore, you need to
be sure that both the data and
the analyses are correct.
All Vs
30
Types of data
31
Data Types
2. Unstructured Data
✔ is the one which cannot be fit into tabular
databases.
Pictures
X-Rays
Types of data
35
◻ Semi-Structured Data
The data which do not have a proper format
attached to it.
Ex.
● Example: Platforms like Khan Academy or Coursera adjust difficulty and topics based on learner performance.
40
🧠 b. Early Warning Systems
● Predict students at risk of dropping out or failing using data on attendance, grades, and engagement.
● Analyze which teaching methods and materials produce the best results.
● Use AI and big data for grading essays, quizzes, and assignments to reduce educator workload.
● Improve resource allocation—class sizes, faculty hiring, and facility management—based on data-driven insights.
● Track student skills and recommend courses or career options aligned with job market trends.
41
● Uses data from online courses, campus systems, and student engagement.
Outcome:
Personalized Medicine
● Tailoring treatments based on patient genetics, lifestyle, and history.
b. Predictive Analytics
● Predict disease outbreaks, patient admissions, and risk of chronic conditions.
c. Genomic Research
● Big Data enables faster DNA sequencing and analysis for discovering disease
markers.
46
d. Operational Efficiency
● Optimize hospital resource allocation (staffing, equipment use) based on patient flow data.
Impact:
51
Applications of Big Data in Weather Forecasting
52
● Real-time data streams enable high-resolution models to predict storms, rain, heatwaves.
● Example: NOAA (National Oceanic and Atmospheric Administration) uses petabytes of data daily for
precise forecasts.
● Weather data combined with soil moisture and satellite imagery guide irrigation schedules and crop
protection.
● Forecasting wind and solar patterns to manage renewable energy grids efficiently.
53
● Uses billions of weather observations daily from satellites, radars, weather stations, and
aircraft.
Impact:
The transportation industry is rapidly evolving through digital transformation. Big Data is enabling smarter logistics, predictive
maintenance, route optimization, and real-time tracking — leading to safer, faster, and more efficient travel and freight movement.
a. Route Optimization
● Example: Uber and Ola adjust routes dynamically using Google Maps APIs + internal ride data.
57
b. Fleet Management & Logistics
● Example: London Underground uses Oyster card data + CCTV to predict peak loads.
d. Predictive Maintenance
● Example: GE Aviation collects terabytes of jet engine data per flight to prevent breakdowns.
e. Infrastructure Planning
● Big Data helps cities design better roads, bike lanes, and signal systems based on real commuter behavior.
● Crash data + driving patterns help identify dangerous zones, and adjust speed limits or signage.
58
Future Trends
● Autonomous Vehicles: Data from LIDAR, cameras, sensors, and AI
● Example: Paytm or Fintechs use mobile usage + payments history for loans
JPMorgan Chase & Co. is an American multinational investment bank and financial services holding
company headquartered in New York City. It's one of the largest financial institutions in the world,
and the largest of the "Big Four" banks in the United States. JPMorgan Chase offers a wide array of
financial services, including consumer and commercial banking, investment banking, financial
transaction processing, and asset management.
Impact:
https://www.youtube.com/watch?v=TzxmjbL-i4Y&t=2s
65
M&E companies gather structured and unstructured data from multiple sources:
Applications of Big Data in M&E
66
a. Personalized Recommendations
● Netflix, YouTube, Spotify use algorithms that analyze watch/listen history to suggest content. Example: Netflix claims 80% of viewer
activity is driven by recommendations.
c. Targeted Advertising
● Platforms use user data to show demographically relevant ads, increasing ROI for advertisers. Example: YouTube uses Google’s ad
system to deliver interest-based video ads.
Outcome:
1. Crash Overview
● Flight AI 171 (register VT‑ANB), a Boeing 787‑8 Dreamliner, departed Ahmedabad for London Gatwick on
June 12, 2025 around 13:38 IST — 08:08 UTC I
● Shortly after takeoff, the aircraft stalled at approximately 600–625 ft, and within 30–36 seconds, the crew
issued a “Mayday – no power” emergency call
● It crashed into a medical college hostel near the airport, tragically killing 241 on board and several more
on the ground. One passenger survived
● Both the Flight Data Recorder (FDR) and Cockpit Voice Recorder (CVR) were recovered:
● Both were sent to the AAIB’s new New Delhi lab, with support from the U.S. NTSB, and data extraction
began around June 24–25
69
Parallel Distributed
Processing Data base
Big data challenges and solution
77
5 marks
WHAT IS HADOOP?
78
40
While "Hadoop" is often thought of as an acronym, it actually doesn't stand for anything specific. It
was named after a toy elephant belonging to the son of Doug Cutting, the creator of Hadoop
What does Hadoop do?
79
Components Explained:
● Client sends data for storage or processing.
The NameNode acts as the central control unit of HDFS. It stores metadata for the HDFS file system,
managing the directory structure and tracking file metadata. The NameNode does not store the actual
data but rather serves as the system's “index.”
Data Storage: DataNodes store the actual data blocks of files in the HDFS. They are responsible for
reading and writing these data blocks. Heartbeat and Block Report: DataNodes send periodic heartbeat
signals to the NameNode to indicate that they are alive and well. It confirms their availability and sending
"block reports" detailing the data blocks they store. This mechanism ensures the NameNode knows the status and
data distribution across the cluster.
The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by
marshalling the distributed servers, running the various tasks in parallel, managing all communications
and data transfers between the various parts of the system, and providing for redundancy and fault
tolerance. In essence, it simplifies the complexities of distributed computing by abstracting away the low-level
details of managing servers and data movement.
Replication ensures that data or services are accessible even if some nodes fail or become
unavailable. Users can access replicated data from other available nodes, reducing downtime and
improving system reliability.
While replication guarantees that data remains accessible even with node failures, ensuring that all copies of the
data are identical (consistent) after updates becomes a complex task.
82
To address the consistency challenge in data replication, systems use various consistency models and consensus
algorithms. These mechanisms ensure that replicas converge to the same state after updates, either by requiring
immediate consistency across all nodes or by allowing some temporary divergence with eventual consistency.
1. Consistency Models:
Strong Consistency:
Ensures that all replicas are updated simultaneously, and any read operation will always return the most recent
write. This provides the highest level of consistency but can impact performance due to the need for coordination
across all replicas. This is often achieved with synchronous replication where the leader waits for confirmation
from followers before acknowledging a write request.
Weak/Eventual Consistency:
Allows replicas to diverge temporarily, but guarantees that all replicas will eventually converge to the same state.
This offers higher performance and scalability but may result in reads returning slightly stale data.
Other Consistency Models:
There are various other models like causal consistency, which preserves causality between related
operations, and sequential consistency, which provides a total order of operations across all replicas.
2. Consensus Algorithms:
Raft, Paxos, other consensus algorithms like Zab (used in Apache ZooKeeper) and Viewstamped Replication, each with its own
strengths and weaknesses.
83
Replication Strategies:
Synchronous Replication:
The leader node waits for all replicas to acknowledge a write before confirming the write to the client. This
provides strong consistency but can slow down write operations.
Asynchronous Replication:
The leader node doesn't wait for acknowledgments from all replicas. This improves write performance but can
lead to eventual consistency issues.
Semi-synchronous Replication:
A hybrid approach that aims to balance consistency and performance. It typically waits for a quorum of replicas to
84
FEATURES OF HADOOP
1. Cost effective system
2. Large cluster of node
3. Parallel Processing
4. Distributed Data
5. Automatic failover Management
6. Data locality optimization
7. Heterogeneous cluster
8. Scalability
41
FEATURES OF HADOOP
Cost effective system:
It can implemented on a simple hardware, these hardware
components are technically referred to as Commodity
Hardware(like your PC, laptop). This drastically reduces the
hardware cost. Hadoop is free to use, so there's no need to pay for
software licenses, unlike many traditional database systems.
85
FEATURES OF HADOOP
86
43
FEATURES OF HADOOP
87
Parallel Processing
Data can be processed simultaneously across all the nodes
within the cluster, and thus saving a lot of time.
● Faster execution: Tasks are completed quicker by working in parallel.
● Efficiency: Utilizes all computing power available in the cluster.
● Scalability: Works well as data volume increases.
Example:
In Hadoop's MapReduce:
10.33
● A large file is split into blocks. 44
● Each block is processed in parallel by different nodes in the cluster.
● Final results are combined (reduced) to give the complete output.
FEATURES OF HADOOP
88
◻ Distributed Data
● Improves speed: Data is accessed and processed from multiple locations simultaneously.
● Enhances fault tolerance: Copies of data are stored on different nodes — if one fails, others have backups.
● Enables parallel processing: Each node works on its own part of the data.
● Supports large-scale storage: Easily handles terabytes or petabytes of data.
Example: In Hadoop’s HDFS (Hadoop Distributed File System), A 100 GB file may be split into 64 MB blocks.
These blocks are stored across different nodes in the cluster. Processing happens close to where each block resides.
FEATURES OF HADOOP
89
Automatic failover Management
The Hadoop Framework would replace that particular machine,
with another machine
It replicates all the configuration settings and the data, from the failed
machine onto this newly replicated machine.
Automatic failover Management
90
Automatic failover management is a system’s ability to detect failures and automatically switch to a backup
component or node, without human intervention, to avoid downtime or data loss.
Example in Hadoop:
● The NameNode is the master in HDFS.
● If the active NameNode fails, the Standby NameNode takes over automatically.
● Similarly, data blocks are stored in multiple nodes (default 3 copies), so if one node crashes, another has a
backup.
FEATURES OF HADOOP
91
USA
PB
MB
FEATURES OF HADOOP
92
Heterogeneous cluster
Each node can be from a different vendor, and each node can be
running a different version and flavor of operating system.
Scalability
We can easily add or remove a node to or from a Hadoop
Cluster without bringing down or affecting the cluster
operation.
10 Marks
HADOOP’S ARCHITECTURE
95
◻ Hadoop Components
HDFS
Distributed Storage
Yarn Common
Yet Another Resources Framework Utilities
Negotiator (Job Scheduler
and Resource Manager)
Java Lib. & Utilities (Java
files & scripts)
HADOOP’S ARCHITECTURE
96
Main Components
MapReduce
HDFS
Distributed Storage
YARN Common
Framework Utilities
57
HDFS Architecture Overview
● Namenode periodically receives a heartbeat and a Block report from each Datanode in the cluster.
● Every Datanode sends heartbeat message after every 3 seconds to Namenode.
● The health report is just information about a particular Datanode that is working properly or not. In the
words we can say that particular Datanode is alive or not.
● A block report of a particular Datanode contains information about all the blocks on that resides on the
corresponding Datanode.
● When Namenode doesn't receive any heartbeat message for 10 minutes(ByDefault) from a particular
Datanode then corresponding Datanode is considered Dead or failed by Namenode.
● Since blocks will be under replicated, the system starts the replication process from one Datanode to
another by taking all block information from the Block report of corresponding Datanode.
● The Data for replication transfers directly from one Datanode to another without data passing through
Namenode.
The three major components of HDFS, i.e. Namenode, Datanode and Secondary Name Node are explained in
slides:
Namenode
● Hadoop file system is a master/slave file system in which Namenode works as the master and Datanode work as a slave.
● Namenode is so critical term to Hadoop file system because it acts as a central component of HDFS.
● If Namenode gets down then the whole Hadoop cluster is inaccessible and considered dead.
● Namenode maintains and manages the Data Nodes and assigns the task to them.
● Namenode does not contain actual data of files.
● Namenode stores metadata of actual data like Filename, path, number of data blocks, block IDs, block location, number of replica
other slave related informations.
● Namenode manages all the request(read, write) of client for actual data file.
● Namenode executes file system name space operations like opening/closing files, renaming files and directories.
● Take care of authorization and authentication.
● Take care of the replication factor
● Create checkpoints and logs the namespace change
● Handle DataNode failure
Namenode
99
59
Namenode
Only in the restart of namenode, edit logs are applied to fsimage to get the latest snapshot of the file system.
100
60
HDFS ARCHITECTURE OVERVIEW..
NAMENODE – ONE PER CLUSTER
101
Functions:
Maintain and manage “node” (Slave node) information
Manage namespace of the file system in memory
Take care of authorization and authentication.
Take care of the replication factor
Create checkpoints and logs the namespace change
Handle DataNode failure
61
Datanode
DATANODE: Datanode stores actual data and works as instructed by Namenode. A Hadoop file system can have multiple data
nodes but only one active Namenode.
DataNode is a daemon (process that runs in background) that runs on the ‘SlaveNode’ in Hadoop Cluster.
In Hdfs file is broken into small chunks called blocks(default block of 128/256 MB)
It stores the actual data. So, large number of disks are required to store data.(Recommended 8 disks).
102 62
Datanode
It is usually run on a different machine than the PNN due to same memory requirements as of PNN
it stores the latest checkpoint in a directory which is structured the same way as the PNN.
The check point image is always ready to be read by the PNN if necessary.
63
104
Secondary NameNode (SNN)
Process:
1. The SNN asks the PNN to roll its edits file,
so new edits go to a new file. (Rolling edits
means finalizing the current
edits_inprogress and starting a new one.)
2. The SNN retrieves fsimage and edits
from the PNN
3. The SNN loads fsimage into memory,
applies each operation from edits, then
creates a new consolidated fsimage file.
4. The SNN sends the new fsimage
back to the primary.
5. The PNN replaces the old fsimage with the
new one from the SNN, and the old edits
file with the new one it started in step1.
6. It also updates the fstime file to record the
time that the checkpoint was taken.
(checkpointing is the process of merging any
outstanding edit logs with the latest fsimage,
saving the full state to a new fsimage file)
(The “fstime file" stores the file system
timestamps, specifically the last write time, and
potentially creation and access times. It includes
fields for hours, minutes, seconds, day, month,
and year. It is used alongside fsimage and edits
files to track metadata changes.) 105 64
HADOOP’S ARCHITECTURE
106
Main Components
HDFS
YARN Common
Framework Utilities
65
MAIN COMPONENTS OF MAPREDUCE
107
Job Tracker
Recourse management
Is used to assign MapReduce Tasks to Task Trackers in the Cluster
of Nodes.
Sometimes, it reassigns same tasks to other Task Trackers as previous
Task Trackers are failed or shutdown scenarios.
Maintains all the Task Trackers status like Up/running, Failed,
Recovered etc.
Master Node
Job
Tracker
Name
Node 69
MAIN COMPONENTS OF MAPREDUCE
108
Slave Node
Job Tracker
Recourse management I
Tasks used to assign MapReduce Tasks to Task Trackers in the Cluster
of Nodes.
Tracker
Sometimes, it reassigns same tasks to other Task Trackers as previous
Task Trackers are failed or shutdown scenarios.
Data
maintains all the Task Trackers status like Up/running, Failed,
Node
Recovered etc.
Task Tracker
Agents deployed to each machine in the cluster to run the map and reduce
tasks
Task Tracker executes the Tasks which are assigned by Job
Tracker
Sends the status of those tasks to Job Tracker.
Job History Server
. 69
A component that tracks completed jobs and is typically deployed as a separate
function or with Job Tracker
Map()
Reduce()
Map()
109
6
6
MAPREDUCE
Big Data Map(
) Reduce()
Map( O/P
Big Data
)
Reduce()
Map(
)
The Reducer :
Receives the key-value pair from multiple map jobs. 110
Then, the reducer aggregates those intermediate data tuples (intermediate key-value
pair) into a smaller set of tuples or key value pairs which is the final output.
MAPREDUCE… EG.
111
68
5/10 Marks
HADOOP’S ARCHITECTURE
112
Main Components
MapReduce
HDFS
YARN Common
Yet Another Resources Framework Utilities
Negotiator (Job
Scheduler and Resource 70
Manager)
YARN
113
(YET ANOTHER RESOURCE NEGOTIATOR)
Job Tracker
71
YARN
114
(YET ANOTHER RESOURCE NEGOTIATOR)
72
YARN
115
(YET ANOTHER RESOURCE NEGOTIATOR)
Resource Manager
This daemon process resides on the
Master Node
Responsible for,
❑ Managing resources scheduling for
different compute applications in an
optimum way
❑ Coordinating with two process on master node
Scheduler
Application Manager
74
YARN
116
(YET ANOTHER RESOURCE NEGOTIATOR)
Scheduler
This daemon process resides on the Master Node (runs along
with Resource Manager daemon )
Responsible for,
Scheduling the job execution as per submission request received
by Resource Manager
Allocating resources to applications submitted to the
cluster
Coordinating with Application Manager daemon and keeping
track of resources of running applications
Application Manager
This daemon process resides on the Master Node (runs along
with Resource Manager daemon )
Responsible for,
Helping Scheduler daemon to keeps track of running 75
application by coordination
Negotiating first container for executing application
specific task with suitable Application Master on slave
YARN
117
(YET ANOTHER RESOURCE NEGOTIATOR)
Node o This daemon process resides on the slave
Manager nodes (runs along with Data Node daemon)
o Responsible for:
• Managing and executing containers
o Responsible for:
▪ Negotiating suitable resource containers
120
HADOOP ECOSYSTEM
80
121
HADOOP ECOSYSTEM
Core Hadoop Components
❑ Hadoop Distributed File System (HDFS)
❑ MapReduce- Distributed Data Processing
Framework of Apache Hadoop
❑ YARN
81
122
HADOOP ECOSYSTEM
Core Hadoop Components
HDFS
HDFS is the one, which makes it possible to store different types of large
data sets (i.e. structured, unstructured and semi structured data).
It helps us in storing our data across various nodes and maintaining
the log file about the stored data (metadata).
HDFS has two core components, i.e. NameNode and DataNode.
Tasks of HDFS NameNode
Manage file system namespace.
Regulates client’s access to files.
Executes file system execution such as naming, closing, opening files
and directories.
Tasks of HDFS DataNode
DataNode performs operations like block replica creation, deletion,
and replication according to the instruction of NameNode. 82
DataNode manages data storage of the system.
123
HADOOP ECOSYSTEM
Core Hadoop Components
MapReduce- Distributed Data Processing Framework of Apache
Hadoop
Is responsible for the analysing large datasets in parallel before reducing
it to find the results.
Operations performed:
Map Task in the Hadoop ecosystem takes input data and splits into
independent chunks and output of this task will be the input for Reduce
Task.
In the same Hadoop ecosystem Reduce task combines Mapped
data tuples into smaller set of tuples.
Meanwhile, both input and output of tasks are stored in a file system.
MapReduce takes care of scheduling jobs, monitoring jobs and83
re-executes the failed task.
124
HADOOP ECOSYSTEM
Core Hadoop Components
YARN (Yet Another Resource Negotiator)
It provides the resource management.
YARN is called as the operating system of Hadoop as it is
responsible for managing and monitoring workloads.
It allows multiple data processing engines such as real-time
streaming and batch processing to handle data stored on a single
platform.
84
125
HADOOP ECOSYSTEM
Data Access Components of Hadoop
Ecosystem
Pig-
Hive
85
126
HADOOP ECOSYSTEM
Data Access Components of Hadoop
Ecosystem
Pig-
PIG has two parts:
• Pig Latin: the language
• The pig runtime: the execution environment(JRE)
It has SQL like command structure.
How Pig works?
In PIG, first the load command, loads the data.
Then perform various functions on it like grouping, filtering,
joining, sorting, etc.
At last, either you can dump the data on the screen or you
86
can store the result back in HDFS.
Eg: Healthcare data
HADOOP
127
ECOSYSTEM
Data Access Components of Hadoop Ecosystem
Hive
Open source data warehouse system for Summarization, querying
and analyzing large data set stored in files
Performs reading, writing and managing large data sets in
a distributed environment using SQL-like interface.
The query language of Hive is called Hive Query
Language (HQL)
HIVE + SQL = HQL
The Hive Command line interface is used to execute HQL commands.
Java Database Connectivity (JDBC) and Object Database Connectivity
(ODBC) is used to establish connection from data storage.
HADOOP
128
ECOSYSTEM
Data Access Components of Hadoop Ecosystem
Hive
129
HADOOP ECOSYSTEM
Data Integration Components of Hadoop
Ecosystem-
Sqoop
Flume
88
HADOOP
130
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Sqoop
Is used for importing data from external sources into related Hadoop components
like HDFS, HBase or Hive.
It can also be used for exporting data from Hadoop to other external structured
data stores.
Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.
Eg. Coupons.com
HADOOP
131
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Flume
The Flume is a service which helps in ingesting unstructured and semi-
structured data into HDFS.
It helps us to ingest online streaming data from various sources like
network traffic, social media, email messages, log files etc. in HDFS.
90
HADOOP
132
ECOSYSTEM
Data Integration Components of Hadoop Ecosystem-
Flume
The flume agent has 3 components: source, sink and channel.
Source: it accepts the data from the incoming streamline and stores the data in
the channel.
Channel: it acts as the local storage or the primary storage. A Channel is
a temporary storage between the source of data and persistent data in
the HDFS.
Sink: collects the data from the channel and commits or writes the data in
the HDFS permanently.
HDFS
Source Sink
Web
Server
Channel 90
133
HADOOP ECOSYSTEM
Data Storage Component of Hadoop Ecosystem
Hbase
92
HADOOP
134
ECOSYSTEM
Data Storage Component of Hadoop Ecosystem
Hbase
HBase is an open source, non-relational distributed
database. In other words, it is a NoSQL database.
It supports all types of data
HBase is a column-oriented database that uses HDFS
for underlying storage of data.
It can create large tables with millions of rows and
columns on hardware machine.
HBase supports random reads and also batch
computations using MapReduce.
Eg. Facebook
91
135
HADOOP ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Oozie
Zookeeper
92
136
HADOOP ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Oozie
Is a clock and alarm service inside Hadoop Ecosystem.
For Apache jobs, Oozie has been just like a scheduler.
It schedules Hadoop jobs and binds them together as one logical work.
There are two kinds of Oozie jobs:
Oozie workflow:
These are sequential set of actions to be executed.
Oozie Coordinator:
These are the Oozie jobs which are triggered when the data is made
available to it.
An Oozie coordinator responds to the availability of data and it rests
otherwise.
93
HADOOP
137
ECOSYSTEM
Monitoring and Management Components of Hadoop
Ecosystem-
Zookeeper
Apache Zookeeper is a centralized service and a Hadoop
Ecosystem component
Zookeeper manages and coordinates a large cluster of
machines.
Features of Zookeeper:
• Fast – Zookeeper is fast with workloads where reads to
• data are more common than writes. The ideal
• read/write ratio is 10:1.
• Ordered – Zookeeper maintains a record of all transactions.
94
HADOOP
138
ECOSYSTEM
Apache AMBARI
The Ambari provides:
Hadoop cluster provisioning:
It gives us step by step process for installing Hadoop services across
a number of hosts.
It also handles configuration of Hadoop services over a cluster.
Hadoop cluster management:
It provides a central management service for starting, stopping and
re-configuring Hadoop services across the cluster.
Hadoop cluster monitoring:
For monitoring health and status, Ambari provides us a
dashboard.
The Amber Alert framework is an alerting service which notifies
the user, whenever the attention is needed. 95
For example, if a node goes down or low disk space on a
node, etc.
HADOOP
139
ECOSYSTEM
Apache MAHOUT
Mahout is open source framework for creating scalable
machine learning algorithm and data mining library.
Provides the data science tools to automatically find meaningful patterns
in those big data sets.
Algorithms of Mahout are:
Clustering – Here it takes the item in particular class and organizes them into
naturally occurring groups, such that item belonging to the same group are
similar to each other.
Collaborative filtering – It mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
Classifications – It learns from existing categorization and then assigns
unclassified items to the best category.
Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart
or terms in query session) and then identifies which items 96
typically appear together.
Any Queries 97
140
Descriptive Questions
141