Unit-1 - Session 1 - Supervised & Unsupervised PDF
Unit-1 - Session 1 - Supervised & Unsupervised PDF
LEARNING
2
3
UNIT-I
4
Machine learning- What and Why
• The rise of big data demands machine learning for efficient data analysis and
decision-making.
• For instance, there are around 1 trillion web pages, and every second, one hour of
video content is uploaded to YouTube, equating to 10 years of content every day.
Additionally, thousands of human genomes, each consisting of approximately 3.8
billion base pairs, have been sequenced, and Walmart handles over 1 million
transactions per hour, resulting in databases containing more than 2.5 petabytes of
information.
• Machine learning comprises techniques that can automatically detect patterns within
data and leverage these patterns to predict future data or make decisions under
uncertainty.
• The optimal approach to addressing such challenges is through probability theory,
which applies to any problem involving uncertainty.
5
Artificial Intelligence (AI)
• Programs with the ability to learn
and reason like humans
Learning is the essential step. After training or learning we have to go for testing. 7
Pattern Recognition
Pattern Recognition
Camera Image Processing
(AI/ML Algorithm)
(Image Acquisition) (Pre-Processing)
Decision Making
• For computer vision, the input is images or videos. Therefore, to capture the
image one may need image acquisition devices like camera. After that, some
pre-processing techniques are important. Finally, we need to apply ML
algorithms for decision making purpose. 10
Types of Machine Learning
11
• Supervised: Learning with a labelled training set or dataset with proper label info
➢ (e.g., Email classification, House Rent Price Prediction etc.)
12
Semi Supervised Learning
• This approach is desirable for small amount of labelled data and large
amount of unlabelled data.
❑ Example:
For medical image classification, it is very much difficult to get the labelled
data. In such case, we may have some amount of labelled data and large
amount of unlabelled data. 13
Machine Learning
Regression Clustering
• Linear • PCA Continuous Variable
• Polynomial • K-means
Association
Classification • Apriori Categorical Variable
• FP-Growth
14
Predictive or Supervised Learning:
•Goal: Learn a mapping from inputs x to outputs y, given a labeled set of input-output pairs
•Training set D: Set of input-output pairs and N: Number of training examples.
•Training input xi:
• Typically a D-dimensional vector of numbers.
• Represents features, attributes, or covariates (e.g., height and weight of a person).
• Can be complex structured objects (e.g., images, sentences, time series, molecular shapes, graphs).
•Output or response variable yi:
• Can be categorical/nominal (e.g., male or female) or real-valued (e.g., income level).
• Categorical problems are known as classification or pattern recognition.
• Real-valued problems are known as regression.
• Ordinal regression: Label space Y has a natural ordering (e.g., grades A-F).
Descriptive or Unsupervised Learning:
•Goal: Find "interesting patterns" in the data.
•Given inputs . Also known as knowledge discovery.
•No well-defined problem as patterns are not specified in advance.
Reinforcement Learning:
•Useful for learning how to act or behave when given occasional reward or punishment signals.
•Example: How a baby learns to walk.
15
Supervised learning
1. Classification:
• Goal of Classification: Learn a mapping from inputs x to outputs y,
where y∈{1,…,C} with C being the number of classes.
• Binary Classification: When C=2, known as binary classification (e.g.,
y∈{0,1})
• Multiclass Classification: When C>2, known as multiclass classification.
• Multi-label Classification: When class labels are not mutually exclusive
(e.g., someone classified as tall and strong), predicting multiple related
binary class labels (multiple output model).
• One way to formalize the problem is as function approximation.
Assume y=f(x) for an unknown function f; learning aims to estimate f
using a labeled training set and predict with
• Generalization: The main goal is to make predictions on novel inputs not
seen before, emphasizing the importance of generalization over fitting the
training set.
16
Supervised Learning
Car
Ship
17
Supervised Learning
Dhoni
Virat
Supervised Predictive
Predictive
Learning Model
Model
Sunil
Dhoni
Anand
18
Classification Analysis
19
Multi-class Classification: Emotion Analysis
1. Anger 0
2. Disgust 1
3. Fear 2
4. Happiness 3
5. Neutral 4
6. Sadness 5
20
Supervised learning-Cont.
Example
• Two classes of objects with labels 0 and 1.
• Inputs are colored shapes, described by D features or attributes.
• Features are stored in an N×D design matrix.
• Input features x can be discrete, continuous, or both. Vector of training labels y.
• Test objects: blue crescent, yellow circle, and blue arrow.
• These test objects have not been seen before, requiring generalization beyond the training set.
Generalization:
•Blue crescent likely has y=1 since all blue shapes in the training set are labeled 1.
•Yellow circle's label is unclear due to mixed labels for yellow objects and circles.
•Blue arrow's label is also unclear due to lack of specific information from the training set.
21
Supervised learning-Cont.
•This is the mode of the distribution p(y∣x,D) and known as a MAP estimate (maximum a
posteriori).
• Confidence in predictions is crucial, especially in risk-averse domains like medicine
and finance.
• IBM's Watson for Jeopardy uses a confidence module to decide when to answer.
• Google's SmartASS (ad selection system) predicts the click-through rate (CTR) to
maximize expected profit.
• Systems like Watson and SmartASS assess the risk of their predictions, making
decisions based on confidence levels to optimize performance and minimize errors.
22
Supervised learning-Cont.
Real-world applications:
(i) Document classification and email spam filtering
• In document classification, the primary objective is to categorize documents like web
pages or email messages into predefined classes C, determining p(y=c∣x,D), where x
represents the document's text representation.
• A classic example is email spam filtering, where classes are typically labeled as spam (
y=1 ) or non-spam ( y=0).
• Most classifiers assume a fixed-size input vector x. To handle variable-length documents,
a common approach is the bag of words (BoW) representation.
• Bag of Words (BoW):
• Documents are transformed into fixed-size feature vectors.
• Each vector element corresponds to a word from a predefined vocabulary.
• If a word appears in the document, its corresponding vector element is set to 1; otherwise,
it remains 0.
23
Supervised learning-Cont.
(ii) Classifying flowers
• The goal is to classify iris flowers into three types: setosa, versicolor, and virginica,
based on four extracted features: sepal length, sepal width, petal length, and petal
width.
24
Supervised learning-Cont.
25
Supervised learning-Cont.
(iv) Face detection and recognition
• Object detection, or localization, involves
identifying specific objects within an image. A
notable application is face detection, which is
crucial for tasks like autofocus in cameras and
privacy features in services like Google's
StreetView.
• One approach to face detection is the sliding
window detector method. It divides the image into
small overlapping patches at various locations,
scales, and orientations.
• Each patch is classified based on whether it
exhibits face-like textures or features. Locations
where the probability of containing a face is high
are identified as potential face locations.
• Modern digital cameras often integrate face
detection systems to assist with autofocus by
identifying and focusing on faces within the frame.
• Services like Google's StreetView use face
detection to automatically blur faces to protect
privacy. 26
Supervised learning-Cont.
2. Regression:
• Regression is just like classification except the response variable is continuous
28
Performance Analysis of ML models: Classification
E.g., for binary classification, a model predicts two
classes: ‘spam’ and ‘not_spam’ for inbox mail
• TP: How many times the spam is
Prediction recognized as spam
spam not_spam • FN: How many times the spam is
spam True Positive False Negative recognized as not spam
Actual
29
Performance Analysis of ML models: Classification (cont.)
30
• Train the model using learning algorithm using a training set then validate using
a validation set (i.e., testing set)
• Finally, we need to evaluate the performance of the model with the help of
testing samples
31
• During testing with the help of unseen data, the error will be high. Because it can
not handle unseen data. We can represent perfectly the training samples with the
help of complex models. This corresponds to the case of overfitting.
32
Performance Analysis of ML models: Classification (cont.)
• Precision: It is the ratio of the correct positive predictions to the overall number of
positive prediction.
𝑇𝑃
• Precision =
𝑇𝑃+𝐹𝑃
• Recall: It is the ratio of correct positive predictions to the overall number of positive
examples.
𝑇𝑃
• Recall =
𝑇𝑃+𝐹𝑁
• F1 Score: It is the harmonic mean of Precision and Recall. A perfect model has an F1
score of 1.
2 ∗𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙
• F1 Score =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙
33
Performance Analysis of ML models: Classification (cont.)
# Precision and Recall depends on the specific problem
(2) Problem is related to the detection of an
(1) Problem is related to the diagnosis of cancer email is spam or not spam
Prediction Prediction
Cancer not_cancer spam not_spam
cancer True Positive False Negative spam True Positive False Negative
(TP) (FN) (TP) (FN)
Actual
Actual
(Perfect) (High Risk) (Perfect) Low Risk (?)
False Positive True Negative False Positive True Negative
not_cancer (FP) (TN) not_spam (FP) (TN)
• In such cases, the models rise a false • It is more important that we don’t miss any
alarm but the actual positive cases important email as spam than receiving an
should not go undetected. occasional spam as no spam.
• (Recall is important) • (Precision is important) 34
Performance Analysis of ML models: Classification (cont.)
Calculate: Accuracy, Precision, Recall, F1-Score
Prediction 𝑇𝑃+𝑇𝑁
Accuracy =
1 0 𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
1 True Positive False Negative 𝑇𝑃
Precision =
Actual
Prediction
• Condition 1: True Positive
India England
• Condition 2: True Negative
India True Positive False Negative
• Condition 3: False Positive
Actual
(TP) (FN)
• Condition 4: False Negative
False Positive True Negative
England (FP) (TN)
• Condition 1: You had predicted that India would win and it won
• Condition 2: You had predicted that England would not win and it lost
• Condition 3: You had predicted that England would win, but lost
• Condition 4: You had predicted that India would not win, but it won
36
Performance Analysis of ML models: Classification (cont.)
# Consider the Actual and Predicted classification result. Find the mapping through confusion matrix
Actual 1 1 1 1 1 1 1 1 0 0 0 0
Predicted 0 0 1 1 1 1 1 1 1 0 0 0
Result ? ? ? ? ? ? ? ? ? ? ? ?
Prediction
1 0
1 True Positive False Negative
Actual
(TP = ?) (FN = ?)
False Positive True Negative
0 (FP = ?) (TN = ?)
37
Performance Analysis of ML models: Classification (cont.)
# Consider the Actual and Predicted classification result. Find the mapping through confusion matrix
Actual 1 1 1 1 1 1 1 1 0 0 0 0
Predicted 0 0 1 1 1 1 1 1 1 0 0 0
Result FN FN TP TP TP TP TP TP FP TN TN TN
Prediction
1 0
1 True Positive False Negative
Actual
(TP = 6) (FN = 2)
False Positive True Negative
0 (FP = 1) (TN = 3)
38
Performance Analysis of ML models: Classification (cont.)
# Multi-class classification (e.g., Emotion Analysis). Find the mapping through confusion matrix
Predicted
Anger Disgust Fear Happiness Neutral Sadness
Anger TP FN FN FN FN FN
Disgust FP TP
Actual
Fear FP TP
Happiness FP TP
Neutral FP TP
Sadness FP TP
39
Performance Analysis of ML models: Classification (cont.)
Area under the ROC Curve (AUC):
• The ROC (Receiver Operating Characteristic) is a commonly used method to assess the
performance of binary classification models
• It uses the combination of TPR (the proportion of positive examples predicted correctly,
defined exactly as Recall) and FPR (the proportion of negative examples predicted
incorrectly)
𝑻𝑷 𝑭𝑷
• TPR = and FPR =
𝑻𝑷+𝑭𝑵 𝑭𝑷+𝑻𝑵
Prediction Prediction
spam not_spam spam not_spam
spam True Positive False Negative spam True Positive False Negative
Actual
Actual
(TP = 10) (FN = 0) (TP = 0) (FN = 10)
not_spam False Positive True Negative not_spam False Positive True Negative
(FP = 10) (TN = 0) (FP = 0) (TN = 10)
41
Performance Analysis of ML models: Regression
• The R2 defines the distribution of variance in dependent variable which is predicted through independent variable. The R2 result lies in a range of 0 to 1.
2
∑ y j −yഥ j
The R2 value is estimated as: 𝑹𝟐 = 1 − 2
∑ y j −yෝ j
• The Mean Absolute Error (MAE) estimator measures the absolute average distance amongst the estimated data in addition to the predicted data. The
n
1
MAE is estimated as: 𝐌𝐀𝐄 = y j − yഥ j
n j=1
• The Mean Squared Error (MSE) is a measure used to estimate the average squared difference between the predicted responses and the actual responses.
n
1 2
The MSE is estimated as: 𝐌𝐒𝐄 = yഥ j − y j
n
j=1
• The Root Mean Squared Error (RMSE) estimator is expressed as the square root of the average of the squared difference amongst the estimated data
and the predicted data. The RMSE with lower range confirms the better predictive performance of the specific model.
n
1 2
The RMSE is measured as: 𝐑𝐌𝐒𝐄 = yഥ j − y j
n 42
j=1
Unsupervised Learning:
• Organizing data into classes such that there is high intra-class similarity and low
inter-class similarity
• Finding the class labels and the number of classes directly from the data
43
Classification of Clustering of
Supervised data Unsupervised data
1. Discovering clusters:
•Clustering involves grouping data points into clusters based on similarities in their
features, without predefined labels.
•The goal is to estimate the distribution p(K∣D) over the number of clusters K, indicating
the presence of subgroups within the data.
•Model selection in clustering aims to determine the optimal number of clusters K∗ often
approximated by the mode of p(K∣D). Unlike supervised learning where classes are
predefined, unsupervised learning allows flexibility in choosing the number of clusters that
best represent the underlying structure of the data.
•Each data point i is assigned to a cluster zi∈{1,…,K} based on the probability p(zi=k∣xi,D),
where xi is the feature vector of the data point.
•Assignments zi∗ are inferred to determine the cluster membership of each data point,
illustrated by different colors representing clusters in visualizations.
46
Unsupervised learning-Cont.
Applications of Clustering:
•Astronomy: Clustering methods like Autoclass have been used to discover new
types of stars based on astrophysical measurements.
•E-commerce: Clustering users based on purchasing or web-surfing behavior
allows for targeted advertising and personalized recommendations.
•Biology: Clustering flow-cytometry data helps identify different sub-populations of
cells, aiding in biological research such as understanding disease mechanisms. 47
Unsupervised learning-Cont.
2. Discovering latent factors:
• Dimensionality reduction involves projecting high-dimensional data into a lower-
dimensional subspace that captures essential characteristics of the data.
• Despite high-dimensional appearances, data often exhibit variability across a smaller
number of latent factors. Dimensionality reduction helps in focusing on these key
factors, such as lighting, pose, or identity in face image modeling.
• PCA is a common approach for dimensionality reduction, resembling an unsupervised
form of multi-output linear regression.
• Given high-dimensional responses y, PCA infers latent low-dimensional factors z that
explain most of the variability in y.
48
Unsupervised learning-Cont.
Applications:
• In biology, it is common to use PCA to interpret gene microarray data, to account for the
fact that each measurement is usually the result of many genes which are correlated in
their behavior by the fact that they belong to different biological pathways.
• In natural language processing, it is common to use a variant of PCA called latent
semantic analysis for document retrieval.
• In signal processing (e.g., of acoustic or neural signals), it is common to use ICA (which
is a variant of PCA) to separate signals into their different sources.
• In computer graphics, it is common to project motion capture data to a low dimensional
space, and use it to create animations.
49
Unsupervised learning-Cont.
3. Discovering graph structure
• Learning sparse graphical models involves representing relationships between correlated
variables using a graph G, where nodes depict variables and edges denote direct
dependencies. This approach is pivotal in both discovering new knowledge and enhancing
joint probability density estimators.
• In systems biology, sparse graphical models are used to uncover relationships among
biological entities. For instance, graphs derived from protein phosphorylation data reveal
complex interactions within cellular networks. Similarly, neural wiring diagrams in birds can
be reconstructed from EEG data, highlighting functional connectivity patterns.
• In fields like financial portfolio management, sparse graphs help model covariance between
stocks for better prediction and decision-making. Utilizing sparse graph structures has proven
beneficial in outperforming traditional methods, enabling more effective trading strategies.
• Applications extend to traffic prediction systems, such as JamBayes, which leverage learned
graphical models to forecast traffic flow dynamics. These models contribute to accurate
predictions and efficient management of transportation networks, illustrating the broad
applicability and utility of sparse graphical learning in real-world scenarios.
50
Unsupervised learning-Cont.
51
Unsupervised learning-Cont.
4. Matrix completion
• Sometimes we have missing data, that is, variables whose values are unknown. For example, we might have
conducted a survey, and some people might not have answered certain questions.
• The corresponding design matrix will then have “holes” in it; these missing entries are often represented by
NaN, which stands for “not a number”. The goal of imputation is to infer plausible values for the missing
entries. This is sometimes called matrix completion.
• Image Inpainting: Technique to fill in missing parts of images due to scratches or occlusions, achieved by
modeling joint probability of pixels from clean images.
• Collaborative Filtering: Predicting user preferences for items (like movies) based on sparse ratings matrices,
aiming to fill in missing ratings for better recommendation systems.
• Market basket analysis:
❖ Involves examining a large, sparse binary matrix where columns represent items/products and rows
represent transactions.
❖ Each entry in the matrix indicates whether an item was purchased in a specific transaction. By analyzing
correlations among items often bought together, predictions can be made about additional items a consumer
might buy based on partial transaction data.
❖ This technique is also applicable in other domains, such as predicting file dependencies in software systems.
❖ Common methods for market basket analysis include frequent itemset mining, which generates association
rules, and probabilistic modeling, which fits a joint density model to the data.
❖ Data mining emphasizes interpretability of models, whereas machine learning focuses on model accuracy.52
Imbalanced Dataset – Challenges and Solutions
53
54