[go: up one dir, main page]

100% found this document useful (1 vote)
42 views124 pages

Chapter 2

The document provides an overview of machine learning, specifically focusing on supervised learning, which involves training algorithms on labeled data to improve performance on specific tasks. It discusses various classification algorithms, their applications, and challenges, such as the need for large amounts of labeled data and the risk of overfitting. Additionally, it explains key concepts like binary and multi-label classification, as well as the K-Nearest Neighbors algorithm and its steps.

Uploaded by

antenehpawlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
42 views124 pages

Chapter 2

The document provides an overview of machine learning, specifically focusing on supervised learning, which involves training algorithms on labeled data to improve performance on specific tasks. It discusses various classification algorithms, their applications, and challenges, such as the need for large amounts of labeled data and the risk of overfitting. Additionally, it explains key concepts like binary and multi-label classification, as well as the K-Nearest Neighbors algorithm and its steps.

Uploaded by

antenehpawlos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 124

Bahir Dar University

College of Science
Data Science Department
Course Title: Machine Learning

By

Adane Kasie Chekole (MSc)


What is machine learning ?
An algorithm is said to learn from experience E with respect some class of
tasks T and performance measure P …… if its performance at tasks in T, as
measured by P, improves as it does task in T (experience E).
Example 1 Learning to recognize faces
– T: recognize faces
– P: % of correct recognitions
– E: opportunity to makes guesses and being told what the truth is
Example 2 Learning to find clusters in data
– T: finding clusters
– P: compactness of groups detected
– E: analyses of a growing set of data
2
Chapter 2: Supervised learning
 Classification  Model performance evaluation,
 Regression diagnostics & predictions
 Commonly used algorithms
 Evaluating hypothesis
Model selection (Train/test/validation)
 Linear Regression
Regularization and Bias/Variance
 Logistic Regression
 Learning Curves
 Anomaly Detection MSE, lift, AUC, Type 1 vs 2
 Support Vector Machines
 Decision Tree
 Random Forest
4
Supervised Learning
Background – Terminology
• Let’s review some common ML terms.
• Data is usually represented with a feature matrix.
Features Features
• Attributes used for analysis Attributes used to classify instances

Each instance has a class label


• Represented by columns in feature matrix F1 F2 F3 F4 F5

C1
• Instances/feature vector C2
41
3.6
63
1.2

1.5
2

4
1

0 3.5
• Entity with certain attribute values C1 109 0.4 6 1 2.4

• Represented by rows in feature matrix C1 8 34 0.2 1 0 3.0

• Class labels C1 33 0.9 6 1 5.3

C2 565 4.3 10 0 3.2


• Indicate category for each instance. C1 21 4.3 1 0 1.2
• This example has two classes (C1 and C2). C2 35 5.6 2 0 9.1

• Only used for supervised learning Instances Fig. 1 Feature matrix


3
What is Supervised learning?
 Definition: Supervised learning is a type of machine
learning algorithm that learns from labeled data.
 Labeled Data: Labeled data is data that has been tagged
with a correct answer or classification.
 Role of Supervisor: Supervised learning involves the
presence of a supervisor, acting as a teacher.
 Training Process: It entails teaching or training the
machine using well-labeled data, where some data is
already tagged with the correct answer.
 Outcome : After training, the machine is provided with new
examples to analyze and produce correct outcomes based
on the labeled data.
 For example, a labeled dataset of images of Elephant, Camel and Cow
Cont’d
How It Works:
1.The algorithm is trained with input-output pairs.
2.It learns patterns and relationships.
3.Once trained, it predicts outcomes for new data.

Examples:
•Spam Detection: Classifies emails as spam or not spam.
•House Price Prediction: Estimates house prices based on features like
size and location.
•Medical Diagnosis: Identifies diseases from symptoms and patient data.
Steps Involved in Supervised Learning

 Determine the type of training dataset


 Collect/Gather the labelled training data.
 Split the training dataset into training dataset, test dataset
 Determine the suitable algorithm for the model
11

 Execute the algorithm on the training dataset.


 Evaluate the accuracy of the model by providing the test set.
Challenges of Supervised Learning
• Supervised learning requires a certain level of expertise to
structure the model accurately.
• It is incapable of self-learning. A data scientist must be
present to train the model.
• It takes a lot of time to label the data.
• Supervised Learning is inflexible as it’s a struggle to label
data outside the bounds of the training data set.
• Requires a large amount of labeled data, which can be time-
consuming and costly to obtain.
• Prone to overfitting if the model becomes too specialized to
the training data.
Key Points
 Supervised learning involves training a machine from labeled data.
 Labeled data consists of example with the correct answer/ classification.
 The machine learns the relationship between inputs (fruit images) and
outputs (fruit labels).
 The trained machine can then make predictions on new, unlabeled data.
Types of Supervised Learning
Supervised learning is classified into two categories of algorithms:
• Regression: A regression problem is when the output variable is a real
value, such as “dollars” or “weight”.
• Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
What is Classification?
 Classification is predicting a categorical label or Predicting discrete
class labels (categories) from labeled data.
 The model learns from past data and assigns labels to new, unseen
instances.
 Example: Identifying spam emails (Spam vs. Not Spam).
Types of Classification Algorithms
 Binary Classification: Two possible labels (e.g., Yes/No, Spam/Not
Spam, exam Pass or Not).
 Multi-class Classification: More than two classes (e.g., Classifying
types of animals, classifying types of flowers).
 Multi-label Classification: Each instance can belong to multiple
classes (e.g., Tagging images with multiple objects).
Single label vs multi-label Classification

•The classification problem can be further divided into two categories:


•Single-label classification : only one label
•Multi-label classification: more than one label

1
5
Single label classification
• Single label classification: learning from a set of data points that are
associated with only one target label from a set of disjoint
labels Y , with |Y| ≥ 2.
• If |Y| = 2, the learning problem is called binary classification
e.g disease diagnosis, spam, malware detection, etc.

• if |Y | > 2, then it is called a multi-class classification problem


e.g Language identification, an animal image classification, character recognition,
face recognition, etc.
1
6
Multi-label classification

• What is the issue with Single-label classification?


Ans: In many real-world applications, each data instance
can be associated with multiple class variables
• Examples:
• A news article may cover multiple topics, such as politics,
Agriculture, and economics
1
7

• Class1- Red Vs Blue


• Class2- Triangle Vs Oval

• So, how can we handle such multi-label problem?


Approaches to solve multi-label classification problem

•Problem transformation : Transforms multi-label classification problem to multiple


single-label classification problems which can be called Adapt data to the algorithm.

• Binary Relevance (BR): Multi-label into single-label problems.


• Label Power-set (LP) : Multi-label into multi-class – consider each label-sets as class.
• Classifier Chains (CC): resolves the BR limitations, by making label correlation task.

Other solutions [ will not coverd in this chapter]


• Adapted Algorithm: Perform multi-label classification, rather than transforming the
problem into different subsets of problem.

• Ensemble Approaches: learning multiple classifier systems train multiple hypotheses


to solve the same problem.
8
Binary Relevance
• Transform multi-label into single-label problems
• Suppose a training dataset D having five instances with an input feature
vector X (x1 and x2) and output class vector Y (y1,y2,y3).
X1 X2 Y1 Y2 Y3
• X-->Y1
0.3 0.7 1 1 0 • X-->Y2
0.9 0.3 1 1 0
To multiple singl-label problems • X-->Y3

0.6 0.6 0 0 1
Advantages
0.4 0.1 0 1 0 • Computationally efficient
Disadvantages
0.5 0.2 1 0 1
•Does not capture the dependence relations among the
class variables

9
Label Powerset Method
• Transform each label combination to a class value and then learn a
multi-class classifier with the new class values
X1 X2 Y1 Y2 Y3 X1 X2
Ycomb
• X-->Ycomb
0.3 0.7 1 1 0 0.3 0.7 1

0.9 0.3 1 1 0 Combine label to a class value 0.9 0.3 1

0.6 0.6 0 0 1 0.6 0.6 2

0.4 0.1 0 1 0 0.4 0.1 3

0.5 0.2 1 0 1 0.5 0.2 4

Advantages
• Learns the full joint of the class variables and each of the new class values maps to a label
combination
Disadvantages
• The number
classifier of choices inchoices
on exponential the newisclass can be exponential (|YLP | = O(2d )) and Learning
1 a
expensive
multi-class 0
Classifier Chains
• Similar to Binary relevance, doing multiple single-label problem but we use the previous prediction as
input feature vector.

X1 X2 Y1 X1 X2 Y1 Y2 X1 X2 Y1 Y2 Y3

0.3 0.7 1 0.3 0.7 1 1 0.3 0.7 1 1 0

0.9 0.3 1 0.9 0.3 1 1 0.9 0.3 1 1 0

0.6 0.6 0 0.6 0.6 0 0 0.6 0.6 0 0 1

0.4 0.1 0 0.4 0.1 0 1 0.4 0.1 0 1 0

0.5 0.2 1 0.5 0.2 1 0 0.5 0.2 1 0 1

Y1 as a class
Y1 as a feature and Y2 as a class Y1,Y2 as feature and Y3 as a
class
Limitation: The result can vary for different order of chains Solution:
1
ensemble 1
Classification-A Two-Step Process
•Model construction (Training phase): describing a set of predetermined classes
• Model construction (Training phase): describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined by the
class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or math formulae
•Model usage (Testing phase): for classifying future or unknown objects
• Estimate accuracy of the model 22

• The known label of test sample is compared with the classified result from the model
• Accuracy rate is the percentage of test set samples that are correctly classified by the model
• Test set is independent of training set (otherwise overfitting) If the accuracy is acceptable,
use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set.
Common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Anomaly Detection
 Linear Regression with Regularization
 Logistic Regression with Regularization
K-Nearest-Neighbor
•The K-Nearest Neighbors (KNN) algorithm is a simple
yet powerful supervised machine-learning technique
used for classification tasks.
•It works based on the idea that similar data points are
often near each other in feature space.
•KNN is like the "ask your neighbors" rule in real life.
When you need to make a decision or guess something,
you check what the closest people (or examples) around
you are doing and follow the majority.
•It’s a simple machine learning algorithm that looks at
who’s closest to you to make a prediction. 11
22
Applications of KNN

•Handwriting recognition (e.g. digit classifications).


•Recommendation systems.
•Pattern recognition.
•Customer segmentation.
E.g., If the object walks like a duck, quacks like a duck, then it‘s
probably a duck Compute
Test
Distance
Record

Training
Records
Choose k of the
nearest records
11
22
K-Nearest-Neighbor
Real-Life Example: Fruit Identification
Imagine you find a fruit and don’t know what it is. You compare
it with nearby fruits based on: Size, Color, Weight
If it looks similar to apples near you, you call it an apple!
For Example: you find a fruit that is: Size: 7.5 cm, Color: Red,
Weight: 155 g
You compare it to the data below. It’s closest to the apples, so
you classify
No Size(cm) it as an
Color Apple.
Weight (g) Label
1 10 Red 190 tomato
2 6 Yellow 120 Banna
3 8 Red 160 Apple
4 12 Yellow 250 Orange
5 9 Red 165 Apple
11
22
Key Concepts of KNN
1. Instance-based learning: Predictions are made based on
the similarity of a new data point to existing instances.
* KNN does not explicitly learn a model but memorizes the
training dataset. *
2. Distance Metric: KNN relies on measuring the distance
between data points. Common distance metrics include:
• Euclidean distance, * Manhattan distance, * Minkowski
distance
3. Number of Neighbors (K):
 The parameter K determines how many nearest neighbors
are considered for classification or regression.
 Small K may lead to noisy predictions (overfitting), while
large K may oversimplify the model (underfitting). 1
2
1
2
Nearest Neighbor Classification

28
Steps of K-NN Algorithm
k-nearest neighbors algorithm steps.

Step 1. Determine parameter k= number of nearest neighbors

Step 2. Calculate the distance between the query-instance and all the training
samples

Step 3. Sort the distance and determine nearest neighbors based on the k-th
minimum distance

Step 4. Gather the category Y of the nearest neighbor

Step 5. Use simple majority of the category of nearest neighbors as the


prediction value of the query instance 29
k-NN
Example
Question: A factory produces a new paper tissue that passes the laboratory test with
X1= 3 and X2 =7. Classify this paper as good or bad using k-nearest neighbor
method.
Solution
Step1. Determine the nearest neighbor parameter k. In this example we assume k=3.
Acid Durability Strength Classification
(X1) (X2 ) (Y)

7 7 Bad
7 4 Bad
3 4 Good
1 4 Good
30
k-NN Example

Step2. Calculate the distance between the query-instance and all the
training examples

Acid Durability Strength Square distance to query


(X1 ) (X2 ) instance (3,7)
7 7 (7-3)2 + (7-7)2 = 16
7 4 (7-3)2 + (4-7)2 =25
3 4 (3-3)2 + (4-7)2 =9
1 4 (1-3)2 + (4-7)2 =13

31
k-NN Example
Step 3. Sort the distance and determine nearest neighbors based on the
k-th minimum distance
Acid Strengt Square distance to Rank Is it included
Durability h (X2) query instance (3,7) minimum in 3- nearest
(X1 ) distance neighbors
7 7 (7-3)2 + (7-7)2 =16 3 Yes

7 4 (7-3)2 + (4-7)2 =25 4 No

3 4 (3-3)2 + (4-7)2 =9 1 Yes

1 4 (1-3)2 + (4-7)2 =13 2 Yes


32
k-NN Example
Step 4. Gather the category Y of the nearest neighbors. Notice that the second row
last column that the category of nearest neighbor (Y) is not included because the rank
of this data is more than 3 (=k).
Acid Durability Strength Square distance to Rank Is it included Y= Category
(X1 ) (X2 ) query instance (3,7) minimum in 3-nearest of nearest
distance neighbors? neighbor

7 7 (7-3)2 + (7-7)2 =16 3 Yes Bad

7 4 (7-3)2 + (4-7)2 =25 4 No -

3 4 (3-3)2 + (4-7)2 =9 1 Yes Good

1 4 (1-3)2 + (4-7)2 =13 2 Yes Good

33
k-NN Example
Step 5. Use simple majority of the category of nearest neighbors as
the prediction value of the query instance.

In this example we have 2 good and 1 bad, since 2>1, we


conclude that a new paper tissue that pass lab test with X1 =3 and
X2 = 7 is classified as GOOD.

* Please try (x1=2, x2=6)

34
K-Nearest-Neighbor
Advantages of KNN
 Simple to understand and implement.
 No assumptions about the underlying data distribution.
 Effective for small datasets with well-separated classes.
 Versatile, applicable to both classification and regression.

Disadvantages of KNN
 Computationally expensive during prediction since it requires calculating distances
for all training data points.
 Memory-intensive as it requires storing the entire training set.
 Sensitive to irrelevant or noisy features.
 Requires careful selection of K and the distance metric. 1 1
22
Common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Anomaly Detection
 Linear Regression with Regularization
 Logistic Regression with Regularization
Decision Tree
Decision Tree

Decision tree structure


• Root node: beginning of a tree and represents
entire population being analyzed.
• Internal node: denotes a test on an attribute
• Branch: represents an outcome of the test
• Leaf nodes: represent class labels or class
distribution
3
9

• Tree is constructed in a top-down recursive divide-and-conquer manner


Decision Tree

• Root node: has no incoming edges and zero or more outgoing edges.
• Internal nodes: each of which has exactly one incoming edge and two or
more outgoing edges.
• Leaf or terminal node, each of which has exactly one incoming and no
outgoing edges.

Solving the classification problem using DT is a two-step process:


• Decision Tree Induction- Construct a DT using training data/Induction
• For each ti in D, apply the DT to determine its class/ Deduction

22
88
Decision Tree Example
Output: A Decision Tree for
“buys_computer”
Training Dataset Decision Tree
age income student credit_rati buys_compute
ng r age?
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes <=30 overcast
30..40 >40
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no student? yes credit rating?
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes no yes excellent
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes no yes no yes
31…40 high yes fair yes
>40 medium no excellent no
Decision Tree
Algorithm:
• The algorithm is a greedy algorithm that doesn’t warranty optimality
• The algorithm for decision tree involves three general phases as stated below:
• Phase I: Find Splitting Criteria based on all the sample set at the splitting point (node)
• Phase II: Split all the sample data based on the splitting criteria and form branches and each
successor nodes will have the samples
• Phase III: Do phase one and phase two iteratively until stopping criterion get fulfilled

Attribute Selection Measure


1. Information Gain (ID3 algorithm)
2. Gain Ratio (C4.5 algorithm) [will not be discussed]
3. Gini Index (CART algorithm) [will not be discussed]
• ID3, C4.5, and CART adopt a greedy (i.e., nonbacktracking) approach in which decision trees
are constructed in a top-down recursive divide-and-conquer manner using Information gain
attribute selection approach.
Information Gain (ID3)
• Developed by J. Ross Quinlan in late 1970s
• All attributes are assumed to be categorical
• Uses information gain to split node into branches
• Details of the algorithm is given below
• Select the attribute with the highest information gain
• This attribute minimizes the information needed to classify the tuples in the resulting
partitions and reflects the least randomness or “impurity” in these partitions
• Such an approach minimizes the expected number of tests needed to classify a given tuple
and guarantees that a simple (but not necessarily the simplest) tree is found.
• The amount of information need becomes maximum if all the classes has the same
number of tuples (i.e. I(D)=1)
• The amount of information need becomes minimal if all the tuples belongs to one class (i.e
I(D)= 0)
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Decision Tree
Advantages of Decision Trees:
• Simplicity and Interpretability: They are easy to understand and visualize, making
them accessible even to those without a deep understanding of machine learning.
• Versatility: Capable of handling both numerical and categorical data.
• No Need for Data Normalization: They do not require feature scaling or centering.

Disadvantages of Decision Trees:


• Overfitting: They can create overly complex trees that do not generalize well to
unseen data.
• Instability: Small variations in the data can result in a completely different tree being
generated.
• Underfitting (may over generalize during training, less conditions)
•Bias: Trees may favor features with more levels, leading to biased splits.
• High variance estimators & More costly 33
33
Applications of Decision Trees
Decision trees are widely used in various industries:
• Healthcare: Diagnosing diseases based on patient symptoms and
test results.
• Finance: Assessing credit risk and predicting loan defaults.
• Retail: Analyzing customer behavior for targeted marketing.
• Manufacturing: Optimizing supply chain logistics and defect
Decision Trees in Ensemble Methods
• detection.
Random Forest: Builds multiple decision trees and combines their
outputs to improve accuracy and robustness.

• Gradient Boosting: Sequentially builds trees to correct errors from


previous ones, resulting in a strong predictive model.
common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Anomaly Detection
 Linear Regression with Regularization
 Logistic Regression with Regularization
Ensemble vs. Transfer vs. Hybrid Learning
11️⃣Ensemble Learning: Combines multiple models (e.g.,
decision trees, neural networks) to improve accuracy and robustness.
Example: Random Forest (uses multiple decision trees for better predictions).
Key Idea: "Many weak models work together to make a strong model.“
2️⃣Transfer Learning: Uses a pre-trained model on one task
and fine-tunes it for a new but related task.
Example: Using a model trained on ImageNet to classify medical images.
Key Idea: "Learn once, reuse knowledge for another related task.“
3️⃣Hybrid Learning: Combines two or more learning
approaches (e.g., merging deep learning with rule-based systems).
Example: A chatbot that uses both AI-based language models and predefined rule-
based responses.
Key Idea: "Mixing different learning techniques for better results."
What is Random Forest ?
Random Forest Algorithm
 Random Forest algorithm is a powerful tree learning technique in ML to make
predictions and then we do voting of all the tress to make prediction.
 Random Forest is a collection of decision trees that work together to make better
prediction tasks.

Example: Imagine a panel of doctors diagnosing a patient.


Each doctor examines the patient based on their own experience and medical
knowledge (decision trees trained on different subsets of data). Some doctors may
focus on symptoms, while others prioritize medical history or test results. Instead of
relying on a single doctor’s opinion, the final diagnosis is determined by considering
the majority opinion (majority voting for classification) or averaging the suggested
treatments.
Random Forest Algorithm
 Process starts with a dataset with rows and their corresponding class labels
 Then - Multiple Decision Trees are created from the training data.
 Each tree is trained on a random subset of the data (with replacement) and a random
subset of features. bagging or bootstrap aggregating.
 Each Decision Tree in the ensemble learns to make predictions independently.
 When presented with a new, unseen instance, each Decision Tree in the ensemble
makes a prediction.
 The final prediction is made by combining the predictions of all the Decision Trees.
 This is typically done through a majority vote (for classification) or averaging (for
regression).
Assumptions of Random Forest
 Each tree makes its own decisions: Every tree in the forest makes its
own predictions without relying on others.
 Random parts of the data are used: Each tree is built using random
samples and features to reduce mistakes.
 Enough data is needed: Sufficient data ensures the trees are different
and learn unique patterns and variety.
 Different predictions improve accuracy: Combining the predictions
from different trees leads to a more accurate final results.
How Random Forest Algorithm Works?
 Random Forest builds multiple decision trees using random samples of the data. Each
tree is trained on a different subset of the data which makes each tree unique.
 When creating each tree the algorithm randomly selects a subset of features or
variables to split the data rather than using all available features at a time. This adds
diversity to the trees.
 Each decision tree in the forest makes a prediction based on the data it was trained on.
When making final prediction random forest combines the results from all the trees.
 For classification tasks the final prediction is decided by a majority vote. This means
that the category predicted by most trees is the final prediction, For regression tasks
the final prediction is the average of the predictions from all the trees.
 The randomness in data samples and feature selection helps to prevent the model
from overfitting making the predictions more accurate and reliable.
Random Forest Algorithm
Advantages of Random Forest
 Random Forest provides very accurate predictions even with large datasets.
 Random Forest can handle missing data well without compromising with
accuracy.
 It doesn’t require normalization or standardization on dataset.
 When we combine multiple decision trees it reduces the risk of overfitting of
the model.
Limitations of Random Forest
 It can be computationally expensive especially with a large number of trees.
 It’s harder to interpret the model compared to simpler models like decision
trees.
Random Forest Algorithm
What is the difference between decision tree and random forest?
 Decision tree is an independent model that makes predictions based on a series of
decisions whereas random forest is group of multiple decision trees which work to
improve the overall prediction accuracy.
 The accuracy of decision tree is low and sensitive to variations in training data
whereas random forest provides an improved accuracy.

What is the difference between XGBoost and Random Forest?


 Random forest is a group learning algorithm based on bagging, where multiple
decision trees are independently trained and their predictions are averaged or voted.
 XGBoost is a boosting algorithm that gradually trains weaker learners where each
successive learner focuses on the mistakes of its predecessor to improve overall
performance.
common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Anomaly Detection
 Linear Regression with Regularization
 Logistic Regression with Regularization
what is Support Vector Machine (SVM) Algorithm ?
Support Vector Machine (SVM) Algorithm
• Definition: Support Vector Machine (SVM) is a supervised machine learning
algorithm used for both classification and regression tasks.
• Geometric Model: SVM views input data as two sets of vectors in an n-dimensional
space, constructing a separating hyperplane.
• Maximizing Margin: The algorithm aims to find the hyperplane that maximizes the
margin between the two data sets.
• Margin Calculation: Two parallel hyperplanes are created on either side of the
separating hyperplane to calculate the margin.
• Support Vectors: The points that define the width of the margin are known as
support vectors, which are crucial for determining the optimal hyperplane.
• Suitability: SVM is particularly effective for classification tasks, focusing on
achieving the best separation between different classes in the data.
• Note: The algorithm maximizes the margin between the closest points of d/t classes.
Support Vector Machine (SVM) Terminology
• Hyperplane: A decision boundary separating different classes in feature space,
represented by the equation wx + b = 0 in linear classification.
• Support Vectors: The closest data points to the hyperplane, crucial for determining the
hyperplane and margin in SVM.
• Margin: The distance between the hyperplane and the support vectors. SVM aims to
maximize this margin for better classification performance.
• Kernel: A function that maps data to a higher-dimensional space, enabling SVM to
handle non-linearly separable data.
• Hard Margin: A maximum-margin hyperplane that perfectly separates the data without
misclassifications.
• Soft Margin: Allows some misclassifications by introducing slack variables, balancing
margin maximization and misclassification penalties when data is not perfectly
separable.
Support Vector Machine (SVM) Terminology
• Hinge Loss: A loss function penalizing misclassified points or margin violations,
combined with regularization in SVM.
• Dual Problem: Involves solving for Lagrange multipliers associated with support
vectors, facilitating the kernel trick and efficient computation.
Types of Support Vector Machine
• Linear SVM is a Support Vector Machine that uses a straight line (or hyperplane in
higher dimensions) to separate data into classes. It aims to find the optimal hyperplane
that maximizes the margin between different classes for clear classification boundaries.
• Non-Linear SVM is used when data cannot be separated by a straight line. It applies
the kernel trick to transform data into a higher-dimensional space, making it possible to
find a non-linear decision boundary. This allows SVM to handle more complex patterns
and non-linearly separable data effectively.
Support Vector Machine (SVM)
SVM algorithm has a feature to
ignore outliers and find the
hyper-plane that has the
maximum margin

Add a new feature z=x2+y2 and then


plot the data points on axis x and z

kernel trick. The SVM kernel is a


function that takes low dimensional
input space and transforms it to a
higher dimensional space i.e. it
converts not separable problem to
separable problem
Support Vector Machine (SVM)

• To separate the two classes of data points:


• There are many possible hyperplanes that could be chosen.
• Our objective is to find a plane that has the maximum margin, i.e
the maximum distance between data points of both classes.
• Maximizing the margin distance provides some reinforcement so
that future data points can be classified with more confidence.
Advantages of Support Vector Machine (SVM)
• High-Dimensional Performance: SVM excels in high-dimensional spaces, making it
suitable for image classification and gene expression analysis.
• Nonlinear Capability: Utilizing kernel functions like RBF and polynomial, SVM
effectively handles nonlinear relationships.
• Outlier Resilience: The soft margin feature allows SVM to ignore outliers, enhancing
robustness in spam detection and anomaly detection.
• Binary and Multiclass Support: SVM is effective for both binary classification and
multiclass classification, suitable for applications in text classification.
• Memory Efficiency: SVM focuses on support vectors, making it memory efficient
compared to other algorithms.
Disadvantages of Support Vector Machine
(SVM)
• Slow Training: SVM can be slow for large datasets, affecting performance in SVM in
data mining tasks.
• Parameter Tuning Difficulty: Selecting the right kernel and adjusting parameters
like C requires careful tuning, impacting SVM algorithms.
• Noise Sensitivity: SVM struggles with noisy datasets and overlapping classes,
limiting effectiveness in real-world scenarios.
• Limited Interpretability: The complexity of the hyperplane in higher dimensions
makes SVM less interpretable than other models.
• Feature Scaling Sensitivity: Proper feature scaling is essential; otherwise, SVM
models may perform poorly.
Quiz 5%
Question: Enkutatash Technologies PLC, a telecommunications
company, is struggling with customer churn prediction. Their
dataset includes 20+ features such as customer demographics, service
usage patterns, transaction history, and customer support interactions
(mix of numerical and categorical variables). The data has missing
values, nonlinear relationships, and potential multicollinearity. The
company wants a model that prioritizes interpretability of key factors
driving churn while maintaining high accuracy.
Which machine learning algorithm is most suitable for this
problem, and why?
common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Linear Regression with Regularization
 Logistic Regression with Regularization
 Anomaly Detection
Linear Regression with Regularization
What is Linear Regression?
Linear Regression
 Linear regression is a statistical method used to model the relationship between a
dependent variable and one or more independent variables.

 It computes the linear relationship between the dependent variable and one or more
independent features by fitting a linear equation with observed data.

 It predicts the continuous output variables based on the independent input variable.

 It provides valuable insights for prediction and data analysis.

 For Example: if we want to predict house price we consider various factor such as
house age, distance from the main road, location, area and number of room,
linear regression uses all these parameter to predict house price as it consider a
linear relation between all these features and price of house.
Why Linear Regression is Important?
 Simplicity & Interpretability – Linear regression provides a straightforward way to
understand r/ship b/n dependent & independent variables, making it easy to interpret.
 Predictive Modeling – It helps in forecasting and predicting outcomes based on
historical data, such as predicting sales based on advertising spending.
 Identifying Trends & Relationships – It helps in understanding the impact of one or
more independent variables on a dependent variable (e.g., how temperature affects ice
cream sales).
 Feature Importance – It helps determine which variables (features) have the most
influence on the target variable.
 Baseline Model for Comparison – In machine learning, linear regression serves as a
benchmark model to compare with more complex models.
 Efficiency and Low Computational Cost – It is computationally inexpensive and
works well with small to moderately large datasets.
 Widespread Applications – Used in finance, healthcare, economics, marketing, and
many other fields for decision-making and analysis.
Types of Linear Regression
 Simple Linear Regression (Univariate Linear Regression):
with one independent variable
 Multiple Linear Regression (Multivariate Regression)
with multiple independent variables
Simple Linear Regression
 Simple linear regression is the simplest form of linear regression and it involves only
one independent variable and one dependent variable.
 The relationship is modeled using a straight line.
 The equation for simple linear regression is: y=β0​+β1​X
where:
• Y is the dependent variable
• X is the independent variable
• β0 is the intercept
• β1 is the slope
Assumptions of Simple Linear Regression
 Linearity: independent & dependent variables have a linear r/ship with one another.
 This implies that changes in the dependent variable follow those in the independent
variable(s) in a linear fashion. This means that there should be a straight line that can
be drawn through the data points.
 If the relationship is not linear, then linear regression will not be an accurate model.
Assumptions of Simple Linear Regression
 Homoscedasticity: Across all levels of the independent variable(s), the variance of
the errors is constant.
 This indicates that the amount of the independent variable(s) has no impact on the
variance of the errors.
 If the variance of the residuals is not constant, then linear regression will not be an
accurate model.
Assumptions of Simple Linear Regression
 Independence: The observations in the dataset are independent of each other.
 This means that the value of the dependent variable for one observation does not
depend on the value of the dependent variable for another observation.
 If the observations are not independent, then linear regression will not be an accurate
model.
 Normality: The residuals should be normally distributed.
 This means that the residuals should follow a bell-shaped curve.
 If the residuals are not normally distributed, then linear regression will not be an
accurate model.
Multiple Linear Regression
 Multiple linear regression involves more than one independent variable and one
dependent variable.
 The equation for multiple linear regression is. y=β0​+β1​X1+β2​X2+
………βn​Xn
where:
• Y is the dependent variable
• X1, X2, …, Xn are the independent variables
• β0 is the intercept
• β1, β2, …, βn are the slopes
 The goal of the algorithm is to find the best Fit Line equation that can predict the
values based on the independent variables.
Assumptions of Multiple Linear Regression
 For Multiple Linear Regression, all four of the assumptions from Simple Linear
Regression apply. In addition to this, below are few more:
 No multicollinearity: There is no high correlation b/n the independent variables.
 This indicates that there is little or no correlation between the independent variables.
 Multicollinearity occurs when two or more independent variables are highly
correlated with each other, which can make it difficult to determine the individual
effect of each variable on the dependent variable.
 If there is multicollinearity, then multiple linear regression will not be an accurate
model.
 Additivity: The model assumes that the effect of changes in a predictor variable on
the response variable is consistent regardless of the values of the other variables.
 This assumption implies that there is no interaction between variables in their effects
on the dependent variable.
Assumptions of Multiple Linear Regression
 Feature Selection: In multiple linear regression, it is essential to carefully select the
independent variables that will be included in the model.
 Including irrelevant or redundant variables may lead to overfitting and complicate
the interpretation of the model.
 Overfitting: Overfitting occurs when the model fits the training data too closely,
capturing noise or random fluctuations that do not represent the true underlying
relationship between variables.
 This can lead to poor generalization performance on new, unseen data.
Multicollinearity
Multicollinearity is a statistical phenomenon where two or more independent variables
in a multiple regression model are highly correlated, making it difficult to assess the
individual effects of each variable on the dependent variable.
Detecting Multicollinearity includes two techniques:
• Correlation Matrix: High correlations (near 1 or -1) potential multicollinearity.
• VIF (Variance Inflation Factor): A VIF above 10 suggests multicollinearity.
Use Case of Multiple Linear Regression
 Real Estate Pricing: In real estate MLR is used to predict property prices based on
multiple factors such as location, size, number of bedrooms, etc. This helps buyers
and sellers understand market trends and set competitive prices.
 Financial Forecasting: Financial analysts use MLR to predict stock prices or
economic indicators based on multiple influencing factors such as interest rates,
inflation rates and market trends. This enables better investment strategies and risk
management24.
 Agricultural Yield Prediction: Farmers can use MLR to estimate crop yields based
on several variables like rainfall, temperature, soil quality and fertilizer usage. This
information helps in planning agricultural practices for optimal productivity
 E-commerce Sales Analysis: An e-commerce company can utilize MLR to assess
how various factors such as product price, marketing promotions and seasonal trends
impact sales.
common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Linear Regression with Regularization
 Logistic Regression with Regularization
 Anomaly Detection
what is Logistic Regression?
 Logistic regression is a supervised ML algorithm used for classification tasks where
the goal is to predict the probability that an instance belongs to a given class or not.
 Logistic regression is a statistical algorithm which analyze the relationship between
two data factors.

 Logistic regression is a statistical method for developing machine learning models


with binary dependent variables, i.e. binary.

 Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value b/n 0 & 1.

 For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0. It’s referred to as regression because it is the
extension of linear regression but is mainly used for classification problems.
Key Point of Logistic Regression?
 Logistic regression predicts the output of a categorical dependent variable. Therefore,
the outcome must be a categorical or discrete value.
 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact
value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped
logistic function, which predicts two maximum values (0 or 1).
Types of Logistic Regression
On the basis of the categories, Logistic Regression can be classified into three types:
• Binomial: In binomial Logistic regression, there can be only two possible types of
the dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as “low”, “Medium”, or “High”.
Assumptions of Logistic Regression
 Independent observations: Each observation is independent of the other. meaning
there is no correlation between any input variables.
 Binary dependent variables: It takes the assumption that the dependent variable
must be binary or dichotomous, meaning it can take only two values. For more than
two categories SoftMax functions are used.
 Linearity r/ship b/n independent variables and log odds: The relationship between
the independent variables and the log odds of the dependent variable should be linear.
 No outliers: There should be no outliers in the dataset.
 Large sample size: The sample size is sufficiently large
Differences Between Linear and Logistic
Linear Regression
Regression
Logistic Regression
It is used to predict the continuous dependent variable itis used to predict the categorical dependent
using a given set of independent variables. variable using a given set of independent variables.
Linear regression is used for solving regression
It is used for solving classification problems.
problem.
In this we predict the value of continuous variables In this we predict values of categorical variables
In this we find best fit line. In this we find S-Curve.
Least square estimation method is used for estimation Maximum likelihood estimation method is used for
of accuracy. Estimation of accuracy.
The output must be continuous value, such as price, Output must be categorical value such as 0 or 1,
age, etc. Yes or no, etc.
It required linear r/ship b/n dependent & independent. It not required linear relationship.
There may be collinearity between the independent There should be little to no collinearity between
variables. independent variables.
common Algorithms in Supervised Learning
 K-Nearest-Neighbor
 Decision Tree
 Random Forest
 Support Vector Machine (SVM)
 Linear Regression with Regularization
 Logistic Regression with Regularization
 Anomaly Detection
Anomaly Detection
What is Anomaly Detection?
Anomaly Detection
• Anomaly Detection (Outlier Detection) is a technique in data analysis and machine
learning that identifies data points, events, or observations that significantly deviate
from the normal pattern of a dataset.

• Anomaly Detection is critical in lots of fields, which includes finance for detecting
fraud transactions, manufacturing finds defects, healthcare for odd clinical
conditions, and cybersecurity for detecting protection breaches or threats.

• Recognizing odd data patte­rns is called anomaly detection. It discove­rs unexpected


stuff that doe­sn't fit normal trends.

• These irre­gular findings often signal major troubles. Think mistakes, wrongdoing, or
unauthorize­d access.
Types of Anomalies
1. Individual Point Anomalist: point anomaly occurs when a single data point deviates
significantly from the overall distribution.
• Example: In the case of the credit card transaction analysis, a point anomaly may be the
transaction that has this value significantly bigger than any other average values recorded
for that account and potential fraud.

2. Contextual Anomalies (If-Then Anomalies): Contextual anomalies occur when data


points appear normal overall but deviate in a specific context, such as time-series or
geographical data, where context defines normalcy.
• Example: An 85 Fahrenheit temperature might be normal during the summer, but in the
winter, it would be considered atypical. For instance, heating the streets or offices with air
conditioning in the middle of the winter in Bahir Dar could be contextually incorrect.

3. Collective Anomalies: Consolidated anomalies occur when a group of data points seems
normal individually but forms an outlier collectively, often seen in sequential data like
telecom and healthcare monitoring systems.
Why is Anomaly Detection Important?
 Early Issue and Threat Detection: Anomaly detection helps identify potential problems
and risks early, preventing significant damage. In cybersecurity, detecting unusual network
traffic can signal a breach, enabling proactive prevention of data theft.
 Fraud Prevention: In finance helps identify and prevent fraudulent transactions by
spotting deviations from a user's normal behavior, protecting assets and saving millions.
 Quality Control & Maintenance: In manufacturing, anomaly detection ensures product
quality and enables predictive maintenance by identifying defects and abnormal equipment
behavior, reducing downtime and costs.
 Healthcare Monitoring: Anomaly detection helps track patient health by identifying
abnormal vital signs, enabling early detection of potential issues or deterioration.
 Improving the Customer Experience: Companies employ anomaly detection to track
service performance and user interactions.
 Enhanced Security: Anomaly detection boosts safety and security by identifying suspicious
activities or behaviors, complementing cybersecurity and fraud prevention efforts.
Anomaly Detection Use Cases
1. Fraud Detection:
 Banking and Finance: Identifies potential fraud by automatically flagging unusual
activities such as large amounts, foreign transactions, or rapid sequences of transactions.
 Insurance: Flags suspicious claims, such as when damage seems inconsistent with
reported losses or multiple claims are made for the same incident.
2. Intrusion Detection (Cybersecurity):
• Network Security: Monitors network traffic for unusual events such as DoS attacks,
phishing, or malware spreading, based on deviations from normal traffic patterns.
• System Security: Tracks system operations and alerts on suspicious activities, like
unauthorized access or abnormal access patterns.
3. Industrial Anomaly Detection:
 Manufacturing Processes: Monitors production lines to identify and remove defective or
non-compliant products, preventing supply chain issues.
 Oil and Gas: Tracks infrastructure and machinery with sensor data to detect failures or
safety risks early.
Model performance evaluation, diagnostics &
predictions
 Regularization and Bias/Variance
 Evaluating hypothesis
 Model selection (Train/ Test / Validation)
 Learning Curves
 MSE, lift, AUC, Type 1 vs 2
Regularization Techniques in Machine
 Regularization is a technique usedLearning
to prevent overfitting by adding a penalty term to
the model's objective function during training.

 The objective is to discourage the model from fitting the training data too closely and
promote simpler models that generalize better to unseen data.

 Regularization methods control the complexity of models by penalizing large


coefficients or by selecting a subset of features, thus helping to strike the right
balance between bias and variance.

 which provides a methodical way to avoid overfitting and enhance the capacity of
machine learning models to generalize.

 Regularization plays a pivotal role in enhancing the generalization ability of machine


learning models.
What Are Overfitting and Underfitting?
 In the field of machine learning, overfitting and underfitting are two critical concepts
that directly impact the performance and reliability of models.

 Overfitting occurs when a model captures noise and patterns specific to the training
data, leading to poor generalization on unseen data.

 Underfitting arises when a model is too simple to capture the underlying patterns in
the data, resulting in poor performance on both the training and testing datasets.

 By mitigating overfitting, regularization techniques improve the model's performance


on unseen data, leading to more reliable predictions in real-world scenarios.

 Additionally, regularization facilitates feature selection and helps in building


interpretable models by identifying the most relevant features for prediction.
What Are Overfitting and Underfitting?
 The image illustrates three scenarios in model performance:
 Overfitting – The model is too complex, capturing noise and outliers leading to poor
generalization.
 Underfitting – The model is too simple failing to capture the underlying data patterns.
 Optimal Fit – A balanced model that generalizes well achieving low bias and low
variance and it can be achieved by using regularization techniques.
When occur Overfitting and Underfitting?
• Overfitting occurs when a machine learning model is too tailored to the training data, failing to
generalize to unseen data. This happens when the model learns noise instead of underlying
patterns. For example, predicting tomorrow's weather based solely on last week's data might
lead to irrelevant conclusions, like basing predictions on a one-time rainstorm.
• Underfitting, happens when a model fails to capture even the basic patterns in the dataset. In
this case, the model performs poorly on both training and validation data. To address
underfitting, we may need to increase the model's complexity or add more features.
Types of Regularization
1. Lasso Regularization – (L1 Regularization): A regression model which uses the
L1 Regularization technique is called LASSO(Least Absolute Shrinkage and
Selection Operator) regression. Lasso Regression adds the “absolute value of
magnitude” of the coefficient as a penalty term to the loss function(L)

2. Ridge Regularization – (L2 Regularization): A regression model that uses the L2


regularization technique is called Ridge regression. Ridge regression adds the “squared
magnitude” of the coefficient as a penalty term to the loss function(L).

3. Elastic Net Regularization – (L1 and L2 Regularization): Elastic Net Regression is


a combination of both L1 as well as L2 regularization. That implies that we add the
absolute norm of the weights as well as the squared measure of the weights. With the
help of an extra hyperparameter that controls the ratio of the L1 and L2 regularization.
Benefits of Regularization
• Prevents Overfitting: Regularization helps models focus on underlying patterns instead
of memorizing noise in the training data.
• Improves Interpretability: L1 (Lasso) regularization simplifies models by reducing less
important feature coefficients to zero.
• Enhances Performance: Prevents excessive weighting of outliers or irrelevant features,
improving overall model accuracy.
• Stabilizes Models: Reduces sensitivity to minor data changes, ensuring consistency
across different data subsets.
• Prevents Complexity: Keeps models from becoming too complex, which is crucial for
limited or noisy data.
• Handles Multicollinearity: Reduces the magnitudes of correlated coefficients, improving
model stability.
• Allows Fine-Tuning: Hyperparameters like alpha and lambda control regularization
strength, balancing bias and variance.
• Promotes Consistency: Ensures reliable performance across different datasets, reducing
the risk of large performance shifts.
What are Bias and Variance?
• Bias refers to a way too simplistic a model to fit the data then we are more probably face
the situation of High Bias (underfitting) refers to the case when the model is unable to
learn the patterns in the data at hand and perform poorly.
• Variance implies the error value that occurs when we try to make predictions by using
data that is not previously seen by the model. There is a situation known as high variance
(overfitting) that occurs when the model learns noise that is present in the data.
Bias Variance tradeoff
• The bias-variance tradeoff It refers to the balance between bias and variance, which
affect predictive model performance. Finding the right balance is crucial
• The bias-variance tradeoff demonstrates the inverse relationship between bias and
variance. When one decreases, the other tends to increase, and vice versa.
• An overly simple model with high bias won’t capture the underlying patterns, while an
overly complex model with high variance will fit the noise in the data.
Model performance evaluation, diagnostics &
predictions
 Regularization and Bias/Variance
 Evaluating hypothesis
 Model selection (Train/ Test / Validation)
 Learning Curves
 MSE, lift, AUC, Type 1 vs 2
Evaluating hypothesis
• Machine learning is a crucial aspect of artificial intelligence that enables machines to learn
from data and make predictions or decisions.
• Assessing whether a model’s predictions align with observed data using statistical methods.
• Model performance evaluation is essential in ensuring the accuracy, reliability, and
robustness of machine learning models.
• The process of machine learning involves training a model on a dataset, and then using that
model to make predictions on new, unseen data.
• before deploying a machine learning model, it is essential to evaluate its performance to
ensure that it is accurate and reliable.
• One crucial step in this evaluation process is hypothesis testing.
• Purpose: To determine if the machine learning model is accurate and if it can be
generalized to new data or unseen data on the testing phase.
Why are Hypotheses Essential in Machine

Learning?
Hypotheses are essential in machine learning because they provide a framework for
understanding the problem that we are trying to solve.
• They help us to identify the key variables that are relevant to the problem, and they
provide a basis for evaluating the performance of our machine learning model.
• Without a clear hypothesis, it is difficult to develop an effective machine learning
model. A hypothesis helps us to:

• Identify the key variables that are relevant to the problem


• Develop a clear understanding of the problem that we are trying to solve
• Evaluate the performance of our machine learning model
• Refine our model and improve its accuracy.
Evaluating Hypotheses in Machine Learning
• Evaluating hypotheses in machine learning involves testing the null hypothesis against
the alternative hypothesis. This is typically done using statistical methods, such as t-
tests, ANOVA, and regression analysis. Here are the general steps involved in evaluating
hypotheses in machine learning:
• Formulate the null and alternative hypotheses: Clearly define the null and alternative
hypotheses that you want to test.
• Collect and prepare the data: Collect the data that you will use to test the hypotheses.
Ensure that the data is clean, relevant, and representative of the population.
• Choose a statistical method: Select a suitable statistical method to test the hypotheses.
This could be a t-test, ANOVA, regression analysis, or another method.
• Test the hypotheses: Use the chosen statistical method to test the null hypothesis
against the alternative hypothesis.
• Interpret the results: Interpret the results of the hypothesis test. If the null hypothesis is
rejected, it suggests that there is a significant relationship between the variables. If the
null hypothesis is not rejected, it suggests that there is no significant relationship
between the variables.
Model performance evaluation, diagnostics &
predictions
 Regularization and Bias/Variance
 Evaluating hypothesis
 Model selection (Train/ Test / Validation)
 Learning Curves
 MSE, lift, AUC, Type 1 vs 2
Model selection (Train/ Test / Validation)
Model selection (Train/ Test / Validation)
• Model selection is the process of choosing the most appropriate machine learning model
from a set of candidates for a given task.
• It involves evaluating models based on their performance using training, validation, and
test datasets.
• The goal is to select a model that generalizes well to new, unseen data, avoiding issues
like overfitting or underfitting.
• Key techniques for model selection include cross-validation, hyperparameter tuning, and
evaluating performance metrics such as accuracy, precision, or AUC.
Goals:
• Accuracy: Maximize predictive accuracy on validation or test datasets.
• Simplicity: Favor simpler models that are easier to interpret, unless a more complex
model significantly improves performance.
• Robustness: Ensure that the selected model performs well across different subsets of data.
Train Test Validation Split
Training Set
• This is the actual dataset from which a model trains .i.e. the model sees and learns from
this data to predict the outcome or to make the right decisions.
• Most of the training data is collected from several resources and then preprocessed and
organized to provide proper performance of the model.
• Type of training data hugely determines the ability of the model to generalize .
• The better the quality and diversity of training data, the better will be the performance of
the model. This data is more than 60% of the total data available for the project.
• It is the set of data that is used to train and make the model learn the hidden
features/patterns in the data.
• In each epoch, the same training data is fed to the neural network architecture repeatedly,
and the model continues to learn the features of the data.
• The training set should have a diversified set of inputs so that the model is trained in all
scenarios and can predict any unseen data sample that may appear in the future.
Testing Set
• This dataset is independent of the training set but has a somewhat similar type of
probability distribution of classes and is used as a benchmark to evaluate the model, used
only after the training of the model is complete.

• Testing set is usually a properly organized dataset having all kinds of data for scenarios
that the model would probably be facing when used in the real world.

• Often the validation and testing set combined is used as a testing set which is not
considered a good practice.

• If the accuracy of the model on training data is greater than that on testing data then the
model is said to have overfitting.

• This data is approximately 20-25% of the total data available for the project.
Validation Set
• The validation set is used to fine-tune the hyperparameters of the model and is considered
a part of the training of the model.

• The model only sees this data for evaluation but does not learn from this data, providing
an objective unbiased evaluation of the model.

• Validation dataset can be utilized for regression as well by interrupting training of model
when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing
bias and variance.

• This data is approximately 10-15% of the total data available for the project but this can
change depending upon the number of hyperparameters .i.e. if model has quite many
hyperparameters then using large validation set will give better results.

• Now, whenever the accuracy of model on validation data is greater than that on training
data then the model is said to have generalized well.
Cross-Validation
• Cross-validation is a statistical method used in machine learning to evaluate how well a
model performs on an independent data set.
• It involves dividing the available data into multiple folds or subsets, using one of these
folds as a validation set and training the model on the remaining folds.
• This process is repeated multiple times each time using a different fold as the validation
set.
• Finally, the results from each validation step are averaged to produce a more robust
estimate of the model’s performance.
• The main purpose of cross validation is to prevent overfitting which occurs when a model
is trained too well on the training data and performs poorly on new, unseen data.
• By evaluating the model on multiple validation sets, cross validation provides a more
realistic estimate of the model’s generalization performance i.e. its ability to perform well
on new, unseen data.
• If you want to make sure your machine learning model is not just memorizing the training
data but is capable of adapting to real-world data cross-validation is a commonly used
technique.
Types of Cross-Validation
• Holdout Validation splits the dataset into two equal parts: 50% for training and 50% for
testing. It’s quick but may miss important data in the unused half, leading to higher bias.

• LOOCV(Leave-One-Out Cross Validation) trains the model on all but one data point and
tests on the omitted point, repeating for each data point. It uses all data, minimizing bias,
but can result in higher variation and longer execution time due to testing each data point.

• Stratified Cross-Validation ensures that each fold of the dataset has the same class
distribution as the entire dataset. It’s useful for imbalanced datasets, ensuring balanced
representation in each fold, which helps in classification tasks.

• K-Fold Cross Validation splits the dataset into k subsets. The model is trained on k-1
subsets, with one reserved for testing. This process is repeated k times, using each subset
for testing once.
Model performance evaluation, diagnostics &
predictions
 Regularization and Bias/Variance
 Evaluating hypothesis
 Model selection (Train/ Test / Validation)
 Learning Curves
 MSE, lift, AUC, Type 1 vs 2
Learning Curves
• Machine learning models are employed to learn patterns in data.
• The best models can generalize well when faced with instances that were not part of the
initial training data.
• During the research phase, several experiments are conducted to find the solution that
best solves the business's problem, and reduces the error being made by the model.
• An error may be defined as the difference between the prediction of observation and the
true value of the observation.
• There are two major causes for errors in machine learning models:
• Bias describes a model which makes simplified assumptions so the target function is
easier to approximate; a model may learn that every 5'9 male in the world wears a size
medium top - this is clearly biased.
• Variance describes the variability in the model prediction; how much the prediction of the
model changes when we change the data used to train it.
Model performance evaluation, diagnostics &
predictions
 Regularization and Bias/Variance
 Evaluating hypothesis
 Model selection (Train/ Test / Validation)
 Learning Curves
 MSE, lift, AUC, Type 1 vs 2
MSE, lift, AUC, Type 1 vs 2
MSE (Mean Squared Error)
• Definition: MSE is a metric used to evaluate the accuracy of a regression model. It
measures the average squared difference between the predicted and actual values.
• Purpose: It helps to quantify how well a regression model fits the data.
• Goal: Minimize MSE to achieve a better-fitting model, as lower MSE indicates higher
accuracy in predictions.
Lift
 Definition: Lift is a metric used in classification models to measure how much better a
model performs compared to random guessing. It is commonly used in marketing and
sales to evaluate targeting strategies.
 Purpose: To measure the effectiveness of a classification model in identifying true positive
outcomes.
 Goal: A higher lift means the model is better at predicting the target variable, particularly
in situations with imbalanced datasets.
MSE, lift, AUC, Type 1 vs 2
AUC (Area Under the Curve)
• Definition: AUC is the area under the ROC curve (Receiver Operating Characteristic
curve) and is used to evaluate the performance of a binary classification model. It
measures the ability of the model to distinguish between positive and negative classes.
• Purpose: It provides a single value that reflects the overall performance of a classifier
across all possible classification thresholds.
• Goal: Maximize AUC (closer to 1) to achieve a model with better discriminative ability.
• Interpretation:
• AUC = 0.5: No discrimination (random classifier)
• AUC = 1.0: Perfect classification
• AUC < 0.5: Worse than random guessing.
Type 1 vs. Type 2 Errors
• Type 1 Error (False Positive): Incorrectly rejecting the null hypothesis when it is true. It
occurs when the model incorrectly predicts a positive outcome (e.g., predicting a disease
when the person is healthy).
• Type 2 Error (False Negative): Failing to reject the null hypothesis when it is false. It
occurs when the model incorrectly predicts a negative outcome (e.g., predicting no disease
when the person is actually sick).
• Purpose: To understand the trade-offs between false positives and false negatives in model
predictions, especially in classification tasks.
Goal:
• Minimize Type 1 Errors when the cost of false positives is high (e.g., unnecessary medical
treatments).
• Minimize Type 2 Errors when the cost of false negatives is high (e.g., missing a critical
diagnosis).
• Relationship: Type 1 and Type 2 errors are inversely related. Reducing one often increases
the other, so finding a balance depending on the problem context is important.
Other Related Topics
Enkutatash Tech

THANK TOU
Wisdom is the essence of our
uniqueness!
Adane Kasie Chekole
+251-938427723
adanekasie26@gmail.com
www.enkutatashplc.com

Lemlemitu, Bahir Dar, Ethiopia

You might also like