ST JOSEPH’S UNIVERSITY
BENGALURU
INTRODUCTION TO
MACHINE LEARNING
PRESENTED BY:
AARAN DLIMA
INTRODUCTION:
Human beings can learn everything from our
experiences and we do have leaning capacity.
Additionally, we have computers or machines
which work on our instructions.
But can a machine also learn from experiences
or past data like a human does?
So here comes the role of Machine Learning
HUMAN VS MACHINE
INTRODUCTION:
Subset of Artificial Intelligence.
Primary Focus : Creation of algorithms that
enable a computer to independently learn from
data and previous experiences.
Algorithms create a mathematical model that,
without being explicitly programmed, helps in
making predictions or decisions with the
assistance of sample past data, or training data.
EXAMPLE: WANT TO PREDICT CAT OR DOG
For the purpose of developing predictive models, machine learning
brings together statistics and computer science.
A machine can learn if it can gain more data to improve its performance.
HOW DOES ML WORKS?
ML builds prediction models, learns
from the data and predicts the output
of new data when it receives it.
The more data you have, the better
model will be. It will also affects the
accuracy of the predicted output.
HOW DOES ML WORK?
DO WE NEED ML?
Increased Data
Solving complex problems( difficult
for a human)
Decision making in various sector
Finding hidden patterns and
extracting useful information from
data.
CLASSIFICATION OF ML
SUPERVISED LEARNING
UNSUPERVISED LEARNING
REINFORCEMENT LEARNING
CLASSIFICATION OF ML
SUPERVISED LEARNING
REMEMBER: Labeled data is used here.
OBJECTIVE: Mapping of the input data to
the output data.
The system uses labeled data to build a
model that understands the datasets and
learns about each one.
Training and Testing.
UNSUPERVISED LEARNING
REMEMBER: Data that has not been labeled,
classified, or categorized
OBJECTIVE: To restructure the input data into
new features or a group of objects with similar
patterns.
It is a learning method in which a machine learns
without any supervision.
No predetermined result.
The machine tries to find useful insights from the
huge amount of data.
REINFORCEMENT LEARNING
Feedback-based learning method.
Right action reward and punishment for wrong
action.
The agent learns automatically with these
feedbacks and improves its performance.
The robotic dog, which automatically learns the
movement of his arms, is an example of
Reinforcement learning.
APPLICATIONS OF ML
Image recognition
Speech recognition
Prediction
Recommendation
APPLICATIONS OF ML
Voice Assistant
Fraud Detection
Language Translation
Stock market
TYPES OF DATA
Numerical data: Such as house price,
temperature, etc.
Categorical data: Such as Yes/No,
True/False, Blue/green, etc.
Ordinal data: These data are similar to
categorical data but can be measured on
the basis of comparison.
TYPES OF DATASETS
IMAGE DATASETS TEXT DATASETS TABULAR DATASETS
Images Textual Information Rows and columns
Articles, books etc Table format
image classification, Sentiment analysis,
object detection, and Text classification
image segmentation.
DATA PREPROCESSING:
Data preprocessing is required tasks for
cleaning the data and making it suitable for a
machine learning model which also increases
the accuracy and efficiency of a machine
learning model.
Data contains noises, missing values, and
unusable format which cannot be directly used
for machine learning models.
STEPS IN DATA
PREPROCESSING
01 02 03
SEARCH THE DATASET IMPORTING LIBRARIES IMPORT DATASETS
Each Dataset is different from the Import some predefined Python After importing the libraries we
other. libraries. need to import the data that we
have collected.
Search the dataset according to
Few libraries we usually come
the need of your problem
across are numpy, pandas, Here we make use of syntaxes
statement.
matplotlib, seaborn, scikit learn. such as : read_csv,read_excel and
Data we make use is usually in csv so on.(Learn different ways of
format,text file format or excel file. importing the datasets.)
STEPS IN DATA
PREPROCESSING
Once first three steps are done then explore the data. Know
what are your dependent and independent variables are.
04 05 06
DATA CLEANING ENCODING SPLITTING
We are Splitting the dataset into
Check for Missing values This step is basically for treating
X_train, X_test, Y_train, Y_test.
Categorical variables.
How do you deal with missing
Here you need to decide what is
values? Few encoding techniques are Label
your test size.(Usually 20% to
Deleting encoder and One hot encoder with
30%).
Substituting values with mean, the help of scikit learn package.
median or mode
Random state keyword.
Check for column names to be
renamed and so on.
Dummy Variables:
Dummy variables are those variables which have values 0 or 1. The 1 value gives the presence of that variable in a particular column, and
rest variables become 0. With dummy encoding, we will have a number of columns equal to the number of categories.
STEPS IN DATA PREPROCESSING:
7. Feature Scaling
Final step in machine learning.
It is a technique to standardize the independent variables of the
dataset in a specific range.
We put our variables in the same range and in the same scale so
that no any variable dominate the other variable.
Two ways of feature scaling:
Standardization
Normalization
For feature scaling, we will import StandardScaler class of
sklearn.preprocessing library
STEPS IN DATA PREPROCESSING:
7. Feature Scaling
LIFE CYCLE OF ML
Problem Definition
Data Collection
Data Preparation
Model Building
Model Evaluation
Model Deployment
OVERFITTING &UNDERFITTING
These are the two main two problems that we encounter in Machine Learning
which degrades the performance of machine learning models.
Few Terms to Keep In mInd:
Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
OVERFITTING:
Occurs when our ML model tries to cover all the data points or more
than the required data points present in the given dataset.
Therefore model starts caching noise and inaccurate values present
in the dataset, and all these factors reduce the efficiency and
accuracy of the model.
The overfitted model has low bias and high variance.
The chances of occurrence of overfitting increase as much we
provide training to our model.
It means the more we train our model, the more chances of occurring
the overfitted model.
Overfitting is the main problem that occurs in Supervised Learning.
LET US TRY TO UNDERSTAND OVERFITTING WITH EXAMPLE:
As we can see from the above linear regression output graph, the model tries to cover all the data
points present in the scatter plot.
It may look efficient, but in reality, it is not so.
Because the goal of the regression model to find the best fit line, but here we have not got any
best fit, so, it will generate the prediction errors.
HOW TO AVOID OVERFITTING:
Cross Validation
Training with more data
Removing Features
Early Stopping
Regularization
Ensembling
UNDERFITTING:
Underfitting occurs when our machine learning model is not able to
capture the underlying trend of the data.
To avoid the overfitting in the model, we have know the techniques
to overcome. One technique is early stopping.
As a result, it may fail to find the best fit of the dominant trend in the
data.
Here the model is not able to learn enough from the training data, and
hence it reduces the accuracy and produces unreliable predictions.
An underfitted model has high bias and low variance.
LET US TRY TO UNDERSTAND UNDERFITTING WITH EXAMPLE:
Here, the model is unable to capture the data points present in the plot.
How to avoid underfitting:
By increasing the training time of the model.
By increasing the number of features.
ERRORS IN MACHINE LEARNING
Reducible errors: These errors can be reduced to improve the model accuracy.
Irreducible errors: These errors will always be present in the model regardless of which
algorithm has been used. The cause of these errors is unknown variables whose value
can't be reduced.
BIAS
Low Bias: A low bias model will make fewer assumptions
about the form of the target function.
High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important
features of our dataset. A high bias model also cannot
perform well on new data.
Some examples of machine learning algorithms with low bias
are Decision Trees, k-Nearest Neighbours and Support
Vector Machines.
At the same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and Logistic
Regression.
VARIANCE
Low variance means there is a small variation in the prediction of the
target function with changes in the training data set.
High variance shows a large variation in the prediction of the target
function with changes in the training dataset.
A model that shows high variance learns a lot and perform well with
the training dataset, and does not generalize well with the unseen
dataset.( model gives good results with the training dataset )
Low variance - Linear Regression, Logistic Regression, and Linear
discriminant analysis.
High variance - decision tree, Support Vector Machine, and K-nearest
neighbours.
DIFFERENT COMBINATIONS OF BIAS-VARIANCE
Low-Bias, Low-Variance: This shows an ideal machine learning model.
However, it is not possible practically.
Low-Bias, High-Variance: With low bias and high variance, model
predictions are inconsistent and accurate on average. This case occurs
when the model learns with a large number of parameters and hence
leads to an overfitting
High-Bias, Low-Variance: With High bias and low variance, predictions
are consistent but inaccurate on average. This case occurs when a
model does not learn well with the training dataset or uses few
numbers of the parameter. It leads to underfitting problems in the
model.
High-Bias, High-Variance: Predictions are inconsistent and also
inaccurate on average.
BIAS-VARIANCE TRADE-OFF
While building the machine learning model, it is really important to take
care of bias and variance in order to avoid overfitting and underfitting
in the model.
If the model is very simple with fewer parameters, it may have low
variance and high bias.
If the model has a large number of parameters, it will have high
variance and low bias.
So, it is required to make a balance between bias and variance errors,
and this balance between the bias error and variance error is known as
the Bias-Variance trade-off.
BIAS-VARIANCE TRADE-OFF
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible
because bias and variance are related to each other:
If we decrease the variance, it will increase the bias.
If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.
CONFUSION MATRIX
The confusion matrix is a matrix used to determine the
performance of the classification models for a given
set of test data.
The matrix itself can be easily understood, but the
related terminologies may be confusing.
It shows the errors in the model performance in the
form of a matrix.
Known as an error matrix.
CONFUSION MATRIX
For the 2 prediction classes of classifiers, the matrix is of 2*2 table, for 3 classes, it
is 3*3 table, and so on.
The matrix is divided into two dimensions, that are predicted values and actual
values along with the total number of predictions.
Predicted values are those values, which are predicted by the model, and actual
values are the true values for the given observations.
NEED FOR CONFUSION MATRIX
It evaluates the performance of the classification models, when they make
predictions on test data, and tells how good our classification model is.
It not only tells the error made by the classifiers but also the type of errors such as
it is either type-I or type-II error.
With the help of the confusion matrix, we can calculate the different parameters
for the model, such as accuracy, precision, etc.
PERFORMANCE METRICS FOR CLASSIFICATION
Confusion Matrix
Accuracy
Precision
Recall
F Score
AUC(Area Under the Curve)-ROC
PERFORMANCE METRICS FOR CLASSIFICATION
The accuracy metric can be determined as the number of correct predictions to
the total number of predictions.
ACCURACY
When to Use Accuracy?
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced.
For example, if 60% of classes in a fruit image dataset are of Apple, 40% are
Mango. In this case, if the model is asked to predict whether the image is of Apple
or Mango, it will give a prediction with 97% of accuracy.
PERFORMANCE METRICS FOR CLASSIFICATION
Precision is the ratio of correctly classified positive samples (True Positive) to a
total number of classified positive samples (either correctly or incorrectly).
PRECISION
Precision helps us to visualize the reliability of the machine learning model in
classifying the model as positive.
The precision metric is used to overcome the limitation of Accuracy.
PERFORMANCE METRICS FOR CLASSIFICATION
he recall is calculated as the ratio between the numbers of Positive samples
correctly classified as Positive to the total number of Positive samples.
RECALL
It is also similar to the Precision metric.
It aims to calculate the proportion of actual positive that was identified
incorrectly.
The recall measures the model's ability to detect positive samples.
The higher the recall, the more positive samples detected.
PERFORMANCE METRICS FOR CLASSIFICATION
When to use Precision and Recall?
From the above definitions of Precision and Recall, we can say that recall determines the
performance of a classifier with respect to a false negative, whereas precision gives
information about the performance of a classifier with respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to 100%, and if
we want to minimize the false positive, then precision should be close to 100% as possible.
In simple words, if we maximize precision, it will minimize the FP errors, and if we
maximize recall, it will minimize the FN error.
PERFORMANCE METRICS FOR CLASSIFICATION
F-score or F1 Score is a metric to evaluate a binary classification model on the basis of
predictions that are made for the positive class.
It is calculated with the help of Precision and Recall. So, the F1 Score can be calculated as the
harmonic mean of both precision and Recall, assigning equal weight to each of them
F SCORE
When to use F-Score?
As F-score make use of both precision and recall, so it should be used if both of them are
important for evaluation, but one (precision or recall) is slightly more important to consider than
the other.
For example, when False negatives are comparatively more important than false positives, or vice
versa.
PERFORMANCE METRICS FOR CLASSIFICATION
It is one of the popular and important metrics for evaluating the performance of the
classification model.
AUC (Area
Under Curve) - ROC (Receiver Operating Characteristic curve) curve represents a graph to show the
ROC performance of a classification model at different threshold levels.
The curve is plotted between two parameters, which are:
True Positive Rate
False Positive Rate
TPR = FPR =
To calculate value at any point in a ROC curve, we can evaluate a logistic regression model multiple times with different classification
thresholds, but this would not be much efficient. So, for this, one efficient method is used, which is known as AUC.
AUC: AREA UNDER THE ROC CURVE
AUC calculates the performance across all the thresholds and provides an aggregate measure.
The value of AUC ranges from 0 to 1.
It means a model with 100% wrong prediction will have an AUC of 0.0, whereas models with 100%
correct predictions will have an AUC of 1.0.
AUC: AREA UNDER THE ROC CURVE
When to Use AUC?
AUC should be used to measure how well the predictions are
ranked rather than their absolute values.
It measures the quality of predictions of the model without
considering the classification threshold.
When not to use AUC?
As AUC is scale-invariant, which is not always desirable, and
we need calibrating probability outputs, then AUC is not
preferable.
AUC is not a useful metric when there are wide disparities in
the cost of false negatives vs. false positives, and it is difficult
to minimize one type of classification error.
PERFORMANCE METRICS FOR REGRESSION
The metrics used for regression are different from the classification metrics.
It means we cannot use the Accuracy metric (explained above) to evaluate a regression model; instead, the performance of a
Regression model is reported as errors in the prediction.
Mean Absolute Error
Mean Squared Error
R2 Score
Adjusted R2
PERFORMANCE METRICS FOR REGRESSION
Mean Absolute Error measures the absolute difference between actual and
predicted values, where absolute means taking a number as Positive.
Mean
Absolute Error Let's take an example of Linear Regression, where the model draws a best fit line
between dependent and independent variables. To measure the MAE or error in
prediction, we need to calculate the difference between actual values and
predicted values. But in order to find the absolute error for the complete dataset,
we need to find the mean absolute of the complete dataset.
Y is the Actual value, Y' is the predicted value, and N is the total number of observations.
PERFORMANCE METRICS FOR REGRESSION
MAE is much more robust for the outliers.
One of the limitations of MAE is that it is not differentiable,
Mean so for this, we need to apply different optimizers such as
Absolute Error Gradient Descent.
However, to overcome this limitation, another metric can be
used, which is Mean Squared Error or MSE.
PERFORMANCE METRICS FOR REGRESSION
Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model.
Mean Squared
Error MSE is usually positive and non-zero.
Due to squared differences, it penalizes small errors also, and hence it leads to
over-estimation of how bad the model is.
MSE is a much-preferred metric compared to other regression metrics as it is
differentiable and hence optimized better.
Y is the Actual value, Y' is the predicted value, and N is the total number of observations.
PERFORMANCE METRICS FOR REGRESSION
R squared error is also known as Coefficient of Determination, which is another
popular metric used for Regression model evaluation.
R2 SCORE
Determines the goodness of fit.
Strength of relationship between dependent and independent on the scale of
0-100%.
The R squared score will always be less than or equal to 1 without concerning
if the values are too large or small.
PERFORMANCE METRICS FOR REGRESSION
Adjusted R squared, as the name suggests, is the improved version of R squared
error.
R square has a limitation of improvement of a score on increasing the terms,
ADJUSTED R2 even though the model is not improving, and it may mislead the data scientists.
SQUARED
To overcome the issue of R square, adjusted R squared is used, which will
always show a lower value than R².
It is because it adjusts the values of increasing predictors and only shows
improvement if there is a real improvement.
n is the number of observations
k denotes the number of independent variables
and Ra2 denotes the adjusted R2
QUESTIONS
1) Classifications of ML.
2) Steps involved in Data Preprocessing
3) Underfitting and Overfitting (When does it occur, Graph and how to
avoid?)
4) Performance metrics for Classification
5) Performance metrics for Regression
THANK YOU