[go: up one dir, main page]

0% found this document useful (0 votes)
10 views312 pages

ML Intro Linear Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views312 pages

ML Intro Linear Regression

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 312

Machine Learning

Chapter 1

Machine Learning 1
- Supervised Learning

Minor MLDS Course

Dr. Nagaraju K, Asst Prof, Dept of CSE 2


Chapter Description
Chapter objectives

✓ Be able to introduce machine learning-based data analysis according to the business objective, strategy, and policy and
manage the overall process.
✓ Be able to select and apply a machine learning algorithm that is the most suitable to the given problem and perform
hyperparameter tuning.
✓ Be able to design, maintain, and optimize a machine learning workflow for AI modeling by using structured and
unstructured data.

Chapter contents

✓ Topic 1. Machine Learning Based Data Analysis


✓ Topic 2. Application of Supervised Learning Model for Numerical Prediction
✓ Topic 3. Application of Supervised Learning Model for Classification
✓ Topic 4. Decision Tree
✓ Topic 5. Naïve Bayes Algorithm
✓ Topic 6. KNN Algorithm
✓ Topic 7. SVM Algorithm
✓ Topic 8. Ensemble Algorithms

Dr. Nagaraju K, Asst Prof, Dept of CSE 3


Topic 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 4


1.1. What is machine learning? UNIT 01

What is machine learning?


What is machine learning?

‣ A statistical model that learns from data.


‣ A rather simple model can make complex predictions.

Dr. Nagaraju K, Asst Prof, Dept of CSE 5


1.1. What is machine learning? UNIT 01

Samuel’s definition in the early phase of artificial intelligence


‣ “Programming Computers to learn from experience should eventually eliminate the need for much of this detailed
programming effort.” - Samuel, 1959

Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate predictions.” – Mohri, 2012

Dr. Nagaraju K, Asst Prof, Dept of CSE 6


1.1. What is machine learning? UNIT 01

Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.

(Target) ‣ Question about prediction –What is the sales when random advertising expenses
are given?
𝑓2 • Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦 = 𝑤𝑥 + 𝑏

(Feature) • ‘w’ is commonly used as an abbreviation of ‘weigh.’


2 4 6 8 10

‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach the optimal value by
gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.

Dr. Nagaraju K, Asst Prof, Dept of CSE 7


1.1. What is machine learning? UNIT 01

Statistics and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuro Science

Connection among machine learning and different kinds of study

Dr. Nagaraju K, Asst Prof, Dept of CSE 8


1.1. What is machine learning? UNIT 01

Data mining and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuro Science

Connection among machine learning and different kinds of study

Dr. Nagaraju K, Asst Prof, Dept of CSE 9


1.1. What is machine learning? UNIT 01

Types of machine learning according to methods of supervision

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 10


1.1. What is machine learning? UNIT 01

Machine learning workflow

Machine Learning
Feature engineering
Understanding the business and Pre-processing
problem definition and searching of data
Train
Modeling and
Problem Definition Data Preparation
Optimization
Validate Model training for data

Test Performance metrics

Raw
Data
Model performance evaluation

Data collection Enhanced model performance and


application to real life

Dr. Nagaraju K, Asst Prof, Dept of CSE 11


1.1. What is machine learning? UNIT 01

Machine learning types:

Type Algorithm/Method
Unsupervised learning Clustering.
MDS, t-SNE.
PCA, NMF.
Association analysis.
Supervised learning Linear regression.
Logistic regression.
Tree, Random Forest, Ada Boost, XGBoost.
Naïve Bayes.
KNN.
Support vector machine (SVM).
Neural Network.

Dr. Nagaraju K, Asst Prof, Dept of CSE 12


1.1. What is machine learning? UNIT 01

Parameters vs. Hyperparameters


Parameters

‣ Learned from data by training and not manually set by the practitioner.
‣ Contain the data pattern.

Ex Coefficients of linear regression.


Ex Weights of neural network.

Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.

Ex in KNN algorithm.
Ex Learning rate in neural network.
Ex Maximum depth in Tree algorithm.

Dr. Nagaraju K, Asst Prof, Dept of CSE 13


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 14


1.2. Phyton scikit-learn library for machine learning UNIT 01

Features of the scikit-learn library


Features

‣ Integrated library interface by applying the façade design pattern


‣ Installed with various kinds of machine learning algorithms, model selection and data pre-processing functions
‣ Simple and efficient tool to analyze predicted data
‣ Based on NumPy, SciPy and matplotlib
‣ Easily accessible and can be reused in many different situations
‣ Highly compatible with different libraries
‣ Does not support GPU
‣ Can be used as an open source and for commercial purposes

Dr. Nagaraju K, Asst Prof, Dept of CSE 15


1.2. Phyton scikit-learn library for machine learning UNIT 01

Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface complete with high-level API.

Predict /
Instance Fit
transform

Dr. Nagaraju K, Asst Prof, Dept of CSE 16


1.2. Phyton scikit-learn library for machine learning UNIT 01

Estimator, Classifier, Regressor


‣ Estimator refers to an object that can fit the model and deduce a certain features of new data based on the training data.
‣ Classifier refers to a class that realizes a classifying algorithm, while regression refers to a class that realizes regressing
algorithm.

Estimator

Training: .fit Prediction:


.predict

Classifier Regressor

DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegressor GradientBoostingRegressor
Gaussian NB … Ridge …

Dr. Nagaraju K, Asst Prof, Dept of CSE 17


1.2. Phyton scikit-learn library for machine learning UNIT 01

Scikit-Learn Library
About the Scikit-Learn library

‣ It is a representative Python machine learning library.


‣ To import a machine learning algorithm as class:
from sklearn. <family> import <machine learning algorithm>
Ex from sklearn.linear_model import LinearRegression
‣ Hyperparameters are specified when the machine learning object is instantiated:
Ex myModel = KNeighborsClassifier(n_neighbors=10) # KNN with k = 10

Dr. Nagaraju K, Asst Prof, Dept of CSE 18


1.2. Phyton scikit-learn library for machine learning UNIT 01

About the Scikit-Learn library


‣ To train a supervised learning model: myModel.fit(X_train, Y_train)
‣ To train a unsupervised learning model: myModel.fit(X_train)
‣ To predict using an already trained model: myModel.predict(X_test)
‣ To import a preprocessor as class: from sklearn.preprocessing import <a preprocessor>
‣ To split the dataset into a training set and a testing set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)
‣ To calculate a performance metric (accuracy): metrics.accuracy_score(Y_test, Y_pred)
‣ To cross validate and do hyperparameter tuning at the same time:
myGridCV = GridSearchCV(estimator, parameter_grid, cv=k)
myGridCV.fit(X_train, Y_train)

Dr. Nagaraju K, Asst Prof, Dept of CSE 19


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference
datasets. It also features some artificial data generators.

‣ Import data with the load_breast_cancer().

‣ Container object exposing keys as attributes.


Bunch objects are sometimes used as an output for functions and methods.
They extend dictionaries by enabling values to be accessed by key, bunch[“value_key”], or by an attribute, bunch.value_key.

Dr. Nagaraju K, Asst Prof, Dept of CSE 20


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 5
• This data becomes x (independent variable, data).

Dr. Nagaraju K, Asst Prof, Dept of CSE 21


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 6
• This data becomes y (dependent variable, actual value).

Dr. Nagaraju K, Asst Prof, Dept of CSE 22


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.

Dr. Nagaraju K, Asst Prof, Dept of CSE 23


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 7
• From the total of 569 observed values, divide the data for training and evaluation into 7:3 or 8:2.
7.5:2.5 is the default value.

Dr. Nagaraju K, Asst Prof, Dept of CSE 24


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.

Dr. Nagaraju K, Asst Prof, Dept of CSE 25


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (72%) out of total 569 observations are found.

Dr. Nagaraju K, Asst Prof, Dept of CSE 26


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that requires human setting
and affects a lot to the model performance.

Line 1-5
• Loading the test data set
Line 1-8
• Instancing the estimator and hyperparameter setting
• Model initialization by using the entropy for branching

Dr. Nagaraju K, Asst Prof, Dept of CSE 27


1.2. Phyton scikit-learn library for machine learning UNIT 01

fit
‣ Use the fit method with instance estimator for training. Send the training data and label data together as an argument to
supervised learning algorithm.

predict
‣ The instance estimator that has completed training with fitting can be applied with the predict method. ‘Predict’ converts
the estimated results of the model regarding the entered data.

Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.

Dr. Nagaraju K, Asst Prof, Dept of CSE 28


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 29


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 30


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 57
• Data frame shows a result where predicted value and actual value differ.

Dr. Nagaraju K, Asst Prof, Dept of CSE 31


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 32


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 33


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 66
• 133/143

Dr. Nagaraju K, Asst Prof, Dept of CSE 34


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

‣ It showed 93% accuracy, which is a quite rare result. In fact, a process of increasing the data accuracy is required during data
pre-processing, and standardization is one of the options. The following is a brief summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term of standardization is z-transformation, and the standardized value is also referred to as z-score. 94% accuracy
would be obtained from KNN wine classification through standardization.
• Standardization is widely used in data pre-processing in general other than KNN, and the following is the equation.

(, standard deviation)

• The standardization is available as StandardScaler class in the scikit-learn.

Dr. Nagaraju K, Asst Prof, Dept of CSE 35


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 36


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 35
• Data frame before standardization

Dr. Nagaraju K, Asst Prof, Dept of CSE 37


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 39
• The differences among column values are huge before standardization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 38


1.2. Phyton scikit-learn library for machine learning UNIT 01

Practicing scikit-learn

Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to before standardization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 39


1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-3
• Output before pre-processing

Dr. Nagaraju K, Asst Prof, Dept of CSE 40


1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-7
• Pre-processing – Apply scaling

Dr. Nagaraju K, Asst Prof, Dept of CSE 41


1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-9
• Result check after pre-processing

Dr. Nagaraju K, Asst Prof, Dept of CSE 42


1.2. Phyton scikit-learn library for machine learning UNIT 01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Dr. Nagaraju K, Asst Prof, Dept of CSE 43


1.2. Phyton scikit-learn library for machine learning UNIT 01

fit_transform
‣ Fit and Transform is combined as fit_transform.

Line 4-1 & 4-5


• Before & After
Line 4-3
• Combination of fit and transform

Dr. Nagaraju K, Asst Prof, Dept of CSE 44


1.2. Phyton scikit-learn library for machine learning UNIT 01

Major scikit modules

Classification Module Embedded functions

Data example sklearn.datasets Data set for practicing

Feature sklearn.preprocessing Pre-processing techniques (One-hot encoding, normalization, scaling, etc.)


processing
sklearn.feature_selection Technique to search and select a feature that provides a significant impact
to the model
sklearn.feature_extraction Feature extraction from source data
The supporting API for feature extraction regarding image is present in the
submodule image, while the supporting API for text data feature
extraction is present in the submodule test.

Dimension sklearn.decomposition Algorithms related to dimension reduction (PCA, NMF, Truncated SVD,
reduction etc.)
Validation, sklearn.model_selection Validation, hyperparameter tuning, data separation, etc.
hyperparameter (corss_validate, GridSearchCV, train_test_split, learning_curve, etc.)
tuning, data
separation

Dr. NagarajuModel sklearn.metrics


K, Asst Prof, Dept of CSE Techniques to measure and evaluate model performance 45
evaluation (accuracy, precision, recall, ROC curve, etc.)
1.2. Phyton scikit-learn library for machine learning UNIT 01

Major scikit modules

Classification Module Embedded functions

Machine sklearn.ensemble Ensemble algorithms (Random forest, AdaBoost, bagging, etc.)


learning
algorithm
sklearn.linear_model Linear algorithms (Linear regression, logistic regression, SGD, etc.)

sklearn.naïve_bayes Naive Bayes algorithms


(Bernoulli NB, Gaussian NB, multinomial distribution NB, etc.)

sklearn.neighbors Nearest neighbor algorithms (K-NN, etc.)


sklearn.svm Support Vector Machine algorithms
sklearn.tree Decision tree algorithms
sklearn.cluster Unsupervised learning (clustering) algorithms (Kmeans, DBSCAN, etc.)

Utility sklearn.pipeline Serial conversion of feature processing and machine learning algorithms,
etc.

Dr. Nagaraju K, Asst Prof, Dept of CSE 46


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 47


1.3. Preparation and division of data set UNIT 01

Preparation and division of data set


Chapter objectives
‣ Be able to understand the meaning and ripple effect of overfitting and generalization and design data set division to solve
issues.
‣ Be able to properly divide training data set and test data set for machine learning technique application according to the
analysis purpose and data set features.
‣ Be able to divide training data set and validation data set and decide appropriate k-value for cross validation by deciding
the necessity of cross validation according to the issue and applied technique. Be able to divide the data set and and
perform sampling by considering the prediction results based on data features and classified variable distribution.
‣ Be able to analyze differences of various sampling methods for data set division and apply appropriate sampling methods.

Dr. Nagaraju K, Asst Prof, Dept of CSE 48


1.3. Preparation and division of data set UNIT 01

Necessity of data set division


‣ When analyzing machine learning-based data, especially when applying a supervised learning-based model, do not analyze
the overall data set but analyze by dividing the training and evaluation (test) data sets.

Division of Training data set


data set
(Perform K-fold cross Model
validation if
Final
necessary)
Overall data set model

Performance
Test data set
evaluation

Machine learning modeling process through division of training and test data sets

Dr. Nagaraju K, Asst Prof, Dept of CSE 49


1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


‣ Strictly speaking, the data included in the provided training data set can be considered as the values obtained by chance, so
a new data set obtained to predict the values of new objective variables (or response variables) is not the same as the
existing training data set.
‣ Thus, the chance that the patterns from the training data and new data to perfectly accord is extremely low. So, when
learning the machine learning-based model, overfitting occurs in which it highlights the training data set pattern when
reflecting too much of the training data set, while generalization for accurate prediction of new data is underperformed.
‣ To prevent such issues, the data set is generally divided into training data set and test data set. By measuring how the
machine learning model learned with the training data set accurately predicts the objective variables (or response
variables) of the test data set, the resulted standard becomes the standard for model performance evaluation.

1st 2nd 3rd 4th 12th


𝑦

𝑦
𝑥 𝑥 𝑥
Overfitting and underfitting
𝑥 𝑥

Dr. Nagaraju K, Asst Prof, Dept of CSE 50


1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


𝑦 1st 2nd 3rd 4th 12th

𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting

‣ Even if the machine learning finds the optimal solution in data distribution, a wide margin of error occurs, and this is
because the model has a small capacity. Such phenomenon is referred to as underfitting, and the linear equation model on
the leftmost figure above is an example.
‣ An easy alternative is to use higher degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with 12th order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.

𝑦
= 𝑤12 𝑥 12 + 𝑤11 𝑥 11 + 𝑤10 𝑥 10 ⋯ 𝑤1 𝑥 1
Dr. Nagaraju K, Asst Prof, Dept of CSE + 𝑤0 51
1.3. Preparation and division of data set UNIT 01

Overfitting
‣ When choosing a 12th order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.

12th

𝑥 𝑥0
Inaccurate prediction in overfitting

Dr. Nagaraju K, Asst Prof, Dept of CSE 52


1.3. Preparation and division of data set UNIT 01

Overfitting and generalization of modeling


‣ As the flexibility of machine learning technique increases, (in
RMSE Graph
other words, flexibility is increased as the possibility of the
Training set given data patterns accurately according with each other rises
Test set followed by increased order of polynomial.)
‣ the root mean squared error of the training data set shows a
monotone decreasing trend, while the root mean squared
error of the test data set declines in the beginning as the
order of polynomial rises, but it increases after a certain point.
RMSE

‣ Summing up, the figure on the left shows an overfitting trend


that reflects the training data set pattern too much after the
4th order polynomial. If there was no root mean squared error
index calculated with test data, the root mean squared error
of the training data would decline as the order of polynomial
rises to lead to selecting an overfitting model.
‣ Thus, test data is required.
Flexibility (Degree of polynomial expression)

Comparison of the difference between the root


mean squared errors of training and test data

Dr. Nagaraju K, Asst Prof, Dept of CSE 53


1.3. Preparation and division of data set UNIT 01

Method and process of data set division


Cross-Validation:

‣ The data should be split into a training set and a testing set.
‣ In principle, the testing set should be used only once! It should not be reused!
‣ If the training set is used also for evaluation, the errors can be unrealistically small.
‣ We would like to evaluate realistic errors while training by splitting the training data into two.

Training Data Testing Data

Train Cross Validate Evaluate

Cross-Validation and Hyperparameter optimization:

‣ As we can repeatedly evaluate errors while training, it is also possible to tune the hyperparameters.

Dr. Nagaraju K, Asst Prof, Dept of CSE 54


1.3. Preparation and division of data set UNIT 01

Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat a few times from the step 2).

Dr. Nagaraju K, Asst Prof, Dept of CSE 55


1.3. Preparation and division of data set UNIT 01

Cross-Validation method: k-Fold

Validation

Training

‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.

Dr. Nagaraju K, Asst Prof, Dept of CSE 56


1.3. Preparation and division of data set UNIT 01

Cross-Validation method: Leave One Out (LOO)

Validation

Training

‣ Leave only one observation for validation. Apply sequentially. More time consuming.

Dr. Nagaraju K, Asst Prof, Dept of CSE 57


1.3. Preparation and division of data set UNIT 01

Cross-Validation method: k-cross folding

k-cross folding

n=k n=10 most of the time


epoch Repeated measurement
round 1 round 2 round 3 round 4 round 5 round 6 round 7 round 8 round 9 round 10
validation set
validation set
validation set
validation set training set training set training set training set
validation set
training set training set training set training set validation set
training set validation set
training set validation set
validation set
validation set

Accuracy 93% 90% 91% 95%

Final average accuracy (Round1, Round2… Round10)

Dr. Nagaraju K, Asst Prof, Dept of CSE 58


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 59


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


Data cleansing for machine learning-based data analysis uses missing value and noise processing to eliminate
discrepancy of collected data.
‣ Missing value processing is done as follows.
‣ First, import the iris data for quick examination.

Dr. Nagaraju K, Asst Prof, Dept of CSE 60


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing

Dr. Nagaraju K, Asst Prof, Dept of CSE 61


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


1) Ignore the record (row)
• In data classification, ignore the record if the class label is not distinguished.

Ex In the case of the iris data, ignore the fourth row as shown on the table below.

x1 x2 x3 x4 y

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa


2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2

5 5 3.6 1.4 0.2 setosa


• ‘Ignore6the record’ is 5.4
extremely inefficient
3.9 if missing 1.7
values frequently
0.4 occur.
setosa

Dr. Nagaraju K, Asst Prof, Dept of CSE 62


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


2) Insert the missing value
• Enter a certain value like ‘unknown’ for missing value. Or, enter the average value of data such as the overall average
value, median value, or class that belong to the same record.

x1 x2 x3 x4 y

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa


2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 unknown
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Dr. Nagaraju K, Asst Prof, Dept of CSE 63


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


2) Insert the missing value
• The average value of Sepal.Length for iris is 5.843 as provided earlier, so insert 5.843.

x1 x2 x3 x4 y

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa


2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa

5 5 3.6 1.4 0.2 setosa


x1 x2 x3 x4 y
6 5.4 3.9 1.7 0.4 setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa


2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.843 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
Dr. Nagaraju K, 6Asst Prof, Dept of 64
5.4CSE 3.9 1.7 0.4 setosa
1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


3) Manual entry
• A person in charge (or an expert) should check the data and modify it into an appropriate value.
• It requires a lot of time but provides high reliability.

Dr. Nagaraju K, Asst Prof, Dept of CSE 65


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


‣ Identify the missing value from the table type data below and make appropriate processing.
• In python, the missing value is specified as np.nan or null value.
• ’nan’ is abbreviation of Not a Number.

Dr. Nagaraju K, Asst Prof, Dept of CSE 66


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing

‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of data, but it is extremely
inconvenient to manually find missing values from a huge data frame.

Dr. Nagaraju K, Asst Prof, Dept of CSE 67


1.4. Data pre-processing for making a good training data set UNIT 01

Missing value processing


‣ isnull() returns data frame with boolean which show if the cell has a numerical value (False) or is omitted with a numerical
value (True). Then, sum() is used to obtain the number of omissions.
‣ It is mandatory to check the number of missing values when importing data.

Line 17
• The number of missing values can be counted.

Dr. Nagaraju K, Asst Prof, Dept of CSE 68


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ Completely delete a certain training sample (row) or feature (column). Use dropna().
‣ help(df.dropna) shows axis=0 is default.

Line 18
• axis=0 is default, so the row with the NaN value is deleted.

Dr. Nagaraju K, Asst Prof, Dept of CSE 69


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When trying to reflect the deleted result immediately to the object, do not omit inplace=True option.

Dr. Nagaraju K, Asst Prof, Dept of CSE 70


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ Delete the row with missing value.

Dr. Nagaraju K, Asst Prof, Dept of CSE 71


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

Dr. Nagaraju K, Asst Prof, Dept of CSE 72


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ If all rows have NaN, use how='all’ to delete.

Dr. Nagaraju K, Asst Prof, Dept of CSE 73


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

Dr. Nagaraju K, Asst Prof, Dept of CSE 74


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

Dr. Nagaraju K, Asst Prof, Dept of CSE 75


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When deleting 3 or more NaN values, use thresh.

Dr. Nagaraju K, Asst Prof, Dept of CSE 76


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value

Dr. Nagaraju K, Asst Prof, Dept of CSE 77


1.4. Data pre-processing for making a good training data set UNIT 01

Removing the training sample or feature with missing value


‣ When deleting a row with NaN on a certain column, use subject.

Dr. Nagaraju K, Asst Prof, Dept of CSE 78


1.4. Data pre-processing for making a good training data set UNIT 01

Imputation
‣ It is sometimes hard to delete a training sample or a certain column, and this is because it loses too much of useful data. If
so, estimate missing values from other training samples in data set by kriging. The most commonly used method is to
impute with average value, which is to change the missing value into the overall average of a certain column. In scikit-learn,
use SimpleImputer class.

‣ Impute with using df.values.

Dr. Nagaraju K, Asst Prof, Dept of CSE 79


1.4. Data pre-processing for making a good training data set UNIT 01

Imputation

Dr. Nagaraju K, Asst Prof, Dept of CSE 80


1.4. Data pre-processing for making a good training data set UNIT 01

Imputation

Line 45
• Check it is the average of the column.

‣ For strategy parameters, median and most_frequent can be also set.

Dr. Nagaraju K, Asst Prof, Dept of CSE 81


1.4. Data pre-processing for making a good training data set UNIT 01

Review on the scikit-learn estimator API


‣ Impute the missing value of the data set by using the SimpleImputer class of scikit-learn from previous clause.
‣ The SimpleImputer class is a transformer class of scikit-learn that is used for data conversion.
‣ The two main methods for the estimator include fit and transform.
‣ Use the fit method to learn the model parameter in the training data.
‣ Use the transform method to convert the data into the learned parameter.
‣ The data array for conversion should be same as the number of data features used in the model learning.

Dr. Nagaraju K, Asst Prof, Dept of CSE 82


1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing


‣ Data is generally classified into categorical scale and continuous scale depending on their features.
‣ In liberal arts and social science, questionnaires are mainly used to collect data.
‣ Categorical scale
• It is a scale that can distinguish data into different categories and is classified into nominal scale and ordinal scale.
‣ Continuous scale
• It is a scale that divides linked data into the purpose of survey and is classified into interval scale and ratio scale.
‣ Actual data sets would include more than one categorical feature. As explained earlier, categorical data would classify
sequential and non-sequential features. Ordinal scale can be referred to as sequential categorical scale that can array
features with sequences.

Dr. Nagaraju K, Asst Prof, Dept of CSE 83


1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing

‣ The data on the table have features that have orders and do not have orders.
Size is ordered, but color is not ordered. Thus, size is classified as ordinals scale while color is nominal scale.

Dr. Nagaraju K, Asst Prof, Dept of CSE 84


1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing


‣ Convert ordered data into numerical values. The reason why changing text data into numerical data is to allow a computer
to process arithmetic operations.

Dr. Nagaraju K, Asst Prof, Dept of CSE 85


1.4. Data pre-processing for making a good training data set UNIT 01

Categorical data processing

Dr. Nagaraju K, Asst Prof, Dept of CSE 86


1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ The class refers to the y value, which is a column with an actual value.
‣ Create a mapping to convert the class label from strings to integers.

Dr. Nagaraju K, Asst Prof, Dept of CSE 87


1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

‣ ‘enumerate’ creates an object with an index.

Dr. Nagaraju K, Asst Prof, Dept of CSE 88


1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

Line 62
• Change the class label from strings to integers.

Dr. Nagaraju K, Asst Prof, Dept of CSE 89


1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ Since the method in the previous slide is rather inconvenient, scikit-learn supports LabelEncoder for easy conversion.

Dr. Nagaraju K, Asst Prof, Dept of CSE 90


1.4. Data pre-processing for making a good training data set UNIT 01

Class label encoding


‣ enumerate는 열거형으로 인덱스가 있는 형태의 객체로 만들어 준다.

Dr. Nagaraju K, Asst Prof, Dept of CSE 91


1.4. Data pre-processing for making a good training data set UNIT 01

Application of one-hot encoding to unordered feature


There are cases when it is not possible to directly use categorial data to machine learning algorithm such as regression
analysis, etc., and if so, conversion is required to be recognized by a computer.
‣ In such cases, use dummy variable which is expressed as 0 or 1. 0 or 1 does not represent how the number is large or small,
but shows whether a certain feature is present or not.
‣ If a certain feature is present, it is expressed as 1 and if it’s not found, it is classified as 0. Likewise, one-hot encoding is the
conversion of categorical data to one hot vector that consists of 0 or 1 that can be recognized by a computer.
‣ Practice with the iris.target object.

Dr. Nagaraju K, Asst Prof, Dept of CSE 92


1.4. Data pre-processing for making a good training data set UNIT 01

The encoding is done with integers, so insert the iris ‘species’ value.

Dr. Nagaraju K, Asst Prof, Dept of CSE 93


1.4. Data pre-processing for making a good training data set UNIT 01

The encoding is done with integers, so insert the iris ‘species’ value.

Dr. Nagaraju K, Asst Prof, Dept of CSE 94


1.4. Data pre-processing for making a good training data set UNIT 01

Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables into new dummy
variable.

Dr. Nagaraju K, Asst Prof, Dept of CSE 95


1.4. Data pre-processing for making a good training data set UNIT 01

Use sklearn library to conveniently process one-hot encoding. The result is given as the sparse matrix in linear algebra.
In the sparse matrix, the value of most matrices is 0. An opposite concept to sparse matrix is dense matrix.

Example of sparse matrix

Only 9 out of the above 35 coefficients are not 0


in the above sparse matrix.

Dr. Nagaraju K, Asst Prof, Dept of CSE 96


1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

Dr. Nagaraju K, Asst Prof, Dept of CSE 97


1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

Dr. Nagaraju K, Asst Prof, Dept of CSE 98


1.4. Data pre-processing for making a good training data set UNIT 01

OneHotEncoder

Dr. Nagaraju K, Asst Prof, Dept of CSE 99


1.4. Data pre-processing for making a good training data set UNIT 01

Conversion to the sparse matrix

Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices)

Dr. Nagaraju K, Asst Prof, Dept of CSE 100


1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

Dr. Nagaraju K, Asst Prof, Dept of CSE 101


1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

Dr. Nagaraju K, Asst Prof, Dept of CSE 102


1.4. Data pre-processing for making a good training data set UNIT 01

Refer to the figure below for easier understanding.

Row index 0 2

Column index Species setosa versicolor virginica Sparse matrix


expression
0 setosa 1 0 0 (0,0)
1 setosa 1 0 0 (1,0)
setosa 1 0 0

49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0

100 versicolor 0 1 0

101 virginica 0 0 1 (101,2)


virginica 0 0 1 (102,2)

virginica 0 0 1

150 virginica 0 0 1 (150,2)


Dr. Nagaraju K, Asst Prof, Dept of CSE 103
1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set
‣ df_wine is the data that measure wines produced in Vinho Verde which is adjacent to Atlantic Ocean in the northwest of
Portugal. It measured and analyzed the grade, taste, and acidity of 1,599 red wine samples along with 4,898 white wine
samples to create data. If the data is not found in the following route, it is possible to import the data from the local by
directly downloading it from the UCI repository.

Dr. Nagaraju K, Asst Prof, Dept of CSE 104


1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set

Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
• Remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)

Dr. Nagaraju K, Asst Prof, Dept of CSE 105


1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set

Dr. Nagaraju K, Asst Prof, Dept of CSE 106


1.4. Data pre-processing for making a good training data set UNIT 01

Using hold-out in real life that splits the data set into training data set and test set
‣ Data splitting is possible by using the train_test_split function provided in the model_selection module of scikit-learn. First,
convert the features from index 1 to 13 to NumPy array and assign to variable X. With the train_test_split function, data
conversion is done in four tuples, so assign by designating appropriate variables.

‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the training data set and test data
set is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3 or 8:2 depending on the size of the data set. For large data set, it is
common and suitable to split the training data set and test data set into the ratio of 9:1 or 9.9:0.1.

Dr. Nagaraju K, Asst Prof, Dept of CSE 107


1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

Dr. Nagaraju K, Asst Prof, Dept of CSE 108


1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

Dr. Nagaraju K, Asst Prof, Dept of CSE 109


1.4. Data pre-processing for making a good training data set UNIT 01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised Learning)) for detailed
code.

#MaxAbsScaler divides the data into the maximum absolute value based on each feature. Thus, the maximum value of each
feature becomes 1.
#The overall feature changes to [-1, 1] range.

Dr. Nagaraju K, Asst Prof, Dept of CSE 110


1.4. Data pre-processing for making a good training data set UNIT 01

Limiting model complexity through L1 and L2 regularizations


Bias and Variance
‣ If the predicted values are highly deviated from the actual target value in general, it is said that there’s high bias in the result.
When the predicted values are scattered far away from one another, it is said that it has the high variance in the result.
‣ The following figure expresses predicted results of the model on the target. High bias refers to the result when it is
significantly deviated from the center of the target. On the other hand, if the predicted values are gathered around the
center of the target, it has low bias. Bias refers to overall similarity between the predicted values and actual value. In the
figure, high variance is when the predicted values are greatly apart from one another. On contrary, low variance is when the
predicted values are closely gathered together. Thus, variance refers to overall similarity among predicted values.

Low High
Variance Variance
×
×××××× × ×× ×
High × ×
Bias

× × ×
Low ×××××× ×
Bias ××
×
×

Dr. Nagaraju K, Asst Prof, Dept of CSE 111


1.4. Data pre-processing for making a good training data set UNIT 01

trade-off
‣ Bias and variance have a trade-off relationship in which when one of them increases, the other falls and vice versa. In the
beginning of learning, the model becomes complex and the overall error cost falls due to decreased bias. However, at some
point, the model keeps learning and becomes much more complicated which causes higher variance and increased overall
error cost. In other words, the model gets overfitted to the training data. So, one of the ways to prevent overfitting is to
stop learning at appropriate timing. Regularization is a method to prevent overfitting by lowering variance, but it can
increase bias instead due to the trade-off relationship.

Optimum Model Complexity


Total Error
Error

Variance

Bias2

Model Complexity

Dr. Nagaraju K, Asst Prof, Dept of CSE 112


1.4. Data pre-processing for making a good training data set UNIT 01

Ridge Regression
‣ The ridge regression model is a technique to limit the L2norm of w, which is the regression coefficient vector. A constraint
is added to minimize the sum of squares of weight in the cost function of linear regression. If the linear regression model is
as follows,

(,w= )

‣ then the cost function of the ridge regression model is as follows. N is the number of data, and M is the number of
elements of regression coefficient vector. A constraint is added to the existing SSE (Sum of Squared Errors).

̰ 𝑟ⅈ𝑑𝑔ⅇ
𝑤
𝑁 𝑀

= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ෍ 𝑦𝑖 − 𝑤𝑋 2 + 𝜆 ෍ 𝑤𝑗2
𝑖=1 𝑗=1
‣ λ is a hyperparameter to adjust the weight of existing SSE and added constraint.
When the λ is large, regularization is greatly applied and the regression coefficients become lower. When the λ becomes
smaller, regularization gets weaker and when the λ equals 0, the constraint clause also becomes 0 which is the same as the
general linear regression model.

Dr. Nagaraju K, Asst Prof, Dept of CSE 113


1.4. Data pre-processing for making a good training data set UNIT 01

Ridge Regression
‣ The following is an example of simple linear regression model equation.

Dr. Nagaraju K, Asst Prof, Dept of CSE 114


1.4. Data pre-processing for making a good training data set UNIT 01

Ridge Regression
‣ When drawing the cost function SSE (w1, w2) on the coordinate with x-axis and y-axis, an ellipse is created as provided on
the following figure.

𝑤2 Minimize cost

𝜆
∥𝑊
2 𝑤1

2
Minimize penalty Minimize cost + penalty

‣ On the figure above, the ellipse drawn in solid line is the cost function, which is the combination of w1 and w2 with the
same cost (SSE). The central point of the ellipse is when the cost becomes 0. Outward of the ellipse is the combination of
w1 and w2 with higher cost, which in other words the model with higher error (consists of w1 and w2 weights). The colored
circle refers to the constraint. The circle becomes smaller when the λ gets larger, and vice versa. The point where the cost
function (ellipse) and constraint (colored circle) meet is the optimal solution where the cost of the ridge regression model is
minimum.

Dr. Nagaraju K, Asst Prof, Dept of CSE 115


1.4. Data pre-processing for making a good training data set UNIT 01

Lasso Regression
‣ The Lasso (Least Absolute Shrinkage and Selection Operator) regression model is a technique to limit the L1norn of the
regression coefficient vector w. A constraint is added to minimize the sum of the absolute values of the weights in the cost
function of linear regression. The following figure shows the cost function of Lasso regression model.

𝑁 𝑀

̰ 𝑙𝑎𝑠𝑠𝑜 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ෍ 𝑦𝑖 − 𝑤𝑋
𝑤 2 + 𝜆 ෍ 𝑤𝑗
𝑖=1 𝑗=1

Dr. Nagaraju K, Asst Prof, Dept of CSE 116


1.4. Data pre-processing for making a good training data set UNIT 01

Lasso Regression
‣ When drawing the cost function of Lasso regression model (w1, w2) on the coordinate with x-axis and y-axis, a rhombus is
created as provided on the following figure.

𝑤2

𝜆
∥𝑊
∥1 𝑤1
Minimize cost + penalty

‣ Because the constraint of the Lasso regression model is a rhombus, it is highly possible that the point that meets the cost
function is the vertex of the rhombus. The vertexes of the rhombus are always the points where w1 or w2 are 0. So, the
Lasso regression model results in 0 weight.

Dr. Nagaraju K, Asst Prof, Dept of CSE 117


1.4. Data pre-processing for making a good training data set UNIT 01

Elastic-net regression
‣ The Elastic-net regression model applies both the L2norm and L1norm to the regression coefficient vector. The constraint is
both the sum of weight squares and sum of the absolute weight values. The following figure shows the cost function of
Elastic-net. There are two hyperparameters of Elastic-net, which are λ1 and λ2.

̰ 𝑒𝑙𝑎𝑠𝑡𝑖𝑐
𝑤
𝑁 𝑀 𝑀

= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ෍ 𝑦𝑖 − 𝑤𝑋 2 + 𝜆1 ෍ 𝑤𝑗2 + 𝜆2 ෍ 𝑤𝑗
𝑖=1 𝑗=1 𝑗=1

Dr. Nagaraju K, Asst Prof, Dept of CSE 118


1.4. Data pre-processing for making a good training data set UNIT 01

Elastic-net regression
‣ Elastic-net applies both the L2norn and L1norn at the same time, so the constraint is somewhere in the middle. It reduces a
larger weight while making unimportant weight to 0.
1.5

1.0

0.5

0.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-0.5

-1.0

-1.5

Dr. Nagaraju K, Asst Prof, Dept of CSE 119


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-learn

Dr. Nagaraju K, Asst Prof, Dept of CSE 120


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Finding an optimal method to solve problems with scikit-learn


Practicing
‣ Use the following problem-solving methodology and consider which algorithm to apply. Perform pre-processing and overall
process regarding the iris data, and compare with the result code provided below.

Dr. Nagaraju K, Asst Prof, Dept of CSE 121


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Practicing

Sample
(Instance, observation)
Petal
Sepal Sepal Petal Petal Class label
length width length width

1 5.1 3.5 1.4 0.2 Setosa


2 4.9 3.0 1.4 0.2 Setosa

50 6.4 3.5 4.5 1.2 Versicolor

150 5.9 3.0 5.0 1.8 Virginica
Sepal

Feature Class label


(Property, (Target)
measurement value, dimension)

Dr. Nagaraju K, Asst Prof, Dept of CSE 122


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Considerations in machine learning


1) Define the problem of the business and check for solutions or best alternative plans.
2) Check if it is possible to define as supervised or unsupervised problems.
3) Check which method to use for measuring model performance.
4) Check if the performance index is linked with the business objective and confirm if the project participants made an
agreement regarding the performance.
‣ In this practicing problem, define the problem as iris species (supervised) and suppose that the result is satisfactory if the
performance predicts 85% or higher classification accuracy.

Dr. Nagaraju K, Asst Prof, Dept of CSE 123


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Understanding the iris data

Domain knowledge of iris data

‣ Data name: IRIS


‣ Number of data: 150
‣ Number of variables: 5
‣ Understanding variables
Sepal Length Length information of the sepal
Sepal Width Width information of the sepal
Petal Length Length information of the petal
Petal Width Width information of the petal
Species Flower species, classified into setosa / versicolor / virginica

Dr. Nagaraju K, Asst Prof, Dept of CSE 124


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Understanding the iris data

Dr. Nagaraju K, Asst Prof, Dept of CSE 125


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Understanding the iris data

Dr. Nagaraju K, Asst Prof, Dept of CSE 126


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 3-1 ~ 4
• Import the library required for practicing.

Dr. Nagaraju K, Asst Prof, Dept of CSE 127


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 4-1
• Convert the current variable data to ndarray and DataFrame of NumPy.

Dr. Nagaraju K, Asst Prof, Dept of CSE 128


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Dr. Nagaraju K, Asst Prof, Dept of CSE 129


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Dr. Nagaraju K, Asst Prof, Dept of CSE 130


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Dr. Nagaraju K, Asst Prof, Dept of CSE 131


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 1
• Merge the feature and target

Dr. Nagaraju K, Asst Prof, Dept of CSE 132


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 3 ~ 5
• Change the column name

Dr. Nagaraju K, Asst Prof, Dept of CSE 133


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 10
• Change the target value

Dr. Nagaraju K, Asst Prof, Dept of CSE 134


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 11
• Check the missing value

Dr. Nagaraju K, Asst Prof, Dept of CSE 135


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Basic statistical analysis
• Perform basic statistical analysis for a better understanding of data. From analysis, it is possible to have better
understand the data by understanding the data size (numbers of data), shape of data (matrix shape), data type, data
distribution, and the relationship between features and enhance the performance of machine learning.

Dr. Nagaraju K, Asst Prof, Dept of CSE 136


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

Line 13
• petal_length has the greatest standard deviation. Compared to other features, petal_width seems to have a
narrower range of values. It would be better to perform regularization after checking the model
performance due to the scale differences between features.

Dr. Nagaraju K, Asst Prof, Dept of CSE 137


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Correlation analysis
• Use corr to analyze the relationship among features.

Line 14
• The correlation coefficient of petal_length and petal_width is 0.962865, which is extremely high. Since highly
correlated features may induce multicollinearity problems, it is recommended to select one of the two
variables to use.

Dr. Nagaraju K, Asst Prof, Dept of CSE 138


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Aggregation analysis

Line 15
• Number of data in each target was counted by using the aggregation function ‘size,’ and it was confirmed that
50 data were equally found in each feature. Select between ‘size’ and ‘count’ depending on the purpose of
analysis as the ‘size’ counts the number of data including missing values while the ‘count’ counts the number
of data without missing values. In this case, there’s no difference using ‘size’ and ‘count’ because iris data
does not have any missing values.

Dr. Nagaraju K, Asst Prof, Dept of CSE 139


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Data visualization
• Previously, basic statistics analysis was done on the data, but it is not easy to understand due to too many numbers, and
incorrect reading of decimal points would result in significant error. If so, visualization of data as a graph provides an
intuitive understanding of the reader. Also, data visualization is efficient to explain data analysis results.

Dr. Nagaraju K, Asst Prof, Dept of CSE 140


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Visualizing basic statistics and outlier

Dr. Nagaraju K, Asst Prof, Dept of CSE 141


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA

setal_length setal_width (cm)

petal_length petal_width

Dr. Nagaraju K, Asst Prof, Dept of CSE 142


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ Visualizing data distribution

Dr. Nagaraju K, Asst Prof, Dept of CSE 143


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Iris data pre-processing and EDA


‣ The frequency of median class interval is high for petal_width, and it becomes lower as it deviates far away from the center.
In the box plot, the box length of sepal_width is short because a lot of data was aggregated at the median. In the case of
petal_length, the frequency of the median class interval is high, but there are a lot of data on the left class interval. In the box
plot, the box of petal_length is long to the bottom because there were a lot of data in lower values.

setal_length setal_width (cm) petal_length petal_width

Dr. Nagaraju K, Asst Prof, Dept of CSE 144


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing correlation

setal_length

1.00

0.75
setal_width (cm)

0.5

0.25

0.00

-0.25
petal_width petal_length

-0.50

-0.75

-1.00

setal_length setal_width (cm) petal_length petal_width

Dr. Nagaraju K, Asst Prof, Dept of CSE 145


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the correlation between features and data distribution by using pairplot

Dr. Nagaraju K, Asst Prof, Dept of CSE 146


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the correlation between features and data distribution by using pairplot
petal_widthpetal_lengthsetal_width (cm) setal_length

species
• setosa
• versicolor
• virginica

Dr. Nagaraju K, Asst Prof, Dept of CSE 147


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the correlation between features and data distribution by using pairplot
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by drawing an imaginary line, and
setosa will be classified as a linear model. For versicolor and virginica, it seems to be difficult to classify them by drawing a
line because they are mixed in the graph complete with the sepal_width and sepal_length features. However, even it seems a
little vague, they can be classified in other graphs.

Dr. Nagaraju K, Asst Prof, Dept of CSE 148


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the class ratio of the target

Dr. Nagaraju K, Asst Prof, Dept of CSE 149


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the class ratio of the target

versicolor
33.3%

33.3%
setosa 33.3%

virginica

Line 15
• The data are evenly arranged in each target class.

Dr. Nagaraju K, Asst Prof, Dept of CSE 150


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Visualizing the class ratio of the target

‣ Before starting machine learning, split the data set into the training data and performance test data. The final objective of
machine learning is to create a generalized model so that it can accurately predict new data. If evaluating the performance
with data that was used in learning, the possibility of getting right is high since the model is already familiar with the given
data feature. For reliable evaluation, separate the performance test data set from the training data set. Because it is the
separation of data, it is referred to as hold out method.
‣ Split the training data set and performance test data set with the train_test_split function of sklearn. Classify the training data
as ‘train’ and performance test data as ‘test.’ X is the feature of the data set, and y is the target. For structured data analysis,
indicate DataFrame with capital letters and Series with lower cases. The test_size=0.33 option separates 33% of the total data
as test set. random_state=42 is an option used to induce reproducible results for the practicing problem. If not designating
random_state, the data set for conversion will differ every time.

Dr. Nagaraju K, Asst Prof, Dept of CSE 151


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Algorithm selection

START

classification >50k get


regression
samples NO
more data
YES

Kernel SGD NO Predicting a Predicting a <100k few features should Lasso


approximation Classifier category NO
quantity YES
samples YES be important YES ElasticNet
SVC NOT WORKING YES NO NO

Ensembile Kneighbors <100k do you have SGD Ridge Regresson


Classifiers Classifier samples
YES
Labeled data Regressor SVR (kernel= ‘linear’
NOT WORKING NO NOT WORKING

Naïve Text Linear SVR(kernel=‘rbf'


Bayes YES Data SVC YES
NO
EnsembleRegressors
NOT WORKING

NO

clustering Dimensionality reduction


<10k Number of just Randomized
KMeans lookong
YES
samples
YES
Categoties known YES PCA
NOT WORKING NO NO NO NOT WORKING

Spectral MiniBatch <10k tough predicting <10k Isomap


structure LLE
Clustering KMeans samples luck samples
YES Spectral Embedding
NO
GMM NO NO NOT WORKING

MeanShift kernel
VBGMM approximation

Dr. Nagaraju K, Asst Prof, Dept of CSE 152


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Algorithm selection
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by drawing an imaginary line, and
setosa will be classified as a linear model. For versicolor and virginica, it seems to be difficult to classify them by drawing a
line because they are mixed in the graph complete with the sepal_width and sepal_length features. However, even it seems a
little vague, they can be classified in other graphs.

Dr. Nagaraju K, Asst Prof, Dept of CSE 153


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Algorithm selection

‣ # regularization: Constraints the degree of freedom of decision tree.


‣ # Lowering max_depth would constraint the model and reduce the risk of overfitting.
‣ # min_samples_split: Minimum amount of sample required by the node for splitting.
‣ # min_weight_fraction_leaf: Identical to the min_samples_leaf, but is the ratio with weight in the entire sample.
‣ # max_leaf_nodes: Maximum number of leaf nodes
‣ # max_features: Largest number of the feature that will be used for splitting by each node.
‣ # Increasing the parameter that starts with min_ or lowering the parameter that starts with max_ would increase the model
constraint.

Dr. Nagaraju K, Asst Prof, Dept of CSE 154


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Algorithm selection
‣ # Gini impurity or entropy
‣ # The difference between Gini impurity and entropy is vague in real life. Both of them create a similar tree.
‣ # Calculation of Gini impurity is quicker, so it is recommended as default. However, when creating a different tree, while Gini
impurity tends to isolate the most frequent class to one side,
‣ # entropy results in a more balanced tree.

Dr. Nagaraju K, Asst Prof, Dept of CSE 155


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Model learning
‣ Perform model learning with the train data to check the model performance. The current model is set with default
hyperparameter except for random_state.

Score
‣ Evaluate the performance by using the performance test data set. In the Scikit-learn, score refers to accuracy. Since the iris
data set is well-structured data set for practice, it generally shows a high performance in any model.

Dr. Nagaraju K, Asst Prof, Dept of CSE 156


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Model generalization strategy


‣ As machine learning shows data-driven model performance, enough amount of data is required for good performance.
Insufficient amount of data may result in overfitting, which means that the model shows lower prediction ability regarding
unseen data because it is fit only to the training data features. The following is the generalization strategy so that the model
would provide high performance regarding unseen data.

Validation set
‣ The performance test data set split with Train_test_split is for the final performance evaluation of the model. Because it is
necessary to check model performance during model learning, hold out some of the data from the training data set and use it
as the validation set. A chance of overfitting can be found during learning by using the validation set, and it is also used to find
hyperparameters.

training set validation set test set

Model fitting Parameter selection Evaluation

Dr. Nagaraju K, Asst Prof, Dept of CSE 157


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Cross validation
‣ This is a strategy to make many validation sets so that every data can be included in learning once.
Divide the data set into a random number n=k (k-fold). Use the first fold as a validation set and use other k-1 folds as train set,
and measure the performance.
‣ Use the second fold as test set and other folds as train set for learning, and then measure the performance. Repeat the same
process to all the other folds so that all data can be included in the training. Obtain k performance evaluation results and then
average out to predict the model performance. The following figure is an example when k=5.

cross_validation
Split 1
CV Iterations

Split 2
Train
Split 3 Test
Split 4
Split 5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Dr. Nagaraju K, Asst Prof, Dept of CSE 158


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Cross_val_score
‣ Cross validation can be easily performed by using the cross_val_score function of scikit-learn.

({}th cross validation score : {}".format(i,_))

("\ncross validation final score: {}".format(fin_result))

0th cross validation score : 0.9


1st cross validation score : 1.0
2nd cross validation score : 0.8
3rd cross validation score : 1.0
4th cross validation score : 0.8
5th cross validation score : 0.9
6th cross validation score : 1.0
7th cross validation score : 0.9
8th cross validation score : 1.0
9th cross validation score : 1.0
Final cross validation score : 0.93

Dr. Nagaraju K, Asst Prof, Dept of CSE 159


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

stratified
‣ The random splitting of the train set and validation set would result in inconstant target class when hold-out. If so, the data
distribution differs in the train set and validation set which affects learning. For machine learning, a premise is present in
which the distribution of training data and real-life data distribution are the same. If the premise is not followed, the
performance of learning model falls. So, to prevent such issue, the stratified method is used to evenly distribute the target
class ratio.
The following figure provides an intuitive understanding how the stratified method classifies data.

Stratified Cross-validation
Split 1
CV Iterations

Split 2 Training data


Split 3 Test data

Class label Class 0 Class 1 Class 2


| | | | | | | |
0 20 40 60 80 100 120 140
Data points

‣ Cross validation is possible by sending the instance of StratifiedKFold to the cv option of cross_val_score.

Dr. Nagaraju K, Asst Prof, Dept of CSE 160


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

stratified

("{}th stratified cross validation score : {}".format(i,_))

0th stratified cross validation score : 0.9


1st stratified cross validation score : 0.9
2nd stratified cross validation score : 0.8
3rd stratified cross validation score : 0.9
4th stratified cross validation score : 1.0
5th stratified cross validation score : 1.0
6th stratified cross validation score : 0.9
7th stratified cross validation score : 0.8
8th stratified cross validation score : 1.0
9th stratified cross validation score : 1.0

("\nstratified Final stratified cross validation score : {}".format(fin)result))

Final stratified cross validation score : 0.9199999999999999

Dr. Nagaraju K, Asst Prof, Dept of CSE 161


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Learning Curve
‣ !pip install scikit-plot

Dr. Nagaraju K, Asst Prof, Dept of CSE 162


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Learning Curve
‣ !pip install scikit-plot
• The green line is the result of cross validation. When the green line goes upward to the right but then starts to fall, it is
when overfitting occurs. The red line is data used for training which is validated. The red line may momentarily fall if there
are too much data. This phenomenon is temporary only and the graph is converged in a long term.
• The cv option is not designated, so 3fold is applied as default. There are 100 data in the train set and 33% of it is used for
cross validation, so the maximum value of x-axis is 66. The training curve is disconnected as the green line is increasing, so
it is unknown if there are more data. At this point, it is impossible to know if there is enough data or not. The training curve
is drawn differently depending on the algorithm even when using the same data. Thus, what is known from this training
curve is that the performance of the current decision tree model would be better if there is more data.

Dr. Nagaraju K, Asst Prof, Dept of CSE 163


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Learning Curve
‣ !pip install scikit-plot

Learning Curve
Score

Training examples

Dr. Nagaraju K, Asst Prof, Dept of CSE 164


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Learning Curve
‣ If there is enough amount of data, identical data distribution is maintained even when the train set and validation set are
randomly split. The cross validation method is required when there is insufficient data. In order to determine if there is
enough amount of data, draw a learning curve. The learning curve shows how performance changes when slightly increasing
the amount of training data by setting x-axis as number of training data and y-axis as performance score. The test score is
calculated by internal cross validation.
‣ The learning curve can be drawn by using the scikitplot library which supports the scikit-learn. Install the library separately
from scikit-learn to use.
‣ The scikitplot is not provided in anaconda as default, it needs to be installed by using the package management tools. Run the
following code with the jupyter notebook to install library. Be aware of different names when installing and importing the
library.

Dr. Nagaraju K, Asst Prof, Dept of CSE 165


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Model optimization strategy

Dr. Nagaraju K, Asst Prof, Dept of CSE 166


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Model optimization strategy


‣ #Hyperparameter
• In machine learning, the machine learns the data and finds parameters by itself.
Hyperparameters refer to parameters that need to be directly designated by a human as they cannot be found by the
machine.
In the scikit-learn, it is possible to set hyperparameters to instance an algorithm.
‣ #Hyperparameter search by using GridSearchCV
• In general, hyperparameters are found by the analyst’s expertise.
The scikit-learn provides the GridSearchCV function that finds hyperparameters, and it is a function that lists all number of
cases regarding hyperparameter combination on a grid and learns and performs performance measure to every
combination. It may seem like working without any plans, but the work is automatically performed by the machine as the
range of hyperparameters is designated by the analyst. While it takes some time, it eases finding hyperparameters.
‣ Similar to algorithms, instance the GridSearchCV as well. When instancing, send the instanced algorithm model as the
argument to the estimator option. For param_grid, send the dictionary containing hyperparameters for testing as the
argument.

Dr. Nagaraju K, Asst Prof, Dept of CSE 167


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

The total possible number of cases that can be made with the parameters from the practice problem is 1600. Since the
k=10 in the K-fold cross validation, 10 cross validations were performed for each case so that a total of 16,000 training
was done. The following table shows hyperparameter combinations in the practice problem.
‣ The optimal parameters and optimized performance found with GridSearCV are recorded in best_params_and best_score_
attributes.
‣ If the refit option is set True, train the model with the optimal hyperparameters and record to the best_estimator_ attribute.

0 1 2 3 4 ‧‧‧ 1595 1596 1597 1598 1599

criterion gini gini gini gini gini ‧‧‧ entropy entropy entropy entropy entropy

max_depth 4 4 4 4 4 ‧‧‧ 12 12 12 12 12
min_impurity_decrease 0 0 0 0 0 ‧‧‧ 0.2 0.2 0.2 0.2 0.2
randomight_fraction_leaf 0 0 0 0 0 ‧‧‧ 0.3 0.3 0.3 0.3 0.3

random_state 7 7 23 23 42 ‧‧‧ 42 78 78 142 142


splitter best random best random best ‧‧‧ random best random best random
6 rows * 1600 columns

Dr. Nagaraju K, Asst Prof, Dept of CSE 168


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Dr. Nagaraju K, Asst Prof, Dept of CSE 169


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Evaluation criteria and model evaluation


‣ Use the X_test and y_test from hold out for final evaluation of the model. For accurate evaluation, it is important to be aware
of different kinds of evaluation criteria.

Dr. Nagaraju K, Asst Prof, Dept of CSE 170


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Evaluation criteria and model evaluation


‣ #Limitations of accuracy
• So far, accuracy was the only criterion to validate the model, but a limitation is present. It requires more than accuracy to
evaluate the model properly.
Ex If there’s a model that predicts ‘setosa’ only even when entering various kinds of data, the model performance would be
doubtful.

• However, let’s assume that there are 48 setosas, 1 versicolor, and 1 virginica in the test set. When making an evaluation by
using this test set, a problems is that it would have 96% accuracy. Nevertheless, it’s not because the model performance is
great. It would be necessary to check other evaluation criteria as well to accurately evaluate the model performance.

Dr. Nagaraju K, Asst Prof, Dept of CSE 171


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Confusion Matrix
‣ The following confusion matrix can be expressed with binary classification.

Predicted positive class Predicted negative class

Actual Positive TP (True Positive) FN (False Negative)


Actual Negative FP (False Positive) TN (True Negative)

‣ Evaluation scores including precision, recall, f1-score, and others can be made based on the four concepts provided above
(TP, FP, TN, FN).
‣ Use the confusion matrix to analyze both right and wrong predicted results. The confusion matrix can validate the
performance in different ways to see how well the predicted and actual targets got right.

Dr. Nagaraju K, Asst Prof, Dept of CSE 172


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Multi label classification

Predicted setosa Predicted versicolor Predicted virginica


Actual setosa Actual setosa and predicted Actual setosa but predicted Actual setosa but predicted
setosa versicolor virginica

Actual versicolor Actual versicolor but predicted Actual versicolor and predicted Actual versicolor but predicted
setosa versicolor virginica

Actual virginica Actual virginica but predicted Actual virginica but predicted Actual virginica and predicted
setosa versicolor virginica

‣ Because the iris data is multi label classification problem, it cannot be expressed in four different concepts only as provided
earlier. So, create three indexes for each setosa, versicolor, virginica by considering each of them as binary classification
problem. Take setosa, for example.

Dr. Nagaraju K, Asst Prof, Dept of CSE 173


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Confusion matrix of the iris data set

‣ With scikit-learn, it is possible to easily calculate the confusion matrix by using confusion_matrix function. Send the
arguments to actual class and then predicted class.

Dr. Nagaraju K, Asst Prof, Dept of CSE 174


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Confusion matrix of the iris data set

Confusion Matrix

setosa
True label

versicolor

virginica

setosa versicolor virginica


Predicted label

Dr. Nagaraju K, Asst Prof, Dept of CSE 175


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Confusion matrix of the iris data set


‣ Use the scikit-plot to visualize the confusion matrix into a more intuitive heatmap. The scikit-learn did not have labels for x-
axis and y-axis, but scikit-plot has labels for the axis for easier result interpretation.

Dr. Nagaraju K, Asst Prof, Dept of CSE 176


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

precision / recall / fall-out / f-score


‣ Evaluation scores differ in each target class in multi label classification problem.

Predicted setosa Not predicted setosa


(predicted versicolor or virginica)
Actual setosa TP (True Positive) FN (False Negative)
Not actual versicolor FP (False Positive) TN (True Negative)
(versicolor or virginica)

‣ Score the evaluation results based on the TP, TN, FP, FN of confusion matrix. These four concepts are only possible in binary
classification problems. For multi label classification problems that have N target classes such as iris data, consider each
target class as binary classification and obtain N confusion matrixes.

Ex Iris data
Consider each of setosa, versicolor, virginica as binary classification problem and create three confusion matrixes.
The following shows the confusion matrix of setosa.

Dr. Nagaraju K, Asst Prof, Dept of CSE 177


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

precision
‣ Precision is the ratio of correct predicted class.

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑇𝑃
=
𝑇𝑃 + 𝐹𝑃

(f"{target}precision:{score}")

setosa precision: 1.0


versicolor precision: 0.9375
virginica precision: 1.0

Line 45
• In multi label classification, average cannot be “binary.”
• “binary” is the average default.

Dr. Nagaraju K, Asst Prof, Dept of CSE 178


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

recall
‣ Also called as sensitivity, recall is the ratio of correct prediction among the actual target class.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁

(f"{target}sensitivity:{score}")

setosa sensitivity: 1.0


versicolor sensitivity: 1.0
virginica sensitivity : 0.9375

Dr. Nagaraju K, Asst Prof, Dept of CSE 179


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

fall-out
‣ Fall-out is the incorrect ratio among the actual class, not target. Also expressed as 1-specificity.

𝑓𝑎𝑙𝑙 − 𝑜𝑢𝑡
𝐹𝑃
=
𝐹𝑃 + 𝑇𝑁
‣ The scikit-learn does not provide how to calculate fall-out.

Dr. Nagaraju K, Asst Prof, Dept of CSE 180


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

f-score
‣ Precision and recall have a trade-off relationship. The f-score is the weighted harmonic mean of precision and recall. If the f-
score is less than 1, more weight is provided to precision, and if it is greater than 1, more weight is provided to recall. The f-
score is used to accurately understand the model performance when the data class is imbalanced.

𝐹𝛽
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
= 1 + 𝛽2
𝛽 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
‣ For even weight of precision and recall, 𝜷 is set 1 most of the time, which is specifically referred to as f1-score.

𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∙
𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙

Dr. Nagaraju K, Asst Prof, Dept of CSE 181


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

f-score
‣ #F1 Measure – both precision and recall are equally weighted. F1 score is used to calculate the average from the harmonic
mean of precision and recall (sensitivity) and weighting precision and recall.
‣ #a (precision), b (recall) 2ab/a+b
‣ #F 0.5 measure – Precision is more weighted than recall. 0.5 times greater weight is applied to the recall compared to
precision.
‣ #F2 measure – Recall is more weighted. Recall is 2 times more weighted than precision.

Dr. Nagaraju K, Asst Prof, Dept of CSE 182


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

f-score

(f"{target}fbetas score:{score}")

(f"{target}f1 score:{score}")

setosa fbetas score : 1.0


versicolor fbetas score : 0.967741935483871
virginica fbetas score : 0.967741935483871
setosa f1 score : 1.0
versicolor f1 score : 0.967741935483871
virginica f1 score : 0.967741935483871

Dr. Nagaraju K, Asst Prof, Dept of CSE 183


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

accuracy

𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃 + 𝑇𝑁
=
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

Dr. Nagaraju K, Asst Prof, Dept of CSE 184


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

classification_report
‣ Use the classification_report function of scikit-learn to batch calculate precision, recall, and f1-score.

Dr. Nagaraju K, Asst Prof, Dept of CSE 185


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

ROC curve
‣ ROC curve has TPR (True Positive Rate) on the y-axis, and FPR(False Positive Rate) on the x-axis.
TPR is recall, and FPR refers to fall-out.

𝑇𝑃𝑅 𝐹𝑃𝑅
𝑇𝑃 𝐹𝑃
= =
𝑇𝑃 + 𝐹𝑁 𝐹𝑃 + 𝑇𝑁

Dr. Nagaraju K, Asst Prof, Dept of CSE 186


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

ROC curve

Dr. Nagaraju K, Asst Prof, Dept of CSE 187


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

ROC curve

ROC Curves
True Positive Rate

ROC Curve of class setosa (area =1.00)


ROC Curve of class versicolor (area = 0.99)
ROC Curve of class virginica (area = 0.99)
micro-average ROC curve (area = 0.99)
micro-average ROC curve (area = 0.99)

False Positive Rate

Dr. Nagaraju K, Asst Prof, Dept of CSE 188


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

AUC (Area Under Curve)

True positive rate

True positive rate


AUC = 0.4 AUC = 0.5
False positive rate False positive rate
True positive rate

True positive rate

AUC = 0.6 AUC = 0.85


False positive rate False positive rate

The Hundred-Page Machine Learning Book

Dr. Nagaraju K, Asst Prof, Dept of CSE 189


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Poor performance: Return to previous step


‣ Final model
‣ Save model - It takes days to learn if there is a lot of data. It is extremely inefficient to train the model every time for
prediction, so the best way is to save the model for reuse. Use ‘pickle’ to save the model.

Dr. Nagaraju K, Asst Prof, Dept of CSE 190


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Poor performance: Return to previous step


‣ Final model

Line 55
• Import model

Dr. Nagaraju K, Asst Prof, Dept of CSE 191


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Poor performance: Return to previous step


‣ Final model

Line 56
• Final prediction

Dr. Nagaraju K, Asst Prof, Dept of CSE 192


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT 01

Poor performance: Return to previous step


‣ Final model

Line 57
• Save csv

Dr. Nagaraju K, Asst Prof, Dept of CSE 193


Unit 2.

Application of the Supervised Learning Model for


Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical Prediction

Dr. Nagaraju K, Asst Prof, Dept of CSE 194


2.1. Training and Testing in Machine Learning UNIT 02

Machine Learning Types

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 195


Supervised Learning

Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well labelled.
Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the
supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labelled data.

Dr. Nagaraju K, Asst Prof, Dept of CSE 196


Dr. Nagaraju K, Asst Prof, Dept of CSE 197
Supervised Learning

⚫ X , Y (Pre classified training examples)


⚫ Given an observation x, what is the best lable
for y

Dr. Nagaraju K, Asst Prof, Dept of CSE 198


Supervised learning is classified
into two categories of algorithms:
⚫ Classification: A classification problem is when
the output variable is a category, such as “Red”
or “blue” , “disease” or “no disease”.
⚫ Regression: A regression problem is when
the output variable is a real value, such as
“dollars” or “weight”.

Dr. Nagaraju K, Asst Prof, Dept of CSE 199


Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees

Dr. Nagaraju K, Asst Prof, Dept of CSE 200


Dr. Nagaraju K, Asst Prof, Dept of CSE 201
Dr. Nagaraju K, Asst Prof, Dept of CSE 202
Dr. Nagaraju K, Asst Prof, Dept of CSE 203
Dr. Nagaraju K, Asst Prof, Dept of CSE 204
Dr. Nagaraju K, Asst Prof, Dept of CSE 205
Dr. Nagaraju K, Asst Prof, Dept of CSE 206
− Supervised learning
⚫ Advantages:-
− Supervised learning allows collecting data and produces data output from
previous experiences.
− Helps to optimize performance criteria with the help of experience.
− Supervised machine learning helps to solve various types of real-world
computation problems.

⚫ Disadvantages:-
− Classifying big data can be challenging.
− Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.

Dr. Nagaraju K, Asst Prof, Dept of CSE 207


Machine Learning Types

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 208


Dr. Nagaraju K, Asst Prof, Dept of CSE 209
Unsupervised learning is the training of a machine using information that is neither classified nor labeled and allowing the algorithm to
act on that information without guidance.
Here the task of the machine is to group unsorted information according to similarities, patterns, and differences without any prior training of data.

Dr. Nagaraju K, Asst Prof, Dept of CSE 210


Dr. Nagaraju K, Asst Prof, Dept of CSE 211
Unsupervised learning is

classified into two categories of


algorithms
⚫ Clustering: A clustering problem is where you want to
discover the inherent groupings in the data, such as
grouping customers by purchasing behavior.
⚫ Association: An association rule learning problem is
where you want to discover rules that describe large
portions of your data, such as people that buy X also tend
to buy Y.

Dr. Nagaraju K, Asst Prof, Dept of CSE 212


Clustering Types:
Hierarchical clustering
K-means clustering
Principal Component Analysis
Singular Value Decomposition
Independent Component
Analysis

Clustering
Exclusive
(partitioning)
Agglomerative
Overlapping
Probabilistic

Dr. Nagaraju K, Asst Prof, Dept of CSE 213


Dr. Nagaraju K, Asst Prof, Dept of CSE 214
2.1. Training and Testing in Machine Learning UNIT 02

Machine Learning Types

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 215


Dr. Nagaraju K, Asst Prof, Dept of CSE 216
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation.

Dr. Nagaraju K, Asst Prof, Dept of CSE 217


Dr. Nagaraju K, Asst Prof, Dept of CSE 218
Dr. Nagaraju K, Asst Prof, Dept of CSE 219
Dr. Nagaraju K, Asst Prof, Dept of CSE 220
Dr. Nagaraju K, Asst Prof, Dept of CSE 221
⚫ Main points in Reinforcement
learning
⚫ Input: The input should be an initial state from which the model will
start
⚫ Output: There are many possible outputs as there are a variety of
solutions to a particular problem
⚫ Training: The training is based upon the input, The model will return
a state and the user will decide to reward or punish the model based
on its output.
⚫ The model keeps continues to learn.
⚫ The best solution is decided based on the maximum reward.

Dr. Nagaraju K, Asst Prof, Dept of CSE 222


Machine Learning Types

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization.

Dr. Nagaraju K, Asst Prof, Dept of CSE 223


Supervised Learning

Numeric 𝑌 Categorical 𝑌

Y = 13.45, 73, 9.5, ….. Y = red, green, blue, …..

Dr. Nagaraju K, Asst Prof, Dept of CSE 232


Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees

Dr. Nagaraju K, Asst Prof, Dept of CSE 233


A regression
problem is when
the output variable
is a real value,
such as “dollars”
or “weight”.
Dr. Nagaraju K, Asst Prof, Dept of CSE 234
Dr. Nagaraju K, Asst Prof, Dept of CSE 235
Dr. Nagaraju K, Asst Prof, Dept of CSE 236
Dr. Nagaraju K, Asst Prof, Dept of CSE 237
Dr. Nagaraju K, Asst Prof, Dept of CSE 238
Dr. Nagaraju K, Asst Prof, Dept of CSE 239
Dr. Nagaraju K, Asst Prof, Dept of CSE 240
Dr. Nagaraju K, Asst Prof, Dept of CSE 241
Dr. Nagaraju K, Asst Prof, Dept of CSE 242
Dr. Nagaraju K, Asst Prof, Dept of CSE 243
Dr. Nagaraju K, Asst Prof, Dept of CSE 244
Dr. Nagaraju K, Asst Prof, Dept of CSE 245
Dr. Nagaraju K, Asst Prof, Dept of CSE 246
Dr. Nagaraju K, Asst Prof, Dept of CSE 247
Dr. Nagaraju K, Asst Prof, Dept of CSE 248
Dr. Nagaraju K, Asst Prof, Dept of CSE 249
Dr. Nagaraju K, Asst Prof, Dept of CSE 250
Dr. Nagaraju K, Asst Prof, Dept of CSE 251
Dr. Nagaraju K, Asst Prof, Dept of CSE 252
Dr. Nagaraju K, Asst Prof, Dept of CSE 253
Dr. Nagaraju K, Asst Prof, Dept of CSE 254
R-squared is a statistical measure that represents the goodness of fit of a regression model. The value of
R-square lies between 0 to 1.
Where we get R-square equals 1 when the model perfectly fits the data and there is no difference between
the predicted value and actual value.
However, we get R-square equals 0 when the model does not predict any variability in the model and it
does not learn any relationship between the dependent and independent variables.

Dr. Nagaraju K, Asst Prof, Dept of CSE 255


Dr. Nagaraju K, Asst Prof, Dept of CSE 256
Dr. Nagaraju K, Asst Prof, Dept of CSE 257
Dr. Nagaraju K, Asst Prof, Dept of CSE 258
Dr. Nagaraju K, Asst Prof, Dept of CSE 259
Dr. Nagaraju K, Asst Prof, Dept of CSE 260
Dr. Nagaraju K, Asst Prof, Dept of CSE 261
Dr. Nagaraju K, Asst Prof, Dept of CSE 262
Dr. Nagaraju K, Asst Prof, Dept of CSE 263
Dr. Nagaraju K, Asst Prof, Dept of CSE 264
Dr. Nagaraju K, Asst Prof, Dept of CSE 265
Dr. Nagaraju K, Asst Prof, Dept of CSE 266
Dr. Nagaraju K, Asst Prof, Dept of CSE 267
Predict price of a home with area = 3300 sqr ft

Predict price of a home with area = 5000 sqr ft

Dr. Nagaraju K, Asst Prof, Dept of CSE 268


Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old
M= 135.78767123 c=80616.438 price=628715.7534

Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old 859554.794
5

Dr. Nagaraju K, Asst Prof, Dept of CSE 269


Dr. Nagaraju K, Asst Prof, Dept of CSE 270
Dr. Nagaraju K, Asst Prof, Dept of CSE 271
Dr. Nagaraju K, Asst Prof, Dept of CSE 272
Dr. Nagaraju K, Asst Prof, Dept of CSE 273
Dr. Nagaraju K, Asst Prof, Dept of CSE 274
Dr. Nagaraju K, Asst Prof, Dept of CSE 275
Dr. Nagaraju K, Asst Prof, Dept of CSE 276
Dr. Nagaraju K, Asst Prof, Dept of CSE 277
Find price of home with 3000 sqr ft area, 3 bedrooms, 40 year old

Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old

Dr. Nagaraju K, Asst Prof, Dept of CSE 278


Here area, bedrooms, age are called independant variables or features whereas price is a dependant variable

Dr. Nagaraju K, Asst Prof, Dept of CSE 279


Dr. Nagaraju K, Asst Prof, Dept of CSE 395
Dr. Nagaraju K, Asst Prof, Dept of CSE 396
Dr. Nagaraju K, Asst Prof, Dept of CSE 397
Dr. Nagaraju K, Asst Prof, Dept of CSE 398
Dr. Nagaraju K, Asst Prof, Dept of CSE 399
Dr. Nagaraju K, Asst Prof, Dept of CSE 400
Dr. Nagaraju K, Asst Prof, Dept of CSE 401
Dr. Nagaraju K, Asst Prof, Dept of CSE 402
Dr. Nagaraju K, Asst Prof, Dept of CSE 403
• Logistic Regression is a supervised machine learning algorithm used for classification problems.
• Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a
specific class. It is used for binary classification where the output can be one of two possible categories such as
Yes/No, True/False or 0/1.
• It uses sigmoid function to convert inputs into a probability value between 0 and 1

Dr. Nagaraju K, Asst Prof, Dept of CSE 404


Dr. Nagaraju K, Asst Prof, Dept of CSE 405
Dr. Nagaraju K, Asst Prof, Dept of CSE 406
Dr. Nagaraju K, Asst Prof, Dept of CSE 407
Dr. Nagaraju K, Asst Prof, Dept of CSE 408
Dr. Nagaraju K, Asst Prof, Dept of CSE 409
Dr. Nagaraju K, Asst Prof, Dept of CSE 410
Dr. Nagaraju K, Asst Prof, Dept of CSE 411
Dr. Nagaraju K, Asst Prof, Dept of CSE 412
Dr. Nagaraju K, Asst Prof, Dept of CSE 413
1. Binomial Logistic Regression: This type is used when the dependent variable has only
two possible categories. Examples include Yes/No, Pass/Fail or 0/1. It is the most common
form of logistic regression and is used for binary classification problems.

Dr. Nagaraju K, Asst Prof, Dept of CSE 414


1. Binomial Logistic Regression
2.Multinomial Logistic Regression: This is used when the dependent variable has three
or more possible categories that are not ordered. For example, classifying animals into
categories like "cat," "dog" or "sheep." It extends the binary logistic regression to handle
multiple classes.

Dr. Nagaraju K, Asst Prof, Dept of CSE 415


1. Binomial Logistic Regression
2.Multinomial Logistic Regression
3.Ordinal Logistic Regression: This type applies when the dependent variable has three
or more categories with a natural order or ranking. Examples include ratings like "low,"
"medium" and "high." It takes the order of the categories into account when modeling.

Dr. Nagaraju K, Asst Prof, Dept of CSE 416


Dr. Nagaraju K, Asst Prof, Dept of CSE 417
Dr. Nagaraju K, Asst Prof, Dept of CSE 418
Dr. Nagaraju K, Asst Prof, Dept of CSE 419
Dr. Nagaraju K, Asst Prof, Dept of CSE 420
Dr. Nagaraju K, Asst Prof, Dept of CSE 421
Dr. Nagaraju K, Asst Prof, Dept of CSE 422
Dr. Nagaraju K, Asst Prof, Dept of CSE 423
Dr. Nagaraju K, Asst Prof, Dept of CSE 424
Dr. Nagaraju K, Asst Prof, Dept of CSE 425
Dr. Nagaraju K, Asst Prof, Dept of CSE 426
Dr. Nagaraju K, Asst Prof, Dept of CSE 427
Dr. Nagaraju K, Asst Prof, Dept of CSE 428
Dr. Nagaraju K, Asst Prof, Dept of CSE 429
Dr. Nagaraju K, Asst Prof, Dept of CSE 430
Dr. Nagaraju K, Asst Prof, Dept of CSE 431
Dr. Nagaraju K, Asst Prof, Dept of CSE 432
Dr. Nagaraju K, Asst Prof, Dept of CSE 433
Dr. Nagaraju K, Asst Prof, Dept of CSE 434
Dr. Nagaraju K, Asst Prof, Dept of CSE 435

You might also like