ML Intro Linear Regression
ML Intro Linear Regression
Chapter 1
Machine Learning 1
- Supervised Learning
✓ Be able to introduce machine learning-based data analysis according to the business objective, strategy, and policy and
manage the overall process.
✓ Be able to select and apply a machine learning algorithm that is the most suitable to the given problem and perform
hyperparameter tuning.
✓ Be able to design, maintain, and optimize a machine learning workflow for AI modeling by using structured and
unstructured data.
Chapter contents
Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if
its performance at tasks in T, as measured by P, improves with experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate predictions.” – Mohri, 2012
Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.
(Target) ‣ Question about prediction –What is the sales when random advertising expenses
are given?
𝑓2 • Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦 = 𝑤𝑥 + 𝑏
‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach the optimal value by
gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuro Science
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuro Science
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
Machine Learning
Feature engineering
Understanding the business and Pre-processing
problem definition and searching of data
Train
Modeling and
Problem Definition Data Preparation
Optimization
Validate Model training for data
Raw
Data
Model performance evaluation
Type Algorithm/Method
Unsupervised learning Clustering.
MDS, t-SNE.
PCA, NMF.
Association analysis.
Supervised learning Linear regression.
Logistic regression.
Tree, Random Forest, Ada Boost, XGBoost.
Naïve Bayes.
KNN.
Support vector machine (SVM).
Neural Network.
‣ Learned from data by training and not manually set by the practitioner.
‣ Contain the data pattern.
Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.
Ex in KNN algorithm.
Ex Learning rate in neural network.
Ex Maximum depth in Tree algorithm.
Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface complete with high-level API.
Predict /
Instance Fit
transform
Estimator
Classifier Regressor
DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegressor GradientBoostingRegressor
Gaussian NB … Ridge …
Scikit-Learn Library
About the Scikit-Learn library
Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch popular reference
datasets. It also features some artificial data generators.
Practicing scikit-learn
Line 5
• This data becomes x (independent variable, data).
Practicing scikit-learn
Line 6
• This data becomes y (dependent variable, actual value).
Practicing scikit-learn
Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.
Practicing scikit-learn
Line 7
• From the total of 569 observed values, divide the data for training and evaluation into 7:3 or 8:2.
7.5:2.5 is the default value.
Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.
Practicing scikit-learn
Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (72%) out of total 569 observations are found.
Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that requires human setting
and affects a lot to the model performance.
Line 1-5
• Loading the test data set
Line 1-8
• Instancing the estimator and hyperparameter setting
• Model initialization by using the entropy for branching
fit
‣ Use the fit method with instance estimator for training. Send the training data and label data together as an argument to
supervised learning algorithm.
predict
‣ The instance estimator that has completed training with fitting can be applied with the predict method. ‘Predict’ converts
the estimated results of the model regarding the entered data.
Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.
Practicing scikit-learn
Practicing scikit-learn
Practicing scikit-learn
Line 57
• Data frame shows a result where predicted value and actual value differ.
Practicing scikit-learn
Practicing scikit-learn
Practicing scikit-learn
Line 66
• 133/143
Practicing scikit-learn
‣ It showed 93% accuracy, which is a quite rare result. In fact, a process of increasing the data accuracy is required during data
pre-processing, and standardization is one of the options. The following is a brief summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term of standardization is z-transformation, and the standardized value is also referred to as z-score. 94% accuracy
would be obtained from KNN wine classification through standardization.
• Standardization is widely used in data pre-processing in general other than KNN, and the following is the equation.
(, standard deviation)
Practicing scikit-learn
Practicing scikit-learn
Line 35
• Data frame before standardization
Practicing scikit-learn
Line 39
• The differences among column values are huge before standardization.
Practicing scikit-learn
Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to before standardization.
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-3
• Output before pre-processing
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-7
• Pre-processing – Apply scaling
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-9
• Result check after pre-processing
transform
‣ Feature processing is done with ‘transform’ to return the result.
fit_transform
‣ Fit and Transform is combined as fit_transform.
Dimension sklearn.decomposition Algorithms related to dimension reduction (PCA, NMF, Truncated SVD,
reduction etc.)
Validation, sklearn.model_selection Validation, hyperparameter tuning, data separation, etc.
hyperparameter (corss_validate, GridSearchCV, train_test_split, learning_curve, etc.)
tuning, data
separation
Utility sklearn.pipeline Serial conversion of feature processing and machine learning algorithms,
etc.
Performance
Test data set
evaluation
Machine learning modeling process through division of training and test data sets
𝑦
𝑥 𝑥 𝑥
Overfitting and underfitting
𝑥 𝑥
𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting
‣ Even if the machine learning finds the optimal solution in data distribution, a wide margin of error occurs, and this is
because the model has a small capacity. Such phenomenon is referred to as underfitting, and the linear equation model on
the leftmost figure above is an example.
‣ An easy alternative is to use higher degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with 12th order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.
𝑦
= 𝑤12 𝑥 12 + 𝑤11 𝑥 11 + 𝑤10 𝑥 10 ⋯ 𝑤1 𝑥 1
Dr. Nagaraju K, Asst Prof, Dept of CSE + 𝑤0 51
1.3. Preparation and division of data set UNIT 01
Overfitting
‣ When choosing a 12th order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.
12th
𝑥 𝑥0
Inaccurate prediction in overfitting
‣ The data should be split into a training set and a testing set.
‣ In principle, the testing set should be used only once! It should not be reused!
‣ If the training set is used also for evaluation, the errors can be unrealistically small.
‣ We would like to evaluate realistic errors while training by splitting the training data into two.
‣ As we can repeatedly evaluate errors while training, it is also possible to tune the hyperparameters.
Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat a few times from the step 2).
Validation
Training
‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.
Validation
Training
‣ Leave only one observation for validation. Apply sequentially. More time consuming.
k-cross folding
Ex In the case of the iris data, ignore the fourth row as shown on the table below.
x1 x2 x3 x4 y
x1 x2 x3 x4 y
x1 x2 x3 x4 y
‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of data, but it is extremely
inconvenient to manually find missing values from a huge data frame.
Line 17
• The number of missing values can be counted.
Line 18
• axis=0 is default, so the row with the NaN value is deleted.
Imputation
‣ It is sometimes hard to delete a training sample or a certain column, and this is because it loses too much of useful data. If
so, estimate missing values from other training samples in data set by kriging. The most commonly used method is to
impute with average value, which is to change the missing value into the overall average of a certain column. In scikit-learn,
use SimpleImputer class.
Imputation
Imputation
Line 45
• Check it is the average of the column.
‣ The data on the table have features that have orders and do not have orders.
Size is ordered, but color is not ordered. Thus, size is classified as ordinals scale while color is nominal scale.
Line 62
• Change the class label from strings to integers.
The encoding is done with integers, so insert the iris ‘species’ value.
The encoding is done with integers, so insert the iris ‘species’ value.
Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables into new dummy
variable.
Use sklearn library to conveniently process one-hot encoding. The result is given as the sparse matrix in linear algebra.
In the sparse matrix, the value of most matrices is 0. An opposite concept to sparse matrix is dense matrix.
OneHotEncoder
OneHotEncoder
OneHotEncoder
Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices)
Row index 0 2
49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0
100 versicolor 0 1 0
virginica 0 0 1
Using hold-out in real life that splits the data set into training data set and test set
‣ df_wine is the data that measure wines produced in Vinho Verde which is adjacent to Atlantic Ocean in the northwest of
Portugal. It measured and analyzed the grade, taste, and acidity of 1,599 red wine samples along with 4,898 white wine
samples to create data. If the data is not found in the following route, it is possible to import the data from the local by
directly downloading it from the UCI repository.
Using hold-out in real life that splits the data set into training data set and test set
Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
• Remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)
Using hold-out in real life that splits the data set into training data set and test set
Using hold-out in real life that splits the data set into training data set and test set
‣ Data splitting is possible by using the train_test_split function provided in the model_selection module of scikit-learn. First,
convert the features from index 1 to 13 to NumPy array and assign to variable X. With the train_test_split function, data
conversion is done in four tuples, so assign by designating appropriate variables.
‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the training data set and test data
set is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3 or 8:2 depending on the size of the data set. For large data set, it is
common and suitable to split the training data set and test data set into the ratio of 9:1 or 9.9:0.1.
#MaxAbsScaler divides the data into the maximum absolute value based on each feature. Thus, the maximum value of each
feature becomes 1.
#The overall feature changes to [-1, 1] range.
Low High
Variance Variance
×
×××××× × ×× ×
High × ×
Bias
× × ×
Low ×××××× ×
Bias ××
×
×
trade-off
‣ Bias and variance have a trade-off relationship in which when one of them increases, the other falls and vice versa. In the
beginning of learning, the model becomes complex and the overall error cost falls due to decreased bias. However, at some
point, the model keeps learning and becomes much more complicated which causes higher variance and increased overall
error cost. In other words, the model gets overfitted to the training data. So, one of the ways to prevent overfitting is to
stop learning at appropriate timing. Regularization is a method to prevent overfitting by lowering variance, but it can
increase bias instead due to the trade-off relationship.
Variance
Bias2
Model Complexity
Ridge Regression
‣ The ridge regression model is a technique to limit the L2norm of w, which is the regression coefficient vector. A constraint
is added to minimize the sum of squares of weight in the cost function of linear regression. If the linear regression model is
as follows,
(,w= )
‣ then the cost function of the ridge regression model is as follows. N is the number of data, and M is the number of
elements of regression coefficient vector. A constraint is added to the existing SSE (Sum of Squared Errors).
̰ 𝑟ⅈ𝑑𝑔ⅇ
𝑤
𝑁 𝑀
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦𝑖 − 𝑤𝑋 2 + 𝜆 𝑤𝑗2
𝑖=1 𝑗=1
‣ λ is a hyperparameter to adjust the weight of existing SSE and added constraint.
When the λ is large, regularization is greatly applied and the regression coefficients become lower. When the λ becomes
smaller, regularization gets weaker and when the λ equals 0, the constraint clause also becomes 0 which is the same as the
general linear regression model.
Ridge Regression
‣ The following is an example of simple linear regression model equation.
Ridge Regression
‣ When drawing the cost function SSE (w1, w2) on the coordinate with x-axis and y-axis, an ellipse is created as provided on
the following figure.
𝑤2 Minimize cost
𝜆
∥𝑊
2 𝑤1
∥
2
Minimize penalty Minimize cost + penalty
‣ On the figure above, the ellipse drawn in solid line is the cost function, which is the combination of w1 and w2 with the
same cost (SSE). The central point of the ellipse is when the cost becomes 0. Outward of the ellipse is the combination of
w1 and w2 with higher cost, which in other words the model with higher error (consists of w1 and w2 weights). The colored
circle refers to the constraint. The circle becomes smaller when the λ gets larger, and vice versa. The point where the cost
function (ellipse) and constraint (colored circle) meet is the optimal solution where the cost of the ridge regression model is
minimum.
Lasso Regression
‣ The Lasso (Least Absolute Shrinkage and Selection Operator) regression model is a technique to limit the L1norn of the
regression coefficient vector w. A constraint is added to minimize the sum of the absolute values of the weights in the cost
function of linear regression. The following figure shows the cost function of Lasso regression model.
𝑁 𝑀
̰ 𝑙𝑎𝑠𝑠𝑜 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦𝑖 − 𝑤𝑋
𝑤 2 + 𝜆 𝑤𝑗
𝑖=1 𝑗=1
Lasso Regression
‣ When drawing the cost function of Lasso regression model (w1, w2) on the coordinate with x-axis and y-axis, a rhombus is
created as provided on the following figure.
𝑤2
𝜆
∥𝑊
∥1 𝑤1
Minimize cost + penalty
‣ Because the constraint of the Lasso regression model is a rhombus, it is highly possible that the point that meets the cost
function is the vertex of the rhombus. The vertexes of the rhombus are always the points where w1 or w2 are 0. So, the
Lasso regression model results in 0 weight.
Elastic-net regression
‣ The Elastic-net regression model applies both the L2norm and L1norm to the regression coefficient vector. The constraint is
both the sum of weight squares and sum of the absolute weight values. The following figure shows the cost function of
Elastic-net. There are two hyperparameters of Elastic-net, which are λ1 and λ2.
̰ 𝑒𝑙𝑎𝑠𝑡𝑖𝑐
𝑤
𝑁 𝑀 𝑀
= 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝑦𝑖 − 𝑤𝑋 2 + 𝜆1 𝑤𝑗2 + 𝜆2 𝑤𝑗
𝑖=1 𝑗=1 𝑗=1
Elastic-net regression
‣ Elastic-net applies both the L2norn and L1norn at the same time, so the constraint is somewhere in the middle. It reduces a
larger weight while making unimportant weight to 0.
1.5
1.0
0.5
0.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-0.5
-1.0
-1.5
Practicing
Sample
(Instance, observation)
Petal
Sepal Sepal Petal Petal Class label
length width length width
Line 3-1 ~ 4
• Import the library required for practicing.
Line 4-1
• Convert the current variable data to ndarray and DataFrame of NumPy.
Line 1
• Merge the feature and target
Line 3 ~ 5
• Change the column name
Line 10
• Change the target value
Line 11
• Check the missing value
Line 13
• petal_length has the greatest standard deviation. Compared to other features, petal_width seems to have a
narrower range of values. It would be better to perform regularization after checking the model
performance due to the scale differences between features.
Line 14
• The correlation coefficient of petal_length and petal_width is 0.962865, which is extremely high. Since highly
correlated features may induce multicollinearity problems, it is recommended to select one of the two
variables to use.
Line 15
• Number of data in each target was counted by using the aggregation function ‘size,’ and it was confirmed that
50 data were equally found in each feature. Select between ‘size’ and ‘count’ depending on the purpose of
analysis as the ‘size’ counts the number of data including missing values while the ‘count’ counts the number
of data without missing values. In this case, there’s no difference using ‘size’ and ‘count’ because iris data
does not have any missing values.
petal_length petal_width
Visualizing correlation
setal_length
1.00
0.75
setal_width (cm)
0.5
0.25
0.00
-0.25
petal_width petal_length
-0.50
-0.75
-1.00
Visualizing the correlation between features and data distribution by using pairplot
Visualizing the correlation between features and data distribution by using pairplot
petal_widthpetal_lengthsetal_width (cm) setal_length
species
• setosa
• versicolor
• virginica
Visualizing the correlation between features and data distribution by using pairplot
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by drawing an imaginary line, and
setosa will be classified as a linear model. For versicolor and virginica, it seems to be difficult to classify them by drawing a
line because they are mixed in the graph complete with the sepal_width and sepal_length features. However, even it seems a
little vague, they can be classified in other graphs.
versicolor
33.3%
33.3%
setosa 33.3%
virginica
Line 15
• The data are evenly arranged in each target class.
‣ Before starting machine learning, split the data set into the training data and performance test data. The final objective of
machine learning is to create a generalized model so that it can accurately predict new data. If evaluating the performance
with data that was used in learning, the possibility of getting right is high since the model is already familiar with the given
data feature. For reliable evaluation, separate the performance test data set from the training data set. Because it is the
separation of data, it is referred to as hold out method.
‣ Split the training data set and performance test data set with the train_test_split function of sklearn. Classify the training data
as ‘train’ and performance test data as ‘test.’ X is the feature of the data set, and y is the target. For structured data analysis,
indicate DataFrame with capital letters and Series with lower cases. The test_size=0.33 option separates 33% of the total data
as test set. random_state=42 is an option used to induce reproducible results for the practicing problem. If not designating
random_state, the data set for conversion will differ every time.
Algorithm selection
START
NO
MeanShift kernel
VBGMM approximation
Algorithm selection
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by drawing an imaginary line, and
setosa will be classified as a linear model. For versicolor and virginica, it seems to be difficult to classify them by drawing a
line because they are mixed in the graph complete with the sepal_width and sepal_length features. However, even it seems a
little vague, they can be classified in other graphs.
Algorithm selection
Algorithm selection
‣ # Gini impurity or entropy
‣ # The difference between Gini impurity and entropy is vague in real life. Both of them create a similar tree.
‣ # Calculation of Gini impurity is quicker, so it is recommended as default. However, when creating a different tree, while Gini
impurity tends to isolate the most frequent class to one side,
‣ # entropy results in a more balanced tree.
Model learning
‣ Perform model learning with the train data to check the model performance. The current model is set with default
hyperparameter except for random_state.
Score
‣ Evaluate the performance by using the performance test data set. In the Scikit-learn, score refers to accuracy. Since the iris
data set is well-structured data set for practice, it generally shows a high performance in any model.
Validation set
‣ The performance test data set split with Train_test_split is for the final performance evaluation of the model. Because it is
necessary to check model performance during model learning, hold out some of the data from the training data set and use it
as the validation set. A chance of overfitting can be found during learning by using the validation set, and it is also used to find
hyperparameters.
Cross validation
‣ This is a strategy to make many validation sets so that every data can be included in learning once.
Divide the data set into a random number n=k (k-fold). Use the first fold as a validation set and use other k-1 folds as train set,
and measure the performance.
‣ Use the second fold as test set and other folds as train set for learning, and then measure the performance. Repeat the same
process to all the other folds so that all data can be included in the training. Obtain k performance evaluation results and then
average out to predict the model performance. The following figure is an example when k=5.
cross_validation
Split 1
CV Iterations
Split 2
Train
Split 3 Test
Split 4
Split 5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Cross_val_score
‣ Cross validation can be easily performed by using the cross_val_score function of scikit-learn.
stratified
‣ The random splitting of the train set and validation set would result in inconstant target class when hold-out. If so, the data
distribution differs in the train set and validation set which affects learning. For machine learning, a premise is present in
which the distribution of training data and real-life data distribution are the same. If the premise is not followed, the
performance of learning model falls. So, to prevent such issue, the stratified method is used to evenly distribute the target
class ratio.
The following figure provides an intuitive understanding how the stratified method classifies data.
Stratified Cross-validation
Split 1
CV Iterations
‣ Cross validation is possible by sending the instance of StratifiedKFold to the cv option of cross_val_score.
stratified
Learning Curve
‣ !pip install scikit-plot
Learning Curve
‣ !pip install scikit-plot
• The green line is the result of cross validation. When the green line goes upward to the right but then starts to fall, it is
when overfitting occurs. The red line is data used for training which is validated. The red line may momentarily fall if there
are too much data. This phenomenon is temporary only and the graph is converged in a long term.
• The cv option is not designated, so 3fold is applied as default. There are 100 data in the train set and 33% of it is used for
cross validation, so the maximum value of x-axis is 66. The training curve is disconnected as the green line is increasing, so
it is unknown if there are more data. At this point, it is impossible to know if there is enough data or not. The training curve
is drawn differently depending on the algorithm even when using the same data. Thus, what is known from this training
curve is that the performance of the current decision tree model would be better if there is more data.
Learning Curve
‣ !pip install scikit-plot
Learning Curve
Score
Training examples
Learning Curve
‣ If there is enough amount of data, identical data distribution is maintained even when the train set and validation set are
randomly split. The cross validation method is required when there is insufficient data. In order to determine if there is
enough amount of data, draw a learning curve. The learning curve shows how performance changes when slightly increasing
the amount of training data by setting x-axis as number of training data and y-axis as performance score. The test score is
calculated by internal cross validation.
‣ The learning curve can be drawn by using the scikitplot library which supports the scikit-learn. Install the library separately
from scikit-learn to use.
‣ The scikitplot is not provided in anaconda as default, it needs to be installed by using the package management tools. Run the
following code with the jupyter notebook to install library. Be aware of different names when installing and importing the
library.
The total possible number of cases that can be made with the parameters from the practice problem is 1600. Since the
k=10 in the K-fold cross validation, 10 cross validations were performed for each case so that a total of 16,000 training
was done. The following table shows hyperparameter combinations in the practice problem.
‣ The optimal parameters and optimized performance found with GridSearCV are recorded in best_params_and best_score_
attributes.
‣ If the refit option is set True, train the model with the optimal hyperparameters and record to the best_estimator_ attribute.
criterion gini gini gini gini gini ‧‧‧ entropy entropy entropy entropy entropy
max_depth 4 4 4 4 4 ‧‧‧ 12 12 12 12 12
min_impurity_decrease 0 0 0 0 0 ‧‧‧ 0.2 0.2 0.2 0.2 0.2
randomight_fraction_leaf 0 0 0 0 0 ‧‧‧ 0.3 0.3 0.3 0.3 0.3
• However, let’s assume that there are 48 setosas, 1 versicolor, and 1 virginica in the test set. When making an evaluation by
using this test set, a problems is that it would have 96% accuracy. Nevertheless, it’s not because the model performance is
great. It would be necessary to check other evaluation criteria as well to accurately evaluate the model performance.
Confusion Matrix
‣ The following confusion matrix can be expressed with binary classification.
‣ Evaluation scores including precision, recall, f1-score, and others can be made based on the four concepts provided above
(TP, FP, TN, FN).
‣ Use the confusion matrix to analyze both right and wrong predicted results. The confusion matrix can validate the
performance in different ways to see how well the predicted and actual targets got right.
Actual versicolor Actual versicolor but predicted Actual versicolor and predicted Actual versicolor but predicted
setosa versicolor virginica
Actual virginica Actual virginica but predicted Actual virginica but predicted Actual virginica and predicted
setosa versicolor virginica
‣ Because the iris data is multi label classification problem, it cannot be expressed in four different concepts only as provided
earlier. So, create three indexes for each setosa, versicolor, virginica by considering each of them as binary classification
problem. Take setosa, for example.
‣ With scikit-learn, it is possible to easily calculate the confusion matrix by using confusion_matrix function. Send the
arguments to actual class and then predicted class.
Confusion Matrix
setosa
True label
versicolor
virginica
‣ Score the evaluation results based on the TP, TN, FP, FN of confusion matrix. These four concepts are only possible in binary
classification problems. For multi label classification problems that have N target classes such as iris data, consider each
target class as binary classification and obtain N confusion matrixes.
Ex Iris data
Consider each of setosa, versicolor, virginica as binary classification problem and create three confusion matrixes.
The following shows the confusion matrix of setosa.
precision
‣ Precision is the ratio of correct predicted class.
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝑇𝑃
=
𝑇𝑃 + 𝐹𝑃
(f"{target}precision:{score}")
Line 45
• In multi label classification, average cannot be “binary.”
• “binary” is the average default.
recall
‣ Also called as sensitivity, recall is the ratio of correct prediction among the actual target class.
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
(f"{target}sensitivity:{score}")
fall-out
‣ Fall-out is the incorrect ratio among the actual class, not target. Also expressed as 1-specificity.
𝑓𝑎𝑙𝑙 − 𝑜𝑢𝑡
𝐹𝑃
=
𝐹𝑃 + 𝑇𝑁
‣ The scikit-learn does not provide how to calculate fall-out.
f-score
‣ Precision and recall have a trade-off relationship. The f-score is the weighted harmonic mean of precision and recall. If the f-
score is less than 1, more weight is provided to precision, and if it is greater than 1, more weight is provided to recall. The f-
score is used to accurately understand the model performance when the data class is imbalanced.
𝐹𝛽
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
= 1 + 𝛽2
𝛽 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
‣ For even weight of precision and recall, 𝜷 is set 1 most of the time, which is specifically referred to as f1-score.
𝑖𝑠𝑖𝑜𝑛 × 𝑟𝑒𝑐𝑎𝑙𝑙
𝐹1 = 2 ∙
𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙
f-score
‣ #F1 Measure – both precision and recall are equally weighted. F1 score is used to calculate the average from the harmonic
mean of precision and recall (sensitivity) and weighting precision and recall.
‣ #a (precision), b (recall) 2ab/a+b
‣ #F 0.5 measure – Precision is more weighted than recall. 0.5 times greater weight is applied to the recall compared to
precision.
‣ #F2 measure – Recall is more weighted. Recall is 2 times more weighted than precision.
f-score
(f"{target}fbetas score:{score}")
(f"{target}f1 score:{score}")
accuracy
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦
𝑇𝑃 + 𝑇𝑁
=
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
classification_report
‣ Use the classification_report function of scikit-learn to batch calculate precision, recall, and f1-score.
ROC curve
‣ ROC curve has TPR (True Positive Rate) on the y-axis, and FPR(False Positive Rate) on the x-axis.
TPR is recall, and FPR refers to fall-out.
𝑇𝑃𝑅 𝐹𝑃𝑅
𝑇𝑃 𝐹𝑃
= =
𝑇𝑃 + 𝐹𝑁 𝐹𝑃 + 𝑇𝑁
ROC curve
ROC curve
ROC Curves
True Positive Rate
Line 55
• Import model
Line 56
• Final prediction
Line 57
• Save csv
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well labelled.
Which means some data is already tagged with the correct answer. After that, the machine is provided with a new set of examples(data) so that the
supervised learning algorithm analyses the training data(set of training examples) and produces a correct outcome from labelled data.
⚫ Disadvantages:-
− Classifying big data can be challenging.
− Training for supervised learning needs a lot of computation time. So, it
requires a lot of time.
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
Clustering
Exclusive
(partitioning)
Agglomerative
Overlapping
Probabilistic
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization.
Numeric 𝑌 Categorical 𝑌
Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old 859554.794
5
Find price of home with 2500 sqr ft area, 4 bedrooms, 5 year old