PE 515 CS – DATA SCIENCE
UNIT – V
04.01.2023
Performance Measure
Logistic regression implementation in R
K- Nearest Neighbors (KNN)
K- Nearest Neighbors Implementation in R
Clustering: K-Means Algorithm
K-Means Implementation in R
Case Studies of Data Science Application
o Weather Forecasting
o Stock Market Prediction
o Object Recognition
o Real Time Sentiment Analysis
Performance measures
Result of a Case study – classification of car based on certain attributes
Page 1 of 1
2 classes are Hatchback and SUV
Confusion matrix – What was predicted by classifier (prediction) and
actual (Reference / True Condition)
Number of data points used = 20
Reference
Prediction Hatchback SUV
Hatchback 10 1
SUV 0 9
Example: True Positive
o 1st word refers to the truth / non-truth of the prediction
Page 2 of 2
o 2nd word – relates to prediction
True Positive – success of the classifier (Power)
False positive – mistake of the classifier (Type I error)
True Negative – success of the classifier
False Negative – mistake of the classifier (Type II error)
Perfect classifier – other than diagonal elements, other values are ‘0’.
Page 3 of 3
Measures of Performance
Terminology
TP – True Positive (Correct Identification of Positive Labels)
TN – True Negative (Correct Identification of Negative Labels)
FP – False Positive (Incorrect Identification of Positive Labels)
FN – False Negative (Incorrect Identification of Negative Labels)
Total Samples, N = TP + TN + FT + FN
1) Accuracy – overall effectiveness of a classifier, A = (TP + TN) / N
Maximum value that accuracy can take is 1
This happens when the classifier exactly classifies two groups (i.e.
FP=0 & FN=0)
Total number of True Positive Labels = TP + FN
Total number of True Negative Labels = TN + FP
2) Sensitivity – effectiveness of a classifier to identify positive labels,
Se=TP/(TP+FN)
3) Specificity – effectiveness of a classifier to identify negative labels,
Sp= TN/(FP+TN)
Both Se & Sp lie between 0 & 1, 1 is an ideal value for each of them
4) Balanced Accuracy – BA = (Sensitivity + Specificity) / 2
5) Prevalence – How often does the yes condition actually occur in our
sample, P = (TP+FN)/N
6) Positive Predictive Value – Proportion of correct results in labels
identified as positive,
PPV = (sensitivity * prevalence) / ((sensitivity * prevalence) + (1-
specificity)*(1-prevalence))
7) Negative Prediction Value – Proportion of correct results in labels
identified as negative
NPV = (specificity * (1-prevalence)) / (((1-sensitivity)*prevalence +
((specificity)*(1-prevalence)))
8) Detection Rate = TP / N
9) Detection Prevalence – Prevalence of Predicted Events, DP = (TP +
Page 4 of 4
FP) / N
10) The Kappa Statistic (or value) is a metric that compares an
Observed Accuracy with Expected Accuracy (random chance)
Kappa = (observed accuracy – expected accuracy) / (1 – expected
accuracy)
Observed Accuracy, OA = a+d / N
Expected Accuracy, EA = ((a+c)(a+b)+(b+d)(c+d))/N
Where a, b, c and d are TP, FP, FN and TN respectively
First one in the result is positive label
Second one in the result is the negative label
Refer the final outcome for confirmation, “Positive Class” –
Hatchback
Page 5 of 5
Which is important measure – Kind of application dependent
ROC – Receiver Operating Characteristics
Originally developed and used in signal detection theory
ROC graph:
Sensitivity as a function of specificity
Sensitivity (y – axis) and 1-specificty (x-axis)
Best value for sensitivity and specificity is 1.
For single classifier:
ROC can be used to
o see the classifier performance at different threshold levels (from 0
to 1)
o AUC – Area under the ROC
An area of 1 represents a perfect test; an area of 0.5
represents a worthless model
0.90 – 1 = excellent
0.80 – 0.90 = good
0.70 – 0.80 = fair
0.60 – 0.70 = poor
o AUC < 0.5, check whether your labels are marked in opposite
Page 6 of 6
For several classifiers:
ROC can be used to
o Compare different classifiers at one threshold or overall threshold
levels
o Performance
o Model 3 > Model 2 > Model 1
05.01.2023
Logistic Regression implementation in R
Case study
o Problem statement
Solve the case study suing R
o Read the data from a .csv file
o Understand the data
Page 7 of 7
o glm() function
o Interpret the results
Page 8 of 8
Key points
Logistic regression is primarily used as a classification algorithm
It is a supervised learning algorithm
o Data is labeled
Parametric Approach
The decision boundary is derived based on the probability interpretation
Decision boundary can be linear or nonlinear
The probabilities are also modeled as sigmoidal function
Automotive Crash Testing
Problem statement – a crash test is a form of destructive testing that is
performed in order to ensure high safety standards for various cars
Page 9 of 9
Several cars have rolled into an independent audit unit for crash test
They are being evaluated on a defined scale {poor (-10) to excellent (10)}
on the following parameters:
1. Manikin head impact
2. Manikin body impact
3. Interior impact
4. HVAC (heating, ventilation & air conditioning) impact
5. Safety alarm system
Each crash test is very expensive to perform
The crash test was performed for only 100 cars
Type of a car – Hatchback / SUV, was noted
However with this data in future they should be able to predict the type
of the car
Part of data reserved for building a model and remaining kept for
analysis
Data for 80 cars is given in crashTest_1.csv
Data for remaining 20 cars is given in crashTest_1_TEST.csv
Use logistic regression classification technique to classify the car types
as hatchback / SUV
Solution to Case Study Using R
Setting working directory, clearing variables in the workspace
Installing or loading required packages
#set the working directory as the directory which contains the data files
#setwd(“path of the directory with data files”)
rm(list=ls()) # to clear the environment
#install.packages(“caret”, dependencies = true)
library(caret) #for confusion matrix
Reading the data
Data for this case study is provided in files named, crashTest_1.csv
Page 10 of 10
(Training Data) & crashTest_1_TEST.csv (Testing Data)
To read the data from a .csv file, use read.csv() function
read.csv() – reads a file in a table format and creates a data frame from
it
Syntax – read.csv(file, row.names=1)
o File – the name of the file which the data are to be read from. Each
row of the table appears as one line if the file.
o row.names – a vector of row names. This can be a vector giving
the actual row names, or a single number giving the column of the
table which contains the row names, or character string giving the
name of the table column containing the row names.
#Reading the data
crashTest_1<-read.csv(“crashTest1.csv”, row.names=1)
crashTest_1_TEST<-read.csv(“crashTest1_1_TEST.csv”, row.names=1)
Viewing the data
View(CrashTest_1)
Page 11 of 11
Understanding the data
crashTest_1 contains 80 observations of 6 variables
crashTest_1 _TEST contains 20 observations of 6 variables
The variables are Manikin head impact, Manikin body impact, Interior
impact, HVAC impact, safety alarm system
First five columns are the details about the car and last column is the
label which says whether the car type Hatchback / SUV
Structure of the data
Variables and their data types
str()
syntax – str(object)
object – any R object about which you want to have some information
Page 12 of 12
Summary of the data
Summary of data – The function invokes particular methods which
depend on the class of the first argument
summary()
o summary gives a 5 point summary for numeric attributes in the
data
Syntax
o summary (object)
object - any R object about which you want to have some information
Page 13 of 13
summary of trained data
summary of test
Page 14 of 14
glm()
glm(formula, data, family)
Arguments:
Formula – object of class”formula” (or one that can be coerced to
that class): a symbolic description of the model to be fitted
Data – data frame containing variables
Family – a description of the error distribution and link function to be
used in te model.
For glm, this can be a character string naming a family function, a
family function or the result of a call to a family function in specific,
family=’binomial’ corresponds to logistic regression
Building a logistic regression model
# Model
Logisfit<-glm(formula = crashTest_1$CarType~., family=’binomial’, data =
crashTest_1)
Modeled the probability as sigmoidal function
Log odds ratio:
The odds ratio is the probability of
success/probability of failure.
Decision boundary, hyper plane
equation
p(X) = probability of success,
1-p(X) = probability of failure
Page 15 of 15
2 degrees of freedom
1st – when there is NULL model (only with intercept – Reduced Model, 80 –
1 =79)
2nd – included all the variable in the modeling – Full Model, 80 – 6 = 74)
Summary of model
Page 16 of 16
Fisher Scoring iterations (MLE), no. of iterations = 25
Finding the Odds
predict()
syntax: predict(object)
# Finding the odds
logisTrain<-predict(logisfit, type=’response’) #by default – training set
predict() with type = ‘response’ gives probabilities
by default otherwise it returns log(odds)
Plotting the probabilities
plot(logisTrain)
Page 17 of 17
Classes are well separated
Which side belongs to which car type?
Identifying probabilities associated with the Car Type
Mean of probabilities
This helps us identify the probabilities associated with the two classes
Tapply(logisTrain, crashTest_1$CarType, mean)
Low probabilities are associated with car type ‘Hatchback’
higher probabilities are associated with car type ‘SUV’
Page 18 of 18
Predicting on test data
# Predicting on test data
logispred<-predict(logisfit, newdata=crashTest_1_TEST,type =’response’)
plot(logispred)
“logispred” is the output which has the probability values
Results
It is classified that the test point is Hatchback / SUV by setting a
threshold
crashTest_1TEST[logispred<=0.5,”LogisPred”]<-“Hatchback”
crashTest_1TEST[logispred>0.5,”LogisPred”]<-“SUV”
Page 19 of 19
Predicted Results @ 7th Column
Page 20 of 20
Confusion Matrix
confusionMatrix(table(crashTest_1_TEST[,7], crashTest_1_TEST[,6]),
positive = ‘Hatchback’)
Page 21 of 21
10.01.2023
k – Nearest Neighbors (kNN)
Simple and powerful classification algorithm
k – Nearest Neighbors (kNN) is a non-parametric method used for classification
It is a lazy learning algorithm where all computation is deferred until
classification
It is also an instance based learning algorithm where the function is
approximated locally
In logistic regression, earlier Used trained data for hyper plane, estimated
parameter were used to predict the test data
In kNN, data itself will be used, not estimated parameters, here number of
neighbors will be use, which is a tuning parameter for the algorithm
(It is not the parameter derived from the data)
Distinction between parameter and non-parameter
In Logistic regression, without parameters, predicting data is not possible
But in kNN, using data, it is possible
No prior work is required for doing classification using kNN (difference b/w
Logistic Regression and Knn
Why kNN and when does one use it?
Why kNN?
Simplest of all classification algorithms and easy to implement
There is no explicit training phase and the algorithm does not perform any
generalization of the trained data (no optimization required)
When does one use this kNN algorithm?
When there are nonlinear decision boundaries between classes and when the
amount of data is large
When there are more data, complication also increases, but there are many ways
to address this issue.
Input features
input features can be both quantitative and qualitative
Outputs
Page 22 of 22
Outputs are categorical values, which typically are the classes of the data
kNN explains a categorical value using the majority votes of nearest neighbors
Assumptions
Being nonparametric, the algorithm does not make any assumptions about the
underlying data distribution
Select the parameter ‘k’ based on the data
Requires a distance metric to define proximity between any two data points
Example: Euclidean distance, Mahalanobis distance or Hamming distance
Algorithm
The kNN classification is performed using the following four steps
1. Compute the distance matrix between the test data point and all the
labeled data points
2. Order the labeled data points in the increasing order of the distance metric
3. Select the top ‘k’ labeled data points and look at the class labels
4. Find the class label that the majority of these ‘k’ labeled data points have
and assign it to the test data point
Easy to solve multiclass problems using kNN
New test point xnew
Calculating distances (refer the image below)
When the distance is ‘0’, xnew itself
If k=3, find the first 3 distances
Page 23 of 23
If all the data points are from class 1, then xnew is also from class 1
If all the data points are from class 2, then xnew is also from class 2
If 2 belongs to class 1 & 1 belongs to class 2, then due to majority of
votes, it still stays in class 1
If 2 belongs to class 2 & 1 belongs to class 1, then due to majority of
votes, it still stays in class 2
The algorithm can be used with minor modification for the Function
Approximation
If k=5, then take the first 5 points and then look for the majority votes,
and then assign this class to the new test data point.
Illustration of kNN
Page 24 of 24
There is a possibility of data points getting misclassified in the region
where there is a mix of the data points (as shown in the below figure)
As we approach farther away, the misclassification problem is less
Page 25 of 25
We didn’t define any boundary here
This algorithm can be effectively used for solving Complicated
nonlinear boundaries
Illustration of kNN (Testing)
No label for test data (yellow color data point)
In the next image, when k=3, the new test data point (yellow) will get
the label (red – class 2)
In the next image, when k=5, the new test data point (yellow) will get
the label (blue – class 1)
Page 26 of 26
Things to consider
Following are some things one should consider before applying kNN
algorithm
o Parameter selection
o Presence of noise
o Feature selection and scaling
Page 27 of 27
o Curse of dimensionality
Parameter selection
o The best choice of ‘k’ depends on the data
o Larger values of ‘k’ reduces the effect of noise on classification
but makes the decision boundaries between classes less
distinct
o Smaller values of ‘k’ tends to be affected by the noise with clear
separation between classes
Feature selection and scaling
o It is important to remove irrelevant features
o When the number of features is too large, and suspected to be
highly redundant, feature extraction is required
o If the features are carefully chosen then it is expected that the
classification will be better
28.01.2023
K- nearest neighbours implementation in R
Case Study
o Problem statement
Solve the Case Study using R
o Read the data from a .csv file
o Understand the data
o knn() function
o interpret the results
Key points
o knn is primarily used a classification algorithm
o it is a supervised learning algorithm
data is labeled
o Non-parametric method
o No explicit training phase is involved
o Lazy learning algorithm
o Notion of distance is needed
Page 28 of 28
o Majority voting method
Automotive Service Company: A Case Study
Problem statement
o An automotive service chain is launching its new grand service
station this weekend. They offer to service a wide variety of
cars. The current capacity of the station is to check 315 cars
thoroughly per day.
o As an inaugural offer, they claim to freely check all cars that
arrive on their launch day, and report whether they need
servicing or not.
o Unexpectedly, they got 450 cars. The service men won’t work
longer than the working hours but the data analysts have to.
o Can you save the day for the new service station?
How can a data scientist save a day for them?
o “serviceTrainData.csv” – a dataset which contains some
attributes of a car that can be easily measured and won’t
require much time to conclude that if services are needed for
that or not.
o Now for the cars they cannot check in detail, they measure
those attributes – “serviceTestData.csv”
o knn classification technique is used to classify the cars which
cannot be tested manually and to say whether service is
needed or not.
Getting things ready
o Setting working directory, clearing variables in the workspace
o Installing or loading required packages
# knn implementation in R
# set the working directory as the directory which contains the
data files
# setwd(“Path of the directory with data files”)
Page 29 of 29
Rm(list=ls()) # to clear the environment
# install.packages(“caret”, dependencies = TRUE)
# install.packages(“class”, dependencies = TRUE)
library (caret) # for confusion matrix
library (class) # for knn
Reading the data
Data for this case study is provided in file named
o serviceTrainData.csv
o serviceTestData.csv
To read the data from .csv file, use read.csv() function is used
read.csv() function – reads a file in table format and creates a data
Syntax – read.csv(file, row.names)
file – the name of the file which the data are to be read from. Each
row of the table appears as one line of the file.
row.names – a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the table
which contains the row names, or character string giving the name of
the table column containing the row names.
# reading the data
ServiceTrain <- read.csv (“serviceTrainData.csv “)
ServiceTest <- read.csv (“serviceTestData.csv “)
Page 30 of 30
Viewing the data
Understanding the data
o ServiceTrain contains 315 observations of 6 variables
o ServiceTest contains 135 observations of 6 variables
o The variables are:
1. OilQual
2. Engineperf
3. NormMileage
4. TyreWear
Page 31 of 31
5. HVACwear
6. Service
o First five columns are the details about the car and last column
is the label which says whether a service is needed or not
Structure of the data
o Structure of data
Variables and their data types
o str() – compactly displays the internal structure of an ‘R’ object
o Syntax – str(object)
o Object – any R object about which you want to have some
information
Page 32 of 32
Summary of the data
o Summary of data
The function invokes particular methods which depend on
the class of the first argument
o summary()
o gives a 5 point summary for numeric attributes in the data
o Syntax – summary(object)
o Object – any R object about which you want to have some
information
o
Page 33 of 33
Implementation of k-nearest neighbours: knn()
knn(train, test, cl, k = 1)
Arguments
train Matrix of data frame of training set cases
test Matrix or data frame of test set cases.
A vector will be interpreted as a row vector for a single
case
cl Factor of true classifications of training set
k Number of neighbours considered
Applying knn algorithm on data
Page 34 of 34
# Applying k-NN algorithm
# K Nearest neighbour is a lazy algorithm and can do prediction directly
with the testing dataset, command “knn”, accepts training and testing
datasets the class variable of interest. i.e. outcome categorical variable is
provided for the parameter “cl”. Parameter “k: is to specify the number of
nearest neighbours required.
Predictedknn <- knn (train = ServiceTrain [ , -6],
test = ServiceTest [ , -6],
cl=ServiceTrain$Service,
k = 3)
o ServiceTrain [ , -6] gives information in ServiceTrain except the
last column
o ServiceTest [ , -6] gives information in ServiceTest except the
last column
o ServiceTrain$Service gives the last column of training data as a
classification factor to the algorithm
Results: Generating Confusion Matrix Manually
Page 35 of 35
# Command to develop and print a confusion matrix
conf_matrix = table(predictedknn, ServiceTest[,6])
predictedknn No Yes
No 99 0
Yes 0 36
# A measure of accuracy is calculated by summing the true positives and
true negatives and dividing them by total number of samples
knn_accuracy = sum(diag(conf_matrix))/nrow(ServiceTest)
> knn_accuracy
[1] 1
Page 36 of 36
Conclusion
o read.csv() can be used to read data from .csv files
o str() function gives data types of each attribute in the given R-
object
o summary() – provides a summary of R – objects
o K-nearest neighbours is supervised learning technique – needs
labeled data
o In R knn algorithm can be implemented using knn()
Page 37 of 37