0% found this document useful (0 votes)

30 views37 pages

Data Science Unit-5

The document covers key concepts in data science, specifically focusing on performance measures, logistic regression, and K-Nearest Neighbors (KNN) implementations in R. It details various performance metrics such as accuracy, sensitivity, specificity, and ROC analysis, along with case studies including weather forecasting and automotive crash testing. Additionally, it provides step-by-step instructions for implementing logistic regression and KNN algorithms using R, including data preparation and model evaluation techniques.

Uploaded by

yejem28478

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views37 pages

Data Science Unit-5

Uploaded by

yejem28478

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

PE 515 CS – DATA SCIENCE

UNIT – V
04.01.2023
 Performance Measure
 Logistic regression implementation in R
 K- Nearest Neighbors (KNN)
 K- Nearest Neighbors Implementation in R
 Clustering: K-Means Algorithm
 K-Means Implementation in R
 Case Studies of Data Science Application
o Weather Forecasting
o Stock Market Prediction
o Object Recognition
o Real Time Sentiment Analysis

Performance measures

 Result of a Case study – classification of car based on certain attributes

Page 1 of 1
 2 classes are Hatchback and SUV
 Confusion matrix – What was predicted by classifier (prediction) and
actual (Reference / True Condition)
 Number of data points used = 20
Reference
Prediction Hatchback SUV
Hatchback 10 1
SUV 0 9

 Example: True Positive

o 1st word refers to the truth / non-truth of the prediction
Page 2 of 2
o 2nd word – relates to prediction

 True Positive – success of the classifier (Power)

 False positive – mistake of the classifier (Type I error)
 True Negative – success of the classifier
 False Negative – mistake of the classifier (Type II error)

 Perfect classifier – other than diagonal elements, other values are ‘0’.

Page 3 of 3
Measures of Performance
Terminology
 TP – True Positive (Correct Identification of Positive Labels)
 TN – True Negative (Correct Identification of Negative Labels)
 FP – False Positive (Incorrect Identification of Positive Labels)
 FN – False Negative (Incorrect Identification of Negative Labels)

 Total Samples, N = TP + TN + FT + FN
1) Accuracy – overall effectiveness of a classifier, A = (TP + TN) / N
 Maximum value that accuracy can take is 1
 This happens when the classifier exactly classifies two groups (i.e.
FP=0 & FN=0)
 Total number of True Positive Labels = TP + FN
 Total number of True Negative Labels = TN + FP
2) Sensitivity – effectiveness of a classifier to identify positive labels,
Se=TP/(TP+FN)
3) Specificity – effectiveness of a classifier to identify negative labels,
Sp= TN/(FP+TN)
 Both Se & Sp lie between 0 & 1, 1 is an ideal value for each of them
4) Balanced Accuracy – BA = (Sensitivity + Specificity) / 2
5) Prevalence – How often does the yes condition actually occur in our
sample, P = (TP+FN)/N
6) Positive Predictive Value – Proportion of correct results in labels
identified as positive,
PPV = (sensitivity * prevalence) / ((sensitivity * prevalence) + (1-
specificity)*(1-prevalence))
7) Negative Prediction Value – Proportion of correct results in labels
identified as negative
NPV = (specificity * (1-prevalence)) / (((1-sensitivity)*prevalence +
((specificity)*(1-prevalence)))
8) Detection Rate = TP / N
9) Detection Prevalence – Prevalence of Predicted Events, DP = (TP +

Page 4 of 4
FP) / N
10) The Kappa Statistic (or value) is a metric that compares an
Observed Accuracy with Expected Accuracy (random chance)
Kappa = (observed accuracy – expected accuracy) / (1 – expected
accuracy)
Observed Accuracy, OA = a+d / N
Expected Accuracy, EA = ((a+c)(a+b)+(b+d)(c+d))/N

Where a, b, c and d are TP, FP, FN and TN respectively

 First one in the result is positive label
 Second one in the result is the negative label
 Refer the final outcome for confirmation, “Positive Class” –
Hatchback

Page 5 of 5
 Which is important measure – Kind of application dependent

ROC – Receiver Operating Characteristics

 Originally developed and used in signal detection theory
 ROC graph:
 Sensitivity as a function of specificity
 Sensitivity (y – axis) and 1-specificty (x-axis)


 Best value for sensitivity and specificity is 1.

For single classifier:

 ROC can be used to
o see the classifier performance at different threshold levels (from 0
to 1)
o AUC – Area under the ROC
 An area of 1 represents a perfect test; an area of 0.5
represents a worthless model
 0.90 – 1 = excellent
 0.80 – 0.90 = good
 0.70 – 0.80 = fair
 0.60 – 0.70 = poor
o AUC < 0.5, check whether your labels are marked in opposite

Page 6 of 6
For several classifiers:
 ROC can be used to
o Compare different classifiers at one threshold or overall threshold
levels
o Performance
o Model 3 > Model 2 > Model 1



05.01.2023
Logistic Regression implementation in R

 Case study
o Problem statement
 Solve the case study suing R
o Read the data from a .csv file
o Understand the data

Page 7 of 7
o glm() function
o Interpret the results

Page 8 of 8
Key points
 Logistic regression is primarily used as a classification algorithm
 It is a supervised learning algorithm
o Data is labeled
 Parametric Approach
 The decision boundary is derived based on the probability interpretation
 Decision boundary can be linear or nonlinear
 The probabilities are also modeled as sigmoidal function

Automotive Crash Testing

 Problem statement – a crash test is a form of destructive testing that is
performed in order to ensure high safety standards for various cars

Page 9 of 9
 Several cars have rolled into an independent audit unit for crash test
 They are being evaluated on a defined scale {poor (-10) to excellent (10)}
on the following parameters:
1. Manikin head impact
2. Manikin body impact
3. Interior impact
4. HVAC (heating, ventilation & air conditioning) impact
5. Safety alarm system
 Each crash test is very expensive to perform
 The crash test was performed for only 100 cars
 Type of a car – Hatchback / SUV, was noted
 However with this data in future they should be able to predict the type
of the car
 Part of data reserved for building a model and remaining kept for
analysis
 Data for 80 cars is given in crashTest_1.csv
 Data for remaining 20 cars is given in crashTest_1_TEST.csv
 Use logistic regression classification technique to classify the car types
as hatchback / SUV

Solution to Case Study Using R

 Setting working directory, clearing variables in the workspace
 Installing or loading required packages
#set the working directory as the directory which contains the data files
#setwd(“path of the directory with data files”)
rm(list=ls()) # to clear the environment
#install.packages(“caret”, dependencies = true)
library(caret) #for confusion matrix

Reading the data

 Data for this case study is provided in files named, crashTest_1.csv

Page 10 of 10
(Training Data) & crashTest_1_TEST.csv (Testing Data)
 To read the data from a .csv file, use read.csv() function
 read.csv() – reads a file in a table format and creates a data frame from
it
 Syntax – read.csv(file, row.names=1)
o File – the name of the file which the data are to be read from. Each
row of the table appears as one line if the file.
o row.names – a vector of row names. This can be a vector giving
the actual row names, or a single number giving the column of the
table which contains the row names, or character string giving the
name of the table column containing the row names.

#Reading the data

crashTest_1<-read.csv(“crashTest1.csv”, row.names=1)
crashTest_1_TEST<-read.csv(“crashTest1_1_TEST.csv”, row.names=1)

Viewing the data

 View(CrashTest_1)

Page 11 of 11


Understanding the data

 crashTest_1 contains 80 observations of 6 variables
 crashTest_1 _TEST contains 20 observations of 6 variables
 The variables are Manikin head impact, Manikin body impact, Interior
impact, HVAC impact, safety alarm system
 First five columns are the details about the car and last column is the
label which says whether the car type Hatchback / SUV

Structure of the data

 Variables and their data types
 str()
syntax – str(object)
 object – any R object about which you want to have some information

Page 12 of 12


Summary of the data

 Summary of data – The function invokes particular methods which
depend on the class of the first argument
 summary()
o summary gives a 5 point summary for numeric attributes in the
data
 Syntax
o summary (object)
 object - any R object about which you want to have some information

Page 13 of 13
 summary of trained data

 summary of test

Page 14 of 14
glm()
glm(formula, data, family)

Arguments:
 Formula – object of class”formula” (or one that can be coerced to
that class): a symbolic description of the model to be fitted
 Data – data frame containing variables
 Family – a description of the error distribution and link function to be
used in te model.
For glm, this can be a character string naming a family function, a
family function or the result of a call to a family function in specific,
family=’binomial’ corresponds to logistic regression

Building a logistic regression model

# Model
Logisfit<-glm(formula = crashTest_1$CarType~., family=’binomial’, data =
crashTest_1)

Modeled the probability as sigmoidal function

Log odds ratio:

The odds ratio is the probability of
success/probability of failure.

Decision boundary, hyper plane

equation
p(X) = probability of success,
1-p(X) = probability of failure

Page 15 of 15
2 degrees of freedom
1st – when there is NULL model (only with intercept – Reduced Model, 80 –
1 =79)
2nd – included all the variable in the modeling – Full Model, 80 – 6 = 74)

Summary of model

Page 16 of 16
Fisher Scoring iterations (MLE), no. of iterations = 25
Finding the Odds
 predict()
 syntax: predict(object)

# Finding the odds

logisTrain<-predict(logisfit, type=’response’) #by default – training set

 predict() with type = ‘response’ gives probabilities

 by default otherwise it returns log(odds)

Plotting the probabilities

plot(logisTrain)

Page 17 of 17
Classes are well separated
Which side belongs to which car type?

Identifying probabilities associated with the Car Type

 Mean of probabilities
 This helps us identify the probabilities associated with the two classes
 Tapply(logisTrain, crashTest_1$CarType, mean)


 Low probabilities are associated with car type ‘Hatchback’

 higher probabilities are associated with car type ‘SUV’

Page 18 of 18
Predicting on test data

# Predicting on test data

logispred<-predict(logisfit, newdata=crashTest_1_TEST,type =’response’)
plot(logispred)

“logispred” is the output which has the probability values

Results
 It is classified that the test point is Hatchback / SUV by setting a
threshold

crashTest_1TEST[logispred<=0.5,”LogisPred”]<-“Hatchback”
crashTest_1TEST[logispred>0.5,”LogisPred”]<-“SUV”

Page 19 of 19

 Predicted Results @ 7th Column

Page 20 of 20
Confusion Matrix
confusionMatrix(table(crashTest_1_TEST[,7], crashTest_1_TEST[,6]),
positive = ‘Hatchback’)

Page 21 of 21
10.01.2023
k – Nearest Neighbors (kNN)

 Simple and powerful classification algorithm

 k – Nearest Neighbors (kNN) is a non-parametric method used for classification
 It is a lazy learning algorithm where all computation is deferred until
classification
 It is also an instance based learning algorithm where the function is
approximated locally
 In logistic regression, earlier Used trained data for hyper plane, estimated
parameter were used to predict the test data
 In kNN, data itself will be used, not estimated parameters, here number of
neighbors will be use, which is a tuning parameter for the algorithm
 (It is not the parameter derived from the data)
 Distinction between parameter and non-parameter
 In Logistic regression, without parameters, predicting data is not possible
 But in kNN, using data, it is possible
 No prior work is required for doing classification using kNN (difference b/w
Logistic Regression and Knn

Why kNN and when does one use it?

Why kNN?
 Simplest of all classification algorithms and easy to implement
 There is no explicit training phase and the algorithm does not perform any
generalization of the trained data (no optimization required)

When does one use this kNN algorithm?

 When there are nonlinear decision boundaries between classes and when the
amount of data is large

 When there are more data, complication also increases, but there are many ways
to address this issue.

Input features
 input features can be both quantitative and qualitative

Outputs
Page 22 of 22
 Outputs are categorical values, which typically are the classes of the data

 kNN explains a categorical value using the majority votes of nearest neighbors

Assumptions
 Being nonparametric, the algorithm does not make any assumptions about the
underlying data distribution
 Select the parameter ‘k’ based on the data
 Requires a distance metric to define proximity between any two data points
 Example: Euclidean distance, Mahalanobis distance or Hamming distance

Algorithm
 The kNN classification is performed using the following four steps
1. Compute the distance matrix between the test data point and all the
labeled data points
2. Order the labeled data points in the increasing order of the distance metric
3. Select the top ‘k’ labeled data points and look at the class labels
4. Find the class label that the majority of these ‘k’ labeled data points have
and assign it to the test data point
 Easy to solve multiclass problems using kNN
 New test point xnew
 Calculating distances (refer the image below)
 When the distance is ‘0’, xnew itself
 If k=3, find the first 3 distances

Page 23 of 23
 If all the data points are from class 1, then xnew is also from class 1
 If all the data points are from class 2, then xnew is also from class 2
 If 2 belongs to class 1 & 1 belongs to class 2, then due to majority of
votes, it still stays in class 1
 If 2 belongs to class 2 & 1 belongs to class 1, then due to majority of
votes, it still stays in class 2

 The algorithm can be used with minor modification for the Function
Approximation
 If k=5, then take the first 5 points and then look for the majority votes,
and then assign this class to the new test data point.

Illustration of kNN

Page 24 of 24
 There is a possibility of data points getting misclassified in the region
where there is a mix of the data points (as shown in the below figure)

 As we approach farther away, the misclassification problem is less

Page 25 of 25
 We didn’t define any boundary here
 This algorithm can be effectively used for solving Complicated
nonlinear boundaries

Illustration of kNN (Testing)

 No label for test data (yellow color data point)
 In the next image, when k=3, the new test data point (yellow) will get
the label (red – class 2)
 In the next image, when k=5, the new test data point (yellow) will get
the label (blue – class 1)


Page 26 of 26


Things to consider
 Following are some things one should consider before applying kNN
algorithm
o Parameter selection
o Presence of noise
o Feature selection and scaling

Page 27 of 27
o Curse of dimensionality
Parameter selection
o The best choice of ‘k’ depends on the data
o Larger values of ‘k’ reduces the effect of noise on classification
but makes the decision boundaries between classes less
distinct
o Smaller values of ‘k’ tends to be affected by the noise with clear
separation between classes
Feature selection and scaling
o It is important to remove irrelevant features
o When the number of features is too large, and suspected to be
highly redundant, feature extraction is required
o If the features are carefully chosen then it is expected that the
classification will be better

28.01.2023
K- nearest neighbours implementation in R

 Case Study
o Problem statement
 Solve the Case Study using R
o Read the data from a .csv file
o Understand the data
o knn() function
o interpret the results
 Key points
o knn is primarily used a classification algorithm
o it is a supervised learning algorithm
 data is labeled
o Non-parametric method
o No explicit training phase is involved
o Lazy learning algorithm
o Notion of distance is needed

Page 28 of 28
o Majority voting method

 Automotive Service Company: A Case Study

 Problem statement
o An automotive service chain is launching its new grand service
station this weekend. They offer to service a wide variety of
cars. The current capacity of the station is to check 315 cars
thoroughly per day.
o As an inaugural offer, they claim to freely check all cars that
arrive on their launch day, and report whether they need
servicing or not.
o Unexpectedly, they got 450 cars. The service men won’t work
longer than the working hours but the data analysts have to.
o Can you save the day for the new service station?

 How can a data scientist save a day for them?

o “serviceTrainData.csv” – a dataset which contains some
attributes of a car that can be easily measured and won’t
require much time to conclude that if services are needed for
that or not.
o Now for the cars they cannot check in detail, they measure
those attributes – “serviceTestData.csv”
o knn classification technique is used to classify the cars which
cannot be tested manually and to say whether service is
needed or not.
 Getting things ready
o Setting working directory, clearing variables in the workspace
o Installing or loading required packages
# knn implementation in R
# set the working directory as the directory which contains the
data files
# setwd(“Path of the directory with data files”)
Page 29 of 29
Rm(list=ls()) # to clear the environment
# install.packages(“caret”, dependencies = TRUE)
# install.packages(“class”, dependencies = TRUE)
library (caret) # for confusion matrix
library (class) # for knn
 Reading the data
 Data for this case study is provided in file named
o serviceTrainData.csv
o serviceTestData.csv
 To read the data from .csv file, use read.csv() function is used
 read.csv() function – reads a file in table format and creates a data
 Syntax – read.csv(file, row.names)
 file – the name of the file which the data are to be read from. Each
row of the table appears as one line of the file.
 row.names – a vector of row names. This can be a vector giving the
actual row names, or a single number giving the column of the table
which contains the row names, or character string giving the name of
the table column containing the row names.
# reading the data
ServiceTrain <- read.csv (“serviceTrainData.csv “)
ServiceTest <- read.csv (“serviceTestData.csv “)

Page 30 of 30
 Viewing the data

 Understanding the data

o ServiceTrain contains 315 observations of 6 variables
o ServiceTest contains 135 observations of 6 variables
o The variables are:
1. OilQual
2. Engineperf
3. NormMileage
4. TyreWear
Page 31 of 31
5. HVACwear
6. Service
o First five columns are the details about the car and last column
is the label which says whether a service is needed or not
 Structure of the data
o Structure of data
 Variables and their data types
o str() – compactly displays the internal structure of an ‘R’ object
o Syntax – str(object)
o Object – any R object about which you want to have some
information

Page 32 of 32
 Summary of the data
o Summary of data
 The function invokes particular methods which depend on
the class of the first argument
o summary()
o gives a 5 point summary for numeric attributes in the data
o Syntax – summary(object)
o Object – any R object about which you want to have some
information
o

Page 33 of 33
 Implementation of k-nearest neighbours: knn()
knn(train, test, cl, k = 1)
Arguments
train Matrix of data frame of training set cases
test Matrix or data frame of test set cases.
A vector will be interpreted as a row vector for a single
case
cl Factor of true classifications of training set
k Number of neighbours considered
 Applying knn algorithm on data

Page 34 of 34
# Applying k-NN algorithm
# K Nearest neighbour is a lazy algorithm and can do prediction directly
with the testing dataset, command “knn”, accepts training and testing
datasets the class variable of interest. i.e. outcome categorical variable is
provided for the parameter “cl”. Parameter “k: is to specify the number of
nearest neighbours required.

Predictedknn <- knn (train = ServiceTrain [ , -6],

test = ServiceTest [ , -6],
cl=ServiceTrain$Service,
k = 3)
o ServiceTrain [ , -6] gives information in ServiceTrain except the
last column
o ServiceTest [ , -6] gives information in ServiceTest except the
last column
o ServiceTrain$Service gives the last column of training data as a
classification factor to the algorithm

Results: Generating Confusion Matrix Manually

Page 35 of 35
# Command to develop and print a confusion matrix
conf_matrix = table(predictedknn, ServiceTest[,6])

predictedknn No Yes
No 99 0
Yes 0 36

# A measure of accuracy is calculated by summing the true positives and

true negatives and dividing them by total number of samples
knn_accuracy = sum(diag(conf_matrix))/nrow(ServiceTest)

> knn_accuracy
[1] 1

Page 36 of 36
 Conclusion
o read.csv() can be used to read data from .csv files
o str() function gives data types of each attribute in the given R-
object
o summary() – provides a summary of R – objects
o K-nearest neighbours is supervised learning technique – needs
labeled data
o In R knn algorithm can be implemented using knn()

Page 37 of 37

Lec 45
No ratings yet
Lec 45
20 pages
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
No ratings yet
Data Science Cheatsheet 2.0: Statistics Model Evaluation Logistic Regression
4 pages
AST Day 2 Slides
No ratings yet
AST Day 2 Slides
58 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
100% (4)
Machine Learning Project: Sneha Sharma PGPDSBA Mar'21 Group 2
36 pages
Domande Complete ML UNIPD
No ratings yet
Domande Complete ML UNIPD
12 pages
Transport Mode Prediction Analysis
No ratings yet
Transport Mode Prediction Analysis
29 pages
Uni T - 2 - R Programming
No ratings yet
Uni T - 2 - R Programming
10 pages
Data Science Cheatsheet
No ratings yet
Data Science Cheatsheet
4 pages
Machine Learning Project On Cars
92% (13)
Machine Learning Project On Cars
22 pages
Lecture 7 Classification
No ratings yet
Lecture 7 Classification
33 pages
Unit Iii
No ratings yet
Unit Iii
67 pages
Logistic Regression
No ratings yet
Logistic Regression
41 pages
Classification Models
No ratings yet
Classification Models
3 pages
Data Scientists: Model Comparison
No ratings yet
Data Scientists: Model Comparison
14 pages
Session-11 Machine Learning - Jupyter Notebook
No ratings yet
Session-11 Machine Learning - Jupyter Notebook
11 pages
Cvms
No ratings yet
Cvms
37 pages
EDAN96 2024 Last Lecture-1
No ratings yet
EDAN96 2024 Last Lecture-1
78 pages
Information Securtiy
No ratings yet
Information Securtiy
8 pages
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
No ratings yet
Module 4: Recommended Exercises: Problem 1: KNN (Exercise 2.4.7 in ISL Textbook, Slightly Modified)
6 pages
DS Unit 4
No ratings yet
DS Unit 4
13 pages
Untitled Document
No ratings yet
Untitled Document
6 pages
Final Cost Practical
No ratings yet
Final Cost Practical
29 pages
Logistic Regression Tech Document
No ratings yet
Logistic Regression Tech Document
12 pages
2 Modele Lineare
No ratings yet
2 Modele Lineare
43 pages
Topic 7 Regression (Cont2) Logistic Regression
No ratings yet
Topic 7 Regression (Cont2) Logistic Regression
33 pages
MISY 631 Final Review Calculators Will Be Provided For The Exam
No ratings yet
MISY 631 Final Review Calculators Will Be Provided For The Exam
9 pages
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
No ratings yet
Logistic Regression and Discriminant Analysis: Jerry D.T. Purnomo, PH.D
54 pages
Big Data Lesson 2 Lucrezia Noli
No ratings yet
Big Data Lesson 2 Lucrezia Noli
21 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
No ratings yet
Machine Learning Techniques For Predicting Credit Approvals: Prawar Mundra 2018IMG-037
16 pages
Session-11 Machine Learning
No ratings yet
Session-11 Machine Learning
27 pages
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
No ratings yet
Data Analytics Course (IIFT MBA) Full Course Summary - 27072023
253 pages
Model Evaluation Techniques Guide
No ratings yet
Model Evaluation Techniques Guide
40 pages
MLS 2 - Classification
No ratings yet
MLS 2 - Classification
13 pages
Project 5 - Cars
100% (1)
Project 5 - Cars
22 pages
DS Assignment COMPLETED
No ratings yet
DS Assignment COMPLETED
11 pages
Chapter 02 Overview (R)
No ratings yet
Chapter 02 Overview (R)
43 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Machine Learning With Titanic Dataset Tutorial
No ratings yet
Machine Learning With Titanic Dataset Tutorial
7 pages
Linear Regression Example
No ratings yet
Linear Regression Example
26 pages
M2 - Supervised Machine Learning
No ratings yet
M2 - Supervised Machine Learning
79 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Handling The Dataset Using R - Word
No ratings yet
Handling The Dataset Using R - Word
54 pages
Machine Learning Project Report
No ratings yet
Machine Learning Project Report
65 pages
Module-2 - Logistic Regression in Machine Learning
No ratings yet
Module-2 - Logistic Regression in Machine Learning
28 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Machine Learning Project: Choice of Employee Mode of Transport
No ratings yet
Machine Learning Project: Choice of Employee Mode of Transport
35 pages
Live Classroom 2
No ratings yet
Live Classroom 2
40 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Business Intelligence Endsem
No ratings yet
Business Intelligence Endsem
12 pages
Predictive ModellingAnalytics
No ratings yet
Predictive ModellingAnalytics
27 pages
Fiches Machine Learning
No ratings yet
Fiches Machine Learning
21 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
7708 - MBA PredAnanBigDataNov21
No ratings yet
7708 - MBA PredAnanBigDataNov21
11 pages
Foundational Concepts in Management Information System: 2.1. Objectives
No ratings yet
Foundational Concepts in Management Information System: 2.1. Objectives
14 pages
Startup and Shutdown Container Databases (CDB) and Pluggable Databases (PDB)
No ratings yet
Startup and Shutdown Container Databases (CDB) and Pluggable Databases (PDB)
3 pages
RL3-NAC Info
No ratings yet
RL3-NAC Info
11 pages
Guide To Renaming Files Fast
No ratings yet
Guide To Renaming Files Fast
9 pages
Software Development Agreement
No ratings yet
Software Development Agreement
16 pages
Mira 2012
100% (1)
Mira 2012
18 pages
STS Module 1
No ratings yet
STS Module 1
9 pages
DX Diag
No ratings yet
DX Diag
17 pages
Installation Guide S900
No ratings yet
Installation Guide S900
28 pages
College Transfer Essay
No ratings yet
College Transfer Essay
5 pages
Allen-Bradley Stratix 5700™ Network Address Translation (NAT)
100% (1)
Allen-Bradley Stratix 5700™ Network Address Translation (NAT)
20 pages
Deedy
No ratings yet
Deedy
9 pages
Exam On Reference Sources and Services
100% (1)
Exam On Reference Sources and Services
143 pages
DNS Aimbot Free Fire Normal
0% (3)
DNS Aimbot Free Fire Normal
3 pages
Language Planning Essentials
No ratings yet
Language Planning Essentials
25 pages
Prasad P. - App Design Apprentice (1st Edition) - 2021
100% (5)
Prasad P. - App Design Apprentice (1st Edition) - 2021
476 pages
Automated Vehicles for Industry
No ratings yet
Automated Vehicles for Industry
30 pages
Pasado, Presente, Futuro - Live Worksheets
No ratings yet
Pasado, Presente, Futuro - Live Worksheets
3 pages
APFC
No ratings yet
APFC
12 pages
Accessing and Investigating Data A. B. Visualize. SAS Visual Analytics Appears. C
No ratings yet
Accessing and Investigating Data A. B. Visualize. SAS Visual Analytics Appears. C
6 pages
DGDP Bulletin No. 57 Dated 21 Mar 2025
No ratings yet
DGDP Bulletin No. 57 Dated 21 Mar 2025
4 pages
Travel Management Configuration Step by Step
No ratings yet
Travel Management Configuration Step by Step
38 pages
Publication 1
No ratings yet
Publication 1
4 pages
Screenshot 2024-03-11 at 4.24.45 PM
No ratings yet
Screenshot 2024-03-11 at 4.24.45 PM
48 pages
11 One Way Anova
No ratings yet
11 One Way Anova
24 pages
PLC Operation - 8.1
No ratings yet
PLC Operation - 8.1
9 pages
Operations Research: Assignment Problems
No ratings yet
Operations Research: Assignment Problems
2 pages
Yellow Black White Modern Doodle UI Computer Pitch Deck Marketing Presentation
No ratings yet
Yellow Black White Modern Doodle UI Computer Pitch Deck Marketing Presentation
13 pages
Dual Gate Mosfet Data Sheet PDF
No ratings yet
Dual Gate Mosfet Data Sheet PDF
8 pages
DZ77BH-55K Mainboard Compatibility
No ratings yet
DZ77BH-55K Mainboard Compatibility
26 pages

Data Science Unit-5

Uploaded by

Data Science Unit-5

Uploaded by

PE 515 CS – DATA SCIENCE

 Result of a Case study – classification of car based on certain attributes

 Example: True Positive

 True Positive – success of the classifier (Power)

Where a, b, c and d are TP, FP, FN and TN respectively

ROC – Receiver Operating Characteristics

 Best value for sensitivity and specificity is 1.

For single classifier:

Automotive Crash Testing

Solution to Case Study Using R

Reading the data

#Reading the data

Viewing the data

Understanding the data

Structure of the data

Summary of the data

Building a logistic regression model

Modeled the probability as sigmoidal function

Log odds ratio:

Decision boundary, hyper plane

# Finding the odds

 predict() with type = ‘response’ gives probabilities

Plotting the probabilities

Identifying probabilities associated with the Car Type

 Low probabilities are associated with car type ‘Hatchback’

# Predicting on test data

“logispred” is the output which has the probability values

 Simple and powerful classification algorithm

Why kNN and when does one use it?

When does one use this kNN algorithm?

 As we approach farther away, the misclassification problem is less

Illustration of kNN (Testing)

 Automotive Service Company: A Case Study

 How can a data scientist save a day for them?

 Understanding the data

Predictedknn <- knn (train = ServiceTrain [ , -6],

Results: Generating Confusion Matrix Manually

# A measure of accuracy is calculated by summing the true positives and

You might also like