100% found this document useful (1 vote)

104 views10 pages

Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131

1. The document describes applying logistic regression to predict red wine quality using a dataset containing physicochemical properties of red wines. 2. Exploratory data analysis was conducted including univariate analysis of variables and bivariate correlation analysis. The data was then prepared by handling missing values and outliers. 3. Logistic regression was performed on a training dataset to predict wine quality, which was evaluated on a test dataset, achieving an accuracy of 98% but with a poor ROC curve AUC of 0.511.

Uploaded by

Shivam Batra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

104 views10 pages

Name: Reg. No.: Lab Exercise:: Shivam Batra 19BPS1131

Uploaded by

Shivam Batra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

CSE3506 Essentials of Data Analytics

Name : Shivam Batra

Reg. No. : 19BPS1131
Lab Exercise: 5

Objective: Applying logistic regression to predict red wine.

Methods -

1. Exploratory data analysis (EDA)

2. Data preparation

3. Modeling -> Logistic regression

4. ROC Curve

STEPS:

#Importing the dataset

data <- read.csv('winequality-red.csv', sep = ';')
str(data)

## 'data.frame': 1599 obs. of 12 variables:

## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...

#Format outcome variable

data$quality <- ifelse(data$quality >= 7, 1, 0)
data$quality <- factor(data$quality, levels = c(0, 1))

#Descriptive statistics
summary(data)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 0:1382
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1: 217
## Median :3.310 Median :0.6200 Median :10.20
## Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :4.010 Max. :2.0000 Max. :14.90

Univariate analysis

#Dependent variable
#Frequency plot
par(mfrow=c(1,1))
barplot(table(data[[12]]),
main = sprintf('Frequency plot of the variable: %s',
colnames(data[12])),
xlab = colnames(data[12]),
ylab = 'Frequency')
#Check class BIAS
table(data$quality)

##
## 0 1
## 1382 217

round(prop.table((table(data$quality))),2)

##
## 0 1
## 0.86 0.14

#Independent variable
#Boxplots
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
boxplot(x = data[i],
horizontal = TRUE,
main = sprintf('Boxplot of the variable: %s',
colnames(data[i])),
xlab = colnames(data[i]))
}
#Histograms
par(mfrow=c(3,4))
for (i in 1:(length(data)-1)){
hist(x = data[[i]],
main = sprintf('Histogram of the variable: %s',
colnames(data[i])),
xlab = colnames(data[i]))
}
Bivariate analysis

#Correlation matrix
library(ggcorrplot)

## Loading required package: ggplot2

ggcorrplot(round(cor(data[-12]), 2),
type = "lower",
lab = TRUE,
title =
'Correlation matrix of the red wine quality dataset')
Data preparation

#Missing values
sum(is.na(data))

## [1] 0

#Outliers
#Identifing outliers
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) |
x > quantile(x, 0.75) + 1.5 * IQR(x))
}
outlier <- data.frame(variable = character(),
sum_outliers = integer(),
stringsAsFactors=FALSE)
for (j in 1:(length(data)-1)){
variable <- colnames(data[j])
for (i in data[j]){
sum_outliers <- sum(is_outlier(i))
}
row <- data.frame(variable,sum_outliers)
outlier <- rbind(outlier, row)
}
outlier

## variable sum_outliers
## 1 fixed.acidity 49
## 2 volatile.acidity 19
## 3 citric.acid 1
## 4 residual.sugar 155
## 5 chlorides 112
## 6 free.sulfur.dioxide 30
## 7 total.sulfur.dioxide 55
## 8 density 45
## 9 pH 35
## 10 sulphates 59
## 11 alcohol 13

#Identifying the percentage of outliers

for (i in 1:nrow(outlier)){
if (outlier[i,2]/nrow(data) * 100 >= 5){
print(paste(outlier[i,1],
'=',
round(outlier[i,2]/nrow(data) * 100, digits = 2),
'%'))
}
}

## [1] "residual.sugar = 9.69 %"

## [1] "chlorides = 7 %"

#Inputting outlier values

for (i in 4:5){
for (j in 1:nrow(data)){
if (data[[j, i]] > as.numeric(quantile(data[[i]], 0.75) +
1.5 * IQR(data[[i]]))){
if (i == 4){
data[[j, i]] <- round(mean(data[[i]]), digits = 2)
} else{
data[[j, i]] <- round(mean(data[[i]]), digits = 3)
}
}
}
}

Modeling
#Splitting the dataset into the Training set and Test set
#Stratified sample
data_ones <- data[which(data$quality == 1), ]
data_zeros <- data[which(data$quality == 0), ]
#Train data
set.seed(123)
train_ones_rows <- sample(1:nrow(data_ones), 0.8*nrow(data_ones))
train_zeros_rows <- sample(1:nrow(data_zeros), 0.8*nrow(data_ones))
train_ones <- data_ones[train_ones_rows, ]
train_zeros <- data_zeros[train_zeros_rows, ]
train_set <- rbind(train_ones, train_zeros)
table(train_set$quality)

##
## 0 1
## 173 173

#Test Data
test_ones <- data_ones[-train_ones_rows, ]
test_zeros <- data_zeros[-train_zeros_rows, ]
test_set <- rbind(test_ones, test_zeros)
table(test_set$quality)

##
## 0 1
## 1209 44

Logistic regression

#Logistic Regression
lr = glm(formula = quality ~.,
data = train_set,
family = binomial)
#Predictions
prob_pred = predict(lr,
type = 'response',
newdata = test_set[-12])
library(InformationValue)
optCutOff <- optimalCutoff(test_set$quality, prob_pred)[1]
y_pred = ifelse(prob_pred > optCutOff, 1, 0)

#Making the confusion matrix

cm_lr = table(test_set[, 12], y_pred)
cm_lr

## y_pred
## 0 1
## 0 1207 2
## 1 43 1

#Accuracy
accuracy_lr = (cm_lr[1,1] + cm_lr[1,1])/
(cm_lr[1,1] + cm_lr[1,1] + cm_lr[2,1] + cm_lr[1,2])
accuracy_lr

## [1] 0.9816999

#ROC curve
library(ROSE)

## Loaded ROSE 0.0-4

par(mfrow = c(1, 1))

roc.curve(test_set$quality, y_pred)

## Area under the curve (AUC): 0.511

Wine Quality Prediction with SVR
100% (1)
Wine Quality Prediction with SVR
6 pages
Project 5 PDF
100% (1)
Project 5 PDF
48 pages
BoS - Session 1
100% (1)
BoS - Session 1
37 pages
Decision Tree Classification
100% (1)
Decision Tree Classification
11 pages
1.1 Simple Linear Regression Model
100% (1)
1.1 Simple Linear Regression Model
15 pages
1-Introduction To Statistics PDF
100% (1)
1-Introduction To Statistics PDF
37 pages
Homework 2
100% (1)
Homework 2
12 pages
SAS Categorical Data Analysis Guide
100% (1)
SAS Categorical Data Analysis Guide
16 pages
Intro to Statistics & Data Analysis
100% (1)
Intro to Statistics & Data Analysis
30 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
7. Heteroscedasticity: y = β + β x + · · · + β x + u
100% (1)
7. Heteroscedasticity: y = β + β x + · · · + β x + u
21 pages
Engineering Regression Analysis
100% (1)
Engineering Regression Analysis
21 pages
Logistic Regression Tutorial
100% (1)
Logistic Regression Tutorial
22 pages
EDA Lecture Module 2
100% (1)
EDA Lecture Module 2
42 pages
CPE412 Pattern Recognition (Week 8)
100% (1)
CPE412 Pattern Recognition (Week 8)
25 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
Stats For Managers - Intro
100% (1)
Stats For Managers - Intro
101 pages
Using Statistical Techniq Ues in Analyzing Data
100% (1)
Using Statistical Techniq Ues in Analyzing Data
40 pages
Quiz Feedback1 - Coursera
100% (1)
Quiz Feedback1 - Coursera
7 pages
Telecom Churn Prediction with Logistic Regression
No ratings yet
Telecom Churn Prediction with Logistic Regression
38 pages
Blank: CFC Cumulative Forecast Error or Bias Error
100% (1)
Blank: CFC Cumulative Forecast Error or Bias Error
2 pages
Stat1012 Cheatsheet Double-Sided
100% (1)
Stat1012 Cheatsheet Double-Sided
2 pages
Tutor
100% (1)
Tutor
309 pages
Logistic Regression
100% (1)
Logistic Regression
17 pages
Regression Models Course Project
100% (1)
Regression Models Course Project
4 pages
Correlation & Regression Guide
100% (1)
Correlation & Regression Guide
53 pages
Poly
100% (1)
Poly
108 pages
KPMG - Data Set
100% (1)
KPMG - Data Set
1,685 pages
Employee Attrition Miniblogs
100% (1)
Employee Attrition Miniblogs
15 pages
Leer Los Datos: Import As Import As Import As From Import From Import
100% (1)
Leer Los Datos: Import As Import As Import As From Import From Import
14 pages
Import As
100% (1)
Import As
27 pages
Risk Return Summery
100% (1)
Risk Return Summery
85 pages
Logistic Regression Model Study Assignment
100% (1)
Logistic Regression Model Study Assignment
5 pages
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
100% (1)
Dokumen - Pub Approaching Almost Any Machine Learning Problem 9788269211528 L 5276104
151 pages
Python Vs R in Data and Machine Learning PDF
100% (1)
Python Vs R in Data and Machine Learning PDF
6 pages
Python For You and Me: Release 0.3.alpha1
100% (1)
Python For You and Me: Release 0.3.alpha1
143 pages
Airbnbs in Seattle, Wa: Questions
100% (1)
Airbnbs in Seattle, Wa: Questions
5 pages
MacVille's Sydney Expansion Strategy
100% (1)
MacVille's Sydney Expansion Strategy
12 pages
LPTHW
100% (1)
LPTHW
220 pages
Tableau MMMF PDF
100% (1)
Tableau MMMF PDF
11 pages
EDA Techniques in R with dlookr
100% (2)
EDA Techniques in R with dlookr
11 pages
Py Notes
100% (1)
Py Notes
169 pages
Quest Stat
100% (1)
Quest Stat
2 pages
Data Issues & Next Steps
100% (1)
Data Issues & Next Steps
2 pages
Data Analysis Nirvana: Excel 2013 Business Intelligence Features
100% (1)
Data Analysis Nirvana: Excel 2013 Business Intelligence Features
27 pages
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
100% (1)
M&A Deal of ABC Inc. and XYZ Inc.: Insert Your Title Here
25 pages
1
100% (1)
1
385 pages
Photon Prog Guide
100% (1)
Photon Prog Guide
919 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Statistical Methods For Decision Making (SMDM) Project Report
100% (2)
Statistical Methods For Decision Making (SMDM) Project Report
22 pages
In All The Regression Models That We Have Considered So
100% (1)
In All The Regression Models That We Have Considered So
52 pages
January 1, 1983 1990 5 July 1994 1930 1960
100% (1)
January 1, 1983 1990 5 July 1994 1930 1960
13 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Linear Regression Chap01
100% (1)
Linear Regression Chap01
7 pages
R Project
No ratings yet
R Project
22 pages
Wine
No ratings yet
Wine
15 pages
HW3 Solution Fall 2024
No ratings yet
HW3 Solution Fall 2024
8 pages
Datamining Exp5 Datanormalisation
No ratings yet
Datamining Exp5 Datanormalisation
14 pages
Braunmühl-Verbeek1979 Chapter Finite-changeAutomata
No ratings yet
Braunmühl-Verbeek1979 Chapter Finite-changeAutomata
10 pages
Feedback and Feedforward Process Contl
No ratings yet
Feedback and Feedforward Process Contl
16 pages
Lecture - Linear - Systems PDF
No ratings yet
Lecture - Linear - Systems PDF
31 pages
Idea
No ratings yet
Idea
34 pages
Design Analysis and Algorithm Assignment 2: Task 2
No ratings yet
Design Analysis and Algorithm Assignment 2: Task 2
6 pages
(2024-ICML) Variational Schrodinger Diffusion Models
No ratings yet
(2024-ICML) Variational Schrodinger Diffusion Models
24 pages
Lead and Lag Controller Design
No ratings yet
Lead and Lag Controller Design
57 pages
Encoders
100% (1)
Encoders
7 pages
ANSWER KEY Worksheet CH 13
No ratings yet
ANSWER KEY Worksheet CH 13
3 pages
Programming and Problem Solving With MATLAB For Mechanical Engineers
No ratings yet
Programming and Problem Solving With MATLAB For Mechanical Engineers
3 pages
Simplex Method Concept
No ratings yet
Simplex Method Concept
31 pages
Genetic Algorithm Basics
No ratings yet
Genetic Algorithm Basics
2 pages
Statistical Model of Relationship Between Natural Gas Consumption and Temperature
No ratings yet
Statistical Model of Relationship Between Natural Gas Consumption and Temperature
26 pages
Practice Paper For Gate Data Science and Ai Collegedekho
No ratings yet
Practice Paper For Gate Data Science and Ai Collegedekho
3 pages
Reduce The Block Diagram and Calculate The Transfer Function For Figures 1 - 3
No ratings yet
Reduce The Block Diagram and Calculate The Transfer Function For Figures 1 - 3
13 pages
Lab 2-CS-Lab
No ratings yet
Lab 2-CS-Lab
7 pages
Robust Linear Algebra
No ratings yet
Robust Linear Algebra
11 pages
Non-Linear Data Structures Guide
No ratings yet
Non-Linear Data Structures Guide
20 pages
Formula Sheet
No ratings yet
Formula Sheet
1 page
PART-7 Ordinary Differential Equations Runge-Kutta Methods: Numerical Methods of Chemical Engineers CHE F242
No ratings yet
PART-7 Ordinary Differential Equations Runge-Kutta Methods: Numerical Methods of Chemical Engineers CHE F242
14 pages
List of RungeKutta Methods - Wikipedia
No ratings yet
List of RungeKutta Methods - Wikipedia
16 pages
Chapter 6b Pid - Controller
No ratings yet
Chapter 6b Pid - Controller
22 pages
Code Question1-Adaline
No ratings yet
Code Question1-Adaline
29 pages
Lec 2 Feature Engineering
No ratings yet
Lec 2 Feature Engineering
18 pages
Reference For Ctfs CTFT
No ratings yet
Reference For Ctfs CTFT
111 pages
Cell2Cell Churn Analysis Report
No ratings yet
Cell2Cell Churn Analysis Report
13 pages
Recommender Systems-Chapter 4
No ratings yet
Recommender Systems-Chapter 4
76 pages
This Study Resource Was: MC Qu. 9-45 Based On The Following Payoff..
No ratings yet
This Study Resource Was: MC Qu. 9-45 Based On The Following Payoff..
3 pages
Quantum Cryptography Agricultural Data Complete With References
No ratings yet
Quantum Cryptography Agricultural Data Complete With References
4 pages
Lecture 6 - Python Dictionaries and Sets
No ratings yet
Lecture 6 - Python Dictionaries and Sets
19 pages