0% found this document useful (0 votes)

14 views28 pages

Predictive Modelling Project

The document discusses using linear regression to predict sales of firms based on various attributes in a dataset containing information on 759 firms. It describes exploring and cleaning the data, building linear regression models on train and test splits, and evaluating model performance on both sets using metrics like RMSE and R-squared.

Uploaded by

PARIJAT DEV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views28 pages

Predictive Modelling Project

Uploaded by

PARIJAT DEV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Predictive

Modelling Project
By Parijat Dev

1|Page
Contents
Linear Regression Problem ....................................................................................... 3
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, data types, shape, EDA). Perform Univariate and Bivariate
Analysis. .............................................................................................................. 3
Univariate Analysis ............................................................................................ 5
Bivariate Analysis ............................................................................................ 11
Correlation Heatmap....................................................................................... 12
1.2 Impute null values if present? Do you think scaling is necessary in this case?13
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE. ................ 13
1.4 Inference: Based on these predictions, what are the business insights and
recommendations. ............................................................................................. 14
Logistics Regression and Linear Discriminant Analysis ............................................. 15
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis................................................................................ 15
Univariate Analysis .......................................................................................... 16
Bivariate Analysis ............................................................................................ 21
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis). ........................................................................................................... 23
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized. .................................................................................................. 24
Linear Discriminant Analysis ............................................................................ 24
Logistics Regression ........................................................................................ 26
......................................................................................................................... 26
2.4 Inference: Based on these predictions, what are the insights and
recommendations. ............................................................................................. 28

2|Page
Linear Regression Problem
Problem Statement- You are a part of an investing firm and your work is to do research
about these 759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the details
given in the dataset to help your company in investing. consciously. Also, provide them
with 5 attributes that are most important.

1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform
Univariate and Bivariate Analysis.
Answer – Data has been read and successfully loaded in the python environment.
following are the key findings about the data.

• Dataset has 10 columns, here is the data dictionary

• Data Dictionary for Firm_level_data:
o Unnamed column – shows index values
o 1. sales: Sales (in millions of dollars).
o 2. capital: Net stock of property, plant, and equipment.
o 3. patents: Granted patents.
o 4. randd: R&D stock (in millions of dollars).
o 5. employment: Employment (in 1000s).
o 6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock
market index that measures the stock performance of 500 large
companies listed on stock exchanges in the United States
o 7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio
between a physical asset's market value and its replacement value.
o 8. value: Stock market value.
o 9. institutions: Proportion of stock owned by institutions.
• Shape of the data is 759 rows × 10 columns
• Dropped unnamed column from the dataset
• Information about the dataset
• Data columns (total 9 columns):
o # Column Non-Null Count Dtype
o --- ------ -------------- -----
o 0 sales 759 non-null float64
o 1 capital 759 non-null float64
o 2 patents 759 non-null int64
o 3 randd 759 non-null float64
o 4 employment 759 non-null float64

3|Page
o 5 sp500 759 non-null object
o 6 tobinq 738 non-null float64
o 7 value 759 non-null float64
o 8 institutions 759 non-null float64
• We can observe that there are missing values in the tobinq column. Sp500
column is of object datatype rest all are of numeric data types.
• Description of the data

SALES CAPITA PATEN RANDD EMPLOY SP5 TOBIN VALUE INSTITUT

L TS MENT 00 Q IONS

COU 759.0000 759.0000 759.000 759.0000 759.0000 759 738.00 759.0000 759.0000
NT 00 00 000 00 00 0000 00 00

UNIQ NaN NaN NaN NaN NaN 2 NaN NaN NaN

TOP NaN NaN NaN NaN NaN no NaN NaN NaN

FRE NaN NaN NaN NaN NaN 542 NaN NaN NaN
Q

MEA 2689.705 1977.747 25.8313 439.9380 14.16451 Na 2.7949 2732.734 43.02054

N 158 498 57 74 9 N 10 750 0

STD 8722.060 6466.704 97.2595 2007.397 43.32144 Na 3.3665 7071.072 21.68558

124 896 77 588 3 N 91 362 6

MIN 0.138000 0.057000 0.00000 0.000000 0.006000 Na 0.1190 1.971053 0.000000

0 N 01

25% 122.9200 52.65050 1.00000 4.628262 0.927500 Na 1.0187 103.5939 25.39500

00 1 0 N 83 46 0

50% 448.5770 202.1790 3.00000 36.86413 2.924000 Na 1.6803 410.7935 44.11000

82 23 0 6 N 03 29 0

75% 1822.547 1075.790 11.5000 143.2534 10.05000 Na 3.1393 2054.160 60.51000

366 020 00 03 1 N 09 386 0

MAX 135696.7 93625.20 1220.00 30425.25 710.7999 Na 20.000 95191.59 90.15000

88200 0560 0000 5860 25 N 000 1160 0

4|Page
Univariate Analysis

5|Page
6|Page
7|Page
8|Page
Boxplot of Variables before outlier treatment

9|Page
o There is a presence of outliers in all the variables except Institutions.
o We will cap the outliers by doing quartile treatment.

Boxplot of variables after outlier treatment

10 | P a g e
Bivariate Analysis
Scatter Plot of Variables with hue of SP500

11 | P a g e
Correlation Heatmap

There is high level of correlation between

o Capital and Sales

o Randd and Sales
o Employment and Sales

We chose to keep all the independent variables as each has their important function

12 | P a g e
1.2 Impute null values if present? Do you think scaling is necessary in
this case?
Null Values in the tobinq column has been replaced using the mean of the column and
after imputation there are no missing values left.

Scaling is not necessary in this case. I performed regression analysis on normal dataset
and scaled dataset and no major changes in the performance metrics were observed.

1.3 Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE.
SP500 has string values which has been encoded into numeric values using dummy
variables. 0 represents it does not belong to sp500 group and 1 represents it belongs to
sp500 group.

Data has been split into train and test in the ratio of 30:70. And the data has been fit into
linear regression model. Below are the scores

For non scaled data

Train RMSE Test RMSE Training Score Test Score

Linear 393.365800 402.933381 0.933731 0.930351

Regression

For Scaled data

Train RMSE Test RMSE Training Score Test Score

Linear 2.574918e-01 0.263755 0.933731 0.930351

Regression

Linear Regression Equation - y = 0.25 * capital + -0.02 * patents + 0.03 * randd + 0.44 *
employment + -0.05 * tobinq + 0.30 * value + 0.01 * institutions + 0.01 * sp500_yes + -
0.00

13 | P a g e
1.4 Inference: Based on these predictions, what are the business
insights and recommendations.
Capital and Employment are highly dependent on the sales of the firm.

There are 217 companies that belong to the sp500 group.

The majority of companies have sales below 10,000.

Top 5 attributes

1. Employment
2. Capital
3. Value
4. Tobinq
5. Patents

The company should focus on increasing their patent count and employment which will
ultimately boost it’s sales in the market.

14 | P a g e
Logistics Regression and Linear Discriminant Analysis
Problem 2: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to
help the government in predicting whether a person will survive or not on the basis of the
information given in the data set so as to provide insights that will help government to
make stronger laws for car manufacturers to ensure safety measures. Also, find out the
important factors on the basis of which you made your predictions.

2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.
Data Dictionary of the dataset

1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54,
55+

2. weight: Observation weights, albeit of uncertain accuracy, designed to account for

varying sampling probabilities. (The inverse probability weighting estimator can be used
to demonstrate causality when the researcher cannot conduct a controlled experiment
but has observed data to model) for further information go to this link:

3. Survived: factor with levels Survived or not_survived

4. airbag: a factor with levels none or airbag

5. seatbelt: a factor with levels none or belted

6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact

7. sex: a factor with levels f: Female or m: Male

8. ageOFocc: age of occupant in years

9. yearacc: year of accident

10. yearVeh: Year of model of vehicle; a numeric vector

11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels
deploy, nodeploy and unavail

12. occRole: a factor with levels driver or pass: passenger

13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or
more bags deployed.

15 | P a g e
Information about the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null object
1 weight 11217 non-null float64
2 Survived 11217 non-null object
3 airbag 11217 non-null object
4 seatbelt 11217 non-null object
5 frontal 11217 non-null int64
6 sex 11217 non-null object
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null object
11 occRole 11217 non-null object
12 deploy 11217 non-null int64
13 injSeverity 11140 non-null float64
dtypes: float64(3), int64(4), object(7)
memory usage: 1.2+ MB

There are missing values in the injseverity column, we can impute the values by using
mode method.

Univariate Analysis
Count plot of Survived individuals

16 | P a g e
The count of people who survived in the car crash is significantly higher than the people
did not survive.

Countplot of Crash Speeds

There are significant people who crashed there car at a significantly lower speed than
higher speeds.

17 | P a g e
Countplot of Vehicles with Airbags

Most of the vehicles were equipped with the Airbags and they worked properly.

Countplot of people using seatbelt during the crash

The maximum
number of people
were wearing the
seatbelt during the
crash.

18 | P a g e
Countplot of Sex of an Individual

There are
more males
than females
who crashed
their cars.

Countplot of Airbags functioning or not

There were 2599

cases where the
airbags were
present but they
did not get
deployed.

19 | P a g e
Boxplot of Variables before outliers treatment

Boxplot of Variables after outliers treatment

20 | P a g e
Bivariate Analysis
Scatter Plot of variables

21 | P a g e
Barplot of Survived individuals by speed of Car

We can see that the maximum number of deaths have occurred at the speed of 55+.

Barplot of individuals that survived vs seatbelt

People that were wearing a seatbelt had more survival probabilities than the people
who were not wearing a seatbelt.
22 | P a g e
Bar plot of survival after frontal or non-frontal impact

Survival is not much dependent on whether the car crashes from the front or any other
side.

2.2 Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis).
Data has been encoded with into binary numbers.

Column Name Variable in the column Encoding

DVCAT 55+ 4
25-39 2
10-24 1
40-54 3
1-9km/h 0
Survived survived 0
Not_survived 1
airbag none 1
airbag 0
Seatbelt none 1
23 | P a g e
belted 0
Sex m 1
f 0
abcat unavail 2
nodeploy 1
deploy 0
occRole driver 0
Pass 1

2.3 Performance Metrics: Check the performance of Predictions on

Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model. Compare both the models
and write inferences, which model is best/optimized.
Linear Discriminant Analysis
Model Performance on Test Dataset

Model Performance on Training Dataset

24 | P a g e
Confusion Matrix – Training Dataset

Confusion Matrix - Test Dataset

25 | P a g e
ROC Curve – Linear Discriminant Analysis

Logistics Regression
Model Performance on Test Dataset

26 | P a g e
Confusion Matrix – Training Dataset

Confusion Matrix – Test Dataset

27 | P a g e
ROC Curve – Logistics Regression.

2.4 Inference: Based on these predictions, what are the insights and
recommendations.
The Logistics Regression model performed much better than the LDA and the Logistics
regression model was able to justify 99% of the data, which is certainly a very good figure.

In the Car Crash survival, was very much dependent on the airbags, and if the person was
wearing a seatbelt or not and the speed of the car. Apart from these, other factors almost
played a negligible role.

We observed that there were many cases where the airbags were present in the car, but
they did not get deployed on the time leading to injury and death in several cases.

28 | P a g e

Project Predictive Modeling
No ratings yet
Project Predictive Modeling
43 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
Devidutta Predictive Modeling PDF
No ratings yet
Devidutta Predictive Modeling PDF
25 pages
Predictive Modelling Alternate Project Business Case
No ratings yet
Predictive Modelling Alternate Project Business Case
47 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
Sukanya Linear LogisticRegression Report
100% (1)
Sukanya Linear LogisticRegression Report
23 pages
Business Report - Predictive Modelling
No ratings yet
Business Report - Predictive Modelling
19 pages
PM - ExtendedProject - Business Report
100% (5)
PM - ExtendedProject - Business Report
35 pages
Linear Regression Datascience Basit PDF
No ratings yet
Linear Regression Datascience Basit PDF
19 pages
Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Assignment Report - Predictive Modelling - Rahul Dubey
No ratings yet
Assignment Report - Predictive Modelling - Rahul Dubey
18 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Problem 1: Linear Regression
54% (13)
Problem 1: Linear Regression
14 pages
Business Report
No ratings yet
Business Report
20 pages
Arun 27072021 Predictive Modeling PDF
No ratings yet
Arun 27072021 Predictive Modeling PDF
33 pages
Anshul Dyundi Predictive Modelling Alternate Project July 2022
No ratings yet
Anshul Dyundi Predictive Modelling Alternate Project July 2022
11 pages
Linear Regression Firm Basit PDF
No ratings yet
Linear Regression Firm Basit PDF
21 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
19 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
FRA Milestone1 - Maminulislam
100% (4)
FRA Milestone1 - Maminulislam
23 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Predictive Modeling for Analysts
100% (1)
Predictive Modeling for Analysts
28 pages
Predictive Modelling Business Report
No ratings yet
Predictive Modelling Business Report
23 pages
FRA Milestone 1
No ratings yet
FRA Milestone 1
33 pages
Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
2.3 Assumptions of Linear Regression
No ratings yet
2.3 Assumptions of Linear Regression
16 pages
FRA Report
100% (1)
FRA Report
30 pages
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
100% (4)
India Credit Risk Default Model - Nivedita Dey - PGP BABI May19 - 2
19 pages
Business Report PM Suchita Bhovar March 10 2024
No ratings yet
Business Report PM Suchita Bhovar March 10 2024
27 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Pooja Kabadi - Predictive Modelling Project
No ratings yet
Pooja Kabadi - Predictive Modelling Project
70 pages
Predictive Modeling for Business Insights
100% (3)
Predictive Modeling for Business Insights
69 pages
ML CH
No ratings yet
ML CH
19 pages
Abinash Nag Project Report CART
No ratings yet
Abinash Nag Project Report CART
40 pages
Predictive Modelling ALOK KUMAR
100% (1)
Predictive Modelling ALOK KUMAR
25 pages
Predictive Modelling Alternative Firm Level PDF
100% (4)
Predictive Modelling Alternative Firm Level PDF
26 pages
Excel Project - Investment Firm
No ratings yet
Excel Project - Investment Firm
3 pages
'Yatham Padma' 8 May 2022
No ratings yet
'Yatham Padma' 8 May 2022
82 pages
Simple Linear Regression (Precious)
No ratings yet
Simple Linear Regression (Precious)
3 pages
Salary Prediction with Linear Regression
No ratings yet
Salary Prediction with Linear Regression
7 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
4 pages
Unit II - Diagnotis and Multiple Linear
No ratings yet
Unit II - Diagnotis and Multiple Linear
8 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
Sukanya December Predictive Modeling 14th Jan 2024
No ratings yet
Sukanya December Predictive Modeling 14th Jan 2024
50 pages
Mohammed Tayab Khan 24 Dec 2021
No ratings yet
Mohammed Tayab Khan 24 Dec 2021
16 pages
Assumptions& Detection&Fixing
No ratings yet
Assumptions& Detection&Fixing
6 pages
Linear Regression - Jupyter Notebook
100% (3)
Linear Regression - Jupyter Notebook
56 pages
Finance & Risk Analytics QSTN 1 - Credit Risk
No ratings yet
Finance & Risk Analytics QSTN 1 - Credit Risk
24 pages
Business+Report Linear
No ratings yet
Business+Report Linear
20 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
30 pages
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
No ratings yet
Nanduri Naga Sowri Pgp-Dsba - Octa - G2 Great Learning
40 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
Unit 5
No ratings yet
Unit 5
18 pages
Statistic and Data Science Ii PDF
No ratings yet
Statistic and Data Science Ii PDF
37 pages
223a1131 ML Exp 1
No ratings yet
223a1131 ML Exp 1
8 pages
CH4 Exploratory Data Analysis
No ratings yet
CH4 Exploratory Data Analysis
12 pages
Cluster-Based Grid Computing On Wireless Network Data Transmission With Routing Analysis Protocol and Deep Learning
No ratings yet
Cluster-Based Grid Computing On Wireless Network Data Transmission With Routing Analysis Protocol and Deep Learning
18 pages
Methodological Standards For The Development and Evaluation of Clinical Prediction Rules A Review of The Literature
No ratings yet
Methodological Standards For The Development and Evaluation of Clinical Prediction Rules A Review of The Literature
23 pages
SCM Optimization with HLXM Model
No ratings yet
SCM Optimization with HLXM Model
9 pages
Applied Longitudinal Analysis. ISBN 0470380276, 978-0470380277
100% (26)
Applied Longitudinal Analysis. ISBN 0470380276, 978-0470380277
23 pages
Cspro 26
No ratings yet
Cspro 26
288 pages
2004-Methods For Imputation of Missing Values in Air Quality Data Sets
No ratings yet
2004-Methods For Imputation of Missing Values in Air Quality Data Sets
13 pages
1 s2.0 S2405918822000095 Main
No ratings yet
1 s2.0 S2405918822000095 Main
22 pages
AISD Paper 5
No ratings yet
AISD Paper 5
16 pages
Week 5 Lecture - Data Wrangling
No ratings yet
Week 5 Lecture - Data Wrangling
26 pages
Oma 2017
No ratings yet
Oma 2017
8 pages
Marketing Analytics
No ratings yet
Marketing Analytics
79 pages
DS Question in Mechanical Industry
No ratings yet
DS Question in Mechanical Industry
22 pages
DS Report
No ratings yet
DS Report
11 pages
A Gentle Introduction To Stata 2018th Edition Alan C. Acock Newest Edition 2025
No ratings yet
A Gentle Introduction To Stata 2018th Edition Alan C. Acock Newest Edition 2025
159 pages
Data Analyst Interview Prep Guide
No ratings yet
Data Analyst Interview Prep Guide
12 pages
Advanced Statistical Methods v1 Sanath
No ratings yet
Advanced Statistical Methods v1 Sanath
36 pages
Eda
No ratings yet
Eda
6 pages
Flexible Imputation of Missing Data
100% (3)
Flexible Imputation of Missing Data
444 pages
J Patrec 2015 08 023
No ratings yet
J Patrec 2015 08 023
9 pages
Linear Regression & Contraceptive Use Analysis
No ratings yet
Linear Regression & Contraceptive Use Analysis
12 pages
Jaggia BA 1e Chap002 PPT
No ratings yet
Jaggia BA 1e Chap002 PPT
35 pages
Assignment Tata
No ratings yet
Assignment Tata
3 pages
Hughes Persistence 2021
No ratings yet
Hughes Persistence 2021
45 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Unit-1 DA
No ratings yet
Unit-1 DA
23 pages
Censoring Issues in Survival Analysis: Kwan-Moon Leung, Robert M. Elashoff, and Abdelmonem A. Afifi
No ratings yet
Censoring Issues in Survival Analysis: Kwan-Moon Leung, Robert M. Elashoff, and Abdelmonem A. Afifi
22 pages
Overview and Exploratory Analyses of CICIDS 2017 I
No ratings yet
Overview and Exploratory Analyses of CICIDS 2017 I
9 pages
Schwalbe (2008) - A Meta Analysys of Juvenile Justice Risk Assessment Instruments
No ratings yet
Schwalbe (2008) - A Meta Analysys of Juvenile Justice Risk Assessment Instruments
15 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages

Predictive Modelling Project

Uploaded by

Predictive Modelling Project

Uploaded by

Predictive

• Dataset has 10 columns, here is the data dictionary

SALES CAPITA PATEN RANDD EMPLOY SP5 TOBIN VALUE INSTITUT

UNIQ NaN NaN NaN NaN NaN 2 NaN NaN NaN

TOP NaN NaN NaN NaN NaN no NaN NaN NaN

MEA 2689.705 1977.747 25.8313 439.9380 14.16451 Na 2.7949 2732.734 43.02054

STD 8722.060 6466.704 97.2595 2007.397 43.32144 Na 3.3665 7071.072 21.68558

MIN 0.138000 0.057000 0.00000 0.000000 0.006000 Na 0.1190 1.971053 0.000000

25% 122.9200 52.65050 1.00000 4.628262 0.927500 Na 1.0187 103.5939 25.39500

50% 448.5770 202.1790 3.00000 36.86413 2.924000 Na 1.6803 410.7935 44.11000

75% 1822.547 1075.790 11.5000 143.2534 10.05000 Na 3.1393 2054.160 60.51000

MAX 135696.7 93625.20 1220.00 30425.25 710.7999 Na 20.000 95191.59 90.15000

Boxplot of variables after outlier treatment

There is high level of correlation between

o Capital and Sales

For non scaled data

Train RMSE Test RMSE Training Score Test Score

Linear 393.365800 402.933381 0.933731 0.930351

For Scaled data

Train RMSE Test RMSE Training Score Test Score

Linear 2.574918e-01 0.263755 0.933731 0.930351

There are 217 companies that belong to the sp500 group.

The majority of companies have sales below 10,000.

2. weight: Observation weights, albeit of uncertain accuracy, designed to account for

3. Survived: factor with levels Survived or not_survived

4. airbag: a factor with levels none or airbag

5. seatbelt: a factor with levels none or belted

6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact

7. sex: a factor with levels f: Female or m: Male

8. ageOFocc: age of occupant in years

9. yearacc: year of accident

10. yearVeh: Year of model of vehicle; a numeric vector

12. occRole: a factor with levels driver or pass: passenger

Countplot of Crash Speeds

Countplot of people using seatbelt during the crash

Countplot of Airbags functioning or not

There were 2599

Boxplot of Variables after outliers treatment

Barplot of individuals that survived vs seatbelt

Column Name Variable in the column Encoding

2.3 Performance Metrics: Check the performance of Predictions on

Model Performance on Training Dataset

Confusion Matrix - Test Dataset

Confusion Matrix – Test Dataset

You might also like