Predictive
Modelling Project
By Parijat Dev
1|Page
Contents
Linear Regression Problem ....................................................................................... 3
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, data types, shape, EDA). Perform Univariate and Bivariate
Analysis. .............................................................................................................. 3
Univariate Analysis ............................................................................................ 5
Bivariate Analysis ............................................................................................ 11
Correlation Heatmap....................................................................................... 12
1.2 Impute null values if present? Do you think scaling is necessary in this case?13
1.3 Encode the data (having string values) for Modelling. Data Split: Split the data
into test and train (70:30). Apply Linear regression. Performance Metrics: Check the
performance of Predictions on Train and Test sets using Rsquare, RMSE. ................ 13
1.4 Inference: Based on these predictions, what are the business insights and
recommendations. ............................................................................................. 14
Logistics Regression and Linear Discriminant Analysis ............................................. 15
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and do null value
condition check, write an inference on it. Perform Univariate and Bivariate Analysis.
Do exploratory data analysis................................................................................ 15
Univariate Analysis .......................................................................................... 16
Bivariate Analysis ............................................................................................ 21
2.2 Encode the data (having string values) for Modelling. Data Split: Split the data into
train and test (70:30). Apply Logistic Regression and LDA (linear discriminant
analysis). ........................................................................................................... 23
2.3 Performance Metrics: Check the performance of Predictions on Train and Test
sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for
each model. Compare both the models and write inferences, which model is
best/optimized. .................................................................................................. 24
Linear Discriminant Analysis ............................................................................ 24
Logistics Regression ........................................................................................ 26
......................................................................................................................... 26
2.4 Inference: Based on these predictions, what are the insights and
recommendations. ............................................................................................. 28
2|Page
Linear Regression Problem
Problem Statement- You are a part of an investing firm and your work is to do research
about these 759 firms. You are provided with the dataset containing the sales and other
attributes of these 759 firms. Predict the sales of these firms on the bases of the details
given in the dataset to help your company in investing. consciously. Also, provide them
with 5 attributes that are most important.
1.1 Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, data types, shape, EDA). Perform
Univariate and Bivariate Analysis.
Answer – Data has been read and successfully loaded in the python environment.
following are the key findings about the data.
• Dataset has 10 columns, here is the data dictionary
• Data Dictionary for Firm_level_data:
o Unnamed column – shows index values
o 1. sales: Sales (in millions of dollars).
o 2. capital: Net stock of property, plant, and equipment.
o 3. patents: Granted patents.
o 4. randd: R&D stock (in millions of dollars).
o 5. employment: Employment (in 1000s).
o 6. sp500: Membership of firms in the S&P 500 index. S&P, is a stock
market index that measures the stock performance of 500 large
companies listed on stock exchanges in the United States
o 7. tobinq: Tobin's q (also known as q ratio and Kaldor's v) is the ratio
between a physical asset's market value and its replacement value.
o 8. value: Stock market value.
o 9. institutions: Proportion of stock owned by institutions.
• Shape of the data is 759 rows × 10 columns
• Dropped unnamed column from the dataset
• Information about the dataset
• Data columns (total 9 columns):
o # Column Non-Null Count Dtype
o --- ------ -------------- -----
o 0 sales 759 non-null float64
o 1 capital 759 non-null float64
o 2 patents 759 non-null int64
o 3 randd 759 non-null float64
o 4 employment 759 non-null float64
3|Page
o 5 sp500 759 non-null object
o 6 tobinq 738 non-null float64
o 7 value 759 non-null float64
o 8 institutions 759 non-null float64
• We can observe that there are missing values in the tobinq column. Sp500
column is of object datatype rest all are of numeric data types.
• Description of the data
SALES CAPITA PATEN RANDD EMPLOY SP5 TOBIN VALUE INSTITUT
L TS MENT 00 Q IONS
COU 759.0000 759.0000 759.000 759.0000 759.0000 759 738.00 759.0000 759.0000
NT 00 00 000 00 00 0000 00 00
UNIQ NaN NaN NaN NaN NaN 2 NaN NaN NaN
UE
TOP NaN NaN NaN NaN NaN no NaN NaN NaN
FRE NaN NaN NaN NaN NaN 542 NaN NaN NaN
Q
MEA 2689.705 1977.747 25.8313 439.9380 14.16451 Na 2.7949 2732.734 43.02054
N 158 498 57 74 9 N 10 750 0
STD 8722.060 6466.704 97.2595 2007.397 43.32144 Na 3.3665 7071.072 21.68558
124 896 77 588 3 N 91 362 6
MIN 0.138000 0.057000 0.00000 0.000000 0.006000 Na 0.1190 1.971053 0.000000
0 N 01
25% 122.9200 52.65050 1.00000 4.628262 0.927500 Na 1.0187 103.5939 25.39500
00 1 0 N 83 46 0
50% 448.5770 202.1790 3.00000 36.86413 2.924000 Na 1.6803 410.7935 44.11000
82 23 0 6 N 03 29 0
75% 1822.547 1075.790 11.5000 143.2534 10.05000 Na 3.1393 2054.160 60.51000
366 020 00 03 1 N 09 386 0
MAX 135696.7 93625.20 1220.00 30425.25 710.7999 Na 20.000 95191.59 90.15000
88200 0560 0000 5860 25 N 000 1160 0
4|Page
Univariate Analysis
5|Page
6|Page
7|Page
8|Page
Boxplot of Variables before outlier treatment
9|Page
o There is a presence of outliers in all the variables except Institutions.
o We will cap the outliers by doing quartile treatment.
Boxplot of variables after outlier treatment
10 | P a g e
Bivariate Analysis
Scatter Plot of Variables with hue of SP500
11 | P a g e
Correlation Heatmap
There is high level of correlation between
o Capital and Sales
o Randd and Sales
o Employment and Sales
We chose to keep all the independent variables as each has their important function
12 | P a g e
1.2 Impute null values if present? Do you think scaling is necessary in
this case?
Null Values in the tobinq column has been replaced using the mean of the column and
after imputation there are no missing values left.
Scaling is not necessary in this case. I performed regression analysis on normal dataset
and scaled dataset and no major changes in the performance metrics were observed.
1.3 Encode the data (having string values) for Modelling. Data Split:
Split the data into test and train (70:30). Apply Linear regression.
Performance Metrics: Check the performance of Predictions on
Train and Test sets using Rsquare, RMSE.
SP500 has string values which has been encoded into numeric values using dummy
variables. 0 represents it does not belong to sp500 group and 1 represents it belongs to
sp500 group.
Data has been split into train and test in the ratio of 30:70. And the data has been fit into
linear regression model. Below are the scores
For non scaled data
Train RMSE Test RMSE Training Score Test Score
Linear 393.365800 402.933381 0.933731 0.930351
Regression
For Scaled data
Train RMSE Test RMSE Training Score Test Score
Linear 2.574918e-01 0.263755 0.933731 0.930351
Regression
Linear Regression Equation - y = 0.25 * capital + -0.02 * patents + 0.03 * randd + 0.44 *
employment + -0.05 * tobinq + 0.30 * value + 0.01 * institutions + 0.01 * sp500_yes + -
0.00
13 | P a g e
1.4 Inference: Based on these predictions, what are the business
insights and recommendations.
Capital and Employment are highly dependent on the sales of the firm.
There are 217 companies that belong to the sp500 group.
The majority of companies have sales below 10,000.
Top 5 attributes
1. Employment
2. Capital
3. Value
4. Tobinq
5. Patents
The company should focus on increasing their patent count and employment which will
ultimately boost it’s sales in the market.
14 | P a g e
Logistics Regression and Linear Discriminant Analysis
Problem 2: You are hired by Government to do analysis on car crashes. You are provided
details of car crashes, among which some people survived and some didn't. You have to
help the government in predicting whether a person will survive or not on the basis of the
information given in the data set so as to provide insights that will help government to
make stronger laws for car manufacturers to ensure safety measures. Also, find out the
important factors on the basis of which you made your predictions.
2.1 Data Ingestion: Read the dataset. Do the descriptive statistics and
do null value condition check, write an inference on it. Perform
Univariate and Bivariate Analysis. Do exploratory data analysis.
Data Dictionary of the dataset
1. dvcat: factor with levels (estimated impact speeds) 1-9km/h, 10-24, 25-39, 40-54,
55+
2. weight: Observation weights, albeit of uncertain accuracy, designed to account for
varying sampling probabilities. (The inverse probability weighting estimator can be used
to demonstrate causality when the researcher cannot conduct a controlled experiment
but has observed data to model) for further information go to this link:
3. Survived: factor with levels Survived or not_survived
4. airbag: a factor with levels none or airbag
5. seatbelt: a factor with levels none or belted
6. frontal: a numeric vector; 0 = non-frontal, 1=frontal impact
7. sex: a factor with levels f: Female or m: Male
8. ageOFocc: age of occupant in years
9. yearacc: year of accident
10. yearVeh: Year of model of vehicle; a numeric vector
11. abcat: Did one or more (driver or passenger) airbag(s) deploy? This factor has levels
deploy, nodeploy and unavail
12. occRole: a factor with levels driver or pass: passenger
13. deploy: a numeric vector: 0 if an airbag was unavailable or did not deploy; 1 if one or
more bags deployed.
15 | P a g e
Information about the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11217 entries, 0 to 11216
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 dvcat 11217 non-null object
1 weight 11217 non-null float64
2 Survived 11217 non-null object
3 airbag 11217 non-null object
4 seatbelt 11217 non-null object
5 frontal 11217 non-null int64
6 sex 11217 non-null object
7 ageOFocc 11217 non-null int64
8 yearacc 11217 non-null int64
9 yearVeh 11217 non-null float64
10 abcat 11217 non-null object
11 occRole 11217 non-null object
12 deploy 11217 non-null int64
13 injSeverity 11140 non-null float64
dtypes: float64(3), int64(4), object(7)
memory usage: 1.2+ MB
There are missing values in the injseverity column, we can impute the values by using
mode method.
Univariate Analysis
Count plot of Survived individuals
16 | P a g e
The count of people who survived in the car crash is significantly higher than the people
did not survive.
Countplot of Crash Speeds
There are significant people who crashed there car at a significantly lower speed than
higher speeds.
17 | P a g e
Countplot of Vehicles with Airbags
Most of the vehicles were equipped with the Airbags and they worked properly.
Countplot of people using seatbelt during the crash
The maximum
number of people
were wearing the
seatbelt during the
crash.
18 | P a g e
Countplot of Sex of an Individual
There are
more males
than females
who crashed
their cars.
Countplot of Airbags functioning or not
There were 2599
cases where the
airbags were
present but they
did not get
deployed.
19 | P a g e
Boxplot of Variables before outliers treatment
Boxplot of Variables after outliers treatment
20 | P a g e
Bivariate Analysis
Scatter Plot of variables
21 | P a g e
Barplot of Survived individuals by speed of Car
We can see that the maximum number of deaths have occurred at the speed of 55+.
Barplot of individuals that survived vs seatbelt
People that were wearing a seatbelt had more survival probabilities than the people
who were not wearing a seatbelt.
22 | P a g e
Bar plot of survival after frontal or non-frontal impact
Survival is not much dependent on whether the car crashes from the front or any other
side.
2.2 Encode the data (having string values) for Modelling. Data Split:
Split the data into train and test (70:30). Apply Logistic Regression and
LDA (linear discriminant analysis).
Data has been encoded with into binary numbers.
Column Name Variable in the column Encoding
DVCAT 55+ 4
25-39 2
10-24 1
40-54 3
1-9km/h 0
Survived survived 0
Not_survived 1
airbag none 1
airbag 0
Seatbelt none 1
23 | P a g e
belted 0
Sex m 1
f 0
abcat unavail 2
nodeploy 1
deploy 0
occRole driver 0
Pass 1
2.3 Performance Metrics: Check the performance of Predictions on
Train and Test sets using Accuracy, Confusion Matrix, Plot ROC curve
and get ROC_AUC score for each model. Compare both the models
and write inferences, which model is best/optimized.
Linear Discriminant Analysis
Model Performance on Test Dataset
Model Performance on Training Dataset
24 | P a g e
Confusion Matrix – Training Dataset
Confusion Matrix - Test Dataset
25 | P a g e
ROC Curve – Linear Discriminant Analysis
Logistics Regression
Model Performance on Test Dataset
26 | P a g e
Confusion Matrix – Training Dataset
Confusion Matrix – Test Dataset
27 | P a g e
ROC Curve – Logistics Regression.
2.4 Inference: Based on these predictions, what are the insights and
recommendations.
The Logistics Regression model performed much better than the LDA and the Logistics
regression model was able to justify 99% of the data, which is certainly a very good figure.
In the Car Crash survival, was very much dependent on the airbags, and if the person was
wearing a seatbelt or not and the speed of the car. Apart from these, other factors almost
played a negligible role.
We observed that there were many cases where the airbags were present in the car, but
they did not get deployed on the time leading to injury and death in several cases.
28 | P a g e