2/11/24, 10:48 PM Copy of Assignment Lab1.
ipynb - Colaboratory
INSURANCE PREDICTION USING MULTIPLE REGRESSION
MODEL
Students : Mayuri.A, Apoorva Tumma, Prashanthi R S, Ragortham R C, R M Uma, Swetha.R, J
Jeya Sharmila, Anjali Gupta, Antony Joshy, Gokul Premkumar
Subject: Statistical Machine Learning
Date:11/02/2023
Objective
1)EDA
2)Multiple Linear regression.
Question Take a suitable data set having at least six features and build a linear regression ML
model. Whether the p-value of feature variables should be taken into account to check the
adequacy of the model
keyboard_arrow_down Insurance charges prediction
EDA
Exploratory Data Analyis is a preliminary phase of understanding the data holistically.
We can use different graphs, plots, handle missing observation and outliers appropriately.
Dataset
The below data-set was sourced from kaggle website,
https://www.kaggle.com/datasets/noordeen/insurance-premium-prediction
The dataset is specifically used by practicioners to understand multiple linear
regression.However the dataset had more that 12 attribute which was cleaned and refined to the
variables of interest by the team.
We have also taken few parameters that are qualitative to demonstrate the method to deal with
such data.
Source of defences for our work
1)Applied regression and generalized linear model - John Fox
2)Regression Modelling Michael Panik
3)Applied Linear regression by sanford weisberg
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 1/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Codes
We have adopted
1) Kaggle, Github , python official website (sklearn, pandas, numpy, matplotlib, seaborn)
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount('/content/drive')
## Importing the Given Dataset
insurance=pd.read_csv("/content/drive/MyDrive/insurance data.csv")
insurance.head()
age sex bmi children smoker region expenses
0 19 female 27.9 0 yes southwest 16884.92
1 18 male 33.8 1 no southeast 1725.55
2 28 male 33.0 3 no southeast 4449.46
3 33 male 22.7 0 no northwest 21984.47
4 32 male 28.9 0 no northwest 3866.86
len(insurance.values)
1338
The length/ number of observation of the datset is 1338
insurance.shape #
(1338, 7)
insurance.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1338 non-null int64
1 sex 1338 non-null object
2 bmi 1338 non-null float64
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 2/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
3 children 1338 non-null int64
4 smoker 1338 non-null object
5 region 1338 non-null object
6 expenses 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
insurance.isnull()
age sex bmi children smoker region expenses
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
... ... ... ... ... ... ... ...
1333 False False False False False False False
1334 False False False False False False False
1335 False False False False False False False
1336 False False False False False False False
1337 False False False False False False False
1338 rows × 7 columns
insurance.isnull().sum()
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
expenses 0
dtype: int64
There is no null values in the data set
insurance.tail()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 3/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
age sex bmi children smoker region expenses
1333 50 male 31.0 3 no northwest 10600.55
1334 18 female 31.9 0 no northeast 2205.98
1335 18 female 36.9 0 no southeast 1629.83
1336 21 female 25.8 0 no southwest 2007.95
1337 61 female 29.1 0 yes northwest 29141.36
insurance
age sex bmi children smoker region expenses
0 19 female 27.9 0 yes southwest 16884.92
1 18 male 33.8 1 no southeast 1725.55
2 28 male 33.0 3 no southeast 4449.46
3 33 male 22.7 0 no northwest 21984.47
4 32 male 28.9 0 no northwest 3866.86
... ... ... ... ... ... ... ...
1333 50 male 31.0 3 no northwest 10600.55
1334 18 female 31.9 0 no northeast 2205.98
1335 18 female 36.9 0 no southeast 1629.83
1336 21 female 25.8 0 no southwest 2007.95
1337 61 female 29.1 0 yes northwest 29141.36
1338 rows × 7 columns
col=list(insurance.columns)
col
['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses']
display(insurance['smoker'].mode()[0])
insurance['children'].mean()
'no'
1.0949177877429
Double-click (or enter) to edit
insurance['age'].dtype
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 4/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
dtype('int64')
col
['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'expenses']
for col_name in col:
if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
plt.hist(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 5/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 6/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
We have used a looping concept to create bargraphs for all float (numerical observation such as
children bmi etc) with its frequancy (count)
import seaborn as sns
sns.scatterplot(data=insurance, x="age", y="expenses")
<Axes: xlabel='age', ylabel='expenses'>
keyboard_arrow_down Check for outliers
for col_name in col:
if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
plt.boxplot(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 7/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
insurance.describe()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 8/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
age bmi children expenses
count 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 30.665471 1.094918 13270.422414
std 14.049960 6.098382 1.205493 12110.011240
min 18.000000 16.000000 0.000000 1121.870000
25% 27.000000 26.300000 0.000000 4740.287500
50% 39.000000 30.400000 1.000000 9382.030000
75% 51.000000 34.700000 2.000000 16639.915000
max 64.000000 53.100000 5.000000 63770.430000
Inter Quartile Range
#treating outliers ()
Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)
Q1
Q3
34.7
IQR = Q3 - Q1
IQR
8.400000000000002
Q1 - 1.5*IQR
Q3 + 1.5*IQR
47.300000000000004
Box Plot
plt.boxplot(insurance['bmi'])
plt.show()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 9/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
fig=plt.figure(figsize=(10,20))
for i,v in enumerate(col):
if insurance[v].dtype != 'object':
plt.subplot(8,2,i+1)
sns.boxplot(insurance[v])
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 10/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Th above code has used looping concept to build these box plots. Where v is the variable that
takes different names such as bmi, children for each cycle inside the loop.
Removing the outliers in the data
Q1 = insurance.bmi.quantile(0.25)
Q3 = insurance.bmi.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.bmi >= Q1 - 1.5*IQR) & (insurance.bmi <= Q3 + 1.5*IQR)]
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 11/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
insurance['bmi'].dtype
dtype('float64')
Q1 = insurance.expenses.quantile(0.25)
Q3 = insurance.expenses.quantile(0.75)
IQR = Q3 - Q1
insurance = insurance[(insurance.expenses >= Q1 - 1.5*IQR) & (insurance.expenses <= Q3 +
insurance['expenses'].dtype
dtype('float64')
insurance.shape
(1191, 7)
keyboard_arrow_down After removing of outliers
for col_name in col:
if(insurance[col_name].dtypes=='int64' or insurance[col_name].dtypes=='float64'):
sns.boxplot(insurance[col_name])
plt.xlabel(col_name)
plt.ylabel('count')
plt.show()
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 12/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 13/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
keyboard_arrow_down Drop
data
the object type columns because regression needs only numeric
for i in col:
if i != 'charges' and insurance[i].dtype == 'float':
insurance.fillna(insurance[i].mean(),inplace=True)
elif i != 'charges' and insurance[i].dtype == 'object':
insurance.drop(i,axis=1,inplace=True)
else:
pass
insurance.corr().T
age bmi children expenses
age 1.000000 0.123845 0.038179 0.448798
bmi 0.123845 1.000000 0.007357 -0.064589
children 0.038179 0.007357 1.000000 0.089083
expenses 0.448798 -0.064589 0.089083 1.000000
insurance.shape
(1191, 4)
keyboard_arrow_down Linear Regression
We are using module and inbuilt features from SKLearn to aid us in linear regression and
test/train.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):#only num cols except f
col_list.append(col)
X = insurance[col_list]
X.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1191 entries, 0 to 1337
Data columns (total 4 columns):
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 14/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 1191 non-null int64
1 bmi 1191 non-null float64
2 children 1191 non-null int64
3 expenses 1191 non-null float64
dtypes: float64(2), int64(2)
memory usage: 46.5 KB
X.values
array([[1.900000e+01, 2.790000e+01, 0.000000e+00, 1.688492e+04],
[1.800000e+01, 3.380000e+01, 1.000000e+00, 1.725550e+03],
[2.800000e+01, 3.300000e+01, 3.000000e+00, 4.449460e+03],
...,
[1.800000e+01, 3.690000e+01, 0.000000e+00, 1.629830e+03],
[2.100000e+01, 2.580000e+01, 0.000000e+00, 2.007950e+03],
[6.100000e+01, 2.910000e+01, 0.000000e+00, 2.914136e+04]])
for i in range(len(X.columns)):
print(i)
0
1
2
3
from statsmodels.stats.outliers_influence import variance_inflation_factor
col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):
col_list.append(col)
X = insurance[col_list]
vif_data = pd.DataFrame()
print(X.columns)
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]
print(vif_data)
Index(['age', 'bmi', 'children', 'expenses'], dtype='object')
feature VIF
0 age 10.418850
1 bmi 7.955137
2 children 1.786420
3 expenses 3.660522
VIF : Variance Inflation Factor is used to understand multicolilinearity between the variables. VIF
>5 implies there exsist multicoolinearity for the variable. Hence we drop these parameters.
insurance.size
4764
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 15/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
len(X.columns)
insurance=insurance.drop(['age'], axis=1)
col_list = []
for col in insurance.columns:
if ((insurance[col].dtype != 'object') & (col != 'charges') ):
col_list.append(col)
X = insurance[col_list]
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]
print(vif_data)
# vif value < 5
feature VIF
0 bmi 3.120146
1 children 1.784762
2 expenses 2.676247
x=insurance.loc[:,['bmi','children','expenses']]
y=insurance.iloc[:,-1]
insurance.head()
bmi children expenses
0 27.9 0 16884.92
1 33.8 1 1725.55
2 33.0 3 4449.46
3 22.7 0 21984.47
4 28.9 0 3866.86
x_train, x_test, y_train, y_test=train_test_split(x,y,train_size=0.8, random_state=0)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
(952, 3)
(952,)
(239, 3)
(239,)
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 16/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
from sklearn.linear_model import LinearRegression
l_model=LinearRegression()
#building the model
l_model.fit(x_train, y_train)
predictions=l_model.predict(x_test)
predictions
array([ 4673.39, 8551.35, 15170.07, 2261.57, 1631.82, 10141.14,
7729.65, 5630.46, 33732.69, 11015.17, 3756.62, 13143.86,
9850.43, 4185.1 , 2138.07, 14133.04, 10422.92, 12347.17,
23807.24, 17081.08, 7151.09, 2473.33, 33900.65, 10370.91,
8823.28, 8457.82, 11093.62, 6406.41, 5415.66, 4189.11,
1837.24, 4320.41, 7256.72, 4058.12, 8347.16, 22218.11,
10796.35, 14043.48, 8825.09, 2585.85, 18608.26, 5458.05,
20149.32, 5227.99, 6593.51, 3556.92, 11299.34, 4779.6 ,
11944.59, 9447.25, 18972.5 , 11763. , 2680.95, 5124.19,
1711.03, 1875.34, 3757.84, 1769.53, 13880.95, 10325.21,
2699.57, 12244.53, 4133.64, 28340.19, 20167.34, 13204.29,
24180.93, 30284.64, 10795.94, 14474.68, 9625.92, 25309.49,
5152.13, 7209.49, 18903.49, 10450.55, 12265.51, 20420.6 ,
12741.17, 24227.34, 1880.49, 10381.48, 2134.9 , 6272.48,
20781.49, 18033.97, 19539.24, 11931.13, 4428.89, 14256.19,
5397.62, 12430.95, 5080.1 , 4670.64, 8211.1 , 22478.6 ,
5966.89, 10579.71, 6414.18, 3994.18, 8606.22, 6289.75,
3353.28, 7243.81, 2207.7 , 24535.7 , 3176.29, 7749.16,
11082.58, 11534.87, 8515.76, 14455.64, 13747.87, 8023.14,
11272.33, 10214.64, 11187.66, 8334.59, 13352.1 , 3847.67,
13228.85, 14988.43, 8978.19, 1141.45, 12222.9 , 10959.33,
11345.52, 4687.8 , 6238.3 , 17748.51, 19214.71, 5253.52,
18310.74, 15359.1 , 14210.54, 7419.48, 19023.26, 14478.33,
4074.45, 7935.29, 22462.04, 10493.95, 2721.32, 3268.85,
10928.85, 6548.2 , 5246.05, 2352.97, 2302.3 , 11356.66,
34166.27, 4561.19, 5383.54, 24869.84, 6360.99, 8627.54,
1708. , 13887.2 , 7640.31, 1725.55, 10704.47, 1980.07,
8603.82, 10577.09, 2897.32, 8162.72, 6653.79, 21771.34,
6875.96, 21880.82, 11848.14, 9058.73, 17663.14, 17942.11,
13041.92, 4762.33, 4260.74, 4151.03, 9225.26, 4237.13,
28468.92, 5693.43, 6186.13, 11512.41, 15019.76, 20234.85,
2850.68, 10106.13, 8442.67, 6799.46, 4931.65, 10355.64,
7804.16, 17904.53, 9778.35, 4234.93, 15555.19, 17179.52,
10226.28, 13224.06, 1635.73, 9095.07, 7153.55, 3062.51,
18955.22, 8252.28, 2523.17, 14235.07, 7325.05, 19350.37,
11552.9 , 4571.41, 1622.19, 7623.52, 8068.19, 6474.01,
11305.93, 12094.48, 11482.63, 12957.12, 7740.34, 11881.97,
11411.69, 1136.4 , 10797.34, 13224.69, 23887.66, 6435.62,
4433.92, 9620.33, 27375.9 , 14394.4 , 27218.44, 2020.55,
9048.03, 3056.39, 1972.95, 11737.85, 9301.89])
error_pred=pd.DataFrame({'Actual_data':y_test,'Prediction_data':pd.Series(predictions)})
error_pred
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 17/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Actual_data Prediction_data
0 NaN 4673.39
1 1725.55 8551.35
2 NaN 15170.07
3 NaN 2261.57
4 NaN 1631.82
... ... ...
1311 4571.41 NaN
1315 11272.33 NaN
1329 10325.21 NaN
1331 10795.94 NaN
1332 11411.69 NaN
440 rows × 2 columns
error_pred
Actual_data Prediction_data
0 NaN 4673.39
1 1725.55 8551.35
2 NaN 15170.07
3 NaN 2261.57
4 NaN 1631.82
... ... ...
1311 4571.41 NaN
1315 11272.33 NaN
1329 10325.21 NaN
1331 10795.94 NaN
1332 11411.69 NaN
440 rows × 2 columns
error_pred['Error']=error_pred['Actual_data']-error_pred['Prediction_data']
error_pred
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 18/19
2/11/24, 10:48 PM Copy of Assignment Lab1.ipynb - Colaboratory
Actual_data Prediction_data Error
0 NaN 4673.39 NaN
1 1725.55 8551.35 -6825.8
2 NaN 15170.07 NaN
3 NaN 2261.57 NaN
4 NaN 1631.82 NaN
https://colab.research.google.com/drive/1TaSL_2telgbE4BEnmlqHEpPyaTjjhhu-#printMode=true 19/19