71328, 1112AM
Int}:
localhost 8888inotebooks/DesktopFinal ProjectiCar Pr
Car Price Prediction Project Task 3 ~Jupytr Notebook
Project Report On Car Price Prediction Using
Machine Learning
Submitted by, Mr. Omkar Balwant Jadhav
+ Data
+ What Problem We Have and Which Metric to Use?
+ Exploratory Data Analysis
= Target Variable
= Numerical Features
= Categorical Features
+ Model Selection
= Baseline Mode!
= Models with Ridge & Lasso & ElasticNet and KNN
= Models with Random Forest & Extra Trees & Gradient Boosting & XGBoost
= Best Model with Hyperparameter Tuning
= Feature Importance
* Conclusion
‘Type Markdown and LaTeX: a”
1. Collecting Data
1 import pandas as pd
2 data= pd.read_csv
3 data
:\\Users\\Onkar'\\Downloads\\CarPrice.csv")
2. Defining the problem statement
In this project, we study the data of Uber which is present in tabular format in which we use
different libraries like numpy, pandas and matplotlib and different machine leaming algorithms.
We study different columns of the table and try to co-relate them with others and find a relation
between those two.
Prediction Projact Task 3ipynb 13071328, 1112AM
In [2]:
out [2]:
In [3]:
out(3]:
In [5]:
out [5]:
3. Exploratory Data Analysis
Car Price Prediction Project Task 3 ~Jupytr Notebook
Exploratory Data Analysis refers to the critical process of performing initial investigations on
data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions
with the help of summary statistics and graphical representations.
Itis a good practice to understand the data first and try to gather as many insights from it.
EDAis all about making sense of data in hand.
1 data.shape
(205, 26)
1 data.head()
ID symboling CarName fueltype aspiration doornumber _carbody drivewheel_ eng
° 1 3 alfawromero td tn tbl is
gia 998 sl two convertible rw
1 2 3 alfe-romero a
omer gas st two convertible ws
2 3 1 Giaromero td two hatchback i
Quadrifoglio «9° s tchbact rw
34 2 audi100's gas std four sedan wd
4 5 2 audi 100s gas. std four sedan awa
5 rows x 26 columns
‘Type Markdown and LaTeX: a”
1 data 'CarWane" ].value_counts()
toyota corona
toyota corolla
peugeot 504
subaru dl
mitsubishi mirage g4
mazda gle
mazda rx2
maxda glo
maxda x3
volvo 246
CarName, Length:
Nam
4
coupe
deluxe
waana
1
147, dtype: intea
Iocahost 8888inotebooks/Desktop/Final ProjectICar Price Prediction Project Task Bipynb
218071328, 1112AM
In [87]:
from
from
from
from
1@ from
11 from
wavoukune
13. from
14 from
17 from
18 from
In [26]: 1 df= pd.read_csv("C:\\Users\\Onkar\\Downloads\\CarPrice.csv")
Car Price Prediction Project Task 3 ~Jupytr Notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sklearn. preprocessing import StandardScaler, PolynomialFeatures, Onel
sklearn.model_selection import KFold, cross_val_predict, train_test_:
sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNe
sklearn.metrics import r2_score,mean_squared_error
sklearn.pipeline import make_pipeline
sklearn.compose import make_column_transformer
sklearn.neighbors import KNeighborsRegressor
sklearn.svm import SVR
sklearn.ensemble import RandonForestRegressor,,GradientBoostingRegres:
sklearn.feature_selection import SelectkBest ,SelectPercentile, f_clas:
2 df.head()
out [26]: ID. symboling _CarName fucltype aspiration doormumber _carbody drivewheel eng
o 4 3 alfaromero td ‘wo convert cd
‘ula 88 3 convertible rw
1 3 alferomero rn
omer gas st ‘wo convertible rad
2 3 4 Baromero td two hatchback
Quadriagio 9° st chad mm
34 2 audi 1001s gas std four sedan ‘wa
4s 2 ausi 00 gas std four sedan uid
5 rows * 26 columns
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb
18071328, 1112AM
In [27]: 1 df.info()
RangeIndex: 205 entries, @ to 204
Car Price Prediction Project Task 3 ~Jupytr Notebook
Data colunns (total 26 columns):
25
dtypes: floatea(s), int64(8), object(1@)
memory usage: 41.84 KB
Column
car_ID
symboling
CarName
fueltype
aspiration
doornunber
carbody
drivewheel
enginelocation
wheelbase
carlength
carwidth
carheight
curbweight
enginetype
cylindernumber
enginesize
fuelsystem
boreratio
stroke
compressionratio
horsepower
peakrpm
citympg
highwaympg.
price
Non-Null Count
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
205
285
205
205
205
205
In [28]: 1 df.duplicated().sum()
out[28]: @
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
non-null
Dtype
intea
intea
object
object
object
object
object
object
object
Floated
floates
floated
floates
intea
object
object
intea
object
Floates
floates
Floates
intea
intea
intea
intea
floate4
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb
413071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
In [29]: 1 def missing (df)
2 missing_nunber = df. isnull().sum().sort_values(ascending-False)
3 missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values
4 missing_values = pd.concat([missing_number, missing_percent], axis:
5 return missing values
6
7 missing(aF)
out [29]: Missing Number Missing Percent
car|D ° 00
symboling ° 00
highwaympa ° 00
itympg ° 09
peakrpm ° 00
horsepower ° 00
compressionratio ° 00
stroke ° 00
boreratio ° 00
fuelsystem ° 00
enginesize ° 00
cylindernumber ° 00
enginetype ° 00
curbweight ° 09
carholght ° 00
carwidth ° 00
cartength ° 00
wheelbase ° 00
enginelocation ° 00
drivewhoe! ° 00
carbody ° 00
doomumber ° 00
aspiration ° 00
fucttype ° 09
carNiame ° 00
° 00
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 913071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
In [38]: 1 df.nunique()
out(30]: cari 205
symboling 6
CarName 147
fueltype 2
aspiration 2
doornumber 2
carbody 5
drivewheel 3
enginelocation 2
wheelbase 53
carlength 7
carwidth 4a
carheight 49
curbweight 171
enginetype 7
cylindernumber 7
enginesize 44
fuelsystem 8
boreratio 38
stroke 37
compressionratio 32
horsepower 59
peakrpm 23
citympg 29
highwaympg 38
price 189
dtype: intea
+ There is no zero variance variable.
+ Car ID column is repetition of the index. So I'l drop it
+ Carname has 147 different entity. I'l check it. And try to find a way to reduce the variance.
+ Other than that there is no problem.
In [31]: 1 df= df.copy()
In [32]: 1 df4['CarNane"].sample(5)
out [32]: bmw 24
isuzu MU-x
honda accord 1x
saab 99e
volkswagen model 111
CarName, dtype: object
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 613071328, 1112AM
In [33]:
out (33):
Car Price Prediction Project Task 3 ~Jupytr Notebook
1 df1['CarName"].unique()
array([‘alfa-romero giulia', ‘alfa-romero stelvio',
‘alfa-romero Quadrifoglio', ‘audi 100 1s’, ‘audi 1001s",
‘audi fox’, ‘audi 5@00", ‘audi 4000", ‘audi See0s (diesel)',
“bmw 3205", “bmw x1", "bmw x3", ‘bmw 74", "bmw x4", ‘bmw x5",
‘chevrolet impala", ‘chevrolet monte carlo", ‘chevrolet vega 2300",
“dodge rampage’, ‘dodge challenger se’, ‘dodge 4200",
‘dodge monaco (sw)', ‘dodge colt hardtop’, ‘dodge colt (sw)',
‘dodge coronet custom’, "dodge dart custon",
‘dodge coronet custom (sw)', ‘honda civic’, "honda civic cvec’,
"honda accord cvcc', ‘honda accord 1x", ‘honda civic 1500 gl’,
“honda accord’, ‘honda civic 130", "honda prelude’,
“honda civic (auto)', ‘isuzu MU-x', "isuzu D-Max *
suzu D-Max V-Cross', “jaguar xj', ‘jaguar xf", ‘Jaguar xk’,
tmaxda rx3", ‘maxda gle deluxe’, ‘mazda rx2 coupe’, ‘mazda rx-4",
‘mazda gle deluxe’, ‘mazda 626', ‘mazda glc', ‘mazda rx-7 gs",
‘mazda gle 4, ‘mazda gle custom 1', ‘mazda gle custom",
“buick electra 225 custom’, ‘buick century luxus (sw)',
“buick century’, “buick skyhawk', ‘buick opel isuzu deluxe’,
“buick skylark", "buick century special’,
“buick regal sport coupe (turbo)’, ‘mercury cougar’,
‘mitsubishi mirage’, ‘mitsubishi lancer’, ‘mitsubishi outlander’,
‘mitsubishi ga", ‘mitsubishi mirage g4', ‘mitsubishi montero’,
‘mitsubishi pajero’, ‘Nissan versa’, ‘nissan gt-r', ‘nissan rogue’,
‘nissan latio", ‘nissan titan’, ‘nissan leaf’, ‘nissan juke’,
‘nissan note’, ‘nissan clipper’, ‘nissan nv200", ‘nissan dayz',
‘nissan fuga’, ‘nissan otti', ‘nissan teana’, ‘nissan kicks',
“peugeot 504°, ‘peugeot 304°, ‘peugeot 504 (sw)', ‘peugeot 60451",
“peugeot 5@5s turbo diesel’, ‘plymouth fury iii",
‘plymouth cricket", ‘plymouth satellite custom (sw)',
‘plymouth fury gran sedan’, ‘plymouth valiant’, ‘plymouth duster’,
‘porsche macan', "porcshce panamera’, ‘porsche cayenne",
“porsche boxter’, ‘renault 12t1", ‘renault 5 gtl', ‘saab 99e",
‘saab 991e", ‘saab 99gle’, ‘subaru’, ‘subaru dl’, ‘subaru brz',
‘subaru baja’, ‘subaru ri", ‘subaru r2‘, ‘subaru trezia’,
‘subaru tribeca', ‘toyota corona mark ii’, ‘toyota corona’,
‘toyota corolla 1200", ‘toyota corona hardtop",
‘toyota corolla 1600 (sw)', ‘toyota carina’, ‘toyota mark ii",
“toyota corolla’, ‘toyota corolla liftback",
‘toyota celica gt liftback", ‘toyota corolla tercel’,
“toyota corona liftback", ‘toyota starlet’, ‘toyota tercel’,
“toyota cressida’, ‘toyota celica gt’, ‘toyouta tercel’,
‘vokswagen rabbit’, ‘volkswagen 1131 deluxe sedan’,
‘volkswagen model 111', ‘volkswagen type 3°, ‘volkswagen 411 (sw)',
“volkswagen super beetle’, ‘volkswagen dasher', ‘vw dasher’,
‘vw rabbit", ‘volkswagen rabbit’, ‘volkswagen rabbit custom’,
‘volvo 145e (sw)', ‘volvo 144ea", ‘volvo 244d1', ‘volvo 245',
‘volvo 264g1', ‘volvo diesel’, ‘volvo 246"], dtype=object)
+ Huse only the brands/make not the models.
+ Ihave seen several typos, I'l handle those.
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb 13071328, 1112AM
In [34]:
In [35]:
In [36]:
In [37]:
In [38]:
out (38):
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk
Car Price Prediction Project Task 3 ~Jupytr Notebook
df1[ ‘model’ ]
df1['model'}
[x.split()[@] for x in df1[‘CarName'}]
if1['model'].replace({'maxda': ‘Mazda','mazda’: 'Mazda',
‘nissan’: ‘Nissan’,
‘poreshce’: ‘Porsche’ , ‘porsche’ : 'Por:
‘toyouta’: ‘Toyota’, ‘toyota’: ‘Toyoti
"vokswagen': ‘Volkswagen’, “Volkswagen”
Vounune
Let's drop the ‘model’ and ‘caria’ columns
1 die df1.drop({‘car_ID", ‘CarNane' J, axis=1)
1 print (f' We have {df1.shape[@]} instances with the {dF1.shape[1]-1} feat!
We have 205 instances with the 24 features and 1 output variable
numerical= df1.drop(['price'], axis=1).select_dtypes(‘number').columns
categorical = dfl.select_dtypes(‘object').columns
print(f'Numerical Columns: {df1[numerical].columns}')
print(*\n")
print(f'Categorical Columns: (df1[categorical].columns}')
Numerical Columns: Index([‘symboling', ‘wheelbase’, ‘carlength', ‘carwidth',
"carheight',
“curbweight', ‘enginesize’, ‘boreratio', ‘stroke’, ‘compressionratio’,
‘horsepower’, ‘peakrpm', ‘citympg', ‘highwaympg'],
dtype="object')
Categorical Columns: Index(['fueltype', ‘aspiration’, ‘doornumber', ‘carbod
y', ‘drivewheel",
‘enginelocation’, ‘enginetype’, ‘cylindernumber', ‘fuelsysten’,
*model'],
dtype= object")
Target Varable
1 df1['price'].describe()
count, 205.000000
mean 13276.710571.
std 7988.852332
min 5118.000000
25% 7788.000000
50% 10295 .000000
79% 16503.000000
max 45400.200000
Name: price, dtype: floated
Predict Projact Task Sipynb
8/307323, 1112.0M Car Price Prediction Project Task 3 -Jupyter Notebook
In [39]: 1 print( f"Skewness: {df1['price’].skew()}")
Skewness: 1.7776781560914454
In [41]: 1 df1['price’].plot(kind="hist")
out[41]:
80
8
Frequency
8
8
10
5000 10000 15000 20000 25000 30000 35000 40000 45000
+ Even though target variable has right skewness, | will not make any transformation on it,
+ Let's see the numerical features.
Numerical Features
lncalhost 8886/notebooks/Desktop/Final ProjectiCar Price Predicton Project Task 3.ipynb71928, 112M Car Price Prediction Project Task 3 -Jupyter Notebook
In [42]: 1 df1[numerical] .describe()
out [42]: symboling wheelbase carlength —_carwidth__carheight _curbweight enginesize |
feount 205.000000 206,000000 205.000000 206.0000 205:000000 206.000000 205.000000 20
mean 0.894146 98,756585 174,049268 65.907805 59.724878 2556.565854 126.907317
std 1.245907 9.021776 12397280 2.148204 2.443622 520.680208 41.5642603
min 2.000000 86.600000 141.100000 60,3000 47.800000 1488,000000 1.000000
25% 0.000000 94.500000 166:300000 64.1000 2.000000 2146.000000 97.0000,
50% 1.000000 97.0000 173.2000 5.500000 4.100000 2414,000000. 120.0000
75% 2.000000 102.400000 183.1000 66.900000 5.500000 2936,000000 141.0000
max 3.000000 120,900000 208.1000 72,300000 8,800000 4058,000000. 326.0000
In [44]: 1 df1[numerical].plot(kind='hist');
200 mmm cymboling
mm wheelbase
us mmm carlength
mm carwidth
mm carheight
150 mmm curbweight
as lm enginesize
3 lm boreratio
g mm stroke
é 400 lm compressionratio
ME horsepower
P mmm peakrpm
mm ctympg
50 mms highwaymog
25
° 1000 2000 ©3000 © 4000» 5000-6000
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 1019071328, 1112AM
In [55]:
out (55):
In [48]:
out [48]:
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk
Car Price Prediction Project Task 3 ~Jupytr Notebook
1 dfi [numerical] .plot(kind='hist' , subplots=True, bins=50)
array([,
Frequency'>, , ,
, ,
, ,
, ,
, ,
200
eo mE carwidth
2
Eo
-1 ° 1 2
+ During the modelling process, we can use power transformer.
+ Let's observe the correlation among the numerical features
+ And also observe the correlation with the target variable
localhost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Project Task 3ipynb 1919071328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook
In [61]: 1 numericali= df1.select_dtypes( ‘number’).colunns
2
3
4 matrix = np.triu(dfi[numerical1].corr())
5 fig, ax = plt.subplots(figsize=(14,10))
6 sns.heatmap (df1[numerical2].corr(), annot=True, fmt= '.2f", vmin=-1, vma
nai
sm om on aan 9 a
atone
pee
+ We have 9 numerical features which have more than .5 correlation with the price variable.
+ Which is a good sign for the prediction capability of the model, but still we need to see in
the practice
+ From the threshold .9 perspective: Highwaympg and citympg has .97 correlation. We can
drop one of them to avoid mutticollinearity problems for the linear models.
+ [have observed several highly correlated features below the .9 level.
+ Let's drop the ‘citympg’
In [62]: 1 df1 = df1.drop('citympg' ,axis=1)
Categorical Features
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb71328, 1112AM
In [63]:
out (63):
In [65]:
out [65]:
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk
Car Price Prediction Project Task 3 ~Jupytr Notebook
1 df1[categorical].head()
fueltype aspiration doornumber _carbody drivewheel_enginelocation enginetype cylinder
0 gas std two convertie. wd front dohe
1 gas std two converte sw front dohe
2 gas std two hatchback wd front ohev
3 gas std four sedan fa front one
4 gas std four sedan wd front one
Fuel Type and Price
1 print (d#1.groupby( ‘Fueltype’ )[ ‘price’ ].mean().sort_values())
2 print()
3. df1.groupby('fueltype' )[' price’ ].mean().plot(kind="hist' ,subplots=True, bil
fueltype
gas 12999.7982
diesel 15838.1500
Name: price, dtype: floated
array([], dtype=object)
LO
08
Frequency
a
0.2
0.0
13000 13500 «=«14000-S«14500.-—S «15000 ~—-15500
Prediction Projact Task 3ipynb
118071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
* Diesel cars are more expensive than cars with gas.
4. Model Selection
Aspiration and Price
In [67]: 1 print(df1.groupby( ‘aspiration’ )[*price’].mean().sort_values())
2 print()
3. df1.groupby( ‘aspiration’ )[' price’ ].mean().plot(kind="hist', subplots=True,|
aspiration
std 12611.270833
turbo 16298. 166676
Name: price, dtype: floated
out[67]: array([], dtype=object)
LO
08
Frequency
a
0.2
0.0
12500 13000 13500 14000 14500 15000 15500 16000
‘Turbo aspiration is more expensive than standard aspiration
CarBody and Price
Predict Projact Task Sipynb 18190
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk71328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook
In [68]: 1 print(dF1.groupby( ' carbody')['price’].mean() .sort_values())
2 print()
3. df1.groupby( 'carbody')['price’].mean().plot(kind="hist’ , subplots=True,bin
carbody
hatchback 10376.652386
wagon 12371.960000
sedan 14344,270833,
convertible 21898.5@@e00
hardtop 22208.500000
Name: price, dtype: floate4
out[68]: array([], dtypesobject)
LO
08
0.2
0.0
10000 12000 ©14000 += 16000-18000 -«-20000-S 22000
Frequency
a
+ Based on the price, there are differences among the carbody.
+ While Wagon cars the leats expensive ones, hardtop and the convertibles are the most
expensive ones.
Drivewheel and Price
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 1919071328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook
In [78]: 1. print(dF1.groupby(drivewheel')['price’].mean().sort_values())
2 print()
3. df1.groupby( 'drivewheel")[ price’ ].mean().plot(kind="hist', subplots=True,|
drivewheel
fwd 9239, 308333
awd 11087.463000
rwd — 19910.8@9211
Name: price, dtype: floatea
Out[70]: array([], dtype-object)
LO
08
0.2
0.0
10000 12000 14000 16000 18000 20000
Frequency
a
+ Rear wheel drive cars are the most expensive ones. Front wheel cars the least expensive
ones.
Engine Location and price
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 2013071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
In [72]: 1 print(df1.groupby(‘enginelocation’ ){ price’ ].mean().sort_values())
2 print()
3. df1.groupby( 'enginelocation' )[ ‘price’ ].mean().plot(kind='hist' , subplots=T1
enginelocation
front 12961.097361.
rear -34528.000000
Name: price, dtype: floatea
“Frequency'>], dtype=object)
out[72]: array([], dtype-object)
LO
08
Frequency
a
0.2
0.0
+ Our dataset has 7 different engine types and price changes amongs them significantly.
12500 15000 17500 20000 22500 25000 27500 30000
Fuel System and price
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb
223071328, 1112AM
Car Price Prediction Project Task 3 ~Jupytr Notebook
In [75]: 1 print (d#1.groupby(’ fuelsystem')['price’].mean().sort_values())
3. df1.groupby('fuelsysten’)[' price’ ].mean().plot(kind="hist', subplots=True,|
2 print()
fuelsystem
2bbl——-7478..151515
Ibbl ——-7555.545455
spdi 10990. aaaaaa
spfi 11048.e00000
4bbl 12145. 800000
mi 12964. 200000
idi --15838.150000
mpfi —17754..602840
Name: price, dtype: floatea
Out[75]: array([], dtype=object)
8000
10000
12000 14000 16000 18000
+ Our dataset has 8 different fuel system and price changes amongs them significantly,
Model and Price
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb
283071328, 1112AM Car Price Prediction Project Task 3 -Jupyter Notebook
In [76]: 1 print(df1.groupby(‘model') ‘price’ ].mean().sort_values())
2 print()
3. df1.groupby( ‘model )[ ‘price’ ].mean().plot(kind='hist', subplots=True, bins=!
model
chevrolet 6007 00000
dodge 7875.aaaaaa
plymouth 7963428571
honda 8184.692308
subaru 8541250000
isuzu 8916.500000
mitsubishi 9239769231
renault 9595.000000
Toyota 9885.812500
Volkswagen 10077.500000
Nissan 10415 .666667
Mazda 10652.882353
saab 15223.333333
peugeot 15489.090909
alfa-romero 15498.333333
mercury 16503.200000
audi 17859.166714
volvo 18063.181818
bmw 26118.750000
Porsche 31400.500000
buick 33647 .000000
jaguar 34600.000000
Name: price, dtype: floatea
Out [76]: array([], dtype=object)
3.0
2.5
05 | | |
0.0
“5000 10000 15000 20000 25000 30000 35000
Frequency
a 6
6
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb 243071328, 1112AM
+ Based on the model, Porsche, Buick and Jaguar are the most expensive ones.
Car Price Prediction Project Task 3 ~Jupytr Notebook
+ Chevroletis the least expensive model.
Get The Dummies
In [77]: 1 df2
2 df2.head()
pd.get_dumnies(df1, column
ategorical, drop_first
rue)
symboling wheelbase carlength carwidth carheight curbweight enginesize boreratio stro
out(77]:
° 3
1 3
2 1
3 2
4 2
5 rows x 64 columns
286
886
eas
oo8
994
1688
1688
1712
1766
1768
Model Selection
64a
64a
655
662
664
438
488
524
543
543
+ Tlluse linear regression model as a base model
+ And then | will use Ridge, Lasso, Elasticnet, KNeighborsRegressor and Support Vector
MAchine Regressor
+ And then i will use ensemble models, like Randomforest, Gradient Boosting and Extra
Trees
+ Finally | will look at the XGBoost Regresson,
2588
2548
2823
2337
2824
+ And after evaluating the algorithm, we will select our best model
+ Let's start
Baseline Model
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb
130
130
182
109
136
3.47
347
268
319
3.19
2
2
3
2813071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
In [78]: X= df2.drop( ‘price’, axis=1)
1
2 y= df2['price']
3 Xtrain, Xtest, y_train, y test = train_test_split(X, y, test_size=0.3, |
4
5 model = LinearRegression()
6
7
8 model.fit(X_train, y train)
9 y_pred = model.predict(x_test)
11 print (f'model : {model} and rmse score is : {np.sqrt(mean_squared_error’
>
model : LinearRegression() and rmse score is : 265@.560337022249, r2 score i
s @.8985995076954914
+ Baseline Model, in our case, Linear Regression model, without and scaling and
transformation did a quite a good job.
Ridge & Lasso & Elasticnet & KNN with Scaler
and Transformer
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk
Predict Projact Task Sipynb 263071328, 1112AM
In [79]:
31
32
Car Price Prediction Project Task 3 ~Jupytr Notebook
rmse_test
r2_test =[]
model_names =[]
numerical2= df2.drop(['price’], axis=1).select_dtypes( ‘number’ ).columns
(= d#2.drop('price’, axis=1)
df2['price']
Xtrain, X test, y train, y test = train_test_split(x, y, test_size:
s = StandardScaler()
PowerTransformer(method="yeo-johnson’, standardiz
rue)
rr = Ridge()
las = Lasso()
el= ElasticNet()
knn = KNeighborsRegressor()
models = [rr,las,el,knn]
for model in models:
ct = make_column_transformer((s,nunerical2), (p, skew_cols. index), remaii
pipe = make_pipeline(ct, model)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
rmse_test .append(round(np.sqrt(mean_squared_error(y_test, y_pred)),2)
2_testappend(round(r2_score(y_test, y_pred),2))
print (f'model : {model} and rmse score is : {round(np.sqrt(mean_squi
model_names = ['Ridge’, ‘Lasso’, 'ElasticNet', 'KNeighbors"]
result_df = pd.DataFrame({'RMSE':rmse_test,'R2_Test':r2_test}, inde
result_df
ode!
model : Ridge() and rmse score is : 2423.29, r2 score is 0.92
model : Lasso() and rmse score is : 2329.06, r2 score is 0.92
model
model :
ElasticNet() and rmse score is : 3350.1, r2 score is 0.84
KNeighborsRegressor() and rmse score is : 4048.13, r2 score is @.76
€:\Users\Onkar\anaconda3\1ib\site-packages\sklearn\linear_model\_coordinate_d
escent .py:647:
ConvergenceWiarning: Objective did not converge. You might want
‘to increase the number of iterations, check the scale of the features or cons
ider increasing regularisation. Duality gap: 1.809e+07, tolerance: 8.716e+05
model = cd_fast.enet_coordinate_descent(
out [79]:
RMSE R2_Test
242329 0.92
2323.06 0.92
ElasticNet 3350.10 0.84
KNeighbors 4048.13 0.76
+ By using standard scaler and power transformer for the skewness;
+ Forlinear models we got 92 for the R2 and
localhost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Project Task 3ipynb 2183071328, 1112AM Car Price Prediction Project Task 3 ~Jupytr Notebook
+ 2307.47 RMSE which are better scores compare to the baseline model
Best Model with the Hyperparameter Tuning
In [81]: 1 X= df2.drop( ‘price’, axis=1)
2 ys df2['price’]
3 Xtrain, Xtest, y train, y test = train_test_split(x, y, test_size-0.3, |
4
5 rf = RandonForestRegressor(n_estimators= 220, random_state=42 )
6
7 rf.fit(X train, y_train)
8 y_pred = rf.predict(x test)
9 print (fF rmse score is : {round(np.sqrt(mean_squared_error(y_test, y_prei
10
a
2
rmse score is : 1975,8483, r2 score is 0.9437
+ With hyperparameter tuning we got a lift.
+ RMSE (from 1984.44 to 1975.8483)
+ R2 (from 9432 to 9437)
Feature Importance
localhost 8888/notebooks/Desktop/Final ProjectiCar Prk
Prediction Projact Task 3ipynb 203071328, 1112AM
In [84]:
Car Price Prediction Project Task 3 ~Jupytr Notebook
importances = rf.feature_importances_
feature names = [f'feature (i}' for i in range(X.shape[1])]
1
2
3
4 # what are scores for the features
5 for i in range(len(rf.feature_importances_)):
6 if rf.feature_importances [i] >0.001:
7 print(#'{X train.columns[i]} : {round(r#.feature_importances_[i],
8
3
print()
11 plt.bar([X_train.columns[i] for i in range(len(rf.feature_importances_))]
12 plt.xticks(rotation=90)
13. plt.rcParams["figure.figsize"] = (124,112)
14 plt.show()
synboling : 0.002
wheelbase : 2.008
carlength : 0.013
carwidth : 0.026
carheight : 0.004
curbweight : @.167
enginesize : 0.6
boreratio : 0.005
stroke : 0.003
compressionratio : 0.005
horsepower : @.028
Peakrpm : 0.085
highwaympg : 0.118
enginetype_ohc : 0.001
model_bmw : 0.006
+ Based on the Random Forest Regressor:
+ enginesize
locahost 8868inotebooks/Desktop/Final ProjectCar Price Predicton Project Task 3ioynb
2083071328, 1112AM
Car Price Prediction Project Task 3 ~Jupytr Notebook
= curbweight
= highway mpg
= horse power
have biggest importance scores.
Itis important to note that Random Forest Regressor gave importance score bigger than 0
to 16 features.
Model used 16 out of 63 features to get best prediction,
Conclusion
We have developed model to predict car price problem.
First, we made the detailed exploratory analysis.
We have decided which metric to use
We analyzed both target and features in detail
We transform categorical variables into numeric so we can use them in the model.
We transform numerical variables to reduce skewness and get close to normal distribution.
We use pipeline to avoid data leakage
‘We looked at the results of the each model and selected the best one for the problem in
hand.
‘We made hyperparameter tuning of the best model see the improvement
‘We looked at the feature importance.
After this point itis up to you to develop and improve the models.
Iocahost 8888inotebooks/Desklop/Final ProjectICar Price Prediction Projact Task 3ipynb
3030