0 ratings0% found this document useful (0 votes) 37 views34 pagesML - Regression
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
CHAPTER
[ this chapter and subsequent chapters, we are going to discuss Machine Learning
sMPLE LINEAR
2EGRESSION
models which are usefull to analyze data and provide predictions about new data. Model
js a term that represents an algorithm or logic. The main purpose of a model is to
‘understand the given data. It is something like brain of human beings which analyzes
the data received from the sense organs like eyes, ears, nose, tongue and skin. The light
hich is reflected on objects enters the eyes and then brain. The neural network in the
brain transmits this data to a particular center (or point) in the brain where the light signal
is understood and interpreted by the brain regarding what the object seen by the eyes.
Depending on the shapes and patterns already stored in the brain, it interprets that object
asacat or a car or a human being etc.
AMachine Learning model also does the same thing. When data is given to the model, it
uses some mathematical formula and fits the data into that formula. If the data fits into
the formula in the best possible manner, then the model will understand the relationship
between the pieces of data according to the formula. When new data is encountered, the
‘model will apply the same formula on the new data and makes predictions about the new
ata.
Various Machine Learning models were created by Computer Scientists and Data Scientists
‘oexplain various relationships between the pieces of data. When certain data is given to
Us, itis we to select correct Machine Learning model to apply on the data. When our model
‘Snot correct, then the results will not be accurate.
The word ‘regression’ means measure of relation between variables or pieces of data. A
"gression model such as Linear Regression or Ridge Regression tries to understand the
‘lationship between different pieces of data, There are 2 objectives of any regression
“odel. They are:
} To-establish relationship between two variables: There are two types of relationship
that exists between variables. When a variable increases, if another variable also
increases then itis called positive relationship. When a variable increases, if another one
mw
Scanned with CamScannerChapter 19
decreases, then it s called negative relationship. For example, when i
e enditure can also increase. This is positive relationship, Whey ti,
tierenaes! the humidity in the climate will decrease. This ig Negative r ela,
a Inh;, *
2, Predict new observations: Once the regression model understands
between the variables, it can predict new results. For example, whee thet
ti,
in the past 1 year is given to a regression model, it can predict the saleg ont “
quarter. on
Variables
A variable is nothing but data. We know that a data frame contains data inthe
several rows and columns. Here, the columns are called variables. We ca dag?
variables into 2 types. They are: yy
1, Dependent variable: This is the variable whose value is to forecast or nx :
value is dependent on the values of other variables, called ‘Sndependent sc
Dependent variables are also called ‘response variables’ or ‘target variates hg
mathematical equations, they are generally represented by the letter
2. Independent variable: This is the variable which is useful to calculate te mies
another variable. Independent variables do not depend on each other. While tite.
not be possible practically, but independent variables are so called since ts
not having dependency on any other variable. Independent variables are alse
features’, or ‘regressors’. In the mathematical equations, they are represented lr:
letter *’,
Linear Regression
Tinear regression is a Machine Learning model that depends on the linear ret?
between a dependent variable and one or more independent variables. Let us a
the word ‘linear relationship’. We can say two variables are in linear relat
their values can be represented using a straight line. ‘That means, the data pl
values of the variables) lie on the straight line, ,
When there is only one independent variable, itis called ‘Simple Linear regression Lele
than one independent variables, the model is called Multiple Linear regress
Linear regression is also called ‘Least Squares regression’, the term we will yt
In this Chapter, we will focus on Simple Linear regression where only one i
variable is considered,
The Linear Equation
Scanned with CamScannerope and b equation is useful to fi
called the dependent variable and x is known a:
ind y value depending, on
em i
an 6 independent variable,
atu Heres 8
Se tities, WC write the same equation as:
syatisticS+
wee BR
ihe slope is B, The constant value B, is called intercept. fi, indicates the distance on
et Tus take a linear equation: y = 4+2x, Compare this with y = B, + Px. Here, B,=
ys%*. “1 9, By substituting x value into this equation, we can find the value of y. So, xis
tal e dependent variable and y is called dependent variable since y value is dependent
al
ga xvalue
s=0, then y= 442(0)= 4.
yix=2, then y = 4#2(2) = 8.
ix=4, then y= 4#2(4) = 12.
ix=6, then y = 4+2(6) = 16.
ipthis manner, when x is increased in steps of 2, y values are increasing in steps of 4.
ese values are shown in the Figure 19.1. From the Figure, we can calculate the slope B,
sn intercept B, values, as:
Slope B, = deviation in y / deviation in x= dy / dx= 4 /2=2
Intercept B,= distance on y axis where the line crosses = 4
Y= Bo+ Bix
y=442x
o 2 4 6 8
Understanding Linear Equation
Figure |
Wty we called y = 442x a linear equation? Because when the x values and y values are
"nin the form of a graph as in Figure 19-1, it will show a straight line, That means the
Scanned with CamScanner384
the independent variable (x) and the dependent Variable
relationship between en the data, we can apply Linear reprae
in
When such a relation exists betwe
analyze the data.
Me
The r Squared Value
i dataset, we can plot them in the form of a
-ading the data from the t rn : wep p,
eae may not be exactly on the straight line. There will be deviations from ty.
ine, This is called error ‘E’. The linear regression should consider this erry gp"
. :
the formula will be:
Let.us discuss about this error term ‘B’ now,
FRI ee :
But the y values are deviated due to the deviations ofthe data points from the suas
‘These deviated y values are:
yi = (0.8, 2.5, 3, 4.8, 4.4 z ana
‘That means we should get 1 but we got 0.8 as y value. This difference is called mix
error (E1). We should square these errors. If we do not square them, while finding se
total, the positive and negative values may cancel out. Hence squaring is needed
Similarly, we have to calculate the differences of y values from their mean. Thsen
deviations from the mean value (E2 = y-Mean). We have to square this value (£2)
Now, r squared value = 1- (Sum of El? / Sum of E2? )
‘The above formula can be used to find the value of r squared.
Please observe the following table to understand how to calculate r squared value.
Table 19.1: Calculating r squared value
y yt Etzy-yt et ezey-wean| ©
Z 2
2 S J
}
3 sc i
: 1
1 | ad
5 A D
eos
Mean = 3 es
: Sum] = 1.29
Using the above table data, the formula wil be:
T squared = 1~( Sum1 /Sum2)=1 = (1.29/10) = 0.871
4
Scanned with CamScannerSimple Linear Regression
aiue is also called ‘Coefficient of determination’
we . The r squared value obtai
aed . btained
_ si) gr. In percentage, itis 0.871 X 100 = 87.1, This indicates 87% accuracy level
tS je That means the Linear regression model in this example can explain 87%
jet Heys successflly, whereas the remaining 23% cannot be explained by the model
sens theres chae of error level at 23%, The accuracy level ofthe model is 87%,
rsdvaue wil be in the range of 0 to 1. Ifr squared value is closer to 1, then the actual
and predicted values {on WG line) will be very close. It represents high accuracy of the
ge. When 7 squared value is nearer to 0, they are much apart. So, the prediction may
potve correct:
happens when there are no deviations between y and yl. That means, El value
Then El?will also be 0. Then the sum of squares of errors (Sum!) will be 0, This
picates 100% accuracy for the model. So, the point is this: in Linear regression models,
spesum of squares should have least value and that represents high accuracy. This is the
‘eon, Simple Linear Regression model’ js also called ‘Least Squares Regression model’
whet
ate
‘etlowing Python code explains how to calculate r squared value for the above example.
gam’ is a package from scikit-learn.org that contains many machine learning related
soiules. In sklearn, we have a module by the name metrics. This module contains a
incon 12_score(). By calling this function and passing the original data and predicted
doa, we can find the r squared value as shown below:
[ron sklearn.metri€s import r2-score nie
y-[l, 2, 3, 4, 5)
yl =(0.8,2.5,3,4.8,4.4]
Rsquare = r2_score(y, yl)
| print('coefficient of Determination’, R-square) _ ge Nees
Output:
(TGefficient oF Determination
Practical Use of Simple Linear Regression
Neare given prices of houses based on their area in New York city. That means, our data
alt® the area of the house in square foot and its price, We should understand whether
asus any linear relationship between the area and price of the houses. Then, we
apply Simple Linear Regression model to analyze the data. Finally, we have to find
Pree ofa new house when its area is given.
Te
dataset: homeprices.csv
This ‘i
‘ata setis a simple dataset that contains only 5 rows with 2.columns, The column are
Lowey Price of the house, The price of the house is mentioned in dollars. This dataset is
"ailable in kaggle.com.
ba
Scanned with CamScannerhapter
shown in Figure 1°
550000
565000
610000
680000
725000
Figure 19.2: Home prices in New York city
In this dataset, if there exists linear relationship between area ang
relationship can be represented by the equation of the straight line: y fo
equation, the dependent variable is ‘y’ whose value is dependent on ‘"
In our dataset, we are supposed to find the price of a house depending on i,
we have to find, i.e. the price ~ becomes the dependent variable or response vu
it depends on, i.e. the area ~ becomes the independent variable or feature, Sot
equation becomes: ee
Valeo a
price =m * area +b
We know that ‘m' represent the slope of the line and ‘b’ is a constant value that rege
intercept on y ~ axis, These ‘m’ and ‘b’ values are calculated depending on te
Simple Linear regression model.
Let us draw a scatter plot to see the relationship between the area and price rates:
can be done using scatterplot() function of seaborn module, as:
sns.scatterplot (datasdf, \x='aréa", 'y='price’)
Output:
725000
700000
675000
» 650000
=
625000
600000
575000
550000
2600 2800 3000 3200 3400 3600 380°
area
40
Figure 19.3: Data points are aligned in a straight line
Scanned with CamScannerSimple Linear Regression
the data points (dots) in the scatter plot, we can un
serve vote derstand that they
at oe more or less using a straight line. Hence, we can apply Simple Linear
* cone’ spine learning model on this data.
tn mac!
“so
i ing models (or logic) are implemented in the form of various Python
ine learn ‘kitd ek
jo nen’ package by scikit-leam.org. The name ‘scikit’ represents Scientific
jst Pg. If we want (0 use @ particular machine learning model, first we should
ass name and then create an object to that class. For example, to implement
gression model, we have to first create an object to ‘LinearRegression’ class
3 del module of sklearn package. Observe the following code:
ode Torn near node] fmport Cinearaegréssion 7! E
* inearregression() # Create object to LinearRegression class
soft ™ 5 ineat_m0
1 emodelis available to us in the form of ‘reg’ object. We can call any methods of
7 sion class using this object ‘reg’. The next step is to train the machine
jneatRegres
aan del, by calling fit() method on the data in the form of:
del,
ay tothe fit) method. ‘” indicates the dependent variable, i.e. df.price. Now the
eens: how to convert the df.area column data into a 2D array? There are two ways.
sist ways to first convert the df.area column into an array by using values attribute,
mart
=.
[Wfarea.values
Tenconvert the 1D array into 2D array by using reshape() method of numpy arrays,
«
| tharea.values.reshape(-1,1) # gives 20 array :
Watths reshape(-1, 1) is doing? It is converting the 1D array into 2D array by reshaping
eanay in such a way that the resultant array contains only 1 column. For example, let
‘stale an array of shape (2,4). When we reshape it with (-1, 1), then the array will get
‘sbiped in such a way that the resultant array has only 1 column and this is only possible
Sitting 8 rows (i.e. 2X4=8). Hence the resultant array will have the shape of (8,1).
Tilly, we can use the fit() method, as:
1 df arealValues sreshape (1/4)
_TafitG, df. price) : ae
wp f converting the df.area value into a 2D array is to take the ‘area’ column name
ae Pairs of square braces, as:
a srea']]“# atea column in’ 20 array. Format
a a 7
i me We can write the fit() method as:
Tg "@FEtrarea"]], df sprice) SE
‘eetug ts the model has been trained with the data. That means, the model could
fit the data in the form of a straight line. That means the model could have
|
Scanned with CamScanner388 | Chapter 19
s
yd that there 2
Saeed ‘understands the data, then it is ready to be tested with Ba oh
ce of a house having 3300 square foot area, thi,
all the predict() method of the model and pass Pion S,
4
Let us now predict the pri
in the dataset. We have to c
Observe the output. It shows only 1 element in the form of 1D array. It says tae
of a 3300 sft house will be around 628715 dollars. Please remember, the cypy
.
machine learning models will be a 1D array.
Let us now understand the inner details of the Simple Linear Regression
following is the equation used by the model:
E price.
ad
To calculate the price of 3300 sft house, the above formula is used by our model Bi
it is using the above formula? To use the above formula, it should know the sori
the line and intercept (b) values. The following constants of Linear Regression casi
used to calculate these values:
EesetaDe
Let us substitute these values to find out
pr area ob
eee aS Ee aT 300
Price value is: 628715.7534151643. It means, the price of a 3300 saft pose te
628715 de i
ollars. The same output has been already given by the predict) me
Hence, we can confirm that the Si ae
ce, we e Simple Lin ion model is using
straight line to predict the new vals, “hee 7
Similarly, if we want to
Predict() method, a;
5 2g
oi
ic ca
Predict the price of house having area of 5000 safs "°°
That means,
the predicted price is around 859554 dollars.
Scanned with CamScannerSimple Linear Regression
level of the model,
find the accuracy lel, we have to cal
ip t 1 lculate r square i
ong 2-scorel fanction. To this function, we have to pass the orginal preccahn,
ot WH prices fom our model. Observe the following code: ce een
st
snes ics ii
qi jlearn.metrics ‘import r2_score
si |
fret ginal = df.price _
ot reg.predict(df[['area']])
score(y_original, y_predicted)
f
y d=
redictet
| ettare = F2
| r squared value’, R_square)
|
psqar
print
b
at ee
one ated value 0.9584307138199486
“noms 95.8% accuracy for our model. That means our mode “
es which is pretty good. ir model performs accurately in
lata is fit by the model in the form of a straight line (or regression line), we
‘ter plot once again with a line using Implot() function as: i
sis sho
gs. time
tosee how the d
can draw the scatt
sig lnplot (data=df, xe'area’, y="price')
inst) wil draw the straight line that best fits the data points as shown in Figure 19.4:
750000
700000
price
650000
600000
550000
00000
2600 2800 3000/3200 «3400-3600 3800-4000
area
Figure 19.4: Data points with regression line
Scanned with CamScannerm 1: Predicting the price of a house in New York depending on;
s of houses whose areas are: 3300 sqft saa ta
Prograi 3
1 ug
house area, find out the price!
| = sai
Simple Linear regress? : :
‘ paciccina the house prices depending on area
import pandas as pd
import seaborn as sms : /
from sklearn.linear_model import LinearRegression
# load the data into dataframe = =
df = pd.read_csv("E:/test/homeprices.csv")
df
# plot a scatter plot =
sns.scatterplot(data=-df, x-'area', y='price')
# once we see the scatter plot,-we can understand that the
# distribution is linear and can use Liner regression model.
reg = LinearRegression()
reg.fit(df(['area']], df.price) # fitting means training
# predict the price of 3300 sft house
reg.predict([[3300]]) # 628715.7534151643
# find the coefficient. this is slope m
reg.coef_
# find the intercept. this is b
reg. intercept_
# if we substitute m and b values in y = mx+b,
# we get the predicted value above.
ye 135. 78767123 * 3300 + 180616. 43835616432
y # displays 628715.7534151643
next predict the price of 5000 sft house
reg.predict([[sooo]]) # 859554.79452055
# find accuracy Jeve’
# gives 95.8% accura
from sklearn.metrics
yoriginal = df.price
y-Predicted =
‘1 of the model by finding r squared values
icy
‘import r2_score
Pea.predicrraerri-— vans :
m Scanned with CamScannerSimple Linear Regression | 394
core(y-original, y_predicted)
2 tls
re mM value’, Risquare)
ny
weer square
S Be pyre iscectan plot va @ regression line
1 dieF yoccaatardts X= area’, y='price')
e above program line by line in Spyder IDE, as shown in Figure 19.5:
roe Tol Vier Hp
Cate ny ong cms
fa gopoee cM em an als 6/6 9 Geren —
1 a
aciseficient. thks 8 slope»
tie
a Floates
apn the eters HS isb Linear_podel. base. Linea
regeintercert eee
poe vatiete = ond bvolucs in = att, we get ¢ ae
‘i Serra + 3900 + 180626. 43635616652
1 yy enr 205068
ne subst
ere gredice the price of 5000 sft house
fer Eeet([(5000]]) F 859554. 79452055 hares’, pe'price’}
love(29):
tid ecaracy level of the wodel by Finding r squared |]|cseaborn-axtagrid.FacetGrid at
fives 95.8% accuracy ee eaicaiae
{SE NScern.metrics import #2_score
yoriginal = of.price
Torebiced = reg.predict(d#{C'area"}})
[gare = r2iscore(y original) y_preditted) ll
fin('r squared value", R_square)
4 display the scatter plot with a regression Line
fs. leplot(data=df, x="area’, y="price’)
a5 rao [ar
hen tay wwe seams |
Figure 19.5: Executing the program in Spyder
‘ie Linear Regression with Train and Test Data
hy
} ee | please observe the following statements:
ree gsuinearRegression() ™ STR 7
oc tieHittarea’ Ty, df price), # train the del
Set et all the rows of the dataset to Simple Linear Re t
i. ied on the data. But in many cases, this may not be ‘the case. The Machine
fois ggrPet divide the data rows into 2 parts: some rows are used for training the
ie other rows are used for testing purpose. Generally, 70% of the rows are
Purpose of training the model and the remaining 30% are used for testing the
gression model and the
fo th
Scanned with CamScannerChapter 19
model. Alternately, we can use 80% rows for training and the Temaining op
First, let us see how to split the data into train and test data. lop
| *e
skteamn package has model selection module that contains a function yyy,
test split). This function is used to split the data rows into 4 parts: x tafe wake
15 7A test,
and y_test.
from sklearn- model
x_train, x_test,
random_state = 0) 5 :
train_test_split() function is taking x (independent variable) and y (dependent
parts of the data and splitting the data into 4 parts. We are Tepresenting them *
X.test, y_train, and y_test. Please observe the attribute: test_size= 0.3, this ing Abe
the test size is 30%. That means it randomly selects 30% of rows in the dataset
them for testing purpose. The remaining 70% of rows are used as training ane
randomly selects depending on the seed given in random_state. Suppose oe
~0, then it will take the seed 0 and creates some random numbers, like: 3 | 5,6 on
means the 3%, 1*, 5, 6% and 9" rows are selected for testing purpose. Whee tte
is given, it will always generate the same row numbers for selecting the test ata eg
the output of the program will be same on running several times. If the seed ‘une
changed, then it will select a different set of rows. If we do not use random. state ati
then it will generate an integer number randomly and based on that selects the mnt
testing. These rows may change every time when the program is run,
We can also use train_test_split() function by mentioning train_size attribute insteté
test_size attribute as:
- ~ Scanned with CamScannerSimple Linear Rey
eos are Busine we should verify its accuracy by comparing with original
sett ua. yest with the predicted data y_pred. This can be done using r2_score()
ected 2
pe “qearn.metrics ‘import r2_score
f feeeore vests y-pred)
if
poet getail
«can be easily understood from the Figure 19.6.
mete
Joutput
Machine Leaming
Model
Icompare|
Figure 19.6: Spliting and using the data with Machine Learning Model
To understand the relationship between the x and y variables in the train data, we can
diana scatter plot. This scatter plot represents how the original data is distributed along
Tand y axis,
ble,
en(xctrain, Yo : z
Tieregression line isthe line that is used by the model tot the above data, This regression
tee = mxtc, When we pass train data to
ode, it predicts the output according to this formula. Hence, regression line should
| ae train data and predicted data.
nq sbletGeratn,: reg predict @cerain) eden:
th
the 7
Sean statement, x_train represents the train data and reg.predict(x_train) represents
‘cted data for all the rows in x_train.
bm
Scanned with CamScannerChapter 19
When we display scatter plot along with the line
Plot using the prey;
looks like the plot as shown in Figure 19.7: VIOUS tag
Experience vs Salary
120000 +
100000 +
60000-
40000-
Yrs of exp
Figure 19.7: Scatter plot with regression line
_
Now, let us solve a task related to how to decide the salary of an employee depts
experience. We give Salary Data.csv that contains Experience of employee and ts=
variable (). Split the dataset into train data and test data. Using train data, a
and then predict the salary of an employee having 11 years of experience.
The dataset: Salary Data.csv
is salaryine,
wut the experience of an employee and his sae at
ince an 0 ows and 2 columns. The column names are: YearsExpeti ne wt,
Since we have to predict the Salary depending on YearsExperience, We ya
that ‘Salary’is the dependent variable on YearsExperience which becomes
variable. See Figure 19.8.
This dataset contains data abo
There
Scanned with CamScannerSimple Linear Regression | 49%
Salary
Figure 19.8: Employee experience and salary dataset
isonly 1 independent variable, we can use Simple Linear Regression model and
asthe
“the data fits into this model or not. This can be done by drawing a scatter plot
ec
xend y variables as:
sltiscetter(x, y)
Oop
120000
100000
60000
40000
Figure 19.9: The data points are showing linear relationship.
Scanned with CamScannerChapter 19
This indicates linear relationship between x and y variables. When
value is also increasing. Hence this is positive relationship. So, we
Regression model to analyze the data.
Program 2: Train the computer using Simple Linear Regression mode]
experience and salary data, Predict the salary of an employee having 11 years“ ty,
# Simple Linear Regression with train and test data Sof epee
import pandas as pd
import matplotlib.pyplot as pit
X Value jg j
CaN Use girs,
imple)
ley!
# load the dataset from the computer into dataset object
dataset = pd.read_csv("E:/test/Salary_Data.csv")
# retrieve only Oth column and take it as x
x = dataset.iloc[:, :-1].values
x
# retrieve only Ist column and take it as y
y = dataset.iloc[:, 1].values
y
# draw scatter plot to verify Simple Linear Regression
# model can be used. Scatter plot shows dots as straight line
plt.scatter(x, y)
# take 70% of data for training and 30% for testing
# random_state indicates the random seed used in selecting test rovs
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test.size0,
random_state = 0)
# train the computer with Simple Linear Regression model
from sklearn.linear_model import LinearRegression
# Create LinearRegression class object
reg = Linearregression()
# train the model by passing train data to fit() method
reg.fit(x_train, y_train)
# test the model by passing test data and obtain predicted dat@
y-pred = reg.predict(x_test)
y_pred
# find the r squared value by comparing test data
# (expected data). and predicted data. accuracy is 97-4%
from sklearn.metrics import r2_score
r2_score(y_test, y pred) # 0.9740993407213511
Scanned with CamScannerSimple Linear Regression
salary of employee with expe
the 0 Weare Vala perience 11 Years,
1 16 So.predict¢C{21]])) # (129740, 26548933)
atplotliby draw scatter plot and line plot
She below al! statements and then execute at once
4 selec ercctrain, y_train, color='red")
escaee train, reg-predict(x-train), color= ‘blue')
sitfl0t Cmexperence vs Salary")
are abel YES of exp")
oe yabel salary")
HeshowO
xs now atte!
ars from
ye
ne would be
mpt another task where we are given per capita income of Canada during
1970 to 2016. Using Simple Linear Regression Model, we have to predict
the per capita income during the years 2020 and 2021.
paoset used: canada_per_ capita_income.xlsx
x capita income represents the income earned per person in a given country. It is
Gated by dividing the country’s total income by its total population. When per capita
oie is more, the people in that country are leading comfortable and probably rich life.
eemjqiset canada per_capita_income.xlsx is an Excel file that contains year number and
‘epita income in that year for Canadians. This xlsx file contains a sheet by the name
Sect” and hence while reading data from this file, we can use read_excel() functions of
yenies module, as: a :
"dataset = pd-read_excel("€:/test/canada_per_capita_incone Jelsx" "sheet1")
this dataset has 47 rows and 2 columns. The first few columns of this dataset are shown
inFigure 19.10.
Income
1970 __3399.299037
1971 3768.297935
1972 4251.175484
1973 4804.463248
1974 5576,514583
1975 5998.144346
1976 7062.131392
1977 7100.12617
1978, 7247.967035
1979, 7602.912681
1980 8355.96812
1981, _-9434.390652
1982 9619.438377
1983
Sheett |
Figure 19.10: Data related to Income per person in Canada
Scanned with CamScannerChapter 19
To cheek if the data fits into Simple Linear Regression Model, we can qi
raw 4
\
as:
pit.scatter(x, y, color='red')
Output:
40000
35000
30000
25000
20000
15000
1970 1980 1990 2000 2010
Figure 19.11: The data points are nearly linear
The data points are more or less showing a straight line. Hence, we can use Simpl l="
Regression Model on this data.
We can also find the accuracy of the model, by comparing y_test and y_pred 8°
12_score() function, as:
[r2iscore(y-tést; y=pred)= = eee roe
Another way to measure the accuracy of Linear Regression Model is by cal
method on the model object (reg), a: aoe
| reg-score(x test, y test) E é we!
E = = = Sena st.
ased on XS ay
The score() method takes x_test and calculates y_pred values aii Hence
compares y_test and y_pred values to provide the score of the mo
score with r2_score() function and score() method.
¥
a”
2020
conesion
Program 3: Find out the per capita income of Canada during the Ye"
the per cap: aa
a a
Scanned with CamScannerSimple Linear Regression
4 data from "sheet." of Excel file
' wet = pd.read_excel ("E:/test/canada_per_capita_income.x1sx", "Sheet
i '
praset.iTocL!» OF2].values # retrieve st column as 20 array
fe
* aataset Tots» -l].values # retrieve last column as 10 array
, F di
jstribution of data is linear or not
neck the dis pei
Kegeater(&s Y» colors!red)
jie 708 of data for training and 30% for testing
eon state indicates the random seed used in splitting the data
fron sklearn.model_selection import train_test_split
ctrain, xtest, y-train, y_test = train_test_split(x, y, test_size=0.3,
“dom state = 5)
rain the computer using Simple Linear Regression model
ge sklearn. linear_model. import LiaearRegression
rag = LinearRegression()
reg.fit@train, y_train)
#nake prediction based on test data
pred = reg. predict (x_test)
y.pred
# find the r squared value by comparing test data and predicted data
fros sklearn.metrics import r2_score
rscore(y_test, y_pred) # 0.8433026110551844
# another way to know the score of a linear regression model
feg.score(x_test, y_test) # 0.8433026110551844
# predict the per capita income during the years 2020 and 2021
# Output: array([41819. 49650873, 42681. 02869595])
# this means 41819$ in 2020 and 42681$ in 2021 years.
*eg-predict({[2020], [2021]])
a Scatter plot and regression Tine
mm et below block and run at once
He gutterGxtrain, y_train, color="red’), 4,
de eestrain, reg.predict(x_train) » color= 'blue')
nit, ‘TtleC"PER CAPITA INCOME OF CANADA")
“Mabel ("Year")
Plt, :
me qatel Per Capita Income’)
0 We ran thi ;
Tun this program in Spyder, it looks like the following.
le
foung /
ies pn per capita income during the years 2020 and 2021 as:
edict (L(2020))"{202111) 4 gel
bh
Scanned with CamScannerChapter 19
i ements in 2D array an
, we are passing the years as el
Fe STapeTAUATOTASSOUTS 42681.02869595])
ita i Id be 41819 doy,
is i at the per capita ineome would be. liars ip
Sones ieoa orem the accuracy of the model is 84% oniy"* ! 2029
id got the utp
DeetwerDOeenccan Bet £4 \¢ 9 came
dataset atarrane
ree
near sede),
a
array of neg Mr,
Mtest Array of intes
train Array of ineee
from sklearn.metrics import r2score
P2score(y test,
plt.show)
Zee Ber Oyen d4)_ were
Figure 19.12: Running the program in Spyder
Points to Remember
par rej i latinos
Q Linear regression is a Machine Learning model that depends on the linear 1th
between variables,
alt
data is
(The input data is represented by independent variables and the target
dependent variable.
the?
sts
Q When a variable value increases, if another variable value also incre
called positive relationship,
When a variable val
iris
i es, then’
ue increases, if another variable value decreas
negative relationshi
ip.
7
i Here BON
Simple linear regression model uses the formula: y = BO + Bix +E: Here’
intercept and B
which s
iagPresents the slope of the line. E is Error term
as T squared value’.
’
in and test dal i
cnaes*®St Split function is Used to split the dataset into train es
either train size or test. size based on which it divides the dataset-
Scanned with CamScannery
CHAPTER
20
gression model, we take only 1 independent variable (x) that is useful
dent variable (y) value. For example, ‘area’ of the house is useful to
But in many practical cases, house ‘price’ does not dependent
f a house can be decided depending on various factors like
ms’, ‘the age of the house’, etc. These variables are called
and they are useful to predict the price of the house. When we use
ariables in Linear Regression model, it is called ‘Multiple Linear
\uLTIPLE LINEAR
EGRESSION
topredict the depen
prelict the house ‘price’.
tay on ‘area’, The price of
‘gra, the number of bedroo!
[ simple Linear Re
independent variables
rultple (more than 1) vé
Regession Model’.
ariable (y) value can be calculated based on
ln Multiple Linear Regression, the target v
So, the equation used by this model will be
several independent variables (x1, x2, x3, ...
inthe form of:
y= mena nds 4 EB se
Here, y is called the independent variable or target variable. x1, x2, x3, ... are called
dependent variables or features. m1, m2, m3, ... are called quotients associated with the
drpendent variables. b is called intercept.
‘That means it can be shown in the
The relationship between y and x1 should be linear.
fen y and x2 should be linear. Also,
f a
ae straight line. Similarly, the relationship betwes
*telationship between y and x3 should be linear.
‘Susual, when Multiple Linear Regression model is applied on data, there will be certain
ations between the predicted values and original values which can be calculated
ie SQuared value, The r squared value should be betvit O and 1. Ifitis nearer to
ee the model is not performing well. If it is nearer to 1, then the model is doing well
‘curacy level is high.
lets
atthe PPI Multiple Linear Regression model on the house Patt
ata and then understand how to use the model on the data.
.3 data, We will first look
~~
Scanned with CamScannerChapter 20
The dataset: homeprices.csv
This dataset represents home prices in Monroe Township, New Jersey, ugg
is a sample dataset that contains only 6 rows and 4 columns, The column, elite,
the house in square foot, the number of bed rooms, age of the house in aaa thet
of the house in dollars. This is shown in Figure 20.
rea____bedrooms age price. -~&
2600 = Bt _ 550000
| 3000 4 15 _ 565000
Ee mt 3200 18 610000 ©
5) 36008. _ 595000
6] 4000 5g ___760000
7/4400 Ss 8 795000
8 | lt
Figure 20.1: Homeprices dataset
We are supposed to calculate the home ‘price’ depending on the ‘area’, ‘bedrooms’ ani’?
columns, So, the Multiple Linear Regression model uses the following formula:
Y= mix + m2x2 4 m3x3 4b = Bee
_ Price = mitarea + m2*bedrooms + mage +b Z el
= Looms E , fi
Please observe that this dataset has a missing value in ‘bedrooms’ column. Son!
let us clean the data or make the data ready for the model. For this purpost "i,
7 2 ate vive?
cither delete that row that contains the missing value or substitute appropt!
that place.
To find out the missing values in the data frame (df), we can use:
df. 18nu1110). sum) é
Output:
Pearce
bedrooms 1
0
0
age
"price
atype: intea
Scanned with CamScannerMultiple Linear Regression
ax easly ells us that there is 1 missing value found in the ‘bedrooms’ column.
to calculate median value of all the other values in that column and then
oat value in the place of missing value. So, first let us find the median value for
pl column aS!
oe roons median
coe
cut
40
a float number as a result of executing the above statement. Convert that into
at
floor() function of math module, as:
seny 6
spnege 20
sport oath ;
ned = path. floor (df. bedrooms .median())
wed
cop
(4
mately, we can use int() function as:
rqed = int (df. bedrooms .median())
ed
uw, lets fil the median value into missing place of ‘bedrooms’ column, as:
éfibedrooms = df. bedrooms. fi11na(med)
ff
Output:
area bedrooms age price
0 2600 3.0 20 550000
1 3000 4.0 15 565000
2 3200 4.0 18 610000
3 3600 3.0 30 595000
d 4000 5.0 8 760000
4400 5.0 8 795000 e
Pa ‘he datais alright, we can check if we can apply Linear Regression model on this data
wt. This can be done by checking the relationships between ‘area’ and ‘price’, between.
tr ny te ‘price’, and between ‘age’ and ‘price’. To view these relationships, we can
Laing hat dlaplays a seattered data points along wi
i in be drawn using impo) fonction of seaborm module. Now,
in 3 ways:
th regression (relationship) line.
let’s use the Implot()
To 7
ind the relation between constructed area and price by drawing Implot as:
eee seaborn as sns
-Implot(x='area', y='price’, data-df) : —
bh
Scanned with CamScanner404 © | Chapter 20
Output
2600 2800 3000 3200 3400 3600 3800 4000 4200 aan
area
ae
Figure 20.2: Relationship between area and price of house
art
fe output shows positive relationship. That means, if the constructed
‘ouse increases, the price of the house will also increase. ast
To find the relationship between the number of bedrooms and Pat a
Implot as: g
‘Scanned with CamScannerMultiple Linear Regression
output
price
4.75 5.00
3.00 325 350 375 4.00 425 4.50
bedrooms
Figure 20.3: Relationship between bedrooms and price of house
The out
ig shows that a positive relationship exists between the number of bedrooms
nd price. That means, if the number of bedrooms is increased, the price of the house
“ills increase.
hnouse and price by drawing Implot as:
jata-df) eS
To fing
n the relation between age of the
SBS ImpTot
‘Scanned with CamScannerChapter 20
Price
age
Figure 20.4: Relationship between
‘at there is
ed, the price
This output shows the
age and price of house
the house is increas,
ice, ifthe
negative relationship between age and es
ill decrease as it represents an older co
tiple
neve is linear (og Straight line) relationship between the multipl
Variables fie. ar
f
i ice), we
a, bedrooms id age) with the dependent variable (pri
Multiple Linear Re, 'ssion mode], on the data,
‘Scanned with CamScannerMultiple Linear Regression
the data using fit() method. We should i i
odel 07 s uld remember that while passin,
x thie independent variables), we have to pass them in the form of 2D array and
792 Oe cpendent variable) should be given as 1D array.
oP e(@FLL' area , ‘bedrooms’; ‘age"}], dF 'price"])
a coeficient values used by the model, we can display coef_ variable, as:
~=8529-30115951 =
display intercept_variable as:
(77561 89287339806 z 3 a
model has been trained, we can predict the house price for given area, bedrooms
‘alues. This can be done using predict() method to which we have to pass the area,
‘and age values in the form of a 2D array.
edict (113000, 3, 40]1) = = so
tpete independent variable value, the output will also be displayed in the form of a 1D
ta. See the output below:
et
adage
put: - 7
array((427301. 786273871) eto = z = 2
‘tatmeans, the price of the house in New Jersey with 3000 square foot constructed area,
3teirooms and 40 years old would be 427301 dollars.
baths manner, we can use the model on the new data to provide predictions or forecast.
Ee
Program 1: We are given Home prices in Monroe Township, NJ (USA). We should predict
tte prices for the following homes:
* 3000 sqft area, 3 bed rooms, 40 years old
4. 2500 sqft area, 4 bed rooms, 5 years old errey
{ultiple Linear Regression wiodel = predicting house prices
import pandas: as pd
'
ad the dataset into dataframe
a Pa-Fead_csv("e:/test/homeprices.csv")
44
t hegegitt any missing values in the dataset
tf. {g,00™S has’ missing value
“SMU. sum¢)
ti)
1 the missing data (NaN) with median of bedrooms
a
Scanned with CamScannerChapter 20
Ymport mat!
med = math
med #4
# F171 the missing data (NaN columns) with this median y
df.bedroons = df.bedroons. fil Ina(med) alue
df
# represent the relations between independent and dependene
# area, bedroons and age are independent vars and prices pa
import seaborn as sns Dendy
sns.Implot(x="area’, y="price’, data=df)
sns.Implot(x="bedrooms', y='price’, data-df)
sns.Implot(x='age', y='price’, datadf)
H a
. Floor (df. bedrooms .median())
# create Tinear regression model with multiple variables
# take the independent vars first and take dependent var next.
from sklearn.1inear_model import LinearRegression :
reg = Linearregression()
veg fitCdf[C'area’, ‘bedrooms’, 'age']], df['price'])
# Fitting means training
# print coefficients,i.e. m1, m2, m3 values
reg.coef_ # 142.895644 , -48591.66405516, ~8529. 30115951
# intercept
reg.intercept_ # 485561. 89282339806
# predict the price of 3000 sqft area, 3 bed rooms, 40 years old hast
reg-predict([[3000, 3, 40]]) # 427301
# predict the price of 2500 sqft area, 4 bed rooms, 5 years old hast
reg.predict([[2500, 4, 5]]) # 605787
We are going to attempt another task which is similar to the previous task butwits¥
dataset. We are given a dataset that contains house prices in California state de
the constructed area, number of bedrooms and number of bath rooms. Unlike
we have to divide the dataset into train data and test data and find out the price
depending on the area, bedrooms and bathrooms.
The given dataset: cal1-03homes.xis
Obsereatn, EXet Workbook file that contains one sheet where the dat 5 ot
is available on nett Prive of the house, Square foot area, Bed rooms, Bath yt!
rows and 7 columisc td iP code of the place where the house is bacoted ot i
take only 3 columney gate Te Ot Boing to take all these columns i\? ue
Of the house. Since the cthae (oot &€@» Bed rooms and Bath rooms t0 CC io
‘se. Since the other factors lite Garage and Zip code are not 0"
Scanned with CamScannerMultiple Linear Regression
._ From this discussion, we can understand that the independent variables
200 Soms', Baths’ and the dependent variable is ‘Price’. So, the Multiple Linear
i ea jnternally uses the following formula:
"mx? + 3X3 + — +B
itsqet + m2*BedRooms + m3*Baths + b
ie draw 3 plots:
7 gto nd out any linear relationship is existing between the Sqft column and
0
rots representing the relation between BedRooms and Price, and then Baths
a ee ‘These boxplots are useful to find out any outliers are present in the data.
od " a " i
draw Implot that contains datapoints along with the regression line, This can
Ht plot() function of seaborn as:
yeenice
800000
600000
Price
400000
200000
4000 2000 3000 4000 5000 6000 70
Figure 20.5: Implot between area and price
Scanned with CamScannerLet us now deci
The columns: ‘sqpy,
Once we find 0)
method of data
Chapter 20
This Implot represents here is linear relationship between
nts that ther f
s price and it is a positive relationship. That means if the are,
will also increase.
the a
Fea of
9 increases th
8, the:
draw the box plots using seaborn boxplotd jaunction, as:
Let us now draw oo ora
-boxplot (x="BedRooms ‘Pri fa
Sas boxe lot Galestheniver rice aldara df)
Output:
points are nothing but abnormal values i
datapoints using IQR (Inter Quartile Range) method, as:
# calculate igr
93 = df['Price'].quantile(o,75)
gi = df['Price'] quantile(o.25)
igr = q3-q1
# calculate upper and lower 13,
#
mits from iqr
any value above ul or below
11 will become outliers.
UT = 3 +(1.5 * igry
M1 = ql -(2.5 * ign)
# Price should not be more than ul and less than 11.
#If itis SO, then it become outlier
Upper = MP. whereCdF[' pricey >= ul)
Tower = MP: whereCdf[' price] <= 11)
a yes sits
ut the ouliers, we can delete those rows with outlier valu
frame, as:
oF. drop Cuppertoy, inplace=triey f
df.di PClower [0], ‘inplace=true)
- ‘able ) in Oh
ide inde (x) and dependent variable Oe!
‘aths’ are independent variable
Pendent variables
‘BedRooms’ and 8,
Scanned with CamScannerMultiple Linear Regression
mns. ‘The column ‘Price’ is the 1" column that should be taken as dependent
s ni
i
atte ly 2nd, 3rd and 4th c
nite eve ons valites ‘Olumns and take them as x
ist column and take it as y
jeve
g retrte 1].values
Mapedlocl?»
wpe above statements, .values property converts the values into array format. The ‘’
se Ue given in 2D array format and the value willbe given as 1D array format. These
a are required when We want to supply this data to the Machine Learning model
a using train_test_split() function, as:
train_test_split(x, y, test_size=0.2,
fo yspltthe data into train and test dat
wearin, xtest, y_train, y_test
Todo state = 10) =
pee ri and y_train data should be used to train the model. xtest and y_test are to
qeused for testing the accuracy of the model. 7
aise Linear Regression is also nothing but Linear Regression model only. Hence, we
seefate te model by creating an object to LinearRegression class as:
jnearRegressionO. z =
‘ein the 7 = ~
ag. Fit OC tral Ny eyatta ee ee ee
tee the training of the model is completed, we can find the accuracy level of the model by
cdauating r squared value as:
riscore(y_test, y-pred) ae = BE ;|
Weare comparing the test rows (y_test) against
edited by the model based on x_test rows pass'
‘Souse score) method to find the score as:
reg.score(x_test, y_test) = Se =
The above score() method calculates y_pre
pooh y.test to decide the score of the mod
ig the model. Let us predict the house price
th 3 bed rooms and 1 bath room. oe
eR 3, 11) =
vit We want to predict the prices of 2 ho
{ake 3 bedrooms, 2 bathrooms and 2000 sqft area,
Tepes can use the predict) method as:
ts -Predict({[is00, 3, 2], (2000, 4, 4]]) oo
We inember, the input should be in the form of 2D array and the output will be given
“tenon form of 1D array by any Machine Learning ‘model. For example, the above
egg ea Produce the following output which is @ 1D array: eee
fc 13085266 205294.90136746] .
Yen ts element 128866, 13085266 represents the
‘ment 205294.90136746 represents the price of the seco}
t the predicted rows (y_pred). These are
ed to predict/) method. Of course, we can
.d values based on x_test values and compares
el. Finally, let us predict the house prices
having 780 square foot constructed area,
‘uses with the following specifications: 1500
4 bedrooms and 4 bathrooms. In
price of the first house and the
nd house.
~~
Scanned with CamScanner«7
iple Linear Regression Model for the house oy;
2: Create Multiple Linear Regression \ Prices
eo ernect into train and test data while giving it to the model ans Predict em b
ye
houses with the following specifications:
1. Predict the house price with 780 sqft area, 3 bed rooms and 1 bath
i i ith 1500 sqft, 3 bed
-dict the house prices for two houses wit aft, Tooms 7
> org another one with 2000 sqft, 4 bed rooms and 4 bath room. 24 2b,
# multiple linear regression - predicting house prices
import pandas as pd
# load the dataset into dataframe
af = pd.read_excel("e:/test/call-O3hones .x1s", "Sheet")
df
# find out any missing values in the dataset
# there are no missing values in any column
df.isnul1Q.sumQ)
# find out outliers by drawing box plots - there are outliers in Pria
import seaborn as sns
sns. Implot(x='sqFt', y="Price', data=df)
sns.boxplot(x='BedRooms', Price’, dat
sns.boxplot(x='Baths', y="'Price', data=df)
# delete the rows with outliers using igr method
# calculate q3 (third quartile).
q3 = df["Price'].quantile(0.75)
3
# calculate ql (first quartile).
at = dFC'Price'] .quantile(o.25)
4g.
# find iqr value. this gives 80000
igr = q3-q1 , _
igr
# calculate upper and lower limi i
from iqr
# any value above ul or below Tite Joep
Bio a elow 11 will become outli
nN * igqr)
= Ql -(1.5
Print(ul, 11) # 304900.0 -15100.0
F # Upper bound
| sapere numpy as np
_Uppe MP.whereCdFL' Price'y >= ul)
Scanned with CamScannerMultiple Linear Regression
nd an
jnonet DOU recat’ price’) <= 11)
over =
ne rows above upper and below lover values
p delete er(0], inplace=True)
only 2nd, 3rd and 4th columns and take them as x
jeve
i rete Toc + 2:5].values
teat.
x
retrieve 2st column and take it as y
i Pof.ilocl:, 1.values
y
take 80% of data for training and 20% for testing
{Gtdon state indicates the random seed used in selecting test rows
frm sklearn.model_selection import train_test_split
Krein, xtest, y_train, ytest = train_test_splitx, y, test_size-0.2,
randonstate = 10)
f train the computer with Simple Linear Regression model
fron sklearn.Tinear_mode1 import LinearRegression
4 create LinearRegression class object
req = LinearRegression()
# train the model by passing train data to fit() method
req.fit@_train, y_train)
# test the model by passing test data and obtain predicted data
y.pred = reg.predict(x_test)
# find the r squared value by comparing test data
Feitinected data) and predicted data. Accuracy is 97.4%
ae sklearn.metrics import r2_score
~Score(y_test, y_pred) # 0.829686049412441
{ another way to find the score
S-Score(x_test, y_test) # 0.829686049412441
{predict the price of a house with 780 sqft, 3 bedrooms and 1 bathroom
rin gives 56120 dollars.
"eg.predict({[780, 3, 1]])) # 56120.32684253
# .
aBtedict the prices of houses with 1522 sqft, 3 bedrooms and 2 bathrooms
00 sqft, 4 bed: d
01 oe edrooms and 4 bathrooms.
Drint gat, 128865 doTlars and 205294 dollars. :
9-predict([[1500, 3, 2], [2000, 4, 411)) i aS
~~
Scanned with CamScannerOutput: ;
° the following screenshot where we executed the program in
serve
the program line by line and observe
in Figure 20.7:
Chapter 20
s
the output at the bottom right sa oi De,
ed data. Rccuraey 19 97.4%
Bisreneaeer eee”
ys5049812582
see
| ew ina
6126 dollars * te
Hict(({780, 3, 1]])) * 58129. 32683253 gf: erent re areca AS
7 3 Bedroom: | [56120.32604253)
In [72): print(eeg predicts
1. (2000, 4,
128866.13085266 265254 1976)
Figure 20.7: Executing the program in Spyder IDE
Points to Remember
Qa
Q
; ye i
There are 2 types of Linear Regression: Simple Linear Regression and Mult
Regression.
sind”
tn Simple Linear Regression, the target value is predicted based on onl
variable or feature,
a ile (me
in multiple Linear Regression, the target value is dependent on muiiple
independent variables, 38"
’
Multiple linear regression model uses the formula: y= mixl + a mn,"
b. Here, x1, x2, x3, ... are called dependent variables or features. M1, 7
called quotients. b is called intercept. in
: ; in
In linear regression models, the score or accuracy can be known it
NSiNg Scorel) method or r2_score() methods. ad
The ints:
teed ted value represents the square of deviations of the datape
between 0
: aud 1. If itis nearer to 0, then the model is not performing
‘0 I, then the model ig doing well and accuracy level is high-
d
Scanned with CamScanner