6/2/23, 8:33 PM linearRegression
Linear Regression
Linear regression is an important statitical technique used to model and analyze the
relationship between a dependent and one or more independent variables
Question 1: What is linear regression ? linear reegression is a statistical method used to
model relationship between a dependent variable and one or more independent
variables by fitting a linear equation to observed data.
Question 2: What is the difference between simple linear regression and multiple linear
regression?
Simple linear regression involves a single independent variable predicting the dependent
variable, while multiple linear regression involves two or more independent variables
predicting the dependent variable. Multiple linear regression allows for the analysis of
the combined effects of multiple predictors on the outcome variable.
Question 3:How is the quality of a linear regression model evaluated ?
The quality of a linear regression model is assessed using various metrics, such as
1. the coefficient of determination (R-squared)
file:///C:/Users/rinki/Downloads/linearRegression.html 1/8
6/2/23, 8:33 PM linearRegression
2. root mean square error (RMSE)
3. mean absolute error (MAE)
4. adjusted R-squared
Question 4: How can you deal with multicollinearity in linear regression ?
Multicollinearity occurs when independent variables in a regression model are highly
correlated with each other. It can lead to unstable coefficient estimates and reduce the
model's interpretability. Dealing with multicollinearity can involve removing one of the
correlated variables, combining variables, or using dimensionality reduction techniques
such as principal component analysis (PCA)
In [1]: ## import library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [3]: df = pd.read_csv('advertising.csv')
df.head
Out[3]: <bound method NDFrame.head of TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
.. ... ... ... ...
195 38.2 3.7 13.8 7.6
196 94.2 4.9 8.1 14.0
197 177.0 9.3 6.4 14.8
198 283.6 42.0 66.2 25.5
199 232.1 8.6 8.7 18.4
[200 rows x 4 columns]>
In [4]: df.shape
Out[4]: (200, 4)
In [5]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Radio 200 non-null float64
2 Newspaper 200 non-null float64
3 Sales 200 non-null float64
dtypes: float64(4)
memory usage: 6.4 KB
In [7]: df.describe()
file:///C:/Users/rinki/Downloads/linearRegression.html 2/8
6/2/23, 8:33 PM linearRegression
Out[7]: TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 15.130500
std 85.854236 14.846809 21.778621 5.283892
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 11.000000
50% 149.750000 22.900000 25.750000 16.000000
75% 218.825000 36.525000 45.100000 19.050000
max 296.400000 49.600000 114.000000 27.000000
In [11]: df.isnull().sum()*100/df.shape[0]
Out[11]: TV 0.0
Radio 0.0
Newspaper 0.0
Sales 0.0
dtype: float64
In [13]: #outlier analysis
fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(df['TV'], ax = axs[0])
plt2 = sns.boxplot(df['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(df['Radio'], ax = axs[2])
plt.tight_layout()
file:///C:/Users/rinki/Downloads/linearRegression.html 3/8
6/2/23, 8:33 PM linearRegression
In [14]: sns.pairplot(df, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4,
plt.show()
In [15]: # Let's see the correlation between different variables.
sns.heatmap(df.corr(), cmap="YlGnBu", annot = True)
plt.show()
file:///C:/Users/rinki/Downloads/linearRegression.html 4/8
6/2/23, 8:33 PM linearRegression
TV seems to be most correlated with Sales
In [16]: X = df['TV']
y = df['Sales']
In [17]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.6, test
In [20]: # Add a constant to get an intercept
import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)
In [21]: lr = sm.OLS(y_train, X_train_sm).fit()
In [22]: lr.params
Out[22]: const 6.780417
TV 0.055639
dtype: float64
In [23]: plt.scatter(X_train, y_train)
plt.plot(X_train, 6.948 + 0.054*X_train, 'r')
plt.show()
file:///C:/Users/rinki/Downloads/linearRegression.html 5/8
6/2/23, 8:33 PM linearRegression
In [24]: y_train_pred = lr.predict(X_train_sm)
res = (y_train - y_train_pred)
In [25]: fig = plt.figure()
sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15) # Plot heading
plt.xlabel('y_train - y_train_pred', fontsize = 15) # X-label
plt.show()
<ipython-input-25-723b49e70e34>:2: UserWarning:
`distplot` is a deprecated function and will be removed in seaborn v0.14.0.
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
sns.distplot(res, bins = 15)
file:///C:/Users/rinki/Downloads/linearRegression.html 6/8
6/2/23, 8:33 PM linearRegression
In [26]: plt.scatter(X_train,res)
plt.show()
In [29]: X_test_sm = sm.add_constant(X_test)
y_pred = lr.predict(X_test_sm)
file:///C:/Users/rinki/Downloads/linearRegression.html 7/8
6/2/23, 8:33 PM linearRegression
In [30]: y_pred.head()
Out[30]: 126 7.214399
104 20.033555
99 14.302769
92 18.892962
111 20.228291
dtype: float64
In [31]: from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_test, y_pred))
Out[31]: 1.994739178382777
In [32]: r_squared = r2_score(y_test, y_pred)
r_squared
Out[32]: 0.7807592057194056
In [33]: #best fit line for test
plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948 + 0.054 * X_test, 'r')
plt.show()
file:///C:/Users/rinki/Downloads/linearRegression.html 8/8