[go: up one dir, main page]

0% found this document useful (0 votes)
8 views8 pages

Linear Regression

The document provides an overview of linear regression, explaining its purpose in modeling relationships between dependent and independent variables. It distinguishes between simple and multiple linear regression, discusses model evaluation metrics, and addresses multicollinearity issues. Additionally, it includes practical examples using Python for data analysis and visualization with a dataset on advertising and sales.

Uploaded by

gauri10in
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

Linear Regression

The document provides an overview of linear regression, explaining its purpose in modeling relationships between dependent and independent variables. It distinguishes between simple and multiple linear regression, discusses model evaluation metrics, and addresses multicollinearity issues. Additionally, it includes practical examples using Python for data analysis and visualization with a dataset on advertising and sales.

Uploaded by

gauri10in
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

6/2/23, 8:33 PM linearRegression

Linear Regression

Linear regression is an important statitical technique used to model and analyze the
relationship between a dependent and one or more independent variables

Question 1: What is linear regression ? linear reegression is a statistical method used to


model relationship between a dependent variable and one or more independent
variables by fitting a linear equation to observed data.

Question 2: What is the difference between simple linear regression and multiple linear
regression?

Simple linear regression involves a single independent variable predicting the dependent
variable, while multiple linear regression involves two or more independent variables
predicting the dependent variable. Multiple linear regression allows for the analysis of
the combined effects of multiple predictors on the outcome variable.

Question 3:How is the quality of a linear regression model evaluated ?

The quality of a linear regression model is assessed using various metrics, such as

1. the coefficient of determination (R-squared)

file:///C:/Users/rinki/Downloads/linearRegression.html 1/8
6/2/23, 8:33 PM linearRegression

2. root mean square error (RMSE)


3. mean absolute error (MAE)
4. adjusted R-squared

Question 4: How can you deal with multicollinearity in linear regression ?

Multicollinearity occurs when independent variables in a regression model are highly


correlated with each other. It can lead to unstable coefficient estimates and reduce the
model's interpretability. Dealing with multicollinearity can involve removing one of the
correlated variables, combining variables, or using dimensionality reduction techniques
such as principal component analysis (PCA)

In [1]: ## import library


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [3]: df = pd.read_csv('advertising.csv')
df.head

Out[3]: <bound method NDFrame.head of TV Radio Newspaper Sales


0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9
.. ... ... ... ...
195 38.2 3.7 13.8 7.6
196 94.2 4.9 8.1 14.0
197 177.0 9.3 6.4 14.8
198 283.6 42.0 66.2 25.5
199 232.1 8.6 8.7 18.4

[200 rows x 4 columns]>

In [4]: df.shape

Out[4]: (200, 4)

In [5]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TV 200 non-null float64
1 Radio 200 non-null float64
2 Newspaper 200 non-null float64
3 Sales 200 non-null float64
dtypes: float64(4)
memory usage: 6.4 KB

In [7]: df.describe()

file:///C:/Users/rinki/Downloads/linearRegression.html 2/8
6/2/23, 8:33 PM linearRegression

Out[7]: TV Radio Newspaper Sales

count 200.000000 200.000000 200.000000 200.000000

mean 147.042500 23.264000 30.554000 15.130500

std 85.854236 14.846809 21.778621 5.283892

min 0.700000 0.000000 0.300000 1.600000

25% 74.375000 9.975000 12.750000 11.000000

50% 149.750000 22.900000 25.750000 16.000000

75% 218.825000 36.525000 45.100000 19.050000

max 296.400000 49.600000 114.000000 27.000000

In [11]: df.isnull().sum()*100/df.shape[0]

Out[11]: TV 0.0
Radio 0.0
Newspaper 0.0
Sales 0.0
dtype: float64

In [13]: #outlier analysis


fig, axs = plt.subplots(3, figsize = (5,5))
plt1 = sns.boxplot(df['TV'], ax = axs[0])
plt2 = sns.boxplot(df['Newspaper'], ax = axs[1])
plt3 = sns.boxplot(df['Radio'], ax = axs[2])
plt.tight_layout()

file:///C:/Users/rinki/Downloads/linearRegression.html 3/8
6/2/23, 8:33 PM linearRegression

In [14]: sns.pairplot(df, x_vars=['TV', 'Newspaper', 'Radio'], y_vars='Sales', height=4,


plt.show()

In [15]: # Let's see the correlation between different variables.


sns.heatmap(df.corr(), cmap="YlGnBu", annot = True)
plt.show()

file:///C:/Users/rinki/Downloads/linearRegression.html 4/8
6/2/23, 8:33 PM linearRegression

TV seems to be most correlated with Sales

In [16]: X = df['TV']
y = df['Sales']

In [17]: from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.6, test

In [20]: # Add a constant to get an intercept


import statsmodels.api as sm
X_train_sm = sm.add_constant(X_train)

In [21]: lr = sm.OLS(y_train, X_train_sm).fit()

In [22]: lr.params

Out[22]: const 6.780417


TV 0.055639
dtype: float64

In [23]: plt.scatter(X_train, y_train)


plt.plot(X_train, 6.948 + 0.054*X_train, 'r')
plt.show()

file:///C:/Users/rinki/Downloads/linearRegression.html 5/8
6/2/23, 8:33 PM linearRegression

In [24]: y_train_pred = lr.predict(X_train_sm)


res = (y_train - y_train_pred)

In [25]: fig = plt.figure()


sns.distplot(res, bins = 15)
fig.suptitle('Error Terms', fontsize = 15) # Plot heading
plt.xlabel('y_train - y_train_pred', fontsize = 15) # X-label
plt.show()

<ipython-input-25-723b49e70e34>:2: UserWarning:

`distplot` is a deprecated function and will be removed in seaborn v0.14.0.

Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).

For a guide to updating your code to use the new functions, please see
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751

sns.distplot(res, bins = 15)

file:///C:/Users/rinki/Downloads/linearRegression.html 6/8
6/2/23, 8:33 PM linearRegression

In [26]: plt.scatter(X_train,res)
plt.show()

In [29]: X_test_sm = sm.add_constant(X_test)


y_pred = lr.predict(X_test_sm)

file:///C:/Users/rinki/Downloads/linearRegression.html 7/8
6/2/23, 8:33 PM linearRegression

In [30]: y_pred.head()

Out[30]: 126 7.214399


104 20.033555
99 14.302769
92 18.892962
111 20.228291
dtype: float64

In [31]: from sklearn.metrics import mean_squared_error


from sklearn.metrics import r2_score
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_test, y_pred))

Out[31]: 1.994739178382777

In [32]: r_squared = r2_score(y_test, y_pred)


r_squared

Out[32]: 0.7807592057194056

In [33]: #best fit line for test


plt.scatter(X_test, y_test)
plt.plot(X_test, 6.948 + 0.054 * X_test, 'r')
plt.show()

file:///C:/Users/rinki/Downloads/linearRegression.html 8/8

You might also like