8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.
ipynb - Colab
Linear models assume that the dependent variables X take a linear relationship with the dependent variable Y. If the assumption is not met, the
model may show poor performance. In this recipe, we will learn how to visualize the linear relationships between X and Y.
import pandas as pd
import numpy as np
# for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# the dataset for the demo
from sklearn.datasets import load_boston
# for linear regression
from sklearn.linear_model import LinearRegression
# load the the Boston House price data from scikit-learn
# this is how we load the boston dataset from sklearn
boston_dataset = load_boston()
# create a dataframe with the independent variables
boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)
# add the target
boston['MEDV'] = boston_dataset.target
boston.head()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
# this is the information about the boston house prince dataset
# get familiar with the variables before continuing with
# the notebook
# the aim is to predict the "Median value of the houses"
# MEDV column of this dataset
# and we have variables with characteristics about
# the homes and the neighborhoods
print(boston_dataset.DESCR)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 1/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machi
# I will create a dataframe with the variable x that
# follows a normal distribution and shows a
# linear relationship with y
# this will provide the expected plots
# i.e., how the plots should look like if the
# linear assumption is met
np.random.seed(29) # for reproducibility
n = 200 # in the book we pass directly 200 within brackets, without defining n
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2
data = pd.DataFrame([x, y]).T
data.columns = ['x', 'y']
data.head()
x y
0 -0.417482 -1.271561
1 0.706032 7.990600
2 1.915985 19.848687
3 -2.141755 -21.928903
4 0.719057 5.579070
Linear relationships can be assessed by scatter plots.
# for the simulated data
# this is how the scatter-plot looks like when
# there is a linear relationship between X and Y
sns.lmplot(x="x", y="y", data=data, order=1)
# order 1 indicates that we want seaborn to
# estimate a linear model (the line in the plot below)
# between x and y
plt.ylabel('Target')
plt.xlabel('Independent variable')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 2/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
Text(0.5, 6.79999999999999, 'Independent variable')
# now we make a scatter plot for the boston
# house price dataset
# we plot the variable LAST (% lower status of the population)
# vs the target MEDV (median value of the house)
sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)
<seaborn.axisgrid.FacetGrid at 0xc2b631fa20>
Although not perfect, the relationship is fairly linear.
# now we plot CRIM (per capita crime rate by town)
# vs the target MEDV (median value of the house)
sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 3/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
<seaborn.axisgrid.FacetGrid at 0xc2b639d2e8>
Linear relationships can also be assessed by evaluating the residuals. Residuals are the difference between the value estimated by the linear
relationship and the real output. If the relationship is linear, the residuals should be normally distributed and centered around zero.
# SIMULATED DATA
# step 1: build a linear model
# call the linear model from sklearn
linreg = LinearRegression()
# fit the model
linreg.fit(data['x'].to_frame(), data['y'])
# step 2: obtain the predictions
# make the predictions
pred = linreg.predict(data['x'].to_frame())
# step 3: calculate the residuals
error = data['y'] - pred
# plot predicted vs real
plt.scatter(x=pred, y=data['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')
Text(0, 0.5, 'Real value')
# step 4: observe the distribution of the residuals
# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution
# we plot the error terms vs the independent variable x
# error values should be around 0 and homogeneously distributed
plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 4/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
Text(0.5, 0, 'Independent variable x')
# step 4: observe the distribution of the errors
# plot a histogram of the residuals
# they should follow a gaussian distribution
# centered around 0
sns.distplot(error, bins=30)
plt.xlabel('Residuals')
Text(0.5, 0, 'Residuals')
# now we do the same for the variable LSTAT of the boston
# house price dataset from sklearn
# call the linear model from sklearn
linreg = LinearRegression()
# fit the model
linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])
# make the predictions
pred = linreg.predict(boston['LSTAT'].to_frame())
# calculate the residuals
error = boston['MEDV'] - pred
# plot predicted vs real
plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 5/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
Text(0, 0.5, 'MEDV')
# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution
plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')
Text(0.5, 0, 'LSTAT')
# plot a histogram of the residuals
# they should follow a gaussian distribution
sns.distplot(error, bins=30)
<matplotlib.axes._subplots.AxesSubplot at 0xc2b6e0c2b0>
For this particular case, the residuals are centered around zero, but they are not homogeneously distributed across the values of LSTAT. Bigger
and smaller values of LSTAT show higher residual values. In addition, we see in the histogram that the residuals do not adopt a strictly
Gaussian distribution.
https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 6/6