0% found this document useful (0 votes)

13 views6 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views6 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.

ipynb - Colab

Linear models assume that the dependent variables X take a linear relationship with the dependent variable Y. If the assumption is not met, the
model may show poor performance. In this recipe, we will learn how to visualize the linear relationships between X and Y.

import pandas as pd
import numpy as np

# for plotting
import matplotlib.pyplot as plt
import seaborn as sns

# the dataset for the demo

from sklearn.datasets import load_boston

# for linear regression

from sklearn.linear_model import LinearRegression

# load the the Boston House price data from scikit-learn

# this is how we load the boston dataset from sklearn

boston_dataset = load_boston()

# create a dataframe with the independent variables

boston = pd.DataFrame(boston_dataset.data,
columns=boston_dataset.feature_names)

# add the target

boston['MEDV'] = boston_dataset.target

boston.head()

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV

0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0

1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6

2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7

3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4

4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

# this is the information about the boston house prince dataset

# get familiar with the variables before continuing with
# the notebook

# the aim is to predict the "Median value of the houses"

# MEDV column of this dataset

# and we have variables with characteristics about

# the homes and the neighborhoods

print(boston_dataset.DESCR)

.. _boston_dataset:

Boston house prices dataset

---------------------------

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 1/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machi

# I will create a dataframe with the variable x that

# follows a normal distribution and shows a
# linear relationship with y

# this will provide the expected plots

# i.e., how the plots should look like if the
# linear assumption is met

np.random.seed(29) # for reproducibility

n = 200 # in the book we pass directly 200 within brackets, without defining n
x = np.random.randn(n)
y = x * 10 + np.random.randn(n) * 2

data = pd.DataFrame([x, y]).T

data.columns = ['x', 'y']
data.head()

x y

0 -0.417482 -1.271561

1 0.706032 7.990600

2 1.915985 19.848687

3 -2.141755 -21.928903

4 0.719057 5.579070

Linear relationships can be assessed by scatter plots.

# for the simulated data

# this is how the scatter-plot looks like when

# there is a linear relationship between X and Y

sns.lmplot(x="x", y="y", data=data, order=1)

# order 1 indicates that we want seaborn to
# estimate a linear model (the line in the plot below)
# between x and y

plt.ylabel('Target')
plt.xlabel('Independent variable')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 2/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 6.79999999999999, 'Independent variable')

# now we make a scatter plot for the boston

# house price dataset

# we plot the variable LAST (% lower status of the population)

# vs the target MEDV (median value of the house)

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

<seaborn.axisgrid.FacetGrid at 0xc2b631fa20>

Although not perfect, the relationship is fairly linear.

# now we plot CRIM (per capita crime rate by town)

# vs the target MEDV (median value of the house)

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 3/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

<seaborn.axisgrid.FacetGrid at 0xc2b639d2e8>

Linear relationships can also be assessed by evaluating the residuals. Residuals are the difference between the value estimated by the linear
relationship and the real output. If the relationship is linear, the residuals should be normally distributed and centered around zero.

# SIMULATED DATA

# step 1: build a linear model

# call the linear model from sklearn
linreg = LinearRegression()

# fit the model

linreg.fit(data['x'].to_frame(), data['y'])

# step 2: obtain the predictions

# make the predictions
pred = linreg.predict(data['x'].to_frame())

# step 3: calculate the residuals

error = data['y'] - pred

# plot predicted vs real

plt.scatter(x=pred, y=data['y'])
plt.xlabel('Predictions')
plt.ylabel('Real value')

Text(0, 0.5, 'Real value')

# step 4: observe the distribution of the residuals

# Residuals plot
# if the relationship is linear, the noise should be
# random, centered around zero, and follow a normal distribution

# we plot the error terms vs the independent variable x

# error values should be around 0 and homogeneously distributed

plt.scatter(y=error, x=data['x'])
plt.ylabel('Residuals')
plt.xlabel('Independent variable x')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 4/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0.5, 0, 'Independent variable x')

# step 4: observe the distribution of the errors

# plot a histogram of the residuals

# they should follow a gaussian distribution
# centered around 0

sns.distplot(error, bins=30)
plt.xlabel('Residuals')

Text(0.5, 0, 'Residuals')

# now we do the same for the variable LSTAT of the boston

# house price dataset from sklearn

# call the linear model from sklearn

linreg = LinearRegression()

# fit the model

linreg.fit(boston['LSTAT'].to_frame(), boston['MEDV'])

# make the predictions

pred = linreg.predict(boston['LSTAT'].to_frame())

# calculate the residuals

error = boston['MEDV'] - pred

# plot predicted vs real

plt.scatter(x=pred, y=boston['MEDV'])
plt.xlabel('Predictions')
plt.ylabel('MEDV')

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 5/6
8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.ipynb - Colab

Text(0, 0.5, 'MEDV')

# Residuals plot

# if the relationship is linear, the noise should be

# random, centered around zero, and follow a normal distribution

plt.scatter(y=error, x=boston['LSTAT'])
plt.ylabel('Residuals')
plt.xlabel('LSTAT')

Text(0.5, 0, 'LSTAT')

# plot a histogram of the residuals

# they should follow a gaussian distribution
sns.distplot(error, bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0xc2b6e0c2b0>

For this particular case, the residuals are centered around zero, but they are not homogeneously distributed across the values of LSTAT. Bigger
and smaller values of LSTAT show higher residual values. In addition, we see in the histogram that the residuals do not adopt a strictly
Gaussian distribution.

https://colab.research.google.com/drive/1DGz3fUiQttJal20_p9Q-zT5USLKXIt1k#printMode=true 6/6

The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Week 6 LAB
No ratings yet
Week 6 LAB
13 pages
Boston Housing
No ratings yet
Boston Housing
17 pages
Boston - Housing - Kaggle - Boston - Housing - Ipynb at Master Eric-Bunch - Boston - Housing GitHub
No ratings yet
Boston - Housing - Kaggle - Boston - Housing - Ipynb at Master Eric-Bunch - Boston - Housing GitHub
13 pages
Boston Dataset
No ratings yet
Boston Dataset
6 pages
House Pricing
No ratings yet
House Pricing
15 pages
Python ML for Engineers: Week 3
No ratings yet
Python ML for Engineers: Week 3
12 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
Regression Problem
No ratings yet
Regression Problem
28 pages
Lab 6 Linear Regression
No ratings yet
Lab 6 Linear Regression
1 page
Linear Regression for House Pricing
No ratings yet
Linear Regression for House Pricing
113 pages
Linear Regression with Boston Housing Data
No ratings yet
Linear Regression with Boston Housing Data
14 pages
House Price Prediction Full Report-2
No ratings yet
House Price Prediction Full Report-2
5 pages
Linear Regression Apply On House Price Prediction On Boston House Dataset
No ratings yet
Linear Regression Apply On House Price Prediction On Boston House Dataset
12 pages
Final Thesis Jordi Van Veen Erasmus
No ratings yet
Final Thesis Jordi Van Veen Erasmus
79 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Predicting Boston Housing Price Using Machine Learning Models
No ratings yet
Predicting Boston Housing Price Using Machine Learning Models
6 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Real Estate Valuation Data Set: Section Order
No ratings yet
Real Estate Valuation Data Set: Section Order
17 pages
Build A Machine Learning Model To Predict House Prices Using Supervised Learing Algorithm 'S Linear Regression
No ratings yet
Build A Machine Learning Model To Predict House Prices Using Supervised Learing Algorithm 'S Linear Regression
14 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Exp 4
No ratings yet
Exp 4
2 pages
Continuous Assessment
No ratings yet
Continuous Assessment
4 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Data Scientists' Guide to Predicting House Prices
No ratings yet
Data Scientists' Guide to Predicting House Prices
9 pages
Boston Housing Price Analysis
No ratings yet
Boston Housing Price Analysis
7 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
Boston Housing Price Prediction
No ratings yet
Boston Housing Price Prediction
3 pages
Boston Housing - Prediction of House Price - by Harsh - Medium
No ratings yet
Boston Housing - Prediction of House Price - by Harsh - Medium
14 pages
ML Lab Manual
No ratings yet
ML Lab Manual
60 pages
Python - Vectorized - Tute - Jupyter Notebook
No ratings yet
Python - Vectorized - Tute - Jupyter Notebook
16 pages
Boston Housing Price Prediction
No ratings yet
Boston Housing Price Prediction
3 pages
07 Tidymodels
No ratings yet
07 Tidymodels
64 pages
Ds 4 Linears Boston
No ratings yet
Ds 4 Linears Boston
2 pages
Linear Regression Analysis Guide
No ratings yet
Linear Regression Analysis Guide
5 pages
Research On The Prediction of Boston House Price B
No ratings yet
Research On The Prediction of Boston House Price B
11 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Intro to ML with Sklearn & Python
No ratings yet
Intro to ML with Sklearn & Python
10 pages
Assignment 4
No ratings yet
Assignment 4
7 pages
DSBDAL - Assignment No 4
No ratings yet
DSBDAL - Assignment No 4
15 pages
1 2 Boston Housing Data
No ratings yet
1 2 Boston Housing Data
12 pages
Raport
No ratings yet
Raport
2 pages
ML Assignment4 (22bcb7162)
No ratings yet
ML Assignment4 (22bcb7162)
3 pages
ML Assignment2 33418
No ratings yet
ML Assignment2 33418
6 pages
Business Analytics Decision Making Project
No ratings yet
Business Analytics Decision Making Project
15 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
PRJ Housuing Price
No ratings yet
PRJ Housuing Price
14 pages
Exp 3 ML
No ratings yet
Exp 3 ML
3 pages
Real Estate Price Prediction Guide
No ratings yet
Real Estate Price Prediction Guide
10 pages
ML Lab-3
No ratings yet
ML Lab-3
14 pages
Predicting Housin Main Project Ediglobe
No ratings yet
Predicting Housin Main Project Ediglobe
4 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
23 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
DL Assignment 1ms24rai03
No ratings yet
DL Assignment 1ms24rai03
10 pages
Practical Activity 01: Linear Regression: Case of Study: Predicting House Prices
No ratings yet
Practical Activity 01: Linear Regression: Case of Study: Predicting House Prices
2 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
ML Manual
No ratings yet
ML Manual
30 pages
SQL Session 02 - Manual
No ratings yet
SQL Session 02 - Manual
8 pages
Glossary of Islamic Terms
No ratings yet
Glossary of Islamic Terms
30 pages
SQLite Time Series Temperature Data Tutorial
No ratings yet
SQLite Time Series Temperature Data Tutorial
6 pages
Ex01-Quick Start
No ratings yet
Ex01-Quick Start
2 pages
3 1-Lists
No ratings yet
3 1-Lists
4 pages
100 Numpy Exercises
No ratings yet
100 Numpy Exercises
13 pages
Mathematics: Answer Key
No ratings yet
Mathematics: Answer Key
5 pages
Unit 4 Statistics Notes Scatter Plot 2023-24
No ratings yet
Unit 4 Statistics Notes Scatter Plot 2023-24
15 pages
UWorld - UWorld MCAT Behavioral-UWorld (2024)
No ratings yet
UWorld - UWorld MCAT Behavioral-UWorld (2024)
238 pages
Print Quality & Fountain Solution Analysis
No ratings yet
Print Quality & Fountain Solution Analysis
9 pages
Crespi 1956
No ratings yet
Crespi 1956
6 pages
Mathematics Syllabus Grade 12
No ratings yet
Mathematics Syllabus Grade 12
4 pages
B Com Sem 5 New PDF
No ratings yet
B Com Sem 5 New PDF
46 pages
305 - TC 508 Booklet
No ratings yet
305 - TC 508 Booklet
18 pages
Fischer Boone Neumann
No ratings yet
Fischer Boone Neumann
83 pages
Bba Degree CBCS 2022 July Previous Question Paper - Fundamentals of Business Statistics
No ratings yet
Bba Degree CBCS 2022 July Previous Question Paper - Fundamentals of Business Statistics
3 pages
Concise Notes in Oncology 2005
No ratings yet
Concise Notes in Oncology 2005
186 pages
Statistics Test - Correlation-Regression, Index Number - 1474785
No ratings yet
Statistics Test - Correlation-Regression, Index Number - 1474785
5 pages
SPSS Jalal Et Al 21nov 2024
No ratings yet
SPSS Jalal Et Al 21nov 2024
76 pages
Optimal Experimental Design With R 1st Edition Dieter Rasch Available All Format
100% (7)
Optimal Experimental Design With R 1st Edition Dieter Rasch Available All Format
152 pages
(Ebook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant Download
100% (5)
(Ebook PDF) Introductory Econometrics: A Modern Approach 6th Editioninstant Download
57 pages
Factors Affecting Employee Retention in The Philippine Business Process Outsourcing Industry: Integrating Job Embeddedness Theory
No ratings yet
Factors Affecting Employee Retention in The Philippine Business Process Outsourcing Industry: Integrating Job Embeddedness Theory
10 pages
Cost Estimation Methods and Tools 1st Edition Gregory K. Mislick PDF Download
100% (1)
Cost Estimation Methods and Tools 1st Edition Gregory K. Mislick PDF Download
61 pages
【ISO - 14253-3-2013】 GPS - Inspection by Measurement of Workpieces and Measuring Equipment
100% (1)
【ISO - 14253-3-2013】 GPS - Inspection by Measurement of Workpieces and Measuring Equipment
20 pages
MATH IA Revised Final
No ratings yet
MATH IA Revised Final
19 pages
FMCG Buying Behavior in U.P.
No ratings yet
FMCG Buying Behavior in U.P.
7 pages
Development of Standardized Sizing System For Ethiopian Children's Dodd
No ratings yet
Development of Standardized Sizing System For Ethiopian Children's Dodd
66 pages
The Demand For Private Schooling in England The Im
No ratings yet
The Demand For Private Schooling in England The Im
34 pages
FREE STATE GR 12 SEPT 2020 P2 and Memo
100% (1)
FREE STATE GR 12 SEPT 2020 P2 and Memo
27 pages
Energy Use in Polish Dairy Plants
No ratings yet
Energy Use in Polish Dairy Plants
23 pages
Accountancy SDL Manual
No ratings yet
Accountancy SDL Manual
129 pages
IC252
No ratings yet
IC252
2 pages
Eai 2-12-2022 2332276
No ratings yet
Eai 2-12-2022 2332276
13 pages
7 - Data Analysis and Interpretation (Part 1) PPNCKH p7
No ratings yet
7 - Data Analysis and Interpretation (Part 1) PPNCKH p7
41 pages
IFCC
No ratings yet
IFCC
13 pages
Expt 1 - Non Verbal Intelligence Tets - Methodology
No ratings yet
Expt 1 - Non Verbal Intelligence Tets - Methodology
5 pages

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

Recipe-5-Identifying-a-linear-relationship - Ipynb - Colab

Uploaded by

8/14/24, 11:59 AM Recipe-5-Identifying-a-linear-relationship.

# the dataset for the demo

# for linear regression

# load the the Boston House price data from scikit-learn

# this is how we load the boston dataset from sklearn

# create a dataframe with the independent variables

# add the target

# this is the information about the boston house prince dataset

# the aim is to predict the "Median value of the houses"

# and we have variables with characteristics about

Boston house prices dataset

**Data Set Characteristics:**

:Number of Instances: 506

:Attribute Information (in order):

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic

# I will create a dataframe with the variable x that

# this will provide the expected plots

np.random.seed(29) # for reproducibility

data = pd.DataFrame([x, y]).T

Linear relationships can be assessed by scatter plots.

# for the simulated data

# this is how the scatter-plot looks like when

sns.lmplot(x="x", y="y", data=data, order=1)

Text(0.5, 6.79999999999999, 'Independent variable')

# now we make a scatter plot for the boston

# we plot the variable LAST (% lower status of the population)

sns.lmplot(x="LSTAT", y="MEDV", data=boston, order=1)

Although not perfect, the relationship is fairly linear.

# now we plot CRIM (per capita crime rate by town)

sns.lmplot(x="CRIM", y="MEDV", data=boston, order=1)

# step 1: build a linear model

# fit the model

# step 2: obtain the predictions

# step 3: calculate the residuals

# plot predicted vs real

Text(0, 0.5, 'Real value')

# step 4: observe the distribution of the residuals

# we plot the error terms vs the independent variable x

Text(0.5, 0, 'Independent variable x')

# step 4: observe the distribution of the errors

# plot a histogram of the residuals

# now we do the same for the variable LSTAT of the boston

# call the linear model from sklearn

# fit the model

# make the predictions

# calculate the residuals

# plot predicted vs real

Text(0, 0.5, 'MEDV')

# if the relationship is linear, the noise should be

# plot a histogram of the residuals

You might also like

Data Set Characteristics: