[go: up one dir, main page]

0% found this document useful (0 votes)
29 views7 pages

Ds - Lab - 4.ipynb - Colab

Uploaded by

tarun.24msd7001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

Ds - Lab - 4.ipynb - Colab

Uploaded by

tarun.24msd7001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

11/29/24, 10:17 PM ds_lab_4.

ipynb - Colab

G KALYAN 24MSD7034

from google.colab import files


uploaded = files.upload()

Choose Files Advertising_lab_4_Q.csv


Advertising_lab_4_Q.csv(text/csv) - 5166 bytes, last modified: 11/23/2024 - 100% done
Saving Advertising lab 4 Q csv to Advertising lab 4 Q csv

#necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

data = pd.read_csv("Advertising_lab_4_Q.csv")

1) Create three well labelled scatterplots of this data with TV, Radio and News paper on the x-axis and Sales on the y-axis, and describe the re
lationship you see. The scatterplot colour should be red, blue and green respectively. Add suitable labels and title to the plot.

import matplotlib.pyplot as plt

# Scatterplot for TV, Radio, and Newspaper vs. Sales


plt.figure(figsize=(16, 5))

# TV vs Sales
plt.subplot(1, 3, 1)
plt.scatter(data['TV'], data['Sales'], color='red')
plt.title('TV Advertising Budget vs Sales')
plt.xlabel('TV Advertising Budget')
plt.ylabel('Sales')
plt.grid(True)

# Radio vs Sales
plt.subplot(1, 3, 2)
plt.scatter(data['Radio'], data['Sales'], color='blue')
plt.title('Radio Advertising Budget vs Sales')
plt.xlabel('Radio Advertising Budget')
plt.ylabel('Sales')
plt.grid(True)

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 1/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab
# Newspaper vs Sales
plt.subplot(1, 3, 3)
plt.scatter(data['Newspaper'], data['Sales'], color='green')
plt.title('Newspaper Advertising Budget vs Sales')
plt.xlabel('Newspaper Advertising Budget')
plt.ylabel('Sales')
plt.grid(True)

plt.tight_layout()
plt.show()

2) In the scatterplot you made, what is the explanatory variable? What is the response variable? Why might you want to construct the problem
in this way?

Explanatory variable : The explanatory variable is the variable that is used to explain or predict the response variable it is also know as
independent variable

Response Variable : The variable that measures the impact of the explanatory variable on the subject. It is also known as the dependent
variable

Predictive Modeling :

Framing the problem with ad budgets as explanatory variables allows us to build predictive models that estimate sales based on different
spending levels.

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 2/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab

Business Decisions:

Understanding the relationship between advertising spend and sales helps businesses optimize their budgets for maximum return on
investment (ROI).

By constructing the problem this way, we focus on identifying actionable insights for sales prediction and budget optimization.

3) Compute Pearson’s correlation coefficient between sales and each of the independent variables. What is your observation?

# Pearson's correlation coefficients


cor_tv, _ = pearsonr(data['TV'], data['Sales'])
cor_radio, _ = pearsonr(data['Radio'], data['Sales'])
cor_newspaper, _ = pearsonr(data['Newspaper'], data['Sales'])
print(f"Pearson's correlation between TV and Sales: {cor_tv}")
print(f"Pearson's correlation between Radio and Sales: {cor_radio}")
print(f"Pearson's correlation between Newspaper and Sales: {cor_newspaper}")

Pearson's correlation between TV and Sales: 0.7822244248616065


Pearson's correlation between Radio and Sales: 0.576222574571055
Pearson's correlation between Newspaper and Sales: 0.2282990263761654

4) Split the data into train (80%) and test (20%) (without shuffling). Fit a simple linear regression model on the train data for the three
independent variables separately and assess the accuracy of the model in terms of MSE(train and test). Which independent variable
contributes to accurate prediction of Sales?

# Split the data into train (80%) and test (20%) without shuffling
train, test = train_test_split(data, test_size=0.2, shuffle=False)

# Function to evaluate a simple linear regression model


def evaluate_model(feature):
model = LinearRegression()
X_train, y_train = train[[feature]], train['Sales']
X_test, y_test = test[[feature]], test['Sales']

# Fit the model


model.fit(X_train, y_train)

# Predictions and MSE


y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

return mse_train, mse_test

# Evaluate models for TV, Radio, and Newspaper

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 3/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab
results = {}
for feature in ['TV', 'Radio', 'Newspaper']:
mse_train, mse_test = evaluate_model(feature)
results[feature] = {'MSE Train': mse_train, 'MSE Test': mse_test}

print("Simple Linear Regression Results:")


for feature, mse_values in results.items():
print(f"{feature}: Train MSE = {mse_values['MSE Train']}, Test MSE = {mse_values['MSE Test']}")

Simple Linear Regression Results:


TV: Train MSE = 9.699713411632143, Test MSE = 14.128761342728321
Radio: Train MSE = 19.063327668527208, Test MSE = 14.44042373035678
Newspaper: Train MSE = 26.026670592327, Test MSE = 24.35470998177176

5) Fit multiple linear regression model on the train data for the different possible combinations of the three independent variables and assess
the accuracy of the model in terms of MSE (train and test). Which combina tion contributes to accurate prediction of Sales?

from itertools import combinations

# Function to evaluate multiple linear regression for different combinations


def evaluate_multiple_models(features):
model = LinearRegression()
X_train, y_train = train[features], train['Sales']
X_test, y_test = test[features], test['Sales']

# Fit the model


model.fit(X_train, y_train)

# Predictions and MSE


y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
mse_train = mean_squared_error(y_train, y_train_pred)
mse_test = mean_squared_error(y_test, y_test_pred)

return mse_train, mse_test

# Evaluate all combinations of features


all_features = ['TV', 'Radio', 'Newspaper']
combination_results = {}
for r in range(1, len(all_features) + 1):
for combo in combinations(all_features, r):
mse_train, mse_test = evaluate_multiple_models(list(combo))
combination_results[combo] = {'MSE Train': mse_train, 'MSE Test': mse_test}

print("Multiple Linear Regression Results:")


for combo, mse_values in combination_results.items():
print(f"{combo}: Train MSE = {mse_values['MSE Train']}, Test MSE = {mse_values['MSE Test']}")

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 4/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab

Multiple Linear Regression Results:


('TV',): Train MSE = 9.699713411632143, Test MSE = 14.128761342728321
('Radio',): Train MSE = 19.063327668527208, Test MSE = 14.44042373035678
('Newspaper',): Train MSE = 26.026670592327, Test MSE = 24.35470998177176
('TV', 'Radio'): Train MSE = 2.8221633041460903, Test MSE = 2.7930889237731136
('TV', 'Newspaper'): Train MSE = 8.77967640891637, Test MSE = 13.103347790031586
('Radio', 'Newspaper'): Train MSE = 19.06326627813012, Test MSE = 14.43136345887287
('TV', 'Radio', 'Newspaper'): Train MSE = 2.82179249487708, Test MSE = 2.7911451862764003

6) What is the difference between R2 and Adjusted R2? Comment!

keyboard_arrow_down R-squared
measures the proportion of the variance in the dependent variable that is explained by the independent variables.

Adjusted R-squared
It adjusts the R-squared value based on the number of predictors in the model and the sample size.

Difference between R-squared and Adjusted R-squared

R-squared:

Measures the proportion of variance explained by the model.


Always increases with the addition of more variables. Can be misleading in models with many predictors.

Adjusted R-squared:

Penalizes the addition of unnecessary variables.


Provides a more accurate measure of the model's fit.
Can decrease if a variable does not improve the model significantly

summary:

R-squared tells you how well your model fits the data.
Adjusted R-squared tells you how well your model fits the data, taking into account the number of predictors.

# Function to compute R² and Adjusted R²


def calculate_r2_adj_r2(features):
model = LinearRegression()
X_train, y_train = train[features], train['Sales']
model.fit(X_train, y_train)

# R²
r2 = model.score(X_train, y_train)

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 5/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab
# Adjusted R²
n = len(y_train)
p = len(features)
adjusted_r2 = 1 - ((1 - r2) * (n - 1)) / (n - p - 1)

return r2, adjusted_r2

# Example: Evaluate R² and Adjusted R² for all features


r2, adj_r2 = calculate_r2_adj_r2(['TV', 'Radio', 'Newspaper'])
print(f"R²: {r2}, Adjusted R²: {adj_r2}")

R²: 0.8961523241120161, Adjusted R²: 0.8941552534218625

7)Give your final comments on which model linear or multiple linear is apt for accurate prediction of sales based on MSE values for train and
test and R2, Adjusted-R2 values

R² (0.896) :

The R² value of 0.896 means that the model explains 89.6% of the variance in the Sales variable using the features TV, Radio, and Newspaper.
This is a fairly high value

Adjusted R² (0.894) :

The Adjusted R² value is 0.894, which is very close to the R² value. This means that the inclusion of multiple predictors (TV, Radio, and
Newspaper) is contributing to explaining the variance in Sales and is not simply inflating the R² through overfitting.

Multiple Linear Regression is appropriate here because:


The Adjusted R² is close to R², suggesting that the additional features (Radio and Newspaper) are providing meaningful explanatory power and
not just increasing the complexity of the model unnecessarily.

test MSE is comparable to train MSE, this would further confirm that the model generalizes well and is a good choice for prediction.

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 6/7
11/29/24, 10:17 PM ds_lab_4.ipynb - Colab

https://colab.research.google.com/drive/1ZXvPoqQX9prVOkmN3w_sav9_1jAoMtBS#scrollTo=I9E_fhsnw6qU&printMode=true 7/7

You might also like