Aim:
To Perform multiple linear regression on multiple datasets and see the results and check which one
has better output.
Theory:
Multiple Linear Regression: Theory and Understanding
Multiple linear regression (MLR) is a statistical technique used to model the relationship between a
single dependent variable (what you want to predict) and multiple independent variables (features
that influence the dependent variable). It assumes a linear relationship between these variables and
builds a linear equation to capture this relationship.
Key Concepts:
Equation:
y_hat = β₀ + β₁x₁ + β₂x₂ + ... + β_p * x_p
y_hat is the predicted value of the dependent variable.
β₀ is the intercept term (constant value when all independent variables are zero).
β_i are the coefficients for each independent variable x_i.
p is the number of independent variables.
Limitations of MLR:
Cannot capture non-linear relationships.
Sensitive to assumptions, and their violation can lead to inaccurate results.
Cannot establish causation; only identifies correlations.
Applications of MLR:
Predicting house prices based on features like size, location, and amenities.
Understanding how factors like age, income, and education affect job satisfaction.
Analysing the impact of advertising campaigns on sales
Code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
# Load the dataset (replace 'your_dataset_filename.csv' with the actual name)
df = pd.read_csv('boston.csv')
# Handle outliers using IQR (adjust based on your data's characteristics)
numeric_cols = df.select_dtypes(include=[np.number]).columns
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[numeric_cols] < (Q1 - 1.5 * IQR)) | (df[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
# Extract features and target variable (using the provided column names)
X = df.drop(['TOWN', 'TRACT', 'LON', 'LAT', 'MEDV'], axis=1)
y = df['MEDV']
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
# Feature selection (experiment with different thresholds and methods)
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train, y_train)
sfm = SelectFromModel(rf_model, threshold=0.1) # Adjust threshold if needed
X_train = sfm.transform(X_train)
X_test = sfm.transform(X_test)
# Polynomial features (consider different degrees)
poly = PolynomialFeatures(degree=2, include_bias=False) # Adjust degree if needed
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Model fitting
regressor = LinearRegression()
regressor.fit(X_train_poly, y_train)
# Evaluation
y_pred = regressor.predict(X_test_poly)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print('Train Score: ', regressor.score(X_train_poly, y_train))
print('Test Score: ', regressor.score(X_test_poly, y_test))
print('Mean Squared Error (MSE): ', mse)
print('R-squared (R2): ', r2)
# Visualization (optional)
plt.scatter(y_test, y_pred)
plt.xlabel("Actaul Medv")
plt.ylabel("Predicted Medv")
plt.title("Actual Medv vs Predicted Medv")
plt.show()
Performance Metrics:
Multiple Linear Regression Dataset:
Boston Housing Dataset:
Output:
Multiple Regression dataset:
Boston Housing Dataset Output:
Comparission.
Comparing the performance of models trained on a multiple regression dataset and the Boston
Housing dataset:
Train Score:
The multiple regression model achieves a very high train score (0.983), indicating an excellent fit to
the training data.
The Boston Housing model also demonstrates a reasonably high train score (0.822), suggesting a
good fit to its training data.
Test Score:
Both models exhibit high test scores, with the multiple regression model at 0.887 and the Boston
Housing model at 0.877, indicating strong generalization performance.
Mean Squared Error (MSE):
The multiple regression model has a relatively high MSE of 2,611,228, suggesting higher prediction
errors on average.
In contrast, the Boston Housing model shows a much lower MSE of 5.379, indicating superior
prediction accuracy.
R-squared (R2):
The multiple regression model and the Boston Housing model both achieve high R-squared values
(0.887 and 0.877 respectively), indicating good explanatory power over the variance in their
respective dependent variables.
Conclusion:
While both models exhibit strong performance in terms of train and test scores, the Boston Housing
model outperforms in terms of MSE, suggesting superior prediction accuracy.
Despite the multiple regression model's higher R-squared value, indicating a better fit to the data, its
higher MSE implies potential issues with prediction accuracy on unseen data.
Therefore, for accurate prediction of housing prices, the Boston Housing model is preferred.
However, if the goal is to explain variance in the dependent variable, the multiple regression model
may be more suitable.