[go: up one dir, main page]

0% found this document useful (0 votes)
44 views30 pages

Ese Lab File

The document outlines a series of experiments related to Empirical Software Engineering, focusing on house price prediction using machine learning. It identifies research gaps in existing models, discusses exploratory data analysis, and details various statistical tests such as t-tests and chi-square tests for analyzing house price data. The document also includes code examples for data preprocessing, feature evaluation, and model development.

Uploaded by

rohansahu02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views30 pages

Ese Lab File

The document outlines a series of experiments related to Empirical Software Engineering, focusing on house price prediction using machine learning. It identifies research gaps in existing models, discusses exploratory data analysis, and details various statistical tests such as t-tests and chi-square tests for analyzing house price data. The document also includes code examples for data preprocessing, feature evaluation, and model development.

Uploaded by

rohansahu02
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

EMPIRICAL SOFTWARE ENGINEERING (SE 302a)

LAB FILE

Subject Code : SE 302a


Subject Name :Empirical Software Engineering
Branch: Software Engineering
Year: 3rd year/6th Semester

Submitted by:
LOVEESH SINGH
2K22/SE/107

Submitted to:
Dr. Shweta Meena
Assistant Professor
Department of Software Engineering
Delhi Technological University

Delhi Technological University


Shahbad Daulatpur , Main Bawana Road,Delhi-110042
EXPERIMENT -1

AIM: Collection of Empirical Studies


EXPERIMENT -2

AIM: Identify research gaps from empirical studies. Collection of datasets from open-source repositories

Research Gaps in "House Price Prediction Using Machine Learning"

1. The Model is Slow and Needs a Lot of Time to Train


The authors point out that their system takes more than a day to train on the dataset. This is a big problem if you want to use the
model in real life, where fast results are important. They suggest using multiple computers at once (parallel processing) to make it
faster, but they don’t actually do this or test how well it would work1.

2. Not Enough Work on Choosing the Best Features


The paper uses many house features (like bedrooms, area, location), but doesn’t really dig into which features are most important
for predicting price. There’s no deep analysis or method for picking the most useful features. This means the model might be using
extra information that doesn’t help much, making it slower and harder to understand1.

3. Location and Time Factors Aren’t Explored Much


While the authors mention that it would be nice to let users pick a region or district for more detailed results, their current model
doesn’t really include detailed location-based or time-based analysis. In reality, house prices change a lot depending on where and
when you buy or sell, but the model doesn’t fully handle this1.

4. The Model’s Decisions Are Hard to Explain


The paper doesn’t make it clear how the model comes up with its predictions in a way that regular people (like home buyers or real
estate agents) can easily understand. This makes it harder for people to trust or use the predictions in real situations1.

5. Testing and Validation Could Be Improved


The authors show the accuracy of their models, but don’t use advanced testing methods to check if the results are reliable in
different situations or places. They also don’t compare their predictions to official house price indexes or test how the model works
as market conditions change over time1.

6.Limited Algorithmic Comparison:

 The study only compares a small number of models (Linear Regression, Lasso Regression, Gradient Boosting).

 More advanced or recent models like Random Forest, CatBoost, LightGBM, or deep learning-based architectures could have
been included for a broader performance benchmark.

7. Insufficient Evaluation Metrics:

 The paper provides basic accuracy without detailing error metrics like RMSE, MAE, or R² score, which are critical for
regression problems.

 There's no discussion of statistical significance, confidence intervals, or cross-validation methods used.

8. No Real-World Deployment Considerations:

There's no mention of how the model could be deployed, updated with new data, or used in a production environment for
real-time price predictions

9. No Regional/Geographic Customization:
The model assumes a general approach, but house pricing is highly location-dependent. There’s no effort to incorporate or
compare regional models.

Summary:
The main gaps are that the system is slow to train, doesn’t focus on the most important features, doesn’t fully consider location and
time, is hard to interpret for users, and could use better ways to test if it really works well in practice. Addressing these issues would
make the model much more useful and reliable for real-world house price prediction
EXPERIMENT-3

AIM: Write a program to perform exploratory analysis of the dataset

THEORY:
Exploratory Data Analysis (EDA) – Theory for House Price Prediction
Introduction on Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step in any data science or machine learning project. It involves examining datasets
to summarize their main characteristics—often using visual methods and descriptive statistics. The goal is to understand the data,
detect patterns, spot anomalies, and check assumptions before applying predictive models.
In the context of House Price Prediction, EDA helps uncover the relationships between features (e.g., number of bedrooms, square
footage, location) and the target variable (house price). It guides decisions on feature selection, data cleaning, and model building.

Goals of EDA in House Price Prediction


1. Understand the structure of the data (types, size, shape).
2. Identify missing or inconsistent values.
3. Detect outliers and extreme values that may affect the model.
4. Visualize distributions of features (e.g., histogram of prices).
5. Explore relationships between variables using correlations and plots.
6. Prepare data for model building (feature engineering, transformations).

Key Steps in EDA for House Price Dataset


1. Data Loading and Overview
 Load the dataset using pandas.
 Display the first few rows (head()) to get a sense of the structure.
 Check data types, column names, and basic info (info()).
2. Descriptive Statistics
 Use describe() to summarize numerical columns: mean, std, min, max, etc.
 Understand the spread and central tendency of house prices and predictors.
3. Missing Value Analysis
 Use isnull().sum() to check for missing entries in features.
 Decide whether to fill, drop, or impute missing values.
4. Univariate Analysis
 Analyze individual variables:
o Histogram of house prices
o Countplot of categorical features (e.g., number of bedrooms, waterfront)
o Boxplots to check for outliers in continuous variables
5. Bivariate/Multivariate Analysis
 Use scatterplots (e.g., price vs. area), barplots (e.g., price by condition), etc.
 Use correlation matrix and heatmaps to identify relationships.
 Check multicollinearity using correlation or Variance Inflation Factor (VIF).
6. Outlier Detection
 Use boxplots and statistical thresholds (e.g., IQR method) to detect anomalies.
 Decide whether to remove or treat outliers based on domain knowledge.

CODE:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset


file_path = "your_dataset.csv" # Change this to your actual CSV file
df = pd.read_csv("Bengaluru_House_Data.csv")

# Basic info
print("\n📋 Dataset Info:")
print(df.info())
# First few rows
print("\n👀 First 5 Rows:")
print(df.head())

# Missing values
print("\n❌ Missing Values:")
print(df.isnull().sum())

# Summary statistics
print("\n📊 Summary Statistics:")
print(df.describe())

# Correlation matrix
print("\n🔗 Correlation Matrix:")
print(df.corr(numeric_only=True))

# Visualizations

# Histogram of all numerical features


df.hist(figsize=(12, 10), bins=20)
plt.tight_layout()
plt.suptitle("🔍 Histograms of Numerical Features", fontsize=16, y=1.02)
plt.show()

# Heatmap of correlation
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("🔥 Feature Correlation Heatmap")
plt.show()

# Pairplot (can be slow for large datasets)


# sns.pairplot(df)
# plt.show()

# Boxplots for outlier detection


numeric_columns = df.select_dtypes(include='number').columns
for col in numeric_columns:
plt.figure(figsize=(6, 4))
sns.boxplot(x=df[col])
plt.title(f"📦 Boxplot of {col}")
plt.show()

OUTPUT:
EXPERIMENT- 4

AIM: Write a program to perform following feature reduction techniques:


a) Correlation based feature evaluation
b) Relief attribute feature evaluation
c) Information gain feature evaluation
d) Principle component analysis

THEORY:

1. Correlation Analysis
Measures linear relationships between features and price using Pearson coefficients.
Values range from -1 to 1, indicating direction and strength of association.
Limited to detecting linear patterns only.

2. F-Test Evaluation
Assesses feature significance through ANOVA-based statistical testing.
Higher F-scores indicate stronger group-wise variance with price.
Effective for identifying linearly discriminative features.

3. Mutual Information
Quantifies non-linear dependencies using entropy reduction.
Measures information gained about price from each feature.
Handles both numerical and categorical data effectively.

4. Principal Component Analysis


Reduces dimensionality through orthogonal transformation.
First components capture maximum variance in the data.
Helps mitigate multicollinearity in regression models.

CODE:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from skrebate import ReliefF
import warnings
warnings.filterwarnings('ignore')

# Load dataset
df = pd.read_csv('Bengaluru_House_Data-2.csv')

# Data preprocessing
def preprocess_data(df):
# Convert size to numerical values
df['size'] = df['size'].str.split().str[0].astype(float)

# Handle total_sqft ranges


def convert_sqft(sqft):
if isinstance(sqft, str) and '-' in sqft:
low, high = map(float, sqft.split('-'))
return (low + high)/2
try:
return float(sqft)
except:
return np.nan

df['total_sqft'] = df['total_sqft'].apply(convert_sqft)
# Handle availability
df['is_ready'] = df['availability'].apply(lambda x: 1 if 'Ready' in x else 0)

# Drop unnecessary columns


df = df.drop(['area_type', 'availability', 'society', 'location'], axis=1)

# Handle missing values


df = df.dropna(subset=['price'])
df['bath'] = df['bath'].fillna(df['bath'].median())
df['balcony'] = df['balcony'].fillna(df['balcony'].median())
df['total_sqft'] = df['total_sqft'].fillna(df['total_sqft'].median())

return df

# Preprocess data
df_clean = preprocess_data(df)
X = df_clean.drop('price', axis=1)
y = df_clean['price']

# a) Correlation-based feature evaluation


correlation_matrix = df_clean.corr()
price_correlation = correlation_matrix['price'].sort_values(ascending=False)
print("Correlation with Price:\n", price_correlation, "\n")

# b) Relief attribute evaluation


relief = ReliefF(n_features_to_select=5, n_neighbors=100)
relief.fit(X.values, y.values)
relief_scores = pd.Series(relief.feature_importances_, index=X.columns)
print("Relief Feature Scores:\n", relief_scores.sort_values(ascending=False), "\n")

# c) Information gain evaluation


mi_scores = mutual_info_regression(X, y)
mi_scores = pd.Series(mi_scores, index=X.columns)
print("Mutual Information Scores:\n", mi_scores.sort_values(ascending=False), "\n")

# d) Principal Component Analysis


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)
explained_variance = pca.explained_variance_ratio_

print("PCA Explained Variance Ratio:")


for i, ratio in enumerate(explained_variance):
print(f"PC{i+1}: {ratio:.4f}")

OUTPUT:
EXPERIMENT -5
AIM: Develop a machine learning model for the selected topic
EXPERIMENT -6
AIM: Consider the model developed in exp5 and identify the following:
a) State the hypothesis
b) Formulate an analysis plan
c) Analyze sample data
d) Interpret results
e) Estimate type-I and type-II error
EXPERIMENT-7

AIM: Write a program to implement t-test

THEORY:
T-Test – Theory (Brief)
A T-test is a statistical method used to determine whether there is a significant difference in average values between two
groups. In this case, it is used to compare house prices between two locations in Bengaluru.
Objective
To check if the average house prices in two selected locations (e.g., Whitefield vs Rajaji Nagar) are statistically different.
Hypotheses
 Null Hypothesis (H₀): No difference in average house prices between the two locations.
 Alternative Hypothesis (H₁): There is a significant difference.
Method
 Data cleaned to remove missing values.
 Prices grouped by location.
 Independent two-sample T-test is applied using scipy.stats.ttest_ind().
Interpretation
 p-value < 0.05 → Significant difference (reject H₀).
 p-value ≥ 0.05 → No significant difference (fail to reject H₀).

CODE:
import pandas as pd
from scipy import stats

# Load dataset
file_path = "Bengaluru_House_Data.csv"
df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop rows with missing price or location


df = df[['location', 'price']].dropna()

# Get two locations to compare


loc1 = 'Whitefield'
loc2 = 'Rajaji Nagar'

# Filter data for the two locations


group1 = df[df['location'] == loc1]['price']
group2 = df[df['location'] == loc2]['price']

# Check if both groups have enough data


if len(group1) < 2 or len(group2) < 2:
print("❌ Not enough data in one or both groups for t-test.")
else:
# Perform independent t-test

t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)


print(f"📊 T-Test Between '{loc1}' and '{loc2}' on House Prices")
print(f"T-statistic: {t_stat:.4f}")

print(f"P-value: {p_val:.4f}")

if p_val < 0.05:


print("✅ Significant difference in average house prices.")
else:
print("❌ No significant difference in average house prices.")

OUTPUT:
EXPERIMENT-8

AIM: Write a program to implement chi-square test

THEORY:
Chi-Square Test – Brief Theory
The Chi-Square Test of Independence is a statistical test used to determine whether there is a significant association between two
categorical variables.
Purpose in House Price Data
In the context of the Bengaluru House Price dataset, the Chi-Square test can help answer questions like:
"Is the number of bedrooms (size) dependent on the house location?"
How It Works
 A contingency table is created showing the frequency of occurrences between the two variables (e.g., different size
categories across location values).
 The Chi-Square test compares the observed frequencies with expected frequencies (assuming no relationship).
 It calculates a Chi-square statistic and a p-value.
Hypotheses
 H₀ (Null Hypothesis): The two variables are independent.
 H₁ (Alternative Hypothesis): The two variables are related.
Interpretation
 If p-value < 0.05, reject H₀ → there is a significant relationship.
 If p-value ≥ 0.05, fail to reject H₀ → the variables are independent.

CODE:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset


file_path = "Bengaluru_House_Data.csv"
df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop missing values in relevant columns


df = df[['location', 'size']].dropna()

# Optional: Clean 'size' column to keep only number of bedrooms


df['size'] = df['size'].str.extract('(\d+)').astype(float)

# Drop rows with missing or non-numeric sizes after extraction


df = df.dropna()

# Convert bedroom count to categorical for Chi-Square test


df['size'] = df['size'].astype(int).astype(str)

# Create a contingency table: location vs size


contingency_table = pd.crosstab(df['location'], df['size'])

# Perform Chi-Square test


chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output results
print("📊 Chi-Square Test of Independence")

print("----------------------------------")
print("Chi-Square Statistic:", round(chi2, 4))
print("Degrees of Freedom:", dof)
print("P-value:", round(p, 4))

# Interpretation
alpha = 0.05
if p < alpha:
print("✅ Significant association between 'location' and 'number of bedrooms'")
else:
print("❌ No significant association between 'location' and 'number of bedrooms'")

OUTPUT:
EXPERIMENT-9

AIM: Write a program to implement Friedman test

THEORY:
Friedman Test (Non-Parametric Test for Repeated Measures)
The Friedman Test is a non-parametric statistical test used to detect differences in treatments across multiple test attempts. It is often
seen as the non-parametric alternative to the repeated-measures ANOVA.
When to Use:
 You have one group that is tested on three or more conditions (or time points).
 The dependent variable is ordinal or not normally distributed.
 The data is in a repeated-measures or matched-subjects format.
Basic Idea:
The test ranks the scores within each block (e.g., each subject), and then analyzes the ranks to determine if there's a statistically significant
difference between the treatments.
Hypotheses:
 Null Hypothesis (H₀): All treatments (conditions) have the same effect (i.e., median ranks are equal).
 Alternative Hypothesis (H₁): At least one treatment has a different effect.
Test Statistic (χ²_F):
χF2=12nk(k+1)∑j=1kRj2−3n(k+1)\chi^2_F = \frac{12}{n k (k + 1)} \sum_{j=1}^k R_j^2 - 3n(k + 1)χF2=nk(k+1)12j=1∑kRj2−3n(k+1)
Where:
 nnn = number of blocks (e.g., subjects)
 kkk = number of treatments
 RjR_jRj = sum of ranks for treatment jjj
Decision Rule:
Compare the test statistic to a critical value from the chi-square distribution with k−1k - 1k−1 degrees of freedom, or use a p-value.

CODE:
from scipy.stats import friedmanchisquare

# Example: 3 algorithms' predictions on 5 houses


linear_regression = [400000, 410000, 395000, 420000, 405000]
lasso_regression = [390000, 415000, 400000, 425000, 410000]
xgboost_model = [405000, 412000, 398000, 422000, 407000]

# Apply Friedman Test


statistic, p_value = friedmanchisquare(linear_regression, lasso_regression, xgboost_model)

# Output results
print("📊 Friedman Test Results")
print("------------------------")
print(f"Test Statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
print("✅ Significant difference between the model predictions.")
else:
print("❌ No significant difference between the model predictions.")

OUTPUT:
EXPERIMENT-10

AIM: Write a program to implement Wilcoxon signed rank test

THEORY:
Wilcoxon Signed Rank Test – Theory (For House Price Prediction)
The Wilcoxon Signed Rank Test is a non-parametric statistical test used to compare two related samples. It helps determine
whether their population mean ranks differ — without assuming a normal distribution.
In your case:
 You have two related sets of data:
o Actual house prices
o Predicted house prices (from your model)
 The test checks if the differences between actual and predicted prices are symmetric around zero, i.e., whether your
model systematically under- or over-predicts prices.

How It Works (In Simple Steps)


1. Calculate differences between predicted and actual prices.
2. Ignore zero differences (no change).
3. Take the absolute value of differences and rank them (smallest to largest).
4. Add signs back to ranks based on whether the prediction was higher or lower.
5. Compute:
o W+ = sum of positive signed ranks
o W− = sum of negative signed ranks
6. The Wilcoxon statistic (W) is the smaller of W+ and W−.
7. A p-value is calculated to test the null hypothesis.

Hypotheses
 Null Hypothesis (H₀):
There is no significant difference between actual and predicted prices.
 Alternative Hypothesis (H₁):
There is a significant difference between actual and predicted prices.

Interpretation
 If p-value < 0.05 (or your chosen significance level):
o You reject H₀ → your model's predictions differ significantly from actual values.
 If p-value ≥ 0.05:
o You fail to reject H₀ → there's no significant difference → your model is reasonably accurate.

CODE:
import pandas as pd
import numpy as np
from scipy.stats import wilcoxon

# ======= Sample data for illustration =======


# In practice, replace this with: pd.read_csv("your_file.csv")
data = {
'actual_price': [210000, 340000, 280000, 500000, 310000],
'predicted_price': [205000, 345000, 275000, 490000, 300000]
}
df = pd.DataFrame(data)

# Rename columns for test


actual = df['actual_price']
predicted = df['predicted_price']

# === Wilcoxon Signed Rank Test ===


# Using scipy
print("\n=== Wilcoxon Signed Rank Test (SciPy) ===")
stat, p_value = wilcoxon(actual, predicted)
print(f"Wilcoxon statistic = {stat:.4f}")
print(f"P-value = {p_value:.4f}")

# === Manual Implementation ===


def manual_wilcoxon(x, y):
diff = y - x
non_zero_diff = diff[diff != 0]
abs_diff = np.abs(non_zero_diff)
ranks = abs_diff.rank(method='average')
signed_ranks = ranks * np.sign(non_zero_diff)

W_plus = signed_ranks[signed_ranks > 0].sum()


W_minus = -signed_ranks[signed_ranks < 0].sum()

W = min(W_plus, W_minus)
return {
"W+": W_plus,
"W-": W_minus,
"Wilcoxon Statistic (W)": W
}

print("\n=== Manual Implementation ===")


manual_result = manual_wilcoxon(actual, predicted)
for k, v in manual_result.items():
print(f"{k}: {v:.4f}")

# === Interpretation ===


alpha = 0.05
print("\n=== Interpretation ===")
if p_value < alpha:
print(f"Since p-value ({p_value:.4f}) < {alpha}, we reject the null hypothesis.")
print("There is a significant difference between actual and predicted prices.")
else:
print(f"Since p-value ({p_value:.4f}) >= {alpha}, we fail to reject the null hypothesis.")
print("There is no significant difference between actual and predicted prices.")

OUTPUT:
EXPERIMENT-11

AIM: Write a program to implement ANOVA test

THEORY:

ANOVA (Analysis of Variance) Test

The ANOVA test is a statistical method used to determine whether there are any significant differences between the means of three
or more independent groups.

Hypotheses:

 Null Hypothesis (H₀):


All BHK groups have the same average house price.
(No difference between means)

 Alternative Hypothesis (H₁):


At least one BHK group has a different average price.
(There's a difference between group means)

How It Works:

1. ANOVA splits the total variation in price into:

o Between-group variance (due to different BHKs)

o Within-group variance (natural variation within each BHK group)

2. It calculates the F-statistic, which is the ratio:

F=Variance between groupsVariance within groupsF = \frac{\text{Variance between groups}}{\text{Variance within


groups}}F=Variance within groupsVariance between groups

3. A higher F-value suggests greater group differences.

4. The p-value tells you whether the observed F-statistic is statistically significant.

Interpretation:

 If p-value < 0.05 (or your chosen alpha level):

o Reject H₀ → There is a significant difference in average prices across BHKs.

 If p-value ≥ 0.05:

o Fail to reject H₀ → No significant price difference across BHKs.

CODE:

import pandas as pd

from scipy.stats import f_oneway

# Load the dataset


df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop rows with missing 'size' or 'price'

df = df[['size', 'price']].dropna()

# Extract numeric BHK value from the 'size' column

df['bhk'] = df['size'].str.extract(r'(\d+)').astype(float)

# Drop rows where bhk is missing after extraction

df = df.dropna(subset=['bhk'])

# Group prices by BHK

grouped_prices = df.groupby('bhk')['price'].apply(list)

# Keep only BHK groups with at least 2 data points

valid_groups = [prices for prices in grouped_prices if len(prices) >= 2]

# Perform ANOVA

f_stat, p_value = f_oneway(*valid_groups)

# Output results

print("\n=== ANOVA Test: Does Price Differ by BHK? ===")

print(f"F-statistic: {f_stat:.4f}")

print(f"P-value: {p_value:.4f}")

# Interpretation

alpha = 0.05

print("\n=== Interpretation ===")

if p_value < alpha:

print("P-value is less than 0.05 → Reject H₀")

print("=> There is a significant difference in prices between BHK levels.")

else:

print("P-value is greater than 0.05 → Fail to reject H₀")

print("=> No significant difference in prices between BHK levels.")


OUTPUT:
EXPERIMENT-12

AIM: Write a program to implement Nemeyi test

THEORY:
The Nemenyi test is a non-parametric statistical test used to compare the performance of multiple models or
methods. It is particularly useful when you have more than two models and want to determine if there are
significant differences in their performance metrics, such as prediction accuracy or error rates. The test works
by ranking the models based on their performance and then comparing the ranks pairwise. If the difference
between the ranks of two models is large enough, it indicates a significant difference in their performance. The
Nemenyi test is especially helpful when the data does not follow a normal distribution, making it a robust
choice for model comparison in real-world scenarios like house price prediction.

When to Use It:


 After running multiple models for your house price prediction, you can apply the Nemenyi test to
compare their performance metrics and determine if the differences you observe are statistically
significant.
 It is most useful when you're dealing with multiple models or multiple configurations (e.g.,
hyperparameters) and want to check whether one model consistently outperforms others.

CODE:
import numpy as np
import scikit_posthocs as sp
import pandas as pd

# Example MAE scores of 3 models across 5 datasets/folds


# Rows = datasets/folds, Columns = models
scores = np.array([
[3.1, 2.9, 3.5], # Dataset 1
[2.8, 2.5, 3.2], # Dataset 2
[3.0, 2.7, 3.3], # Dataset 3
[3.2, 3.0, 3.6], # Dataset 4
[2.9, 2.6, 3.1], # Dataset 5
])
# Convert to DataFrame for better readability
df = pd.DataFrame(scores, columns=["Model A", "Model B", "Model C"])

# Perform Nemenyi post-hoc test


nemenyi_result = sp.posthoc_nemenyi_friedman(df)

print("=== Nemenyi Post-Hoc Test Result ===")


print(nemenyi_result)

OUTPUT:
EXPERIMENT-13

AIM: Write a program to analyze the performance of model developed

THEORY:
To analyze the performance of your house price prediction models, you can write a program that compares
the performance of multiple models using a relevant metric, such as Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE), or R-squared (R²). Then, you can apply the Nemenyi test to compare the models
statistically if you want to see if any differences in their performance are significant.
Here's an outline of the program using Python. This example assumes you're comparing models like Linear
Regression, Random Forest, and XGBoost.
Steps:
1. Train the models on the same dataset.
2. Calculate performance metrics (e.g., RMSE or MAE) for each model.
3. Apply the Nemenyi test to compare the models.

CODE:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import scipy.stats as stats

# Step 1: Load and prepare your dataset

X, y = make_regression(n_samples=500, n_features=5, noise=0.1)

# Split the data into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 2: Train the models
# Model 1: Linear Regression
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# Model 2: Random Forest Regressor


rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Model 3: XGBoost (if installed)


from xgboost import XGBRegressor
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# Step 3: Calculate the performance metrics (RMSE in this case)


def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_lr = rmse(y_test, y_pred_lr)


rmse_rf = rmse(y_test, y_pred_rf)
rmse_xgb = rmse(y_test, y_pred_xgb)

# Step 4: Output the performance of each model


print(f"RMSE for Linear Regression: {rmse_lr:.4f}")
print(f"RMSE for Random Forest: {rmse_rf:.4f}")
print(f"RMSE for XGBoost: {rmse_xgb:.4f}")

# Step 5: Apply the Nemenyi test


# First, you need to rank the RMSE values
rmse_values = [rmse_lr, rmse_rf, rmse_xgb]
model_names = ['Linear Regression', 'Random Forest', 'XGBoost']

# Compute ranks for the RMSE values


ranks = pd.Series(rmse_values).rank()

# Print the rank for each model


print("\nRanks of models based on RMSE:")
for model, rank in zip(model_names, ranks):
print(f"{model}: Rank {rank}")

# Step 6: Perform pairwise Nemenyi test (using a simple approximation)


# Nemenyi test compares the difference between ranks of each pair of models.
from itertools import combinations

def nemenyi_test(ranks):
n = len(ranks)
p_values = {}
for (i, j) in combinations(range(n), 2):
rank_diff = abs(ranks[i] - ranks[j])
p_value = stats.ttest_ind([ranks[i]], [ranks[j]])[1] # Simplified for illustration
p_values[(model_names[i], model_names[j])] = p_value
return p_values

# Compare ranks of the models


p_values = nemenyi_test(ranks)
print("\nPairwise p-values from Nemenyi test:")
for pair, p_value in p_values.items():
print(f"{pair[0]} vs {pair[1]}: p-value = {p_value:.4f}")

OUTPUT:

You might also like