0% found this document useful (0 votes)

44 views30 pages

Ese Lab File

The document outlines a series of experiments related to Empirical Software Engineering, focusing on house price prediction using machine learning. It identifies research gaps in existing models, discusses exploratory data analysis, and details various statistical tests such as t-tests and chi-square tests for analyzing house price data. The document also includes code examples for data preprocessing, feature evaluation, and model development.

Uploaded by

rohansahu02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views30 pages

Ese Lab File

Uploaded by

rohansahu02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 30

EMPIRICAL SOFTWARE ENGINEERING (SE 302a)

LAB FILE

Subject Code : SE 302a

Subject Name :Empirical Software Engineering
Branch: Software Engineering
Year: 3rd year/6th Semester

Submitted by:
LOVEESH SINGH
2K22/SE/107

Submitted to:
Dr. Shweta Meena
Assistant Professor
Department of Software Engineering
Delhi Technological University

Delhi Technological University

Shahbad Daulatpur , Main Bawana Road,Delhi-110042
EXPERIMENT -1

AIM: Collection of Empirical Studies

EXPERIMENT -2

AIM: Identify research gaps from empirical studies. Collection of datasets from open-source repositories

Research Gaps in "House Price Prediction Using Machine Learning"

1. The Model is Slow and Needs a Lot of Time to Train

The authors point out that their system takes more than a day to train on the dataset. This is a big problem if you want to use the
model in real life, where fast results are important. They suggest using multiple computers at once (parallel processing) to make it
faster, but they don’t actually do this or test how well it would work1.

2. Not Enough Work on Choosing the Best Features

The paper uses many house features (like bedrooms, area, location), but doesn’t really dig into which features are most important
for predicting price. There’s no deep analysis or method for picking the most useful features. This means the model might be using
extra information that doesn’t help much, making it slower and harder to understand1.

3. Location and Time Factors Aren’t Explored Much

While the authors mention that it would be nice to let users pick a region or district for more detailed results, their current model
doesn’t really include detailed location-based or time-based analysis. In reality, house prices change a lot depending on where and
when you buy or sell, but the model doesn’t fully handle this1.

4. The Model’s Decisions Are Hard to Explain

The paper doesn’t make it clear how the model comes up with its predictions in a way that regular people (like home buyers or real
estate agents) can easily understand. This makes it harder for people to trust or use the predictions in real situations1.

5. Testing and Validation Could Be Improved

The authors show the accuracy of their models, but don’t use advanced testing methods to check if the results are reliable in
different situations or places. They also don’t compare their predictions to official house price indexes or test how the model works
as market conditions change over time1.

6.Limited Algorithmic Comparison:

 The study only compares a small number of models (Linear Regression, Lasso Regression, Gradient Boosting).

 More advanced or recent models like Random Forest, CatBoost, LightGBM, or deep learning-based architectures could have
been included for a broader performance benchmark.

7. Insufficient Evaluation Metrics:

 The paper provides basic accuracy without detailing error metrics like RMSE, MAE, or R² score, which are critical for
regression problems.

 There's no discussion of statistical significance, confidence intervals, or cross-validation methods used.

8. No Real-World Deployment Considerations:

There's no mention of how the model could be deployed, updated with new data, or used in a production environment for
real-time price predictions

9. No Regional/Geographic Customization:
The model assumes a general approach, but house pricing is highly location-dependent. There’s no effort to incorporate or
compare regional models.

Summary:
The main gaps are that the system is slow to train, doesn’t focus on the most important features, doesn’t fully consider location and
time, is hard to interpret for users, and could use better ways to test if it really works well in practice. Addressing these issues would
make the model much more useful and reliable for real-world house price prediction
EXPERIMENT-3

AIM: Write a program to perform exploratory analysis of the dataset

THEORY:
Exploratory Data Analysis (EDA) – Theory for House Price Prediction
Introduction on Exploratory Data Analysis (EDA)?
Exploratory Data Analysis (EDA) is a crucial initial step in any data science or machine learning project. It involves examining datasets
to summarize their main characteristics—often using visual methods and descriptive statistics. The goal is to understand the data,
detect patterns, spot anomalies, and check assumptions before applying predictive models.
In the context of House Price Prediction, EDA helps uncover the relationships between features (e.g., number of bedrooms, square
footage, location) and the target variable (house price). It guides decisions on feature selection, data cleaning, and model building.

Goals of EDA in House Price Prediction

1. Understand the structure of the data (types, size, shape).
2. Identify missing or inconsistent values.
3. Detect outliers and extreme values that may affect the model.
4. Visualize distributions of features (e.g., histogram of prices).
5. Explore relationships between variables using correlations and plots.
6. Prepare data for model building (feature engineering, transformations).

Key Steps in EDA for House Price Dataset

1. Data Loading and Overview
 Load the dataset using pandas.
 Display the first few rows (head()) to get a sense of the structure.
 Check data types, column names, and basic info (info()).
2. Descriptive Statistics
 Use describe() to summarize numerical columns: mean, std, min, max, etc.
 Understand the spread and central tendency of house prices and predictors.
3. Missing Value Analysis
 Use isnull().sum() to check for missing entries in features.
 Decide whether to fill, drop, or impute missing values.
4. Univariate Analysis
 Analyze individual variables:
o Histogram of house prices
o Countplot of categorical features (e.g., number of bedrooms, waterfront)
o Boxplots to check for outliers in continuous variables
5. Bivariate/Multivariate Analysis
 Use scatterplots (e.g., price vs. area), barplots (e.g., price by condition), etc.
 Use correlation matrix and heatmaps to identify relationships.
 Check multicollinearity using correlation or Variance Inflation Factor (VIF).
6. Outlier Detection
 Use boxplots and statistical thresholds (e.g., IQR method) to detect anomalies.
 Decide whether to remove or treat outliers based on domain knowledge.

CODE:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load your dataset

file_path = "your_dataset.csv" # Change this to your actual CSV file
df = pd.read_csv("Bengaluru_House_Data.csv")

# Basic info
print("\n📋 Dataset Info:")
print(df.info())
# First few rows
print("\n👀 First 5 Rows:")
print(df.head())

# Missing values
print("\n❌ Missing Values:")
print(df.isnull().sum())

# Summary statistics
print("\n📊 Summary Statistics:")
print(df.describe())

# Correlation matrix
print("\n🔗 Correlation Matrix:")
print(df.corr(numeric_only=True))

# Visualizations

# Histogram of all numerical features

df.hist(figsize=(12, 10), bins=20)
plt.tight_layout()
plt.suptitle("🔍 Histograms of Numerical Features", fontsize=16, y=1.02)
plt.show()

# Heatmap of correlation
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, fmt=".2f", cmap="coolwarm")
plt.title("🔥 Feature Correlation Heatmap")
plt.show()

# Pairplot (can be slow for large datasets)

# sns.pairplot(df)
# plt.show()

# Boxplots for outlier detection

numeric_columns = df.select_dtypes(include='number').columns
for col in numeric_columns:
plt.figure(figsize=(6, 4))
sns.boxplot(x=df[col])
plt.title(f"📦 Boxplot of {col}")
plt.show()

OUTPUT:
EXPERIMENT- 4

AIM: Write a program to perform following feature reduction techniques:

a) Correlation based feature evaluation
b) Relief attribute feature evaluation
c) Information gain feature evaluation
d) Principle component analysis

THEORY:

1. Correlation Analysis
Measures linear relationships between features and price using Pearson coefficients.
Values range from -1 to 1, indicating direction and strength of association.
Limited to detecting linear patterns only.

2. F-Test Evaluation
Assesses feature significance through ANOVA-based statistical testing.
Higher F-scores indicate stronger group-wise variance with price.
Effective for identifying linearly discriminative features.

3. Mutual Information
Quantifies non-linear dependencies using entropy reduction.
Measures information gained about price from each feature.
Handles both numerical and categorical data effectively.

4. Principal Component Analysis

Reduces dimensionality through orthogonal transformation.
First components capture maximum variance in the data.
Helps mitigate multicollinearity in regression models.

CODE:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import mutual_info_regression
from skrebate import ReliefF
import warnings
warnings.filterwarnings('ignore')

# Load dataset
df = pd.read_csv('Bengaluru_House_Data-2.csv')

# Data preprocessing
def preprocess_data(df):
# Convert size to numerical values
df['size'] = df['size'].str.split().str[0].astype(float)

# Handle total_sqft ranges

def convert_sqft(sqft):
if isinstance(sqft, str) and '-' in sqft:
low, high = map(float, sqft.split('-'))
return (low + high)/2
try:
return float(sqft)
except:
return np.nan

df['total_sqft'] = df['total_sqft'].apply(convert_sqft)
# Handle availability
df['is_ready'] = df['availability'].apply(lambda x: 1 if 'Ready' in x else 0)

# Drop unnecessary columns

df = df.drop(['area_type', 'availability', 'society', 'location'], axis=1)

# Handle missing values

df = df.dropna(subset=['price'])
df['bath'] = df['bath'].fillna(df['bath'].median())
df['balcony'] = df['balcony'].fillna(df['balcony'].median())
df['total_sqft'] = df['total_sqft'].fillna(df['total_sqft'].median())

return df

# Preprocess data
df_clean = preprocess_data(df)
X = df_clean.drop('price', axis=1)
y = df_clean['price']

# a) Correlation-based feature evaluation

correlation_matrix = df_clean.corr()
price_correlation = correlation_matrix['price'].sort_values(ascending=False)
print("Correlation with Price:\n", price_correlation, "\n")

# b) Relief attribute evaluation

relief = ReliefF(n_features_to_select=5, n_neighbors=100)
relief.fit(X.values, y.values)
relief_scores = pd.Series(relief.feature_importances_, index=X.columns)
print("Relief Feature Scores:\n", relief_scores.sort_values(ascending=False), "\n")

# c) Information gain evaluation

mi_scores = mutual_info_regression(X, y)
mi_scores = pd.Series(mi_scores, index=X.columns)
print("Mutual Information Scores:\n", mi_scores.sort_values(ascending=False), "\n")

# d) Principal Component Analysis

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

pca = PCA()
pca.fit(X_scaled)
explained_variance = pca.explained_variance_ratio_

print("PCA Explained Variance Ratio:")

for i, ratio in enumerate(explained_variance):
print(f"PC{i+1}: {ratio:.4f}")

OUTPUT:
EXPERIMENT -5
AIM: Develop a machine learning model for the selected topic
EXPERIMENT -6
AIM: Consider the model developed in exp5 and identify the following:
a) State the hypothesis
b) Formulate an analysis plan
c) Analyze sample data
d) Interpret results
e) Estimate type-I and type-II error
EXPERIMENT-7

AIM: Write a program to implement t-test

THEORY:
T-Test – Theory (Brief)
A T-test is a statistical method used to determine whether there is a significant difference in average values between two
groups. In this case, it is used to compare house prices between two locations in Bengaluru.
Objective
To check if the average house prices in two selected locations (e.g., Whitefield vs Rajaji Nagar) are statistically different.
Hypotheses
 Null Hypothesis (H₀): No difference in average house prices between the two locations.
 Alternative Hypothesis (H₁): There is a significant difference.
Method
 Data cleaned to remove missing values.
 Prices grouped by location.
 Independent two-sample T-test is applied using scipy.stats.ttest_ind().
Interpretation
 p-value < 0.05 → Significant difference (reject H₀).
 p-value ≥ 0.05 → No significant difference (fail to reject H₀).

CODE:
import pandas as pd
from scipy import stats

# Load dataset
file_path = "Bengaluru_House_Data.csv"
df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop rows with missing price or location

df = df[['location', 'price']].dropna()

# Get two locations to compare

loc1 = 'Whitefield'
loc2 = 'Rajaji Nagar'

# Filter data for the two locations

group1 = df[df['location'] == loc1]['price']
group2 = df[df['location'] == loc2]['price']

# Check if both groups have enough data

if len(group1) < 2 or len(group2) < 2:
print("❌ Not enough data in one or both groups for t-test.")
else:
# Perform independent t-test

t_stat, p_val = stats.ttest_ind(group1, group2, equal_var=False)

print(f"📊 T-Test Between '{loc1}' and '{loc2}' on House Prices")
print(f"T-statistic: {t_stat:.4f}")

print(f"P-value: {p_val:.4f}")

if p_val < 0.05:

print("✅ Significant difference in average house prices.")
else:
print("❌ No significant difference in average house prices.")

OUTPUT:
EXPERIMENT-8

AIM: Write a program to implement chi-square test

THEORY:
Chi-Square Test – Brief Theory
The Chi-Square Test of Independence is a statistical test used to determine whether there is a significant association between two
categorical variables.
Purpose in House Price Data
In the context of the Bengaluru House Price dataset, the Chi-Square test can help answer questions like:
"Is the number of bedrooms (size) dependent on the house location?"
How It Works
 A contingency table is created showing the frequency of occurrences between the two variables (e.g., different size
categories across location values).
 The Chi-Square test compares the observed frequencies with expected frequencies (assuming no relationship).
 It calculates a Chi-square statistic and a p-value.
Hypotheses
 H₀ (Null Hypothesis): The two variables are independent.
 H₁ (Alternative Hypothesis): The two variables are related.
Interpretation
 If p-value < 0.05, reject H₀ → there is a significant relationship.
 If p-value ≥ 0.05, fail to reject H₀ → the variables are independent.

CODE:
import pandas as pd
from scipy.stats import chi2_contingency

# Load the dataset

file_path = "Bengaluru_House_Data.csv"
df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop missing values in relevant columns

df = df[['location', 'size']].dropna()

# Optional: Clean 'size' column to keep only number of bedrooms

df['size'] = df['size'].str.extract('(\d+)').astype(float)

# Drop rows with missing or non-numeric sizes after extraction

df = df.dropna()

# Convert bedroom count to categorical for Chi-Square test

df['size'] = df['size'].astype(int).astype(str)

# Create a contingency table: location vs size

contingency_table = pd.crosstab(df['location'], df['size'])

# Perform Chi-Square test

chi2, p, dof, expected = chi2_contingency(contingency_table)

# Output results
print("📊 Chi-Square Test of Independence")

print("----------------------------------")
print("Chi-Square Statistic:", round(chi2, 4))
print("Degrees of Freedom:", dof)
print("P-value:", round(p, 4))

# Interpretation
alpha = 0.05
if p < alpha:
print("✅ Significant association between 'location' and 'number of bedrooms'")
else:
print("❌ No significant association between 'location' and 'number of bedrooms'")

OUTPUT:
EXPERIMENT-9

AIM: Write a program to implement Friedman test

THEORY:
Friedman Test (Non-Parametric Test for Repeated Measures)
The Friedman Test is a non-parametric statistical test used to detect differences in treatments across multiple test attempts. It is often
seen as the non-parametric alternative to the repeated-measures ANOVA.
When to Use:
 You have one group that is tested on three or more conditions (or time points).
 The dependent variable is ordinal or not normally distributed.
 The data is in a repeated-measures or matched-subjects format.
Basic Idea:
The test ranks the scores within each block (e.g., each subject), and then analyzes the ranks to determine if there's a statistically significant
difference between the treatments.
Hypotheses:
 Null Hypothesis (H₀): All treatments (conditions) have the same effect (i.e., median ranks are equal).
 Alternative Hypothesis (H₁): At least one treatment has a different effect.
Test Statistic (χ²_F):
χF2=12nk(k+1)∑j=1kRj2−3n(k+1)\chi^2_F = \frac{12}{n k (k + 1)} \sum_{j=1}^k R_j^2 - 3n(k + 1)χF2=nk(k+1)12j=1∑kRj2−3n(k+1)
Where:
 nnn = number of blocks (e.g., subjects)
 kkk = number of treatments
 RjR_jRj = sum of ranks for treatment jjj
Decision Rule:
Compare the test statistic to a critical value from the chi-square distribution with k−1k - 1k−1 degrees of freedom, or use a p-value.

CODE:
from scipy.stats import friedmanchisquare

# Example: 3 algorithms' predictions on 5 houses

linear_regression = [400000, 410000, 395000, 420000, 405000]
lasso_regression = [390000, 415000, 400000, 425000, 410000]
xgboost_model = [405000, 412000, 398000, 422000, 407000]

# Apply Friedman Test

statistic, p_value = friedmanchisquare(linear_regression, lasso_regression, xgboost_model)

# Output results
print("📊 Friedman Test Results")
print("------------------------")
print(f"Test Statistic: {statistic:.4f}")
print(f"P-value: {p_value:.4f}")

# Interpretation
alpha = 0.05
if p_value < alpha:
print("✅ Significant difference between the model predictions.")
else:
print("❌ No significant difference between the model predictions.")

OUTPUT:
EXPERIMENT-10

AIM: Write a program to implement Wilcoxon signed rank test

THEORY:
Wilcoxon Signed Rank Test – Theory (For House Price Prediction)
The Wilcoxon Signed Rank Test is a non-parametric statistical test used to compare two related samples. It helps determine
whether their population mean ranks differ — without assuming a normal distribution.
In your case:
 You have two related sets of data:
o Actual house prices
o Predicted house prices (from your model)
 The test checks if the differences between actual and predicted prices are symmetric around zero, i.e., whether your
model systematically under- or over-predicts prices.

How It Works (In Simple Steps)

1. Calculate differences between predicted and actual prices.
2. Ignore zero differences (no change).
3. Take the absolute value of differences and rank them (smallest to largest).
4. Add signs back to ranks based on whether the prediction was higher or lower.
5. Compute:
o W+ = sum of positive signed ranks
o W− = sum of negative signed ranks
6. The Wilcoxon statistic (W) is the smaller of W+ and W−.
7. A p-value is calculated to test the null hypothesis.

Hypotheses
 Null Hypothesis (H₀):
There is no significant difference between actual and predicted prices.
 Alternative Hypothesis (H₁):
There is a significant difference between actual and predicted prices.

Interpretation
 If p-value < 0.05 (or your chosen significance level):
o You reject H₀ → your model's predictions differ significantly from actual values.
 If p-value ≥ 0.05:
o You fail to reject H₀ → there's no significant difference → your model is reasonably accurate.

CODE:
import pandas as pd
import numpy as np
from scipy.stats import wilcoxon

# ======= Sample data for illustration =======

# In practice, replace this with: pd.read_csv("your_file.csv")
data = {
'actual_price': [210000, 340000, 280000, 500000, 310000],
'predicted_price': [205000, 345000, 275000, 490000, 300000]
}
df = pd.DataFrame(data)

# Rename columns for test

actual = df['actual_price']
predicted = df['predicted_price']

# === Wilcoxon Signed Rank Test ===

# Using scipy
print("\n=== Wilcoxon Signed Rank Test (SciPy) ===")
stat, p_value = wilcoxon(actual, predicted)
print(f"Wilcoxon statistic = {stat:.4f}")
print(f"P-value = {p_value:.4f}")

# === Manual Implementation ===

def manual_wilcoxon(x, y):
diff = y - x
non_zero_diff = diff[diff != 0]
abs_diff = np.abs(non_zero_diff)
ranks = abs_diff.rank(method='average')
signed_ranks = ranks * np.sign(non_zero_diff)

W_plus = signed_ranks[signed_ranks > 0].sum()

W_minus = -signed_ranks[signed_ranks < 0].sum()

W = min(W_plus, W_minus)
return {
"W+": W_plus,
"W-": W_minus,
"Wilcoxon Statistic (W)": W
}

print("\n=== Manual Implementation ===")

manual_result = manual_wilcoxon(actual, predicted)
for k, v in manual_result.items():
print(f"{k}: {v:.4f}")

# === Interpretation ===

alpha = 0.05
print("\n=== Interpretation ===")
if p_value < alpha:
print(f"Since p-value ({p_value:.4f}) < {alpha}, we reject the null hypothesis.")
print("There is a significant difference between actual and predicted prices.")
else:
print(f"Since p-value ({p_value:.4f}) >= {alpha}, we fail to reject the null hypothesis.")
print("There is no significant difference between actual and predicted prices.")

OUTPUT:
EXPERIMENT-11

AIM: Write a program to implement ANOVA test

THEORY:

ANOVA (Analysis of Variance) Test

The ANOVA test is a statistical method used to determine whether there are any significant differences between the means of three
or more independent groups.

Hypotheses:

 Null Hypothesis (H₀):

All BHK groups have the same average house price.
(No difference between means)

 Alternative Hypothesis (H₁):

At least one BHK group has a different average price.
(There's a difference between group means)

How It Works:

1. ANOVA splits the total variation in price into:

o Between-group variance (due to different BHKs)

o Within-group variance (natural variation within each BHK group)

2. It calculates the F-statistic, which is the ratio:

F=Variance between groupsVariance within groupsF = \frac{\text{Variance between groups}}{\text{Variance within

groups}}F=Variance within groupsVariance between groups

3. A higher F-value suggests greater group differences.

4. The p-value tells you whether the observed F-statistic is statistically significant.

Interpretation:

 If p-value < 0.05 (or your chosen alpha level):

o Reject H₀ → There is a significant difference in average prices across BHKs.

 If p-value ≥ 0.05:

o Fail to reject H₀ → No significant price difference across BHKs.

CODE:

import pandas as pd

from scipy.stats import f_oneway

# Load the dataset

df = pd.read_csv("Bengaluru_House_Data.csv")

# Drop rows with missing 'size' or 'price'

df = df[['size', 'price']].dropna()

# Extract numeric BHK value from the 'size' column

df['bhk'] = df['size'].str.extract(r'(\d+)').astype(float)

# Drop rows where bhk is missing after extraction

df = df.dropna(subset=['bhk'])

# Group prices by BHK

grouped_prices = df.groupby('bhk')['price'].apply(list)

# Keep only BHK groups with at least 2 data points

valid_groups = [prices for prices in grouped_prices if len(prices) >= 2]

# Perform ANOVA

f_stat, p_value = f_oneway(*valid_groups)

# Output results

print("\n=== ANOVA Test: Does Price Differ by BHK? ===")

print(f"F-statistic: {f_stat:.4f}")

print(f"P-value: {p_value:.4f}")

# Interpretation

alpha = 0.05

print("\n=== Interpretation ===")

if p_value < alpha:

print("P-value is less than 0.05 → Reject H₀")

print("=> There is a significant difference in prices between BHK levels.")

else:

print("P-value is greater than 0.05 → Fail to reject H₀")

print("=> No significant difference in prices between BHK levels.")

OUTPUT:
EXPERIMENT-12

AIM: Write a program to implement Nemeyi test

THEORY:
The Nemenyi test is a non-parametric statistical test used to compare the performance of multiple models or
methods. It is particularly useful when you have more than two models and want to determine if there are
significant differences in their performance metrics, such as prediction accuracy or error rates. The test works
by ranking the models based on their performance and then comparing the ranks pairwise. If the difference
between the ranks of two models is large enough, it indicates a significant difference in their performance. The
Nemenyi test is especially helpful when the data does not follow a normal distribution, making it a robust
choice for model comparison in real-world scenarios like house price prediction.

When to Use It:

 After running multiple models for your house price prediction, you can apply the Nemenyi test to
compare their performance metrics and determine if the differences you observe are statistically
significant.
 It is most useful when you're dealing with multiple models or multiple configurations (e.g.,
hyperparameters) and want to check whether one model consistently outperforms others.

CODE:
import numpy as np
import scikit_posthocs as sp
import pandas as pd

# Example MAE scores of 3 models across 5 datasets/folds

# Rows = datasets/folds, Columns = models
scores = np.array([
[3.1, 2.9, 3.5], # Dataset 1
[2.8, 2.5, 3.2], # Dataset 2
[3.0, 2.7, 3.3], # Dataset 3
[3.2, 3.0, 3.6], # Dataset 4
[2.9, 2.6, 3.1], # Dataset 5
])
# Convert to DataFrame for better readability
df = pd.DataFrame(scores, columns=["Model A", "Model B", "Model C"])

# Perform Nemenyi post-hoc test

nemenyi_result = sp.posthoc_nemenyi_friedman(df)

print("=== Nemenyi Post-Hoc Test Result ===")

print(nemenyi_result)

OUTPUT:
EXPERIMENT-13

AIM: Write a program to analyze the performance of model developed

THEORY:
To analyze the performance of your house price prediction models, you can write a program that compares
the performance of multiple models using a relevant metric, such as Root Mean Squared Error (RMSE), Mean
Absolute Error (MAE), or R-squared (R²). Then, you can apply the Nemenyi test to compare the models
statistically if you want to see if any differences in their performance are significant.
Here's an outline of the program using Python. This example assumes you're comparing models like Linear
Regression, Random Forest, and XGBoost.
Steps:
1. Train the models on the same dataset.
2. Calculate performance metrics (e.g., RMSE or MAE) for each model.
3. Apply the Nemenyi test to compare the models.

CODE:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
import scipy.stats as stats

# Step 1: Load and prepare your dataset

X, y = make_regression(n_samples=500, n_features=5, noise=0.1)

# Split the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 2: Train the models
# Model 1: Linear Regression
lr_model = LinearRegression()

lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)

# Model 2: Random Forest Regressor

rf_model = RandomForestRegressor(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Model 3: XGBoost (if installed)

from xgboost import XGBRegressor
xgb_model = XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)

# Step 3: Calculate the performance metrics (RMSE in this case)

def rmse(y_true, y_pred):
return np.sqrt(mean_squared_error(y_true, y_pred))

rmse_lr = rmse(y_test, y_pred_lr)

rmse_rf = rmse(y_test, y_pred_rf)
rmse_xgb = rmse(y_test, y_pred_xgb)

# Step 4: Output the performance of each model

print(f"RMSE for Linear Regression: {rmse_lr:.4f}")
print(f"RMSE for Random Forest: {rmse_rf:.4f}")
print(f"RMSE for XGBoost: {rmse_xgb:.4f}")

# Step 5: Apply the Nemenyi test

# First, you need to rank the RMSE values
rmse_values = [rmse_lr, rmse_rf, rmse_xgb]
model_names = ['Linear Regression', 'Random Forest', 'XGBoost']

# Compute ranks for the RMSE values

ranks = pd.Series(rmse_values).rank()

# Print the rank for each model

print("\nRanks of models based on RMSE:")
for model, rank in zip(model_names, ranks):
print(f"{model}: Rank {rank}")

# Step 6: Perform pairwise Nemenyi test (using a simple approximation)

# Nemenyi test compares the difference between ranks of each pair of models.
from itertools import combinations

def nemenyi_test(ranks):
n = len(ranks)
p_values = {}
for (i, j) in combinations(range(n), 2):
rank_diff = abs(ranks[i] - ranks[j])
p_value = stats.ttest_ind([ranks[i]], [ranks[j]])[1] # Simplified for illustration
p_values[(model_names[i], model_names[j])] = p_value
return p_values

# Compare ranks of the models

p_values = nemenyi_test(ranks)
print("\nPairwise p-values from Nemenyi test:")
for pair, p_value in p_values.items():
print(f"{pair[0]} vs {pair[1]}: p-value = {p_value:.4f}")

OUTPUT:

House Price Prediction
No ratings yet
House Price Prediction
5 pages
Real-Estate Property
No ratings yet
Real-Estate Property
11 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
House Price Pridiction Prabhjotsingh2
No ratings yet
House Price Pridiction Prabhjotsingh2
14 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Intership Report
No ratings yet
Intership Report
20 pages
(House Price Prediction) Capstone Project For Python
No ratings yet
(House Price Prediction) Capstone Project For Python
10 pages
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
ABCA 2 Model Building
No ratings yet
ABCA 2 Model Building
9 pages
House Price Prediction With Analysis
No ratings yet
House Price Prediction With Analysis
9 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
House Value
No ratings yet
House Value
22 pages
Shub Neet DT
No ratings yet
Shub Neet DT
12 pages
NN - CCP
No ratings yet
NN - CCP
10 pages
Ads Lab8
No ratings yet
Ads Lab8
5 pages
18BCS115
No ratings yet
18BCS115
25 pages
House
No ratings yet
House
7 pages
House Price Prediction
No ratings yet
House Price Prediction
17 pages
Real Estate Price Prediction Model
No ratings yet
Real Estate Price Prediction Model
3 pages
Phase 2 Irfan
No ratings yet
Phase 2 Irfan
5 pages
Report
No ratings yet
Report
40 pages
Machine Learning for Real Estate
No ratings yet
Machine Learning for Real Estate
9 pages
Dma 362
No ratings yet
Dma 362
7 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Anbuselvan Phase 2 PRJ
No ratings yet
Anbuselvan Phase 2 PRJ
5 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
Coding
No ratings yet
Coding
7 pages
ML Project CLG
No ratings yet
ML Project CLG
62 pages
Dawit House
No ratings yet
Dawit House
49 pages
Business: Capstone Project House Price Prediction Project Note-1
88% (8)
Business: Capstone Project House Price Prediction Project Note-1
40 pages
Capstone Project 6 April
No ratings yet
Capstone Project 6 April
64 pages
House Price Predictor PPT Project
No ratings yet
House Price Predictor PPT Project
13 pages
Synopsis
No ratings yet
Synopsis
7 pages
Reshma Naan Mudhalvan Project
No ratings yet
Reshma Naan Mudhalvan Project
5 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
Real Estate Price Prediction Guide
No ratings yet
Real Estate Price Prediction Guide
2 pages
Story Point Estimation Copy
No ratings yet
Story Point Estimation Copy
16 pages
Price Prediction
100% (1)
Price Prediction
13 pages
House Price Prediction Analysis PDF
No ratings yet
House Price Prediction Analysis PDF
78 pages
Aastha Mahajan Python File
No ratings yet
Aastha Mahajan Python File
17 pages
Extended House Price Prediction Synopsis
No ratings yet
Extended House Price Prediction Synopsis
16 pages
Home Value Prediction for Analysts
No ratings yet
Home Value Prediction for Analysts
5 pages
Title Predicting House Pricing Using AIML (KASHISH)
No ratings yet
Title Predicting House Pricing Using AIML (KASHISH)
2 pages
Capstone Project PPT by Roshan Padhi
No ratings yet
Capstone Project PPT by Roshan Padhi
9 pages
Surprise Housing Case Study Coincent
No ratings yet
Surprise Housing Case Study Coincent
4 pages
Anbuselvan Phase2
No ratings yet
Anbuselvan Phase2
5 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
Project Report Gr-12
No ratings yet
Project Report Gr-12
25 pages
AIreport
No ratings yet
AIreport
17 pages
House Price Prediction Using Machine Learning and Artificial Intelligence.
No ratings yet
House Price Prediction Using Machine Learning and Artificial Intelligence.
11 pages
House Price Prediction for Buyers
100% (1)
House Price Prediction for Buyers
10 pages
House Pridiction Analysis
No ratings yet
House Pridiction Analysis
3 pages
Oral Presentation
No ratings yet
Oral Presentation
9 pages
Updated House Price Prediction Report
No ratings yet
Updated House Price Prediction Report
5 pages
House Price Prediction Using Machine Learning and Artificial Intelligence
No ratings yet
House Price Prediction Using Machine Learning and Artificial Intelligence
11 pages
Formal Research Paper Slideshow by Slidesgo
No ratings yet
Formal Research Paper Slideshow by Slidesgo
9 pages
Final Data Science Report 25 Pages
No ratings yet
Final Data Science Report 25 Pages
4 pages
Synopsis Format1 PDF
No ratings yet
Synopsis Format1 PDF
6 pages
Review Analysis and Sentiment Learning Using NLP
No ratings yet
Review Analysis and Sentiment Learning Using NLP
15 pages
IT306: Artificial Intelligence and Expert Systems: Lecture 1 - 06 Jan 2025
No ratings yet
IT306: Artificial Intelligence and Expert Systems: Lecture 1 - 06 Jan 2025
147 pages
Switch/Case Statement Translation Technique: Group - 8
No ratings yet
Switch/Case Statement Translation Technique: Group - 8
17 pages
Presentation 20
No ratings yet
Presentation 20
6 pages
I Love Merge
No ratings yet
I Love Merge
9 pages
CN Lab File 2K21-SE-159
No ratings yet
CN Lab File 2K21-SE-159
28 pages
NCWEB PG 24 Compressed
No ratings yet
NCWEB PG 24 Compressed
15 pages
Ese Lab - Sanoj-159
No ratings yet
Ese Lab - Sanoj-159
11 pages
Mte Score Aies E3
No ratings yet
Mte Score Aies E3
1 page
Merged Presentation Choladeck
No ratings yet
Merged Presentation Choladeck
50 pages
AI & Expert Systems Course Report
No ratings yet
AI & Expert Systems Course Report
1 page
(Httpsnoti - Akshat.sh) File0201-1
No ratings yet
(Httpsnoti - Akshat.sh) File0201-1
2 pages
Adobe Scan Feb 06, 2025
No ratings yet
Adobe Scan Feb 06, 2025
1 page
Merged Presentation Choladeck
No ratings yet
Merged Presentation Choladeck
51 pages
5 Year Plans
No ratings yet
5 Year Plans
21 pages
W Sat Email Decoder Manual
No ratings yet
W Sat Email Decoder Manual
33 pages
Selenium Basics Notes
No ratings yet
Selenium Basics Notes
6 pages
Nagendra Krishnapura, Dept. of EE Indian Institute of Technology, Madras Analog Integrated Circuit Design A Course Under The NPTEL
No ratings yet
Nagendra Krishnapura, Dept. of EE Indian Institute of Technology, Madras Analog Integrated Circuit Design A Course Under The NPTEL
5 pages
Modular Terminal Accessories Guide
No ratings yet
Modular Terminal Accessories Guide
2 pages
LMS Plus 7.5 Service Manual
100% (1)
LMS Plus 7.5 Service Manual
36 pages
Arranging For Big Band 123
71% (7)
Arranging For Big Band 123
4 pages
B5W-LB Series
No ratings yet
B5W-LB Series
12 pages
6ED10551FB100BA2 Datasheet en
No ratings yet
6ED10551FB100BA2 Datasheet en
2 pages
DSA Notes Well Organised
No ratings yet
DSA Notes Well Organised
166 pages
Lesson Plan Class 10
No ratings yet
Lesson Plan Class 10
12 pages
CSS 6
No ratings yet
CSS 6
9 pages
Chapter 4 Mem
No ratings yet
Chapter 4 Mem
20 pages
Addisu Jagema
No ratings yet
Addisu Jagema
83 pages
Garrett: Capacitors
No ratings yet
Garrett: Capacitors
47 pages
Aerodynamic Bus Design Report
No ratings yet
Aerodynamic Bus Design Report
7 pages
574-036 Performance Data
No ratings yet
574-036 Performance Data
124 pages
Modeling Constructs: An Entity Declaration Multiple Architecture Bodies
No ratings yet
Modeling Constructs: An Entity Declaration Multiple Architecture Bodies
71 pages
Lecture 08 - Diffusion in Solids PDF
No ratings yet
Lecture 08 - Diffusion in Solids PDF
23 pages
8-4 Notes Geometry
100% (2)
8-4 Notes Geometry
2 pages
4.G.A.1 Line Relationships and Angles Formed
No ratings yet
4.G.A.1 Line Relationships and Angles Formed
2 pages
Evaporation Crystallization
0% (1)
Evaporation Crystallization
53 pages
10th Grade Math Exam Questions
No ratings yet
10th Grade Math Exam Questions
3 pages
SLP Series - Humidity and Temp - Install - Z207903-0J-1
No ratings yet
SLP Series - Humidity and Temp - Install - Z207903-0J-1
10 pages
Permutations & Combinations MS
No ratings yet
Permutations & Combinations MS
19 pages
Foxconn 945P7AC Motherboard
No ratings yet
Foxconn 945P7AC Motherboard
87 pages
ABES Engineering College, Ghaziabad: Roll No
No ratings yet
ABES Engineering College, Ghaziabad: Roll No
1 page
Intel RealSense Camera Guide
No ratings yet
Intel RealSense Camera Guide
22 pages
Activation of Na2S2O8 by MIL 101 Fe MoS2 Comp 2022 Colloids and Surfaces A
No ratings yet
Activation of Na2S2O8 by MIL 101 Fe MoS2 Comp 2022 Colloids and Surfaces A
11 pages
UHMWPE Products
No ratings yet
UHMWPE Products
4 pages
Aspergillus: Structure and Reproduction
No ratings yet
Aspergillus: Structure and Reproduction
13 pages