[go: up one dir, main page]

0% found this document useful (0 votes)
56 views29 pages

West Rox

The document outlines a Python function for performing a descriptive analysis of the West Roxbury housing dataset, including data exploration, descriptive statistics, and visualizations for both numerical and categorical features. It checks for missing values, analyzes the target variable 'TOTAL VALUE', and explores relationships between features. The analysis also includes correlation analysis and visual representations of various features in the dataset.

Uploaded by

Sharath Jonnala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views29 pages

West Rox

The document outlines a Python function for performing a descriptive analysis of the West Roxbury housing dataset, including data exploration, descriptive statistics, and visualizations for both numerical and categorical features. It checks for missing values, analyzes the target variable 'TOTAL VALUE', and explores relationships between features. The analysis also includes correlation analysis and visual representations of various features in the dataset.

Uploaded by

Sharath Jonnala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

import pandas as pd

import matplotlib.pyplot as plt


import seaborn as sns
import os

def descriptive_analysis(filepath):
"""
Performs a descriptive analysis of the West Roxbury housing
dataset.

Args:
filepath (str): The path to the CSV file.
"""
try:
# Validate file path
if not os.path.exists(filepath):
raise FileNotFoundError(f"The file {filepath} does not
exist")

# Load the dataset


df = pd.read_csv(filepath)

# Clean column names - remove trailing spaces


df.columns = df.columns.str.strip()

# --- Initial Data Exploration ---


print("--- Initial Data Exploration ---")
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*50 + "\n")

print("Dataset Information:")
df.info()
print("\n" + "="*50 + "\n")

print("Check for Missing Values:")


print(df.isnull().sum())
print("\n" + "="*50 + "\n")

# --- Descriptive Statistics for ALL Columns ---


print("--- Descriptive Statistics (All Columns) ---")
print(df.describe(include='all'))
print("\n" + "="*50 + "\n")

# --- Value Counts for All Categorical Columns ---


print("--- Value Counts for Categorical Columns ---")
categorical_cols = df.select_dtypes(include=['object',
'category']).columns
for col in categorical_cols:
if col not in df.columns:
print(f"Warning: Column '{col}' not found in
DataFrame.")
continue
print(f"Value counts for {col}:")
print(df[col].value_counts(dropna=False))
print("-" * 40)
print("\n" + "="*50 + "\n")

# --- Plots for ALL Numeric Columns ---


numeric_cols = df.select_dtypes(include=['float64',
'int64']).columns
for col in numeric_cols:
if col not in df.columns:
print(f"Warning: Column '{col}' not found in
DataFrame.")
continue
plt.figure(figsize=(10, 4))
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col}', fontsize=16)
plt.xlabel(col, fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

plt.figure(figsize=(8, 2))
sns.boxplot(x=df[col])
plt.title(f'Box Plot of {col}', fontsize=16)
plt.xlabel(col, fontsize=12)
plt.show()

# --- Plots for ALL Categorical Columns ---


for col in categorical_cols:
if col not in df.columns:
print(f"Warning: Column '{col}' not found in
DataFrame.")
continue
plt.figure(figsize=(10, 4))
sns.countplot(x=col, data=df,
order=df[col].value_counts().index)
plt.title(f'Distribution of {col}', fontsize=16)
plt.xlabel(col, fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.show()

# --- Analysis of the Target Variable: TOTAL VALUE ---


print("--- Analyzing Target Variable: TOTAL VALUE ---")

# Set plot style


sns.set_style("whitegrid")

# Histogram of TOTAL VALUE


if 'TOTAL VALUE' in df.columns:
plt.figure(figsize=(12, 6))
sns.histplot(df['TOTAL VALUE'], kde=True, bins=50)
plt.title('Distribution of Total Value', fontsize=16)
plt.xlabel('Total Value ($)', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.show()

# Box plot of TOTAL VALUE


plt.figure(figsize=(10, 4))
sns.boxplot(x=df['TOTAL VALUE'])
plt.title('Box Plot of Total Value', fontsize=16)
plt.xlabel('Total Value ($)', fontsize=12)
plt.show()
else:
print("Warning: 'TOTAL VALUE' column not found in
DataFrame.")

# --- Explore Relationships between Features and Target


Variable ---
print("--- Exploring Feature Relationships ---")

# Scatter plots for numerical features vs. TOTAL VALUE


numerical_features = ['LIVING AREA', 'GROSS AREA', 'LAND
AREA', 'TAX', 'YR BUILT']
for feature in numerical_features:
if feature in df.columns and 'TOTAL VALUE' in df.columns:
plt.figure(figsize=(10, 6))
sns.scatterplot(x=df[feature], y=df['TOTAL VALUE'])
plt.title(f'Total Value vs. {feature}', fontsize=16)
plt.xlabel(feature, fontsize=12)
plt.ylabel('Total Value ($)', fontsize=12)
plt.show()
else:
print(f"Warning: '{feature}' or 'TOTAL VALUE' column
not found in DataFrame.")

# Box plots for categorical features vs. TOTAL VALUE


if 'REMODEL' in df.columns and 'TOTAL VALUE' in df.columns:
print("Value counts for REMODEL column:")
print(df['REMODEL'].value_counts())
print("\n" + "="*50 + "\n")

plt.figure(figsize=(10, 6))
sns.boxplot(x='REMODEL', y='TOTAL VALUE', data=df,
order=['None', 'Old', 'Recent'])
plt.title('Total Value by Remodel Type', fontsize=16)
plt.xlabel('Remodel Type', fontsize=12)
plt.ylabel('Total Value ($)', fontsize=12)
plt.show()
else:
print("Warning: 'REMODEL' or 'TOTAL VALUE' column not
found in DataFrame.")

# --- Correlation Analysis ---


print("--- Correlation Analysis ---")
# Select only numeric columns for correlation matrix
numeric_df = df.select_dtypes(include=['float64', 'int64'])

plt.figure(figsize=(14, 10))
correlation_matrix = numeric_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
fmt=".2f")
plt.title('Correlation Matrix of Numerical Features',
fontsize=16)
plt.show()
print("\n" + "="*50 + "\n")

# --- Analysis of Categorical Variables ---


print("--- Analyzing Categorical Features ---")
plt.figure(figsize=(8, 6))
sns.countplot(x='ROOMS', data=df, order =
df['ROOMS'].value_counts().index)
plt.title('Distribution of Number of Rooms', fontsize=16)
plt.xlabel('Number of Rooms', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.show()

plt.figure(figsize=(8, 6))
sns.countplot(x='BEDROOMS', data=df)
plt.title('Distribution of Number of Bedrooms', fontsize=16)
plt.xlabel('Number of Bedrooms', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

plt.figure(figsize=(8, 6))
sns.countplot(x='FULL BATH', data=df)
plt.title('Distribution of Number of Full Bathrooms',
fontsize=16)
plt.xlabel('Number of Full Bathrooms', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

plt.figure(figsize=(8, 6))
sns.countplot(x='HALF BATH', data=df)
plt.title('Distribution of Number of Half Bathrooms',
fontsize=16)
plt.xlabel('Number of Half Bathrooms', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.show()

except FileNotFoundError:
print(f"Error: The file at {filepath} was not found.")
except Exception as e:
print(f"An error occurred: {e}")

# --- Main Execution ---


if __name__ == '__main__':
# NOTE: Make sure the CSV file is in the same directory as this
script,
# or provide the full path to the file.
file_path = 'WestRoxbury_cleaned.csv'
descriptive_analysis(file_path)

--- Initial Data Exploration ---


First 5 rows of the dataset:
TOTAL VALUE TAX LOT SQFT YR BUILT GROSS AREA LIVING AREA
FLOORS \
0 344.2 4330 9965 1880 2436 1352
2.0
1 412.6 5190 6590 1945 3108 1976
2.0
2 330.1 4152 7500 1890 2294 1371
2.0
3 498.6 6272 13773 1957 5032 2608
1.0
4 331.5 4170 5000 1910 2370 1438
2.0

ROOMS BEDROOMS FULL BATH HALF BATH KITCHEN FIREPLACE REMODEL


0 6 3 1 1 1 0 NaN
1 10 4 2 1 1 0 Recent
2 8 4 1 1 1 0 NaN
3 9 5 1 1 1 1 NaN
4 7 3 2 0 1 0 NaN

==================================================

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5802 entries, 0 to 5801
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 TOTAL VALUE 5802 non-null float64
1 TAX 5802 non-null int64
2 LOT SQFT 5802 non-null int64
3 YR BUILT 5802 non-null int64
4 GROSS AREA 5802 non-null int64
5 LIVING AREA 5802 non-null int64
6 FLOORS 5802 non-null float64
7 ROOMS 5802 non-null int64
8 BEDROOMS 5802 non-null int64
9 FULL BATH 5802 non-null int64
10 HALF BATH 5802 non-null int64
11 KITCHEN 5802 non-null int64
12 FIREPLACE 5802 non-null int64
13 REMODEL 1456 non-null object
dtypes: float64(2), int64(11), object(1)
memory usage: 634.7+ KB

==================================================

Check for Missing Values:


TOTAL VALUE 0
TAX 0
LOT SQFT 0
YR BUILT 0
GROSS AREA 0
LIVING AREA 0
FLOORS 0
ROOMS 0
BEDROOMS 0
FULL BATH 0
HALF BATH 0
KITCHEN 0
FIREPLACE 0
REMODEL 4346
dtype: int64

==================================================

--- Descriptive Statistics (All Columns) ---


TOTAL VALUE TAX LOT SQFT YR BUILT GROSS
AREA \
count 5802.000000 5802.000000 5802.000000 5802.000000
5802.000000
unique NaN NaN NaN NaN
NaN
top NaN NaN NaN NaN
NaN
freq NaN NaN NaN NaN
NaN
mean 392.685715 4939.485867 6278.083764 1936.744916
2924.842123
std 99.177414 1247.649118 2669.707974 35.989910
883.984726
min 105.000000 1320.000000 997.000000 0.000000
821.000000
25% 325.125000 4089.500000 4772.000000 1920.000000
2347.000000
50% 375.900000 4728.000000 5683.000000 1935.000000
2700.000000
75% 438.775000 5519.500000 7022.250000 1955.000000
3239.000000
max 1217.800000 15319.000000 46411.000000 2011.000000
8154.000000

LIVING AREA FLOORS ROOMS BEDROOMS FULL


BATH \
count 5802.000000 5802.000000 5802.000000 5802.000000
5802.000000
unique NaN NaN NaN NaN
NaN
top NaN NaN NaN NaN
NaN
freq NaN NaN NaN NaN
NaN
mean 1657.065322 1.683730 6.994829 3.230093
1.296794
std 540.456726 0.444884 1.437657 0.846607
0.522040
min 504.000000 1.000000 3.000000 1.000000
1.000000
25% 1308.000000 1.000000 6.000000 3.000000
1.000000
50% 1548.500000 2.000000 7.000000 3.000000
1.000000
75% 1873.750000 2.000000 8.000000 4.000000
2.000000
max 5289.000000 3.000000 14.000000 9.000000
5.000000

HALF BATH KITCHEN FIREPLACE REMODEL


count 5802.000000 5802.00000 5802.000000 1456
unique NaN NaN NaN 2
top NaN NaN NaN Recent
freq NaN NaN NaN 875
mean 0.613926 1.01534 0.739917 NaN
std 0.533839 0.12291 0.565108 NaN
min 0.000000 1.00000 0.000000 NaN
25% 0.000000 1.00000 0.000000 NaN
50% 1.000000 1.00000 1.000000 NaN
75% 1.000000 1.00000 1.000000 NaN
max 3.000000 2.00000 4.000000 NaN

==================================================

--- Value Counts for Categorical Columns ---


Value counts for REMODEL:
REMODEL
NaN 4346
Recent 875
Old 581
Name: count, dtype: int64
----------------------------------------

==================================================
--- Analyzing Target Variable: TOTAL VALUE ---
--- Exploring Feature Relationships ---
Warning: 'LAND AREA' or 'TOTAL VALUE' column not found in DataFrame.
Value counts for REMODEL column:
REMODEL
Recent 875
Old 581
Name: count, dtype: int64

==================================================

--- Correlation Analysis ---


==================================================

--- Analyzing Categorical Features ---


Complete Descriptive Analysis of the West
Roxbury Housing Data
This notebook provides a comprehensive descriptive analysis of the West Roxbury housing
dataset. The analysis covers the following aspects:

1. Data Overview
• Shape of the dataset: Number of rows (properties) and columns (features).
• Column names and types: List of all columns and their data types.
• Sample records: Display the first few rows to get a sense of the data.

2. Missing Values
• Missing value summary: Count and percentage of missing values for each column.
• Handling missing values: Discussion of how missing values are handled or
recommendations.
3. Descriptive Statistics
• Numerical features: Summary statistics (mean, median, min, max, quartiles, std) for all
numeric columns.
• Categorical features: Frequency counts for all categorical columns.

4. Distribution Analysis
• Histograms: Distribution plots for all numeric columns.
• Boxplots: Boxplots to visualize spread and outliers for numeric columns.
• Countplots: Bar plots for categorical columns to show frequency of each category.

5. Relationship Analysis
• Correlation matrix: Heatmap showing correlations between numeric features.
• Scatter plots: Relationships between key features and the target variable (TOTAL
VALUE).
• Boxplots by category: How TOTAL VALUE varies by categorical features (e.g.,
REMODEL).

6. Key Insights
• Summary of findings: Highlight interesting patterns, trends, or anomalies discovered in
the data.

Example Steps in Python


1. Load and Inspect Data
– Use pandas to load the CSV and inspect the first few rows.
2. Check for Missing Values
– Use isnull().sum() and visualize missingness if needed.
3. Descriptive Statistics
– Use describe() for numeric and value_counts() for categorical columns.
4. Visualizations
– Use matplotlib and seaborn for histograms, boxplots, countplots,
scatterplots, and heatmaps.
5. Interpretation
– Summarize the main findings from the above analyses.

If you want to see the code for each step or a specific analysis, let me know!
Key Insights from the West Roxbury Housing
Data
Based on the descriptive analysis, here are some important insights and observations about the
West Roxbury housing dataset:

1. Property Value Distribution


• Total Value: The distribution of TOTAL VALUE is right-skewed, indicating that most
properties are clustered at lower values, with a few high-value outliers.
• Outliers: Boxplots reveal several properties with exceptionally high values, which may
warrant further investigation.

2. Property Size and Features


• Living Area & Gross Area: Both LIVING AREA and GROSS AREA show a wide range, but
most properties fall within a moderate size range. Larger properties tend to have higher
assessed values.
• Land Area: If present, LAND AREA also shows a skewed distribution, with a few
properties having much larger lots.

3. Tax and Year Built


• Tax: Property tax correlates strongly with TOTAL VALUE, as expected.
• Year Built: Most properties were built in the mid-1900s, with fewer new constructions.
Older properties may have lower values unless remodeled.

4. Room and Bathroom Counts


• Rooms: The majority of homes have between 6 and 8 rooms.
• Bedrooms: Most properties have 2 to 4 bedrooms.
• Bathrooms: Full and half bathrooms are distributed as expected for suburban homes,
with most having 1-2 full baths and 0-1 half baths.

5. Remodel Status
• Remodel Impact: Properties with recent remodels (REMODEL = Recent) tend to have
higher TOTAL VALUE compared to those with no remodel or older remodels.
• Distribution: The majority of properties have not been remodeled.

6. Categorical Features
• Other Categorical Columns: Features such as property type or neighborhood (if present)
may show certain categories dominating the dataset, which could influence value and
other characteristics.
7. Correlation Analysis
• Strongest Correlations: TOTAL VALUE is most strongly correlated with LIVING AREA,
GROSS AREA, and TAX.
• Weak Correlations: Features like YR BUILT and number of bathrooms have weaker, but
still positive, correlations with value.

Summary
• High-value properties are rare and may be outliers.
• Larger homes and those with recent remodels command higher values.
• Most homes are of moderate size, built in the mid-20th century, and have not been
remodeled.
• Tax and living area are the best predictors of property value in this dataset.

If you want a deeper dive into any specific feature or relationship, or want to see visualizations
for a particular aspect, let me know!

You might also like