[go: up one dir, main page]

0% found this document useful (0 votes)
22 views24 pages

Exploratory Data Analysis

This document presents an Exploratory Data Analysis (EDA) of house listing data from Austin, Texas, focusing on factors affecting house prices. It includes data cleaning, feature engineering, and hypothesis testing to explore relationships between various features and the latest house prices. The analysis suggests further data collection for improved predictions and recognizes the importance of certain features in determining house values.

Uploaded by

Raza Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

Exploratory Data Analysis

This document presents an Exploratory Data Analysis (EDA) of house listing data from Austin, Texas, focusing on factors affecting house prices. It includes data cleaning, feature engineering, and hypothesis testing to explore relationships between various features and the latest house prices. The analysis suggests further data collection for improved predictions and recognizes the importance of certain features in determining house values.

Uploaded by

Raza Abbas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

EXPLORATORY

DATA ANALYSIS
IBM Machine
learning

Raza Abbas
2
Content

01 A First Look
Target
04 Hypothesis Testing
KPIs/Objectives
Features Summary Action Plan
Data Summary Timeline
General Changes Revenue

02 Data Visualization
Latest mean price by last sale year
05 Conclusion
Questions
Latest mean price per Home type Thank You
House Features affecting Latest price

03 Data Cleaning & Feature


Engineering
Data Cleaning
Feature Engineering
01
A First Look
4
TARGET

This presentation showcases the Exploratory Data Analysis (EDA) conducted on house listing data from Austin, Texas. The project is
part of the IBM Machine Learning Course Series, specifically focusing on the first course, Exploratory Data Analysis.

The primary objective of this analysis is to explore how various factors—ranging from the number of bedrooms and bathrooms to the
source of the latest price—impact the latest house prices in one of the hottest housing markets in the United States.

The dataset, updated until 2021, provides insights into the latest trends, historical trends, and other features influencing house prices.
This comprehensive analysis offers a closer look at how the Austin housing market has evolved over the years.

A special thanks to Kaggle contributor Eric Pierce for making this valuable dataset available to the community.
You can find the dataset here:

Austin Housing Prices Dataset


5
Features Summary
Below are the column names, their description and their possible/example values:

zpid- A unique identifier assigned by Zillow for every listing


city-
80 the lowercase name of a city in or around texas
streetAddress- the name of the street where the house is located
zipcode-
70
the listing 5 digit zip code eg.78109
Description- the description of the listing on Zillow
Latitude- the latitudinal location of the house
60
Longitude- the longitudinal location of the house
propertyTaxRate- the property tax rate
garageSpaces-
50 number of garage spaces in the house
hasAssociation- boolean value that indicates if there is a homeowner association
associated
40 with the listing
hasCooling- boolean value for the cooling in the house
hasGarage-
30 boolean value for the garage in the house
hasHeating-boolean value for the heating in the house
hasSpa-boolean
20
value for the spa in the house
hasView-boolean value whether there is a view from the house mentioned in the listing
homeType-what home type it is e.g Single Family,Townhouse,Condo
10
parkingSpaces- number of parking spots coming with te home
yearBuilt- the year the house was built
latestPrice- the latest price of the listing
numPriceChanges- the number
BAR GRAPH INFO 1 of price changes theINFO
BAR GRAPH listing
1 went through
BAR GRAPH INFO 1 BAR GRAPH INFO 1
latest_saledate- the date of the latest purchase of the house e.g 20/4/19
latest_salemonth- the month of the latest purchase of the house from 1 to 12
latest_saleyear- the year of the latest purchase of the house 6
latestPriceSource- the party that has provided the latest price of the house
numOfPhotos- the number of photos of the house listed on Zillow
numOfAccessibilityFeatures- the number of unique accessibility features mentioned in the listing
numOfAppliances- the number of unique appliances mentioned in the listing
numOfParkingFeatures- the number of unique parking features mentioned in the listing
numOfPatioAndPorchFeatures- the number of unique patio and/or porch features in the Zillow listing
numOfSecurityFeatures- the number of unique security features mentioned in the Zillow listing
numOfWaterfrontFeatures- the number of unique waterfront features mentioned in the Zillow listing
numOfWindowFeatures- the number of unique window aesthetics mentioned in the Zillow listing
numOfCommunityFeatures- the number of community features in the area of the house in the listing
lotSizeSqFt- the lot size of the property measured in square feet
livingAreaSqFt- the living area of the property measured in square feet
numOfPrimarySchools- the number of primary schools in the area of the house
numOfElementarySchools- the number of elementary schools in the area of the house
numOfMiddleSchools- the number of middle schools in the area of the house
numOfHighSchools- the number of high schools in the area of the house
avgSchoolDistance- the average distance between any school in the area and the house that is listed
avgSchoolRating- the average schools rating in the area of the house
avgSchoolSize- the average size of the schools in the area of the house
MedianStudentsPerTeacher- the median number of students per teacher present in the area of the home
numOfBathrooms- the number of bathrooms in the house
numOfBedrooms- the number of bedrooms in the house
numOfStories- the number of stories in the house
homeImage- the first image in the Zillow listing, the images are provided with the csv file
The columns and their datatypes
8

The Summary of Numeric Data


9

The Summary of Categorical data


General Changes
1) Firsty, the latest_saledate column has an object datatype. It was converted to date time format using
pd.to_datetime function.

2) Columns description ,homeImage,numOfPhotos and streetAddress were dropped as they were of no use in this
EDA project.

3) Column latestPriceSource had too many unique values, hence all the values excluding the top 2 most frequent
values(Agent provided,Broker provided) were renamed as “other”.

4) Column city also had too many unique values, hence all the values excluding the top 3 most frequent
values(austin,pflugerville,del valle) were renamed as “other”.

5) Column homeType also had too many unique values, hence all the values excluding the top 5 most frequent
values(SingleFamily, Condo, Vacantland, TownHouse, Multiple Occupancy)
02
Data Visualization
12

Latest mean price by sale year

The graph shows the mean price of houses grouped together by the year of
their latest respective sale/purchase.
The Latest Price per HomeType 13

This graph shows how the housetype of a house


affects the latest price of the house in th listing.
We can see a trend that when the housetype is
either ‘Single Family’ or ‘Vacant Land’, the price
of the house seems to go up.
14
INDUSTRY VALUE CURVE STRATEGIC MOVE

How a number of features affect the Latest Price of the Homes


03
Data Cleaning & Feature
Engineering
16

Data Cleaning

1) Looking for duplicates:


I searched for any duplicate rows with the primary key or the unique field which was
different for every row i.e zpid and found no duplicate rows.
2) Looking for missing values:
I searched for any sort of missing value in any column but was unable to find anything.
3) Data inconsistencies : Outliers were found in 3 different columns and were also removed to
prevent any inaccuracy to affect the distribution of the certain columns
4) The categorical columns were one hot encoded into several bool columns ensuring clean
data
5) The numeric columns were searched for columns with a skew value of more than 0.75 and
were log transformed ensuring clean numeric data
17
Feature Engineering :
In this section, we just tweaked some features to fit the requirements of our linear regression model.

The Pearson Correlation coefficient method was used to uncover some correlations of features with
our target variable. In addition, obvious features that affected the price of the house were also added
to the feature list.

As the linear regression model expects the feature/target relationship to be linear, we also visualized
this relationship to detect any non linear relationship between features and target.

3 relationships were found to be non linear and they were transformed into linear relationship
through the Polynomial Features method.
04
Hypothesis Testing
Hypothesis #1 19

Null : Hometype does not


affect the price of houses
Alternate : Hometype does
affect the price of houses

Results:
We compared the mean prices
of all the hometypes and
rejected the null hypothesis
with the confidence level of
95%. As we can see, Vacant
Land has the highest mean
price out of all hometypes.
Hypothesis #2 20

Null: The lot size affects


the price of the house
significantly.
Alternate: The lot size
does not affect the price.

Results: as we can see


here that there is no
specific correlation
established between lot
size and the price of the
house. Hence, we accept
the alt hypothesis with a
confidence level of 95%.
Hypothesis #3 21

Null hypothesis: The total


number of educational
institutions in the area
affect latest price. Alt:
The total number of
educational institutes do
not affect the price.
The predictor and latest
price do not share a
linear relationship hence
do not fit our linear
regression model.But,
through polynomial
regression, we accept
the null hypothesis .
Suggestions for further analysis:

My suggestions for further analysis would be first to request more data as I feel that the
predictors are not enough. Variables such as the last price on which it was sold or the house
quality would help in predicting the latest price of the houses. Upon getting more data, we can
formulate more hypothesis and recognize complex relationships. The quality of the data is
neither very good nor very bad, as it’s main source was Zillow.
23

Questions?
THANK YOU

COMPANY NAME

You might also like