[go: up one dir, main page]

0% found this document useful (0 votes)
69 views4 pages

Decision Trees For Objective House Price Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views4 pages

Decision Trees For Objective House Price Prediction

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI)

Decision Trees for Objective House Price Prediction


2021 3rd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI) | 978-1-6654-1790-7/21/$31.00 ©2021 IEEE | DOI: 10.1109/MLBDBI54094.2021.00059

Zhishuo Zhang 1, *
1Jinan New Channel

Jinan, Shandong, China


*guanghua.ren@gecacdemy.cn

Abstract-Different people buy houses with the same value at al. [10] introduced a hybrid algorithm based on fuzzy linear
different prices, which usually leads to dissatisfaction with regression and fuzzy cognitive graphs to solve the problem of
housing prices and unfair housing prices. To solve this problem, prediction and optimization of housing market fluctuations.
we designed an objective housing price prediction scheme based The experimental results showed that the machine learning
on a decision tree. First, we selected 5 important features based prediction of house prices could retrieve and combine more
on the decision tree for subsequent modeling. Then we designed a features to demonstrate more reasonable and accurate house
housing price prediction model based on a decision tree. To prices prediction results.
obtain the optimal parameters, we used grid search. The results
showed that the number of houses is the most important factor To provide accurate and objective house price prediction
affecting housing prices, followed by the local population's results, we designed the decision tree method, where the
quality, geographic location, education, and crime rate. To verify Boston Housing Information data were used to train and test
the effectiveness of the decision tree scheme, we compared it with the model and evaluate the performance of the model. To build
some other advanced machine learning models. The a more effective decision tree model, we screened the
implementation results show that our scheme achieves the best important features based on the information gain of the
results. decision tree and then built the housing price model based on
these important features.
Keywords-Decision trees; machine learning; house price
forecasting. The remainder of this article is organized as follows.
Section II describes the designed methodologies for house
I. INTRODUCTION price prediction. Section III describes the experimental results.
Section V concludes this article.
With the rapid development of the country's economy in
the past few years, housing price, which covers many
livelihood issues, has become a concerning domestic II. METHODS
economic problem. People buy houses at different prices
because they do not thoroughly understand the house price A. Dataset and data preprocessing
system. Besides, the house prices did not objectively evaluate In this work, we used the Boston house price dataset [11]
because they are influenced by many factors, such as politics to verify our methods. The dataset contains information about
and population [1]. To promote fairness in house prices, housing prices in Boston, Massachusetts, USA, which were
regulate people's psychological imbalance in house prices and collected by the US Census Bureau. It is a small dataset of 506
provide an objective way to assess house prices, house price cases. The sample dataset contains 14 attributes, the first 13 of
forecasting is particularly important. Currently, machine which are used as feature inputs to predict Boston house prices,
learning methods provide superior performance for predicting and the 14th attribute is used as a label to be predicted, shown
house prices [2, 3]. The current house price forecasting system in Table I.
is already quite strong, but it is possible to increase accuracy
and improve the prediction scheme. With the increase of TABLE I. THE FEATURE DESCRIPTION OF BOSTON HOUSE PRICE DATASET
instability in the house price market, the traditional statistical
analysis has become less applicable. Forecasting house prices Feature Description
based on computer systems has become the main trend that CRIM Crime rate per capita for each town
can offer specific quantitative analysis, mainly divided into The proportion of residential land zoned for lots over 25,000
two research projects: 1) Traditional statistical methods: the ZN
square feet.
methods predict house price based on related principles [4, 5]. INDUS The proportion of non-retail commercial acres in each town
In the early days, it was more reliable to use various Charles River dummy variable (1 if the block is connected to
mathematical, statistical techniques to predict house prices [6]. CHAS
the river; 0 otherwise)
The GDP, currency, and population can be quantified into NOX The concentration of nitric oxide (per 10 million)
numbers. Hence using these indicators to make relevant
RM The average number of rooms per dwelling
statistical regression forecasts is also a popular method
nowadays. The data-driven statistics may result in a single AGE The proportion of owner-occupied units built before 1940
one-sided, inaccurate, and biased correlation. 2) Machine DIS The weighted distance to Boston's five employment centers
learning-based house price prediction: machine learning has RAD Radial Access Highway Accessibility Index
shown its strength in many fields [7-9]. This technique uses
TAX The full property tax rate per $10,000
regression-based methods to predict housing prices. Azadeh et

978-1-6654-1790-7/21/$31.00 ©2021 IEEE 280


DOl 10.1109/MLBDBI54094.2021.00059
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on November 11,2024 at 19:32:43 UTC from IEEE Xplore. Restrictions apply.
PTRATIO The student-teacher ratio by town decision tree algorithms. The five most important features were
1000(Bk - 0.63)1\2, where Bk is the percentage of blacks in selected by testing the model.
B
each town
LSTAT The percentage of the population that is lower status D. House prediction based on decision tree regressor
The median value of owner-occupied homes is $1,000
MEDV
increments
A decision tree is a probabilistic model constructed
based on various possible scenarios. It is a good way to
assess house price forecasting trends because there are
Before building the model, data pre-processing should be more obvious threshold differences for housing price
carried out first. Different evaluation indicators often have forecasts [13]. Therefore, in this work, we designed a
different levels and units of measurement. Such a situation decision tree to first select important features and then
will affect the results of data analysis. To eliminate the further build a house price prediction model based on the
influence of the level between indicators, data standardization decision tree.
(normalization) is needed to solve the comparability between
data indicators. Once the raw data has been normalized, the Decision trees are prone to overfitting and generally need
indicators would be in the same magnitude order and suitable to be pruned to reduce the size of the tree structure and
for comprehensive comparative evaluation [12]. Therefore, we mitigate overfitting. There are two pruning techniques: pre-
utilized the Min-Max normalization to transform the raw data pruning and post-pruning.
between [0 - 1]. The transformation function is as follows:
According to the feature selection algorithm, the decision
X new = (x - Xmin)/(Xmax - Xmin) (1) tree model is divided into ID3, C4.5 [14], and CART [15]
according to the feature selection algorithm. The decision tree
where X max is the maximum value of the sample data, and Xmin
regression model used in this experiment is the decision tree
is the minimum value of the sample data. model provided, which uses the CART algorithm to split the
The total number of samples is 506, and the data consists features. The CART algorithm is dichotomous and can greatly
of 13 feature attributes. First, the data set is divided into two improve the speed of decision tree generation, especially in the
parts: feature X and label Y. The label part X is changed into case of large-scale data sets.
one-dimensional data using reshape, convenient for the The full name of the CART algorithm is Classification and
subsequent segmentation process. Then the data set is split up Regression Tree, which uses the Gini index (the feature with
into test set and training set according to the ratio of 1:3. the smallest Gini index) like the splitting criterion, and it also
contains a post-pruning operation. Decision trees have larger
B. Decision tree regressor branches and larger sizes. To simplify the size of decision
The decision tree is a classical machine learning method trees and improve the efficiency of generating decision trees,
whose core idea is that the same (similar) input produces the the decision tree algorithm CART emerged, which selects
same (similar) output. The purpose of decision-making through testing attributes based on GINI coefficients, calculated as
tree results is to classify or regress the samples with the same follows,
attributes by judging the decisions of the different attributes of
the samples and classifying them into the next leaf node. The Gini(p) == 1 - I[=l Pk 2 (2)
Decision tree is the process of classifying data through a set of where p is the probability that the judged node falls into the
rules. It provides a rule-like approach to decide what values value interval.
will be obtained under what conditions. There are two types of
decision trees: classification trees for discrete variables, and III. EXPERIMENTAL RESULTS
regression trees for continuous variables [6].
A. Evaluation ofindicators
c. Feature selection To evaluate the effectiveness of the decision tree method,
The decision branching is plotted as a graph similar to the we compared it with linear regression and support vector
branches of a tree. In machine learning, a decision tree is a regression (SVR) models. Here, two common metrics, MSE
predictive model representing a mapping relationship between (Mean Square Error) and RMSE (Root Mean Square Error),
features and labels. Typically, decision tree algorithms utilize were used as evaluated indicators [16].
entropy ID3, C4.5, and C5.0 to minimize the level of confusion
in the system to achieve stable classification or regression The MSE is the mean square error, which is the square of
prediction performance. Based on the selected feature the difference between the true value of the data and the
evaluation criteria, the child nodes are generated recursively model's predicted value and then summed and averaged.
from top to bottom until the data set is indivisible, then the
MSE == ;If=lCfi - Yi)2 (3)
decision tree stops growing. The recursive structure is the
easiest way to understand the tree structure. Feature selection where ji and Yi are predicted house prices and real house prices,
refers to the process of selecting a feature from the many respectively.
features in the training data and using it as the splitting criterion
for the current node. The different quantitative evaluation The RMSE is the root mean square error, which measures
criteria for how to select features could result in different the deviation between the observed and true values and is

281

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on November 11,2024 at 19:32:43 UTC from IEEE Xplore. Restrictions apply.
often used to measure the prediction results of machine VR
learning models.

(4)

B. Feature selection result


We designed a decision tree based on Gini and selected 5
important features in terms of the Boston housing price
prediction task, and the results are shown in Figure 1. It can be
found that RM (number of rooms per dwelling) has the highest
impact on house prices. This is reasonable because the number
of houses is directly proportional to the housing price. Besides, 10
the lower-status population, the crime rate per capita,
geographic location, and student-teacher ratio are also o 10 20
important factors for predicting house prices. Pr dicti

Linear Regression
0.0

0.5

0.4

0.3

0.2

0.1

0.0
y_ rify
y--'pr diet
o I---------r-------,-------,-------r-------I
o 10 40
Figure 1. The feature importance of the Boston house price dataset based on
the decision tree Deci ion Tree

c. Model evaluation
Based on the Boston housing price data set, we compared
the decision tree model with SVR and linear regression, and
the experimental results are shown in Table II.

TABLE II. THE PERFORMANCE COMPARISON BETWEEN MODELS

Decision tree Linear regression SVR


MSE 26.62 62.80 63
RMSE 5.16 7.92 6.68

o I---------r------,-------.-------.--------;
o 10 40

Figure 2. The predicted performance comparison between different methods.


Here, y_verify and y--predict are predicted house price and real house price,
respectively

Based on the above results, it can be found that the


prediction ability of the decision tree is higher than that of the
linear regression model and the SVR model, with the linear
regression model having the worst prediction ability. Then, to
visualize the predictive ability of the models, a line graph of
the true and predicted values of the labels for the three models
is plotted.

282

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on November 11,2024 at 19:32:43 UTC from IEEE Xplore. Restrictions apply.
From figure 2, it is easy to see that the predicted and true linear regression model. The experimental results show that
values of the decision tree model are by and large the closest, the decision trees provide effective solutions for house price
but the deviations at certain points are still relatively larger. forecasting. However, this experiment only used the reference
The linear regression and SVR models have larger deviations characteristics, but many more factors influence house prices.
than the decision tree. When the decision tree has larger There are many more unknowns waiting to be explored in the
deviations, the deviations of the other two models will show a future.
greater trend.
REFERENCES
D. Tuning by grid search [1] E. Ahmed and M. Moustafa, "House price estimation from visual and
Firstly, we should tune the decision tree to see if we can textual features," arXiv preprint arXiv:1609.08399, 2016.
get better results. As the decision tree has a large number of [2] Y. Kang et aI., "Understanding house price appreciation using multi-
parameters, we choose these three parameters here, shown in source big geo-data and machine learning," Land Use Policy, p. 104919,
2020.
Table III.
[3] M. Thamarai and S. Malarvizhi, "House Price Prediction Modeling
U sing Machine Learning," International Journal of Information
TABLE III. THE TUNNING PARAMETERS FOR DECISION TREE Engineering & Electronic Business, vol. 12, no. 2, 2020.
[4] K. Adam, P. Kuang, and A. Marcet, "House price booms and the current
max depth Maximum depth account," NBER Macroeconomics Annual, vol. 26, no. 1, pp. 77-122,
min_impurity_decrease Minimum impurity of node branches 2012.
min_samples_leaf Minimum number of samples required
[5] M. Berlemann, 1. Freese, and S. Knoth, "Dating the start of the US house
for a node to exist
price bubble: an application of statistical process control," Empirical
Economics, vol. 58, no. 5, pp. 2287-2307,2020.
Grid search is an exhaustive searching method that [6] P. 1. S. Chang, "House price, income and finance structure's
specifies parameter values and optimizes the parameters of the mathematical model research," Special Zone Economy, p. 06, 2013.
estimation function by cross-validation to obtain the optimal [7] M. I. Jordan and T. M. Mitchell, "Machine learning: Trends,
learning algorithm [14]. The possible values of each parameter perspectives, and prospects," Science, vol. 349, no. 6245, pp. 255-260,
are arranged and combined to produce a "grid" of all possible 2015.
combinations. The combinations are then used for decision [8] B. T. Pham, T.-A. Hoang, D.-M. Nguyen, and D. T. Bui, "Prediction of
shear strength of soft soil using machine learning methods," CATENA,
tree training, and the performance is evaluated using cross- vol. 166, pp. 181-191,2018.
validation. After the fitting function has tried all the parameter
[9] D.-C. Feng et aI., "Machine learning-based compressive strength
combinations, a suitable classifier is returned and prediction for concrete: An adaptive boosting approach," Construction
automatically adjusted to the best combination of parameters, and Building Materials, vol. 230, p. 117000, 2020.
shown in Table IV. [10] A. Azadeh, B. Ziaei, and M. Moghaddam, "A hybrid fuzzy regression-
fuzzy cognitive map algorithm for forecasting and optimization of
housing market fluctuations," Expert Systems with Applications, vol. 39,
TABLE IV. THE OPTIMAL PARAMETERS FOR DECISION TREE no. 1, pp. 298-315,2012.
max depth 5 [11] Boston House Price Dataset. Available: http://t.cn/RfHTAgY
min_impurity_decrease 0.267 [12] W. W. Koczkodaj et aI., "On normalization of inconsistency indicators
min_samples_leaf 11 in pairwise comparisons," International Journal of Approximate
random state 4 Reasoning, vol. 86, pp. 73-79,2017.
[13] M. Adelino, A. Schoar, and F. Severino, "Credit supply and house prices:
IV. CONCLUSION evidence from mortgage market segmentation," National Bureau of
Economic Research2012.
This work designed a house price prediction scheme based
[14] B. Hssina, A. Merbouha, H. Ezzikouri, and M. Erritali, "A comparative
on the important feature selection scheme and predicted study of decision tree ID3 and C4. 5," International Journal ofAdvanced
models based on the decision tree. Through the screening of Computer Science and Applications, vol. 4, no. 2, pp. 13-19,2014.
important features, it can be found that the number of houses [15] L. Rutkowski, M. Jaworski, L. Pietruczuk, and P. Duda, "The CART
is the most important feature to determine the housing price. In decision tree for mining data streams," Information Sciences, vol. 266,
addition, the basic quality of the local population, education, pp. 1-15,2014.
public security, and geographic location are also very [16] T. Chai and R. R. Draxler, "Root mean square error (RMSE) or mean
important features that affect housing prices. We selected 5 absolute error (MAE)?-Arguments against avoiding RMSE in the
literature," Geoscientific model development, vol. 7, no. 3, pp. 1247-
important features and then designed the housing price 1250,2014.
prediction scheme of the decision tree. To prove the
effectiveness of our scheme, we compared it with SVR and

283

Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on November 11,2024 at 19:32:43 UTC from IEEE Xplore. Restrictions apply.

You might also like