American Journal of Engineering and Applied Sciences
Original Research Paper
Evaluating Patient Readmission Risk: A Predictive Analytics
Approach
1,2,3
Avishek Choudhury and 4Dr. Christopher M. Greene
1
Applied Data Science, Syracuse University, New York, USA
Process Improvement, UnityPoint Health, Iowa, USA
3
Systems Engineering, Stevens Institute of Technology, New Jersey, USA
4
Systems Science and Industrial Engineering, Binghamton University, NY, USA
2
Article history
Received: 31-10-2018
Revised: 26-11-2018
Accepted: 06-12-2018
Corresponding Author:
Avishek Choudhury
Applied Data Science,
Syracuse University, New
York, USA
Tell: +1 (515) 608-0777
Email: achoud02@syr.edu
Achoudh7@stevens.edu
Abstract: With the emergence of the Hospital Readmission Reduction
Program of the Center for Medicare and Medicaid Services on October 1,
2012, forecasting unplanned patient readmission risk became crucial to the
healthcare domain. There are tangible works in the literature emphasizing
on developing readmission risk prediction models; However, the models
are not accurate enough to be deployed in an actual clinical setting. Our
study considers patient readmission risk as the objective for optimization
and develops a useful risk prediction model to address unplanned
readmissions. Furthermore, Genetic Algorithm and Greedy Ensemble is
used to optimize the developed model constraints.
Keywords: Prediction Model, Patient Readmission Risk, Healthcare
Expenses, Healthcare Quality, Optimization Model
Introduction
It is a fact that the federal budget of the United States
is concerned by the burgeoning healthcare expenses
(Shipeng Yua, 2015). One of the main factors contributing
to the healthcare cost is the avoidable patient readmission.
Unplanned patient readmission has been a significant
measure of care quality. However, the Affordable Care
Act of 2010 introduced the Readmission Reduction
Program which became effective on October 1, 2012.
According to the School of Public Health, Veterans
Administration can save $2,140 per patient by managing
patients prone to readmission (Kathleen Carey, 2016).
Moreover, studies have shown that 15 to 25 percent
of discharged patients are readmitted in less than 30
days. According to the Agency for Healthcare Research
and Quality, about 1.8 million patients were readmitted
(Anika and Hines, 2014). Fierce Healthcare reported that
in 2011, hospitals spend $41.3 billion to treat unplanned
readmitted patients (Shinkman, 2014).
A study published by Harvard Business Review
stated that prioritized and effective communication with
the patient and complying to evidence-based care
standards could check patient readmission rate by 5
percent (Claire Senot, 2015). However, fostering desired
communication within a hospital is arduous due to the
complexity of the system.
Our study focuses on predicting patient readmission.
Individuals with a high risk of readmission can be
provided with alternative preventive measures such as
intensive post-discharge care or home care (Davood
Golmohammadi, 2015).
We define patient readmission as the readmission
caused due to poor discharge planning resulting in
reoccurrence of the treated disease and worsening health
condition. When an individual requires readmission within
90 days’ post-discharge for the same cause for which she or
he was admitted to a hospital in the very first place is
termed as the patient readmission. The reason behind
considering readmission within 90 days is since the patients
during the first three-month post-discharge are susceptible
to the diseases and have suicidal behavior among
individuals who have a mental disorder (Appleby, 2013).
Alarming Hospital Discharge Concerns
This section classifies the three most crucial poor patient
discharge issues that encourages patient readmission.
Early Patient Discharge
The fundamental decision healthcare providers need
to take is whether an individual has recovered enough to
leave the hospital independently. Poor decision making at
this instance hinders patient safety, resulting in emergency
© 2018 Avishek Choudhury and Dr. Christopher M. Greene. This open access article is distributed under a Creative
Commons Attribution (CC-BY) 3.0 license.
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
readmission or sometimes death. “A man died after a
hospital failed to treat sepsis” and discharge the patient
before time (Ombudsman, 2003). According to Homeless
Link, more than 70% of underprivileged people were
discharged without any housing and addressing
underlying health conditions (IHSMHL, 2012).
2005). The literature search process comprised of the
following three steps: (a) Systematic literature search
using electronic database search, ‘snowballing’
(Greenhalgh, 2005), (b) identify relevant papers based on
their title and (c) article selection based on their abstract.
Poor Patient Assessment and Consulting Prior
Discharge
The literature survey was limited to the following
database: ACM Digital Library, ASME Digital
Collection, BIOSIS Citation Index, CINDAS
Microelectronics Packaging Material Database, CiteSeer, Computer Database, Emerald Library, Energy and
Power Source, Engineering Village, IEEE Xplore,
MEDLINE, OSA Publishing, PubMed, Safari Books
Online, ScienceDirect, Sci-Finder, SPIE Digital Library
and Springer. The search conducted had no constraint of
time zone and the following material type was
considered: Articles, Newspaper Articles, Dissertations,
Conference Proceedings, Statistics Data Sets, Technical
Reports and Websites. The search keywords used were
“patient readmission,” “readmission” risk, “readmission
survey,” “readmission prediction” and “prediction
models.” All the papers that contained any of these
words anywhere in the article were selected. Then
based on their title, 104 papers where shortlisted.
Finally, after reading the abstract 33 peers reviewed
articles were finalized as the reference for this study.
Often patients physically fit enough are not mentally
capable of coping at home. These patients after discharge
often fail to continue medications and lose mental health
which in turn enhances the plausibility of readmission.
Such conditions are common among older adults who
are not capable of independently maintaining their health
either due to cognitive or financial constraints.
According to King’s Fund, “being discharged without
proper support is an invitation to relapse, worsening of
the condition and readmission” (Maguire, 2015).
Absence of Home Care Plans
Insufficient communication and coordination
between hospitals and community healthcare providers
is another concern that needs attention. Due to
insufficient domestic healthcare facilities, discharged
patients with health care requirements are left alone at
home which leaves the patient susceptible to health
deterioration and emergency readmission. During 2002
and 2012, 3,225 suicides were recorded by The
National Confidential Inquiry into Suicide and
Homicide by People with Mental Illness, 2014. To
minimize such occurrences, NHS recommends
hospitals to follow up with their discharged patients
within 7-2002 1day post discharge and ensure
availability of crisis support (Assessment, 2013).
Problem Statement
There exist several possible causes responsible for
unplanned patient readmission. However, our study does
not focus on identifying the responsible cause, but it
provides with an efficient prediction model that can be
deployed to a clinical scenario and help healthcare units
to be prepared for the unavoidable readmissions and
provide alternative care to preventable readmissions. The
proposed model provides healthcare providers with a
decision support system to identify individuals prone to
readmission and thus minimize early discharge and
ensure follow up with the discharge patients.
Systematic Literature Review
Design
The study conducts a systematic review of methods
and models used in predicting unplanned patient
readmission to meet our research motive (Dixon Woods,
Search Strategy and Inclusion Criteria
Findings From the Literature Review
Miller et al. (1984) used multiple regression to
develop a five-year prediction model for patient
readmission. This paper was an indication that multiple
regression for predictor variable analysis was a viable
option. Hodgson et al. (2001) estimated readmission
rates for all psychiatric admissions in North
Staffordshire. Survival analysis was used to find the ones
that predicted readmission. It used the Survival Analysis
Log-Rank Test and Cox Regression for this purpose.
Betihavas et al. (2012) extended the scope of the study
done. It included non-clinical factors as well such as
social instability. It also called out the need for
predictive analysis and the lack of such tools for
clinicians to use in risk assessment of readmission.
Allison (2012) studied variables that could potentially
predict readmission chances for patients previously
admitted in pulmonary rehabilitation. Apart from
eliminating excessive pain and unnecessary illness, it
sought to reduce health care cost, in general, using
discriminant and predictive analysis. Zheng et al.
(2015) used metaheuristic and data mining approaches.
A neural networks algorithm and SVM classifiers were
used. These models could perform risk measure with
higher sensitivity and F measure.
Bakal et al. (2014) generated a prediction as well as
risk evaluation model for the rehospitalization of Heart
Failure (HF) patients. It was found using five years of
follow up half the patients had returned for
1321
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
rehospitalization. It concluded that trying to elongate the
gap between hospitalizations should be an essential goal
for evaluating the quality of treatment. Ajorlou et al.
(2014) proposed a risk prediction model based on
hierarchical nonlinear mixed effect to recognize patients
with high likelihood of discharging, non-compliances to
decrease Medicare costs and improve quality of care
provided by hospitals. It applied stepwise variable
selection in the mixed-effect framework and extended
the (typical) random frailty model for Weibull hazard
function with incorporated patient factors. Wang et al.
(2014) validated the use of the LACE index when
studying readmission risk of patients with CHF.
Bayati et al. (2014) evaluated the cost-effectiveness
and efficiency of the methodology that combines
prediction and decision making. Machine learning
classifiers were used with the patient data to perform
the cost analysis. Inouye et al. (2015) was one of the
few studies that used patient self-reports as means for
risk assessment of readmission. An automated multicall follows up system was implemented.
Amarasingham et al. (2015) in the same year focused
on Electronic Medical Record (EMR) models to access
readmission. Kang et al. (2015) used Retrospective
Analysis and multivariate analysis to determine
readmission risk factors. Futoma et al. (2015) compared
several predictive models for predicting early
readmissions. Deep learning is used to analyze the five
conditions that CMS uses to penalize hospitals. It used
Logistic Regression, Penalized Logistic Regression,
Random Forests and Support Vector Machines and Neural
Network deep learning methods. A framework for
assessing patient readmission risk was developed. It found
random forests, penalized regressions and deep neural
networks to be the best predictors. Shams et al. (2015)
developed a new metric for evaluating possible avoidable
readmission. A tree-based classification method was
proposed that factored in the previous history of the
patient’s readmissions and the various risk factors that
were identified by the researchers of the paper. Pack et al.
(2016) focused on readmission prediction for patients with
heart valve surgery specifically. It used a generalized
predictive equation for predicting readmission.
Turgeman and May (2016) developed a predictive
model for hospital readmissions using a boosted C5.0
tree and Support Vector Machine as base and secondary
classifiers respectively. It tried to balance the
readmission classification problem. Lewis et al. (2016),
compared the accuracy of two different risk prediction
models, The Hospital Readmission Reduction Program
(HRRP) and the Risk-Standardized Readmission Rate
(RSRR) models. (Tong et al., 2016) compared several
existing models on an all-cause non-elective basis. LACE,
LASSO logistic, AdaBoost, STEPWISE logistic are
compared with varying sample sizes. Golmohammadi
used neural networks, classification and regression models
and chi-square automatic interaction detection for analysis
(Golmohammadi and Radnia, 2016). All models had an
overall accuracy of over 80%.
The latter two gave the user the ability to select
misclassification costs additionally. C5.0 was used to find
any recurring patterns using patient history. Low et al.
(2016) also compared the results with LACE index. After
retrospective cohort analysis, like Kang et al. (2015), it
grouped the predictors into categories. Wang et al. (2016),
aimed to find the accuracy of Severity of Illness (SOI) and
Risk of Mortality (ROM) individually, in predicting
readmissions. Similar to Hogarth’s work (Mahajan et al.,
2016; 2017) created a regularized logistic regression model
for risk prediction on a thirty-day basis. This was yet
another study in the heart related patient readmission
domain and was limited to risk prediction and comparison
of risk prediction models specifically (Mahajan et al.,
2016). Kroeger et al. (2018) determines whether Pediatric
Early Warning Score before transfer may serve as a
predictor of unplanned readmission to the cardiac intensive
care unit. Jiang et al. (2018) utilized feature selection
algorithms and machine learning models to develop a risk
prediction system that is dynamic and accurate.
Several studies have implemented diverse modeling
methods to determine the factors that influences
hospital patient readmission rate (Betihavas et al., 2012;
Davison et al., 2016; Golas et al., 2018; Hebert et al., 2014;
Lum et al., 2012; SHAMEER et al., 2017; Shams et al.,
2015; Wasfy et al., 2013; Yu et al., 2015).
Methodology
Our study does not involve the participation of any
patient. All analysis is based on anonymized data and
ensures confidentiality. The dataset consists of 55
attributes and a sample size of 100,000 instances and
represents 10 years of data collected from 130 US
hospitals (Avishek, 2018; Strack et al., 2014). Table 1
below shows the data distribution. The original database
contains curtailed, superfluous and noisy information as
expected in most of the real-world data (Strack et al.,
2014). There were some attributes that could not be
treated directly since they had a high percentage of
missing values. These features were “weight” (97%
values missing), “payer code” (40%) and “medical
specialty” (47%). “Weight” attribute was too sparse to be
considered and was not included in further analysis.
“Payer code” was neglected since it had a high
percentage of missing values and it was not considered
relevant to the outcome. “Medical specialty” attribute
was accounted for analysis, adding the value “missing”
in order to account for missing values. Large percentage
of missing values of the “weight” attribute can be
explained by the fact that prior to the HITECH
legislation of the American Reinvestment and Recovery
Act in 2009 hospitals and clinics were not required to
capture it in a structured format (Strack et al., 2014).
1322
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
Table 1: Data description
Predictors
HbA1c
No test was performed
Result was high and the diabetic medication
was changed
Result was high, but the diabetic medication
was not changed
Normal result of the test
Gender
Female
Male
Discharge disposition
Home
Otherwise
Admission source
Emergency room
Referrals
Otherwise
Specialty of the admitting physician
Internal medicine
Cardiology
Surgery
General practice
Missing values
Other
Primary diagnosis
Circulatory system
Diabetes
Respiratory system
Digestive system
Injury and poisoning
Musculoskeletal and connective tissue problem
Genitourinary system disease
Neoplasm
Other
Race
African American
Caucasian
Other
Missing
Age
Less than or equal to 30yrs.
30-60 yrs.
Older than 60yrs.
Age
Age in years
Time in hospital
Days between admission and discharge
Number of
encounters
% of
population
Readmitted
-----------------------------------------Number of
encounters
% in group
57,080
4,071
81.6%
5.8%
5,324
361
9.4%
8.9%
2,196
3.1%
166
7.6%
6,637
9.5%
590
8.9%
37,234
32,750
53.2%
46.8%
3,462
2,997
9.3%
9.2%
44,339
25,645
63.4%
36.6%
3,184
3,275
7.2%
12.8%
37,277
22,800
9,907
53.3%
32.6%
14.2%
3,563
2,032
846
9.6%
8.9%
8.5%
10,642
4,213
3,541
4,984
33,641
12,963
15.2%
6.0%
5.1%
7.1%
48.1%
18.5%
1,044
309
284
492
3,237
1,093
9.8%
7.3%
8.0%
9.9%
9.6%
8.4%
21,411
5,747
9,490
6,485
4,697
4,076
3,435
2,536
12,107
18.5%
8.2%
13.6%
9.3%
6.7%
5.8%
4.9%
3.6%
17.3%
1,093
529
710
532
524
354
313
239
1,129
8.4%
9.2%
7.5%
8.2%
11.2%
8.7%
9.1%
9.4%
9.3%
12,626
52,300
3,138
1,920
18.0%
74.7%
4.5%
2.7%
1,116
4,943
256
144
8.8%
9.5%
8.2%
7.5%
1,808
21,871
46,305
mean
64.9
2.6%
31.3%
66.2%
median
67
112
1,614
4,733
1st Qu
55
6.2%
7.4%
10.2%
2nd Qu
77
4.3
3
2
6
The primary dataset contained numerous inpatient
visits for some patients and the observations could not be
considered as statistically independent, an assumption of
the logistic regression model. We thus used only one
encounter per patient; in particular, we measured only
the first encounter for each patient as the primary
admission and determined whether or not they were
readmitted within 90 days.
Furthermore, we detached all encounters that resulted
in either discharge to a hospice or patient death. After
filtering out the data, we were left with 69,984
encounters that constituted the final dataset for analysis.
The methodology employed in this study can be
broadly categorized into the following sections: Data
preprocessing, implementation of predictive models and
model optimization.
1323
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
Data Preprocessing
Handling Outliers
Data preprocessing included three steps: Feature
selection, handling outliers, data balancing and data
partitioning.
Median Absolute Deviation Method was used to address
outliers (Avishek and Greene C., 2018) and (Leys, 2013).
All missing values were replaced by the column mean.
Feature Selection
Data Balancing
Large datasets hinder the speed of algorithms and
even deteriorate classification accuracy (Kohavi and John,
1997). The concern raised due to data size is termed as the
minimal-optimal problem (Nilsson, 2007). Our study
employs Boruta algorithm and stepwise regression to
determine the best features within the dataset.
Boruta algorithm is a wrapper developed on
random forest classification algorithm (Liaw and
Wiener, 2002). In this algorithm, the relevance of any
attribute is retrieved as the loss of classification
accuracy caused due to permutation of attribute values
among objects. It calculates the shuffled correlations
between the response and the attributes. It also
computes the Z-score to determine attributes’
relevance by dividing the mean accuracy loss by its
standard deviation. In addition to Boruta, stepwise
regression was also implemented.
Stepwise regression is designed as an automatic
computational procedure in which the performance of
the regression increases with increase in the input
variable (Barnett et al., 1975; Campolongo et al.,
2000). Stepwise regression is a different version of
the forward selection in which after every step a
variable is added, all selected attributes in the model
are analyzed to determine any loss in relevance. If an
irrelevant variable is found, it is blocked from the
model. Stepwise regression mandates two significance
levels: One for adding attributes and one for
eliminating attributes. The cutoff plausibility for
adding an attribute must be less than the cutoff
probability for eliminating attributes to avoid an
infinite loop trap (Mengchao Wang, 2013).
The entire data was randomly shuffled.
Oversampling, undersampling and rose sampling
methods were performed to balance the response
variable (Al-Wesabi et al., 2018). Figure 1 shows the
comparison of different data sampling methods.
Missing Value Imputation
Median Absolute Deviation Method was used to
address missing values (Avishek, 2018;Leys, 2013). This
method helps to avoid any outliers within the dataset.
The same concern can also be addressed by scaling and
normalizing the dataset between 0 and 1. The Equations
(1) used in this study to perform data normalization are
given below:
Normalized ( Ni ) = ( Ni − Emin ) / ( Emax − Emin )
Where:
Emin = The minimum value for variable E
Emax = The maximum value for variable E
*Note: If Emax equals Emin then Ni equals 0.5
Data Partitioning
The balanced data was partitioned into training
(70%) and testing (30%). All predictive models were
fitted on the training data and the testing accuracy was
measured for model evaluation.
Implementation of Prediction Models
Algorithms such as random forest, support vector
machine, Recursive Partitioning and Regression Tree,
Gradient Boosting Method and General Linear Model were
used on the training data to predict the readmission risk.
Random Forest
Random Forests (RF) is a type of decision tree that
employs modified tree learning algorithm. RF at each
split of the input variables during the learning phase
randomly selects a subset of features. This process is
also termed called "feature bagging." and helps in
determining the few highly correlated attributes that
significantly influences the predictors for yielding the
best-fit target output with high accuracy. Usually, for
any classification problem having ‘x’ features, √x
(rounded down) features are used in each split. Whereas
for regression type analysis it is recommended having
x/3 (rounded down) splits and at least node size of 5.
Support Vector Machine
A Support Vector Machine (SVM) is a
discriminative classifier that uses a hyperplane to
segregate different classes either linearly or radially. In
other words, in supervised learning, the algorithm
produces an optimal hyperplane which classifies output
into specific categories.
For SVM Linear the prediction Equation (2 shown
below) for input is a dot product between the input (x)
and each support vector (xi):
(1)
f ( x ) = Bo + ∑ ( ( ai ) * ( x, xi ) )
(2)
Equation 2 computes the inner products of an input
vector (x) with all support vectors in training data. The
coefficients Bo and ai are assessed from the training data
by the acquiring algorithm.
1324
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
1.00
0.75
y-axis
Model
Original
Over
Rose
Under
0.50
0.25
0.00
F1
Precision
Recall x
Sensitivity
Specificity
X-axis
Fig. 1: Comparing data sampling methods. In this figure the x-axis represents the performance measure such as F1, precision, recall,
sensitivity, specificity. The y-axis in the figure shows their respective value ranging from 0 and 1. Higher y value signifies
better performance
SVM polynomial kernel and exponential kernel are
expressed as the respective Equations (3 and 4) given
below:
By implementing gradient descent, the minimum
MSE can be found using the Equation (6) given below:
y p i = y p i – α * 2 * ∑ ( yi − y p i )
K ( x, xi ) = 1 + ∑ ( x * xi )
d
(6)
(3)
where, is learning rate and Σ(yi-yip) is sum of residuals.
K ( x, xi ) = e
( −γ *
∑ (( x − xi 2))
(4)
General Linear Model
The general equation for the General Linear Model
(GLM) is defined as the Equation (7) given below:
Gradient Boosting Method
Gradient boosting method is a classification and
regression technique that ensembles several weak
prediction models and produces a decision tree. GBM
like any other supervised learning defines and minimizes
the loss function. The Equation (5) below shows the
Loss function:
Loss = MSE = ∑ ( yi − y p i )
2
y = βo + β1 X 1 + β2 X 2 + … + βn −1 X n −1 + βn X n
(7)
The βs in the given GLM equation are coefficients
or weights dispensed to the input or predictor
variables, i.e., the X’s on the right-hand side of the
prediction equation.
Recursive Partitioning and Regression Tree
(5)
where, yi = ith target value, yip = ith prediction, L(yi,yip) is
the loss function.
The Recursive Partitioning and Regression Tree
(Rpart) algorithm splits the dataset recursively. In other
words, the split continues till a given termination
1325
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
criterion is attained. It is crucial to observe that the
algorithm makes the best decision at each splitting stage,
without any contemplation of optimality in the upcoming
stages. In other words, this approach ensures local
optimality. Due to this approach, deep trees are prone to
overfitting. However, overfitting can be checked by
developing shallower trees by terminating the algorithm
at an ideal point or by pruning the deep tree to the
desired criterion. Rpart follows the later technique to
minimize overfitting.
Overfitting minimization is achieved by the following
Equation (8 and 9) shown below:
min ( Cα (T ) )
(8)
Equation 8 minimizes the cost Cα(T) assigned to
each variable which is the linear combination (see
Equation 9 below) of error R(T) and the number of leaf
nodes in the tree |T|:
Cα (T ) = R (T ) + α T
(9)
Performance Measures
To analyze and compare each model, we considered
accuracy, sensitivity and specificity (Choudhury, 2018).
Accuracy (ACC) is determined as the number of
correct prediction upon the total number of the dataset.
The accuracy can vary from 0 to 1. Where 1 is the best
possible result.
Sensitivity (SN) is the count of true (correct) positive
predictions divided by the total number of positives. The
value of SN varies from 0 to 1, where 1 is the best
possible sensitivity.
Specificity (SP) is the number of true (correct)
negative predictions divided by the total number of
negatives. The value of SP varies from 0 to 1, where 1 is
the best possible specificity.
Table 2: Parameters for genetic algorithm
GA Characteristics
Genes’ alley in chromosome
Real values
Chromosome length = No. of
variables
Population
Random (uniform) population
of real values Size = 15
Selection Strategy
Linear-rank selection
Crossover Method
Local arithmetic crossover
Crossover probability = 0.8
Mutation Method
Uniform random mutation
Mutation Probability = 0.1
Replacement Strategy
Elitism by %5
Termination Strategy
No. of generations = 15
Constraint Handling
Constraints repair mechanism
bound of [-10,10]
Model Optimization
Genetic Algorithm and Greedy Ensemble algorithm
were implemented to obtain the best fit model. Genetic
Algorithm (GA) is a random but global search in
solution space that is inspired by natural behavior of
chromosomes in transmitting characteristics (genes)
from one generation after another in which genes will be
updated randomly with the aim of crossovers and
mutations strategies through generations to produce a
chromosome which is the best representative of optimal
solution (Rowe, 2015). This is study considered GA for
tuning a set of classifiers which provided the best
performance among applied data mining methods in the
previous section due to enhance the predictive accuracy
and investigate the effect of classifiers’ parameters on its
performance measurement. Applied GA characteristics
have been introduced in Table 2.
Results
Boruta algorithm performed 18 iterations in 1.32 h and
identified 10 important attributes as shown in the Fig. 2.
However, “number of inpatients”, “number of
emergencies”, “number of diagnoses”, “diabetes med”,
“number of outpatients”, “number of procedures” and
“number of medications” were identified as the top seven
influential factors by stepwise regression as shown in Fig. 3.
The important attributes identified by both Boruta
and stepwise regression were (a) number of medications,
(b) number of procedures, (c) number of emergencies,
(d) number of outpatients, (e) number of inpatients and
(f) number of diagnosis.
For further analysis we employed interaction effect
study. Interaction effects study is the analysis of how
multiple predictor variables, when considered together,
have an impact on the main variable for analysis. It helps
establish a relationship, not just between the predictors
and the main variable, but also between the predictor
variables themselves as shown in Fig. 4. The blue and red
lines mark the low and high levels of one variable when it
is being considered along with another variable. For
example, in the interaction plot of the number of
outpatients*number of inpatients, in a higher value setting
of the number of inpatients (21), the mean of readmitted is
decreasing (approaching zero) for some outpatients
increases. This signifies that if every time a patient is an
inpatient, he or she also becomes an outpatient, it
becomes less likely that he or she will be readmitted.
Similarly, there seems to be a change in behavior
when the number of inpatient and number of diagnoses
is considered, wherein the number of diagnoses is the
changing variable. When the number of diagnoses
setting is higher, it naturally means that chances of
1326
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
readmission will remain high with increasing number of
inpatients for a particular patient. However, even in a
low setting, the mean of readmitted increases and crosses
over the high setting line. This may signify that of a
particular patient is an inpatient again and again, but
there are no new diagnoses, the hospital may be failing
concerning accuracy of judgment, which is why chances
if readmission still increase.
Boruta Feature Selection
20
15
0
Importance
10
0
0
5
0
No. In-Patient
No. Emergency
No. out-Patient
No. Lab Procedures
No. Medications
Time in hospital
No. Procedures
No. Diagnosis
Age
DIABETES MED
Shadow Max
Shadow Mean
Shadow Min
-5
Fig. 2: Boruta feature selection
12
10
80
8
60
6
40
4
Fig. 3: Stepwise regression variable importance
1327
num_medications 7
num_procedures 6
diabetesMed 12
0
number_emergency 9
0
number_outpatient 8
20
number_diag0ses 11
2
number_inpatient 10
nsubsets
100
nsubsets
sqrt qcv
sqrt rss
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
Interaction plots for readmitted
Describes how readmitted changes if you change the settings of two X variables.
num_procedur’ num_medicati
num_medicati
1
81
0
-2
-4
num_procedur’ number_outpa num_medicati’ number_outpa
Mean of readmitted
0
-2
-4
num_ procedur’ number_inpat
num_medicati’ number_inpat
number_outpa’ number_inpat
number_ inpat
0
21
0
-2
-4
num_procedur’ number_diag0 num_medicati’ number_diag0
number_outpa’ number_diag0
number_inpat’ number_diag0
number_diag0
1
9
0
-2
-4
0.0
2.5
5.0
0
40
80
0
20
40
0
10
20
Fig. 4: Interaction effect analysis
Table 3: Model performance after data processing
Algorithm
Gradient boosting method
General linear model
Support vector machine
Recursive partitioning and regression trees
Accuracy (%)
96.12
96.35
97.00
96.90
Table 4: Model performance after implementing genetic algorithm
Gradient
General
boosting method
linear model
Optimized accuracy (%)
97.05
97.05
Optimized classifiers’ parameters
-0.69, -2.36
1.32, 1.64
We implemented under-sampling method to reduce
the bias of the response variable. Table 3 shows the
model performance and testing accuracy of all the
selected algorithms.
To further enhance the performance of the models,
the Genetic Algorithm was used to obtain the optimized
performance as shown in the following Table 4.
Since using Genetic Optimization gave two best
models with same output, we performed greedy ensemble
to find the best one model.
Gradient Boosting Method was found to be the best
after both Genetic Optimizations as well as Greedy
Ensemble Method and we recommend GBM with 98.50%
prediction accuracy as the best fit model for this dataset.
Moreover, “number of inpatients” was found to be the most
influencing factor that determines patient readmission risk.
Sensitivity
0.94
0.94
0.98
1.00
Support
vector machine
97.04
-4.46, 3.30
Specificity
0.97
0.96
0.96
0.96
Recursive partitioning
and Regression Tree
97.04
0.50, 5.67
However, patient age, diabetes, time spent in the hospital,
number of lab procedures, number of outpatients and the
number of emergencies had significant relevance.
Discussion and Conclusion
Readmission rate is a quality evaluation metric
customarily used to extrapolate the quality of life index
of patient population and the quality of healthcare
delivery (Shameer et al., 2017). Irrespective of the
developments in biomedical and healthcare research
practices, hospital quality control offices still use
traditional predefined sets of variables to infer the
probability patient readmission (Shameer et al., 2017).
However, predictive analytics could provide evidences to
improve the quality of healthcare delivery. Uniting
1328
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
predictive analytics with preventive measures would
involve patients, physicians and payers to contribute
proactively in taming the health and wellness
In this study, we implement a predictive analytical
approach to identify patients prone to readmission and
thus, systematically reduce the number of avoidable
readmissions mainly caused by patient non-compliances
to medication instruction or early discharge from
hospital. Our proposal has the capability of capturing
both patient and population-based variations of hospital
readmissions. It incorporates patient with diverse health
concerns across 130 US hospitals. The novelty of our
method is to directly incorporate patients’ history of
readmissions into modeling framework along with other
demographic and clinical characteristics. We also verify
the effectiveness of the proposed approach by validating
training accuracy. Some contributions made in this paper
are (i) applying Boruta algorithm and stepwise variable
selection and (ii) implementing genetic and greedy
ensemble algorithm to optimize our predictive models.
Our study recommended optimized gradient boosting
method for identifying patient most likely to get
readmitted. Furthermore, the study also emphasizes on the
effectiveness of data preprocessing. It measures the
influence of data balancing, removing outliers and imputing
missing values on the classification accuracy. Our study
also produces highest readmission prediction accuracy.
Some research directions can be sought by trying
different variable selection techniques such as LASSO or
Nonnegative Garrote for better subset regressions. Also, in
presence of high right censored data, it is interesting to
consider some health care cost measures from which it
may be possible to statistically estimate the mean
population cost for readmission.
Author’s Contributions
Avishek Choudhury: Research, data collection,
analysis, data interpretation, figure formation, coding
and writing manuscript.
Dr. Christopher M. Greene: Manuscript writing
and formatting.
Ethics
All data was collected with the permission of the
organization and patient’s medical and personal
information was secured.
Data Statement
The data used in this study can be retrieved from
DOI:
10.17632/nntck7ddgt.1.
URL:
https://data.mendeley.com/datasets/nntck7ddgt/2
References
Ajorlou, S., I. Shams and K. Yang, 2014. Predicting
patient risk of readmission with frailty models in the
Department of Veteran Affairs. Proceedings of the
IEEE International Conference on Automation
Science and Engineering, Aug. 18-22, IEEE Xplore
Press, Taipei, Taiwan, pp:
576-581.
DOI: 10.1109/CoASE.2014.6899384
Allison, N., 2012. Examining the likelihood of
readmission to inpatient pulmonary rehabilitation
using a variety of predictors. ProQuest Dissertations
and Thesis, ProQuest Dissertations Publishing.
Al-Wesabi, Y.M.S., A. Choudhury and D. Won, 2018.
Classification of cervical cancer dataset.
Amarasingham, R., F. Velasco, B. Xie, C. Clark and Y.
Ma et al., 2015. Electronic medical record-based
multicondition models to predict the risk of 30-day
readmission or death among adult medicine patients:
Validation and comparison to existing models. BMC
Med. Inform. Dec. Mak., 15: 39-39.
DOI: 10.1186/s12911-015-0162-6
Anika, L. and M.L. Hines, 2014. Conditions with the
largest number of adult hospital readmissions by
payer. HCUP Statistical Brief# 172. Rockville:
Agency for Healthcare Research and Quality.
https://www.hcupus.ahrq.gov/reports/statbriefs/sb17
2-Conditions-Readmissions-Payer.pdf
Appleby, L.K.N., 2013. The national confidential inquiry
into suicide and homicide by people with mental
illness. Healthwatch England.
Avishek, C. and C.M. Greene, 2018. Prognosticating
autism spectrum disorder using artificial neural
network: Levenberg-Marquardt algorithm. J.
Bioinform. Syst. Biol., 1: 001-010.
DOI: 10.26502/fjbsb001
Avishek, C.K., 2018. Decision support system for renal
transplantation. Proceedings of the IISE Annual
Conference, (AC’ 18), IISE, Orlando.
Bakal, J.A., F.A. Mcalister, W. Liu and J.A. Ezekowitz,
2014. Heart failure readmission: Measuring the
ever-shortening gap between repeat heart failure
hospitalizations. PLoS ONE, 9: 106494-106494.
DOI: 10.1371/journal.pone.0106494
Bayati, M., M. Braverman, M. Gillam, K.M. Mack and
G. Ruiz et al., 2014. Data-driven decisions for
reducing readmissions for heart failure: General
methodology and case study. PloS one, 9: e109264e109264. DOI: 10.1371/journal.pone.0109264
Betihavas, V., P.M. Davidson, P.J. Newton, S.A. Frost
and P.S. Macdonald et al., 2012. What are the
factors
in
risk
prediction
models
for
rehospitalisation for adults with chronic heart
failure? Australian Critical Care, 25: 31-40.
DOI: 10.1016/j.aucc.2011.07.004
1329
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
Campolongo, F., S. Tarantola and A. Saltelli, 2000.
Sensitivity anaysis as an ingredient of modeling.
Stat. Sci. DOI: 10.1214/ss/1009213004
Choudhury, C., 2018. Identification of cancer:
Mesothelioma’s desease using logistic regression
and association rule. Am. J. Eng. Applied Sci.
Claire Senot, A.C., 2015. What has the biggest impact on
hospital readmission rates. Harvard Business
Review.
Davison, B.A., M. Metra, S. Senger, C. Edwards and O.
Milo et al., 2016. Patient journey after admission for
acute heart failure: length of stay, 30-day
readmission and 90-day mortality. Eur. J. Heart
Failure. DOI: 10.1002/ejhf.540
Davood Golmohammadi, N.R., 2015. Prediction
modeling and pattern recognition for patient
readmission. Int. J. Product. Econom., 171: 151-161.
DOI: 10.1016/j.ijpe.2015.09.027
Dixon Woods, M.A., 2005. Synthesizing qualitative and
quantitative evidence: A review of possible
methods. J. Health Services Res. Policy, 10: 45-53.
DOI: 10.1177/135581960501000110
Futoma, J., J. Morris and J. Lucas, 2015. A comparison
of models for predicting early hospital readmissions.
J. Biomed. Inform., 56: 229-238.
DOI: 10.1016/j.jbi.2015.05.016
Golas, S.B., T. Shibahara, S. Agboola, H. Otaki and J.
Sato et al., 2018. A machine learning model to
predict the risk of 30-day readmissions in patients
with heart failure: A retrospective analysis of
electronic medical records data. BMC Med. Inform.
Dec. Mak. DOI: 10.1186/s12911-018-0620-z
Golmohammadi, D. and N. Radnia, 2016. Prediction
modeling and pattern recognition for patient
readmission. Int. J. Product. Econom.
DOI: 10.1016/j.ijpe.2015.09.027
Greenhalgh, 2005. Effectiveness and efficiency of search
methods in systematic reviews of complex evidence:
Audit of primary sources. BMC, 331: 1064-1065.
DOI: 10.1136/bmj.38636.593461.68
Hebert, C., C. Shivade, R. Foraker, J. Wasserman and C.
Roth et al., 2014. Diagnosis-specific readmission
risk prediction using electronic health data: A
retrospective cohort study. BMC Med. Inform. Dec.
Mak. DOI: 10.1186/1472-6947-14-65
Hodgson, R.E., M. Lewis and A.P. Boardman, 2001.
Prediction of readmission to acute psychiatric units.
Soc. Psychiatry Psychiatric Epidemiol., 36: 304309. DOI: 10.1007/s001270170049
IHSMHL, 2012. Improving hospital admission and
discharge for people who are homeless. Inclusion
Health St Mungo's Homeless Link, Healthwatch
England.
Inouye, S., V. Bouras, E. Shouldis, A. Johnstone and Z.
Silverzweig et al., 2015. Predicting readmission of
heart failure patients using automated follow-up
calls. BMC Med. Inform. Dec. Mak., 15: 22-22.
DOI: 10.1186/s12911-015-0144-8
Jiang, S., K.S. Chin, G. Qu and K.L. Tsui, 2018. An
integrated machine learning framework for hospital
readmission prediction. Knowledge-Based Syst.,
146: 73-90. DOI: 10.1016/j.knosys.2018.01.027
Kang, C., K. Kim, J.H. Lee, Y.H. Jo and J.W. Park et al.,
2015. Predictors of revisit and admission after
discharge from an emergency department in acute
pyelonephritis. Hong Kong J. Emergency Med., 22:
154-162. DOI: 10.1177/102490791502200304
Kathleen Carey, T.S., 2016. The Cost of hospital
readmissions: Evidence from the VA. Health Care
Manage. Sci., 19: 241-248. DOI: 10.1007/s10729014-9316-9
Kohavi, R. and G.H. John, 1997. Wrappers for feature
subset selection. Artificial Intell., 97: 273-324.
DOI: 10.1016/S0004-3702(97)00043-X
Kroeger, A.R., J. Morrison and A.H. Smith, 2018.
Predicting unplanned readmissions to a pediatric
cardiac intensive care unit using pre-discharge
Pediatric Early Warning Scores. Congenital Heart
Dis., 13: 98-104. DOI: 10.1111/chd.12525
Lewis, C.M., Z.L. Cox, P. Lai and D.J. Lenihan, 2016.
Evaluation of two different models to predict heart
failure readmissions. Heart Lung: J. Acute Critical
Care, 45: 374-374.
DOI: 10.1016/j.hrtlng.2016.05.008
Leys, C., 2013. Detecting outliers: Do not use standard
deviation around the mean, use absolute deviation
around the mean. J. Exp. Soc. Psychol., 49: 764-766.
DOI: 10.1016/j.jesp.2013.03.013
Liaw, A. and M. Wiener, 2002. Classification and
Regression by random Forest. R News.
Lum, H.D., S.A. Studenski, H.B. Degenholtz and S.E.
Hardy, 2012. Early hospital readmission is a
predictor of one-year mortality in communitydwelling older medicare beneficiaries. J. General
Internal Med. DOI: 10.1007/s11606-012-2116-3
Maguire, D., 2015. The king's fund. Premature
discharge: Is going home early really a Christmas
gift? kingsfund.org.uk
Mahajan, S., P. Burman and M. Hogarth, 2016.
Analyzing 30-day readmission rate for heart failure
using different predictive models. Nurs. Inform.,
2016: 225: 143-147. PMID: 27332179
Mahajan, S.M., P. Burman, A. Newton and P.A.
Heidenreich, 2017. A validated risk model for 30day readmission for heart failure. Stud. Health
Technol. Inform., 245: 506-510. PMID: 29295146
1330
Avishek Choudhury and Dr. Christopher M. Greene / American Journal of Engineering and Applied Sciences 2018, 11 (4): 1320.1331
DOI: 10.3844/ajeassp.2018.1320.1331
Mengchao Wang, J.W., 2013. A comparison of
approaches to stepwise regression for global
sensitivity
analysis
used
with
evolution
optimization. Proceedings of the 13th Conference of
International Building Performance Simulation
Association, (PSA’ 13), France, pp: 2551-2558.
Miller, D.J., N.C. Beck and C. Fraps, 1984. Predicting
rehospitalization at a community mental health
center: A „double‐crossed” validation. J. Clin.
Psychol., 40: 35-39.
DOI:
10.1002/1097-4679(198401)40:1<35::AIDJCLP2270400106>3.0.CO;2-8
Nilsson, R., 2007. Consistent feature selection for
pattern recognition in polynomial time. J. Mach.
Learn. Res.
Ombudsman, T.P.O., 2003. The parliamentary
ombudsman and the health service ombudsman
website. Parliamentary Ombudsman and the Health
Service Ombudsman.
Pack, Q.R., A. Priya, T. Lagu, P.S. Pekow and R.
Engelman et al., 2016. Development and validation
of a predictive model for short‐and medium‐term
hospital readmission following heart valve surgery.
J. Am. Heart Assoc., 5: e003544-e003544.
DOI: 10.1161/JAHA.116.003544
Rowe, J.E., 2015. Genetic Algorithms. In: Springer
Handbook
of
Computational
Intelligence,
Kacprzyk, J. and W. Pedrycz (Eds.), Springer,
Berlin, pp: 825-844.
Shameer, K., K.W. Johnson, A. Yahi, R. Miotto and L.
Li et al., 2017. Predictive modeling of hospital
readmission rates using electronic medical recordwide machine learning: A case-study using mount
sinai heart failure cohort. Biocomputing.
Shams, I., S. Ajorlou and K. Yang, 2015. A predictive
analytics approach to reducing 30- day avoidable
readmissions among patients with heart failure,
acute myocardial infarction, pneumonia, or COPD.
Health Care Manage. Sci., 18: 19-34.
DOI: 10.1007/s10729-014-9278-y
Shinkman,
R.,
2014. Questex.
Fierce
Healthcare.
https://www.fiercehealthcare.com/finance/readmissi
ons-lead-to-41-3b-additional-hospital-costs
Shipeng Yua, F.F., 2015. Predicting Readmission Risk with
Institution-Specific Prediction Models. Artificial Intell.
Med., 65: 89-96. DOI: 10.1016/j.artmed.2015.08.005
Strack, B., J.P. Deshazo, C. Gennings, J.L. Olmo and S.
Ventura et al., 2014. Impact of HbA1c measurement
on hospital readmission rates: Analysis of 70,000
clinical database patient records. BioMed Res. Int.
DOI: 10.1155/2014/781670
Tong, L., C. Erdmann, M. Daldalian, J. Li and T. Esposito,
2016. Comparison of predictive modeling approaches
for 30-day all-cause non-elective readmission risk.
BMC Med. Res. Methodol.
DOI: 10.1186/s12874-016-0128-0
Turgeman, L. and J.H. May, 2016. A mixed-ensemble
model for hospital readmission. Artificial Intell.
Med., 72: 72-82.
DOI: 10.1016/j.artmed.2016.08.005
Wang, H., C. Johnson, R.D. Robinson, V.A. Nejtek and
C.D. Schrader et al., 2016. Roles of disease severity
and post-discharge outpatient visits as predictors of
hospital readmissions. BMC Health Services Res.,
16: 564-564. DOI: 10.1186/s12913-016-1814-7
Wang, H., R.D. Robinson, C. Johnson, N.R. Zenarosa
and R.D. Jayswal et al., 2014. Using the LACE
index to predict hospital readmissions in
congestive
heart
failure
patients.
BMC
Cardiovascular Disorders, 14: 97-97.
DOI: 10.1186/1471-2261-14-97
Wasfy, J.H., K. Rosenfield, K. Zelevinsky, R. Sakhuja and
A. Lovett et al., 2013. A prediction model to identify
patients at high risk for 30-day readmission after
percutaneous coronary intervention. Circulation:
Cardiovascular Quality and Outcomes.
Yu, S., F. Farooq, A. van Esbroeck, G. Fung and V.
Anand et al., 2015. Predicting readmission risk
with institution-specific prediction models.
Artificial Intell. Med.
DOI: 10.1016/j.artmed.2015.08.005
Zheng, B., J. Zhang, S.W. Yoon, S.S. Lam and M.
Khasawneh et al., 2015. Predictive modeling of
hospital readmissions using metaheuristics and data
mining. Expert Syst. Applic., 42: 7110-7120.
DOI: 10.1016/j.eswa.2015.04.066
1331