Robust Prediction Model For Multidimensional and Unbalanced Datasets
Robust Prediction Model For Multidimensional and Unbalanced Datasets
Robust Prediction Model For Multidimensional and Unbalanced Datasets
Abstract:
Data Mining is a promising field and is applied in multiple domains for its predictive capabilities. Data in the real world cannot be readily used for
data mining as it suffers from the problems of multidimensionality, unbalance and missing values. It is difficult to use its predictive capabilities by novice
users. It is difficult for a beginner to find the relevant set of attributes from a large pool of data available. The paper presents a Robust Prediction Model
that finds a relevant set of attributes; resolves the problems of unbalanced and multidimensional real-life datasets and helps in finding patterns for
informed decision making. Model is tested upon five different datasets in the domain of Health Sector, Education, Business and Fraud Detection. The
results showcase the robust behaviour of the model and its applicability in various domains.
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
1
193
4th International Conference on Computers and Management (ICCM) 2018
Paper further describes Robust Prediction Model (RPM) in detail in easy selection of relevant attributes from a large pool of attributes and also
Section 2, Section 3 describes the type of datasets on which RPM is enhance the quality of dataset for further classification. It results in
applied for the test of robustness. Section 4 depicts the results obtained. It improved classification accuracy (Thakar, Mehta and Manisha, 2017)
also compares the results of RPM with PCA (Principle Component
Analysis) method of dimensionality reduction and Section 5 concludes the Phase II: At second phase transformed dataset derived from ‘Phase I’ is
research with its future scope. used for classification. Instead of choosing one method for classification,
voting ensemble method is used for improved classification results. Vote
2. Robust Prediction Model (RPM) method uses the vote of each base learner for classification of an instance;
the prediction with maximum votes is taken into consideration. It uses
A Robust Prediction Model (RPM) has been designed by integrating predictions of base learners to make a combined prediction by using
unsupervised and supervised learning techniques of clustering and majority voting method.
classification (Thakar, Mehta and Manisha, 2017) (Hu, 2017). It helps in
predicting the target class of the dataset. RPM model works in three Phase III: Results are obtained from ‘Phase II’ and they become the basis
phases. In the first phase, the concept of automated pre-processing is to generate rules. Results produced by classification methods help in
performed, where raw data is converted to a ready dataset that helps in writing rules for the dataset.
better prediction through classification (Thakar, Mehta, & Manisha,
2016). It reduces dimensionality and finds a relevant set of attributes, Abbreviations for Algorithm of RPM:
which can further be readily used for better classification results. The” k-
means clustering” is applied to attribute set for the purpose of finding a Dun: Variable to store the unprocessed/raw dataset
relevant set of attributes (Kim, 2005). In the second phase, dataset SB(): Method for Sample Bootstrapping
received from the first phase undergoes ensemble vote classification. CL1..CLm: Depicts target classes of the raw dataset, where m is
Instead of choosing one method for classification, voting ensemble the total number of target classes
method is used for improved classification accuracy. The third phase finds D1: Sampled dataset after combining each target class
rules from results obtained in the second phase. Robust Prediction Model Append(): Method for appending instances
(RPM) thus deals with the complex dataset, automatically selects a TRAN(): Method to transpose matrix
relevant set of attributes from a large pool of attributes and integrates Cx: Selected Cluster
learning algorithms to predict target class. Model is scalable enough to be D2: Transposed dataset/cluster
applied in any domain. Thus, proposes an easy and generalized solution to DC1..DCn: Depicts dataset clusters after separating clusters,
prediction problems in real life. The details on all three phases of the where n is the total number of clusters
model are described below: C1..Cn: Re-transposed clusters, where n is the total number
of clusters
Phase I: At this stage, the dataset is pre-processed in an innovative way, Df : Final dataset for classification
where the raw dataset is fed into the system to find a relevant set of Pi: Prediction Accuracy of the dataset, where i
attributes and transform the dataset into refined data that can be readily represents the number of iteration
used for classification. The raw data set is first balanced by Sample MAX(): To find the cluster with maximum prediction
Bootstrapping and equal instances of each class are taken into accuracy
consideration. Thereafter, k- means clustering is applied on the balanced CHI(): Method to find chi-square weights of the cluster
dataset after transposing attributes to instances and instances to attributes. SELECT(): Select v attributes from dataset/cluster, where v is
It results in sets of clusters (clusters of attributes), where the related sets of number of attributes
attributes are clubbed together in one cluster. All obtained clusters are re-
transposed and tested for its capabilities. Simple Cart classification is The Algorithm of RPM Model:
applied to all clusters (Denison, 1998). Cluster with better classification
accuracy is then selected and taken further to the next level. At the next Step1: Load Raw/ Unprocessed Dataset (Dun) in the form of a matrix
level it is transposed again, k- means clustering is re-applied on the and initialise Cluster Cx to Null and i=1.
chosen cluster. This process can be repeated n times unless the chosen Step2: Perform Sample Bootstrapping SB () on Dun i.e. SB (Dun, m)
cluster produces better results. After n level clustering, clusters are w.r.t. a number of classes (m) in Dun, where, Dun ∈ CL1..CLm
obtained by the process of filtration. Finally, transpose clusters and apply Step3: Select equal samples of each class CL1..CLm and create dataset
Chi-Square to select top attributes and obtain refined and transformed D1 i.e. D1=Append (SB (Dun,m))
dataset with selected attributes. This automated approach helps in fast and
AUTHOR 2
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML ELSEVIER-SSRN (ISSN: 1556-5068)
194
4th International Conference on Computers and Management (ICCM) 2018
Step4: Transpose TRAN () matrix received (D1 or Cx) and create obtained cluster is required or not. If Pi<Pi-1, Then cluster is taken further
dataset D2 i.e. D2= TRAN (D1) or TRAN (Cx) for iteration otherwise Pi-1 is used for generating rules.
Step5: Apply k-means clustering on D2 where k=number of classes in
D2
Step6: Filter clusters of D2 w.r.t. cluster number and generate new
datasets as Data Clusters DC1, DC2..DCn
Step7: Transpose all Data Clusters as
C1=TRAN (DC1)
C2=TRAN (DC2)
Cn=TRAN (DCn)
Step8: Implement Simple Cart Classification on each Data Clusters
C1..Cn
Step9: Validate performance of all clusters and find a cluster with
maximum prediction accuracy i.e. Cx=MAX (C1,C2..Cn)
Step11: Apply Chi-square CHI (Cx) on the received cluster and find
weights of each attribute.
Step12: Select top v attributes with maximum weights from the
selected cluster and create final dataset Df i.e. Df=SELECT
(CHI (Cx), v))
Step13: Apply ensemble vote classification on Df and find performance
Pi with 10 cross-validations
Step14: If Pi < Pi-1 or i==1, then i=i+1 and GOTO Step 4 with Cx, else
GOTO Step 15
Step15: Generate rules with results obtained, where Pi<Pi-1
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
3
195
4th International Conference on Computers and Management (ICCM) 2018
(Thakar, Mehta and Manisha, 2018), (Thakar, Mehta and Manisha, 2015). composed of a total 5820 evaluation scores submitted by students from
This enhances the quality of dataset for classification and results in “Gazi University” in Ankara, Turkey (Gunduz, 2013) (Oyedotun, 2015).
improved classification accuracy. Further, classification is performed on There were 28-course specific questions asked with 5 additional attributes.
transformed dataset derived from the previous phase. The voting ensemble Q1-Q28 were all Likert-type ranging {1,2,3,4,5}. The target attribute
method is used to improve classification accuracy. Instead of choosing “nb.repeat” produces 3 outputs 1,2,3 that indicate that how many times, of
one method for classification, an ensemble method is used (Catal & course, will be repeated by the student.
Nangir, 2017). Base learners are selected as they perform better than DATASET D: “Bank Marketing Dataset”, is a result of direct marketing
adjoining classifiers when applied alone. Vote method of ensemble campaigns through phone calls of a Portuguese banking institution. The
classification goal is to predict if the client will subscribe to a term deposit
classification employs the vote of each learner for classification of an
(Moro, 2014). The target class is binary with “YES” or “NO” as per the
instance; the prediction with maximum votes is taken into consideration.
response of client.
Base learners’ predictions are combined to make a final prediction by
DATASET E: “Polish Companies bankruptcy data dataset”, for
using majority voting method. Rules thus generated can be readily used
bankruptcy prediction of Polish companies. The companies that were
for the purpose of decision making. The user only has to feed in the raw
bankrupt were analyzed in the period 2000-2012, while the still operating
dataset. The model will produce relevant rules and patterns that can be
companies were evaluated from 2007-2013 (Zięba, 2016). The data was
used for informed decision making. To test the model it is applied to
collected from “Emerging Markets Information Service”. Based on the
various datasets available publically in different domains but with the collected data 5 classification cases were distinguished depending on the
common problem of multidimensionality, unbalance and large dataset forecasting period. Out of 5, the model is tested on 3 datasets. They are
with missing values. 1st Year data, the data contains financial rates from 1st year of the
forecasting period and corresponding class label that indicates bankruptcy
3. Application of Robust Prediction Model status after 5 years. The data contains 7027 instances “financial
statements”, 271 represents bankrupted companies, 6756 firms that did
3.1. Datasets not bankrupt in the forecasting period. 2nd Year data, the data contains
financial rates from the 2nd year of the forecasting period and
To test the robustness of RPM, it is applied to various public datasets. All corresponding class label that indicates bankruptcy status after 4 years.
the datasets were taken from “The UCI Machine Learning Repository” The data contains 10173 instances “financial statements”, 400 represents
(Dheeru, 2017). The descriptions of all datasets are as follows: bankrupted companies, 9773 firms that did not bankrupt in the forecasting
DATASET A: “Epileptic Seizure Recognition Data Set” records EEG period. 5th Year data, the data contains financial rates from the 5th year of
recording at a different point in time (Andrzejak, 2001). Dataset is pre- the forecasting period and corresponding class label that indicates
processed and restructured version of commonly used dataset featuring bankruptcy status after 1 year. The data contains 5910 instances “financial
epileptic seizure detection. All cases that fall in classes 2, 3, 4, and 5 are statements”, 410 represents bankrupted companies, 5500 firms that did
subjects, who did not have an epileptic seizure. Cases in class 1 had an not bankrupt in the forecasting period.
epileptic seizure. Dataset was pre-processed to convert it into a binary
class dataset with Class 1 as epileptic seizure and rest as Class 0 3.2. Experimental Setup and Measures
representing the absence of an epileptic seizure. Data is large,
multidimensional and unbalanced with very fewer instances belonging to RapidMiner Studio Educational Version 7.6.003 was used to implement
the class of subjects having an epileptic seizure. The goal is to predict the RPM. The version also extends and implements algorithms designed for
presence and absence of epileptic seizures in subjects. Weka Mining Tool. The “10-fold cross-validation” is used as an
DATASET B: “Student Performance Data Set”, predicts student estimation approach to finding classifier performance. Since there is no
achievement in secondary education of two schools in Portuguese. Two separate test data set, thus, this technique divide training dataset into ten
datasets on Mathematics and Portuguese Language are provided (Silva, equal parts, out of which nine are applied as a training set for making
April 2008). The goal is to predict student performance in secondary machine learning algorithm learn and one part, is used as a test dataset.
education in two subjects (Mathematics and Portuguese) separately. The This approach is enforced ten times on the same dataset, where every
class G3 has a strong correlation with G2 and G1 attributes because G3 is training set has to act as test set once. The number of correct predictions
the final year grade, which was issued at the third period, while G1 and made, divided by the total number of predictions made and multiplied by
G2 correspond to the first and second period grades. Predicting G3 100 to turn it into a percentage is known as classification accuracy. In a
without G2 and G1 is difficult, but such prediction is much more useful real-life problem, where there is a large class imbalance. The accuracy
(Silva, April 2008). The model was applied to predict G3 without G1 and paradox may exist, where a model can predict the value of the majority
G2. Data were converted into binary (Pass and Fail) class in both the class for all predictions and achieve high classification accuracy (Chawla,
datasets, PASS value if G3>=10 and FAIL if G3<10. 2009). The datasets used in the study suffers from this; hence accuracy
DATASET C: “Turkiye Student Evaluation Dataset”, the dataset is measure may not be the only perfect indicator to judge the performance.
AUTHOR
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML 4
ELSEVIER-SSRN (ISSN: 1556-5068)
196
4th International Conference on Computers and Management (ICCM) 2018
Classification accuracy is not enough to make a decision on the Table 1: Results of Dataset A with RPM
effectiveness of the model. Weighted Mean Precision (WMR) and Parameter Result
Weighted Mean Recall (WMR) takes False Positive and False Negative
Accuracy: 95.21% +/- 0.66%
into account and is used as another measure for testing performance
(Kumar & Rathee, 2011), (Sokolova, 2009). The benefit of these metrics Kappa : 0.904 +/- 0.013
is that they aggregate precision and recall over the result set (Elmishali et Weighted Mean Recall: 95.21% +/- 0.66%
al., 2016), (Ori BarIlan et al.). Kappa is another measure, which is used in
Weighted Mean Precision: 95.31% +/- 0.68%
this case. Kappa Statistics is a normalized value of agreement for a chance
(Witten, 2016).
The data suffered from two intrinsic problems of multidimensionality and
3.3. Application of RPM on all Datasets (A-E) unbalanced dataset. RPM could overcome the problems and results are
excellent with 95.21% of accuracy and 0.904 Kappa statistics. This proves
RPM was applied to each dataset separately. The model works at three the robustness capability of the model to deal with any type of dataset that
levels. The first level implements the concept of automated pre- are large, multidimensional and unbalanced in nature. Ensemble Vote
processing, where raw dataset was converted to refined data. Then it was Method combined four base classifiers namely Random Tree, kStar,
taken further at the second level for classification. An ensemble method of Simple Cart and Random Forest (Breiman, 2001). Results were obtained
voting with best classifiers for the respective case was used. Last third after applying PCA on the same dataset followed by RPM phase 2 and 3.
level generates rules to facilitate decision making. Principle Component Results are shown below in Table 2. The results are compared with RPM
Analysis is a well-known method of dimensionality reduction and is results and it is found that model (RPM) outperforms PCA. The graph in
widely used to find the relevant set of attributes from the dataset, To Fig. 2 depicts the comparative results.
overcome the problem of multidimensional data, PCA is considered as
very versatile, oldest and the most popular technique in multivariate Table 2: Results of Dataset A with PCA
analysis (Abdi, Hervé, and Lynne, 2010). To test the RPM model, Parameter Result
comparative analysis is done on all datasets (A-E). The results are
Accuracy: 80.00% +/- 0.00%
obtained by applying PCA to the same dataset. Instead of applying
automated preprocessing of RPM to find the relevant set of attributes, Kappa : 0.000 +/- 0.000
PCA was applied in Phase 1 of RPM. PCA now finds the relevant set of Weighted Mean Recall: 50.00% +/- 0.00%
attributes. Further, the ensemble vote method of classification is applied, Weighted Mean Precision: 40.00% +/- 0.00%
just as in phase 2 of RPM, which is followed by phase 3 of RPM to find
the rules. Thus, only phase 1 of RPM is replaced with PCA. The results
clearly showcase improved performance when RPM is applied with all its
phases.
4. Results
197
4th International Conference on Computers and Management (ICCM) 2018
Table 3: Results of dataset B (Portuguese) with RPM Results showcase that performance of RPM is better than PCA.
Parameter Result
Accuracy: 91.67% +/- 3.87%
Kappa : 0.685 +/- 0.078 Fig. 4: RPM and PCA for DATASET B (Mathematics)
Parameter Result
Table 7: Results of dataset C with RPM
Accuracy: 62.54% +/- 4.82%
Parameter Result
Kappa : 0.114 +/- 0.095 Accuracy: 83.30% +/- 1.20%
Weighted Mean Recall: 55.43% +/- 4.39% Kappa : 0.749 +/- 0.018
Weighted Mean Precision: 56.45% +/- 5.67% Weighted Mean Recall: 83.30% +/- 1.20%
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
6
198
4th International Conference on Computers and Management (ICCM) 2018
Parameter Result
Accuracy: 82.11% +/- 0.57%
The model was applied to “Polish Companies bankruptcy data dataset” for
bankruptcy 1, 2 and 5 years and results are shown in Table 11 to Table 13.
RPM model identifies top 12 significant attributes with the process of the
automated pre-processing capability of the model and results in 93.12%,
Fig. 5: RPM and PCA for DATASET C 87.49% and 90.94% accuracy for bankruptcy dataset 1,2 and 5
respectively of predicting if the company will bankrupt in forecasting
4.3. Results of DATASET D period.
Weighted Mean Recall: 91.75% +/- 0.40% Kappa : 0.750 +/- 0.020
Weighted Mean Precision: 92.16% +/- 0.41% Weighted Mean Recall: 87.49% +/- 1.02%
Table 10: Results of Dataset D with PCA Table 13: Results of dataset E (Bankruptcy 5) with RPM
Parameter Result Parameter Result
Accuracy: 85.42% +/- 0.41% Accuracy: 90.94% +/- 1.25%
Weighted Mean Recall: 54.20% +/- 0.59% Weighted Mean Recall: 90.94% +/- 1.25%
Weighted Mean Precision: 57.70% +/- 1.07% Weighted Mean Precision: 92.35% +/- 0.90%
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
7
199
4th International Conference on Computers and Management (ICCM) 2018
Ensemble Vote Method combined two base classifiers namely kstar and Table 16: Results of Dataset E (Bankrupcy5) with PCA
ibk. Results were obtained after applying PCA and are depicted in Fig 7 to Parameter Result
9.
Accuracy: 93.13% +/- 0.11%
Table 14: Results of Dataset E (Bankrupcy1) with PCA Kappa : 0.026 +/- 0.028
Accuracy: 96.11% +/- 0.13% Weighted Mean Precision: 69.91% +/- 23.85%
All the datasets worked well with RPM Model and attained accuracy level
greater than 83% with Kappa Statistics greater than 0.6. Results
Fig. 7: RPM and PCA for DATASET E (Bankruptcy 1) showcased remarkable improvement in performance as compared to the
performance with PCA. Pictorial representation of all datasets in terms of
Accuracy Percentage and Kappa Statics is depicted below in Fig. 10 and
Table 15: Results of Dataset E (Bankrupcy2) with PCA 11.
Parameter Result
AUTHOR
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML 8
ELSEVIER-SSRN (ISSN: 1556-5068)
200
4th International Conference on Computers and Management (ICCM) 2018
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
9
201