Robust Prediction Model For Multidimensional and Unbalanced Datasets

4th International Conference on Computers and Management (ICCM) 2018
Robust Prediction Model for Multidimensional and Unbalanced

Datasets
Pooja Thakara, Anil Mehtab, Manishaa*

a
Banasthali University, Jaipur, Rajasthan 304022, India
b
Manipal University, Jaipur, Rajasthan 303007, India
Abstract:
Data Mining is a promising field and is applied in multiple domains for its predictive capabilities. Data in the real world cannot be readily used for
data mining as it suffers from the problems of multidimensionality, unbalance and missing values. It is difficult to use its predictive capabilities by novice
users. It is difficult for a beginner to find the relevant set of attributes from a large pool of data available. The paper presents a Robust Prediction Model
that finds a relevant set of attributes; resolves the problems of unbalanced and multidimensional real-life datasets and helps in finding patterns for
informed decision making. Model is tested upon five different datasets in the domain of Health Sector, Education, Business and Fraud Detection. The
results showcase the robust behaviour of the model and its applicability in various domains.
1. Introduction • To predict target class by integrating machine learning

methods of classification for better decision making.
The data mining is a promising field and is growing in popularity due to The RPM model is robust in nature. This paper applies the RPM model on
its vast applicability in various areas. But it has to face many challenges five different types of datasets from the domains of medicine, marketing,
before it can be put to use for resolving real-life problems. Real life data is education and fraud detection. Results prove the robustness and
generally not fit for direct data mining and suffers from many problems. applicability of Robust Prediction Model in various fields. RPM is applied
Data is generally large, multidimensional and unbalanced in nature with to public datasets from the UCI Repository (Dheeru, 2017). The datasets
lots of missing values. Plenty of publically available data mining tools are are as follows:
available, but they can be used only when data fed in them have the • “Epileptic Seizure Recognition Data Set”, which belongs to the
capability to produce relevant information. Data should be treated well health sector. Every data is EEG recording value at a different
before feeding in any tool. Entrepreneurs, Managers and new setup point in time (Andrzejak, 2001).
businesses may not always be well versed with data mining intricacies and • “Student Performance Data Set”, which belongs to the
such a powerful tool cannot be used easily for decision making. To make education sector and predicts student achievement in
novice users adapt and apply modern techniques and practices, a robust secondary education of two schools in Portuguese. Two
prediction model is proposed in this paper. In real-life datasets, data is datasets on Mathematics and Portuguese Language are
generally unbalanced. Data is unbalanced when some classes have a provided (Silva, April 2008).
significantly large number of instances available than others. Predictive • “Turkiye Student Evaluation Dataset”, which also belongs to
algorithms overlook minority classes and consider only majority classes. the education sector. Dataset is composed of “evaluation
Minority class hubs may lead to misclassification in many high- scores” provided by students. Students are from “Gazi
dimensional datasets (Tomašev, 2013). As a result, the predictive University” in Ankara, Turkey (Gunduz, 2013).
algorithms are unable to classify them correctly. A controlled dataset with • “Bank Marketing Dataset”, which belongs to the marketing
high dimensionality may result in over-fitting and decrease the sector in business management. Data is composed of results of
generalization performance of predictive algorithms (Tachibana, 2014). direct marketing campaigns through phone calls for a
This, in turn, requires a huge set of data for better prediction. Furthermore, Portuguese bank. The target is to predict “if the client will
all the attributes in dataset don’t contribute to the final prediction. Hence, subscribe to a term deposit” (Moro, 2014).
finding the relevant set of attributes from a large pool of attributes is also • “Polish Companies bankruptcy data dataset”, which belongs to
a challenging task. To overcome the aforesaid problems a Robust fraud detection. The dataset is composed of Polish companies’
Prediction Model (RPM) is designed which can be used with varied bankruptcy prediction. The bankrupt companies were analyzed
datasets that are multidimensional, large and unbalanced in nature. Robust in the period 2000-2012. “Still operating companies” were
Prediction Model works on two major aspects. evaluated from 2007-2013 (Zięba, 2016).
RPM resulted in very encouraging results with high accuracy level. The
• To find the major set of attributes in the data-set that affects user doesn’t need to understand intricate details of dimensionality
the prediction in an automated approach. reduction and prediction yet can apply the model for informed decision
making.
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML
AUTHOR ELSEVIER-SSRN (ISSN: 1556-5068)
1
193
Paper further describes Robust Prediction Model (RPM) in detail in easy selection of relevant attributes from a large pool of attributes and also
Section 2, Section 3 describes the type of datasets on which RPM is enhance the quality of dataset for further classification. It results in
applied for the test of robustness. Section 4 depicts the results obtained. It improved classification accuracy (Thakar, Mehta and Manisha, 2017)
also compares the results of RPM with PCA (Principle Component
Analysis) method of dimensionality reduction and Section 5 concludes the Phase II: At second phase transformed dataset derived from ‘Phase I’ is
research with its future scope. used for classification. Instead of choosing one method for classification,
voting ensemble method is used for improved classification results. Vote
2. Robust Prediction Model (RPM) method uses the vote of each base learner for classification of an instance;
the prediction with maximum votes is taken into consideration. It uses
A Robust Prediction Model (RPM) has been designed by integrating predictions of base learners to make a combined prediction by using
unsupervised and supervised learning techniques of clustering and majority voting method.
classification (Thakar, Mehta and Manisha, 2017) (Hu, 2017). It helps in
predicting the target class of the dataset. RPM model works in three Phase III: Results are obtained from ‘Phase II’ and they become the basis
phases. In the first phase, the concept of automated pre-processing is to generate rules. Results produced by classification methods help in
performed, where raw data is converted to a ready dataset that helps in writing rules for the dataset.
better prediction through classification (Thakar, Mehta, & Manisha,
2016). It reduces dimensionality and finds a relevant set of attributes, Abbreviations for Algorithm of RPM:
which can further be readily used for better classification results. The” k-
means clustering” is applied to attribute set for the purpose of finding a Dun: Variable to store the unprocessed/raw dataset
relevant set of attributes (Kim, 2005). In the second phase, dataset SB(): Method for Sample Bootstrapping
received from the first phase undergoes ensemble vote classification. CL1..CLm: Depicts target classes of the raw dataset, where m is
Instead of choosing one method for classification, voting ensemble the total number of target classes
method is used for improved classification accuracy. The third phase finds D1: Sampled dataset after combining each target class
rules from results obtained in the second phase. Robust Prediction Model Append(): Method for appending instances
(RPM) thus deals with the complex dataset, automatically selects a TRAN(): Method to transpose matrix
relevant set of attributes from a large pool of attributes and integrates Cx: Selected Cluster
learning algorithms to predict target class. Model is scalable enough to be D2: Transposed dataset/cluster
applied in any domain. Thus, proposes an easy and generalized solution to DC1..DCn: Depicts dataset clusters after separating clusters,
prediction problems in real life. The details on all three phases of the where n is the total number of clusters
model are described below: C1..Cn: Re-transposed clusters, where n is the total number
of clusters
Phase I: At this stage, the dataset is pre-processed in an innovative way, Df : Final dataset for classification
where the raw dataset is fed into the system to find a relevant set of Pi: Prediction Accuracy of the dataset, where i
attributes and transform the dataset into refined data that can be readily represents the number of iteration
used for classification. The raw data set is first balanced by Sample MAX(): To find the cluster with maximum prediction
Bootstrapping and equal instances of each class are taken into accuracy
consideration. Thereafter, k- means clustering is applied on the balanced CHI(): Method to find chi-square weights of the cluster
dataset after transposing attributes to instances and instances to attributes. SELECT(): Select v attributes from dataset/cluster, where v is
It results in sets of clusters (clusters of attributes), where the related sets of number of attributes
attributes are clubbed together in one cluster. All obtained clusters are re-
transposed and tested for its capabilities. Simple Cart classification is The Algorithm of RPM Model:
applied to all clusters (Denison, 1998). Cluster with better classification
accuracy is then selected and taken further to the next level. At the next Step1: Load Raw/ Unprocessed Dataset (Dun) in the form of a matrix
level it is transposed again, k- means clustering is re-applied on the and initialise Cluster Cx to Null and i=1.
chosen cluster. This process can be repeated n times unless the chosen Step2: Perform Sample Bootstrapping SB () on Dun i.e. SB (Dun, m)
cluster produces better results. After n level clustering, clusters are w.r.t. a number of classes (m) in Dun, where, Dun ∈ CL1..CLm
obtained by the process of filtration. Finally, transpose clusters and apply Step3: Select equal samples of each class CL1..CLm and create dataset
Chi-Square to select top attributes and obtain refined and transformed D1 i.e. D1=Append (SB (Dun,m))
dataset with selected attributes. This automated approach helps in fast and
AUTHOR 2
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML ELSEVIER-SSRN (ISSN: 1556-5068)
194
Step4: Transpose TRAN () matrix received (D1 or Cx) and create obtained cluster is required or not. If Pi<Pi-1, Then cluster is taken further
dataset D2 i.e. D2= TRAN (D1) or TRAN (Cx) for iteration otherwise Pi-1 is used for generating rules.
Step5: Apply k-means clustering on D2 where k=number of classes in
D2
Step6: Filter clusters of D2 w.r.t. cluster number and generate new
datasets as Data Clusters DC1, DC2..DCn
Step7: Transpose all Data Clusters as
C1=TRAN (DC1)
C2=TRAN (DC2)
Cn=TRAN (DCn)
Step8: Implement Simple Cart Classification on each Data Clusters
C1..Cn
Step9: Validate performance of all clusters and find a cluster with
maximum prediction accuracy i.e. Cx=MAX (C1,C2..Cn)
Step11: Apply Chi-square CHI (Cx) on the received cluster and find
weights of each attribute.
Step12: Select top v attributes with maximum weights from the
selected cluster and create final dataset Df i.e. Df=SELECT
(CHI (Cx), v))
Step13: Apply ensemble vote classification on Df and find performance
Pi with 10 cross-validations
Step14: If Pi < Pi-1 or i==1, then i=i+1 and GOTO Step 4 with Cx, else
GOTO Step 15
Step15: Generate rules with results obtained, where Pi<Pi-1
In the process of automation, data is loaded into the system. Unprocessed

dataset (Dun) go through Sample Bootstrapping SB (Dun,m) w.r.t. a
number of classes (m) in Dun. This produces balanced dataset (D1) by
applying append function (Append (SB (Dun,m)). For the purpose of
balancing, equal instances of each class are taken into consideration. Now,
the balanced dataset (D1) is transposed by transposing attributes to
instances and vice versa that converts dataset D1 to D2. Thereafter, k-
means clustering is applied to the transposed dataset D2, where k is the
number of classes in D2. This produces n sets (n is set as per the number
of target classes in the dataset) of clusters (clusters of attributes). Each
cluster clubs together the related set of attributes. All the clusters are
filtered separately to produce clustered datasets DC1 to DCn. All the
clusters are re-transposed for classification thus produces clusters C1 to
Cn. Simple Cart classification is then applied to all the clusters i.e. C1 to
Cn. Performance of all the clusters from C1 to Cn is tested with 10 cross-
validations. Cluster with best classification result is selected (Cx, where
Cx=MAX(C1 to Cn)). Chi-square is applied to the selected cluster. Top v
attributes are selected to produce dataset Df, where Df = SELECT (CHI
(Cj), v). Ensemble vote is applied to Df and Performance Pi is noted. Here
Cx is taken further to the next level and other clusters are simply
discarded. At next level clustering, the chosen cluster (Cx) is transposed Fig 1: Logical Diagram of RPM
again. It undergoes clustering again with k-means clustering and whole
the process is repeated to produce Performance Pi. At this stage Pi is This automated pre-processing approach helps in parsimonious selection
compared with Pi-1 and decision is taken where further clustering of the of relevant attributes from a large pool of attributes in a fast and easy way
3
195
(Thakar, Mehta and Manisha, 2018), (Thakar, Mehta and Manisha, 2015). composed of a total 5820 evaluation scores submitted by students from
This enhances the quality of dataset for classification and results in “Gazi University” in Ankara, Turkey (Gunduz, 2013) (Oyedotun, 2015).
improved classification accuracy. Further, classification is performed on There were 28-course specific questions asked with 5 additional attributes.
transformed dataset derived from the previous phase. The voting ensemble Q1-Q28 were all Likert-type ranging {1,2,3,4,5}. The target attribute
method is used to improve classification accuracy. Instead of choosing “nb.repeat” produces 3 outputs 1,2,3 that indicate that how many times, of
one method for classification, an ensemble method is used (Catal & course, will be repeated by the student.
Nangir, 2017). Base learners are selected as they perform better than DATASET D: “Bank Marketing Dataset”, is a result of direct marketing
adjoining classifiers when applied alone. Vote method of ensemble campaigns through phone calls of a Portuguese banking institution. The
classification goal is to predict if the client will subscribe to a term deposit
classification employs the vote of each learner for classification of an
(Moro, 2014). The target class is binary with “YES” or “NO” as per the
instance; the prediction with maximum votes is taken into consideration.
response of client.
Base learners’ predictions are combined to make a final prediction by
DATASET E: “Polish Companies bankruptcy data dataset”, for
using majority voting method. Rules thus generated can be readily used
bankruptcy prediction of Polish companies. The companies that were
for the purpose of decision making. The user only has to feed in the raw
bankrupt were analyzed in the period 2000-2012, while the still operating
dataset. The model will produce relevant rules and patterns that can be
companies were evaluated from 2007-2013 (Zięba, 2016). The data was
used for informed decision making. To test the model it is applied to
collected from “Emerging Markets Information Service”. Based on the
various datasets available publically in different domains but with the collected data 5 classification cases were distinguished depending on the
common problem of multidimensionality, unbalance and large dataset forecasting period. Out of 5, the model is tested on 3 datasets. They are
with missing values. 1st Year data, the data contains financial rates from 1st year of the
forecasting period and corresponding class label that indicates bankruptcy
3. Application of Robust Prediction Model status after 5 years. The data contains 7027 instances “financial
statements”, 271 represents bankrupted companies, 6756 firms that did
3.1. Datasets not bankrupt in the forecasting period. 2nd Year data, the data contains
financial rates from the 2nd year of the forecasting period and
To test the robustness of RPM, it is applied to various public datasets. All corresponding class label that indicates bankruptcy status after 4 years.
the datasets were taken from “The UCI Machine Learning Repository” The data contains 10173 instances “financial statements”, 400 represents
(Dheeru, 2017). The descriptions of all datasets are as follows: bankrupted companies, 9773 firms that did not bankrupt in the forecasting
DATASET A: “Epileptic Seizure Recognition Data Set” records EEG period. 5th Year data, the data contains financial rates from the 5th year of
recording at a different point in time (Andrzejak, 2001). Dataset is pre- the forecasting period and corresponding class label that indicates
processed and restructured version of commonly used dataset featuring bankruptcy status after 1 year. The data contains 5910 instances “financial
epileptic seizure detection. All cases that fall in classes 2, 3, 4, and 5 are statements”, 410 represents bankrupted companies, 5500 firms that did
subjects, who did not have an epileptic seizure. Cases in class 1 had an not bankrupt in the forecasting period.
epileptic seizure. Dataset was pre-processed to convert it into a binary
class dataset with Class 1 as epileptic seizure and rest as Class 0 3.2. Experimental Setup and Measures
representing the absence of an epileptic seizure. Data is large,
multidimensional and unbalanced with very fewer instances belonging to RapidMiner Studio Educational Version 7.6.003 was used to implement
the class of subjects having an epileptic seizure. The goal is to predict the RPM. The version also extends and implements algorithms designed for
presence and absence of epileptic seizures in subjects. Weka Mining Tool. The “10-fold cross-validation” is used as an
DATASET B: “Student Performance Data Set”, predicts student estimation approach to finding classifier performance. Since there is no
achievement in secondary education of two schools in Portuguese. Two separate test data set, thus, this technique divide training dataset into ten
datasets on Mathematics and Portuguese Language are provided (Silva, equal parts, out of which nine are applied as a training set for making
April 2008). The goal is to predict student performance in secondary machine learning algorithm learn and one part, is used as a test dataset.
education in two subjects (Mathematics and Portuguese) separately. The This approach is enforced ten times on the same dataset, where every
class G3 has a strong correlation with G2 and G1 attributes because G3 is training set has to act as test set once. The number of correct predictions
the final year grade, which was issued at the third period, while G1 and made, divided by the total number of predictions made and multiplied by
G2 correspond to the first and second period grades. Predicting G3 100 to turn it into a percentage is known as classification accuracy. In a
without G2 and G1 is difficult, but such prediction is much more useful real-life problem, where there is a large class imbalance. The accuracy
(Silva, April 2008). The model was applied to predict G3 without G1 and paradox may exist, where a model can predict the value of the majority
G2. Data were converted into binary (Pass and Fail) class in both the class for all predictions and achieve high classification accuracy (Chawla,
datasets, PASS value if G3>=10 and FAIL if G3<10. 2009). The datasets used in the study suffers from this; hence accuracy
DATASET C: “Turkiye Student Evaluation Dataset”, the dataset is measure may not be the only perfect indicator to judge the performance.
AUTHOR
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML 4
ELSEVIER-SSRN (ISSN: 1556-5068)
196
Classification accuracy is not enough to make a decision on the Table 1: Results of Dataset A with RPM
effectiveness of the model. Weighted Mean Precision (WMR) and Parameter Result
Weighted Mean Recall (WMR) takes False Positive and False Negative
Accuracy: 95.21% +/- 0.66%
into account and is used as another measure for testing performance
(Kumar & Rathee, 2011), (Sokolova, 2009). The benefit of these metrics Kappa : 0.904 +/- 0.013
is that they aggregate precision and recall over the result set (Elmishali et Weighted Mean Recall: 95.21% +/- 0.66%
al., 2016), (Ori BarIlan et al.). Kappa is another measure, which is used in
Weighted Mean Precision: 95.31% +/- 0.68%
this case. Kappa Statistics is a normalized value of agreement for a chance
(Witten, 2016).
The data suffered from two intrinsic problems of multidimensionality and
3.3. Application of RPM on all Datasets (A-E) unbalanced dataset. RPM could overcome the problems and results are
excellent with 95.21% of accuracy and 0.904 Kappa statistics. This proves
RPM was applied to each dataset separately. The model works at three the robustness capability of the model to deal with any type of dataset that
levels. The first level implements the concept of automated pre- are large, multidimensional and unbalanced in nature. Ensemble Vote
processing, where raw dataset was converted to refined data. Then it was Method combined four base classifiers namely Random Tree, kStar,
taken further at the second level for classification. An ensemble method of Simple Cart and Random Forest (Breiman, 2001). Results were obtained
voting with best classifiers for the respective case was used. Last third after applying PCA on the same dataset followed by RPM phase 2 and 3.
level generates rules to facilitate decision making. Principle Component Results are shown below in Table 2. The results are compared with RPM
Analysis is a well-known method of dimensionality reduction and is results and it is found that model (RPM) outperforms PCA. The graph in
widely used to find the relevant set of attributes from the dataset, To Fig. 2 depicts the comparative results.
overcome the problem of multidimensional data, PCA is considered as
very versatile, oldest and the most popular technique in multivariate Table 2: Results of Dataset A with PCA
analysis (Abdi, Hervé, and Lynne, 2010). To test the RPM model, Parameter Result
comparative analysis is done on all datasets (A-E). The results are
Accuracy: 80.00% +/- 0.00%
obtained by applying PCA to the same dataset. Instead of applying
automated preprocessing of RPM to find the relevant set of attributes, Kappa : 0.000 +/- 0.000
PCA was applied in Phase 1 of RPM. PCA now finds the relevant set of Weighted Mean Recall: 50.00% +/- 0.00%
attributes. Further, the ensemble vote method of classification is applied, Weighted Mean Precision: 40.00% +/- 0.00%
just as in phase 2 of RPM, which is followed by phase 3 of RPM to find
the rules. Thus, only phase 1 of RPM is replaced with PCA. The results
clearly showcase improved performance when RPM is applied with all its
phases.
4. Results
RPM was applied to varied datasets (Dataset A to E) and results were

calculated. Results obtained after applying RPM are depicted in terms of
Accuracy, Kappa Statistics, Weighted Mean Recall and Weighted Mean
Precision.
4.1. Results of DATASET A

Fig. 2: RPM and PCA for DATASET A
The model was applied to Epileptic Seizure Recognition dataset and

4.2. Results of DATASET B
results are shown in Table 1. After sampling, numbers of instances in the
dataset were 10,000. RPM model identifies top 12 significant attributes
RPM Model was applied on “Student Performance Data Set” and results
with the process of the automated pre-processing capability of the model
are shown in Table 3. After sampling, numbers of instances in the dataset
and results in 95.2% accuracy of predicting Epileptic Seizure in the
were 600 and 400 for Portuguese and Mathematics respectively. The class
subjects.
attribute G3 has a strong correlation with G1 and G2. This occurs because
G3 is the final year grade that was issued in the third period, while G1 and
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML ELSEVIER-SSRN (ISSN: 1556-5068)

AUTHOR 5
197
G2 correspond to the first and second period grades. It is more difficult to

predict G3 without G2 and G1. The model was applied to predict G3
without G1 and G2. Data were converted into binary (Pass and Fail) class
in both the datasets, PASS value if G3>=10 and FAIL if G3<10. RPM
model identifies top 12 significant attributes with the process of the
automated pre-processing capability of the model and results in 91.67%
and 84.25% accuracy of predicting academic performance in Portuguese
and Mathematics respectively. Ensemble Vote Method combined four
base classifiers namely Random Tree (Aldous, 1991), kStar, LMT and
Random Forest in both the dataset. Results obtained after applying PCA
on the same dataset are shown below in fig 3 and 4:
Fig. 3: RPM and PCA for DATASET B (Portuguese)
Table 3: Results of dataset B (Portuguese) with RPM Results showcase that performance of RPM is better than PCA.
Parameter Result
Accuracy: 91.67% +/- 3.87%
Kappa : 0.833 +/- 0.077
Weighted Mean Recall: 91.67% +/- 3.87%
Table 4: Results of dataset B (Mathematics) with RPM

Parameter Result
Accuracy: 84.25% +/- 3.88%
Kappa : 0.685 +/- 0.078 Fig. 4: RPM and PCA for DATASET B (Mathematics)

Results of DATASET C
Table 5: Results of dataset B (Portuguese) with PCA
The model was applied to “Turkiye Student Evaluation Dataset” and
Parameter Result
results are shown in Table 7. The instances after sampling become 6000.
Accuracy: 84.29% +/- 1.12% RPM model identifies top 12 significant attributes with the process of the
Kappa : 0.020 +/- 0.070 automated pre-processing capability of the model and results in 83.3%
accuracy of predicting the number of times a student repeats a course (1,2
or 3). Ensemble Vote Method combined two base classifiers namely
Random Tree and ibk. Comparative Analysis on Dataset C with RPM and
PCA is showcased in Fig 5.
Table 6: Results of dataset B (Mathematics) with PCA
Parameter Result
Table 7: Results of dataset C with RPM
Accuracy: 62.54% +/- 4.82%
Parameter Result
Kappa : 0.114 +/- 0.095 Accuracy: 83.30% +/- 1.20%
Weighted Mean Recall: 55.43% +/- 4.39% Kappa : 0.749 +/- 0.018
Weighted Mean Precision: 56.45% +/- 5.67% Weighted Mean Recall: 83.30% +/- 1.20%
6
198
Table 8: Results of Dataset C with PCA
Parameter Result
Accuracy: 82.11% +/- 0.57%
Kappa : 0.102 +/- 0.028
Fig. 6: RPM and PCA for DATASET D
4.4. Results of DATASET E
The model was applied to “Polish Companies bankruptcy data dataset” for
bankruptcy 1, 2 and 5 years and results are shown in Table 11 to Table 13.
RPM model identifies top 12 significant attributes with the process of the
automated pre-processing capability of the model and results in 93.12%,
Fig. 5: RPM and PCA for DATASET C 87.49% and 90.94% accuracy for bankruptcy dataset 1,2 and 5
respectively of predicting if the company will bankrupt in forecasting
4.3. Results of DATASET D period.
Table 11: Results of dataset E (Bankruptcy 1) with RPM

The model was applied to “Bank Marketing” and results are shown in
Table 9. RPM model identifies top 6 significant attributes with the process Parameter Result
of the automated pre-processing capability of the model and results in a
Accuracy: 93.12% +/- 0.48%
91.75% accuracy of predicting if the client will subscribe to a term
deposit. The number of instances after sampling becomes 30,000. Kappa : 0.862 +/- 0.010
Ensemble Vote Method combined two base classifiers namely kstar
(Hernández, 2015) and ibk. Comparative Results with PCA are showcased
in Fig 6. Weighted Mean Precision: 93.96% +/- 0.37%
Table 9: Results of dataset D with RPM
Parameter Result Table 12: Results of dataset E (Bankruptcy 2) with RPM
Accuracy: 91.75% +/- 0.40% Parameter Result
Kappa : 0.835 +/- 0.008 Accuracy: 87.49% +/- 1.02%
Weighted Mean Recall: 91.75% +/- 0.40% Kappa : 0.750 +/- 0.020
Weighted Mean Precision: 92.16% +/- 0.41% Weighted Mean Recall: 87.49% +/- 1.02%
Table 10: Results of Dataset D with PCA Table 13: Results of dataset E (Bankruptcy 5) with RPM
Parameter Result Parameter Result
Accuracy: 85.42% +/- 0.41% Accuracy: 90.94% +/- 1.25%
Kappa : 0.106 +/- 0.014 Kappa : 0.819 +/- 0.025
Weighted Mean Recall: 54.20% +/- 0.59% Weighted Mean Recall: 90.94% +/- 1.25%
Weighted Mean Precision: 57.70% +/- 1.07% Weighted Mean Precision: 92.35% +/- 0.90%
7
199
Ensemble Vote Method combined two base classifiers namely kstar and Table 16: Results of Dataset E (Bankrupcy5) with PCA
ibk. Results were obtained after applying PCA and are depicted in Fig 7 to Parameter Result
9.
Accuracy: 93.13% +/- 0.11%
Table 14: Results of Dataset E (Bankrupcy1) with PCA Kappa : 0.026 +/- 0.028
Parameter Result Weighted Mean Recall: 50.71% +/- 0.80%
Accuracy: 96.11% +/- 0.13% Weighted Mean Precision: 69.91% +/- 23.85%
Kappa : -0.001 +/- 0.001
Fig. 9: RPM and PCA for DATASET E (Bankruptcy 5)
All the datasets worked well with RPM Model and attained accuracy level
greater than 83% with Kappa Statistics greater than 0.6. Results
Fig. 7: RPM and PCA for DATASET E (Bankruptcy 1) showcased remarkable improvement in performance as compared to the
performance with PCA. Pictorial representation of all datasets in terms of
Accuracy Percentage and Kappa Statics is depicted below in Fig. 10 and
Table 15: Results of Dataset E (Bankrupcy2) with PCA 11.
Parameter Result
Accuracy: 96.05% +/- 0.11%
Kappa : 0.017 +/- 0.031
Fig. 10: Prediction Accuracy of all Datasets (A-E)
Fig. 8: RPM and PCA for DATASET E (Bankruptcy 2)
AUTHOR
HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML 8
ELSEVIER-SSRN (ISSN: 1556-5068)
200
Catal, C., & Nangir, M. (2017). A sentiment classification model based on

multiple classifiers. Applied Soft Computing, 50, 135-141.
Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview.

Data mining and knowledge discovery handbook. Springer US, 875-886.
Denison, D. G. (1998). A bayesian CART algorithm. Biometrika, 85(2), 363-
377.
Dheeru, D. &. (2017). UCI machine learning repository. Retrieved from URL
http://archive. ics. uci. edu/ml.
Elmishali, A., Stern, R., & Kalech, M. (2016, February). Data-Augmented
Software Diagnosis. In AAAI (pp. 4003-4009).
Gunduz, G. &. (2013). UCI Machine Learning Repository. Irvine, CA:
University of California, School of Information and Computer Science.
Hernández, D. C. (2015). An Experimental Study of K* Algorithm. I.J.
Information Engineering and Electronic Business, 2, 14-19.
Hu, Z. B. (2017). Possibilistic Fuzzy Clustering for Categorical Data Arrays
Based on Frequency Prototypes and Dissimilarity Measures. International
Fig. 11: Kappa Statistics of All Datasets (A-E)
Journal of Intelligent Systems and Applications (IJISA), 9(5), 55-61.
Kim, D. W. (2005). Evaluation of the performance of clustering algorithms in
kernel-induced feature space. Pattern Recognition, 38(4) , 607-611.
Kumar, V., & Rathee, N. (2011). Knowledge discovery from database using an
integration of clustering and classification. International Journal of
5. Conclusion and Future Scope Advanced Computer Science and Applications, 2(3), 29-33.
Moro, S. C. (2014). A data-driven approach to predict the success of bank
The results prove that prediction performance can be enhanced by telemarketing. Decision Support Systems, 62, , 22-31.
applying RPM when the dataset is large, unbalanced and Oyedotun, O. K. (2015). Data Mining of Students' Performance: Turkish
multidimensional in nature. It is also proved that the model is robust in Students as a Case Study. International Journal of Intelligent Systems and
nature and can be applied to any type of dataset. The clustering applied to Applications, 7(9), 20.
Silva, P. C. (April, 2008). Using Data Mining to Predict Secondary School
attributes set at pre-processing stage helps in parsimonious selection of
Student Performance. In A. Brito and J. Teixeira Eds, Proceedings of 5th
variables and improves the performance of predictive algorithms. All the FUture BUsiness TEChnology Conference (FUBUTEC 2008), (pp. 5-12).
results showcase the robustness of RPM. This can be readily used by Porto, Portugal, : EUROSIS.
beginners for resolving real-world problems with easy application Sokolova, M. a. (2009). A systematic analysis of performance measures for
capabilities. In future RPM can be tested with big data. classification tasks. Information Processing & Management 45.4 , 427-437.
Tachibana, R. O. (2014). Semi-automatic classification of birdsong elements
using a linear support vector machine. PloS one, 9(3), e92584.
Acknowledgements
Thakar, P., Mehta, A., & Manisha. (2017). A Unified Model of Clustering and
Classification to Improve Students’ Employability Prediction. IJISA.
The authors gratefully acknowledge the use of services and facilities of Thakar, P., Mehta, A., & Manisha. (2016). Cluster Model for parsimonious
the Faculty Research Center and Library Resources at Vivekananda selection of variables and enhancing Students' Employability Prediction".
Institute of Professional Studies, GGSIPU, Delhi, India, where the first International Journal of Computer Science and Information Security,
author is working as an Assistant Professor in Information Technology 14(12), 611.
Department. Thakar, P., Mehta, A., & Manisha. (2015). Role of Secondary Attributes to
Boost the Prediction Accuracy of Students’ Employability Via Data
Mining. International Journal of Advanced Computer Science &
REFERENCES Applications, 84-90.
Thakar, P., & Anil Mehta, M. (2018). Unified Prediction Model for
Employability in Indian Higher Education System. Journal of Advanced
Abdi, Hervé, and Lynne J. Williams. "Principal component analysis." Wiley Research in Dynamical and Control Systems, Volume 10, 02-Special Issue,
Interdisciplinary Reviews: Computational Statistics 2.4 (2010): 433-459. 1480-1488.
Andrzejak, R. G. (2001). Indications of nonlinear deterministic and finite- Tomašev, N. &. (2013). Class imbalance and the curse of minority hubs.
dimensional structures in time series of brain electrical activity: Knowledge-Based Systems, 53, 157-172.
Dependence on recording region and brain state. Physical Review E, 64(6), Witten, I. H. (2016). Data Mining: Practical machine learning tools and
061907. techniques. Morgan Kaufmann.
Aldous, D. (1991). The continuum random tree II: an overview. Stochastic Zięba, M. T. (2016). Ensemble boosted trees with synthetic features generation
analysis. Stochastic analysis, 167, 23-70. in application to bankruptcy prediction. Expert Systems with Applications,
BarIlan, O., Stern, R., Kalech, M., & Be’er Sheva, I. Learning Software 58, 93-101.
Behavior for Automated Diagnosis.
Breiman, L. (2001). Random forests. Machine learning 45.1, 5-32.
9
201

Robust Prediction Model For Multidimensional and Unbalanced Datasets

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Robust Prediction Model For Multidimensional and Unbalanced Datasets

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Robust Prediction Model For Multidimensional and Unbalanced Datasets

Uploaded by

Copyright:

Available Formats

4th International Conference on Computers and Management (ICCM) 2018

Robust Prediction Model for Multidimensional and Unbalanced

Pooja Thakara, Anil Mehtab, Manishaa*

1. Introduction • To predict target class by integrating machine learning

In the process of automation, data is loaded into the system. Unprocessed

RPM was applied to varied datasets (Dataset A to E) and results were

4.1. Results of DATASET A

The model was applied to Epileptic Seizure Recognition dataset and

HTTPS://WWW.SSRN.COM/LINK/IJISMS-PIP.HTML ELSEVIER-SSRN (ISSN: 1556-5068)

G2 correspond to the first and second period grades. It is more difficult to

Kappa : 0.833 +/- 0.077

Weighted Mean Recall: 91.67% +/- 3.87%

Weighted Mean Precision: 92.22% +/- 3.40%

Table 4: Results of dataset B (Mathematics) with RPM

Weighted Mean Recall: 84.25% +/- 3.88%

Weighted Mean Precision: 84.90% +/- 3.70%

Weighted Mean Precision: 83.65% +/- 1.22%

Table 8: Results of Dataset C with PCA

Kappa : 0.102 +/- 0.028

Weighted Mean Recall: 37.34% +/- 1.47%

Weighted Mean Precision: 42.77% +/- 3.06%

Fig. 6: RPM and PCA for DATASET D

4.4. Results of DATASET E

Table 11: Results of dataset E (Bankruptcy 1) with RPM

Table 9: Results of dataset D with RPM

Parameter Result Table 12: Results of dataset E (Bankruptcy 2) with RPM

Accuracy: 91.75% +/- 0.40% Parameter Result

Kappa : 0.835 +/- 0.008 Accuracy: 87.49% +/- 1.02%

Weighted Mean Precision: 90.00% +/- 0.65%

Kappa : 0.106 +/- 0.014 Kappa : 0.819 +/- 0.025

Parameter Result Weighted Mean Recall: 50.71% +/- 0.80%

Kappa : -0.001 +/- 0.001

Weighted Mean Recall: 49.99% +/- 0.03%

Weighted Mean Precision: 48.07% +/- 0.05%

Fig. 9: RPM and PCA for DATASET E (Bankruptcy 5)

Accuracy: 96.05% +/- 0.11%

Kappa : 0.017 +/- 0.031

Weighted Mean Recall: 50.47% +/- 0.83%

Weighted Mean Precision: 60.55% +/- 20.19%

Fig. 10: Prediction Accuracy of all Datasets (A-E)

Fig. 8: RPM and PCA for DATASET E (Bankruptcy 2)

Catal, C., & Nangir, M. (2017). A sentiment classification model based on

Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview.

You might also like