Abstract
Classification models trained on imbalanced datasets tend to be biased towards the majority category, resulting in reduced accuracy for minority categories. A common approach to address this problem is to generate artificial data for underrepresented categories. The Synthetic Minority Over-sampling Technique (SMOTE) algorithm and its variants are widely used for this purpose. In this paper, we propose a modification to the data generation mechanism called Iteration-based SMOTE (ISMOTE). Unlike SMOTE, the ISMOTE algorithm trains the data for multiple iterations. In each iteration, the model generates new samples in the vicinity of appropriately misclassified data. These new samples are then fed into the classification model, thus improving classification accuracy over the course of multiple iterations. We compare the performance of ISMOTE with SMOTE and other commonly used oversampling algorithms. Our empirical results demonstrate that ISMOTE significantly improves the quality of the generated data compared to other oversampling methods. Additionally, we conduct experiments to verify the effect of parameters on the model and provide suggestions for choosing appropriate values to improve performance.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
When learning classification models, having datasets with balanced categories is ideal. However, in real-life scenarios, datasets are often imbalanced, such as medical classification and abnormal network traffic datasets [1,2,3,4,5,6]. These datasets are labeled as imbalanced, and models trained on such datasets tend to favor categories with a higher number of samples during the training process, resulting in higher accuracy for those categories. However, classes with smaller numbers are of more interest to researchers since they are generally anomalies in normal situations, such as identifying abnormal traffic data among normal traffic or non-ill patients among ill patients to protect network security and speed up drug development, respectively [7,8,9].
The focus of researchers is on how to process the class imbalance data to obtain a better classification model. Research on class imbalance classification mainly consists of three aspects: (1) data level, where data is manipulated to obtain a balanced number of samples from different classes [10,11,12]; (2) algorithm level, where the algorithm is designed to pay more attention to the classification results of a few classes [13,14,15]; (3) integration level, where multiple weak classifiers are integrated to obtain a better class classifier [16,17,18,19,20,21]. The algorithm-level and integration-level approaches are based on the data-level approach, i.e., a good dataset is an important factor in determining the final performance of an algorithm. Therefore, the study of data-level algorithms is necessary.
Oversampling methods are widely used among data-level algorithms due to their superior performance. Oversampling methods refer to achieving balance in the number of different classes in a sample by increasing the number of samples in a few classes. The most classical oversampling method is Synthetic Minority Oversampling Technique (SMOTE), where samples are selected using the nearest neighbor method, and new samples are generated by combining two samples [22]. Various improved oversampling methods are used to select the initial oversampled samples to further improve the classification accuracy of the model. However, the selected samples are somewhat random in nature. Since one wants to obtain a model with higher accuracy, why not choose to oversample the wrong samples?
Neural network is a classical machine learning algorithm that is trained continuously to find the optimal weights of neurons to obtain a more accurate model [23,24,25]. During model training, the original misclassified data can be correctly identified by the model after training, improving the model’s accuracy continuously. However, it is essential to note that overfitting can occur when the model is trained excessively, resulting in an undesired model.
In datasets that are imbalanced, achieving low model classification accuracy often stems from the difficulty in correctly identifying minority class samples. Relying solely on random generation of sample points based on the data can be unreliable and may yield subpar results. Consequently, haphazardly generating minority class samples isn’t advisable. Instead, to alleviate the adverse impacts of data generation, it’s more prudent to focus on generating samples strategically, aiming to aid the model in recognizing the distinctive features of these misclassified minority samples.
Therefore, a novel oversampling method is proposed that iteratively trains the safe error data in the oversampling process to improve model accuracy. Here, the error data refers to the data that were misclassified in the previous classification process; the safe error data refers to the data that are judged not to be outliers’ points. This method, called the “iteration-based Synthetic Minority Oversampling Technique (ISMOTE),” combines the idea of iterative training of neural networks with the oversampling process. ISMOTE iteratively oversamples misclassified data to obtain a new dataset, allowing the model to correctly classify the last misclassified data and improve model accuracy. The contributions of this paper are as follows:
-
1.
Proposing an iteration-based Synthetic Minority Oversampling Technique method.
-
2.
Generating data not only for a few classes but also for most classes.
-
3.
Providing a framework structure based on SMOTE, which is equally adaptable to other improved versions of SMOTE.
-
4.
Showing that ISMOTE can be used as a novel oversampling method based on experimental results on 20 publicly available datasets.
2 Related Work
The present work is an extension of oversampling methods, and this paper focuses on presenting only the relevant oversampling techniques. The simplest oversampling technique is random oversampling, which balances the data by replicating a small number of class instances. Although this technique does not lead to information loss, the dataset is prone to overfitting when replicating the same data.
To alleviate the overfitting problem caused by random oversampling, the SMOTE method was proposed. In the SMOTE method, one sample is randomly selected from the K most similar samples for each minority class sample. New data are then generated by linear interpolation between the two minority class samples. The SMOTE method considers only distance similarity, using the Euclidean distance to measure the degree of sample similarity. The K most similar samples are equivalent to the nearest neighbor samples.
However, the similarity between two vectors can also be considered in terms of their angles or directions. From this perspective, the SMOTE-Cosine algorithm was proposed, which calculates sample similarity using Euclidean distance similarity while considering cosine similarity [26]. This algorithm allows for better access to the nearest neighbors of a small number of samples.
To enhance the minority boundary samples and avoid generating noisy samples, the borderline-SMOTE1 and borderline-SMOTE2 methods were proposed. These methods generate synthetic samples in the minority and majority classes in the boundary region between them. They differ in how to determine which samples should be selected for synthesis. Borderline-SMOTE1 is used when most of the immediate neighbors around all minority class samples belong to the majority class, while Borderline-SMOTE2 is used when all minority class samples are surrounded by both minority and majority class samples. Although Borderline-SMOTE can help the algorithm better determine decision boundaries, it is also more sensitive to the presence of noisy points in the data, which can easily be misclassified as boundary samples that need to be synthesized, thus generating unnecessary synthesis samples and further exacerbating the oversynthesis problem.
After achieving better results with Borderline-SMOTE, researchers found that oversampling sample points with certain qualities may have better results. The Safe-Level-SMOTE was proposed, which selects appropriate classes of samples for synthesis by setting the safety level metric [27]. Safe-Level-SMOTE generates all synthesis instances only in the safe region, reducing the number of unnecessary noise and wrong synthesis samples.
K-Means SMOTE analyzes each region through the Kmeans clustering algorithm [28]. The target ratio of minority to majority instances is achieved by assigning more samples to clusters with a sparse distribution of minority samples. With the assurance that the minority and majority samples are balanced, K-Means SMOTE ensures that the generated synthetic samples are similar to the distribution of the original data and can retain the structural information of the data.
In contrast to K-Means SMOTE, OEGFCM-SMOTE employs a soft clustering approach coupled with an optimization algorithm to fine-tune hyperparameters, taking into account the entropy of the original data, thereby yielding favorable outcomes [29].
While combining clustering algorithms with SMOTE models has demonstrated effectiveness, the clustering process often demands intricate operations or prolonged execution times to achieve satisfactory results. In order to simplify the operation and better maintain the original distribution of the dataset, the Selected Smote was proposed. Selected Smote generates only those synthetic samples that are far from the real minority samples, and uses a random selection of nearest neighbors in generating synthetic samples, which can avoid the problem of overfitting and reduce the number of useless samples generated. Similarly, in efforts to mitigate errors introduced by data generation on experimental outcomes and simplify the impact of parameter selection, Tuanfei Zhu proposes the Oversampling method with the aim of Reliably Expanding Minority class regions (OREM) [30]. OREM ensures data generation quality by exclusively creating new samples in clean regions proximate to minority class instances.
Furthermore, the amalgamation of SMOTE algorithms with deep learning represents an emerging research avenue in the continuous evolution of deep learning. Notably, DeepSmote stands out among these methodologies [31]. DeepSmote first reduces data to a one-dimensional structure through encoding using an encoder. Subsequently, the SMOTE algorithm is applied to undersample the data. Finally, decoding is performed through a decoder to restore the data to a two-dimensional structure.
In summary, numerous SMOTE-based oversampling methods have been proposed, and different synthetic data are generated according to different oversampling schemes [29, 32,33,34]. However, to judge error data, a model can be strengthened by sampling near the appropriate error data.
3 Principle and Methodology
Let \(\:\text{U}\) be a sample dataset defined as \(\:U={U}_{L}\cap\:{U}_{M}\), where \(\:{U}_{L}=\{{x}_{1},\:{x}_{2},\:…,{x}_{L}\}\) represents the minority class sample set and \(\:{U}_{M}=\{{x}_{1},\:{x}_{2},\:…,{x}_{M}\}\) represents the majority class sample set. The operational flow of the proposed model is illustrated below.
3.1 Pre-Training
To obtain the classification results of the model, the data is pre-trained using a classifier as follows:
Where \(\:x\) is the input data, \(\:f\) represents the classifier, and \(\:{y}_{pre}\) is the predicted class corresponding to \(\:x\). The predicted class is marked as 0 for the minority class and 1 for the majority class. Therefore, the class corresponding to \(\:{y}_{pre}\) can be expressed using the following equation:
3.2 Error Set Construction
The predicted class \(\:{y}_{pre}\) obtained from pre-training is compared with the true class \(\:{y}_{true}\). Data with incorrect predictions are put into the error set \(\:{U}_{W}\), such that \(\:{U}_{W}=\{{x}_{1},\:{x}_{2},\:…,{x}_{w}\}\). The error set is defined as follows:
3.3 Error Set Classification
In this study, we have classified the reasons for classification errors in the error set \(\:{U}_{W}\) into two categories: safety data \(\:{U}_{F}\) and noise data \(\:{U}_{N}\), based on the distribution of sample points around the error data. Safety data refers to classification errors caused by the scarcity of data in the minority category. Expanding the data around the sample points can subsequently improve the model’s classification performance for such data. On the other hand, noise data refers to discrete points that the model cannot accurately classify, and hence, they are not considered further. The classification operations for safety data and noise data are shown below:
-
1)
Given a data point \(\:{x}_{i}\) from the error set \(\:{U}_{W}\);
-
2)
Calculate the K nearest neighbor samples of \(\:{x}_{i}\) in \(\:U\), where the number of minority class samples is noted as \(\:{N}_{L}\) and the number of majority class samples is noted as \(\:{N}_{M}\);
-
3)
Determine whether the data Point is safe or Noisy as Follows:
-
If \(\:{N}_{M}\) > \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the minority class, the sample is classified as safe data.
-
If \(\:{N}_{M}\) > \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the majority class, the sample is classified as noise data.
-
If \(\:{N}_{M}\) < \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the minority class, the sample is classified as safe data.
-
If \(\:{N}_{M}\) < \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the majority class, the sample is classified as noise data.
3.4 Generate New Samples
The safety data set \(\:{U}_{F}\) is expanded to improve the classification effect of the model. First, a safety data point \(\:{x}_{i}\in\:{U}_{F}\) is selected. Then, a sample is randomly selected from the K nearest neighbor samples of the same kind. Finally, a new sample is generated using the following data generation formula:
Where \(\:{x}_{new}\) is the newly generated sample, \(\:{x}_{i}\) is the safety data point, \(\:{x}_{j}\) is a randomly selected sample of the same kind, and \(\:r\) is a random number between 0 and 1.
3.5 Termination Condition
The model is stopped and final training is performed to obtain a trained model once either of the termination conditions is met. The termination condition is defined as follows:
-
The number of minority classes in the dataset is equal to the number of majority classes.
-
The model meets the maximum number of generations.
4 Experiments
To assess the effectiveness of our proposed ISMOTE model, we conducted a comparative analysis with other commonly used oversampling techniques. Specifically, we compared the performance of ISMOTE with SMOTE (labeled as Mod1), Borderline1-SMOTE (Mod2), Borderline2-SMOTE (Mod3), ADASYN (Mod4) [30], Safe-Level-SMOTE (Mod5), Random-SMOTE (Mod6), SMOTE-Out (Mod7), SMOTE-Cosine (Mod8), Selected-SMOTE (Mod9), and Kmeans-SMOTE (Mod10). The baseline model (direct categorization without oversampling) was compared to Mod0. To ensure the reliability of our findings, we evaluated these models based on three metrics: AUC, F-Score, and G-mean. Furthermore, we conducted ten replications of five-fold cross-validation to ensure the consistency and stability of our results.
4.1 Dataset
To assess the efficacy of our model, we conducted experiments on a collection of 20 datasets retrieved from the KEEL dataset repository [31]. These datasets varied in the number of features, ranging from 3 to 19, and sample sizes ranging from 150 to 2308. Furthermore, the imbalance rates of these datasets ranged from 1.82 to 8.6. A comprehensive summary of each dataset, including its name, number of features, sample size, minority and majority category sample size, and imbalance rate, is presented in Table 1. To facilitate subsequent analysis, we abbreviated datasets with lengthy names. Specifically, ecoli-0_vs_1 was abbreviated as ecoli, glass-0-1-2-3_vs_4-5-6 as glass, and new-thyroid1 as new-th1. In addition, to further evaluate the effectiveness of each algorithm, we computed the average performance of all datasets and denoted it as “average.”
4.2 Selecting the Appropriate Classification Model
In the ISMOTE model, the selection of a classification algorithm can have a significant impact on the resulting classification performance. Commonly used classification algorithms for imbalanced datasets include Support Vector Machines (SVM) and Decision Tree C4.5. This section aims to investigate the impact of changing the classification model for SVM and decision tree C4.5 when used as classifiers in the ISMOTE model. Moreover, the effectiveness of these models is compared with other classification models on the dataset. In this comparison, the default nearest neighbor parameter K is set to 5 and the default number of iterations is set to 10.
4.2.1 SVM Classifier
SVM is a frequently used machine learning model classification algorithm. The hyperplane produced by the SVM algorithm is known for its robustness, making it widely applicable to various classification tasks and exhibiting strong performance advantages.
In this study, the performance of the ISMOTE oversampling method was evaluated using SVM as a classifier, and compared against 10 commonly used oversampling methods. The evaluation results, which were measured by the AUC, F-score, and G-mean metrics, are presented in Tables 2 and 3, and 4, respectively.
The ISMOTE algorithm demonstrated the best performance across multiple metrics and datasets. Specifically, it outperformed all other algorithms in terms of the AUC metric for 6 datasets, the F-Score metric for 11 datasets, and the G-mean metric for 10 datasets. Additionally, when considering the overall effectiveness of each algorithm, ISMOTE achieved the highest average score across all metrics and datasets. Notably, the ISMOTE algorithm outperformed the second-best algorithm by a margin of 0.001, 0.0284, and 0.0266 for the AUC, F-Score, and G-mean metrics, respectively.
4.2.2 C4.5 Classifier
The C4.5 algorithm is widely regarded as one of the most effective data mining algorithms, ranking among the top 10. This algorithm utilizes the information gain rate metric for attribute selection, which overcomes the bias of using information gain to favor attributes with more values. In addition, C4.5 produces classification rules that are both accurate and easy to interpret.
In this study, C4.5 was employed as a classifier to evaluate the performance of the ISMOTE oversampling method relative to 10 other commonly used oversampling methods. The results of the evaluation, as measured by AUC, F-score, and G-mean metrics, are presented in Tables 5 and 6, and 7, respectively.
The ISMOTE algorithm demonstrated superior performance compared to ten other oversampling methods across multiple metrics and datasets, despite using the C4.5 classifier. Specifically, ISMOTE surpassed all other algorithms in AUC for eight datasets, F-Score for nine datasets, and G-mean for nine datasets. When considering overall effectiveness, ISMOTE achieved the highest average scores across all metrics and datasets. Notably, ISMOTE outperformed the second-best algorithm by 0.0061, 0.0241, and 0.0103 on the AUC, F-Score, and G-mean metrics, respectively.
4.2.3 Wilcoxon Signed-Rank Tests
To verify whether there are significant differences between ISMOTE and each of the other oversampling methods, we perform the Wilcoxon signed-rank tests on the performance values of tested models. The results of significance tests are listed in Table 8. “+*” and “++” signify that ISMOTE is statistically better than the compared algorithm under consideration at a significant level of 0.1 and 0.05, respectively. “+” denotes that ISMOTE is only quantitatively better. One can see that ISMOTE can achieve statistically superior performance in most of the cases, which demonstrates ISMOTE is highly effective as compared to the other two-class oversampling algorithms.
4.3 Selection of K
The parameter K plays a crucial role in ISMOTE as it controls the number of nearest neighbors and consequently affects the quality of data generation. To study the impact of different K values on the experimental results, we conducted an experiment as follows:
-
1)
We Selected K Values Ranging from 3 to 10.
-
2)
For each K value, we calculated the performance metrics of the ISMOTE algorithm on the corresponding dataset;
-
3)
We repeated the experiment 20 times, calculated the average value of each performance metric for each K value under different datasets. These values were plotted on a graph.
The SVM classifier was used in this experiment. The performance curves for AUC, F-Score, and G-mean metrics for different K values are presented in Figs. 1 and 2, and 3, respectively.
The present study examines the impact of varying K values on the performance of the ISMOTE algorithm through Figs. 1 and 2, and 3. Our findings reveal that the ISMOTE algorithm produces varying outcomes depending on the K value used. Larger K values generally yield better results, although they may lead to model instability. Conversely, employing a K value that is too small may negatively affect model performance. Through our analysis, we have identified that the optimal K value range for the ISMOTE model is between 5 and 7, which ensures superior model performance and stability.
To further investigate the effectiveness of the ISMOTE model under different K values, we conducted a comparison of average metrics across the dataset. The results of this analysis can be found in Table 9.
4.4 Selection of the Number of Iterations
Determining the appropriate number of iterations is an important aspect of the termination condition, as it controls the quality of data generation to some extent. To study the impact of the number of iterations on experimental results, an experiment was conducted in which the number of iterations (n) was varied between 10, 20, 30, and 50. This experiment was repeated 20 times, and the average performance metric was calculated for different data sets with each number of iterations. The performance curves corresponding to the number of iterations were plotted to observe the effect of the number of iterations on the experimental results. SVM was used as the classifier in this experiment. The resulting performance curves for AUC, F-Score, and G-mean metrics are shown in Figs. 4 and 5, and 6, respectively.
The figures indicate that the ISMOTE model is influenced differently depending on the number of iterations used. The best results are obtained for most datasets when the maximum number of iterations (50) is utilized, but some datasets yield the worst results. Conversely, the minimum value of 10 produces the best results for some datasets, but the worst for others. On the other hand, middle values of 20 and 30 generally result in better performance for the ISMOTE model.
To further examine the impact of varying the number of iterations on the ISMOTE model, the average metrics of the datasets were compared for each value of n. Table 10 presents the results of this analysis.
In addition to the impact on performance, the number of iterations also affects the running time of the algorithm. To better compare the effect of the number of iterations on the algorithm, Table 11 shows the running time comparison of the algorithm under each number of iterations.
As observed in Table 11, the running time of the algorithm increases significantly as the number of iterations increases. Therefore, to balance the running time with the performance effect of the algorithm, it is recommended to choose the number of iterations as 20, which results in relatively good performance and a reasonable running time.
4.5 Analysis and Discussion
The results of this study demonstrate that the ISMOTE algorithm outperforms the other 10 oversampling methods, regardless of whether the SVM or C4.5 classifier is used, in terms of F-score, G-mean, and AUC metrics. Furthermore, the effectiveness of the ISMOTE algorithm is influenced by K and the number of iterations. Within a certain range, increasing K results in higher model accuracy. However, when K exceeds this range, increasing K causes a decline in model accuracy. Nevertheless, the reduced model accuracy is still higher than the model accuracy obtained when K is equal to 3. Similarly, within a certain range, increasing the number of iterations leads to an increase in model accuracy. Beyond this range, however, further iterations result in a decline in model accuracy, although the reduced model accuracy may still be higher than the model accuracy obtained when the number of iterations is equal to 10.
Overall, the ISMOTE algorithm significantly enhances the classification accuracy of imbalanced sample classes and improves the overall model accuracy compared to other oversampling methods. Although parameter selection influences the effectiveness of the ISMOTE algorithm, it does not significantly affect its overall performance. Even in the worst-case scenario, the ISMOTE method outperforms the other 10 oversampling methods.
Finally, this study analyzes the reasons for the decrease in model accuracy as the K value and the number of iterations increase. It has been observed that the model tends to become more complex as K value and iterations increase, resulting in overfitting similar to that observed in neural networks. This overfitting phenomenon ultimately leads to a decrease in the final accuracy of the model.
While the ISMOTE model yields commendable results, it shares similar constraints with iterative algorithms, notably the time-consuming nature of iterations. Furthermore, given that the model’s oversampling technique is based on the SMOTE algorithm, it is susceptible to encountering overgeneralization issues due to noisy samples, particularly when dealing with high imbalance rates.
5 Conclusion and Future work
In this study, we have investigated the ISMOTE algorithm as a means of generating high-quality data to improve the accuracy of classification models. By iteratively generating new data near the last correctly classified error data, we have demonstrated that the accuracy of the model can be significantly improved. Our comparative evaluation of ISMOTE against 10 popular oversampling methods on 20 datasets indicates that ISMOTE achieves optimal results on most datasets, with its average performance significantly outperforming other algorithms. We have also investigated the impact of classifier selection, K-value, and number of iterations on the ISMOTE algorithm and provided recommendations for their optimal settings.
In future work, we aim to optimize the ISMOTE algorithm by exploring more effective approaches to data generation beyond random selection as in the SMOTE algorithm. Additionally, we intend to generalize our findings by combining ISMOTE with other oversampling methods.
References
Sayed GI, Soliman MM, Hassanien AE (2021) A novel melanoma prediction model for imbalanced data using optimized SqueezeNet by bald eagle search optimization. Comput Biol Med 136:104712. https://doi.org/10.1016/j.compbiomed.2021.104712
Hussain S (2017) Survey on current trends and techniques of data mining research. Lond J Res Comput Sci Technol 17:11
Alam TM, Shaukat K, Khan WA, Hameed IA, Almuqren LA, Raza MA, Aslam M, Luo S (2022) An efficient deep learning-based skin Cancer classifier for an Imbalanced dataset. Diagnostics 12:2115. https://doi.org/10.3390/diagnostics12092115
Santos LI, Camargos MO, D’Angelo MFSV, Mendes JB, de Medeiros EEC, Guimarães ALS, Palhares RM (2022) Decision tree and artificial immune systems for stroke prediction in imbalanced data. Expert Syst Appl 191:116221. https://doi.org/10.1016/j.eswa.2021.116221
Al S, Dener M (2021) Computers Secur 110:102435. https://doi.org/10.1016/j.cose.2021.102435. STL-HDL: A new hybrid network intrusion detection system for imbalanced dataset on big data environment
Fu Y, Du Y, Cao Z, Li Q, Xiang W (2022) A deep learning model for Network Intrusion detection with Imbalanced Data. Electronics 11:898. https://doi.org/10.3390/electronics11060898
Prati RC, Batista GEAPA, Silva DF (2015) Class imbalance revisited: a new experimental setup to assess the performance of treatment methods. Knowl Inf Syst 45:247–270. https://doi.org/10.1007/s10115-014-0794-3
Wei G, Mu W, Song Y, Dou J (2022) An improved and random synthetic minority oversampling technique for imbalanced data. Knowl Based Syst 248:108839. https://doi.org/10.1016/j.knosys.2022.108839
El Bakrawy LM, Cifci MA, Kausar S (2022) A modified ant lion optimization method and its application for Instance Reduction Problem in Balanced and Imbalanced Data. Axioms 11:95. https://doi.org/10.3390/axioms11030095
Han H, Wang W-Y, Mao B-H (2005) Borderline-SMOTE: a New Over-sampling Method in Imbalanced Data sets Learning. In: Huang D-S, Zhang X-P, Huang G-B (eds) Advances in Intelligent Computing. Springer, Berlin Heidelberg, Berlin, Heidelberg, pp 878–887. https://doi.org/10.1007/11538059_91.
Torres FR, Carrasco-Ochoa JA, Martínez-Trinidad JF (2016) SMOTE-D a deterministic version of SMOTE. In: Martínez-Trinidad JF, Carrasco-Ochoa JA, Ayala Ramirez V, Olvera-López JA, Jiang X (eds) Pattern recognition. Springer International Publishing, Cham, pp 177–188. https://doi.org/10.1007/978-3-319-39393-3_18.
Dong Y, Wang X, New Over-Sampling A, Approach (2011) Random-SMOTE for learning from Imbalanced Data sets. Knowledge Science, Engineering and Management. Springer, Berlin, Heidelberg, pp 343–352. https://doi.org/10.1007/978-3-642-25975-3_30.
Gu B, Sheng VS, Tay KY, Romano W, Li S (2017) Cross Validation through two-dimensional solution surface for cost-sensitive SVM. IEEE Trans Pattern Anal Mach Intell 39:1103–1121. https://doi.org/10.1109/TPAMI.2016.2578326
Liu Y, Lu H, Yan K, Xia H, An C (2016) Applying cost-sensitive Extreme Learning machine and dissimilarity integration to gene expression data classification. Comput Intell Neurosci 2016(e8056253). https://doi.org/10.1155/2016/8056253
Tapkan P, Özbakır L, Kulluk S, Baykasoğlu A (2016) A cost-sensitive classification algorithm: BEE-Miner. Knowl Based Syst 95:99–113. https://doi.org/10.1016/j.knosys.2015.12.010
Radtke PVW, Granger E, Sabourin R, Gorodnichy DO (2014) Skew-sensitive boolean combination for adaptive ensembles – an application to face recognition in video surveillance. Inform Fusion 20:31–48. https://doi.org/10.1016/j.inffus.2013.11.001
Díez-Pastor JF, Rodríguez JJ, García-Osorio CI, Kuncheva LI (2015) Diversity techniques improve the performance of the best imbalance learning ensembles. Inf Sci 325:98–117. https://doi.org/10.1016/j.ins.2015.07.025
Bhardwaj M, Bhatnagar V, Sharma K (2016) Cost-effectiveness of classification ensembles. Pattern Recogn 57:84–96. https://doi.org/10.1016/j.patcog.2016.03.017
Fernández-Baldera A, Buenaposada JM, Baumela L (2018) BAdaCost: multi-class boosting with costs, Pattern Recognition. 79:467–479. https://doi.org/10.1016/j.patcog.2018.02.022
Sun Z, Song Q, Zhu X, Sun H, Xu B, Zhou Y (2015) A novel ensemble method for classifying imbalanced data. Pattern Recogn 48:1623–1637. https://doi.org/10.1016/j.patcog.2014.11.014
Chen Z, Duan J, Kang L, Qiu G (2021) Inf Sci 554:157–176. https://doi.org/10.1016/j.ins.2020.12.023. A hybrid data-level ensemble to enable learning from highly imbalanced dataset
Chawla N, Bowyer K, Hall L, Kegelmeyer W (2002) Synthetic minority over-sampling technique. J Artif Intell Res (JAIR) 16:321–357. https://doi.org/10.1613/jair.953
Bishop CM (1994) Neural networks and their applications. Rev Sci Instrum 65:1803–1832. https://doi.org/10.1063/1.1144830
Joloudari JH, Marefat A, Nematollahi MA (2023) Effective class-imbalance learning based on SMOTE and convolutional neural networks. Appl Sci 13:4006. https://doi.org/10.3390/app13064006
Desuky AS, Elbarawy YM, Kausar S (2022) Single-point crossover and Jellyfish optimization for handling Imbalanced Data classification problem. IEEE Access 10:11730–11749. https://doi.org/10.1109/ACCESS.2022.3146424
Koto F SMOTE-Out, SMOTE-Cosine, and Selected-SMOTE: An enhancement strategy to handle imbalance in data level, in: 2014 International Conference on Advanced Computer Science and Information System, 2014: pp. 280–284. https://doi.org/10.1109/ICACSIS.2014.7065849
Bunkhumpornpat C, Sinapiromsaran K, Lursinsap C (2009) Safe-level-synthetic minority over-sampling TEchnique for handling the Class Imbalanced Problem. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in Knowledge Discovery and Data Mining. Springer, Berlin, Heidelberg, pp 475–482. https://doi.org/10.1007/978-3-642-01307-2_43.
Douzas G, Bacao F, Last F (2018) Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci 465:1–20. https://doi.org/10.1016/j.ins.2018.06.056
El Moutaouakil K, Roudani M, El Ouissari A (2023) Optimal Entropy genetic Fuzzy-C-Means SMOTE (OEGFCM-SMOTE). Knowl Based Syst 262:110235. https://doi.org/10.1016/j.knosys.2022.110235
Zhu T, Liu X, Zhu E, Oversampling With Reliably Expanding Minority Class Regions for Imbalanced Data Learning (2023) IEEE Trans Knowl Data Eng 35:6167–6181. https://doi.org/10.1109/TKDE.2022.3171706
Dablain D, Krawczyk B, Chawla NV (2023) DeepSMOTE: Fusing Deep Learning and SMOTE for Imbalanced Data. IEEE Trans Neural Networks Learn Syst 34:6390–6404. https://doi.org/10.1109/TNNLS.2021.3136503
Camacho L, Douzas G, Bacao F (2022) Geometric SMOTE for regression. Expert Syst Appl 193:116387. https://doi.org/10.1016/j.eswa.2021.116387
He H, Bai Y, Garcia EA, Li S (2008) ADASYN: Adaptive synthetic sampling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), : pp. 1322–1328. https://doi.org/10.1109/IJCNN.2008.4633969
Fernández A, García S, del Jesus MJ, Herrera F (2008) A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets. Fuzzy Sets Syst 159:2378–2398. https://doi.org/10.1016/j.fss.2007.12.023
Acknowledgements
Not applicable.
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
Jiuxiang Song: Methodology, Validation, Writing - Review & Editing.Jizhong Liu: Funding acquisition. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Song, J., Liu, J. ISMOTE: A More Accurate Alternative for SMOTE. Neural Process Lett 56, 240 (2024). https://doi.org/10.1007/s11063-024-11695-w
Accepted:
Published:
DOI: https://doi.org/10.1007/s11063-024-11695-w