1 Introduction

When learning classification models, having datasets with balanced categories is ideal. However, in real-life scenarios, datasets are often imbalanced, such as medical classification and abnormal network traffic datasets [1,2,3,4,5,6]. These datasets are labeled as imbalanced, and models trained on such datasets tend to favor categories with a higher number of samples during the training process, resulting in higher accuracy for those categories. However, classes with smaller numbers are of more interest to researchers since they are generally anomalies in normal situations, such as identifying abnormal traffic data among normal traffic or non-ill patients among ill patients to protect network security and speed up drug development, respectively [7,8,9].

The focus of researchers is on how to process the class imbalance data to obtain a better classification model. Research on class imbalance classification mainly consists of three aspects: (1) data level, where data is manipulated to obtain a balanced number of samples from different classes [10,11,12]; (2) algorithm level, where the algorithm is designed to pay more attention to the classification results of a few classes [13,14,15]; (3) integration level, where multiple weak classifiers are integrated to obtain a better class classifier [16,17,18,19,20,21]. The algorithm-level and integration-level approaches are based on the data-level approach, i.e., a good dataset is an important factor in determining the final performance of an algorithm. Therefore, the study of data-level algorithms is necessary.

Oversampling methods are widely used among data-level algorithms due to their superior performance. Oversampling methods refer to achieving balance in the number of different classes in a sample by increasing the number of samples in a few classes. The most classical oversampling method is Synthetic Minority Oversampling Technique (SMOTE), where samples are selected using the nearest neighbor method, and new samples are generated by combining two samples [22]. Various improved oversampling methods are used to select the initial oversampled samples to further improve the classification accuracy of the model. However, the selected samples are somewhat random in nature. Since one wants to obtain a model with higher accuracy, why not choose to oversample the wrong samples?

Neural network is a classical machine learning algorithm that is trained continuously to find the optimal weights of neurons to obtain a more accurate model [23,24,25]. During model training, the original misclassified data can be correctly identified by the model after training, improving the model’s accuracy continuously. However, it is essential to note that overfitting can occur when the model is trained excessively, resulting in an undesired model.

In datasets that are imbalanced, achieving low model classification accuracy often stems from the difficulty in correctly identifying minority class samples. Relying solely on random generation of sample points based on the data can be unreliable and may yield subpar results. Consequently, haphazardly generating minority class samples isn’t advisable. Instead, to alleviate the adverse impacts of data generation, it’s more prudent to focus on generating samples strategically, aiming to aid the model in recognizing the distinctive features of these misclassified minority samples.

Therefore, a novel oversampling method is proposed that iteratively trains the safe error data in the oversampling process to improve model accuracy. Here, the error data refers to the data that were misclassified in the previous classification process; the safe error data refers to the data that are judged not to be outliers’ points. This method, called the “iteration-based Synthetic Minority Oversampling Technique (ISMOTE),” combines the idea of iterative training of neural networks with the oversampling process. ISMOTE iteratively oversamples misclassified data to obtain a new dataset, allowing the model to correctly classify the last misclassified data and improve model accuracy. The contributions of this paper are as follows:

  1. 1.

    Proposing an iteration-based Synthetic Minority Oversampling Technique method.

  2. 2.

    Generating data not only for a few classes but also for most classes.

  3. 3.

    Providing a framework structure based on SMOTE, which is equally adaptable to other improved versions of SMOTE.

  4. 4.

    Showing that ISMOTE can be used as a novel oversampling method based on experimental results on 20 publicly available datasets.

2 Related Work

The present work is an extension of oversampling methods, and this paper focuses on presenting only the relevant oversampling techniques. The simplest oversampling technique is random oversampling, which balances the data by replicating a small number of class instances. Although this technique does not lead to information loss, the dataset is prone to overfitting when replicating the same data.

To alleviate the overfitting problem caused by random oversampling, the SMOTE method was proposed. In the SMOTE method, one sample is randomly selected from the K most similar samples for each minority class sample. New data are then generated by linear interpolation between the two minority class samples. The SMOTE method considers only distance similarity, using the Euclidean distance to measure the degree of sample similarity. The K most similar samples are equivalent to the nearest neighbor samples.

However, the similarity between two vectors can also be considered in terms of their angles or directions. From this perspective, the SMOTE-Cosine algorithm was proposed, which calculates sample similarity using Euclidean distance similarity while considering cosine similarity [26]. This algorithm allows for better access to the nearest neighbors of a small number of samples.

To enhance the minority boundary samples and avoid generating noisy samples, the borderline-SMOTE1 and borderline-SMOTE2 methods were proposed. These methods generate synthetic samples in the minority and majority classes in the boundary region between them. They differ in how to determine which samples should be selected for synthesis. Borderline-SMOTE1 is used when most of the immediate neighbors around all minority class samples belong to the majority class, while Borderline-SMOTE2 is used when all minority class samples are surrounded by both minority and majority class samples. Although Borderline-SMOTE can help the algorithm better determine decision boundaries, it is also more sensitive to the presence of noisy points in the data, which can easily be misclassified as boundary samples that need to be synthesized, thus generating unnecessary synthesis samples and further exacerbating the oversynthesis problem.

After achieving better results with Borderline-SMOTE, researchers found that oversampling sample points with certain qualities may have better results. The Safe-Level-SMOTE was proposed, which selects appropriate classes of samples for synthesis by setting the safety level metric [27]. Safe-Level-SMOTE generates all synthesis instances only in the safe region, reducing the number of unnecessary noise and wrong synthesis samples.

K-Means SMOTE analyzes each region through the Kmeans clustering algorithm [28]. The target ratio of minority to majority instances is achieved by assigning more samples to clusters with a sparse distribution of minority samples. With the assurance that the minority and majority samples are balanced, K-Means SMOTE ensures that the generated synthetic samples are similar to the distribution of the original data and can retain the structural information of the data.

In contrast to K-Means SMOTE, OEGFCM-SMOTE employs a soft clustering approach coupled with an optimization algorithm to fine-tune hyperparameters, taking into account the entropy of the original data, thereby yielding favorable outcomes [29].

While combining clustering algorithms with SMOTE models has demonstrated effectiveness, the clustering process often demands intricate operations or prolonged execution times to achieve satisfactory results. In order to simplify the operation and better maintain the original distribution of the dataset, the Selected Smote was proposed. Selected Smote generates only those synthetic samples that are far from the real minority samples, and uses a random selection of nearest neighbors in generating synthetic samples, which can avoid the problem of overfitting and reduce the number of useless samples generated. Similarly, in efforts to mitigate errors introduced by data generation on experimental outcomes and simplify the impact of parameter selection, Tuanfei Zhu proposes the Oversampling method with the aim of Reliably Expanding Minority class regions (OREM) [30]. OREM ensures data generation quality by exclusively creating new samples in clean regions proximate to minority class instances.

Furthermore, the amalgamation of SMOTE algorithms with deep learning represents an emerging research avenue in the continuous evolution of deep learning. Notably, DeepSmote stands out among these methodologies [31]. DeepSmote first reduces data to a one-dimensional structure through encoding using an encoder. Subsequently, the SMOTE algorithm is applied to undersample the data. Finally, decoding is performed through a decoder to restore the data to a two-dimensional structure.

In summary, numerous SMOTE-based oversampling methods have been proposed, and different synthetic data are generated according to different oversampling schemes [29, 32,33,34]. However, to judge error data, a model can be strengthened by sampling near the appropriate error data.

3 Principle and Methodology

Let \(\:\text{U}\) be a sample dataset defined as \(\:U={U}_{L}\cap\:{U}_{M}\), where \(\:{U}_{L}=\{{x}_{1},\:{x}_{2},\:…,{x}_{L}\}\) represents the minority class sample set and \(\:{U}_{M}=\{{x}_{1},\:{x}_{2},\:…,{x}_{M}\}\) represents the majority class sample set. The operational flow of the proposed model is illustrated below.

3.1 Pre-Training

To obtain the classification results of the model, the data is pre-trained using a classifier as follows:

$$\:{y}_{pre}=f\left(x\right)$$
(1)

Where \(\:x\) is the input data, \(\:f\) represents the classifier, and \(\:{y}_{pre}\) is the predicted class corresponding to \(\:x\). The predicted class is marked as 0 for the minority class and 1 for the majority class. Therefore, the class corresponding to \(\:{y}_{pre}\) can be expressed using the following equation:

$$\:\left\{\begin{array}{c}\begin{array}{cc}{y}_{pre}\in\:{U}_{L}\:\:\:&\:{y}_{pre}=0\end{array}\\\:\begin{array}{cc}{y}_{pre}\in\:{U}_{M}&\:\:\:{y}_{pre}=1\end{array}\end{array}\right.$$
(2)

3.2 Error Set Construction

The predicted class \(\:{y}_{pre}\) obtained from pre-training is compared with the true class \(\:{y}_{true}\). Data with incorrect predictions are put into the error set \(\:{U}_{W}\), such that \(\:{U}_{W}=\{{x}_{1},\:{x}_{2},\:…,{x}_{w}\}\). The error set is defined as follows:

$$\:{U}_{W}=\left\{\begin{array}{c}\begin{array}{cc}[\:{U}_{W}\:,\:x]&\:{y}_{pre}!={y}_{true}\end{array}\\\:\begin{array}{cc}\left[\:\:\:\:{U}_{W}\:\:\:\:\right]&\:{y}_{pre}={y}_{true}\end{array}\end{array}\right.$$
(3)

3.3 Error Set Classification

In this study, we have classified the reasons for classification errors in the error set \(\:{U}_{W}\) into two categories: safety data \(\:{U}_{F}\) and noise data \(\:{U}_{N}\), based on the distribution of sample points around the error data. Safety data refers to classification errors caused by the scarcity of data in the minority category. Expanding the data around the sample points can subsequently improve the model’s classification performance for such data. On the other hand, noise data refers to discrete points that the model cannot accurately classify, and hence, they are not considered further. The classification operations for safety data and noise data are shown below:

  1. 1)

    Given a data point \(\:{x}_{i}\) from the error set \(\:{U}_{W}\);

  2. 2)

    Calculate the K nearest neighbor samples of \(\:{x}_{i}\) in \(\:U\), where the number of minority class samples is noted as \(\:{N}_{L}\) and the number of majority class samples is noted as \(\:{N}_{M}\);

  3. 3)

    Determine whether the data Point is safe or Noisy as Follows:

  • If \(\:{N}_{M}\) > \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the minority class, the sample is classified as safe data.

  • If \(\:{N}_{M}\) > \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the majority class, the sample is classified as noise data.

  • If \(\:{N}_{M}\) < \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the minority class, the sample is classified as safe data.

  • If \(\:{N}_{M}\) < \(\:{N}_{L}\) and the true class of \(\:{x}_{i}\) is the majority class, the sample is classified as noise data.

3.4 Generate New Samples

The safety data set \(\:{U}_{F}\) is expanded to improve the classification effect of the model. First, a safety data point \(\:{x}_{i}\in\:{U}_{F}\) is selected. Then, a sample is randomly selected from the K nearest neighbor samples of the same kind. Finally, a new sample is generated using the following data generation formula:

$$\:{x}_{new}={x}_{i}+r*\left({x}_{j}-{x}_{i}\right)$$
(4)

Where \(\:{x}_{new}\) is the newly generated sample, \(\:{x}_{i}\) is the safety data point, \(\:{x}_{j}\) is a randomly selected sample of the same kind, and \(\:r\) is a random number between 0 and 1.

3.5 Termination Condition

The model is stopped and final training is performed to obtain a trained model once either of the termination conditions is met. The termination condition is defined as follows:

  • The number of minority classes in the dataset is equal to the number of majority classes.

  • The model meets the maximum number of generations.

4 Experiments

To assess the effectiveness of our proposed ISMOTE model, we conducted a comparative analysis with other commonly used oversampling techniques. Specifically, we compared the performance of ISMOTE with SMOTE (labeled as Mod1), Borderline1-SMOTE (Mod2), Borderline2-SMOTE (Mod3), ADASYN (Mod4) [30], Safe-Level-SMOTE (Mod5), Random-SMOTE (Mod6), SMOTE-Out (Mod7), SMOTE-Cosine (Mod8), Selected-SMOTE (Mod9), and Kmeans-SMOTE (Mod10). The baseline model (direct categorization without oversampling) was compared to Mod0. To ensure the reliability of our findings, we evaluated these models based on three metrics: AUC, F-Score, and G-mean. Furthermore, we conducted ten replications of five-fold cross-validation to ensure the consistency and stability of our results.

4.1 Dataset

To assess the efficacy of our model, we conducted experiments on a collection of 20 datasets retrieved from the KEEL dataset repository [31]. These datasets varied in the number of features, ranging from 3 to 19, and sample sizes ranging from 150 to 2308. Furthermore, the imbalance rates of these datasets ranged from 1.82 to 8.6. A comprehensive summary of each dataset, including its name, number of features, sample size, minority and majority category sample size, and imbalance rate, is presented in Table 1. To facilitate subsequent analysis, we abbreviated datasets with lengthy names. Specifically, ecoli-0_vs_1 was abbreviated as ecoli, glass-0-1-2-3_vs_4-5-6 as glass, and new-thyroid1 as new-th1. In addition, to further evaluate the effectiveness of each algorithm, we computed the average performance of all datasets and denoted it as “average.”

Table 1 Dataset description

4.2 Selecting the Appropriate Classification Model

In the ISMOTE model, the selection of a classification algorithm can have a significant impact on the resulting classification performance. Commonly used classification algorithms for imbalanced datasets include Support Vector Machines (SVM) and Decision Tree C4.5. This section aims to investigate the impact of changing the classification model for SVM and decision tree C4.5 when used as classifiers in the ISMOTE model. Moreover, the effectiveness of these models is compared with other classification models on the dataset. In this comparison, the default nearest neighbor parameter K is set to 5 and the default number of iterations is set to 10.

4.2.1 SVM Classifier

SVM is a frequently used machine learning model classification algorithm. The hyperplane produced by the SVM algorithm is known for its robustness, making it widely applicable to various classification tasks and exhibiting strong performance advantages.

In this study, the performance of the ISMOTE oversampling method was evaluated using SVM as a classifier, and compared against 10 commonly used oversampling methods. The evaluation results, which were measured by the AUC, F-score, and G-mean metrics, are presented in Tables 2 and 3, and 4, respectively.

Table 2 The effectiveness of the SVM classifier on the datasets measured by AUC
Table 3 The effectiveness of the SVM classifier on the datasets measured by F-Score
Table 4 The effectiveness of the SVM classifier on the datasets measured by G-mean

The ISMOTE algorithm demonstrated the best performance across multiple metrics and datasets. Specifically, it outperformed all other algorithms in terms of the AUC metric for 6 datasets, the F-Score metric for 11 datasets, and the G-mean metric for 10 datasets. Additionally, when considering the overall effectiveness of each algorithm, ISMOTE achieved the highest average score across all metrics and datasets. Notably, the ISMOTE algorithm outperformed the second-best algorithm by a margin of 0.001, 0.0284, and 0.0266 for the AUC, F-Score, and G-mean metrics, respectively.

4.2.2 C4.5 Classifier

The C4.5 algorithm is widely regarded as one of the most effective data mining algorithms, ranking among the top 10. This algorithm utilizes the information gain rate metric for attribute selection, which overcomes the bias of using information gain to favor attributes with more values. In addition, C4.5 produces classification rules that are both accurate and easy to interpret.

In this study, C4.5 was employed as a classifier to evaluate the performance of the ISMOTE oversampling method relative to 10 other commonly used oversampling methods. The results of the evaluation, as measured by AUC, F-score, and G-mean metrics, are presented in Tables 5 and 6, and 7, respectively.

Table 5 The effectiveness of the C4.5 classifier on the datasets measured by AUC
Table 6 The effectiveness of the C4.5 classifier on the datasets measured by F-Score
Table 7 The effectiveness of the C4.5 classifier on the datasets measured by G-mean

The ISMOTE algorithm demonstrated superior performance compared to ten other oversampling methods across multiple metrics and datasets, despite using the C4.5 classifier. Specifically, ISMOTE surpassed all other algorithms in AUC for eight datasets, F-Score for nine datasets, and G-mean for nine datasets. When considering overall effectiveness, ISMOTE achieved the highest average scores across all metrics and datasets. Notably, ISMOTE outperformed the second-best algorithm by 0.0061, 0.0241, and 0.0103 on the AUC, F-Score, and G-mean metrics, respectively.

4.2.3 Wilcoxon Signed-Rank Tests

To verify whether there are significant differences between ISMOTE and each of the other oversampling methods, we perform the Wilcoxon signed-rank tests on the performance values of tested models. The results of significance tests are listed in Table 8. “+*” and “++” signify that ISMOTE is statistically better than the compared algorithm under consideration at a significant level of 0.1 and 0.05, respectively. “+” denotes that ISMOTE is only quantitatively better. One can see that ISMOTE can achieve statistically superior performance in most of the cases, which demonstrates ISMOTE is highly effective as compared to the other two-class oversampling algorithms.

Table 8 p-values of Wilcoxon signed-rank tests for the average performance comparisons between ISMOTE and other models

4.3 Selection of K

The parameter K plays a crucial role in ISMOTE as it controls the number of nearest neighbors and consequently affects the quality of data generation. To study the impact of different K values on the experimental results, we conducted an experiment as follows:

  1. 1)

    We Selected K Values Ranging from 3 to 10.

  2. 2)

    For each K value, we calculated the performance metrics of the ISMOTE algorithm on the corresponding dataset;

  3. 3)

    We repeated the experiment 20 times, calculated the average value of each performance metric for each K value under different datasets. These values were plotted on a graph.

The SVM classifier was used in this experiment. The performance curves for AUC, F-Score, and G-mean metrics for different K values are presented in Figs. 1 and 2, and 3, respectively.

Fig. 1
figure 1

Effectiveness of ISMOTE model on each dataset under different K values - AUC

Fig. 2
figure 2

Effect of ISMOTE model on each dataset with different K values - F-Score

Fig. 3
figure 3

Effect of ISMOTE model on each dataset with different K values - G-mean

The present study examines the impact of varying K values on the performance of the ISMOTE algorithm through Figs. 1 and 2, and 3. Our findings reveal that the ISMOTE algorithm produces varying outcomes depending on the K value used. Larger K values generally yield better results, although they may lead to model instability. Conversely, employing a K value that is too small may negatively affect model performance. Through our analysis, we have identified that the optimal K value range for the ISMOTE model is between 5 and 7, which ensures superior model performance and stability.

To further investigate the effectiveness of the ISMOTE model under different K values, we conducted a comparison of average metrics across the dataset. The results of this analysis can be found in Table 9.

Table 9 Comparison of average performance metrics under varying K values

4.4 Selection of the Number of Iterations

Determining the appropriate number of iterations is an important aspect of the termination condition, as it controls the quality of data generation to some extent. To study the impact of the number of iterations on experimental results, an experiment was conducted in which the number of iterations (n) was varied between 10, 20, 30, and 50. This experiment was repeated 20 times, and the average performance metric was calculated for different data sets with each number of iterations. The performance curves corresponding to the number of iterations were plotted to observe the effect of the number of iterations on the experimental results. SVM was used as the classifier in this experiment. The resulting performance curves for AUC, F-Score, and G-mean metrics are shown in Figs. 4 and 5, and 6, respectively.

Fig. 4
figure 4

Effect of ISMOTE model on each dataset with different number of iterations - AUC

Fig. 5
figure 5

Effect of ISMOTE model on each dataset with different number of iterations - F-Score

Fig. 6
figure 6

Effect of ISMOTE model on each dataset with different number of iterations - G-mean

The figures indicate that the ISMOTE model is influenced differently depending on the number of iterations used. The best results are obtained for most datasets when the maximum number of iterations (50) is utilized, but some datasets yield the worst results. Conversely, the minimum value of 10 produces the best results for some datasets, but the worst for others. On the other hand, middle values of 20 and 30 generally result in better performance for the ISMOTE model.

To further examine the impact of varying the number of iterations on the ISMOTE model, the average metrics of the datasets were compared for each value of n. Table 10 presents the results of this analysis.

Table 10 Comparison of average performance metrics under different numbers of iterations

In addition to the impact on performance, the number of iterations also affects the running time of the algorithm. To better compare the effect of the number of iterations on the algorithm, Table 11 shows the running time comparison of the algorithm under each number of iterations.

Table 11 Comparison of running time under different numbers of iterations

As observed in Table 11, the running time of the algorithm increases significantly as the number of iterations increases. Therefore, to balance the running time with the performance effect of the algorithm, it is recommended to choose the number of iterations as 20, which results in relatively good performance and a reasonable running time.

4.5 Analysis and Discussion

The results of this study demonstrate that the ISMOTE algorithm outperforms the other 10 oversampling methods, regardless of whether the SVM or C4.5 classifier is used, in terms of F-score, G-mean, and AUC metrics. Furthermore, the effectiveness of the ISMOTE algorithm is influenced by K and the number of iterations. Within a certain range, increasing K results in higher model accuracy. However, when K exceeds this range, increasing K causes a decline in model accuracy. Nevertheless, the reduced model accuracy is still higher than the model accuracy obtained when K is equal to 3. Similarly, within a certain range, increasing the number of iterations leads to an increase in model accuracy. Beyond this range, however, further iterations result in a decline in model accuracy, although the reduced model accuracy may still be higher than the model accuracy obtained when the number of iterations is equal to 10.

Overall, the ISMOTE algorithm significantly enhances the classification accuracy of imbalanced sample classes and improves the overall model accuracy compared to other oversampling methods. Although parameter selection influences the effectiveness of the ISMOTE algorithm, it does not significantly affect its overall performance. Even in the worst-case scenario, the ISMOTE method outperforms the other 10 oversampling methods.

Finally, this study analyzes the reasons for the decrease in model accuracy as the K value and the number of iterations increase. It has been observed that the model tends to become more complex as K value and iterations increase, resulting in overfitting similar to that observed in neural networks. This overfitting phenomenon ultimately leads to a decrease in the final accuracy of the model.

While the ISMOTE model yields commendable results, it shares similar constraints with iterative algorithms, notably the time-consuming nature of iterations. Furthermore, given that the model’s oversampling technique is based on the SMOTE algorithm, it is susceptible to encountering overgeneralization issues due to noisy samples, particularly when dealing with high imbalance rates.

5 Conclusion and Future work

In this study, we have investigated the ISMOTE algorithm as a means of generating high-quality data to improve the accuracy of classification models. By iteratively generating new data near the last correctly classified error data, we have demonstrated that the accuracy of the model can be significantly improved. Our comparative evaluation of ISMOTE against 10 popular oversampling methods on 20 datasets indicates that ISMOTE achieves optimal results on most datasets, with its average performance significantly outperforming other algorithms. We have also investigated the impact of classifier selection, K-value, and number of iterations on the ISMOTE algorithm and provided recommendations for their optimal settings.

In future work, we aim to optimize the ISMOTE algorithm by exploring more effective approaches to data generation beyond random selection as in the SMOTE algorithm. Additionally, we intend to generalize our findings by combining ISMOTE with other oversampling methods.