CN113204481B

CN113204481B - Class imbalance software defect prediction method based on data resampling

Info

Publication number: CN113204481B
Application number: CN202110428102.3A
Authority: CN
Inventors: 荆晓远; 孔晓辉; 陈昊文
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2022-03-04
Anticipated expiration: 2041-04-21
Also published as: CN113204481A

Abstract

The invention provides a class imbalance software defect prediction method based on data resampling. According to the method, the Euclidean distance between a minority class data set and a majority class element and between the minority class data set and the minority class element is calculated, the minority class data and the majority class data which are closest to the minority class data are screened out, and the distance parameter of the minority class data is obtained through the Euclidean distance; marking the minority data in the minority data set according to the distance parameters, and obtaining minority data point types; calculating each K near-point set with few elements in the minority data sets, and counting the number of majority data and minority data in the K near-point sets to obtain the number of newly generated minority data; and respectively selecting two classifiers, performing confidence evaluation on the newly generated software defect prediction minority class data to obtain a training data set, training the selected classifiers, and obtaining a final prediction result through weighted voting. The invention can well solve the class imbalance problem in the software defect prediction process.

Description

A Class Imbalanced Software Defect Prediction Method Based on Data Resampling

技术领域technical field

本发明属于软件缺陷预测领域，具体涉及一种基于数据重采样的类不平衡软件缺陷预测方法。The invention belongs to the field of software defect prediction, in particular to a class unbalanced software defect prediction method based on data resampling.

背景技术Background technique

随着社会的发展与科学技术的提升，互联网已经深入的融合到我们的生活中的方方面面，无论是网上购物，出门坐车，智能家居，餐厅点餐等我们日常生活中的各种活动都可以通过软件完成，软件的使用场景已经渗透在我们吃穿住行等方方面面。在软件开发的过程中，软件功能需求不断的增加，软件服务的人群数量在不断的增加，软件开发时间不断的被压缩，各种问题导致在软件开发的过程中，软件很容易出现缺陷，软件缺陷的发生会使得软件不能提供正常的功能，会造成巨大的生产和经济损失，给人们的正常生活造成巨大的影响，因此竭力避免软件缺陷的发生是重要且必要的，因此在软件开发过程中进行软件缺陷预测能够帮助开发人员尽快的发现在软件开发过程中发生的缺陷，能够及时的进行软件缺陷的代码修改，从而避免各种生产经济损失。With the development of society and the improvement of science and technology, the Internet has been deeply integrated into all aspects of our life, whether it is online shopping, going out to take a car, smart home, restaurant ordering and other various activities in our daily life can be done through The software is completed, and the usage scenarios of the software have penetrated into all aspects of our food, clothing, housing and transportation. In the process of software development, the demand for software functions continues to increase, the number of people served by software is constantly increasing, and the software development time is constantly being compressed. Various problems lead to software defects in the process of software development. The occurrence of defects will make the software unable to provide normal functions, cause huge production and economic losses, and have a huge impact on people's normal life. Therefore, it is important and necessary to try to avoid the occurrence of software defects. Therefore, in the software development process Predicting software defects can help developers find defects that occur in the software development process as soon as possible, and can timely modify the code of software defects, thereby avoiding various production economic losses.

然而在现实开发环境中，存在软件缺陷的数据是远远小于不存在软件缺陷的数据的，这时候构建的软件缺陷预测模型，更不容易发现存在软件缺陷的代码模块，然而理想的软件缺陷预测模型需要对存在缺陷的数据更敏感，能够更加精确的预测出代码模块是否存在缺陷，因此解决软件缺陷预测的类不平衡问题变得十分重要。针对上面的不足，本发明提出了一个类不平衡软件缺陷预测方法。However, in the real development environment, the data with software defects is far smaller than the data without software defects. The software defect prediction model constructed at this time is more difficult to find code modules with software defects. However, the ideal software defect prediction model The model needs to be more sensitive to defective data and can more accurately predict whether the code module has defects. Therefore, it is very important to solve the class imbalance problem of software defect prediction. In view of the above shortcomings, the present invention proposes a class-imbalanced software defect prediction method.

发明内容SUMMARY OF THE INVENTION

本发明主要目的是解决软件缺陷预测中的类不平衡问题提出一个类不平衡问题软件缺陷预测方法，普遍适用于软件缺陷预测。为了实现上述目的，本发明包括如下步骤：The main purpose of the present invention is to solve the class imbalance problem in software defect prediction and propose a software defect prediction method for class imbalance problem, which is generally applicable to software defect prediction. In order to achieve the above object, the present invention comprises the following steps:

步骤1，选取少数类数据集合中任意个少数类数据依次与少数类数据集合中每个少数类数据进行欧式距离计算，在少数类数据集合中筛选出与选取的少数类数据距离最近的少数类数据，选取少数类数据集合中任意个少数类数据依次与多数类数据集合中每个多数类数据进行欧式距离计算，在多数类数据集合中筛选出与选取的少数类数据距离最近的多数类数据，根据选取的少数类数据与少数类数据集合中每个少数类数据最近欧式距离、根据选取的少数类数据与多数类数据集合中每个少数类数据最近欧式距离计算选取的少数类数据的距离参数；在少数类数据集合中根据少数类数据的距离参数对少数类数据进行标记，并得到少数类数据的数据点类型；计算少数类数据集合中每个少数类数据的K近邻点集合，进一步在每个少数类数据的K近邻点集合划分为K近邻点多数类数据集合、K近邻点少数类数据集合，分别统计K近邻点多数类数据集合中多数类数据的数量、K 近邻点少数类数据集合中少数类数据的数量，计算少数类数据集合中每个少数类数据的新生成的少数类数据数量；Step 1: Select any minority class data in the minority class data set to perform Euclidean distance calculation with each minority class data in the minority class data set in turn, and select the minority class with the closest distance to the selected minority class data in the minority class data set. Data, select any minority class data in the minority class data set and perform Euclidean distance calculation with each majority class data in the majority class data set in turn, and filter out the majority class data in the majority class data set with the closest distance to the selected minority class data. , calculate the distance of the selected minority data according to the nearest Euclidean distance between the selected minority data and each minority data in the minority data set, and according to the nearest Euclidean distance between the selected minority data and each minority data in the majority data set parameter; mark the minority class data according to the distance parameter of the minority class data in the minority class data set, and obtain the data point type of the minority class data; calculate the K nearest neighbor point set of each minority class data in the minority class data set, and further The K-nearest neighbor point set of each minority class data is divided into the K-nearest neighbor point majority class data set and the K-nearest neighbor point minority class data set. The number of minority class data in the data set, calculate the number of newly generated minority class data for each minority class data in the minority class data set;

步骤2，分别选择第一分类器、第二分类器，对新生成的软件缺陷预测少数类数据进行置信度评价，得到训练数据集；Step 2: Select the first classifier and the second classifier respectively, and perform confidence evaluation on the newly generated software defect prediction minority data to obtain a training data set;

步骤3，运用步骤2选择的第一分类器、第二分类器以及得到的训练集S′，通过加权投票得到最终的预测结果；Step 3, use the first classifier, the second classifier and the obtained training set S' selected in step 2 to obtain the final prediction result through weighted voting;

作为优选，步骤1所述软件缺陷数据为：S＝{S_min，S_max}；Preferably, the software defect data in step 1 is: S={S _min , S _max };

步骤1所述少数类数据集合为：

The minority class data set described in step 1 is:

步骤1所述多数类数据集合为：

The majority class data set described in step 1 is:

其中，S_min表示少数类数据集合，用S_max表示多数类数据集合，p_i表示少数类数据集合中第i个少数类数据，i∈[1，N],N表示少数类数据集合中少数类数据的数量，d_k表示多数类数据集合中第k个多数类数据，k∈[1,K]，K表示多数类数据集合中多数类数据的数量；Among them, S _min represents the minority class data set, S _max represents the majority class data set, pi represents the _i -th minority class data in the minority class data set, i∈[1, N], N represents the minority class data set in the minority class data set The number of class data, d _k represents the kth majority class data in the majority class data set, k∈[1,K], K represents the number of majority class data in the majority class data set;

步骤1所述与选取的少数类数据距离最近的少数类数据为：

i∈[1，N]， min_i∈[1，N]；The minority class data that is closest to the selected minority class data described in step 1 is:

i∈[1,N], min _i∈ [1,N];

其中，

表示少数类数据集合中与选取的第i个少数类数据距离最近的少数类数据，N表示少数类数据集合中少数类数据的数量；in,

Represents the minority class data in the minority class data set that is closest to the selected i-th minority class data, and N represents the number of minority class data in the minority class data set;

步骤1所述与选取的少数类数据距离最近的多数类数据为：

i∈[1，N], max_i∈[1，K]；The majority class data that is closest to the selected minority class data described in step 1 is:

i∈[1,N], max _i∈ [1,K];

其中，

表示多数类数据集合中与选取的第i个少数类数据距离最近的多数类数据，K表示少数类数据集合中少数类数据的数量；in,

Indicates the majority class data in the majority class data set that is closest to the selected i-th minority class data, and K represents the number of minority class data in the minority class data set;

步骤1所述选取的少数类数据与少数类数据集合中每个少数类数据最近欧式距离为：

The nearest Euclidean distance between the minority class data selected in step 1 and each minority class data in the minority class data set is:

步骤1所述选取的少数类数据与多数类数据集合中每个多数类数据最近欧式距离为：

The nearest Euclidean distance between the minority class data selected in step 1 and each majority class data in the majority class data set is:

步骤1所述计算选取的少数类数据的距离参数为：The distance parameter of the selected minority data in step 1 is:

其中，∝_i为少数类数据集合中第i个少数类数据的距离参数；Among them, ∝ _i is the distance parameter of the i-th minority class data in the minority class data set;

步骤1所述在少数类数据集合中根据少数类数据的距离参数对少数类数据进行标记为：In step 1, the minority class data is marked in the minority class data set according to the distance parameter of the minority class data as:

若∝_i＜1，则少数类数据集合中与选取的第i个少数类数据的数据点类型标记为安全点,flag_i＝1；If ∝ _i < 1, the data point type in the minority class data set and the selected i-th minority class data is marked as a safe point, flag _i =1;

若∝_i＝1，则少数类数据集合中与选取的第i个少数类数据的数据点类型标记为混淆点，flag_i＝2；If ∝ _i = 1, the data point type in the minority class data set and the selected i-th minority class data is marked as a confusion point, flag _i = 2;

若∝_i＞1，则少数类数据集合中与选取的第i个少数类数据的数据点类型标记为危险点，flag_i＝3；If ∝ _i > 1, the data point type in the minority class data set and the selected i-th minority class data are marked as dangerous points, and flag _i =3;

步骤1所述计算少数类数据集合中每个少数类数据的K近邻点集合：Calculate the K-nearest neighbor point set of each minority class data in the minority class data set described in step 1:

步骤1所述在每个少数类数据的K近邻点集合划分为K近邻点多数类数据集合、K近邻点少数类数据集合，具体为：In step 1, the K nearest neighbor point set of each minority class data is divided into K nearest neighbor point majority class data set and K nearest neighbor point minority class data set, specifically:

步骤1所述K近邻点多数类数据集合中多数类数据的数量，记为

The number of majority class data in the K nearest neighbor majority class data set described in step 1, denoted as

步骤1所述K近邻点少数类数据集合中少数类数据的数量，记为

The number of minority class data in the minority class data set of K nearest neighbors described in step 1, denoted as

步骤1所述计算少数类数据集合中每个少数类数据的新生成的少数类数据数量，具体为：Step 1: Calculate the number of newly generated minority class data for each minority class data in the minority class data set, specifically:

其中，∝_i为少数类数据集合中第i个少数类数据的距离参数，n_i为少数类数据集合中第i个每个少数类数据的新生成的少数类数据数量；Among them, ∝ _i is the distance parameter of the ith minority class data in the minority class data set, and n _i is the number of newly generated minority class data of each i th minority class data in the minority class data set;

步骤1所述，计算新生成的软件缺陷预测数据；Described in step 1, calculate the newly generated software defect prediction data;

步骤1所述，少数类数据集合中第i个少数类数据会生成n_i新少数类数据，因此将新生成的少数类数据用p^new _i，j来表示，其中j∈[1，n_i]As described in step 1, the i-th minority data in the minority data set will generate n _i new minority data, so the newly generated minority data is represented by p ^new _{i, j} , where j ∈ [1, n _i ]

步骤1所述少数类数据集合中第i个少数类数据的第j个新生成数据的偏离多数类的偏量，记做ε_i，j；The deviation of the j-th newly generated data of the i-th minority-class data in the minority-class data set described in step 1 deviates from the majority class, denoted as ε _i,j ;

其中少数类数据集合中第i个少数类数据的第j个新生成数据的偏离多数类的偏量ε_i，j，其计算公式为：Among them, the deviation ε _i,j of the jth newly generated data of the ith minority class data in the minority class data set deviates from the majority class, and its calculation formula is:

其中，

为偏离多数类程度参数，取值为0-1的随机数,

为其最近的多数类数据。in,

is the degree of deviation from the majority class parameter, a random number of 0-1,

is its most recent majority class data.

步骤1所述少数类数据集合中第i个少数类数据的第j个新生成数据偏向多数类的偏量，记做σ_i，j；The bias of the j-th newly generated data of the i-th minority-class data in the minority-class data set described in step 1 is biased towards the majority class, denoted as σ _i,j ;

步骤1所述少数类数据集合中第i个少数类数据的第j个新生成数据偏向多数类的偏量σ_i，j，其计算公式为：The bias σ _i,j of the j-th newly generated data of the i-th minority-class data in the minority-class data set described in step 1 is biased towards the majority class, and its calculation formula is:

其中，

为偏向少数类层度参数，其取值为0-1.5的随机数，

为其最近的少数类数据。in,

is a level parameter that is biased towards the minority class, and its value is a random number from 0 to 1.5.

for its nearest minority class data.

步骤1所述新生成的软件缺陷预测数据少数类数据，记做p^new _i，j；The newly generated software defect prediction data described in step 1 is a minority class data, which is denoted as p ^new _i,j ;

新生成的软件缺陷预测数据第i个少数类数据的第j个新生成数据计算公式为：The calculation formula of the jth newly generated data of the ith minority class data of the newly generated software defect prediction data is:

p^new _i，j＝p_i+ε_i，j+σ_i，j p ^new _i,j = p _i +ε _i,j +σ _i,j

步骤1所述得到新成少数类数据集，记做S_new；Obtaining a new minority data set described in step 1, denoted as S _new ;

步骤1所述少数类点p_i新生成缺陷数据的个数n_i，按照上面生成的少数类数据p^new的方式，得到新成少数类数据集S_new。The number n _i of defect data newly generated by the minority class point p _i in step 1 is obtained according to the method of the minority class data p ^new generated above to obtain a new minority class data set S _new .

其中,

N’为新成少数类数据集S_new包含元素的个数，对新数据的类别标记为缺陷数据，记为弱标记L_w，对于第i个新成少数类数据集用符号p’_i表示，其标记为

in,

N' is the number of elements contained in the newly formed minority data set S _new , and the new data category is marked as defect data, denoted as weak label L _w , and the i-th newly formed minority data set is represented by the symbol p' _i , which is marked as

作为优选，所述步骤2具体如下：Preferably, the step 2 is as follows:

步骤2所述分别计算第一分类器的影响程度、第二分类器的影响程度；In step 2, the influence degree of the first classifier and the influence degree of the second classifier are calculated respectively;

步骤2所述利用新成少数类数据集S_new训练第一分类器H₁，利用新成少数类数据集S_new，依次带入第一分类器H₁，得到预测的类别L_p1,对于S_new中的第i个点p’_i,其弱标记为

H₁预测类别为

In step 2, the first classifier H ₁ is trained by using the newly formed minority data set S _new , and the newly formed minority data set S _new is used to bring into the first classifier H ₁ in turn to obtain the predicted class L _p1 , for S The i-th point p' _i in _new is weakly marked as

H1 _predicts the class to be

利用新成少数类数据集S_new训练第二分类器H₂，利用新成少数类数据集S_new依次带入第二分类器H₂，得到预测的类别L_p2,对于S_new中的第i个点p’_i,其弱标记为

H₂预测类别为

The second classifier H ₂ is trained by using the newly formed minority data set S _new , and the second classifier H 2 is sequentially brought into the second classifier H ₂ by using the newly formed minority data set S _new to obtain the _predicted class L _p2 . points p' _i , whose weak labels are

H ₂ predicts the class to be

所述第一分类器的影响程度为：The degree of influence of the first classifier is:

其中，N为少数类数据集合S_min元素个数，第一分类器H₁预测类别与弱标记 L_w类别相同

取值为1，否则取值为0，第二分类器H₂预测类别与弱标记L_w类别相同

取值为1，否则取值为0。Among them, N is the number of elements in the minority data set S _min , and the first classifier H ₁ predicts the same category as the weak label L _w category

The value is 1, otherwise the value is 0, the second classifier _H2 predicts the same category as the weak label _Lw category

The value is 1, otherwise the value is 0.

所述第二分类器的影响程度为：The degree of influence of the second classifier is:

The value is 1, otherwise the value is 0.

步骤2所述根据第一分类器的影响程度、第二分类器的影响程度更新少数类数据的标签更新少数类数据的标签，以构建更新后原始软件缺陷数据；In step 2, the label of the minority data is updated according to the influence degree of the first classifier and the influence degree of the second classifier, and the label of the minority data is updated to construct the updated original software defect data;

步骤2所述，计算弱标记

的置信度，用符号γ_i表示。As described in step 2, compute weak markers

The confidence level of , denoted by the symbol γ _i .

步骤2所述，对新少数类数据集的弱标记

进行判断，依据分类器的影响程度来判断，计算公式为

当置信度γ_i＞β，将这个少数类数据

加入训练数据，当γ_i≤β的时候，直接删除，不把这个少数类数据加入到新训练集。Weak labeling of the new minority dataset as described in step 2

Judgment is made according to the influence degree of the classifier, and the calculation formula is

When the confidence γ _i > β, the minority class data

Add training data, when γ _i ≤ β, delete it directly, and do not add this minority class data to the new training set.

步骤2所述，新成少数类数据即S_new被重新进行筛选，得到新成少数类数据 S_new′,将S_new′加入原始软件缺陷数据S得到新训练集S′；As described in step 2, the newly generated minority data, that is, S _new , is re-screened to obtain newly generated minority data S _new ′, and S _new ′ is added to the original software defect data S to obtain a new training set S ′;

作为优选，所述步骤3具体包括下述步骤：Preferably, the step 3 specifically includes the following steps:

得到新训练数据集S′后，训练第一分类器H₁和第二分类器H₂，通过训练好的第一分类器H₁和第二分类器H₂预测数据v分别得到第一分类器预测结果L₁和第二分类器L₂，继续利用第一分类器的影响程度o₁和第二分类器的影响程度o₂，利用计算公式L_pre＝L₁*o₁+L₂*o₂的值来得到预测结果；After obtaining the new training data set S', train the first classifier H ₁ and the second classifier H ₂ , and obtain the first classifier respectively by predicting the data v through the trained first classifier H ₁ and the second classifier H ₂ Predict the result L ₁ and the second classifier L ₂ , continue to use the influence degree o ₁ of the first classifier and the influence degree o ₂ of the second classifier, and use the calculation formula L _pre =L ₁ *o ₁ +L ₂ *o ₂ to get the prediction result;

步骤3所述，当L_pre值大于β的时候，预测v的类别为少数类；As described in step 3, when the value of L _pre is greater than β, the category of v is predicted to be a minority category;

步骤3所述，当L_pre的值小于等于β的时候，预测v的类别为多数类；As described in step 3, when the value of L _pre is less than or equal to β, the category of v is predicted to be the majority category;

与现有技术相比，本发明的优点和积极效果在于：Compared with the prior art, the advantages and positive effects of the present invention are:

本发明能够良好的解决类不平衡问题。The present invention can well solve the class imbalance problem.

本文增加了对新生成少数类数据的筛选过程，去除掉偏离实际的数据，保留能够表现出少数类真实特征的数据。This paper adds a screening process for newly generated minority data, removes data that deviates from reality, and retains data that can show the true characteristics of minority classes.

本文提出了一个能解决类不平衡的软件缺陷预测方法，能够广泛的适用于各种软件缺陷数据并且解决类不平衡问题。This paper proposes a software defect prediction method that can solve the class imbalance, which can be widely applied to various software defect data and solve the class imbalance problem.

附图说明Description of drawings

图1：为本发明的类不平衡的软件缺陷预测方法图。FIG. 1 is a diagram of a software defect prediction method for class imbalance according to the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，下面结合附图和具体实施对本发明做进一步描述，在此仅用本发明的适宜性实例说明来解释本发明，但并不作为本发明的限定。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described below with reference to the accompanying drawings and specific implementations. Here, the present invention is explained only by the description of the suitability of the present invention, but is not regarded as the present invention. limit.

本发明的总实施流程图如图1所示，具体实施如下：The overall implementation flow chart of the present invention is shown in Figure 1, and the specific implementation is as follows:

步骤1，选取少数类数据集合中任意个少数类数据依次与少数类数据集合中每个少数类数据进行欧式距离计算，在少数类数据集合中筛选出与选取的少数类数据距离最近的少数类数据，选取少数类数据集合中任意个少数类数据依次与多数类数据集合中每个多数类数据进行欧式距离计算，在多数类数据集合中筛选出与选取的少数类数据距离最近的多数类数据，根据选取的少数类数据与少数类数据集合中每个少数类数据最近欧式距离、根据选取的少数类数据与多数类数据集合中每个少数类数据最近欧式距离计算选取的少数类数据的距离参数；在少数类数据集合中根据少数类数据的距离参数对少数类数据进行标记，并得到少数类数据的数据点类型；计算少数类数据集合中每个少数类数据的K近邻点集合，进一步在每个少数类数据的K近邻点集合划分为K近邻点多数类数据集合、K近邻点少数类数据集合，分别统计K近邻点多数类数据集合中多数类数据的数量、K 近邻点少数类数据集合中少数类数据的数量，计算少数类数据集合中每个少数类数据的新生成的少数类数据数量。Step 1: Select any minority class data in the minority class data set to perform Euclidean distance calculation with each minority class data in the minority class data set in turn, and select the minority class with the closest distance to the selected minority class data in the minority class data set. Data, select any minority class data in the minority class data set and perform Euclidean distance calculation with each majority class data in the majority class data set in turn, and filter out the majority class data in the majority class data set with the closest distance to the selected minority class data. , calculate the distance of the selected minority data according to the nearest Euclidean distance between the selected minority data and each minority data in the minority data set, and according to the nearest Euclidean distance between the selected minority data and each minority data in the majority data set parameter; mark the minority class data according to the distance parameter of the minority class data in the minority class data set, and obtain the data point type of the minority class data; calculate the K nearest neighbor point set of each minority class data in the minority class data set, and further The K-nearest neighbor point set of each minority class data is divided into the K-nearest neighbor point majority class data set and the K-nearest neighbor point minority class data set. The number of minority class data in the data set, calculate the number of newly generated minority class data for each minority class data in the minority class data set.

步骤1所述软件缺陷数据为：S＝{S_min，S_max}；The software defect data in step 1 is: S={S _min , S _max };

步骤1所述少数类数据集合为：

The minority class data set described in step 1 is:

步骤1所述多数类数据集合为：

The majority class data set described in step 1 is:

步骤1所述与选取的少数类数据距离最近的少数类数据为：

i∈[1,N], min _i∈ [1,N];

其中，

步骤1所述与选取的少数类数据距离最近的多数类数据为：

i∈[1,N], max _i∈ [1,K];

其中，

步骤1所述计算少数类数据集合中每个少数类数据的K近邻点集合，实验设置K＝5：In step 1, the set of K nearest neighbors of each minority class data in the minority class data set is calculated, and the experimental setting K=5:

其中，

为偏离多数类程度参数，取值为0-1的随机数,

为其最近的多数类数据。in,

is its most recent majority class data.

其中，

为偏向少数类层度参数，其取值为0-1.5的随机数，

为其最近的少数类数据。in,

for its nearest minority class data.

p^new _i，j＝p_i+ε_i，j+σ_i，j p ^new _i,j = p _i +ε _i,j +σ _i,j

其中,

in,

所述步骤2具体如下：The step 2 is as follows:

H₁预测类别为

H1 _predicts the class to be

H₂预测类别为

H ₂ predicts the class to be

The value is 1, otherwise the value is 0.

The value is 1, otherwise the value is 0.

步骤2所述，计算弱标记

The confidence level of , denoted by the symbol γ _i .

步骤2所述，对新少数类数据集的弱标记

进行判断，依据分类器的影响程度来判断，计算公式为

当置信度γ_i＞β＝0.5，将这个少数类数据

加入训练数据，当γ_i≤β＝0.5的时候，直接删除，不把这个少数类数据加入到新训练集。Weak labeling of the new minority dataset as described in step 2

When the confidence γ _i > β = 0.5, this minority class data

Add training data, when γ _i ≤ β = 0.5, delete it directly, and do not add this minority data to the new training set.

所述步骤3具体包括下述步骤：The step 3 specifically includes the following steps:

步骤3所述，当L_pre值大于β＝0.5的时候，预测v的类别为少数类；As described in step 3, when the L _pre value is greater than β=0.5, the category of v is predicted to be a minority category;

步骤3所述，当L_pre的值小于等于β＝0.5的时候，预测v的类别为多数类。As described in step 3, when the value of L _pre is less than or equal to β=0.5, the class of v is predicted to be the majority class.

本实施例将本发明的方法与现有的一些主流的SMOTE+SVM、SMOTE+决策树、 SMOTE+k近邻、SMOTE+朴素贝叶斯方法进行了比较，选取了精度、F-measure、平衡度、AUC指标比较结果。在对比的所有方法中，本发明方法的准确率最高，识别准确率已经达到了领域先进水平。This embodiment compares the method of the present invention with some existing mainstream SMOTE+SVM, SMOTE+decision tree, SMOTE+k-nearest neighbor, SMOTE+Naive Bayes methods, and selects the accuracy, F-measure, balance, AUC Metric comparison results. Among all the methods compared, the method of the present invention has the highest accuracy, and the recognition accuracy has reached the advanced level in the field.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明所述系统及其实施方法所做的同等变化，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above description is only a preferred embodiment of the present invention, but the protection scope of the present invention is not limited to this. The equivalent changes made by the method and the implementation method thereof should all be included within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. a class unbalanced software defect prediction method based on data resampling, is characterized in that,

Step 1: Select any minority class data in the minority class data set to perform Euclidean distance calculation with each minority class data in the minority class data set in turn, and select the minority class with the closest distance to the selected minority class data in the minority class data set. Data, select any minority class data in the minority class data set and perform Euclidean distance calculation with each majority class data in the majority class data set in turn, and filter out the majority class data in the majority class data set with the closest distance to the selected minority class data. , calculate the distance of the selected minority data according to the nearest Euclidean distance between the selected minority data and each minority data in the minority data set, and according to the nearest Euclidean distance between the selected minority data and each minority data in the majority data set parameter; mark the minority class data according to the distance parameter of the minority class data in the minority class data set, and obtain the data point type of the minority class data; calculate the K nearest neighbor point set of each minority class data in the minority class data set, and further The K-nearest neighbor point set of each minority class data is divided into the K-nearest neighbor point majority class data set and the K-nearest neighbor point minority class data set. The number of minority class data in the data set, calculate the number of newly generated minority class data for each minority class data in the minority class data set;

Step 2: Select the first classifier and the second classifier respectively, and perform confidence evaluation on the newly generated software defect prediction minority data to obtain a training data set;

Step 3, use the first classifier, the second classifier and the obtained training set S' selected in step 2 to obtain the final prediction result through weighted voting;

The software defect data is: S={S _min , S _max };

Step 1 The minority class data set is:

Step 1 The majority class data set is:

Among them, S _min represents the minority class data set, S _max represents the majority class data set, pi represents the _i -th minority class data in the minority class data set, i∈[1, N], N represents the minority class data set in the minority class data set The number of class data, d _k represents the kth majority class data in the majority class data set, k∈[1,K], K represents the number of majority class data in the majority class data set;

The minority class data closest to the selected minority class data in step 1 is:

in,

The majority class data closest to the selected minority class data in step 1 is:

in,

The nearest Euclidean distance between the minority class data selected in step 1 and each majority class data set in the majority class data set is:

Step 1 calculates the distance parameter of the selected minority class data as:

Among them, ∝ _i is the distance parameter of the i-th minority class data in the minority class data set;

Step 1 Label the minority class data according to the distance parameter of the minority class data in the minority class data set as:

If ∝ _i < 1, the data point type in the minority class data set and the selected i-th minority class data is marked as a safe point, flag _i =1;

If ∝ _i = 1, the data point type in the minority class data set and the selected i-th minority class data is marked as a confusion point, flag _i = 2;

If ∝ _i > 1, the data point type in the minority class data set and the selected i-th minority class data are marked as dangerous points, and flag _i =3;

Step 1 Calculate the set of K-nearest neighbors for each minority class in the minority class data set:

In step 1, the K-nearest neighbor point set of each minority class data is divided into the K-nearest neighbor point majority class data set and the K-nearest neighbor point minority class data set, specifically:

Step 1 The number of majority class data in the majority class data set of K nearest neighbors, denoted as

Step 1 The number of minority class data in the minority class data set of K nearest neighbors, denoted as

Step 1: Calculate the number of newly generated minority class data for each minority class data in the minority class data set, specifically:

Among them, ∝ _i is the distance parameter of the ith minority class data in the minority class data set, and n _i is the number of newly generated minority class data of each i th minority class data in the minority class data set;

Step 1, calculate the newly generated software defect prediction data;

Step 1, the i-th minority data in the minority data set will generate n _i new minority data, so the newly generated minority data is represented by p ^new _{i, j} , where j ∈ [1, n _i ]

Step 1: The deviation of the j-th newly generated data of the i-th minority-class data in the minority-class data set deviates from the majority class, denoted as ε _i,j ;

Among them, the deviation εi _,j of the jth newly generated data of the ith minority class data in the minority class data set deviates from the majority class, and its calculation formula is:

in,

its most recent majority class data;

Step 1: The bias of the j-th newly generated data of the i-th minority-class data in the minority-class data set is biased towards the majority class, denoted as σ _i,j ;

Step 1 The bias σ _i,j of the j-th newly generated data of the i-th minority class data in the minority class data set is biased towards the majority class, and its calculation formula is:

in,

for its most recent minority class data;

The software defect prediction data newly generated in step 1 is the minority data, which is denoted as p ^new _i,j ;

The calculation formula of the jth newly generated data of the ith minority class data of the newly generated software defect prediction data is:

p ^new _i,j = p _i +ε _i,j +σ _i,j

Step 1: Obtain a newly generated minority data set, denoted as S _new ;

Step 1: The number n _i of newly generated defect data by the minority class point p _i , according to the method of the minority class data p ^new generated above, obtain the newly generated minority class data set S _new ;

in,

N, is the number of elements included in the newly formed minority data set S _new , the category of the new data is marked as defect data, denoted as weak label L _w , and the i-th newly formed minority data set is represented by the symbol p' _i , which is marked as

The step 2 is as follows:

Step 2: Calculate the influence degree of the first classifier and the influence degree of the second classifier respectively;

Step 2: Train the first classifier H ₁ by using the newly formed minority data set S _new , use the newly formed minority data set S _new , and bring it into the first classifier H ₁ in turn to obtain the predicted class L _p1 , for S _new The ith point p' _i of , its weak mark is

H1 _predicts the class to be

H ₂ predicts the class to be

The degree of influence of the first classifier is:

Among them, N is the number of elements in the minority data set S _min , and the first classifier H ₁ predicts the same category as the weak label L _w category

The value is 1, otherwise the value is 0;

The degree of influence of the second classifier is:

The value is 1, otherwise the value is 0;

Step 2, updating the label of the minority class data according to the influence degree of the first classifier and the influence degree of the second classifier, and updating the label of the minority class data to construct the updated original software defect data;

Step 2, Compute Weak Markers

The confidence level of , which is represented by the symbol γ _i ;

Step 2, Weak labeling of the new minority dataset

When the confidence γ _i > β, the minority class data

Add training data, when γ _i ≤ β, delete it directly, and do not add this minority data to the new training set;

In step 2, the newly generated minority data S _new is re-screened to obtain newly generated minority data S _new ′, and S _new ′ is added to the original software defect data S to obtain a new training set S ′.

2. the class unbalanced software defect prediction method based on data resampling according to claim 1, is characterized in that,

Step 3 specifically includes the following steps:

After obtaining the new training data set S', train the first classifier H ₁ and the second classifier H ₂ , and obtain the first classifier respectively by predicting the data v through the trained first classifier H ₁ and the second classifier H ₂ The prediction result L ₁ and the second classifier prediction result L ₂ , continue to use the influence degree o ₁ of the first classifier and the influence degree o ₂ of the second classifier, and use the calculation formula L _pre =L ₁ *o ₁ +L ₂ The value of *o ₂ to get the prediction result;

Step 3, when the L _pre value is greater than β, the category of the predicted data v is a minority category;

Step 3, when the value of L _pre is less than or equal to β, the category of the predicted data v is the majority category.