Software defect prediction model based on feature mapping and attribute compensation technology
Technical Field
The invention belongs to the field of software safety, and relates to a software project defect prediction model based on a feature mapping and measurement compensation method.
Background
With the rapid development of the internet technology, the complexity of software is increasing day by day, the software safety and the software reliability are increasingly important in software engineering, in order to ensure the high reliability of the software quality, the defect prediction method becomes a research hotspot in the field of defect prediction, and the good software defect prediction method can greatly improve the test efficiency of the software, so that the construction of a high-performance defect prediction model has great significance in ensuring the software quality.
Software defect prediction is based on historical data accumulated in a software development process to build a prediction model, however, when no historical data exists or the historical defect data is insufficient to build the model, the traditional software defect prediction method based on target data cannot meet practical requirements. However, in practical applications, projects to be predicted are usually newly developed projects, which causes the problem of insufficient historical data, and a concept of cross-project defect prediction occurs, that is, data of other related projects are used as training to construct a defect prediction model for target data, which is accompanied by a problem of data difference between a source project and a target project. Different items have different feature spaces and distributions of software defect data due to different context environments, that is, the assumption of independent and same distribution is difficult to satisfy due to the difference of distribution of values of measurement elements of the source and target item data sets. The defect prediction model built by using the traditional machine learning technology cannot obtain better prediction effect.
The current cross-project defect prediction method is roughly divided into two categories of attribute-based conversion and data-based selection, and attribute feature transformation enables a source project and a target project to be subjected to the same distribution on the premise of keeping respective defect features. In order to solve the problem that the defect prediction model has low performance due to large data distribution difference among different projects, a measurement compensation method is used for changing data distribution in a source data set to adapt to a target data set, and the difference between the source project data and the target project data is used for reducing the distribution difference between the source project data and the target project data by using a weight thought during instance training. And then, the existing knowledge of the source item is utilized by using a transfer learning technology to solve the problem that only a few marked examples exist in the target field or even the target field does not exist, a semi-supervised transfer component analysis (SSTCA) method is utilized to obtain the minimum data distance between the source item and the target item in a mapping space, simultaneously the respective internal attributes of the source item and the target item are kept to the maximum extent, and the effect of combining the two methods is combined to enhance the similarity of the data distribution of the source data field and the target data field. The idea of feature mapping and attribute value conversion is used to make the data distribution in the source and target data sets similar as much as possible in cross-project defect prediction.
Disclosure of Invention
The existing methods for predicting the cross-project defects are roughly divided into two types, one is based on attribute conversion, the other is based on data selection, and most of the methods based on the attribute conversion make the source and the target data more fit by using a weight idea to obtain a similar data distribution condition. The invention combines a transfer learning method SSTCA and an attribute compensation technology which can fully utilize class label information of source domain data to carry out transfer component learning, so that the data distribution of a target item is similar to that of a source item. SSTCA makes full use of class label information of data in the source item, takes the transfer data of the source item as training data, and further performs class imbalance learning by using a sampling method. Meanwhile, the average value of the prediction results of the multi-source training set on the same target data set is used as the final prediction result of the model. The experimental result of the method is compared with the existing measurement compensation method, and the high efficiency of the method is verified.
The invention provides a cross-project defect prediction model combining a transfer learning technology and an attribute compensation technology, which comprises the following steps:
step 1, reasonably dividing a data set by using cross validation, and performing sampling and normalization pretreatment on a sample set used for testing to obtain a more balanced defect data set;
step 2, solving the problem of data distribution difference among the cross projects by applying an attribute compensation method, so that the data distribution of the target project is more fit with the data distribution of the source project;
step 3, after the sample set processed by the measurement compensation technology is obtained, minimizing the data distance between the source data and the target data by utilizing the transfer learning technology to obtain more similar data distribution;
and 4, classifying the defects of the data by using a decision tree classification model according to the training sample set and the test sample set obtained in the step 3, and further evaluating the prediction performance of the model according to the prediction result.
In a first aspect, the specific steps of step 1 are as follows:
step 1.1, reasonably dividing a data set required by verifying the model performance by using a loadtxt () file reading method, reading out measurement data and label data in a corresponding training sample set and a corresponding test sample set, wherein' is a reading division identifier, the first N columns of the read data are taken as measurement data and stored in an x _ list, and the data in the (N + 1) th column is stored in a y _ list as a label of defect data;
and 1.2, normalizing the number subjected to the standardization operation by using a dispersion standardization method, and converting the standardized data into [0,1] to obtain normalized training set data, so that the data of each measurement attribute has stronger visualization operation.
And step 1.3, oversampling is carried out on the training data by utilizing an SMOTE sampling method, so that the problem of class imbalance of the defect data is effectively relieved, and the training precision of the data set is improved. And setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying power, calculating the distance from each sample in the minority class to all samples in the minority class sample set by taking the Euclidean distance as a standard, and obtaining k neighbors of the samples after sequencing so as to expand the samples of the minority class.
In a second aspect, the calculation flow of the attribute compensation method includes:
and 2.1, storing the preprocessed training data sample set in a list form, and multiplying each sample of the target training set by original data by a weight value, wherein the weight value is the ratio of the training data to the mean value of the target data, so as to obtain a new target data set which is more adaptive to the source data distribution. Similarly, for all samples in the source data set, the ratio of the target data to the mean value of the source data is used as a new weight for the source data set to allocate to the adaptive target data, so as to obtain a data set which is adapted to the source and target data after a round of attribute value conversion. The calculation mode of the conversion between the source data and the target data is as follows:
the above expression source [ i, j represents the jth measurement value of the ith data instance in the source item data, and the metric _ mean _ source represents the average value of all data instances of the source item in the jth measurement; similarly, target [ i, j ] represents the jth metric value of the ith data instance in the target item data, and the metric _ mean _ target represents the average value of all data instances in the jth metric of the target item.
Step 2.2, using source 1[ i, j ] to represent the j-th metric value in the ith data instance of the required new source data obtained by solving in step 2.1, using meter _ mean _ source and meter _ mean _ target to represent the mean value of the j-th metric of the source data and the target data after recalculation and performing data processing similar to that performed in step 2.1 once again on the source data, so that the source data set performs secondary conversion of attribute values for the distribution of the target data, thereby further improving the similarity and conformity of the source data and the target data and completing the two-round metric mapping of the new source and target data, wherein the calculation mode of source 1 is as follows:
sour1[i,j]=(sour1[i,j]*metric_mean_target)/metric_mean_source。
and 2.3, introducing the source data and the target data subjected to the attribute value transformation as parameters of a new transfer learning method, and performing next conversion based on feature mapping.
In a third aspect, a computing process for processing source and target data distribution by using a semi-supervised migration component analysis method is as follows:
step 3.1 migration into form with SSTCAThe analysis technique further improves the training data (X)S) And test data (X)T) To improve the robustness of the defect prediction model, the distance between the two domains is calculated using the MMD (maximum mean difference) algorithm, where MMD (X)S,XT) L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, and tr (KL) represents a trace of a matrix obtained by solving the splicing matrix KL.
Step 3.2, the quantized distance is converted into the learning process of the kernel function by adopting the kernel function, and the self-defined L and H matrixes are firstly calculated, wherein the definition
Where H is a central matrix, n1, n2 are the source fields X
srcAnd a target domain X
tarNumber of instances of (2), x
i,x
jAnd representing the sample data in the domain, and further calculating to obtain a kernel matrix K.
Step 3.3, the kernel matrix K calculated in the step 3.2 is utilized to further solve (KLK + mu I)-1KHK first m eigenvalues, as mentioned above, L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, H is a central matrix, μ is an introduced parameter, and I is an intermediate matrix introduced in the algorithm. Calculating the matrix after dimensionality reduction by utilizing SSTCA algorithm, and transmitting the matrix into a source domain XSAnd a target domain XTAnd the desired data dimension dim after dimensionality reduction, the obtained new source data features and new target data features (i.e. the reduced-dimensionality results of the source domain and the data domain). Wherein the number of rows of X is the characteristic number of the original data, and the columns are the total characteristic number.
In a fourth aspect, the specific steps of step 4 are as follows:
and 4.1, obtaining the defect data with similar distribution and reduced dimensionality after the training sample set is subjected to the attribute conversion and the migration component analysis operation, creating a corresponding decision tree classifier object, and performing the class prediction of the defect data on the target data set by using the decision tree classifier.
And 4.2, comprehensively determining the prediction result of the same target data by using models trained by a plurality of different training data, training a plurality of classification models with different prediction performances for a selected target data by using a plurality of different training data sets, and finally taking the average value of the prediction results of the plurality of models as the final prediction value of the target data.
Compared with the prior art, the invention has the beneficial effects that:
1. the cross-project defect prediction model combining the transfer learning technology and the measurement compensation technology predicts defect data on the basis of similar source and target data distribution, utilizes the attribute compensation technology to enable source data to adapt to the distribution of the target data by utilizing the weight thought during training, and performs secondary measurement value conversion again after relatively fitting data are obtained, so that the distribution condition of the source data and the target data are considered simultaneously, and the constructed defect prediction model is more robust.
2. The advantage of processing data distribution based on the transfer learning technology is that when a defect prediction model is constructed, the trained model has poor prediction performance due to inconsistent distribution of training data and test data, and it may be difficult to predict a correct result in a test. Thereby further enhancing the similarity between different data domains.
Drawings
FIG. 1 is a general flow diagram of a cross-project defect prediction method incorporating feature mapping and attribute value conversion techniques.
FIG. 2 is a detailed flow diagram of a cross-project defect prediction method incorporating feature mapping and attribute value conversion techniques.
FIG. 3 is data sample set information used in the experimental segment of the present invention.
Fig. 4 shows the information and results of experiments performed on the defect prediction models obtained by processing data using different schemes with KC2 as the target project and CM1, KC1, JM1, and PC1 as the training set of source projects.
FIG. 5 shows the information and results of experiments performed on defect prediction models obtained by processing data using different schemes with CM1 as the target project and KC2, KC1, JM1, and PC1 as the training set of source projects.
Fig. 6 shows the information and results of experiments performed on the defect prediction models obtained by data processing using different schemes with KC1 as the target project and CM1, KC2, JM1, and PC1 as the source project training sets.
FIG. 7 shows the information and results of experiments performed on defect prediction models obtained by processing data using different schemes with PC1 as the target project and CM1, KC1, JM1, and KC2 as the training set of source projects.
Fig. 8 is AUC values, recall and precision of experiments on data processing with different schemes using KC2 as the target item and CM1, KC1, JM1, KC2 as the training set of source items.
FIG. 9 is the AUC values, recall and precision of experiments on data processing with different schemes using PC1 as the target item and KC1, CM1, JM1, KC2 as the training set of source items.
FIG. 10 is the AUC values, recall and precision, of experiments on data processing with different schemes using KC1 as the target item and PC1, CM1, JM1, KC2 as the training set of source items.
Detailed Description
The invention will be further described with reference to the accompanying drawings and embodiments, which are described for the purpose of facilitating an understanding of the invention and are not intended to be limiting in any way.
The invention aims to solve the problem that a source data set and a target data set have large data distribution difference when a defect prediction model is constructed in cross-project defect prediction, provides a method for improving the similarity of source and target data so as to establish a cross-project defect prediction model with better prediction performance and higher robustness, and performs sufficient experiments to prove the feasibility and the high efficiency of the method.
As shown in fig. 1, a cross-project defect prediction method combining transfer learning and attribute compensation techniques of the present invention includes:
step 201, reasonably dividing a data set required for verifying the model performance by using a loadtxt () file reading method, reading out measurement data and label data in a corresponding training sample set and a corresponding test sample set, so that the measurement data and the label data are read out, wherein the first N columns of the read data are taken as measurement data and stored in x _ list for reading and dividing identification, and the data in the (N + 1) th column is stored in y _ list for the label of defect data; sampling and normalizing pretreatment are carried out on a sample set used for testing to obtain a more balanced defect data set;
the purpose of implementing data preprocessing in the invention is that when the obtained source data set is used as training data, each item data set is composed of a plurality of pieces of sample data, each defect sample number is represented by a plurality of pieces of measurement data, but the measurement dimensions are different among different measurements, so that preprocessing operation needs to be performed on the training data samples when a defect prediction model is trained, and the different measurements have the same dimension. In short, when the units of measure of the training data in different dimensions are not consistent, a normalization step is required to preprocess the data.
In step 2011, the normalized data is normalized by using a dispersion normalization method, and the normalized data is converted into [0,1], so that the data with different measurement attributes has stronger visualization operation.
The normalization calculation process comprises (1) calculating the maximum and minimum values in the sample data, and (2) using the transfer function
Processing metric and label values in data to convert data samples to [0,1]Inner, f
/Namely normalized data.
Step 2012, a SMOTE oversampling method is adopted to sample and expand the training data, a sampling proportion is set according to a sample imbalance proportion to determine a sampling magnification, for each sample in the minority class, the distance from the sample to all samples in the minority class sample set is calculated by taking the Euclidean distance as a standard, and k neighbor expansion minority class samples are obtained after sequencing.
The sampling calculation process comprises the following steps: for each randomSelected neighbors x
nAccording to
And a new sample is constructed by the formula, so that a more balanced data set can be obtained according to the set sampling proportion.
Through the operation, the preprocessed standard input data can be obtained.
Step 202, solving the problem of inter-project category balance by applying an attribute compensation method, so that the data distribution of a target project is more fit with the data distribution of an original project;
in step 202, the step of processing the data distribution by the attribute compensation method is as follows:
step 2021 stores the preprocessed training data sample set in a form of a list, and assigns a weight value to the original data for each sample of the target training set, where the weight value is a ratio of the training data to a mean value of the target data, so as to obtain a new target data set more suitable for source data distribution. Similarly, for all samples in the source data set, the ratio of the target data to the mean value of the source data is used as a new weight for processing the source data set, so as to obtain a data set which is adapted to the source data and the target data after a round of attribute value conversion. The calculation mode of the conversion between the source data and the target data is as follows:
source[i,j]=(source[i,j]*metric_mean_target)/metric_mean_source
target[i,j]=(target[i,j]*metric_mean_source)/metric_mean_target
wherein:
metric_mean_source=np.mean(source[:,j])
metric_mean_source=np.mean(target[:,j])
the above expression source [ i, j represents the jth measurement value of the ith data instance in the source item data, and the metric _ mean _ source represents the average value of all data instances of the source item in the jth measurement; similarly, target [ i, j ] represents the jth metric value of the ith data instance in the target item data, and the metric _ mean _ target represents the average value of all data instances in the jth metric of the target item.
Step 2022, representing the j-th metric value in the i-th data instance of the required new source data obtained by the solution in step 2021 by source 1[ i, j ], representing the mean of the j-th metric values of the source data and the target data after the recalculation processing by means of the metric _ mean _ source and the metric _ mean _ target, and using the weight concept to make the source data set perform the secondary conversion of the attribute values according to the distribution of the target data, thereby further improving the similarity and the compatibility of the source and the target data, i.e. completing the two-round metric mapping of the new source and the target data. The calculation of source 1 is:
sour1[i,j]=(sour1[i,j]*metric_mean_target)/metric_mean_source。
step 203, after the sample set processed by the metric compensation technology is obtained, minimizing the data distance between the source and the target data by using a transfer learning Technology (TCA) to obtain more similar data distribution;
in step 203, the step of changing the distribution of the source and target data by using the SSTCA technology is as follows:
step 2031, using the source and target data after two rounds of mapping obtained in step 202 as SSTCA algorithm parameters, and performing a conversion on the data mapping again by using migration component analysis
Step 2032, convert the difference between the two domains into the distance between the two domains for calculation, and quantitatively measure the difference by using MMD (maximum mean difference) algorithm. A learning process for transforming quantized distances into kernel functions using kernel functions first computes custom L and H matrices, where definitions are defined
Where H is a central matrix, n1, n2 are the source fields X
srcAnd a target domain X
tarNumber of instances of (2), x
i,x
jAnd representing the sample data in the domain, and further calculating to obtain a kernel matrix K.
Step 2033, using the kernel matrix K calculated in step 2032 to further solve (KLK + muI)-1KHK first m eigenvalues, as mentioned above, L is the matrix introduced by MMD algorithm, K is the kernel matrix obtained by kernel function mapping, H is the central matrix, μ is the introduced parameter, and I is the algorithmThe intermediate matrix introduced in (1). The SSTCA algorithm calculates the matrix after dimensionality reduction, transmits the matrix into a source domain Xs and a target domain Xt, and the expected data dimensionality dim after dimensionality reduction, and obtains new source data characteristics and new target data characteristics which are the results after dimensionality reduction of the source domain and the data domain. Wherein the number of rows of X is the characteristic number of the original data, and the columns are the total characteristic number.
And step 204, classifying the data by utilizing a decision tree technology according to the training sample set and the test sample set obtained in the step 203 to obtain the defect prediction performance of the model.
2041, performing attribute conversion and migration component analysis on a training sample set to obtain defect data with similar distribution and reduced dimensionality, creating a corresponding decision tree classifier object, and performing class prediction on the defect data of a target data set by using a decision tree classifier;
step 2042, the prediction result of the same target data is determined by combining the models trained by a plurality of different training data, for a selected target data, a plurality of classification models with different prediction performances are trained by a plurality of different training data sets, and finally, the average value of the prediction results of the plurality of models is used as the final prediction value of the target data.
The invention mainly provides a method for solving the problem of data distribution difference aiming at large source domain target data distribution difference in cross-project defect prediction, data are processed by combining a transfer learning and attribute compensation technology to convert attribute values, 5 subsets in a NASA (self-organizing adaptive analysis and analysis) data set are selected as experimental data, namely CM1, JM1, KC1, KC2 and PC1, and detailed information of the data set used in the experiment is shown in figure 3.
In order to improve the performance of the cross-project defect prediction model, the invention provides some solutions for solving the data difference between different projects. Because the measurement data distribution of different projects has great difference, the invention uses the weight idea to perform attribute conversion operation on source data and target data, so that the data difference between the projects is reduced to the minimum, simultaneously, the invention combines the migration component analysis technology SSTCA to perform secondary mapping on the attributes, and takes the prediction results of a plurality of classifiers with different performances as the final prediction value of the model when predicting the target data, thereby greatly improving the generalization ability and the prediction performance of the model.
As can be seen visually by the line drawings of fig. 8-10. For the 4 data processing methods, for the five selected data sets, the method for adapting the source data set distribution by using the target data set proposed in 17 years is improved to a certain extent in the recall ratio index compared with the method for adapting the target data set distribution by using the source data set in 08 years, that is, the method can predict more comprehensive defect data for the model. Therefore, the research also utilizes the idea that the target data adapts to the distribution condition of the source data, and combines the SSTCA (SSTCA) feature mapping-based dimension reduction method to keep the attribute values of the test data and the target data to have the maximum data similarity, so that the prediction effect of the defect prediction model is improved, and the model prediction effect is optimal.
After the idea of adapting to source data distribution by using target data is determined, the invention develops research on how to design a proper weight value to enable a source data set to be more adapted to the target data, for an attribute compensation method, the target data is adapted to the source data set distribution, and simultaneously, the source data set is subjected to secondary conversion adapting to the target data, each method is subjected to 500 experiments on each source and target data, and finally, an average value of experiment results of different data sets is taken as the final performance of a model, and the recall ratio, the AUC value and the precision ratio of processing data distribution by combining SSTCA characteristic mapping and attribute value conversion are shown in FIGS. 4-7. Meanwhile, as a comparison, similar experiments were also performed on the attribute value conversion methods used in 08 and 17 years.
Through the selection of the data processing method, 5 data sets selected by an experiment are subjected to defect data prediction, the method provided by the invention is applied to process the problem of distribution difference of source data and target data, then source data and target data with similar data distribution are obtained, finally the prediction of the target data is completed by using a classifier trained by taking transfer data as training data, in addition, in order to verify the effectiveness of the method provided by the invention, a plurality of groups of comparison experiments are arranged, the same test and the target data sets are subjected to experiments by using different data processing methods, comprehensive comparison is carried out from indexes such as AUC (total efficiency), and the experiment result is shown in fig. 4-7.
It can be seen from observing fig. 8-10 that the algorithm proposed by the present invention enables the decision tree classifier to obtain a more accurate defect prediction effect, because the method combines the data distribution of the source and target data to train the model. On the other hand, the method minimizes the data distance between the source data and the target data in the mapping space by using the class label information of the data in the source item through the migration component analysis technology, takes the transfer data of the source item as training data, namely, the distribution similarity of the source data and the target data is ensured to the maximum extent through attribute value conversion and the feature mapping-based migration learning technology, and further the prediction effect of the defect prediction system is greatly improved. Therefore, the method provided by the research can effectively improve the method based on the metric value conversion, and the improved method can improve the performance of the prediction model to a certain extent.