[go: up one dir, main page]

CN113626316A - Software defect prediction model based on feature mapping and attribute compensation technology - Google Patents

Software defect prediction model based on feature mapping and attribute compensation technology Download PDF

Info

Publication number
CN113626316A
CN113626316A CN202110851716.2A CN202110851716A CN113626316A CN 113626316 A CN113626316 A CN 113626316A CN 202110851716 A CN202110851716 A CN 202110851716A CN 113626316 A CN113626316 A CN 113626316A
Authority
CN
China
Prior art keywords
data
source
target
training
metric
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110851716.2A
Other languages
Chinese (zh)
Other versions
CN113626316B (en
Inventor
陈锦富
王小丽
蔡赛华
陈海波
张翅
徐家平
黄创飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN202110851716.2A priority Critical patent/CN113626316B/en
Publication of CN113626316A publication Critical patent/CN113626316A/en
Application granted granted Critical
Publication of CN113626316B publication Critical patent/CN113626316B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • G06F11/3672Test management
    • G06F11/3684Test management for test design, e.g. generating new test cases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Prevention of errors by analysis, debugging or testing of software
    • G06F11/3668Testing of software
    • G06F11/3672Test management
    • G06F11/3688Test management for test execution, e.g. scheduling of test suites
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

本发明提供了一种基于特征映射和属性值转换的软件缺陷预测模型。包括:步骤1,利用交叉验证合理划分数据集,对测试所用到的样本集进行采样与归一化预处理,以得到更加平衡的缺陷数据集;步骤2,应用度量补偿方法解决源项目与目标项目间数据分布差异问题,使目标项目的数据分布与源项目的数据分布更加契合;步骤3,利用迁移学习技术将源数据与目标数据的数据距离映射到特征空间并加以最小化;步骤4,根据步骤3得到的训练样本集与测试样本集,利用决策树分类模型对数据进行缺陷分类。

Figure 202110851716

The invention provides a software defect prediction model based on feature mapping and attribute value conversion. Including: step 1, use cross-validation to reasonably divide the data set, sample and normalize the sample set used in the test, so as to obtain a more balanced defect data set; step 2, apply the metric compensation method to solve the source project and the target The problem of data distribution differences between projects makes the data distribution of the target project more consistent with the data distribution of the source project; Step 3, use the transfer learning technology to map the data distance between the source data and the target data to the feature space and minimize it; Step 4, According to the training sample set and the test sample set obtained in step 3, a decision tree classification model is used to classify the data defects.

Figure 202110851716

Description

Software defect prediction model based on feature mapping and attribute compensation technology
Technical Field
The invention belongs to the field of software safety, and relates to a software project defect prediction model based on a feature mapping and measurement compensation method.
Background
With the rapid development of the internet technology, the complexity of software is increasing day by day, the software safety and the software reliability are increasingly important in software engineering, in order to ensure the high reliability of the software quality, the defect prediction method becomes a research hotspot in the field of defect prediction, and the good software defect prediction method can greatly improve the test efficiency of the software, so that the construction of a high-performance defect prediction model has great significance in ensuring the software quality.
Software defect prediction is based on historical data accumulated in a software development process to build a prediction model, however, when no historical data exists or the historical defect data is insufficient to build the model, the traditional software defect prediction method based on target data cannot meet practical requirements. However, in practical applications, projects to be predicted are usually newly developed projects, which causes the problem of insufficient historical data, and a concept of cross-project defect prediction occurs, that is, data of other related projects are used as training to construct a defect prediction model for target data, which is accompanied by a problem of data difference between a source project and a target project. Different items have different feature spaces and distributions of software defect data due to different context environments, that is, the assumption of independent and same distribution is difficult to satisfy due to the difference of distribution of values of measurement elements of the source and target item data sets. The defect prediction model built by using the traditional machine learning technology cannot obtain better prediction effect.
The current cross-project defect prediction method is roughly divided into two categories of attribute-based conversion and data-based selection, and attribute feature transformation enables a source project and a target project to be subjected to the same distribution on the premise of keeping respective defect features. In order to solve the problem that the defect prediction model has low performance due to large data distribution difference among different projects, a measurement compensation method is used for changing data distribution in a source data set to adapt to a target data set, and the difference between the source project data and the target project data is used for reducing the distribution difference between the source project data and the target project data by using a weight thought during instance training. And then, the existing knowledge of the source item is utilized by using a transfer learning technology to solve the problem that only a few marked examples exist in the target field or even the target field does not exist, a semi-supervised transfer component analysis (SSTCA) method is utilized to obtain the minimum data distance between the source item and the target item in a mapping space, simultaneously the respective internal attributes of the source item and the target item are kept to the maximum extent, and the effect of combining the two methods is combined to enhance the similarity of the data distribution of the source data field and the target data field. The idea of feature mapping and attribute value conversion is used to make the data distribution in the source and target data sets similar as much as possible in cross-project defect prediction.
Disclosure of Invention
The existing methods for predicting the cross-project defects are roughly divided into two types, one is based on attribute conversion, the other is based on data selection, and most of the methods based on the attribute conversion make the source and the target data more fit by using a weight idea to obtain a similar data distribution condition. The invention combines a transfer learning method SSTCA and an attribute compensation technology which can fully utilize class label information of source domain data to carry out transfer component learning, so that the data distribution of a target item is similar to that of a source item. SSTCA makes full use of class label information of data in the source item, takes the transfer data of the source item as training data, and further performs class imbalance learning by using a sampling method. Meanwhile, the average value of the prediction results of the multi-source training set on the same target data set is used as the final prediction result of the model. The experimental result of the method is compared with the existing measurement compensation method, and the high efficiency of the method is verified.
The invention provides a cross-project defect prediction model combining a transfer learning technology and an attribute compensation technology, which comprises the following steps:
step 1, reasonably dividing a data set by using cross validation, and performing sampling and normalization pretreatment on a sample set used for testing to obtain a more balanced defect data set;
step 2, solving the problem of data distribution difference among the cross projects by applying an attribute compensation method, so that the data distribution of the target project is more fit with the data distribution of the source project;
step 3, after the sample set processed by the measurement compensation technology is obtained, minimizing the data distance between the source data and the target data by utilizing the transfer learning technology to obtain more similar data distribution;
and 4, classifying the defects of the data by using a decision tree classification model according to the training sample set and the test sample set obtained in the step 3, and further evaluating the prediction performance of the model according to the prediction result.
In a first aspect, the specific steps of step 1 are as follows:
step 1.1, reasonably dividing a data set required by verifying the model performance by using a loadtxt () file reading method, reading out measurement data and label data in a corresponding training sample set and a corresponding test sample set, wherein' is a reading division identifier, the first N columns of the read data are taken as measurement data and stored in an x _ list, and the data in the (N + 1) th column is stored in a y _ list as a label of defect data;
and 1.2, normalizing the number subjected to the standardization operation by using a dispersion standardization method, and converting the standardized data into [0,1] to obtain normalized training set data, so that the data of each measurement attribute has stronger visualization operation.
And step 1.3, oversampling is carried out on the training data by utilizing an SMOTE sampling method, so that the problem of class imbalance of the defect data is effectively relieved, and the training precision of the data set is improved. And setting a sampling proportion according to the sample unbalance proportion to determine a sampling multiplying power, calculating the distance from each sample in the minority class to all samples in the minority class sample set by taking the Euclidean distance as a standard, and obtaining k neighbors of the samples after sequencing so as to expand the samples of the minority class.
In a second aspect, the calculation flow of the attribute compensation method includes:
and 2.1, storing the preprocessed training data sample set in a list form, and multiplying each sample of the target training set by original data by a weight value, wherein the weight value is the ratio of the training data to the mean value of the target data, so as to obtain a new target data set which is more adaptive to the source data distribution. Similarly, for all samples in the source data set, the ratio of the target data to the mean value of the source data is used as a new weight for the source data set to allocate to the adaptive target data, so as to obtain a data set which is adapted to the source and target data after a round of attribute value conversion. The calculation mode of the conversion between the source data and the target data is as follows:
Figure BDA0003182562820000032
wherein:
Figure BDA0003182562820000031
the above expression source [ i, j represents the jth measurement value of the ith data instance in the source item data, and the metric _ mean _ source represents the average value of all data instances of the source item in the jth measurement; similarly, target [ i, j ] represents the jth metric value of the ith data instance in the target item data, and the metric _ mean _ target represents the average value of all data instances in the jth metric of the target item.
Step 2.2, using source 1[ i, j ] to represent the j-th metric value in the ith data instance of the required new source data obtained by solving in step 2.1, using meter _ mean _ source and meter _ mean _ target to represent the mean value of the j-th metric of the source data and the target data after recalculation and performing data processing similar to that performed in step 2.1 once again on the source data, so that the source data set performs secondary conversion of attribute values for the distribution of the target data, thereby further improving the similarity and conformity of the source data and the target data and completing the two-round metric mapping of the new source and target data, wherein the calculation mode of source 1 is as follows:
sour1[i,j]=(sour1[i,j]*metric_mean_target)/metric_mean_source。
and 2.3, introducing the source data and the target data subjected to the attribute value transformation as parameters of a new transfer learning method, and performing next conversion based on feature mapping.
In a third aspect, a computing process for processing source and target data distribution by using a semi-supervised migration component analysis method is as follows:
step 3.1 migration into form with SSTCAThe analysis technique further improves the training data (X)S) And test data (X)T) To improve the robustness of the defect prediction model, the distance between the two domains is calculated using the MMD (maximum mean difference) algorithm, where MMD (X)S,XT) L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, and tr (KL) represents a trace of a matrix obtained by solving the splicing matrix KL.
Step 3.2, the quantized distance is converted into the learning process of the kernel function by adopting the kernel function, and the self-defined L and H matrixes are firstly calculated, wherein the definition
Figure BDA0003182562820000041
Where H is a central matrix, n1, n2 are the source fields XsrcAnd a target domain XtarNumber of instances of (2), xi,xjAnd representing the sample data in the domain, and further calculating to obtain a kernel matrix K.
Step 3.3, the kernel matrix K calculated in the step 3.2 is utilized to further solve (KLK + mu I)-1KHK first m eigenvalues, as mentioned above, L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, H is a central matrix, μ is an introduced parameter, and I is an intermediate matrix introduced in the algorithm. Calculating the matrix after dimensionality reduction by utilizing SSTCA algorithm, and transmitting the matrix into a source domain XSAnd a target domain XTAnd the desired data dimension dim after dimensionality reduction, the obtained new source data features and new target data features (i.e. the reduced-dimensionality results of the source domain and the data domain). Wherein the number of rows of X is the characteristic number of the original data, and the columns are the total characteristic number.
In a fourth aspect, the specific steps of step 4 are as follows:
and 4.1, obtaining the defect data with similar distribution and reduced dimensionality after the training sample set is subjected to the attribute conversion and the migration component analysis operation, creating a corresponding decision tree classifier object, and performing the class prediction of the defect data on the target data set by using the decision tree classifier.
And 4.2, comprehensively determining the prediction result of the same target data by using models trained by a plurality of different training data, training a plurality of classification models with different prediction performances for a selected target data by using a plurality of different training data sets, and finally taking the average value of the prediction results of the plurality of models as the final prediction value of the target data.
Compared with the prior art, the invention has the beneficial effects that:
1. the cross-project defect prediction model combining the transfer learning technology and the measurement compensation technology predicts defect data on the basis of similar source and target data distribution, utilizes the attribute compensation technology to enable source data to adapt to the distribution of the target data by utilizing the weight thought during training, and performs secondary measurement value conversion again after relatively fitting data are obtained, so that the distribution condition of the source data and the target data are considered simultaneously, and the constructed defect prediction model is more robust.
2. The advantage of processing data distribution based on the transfer learning technology is that when a defect prediction model is constructed, the trained model has poor prediction performance due to inconsistent distribution of training data and test data, and it may be difficult to predict a correct result in a test. Thereby further enhancing the similarity between different data domains.
Drawings
FIG. 1 is a general flow diagram of a cross-project defect prediction method incorporating feature mapping and attribute value conversion techniques.
FIG. 2 is a detailed flow diagram of a cross-project defect prediction method incorporating feature mapping and attribute value conversion techniques.
FIG. 3 is data sample set information used in the experimental segment of the present invention.
Fig. 4 shows the information and results of experiments performed on the defect prediction models obtained by processing data using different schemes with KC2 as the target project and CM1, KC1, JM1, and PC1 as the training set of source projects.
FIG. 5 shows the information and results of experiments performed on defect prediction models obtained by processing data using different schemes with CM1 as the target project and KC2, KC1, JM1, and PC1 as the training set of source projects.
Fig. 6 shows the information and results of experiments performed on the defect prediction models obtained by data processing using different schemes with KC1 as the target project and CM1, KC2, JM1, and PC1 as the source project training sets.
FIG. 7 shows the information and results of experiments performed on defect prediction models obtained by processing data using different schemes with PC1 as the target project and CM1, KC1, JM1, and KC2 as the training set of source projects.
Fig. 8 is AUC values, recall and precision of experiments on data processing with different schemes using KC2 as the target item and CM1, KC1, JM1, KC2 as the training set of source items.
FIG. 9 is the AUC values, recall and precision of experiments on data processing with different schemes using PC1 as the target item and KC1, CM1, JM1, KC2 as the training set of source items.
FIG. 10 is the AUC values, recall and precision, of experiments on data processing with different schemes using KC1 as the target item and PC1, CM1, JM1, KC2 as the training set of source items.
Detailed Description
The invention will be further described with reference to the accompanying drawings and embodiments, which are described for the purpose of facilitating an understanding of the invention and are not intended to be limiting in any way.
The invention aims to solve the problem that a source data set and a target data set have large data distribution difference when a defect prediction model is constructed in cross-project defect prediction, provides a method for improving the similarity of source and target data so as to establish a cross-project defect prediction model with better prediction performance and higher robustness, and performs sufficient experiments to prove the feasibility and the high efficiency of the method.
As shown in fig. 1, a cross-project defect prediction method combining transfer learning and attribute compensation techniques of the present invention includes:
step 201, reasonably dividing a data set required for verifying the model performance by using a loadtxt () file reading method, reading out measurement data and label data in a corresponding training sample set and a corresponding test sample set, so that the measurement data and the label data are read out, wherein the first N columns of the read data are taken as measurement data and stored in x _ list for reading and dividing identification, and the data in the (N + 1) th column is stored in y _ list for the label of defect data; sampling and normalizing pretreatment are carried out on a sample set used for testing to obtain a more balanced defect data set;
the purpose of implementing data preprocessing in the invention is that when the obtained source data set is used as training data, each item data set is composed of a plurality of pieces of sample data, each defect sample number is represented by a plurality of pieces of measurement data, but the measurement dimensions are different among different measurements, so that preprocessing operation needs to be performed on the training data samples when a defect prediction model is trained, and the different measurements have the same dimension. In short, when the units of measure of the training data in different dimensions are not consistent, a normalization step is required to preprocess the data.
In step 2011, the normalized data is normalized by using a dispersion normalization method, and the normalized data is converted into [0,1], so that the data with different measurement attributes has stronger visualization operation.
The normalization calculation process comprises (1) calculating the maximum and minimum values in the sample data, and (2) using the transfer function
Figure BDA0003182562820000061
Processing metric and label values in data to convert data samples to [0,1]Inner, f/Namely normalized data.
Step 2012, a SMOTE oversampling method is adopted to sample and expand the training data, a sampling proportion is set according to a sample imbalance proportion to determine a sampling magnification, for each sample in the minority class, the distance from the sample to all samples in the minority class sample set is calculated by taking the Euclidean distance as a standard, and k neighbor expansion minority class samples are obtained after sequencing.
The sampling calculation process comprises the following steps: for each randomSelected neighbors xnAccording to
Figure BDA0003182562820000071
And a new sample is constructed by the formula, so that a more balanced data set can be obtained according to the set sampling proportion.
Through the operation, the preprocessed standard input data can be obtained.
Step 202, solving the problem of inter-project category balance by applying an attribute compensation method, so that the data distribution of a target project is more fit with the data distribution of an original project;
in step 202, the step of processing the data distribution by the attribute compensation method is as follows:
step 2021 stores the preprocessed training data sample set in a form of a list, and assigns a weight value to the original data for each sample of the target training set, where the weight value is a ratio of the training data to a mean value of the target data, so as to obtain a new target data set more suitable for source data distribution. Similarly, for all samples in the source data set, the ratio of the target data to the mean value of the source data is used as a new weight for processing the source data set, so as to obtain a data set which is adapted to the source data and the target data after a round of attribute value conversion. The calculation mode of the conversion between the source data and the target data is as follows:
source[i,j]=(source[i,j]*metric_mean_target)/metric_mean_source
target[i,j]=(target[i,j]*metric_mean_source)/metric_mean_target
wherein:
metric_mean_source=np.mean(source[:,j])
metric_mean_source=np.mean(target[:,j])
the above expression source [ i, j represents the jth measurement value of the ith data instance in the source item data, and the metric _ mean _ source represents the average value of all data instances of the source item in the jth measurement; similarly, target [ i, j ] represents the jth metric value of the ith data instance in the target item data, and the metric _ mean _ target represents the average value of all data instances in the jth metric of the target item.
Step 2022, representing the j-th metric value in the i-th data instance of the required new source data obtained by the solution in step 2021 by source 1[ i, j ], representing the mean of the j-th metric values of the source data and the target data after the recalculation processing by means of the metric _ mean _ source and the metric _ mean _ target, and using the weight concept to make the source data set perform the secondary conversion of the attribute values according to the distribution of the target data, thereby further improving the similarity and the compatibility of the source and the target data, i.e. completing the two-round metric mapping of the new source and the target data. The calculation of source 1 is:
sour1[i,j]=(sour1[i,j]*metric_mean_target)/metric_mean_source。
step 203, after the sample set processed by the metric compensation technology is obtained, minimizing the data distance between the source and the target data by using a transfer learning Technology (TCA) to obtain more similar data distribution;
in step 203, the step of changing the distribution of the source and target data by using the SSTCA technology is as follows:
step 2031, using the source and target data after two rounds of mapping obtained in step 202 as SSTCA algorithm parameters, and performing a conversion on the data mapping again by using migration component analysis
Step 2032, convert the difference between the two domains into the distance between the two domains for calculation, and quantitatively measure the difference by using MMD (maximum mean difference) algorithm. A learning process for transforming quantized distances into kernel functions using kernel functions first computes custom L and H matrices, where definitions are defined
Figure BDA0003182562820000081
Where H is a central matrix, n1, n2 are the source fields XsrcAnd a target domain XtarNumber of instances of (2), xi,xjAnd representing the sample data in the domain, and further calculating to obtain a kernel matrix K.
Step 2033, using the kernel matrix K calculated in step 2032 to further solve (KLK + muI)-1KHK first m eigenvalues, as mentioned above, L is the matrix introduced by MMD algorithm, K is the kernel matrix obtained by kernel function mapping, H is the central matrix, μ is the introduced parameter, and I is the algorithmThe intermediate matrix introduced in (1). The SSTCA algorithm calculates the matrix after dimensionality reduction, transmits the matrix into a source domain Xs and a target domain Xt, and the expected data dimensionality dim after dimensionality reduction, and obtains new source data characteristics and new target data characteristics which are the results after dimensionality reduction of the source domain and the data domain. Wherein the number of rows of X is the characteristic number of the original data, and the columns are the total characteristic number.
And step 204, classifying the data by utilizing a decision tree technology according to the training sample set and the test sample set obtained in the step 203 to obtain the defect prediction performance of the model.
2041, performing attribute conversion and migration component analysis on a training sample set to obtain defect data with similar distribution and reduced dimensionality, creating a corresponding decision tree classifier object, and performing class prediction on the defect data of a target data set by using a decision tree classifier;
step 2042, the prediction result of the same target data is determined by combining the models trained by a plurality of different training data, for a selected target data, a plurality of classification models with different prediction performances are trained by a plurality of different training data sets, and finally, the average value of the prediction results of the plurality of models is used as the final prediction value of the target data.
The invention mainly provides a method for solving the problem of data distribution difference aiming at large source domain target data distribution difference in cross-project defect prediction, data are processed by combining a transfer learning and attribute compensation technology to convert attribute values, 5 subsets in a NASA (self-organizing adaptive analysis and analysis) data set are selected as experimental data, namely CM1, JM1, KC1, KC2 and PC1, and detailed information of the data set used in the experiment is shown in figure 3.
In order to improve the performance of the cross-project defect prediction model, the invention provides some solutions for solving the data difference between different projects. Because the measurement data distribution of different projects has great difference, the invention uses the weight idea to perform attribute conversion operation on source data and target data, so that the data difference between the projects is reduced to the minimum, simultaneously, the invention combines the migration component analysis technology SSTCA to perform secondary mapping on the attributes, and takes the prediction results of a plurality of classifiers with different performances as the final prediction value of the model when predicting the target data, thereby greatly improving the generalization ability and the prediction performance of the model.
As can be seen visually by the line drawings of fig. 8-10. For the 4 data processing methods, for the five selected data sets, the method for adapting the source data set distribution by using the target data set proposed in 17 years is improved to a certain extent in the recall ratio index compared with the method for adapting the target data set distribution by using the source data set in 08 years, that is, the method can predict more comprehensive defect data for the model. Therefore, the research also utilizes the idea that the target data adapts to the distribution condition of the source data, and combines the SSTCA (SSTCA) feature mapping-based dimension reduction method to keep the attribute values of the test data and the target data to have the maximum data similarity, so that the prediction effect of the defect prediction model is improved, and the model prediction effect is optimal.
After the idea of adapting to source data distribution by using target data is determined, the invention develops research on how to design a proper weight value to enable a source data set to be more adapted to the target data, for an attribute compensation method, the target data is adapted to the source data set distribution, and simultaneously, the source data set is subjected to secondary conversion adapting to the target data, each method is subjected to 500 experiments on each source and target data, and finally, an average value of experiment results of different data sets is taken as the final performance of a model, and the recall ratio, the AUC value and the precision ratio of processing data distribution by combining SSTCA characteristic mapping and attribute value conversion are shown in FIGS. 4-7. Meanwhile, as a comparison, similar experiments were also performed on the attribute value conversion methods used in 08 and 17 years.
Through the selection of the data processing method, 5 data sets selected by an experiment are subjected to defect data prediction, the method provided by the invention is applied to process the problem of distribution difference of source data and target data, then source data and target data with similar data distribution are obtained, finally the prediction of the target data is completed by using a classifier trained by taking transfer data as training data, in addition, in order to verify the effectiveness of the method provided by the invention, a plurality of groups of comparison experiments are arranged, the same test and the target data sets are subjected to experiments by using different data processing methods, comprehensive comparison is carried out from indexes such as AUC (total efficiency), and the experiment result is shown in fig. 4-7.
It can be seen from observing fig. 8-10 that the algorithm proposed by the present invention enables the decision tree classifier to obtain a more accurate defect prediction effect, because the method combines the data distribution of the source and target data to train the model. On the other hand, the method minimizes the data distance between the source data and the target data in the mapping space by using the class label information of the data in the source item through the migration component analysis technology, takes the transfer data of the source item as training data, namely, the distribution similarity of the source data and the target data is ensured to the maximum extent through attribute value conversion and the feature mapping-based migration learning technology, and further the prediction effect of the defect prediction system is greatly improved. Therefore, the method provided by the research can effectively improve the method based on the metric value conversion, and the improved method can improve the performance of the prediction model to a certain extent.

Claims (5)

1. A software defect prediction model based on feature mapping and attribute conversion is characterized by comprising the following steps:
step 1, reasonably dividing a data set by using cross validation, and performing sampling and normalization pretreatment on a sample set used for testing to obtain a more balanced defect data set;
step 2, solving the problem of data distribution difference among the cross projects by applying an attribute compensation method, so that the data distribution of the target project is more fit with the data distribution of the source project;
step 3, after the sample set processed by the measurement compensation technology is obtained, minimizing the data distance between the source data and the target data by utilizing the transfer learning technology to obtain more similar data distribution;
and 4, classifying the defects of the data by using a decision tree classification model according to the training sample set and the test sample set obtained in the step 3, and further evaluating the prediction performance of the model according to the prediction result.
2. The method of claim 1, wherein the step 1 is implemented by the following steps:
step 1.1, reasonably dividing a data set required by verifying the model performance by using a loadtxt () file reading method, reading out measurement data and label data in a corresponding training sample set and a corresponding test sample set, wherein' is a reading division identifier, the first N columns of the read data are taken as measurement data and stored in an x _ list, and the data in the (N + 1) th column is stored in a y _ list as a label of defect data;
step 1.2, normalizing the number subjected to the standardization operation by using a dispersion standardization method, and converting the standardized data into [0,1] to obtain normalized training set data, so that the data of each measurement attribute has stronger visual operation;
step 1.3, oversampling is carried out on training data by utilizing an SMOTE sampling method, the class imbalance problem of defect data is effectively relieved, the training precision of a data set is improved, a sampling proportion is set according to the sample imbalance proportion to determine the sampling multiplying power, for each sample in the minority class, the distance from the sample to all samples in the minority class sample set is calculated by taking the Euclidean distance as a standard, and k neighbor of the sample is obtained after sequencing so as to expand the samples of the minority class.
3. The method as claimed in claim 1, wherein the step 2 is implemented by the following steps:
and 2.1, storing the preprocessed training data sample set in a list form, and multiplying each sample of the target training set by original data by a weight value, wherein the weight value is the ratio of the training data to the mean value of the target data, so as to obtain a new target data set which is more adaptive to the source data distribution. Similarly, for all samples in the source data set, the ratio of the target data to the mean value of the source data is used as a new weight allocated to the source data set for adapting to the target data, so as to obtain a data set which adapts to the source and target data after a round of attribute value conversion, and the calculation method of the source and target data conversion is as follows:
Figure FDA0003182562810000011
wherein:
Figure FDA0003182562810000012
wherein source [ i, j ] represents the jth measurement value of the ith data instance in the source item data, and the metric _ mean _ source represents the average value of all data instances of the source item in the jth measurement; similarly, target [ i, j ] represents the jth metric value of the ith data instance in the target item data, and the metric _ mean _ target represents the mean value of all data instances of the target item in the jth metric;
step 2.2, using source 1[ i, j ] to represent the j-th metric value in the ith data instance of the required new source data obtained by solving in step 2.1, using meter _ mean _ source and meter _ mean _ target to represent the mean value of the j-th metric of the source data and the target data after recalculation and performing data processing similar to that performed in step 2.1 once again on the source data, so that the source data set performs secondary conversion of attribute values for the distribution of the target data, thereby further improving the similarity and conformity of the source data and the target data and completing the two-round metric mapping of the new source and target data, wherein the calculation mode of source 1 is as follows:
sour1[i,j]=(sour1[i,j]*metric_mean_target)/metric_mean_source。
and 2.3, introducing the source data and the target data subjected to the attribute value transformation as parameters of a new transfer learning method, and performing next conversion based on feature mapping.
4. The method as claimed in claim 1, wherein the specific implementation of step 3 comprises the following steps:
step 3.1, further improving the training data X by utilizing the SSTCA migration component analysis technologySAnd test data XTSo as to improve the robustness of the defect prediction model, the distance between the two domains is calculated by using the Maximum Mean Difference (MMD) algorithm, whichMiddle MMD (X)S,XT) L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, and tr (KL) represents a trace of a matrix obtained by solving the splicing matrix KL;
step 3.2, the quantized distance is converted into the learning process of the kernel function by adopting the kernel function, and the self-defined L and H matrixes are firstly calculated, wherein the definition
Figure FDA0003182562810000021
Where H is a central matrix, n1, n2 are the source fields XsrcAnd a target domain XtarNumber of instances of (2), xi,xjRepresenting sample data in the domain, and further calculating to obtain a kernel matrix K;
step 3.3, the kernel matrix K calculated in the step 3.2 is utilized to further solve (KLK + mu I)-1KHK first m eigenvalues, as mentioned above, L is a matrix introduced by the MMD algorithm, K is a kernel matrix obtained by kernel function mapping, H is a central matrix, μ is an introduced parameter, and I is an intermediate matrix introduced in the algorithm; calculating the matrix after dimensionality reduction by utilizing SSTCA algorithm, and transmitting the matrix into a source domain XSAnd a target domain XTAnd a desired data dimension dim after dimensionality reduction, and obtaining a new source data characteristic and a new target data characteristic, namely a dimensionality reduced result of a source domain and a data domain, wherein the line number of X is the characteristic number of the original data, and the column is a total characteristic number.
5. The method as claimed in claim 1, wherein the specific implementation of step 4 comprises the following steps:
step 4.1, obtaining defect data with similar distribution and reduced dimensionality after the training sample set is subjected to the attribute conversion and migration component analysis operation, creating a corresponding decision tree classifier object, and performing class prediction of the defect data on a target data set by using a decision tree classifier;
and 4.2, comprehensively determining the prediction result of the same target data by using models trained by a plurality of different training data, training a plurality of classification models with different prediction performances for a selected target data by using a plurality of different training data sets, and finally taking the average value of the prediction results of the plurality of models as the final prediction value of the target data.
CN202110851716.2A 2021-07-27 2021-07-27 A software defect prediction model based on feature mapping and attribute compensation technology Active CN113626316B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110851716.2A CN113626316B (en) 2021-07-27 2021-07-27 A software defect prediction model based on feature mapping and attribute compensation technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110851716.2A CN113626316B (en) 2021-07-27 2021-07-27 A software defect prediction model based on feature mapping and attribute compensation technology

Publications (2)

Publication Number Publication Date
CN113626316A true CN113626316A (en) 2021-11-09
CN113626316B CN113626316B (en) 2024-12-20

Family

ID=78381202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110851716.2A Active CN113626316B (en) 2021-07-27 2021-07-27 A software defect prediction model based on feature mapping and attribute compensation technology

Country Status (1)

Country Link
CN (1) CN113626316B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446735A (en) * 2018-12-18 2019-03-08 中国石油大学(北京) A kind of generation method, equipment and the system of modeling logging data
CN110659207A (en) * 2019-09-02 2020-01-07 北京航空航天大学 Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446735A (en) * 2018-12-18 2019-03-08 中国石油大学(北京) A kind of generation method, equipment and the system of modeling logging data
CN110659207A (en) * 2019-09-02 2020-01-07 北京航空航天大学 Heterogeneous cross-project software defect prediction method based on nuclear spectrum mapping migration integration

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈曙;叶俊民;刘童;: "一种基于领域适配的跨项目软件缺陷预测方法", 软件学报, no. 02, 15 February 2020 (2020-02-15), pages 24 - 39 *

Also Published As

Publication number Publication date
CN113626316B (en) 2024-12-20

Similar Documents

Publication Publication Date Title
CN108446711B (en) A software defect prediction method based on transfer learning
CN112989635B (en) Integrated learning soft measurement modeling method based on self-encoder diversity generation mechanism
CN113469470B (en) Energy consumption data and carbon emission correlation analysis method based on electric brain center
CN107784394A (en) Consider that the highway route plan of prospect theory does not know more attribute method for optimizing
CN111539657A (en) Classification and synthesis method of load characteristics of typical electricity industry combined with daily electricity consumption curve of users
CN110458601B (en) Method and device for processing resource data, computer equipment and storage medium
CN113376483A (en) XLPE cable insulation state evaluation method
CN111008870A (en) Regional logistics demand prediction method based on PCA-BP neural network model
CN114372693B (en) Transformer fault diagnosis method based on cloud model and improved DS evidence theory
CN103700030A (en) Grey rough set-based power grid construction project post-evaluation index weight assignment method
CN117150416B (en) A detection method, system, media and equipment for abnormal nodes in the industrial Internet
CN113112090B (en) Spatial load prediction method based on principal component analysis of comprehensive mutual information
CN116500480A (en) Intelligent battery health monitoring method based on feature transfer learning hybrid model
CN114548306B (en) An intelligent monitoring method for early overflow in drilling based on misclassification cost
CN101206727B (en) Data processing apparatus, data processing method
CN115454990A (en) A data cleaning method for oil-paper insulation based on improved KNN
CN105931133A (en) Distribution transformer replacement priority evaluation method and device
CN113626316A (en) Software defect prediction model based on feature mapping and attribute compensation technology
CN110264010B (en) Novel rural power saturation load prediction method
CN118197297A (en) A device fault detection method based on voiceprint signal
CN116484295A (en) Mineral resources classification and prediction method and system based on multi-source small sample joint learning
CN115186584A (en) A Width Learning Semi-Supervised Soft Sensing Modeling Method Fusing Attention Mechanism and Adaptive Composition
CN118169727B (en) Fusion positioning method based on 5G communication network and Beidou satellite navigation system
CN118313106B (en) Data-driven risk assessment method and decision-making system for power grid emergency repair
CN116304794A (en) Method, system, equipment and medium for engine vibration prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant