CN110083637B

CN110083637B - Bridge disease rating data-oriented denoising method

Info

Publication number: CN110083637B
Application number: CN201910327313.0A
Authority: CN
Inventors: 周扬名; 王凯; 叶琪; 阮彤; 翟洁
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2019-04-23
Filing date: 2019-04-23
Publication date: 2023-04-18
Anticipated expiration: 2039-04-23
Also published as: CN110083637A

Abstract

The invention relates to a bridge disease rating data oriented denoising method, which is characterized by comprising the following steps: firstly, deleting the characteristics of the original data set, wherein the characteristic value sequence relation is difficult to distinguish; secondly, respectively comparing every two samples with different labels in the data set to obtain a conflict pair set consisting of samples with label conflicts; then, sequencing the samples from high to low according to the times of the samples appearing in the conflict pair set, sequentially calculating the outline coefficients of the samples t% before ranking, deleting the samples with the outline coefficients smaller than epsilon from the conflict pair set and the data set, sequencing the samples again according to the times of the samples appearing in the conflict pairs, calculating the outline coefficients and deleting the samples until the outline coefficients of the samples t% before ranking are not smaller than epsilon; and finally, obtaining a data set with noise removed, wherein the noise removal method can effectively improve the accuracy of the classification of the bridge disease data.

Description

A denoising method for bridge damage rating data

技术领域Technical Field

本发明属于数据处理技术领域，旨在设计一种面向桥梁病害评级数据的去噪方法。The invention belongs to the technical field of data processing and aims to design a denoising method for bridge disease rating data.

背景技术Background Art

改革开放以来，我国公路桥梁迎来大建设大发展时期。目前，我国公路桥梁总数接近80万座，桥梁数量和规模均位居世界之首。然而，我国步入维修期的在役桥梁日渐增多。据不完全统计，我国在役桥梁约40％服役超过20年，技术等级为三、四类的带病桥梁高达30％，甚至有超过10万座桥梁为危桥，安全隐患不容忽视。桥梁病害状况评级是路桥管理和养护的基础。传统的人工评级方法不仅耗时耗力，而且准确度不高，迫切需要运用机器学习技术对在役桥梁病害进行自动评级。现有桥梁数据集中往往包含着大量的标签噪音数据，为了有效提高机器学习方法进行桥梁病害评级的预测性能，需要过滤掉原始数据集中的标签噪音数据。目前主流的标签噪音过滤方法有两种形式：(1)直接改编分类算法来降低标签噪音对算法性能的影响；(2)采用预分类器对数据进行分类投票，然后过滤掉部分疑似噪音数据。然而，上述两种方法在过滤桥梁病害标签噪音数据上的效果并不理想。Since the reform and opening up, my country's highway bridges have ushered in a period of great construction and development. At present, the total number of highway bridges in my country is close to 800,000, ranking first in the world in terms of the number and scale of bridges. However, the number of bridges in service in my country that are entering the maintenance period is increasing. According to incomplete statistics, about 40% of my country's in-service bridges have been in service for more than 20 years, and the number of bridges with technical grades of three or four is as high as 30%. There are even more than 100,000 bridges that are dangerous bridges, and the safety hazards cannot be ignored. The rating of bridge disease conditions is the basis of road and bridge management and maintenance. The traditional manual rating method is not only time-consuming and labor-intensive, but also has low accuracy. It is urgent to use machine learning technology to automatically rate the diseases of in-service bridges. The existing bridge datasets often contain a large amount of labeled noise data. In order to effectively improve the prediction performance of machine learning methods for bridge disease rating, it is necessary to filter out the labeled noise data in the original dataset. At present, there are two mainstream label noise filtering methods: (1) directly adapting the classification algorithm to reduce the impact of label noise on algorithm performance; (2) using a pre-classifier to classify and vote on the data, and then filter out some suspected noise data. However, the above two methods are not ideal in filtering bridge disease label noise data.

综上所述，本交叉领域亟需设计一种新的标签噪音过滤方法来解决上述问题。In summary, this interdisciplinary field urgently needs to design a new label noise filtering method to solve the above problems.

发明内容Summary of the invention

有鉴于此，本发明公开了一种桥梁病害评级数据的去噪方法，有效地提高了基于机器学习的桥梁病害评级方法的预测性能。第一、通过删除原始数据集中难以区分出特征值次序关系的特征得到新的数据集，该数据集中每一特征的特征值皆有次序关系，其中，所述原始数据集中包括有各个桥梁的基本信息，各个种类的桥梁病害信息及对应的桥梁病害等级标签；第二、对数据集中所有样本进行两两比较，并将互相冲突的两个样本组成一个冲突对，而数据集中所有的冲突对构造成一个冲突对集合；第三、统计冲突对集合中样本的出现次数，并对样本的出现次数进行排序；第四、依据样本的轮廓系数和样本频次依次由高到低剔除掉一定比例的样本获得新的数据集；最后，使用stacking方法分别对原始数据集和新数据集进行训练获得两个模型，并对两个模型的桥梁病害等级预测性能进行评估验证，以验证本去噪方法的有效性，若确认有效，便得到了一个相对干净的数据集。In view of this, the present invention discloses a denoising method for bridge disease rating data, which effectively improves the prediction performance of the bridge disease rating method based on machine learning. First, a new data set is obtained by deleting the features whose eigenvalue order relationship is difficult to distinguish in the original data set, and the eigenvalues of each feature in the data set have an order relationship, wherein the original data set includes the basic information of each bridge, each type of bridge disease information and the corresponding bridge disease grade label; second, all samples in the data set are compared in pairs, and the two conflicting samples are formed into a conflict pair, and all conflict pairs in the data set are constructed into a conflict pair set; third, the number of occurrences of samples in the conflict pair set is counted, and the number of occurrences of samples is sorted; fourth, a certain proportion of samples are eliminated from high to low according to the sample silhouette coefficient and sample frequency to obtain a new data set; finally, the stacking method is used to train the original data set and the new data set to obtain two models, and the bridge disease grade prediction performance of the two models is evaluated and verified to verify the effectiveness of the denoising method. If it is confirmed to be effective, a relatively clean data set is obtained.

本发明的技术方案实现形式为：一种面向桥梁病害评级数据的去噪方法，首先通过样本的两两比对获得冲突对集合，然后根据样本在冲突对集合中出现的次数，结合样本的轮廓系数进行噪音数据剔除，得到过滤后的数据集，接着使用同一种机器学习方法分别在原始数据集和过滤后的新数据集上对模型进行训练，最后比较两模型的预测性能，具体步骤为：The technical solution of the present invention is implemented in the form of: a denoising method for bridge disease rating data, firstly, a conflict pair set is obtained by pairwise comparison of samples, then noise data is removed according to the number of times the sample appears in the conflict pair set and the silhouette coefficient of the sample to obtain a filtered data set, then the same machine learning method is used to train the model on the original data set and the filtered new data set respectively, and finally the prediction performance of the two models is compared. The specific steps are:

S1、将原始数据集中的数据进行预处理获得数据集W₁，通过对W₁中无全序关系的特征进行去除得到新的数据集W₂；S1, preprocessing the data in the original data set to obtain data set W ₁ , and removing the features without total order relationship in W ₁ to obtain a new data set W ₂ ;

S2、基于数据集W₂，根据特征a_i的特征值a_i，j对具有不同标签的样本进行两两比较，构造冲突对c_i；S2. Based on the data set W ₂ , samples with different labels are compared pairwise according to the feature values a _i,j of the feature a _i , and conflict pairs c _i are constructed;

S3、基于冲突对c_i构造冲突集合C＝{c₁，c₂，...，c_N}，其中N是冲突集C中包含的冲突对总数；S3. Construct a conflict set C = {c ₁ , c ₂ , ..., c _N } based on the conflict pairs c _i , where N is the total number of conflict pairs included in the conflict set C;

S4、通过统计冲突集合C中样本s_k出现的频次f_k，获得词典D＝{s_k：f_k}。S4. Obtain a dictionary D = {s _k :f _k } by counting the frequency f _k of occurrence of the sample s _k in the conflict set C.

S5、将词典D中的样本按频次由高到低进行排序；S5, sorting the samples in the dictionary D from high to low according to frequency;

S6、针对排序后前t％的样本，在数据集W₂中计算轮廓系数s(k)，删除s(k)小于ε的样本s_k，获得过滤后的新数据集W₃,与此同时删除冲突对集合C中包含疑似噪音样本s_k的冲突对；S6. For the first t% of samples after sorting, calculate the silhouette coefficient s(k) in the data set W ₂ , delete the samples _sk whose s(k) is less than ε, obtain the filtered new data set W ₃ , and at the same time delete the conflicting pairs containing the suspected noise samples _sk in the conflicting pair set C;

S7、重复S4，S5，S6，直至步骤S62中无s(i)小于ε的样本；S7, repeat S4, S5, S6 until there is no sample with s(i) less than ε in step S62;

S8、使用相同机器学习方法，基于数据集W₁和W₃分别训练出模型M₁和M₃，评估并比较模型M₃的预测性能。S8. Using the same machine learning method, train models M ₁ and M ₃ based on data sets W ₁ and W _3, respectively, and evaluate and compare the prediction performance of model M ₃ .

进一步地，步骤S1包括：Further, step S1 comprises:

S11、基于数据集W₁,使用热卡填充方法，利用最相似样本的值补足缺失特征值，最相似样本的度量方法为

其中a_i，j为数据集中第i个样本的第j个特征的特征值，

为缺失的特征值；S11. Based on the data set W ₁ , the hot card filling method is used to fill in the missing feature values with the values of the most similar samples. The measurement method of the most similar samples is

Where a _i,j is the characteristic value of the jth feature of the i-th sample in the data set,

is the missing feature value;

S12、删除对标签值无影响的无用特征；S12, delete useless features that have no effect on the label value;

S13、删除数据集W₁中特征值无全序关系的特征，获得数据集W₂。S13. Delete the features whose feature values have no total order relationship in the data set W ₁ to obtain the data set W ₂ .

进一步地，步骤S2包括：Further, step S2 includes:

S21、数据集W₂的特征集合为A＝{a₁，a₂，...，a_Ni}，Ni是数据集W₂的特征总数；S21, the feature set of the data set W ₂ is A={a ₁ , a ₂ , ..., a _Ni }, where Ni is the total number of features of the data set W ₂ ;

S22、数据集特征a_i的特征值集合为

其中N_a是数据集W₂的总样本数，也是特征a_i的特征值总数；S22, the feature value set of the dataset feature a _i is

Where _Na is the total number of samples in the dataset _W2 , and is also the total number of eigenvalues of feature _ai ;

S23、首先判断两个样本的标签，若相同，则跳过比较这两个样本，若标签不同，则对两个样本所有特征下的特征值一一对应地比较大小，其计算公式：S23. First, determine the labels of the two samples. If they are the same, skip comparing the two samples. If the labels are different, compare the feature values of all features of the two samples one by one. The calculation formula is:

若f(A，B)为真，则有A，B构成冲突对(A,B)；

If f(A, B) is true, then A and B form a conflict pair (A, B);

S24、选定第一个样本，依次将后面的所有样本按照步骤S23的方式与第一个样本进行比较，构造冲突对，依次进行下去，直至迭代到最后一个样本，然后选定第二个样本，依次将后面的所有样本按照步骤S23的方式与第一个样本进行比较，构造冲突对，依次进行下去，直至迭代到最后一个样本；同样地，直到选定倒数第二个样本比较完后停止迭代。S24, select the first sample, and compare all subsequent samples with the first sample in the manner of step S23, construct conflict pairs, and continue to iterate until the last sample is reached, then select the second sample, and compare all subsequent samples with the first sample in the manner of step S23, construct conflict pairs, and continue to iterate until the last sample is reached; similarly, stop iterating after the second to last sample is selected and compared.

进一步地，步骤S3包括：Further, step S3 includes:

S31、将步骤S23构造的所有冲突对，构造成一个冲突集合C＝{c₁，c₂，...，c_N}，N是冲突集C的冲突对总数。S31 . Construct all conflict pairs constructed in step S23 into a conflict set C={c ₁ , c ₂ , . . . , c _N }, where N is the total number of conflict pairs in the conflict set C.

进一步地，步骤S4包括：Further, step S4 includes:

S41、统计冲突对左边元素中样本s_k出现的次数f_lk；S41, counting the number of occurrences f _lk of the sample _sk in the left element of the conflict pair;

S42、统计冲突对右边元素中样本s_k出现的次数f_rk；S42, counting the number of occurrences f _rk of the sample _sk in the right element of the conflicting pair;

S43、计算总频次f_k＝f_lk+f_rk；S43, calculating the total frequency f _k =f _lk +f _rk ;

S44、将样本s_k和其出现的频次f_k之间的一一映射关系，构造一个词典D＝{s_k：f_k}，k＝1，2，...，Na。S44. Construct a dictionary D = {s _k :f _k }, k = 1, 2, ..., Na, by mapping the one-to-one relationship between the sample _sk and its occurrence frequency f _k.

进一步地，步骤S5包括：Further, step S5 includes:

S51、将词典D中的样本按照频次f_k由高到低进行排序。S51. Sort the samples in the dictionary D according to the frequency f _k from high to low.

进一步地，步骤S6包括：Further, step S6 includes:

S61、根据公式

计算轮廓系数，其中

为样本s_k的簇内不相似度，a_i，k为样本s_k第i个特征的特征值和rk为样本s_k的标签；b(k)＝min{b(k)₁，b(k)₂，...b(k)_n}为样本s_k的簇间不相似度，

是样本s_k与第n簇(即标签为rn的类别)的不相似度；S61, according to the formula

Calculate the silhouette coefficient, where

is the intra-cluster dissimilarity of sample _sk , ai _,k is the eigenvalue of the i-th feature of sample _sk and rk is the label of sample _sk ; b(k)=min{b(k) ₁ , b(k) ₂ , ...b(k) _n } is the inter-cluster dissimilarity of sample _sk ,

is the dissimilarity between sample s _k and the nth cluster (i.e., the category with label rn);

S62、若样本s_k的轮廓系数s(k)＜ε，则记录下该样本s_k的编号，并将其视为疑似噪音样本，在数据集W₂中删除掉样本s_k，得到新的数据集W₃。S62. If the silhouette coefficient s(k) of the sample _sk is less than ε, the serial number of the sample _sk is recorded and regarded as a suspected noise sample. The sample _sk is deleted from the data set _W2 to obtain a new data set _W3 .

S63、在冲突对集合C中删除包含步骤S62中疑似噪音样本s_k的冲突对。S63. Delete the conflicting pair including the suspected noise sample _sk in step S62 from the conflicting pair set C.

进一步地，步骤S7包括：Further, step S7 includes:

S71、重复S4，S5，S6，直至步骤S62中无s(i)小于ε的样本(在本专利中，ε被设定为0)。S71. Repeat S4, S5, and S6 until there is no sample with s(i) less than ε in step S62 (in this patent, ε is set to 0).

进一步地，步骤S8包括：Further, step S8 includes:

S81、分别将数据集W₁和新数据集W₃按照同样的比例分为三部分，分别是训练集、验证集和测试集；S81, respectively divide the data set _W1 and the new data set _W3 into three parts according to the same proportion, namely, a training set, a validation set and a test set;

S82、运用使用stacking方法分别在W₁和W₃训练出模型M₁和M₃；S82, using the stacking method to train models M ₁ and M ₃ on W ₁ and W ₃ respectively;

S83、比较评估模型M₃的预测性能。S83. Compare and evaluate the prediction performance of model M ₃ .

采用上述方法策略后，本发明的积极效果是：After adopting the above method strategy, the positive effects of the present invention are:

(1)本发明针对桥梁病害数据集中出现的标签噪音数据，设计了一种截然不同的噪音消除算法，利用样本与样本之间的标签冲突，依据样本冲突次数找到了不同样本作为噪音数据概率大小的差异，增加了标签噪音过滤的准确性，提高了数据集的数据质量。(1) This paper designs a completely different noise elimination algorithm for the labeled noise data appearing in the bridge disease dataset. It uses the label conflicts between samples and finds the difference in the probability of different samples being noise data based on the number of sample conflicts, thereby increasing the accuracy of label noise filtering and improving the data quality of the dataset.

(2)相较于传统的使用分类算法进行标签噪音过滤的方法，本发明借助了数据集本身内在结构的特异性，提高了最终训练出机器学习模型的预测性能。(2) Compared with the traditional method of using classification algorithms to filter label noise, the present invention takes advantage of the specificity of the intrinsic structure of the data set itself to improve the predictive performance of the machine learning model finally trained.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

读者在参照附图阅读了本发明的具体实施方式以后，将会更清楚地了解本发明的各个方面，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提之下，还可以根据这些附图获得其他附图。After reading the specific implementation methods of the present invention with reference to the accompanying drawings, the reader will have a clearer understanding of various aspects of the present invention. The accompanying drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying any creative work.

图1为本发明的面向桥梁病害评级数据的噪音去除方法实施例的流程示意图。FIG1 is a flow chart of an embodiment of a method for removing noise from bridge defect rating data according to the present invention.

图2为本发明的面向桥梁病害评级数据的噪音去除方法实施例中步骤S2的具体流程示意图。FIG. 2 is a schematic diagram of a specific flow chart of step S2 in an embodiment of the method for removing noise from bridge defect rating data of the present invention.

图3为本发明的面向桥梁病害评级数据的噪音去除方法实施例中步骤S4的具体流程示意图。FIG3 is a schematic diagram of a specific flow chart of step S4 in an embodiment of the method for removing noise from bridge defect rating data of the present invention.

图4为本发明的面向桥梁病害评级数据的噪音去除方法实施例中步骤S6的具体流程示意图。FIG. 4 is a schematic diagram of a specific flow chart of step S6 in an embodiment of the method for removing noise from bridge defect rating data of the present invention.

图5为本发明的面向桥梁病害评级数据的噪音去除方法实施例中步骤S8的具体流程示意图。FIG. 5 is a schematic diagram of a specific flow chart of step S8 in an embodiment of the method for removing noise from bridge defect rating data of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not used to limit the present invention.

S1、将原始数据集中的数据进行数据预处理得到数据集W₁，对W₁中无全序关系的特征进行去除,得到数据集W₂。S1. Preprocess the data in the original data set to obtain data set W ₁ , remove the features without total order relationship in W ₁ , and obtain data set W ₂ .

S11、使用热卡填充方法，利用最相似样本的值补足缺失特征值，最相似样本的度量方法为

其中ai，j为数据集中第i个样本的第j个特征的特征值，

为缺失的特征值；S11. Use the hot card filling method to fill in the missing feature values with the values of the most similar samples. The measurement method of the most similar samples is

Where ai,j is the characteristic value of the jth feature of the i-th sample in the data set,

is the missing feature value;

S13、删除数据集W₁中特征值无全序关系的特征，得到数据集W₂。S13. Delete the features whose feature values have no total order relationship in the data set W ₁ to obtain the data set W ₂ .

S2、根据数据集W₂，基于特征a_i的特征值a_i，j对不同标签的样本进行两两比较，构造冲突对c_i。S2. According to the data set W ₂ , samples with different labels are compared pairwise based on the feature values a _i,j of the feature a _i , and conflict pairs c _i are constructed.

S22、数据集特征a_i的特征值集合为D＝{a_i，1，a_i，2，...，a_i，Na}，Na是数据集W₂的总样本数，也是特征a_i的特征值总数；S22, the feature value set of the feature _ai of the data set is D = { _{ai, 1} , _{ai, 2,} ..., _{ai, Na} }, where Na is the total number of samples in the data set _W2 and also the total number of feature values of the feature _ai ;

若f(A，B)为真，则有A，B构成冲突对(A,B)。

If f(A, B) is true, then A and B form a conflicting pair (A, B).

S3、根据冲突对c_i构造冲突集合C＝{c₁，c₂，...，c_N}，其中N是冲突集C的冲突对总数。S3. Construct a conflict set C = {c ₁ , c ₂ , ..., c _N } according to the conflict pairs c _i , where N is the total number of conflict pairs in the conflict set C.

S31、将步骤S23构造的所有冲突对组成一个冲突集合C＝{c₁，c₂，...，c_N}，其中N是冲突集C的冲突对总数。S31 . All conflict pairs constructed in step S23 are combined into a conflict set C={c ₁ , c ₂ , . . . , c _N }, where N is the total number of conflict pairs in the conflict set C.

S4、通过统计冲突集合C中样本s_k出现的频次f_k，得到词典D＝{s_k：f_k}。S4. By counting the frequency f _k of occurrence of sample _sk in the conflict set C, a dictionary D = { _sk : _fk } is obtained.

S43、计算频次f_k＝f_lk+f_rk；S43, calculating the frequency f _k =f _lk +f _rk ;

S5、将词典D中的样本按频次由高到低进行排序。S5. Sort the samples in dictionary D by frequency from high to low.

S6、针对排序后前t％的样本在数据集W₂中计算轮廓系数s(k)，删除s(k小于ε的样本s_k，得到过滤后的新数据集W₃,同时删除冲突对集合C中包含疑似噪音样本s_k的冲突对。S6. Calculate the silhouette coefficient s(k) in the data set _W2 for the first t% of the samples after sorting, delete the samples _sk whose s(k) is less than ε, obtain the filtered new data set _W3 , and delete the conflicting pairs in the conflicting pair set C that contain the suspected noise samples _sk .

S61、根据公式

计算轮廓系数，其中

Calculate the silhouette coefficient, where

S62、若样本s_k的轮廓系数s(k)＜ε，则记录下该样本s_k的编号，并将其视为疑似噪音样本，在数据集W₂中删除掉样本s_k，得到新的数据集W₃；S62, if the silhouette coefficient s(k) of the sample _sk is less than ε, the serial number of the sample _sk is recorded and regarded as a suspected noise sample, and the sample _sk is deleted from the data set _W2 to obtain a new data set _W3 ;

S7、重复S4，S5，S6，直至步骤S62中无s(i)小于ε的样本。S7. Repeat S4, S5, and S6 until there is no sample with s(i) less than ε in step S62.

S8、基于数据集W₁和W₃运用stacking方法分别训练出模型M₁和M₃，评估并比较模型M₃的预测性能。S8. Based on data sets _W1 and _W3, use the stacking method to train models _M1 and _M3 respectively, and evaluate and compare the prediction performance of model _M3 .

S81、分别将数据集W₁和新数据集W₃按照同样的比例分割成三部分，分别作为训练集、验证集和测试集；S81, divide the data set _W1 and the new data set _W3 into three parts according to the same ratio, as a training set, a validation set and a test set respectively;

S82、运用相同的机器学习算法分别基于W₁和W₃数据集训练出模型M₁和M₃；S82, using the same machine learning algorithm to train models _M1 and _M3 based on the _W1 and _W3 data sets respectively;

S83、评估并比较模型M₃的预测性能。S83. Evaluate and compare the prediction performance of model M ₃ .

上文中，参照附图描述了本发明的具体实施方式。但是，本领域中的普通技术人员能够理解，在不偏离本发明的精神和范围的情况下，还可以对本发明的具体实施方式作各种变更和替换。这些变更和替换都落在本发明权利要求书所限定的范围内。In the above, the specific embodiments of the present invention are described with reference to the accompanying drawings. However, those skilled in the art will appreciate that various changes and substitutions may be made to the specific embodiments of the present invention without departing from the spirit and scope of the present invention. These changes and substitutions are all within the scope defined by the claims of the present invention.

Claims

1. A noise removal method for bridge disease rating data includes the steps of firstly, comparing sample data pairwise to obtain a conflict pair set, then, conducting noise data elimination according to the number of times of appearance of a sample in the conflict pair set and combining a contour coefficient of the sample to obtain a filtered data set, then, training a model on an original data set and the filtered new data set respectively by using a stacking method, and finally evaluating and comparing prediction performances of the two models to verify the effectiveness of the noise removal method, wherein if the effectiveness is confirmed, a clean data set is obtained, and the specific steps are as follows:

s1, preprocessing data in an original data set to obtain a data set W ₁ To W ₁ Removing the characteristics of medium and non-full order relation to obtain a data set W ₂ The original data set comprises basic information of each bridge, bridge defect information of each type and corresponding bridge defect grade labels;

s2, according to the data set W ₂ Based on the feature a _i Characteristic value a of _i,j Comparing the samples with different labels pairwise to construct a conflict pair c _i ；

S3, according to the conflict pair c _i Construction conflict set C = { C = ₁ ,c ₂ ,…,c _N N is the total number of conflict pairs of the conflict set C;

s4, counting samples S in the conflict set C _k Frequency of occurrence f _k Obtaining dictionary D = { s = { [ s ] _k ∶f _k }；

S5, sequencing the samples in the dictionary D from high to low according to frequency;

s6, sorting the samples with the first t% in the data set W ₂ Calculating the contour coefficients s (k), deleting s: (k) Samples s smaller than ε _k To obtain a new filtered data set W ₃ And simultaneously deleting suspected noise samples s contained in the conflict pair set C _k Wherein t is a threshold value for reducing the number of samples to be calculated;

s7, repeating S4, S5 and S6 until no sample with S (i) smaller than epsilon exists in the step S6, wherein the value of epsilon is 0;

s8, in the data set W ₁ And W ₃ Respectively training out the model M by using the same machine learning algorithm ₁ And M ₃ Evaluating and verifying the bridge disease grade prediction performance of the two models, and comparing the evaluation model M ₃ The predicted performance of (2).

2. The bridge disease data-oriented denoising method according to claim 1, wherein the step S1 specifically comprises:

s11, based on the data set W ₁ Using a hot card filling method, complementing the missing characteristic value by using the value of the most similar sample, wherein the measurement method of the most similar sample is

Wherein a is _i,j For a feature value of a jth feature of an ith sample in a data set>

For missing feature values, na is the data set W ₂ Total number of samples of i ₀ Numbering the most similar samples;

s12, deleting useless features which do not affect the label value;

s13, deleting the data set W ₁ The medium characteristic value has no characteristic of the complete sequence relation to obtain a data set W ₂ 。

3. The bridge disease data-oriented denoising method according to claim 1, wherein the step S2 specifically comprises:

s21, data set W ₂ The feature set of (A) = { a = ₁ ,a ₂ ,…,a _Ni Ni is the data set W ₂ The total number of features of (a);

s22, data set characteristics a _i Is D = { a = _i,1 ,a _i,2 ,…,a _i,Na Is the data set W ₂ Is the total number of samples of, and is also characteristic a _i The total number of characteristic values of;

s23, firstly judging the labels of the two samples, if the labels are the same, skipping to compare the two samples, and if the labels are different, comparing the feature values of the two samples under all the characteristics in a one-to-one correspondence manner, wherein the calculation formula is as follows:

if f (A, B) is true, then A, B forms a conflict pair (A, B); />

S24, selecting a first sample, sequentially comparing all the following samples with the first sample according to the mode of the step S23, constructing a conflict pair, sequentially proceeding until the last sample is iterated, then selecting a second sample, sequentially comparing all the following samples with the first sample according to the mode of the step S23, constructing a conflict pair, and sequentially proceeding until the last sample is iterated; similarly, the iteration is stopped until the selected penultimate sample is compared.

4. The bridge disease rating data-oriented denoising method according to claim 1, wherein the step S3 specifically comprises:

s31, all conflict pairs constructed in the step S23, construct one conflict set C = { C = ₁ ,c ₂ ,…,c _N N is the total number of conflict pairs of the conflict set C.

5. The bridge disease data-oriented denoising method according to claim 1, wherein the step S4 specifically comprises:

s41, counting samples S in the conflict pair left element _k Number of occurrences f _lk ；

S42, counting samples S in the right element of the conflict pair _k Number of occurrences f _rk ；

S43, calculating the total frequency f _k ＝f _lk +f _rk ；

S44, sampling S _k And frequency f of its occurrence _k A one-to-one mapping relation between the two is used for constructing a dictionary D = { s = {(s) } _k :f _k }，k＝1,2,…,Na。

6. The bridge disease data-oriented denoising method according to claim 1, wherein the step S5 specifically comprises:

s51, the samples in the dictionary D are processed according to the frequency f _k Sorted from high to low.

7. The bridge disease data-oriented denoising method according to claim 1, wherein the step S6 specifically comprises:

s61, according to the formula

Calculating a contour factor, wherein>

Is a sample s _k Degree of intra-cluster dissimilarity of (a) _i,k As a sample s _k The eigenvalue and rk of the ith characteristic are samples s _k The label of (1); b (k) = min { b (k) ₁ ,b(k) ₂ ,…b(k) _n Is the sample s _k Is not similar between clusters, is greater than>

Is a sample s _k Dissimilarity to the nth cluster;

s62, if the sample S _k S (k) of the contour coefficient<E, recording the sampleThis s _k Is counted and considered as a suspected noise sample in the data set W ₂ In which the sample s is deleted _k To obtain a new data set W ₃ ；

S63, deleting the suspected noise sample S containing the step S62 in the conflict pair set C _k The conflict pair of (3).

8. The bridge disease classification data-oriented denoising method according to claim 1, wherein the step S8 specifically comprises:

s81, respectively combining the data sets W ₁ And a new data set W ₃ Dividing the test sample into three parts according to the same proportion, namely a training set, a verification set and a test set;

s82, respectively applying the same machine learning algorithm based on W ₁ And W ₃ Training out model M from data set ₁ And M ₃ ；

S83, evaluating and comparing the model M ₃ The predicted performance of (2).