CN113704389A

CN113704389A - Data evaluation method and device, computer equipment and storage medium

Info

Publication number: CN113704389A
Application number: CN202110266548.0A
Authority: CN
Inventors: 陈岁迪; 童丽霞
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-03-11
Filing date: 2021-03-11
Publication date: 2021-11-26

Abstract

The embodiment of the present application discloses a data evaluation method, an apparatus, a computer device and a storage medium. The embodiment of the present application can obtain a data set to be evaluated; filter the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set ; Analyze the candidate data set based on the evaluation dimension; obtain the evaluation parameter of the data set to be evaluated according to the analysis result; determine the evaluation result corresponding to the data set to be evaluated according to the evaluation parameter, so as to obtain the evaluation result according to the evaluation result Process the data. Improve the accuracy and reliability of data evaluation.

Description

Data evaluation method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to a data evaluation method, apparatus, computer device, and storage medium.

Background

With the rapid development of artificial intelligence, various models are more and more widely applied, for example, in the field of customer service, a call center can generate massive call voices every day, and in order to supervise service quality and public opinion risks, whether the words spoken by a user are in risk tendency, whether the words spoken by the user are discontented, whether the expression of the customer service is unreasonable and the like can be analyzed through a text semantic model constructed in an intelligent quality inspection system.

The detection effect of the text semantic model depends on the training of a large amount of high-quality labeled data, and in the data labeling process, because the understanding of different labeling personnel has deviation, many abnormal noise data are often contained in the labeled data set, and the effect of the text semantic model can be directly influenced by the training of the noise data on the text semantic model. Moreover, different data contribute differently to the performance improvement of the text semantic model: simple data is easy to distinguish, and the performance improvement of the text semantic model is limited; too much simple data can lead the change of text semantic model parameters, and the efficiency of text semantic model training is influenced, so that how to evaluate the quality of the data is significant.

In the prior art, the data set needs to be evaluated manually: the sampled data is manually rechecked by experts, and the accuracy of the data set is estimated according to the accuracy of data sampling inspection; or, setting a knowledge point (gold set) in the data set, putting the gold set into the data set to be labeled of each label member, and estimating the accuracy of the labeled data of the label member according to the accuracy of the gold set.

For the manual designation of the spot check proportion in the sampling rechecking method, if the spot check proportion is high, more labor cost and time cost are needed, and if the spot check proportion is less, the spot check result is high in limitation and large in deviation, so that the accuracy of data evaluation is low. For the gold sets, an expert is required to set and put the gold sets into the data set in advance, the gold sets need to be updated frequently, each annotator is guaranteed to contain the unlabelled gold sets in the data set labeled each time, and the method is complex and low in accuracy of data evaluation.

Disclosure of Invention

The embodiment of the application provides a data evaluation method, a data evaluation device, computer equipment and a storage medium, which can improve the accuracy and reliability of data evaluation.

In order to solve the above technical problem, an embodiment of the present application provides the following technical solutions:

the embodiment of the application provides a data evaluation method, which comprises the following steps:

acquiring a data set to be evaluated;

filtering the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set;

analyzing the candidate dataset based on an evaluation dimension;

obtaining evaluation parameters of the data set to be evaluated according to the analysis result;

and determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameters, and processing the data according to the evaluation result.

According to an aspect of the present application, there is also provided a data evaluation apparatus including:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring a data set to be evaluated;

the filtering unit is used for filtering the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set;

an analysis unit for analyzing the candidate data set based on an evaluation dimension;

the second acquisition unit is used for acquiring the evaluation parameters of the data set to be evaluated according to the analysis result;

and the determining unit is used for determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameters so as to process the data according to the evaluation result.

According to an aspect of the present application, there is also provided a computer device, including a processor and a memory, where the memory stores a computer program, and the processor executes any one of the data evaluation methods provided by the embodiments of the present application when calling the computer program in the memory.

According to an aspect of the present application, there is also provided a storage medium for storing a computer program, which is loaded by a processor to execute any one of the data evaluation methods provided by the embodiments of the present application.

The method and the device for evaluating the data set can acquire the data set to be evaluated, filter the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set, analyze the candidate data set based on evaluation dimensionality, and acquire evaluation parameters of the data set to be evaluated according to an analysis result; at this time, the evaluation result corresponding to the data set to be evaluated can be determined according to the evaluation parameter, so as to process the data according to the evaluation result. According to the scheme, the data set to be evaluated can be filtered, the candidate data set obtained through filtering is analyzed based on the evaluation dimension to obtain the evaluation parameter, the evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameter, manual review evaluation is avoided, and the accuracy and reliability of data evaluation are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic view of a scenario in which a data evaluation method provided in an embodiment of the present application is applied;

FIG. 2 is a schematic flow chart diagram of a data evaluation method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of complexity evaluation of data provided by an embodiment of the present application;

FIG. 4 is another schematic flow chart diagram of a data evaluation method provided in an embodiment of the present application;

FIG. 5 is a schematic flowchart of training a text-to-speech model according to an embodiment of the present application;

fig. 6 is a schematic flowchart of detecting a text to be detected by using a trained text-to-speech model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a data evaluation device provided in an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a data evaluation method, a data evaluation device, computer equipment and a storage medium.

Referring to fig. 1, fig. 1 is a schematic view of a scene of an application of a data evaluation method provided in an embodiment of the present application, where the application of the data evaluation method may include a data evaluation device, the data evaluation device may be specifically integrated in a server or a terminal or other computer equipment, the server may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform, but is not limited thereto. The server and the terminal may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The terminal can be a mobile phone, a tablet computer, a notebook computer, a desktop computer, a wearable device or the like.

The computer device may be configured to obtain a data set to be evaluated, filter the data set to be evaluated according to a preset filtering policy to obtain a candidate data set, analyze the candidate data set based on the evaluation dimension, and obtain evaluation parameters of the data set to be evaluated according to an analysis result, for example, the candidate data set may be subjected to availability evaluation, consistency evaluation, anomaly detection evaluation, complexity evaluation, and the like to obtain evaluation parameters of availability, consistency, anomaly, complexity, and the like. At this time, the evaluation result (for example, data quality is qualified or unqualified) corresponding to the data set to be evaluated can be determined according to the evaluation parameter, so that the data is processed according to the evaluation result, and the accuracy and reliability of data evaluation are improved.

It should be noted that the scenario diagram of the data evaluation method application shown in fig. 1 is only an example, and the data evaluation method application and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not form a limitation on the technical solution provided in the embodiment of the present application.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

In the present embodiment, description will be made from the perspective of a data evaluation apparatus, which may be specifically integrated in a computer device such as a server or a terminal.

Referring to fig. 2, fig. 2 is a schematic flow chart of a data evaluation method according to an embodiment of the present application. The data evaluation method may include:

s101, acquiring a data set to be evaluated.

The data set to be evaluated may include a plurality of sample data, and the specific type of the sample data may be flexibly set according to actual needs, for example, the sample data may be text, voice, image, or other types of data, and when the sample data is voice, the voice may be semantically recognized to convert the voice into text. Sample data of a data set to be evaluated may be provided with a label, which may be used to characterize the class of the sample data, or which may be used to identify characteristics of the sample data, etc., e.g., whether the sample data is risk prone, etc.

The obtaining mode of the data set to be evaluated may include: obtaining a plurality of pre-stored sample data from a local database to obtain a data set to be evaluated; or, the data set to be evaluated can be downloaded from the server; or, a plurality of sample data sent by the terminal may be received, and a data set to be evaluated is generated based on the plurality of sample data; and so on.

S102, filtering the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set.

In order to improve the efficiency and reliability of processing the data set to be evaluated, the data set to be evaluated may be filtered to remove unwanted interference data.

The preset filtering strategy can be flexibly set according to actual needs.

In an embodiment, filtering the data set to be evaluated according to a preset filtering policy to obtain a candidate data set may include: and filtering empty samples, all digital samples or texts with the word number smaller than a preset word number threshold value in the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set. The null sample may be blank text or scrambled text, etc.

And S103, analyzing the candidate data set based on the evaluation dimension.

And S104, obtaining the evaluation parameters of the data set to be evaluated according to the analysis result.

The evaluation dimension may include at least any one of availability evaluation, consistency evaluation, anomaly detection evaluation, complexity evaluation, and the like, and the evaluation parameter may include at least any one of availability rate, consistency rate, anomaly rate, complexity rate, and the like. Of course, the evaluation dimension and the evaluation parameter may also be flexibly set according to actual needs, and the specific content is not limited herein.

In an embodiment, the evaluation dimension includes availability evaluation, the evaluation parameter includes availability ratio, the analyzing the candidate data set based on the evaluation dimension, and the obtaining the evaluation parameter of the data set to be evaluated according to the analysis result may include: acquiring a first data volume of a data set to be evaluated and a second data volume of a candidate data set; and calculating the availability ratio of the data set to be evaluated according to the first data quantity and the second data quantity.

In order to accurately judge whether the data set to be evaluated is available, availability evaluation can be performed on the data set to be evaluated, wherein the availability evaluation can be characterized by whether a candidate data set obtained after filtering the data set to be evaluated is available for characterizing the availability of the data set to be evaluated, and the availability can be characterized by the proportion of the candidate data set obtained after filtering the data set to be evaluated occupying the data set to be evaluated. The higher the availability of the data set to be evaluated is, the better the data quality of the data set to be evaluated is, and conversely, the lower the availability of the data set to be evaluated is, the worse the data quality of the data set to be evaluated is.

Specifically, in the process of performing the availability evaluation, a first data amount of the data set to be evaluated (i.e., the data size of the data set to be evaluated) may be obtained, and a second data amount of the candidate data set (i.e., the data size of the candidate data set) may be obtained, and then, a ratio between the first data amount and the second data amount may be calculated, and the availability of the data set to be evaluated may be determined according to the ratio between the first data amount and the second data amount. For example, the first data quantity of the data set to be evaluated is C, and the candidate data set D_aIs | D_aIf, then the availability p of the dataset to be evaluated_aComprises the following steps: p is a radical of_a＝|D_a|/C。

In an embodiment, the evaluation dimension includes consistency evaluation, the evaluation parameter includes consistency rate, the analyzing the candidate data set based on the evaluation dimension, and the obtaining the evaluation parameter of the data set to be evaluated according to the analysis result may include: carrying out repeated item detection on the candidate data set to obtain a repeated data set; extracting data with consistent labels from the repeated data set; acquiring the ratio of the data quantity of the data with consistent labels to the data quantity in the data set to be evaluated; and determining the consistency rate of the data set to be evaluated according to the ratio.

In order to accurately judge the data quality of the data set to be evaluated, consistency evaluation can be performed on the data set to be evaluated, wherein the consistency evaluation can represent the consistency of the data set to be evaluated by judging whether the labels of sample data in a candidate data set obtained after the data set to be evaluated is filtered are consistent, and the consistency rate can be represented by the proportion of the sample data with consistent labels in the candidate data set occupying the data set to be evaluated. The higher the consistency rate of the data set to be evaluated is, the better the data quality of the data set to be evaluated is, and conversely, the lower the consistency rate of the data set to be evaluated is, the worse the data quality of the data set to be evaluated is.

Specifically, in the consistency evaluation process, duplicate item detection may be performed on the candidate data set to obtain a duplicate data set, where the duplicate data set may include multiple duplicate data samples, and sample data of the data set to be evaluated may be provided with a tag, so that the sample data of the duplicate data set is also provided with a tag. Then, data (i.e., sample data) with consistent labels can be extracted from the repeated data set, a ratio of the data amount of the data with consistent labels to the data amount of the data set to be evaluated is obtained, and the consistency rate of the data set to be evaluated is determined according to the ratio. For example, the data amount of the tag-consistent data in the duplicate data set is C₁The data volume in the data set to be evaluated is C, and the consistency rate of the data set to be evaluated is p_d＝C₁and/C. For another example, the data amount of the data with inconsistent tags may be extracted from the duplicate data set as C_dThe data volume in the data set to be evaluated is C, and the consistency rate of the data set to be evaluated is p_d＝1-C_d/C。

In an embodiment, the evaluation dimension includes an anomaly detection evaluation, the evaluation parameter includes an anomaly rate, the analyzing the candidate data set based on the evaluation dimension, and the obtaining the evaluation parameter of the data set to be evaluated according to the analysis result may include: performing word segmentation processing on each sample data in the candidate data set respectively to obtain a word set corresponding to each sample data; vectorizing the word set to obtain a word vector; projecting the word vectors to a sample feature space with preset dimensionality to obtain text feature vectors; clustering the text feature vectors, and determining abnormal data in the candidate data set based on a clustering result; and determining the abnormal rate of the data set to be evaluated according to the abnormal data.

In order to improve the accuracy of the evaluation of the data set to be evaluated, the data set to be evaluated may be subjected to anomaly detection evaluation, where the anomaly detection evaluation may use whether sample data in a candidate data set obtained after the data set to be evaluated is filtered is abnormal to characterize the anomaly of the data set to be evaluated, and the anomaly rate may be characterized by the proportion of abnormal data in the candidate data set occupying the data set to be evaluated. The higher the abnormal rate of the data set to be evaluated is, the worse the data quality of the data set to be evaluated is, and conversely, the lower the abnormal rate of the data set to be evaluated is, the better the data quality of the data set to be evaluated is.

Specifically, in the process of performing anomaly detection and evaluation, when sample data is a text, word segmentation processing may be performed on each sample data in the candidate data set, so as to obtain a word set corresponding to each sample data; when the sample data is voice, semantic recognition can be performed on the sample data to convert the voice into a text, and at the moment, word segmentation processing can be performed on each sample data in the candidate data set respectively to obtain a word set corresponding to each sample data. The word segmentation processing mode may be flexibly set according to actual needs, for example, a word segmentation processing may be performed on each sample data in the candidate data set by using a jieba word segmentation strategy, or semantic recognition may be performed on the sample data, and the word segmentation processing may be performed on each sample data in the candidate data set according to a single word, a phrase, or the like.

Then, the word set corresponding to each sample data may be vectorized to obtain a word vector corresponding to each sample data, for example, the word set of each sample data may be vectorized by using a word to vector (word 2vec for short) related model generating the word vector, and the word vector of each sample data may be projected to a sample feature space of a preset dimension (for example, an N dimension, where N may be 128 or may be flexibly set according to actual needs) to obtain a text feature vector, which may specifically be as follows:

wherein s isen _ vec may represent a text feature vector (also referred to as a sentence vector), m may represent the number of words contained in each sample data, vec_iThe ith word vector may be represented.

At this time, the text feature vectors may be clustered, and abnormal data in the candidate data set may be determined based on the clustering result. In one embodiment, clustering the text feature vectors, and determining abnormal data in the candidate data set based on the clustering result may include: performing dimensionality reduction on the text feature vector to obtain a dimensionality-reduced feature vector; normalizing the feature vector subjected to dimension reduction to obtain a normalized feature vector; and clustering the normalized feature vectors, and determining abnormal data in the candidate data set based on a clustering result.

In order to improve the accuracy of clustering, the text feature Vector may be subjected to dimension reduction and normalization, for example, the N-dimensional text feature Vector may be subjected to dimension reduction based on Latent Semantic Analysis (LSA) to obtain a feature Vector after dimension reduction, the dimension of the feature Vector after dimension reduction may be D-dimension, and D may take a Value of 15 or may be flexibly set according to actual needs, where the LSA may map a text in a high-dimensional Vector Space Model (VSM) representation to a low-dimensional Latent Semantic Space, and the mapping may be implemented by Singular Value Decomposition (SVD) of a text feature Vector matrix.

Then, the feature vector after dimensionality reduction may be normalized to obtain a normalized feature vector, for example, since the feature vector may be represented by a numerical value, in order to improve the reliability of subsequent clustering, the feature vector after dimensionality reduction may be normalized to a numerical value in a range of 0 to 1 to obtain the normalized feature vector. At this time, the normalized feature vectors may be clustered, and abnormal data in the candidate data set may be determined based on the clustering result. For example, the normalized feature vectors may be clustered by a Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The abnormal data can be data with label errors caused by negligence, the abnormal data can be an isolated point (namely isolated data) in a clustered data set, samples with similar distances in a text feature vector space can be clustered into a class cluster, and points which do not belong to any class cluster are isolated points.

In an embodiment, clustering the normalized feature vectors, and determining abnormal data in the candidate data set based on the clustering result may include: discretizing the normalized feature vector into a plurality of characteristic points; selecting any one feature point from the plurality of feature points as a core point; distributing all feature points in a preset neighborhood range with the core point as the center into the same class; selecting another feature point from the plurality of feature points as a core point, and returning to execute the operation of distributing all feature points in a preset neighborhood range with the core point as the center to the same class until the plurality of feature points are traversed; and screening out sample data corresponding to the characteristic points of the unassigned class to obtain abnormal data.

Specifically, the normalized feature vector may be discretized into a plurality of feature points, any feature point is selected from the plurality of feature points as a core point, all feature points in a preset neighborhood range centered on the core point are assigned to the same class, for example, all feature points in a preset neighborhood range centered on the core point and having a radius of R are assigned to the same class, and a value of R may be flexibly set according to actual needs. And selecting another feature point from the plurality of feature points as a core point, and returning to execute the operation of distributing all the feature points in a preset neighborhood range taking the core point as the center to the same class until the plurality of characteristic points are traversed. For example, all feature points in the neighborhood of the core point may be traversed sequentially as the core point { y }₁,y₂,...,y_iE.g. will y_iAs a new core point, will y_iAll feature points in the neighborhood are assigned to the same class cluster, and the y is traversed in sequence_iAll feature points in the neighborhood are respectively used as core points { z₁,z₂,...,z_iAnd so on, the clusters are gradually increased until there are no core points that can be extended. Then find one againAnd (4) allocating the core points which are not allocated with the class cluster to a new class cluster, repeating the steps, and expanding the core points until all the core points in the data set are allocated with the cluster labels. The data points which are not distributed with any cluster label in the data set are abnormal points, and finally an abnormal point collection D can be obtained_outlierAnd screening out sample data corresponding to the characteristic points of the unassigned class to obtain abnormal data. The core point may be a point whose density exceeds a Threshold, and may be the center of a cluster, and the density ρ may be the number of all data points whose distance from the point is less than r, where the distance may be a euclidean distance; outliers (which may also be referred to as noise points) may be points that do not belong to any one cluster class.

After the abnormal data is obtained, the abnormal rate of the data set to be evaluated can be determined according to the abnormal data. For example, a ratio between the data amount of the abnormal data and the data amount of the data set to be evaluated may be calculated, and the ratio between the data amount of the abnormal data and the data amount of the data set to be evaluated is taken as the abnormal rate: anomaly rate is the amount of data of the anomalous data/the amount of data of the dataset to be evaluated.

In an embodiment, the evaluation dimension includes an anomaly detection evaluation, the evaluation parameter includes an anomaly rate, the analyzing the candidate data set based on the evaluation dimension, and the obtaining the evaluation parameter of the data set to be evaluated according to the analysis result may include: carrying out classification prediction on each sample data in the candidate data set through the trained classification model to obtain the classification probability corresponding to each sample data in the candidate data set; screening sample data with classification probability smaller than a preset threshold value as abnormal data; and determining the abnormal rate of the data set to be evaluated according to the screened abnormal data with the classification probability smaller than the preset threshold value.

In order to improve the accuracy of the abnormal rate calculation, the abnormal rate can be calculated through abnormal data screened by the trained classification model, wherein the specific type, structure and the like of the trained classification model can be flexibly set according to actual needs, for example, the classification model can be a FastText model, a textCNN model, an RCNN model, a BERT model and the like. After the candidate data set is obtained, each sample data in the candidate data set can be classified and predicted through the trained classification model, so that a class label corresponding to each sample data in the candidate data set and the classification probability of each sample data are obtained, wherein the classification probability can be the probability of the class label to which the class of the sample data belongs. Then, the sample data with the classification probability smaller than the preset threshold can be screened out, the sample data with the classification probability smaller than the preset threshold is used as abnormal data, and the specific value of the preset threshold can be flexibly set according to actual needs. At this time, the abnormal rate of the data set to be evaluated may be determined according to the screened abnormal data with the classification probability smaller than the preset threshold, for example, the abnormal rate is the data amount of the abnormal data with the classification probability smaller than the preset threshold/the data amount of the data set to be evaluated.

In one embodiment, the data evaluation method may further include: obtaining a plurality of training samples, wherein the training samples comprise label samples and complementary label samples, the label samples are samples marked with real labels, and the complementary label samples are samples marked with labels complementary to the real labels; carrying out negative learning on the basis of the label samples and the complementary label samples through the initial classification model to train the initial classification model to obtain a candidate classification model and predict the sample classification probability of each training sample; screening out training samples with the sample classification probability larger than a target probability threshold value to obtain candidate training samples; and carrying out fine tuning training on the candidate classification model based on the candidate training sample to obtain the trained classification model.

In order to improve the accuracy and reliability of the trained classification model for screening the abnormal data, the classification model can be trained in advance, and as sample data is difficult to distinguish and the error of the label set for the sample data caused by deviation can be understood, in order to accurately detect the abnormal data with the error of the label, the classification model can be trained in a negative learning training mode, so that the trained classification model can quickly and accurately find out the abnormal point D with the error of the label_noise. Negative learning training may use complementary labels for training, with a label set of { L } in a multi-class labeling task₁,L₂,...,L_NAnd if the annotator classifies the data into a T-tag class, the complementary tag C is random ({ L)₁,L₂,…,L_NT), i.e. randomly selecting one label from the label set without the labeled label as a complementary label. The complementary label is used for training, the probability that the complementary label is the real label is low for the data with wrong labeling, and the probability that the complementary label is the real label is 0 for the data with correct labeling, so that the risk of wrong information is reduced by negative learning, the problem of overfitting abnormal points in positive learning is avoided, and the method is more suitable for abnormal data with noise.

Specifically, a plurality of training samples stored in advance may be obtained from a local database, or a plurality of training samples and the like may be downloaded from a server, where the training samples may be texts, voices, images, and the like, the plurality of training samples may include label samples and complementary label samples, where the label samples are samples labeled with real labels, and the complementary label samples are samples labeled with labels complementary to the real labels, for example, a real label labeled by a training sample a is a category a, and a real label of a training sample B is a category B, but a training sample B may be a complementary label sample B, and a label labeled by the complementary label sample B and complementary to the real label may be a category C, a category D, a category E, a category F, a category G, and the like.

Then, negative learning can be performed on the basis of the label samples and the complementary label samples through the initial classification model, so that the initial classification model is trained to obtain candidate classification models, and the sample classification probability of each training sample is obtained through prediction. For example, the prediction class to which the training sample belongs and the sample classification probability may be predicted by an initial classification model, and a loss value may be calculated by a loss function based on the sample classification probability, so as to adjust a parameter of the classification model to an appropriate value according to the loss value, thereby obtaining a candidate classification model. Wherein, the loss function in negative learning may be:

whereinL may represent a loss value, N may represent the number of training samples, C_kMay represent the k training sample, P_kThe sample classification probability of the kth training sample may be represented.

Secondly, the training samples with the sample classification probability larger than the target probability threshold can be screened out to obtain candidate training samples, wherein the target probability threshold can be flexibly set according to actual needs. At this time, the candidate classification model may be further trained based on the candidate training samples to perform fine tuning training on parameters of the candidate classification model, so as to obtain a trained classification model.

In an embodiment, determining the abnormal rate of the data set to be evaluated according to the abnormal data with the screened classification probability smaller than the preset threshold may include: acquiring a target characteristic vector corresponding to each sample data in the candidate data set; determining target abnormal data in the candidate data set based on the target feature vector; and determining the abnormal rate of the data set to be evaluated according to the target abnormal data and the screened abnormal data with the classification probability smaller than the preset threshold.

In order to improve the accuracy of determining the abnormal rate, the abnormal rate of the data set to be evaluated can be determined by combining the abnormal data screened out in the clustering mode and the abnormal data screened out by the trained classification model. Specifically, word segmentation processing may be performed on each sample data in the candidate data set according to the above manner, so as to obtain a word set corresponding to each sample data; vectorizing the word set to obtain a word vector; projecting the word vectors to a sample feature space with preset dimensionality to obtain target feature vectors (namely text feature vectors); performing dimensionality reduction on the target feature vector to obtain a dimensionality-reduced feature vector; normalizing the feature vector subjected to dimension reduction to obtain a normalized feature vector; and clustering the normalized feature vectors, and determining target abnormal data in the candidate data set based on a clustering result. According to the method, classifying and predicting each sample data in the candidate data set through the trained classification model to obtain the classification probability corresponding to each sample data in the candidate data set; screeningSample data with classification probability less than a preset threshold is taken as abnormal data, and then, the abnormal data can be obtained according to the target abnormal data D_isolateAnd the screened abnormal data D with the classification probability smaller than the preset threshold value_noiseAnd determining the abnormal rate of the data set to be evaluated. For example, the total amount of abnormal data D_outlier＝D_isolate UD_noiseThereby obtaining an abnormality rate p_o：

Where C may represent the data volume of the data set to be evaluated.

In an embodiment, the evaluation dimension includes complexity evaluation, the evaluation parameter includes complexity rate, and analyzing the candidate data set based on the evaluation dimension, and obtaining the evaluation parameter of the data set to be evaluated according to the analysis result may include: performing word segmentation processing on each sample data in the data set respectively to obtain a word set corresponding to each sample data; vectorizing the word set to obtain a vector matrix; performing convolution operation on the vector matrix to obtain a plurality of one-dimensional vectors; performing pooling operation on the multiple one-dimensional vectors to obtain pooled vectors; splicing the pooled backward vector to obtain a spliced vector; performing full-connection operation on the spliced vectors to obtain the label category probability of each sample data; and determining the complexity of the data set to be evaluated according to the label category probability.

In order to improve the accuracy of the evaluation of the data set to be evaluated, the complexity of the data set to be evaluated may be evaluated, where the complexity evaluation may be measured by the complexity of sample data in a candidate data set obtained after filtering the data set to be evaluated, and the complexity rate may be represented by a proportion of the complex sample data in the candidate data set occupying the data set to be evaluated. The higher the complexity of the data set to be evaluated is, the better the data quality of the data set to be evaluated is, and conversely, the lower the complexity of the data set to be evaluated is, the worse the data quality of the data set to be evaluated is.

Specifically, the complexity of each sample data may be evaluated based on the TEXTCNN model, as shown in fig. 3, a word segmentation process may be performed on each sample data in the data set by using a jieba word segmentation strategy, so as to obtain a word set corresponding to each sample data. And then carrying out vectorization processing on the Word set by using Word2vec to obtain a vector matrix, wherein the vector matrix can be a two-dimensional matrix, the size of the Word vector can be D, the maximum length of sample data can be M, the value of M can be 20, the value of D can be 128, or the vector matrix can be flexibly set according to actual needs.

After the vector matrix is obtained, performing convolution operation on the vector matrix by using three convolution kernels with the sizes of 2, 3 and 4 respectively to obtain a plurality of one-dimensional vectors; pooling operation is performed on the multiple one-dimensional vectors through the pooling layer to obtain a pooled vector corresponding to each sample data, for example, the one-dimensional vector obtained through convolution is maximized through the pooling layer to obtain a pooled vector. Splicing the pooled backward vector corresponding to each sample data into a vector to obtain a spliced vector, and performing full-connection operation on the spliced vector through a full-connection layer to obtain the label category probability of each sample data, wherein the maximum label category probability is a first category probability value p_maxThe second highest label class probability is the second class probability value p_secondIf p is_maxLess than P_th1And p is_secondGreater than P_th2If the prediction is uncertain, it means that the sample data is near the boundary, which is relatively complicated and difficult to predict, P_th1And P_th2Can be flexibly set according to actual needs, for example, P is arranged at the position_th1Can be 0.65, P_th2May be taken to be 0.25. Can screen out p_maxLess than P_th1And p is_secondGreater than P_th2Obtaining a complex sample D_complexSo that the complexity p of the data set to be evaluated can be calculated_c：

Where C may represent the data volume of the data set to be evaluated.

And S105, determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameters, and processing the data according to the evaluation result.

In an embodiment, the evaluation parameter may include at least any one of an availability rate, a consistency rate, an anomaly rate, and a complexity rate, and determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameter may include: acquiring accumulated values of the availability ratio, the consistency ratio, the abnormal ratio and the complexity ratio, and determining an evaluation result corresponding to the data set to be evaluated according to the accumulated values; or respectively setting weighted values for the availability ratio, the consistency ratio, the abnormal ratio and the complex ratio, carrying out weighting operation according to the availability ratio, the consistency ratio, the abnormal ratio, the complex ratio and the corresponding weighted values to obtain a target numerical value, and determining an evaluation result corresponding to the data set to be evaluated by the target numerical value.

The evaluation result may include that the data quality is qualified, the data quality is unqualified, and the like, for example, the accumulated value is equal to the availability ratio + the consistency ratio + the anomaly ratio + the complexity ratio, the higher the accumulated value is, the better the data quality is described, otherwise, the lower the accumulated value is, the worse the data quality is described, when the accumulated value is greater than or equal to the target threshold value, the evaluation result is qualified, when the accumulated value is less than the target threshold value, the evaluation result is unqualified, the target threshold value may be flexibly set according to actual needs, and specific values are not limited here.

For another example, the target value is the weighted value of the available rate + the consistent rate + the weighted value of the abnormal rate + the complex rate, the higher the target value is, the better the data quality is, and conversely, the lower the target value is, the worse the data quality is, when the target value is greater than or equal to the target threshold, the evaluation result is qualified data quality, and when the target value is less than the target threshold, the evaluation result is unqualified data quality, wherein the weighted value of the available rate, the weighted value of the consistent rate, the weighted value of the abnormal rate, the weighted value of the complex rate, and the target threshold may be flexibly set according to actual needs, and specific values are not limited here.

It should be noted that, according to actual requirements, only any one dimension of the availability ratio, the consistency ratio, the anomaly ratio, and the complexity ratio may be considered to evaluate the data set to be evaluated, or any two or three dimensions of the availability ratio, the consistency ratio, the anomaly ratio, and the complexity ratio may be considered to evaluate the data set to be evaluated, or the four dimensions may be considered comprehensively to evaluate the labeling quality, so as to determine whether the data set to be evaluated is qualified. And the quality of the labeled data is evaluated from multiple dimensions, so that the evaluation efficiency and accuracy are improved.

In an embodiment, after determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameter, the data evaluation method may further include: when the evaluation result corresponding to the data set to be evaluated is that the data quality is qualified, training the model to be trained by using the data set to be evaluated to obtain a trained model, and processing the data through the trained model; and when the evaluation result corresponding to the data set to be evaluated is unqualified data quality, analyzing factors influencing the quality of the data set to be evaluated.

The model structure, the specific type and the like of the model to be trained can be flexibly set according to actual needs, for example, the model to be trained can be a text semantic model, when the evaluation result corresponding to the data set to be evaluated is that the data quality is qualified, the model to be trained can be trained by using the data set to be evaluated to obtain a trained model, the data can be processed by the trained model, for example, in the field of customer service, a call center can generate massive conversation voice every day, in order to monitor the service quality and public opinion risks, one or more trained models can be set in the intelligent quality inspection system, and whether the words spoken by a user are in risk tendency, whether the words are in discontent emotion, whether the expression of the customer service is unreasonable or not can be analyzed by the trained models. As another example, the class to which the data belongs may be analyzed by a post-training model.

When the evaluation result corresponding to the data set to be evaluated is that the data quality is not good, the factors affecting the quality of the data set to be evaluated may be analyzed, for example, whether the factors affecting the quality of the data set to be evaluated are low in availability may be analyzed based on the availability ratio, whether the factors affecting the quality of the data set to be evaluated are low in consistency may be analyzed based on the consistency ratio, whether the factors affecting the quality of the data set to be evaluated are caused by abnormal data may be analyzed based on the abnormal rate, whether the factors affecting the quality of the data set to be evaluated are low in complexity may be analyzed based on the complex rate, and the like.

It should be noted that the data evaluation method in the embodiment of the present application may also be used to manage the quality of the labeled data, and by performing a full evaluation on the data set labeled by the labeling staff, the work of the labeling staff is supervised and assessed. Through comparison experiments, qualified marking data in the same marking time are greatly reduced, and the accuracy of model training is greatly improved by comparing models obtained by original data training.

The method and the device for evaluating the data set can acquire the data set to be evaluated, filter the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set, analyze the candidate data set based on evaluation dimensionality, and acquire evaluation parameters of the data set to be evaluated according to an analysis result; at this time, the evaluation result corresponding to the data set to be evaluated can be determined according to the evaluation parameter, so as to process the data according to the evaluation result. According to the scheme, the data set to be evaluated can be filtered, the candidate data set obtained through filtering is analyzed based on the evaluation dimension to obtain the evaluation parameter, the evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameter, manual review evaluation is avoided, and the efficiency, the accuracy and the reliability of data evaluation are improved.

The method described in the above embodiments is further illustrated in detail by way of example.

In this embodiment, for example, a data evaluation device is integrated in a server, and the flow of the data evaluation method provided in the embodiment of the present application may include text screening, model training, model application, and the like, which may specifically be as follows:

screening of texts

As shown in fig. 4, the process of text filtering provided in the embodiment of the present application may include:

s201, acquiring a data set to be evaluated containing a plurality of texts.

Taking sample data as a text, for example, the server may obtain a plurality of pre-stored texts from the database to obtain the data set to be evaluated, or the server may receive a plurality of texts sent by the terminal to obtain the data set to be evaluated, and so on.

S202, filtering the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set.

For example, the server may filter empty texts, scrambled texts, full digital texts, or texts with a word number less than a preset word number threshold in the to-be-evaluated data set according to a preset filtering policy to obtain a candidate data set, thereby removing unnecessary interference data and improving the efficiency and reliability of processing the to-be-evaluated data set.

And S203, carrying out availability evaluation on the candidate data set to obtain the availability of the data set to be evaluated.

For example, the server may obtain a first data amount of the data set to be evaluated (i.e. the data size of the data set to be evaluated) and obtain a second data amount of the candidate data set (i.e. the data size of the candidate data set), and then may calculate a ratio between the first data amount and the second data amount, and determine the availability of the data set to be evaluated according to the ratio between the first data amount and the second data amount: availability ratio p_aSecond amount of data | D_aI/first data amount C.

And S204, carrying out consistency evaluation on the candidate data set to obtain the consistency rate of the data set to be evaluated.

For example, the server may perform repeated item detection on the candidate data set to obtain a repeated data set, where the repeated data set may include multiple repeated texts, and then may extract texts with inconsistent tags from the repeated data set, obtain a ratio of a data amount of the texts with inconsistent tags to a data amount of the data set to be evaluated, and determine a consistency rate of the data set to be evaluated according to the ratio: coincidence rate of p_dData volume C of text with 1-tag inconsistency_dData volume C of the data set to be evaluated.

S205, carrying out anomaly detection and evaluation on the candidate data set to obtain the anomaly rate of the data set to be evaluated.

For example, the server may perform word segmentation processing on each text in the candidate data set by using a jieba word segmentation strategy to obtain a word set corresponding to each text; vectorizing the word set by using word2vec to obtain a word vector; projecting the word vectors to a sample feature space with preset dimensionality to obtain text feature vectors; performing dimensionality reduction on the text feature vector by using an LSA to obtain a dimensionality-reduced feature vector; normalizing the feature vector subjected to dimension reduction to obtain a normalized feature vector; clustering the normalized feature vectors through a DBSCAN algorithm, determining abnormal data in the candidate data set based on a clustering result, calculating a ratio between the data volume of the abnormal data and the data volume of the data set to be evaluated, and taking the ratio between the data volume of the abnormal data and the data volume of the data set to be evaluated as an abnormal rate: anomaly rate is the amount of data of the anomalous data/the amount of data of the dataset to be evaluated.

For another example, the server may perform classification prediction on each text in the candidate data set through a trained classification model (e.g., FastText model) to obtain a classification probability corresponding to each text in the candidate data set; screening out texts with classification probability smaller than a preset threshold value as abnormal data; determining the abnormal rate of the data set to be evaluated according to the screened abnormal data with the classification probability smaller than the preset threshold value: and the abnormal rate is the data volume of the abnormal data/the data volume of the data set to be evaluated, wherein the classification probability is smaller than a preset threshold value.

For another example, the abnormal data screened by the clustering method and the abnormal data screened by the trained classification model can be combined to determine the abnormal rate p of the data set to be evaluated_o：p_o＝|D_isolateUD_noiseI/C, wherein D_isolateAbnormal data that can be filtered out in a clustering manner, D_noiseThe abnormal data screened by the trained classification model can be represented, and C can represent the data volume of the data set to be evaluated.

And S206, performing complexity evaluation on the candidate data set to obtain the complexity rate of the data set to be evaluated.

For example, the server may perform complexity evaluation on the candidate data set through the TEXTCNN model, and specifically, may perform word segmentation on each text in the data set respectively by using a jieba word segmentation strategy to obtain a word set corresponding to each text. Then, carrying out vectorization processing on the Word set by using Word2vec to obtain a vector matrix, wherein the vector matrix can be a two-dimensional matrix, and carrying out convolution operation on the vector matrix by using three convolution kernels with the sizes of 2, 3 and 4 respectively to obtain a plurality of one-dimensional vectors; and performing maximum pooling operation on the plurality of one-dimensional vectors through the pooling layer to obtain a pooled vector corresponding to each text. Splicing the pooled vectors corresponding to each text into a vector to obtain a spliced vector, carrying out full-connection operation on the spliced vector through a full-connection layer to obtain the label category probability of each sample data, wherein the maximum label category probability is a first category probability value p_maxThe second highest label class probability is the second class probability value p_secondCan screen out p_maxLess than P_th1And p is_secondGreater than P_th2To obtain a complex sample D_complexSo that the complexity p of the data set to be evaluated can be calculated_c：p_c＝D_complexC, where C may represent the amount of data in the data set to be evaluated.

And S207, determining an evaluation result corresponding to the data set to be evaluated according to the availability, the consistency rate, the abnormal rate and the complexity rate.

For example, the server may calculate an accumulated value of availability, agreement, anomaly, and complexity: and when the accumulated value is greater than or equal to the target threshold value, the data quality is qualified, and when the accumulated value is less than the target threshold value, the data quality is unqualified.

For another example, the server may perform a weighting operation according to the availability ratio, the consistency ratio, the anomaly ratio, the complexity ratio, and the weight values corresponding thereto, to obtain a target value: and when the target value is greater than or equal to a target threshold value, the evaluation result is that the data quality is qualified, and when the target value is less than the target threshold value, the evaluation result is that the data quality is unqualified.

It should be noted that the execution sequence between steps S203 to S206 can be flexibly set according to actual needs, for example, steps S203 to S206 can be executed serially in sequence, or steps S203 to S206 can be executed in parallel, or steps S203 to S206 can be executed in any sequence.

According to the data quality evaluation method and device, the data set to be evaluated can be filtered, the candidate data set obtained through filtering is analyzed based on multiple different evaluation dimensions to obtain evaluation parameters such as the availability ratio, the consistency ratio, the abnormal rate and the complexity ratio, and the evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameters, so that the data quality is evaluated from multiple dimensions, and the data evaluation efficiency and accuracy are improved.

Second, training of model

As shown in fig. 5, the process of model training provided in the embodiment of the present application may include:

s301, acquiring a target data set with qualified data quality as an evaluation result.

S302, predicting the target data set through the initial text semantic model to obtain a prediction result.

S303, adjusting parameters of the initial text semantic model according to the prediction result to obtain the trained text semantic model.

When the evaluation result corresponding to the data set to be evaluated is that the data quality is qualified, the server may use the data set to be evaluated, of which the evaluation result is that the data quality is qualified, as a target data set, and may train the initial text semantic model by using the target data set to obtain the trained text semantic model. For example, the text in the target data set can be predicted through the initial text semantic model, and the parameters of the initial text semantic model are adjusted according to the prediction result, so that the trained text semantic model is obtained. According to the text semantic model training method and device, the text semantic model can be trained through the target data set with qualified data quality, accuracy of model training is improved, and performance of the model is improved.

Application of model

As shown in fig. 6, the process of applying the model provided in the embodiment of the present application may include:

s401, acquiring the text to be detected generated by the source end.

The source end can be a terminal such as a mobile phone, a computer or other intelligent devices, and the text to be detected can be a text input by a user through the source end, or a text obtained by converting voice input by the user through the source end.

S402, performing semantic analysis on the text to be detected through the trained text semantic model to obtain an analysis result.

For example, in order to monitor the service quality and public opinion risk by detecting call voice generated by a call center, whether the text to be detected corresponding to the call voice is in risk tendency, whether the text is discontented, whether the expression of customer service is unreasonable and the like can be analyzed through the trained text semantic model. For another example, the category to which the text to be detected belongs may be analyzed through the trained text semantic model.

And S403, determining the risk level of the text to be detected according to the analysis result.

The risk level can be flexibly set according to actual needs, the higher the risk level is, the larger the risk is, for example, if it is determined that the text to be detected has a high risk tendency based on the analysis result, the higher the risk level is. And when the risk grade is 0, indicating that the text to be detected has no risk tendency.

S404, generating a source end of the text to be detected according to the risk level and carrying out corresponding processing.

When the risk level is high (for example, the risk level is greater than a preset level threshold, which can be flexibly set according to actual needs), the server may output the risk early warning, the identifier of the source end, and the like, for example, the identifier of the risk early warning and the source end may be sent to a designated mailbox or device, and the like, so that maintenance personnel can check the identifier, and take corresponding measures to perform corresponding processing on the source end generating the text to be detected, for example, perform punishment or warning on a user corresponding to the source end, and the like. When the risk level is lower, the server can output prompt information with no risk or lower risk level for the maintenance personnel to view. According to the method and the device, the trained text semantic model can analyze the risk level of the text to be detected to take corresponding processing measures, and timeliness and reliability of corresponding processing on the source end generating the text to be detected are improved.

In order to better implement the data evaluation method provided by the embodiment of the present application, the embodiment of the present application further provides a device based on the data evaluation method. Wherein the meanings of the nouns are the same as those in the data evaluation method, and the specific implementation details can refer to the description in the method embodiment.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a data evaluation apparatus according to an embodiment of the present disclosure, where the data evaluation apparatus may include a first obtaining unit 501, a filtering unit 502, an analyzing unit 503, a second obtaining unit 504, a determining unit 505, and the like.

The first obtaining unit 501 is configured to obtain a data set to be evaluated.

The filtering unit 502 is configured to filter the data set to be evaluated according to a preset filtering policy to obtain a candidate data set.

An analyzing unit 503 for analyzing the candidate data set based on the evaluation dimension.

A second obtaining unit 504, configured to obtain an evaluation parameter of the data set to be evaluated according to the analysis result.

The determining unit 505 is configured to determine an evaluation result corresponding to the data set to be evaluated according to the evaluation parameter, so as to process the data according to the evaluation result.

In an embodiment, the evaluation dimension includes an anomaly detection evaluation, the evaluation parameter includes an anomaly rate, and the analysis unit 503 may specifically be configured to: performing word segmentation processing on each sample data in the candidate data set respectively to obtain a word set corresponding to each sample data; vectorizing the word set to obtain a word vector; projecting the word vectors to a sample feature space with preset dimensionality to obtain text feature vectors; clustering the text feature vectors, and determining abnormal data in the candidate data set based on a clustering result; the second obtaining unit 504 may specifically be configured to: and determining the abnormal rate of the data set to be evaluated according to the abnormal data.

In an embodiment, the analysis unit 503 may specifically be configured to: performing dimensionality reduction on the text feature vector to obtain a dimensionality-reduced feature vector; normalizing the feature vector subjected to dimension reduction to obtain a normalized feature vector; and clustering the normalized feature vectors, and determining abnormal data in the candidate data set based on a clustering result.

In an embodiment, the analysis unit 503 may specifically be configured to: discretizing the normalized feature vector into a plurality of characteristic points; selecting any one feature point from the plurality of feature points as a core point; distributing all feature points in a preset neighborhood range with the core point as the center into the same class; selecting another feature point from the plurality of feature points as a core point, and returning to execute the operation of distributing all feature points in a preset neighborhood range with the core point as the center to the same class until the plurality of feature points are traversed; and screening out sample data corresponding to the characteristic points of the unassigned class to obtain abnormal data.

In an embodiment, the evaluation dimension includes an anomaly detection evaluation, the evaluation parameter includes an anomaly rate, and the analysis unit 503 may specifically be configured to: carrying out classification prediction on each sample data in the candidate data set through the trained classification model to obtain the classification probability corresponding to each sample data in the candidate data set; screening sample data with classification probability smaller than a preset threshold value as abnormal data; the second obtaining unit 504 may specifically be configured to: and determining the abnormal rate of the data set to be evaluated according to the screened abnormal data with the classification probability smaller than the preset threshold value.

In one embodiment, the data evaluation apparatus may further include:

the third acquisition unit is used for acquiring a plurality of training samples, wherein the training samples comprise label samples and complementary label samples, the label samples are samples marked with real labels, and the complementary label samples are samples marked with labels complementary to the real labels;

the training unit is used for carrying out negative learning on the basis of the label samples and the complementary label samples through the initial classification model so as to train the initial classification model to obtain a candidate classification model and predict the sample classification probability of each training sample;

the screening unit is used for screening out the training samples with the sample classification probability larger than the target probability threshold value to obtain candidate training samples;

and the fine tuning unit is used for carrying out fine tuning training on the candidate classification model based on the candidate training sample to obtain the trained classification model.

In an embodiment, the second obtaining unit 504 may specifically be configured to: acquiring a target characteristic vector corresponding to each sample data in the candidate data set; determining target abnormal data in the candidate data set based on the target feature vector; and determining the abnormal rate of the data set to be evaluated according to the target abnormal data and the screened abnormal data with the classification probability smaller than the preset threshold.

In an embodiment, the evaluation dimension includes complexity evaluation, the evaluation parameter includes complexity rate, and the analysis unit 503 may be specifically configured to: performing word segmentation processing on each sample data in the data set respectively to obtain a word set corresponding to each sample data; vectorizing the word set to obtain a vector matrix; performing convolution operation on the vector matrix to obtain a plurality of one-dimensional vectors; performing pooling operation on the multiple one-dimensional vectors to obtain pooled vectors; splicing the pooled backward vector to obtain a spliced vector; performing full-connection operation on the spliced vectors to obtain the label category probability of each sample data; the second obtaining unit 504 may specifically be configured to: and determining the complexity of the data set to be evaluated according to the label category probability.

In an embodiment, the evaluation dimension includes availability evaluation, the evaluation parameter includes availability ratio, and the analysis unit 503 may be specifically configured to: acquiring a first data volume of a data set to be evaluated and a second data volume of a candidate data set; the second obtaining unit 504 may specifically be configured to: and calculating the availability ratio of the data set to be evaluated according to the first data quantity and the second data quantity.

In an embodiment, the evaluation dimension includes consistency evaluation, the evaluation parameter includes a consistency rate, and the analysis unit 503 may specifically be configured to: carrying out repeated item detection on the candidate data set to obtain a repeated data set; extracting data with consistent labels from the repeated data set;

the second obtaining unit 504 may specifically be configured to: acquiring the ratio of the data quantity of the data with consistent labels to the data quantity in the data set to be evaluated; and determining the consistency rate of the data set to be evaluated according to the ratio.

In one embodiment, the evaluation parameter includes at least one of availability rate, consistency rate, anomaly rate, and complexity rate,

in an embodiment, the determining unit 505 may specifically be configured to: acquiring accumulated values of the availability ratio, the consistency ratio, the abnormal ratio and the complexity ratio, and determining an evaluation result corresponding to the data set to be evaluated according to the accumulated values; or respectively setting weighted values for the availability ratio, the consistency ratio, the abnormal ratio and the complex ratio, carrying out weighting operation according to the availability ratio, the consistency ratio, the abnormal ratio, the complex ratio and the corresponding weighted values to obtain a target numerical value, and determining an evaluation result corresponding to the data set to be evaluated by the target numerical value.

In one embodiment, the data evaluation apparatus may further include:

the processing unit is used for training the model to be trained by using the data set to be evaluated to obtain a trained model when the evaluation result corresponding to the data set to be evaluated is that the data quality is qualified, and processing the data through the trained model; and when the evaluation result corresponding to the data set to be evaluated is unqualified data quality, analyzing factors influencing the quality of the data set to be evaluated.

In the embodiment of the application, the first obtaining unit 501 may obtain a data set to be evaluated, the filtering unit 502 filters the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set, the analyzing unit 503 may analyze the candidate data set based on an evaluation dimension, and the second obtaining unit 504 obtains an evaluation parameter of the data set to be evaluated according to an analysis result; at this time, the determining unit 505 may determine an evaluation result corresponding to the data set to be evaluated according to the evaluation parameter, so as to process the data according to the evaluation result. According to the scheme, the data set to be evaluated can be filtered, the candidate data set obtained through filtering is analyzed based on the evaluation dimension to obtain the evaluation parameter, the evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameter, manual review evaluation is avoided, and the efficiency, the accuracy and the reliability of data evaluation are improved.

An embodiment of the present application further provides a computer device, where the computer device may be a server or a terminal, and as shown in fig. 8, it shows a schematic structural diagram of the computer device according to the embodiment of the present application, specifically:

the computer device may include components such as a processor 601 of one or more processing cores, memory 602 of one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 8 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:

the processor 601 is a control center of the computer device, connects various parts of the whole computer device by using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 602 and calling data stored in the memory 602, thereby monitoring the computer device as a whole. Optionally, processor 601 may include one or more processing cores; preferably, the processor 601 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 601.

The memory 602 may be used to store software programs and modules, and the processor 601 executes various functional applications and data processing by operating the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602.

The computer device further comprises a power supply 603 for supplying power to the various components, and preferably, the power supply 603 is logically connected to the processor 601 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 603 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The computer device may also include an input unit 604, the input unit 604 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 601 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602, thereby implementing various functions as follows:

acquiring a data set to be evaluated; filtering the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set; analyzing the candidate data set based on the evaluation dimension; obtaining evaluation parameters of the data set to be evaluated according to the analysis result; and determining an evaluation result corresponding to the data set to be evaluated according to the evaluation parameters, and processing the data according to the evaluation result.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and parts that are not described in detail in a certain embodiment may refer to the above detailed description of the data evaluation method, and are not described herein again.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.

It will be understood by those skilled in the art that all or part of the steps of the methods of the embodiments described above may be performed by computer instructions, or by computer instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor. To this end, the present application provides a storage medium, in which a computer program is stored, where the computer program may include computer instructions, and the computer program can be loaded by a processor to execute any one of the data evaluation methods provided by the present application.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any data evaluation method provided in the embodiments of the present application, the beneficial effects that can be achieved by any data evaluation method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The data evaluation method, the data evaluation device, the computer device, and the storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application, and the description of the above embodiments is only used to help understand the method and the core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. a data evaluation method, is characterized in that, comprises:

Get the dataset to be evaluated;

Filter the data set to be evaluated according to a preset filtering strategy to obtain a candidate data set;

analyzing the candidate dataset based on the evaluation dimension;

Obtain the evaluation parameters of the data set to be evaluated according to the analysis result;

An evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameter, so as to process the data according to the evaluation result.

2. The data evaluation method according to claim 1, wherein the evaluation dimension includes anomaly detection evaluation, the evaluation parameter includes an anomaly rate, and the candidate data set is analyzed based on the evaluation dimension, according to the analysis Results Obtaining the evaluation parameters of the data set to be evaluated includes:

Perform word segmentation processing on each sample data in the candidate data set to obtain a word set corresponding to each sample data;

vectorizing the word set to obtain word vectors;

Projecting the word vector to a sample feature space of a preset dimension to obtain a text feature vector;

Clustering the text feature vectors, and determining abnormal data in the candidate data set based on the clustering results;

The abnormality rate of the data set to be evaluated is determined according to the abnormal data.

3. The data evaluation method according to claim 2, wherein the performing clustering on the text feature vector and determining the abnormal data in the candidate data set based on the clustering result comprises:

Dimensionality reduction is performed on the text feature vector to obtain a feature vector after dimensionality reduction;

The eigenvectors after the dimension reduction are normalized to obtain the normalized eigenvectors;

The normalized feature vectors are clustered, and abnormal data in the candidate data set is determined based on the clustering result.

4. The data evaluation method according to claim 3, wherein the clustering of the normalized feature vectors, and determining the abnormal data in the candidate data set based on the clustering result comprises:

discretizing the normalized feature vector into a plurality of feature points;

Select any one feature point from the plurality of feature points as the core point;

Allocating all feature points within a preset neighborhood range centered on the core point as the same class family;

Select another feature point from the plurality of feature points as the core point, and return to perform the operation of assigning all the feature points within the preset neighborhood range centered on the core point as the same class family, until the traversal is complete the plurality of characteristic points;

The sample data corresponding to the feature points that are not assigned to the class family are filtered out to obtain abnormal data.

5. The data evaluation method according to claim 1, wherein the evaluation dimension includes anomaly detection evaluation, the evaluation parameter includes an anomaly rate, and the candidate data set is analyzed based on the evaluation dimension, according to the analysis Results Obtaining the evaluation parameters of the data set to be evaluated includes:

Classify and predict each sample data in the candidate data set by using the trained classification model to obtain the classification probability corresponding to each sample data in the candidate data set;

Filter out sample data whose classification probability is less than a preset threshold as abnormal data;

Determine the abnormality rate of the to-be-evaluated data set according to the filtered abnormal data whose classification probability is less than a preset threshold.

6. The data evaluation method according to claim 5, wherein the data evaluation method further comprises:

Acquiring a plurality of training samples, the plurality of training samples include a label sample and a complementary label sample, the label sample is a sample marked with a real label, and the complementary label sample is a sample marked with a label complementary to the real label;

Perform negative learning based on the label samples and the complementary label samples through the initial classification model, so as to train the initial classification model, obtain candidate classification models, and predict the sample classification probability of each training sample;

Filter out the training samples whose classification probability is greater than the target probability threshold to obtain candidate training samples;

Perform fine-tuning training on the candidate classification model based on the candidate training samples to obtain a trained classification model.

7. The data evaluation method according to claim 5, characterized in that, determining the abnormal rate of the data set to be evaluated according to the abnormal data whose classification probability is less than a preset threshold value, comprises:

Obtain the target feature vector corresponding to each sample data in the candidate data set;

determining target abnormal data in the candidate data set based on the target feature vector;

The abnormality rate of the data set to be evaluated is determined according to the target abnormal data and the filtered abnormal data whose classification probability is less than a preset threshold.

8 . The data evaluation method according to claim 1 , wherein the evaluation dimension includes a complexity evaluation, the evaluation parameter includes a complexity rate, and the candidate data set is analyzed based on the evaluation dimension, according to the analysis method. 9 . Results Obtaining the evaluation parameters of the data set to be evaluated includes:

Perform word segmentation processing on each sample data in the data set to obtain a word set corresponding to each sample data;

Perform vectorization processing on the word set to obtain a vector matrix;

Perform a convolution operation on the vector matrix to obtain a plurality of one-dimensional vectors;

performing a pooling operation on the plurality of one-dimensional vectors to obtain a pooled vector;

Perform splicing processing on the pooled vector to obtain a spliced vector;

Perform a full connection operation on the spliced vector to obtain the label category probability of each sample data;

The complexity rate of the data set to be evaluated is determined according to the label category probability.

9 . The data evaluation method according to claim 1 , wherein the evaluation dimension includes availability evaluation, the evaluation parameter includes availability rate, the candidate data set is analyzed based on the evaluation dimension, and an analysis result is performed according to the analysis result. 10 . Obtaining the evaluation parameters of the data set to be evaluated includes:

obtaining the first data volume of the data set to be evaluated and the second data volume of the candidate data set;

The availability rate of the data set to be evaluated is calculated according to the first data amount and the second data amount.

10. The data evaluation method according to claim 1, wherein the evaluation dimension includes consistency evaluation, the evaluation parameter includes a consistency rate, and the candidate data set is analyzed based on the evaluation dimension, and according to the analysis Results Obtaining the evaluation parameters of the data set to be evaluated includes:

performing duplicate item detection on the candidate data set to obtain a duplicate data set;

extracting data with consistent labels from the repeated datasets;

Obtain the ratio of the data volume of the data with the same label to the data volume in the data set to be evaluated;

The consistency rate of the data set to be evaluated is determined according to the ratio.

11. The data evaluation method according to claim 1, wherein the evaluation parameters include at least any one of availability rate, consistency rate, abnormal rate, and complexity rate, and the determination of the The evaluation results corresponding to the data set to be evaluated include:

Acquire the accumulated value of the availability rate, the consistency rate, the abnormal rate, and the complexity rate, and determine the evaluation result corresponding to the data set to be evaluated according to the accumulated value; or,

Set weight values for the availability rate, the consistency rate, the abnormal rate, and the complexity rate, respectively, and perform a weighted operation according to the availability rate, the consistency rate, the abnormal rate, the complexity rate and their corresponding weight values to obtain the target value, the said The target value determines the evaluation result corresponding to the data set to be evaluated.

12. The data evaluation method according to any one of claims 1 to 11, wherein after the evaluation result corresponding to the data set to be evaluated is determined according to the evaluation parameter, the data evaluation method further comprises:

When the evaluation result corresponding to the to-be-evaluated data set is that the data quality is qualified, use the to-be-evaluated data set to train the to-be-trained model to obtain a post-training model, so as to process the data through the post-training model;

When the evaluation result corresponding to the data set to be evaluated is unqualified data quality, analyze the factors affecting the quality of the data set to be evaluated.

13. A data evaluation device, comprising:

a first acquisition unit, used to acquire the data set to be evaluated;

a filtering unit, configured to filter the to-be-evaluated data set according to a preset filtering strategy to obtain a candidate data set;

an analysis unit, configured to analyze the candidate data set based on the evaluation dimension;

a second obtaining unit, configured to obtain the evaluation parameters of the data set to be evaluated according to the analysis result;

A determination unit, configured to determine an evaluation result corresponding to the data set to be evaluated according to the evaluation parameter, so as to process the data according to the evaluation result.

14. A computer device, characterized in that it comprises a processor and a memory, wherein a computer program is stored in the memory, and the processor executes the method according to any one of claims 1 to 12 when the processor calls the computer program in the memory. The data evaluation method described above.

15 . A storage medium, characterized in that, the storage medium is used for storing a computer program, and the computer program is loaded by a processor to execute the data evaluation method according to any one of claims 1 to 12 .