CN115577285A

CN115577285A - Training set processing method, device, electronic device and storage medium for classification

Info

Publication number: CN115577285A
Application number: CN202211188253.7A
Authority: CN
Inventors: 罗欢; 李辛; 戴辰玥; 侯元春
Original assignee: Shanghai Ximalaya Technology Co Ltd
Current assignee: Shanghai Ximalaya Technology Co Ltd
Priority date: 2022-09-28
Filing date: 2022-09-28
Publication date: 2023-01-06

Abstract

The invention provides a training set processing method and device for classification, an electronic device and a storage medium, wherein the method comprises the following steps: obtaining a classification sample set; determining a respective prediction classification corresponding to each sample, a probability of each prediction classification, and a disambiguation score corresponding to each sample; for each sample, determining the most similar sample corresponding to the sample from the samples with the target prediction classification corresponding to the sample; the most similar prediction classification corresponding to each classification label is determined based on the disambiguation score, the first target sample of which the prediction classification is the most similar prediction classification is determined from all samples with the classification labels based on the most similar prediction classification, the second target sample of which the prediction classification is the classification label is determined from all samples with the most similar prediction classification, and the defects of confusion of the classes or labeling errors of the samples and the like can be rapidly identified by analyzing the first target sample, the second target sample and the most similar sample corresponding to each sample, so that the sample cleaning speed is improved.

Description

Training set processing method, device, electronic device and storage medium for classification

技术领域technical field

本发明涉及文本分类技术领域，具体而言，涉及一种用于分类的训练集处理方法、装置、电子设备及存储介质。The present invention relates to the technical field of text classification, in particular, to a training set processing method, device, electronic equipment and storage medium for classification.

背景技术Background technique

分类任务在很多场景都有较大作用，比如智能客服的意图识别，自然界的花草动物识别等，分类任务除了依赖分类模型技术，也对分类语料强依赖，分类效果与训练集的数据质量和样本数量有很大关系，当训练集的标注样本存在大量标注错误，算法的性能会受影响，严重影响分类效果。Classification tasks play a big role in many scenarios, such as the intention recognition of intelligent customer service, the recognition of flowers, plants and animals in nature, etc. In addition to relying on classification model technology, classification tasks also rely heavily on classification corpus, and the classification effect is related to the data quality and samples of the training set. The number has a lot to do with it. When there are a large number of labeling errors in the labeled samples of the training set, the performance of the algorithm will be affected and the classification effect will be seriously affected.

发明内容Contents of the invention

本发明的目的之一在于提供一种用于分类的训练集处理方法、装置、电子设备及存储介质，可以快速对训练集进行数据清洗，识别出标注错误和标注混淆的样本或类别，本发明的实施例可以这样实现：One of the objectives of the present invention is to provide a training set processing method, device, electronic equipment and storage medium for classification, which can quickly perform data cleaning on the training set, and identify samples or categories with wrong labels and confusing labels. An example can be implemented like this:

第一方面，本发明提供一种用于分类的训练集处理方法，所述方法包括：In a first aspect, the present invention provides a training set processing method for classification, the method comprising:

获取分类样本集；所述分类样本集中包含多个样本以及每个所述样本对应的分类标签；一种所述分类标签对应至少一个所述样本；Obtaining a classification sample set; the classification sample set includes a plurality of samples and classification labels corresponding to each of the samples; one type of classification label corresponds to at least one of the samples;

确定每个所述样本对应的各个预测分类、每个所述预测分类的概率、以及每个所述样本对应的消歧得分；其中，所述消歧得分表征所述样本对应的所述分类标签的错误程度；Determining each predicted classification corresponding to each of the samples, the probability of each of the predicted classifications, and the disambiguation score corresponding to each of the samples; wherein, the disambiguation score represents the classification label corresponding to the sample the degree of error;

针对每个样本，从具有所述样本对应的目标预测分类的各个样本中，确定每个所述样本对应的最相似样本，所述目标预测分类的所述概率最大；For each sample, from each sample with the target prediction classification corresponding to the sample, determine the most similar sample corresponding to each of the samples, and the probability of the target prediction classification is the largest;

基于所述消歧得分，确定每个所述分类标签对应的最相似预测分类，并基于所述最相似预测分类，从具有所述分类标签的全部样本中，确定预测分类为所述最相似预测分类的第一目标样本、以及从具有所述最相似预测分类的全部样本中，确定预测分类为所述分类标签的第二目标样本；Based on the disambiguation score, determine the most similar prediction classification corresponding to each of the classification labels, and based on the most similar prediction classification, from all samples with the classification labels, determine the prediction classification as the most similar prediction classifying the first target sample, and from among all samples having the most similar predicted class, determining a second target sample predicted to be classified as the class label;

其中，所述第一目标样本和所述第二目标样本用于指示对所述分类标签和对所述最相似预测分类执行合并策略，或者指示对所述第一目标样本和/或所述第二目标样本执行纠正策略；所述最相似样本用于指示对所述样本进行纠正。Wherein, the first target sample and the second target sample are used to indicate that a merging strategy is performed on the classification label and the most similar predicted classification, or indicate that the first target sample and/or the second Two target samples implement a correction strategy; the most similar sample is used to indicate to correct the sample.

第二方面，本发明提供一种用于分类的训练集处理装置，包括：In a second aspect, the present invention provides a training set processing device for classification, comprising:

获取模块，用于获取分类样本集；所述分类样本集中包含多个样本以及每个所述样本对应的分类标签；一种所述分类标签对应至少一个所述样本；An acquisition module, configured to acquire a classification sample set; the classification sample set includes a plurality of samples and classification labels corresponding to each of the samples; one type of classification label corresponds to at least one of the samples;

确定模块，用于确定每个所述样本对应的各个预测分类、每个所述预测分类的概率、以及每个所述样本对应的消歧得分；其中所述消歧得分表征所述样本对应的所述分类标签的错误得分；A determining module, configured to determine each predicted classification corresponding to each of the samples, the probability of each of the predicted classifications, and the disambiguation score corresponding to each of the samples; wherein the disambiguation score represents the corresponding the error score for the class label;

所述确定模块，还用于针对每个样本，从具有所述样本对应的目标预测分类的各个样本中，确定每个所述样本对应的最相似样本，所述目标预测分类的所述概率最大；所述目标预测分类的所述概率最大；The determination module is further configured to, for each sample, determine the most similar sample corresponding to each sample from each sample with the target prediction classification corresponding to the sample, and the probability of the target prediction classification is the largest ; The probability of the target prediction classification is the largest;

所述确定模块，还用于基于所述消歧得分，确定每个所述分类标签对应的最相似预测分类，并基于所述最相似预测分类，从具有所述分类标签的全部样本中，确定预测分类为所述最相似预测分类的第一目标样本、以及从具有所述最相似预测分类的全部样本中，确定预测分类为所述分类标签的第二目标样本；The determining module is further configured to determine the most similar predicted classification corresponding to each of the classification labels based on the disambiguation score, and determine from all samples with the classification labels based on the most similar predicted classification predicting a first target sample that is classified as the most similar predicted class, and determining a second target sample that is predicted to be the class label from all samples with the most similar predicted class;

第三方面，本发明提供一种电子设备，包括处理器和存储器，所述存储器存储有能够被所述处理器执行的计算机程序，所述处理器可执行所述计算机程序以实现第一方面提供的用于分类的训练集处理方法。In a third aspect, the present invention provides an electronic device, including a processor and a memory, the memory stores a computer program that can be executed by the processor, and the processor can execute the computer program to implement the computer program provided in the first aspect. A training set processing method for classification.

第四方面，本发明提供一种可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如第一方面提供的用于分类的训练集处理方法。In a fourth aspect, the present invention provides a readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the training set processing method for classification as provided in the first aspect is implemented.

本发明提供的一种用于分类的训练集处理方法、装置、电子设备及存储介质,方法包括：首先获得分类样本集，然后预测每个样本的类别以及各个类别的概率，由于每个样本均具有一个正确的分类标签，那么针对每一种类别，均可以从该具有该类别的样本的预测结果中，确定与该类别具有最大冲突的预测分类、以及对应该预测分类的第一目标样本、以及从具有该预测分类的各个样本中，确定出预测结果为该类别的第二目标样本，通过分析第一目标样本和第二目标样本可以确定类别之间是否需要进行合并或者样本是否需要进行调整，还可以确定每个样本对应的最相似样本，进而确定是否需要对样本的类别进行调整，整个过程可以快速识别出标注错误和标注混淆情况，提高了样本清洗速度。The present invention provides a training set processing method, device, electronic equipment and storage medium for classification. The method includes: first obtaining a classification sample set, and then predicting the category of each sample and the probability of each category. Since each sample is With a correct classification label, then for each category, from the prediction results of the samples with this category, it is possible to determine the predicted category with the largest conflict with the category, and the first target sample corresponding to the predicted category, And from each sample with the predicted classification, determine the second target sample whose prediction result is the category, and determine whether the categories need to be merged or whether the samples need to be adjusted by analyzing the first target sample and the second target sample , can also determine the most similar sample corresponding to each sample, and then determine whether the category of the sample needs to be adjusted. The whole process can quickly identify labeling errors and labeling confusion, and improve the speed of sample cleaning.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention, and thus It should be regarded as a limitation on the scope, and those skilled in the art can also obtain other related drawings based on these drawings without creative work.

图1为本发明实施例提供的一种应用场景示意图；FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present invention;

图2为本发明实施例提供的电子设备的结构框图；FIG. 2 is a structural block diagram of an electronic device provided by an embodiment of the present invention;

图3为本发明实施例提供的用于分类的训练集处理方法的示意性流程图；FIG. 3 is a schematic flowchart of a training set processing method for classification provided by an embodiment of the present invention;

图4为本发明实施例中的文本样本集的示意图；FIG. 4 is a schematic diagram of a text sample set in an embodiment of the present invention;

图5为本发明实施例提供的预测分类的示意图；Fig. 5 is a schematic diagram of prediction classification provided by an embodiment of the present invention;

图6为本发明实施例步骤S302的示意性流程图；FIG. 6 is a schematic flowchart of step S302 according to an embodiment of the present invention;

图7为本发明实施例提供的每个样本对应的样本知识示意图；Fig. 7 is a schematic diagram of sample knowledge corresponding to each sample provided by the embodiment of the present invention;

图8为本发明实施例提供的步骤S304的示意性流程图；FIG. 8 is a schematic flowchart of step S304 provided by an embodiment of the present invention;

图9为本发明实施例提供的类别消歧结果分析示意图；Fig. 9 is a schematic diagram of analysis of category disambiguation results provided by the embodiment of the present invention;

图10为本发明实施例提供的用于分类的训练集处理装置的功能模块图。Fig. 10 is a functional block diagram of a training set processing device for classification provided by an embodiment of the present invention.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. The components of the embodiments of the invention generally described and illustrated in the figures herein may be arranged and designed in a variety of different configurations.

因此，以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围，而是仅仅表示本发明的选定实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Accordingly, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely represents selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

在本发明的描述中，需要说明的是，若出现术语“上”、“下”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，或者是该发明产品使用时惯常摆放的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。In the description of the present invention, it should be noted that if the orientation or positional relationship indicated by the terms "upper", "lower", "inner" and "outer" appear, it is based on the orientation or positional relationship shown in the drawings, or It is the orientation or positional relationship that the invention product is usually placed in use, and it is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred device or element must have a specific orientation, be constructed and operated in a specific orientation , and therefore cannot be construed as a limitation of the present invention.

此外，若出现术语“第一”、“第二”等仅用于区分描述，而不能理解为指示或暗示相对重要性。In addition, terms such as "first" and "second" are used only for distinguishing descriptions, and should not be understood as indicating or implying relative importance.

需要说明的是，在不冲突的情况下，本发明的实施例中的特征可以相互结合。It should be noted that, in the case of no conflict, the features in the embodiments of the present invention may be combined with each other.

分类任务在很多场景都有较大作用，比如智能客服的意图识别，自然界的花草动物识别等，以文本分类任务为例，请参见图1，图1为本发明实施例提供的一种应用场景示意图，该场景包括：终端设备11和服务器12。其中，终端设备11中可安装有各种应用程序。终端设备11的应用程序和服务器12之间通过通信网络建立通信连接后，终端设备11的客户端可以将要识别的文本发送给服务器12，由服务器12进行分类，得到分类结果，再将分类结果发送给终端设备11的应用程序。Classification tasks are very useful in many scenarios, such as the intention recognition of intelligent customer service, the recognition of flowers, plants and animals in nature, etc. Take the text classification task as an example, please refer to Figure 1, which is an application scenario provided by the embodiment of the present invention A schematic diagram, the scene includes: a terminal device 11 and a server 12 . Wherein, various application programs may be installed in the terminal device 11 . After the communication connection is established between the application program of the terminal device 11 and the server 12 through the communication network, the client of the terminal device 11 can send the text to be recognized to the server 12, and the server 12 performs classification to obtain the classification result, and then sends the classification result to the server 12. Application program for terminal equipment 11.

其中，终端设备11可以但不限于是具有信息采集功能的个人计算机、笔记本电脑、智能手机、平板电脑、智能穿戴设备等计算机设备。Wherein, the terminal device 11 may be, but not limited to, a computer device such as a personal computer, a notebook computer, a smart phone, a tablet computer, and a smart wearable device with an information collection function.

服务器12可以用独立的服务器或者是多个服务器组成的服务器集群或者分布式系统来实现，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network，内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备11以及服务器12可以通过有线或无线通信方式进行直接或间接地连接，本申请在此不做限制。The server 12 can be realized by an independent server or a server cluster or a distributed system composed of multiple servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware Cloud server for basic cloud computing services such as domain name service, security service, CDN (Content Delivery Network, content distribution network), and big data and artificial intelligence platforms. The terminal device 11 and the server 12 may be connected directly or indirectly through wired or wireless communication, which is not limited in this application.

终端设备11和服务器12可通过通信网络进行通信连接，通信网络可以是无线保真(wireless fidelity，wifi)热点网络、蓝牙(bluetooth，BT)网络或近距离无线通信技术(near field communication，NFC)网络等近距离通信网络、还可以是第三代移动通信技术(3rd-generation wireless telephonetechnology，3G)网络、第四代移动通信技术(the4th generation mobile communicationtechnology，4G)网络、第五代移动通信技术(5th-generation mobile communicationtechnology，5G)网络、未来演进的公共陆地移动网络(public land mobile network，PLMN)或因特网等。The terminal device 11 and the server 12 can be connected through a communication network, and the communication network can be a wireless fidelity (wireless fidelity, wifi) hotspot network, a bluetooth (bluetooth, BT) network or a near field communication technology (near field communication, NFC) Short-distance communication network such as network, can also be the third-generation mobile communication technology (3rd-generation wireless telephone technology, 3G) network, the fourth-generation mobile communication technology (the4th generation mobile communication technology, 4G) network, the fifth-generation mobile communication technology ( 5th-generation mobile communication technology (5G) network, future evolution of public land mobile network (public land mobile network, PLMN) or the Internet, etc.

继续参见图1，目前，在执行分类任务之前，需要先利用预先标注好的训练集对分类模型进行训练，然后利用训练好的分类模型执行分类。可以看出，分类效果与训练集的数据质量有很大关系，当训练集的标注样本存在错误，例如，训练集中某些样本的标注错误，或者某些类别之间存在冲突，分类模型的性能会受影响，严重影响分类效果。Continuing to refer to FIG. 1 , at present, before performing a classification task, it is necessary to use a pre-labeled training set to train a classification model, and then use the trained classification model to perform classification. It can be seen that the classification effect has a great relationship with the data quality of the training set. When there are errors in the labeled samples of the training set, for example, some samples in the training set are labeled incorrectly, or there are conflicts between certain categories, the performance of the classification model Will be affected, seriously affecting the classification effect.

因此，本发明实施例提供了一种训练集处理方法，可以先对训练集进行数据清洗，实现知识消歧，这里涉及的“消歧”意指消除样本标注错误和消除类别混淆问题，将得到的结果反馈给运营人员，由运营人员对样本或者对类别进行调整，从而实现训练集预清洗的效果。Therefore, the embodiment of the present invention provides a training set processing method, which can firstly perform data cleaning on the training set to realize knowledge disambiguation. The "disambiguation" involved here means eliminating sample labeling errors and eliminating category confusion problems, and will get The results are fed back to the operator, and the operator adjusts the samples or categories to achieve the effect of pre-cleaning the training set.

首先请参见图2，图2为本发明实施例提供的电子设备的结构框图，该电子设备可以用来执行本发明实施例提供的用于分类的训练集处理方法。Please refer to FIG. 2 first. FIG. 2 is a structural block diagram of an electronic device provided by an embodiment of the present invention. The electronic device can be used to execute the training set processing method for classification provided by the embodiment of the present invention.

如图2所示，电子设备200包括存储器201、处理器202和通信接口203，该存储器201、处理器202和通信接口203相互之间直接或间接地电性连接，以实现数据的传输或交互。例如，这些元件相互之间可通过一条或多条通讯总线或信号线实现电性连接。As shown in FIG. 2, the electronic device 200 includes a memory 201, a processor 202, and a communication interface 203, and the memory 201, the processor 202, and the communication interface 203 are electrically connected to each other directly or indirectly to realize data transmission or interaction. . For example, these components can be electrically connected to each other through one or more communication buses or signal lines.

存储器201可用于存储软件程序及模块，如本发明实施例提供的用于分类的训练集处理装置400的指令/模块，可以软件或固件(firmware)的形式存储于存储器201中或固化在电子设备200的操作系统(operating system，OS)中，处理器202通过执行存储在存储器201内的软件程序及模块，从而执行各种功能应用以及数据处理。该通信接口203可用于与其他节点设备进行信令或数据的通信。The memory 201 can be used to store software programs and modules, such as the instructions/modules of the training set processing device 400 for classification provided in the embodiment of the present invention, which can be stored in the memory 201 in the form of software or firmware (firmware) or solidified in the electronic device In the operating system (operating system, OS) of 200 , the processor 202 executes various functional applications and data processing by executing software programs and modules stored in the memory 201 . The communication interface 203 can be used for signaling or data communication with other node devices.

其中，存储器201可以是但不限于，随机存取存储器(Random Access Memory，RAM)，只读存储器(Read Only Memory，ROM)，可编程只读存储器(Programmable Read-OnlyMemory，PROM)，可擦除只读存储器(Erasable Programmable Read-Only Memory，EPROM)，电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory，EEPROM)等。Wherein, memory 201 can be but not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-OnlyMemory, PROM), erasable Read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), etc.

处理器202可以是一种集成电路芯片，具有信号处理能力。该处理器202可以是通用处理器，包括中央处理器(Central Processing Unit，CPU)、网络处理器(NetworkProcessor，NP)等；还可以是数字信号处理器(Digital Signal Processing，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field－Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。The processor 202 may be an integrated circuit chip with signal processing capabilities. The processor 202 can be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (NetworkProcessor, NP), etc.; it can also be a digital signal processor (Digital Signal Processing, DSP), an application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

可以理解，图2所示的结构仅为示意，电子设备200还可以包括比图2中所示更多或者更少的组件，或者具有与图2所示不同的配置。图2所示的各组件可以采用硬件、软件或其组合实现。It can be understood that the structure shown in FIG. 2 is only for illustration, and the electronic device 200 may also include more or less components than those shown in FIG. 2 , or have a configuration different from that shown in FIG. 2 . Each component shown in FIG. 2 may be implemented by hardware, software or a combination thereof.

下面以图2所示的电子设备为执行主体，对本发明实施例提供的用于分类的训练集处理方法进行详细介绍，请参见图3，图3为本发明实施例提供的用于分类的训练集处理方法的示意性流程图，该方法包括：The electronic device shown in FIG. 2 is used as the execution subject below to introduce in detail the training set processing method for classification provided by the embodiment of the present invention. Please refer to FIG. 3, which shows the training for classification provided by the embodiment of the present invention. A schematic flowchart of a set processing method, the method comprising:

S301：获取分类样本集；分类样本集中包含多个样本以及每个样本对应的分类标签；一种分类标签对应至少一个样本；S301: Obtain a classification sample set; the classification sample set contains multiple samples and a classification label corresponding to each sample; a classification label corresponds to at least one sample;

S302：确定每个样本对应的各个预测分类、每个预测分类的概率、以及每个样本对应的消歧得分；其中，消歧得分表征样本对应的分类标签的错误程度；S302: Determine each predicted classification corresponding to each sample, the probability of each predicted classification, and the disambiguation score corresponding to each sample; wherein, the disambiguation score represents the error degree of the classification label corresponding to the sample;

S303：针对每个样本，从具有样本对应的目标预测分类的各个样本中，确定每个样本对应的最相似样本，目标预测分类的概率最大；S303: For each sample, determine the most similar sample corresponding to each sample from each sample with the target prediction classification corresponding to the sample, and the probability of the target prediction classification is the largest;

S304：基于消歧得分，确定每个分类标签对应的最相似预测分类，并基于最相似预测分类，从具有分类标签的全部样本中，确定预测分类为最相似预测分类的第一目标样本、以及从具有最相似预测分类的全部样本中，确定预测分类为分类标签的第二目标样本；S304: Based on the disambiguation score, determine the most similar predicted category corresponding to each category label, and based on the most similar predicted category, determine the first target sample whose predicted category is the most similar predicted category from all samples with the category label, and From all samples with the most similar predicted classification, determine a second target sample whose predicted classification is the classification label;

其中，第一目标样本和第二目标样本用于指示对分类标签和对最相似预测分类执行合并策略，或者指示对第一目标样本和/或第二目标样本执行纠正策略；最相似样本用于指示对样本进行纠正。Among them, the first target sample and the second target sample are used to indicate the combination strategy for the classification label and the most similar predicted classification, or indicate the implementation of the correction strategy for the first target sample and/or the second target sample; the most similar sample is used for Indicates that the sample is corrected.

在上述训练集处理方法中，首先获得分类样本集，然后预测每个样本的类别以及各个类别的概率，由于每个样本均具有一个正确的分类标签，那么针对每一种类别，均可以从该具有该类别的样本的预测结果中，确定与该类别具有最大冲突的预测分类、以及对应该预测分类的第一目标样本、以及从具有该预测分类的各个样本中，确定出预测结果为该类别的第二目标样本，通过分析第一目标样本和第二目标样本可以确定类别之间是否需要进行合并或者样本是否需要进行调整，还可以确定每个样本对应的最相似样本，进而确定是否需要对样本的类别进行调整，整个过程可以快速识别出标注错误和标注混淆情况，提高了样本清洗速度。In the above-mentioned training set processing method, the classification sample set is first obtained, and then the category of each sample and the probability of each category are predicted. Since each sample has a correct classification label, then for each category, it can be obtained from the Among the prediction results of the samples with this category, determine the predicted category with the largest conflict with the category, and the first target sample corresponding to the predicted category, and from each sample with the predicted category, determine that the predicted result is the category By analyzing the first target sample and the second target sample, it can be determined whether the categories need to be merged or whether the samples need to be adjusted. It is also possible to determine the most similar sample corresponding to each sample, and then determine whether it is necessary to The category of samples is adjusted, and the whole process can quickly identify labeling errors and labeling confusion, which improves the speed of sample cleaning.

下面对上述步骤S301至步骤S304进行详细介绍。The above step S301 to step S304 will be described in detail below.

在步骤S301中、获取分类样本集。In step S301, a classification sample set is obtained.

本实施例中，分类样本集中包含多个样本以及每个样本对应的分类标签；一种分类标签对应至少一个样本；In this embodiment, the classification sample set includes a plurality of samples and a classification label corresponding to each sample; a classification label corresponds to at least one sample;

作为一种可选的实施方式，分类样本集中的样本可以是语料、文本、图像等，此处不做限定。As an optional implementation manner, the samples in the classification sample set may be corpus, text, image, etc., which are not limited here.

以样本为文本为例，请参见图4，图4为本发明实施例中的文本样本集的示意图，其中，sentence为样本，classify为样本对应的分类，标签，split_num为样本标识，本发明实施例中的样本分割标识用于后续对分类样本集进行交叉训练的数据基础。Taking the sample as text as an example, please refer to Fig. 4. Fig. 4 is a schematic diagram of a text sample set in the embodiment of the present invention, wherein sentence is a sample, classify is the classification and label corresponding to the sample, and split_num is the sample identifier. The implementation of the present invention The sample segmentation in the example identifies the data basis for subsequent cross-training on the classification sample set.

作为一种可选的实施方式，可以通过数据爬取方式应用软件的后台爬取，或者，可以实时记录用户在应用软件中的行为记录，并提取用户输入的文本，组成分类样本集，其中，应用软件可以但不限于是媒体应用、游戏应用、购物应用等等，此处不做限定。As an optional implementation, the background crawling of the application software can be performed through data crawling, or the user's behavior records in the application software can be recorded in real time, and the text entered by the user can be extracted to form a classification sample set, wherein, The application software may be, but not limited to, a media application, a game application, a shopping application, etc., which are not limited here.

可以理解的是，分类标签由人工标注，每个样本对应的分类标签可能与该样本的实际类别相符合，即分类标签是正确分类，或者，分类标签与样本实际类别不符合，即分类标签不是正确分类，通过本发明实施例可以辅助运营人员确定出标注错误的样本和容易混淆的类别。It is understandable that the classification labels are manually marked, and the classification labels corresponding to each sample may match the actual category of the sample, that is, the classification label is correctly classified, or the classification label does not match the actual category of the sample, that is, the classification label is not For correct classification, the embodiments of the present invention can assist operators in determining mislabeled samples and confusing categories.

在步骤S302中、确定每个样本对应的各个预测分类、每个预测分类的概率、以及每个样本对应的消歧得分。In step S302, each predicted category corresponding to each sample, the probability of each predicted category, and the disambiguation score corresponding to each sample are determined.

本实施例中，可以通过本发明实施例中的分类模型对每个样本进行预测，得到每个样本对应的预测分类以及预测分类的概率，可以理解的是，概率指得的预测分类是该样本对应的分类标签的概率。In this embodiment, each sample can be predicted through the classification model in the embodiment of the present invention, and the predicted classification corresponding to each sample and the probability of the predicted classification can be obtained. It can be understood that the predicted classification referred to by the probability is the sample The probability of the corresponding class label.

上述消歧得分可以用来粗略估计样本对应的分类标签的错误程度，本实施例中，消歧得分为最大错误分类概率与混淆概率中最大值，其中，最大错误分类概率为：与分类标签不同但概率最大的一个预测分类对应的概率，混淆概率为：全部概率中的前两个最大概率的差值，这样一来，本发明实施例通过上述两个指标得到样本的消歧得分，既能够针对标注错误，也能针对混淆标注，为后续可以很好的识别出样本中是否存在标注错误和标注混淆情况提供了数据依据。The above disambiguation score can be used to roughly estimate the error degree of the classification label corresponding to the sample. In this embodiment, the disambiguation score is the maximum value of the maximum misclassification probability and the confusion probability, wherein the maximum misclassification probability is: different from the classification label However, for the probability corresponding to the predicted category with the highest probability, the confusion probability is: the difference between the first two largest probabilities in all probabilities. In this way, the embodiment of the present invention obtains the disambiguation score of the sample through the above two indicators, which can not only For labeling errors, it can also be used for confusing labels, which provides a data basis for subsequent identification of whether there are labeling errors and labeling confusion in the sample.

参见图5，图5为本发明实施例提供的预测分类的示意图，在图5中可以看到一个句子对应的TOP2预测分类以及概率，在实际实施场景中，选择展示的预测分类的数量可以自定义，不仅限于展示概率为TOP2预测分类，通过上述方式可以得到全部样本的预测结果。Referring to Fig. 5, Fig. 5 is a schematic diagram of the prediction classification provided by the embodiment of the present invention. In Fig. 5, the TOP2 prediction classification and probability corresponding to a sentence can be seen. The definition is not limited to the display probability for the TOP2 prediction classification, and the prediction results of all samples can be obtained through the above method.

因此，本实施例还给出了步骤S302的实施方式，请参见图6，图6为本发明实施例步骤S302的示意性流程图：Therefore, this embodiment also provides the implementation of step S302, please refer to FIG. 6, which is a schematic flowchart of step S302 in the embodiment of the present invention:

S302-1：利用训练后的分类模型，得到每个样本的各个预测分类、以及每个预测分类的概率。S302-1: Using the trained classification model, obtain each predicted category of each sample and the probability of each predicted category.

本实施例中，分类模型可以通过现有的任意一种训练方法进行训练。In this embodiment, the classification model can be trained by any existing training method.

作为其中一种可选的实施方式，本实施例中分类模型的训练方式可以包括如下步骤：As one of the optional implementation manners, the training method of the classification model in this embodiment may include the following steps:

a1：根据每个样本对应的样本标识，从训练集中确定多个样本子集和每个样本子集对应的预测样本集。a1: According to the sample identification corresponding to each sample, determine multiple sample subsets and the prediction sample set corresponding to each sample subset from the training set.

其中，每个样本子集和样本子集对应的预测样本集的并集为分类样本集；一个样本子集内，每个样本的样本标识与样本子集个数相除得到的余数相同，且余数与预测样本集中每个样本对应的余数不同。Among them, the union of each sample subset and the prediction sample set corresponding to the sample subset is a classification sample set; in a sample subset, the sample identification of each sample is the same as the remainder obtained by dividing the number of sample subsets, and The remainder is different from the remainder corresponding to each sample in the forecast sample set.

本实施例中，用每个样本的样本标识除以样本子集个数，得到每个样本对应的余数，然后将具有相同余数的样本组成样本子集，然后将具有另外余数的样本组成该样本子集对应的预测样本集。In this embodiment, the sample ID of each sample is divided by the number of sample subsets to obtain the remainder corresponding to each sample, and then the samples with the same remainder form the sample subset, and then the samples with the other remainder form the sample The prediction sample set corresponding to the subset.

为了方便理解，假设样本子集个数为3,余数设置为0,1,2将分类样本集分成3个样本子集，那么可以将「split_num」除以3得到的余数为1,2的样本组成第一个样本子集，记为S_1，2，那么余数为0的样本作为S_1，2的预测样本集；余数为0,1的样本组成第二个样本子集,记为S_0，1，那么余数为2的样本作为S_0，1样本子集的预测样本集；余数为0,2的样本组成第三个样本子集，记为S_0，2，那么余数为1的样本作为S_0，2的预测样本集。For the convenience of understanding, assuming that the number of sample subsets is 3, and the remainder is set to 0, 1, 2 to divide the classification sample set into 3 sample subsets, then you can divide "split_num" by 3 to obtain samples with a remainder of 1, 2 Form the first sample subset, denoted as S ₁ , 2, then the samples with a remainder of 0 are used as the prediction sample set of S _{1, 2} ; the samples with a remainder of 0, 1 form the second sample subset, denoted as S _{0 , 1} , then the sample with a remainder of 2 is used as the prediction sample set of the S ₀ _, 1 sample subset; As the prediction sample set of S _0,2 .

也就是说，在实际实施过程中，可以先确定样本子集的个数N，然后确定与样本子集个数一致的N个不同的余数，然后在确定样本子集的过程中，先将具有N-1个余数的样本组成样本子集，将另外一个余数的样本组成该样本子集的预测样本集，以此类推，得到多个样本子集以及每个样本子集对应预测样本集。That is to say, in the actual implementation process, the number N of sample subsets can be determined first, and then N different residues consistent with the number of sample subsets can be determined, and then in the process of determining the sample subsets, first The samples with N-1 remainders form a sample subset, and the samples with another remainder form a prediction sample set of the sample subset, and so on, to obtain multiple sample subsets and each sample subset corresponds to a prediction sample set.

a2：依次利用每个样本子集对分类模型进行训练，并利用训练后的分类模型对样本子集对应的预测样本集进行预测，得到预测样本集中每个预测样本的各个预测分类、以及每个预测分类的概率。a2: Use each sample subset to train the classification model in turn, and use the trained classification model to predict the prediction sample set corresponding to the sample subset, and obtain each prediction category of each prediction sample in the prediction sample set, and each Probability of predicted class.

S302-2：针对每个样本，基于预测分类与样本对应的分类标签的比较结果，确定出样本对应的最大错误分类概率和混淆概率。S302-2: For each sample, based on the comparison result of the predicted classification and the classification label corresponding to the sample, determine the maximum misclassification probability and confusion probability corresponding to the sample.

将预测分类与样本对应的分类标签进行比较的目的：确定是否存在与分类标签相同的预测分类，不同的比较结果对后续确定最大错误分类概率有不同的指导方向。The purpose of comparing the predicted classification with the classification label corresponding to the sample is to determine whether there is a prediction classification with the same classification label. Different comparison results have different guidance directions for the subsequent determination of the maximum misclassification probability.

本实施例中，最大错误分类概率指得是将样本类别识别错误(即与分类标签不同)的概率，例如，假设一个A类样本，预测的分类结果有A类、B类和C类，那么，B类和C类均可以看成是A类样本的错误分类，将概率最大的错误分类的概率作为最大错误分类概率，即B类的概率。In this embodiment, the maximum misclassification probability refers to the probability of misidentifying the sample category (that is, it is different from the classification label). For example, assuming a class A sample, the predicted classification results include class A, class B and class C, then , both class B and class C can be regarded as the misclassification of class A samples, and the probability of misclassification with the highest probability is taken as the maximum misclassification probability, that is, the probability of class B.

混淆概率为一个样本的各个预测分类中，类别存在冲突的概率，比如，继续以上述例子说明，上述预测结果A类、B类和C类各自对应的概率之间的关系为P_A>P_B>P_C,那么混淆概率为A类和B类之间的概率之差，可以通过混淆概率衡量A类和B类之间的容易被混淆的程度。Confusion probability is the probability of conflicting categories in each predicted classification of a sample. For example, continue to use the above example to illustrate that the relationship between the corresponding probabilities of the above-mentioned predicted results of class A, class B and class C is P _A > P _B >P _C , then the confusion probability is the difference between the probability of class A and class B, and the degree of confusion between class A and class B can be measured by the confusion probability.

因此，本发明实施例提供了一种步骤S302-2的实施方式，即：Therefore, the embodiment of the present invention provides an implementation manner of step S302-2, namely:

b1：按照概率从大到小的顺序，确定第一最大概率和第二最大概率，并将第一最大概率和第二最大概率之差作为混淆概率。b1: Determine the first maximum probability and the second maximum probability in order of probability from large to small, and use the difference between the first maximum probability and the second maximum probability as the confusion probability.

b2：若第一最大概率对应的预测分类与分类标签一致，则将第二最大概率作为最大错误分类概率。b2: If the predicted classification corresponding to the first maximum probability is consistent with the classification label, then the second maximum probability is used as the maximum misclassification probability.

b3：若第一最大概率对应的预测分类与分类标签不一致，则将第一最大概率作为最大错误分类概率。b3: If the predicted classification corresponding to the first maximum probability is inconsistent with the classification label, then the first maximum probability is used as the maximum misclassification probability.

S302-3：将最大错误分类概率和混淆概率中的最大值，确定为样本对应的消歧得分。S302-3: Determine the maximum value of the maximum misclassification probability and the confusion probability as the disambiguation score corresponding to the sample.

通过上述得到的最大错误分类概率和混淆概率，即可确定每个样本对应的消歧得分，进而可以整理输出类别消歧分析结果和样本消歧分析结果，下面先介绍样本消歧分析结果。Through the maximum misclassification probability and confusion probability obtained above, the disambiguation score corresponding to each sample can be determined, and then the output category disambiguation analysis results and sample disambiguation analysis results can be sorted out. The following will first introduce the sample disambiguation analysis results.

请参见图7，图7为本发明实施例提供的每个样本对应的样本知识示意图，从图7中可以看出，每个样本，本发明实施例均可以得到该样本对应的消歧得分，以及该样本的类别对应的最相似类别、以及最相似类别的概率、混淆概率等，基于这些样本知识，本发明实施例可以继续深入挖掘得到可能存在标注错误或者容易出现混淆的类别。Please refer to FIG. 7. FIG. 7 is a schematic diagram of the sample knowledge corresponding to each sample provided by the embodiment of the present invention. It can be seen from FIG. 7 that for each sample, the embodiment of the present invention can obtain the disambiguation score corresponding to the sample. And the most similar category corresponding to the category of the sample, the probability of the most similar category, the probability of confusion, etc. Based on these sample knowledge, the embodiment of the present invention can continue to dig deeper to obtain categories that may have labeling errors or are prone to confusion.

在步骤S303中、针对每个样本，从具有样本对应的目标预测分类的各个样本中，确定样本对应的最相似样本，目标预测分类的概率最大；In step S303, for each sample, determine the most similar sample corresponding to the sample from each sample with the target prediction classification corresponding to the sample, and the probability of the target prediction classification is the largest;

本实施例中，每个样本对应有多个预测分类，将最大概率对应的预测分类作为目标预测分类，需要注意的是，最大概率对应的预测分类可能与该样本的分类标签相同、也可能不同，因此可以根据两种不同的情况确定最相似样本。In this embodiment, each sample corresponds to multiple predicted categories, and the predicted category corresponding to the maximum probability is used as the target predicted category. It should be noted that the predicted category corresponding to the maximum probability may be the same as or different from the category label of the sample , so the most similar samples can be determined based on two different situations.

第一种情况：若目标预测分类与样本对应的分类标签一致，则输出最相似样本为空；The first case: if the target prediction classification is consistent with the classification label corresponding to the sample, the output most similar sample is empty;

第二种情况：若目标预测分类与样本对应的分类标签不一致，从分类样本集中提取具有目标预测分类的全部候选样本，计算全部候选样本与样本之间的相似度，并将相似度最大的候选样本确定为最相似样本。The second case: if the target prediction classification is inconsistent with the classification label corresponding to the sample, extract all candidate samples with the target prediction classification from the classification sample set, calculate the similarity between all candidate samples and the sample, and compare the candidate with the largest similarity The sample is determined to be the most similar sample.

本实施例中，可以先按消歧得分从大到小的顺序，对分类样本集中的全部样本进行排序，然后依次展示每个样本的最相似样本，方便分析可能预测错误原因。In this embodiment, all the samples in the classification sample set can be sorted in descending order of the disambiguation scores, and then the most similar samples of each sample can be displayed in sequence, so as to facilitate the analysis of possible reasons for prediction errors.

下面先介绍类别消歧分析结果。The results of category disambiguation analysis are introduced first.

在步骤S304中、基于消歧得分，确定每个分类标签对应的最相似预测分类，并基于最相似预测分类，从具有分类标签的全部样本中，确定预测分类为最相似预测分类的第一目标样本、以及从具有最相似预测分类的全部样本中，确定预测分类为分类标签的第二目标样本。In step S304, based on the disambiguation score, determine the most similar prediction classification corresponding to each classification label, and based on the most similar prediction classification, from all samples with classification labels, determine the prediction classification as the first target of the most similar prediction classification sample, and from among all samples with the most similar predicted class, determine a second target sample whose predicted class is the class label.

本实施例中，一个分类标签对应至少一个样本，也就是说，假设分类样本集中存在M个样本，对应的类别种数为N，那么N与M满足N小于或等于M的关系，那么针对每种类别，本发明实施例均可以找出与其冲突最大的最相似类别，冲突最大可以理解为两种类别无法轻易被区分开，造成这种情况的原因可能是在类别划分粒度过小。In this embodiment, a classification label corresponds to at least one sample, that is, assuming that there are M samples in the classification sample set, and the corresponding number of categories is N, then N and M satisfy the relationship that N is less than or equal to M, then for each The embodiment of the present invention can find the most similar category with the largest conflict. The largest conflict can be understood as the two categories cannot be easily distinguished. The reason for this situation may be that the granularity of category division is too small.

因此，本发明实施例通过找出每种类别对应最相似类别，然后找出这种类别以及最相似类别对应各个样本，通过对这些样本进行排查，确定是否需要进行类别合并或者是否需要更改样本的分类标签。Therefore, the embodiment of the present invention finds out the most similar category corresponding to each category, and then finds out the corresponding samples of this category and the most similar category, and checks these samples to determine whether it is necessary to perform category merging or whether it is necessary to change the sample. Classification labels.

作为一种可选的实施方式，上述步骤S304可以通过图8所示的方式实现，图8为本发明实施例提供的步骤S304的示意性流程图：As an optional implementation manner, the above step S304 can be implemented in the manner shown in FIG. 8 , and FIG. 8 is a schematic flowchart of step S304 provided by an embodiment of the present invention:

S304-1，从具有分类标签的全部样本中、确定具有相同预测分类的待确认样本。S304-1. Determine samples to be confirmed with the same predicted classification from all samples with classification labels.

S304-2，确定具有相同预测分类的全部待确认样本的消歧得分之和，并将消歧得分之和最大时对应的预测分类，确定为最相似预测分类；S304-2. Determine the sum of the disambiguation scores of all samples to be confirmed with the same predicted category, and determine the corresponding predicted category when the sum of the disambiguated scores is the largest as the most similar predicted category;

S304-3，将最相似预测分类对应的全部待确认样本，确定为第一目标样本，并从具有最相似预测分类的全部样本中，确定预测分类为分类标签的第二目标样本。S304-3. Determine all unconfirmed samples corresponding to the most similar predicted classification as first target samples, and determine a second target sample whose predicted classification is a classification label from all samples with the most similar predicted classification.

本实施例中，由于预先得到每个样本的预测分类，那么针对每种分类标签，可以从全部样本中，将预测分类为该分类标签的样本确定为第二目标样本。又由于每种分类标签对应有多个样本，那么可以将预测分类为最相似预测分类的样本确定为第一目标样本。In this embodiment, since the predicted classification of each sample is obtained in advance, for each classification label, the sample predicted to be classified as the classification label can be determined as the second target sample from all samples. And since there are multiple samples corresponding to each classification label, the sample whose predicted classification is the most similar to the predicted classification can be determined as the first target sample.

本实施例中，第一目标样本和第二目标样本用于指示对分类标签和对最相似预测分类执行合并策略，或者指示对第一目标样本和/或第二目标样本执行纠正策略。In this embodiment, the first target sample and the second target sample are used to indicate to perform a merge strategy on the classification label and the most similar predicted classification, or to indicate to perform a correction strategy on the first target sample and/or the second target sample.

例如，运营人员可以根据上述结果，查找两个类别错误预测相似问(即第一目标样本和第二目标样本)，如果感觉有较大冲突，可以考虑合并类别，或者进行对应样本调整。For example, based on the above results, the operator can find out the similarity between two categories of wrong predictions (ie, the first target sample and the second target sample), and if they feel that there is a big conflict, they can consider merging categories, or make corresponding sample adjustments.

比如，第一目标样本被标注的是A类，但是通过分类模型，预测第一目标样本是B类，这就表明同一文本，将其归为A类或者B类均合理，那么运营人员此时就可以确定第一目标样本是否需要进行类别合并，或者判断第一目标文本是否需要更改分类标签，使其能够与其他类别完全区别开，第二目标类别同理。For example, the first target sample is marked as class A, but the classification model predicts that the first target sample is class B, which indicates that it is reasonable to classify it as class A or class B for the same text, then the operator at this time It can be determined whether the first target sample needs to be classified, or judge whether the first target text needs to change the classification label so that it can be completely distinguished from other categories, and the same is true for the second target category.

为了方便理解，假设其中一个分类标签为“为什么无法适用会员权益”，具有这一种分类标签的样本有10个，得到预测分类的有5个样本，得到预测分类S2的有3个样本，得到预测分类S3的有2个样本，由于前述内容已经得到了每个样本的消歧得分，那么将每个类别对应的全部样本的消歧得分相加，就得到这种类别对应的消歧得分之和，即将S1对应的5个样本各自的消歧得分相加、将S2对应的3个样本各自的消歧得分相加、将S3对应的2个样本各自的消歧得分相加，最终得到每个预测分类的消歧得分之和可以理解为：该类别需要进行消歧处理的权重，将歧得分之和最大时对应的类别，确定为“为什么无法适用会员权益”的最相似预测分类。For the convenience of understanding, assume that one of the classification labels is "why the membership rights cannot be applied", and there are 10 samples with this classification label, 5 samples of the predicted classification, and 3 samples of the predicted classification S2, and we get There are 2 samples in the predicted category S3. Since the disambiguation score of each sample has been obtained in the foregoing content, the disambiguation score of all samples corresponding to each category is added to obtain the disambiguation score corresponding to this category. and, that is, add the disambiguation scores of the 5 samples corresponding to S1, add the disambiguation scores of the 3 samples corresponding to S2, add the disambiguation scores of the 2 samples corresponding to S3, and finally get each The sum of the disambiguation scores of each prediction category can be understood as: the weight of this category that needs to be disambiguated, and the category corresponding to the maximum sum of the disambiguation scores is determined as the most similar prediction category of "why the membership benefits cannot be applied".

请参见图9，图9为本发明实施例提供的类别消歧结果分析示意图，可以看出，针对每一种分类，本发明实施例均可排查出与其最相似的分类、进而通过样本的分类结果，得到可能存在冲突的样本，将图9所示的结果展示给运营人员进行知识消歧，完成数据清洗，提高标注准确性。Please refer to Fig. 9. Fig. 9 is a schematic diagram of the category disambiguation result analysis provided by the embodiment of the present invention. It can be seen that for each classification, the embodiment of the present invention can find out the most similar classification to it, and then classify the samples As a result, samples that may have conflicts were obtained, and the results shown in Figure 9 were displayed to the operator for knowledge disambiguation, data cleaning, and labeling accuracy were improved.

因此，本发明实施例在得到每个样本对应的最相似样本、以及每种分类对应的最相似分类以及可能存在冲突的第一目标样本和第二目标样本之和，为了方便运营人员分析，还可以按照如下方式将上述结果进行展示：Therefore, in the embodiment of the present invention, after obtaining the most similar sample corresponding to each sample, the most similar classification corresponding to each classification, and the sum of the first target sample and the second target sample that may have conflicts, in order to facilitate the analysis of the operator, the The above results can be displayed as follows:

按照消歧得分之和从大到小的顺序，依次显示分类标签、分类标签对应的样本数量、分类标签对应的最相似预测分类、以及第一目标样本和第一目标样本对应的最相似预测分类的概率、第二目标样本和第二目标样本对应的分类标签的概率；According to the order of the sum of the disambiguation scores from large to small, the classification label, the number of samples corresponding to the classification label, the most similar predicted classification corresponding to the classification label, and the most similar predicted classification corresponding to the first target sample and the first target sample are displayed in sequence The probability of the second target sample and the probability of the classification label corresponding to the second target sample;

按照消歧得分从大到小的顺序，依次显示每个样本、样本对应的分类标签、预测分类、消歧分以及最相似样本。In descending order of the disambiguation score, each sample, the classification label corresponding to the sample, the predicted classification, the disambiguation score, and the most similar sample are displayed in turn.

本发明实施例提供的用于分类的训练集处理方法可以在硬件设备或者以软件模块的形式实现中执行，当用于分类的训练集处理方法以软件模块的形式实现时，本发明实施例还提供一种用于分类的训练集处理方法装置，请参见图10，图10为本发明实施例提供的用于分类的训练集处理装置的功能模块图，该用于分类的训练集处理装置400可以包括：The training set processing method for classification provided in the embodiment of the present invention can be implemented in a hardware device or implemented in the form of a software module. When the training set processing method for classification is implemented in the form of a software module, the embodiment of the present invention also Provide a training set processing method device for classification, please refer to Figure 10, Figure 10 is a functional block diagram of a training set processing device for classification provided by an embodiment of the present invention, the training set processing device for classification 400 Can include:

获取模块410，用于获取分类样本集；所述分类样本集中包含多个样本以及每个所述样本对应的分类标签；一种所述分类标签对应至少一个所述样本；An acquisition module 410, configured to acquire a classification sample set; the classification sample set includes a plurality of samples and classification labels corresponding to each of the samples; one classification label corresponds to at least one of the samples;

确定模块420，用于确定每个所述样本对应的各个预测分类、每个所述预测分类的概率、以及每个所述样本对应的消歧得分；其中所述消歧得分表征所述样本对应的所述分类标签的错误得分；A determining module 420, configured to determine each predicted classification corresponding to each of the samples, the probability of each of the predicted classifications, and the disambiguation score corresponding to each of the samples; wherein the disambiguation score represents that the sample corresponds to The error score of the classification label of ;

所述确定模块420，还用于针对每个样本，从具有所述样本对应的目标预测分类的各个样本中，确定每个所述样本对应的最相似样本，所述目标预测分类的所述概率最大；所述目标预测分类的所述概率最大；The determination module 420 is further configured to, for each sample, determine the most similar sample corresponding to each sample from each sample with the target prediction classification corresponding to the sample, and the probability of the target prediction classification Maximum; the probability of the target prediction classification is maximum;

所述确定模块420，还用于基于所述消歧得分，确定每个所述分类标签对应的最相似预测分类，并基于所述最相似预测分类，从具有所述分类标签的全部样本中，确定预测分类为所述最相似预测分类的第一目标样本、以及从具有所述最相似预测分类的全部样本中，确定预测分类为所述分类标签的第二目标样本；The determination module 420 is further configured to determine the most similar predicted category corresponding to each of the category labels based on the disambiguation score, and based on the most similar predicted category, from all samples with the category label, determining a first target sample whose predicted class is the most similar predicted class, and determining a second target sample whose predicted class is the class label from all samples having the most similar predicted class;

可以理解的是，上述获取模块410和确定模块420可以协同的执行图3中的各个步骤以实现相应的技术效果。It can be understood that the acquisition module 410 and the determination module 420 can coordinately execute each step in FIG. 3 to achieve corresponding technical effects.

在可选的实施方式中，确定模块420，具体用于：利用训练后的分类模型，得到每个所述样本的所述各个预测分类、以及每个所述预测分类的概率；针对每个样本，基于所述预测分类与所述样本对应的分类标签的比较结果，确定出所述样本对应的最大错误分类概率和混淆概率；将所述最大错误分类概率和所述混淆概率中的最大值，确定为所述样本对应的消歧得分。In an optional embodiment, the determination module 420 is specifically configured to: use the trained classification model to obtain the respective predicted categories of each sample and the probability of each predicted category; for each sample , based on the comparison result of the predicted classification and the classification label corresponding to the sample, determine the maximum misclassification probability and confusion probability corresponding to the sample; the maximum misclassification probability and the maximum value of the confusion probability, Determine the disambiguation score corresponding to the sample.

在可选的实施方式中，确定模块420，具体用于：根据每个所述样本对应的样本标识，从所述训练集中确定多个样本子集和每个所述样本子集对应的预测样本集；其中，每个所述样本子集和所述样本子集对应的所述预测样本集的并集为所述分类样本集；一个所述样本子集内，每个所述样本的样本标识与样本子集个数相除得到的余数相同，且所述余数与所述预测样本集中每个预测样本对应的余数不同；依次利用每个所述样本子集对分类模型进行训练，并利用训练后的所述分类模型对所述样本子集对应的预测样本集进行预测，得到所述预测样本集中每个所述预测样本的所述各个预测分类、以及每个所述预测分类的概率。In an optional embodiment, the determining module 420 is specifically configured to: determine a plurality of sample subsets and a prediction sample corresponding to each of the sample subsets from the training set according to the sample identification corresponding to each sample set; wherein, the union of each of the sample subsets and the predicted sample sets corresponding to the sample subsets is the classification sample set; in one of the sample subsets, the sample identification of each of the samples The remainder obtained by dividing the number of sample subsets is the same, and the remainder is different from the remainder corresponding to each prediction sample in the prediction sample set; each of the sample subsets is used in turn to train the classification model, and using the training The subsequent classification model predicts the prediction sample set corresponding to the sample subset, and obtains the respective prediction classifications of each prediction sample in the prediction sample set and the probability of each prediction classification.

在可选的实施方式中，确定模块420，具体用于：按照概率从大到小的顺序，确定第一最大概率和第二最大概率，并将所述第一最大概率和所述第二最大概率之差作为所述混淆概率；若所述第一最大概率对应的所述预测分类与所述分类标签一致，则将所述第二最大概率作为所述最大错误分类概率；若所述第一最大概率对应的所述预测分类与所述分类标签不一致，则将所述第一最大概率作为所述最大错误分类概率。In an optional implementation, the determination module 420 is specifically configured to: determine the first maximum probability and the second maximum probability in descending order of probability, and calculate the first maximum probability and the second maximum probability The difference between the probabilities is used as the confusion probability; if the predicted classification corresponding to the first maximum probability is consistent with the classification label, then the second maximum probability is used as the maximum misclassification probability; if the first If the predicted category corresponding to the maximum probability is inconsistent with the category label, the first maximum probability is used as the maximum misclassification probability.

在可选的实施方式中，确定模块420，具体用于：从具有所述分类标签的全部样本中、确定具有相同所述预测分类的待确认样本；确定具有相同所述预测分类的全部所述待确认样本的消歧得分之和，并将消歧得分之和最大时对应的所述预测分类，确定为所述最相似预测分类；将所述最相似预测分类对应的全部所述待确认样本，确定为所述第一目标样本，并从具有所述最相似预测分类的全部样本中，确定预测分类为所述分类标签的第二目标样本。In an optional implementation, the determination module 420 is specifically configured to: determine the samples to be confirmed with the same predicted classification from all samples with the classification label; determine all the samples with the same predicted classification The sum of the disambiguation scores of the samples to be confirmed, and the corresponding prediction classification when the sum of the disambiguation scores is the largest is determined as the most similar prediction classification; all the samples to be confirmed corresponding to the most similar prediction classification , determine as the first target sample, and determine a second target sample whose predicted class is the class label from all samples with the most similar predicted class.

在可选的实施方式中，确定模块420，具体用于：若所述目标预测分类与所述样本对应的分类标签一致，则输出最相似样本为空；若所述目标预测分类与所述样本对应的分类标签不一致，从所述分类样本集中提取具有所述目标预测分类的全部候选样本，计算全部所述候选样本与所述样本之间的相似度，并将相似度最大的候选样本确定为所述最相似样本。In an optional implementation, the determination module 420 is specifically configured to: if the predicted classification of the target is consistent with the classification label corresponding to the sample, then output the most similar sample as empty; if the predicted classification of the target is consistent with the classification label corresponding to the sample The corresponding classification labels are inconsistent, extract all candidate samples with the target prediction classification from the classification sample set, calculate the similarity between all the candidate samples and the samples, and determine the candidate sample with the largest similarity as The most similar sample.

在可选的实施方式中，该装置还可以包括显示模块，用于按照所述消歧得分之和从大到小的顺序，依次显示所述分类标签、所述分类标签对应的样本数量、所述分类标签对应的所述最相似预测分类、以及所述第一目标样本和所述第一目标样本对应的所述最相似预测分类的概率、所述第二目标样本和所述第二目标样本对应的所述分类标签的概率；按照消歧得分从大到小的顺序，依次显示每个所述样本、所述样本对应的分类标签、预测分类、消歧得分以及所述最相似样本。In an optional embodiment, the device may further include a display module, configured to sequentially display the classification label, the number of samples corresponding to the classification label, the The most similar predicted classification corresponding to the classification label, and the probability of the most similar predicted classification corresponding to the first target sample and the first target sample, the second target sample and the second target sample The probability of the corresponding classification label; in descending order of the disambiguation score, each of the samples, the classification label corresponding to the sample, the predicted classification, the disambiguation score, and the most similar sample are displayed in sequence.

本发明实施例还提供一种存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现如前述实施方式中任一项的用于分类的训练集处理方法。该计算机可读存储介质可以是，但不限于，U盘、移动硬盘、ROM、RAM、PROM、EPROM、EEPROM、磁碟或者光盘等各种可以存储程序代码的介质。An embodiment of the present invention also provides a storage medium on which a computer program is stored, and when the computer program is executed by a processor, the training set processing method for classification according to any one of the foregoing implementations is implemented. The computer-readable storage medium may be, but not limited to, various mediums capable of storing program codes such as U disk, mobile hard disk, ROM, RAM, PROM, EPROM, EEPROM, magnetic disk or optical disk.

应该理解到，在本发明所揭露的装置和方法，也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的，例如，附图中的流程图和框图显示了根据本发明的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分，模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现方式中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。It should be understood that the devices and methods disclosed in the present invention can also be implemented in other ways. The device embodiments described above are only illustrative. For example, the flowcharts and block diagrams in the accompanying drawings show the architecture, functions and possible implementations of devices, methods and computer program products according to multiple embodiments of the present invention. operate. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more executable instruction. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

另外，在本发明各个实施例中的各功能模块可以集成在一起形成一个独立的部分，也可以是各个模块单独存在，也可以两个或两个以上模块集成形成一个独立的部分。In addition, each functional module in each embodiment of the present invention can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.

以上仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。The above are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention. It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

Claims

1. A method of training set processing for classification, the method comprising:

obtaining a classification sample set; the classification sample set comprises a plurality of samples and classification labels corresponding to the samples; one classification label corresponds to at least one sample;

determining a respective prediction classification for each of the samples, a probability for each of the prediction classifications, and a disambiguation score for each of the samples; wherein the disambiguation score characterizes a degree of error of the classification label corresponding to the sample;

for each sample, determining a most similar sample corresponding to each sample from the samples having a target prediction classification corresponding to the sample, the probability of the target prediction classification being the maximum;

determining a most similar prediction classification corresponding to each classification label based on the disambiguation score, and determining a first target sample of which the prediction classification is the most similar prediction classification from all samples with the classification labels and a second target sample of which the prediction classification is the classification label from all samples with the most similar prediction classification based on the most similar prediction classification;

wherein the first target sample and the second target sample are used to instruct a merge strategy to be performed on the class labels and on the most similar prediction classes or to instruct a correction strategy to be performed on the first target sample and/or the second target sample; the most similar sample is used to indicate that the sample is corrected.

2. The method of claim 1, wherein determining a predictive classification for each of the samples, a probability of the predictive classification, and a disambiguation score for each of the samples comprises:

obtaining each prediction classification of each sample and the probability of each prediction classification by using the trained classification model;

for each sample, determining the maximum misclassification probability and the confusion probability corresponding to the sample based on the comparison result of the prediction classification and the classification label corresponding to the sample;

and determining the maximum value of the maximum misclassification probability and the confusion probability as the corresponding disambiguation score of the sample.

3. The method of claim 2, wherein obtaining the respective prediction classification corresponding to each of the samples and the probability of each of the prediction classifications using the trained classification model comprises:

determining a plurality of sample subsets and a prediction sample set corresponding to each sample subset from the training set according to the sample identification corresponding to each sample;

wherein each sample subset and the union of the prediction sample sets corresponding to the sample subsets are the classification sample sets; in one of the sample subsets, the remainder obtained by dividing the sample identifier of each sample by the number of the sample subset is the same, and the remainder is different from the remainder corresponding to each predicted sample in the predicted sample set;

and sequentially training a classification model by using each sample subset, and predicting a prediction sample set corresponding to the sample subset by using the trained classification model to obtain each prediction classification of each prediction sample in the prediction sample set and the probability of each prediction classification.

4. The method of claim 2, wherein determining, for each sample, a maximum misclassification probability and a confusion probability for the sample based on the comparison of the predicted classification to the classification label corresponding to the sample comprises:

determining a first maximum probability and a second maximum probability according to the sequence of the probabilities from large to small, and taking the difference between the first maximum probability and the second maximum probability as the confusion probability;

if the prediction classification corresponding to the first maximum probability is consistent with the classification label, taking the second maximum probability as the maximum error classification probability;

and if the prediction classification corresponding to the first maximum probability is inconsistent with the classification label, taking the first maximum probability as the maximum error classification probability.

5. The method of claim 1, wherein determining a most similar prediction classification corresponding to each of the classification tags based on the disambiguation score, and determining a first target sample for which a prediction classification is the most similar prediction classification from all samples having the classification tag and a second target sample for which a prediction classification is the classification tag from all samples having the most similar prediction classification based on the most similar prediction classification comprises:

determining samples to be confirmed having the same prediction classification from all samples having the classification label;

determining the sum of disambiguation scores of all samples to be confirmed with the same prediction classification, and determining the prediction classification corresponding to the maximum sum of the disambiguation scores as the most similar prediction classification;

and determining all samples to be confirmed corresponding to the most similar prediction classification as the first target sample, and determining a second target sample of which the prediction classification is the classification label from all samples with the most similar prediction classification.

6. The method of claim 1, wherein for each sample, determining, for each sample, a most similar sample for each sample from the samples having the target prediction classification to which the sample corresponds comprises:

if the target prediction classification is consistent with the classification label corresponding to the sample, outputting the most similar sample as null;

if the target prediction classification is inconsistent with the classification label corresponding to the sample, extracting all candidate samples with the target prediction classification from the classification sample set, calculating the similarity between all the candidate samples and the sample, and determining the candidate sample with the maximum similarity as the most similar sample.

7. The method of claim 5, further comprising:

sequentially displaying the classification labels, the number of samples corresponding to the classification labels, the most similar prediction classification corresponding to the classification labels, the probability of the most similar prediction classification corresponding to the first target sample and the first target sample, and the probability of the classification labels corresponding to the second target sample and the second target sample according to the sequence that the sum of the disambiguation scores is from large to small;

and sequentially displaying each sample, the classification label corresponding to the sample, the prediction classification, the disambiguation score and the most similar sample according to the sequence of the disambiguation score from large to small.

8. A training set processing apparatus for classification, comprising:

the acquisition module is used for acquiring a classification sample set; the classification sample set comprises a plurality of samples and classification labels corresponding to the samples; one classification label corresponds to at least one sample;

a determining module for determining respective prediction classifications corresponding to each of the samples, a probability of each of the prediction classifications, and a disambiguation score corresponding to each of the samples; wherein the disambiguation score characterizes an error score of the classification label corresponding to the sample;

the determining module is further configured to determine, for each sample, a most similar sample corresponding to each sample from samples having a target prediction classification corresponding to the sample, where the probability of the target prediction classification is the maximum; the probability of the target prediction classification is maximum;

the determining module is further configured to determine a most similar prediction classification corresponding to each of the classification labels based on the disambiguation score, and determine a first target sample of which the prediction classification is the most similar prediction classification from all samples having the classification label and a second target sample of which the prediction classification is the classification label from all samples having the most similar prediction classification based on the most similar prediction classification;

9. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being configured to execute the computer program to perform the method of any of claims 1 to 7.

10. A storage medium on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.