CN110196908A

CN110196908A - Data classification method, device, computer installation and storage medium

Info

Publication number: CN110196908A
Application number: CN201910310574.1A
Authority: CN
Inventors: 刘康龙; 徐国强; 邱寒
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2019-09-03

Abstract

The invention provides a data classification method, device, computer device and storage medium. The method includes: obtaining a data set to be marked; marking the data set through a labeling function to obtain an initial label of the data set; calculating the pairwise correlation of the identification function according to the initial label, and according to the The pairwise correlation constructs the generation model of the labeling function; estimates the probability label of the data set according to the generation model; trains the discriminant model according to the probability label to obtain the trained discriminant model; The data is input into the trained discriminant model to obtain the category of the data to be classified. The invention improves the labeling efficiency and accuracy of the training data, and the discriminant model can be quickly trained by using the training data, and fast and accurate data classification is realized by using the discriminant model.

Description

Data classification method, device, computer device and storage medium

技术领域technical field

本发明涉及机器学习技术领域，具体涉及一种数据分类方法、装置、计算机装置及计算机存储介质。The invention relates to the technical field of machine learning, in particular to a data classification method, device, computer device and computer storage medium.

背景技术Background technique

随着人工智能的快速发展，机器学习技术(尤其是深度学习技术)已经应用在了各个行业中。此时，训练数据标注已经逐渐成为广泛部署机器学习系统的最大瓶颈。With the rapid development of artificial intelligence, machine learning technology (especially deep learning technology) has been applied in various industries. At this point, training data labeling has gradually become the biggest bottleneck in the widespread deployment of machine learning systems.

现有传统的人工标注方法耗时耗力且成本颇高，而且现有的数据增强方法如半监督学习、主动学习和迁移学习等方法无法大规模快速生成训练数据。The existing traditional manual labeling methods are time-consuming, labor-intensive and costly, and existing data enhancement methods such as semi-supervised learning, active learning, and transfer learning cannot quickly generate large-scale training data.

如何制定合适的方案，减少人工标注训练数据的工作量，提高训练数据的标注效率，是相关技术人员目前需要解决的技术问题。How to formulate a suitable solution, reduce the workload of manually labeling training data, and improve the efficiency of labeling training data is a technical problem that relevant technical personnel need to solve at present.

发明内容Contents of the invention

鉴于以上内容，有必要提出一种数据分类方法、装置、计算机装置及计算机存储介质，可以提高训练数据的标注效率，快速准确地对数据进行分类。In view of the above, it is necessary to propose a data classification method, device, computer device and computer storage medium, which can improve the labeling efficiency of training data and classify data quickly and accurately.

本申请的第一方面提供一种数据分类方法，应用于机器学习系统，所述方法包括：The first aspect of the present application provides a data classification method applied to a machine learning system, the method comprising:

获取待标注的数据集{x_i|i＝1，2，...，m}；Obtain the dataset to be labeled {x _i |i=1, 2, ..., m};

通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；The data set is marked by the labeling function λ _j , j=1, 2, ..., n, and the initial label Λ _{i, j} = λ _j ( _xi ), i=1, 2 of the data set is obtained ,...,m,j=1,2,...,n;

根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；calculating the pairwise correlation of the identification function according to the initial label, and constructing a generative model of the labeling function according to the pairwise correlation;

根据所述生成模型预估所述数据集的概率标签；Estimating a probability label of the data set according to the generative model;

根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；training the discriminant model of the machine learning system according to the probability label to obtain a trained discriminant model;

将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。Inputting the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

另一种可能的实现方式中，所述生成模型为：In another possible implementation, the generation model is:

其中Λ表示所述初始标签构成的初始标签矩阵，Y表示真实标签矩阵，Z_w为归一化常数，φ_i(Λ，y_i)，i＝1，2，...，m为针对所述数据集中的各个数据的所述标注函数的成对相关性，w为所述生成模型的待定参数，w∈R^2n+|C|。Where Λ represents the initial label matrix formed by the initial label, Y represents the real label matrix, Z _w is a normalization constant, φ _i (Λ, y _i ), i=1, 2, ..., m is the is the pairwise correlation of the labeling function of each data in the data set, w is an undetermined parameter of the generation model, w∈R ^2n+|C| .

另一种可能的实现方式中，所述成对相关性为：In another possible implementation, the pairwise correlation is:

其中II{Λ_i，j＝Λ_i，k}表示当括号{}内的条件成立与不成立时的取值。Among them, II{Λ _{i, j} =Λ _{i, k} } indicates the value when the condition in brackets {} is satisfied or not.

另一种可能的实现方式中，所述待标注的数据集是图像集，所述待分类数据是待分类图像；或者In another possible implementation, the data set to be labeled is an image set, and the data to be classified is an image to be classified; or

所述待标注的数据集是文本集，所述待分类数据是待分类文本；或者The data set to be labeled is a text set, and the data to be classified is a text to be classified; or

所述待标注的数据集是语音集，所述待分类数据是待分类语音。The data set to be marked is a speech set, and the data to be classified is a speech to be classified.

另一种可能的实现方式中，所述将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别包括：In another possible implementation manner, the inputting the data to be classified into the trained discriminant model, and obtaining the category of the data to be classified includes:

将所述待分类图像输入所述训练后的判别模型，得到所述待分类图像对应的用户、物体或人脸属性；Inputting the image to be classified into the trained discriminant model to obtain the user, object or face attributes corresponding to the image to be classified;

将所述待分类文本输入所述训练后的判别模型，得到所述待分类文本对应的情感倾向、题材或技术领域；Inputting the text to be classified into the trained discriminant model to obtain the emotional tendency, subject matter or technical field corresponding to the text to be classified;

将所述待分类语音输入所述训练后的判别模型，得到所述待分类语音对应的用户、年龄段或情感。Input the speech to be classified into the trained discriminant model to obtain the user, age group or emotion corresponding to the speech to be classified.

另一种可能的实现方式中，所述根据所述概率标签对所述机器学习系统的判别模型进行训练包括：In another possible implementation manner, the training the discriminant model of the machine learning system according to the probability label includes:

通过最小化所述判别模型的损失函数的噪声感知变量在所述概率标签上训练所述判别模型。The discriminative model is trained on the probabilistic labels by minimizing the noise-aware variant of the loss function of the discriminative model.

另一种可能的实现方式中，所述通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注之前，所述方法还包括：In another possible implementation manner, before labeling the data set with a labeling function λ _j , j=1, 2, ..., n, the method further includes:

填充所述数据集中的缺失值；和/或filling missing values in said data set; and/or

修正所述数据集中的异常值。Correct for outliers in the dataset.

本申请的第二方面提供一种数据分类装置，应用于机器学习系统，所述装置包括：A second aspect of the present application provides a data classification device applied to a machine learning system, the device comprising:

获取模块，用于获取待标注的数据集{x_i| i＝1，2，...，m}；An acquisition module, used to acquire the data set {x _i | i=1, 2, ..., m} to be labeled;

标注模块，用于通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；A labeling module, configured to label the data set with a labeling function λ _j , j=1, 2, ..., n, to obtain the initial label Λ _{i, j} = λ _j ( _xi ) of the data set, i=1, 2, ..., m, j = 1, 2, ..., n;

构建模块，用于根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；A building module for calculating the pairwise correlation of the identification function according to the initial label, and constructing a generative model of the labeling function according to the pairwise correlation;

预估模块，用于根据所述生成模型预估所述数据集的概率标签；An estimation module, configured to estimate the probability label of the data set according to the generation model;

训练模块，用于根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；A training module, configured to train the discriminant model of the machine learning system according to the probability label to obtain the trained discriminant model;

分类模块，用于将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。A classification module, configured to input the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

另一种可能的实现方式中，所述装置还包括：In another possible implementation manner, the device further includes:

预处理模块，用于填充所述数据集中的缺失值，和/或修正所述数据集中的异常值。A preprocessing module is used for filling missing values in the data set, and/or correcting outliers in the data set.

本申请的第三方面提供一种计算机装置，所述计算机装置包括处理器，所述处理器用于执行存储器中存储的计算机程序时实现所述数据分类方法。A third aspect of the present application provides a computer device, the computer device includes a processor, and the processor is configured to implement the data classification method when executing a computer program stored in a memory.

本申请的第四方面提供一种计算机存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现所述数据分类方法。A fourth aspect of the present application provides a computer storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the data classification method is implemented.

本发明获取待标注的数据集{x_i| i＝1，2，...，m}；通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；根据所述生成模型预估所述数据集的概率标签；根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。本发明可以快速生成机器学习系统的判别模型所需的训练数据，解决人工标注训练数据获取难度大，标注时间长，准确率得不到保证的技术问题，减少人工标注训练数据的工作量，提高训练数据的标注效率和准确率，利用所述训练数据可以快速训练判别模型，利用所述判别模型实现快速准确的数据分类。The present invention obtains the data set {x _i | i=1, 2, ..., m} to be marked; the data set is marked by the marking function λ _j , j = 1, 2, ..., n, Obtain the initial label Λ _{i, j} = λ _j ( _xi ), i=1, 2, ..., m, j = 1, 2, ..., n of the data set; calculate according to the initial label The pairwise correlation of the identification function, constructing the generation model of the label function according to the pairwise correlation; predicting the probability label of the data set according to the generation model; The discriminant model of the learning system is trained to obtain the trained discriminant model; the data to be classified is input into the trained discriminant model to obtain the category of the data to be classified. The invention can quickly generate the training data required by the discriminant model of the machine learning system, solve the technical problems of difficulty in obtaining manual labeling training data, long labeling time, and unguaranteed accuracy, reduce the workload of manual labeling training data, and improve The labeling efficiency and accuracy of the training data, using the training data, can quickly train the discriminant model, and use the discriminant model to realize fast and accurate data classification.

附图说明Description of drawings

图1是本发明实施例提供的数据分类方法的流程图。Fig. 1 is a flowchart of a data classification method provided by an embodiment of the present invention.

图2是本发明实施例提供的数据分类装置的结构图。Fig. 2 is a structural diagram of a data classification device provided by an embodiment of the present invention.

图3是本发明实施例提供的计算机装置的示意图。Fig. 3 is a schematic diagram of a computer device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了能够更清楚地理解本发明的上述目的、特征和优点，下面结合附图和具体实施例对本发明进行详细描述。需要说明的是，在不冲突的情况下，本申请的实施例及实施例中的特征可以相互组合。In order to more clearly understand the above objects, features and advantages of the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments can be combined with each other.

在下面的描述中阐述了很多具体细节以便于充分理解本发明，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。Many specific details are set forth in the following description to facilitate a full understanding of the present invention, and the described embodiments are only some of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

除非另有定义，本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中在本发明的说明书中所使用的术语只是为了描述具体的实施例的目的，不是旨在于限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of the invention. The terms used herein in the description of the present invention are for the purpose of describing specific embodiments only, and are not intended to limit the present invention.

优选地，本发明的数据分类方法应用在一个或者多个计算机装置中。所述计算机装置是一种能够按照事先设定或存储的指令，自动进行数值计算和/或信息处理的设备，其硬件包括但不限于微处理器、专用集成电路(Application Specific IntegratedCircuit，ASIC)、可编程门阵列(Field－Programmable Gate Array，FPGA)、数字处理器(Digital Signal Processor，DSP)、嵌入式设备等。Preferably, the data classification method of the present invention is applied in one or more computer devices. The computer device is a device that can automatically perform numerical calculations and/or information processing according to preset or stored instructions, and its hardware includes but not limited to microprocessors, Application Specific Integrated Circuits (ASICs), Programmable gate array (Field-Programmable Gate Array, FPGA), digital processor (Digital Signal Processor, DSP), embedded devices, etc.

所述计算机装置可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机装置可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, or a cloud server. The computer device can perform human-computer interaction with the user through keyboard, mouse, remote controller, touch panel or voice control equipment.

实施例一Embodiment one

图1是本发明实施例一提供的数据分类方法的流程图。所述数据分类方法应用于计算机装置。FIG. 1 is a flowchart of a data classification method provided by Embodiment 1 of the present invention. The data classification method is applied to a computer device.

本发明的数据分类方法应用于机器学习系统，用于生成训练数据，使用所述训练数据对机器学习系统的判别模型进行训练，利用训练后的判别模型对待分类数据进行分类。所述数据分类方法可以快速生成机器学习系统的判别模型所需的训练数据，解决人工标注训练数据获取难度大，标注时间长，准确率得不到保证的技术问题，减少人工标注训练数据的工作量，提高训练数据的标注效率和准确率，利用所述训练数据可以快速训练判别模型，利用所述判别模型实现快速准确的数据分类。The data classification method of the present invention is applied to a machine learning system for generating training data, using the training data to train a discriminant model of the machine learning system, and using the trained discriminant model to classify the data to be classified. The data classification method can quickly generate the training data required by the discriminant model of the machine learning system, solve the technical problems that manual labeling training data is difficult to obtain, the labeling time is long, and the accuracy rate cannot be guaranteed, and the work of manual labeling training data can be reduced Quantity, improve the labeling efficiency and accuracy of the training data, use the training data to quickly train the discriminant model, and use the discriminant model to achieve fast and accurate data classification.

如图1所示，所述数据分类方法包括：As shown in Figure 1, the data classification methods include:

步骤101，获取待标注的数据集{x_i|i＝1，2，...，m}。Step 101, obtain the data set {x _i |i=1, 2, . . . , m} to be labeled.

待标注的数据集包括多个需要进行标注的数据。数据集中的每个数据标注后得到该数据的标签，用来对判别模型进行训练。The data set to be labeled includes multiple data that need to be labeled. After each data in the dataset is labeled, the label of the data is obtained, which is used to train the discriminant model.

在所述数据分类方法应用于图像分类的应用场景中，所述待标注的数据集可以是图像集。例如，所述待标注的数据集包括不同用户的图像，要标注每个图像对应的用户。又如，所述待标注的数据集包括不同物体(例如笔、球、书)的图像，要标注每个图像对应的物体。再如，所述待标注的数据集包括不同人脸属性(例如种族、性别、年龄、表情)的图像，要标注每个图像对应的人脸属性。In an application scenario where the data classification method is applied to image classification, the data set to be marked may be an image set. For example, the dataset to be labeled includes images of different users, and the user corresponding to each image needs to be labeled. As another example, the data set to be labeled includes images of different objects (such as pens, balls, and books), and the objects corresponding to each image need to be labeled. For another example, the data set to be labeled includes images of different face attributes (such as race, gender, age, and expression), and the face attributes corresponding to each image need to be labeled.

在所述数据分类方法应用于文本分类的应用场景中，所述待标注的数据集也可以是文本集。例如，所述待标注的数据集包括不同情感倾向的文本，要标注每个文本对应的情感倾向。又如，所述待标注的数据集包括不同题材的文本，要标注每个文本对应的题材。再如，所述待标注的数据集包括不同技术领域(例如物理、化学、机械)的文本，要标注每个文本对应的技术领域。In an application scenario where the data classification method is applied to text classification, the data set to be labeled may also be a text set. For example, the data set to be labeled includes texts with different emotional tendencies, and the corresponding emotional tendencies of each text should be marked. As another example, the data set to be marked includes texts of different subjects, and the subject matter corresponding to each text needs to be marked. As another example, the data set to be labeled includes texts in different technical fields (such as physics, chemistry, and machinery), and the technical field corresponding to each text needs to be marked.

在所述数据分类方法应用于语音分类的应用场景中，所述待标注的数据集还可以是语音集。例如，所述待标注的数据集包括多个不同用户的语音，要标注每个语音对应的用户。又如，所述待标注的数据集包括多个不同年龄段用户的语音，要标注每个语音对应的年龄段。再如，所述待标注的数据集包括多个不同情感的语音，要标注每个语音对应的情感。In an application scenario where the data classification method is applied to speech classification, the data set to be marked may also be a speech set. For example, the data set to be labeled includes voices of multiple different users, and the user corresponding to each voice needs to be labeled. As another example, the data set to be marked includes voices of users of multiple different age groups, and the age group corresponding to each voice needs to be marked. For another example, the data set to be marked includes multiple voices with different emotions, and the emotion corresponding to each voice needs to be marked.

可以实时采集数据，由实时采集的数据构成待标注的数据集。例如，可以实时拍摄人物图像，将各个实时拍摄的人物图像作为所述待标注的数据集。Data can be collected in real time, and the data collected in real time constitutes a data set to be marked. For example, images of people can be captured in real time, and each of the images of people captured in real time can be used as the data set to be labeled.

或者，可以从预设的数据源获取待标注的数据集。例如，可以从预设的数据库获取数据构成待标注的数据集，所述预设数据库可以预先存储大量的数据，例如图像。Alternatively, the dataset to be labeled can be obtained from a preset data source. For example, data can be obtained from a preset database to form a dataset to be labeled, and the preset database can pre-store a large amount of data, such as images.

或者，可以接收用户输入的待标注的数据集。例如，可以接收用户输入的多个图像，将用户输入的多个图像作为所述待标注的数据集。Alternatively, a data set to be labeled input by a user may be received. For example, multiple images input by the user may be received, and the multiple images input by the user may be used as the data set to be labeled.

步骤102，通过标注函数λ_j，j＝1，2，...，n对所述数据集{x_i|i＝1，2，...，m}进行标注，得到所述数据集{x_i|i＝1，2，...，m}的初始标签Λ_i，j＝λ_j(x_i)，其中i＝1，2，...，m，j＝1，2，...，n。Step 102, label the data set { _xi |i=1, 2, ..., m} through the labeling function λ _j , j=1, 2, ..., n, and obtain the data set { x _i |i=1, 2, ..., m}'s initial label Λ _{i, j} = λ _j ( _xi ), where i = 1, 2, ..., m, j = 1, 2,. . . ., n.

标注函数是表示数据与标签映射关系的函数，标注函数接收数据并输出该数据的标签。标注函数是黑盒函数，可以表示为λ：其中λ表示标注函数，X表示数据，Y表示X对应的初始标签，表示标注函数弃权。The labeling function is a function that represents the mapping relationship between data and labels. The labeling function receives data and outputs the label of the data. The annotation function is a black-box function, which can be expressed as λ: Where λ represents the labeling function, X represents the data, and Y represents the initial label corresponding to X, Indicates that the callout function abstains.

与人工标注训练数据相比，标注函数允许利用各种弱监督来源信息(如启发式信息、外部知识库等)来生成所述初始标签。例如，图像中包括甲和乙两个人，要标注两个人的关系，已知甲是丁的父亲，乙是丁的母亲，则根据启发式信息“A是C的父亲，B是C的母亲-＞A和B是夫妻”，则得到甲和乙是夫妻的标注结果(即初始标签)。Compared with manually annotating training data, the labeling function allows to utilize various weakly supervised source information (such as heuristic information, external knowledge base, etc.) to generate the initial labels. For example, there are two people A and B in the image, and it is necessary to mark the relationship between the two people. It is known that A is Ding’s father and B is Ding’s mother. According to the heuristic information “A is C’s father, B is C’s mother- > A and B are husband and wife”, then get the labeling result (that is, the initial label) that A and B are husband and wife.

标注函数不要求准确率。也就是说，根据标注函数得到的初始标签是不可靠的。不可靠可以包括标注不正确、多种标注、标注不充分、局部标注等。Labeling functions do not require accuracy. That is, the initial labels obtained from the labeling function are unreliable. Unreliability can include incorrect labeling, multiple labeling, insufficient labeling, partial labeling, etc.

可以预先根据需要定义多个标注函数，例如定义6个标注函数。Multiple labeling functions can be defined in advance as required, for example, six labeling functions are defined.

不同的标注函数对于同一个数据的标注结果允许冲突。例如，标注函数1标注数据为夫妻，标注函数2标注数据为兄妹。Different labeling functions are allowed to conflict with the labeling results of the same data. For example, labeling function 1 labels the data as husband and wife, and labeling function 2 labels the data as siblings.

步骤103，根据所述初始标签计算所述标识函数λ_j，j＝1，2，...，n的成对相关性，根据所述成对相关性构建所述标注函数λ_j，j＝1，2，...，n的生成模型。Step 103, calculate the pairwise correlation of the identification function λ _j , j=1, 2, . . . , n according to the initial label, and construct the labeling function λ _j , j= Generative models for 1, 2, ..., n.

标注函数λ_j，j＝1，2，...，n的成对相关性是指两个标注函数之间的依赖关系。本方法建模标注函数之间的统计依赖关系，以提高预测性能。例如，如果两个标注函数表达类似的启发式信息，可以在生成模型中包含此依赖性并避免“重复计算”问题。这种成对相关性是最常见的，因而选择标注函数对(j，k)的集合C来建模为相关的。The pairwise correlation of labeling functions λ _j , j=1, 2, . . . , n refers to the dependency relationship between two labeling functions. This method models the statistical dependencies between labeled functions to improve predictive performance. For example, if two labeling functions express similar heuristic information, it is possible to include this dependency in the generative model and avoid the "double counting" problem. This pairwise correlation is the most common, so the set C of label function pairs (j, k) is chosen to be modeled as correlated.

在本实施例中，标注函数λ_j，j＝1，2，...，n的成对相关性可以表示为：In this embodiment, the pairwise correlation of labeling function λ _j , j=1, 2, . . . , n can be expressed as:

其中，Λ为所述初始标签构成的初始标签矩阵，Λ_i，j＝λ_j(x_i)；Y为真实标签矩阵，Y＝(y₁，y₂，...，y_m)；C为标注函数对(i，j)的集合。II{Λ_i，j＝Λ_i，k}表示当括号{}内的条件成立与不成立时的取值。在本实施例中，当括号{}内的条件成立时取值为1，当括号{}内的条件不成立时取值为0。Among them, Λ is the initial label matrix formed by the initial label, Λ _{i, j} = λ _j ( _xi ); Y is the real label matrix, Y = (y ₁ , y ₂ ,..., y _m ); C is a collection of label function pairs (i, j). II{Λ _{i, j} = Λ _{i, k} } indicates the value when the conditions in brackets {} are true or not. In this embodiment, the value is 1 when the condition in brackets {} is satisfied, and the value is 0 when the condition in brackets {} is not satisfied.

根据所述成对相关性可以构建所述标注函数λ_j，j＝1，2，...，n的生成模型。A generative model of the labeling function λ _j , j=1, 2, . . . , n can be constructed according to the pairwise correlation.

本方法的核心操作是对一组标注函数提供的噪声信号进行建模和集成，将每个标注函数建模为一个噪声“选民”，产生与其他标注函数相关的错误。The core operation of the method is to model and integrate the noise signals provided by a set of labeling functions, modeling each labeling function as a noise "voter" that generates errors relative to other labeling functions.

对于数据集{x_i|i＝1，2，...，m}中的每个数据x_i，初始标签向量为Λ_i＝(Λ_i，1，Λ_i，2，，...，Λ_i，n)。在本实施例中，根据所述成对相关性构建的生成模型为：For each data x _i in the data set {x _i |i=1, 2, ..., m}, the initial label vector is Λ _i = (Λ _{i, 1} , Λ _{i, 2,} , ..., Λi _,n ). In this embodiment, the generative model constructed according to the pairwise correlation is:

其中Z_w是归一化常数，φ_i(Λ，y_i)为针对所述数据集中的各个数据的所述标注函数λ_j，j＝1，2，...，n的成对相关性，w为所述生成模型的待定参数，w∈R^2n+|C|。where Z _w is a normalization constant, φ _i (Λ, y _i ) is the pairwise correlation of the labeling function λ _j , j=1, 2, . . . , n for each data in the data set , w is an undetermined parameter of the generative model, w∈R ^2n+|C| .

步骤104，根据所述生成模型预估所述数据集{x_i|i＝1，2，...，m}的概率标签 Step 104, estimate the probability label of the data set { _xi |i=1, 2, ..., m} according to the generative model

在本实施例中，生成模型为：In this example, the generated model is:

要在不访问真实标签的情况下学习该模型，可以在给定观察到的初始标签矩阵Λ的情况下最小化负对数边际概率：To learn the model without access to the true labels, the negative log marginal probability can be minimized given the observed initial label matrix Λ:

可以通过交错执行随机梯度下降步骤与吉布斯采样步骤优化这一目标，得到生成模型的参数w的估计值 This objective can be optimized by interleaving stochastic gradient descent steps with Gibbs sampling steps, resulting in estimates of the parameters w of the generative model

所述估计值确定后所述生成模型就确定了，将初始标签矩阵Λ输入所述生成模型，即可得到所述数据集{x_i|i＝1，2，...，m}的概率标签 the estimated value After the determination, the generation model is determined, and the initial label matrix Λ is input into the generation model to obtain the probability labels of the data set { _xi |i=1, 2, ..., m}

步骤103-104就是利用生成模型对所述初始标签Λ_i，j＝λ_j(x_i)进行去噪，得到所述数据集{x_i|i＝1，2，...，m}的概率标签。带有所述概率标签的所述数据集即所述机器学习系统的训练数据。Steps 103-104 are to use the generative model to denoise the initial label Λ _{i, j} = λ _j ( _xi ), and obtain the data set { _xi |i=1, 2, ..., m} Probability label. The data set with the probability label is the training data of the machine learning system.

步骤105，根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型。Step 105: Train the discriminant model of the machine learning system according to the probability label to obtain the trained discriminant model.

根据所述概率标签对所述机器学习系统的判别模型进行训练，就是将带有所述概率标签的所述数据集作为训练样本对所述判别模型进行训练。Training the discriminant model of the machine learning system according to the probability label is to use the data set with the probability label as a training sample to train the discriminant model.

根据所述概率标签对判别模型进行训练，最终目标是训练一个超出标注函数所表达信息的判别模型。可以通过最小化所述判别模型的损失函数l(h_θ(x_i)，y)的噪声感知变量(即相对于的预期损失)在概率标签上训练判别模型h_θ：According to the probability label The discriminative model is trained with the ultimate goal of training a discriminative model that exceeds the information expressed by the labeling function. The noise-aware variable can be minimized by minimizing the loss function l(h _θ ( _xi ), y) of the discriminative model (i.e. relative to the expected loss) in the probability label Train the discriminative model h _θ on:

在对所述判别模型进行训练时，调整判别模型的参数，使所述噪声感知变量取得最小值。训练的过程可以采用RMSprop算法。RMSprop是一种改进的随机梯度下降算法。RMSprop算法是本领域公知技术，此处不再赘述。When the discriminant model is trained, parameters of the discriminant model are adjusted to make the noise perception variable obtain a minimum value. The training process can use the RMSprop algorithm. RMSprop is a modified stochastic gradient descent algorithm. The RMSprop algorithm is a well-known technology in the art, and will not be repeated here.

步骤106，将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。Step 106, input the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

在所述数据分类方法应用于图像分类的应用场景中，将待分类图像输入所述训练后的判别模型，得到所述待分类图像的类别。例如，得到所述待分类图像对应的用户。又如，得到所述待分类图像对应的物体。再如，得到所述待分类图像对应的人脸属性。In an application scenario where the data classification method is applied to image classification, the image to be classified is input into the trained discriminant model to obtain the category of the image to be classified. For example, the user corresponding to the image to be classified is obtained. In another example, an object corresponding to the image to be classified is obtained. For another example, the face attribute corresponding to the image to be classified is obtained.

在所述数据分类方法应用于文本分类的应用场景中，将待分类文本输入所述训练后的判别模型，得到所述待分类文本的类别。例如，得到所述待分类文本的情感倾向(例如正面情感倾向或负面情感倾向)。又如，得到所述待分类文本对应的题材。再如，得到所述待分类文本的技术领域。In an application scenario where the data classification method is applied to text classification, the text to be classified is input into the trained discriminant model to obtain the category of the text to be classified. For example, the sentiment orientation (for example, positive sentiment orientation or negative sentiment orientation) of the text to be classified is obtained. In another example, the subject matter corresponding to the text to be classified is obtained. As another example, the technical field of the text to be classified is obtained.

在所述训练数据生成方法应用于语音分类的应用场景中，将待分类语音输入所述训练后的判别模型，得到所述待分类语音的类别。例如，得到所述待分类语音对应的用户。又如，得到所述待分类语音对应的年龄段。再如，得到所述待分类语音的情感。In an application scenario where the method for generating training data is applied to speech classification, the speech to be classified is input into the trained discriminant model to obtain the category of the speech to be classified. For example, the user corresponding to the speech to be classified is obtained. In another example, the age group corresponding to the speech to be classified is obtained. For another example, the emotion of the speech to be classified is obtained.

实施例一的数据分类方法获取待标注的数据集{x_i|i＝1，2，...，m}；通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；根据所述生成模型预估所述数据集的概率标签；根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。本实施例可以快速生成机器学习系统的判别模型所需的训练数据，解决人工标注训练数据获取难度大，标注时间长，准确率得不到保证的技术问题，减少人工标注训练数据的工作量，提高训练数据的标注效率和准确率，利用所述训练数据可以快速训练判别模型，利用所述判别模型实现快速准确的数据分类。The data classification method in Embodiment 1 obtains the data set {x _i |i=1, 2, ..., m} to be labeled; through the labeling function λ _j , j = 1, 2, ..., n pairs the The data set is labeled to obtain the initial label Λ _{i, j} = λ _j ( _xi ), i=1, 2,..., m, j=1, 2,..., n of the data set; according to The initial label calculates the pairwise correlation of the identification function, constructs the generation model of the label function according to the pairwise correlation; estimates the probability label of the data set according to the generation model; according to the probability The label trains the discriminant model of the machine learning system to obtain the trained discriminant model; input the data to be classified into the trained discriminant model to obtain the category of the to-be-classified data. This embodiment can quickly generate the training data required by the discriminant model of the machine learning system, solve the technical problems of manual labeling training data acquisition difficulty, long labeling time, and accuracy rate cannot be guaranteed, and reduce the workload of manual labeling training data. The labeling efficiency and accuracy of training data are improved, and a discriminant model can be quickly trained by using the training data, and fast and accurate data classification can be realized by using the discriminant model.

在另一实施例中，在通过标注函数λ_j对所述数据集{x_i|i＝1，2，...，m}进行标注之前，所述方法还可以包括：对所述数据集{x_i|i＝1，2，...，m}进行预处理。In another embodiment, before labeling the data set { _xi |i=1, 2, ..., m} by the labeling function λ _j , the method may further include: labeling the data set {x _i |i=1, 2, . . . , m} for preprocessing.

对所述数据集{x_i|i＝1，2，...，m}进行预处理可以包括填充所述数据集{x_i|i＝1，2，...，m}中的缺失值。Preprocessing the dataset {x _i | _i =1, 2, ..., m} may include filling in missing value.

可以采用K-最近邻算法，确定距离具有缺失值的数据最近的K个数据(例如根据欧式距离确定距离具有缺失值的数据最近的K个数据)，将K个数据的数值加权平均来估计该数据的缺失值。The K-nearest neighbor algorithm can be used to determine the K data closest to the data with missing values (for example, according to the Euclidean distance to determine the K data closest to the data with missing values), and the weighted average of the values of the K data to estimate the The missing value of the data.

或者，可以采用预测模型来预测缺失值，如果缺失值是数值型的，可以采用平均值来填充所述缺失值，如果缺失值是非数值型的，可以采用众数来填充所述缺失值。Alternatively, a prediction model can be used to predict missing values. If the missing value is numerical, the missing value can be filled with the mean value, and if the missing value is non-numeric, the missing value can be filled with the mode.

或者，可以采用均值法替代缺失值。优选地，由于采用均值法替代缺失值的方法是建立在完全随机缺失的假设之上，会造成数据的方差及标准差变小，因而，所述方法还可以包括：将通过均值替代后得到的填充值与预设扩大系数进行求积，得到新的数据作为最终的填充值。所述预设扩大系数为预先设置的扩大系数，所述扩大系数大于1。Alternatively, the mean method can be used to replace missing values. Preferably, since the method of using the mean method to replace missing values is based on the assumption of complete random missing, the variance and standard deviation of the data will become smaller. Therefore, the method may also include: replacing the value obtained by the mean value The filling value is multiplied with the preset expansion factor to obtain new data as the final filling value. The preset expansion factor is a preset expansion factor, and the expansion factor is greater than 1.

还可以采用其他方法填充所述缺失值。例如，可以通过回归拟合的方法或者插值法对缺失值进行填充。Other methods can also be used to fill in the missing values. For example, missing values can be filled by regression fitting or interpolation.

对所述数据集{x_i|i＝1，2，...，m}进行预处理还可以包括修正所述数据集{x_i|i＝1，2，...，m}中的异常值。异常值是明显偏离其他数据的数值。Preprocessing the data set { _xi |i=1, 2, ..., m} may also include correcting the data set { _xi |i = 1, 2, ..., m} outliers. Outliers are values that deviate significantly from the rest of the data.

修正异常值的方法可以同填充缺失值的方法。例如，可以采用K-最近邻算法，确定距离具有异常值的数据最近的K个数据(例如根据欧式距离确定距离具有异常值的数据最近的K个数据)，将K个数据的数值加权平均来估计该数据的异常值的修正值。或者，可以采用预测模型来预测修正值，如果异常值是数值型的，可以采用平均值来修正所述异常值，如果异常值是非数值型的，可以采用众数来修正所述异常值。The method of correcting outliers can be the same as the method of filling missing values. For example, the K-nearest neighbor algorithm can be used to determine the K data closest to the data with abnormal values (for example, according to the Euclidean distance to determine the K data closest to the data with abnormal values), and the weighted average of the values of the K data to Estimates the correction value for outliers in the data. Alternatively, a prediction model may be used to predict the correction value. If the outlier is numerical, the outlier may be corrected by the mean value, and if the outlier is non-numerical, the outlier may be corrected by the mode.

或者，可以采用均值法替换异常值。优选地，由于采用均值法替换异常值的方法是建立在完全随机缺失的假设之上，会造成数据的方差及标准差变小，因而，所述方法还可以包括：将通过均值替换后得到的修正值与预设扩大系数进行求积，得到新的数值作为最终的修正值。所述预设扩大系数为预先设置的扩大系数，所述扩大系数大于1。Alternatively, the mean method can be used to replace outliers. Preferably, since the method of using the mean value method to replace the outliers is based on the assumption of complete missing at random, the variance and standard deviation of the data will become smaller, therefore, the method may also include: replacing the mean value obtained by The correction value is multiplied by the preset expansion factor to obtain a new value as the final correction value. The preset expansion factor is a preset expansion factor, and the expansion factor is greater than 1.

还可以采用其他方法修正所述异常值。例如，可以通过回归拟合的方法或者插值法对异常值进行修正。Other methods can also be used to correct the outliers. For example, outliers can be corrected by a regression fitting method or an interpolation method.

修正异常值的方法也可以不同与填充缺失值的方法。The method of correcting outliers can also be different from the method of filling missing values.

对所述数据集进行预处理还可以包括直接丢弃有缺失值的数据和或有异常值的数据。直接将有缺失值的数据和或有异常值的数据进行丢弃，可以保证数据集的干净。Preprocessing the data set may also include directly discarding data with missing values and data with possible outliers. Directly discarding data with missing values and data with outliers can ensure the cleanliness of the data set.

实施例二Embodiment two

图2是本发明实施例二提供的数据分类装置的结构图。所述数据分类装置20应用于机器学习系统，用于生成训练数据，使用所述训练数据对机器学习系统的判别模型进行训练，利用训练后的判别模型对待分类数据进行分类。所述数据分类装置20可以快速生成机器学习系统的判别模型所需的训练数据，解决人工标注训练数据获取难度大，标注时间长，准确率得不到保证的技术问题，减少人工标注训练数据的工作量，提高训练数据的标注效率和准确率，利用所述训练数据可以快速训练判别模型，利用所述判别模型实现快速准确的数据分类。Fig. 2 is a structural diagram of a data classification device provided by Embodiment 2 of the present invention. The data classification device 20 is applied to a machine learning system for generating training data, using the training data to train a discriminant model of the machine learning system, and using the trained discriminant model to classify the data to be classified. The data classification device 20 can quickly generate the training data required by the discriminant model of the machine learning system, solve the technical problems that the manual labeling training data is difficult to obtain, the labeling time is long, and the accuracy rate cannot be guaranteed, and the manual labeling training data is reduced. workload, improve the labeling efficiency and accuracy of the training data, use the training data to quickly train the discriminant model, and use the discriminant model to achieve fast and accurate data classification.

如图2所示，所述数据分类装置20可以包括获取模块201、标注模块202、构建模块203、预估模块204、训练模块205、分类模块206。As shown in FIG. 2 , the data classification apparatus 20 may include an acquisition module 201 , a labeling module 202 , a construction module 203 , an estimation module 204 , a training module 205 , and a classification module 206 .

获取模块201，用于获取待标注的数据集{x_i|i＝1，2，...，m}。The obtaining module 201 is used to obtain the data set { _xi |i=1, 2, . . . , m} to be labeled.

在所述数据分类方法应用于图像分类的应用场景中，所述待标注的数据集可以是图像集。例如，所述待标注的数据集包括不同用户的图像，要标注每个图像对应的用户。又如，所述待标注的数据集包括不同物体(例如笔、球、书)的图像，要标注每个图像对应的物体。再如，所述待标注的数据集包括不同人脸属性(例如种族、性别、年龄、表情)的图像，要标注每个图像对应的人脸属性。In an application scenario where the data classification method is applied to image classification, the data set to be marked may be an image set. For example, the dataset to be labeled includes images of different users, and the user corresponding to each image needs to be labeled. As another example, the data set to be labeled includes images of different objects (such as pens, balls, and books), and the objects corresponding to each image need to be labeled. For another example, the dataset to be labeled includes images of different face attributes (such as race, gender, age, and expression), and the face attributes corresponding to each image need to be labeled.

标注模块202，用于通过标注函数λ_j对所述数据集{x_i|i＝1，2，...，m}进行标注，得到所述数据集{x_i|i＝1，2，...，m}的初始标签Λ_i，j＝λ_j(x_i)，其中i＝1，2，...，m，j＝1，2，...，n。The labeling module 202 is used to label the data set { _xi |i=1, 2, ..., m} through the labeling function λ _j to obtain the data set { _xi |i=1, 2, ..., m}'s initial label Λ _{i, j} = λ _j ( _xi ), where i=1, 2, ..., m, j = 1, 2, ..., n.

构建模块203，用于根据所述初始标签计算所述标识函数λ_j，j＝1，2，...，n的成对相关性，根据所述成对相关性构建所述标注函数λ_j，j＝1，2，...，n的生成模型。A construction module 203, configured to calculate the pairwise correlation of the identification function λ _j , j=1, 2, ..., n according to the initial label, and construct the label function λ _j according to the pairwise correlation , a generative model for j=1, 2, . . . , n.

其中，Λ为所述初始标签构成的初始标签矩阵，Λ_i，j＝λ_j(x_i)；Y为真实标签矩阵，Y＝(y₁，y₂，...，y_m)；C为标注函数对(i，j)的集合。II{Λ_i，j＝Λ_i，k}表示当括号{}内的条件成立与不成立时的取值。在本实施例中，当括号{}内的条件成立时取值为1，当括号{}内的条件不成立时取值为O。Among them, Λ is the initial label matrix formed by the initial label, Λ _{i, j} = λ _j ( _xi ); Y is the real label matrix, Y = (y ₁ , y ₂ ,..., y _m ); C is a collection of label function pairs (i, j). II{Λ _{i, j} = Λ _{i, k} } indicates the value when the conditions in brackets {} are true or not. In this embodiment, the value is 1 when the condition in brackets {} is satisfied, and the value is 0 when the condition in brackets {} is not satisfied.

预估模块204，用于根据所述生成模型预估所述数据集{x_i|i＝1，2，...，m}的概率标签 Estimation module 204, configured to estimate the probability label of the data set { _xi |i=1, 2, ..., m} according to the generative model

在本实施例中，生成模型为：In this example, the generated model is:

构建模块203、预估模块204就是利用生成模型对所述初始标签Λ_i，j＝λ_j(x_i)进行去噪，得到所述数据集{x_i|i＝1，2，...，m}的概率标签。带有所述概率标签的所述数据集即得到的所述机器学习系统的训练数据。The construction module 203 and the estimation module 204 use the generative model to denoise the initial label Λ _i,j =λ _j ( _xi ), and obtain the data set { _xi |i=1,2,... , the probability label for m}. The data set with the probability label is obtained training data of the machine learning system.

训练模块205，用于根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型。The training module 205 is configured to train the discriminant model of the machine learning system according to the probability label to obtain the trained discriminant model.

分类模块206，用于将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。The classification module 206 is configured to input the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

实施例二的数据分类装置20获取待标注的数据集{x_i|i＝1，2，...，m}；通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；根据所述生成模型预估所述数据集的概率标签；根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；将待分类数据输入所述训练后的判别模型，得到所述待分类数据的类别。本实施例可以快速生成机器学习系统的判别模型所需的训练数据，解决人工标注训练数据获取难度大，标注时间长，准确率得不到保证的技术问题，减少人工标注训练数据的工作量，提高训练数据的标注效率和准确率，利用所述训练数据可以快速训练判别模型，利用所述判别模型实现快速准确的数据分类。The data classification device 20 of the second embodiment obtains the data set {x _i |i=1, 2, ..., m} to be labeled; through the labeling function λ _j , j = 1, 2, ..., n Label the above data set to obtain the initial label Λ _{i, j} = λ _j ( _xi ), i=1, 2, ..., m, j = 1, 2, ..., n of the data set; Calculate the pairwise correlation of the identification function according to the initial label, construct the generation model of the label function according to the pairwise correlation; estimate the probability label of the data set according to the generation model; The probability label trains the discriminant model of the machine learning system to obtain a trained discriminant model; input the data to be classified into the trained discriminant model to obtain the category of the to-be-classified data. This embodiment can quickly generate the training data required by the discriminant model of the machine learning system, solve the technical problems of manual labeling training data acquisition difficulty, long labeling time, and accuracy rate cannot be guaranteed, and reduce the workload of manual labeling training data. The labeling efficiency and accuracy of training data are improved, and a discriminant model can be quickly trained by using the training data, and fast and accurate data classification can be realized by using the discriminant model.

在另一实施例中，所述数据分类装置20还可以包括：预处理模块，用于对所述数据集{x_i|i＝1，2，...，m}进行预处理。In another embodiment, the data classification apparatus 20 may further include: a preprocessing module, configured to preprocess the data set { _xi |i=1, 2, . . . , m}.

或者，可以采用均值法替换异常值。优选地，由于采用均值法替换异常值的方法是建立在完全随机缺失的假设之上，会造成数据的方差及标准差变小，因而，所述方法还可以包括：将通过均值替换后得到的修正值与预设扩大系数进行求积，得到新的数值作为最终的修正值。所述预设扩大系数为预先设置的扩大系数，所述扩大系数大于1。Alternatively, the mean method can be used to replace outliers. Preferably, since the method of using the mean method to replace outliers is based on the assumption of complete missing at random, it will cause the variance and standard deviation of the data to become smaller. Therefore, the method may also include: The correction value is multiplied by the preset expansion factor to obtain a new value as the final correction value. The preset expansion factor is a preset expansion factor, and the expansion factor is greater than 1.

实施例三Embodiment three

本实施例提供一种计算机存储介质，该计算机存储介质上存储有计算机程序，该计算机程序被处理器执行时实现上述数据分类方法实施例中的步骤，例如图1所示的步骤101-106：This embodiment provides a computer storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps in the above embodiment of the data classification method are implemented, such as steps 101-106 shown in FIG. 1:

步骤101，获取待标注的数据集{x_i|i＝1，2，...，m}；Step 101, obtain the data set {x _i |i=1, 2, ..., m} to be labeled;

步骤102，通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；Step 102, label the data set with the labeling function λ _j , j=1, 2, ..., n, and obtain the initial label Λ _{i, j} = λ _j ( _xi ), i= 1, 2, ..., m, j = 1, 2, ..., n;

步骤103，根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；Step 103, calculating the pairwise correlation of the labeling function according to the initial label, and constructing a generative model of the labeling function according to the pairwise correlation;

步骤104，根据所述生成模型预估所述数据集的概率标签；Step 104, estimating the probability label of the data set according to the generation model;

步骤105，根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；Step 105, training the discriminant model of the machine learning system according to the probability label to obtain the trained discriminant model;

或者，该计算机程序被处理器执行时实现上述装置实施例中各模块的功能，例如图2中的模块201-206：Or, when the computer program is executed by the processor, the functions of the modules in the above-mentioned device embodiments are implemented, such as modules 201-206 in FIG. 2:

获取模块201，用于获取待标注的数据集{x_i| i＝1，2，...，m}；An acquisition module 201, configured to acquire a dataset to be labeled { _xi | i=1, 2, ..., m};

标注模块202，用于通过标注函数λ_j，j＝1，2，...，n对所述数据集进行标注，得到所述数据集的初始标签Λ_i，j＝λ_j(x_i)，i＝1，2，...，m，j＝1，2，...，n；Annotation module 202, configured to annotate the data set with an annotation function λ _j , j=1, 2, ..., n, to obtain an initial label Λ _{i, j} = λ _j ( _xi ) of the data set , i=1, 2, ..., m, j = 1, 2, ..., n;

构建模块203，用于根据所述初始标签计算所述标识函数的成对相关性，根据所述成对相关性构建所述标注函数的生成模型；A construction module 203, configured to calculate the pairwise correlation of the identification function according to the initial label, and construct a generation model of the labeling function according to the pairwise correlation;

预估模块204，用于根据所述生成模型预估所述数据集的概率标签；An estimation module 204, configured to estimate the probability label of the data set according to the generation model;

训练模块205，用于根据所述概率标签对所述机器学习系统的判别模型进行训练，得到训练后的判别模型；A training module 205, configured to train a discriminant model of the machine learning system according to the probability label to obtain a trained discriminant model;

实施例四Embodiment four

图3为本发明实施例四提供的计算机装置的示意图。所述计算机装置30包括存储器301、处理器302以及存储在所述存储器301中并可在所述处理器302上运行的计算机程序303，例如数据分类程序。所述处理器302执行所述计算机程序303时实现上述数据分类方法实施例中的步骤，例如图1所示的步骤101-106：FIG. 3 is a schematic diagram of a computer device provided by Embodiment 4 of the present invention. The computer device 30 includes a memory 301 , a processor 302 and a computer program 303 stored in the memory 301 and executable on the processor 302 , such as a data classification program. When the processor 302 executes the computer program 303, the steps in the above-mentioned data classification method embodiment are implemented, for example, steps 101-106 shown in FIG. 1:

获取模块201，用于获取待标注的数据集{x_i|i＝1，2，...，m}；An acquisition module 201, configured to acquire a data set to be labeled { _xi |i=1, 2, ..., m};

示例性的，所述计算机程序303可以被分割成一个或多个模块，所述一个或者多个模块被存储在所述存储器301中，并由所述处理器302执行，以完成本方法。所述一个或多个模块可以是能够完成特定功能的一系列计算机程序指令段，该指令段用于描述所述计算机程序303在所述计算机装置30中的执行过程。例如，所述计算机程序303可以被分割成图2中的获取模块201、标注模块202、构建模块203、预估模块204、训练模块205、分类模块206，各模块具体功能参见实施例二。Exemplarily, the computer program 303 can be divided into one or more modules, and the one or more modules are stored in the memory 301 and executed by the processor 302 to complete the method. The one or more modules may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 303 in the computer device 30 . For example, the computer program 303 can be divided into the acquisition module 201, labeling module 202, construction module 203, estimation module 204, training module 205, and classification module 206 in FIG.

所述计算机装置30可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。本领域技术人员可以理解，所述示意图3仅仅是计算机装置30的示例，并不构成对计算机装置30的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件，例如所述计算机装置30还可以包括输入输出设备、网络接入设备、总线等。The computer device 30 may be a computing device such as a desktop computer, a notebook, a palmtop computer, or a cloud server. Those skilled in the art can understand that the schematic diagram 3 is only an example of the computer device 30, and does not constitute a limitation to the computer device 30. It may include more or less components than those shown in the figure, or combine certain components, or be different. For example, the computer device 30 may also include input and output devices, network access devices, buses, and the like.

所称处理器302可以是中央处理单元(Central Processing Unit，CPU)，还可以是其他通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现成可编程门阵列(Field-Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器302也可以是任何常规的处理器等，所述处理器302是所述计算机装置30的控制中心，利用各种接口和线路连接整个计算机装置30的各个部分。The so-called processor 302 may be a central processing unit (Central Processing Unit, CPU), and may also be other general-purpose processors, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or the processor 302 can also be any conventional processor, etc., the processor 302 is the control center of the computer device 30, and uses various interfaces and lines to connect the entire computer device 30. various parts.

所述存储器301可用于存储所述计算机程序303，所述处理器302通过运行或执行存储在所述存储器301内的计算机程序或模块，以及调用存储在存储器301内的数据，实现所述计算机装置30的各种功能。所述存储器301可主要包括存储程序区和存储数据区，其中，存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等；存储数据区可存储根据计算机装置30的使用所创建的数据(比如音频数据、电话本等)等。此外，存储器301可以包括高速随机存取存储器，还可以包括非易失性存储器，例如硬盘、内存、插接式硬盘，智能存储卡(Smart Media Card，SMC)，安全数字(Secure Digital，SD)卡，闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 301 can be used to store the computer program 303, and the processor 302 implements the computer device by running or executing the computer program or module stored in the memory 301 and calling the data stored in the memory 301 30 various functions. The memory 301 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, at least one application program required by a function (such as a sound playback function, an image playback function, etc.) and the like; the storage data area can store Data created according to use of the computer device 30 (such as audio data, a phone book, etc.) and the like are stored. In addition, the memory 301 may include a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), at least one magnetic disk storage device, flash memory device, or other volatile solid state storage devices.

所述计算机装置30集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机存储介质中。基于这样的理解，本发明实现上述实施例方法中的全部或部分流程，也可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一计算机存储介质中，该计算机程序在被处理器执行时，可实现上述各个方法实施例的步骤。其中，所述计算机程序包括计算机程序代码，所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括：能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，RandomAccess Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是，所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减，例如在某些司法管辖区，根据立法和专利实践，计算机可读介质不包括电载波信号和电信信号。If the integrated modules of the computer device 30 are realized in the form of software function modules and sold or used as independent products, they can be stored in a computer storage medium. Based on this understanding, the present invention realizes all or part of the processes in the methods of the above embodiments, and can also be completed by instructing related hardware through a computer program. The computer program can be stored in a computer storage medium, and the computer program is in When executed by the processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (ROM, Read-Only Memory) , random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excludes electrical carrier signals and telecommunication signals.

在本发明所提供的几个实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式。In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

所述作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one place, or may also be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, each module may exist separately physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, or in the form of hardware plus software function modules.

上述以软件功能模块的形式实现的集成的模块，可以存储在一个计算机存储介质中。上述软件功能模块存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)或处理器(processor)执行本发明各个实施例所述方法的部分步骤。The above-mentioned integrated modules realized in the form of software function modules may be stored in a computer storage medium. The above-mentioned software function modules are stored in a storage medium, and include several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) or a processor (processor) execute the methods described in various embodiments of the present invention. partial steps.

对于本领域技术人员而言，显然本发明不限于上述示范性实施例的细节，而且在不背离本发明的精神或基本特征的情况下，能够以其他的具体形式实现本发明。因此，无论从哪一点来看，均应将实施例看作是示范性的，而且是非限制性的，本发明的范围由所附权利要求而不是上述说明限定，因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本发明内。不应将权利要求中的任何附关联图标记视为限制所涉及的权利要求。此外，显然“包括”一词不排除其他模块或步骤，单数不排除复数。系统权利要求中陈述的多个模块或装置也可以由一个模块或装置通过软件或者硬件来实现。第一，第二等词语用来表示名称，而并不表示任何特定的顺序。It will be apparent to those skilled in the art that the invention is not limited to the details of the above-described exemplary embodiments, but that the invention can be embodied in other specific forms without departing from the spirit or essential characteristics of the invention. Accordingly, the embodiments should be regarded in all points of view as exemplary and not restrictive, the scope of the invention being defined by the appended claims rather than the foregoing description, and it is therefore intended that the scope of the invention be defined by the appended claims rather than by the foregoing description. All changes within the meaning and range of equivalents of the elements are embraced in the present invention. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is clear that the word "comprising" does not exclude other modules or steps, and the singular does not exclude the plural. Multiple modules or devices stated in the system claims may also be realized by one module or device through software or hardware. The words first, second, etc. are used to denote names and do not imply any particular order.

最后应说明的是，以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或等同替换，而不脱离本发明技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention without limitation. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be Modifications or equivalent replacements can be made without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data classification method applied to a machine learning system, wherein the method comprises:

Obtain the dataset to be labeled {x _i |i=1,2,...,m};

The data set is marked by the labeling function λ _j , j=1,2,...,n, and the initial label Λ _i,j =λ _j ( _xi ), i=1,2,... of the data set is obtained ,m,j=1,2,...,n;

calculating the pairwise correlation of the identification function according to the initial label, and constructing a generative model of the labeling function according to the pairwise correlation;

Estimating a probability label of the data set according to the generative model;

training the discriminant model of the machine learning system according to the probability label to obtain a trained discriminant model;

Inputting the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

2. The method according to claim 1, wherein the generated model is:

Where Λ represents the initial label matrix formed by the initial label, Y represents the real label matrix, Z _w is a normalization constant, φ _i (Λ, y _i ), i=1, 2,..., m are The pairwise correlation of the labeling function of each data in the set, w is an undetermined parameter of the generation model, w∈R ^2n+|C| .

3. The method of claim 2, wherein the pairwise correlation is:

in Indicates the value when the condition in brackets {} is true or not.

4. The method of claim 1, wherein:

The data set to be marked is an image set, and the data to be classified is an image to be classified; or

The data set to be labeled is a text set, and the data to be classified is a text to be classified; or

The data set to be marked is a speech set, and the data to be classified is a speech to be classified.

5. The method according to claim 4, wherein the input of the data to be classified into the trained discriminant model, obtaining the category of the data to be classified comprises:

Inputting the image to be classified into the trained discriminant model to obtain the user, object or face attributes corresponding to the image to be classified;

Inputting the text to be classified into the trained discriminant model to obtain the emotional tendency, subject matter or technical field corresponding to the text to be classified;

Input the speech to be classified into the trained discriminant model to obtain the user, age group or emotion corresponding to the speech to be classified.

6. The method according to claim 1, wherein the training of the discriminant model of the machine learning system according to the probability label comprises:

The discriminative model is trained on the probabilistic labels by minimizing the noise-aware variant of the loss function of the discriminative model.

7. The method according to any one of claims 1-6, characterized in that, before the data set is marked by the marking function λ _j , j=1, 2,..., n, the method Also includes:

filling missing values in said data set; and/or

Correct for outliers in the dataset.

8. A data classification device applied to a machine learning system, characterized in that the device comprises:

An acquisition module, used to acquire the data set {x _i |i=1,2,...,m} to be labeled;

A labeling module, configured to label the data set with a labeling function λ _j , j=1,2,...,n, to obtain the initial label Λ _i,j =λ _j ( _xi ), i= 1,2,...,m, j=1,2,...,n;

A building module for calculating the pairwise correlation of the identification function according to the initial label, and constructing a generative model of the labeling function according to the pairwise correlation;

An estimation module, configured to estimate the probability label of the data set according to the generation model;

A training module, configured to train the discriminant model of the machine learning system according to the probability label to obtain the trained discriminant model;

A classification module, configured to input the data to be classified into the trained discriminant model to obtain the category of the data to be classified.

9. A computer device, characterized in that: the computer device comprises a processor, and the processor is configured to execute a computer program stored in a memory to realize the data classification method according to any one of claims 1-7.

10. A computer storage medium, wherein a computer program is stored on the computer storage medium, wherein when the computer program is executed by a processor, the data classification method according to any one of claims 1-7 is realized.