CN104699685B

CN104699685B - Model modification device and method, data processing equipment and method, program

Info

Publication number: CN104699685B
Application number: CN201310647831.3A
Authority: CN
Inventors: 夏迎炬; 孙健; 王云芝; 李中华
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2013-12-04
Filing date: 2013-12-04
Publication date: 2018-02-09
Anticipated expiration: 2033-12-04
Also published as: CN104699685A

Abstract

The present application discloses a model update device and method, data processing device and method, which are used to update the target model in the multi-model system, wherein each model in the multi-model system is pre-trained in different ways for the training data set The obtained model, the model updating device includes: a pseudo-label acquisition unit, which uses a calibration model to process the data set to be tested, and uses the processed result as a pseudo-label; the first feature distribution acquisition unit obtains the data set to be tested based on the pseudo-label. Feature distribution; the second feature distribution acquisition unit obtains the feature distribution of the training data set based on the target model; the adjustment unit adjusts the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that The training data set and the test data set have similar distributions for the feature space division; and an update unit uses the training data set to update the target model based on the adjusted feature space division.

Description

Model update device and method, data processing device and method, program

技术领域technical field

本申请涉及数据处理领域，具体地涉及对数据处理中的模型进行更新的模型更新装置和模型更新方法、以及使用该模型更新装置和模型更新方法的数据处理装置和数据处理方法。The present application relates to the field of data processing, in particular to a model updating device and a model updating method for updating a model in data processing, and a data processing device and a data processing method using the model updating device and model updating method.

背景技术Background technique

随着社会的进步，信息技术也在飞速发展，如何高效地处理海量的信息变得尤为重要。借助于计算机技术对数据进行各种处理，比如进行实体关系抽取、对象识别、数据挖掘等成为常用的处理方式。这些处理通常基于一种或更多种模型。例如，在实体关系抽取中，基于机器学习的方法通过对关系样例进行特征抽取，然后在关系样例上进行训练得到模型，在实际进行抽取时用训练好的模型对实体关系进行识别。换言之，所使用的模型一般是基于训练数据集训练得到的。With the progress of society and the rapid development of information technology, how to efficiently process massive amounts of information has become particularly important. With the help of computer technology, various data processing methods, such as entity relationship extraction, object recognition, data mining, etc., have become common processing methods. These treatments are usually based on one or more models. For example, in entity relationship extraction, machine learning-based methods extract features from relationship samples, then train models on relationship samples, and use the trained model to identify entity relationships during actual extraction. In other words, the used model is generally trained based on the training data set.

如上所述，使用训练好的模型对待测数据集进行处理实际上基于训练数据集和待测数据集同质的假设。然而，在很多情况下，这种假设并不完全成立，导致基于训练数据集获得的模型不太适用于待测数据集，尤其在有新的数据产生时，该问题尤为突出。并且，在新的数据上标注样例需要花费大量的时间和人力，使得更新模型的代价较大。As mentioned above, using the trained model to process the test data set is actually based on the assumption that the training data set and the test data set are homogeneous. However, in many cases, this assumption is not fully established, resulting in the model obtained based on the training data set is not suitable for the test data set, especially when new data is generated, this problem is particularly prominent. Moreover, it takes a lot of time and manpower to label samples on new data, which makes updating the model more expensive.

发明内容Contents of the invention

在下文中给出了关于本发明的简要概述，以便提供关于本发明的某些方面的基本理解。应当理解，这个概述并不是关于本发明的穷举性概述。它并不是意图确定本发明的关键或重要部分，也不是意图限定本发明的范围。其目的仅仅是以简化的形式给出某些概念，以此作为稍后论述的更详细描述的前序。A brief overview of the invention is given below in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical parts of the invention nor to delineate the scope of the invention. Its purpose is merely to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

鉴于背景技术部分所述的需求，本发明关注于如何以较小的代价对模型进行更新，使其贴近真实的数据，从而提高系统的性能。针对该问题，本发明提出了一种通过基于伪反馈的特征空间划分来更新模型的模型更新装置和方法，其中，利用待测数据集与训练数据集之间的差异来调整基于训练数据集训练的模型的特征空间划分，并使用调整后的特征空间划分来更新模型。In view of the requirements described in the background section, the present invention focuses on how to update the model at a relatively low cost to make it close to real data, thereby improving the performance of the system. Aiming at this problem, the present invention proposes a model update device and method for updating the model by dividing the feature space based on pseudo-feedback, wherein the difference between the test data set and the training data set is used to adjust the The feature space partition of the model, and use the adjusted feature space partition to update the model.

根据本发明的一个方面，提供了一种对多模型系统中的目标模型进行更新的模型更新装置，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型，该模型更新装置包括：伪标签获取单元，被配置为使用多模型系统中不同于目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签；第一特征分布获取单元，被配置为基于伪标签获得待测数据集的特征分布；第二特征分布获取单元，被配置为基于目标模型获取训练数据集的特征分布；调整单元，被配置为基于训练数据集的特征分布和待测数据集的特征分布来调整目标模型的特征空间划分，以使得训练数据集和待测数据集针对该特征空间划分具有类似的分布；以及更新单元，被配置为基于调整后的特征空间划分使用训练数据集来更新目标模型。According to one aspect of the present invention, a model update device for updating the target model in the multi-model system is provided, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, the The model update device includes: a pseudo-label acquisition unit configured to use a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label; the first feature distribution acquisition unit, The second feature distribution acquisition unit is configured to obtain the feature distribution of the training data set based on the target model; the adjustment unit is configured based on the feature distribution of the training data set and The feature distribution of the test data set is used to adjust the feature space division of the target model, so that the training data set and the test data set have similar distributions for the feature space division; and the update unit is configured to be based on the adjusted feature space division Use the training dataset to update the target model.

根据本发明的另一个方面，提供了一种使用多模型系统对待测数据集进行处理的数据处理装置，包括上述模型更新装置。According to another aspect of the present invention, there is provided a data processing device using a multi-model system to process a data set to be tested, including the above-mentioned model updating device.

根据本发明的又一个方面，提供了一种对多模型系统中的目标模型进行更新的模型更新方法，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型，所述模型更新方法包括：使用多模型系统中不同于目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签；基于伪标签获得待测数据集的特征分布；基于目标模型获取训练数据集的特征分布；基于训练数据集的特征分布和待测数据集的特征分布来调整目标模型的特征空间划分，以使得训练数据集和待测数据集针对该特征空间划分具有类似的分布；以及基于调整后的特征空间划分使用训练数据集来更新目标模型。According to another aspect of the present invention, there is provided a model update method for updating the target model in the multi-model system, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, The method for updating the model includes: using a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and using the processed result as a pseudo-label; obtaining the feature distribution of the data set to be tested based on the pseudo-label; The target model obtains the feature distribution of the training data set; based on the feature distribution of the training data set and the feature distribution of the test data set, the feature space division of the target model is adjusted, so that the training data set and the test data set have the same characteristics for the feature space division. similar distribution; and using the training dataset to update the target model based on the adjusted feature space partition.

根据本发明的再一个方面，提供了一种使用多模型系统对待测数据集进行处理的数据处理方法，包括上述模型更新方法。According to another aspect of the present invention, there is provided a data processing method using a multi-model system to process a data set to be tested, including the above-mentioned model updating method.

依据本发明的其它方面，还提供了相应的计算机程序代码、计算机可读存储介质和计算机程序产品。According to other aspects of the present invention, corresponding computer program codes, computer-readable storage media and computer program products are also provided.

通过以下结合附图对本发明的优选实施例的详细说明，本发明的这些以及其他优点将更加明显。These and other advantages of the present invention will be more apparent through the following detailed description of preferred embodiments of the present invention with reference to the accompanying drawings.

附图说明Description of drawings

为了进一步阐述本申请的以上和其它优点和特征，下面结合附图对本申请的具体实施方式作进一步详细的说明。所述附图连同下面的详细说明一起包含在本说明书中并且形成本说明书的一部分。具有相同的功能和结构的元件用相同的参考标号表示。应当理解，这些附图仅描述本申请的典型示例，而不应看作是对本申请的范围的限定。在附图中：In order to further illustrate the above and other advantages and features of the present application, the specific implementation manners of the present application will be further described in detail below in conjunction with the accompanying drawings. The drawings are incorporated in and form a part of, together with the following detailed description. Elements having the same function and structure are denoted by the same reference numerals. It should be understood that these drawings depict only typical examples of the application and should not be considered as limiting the scope of the application. In the attached picture:

图1是示出了根据本申请的一个实施例的模型更新装置的结构框图；Fig. 1 is a structural block diagram showing a model updating device according to an embodiment of the present application;

图2是示出了根据本申请的一个实施例的训练数据集的特征分布和待测数据集的特征分布的对比的示例的示意图；FIG. 2 is a schematic diagram showing an example of a comparison between a feature distribution of a training data set and a feature distribution of a test data set according to an embodiment of the present application;

图3是示出了根据本申请的一个实施例的调整单元的结构框图；Fig. 3 is a structural block diagram showing an adjustment unit according to an embodiment of the present application;

图4是示出了根据本申请的一个实施例的数据处理装置的结构框图；Fig. 4 is a structural block diagram showing a data processing device according to an embodiment of the present application;

图5是示出了根据本申请的另一个实施例的数据处理装置的结构框图；Fig. 5 is a structural block diagram showing a data processing device according to another embodiment of the present application;

图6是示出了实体关系挖掘的具体示例中所收集的若干表的示意图；Fig. 6 is a schematic diagram showing several tables collected in a specific example of entity relationship mining;

图7是示出了根据本申请的一个实施例的模型更新方法的流程图；Fig. 7 is a flowchart illustrating a model updating method according to one embodiment of the present application;

图8是示出了根据本申请的一个实施例的模型更新方法中的调整步骤的流程图；以及Fig. 8 is a flow chart showing the adjustment steps in the model update method according to one embodiment of the present application; and

图9是其中可以实现根据本发明的实施例的方法和/或装置的通用个人计算机的示例性结构的框图。FIG. 9 is a block diagram of an exemplary structure of a general-purpose personal computer in which methods and/or apparatuses according to embodiments of the present invention can be implemented.

具体实施方式detailed description

在下文中将结合附图对本发明的示范性实施例进行描述。为了清楚和简明起见，在说明书中并未描述实际实施方式的所有特征。然而，应该了解，在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的决定，以便实现开发人员的具体目标，例如，符合与系统及业务相关的那些限制条件，并且这些限制条件可能会随着实施方式的不同而有所改变。此外，还应该了解，虽然开发工作有可能是非常复杂和费时的，但对得益于本公开内容的本领域技术人员来说，这种开发工作仅仅是例行的任务。Exemplary embodiments of the present invention will be described below with reference to the accompanying drawings. In the interest of clarity and conciseness, not all features of an actual implementation are described in this specification. It should be understood, however, that in developing any such practical embodiment, many implementation-specific decisions must be made in order to achieve the developer's specific goals, such as meeting those constraints related to the system and business, and those Restrictions may vary from implementation to implementation. Moreover, it should also be understood that development work, while potentially complex and time-consuming, would at least be a routine undertaking for those skilled in the art having the benefit of this disclosure.

在此，还需要说明的一点是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与根据本发明的方案密切相关的设备结构和/或处理步骤，而省略了与本发明关系不大的其他细节。Here, it should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only the device structure and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and the Other details not relevant to the present invention are described.

下文中的描述按如下顺序进行：The following description proceeds in the following order:

1.模型更新装置1. Model update device

2.数据处理装置2. Data processing device

3.模型更新方法和数据处理方法3. Model update method and data processing method

4.用以实施本申请的装置和方法的计算设备4. Computing equipment for implementing the apparatus and method of the present application

[1.模型更新装置][1. Model update device]

首先参照图1描述根据本申请的实施例的模型更新装置100的配置和结构。模型更新装置100用于对多模型系统中的目标模型进行更新，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型。这些模型本身及其组合用于对待测数据进行检测。根据本实施例的模型更新装置100可以对多个模型中的每一个或者多个模型中的至少一部分分别进行基于伪反馈的更新，从而提高各个模型的性能，进而使得由各个模型融合得到的模型的性能更佳，检测效果更好。在下文中，有时将待更新的模型称为目标模型。First, the configuration and structure of a model updating apparatus 100 according to an embodiment of the present application will be described with reference to FIG. 1 . The model updating device 100 is used to update the target model in the multi-model system, wherein each model in the multi-model system is a model pre-trained in different ways for the training data set. These models alone and in combination are used to perform detection on the test data. The model updating device 100 according to this embodiment can update each of multiple models or at least a part of the multiple models based on pseudo-feedback, thereby improving the performance of each model, and then making the model obtained by merging each model The performance is better and the detection effect is better. Hereinafter, the model to be updated is sometimes referred to as a target model.

如图1所示，模型更新装置100包括：伪标签获取单元101，被配置为使用多模型系统中不同于目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签；第一特征分布获取单元102，被配置为基于伪标签获得待测数据集的特征分布；第二特征分布获取单元103，被配置为基于目标模型获取训练数据集的特征分布；调整单元104，被配置为基于训练数据集的特征分布和待测数据集的特征分布来调整目标模型的特征空间划分，以使得训练数据集和待测数据集针对该特征空间划分具有类似的分布；以及更新单元105，被配置为基于调整后的特征空间划分使用训练数据集来更新目标模型。As shown in Figure 1, the model update device 100 includes: a pseudo-label acquisition unit 101 configured to use a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label The first feature distribution acquisition unit 102 is configured to obtain the feature distribution of the data set to be tested based on the pseudo-label; the second feature distribution acquisition unit 103 is configured to obtain the feature distribution of the training data set based on the target model; the adjustment unit 104, configured to adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set have similar distributions for the feature space division; and the update unit 105, configured to use the training data set to update the target model based on the adjusted feature space partition.

在模型训练阶段，使用已有的训练数据集采用不同的方式训练得到多个模型。例如，多个模型包括基于如下中的一种或更多种方法获得的模型：支持向量机（SVM）、随机森林、决策树、K最邻近结点（KNN）算法、Boosting算法、朴素贝叶斯算法和最大熵算法。应该注意，这里所给出的上述算法仅是示例，除此之外，还可以基于其他各种算法来获得模型。In the model training phase, use the existing training data set to train multiple models in different ways. For example, the plurality of models includes models obtained based on one or more of the following methods: Support Vector Machine (SVM), Random Forest, Decision Tree, K-Nearest Neighbor (KNN) algorithm, Boosting algorithm, Naive Bayes Algorithm and Maximum Entropy Algorithm. It should be noted that the above-mentioned algorithms given here are only examples, and besides that, models can also be obtained based on other various algorithms.

同时，在模型训练过程中，还可以获得每个模型在训练集上的性能。上述多模型系统可以针对多种应用，比如实体关系抽取、主题划分、对象识别等。训练好的多模型系统对于与训练数据集同质的待测数据集是适用的，但是待测数据集往往与训练数据集存在区别，使得训练好的多模型系统在对待测数据集进行处理时难以获得准确的结果。At the same time, during the model training process, the performance of each model on the training set can also be obtained. The above-mentioned multi-model system can target various applications, such as entity relationship extraction, topic division, object recognition, and so on. A well-trained multi-model system is suitable for a test data set that is homogeneous to the training data set, but the test data set is often different from the training data set, so that when the trained multi-model system processes the test data set Difficult to obtain accurate results.

因此，可以采用图1所示的模型更新装置100来基于待测数据集对多模型系统中的各个模型进行更新。具体地，首先选择多模型系统中的一个模型作为目标模型，其表示要更新或要调整的模型。然后，从剩余模型中选择用于基于其处理结果对目标模型进行调整的校准模型。Therefore, the model updating apparatus 100 shown in FIG. 1 can be used to update each model in the multi-model system based on the data set to be tested. Specifically, one model in the multi-model system is first selected as the target model, which represents the model to be updated or adjusted. Then, a calibration model for adjusting the target model based on its processing results is selected from the remaining models.

作为一个示例，一个多模型系统中包括最大熵模型、SVM、随机森林和KNN，假设希望调整最大熵模型来改善系统性能，则最大熵模型为目标模型，其余的SVM、随机森林和KNN模型均可以作为校准模型的候选。As an example, a multi-model system includes maximum entropy model, SVM, random forest, and KNN. Suppose you want to adjust the maximum entropy model to improve system performance, then the maximum entropy model is the target model, and the remaining SVM, random forest, and KNN models are all Can be used as a candidate for the calibration model.

在一个实施例中，校准模型可以是多模型系统中目标模型以外的性能最好的模型。比如在以上示例中，SVM、随机森林和KNN中SVM性能最好，则可以选择SVM作为校准模型。In one embodiment, the calibration model may be the best performing model other than the target model in the multi-model system. For example, in the above example, SVM has the best performance among SVM, random forest and KNN, so you can choose SVM as the calibration model.

或者，可以选择与目标模型采用原理上互补的算法获得的模型作为校准模型。这里所说的原理上互补指的是算法所依据的原理或者算法所利用的数据的信息实质上互为补充。例如，传统的产生式模型如贝叶斯网络、最大熵算法倾向从统计的角度表示数据的分布情况，能够反映同类数据本身的相似度，而传统的判别式模型如SVM、KNN寻找不同类别之间的最优分类面，反映的是异类数据之间的差异。因此从某个角度看，可以认为这两种算法获得的模型互补。相反，由于随机森林是决策树的升级算法，二者基于同样的原理，因此它们不是互补的。Alternatively, a model obtained using an algorithm that is complementary in principle to the target model can be selected as the calibration model. The complementary in principle mentioned here means that the principles on which the algorithms are based or the information of the data used by the algorithms are substantially complementary to each other. For example, traditional production models such as Bayesian networks and maximum entropy algorithms tend to represent the distribution of data from a statistical point of view, which can reflect the similarity of the same type of data itself, while traditional discriminative models such as SVM and KNN look for differences between different categories. The optimal classification surface among them reflects the differences among heterogeneous data. Therefore, from a certain point of view, it can be considered that the models obtained by these two algorithms are complementary. On the contrary, since random forest is an upgrading algorithm of decision tree, the two are based on the same principle, so they are not complementary.

此外，校准模型还可以是多模型系统中目标模型以外的其他模型融合获得的模型。融合的方式将在后面详细描述。In addition, the calibration model can also be a model obtained by fusing other models than the target model in the multi-model system. The way of fusion will be described in detail later.

在选定了校准模型之后，伪标签获取单元101使用该校准模型对待测数据集进行处理，并将处理的结果作为伪标签。接下来，将该伪标签作为反馈对目标模型的特征空间划分进行调整，其中，特征空间表示特征值的值域。After the calibration model is selected, the pseudo-label acquisition unit 101 uses the calibration model to process the data set to be tested, and uses the processed result as a pseudo-label. Next, this pseudo-label is used as feedback to adjust the feature space division of the target model, where the feature space represents the range of feature values.

具体地，第一特征分布获取单元102将所获得的伪标签作为正确的处理结果，并基于伪标签来获得待测数据集的特征分布。类似地，第二特征分布获取单元103基于目标模型获取训练数据集的特征分布。Specifically, the first feature distribution obtaining unit 102 takes the obtained pseudo-label as a correct processing result, and obtains the feature distribution of the data set to be tested based on the pseudo-label. Similarly, the second feature distribution obtaining unit 103 obtains the feature distribution of the training data set based on the target model.

图2是示出了根据本申请的一个实施例的待测数据集的特征分布与训练数据集的特征分布的对比的示例的图。如图2所示，横轴代表特征值，纵轴代表样本的类别，上方的实心圆点和三角表示待测数据集的基于伪标签的特征分布，下方的空心圆点和三角表示训练数据集的样本的特征分布，此外，圆点和三角分别代表两种样本类别，针对待测数据集而言，其根据伪标签确定，针对训练数据集而言，其基于目标模型确定。FIG. 2 is a graph showing an example of a comparison between the feature distribution of the test data set and the feature distribution of the training data set according to an embodiment of the present application. As shown in Figure 2, the horizontal axis represents the feature value, the vertical axis represents the category of the sample, the solid circles and triangles above represent the pseudo-label-based feature distribution of the data set to be tested, and the hollow circles and triangles below represent the training data set In addition, the dots and triangles represent two kinds of sample categories, which are determined according to the pseudo-label for the test data set, and determined based on the target model for the training data set.

可以看出，待测数据集和训练数据集的样本分布是有差别的。接下来，调整单元104基于这种差别来对目标模型的特征空间划分进行调整，以使得训练数据集和待测数据集针对该特征空间划分具有类似的分布。换言之，调整单元104利用待测数据集的数据的特性对目标模型进行了调整，从而使得调整后的目标模型考虑了待测数据集与训练数据集之间的差异，由此处理结果更加准确，性能提高。It can be seen that the sample distribution of the test data set and the training data set are different. Next, the adjustment unit 104 adjusts the feature space division of the target model based on this difference, so that the training data set and the test data set have similar distributions for the feature space division. In other words, the adjustment unit 104 adjusts the target model by using the characteristics of the data of the test data set, so that the adjusted target model takes into account the difference between the test data set and the training data set, so that the processing result is more accurate. Improved performance.

然后，更新单元105在调整后的特征空间划分上，使用训练数据集重新训练以调整参数，从而得到更新后的模型。在一个实施例中，模型更新装置100可以被配置为定期进行更新，但是并不限于此，而是可以根据应用场景和设备的处理能力来具体选择更新的时间间隔或启动更新的方式。此外，模型更新装置100可以对多模型系统中的一部分或全部模型进行更新。Then, the updating unit 105 uses the training data set to retrain on the adjusted feature space division to adjust parameters, so as to obtain an updated model. In one embodiment, the model updating apparatus 100 may be configured to update periodically, but it is not limited thereto, and the time interval for updating or the way of starting the update may be specifically selected according to the application scenario and the processing capability of the device. In addition, the model updating device 100 can update a part or all of the models in the multi-model system.

下面参照图3描述根据本申请的一个实施例的调整单元104的结构和配置。图3所示的调整单元104包括：分区模块4001，被配置为将训练数据集和待测数据集的特征空间划分为多个区域；分布计算模块4002，被配置为分别基于训练数据集的特征分布和待测数据集的特征分布，计算相邻的一个或更多个区域在训练数据集和待测数据集上的分布；距离计算模块4003，被配置为计算相邻的一个或更多个区域的两种分布之间的距离；以及合并模块4004，被配置为在所述距离小于预定阈值时将所述相邻的一个或更多个区域合并作为特征空间的一个划分。The structure and configuration of the adjustment unit 104 according to an embodiment of the present application will be described below with reference to FIG. 3 . The adjustment unit 104 shown in FIG. 3 includes: a partition module 4001, configured to divide the feature space of the training data set and the data set to be tested into multiple regions; a distribution calculation module 4002, configured to distribution and the feature distribution of the data set to be tested, calculate the distribution of one or more adjacent areas on the training data set and the data set to be tested; the distance calculation module 4003 is configured to calculate the adjacent one or more a distance between the two distributions of regions; and a merging module 4004 configured to merge the one or more adjacent regions as a division of the feature space when the distance is less than a predetermined threshold.

仍然以图2作为示例，分区模块4001将横轴表示的特征空间分割为多个区域。比如特征空间为[0,1]，可将特征空间划分为100个区域，其中，所划分的区域的个数决定了调整的粒度，个数越多，粒度越细。Still taking FIG. 2 as an example, the partition module 4001 divides the feature space represented by the horizontal axis into multiple regions. For example, if the feature space is [0,1], the feature space can be divided into 100 regions, where the number of divided regions determines the granularity of adjustment, the more the number, the finer the granularity.

然后，分布计算模块4002分别基于训练数据集的特征分布和待测数据集的特征分布，从第一个区域开始计算相邻的一个或更多个区域在训练数据集和待测数据集上的分布。这相当于一个滑动窗口，每个区域的大小相当于滑动的步长，以下将窗口内的一个或更多个区域称为一个间隔。距离计算模块4003计算分布计算模块4002所计算的两种分布之间的距离，当该距离小于预定阈值时，确定模块4004将其对应的一个或更多个区域合并作为特征空间的一个划分。如此重复进行，可以将整个特征空间重新进行划分。Then, the distribution calculation module 4002 is based on the feature distribution of the training data set and the feature distribution of the test data set, starting from the first area to calculate the adjacent one or more areas on the training data set and the test data set distributed. This is equivalent to a sliding window, and the size of each area is equivalent to the sliding step. Hereinafter, one or more areas in the window are called an interval. The distance calculation module 4003 calculates the distance between the two distributions calculated by the distribution calculation module 4002, and when the distance is smaller than a predetermined threshold, the determination module 4004 combines one or more corresponding regions as a division of the feature space. Repeatedly, the entire feature space can be re-divided.

其中，距离计算模块4003可以通过多种方法来计算分布间的距离，比如卡方距离、KL距离、Resistor距离、Topsoe距离等。以KL距离为例，可根据下式（1）来计算：Among them, the distance calculation module 4003 can calculate the distance between distributions through various methods, such as chi-square distance, KL distance, Resistor distance, Topsoe distance and so on. Taking the KL distance as an example, it can be calculated according to the following formula (1):

其中，P_tr表示训练数据集的概率分布，P_ts表示待测数据集的概率分布，f_i表示所讨论的间隔中的特征值，y表示样本类别的集合。如图2所示，基于公式（1）可以获得KL距离的曲线，曲线的峰值处对应的特征值的取值为要进行划分的位置，两个峰值之间的区域合并为特征空间的一个划分。Among them, P _tr represents the probability distribution of the training dataset, P _ts represents the probability distribution of the test data set, _fi represents the feature value in the interval in question, and y represents the set of sample categories. As shown in Figure 2, the curve of KL distance can be obtained based on formula (1), the value of the corresponding eigenvalue at the peak of the curve is the position to be divided, and the area between the two peaks is merged into a division of the feature space .

在待测试数据集中某一间隔上的数据缺失的情况下，即待测试数据集中该特征值间隔内没有数据样本，距离计算模块4003使用贝叶斯量度来计算上述相邻的一个或更多个区域（即间隔）在训练数据集上的分布来作为所述距离。这是因为由于待测试数据集上的相应数据缺失，所以期望仍然使用原来的特征空间划分方式。贝叶斯量度的公式如下式（2）所示。In the case that the data on a certain interval in the data set to be tested is missing, that is, there is no data sample in the eigenvalue interval in the data set to be tested, the distance calculation module 4003 uses the Bayesian metric to calculate the above-mentioned adjacent one or more The distribution of regions (i.e. intervals) on the training dataset is used as the distance. This is because it is expected to still use the original feature space division method due to the lack of corresponding data on the data set to be tested. The formula of the Bayesian measure is shown in the following formula (2).

其中，y_j表示第j个样本类别，y_k表示第k个样本类别，且共有m个样本类别，f表示特征值。Among them, y _j represents the jth sample category, y _k represents the kth sample category, and there are m sample categories in total, and f represents the feature value.

以下作为一个示例给出了上述算法的伪代码，应该理解，这仅仅是为了说明的需要，并不意在是限制性的。The pseudo code of the above algorithm is given below as an example, and it should be understood that this is only for the purpose of illustration and is not intended to be limiting.

初始化：initialization:

设置间隔合并的步长TSet the step size T for interval merging

设置特征的取值范围[MINB,MAXB]，设置lowb=MINB,Set the value range of the feature [MINB,MAXB], set lowb=MINB,

upb=lowb+Tupb=lowb+T

设置贝叶斯的阈值θb，前一个间隔的贝叶斯分布Bp和当前间隔的分布BcSet the Bayesian threshold θb, the Bayesian distribution Bp of the previous interval and the distribution Bc of the current interval

设置KL距离的阈值θd，前一个间隔的KL值Dp和当前间隔的KL值DcSet the threshold θd of the KL distance, the KL value Dp of the previous interval and the KL value Dc of the current interval

While upb<MAXBWhile upb<MAXB

计算Bp,DpCalculate Bp,Dp

upb=lowb+Tupb=lowb+T

计算Bc,DcCalculate Bc,Dc

划分[lowb,upb-T)divide [lowb,upb-T)

lowb=upb-Tlowb=upb-T

elseelse

upb+=Tupb+=T

end ifend if

end whileend while

在以上伪代码中，间隔合并的步长T相当于特征空间的每个区域的大小，其大小决定了调整的粒度，可以根据实际应用或经验值来确定。此外，以上虽然示出了在待测数据集中某个间隔的数据缺失的情况下，使用贝叶斯量度来计算间隔上的分布的计算方式，但是并不限于此，还可以采用其他策略，比如赋予固定值等。In the above pseudocode, the step size T of interval merging is equivalent to the size of each region of the feature space, and its size determines the granularity of adjustment, which can be determined according to actual applications or empirical values. In addition, although the above shows the calculation method of using the Bayesian metric to calculate the distribution on the interval when the data of a certain interval in the data set to be tested is missing, it is not limited to this, and other strategies can also be used, such as Assign a fixed value, etc.

如上所述，模型更新装置100通过利用多模型系统中不同于目标模型的校准模型来对待测数据集进行处理，并且基于该处理结果获得待测数据集与训练数据集的差异，根据该差异来调整目标模型的特征空间的划分，使得目标模型更贴近待测数据集，即目标模型对待测数据集进行正确处理的程度更高，从而提高了其性能。As mentioned above, the model updating device 100 processes the data set to be tested by using a calibration model different from the target model in the multi-model system, and obtains the difference between the data set to be tested and the training data set based on the processing result, and according to the difference Adjusting the division of the feature space of the target model makes the target model closer to the data set to be tested, that is, the target model has a higher degree of correct processing of the data set to be tested, thereby improving its performance.

[2.数据处理装置][2. Data processing device]

图4示出了根据本申请的一个实施例的数据处理装置200的配置，数据处理装置200使用多模型系统对待测数据集进行处理，包括如上所述的模型更新装置100。其中，为了不模糊本申请的精神和范围，图4省略了数据处理装置200中可能包括的其他通用部件，比如输入输出接口、显示单元等。此外，由于以上已经详细描述了模型更新装置100的结构与配置，在此不再赘述。FIG. 4 shows the configuration of a data processing device 200 according to an embodiment of the present application. The data processing device 200 uses a multi-model system to process the data set to be tested, including the model updating device 100 as described above. Wherein, in order not to obscure the spirit and scope of the present application, FIG. 4 omits other common components that may be included in the data processing device 200, such as input and output interfaces, display units, and the like. In addition, since the structure and configuration of the model updating apparatus 100 have been described in detail above, details will not be repeated here.

通过包括模型更新装置100，数据处理装置200在处理包括新的数据的待测数据集时可以获得更好的性能。By including the model update device 100, the data processing device 200 can obtain better performance when processing the test data set including new data.

图5是根据本申请的另一个实施例的数据处理装置300的结构框图，除了模型更新装置100之外，数据处理装置300还包括：控制单元301，被配置为执行控制以使得模型更新装置100将多模型系统中的每个模型作为目标模型进行更新；以及融合单元302，被配置为将更新后的各个模型进行融合以获得最终模型，其中，数据处理装置300使用该最终模型对待测试数据集进行处理。5 is a structural block diagram of a data processing device 300 according to another embodiment of the present application. In addition to the model updating device 100, the data processing device 300 also includes: a control unit 301 configured to perform control so that the model updating device 100 updating each model in the multi-model system as a target model; and a fusion unit 302 configured to fuse the updated models to obtain a final model, wherein the data processing device 300 uses the final model to be tested for the data set to process.

在一个实施例中，可以采用如下公式（3）对模型进行融合：In one embodiment, the following formula (3) can be used to fuse the models:

其中，多模型系统中共有K个模型，P_i为融合后的模型在样本i上的输出值，P_ik为第k个模型在样本i上的输出值，α为权重值。Among them, there are K models in the multi-model system, P _i is the output value of the fused model on sample i, P _ik is the output value of the kth model on sample i, and α is the weight value.

可以理解，数据处理装置300通过使用将更新后的每一个模型融合所获得的最终模型来对待测试数据进行处理，可以进一步提高其性能。It can be understood that the data processing device 300 can further improve its performance by using the final model obtained by merging each updated model to process the data to be tested.

以下通过一个具体示例来说明数据处理装置300的操作。在该示例中，需要判断来自网络的两个实体、即作者和文章的所属关系，由于作者重名、单位改变、研究方向改变、信息来源可靠性、多个信息源不一致等影响，这种判断并不容易。换言之，该示例涉及实体关系挖掘。但是，应该理解，本申请的数据处理装置300所应用的范围并不限于此，而是可以广泛应用于使用多模型系统的各种数据处理场合。The operation of the data processing apparatus 300 is described below through a specific example. In this example, it is necessary to judge the affiliation of two entities from the network, that is, the author and the article. Due to the influence of the author's duplicate name, change of unit, change of research direction, reliability of information source, and inconsistency of multiple information sources, this judgment It's not easy. In other words, this example involves entity relationship mining. However, it should be understood that the scope of application of the data processing device 300 of the present application is not limited thereto, but can be widely applied to various data processing occasions using a multi-model system.

将收集到的信息做成若干表，如图6所示，包括作者表、文章表、会议表、期刊表、文章-作者表，每个表包括不同的字段。在这些表的基础上，通过人工标注的方法获得一批具有“作者ID，文章ID，是/否”的格式的标注项，来表明作者ID和文章ID的所属关系。Make the collected information into several tables, as shown in Figure 6, including author table, article table, conference table, journal table, article-author table, each table includes different fields. On the basis of these tables, a batch of labeled items in the format of "author ID, article ID, yes/no" is obtained by manual labeling to indicate the affiliation relationship between author ID and article ID.

假设所采用的多模型系统包括随机森林、SVM、贝叶斯网络、boosting、最大熵这5个分类器（即模型）。首先在训练数据集上对这些分类器进行训练，得到每个分类器的性能。Assume that the multi-model system used includes five classifiers (models) including random forest, SVM, Bayesian network, boosting, and maximum entropy. These classifiers are first trained on the training data set to get the performance of each classifier.

例如先选定最大熵为目标模型，则随机森林、SVM、贝叶斯网络、boosting为候选的校准模型。接下来，例如根据性能最优策略来选定SVM作为校准模型。伪标签获取单元101使用选定的SVM对待测数据集进行分类，得到待测数据集的伪标签信息，也即“作者ID，文章ID，是/否”。For example, first select the maximum entropy as the target model, then random forest, SVM, Bayesian network, and boosting are candidate calibration models. Next, the SVM is selected as the calibration model, for example, according to a performance-optimized strategy. The pseudo-label acquisition unit 101 uses the selected SVM to classify the data set to be tested to obtain the pseudo-label information of the data set to be tested, that is, "author ID, article ID, yes/no".

对于为[0,1]之间的连续值的某个特征，比如表明两个姓名之间的相似程度的姓名相似度特征，选定区域大小，比如0.01，分区模块4001将特征空间分为100个区域。从0开始，对于相邻的一个或更多个区域，分布计算模块4002计算其在训练数据集和待测数据集上的分布，距离计算模块4003根据该分布计算训练数据集和待测数据集之间的分布距离，比如KL距离。如果在某两个或更多个区域上，KL距离差小于预定阈值，则合并模块4004将这些区域合并起来，并将最终合并的结果作为新的特征空间划分。For a feature that is a continuous value between [0,1], such as the name similarity feature indicating the degree of similarity between two names, the size of the selected area, such as 0.01, the partition module 4001 divides the feature space into 100 area. Starting from 0, for one or more adjacent areas, the distribution calculation module 4002 calculates its distribution on the training data set and the data set to be tested, and the distance calculation module 4003 calculates the training data set and the data set to be tested according to the distribution The distribution distance between them, such as the KL distance. If the KL distance difference is smaller than a predetermined threshold in two or more regions, the merging module 4004 merges these regions, and uses the final merging result as a new feature space division.

作为一个示例，例如，最大熵模型的姓名相似度特征的特征空间划分在更新前为[0,0.2)、[0.2,0.4)、[0.4,0.6)、[0.6,0.8)和[0.8,1]，而在经过上述重新划分之后为[0,0.3)、[0.3,0.5)、[0.5,0.8)和[0.8,1)。然后，最大熵模型采用该特征空间划分基于训练数据集重新进行训练。As an example, for example, the feature space division of the name similarity feature of the maximum entropy model is [0,0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8) and [0.8,1 ], and [0,0.3), [0.3,0.5), [0.5,0.8) and [0.8,1) after the above repartition. The maximum entropy model is then retrained on the training dataset using this feature space partition.

如果存在多个特征，则可以按同样的步骤对所有的特征空间进行重新划分，更新单元105在新的特征空间中利用训练数据集重新训练最大熵分类器。If there are multiple features, all the feature spaces can be re-divided according to the same steps, and the update unit 105 uses the training data set to retrain the maximum entropy classifier in the new feature space.

控制单元301接下来可以选定其他的分类器作为目标模型，重复上述操作，直到所有的目标模型都得到了更新。应该理解，虽然这里示出了更新所有的目标模型的情形作为示例，但是并不一定如此，也可以只更新其中一个模型或几个模型。Next, the control unit 301 may select other classifiers as target models, and repeat the above operations until all target models are updated. It should be understood that although the case of updating all target models is shown here as an example, this is not necessarily the case, and only one or several models may be updated.

最后，融合单元302将更新后的所有模型进行融合，得到最终模型。在只更新了一个或几个模型的情况下，对于其他未更新的模型，融合单元302在融合时原样使用。Finally, the fusion unit 302 fuses all the updated models to obtain the final model. In the case that only one or several models are updated, the fusion unit 302 uses them as they are during fusion for other models that have not been updated.

实验证明，通过使用更新后的单个模型或融合得到的最终模型对待测数据集进行处理，均可以比更新前的相应模型获得准确率更高的分类结果，即提高了系统性能。Experiments have proved that by using the updated single model or the final fusion model to process the test data set, the classification results with higher accuracy can be obtained than the corresponding model before the update, that is, the system performance is improved.

[3.模型更新方法和数据处理方法][3. Model update method and data processing method]

以上结合附图描述了根据本发明的模型更新装置和数据处理装置的实施方式，在此过程中事实上也描述了一种模型更新方法和数据处理方法。下面对所述方法结合附图7和8予以简要描述，其中的细节可参见前文对模型更新装置和数据处理装置的描述。The implementations of the model updating device and the data processing device according to the present invention are described above with reference to the accompanying drawings, and a model updating method and data processing method are actually described in the process. The method will be briefly described below with reference to Figs. 7 and 8, details of which can be referred to the previous descriptions of the model updating device and the data processing device.

图7示出了根据本申请的实施例的对多模型系统中的目标模型进行更新的模型更新方法，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型，该模型更新方法包括：使用多模型系统中不同于目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签（S11）；基于伪标签获得待测数据集的特征分布（S12）；基于目标模型获取训练数据集的特征分布（S13）；基于训练数据集的特征分布和待测数据集的特征分布来调整目标模型的特征空间划分，以使得训练数据集和待测数据集针对该特征空间划分具有类似的分布（S14）；以及基于调整后的特征空间划分使用训练数据集来更新目标模型（S15）。Fig. 7 shows a model update method for updating a target model in a multi-model system according to an embodiment of the present application, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, The method for updating the model includes: using a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and using the processed result as a pseudo-label (S11); obtaining the feature distribution of the data set to be tested based on the pseudo-label (S12); obtain the feature distribution of the training data set based on the target model (S13); adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set The dataset has a similar distribution for the feature space partition (S14); and the target model is updated using the training dataset based on the adjusted feature space partition (S15).

其中，校准模型可以是多模型系统中目标模型以外的性能最好的模型。Wherein, the calibration model may be the model with the best performance other than the target model in the multi-model system.

此外，校准模型与目标模型可以分别采用原理上互补的算法获得。In addition, the calibration model and the target model can be respectively obtained by using complementary algorithms in principle.

在一个实施例中，校准模型还可以是多模型系统中目标模型以外的其他模型融合获得的模型。In an embodiment, the calibration model may also be a model obtained by fusing other models than the target model in the multi-model system.

其中，多个模型可以包括基于如下中的一种或更多种方法获得的模型：支持向量机、随机森林、决策树、K最邻近结点算法、Boosting算法、朴素贝叶斯算法和最大熵算法。应该理解，模型的类别并不限于此。以上所述的多模型系统可以用于各种应用，包括但不限于实体关系抽取。Wherein, a plurality of models may include models obtained based on one or more of the following methods: support vector machine, random forest, decision tree, K nearest neighbor node algorithm, Boosting algorithm, naive Bayesian algorithm and maximum entropy algorithm. It should be understood that the category of models is not limited thereto. The multi-model system described above can be used in various applications, including but not limited to entity-relationship extraction.

在一个实施例中，可以定期执行上述模型更新方法。也可以根据需求来执行上述模型更新方法。In an embodiment, the above-mentioned model updating method may be performed periodically. The above-mentioned model updating method can also be executed according to requirements.

图8示出了步骤S14的子步骤的流程图，步骤S14包括：将训练数据集和待测数据集的特征空间划分为多个区域（S401）；分别基于训练数据集的特征分布和待测数据集的特征分布，计算相邻的一个或更多个区域在训练数据集和待测数据集上的分布（S402）；计算相邻的一个或更多个区域的两种分布之间的距离，并且在该距离小于预定阈值时将所述相邻的一个或更多个区域合并作为特征空间的一个划分（S403）。Fig. 8 shows the flowchart of the sub-steps of step S14. Step S14 includes: dividing the feature space of the training data set and the test data set into multiple regions (S401); The feature distribution of the data set, calculating the distribution of one or more adjacent areas on the training data set and the test data set (S402); calculating the distance between the two distributions of the adjacent one or more areas , and when the distance is less than a predetermined threshold, the one or more adjacent regions are combined as a division of the feature space (S403).

其中，上述距离可以为KL距离。Wherein, the above distance may be KL distance.

在一个实施例中，可以使用贝叶斯量度来计算相邻的一个或更多个区域在训练数据集上的分布作为所述距离。In one embodiment, a Bayesian metric may be used to calculate the distribution of one or more adjacent regions on the training data set as the distance.

如上所述，本申请还提供了一种使用多模型系统对待测数据集进行处理的数据处理方法，包括上述参照图7和图8所述的模型更新方法。As mentioned above, the present application also provides a data processing method using a multi-model system to process a data set to be tested, including the model updating method described above with reference to FIG. 7 and FIG. 8 .

在一个实施例中，该数据处理方法还包括：将多模型系统中的每个模型作为目标模型进行更新；以及将更新后的各个模型进行融合以获得最终模型，并使用该最终模型对所述待测试数据集进行处理。In one embodiment, the data processing method further includes: updating each model in the multi-model system as a target model; and merging the updated models to obtain a final model, and using the final model to The data set to be tested is processed.

以上实施例中的相关细节已经在对模型更新装置和数据处理装置的描述中详细给出，在此不再赘述。Relevant details in the above embodiments have been given in detail in the description of the model updating device and the data processing device, and will not be repeated here.

[3.用以实施本申请的装置和方法的计算设备]上述装置中各个组成模块、单元可通过软件、固件、硬件或其组合的方式进行配置。配置可使用的具体手段或方式为本领域技术人员所熟知，在此不再赘述。在通过软件或固件实现的情况下，从存储介质或网络向具有专用硬件结构的计算机（例如图9所示的通用计算机900）安装构成该软件的程序，该计算机在安装有各种程序时，能够执行各种功能等。[3. Computing equipment for implementing the device and method of the present application] Each component module and unit in the above device can be configured by means of software, firmware, hardware or a combination thereof. Specific means or manners that can be used for configuration are well known to those skilled in the art, and will not be repeated here. In the case of realizing by software or firmware, the program constituting the software is installed from a storage medium or a network to a computer having a dedicated hardware configuration (for example, a general-purpose computer 900 shown in FIG. 9 ), and when the computer is installed with various programs, Capable of performing various functions, etc.

在图9中，中央处理单元（CPU）901根据只读存储器（ROM）902中存储的程序或从存储部分908加载到随机存取存储器（RAM）903的程序执行各种处理。在RAM903中，也根据需要存储当CPU901执行各种处理等等时所需的数据。CPU901、ROM902和RAM903经由总线904彼此连接。输入/输出接口905也连接到总线904。In FIG. 9 , a central processing unit (CPU) 901 executes various processes according to programs stored in a read only memory (ROM) 902 or programs loaded from a storage section 908 to a random access memory (RAM) 903 . In the RAM 903, data required when the CPU 901 executes various processes and the like is also stored as necessary. The CPU 901 , ROM 902 , and RAM 903 are connected to each other via a bus 904 . An input/output interface 905 is also connected to the bus 904 .

下述部件连接到输入/输出接口905：输入部分906（包括键盘、鼠标等等）、输出部分907（包括显示器，比如阴极射线管（CRT）、液晶显示器（LCD）等，和扬声器等）、存储部分908（包括硬盘等）、通信部分909（包括网络接口卡比如LAN卡、调制解调器等）。通信部分909经由网络比如因特网执行通信处理。根据需要，驱动器910也可连接到输入/输出接口905。可移除介质911比如磁盘、光盘、磁光盘、半导体存储器等等根据需要被安装在驱动器910上，使得从中读出的计算机程序根据需要被安装到存储部分908中。The following components are connected to the input/output interface 905: an input section 906 (including a keyboard, a mouse, etc.), an output section 907 (including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.), A storage part 908 (including a hard disk, etc.), a communication part 909 (including a network interface card such as a LAN card, a modem, etc.). The communication section 909 performs communication processing via a network such as the Internet. A driver 910 may also be connected to the input/output interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read therefrom is installed into the storage section 908 as necessary.

在通过软件实现上述系列处理的情况下，从网络比如因特网或存储介质比如可移除介质911安装构成软件的程序。In the case of realizing the above-described series of processing by software, the programs constituting the software are installed from a network such as the Internet or a storage medium such as the removable medium 911 .

本领域的技术人员应当理解，这种存储介质不局限于图9所示的其中存储有程序、与设备相分离地分发以向用户提供程序的可移除介质911。可移除介质911的例子包含磁盘（包含软盘（注册商标））、光盘（包含光盘只读存储器（CD-ROM）和数字通用盘（DVD））、磁光盘（包含迷你盘（MD）（注册商标））和半导体存储器。或者，存储介质可以是ROM1402、存储部分1408中包含的硬盘等等，其中存有程序，并且与包含它们的设备一起被分发给用户。Those skilled in the art should understand that such a storage medium is not limited to the removable medium 911 shown in FIG. 9 in which the program is stored and distributed separately from the device to provide the program to the user. Examples of the removable media 911 include magnetic disks (including floppy disks (registered trademark)), optical disks (including compact disk read-only memory (CD-ROM) and digital versatile disk (DVD)), magneto-optical disks (including trademark)) and semiconductor memory. Alternatively, the storage medium may be a ROM 1402, a hard disk contained in the storage section 1408, or the like, in which the programs are stored and distributed to users together with devices containing them.

本发明还提出一种存储有机器可读取的指令代码的程序产品。所述指令代码由机器读取并执行时，可执行上述根据本发明实施例的方法。The invention also proposes a program product storing machine-readable instruction codes. When the instruction code is read and executed by a machine, the above-mentioned method according to the embodiment of the present invention can be executed.

相应地，用于承载上述存储有机器可读取的指令代码的程序产品的存储介质也包括在本发明的公开中。所述存储介质包括但不限于软盘、光盘、磁光盘、存储卡、存储棒等等。Correspondingly, a storage medium for carrying the program product storing the above-mentioned machine-readable instruction codes is also included in the disclosure of the present invention. The storage medium includes, but is not limited to, a floppy disk, an optical disk, a magneto-optical disk, a memory card, a memory stick, and the like.

最后，还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。此外，在没有更多限制的情况下，由语句“包括一个......”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also Other elements not expressly listed, or inherent to the process, method, article, or apparatus are also included. Furthermore, without further limitations, an element defined by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article, or apparatus comprising said element .

以上虽然结合附图详细描述了本发明的实施例，但是应当明白，上面所描述的实施方式只是用于说明本发明，而并不构成对本发明的限制。对于本领域的技术人员来说，可以对上述实施方式作出各种修改和变更而没有背离本发明的实质和范围。因此，本发明的范围仅由所附的权利要求及其等效含义来限定。Although the embodiments of the present invention have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are only used to illustrate the present invention, rather than to limit the present invention. Various modifications and changes can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Accordingly, the scope of the present invention is limited only by the appended claims and their equivalents.

通过上述的描述，本发明的实施例提供了以下的技术方案，但不限于此。Through the above description, the embodiments of the present invention provide the following technical solutions, but are not limited thereto.

附记1.一种对多模型系统中的目标模型进行更新的模型更新装置，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型，所述模型更新装置包括：Additional Note 1. A model update device for updating the target model in the multi-model system, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, and the model update device includes :

伪标签获取单元，被配置为使用所述多模型系统中不同于所述目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签；A pseudo-label acquisition unit configured to use a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label;

第一特征分布获取单元，被配置为基于所述伪标签获得所述待测数据集的特征分布；a first feature distribution acquisition unit configured to obtain a feature distribution of the data set to be tested based on the pseudo-label;

第二特征分布获取单元，被配置为基于所述目标模型获取所述训练数据集的特征分布；a second feature distribution acquisition unit configured to acquire the feature distribution of the training data set based on the target model;

调整单元，被配置为基于所述训练数据集的特征分布和所述待测数据集的特征分布来调整所述目标模型的特征空间划分，以使得所述训练数据集和所述待测数据集针对该特征空间划分具有类似的分布；以及An adjustment unit configured to adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set has a similar distribution for this feature space partition; and

更新单元，被配置为基于调整后的特征空间划分使用所述训练数据集来更新所述目标模型。An update unit configured to use the training data set to update the target model based on the adjusted feature space partition.

附记2.根据附记1所述的模型更新装置，其中，所述调整单元包括：Supplement 2. The model updating device according to Supplement 1, wherein the adjustment unit includes:

分区模块，被配置为将所述训练数据集和所述待测数据集的特征空间划分为多个区域；a partition module configured to divide the feature space of the training data set and the test data set into multiple regions;

分布计算模块，被配置为分别基于所述训练数据集的特征分布和所述待测数据集的特征分布，计算相邻的一个或更多个区域在所述训练数据集和所述待测数据集上的分布；The distribution calculation module is configured to calculate the distribution of adjacent one or more regions in the training data set and the test data based on the feature distribution of the training data set and the feature distribution of the test data set, respectively. distribution on the set;

距离计算模块，被配置为计算相邻的一个或更多个区域的两种分布之间的距离；以及a distance calculation module configured to calculate the distance between two distributions of adjacent one or more regions; and

合并模块，被配置为在所述距离小于预定阈值时将所述相邻的一个或更多个区域合并作为所述特征空间的一个划分。A combining module configured to combine the one or more adjacent regions as a division of the feature space when the distance is smaller than a predetermined threshold.

附记3.根据附记2所述的模型更新装置，其中，所述距离为KL距离。Supplement 3. The model updating device according to Supplement 2, wherein the distance is KL distance.

附记4.根据附记2所述的模型更新装置，其中，所述距离计算模块被配置为使用贝叶斯量度来计算所述相邻的一个或更多个区域在所述训练数据集上的分布作为所述距离。Supplement 4. The model updating device according to Supplement 2, wherein the distance calculation module is configured to use a Bayesian metric to calculate the distance between the adjacent one or more regions on the training data set The distribution of is used as the distance.

附记5.根据附记1至4中的任意一项所述的模型更新装置，其中，所述校准模型是所述多模型系统中所述目标模型以外的性能最好的模型。Supplement 5. The model updating device according to any one of Supplements 1 to 4, wherein the calibration model is a model with the best performance other than the target model in the multi-model system.

附记6.根据附记1至4中的任意一项所述的模型更新装置，其中，所述校准模型与所述目标模型分别采用原理上互补的算法获得。Supplement 6. The model updating device according to any one of Supplements 1 to 4, wherein the calibration model and the target model are respectively obtained by using complementary algorithms in principle.

附记7.根据附记1至4中的任意一项所述的模型更新装置，其中，所述校准模型是所述多模型系统中所述目标模型以外的其他模型融合获得的模型。Supplement 7. The model updating device according to any one of Supplements 1 to 4, wherein the calibration model is a model obtained by fusing other models than the target model in the multi-model system.

附记8.根据附记1至4中的任意一项所述的模型更新装置，其中，所述多个模型包括基于如下中的一种或更多种方法获得的模型：支持向量机、随机森林、决策树、K最邻近结点算法、Boosting算法、朴素贝叶斯算法和最大熵算法。Supplement 8. The model updating device according to any one of Supplements 1 to 4, wherein the multiple models include models obtained based on one or more of the following methods: support vector machine, random Forest, Decision Tree, K-Nearest Neighbor Algorithm, Boosting Algorithm, Naive Bayesian Algorithm, and Maximum Entropy Algorithm.

附记9.根据附记1至4中的任意一项所述的模型更新装置，其中，所述多模型系统用于实体关系抽取。Supplement 9. The model updating device according to any one of Supplements 1 to 4, wherein the multi-model system is used for entity relationship extraction.

附记10.根据附记1至4中的任意一项所述的模型更新装置，其中，所述模型更新装置被配置为定期进行更新。Supplement 10. The model updating device according to any one of Supplements 1 to 4, wherein the model updating device is configured to update periodically.

附记11.一种使用多模型系统对待测数据集进行处理的数据处理装置，包括根据附记1至10中的任意一项所述的模型更新装置。Supplement 11. A data processing device using a multi-model system to process a data set to be tested, comprising the model updating device according to any one of Supplements 1 to 10.

附记12.根据附记11所述的数据处理装置，还包括：Supplement 12. The data processing device according to Supplement 11, further comprising:

控制单元，被配置为执行控制以使得所述模型更新装置将所述多模型系统中的每个模型作为所述目标模型进行更新；以及a control unit configured to perform control such that the model updating means updates each model in the multi-model system as the target model; and

融合单元，被配置为将更新后的各个模型进行融合以获得最终模型，a fusion unit configured to fuse the updated models to obtain a final model,

其中，所述数据处理装置使用该最终模型对所述待测试数据集进行处理。Wherein, the data processing device uses the final model to process the data set to be tested.

附记13.一种对多模型系统中的目标模型进行更新的模型更新方法，其中，多模型系统中的各个模型是针对训练数据集采用不同方式预先训练得到的模型，所述模型更新方法包括：Additional note 13. A model update method for updating the target model in the multi-model system, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, and the model update method includes :

使用所述多模型系统中不同于所述目标模型的模型作为校准模型对待测数据集进行处理，并将处理的结果作为伪标签；Using a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label;

基于所述伪标签获得所述待测数据集的特征分布；Obtaining the feature distribution of the data set to be tested based on the pseudo-label;

基于所述目标模型获取所述训练数据集的特征分布；Obtaining the feature distribution of the training data set based on the target model;

基于所述训练数据集的特征分布和所述待测数据集的特征分布来调整所述目标模型的特征空间划分，以使得所述训练数据集和所述待测数据集针对该特征空间划分具有类似的分布；以及Adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set have a characteristic for the feature space division. similar distribution; and

基于调整后的特征空间划分使用所述训练数据集来更新所述目标模型。The target model is updated using the training dataset based on the adjusted feature space partition.

附记14.根据附记13所述的模型更新方法，其中，调整所述目标模型的特征空间划分的步骤包括：Supplementary Note 14. The model updating method according to Supplementary Note 13, wherein the step of adjusting the feature space division of the target model includes:

将所述训练数据集和所述待测数据集的特征空间划分为多个区域；Dividing the feature space of the training data set and the test data set into a plurality of regions;

分别基于所述训练数据集的特征分布和所述待测数据集的特征分布，计算相邻的一个或更多个区域在所述训练数据集和所述待测数据集上的分布；Based on the feature distribution of the training data set and the feature distribution of the test data set, calculate the distribution of one or more adjacent regions on the training data set and the test data set;

计算相邻的一个或更多个区域的两种分布之间的距离，并且在所述距离小于预定阈值时将所述相邻的一个或更多个区域合并作为所述特征空间的一个划分。calculating a distance between two distributions of one or more adjacent regions, and merging the one or more adjacent regions as a partition of the feature space when the distance is smaller than a predetermined threshold.

附记15.根据附记14所述的模型更新方法，其中，所述距离为KL距离。Supplementary Note 15. The model updating method according to Supplementary Note 14, wherein the distance is KL distance.

附记16.根据附记14所述的模型更新方法，其中，使用贝叶斯量度来计算所述相邻的一个或更多个区域在所述训练数据集上的分布作为所述距离。Supplementary Note 16. The model updating method according to Supplementary Note 14, wherein a Bayesian metric is used to calculate the distribution of the one or more adjacent regions on the training data set as the distance.

附记17.根据附记13至16中的任意一项所述的模型更新方法，其中，所述校准模型与所述目标模型分别采用原理上互补的算法获得。Supplementary Note 17. The model updating method according to any one of Supplementary Notes 13 to 16, wherein the calibration model and the target model are respectively obtained using complementary algorithms in principle.

附记18.根据附记13至16中的任意一项所述的模型更新方法，其中，所述多个模型包括基于如下中的一种或更多种方法获得的模型：支持向量机、随机森林、决策树、K最邻近结点算法、Boosting算法、朴素贝叶斯算法和最大熵算法。Supplementary Note 18. The model updating method according to any one of Supplementary Notes 13 to 16, wherein the multiple models include models obtained based on one or more of the following methods: support vector machine, random Forest, Decision Tree, K-Nearest Neighbor Algorithm, Boosting Algorithm, Naive Bayesian Algorithm, and Maximum Entropy Algorithm.

附记19.一种使用多模型系统对待测数据集进行处理的数据处理方法，包括根据附记13至18中的任意一项所述的模型更新方法。Supplementary Note 19. A data processing method using a multi-model system to process the data set to be tested, including the model updating method according to any one of Supplementary Notes 13 to 18.

附记20.根据附记19所述的数据处理方法，还包括：Supplement 20. The data processing method described in Supplement 19, further comprising:

将所述多模型系统中的每个模型作为所述目标模型进行更新；以及updating each model in the multi-model system as the target model; and

将更新后的各个模型进行融合以获得最终模型，并使用该最终模型对所述待测试数据集进行处理。The updated models are fused to obtain a final model, and the data set to be tested is processed using the final model.

Claims

1. A model updating device for updating the target model in the multi-model system, wherein each model in the multi-model system is a model obtained by pre-training in different ways for the training data set, and the model updating device includes:

A pseudo-label acquisition unit configured to use a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label;

a first feature distribution acquisition unit configured to obtain a feature distribution of the data set to be tested based on the pseudo-label;

a second feature distribution acquisition unit configured to acquire the feature distribution of the training data set based on the target model;

An adjustment unit configured to adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set has a similar distribution for this feature space partition; and

An update unit configured to use the training data set to update the target model based on the adjusted feature space partition.

2. The model updating device according to claim 1, wherein the adjustment unit comprises:

a partition module configured to divide the feature space of the training data set and the test data set into multiple regions;

The distribution calculation module is configured to calculate the distribution of adjacent one or more regions in the training data set and the test data based on the feature distribution of the training data set and the feature distribution of the test data set, respectively. distribution on the set;

a distance calculation module configured to calculate the distance between two distributions of adjacent one or more regions; and

A combining module configured to combine the one or more adjacent regions as a division of the feature space when the distance is smaller than a predetermined threshold.

3. The model updating device according to claim 2, wherein the distance is a KL distance.

4. The model updating apparatus according to claim 2, wherein the distance calculation module is configured to use a Bayesian metric to calculate the distribution of the adjacent one or more regions on the training data set as the distance.

5. The model updating device according to any one of claims 1 to 4, wherein the calibration model is a model with the best performance other than the target model in the multi-model system.

6. The model updating device according to any one of claims 1 to 4, wherein the calibration model and the target model are respectively obtained using complementary algorithms in principle.

7. The model updating device according to any one of claims 1 to 4, wherein said respective models include models obtained based on one or more of the following methods: support vector machine, random forest, decision Tree, K-Nearest Neighbor Algorithm, Boosting Algorithm, Naive Bayesian Algorithm, and Maximum Entropy Algorithm.

8. The model updating device according to any one of claims 1 to 4, wherein the model updating device is configured to update periodically.

9. A data processing device using a multi-model system to process a data set to be tested, comprising the model updating device according to any one of claims 1-8.

10. A model update method for updating a target model in a multi-model system, wherein each model in the multi-model system is a model obtained by pre-training in different ways for a training data set, and the model update method includes:

Using a model different from the target model in the multi-model system as a calibration model to process the data set to be tested, and use the processed result as a pseudo-label;

Obtaining the feature distribution of the data set to be tested based on the pseudo-label;

Obtaining the feature distribution of the training data set based on the target model;

Adjust the feature space division of the target model based on the feature distribution of the training data set and the feature distribution of the test data set, so that the training data set and the test data set have a characteristic for the feature space division. similar distribution; and

The target model is updated using the training dataset based on the adjusted feature space partition.