CN104657744A

CN104657744A - Multi-classifier training method and classifying method based on non-deterministic active learning

Info

Publication number: CN104657744A
Application number: CN201510046879.8A
Authority: CN
Inventors: 张晓宇; 王树鹏; 吴广君
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2015-01-29
Filing date: 2015-01-29
Publication date: 2015-05-27
Anticipated expiration: 2035-01-29
Also published as: CN104657744B

Abstract

The invention discloses a multi-classifier training method and classification method based on non-deterministic active learning. The method is: 1) select or initialize a multi-classifier; for each sample in the unmarked sample set, use the multi-classifier to calculate the overall information amount Info of the sample; the overall information amount is: model change information amount and The sum of model tuning information; 2) cluster the unlabeled sample set to obtain J subclasses; 3) select some unlabeled samples with the smallest overall information Info value from each subclass; Select K samples from among them for labeling and add them to the labeled sample set L; 4) Use the updated labeled set L as training data to retrain the multi-classifier; 5) Iteratively execute steps 1) to 4) to set the number of times; The resulting multi-classifier is then used to classify the unlabeled set. The invention realizes the comprehensive evaluation of the amount of sample information, thereby obtaining a highly efficient and intelligent multi-classifier.

Description

A multi-classifier training method and classification method based on non-deterministic active learning

技术领域technical field

本发明涉及一种基于非确定主动学习的多分类器训练方法及分类方法，属于软件工程技术领域。The invention relates to a multi-classifier training method and classification method based on non-deterministic active learning, belonging to the technical field of software engineering.

背景技术Background technique

数据分类一直是人们的研究热点，比如专利ZL 201010166225.6“一种基于在线学习的自适应级联分类器训练方法”，专利ZL 200910076428.3“一种跨领域的文本情感分类器的训练方法和分类方法”，专利ZL 200810094208.9“文档分类器生成方法和系统”。Data classification has always been a research hotspot, such as patent ZL 201010166225.6 "an adaptive cascade classifier training method based on online learning", patent ZL 200910076428.3 "a cross-domain text sentiment classifier training method and classification method" , Patent ZL 200810094208.9 "Document Classifier Generation Method and System".

在海量数据的分类问题中，“主动学习”(参考文献：McCallum and K.Nigam,“EmployingEM in pool-based active learning for text classification,”in Proc.of the 15th InternationalConference on Machine Learning,1998,pp.350–358.)是一种高效利用专家标注的机器学习方法，其主要思想是：由机器主动地、有针对性地选择最有信息的样本交给专家进行标注(向专家提出查询)，从而在有限的样本标注量下获得尽可能大的分类性能提升，比如参考授权专利：ZL 201210050383“基于主动学习和半监督学习的多类图像分类方法”；ZL200810082814.9“用于使提升分类器适合于新样本的方法”。In the classification problem of massive data, "active learning" (reference: McCallum and K. Nigam, "EmployingEM in pool-based active learning for text classification," in Proc. of the 15th International Conference on Machine Learning, 1998, pp. 350–358.) is a machine learning method that efficiently utilizes expert labeling. Its main idea is: the machine actively and targetedly selects the most informative samples and sends them to experts for labeling (query to experts), so that Obtain the largest classification performance improvement under the limited amount of sample labels, such as referring to authorized patents: ZL 201210050383 "Multi-class image classification method based on active learning and semi-supervised learning"; ZL200810082814.9 "Used to make the boost classifier suitable for method for new samples".

在样本标注代价高且数量有限、而未标注样本多且易于获得的应用情境下，主动学习的优势尤为明显。选择性采样策略是主动学习的关键环节。现有选择性采样策略大致包括以下几种类型——(1)基于不确定度：将其当前模型最不确定如何进行分类的样本提交给专家标注(参考文献：D.Lewis and W.Gale,“A sequential algorithm for training text classifiers,”In Proc.of the ACM SIGIR Conference on Research and Development in Information Retrieval,1994,pp.3–12.)；(2)基于组合决策：从不同模型出发，采用投票模式，将分歧最大的样本提交给专家标注(参考文献：H.S.Seung,M.Opper,and H.Sompolinsky,“Query by committee,”In Proc.oftheACM Workshop on Computational Learning Theory,1992,pp.287–294)；(3)基于期望误差最小化：从决策理论出发，估计未标注样本被标注之后模型的期望误差，最终选择能够获得最小期望误差的样本提交给专家标注(参考文献：Y.Guo and R.Greiner,“Optimistic activelearning using mutual information,”In Proc.of International Joint Conference on ArtificialIntelligence,2007,pp.823–829.)。The advantages of active learning are particularly obvious in the application scenarios where the cost of sample labeling is high and the number is limited, while the number of unlabeled samples is large and easy to obtain. Selective sampling strategies are a key part of active learning. Existing selective sampling strategies roughly include the following types—(1) Uncertainty-based: Submit the samples whose current model is most uncertain about how to classify them to experts for labeling (References: D.Lewis and W.Gale, "A sequential algorithm for training text classifiers," In Proc.of the ACM SIGIR Conference on Research and Development in Information Retrieval, 1994, pp.3–12.); (2) Combination-based decision-making: starting from different models, using voting mode, and submit the most divergent samples to expert annotation (references: H.S.Seung, M.Opper, and H.Sompolinsky, “Query by committee,” In Proc. of the ACM Workshop on Computational Learning Theory, 1992, pp.287–294 ); (3) Based on the minimization of expected error: starting from the decision theory, estimate the expected error of the model after the unlabeled sample is marked, and finally select the sample that can obtain the minimum expected error and submit it to the expert for labeling (references: Y.Guo and R . Greiner, “Optimistic active learning using mutual information,” In Proc. of International Joint Conference on Artificial Intelligence, 2007, pp.823–829.).

本发明文档的符号表示如下：样本用特征向量x表示；标注用y∈C^N＝{1,2,...,N}表示，其中N表示类别数目；未标注集和已标注集分别用U和L表示；分类模型用后验概率表示，其中表示对应于已标注集L的N分类模型的参数。The notation of the document of the present invention is as follows: the sample is represented by the feature vector ^x ; Represented by U and L; the classification model uses the posterior probability said, among them Denotes the parameters of the N classification model corresponding to the labeled set L.

传统主动学习方法中，类别数目N可以通过经验分析或先验知识预先获知，从而视为常量，这类方法称为“确定主动学习”(DeterminateActive Learning，简称D-AL)。根据类别数目的不同(N＝2或N>2)，相应的分类模型可以划分为两种——二分类(binary)和多分类(multi-class)。二分类模型将样本分到两类中的一类，是一种被广泛研究和应用的基本分类模型；多分类模型将样本分到多类中的一类，是二分类模型的一般化形式。多分类模型的构建方式包括以下两种：In traditional active learning methods, the number of categories N can be known in advance through empirical analysis or prior knowledge, so it can be regarded as a constant. This type of method is called "Determinate Active Learning" (D-AL for short). According to the number of categories (N=2 or N>2), the corresponding classification models can be divided into two types—binary and multi-class. The binary classification model divides samples into one of two categories, which is a basic classification model that has been widely studied and applied; the multi-classification model divides samples into one of multiple categories, which is a generalized form of the binary classification model. There are two ways to build a multi-classification model:

(1)一种直接的处理方法是将多分类模型转化为多个二分类模型。在训练阶段，针对每个类别或每两个类别对，构建相应的二分类模型。在预测阶段，训练出的多个二分类模型通过投票或融合的方式组合成一个总的分类模型。例如，对于每个类别c∈C^N，利用标注y_c＝{0,1}指示样本x是否属于该类，逻辑回归可以用于二分类模型的构建：(1) A straightforward approach is to transform the multi-classification model into multiple binary classification models. In the training phase, for each category or every pair of two categories, a corresponding binary classification model is constructed. In the prediction stage, multiple trained binary classification models are combined into a total classification model by voting or fusion. For example, for each category c∈C ^N , using the label y _c ={0,1} to indicate whether the sample x belongs to this category, logistic regression can be used for the construction of a binary classification model:

$P (y_{c} = 1 | x; θ_{L_{c}}^{2}) = θ_{θ_{L_{c}}^{2}} (x) = \frac{1}{1 + \exp (- {(θ_{L_{c}}^{2})}^{T} x)} .$ 公式(1) $P ({the y}_{c} = 1 | x; θ_{L_{c}}^{2}) = θ_{θ_{L_{c}}^{2}} (x) = \frac{1}{1 + \exp (- {(θ_{L_{c}}^{2})}^{T} x)} .$ Formula 1)

最终的预测标注为后验概率最大的类别：The final prediction is labeled as the category with the largest posterior probability:

$y^{*} = \arg \max_{c &Element; C^{N}} P (y_{c} = 1 | x; θ_{L_{c}}^{2}) .$ 公式(2) ${the y}^{*} = \arg \max_{c &Element; C^{N}} P ({the y}_{c} = 1 | x; θ_{L_{c}}^{2}) .$ Formula (2)

(2)另一种处理方法将各类别综合考虑、统一建模。例如，给定样本x，softmax回归用一个N维向量估计标注y取1～N中每一个值的概率，从而在一个统一的过程中同时对N个类别进行建模：(2) Another processing method considers all categories comprehensively and models them in a unified manner. For example, given a sample x, softmax regression uses an N-dimensional vector to estimate the probability of labeling y taking each value from 1 to N, thereby simultaneously modeling N categories in a unified process:

$[\begin{matrix} P (y = 1 | x; θ_{L}^{N}) \\ P (y = 2 | x; θ_{L}^{N}) \\ . . . \\ P (y = N | x; θ_{L}^{N}) \end{matrix}] = h_{θ_{L}^{N}} (x) = \frac{1}{Σ_{i = 1}^{N} \exp ({(θ_{L}^{N})}_{i}^{T} x)} [\begin{matrix} \exp ({(θ_{L}^{N})}_{1}^{T} x) \\ \exp ({(θ_{L}^{N})}_{2}^{T} x) \\ . . . \\ \exp ({(θ_{L}^{N})}_{N}^{T} x) \end{matrix}] .$ 公式(3) $[\begin{matrix} P (the y = 1 | x; θ_{L}^{N}) \\ P (the y = 2 | x; θ_{L}^{N}) \\ . . . \\ P (the y = N | x; θ_{L}^{N}) \end{matrix}] = h_{θ_{L}^{N}} (x) = \frac{1}{Σ_{i = 1}^{N} \exp ({(θ_{L}^{N})}_{i}^{T} x)} [\begin{matrix} \exp ({(θ_{L}^{N})}_{1}^{T} x) \\ \exp ({(θ_{L}^{N})}_{2}^{T} x) \\ . . . \\ \exp ({(θ_{L}^{N})}_{N}^{T} x) \end{matrix}] .$ Formula (3)

由于样本属于各类的概率分布进行统一建模，结果具有直接可比性，向量中最大的元素对应于最终的预测标注，因此该处理方法更加适用于多分类模型构建。Since the probability distribution of samples belonging to various types is modeled uniformly, the results are directly comparable, and the largest element in the vector corresponds to the final prediction label, so this processing method is more suitable for multi-classification model construction.

为了优化分类模型，传统的基于确定主动学习的多分类方法选取最有信息的样本提交给专家进行标注，从而实现模型更新。最有信息样本的选取方法为：分别计算未标注集中每个样本在标注后模型的期望误差，选取最小化期望误差的样本作为最有信息样本，公式化表示如下：In order to optimize the classification model, the traditional multi-classification method based on deterministic active learning selects the most informative samples and submits them to experts for labeling, so as to realize the model update. The selection method of the most informative sample is: respectively calculate the expected error of each sample in the unlabeled set after labeling, and select the sample that minimizes the expected error as the most informative sample. The formula is as follows:

$x_{D - AL}^{*} = \underset{x &Element; U}{\arg \min} \underset{\tilde{y} &Element; C^{N}}{Σ} P (\tilde{y} | x; θ_{L}^{N}) F (x, \tilde{y}; θ_{L}^{N}) .$ 公式(4) $x_{D. - AL}^{*} = \underset{x &Element; u}{\arg \min} \underset{\tilde{the y} &Element; C^{N}}{Σ} P (\tilde{the y} | x; θ_{L}^{N}) f (x, \tilde{the y}; θ_{L}^{N}) .$ Formula (4)

其中，in,

$\begin{matrix} F (x, \tilde{y}; θ_{L}^{N}) = \underset{x_{u} &Element; U - x}{Σ} H (y_{u} | x_{u}; θ_{L + (x, \tilde{y})}^{N}) \\ = \underset{x_{u} &Element; U - x}{Σ} (- \underset{{\tilde{y}}_{u} &Element; C^{N}}{Σ} P ({\tilde{y}}_{u} | x_{u}; θ_{L + (x, \tilde{y})}^{N}) \cdot \log P ({\tilde{y}}_{u} | x_{u}; θ_{L + (x, \tilde{y})}^{N})) \end{matrix} .$ 公式(5) $\begin{matrix} f (x, \tilde{the y}; θ_{L}^{N}) = \underset{x_{u} &Element; u - x}{Σ} h ({the y}_{u} | x_{u}; θ_{L + (x, \tilde{the y})}^{N}) \\ = \underset{x_{u} &Element; u - x}{Σ} (- \underset{{\tilde{the y}}_{u} &Element; C^{N}}{Σ} P ({\tilde{the y}}_{u} | x_{u}; θ_{L + (x, \tilde{the y})}^{N}) &Center Dot; \log P ({\tilde{the y}}_{u} | x_{u}; θ_{L + (x, \tilde{the y})}^{N})) \end{matrix} .$ Formula (5)

公式表示在给定现有模型参数和新标注样本的情况下，其它未标注样本x_u∈U-x的信息熵之和；表示样本被标注之后新的已标注集。The formula expresses that given the existing model parameters and the newly annotated sample In the case of , the information entropy of other unlabeled samples x _u ∈ Ux Sum; Indicates the sample The new labeled set after being labeled.

根据公式选取出的最有信息样本通过人机交互的形式由专家进行标注，标注完成后该样本从未标注集去除并加入已标注集。The most informative sample selected according to the formula Annotated by experts in the form of human-computer interaction, after the annotation is completed, the sample is removed from the unlabeled set and added to the labeled set.

对于基于确定主动学习的多分类方法，类别数目N事先已知，因而可以据此直接确定N分类模型的形式，剩下的任务是通过确定主动学习选取和标注最有信息样本，从而在现有模型框架下不断优化模型参数然而，在很多实际应用中，类别数目往往无法事先准确获知；甚至在有些应用中，类别数目会随着时间推移不断变化。在上述情况下，类别数目N本身为需要求解的变量，主动学习不仅需要优化现有模型参数同时还需要根据样本分布更新类别数目N(即分类模型的形式)。For the multi-classification method based on deterministic active learning, the number of categories N is known in advance, so the form of the N classification model can be directly determined accordingly. The remaining task is to select and label the most informative samples through deterministic active learning, so that in the existing Continuously optimize model parameters under the model framework However, in many practical applications, the number of categories cannot be accurately known in advance; even in some applications, the number of categories will change over time. In the above cases, the number of categories N itself is the variable that needs to be solved, active learning not only needs to optimize the existing model parameters At the same time, the number of categories N (that is, the form of the classification model) needs to be updated according to the sample distribution.

为表述清楚起见，本发明文档中将现有N分类模型下模型参数的优化称为“模型调优”，将类别数目N的更新进而导致的模型重建称为“模型变更”。传统的基于确定主动学习的多分类方法仅仅关注模型调优而忽略了模型变更，因此只适用于类别数目已知的应用场景；而在类别数目不确定的情况下，基于确定主动学习的多分类方法局限于现有N分类模型下样本信息量的评估，却无法准确描述和拟合样本数据的真实分布，从而无法实现分类性能的有效提升。For the sake of clarity, in the document of the present invention, the model parameters under the existing N classification model The optimization of is called "model tuning", and the model reconstruction caused by updating the number of categories N is called "model change". The traditional multi-classification method based on deterministic active learning only focuses on model tuning and ignores model changes, so it is only suitable for application scenarios where the number of categories is known; while the number of categories is uncertain, the multi-classification method based on deterministic active learning The method is limited to the evaluation of the amount of sample information under the existing N classification model, but it cannot accurately describe and fit the real distribution of the sample data, so that it cannot effectively improve the classification performance.

发明内容Contents of the invention

本发明的目的在于提供一种基于非确定主动学习的多分类器训练方法及分类方法，一方面对样本在现有模型框架下优化模型参数的能力进行评估，另一方面对该样本引入新的类别从而触发模型重建的可能性进行评估，通过综合考虑模型调优和模型变更两方面因素，实现样本信息量的综合评价，从而获得高效化、智能化的海量数据分类模型。The purpose of the present invention is to provide a multi-classifier training method and classification method based on non-deterministic active learning. On the one hand, it evaluates the ability of the sample to optimize model parameters under the existing model framework, and on the other hand, it introduces new Classes can be used to evaluate the possibility of triggering model reconstruction. By comprehensively considering the two factors of model optimization and model change, the comprehensive evaluation of sample information can be realized, so as to obtain an efficient and intelligent massive data classification model.

1、所提供的基于非确定主动学习的多分类方法，分别从模型变更和模型调优两个方面度量样本的信息量，一方面对样本在现有模型框架下优化模型参数的能力进行评估，另一方面对样本标注为新的类别从而触发模型重建的可能性进行评估，通过综合两方面因素，实现样本信息量的综合评价，按此评价标准选择信息量最大的样本进行标注，可以保证有限样本标注量下分类效果的最优化，从而获得高效化、智能化的海量数据分类模型。1. The provided multi-classification method based on non-deterministic active learning measures the amount of information of samples from two aspects of model change and model tuning, on the one hand, evaluates the ability of samples to optimize model parameters under the existing model framework, On the other hand, the possibility of triggering model reconstruction by labeling samples as new categories is evaluated. By combining two factors, a comprehensive evaluation of the amount of sample information is achieved. According to this evaluation standard, the sample with the largest amount of information is selected for labeling, which can ensure limited The optimization of the classification effect under the amount of sample labeling, so as to obtain an efficient and intelligent massive data classification model.

2、所提供的样本信息量计算方法，不仅考虑样本被标注为现有各类别的概率，而且考虑样本被标注为新类别的概率，形成样本信息量的统一化、综合性计算方法。2. The provided sample information calculation method not only considers the probability of samples being labeled as existing categories, but also considers the probability of samples being labeled as new categories, forming a unified and comprehensive calculation method for sample information.

3、所提供的基于聚类的样本批量选取方法，在对未标注集中样本进行聚类的基础上，批量选取最有信息的样本集，在保证样本信息量的同时避免了信息冗余。3. The provided clustering-based sample batch selection method, on the basis of clustering the unmarked samples, selects the most informative sample set in batches, which avoids information redundancy while ensuring the amount of sample information.

本发明的技术方案为：Technical scheme of the present invention is:

一种基于非确定主动学习的多分类器训练方法，其步骤为：A multi-classifier training method based on non-deterministic active learning, the steps of which are:

1)选取或初始化一多分类器；对未标注样本集中的每一样本，利用该多分类器计算该样本的总体信息量Info；所述总体信息量为：模型变更信息量与模型调优信息量之和；1) Select or initialize a multi-classifier; for each sample in the unlabeled sample set, use the multi-classifier to calculate the overall information amount Info of the sample; the overall information amount is: model change information amount and model tuning information the sum of the quantities;

2)对该未标注样本集进行聚类，得到J个子类；2) Clustering the unlabeled sample set to obtain J subclasses;

3)从每个子类中选取总体信息量Info值最小的若干未标注样本；再从所选未标注样本中选取K个样本进行标注后加入到已标注样本集L；3) Select a number of unlabeled samples with the smallest overall information Info value from each subcategory; then select K samples from the selected unlabeled samples for labeling and add them to the labeled sample set L;

4)将更新后的已标注样本集L作为训练数据重新训练该多分类器。4) Retrain the multi-classifier with the updated labeled sample set L as training data.

一种基于非确定主动学习的多分类器分类方法，其步骤为：A multi-classifier classification method based on non-deterministic active learning, the steps of which are:

3)从每个子类中选取总体信息量Info值最小的若干未标注样本；再从所选样本中选取K个样本进行标注后加入到已标注样本集L；3) Select a number of unlabeled samples with the smallest overall information Info value from each subcategory; then select K samples from the selected samples for labeling and add them to the labeled sample set L;

4)将更新后的已标注集L作为训练数据重新训练该多分类器；4) Retrain the multi-classifier with the updated labeled set L as training data;

5)迭代执行步骤1)～4)设定次数；然后利用最终得到的多分类器对未标注集进行分类。5) Iteratively execute steps 1) to 4) for a set number of times; then use the finally obtained multi-classifier to classify the unlabeled set.

进一步的，所述模型变更信息量为：从该未标注样本集中选取一样本a并将该样本的标注类别设定为新类别；然后利用该多分类器计算去除该样本a后的该未标注样本集关于该新类别的信息熵，将该信息熵作为该样本a的模型变更信息量；所述模型调优信息量的计算方法为：从该未标注样本集中选取一样本a并将该样本的标注类别设定为该多分类器中的一个类别；然后利用更新后的该多分类器计算去除该样本a的该未标注样本集关于每个已有类别的信息熵加权和，作为该样本a的模型调优信息量。Further, the information amount of the model change is: select a sample a from the unlabeled sample set and set the labeled category of the sample as a new category; then use the multi-classifier to calculate the unlabeled sample a after removing the sample a For the information entropy of the new category in the sample set, use the information entropy as the model change information amount of the sample a; the calculation method of the model tuning information amount is: select a sample a from the unlabeled sample set and add the sample a The labeled category of is set as a category in the multi-classifier; then the updated multi-classifier is used to calculate the weighted sum of the information entropy of each existing category in the unlabeled sample set that removes the sample a, as the sample The amount of model tuning information for a.

进一步的，计算所述模型变更信息量的方法为：首先根据具有N个类别训练数据的已标注样本集L构建一个N+1多分类器；然后对于去除该样本a后的该未标注样本集中每一样本x，将其不属于现有N个类别中任何一类的概率定义为该样本x属于第N+1个类别的概率；然后利用该多分类器计算去除该样本a后的该未标注样本集关于该新类别的信息熵，作为该样本a的模型变更信息量。Further, the method for calculating the amount of model change information is as follows: first construct an N+1 multi-classifier based on the labeled sample set L with N categories of training data; then for the unlabeled sample set after removing the sample a For each sample x, the probability that it does not belong to any of the existing N categories is defined as the probability that the sample x belongs to the N+1th category; Mark the information entropy of the new category in the sample set as the model change information amount of the sample a.

进一步的，计算所述模型变更信息量的方法为：首先根据具有N个类别训练数据的已标注样本集L构建一个二分类器，其中，将现有N个类别合并为一个类别A，将现有N个类别以外的其它类别归为另一类别B；然后对于去除该样本a后的该未标注样本集中每一样本x，将其不属于现有N个类别中任何一类的概率定义为该样本x属于类别B的概率；然后利用该多分类器计算去除该样本a后的该未标注样本集关于该新类别的信息熵，作为该样本a的模型变更信息量。Further, the method for calculating the amount of model change information is as follows: first construct a binary classifier based on the labeled sample set L with N categories of training data, wherein the existing N categories are merged into one category A, and the current There are other categories other than N categories classified into another category B; then, for each sample x in the unlabeled sample set after removing the sample a, the probability that it does not belong to any of the existing N categories is defined as The probability that the sample x belongs to category B; then use the multi-classifier to calculate the information entropy of the unlabeled sample set about the new category after removing the sample a, as the model change information amount of the sample a.

进一步的，计算所述模型变更信息量的方法为：首先根据具有N个类别训练数据的已标注样本集L构建一个一元分类器；然后对于去除该样本a后的该未标注样本集中每一样本x，将其不属于现有N个类别中任何一类的概率定义为样本x为离群点的概率；然后利用该多分类器计算去除该样本a后的该未标注样本集关于该新类别的信息熵，作为该样本a的模型变更信息量。Further, the method for calculating the amount of model change information is as follows: first construct a one-element classifier based on the labeled sample set L with N categories of training data; then for each sample in the unlabeled sample set after removing the sample a x, the probability that it does not belong to any of the existing N categories is defined as the probability that the sample x is an outlier; then use the multi-classifier to calculate the unlabeled sample set after removing the sample a about the new category The information entropy of is used as the model change information amount of the sample a.

本发明的主要内容包括：Main contents of the present invention include:

对于类别数目不确定的分类问题，类别数目N的取值为当前已标注集中样本不同标注的个数，随着已标注集的扩展，类别数目N随之调整。图1是类别数目不确定的分类模型构建过程示例：图1(1)中，初始的已标注集仅仅包含A、B两个已标注样本，分别属于类别1、类别2，因此相应的分类模型为二分类模型；图1(2)中，样本C被标注为类别1并加入已标注集，由于没有新标注加入，因此分类模型仍然为二分类模型；图1(3)中，样本D被标注为类别3并加入已标注集，由于新标注(类别3)的出现，分类模型变更为三分类模型。For classification problems with an uncertain number of categories, the value of the number of categories N is the number of different labels of samples in the current labeled set. With the expansion of the labeled set, the number of categories N will be adjusted accordingly. Figure 1 is an example of the construction process of a classification model with an uncertain number of categories: in Figure 1(1), the initial labeled set only contains two labeled samples A and B, which belong to category 1 and category 2 respectively, so the corresponding classification model It is a binary classification model; in Figure 1(2), sample C is marked as category 1 and added to the labeled set, since no new label is added, the classification model is still a binary classification model; in Figure 1(3), sample D is Labeled as category 3 and added to the labeled set, due to the emergence of new labels (category 3), the classification model was changed to a three-category model.

图1也表明了模型调优与模型变更这两个因素对于分类模型优化同样重要，不可偏废。图1(1)中，如果仅仅从现有二分类模型出发，样本C比样本D的信息量更大(因为样本C距离分类面更近，从而具有更大的不确定度)；然而事实上，样本D对于分类模型的优化更有意义(因为样本D的标注不仅有助于模型参数更新，同时也引入了新标注信息，进而将模型重建为更加契合数据真实分布的三分类模型)。Figure 1 also shows that the two factors of model tuning and model change are equally important for classification model optimization and cannot be neglected. In Figure 1(1), if only starting from the existing binary classification model, sample C has more information than sample D (because sample C is closer to the classification surface, so it has greater uncertainty); however, in fact , sample D is more meaningful for the optimization of the classification model (because the annotation of sample D not only helps to update the model parameters, but also introduces new annotation information, and then rebuilds the model into a three-category model that better fits the real distribution of the data).

1、样本信息量度量1. Measurement of sample information

本发明提供的基于非确定主动学习的多分类方法将样本信息量有机融合到一个统一的框架下，实现样本信息量的综合有效度量。The multi-classification method based on non-deterministic active learning provided by the present invention organically integrates the amount of sample information into a unified framework, and realizes the comprehensive and effective measurement of the amount of sample information.

该方法从信息论出发，基于如下分析：(1)当一个样本被标注为新类别并加入已标注集，该样本向现有模型中引入了之前未曾建模的全新信息，从而增加了现有模型对于未标注样本的全局估计的不确定性，从信息论的角度，该样本会增加未标注集的总体信息熵；(2)当一个样本被标注为某个已知类别并加入已标注集，该样本为现有模型更好地拟合数据分布提供了新的约束条件，从信息论的角度，该样本往往会降低未标注集的总体信息熵。This method starts from information theory and is based on the following analysis: (1) When a sample is labeled as a new category and added to the labeled set, the sample introduces new information that has not been modeled before into the existing model, thereby increasing the existing model. For the uncertainty of the global estimate of an unlabeled sample, from the perspective of information theory, this sample will increase the overall information entropy of the unlabeled set; (2) when a sample is labeled as a known category and added to the labeled set, the The sample provides new constraints for the existing model to better fit the data distribution. From the perspective of information theory, the sample tends to reduce the overall information entropy of the unlabeled set.

基于上述分析，基于非确定主动学习的多分类方法从模型变更和模型调优两个方面度量样本的信息量：Based on the above analysis, the multi-classification method based on non-deterministic active learning measures the amount of sample information from two aspects: model change and model tuning:

(1)样本的“模型变更信息量”定义为：假设该样本被标注为新类别的情况下，利用该多分类器计算除去该样本的未标注集关于新类别的信息熵；公式化表示如下：(1) The "model change information amount" of a sample is defined as: assuming that the sample is labeled as a new category, use the multi-classifier to calculate the information entropy of the unlabeled set that removes the sample about the new category; the formula is as follows:

${Info}_{up grad e} (x; φ_{L}, θ_{L}^{N}) = P (y &NotElement; C^{N} | x; φ_{L}) F (x, N + 1; θ_{L}^{N + 1}) .$ 公式(6) ${Info}_{up grad e} (x; φ_{L}, θ_{L}^{N}) = P (the y &NotElement; C^{N} | x; φ_{L}) f (x, N + 1; θ_{L}^{N + 1}) .$ Formula (6)

Info_upgrade与样本信息量正相关。公式中，为样本x在现有模型下被标注为新类别(即不属于现有N个类别中任何一类)的概率，φ_L为该概率模型的参数。Info _upgrade is positively correlated with the amount of sample information. formula, is the probability that the sample x is marked as a new category (that is, does not belong to any of the existing N categories) under the existing model, and φ _L is the parameter of the probability model.

(2)样本的“模型调优信息量”定义为：假设该样本被标注为已有N个类别中的一个，利用该多分类器计算除去该样本的未标注集关于每个已有类别的信息熵加权和；公式化表示如下：(2) The "model tuning information amount" of a sample is defined as: Assuming that the sample is marked as one of the existing N categories, the multi-classifier is used to calculate the unlabeled set of the sample except for each existing category The weighted sum of information entropy; the formula is as follows:

$\begin{matrix} {Info}_{update} (x; φ_{L}, θ_{L}^{N}) = P (y &Element; C^{N} | x; φ_{L}) \underset{\tilde{y} &Element; C^{N}}{Σ} P (\tilde{y} | x; θ_{L}^{N}) F (x, \tilde{y}; θ_{L}^{N}) \\ = (1 - P (y &NotElement; C^{N} | x; φ_{L})) \underset{\tilde{y} &Element; C^{N}}{Σ} P (\tilde{y} | x; θ_{L}^{N}) F (x, \tilde{y}; θ_{L}^{N}) \end{matrix} .$ 公式(7) $\begin{matrix} {Info}_{update} (x; φ_{L}, θ_{L}^{N}) = P (the y &Element; C^{N} | x; φ_{L}) \underset{\tilde{the y} &Element; C^{N}}{Σ} P (\tilde{the y} | x; θ_{L}^{N}) f (x, \tilde{the y}; θ_{L}^{N}) \\ = (1 - P (the y &NotElement; C^{N} | x; φ_{L})) \underset{\tilde{the y} &Element; C^{N}}{Σ} P (\tilde{the y} | x; θ_{L}^{N}) f (x, \tilde{the y}; θ_{L}^{N}) \end{matrix} .$ Formula (7)

Info_update与样本信息量负相关。Info _update is negatively correlated with the amount of sample information.

基于非确定主动学习的多分类方法，将样本的模型变更信息量和模型调优信息量有机结合成为一个综合性度量；根据各自特点，重点监测Info_upgrade值显著高和Info_update值显著低的样本；公式化表示如下(但不限于该表示形式)：Based on the multi-classification method of non-deterministic active learning, the amount of model change information and model tuning information of samples are organically combined into a comprehensive measure; according to their respective characteristics, focus on monitoring samples with significantly high Info _upgrade values and significantly low Info _update values ; is formulated as follows (but not limited to this representation):

$x_{IMC - AL}^{*} = \underset{x &Element; U}{\arg \min} Info (x; φ_{L}, θ_{L}^{N}) .$ 公式(8) $x_{IMC - AL}^{*} = \underset{x &Element; u}{\arg \min} Info (x; φ_{L}, θ_{L}^{N}) .$ Formula (8)

其中，in,

$\begin{matrix} Info (x; φ_{L}, θ_{L}^{N}) = \log [{Info}_{update} (x; φ_{L}, θ_{L}^{N}) = \min_{x &Element; U} {Info}_{update} (x; φ_{L}, θ_{L}^{N}) + σ] \\ + λ \log [- ({Info}_{up grad e} (x; φ_{L}, θ_{L}^{N}) - \max_{x &Element; U} {Info}_{up grad e} (x; φ_{L}, θ_{L}^{N})) + σ] \end{matrix} .$ 公式(9) $\begin{matrix} Info (x; φ_{L}, θ_{L}^{N}) = \log [{Info}_{update} (x; φ_{L}, θ_{L}^{N}) = \min_{x &Element; u} {Info}_{update} (x; φ_{L}, θ_{L}^{N}) + σ] \\ + λ \log [- ({Info}_{up grad e} (x; φ_{L}, θ_{L}^{N}) - \max_{x &Element; u} {Info}_{up grad e} (x; φ_{L}, θ_{L}^{N})) + σ] \end{matrix} .$ Formula (9)

Info为样本总信息量。公式中，λ是调整Info_upgrade和Info_update之间相对权重的参数，σ是为了避免计算结果出现-∞而人为加上的一个非常小的常量(如e^-10)，即根据公式(8)选择信息量最大的若干个样本进行标注。Info is the total information amount of the sample. In the formula, λ is a parameter to adjust the relative weight between Info _upgrade and Info _update , and σ is a very small constant (such as e ^-10 ) artificially added in order to avoid -∞ in the calculation result, that is, according to formula (8) Select several samples with the largest amount of information for labeling.

2、样本新类别概率计算2. Calculation of probability of new category of samples

在公式中，样本x不属于现有N个类别中任何一类的概率的计算有多种方法。给定现有已标注集L和N分类模型及其模型参数上述概率的计算方法包括但不限于以下三种：In the formula, the probability that sample x does not belong to any of the existing N classes There are many ways to calculate . Given the existing labeled sets L and N classification models and their model parameters The methods for calculating the above probability include but are not limited to the following three methods:

(1)基于已标注集L构建一个N+1分类模型，从而将样本x不属于现有N个类别中任何一类的概率定义为样本x属于第N+1个类别的概率。公式化表示如下：(1) Construct an N+1 classification model based on the labeled set L, so that the probability that the sample x does not belong to any of the existing N categories is defined as the probability that the sample x belongs to the N+1th category. Formulated as follows:

$P (y &NotElement; C^{N} | x; φ_{L}) = P (y = N + 1 | x; θ_{L}^{N + 1}) .$ 公式(10) $P (the y &NotElement; C^{N} | x; φ_{L}) = P (the y = N + 1 | x; θ_{L}^{N + 1}) .$ Formula (10)

(2)基于已标注集L构建一个二分类模型，将现有N个类别合并为一个类别“+1”，将现有N个类别以外的其它类别归为“-1”，从而将样本x不属于现有N个类别中任何一类的概率定义为样本x属于类别“-1”的概率。公式化表示如下：(2) Construct a binary classification model based on the labeled set L, merge the existing N categories into one category "+1", and classify other categories other than the existing N categories as "-1", so that the sample x The probability of not belonging to any of the existing N classes is defined as the probability that sample x belongs to class "-1". Formulated as follows:

$P (y &NotElement; C^{N} | x; φ_{L}) = P (y = - 1 | x; θ_{L}^{2}) .$ 公式(11) $P (the y &NotElement; C^{N} | x; φ_{L}) = P (the y = - 1 | x; θ_{L}^{2}) .$ Formula (11)

(3)基于已标注集L构建一元分类模型，该分类模型旨在从只包含一种类别的已标注集中训练分类器用以识别同类样本或检测离群样本，常用方法包括单类支持向量机(OC-SVM)。本方法将现有N个类别合并为一个类别，将OC-SVM输出值通过sigmoid函数转化为概率形式，从而将样本x不属于现有N个类别中任何一类的概率定义为样本x为离群点的概率。公式化表示如下：(3) Construct a one-element classification model based on the labeled set L. This classification model aims to train a classifier from a labeled set containing only one category to identify similar samples or detect outlier samples. Common methods include single-class support vector machines ( OC-SVM). This method merges the existing N categories into one category, and converts the output value of OC-SVM into a probability form through the sigmoid function, so that the probability that the sample x does not belong to any of the existing N categories is defined as the sample x is separated from Probability of cluster points. Formulated as follows:

$P (y &NotElement; C^{N} | x; φ_{L}) = \frac{1}{1 + \exp (- {Ourput}_{OC - SVM} (y = outlier | x; L))} .$ 公式(12)。 $P (the y &NotElement; C^{N} | x; φ_{L}) = \frac{1}{1 + \exp (- {Our put}_{OC - SVM} (the y = outlier | x; L))} .$ Formula (12).

3、基于聚类的样本批量选取机制3. Batch selection mechanism of samples based on clustering

实际应用中，为了保证方法执行效率，每次选出的最有信息样本不是一个而是一批(如K个样本)。如果仅仅根据公式选取Info值最小的K个样本，会不可避免地引入冗余信息，从而导致分类效率的下降。In practical applications, in order to ensure the execution efficiency of the method, the most informative samples selected each time are not one but a batch (such as K samples). If only the K samples with the smallest Info values are selected according to the formula, redundant information will inevitably be introduced, resulting in a decrease in classification efficiency.

本发明提供一种改进的样本批量选取方法——基于聚类的样本批量选取方法：(1)将未标注集中的样本聚为J(J≥K)类，聚类方法包括但不限于K-means、K-medoids、谱聚类等；(2)在每一类中，根据公式选取最有信息的样本，获得样本数为J的样本集；(3)在上述样本集中，根据公式选取最有信息(Info值最小)的K个样本。J是大于等于K的数用于处理信息冗余。The present invention provides an improved sample batch selection method—a cluster-based sample batch selection method: (1) cluster the samples in the unmarked set into J (J≥K) classes, and the clustering methods include but are not limited to K- means, K-medoids, spectral clustering, etc.; (2) in each category, select the most informative samples according to the formula, and obtain a sample set with the number of samples J; (3) in the above sample sets, select the most informative samples according to the formula K samples with information (minimum Info value). J is a number greater than or equal to K and is used to deal with information redundancy.

依据本发明方法选取K个最有信息样本之后：(1)对选取出的K个样本进行人工标注；(2)将标注后的K个样本从未标注集中去除，并加入已标注集；(3)基于新的已标注集，根据公式(3)训练新的分类模型，从而获得分类结果。After selecting the K most informative samples according to the method of the present invention: (1) manually label the selected K samples; (2) remove the marked K samples from the unmarked set and add to the marked set; ( 3) Based on the new labeled set, train a new classification model according to formula (3) to obtain classification results.

与现有技术相比，本发明的积极效果为：Compared with prior art, positive effect of the present invention is:

本发明所提供的基于非确定主动学习的多分类方法，分别从模型变更和模型调优两个方面度量样本的信息量，一方面对样本在现有模型框架下优化模型参数的能力进行评估，另一方面对该样本引入新的类别从而触发模型重建的可能性进行评估，通过综合两方面因素，实现样本信息量的综合、全面评价，从而为高效利用有限的标注样本获得最优化的海量数据分类结果提供了一种智能化解决方案。The multi-classification method based on non-deterministic active learning provided by the present invention measures the amount of information of samples from two aspects of model modification and model optimization, on the one hand, evaluates the ability of samples to optimize model parameters under the existing model framework, On the other hand, the possibility of introducing a new category into the sample to trigger model reconstruction is evaluated. By combining two factors, the comprehensive and comprehensive evaluation of the sample information is realized, so as to obtain the most optimized massive data for the efficient use of limited labeled samples. Classification results provide an intelligent solution.

附图说明Description of drawings

图1为类别数目不确定的分类模型构建过程示例；其中，Figure 1 is an example of the construction process of a classification model with an uncertain number of categories; where,

(1)初始的已标注集仅仅包含A、B两个已标注样本，(1) The initial labeled set only contains two labeled samples, A and B,

(2)样本C被标注为类别1并加入已标注集，(2) Sample C is labeled as category 1 and added to the labeled set,

(3)样本D被标注为类别3并加入已标注集；(3) Sample D is marked as category 3 and added to the marked set;

图2为本发明提供的基于非确定主动学习的多分类方法流程图。Fig. 2 is a flow chart of the multi-classification method based on non-deterministic active learning provided by the present invention.

具体实施方式Detailed ways

以下结合附图对本发明的原理和特征进行描述，所举实例只用于解释本发明，并非用于限定本发明的范围。The principles and features of the present invention are described below in conjunction with the accompanying drawings, and the examples given are only used to explain the present invention, and are not intended to limit the scope of the present invention.

实例基于非确定主动学习的多分类方法Example based non-deterministic active learning method for multi-classification

本发明提供的基于非确定主动学习的多分类方法通过循环迭代过程实现分类模型的逐步优化。The multi-classification method based on non-deterministic active learning provided by the invention realizes the gradual optimization of the classification model through a cyclic iterative process.

设每轮循环迭代需要标注K个样本，在每轮循环迭代内部执行以下流程：Assuming that each round of loop iteration needs to label K samples, the following process is executed inside each round of loop iteration:

方法执行结束后，设循环迭代次数为M，则通过人机交互由专家标注的样本总量为K×M。After the method is executed, if the number of loop iterations is M, the total number of samples marked by experts through human-computer interaction is K×M.

以图像分类为例，图像样本用颜色直方图、小波纹理等组成的特征向量x表示；初始已标注集中图像包括汽车、轮船、飞机、老虎、大象共5类，分别用数字1～5表示，则图像标注用y＝{1,2,…,5}表示；未标注图像构成未标注集U，已标注图像构成已标注集L表示；分类模型用后验概率P(y|x；θ_L)表示。Taking image classification as an example, the image sample is represented by a feature vector x composed of color histogram, wavelet texture, etc.; the initial set of marked images includes 5 categories of cars, ships, airplanes, tigers, and elephants, which are represented by numbers 1 to 5 , the image annotation is represented by y={1,2,...,5}; the unlabeled image constitutes the unlabeled set U, and the labeled image constitutes the labeled set L; the classification model uses the posterior probability P(y|x; θ _L ) said.

为了提升分类模型的性能，需要选择一些未标注图像进行标注，并利用新的标注信息更新现有模型，假设每次模型更新需要新标注K＝5个图像样本，迭代执行如下流程：In order to improve the performance of the classification model, it is necessary to select some unlabeled images for labeling, and update the existing model with new labeling information. Assuming that each model update requires new labeling K=5 image samples, iteratively execute the following process:

1)计算未标注图像样本的总体信息量Info(即该样本的模型变更信息量与模型调优信息量之和)；1) Calculate the overall information amount Info of the unlabeled image sample (that is, the sum of the model change information amount and the model tuning information amount of the sample);

2)将未标注图像聚为J＝10个子类；从每个子类中选择Info值最小的一个图像样本，共得到10个图像样本；在选出的10个图像样本中，选择Info值最小的5个图像样本；2) Cluster unlabeled images into J=10 subcategories; select an image sample with the smallest Info value from each subcategory, and obtain 10 image samples in total; among the selected 10 image samples, select the image sample with the smallest Info value 5 image samples;

3)标注选出的5个图像样本，并加入已标注集；3) Label the selected 5 image samples and add them to the labeled set;

4)将新的已标注集L作为训练数据重新训练图像分类模型；4) Retraining the image classification model with the new labeled set L as training data;

5)用更新后的分类模型对未标注集进行分类，进而获得改进的图像分类结果。5) Use the updated classification model to classify the unlabeled set, and then obtain improved image classification results.

本发明提供的基于非确定主动学习的多分类方法能够在类别数目不确定的情况下，利用有限的样本标注量获得最优化的分类效果。The multi-classification method based on the non-deterministic active learning provided by the present invention can obtain the optimal classification effect with a limited number of sample labels under the condition that the number of categories is uncertain.

Claims

1. A multi-classifier training method based on non-deterministic active learning comprises the following steps:

1) selecting or initializing a multi-classifier; for each sample in the unlabeled sample set, calculating the total information amount Info of the sample by using the multi-classifier; the total information amount is: the sum of the model change information quantity and the model tuning information quantity;

2) clustering the unlabeled sample set to obtain J subclasses;

3) selecting a plurality of unlabeled samples with the minimum total information quantity Info value from each subclass; selecting K samples from the selected unlabeled samples, labeling the K samples, and adding the K samples into the labeled sample set L;

4) and using the updated labeled sample set L as training data to retrain the multi-classifier.

2. The method of claim 1, wherein the amount of model change information is: selecting a sample a from the unmarked sample set and setting the marked type of the sample as a new type; then, calculating the information entropy of the unlabeled sample set about the new class after the sample a is removed by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a; the calculation method of the model tuning information quantity comprises the following steps: selecting a sample a from the unlabeled sample set and setting the labeled class of the sample as one class of the multiple classifiers; and then, calculating the information entropy weighted sum of the unlabeled sample set without the sample a about each existing class by using the updated multi-classifier, and using the information entropy weighted sum as the model tuning information quantity of the sample a.

3. The method of claim 2, wherein the amount of model change information is calculated by: firstly, constructing an N +1 multi-classifier according to a labeled sample set L with N classes of training data; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x belongs to the N +1 th class; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.

4. The method of claim 2, wherein the amount of model change information is calculated by: firstly, constructing a two-classifier according to a labeled sample set L with N classes of training data, wherein the existing N classes are merged into a class A, and other classes except the existing N classes are classified into another class B; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x belongs to the class B; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.

5. The method of claim 2, wherein the amount of model change information is calculated by: firstly, constructing a unitary classifier according to a labeled sample set L with N types of training data; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x is an outlier; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.

6. A multi-classifier classification method based on non-deterministic active learning comprises the following steps:

2) clustering the unlabeled sample set to obtain J subclasses;

3) selecting a plurality of unlabeled samples with the minimum total information quantity Info value from each subclass; selecting K samples from the selected samples, labeling the K samples, and adding the K samples into a labeled sample set L;

4) retraining the multi-classifier by using the updated labeled set L as training data;

5) iteratively executing the steps 1) to 4) for a set number of times; and then classifying the unlabeled set by using the finally obtained multi-classifier.

7. The method of claim 6, wherein the amount of model change information is: selecting a sample a from the unmarked sample set and setting the marked type of the sample as a new type; then, calculating the information entropy of the unlabeled sample set about the new class after the sample a is removed by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a; the calculation method of the model tuning information quantity comprises the following steps: selecting a sample a from the unlabeled sample set and setting the labeled class of the sample as one class of the multiple classifiers; and then, calculating the information entropy weighted sum of the unlabeled sample set without the sample a about each existing class by using the updated multi-classifier, and using the information entropy weighted sum as the model tuning information quantity of the sample a.

8. The method of claim 7, wherein the amount of model change information is calculated by: firstly, constructing an N +1 multi-classifier according to a labeled sample set L with N classes of training data; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x belongs to the N +1 th class; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.

9. The method of claim 7, wherein the amount of model change information is calculated by: firstly, constructing a two-classifier according to a labeled sample set L with N classes of training data, wherein the existing N classes are merged into a class A, and other classes except the existing N classes are classified into another class B; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x belongs to the class B; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.

10. The method of claim 7, wherein the amount of model change information is calculated by: firstly, constructing a unitary classifier according to a labeled sample set L with N types of training data; then, for each sample x in the unlabeled sample set after the sample a is removed, defining the probability that the sample x does not belong to any one of the existing N classes as the probability that the sample x is an outlier; and then, calculating the information entropy of the unlabeled sample set with the sample a removed with respect to the new class by using the multi-classifier, and taking the information entropy as the model change information quantity of the sample a.