CN104217225A

CN104217225A - A visual target detection and labeling method

Info

Publication number: CN104217225A
Application number: CN201410442817.4A
Authority: CN
Inventors: 黄凯奇; 任伟强; 王冲; 张俊格
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2014-09-02
Filing date: 2014-09-02
Publication date: 2014-12-17
Anticipated expiration: 2034-09-02
Also published as: CN104217225B

Abstract

The invention discloses a visual target detection and labeling method, comprising: an image input step, inputting an image to be detected; a candidate area extraction step, using a selective search algorithm to extract a candidate window from the image to be detected as a candidate area; feature description In the extraction step, a pre-trained large-scale convolutional neural network is used to describe the features of the candidate area and output the feature description of the candidate area; in the visual target prediction step, based on the feature description of the candidate area, the pre-trained object detection model is used to The candidate area is predicted to estimate the area where the visual target exists; the position labeling step is to mark the position of the visual target according to the estimation result. Experiments show that compared with mainstream weakly supervised visual target detection and labeling methods, the present invention has stronger positive sample mining capabilities and more general application prospects, and is suitable for visual target detection and automatic labeling tasks on large-scale data sets.

Description

A Visual Object Detection and Labeling Method

技术领域technical field

本发明涉及计算机视觉中物体检测技术领域，特别涉及一种基于弱监督学习的视觉目标检测与标注方法。The invention relates to the technical field of object detection in computer vision, in particular to a visual target detection and labeling method based on weakly supervised learning.

背景技术Background technique

图像中物体检测与自动位置标注是计算机视觉领域一个基本问题，也是该领域要研究的核心问题之一。图像中物体检测就是给定测试图像，回答什么东西在什么地方这一问题。物体检测在很多其他视觉研究问题中有着广泛的应用，如物体识别、行人检测、人脸检测、监控场景下的前景检测、运动跟踪、行为识别与分析等。Object detection and automatic location labeling in images is a basic problem in the field of computer vision, and it is also one of the core issues to be studied in this field. Object detection in an image is to answer the question of what is where given a test image. Object detection is widely used in many other vision research problems, such as object recognition, pedestrian detection, face detection, foreground detection in surveillance scenes, motion tracking, behavior recognition and analysis, etc.

一般的物体检测需要给定标注好物体外接矩形的数据库，以便使用基于梯度方向直方图(HOG)、形变部件模型(DPM)等纯有监督的物体检测模型进行模型训练。数字媒体技术的高速发展，使得图像、视频等数据出现爆炸式增长，互联网的普及则使得人们能够更加容易地获取到海量的图像、视频数据。面对这样海量的图像数据，目前的物体检测与标准算法需要面对的一个严峻的问题是大量的数据并没有可用的物体位置标注信息。对海量图像数据进行位置标注，是一个劳动强度非常高、代价非常高的任务。General object detection needs to be given a database of marked object circumscribed rectangles in order to use purely supervised object detection models based on histogram of gradient orientation (HOG) and deformable part model (DPM) for model training. The rapid development of digital media technology has led to explosive growth of data such as images and videos, and the popularity of the Internet has made it easier for people to obtain massive image and video data. Faced with such massive image data, a serious problem that current object detection and standard algorithms need to face is that there is no usable object position labeling information for a large amount of data. Position labeling of massive image data is a very labor-intensive and costly task.

相对来说，对整张图像进行类别标注则要容易得多，采用无监督聚类等方法进行预先过滤也可以实现短时间内构建出较大规模的分类数据库。因而，利用只有分类标注的图像数据库，实现自动进行物体类别学习与定位，即通过弱监督学习实现视觉目标检测与标注，有着重要的理论价值和现实意义。Relatively speaking, it is much easier to classify the entire image. Using unsupervised clustering and other methods for pre-filtering can also build a large-scale classification database in a short time. Therefore, it is of great theoretical value and practical significance to use the image database with only classification and labeling to realize automatic object category learning and positioning, that is, to realize visual target detection and labeling through weakly supervised learning.

传统的弱监督学习算法中，对于候选区域的选择，一般基于密集采集的候选窗口算法，窗口数目非常庞大，召回率和重合度都不是很理想。同时，对候选窗口通常采用词包模型进行描述，词包模型的特征变换层次通常不多，得到的特征可以认为是中层表达，缺少更高层的信息让模型能够自动从图像中发掘出物体表观模型。In the traditional weakly supervised learning algorithm, the selection of candidate regions is generally based on the densely collected candidate window algorithm. The number of windows is very large, and the recall rate and coincidence degree are not very ideal. At the same time, the bag-of-words model is usually used to describe the candidate window. The feature transformation level of the bag-of-words model is usually not many, and the obtained features can be considered as middle-level expressions. The lack of higher-level information allows the model to automatically discover the appearance of objects from images. Model.

目前弱监督物体检测与标注方面主流的方法包括多示例学习、主题模型、条件随机场等。传统的很多多示例学习算法由于很大程度上依赖于核学习或基于距离度量的学习框架，并且使用启发式算法、二次规划、整数规划等复杂度很高的优化算法，很难在大规模数据集上得到高效应用。At present, the mainstream methods of weakly supervised object detection and labeling include multi-instance learning, topic model, conditional random field, etc. Many traditional multi-instance learning algorithms rely heavily on kernel learning or distance metric-based learning frameworks, and use highly complex optimization algorithms such as heuristic algorithms, quadratic programming, and integer programming. Efficiently applied to the data set.

因此，如何改进和优化弱监督学习算法来高效实现海量图像的物体检测与自动位置标注，是现有技术中的急需解决的一个重要问题。Therefore, how to improve and optimize the weakly supervised learning algorithm to efficiently realize object detection and automatic position labeling of massive images is an important problem that needs to be solved urgently in the prior art.

发明内容Contents of the invention

有鉴于此，本发明的主要目的是提供弱监督场景下的视觉目标检测与标注方法，可以在只给定图像类别标签的情况下，自动从图像集合中定位感兴趣的目标，也可以对图像进行物体位置自动标注。In view of this, the main purpose of the present invention is to provide a visual target detection and labeling method in a weakly supervised scene, which can automatically locate the target of interest from the image collection when only the image category label is given, and can also Automatically mark the position of the object.

为了达到上述目的，本发明提供了以下技术方案：In order to achieve the above object, the present invention provides the following technical solutions:

一种视觉目标检测与标注方法，其特征在于，包括：A visual target detection and labeling method, characterized in that it includes:

图像输入步骤，输入待检测图像；Image input step, input image to be detected;

候选区域提取步骤，使用选择性搜索算法从所述待检测图像中提取候选窗口作为候选区域；A candidate region extraction step, using a selective search algorithm to extract a candidate window from the image to be detected as a candidate region;

特征描述提取步骤，使用预先训练的大规模卷积神经网络对候选区域进行特征描述并输出该候选区域的特征描述；The feature description extraction step uses a pre-trained large-scale convolutional neural network to describe the features of the candidate area and output the feature description of the candidate area;

视觉目标预测步骤，基于所述候选区域的特征描述，利用预先训练的物体检测模型对候选区域进行预测，估计存在所述视觉目标的区域；The visual target prediction step is based on the feature description of the candidate area, using the pre-trained object detection model to predict the candidate area, and estimating the area where the visual target exists;

位置标注步骤，根据所述估计结果对所述视觉目标的位置进行标注。The position marking step is to mark the position of the visual target according to the estimation result.

优选的，所述候选区域提取步骤中的选择性搜索算法进一步包括：Preferably, the selective search algorithm in the candidate region extraction step further includes:

将待检测图像的颜色空间转换为预定空间，利用基于Graph的过分割算法对所述图像进行分割，不断合并相似度最高的两块区域，得到图像的层次化分割结果，将多个颜色空间以及多层次的分割区域集合合并以及去重处理后，获得该图像的候选区域集合。Convert the color space of the image to be detected into a predetermined space, use the Graph-based over-segmentation algorithm to segment the image, and continuously merge the two regions with the highest similarity to obtain the hierarchical segmentation result of the image. Multiple color spaces and After the multi-level segmented region sets are combined and deduplicated, the candidate region set of the image is obtained.

优选的，所述预定颜色空间包括：HSV，RGI，I，Lab。Preferably, the predetermined color space includes: HSV, RGI, I, Lab.

优选的，所述预先训练的卷积神经网络为：基于物体分类数据库ImageNet 2013训练的卷积神经网络。Preferably, the pre-trained convolutional neural network is: a convolutional neural network trained based on the object classification database ImageNet 2013.

优选的，还包括物体检测模型训练步骤，具体包括：Preferably, it also includes an object detection model training step, specifically including:

输入带有图像类别标签的训练集图像；Input training set images with image category labels;

采用选择性搜索算法从训练集图像中提取候选窗口作为候选区域；Selective search algorithm is used to extract candidate windows from training set images as candidate regions;

使用预先训练的大规模卷积神经网络对候选区域进行特征描述并输出该候选区域的特征描述；Use the pre-trained large-scale convolutional neural network to describe the features of the candidate area and output the feature description of the candidate area;

基于所述候选区域的特征描述，利用多示例线性支持向量机训练物体表观模型。Based on the feature description of the candidate area, the object appearance model is trained by using a multi-instance linear support vector machine.

优选的，所述使用多示例线性支持向量机训练物体检测模型，包括：Preferably, the training of an object detection model using a multi-instance linear support vector machine includes:

采用MILinear无约束大间隔多示例学习算法对物体检测模型进行训练，其目标函数为：The MILinear unconstrained large-interval multi-instance learning algorithm is used to train the object detection model, and its objective function is:

$\underset{w w}{min min} \frac{11}{22} {| | | | w w | | | |}^{22} + + \frac{C C}{| | B B | |} {Σ Σ}_{i i = = 11}^{| | B B | |} {((max max ((0,1 0,1 - - {y the y}^{i i} {w w}^{T T} {B B}_{{I I}_{i i}}^{i i}))))}^{22},,$

其中，一张图像Iⁱ通过一个包含nⁱ个d维示例的包Bⁱ来描述，其中第j个示例记为若一个包中至少包含有一个示例为正样本，那么该包的标签yⁱ为+1，若所有的示例都是负样本，那么该包的标签yⁱ为-1，训练集为B＝{(Bⁱ,yⁱ)|i＝1,2,…,N}，|B|＝N是训练集样本数目，w是分类器系数，C是正则项用于控制对错误分类的惩罚，是包Bⁱ中预测分数最高的示例的索引值。Among them, an image I ⁱ is described by a bag B ⁱ containing n ⁱ d-dimensional examples, where the jth example is denoted as If a package contains at least one example as a positive sample, then the label y ⁱ of the package is +1, if all examples are negative samples, then the label y ⁱ of the package is -1, and the training set is B={ (B ⁱ ,y ⁱ )|i=1,2,…,N}, |B|=N is the number of samples in the training set, w is the classifier coefficient, C is the regular term used to control the penalty for misclassification, is the index value of the example with the highest predicted score in bag B ⁱ .

优选的，采用可信域牛顿法对MILinear算法进行求解，包括：Preferably, the trusted region Newton method is used to solve the MILinear algorithm, including:

确定MILinear的优化目标函数是无约束的可导目标函数，其一阶导数为：It is determined that the optimization objective function of MILinear is an unconstrained derivable objective function, and its first derivative is:

$g g ((w w)) = = w w + + 22 \frac{C C}{| | B B | |} \underset{i i &Element; &Element; {I I}_{B B}}{Σ Σ} (({w w}^{T T} {B B}_{{I I}_{i i}}^{i i} {B B}_{{I I}_{i i}}^{iT i} - - {y the y}^{i i} {B B}_{{I I}_{i i}}^{iT i})),,$

其中， $I_{B} = {i | 1 - y^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0}$ 是间隔小于1的示例的集合；in, $I_{B} = {i | 1 - {the y}^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0}$ is the set of examples with interval less than 1;

通过下面公式计算广义Hessian矩阵Calculate the generalized Hessian matrix by the following formula

其中，I是单位矩阵；Wherein, I is the identity matrix;

以迭代的方式对目标函数进行优化，计算The objective function is optimized in an iterative manner, computing

$\begin{matrix} {s the s}^{k k} = = min min {q q}_{k k} ((s the s)) = = \underset{s the s}{min min} &dtri; &dtri; f f {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} {&dtri; &dtri;}^{22} f f (({w w}^{k k})) s the s \\ = = \underset{s the s}{min min} g g {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} H h (({w w}^{k k})) s the s,, s the s . . t t . . | | | | s the s | | | | \leq \leq {Δ Δ}_{k k} \end{matrix},,$

其中，s^k是更新步长，Δ_k是可信域，g(w^k)和H(w^k)分别是MILinear目标函数的一阶导数和二阶导数。Among them, s ^k is the update step size, Δ _k is the trusted region, g(w ^k ) and H(w ^k ) are the first and second derivatives of the MILinear objective function, respectively.

在求解得到更新步长s^k后，如果实际目标函数下降足够大，那么就对w^k进行更新，否则保持w^k不变，公式如下：After solving the update step s ^k , if the actual objective function drops sufficiently, then update w ^k , otherwise keep w ^k unchanged, the formula is as follows:

${w w}^{k k + + 11} = = \{\begin{matrix} {w w}^{k k} + + {s the s}^{k k} & if if \frac{f f (({w w}^{k k} + + {s the s}^{k k})) - - f f (({w w}^{k k}))}{{q q}_{k k} (({s the s}^{k k}))} > > {η η}_{00},, \\ {w w}^{k k} & otherwise otherwise . . \end{matrix},,$

其中，η₀是一个预先定义的控制最小可接受实际函数下降的正数。优选的，还包括利用训练好的物体检测模型运行包分解算法，采用迭代方式逐步减少正包的模糊度，包括：Among them, η ₀ is a pre-defined positive number controlling the minimum acceptable practical function drop. Preferably, it also includes using the trained object detection model to run the packet decomposition algorithm, and gradually reduce the ambiguity of the positive packet in an iterative manner, including:

通过MILinear训练得到的物体检测模型在训练集图像上得到对所有候选窗口的预测概率，根据此预测概率将正包分解成一个正包和一个负包，在分解后得到的数据集上训练一个新的MILinear物体检测模型，所述分解过程可能迭代数次。The object detection model trained by MILinear obtains the predicted probability of all candidate windows on the training set image. According to the predicted probability, the positive bag is decomposed into a positive bag and a negative bag, and a new model is trained on the decomposed data set. The MILinear object detection model, the decomposition process may iterate several times.

本发明提供的视觉目标检测与标注方法，具有几个明显优点：The visual target detection and labeling method provided by the present invention has several obvious advantages:

1)、采用选择性搜索的方式，基于大量过分割的结果，获取目标最可能出现的候选窗口，这种方式得到的窗口能够很好的保持物体的边界，与真实物体重合率很高，同时在几百到几千个候选窗口的情况下保持极高的召回率。1) Selective search is used to obtain the most likely candidate window of the target based on a large number of over-segmented results. The window obtained in this way can well maintain the boundary of the object and has a high coincidence rate with the real object. At the same time Maintain extremely high recall with hundreds to thousands of candidate windows.

2)、采用预先在一个很大的图像分类数据集上训练得到的卷积神经网络从候选窗口中提取特征表达，能够获得包含更强的高层语义信息的丰富特征表达，让模型能够自动从图像中发掘出物体表观模型。2) Using a convolutional neural network pre-trained on a large image classification data set to extract feature expressions from candidate windows, it is possible to obtain rich feature expressions containing stronger high-level semantic information, allowing the model to automatically extract from image The surface model of the object is discovered in the object.

3)、采用了一种新的多实例线性支持向量机模型，同时采用一种基于可信域牛顿法的优化算法进行优化，能够高效地在大规模数据集上进行弱监督检测模型的学习。3) A new multi-instance linear support vector machine model is adopted, and an optimization algorithm based on the trusted region Newton method is used for optimization, which can efficiently learn weakly supervised detection models on large-scale data sets.

4)、采用了新的一种包分解算法，通过将正样本包分解成一个正样本包和一个负样本包，大大降低正样本包中的模糊性，能够有效提高弱监督检测模型的性能。4) A new packet decomposition algorithm is adopted. By decomposing the positive sample packet into a positive sample packet and a negative sample packet, the ambiguity in the positive sample packet is greatly reduced, and the performance of the weakly supervised detection model can be effectively improved.

附图说明Description of drawings

图1是依照本发明实施例基于弱监督学习的视觉目标检测与标注方法模型训练与测试流程图；Fig. 1 is a flowchart of model training and testing of a visual target detection and labeling method based on weakly supervised learning according to an embodiment of the present invention;

图2是依照本发明实施例MILinear与带包分解的MILinear示意图；Fig. 2 is according to the embodiment of the present invention MILinear and the MILinear schematic diagram that band bag is decomposed;

图3是依照本发明实施例采用可信域牛顿法进行优化与其他优化方法结果对比示意图；Fig. 3 is a schematic diagram of comparing the results of optimization with the trusted domain Newton method and other optimization methods according to an embodiment of the present invention;

图4是依照本发明实施例训练得到的物体检测模型预测分数与样本重合度关系示意图；Fig. 4 is a schematic diagram of the relationship between the prediction score of the object detection model and the coincidence degree of samples obtained through training according to an embodiment of the present invention;

图5是依照本发明实施例采用包分解算法迭代过程中若干物体类别性能改进示意图；Fig. 5 is a schematic diagram of improving the performance of several object categories in the iterative process of using the packet decomposition algorithm according to an embodiment of the present invention;

图6是依照本发明实施例训练得到的物体检测模型在Pascal VOC2007数据库上的检测结果示意图。Fig. 6 is a schematic diagram of the detection results of the object detection model trained on the Pascal VOC2007 database according to the embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚明白，以下结合具体实施例，并参照附图，对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in further detail below in conjunction with specific embodiments and with reference to the accompanying drawings.

本发明的思想要点是：1)采用选择性搜索的方式，基于大量过分割的结果,能够在较少的候选窗口情况下获得较高的目标召回率和重合度；2)本发明采用预先在一个很大的图像分类数据集上训练得到的卷积神经网络从候选窗口中提取特征表达，能够得到包含更强的高层语义信息的丰富特征表达；3)采用了一种新的多实例线性支持向量机模型，采用一种基于可信域牛顿法的优化算法进行优化，能够高效地在大规模数据集上进行弱监督检测模型的学习；4)本发明采用了新的一种包分解算法，通过将正样本包分解成一个正样本包和一个负样本包，大大降低正样本包中的模糊性，能够有效提高弱监督检测模型的性能。The idea of the present invention is as follows: 1) using selective search, based on a large number of over-segmented results, a higher target recall rate and coincidence can be obtained with fewer candidate windows; 2) the present invention uses A convolutional neural network trained on a large image classification data set extracts feature expressions from candidate windows, and can obtain rich feature expressions containing stronger high-level semantic information; 3) A new multi-instance linear support is adopted The vector machine model is optimized by using an optimization algorithm based on the credible domain Newton method, which can efficiently learn the weakly supervised detection model on a large-scale data set; 4) the present invention adopts a new packet decomposition algorithm, By decomposing the positive sample bag into a positive sample bag and a negative sample bag, the ambiguity in the positive sample bag is greatly reduced, and the performance of the weakly supervised detection model can be effectively improved.

如图1所示，图1上半部分是依照本发明实施例基于弱监督学习的视觉目标检测与标注方法模型训练流程图。首先，是输入图像；其次，通过采用选择性搜索算法对输入的图像进行候选窗口提取，获得提取的候选区域；然后，将候选区域，即候选窗口样本顺序送入卷积神经网络，获得每个候选区域的特征描述，即区域表达；最后，基于特征描述，使用本发明提出的基于弱监督学习的算法进行物体表观模型的自动学习，即正样本挖掘。图1下半部分阐述了该方法的测试过程。对于测试图像，采用与训练过程一样的方式提取候选窗口，然后使用深度卷积神经网络对窗口区域进行特征描述，最后使用前面训练好的物体表观模型对窗口区域进行分类，实现目标检测或者标注任务。该方法包括以下步骤：As shown in FIG. 1 , the upper part of FIG. 1 is a flowchart of model training of a visual object detection and labeling method based on weakly supervised learning according to an embodiment of the present invention. Firstly, it is the input image; secondly, by using the selective search algorithm to extract candidate windows from the input image, the extracted candidate regions are obtained; then, the candidate regions, that is, the candidate window samples are sequentially sent to the convolutional neural network to obtain each The feature description of the candidate area, that is, the area expression; finally, based on the feature description, the algorithm based on weakly supervised learning proposed by the present invention is used to automatically learn the object appearance model, that is, positive sample mining. The lower part of Figure 1 illustrates the testing process of this method. For the test image, the candidate window is extracted in the same way as the training process, and then the deep convolutional neural network is used to describe the characteristics of the window area. Finally, the previously trained object appearance model is used to classify the window area to achieve target detection or labeling. Task. The method includes the following steps:

S1、候选区域提取，使用选择性搜索算法从训练集图像中提取候选窗口作为候选区域。S1. Candidate region extraction, using a selective search algorithm to extract candidate windows from the training set images as candidate regions.

在只给定图像类别标签的情况下，只能知道图像中包含某些类别的物体，比如“汽车”、“人”，但是对于“汽车”和“人”的位置，是不知道的，这就需要通过算法来确定物体的外接矩形。如果从图像中提取所有可能的矩形框，那所有可能的矩形框的数目是非常庞大的，处理起来也是不现实的。候选区域提取算法就是要通过提取有限数目的可能物体矩形框，使得其中尽可能地包含有要定位的物体。这里有三个指标至关重要：一是候选窗口的数目，数目越少，算法效率越高；二是召回率，也即候选窗口中包含真实物体的数目与所有物体数目的比例；三是候选窗口与真实物体外接矩形框的重合度。基于密集采集的候选窗口算法，窗口数目非常庞大，召回率和重合度都不是很理想。When only the image category label is given, it can only be known that the image contains certain categories of objects, such as "cars" and "people", but the positions of "cars" and "people" are not known. It is necessary to determine the circumscribed rectangle of the object through an algorithm. If all possible rectangular frames are extracted from the image, the number of all possible rectangular frames is very large, and it is unrealistic to process. The candidate area extraction algorithm is to extract a limited number of possible object rectangles, so that it contains as much as possible the object to be located. There are three important indicators here: one is the number of candidate windows, the smaller the number, the higher the efficiency of the algorithm; the second is the recall rate, that is, the ratio of the number of real objects contained in the candidate window to the number of all objects; the third is the candidate window The degree of coincidence with the bounding rectangle of the real object. Based on the candidate window algorithm of dense collection, the number of windows is very large, and the recall rate and coincidence degree are not very ideal.

本发明采用的选择性搜索算法是一种基于过分割的候选窗口提取算法，它通过采用不同的参数对图像进行过分割，得到不同的图像分块，再采用层次化组织的思想对分块进行合并，从而找到最有可能包含物体的外接矩形。具体步骤如下：首先，将原始图像从RGB颜色空间转换到其他颜色空间，包括HSV，RGI，I，Lab等；然后，分别使用基于Graph的过分割算法对相应图像分别进行分割，再通过层次化组织的思想不断合并相似度最高的两块区域，得到图像的层次化分割结果。将多个颜色空间，多层次的分割区域集合合并起来，进行去重处理之后，就得到该图的候选区域集合。The selective search algorithm used in the present invention is a candidate window extraction algorithm based on over-segmentation, which uses different parameters to over-segment the image to obtain different image blocks, and then uses the idea of hierarchical organization to perform block segmentation. Merge to find the bounding rectangle most likely to contain the object. The specific steps are as follows: First, convert the original image from the RGB color space to other color spaces, including HSV, RGI, I, Lab, etc.; then, use the Graph-based over-segmentation algorithm to segment the corresponding image respectively, and then pass the hierarchical The organization's thinking continuously merges the two regions with the highest similarity to obtain the hierarchical segmentation results of the image. Multiple color spaces and multi-level segmentation region sets are combined, and after deduplication processing, the candidate region set of the image is obtained.

选择性搜索算法运行效率较高，在数百至数千个候选窗口的情况下，能够获得非常高的召回率和重合度。The selective search algorithm has high operating efficiency, and can obtain very high recall and coincidence in the case of hundreds to thousands of candidate windows.

S2、使用预先训练的大规模卷积神经网络对每个候选区域进行特征描述并输出该特征描述。S2. Use the pre-trained large-scale convolutional neural network to perform feature description for each candidate region and output the feature description.

在获取到可能包含感兴趣物体的候选区域之后，要通过计算机视觉和模式识别算法确定某个候选窗口是否是某种物体，需要首先对该候选区域进行特征描述，从而可以在之后使用分类器进行分类判断。在图像分类与识别领域，常用的图像描述方法包括SIFT、LBP、HOG等底层特征描述，词包模型等中层特征描述，卷积神经网络、深度信念网络等近年非常流行的层次化特征表达。弱监督物体检测与标注问题，要解决的是物体层次的识别问题，要通过消除弱监督的模糊性来回答什么物体在什么地方这个语义层次的问题。这种高层语义问题不是底层特征描述和中层特征描述能够很好处理的，需要非常抽象的高层特征表达。卷积神经网络在物体识别领域取得了一系列的重大突破，其层次化的特征表达，实现了特征由底层到高层的逐层抽象，其前面的特征层通常是边缘，角点检测子，随着层数增多，后面的特征逐渐开始描述物体部件、整个物体。通过提取卷积神经网络后面特征层的特征，能够得到对图像较高层次(比如物体级别)的描述与表达。卷积神经网络还有一个重要的特性就是其模型容量非常大，层数越多，神经元数目越大，模型复杂度越多，能够编码存储的信息量越大。After obtaining a candidate area that may contain an object of interest, to determine whether a candidate window is a certain object through computer vision and pattern recognition algorithms, it is necessary to first describe the features of the candidate area, so that the classifier can be used later. classification judgment. In the field of image classification and recognition, commonly used image description methods include low-level feature descriptions such as SIFT, LBP, and HOG, middle-level feature descriptions such as bag-of-words models, and hierarchical feature expressions that have become very popular in recent years, such as convolutional neural networks and deep belief networks. The problem of weakly supervised object detection and labeling is to solve the problem of object-level recognition, and to answer the semantic level of what object is where by eliminating the ambiguity of weak supervision. This kind of high-level semantic problem is not well handled by the low-level feature description and the middle-level feature description, and requires a very abstract high-level feature expression. The convolutional neural network has made a series of major breakthroughs in the field of object recognition. Its hierarchical feature expression realizes the layer-by-layer abstraction of features from the bottom layer to the top layer. The previous feature layer is usually edge and corner detectors. As the number of layers increases, the following features gradually begin to describe object parts and the entire object. By extracting the features of the feature layer behind the convolutional neural network, the description and expression of the image at a higher level (such as the object level) can be obtained. Another important feature of the convolutional neural network is that its model capacity is very large. The more layers, the greater the number of neurons, the greater the complexity of the model, and the greater the amount of information that can be encoded and stored.

基于此，本发明在一个非常大的图像的数据集ImageNet 2013上训练了一个大规模的卷积神经网络，将大量的一般物体信息存储于该网络中。优选的，使用一个大规模的一般物体分类数据库ImageNet 2013来进行卷积神经网络的训练，训练数据包含1000类约120万张图像，使用的卷积神经网络包含5个卷积层，2个全连接层，并且第1、2、5个卷积层后面连了最大值汇聚层，整个网络包含约65万神经元。就像人类存在大量中的知识有助于分辨物体一样，这个包含了大量一般视觉先验信息的卷积神经网络，能够有效地用于对物体进行一般化的描述。Based on this, the present invention trains a large-scale convolutional neural network on a very large image data set ImageNet 2013, and stores a large amount of general object information in the network. Preferably, a large-scale general object classification database ImageNet 2013 is used to train the convolutional neural network. The training data includes about 1.2 million images of 1,000 categories. The convolutional neural network used includes 5 convolutional layers, 2 full The connection layer, and the 1st, 2nd, and 5th convolutional layers are followed by a maximum pooling layer. The entire network contains about 650,000 neurons. Just as humans have a large amount of knowledge that helps to distinguish objects, this convolutional neural network, which contains a large amount of general visual prior information, can be effectively used for general descriptions of objects.

S3、在只给定图像类别标签的基础上，使用多示例线性支持向量机MI-SVM在候选区域特征表达上训练物体检测模型。S3. On the basis of only given the image category label, use the multi-instance linear support vector machine MI-SVM to train the object detection model on the feature expression of the candidate region.

本发明已经通过采用选择性搜索算法从图像中获取到候选窗口集合，并使用一个预先训练好的大规模卷积神经网络对这些候选窗口进行特征描述，接下来要做的就是在这些候选窗口特征描述上自动学习物体检测模型，-利用训练好的物体检测模型，就可以对候选区域进行预测，找到最可能存在物体的区域。The present invention has obtained the candidate window set from the image by using a selective search algorithm, and uses a pre-trained large-scale convolutional neural network to describe the features of these candidate windows. In terms of description, the object detection model is automatically learned. -Using the trained object detection model, the candidate area can be predicted and the area most likely to have an object can be found.

弱监督物体检测与标注问题通常可以建模成一个多示例学习问题。一张图像Iⁱ通过一个包含nⁱ个d维示例的包Bⁱ来描述，其中第j个示例记为如果一个包中至少包含有一个示例为正样本，那么该包的标签yⁱ为+1，如果所有的示例都是负样本，那么该包的标签yⁱ为-1。为了避免后面显式地处理偏移量，本发明在每一个示例特征的末尾添加了一个额外的1。记The problem of weakly supervised object detection and labeling can usually be modeled as a multi-instance learning problem. An image I ⁱ is described by a bag B ⁱ containing n ⁱ d-dimensional examples, where the jth example is denoted as If a bag contains at least one example as a positive sample, then the label y ⁱ of the bag is +1, and if all examples are negative samples, then the label y ⁱ of the bag is -1. In order to avoid having to deal with the offset explicitly later, the present invention adds an extra 1 at the end of each example feature. remember

$\underset{w w}{min min} \frac{11}{22} {| | | | w w | | | |}^{22} + + C C {Σ Σ}_{i i = = 11}^{| | B B | |} {ξ ξ}_{i i} - - - - - - ((11))$

$s the s . . t t . . max max (({w w}^{T T} {B B}_{j j}^{i i})) &GreaterEqual; &Greater Equal; + + 11 - - {ξ ξ}_{i i},, {y the y}^{i i} = = + + 11$

$max max (({w w}^{T T} {B B}_{j j}^{i i})) \leq \leq - - 11 + + {ξ ξ}_{i i},, {y the y}^{i i} = = - - 11$

ξ_i≥0ξ _i ≥ 0

训练集为B＝{(Bⁱ,yⁱ)|i＝1,2,…,N}，|B|＝N是训练集样本数目，w是分类器系数，C是正则项用于控制对错误分类的惩罚，ξ_i是松弛变量。The training set is B={(B ⁱ ,y ⁱ )|i=1,2,…,N}, |B|=N is the number of samples in the training set, w is the classifier coefficient, and C is the regular term used to control the pair The penalty for misclassification, _ξi is the slack variable.

在多示例学习框架下，图像基本的标注信息带来的是正包中的模糊性，即只知道至少包含一个正样本却不知道哪个是正样本。MI-SVM算法通过只考虑预测分数W^T 最大的示例来解决这一问题，并依靠此来对包进行预测，如图2(a)所示。MI-SVM算法的超平面是由每个包的分数最高的示例决定的，其优化公式是一个混合整数规划问题，只能通过启发式算法进行求解，速度非常慢。Under the framework of multi-instance learning, the basic annotation information of the image brings ambiguity in the positive bag, that is, it only knows that it contains at least one positive sample but does not know which one is the positive sample. The MI-SVM algorithm considers only the prediction score W ^T The largest example to solve this problem and rely on this to make predictions on packets, as shown in Figure 2(a). The hyperplane of the MI-SVM algorithm is determined by the example with the highest score of each package, and its optimization formula is a mixed integer programming problem, which can only be solved by a heuristic algorithm, and the speed is very slow.

S3.1 MILinear算法S3.1 MILinear Algorithm

不同于传统的多示例学习问题处理的小数据集，本发明主要考虑包含5000个包以上并且每个包含有数以百计到千计高维示例的大数据问题。为更好地对大数据规模下的弱监督问题进行高效求解，本发明提出了一种新的无约束大间隔多示例线性支持向量机算法，称为MILinear。其公式如下式所示：Different from the small data sets dealt with by traditional multi-instance learning problems, the present invention mainly considers big data problems containing more than 5000 bags and each containing hundreds to thousands of high-dimensional examples. In order to efficiently solve weakly supervised problems under large data scale, the present invention proposes a new unconstrained large-interval multi-instance linear support vector machine algorithm, called MILinear. Its formula is as follows:

$\underset{w w}{min min} \frac{11}{22} {| | | | w w | | | |}^{22} + + \frac{C C}{| | B B | |} {Σ Σ}_{i i = = 11}^{| | B B | |} {((max max ((0,1 0,1 - - {y the y}^{i i} {w w}^{T T} {B B}_{{I I}_{i i}}^{i i}))))}^{22} - - - - - - ((22))$

其中是第i个包中第j个实例的特征向量，yⁱ是第i个包的类别标注。上式第二项采用了平方Hinge损失函数，max(a,b)取a,b的最大值。in is the feature vector of the j-th instance in the i-th bag, and y ⁱ is the category label of the i-th bag. The second term of the above formula uses the square Hinge loss function, and max(a, b) takes the maximum value of a and b.

${I I}_{i i} = = arg arg \underset{j j}{max max} {w w}^{T T} {B B}_{j j}^{i i} - - - - - - ((33))$

是包Bⁱ中预测分数最高的示例的索引值。is the index value of the example with the highest predicted score in bag B ⁱ .

基于梯度的优化方法在大规模优化问题上得到广泛应用，本发明使用了可导的Hinge Loss损失函数。正如2(a)所示，MI-SVM和MILinear通过选择分数最大的示例来对此大尺度多示例学习问题进行求解。Gradient-based optimization methods are widely used in large-scale optimization problems, and the present invention uses a derivable Hinge Loss loss function. As shown in 2(a), MI-SVM and MILinear solve this large-scale multiple-instance learning problem by selecting the example with the highest score.

S3.2包分解算法S3.2 Packet decomposition algorithm

在MILinear的实验中，本发明发现，在一个正包中，正样本通常集中在分数最大的前30％。注意到这个问题后，本发明提出了一种新的包分解算法，通过将正包分解成一个正包和一个负包，有效减少正包的模糊性。优选的，通过MILinear训练得到的模型在训练图像上得到对所有候选窗口的预测概率，根据此预测概率将正包分解成一个正包和一个负包，具体为概率最大的30％为新的正包，其余样本成为一个新的负包。接下来，在分解后得到的数据集上训练一个新的MILinear模型，如图2(b)所示。通过包分解算法，减少了正包中样本的模糊性，从而提高模型分类性能。这个分解过程可能迭代数次，直到模型性能不再改进为止。In the MILinear experiment, the present invention finds that in a positive bag, the positive samples are usually concentrated in the top 30% with the largest score. After noticing this problem, the present invention proposes a new packet decomposition algorithm, which effectively reduces the ambiguity of the positive packet by decomposing the positive packet into a positive packet and a negative packet. Preferably, the model obtained through MILinear training obtains the predicted probabilities of all candidate windows on the training image, and according to the predicted probabilities, the positive package is decomposed into a positive package and a negative package, specifically, the 30% with the highest probability is the new positive package. bag, and the remaining samples become a new negative bag. Next, a new MILinear model is trained on the decomposed dataset, as shown in Figure 2(b). Through the bag decomposition algorithm, the ambiguity of the samples in the positive bag is reduced, thereby improving the classification performance of the model. This decomposition process may be iterated several times until the model performance no longer improves.

S3.3梯度优化算法S3.3 Gradient optimization algorithm

前面已经给出了MILinear算法的定义，下面将讨论在大尺度数据集下，如何能够高效地进行模型学习。MILinear的优化目标函数是无约束的可导形式，其一阶导数是The definition of the MILinear algorithm has been given above, and the following will discuss how to efficiently perform model learning under large-scale data sets. The optimization objective function of MILinear is an unconstrained derivative form whose first derivative is

$g g ((w w)) = = w w + + 22 \frac{C C}{| | B B | |} \underset{i i &Element; &Element; {I I}_{B B}}{Σ Σ} (({w w}^{T T} {B B}_{{I I}_{i i}}^{i i} {B B}_{{I I}_{i i}}^{iT i} - - {y the y}^{i i} {B B}_{{I I}_{i i}}^{iT i})) - - - - - - ((44))$

其中in

${I I}_{B B} = = {{i i | | 11 - - {y the y}^{i i} {w w}^{T T} {B B}_{{I I}_{i i}}^{i i},, i i = = 1,2 1,2,, . . . . . .,, | | B B | | > > 00}} - - - - - - ((55))$

是间隔小于1的示例的集合。is the set of examples with interval less than 1.

在获得了目标函数的梯度解析表达之后，就有很多方法可以进行目标函数优化了，包括随机梯度下降(SGD)，L-BFGS，非线性共轭梯度法(CG)等。随机梯度下降法对数据集逐个进行处理，并迭代地对模型进行更新。L-BFGS是一种拟牛顿优化方法，它通过一种Hessian矩阵的近似低秩求解方法来避免存储整个Hessian矩阵。一般说来，随机梯度下降每步的代价较低但迭代时间较长，而L-BFGS等二阶优化方法每步耗时较长，但整体收敛速度较快。After obtaining the gradient analytical expression of the objective function, there are many methods to optimize the objective function, including stochastic gradient descent (SGD), L-BFGS, nonlinear conjugate gradient method (CG), etc. Stochastic gradient descent processes the dataset one by one and iteratively updates the model. L-BFGS is a quasi-Newton optimization method that avoids storing the entire Hessian matrix through an approximate low-rank solution method of the Hessian matrix. Generally speaking, the cost of each step of stochastic gradient descent is lower but the iteration time is longer, while the second-order optimization methods such as L-BFGS take longer per step, but the overall convergence speed is faster.

为了更高效的进行物体表观模型学习，本发明提出了一种比L-BFGS更加高效的基于可信域牛顿法的多示例线性支持向量机优化算法。可信域牛顿法是一种非常高效的大尺度无约束问题求解方法，并且在一般大尺度logistic回归和支持向量机训练上得到了应用。为应用可信域牛顿法求解MILinear问题，使用下面公式计算广义Hessian矩阵In order to learn object appearance models more efficiently, the present invention proposes a multi-instance linear support vector machine optimization algorithm based on trusted region Newton method that is more efficient than L-BFGS. The trusted region Newton method is a very efficient method for solving large-scale unconstrained problems, and has been applied in general large-scale logistic regression and support vector machine training. To solve the MILinear problem using the trusted region Newton method, the generalized Hessian matrix is calculated using the following formula

其中I是单位矩阵。where I is the identity matrix.

可信域牛顿法以迭代的方式对目标函数进行优化，每次优化试图求解下面的包含可信域的子问题The trusted region Newton method optimizes the objective function in an iterative manner, and each optimization tries to solve the following subproblems containing the trusted region

$\begin{matrix} {s the s}^{k k} = = min min {q q}_{k k} ((s the s)) = = \underset{s the s}{min min} &dtri; &dtri; f f {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} {&dtri; &dtri;}^{22} f f (({w w}^{k k})) s the s \\ = = \underset{s the s}{min min} g g {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} H h (({w w}^{k k})) s the s,, s the s . . t t . . | | | | s the s | | | | \leq \leq {Δ Δ}_{k k} \end{matrix} - - - - - - ((77))$

其中s^k是更新步长，Δ_k是可信域，g(w^k)核H(w^k)分别是MILinear目标函数(公式2)的一阶导数和二阶导数。where s ^k is the update step size, Δ _k is the trusted region, and g(w ^k ) and H(w ^k ) are the first and second derivatives of the MILinear objective function (Formula 2), respectively.

这个子问题可以采用考虑了可信域的共轭梯度法进行高效求解。This subproblem can be efficiently solved using the conjugate gradient method that takes into account the trusted region.

在求解得到更新步长s^k后，如果实际目标函数下降足够大，那么就对w^k进行更新，否则保持w^k不变。After solving the update step s ^k , if the actual objective function drops sufficiently, then update w ^k , otherwise keep w ^k unchanged.

${w w}^{k k + + 11} = = \{\begin{matrix} {w w}^{k k} + + {s the s}^{k k} & if if \frac{f f (({w w}^{k k} + + {s the s}^{k k})) - - f f (({w w}^{k k}))}{{q q}_{k k} (({s the s}^{k k}))} > > {η η}_{00},, \\ {w w}^{k k} & otherwise otherwise . . \end{matrix} - - - - - - ((88))$

其中η₀是一个预先定义的控制最小可接受实际函数下降的正数，实际函数下降大于该值则更新方向被接受，在本发明一实施例中，优选设置其为1e-4。Wherein η ₀ is a pre-defined positive number that controls the minimum acceptable actual function drop. If the actual function drop is greater than this value, the update direction is accepted. In an embodiment of the present invention, it is preferably set to 1e-4.

严格说，MILinear的目标函数由于引入了max函数，因而是非凸的。同时该目标函数也不是二阶可导的。尽管不能保证全局最优解，但在实际情况下，该算法可以有效地从大规模数据集上学习到物体表观模型。Strictly speaking, the objective function of MILinear is non-convex due to the introduction of the max function. At the same time, the objective function is not second-order differentiable. Although the global optimal solution cannot be guaranteed, in practical situations, the algorithm can effectively learn object appearance models from large-scale data sets.

S4、在测试图像上提取候选区域，并使用同样的方式进行特征描述，使用前面训练得到的物体检测模型定位感兴趣的物体。在测试阶段，首先使用选择性搜索算法获取一定数量的候选区域，然后采用与训练阶段一样的卷积神经网络进行特征描述。之后使用前面训练得到的物体表观模型对窗口特征进行分类，从而判断出每个候选窗口是否是感兴趣的物体，得出什么物体在什么位置的结论。这样就完成了只利用图像标签信息实现感兴趣物体的自动检测与标注。S4. Extract the candidate area on the test image, perform feature description in the same way, and use the object detection model trained above to locate the object of interest. In the testing stage, a certain number of candidate regions are first obtained using the selective search algorithm, and then the feature description is performed using the same convolutional neural network as in the training stage. Then use the object appearance model trained earlier to classify the window features, so as to judge whether each candidate window is an object of interest, and draw the conclusion of what object is where. This completes the automatic detection and labeling of objects of interest using only image label information.

图3是依照本发明实施例采用可信域牛顿法进行优化与其他优化方法结果对比示意图，图4是依照本发明实施例训练得到的物体检测模型预测分数与样本重合度关系示意图，图5是依照本发明实施例采用包分解算法迭代过程中若干物体类别性能改进示意图，图6是依照本发明实施例训练得到的物体检测模型在Pascal VOC2007数据库上的检测结果示意图。Figure 3 is a schematic diagram of the comparison between the results of optimization using the trusted region Newton method and other optimization methods according to the embodiment of the present invention, Figure 4 is a schematic diagram of the relationship between the prediction score of the object detection model and the sample coincidence degree obtained by training according to the embodiment of the present invention, and Figure 5 is According to the embodiment of the present invention, it is a schematic diagram of performance improvement of several object categories in the iterative process of packet decomposition algorithm. FIG. 6 is a schematic diagram of the detection results of the object detection model trained according to the embodiment of the present invention on the Pascal VOC2007 database.

总之，本发明提出了一种新的基于弱监督学习的视觉目标检测与标注方法，使用选择性搜索算法进行候选窗口提取，使用在大量数据上预训练的深层卷积神经网络作为候选窗口特征表达模型和一般先验，并使用一种基于多示例线性支持向量机的算法进行正样本挖掘。通过采用可信域牛顿方法进行模型优化，并利用一种新颖的包分解算法逐步减小正包的模糊性，本方法实现了弱监督场景下的视觉目标检测与自动标注。实验表明该发明与主流弱监督视觉目标检测与标注方法相比，具有更强的正样本挖掘能力和更一般的应用前景，适合于在大规模数据集上的视觉目标检测与自动标注任务。In conclusion, the present invention proposes a new visual object detection and labeling method based on weakly supervised learning, using a selective search algorithm for candidate window extraction, and using a deep convolutional neural network pre-trained on a large amount of data as a candidate window feature expression model and general priors, and uses an algorithm based on multi-instance linear support vector machines for positive sample mining. By adopting the trusted region Newton method for model optimization, and using a novel packet decomposition algorithm to gradually reduce the ambiguity of positive packets, this method realizes visual object detection and automatic labeling in weakly supervised scenarios. Experiments show that compared with the mainstream weakly supervised visual target detection and labeling methods, this invention has stronger positive sample mining ability and more general application prospects, and is suitable for visual target detection and automatic labeling tasks on large-scale data sets.

以上所述的具体实施例，对本发明的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本发明的具体实施例而已，并不用于限制本发明，凡在本发明的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1. A method for visual target detection and labeling, comprising:

Image input step, input image to be detected;

A candidate region extraction step, using a selective search algorithm to extract a candidate window from the image to be detected as a candidate region;

The feature description extraction step uses a pre-trained large-scale convolutional neural network to describe the features of the candidate area and output the feature description of the candidate area;

The visual target prediction step is based on the feature description of the candidate area, using the pre-trained object detection model to predict the candidate area, and estimating the area where the visual target exists;

The position marking step is to mark the position of the visual target according to the estimation result.

2. The method according to claim 1, wherein the selective search algorithm in the candidate region extraction step further comprises:

Convert the color space of the image to be detected into a predetermined color space, use the Graph-based over-segmentation algorithm to segment the image, and continuously merge the two regions with the highest similarity to obtain the hierarchical segmentation result of the image. After merging and deduplicating the multi-level segmentation region sets, a candidate region set of the image is obtained.

3. The method according to claim 2, wherein the predetermined color space comprises: HSV, RGI, I, Lab.

4. The method according to claim 1, wherein the pre-trained convolutional neural network is: a convolutional neural network based on object classification database ImageNet 2013 training.

5. The method according to claim 1, further comprising an object detection model training step, specifically comprising:

Input training set images with image category labels;

Selective search algorithm is used to extract candidate windows from training set images as candidate regions;

Use the pre-trained large-scale convolutional neural network to describe the features of the candidate area and output the feature description of the candidate area;

Based on the feature description of the candidate area, the object appearance model is trained using a multi-instance linear support vector machine.

6. The method according to claim 5, wherein the training object detection model using a multi-example linear support vector machine comprises:

The MILinear unconstrained large-interval multi-instance learning algorithm is used to train the object detection model, and its objective function is

\underset{w w}{min min} \frac{11}{22} {| | | | w w | | | |}^{22} + + \frac{C C}{| | B B | |} {Σ Σ}_{i i = = 11}^{| | B B | |} {((max max ((0,1 0,1 - - {y the y}^{i i} {w w}^{T T} {B B}_{{I I}_{i i}}^{i i}))))}^{22},,

Among them, an image I ⁱ is described by a bag B ⁱ containing n ⁱ d-dimensional examples, where the jth example is denoted as If a package contains at least one example as a positive sample, then the label y ⁱ of the package is +1, if all examples are negative samples, then the label y ⁱ of the package is -1, and the training set is B={ (B ⁱ ,y ⁱ )|i=1,2,…,N}, |B|=N is the number of samples in the training set, w is the classifier coefficient, C is the regular term used to control the penalty for misclassification, is the index value of the example with the highest predicted score in bag B ⁱ .

7. method according to claim 6, is characterized in that, adopts credible region Newton's method to solve MILinear algorithm, comprising:

It is determined that the optimization objective function of MILinear is an unconstrained derivable objective function whose first derivative is

g g ((w w)) = = w w + + 22 \frac{C C}{| | B B | |} \underset{i i &Element; &Element; {I I}_{B B}}{Σ Σ} (({w w}^{T T} {B B}_{{I I}_{i i}}^{i i} {B B}_{{I I}_{i i}}^{iT i} - - {y the y}^{i i} {B B}_{{I I}_{i i}}^{iT i})),,

in,

I_{B} = {i | 1 - {the y}^{i} w^{T} B_{I_{i}}^{i}, i = 1,2, . . ., | B | > 0}

is the set of examples with interval less than 1;

Calculate the generalized Hessian matrix by the following formula

Wherein, I is the identity matrix;

The objective function is optimized in an iterative manner, computing

\begin{matrix} {s the s}^{k k} = = min min {q q}_{k k} ((s the s)) = = \underset{s the s}{min min} &dtri; &dtri; f f {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} {&dtri; &dtri;}^{22} f f (({w w}^{k k})) s the s \\ = = \underset{s the s}{min min} g g {(({w w}^{k k}))}^{T T} s the s + + \frac{11}{22} {s the s}^{T T} H h (({w w}^{k k})) s the s,, s the s . . t t . . | | | | s the s | | | | \leq \leq {Δ Δ}_{k k} \end{matrix},,

Among them, k is the number of iterations, s ^k is the update step size, w ^k is the weight of the kth iteration, Δ _k is the trusted region, ▽f(w ^k )=g(w ^k ) and ▽2f(w ^k )(w ^k )=H(w ^k ) are the first-order derivative and the second-order derivative of the MILinear objective function respectively;

After solving the update step s ^k , if the actual objective function drops sufficiently, then update w ^k , otherwise keep w ^k unchanged, the formula is as follows:

{w w}^{k k + + 11} = = \{\begin{matrix} {w w}^{k k} + + {s the s}^{k k} & if if \frac{f f (({w w}^{k k} + + {s the s}^{k k})) - - f f (({w w}^{k k}))}{{q q}_{k k} (({s the s}^{k k}))} > > {η η}_{00},, \\ {w w}^{k k} & otherwise otherwise . . \end{matrix},,

where _η0 is a pre-defined positive number controlling the minimum acceptable practical function drop.

8. The method according to claim 7, further comprising utilizing the trained object detection model to run a packet decomposition algorithm, and gradually reducing the ambiguity of positive packets in an iterative manner, specifically comprising:

The object detection model trained by MILinear obtains the predicted probability of all candidate windows on the training set image. According to the predicted probability, the positive bag is decomposed into a positive bag and a negative bag, and a new model is trained on the decomposed data set. The MILinear object detection model, the decomposition process needs to be iterated several times.