CN109815911B

CN109815911B - Video moving object detection system, method and terminal based on depth fusion network

Info

Publication number: CN109815911B
Application number: CN201910078362.5A
Authority: CN
Inventors: 陈立; 蔡春磊; 张小云; 高志勇
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2019-01-26
Filing date: 2019-01-26
Publication date: 2020-11-03
Anticipated expiration: 2039-01-26
Also published as: CN109815911A

Abstract

The present invention provides a video moving object detection system based on a deep fusion network, comprising: a video feature extraction module, which receives video sequence input, performs feature extraction on the video content, and obtains the feature expression about scene information in the video, that is, the video scene Feature expression, and send to the deep fusion module; basic result detection module, receive video sequence input, use basic detectors to detect moving objects, get the corresponding basic detection results, and send to the deep fusion module; deep fusion module, receive video The scene feature expression and basic detection results are optimally fused using deep neural networks to output the final detection results. At the same time, a video moving object detection method and terminal are provided. The present invention can obtain high-accuracy detection results.

Description

Video moving object detection system, method and terminal based on deep fusion network

技术领域technical field

本发明涉及视频运动物体检测技术领域，具体地，涉及一种基于深度融合网络的视频运动物体检测系统、方法及终端。The present invention relates to the technical field of video moving object detection, and in particular, to a video moving object detection system, method and terminal based on a deep fusion network.

背景技术Background technique

视频运动物体检测可以作为视频图像处理和视频内容分析的第一个环节，为后续操作提供初步分析结果，有助于提高整个视频处理及分析系统的性能，因此视频运动物体检测是一项至关重要的技术。Video moving object detection can be used as the first link of video image processing and video content analysis to provide preliminary analysis results for subsequent operations and help to improve the performance of the entire video processing and analysis system. Therefore, video moving object detection is a crucial step. important technology.

对于视频运动物体检测问题，研究人员已经提出了大量的方法。但是这些研究成果大多数都是针对某个或某类特定的场景、基于特征工程、采用手工设计操作子的方法进行方法设计。这些传统的方法分为基于统计模型，基于聚类，基于稀疏表达等类型。目前还没有一种传统方法可以鲁棒的应对各种场景，大多都只是针对某些场景高效，而对其他场景则表现不佳。For the video moving object detection problem, researchers have proposed a large number of methods. However, most of these research results are designed for a certain or a certain type of specific scenarios, based on feature engineering, using the method of hand-designed operators. These traditional methods are classified into statistical model-based, cluster-based, and sparse representation-based types. At present, there is no traditional method that can robustly deal with various scenarios, and most of them are only efficient for some scenarios and poor for other scenarios.

最近出现了少量基于深度学习的视频运动物体检测方法，这些方法和传统方法最大的不同在于不需要进行人工调参，而是从数据中自动学习得到检测模型。比如Wang等人利用深度卷积网络设计了一种半自动的视频运动物体检测算法。该方法需要人工先标注一些关键帧的检测结果，然后深度卷积神经网络根据标注的结果进行训练，训练完成后，自动对剩余视频帧进行分析，得到这些帧的运动物体检测结果。该方法可以取得很高准确度的检测结果，但是需要人工干预，无法全自动完成。Recently, a small number of video moving object detection methods based on deep learning have appeared. The biggest difference between these methods and traditional methods is that no manual parameter adjustment is required, but the detection model is automatically learned from the data. For example, Wang et al. used a deep convolutional network to design a semi-automatic video moving object detection algorithm. This method needs to manually mark the detection results of some key frames, and then the deep convolutional neural network is trained according to the marked results. After the training is completed, the remaining video frames are automatically analyzed to obtain the moving object detection results of these frames. This method can obtain high-accuracy detection results, but requires manual intervention and cannot be completed automatically.

利用深度学习获得检测模型最大的难点在于训练数据的匮乏，没有足够的标注数据，则无法有效的训练神经网络。目前没有发现同本发明类似技术的说明或报道，也尚未收集到国内外类似的资料。The biggest difficulty in using deep learning to obtain detection models is the lack of training data. Without enough labeled data, the neural network cannot be effectively trained. At present, there is no description or report of the technology similar to the present invention, and no similar materials at home and abroad have been collected.

发明内容SUMMARY OF THE INVENTION

本发明针对现有技术中存在的上述不足，提供了一种基于深度融合网络的视频运动物体检测系统、方法及终端，结合传统方法和深度学习技术，针对多种场景都可以取得非常稳健的检测结果。Aiming at the above-mentioned deficiencies in the prior art, the present invention provides a video moving object detection system, method and terminal based on a deep fusion network. Combined with the traditional method and deep learning technology, very robust detection can be achieved in various scenarios. result.

本发明是通过以下技术方案实现的。The present invention is achieved through the following technical solutions.

根据本发明的一个方面，提供了一种基于深度融合网络的视频运动物体检测系统，包括如下模块：According to one aspect of the present invention, a video moving object detection system based on a deep fusion network is provided, including the following modules:

视频特征提取模块，接收视频序列输入，对视频内容进行特征提取，得到视频中关于场景信息的特征表达，即视频场景特征表达，并发送至深度融合模块；The video feature extraction module receives the input of the video sequence, performs feature extraction on the video content, obtains the feature representation of the scene information in the video, that is, the feature representation of the video scene, and sends it to the deep fusion module;

基础结果检测模块，接收视频序列输入，利用基础检测子对运动物体进行检测，得到相应的基础检测结果，并发送至深度融合模块；The basic result detection module receives the video sequence input, uses the basic detector to detect the moving object, obtains the corresponding basic detection result, and sends it to the deep fusion module;

深度融合模块，接收视频场景特征表达和基础检测结果，利用深度神经网络进行最优融合，输出最终的检测结果。The deep fusion module receives the feature expression of the video scene and the basic detection results, uses the deep neural network for optimal fusion, and outputs the final detection result.

优选地，所述视频特征提取模块采用基于预训练的VGG-16网络作为特征提取器，提取每帧视频的特征，再将每帧视频的特征堆叠在一起，组成一组用于描述视频场景的描述子，即视频场景特征表达。Preferably, the video feature extraction module uses a pre-trained VGG-16 network as a feature extractor, extracts the features of each frame of video, and then stacks the features of each frame of video together to form a set of features for describing the video scene. Descriptor, that is, the representation of video scene features.

优选地，所述基础检测子为多个，其中每一个基础检测子分别采用一种传统运动检测方法对运动物体进行检测，得到多个相应的基础检测结果。Preferably, there are a plurality of basic detectors, wherein each basic detector adopts a traditional motion detection method to detect a moving object to obtain a plurality of corresponding basic detection results.

优选地，所述基础检测子为四个，相应地，每一个基础检测子分别采用如下传统运动检测方法：Preferably, there are four basic detectors, and correspondingly, each basic detector adopts the following traditional motion detection methods:

-基于像素的自适应语义关联分割方法；- Pixel-based adaptive semantic association segmentation method;

-基于边缘检测的前后背景分割方法；- Front and back background segmentation method based on edge detection;

-基于共享模型的背景分割方法；- Background segmentation method based on shared model;

-基于采样点加权的背景分割方法。。- Background segmentation method based on sampling point weighting. .

优选地，所述深度融合模块接收视频场景特征表达作为输入，经过四层卷积层和一层Soft-Max层得到最优融合权重图，再根据最优融合权重图对基础检测结果进行逐像素线性加权。Preferably, the deep fusion module receives the feature expression of the video scene as input, obtains the optimal fusion weight map through four convolution layers and one Soft-Max layer, and then performs pixel-by-pixel processing on the basic detection results according to the optimal fusion weight map. Linear weighting.

根据本发明的另一个方面，提供了一种基于深度融合网络的视频运动物体检测方法，包括如下步骤：According to another aspect of the present invention, a video moving object detection method based on a deep fusion network is provided, comprising the following steps:

S1：顺序读取视频中当前帧及当前帧之前的多帧作为视频序列输入；S1: Sequentially read the current frame in the video and the multiple frames before the current frame as the video sequence input;

S2：利用特征提取器对输入的视频序列中的每帧视频进行分析，得到多组视频帧特征，将这多组视频特征在通道方向上堆叠在一起，组成一个描述视频场景特征的描述子，即视频场景描述子；利用传统运动检测方法对输入的视频序列进行运动物体分析，得到基础检测结果；S2: Use the feature extractor to analyze each frame of video in the input video sequence to obtain multiple sets of video frame features, and stack the multiple sets of video features in the channel direction to form a descriptor describing the characteristics of the video scene. That is, the video scene descriptor; use the traditional motion detection method to analyze the moving objects of the input video sequence, and obtain the basic detection results;

S3：将S2中得到的视频场景描述子和基础检测结果输入到深度融合网络中；所述深度融合网络对视频场景描述子进行分析，得到最优融合权重图，利用最优融合权重图，对基础检测结果进行线性加权融合。S3: Input the video scene descriptor and the basic detection result obtained in S2 into the deep fusion network; the deep fusion network analyzes the video scene descriptor to obtain the optimal fusion weight map, and uses the optimal fusion weight map to analyze the The basic detection results are linearly weighted and fused.

优选地，所述深度融合网络基于深度卷积网络，将输入的视频场景描述子经过四层卷积层和一层Soft-Max层得到最优融合权重图，再根据最优融合权重图对基础检测结果进行逐像素线性加权。Preferably, the deep fusion network is based on a deep convolutional network, and the input video scene descriptor is passed through four convolution layers and one Soft-Max layer to obtain an optimal fusion weight map, and then based on the optimal fusion weight map The detection results are linearly weighted pixel by pixel.

优选地，所述基于深度融合网络的运动物体检测方法，还包括对特征提取器和深度融合网络的离线训练，步骤如下：Preferably, the method for detecting moving objects based on the deep fusion network further includes offline training of the feature extractor and the deep fusion network, and the steps are as follows:

在训练视频中随机采样视频片段作为预测运动掩模，并与真实运动物体的标注掩模即真实运动掩模一起作为训练对，多个训练对构成一个训练集；对训练对中的训练视频进行随机裁剪，得到训练样本，然后对训练样本进行随机左右和上下翻转以扩种训练集；Randomly sample video clips in the training video as the predicted motion mask, and together with the annotation mask of the real moving object, that is, the real motion mask, as a training pair, multiple training pairs constitute a training set; Randomly crop to obtain training samples, and then randomly flip the training samples left and right and up and down to expand the training set;

使用一个训练对作为输入，利用随机梯度下降算法对特征提取器和深度融合网络的参数进行联合优化，在训练集中的所有训练对上进行多轮学习，直到损失收敛。Using one training pair as input, the parameters of the feature extractor and the deep fusion network are jointly optimized using a stochastic gradient descent algorithm, and multiple rounds of learning are performed on all training pairs in the training set until the loss converges.

优选地，所述随机梯度下降算法中采用的损失函数为预测运动掩模和真实运动掩模的平均方差。Preferably, the loss function used in the stochastic gradient descent algorithm is the average variance of the predicted motion mask and the real motion mask.

优选地，深度融合网络参数更新率设置为特征提取器参数更新率的100～10000倍。Preferably, the parameter update rate of the deep fusion network is set to be 100-10000 times the update rate of the parameters of the feature extractor.

根据本发明的第三方面，提供一种检测终端，包括：存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时可用于执行上述的基于深度融合网络的视频运动物体检测方法。According to a third aspect of the present invention, a detection terminal is provided, comprising: a memory, a processor, and a computer program stored in the memory and running on the processor, where the processor can be used to execute the above-mentioned program when the processor executes the program Video moving object detection method based on deep fusion network.

与现有技术相比，本发明具有如下的有益效果：Compared with the prior art, the present invention has the following beneficial effects:

1、本发明充分利用多种现有传统视频运动物体检测系统，提高针对不同场景的有效性；1. The present invention makes full use of a variety of existing traditional video moving object detection systems to improve the effectiveness for different scenarios;

2、本发明充分利用深度学习技术，提高对于视频图像高层语义特征的描述能力；2. The present invention makes full use of deep learning technology to improve the ability to describe high-level semantic features of video images;

3、本发明通过将系统中的参数从数据中自动学习得到，不需要采用基于特征工程的调参；3. The present invention automatically learns the parameters in the system from the data, and does not need to use parameter adjustment based on feature engineering;

4、本发明结合传统方法和深度学习方法，得到了一种稳健高性能的视频运动物体检测系统及方法，针对各种场景都有较高的检测准确度；4. The present invention combines the traditional method and the deep learning method to obtain a robust and high-performance video moving object detection system and method, which has high detection accuracy for various scenarios;

5、本发明结合传统方法针对特定场景的高效性能和深度学习提取视频图像内容特征的强大表达能力，利用一个深度融合网络，根据视频场景特征，对多种传统检测结果进行最优融合，从而对于各种场景都能得到都稳健的检测结果。5. The present invention combines the high-efficiency performance of traditional methods for specific scenes and the powerful expressive ability of deep learning to extract video image content features, and utilizes a deep fusion network to perform optimal fusion of various traditional detection results according to video scene characteristics, so that Robust detection results can be obtained in various scenarios.

附图说明Description of drawings

通过阅读参照以下附图对非限制性实施例所作的详细描述，本发明的其它特征、目的和优点将会变得更明显：Other features, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments with reference to the following drawings:

图1为本发明一实施例所提供的基于深度融合网络的视频运动物体检测系统结构框图；1 is a structural block diagram of a video moving object detection system based on a deep fusion network provided by an embodiment of the present invention;

图2为本发明一实施例所提供的基于深度融合网络的运动物体检测方法的流程图。FIG. 2 is a flowchart of a method for detecting moving objects based on a deep fusion network provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明，但不以任何形式限制本发明。应当指出的是，对本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进。这些都属于本发明的保护范围。The present invention will be described in detail below with reference to specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that, for those skilled in the art, several modifications and improvements can be made without departing from the concept of the present invention. These all belong to the protection scope of the present invention.

本发明实施例提供一种基于深度融合网络的视频运动物体检测系统，包括如下模块：An embodiment of the present invention provides a video moving object detection system based on a deep fusion network, including the following modules:

模块一：视频特征提取模块，接收视频序列输入，对视频内容进行特征提取，得到视频中关于场景信息的特征表达，即视频场景特征表达，并发送至深度融合模块，用于深度融合模块对各个基础检测结果进行最优融合；Module 1: Video feature extraction module, which receives video sequence input, performs feature extraction on video content, and obtains the feature expression of scene information in the video, that is, the video scene feature expression, and sends it to the deep fusion module for the deep fusion module. Optimal fusion of basic detection results;

模块二：基础结果检测模块，接收视频序列输入，利用基础检测子对运动物体进行检测，得到相应的基础检测结果，并发送至深度融合模块；Module 2: Basic result detection module, receives video sequence input, uses basic detectors to detect moving objects, obtains corresponding basic detection results, and sends them to the deep fusion module;

模块三：深度融合模块，接收视频场景特征表达和基础检测结果，利用深度神经网络进行最优融合，输出最终的检测结果。Module 3: Deep fusion module, which receives video scene feature expression and basic detection results, uses deep neural network for optimal fusion, and outputs final detection results.

在部分优选实施方式中，视频特征提取模块采用基于预训练的VGG-16网络作为特征提取器，提取每帧视频的特征，再将每帧视频的特征堆叠在一起，组成一组可以描述视频场景的描述子。In some preferred embodiments, the video feature extraction module uses a pre-trained VGG-16 network as a feature extractor, extracts the features of each frame of video, and then stacks the features of each frame of video together to form a group that can describe the video scene. descriptor.

进一步地，基础结果检测模块的基础检测子为多个，其中每一个基础检测子分别采用一种传统运动检测方法对运动物体进行检测，得到多个相应的基础检测结果。在实施方式中，基础检测子可以为四个，也可以为其他数量。比如当基础检测子为四个时，相应地，运动检测方法可以采用如下方法，但不限于如下方法：PWACS、EFIC、SharedModel和WeSamBE。其中PWACS为基于像素的自适应语义关联分割方法；EFIC为基于边缘检测的前后背景分割方法；ShareModel为基于共享模型的背景分割方法；WeSamBE为基于采样点加权的背景分割方法。上述方法都是用于运动物体检测的基于非深度学习的传统方法。当然，在不同实施例中，可以采用不同的运动检测方法，本发明实施例通过融合几种传统运动检测方法的结果，可以得到一个更加稳健的检测结果。Further, the basic result detection module has a plurality of basic detectors, wherein each basic detector adopts a traditional motion detection method to detect the moving object to obtain a plurality of corresponding basic detection results. In an embodiment, the number of basic detectors may be four, or may be other numbers. For example, when there are four basic detectors, correspondingly, the motion detection method may adopt the following methods, but is not limited to the following methods: PWACS, EFIC, SharedModel and WeSamBE. Among them, PWACS is a pixel-based adaptive semantic association segmentation method; EFIC is a front and back background segmentation method based on edge detection; ShareModel is a background segmentation method based on a shared model; WeSamBE is a background segmentation method based on sampling point weighting. The above methods are all traditional methods based on non-deep learning for moving object detection. Of course, in different embodiments, different motion detection methods may be used, and in this embodiment of the present invention, a more robust detection result can be obtained by fusing the results of several traditional motion detection methods.

在部分优选实施方式中，深度融合模块接收视频场景特征表达(视频特征描述子)作为输入，经过四层卷积层和一层Soft-Max层得到最优融合权重图，再根据该最优融合权重图对基础检测结果进行逐像素线性加权。In some preferred embodiments, the deep fusion module receives the video scene feature expression (video feature descriptor) as input, obtains the optimal fusion weight map through four convolution layers and one Soft-Max layer, and then obtains the optimal fusion weight map according to the optimal fusion The weight map performs pixel-by-pixel linear weighting on the underlying detection results.

本发明实施例还提供一种基于深度融合网络的视频运动物体检测方法，其步骤包括：The embodiment of the present invention also provides a video moving object detection method based on a deep fusion network, the steps of which include:

步骤一：顺序读取视频中当前帧及其之前的多帧(例如16帧，该数量是根据具体实现的输入格式而定，如果实现不同，这里数量也会变化)作为视频序列输入；Step 1: Sequentially read the current frame in the video and its previous multiple frames (for example, 16 frames, the number is determined according to the input format of the specific implementation, if the implementation is different, the number here will also change) as the video sequence input;

步骤二：对输入的视频序列中的每帧视频利用特征提取器进行分析，得到多组(当为16帧时，此处得到16组视频帧)视频帧的特征，将这多组视频帧的特征在通道方向上堆叠在一起，组成一个描述视频场景特征的描述子；同时使用传统运动检测方法，对输入的视频序列进行运动物体分析，得到基础检测结果；Step 2: Use the feature extractor to analyze each frame of video in the input video sequence to obtain the features of multiple groups (when it is 16 frames, 16 groups of video frames are obtained here) video frame features, The features are stacked together in the channel direction to form a descriptor that describes the characteristics of the video scene; at the same time, the traditional motion detection method is used to analyze the moving objects of the input video sequence to obtain the basic detection results;

步骤三：对步骤二中得到的视频场景描述子和基础检测结果输入到深度融合网络中；深度融合网络对视频场景描述子进行进一步分析，得到最优融合权重图，最后利用该最优融合权重图，对基础检测结果进行线性加权融合。Step 3: Input the video scene descriptor and basic detection results obtained in step 2 into the deep fusion network; the deep fusion network further analyzes the video scene descriptor to obtain the optimal fusion weight map, and finally uses the optimal fusion weight. Figure, linearly weighted fusion of basic detection results.

步骤一中，视频序列输入作为系统输入，是包括当前帧及其之前的多帧的视频片段。In step 1, the video sequence input is used as the system input, which is a video clip including the current frame and its previous multiple frames.

步骤二中，特征提取器输出是一组基于深度学习的特征图。In step 2, the output of the feature extractor is a set of feature maps based on deep learning.

步骤三中，最优融合是基于深度卷积网络的。融合的最后一步是基于最优融合图的线性加权操作。In step 3, the optimal fusion is based on a deep convolutional network. The final step of fusion is a linear weighting operation based on the optimal fusion graph.

进一步的，所述方法还可以包括对特征提取器和深度融合网络的离线训练步骤，具体如下：Further, the method may also include an offline training step for the feature extractor and the deep fusion network, as follows:

步骤1：在训练视频中随机采样视频片段作为预测运动掩模，并和真实运动物体的标注掩模即真实运动掩模一起作为训练对，多个训练对构成一个训练集；对训练对中的训练视频进行随机裁剪，得到训练样本，然后对样本进行随机左右和上下翻转以扩种训练集；Step 1: Randomly sample video clips in the training video as the predicted motion mask, and together with the annotation mask of the real moving object, that is, the real motion mask, as a training pair, multiple training pairs constitute a training set; The training video is randomly cropped to obtain training samples, and then the samples are randomly flipped left and right and up and down to expand the training set;

步骤2：使用一个训练对作为输入，利用随机梯度下降算法对特征提取器和深度融合网络的的参数进行联合优化，在训练集中的所有训练对上进行多轮学习，直到损失收敛。Step 2: Using a training pair as input, use the stochastic gradient descent algorithm to jointly optimize the parameters of the feature extractor and the deep fusion network, and perform multiple rounds of learning on all training pairs in the training set until the loss converges.

步骤1中，训练样本的尺寸可以为128x128，也可以是其他大小，根据计算资源而定，如果计算资源允许，可以采用更大尺寸，比如256x256，或者512x512等。In step 1, the size of the training samples can be 128x128 or other sizes, depending on the computing resources. If the computing resources allow, a larger size, such as 256x256, or 512x512, can be used.

步骤2中，所述随机梯度下降算法中采用的损失函数可以是预测运动掩模和真实运动掩模的平均方差。进一步地，深度融合网络参数的更新率设为特征提取器参数的更新率的100～10000倍。步骤2中联合优化方法是对基础检测结果的误差进行梯度下降法，逐步迭代优化。训练之后的最优模型参数保存之后，直接用在视频运动物体检测方法中。In step 2, the loss function used in the stochastic gradient descent algorithm may be the average variance of the predicted motion mask and the real motion mask. Further, the update rate of the parameters of the deep fusion network is set to be 100-10000 times the update rate of the parameters of the feature extractor. The joint optimization method in step 2 is to perform a gradient descent method on the error of the basic detection result, and gradually iteratively optimize. After the optimal model parameters after training are saved, they are directly used in the video moving object detection method.

基于上述，下面结合附图以及具体实例对本发明的技术方案进一步详细描述。Based on the above, the technical solutions of the present invention are further described in detail below with reference to the accompanying drawings and specific examples.

如图1所示，本发明一实施例中的基于深度融合网络的视频运动物体检测系统，该系统包括三类模块：视频特征提取模块(视频特征提取网络)、基础结果检测模块和深度融合模块(深度融合网络)。As shown in FIG. 1, a video moving object detection system based on a deep fusion network in an embodiment of the present invention includes three types of modules: a video feature extraction module (video feature extraction network), a basic result detection module and a deep fusion module (Deep Fusion Networks).

本实施例中，系统含有一个视频特征提取模块和一个深度融合模块，基础结果检测模块中基础检测系统的类型和数量可以根据具体场景特点和处理平台的性能灵活选区。In this embodiment, the system includes a video feature extraction module and a deep fusion module, and the type and number of basic detection systems in the basic result detection module can be flexibly selected according to specific scene characteristics and the performance of the processing platform.

本实施例中，视频特征提取模块采用预训练的VGG-16网络作为特征提取器，对一个视频片段内的所有视频帧依次进行分析，得到的特征图堆叠在一起作为视频特征的描述子。In this embodiment, the video feature extraction module uses a pre-trained VGG-16 network as a feature extractor, analyzes all video frames in a video segment in turn, and stacks the obtained feature maps as video feature descriptors.

本实施例中，基础结果检测模块采用四种基本的检测系统：PWACS，EFIC，SharedModel和WeSamBE，他们在动态背景，夜间场景，镜头抖动和红外场景中有着互补的性能表现。In this embodiment, the basic result detection module adopts four basic detection systems: PWACS, EFIC, SharedModel and WeSamBE, which have complementary performance in dynamic background, night scene, lens shake and infrared scene.

本实施例中，深度融合模块主要由四层卷积层和一层Soft-Max层级联组成。模块接收视频描述子作为输入，进一步分析得到最优融合权重图，再根据该权重图对基础检测结果进行逐像素线性加权。In this embodiment, the deep fusion module is mainly composed of four convolution layers and one Soft-Max layer concatenated. The module receives the video descriptor as input, further analyzes to obtain the optimal fusion weight map, and then performs pixel-by-pixel linear weighting on the basic detection results according to the weight map.

如图2所示，在一具体实施例中，利用基于深度融合网络的视频运动物体检测系统进行视频运动物体检测的方法，步骤如下：As shown in Figure 2, in a specific embodiment, the method for detecting video moving objects by using a video moving object detection system based on a deep fusion network, the steps are as follows:

步骤一、顺序读取包括当前帧及其之前的16帧作为系统输入(视频序列输入)；Step 1, read sequentially including the current frame and its previous 16 frames as system input (video sequence input);

步骤二、对每帧利用VGG-16网络进行分析，将每帧经过网络得到的最后一层特征图在通道方向上堆叠在一起，组成一个描述视频特征的描述子；Step 2: Use the VGG-16 network to analyze each frame, and stack the last layer of feature maps obtained by each frame through the network in the channel direction to form a descriptor describing the video features;

同时使用基础结果检测模块，对系统输入进行分析，得到4个基础检测结果，记为B(n)，n＝1，2，3，4，表示四个基础检测方法的检测结果；At the same time, the basic result detection module is used to analyze the system input, and 4 basic detection results are obtained, denoted as B(n), n=1, 2, 3, 4, which represent the detection results of the four basic detection methods;

步骤三、对步骤二中的视频描述子和基础检测结果输入到深度融合网络中。深度融合网络对视频描述子进行进一步分析，得到最优融合权重图，最后利用该权重图M，对基础检测结果进行如式(1)的线性加权融合；Step 3: Input the video descriptor and basic detection result in Step 2 into the deep fusion network. The deep fusion network further analyzes the video descriptor to obtain the optimal fusion weight map, and finally uses the weight map M to perform linear weighted fusion of the basic detection results as shown in formula (1);

式(1)中B(n)代表第n个基础检测结果，M(n)代表对应于第n个基础检测结果的加权系数，他们都是和输入视频帧尺寸一样的二维图像。⊙表示的元素乘。所以式(1)表示将四个基础检测结果进行逐像素加权平均作为最终的预测结果P。In formula (1), B(n) represents the nth basic detection result, and M(n) represents the weighting coefficient corresponding to the nth basic detection result. They are all two-dimensional images with the same size as the input video frame. ⊙ represents the element-wise multiplication. Therefore, formula (1) indicates that the pixel-by-pixel weighted average of the four basic detection results is used as the final prediction result P.

本实施例中，对于特征提取器和深度融合网络中的参数的离线训练步骤如下：In this embodiment, the offline training steps for the parameters in the feature extractor and the deep fusion network are as follows:

步骤1：在训练视频中随机采样视频片段，和真实运动物体的标注掩模一起作为训练对。对训练视频进行随机裁剪，得到128x128的训练样本，然后对样本进行随机左右、上下翻转以扩种训练集；Step 1: Randomly sample video clips in the training video, and use them as training pairs together with the annotation masks of real moving objects. Randomly crop the training video to obtain 128x128 training samples, and then randomly flip the samples left and right and up and down to expand the training set;

步骤2：利用随机梯度下降算法对整个系统中的参数进行联合优化，直到损失收敛；Step 2: Use the stochastic gradient descent algorithm to jointly optimize the parameters in the entire system until the loss converges;

步骤2中的优化方法是Adam优化方法。损失函数设为式(2)：The optimization method in step 2 is the Adam optimization method. The loss function is set to formula (2):

式(2)中H和W代表图像的高度和宽度，G代表真实的运动标注掩模；In formula (2), H and W represent the height and width of the image, and G represents the real motion annotation mask;

步骤2中对于视频特征提取模块中的参数学习率设为10^-7，而对深度融合网络的学习率设为10^-4。训练收敛后，保存参数，实际使用时直接加载使用即可。In step 2, the learning rate of the parameters in the video feature extraction module is set to 10 ^-7 , and the learning rate of the deep fusion network is set to 10 ^-4 . After the training converges, save the parameters and directly load and use them in actual use.

基于上述方法，本发明一实施例还提供一种检测终端，包括：存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述程序时可用于执行上述的基于深度融合网络的视频运动物体检测方法。Based on the above method, an embodiment of the present invention further provides a detection terminal, including: a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor can be used to execute the program when the processor executes the program The above-mentioned video moving object detection method based on deep fusion network.

本发明上述实施例提供的基于深度融合网络的视频运动物体检测系统及方法、检测终端，视频序列输入到系统中后，同时进行视频特征提取操作和基本结果检测操作，然后利用深度融合模块，根据视频特征对多个基本检测结果进行最优融合。本发明上述实施例使用深度卷积网络构建特征提取模块和深度融合模块，利用大量数据进行训练得到最优模型参数，在实际应用中可自动进行运动物体检测；实验结果表明该系统能够取得高准确度的检测结果。In the video moving object detection system and method based on the deep fusion network, and the detection terminal provided by the above embodiments of the present invention, after the video sequence is input into the system, the video feature extraction operation and the basic result detection operation are simultaneously performed, and then the deep fusion module is used. Video features perform optimal fusion of multiple basic detection results. The above-mentioned embodiments of the present invention use a deep convolutional network to construct a feature extraction module and a deep fusion module, use a large amount of data for training to obtain optimal model parameters, and can automatically detect moving objects in practical applications; the experimental results show that the system can achieve high accuracy degree test results.

本发明上述实施例中的具体参数仅为说明本发明技术方案的实施而举例，本发明在另外的实施例中也可以采用其他的具体参数，这对于本发明实现没有本质性的影响。The specific parameters in the above embodiments of the present invention are only examples to illustrate the implementation of the technical solutions of the present invention. The present invention may also adopt other specific parameters in other embodiments, which have no essential impact on the implementation of the present invention.

要说明的是，本发明提供的所述方法中的步骤，可以利用所述系统中对应的模块、装置、单元等予以实现，本领域技术人员可以参照所述系统的技术方案实现所述方法的步骤流程，即，所述系统中的实施例可理解为实现所述方法的优选例，在此不予赘述。It should be noted that the steps in the method provided by the present invention can be implemented by using the corresponding modules, devices, units, etc. in the system, and those skilled in the art can refer to the technical solutions of the system to implement the method. The step flow, that is, the embodiments in the system can be understood as a preferred example for implementing the method, and details are not described here.

本领域技术人员知道，除了以纯计算机可读程序代码方式实现本发明提供的系统及其各个模块、装置、单元以外，完全可以通过将方法步骤进行逻辑编程来使得本发明提供的系统及其各个装置以逻辑门、开关、专用集成电路、可编程逻辑控制器以及嵌入式微控制器等的形式来实现相同功能。所以，本发明提供的系统及其各项装置可以被认为是一种硬件部件，而对其内包括的用于实现各种功能的装置也可以视为硬件部件内的结构；也可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art know that, in addition to implementing the system provided by the present invention and its various modules, devices, and units in the form of purely computer-readable program codes, the system provided by the present invention and its various devices can be implemented by logically programming method steps. The same function is implemented in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system and its various devices provided by the present invention can be regarded as a kind of hardware components, and the devices for realizing various functions included in the system can also be regarded as structures in the hardware components; The means for implementing various functions can be regarded as either a software module implementing a method or a structure within a hardware component.

以上对本发明的具体实施例进行了描述。需要理解的是，本发明并不局限于上述特定实施方式，本领域技术人员可以在权利要求的范围内做出各种变形或修改，这并不影响本发明的实质内容。The specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the above-mentioned specific embodiments, and those skilled in the art can make various variations or modifications within the scope of the claims, which do not affect the essential content of the present invention.

Claims

1. A video moving object detection system based on a depth fusion network is characterized by comprising:

the video feature extraction module receives video sequence input, performs feature extraction on video content to obtain feature expression about scene information in a video, namely video scene feature expression, and sends the feature expression to the depth fusion module;

the basic result detection module receives video sequence input, detects the moving object by using the basic detector to obtain a corresponding basic detection result and sends the basic detection result to the depth fusion module;

the depth fusion module is used for receiving the video scene feature expression and the basic detection result, performing optimal fusion by using a depth neural network and outputting a final detection result;

the depth fusion module receives video scene feature expression as input, obtains an optimal fusion weight map through four convolutional layers and a Soft-Max layer, and performs pixel-by-pixel linear weighting on a basic detection result according to the optimal fusion weight map.

2. The system of claim 1, wherein the video feature extraction module uses a pre-training-based VGG-16 network as a feature extractor to extract features of each frame of video, and then stacks the features of each frame of video together to form a set of descriptors, i.e., video scene feature expressions, for describing video scenes.

3. The video moving object detection system based on the deep convergence network as claimed in claim 1, wherein the number of the basic detectors is plural, and each of the basic detectors respectively detects the moving object by using a conventional motion detection method to obtain plural corresponding basic detection results.

4. The system of claim 3, wherein the number of the basic detectors is four, and accordingly, each basic detector respectively adopts the following conventional motion detection methods:

-a pixel-based adaptive semantic relevance segmentation method;

-a context segmentation method based on edge detection;

-a shared model based background segmentation method;

-background segmentation method based on sample point weighting.

5. A video moving object detection method based on a depth fusion network is characterized by comprising the following steps:

s1: reading a current frame and a plurality of frames before the current frame in the video in sequence as video sequence input;

s2: analyzing each frame of video in an input video sequence by using a feature extractor to obtain a plurality of groups of video frame features, and stacking the plurality of groups of video features in the channel direction to form a descriptor for describing video scene features, namely a video scene descriptor; carrying out moving object analysis on an input video sequence by utilizing a traditional motion detection method to obtain a basic detection result;

s3: inputting the video scene descriptor and the basic detection result obtained in the S2 into a depth fusion network; the deep fusion network analyzes the video scene descriptors to obtain an optimal fusion weight map, and linear weighted fusion is performed on basic detection results by using the optimal fusion weight map.

6. The method as claimed in claim 5, wherein the depth fusion network is based on a depth convolution network, the input video scene descriptor is processed through four convolution layers and a Soft-Max layer to obtain an optimal fusion weight map, and the basic detection result is subjected to pixel-by-pixel linear weighting according to the optimal fusion weight map.

7. The method of claim 5 or 6, further comprising offline training of the feature extractor and the deep fusion network, wherein:

randomly sampling video segments in a training video to serve as a prediction motion mask, and using the video segments and a marking mask of a real moving object, namely a real motion mask, as training pairs, wherein a plurality of training pairs form a training set; randomly cutting the training video in the training pair to obtain a training sample, and then randomly turning the training sample left and right and/or up and down to expand the training set;

and using a training pair as input, performing joint optimization on parameters of the feature extractor and the deep fusion network by using a random gradient descent algorithm, and performing multiple rounds of learning on all training pairs in a training set until loss convergence.

8. The method according to claim 7, wherein the loss function used in the stochastic gradient descent algorithm is the mean variance of the predicted motion mask and the true motion mask; and/or the presence of a gas in the gas,

the parameter updating rate of the deep fusion network is set to be 100-10000 times of the parameter updating rate of the feature extractor.

9. A detection terminal, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the program, when executed by the processor, is operable to perform the method of any of the preceding claims 5 to 8.