CN111461043B

CN111461043B - Video significance detection method based on deep network

Info

Publication number: CN111461043B
Application number: CN202010266351.2A
Authority: CN
Inventors: 于明; 夏斌红; 刘依; 郭迎春; 郝小可; 朱叶; 师硕; 于洋; 阎刚
Original assignee: Hebei University of Technology
Current assignee: Hebei University of Technology
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2023-04-18
Anticipated expiration: 2040-04-07
Also published as: CN111461043A

Abstract

The present invention is a video saliency detection method based on a deep network, which relates to the field of image data processing. The method uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map, and completes a depth-based The video saliency detection of the network, the steps are, input video frame I, carry out preprocessing; Extract the initial spatial feature map S of video frame I ^' ; Obtain the spatial feature map S _{final of} five scales; Obtain feature map F; Obtain rough The spatiotemporal saliency map Y _ST and the edge contour map E _t of salient objects; obtain the final saliency prediction result map Y _final ; calculate the loss for the input video frame I, and complete the video saliency detection based on the deep network. The invention overcomes the defects of incomplete salient object detection and inaccurate algorithm detection when the foreground and background colors are similar in the prior art video salient detection.

Description

Video saliency detection method based on deep network

技术领域Technical Field

本发明的技术方案涉及图像数据处理领域,具体地说是基于深度网络的视频显著性检测方法。The technical solution of the present invention relates to the field of image data processing, and specifically to a video saliency detection method based on a deep network.

背景技术Background Art

视频显著性检测旨在提取连续的视频帧中人眼最感兴趣的区域。具体地说是利用计算机模拟人眼的视觉注意力机制，从视频帧中提取人眼感兴趣的区域，是计算机视觉领域的关键技术之一。Video saliency detection aims to extract the areas of greatest interest to the human eye in continuous video frames. Specifically, it uses computers to simulate the visual attention mechanism of the human eye and extract the areas of interest to the human eye from video frames. It is one of the key technologies in the field of computer vision.

传统的视频显著性检测方法大多数都基于低级的手工特征(例如颜色，纹理等)，这些方法是典型的启发式方法，具有速度慢(由于耗时的光流计算)和预测精度低(由于低水平特征的可表征性有限)的缺点。近年来深度神经网络开始应用于视频显著性检测领域,深度学习方法是指利用卷积神经网络提取图像的高级语义特征计算图像的显著值，但采用深度卷积网络会丢失目标的位置信息和细节信息，在检测显著目标时可能会引入误导信息，导致检测到的目标不完整。Most of the traditional video saliency detection methods are based on low-level manual features (such as color, texture, etc.). These methods are typical heuristic methods with the disadvantages of slow speed (due to time-consuming optical flow calculation) and low prediction accuracy (due to the limited representability of low-level features). In recent years, deep neural networks have begun to be applied to the field of video saliency detection. Deep learning methods refer to the use of convolutional neural networks to extract high-level semantic features of images and calculate the saliency value of images. However, the use of deep convolutional networks will lose the location information and detail information of the target, which may introduce misleading information when detecting salient targets, resulting in incomplete detected targets.

2016年，Liu等人在“Saliency detection for unconstrained videos usingsuperpixel-level graph and spatiotemporal propagation”一文中提出了SGSP算法，该算法使用超像素级的图模型和时空传播来进行视频显著性的检测，首先，提取超像素级的运动和颜色直方图以及全局运动直方图来构建图。接着，基于图模型使用背景先验通过图上的最短路径迭代地计算运动显著性。然后在时间上往前向和后向传播，在空间上局部和全局地传播，最后将这两个结果融合起来形成最后的显著图。该算法的计算量很大，但得到的显著图仍存在显著性目标检测不完全的问题。基于深度学习模型旨在利用卷积神经网络得到更丰富的深度特征，进而得到更准确的检测结果。Wang等人于2017年在“Videosalient object detection via fully convolutional networks”一文中提出了基于全卷积网络的视频显著性检测方法，这是基于深度学习的全卷积网络第一次用在了视频显著性检测领域，但是由于没有考虑到帧与帧之间的时间信息，导致得到的显著图的边缘不够精细，边缘噪声比较大。CN106372636A公开了一种基于HOG_TOP的视频显著性检测方法，该方法利用原始视频在三个正交的平面XY、XT、YT计算得到HOG_TOP特征，分别在XY平面计算得到空域显著图和在XT，YT平面得到时域显著图，最后通过自适应融合得到最终的显著图，此方法在计算时域显著图时需要计算每个像素点的光流，计算量很大，速度慢。CN109784183A公开了一种基于级联卷积网络和光流的视频显著性目标检测方法，该方法利用级联网络结构,在高、中、低三个尺度上分别对当前帧的图像进行像素级的显著性预测。使用MSAR10K图像数据集训练级联网络结构,显著性标注图作为训练的监督信息,损失函数为交叉熵损失函数。训练终止后,利用训练好的级联网络对视频中的每一帧图像进行静态显著性预测，利用Locus-Kanada算法进行光流场提取。然后使用三层卷积网络结构构建动态优化网络结构。将每一帧图像的静态检测结果和光流场检测结果进行拼接得到优化网络的输入数据。该方法较耗时，且在一些对于复杂场景的时候利用Locus-Kanada算法提取到的光流信息并不准确，鲁棒性较差。CN109118469A公开了一种用于视频显著性的预测方法，该方法先对图像进行量化得到稀疏矩阵响应，再根据局部坐标约束得到分解矩阵，最后对视频中的每一帧进行显著图计算,并进行质量预测。该方法丢失了显著性目标的一些细节信息，使得预测结果会存在显著性目标检测不完整的问题。CN105913456B公开了一种区域分割的视频显著性检测方法，该方法先利用非线性聚类得到超像素块来提取静态特征，再利用分光流法得到动态特征，最后用线性回归模型来预测两个特征融合之后的显著图，该方法的计算量较大，效率较低。CN109034001A公开了一种基于时空线索的跨模态视频显著性检测方法，该方法利用初始的显著图，可见光和热红外两个模态的权重构造显著图，该方法难以找到一个合适权重值导致鲁棒性较差。CN108241854A公开了一种基于运动和记忆信息的深度视频显著性检测方法，该方法先根据当前帧的人眼注视图来提取局部信息和全局信息，再将此作为先验信息和原图像一起输入到深度网络模型当中来预测最终的显著图，当显著目标触及图像边界时，该方法会出现误检，显著目标会被误检测为背景。CN110598537A公开了一种基于深度卷积网络的视频显著性检测方法，该方法以视频的当前帧及其对应的光流图像作为特征提取网络的输入来预测最终的显著图，该方法需要提前计算当前帧的光流信息，计算量较大。In 2016, Liu et al. proposed the SGSP algorithm in the paper "Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation". The algorithm uses superpixel-level graph models and spatiotemporal propagation to detect video saliency. First, superpixel-level motion and color histograms and global motion histograms are extracted to construct a graph. Then, based on the graph model, motion saliency is iteratively calculated through the shortest path on the graph using background priors. Then, it propagates forward and backward in time, and propagates locally and globally in space, and finally combines the two results to form the final saliency map. The algorithm is very computationally intensive, but the saliency map obtained still has the problem of incomplete detection of salient objects. The deep learning model aims to use convolutional neural networks to obtain richer deep features, thereby obtaining more accurate detection results. Wang et al. proposed a video saliency detection method based on a fully convolutional network in the article "Videosalient object detection via fully convolutional networks" in 2017. This is the first time that a fully convolutional network based on deep learning has been used in the field of video saliency detection. However, since the time information between frames is not taken into account, the edges of the obtained saliency map are not fine enough and the edge noise is relatively large. CN106372636A discloses a video saliency detection method based on HOG_TOP. The method uses the original video to calculate the HOG_TOP features on three orthogonal planes XY, XT, and YT, respectively calculates the spatial domain saliency map on the XY plane and the temporal domain saliency map on the XT and YT planes, and finally obtains the final saliency map through adaptive fusion. This method needs to calculate the optical flow of each pixel when calculating the temporal domain saliency map, which is very computationally intensive and slow. CN109784183A discloses a video salient target detection method based on cascade convolutional network and optical flow. The method uses a cascade network structure to perform pixel-level saliency prediction on the image of the current frame at three scales: high, medium and low. The cascade network structure is trained using the MSAR10K image data set, and the saliency annotation map is used as the supervisory information for training. The loss function is the cross entropy loss function. After the training is terminated, the trained cascade network is used to perform static saliency prediction on each frame of the video, and the Locus-Kanada algorithm is used to extract the optical flow field. Then a three-layer convolutional network structure is used to construct a dynamic optimization network structure. The static detection results of each frame of the image and the optical flow field detection results are spliced to obtain the input data of the optimization network. This method is time-consuming, and the optical flow information extracted by the Locus-Kanada algorithm is not accurate and has poor robustness in some complex scenes. CN109118469A discloses a method for predicting video saliency, which first quantizes the image to obtain a sparse matrix response, then obtains a decomposition matrix based on local coordinate constraints, and finally calculates a saliency map for each frame in the video and performs quality prediction. This method loses some detailed information of the salient target, so that the prediction result will have the problem of incomplete detection of salient targets. CN105913456B discloses a method for detecting video saliency by region segmentation, which first uses nonlinear clustering to obtain superpixel blocks to extract static features, then uses the optical flow method to obtain dynamic features, and finally uses a linear regression model to predict the saliency map after the fusion of the two features. The method has a large amount of calculation and low efficiency. CN109034001A discloses a cross-modal video saliency detection method based on spatiotemporal cues, which uses the initial saliency map and the weights of the two modalities of visible light and thermal infrared to construct a saliency map. The method is difficult to find a suitable weight value, resulting in poor robustness. CN108241854A discloses a deep video saliency detection method based on motion and memory information. The method first extracts local information and global information based on the human eye gaze map of the current frame, and then inputs this information into the deep network model together with the original image as prior information to predict the final saliency map. When the salient target touches the image boundary, the method will have false detection, and the salient target will be mistakenly detected as the background. CN110598537A discloses a video saliency detection method based on a deep convolutional network. The method uses the current frame of the video and its corresponding optical flow image as the input of the feature extraction network to predict the final saliency map. The method needs to calculate the optical flow information of the current frame in advance, and the amount of calculation is large.

总之，视频显著性目标检测的现有技术中仍存在显著目标检测不完整、当前景背景颜色相似时算法检测不准确的问题。In summary, the existing technology of video salient object detection still has problems such as incomplete salient object detection and inaccurate algorithm detection when the foreground and background colors are similar.

发明内容Summary of the invention

本发明所要解决的技术问题是：提供基于深度网络的视频显著性检测方法，该方法是先用ResNet50深度网络来取空间特征，然后再提取时间和边缘信息来共同得到显著性预测结果图，完成基于深度网络的视频显著性检测，克服了现有技术视频显著性检测中存在的显著目标检测不完整、当前景背景颜色相似时算法检测不准确的缺陷。The technical problem to be solved by the present invention is to provide a video saliency detection method based on a deep network. The method first uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map, thereby completing video saliency detection based on a deep network, overcoming the defects of incomplete salient target detection and inaccurate algorithm detection when the foreground and background colors are similar in the prior art video saliency detection.

本发明解决该技术问题所采用的技术方案是：基于深度网络的视频显著性检测方法，是先用ResNet50深度网络来取空间特征，然后再提取时间和边缘信息来共同得到显著性预测结果图，完成基于深度网络的视频显著性检测，具体步骤如下：The technical solution adopted by the present invention to solve the technical problem is: a video saliency detection method based on a deep network first uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map to complete the video saliency detection based on a deep network. The specific steps are as follows:

第一步，输入视频帧I，进行预处理:The first step is to input the video frame I and perform preprocessing:

输入视频帧I，将视频帧的尺寸都统一为宽高都是473×473像素，并且视频帧I中的每个像素值都减去其相对应的通道的均值，其中，每个视频帧I的R通道的均值是104.00698793，每个视频帧I中的G通道的均值是116.66876762，每个视频帧I中的B通道的均值是122.67891434，这样，输入到ResNet50深度网络之前的视频帧I的形状为473×473×3，将如此进行预处理之后的视频帧记为I′，如下公式(1)所示：Input video frame I, unify the size of the video frame to 473×473 pixels in width and height, and subtract the mean of the corresponding channel from each pixel value in video frame I, where the mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1):

I′＝Resize(I-Mean(R,G,B)) (1),I′＝Resize(I-Mean(R,G,B)) (1),

公式(1)中，Mean(R,G,B)为红，绿，蓝三个颜色通道的均值，Resize(·)为调整视频帧I′大小的函数；In formula (1), Mean(R,G,B) is the mean of the three color channels of red, green, and blue, and Resize(·) is the function for adjusting the size of the video frame I′;

第二步，提取视频帧I′的初始空间特征图S：The second step is to extract the initial spatial feature map S of the video frame I′:

将上述第一步预处理之后的视频帧I′送入到ResNet50深度网络去提取初始空间特征图S，如下公式(2)所示：The video frame I′ after the first step of preprocessing is sent to the ResNet50 deep network to extract the initial spatial feature map S, as shown in the following formula (2):

S＝ResNet50(I′) (2),S＝ResNet50(I′) (2),

公式(2)中，ResNet50(·)为ResNet50深度网络，In formula (2), ResNet50(·) is the ResNet50 deep network.

ResNet50深度网络包含卷积层，池化层，非线性激活函数Relu层和残差连接；The ResNet50 deep network contains convolutional layers, pooling layers, non-linear activation function Relu layers and residual connections;

第三步，获得五个尺度的空间特征图S_final：The third step is to obtain the spatial feature map S _final of five scales:

将第二步中提取到的视频帧I′的初始空间特征图S分别送入到ResNet50深度网络中扩张率为2、4、8、16的四个不同的扩张卷积中去，得到扩张率分别为2、4、8、16的四个尺度的结果T_k，再将该结果与ResNet50深度网络的输出结果初始空间特征图S串联起来最终获得五个尺度的空间特征图S_final，The initial spatial feature map S of the video frame I′ extracted in the second step is sent to four different dilated convolutions with dilation rates of 2, 4, 8, and 16 in the ResNet50 deep network, and the results T _k of four scales with dilation rates of 2, 4, 8, and 16 are obtained. Then, the result is connected in series with the output result initial spatial feature map S of the ResNet50 deep network to finally obtain the spatial feature map S _final of five scales.

第四步，获得特征图F：The fourth step is to obtain the feature map F:

将上述第三步得到的五个尺度的空间特征图S_final通过一个卷积核为3×3×32的卷积操作获得形状为60×60×32的特征图F，如下公式(3)所示，The spatial feature maps S _final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (3):

F＝BN(Relu(Conv(S_final))) (3),F＝BN(Relu(Conv(S _final ))) (3),

公式(3)中，Conv(·)为卷积操作，Relu(·)为非线性激活函数，BN(·)为对其进行标准化操作；In formula (3), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation.

第五步，获得粗略的时空显著图Y_ST和显著性物体的边缘轮廓图E_t：The fifth step is to obtain a rough spatiotemporal saliency map Y _ST and an edge contour map E _t of a salient object:

将上述第四步获得的特征图F同时分别输入到时空分支和边缘检测分支得到一个时空特征图F_ST和得到显著性物体的边缘轮廓图E_t，具体操作如下，The feature map F obtained in the fourth step is input into the spatiotemporal branch and the edge detection branch to obtain a spatiotemporal feature map F _ST and an edge contour map E _t of a salient object. The specific operation is as follows:

将上述第四步得到的特征图F输入到时空分支的ConvLSTM当中去，得到一个时空特征图F_ST，如下公式(4)所示，The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F _ST , as shown in the following formula (4):

F_ST＝ConvLSTM(F,H_t-1) (4),F _ST =ConvLSTM(F,H _t-1 ) (4),

公式(4)中，ConvLSTM(·)为ConvLSTM操作，H_t-1为前一时刻ConvLSTM单元的状态；In formula (4), ConvLSTM(·) is the ConvLSTM operation, H _t-1 is the state of the ConvLSTM unit at the previous moment;

再将得到的时空特征图F_ST再送入到一层卷积核大小为1×1的卷积中得到一个粗略的时空显著图Y_ST，公式如下：Then the obtained spatiotemporal feature map F _ST is sent to a convolution layer with a convolution kernel size of 1×1 to obtain a rough spatiotemporal saliency map Y _ST , the formula is as follows:

Y_ST＝Conv(F_ST) (5),Y _ST =Conv(F _ST ) (5),

公式(5)中，Conv(·)为卷积操作；In formula (5), Conv(·) is the convolution operation;

将上述第四步得到的特征图F输入到边缘检测分支中得到显著性物体的边缘轮廓图E_t，具体操作如下，The feature map F obtained in the fourth step is input into the edge detection branch to obtain the edge contour map E _t of the salient object. The specific operation is as follows:

通过ResNet50深度网络和扩张卷积，获得T帧的输入视频的静态为

其中X_t为第t帧的视频帧，给定X_t，X_t经过边缘检测分支后输出为边缘轮廓图E_t∈[0,1]^W×H，其中W和H分别为预测边缘图像的宽度和高度，是从边缘检测网络

中计算出来的，它将先前的视频帧考虑在内，具体如下公式(6)和公式(7)所示，Through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as follows:

Where _Xt is the video frame of the tth frame. Given _Xt , _Xt is output as an edge contour map _Et∈ [0,1] ^W×H after the edge detection branch, where W and H are the width and height of the predicted edge image, respectively, which is obtained from the edge detection network.

It is calculated in , which takes the previous video frame into account, as shown in the following formulas (6) and (7),

H_t＝ConvLSTM(X_t,H_t-1) (6),H _t =ConvLSTM(X _t ,H _t-1 ) (6),

公式(6)和公式(7)中，

为3D张量隐藏状态，M为通道数，E_t′为未加权的边缘轮廓图，H_t为当前ConvLSTM单元的状态，H_t-1为上一时刻ConvLSTM单元的状态，X₁为第一帧的视频帧，In formula (6) and formula (7),

is the 3D tensor hidden state, M is the number of channels, E _t ′ is the unweighted edge contour map, H _t is the state of the current ConvLSTM unit, H _t-1 is the state of the ConvLSTM unit at the previous moment, X ₁ is the video frame of the first frame,

在ConvLSTM当中嵌入ConvLSTM，获得边缘轮廓图E_t的关键组成部分是边缘检测网络

如下公式(8)所示，The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E _t is the edge detection network

As shown in the following formula (8),

然后用上述边缘检测网络

进行加权，得到显著性物体的边缘轮廓图E_t，如下公式(9)所示，Then use the above edge detection network

After weighting, the edge contour map E _t of the salient object is obtained, as shown in the following formula (9):

公式(9)中，

为一个1×1的卷积核，用来映射边缘检测网络

得到一个权重矩阵，sigmoid函数σ为把这个矩阵归一化到[0,1]；In formula (9),

is a 1×1 convolution kernel used to map the edge detection network

Get a weight matrix, and the sigmoid function σ normalizes this matrix to [0,1];

由此完成获得粗略的时空显著图Y_ST和显著性物体的边缘轮廓图E_t；Thus, a rough spatiotemporal saliency map Y _ST and an edge contour map E _t of a salient object are obtained;

第六步，获得最终的显著性预测结果图Y_final：Step 6: Get the final significance prediction result graph Y _final :

将上述第五步得到的粗略的时空显著图Y_ST和显著性物体的边缘轮廓图E_t进行融合，得到最终的显著性预测结果图Y_final，如下公式(10)所示，The rough spatiotemporal saliency map Y _ST obtained in the fifth step above is fused with the edge contour map E _t of the salient object to obtain the final saliency prediction result map Y _final , as shown in the following formula (10):

公式(10)中，

为矩阵相乘，σ为sigmoid函数，Resize(·)为调整视频帧大小的函数，In formula (10),

is matrix multiplication, σ is the sigmoid function, Resize(·) is the function for adjusting the video frame size,

将得到的视频帧恢复到原输入视频帧的大小473×473；The obtained video frame is restored to the size of the original input video frame 473×473;

第七步，计算对于输入视频帧I的损失：Step 7: Calculate the loss for the input video frame I:

经过上述第一步到第六步，计算出对于输入视频帧I的显著图，为了衡量上述第六步获得的最终的显著性预测结果图Y_final与ground-truth之间的差异，训练时采用二值交叉熵损失函数

如下公式(11)所示，After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y _final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training

As shown in the following formula (11),

公式(11)中，G(i,j)∈[0,1]为像素点(i,j)的真实值，M(i,j)∈[0,1]为像素点(i,j)的预测值，取N＝473，In formula (11), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473.

通过不断缩小

的大小进行网络的训练，采用随机梯度下降法优化二值交叉熵损失函数

By continuously shrinking

The size of the network is trained, and the stochastic gradient descent method is used to optimize the binary cross entropy loss function

至此完成基于深度网络的视频显著性检测。This completes the video saliency detection based on deep network.

上述基于深度网络的视频显著性检测方法，所述获得五个尺度的空间特征图S_final的具体操作如下：In the above-mentioned video saliency detection method based on deep network, the specific operation of obtaining the spatial feature map S _final of five scales is as follows:

ResNet50深度网络中的扩张卷积核表示为

其中K为扩张卷积层的个数，c×c为宽度和高度的相乘，C为通道数，

为扩张卷积的参数其步长设置为1，基于这些参数得到四个输出特征图

其中，W和H分别为宽度和高度，如下公式(12)所示，The dilated convolution kernel in the ResNet50 deep network is represented as

Where K is the number of dilated convolutional layers, c×c is the multiplication of width and height, and C is the number of channels.

The parameters of the dilated convolution are set to 1, and four output feature maps are obtained based on these parameters.

Where W and H are the width and height respectively, as shown in the following formula (12):

公式(12)中，C_k为取值为k的扩张卷积核，K为扩张卷积的个数，

为扩张卷积操作，S为初始空间特征图，In formula (12), C _k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,

is the dilated convolution operation, S is the initial spatial feature map,

通过ResNet50深度网络之后得到的初始空间特征图S的形状为60×60×2048，取值为4，k的取值范围是[1，2，3，4]，扩张率r_k的取值有四个，分别是r_k＝{2,4,8,16},并且其扩张卷积核C_k的形状都为3×3×512，由此最后得到有四个不同尺度的特征图

再将它们依次串联起来，由如下公式(13)所示，The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r _k has four values, namely r _k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C _k is 3×3×512, thus finally obtaining feature maps of four different scales

Then connect them in series, as shown in the following formula (13):

S_final＝[S,T₁，T₂,…,T_K] (13),S _final =[S,T ₁ ,T ₂ ,…,T _K ] (13),

公式(13)中，S_final为最后得到的多尺度的空间特征图，S为由ResNet50深度网络提取的初始空间特征图S，T_K为的是经过扩张卷积之后得到的特征图，五个尺度的空间特征图S_final的形状为60×60×4096。In formula (13), S _final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T _K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S _final is 60×60×4096.

本发明的有益效果是：与现有技术相比，本发明的突出的实质性特点和显著进步如下：The beneficial effects of the present invention are as follows: Compared with the prior art, the outstanding substantive features and significant improvements of the present invention are as follows:

(1)本发明方法与CN106372636A相比，本发明采取的是基于深度学习的方法，先利用ResNet50和扩张卷积来提取多尺度的空间特征，再利用ConvLSTM来提取时间信息，最后再整合为时空信息。本发明具有的突出的实质性特点和显著进步是不需要去计算光流信息，而是用ConvLSTM来提取时间信息，显著目标的检测精度比计算光流的方法更好，并且速度更快。(1) Compared with CN106372636A, the method of the present invention adopts a method based on deep learning, first using ResNet50 and dilated convolution to extract multi-scale spatial features, then using ConvLSTM to extract temporal information, and finally integrating it into spatiotemporal information. The outstanding substantive characteristics and significant progress of the present invention are that it does not need to calculate optical flow information, but uses ConvLSTM to extract temporal information. The detection accuracy of salient targets is better than the method of calculating optical flow, and the speed is faster.

(2)本发明方法与CN109784183A相比，本发明采用的是带有残差网络的连接方式，多个卷积层都有残差块的连接，本发明具有的突出的实质性特点和显著进步是能使训练网络收敛的更快，提取的特征更加精细，预测的准确率更高。(2) Compared with CN109784183A, the method of the present invention adopts a connection mode with a residual network, and multiple convolutional layers are connected with residual blocks. The outstanding substantial characteristics and significant progress of the present invention are that it can make the training network converge faster, the extracted features are more refined, and the prediction accuracy is higher.

(3)本发明方法与CN109118469A相比，本发明具有的突出的实质性特点和显著进步是无需进行繁琐的稀疏矩阵的提取，采用深度神经网络从视频帧中提取高级特征，对每一个像素点进行预测，检测结果更加准确，鲁棒性较好。(3) Compared with CN109118469A, the method of the present invention has outstanding substantive features and significant improvements in that it does not require cumbersome sparse matrix extraction, but uses a deep neural network to extract high-level features from video frames and predict each pixel point, so that the detection result is more accurate and has better robustness.

(4)本发明方法与CN105913456B相比，本发明具有的突出的实质性特点和显著进步是不需要进行计算量较大的线性迭代和k-means聚类，而直接采用端到端的神经网络方法，当训练完成之后能较快速地得到预测结果。(4) Compared with CN105913456B, the method of the present invention has the outstanding substantive characteristics and significant progress that it does not require linear iteration and k-means clustering with large computational complexity, but directly adopts an end-to-end neural network method, and can obtain prediction results more quickly after training is completed.

(5)本发明方法与CN109034001A相比，本发明采用的是基于深度网络的边缘检测分支去提取原图像中的显著性物体的边缘，并以此来指导下面的完整显著图的生成。本发明具有的突出的实质性特点和显著进步是得到的显著图中的显著目标更完整。(5) Compared with CN109034001A, the method of the present invention uses an edge detection branch based on a deep network to extract the edges of salient objects in the original image, and uses this to guide the generation of the following complete salient map. The outstanding substantive feature and significant progress of the present invention is that the salient objects in the obtained salient map are more complete.

(6)本发明方法与CN108241854A相比，虽然都是用的深度学习的方法，但是本发明采用扩张卷积提取了四种不同尺度的特征图，与之相比，本发明提取到的特征更加全面，因此本发明具有的突出的实质性特点和显著进步是得到的最终显著图中的显著目标的边缘更加平滑。(6) Compared with CN108241854A, although both methods use deep learning methods, the present invention uses dilated convolution to extract feature maps of four different scales. Compared with the above, the features extracted by the present invention are more comprehensive. Therefore, the outstanding substantial feature and significant improvement of the present invention is that the edges of the salient targets in the final salient map obtained are smoother.

(7)本发明方法与CN110598537A相比，本发明具有的突出的实质性特点和显著进步是利用ConvLSTM来模拟帧间的光流信息，提取到的光流信息比用传统方法计算出来的更加准确。(7) Compared with CN110598537A, the method of the present invention has an outstanding substantive feature and a significant improvement in that ConvLSTM is used to simulate the optical flow information between frames, and the extracted optical flow information is more accurate than that calculated by traditional methods.

(8)与Video Salient Object Detection via Fully Convolutional Networks相比，本发明具有的突出的实质性特点和显著进步是利用到了帧与帧之间的时间信息，得到的预测结果图更加准确。(8) Compared with Video Salient Object Detection via Fully Convolutional Networks, the present invention has the outstanding substantial feature and significant improvement of utilizing the time information between frames, and the obtained prediction result map is more accurate.

(9)本发明方法提出了一个基于深度网络的视频显著性检测方法模型。首先在视频显著性检测领域使用基于深度学习的显著性目标的边缘检测方法，此方法区别于传统的边缘检测算法，它能准确的检测出视频序列中每一帧中的显著性目标的轮廓，用来指导显著图的预测。(9) The method of the present invention proposes a video saliency detection method model based on a deep network. First, in the field of video saliency detection, a deep learning-based edge detection method of salient targets is used. This method is different from the traditional edge detection algorithm. It can accurately detect the contours of salient targets in each frame of the video sequence to guide the prediction of saliency maps.

(10)本发明利用深度显著性目标边缘检测分支生成显著性目标轮廓图与视频中每一帧的时空显著图进行融合，使它的轮廓更加平滑，能更准确的预测出视频序列中每一帧中的显著性目标。(10) The present invention utilizes the deep salient target edge detection branch to generate a salient target contour map and fuses it with the spatiotemporal salient map of each frame in the video, making its contour smoother and being able to more accurately predict the salient targets in each frame in the video sequence.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面结合附图和实施例对本发明进一步说明。The present invention is further described below in conjunction with the accompanying drawings and embodiments.

图1是本发明基于深度网络的视频显著性检测方法的流程示意框图。FIG1 is a schematic flow chart of a method for detecting video saliency based on a deep network according to the present invention.

图2是本发明实施例中的显著目标为一个猫和一个盒子的视频帧I的显著性预测结果图Y_final。FIG. 2 is a saliency prediction result graph Y _final of a video frame I in which salient objects are a cat and a box in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

图1所示实施例表明,本发明基于深度网络的视频显著性检测方法的流程如下:The embodiment shown in FIG1 shows that the process of the video saliency detection method based on deep network of the present invention is as follows:

输入视频帧I，进行预处理→提取视频帧I′的初始空间特征图S→获得五个尺度的空间特征图S_final→获得特征图F→获得粗略的时空显著图Y_ST和显著性物体的边缘轮廓图E_t→获得最终的显著性预测结果图Y_final→计算对于输入视频帧I的损失→完成基于深度网络的视频显著性检测。Input video frame I, perform preprocessing → extract the initial spatial feature map S of video frame I′ → obtain the spatial feature map S _final at five scales → obtain the feature map F → obtain a rough spatiotemporal saliency map Y _ST and the edge contour map E _t of the salient object → obtain the final saliency prediction result map Y _final → calculate the loss for the input video frame I → complete the video saliency detection based on the deep network.

实施例1Example 1

本实施例中显著目标为一个猫和一个盒子，本实施例所述的基于深度网络的视频显著性检测方法，具体步骤如下：In this embodiment, the salient objects are a cat and a box. The video saliency detection method based on a deep network described in this embodiment has the following specific steps:

输入显著目标为一个猫和一个盒子的视频帧I，将视频帧的尺寸都统一为宽高都是473×473像素，并且视频帧I中的每个像素值都减去其相对应的通道的均值，其中，每个视频帧I的R通道的均值是104.00698793，每个视频帧I中的G通道的均值是116.66876762，每个视频帧I中的B通道的均值是122.67891434，这样，输入到ResNet50深度网络之前的视频帧I的形状为473×473×3，将如此进行预处理之后的视频帧记为I′，如下公式(1)所示：Input a video frame I with a cat and a box as salient objects. The size of the video frames is unified to 473×473 pixels in width and height, and the mean of the corresponding channel is subtracted from each pixel value in the video frame I. The mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of the video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1):

I′＝Resize(I-Mean(R,G,B)) (1),I′＝Resize(I-Mean(R,G,B)) (1),

S＝ResNet50(I′) (2),S＝ResNet50(I′) (2),

获得五个尺度的空间特征图S_final的具体操作如下：The specific operations to obtain the spatial feature map S _final of five scales are as follows:

ResNet50深度网络中的扩张卷积核表示为

其中，W和H分别为宽度和高度，如下公式(3)所示，The dilated convolution kernel in the ResNet50 deep network is represented as

Where W and H are the width and height respectively, as shown in the following formula (3):

公式(3)中，C_k为取值为k的扩张卷积核，K为扩张卷积的个数，

为扩张卷积操作，S为初始空间特征图，In formula (3), C _k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,

is the dilated convolution operation, S is the initial spatial feature map,

再将它们依次串联起来，由如下公式(4)所示，The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r _k has four values, namely r _k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C _k is 3×3×512, thus finally obtaining feature maps of four different scales

Then connect them in series, as shown in the following formula (4):

S_final＝[S,T₁，T₂,…,T_K] (4),S _final =[S,T ₁ ,T ₂ ,…,T _K ] (4),

公式(4)中，S_final为最后得到的多尺度的空间特征图，S为由ResNet50深度网络提取的初始空间特征图S，T_K为的是经过扩张卷积之后得到的特征图，五个尺度的空间特征图S_final的形状为60×60×4096；In formula (4), S _final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T _K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S _final is 60×60×4096;

第四步，获得特征图F：The fourth step is to obtain the feature map F:

将上述第三步得到的五个尺度的空间特征图S_final通过一个卷积核为3×3×32的卷积操作获得形状为60×60×32的特征图F，如下公式(5)所示，The spatial feature maps S _final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (5):

F＝BN(Relu(Conv(S_final))) (5),F＝BN(Relu(Conv(S _final ))) (5),

公式(5)中，Conv(·)为卷积操作，Relu(·)为非线性激活函数，BN(·)为对其进行标准化操作；In formula (5), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation.

将上述第四步得到的特征图F输入到时空分支的ConvLSTM当中去，得到一个时空特征图F_ST，如下公式(6)所示，The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F _ST , as shown in the following formula (6):

F_ST＝ConvLSTM(F,H_t-1) (6),F _ST =ConvLSTM(F,H _t-1 ) (6),

公式(6)中，ConvLSTM(·)为ConvLSTM操作，H_t-1为前一时刻ConvLSTM单元的状态；In formula (6), ConvLSTM(·) is the ConvLSTM operation, H _t-1 is the state of the ConvLSTM unit at the previous moment;

Y_ST＝Conv(F_ST) (7),Y _ST =Conv(F _ST ) (7),

公式(7)中，Conv(·)为卷积操作；In formula (7), Conv(·) is the convolution operation;

边缘检测分支中包含一个两层的ConvLSTM，它是一个强大的循环模型，不仅捕捉时序信息，而且根据时间信息来描绘出显著性物体的轮廓边缘，区分出图像中显著性物体与非显著性物体，更具体的说，通过ResNet50深度网络和扩张卷积，获得T帧的输入视频的静态为

中计算出来的，它将先前的视频帧考虑在内，具体如下公式(8)和公式(9)所示，The edge detection branch contains a two-layer ConvLSTM, which is a powerful recurrent model that not only captures temporal information, but also depicts the contour edges of salient objects based on temporal information, distinguishing salient objects from non-salient objects in the image. More specifically, through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as

It is calculated in , which takes the previous video frame into account, as shown in the following formulas (8) and (9),

H_t＝ConvLSTM(X_t,H_t-1) (8),H _t =ConvLSTM(X _t ,H _t-1 ) (8),

公式(8)和公式(9)中，

为3D张量隐藏状态，M为通道数，E_t′为未加权的边缘轮廓图，H_t为当前ConvLSTM单元的状态，H_t-1为上一时刻ConvLSTM单元的状态，X₁为第一帧的视频帧，In formula (8) and formula (9),

如下公式(10)所示，The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E _t is the edge detection network

As shown in the following formula (10),

然后用上述边缘检测网络

进行加权，得到显著性物体的边缘轮廓图E_t，如下公式(11)所示，Then use the above edge detection network

After weighting, the edge contour map E _t of the salient object is obtained, as shown in the following formula (11):

公式(11)中，

为一个1×1的卷积核，用来映射边缘检测网络

得到一个权重矩阵，sigmoid函数σ为把这个矩阵归一化到[0,1]；In formula (11),

is a 1×1 convolution kernel used to map the edge detection network

将上述第五步得到的粗略的时空显著图Y_ST和显著性物体的边缘轮廓图E_t进行融合，得到最终的显著性预测结果图Y_final，如下公式(12)所示，The rough spatiotemporal saliency map Y _ST obtained in the fifth step above is fused with the edge contour map E _t of the salient object to obtain the final saliency prediction result map Y _final , as shown in the following formula (12):

公式(12)中，‘ο’为矩阵相乘，σ为sigmoid函数，Resize(·)为调整视频帧大小的函数，In formula (12), ‘ο’ is matrix multiplication, σ is the sigmoid function, and Resize(·) is the function for adjusting the video frame size.

图2为本实施例的视频帧I的最终的显著性预测结果图Y_final，其中有两个显著目标，猫和盒子。FIG. 2 is a final saliency prediction result graph Y _final of the video frame I in this embodiment, in which there are two salient objects, a cat and a box.

如下公式(13)所示，After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y _final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training

As shown in the following formula (13),

公式(13)中，G(i,j)∈[0,1]为像素点(i,j)的真实值，M(i,j)∈[0,1]为像素点(i,j)的预测值，取N＝473，In formula (13), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473.

通过不断缩小

By continuously shrinking

上述实施例中，所述ResNet50深度网络、ConvLSTM、ground-truth、随机梯度下降法均是本技术领域所公知的。In the above embodiments, the ResNet50 deep network, ConvLSTM, ground-truth, and stochastic gradient descent method are all well known in the technical field.

Claims

1. A video saliency detection method based on a deep network is characterized in that: the spatial features are first obtained by using a ResNet50 deep network, and then the time and edge information are extracted to jointly obtain a saliency prediction result map, thereby completing the video saliency detection based on a deep network. The specific steps are as follows:

The first step is to input the video frame I and perform preprocessing:

Input video frame I, unify the size of the video frame to 473×473 pixels in width and height, and subtract the mean of the corresponding channel from each pixel value in video frame I, where the mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1):

I′＝Resize(I-Mean(R,G,B)) (1),

In formula (1), Mean(R,G,B) is the mean of the three color channels of red, green, and blue, and Resize(·) is the function for adjusting the size of the video frame I′;

The second step is to extract the initial spatial feature map S of the video frame I′:

The video frame I′ after the first step of preprocessing is sent to the ResNet50 deep network to extract the initial spatial feature map S, as shown in the following formula (2):

S＝ResNet50(I′) (2),

In formula (2), ResNet50(·) is the ResNet50 deep network.

The ResNet50 deep network contains convolutional layers, pooling layers, non-linear activation function Relu layers and residual connections;

The third step is to obtain the spatial feature map S _final of five scales:

The initial spatial feature map S of the video frame I′ extracted in the second step is sent to four different dilated convolutions with dilation rates of 2, 4, 8, and 16 in the ResNet50 deep network, and the results T _k of four scales with dilation rates of 2, 4, 8, and 16 are obtained. Then, the result is connected in series with the output result initial spatial feature map S of the ResNet50 deep network to finally obtain the spatial feature map S _final of five scales.

The fourth step is to obtain the feature map F:

The spatial feature maps S _final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (3):

F＝BN(Relu(Conv(S _final ))) (3),

In formula (3), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation.

The fifth step is to obtain a rough spatiotemporal saliency map Y _ST and an edge contour map E _t of a salient object:

The feature map F obtained in the fourth step is input into the spatiotemporal branch and the edge detection branch to obtain a spatiotemporal feature map F _ST and an edge contour map E _t of a salient object. The specific operation is as follows:

The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F _ST , as shown in the following formula (4):

F _ST =ConvLSTM(F,H _t-1 ) (4),

In formula (4), ConvLSTM(·) is the ConvLSTM operation, H _t-1 is the state of the ConvLSTM unit at the previous moment;

Then the obtained spatiotemporal feature map F _ST is sent to a convolution layer with a convolution kernel size of 1×1 to obtain a rough spatiotemporal saliency map Y _ST , the formula is as follows:

Y _ST =Conv(F _ST ) (5),

In formula (5), Conv(·) is the convolution operation;

The feature map F obtained in the fourth step is input into the edge detection branch to obtain the edge contour map E _t of the salient object. The specific operation is as follows:

Through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as follows:

H _t =ConvLSTM(X _t ,H _t-1 ) (6),

In formula (6) and formula (7),

The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E _t is the edge detection network

As shown in the following formula (8),

Then use the above edge detection network

In formula (9),

is a 1×1 convolution kernel used to map the edge detection network

Thus, a rough spatiotemporal saliency map Y _ST and an edge contour map E _t of a salient object are obtained;

Step 6: Get the final significance prediction result graph Y _final :

The rough spatiotemporal saliency map Y _ST obtained in the fifth step above is fused with the edge contour map E _t of the salient object to obtain the final saliency prediction result map Y _final , as shown in the following formula (10):

In formula (10),

The obtained video frame is restored to the size of the original input video frame 473×473;

Step 7: Calculate the loss for the input video frame I:

After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y _final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training

As shown in the following formula (11),

In formula (11), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473.

By continuously shrinking

This completes the video saliency detection based on deep network.

2. According to the deep network-based video saliency detection method of claim 1, it is characterized in that: the specific operation of obtaining the spatial feature map S _final of five scales is as follows:

The dilated convolution kernel in the ResNet50 deep network is represented as

In formula (12), C _k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,

is the dilated convolution operation, S is the initial spatial feature map,

The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r _k has four values, namely r _k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C _k is 3×3×512, thus finally obtaining feature maps of four different scales

Then connect them in series, as shown in the following formula (13):

S _final =[S,T ₁ ,T ₂ ,…,T _K ] (13),

In formula (13), S _final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T _K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S _final is 60×60×4096.