[go: up one dir, main page]

CN111461043B - Video significance detection method based on deep network - Google Patents

Video significance detection method based on deep network Download PDF

Info

Publication number
CN111461043B
CN111461043B CN202010266351.2A CN202010266351A CN111461043B CN 111461043 B CN111461043 B CN 111461043B CN 202010266351 A CN202010266351 A CN 202010266351A CN 111461043 B CN111461043 B CN 111461043B
Authority
CN
China
Prior art keywords
map
video frame
saliency
final
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010266351.2A
Other languages
Chinese (zh)
Other versions
CN111461043A (en
Inventor
于明
夏斌红
刘依
郭迎春
郝小可
朱叶
师硕
于洋
阎刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei University of Technology
Original Assignee
Hebei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei University of Technology filed Critical Hebei University of Technology
Priority to CN202010266351.2A priority Critical patent/CN111461043B/en
Publication of CN111461043A publication Critical patent/CN111461043A/en
Application granted granted Critical
Publication of CN111461043B publication Critical patent/CN111461043B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

本发明基于深度网络的视频显著性检测方法,涉及图像数据处理领域,该方法是先用ResNet50深度网络来取空间特征,然后再提取时间和边缘信息来共同得到显著性预测结果图,完成基于深度网络的视频显著性检测,步骤是,输入视频帧I,进行预处理;提取视频帧I的初始空间特征图S;获得五个尺度的空间特征图Sfinal;获得特征图F;获得粗略的时空显著图YST和显著性物体的边缘轮廓图Et;获得最终的显著性预测结果图Yfinal;计算对于输入视频帧I的损失,完成基于深度网络的视频显著性检测。本发明克服了现有技术视频显著性检测中存在的显著目标检测不完整、当前景背景颜色相似时算法检测不准确的缺陷。

Figure 202010266351

The present invention is a video saliency detection method based on a deep network, which relates to the field of image data processing. The method uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map, and completes a depth-based The video saliency detection of the network, the steps are, input video frame I, carry out preprocessing; Extract the initial spatial feature map S of video frame I ' ; Obtain the spatial feature map S final of five scales; Obtain feature map F; Obtain rough The spatiotemporal saliency map Y ST and the edge contour map E t of salient objects; obtain the final saliency prediction result map Y final ; calculate the loss for the input video frame I, and complete the video saliency detection based on the deep network. The invention overcomes the defects of incomplete salient object detection and inaccurate algorithm detection when the foreground and background colors are similar in the prior art video salient detection.

Figure 202010266351

Description

基于深度网络的视频显著性检测方法Video saliency detection method based on deep network

技术领域Technical Field

本发明的技术方案涉及图像数据处理领域,具体地说是基于深度网络的视频显著性检测方法。The technical solution of the present invention relates to the field of image data processing, and specifically to a video saliency detection method based on a deep network.

背景技术Background Art

视频显著性检测旨在提取连续的视频帧中人眼最感兴趣的区域。具体地说是利用计算机模拟人眼的视觉注意力机制,从视频帧中提取人眼感兴趣的区域,是计算机视觉领域的关键技术之一。Video saliency detection aims to extract the areas of greatest interest to the human eye in continuous video frames. Specifically, it uses computers to simulate the visual attention mechanism of the human eye and extract the areas of interest to the human eye from video frames. It is one of the key technologies in the field of computer vision.

传统的视频显著性检测方法大多数都基于低级的手工特征(例如颜色,纹理等),这些方法是典型的启发式方法,具有速度慢(由于耗时的光流计算)和预测精度低(由于低水平特征的可表征性有限)的缺点。近年来深度神经网络开始应用于视频显著性检测领域,深度学习方法是指利用卷积神经网络提取图像的高级语义特征计算图像的显著值,但采用深度卷积网络会丢失目标的位置信息和细节信息,在检测显著目标时可能会引入误导信息,导致检测到的目标不完整。Most of the traditional video saliency detection methods are based on low-level manual features (such as color, texture, etc.). These methods are typical heuristic methods with the disadvantages of slow speed (due to time-consuming optical flow calculation) and low prediction accuracy (due to the limited representability of low-level features). In recent years, deep neural networks have begun to be applied to the field of video saliency detection. Deep learning methods refer to the use of convolutional neural networks to extract high-level semantic features of images and calculate the saliency value of images. However, the use of deep convolutional networks will lose the location information and detail information of the target, which may introduce misleading information when detecting salient targets, resulting in incomplete detected targets.

2016年,Liu等人在“Saliency detection for unconstrained videos usingsuperpixel-level graph and spatiotemporal propagation”一文中提出了SGSP算法,该算法使用超像素级的图模型和时空传播来进行视频显著性的检测,首先,提取超像素级的运动和颜色直方图以及全局运动直方图来构建图。接着,基于图模型使用背景先验通过图上的最短路径迭代地计算运动显著性。然后在时间上往前向和后向传播,在空间上局部和全局地传播,最后将这两个结果融合起来形成最后的显著图。该算法的计算量很大,但得到的显著图仍存在显著性目标检测不完全的问题。基于深度学习模型旨在利用卷积神经网络得到更丰富的深度特征,进而得到更准确的检测结果。Wang等人于2017年在“Videosalient object detection via fully convolutional networks”一文中提出了基于全卷积网络的视频显著性检测方法,这是基于深度学习的全卷积网络第一次用在了视频显著性检测领域,但是由于没有考虑到帧与帧之间的时间信息,导致得到的显著图的边缘不够精细,边缘噪声比较大。CN106372636A公开了一种基于HOG_TOP的视频显著性检测方法,该方法利用原始视频在三个正交的平面XY、XT、YT计算得到HOG_TOP特征,分别在XY平面计算得到空域显著图和在XT,YT平面得到时域显著图,最后通过自适应融合得到最终的显著图,此方法在计算时域显著图时需要计算每个像素点的光流,计算量很大,速度慢。CN109784183A公开了一种基于级联卷积网络和光流的视频显著性目标检测方法,该方法利用级联网络结构,在高、中、低三个尺度上分别对当前帧的图像进行像素级的显著性预测。使用MSAR10K图像数据集训练级联网络结构,显著性标注图作为训练的监督信息,损失函数为交叉熵损失函数。训练终止后,利用训练好的级联网络对视频中的每一帧图像进行静态显著性预测,利用Locus-Kanada算法进行光流场提取。然后使用三层卷积网络结构构建动态优化网络结构。将每一帧图像的静态检测结果和光流场检测结果进行拼接得到优化网络的输入数据。该方法较耗时,且在一些对于复杂场景的时候利用Locus-Kanada算法提取到的光流信息并不准确,鲁棒性较差。CN109118469A公开了一种用于视频显著性的预测方法,该方法先对图像进行量化得到稀疏矩阵响应,再根据局部坐标约束得到分解矩阵,最后对视频中的每一帧进行显著图计算,并进行质量预测。该方法丢失了显著性目标的一些细节信息,使得预测结果会存在显著性目标检测不完整的问题。CN105913456B公开了一种区域分割的视频显著性检测方法,该方法先利用非线性聚类得到超像素块来提取静态特征,再利用分光流法得到动态特征,最后用线性回归模型来预测两个特征融合之后的显著图,该方法的计算量较大,效率较低。CN109034001A公开了一种基于时空线索的跨模态视频显著性检测方法,该方法利用初始的显著图,可见光和热红外两个模态的权重构造显著图,该方法难以找到一个合适权重值导致鲁棒性较差。CN108241854A公开了一种基于运动和记忆信息的深度视频显著性检测方法,该方法先根据当前帧的人眼注视图来提取局部信息和全局信息,再将此作为先验信息和原图像一起输入到深度网络模型当中来预测最终的显著图,当显著目标触及图像边界时,该方法会出现误检,显著目标会被误检测为背景。CN110598537A公开了一种基于深度卷积网络的视频显著性检测方法,该方法以视频的当前帧及其对应的光流图像作为特征提取网络的输入来预测最终的显著图,该方法需要提前计算当前帧的光流信息,计算量较大。In 2016, Liu et al. proposed the SGSP algorithm in the paper "Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation". The algorithm uses superpixel-level graph models and spatiotemporal propagation to detect video saliency. First, superpixel-level motion and color histograms and global motion histograms are extracted to construct a graph. Then, based on the graph model, motion saliency is iteratively calculated through the shortest path on the graph using background priors. Then, it propagates forward and backward in time, and propagates locally and globally in space, and finally combines the two results to form the final saliency map. The algorithm is very computationally intensive, but the saliency map obtained still has the problem of incomplete detection of salient objects. The deep learning model aims to use convolutional neural networks to obtain richer deep features, thereby obtaining more accurate detection results. Wang et al. proposed a video saliency detection method based on a fully convolutional network in the article "Videosalient object detection via fully convolutional networks" in 2017. This is the first time that a fully convolutional network based on deep learning has been used in the field of video saliency detection. However, since the time information between frames is not taken into account, the edges of the obtained saliency map are not fine enough and the edge noise is relatively large. CN106372636A discloses a video saliency detection method based on HOG_TOP. The method uses the original video to calculate the HOG_TOP features on three orthogonal planes XY, XT, and YT, respectively calculates the spatial domain saliency map on the XY plane and the temporal domain saliency map on the XT and YT planes, and finally obtains the final saliency map through adaptive fusion. This method needs to calculate the optical flow of each pixel when calculating the temporal domain saliency map, which is very computationally intensive and slow. CN109784183A discloses a video salient target detection method based on cascade convolutional network and optical flow. The method uses a cascade network structure to perform pixel-level saliency prediction on the image of the current frame at three scales: high, medium and low. The cascade network structure is trained using the MSAR10K image data set, and the saliency annotation map is used as the supervisory information for training. The loss function is the cross entropy loss function. After the training is terminated, the trained cascade network is used to perform static saliency prediction on each frame of the video, and the Locus-Kanada algorithm is used to extract the optical flow field. Then a three-layer convolutional network structure is used to construct a dynamic optimization network structure. The static detection results of each frame of the image and the optical flow field detection results are spliced to obtain the input data of the optimization network. This method is time-consuming, and the optical flow information extracted by the Locus-Kanada algorithm is not accurate and has poor robustness in some complex scenes. CN109118469A discloses a method for predicting video saliency, which first quantizes the image to obtain a sparse matrix response, then obtains a decomposition matrix based on local coordinate constraints, and finally calculates a saliency map for each frame in the video and performs quality prediction. This method loses some detailed information of the salient target, so that the prediction result will have the problem of incomplete detection of salient targets. CN105913456B discloses a method for detecting video saliency by region segmentation, which first uses nonlinear clustering to obtain superpixel blocks to extract static features, then uses the optical flow method to obtain dynamic features, and finally uses a linear regression model to predict the saliency map after the fusion of the two features. The method has a large amount of calculation and low efficiency. CN109034001A discloses a cross-modal video saliency detection method based on spatiotemporal cues, which uses the initial saliency map and the weights of the two modalities of visible light and thermal infrared to construct a saliency map. The method is difficult to find a suitable weight value, resulting in poor robustness. CN108241854A discloses a deep video saliency detection method based on motion and memory information. The method first extracts local information and global information based on the human eye gaze map of the current frame, and then inputs this information into the deep network model together with the original image as prior information to predict the final saliency map. When the salient target touches the image boundary, the method will have false detection, and the salient target will be mistakenly detected as the background. CN110598537A discloses a video saliency detection method based on a deep convolutional network. The method uses the current frame of the video and its corresponding optical flow image as the input of the feature extraction network to predict the final saliency map. The method needs to calculate the optical flow information of the current frame in advance, and the amount of calculation is large.

总之,视频显著性目标检测的现有技术中仍存在显著目标检测不完整、当前景背景颜色相似时算法检测不准确的问题。In summary, the existing technology of video salient object detection still has problems such as incomplete salient object detection and inaccurate algorithm detection when the foreground and background colors are similar.

发明内容Summary of the invention

本发明所要解决的技术问题是:提供基于深度网络的视频显著性检测方法,该方法是先用ResNet50深度网络来取空间特征,然后再提取时间和边缘信息来共同得到显著性预测结果图,完成基于深度网络的视频显著性检测,克服了现有技术视频显著性检测中存在的显著目标检测不完整、当前景背景颜色相似时算法检测不准确的缺陷。The technical problem to be solved by the present invention is to provide a video saliency detection method based on a deep network. The method first uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map, thereby completing video saliency detection based on a deep network, overcoming the defects of incomplete salient target detection and inaccurate algorithm detection when the foreground and background colors are similar in the prior art video saliency detection.

本发明解决该技术问题所采用的技术方案是:基于深度网络的视频显著性检测方法,是先用ResNet50深度网络来取空间特征,然后再提取时间和边缘信息来共同得到显著性预测结果图,完成基于深度网络的视频显著性检测,具体步骤如下:The technical solution adopted by the present invention to solve the technical problem is: a video saliency detection method based on a deep network first uses a ResNet50 deep network to obtain spatial features, and then extracts time and edge information to jointly obtain a saliency prediction result map to complete the video saliency detection based on a deep network. The specific steps are as follows:

第一步,输入视频帧I,进行预处理:The first step is to input the video frame I and perform preprocessing:

输入视频帧I,将视频帧的尺寸都统一为宽高都是473×473像素,并且视频帧I中的每个像素值都减去其相对应的通道的均值,其中,每个视频帧I的R通道的均值是104.00698793,每个视频帧I中的G通道的均值是116.66876762,每个视频帧I中的B通道的均值是122.67891434,这样,输入到ResNet50深度网络之前的视频帧I的形状为473×473×3,将如此进行预处理之后的视频帧记为I′,如下公式(1)所示:Input video frame I, unify the size of the video frame to 473×473 pixels in width and height, and subtract the mean of the corresponding channel from each pixel value in video frame I, where the mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1):

I′=Resize(I-Mean(R,G,B)) (1),I′=Resize(I-Mean(R,G,B)) (1),

公式(1)中,Mean(R,G,B)为红,绿,蓝三个颜色通道的均值,Resize(·)为调整视频帧I′大小的函数;In formula (1), Mean(R,G,B) is the mean of the three color channels of red, green, and blue, and Resize(·) is the function for adjusting the size of the video frame I′;

第二步,提取视频帧I′的初始空间特征图S:The second step is to extract the initial spatial feature map S of the video frame I′:

将上述第一步预处理之后的视频帧I′送入到ResNet50深度网络去提取初始空间特征图S,如下公式(2)所示:The video frame I′ after the first step of preprocessing is sent to the ResNet50 deep network to extract the initial spatial feature map S, as shown in the following formula (2):

S=ResNet50(I′) (2),S=ResNet50(I′) (2),

公式(2)中,ResNet50(·)为ResNet50深度网络,In formula (2), ResNet50(·) is the ResNet50 deep network.

ResNet50深度网络包含卷积层,池化层,非线性激活函数Relu层和残差连接;The ResNet50 deep network contains convolutional layers, pooling layers, non-linear activation function Relu layers and residual connections;

第三步,获得五个尺度的空间特征图SfinalThe third step is to obtain the spatial feature map S final of five scales:

将第二步中提取到的视频帧I′的初始空间特征图S分别送入到ResNet50深度网络中扩张率为2、4、8、16的四个不同的扩张卷积中去,得到扩张率分别为2、4、8、16的四个尺度的结果Tk,再将该结果与ResNet50深度网络的输出结果初始空间特征图S串联起来最终获得五个尺度的空间特征图SfinalThe initial spatial feature map S of the video frame I′ extracted in the second step is sent to four different dilated convolutions with dilation rates of 2, 4, 8, and 16 in the ResNet50 deep network, and the results T k of four scales with dilation rates of 2, 4, 8, and 16 are obtained. Then, the result is connected in series with the output result initial spatial feature map S of the ResNet50 deep network to finally obtain the spatial feature map S final of five scales.

第四步,获得特征图F:The fourth step is to obtain the feature map F:

将上述第三步得到的五个尺度的空间特征图Sfinal通过一个卷积核为3×3×32的卷积操作获得形状为60×60×32的特征图F,如下公式(3)所示,The spatial feature maps S final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (3):

F=BN(Relu(Conv(Sfinal))) (3),F=BN(Relu(Conv(S final ))) (3),

公式(3)中,Conv(·)为卷积操作,Relu(·)为非线性激活函数,BN(·)为对其进行标准化操作;In formula (3), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation.

第五步,获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThe fifth step is to obtain a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object:

将上述第四步获得的特征图F同时分别输入到时空分支和边缘检测分支得到一个时空特征图FST和得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the spatiotemporal branch and the edge detection branch to obtain a spatiotemporal feature map F ST and an edge contour map E t of a salient object. The specific operation is as follows:

将上述第四步得到的特征图F输入到时空分支的ConvLSTM当中去,得到一个时空特征图FST,如下公式(4)所示,The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F ST , as shown in the following formula (4):

FST=ConvLSTM(F,Ht-1) (4),F ST =ConvLSTM(F,H t-1 ) (4),

公式(4)中,ConvLSTM(·)为ConvLSTM操作,Ht-1为前一时刻ConvLSTM单元的状态;In formula (4), ConvLSTM(·) is the ConvLSTM operation, H t-1 is the state of the ConvLSTM unit at the previous moment;

再将得到的时空特征图FST再送入到一层卷积核大小为1×1的卷积中得到一个粗略的时空显著图YST,公式如下:Then the obtained spatiotemporal feature map F ST is sent to a convolution layer with a convolution kernel size of 1×1 to obtain a rough spatiotemporal saliency map Y ST , the formula is as follows:

YST=Conv(FST) (5),Y ST =Conv(F ST ) (5),

公式(5)中,Conv(·)为卷积操作;In formula (5), Conv(·) is the convolution operation;

将上述第四步得到的特征图F输入到边缘检测分支中得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the edge detection branch to obtain the edge contour map E t of the salient object. The specific operation is as follows:

通过ResNet50深度网络和扩张卷积,获得T帧的输入视频的静态为

Figure BDA0002441399420000031
其中Xt为第t帧的视频帧,给定Xt,Xt经过边缘检测分支后输出为边缘轮廓图Et∈[0,1]W×H,其中W和H分别为预测边缘图像的宽度和高度,是从边缘检测网络
Figure BDA0002441399420000041
中计算出来的,它将先前的视频帧考虑在内,具体如下公式(6)和公式(7)所示,Through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as follows:
Figure BDA0002441399420000031
Where Xt is the video frame of the tth frame. Given Xt , Xt is output as an edge contour map Et∈ [0,1] W×H after the edge detection branch, where W and H are the width and height of the predicted edge image, respectively, which is obtained from the edge detection network.
Figure BDA0002441399420000041
It is calculated in , which takes the previous video frame into account, as shown in the following formulas (6) and (7),

Ht=ConvLSTM(Xt,Ht-1) (6),H t =ConvLSTM(X t ,H t-1 ) (6),

Figure BDA0002441399420000042
Figure BDA0002441399420000042

公式(6)和公式(7)中,

Figure BDA0002441399420000043
为3D张量隐藏状态,M为通道数,Et′为未加权的边缘轮廓图,Ht为当前ConvLSTM单元的状态,Ht-1为上一时刻ConvLSTM单元的状态,X1为第一帧的视频帧,In formula (6) and formula (7),
Figure BDA0002441399420000043
is the 3D tensor hidden state, M is the number of channels, E t ′ is the unweighted edge contour map, H t is the state of the current ConvLSTM unit, H t-1 is the state of the ConvLSTM unit at the previous moment, X 1 is the video frame of the first frame,

在ConvLSTM当中嵌入ConvLSTM,获得边缘轮廓图Et的关键组成部分是边缘检测网络

Figure BDA0002441399420000044
如下公式(8)所示,The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E t is the edge detection network
Figure BDA0002441399420000044
As shown in the following formula (8),

Figure BDA0002441399420000045
Figure BDA0002441399420000045

然后用上述边缘检测网络

Figure BDA0002441399420000046
进行加权,得到显著性物体的边缘轮廓图Et,如下公式(9)所示,Then use the above edge detection network
Figure BDA0002441399420000046
After weighting, the edge contour map E t of the salient object is obtained, as shown in the following formula (9):

Figure BDA0002441399420000047
Figure BDA0002441399420000047

公式(9)中,

Figure BDA0002441399420000048
为一个1×1的卷积核,用来映射边缘检测网络
Figure BDA0002441399420000049
得到一个权重矩阵,sigmoid函数σ为把这个矩阵归一化到[0,1];In formula (9),
Figure BDA0002441399420000048
is a 1×1 convolution kernel used to map the edge detection network
Figure BDA0002441399420000049
Get a weight matrix, and the sigmoid function σ normalizes this matrix to [0,1];

由此完成获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThus, a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object are obtained;

第六步,获得最终的显著性预测结果图YfinalStep 6: Get the final significance prediction result graph Y final :

将上述第五步得到的粗略的时空显著图YST和显著性物体的边缘轮廓图Et进行融合,得到最终的显著性预测结果图Yfinal,如下公式(10)所示,The rough spatiotemporal saliency map Y ST obtained in the fifth step above is fused with the edge contour map E t of the salient object to obtain the final saliency prediction result map Y final , as shown in the following formula (10):

Figure BDA00024413994200000410
Figure BDA00024413994200000410

公式(10)中,

Figure BDA00024413994200000411
为矩阵相乘,σ为sigmoid函数,Resize(·)为调整视频帧大小的函数,In formula (10),
Figure BDA00024413994200000411
is matrix multiplication, σ is the sigmoid function, Resize(·) is the function for adjusting the video frame size,

将得到的视频帧恢复到原输入视频帧的大小473×473;The obtained video frame is restored to the size of the original input video frame 473×473;

第七步,计算对于输入视频帧I的损失:Step 7: Calculate the loss for the input video frame I:

经过上述第一步到第六步,计算出对于输入视频帧I的显著图,为了衡量上述第六步获得的最终的显著性预测结果图Yfinal与ground-truth之间的差异,训练时采用二值交叉熵损失函数

Figure BDA00024413994200000412
如下公式(11)所示,After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training
Figure BDA00024413994200000412
As shown in the following formula (11),

Figure BDA00024413994200000413
Figure BDA00024413994200000413

公式(11)中,G(i,j)∈[0,1]为像素点(i,j)的真实值,M(i,j)∈[0,1]为像素点(i,j)的预测值,取N=473,In formula (11), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473.

通过不断缩小

Figure BDA00024413994200000414
的大小进行网络的训练,采用随机梯度下降法优化二值交叉熵损失函数
Figure BDA00024413994200000415
By continuously shrinking
Figure BDA00024413994200000414
The size of the network is trained, and the stochastic gradient descent method is used to optimize the binary cross entropy loss function
Figure BDA00024413994200000415

至此完成基于深度网络的视频显著性检测。This completes the video saliency detection based on deep network.

上述基于深度网络的视频显著性检测方法,所述获得五个尺度的空间特征图Sfinal的具体操作如下:In the above-mentioned video saliency detection method based on deep network, the specific operation of obtaining the spatial feature map S final of five scales is as follows:

ResNet50深度网络中的扩张卷积核表示为

Figure BDA00024413994200000416
其中K为扩张卷积层的个数,c×c为宽度和高度的相乘,C为通道数,
Figure BDA0002441399420000051
为扩张卷积的参数其步长设置为1,基于这些参数得到四个输出特征图
Figure BDA0002441399420000052
其中,W和H分别为宽度和高度,如下公式(12)所示,The dilated convolution kernel in the ResNet50 deep network is represented as
Figure BDA00024413994200000416
Where K is the number of dilated convolutional layers, c×c is the multiplication of width and height, and C is the number of channels.
Figure BDA0002441399420000051
The parameters of the dilated convolution are set to 1, and four output feature maps are obtained based on these parameters.
Figure BDA0002441399420000052
Where W and H are the width and height respectively, as shown in the following formula (12):

Figure BDA0002441399420000053
Figure BDA0002441399420000053

公式(12)中,Ck为取值为k的扩张卷积核,K为扩张卷积的个数,

Figure BDA0002441399420000054
为扩张卷积操作,S为初始空间特征图,In formula (12), C k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,
Figure BDA0002441399420000054
is the dilated convolution operation, S is the initial spatial feature map,

通过ResNet50深度网络之后得到的初始空间特征图S的形状为60×60×2048,取值为4,k的取值范围是[1,2,3,4],扩张率rk的取值有四个,分别是rk={2,4,8,16},并且其扩张卷积核Ck的形状都为3×3×512,由此最后得到有四个不同尺度的特征图

Figure BDA0002441399420000055
再将它们依次串联起来,由如下公式(13)所示,The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r k has four values, namely r k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C k is 3×3×512, thus finally obtaining feature maps of four different scales
Figure BDA0002441399420000055
Then connect them in series, as shown in the following formula (13):

Sfinal=[S,T1,T2,…,TK] (13),S final =[S,T 1 ,T 2 ,…,T K ] (13),

公式(13)中,Sfinal为最后得到的多尺度的空间特征图,S为由ResNet50深度网络提取的初始空间特征图S,TK为的是经过扩张卷积之后得到的特征图,五个尺度的空间特征图Sfinal的形状为60×60×4096。In formula (13), S final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S final is 60×60×4096.

本发明的有益效果是:与现有技术相比,本发明的突出的实质性特点和显著进步如下:The beneficial effects of the present invention are as follows: Compared with the prior art, the outstanding substantive features and significant improvements of the present invention are as follows:

(1)本发明方法与CN106372636A相比,本发明采取的是基于深度学习的方法,先利用ResNet50和扩张卷积来提取多尺度的空间特征,再利用ConvLSTM来提取时间信息,最后再整合为时空信息。本发明具有的突出的实质性特点和显著进步是不需要去计算光流信息,而是用ConvLSTM来提取时间信息,显著目标的检测精度比计算光流的方法更好,并且速度更快。(1) Compared with CN106372636A, the method of the present invention adopts a method based on deep learning, first using ResNet50 and dilated convolution to extract multi-scale spatial features, then using ConvLSTM to extract temporal information, and finally integrating it into spatiotemporal information. The outstanding substantive characteristics and significant progress of the present invention are that it does not need to calculate optical flow information, but uses ConvLSTM to extract temporal information. The detection accuracy of salient targets is better than the method of calculating optical flow, and the speed is faster.

(2)本发明方法与CN109784183A相比,本发明采用的是带有残差网络的连接方式,多个卷积层都有残差块的连接,本发明具有的突出的实质性特点和显著进步是能使训练网络收敛的更快,提取的特征更加精细,预测的准确率更高。(2) Compared with CN109784183A, the method of the present invention adopts a connection mode with a residual network, and multiple convolutional layers are connected with residual blocks. The outstanding substantial characteristics and significant progress of the present invention are that it can make the training network converge faster, the extracted features are more refined, and the prediction accuracy is higher.

(3)本发明方法与CN109118469A相比,本发明具有的突出的实质性特点和显著进步是无需进行繁琐的稀疏矩阵的提取,采用深度神经网络从视频帧中提取高级特征,对每一个像素点进行预测,检测结果更加准确,鲁棒性较好。(3) Compared with CN109118469A, the method of the present invention has outstanding substantive features and significant improvements in that it does not require cumbersome sparse matrix extraction, but uses a deep neural network to extract high-level features from video frames and predict each pixel point, so that the detection result is more accurate and has better robustness.

(4)本发明方法与CN105913456B相比,本发明具有的突出的实质性特点和显著进步是不需要进行计算量较大的线性迭代和k-means聚类,而直接采用端到端的神经网络方法,当训练完成之后能较快速地得到预测结果。(4) Compared with CN105913456B, the method of the present invention has the outstanding substantive characteristics and significant progress that it does not require linear iteration and k-means clustering with large computational complexity, but directly adopts an end-to-end neural network method, and can obtain prediction results more quickly after training is completed.

(5)本发明方法与CN109034001A相比,本发明采用的是基于深度网络的边缘检测分支去提取原图像中的显著性物体的边缘,并以此来指导下面的完整显著图的生成。本发明具有的突出的实质性特点和显著进步是得到的显著图中的显著目标更完整。(5) Compared with CN109034001A, the method of the present invention uses an edge detection branch based on a deep network to extract the edges of salient objects in the original image, and uses this to guide the generation of the following complete salient map. The outstanding substantive feature and significant progress of the present invention is that the salient objects in the obtained salient map are more complete.

(6)本发明方法与CN108241854A相比,虽然都是用的深度学习的方法,但是本发明采用扩张卷积提取了四种不同尺度的特征图,与之相比,本发明提取到的特征更加全面,因此本发明具有的突出的实质性特点和显著进步是得到的最终显著图中的显著目标的边缘更加平滑。(6) Compared with CN108241854A, although both methods use deep learning methods, the present invention uses dilated convolution to extract feature maps of four different scales. Compared with the above, the features extracted by the present invention are more comprehensive. Therefore, the outstanding substantial feature and significant improvement of the present invention is that the edges of the salient targets in the final salient map obtained are smoother.

(7)本发明方法与CN110598537A相比,本发明具有的突出的实质性特点和显著进步是利用ConvLSTM来模拟帧间的光流信息,提取到的光流信息比用传统方法计算出来的更加准确。(7) Compared with CN110598537A, the method of the present invention has an outstanding substantive feature and a significant improvement in that ConvLSTM is used to simulate the optical flow information between frames, and the extracted optical flow information is more accurate than that calculated by traditional methods.

(8)与Video Salient Object Detection via Fully Convolutional Networks相比,本发明具有的突出的实质性特点和显著进步是利用到了帧与帧之间的时间信息,得到的预测结果图更加准确。(8) Compared with Video Salient Object Detection via Fully Convolutional Networks, the present invention has the outstanding substantial feature and significant improvement of utilizing the time information between frames, and the obtained prediction result map is more accurate.

(9)本发明方法提出了一个基于深度网络的视频显著性检测方法模型。首先在视频显著性检测领域使用基于深度学习的显著性目标的边缘检测方法,此方法区别于传统的边缘检测算法,它能准确的检测出视频序列中每一帧中的显著性目标的轮廓,用来指导显著图的预测。(9) The method of the present invention proposes a video saliency detection method model based on a deep network. First, in the field of video saliency detection, a deep learning-based edge detection method of salient targets is used. This method is different from the traditional edge detection algorithm. It can accurately detect the contours of salient targets in each frame of the video sequence to guide the prediction of saliency maps.

(10)本发明利用深度显著性目标边缘检测分支生成显著性目标轮廓图与视频中每一帧的时空显著图进行融合,使它的轮廓更加平滑,能更准确的预测出视频序列中每一帧中的显著性目标。(10) The present invention utilizes the deep salient target edge detection branch to generate a salient target contour map and fuses it with the spatiotemporal salient map of each frame in the video, making its contour smoother and being able to more accurately predict the salient targets in each frame in the video sequence.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

下面结合附图和实施例对本发明进一步说明。The present invention is further described below in conjunction with the accompanying drawings and embodiments.

图1是本发明基于深度网络的视频显著性检测方法的流程示意框图。FIG1 is a schematic flow chart of a method for detecting video saliency based on a deep network according to the present invention.

图2是本发明实施例中的显著目标为一个猫和一个盒子的视频帧I的显著性预测结果图YfinalFIG. 2 is a saliency prediction result graph Y final of a video frame I in which salient objects are a cat and a box in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

图1所示实施例表明,本发明基于深度网络的视频显著性检测方法的流程如下:The embodiment shown in FIG1 shows that the process of the video saliency detection method based on deep network of the present invention is as follows:

输入视频帧I,进行预处理→提取视频帧I′的初始空间特征图S→获得五个尺度的空间特征图Sfinal→获得特征图F→获得粗略的时空显著图YST和显著性物体的边缘轮廓图Et→获得最终的显著性预测结果图Yfinal→计算对于输入视频帧I的损失→完成基于深度网络的视频显著性检测。Input video frame I, perform preprocessing → extract the initial spatial feature map S of video frame I′ → obtain the spatial feature map S final at five scales → obtain the feature map F → obtain a rough spatiotemporal saliency map Y ST and the edge contour map E t of the salient object → obtain the final saliency prediction result map Y final → calculate the loss for the input video frame I → complete the video saliency detection based on the deep network.

实施例1Example 1

本实施例中显著目标为一个猫和一个盒子,本实施例所述的基于深度网络的视频显著性检测方法,具体步骤如下:In this embodiment, the salient objects are a cat and a box. The video saliency detection method based on a deep network described in this embodiment has the following specific steps:

第一步,输入视频帧I,进行预处理:The first step is to input the video frame I and perform preprocessing:

输入显著目标为一个猫和一个盒子的视频帧I,将视频帧的尺寸都统一为宽高都是473×473像素,并且视频帧I中的每个像素值都减去其相对应的通道的均值,其中,每个视频帧I的R通道的均值是104.00698793,每个视频帧I中的G通道的均值是116.66876762,每个视频帧I中的B通道的均值是122.67891434,这样,输入到ResNet50深度网络之前的视频帧I的形状为473×473×3,将如此进行预处理之后的视频帧记为I′,如下公式(1)所示:Input a video frame I with a cat and a box as salient objects. The size of the video frames is unified to 473×473 pixels in width and height, and the mean of the corresponding channel is subtracted from each pixel value in the video frame I. The mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of the video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1):

I′=Resize(I-Mean(R,G,B)) (1),I′=Resize(I-Mean(R,G,B)) (1),

公式(1)中,Mean(R,G,B)为红,绿,蓝三个颜色通道的均值,Resize(·)为调整视频帧I′大小的函数;In formula (1), Mean(R,G,B) is the mean of the three color channels of red, green, and blue, and Resize(·) is the function for adjusting the size of the video frame I′;

第二步,提取视频帧I′的初始空间特征图S:The second step is to extract the initial spatial feature map S of the video frame I′:

将上述第一步预处理之后的视频帧I′送入到ResNet50深度网络去提取初始空间特征图S,如下公式(2)所示:The video frame I′ after the first step of preprocessing is sent to the ResNet50 deep network to extract the initial spatial feature map S, as shown in the following formula (2):

S=ResNet50(I′) (2),S=ResNet50(I′) (2),

公式(2)中,ResNet50(·)为ResNet50深度网络,In formula (2), ResNet50(·) is the ResNet50 deep network.

ResNet50深度网络包含卷积层,池化层,非线性激活函数Relu层和残差连接;The ResNet50 deep network contains convolutional layers, pooling layers, non-linear activation function Relu layers and residual connections;

第三步,获得五个尺度的空间特征图SfinalThe third step is to obtain the spatial feature map S final of five scales:

将第二步中提取到的视频帧I′的初始空间特征图S分别送入到ResNet50深度网络中扩张率为2、4、8、16的四个不同的扩张卷积中去,得到扩张率分别为2、4、8、16的四个尺度的结果Tk,再将该结果与ResNet50深度网络的输出结果初始空间特征图S串联起来最终获得五个尺度的空间特征图SfinalThe initial spatial feature map S of the video frame I′ extracted in the second step is sent to four different dilated convolutions with dilation rates of 2, 4, 8, and 16 in the ResNet50 deep network, and the results T k of four scales with dilation rates of 2, 4, 8, and 16 are obtained. Then, the result is connected in series with the output result initial spatial feature map S of the ResNet50 deep network to finally obtain the spatial feature map S final of five scales.

获得五个尺度的空间特征图Sfinal的具体操作如下:The specific operations to obtain the spatial feature map S final of five scales are as follows:

ResNet50深度网络中的扩张卷积核表示为

Figure BDA0002441399420000071
其中K为扩张卷积层的个数,c×c为宽度和高度的相乘,C为通道数,
Figure BDA0002441399420000072
为扩张卷积的参数其步长设置为1,基于这些参数得到四个输出特征图
Figure BDA0002441399420000073
其中,W和H分别为宽度和高度,如下公式(3)所示,The dilated convolution kernel in the ResNet50 deep network is represented as
Figure BDA0002441399420000071
Where K is the number of dilated convolutional layers, c×c is the multiplication of width and height, and C is the number of channels.
Figure BDA0002441399420000072
The parameters of the dilated convolution are set to 1, and four output feature maps are obtained based on these parameters.
Figure BDA0002441399420000073
Where W and H are the width and height respectively, as shown in the following formula (3):

Figure BDA0002441399420000074
Figure BDA0002441399420000074

公式(3)中,Ck为取值为k的扩张卷积核,K为扩张卷积的个数,

Figure BDA0002441399420000075
为扩张卷积操作,S为初始空间特征图,In formula (3), C k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,
Figure BDA0002441399420000075
is the dilated convolution operation, S is the initial spatial feature map,

通过ResNet50深度网络之后得到的初始空间特征图S的形状为60×60×2048,取值为4,k的取值范围是[1,2,3,4],扩张率rk的取值有四个,分别是rk={2,4,8,16},并且其扩张卷积核Ck的形状都为3×3×512,由此最后得到有四个不同尺度的特征图

Figure BDA0002441399420000076
再将它们依次串联起来,由如下公式(4)所示,The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r k has four values, namely r k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C k is 3×3×512, thus finally obtaining feature maps of four different scales
Figure BDA0002441399420000076
Then connect them in series, as shown in the following formula (4):

Sfinal=[S,T1,T2,…,TK] (4),S final =[S,T 1 ,T 2 ,…,T K ] (4),

公式(4)中,Sfinal为最后得到的多尺度的空间特征图,S为由ResNet50深度网络提取的初始空间特征图S,TK为的是经过扩张卷积之后得到的特征图,五个尺度的空间特征图Sfinal的形状为60×60×4096;In formula (4), S final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S final is 60×60×4096;

第四步,获得特征图F:The fourth step is to obtain the feature map F:

将上述第三步得到的五个尺度的空间特征图Sfinal通过一个卷积核为3×3×32的卷积操作获得形状为60×60×32的特征图F,如下公式(5)所示,The spatial feature maps S final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (5):

F=BN(Relu(Conv(Sfinal))) (5),F=BN(Relu(Conv(S final ))) (5),

公式(5)中,Conv(·)为卷积操作,Relu(·)为非线性激活函数,BN(·)为对其进行标准化操作;In formula (5), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation.

第五步,获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThe fifth step is to obtain a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object:

将上述第四步获得的特征图F同时分别输入到时空分支和边缘检测分支得到一个时空特征图FST和得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the spatiotemporal branch and the edge detection branch to obtain a spatiotemporal feature map F ST and an edge contour map E t of a salient object. The specific operation is as follows:

将上述第四步得到的特征图F输入到时空分支的ConvLSTM当中去,得到一个时空特征图FST,如下公式(6)所示,The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F ST , as shown in the following formula (6):

FST=ConvLSTM(F,Ht-1) (6),F ST =ConvLSTM(F,H t-1 ) (6),

公式(6)中,ConvLSTM(·)为ConvLSTM操作,Ht-1为前一时刻ConvLSTM单元的状态;In formula (6), ConvLSTM(·) is the ConvLSTM operation, H t-1 is the state of the ConvLSTM unit at the previous moment;

再将得到的时空特征图FST再送入到一层卷积核大小为1×1的卷积中得到一个粗略的时空显著图YST,公式如下:Then the obtained spatiotemporal feature map F ST is sent to a convolution layer with a convolution kernel size of 1×1 to obtain a rough spatiotemporal saliency map Y ST , the formula is as follows:

YST=Conv(FST) (7),Y ST =Conv(F ST ) (7),

公式(7)中,Conv(·)为卷积操作;In formula (7), Conv(·) is the convolution operation;

将上述第四步得到的特征图F输入到边缘检测分支中得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the edge detection branch to obtain the edge contour map E t of the salient object. The specific operation is as follows:

边缘检测分支中包含一个两层的ConvLSTM,它是一个强大的循环模型,不仅捕捉时序信息,而且根据时间信息来描绘出显著性物体的轮廓边缘,区分出图像中显著性物体与非显著性物体,更具体的说,通过ResNet50深度网络和扩张卷积,获得T帧的输入视频的静态为

Figure BDA0002441399420000081
其中Xt为第t帧的视频帧,给定Xt,Xt经过边缘检测分支后输出为边缘轮廓图Et∈[0,1]W×H,其中W和H分别为预测边缘图像的宽度和高度,是从边缘检测网络
Figure BDA0002441399420000082
中计算出来的,它将先前的视频帧考虑在内,具体如下公式(8)和公式(9)所示,The edge detection branch contains a two-layer ConvLSTM, which is a powerful recurrent model that not only captures temporal information, but also depicts the contour edges of salient objects based on temporal information, distinguishing salient objects from non-salient objects in the image. More specifically, through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as
Figure BDA0002441399420000081
Where Xt is the video frame of the tth frame. Given Xt , Xt is output as an edge contour map Et∈ [0,1] W×H after the edge detection branch, where W and H are the width and height of the predicted edge image, respectively, which is obtained from the edge detection network.
Figure BDA0002441399420000082
It is calculated in , which takes the previous video frame into account, as shown in the following formulas (8) and (9),

Ht=ConvLSTM(Xt,Ht-1) (8),H t =ConvLSTM(X t ,H t-1 ) (8),

Figure BDA0002441399420000083
Figure BDA0002441399420000083

公式(8)和公式(9)中,

Figure BDA0002441399420000084
为3D张量隐藏状态,M为通道数,Et′为未加权的边缘轮廓图,Ht为当前ConvLSTM单元的状态,Ht-1为上一时刻ConvLSTM单元的状态,X1为第一帧的视频帧,In formula (8) and formula (9),
Figure BDA0002441399420000084
is the 3D tensor hidden state, M is the number of channels, E t ′ is the unweighted edge contour map, H t is the state of the current ConvLSTM unit, H t-1 is the state of the ConvLSTM unit at the previous moment, X 1 is the video frame of the first frame,

在ConvLSTM当中嵌入ConvLSTM,获得边缘轮廓图Et的关键组成部分是边缘检测网络

Figure BDA0002441399420000085
如下公式(10)所示,The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E t is the edge detection network
Figure BDA0002441399420000085
As shown in the following formula (10),

Figure BDA0002441399420000086
Figure BDA0002441399420000086

然后用上述边缘检测网络

Figure BDA0002441399420000087
进行加权,得到显著性物体的边缘轮廓图Et,如下公式(11)所示,Then use the above edge detection network
Figure BDA0002441399420000087
After weighting, the edge contour map E t of the salient object is obtained, as shown in the following formula (11):

Figure BDA0002441399420000088
Figure BDA0002441399420000088

公式(11)中,

Figure BDA0002441399420000089
为一个1×1的卷积核,用来映射边缘检测网络
Figure BDA00024413994200000810
得到一个权重矩阵,sigmoid函数σ为把这个矩阵归一化到[0,1];In formula (11),
Figure BDA0002441399420000089
is a 1×1 convolution kernel used to map the edge detection network
Figure BDA00024413994200000810
Get a weight matrix, and the sigmoid function σ normalizes this matrix to [0,1];

由此完成获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThus, a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object are obtained;

第六步,获得最终的显著性预测结果图YfinalStep 6: Get the final significance prediction result graph Y final :

将上述第五步得到的粗略的时空显著图YST和显著性物体的边缘轮廓图Et进行融合,得到最终的显著性预测结果图Yfinal,如下公式(12)所示,The rough spatiotemporal saliency map Y ST obtained in the fifth step above is fused with the edge contour map E t of the salient object to obtain the final saliency prediction result map Y final , as shown in the following formula (12):

Figure BDA0002441399420000091
Figure BDA0002441399420000091

公式(12)中,‘ο’为矩阵相乘,σ为sigmoid函数,Resize(·)为调整视频帧大小的函数,In formula (12), ‘ο’ is matrix multiplication, σ is the sigmoid function, and Resize(·) is the function for adjusting the video frame size.

将得到的视频帧恢复到原输入视频帧的大小473×473;The obtained video frame is restored to the size of the original input video frame 473×473;

图2为本实施例的视频帧I的最终的显著性预测结果图Yfinal,其中有两个显著目标,猫和盒子。FIG. 2 is a final saliency prediction result graph Y final of the video frame I in this embodiment, in which there are two salient objects, a cat and a box.

第七步,计算对于输入视频帧I的损失:Step 7: Calculate the loss for the input video frame I:

经过上述第一步到第六步,计算出对于输入视频帧I的显著图,为了衡量上述第六步获得的最终的显著性预测结果图Yfinal与ground-truth之间的差异,训练时采用二值交叉熵损失函数

Figure BDA0002441399420000092
如下公式(13)所示,After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training
Figure BDA0002441399420000092
As shown in the following formula (13),

Figure BDA0002441399420000093
Figure BDA0002441399420000093

公式(13)中,G(i,j)∈[0,1]为像素点(i,j)的真实值,M(i,j)∈[0,1]为像素点(i,j)的预测值,取N=473,In formula (13), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473.

通过不断缩小

Figure BDA0002441399420000094
的大小进行网络的训练,采用随机梯度下降法优化二值交叉熵损失函数
Figure BDA0002441399420000095
By continuously shrinking
Figure BDA0002441399420000094
The size of the network is trained, and the stochastic gradient descent method is used to optimize the binary cross entropy loss function
Figure BDA0002441399420000095

至此完成基于深度网络的视频显著性检测。This completes the video saliency detection based on deep network.

上述实施例中,所述ResNet50深度网络、ConvLSTM、ground-truth、随机梯度下降法均是本技术领域所公知的。In the above embodiments, the ResNet50 deep network, ConvLSTM, ground-truth, and stochastic gradient descent method are all well known in the technical field.

Claims (2)

1.基于深度网络的视频显著性检测方法,其特征在于:是先用ResNet50深度网络来取空间特征,然后再提取时间和边缘信息来共同得到显著性预测结果图,完成基于深度网络的视频显著性检测,具体步骤如下:1. A video saliency detection method based on a deep network is characterized in that: the spatial features are first obtained by using a ResNet50 deep network, and then the time and edge information are extracted to jointly obtain a saliency prediction result map, thereby completing the video saliency detection based on a deep network. The specific steps are as follows: 第一步,输入视频帧I,进行预处理:The first step is to input the video frame I and perform preprocessing: 输入视频帧I,将视频帧的尺寸都统一为宽高都是473×473像素,并且视频帧I中的每个像素值都减去其相对应的通道的均值,其中,每个视频帧I的R通道的均值是104.00698793,每个视频帧I中的G通道的均值是116.66876762,每个视频帧I中的B通道的均值是122.67891434,这样,输入到ResNet50深度网络之前的视频帧I的形状为473×473×3,将如此进行预处理之后的视频帧记为I′,如下公式(1)所示:Input video frame I, unify the size of the video frame to 473×473 pixels in width and height, and subtract the mean of the corresponding channel from each pixel value in video frame I, where the mean of the R channel of each video frame I is 104.00698793, the mean of the G channel of each video frame I is 116.66876762, and the mean of the B channel of each video frame I is 122.67891434. In this way, the shape of video frame I before inputting into the ResNet50 deep network is 473×473×3. The video frame after such preprocessing is recorded as I′, as shown in the following formula (1): I′=Resize(I-Mean(R,G,B)) (1),I′=Resize(I-Mean(R,G,B)) (1), 公式(1)中,Mean(R,G,B)为红,绿,蓝三个颜色通道的均值,Resize(·)为调整视频帧I′大小的函数;In formula (1), Mean(R,G,B) is the mean of the three color channels of red, green, and blue, and Resize(·) is the function for adjusting the size of the video frame I′; 第二步,提取视频帧I′的初始空间特征图S:The second step is to extract the initial spatial feature map S of the video frame I′: 将上述第一步预处理之后的视频帧I′送入到ResNet50深度网络去提取初始空间特征图S,如下公式(2)所示:The video frame I′ after the first step of preprocessing is sent to the ResNet50 deep network to extract the initial spatial feature map S, as shown in the following formula (2): S=ResNet50(I′) (2),S=ResNet50(I′) (2), 公式(2)中,ResNet50(·)为ResNet50深度网络,In formula (2), ResNet50(·) is the ResNet50 deep network. ResNet50深度网络包含卷积层,池化层,非线性激活函数Relu层和残差连接;The ResNet50 deep network contains convolutional layers, pooling layers, non-linear activation function Relu layers and residual connections; 第三步,获得五个尺度的空间特征图SfinalThe third step is to obtain the spatial feature map S final of five scales: 将第二步中提取到的视频帧I′的初始空间特征图S分别送入到ResNet50深度网络中扩张率为2、4、8、16的四个不同的扩张卷积中去,得到扩张率分别为2、4、8、16的四个尺度的结果Tk,再将该结果与ResNet50深度网络的输出结果初始空间特征图S串联起来最终获得五个尺度的空间特征图SfinalThe initial spatial feature map S of the video frame I′ extracted in the second step is sent to four different dilated convolutions with dilation rates of 2, 4, 8, and 16 in the ResNet50 deep network, and the results T k of four scales with dilation rates of 2, 4, 8, and 16 are obtained. Then, the result is connected in series with the output result initial spatial feature map S of the ResNet50 deep network to finally obtain the spatial feature map S final of five scales. 第四步,获得特征图F:The fourth step is to obtain the feature map F: 将上述第三步得到的五个尺度的空间特征图Sfinal通过一个卷积核为3×3×32的卷积操作获得形状为60×60×32的特征图F,如下公式(3)所示,The spatial feature maps S final of the five scales obtained in the third step above are subjected to a convolution operation with a convolution kernel of 3×3×32 to obtain a feature map F with a shape of 60×60×32, as shown in the following formula (3): F=BN(Relu(Conv(Sfinal))) (3),F=BN(Relu(Conv(S final ))) (3), 公式(3)中,Conv(·)为卷积操作,Relu(·)为非线性激活函数,BN(·)为对其进行标准化操作;In formula (3), Conv(·) is the convolution operation, Relu(·) is the nonlinear activation function, and BN(·) is the normalization operation. 第五步,获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThe fifth step is to obtain a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object: 将上述第四步获得的特征图F同时分别输入到时空分支和边缘检测分支得到一个时空特征图FST和得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the spatiotemporal branch and the edge detection branch to obtain a spatiotemporal feature map F ST and an edge contour map E t of a salient object. The specific operation is as follows: 将上述第四步得到的特征图F输入到时空分支的ConvLSTM当中去,得到一个时空特征图FST,如下公式(4)所示,The feature map F obtained in the fourth step above is input into the ConvLSTM of the spatiotemporal branch to obtain a spatiotemporal feature map F ST , as shown in the following formula (4): FST=ConvLSTM(F,Ht-1) (4),F ST =ConvLSTM(F,H t-1 ) (4), 公式(4)中,ConvLSTM(·)为ConvLSTM操作,Ht-1为前一时刻ConvLSTM单元的状态;In formula (4), ConvLSTM(·) is the ConvLSTM operation, H t-1 is the state of the ConvLSTM unit at the previous moment; 再将得到的时空特征图FST再送入到一层卷积核大小为1×1的卷积中得到一个粗略的时空显著图YST,公式如下:Then the obtained spatiotemporal feature map F ST is sent to a convolution layer with a convolution kernel size of 1×1 to obtain a rough spatiotemporal saliency map Y ST , the formula is as follows: YST=Conv(FST) (5),Y ST =Conv(F ST ) (5), 公式(5)中,Conv(·)为卷积操作;In formula (5), Conv(·) is the convolution operation; 将上述第四步得到的特征图F输入到边缘检测分支中得到显著性物体的边缘轮廓图Et,具体操作如下,The feature map F obtained in the fourth step is input into the edge detection branch to obtain the edge contour map E t of the salient object. The specific operation is as follows: 通过ResNet50深度网络和扩张卷积,获得T帧的输入视频的静态为
Figure FDA0002441399410000021
其中Xt为第t帧的视频帧,给定Xt,Xt经过边缘检测分支后输出为边缘轮廓图Et∈[0,1]W×H,其中W和H分别为预测边缘图像的宽度和高度,是从边缘检测网络
Figure FDA0002441399410000022
中计算出来的,它将先前的视频帧考虑在内,具体如下公式(6)和公式(7)所示,
Through the ResNet50 deep network and dilated convolution, the static image of the input video of frame T is obtained as follows:
Figure FDA0002441399410000021
Where Xt is the video frame of the tth frame. Given Xt , Xt is output as an edge contour map Et∈ [0,1] W×H after the edge detection branch, where W and H are the width and height of the predicted edge image, respectively, which is obtained from the edge detection network.
Figure FDA0002441399410000022
It is calculated in , which takes the previous video frame into account, as shown in the following formulas (6) and (7),
Ht=ConvLSTM(Xt,Ht-1) (6),H t =ConvLSTM(X t ,H t-1 ) (6),
Figure FDA0002441399410000023
Figure FDA0002441399410000023
公式(6)和公式(7)中,
Figure FDA0002441399410000024
为3D张量隐藏状态,M为通道数,Et′为未加权的边缘轮廓图,Ht为当前ConvLSTM单元的状态,Ht-1为上一时刻ConvLSTM单元的状态,X1为第一帧的视频帧,
In formula (6) and formula (7),
Figure FDA0002441399410000024
is the 3D tensor hidden state, M is the number of channels, E t ′ is the unweighted edge contour map, H t is the state of the current ConvLSTM unit, H t-1 is the state of the ConvLSTM unit at the previous moment, X 1 is the video frame of the first frame,
在ConvLSTM当中嵌入ConvLSTM,获得边缘轮廓图Et的关键组成部分是边缘检测网络
Figure FDA0002441399410000025
如下公式(8)所示,
The key component of embedding ConvLSTM in ConvLSTM to obtain the edge contour map E t is the edge detection network
Figure FDA0002441399410000025
As shown in the following formula (8),
Figure FDA0002441399410000026
Figure FDA0002441399410000026
然后用上述边缘检测网络
Figure FDA0002441399410000027
进行加权,得到显著性物体的边缘轮廓图Et,如下公式(9)所示,
Then use the above edge detection network
Figure FDA0002441399410000027
After weighting, the edge contour map E t of the salient object is obtained, as shown in the following formula (9):
Figure FDA0002441399410000028
Figure FDA0002441399410000028
公式(9)中,
Figure FDA0002441399410000029
为一个1×1的卷积核,用来映射边缘检测网络
Figure FDA00024413994100000210
得到一个权重矩阵,sigmoid函数σ为把这个矩阵归一化到[0,1];
In formula (9),
Figure FDA0002441399410000029
is a 1×1 convolution kernel used to map the edge detection network
Figure FDA00024413994100000210
Get a weight matrix, and the sigmoid function σ normalizes this matrix to [0,1];
由此完成获得粗略的时空显著图YST和显著性物体的边缘轮廓图EtThus, a rough spatiotemporal saliency map Y ST and an edge contour map E t of a salient object are obtained; 第六步,获得最终的显著性预测结果图YfinalStep 6: Get the final significance prediction result graph Y final : 将上述第五步得到的粗略的时空显著图YST和显著性物体的边缘轮廓图Et进行融合,得到最终的显著性预测结果图Yfinal,如下公式(10)所示,The rough spatiotemporal saliency map Y ST obtained in the fifth step above is fused with the edge contour map E t of the salient object to obtain the final saliency prediction result map Y final , as shown in the following formula (10):
Figure FDA00024413994100000211
Figure FDA00024413994100000211
公式(10)中,
Figure FDA00024413994100000212
为矩阵相乘,σ为sigmoid函数,Resize(·)为调整视频帧大小的函数,
In formula (10),
Figure FDA00024413994100000212
is matrix multiplication, σ is the sigmoid function, Resize(·) is the function for adjusting the video frame size,
将得到的视频帧恢复到原输入视频帧的大小473×473;The obtained video frame is restored to the size of the original input video frame 473×473; 第七步,计算对于输入视频帧I的损失:Step 7: Calculate the loss for the input video frame I: 经过上述第一步到第六步,计算出对于输入视频帧I的显著图,为了衡量上述第六步获得的最终的显著性预测结果图Yfinal与ground-truth之间的差异,训练时采用二值交叉熵损失函数
Figure FDA0002441399410000031
如下公式(11)所示,
After the first to sixth steps above, the saliency map for the input video frame I is calculated. In order to measure the difference between the final saliency prediction result map Y final obtained in the sixth step above and the ground-truth, the binary cross entropy loss function is used during training
Figure FDA0002441399410000031
As shown in the following formula (11),
Figure FDA0002441399410000032
Figure FDA0002441399410000032
公式(11)中,G(i,j)∈[0,1]为像素点(i,j)的真实值,M(i,j)∈[0,1]为像素点(i,j)的预测值,取N=473,In formula (11), G(i,j)∈[0,1] is the true value of pixel (i,j), M(i,j)∈[0,1] is the predicted value of pixel (i,j), and N=473. 通过不断缩小
Figure FDA0002441399410000033
的大小进行网络的训练,采用随机梯度下降法优化二值交叉熵损失函数
Figure FDA0002441399410000034
By continuously shrinking
Figure FDA0002441399410000033
The size of the network is trained, and the stochastic gradient descent method is used to optimize the binary cross entropy loss function
Figure FDA0002441399410000034
至此完成基于深度网络的视频显著性检测。This completes the video saliency detection based on deep network.
2.根据权利要求1所述基于深度网络的视频显著性检测方法,其特征在于:所述获得五个尺度的空间特征图Sfinal的具体操作如下:2. According to the deep network-based video saliency detection method of claim 1, it is characterized in that: the specific operation of obtaining the spatial feature map S final of five scales is as follows: ResNet50深度网络中的扩张卷积核表示为
Figure FDA0002441399410000035
其中K为扩张卷积层的个数,c×c为宽度和高度的相乘,C为通道数,
Figure FDA0002441399410000036
为扩张卷积的参数其步长设置为1,基于这些参数得到四个输出特征图
Figure FDA0002441399410000037
其中,W和H分别为宽度和高度,如下公式(12)所示,
The dilated convolution kernel in the ResNet50 deep network is represented as
Figure FDA0002441399410000035
Where K is the number of dilated convolutional layers, c×c is the multiplication of width and height, and C is the number of channels.
Figure FDA0002441399410000036
The parameters of the dilated convolution are set to 1, and four output feature maps are obtained based on these parameters.
Figure FDA0002441399410000037
Where W and H are the width and height respectively, as shown in the following formula (12):
Figure FDA00024413994100000310
Figure FDA00024413994100000310
公式(12)中,Ck为取值为k的扩张卷积核,K为扩张卷积的个数,
Figure FDA0002441399410000039
为扩张卷积操作,S为初始空间特征图,
In formula (12), C k is the dilated convolution kernel with a value of k, K is the number of dilated convolutions,
Figure FDA0002441399410000039
is the dilated convolution operation, S is the initial spatial feature map,
通过ResNet50深度网络之后得到的初始空间特征图S的形状为60×60×2048,取值为4,k的取值范围是[1,2,3,4],扩张率rk的取值有四个,分别是rk={2,4,8,16},并且其扩张卷积核Ck的形状都为3×3×512,由此最后得到有四个不同尺度的特征图
Figure FDA0002441399410000038
再将它们依次串联起来,由如下公式(13)所示,
The shape of the initial spatial feature map S obtained after the ResNet50 deep network is 60×60×2048, the value is 4, the value range of k is [1, 2, 3, 4], the expansion rate r k has four values, namely r k = {2, 4, 8, 16}, and the shape of its expansion convolution kernel C k is 3×3×512, thus finally obtaining feature maps of four different scales
Figure FDA0002441399410000038
Then connect them in series, as shown in the following formula (13):
Sfinal=[S,T1,T2,…,TK] (13),S final =[S,T 1 ,T 2 ,…,T K ] (13), 公式(13)中,Sfinal为最后得到的多尺度的空间特征图,S为由ResNet50深度网络提取的初始空间特征图S,TK为的是经过扩张卷积之后得到的特征图,五个尺度的空间特征图Sfinal的形状为60×60×4096。In formula (13), S final is the final multi-scale spatial feature map, S is the initial spatial feature map S extracted by the ResNet50 deep network, T K is the feature map obtained after the dilated convolution, and the shape of the five-scale spatial feature map S final is 60×60×4096.
CN202010266351.2A 2020-04-07 2020-04-07 Video significance detection method based on deep network Expired - Fee Related CN111461043B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010266351.2A CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010266351.2A CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Publications (2)

Publication Number Publication Date
CN111461043A CN111461043A (en) 2020-07-28
CN111461043B true CN111461043B (en) 2023-04-18

Family

ID=71685906

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010266351.2A Expired - Fee Related CN111461043B (en) 2020-04-07 2020-04-07 Video significance detection method based on deep network

Country Status (1)

Country Link
CN (1) CN111461043B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931732B (en) * 2020-09-24 2022-07-15 苏州科达科技股份有限公司 Method, system, device and storage medium for detecting salient object of compressed video
CN113570509B (en) * 2020-11-13 2025-02-21 华南理工大学 Data processing method and computer device
CN112861733B (en) * 2021-02-08 2022-09-02 电子科技大学 Night traffic video significance detection method based on space-time double coding
CN112950477B (en) * 2021-03-15 2023-08-22 河南大学 A High Resolution Salient Object Detection Method Based on Dual Path Processing
CN114119978B (en) * 2021-12-03 2024-08-09 安徽理工大学 Saliency target detection algorithm for integrated multisource feature network
CN114511454B (en) * 2021-12-24 2024-10-11 广州市广播电视台 Video quality assessment method with enhanced edges
CN117152670A (en) * 2023-10-31 2023-12-01 江西拓世智能科技股份有限公司 Behavior recognition method and system based on artificial intelligence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
CN109448015A (en) * 2018-10-30 2019-03-08 河北工业大学 Image based on notable figure fusion cooperates with dividing method
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106157307B (en) * 2016-06-27 2018-09-11 浙江工商大学 A kind of monocular image depth estimation method based on multiple dimensioned CNN and continuous CRF

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108256562A (en) * 2018-01-09 2018-07-06 深圳大学 Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
CN109448015A (en) * 2018-10-30 2019-03-08 河北工业大学 Image based on notable figure fusion cooperates with dividing method
CN110929736A (en) * 2019-11-12 2020-03-27 浙江科技学院 Multi-feature cascade RGB-D significance target detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Guo, YC,et al.Video Object Extraction Based on Spatiotemporal Consistency Saliency Detection.IEEE Access.2018,第6卷35171-35181. *
师硕.图像局部不变特征及应用研究.中国博士学位论文全文数据库 信息科技辑.2015,第2015年卷(第2015年期),I138-45. *

Also Published As

Publication number Publication date
CN111461043A (en) 2020-07-28

Similar Documents

Publication Publication Date Title
CN111461043B (en) Video significance detection method based on deep network
Liu et al. Salient object detection for RGB-D image by single stream recurrent convolution neural network
Sengupta et al. Sfsnet: Learning shape, reflectance and illuminance of facesin the wild'
Kim et al. Deep monocular depth estimation via integration of global and local predictions
CN109886066B (en) Rapid target detection method based on multi-scale and multi-layer feature fusion
US10803546B2 (en) Systems and methods for unsupervised learning of geometry from images using depth-normal consistency
CN108133188B (en) Behavior identification method based on motion history image and convolutional neural network
CN108520501B (en) A video rain and snow removal method based on multi-scale convolutional sparse coding
Deng et al. A voxel graph cnn for object classification with event cameras
CN108256562A (en) Well-marked target detection method and system based on Weakly supervised space-time cascade neural network
CN110276264B (en) Crowd density estimation method based on foreground segmentation graph
Meunier et al. Em-driven unsupervised learning for efficient motion segmentation
CN112597941A (en) Face recognition method and device and electronic equipment
CN111612807A (en) A Small Object Image Segmentation Method Based on Scale and Edge Information
CN113592018A (en) Infrared light and visible light image fusion method based on residual dense network and gradient loss
CN101477633B (en) Method for automatically estimating visual significance of image and video
CN108564012B (en) Pedestrian analysis method based on human body feature distribution
Xu et al. Video salient object detection via robust seeds extraction and multi-graphs manifold propagation
CN110287826A (en) A Video Object Detection Method Based on Attention Mechanism
CN110533048A (en) The realization method and system of combination semantic hierarchies link model based on panoramic field scene perception
CN108985298B (en) Human body clothing segmentation method based on semantic consistency
CN104680546A (en) Image salient object detection method
Yang et al. Shape tracking with occlusions via coarse-to-fine region-based sobolev descent
CN113033432A (en) Remote sensing image residential area extraction method based on progressive supervision
CN113033656B (en) Interactive hole detection data expansion method based on generation countermeasure network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230418