CN113869154B

CN113869154B - Video actor segmentation method according to language description

Info

Publication number: CN113869154B
Application number: CN202111081527.8A
Authority: CN
Inventors: 李国荣; 陈伟东; 张新峰; 黄庆明
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-09-02
Anticipated expiration: 2041-09-15
Also published as: CN113869154A

Abstract

本发明公开了一种根据语言描述的视频动作者分割方法，所述方法利用级联跨模态注意力模块，利用剪辑级视觉特征粗略地关注语言查询的信息词，再利用帧级视觉特征来微调单词的注意力，微调目标帧中语言的权重，可以在丰富的视频信息中辨别分割出正例，并通过对比学习挖掘难负例，可以学习从视频中识别目标演员，还可以在不同的视频中区分它，显着提高帧内匹配和分割的准确率。The invention discloses a video actor segmentation method according to language description. The method utilizes cascaded cross-modal attention modules, uses clip-level visual features to roughly focus on the information words of language queries, and then uses frame-level visual features to Fine-tune the attention of words, fine-tune the weight of the language in the target frame, can identify and segment positive examples in the rich video information, and mine difficult negative examples through comparative learning, can learn to identify the target actor from the video, and can also be used in different Distinguish it in the video, significantly improving the accuracy of intra-frame matching and segmentation.

Description

A Video Actor Segmentation Method Based on Language Description

技术领域technical field

本发明涉及视频识别技术领域，具体涉及一种根据语言描述的视频动作者分割方法。The invention relates to the technical field of video recognition, in particular to a method for segmenting video actors according to language description.

背景技术Background technique

近年来，视频理解任务受到了广泛关注，尤其是涉及自然语言处理的问题。在这个领域，已经取得了在语言选择性时间动作定位、视频字幕生成和根据句子描述对视频中动作者和动作分割任务的巨大成就。在现实场景中，一个视频包含多个在做动作的动作者是很常见的。因此，通过语言查询在空间和时间上选择性地细粒度地定位一个特定的动作者及其动作成为计算机更好地理解视频的一项重要任务。Video understanding tasks have received a lot of attention in recent years, especially those involving natural language processing. In this field, great achievements have been made in the tasks of language-selective temporal action localization, video caption generation, and segmentation of actors and actions in videos based on sentence descriptions. In real-world scenarios, it is common for a video to contain multiple actors in action. Therefore, selectively fine-grained localization of a specific actor and its actions in space and time through linguistic queries becomes an important task for computers to better understand videos.

在相关任务中广泛使用的框架，例如视频/图像对象接地，是通过一些检测方法生成视频/图像中的区域提议，然后将文本特征与提议的视觉特征进行匹配以选择最佳的对象提议作为匹配的对象。为了提高匹配两个异构特征的性能，以前的工作首先利用双向LSTM和自注意力机制来生成语言特征，然后使用加权文本特征来处理视觉特征，最后进行文本-视觉特征匹配。但这种自注意力机制学到的语言注意力实际上是训练数据的平均解，而不是专注于某个视频的个性解。这样，在推理过程中，无论输入视频是什么，都确定了关注的语言特征，因为视频是包含丰富内容的高级语义空间，因此很难掌握视频的最具判别力的特征。因而，视频决定了语言查询中的关键，捕获信息性单词和学习视觉感知的判别语言表示对于语言引导的视频动作者-动作分割任务至关重要。A widely used framework in related tasks, such as video/image object grounding, is to generate region proposals in videos/images by some detection methods, and then match text features with proposed visual features to select the best object proposal as a match Object. To improve the performance of matching two heterogeneous features, previous work first utilized bidirectional LSTM and self-attention mechanism to generate linguistic features, then used weighted text features to process visual features, and finally performed text-visual feature matching. But the linguistic attention learned by this self-attention mechanism is actually an average solution of the training data, rather than focusing on the individual solutions of a certain video. In this way, during inference, regardless of the input video, the linguistic features of interest are identified, and since video is a high-level semantic space containing rich content, it is difficult to grasp the most discriminative features of video. Video thus determines the key in language queries, capturing informative words and learning discriminative language representations for visual perception are crucial for language-guided video actor-action segmentation tasks.

如何设计视觉感知的语言编码器和生成判别性语言，从而对视频中的动作者及其动作进行分割，还需进一步优化分割方法，提高帧内匹配和分割的准确率。How to design a visual perception language encoder and generate discriminative language, so as to segment the actors and their actions in the video, it is necessary to further optimize the segmentation method and improve the accuracy of intra-frame matching and segmentation.

发明内容SUMMARY OF THE INVENTION

为了克服上述问题，本发明提供一种根据语言描述对视频动作者及其动作进行分割的方法，其中，利用级联跨模态注意力机制的协作优化网络，显著提高了匹配和分割的准确性。利用两个视角的视觉特征从粗到细地关注语言，生成具有辨别力的视觉感知的语言特征，另外，配备对比学习的方法，设计难负例挖掘策略，有利于网络从负例中识别正例，并进一步提高性能，从而完成本发明。In order to overcome the above problems, the present invention provides a method for segmenting video actors and their actions based on language descriptions, wherein the cooperative optimization network using cascaded cross-modal attention mechanisms significantly improves the accuracy of matching and segmentation . Using visual features from two perspectives to focus on language from coarse to fine, to generate language features with discriminative visual perception. In addition, it is equipped with a comparative learning method to design a mining strategy for difficult negative examples, which is beneficial for the network to identify positive examples from negative examples. example, and further improve the performance, thereby completing the present invention.

本发明第一方面的目的在于提供一种根据语言描述的视频动作者分割方法，所述方法利用级联跨模态注意力模块进行，生成具有辨别力的句子查询特征，提高匹配和分割的准确性。The purpose of the first aspect of the present invention is to provide a video actor segmentation method according to language description, the method utilizes cascaded cross-modal attention modules to generate discriminative sentence query features, and improves the accuracy of matching and segmentation. sex.

所述级联跨模态注意力模块包括剪辑级特征注意力单元和帧级特征注意力单元。The cascaded cross-modal attention module includes a clip-level feature attention unit and a frame-level feature attention unit.

所述剪辑级特征注意力单元采用句子嵌入s和目标帧j的剪辑级特征v_c作为输入。The clip-level feature attention unit takes as input the sentence embedding s and the clip-level feature _vc of the target frame j.

所述剪辑级特征注意力单元利用剪辑级特征v_c对语言特征进行粗略加权，分别得到：The clip-level feature attention unit uses the clip-level feature _vc to roughly weight the language features, and obtain:

F₁＝Att₁·ψ(v_c)+φ(s)F ₁ =Att ₁ ·ψ( _vc )+φ(s)

其中，T为矩阵转置；σ_softmax是softmax激活函数；Att₁为剪辑特征v_c和句子嵌入s的注意力图；F₁为粗加权的句子特征；剪辑特征v_c经过卷积层

和ψ(·)处理后得到

和ψ(v_c)；将组合单词嵌入e_t形成句子嵌入s，接着将其放入一个卷积层φ(·)中以生成句子特征φ(s)。Among them, T is the matrix transpose; σ _softmax is the softmax activation function; Att ₁ is the attention map of the clip feature _vc and sentence embedding s; F ₁ is the _coarsely weighted sentence feature;

and ψ( ) after processing to get

and ψ(v _c ); combine word embeddings e _t to form sentence embeddings s, which are then put into a convolutional layer φ( ) to generate sentence features φ(s).

对于视频V，剪辑级特征v_c由下式进行编码：For video V, the clip-level features _vc are encoded by:

其中，

表示L₂范数，θ_avg为均值池化操作，I3D(·)为双流I3D编码器，优选地，使用I3D网络的Mixed-4f层的输出作为I3D编码器。in,

denotes the L2 norm, _θavg is the mean pooling operation, _I3D ( ) is the dual-stream I3D encoder, preferably, the output of the Mixed-4f layer of the I3D network is used as the I3D encoder.

所述帧级特征注意力单元处理所述粗略加权的句子特征F₁和帧级特征v_f，得到微调的句子特征F₂：The frame-level feature attention unit processes the roughly weighted sentence features F ₁ and frame-level features v _f to obtain fine-tuned sentence features F ₂ :

F₂＝Att₂·ψ′(v_f)+F₁ F ₂ =Att ₂ ·ψ′(v _f )+F ₁

其中，Att₂为帧级特征v_f和粗加权的句子特征F₁的注意力图；F₂表示微调的句子特征，其每一列表示一个单词的向量；

为v_f经过一个线性层

得到的特征；ψ′(v_f)为v_f经过一个线性层ψ′(·)得到的特征。Among them, Att ₂ is the attention map of the frame-level feature v _f and the coarsely weighted sentence feature F ₁ ; F ₂ represents the fine-tuned sentence feature, and each column of which represents a word vector;

go through a linear layer for v _f

The obtained feature; ψ′(v _f ) is the feature obtained by v _f through a linear layer ψ′(·).

所述帧级特征v_f利用ResNet-101网络进行提取，优选地，提取帧级特征前，在COCO数据集上进行预训练，并在A2D数据集训练分割上进行微调；The frame-level features v _f are extracted by using the ResNet-101 network. Preferably, before extracting the frame-level features, pre-training is performed on the COCO dataset, and fine-tuning is performed on the A2D dataset training segmentation;

所述帧级特征为从帧j到帧i的扭曲特征和原始特征v_i的线性加权组合，具体如下式所示：The frame-level feature is a linear weighted combination of the warped feature from frame j to frame i and the original feature v _i , as shown in the following formula:

其中，v_i为对目标帧i的ResNet-101编码特征，β是权重系数，i为目标帧，j为参考帧，2K为参考帧的帧数(即在目标帧前向取K帧，后向取K帧，对目标帧特征进行补偿)，v_j→i为从参考帧j到目标帧i的扭曲特征；Among them, vi is the ResNet-101 coding feature for the target frame _i , β is the weight coefficient, i is the target frame, j is the reference frame, 2K is the number of frames of the reference frame (that is, take K frames in the front of the target frame, and then Take K frames to compensate the target frame features), v _j→i is the distortion feature from the reference frame j to the target frame i;

所述v_j→i为：The _vj→i is:

其中，v_j是参考帧j的ResNet-101编码特征；OF_j→i为参考帧j和目标帧i之间的光流；

为双线性扭曲方程。Among them, v _j is the ResNet-101 encoding feature of reference frame j; OF _j→i is the optical flow between reference frame j and target frame i;

is the bilinear distortion equation.

加权的单词特征h_t经过一个全连接层，得到句子查询特征q：The weighted word feature h _t goes through a fully connected layer to obtain the sentence query feature q:

m_t＝FC(h_t)m _t =FC(h _t )

α_t＝σ_softmax(m_t)α _t =σ _softmax (m _t )

其中，h_t为F₂特征的第t列，表示第t个单词的向量；FC(h_t)为一个全连接层；m_t为h_t经过一个全连接层的中间向量；α_t为第t个单词的加权系数。Among them, h _t is the t-th column of the F ₂ feature, representing the vector of the t-th word; FC(h _t ) is a fully connected layer; m _t is the intermediate vector of h _t through a fully-connected layer; α _t is the th Weighting coefficients for t words.

本发明第二方面还提供了一种计算机可读存储介质，存储有根据语言描述的视频动作者分割的训练程序，所述程序被处理器执行时，使得处理器执行所述根据语言描述的视频动作者分割方法的步骤。A second aspect of the present invention also provides a computer-readable storage medium, which stores a training program for segmentation of video actors according to language description, and when the program is executed by a processor, causes the processor to execute the video according to language description. The steps of the actor segmentation method.

本发明中所述的根据语言描述的视频动作者分割方法可借助软件加必需的通用硬件平台的方式来实现，所述软件存储在计算机可读存储介质(包括ROM/RAM、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机、计算机、服务器、网络设备等)执行本发明所述方法。The language-described video actor segmentation method described in the present invention can be implemented by means of software plus a necessary general hardware platform, and the software is stored in a computer-readable storage medium (including ROM/RAM, magnetic disk, optical disk) , including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, a network device, etc.) execute the method of the present invention.

本发明第三方面还提供了一种计算机设备，包括存储器和处理器，所述存储器存储有根据语言描述的视频动作者分割的训练程序，所述程序被处理器执行时，使得处理器执行所述根据语言描述的视频动作者分割方法的步骤。A third aspect of the present invention also provides a computer device, comprising a memory and a processor, wherein the memory stores a training program for segmentation of video actors according to language description, and when the program is executed by the processor, the processor executes the Describe the steps of a video actor segmentation method according to language description.

本发明所具有的有益效果包括：The beneficial effects of the present invention include:

(1)本发明中提供的方法，利用剪辑级视觉特征粗略地关注语言查询的信息词，再利用帧级视觉特征来微调单词的注意力，微调目标帧中语言的权重，显着提高帧内匹配和分割的准确率。(1) The method provided in the present invention uses clip-level visual features to roughly pay attention to the information words of language queries, and then uses frame-level visual features to fine-tune the attention of words, fine-tune the weight of the language in the target frame, and significantly improve the intra-frame Matching and segmentation accuracy.

(2)本发明中提供的方法可以在丰富的视频信息中辨别分割出正例，并通过对比学习挖掘难负例，可以学习从视频中识别目标演员，还可以在不同的视频中区分它。(2) The method provided in the present invention can identify and segment positive examples in rich video information, and mine difficult negative examples through comparative learning, can learn to identify the target actor from the video, and can also distinguish it in different videos.

(3)利用本发明中的视觉特征编码器，提取剪辑级特征、帧级特征和正例、负例的视觉特征，得到句子查询特征q，通过对比学习区分正负例、正例和难负例，加强了级联跨模态注意力模块的判别能力。(3) Using the visual feature encoder in the present invention to extract clip-level features, frame-level features, and visual features of positive and negative examples to obtain sentence query features q, and distinguish positive and negative examples, positive examples, and difficult negative examples through comparative learning , which strengthens the discriminative ability of cascaded cross-modal attention modules.

附图说明Description of drawings

图1示出根据本发明一种根据语言描述的视频动作者分割方法的网络结构示意图；1 shows a schematic diagram of the network structure of a video actor segmentation method described by language according to the present invention;

图2示出本发明实施例1中一种根据语言描述的视频动作者分割方法在A2D数据集上应用的效果图。FIG. 2 shows an effect diagram of a method for segmentation of video actors according to language description applied to an A2D data set in Embodiment 1 of the present invention.

具体实施方式Detailed ways

下面通过附图和实施方式对本发明进一步详细说明。通过这些说明，本发明的特点和优点将变得更为清楚明确。其中，尽管在附图中示出了实施方式的各种方面，但是除非特别指出，不必按比例绘制附图。The present invention will be further described in detail below through the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from these descriptions. Therein, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

本发明提供了一种根据语言描述的视频动作者分割方法，所述方法利用级联跨模态注意力模块进行，生成具有辨别力的句子查询特征，提高匹配和分割的准确性。The present invention provides a video actor segmentation method according to language description, which utilizes cascaded cross-modal attention modules to generate discriminative sentence query features and improves the accuracy of matching and segmentation.

本发明中，由于视频剪辑特征是包含所有信息的全局特征，因此，首先利用剪辑级视觉特征注意力单元粗略地关注语言查询的信息词，如反映运动和时间变化的词。然后，利用帧级特征注意力单元来微调单词的注意力，微调特定帧中句子的权重，例如，反应外观的属性，显著提高帧内匹配和分割的准确率。所述方法中，利用剪辑级特征注意力单元和帧级特征注意力单元学习基于给定视频的判别语言表示。In the present invention, since the video clip feature is a global feature containing all information, first, the clip-level visual feature attention unit is used to roughly focus on the informative words of the language query, such as words reflecting motion and temporal changes. Then, frame-level feature attention units are used to fine-tune word attention, fine-tuning the weights of sentences in specific frames, e.g., to reflect appearance properties, significantly improve the accuracy of intra-frame matching and segmentation. In the method, a clip-level feature attention unit and a frame-level feature attention unit are used to learn a discriminative language representation based on a given video.

所述级联跨模态注意力模块中，剪辑级特征注意力单元采用句子嵌入s和目标帧i的剪辑级特征v_c作为输入，得到粗加权的句子特征F₁。In the cascaded cross-modal attention module, the clip-level feature attention unit takes the sentence embedding s and the clip-level feature _vc of the target frame i as input, and obtains the coarsely weighted sentence feature F ₁ .

本发明中，所述剪辑级特征v_c利用双流I3D编码器进行提取，优选地，提取剪辑级特征前，利用ImageNet和Kinetics对双流I3D编码器进行预训练。In the present invention, the clip-level feature _vc is extracted by using a dual-stream I3D encoder. Preferably, ImageNet and Kinetics are used to pre-train the dual-stream I3D encoder before extracting the clip-level feature.

所述双流I3D编码器具体如文献“Carreira J，Zisserman A.Quo vadis，actionrecognition a new model and the kinetics dataset[C]//proceedings of the IEEEConference on Computer Vision and Pattern Recognition，2017：6299-6308.”中所述。The dual-stream I3D encoder is specifically described in the document "Carreira J, Zisserman A. Quo vadis, actionrecognition a new model and the kinetics dataset[C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308." described in.

所述ImageNet具体如文献“Russakovsky O，Deng J，Su H，et al.Imagenet largescale Visual recognition challenge[J].International journal of computervision，2015，115(3)：211-252.”中所述。The ImageNet is specifically described in the document "Russakovsky O, Deng J, Su H, et al. Imagenet largescale Visual recognition challenge [J]. International journal of computervision, 2015, 115(3): 211-252.".

所述Kinetics具体如文献“Kay W，Carreira J，Simonyan K，et al.The kineticshuman action video dataset[J].arXiV preprint arXiv：1705.06950，2017.”中所述。The Kinetics are specifically described in the document "Kay W, Carreira J, Simonyan K, et al. The kineticshuman action video dataset[J]. arXiV preprint arXiv: 1705.06950, 2017.".

其中，

为L₂范数(即指向量各元素的平方和再求平方根)，θ_avg为均值池化操作。I3D(·)为双流13D编码器，优选地，使用I3D网络的Mixed-4f层的输出作为I3D编码器。in,

is the L ₂ norm (that is, the squared sum of each element of the pointing vector and then the square root), and θ _avg is the mean pooling operation. I3D(·) is a dual-stream 13D encoder, preferably, the output of the Mixed-4f layer of the I3D network is used as the I3D encoder.

所述I3D网络的Mixed-4f层具体如文献“Carreira J，Zisserman A.Quo vadis，action recognition a new model and the kinetics dataset[C]//proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition，2017：6299-6308.”中所述。The Mixed-4f layer of the I3D network is specifically described in the document "Carreira J, Zisserman A. Quo vadis, action recognition a new model and the kinetics dataset [C]//proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299 -6308."

本发明中，先将目标帧i的视觉特征v_j送入两个卷积层(

和ψ(·))以分别生成两个新的视觉特征

和ψ(v_j)。同时，通过组合单词嵌入e_t形成句子嵌入s，e_t为第t个单词的嵌入特征。接着将其放入一个卷积层φ(·)中以生成句子特征φ(s)。其中，t∈[1，n]，且t为整数，n为单词总数；e_t具体如文献“MIKOLOV T，SUTSKEVER I，CHEN K，etal.Distributed representations of words and phrases and theircompositionality[C]//Advances in neural information processing systems，2013：3111-3119.”中所述。In the present invention, the visual feature vj of the target frame _i is first sent into two convolutional layers (

and ψ( )) to generate two new visual features respectively

and ψ(v _j ). At the same time, sentence embedding s is formed by combining word embeddings _t , which is the embedding feature of the _t -th word. It is then put into a convolutional layer φ( ) to generate sentence features φ(s). Among them, t∈[1,n], and t is an integer, and n is the total number of words; e _{t is} specifically as in the document "MIKOLOV T, SUTSKEVER I, CHEN K, etal. Distributed representations of words and phrases and their compositionality[C]// Advances in neural information processing systems, 2013: 3111-3119.".

所述

乘以φ(s)以生成逐词空间位置的注意图：said

Multiply by φ(s) to generate an attention map of word-wise spatial locations:

其中，T为矩阵转置，σ_softmax为softmax函数，Att为衡量空间位置对每个单词的注意力图。Among them, T is the matrix transpose, σ _softmax is the softmax function, and Att is the attention map that measures the spatial position of each word.

所述Att乘以ψ(v_j)，再加上φ(s)以得到视觉加权的句子特征：The Att is multiplied by ψ(v _j ), plus φ(s) to obtain visually weighted sentence features:

F＝Att·ψ(v_j)+φ(s)F=Att·ψ(v _j )+φ(s)

其中，F为视觉加权的句子特征。where F is the visually weighted sentence feature.

在本发明中，所述剪辑级特征注意力单元利用剪辑级特征v_c对语言特征进行粗略加权，分别得到：In the present invention, the clip-level feature attention unit uses the clip-level feature _vc to roughly weight the language features, and obtain respectively:

F₁＝Att₁·ψ(v_c)+φ(s)F ₁ =Att ₁ ·ψ( _vc )+φ(s)

其中，T为矩阵转置，σ_softmax是softmax函数，Att₁为剪辑特征v_c和句子嵌入s的注意力图，F₁为粗加权的句子特征。where T is the matrix transpose, σ _softmax is the softmax function, Att ₁ is the attention map of the clip feature _vc and sentence embedding s, and F ₁ is the coarsely weighted sentence feature.

本发明中，所述帧级特征注意力单元处理所述粗略加权的句子特征F₁和帧级特征v_f，得到微调的句子特征F₂。In the present invention, the frame-level feature attention unit processes the roughly weighted sentence feature F ₁ and the frame-level feature v _f to obtain a fine-tuned sentence feature F ₂ .

所述帧级特征v_f利用ResNet-101网络进行提取，优选地，提取帧级特征前，ResNet-101网络在COCO数据集上进行预训练，并在A2D数据集训练分割上进行微调。The frame-level features v _f are extracted by using the ResNet-101 network. Preferably, before extracting the frame-level features, the ResNet-101 network is pre-trained on the COCO dataset, and fine-tuned on the A2D dataset training segmentation.

所述ResNet-101网络具体如文献“He K，Zhang X，Ren S，et al.Deep residuallearning for image recognition[C]//Proceedings of the IEEE conference oncomputer vision and pattern recognition.2016：770-778.”中所述；所述COCO数据集具体如文献“Lin T Y，Maire M，Belongie S，et al.Microsoft coco：Common objects incontext[C]//European conference on computer vision.Springer，Cham，2014：740-755.”中所述；所述A2D数据集具体如文献“Xu C，Hsieh S H，Xiong C，et al.Can humansfly action understanding with multiple classes of actors[C]//Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition.2015：2264-2273.”中所述。The ResNet-101 network is specifically described in the document "He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778." Described in; the COCO dataset is specifically described in the document "Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects incontext [C]//European conference on computer vision. Springer, Cham, 2014: 740- 755.”; the A2D dataset is specifically described in the document “Xu C, Hsieh S H, Xiong C, et al. Can humans fly action understanding with multiple classes of actors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2264-2273.".

本发明中，为了增强特征图，计算目标帧i附近2K帧个参考帧j之间的光流，即在目标帧i前向取K帧，后向取K帧，作为参考帧，并基于光流将附近2K帧的ResNet-101编码特征扭曲到目标帧上，其中，K为帧数，具体根据下式进行：In the present invention, in order to enhance the feature map, the optical flow between 2K frames and reference frames j near the target frame i is calculated, that is, K frames are taken in the forward direction of the target frame i, and K frames are taken in the backward direction, as reference frames, and based on the light The stream warps the ResNet-101 encoded features of the nearby 2K frames onto the target frame, where K is the number of frames, according to the following formula:

为双线性扭曲方程；v_j→i为从参考帧j到目标帧i的扭曲特征。上述操作具体如文献“ZHU X，WANG Y，DAI J，et al.Flow-guided feature aggregation for Video objectdetection[C]//Proceedings of the IEEE International Conference on ComputerVision.2017：408-417.”中所述。Among them, v _j is the ResNet-101 encoding feature of reference frame j; OF _j→i is the optical flow between reference frame j and target frame i;

is the bilinear warping equation; v _j→i is the warping feature from the reference frame j to the target frame i. The above operation is specifically described in the document "ZHU X, WANG Y, DAI J, et al.Flow-guided feature aggregation for Video objectdetection[C]//Proceedings of the IEEE International Conference on ComputerVision.2017:408-417." .

其中，v_i为对目标帧i的ResNet-101编码特征，β是权重系数，i为目标帧，j为参考帧，2K为参考帧帧数，即在目标帧前向取K帧，后向取K帧，对目标帧特征进行补偿。通过这种方式，帧级特征v_f被附近的帧级特征增强。Among them, vi is the ResNet-101 encoding feature of the target frame _i , β is the weight coefficient, i is the target frame, j is the reference frame, 2K is the number of reference frames, that is, take K frames in the forward direction of the target frame, and backward Take K frames to compensate the target frame features. In this way, the frame-level features _vf are enhanced by nearby frame-level features.

本发明中，将增强后的帧级特征v_f直接放入区域提议网络(RPN)中以生成提议。由于目标帧已被其附近帧补偿，因此可以更容易地检测到移动物体。优选地，所述区域提议网络(RPN)具体如文献“He K，Gkioxari G，Dollár P，et al.Mask r-cnn[C]//Proceedingsof the IEEE international conference on computer vision.2017：2961-2969.”中所述述，优选地，所述区域提议网络(RPN)在COCO数据集上进行预训练，在A2D数据集训练分支割上进行微调得到。In the present invention, the enhanced frame-level features _vf are directly put into the Region Proposal Network (RPN) to generate proposals. Since the target frame has been compensated by its nearby frames, moving objects can be detected more easily. Preferably, the Regional Proposal Network (RPN) is as detailed in the document "He K, Gkioxari G, Dollár P, et al. Mask r-cnn [C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969 .", preferably, the Region Proposal Network (RPN) is pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch cut.

F₂＝Att₂·ψ′(v_f)+F₁ F ₂ =Att ₂ ·ψ′(v _f )+F ₁

其中，Att₂为帧级特征v_f和粗加权的句子特征F₁的注意力图；F₂表示微调后的句子特征；

为v_f经过一个线性层

得到的特征；ψ′(v_f)为v_f经过一个线性层ψ′(·)得到的特征。Among them, Att ₂ is the attention map of the frame-level feature v _f and the coarsely weighted sentence feature F ₁ ; F ₂ represents the fine-tuned sentence feature;

go through a linear layer for v _f

m_t＝FC(h_t)m _t =FC(h _t )

α_t＝σ_softmax(m_t)α _t =σ _softmax (m _t )

其中，h_t为F₂特征的第t列，表示第t个单词的向量；FC(h_t)为一个全连接层，其具体如文献“Krizhevsky A，Sutskever I，Hinton G E.ImageNet Classification with DeepConvolutional Neural Networks[C]//Advances in Neural Information ProcessingSystems.2012.”中所述；m_t为h_t经过一个全连接层的中间向量；α_t为第t个单词的加权系数；e_t为第t个单词的嵌入特征，具体如文献“MIKOLOV T，SUTSKEVER I，CHEN K，etal.Distributed representations of words and phrases and theircompositionality[C]//Advances in neural information processing systems，2013：3111-3119.”中所述。Among them, h _t is the t-th column of the F ₂ feature, representing the vector of the t-th word; FC(h _t ) is a fully connected layer, which is as detailed in the document "Krizhevsky A, Sutskever I, Hinton G E. ImageNet Classification with Described in DeepConvolutional Neural Networks[C]//Advances in Neural Information ProcessingSystems.2012."; m _t is the intermediate vector of h _t after a fully connected layer; α _t is the weighting coefficient of the t-th word; e _t is the th-th word Embedding features of t words, as in the document "MIKOLOV T, SUTSKEVER I, CHEN K, etal. Distributed representations of words and phrases and their compositionality[C]//Advances in neural information processing systems, 2013: 3111-3119." said.

在本发明的一种优选实施方式中，在训练时，所述根据语言描述的视频动作者分割方法还包括利用句子查询特征q进行对比学习，用以区分正例和难负例。所述正例为目标帧上，与真值IoU(交并比)大于0.5的且IoU值最大的区域提议。所述难负例为目标帧上，与真值的IoU小于0.5的区域提议。所述IoU具体如文献“He K，Gkioxari G，Dollár P，etal.Mask r-cnn[C]//Proceedings of the IEEE international conference oncomputer vision.2017：2961-2969.”中所述In a preferred embodiment of the present invention, during training, the method for segmenting video actors based on language description further includes using sentence query feature q to perform comparative learning to distinguish positive examples from difficult negative examples. The positive example is the region proposal with the highest IoU value on the target frame, and the true value IoU (intersection-union ratio) is greater than 0.5. The hard negative example is the region proposal on the target frame with an IoU less than 0.5 with the ground truth. The IoU is specifically described in the document "He K, Gkioxari G, Dollár P, etal.Mask r-cnn[C]//Proceedings of the IEEE international conference oncomputer vision.2017:2961-2969."

注意到仅利用同目标帧内的难负例不足以学习句子查询特征q从视频区域提议中匹配到想要结果的能力，本发明中设计了难负例挖掘策略，分为两部分。第一部分，从包含目标帧的视频内挖掘难负例：从视频内的其他关键帧上，找到与当前帧真值不同的区域提议(即IoU小于0.5的区域提议)做为难负例。第二部分，从不同的视频上挖掘难负例：当难负例不足时，不断地找到包含同样动作者-动作标签的视频，从这些视频的关键帧上找到与该动作者-动作标签不同的区域提议作为难负例。测试时，从目标帧上的区域提议中，找到与得到的句子查询特征q相似度得分最高的区域提议即为匹配的结果。Noting that only using hard negative examples in the same target frame is not enough to learn the ability of sentence query feature q to match the desired result from the video region proposal, a hard negative example mining strategy is designed in the present invention, which is divided into two parts. The first part, mining hard negative examples from the video containing the target frame: from other key frames in the video, find the region proposals that are different from the ground truth of the current frame (that is, the region proposals with IoU less than 0.5) as hard negative examples. The second part, mining hard negative examples from different videos: when the hard negative examples are insufficient, continuously find videos that contain the same actor-action label, and find the key frames of these videos that are different from the actor-action label The region proposed as a hard negative example. During testing, from the region proposals on the target frame, finding the region proposal with the highest similarity score with the obtained sentence query feature q is the matching result.

本发明中，需要从包含丰富信息的视频中分割出能够与句子查询特征q相匹配的动作者及其动作，因此，所述方法中利用句子查询特征q进行对比学习，挖掘难负例，来加强q和区域提议之间的匹配能力。In the present invention, it is necessary to segment the actors and their actions that can match the sentence query feature q from the video containing rich information. Therefore, in the method, the sentence query feature q is used for comparative learning, and the difficult negative examples are mined to obtain Strengthen the matching ability between q and region proposals.

本发明中，对于句子嵌入s经过剪辑级特征注意力单元和帧级特征注意力单元处理后，得到微调的句子特征F₂后，再与词嵌入加权得到句子查询特征q。所述r⁺为正例的视觉特征，r_l ^-为负例的视觉特征，其中，l＝1，2，...，L，其中L为负例的个数，L为大于等于1的整数，优选为5-50的整数，更优选为15-35，如为25。句子查询特征q的损失函数为：In the present invention, after the sentence embedding s is processed by the clip-level feature attention unit and the frame-level feature attention unit, the fine _- tuned sentence feature F2 is obtained, and then the sentence query feature q is obtained by weighting with the word embedding. The r ⁺ is the visual feature of the positive example, and r _l ^- is the visual feature of the negative example, wherein, l=1, 2,...,L, where L is the number of negative examples, and L is greater than or equal to 1 Integer, preferably an integer of 5-50, more preferably 15-35, such as 25. The loss function of the sentence query feature q is:

其中，T为矩阵转置。where T is the matrix transpose.

其中，损失函数Loss利用了(L+1)元组来优化从L个负例中识别出正例，减轻了网络收敛到局部最优的可能。Among them, the loss function Loss utilizes (L+1) tuples to optimize the identification of positive examples from L negative examples, reducing the possibility of the network converging to a local optimum.

所述正例的视觉特征r⁺或负例的视觉特征r_l ^-利用RPN网络进行提取，即区域提议网络，生成区域提议，其骨干网络为ResNet-101网络。优选地，提取特征前，所述ResNet-101网络和区域提议网络(RPN)在COCO数据集上进行预训练，并在A2D数据集训练分支上进行微调。所述区域提议具有目标特征o和位置特征l。The visual feature r ⁺ of the positive example or the visual feature _r l- ^of the negative example is extracted by using the RPN network, that is, the region proposal network, to generate region proposals, and its backbone network is the ResNet-101 network. Preferably, before extracting features, the ResNet-101 network and the region proposal network (RPN) are pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch. The region proposal has object feature o and location feature l.

所述目标特征o为经过RPN网络ROI-Aligned模块后的特征，优选地，与句子查询特征q的维度一致，如均为1024维的向量。所述ROI-Aligned模块具体如“Kaiming He，GeorgiaGkioxari，Piotr Dollár，Ross Girshick.Mask r-cnn[C]//Proceedings of the IEEEinternational conference on computer vision.2017：2961-2969.”中所述。The target feature o is the feature after passing through the ROI-Aligned module of the RPN network, preferably, the dimension of the sentence query feature q is consistent, such as a 1024-dimensional vector. The ROI-Aligned module is specifically described in "Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.".

所述位置特征l为：The location feature l is:

其中，(x_tl，y_tl)和(x_br，y_br)分别是区域提议的左上角和右下角的点的坐标，而w，h，W，H分别是区域提议的宽度、高度、目标帧的宽度和高度。where (x _tl , y _tl ) and (x _br , y _br ) are the coordinates of the upper-left and lower-right corners of the region proposal, respectively, and w, h, W, and H are the width, height, and target of the region proposal, respectively The width and height of the frame.

所述正例的视觉特征r⁺或负例的视觉特征r_l ^-通过将位置特征l连接到目标特征o，然后将连接后的特征传递给全连接层以获得与文本表示具有相同维度的正例的视觉特征r⁺或负例的视觉特征r_l ^-，具体如下式所示：The visual feature r of the positive example ⁺ or the visual feature r _l of the negative example ^- by concatenating the location feature l to the target feature o, and then passing the concatenated feature to a fully connected layer to obtain a positive image with the same dimension as the text representation The visual feature r ⁺ of the example or the visual feature r _l ^- of the negative example, as shown in the following formula:

r＝σ_tanh(W([o；l]))r=σ _tanh (W([o;l]))

其中，r为正例的视觉特征r⁺或负例的视觉特征r_l ^-，为C维目标级特征，[；]表示连接操作，W为要学习的参数矩阵，σ_tanh为tanh激活函数。Among them, r is the visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example, which is the C-dimensional target-level feature, [;] represents the connection operation, W is the parameter matrix to be learned, and σ _tanh is the tanh activation function.

所述tanh激活函数具体如文献“LeCun Y，Bottou L，Bengio Y，et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE，1998，86(11)：2278-2324.”中所述。The tanh activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE, 1998, 86(11): 2278-2324." described in.

本发明中，经过句子查询特征q和正/难负例进行对比学习后，得到训练好的网络后，选择与文本特征相似度最高的区域提议，并将所选区域提议的视觉特征r⁺输入到分割分支中以生成结果掩码。本发明中的分割分支采用的是mask r-cnn的分割分支，其具体如文献“He K，Gkioxari G，Dollár P，et al.Mask r-cnn[C]//Proceedings of the IEEEinternational conference on computer vision.2017：2961-2969.”中所述，生成的掩码包含整个实体并有一个很好的边缘。本发明中的根据语言描述分割视频动作者及其动作的方法的网络结构示意图如图1所示。In the present invention, after comparing and learning the sentence query feature q and positive/difficult negative examples, after obtaining a trained network, select the region proposal with the highest similarity with the text feature, and input the visual feature r ⁺ of the selected region proposal into the in the split branch to generate the resulting mask. The segmentation branch in the present invention adopts the segmentation branch of mask r-cnn, which is specifically as in the document "He K, Gkioxari G, Dollár P, et al.Mask r-cnn [C]//Proceedings of the IEEE international conference on computer Vision.2017:2961-2969.", the generated mask contains the entire entity and has a nice edge. A schematic diagram of the network structure of the method for segmenting video actors and their actions according to language descriptions in the present invention is shown in FIG. 1 .

本发明中提供的根据语言描述的视频动作者分割方法，首先利用剪辑级视觉特征粗略地关注语言查询的信息词，然后利用帧级视觉特征来微调单词的注意力，显着提高帧内匹配和分割的准确率。并且，该分割方法中，可以在包含内容丰富的视频中识别出正例，通过难负例挖掘策略并引入了对比学习，它不仅可以学习从视频中识别目标演员，还可以在不同的视频中区分它。这样，使其跨模型检索具有更强的判别能力。此外，采用专门的分割网络来分割在匹配中得到的区域，这使得生成的掩码具有更好的完整性和更好的边缘。The video actor segmentation method according to the language description provided in the present invention firstly uses the clip-level visual features to roughly pay attention to the informative words of the language query, and then uses the frame-level visual features to fine-tune the attention of the words, which significantly improves the intra-frame matching and segmentation accuracy. Moreover, in this segmentation method, positive examples can be identified in videos that contain rich content, through the hard negative example mining strategy and the introduction of contrastive learning, it can not only learn to identify target actors from videos, but also in different videos. differentiate it. In this way, it has stronger discriminative ability for cross-model retrieval. Furthermore, a dedicated segmentation network is employed to segment the regions obtained in the matching, which makes the generated masks with better completeness and better edges.

实施例Example

以下通过具体实例进一步描述本发明，不过这些实例仅仅是范例性的，并不对本发明的保护范围构成任何限制。The present invention is further described below through specific examples, but these examples are only exemplary and do not constitute any limitation to the protection scope of the present invention.

实施例1Example 1

(1)使用I3D网络在数据集ImageNet和数据集Kinetics上进行训练，利用得到的I3D网络的Mixed-4f层的输出作为I3D编码器。(1) Use the I3D network to train on the dataset ImageNet and the dataset Kinetics, and use the output of the Mixed-4f layer of the obtained I3D network as the I3D encoder.

所述数据集ImageNet具体如文献“Olga Russakovsky,Jia Deng,Hao Su,Jonathan Krause,Sanjeev Satheesh,Sean Ma,Zhiheng Huang,Andrej Karpathy,AdityaKhosla,Michael Bernstein,et al.2015.Imagenet large scale visual recognitionchallenge.International journal of computer vision 115,3(2015),211–252”中所述。The dataset ImageNet is specifically described in the document "Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, AdityaKhosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of computer vision 115, 3(2015), 211–252”.

所述数据集Kinetics具体如文献“Will Kay,Joao Carreira,Karen Simonyan,Brian Zhang,Chloe Hillier,Sudheendra Vijayanarasimhan,Fabio Viola,Tim Green,Trevor Back,Paul Natsev,et al.2017.The kinetics human action videodataset.arXiv preprint arXiv:1705.06950(2017)”中所述。The data set Kinetics is specifically described in the document "Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action videodataset. arXiv preprint arXiv:1705.06950 (2017)”.

给定视频V(具体如图1或图2中的视频帧所示)，剪辑特征v_c由下式编码得到：Given a video V (specifically as shown in the video frame in Figure 1 or Figure 2), the clip feature v _c is encoded by the following formula:

其中，

表示L₂范数，θ_avg表示均值池化操作；v_c表示到得的剪辑特征，其为32×32×832的矩阵。in,

represents the L ₂ norm, θ _avg represents the mean pooling operation; _vc represents the resulting clipping feature, which is a 32×32×832 matrix.

(2)ResNet-101网络在COCO数据集上进行了预训练后，在A2D数据集训练分割上进行微调。所述ResNet-101网络具体如文献“He K，Zhang X，Ren S，et al.Deep residuallearning for image recognition[C]//Proceedings of the IEEE conference oncomputer vision and pattern recognition.2016：770-778.”中所述；所述COCO数据集具体如文献“Lin T Y，Maire M，Belongie S，et al.Microsoft coco：Common objects incontext[C]//European conference on computer vision.Springer，Cham，2014：740-755.”中所述；所述A2D数据集具体如文献“Xu C，Hsieh S H，Xiong C，et al.Can humansfly action understanding with multiple classes of actors[C]//Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition.2015：2264-2273.”中所述。(2) The ResNet-101 network is fine-tuned on the A2D dataset training split after pre-training on the COCO dataset. The ResNet-101 network is specifically described in the document "He K, Zhang X, Ren S, et al. Deep residual learning for image recognition [C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778." Described in; the COCO dataset is specifically described in the document "Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects incontext [C]//European conference on computer vision. Springer, Cham, 2014: 740- 755.”; the A2D dataset is specifically described in the document “Xu C, Hsieh S H, Xiong C, et al. Can humans fly action understanding with multiple classes of actors[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 2264-2273.".

计算目标帧i附近14个参考帧j之间的光流(目标帧i前向7帧和后向7帧为参考帧j)，并基于光流将上述附近14帧参考帧j的ResNet-101编码特征扭曲到目标帧上，具体通过下式完成：Calculate the optical flow between 14 reference frames j near the target frame i (the forward 7 frames and the backward 7 frames of the target frame i are the reference frame j), and based on the optical flow, the ResNet-101 of the above 14 nearby reference frames j is calculated. The encoded features are warped onto the target frame by the following formula:

其中，v_j是对参考帧j的ResNet-101编码特征，OF_j→i为参考帧j和目标帧i之间的光流，

表示双线性扭曲方程，v_j→i表示从帧j到帧i扭曲的特征，具体操作如文献“ZHU X，WANG Y，DAI J，et al.Flow-guided feature aggregation for video object detection[C]//Proceedings of the IEEE International Conference on ComputerVision.2017：408-417.”中所述。where v _j is the ResNet-101 encoding feature for reference frame j, OF _j→i is the optical flow between reference frame j and target frame i,

Represents the bilinear warping equation, v _j→i represents the feature of warping from frame j to frame i, the specific operation is described in the literature "ZHU X, WANG Y, DAI J, et al.Flow-guided feature aggregation for video object detection[C ]//Proceedings of the IEEE International Conference on ComputerVision. 2017:408-417.”.

帧级特征v_f为从帧j到帧i的扭曲特征和原始特征v_i的线性加权组合，具体如下式所示，其中，v_i为对目标帧i的ResNet-101编码特征：The frame-level feature v _f is the linear weighted combination of the warped feature from frame j to frame i and the original feature v _i , as shown in the following formula, where v _i is the ResNet-101 encoding feature for the target frame i:

其中，β是权重系数为0.1，K设定为7。参考帧j为目标帧i的前面7帧和后面7帧，视频采样帧率为24fps。通过从帧j到帧i的扭曲特征和原始特征v_i的线性加权组合，使得帧级特征v_f被附近的帧级特征增强。然后将增强后的特征直接放入区域提议网络(RPN)中以生成提议，所述区域提议网络(RPN)具体如文献“Girshick R.Fast r-cnn[C]//Proceedingsof the IEEE international conference on computer vision.2015：1440-1448.”中所述，所述区域提议网络(RPN)在COCO数据集上进行预训练，在A2D数据集训练分支割上进行微调得到。由于目标帧i已被其附近的帧补偿，因此可以更容易地检测到移动物体。where β is a weight coefficient of 0.1 and K is set to 7. The reference frame j is the first 7 frames and the last 7 frames of the target frame i, and the video sampling frame rate is 24fps. The frame-level feature vf is enhanced by nearby frame-level features by a linearly weighted combination of the warped features from frame j to frame _i and the original feature _vi . The enhanced features are then directly put into a Region Proposal Network (RPN) to generate proposals, which is as detailed in the document "Girshick R. Fast r-cnn[C]//Proceedings of the IEEE international conference on As described in computer vision. 2015: 1440-1448.", the region proposal network (RPN) is pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch cut. Since the target frame i has been compensated by its nearby frames, moving objects can be detected more easily.

其中，v_f为16×16×1024维矩阵。Among them, v _f is a 16×16×1024-dimensional matrix.

(3)利用剪辑级特征注意力单元将剪辑级特征v_c对语言特征进行粗略加权，将目标帧i的剪辑级特征v_c送入卷积层

知ψ(·)，分别生成两个新的视觉特征

和ψ(v_c)。(3) Use the clip-level feature attention unit to roughly weight the clip-level feature v _c to the language features, and send the clip-level feature v _c of the target frame i to the convolutional layer

Knowing ψ( ), two new visual features are generated respectively

and ψ( _vc ).

所述卷积层

为1024维的1×1的卷积。所述卷积层ψ(·)为1024维的1×1的卷积。The convolutional layer

is a 1×1 convolution of 1024 dimensions. The convolution layer ψ(·) is a 1×1 convolution of 1024 dimensions.

通过组合单词嵌入e_t形成句子嵌入s，接着将其放入一个1024维的全连接层中以生成句子特征φ(s)。其中，t∈[1，n]，且t为整数，n为单词总数。e_t为第t个单词的嵌入特征，是一个300维的向量，由Google News数据集训练的word2vec模型得到，具体如文献“Mikolov T，Sutskever I，Chen K，et al.Distributed representations of words andphrases and their compositionality[C].//In Advances in neural informationprocessing systems 2013.3111-3119.”中所述得到；所述Google News数据集具体如文献“A.S.Das，M.Datar，A.Garg，and S.Rajaram，“Google news personalization：scalableonline collaborative filtering，”in Proceedings of the 16th internationalconference on World Wide Web.ACM，2007，pp.271-280.”中所述。φ(·)是一个全连接层，其作用为将300维的信息变换到1024维，与视觉特征

和ψ(v_c)维度保持一致。Sentence embeddings s are formed by combining word embeddings e _t , which are then put into a 1024-dimensional fully connected layer to generate sentence features φ(s). where t∈[1,n], and t is an integer, and n is the total number of words. e _t is the embedding feature of the t-th word, which is a 300-dimensional vector obtained by the word2vec model trained on the Google News dataset, as shown in the document "Mikolov T, Sutskever I, Chen K, et al. Distributed representations of words andphrases and their compositionality[C].//In Advances in neural information processing systems 2013.3111-3119.”; the Google News dataset is specifically as described in the document “AS Das, M. Datar, A. Garg, and S. Rajaram, "Google news personalization: scalable online collaborative filtering," in Proceedings of the 16th internationalconference on World Wide Web. ACM, 2007, pp. 271-280." φ( ) is a fully connected layer whose role is to transform the 300-dimensional information to 1024-dimensional, and the visual features

and ψ( _vc ) dimension is consistent.

剪辑级特征注意力单元利用剪辑级特征v_c对语言特征进行粗略加权，分别得到：The clip-level feature attention unit uses the clip-level feature _vc to roughly weight the language features, and obtain:

F₁＝Att₁·ψ(v_c)+φ(s)F ₁ =Att ₁ ·ψ( _vc )+φ(s)

其中，σ_softmax是softmax激活函数，Att₁为剪辑特征v_c和句子嵌入s的注意力图，其维度为20×1024，其中每一个元素(a，b)表示第b个位置的视觉特征对第a个单词的影响，F₁为粗加权的句子特征，其维度为20×1024，表示经过全局特征加权后的单词的特征。所述softmax激活函数具体如文献“LeCun Y，Bottou L，Bengio Y，et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE，1998，86(11)：2278-2324.”中所述。Among them, σ _softmax is the softmax activation function, Att ₁ is the attention map of clip feature v _c and sentence embedding s, and its dimension is 20×1024, where each element (a, b) represents the visual feature of the bth position. The influence of a word, F ₁ is the coarsely weighted sentence feature, and its dimension is 20×1024, which represents the feature of the word after the global feature weighting. The softmax activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE, 1998, 86(11): 2278-2324." said.

将上述粗略加权的句子特征F₁和帧级特征v_f输入到帧级特征注意力单元，得到微调的句子特征F₂：Input the above roughly weighted sentence feature F ₁ and frame-level feature v _f to the frame-level feature attention unit to obtain the fine-tuned sentence feature F ₂ :

F₂＝Att₂·ψ′(v_f)+F₁ F ₂ =Att ₂ ·ψ′(v _f )+F ₁

其中，Att₂为帧级特征v_f和粗加权的句子特征F₁的注意力图，F₂表示微调的句子特征。

为1024维的1×1的卷积；ψ′(v_f)为1024维的1×1的卷积。Among them, Att ₂ is the attention map of the frame-level feature v _f and the coarsely weighted sentence feature F ₁ , and F ₂ represents the fine-tuned sentence feature.

is a 1×1 convolution of 1024 dimensions; ψ′(v _f ) is a 1×1 convolution of 1024 dimensions.

(4)将所述微调的句子特征F₂的加权单词h_t嵌入，得到句子查询特征q：(4) Embed the weighted word h _t of the fine-tuned sentence feature F ₂ to obtain the sentence query feature q:

m_t＝FC(h_t)m _t =FC(h _t )

α_t＝σ_softmax(m_t)α _t =σ _softmax (m _t )

(5)针对视频V，句子嵌入s，经上述步骤(1)-(4)得到句子查询特征q(如步骤4中所示)，其中1个正例视觉特征为r⁺，25个负例视觉特征为r_l ^-，其中，l＝1，2，...，25。(5) For the video V, the sentence embedding s, the sentence query feature q (as shown in step 4) is obtained through the above steps (1)-(4), of which 1 positive example visual feature is r ⁺ , 25 negative examples The visual features are r _l ⁻ , where l=1, 2, . . . , 25.

损失函数为：The loss function is:

其中，T为矩阵转置。where T is the matrix transpose.

(6)所述正例的视觉特征r⁺或负例的视觉特征r_l ^-利用ResNet-101网络(所述ResNet-101网络具体如文献“He K，Zhang X，Ren S，et al.Deep residual learning forimage recognition[C].//In Proceedings of the IEEE conference on computervision and pattern recognition.2016，770-778.”中所述)进行提取。提取特征前，该网络在COCO数据集(所述COCO数据集具体如文献“Lin T，Maire M，Belongie S，etal.Microsoft coco：Common objects in context[C].//In European conference oncomputer vision.2014.740-755.”中所述)上进行预训练，并在A2D数据集(所述A2D数据集具体如文献“Xu C，Hsieh S，Xiong，C，etal.Can humans fly action understanding withmultiple classes of actors[C].//In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2015.2264-2273.”中所述)训练分割上进行微调，生成区域提议。区域提议的特征o为微调后的ResNet-01输出的该区域的目标特征，特征l为位置特征。(6) The visual feature r ⁺ of the positive example or the visual feature r _l of the negative example ^— uses the ResNet-101 network (the ResNet-101 network is specifically described in the document “He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C].//In Proceedings of the IEEE conference on computervision and pattern recognition. 2016, 770-778.”) for extraction. Before extracting features, the network is in the COCO dataset (specifically, the COCO dataset is described in the document "Lin T, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C].//In European conference on computer vision. 2014.740-755.”) and pre-trained on the A2D dataset (the A2D dataset is specifically described in the document “Xu C, Hsieh S, Xiong, C, et al. Can humans fly action understanding with multiple classes of actors [C].//In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.2264-2273.”) fine-tune on training segmentation to generate region proposals. The feature o of the region proposal is the target feature of the region output by the fine-tuned ResNet-01, and the feature l is the position feature.

所述目标特征o为经过RPN网络ROI-Aligned模块后的特征，具体为1024维的向量(与句子查询特征q的维度一致)。The target feature o is the feature after passing through the ROI-Aligned module of the RPN network, and is specifically a 1024-dimensional vector (consistent with the dimension of the sentence query feature q).

所述位置特征l为：The location feature l is:

所述正例的视觉特征r⁺或负例的视觉特征r_l ^-是通过将位置特征l连接到目标特征o，然后将连接后的特征传递给全连接层以获得与文本表示具有相同维度的正例的视觉特征r⁺或负例的视觉特征r_l ^-，具体如下式所示：The visual feature r ⁺ of the positive example or the visual feature r _l ⁻ of the negative example is obtained by concatenating the location feature l to the target feature o, and then passing the concatenated feature to a fully connected layer to obtain a The visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example, as shown in the following formula:

r＝σ_tanh(W([o；l]))r=σ _tanh (W([o;l]))

其中，r为正例的视觉特征r⁺或负例的视觉特征r_l ^-，为C维目标级特征，[；]表示连接操作，σ_tanh为tanh激活函数。Among them, r is the visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example, which is the C-dimensional target-level feature, [;] represents the connection operation, and σ _tanh is the tanh activation function.

使正例的视觉特征r⁺或负例的视觉特征r_l ^-与句子查询特征q进行对比学习，能够与q匹配的即为所需分割的正例，得到与文本特征相似度最高的区域提议，并将所选区域提议的视觉特征r⁺输入到mask r-cnn的分割分支中以生成结果掩码，得到分割后的最终结果。The visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example is compared with the sentence query feature q, and the positive example that can match q is the positive example to be segmented, and the region proposal with the highest similarity to the text feature is obtained. , and input the visual feature r ⁺ proposed by the selected region into the segmentation branch of mask r-cnn to generate the resulting mask, and get the final result after segmentation.

图2中为实施例1中方法在A2D数据集上的效果图，其中第一行中的文字为输入的句子，第二行中的图片序列为输入的视频，第三行为本实施例得到的分割结果，第四行为数据集给出的真实的分割图。从图2中可以看出，利用本发明中提供的根据语言描述分割的视频动作者分割方法能够从视频帧中完整的分割出目标动作者，分割结果与数据集给出的真实的分割图接近度很高。Figure 2 is the effect diagram of the method in Embodiment 1 on the A2D data set, in which the text in the first row is the input sentence, the picture sequence in the second row is the input video, and the third row is obtained in this embodiment. The segmentation result, the real segmentation map given by the fourth row dataset. As can be seen from Fig. 2, the target actor can be completely segmented from the video frame by using the video actor segmentation method according to the language description provided in the present invention, and the segmentation result is close to the real segmentation map given by the data set degree is very high.

以上结合具体实施方式和/或范例性实例以及附图对本发明进行了详细说明，不过这些说明并不能理解为对本发明的限制。本领域技术人员理解，在不偏离本发明精神和范围的情况下，可以对本发明技术方案及其实施方式进行多种等价替换、修饰或改进，这些均落入本发明的范围内。本发明的保护范围以所附权利要求为准。The present invention has been described in detail above with reference to specific embodiments and/or exemplary examples and accompanying drawings, but these descriptions should not be construed as limiting the present invention. Those skilled in the art understand that, without departing from the spirit and scope of the present invention, various equivalent replacements, modifications or improvements can be made to the technical solutions of the present invention and the embodiments thereof, which all fall within the scope of the present invention. The scope of protection of the present invention is determined by the appended claims.

Claims

1. a video actor segmentation method according to language description, is characterized in that, described method utilizes cascading cross-modal attention module to carry out, and described cascading cross-modal attention module comprises clip-level feature attention unit and Frame-level feature attention unit;

The clip-level feature attention unit uses the clip-level feature vc to roughly weight the language features to obtain:

F ₁ =Att ₁ ·ψ( _vc )+φ(s)

Among them, T is the matrix transpose; σ _softmax is the softmax activation function; Att ₁ is the attention map of the clip feature _vc and sentence embedding s; F ₁ is the _coarsely weighted sentence feature;

and ψ( ) after processing to get

and ψ(v _c ); combine word embeddings et _t to form sentence embedding s, which are then put into a convolutional layer φ( ) to generate sentence features φ(s); et is the embedding feature of the t-th word ;

The frame-level feature attention unit processes the roughly weighted sentence features F ₁ and frame-level features v _f to obtain fine-tuned sentence features F ₂ :

F ₂ =Att ₂ ·ψ′(v _f )+F ₁

Among them, Att ₂ is the attention map of the frame-level feature v _f and the coarsely weighted sentence feature F ₁ ; F ₂ represents the fine-tuned sentence feature;

go through a linear layer for v _f

2 . The method of claim 1 , wherein the clip-level feature attention unit uses sentence embedding s and clip-level features _vc of target frame i as input. 3 .

3. method according to claim 1, is characterized in that, for video V, clip-level feature v _c is encoded by following formula:

in,

represents the L ₂ norm, θ _avg is the mean pooling operation, and I3D( ) is the dual-stream I3D encoder.

4. The method of claim 1, wherein

The frame-level features v _f are extracted using the ResNet-101 network.

5. The method according to claim 4, wherein, before extracting frame-level features, the ResNet-101 network is pre-trained on the COCO data set, and fine-tuned on the A2D data set training segmentation;

The frame-level feature is a linear weighted combination of the warped feature from frame j to frame i and the original feature v _i , as shown in the following formula:

Among them, vi is the ResNet-101 encoding feature of the target frame _i , β is the weight coefficient, i is the target frame, j is the reference frame, 2K is the number of frames of the reference frame, K frames are taken in the forward direction of the target frame, and the backward direction is Take K frames to compensate the target frame features, v _j→i is the distortion feature from the reference frame j to the target frame i;

The _vj→i is:

Among them, v _j is the ResNet-101 encoding feature of reference frame j; OF _j→i is the optical flow between reference frame j and target frame i;

is the bilinear distortion equation.

6. The method according to claim 1, wherein the weighted word feature _ht passes through a fully connected layer to obtain the sentence query feature q:

m _t =FC(h _t )

α _t =σ _softmax (m _t )

Among them, h _t is the t-th column of the F ₂ feature, representing the vector of the t-th word; FC(h _t ) is a fully connected layer; m _t is the vector obtained by h _t after passing through a fully connected layer; α _t is Weighting factor for the t-th word.

7. The method according to one of claims 1 to 6, wherein the method further comprises using sentence query feature q to perform comparative learning to distinguish positive examples from difficult negative examples, wherein the positive examples are on the target frame. , and the region proposal with the true value IoU greater than 0.5 and the largest IoU value, the hard negative example is the region proposal on the target frame, and the true value IoU is less than 0.5.

8. The method according to one of claims 1 to 6, characterized in that,

In the method, the loss function of the sentence query feature q is:

Among them, T is the matrix transpose, the r ⁺ is the visual feature of the positive example, r _l ^- is the visual feature of the negative example, where l=1, 2,...,L, where L is the number of negative examples number;

The visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example is:

r=σ _tanh (W([o;l])),

r is the visual feature r ⁺ of the positive example or the visual feature r _l ^- of the negative example, the r is the C-dimensional target-level feature, [;] represents the connection operation, W is the parameter matrix to be learned, and σ _tanh is the tanh activation function ;

The visual feature r ⁺ of the positive example or the visual feature r _l - of the negative example is extracted by using the RPN network to generate a region proposal, the target feature of which is o, and the position feature corresponding to the region is 1;

The target feature o is the feature extracted and generated by the RPN network,

The location feature l is:

where (x _tl , y _tl ) and (x _br , y _br ) are the coordinates of the upper-left and lower-right corners of the region proposal, respectively, and w, h, W, H are the width and height of the region proposal, the target The width and height of the frame.

9. A computer-readable storage medium, characterized in that, it stores a training program for segmentation of video actors according to language description, and when the program is executed by a processor, the processor is made to execute one of claims 1 to 8 The steps of the video actor segmentation method according to the language description.

10. A computer device, characterized in that it comprises a memory and a processor, wherein the memory stores a training program segmented according to a video actor described in a language, and when the program is executed by the processor, the processor executes the program as described in the right. The steps of the video actor segmentation method according to the language description described in one of claims 1 to 8.