CN111462183A

CN111462183A - Behavior identification method and system based on attention mechanism double-current network

Info

Publication number: CN111462183A
Application number: CN202010243615.2A
Authority: CN
Inventors: 刘允刚; 陈琳; 满永超; 李峰忠
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2020-07-28

Abstract

The present disclosure provides a behavior recognition method and system based on an attention mechanism dual-stream network, belonging to the technical field of behavior recognition. The entire acquired video is divided into multiple equal-length video segments, and the RGB image of each frame of each video segment is extracted. and optical flow grayscale images, and perform preprocessing; randomly sample the preprocessed images to obtain the RGB image and optical flow grayscale image of each video segment; use the dual-stream network model that introduces the attention mechanism to extract the sampled images The appearance features and temporal dynamic features of the video are fused respectively according to the temporal network and the spatial network type, and the fusion result of the temporal network and the fusion result of the spatial network are weighted and fused to obtain the recognition result of the entire video; Make full use of it, and can better extract the local key features of the video frame image, highlight the foreground area where the action occurs, suppress the influence of irrelevant information in the background environment, and improve the accuracy of behavior recognition.

Description

A behavior recognition method and system based on attention mechanism dual-stream network

技术领域technical field

本公开涉及行为识别技术领域，特别涉及一种基于注意力机制双流网络的行为识别方法及系统。The present disclosure relates to the technical field of behavior recognition, and in particular, to a behavior recognition method and system based on an attention mechanism dual-stream network.

背景技术Background technique

本部分的陈述仅仅是提供了与本公开相关的背景技术，并不必然构成现有技术。The statements in this section merely provide background related to the present disclosure and do not necessarily constitute prior art.

近年来，基于视频的人体动作行为识别引起了社会的普遍关注，成为了计算机视觉领域的重要研究方向，已被广泛应用于人体动作分析、人机交互、视频监控、医疗看护等科学领域。随着深度学习尤其是深度卷积神经网络的发展，国内外行为识别方面的研究已取得重大进展。但由于复杂视频中普遍存在视角变动、物体遮挡和背景环境干扰等因素的影响，使得对视频数据的准确分析成为了难点。In recent years, video-based human action behavior recognition has attracted widespread attention and has become an important research direction in the field of computer vision. It has been widely used in human motion analysis, human-computer interaction, video surveillance, medical care and other scientific fields. With the development of deep learning, especially deep convolutional neural networks, the research on behavior recognition at home and abroad has made great progress. However, due to the influence of factors such as viewing angle change, object occlusion, and background environment interference in complex videos, it is difficult to accurately analyze video data.

目前，面向视频的行为识别主要使用双流网络，它包含空域网络和时域网络两个参数不共享的独立部分，其中，空域网络从RGB图像中提取空间外观特征，时域网络从堆叠的光流图像中提取时间动态特征。Currently, video-oriented action recognition mainly uses a two-stream network, which contains two independent parts that do not share the parameters of a spatial network and a temporal network. The spatial network extracts spatial appearance features from RGB images, and the temporal network extracts spatial appearance features from stacked optical Extract temporal dynamic features from images.

本公开发明人发现，虽然双流网络在行为识别领域取得了良好的效果，但传统的双流网络仍面临以下问题：(1)由于传统的双流网络只是单纯地利用视频数据中某一帧的图像信息对整个视频进行识别，导致视频数据的利用率较低，故难以保证视频中其他帧的有效信息被网络提取利用；(2)由于传统的双流网络未考虑图像特征权重的影响，故易受到视频背景环境中无关信息的干扰，导致将背景运动信息编码到最终特征表示中，影响分类准确率。The inventor of the present disclosure found that although the dual-stream network has achieved good results in the field of behavior recognition, the traditional dual-stream network still faces the following problems: (1) Since the traditional dual-stream network simply uses the image information of a certain frame in the video data Identifying the entire video results in a low utilization rate of video data, so it is difficult to ensure that the effective information of other frames in the video can be extracted and utilized by the network; (2) Since the traditional dual-stream network does not consider the influence of image feature weights, it is vulnerable to video The interference of irrelevant information in the background environment leads to encoding background motion information into the final feature representation, which affects the classification accuracy.

发明内容SUMMARY OF THE INVENTION

为了解决现有技术的不足，本公开提供了一种基于注意力机制双流网络的行为识别方法及系统，当视频时间较长时，能够对视频数据进行充分利用，并能更好地提取视频帧图像的局部重点特征，突出动作发生的前景区域，抑制背景环境中无关信息的影响，提高了行为识别准确率。In order to solve the deficiencies of the prior art, the present disclosure provides a behavior recognition method and system based on an attention mechanism dual-stream network, which can make full use of video data and better extract video frames when the video time is long. The local key features of the image highlight the foreground area where the action occurs, suppress the influence of irrelevant information in the background environment, and improve the accuracy of action recognition.

为了实现上述目的，本公开采用如下技术方案：In order to achieve the above object, the present disclosure adopts the following technical solutions:

本公开第一方面提供了一种基于注意力机制双流网络的行为识别方法。A first aspect of the present disclosure provides a behavior recognition method based on an attention mechanism dual-stream network.

一种基于注意力机制双流网络的行为识别方法，包括以下步骤：A behavior recognition method based on attention mechanism dual-stream network, comprising the following steps:

将获取的整段视频分成多个等长视频片段，提取每个视频片段每一帧的RGB图像和光流灰度图像，并进行预处理；Divide the acquired entire video into multiple equal-length video segments, extract the RGB image and optical flow grayscale image of each frame of each video segment, and perform preprocessing;

对预处理后的图像进行随机采样，得到每个视频片段的至少一张RGB图像和至少一份堆叠光流灰度图像；Randomly sample the preprocessed images to obtain at least one RGB image and at least one stacked optical flow grayscale image of each video segment;

利用引入注意力机制的双流网络模型，提取采样得到的图像的外观特征和时间动态特征，将提取到的各个视频片段的特征分别按时域网络和空域网络类型进行融合，将时域网络与空域网络的融合结果进行加权融合，得到整段视频的识别结果。Using the dual-stream network model that introduces the attention mechanism, the appearance features and temporal dynamic features of the sampled images are extracted, and the features of each extracted video clip are fused according to the temporal network and spatial network types, respectively. The fusion results are weighted and fused to obtain the recognition results of the entire video.

本公开第二方面提供了一种基于注意力机制双流网络的行为识别系统。A second aspect of the present disclosure provides a behavior recognition system based on an attention mechanism dual-stream network.

一种基于注意力机制双流网络的行为识别系统，包括：A behavior recognition system based on attention mechanism dual-stream network, including:

数据采集模块，被配置为：将获取的整段视频分成多个等长视频片段，提取每个视频片段每一帧的RGB图像和光流灰度图像，并进行预处理；The data acquisition module is configured to: divide the acquired entire video into multiple equal-length video segments, extract the RGB image and optical flow grayscale image of each frame of each video segment, and perform preprocessing;

图像采样模块，被配置为：对预处理后的图像进行随机采样，得到每个视频片段的至少一张RGB图像和至少一份堆叠光流灰度图像；The image sampling module is configured to: randomly sample the preprocessed image to obtain at least one RGB image and at least one stacked optical flow grayscale image of each video segment;

行为识别模块，被配置为：利用引入注意力机制的双流网络模型，提取采样得到的图像的外观特征和时间动态特征，将提取到的各个视频片段的特征分别按时域网络和空域网络类型进行融合，将时域网络与空域网络的融合结果进行加权融合，得到整段视频的识别结果。The behavior recognition module is configured to: extract the appearance features and temporal dynamic features of the sampled images by using a dual-stream network model that introduces an attention mechanism, and fuse the extracted features of each video segment according to the temporal network and spatial network types. , the fusion results of the temporal network and the spatial network are weighted and fused to obtain the recognition result of the entire video.

本公开第三方面提供了一种介质，其上存储有程序，该程序被处理器执行时实现如本公开第一方面所述的基于注意力机制双流网络的行为识别方法中的步骤。A third aspect of the present disclosure provides a medium on which a program is stored, and when the program is executed by a processor, implements the steps in the attention mechanism dual-stream network-based behavior recognition method described in the first aspect of the present disclosure.

本公开第四方面提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的程序，所述处理器执行所述程序时实现如本公开第一方面所述的基于注意力机制双流网络的行为识别方法中的步骤。A fourth aspect of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, the processor implementing the program as described in the first aspect of the present disclosure when the processor executes the program The steps in the attention mechanism two-stream network-based action recognition method.

与现有技术相比，本公开的有益效果是：Compared with the prior art, the beneficial effects of the present disclosure are:

1、本公开所述的方法、系统、介质及电子设备，采用分段双流网络，将视频分为多个等长片段，并对每个片段随机采样，分别获取各个片段的静态RGB图像和动态光流图像，再将两种图像输入双流网络进行视频特征提取，然后将各个片段的视频特征按时域和空域通道类型进行融合，实现了对视频数据的充分利用，从而提高了行为识别准确率。1. The method, system, medium and electronic device described in this disclosure use a segmented dual-stream network to divide the video into multiple equal-length segments, and randomly sample each segment to obtain static RGB images and dynamic images of each segment. Then, the two images are input into the dual-stream network for video feature extraction, and then the video features of each segment are fused according to the temporal and spatial channel types, which fully utilizes the video data and improves the accuracy of behavior recognition.

2、本公开所述的方法、系统、介质及电子设备，在空域网络中引入注意力机制，使网络更加关注图像局部重点特征，突出动作发生的前景区域，抑制背景环境中无关信息的影响，更进一步的提高了行为识别准确率。2. The method, system, medium and electronic device described in the present disclosure introduce an attention mechanism into the airspace network, so that the network pays more attention to the local key features of the image, highlights the foreground area where the action occurs, and suppresses the influence of irrelevant information in the background environment, The accuracy of action recognition is further improved.

附图说明Description of drawings

构成本公开的一部分的说明书附图用来提供对本公开的进一步理解，本公开的示意性实施例及其说明用于解释本公开，并不构成对本公开的不当限定。The accompanying drawings that constitute a part of the present disclosure are used to provide further understanding of the present disclosure, and the exemplary embodiments of the present disclosure and their descriptions are used to explain the present disclosure and do not constitute an improper limitation of the present disclosure.

图1为本公开实施例1提供的基于注意力机制双流网络的行为识别方法的流程示意图。FIG. 1 is a schematic flowchart of a behavior recognition method based on an attention mechanism dual-stream network according to Embodiment 1 of the present disclosure.

图2为本公开实施例1提供的整体网络结构示意图。FIG. 2 is a schematic diagram of an overall network structure provided by Embodiment 1 of the present disclosure.

图3为本公开实施例1提供的获取的视频RGB图像及其对应的光流灰度图像。FIG. 3 is an acquired video RGB image and a corresponding optical flow grayscale image provided by Embodiment 1 of the present disclosure.

图4为本公开实施例1提供的Inception-v3网络的总体结构示意图。FIG. 4 is a schematic diagram of the overall structure of the Inception-v3 network provided in Embodiment 1 of the present disclosure.

图5为本公开实施例1提供的通道注意力单元结构示意图。FIG. 5 is a schematic structural diagram of the channel attention unit provided in Embodiment 1 of the present disclosure.

具体实施方式Detailed ways

应该指出，以下详细说明都是例示性的，旨在对本公开提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本公开所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present disclosure. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本公开的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present disclosure. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural as well, furthermore, it is to be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates that There are features, steps, operations, devices, components and/or combinations thereof.

在不冲突的情况下，本公开中的实施例及实施例中的特征可以相互组合。The embodiments of this disclosure and features of the embodiments may be combined with each other without conflict.

实施例1：Example 1:

如图1所示，本公开实施例1提供了一种基于注意力机制双流网络的行为识别方法，包括以下步骤：As shown in FIG. 1 , Embodiment 1 of the present disclosure provides a behavior recognition method based on an attention mechanism dual-stream network, including the following steps:

对预处理后的图像进行随机采样，得到每个视频片段的至少一张RGB图像和至少一份光流灰度图像；Randomly sample the preprocessed images to obtain at least one RGB image and at least one optical flow grayscale image of each video segment;

利用引入注意力机制的双流网络模型，提取采样得到的图像的外观特征和时间动态特征，将提取到的各个视频片段的特征分别按时域网络和空域网络类型进行融合，将时域网络的融合结果与空域网络的融合结果进行加权融合，得到整段视频的识别结果。Using the dual-stream network model that introduces the attention mechanism, the appearance features and temporal dynamic features of the sampled images are extracted, and the features of each extracted video clip are fused according to the temporal network and spatial network types, respectively. The fusion results of the temporal network Weighted fusion is performed with the fusion result of the airspace network to obtain the recognition result of the entire video.

详细步骤如下：The detailed steps are as follows:

步骤1：从视频中获取每帧的RGB图像和光流灰度图像。即使用GPU编译的OpenCV和denseflow工具，对视频进行截帧，来获取视频每帧的RGB图像，并利用TV-L1算法对每相邻两帧的RGB图像进行计算，从而得到视频的光流灰度图像。图3为获取的视频RGB图像及其对应的光流灰度图像(水平方向光流图像和垂直方向光流图像，通过获取水平和垂直光流图像，极大的提高了数据的全面性，提高了视频中行为识别的准确性)。Step 1: Get the RGB image and optical flow grayscale image of each frame from the video. That is to use the OpenCV and denseflow tools compiled by GPU to cut the video frame to obtain the RGB image of each frame of the video, and use the TV-L1 algorithm to calculate the RGB image of each adjacent two frames, so as to obtain the optical flow gray of the video. degree image. Figure 3 shows the acquired video RGB image and its corresponding optical flow grayscale image (horizontal optical flow image and vertical optical flow image. By acquiring the horizontal and vertical optical flow images, the comprehensiveness of the data is greatly improved, and the the accuracy of action recognition in videos).

步骤2：在获取两种图像的基础上，对图像数据进行预处理。本步骤对时域和空域网络中的输入数据采用随机裁剪、角点裁剪、水平翻转和尺度抖动四种数据扩充方法进行预处理。Step 2: On the basis of acquiring two kinds of images, preprocess the image data. In this step, four data augmentation methods, random cropping, corner cropping, horizontal flipping and scale jittering, are used to preprocess the input data in the temporal and spatial domain networks.

随机裁剪方法指从图像中随机选取裁剪区域进行裁剪，角点裁剪方法指从图像的四角及中心选取裁剪区域进行裁剪，尺度抖动方法指按抖动比率决定从图像中裁剪的区域尺寸大小。The random cropping method refers to randomly selecting the cropping area from the image for cropping, the corner cropping method refers to selecting the cropping area from the four corners and the center of the image for cropping, and the scale dithering method refers to determining the size of the cropped area from the image according to the dither ratio.

以公开数据集UCF101为例，其输入数据的尺寸大小统一设置为256×340，用于网络训练的数据尺寸大小统一设置为224×224，抖动比率设置为1、0.875、0.75和0.66，其裁剪区域的尺寸由以上比率随机决定，最后再将裁剪区域统一调整为224×224，用于网络训练。Taking the public dataset UCF101 as an example, the size of the input data is uniformly set to 256×340, the size of the data used for network training is uniformly set to 224×224, and the jitter ratio is set to 1, 0.875, 0.75, and 0.66. The size of the region is randomly determined by the above ratio, and finally the cropped region is uniformly adjusted to 224×224 for network training.

步骤3：在上述操作的基础上，将图像数据输入网络，对双流网络进行训练。Step 3: On the basis of the above operations, input the image data into the network to train the dual-stream network.

对于网络整体结构，本实施例采用分段的思想，首先将视频分为K个等长片段，并对每个片段随机采样，分别获取各个片段的静态RGB图像和动态光流图像，再将两种图像输入双流网络进行视频特征提取，然后将各个片段的视频特征按时域和空域网络类型进行融合，最后集成双流网络的识别结果，从而得到整段视频的识别分类，图2为K取3时的整体网络结构图。在图2中，3个空域网络的参数共享，3个时域网络的参数共享。For the overall structure of the network, this embodiment adopts the idea of segmentation. First, the video is divided into K equal-length segments, and each segment is randomly sampled to obtain the static RGB image and dynamic optical flow image of each segment, and then divide the two segments. Then, the video features of each segment are fused according to the temporal and spatial network types, and finally the recognition results of the dual-stream network are integrated to obtain the recognition and classification of the entire video. Figure 2 shows when K is 3 The overall network structure diagram. In Figure 2, the parameters of the three air domain networks are shared, and the parameters of the three time domain networks are shared.

若K个视频片段用S₁，S₂，…，S_k表示，随机采样得到的图像用T₁，T₂，…，T_k表示，网络参数用W表示，网络的输出用F(T_k；W)表示，聚合函数用G表示，归一化指数函数(即softmax函数)用H表示，则整个网络结构的表达式可表示为：If K video clips are represented by S ₁ , S ₂ , ..., _Sk , images obtained by random sampling are represented by T ₁ , T ₂ , ..., T _k , network parameters are represented by W, and the output of the network is represented by F(T _k ; W), the aggregation function is represented by G, and the normalized exponential function (ie softmax function) is represented by H, then the expression of the entire network structure can be expressed as:

Net(T₁,T₂,…,T_k)＝H(G(F(T₁；W),F(T₂；W),…,F(T_k；W)))Net(T ₁ ,T ₂ ,...,T _k )=H(G(F(T ₁ ;W),F(T ₂ ;W),...,F(T _k ;W)))

对于空域和时域的网络结构，本实施例采用Inception-v3卷积神经网络；该网络能有效解决因图像内容差异而不能准确提取图像特征的问题，其核心是将网络中某层卷积核拆分成多个不同尺寸的小卷积核，拆分后的卷积核获取的信息比原来更丰富，且可提取更小目标的特征，能够更充分地利用图像信息。For the network structure in the spatial and temporal domains, the Inception-v3 convolutional neural network is used in this embodiment; this network can effectively solve the problem that image features cannot be accurately extracted due to differences in image content. It is split into multiple small convolution kernels of different sizes. The information obtained by the split convolution kernel is more abundant than the original, and the features of smaller objects can be extracted, which can make more full use of image information.

图4为Inception-v3网络的总体结构图，得益于其独特的结构，Inception-v3网络显现出了一些特有优势，包括：增加了网络深度和宽度，增加了网络非线性，增加了网络对尺度的适应性，减少了参数数量，能够缓解梯度消失问题等。Figure 4 shows the overall structure of the Inception-v3 network. Thanks to its unique structure, the Inception-v3 network shows some unique advantages, including: increased network depth and width, increased network nonlinearity, and increased network pairing The adaptability of the scale reduces the number of parameters, which can alleviate the gradient disappearance problem, etc.

此外，在空域网络中引入了注意力机制，设计了一个通道注意力单元，如图5所示，并将其嵌入到Inception-v3卷积神经网络中来抑制背景环境中无关信息的影响。该单元结构包含一个通道注意力模块和一个快捷连接模块，能够使网络更加关注图像局部重点特征，突出动作发生的前景区域。In addition, an attention mechanism is introduced into the spatial network, a channel attention unit is designed, as shown in Fig. 5, and it is embedded into the Inception-v3 convolutional neural network to suppress the influence of irrelevant information in the background environment. The unit structure includes a channel attention module and a shortcut connection module, which can make the network pay more attention to the local key features of the image and highlight the foreground area where the action occurs.

对于网络训练，本实施例采用在大型数据集Kinetics上训练的网络参数作为初始参数，在UCF101数据集上进行微调训练，该方法能够有效地解决样本训练数据不足的问题。同时，引入了梯度裁剪方案，将网络梯度约束在固定范围内，以防止训练过程中可能出现的梯度爆炸现象。因该网络结构为端到端，故其网络参数能够通过反向传播算法进行自动更新优化，使网络参数达到更优性能。For network training, this embodiment uses the network parameters trained on the large data set Kinetics as initial parameters, and performs fine-tuning training on the UCF101 data set. This method can effectively solve the problem of insufficient sample training data. At the same time, a gradient clipping scheme is introduced to constrain the network gradient within a fixed range to prevent the gradient explosion phenomenon that may occur during training. Because the network structure is end-to-end, its network parameters can be automatically updated and optimized through the back-propagation algorithm, so that the network parameters can achieve better performance.

网络初始化网络的超参数，包括：学习速率、学习策略选取、步长、批处理尺度、动量、Dropout_ratio、梯度裁剪参数和迭代次数等。The network initializes the hyperparameters of the network, including: learning rate, learning strategy selection, step size, batch scale, momentum, Dropout_ratio, gradient clipping parameters, and number of iterations.

步骤4：在上述步骤的基础上，通过实验确定了视频片段数、片段融合策略和通道集成比率等参数的选取，即视频片段数取3，片段融合方式取均值融合方式，通道集成比率选1:1.6。接着，将训练后的网络用在动作类数据集UCF101上进行验证。Step 4: On the basis of the above steps, the selection of parameters such as the number of video clips, clip fusion strategy and channel integration ratio is determined through experiments, that is, the number of video clips is 3, the clip fusion method is the mean fusion method, and the channel integration ratio is selected 1. : 1.6. Next, the trained network is validated on the action dataset UCF101.

UCF101数据集是在2012年由Soomro等人提出的。该数据集共包含101个动作类别和13320个短视频剪辑，这些短视频剪辑主要来源于YouTube视频，其分辨率均为320×240。该数据集分为三个子数据集split1、split2和split3。分别在此三个子数据集上进行验证，如下所示，表1为本实施例方法在三个子数据集上的识别准确率。The UCF101 dataset was proposed in 2012 by Soomro et al. The dataset contains a total of 101 action categories and 13,320 short video clips, which are mainly from YouTube videos with a resolution of 320×240. The dataset is divided into three sub-datasets split1, split2 and split3. Verification is performed on the three sub-data sets respectively, as shown below, Table 1 shows the recognition accuracy rates of the method of this embodiment on the three sub-data sets.

表1：在UCF101数据集上的识别准确率Table 1: Recognition accuracy on UCF101 dataset

为了展示引入注意力机制后空域网络的性能，本实施例测试了两组基本实验，一组为只利用卷积神经网络Inception-v3提取空域信息得到的准确率结果，另一组为利用带注意力机制的Inception-v3网络提取空域信息得到的准确率结果，具体实验结果如表2所示。In order to demonstrate the performance of the spatial network after introducing the attention mechanism, two groups of basic experiments are tested in this example. The accuracy results obtained by extracting airspace information from the Inception-v3 network of the force mechanism are shown in Table 2.

表2：引入注意力机制后空域网络的识别准确率对比Table 2: Comparison of recognition accuracy of spatial network after introducing attention mechanism

为了进一步展示本发明的优越性能，取UCF101中三个子数据集实验结果的平均值(95.99％)为最终准确率，与目前现有的先进方法在UCF101数据集上的识别准确率进行对比，对比结果如表3所示，从表3可以看出，本实施例方法的识别准确率高于其他方法。In order to further demonstrate the superior performance of the present invention, the average value (95.99%) of the experimental results of the three sub-data sets in UCF101 is taken as the final accuracy rate, which is compared with the recognition accuracy rate of the existing advanced methods on the UCF101 data set. The results are shown in Table 3. It can be seen from Table 3 that the recognition accuracy of the method of this embodiment is higher than that of other methods.

表3：多种方法的识别准确率对比Table 3: Comparison of recognition accuracy of various methods

实施例2：Example 2:

本公开实施例2提供了一种基于注意力机制双流网络的行为识别系统，包括：Embodiment 2 of the present disclosure provides a behavior recognition system based on an attention mechanism dual-stream network, including:

行为识别模块，被配置为：利用引入注意力机制的双流网络模型，提取采样得到的图像的外观特征和时间动态特征，将提取到的各个视频片段的特征分别按时域网络和空域网络类型进行融合，将时域网络的融合结果与空域网络的融合结果进行加权融合，得到整段视频的识别结果。The behavior recognition module is configured to: extract the appearance features and temporal dynamic features of the sampled images by using a dual-stream network model that introduces an attention mechanism, and fuse the extracted features of each video segment according to the temporal network and spatial network types. , the fusion results of the temporal network and the fusion results of the spatial network are weighted and fused to obtain the recognition results of the entire video.

本实施例所述的行为识别系统的工作方法与实施例1中的基于注意力机制双流网络的行为识别方法相同，这里不再赘述。The working method of the behavior recognition system described in this embodiment is the same as the behavior recognition method based on the attention mechanism dual-stream network in Embodiment 1, and will not be repeated here.

实施例3：Example 3:

本公开实施例3提供了一种介质，其上存储有程序，该程序被处理器执行时实现如本公开实施例1所述的基于注意力机制双流网络的行为识别方法中的步骤。Embodiment 3 of the present disclosure provides a medium on which a program is stored, and when the program is executed by a processor, implements the steps in the attention mechanism-based dual-stream network behavior recognition method described in Embodiment 1 of the present disclosure.

实施例4：Example 4:

本公开实施例4提供了一种电子设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的程序，所述处理器执行所述程序时实现如本公开实施例1所述的基于注意力机制双流网络的行为识别方法中的步骤。Embodiment 4 of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and running on the processor. When the processor executes the program, the implementation is as described in Embodiment 1 of the present disclosure. The steps in the attention mechanism two-stream network-based action recognition method.

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用硬件实施例、软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, optical storage, and the like.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于一计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory，ROM)或随机存储记忆体(RandomAccessMemory，RAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium. During execution, the processes of the embodiments of the above-mentioned methods may be included. The storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.

以上所述仅为本公开的优选实施例而已，并不用于限制本公开，对于本领域的技术人员来说，本公开可以有各种更改和变化。凡在本公开的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本公开的保护范围之内。The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. For those skilled in the art, the present disclosure may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included within the protection scope of the present disclosure.

上述虽然结合附图对本公开的具体实施方式进行了描述，但并非对本公开保护范围的限制，所属领域技术人员应该明白，在本公开的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本公开的保护范围以内。Although the specific embodiments of the present disclosure have been described above in conjunction with the accompanying drawings, they do not limit the protection scope of the present disclosure. Those skilled in the art should understand that on the basis of the technical solutions of the present disclosure, those skilled in the art do not need to pay creative efforts. Various modifications or variations that can be made are still within the protection scope of the present disclosure.

Claims

1. A behavior identification method based on an attention mechanism double-flow network is characterized by comprising the following steps:

dividing the obtained whole video into a plurality of video segments with equal length, extracting an RGB image and an optical flow gray image of each frame of each video segment, and preprocessing;

randomly sampling the preprocessed images to obtain at least one RGB image and at least one stacked optical flow gray image of each video clip;

and extracting appearance characteristics and time dynamic characteristics of the sampled image by using a double-current network model introducing an attention mechanism, fusing the extracted characteristics of each video segment according to the types of a time domain network and a space domain network, and performing weighted fusion on the fusion result of the time domain network and the space domain network to obtain the identification result of the whole video segment.

2. The behavior identification method based on the attention mechanism double-flow network as claimed in claim 1, characterized in that an attention unit is introduced into the spatial domain network of the double-flow network model, and the attention unit at least comprises a channel attention module and a shortcut connection module, so as to make the spatial domain network focus more on the local features of the image.

3. The method for identifying behaviors based on an attention-machine-based dual-flow network as claimed in claim 1, wherein the collected RGB images and optical flow gray scale images are preprocessed by means of random cropping, corner cropping, horizontal flipping and scale dithering.

4. The attention-based dual-flow network behavior recognition method of claim 1, wherein the dual-flow network model comprises two convolutional neural networks of a space domain network and a time domain network, the space domain network takes the RGB image as input, and the time domain network takes the stacked optical flow grayscale image as input.

5. The method for identifying behavior based on attention mechanism dual-flow network as claimed in claim 1, wherein the space domain network and the time domain network both adopt an inclusion-v 3 convolutional neural network.

6. The attention-based dual-flow network behavior recognition method according to claim 1, wherein the training method of the dual-flow network model specifically comprises: and taking the network parameters obtained by training on the first data set as initial parameters, adjusting the network parameters by using the second data set, introducing a gradient cutting scheme, and constraining the network gradient in a fixed range.

7. The attention-driven dual-flow network-based behavior recognition method of claim 1, wherein the optical flow images comprise horizontal optical flow images and vertical optical flow images.

8. An attention-based dual-flow network behavior recognition system, comprising:

a data acquisition module configured to: dividing the obtained whole video into a plurality of video segments with equal length, extracting an RGB image and an optical flow gray image of each frame of each video segment, and preprocessing;

an image sampling module configured to: randomly sampling the preprocessed images to obtain at least one RGB image and at least one optical flow gray image of each video clip;

a behavior recognition module configured to: and extracting appearance characteristics and time dynamic characteristics of the sampled image by using a double-current network model introducing an attention mechanism, fusing the extracted characteristics of each video segment according to the types of a time domain network and a space domain network, and performing weighted fusion on the fusion result of the time domain network and the space domain network to obtain the identification result of the whole video segment.

9. A medium having a program stored thereon, which program, when being executed by a processor, carries out the steps of the method for attention-based dual-flow network behavior recognition according to any of claims 1-7.

10. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps in the method for attention-based dual-flow network behavior recognition according to any of claims 1-7 when executing the program.