CN114511808A

CN114511808A - Video feature determination method, related device and readable storage medium

Info

Publication number: CN114511808A
Application number: CN202210079236.3A
Authority: CN
Inventors: 马骥腾
Original assignee: Iflytek South China Artificial Intelligence Research Institute Guangzhou Co ltd
Current assignee: Iflytek South China Artificial Intelligence Research Institute Guangzhou Co ltd
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-17

Abstract

The present application discloses a video feature determination method, a related device and a readable storage medium. For computer vision tasks, the image area of the sound source orientation in the video is often the key information with a higher degree of importance. In this solution, after acquiring the video to be processed, the image frame is determined for the video frame of the video to be processed. The corresponding sound source orientation attention feature map is based on each image frame and the sound source orientation attention feature map corresponding to each image frame to determine the video features of the video to be processed. The video features extracted by this scheme can enhance the ability to express the image area of the sound source azimuth in the video.

Description

Video feature determination method, related device and readable storage medium

技术领域technical field

本申请涉及视频处理技术领域，更具体的说，是涉及一种视频特征确定方法、相关设备及可读存储介质。The present application relates to the technical field of video processing, and more particularly, to a video feature determination method, a related device and a readable storage medium.

背景技术Background technique

近年来，随着互联网技术以及人工智能技术的发展，基于视频的计算机视觉任务(如，目标检测、视频分类、目标追踪、实时人体姿态估计等)在智能安防、智慧养老等领域发挥关键作用。基于视频的计算机视觉任务是指对视频采集设备(如，摄像头等)采集的视频进行特征提取，确定视频特征，最后基于任务需要对视频特征进行解码输出对应当前任务的结果。视频中包含信息量众多，其中，对于基于视频的计算机视觉任务来说，重要程度更高的关键信息需要更为关注。In recent years, with the development of Internet technology and artificial intelligence technology, video-based computer vision tasks (such as target detection, video classification, target tracking, real-time human pose estimation, etc.) have played a key role in the fields of smart security and smart elderly care. Video-based computer vision tasks refer to extracting features from videos collected by video capture devices (such as cameras, etc.), determining video features, and finally decoding the video features based on task requirements to output the results corresponding to the current task. Videos contain a lot of information, among which, for video-based computer vision tasks, more important key information needs to be paid more attention.

因此，如何确定视频特征，使得确定的视频特征能够增强对视频中的关键信息的表达能力，成为本领域技术人员亟待解决的技术问题。Therefore, how to determine the video features so that the determined video features can enhance the ability to express key information in the video has become a technical problem to be solved urgently by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，本申请提出了一种视频特征确定方法、相关设备及可读存储介质。具体方案如下：In view of the above problems, the present application proposes a video feature determination method, a related device, and a readable storage medium. The specific plans are as follows:

一种视频特征确定方法，所述方法包括：A method for determining a video feature, the method comprising:

获取待处理视频；Get pending video;

针对所述待处理视频的视频帧，确定所述图像帧对应的声源方位注意力特征图；For the video frame of the video to be processed, determine the sound source orientation attention feature map corresponding to the image frame;

基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征。Based on each image frame and the sound source orientation attention feature map corresponding to each image frame, video features of the video to be processed are determined.

可选地，所述确定所述图像帧对应的声源方位注意力特征图，包括：Optionally, the determining the sound source orientation attention feature map corresponding to the image frame includes:

获取所述图像帧对应的声音信号；acquiring the sound signal corresponding to the image frame;

基于所述图像帧对应的声音信号，确定声源方位；Determine the sound source orientation based on the sound signal corresponding to the image frame;

根据所述图像帧以及所述声源方位，确定所述图像帧对应的声源方位注意力特征图。According to the image frame and the sound source position, an attention feature map of the sound source position corresponding to the image frame is determined.

可选地，所述基于所述图像帧对应的声音信号，确定声源方位，包括：Optionally, determining the sound source orientation based on the sound signal corresponding to the image frame, including:

获取采集所述声音信号的至少一个麦克风阵元的坐标；acquiring the coordinates of at least one microphone array element that collects the sound signal;

基于所述声音信号到达所述至少一个麦克风阵元的时间差，计算得到声源坐标，所述声源坐标用于表征所述声源方位。Based on the time difference between the sound signal reaching the at least one microphone array element, the sound source coordinates are calculated, and the sound source coordinates are used to represent the sound source orientation.

可选地，所述根据所述图像帧以及所述声源方位，确定所述图像帧对应的声源方位注意力特征图，包括：Optionally, determining the sound source orientation attention feature map corresponding to the image frame according to the image frame and the sound source orientation, comprising:

基于所述声源坐标，确定所述声源在所述图像帧中的像面坐标；Based on the sound source coordinates, determine the image plane coordinates of the sound source in the image frame;

以所述像面坐标为均值，以随机值为标准差生成二维随机高斯分布；generating a two-dimensional random Gaussian distribution with the image plane coordinates as the mean value and the random value as the standard deviation;

将所述二维随机高斯分布进行归一化处理，得到所述图像帧对应的声源方位注意力特征图。The two-dimensional random Gaussian distribution is normalized to obtain a sound source orientation attention feature map corresponding to the image frame.

可选地，所述基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征，包括：Optionally, determining the video features of the video to be processed based on each image frame and the sound source orientation attention feature map corresponding to each image frame, including:

将所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图，输入特征生成网络，所述特征生成网络基于所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图输出所述待处理视频的视频特征；The respective image frames and the sound source orientation attention feature map corresponding to each image frame are input into a feature generation network, and the feature generation network is based on the respective image frames and the sound source orientation attention feature corresponding to each image frame. Figure outputs the video features of the video to be processed;

所述特征生成网络是以训练用图像帧，以及训练用图像帧对应的声源方位注意力特征图为训练样本，以训练用图像帧标注的预设计算机视觉处理任务结果为样本标签，训练得到的。The feature generation network takes the training image frame and the sound source orientation attention feature map corresponding to the training image frame as the training sample, and takes the preset computer vision processing task result marked with the training image frame as the sample label, and the training result is obtained. of.

可选地，所述特征生成网络包括编码模块及特征融合模块；Optionally, the feature generation network includes an encoding module and a feature fusion module;

则，所述特征生成网络基于所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图输出所述待处理视频的视频特征，包括：Then, the feature generation network outputs the video features of the video to be processed based on the respective image frames and the sound source orientation attention feature maps corresponding to the respective image frames, including:

针对视频帧，所述编码模块对所述图像帧以及所述图像帧对应的声源方位注意力特征图进行编码，得到所述图像帧的强化特征图；For the video frame, the encoding module encodes the image frame and the sound source orientation attention feature map corresponding to the image frame to obtain an enhanced feature map of the image frame;

所述特征融合模块对各个图像帧的强化特征图进行特征融合操作，得到所述待处理视频的视频特征。The feature fusion module performs a feature fusion operation on the enhanced feature maps of each image frame to obtain the video features of the video to be processed.

可选地，所述编码模块包括图像帧编码模块和声源方位注意力特征图编码模块，所述图像帧编码模块包括级联的I个编码单元，每个编码单元包括一个卷积层，一个降采样层和一个特征融合层；Optionally, the encoding module includes an image frame encoding module and a sound source orientation attention feature map encoding module, the image frame encoding module includes a cascade of 1 encoding units, each encoding unit includes a convolution layer, a Downsampling layer and a feature fusion layer;

所述声源方位注意力特征图编码模块包括级联的I个降采样层；所述声源方位注意力特征图编码模块的第i个降采样层的输出作为所述图像编码模块中第i个编码单元中特征融合层的输入；The sound source orientation attention feature map encoding module includes 1 downsampling layers that are cascaded; the output of the i th downsampling layer of the sound source orientation attention feature map encoding module is used as the i th in the image encoding module. The input of the feature fusion layer in each coding unit;

所述I为大于等于1的整数，所述i为大于等于1小于等于I的整数。The I is an integer greater than or equal to 1, and the i is an integer greater than or equal to 1 and less than or equal to 1.

可选地，所述特征融合模块包括时序池化层和特征融合层；Optionally, the feature fusion module includes a time sequence pooling layer and a feature fusion layer;

则，所述特征融合模块对各个图像帧的强化特征图进行特征融合操作，得到所述待处理视频的视频特征，包括：Then, the feature fusion module performs a feature fusion operation on the enhanced feature maps of each image frame to obtain the video features of the video to be processed, including:

所述时序池化层对各个图像帧的强化特征图进行时序池化操作，得到各个图像帧的时序池化特征图；The time series pooling layer performs a time series pooling operation on the enhanced feature map of each image frame to obtain a time series pooling feature map of each image frame;

所述特征融合层对所述各个图像帧的时序池化特征图进行融合，得到所述待处理视频的视频特征。The feature fusion layer fuses the time series pooled feature maps of the respective image frames to obtain the video features of the video to be processed.

一种视频特征确定装置，所述装置包括：A video feature determination device, the device comprising:

获取单元，用于获取待处理视频；an acquisition unit for acquiring the video to be processed;

声源方位注意力特征图确定单元，用于针对所述待处理视频的视频帧，确定所述图像帧对应的声源方位注意力特征图；A sound source orientation attention feature map determining unit, configured to determine the sound source orientation attention feature map corresponding to the image frame for the video frame of the video to be processed;

视频特征确定单元，用于基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征。A video feature determination unit, configured to determine the video feature of the video to be processed based on each image frame and the sound source orientation attention feature map corresponding to each image frame.

可选地，所述声源方位注意力特征图确定单元，包括：Optionally, the sound source orientation attention feature map determination unit includes:

声音信号获取单元，用于获取所述图像帧对应的声音信号；a sound signal acquisition unit, configured to acquire the sound signal corresponding to the image frame;

声源方位确定单元，用于基于所述图像帧对应的声音信号，确定声源方位；a sound source orientation determination unit, configured to determine the sound source orientation based on the sound signal corresponding to the image frame;

声源方位注意力特征图确定子单元，用于根据所述图像帧以及所述声源方位，确定所述图像帧对应的声源方位注意力特征图。The sound source orientation attention feature map determination subunit is configured to determine the sound source orientation attention feature map corresponding to the image frame according to the image frame and the sound source orientation.

可选地，所述声源方位确定单元，包括：Optionally, the sound source orientation determination unit includes:

麦克风阵元坐标获取单元，用于获取采集所述声音信号的至少一个麦克风阵元的坐标；a microphone array element coordinate obtaining unit, configured to obtain the coordinates of at least one microphone array element that collects the sound signal;

声源坐标计算单元，用于基于所述声音信号到达所述至少一个麦克风阵元的时间差，计算得到声源坐标，所述声源坐标用于表征所述声源方位。The sound source coordinate calculation unit is configured to calculate and obtain sound source coordinates based on the time difference between the sound signal reaching the at least one microphone array element, and the sound source coordinates are used to represent the sound source orientation.

可选地，所述声源方位注意力特征图确定子单元，包括：Optionally, the sound source orientation attention feature map determination subunit includes:

像面坐标确定单元，用于基于所述声源坐标，确定所述声源在所述图像帧中的像面坐标；an image plane coordinate determination unit, configured to determine the image plane coordinates of the sound source in the image frame based on the sound source coordinates;

二维随机高斯分布生成单元，用于以所述像面坐标为均值，以随机值为标准差生成二维随机高斯分布；A two-dimensional random Gaussian distribution generating unit, used for generating a two-dimensional random Gaussian distribution with the image plane coordinates as the mean value and the random value as the standard deviation;

归一化处理单元，用于将所述二维随机高斯分布进行归一化处理，得到所述图像帧对应的声源方位注意力特征图。A normalization processing unit, configured to perform normalization processing on the two-dimensional random Gaussian distribution to obtain an attention feature map of the sound source orientation corresponding to the image frame.

可选地，所述视频特征确定单元，用于：Optionally, the video feature determination unit is used for:

一种视频特征确定设备，包括存储器和处理器；A video feature determination device, including a memory and a processor;

所述存储器，用于存储程序；the memory for storing programs;

所述处理器，用于执行所述程序，实现如上所述的视频特征确定方法的各个步骤。The processor is configured to execute the program to implement the various steps of the above-mentioned video feature determination method.

一种可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时，实现如上所述的视频特征确定方法的各个步骤。A readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each step of the above-mentioned video feature determination method.

借由上述技术方案，本申请公开了一种视频特征确定方法、相关设备及可读存储介质。对于计算机视觉任务来说，视频中声源方位的图像区域往往是重要程度更高的关键信息，在本方案中，在获取待处理视频之后，针对该待处理视频的视频帧，确定该图像帧对应的声源方位注意力特征图，基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定待处理视频的视频特征。采用本方案提取的视频特征能够增强对视频中声源方位的图像区域的表达能力。With the above technical solutions, the present application discloses a video feature determination method, a related device and a readable storage medium. For computer vision tasks, the image area of the sound source orientation in the video is often the key information with a higher degree of importance. In this solution, after acquiring the video to be processed, the image frame is determined for the video frame of the video to be processed. The corresponding sound source orientation attention feature map is based on each image frame and the sound source orientation attention feature map corresponding to each image frame to determine the video features of the video to be processed. The video features extracted by this scheme can enhance the ability to express the image area of the sound source azimuth in the video.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本申请的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1为本申请实施例公开的视频特征确定方法的流程示意图；1 is a schematic flowchart of a method for determining a video feature disclosed in an embodiment of the present application;

图2为本申请实施例公开的一种声源方位注意力特征图的效果图；2 is an effect diagram of a sound source orientation attention feature map disclosed in an embodiment of the application;

图3为本申请实施例公开的确定声源方位的原理示意图；FIG. 3 is a schematic schematic diagram of the principle of determining the orientation of a sound source disclosed in an embodiment of the present application;

图4为本申请实施例公开的一种特征生成网络的结构示意图；4 is a schematic structural diagram of a feature generation network disclosed in an embodiment of the present application;

图5为本申请实施例公开的一种特征生成网络的结构示意图；5 is a schematic structural diagram of a feature generation network disclosed in an embodiment of the present application;

图6为本申请实施例公开的一种编码单元特征融合层的融合原理示意图；6 is a schematic diagram of a fusion principle of a coding unit feature fusion layer disclosed in an embodiment of the present application;

图7为本申请实施例公开的一种视频特征确定装置结构示意图；FIG. 7 is a schematic structural diagram of a video feature determination apparatus disclosed in an embodiment of the present application;

图8为本申请实施例公开的一种视频特征确定设备的硬件结构框图。FIG. 8 is a block diagram of a hardware structure of a video feature determination device disclosed in an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

接下来，通过下述实施例对本申请提供的视频特征确定方法进行介绍。Next, the video feature determination method provided by the present application will be introduced through the following embodiments.

参照图1，图1为本申请实施例公开的视频特征确定方法的流程示意图，该方法可以包括：Referring to FIG. 1, FIG. 1 is a schematic flowchart of a method for determining a video feature disclosed in an embodiment of the present application, and the method may include:

步骤S101：获取待处理视频。Step S101: Acquire the video to be processed.

在本申请中，待处理视频可以为视频采集设备(如，摄像头等)采集后存储的视频，也可以为视频采集设备(如，摄像头等)实时采集的视频，视频的时长可以为任意时长，对此，本申请不进行任何限定。In this application, the video to be processed can be a video collected by a video capture device (such as a camera, etc.) and stored, or it can be a video collected in real time by a video capture device (such as a camera, etc.), and the duration of the video can be any duration, In this regard, this application does not make any limitation.

待处理视频可以为视频采集设备(如，摄像头等)采集的原始视频，也可以为经过预处理(如降噪、镜头分割等)之后的视频，对此，本申请不进行任何限定。The video to be processed may be the original video collected by a video collection device (eg, camera, etc.), or may be the video after preprocessing (eg, noise reduction, lens segmentation, etc.), which is not limited in this application.

步骤S102：针对所述待处理视频的视频帧，确定所述图像帧对应的声源方位注意力特征图。Step S102: For the video frame of the video to be processed, determine the sound source orientation attention feature map corresponding to the image frame.

对于计算机视觉任务来说，视频中声源方位的图像区域往往是重要程度更高的关键信息，因此，在本申请中，在获取待处理视频之后，针对该待处理视频的视频帧，确定该图像帧对应的声源方位注意力特征图。所述图像帧对应的声源方位注意力特征图用于表征声源在所述图像帧中的位置，为便于理解，参照图2，图2为本申请实施例公开的一种声源方位注意力特征图的效果图，图2中圆环及其内部区域即为声源在所述图像帧中的位置。确定图像帧对应的声源方位注意力特征图的方式将通过后面的实施例详细说明，此处不再描述。For computer vision tasks, the image area of the sound source orientation in the video is often the key information with a higher degree of importance. Therefore, in this application, after acquiring the video to be processed, for the video frame of the video to be processed, determine the The sound source orientation attention feature map corresponding to the image frame. The sound source azimuth attention feature map corresponding to the image frame is used to represent the position of the sound source in the image frame. For ease of understanding, refer to FIG. 2 , which is a sound source azimuth attention disclosed in an embodiment of the application. The effect diagram of the force feature map, the ring and its inner area in Figure 2 are the positions of the sound source in the image frame. The manner of determining the attention feature map of the sound source orientation corresponding to the image frame will be described in detail in the following embodiments, and will not be described herein again.

需要说明的是，对于视频采集设备来说，其每秒中采集的图像帧的数量是固定的，在本申请中，可以对视频的每个图像帧，确定其对应的声源方位注意力特征图，也可以按照场景需求，对视频的特定图像帧，确定其对应的声源方位注意力特征图，对此，本申请不进行任何限定。It should be noted that, for a video acquisition device, the number of image frames collected per second is fixed. In this application, the corresponding sound source orientation attention feature can be determined for each image frame of the video. It is also possible to determine the corresponding sound source orientation attention feature map for a specific image frame of the video according to the scene requirements, which is not limited in this application.

步骤S103：基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征。Step S103: Determine video features of the video to be processed based on each image frame and the sound source orientation attention feature map corresponding to each image frame.

在本申请中，针对视频中的图像帧，可以将该图像帧与该图像帧对应的声源方位注意力特征图进行融合，得到该图像帧的增强特征图，再将各个图像帧的增强特征图进行融合，得到所述待处理视频的视频特征。融合方式可以基于场景需求确定，对此，本申请不进行任何限定。具体实现方式将通过后面的实施例详细说明，此处不再描述。In the present application, for an image frame in a video, the image frame can be fused with the sound source orientation attention feature map corresponding to the image frame to obtain an enhanced feature map of the image frame, and then the enhanced feature of each image frame can be combined. The images are fused to obtain the video features of the video to be processed. The fusion mode can be determined based on the scenario requirements, which is not limited in this application. The specific implementation will be described in detail through the following embodiments, and will not be described here.

本实施例公开了一种视频特征确定方法，在获取待处理视频之后，针对该待处理视频的视频帧，确定该图像帧对应的声源方位注意力特征图，基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定待处理视频的视频特征，使得采用本方案提取的视频特征能够增强视频声源方位的图像区域的表达能力。This embodiment discloses a video feature determination method. After acquiring the video to be processed, for the video frame of the video to be processed, determine the sound source orientation attention feature map corresponding to the image frame, based on each image frame and each image The attention feature map of the sound source orientation corresponding to the frame determines the video features of the video to be processed, so that the video features extracted by this scheme can enhance the expressive ability of the image area of the video sound source orientation.

在本申请的另一个实施例中，对确定所述图像帧对应的声源方位注意力特征图的具体实现方式进行了说明，该方式可以包括如下步骤：In another embodiment of the present application, a specific implementation method for determining the sound source orientation attention feature map corresponding to the image frame is described, and the method may include the following steps:

步骤S201：获取所述图像帧对应的声音信号。Step S201: Acquire a sound signal corresponding to the image frame.

在本申请中，以图像帧同步的采样点对视频中的声音信号进行采样，即可得到图像帧对应的声音信号。In the present application, the sound signal in the video is sampled at the sampling point synchronized with the image frame, so as to obtain the sound signal corresponding to the image frame.

步骤S202：基于所述图像帧对应的声音信号，确定声源方位。Step S202: Determine the sound source orientation based on the sound signal corresponding to the image frame.

作为一种可实施方式，可以获取采集所述声音信号的至少一个麦克风阵元的坐标；基于所述声音信号到达所述至少一个麦克风阵元的时间差，计算得到声源坐标，所述声源坐标用于表征所述声源方位。As an implementation manner, the coordinates of at least one microphone element that collects the sound signal can be obtained; based on the time difference between the sound signal reaching the at least one microphone element, the sound source coordinates are calculated, and the sound source coordinates Used to characterize the sound source orientation.

为便于理解，参照图3，图3为本申请实施例公开的确定声源方位的原理示意图。For ease of understanding, referring to FIG. 3 , FIG. 3 is a schematic diagram of the principle of determining the azimuth of a sound source disclosed in an embodiment of the present application.

如图3所示，采集所述声音信号的麦克风阵元有四个，分别用S1、S2、S3、S4表示，这四个麦克风阵元分别部署在采集视频的设备镜头的四个边中线位置，以镜头中心为坐标原点，则四个麦克风阵元的坐标分别为S1(0,0,0.5H)、S2(0.5W,0,0)、S3(0,0,-0.5H)、S4(-0.5W,0,0)，其中，H和W分别为镜头的高和宽。As shown in Figure 3, there are four microphone array elements for collecting the sound signal, which are represented by S1, S2, S3, and S4 respectively, and these four microphone array elements are respectively deployed at the four sides of the centerline of the lens of the device that collects the video. , taking the center of the lens as the coordinate origin, the coordinates of the four microphone array elements are S1(0,0,0.5H), S2(0.5W,0,0), S3(0,0,-0.5H), S4 (-0.5W,0,0), where H and W are the height and width of the lens, respectively.

采集的声音信号共有四个通道，分别对应四个麦克风阵元。基于四个通道的声音信号达到时间(即，图3中所示的r1、r2、r3、r4)，即可确定所述声音信号到达四个麦克风阵元的时间差(Time Diffeirece of Arrival)，基于所述声音信号到达四个麦克风阵元的时间差可构建方程组，对方程组进行求解，即可得到声源坐标P(X,Y,Z)。The collected sound signal has a total of four channels, corresponding to four microphone array elements respectively. Based on the arrival times of the sound signals of the four channels (ie, r1, r2, r3, and r4 shown in FIG. 3 ), the time difference (Time Difference of Arrival) of the sound signals reaching the four microphone array elements can be determined. The time difference between the sound signal reaching the four microphone array elements can construct an equation system, and by solving the equation system, the sound source coordinates P(X, Y, Z) can be obtained.

步骤S203：根据所述图像帧以及所述声源方位，确定所述图像帧对应的声源方位注意力特征图。Step S203: According to the image frame and the sound source position, determine the sound source position attention feature map corresponding to the image frame.

作为一种可实施方式，所述根据所述图像帧以及所述声源方位，确定所述图像帧对应的声源方位注意力特征图，包括：基于所述声源坐标，确定所述声源在所述图像帧中的像面坐标；以所述像面坐标为均值，以随机值为标准差生成二维随机高斯分布；将所述二维随机高斯分布进行归一化处理，得到所述图像帧对应的声源方位注意力特征图。As a possible implementation manner, the determining the attention feature map of the sound source location corresponding to the image frame according to the image frame and the sound source location includes: determining the sound source based on the sound source coordinates The image plane coordinates in the image frame; taking the image plane coordinates as the mean value and the random value as the standard deviation to generate a two-dimensional random Gaussian distribution; normalizing the two-dimensional random Gaussian distribution to obtain the The sound source orientation attention feature map corresponding to the image frame.

其中，声源在所述图像帧中的像面坐标(u,v)可采用公式[u,v]^T＝M·[X,Y,Z]^T计算得到，其中，M为相机矩阵，是采集视频的设备的固有参数。将所述二维随机高斯分布进行归一化处理具体可以通过MiiMax方式进行。Wherein, the image plane coordinates (u, v) of the sound source in the image frame can be calculated by using the formula [u, v] ^T =M·[X, Y, Z] ^T , where M is the camera matrix and is Intrinsic parameters of the device that captures the video. The normalization process of the two-dimensional random Gaussian distribution can be specifically performed by means of MiiMax.

在本申请中，基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征的步骤可以采用神经网络模型的方式实现，在本申请中，可以预先构建并训练得到一个特征生成网络，则将所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图，输入该特征生成网络，该特征生成网络基于所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图即可输出所述待处理视频的视频特征。In this application, based on each image frame and the sound source orientation attention feature map corresponding to each image frame, the step of determining the video features of the video to be processed can be implemented by a neural network model. In this application, you can Pre-built and trained to obtain a feature generation network, then each image frame and the sound source orientation attention feature map corresponding to each image frame are input into the feature generation network, and the feature generation network is based on the respective image frames, and The sound source orientation attention feature map corresponding to each image frame can output the video features of the video to be processed.

需要说明的是，所述特征生成网络是以训练用图像帧，以及训练用图像帧对应的声源方位注意力特征图为训练样本，以训练用图像帧标注的预设计算机视觉处理任务结果为样本标签，训练得到的。其中，预设计算机视觉处理任务可以为目标检测、视频分类、目标追踪、实时人体姿态估计等中的任意一种，不同的计算机视觉处理任务，其结果也不同，比如，目标检测的计算机视觉处理任务，其结果为视频中包含的目标。It should be noted that the feature generation network takes the training image frame and the sound source orientation attention feature map corresponding to the training image frame as the training sample, and the preset computer vision processing task result marked with the training image frame is: Sample labels, obtained from training. Among them, the preset computer vision processing task can be any one of target detection, video classification, target tracking, real-time human pose estimation, etc. Different computer vision processing tasks have different results. For example, computer vision processing for target detection task, the result of which is the object contained in the video.

下面对特征生成网络的结构进行说明。The structure of the feature generation network will be described below.

参照图4，图4为本申请实施例公开的一种特征生成网络的结构示意图，该特征生成网络包括编码模块及特征融合模块。Referring to FIG. 4 , FIG. 4 is a schematic structural diagram of a feature generation network disclosed in an embodiment of the present application, where the feature generation network includes an encoding module and a feature fusion module.

基于图4所示的特征生成网络，所述特征生成网络基于所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图输出所述待处理视频的视频特征，包括：针对视频帧，所述编码模块对所述图像帧以及所述图像帧对应的声源方位注意力特征图进行编码，得到所述图像帧的强化特征图；所述特征融合模块对各个图像帧的强化特征图进行特征融合操作，得到所述待处理视频的视频特征。Based on the feature generation network shown in FIG. 4 , the feature generation network outputs the video features of the video to be processed based on the respective image frames and the sound source orientation attention feature maps corresponding to the respective image frames, including: for video frames , the encoding module encodes the image frame and the sound source orientation attention feature map corresponding to the image frame to obtain an enhanced feature map of the image frame; the feature fusion module enhances the feature map of each image frame. A feature fusion operation is performed to obtain the video features of the video to be processed.

需要说明的是，编码模块和特征融合模块内部的结构可以采用能够实现上述需求的任意神经网络结构实现。作为一种可实施方式，本申请公开了一种具体实现方式。It should be noted that the internal structures of the encoding module and the feature fusion module may be implemented by any neural network structure that can meet the above requirements. As an implementation manner, the present application discloses a specific implementation manner.

参照图5，图5为本申请实施例公开的一种特征生成网络的结构示意图，该特征生成网络包括编码模块及特征融合模块，所述编码模块包括图像帧编码模块和声源方位注意力特征图编码模块，所述图像帧编码模块包括级联的I个编码单元，每个编码单元包括一个卷积层，一个降采样层和一个特征融合层；所述声源方位注意力特征图编码模块包括级联的I个降采样层；所述声源方位注意力特征图编码模块的第i个降采样层的输出作为所述图像编码模块中第i个编码单元中特征融合层的输入；所述I为大于等于1的整数，所述i为大于等于1小于等于I的整数。所述特征融合模块包括时序池化层和特征融合层。Referring to FIG. 5, FIG. 5 is a schematic structural diagram of a feature generation network disclosed in an embodiment of the application. The feature generation network includes an encoding module and a feature fusion module, and the encoding module includes an image frame encoding module and a sound source orientation attention feature. The picture coding module, the image frame coding module includes 1 coding units that are cascaded, and each coding unit includes a convolution layer, a downsampling layer and a feature fusion layer; the sound source orientation attention feature map coding module Including 1 downsampling layers of concatenation; The output of the ith downsampling layer of the described sound source orientation attention feature map coding module is used as the input of the feature fusion layer in the ith coding unit in the described image coding module; The I is an integer greater than or equal to 1, and the i is an integer greater than or equal to 1 and less than or equal to 1. The feature fusion module includes a time series pooling layer and a feature fusion layer.

可以理解的是，将所述各个图像帧，以及各个图像帧对应的声源方位注意力特征图，输入特征生成网络，包括：将所述各个图像帧输入所述图像编码器的首个编码模块；将所述各个图像帧对应的声源方位注意力特征图输入所述声源方位注意力特征图编码器的首个降采样层。It can be understood that, inputting each image frame and the sound source orientation attention feature map corresponding to each image frame into a feature generation network includes: inputting each image frame into the first encoding module of the image encoder. ; Input the sound source orientation attention feature map corresponding to each image frame into the first downsampling layer of the sound source orientation attention feature map encoder.

需要说明的是，降采样操作是图像编码器中常用的手段，其输出特征的长宽尺度通常为原来的一半，第i个编码单元中的降采样层，与声源方位注意力特征图编码模块中的第i个降采样层可以采用相同的结构，第i个编码单元中的降采样层输出特征图的长宽尺度与声源方位注意力特征图编码模块中的第i个降采样层输出特征图的长宽尺度相同。It should be noted that the downsampling operation is a commonly used method in image encoders, and the length and width scales of the output features are usually half of the original. The i-th down-sampling layer in the module can adopt the same structure, the down-sampling layer in the i-th coding unit outputs the length and width of the feature map and the sound source orientation attention feature map The i-th down-sampling layer in the encoding module The length and width scales of the output feature maps are the same.

需要说明的是，每个编码单元中的特征融合层对输入的图像帧特征图和声源方位注意力特征图进行融合时，可以采用按位点乘的融合方式，即对同一位置的所有通道乘同一值，该值取自声源方位注意力特征图的对应位置。参照图6，图6为本申请实施例公开的一种编码单元特征融合层的融合原理示意图，图6中第i个编码单元中特征融合层的融合原理示意图，tⁿ表征图像帧，其中n表征编码单元并行处理的图像帧的数量。It should be noted that when the feature fusion layer in each coding unit fuses the input image frame feature map and the sound source orientation attention feature map, the fusion method by position multiplication can be used, that is, all channels in the same position are fused. Multiply by the same value, which is taken from the corresponding position of the sound source azimuth attention feature map. Referring to FIG. 6, FIG. 6 is a schematic diagram of a fusion principle of a coding unit feature fusion layer disclosed in an embodiment of the application, a schematic diagram of a fusion principle of a feature fusion layer in the i-th coding unit in FIG. 6, t ⁿ represents an image frame, where n Indicates the number of image frames processed in parallel by the coding unit.

基于图5所示的特征生成网络，所述特征融合模块对各个图像帧的强化特征图进行特征融合操作，得到所述待处理视频的视频特征，包括：所述时序池化层对各个图像帧的强化特征图进行时序池化操作，得到各个图像帧的时序池化特征图；所述特征融合层对所述各个图像帧的时序池化特征图进行融合，得到所述待处理视频的视频特征。Based on the feature generation network shown in FIG. 5 , the feature fusion module performs a feature fusion operation on the enhanced feature maps of each image frame to obtain the video features of the video to be processed, including: the time series pooling layer performs a feature fusion operation on each image frame. The enhanced feature map is subjected to time series pooling operation to obtain the time series pooling feature map of each image frame; the feature fusion layer fuses the time series pooling feature maps of each image frame to obtain the video features of the video to be processed. .

下面对本申请实施例公开的视频特征确定装置进行描述，下文描述的视频特征确定装置与上文描述的视频特征确定方法可相互对应参照。The video feature determination apparatus disclosed in the embodiments of the present application will be described below, and the video feature determination apparatus described below and the video feature determination method described above may refer to each other correspondingly.

参照图7，图7为本申请实施例公开的一种视频特征确定装置结构示意图。如图7所示，该视频特征确定装置可以包括：Referring to FIG. 7 , FIG. 7 is a schematic structural diagram of a video feature determination apparatus disclosed in an embodiment of the present application. As shown in Figure 7, the video feature determination device may include:

获取单元11，用于获取待处理视频；an acquisition unit 11, for acquiring the video to be processed;

声源方位注意力特征图确定单元12，用于针对所述待处理视频的视频帧，确定所述图像帧对应的声源方位注意力特征图；The sound source orientation attention feature map determining unit 12 is used for determining the sound source orientation attention feature map corresponding to the image frame for the video frame of the video to be processed;

视频特征确定单元13，用于基于各个图像帧，以及各个图像帧对应的声源方位注意力特征图，确定所述待处理视频的视频特征。The video feature determining unit 13 is configured to determine the video feature of the video to be processed based on each image frame and the sound source orientation attention feature map corresponding to each image frame.

作为一种可实施方式，所述声源方位注意力特征图确定单元，包括：As an embodiment, the sound source orientation attention feature map determining unit includes:

作为一种可实施方式，所述声源方位确定单元，包括：As an embodiment, the sound source orientation determination unit includes:

作为一种可实施方式，所述声源方位注意力特征图确定子单元，包括：As an embodiment, the sound source orientation attention feature map determination subunit includes:

作为一种可实施方式，所述视频特征确定单元，用于：As a possible implementation manner, the video feature determination unit is used for:

作为一种可实施方式，所述特征生成网络包括编码模块及特征融合模块；As an embodiment, the feature generation network includes an encoding module and a feature fusion module;

作为一种可实施方式，所述编码模块包括图像帧编码模块和声源方位注意力特征图编码模块，所述图像帧编码模块包括级联的I个编码单元，每个编码单元包括一个卷积层，一个降采样层和一个特征融合层；As an embodiment, the encoding module includes an image frame encoding module and a sound source orientation attention feature map encoding module, the image frame encoding module includes a cascade of I encoding units, and each encoding unit includes a convolutional layer, a downsampling layer and a feature fusion layer;

作为一种可实施方式，所述特征融合模块包括时序池化层和特征融合层；As an embodiment, the feature fusion module includes a time sequence pooling layer and a feature fusion layer;

参照图8，图8为本申请实施例提供的视频特征确定设备的硬件结构框图，参照图8，视频特征确定设备的硬件结构可以包括：至少一个处理器1，至少一个通信接口2，至少一个存储器3和至少一个通信总线4；Referring to FIG. 8, FIG. 8 is a block diagram of a hardware structure of a video feature determination device provided by an embodiment of the application. Referring to FIG. 8, the hardware structure of the video feature determination device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

在本申请实施例中，处理器1、通信接口2、存储器3、通信总线4的数量为至少一个，且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信；In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the communication with each other through the communication bus 4;

处理器1可能是一个中央处理器CPU，或者是特定集成电路ASIC(ApplicatioiSpecific Iitegrated Circuit)，或者是被配置成实施本发明实施例的一个或多个集成电路等；The processor 1 may be a central processing unit (CPU), or a specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention, etc.;

存储器3可能包含高速RAM存储器，也可能还包括非易失性存储器(ioi-volatilememory)等，例如至少一个磁盘存储器；The memory 3 may include high-speed RAM memory, and may also include non-volatile memory (ioi-volatile memory), etc., such as at least one disk memory;

其中，存储器存储有程序，处理器可调用存储器存储的程序，所述程序用于：Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:

获取待处理视频；Get pending video;

可选的，所述程序的细化功能和扩展功能可参照上文描述。Optionally, the refinement function and extension function of the program may refer to the above description.

本申请实施例还提供一种可读存储介质，该可读存储介质可存储有适于处理器执行的程序，所述程序用于：Embodiments of the present application further provide a readable storage medium, where the readable storage medium can store a program suitable for execution by a processor, and the program is used for:

获取待处理视频；Get pending video;

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下，在其它实施例中实现。因此，本申请将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for video feature determination, the method comprising:

acquiring a video to be processed;

determining a sound source azimuth attention feature map corresponding to the image frame aiming at the video frame of the video to be processed;

and determining the video characteristics of the video to be processed based on each image frame and the sound source azimuth attention characteristic map corresponding to each image frame.

2. The method according to claim 1, wherein the determining a sound source bearing attention feature map corresponding to the image frame comprises:

acquiring a sound signal corresponding to the image frame;

determining a sound source position based on the sound signals corresponding to the image frames;

and determining a sound source azimuth attention feature map corresponding to the image frame according to the image frame and the sound source azimuth.

3. The method of claim 2, wherein determining the sound source orientation based on the sound signals corresponding to the image frames comprises:

acquiring coordinates of at least one microphone element for acquiring the sound signal;

and calculating to obtain sound source coordinates based on the time difference of the sound signals reaching the at least one microphone element, wherein the sound source coordinates are used for representing the sound source azimuth.

4. The method according to claim 3, wherein the determining a sound source azimuth attention feature map corresponding to the image frame according to the image frame and the sound source azimuth comprises:

determining image plane coordinates of the sound source in the image frame based on the sound source coordinates;

generating two-dimensional random Gaussian distribution by taking the image plane coordinates as a mean value and taking a random value as a standard deviation;

and carrying out normalization processing on the two-dimensional random Gaussian distribution to obtain a sound source azimuth attention feature map corresponding to the image frame.

5. The method according to claim 1, wherein the determining the video feature of the video to be processed based on each image frame and the sound source azimuth attention feature map corresponding to each image frame comprises:

inputting the image frames and the sound source position attention feature maps corresponding to the image frames into a feature generation network, wherein the feature generation network outputs video features of the video to be processed based on the image frames and the sound source position attention feature maps corresponding to the image frames;

the feature generation network is obtained by training by taking a training image frame and a sound source azimuth attention feature map corresponding to the training image frame as training samples and taking a preset computer vision processing task result marked by the training image frame as a sample label.

6. The method of claim 5, wherein the feature generation network comprises an encoding module and a feature fusion module;

then, the feature generation network outputs the video features of the video to be processed based on the image frames and the sound source position attention feature maps corresponding to the image frames, including:

for a video frame, the coding module codes the image frame and a sound source azimuth attention feature map corresponding to the image frame to obtain an enhanced feature map of the image frame;

and the feature fusion module performs feature fusion operation on the enhanced feature map of each image frame to obtain the video features of the video to be processed.

7. The method according to claim 6, wherein the encoding modules comprise an image frame encoding module and a sound source orientation attention feature map encoding module, the image frame encoding module comprises a cascade of I encoding units, each encoding unit comprises a convolution layer, a down-sampling layer and a feature fusion layer;

the sound source azimuth attention feature map coding module comprises I cascaded down-sampling layers; the output of the ith down-sampling layer of the sound source azimuth attention feature map coding module is used as the input of a feature fusion layer in the ith coding unit in the image coding module;

the I is an integer which is greater than or equal to 1, and the I is an integer which is greater than or equal to 1 and less than or equal to I.

8. The method of claim 6, wherein the feature fusion module comprises a temporal pooling layer and a feature fusion layer;

then, the feature fusion module performs feature fusion operation on the enhanced feature map of each image frame to obtain the video features of the video to be processed, including:

the time sequence pooling layer performs time sequence pooling operation on the reinforced characteristic graph of each image frame to obtain a time sequence pooling characteristic graph of each image frame;

and the characteristic fusion layer fuses the time sequence pooling characteristic graphs of the image frames to obtain the video characteristics of the video to be processed.

9. An apparatus for video feature determination, the apparatus comprising:

the acquisition unit is used for acquiring a video to be processed;

the sound source azimuth attention feature map determining unit is used for determining a sound source azimuth attention feature map corresponding to the image frame aiming at the video frame of the video to be processed;

and the video characteristic determining unit is used for determining the video characteristics of the video to be processed based on each image frame and the sound source position attention characteristic map corresponding to each image frame.

10. A video feature determination device comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, implementing the steps of the video feature determination method according to any one of claims 1 to 8.

11. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the video feature determination method according to any one of claims 1 to 8.