CN108229355B

CN108229355B - Behavior recognition method and device, electronic device, computer storage medium

Info

Publication number: CN108229355B
Application number: CN201711407861.1A
Authority: CN
Inventors: 颜思捷; 熊元骏; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Shanghai Chunjian Information Technology Co ltd
Priority date: 2017-12-22
Filing date: 2017-12-22
Publication date: 2021-03-23
Anticipated expiration: 2037-12-22
Also published as: CN108229355A

Abstract

Embodiments of the present disclosure disclose a method and device, electronic device, computer storage medium, and program for behavior recognition, wherein the method includes: performing human key point detection on at least one frame of video image, and obtaining multiple persons of the at least one frame of video image. body key points; based on the feature information of multiple human body key points of the at least one frame of video image and the associated information of the multiple human body key points, the behavior recognition result of the at least one frame of video image is obtained. By combining the feature information of the key points of the human body and the association information between the key points of the human body, the embodiments of the present disclosure make full use of the local information and the overall information, and improve the accuracy of behavior recognition.

Description

Behavior recognition method and device, electronic device, computer storage medium

技术领域technical field

本公开涉及计算机视觉技术，尤其是一种行为识别方法和装置、电子设备、计算机存储介质。The present disclosure relates to computer vision technology, and in particular, to a behavior recognition method and apparatus, electronic device, and computer storage medium.

背景技术Background technique

行为识别是从视频中识别出人物的动作或行为，如游泳、奔跑、扫地等等，行为识别对于理解视频的内容和含义有重要作用。行为识别可以以视频图像、语音或人体关键点坐标作为输入，利用神经网络输出行为的类别。Behavior recognition is to identify the actions or behaviors of people from videos, such as swimming, running, sweeping the floor, etc. Behavior recognition plays an important role in understanding the content and meaning of videos. Behavior recognition can take video images, speech, or coordinates of human key points as input, and use neural networks to output behavior categories.

发明内容SUMMARY OF THE INVENTION

本公开实施例提供的一种行为识别技术。A behavior recognition technology provided by an embodiment of the present disclosure.

根据本公开实施例的一个方面，提供的一种行为识别方法，包括：According to an aspect of the embodiments of the present disclosure, a behavior recognition method is provided, including:

对至少一帧视频图像执行人体关键点检测，获得所述至少一帧视频图像的多个人体关键点；Performing human body key point detection on at least one frame of video image to obtain multiple human key points of the at least one frame of video image;

基于所述至少一帧视频图像的多个人体关键点的特征信息以及所述多个人体关键点的关联信息，得到所述至少一帧视频图像中每帧视频图像的行为识别结果。Based on the feature information of multiple human body key points of the at least one frame of video image and the associated information of the multiple human body key points, a behavior recognition result of each frame of video image in the at least one frame of video image is obtained.

在基于本发明上述方法的另一个实施例中，所述人体关键点的特征信息包括所述人体关键点的坐标信息；或者，In another embodiment based on the above method of the present invention, the feature information of the human body key points includes coordinate information of the human body key points; or,

所述人体关键点的特征信息包括所述人体关键点的坐标信息以及所述人体关键点的估计置信度和/或所述人体关键点对应的初始特征。The feature information of the human body key points includes coordinate information of the human body key points, estimated confidence levels of the human body key points and/or initial features corresponding to the human body key points.

在基于本发明上述方法的另一个实施例中，所述多个人体关键点的关联信息包括下列中的任意一项或多项：同一帧视频图像中的至少两个人体关键点之间的空间关联信息，以及对应于同一人体部位且属于所述至少一帧视频图像中的相邻帧视频图像的至少两个人体关键点之间的时间关联信息。In another embodiment based on the above method of the present invention, the association information of the multiple human body key points includes any one or more of the following: the space between at least two human body key points in the same frame of video image association information, and temporal association information between at least two human body key points corresponding to the same human body part and belonging to adjacent frames of video images in the at least one frame of video image.

对应于同一人体部位且属于所述至少一帧视频图像中的不同帧视频图像的至少两个人体关键点之间的时间关联信息用于指示所述人体部位在所述至少一帧视频图像中随着时间的移动轨迹。The temporal correlation information between at least two human body key points corresponding to the same human body part and belonging to different frames of video images in the at least one frame of video image is used to indicate that the human body part will follow in the at least one frame of video image. The trajectory of movement in time.

在基于本发明上述方法的另一个实施例中，所述至少一帧视频图像具体为视频中的多帧连续视频图像；和/或In another embodiment based on the above method of the present invention, the at least one frame of video image is specifically multiple frames of continuous video images in the video; and/or

所述同一帧视频图像中的至少两个人体关键点之间的空间关联信息是根据人体结构的连通关系确定的。The spatial correlation information between at least two human body key points in the same frame of video image is determined according to the connectivity relationship of human body structures.

在基于本发明上述方法的另一个实施例中，所述至少两个人体关键点之间的空间关联信息包括所述至少两个关键点在空间上的相邻关系，和/或In another embodiment based on the above method of the present invention, the spatial association information between the at least two human body key points includes a spatially adjacent relationship between the at least two key points, and/or

所述至少两个关键点之间的时间关联信息包括：所述至少两个关键点所属的帧的相邻关系。The temporal association information between the at least two key points includes: the adjacent relationship of the frames to which the at least two key points belong.

在基于本发明上述方法的另一个实施例中，所述对至少一帧视频图像执行人体关键点检测，获得所述至少一帧视频图像的多个人体关键点之后，还包括：In another embodiment based on the above method of the present invention, after performing human body key point detection on at least one frame of video image and obtaining multiple human body key points of the at least one frame of video image, the method further includes:

基于所述至少一帧视频图像中的多个人体关键点，建立空时图，其中，所述空时图包含所述至少一帧视频图像中的多个人体关键点的特征信息以及所述多个人体关键点的关联信息；A spatiotemporal map is established based on multiple human body key points in the at least one frame of video image, wherein the spatiotemporal map includes feature information of multiple human key points in the at least one frame of video image and the multiple human key points in the at least one frame of video image. Relevant information of key points of the individual;

所述基于所述至少一帧视频图像的多个人体关键点的特征信息以及所述多个人体关键点的关联信息，得到所述至少一帧视频图像的行为识别结果，包括：The obtaining of the behavior recognition result of the at least one frame of video image based on the feature information of the multiple human body key points and the associated information of the multiple human body key points in the at least one frame of video image includes:

基于所述空时图，得到所述至少一帧视频图像的行为识别结果。Based on the space-time map, a behavior recognition result of the at least one frame of video image is obtained.

在基于本发明上述方法的另一个实施例中，所述空时图包括对应于所述多个人体关键点的多个节点，所述多个节点中每个节点包括对应的人体关键点的特征信息；In another embodiment based on the above method of the present invention, the space-time graph includes a plurality of nodes corresponding to the plurality of human body key points, and each node in the plurality of nodes includes a feature of a corresponding human body key point information;

所述多个节点中的每个节点具有至少一条边，所述多个节点具有的多条边指示所述多个人体关键点的关联关系。Each of the plurality of nodes has at least one edge, and the plurality of edges of the plurality of nodes indicate the association relationship of the plurality of human body key points.

人体关键点的至少一条边指示所述人体关键点与其他人体关键点的关联关系。At least one edge of the human body key point indicates the relationship between the human body key point and other human body key points.

在基于本发明上述方法的另一个实施例中，所述多个节点中的第一节点与至少一个第二节点中的每个第二节点具有空间边，其中，所述第一节点对应的第一人体关键点和所述至少一个第二节点中每个第二节点对应的第二人体关键点属于同一帧，并且所述第一人体关键点与每个所述第二人体关键点对应的人体部位直接连通，和/或In another embodiment based on the above method of the present invention, a first node in the plurality of nodes has a spatial edge with each second node in the at least one second node, wherein the first node corresponding to the first node has a spatial edge. A human body key point and a second human body key point corresponding to each of the at least one second node belong to the same frame, and the first human body key point and the human body corresponding to each of the second human body key points sites are directly connected, and/or

所述第一节点与至少一个第三节点中的每个第三节点之间具有时间边，其中，所述第一人体关键点与每个所述第三节点对应的第三人体关键点对应相同的人体部位且属于相邻的帧。There is a time edge between the first node and each of the at least one third node, wherein the first human body key point corresponds to the same third human body key point corresponding to each of the third nodes body parts and belong to adjacent frames.

所述多个节点的数量等于所述多个人体关键点的数量，并且所述多个节点与所述多个人体关键点一一对应。The number of the plurality of nodes is equal to the number of the plurality of human body key points, and the plurality of nodes are in one-to-one correspondence with the plurality of human body key points.

在基于本发明上述方法的另一个实施例中，所述基于所述至少一帧视频图像中的多个人体关键点，建立空时图，包括：In another embodiment based on the above method of the present invention, the establishing a space-time map based on a plurality of human body key points in the at least one frame of video image includes:

根据人体结构的连通关系，使用空间边连接位于同一帧视频图像的至少两个人体关键点；According to the connection relationship of human body structure, use spatial edge to connect at least two human body key points located in the same frame of video image;

使用时间边连接同一身体部位在所述至少一帧视频图像的相邻帧中的至少两个人体关键点。At least two human body key points of the same body part in adjacent frames of the at least one frame of video image are connected using a temporal edge.

在基于本发明上述方法的另一个实施例中，所述基于所述空时图，得到所述至少一帧视频图像的行为识别结果，包括：In another embodiment based on the above method of the present invention, the obtaining the behavior recognition result of the at least one frame of video image based on the space-time graph includes:

将所述空时图输入到卷积神经网络，得到所述至少一帧视频图像的行为识别结果。The space-time graph is input into a convolutional neural network to obtain a behavior recognition result of the at least one frame of video image.

在基于本发明上述方法的另一个实施例中，所述将所述空时图输入到卷积神经网络进行处理，得到所述至少一帧视频图像的行为识别结果，包括：In another embodiment based on the above method of the present invention, the step of inputting the space-time graph into a convolutional neural network for processing to obtain a behavior recognition result of the at least one frame of video image includes:

基于所述多个人体关键点之间的关联信息，对所述多个人体关键点进行卷积处理，得到所述多个人体关键点的卷积处理结果；Based on the association information between the multiple human body key points, convolution processing is performed on the multiple human body key points to obtain a convolution processing result of the multiple human body key points;

基于所述多个人体关键点的卷积处理结果，得到所述至少一帧视频图像的行为识别结果。Based on the convolution processing result of the multiple human body key points, the behavior recognition result of the at least one frame of video image is obtained.

在基于本发明上述方法的另一个实施例中，所述基于所述多个人体关键点的关联信息，对所述多个人体关键点进行卷积处理，得到所述多个人体关键点的卷积处理结果，包括：In another embodiment based on the above method of the present invention, the convolution process is performed on the plurality of human body key points based on the associated information of the plurality of human body key points to obtain the volume of the plurality of human body key points Product processing results, including:

基于所述多个人体关键点的关联信息，确定与所述多个人体关键点中的第四人体关键点具有关联关系的至少一个第五人体关键点；Based on the association information of the plurality of human body key points, determining at least one fifth human body key point that has an associated relationship with the fourth human body key point in the plurality of human body key points;

基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点的特征信息，得到所述第四人体关键点的卷积处理结果。Based on the fourth human body key point and the feature information of each human body key point in the at least one fifth human body key point, a convolution processing result of the fourth human body key point is obtained.

所述至少一个第五人体关键点包括与所述第四人体关键点具有空间关联关系的至少一个人体关键点；或者The at least one fifth human body key point includes at least one human body key point that has a spatial relationship with the fourth human body key point; or

所述至少一个第五人体关键点包括与所述第四人体关键点具有空间关联关系的至少一个人体关键点，以及与所述第四人体关键点具有时间关联关系的至少一个人体关键点。The at least one fifth human body key point includes at least one human body key point having a spatial relationship with the fourth human body key point, and at least one human body key point having a temporal relationship with the fourth human body key point.

在基于本发明上述方法的另一个实施例中，所述基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点的特征信息，得到所述第四人体关键点的卷积处理结果，包括：In another embodiment based on the above method of the present invention, the fourth human body key point is obtained based on the feature information of each human body key point in the fourth human body key point and the at least one fifth human body key point Convolution processing results of points, including:

利用与所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点所属的人体关键点集合对应的卷积参数，对所述每个人体关键点进行卷积处理，得到所述每个人体关键点的初始卷积结果；Using the convolution parameters corresponding to the set of human body key points to which each of the fourth human body key points and the at least one fifth human body key point belongs, the convolution process is performed on each of the human body key points, obtain the initial convolution result of each human body key point;

基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点的初始卷积结果，得到所述第四人体关键点的卷积处理结果。Based on the fourth human body key point and the initial convolution result of each human body key point in the at least one fifth human body key point, a convolution processing result of the fourth human body key point is obtained.

在基于本发明上述方法的另一个实施例中，在所述对所述每个人体关键点进行卷积处理之前，所述方法还包括：In another embodiment based on the above method of the present invention, before performing the convolution processing on each of the human body key points, the method further includes:

将所述第四人体关键点和所述至少一个第五人体关键点划分成至少一个人体关键点集合，其中，每个人体关键点集合包括至少一个人体关键点；dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, wherein each human body key point set includes at least one human body key point;

基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点所属的人体关键点集合，确定所述每个人体关键点的卷积参数，其中，属于不同人体关键点集合的人体关键点对应于不同的卷积参数。The convolution parameter of each human body key point is determined based on the set of human body key points to which each human body key point in the fourth human body key point and the at least one fifth human body key point belongs, wherein different human body key points belong to different human body key points. The human keypoints of the point set correspond to different convolution parameters.

在基于本发明上述方法的另一个实施例中，所述至少一个人体关键点集合包括第一人体关键点集合和第二人体关键点集合；In another embodiment based on the above method of the present invention, the at least one human body key point set includes a first human body key point set and a second human body key point set;

所述将所述第四人体关键点和所述至少一个第五人体关键点划分成至少一个人体关键点集合，包括：The described fourth human body key point and the at least one fifth human body key point are divided into at least one human body key point set, including:

将所述第四人体关键点划分至所述第一人体关键点集合，并将所述至少一个第五人体关键点划分至所述第二人体关键点集合。The fourth human body key point is divided into the first human body key point set, and the at least one fifth human body key point is divided into the second human body key point set.

在基于本发明上述方法的另一个实施例中，所述将所述第四人体关键点和所述至少一个第五人体关键点划分成至少一个人体关键点集合，包括：In another embodiment based on the above method of the present invention, the dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set includes:

基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与参考点之间的距离，将所述第四人体关键点和所述至少一个第五人体关键点划分为至少一个人体关键点集合。Based on the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point, the fourth human body key point and the at least one fifth human body key point Divide into at least one human body keypoint set.

在基于本发明上述方法的另一个实施例中，所述基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与参考点之间的距离，将所述第四人体关键点和所述至少一个第五人体关键点划分为至少一个人体关键点集合，包括：In another embodiment based on the above method of the present invention, based on the distance between each of the fourth human body key points and the at least one fifth human body key point and a reference point, the The fourth human body key point and the at least one fifth human body key point are divided into at least one human body key point set, including:

基于所述第四人体关键点的特征信息，确定所述第四人体关键点与所述参考点之间的第一距离；determining a first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与所述参考点之间的距离与所述第一距离之间的大小关系，确定所述每个人体关键点所属的关键点集合。determining the each person based on the magnitude relationship between the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point and the first distance The set of keypoints to which the body keypoint belongs.

在基于本发明上述方法的另一个实施例中，所述基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与所述参考点之间的距离与所述第一距离之间的大小关系，确定所述每个人体关键点所属的关键点集合，包括：In another embodiment based on the above method of the present invention, the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point is the same as the distance between each human body key point and the reference point. The size relationship between the first distances, and determine the key point set to which each human body key point belongs, including:

确定与所述参考点之间的距离小于所述第一距离的人体关键点属于第一关键点集合；和/或It is determined that the human body key points whose distance from the reference point is less than the first distance belong to the first set of key points; and/or

确定与所述参考点之间的距离等于所述第一距离的人体关键点属于第二关键点集合；和/或It is determined that the human body key point whose distance from the reference point is equal to the first distance belongs to the second set of key points; and/or

确定与所述参考点之间的距离大于所述第一距离的人体关键点属于第三关键点集合。It is determined that the human body key points whose distance from the reference point is greater than the first distance belong to the third set of key points.

在基于本发明上述方法的另一个实施例中，所述基于所述多个人体关键点的卷积处理结果，得到所述至少一帧视频图像的行为识别结果，包括：In another embodiment based on the above method of the present invention, obtaining the behavior recognition result of the at least one frame of video image based on the convolution processing result of the plurality of human body key points includes:

对所述空时图中包括的多个人体关键点中的每个人体关键点的卷积处理结果进行全局池化处理，得到池化处理结果；Perform global pooling processing on the convolution processing result of each human body key point in the plurality of human body key points included in the space-time map to obtain a pooling processing result;

基于所述池化处理结果，得到所述至少一帧视频图像的行为识别结果。Based on the pooling processing result, a behavior recognition result of the at least one frame of video image is obtained.

在基于本发明上述方法的另一个实施例中，所述池化处理结果包括一维特征向量；In another embodiment based on the above method of the present invention, the pooling processing result includes a one-dimensional feature vector;

所述基于所述池化处理结果，得到所述至少一帧视频图像中每帧视频图像的行为识别结果，包括：The behavior recognition result of each frame of video image in the at least one frame of video image is obtained based on the pooling processing result, including:

利用全连接层，对所述一维特征向量进行处理，得到识别向量，所述识别向量包括对应行为分类类别数量的向量值；Using the fully connected layer, the one-dimensional feature vector is processed to obtain a recognition vector, and the recognition vector includes a vector value corresponding to the number of behavior classification categories;

基于所述识别向量中各所述向量值获得所述视频图像中的人体行为分类。The human action classification in the video image is obtained based on each of the vector values in the identification vector.

根据本公开实施例的另一个方面，提供的一种行为识别装置，包括：According to another aspect of the embodiments of the present disclosure, a behavior recognition device is provided, comprising:

关键点检测单元，用于对至少一帧视频图像执行人体关键点检测，获得所述至少一帧视频图像的多个人体关键点；a key point detection unit, configured to perform human key point detection on at least one frame of video image, and obtain a plurality of human key points of the at least one frame of video image;

行为识别单元，用于基于所述至少一帧视频图像的多个人体关键点的特征信息以及所述多个人体关键点的关联信息，得到所述至少一帧视频图像的行为识别结果。A behavior recognition unit, configured to obtain a behavior recognition result of the at least one frame of video image based on the feature information of multiple human body key points in the at least one frame of video image and the associated information of the multiple human body key points.

在基于本发明上述装置的另一个实施例中，所述人体关键点的特征信息包括所述人体关键点的坐标信息；或者，In another embodiment based on the above device of the present invention, the feature information of the human body key points includes coordinate information of the human body key points; or,

在基于本发明上述装置的另一个实施例中，所述多个人体关键点的关联信息包括下列中的任意一项或多项：同一帧视频图像中的至少两个人体关键点之间的空间关联信息，以及对应于同一人体部位且属于所述至少一帧视频图像中的相邻帧视频图像的至少两个人体关键点之间的时间关联信息。In another embodiment of the above device based on the present invention, the association information of the multiple human body key points includes any one or more of the following: the space between at least two human body key points in the same frame of video image association information, and temporal association information between at least two human body key points corresponding to the same human body part and belonging to adjacent frames of video images in the at least one frame of video image.

在基于本发明上述装置的另一个实施例中，所述至少一帧视频图像具体为视频中的多帧连续视频图像；和/或In another embodiment based on the above device of the present invention, the at least one frame of video image is specifically multiple frames of continuous video images in the video; and/or

在基于本发明上述装置的另一个实施例中，所述至少两个人体关键点之间的空间关联信息包括所述至少两个关键点在空间上的相邻关系，和/或In another embodiment based on the above-mentioned apparatus of the present invention, the spatial correlation information between the at least two human key points includes a spatially adjacent relationship between the at least two key points, and/or

在基于本发明上述装置的另一个实施例中，还包括：In another embodiment of the above device based on the present invention, it also includes:

图建立单元，用于基于所述至少一帧视频图像中的多个人体关键点，建立空时图，其中，所述空时图包含所述至少一帧视频图像中的多个人体关键点的特征信息以及所述多个人体关键点的关联信息；A graph establishing unit, configured to establish a space-time graph based on a plurality of human body key points in the at least one frame of video image, wherein the space-time graph includes a plurality of human body key points in the at least one frame of video image. feature information and associated information of the multiple human body key points;

所述行为识别单元，具体用于基于所述空时图，得到所述至少一帧视频图像的行为识别结果。The behavior recognition unit is specifically configured to obtain a behavior recognition result of the at least one frame of video image based on the space-time graph.

在基于本发明上述装置的另一个实施例中，所述空时图包括对应于所述多个人体关键点的多个节点，所述多个节点中每个节点包括对应的人体关键点的特征信息；In another embodiment of the above device based on the present invention, the space-time graph includes a plurality of nodes corresponding to the plurality of human body key points, and each node of the plurality of nodes includes a feature of a corresponding human body key point information;

在基于本发明上述装置的另一个实施例中，In another embodiment of the above device based on the present invention,

所述多个节点中的第一节点与至少一个第二节点中的每个第二节点具有空间边，其中，所述第一节点对应的第一人体关键点和所述至少一个第二节点中每个第二节点对应的第二人体关键点属于同一帧，并且所述第一人体关键点与每个所述第二人体关键点对应的人体部位直接连通，和/或A first node in the plurality of nodes has a spatial edge with each second node in the at least one second node, wherein the first human body key point corresponding to the first node and the at least one second node have a space edge. The second human body key point corresponding to each second node belongs to the same frame, and the first human body key point is directly connected to the human body part corresponding to each second human body key point, and/or

在基于本发明上述装置的另一个实施例中，所述行为识别单元，具体用于将所述空时图输入到卷积神经网络，得到所述至少一帧视频图像的行为识别结果。In another embodiment based on the above device of the present invention, the behavior recognition unit is specifically configured to input the space-time graph into a convolutional neural network to obtain a behavior recognition result of the at least one frame of video image.

在基于本发明上述装置的另一个实施例中，所述行为识别单元，包括：In another embodiment of the above device based on the present invention, the behavior identification unit includes:

卷积处理模块，用于基于所述多个人体关键点之间的关联信息，对所述多个人体关键点进行卷积处理，得到所述多个人体关键点的卷积处理结果；a convolution processing module, configured to perform convolution processing on the plurality of human body key points based on the association information between the plurality of human body key points to obtain convolution processing results of the plurality of human body key points;

卷积识别模块，用于基于所述多个人体关键点的卷积处理结果，得到所述至少一帧视频图像的行为识别结果。The convolution identification module is configured to obtain the behavior identification result of the at least one frame of video image based on the convolution processing results of the multiple human body key points.

在基于本发明上述装置的另一个实施例中，所述卷积处理模块，包括：In another embodiment based on the above device of the present invention, the convolution processing module includes:

关联确定模块，用于基于所述多个人体关键点的关联信息，确定与所述多个人体关键点中的第四人体关键点具有关联关系的至少一个第五人体关键点；an association determination module, configured to determine at least one fifth human body key point that has an associated relationship with the fourth human body key point in the plurality of human body key points based on the association information of the plurality of human body key points;

特征处理模块，用于基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点的特征信息，得到所述第四人体关键点的卷积处理结果。A feature processing module, configured to obtain a convolution processing result of the fourth human body key point based on the feature information of each human body key point in the fourth human body key point and the at least one fifth human body key point.

在基于本发明上述装置的另一个实施例中，所述特征处理模块，具体用于利用与所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点所属的人体关键点集合对应的卷积参数，对所述每个人体关键点进行卷积处理，得到所述每个人体关键点的初始卷积结果；In another embodiment based on the above device of the present invention, the feature processing module is specifically configured to use the human body to which each human body key point belongs to the fourth human body key point and the at least one fifth human body key point Convolution parameters corresponding to the set of key points, performing convolution processing on each of the human body key points to obtain an initial convolution result of each of the human body key points;

在基于本发明上述装置的另一个实施例中，所述行为识别单元，还包括：In another embodiment of the above device based on the present invention, the behavior identification unit further includes:

分类模块，用于将所述第四人体关键点和所述至少一个第五人体关键点划分成至少一个人体关键点集合，其中，每个人体关键点集合包括至少一个人体关键点；a classification module, configured to divide the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, wherein each human body key point set includes at least one human body key point;

参数确定模块，用于基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点所属的人体关键点集合，确定所述每个人体关键点的卷积参数，其中，属于不同人体关键点集合的人体关键点对应于不同的卷积参数。A parameter determination module, configured to determine the convolution parameter of each human body key point based on the human body key point set to which each human body key point in the fourth human body key point and the at least one fifth human body key point belongs, Among them, the human body key points belonging to different human body key point sets correspond to different convolution parameters.

在基于本发明上述装置的另一个实施例中，所述至少一个人体关键点集合包括第一人体关键点集合和第二人体关键点集合；In another embodiment based on the above device of the present invention, the at least one human body key point set includes a first human body key point set and a second human body key point set;

所述分类模块，具体用于将所述第四人体关键点划分至所述第一人体关键点集合，并将所述至少一个第五人体关键点划分至所述第二人体关键点集合。The classification module is specifically configured to divide the fourth human body key point into the first human body key point set, and divide the at least one fifth human body key point into the second human body key point set.

在基于本发明上述装置的另一个实施例中，所述分类模块，具体用于基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与参考点之间的距离，将所述第四人体关键点和所述至少一个第五人体关键点划分为至少一个人体关键点集合。In another embodiment based on the above device of the present invention, the classification module is specifically configured to be based on the relationship between each of the fourth human body key points and the at least one fifth human body key point and a reference point The distance between the fourth human body key point and the at least one fifth human body key point is divided into at least one human body key point set.

在基于本发明上述装置的另一个实施例中，所述分类模块，包括：In another embodiment of the above device based on the present invention, the classification module includes:

第一距离模块，用于基于所述第四人体关键点的特征信息，确定所述第四人体关键点与所述参考点之间的第一距离；a first distance module, configured to determine a first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

第一关系模块，用于基于所述第四人体关键点和所述至少一个第五人体关键点中每个人体关键点与所述参考点之间的距离与所述第一距离之间的大小关系，确定所述每个人体关键点所属的关键点集合。a first relationship module, configured to be based on the size of the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point and the first distance relationship, to determine the set of key points to which each of the human body key points belongs.

在基于本发明上述装置的另一个实施例中，所述第一关系模块，具体用于确定与所述参考点之间的距离小于所述第一距离的人体关键点属于第一关键点集合；和/或In another embodiment based on the above device of the present invention, the first relationship module is specifically configured to determine that a human body key point whose distance from the reference point is smaller than the first distance belongs to a first set of key points; and / or

在基于本发明上述装置的另一个实施例中，所述卷积识别模块，具体用于对所述空时图中包括的多个人体关键点中的每个人体关键点的卷积处理结果进行全局池化处理，得到池化处理结果；In another embodiment based on the above device of the present invention, the convolution identification module is specifically configured to perform the convolution processing result of each human body key point in the multiple human body key points included in the space-time map. Global pooling processing to obtain the pooling processing result;

根据本公开实施例的另一个方面，提供的一种电子设备，包括处理器，所述处理器包括如上所述的行为识别装置。According to another aspect of the embodiments of the present disclosure, an electronic device is provided, including a processor, where the processor includes the above-mentioned behavior recognition apparatus.

根据本公开实施例的另一个方面，提供的一种电子设备，包括：存储器，用于存储可执行指令；According to another aspect of the embodiments of the present disclosure, an electronic device is provided, comprising: a memory for storing executable instructions;

以及处理器，用于与所述存储器通信以执行所述可执行指令从而完成如上所述行为识别方法。and a processor for communicating with the memory to execute the executable instructions to complete the behavior recognition method as described above.

根据本公开实施例的另一个方面，提供的一种计算机存储介质，用于存储计算机可读取的指令，所述指令被执行时执行如上所述行为识别方法。According to another aspect of the embodiments of the present disclosure, there is provided a computer storage medium for storing computer-readable instructions, and when the instructions are executed, the above-described behavior recognition method is performed.

根据本公开实施例的另一个方面，提供的一种计算机程序，包括计算机可读代码，当所述计算机可读代码在设备上运行时，所述设备中的处理器执行用于实现如上所述行为识别方法的指令。According to another aspect of the embodiments of the present disclosure, there is provided a computer program, comprising computer-readable code, when the computer-readable code is executed on a device, the processor in the device executes the program for implementing the above-mentioned Instructions for behavior recognition methods.

根据本公开实施例的再一个方面，提供的一种计算机程序产品，用于存储计算机可读指令，所述指令被执行时使得计算机执行上述任一可能的实现方式中所述的行为识别方法。According to yet another aspect of the embodiments of the present disclosure, a computer program product is provided for storing computer-readable instructions, which, when executed, cause a computer to execute the behavior recognition method described in any of the above possible implementation manners.

在一个可选实施方式中，所述计算机程序产品具体为计算机存储介质，在另一个可选实施方式中，所述计算机程序产品具体为软件产品，例如SDK等。In an optional implementation manner, the computer program product is specifically a computer storage medium, and in another optional implementation manner, the computer program product is specifically a software product, such as SDK or the like.

本公开实施例还提供了另一种行为识别方法及其对应的装置、电子设备、计算机存储介质、计算机程序以及计算机程序产品，其中，该方法包括：对至少一帧视频图像执行人体关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。Embodiments of the present disclosure further provide another behavior recognition method and a corresponding device, electronic device, computer storage medium, computer program, and computer program product, wherein the method includes: performing human key point detection on at least one frame of video image to obtain multiple human body key points of at least one frame of video image; based on the feature information of multiple human body key points of at least one frame of video image and the associated information of multiple human key points, the behavior recognition result of at least one frame of video image is obtained.

基于本公开上述实施例提供的一种行为识别方法和装置、电子设备、计算机存储介质、程序，对至少一帧视频图像执行关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果；克服了现有技术将所有关键点一起处理不利于关注到局部信息的弊端，通过结合人体关键点的特征信息和人体关键点之间的关联信息，使局部信息和整体信息都得到充分的利用，提高了行为识别的准确度。Based on the behavior recognition method and device, electronic device, computer storage medium, and program provided by the foregoing embodiments of the present disclosure, key point detection is performed on at least one frame of video image, and multiple human key points of at least one frame of video image are obtained; The feature information of multiple human body key points of at least one frame of video image and the associated information of multiple human body key points are obtained, and the behavior recognition result of at least one frame of video image is obtained. The disadvantages of local information, by combining the feature information of human body key points and the correlation information between human body key points, make full use of local information and overall information, and improve the accuracy of behavior recognition.

下面通过附图和实施例，对本公开的技术方案做进一步的详细描述。The technical solutions of the present disclosure will be further described in detail below through the accompanying drawings and embodiments.

附图说明Description of drawings

构成说明书的一部分的附图描述了本公开的实施例，并且连同描述一起用于解释本公开的原理。The accompanying drawings, which form a part of the specification, illustrate embodiments of the present disclosure and together with the description serve to explain the principles of the present disclosure.

参照附图，根据下面的详细描述，可以更加清楚地理解本公开，其中：The present disclosure may be more clearly understood from the following detailed description with reference to the accompanying drawings, wherein:

图1为本公开实施例提供的行为识别方法的流程示意图。FIG. 1 is a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure.

图2为本公开行为识别方法一个可选示例中构建的空时图的结构示意图。FIG. 2 is a schematic structural diagram of a space-time graph constructed in an optional example of the behavior recognition method of the present disclosure.

图3a-d为本公开实施例提供的行为识别方法的具体示例图。3a-d are specific example diagrams of the behavior recognition method provided by the embodiment of the present disclosure.

图4为本公开行为识别方法的一个具体例子的流程示意图。FIG. 4 is a schematic flowchart of a specific example of the behavior recognition method of the present disclosure.

图5为本公开行为识别装置一个实施例的结构示意图。FIG. 5 is a schematic structural diagram of an embodiment of the behavior recognition apparatus of the present disclosure.

图6为用来实现本申请实施例的电子设备的结构示意图。FIG. 6 is a schematic structural diagram of an electronic device used to implement an embodiment of the present application.

具体实施方式Detailed ways

现在将参照附图来详细描述本公开的各种示例性实施例。应注意到：除非另外具体说明，否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

同时，应当明白，为了便于描述，附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。Meanwhile, it should be understood that, for the convenience of description, the dimensions of various parts shown in the accompanying drawings are not drawn in an actual proportional relationship.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的，决不作为对本公开及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses in any way.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论，但在适当情况下，所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

本公开实施例可以应用于计算机系统/服务器，其可与众多其它通用或专用计算系统环境或配置一起操作。适于与计算机系统/服务器一起使用的众所周知的计算系统、环境和/或配置的例子包括但不限于：个人计算机系统、服务器计算机系统、瘦客户机、厚客户机、手持或膝上设备、基于微处理器的系统、机顶盒、可编程消费电子产品、网络个人电脑、小型计算机系统、大型计算机系统和包括上述任何系统的分布式云计算技术环境，等等。Embodiments of the present disclosure may be applied to computer systems/servers that are operable with numerous other general purpose or special purpose computing system environments or configurations. Examples of well-known computing systems, environments and/or configurations suitable for use with computer systems/servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, Microprocessor systems, set-top boxes, programmable consumer electronics, network personal computers, minicomputer systems, mainframe computer systems, and distributed cloud computing technology environments including any of the foregoing, among others.

计算机系统/服务器可以在由计算机系统执行的计算机系统可执行指令(诸如程序模块)的一般语境下描述。通常，程序模块可以包括例程、程序、目标程序、组件、逻辑、数据结构等等，它们执行特定的任务或者实现特定的抽象数据类型。计算机系统/服务器可以在分布式云计算环境中实施，分布式云计算环境中，任务是由通过通信网络链接的远程处理设备执行的。在分布式云计算环境中，程序模块可以位于包括存储设备的本地或远程计算系统存储介质上。A computer system/server may be described in the general context of computer system-executable instructions, such as program modules, being executed by the computer system. Generally, program modules may include routines, programs, object programs, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer systems/servers may be implemented in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located on local or remote computing system storage media including storage devices.

在实现本公开的过程中，公开人发现，至少存在以下问题：现有技术中的方法，无论是采用LSTM网络，还是ResNet网络，都未考虑人体关键点之间空间上的天然联系。In the process of realizing the present disclosure, the inventors found that there are at least the following problems: the methods in the prior art, whether using the LSTM network or the ResNet network, do not consider the natural spatial connection between the key points of the human body.

现有技术忽略了相邻关键点的联系，不加区别地将关键点处理为向量直接输入网络。网络在一开始就考虑了所有人体关键点，不利于模型关注到局部信息。The existing technology ignores the connection of adjacent key points, and indiscriminately processes the key points as vectors and directly inputs them into the network. The network considers all human body key points at the beginning, which is not conducive to the model's attention to local information.

图1为本公开实施例提供的行为识别方法的流程示意图。如图1所示，该实施例方法包括：FIG. 1 is a schematic flowchart of a behavior recognition method provided by an embodiment of the present disclosure. As shown in Figure 1, the method of this embodiment includes:

步骤110，对至少一帧视频图像执行关键点检测，获得至少一帧视频图像的多个人体关键点。Step 110: Perform key point detection on at least one frame of video image to obtain multiple human body key points of at least one frame of video image.

具体地，至少一帧视频图像可以来源于设备自身采集的视频或用户输入的视频或从其他设备处获取的视频，本公开实施例对获取该至少一帧视频图像的方式不作限定，此外，该至少一帧视频图像具体可以为一帧或多帧视频图像，本公开实施例对该至少一帧视频图像的数量不做限定。可选地，该至少一帧视频图像可以为连续的视频图像，例如，该至少一帧视频图像可以属于对应一个动作或行为的一个视频片段，基于这个视频片段可以获得一个行为识别结果。或者，可选地，该至少一帧视频图像中的每帧视频图像可以对应一个动作或行为，例如，该至少一帧视频图像不连续并且是通过在视频中间隔设定数量的帧采集一帧视频图像得到的，但本公开实施例对此不做限定。Specifically, the at least one frame of video image may be derived from a video collected by the device itself, a video input by a user, or a video obtained from other devices. The embodiment of the present disclosure does not limit the manner of obtaining the at least one frame of video image. The at least one frame of video images may specifically be one or more frames of video images, and the embodiment of the present disclosure does not limit the number of the at least one frame of video images. Optionally, the at least one frame of video image may be a continuous video image, for example, the at least one frame of video image may belong to a video segment corresponding to an action or behavior, and a behavior recognition result can be obtained based on this video segment. Or, optionally, each frame of video image in the at least one frame of video image may correspond to an action or behavior, for example, the at least one frame of video image is discontinuous and one frame is collected by spacing a set number of frames in the video obtained from a video image, but this is not limited in this embodiment of the present disclosure.

具体地，可以对至少一帧视频图像中的每帧视频图像进行人体关键点检测，得到每帧视频图像中的至少一个人体关键点，这样，可以得到该至少一帧视频图像的多个人体关键点。其中，可选地，可以利用机器学习方法进行关键点检测，例如，神经网络、支持向量机(Support Vector Machine)、随机森林(Random Forest，RF)等，本公开实施例对关键点检测的实现不作限定。Specifically, human body key point detection can be performed on each frame of video image in at least one frame of video image to obtain at least one human body key point in each frame of video image. In this way, multiple human body key points of the at least one frame of video image can be obtained. point. Wherein, optionally, a machine learning method can be used to perform key point detection, for example, a neural network, a support vector machine (Support Vector Machine), a random forest (Random Forest, RF), etc. The implementation of the key point detection in the embodiments of the present disclosure Not limited.

步骤120，基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。Step 120: Obtain a behavior recognition result of at least one frame of video image based on feature information of multiple human body key points and associated information of multiple human body key points in at least one frame of video image.

在一个或多个可选的实施例中，人体关键点的特征信息可以包括人体关键点的坐标信息，例如，人体关键点的2D坐标或人体关键点的3D坐标，或者，人体关键点的特征信息还可以进一步包括人体关键点的估计置信度和/或人体关键点对应的初始特征，其中，可选地，人体关键点对应的初始特征可以是通过对人体关键点所在位置处或所在区域进行特征提取得到的，或者，人体关键点的特征信息也可以包括其他与该关键点相关的信息，本公开实施例并不限制人体关键点的特征信息具体包括哪些信息。In one or more optional embodiments, the feature information of the human body key points may include coordinate information of the human body key points, for example, the 2D coordinates of the human body key points or the 3D coordinates of the human body key points, or the features of the human body key points The information may further include the estimated confidence of the key points of the human body and/or the initial features corresponding to the key points of the human body. The feature information obtained by feature extraction, or the feature information of the human body key point may also include other information related to the key point, and the embodiment of the present disclosure does not limit the specific information included in the feature information of the human body key point.

在120中，除了从该至少一帧视频图像中获得的多个人体关键点中每个人体关键点的特征信息之外，还可以利用该多个人体关键点之间的关联信息，进行行为识别，以提高行为识别的准确度。可选地，多个人体关键点的关联信息可以包括多个人体关键点的空间关联信息和/或多个人体关键点的时间关联关系。作为一个例子，该至少一帧视频图像可以具体为视频中的多帧连续的视频图像，此时，多个人体关联点的关联信息可以包括多个人体关键点的空间关联信息和时间关联关系，其中，空间关联信息可以指示人体关键点之间的空间关联关系，例如，空间关联信息可以包括同一帧视频图像中的人体关键点的位置关系，例如相邻或直接连通等等；时间关联信息可以指示人体关键点之间的时间关联关系，例如，时间关联信息可以包括对应于同一身体部位(例如同一关节点)的关键点所属帧之间的关系，例如所属帧相邻等等，本公开实施例对此不限于此。In 120, in addition to the feature information of each human body key point among the multiple human body key points obtained from the at least one frame of video image, the association information between the multiple human body key points can also be used to perform behavior recognition , to improve the accuracy of behavior recognition. Optionally, the association information of the multiple human body key points may include the spatial association information of the multiple human body key points and/or the temporal association relationship of the multiple human body key points. As an example, the at least one frame of video image may specifically be multiple frames of continuous video images in the video. In this case, the association information of the multiple human body related points may include the spatial association information and temporal association relationship of multiple human body key points, Wherein, the spatial correlation information may indicate the spatial correlation between human key points. For example, the spatial correlation information may include the positional relationship of human key points in the same frame of video images, such as adjacent or directly connected, etc.; temporal correlation information may Indicates the temporal association relationship between key points of the human body. For example, the temporal association information may include the relationship between the frames to which the key points corresponding to the same body part (such as the same joint point) belong, such as adjacent frames to which they belong, etc. The present disclosure implements The example is not limited to this.

可选地，同一帧视频图像中的人体关键点之间的空间关联关系可以是根据人体结构的连通关系确定的。例如，手肘对应的关键点可以认为与手腕对应的关键点以及肩部对应的关键点相邻，或者可以是预定义的，本公开实施例对人体关键点的关联关系的确定方式不做限定。Optionally, the spatial relationship between the human body key points in the same frame of video images may be determined according to the connectivity relationship of the human body structure. For example, the key point corresponding to the elbow may be considered to be adjacent to the key point corresponding to the wrist and the key point corresponding to the shoulder, or may be predefined, and the embodiment of the present disclosure does not limit the way of determining the association relationship between the key points of the human body .

作为一个可选例子，在进行人体关键点检测的同时，还可以输出人体关键点对应的人体部位的信息，例如人体部位标签，并可以基于人体关键点的人体部位信息确定人体关键点之间的关联关系。可选地，对应于同一人体部位且属于不同帧视频图像的人体关键点之间的时间关联信息可以指示该人体部位在该至少一帧视频图像中随着时间的移动轨迹。As an optional example, while the human body key point detection is performed, the information of the human body part corresponding to the human body key point, such as the human body part label, can also be output, and based on the human body part information of the human body key point, the relationship between the human body key points can be determined. connection relation. Optionally, the temporal correlation information between the human body key points corresponding to the same human body part and belonging to different frames of video images may indicate the movement track of the human body part in the at least one frame of video image over time.

在一些可选例子中，人体关键点之间的空间关联信息可以包括人体关键点在空间位置上的相邻关系，例如可以包括指示同一视频图像中的至少两个人体关键点对应的身体部位直接连通的信息。可选地，人体关键点之间的时间关联信息可以包括人体关键点所属的帧的相邻关系，例如，可以包括指示对应同一身体部位的至少两个人体关键点所属的帧为相邻帧的信息，但本公开实施例不限于此。In some optional examples, the spatial association information between the human body key points may include the adjacent relationship between the human body key points in the spatial position, for example, may include direct body parts indicating at least two human body key points in the same video image. Connectivity information. Optionally, the temporal association information between the human body key points may include the adjacent relationship of the frames to which the human body key points belong, for example, may include a frame indicating that at least two human body key points corresponding to the same body part belong to adjacent frames. information, but embodiments of the present disclosure are not limited thereto.

在120中，可选地，可以利用机器学习方法对多个人体关键点的特征信息和关联信息进行处理，得到至少一帧视频图像的行为识别结果。在一个可选例子中，可以利用神经网络对多个人体关键点的特征信息和关联信息进行处理，例如可以将多个人体关键点的特征信息和关联信息进行处理输入到卷积神经网络中进行处理，得到至少一帧视频图像的行为识别结果，其中，可以直接将多个人体关键点的特征信息和关联信息输入到卷积神经网络，或者在对多个人体关键点的特征信息和关联信息进行处理后输入到卷积神经网络，本公开实施例对卷积神经网络的输入的具体形式不做限定。In 120, optionally, a machine learning method may be used to process the feature information and associated information of multiple human body key points to obtain a behavior recognition result of at least one frame of video image. In an optional example, a neural network can be used to process the feature information and associated information of multiple human body key points. For example, the feature information and associated information of multiple human body key points can be processed and input into a convolutional neural network for processing. processing, to obtain the behavior recognition result of at least one frame of video image, wherein the feature information and related information of multiple human body key points can be directly input into the convolutional neural network, or the feature information and related information of multiple human body key points can be directly input into the convolutional neural network. After processing, it is input to the convolutional neural network, and the specific form of the input of the convolutional neural network is not limited in this embodiment of the present disclosure.

在一个或多个可选的实施例中，可以基于至少一帧视频图像中的多个人体关键点的特征信息以及关联信息，建立空时图，相应地，在120中，可以基于空时图，得到至少一帧视频图像的行为识别结果。可选地，多个人体关键点的特征信息以及关联信息还可以以其他方式体现，本公开实施例不限于此。In one or more optional embodiments, a space-time map may be established based on feature information and associated information of multiple human body key points in at least one frame of video image. Correspondingly, in 120, a space-time map may be established based on the , to obtain the behavior recognition result of at least one frame of video image. Optionally, the feature information and associated information of multiple human body key points may also be embodied in other ways, and the embodiment of the present disclosure is not limited thereto.

具体地，可以基于多个人体关键点的时间关联信息和/或空间关联信息，建立空时图。其中，空时图可以包括多个节点和多条边，多个节点中的每个节点可以具有至少一条边。在一些可选例子中，人体关键点可以作为节点，人体关键点之间的关联信息可以体现为空时图中的边，也就是说，具有时间关联关系和/或空间关联关系的人体关键点之间可以用边连接，边可以用于指示人体关键点之间的关联关系。作为一个可选例子，可以用空间边和时间边分别指示人体关键点之间的空间关联关系和时间关联关系，此外，可选地，人体关键点还可以与自身之间具有自连边，本公开实施例对空时图的具体实现不做限定。Specifically, a space-time map may be established based on temporal correlation information and/or spatial correlation information of multiple human body key points. The space-time graph may include multiple nodes and multiple edges, and each node in the multiple nodes may have at least one edge. In some optional examples, the human body key points can be used as nodes, and the association information between the human body key points can be embodied as edges in the spatiotemporal graph, that is, the human body key points with temporal and/or spatial correlations Edges can be used to connect them, and edges can be used to indicate the relationship between key points of the human body. As an optional example, a spatial edge and a temporal edge can be used to indicate the spatial and temporal relationship between the key points of the human body. In addition, optionally, the key points of the human body can also have self-connected edges with themselves. This The disclosed embodiments do not limit the specific implementation of the space-time graph.

作为一个例子，假设多个人体关键点来自一帧视频图像，空时图中可以包括每个人体关键点对应的节点与至少一个其他人体关键点对应的节点之间连接得到的空间边以及每个人体关键点对应的节点与自身连接得到的自连边，其中，构成空间边的两个人体关键点对应的人体部位相邻或直接连通。As an example, assuming that multiple human body key points come from a frame of video image, the space-time graph may include the spatial edge obtained by connecting the node corresponding to each human body key point and the node corresponding to at least one other human key point, and each person The nodes corresponding to the body key points are connected to the self-connected edge obtained by themselves, wherein the body parts corresponding to the two human body key points that constitute the spatial edge are adjacent or directly connected.

作为另一个例子，假设多个人体关键点来自至少两帧视频图像，空时图在包括空间边和自连边的基础上，还可以包括时间边，其中，时间边是通过连接相邻两帧视频图像中对应相同人体部位的两个人体关键点得到的，但本公开实施例不限于此。As another example, assuming that multiple human body key points come from at least two frames of video images, the spatiotemporal graph can also include temporal edges on the basis of spatial edges and self-connected edges, wherein the temporal edges are obtained by connecting two adjacent frames. obtained from two human body key points corresponding to the same human body part in the video image, but the embodiment of the present disclosure is not limited thereto.

可选地，人体关键点的至少一条边可以指示该人体关键点与其他人体关键点的关联关系，例如，同一视频图像中的不同人体关键点对应不同的人体关节，并且位于一个骨骼两端的两个人体关节对应的人体关键点之间存在关联关系。在一个可选的空时图例子中，相邻帧中对应于同一人体关节的人体关键点(如第三、四帧的“左手肘”)之间存在“时间边”，相同帧中相邻的人体关键点(如第五帧的“左手肘”和“左手腕”)之间存在“空间边”，这里的“相邻的人体关键点”可以人工定义或根据人体结构的连通关系确定或通过其他方式确定，例如，由同一块骨骼相连或直接连通的人体关节对应的人体关键点是相邻的人体关键点(比如左手肘关节和左手腕关节)，但本公开实施例不限于此。Optionally, at least one edge of a human body key point may indicate an association relationship between the human body key point and other human body key points. For example, different human body key points in the same video image correspond to different human body joints, and two points located at both ends of one bone. There is an association relationship between the key points of the human body corresponding to the joints of the human body. In an optional space-time map example, there are "temporal edges" between human key points corresponding to the same human joint in adjacent frames (such as the "left elbow" in the third and fourth frames), and adjacent human body key points in the same frame There is a "spatial edge" between the key points of the human body (such as the "left elbow" and "left wrist" in the fifth frame), where the "adjacent human key points" can be defined manually or determined according to the connection relationship of the human body structure or It is determined in other ways, for example, the human body key points corresponding to the human body joints connected or directly connected by the same bone are adjacent human body key points (such as the left elbow joint and the left wrist joint), but the embodiment of the present disclosure is not limited thereto.

在一个或多个可选的实施例中，多个节点中的第一节点与至少一个第二节点中的每个第二节点具有空间边，其中，第一节点对应的第一人体关键点和至少一个第二节点中每个第二节点对应的第二人体关键点属于同一帧视频图像，并且第一人体关键点与每个第二人体关键点对应的人体部位直接连通。In one or more optional embodiments, a first node of the plurality of nodes has a spatial edge with each of the at least one second node, wherein the first human body key point corresponding to the first node and The second human body key point corresponding to each second node in the at least one second node belongs to the same frame of video image, and the first human body key point is directly connected with the human body part corresponding to each second human body key point.

在一个或多个可选实施例中，第一节点与至少一个第三节点中的每个第三节点之间具有时间边，其中，第一人体关键点与每个第三节点对应的第三人体关键点对应相同的人体部位且属于相邻的帧。In one or more optional embodiments, there is a time edge between the first node and each of the at least one third node, wherein the first human body key point corresponds to a third node corresponding to each third node. Human body keypoints correspond to the same human body parts and belong to adjacent frames.

可选地，在构建空时图时，可以首先基于人体结构的连通关系，使用空间边连接位于同一帧视频图像的至少两个人体关键点，然后，使用时间边连接同一身体部位在所述至少一帧视频图像的相邻帧中的至少两个人体关键点，这样，可以实现在没有手动分配的情况下的自动构建，使得同一网络结构具有通用性，可以适用于不同的节点和节点连接结构的场景，但本公开实施例对此不做限定。Optionally, when constructing the space-time graph, first, based on the connectivity relationship of the human body structure, use spatial edges to connect at least two human body key points located in the same frame of video images, and then use temporal edges to connect the same body part in the at least two human body parts. At least two human body key points in adjacent frames of a frame of video image, in this way, automatic construction can be realized without manual assignment, so that the same network structure has universality and can be applied to different nodes and node connection structures scenario, but this is not limited in the embodiments of the present disclosure.

可选地，空时图中包括的多个节点的数量可以等于或不等于多个人体关键点的数量，其中，如果多个节点的数量等于多个人体关键点的数量，则多个节点与多个人体关键点可以一一对应，但本公开实施例不限于此。Optionally, the number of multiple nodes included in the space-time graph may or may not be equal to the number of multiple human body key points, wherein if the number of multiple nodes is equal to the number of multiple human body key points, the multiple nodes and Multiple human body key points may be in one-to-one correspondence, but the embodiment of the present disclosure is not limited thereto.

可选地，120中得到的行为识别结果可以是至少一帧视频图像中每帧视频图像对应的行为识别结果，也可以是至少一帧视频图像中所有视频图像共同对应的行为识别结果，例如，该至少一帧视频图像属于视频流的一个视频片段，该视频片段对应一个人体动作，相应地，基于该视频片段中的至少一帧视频图像可以得到一个行为识别结果。或者，也可以通过其他流程构建空时图。Optionally, the behavior recognition result obtained in 120 may be the behavior recognition result corresponding to each frame of video image in at least one frame of video image, or the behavior recognition result corresponding to all video images in at least one frame of video image, for example, The at least one frame of video image belongs to a video segment of the video stream, the video segment corresponds to a human action, and accordingly, a behavior recognition result can be obtained based on the at least one frame of video image in the video segment. Alternatively, space-time graphs can also be constructed through other processes.

图2为本公开实施例中构建的空时图的一个示例的示意图。在空时图中，通过空间边和时间边将人体关键点关联起来，每个人体关键点都存在至少一个与其他人体关键点的边，相应地，通过空时图可以获得每个人体关键点的时间维度的相邻关键点和/或空间维度的相邻关键点，但本公开实施例对空时图的具体实现不作限定。FIG. 2 is a schematic diagram of an example of a space-time graph constructed in an embodiment of the present disclosure. In the space-time graph, the key points of the human body are associated with the space edge and the time edge. Each key point of the human body has at least one edge with other key points of the human body. Correspondingly, each key point of the human body can be obtained through the space-time graph. The adjacent key points in the time dimension and/or the adjacent key points in the space dimension, but the specific implementation of the space-time graph is not limited in this embodiment of the present disclosure.

作为一个例子，基于N个关节和T帧视频图像的骨架序列建立的空时图可以表示为G＝(V,E)，其中，V为由多个节点组成的节点集合，E为由多个边组成的边集合。As an example, a space-time graph based on N joints and a skeleton sequence of T frames of video images can be expressed as G=(V, E), where V is a node set composed of multiple nodes, and E is a node set composed of multiple nodes. A collection of edges composed of edges.

具体地，节点集合V＝{v_ti|t＝1，…，T，i＝1，…，N}包括骨架序列的所有关键点，节点的特征向量可以包括坐标向量和估计置信度，或者进一步包括其他信息。作为一个可选例子，边集合E可以包括两个子集，第一个子集描述了每帧视频图像中的部位间连接，记为E_s＝{v_tiv_tj|(i，j)∈H}，其中H是自然连接的人体关节点集合。第二个子集包含帧间边，该帧间边连接连续帧中的相同关节，记为E_F＝{v_tiv_(t+1)i}。E_F中针对某个特定节点的所有边可以表示该节点对应的人体部位随着时间的推移产生的轨迹。Specifically, the node set V={v _ti |t=1,...,T,i=1,...,N} includes all key points of the skeleton sequence, and the feature vector of the node may include the coordinate vector and the estimated confidence, or further Include additional information. As an optional example, the edge set E may include two subsets, the first subset describes the connection between parts in each frame of video image, denoted as E _s ={v _t v _tj |(i,j)∈H }, where H is the set of naturally connected human joint points. The second subset contains inter-frame edges that connect the same joints in consecutive frames, denoted EF = _{ v tiv ( _t _+1)i }. All the edges in _EF for a specific node can represent the trajectory of the human body part corresponding to the node over time.

这样，基于与人体关键点具有关联关系的其他关键点，例如相邻关键点，可以获得该人体关键点的更多信息，从而有利于获得更准确的行为识别结果。In this way, based on other key points associated with the human body key points, such as adjacent key points, more information of the human body key points can be obtained, which is beneficial to obtain a more accurate action recognition result.

基于本公开上述实施例提供的行为识别方法，对至少一帧视频图像执行关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果，与其他方式相比，能够充分利用人体特征的局部信息和整体信息，提高行为识别的准确度。Based on the behavior recognition method provided by the above-mentioned embodiments of the present disclosure, key point detection is performed on at least one frame of video image, and multiple human body key points of at least one frame of video image are obtained; based on the features of multiple human key points of at least one frame of video image Compared with other methods, the local information and overall information of human body features can be fully utilized, and the accuracy of behavior recognition can be improved.

在一个或多个可选实施例中，可以利用神经网络对空时图进行处理，得到至少一帧视频图像的行为识别结果。在一个可选例子中，可以对空时图进行卷积处理，此时，可选地，可以将空时图输入到卷积神经网络中，例如，可以将空时图中节点的特征向量输入到卷积神经网络中，节点的特征向量可以包括节点对应的人体关键点的特征信息以及节点对应的边的信息，但本公开实施例不限于此。卷积神经网络可以对输入的空时图进行处理，得到至少一帧视频图像的行为识别结果。In one or more optional embodiments, a neural network may be used to process the space-time graph to obtain a behavior recognition result of at least one frame of video image. In an optional example, convolution processing may be performed on the space-time graph, and at this time, the space-time graph may optionally be input into the convolutional neural network, for example, the feature vector of the nodes in the space-time graph may be input In a convolutional neural network, the feature vector of a node may include feature information of a human body key point corresponding to the node and information of an edge corresponding to the node, but the embodiment of the present disclosure is not limited thereto. The convolutional neural network can process the input space-time graph to obtain the behavior recognition result of at least one frame of video image.

在一些可选实施例中，可以基于多个人体关键点之间的关联信息，对多个人体关键点进行卷积处理，得到多个人体关键点的卷积处理结果。例如，可以利用卷积神经网络对空时图中的每个人体关键点分别进行处理，得到每个人体关键点的卷积处理结果，例如每个人体关键点对应的图像特征，并基于多个人体关键点中每个人体关键点的卷积处理结果，得到至少一帧视频图像的行为识别结果。为了便于描述，以下可以将基于空时图的神经网络称为空时图神经网络(Spatial-Temporal Graph Neural Networks，ST-GNCs)，或者，该神经网络也可以具有其他名称，神经网络的名称不应理解成对本公开实施例的限定。In some optional embodiments, convolution processing may be performed on multiple human body key points based on the association information between multiple human body key points to obtain convolution processing results of multiple human body key points. For example, a convolutional neural network can be used to process each human body key point in the space-time map separately, and the convolution processing result of each human body key point can be obtained, such as the image features corresponding to each human body key point, and based on multiple human body key points. The convolution processing result of each human body key point in the body key points is obtained, and the behavior recognition result of at least one frame of video image is obtained. For the convenience of description, the neural network based on the space-time graph can be referred to as the space-time graph neural network (Spatial-Temporal Graph Neural Networks, ST-GNCs), or the neural network can also have other names, the name of the neural network is not It should be understood as a limitation on the embodiments of the present disclosure.

可选地，可以基于多个人体关键点的关联信息，确定某个人体关键点(可以称为当前人体关键点)的至少一个关联关键点，并基于该当前人体关键点的特征信息以及该至少一个关联关键点中每个关联关键点的特征信息，确定该当前人体关键点的图像特征。其中，可选地，关联关键点可以包括相邻关键点。以空时图为例，假设将当前节点对应的人体关键点称为第四人体关键点，并且将当前节点的至少一个边中每个边对应的人体关键点称为第五人体关键点，则可以将至少一个第五关键点确定为第四人体关键点的相邻关键点，其中，可选地，该至少一个边可以为当前节点的部分或所有边，并且可以包括至少一个时间边和/或至少一个空间边，相应地，该第四人体关键点的至少一个相邻关键点(即至少一个第五人体关键点)可以包括时间维度上的相邻关键点和/或空间维度上的相邻关键点，但本公开实施例不限于此。或者，当前节点还存在自连边，并且可以将与当前节点的每个边对应的人体关键点确定为第四人体关键点的相邻关键点，此时，该第四人体关键点的至少一个相邻关键点可以包括自身以及至少一个第五关键点，相应地，可以基于该第四人体关键点的至少一个相邻关键点中每个相邻关键点的特征信息，确定第四人体关键点的卷积处理结果，但本公开实施例不限于此。Optionally, at least one associated key point of a certain human body key point (which may be referred to as a current human body key point) may be determined based on the associated information of multiple human body key points, and based on the feature information of the current human body key point and the at least one associated key point of the human body. The feature information of each associated key point in an associated key point determines the image feature of the current human body key point. Wherein, optionally, the associated key points may include adjacent key points. Taking the space-time graph as an example, assuming that the human body key point corresponding to the current node is called the fourth human body key point, and the human body key point corresponding to each edge of at least one edge of the current node is called the fifth human body key point, then At least one fifth key point may be determined as an adjacent key point of the fourth human body key point, wherein, optionally, the at least one edge may be part or all of the edges of the current node, and may include at least one temporal edge and/or or at least one spatial edge, correspondingly, at least one adjacent key point of the fourth human body key point (ie at least one fifth human body key point) may include adjacent key points in the time dimension and/or phase in the space dimension. adjacent key points, but the embodiments of the present disclosure are not limited thereto. Alternatively, the current node also has self-connected edges, and the human body key point corresponding to each edge of the current node can be determined as the adjacent key point of the fourth human body key point. At this time, at least one of the fourth human body key points The adjacent key points may include itself and at least one fifth key point. Accordingly, the fourth human body key point may be determined based on the feature information of each adjacent key point in the at least one adjacent key point of the fourth human body key point. , but the embodiment of the present disclosure is not limited thereto.

可选地，可以对当前人体关键点以及该当前人体关键点的至少一个关联关键点中每个关联关键点进行卷积处理，得到每个人体关键点的初始卷积结果，并基于该当前人体关键点的初始卷积结果和至少一个关联关键点的初始卷积结果，例如，将该当前人体关键点的初始卷积结果和至少一个关联关键点的初始卷积结果进行叠加处理，得到该当前人体关键点的卷积处理结果。例如，可以利用卷积参数对第四人体关键点和至少一个第五人体关键点中的每个第五人体关键点进行卷积处理，得到每个人体关键点的初始卷积结果。其中，可选地，每个人体关键点对应的卷积参数可以相同或不同。Optionally, convolution processing may be performed on each associated key point of the current human body key point and at least one associated key point of the current human body key point to obtain an initial convolution result of each human body key point, and based on the current human body key point. The initial convolution result of the key point and the initial convolution result of at least one associated key point. For example, the initial convolution result of the current human body key point and the initial convolution result of at least one associated key point are superimposed to obtain the current Convolutional processing results of human keypoints. For example, each fifth human body key point among the fourth human body key point and at least one fifth human body key point may be convolved by using the convolution parameter to obtain an initial convolution result of each human body key point. Wherein, optionally, the convolution parameters corresponding to each human body key point may be the same or different.

在一些可选实施例中，可以对当前人体关键点的至少一个关联关键点进行分类，或者对当前人体关键点及其至少一个关联关键点进行分类，得到分类结果，并基于分类结果分配相应的卷积参数。可选地，可以确定节点的每个相邻节点的类型，而后根据相邻节点的类型分配卷积核进行卷积处理。例如，可以将第四人体关键点和至少一个第五人体关键点划分成至少一个人体关键点集合，其中，每个人体关键点集合包括至少一个人体关键点，并且每个人体关键点集合可以对应一组卷积参数，然后，可以基于第四人体关键点和至少一个第五人体关键点中每个人体关键点所属的人体关键点集合，确定每个人体关键点的卷积参数。其中，可选地，不同的人体关键点集合可以对应于不同的卷积参数，例如不同的卷积核，其中，不同的人体关键点集合可以对应不同的卷积核；而卷积核中的权重取值是可以预先初始化的，例如可以通过训练获得。作为一个例子，假设将人体关键点划分为两个人体关键点集合，可以分别为两个人体关键点集合分配编号，并且预先初始化2组权重取值，针对某个特定人体关键点可以根据其所属的人体关键点集合的编号获得对应的权重取值，即获得不同的卷积核；但本公开实施例不限于此。In some optional embodiments, at least one associated key point of the current human body key point may be classified, or the current human body key point and at least one associated key point thereof may be classified, a classification result may be obtained, and a corresponding corresponding key point may be assigned based on the classification result Convolution parameters. Optionally, the type of each adjacent node of the node may be determined, and then convolution kernels are assigned to perform convolution processing according to the type of adjacent nodes. For example, the fourth human body key point and the at least one fifth human body key point may be divided into at least one human body key point set, wherein each human body key point set includes at least one human body key point, and each human body key point set may correspond to A set of convolution parameters, and then, the convolution parameter of each human body key point may be determined based on the human body key point set to which each human body key point in the fourth human body key point and at least one fifth human body key point belongs. Wherein, optionally, different human body key point sets may correspond to different convolution parameters, such as different convolution kernels, wherein different human body key point sets may correspond to different convolution kernels; and the convolution kernels The weight value can be initialized in advance, for example, it can be obtained through training. As an example, assuming that the human body key points are divided into two human body key point sets, numbers can be assigned to the two human body key point sets respectively, and two sets of weight values are pre-initialized. The corresponding weight values are obtained from the numbers of the human body key point sets, that is, different convolution kernels are obtained; but the embodiment of the present disclosure is not limited to this.

为了对不同分类的人体关键点区别处理，在不改变网络结构的基础上，为每个人体关键点集合分配不同的网络参数(如：卷积核参数)，通过分配了不同参数的神经网络对人体关键点集合中的人体关键点进行处理，可以突出局部信息；具体地，分配的具体参数可以是是经过训练决定或者预先定义的，基于不同的任务经过训练，可以获得最适合该分类结合的网络参数。分别不同网络参数的神经网络的结构是不变的，具体地，还可以将不同分类集合分别输入不同的卷积神经网络中，通过卷积神经网络对人体关键点执行卷积操作，可分别获得对应不同分类的人体关键点的人体特征。In order to treat the human body key points of different classifications differently, on the basis of not changing the network structure, different network parameters (such as convolution kernel parameters) are assigned to each human body key point set, and the neural network The human body key points in the human body key point set can be processed to highlight local information; specifically, the specific parameters allocated can be determined by training or pre-defined. After training based on different tasks, the most suitable combination of the classification can be obtained. Network parameters. The structure of the neural network with different network parameters is unchanged. Specifically, different classification sets can also be input into different convolutional neural networks respectively, and the convolution operation is performed on the key points of the human body through the convolutional neural network. Human features corresponding to different classifications of human key points.

下面将结合图3a-d，介绍本公开实施例中对人体关键点的关联关键点进行分类的方式。其中，图3a为一帧输入骨架的示例，该输入骨架包括18个关键点，可选地，本公开实施例中的关键点个数可以为任意多个，本公开实施例对此不做限定。The following describes the manner of classifying the associated key points of the human body key points in the embodiment of the present disclosure with reference to FIGS. 3a-d. 3a is an example of a frame of input skeleton, and the input skeleton includes 18 key points. Optionally, the number of key points in the embodiment of the present disclosure may be any number, which is not limited in the embodiment of the present disclosure. .

在一个或多个可选的实施例中，可以采用如图3b所示的统一分类(Uni-labeling)方式。具体地，可以将所有的人体关键点划分为同一个关键点分类，也就是说，当前人体关键点及其至少一个关联关键点可以采用相同的卷积参数。In one or more optional embodiments, a uniform classification (Uni-labeling) manner as shown in FIG. 3b may be adopted. Specifically, all human body key points may be classified into the same key point classification, that is, the current human body key point and at least one associated key point may adopt the same convolution parameters.

在一个或多个可选的实施例中，可以采用图3c所示的距离分类(Distance-partitioning)法。具体地，可以根据与当前人体关键点之间的距离，对其至少一个关联关键点进行分类。作为一个例子，自连边连接的当前人体关键点与自身的距离为0，空间边或时间边连接的人体关键点与当前人体关键点的距离为1(即为相邻关键点)，则可以将当前人体关键点距离划分为一类，并将当前人体关键点的所有其他关联关键点划分为另一类，例如，至少一个人体关键点集合可以包括第一人体关键点集合和第二人体关键点集合，此时，可选地，可以将第四人体关键点划分至第一人体关键点集合，并将至少一个第五人体关键点划分至第二人体关键点集合。在一个或多个可选的实施例中，可以采用图3d所示的空间配置分类(Spatial Configuration Partitioning)法。具体地，可以人体关键点基于与参考点之间的距离，对人体关键点分类，其中，参考点可以为任意预定义的点，例如，重心点、中心点或其他类型的基准点。例如，在卷积过程中，对于当前节点的邻节点们，比当前讨论节点更近于参考点的属于一类，更远的属于一类，相等的属于一类，或者还可以设置更多或更少的分类，本公开实施例对此不做限定。此时，可选地，可以基于第四人体关键点的坐标信息，确定第四人体关键点与参考点之间的第一距离，并基于人体关键点与参考点的距离与该第一距离之间的大小关系，确定人体关键点所属的人体关键点集合。例如，可以设置三个关键点集合：第一关键点集合、第二关键点集合和第三关键点集合，此时，可以确定与参考点之间的距离小于第一距离的人体关键点属于第一关键点集合，和/或可以确定与参考点之间的距离等于第一距离的人体关键点属于第二关键点集合，和/或确定与参考点之间的距离大于第一距离的人体关键点属于第三关键点集合。In one or more optional embodiments, the distance-partitioning method shown in FIG. 3c may be used. Specifically, at least one associated key point can be classified according to the distance from the current human body key point. As an example, the distance between the current human body key point connected by the self-connected edge and itself is 0, and the distance between the human body key point connected by the space edge or the time edge and the current human key point is 1 (that is, the adjacent key point), then you can Divide the current human body key point distance into one class, and divide all other associated key points of the current human body key point into another class, for example, at least one human body key point set may include a first human body key point set and a second human body key point set. Point set, at this time, optionally, the fourth human body key point may be divided into the first human body key point set, and the at least one fifth human body key point may be divided into the second human body key point set. In one or more optional embodiments, the Spatial Configuration Partitioning method shown in FIG. 3d may be used. Specifically, the human body key points can be classified based on the distance between the human body key points and the reference point, wherein the reference point can be any predefined point, such as a center of gravity point, a center point or other types of reference points. For example, in the process of convolution, for the neighbors of the current node, those that are closer to the reference point than the current discussion node belong to one class, those that are farther belong to one class, and those that are equal belong to one class, or more or more can be set. There are fewer classifications, which are not limited in this embodiment of the present disclosure. At this time, optionally, the first distance between the fourth human body key point and the reference point may be determined based on the coordinate information of the fourth human body key point, and based on the difference between the distance between the human body key point and the reference point and the first distance The size relationship between the human body key points is determined to determine the human body key point set to which the human body key points belong. For example, three sets of key points can be set: the first set of key points, the second set of key points and the third set of key points. At this time, it can be determined that the human body key points whose distance from the reference point is less than the first distance belong to the first set of key points. a set of key points, and/or it can be determined that the human body key points whose distance from the reference point is equal to the first distance belong to the second set of key points, and/or the human body key points whose distance from the reference point is greater than the first distance can be determined The points belong to the third set of keypoints.

与不对相邻关键点区分地进行卷积处理相比，对不同的人体关键点采用不同的卷积参数进行卷积处理可以体现人体关键点的局部信息，从而有利于提高行为识别结果的准确度。Compared with the convolution processing without distinguishing adjacent key points, the convolution processing of different human key points with different convolution parameters can reflect the local information of human key points, which is beneficial to improve the accuracy of behavior recognition results. .

可选地，本公开实施例也可以以其他方式进行分类，本公开实施例对此不做限定。Optionally, the embodiments of the present disclosure may also be classified in other manners, which are not limited in the embodiments of the present disclosure.

可选地，可以在得到多个人体关键点之后，进行上述分类流程，并为每个人体关键点添加标注信息，其中标注信息可以指示人体关键点所属的类别，例如，可以在空时图中为每个人体关键点或每个边添加标注信息，以指示该人体关键点或该边对应的人体关键点所属的类别。此时，卷积神经网络可以根据标注信息，为每个人体关键点分配相应的卷积参数进行卷积处理，但本公开实施例不限于此。Optionally, after obtaining a plurality of human body key points, the above classification process can be performed, and label information can be added to each human body key point, wherein the label information can indicate the category to which the human body key points belong. Label information is added to each human body key point or each edge to indicate the category to which the human body key point or the human body key point corresponding to the edge belongs. At this time, the convolutional neural network may assign corresponding convolution parameters to each human body key point to perform convolution processing according to the label information, but the embodiment of the present disclosure is not limited to this.

可选地，在本公开实施例中，可以将卷积神经网络的卷积操作替换为图卷积操作，即为基于图模型的卷积神经网络，而无需改变网络结构即可实现基于空时图的卷积处理。空时图在通过卷积神经网络后，可以依然保持着图模型的结构，然而通过层层卷积，每个节点的已经包含了由底层的坐标信息提取的高级语义信息。Optionally, in the embodiment of the present disclosure, the convolution operation of the convolutional neural network can be replaced by a graph convolution operation, that is, a convolutional neural network based on a graph model, and the space-time based convolutional neural network can be realized without changing the network structure. Convolution processing of graphs. After passing through the convolutional neural network, the space-time graph can still maintain the structure of the graph model. However, through layer-by-layer convolution, each node already contains high-level semantic information extracted from the underlying coordinate information.

在一个或多个可选的实施例中，卷积神经网络在一个或多个卷积层之后，还可以包括全局池化层和全连接层。此时，相应地，可以对空时图中包括的多个人体关键点中的每个人体关键点的卷积处理结果进行全局池化处理，得到池化处理结果，并且基于池化处理结果，得到至少一帧视频图像的行为识别结果。In one or more optional embodiments, the convolutional neural network may further include a global pooling layer and a fully connected layer after one or more convolutional layers. At this time, correspondingly, the convolution processing result of each human body key point in the multiple human body key points included in the space-time map can be globally pooled to obtain the pooling processing result, and based on the pooling processing result, A behavior recognition result of at least one frame of video image is obtained.

由于输入进行行为识别的特征是基于人体特征构成的空时图结构的特征，为了综合所有节点的信息，将空时图中的所有节点通过全局池化层，获得一维向量，此处全局池化层仅是一种转换方式，本申请不限制转换特征维度的方式。Since the input feature for behavior recognition is the feature of the space-time graph structure composed of human features, in order to synthesize the information of all nodes, all nodes in the space-time graph are passed through the global pooling layer to obtain a one-dimensional vector, where the global pool The transformation layer is only a transformation method, and the present application does not limit the transformation method of the feature dimension.

具体地，池化处理结果包括一维特征向量；基于池化处理结果，得到至少一帧视频图像中每帧视频图像的行为识别结果，具体可以包括：利用全连接层，对一维特征向量进行处理，得到识别向量，识别向量包括对应行为分类类别数量的向量值；基于识别向量中各向量值获得视频图像中的行为分类。Specifically, the pooling processing result includes a one-dimensional feature vector; based on the pooling processing result, the behavior recognition result of each frame of video image in at least one frame of video image is obtained, which may specifically include: using a fully connected layer to perform the one-dimensional feature vector processing to obtain an identification vector, where the identification vector includes vector values corresponding to the number of action classification categories; the action classification in the video image is obtained based on each vector value in the identification vector.

具体地，上述获得的一维向量中向量值的数量并不一定对应行为分类类别(行为分类类别由所用的数据集决定，例如：在kinetics中有400类行为，在ActivityNet则有200类)的数量，为了实现行为分类，可以将一维向量输入全连接层得到识别向量，基于识别向量得到行为识别的分类结果。行为识别也可以使用更复杂的网络结构，而不是单个全连接层，本申请对行为识别的网络结构不作限定。Specifically, the number of vector values in the one-dimensional vector obtained above does not necessarily correspond to the behavior classification category (the behavior classification category is determined by the data set used, for example, there are 400 categories of behaviors in Kinetics, and 200 categories in ActivityNet). In order to realize the behavior classification, the one-dimensional vector can be input into the fully connected layer to obtain the recognition vector, and the classification result of the behavior recognition can be obtained based on the recognition vector. The behavior recognition can also use a more complex network structure instead of a single fully connected layer, and this application does not limit the network structure of the behavior recognition.

图4为本公开行为识别方法的一个具体实施例的流程示意图。如图对至少一帧视频图像进行人体关键点识别，并基于识别到的人体关键点进行姿态估计，基于所有人体关键点构造骨架序列的空时图。时空图卷积网络(st-gcn)的应用将逐渐产生高质量的特征映射图。通过标准的softmax分类器得到相应的动作类别分类。FIG. 4 is a schematic flowchart of a specific embodiment of the disclosed behavior recognition method. As shown in the figure, at least one frame of video image is recognized for human body key points, and pose estimation is performed based on the identified human body key points, and a space-time map of the skeleton sequence is constructed based on all human body key points. The application of spatiotemporal graph convolutional networks (st-gcn) will gradually produce high-quality feature maps. The corresponding action category classification is obtained through the standard softmax classifier.

本领域普通技术人员可以理解：实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成，前述的程序可以存储于一计算机可读取存储介质中，该程序在执行时，执行包括上述方法实施例的步骤；而前述的存储介质包括：ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the steps of implementing the above method embodiments can be completed by program instructions related to hardware, the aforementioned program can be stored in a computer-readable storage medium, and when the program is executed, execute It includes the steps of the above method embodiments; and the aforementioned storage medium includes: ROM, RAM, magnetic disk or optical disk and other media that can store program codes.

图5为本公开行为识别装置一个实施例的结构示意图。该实施例的装置可用于实现本公开上述各方法实施例。如图5所示，该实施例的装置包括：FIG. 5 is a schematic structural diagram of an embodiment of the behavior recognition apparatus of the present disclosure. The apparatus of this embodiment can be used to implement the above method embodiments of the present disclosure. As shown in Figure 5, the device of this embodiment includes:

关键点检测单元51，用于对至少一帧视频图像执行人体关键点检测，获得至少一帧视频图像的多个人体关键点。The key point detection unit 51 is configured to perform human key point detection on at least one frame of video image, and obtain multiple human key points of at least one frame of video image.

具体地，至少一帧视频图像可以来源于设备自身采集的视频或用户输入的视频或从其他设备处获取的视频，本公开实施例对获取该至少一帧视频图像的方式不作限定，此外，该至少一帧视频图像具体可以为一帧或多帧视频图像，本公开实施例对该至少一帧视频图像的数量不做限定。可选地，该至少一帧视频图像可以为连续的视频图像，例如，该至少一帧视频图像可以属于对应一个动作或行为的一个视频片段，基于这个视频片段可以获得一个行为识别结果。Specifically, the at least one frame of video image may be derived from a video collected by the device itself, a video input by a user, or a video obtained from other devices. The embodiment of the present disclosure does not limit the manner of obtaining the at least one frame of video image. The at least one frame of video images may specifically be one or more frames of video images, and the embodiment of the present disclosure does not limit the number of the at least one frame of video images. Optionally, the at least one frame of video image may be a continuous video image, for example, the at least one frame of video image may belong to a video segment corresponding to an action or behavior, and a behavior recognition result can be obtained based on this video segment.

行为识别单元52，用于基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。The behavior recognition unit 52 is configured to obtain the behavior recognition result of at least one frame of video image based on the feature information of multiple human body key points and the associated information of multiple human body key points in at least one frame of video image.

在一个或多个可选的实施例中，多个人体关键点的关联信息包括下列中的任意一项或多项：同一帧视频图像中的至少两个人体关键点之间的空间关联信息，以及对应于同一人体部位且属于至少一帧视频图像中的相邻帧视频图像的至少两个人体关键点之间的时间关联信息。In one or more optional embodiments, the association information of multiple human body key points includes any one or more of the following: spatial association information between at least two human body key points in the same frame of video image, and temporal correlation information between at least two human body key points corresponding to the same human body part and belonging to adjacent frames of video images in at least one frame of video image.

基于本公开上述实施例提供的一种行为识别装置，对至少一帧视频图像执行关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果；克服了现有技术将所有关键点一起处理不利于关注到局部信息的弊端，通过结合人体关键点的特征信息和人体关键点之间的关联信息，使局部信息和整体信息都得到充分的利用，提高了行为识别的准确度。Based on the behavior recognition device provided by the above embodiments of the present disclosure, key point detection is performed on at least one frame of video image to obtain multiple human body key points of at least one frame of video image; based on multiple human key points of at least one frame of video image The feature information and the associated information of multiple human body key points are obtained, and the behavior recognition result of at least one frame of video image is obtained. The correlation information between the feature information and the key points of the human body can make full use of the local information and the overall information, and improve the accuracy of behavior recognition.

在一个或多个可选的实施例中，至少一帧视频图像具体为视频中的多帧连续视频图像；和/或In one or more optional embodiments, the at least one frame of video image is specifically multiple frames of continuous video images in the video; and/or

同一帧视频图像中的至少两个人体关键点之间的空间关联信息是根据人体结构的连通关系确定的。The spatial correlation information between at least two human body key points in the same frame of video image is determined according to the connectivity relationship of human body structures.

在一个或多个可选的实施例中，本公开装置还包括：In one or more optional embodiments, the apparatus of the present disclosure further includes:

图建立单元，用于基于至少一帧视频图像中的多个人体关键点，建立空时图；A graph establishment unit, configured to establish a space-time graph based on a plurality of human body key points in at least one frame of video image;

相应地，行为识别单元，具体用于基于空时图，得到至少一帧视频图像的行为识别结果。Correspondingly, the behavior recognition unit is specifically configured to obtain the behavior recognition result of at least one frame of video image based on the space-time graph.

其中，空时图包含至少一帧视频图像中的多个人体关键点的特征信息以及多个人体关键点的关联信息。The space-time map includes feature information of multiple human body key points and associated information of multiple human body key points in at least one frame of video image.

可选地，在构建空时图时，可以首先基于人体结构的连通关系，使用空间边连接位于同一帧视频图像的至少两个人体关键点，然后，使用时间边连接同一身体部位在所述至少一帧视频图像的相邻帧中的至少两个人体关键点，但本公开实施例对此不做限定。Optionally, when constructing the space-time graph, first, based on the connectivity relationship of the human body structure, use spatial edges to connect at least two human body key points located in the same frame of video images, and then use temporal edges to connect the same body part in the at least two human body parts. At least two human body key points in adjacent frames of a frame of video image, but this is not limited in this embodiment of the present disclosure.

在一个或多个可选实施例中，行为识别单元52，具体用于将空时图输入到卷积神经网络，得到至少一帧视频图像的行为识别结果。In one or more optional embodiments, the behavior recognition unit 52 is specifically configured to input the space-time graph into the convolutional neural network to obtain the behavior recognition result of at least one frame of video image.

可选地，行为识别单元52得到的行为识别结果可以是至少一帧视频图像中每帧视频图像对应的行为识别结果，也可以是至少一帧视频图像中所有视频图像共同对应的行为识别结果，例如，该至少一帧视频图像属于视频流的一个视频片段，该视频片段对应一个人体动作，相应地，基于该视频片段中的至少一帧视频图像可以得到一个行为识别结果。或者，也可以通过其他流程构建空时图。Optionally, the behavior recognition result obtained by the behavior recognition unit 52 may be the behavior recognition result corresponding to each frame of video image in at least one frame of video image, or the behavior recognition result corresponding to all video images in at least one frame of video image, For example, the at least one frame of video image belongs to a video segment of the video stream, the video segment corresponds to a human action, and accordingly, a behavior recognition result can be obtained based on the at least one frame of video image in the video segment. Alternatively, space-time graphs can also be constructed through other processes.

在一个或多个可选实施例中，行为识别单元52，包括：In one or more optional embodiments, the behavior recognition unit 52 includes:

卷积处理模块，用于基于多个人体关键点之间的关联信息，对多个人体关键点进行卷积处理，得到多个人体关键点的卷积处理结果；The convolution processing module is used to perform convolution processing on multiple human body key points based on the association information between multiple human body key points to obtain the convolution processing results of multiple human body key points;

卷积识别模块，用于基于多个人体关键点的卷积处理结果，得到至少一帧视频图像的行为识别结果。The convolution recognition module is used to obtain the behavior recognition result of at least one frame of video image based on the convolution processing results of multiple human key points.

可选地，卷积处理模块，包括：Optionally, a convolution processing module, including:

关联确定模块，用于基于多个人体关键点的关联信息，确定与多个人体关键点中的第四人体关键点具有关联关系的至少一个第五人体关键点；an association determination module, configured to determine at least one fifth human body key point that has an associated relationship with the fourth human body key point in the multiple human body key points based on the association information of the multiple human body key points;

特征处理模块，用于基于第四人体关键点和至少一个第五人体关键点中每个人体关键点的特征信息，得到第四人体关键点的卷积处理结果。The feature processing module is configured to obtain the convolution processing result of the fourth human body key point based on the feature information of each human body key point in the fourth human body key point and the at least one fifth human body key point.

在一些可选实施例中，特征处理模块，具体用于利用与第四人体关键点和至少一个第五人体关键点中每个人体关键点所属的人体关键点集合对应的卷积参数，对每个人体关键点进行卷积处理，得到每个人体关键点的初始卷积结果；In some optional embodiments, the feature processing module is specifically configured to use the convolution parameters corresponding to the human body key point set to which each human body key point in the fourth human body key point and at least one fifth human body key point belongs to, for each human body key point. Perform convolution processing on the key points of the human body to obtain the initial convolution result of each key point of the human body;

基于第四人体关键点和至少一个第五人体关键点中每个人体关键点的初始卷积结果，得到第四人体关键点的卷积处理结果。The convolution processing result of the fourth human body key point is obtained based on the initial convolution result of each human body key point in the fourth human body key point and at least one fifth human body key point.

在一些可选实施例中，行为识别单元52，还包括：In some optional embodiments, the behavior identification unit 52 further includes:

分类模块，用于将第四人体关键点和至少一个第五人体关键点划分成至少一个人体关键点集合；其中，每个人体关键点集合包括至少一个人体关键点；a classification module, configured to divide the fourth human body key point and at least one fifth human body key point into at least one human body key point set; wherein each human body key point set includes at least one human body key point;

参数确定模块，用于基于第四人体关键点和至少一个第五人体关键点中每个人体关键点所属的人体关键点集合，确定每个人体关键点的卷积参数。The parameter determination module is configured to determine the convolution parameter of each human body key point based on the human body key point set to which each human body key point belongs in the fourth human body key point and the at least one fifth human body key point.

其中，属于不同人体关键点集合的人体关键点对应于不同的卷积参数。Among them, the human body key points belonging to different human body key point sets correspond to different convolution parameters.

在一个或多个可选的实施例中，至少一个人体关键点集合包括第一人体关键点集合和第二人体关键点集合；In one or more optional embodiments, the at least one human body key point set includes a first human body key point set and a second human body key point set;

分类模块，具体用于将第四人体关键点划分至第一人体关键点集合，并将至少一个第五人体关键点划分至第二人体关键点集合。The classification module is specifically configured to divide the fourth human body key point into the first human body key point set, and divide the at least one fifth human body key point into the second human body key point set.

在一个或多个可选的实施例中，分类模块，具体用于基于第四人体关键点和至少一个第五人体关键点中每个人体关键点与参考点之间的距离，将第四人体关键点和至少一个第五人体关键点划分为至少一个人体关键点集合。In one or more optional embodiments, the classification module is specifically configured to classify the fourth human body based on the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point. The key points and the at least one fifth human body key point are divided into at least one human body key point set.

可选地，分类模块，包括：Optionally, a classification module, including:

第一距离模块，用于基于第四人体关键点的特征信息，确定第四人体关键点与参考点之间的第一距离；a first distance module, configured to determine the first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

第一关系模块，用于基于第四人体关键点和至少一个第五人体关键点中每个人体关键点与参考点之间的距离与第一距离之间的大小关系，确定每个人体关键点所属的关键点集合。a first relationship module, configured to determine each human body key point based on the magnitude relationship between the distance between each human body key point and the reference point in the fourth human body key point and the at least one fifth human body key point and the first distance The set of keypoints to which it belongs.

具体地，第一关系模块，具体用于确定与参考点之间的距离小于第一距离的人体关键点属于第一关键点集合；和/或Specifically, a first relationship module, specifically configured to determine that a human body key point whose distance from the reference point is less than the first distance belongs to the first set of key points; and/or

确定与参考点之间的距离等于第一距离的人体关键点属于第二关键点集合；和/或It is determined that the human body keypoint whose distance from the reference point is equal to the first distance belongs to the second set of keypoints; and/or

确定与参考点之间的距离大于第一距离的人体关键点属于第三关键点集合。It is determined that the human body key points whose distance from the reference point is greater than the first distance belong to the third set of key points.

在一个或多个可选的实施例中，卷积识别模块，具体用于对空时图中包括的多个人体关键点中的每个人体关键点的卷积处理结果进行全局池化处理，得到池化处理结果；In one or more optional embodiments, the convolution identification module is specifically configured to perform global pooling processing on the convolution processing result of each human body key point in the multiple human body key points included in the space-time map, Get the pooling result;

基于池化处理结果，得到至少一帧视频图像的行为识别结果。此时，相应地，可以对空时图中包括的多个人体关键点中的每个人体关键点的卷积处理结果进行全局池化处理，得到池化处理结果，并且基于池化处理结果，得到至少一帧视频图像的行为识别结果。Based on the pooling processing result, the behavior recognition result of at least one frame of video image is obtained. At this time, correspondingly, the convolution processing result of each human body key point in the multiple human body key points included in the space-time map can be globally pooled to obtain the pooling processing result, and based on the pooling processing result, A behavior recognition result of at least one frame of video image is obtained.

具体地，池化处理结果包括一维特征向量；基于池化处理结果，得到至少一帧视频图像中每帧视频图像的行为识别结果。Specifically, the pooling processing result includes a one-dimensional feature vector; based on the pooling processing result, a behavior recognition result of each frame of video image in at least one frame of video image is obtained.

根据本公开实施例的一个方面，提供的一种电子设备，包括处理器，处理器包括本公开上述任一实施例的行为识别装置。According to an aspect of an embodiment of the present disclosure, an electronic device is provided, including a processor, where the processor includes the behavior recognition apparatus of any of the foregoing embodiments of the present disclosure.

根据本公开实施例的一个方面，提供的一种电子设备，包括：存储器，用于存储可执行指令；According to an aspect of the embodiments of the present disclosure, an electronic device is provided, comprising: a memory for storing executable instructions;

以及处理器，用于与存储器通信以执行可执行指令从而完成本公开行为识别方法上述任一实施例。and a processor configured to communicate with the memory to execute executable instructions so as to accomplish any of the above embodiments of the behavior recognition method of the present disclosure.

根据本公开实施例的一个方面，提供的一种计算机存储介质，用于存储计算机可读取的指令，指令被执行时执行本公开行为识别方法上述任一实施例。According to an aspect of the embodiments of the present disclosure, a computer storage medium is provided for storing computer-readable instructions, and when the instructions are executed, any of the foregoing embodiments of the behavior identification method of the present disclosure are executed.

根据本公开实施例的一个方面，提供的一种计算机程序，包括计算机可读代码，当计算机可读代码在设备上运行时，该设备中的处理器执行用于实现本公开上述任一实施例行为识别方法的指令。According to an aspect of the embodiments of the present disclosure, a computer program is provided, including computer-readable codes. When the computer-readable codes are executed on a device, a processor in the device executes any of the above-mentioned embodiments of the present disclosure. Instructions for behavior recognition methods.

在一个或多个可选实施方式中，本公开实施例还提供了一种计算机程序产品，用于存储计算机可读指令，该指令被执行时使得计算机执行上述任一可能的实现方式中行为识别方法。In one or more optional implementation manners, the embodiments of the present disclosure further provide a computer program product for storing computer-readable instructions, which, when executed, cause the computer to perform the behavior recognition in any of the above possible implementation manners method.

该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选例子中，所述计算机程序产品具体体现为计算机存储介质，在另一个可选例子中，所述计算机程序产品具体体现为软件产品，例如软件开发包(Software Development Kit，SDK)等等。The computer program product can be specifically implemented by hardware, software or a combination thereof. In an optional example, the computer program product is embodied as a computer storage medium, and in another optional example, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), etc. Wait.

在一个或多个可选实施方式中，本公开实施例还提供了另一种行为识别方法及其对应的装置、电子设备、计算机存储介质、计算机程序以及计算机程序产品，其中，该方法包括：对至少一帧视频图像执行人体关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。In one or more optional implementation manners, the embodiments of the present disclosure further provide another behavior recognition method and its corresponding apparatus, electronic device, computer storage medium, computer program, and computer program product, wherein the method includes: Performing human body key point detection on at least one frame of video image to obtain multiple human body key points of at least one frame of video image; A behavior recognition result of at least one frame of video image is obtained.

在一些实施例中，该行为识别指示可以具体为调用指令，第一装置可以通过调用的方式指示第二装置执行图像的处理，相应地，响应于接收到调用指令，第二装置可以执行上述行为识别方法中的任意实施例中的步骤和/或流程。In some embodiments, the behavior identification instruction may be specifically an invocation instruction, and the first device may instruct the second device to perform image processing by means of invocation. Accordingly, in response to receiving the invocation instruction, the second device may perform the above behavior Steps and/or processes in any embodiment of the method are identified.

本公开实施例还提供了一种电子设备，例如可以是移动终端、个人计算机(PC)、平板电脑、服务器等。下面参考图6，其示出了适于用来实现本申请实施例的电子设备600的结构示意图：如图6所示，电子设备600包括一个或多个处理器、通信部等，所述一个或多个处理器例如：一个或多个中央处理单元(CPU)601，和/或一个或多个图像处理器(GPU)613等，处理器可以根据存储在只读存储器(ROM)602中的可执行指令或者从存储部分608加载到随机访问存储器(RAM)603中的可执行指令而执行各种适当的动作和处理。通信部612可包括但不限于网卡，所述网卡可包括但不限于IB(Infiniband)网卡。Embodiments of the present disclosure also provide an electronic device, which may be, for example, a mobile terminal, a personal computer (PC), a tablet computer, a server, and the like. 6, which shows a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present application: As shown in FIG. 6, the electronic device 600 includes one or more processors, communication parts, etc. or multiple processors such as: one or more central processing units (CPUs) 601, and/or one or more graphics processing units (GPUs) 613, etc., the processors may Executable instructions or executable instructions loaded from storage section 608 into random access memory (RAM) 603 perform various suitable actions and processes. The communication part 612 may include, but is not limited to, a network card, and the network card may include, but is not limited to, an IB (Infiniband) network card.

处理器可与只读存储器602和/或随机访问存储器630中通信以执行可执行指令，通过总线604与通信部612相连、并经通信部612与其他目标设备通信，从而完成本申请实施例提供的任一项方法对应的操作，例如，对至少一帧视频图像执行人体关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。The processor can communicate with the read-only memory 602 and/or the random access memory 630 to execute executable instructions, connect with the communication part 612 through the bus 604, and communicate with other target devices through the communication part 612, thereby completing the provision of the embodiments of the present application. The operation corresponding to any one of the methods, for example, performing human body key point detection on at least one frame of video image to obtain multiple human body key points of at least one frame of video image; based on the characteristics of multiple human key points of at least one frame of video image information and the associated information of multiple human body key points to obtain the behavior recognition result of at least one frame of video image.

此外，在RAM 603中，还可存储有装置操作所需的各种程序和数据。CPU601、ROM602以及RAM603通过总线604彼此相连。在有RAM603的情况下，ROM602为可选模块。RAM603存储可执行指令，或在运行时向ROM602中写入可执行指令，可执行指令使处理器601执行上述通信方法对应的操作。输入/输出(I/O)接口605也连接至总线604。通信部612可以集成设置，也可以设置为具有多个子模块(例如多个IB网卡)，并在总线链接上。In addition, in the RAM 603, various programs and data necessary for the operation of the device can also be stored. The CPU 601 , the ROM 602 and the RAM 603 are connected to each other through a bus 604 . In the case of RAM 603, ROM 602 is an optional module. The RAM 603 stores executable instructions, or writes executable instructions into the ROM 602 at runtime, and the executable instructions enable the processor 601 to perform operations corresponding to the above communication methods. An input/output (I/O) interface 605 is also connected to bus 604 . The communication unit 612 may be integrated, or may be configured to have multiple sub-modules (eg, multiple IB network cards) and be connected to the bus.

以下部件连接至I/O接口605：包括键盘、鼠标等的输入部分606；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607；包括硬盘等的存储部分608；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器610上，以便于从其上读出的计算机程序根据需要被安装入存储部分608。The following components are connected to the I/O interface 605: an input section 606 including a keyboard, a mouse, etc.; an output section 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 608 including a hard disk, etc. ; and a communication section 609 including a network interface card such as a LAN card, a modem, and the like. The communication section 609 performs communication processing via a network such as the Internet. A drive 610 is also connected to the I/O interface 605 as needed. A removable medium 611, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 610 as needed so that a computer program read therefrom is installed into the storage section 608 as needed.

需要说明的，如图6所示的架构仅为一种可选实现方式，在具体实践过程中，可根据实际需要对上述图6的部件数量和类型进行选择、删减、增加或替换；在不同功能部件设置上，也可采用分离设置或集成设置等实现方式，例如GPU和CPU可分离设置或者可将GPU集成在CPU上，通信部可分离设置，也可集成设置在CPU或GPU上，等等。这些可替换的实施方式均落入本公开公开的保护范围。It should be noted that the architecture shown in FIG. 6 is only an optional implementation. In the specific practice process, the number and type of components in the above-mentioned FIG. 6 can be selected, deleted, added or replaced according to actual needs; For the setting of different functional components, separate settings or integrated settings can also be adopted. For example, the GPU and the CPU can be set separately or the GPU can be integrated on the CPU, and the communication department can be set separately or integrated on the CPU or GPU. and many more. These alternative embodiments all fall within the scope of the present disclosure.

特别地，根据本公开的实施例，上文参考流程图描述的过程可以被实现为计算机软件程序。例如，本公开的实施例包括一种计算机程序产品，其包括有形地包含在机器可读介质上的计算机程序，计算机程序包含用于执行流程图所示的方法的程序代码，程序代码可包括对应执行本申请实施例提供的方法步骤对应的指令，例如，对至少一帧视频图像执行人体关键点检测，获得至少一帧视频图像的多个人体关键点；基于至少一帧视频图像的多个人体关键点的特征信息以及多个人体关键点的关联信息，得到至少一帧视频图像的行为识别结果。在这样的实施例中，该计算机程序可以通过通信部分609从网络上被下载和安装，和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时，执行本申请的方法中限定的上述功能。In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on a machine-readable medium, the computer program including program code for performing the methods illustrated in the flowcharts, the program code may include corresponding Execute the instructions corresponding to the method steps provided by the embodiments of the present application, for example, perform human body key point detection on at least one frame of video image, and obtain multiple human body key points of at least one frame of video image; The feature information of the key points and the association information of multiple human body key points are used to obtain the behavior recognition result of at least one frame of video image. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 609 and/or installed from the removable medium 611 . When the computer program is executed by the central processing unit (CPU) 601, the above-described functions defined in the method of the present application are performed.

应理解，本公开实施例中的“第一”、“第二”等术语仅仅是为了区分，而不应理解成对本公开实施例的限定。It should be understood that terms such as "first" and "second" in the embodiments of the present disclosure are only for distinction, and should not be construed as limitations on the embodiments of the present disclosure.

还应理解，在本公开中，“多个”可以指两个或两个以上，“至少一个”可以指一个、两个或两个以上。It should also be understood that in the present disclosure, "a plurality" may refer to two or more, and "at least one" may refer to one, two, or more.

还应理解，对于本公开中提及的任一部件、数据或结构，在没有明确限定或者在前后文给出相反启示的情况下，一般可以理解为一个或多个。It should also be understood that any reference to any component, data or structure in the present disclosure may generally be construed as one or more in the absence of an explicit definition or a contrary indication in the context.

还应理解，本公开对各个实施例的描述着重强调各个实施例之间的不同之处，其相同或相似之处可以相互参考，为了简洁，不再一一赘述。It should also be understood that the description of the various embodiments in the present disclosure emphasizes the differences between the various embodiments, and the same or similar points can be referred to each other, and for the sake of brevity, they will not be repeated.

可能以许多方式来实现本公开的方法和装置、设备。例如，可通过软件、硬件、固件或者软件、硬件、固件的任何组合来实现本公开的方法和装置、设备。用于方法的步骤的上述顺序仅是为了进行说明，本公开的方法的步骤不限于以上具体描述的顺序，除非以其它方式特别说明。此外，在一些实施例中，还可将本公开实施为记录在记录介质中的程序，这些程序包括用于实现根据本公开的方法的机器可读指令。因而，本公开还覆盖存储用于执行根据本公开的方法的程序的记录介质。The methods and apparatuses of the present disclosure may be implemented in many ways. For example, the methods, apparatuses, and devices of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above order of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure can also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

本公开的描述是为了示例和描述起见而给出的，而并不是无遗漏的或者将本公开限于所公开的形式。很多修改和变化对于本领域的普通技术人员而言是显然的。选择和描述实施例是为了更好说明本公开的原理和实际应用，并且使本领域的普通技术人员能够理解本公开从而设计适于特定用途的带有各种修改的各种实施例。The description of the present disclosure has been presented for purposes of example and description, and is not intended to be exhaustive or to limit the disclosure to the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to better explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use.

Claims

1. a method for behavior recognition, is characterized in that, comprises:

Performing human body key point detection on at least one frame of video image to obtain multiple human key points of the at least one frame of video image;

A spatiotemporal map is established based on multiple human body key points in the at least one frame of video image, wherein the spatiotemporal map includes feature information of multiple human key points in the at least one frame of video image and the multiple human key points in the at least one frame of video image. Relevant information of key points of the individual;

determining at least one fifth human body key point that has an associated relationship with the fourth human body key point in the plurality of human body key points based on the association information between the plurality of human body key points included in the space-time graph;

Using the convolution parameters corresponding to the set of human body key points to which each of the fourth human body key points and the at least one fifth human body key point belongs, the convolution process is performed on each of the human body key points, obtain the initial convolution result of each human body key point;

Obtain the convolution processing result of the fourth human body key point based on the initial convolution result of each human body key point in the fourth human body key point and the at least one fifth human body key point;

Based on the convolution processing result of the multiple human body key points, the behavior recognition result of the at least one frame of video image is obtained.

2. The method according to claim 1, wherein the feature information of the human body key points comprises coordinate information of the human body key points; or,

The feature information of the human body key points includes coordinate information of the human body key points, estimated confidence levels of the human body key points and/or initial features corresponding to the human body key points.

3. The method according to claim 1, wherein the associated information of the multiple human body key points comprises any one or more of the following: between at least two human body key points in the same frame of video image and the temporal correlation information between at least two human body key points corresponding to the same human body part and belonging to adjacent frames of video images in the at least one frame of video image.

4. The method according to claim 3, wherein the at least one frame of video image is specifically multiple frames of continuous video images in the video; and/or

The spatial correlation information between at least two human body key points in the same frame of video image is determined according to the connectivity relationship of human body structures.

5. The method of claim 3, wherein

The spatial association information between the at least two human body key points includes the spatially adjacent relationship of the at least two key points, and/or

The temporal association information between the at least two key points includes: the adjacent relationship of the frames to which the at least two key points belong.

6 . The method according to claim 1 , wherein the space-time graph includes a plurality of nodes corresponding to the plurality of human body key points, and each node in the plurality of nodes includes a corresponding human body key point. 7 . characteristic information;

Each of the plurality of nodes has at least one edge, and the plurality of edges of the plurality of nodes indicate the association relationship of the plurality of human body key points.

7. The method of claim 6, wherein

A first node in the plurality of nodes has a spatial edge with each second node in the at least one second node, wherein the first human body key point corresponding to the first node and the at least one second node have a space edge. The second human body key point corresponding to each second node belongs to the same frame, and the first human body key point is directly connected to the human body part corresponding to each second human body key point, and/or

There is a time edge between the first node and each of the at least one third node, wherein the first human body key point corresponds to the same third human body key point corresponding to each of the third nodes body parts and belong to adjacent frames.

8. The method according to any one of claims 1-7, wherein before the convolution processing is performed on each of the human body key points, the method further comprises:

dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, wherein each human body key point set includes at least one human body key point;

The convolution parameter of each human body key point is determined based on the set of human body key points to which each human body key point in the fourth human body key point and the at least one fifth human body key point belongs, wherein different human body key points belong to different human body key points. The human keypoints of the point set correspond to different convolution parameters.

9. The method according to claim 8, wherein the at least one human body key point set comprises a first human body key point set and a second human body key point set;

The described fourth human body key point and the at least one fifth human body key point are divided into at least one human body key point set, including:

The fourth human body key point is divided into the first human body key point set, and the at least one fifth human body key point is divided into the second human body key point set.

10. The method according to claim 9, wherein the dividing the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, comprising:

Based on the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point, the fourth human body key point and the at least one fifth human body key point Divide into at least one human body keypoint set.

11. The method according to claim 10, wherein, based on the distance between each human body key point and a reference point in the fourth human body key point and the at least one fifth human body key point, the The fourth human body key point and the at least one fifth human body key point are divided into at least one human body key point set, including:

determining a first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

determining the each person based on the magnitude relationship between the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point and the first distance The set of keypoints to which the body keypoint belongs.

12 . The method according to claim 11 , wherein the distance between each human key point and the reference point is based on the fourth human key point and the at least one fifth human key point. 13 . The size relationship with the first distance, to determine the key point set to which each human body key point belongs, including:

It is determined that the human body key points whose distance from the reference point is less than the first distance belong to the first set of key points; and/or

It is determined that the human body key point whose distance from the reference point is equal to the first distance belongs to the second set of key points; and/or

It is determined that the human body key points whose distance from the reference point is greater than the first distance belong to the third set of key points.

13. The method according to any one of claims 1-7, wherein the obtaining a behavior recognition result of the at least one frame of video image based on the convolution processing results of the plurality of human body key points, comprising:

Perform global pooling processing on the convolution processing result of each human body key point in the plurality of human body key points included in the space-time map to obtain a pooling processing result;

Based on the pooling processing result, a behavior recognition result of the at least one frame of video image is obtained.

14. A behavior recognition device, comprising:

a key point detection unit, configured to perform human key point detection on at least one frame of video image, and obtain a plurality of human key points of the at least one frame of video image;

A graph establishing unit, configured to establish a space-time graph based on a plurality of human body key points in the at least one frame of video image, wherein the space-time graph includes a plurality of human body key points in the at least one frame of video image. feature information and associated information of the multiple human body key points

Behavioural recognition unit, including:

A convolution processing module, including: an association determination module and a feature processing module;

The association determination module is configured to determine, based on the association information between the plurality of human body key points included in the space-time graph, at least a A fifth key point of the human body;

The feature processing module is configured to use the convolution parameter corresponding to the human body key point set to which each human body key point in the fourth human body key point and the at least one fifth human body key point belongs to Perform convolution processing on the body key points to obtain the initial convolution result of each human body key point; based on the initial convolution of each human body key point in the fourth human body key point and the at least one fifth human body key point As a result, the convolution processing result of the fourth human body key point is obtained;

The convolution identification module is configured to obtain the behavior identification result of the at least one frame of video image based on the convolution processing results of the multiple human body key points.

15. The apparatus according to claim 14, wherein the feature information of the human body key points comprises coordinate information of the human body key points; or,

16. The apparatus according to claim 14, wherein the association information of the multiple human body key points comprises any one or more of the following: between at least two human body key points in the same frame of video image and the temporal correlation information between at least two human body key points corresponding to the same human body part and belonging to adjacent frames of video images in the at least one frame of video image.

17. The apparatus according to claim 16, wherein the at least one frame of video image is specifically multiple frames of continuous video images in the video; and/or

18. The apparatus of claim 16, wherein

19. The apparatus according to claim 14, wherein the space-time graph comprises a plurality of nodes corresponding to the plurality of human body key points, and each node in the plurality of nodes comprises a corresponding human body key point characteristic information;

20. The apparatus of claim 19, wherein

21. The device according to any one of claims 14-20, wherein the behavior recognition unit further comprises:

a classification module, configured to divide the fourth human body key point and the at least one fifth human body key point into at least one human body key point set, wherein each human body key point set includes at least one human body key point;

A parameter determination module, configured to determine the convolution parameter of each human body key point based on the human body key point set to which each human body key point in the fourth human body key point and the at least one fifth human body key point belongs, Among them, the human body key points belonging to different human body key point sets correspond to different convolution parameters.

22. The apparatus according to claim 21, wherein the at least one human body key point set comprises a first human body key point set and a second human body key point set;

The classification module is specifically configured to divide the fourth human body key point into the first human body key point set, and divide the at least one fifth human body key point into the second human body key point set.

23. The apparatus according to claim 22, wherein the classification module is specifically configured to be based on each human body key point and reference point in the fourth human body key point and the at least one fifth human body key point The distance between the fourth human body key point and the at least one fifth human body key point is divided into at least one human body key point set.

24. The apparatus according to claim 23, wherein the classification module comprises:

a first distance module, configured to determine a first distance between the fourth human body key point and the reference point based on the feature information of the fourth human body key point;

a first relationship module, configured to be based on the size of the distance between each of the fourth human body key points and the at least one fifth human body key point and the reference point between each human key point and the first distance relationship, to determine the set of key points to which each of the human body key points belongs.

25. The apparatus according to claim 24, wherein the first relationship module is specifically configured to determine that a human body key point whose distance from the reference point is smaller than the first distance belongs to the first key point collection; and/or

26. The apparatus according to any one of claims 14 to 20, wherein the convolution identification module is specifically configured to identify each human body key point in a plurality of human body key points included in the space-time map The result of convolution processing is globally pooled to obtain the result of pooling processing;

27. An electronic device, characterized by comprising a processor, wherein the processor comprises the behavior recognition device according to any one of claims 14 to 26.

28. An electronic device, characterized in that it comprises:

memory for storing executable instructions;

and a processor for executing the executable instructions to complete the behavior recognition method according to any one of claims 1 to 13.

29. A computer storage medium for storing computer-readable instructions, wherein the behavior recognition method according to any one of claims 1 to 13 is executed when the instructions are executed.