CN117994820B

CN117994820B - Hierarchical graph convolution gesture recognition method and device based on time-frequency data fusion

Info

Publication number: CN117994820B
Application number: CN202410405924.3A
Authority: CN
Inventors: 邹峥; 刘石坚; 黄葵; 廖律超; 荆东星; 蔡建成; 吴屹; 陈才艺
Original assignee: Fujian University Of Science And Technology; Fujian Normal University
Current assignee: Fujian University Of Science And Technology; Fujian Normal University
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-06-14
Anticipated expiration: 2044-04-07
Also published as: CN117994820A

Abstract

The invention discloses a hierarchical graph convolution gesture recognition method and device based on time-frequency data fusion, comprising a network structure consisting of a basic block and a full-connection layer; the network structure performs the following steps: the layered hand image convolution module splits the gesture image to be identified into joint layered images with preset numbers, and extracts characteristics of the joint layered images; splicing all joint layered graph features to obtain a first image feature; the time convolution module sequentially carries out convolution, activation and residual error connection operation according to the input first image characteristics to obtain second image characteristics; the spatial attention module performs feature extraction according to the input second image features to obtain third image features; the time convolution module sequentially carries out convolution, activation and residual error connection operation according to the input third image characteristic to obtain a fourth image characteristic; and outputting a gesture prediction result by the full-connection layer according to the fourth image characteristic. Thereby improving the effect of gesture recognition.

Description

Layered graph convolution gesture recognition method and device based on time-frequency data fusion

技术领域Technical Field

本发明涉及计算机视觉技术领域，特别是涉及一种基于时频数据融合的分层图卷积手势识别方法及装置。The present invention relates to the field of computer vision technology, and in particular to a layered graph convolution gesture recognition method and device based on time-frequency data fusion.

背景技术Background technique

随着人工智能技术的发展，使用深度学习模型进行手势识别成为一种趋势。例如现有技术1(Devineau G，Moutarde F，Xi W，et al.《Deep learning for hand gesturerecognition on skeletal data》)首先提出通过卷积神经网络（Convolutional NeuralNetwork，CNN）沿骨架序列时间维度提取特征。现有技术2（Lai K，Yanushkevich S N.《CNN+RNN depth and skeleton based dynamic hand gesture recognition》）提出了将骨架作为时间序列输入到循环神经网络（Recurrent Neural Network，RNN）进行特征提取。但现有技术1和2无法自然的表示非欧式结构的骨架数据。With the development of artificial intelligence technology, the use of deep learning models for gesture recognition has become a trend. For example, prior art 1 (Devineau G, Moutarde F, Xi W, et al. "Deep learning for hand gesture recognition on skeletal data") first proposed to extract features along the time dimension of the skeleton sequence through a convolutional neural network (CNN). Prior art 2 (Lai K, Yanushkevich SN. "CNN+RNN depth and skeleton based dynamic hand gesture recognition") proposed to input the skeleton as a time series into a recurrent neural network (RNN) for feature extraction. However, prior art 1 and 2 cannot naturally represent non-Euclidean skeleton data.

现有技术3（Yan S，Xiong Y，Lin D.《Spatial temporal graph convolutionalnetworks for skeleton-based action recognition》）提出的ST-GCN(Spatial temporalgraph convolutional networks，时空图卷积网络)是基于空域方法的典型代表。ST-GCN由若干空间块和时间块堆叠而成，其中的空间块是通过GCN（Graph Convolutional Network，图卷积神经网络）聚合单帧空间信息的子模块，而时间块则是利用一维卷积在时间维度上关联帧与帧之间信息的子模块。ST-GCN由若干空间块和时间块堆叠而成，其中的空间块是通过GCN聚合单帧空间信息的子模块，而时间块则是利用一维卷积在时间维度上关联帧与帧之间信息的子模块。但图卷积核在训练过程无法变化，因此难以关联较远关节点的信息。ST-GCN (Spatial temporal graph convolutional networks) proposed in prior art 3 (Yan S, Xiong Y, Lin D. "Spatial temporal graph convolutional networks for skeleton-based action recognition") is a typical representative of spatial domain-based methods. ST-GCN is composed of a stack of several spatial blocks and temporal blocks, in which the spatial block is a submodule that aggregates the spatial information of a single frame through GCN (Graph Convolutional Network), while the temporal block is a submodule that uses one-dimensional convolution to associate the information between frames in the temporal dimension. ST-GCN is composed of a stack of several spatial blocks and temporal blocks, in which the spatial block is a submodule that aggregates the spatial information of a single frame through GCN, while the temporal block is a submodule that uses one-dimensional convolution to associate the information between frames in the temporal dimension. However, the graph convolution kernel cannot change during the training process, so it is difficult to associate the information of distant joints.

现有技术4（Shi L，Zhang Y，Cheng J，et al.《Two-stream adaptive graphconvolutional networks for skeleton-based action recognition》）提出的2s-AGCN（Two-stream adaptive graph convolutional networks，双流自适应图卷积网络）通过两个线性层将输入线性变换为/>和/>，之后将/>与/>进行矩阵乘法得到大小为/>的矩阵/>，其中/>表示实数集，/>为输入通道数、/>为帧数以及/>为节点数。最后/>与骨架图的邻接矩阵相加得到最终的骨架图拓扑矩阵，其中/>为输出通道数。这样做的目的在于通过网络训练学习到各节点间的潜在关系。现有技术5（Chen Y，Zhang Z，Yuan C，et al.《Channel-wise topology refinement graph convolution forskeleton-based action recognition》）提出的CTR-GCN（Channel-wise topologyrefinement-graph convolutional networks，通道拓扑图卷积神经网络）使用可学习的矩阵，用来寻找较远骨架节点之间的潜在关系。通过数据驱动对每个通道学习一个图拓扑结构以此得到一个针对输入实例的图邻接矩阵，使得可以发掘潜在的节点关系。The 2s-AGCN (Two-stream adaptive graph convolutional networks, two-stream adaptive graph convolutional networks) proposed in the prior art 4 (Shi L, Zhang Y, Cheng J, et al. "Two-stream adaptive graph convolutional networks for skeleton-based action recognition") converts the input The linear transformation is and/> , then /> With/> Perform matrix multiplication to get a size of/> The matrix of , where/> represents the set of real numbers, /> is the number of input channels, /> is the number of frames and /> is the number of nodes. Finally/> Add it to the adjacency matrix of the skeleton graph to get the final skeleton graph topology matrix, where/> is the number of output channels. The purpose of this is to learn the potential relationship between nodes through network training. The CTR-GCN (Channel-wise topology refinement-graph convolutional networks) proposed in the prior art 5 (Chen Y, Zhang Z, Yuan C, et al. "Channel-wise topology refinement graph convolution for skeleton-based action recognition") uses a learnable matrix to find the potential relationship between distant skeleton nodes. Through data-driven learning of a graph topology structure for each channel, a graph adjacency matrix for the input instance is obtained, so that potential node relationships can be discovered.

现有技术4和5，太过依赖初始的骨架图拓扑结构，虽然模型能够实现对图像的学习，但是这些模型还是会受到初始化骨架图拓扑的影响，从而更加重视原来节点之间的相关性。Existing technologies 4 and 5 rely too much on the initial skeleton graph topology. Although the models can learn images, they are still affected by the initialized skeleton graph topology, thus paying more attention to the correlation between the original nodes.

现有技术6（Lee J，Lee M，Lee D，et al.《Hierarchically decomposed graphconvolutional networks for skeleton-based action recognition》）提出应该着重考虑对识别动作更重要的结点。为此，将骨架图划分为多个层，通过距离中心点的跳数划分关节点组，对不同的关节点组的图邻接矩阵提取的特征进行平均池化，得到更具有代表性的动作特征。然而，如果需要让图卷积神经网络系列的网络模型考虑到全局信息，则必须堆叠较深的GCN层才能将其聚合半径扩大到全图，这会造成计算量过大的问题。Prior art 6 (Lee J, Lee M, Lee D, et al. "Hierarchically decomposed graph convolutional networks for skeleton-based action recognition") proposed that the nodes that are more important for identifying actions should be considered. To this end, the skeleton graph is divided into multiple layers, and the joint point groups are divided by the number of hops from the center point. The features extracted from the graph adjacency matrix of different joint point groups are averaged and pooled to obtain more representative action features. However, if the network model of the graph convolutional neural network series needs to take global information into account, deeper GCN layers must be stacked to expand its aggregation radius to the entire graph, which will cause the problem of excessive computation.

发明内容Summary of the invention

本发明所要解决的技术问题是：提供一种基于时频数据融合的分层图卷积手势识别方法及装置，提高对手势识别的效果。The technical problem to be solved by the present invention is to provide a method and device for gesture recognition based on layered graph convolution of time-frequency data fusion to improve the effect of gesture recognition.

为了解决上述技术问题，本发明采用的技术方案为：In order to solve the above technical problems, the technical solution adopted by the present invention is:

一种基于时频数据融合的分层图卷积手势识别方法，包括由基础块与全连接层组成的网络结构；所述基础块由分层手部图卷积模块、时间卷积模块、空间注意力模块、时间卷积模块依次残差连接组成；所述网络结构执行以下步骤：A hierarchical graph convolution gesture recognition method based on time-frequency data fusion includes a network structure composed of a basic block and a fully connected layer; the basic block is composed of a hierarchical hand graph convolution module, a temporal convolution module, a spatial attention module, and a temporal convolution module sequentially connected by residual connection; the network structure performs the following steps:

所述分层手部图卷积模块将待识别手势图像拆分成预设个数的关节分层图，并提取得到关节分层图特征；拼接所有所述关节分层图特征，得到第一图像特征；The layered hand graph convolution module splits the gesture image to be recognized into a preset number of joint layered graphs, and extracts joint layered graph features; splices all the joint layered graph features to obtain a first image feature;

所述时间卷积模块根据输入的所述第一图像特征依次进行卷积、激活以及残差连接操作，得到第二图像特征；The temporal convolution module sequentially performs convolution, activation and residual connection operations according to the first image feature input to obtain a second image feature;

所述空间注意力模块根据输入的所述第二图像特征进行特征提取，得到第三图像特征；The spatial attention module performs feature extraction based on the input second image feature to obtain a third image feature;

所述时间卷积模块根据输入的所述第三图像特征依次进行卷积、激活以及残差连接操作，得到第四图像特征；The temporal convolution module sequentially performs convolution, activation and residual connection operations according to the input third image feature to obtain a fourth image feature;

所述全连接层根据所述第四图像特征输出手势预测结果。The fully connected layer outputs a gesture prediction result according to the fourth image feature.

为了解决上述技术问题，本发明采用的另一技术方案为：In order to solve the above technical problems, another technical solution adopted by the present invention is:

一种基于时频数据融合的分层图卷积手势识别装置，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上述的一种基于时频数据融合的分层图卷积手势识别方法中的各个步骤。A device for layered graph convolution gesture recognition based on time-frequency data fusion includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, each step in the above-mentioned layered graph convolution gesture recognition method based on time-frequency data fusion is implemented.

本发明的有益效果在于：通过使用图卷积神经网络以及图卷积核，能够自然地表达非欧式结构以及通过训练样本学习到较远关节点之间的相关性；采用分层手部图结构，将手部骨架分为多层并通过训练学习每一层的节点之间的相关性，从而摆脱对于初始的骨架图拓扑结构的依赖，使得模型可以提取到更加多样的骨架序列特征；通过空间注意力块代替冗余的图卷积块提取全局特征，并使用空间注意力块与图卷积块交替堆叠，既可以避免堆叠过多图卷积层带来的计算负担、降低系统复杂度，又确保了模型的全局特征提取能力。The beneficial effects of the present invention are as follows: by using a graph convolutional neural network and a graph convolution kernel, it is possible to naturally express non-Euclidean structures and learn the correlation between distant joints through training samples; by adopting a layered hand graph structure, the hand skeleton is divided into multiple layers and the correlation between the nodes of each layer is learned through training, thereby getting rid of the dependence on the initial skeleton graph topological structure, so that the model can extract more diverse skeleton sequence features; by replacing redundant graph convolution blocks with spatial attention blocks to extract global features, and using spatial attention blocks and graph convolution blocks to be stacked alternately, it is possible to avoid the computational burden brought by stacking too many graph convolution layers, reduce system complexity, and ensure the global feature extraction capability of the model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法的网络结构示意图；FIG1 is a schematic diagram of a network structure of a layered graph convolution gesture recognition method based on time-frequency data fusion in an embodiment of the present invention;

图2为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法的步骤流程图；FIG2 is a flowchart of a method for gesture recognition based on layered graph convolution and time-frequency data fusion according to an embodiment of the present invention;

图3为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法中的手部层次化分解图卷积块的结构图；FIG3 is a structural diagram of a hand hierarchical decomposition graph convolution block in a layered graph convolution gesture recognition method based on time-frequency data fusion in an embodiment of the present invention;

图4为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法中的可视化手部分层图；FIG4 is a visualized hand layer diagram in a layered graph convolution gesture recognition method based on time-frequency data fusion in an embodiment of the present invention;

图5为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法中的手部节点编号示意图；FIG5 is a schematic diagram of hand node numbering in a layered graph convolution gesture recognition method based on time-frequency data fusion in an embodiment of the present invention;

图6为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别方法中的多流融合预测结构图；FIG6 is a diagram of a multi-stream fusion prediction structure in a layered graph convolution gesture recognition method based on time-frequency data fusion in an embodiment of the present invention;

图7为本发明实施例中的一种基于时频数据融合的分层图卷积手势识别装置的结构示意图。FIG. 7 is a schematic diagram of the structure of a layered graph convolution gesture recognition device based on time-frequency data fusion in an embodiment of the present invention.

具体实施方式Detailed ways

为详细说明本发明的技术内容、所实现目的及效果，以下结合实施方式并配合附图予以说明。In order to explain the technical content, achieved objectives and effects of the present invention in detail, the following is an explanation in conjunction with the implementation modes and the accompanying drawings.

请参照图1，一种基于时频数据融合的分层图卷积手势识别方法，包括由基础块与全连接层组成的网络结构；所述基础块由分层手部图卷积模块、时间卷积模块、空间注意力模块、时间卷积模块依次残差连接组成；所述网络结构执行以下步骤：Please refer to Figure 1, a layered graph convolution gesture recognition method based on time-frequency data fusion includes a network structure consisting of a basic block and a fully connected layer; the basic block is composed of a layered hand graph convolution module, a temporal convolution module, a spatial attention module, and a temporal convolution module, which are sequentially connected by residual connections; the network structure performs the following steps:

由上述描述可知，本发明的有益效果在于：通过使用图卷积神经网络以及图卷积核，能够自然地表达非欧式结构以及通过训练样本学习到较远关节点之间的相关性；采用分层手部图结构，将手部骨架分为多层并通过训练学习每一层的节点之间的相关性，从而摆脱对于初始的骨架图拓扑结构的依赖，使得模型可以提取到更加多样的骨架序列特征；通过空间注意力块代替冗余的图卷积块提取全局特征，并使用空间注意力块与图卷积块交替堆叠，既可以避免堆叠过多图卷积层带来的计算负担、降低系统复杂度，又确保了模型的全局特征提取能力。From the above description, it can be seen that the beneficial effects of the present invention are: by using graph convolutional neural networks and graph convolution kernels, non-Euclidean structures can be naturally expressed and the correlation between distant joints can be learned through training samples; a layered hand graph structure is adopted, the hand skeleton is divided into multiple layers and the correlation between the nodes of each layer is learned through training, thereby getting rid of the dependence on the initial skeleton graph topological structure, so that the model can extract more diverse skeleton sequence features; global features are extracted by replacing redundant graph convolution blocks with spatial attention blocks, and spatial attention blocks and graph convolution blocks are used to stack alternately, which can not only avoid the computational burden brought by stacking too many graph convolution layers and reduce the system complexity, but also ensure the global feature extraction capability of the model.

进一步地，所述分层手部图卷积模块将待识别手势图像拆分成预设个数的关节分层图包括：Furthermore, the layered hand graph convolution module splits the gesture image to be recognized into a preset number of joint layered graphs including:

以预设数量的关节点将所述待识别手势图像转化成节点手势图，并确定所述节点手势图的中心关节点；Converting the gesture image to be recognized into a node gesture graph with a preset number of joint points, and determining the central joint point of the node gesture graph;

根据所述关节点到所述中心关节点的距离将所有所述关节点进行分组，得到分层关节点集合；Grouping all the joint points according to the distance from the joint point to the central joint point to obtain a hierarchical joint point set;

将相邻的两组所述分层关节点集合依次相连构成边集，并根据所述边集的数量，得到相应数量的所述分层图。Two adjacent groups of the hierarchical joint point sets are sequentially connected to form edge sets, and according to the number of the edge sets, a corresponding number of the hierarchical graphs are obtained.

由上述描述可知，通过以预设数量的关节点将待识别手势图像转化成节点手势图，以及将距离中心节点的跳数作为分组依据得到数据集合以及分层图，使得分层图具有不同的特征，从而通过分层图能够提取到更多具有多样性的图特征。From the above description, it can be seen that by converting the gesture image to be recognized into a node gesture graph with a preset number of joint points, and taking the number of hops from the central node as the basis for grouping, a data set and a layered graph are obtained, so that the layered graph has different characteristics, and more diverse graph features can be extracted through the layered graph.

进一步地，所述分层手部图卷积模块包括三个图卷积流以及一个图边卷积流；Furthermore, the layered hand graph convolution module includes three graph convolution flows and one graph edge convolution flow;

所述提取得到关节分层图特征包括：The extracted joint layered graph features include:

将所述分层图通过四个卷积核改变通道数形成四流输出，并分别通过所述图卷积流以及图边卷积流进行特征提取，得到三组图卷积特征以及一组图边卷积特征，其中，通过所述图卷积流进行特征提取包括：The layered graph is passed through four convolution kernels to change the number of channels to form a four-stream output, and feature extraction is performed through the graph convolution stream and the graph edge convolution stream respectively to obtain three groups of graph convolution features and one group of graph edge convolution features, wherein feature extraction through the graph convolution stream includes:

； ;

其中，L表示图卷积特征；k表示分层数；S=0，1，2，表示通道数；以及/>表示线性层；X表示输入；/>表示图邻接矩阵；/>表示中间变量，通过线性层/>对输入X处理后得到；Among them, L represents the graph convolution feature; k represents the number of layers; S=0, 1, 2, represents the number of channels; and/> represents a linear layer; X represents input; /> Represents a graph adjacency matrix; /> Represents the intermediate variable, through the linear layer/> After processing the input X, we get:

通过所述图边卷积流进行特征提取包括：Feature extraction through the graph edge convolution flow includes:

； ;

其中，表示第k层图边卷积特征，/>为输出通道数，/>为帧数，/>为手部关键点数目；/>表示实数集；in, represents the k-th layer edge convolution feature,/> is the number of output channels, /> is the number of frames, /> is the number of key points of the hand; /> represents the set of real numbers;

拼接所述图卷积特征以及图边卷积特征得到所述关节分层图特征，具体的：；The joint layered graph features are obtained by concatenating the graph convolution features and the graph edge convolution features. Specifically: ;

其中，||表示通道拼接操作；表示第k层的关节分层图特征；/>分别表示基于第零通道卷积流（S=0）得到的第k层图卷积特征、基于第一通道卷积流（S=1）得到的第k层图卷积特征、基于第二通道卷积流（S=2）得到的第k层图卷积特征。Among them, || represents the channel splicing operation; Represents the joint layered graph features of the kth layer; /> They respectively represent the k-th layer graph convolution features obtained based on the zeroth channel convolution flow (S=0), the k-th layer graph convolution features obtained based on the first channel convolution flow (S=1), and the k-th layer graph convolution features obtained based on the second channel convolution flow (S=2).

由上述描述可知，使用图卷积操作和图边卷积操作并行提取特征，通过图卷积操作可以根据节点位置提取拓扑结构信息，而通过图边卷积操作可以找到语义相近的节点进行特征提取，两种操作相结合可以提取到不同的特征使得模型有更强的泛化性；以及使用三个图卷积操作，能够基于每一卷积操作通道学习到不一样的特征，从而提取出更加多样的特征。From the above description, we can see that features are extracted in parallel using graph convolution operations and graph edge convolution operations. The graph convolution operation can extract topological structure information based on the node position, while the graph edge convolution operation can find nodes with similar semantics for feature extraction. The combination of the two operations can extract different features, making the model more generalizable. And by using three graph convolution operations, different features can be learned based on each convolution operation channel, thereby extracting more diverse features.

进一步地，所述拼接所有所述关节分层图特征，得到第一图像特征包括：Furthermore, the splicing of all the joint layered graph features to obtain the first image feature includes:

； ;

其中，；Z为注意力函数，用于计算各层输出/>的系数后进行加权相加得到/>；/>分别表示第1层的关节分层图特征、第2层的关节分层图特征、第3层的关节分层图特征以及第4层的关节分层图特征；/>表示第一图像特征。in, ; Z is the attention function, used to calculate the output of each layer/> The coefficients are then weighted and added to obtain / > ; /> Respectively represent the joint layered graph features of the first layer, the joint layered graph features of the second layer, the joint layered graph features of the third layer, and the joint layered graph features of the fourth layer;/> Represents the first image feature.

由上述描述可知，通过计算各个关节分层图特征的权重系数得到第一图像特征，实现对不同的关节分层图特征的汇总。It can be seen from the above description that the first image feature is obtained by calculating the weight coefficient of each joint layered graph feature, thereby realizing the aggregation of different joint layered graph features.

进一步地，所述时间卷积模块根据输入的所述第一图像特征依次进行卷积、激活以及残差连接包括：Furthermore, the temporal convolution module sequentially performs convolution, activation and residual connection according to the first image feature of the input, including:

； ;

其中，，/>为时间卷积模块的输出，表示第二图像特征；/>为输出通道数，/>为输出的帧数，/>为手部关键点数目；/>为ReLU激活函数，/>为卷积操作；/>为输入。in, ,/> is the output of the temporal convolution module, representing the second image feature; /> is the number of output channels, /> is the number of frames output, /> is the number of key points of the hand; /> is the ReLU activation function, /> It is a convolution operation; /> For input.

由上述描述可知，通过时间卷积模块对输入的特征进行卷积、激活以及残差连接操作，能够同时提取并聚合多帧特征以及提取单帧内的骨架全局特征。From the above description, it can be seen that by performing convolution, activation and residual connection operations on the input features through the temporal convolution module, it is possible to simultaneously extract and aggregate multi-frame features and extract the global skeleton features within a single frame.

进一步地，所述空间注意力模块根据输入的所述第二图像特征进行特征提取，得到第三图像特征包括：Furthermore, the spatial attention module performs feature extraction based on the input second image feature, and obtains the third image feature including:

将所述第二图像特征输入到串行的预设个数的空间注意力块进行处理；Inputting the second image feature into a preset number of spatial attention blocks in series for processing;

将完成空间注意力块处理后的数据通过一个线性层改变通道数并进行残差连接输出；The data processed by the spatial attention block is passed through a linear layer to change the number of channels and output through a residual connection;

将残差连接输的输出依次通过激活函数、线性层和批标准化；Pass the output of the residual connection through the activation function, linear layer and batch normalization in sequence;

再次通过激活函数激活后进行残差连接后输出，得到所述第三图像特征；After being activated again by the activation function, a residual connection is performed and then output, so as to obtain the third image feature;

具体公式如下：The specific formula is as follows:

； ;

其中，表示空间注意力块输出，/>；/>为手部关键点数目，/>为多头注意力机制的头的个数，/>是自注意力块输出通道数，/>为输出的帧数；/>和/>表示嵌入函数；/>表示残差连接输出，/>，/>为输出通道数；/>、/>表示不同线性层权重；/>表示批标准化后的输出，/>，B为批标准化操作；/>表示最终输出，/>，/>为激活函数；/>表示实数集；其中，softmax函数用来输出最后各类的预测概率；/>表示T模块的输出、T表示帧数、/>表示/>中的第τ帧特征。in, represents the output of the spatial attention block, /> ; /> is the number of key points of the hand, /> is the number of heads of the multi-head attention mechanism, /> is the number of output channels of the self-attention block,/> The number of frames to be output; /> and/> Represents an embedded function; /> Represents the residual connection output, /> ,/> is the number of output channels; /> 、/> Represents the weights of different linear layers; /> Represents the output after batch normalization, /> , B is the batch standardization operation; /> Indicates the final output, /> ,/> is the activation function; /> Represents a set of real numbers; the softmax function is used to output the predicted probabilities of each category; /> Represents the output of the T module, T represents the number of frames, /> Indicates/> The features of the τth frame in .

由上述描述可知，通过时间卷积模块以及空间注意力模块对输入的特征进行处理，能够同时提取并聚合多帧特征以及提取单帧内的骨架全局特征。From the above description, it can be seen that by processing the input features through the temporal convolution module and the spatial attention module, it is possible to simultaneously extract and aggregate multi-frame features and extract the global skeleton features within a single frame.

进一步地，还包括：Furthermore, it also includes:

通过傅里叶变化将关节流数据从时域转换到频域，得到相位特征和振幅特征；The joint flow data is converted from the time domain to the frequency domain through Fourier transform to obtain phase characteristics and amplitude characteristics;

将所述相位特征和振幅特征用于模型学习。The phase feature and amplitude feature are used for model learning.

由上述描述可知，通过傅里叶变换拓展特征数据流，使得融合预测时可以考虑到更多样的数据特征，从而提升模型的准确率和鲁棒性。From the above description, it can be seen that by expanding the feature data stream through Fourier transform, more diverse data features can be taken into account during fusion prediction, thereby improving the accuracy and robustness of the model.

本发明另一实施例提供了一种基于时频数据融合的分层图卷积手势识别装置，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如上述的一种基于时频数据融合的分层图卷积手势识别方法中的各个步骤。Another embodiment of the present invention provides a layered graph convolution gesture recognition device based on time-frequency data fusion, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, each step in the above-mentioned layered graph convolution gesture recognition method based on time-frequency data fusion is implemented.

本发明提供的基于时频数据融合的分层图卷积手势识别方法及装置能够应用于手势识别的场景，以下通过具体实施方式进行说明：The layered graph convolution gesture recognition method and device based on time-frequency data fusion provided by the present invention can be applied to the scene of gesture recognition, which is described below through specific implementation methods:

实施例一Embodiment 1

请参照图1，一种基于时频数据融合的分层图卷积手势识别方法，如（a）部分所示，包括由基础块（HHTSAT）与全连接层（Fully Connection，FC）组成的网络结构；如（b）部分所示，所述基础块由分层手部图卷积模块（Hierarchical Hand graph convolution module，HH）、时间卷积模块（Temporal convolution module，T）、空间注意力模块（SpatialAttention module，SA）、时间卷积模块依次残差连接组成，即按照HH、T、SA、T依次残差连接而成。Please refer to Figure 1, a hierarchical graph convolution gesture recognition method based on time-frequency data fusion, as shown in part (a), includes a network structure consisting of a basic block (HHTSAT) and a fully connected layer (Fully Connection, FC); as shown in part (b), the basic block is composed of a hierarchical hand graph convolution module (HH), a temporal convolution module (Temporal convolution module, T), a spatial attention module (SpatialAttention module, SA), and a temporal convolution module, which are residually connected in sequence, that is, HH, T, SA, and T are residually connected in sequence.

请参照图2，所述网络结构执行以下步骤：Referring to FIG. 2 , the network structure performs the following steps:

S1、所述分层手部图卷积模块将待识别手势图像拆分成预设个数的关节分层图，并提取得到关节分层图特征；拼接所有所述关节分层图特征，得到第一图像特征；如图1中的（c）部分所示，分层手部图卷积模块由1个手部层次化分解图卷积块（HierarchicallyDecomposed Hand Graph Convolution，HDH-GC）与1个层注意力融合模块（Attention-Guided Hierarchy Aggregation，A-HA）组成，其中HDH-GC可继续拆分成4个如图3所示的图卷积块，表示第k层图卷积块的结构，其能够提取到更具多样性的图特征，具体的：S1. The layered hand graph convolution module splits the gesture image to be recognized into a preset number of joint layered graphs, and extracts the joint layered graph features; splices all the joint layered graph features to obtain the first image feature; as shown in part (c) of Figure 1, the layered hand graph convolution module consists of a hand hierarchically decomposed graph convolution block (Hierarchically Decomposed Hand Graph Convolution, HDH-GC) and a layer attention fusion module (Attention-Guided Hierarchy Aggregation, A-HA), where HDH-GC can be further split into 4 graph convolution blocks as shown in Figure 3, which represents the structure of the k-th layer graph convolution block, which can extract more diverse graph features, specifically:

S11、以预设数量的关节点将所述待识别手势图像转化成节点手势图，并确定所述节点手势图的中心关节点；例如确定掌心节点为中心关节点；S11, converting the gesture image to be recognized into a node gesture graph with a preset number of joint points, and determining the central joint point of the node gesture graph; for example, determining the palm node as the central joint point;

S12、根据每一所述关节点到所述中心关节点的距离将所有所述关节点进行分组，得到分层关节点集合；S12, grouping all the joint points according to the distance from each joint point to the central joint point to obtain a hierarchical joint point set;

S13、将相邻的两组所述分层关节点集合依次相连构成边集，并根据所述边集的数量，得到相应数量的所述分层图；如图4所示，为将手部分层后的可视化结果；其中，每层图中的空心点集初始时不属于该层，而实心点集与斜线条纹点集是相邻的两组点集，将相邻两组的节点依次相连构成边集，定义形成的边集用邻接矩阵表示；例如将手部的关节点分为K+1组（K=4，对于目前使用的两个数据集说K只能取4，若是存储其他数据集且使用不同数量的关键点表示手势，则需要修改K值重新划分组别），其中各分组节点集合具体为：S13, connecting two adjacent groups of the layered joint point sets in sequence to form edge sets, and obtaining a corresponding number of the layered graphs according to the number of the edge sets; as shown in FIG4, it is a visualization result after the hand is layered; wherein, the hollow point set in each layer graph does not initially belong to the layer, while the solid point set and the oblique stripe point set are two adjacent groups of point sets, and the nodes of the two adjacent groups are connected in sequence to form an edge set, and the defined formed edge set is represented by an adjacency matrix; for example, the joint points of the hand are divided into K+1 groups (K=4, for the two data sets currently used, K can only be 4, if other data sets are stored and different numbers of key points are used to represent gestures, it is necessary to modify the K value and re-divide the groups), wherein each grouped node set is specifically:

； ;

如图5所示，其中1到22表示手部节点编号；接着定义分组后的图邻接矩阵：As shown in Figure 5, 1 to 22 represent the hand node numbers; then define the grouped graph adjacency matrix:

； ;

其中，k为分层数，为重心分区策略中由自环边集、向外边集和向内边集三个边子集组成一个集合；/>为手部关键点数目；/>是分层之后的第/>（/>=1、2、3、4）层边集；/>表示将两组节点只与自身相连形成自环边集；/>表示从第/>关节点组中的每个点指向第/>关节点组中的每个点相邻的单向边的向外边集；/>表示从第/>关节点组中的每个点指向第/>关节点组中的每个点形成的单向边的向内边集；Where k is the number of layers, In the centroid partitioning strategy, a set is formed by three edge subsets: the self-loop edge set, the outward edge set and the inward edge set;/> is the number of key points of the hand; /> It is the first after stratification/> （/> =1, 2, 3, 4) layer edge set; /> Indicates that two groups of nodes are connected only to themselves to form a self-loop edge set; /> Indicates that from the /> Each point in the joint point group points to the first/> The outward edge set of the one-way edges adjacent to each point in the joint point group; /> Indicates that from the /> Each point in the joint point group points to the first/> The inward edge set of unidirectional edges formed by each point in the joint point group;

如图3所示，为第k层图卷积块的结构，其包括三个图卷积流以及一个图边卷积流，将所述分层图通过四个大小为1×1的CNN卷积核改变通道数形成四流输出，并分别通过所述图卷积流以及图边卷积流（Edge Convolution， Edge-Conv）进行特征提取，得到三组图卷积特征以及一组图边卷积特征；拼接所述图卷积特征以及图边卷积特征得到所述关节分层图特征，其作用是提取更加多样的特征，因为图卷积操作可以很好的根据节点位置提取拓扑结构信息；而图边卷积操作可以找到语义相近的节点进行特征提取，因此两种操作相结合可以提取到不同的特征使得模型有更强的泛化性；同时使用三个图卷积操作其目的是类比CNN卷积核的多通道，每个通道可以学习到不一样的特征，具体的：As shown in Figure 3, it is the structure of the k-th layer graph convolution block, which includes three graph convolution flows and one graph edge convolution flow. The layered graph is passed through four CNN convolution kernels of size 1×1 to change the number of channels to form a four-stream output, and feature extraction is performed through the graph convolution flow and the graph edge convolution flow (Edge Convolution, Edge-Conv) respectively to obtain three groups of graph convolution features and one group of graph edge convolution features; the graph convolution features and the graph edge convolution features are concatenated to obtain the joint layered graph features, which is used to extract more diverse features, because the graph convolution operation can well extract topological structure information according to the node position; and the graph edge convolution operation can find nodes with similar semantics for feature extraction, so the combination of the two operations can extract different features to make the model more generalized; the purpose of using three graph convolution operations at the same time is to simulate the multi-channel of the CNN convolution kernel, and each channel can learn different features, specifically:

通过所述图卷积流进行特征提取包括，例如每个节点通过K-邻近算法动态聚合与其相关的前K个其他节点特征并通过相加聚合特征，其公式如下：The feature extraction through the graph convolution flow includes, for example, dynamically aggregating the features of the first K other nodes related to each node through the K-nearest neighbor algorithm and aggregating the features by addition, and the formula is as follows:

； ;

其中，L表示图卷积特征；k表示分层数；S=0，1，2，表示通道数，即S=0，1，2分别表示第零通道卷积流、第一通道卷积流以及第二通道卷积流；表示线性层，将x从低纬度映射到高维空间；/>的权重/>，/>为输入通道数，/>为输出通道数；/>表示线性层，/>的权重/>；/>；X表示输入；/>表示图邻接矩阵；/>表示中间变量，通过线性层/>对输入X处理后得到；Where L represents the graph convolution feature; k represents the number of layers; S=0, 1, 2 represents the number of channels, that is, S=0, 1, 2 represent the zeroth channel convolution flow, the first channel convolution flow and the second channel convolution flow respectively; Represents a linear layer that maps x from low-dimensional space to high-dimensional space; /> The weight of /> ,/> is the number of input channels, /> is the number of output channels; /> represents a linear layer, /> The weight of /> ; /> ;X indicates input;/> Represents a graph adjacency matrix; /> Represents the intermediate variable, through the linear layer/> After processing the input X, we get:

； ;

其中，表示第k层图边卷积特征，/>为帧数，并通过平均池化层（avgpool）以及图边卷积流进行处理；in, represents the k-th layer edge convolution feature,/> is the number of frames, and is processed by the average pooling layer (avgpool) and the edge convolution flow;

所述拼接所述图卷积特征以及图边卷积特征得到所述关节分层图特征包括：The step of concatenating the graph convolution features and the graph edge convolution features to obtain the joint layered graph features includes:

； ;

其中，||表示通道拼接操作；表示第k层的关节分层图特征，/>；分别表示基于第零通道卷积流（S=0）得到的第k层图卷积特征、基于第一通道卷积流（S=1）得到的第k层图卷积特征、基于第二通道卷积流（S=2）得到的第k层图卷积特征；Among them, || represents the channel splicing operation; Represents the joint layered graph features of the kth layer, /> ; They respectively represent the k-th layer graph convolution features obtained based on the zero-th channel convolution flow (S=0), the k-th layer graph convolution features obtained based on the first channel convolution flow (S=1), and the k-th layer graph convolution features obtained based on the second channel convolution flow (S=2);

所述拼接所有所述关节分层图特征，得到第一图像特征包括：The splicing of all the joint layered graph features to obtain the first image feature includes:

； ;

其中，；Z为注意力函数，用于计算各层输出/>的系数后进行加权相加得到/>；/>分别表示第1层的关节分层图特征、第2层的关节分层图特征、第3层的关节分层图特征以及第4层的关节分层图特征；/>表示第一图像特征。in, ; Z is the attention function, used to calculate the output of each layer/> The coefficients are then weighted and added to obtain / > ; /> Respectively represent the joint layered graph features of the first layer, the joint layered graph features of the second layer, the joint layered graph features of the third layer, and the joint layered graph features of the fourth layer; /> Represents the first image feature.

S2、所述时间卷积模块对输入的每一所述第一图像特征依次进行卷积、激活以及残差连接操作，得到第二图像特征；如图1中的（d）部分所示，输入通过大小为的卷积核再通过激活函数进行激活，最后进行残差连接，具体公式如下：S2, the temporal convolution module sequentially performs convolution, activation and residual connection operations on each of the first image features of the input to obtain a second image feature; as shown in part (d) of FIG1, the input is a The convolution kernel is then activated by the activation function, and finally the residual connection is performed. The specific formula is as follows:

； ;

其中，，/>为时间卷积模块的输出，表示第二图像特征；/>为输出通道数，/>为输出的帧数，/>为手部关键点数目；/>为ReLU激活函数，/>为卷积操作；/>为输入，其可以是HH模块的输出或者是SA模块的输出；本实施例中T模块的细节图中是以HH模块的输出进行说明。in, ,/> is the output of the temporal convolution module, representing the second image feature; /> is the number of output channels, /> is the number of frames output, /> is the number of key points of the hand; /> is the ReLU activation function, /> It is a convolution operation; /> The input may be the output of the HH module or the output of the SA module. In the detailed diagram of the T module in this embodiment, the output of the HH module is used for explanation.

S3、所述空间注意力模块根据输入的所述第二图像特征进行特征提取，得到第三图像特征；如图1中的（e）部分所示，其具体步骤如下：S3, the spatial attention module extracts features according to the input second image features to obtain third image features; as shown in part (e) of FIG1 , the specific steps are as follows:

S31、将所述第二图像特征输入到串行的预设个数（U）的空间注意力块（SpatialAttentionBlock）进行处理；S31, inputting the second image feature into a preset number (U) of serial spatial attention blocks (SpatialAttentionBlock) for processing;

S32、将完成空间注意力块处理后的数据通过一个线性层改变通道数并进行残差连接输出；S32, passing the data processed by the spatial attention block through a linear layer to change the number of channels and perform residual connection output;

S33、将残差连接输的输出依次通过激活函数LeakyReLU、线性层（Linear）和批标准化（BatchNormalization，BN）；S33, pass the output of the residual connection through the activation function LeakyReLU, the linear layer (Linear) and batch normalization (BatchNormalization, BN) in sequence;

S34、再次通过激活函数LeakyReLU激活后进行残差连接后输出，得到所述第三图像特征，具体公式如下：S34, after being activated again by the activation function LeakyReLU and performing residual connection, the third image feature is obtained. The specific formula is as follows:

； ;

其中，表示空间注意力块输出，通过/>变形而来，/>；/>为手部关键点数目，/>为多头注意力机制的头的个数，/>是自注意力块输出通道数，/>为输出的帧数；/>和/>表示嵌入函数；/>表示残差连接输出，/>，/>为输出通道数；/>、为一个线性层其权重/>，/>为另一个线性层其权重/>；/>表示批标准化后的输出，/>，B为批标准化操作；/>表示最终输出，/>，/>为LeakyReLU激活函数；其中，softmax函数用来输出最后各类的预测概率；/>表示T模块的输出、T表示帧数、/>表示/>中的第τ帧特征。in, Represents the output of the spatial attention block, through/> Transformed, /> ; /> is the number of key points of the hand, /> is the number of heads of the multi-head attention mechanism, /> is the number of output channels of the self-attention block,/> The number of frames to be output; /> and/> Represents an embedded function; /> Represents the residual connection output, /> ,/> is the number of output channels; /> , is a linear layer with weights/> ,/> For another linear layer its weights/> ; /> Represents the output after batch normalization, /> , B is the batch standardization operation; /> Indicates the final output, /> ,/> is the LeakyReLU activation function; the softmax function is used to output the final prediction probability of each category; /> Represents the output of the T module, T represents the number of frames, /> Indicates/> The features of the τth frame in .

S4、所述时间卷积模块对输入的所述第三图像特征依次进行卷积、激活以及残差连接操作，得到第四图像特征；该部分操作与步骤S2中的操作相同。S4, the temporal convolution module sequentially performs convolution, activation and residual connection operations on the input third image feature to obtain a fourth image feature; this part of the operation is the same as the operation in step S2.

T模块和SA模块用于同时提取并聚合多帧特征以及提取单帧内的骨架全局特征。The T module and SA module are used to simultaneously extract and aggregate multi-frame features and extract skeleton global features within a single frame.

S5、所述全连接层根据所述第四图像特征输出手势预测结果；即全连接层输出各类预测分数。S5. The fully connected layer outputs a gesture prediction result according to the fourth image feature; that is, the fully connected layer outputs various prediction scores.

其中，在一可选的实施方式中，为了让模型学习到更加多样的数据，本实施例中利用快速傅里叶变换（Fast Fourier Transform，FFT）将关节流数据从时域转换到频域，得到相位和振幅的特征供模型学习；关节数据流指的是数据集原始的数据，即通过25个关节点在每一帧的三维坐标，是进入模型之前的数据；之后每个数据流都使用本实施例中的模型训练相对应的权重参数。Among them, in an optional implementation, in order to allow the model to learn more diverse data, the joint flow data is converted from the time domain to the frequency domain using Fast Fourier Transform (FFT) in this embodiment to obtain phase and amplitude features for model learning; the joint data stream refers to the original data of the data set, that is, the three-dimensional coordinates of the 25 joint points in each frame, which is the data before entering the model; thereafter, each data stream uses the corresponding weight parameters for model training in this embodiment.

如图6所示，展示了多流融合预测的结构图，通过每个数据流预测的分数进行加权求和，得到一个的预测向量，其中/>为预测类别数，/>通过以下公式计算：As shown in Figure 6, the structure diagram of multi-stream fusion prediction is shown. The weighted sum of the scores predicted by each data stream is used to obtain a The prediction vector of is the number of predicted categories, /> Calculated by the following formula:

其中，为对应/>流数据的预测向量；/>是0到1之间的值；/>，/>为关节流数据，/>为关节运动流数据，/>为本文提出的通过关节流数据经过FFT得到的幅度特征；具体来说，对/>应用FFT可表示为：in, To correspond/> Prediction vector of streaming data; /> is a value between 0 and 1; /> ,/> is the joint flow data, /> is the joint motion flow data, /> This paper proposes the amplitude feature obtained by FFT of joint flow data; specifically, for/> Applying FFT can be expressed as:

其中，是计算结果的实部，/>是计算结果的虚部；本实施例中取/>。in, is the real part of the result, /> is the imaginary part of the calculation result; in this embodiment, / > .

实施例二Embodiment 2

请参照图7，一种基于时频数据融合的分层图卷积手势识别装置，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现如实施例一所述的一种基于时频数据融合的分层图卷积手势识别方法中的各个步骤；具体实验效果在DHG-14/28以及SHREC17数据集上的展示：Please refer to Figure 7, a layered graph convolution gesture recognition device based on time-frequency data fusion includes a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, each step in the layered graph convolution gesture recognition method based on time-frequency data fusion as described in Example 1 is implemented; specific experimental results are shown on the DHG-14/28 and SHREC17 data sets:

表1.本实施例模型在DHG-14上与其他模型的性能比较Table 1. Performance comparison of the model in this example with other models on DHG-14

其中，*为本实施例的基线模型。Wherein, * is the baseline model of this embodiment.

表2.本实施例模型在DHG-28上与其他模型的性能比较Table 2. Performance comparison of the model in this example with other models on DHG-28

表3.本实施例模型在SHREC’17的14类手势分类任务上与其他模型的性能对比Table 3. Performance comparison of the model in this embodiment with other models on the 14-category gesture classification task of SHREC’17

表4.本实施例模型在SHREC’17的28类手势分类任务上与其他模型的性能对比Table 4. Performance comparison of the model in this embodiment with other models on the 28-category gesture classification task of SHREC’17

由表1、表2、表3以及表4可知，本实施例中的网络模型在DHG-14/28以及SHREC17数据集上均有较高的准确率。It can be seen from Table 1, Table 2, Table 3 and Table 4 that the network model in this embodiment has a high accuracy on the DHG-14/28 and SHREC17 datasets.

综上所述，本发明提供的一种基于时频数据融合的分层图卷积手势识别方法及装置，通过使用图卷积神经网络以及图卷积核，能够自然地表达非欧式结构以及通过训练样本学习到较远关节点之间的相关性；采用分层手部图结构，将手部骨架分为多层并通过训练学习每一层的节点之间的相关性，从而摆脱对于初始的骨架图拓扑结构的依赖，使得模型可以提取到更加多样的骨架序列特征；通过空间注意力块代替冗余的图卷积块提取全局特征，并使用空间注意力块与图卷积块交替堆叠，既可以避免堆叠过多图卷积层带来的计算负担、降低系统复杂度，又确保了模型的全局特征提取能力；以及通过傅里叶变换拓展特征数据流，使得融合预测时可以考虑到更多样的数据特征，提升模型的准确率和鲁棒性。In summary, the present invention provides a hierarchical graph convolution gesture recognition method and device based on time-frequency data fusion. By using a graph convolutional neural network and a graph convolution kernel, it is possible to naturally express non-Euclidean structures and learn the correlation between distant joints through training samples; a hierarchical hand graph structure is adopted to divide the hand skeleton into multiple layers and learn the correlation between the nodes of each layer through training, thereby getting rid of the dependence on the initial skeleton graph topological structure, so that the model can extract more diverse skeleton sequence features; global features are extracted by replacing redundant graph convolution blocks with spatial attention blocks, and spatial attention blocks and graph convolution blocks are alternately stacked, which can avoid the computational burden brought by stacking too many graph convolution layers, reduce system complexity, and ensure the global feature extraction capability of the model; and the feature data stream is expanded by Fourier transform, so that more diverse data features can be considered in fusion prediction, thereby improving the accuracy and robustness of the model.

即针对背景技术中所提到的问题1：由于本发明使用的是图卷积神经网络，所以可以非常自然地表达非欧式结构。That is, regarding problem 1 mentioned in the background technology: since the present invention uses a graph convolutional neural network, non-Euclidean structures can be expressed very naturally.

针对问题2：本发明使用的是可学习的图卷积核，它可以通过训练样本学习到较远关节点之间的相关性。For question 2: The present invention uses a learnable graph convolution kernel, which can learn the correlation between distant joint points through training samples.

针对问3：本方发明使用手部图卷积模块将手部骨架分为多层并通过训练学习每一层的节点之间的相关性，从而摆脱对于初始的骨架图拓扑结构的依赖。Regarding Question 3: Our invention uses a hand graph convolution module to divide the hand skeleton into multiple layers and learns the correlation between the nodes of each layer through training, thereby getting rid of the dependence on the initial skeleton graph topological structure.

针对问4：本发明通过空间注意力块代替冗余的图卷积块提取全局特征，并使用空间注意力块与图卷积块交替堆叠，起到既可以减少缺点4中提到的计算开销大的缺点，也可以保留骨架数据中局部特征和全局特征信息。Regarding Question 4: The present invention uses spatial attention blocks to replace redundant graph convolution blocks to extract global features, and uses spatial attention blocks and graph convolution blocks to be stacked alternately, which can not only reduce the high computational overhead mentioned in Disadvantage 4, but also retain the local and global feature information in the skeleton data.

以上所述仅为本发明的实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等同变换，或直接或间接运用在相关的技术领域，均同理包括在本发明的专利保护范围内。The above descriptions are merely embodiments of the present invention and are not intended to limit the patent scope of the present invention. Any equivalent transformations made using the contents of the present invention's specification and drawings, or directly or indirectly applied in related technical fields, are also included in the patent protection scope of the present invention.

Claims

1. A hierarchical graph convolution gesture recognition method based on time-frequency data fusion is characterized by comprising a network structure formed by a basic block and a full-connection layer; the basic block is formed by sequentially connecting a layered hand graph convolution module, a time convolution module, a space attention module and a time convolution module in a residual way; the network structure performs the steps of:

the layered hand graph rolling module splits the gesture image to be recognized into joint layered graphs with preset numbers, and extracts characteristics of the joint layered graphs; splicing all the joint layered graph features to obtain a first image feature;

the time convolution module carries out convolution, activation and residual error connection operation on each input first image feature in sequence to obtain a second image feature;

The spatial attention module performs feature extraction according to the input second image features to obtain third image features;

The time convolution module carries out convolution, activation and residual error connection operation on the input third image feature in sequence to obtain a fourth image feature;

The full-connection layer outputs a gesture prediction result according to the fourth image characteristic;

the layering hand graph rolling module splits a gesture image to be recognized into joint layering graphs with preset numbers, and the joint layering graph rolling module comprises:

Converting the gesture image to be recognized into a node gesture image by a preset number of joint points, and determining a central joint point of the node gesture image;

grouping all the joint points according to the distance from each joint point to the central joint point to obtain a layered joint point set;

and connecting two adjacent layered joint point sets in sequence to form an edge set, and obtaining the layered graph with corresponding numbers according to the number of the edge sets.

2. The hierarchical graph convolution gesture recognition method based on time-frequency data fusion according to claim 1, wherein the hierarchical hand graph convolution module comprises three graph convolution streams and one graph edge convolution stream;

the extracting the joint layered graph features comprises the following steps:

Changing the channel number of the layered graph through four convolution kernels to form four-stream output, and extracting features through the graph convolution stream and the graph edge convolution stream respectively to obtain three groups of graph convolution features and one group of graph edge convolution features;

and splicing the graph convolution characteristic and the graph edge convolution characteristic to obtain the joint layered graph characteristic.

3. The hierarchical graph convolution gesture recognition method based on time-frequency data fusion according to claim 2, wherein feature extraction by the graph convolution stream comprises:

；

wherein L represents a graph convolution feature; k represents the number of layers; s=0, 1,2, representing the number of channels; /> Representing a linear layer; x represents an input; /(I)Representing a graph adjacency matrix; /(I)Representing intermediate variables by linear layer/>The input X is processed to obtain;

the feature extraction through the graph edge convolution flow comprises the following steps:

；

Wherein, Representing the edge convolution characteristics of the k-th layer diagram/>For outputting channel number,/>For the number of frames,/>The number of key points of the hand; Representing a real set;

The step of splicing the graph convolution feature and the graph edge convolution feature to obtain the joint layered graph feature comprises the following steps:

；

Wherein, the I represents a channel splicing operation; Features of the joint hierarchy map representing a kth layer; /(I) The k-th layer convolution feature obtained based on the zeroth channel convolution stream (s=0), the k-th layer convolution feature obtained based on the first channel convolution stream (s=1), and the k-th layer convolution feature obtained based on the second channel convolution stream (s=2) are respectively represented.

4. The method for recognizing a hierarchical graph convolution gesture based on time-frequency data fusion according to claim 3, wherein the stitching all the joint hierarchical graph features to obtain a first image feature comprises:

；

Wherein, ; Z is the attention function for calculating the output/>, of each layerAfter the coefficients of (2) are weighted and added to obtain/>；/>Respectively representing the characteristics of the joint hierarchy map of the 1 st layer, the characteristics of the joint hierarchy map of the 2 nd layer, the characteristics of the joint hierarchy map of the 3 rd layer and the characteristics of the joint hierarchy map of the 4 th layer; /(I)Representing a first image feature.

5. The method for recognizing a hierarchical convolution gesture based on time-frequency data fusion according to claim 1, wherein the time convolution module sequentially convolves, activates and residual connects each of the inputted first image features, comprising:

；

Wherein, ，/>Representing a second image feature as an output of the temporal convolution module; /(I)In order to output the number of channels,For the number of frames output,/>The number of key points of the hand; /(I)For ReLU activation function,/>Is a convolution operation; /(I)Is input.

6. The method for recognizing a hierarchical convolution gesture based on time-frequency data fusion according to claim 1, wherein the spatial attention module performs feature extraction according to the input second image feature, and obtaining a third image feature comprises:

Inputting the second image characteristic into a serial space attention block with a preset number for processing;

changing the channel number of the data processed by the spatial attention block through a linear layer, and carrying out residual connection output;

The output of the residual connection input sequentially passes through an activation function, a linear layer and batch standardization;

and performing residual connection after activating the activation function again, and outputting to obtain the third image feature.

7. The method for recognizing a hierarchical convolution gesture based on time-frequency data fusion according to claim 6, wherein the obtaining the third image feature comprises:

；

Wherein, Representing spatial attention block output,/>；/>For the number of key points of hands,/>The number of heads for the multi-head attention mechanism,/>Is the number of self-attention block output channels,/>The number of frames to be output; /(I)And/>Representing an embedding function; /(I)Representing residual connection output,/>，/>The number of output channels; /(I)、/>Representing different linear layer weights; /(I)Representing the output after batch normalization,/>B is batch standardization operation; /(I)The final output is indicated as such,，/>Is an activation function; the softmax function is used for outputting the prediction probability of the last category; /(I)Representing the output of the T module, T representing the number of frames,/>Representation/>Is characterized by the tau frame.

8. The hierarchical graph convolution gesture recognition method based on time-frequency data fusion according to claim 1, further comprising:

Converting the throttle data from a time domain to a frequency domain through Fourier change to obtain a phase characteristic and an amplitude characteristic;

The phase and amplitude characteristics are used to optimize the network structure.

9. A time-frequency data fusion-based hierarchical graph convolution gesture recognition device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein execution of the computer program by the processor implements the steps of a time-frequency data fusion-based hierarchical graph convolution gesture recognition method according to any one of claims 1-8.