CN116953642A

CN116953642A - Millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network

Info

Publication number: CN116953642A
Application number: CN202310782312.1A
Authority: CN
Inventors: 吴振华; 吴俊杰; 钱军; 王腾鑫; 张磊; 杨利霞
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-10-27

Abstract

The invention discloses a millimeter wave radar gesture recognition method based on an adaptive coding VisionTransformer network, which includes: using millimeter wave radar to collect gestures to be classified; performing feature extraction on the collected gesture echo data, and extracting the extracted micro gestures. Dynamic Doppler features and pitch azimuth angle features are synthesized into three-channel RGB gesture feature maps; the denoising diffusion implicit model is used to expand the RGB gesture feature map; the expanded fitting image and the RGB gesture feature map are mixed to form a data set , and divide the data set into a training data set and a verification data set; build an adaptive coding VisionTransformer network model; use the training data set to train the adaptive coding VisionTransformer network model; input the verification data set into the trained adaptive coding VisionTransformer network classification in the model. The present invention adopts the adaptive coding VisionTransformer network model, which reduces the computational complexity, improves the recognition accuracy, and is easy to deploy under low computing power.

Description

Millimeter wave radar gesture based on adaptive coding Vision Transformer network recognition methods

技术领域Technical field

本发明涉及毫米波雷达手势识别技术领域，尤其涉及一种基于自适应编码VisionTransformer网络的毫米波雷达手势识别方法。The present invention relates to the technical field of millimeter wave radar gesture recognition, and in particular to a millimeter wave radar gesture recognition method based on an adaptive coding VisionTransformer network.

背景技术Background technique

由于FMCW雷达从回波信号中提取物体运动时的微动多普勒特征和俯仰方位角的技术较为成熟，故其运用在手势识别领域中为手势识别提供了额外的可选方案。FMCW雷达的手势识别有着许多较为传统的方案，其需要在特征提取后手动的从数据中分离出依赖相关领域知识的特征集合，然后再通过适当的机器学习模型，譬如SVM支持向量机、k邻近、随机森林等常见算法模型，将这些特征向量分类为不同的类别，实现对手势的识别。Since FMCW radar has a relatively mature technology for extracting micro-motion Doppler features and pitch azimuth angles when objects are moving from echo signals, its application in the field of gesture recognition provides additional options for gesture recognition. FMCW radar gesture recognition has many traditional solutions, which require manually separating feature sets that rely on relevant domain knowledge from the data after feature extraction, and then using appropriate machine learning models, such as SVM support vector machines and k-neighbors. Common algorithm models such as , Random Forest, etc. classify these feature vectors into different categories to achieve gesture recognition.

但随着深度学习的快速发展，其衍生出的各类模型都为手势的分类提供了更多的方案。同时深度学习模型还降低了对于手势特征提取的复杂度。已有学者采用三通道合成的手势特征数据进行CNN手势分类，取得了非常良好的效果。但因手势本身也具有其复杂性，在面对部分特征相似的手势，或特征不明显的手势仅仅使用CNN模型是不行的，所以有学者提出使用融合了通道注意力和空间注意力的多通道手势特征识别，如Du等人提出的基于通道和空间注意机制的CNN增强多通道合成手势特征识别(参见文献：Chuan Du et al,“Enhanced Multi-Channel Feature Synthesis for Hand Gesture Recognition Basedon CNN With a Channel and Spatial Attention Mechanism”,IEEE Access,2020,8:144610)。但目前现行的各类CNN分类都存在以下问题，However, with the rapid development of deep learning, various models derived from it provide more solutions for gesture classification. At the same time, the deep learning model also reduces the complexity of gesture feature extraction. Some scholars have used three-channel synthesized gesture feature data for CNN gesture classification and achieved very good results. However, because gestures themselves are also complex, it is not feasible to use only a CNN model when facing gestures with similar features or gestures with unclear features. Therefore, some scholars have proposed using a multi-channel model that combines channel attention and spatial attention. Gesture feature recognition, such as the CNN-based enhanced multi-channel synthetic gesture feature recognition proposed by Du et al. and Spatial Attention Mechanism", IEEE Access, 2020, 8: 144610). However, the current various CNN classifications have the following problems:

一是，全图注意力的方式会损失图像的局部特征，在应对差别较为细小的图像时其表现并不理想。First, the full-image attention method will lose the local features of the image, and its performance is not ideal when dealing with images with relatively small differences.

二是，已出现的CNN算法算力要求过大不适用于低算力部署的情况。Second, the existing CNN algorithm requires too much computing power and is not suitable for deployment with low computing power.

三是，没有考虑到图像自身像素点之间的信息的关联性，直接采取卷积使得其对于像素点间的关系被忽视。Third, the correlation of the information between the pixels of the image itself is not taken into account, and the direct convolution is used to ignore the relationship between the pixels.

发明内容Contents of the invention

为解决背景技术中存在的技术问题，本发明提出一种基于自适应编码VisionTransformer网络的毫米波雷达手势识别方法。In order to solve the technical problems existing in the background technology, the present invention proposes a millimeter wave radar gesture recognition method based on an adaptive coding VisionTransformer network.

本发明提出的一种基于自适应编码Vision Transformer网络的毫米波雷达手势识别方法，包括：The present invention proposes a millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network, including:

利用毫米波雷达对待分类手势进行采集；Use millimeter wave radar to collect gestures to be classified;

对采集到的手势回波数据进行特征提取，得到微动多普勒特征和俯仰方位角特征，并将微动多普勒特征和俯仰方位角特征按三通道合成RGB手势特征图；Perform feature extraction on the collected gesture echo data to obtain micro-motion Doppler features and pitch azimuth angle features, and synthesize the micro-motion Doppler features and pitch azimuth angle features into an RGB gesture feature map in three channels;

利用去噪扩散隐式模型对RGB手势特征图进行扩充，得到拟合图像；The denoising diffusion implicit model is used to expand the RGB gesture feature map to obtain a fitted image;

将拟合图像和RGB手势特征图混合形成数据集，并将数据集划分为训练数据集和验证数据集；Mix the fitting images and RGB gesture feature maps to form a data set, and divide the data set into a training data set and a verification data set;

构建自适应编码Vision Transformer网络模型；Build an adaptive coding Vision Transformer network model;

利用训练数据集对自适应编码Vision Transformer网络模型进行训练；Use the training data set to train the adaptive coding Vision Transformer network model;

将验证数据集输入训练好的自适应编码Vision Transformer网络模型进行分类，得到分类结果。Input the verification data set into the trained adaptive coding Vision Transformer network model for classification, and obtain the classification results.

进一步地，利用去噪扩散隐式模型对RGB手势特征图进行扩充，具体包括：Furthermore, the denoising diffusion implicit model is used to expand the RGB gesture feature map, including:

建立RGB手势特征图的扩散过程和噪声预测模型；其中，噪声预测模型采用U-Net神经网络结构；Establish the diffusion process and noise prediction model of the RGB gesture feature map; among them, the noise prediction model adopts the U-Net neural network structure;

将RGB手势特征图按照扩散过程输入噪声预测模型，按照扩散过程对噪声预测模型进行训练，从而获得训练好的去噪扩散隐式模型；Input the RGB gesture feature map into the noise prediction model according to the diffusion process, and train the noise prediction model according to the diffusion process to obtain the trained denoising diffusion implicit model;

利用训练好的去噪扩散隐式模型对RGB手势特征图进行扩充。The RGB gesture feature map is expanded using the trained denoising diffusion implicit model.

进一步地，去噪扩散隐式模型包括前向扩散过程和逆向去噪生成过程；Further, the denoising diffusion implicit model includes a forward diffusion process and a reverse denoising generation process;

前向扩散过程，通过不断对RGB手势特征图x₀添加高斯噪声，将原始数据分布变为正态分布，得到纯噪声图像x_t；In the forward diffusion process, by continuously adding Gaussian noise to the RGB gesture feature map x ₀ , the original data distribution is changed into a normal distribution, and a pure noise image x _t is obtained;

在逆向去噪生成过程中，使用噪声预测模型将纯噪声图像x_t从正态分布快速恢复到原始数据分布，得到拟合图像。In the reverse denoising generation process, the noise prediction model is used to quickly restore the pure noise image x _t from the normal distribution to the original data distribution to obtain a fitted image.

进一步地，噪声预测模型包括生成器和判别器；其中，将RGB手势特征图按照扩散过程输入噪声预测模型，按照扩散过程对噪声预测模型进行训练，从而获得训练好的去噪扩散隐式模型，具体包括：Further, the noise prediction model includes a generator and a discriminator; among them, the RGB gesture feature map is input into the noise prediction model according to the diffusion process, and the noise prediction model is trained according to the diffusion process, thereby obtaining a trained denoising diffusion implicit model. Specifically include:

对RGB手势特征图进行扩散加噪，得到扩散加噪后的图像及其噪声分量的方差；Perform diffusion noise on the RGB gesture feature map to obtain the diffusion-noised image and the variance of its noise components;

将扩散加噪后的图像及其噪声分量的方差输入生成器，得到预测结果图像；Input the diffusion-noised image and the variance of its noise component into the generator to obtain the prediction result image;

将RGB手势特征图和预测结果图像输入判别器，得到预测结果图像的质量信息；Input the RGB gesture feature map and the prediction result image into the discriminator to obtain the quality information of the prediction result image;

根据预测结果的图像质量信息，对生成器和判别器进行优化，得到训练好的生成器和判别器；According to the image quality information of the prediction results, the generator and discriminator are optimized to obtain the trained generator and discriminator;

将训练完成的生成器和判别器更新至快速去噪声预测模型中，得到训练好的去噪扩散隐式模型。Update the trained generator and discriminator to the fast denoising prediction model to obtain the trained denoising diffusion implicit model.

进一步地，预测结果图像的质量信息包括预测结果图像与RGB手势特征图对应的特征图的内核起始距离KID。Further, the quality information of the prediction result image includes the kernel starting distance KID of the feature map corresponding to the prediction result image and the RGB gesture feature map.

进一步地，判别器包括含输入层、resize层和InceptionV3模块和计算模块。Further, the discriminator includes an input layer, a resize layer, an InceptionV3 module and a calculation module.

进一步地，自适应编码Vision Transformer网络模型包括：图像划分模块、patch编码模块、Transformer Encoder特征提取模块和MLP分类模块；Further, the adaptive coding Vision Transformer network model includes: image division module, patch encoding module, Transformer Encoder feature extraction module and MLP classification module;

图像划分模块用于将图像划分为多个图像patch块；patch编码模块用于对每个图像patch块进行编码；其中，每一个编码后的图像patch块包括该图像patch块的位置信息和该图像patch块中像素点之间关系的信息；Transformer Encoder特征提取模块用于提取编码后的图像patch块的特征向量，MLP分类模块用于根据提取的特征向量进行分类。The image dividing module is used to divide the image into multiple image patch blocks; the patch encoding module is used to encode each image patch block; where each encoded image patch block includes the position information of the image patch block and the image Information about the relationship between pixels in the patch block; the Transformer Encoder feature extraction module is used to extract the feature vector of the encoded image patch block, and the MLP classification module is used to classify based on the extracted feature vector.

进一步地，MLP分类模块包括第三归一化层、Dropout层和三层全连接层。Further, the MLP classification module includes a third normalization layer, a dropout layer and three fully connected layers.

进一步地，Transformer Encoder特征提取模块包括8个首尾依次连接的编码器块，每个Transformer编码器块包括第一层归一化层、第二层归一化层、多头注意力层和多层感知机；Furthermore, the Transformer Encoder feature extraction module includes 8 encoder blocks connected in sequence. Each Transformer encoder block includes a first layer of normalization layer, a second layer of normalization layer, a multi-head attention layer and a multi-layer perception layer. machine;

第一层归一化层的输入作为编码器块的输入，第一层归一化层的输出与多头注意力层的输入连接，且第一层归一化层的输入与多头注意力层的输出跳跃连接，且连接至第二层归一化层的输入，第二层归一化层的输出与多层感知机的输入连接，且第二层归一化层的输出与多层感知机的输出跳跃连接以作为该编码器块的输入。The input of the first normalization layer serves as the input of the encoder block, the output of the first normalization layer is connected to the input of the multi-head attention layer, and the input of the first normalization layer is connected to the input of the multi-head attention layer. The output skip connection is connected to the input of the second normalized layer, the output of the second normalized layer is connected to the input of the multi-layer perceptron, and the output of the second normalized layer is connected to the multi-layer perceptron. The output of is jump-connected to serve as input to this encoder block.

进一步地，将验证数据集输入训练好的自适应编码Vision Transformer网络模型进行分类，还包括：Further, the verification data set is input into the trained adaptive coding Vision Transformer network model for classification, which also includes:

采用t-SNE算法显示分类效果。The t-SNE algorithm is used to display the classification effect.

本发明中，所提出的基于自适应编码Vision Transformer网络的毫米波雷达手势识别方法，利用毫米波雷达对待分类手势进行采集，并对采集到的手势回波数据进行特征提取得到微动多普勒特征和俯仰方位角特征，并将微动多普勒特征和俯仰方位角特征按三通道合成RGB手势特征图，然后通过去噪扩散隐式模型对RGB手势特征图进行扩充，在得到一部分数据基础上省去了后续费时较长且结果不稳定的数据采集流程。此外，相较于传统分类算法SVM支持向量机、k邻近、随机森林以及卷积神经网络，本发明采用自适应编码ViT网络模型，完全抛弃了卷积提取特征和手动提取特征向量的方式，增强了其对局部信息的感知能力使得本发明计算复杂度降低，识别精度提高，且易于低算力状态下的部署。In the present invention, the proposed millimeter wave radar gesture recognition method based on the adaptive coding Vision Transformer network uses millimeter wave radar to collect gestures to be classified, and performs feature extraction on the collected gesture echo data to obtain micro-motion Doppler features and pitch azimuth angle features, and combine the micro-motion Doppler features and pitch azimuth angle features into a three-channel RGB gesture feature map, and then expand the RGB gesture feature map through a denoising diffusion implicit model. After obtaining a part of the data basis This eliminates the need for subsequent data collection processes that are time consuming and have unstable results. In addition, compared with the traditional classification algorithms SVM support vector machine, k-neighborhood, random forest and convolutional neural network, the present invention adopts the adaptive coding ViT network model, completely abandoning the methods of convolutional feature extraction and manual feature vector extraction, and enhances In addition to its ability to perceive local information, the invention reduces the computational complexity, improves the recognition accuracy, and is easy to deploy under low computing power.

附图说明Description of the drawings

图1为本发明提出的一实施例中的基于自适应编码Vision Transformer网络的毫米波雷达手势识别方法的流程示意图。Figure 1 is a schematic flowchart of a millimeter wave radar gesture recognition method based on an adaptive coding Vision Transformer network in an embodiment of the present invention.

图2为本发明的一实施例中的DDIM中U-net神经网络的结构示意图。Figure 2 is a schematic structural diagram of the U-net neural network in DDIM in an embodiment of the present invention.

图3为本发明的一实施例中的DDIM生成图片的过程图。Figure 3 is a process diagram of DDIM generating pictures in an embodiment of the present invention.

图4为本发明的一实施例中的KID在DDIM生成图片过程中的变化曲线。Figure 4 is a change curve of KID during the process of generating pictures by DDIM in an embodiment of the present invention.

图5为本发明的一实施例中的手势特征图原图。Figure 5 is an original image of a gesture feature map in an embodiment of the present invention.

图6为本发明的一实施例中的手势特征图原图被划分为多个patch时的图片。Figure 6 is a picture when the original gesture feature map is divided into multiple patches in an embodiment of the present invention.

图7为本发明的一实施例中的自适应编码Vision Transformer网络模型的结构示意图。Figure 7 is a schematic structural diagram of the adaptive coding Vision Transformer network model in an embodiment of the present invention.

图8为本发明的一实施例中的自适应编码Vision Transformer网络模型在训练过程中损失函数的变化曲线。Figure 8 is a variation curve of the loss function during the training process of the adaptive coding Vision Transformer network model in an embodiment of the present invention.

图9为本发明的一实施例中的训练完成的自适应编码Vision Transformer网络模型对于数据集的识别后结果得到的混淆矩阵。Figure 9 is a confusion matrix obtained after the recognition of the data set by the trained adaptive coding Vision Transformer network model in an embodiment of the present invention.

图10为本发明的一实施例中的原始数据和经过自适应编码Vision Transformer网络模型训练后的利用t-SNE算法画出的数据分布图。Figure 10 shows the original data and the data distribution diagram drawn using the t-SNE algorithm after training the adaptive coding Vision Transformer network model in an embodiment of the present invention.

图11为现有技术中的基于通道和空间注意机制的CNN手势识别模型示意图。Figure 11 is a schematic diagram of a CNN gesture recognition model based on channel and spatial attention mechanisms in the prior art.

图12为现有技术中的基于通道和空间注意机制的CNN模型训练过程中损失值的变化曲线。Figure 12 is a change curve of the loss value during the training process of the CNN model based on the channel and spatial attention mechanism in the prior art.

图13为现有技术中的基于训练完成的通道和空间注意机制的CNN模型对于数据集的识别后结果得到的混淆矩阵。Figure 13 shows the confusion matrix obtained after the recognition of the data set by the CNN model based on the trained channel and spatial attention mechanism in the prior art.

图14为原始数据和经过现有技术中的基于通道和空间注意机制的CNN模型训练后的利用t-SNE算法画出的数据分布图。Figure 14 shows the original data and the data distribution diagram drawn using the t-SNE algorithm after being trained by the CNN model based on the channel and spatial attention mechanism in the prior art.

图15为10种手势特征图样本。Figure 15 shows 10 gesture feature map samples.

具体实施方式Detailed ways

需要说明的是，在不冲突的情况下，本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。It should be noted that, as long as there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other. The present invention will be described in detail below with reference to the accompanying drawings and embodiments.

参照图1，本发明提出的一种基于自适应编码Vision Transformer网络的毫米波雷达手势识别方法，包括：Referring to Figure 1, the present invention proposes a millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network, including:

S1、利用毫米波雷达对待分类手势进行采集；S1. Use millimeter wave radar to collect gestures to be classified;

S2、对采集到的手势回波数据进行特征提取，得到微动多普勒特征和俯仰方位角特征，并将微动多普勒特征和俯仰方位角特征按三通道合成RGB手势特征图；S2. Extract features from the collected gesture echo data to obtain micro-motion Doppler features and pitch azimuth features, and synthesize the micro-motion Doppler features and pitch azimuth features into an RGB gesture feature map in three channels;

S3、利用去噪扩散隐式模型(denoising diffusion implicit models,DDIM)对RGB手势特征图进行扩充，得到拟合图像；S3. Use denoising diffusion implicit models (DDIM) to expand the RGB gesture feature map to obtain a fitted image;

S4、将拟合图像和RGB手势特征图混合形成数据集，并将数据集划分为训练数据集和验证数据集；S4. Mix the fitting image and the RGB gesture feature map to form a data set, and divide the data set into a training data set and a verification data set;

S5、构建自适应编码Vision Transformer网络模型；S5. Construct an adaptive coding Vision Transformer network model;

S6、利用训练数据集对自适应编码Vision Transformer网络模型进行训练；S6. Use the training data set to train the adaptive coding Vision Transformer network model;

S7、将验证数据集输入训练好的自适应编码Vision Transformer网络模型进行分类，得到分类结果。S7. Input the verification data set into the trained adaptive coding Vision Transformer network model for classification, and obtain the classification results.

本发明利用毫米波雷达对待分类手势进行采集，并对采集到的手势回波数据进行特征提取得到微动多普勒特征和俯仰方位角特征，并将微动多普勒特征和俯仰方位角特征按三通道合成RGB手势特征图，然后通过去噪扩散隐式模型对RGB手势特征图进行扩充，在得到一部分数据基础上省去了后续费时较长且结果不稳定的数据采集流程。此外，相较于传统分类算法SVM支持向量机、k邻近、随机森林以及卷积神经网络，本发明采用自适应编码ViT网络模型，完全抛弃了卷积提取特征和手动提取特征向量的方式，增强了其对局部信息的感知能力使得本发明计算复杂度降低，识别精度提高，且易于低算力状态下的部署。The present invention uses millimeter wave radar to collect gestures to be classified, and performs feature extraction on the collected gesture echo data to obtain micro-motion Doppler features and pitch azimuth features, and combines the micro-motion Doppler features and pitch azimuth features The RGB gesture feature map is synthesized according to three channels, and then the RGB gesture feature map is expanded through the denoising diffusion implicit model. On the basis of obtaining a part of the data, the subsequent data collection process that is time-consuming and has unstable results is eliminated. In addition, compared with the traditional classification algorithms SVM support vector machine, k-neighborhood, random forest and convolutional neural network, the present invention adopts the adaptive coding ViT network model, completely abandoning the methods of convolutional feature extraction and manual feature vector extraction, and enhances In addition to its ability to perceive local information, the invention reduces the computational complexity, improves the recognition accuracy, and is easy to deploy under low computing power.

在本实施例中，在S1中，利用毫米波雷达对待分类手势进行采集，具体包括：In this embodiment, in S1, millimeter wave radar is used to collect gestures to be classified, which specifically includes:

利用IWR1642毫米波雷达对待识别手势进行多轮采集。The IWR1642 millimeter wave radar is used to collect multiple rounds of gestures to be recognized.

在进一步地实施例中，利用毫米波雷达对待分类手势进行采集，具体包括：In a further embodiment, millimeter wave radar is used to collect gestures to be classified, which specifically includes:

利用具有双发射器和四接收器的雷达系统发出的毫米波雷达对待分类手势进行采集。The gestures to be classified are collected using millimeter-wave radar emitted by a radar system with dual transmitters and four receivers.

其中，该雷达系统的接收器为Rx1、Rx2、Rx3和Rx4，其中，Rx1和Rx2垂直放置，由于Rx1和Rx2之间有波长距离，因此获得了仰角测量的自由度；Rx2、Rx3和Rx4以波长的一半的间隔平行布置，实现了方位角测量的自由度。该雷达系统通过发射和接收天线的特殊布置，接收天线在方位角和仰角方面具有自由度，可以测量手势在特定时间的方位角和俯仰角。其中，本申请中采用短时快速傅立叶变换(FFT)来获取手势的频谱图，以分析微多普勒特征。其中，FFT的窗口大小和非重叠样本的时间步长分别设置为256ms和1ms。Among them, the receivers of the radar system are Rx1, Rx2, Rx3 and Rx4. Among them, Rx1 and Rx2 are placed vertically. Since there is a wavelength distance between Rx1 and Rx2, the degree of freedom of elevation angle measurement is obtained; Rx2, Rx3 and Rx4 are Spaces of half the wavelength are arranged in parallel, enabling freedom in azimuth measurement. This radar system can measure the azimuth and elevation angles of gestures at a specific time through a special arrangement of transmitting and receiving antennas. The receiving antenna has degrees of freedom in azimuth and elevation angles. Among them, in this application, short-time fast Fourier transform (FFT) is used to obtain the spectrum diagram of the gesture to analyze the micro-Doppler characteristics. Among them, the window size of FFT and the time step of non-overlapping samples are set to 256ms and 1ms respectively.

在S2中，将微动多普勒特征和俯仰方位角特征按三通道合成RGB手势特征图之后，还包括：对RGB手势特征图进行分类，便于后续DDIM的训练。In S2, after the micro-motion Doppler features and pitch azimuth angle features are synthesized into RGB gesture feature maps in three channels, it also includes: classifying the RGB gesture feature maps to facilitate subsequent DDIM training.

在本实施例中，在S3中，利用去噪扩散隐式模型对RGB手势特征图进行扩充，具体包括：In this embodiment, in S3, the denoising diffusion implicit model is used to expand the RGB gesture feature map, which specifically includes:

S31、建立RGB手势特征图的扩散过程和噪声预测模型；其中，噪声预测模型采用U-Net神经网络结构，如图2所示；S31. Establish the diffusion process and noise prediction model of the RGB gesture feature map; among them, the noise prediction model adopts the U-Net neural network structure, as shown in Figure 2;

S32、将RGB手势特征图按照扩散过程输入噪声预测模型，按照扩散过程对噪声预测模型进行训练，从而获得训练好的去噪扩散隐式模型；S32. Input the RGB gesture feature map into the noise prediction model according to the diffusion process, and train the noise prediction model according to the diffusion process to obtain the trained denoising diffusion implicit model;

S33、利用训练好的去噪扩散隐式模型对RGB手势特征图进行扩充。S33. Use the trained denoising diffusion implicit model to expand the RGB gesture feature map.

在对RGB手势特征图进行扩充的过程中，去噪扩散隐式模型利用非马尔可夫链过程和U-Net能够在原有数据集的基础上生成出近乎相同的拟合图像来扩充数据集，从而提高训练量并同时缓解了同用户手势动作的差异性导致的类内样本方差大的问题，以便后续网络对于多种情形下的手势特征进行学习。In the process of expanding the RGB gesture feature map, the denoising diffusion implicit model uses the non-Markov chain process and U-Net to generate nearly identical fitting images based on the original data set to expand the data set. This increases the amount of training and at the same time alleviates the problem of large intra-class sample variance caused by differences in the gestures of the same user, so that the subsequent network can learn gesture features in a variety of situations.

在本实施例中，在S3中，去噪扩散隐式模型包括前向扩散过程和逆向去噪生成过程；In this embodiment, in S3, the denoising diffusion implicit model includes a forward diffusion process and a reverse denoising generation process;

具体地，在逆向去噪生成过程中，递推模型为Specifically, in the reverse denoising generation process, the recursive model is

式中，ε_θ表示噪声预测模型，σ是一个作为索引的实向量，且为总的加噪步数，α_t:＝1-β_t，β_t为人为设定的t时刻时加噪的参数值，0＜β₁＜…＜β_t-1＜β_t＜…＜β_T＜1，x_t表示给定有噪声的观测值。In the formula, ε _θ represents the noise prediction model, σ is a real vector as an index, and is the total number of noise adding steps, α _t :=1-β _t , β _t is the parameter value of adding noise at the artificially set time t, 0＜β ₁ ＜…＜β _t-1 ＜β _t ＜…＜ β _T <1, x _t represents a given noisy observation.

其中，噪声预测模型为T为总的加噪T步数。Among them, the noise prediction model is T is the total number of noise adding T steps.

具体地，U-Net神经网络包括4层下采样层和4层上采样层，具有相同分辨率的层之间添加跳越连接。基于此，可以将扩散模型视为没有瓶颈的去噪自动编码器。其中，该U-Net神经网络结构还引入了残差结构，以完成网络的传递和通道数量的扩充缩减，从而防止梯度消失和梯度爆炸。Specifically, the U-Net neural network includes 4 downsampling layers and 4 upsampling layers, with skip connections added between layers with the same resolution. Based on this, the diffusion model can be viewed as a denoising autoencoder without bottlenecks. Among them, the U-Net neural network structure also introduces a residual structure to complete the network transfer and the expansion and reduction of the number of channels, thereby preventing gradient disappearance and gradient explosion.

具体地，在训练去噪扩散隐式模型的过程中，设置Adam W为整体网络的优化算法，学习率δ设置为10^-3，数值稳定常数ε_r设置为10^-4，t时刻时加噪的参数值β_t设定为在上的三角函数值，输入图像大小为64×64，最终利用Tensorflow框架自动计算出各层梯度向后传播更新各项可学习参数。Specifically, in the process of training the denoising diffusion implicit model, Adam W is set as the optimization algorithm of the overall network, the learning rate δ is set to 10 ^-3 , the numerical stability constant ε _r is set to 10 ^-4 , and noise is added at time t The parameter value β _t is set to The trigonometric function value on, the input image size is 64×64, and finally the Tensorflow framework is used to automatically calculate the gradient of each layer and propagate backward to update the learnable parameters.

在进一步地实施例中，噪声预测模型包括生成器和判别器；将RGB手势特征图按照扩散过程输入噪声预测模型，按照扩散过程对噪声预测模型进行训练，从而获得训练好的去噪扩散隐式模型，具体包括：In a further embodiment, the noise prediction model includes a generator and a discriminator; the RGB gesture feature map is input into the noise prediction model according to the diffusion process, and the noise prediction model is trained according to the diffusion process, thereby obtaining the trained denoising diffusion implicit Models, specifically include:

根据预测结果的图像质量信息对生成器和判别器进行优化，得到训练好的生成器和判别器；Optimize the generator and discriminator based on the image quality information of the prediction results to obtain the trained generator and discriminator;

如此设置，能够提高生成器的生成能力和判别器的判别能力，从而使训练好的去噪扩散隐式模型能够在原有数据集的基础上生成出近乎相同的拟合图像来扩充数据集。Such a setting can improve the generation ability of the generator and the discriminative ability of the discriminator, so that the trained denoising diffusion implicit model can generate nearly identical fitting images based on the original data set to expand the data set.

为了使去噪扩散隐式模型能够在原有数据集的基础上生成出近乎相同的拟合图像来扩充数据集，在更进一步地实施例中，预测结果图像的质量信息包括预测结果图像与RGB手势特征图对应的特征图的内核起始距离KID。In order to enable the denoising diffusion implicit model to generate nearly identical fitting images based on the original data set to expand the data set, in a further embodiment, the quality information of the prediction result image includes the prediction result image and RGB gesture The kernel starting distance KID of the feature map corresponding to the feature map.

具体地，对一个batch_size大小(此处设置为64)的噪声图和生成图像利用多项式核公式分别计算其相对自身和相互的多项式核，再在多项式核的基础上取各自对于batch_size的平均值核并相加得到最终KID的值。Specifically, for a noise map and generated image of a batch_size size (here set to 64), the polynomial kernel formula is used to calculate the polynomial kernel relative to itself and each other, and then based on the polynomial kernel, the average kernel for the batch_size is taken. And add them to get the final KID value.

需要知道的是，KID作为图像评价标准，其值应越小越好，理想情况下KID＝0则生成的图像与原图几乎完全一致。What needs to be known is that KID is used as an image evaluation standard, and its value should be as small as possible. Ideally, if KID=0, the generated image will be almost exactly the same as the original image.

具体地，在S324中，在根据预测结果图像的图像质量信息对生成器和判别器进行优化的过程中，当预测结果图像的KID下降到一定程度时，停止预测结果图的生成。Specifically, in S324, during the process of optimizing the generator and the discriminator according to the image quality information of the prediction result image, when the KID of the prediction result image drops to a certain level, the generation of the prediction result map is stopped.

在进一步地实施例中，判别器包括含输入层、resize层和InceptionV3模块、和图像质量评价模块。其中，resize层用于对输入层输出的图片进行图片大小调整；InceptionV3模块用于对调整后的图片进行特征提取；计算模块用于根据提取后的特征进行KID计算，得到预测结果图像的质量信息。In a further embodiment, the discriminator includes an input layer, a resize layer, an InceptionV3 module, and an image quality evaluation module. Among them, the resize layer is used to adjust the image size of the image output by the input layer; the InceptionV3 module is used to extract features from the adjusted image; the calculation module is used to perform KID calculation based on the extracted features to obtain the quality information of the predicted result image. .

具体地，resize层将输入层输出的图片的大小改为75×75，以适应InceptionV3模块。具体地，IInceptionV3模块在进行特征提取时预加载来自“imagenet”的权重。Specifically, the resize layer changes the size of the image output by the input layer to 75×75 to adapt to the InceptionV3 module. Specifically, the IInceptionV3 module preloads weights from “imagenet” when performing feature extraction.

在其中一个具体地实施例中，在S4中，数据集的图片数量为1600张，训练集和验证集占比为3：1。In one specific embodiment, in S4, the number of pictures in the data set is 1,600, and the ratio of the training set and the verification set is 3:1.

如图7所示，在本实施例中，自适应编码Vision Transformer网络模型包括：图像划分模块、patch编码模块、Transformer Encoder特征提取模块和MLP分类模块。As shown in Figure 7, in this embodiment, the adaptive coding Vision Transformer network model includes: image division module, patch encoding module, Transformer Encoder feature extraction module and MLP classification module.

其中，图像划分模块用于将图像划分为多个图像patch块，如图5和图6所示；patch编码模块用于对每个图像patch块进行编码；其中，每一个编码后的图像patch块包括该图像patch块的位置信息和该图像patch块中像素点之间关系的信息；Transformer Encoder特征提取模块用于提取编码后的图像patch块的特征向量，MLP分类模块用于根据提取的特征向量进行分类。Among them, the image division module is used to divide the image into multiple image patch blocks, as shown in Figure 5 and Figure 6; the patch encoding module is used to encode each image patch block; among them, each encoded image patch block Including the position information of the image patch and the relationship between the pixels in the image patch; the Transformer Encoder feature extraction module is used to extract the feature vector of the encoded image patch, and the MLP classification module is used to extract the feature vector based on the extracted feature vector. sort.

如此设置，自适应编码Vision Transformer网络模型直接通过自适应编码的形式增强了其对局部信息的感知能力使得本发明计算复杂度降低，识别精度提高，且易于低算力状态下的部署。With this arrangement, the adaptive coding Vision Transformer network model directly enhances its ability to perceive local information through adaptive coding, which reduces the computational complexity of the present invention, improves recognition accuracy, and is easy to deploy under low computing power.

具体地，该图像patch块的位置信息的编码，直接采取Embedding的方式即可，而该图像patch块中的像素点之间关系的信息的编码由多层感知机来学习被展平后patch中各像素点的关系并输出的，不同于线性编码方式，该编码方法大大的增强了编码的自适应能力。具体地，图像patch块的最终的编码由patch的位置编码和patch的像素编码相加融合得到。Specifically, the encoding of the position information of the image patch can be directly done by Embedding, and the encoding of the relationship information between pixels in the image patch is learned by the multi-layer perceptron and flattened in the patch. The relationship between each pixel is output. Different from the linear encoding method, this encoding method greatly enhances the adaptive ability of the encoding. Specifically, the final encoding of the image patch block is obtained by the addition and fusion of the position encoding of the patch and the pixel encoding of the patch.

如图7所示，在进一步地实施例中，Transformer Encoder特征提取模块包括8个首尾依次连接的编码器块，每个Transformer编码器块包括第一层归一化层、第二层归一化层、多头注意力层和多层感知机。As shown in Figure 7, in a further embodiment, the Transformer Encoder feature extraction module includes 8 encoder blocks connected in sequence. Each Transformer encoder block includes a first layer of normalization layer and a second layer of normalization layer. layers, multi-head attention layers and multi-layer perceptrons.

具体地，编码器块的输入作为第一层归一化层的输入，第一层归一化层的输出与多头注意力层的输入连接，且第一层归一化层的输入与多头注意力层的输出跳跃连接，且连接至第二层归一化层的输入，第二层归一化层的输出与多层感知机的输入连接，且第二层归一化层的输出与多层感知机的输出跳跃连接，作为该编码器块的输入，以防止梯度爆炸或梯度消失。Specifically, the input of the encoder block serves as the input of the first normalization layer, the output of the first normalization layer is connected with the input of the multi-head attention layer, and the input of the first normalization layer is connected with the multi-head attention layer. The output of the force layer is jump-connected and connected to the input of the second normalization layer. The output of the second normalization layer is connected to the input of the multi-layer perceptron, and the output of the second normalization layer is connected to the multi-layer perceptron input. The output of the layer perceptron is skip-connected as input to this encoder block to prevent exploding or vanishing gradients.

由此可知，相较于传统分类算法SVM支持向量机、k邻近、随机森林等,以及卷积神经网络，自适应编码ViT网络完全抛弃了卷积提取特征和手动提取特征向量的方式，直接通过自适应编码和多头注意力的形式增强了其对局部信息的感知能力使得本发明计算复杂度降低，识别精度提高，且易于低算力状态下的部署。It can be seen that compared with traditional classification algorithms such as SVM support vector machine, k-neighbor, random forest, etc., as well as convolutional neural networks, the adaptive coding ViT network completely abandons the methods of convolutional feature extraction and manual feature vector extraction, directly through The form of adaptive coding and multi-head attention enhances its ability to perceive local information, reduces the computational complexity of the present invention, improves recognition accuracy, and is easy to deploy under low computing power.

在更进一步地实施例中，多层感知机包括两个全连接层。具体地，多层感知机的第一层全连接层为128个units，激活函数选择gelu；第二层全连接层为64个units，激活函数选择gelu。In a further embodiment, the multilayer perceptron includes two fully connected layers. Specifically, the first fully connected layer of the multilayer perceptron is 128 units, and the activation function is gelu. The second fully connected layer is 64 units, and the activation function is gelu.

在进一步地实施例中，MLP分类器包括第三归一化层、Dropout层和三层全连接层。In a further embodiment, the MLP classifier includes a third normalization layer, a dropout layer and three fully connected layers.

具体地，Transformer编码器的输出作为第三归一化层的输入，第三归一化层的输出展平后送入Dropout层，Dropout层的输出送到三层全连接层中。其中，在三层全连接层中，第一层全连接层有2048个units，激活函数选择gelu；第二层全连接层有1024个units，激活函数选择gelu；第三层全连接层有10个units，以对10种手势数据进行分类。Specifically, the output of the Transformer encoder is used as the input of the third normalization layer, the output of the third normalization layer is flattened and sent to the Dropout layer, and the output of the Dropout layer is sent to the three fully connected layers. Among them, among the three fully connected layers, the first fully connected layer has 2048 units, and the activation function is gelu; the second fully connected layer has 1024 units, and the activation function is gelu; the third fully connected layer has 10 units to classify 10 types of gesture data.

在本实施例中，在S6中，利用训练数据集对自适应编码Vision Transformer网络模型进行训练，具体包括：In this embodiment, in S6, the training data set is used to train the adaptive coding Vision Transformer network model, which specifically includes:

将训练数据集输入至构建的自适应编码Vision Transformer网络模型中，采用交叉熵作为损失函数训练自适应编码Vision Transformer网络模型直至收敛，得到训练好的自适应编码Vision Transformer网络模型。The training data set is input into the constructed adaptive coding Vision Transformer network model, and cross entropy is used as the loss function to train the adaptive coding Vision Transformer network model until convergence, and the trained adaptive coding Vision Transformer network model is obtained.

具体地，自适应编码Vision Transformer网络模型的损失函数优化方法为AdamW，学习率设置为0.001，权重衰减系数设置为0.0001，通过Tensorflow自动进行梯度运算学习网络中的各项参数。Specifically, the loss function optimization method of the adaptive coding Vision Transformer network model is AdamW, the learning rate is set to 0.001, the weight attenuation coefficient is set to 0.0001, and Tensorflow is used to automatically perform gradient operations to learn various parameters in the network.

在本实施例中，在S7中，将验证数据集输入训练好的自适应编码VisionTransformer网络模型进行分类，具体包括：In this embodiment, in S7, the verification data set is input into the trained adaptive coding VisionTransformer network model for classification, which specifically includes:

将验证数据集输入训练好的自适应编码Vision Transformer网络模型进行分类，得到分类结果；Input the verification data set into the trained adaptive coding Vision Transformer network model for classification, and obtain the classification results;

采用t-SNE算法显示分类结果，以方便观察分类结果。The t-SNE algorithm is used to display the classification results to facilitate observation of the classification results.

本发明的效果可通过以下仿真进一步说明：The effect of the present invention can be further explained through the following simulation:

为了验证本发明在毫米波雷达手势识别方面的优越性，通过一组实验对本发明与传统方法进行比较，因未使用DDIM扩充前数据量与扩充后数据量相差过大，其验证结果偶然性较强故不进行数据量差距对比，仅对比后续采取自适应编码Vision Transformer(ViT)网络模型与CNN网络识别结果。In order to verify the superiority of the present invention in millimeter-wave radar gesture recognition, a set of experiments was conducted to compare the present invention with the traditional method. Since the data volume before and after expansion without using DDIM is too different, the verification results are highly accidental. Therefore, the difference in data volume is not compared, and only the recognition results of the subsequent adaptive coding Vision Transformer (ViT) network model and CNN network are compared.

1、仿真条件1. Simulation conditions

测试硬件平台参数如表1所示，软件平台参数如表2所示，雷达平台参数如表3所示：The test hardware platform parameters are shown in Table 1, the software platform parameters are shown in Table 2, and the radar platform parameters are shown in Table 3:

表1硬件平台参数Table 1 Hardware platform parameters

CPUCPU AMD Ryzen 74800HAMD Ryzen 74800H 内存Memory 16GB16 GB 显卡型号Graphics card model NVIDIAGeForceGTX 1660TiNVIDIA GeForceGTX 1660Ti 显卡显存Graphics card memory 6GB6GB

表2软件平台参数Table 2 Software platform parameters

操作系统operating system Windows1064位Windows 1064 bit 编译器translater Pycharm 2021.3.1Pycharm 2021.3.1 CUDA版本CUDA version 11.2.011.2.0 Tensorflow版本Tensorflow version 2.102.10 Keras版本Keras version 2.102.10

表3硬雷达平台参数Table 3 Hard radar platform parameters

带宽bandwidth ～4GHz~4GHz 啁啾重复时间Chirp repeat time 400μs400μs 每帧啁啾次数Number of chirps per frame 128128 距离分辨率distance resolution ～4cm~4cm 速度分辨率speed resolution 0.032m/s0.032m/s

2、仿真方法2. Simulation method

(1)现有技术的基于通道和空间注意机制的CNN手势识别网络对手势特征图进行识别分类；(1) The existing CNN gesture recognition network based on channel and spatial attention mechanisms recognizes and classifies gesture feature maps;

(2)本发明所述方法，即基于自适应编码Vision Transformer网络的毫米波雷达手势识别方法。(2) The method of the present invention is a millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network.

3、仿真内容和仿真结果3. Simulation content and simulation results

本申请中使用的雷达为工作在77Ghz的IWR1642雷达，该雷达的平均输出功率为12dBm。该雷达可以感知的径向速度范围为2.6cm/s至2.6m/s，天线波束宽度为120度，适用于手势测量。手的运动是在雷达天线的主瓣中测量的，从主瓣到雷达的平均距离约为30厘米。The radar used in this application is the IWR1642 radar operating at 77Ghz. The average output power of this radar is 12dBm. The radial velocity range that the radar can sense is 2.6cm/s to 2.6m/s, and the antenna beam width is 120 degrees, which is suitable for gesture measurement. Hand movements are measured in the main lobe of the radar antenna, with an average distance from the main lobe to the radar of approximately 30 cm.

本申请所要识别的10种手势，分别是(1)从左向右滑动(2)从右向左滑动(3)从左下到右上滑动(4)从右下到左上滑动(5)向前滑动(6)向后滑动(7)左右摇摆(8)手指双击(9)握紧张开拳头(10)打响指，其对应手势特征图如图15。The 10 gestures to be recognized by this application are (1) sliding from left to right (2) sliding from right to left (3) sliding from lower left to upper right (4) sliding from lower right to upper left (5) sliding forward (6) Swipe backwards (7) Swing left and right (8) Double-click with fingers (9) Clench and open fists (10) Snap fingers. The corresponding gesture feature diagram is shown in Figure 15.

仿真实验1，采用基于通道和空间注意机制的CNN手势识别网络对手势特征图进行识别分类，分类结果的混淆矩阵如图13所示，各手势的分类精度可由混淆矩阵得出，总分类精度为91.00％。Simulation experiment 1 uses a CNN gesture recognition network based on channel and spatial attention mechanisms to identify and classify gesture feature maps. The confusion matrix of the classification results is shown in Figure 13. The classification accuracy of each gesture can be obtained from the confusion matrix. The total classification accuracy is 91.00%.

仿真实验2，采用本发明中的自适应编码ViT网络模型对手势特征图进行识别分类，分类结果的混淆矩阵如图9所示，各手势的分类精度可由混淆矩阵得出，总分类精度为97.25％。Simulation experiment 2 uses the adaptive coding ViT network model in the present invention to identify and classify gesture feature maps. The confusion matrix of the classification results is shown in Figure 9. The classification accuracy of each gesture can be obtained from the confusion matrix, and the total classification accuracy is 97.25 %.

其中，FFT的窗口大小和非重叠样本的时间步长分别设置为256ms和1ms。在图片生成过程中，设置DDIM去噪扩散隐式模型训练的学习率为0.001，批量大小为64，原始数据重复的次数为40次。在训练的每个周期中随机选择和训练集等量的验证集数据用于验证，并对生成的特征图片与验证集中特征图片计算KID，通过keras中的callback方法随着每一次迭代计算并显示图像的KID，便于监视全过程图像质量的变化，方便在图像质量达到一定程度时中断训练。设置总训练次数为70次，整个数据集的培训时间大约为1小时，生成图片的过程如图3所示，KID的下降如图4所示，KID越小则证明图像质量越高。最终使用训练完成后的网络权重生成了近1500张手势特征图。Among them, the window size of FFT and the time step of non-overlapping samples are set to 256ms and 1ms respectively. During the image generation process, set the learning rate of DDIM denoising diffusion implicit model training to 0.001, the batch size to 64, and the number of original data repetitions to 40 times. In each cycle of training, the same amount of verification set data as the training set is randomly selected for verification, and the KID is calculated for the generated feature images and the feature images in the verification set, and is calculated and displayed with each iteration through the callback method in keras The KID of the image makes it easy to monitor changes in image quality throughout the process and to interrupt training when the image quality reaches a certain level. The total number of training times is set to 70, and the training time of the entire data set is about 1 hour. The process of generating images is shown in Figure 3. The decrease of KID is shown in Figure 4. The smaller the KID, the higher the image quality. Finally, nearly 1,500 gesture feature maps were generated using the network weights after training.

本发明与现有基于通道和空间注意机制的CNN增强多通道合成手势特征识别相比，不仅在总体精度上有明显地提高，而且在特征不明显或者特征相似手势特征图的分类精度上也有大幅度地提高，验证了本发明能提取更全面和更具判别性的特征反映手势特征图像的性质，有利于提升手势特征图像的分类精度。Compared with the existing CNN-enhanced multi-channel synthetic gesture feature recognition based on channel and spatial attention mechanisms, this invention not only significantly improves the overall accuracy, but also greatly improves the classification accuracy of gesture feature maps with unobvious features or similar features. The magnitude improvement proves that the present invention can extract more comprehensive and discriminative features to reflect the properties of gesture feature images, and is conducive to improving the classification accuracy of gesture feature images.

经过t-SNE算法将原始数据和训练结果降维可视化后采用基于通道和空间注意机制的CNN手势识别网络对手势特征图进行识别分类的结果如图14所示，采用本发明中的自适应编码ViT网络模型对手势特征图进行识别分类的结果如图10所示，可以看到本发明中模型的泛化能力更好，鲁棒性更强，有利于对从未见过的数据进行识别分类，而基于通道和空间注意机制的CNN手势识别网络对手势特征图进行识别分类其结果过于集中。After the t-SNE algorithm reduces the dimensionality of the original data and training results and visualizes it, the CNN gesture recognition network based on the channel and spatial attention mechanism is used to identify and classify the gesture feature map. The results are shown in Figure 14, using the adaptive coding in the present invention. The results of the ViT network model's recognition and classification of gesture feature maps are shown in Figure 10. It can be seen that the model in the present invention has better generalization ability and stronger robustness, which is beneficial to the identification and classification of never-seen data. , and the CNN gesture recognition network based on the channel and spatial attention mechanism recognizes and classifies the gesture feature map, and the results are too concentrated.

再对比图8和图12可知，本发明在训练过程中梯度下降的稳定性也十分优秀可以减少训练过程中的人为干预。Comparing Figure 8 and Figure 12, it can be seen that the stability of gradient descent during the training process of the present invention is also very excellent and can reduce human intervention during the training process.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above are only preferred specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can, within the technical scope disclosed in the present invention, implement the technical solutions of the present invention. Equivalent substitutions or changes of the inventive concept thereof shall be included in the protection scope of the present invention.

Claims

1. A millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network, which is characterized by:

Use millimeter wave radar to collect gestures to be classified;

Perform feature extraction on the collected gesture echo data to obtain micro-motion Doppler features and pitch azimuth angle features, and synthesize the micro-motion Doppler features and pitch azimuth angle features into an RGB gesture feature map in three channels;

The denoising diffusion implicit model is used to expand the RGB gesture feature map to obtain a fitted image;

Mix the fitting images and RGB gesture feature maps to form a data set, and divide the data set into a training data set and a verification data set;

Build an adaptive coding Vision Transformer network model;

Use the training data set to train the adaptive coding Vision Transformer network model;

Input the verification data set into the trained adaptive coding Vision Transformer network model for classification, and obtain the classification results.

2. The millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network according to claim 1, characterized in that the RGB gesture feature map is expanded by using a denoising diffusion implicit model, specifically including:

Establish the diffusion process and noise prediction model of the RGB gesture feature map; among them, the noise prediction model adopts the U-Net neural network structure;

Input the RGB gesture feature map into the noise prediction model according to the diffusion process, and train the noise prediction model according to the diffusion process to obtain the trained denoising diffusion implicit model;

The RGB gesture feature map is expanded using the trained denoising diffusion implicit model.

3. The millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network according to claim 2, characterized in that the denoising diffusion implicit model includes a forward diffusion process and a reverse denoising generation process;

In the forward diffusion process, by continuously adding Gaussian noise to the RGB gesture feature map x ₀ , the original data distribution is changed into a normal distribution, and a pure noise image x _t is obtained;

In the reverse denoising generation process, the noise prediction model is used to quickly restore the pure noise image x _t from the normal distribution to the original data distribution to obtain a fitted image.

4. The millimeter wave radar gesture recognition method based on the adaptive coding Vision Transformer network according to claim 2, characterized in that the noise prediction model includes a generator and a discriminator; wherein the RGB gesture feature map is input into the noise according to the diffusion process. Prediction model, train the noise prediction model according to the diffusion process to obtain the trained denoising diffusion implicit model, including:

Perform diffusion noise on the RGB gesture feature map to obtain the diffusion-noised image and the variance of its noise components;

Input the diffusion-noised image and the variance of its noise component into the generator to obtain the prediction result image;

Input the RGB gesture feature map and the prediction result image into the discriminator to obtain the quality information of the prediction result image;

According to the image quality information of the prediction results, the generator and discriminator are optimized to obtain the trained generator and discriminator;

Update the trained generator and discriminator to the fast denoising prediction model to obtain the trained denoising diffusion implicit model.

5. The millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network according to claim 4, characterized in that the quality information of the prediction result image includes the kernel of the feature map corresponding to the prediction result image and the RGB gesture feature map. Starting distance KID.

6. The millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network according to claim 5, characterized in that the discriminator includes an input layer, a resize layer, an InceptionV3 module and a calculation module.

7. The millimeter wave radar gesture recognition method based on the adaptive coding Vision Transformer network according to any one of claims 1 to 6, characterized in that the adaptive coding Vision Transformer network model includes: an image division module, a patch coding module, Transformer Encoder feature extraction module and MLP classification module;

The image dividing module is used to divide the image into multiple image patch blocks; the patch encoding module is used to encode each image patch block; where each encoded image patch block includes the position information of the image patch block and the image Information about the relationship between pixels in the patch block; the Transformer Encoder feature extraction module is used to extract the feature vector of the encoded image patch block, and the MLP classification module is used to classify based on the extracted feature vector.

8. The millimeter wave radar gesture recognition method based on adaptive coding Vision Transformer network according to claim 7, characterized in that the MLP classification module includes a third normalization layer, a dropout layer and three fully connected layers.

9. The millimeter wave radar gesture recognition method based on adaptive encoding Vision Transformer network according to claim 7, characterized in that the Transformer Encoder feature extraction module includes 8 encoder blocks connected in sequence, each Transformer encoder block It includes the first normalization layer, the second normalization layer, the multi-head attention layer and the multi-layer perceptron;

The input of the first normalization layer serves as the input of the encoder block, the output of the first normalization layer is connected to the input of the multi-head attention layer, and the input of the first normalization layer is connected to the input of the multi-head attention layer. The output skip connection is connected to the input of the second normalized layer, the output of the second normalized layer is connected to the input of the multi-layer perceptron, and the output of the second normalized layer is connected to the multi-layer perceptron. The output of is jump-connected to serve as input to this encoder block.

10. The millimeter wave radar gesture recognition method based on the adaptive coding Vision Transformer network according to claim 1, characterized in that the verification data set is input into the trained adaptive coding Vision Transformer network model for classification, and further includes:

The t-SNE algorithm is used to display the classification effect.