CN118230421B

CN118230421B - A multimodal gesture recognition method and system based on deep learning

Info

Publication number: CN118230421B
Application number: CN202410435208.XA
Authority: CN
Inventors: 梁运鑫; 张常华; 阮胜林; 石金川
Original assignee: Guangdong Baolun Electronics Co ltd
Current assignee: Guangdong Baolun Electronics Co ltd
Priority date: 2024-04-11
Filing date: 2024-04-11
Publication date: 2024-10-25
Anticipated expiration: 2044-04-11
Also published as: CN118230421A

Abstract

The invention provides a multi-mode gesture recognition method based on deep learning, which comprises the following steps: acquiring an original image containing a hand, and preprocessing the original image to obtain a preprocessed image; detecting the position of the hand in the preprocessed image by using a target detection method to obtain the position of the hand; based on the hand position and by utilizing a human body joint point identification method, obtaining three-dimensional space coordinates of hand joint points in the preprocessed image; and identifying corresponding gestures according to the relation among the three-dimensional space coordinates of each hand joint point. According to the method, the hand joint points in the image are identified, the occluded hand image is perfected by the hand joint points, and the gesture judgment is carried out, so that the accuracy and the robustness of identification are improved.

Description

Multi-mode gesture recognition method and system based on deep learning

Technical Field

The invention relates to the field of image processing, in particular to a multi-mode gesture recognition method and system based on deep learning.

Background

With the rapid development of man-machine interaction technology, gesture recognition based on computer vision has become one of the important ways in man-machine interaction. Gesture recognition senses the intention of a user by analyzing the change of the hand gesture of the user, and is widely applied to the fields of virtual reality, intelligent home and the like.

The traditional gesture recognition method mainly comprises an image segmentation method based on skin color modeling, a motion tracking method based on an optical flow method and the like. These methods work well in simple static gesture recognition, but are susceptible to illumination changes, changes in viewing angle, and occlusion, and are less robust. With the rise of deep learning, gesture recognition technology based on convolutional neural network is widely studied. The method can learn gesture distinguishing characteristics end to end, and compared with the traditional method, the method has the advantage of improving robustness.

However, existing gesture recognition techniques based on deep learning often incorporate RNN or LSTM models to process time information, and such a loop network structure makes it difficult to implement real-time efficient reasoning when online recognition. In addition, the existing data set has limited scale, the network is difficult to learn robust features of shielding resistance and illumination change, and the adaptability to complex environments is insufficient. Generally, constructing a gesture recognition method that can recognize in real time and is robust to occlusion and illumination changes remains a technical challenge to be solved in the art.

Disclosure of Invention

In order to solve the problems, the invention provides a multi-mode gesture recognition method and system based on deep learning, which are used for detecting the current gesture of a person according to the angular relationship between three-dimensional joints of the hands of the person by recognizing the three-dimensional joints of the person in an image, so that the problems of insufficient robustness and weak instantaneity of the current gesture recognition are solved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

A multi-mode gesture recognition method based on deep learning comprises the following steps:

S1, acquiring an original image containing a hand, and preprocessing the original image to obtain a preprocessed image;

s2, detecting the position of the hand in the preprocessed image by using a target detection method to obtain the position of the hand;

S3, based on the hand position and by utilizing a human body joint point identification method, obtaining three-dimensional space coordinates of the hand joint point in the preprocessed image;

S4, according to the relation among the three-dimensional space coordinates of all the hand joint points, corresponding gestures are recognized.

Further, in step S1, the acquiring an original image including a hand specifically includes: the video stream of the hand movement is collected through the camera, and the video stream is divided into a plurality of images according to frames.

Further, in step S2, detecting the position of the hand in the preprocessed image by using the target detection method, and before obtaining the hand position, further includes: and detecting and positioning the face in the image by using a target detection method, and carrying out face recognition on the face in the image to acquire the identity of each person in the image.

Further, in step S2, the detecting the position of the hand in the preprocessed image by using the target detection method to obtain the hand position specifically includes: and detecting the image by using a rotating frame detection method in the target detection method to obtain a plurality of hand prediction frames, wherein a rectangular area formed by each hand prediction frame completely covers a single hand in the image.

Further, between the step S2 and the step S3, further includes: and positioning a target user needing to be identified with the gesture according to the identity of each person in the image, and extracting the face key points of the target user.

Further, the specific implementation process of the step S3 includes: the method comprises the steps of constructing a human body joint point recognition model, inputting images to the trained human body joint point recognition model, recognizing human body joints of each image, and obtaining three-dimensional space coordinates of human body joint points of each person, wherein the human body joint points comprise head joint points and hand joint points.

Further, the human body joint point identification model is a convolutional neural network, the convolutional neural network comprises a 3D convolutional layer, the 3D convolutional layer comprises a convolutional kernel and a plurality of input channels, and the formula of the 3D convolutional layer convolutional operation is expressed as follows:

Wherein, (x, y, z) represents a 3-dimensional spatial coordinate value of an output position, c represents an output feature map channel index, W is a width of a convolution kernel, H is a height of the convolution kernel, and D is a depth of the convolution kernel; s _W is the step size convolved over the width, S _H is the step size convolved over the height, P _W is the value filled over the width, P _H is the value filled over the height, and C _in is the number of input channels.

Further, in step S3, further includes: matching a human face key point of a target user with three-dimensional space coordinates of a head joint point in an image, determining a human body joint point belonging to the target user, and acquiring three-dimensional space coordinates of a hand joint point of the target user from the human body joint point of the target user, wherein the hand joint point comprises point positions of finger tips and finger roots of each hand in the image.

Further, the specific implementation step of the step S4 includes:

S41, obtaining coordinates of the center of the palm of each hand by calculating an average value of three-dimensional space coordinates of the joint point of the hand, wherein the formula is as follows:

wherein, C represents the coordinates of the palm center, x _i represents the x-axis coordinates of the ith hand joint of the hand, y _i represents the y-axis coordinates of the ith hand joint of the hand, and z _i represents the z-axis coordinates of the ith hand joint of the hand;

S42, calculating the direction vector of each finger of the hand corresponding to the palm, and for the direction vector v _i of the ith finger of the hand, calculating the formula:

v_i＝(x_tip,i-x_root,i,y_tip,i-yr_oot,i,z_tip,i-z_root,i)

Wherein (x _tip,i,y_tip,i,z_tip,i) is the fingertip coordinates of the ith finger and (x _root,i,y_root,i,z_root,i) is the root coordinates of the ith finger;

S43, judging the hand gesture according to the relation between the coordinates of the palm center and the direction vector of each finger of the corresponding hand.

Through the technical scheme, the method for judging by using multiple deep learning algorithms is provided, the hand in the image is further extracted and generated through the target detection algorithm, the hand joint is utilized to perfect the shielded hand image, and gesture judgment is carried out, so that the accuracy and the robustness of gesture recognition are improved.

Drawings

FIG. 1 is a schematic overall flow chart of a multi-modal gesture recognition method based on deep learning.

FIG. 2 is a schematic diagram of a multi-modal gesture recognition system based on deep learning.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1

Referring to fig. 1, a multi-modal gesture recognition method based on deep learning includes the following steps:

In an optional embodiment, in step S1, the acquiring an original image including a hand specifically includes: the video stream containing the hand movement is collected through the camera, and the video stream is divided into a plurality of images according to frames.

Before the image is collected, the camera is calibrated by using a camera calibration technology, the characteristic points are extracted by shooting a calibration plate image containing a known geometric structure, and the internal and external parameters of the camera are estimated by using a proper camera model. This includes focal length, principal point coordinates, distortion correction, and camera position and orientation. Camera calibration ensures accurate mapping of physical dimensions in the image to the real world by minimizing re-projection errors and improving accuracy using optimization algorithms. The automatic camera calibration technology can be utilized to realize automatic calibration after each zooming of the camera. Thereby acquiring the three-dimensional space coordinates of the accurate shooting picture.

During acquisition, a video stream of an operation scene is acquired in real time by using a camera with pixels larger than 500 ten thousand pixels, compressed in a hard coding mode, transmitted to an edge computing box, and decoded into a YUV format by hardware.

In an alternative embodiment, in step S2, detecting the position of the hand in the preprocessed image by using the object detection method, and before obtaining the hand position, further includes: and detecting and positioning the face in the image by using a target detection method, and carrying out face recognition on the face in the image to acquire the identity of each person in the image.

The human face is detected mainly by adopting YOLOv target detection algorithm as a target detection method, YOLO is an abbreviation of You Only Look Once, which is a target detector for detecting an object by using features learned by a deep convolutional neural network, and the algorithm predicts directly on the image by dividing the image into a fixed number of grid cells and then by a single neural network model.

In an optional embodiment, in step S2, the detecting a position of the hand in the preprocessed image by using the target detection method, to obtain the hand position, includes the following specific implementation steps: and detecting the image by using a rotating frame detection method in the target detection method to obtain a plurality of hand prediction frames, wherein a rectangular area formed by each hand prediction frame completely covers a single hand in the image.

YOLOv 8A rotating frame detection algorithm aims at hand detection, and the core principle is to adopt a rotating frame instead of a traditional rectangular boundary frame. The rotating frame can be better adapted to the natural shape of the hand, and more detailed hand information is provided by outputting the angle of the rotating frame and the coordinates of the middle point of the upper frame to which the finger points. The specific improvement is as follows: in the rotating box detection of YOLOv, the selected rotating box representation includes a center coordinate (x _c,y_c), a width w, a height h, and an angle θ. The four vertex coordinates (x _i,y_i) (i=1, 2,3, 4) of each rotation box are calculated by the following formula:

Where a _i denotes a rotation angle with respect to the center, α ₁＝α₃＝θ,a₂＝α₄ =θ+pi. This representation takes full account of the geometry of the rotating frame.

To enable the network output to predict the properties of the rotated box, we adjust the output. Two channels are added for predicting the angle and midpoint coordinates of the rotating frame:

Next, we modify YOLOv the loss function of the rotation box detection method to include the appropriate terms for the rotation box angle and midpoint coordinates. Using the square loss and introducing an adjustment coefficient λ _coord,λ_obj,λ_angle,λ_center, the loss function is expressed as:

In this context, Respectively representing predicted values of the network, and x _i,y_i,w_i,h_i,p_i,θ_i,c_xi,c_yi is the real label. In terms of dataset labeling, the label of each hand target needs to include the angle of the rotating box and the midpoint coordinates, and is selected angularly, typically in the range from-pi to pi. Meanwhile, in order to better adapt to the rotating frame, the Anchor frame needs to be adjusted, and a new Anchor frame can be generated by clustering the size and the rotating angle of the hand target. Finally, the post-processing stage requires extracting the angle and midpoint coordinates of the rotating box from the network output and converting them to real coordinates. The series of modifications enable YOLOv to be more flexible and more accurately suitable for rotating frame detection of a hand target, provide more detailed hand information and effectively improve detection performance. After detecting a hand, the detected hand is rotationally corrected using the reflection transformation with the fingertip up for subsequent identification of the finger joint point.

In an alternative embodiment, the step S2 and the step S3 further include: and positioning a target user needing to be identified with the gesture according to the identity of each person in the image, and extracting the face key points of the target user.

We use ArcFace algorithm for face recognition. ArcFace by introducing angle cosine boundary (Angular Margin Cosine Loss) to optimize feature learning, the discriminant of the face features is enhanced. This means that the features of the face are more concentrated in the high-dimensional space and the different features are more dispersed through the special loss function, so that the overlapping among the features of different faces is reduced, the consistency of the features of the same face under different conditions is improved, and the accuracy and the stability of recognition are improved.

The YOLOv face algorithm in YOLOv is utilized to focus on face detection, and by combining deep learning and key point positioning technology, the algorithm can accurately detect the face and provide coordinates of 5 key points. These key points provide important information of the pose and shape of the face to eliminate the rotation and tilting effects of the face in the image by using geometric transformation methods such as affine transformation or perspective transformation. These transformations are based on mathematical principles, which make the face more standardized in the image by adjusting the image coordinates, providing better input for the subsequent recognition process.

In an alternative embodiment, the specific implementation procedure of step S3 includes: the method comprises the steps of constructing a human body joint point recognition model, inputting images to the trained human body joint point recognition model, recognizing human body joints of each image, and obtaining three-dimensional space coordinates of human body joint points of each person, wherein the human body joint points comprise head joint points and hand joint points.

In an optional embodiment, the human body joint point identification model is a convolutional neural network, the convolutional neural network comprises a 3D convolutional layer, the 3D convolutional layer comprises a convolutional kernel and a plurality of input channels, and a formula of the 3D convolutional layer convolutional operation is expressed as:

The convolutional neural network is mainly characterized in that a BlazePose joint point detection model is improved, and BlazePose joint points are detected through two steps: machine learning pipeline of detector-tracker. Machine learning pipelines more easily combine multiple algorithms into a single pipeline or operating principle, which may structurally contain one or more stages, each of which may accomplish a task such as data processing, data transformation, model training, parameter setting or data prediction, etc.

The pipeline first locates a pose region of interest (ROI) within the frame using a detector, from which the tracker then predicts all 33 pose keypoints. The invention is improved herein by replacing the model in the detector therein with the model of the rotating frame detection algorithm and replacing the 2D convolutional network in the machine learning pipeline with the 3D convolutional network, introducing a 3D convolutional layer.

The above-described convolution layer operation proceeds in three dimensions: width, height, and depth (or time) to extract 3D features in the data. Each convolution kernel has its own weight parameters, which are updated during the training process by gradient descent, determining the specific features that the convolution kernel can detect and respond to. The stride and fill are used to adjust the sliding distance of the convolution kernel on the input data and the size of the output. The formula of the 3D convolution layer output characteristic diagram size is as follows:

With the above improvements, more powerful capabilities will be brought to the machine learning pipeline, enabling it to process stereo or volumetric data more efficiently. By learning the 3D convolution kernel suitable for the three-dimensional space, the model can better extract key features with space and time dimension information, and has remarkable advantages for processing tasks such as 3D images or video data. The method also improves the loss function in the machine learning pipeline, designs a junction and improved custom loss function by the smoothness loss (Smoothness Loss) and the skeleton constraint loss (Skeleton Constraint Loss), and changes the smoothness loss and the skeleton constraint loss into 3D and then combines. The specific principle is as follows: the goal of the 3D smoothness penalty is to encourage smoothness of the depth scene in the stereo or volumetric data. The smoothness of neighboring voxel depth values is expressed by a square difference loss:

wherein, Representing the gradient 19 of the depth scene D at the position (x, y, z). The 3D skeleton constraint loss ensures that the depth estimate is consistent with a predefined 3D skeleton structure in the scene. Assuming that D is a depth scene, S _K is a set of three-dimensional spatial positions of a predefined skeleton, and the 3D skeleton constraint loss is:

the new integrated 3D loss function is a weighted sum of the 3D smoothness loss and the 3D skeleton constraint loss, which

The weight in 3D Skeleton Constraint Loss = Σ _k||D(S_k)-D_groundtruth(S_k)||² is controlled by the hyper-parameter λ:

Combined 3D Loss＝λ·3D Smoothness Loss+(1-λ).

3D Skeleton Constraint Loss

where the super parameter λ is used to balance the relative contributions of the 3D smoothness loss and the 3D skeleton constraint loss, and the range of values is typically between [0,1 ]. During training, the parameters of the convolution kernel will be adjusted according to the gradient of the loss function to minimize the loss and improve the model performance. This upgrade will bring more comprehensive and flexible processing power to the pipeline. In the video use case we employ an efficient neural network pipeline in which only the first frame of the video sequence is run. In subsequent frames, we infer the region of interest (ROI) by using pose keypoint information of the previous frame, avoiding re-running the detector on each frame, reducing computational cost. The pose prediction component in the pipeline is responsible for predicting the three-dimensional spatial positions of all 33 human joints and the three-dimensional spatial positions of 21 hand-critical points, each joint having four degrees of freedom, including x, y, z coordinates and visibility.

Unlike the computational intensive thermodynamic diagram prediction methods currently employed, our model employs a more efficient regression method. This approach avoids generating a separate thermodynamic diagram for each joint point by jointly predicting the combined thermodynamic diagram and three-dimensional spatial coordinate offset of all joint points. Specifically, we predict the thermodynamic diagram of joint point combinations by regression methods and then make finer positional adjustments in conjunction with three-dimensional spatial coordinate offset predictions. The strategy not only improves the calculation efficiency, but also can better capture the 3D association information between the nodes, thereby improving the overall prediction performance.

In an alternative embodiment, step S3 further includes: matching a human face key point of a target user with three-dimensional space coordinates of a head joint point in an image, determining a human body joint point belonging to the target user, and acquiring three-dimensional space coordinates of a hand joint point of the target user from the human body joint point of the target user, wherein the hand joint point comprises point positions of finger tips and finger roots of each hand in the image.

And carrying out authority confirmation and binding operation on personnel needing to be operated by using a face detection method and a face recognition method. Because the detected human face frames can fall in the corresponding human body detection frames, the corresponding human body frames can be bound, the subsequent operation carries out corresponding operation according to the human body frames, and the step of human face recognition is not carried out under the condition that the human body frames are not lost, so that a lot of calculated amount can be reduced. After the human body frame is detected by the target detection algorithm, the target is tracked by the target tracking algorithm, so that the human body frame is ensured not to be lost and to be unique, and the interference of other people can be effectively removed.

On the basis of the method, the object determination can be carried out by combining the human body joint points, the Euclidean distance can be used for calculating the distance between the joint points of the hand part and the finger joint points in the 3D human body joint points, and the distance is the smallest and not more than 0.02 meter, so that the hand belonging to the object user can be judged. By using the 3D human body joint point, the three-dimensional space position of the target user can be obtained, and the position of the target user to the camera can be calculated. Under the condition of walking, the camera also follows the operated position to carry out corresponding zooming and rotation, so that the target user is ensured not to lose, and when the target user makes a gesture of ending, gesture recognition and tracking are ended.

In an alternative embodiment, the specific implementation step of step S4 includes:

v_i＝(x_tip,i-x_root,i,y_tip,i-y_root,i,z_tip,i-z_root,i)

For example, when judging whether the palm of the target user is open, the direction vector of each finger is calculated, and then whether the direction vectors are directed to the palm center as a whole is judged. Meanwhile, considering the degree of opening of the fingers, it can be estimated by calculating the angle between the fingers or the distance between the fingers. The specific formula is as follows:

isHandOpen＝isDirectionTowardCenter×isFingersSpread

Where v _i is the direction vector of the ith finger, C is the palm center position, and Spread _i represents the degree of finger opening.

For example, when it is judged that the target user makes a fist: the direction vector of each finger is calculated to ensure that they are pointing in their entirety toward the palm center. At the same time, the distance between the fingers is checked and if they are close to each other, a gesture of making a fist may be represented. The specific formula is as follows:

isFist＝isDirectionTowardCenterFingers×isFingersClose

Where Distance _i represents the Distance between the fingers.

The above formula comprehensively considers the finger direction vector, the palm center position and the relative position information of the fingers, and can judge the gesture type by setting a proper threshold value. Other corresponding gestures may be calculated according to this method.

Example 2

Referring to fig. 2, a multi-modal gesture recognition system based on deep learning, comprising:

The image acquisition module is used for acquiring an original image containing hands, preprocessing the original image and obtaining a preprocessed image;

the hand detection module is used for detecting the position of the hand in the preprocessed image by using a target detection method to obtain the hand position;

The joint recognition module is used for obtaining three-dimensional space coordinates of the hand joint point in the preprocessed image based on the hand position and by utilizing a human body joint point recognition method;

And the gesture judging module is used for identifying corresponding gestures according to the relation among the three-dimensional space coordinates of all the hand joint points.

The embodiment disclosed in the present specification is merely an illustration of one-sided features of the present invention, and the protection scope of the present invention is not limited to this embodiment, and any other functionally equivalent embodiment falls within the protection scope of the present invention. Various other corresponding changes and modifications will occur to those skilled in the art from the foregoing description and the accompanying drawings, and all such changes and modifications are intended to be included within the scope of the present invention as defined in the appended claims.

Claims

1. The multi-mode gesture recognition method based on deep learning is characterized by comprising the following steps of:

S2, detecting and positioning a face in an image by using a target detection method, carrying out face recognition on the face in the image, obtaining the identity of each person in the image, detecting the position of a hand in the preprocessed image by using the target detection method, obtaining the hand position, positioning a target user needing to be identified gesture according to the identity of each person in the image, and extracting the face key points of the target user;

S3, constructing a human body joint point identification model, inputting an image to the trained human body joint point identification model, identifying human body joints of each image to obtain three-dimensional space coordinates of human body joint points of each person, wherein the human body joint points comprise head joint points and hand joint points, matching human face key points of a target user with the three-dimensional space coordinates of the head joint points in the image, determining human body joint points belonging to the target user, and acquiring three-dimensional space coordinates of hand joint points of the target user from the human body joint points of the target user, wherein the hand joint points comprise finger tips and finger root points of each hand in the image;

2. The method for multi-modal gesture recognition based on deep learning according to claim 1, wherein in step S1, the obtaining the original image including the hand specifically includes: the video stream containing the hand movement is collected through the camera, and the video stream is divided into a plurality of images according to frames.

3. The method for recognizing multi-modal gestures based on deep learning according to claim 1, wherein in step S2, the detecting the position of the hand in the preprocessed image by the target detection method to obtain the hand position comprises the following specific implementation steps: and detecting the image by using a rotating frame detection method in the target detection method to obtain a plurality of hand prediction frames, wherein a rectangular area formed by each hand prediction frame completely covers a single hand in the image.

4. The multi-modal gesture recognition method based on deep learning according to claim 3, wherein the human body joint point recognition model is a convolutional neural network, the convolutional neural network comprises a 3D convolutional layer, the 3D convolutional layer comprises a convolutional kernel and a plurality of input channels, and a formula of the convolutional operation of the 3D convolutional layer is expressed as follows:

wherein,

(X, y, z) represents the 3-dimensional spatial coordinate value of the output position, c represents the output feature map channel index, W is the width of the convolution kernel, H is the height of the convolution kernel, and D is the depth of the convolution kernel; s _W is the step size convolved over the width, S _H is the step size convolved over the height, P _W is the value filled over the width, P _H is the value filled over the height, and C _in is the number of input channels.

5. The method for recognizing multi-modal gestures based on deep learning according to claim 4, wherein the specific implementation step of step S4 includes:

vi＝(x_tip,i-x_root,i,y_tip,i-y_root,i,z_tip,i-z_root,i)，

Wherein, (x _tip,i,y_tip,i,z_tip,i) is the fingertip coordinates of the ith finger and (x _root,i,y_root,i,z_root,i) is the root coordinates of the ith finger;

6. A multi-modal gesture recognition system based on deep learning, comprising: the image acquisition module is used for acquiring an original image containing hands, preprocessing the original image and obtaining a preprocessed image;

The hand detection module is used for detecting and positioning the face in the image by utilizing a target detection method, carrying out face recognition on the face in the image, acquiring the identity of each person in the image, detecting the position of the hand in the preprocessed image by utilizing the target detection method, acquiring the hand position, positioning a target user needing to be identified gesture according to the identity of each person in the image, and extracting the face key points of the target user;

The joint recognition module is used for constructing a human joint point recognition model, inputting images to the trained human joint point recognition model, recognizing human joints of each image to obtain three-dimensional space coordinates of human joint points of each person, wherein the human joint points comprise head joint points and hand joint points, matching human face key points of a target user with the three-dimensional space coordinates of the head joint points in the images, determining human joint points belonging to the target user, and acquiring three-dimensional space coordinates of hand joint points of the target user from the human joint points of the target user, wherein the hand joint points comprise points of finger tips and finger roots of each hand in the images; and the gesture judging module is used for identifying corresponding gestures according to the relation among the three-dimensional space coordinates of all the hand joint points.