Disclosure of Invention
The invention mainly aims to provide a gesture recognition method, an interaction method based on gesture recognition and mixed reality glasses, and aims to solve the technical problem that the existing gesture recognition algorithm based on deep learning is low in error limiting precision due to a preset algorithm of a depth map.
To achieve the above object, the present invention provides a gesture recognition method, including the steps of:
Acquiring a depth map sequence of a gesture to be recognized;
Acquiring a key frame sequence of the depth map sequence based on the depth map sequence;
Inputting the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of the gesture to be recognized;
and based on the first semantic sequence, obtaining a semantic result of the gesture to be recognized.
Optionally, the step of inputting the key frame sequence into a pre-trained gesture recognition model to obtain the first semantic sequence of the gesture to be recognized includes:
Extracting the image space characteristics of each frame of image in the key frame sequence through a convolutional layer 3D CNN;
extracting time relation features of the key frame sequence through a time recursion layer LSTM RNN;
combining the image space features and the time relation features to obtain space-time features of the key frame sequence;
And inputting the space-time features into the classifier output layer to obtain the semantic sequence of the gesture to be recognized.
Optionally, the step of obtaining the semantic result of the gesture to be recognized based on the semantic sequence includes:
inputting the semantic sequence into a pre-trained semantic translation model to obtain the semantic result.
Optionally, the step of obtaining the depth map sequence of the gesture to be recognized includes:
And acquiring a depth map sequence of the gesture to be recognized, which is acquired by the depth camera.
In addition, in order to solve the above problems, the present invention also provides an interaction method based on gesture recognition, the method being applied to a mixed reality device, the method comprising the steps of:
acquiring a depth map of a gesture to be recognized;
Acquiring a key frame sequence of the depth map based on the depth map;
Inputting the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of the gesture to be recognized;
based on the first semantic sequence, obtaining a semantic result of the gesture to be recognized;
Outputting the semantic result;
Acquiring voice response information aiming at the semantic result;
extracting a voice fragment based on the voice response information;
inputting the voice fragment into a pre-trained voice translation model to obtain a second semantic sequence of the voice response information;
acquiring a gesture graph sequence based on the second semantic sequence;
And displaying the gesture graph sequence.
Optionally, the step of inputting the key frame sequence into a pre-trained gesture recognition model to obtain the first semantic sequence of the gesture to be recognized includes:
Extracting the image space characteristics of each frame of image in the key frame sequence through a convolutional layer 3D CNN;
extracting time relation features of the key frame sequence through a time recursion layer LSTM RNN;
combining the image space features and the time relation features to obtain space-time features of the key frame sequence;
And inputting the space-time features into the classifier output layer to obtain the semantic sequence of the gesture to be recognized.
Optionally, the step of obtaining the semantic result of the gesture to be recognized based on the semantic sequence includes:
inputting the semantic sequence into a pre-trained semantic translation model to obtain the semantic result.
Optionally, the step of obtaining the depth map sequence of the gesture to be recognized includes:
And acquiring a depth map sequence of the gesture to be recognized, which is acquired by the depth camera.
In addition, in order to solve the above problems, the present invention also provides a gesture recognition apparatus, including:
The first acquisition module is used for acquiring a depth map sequence of the gesture to be recognized;
the second acquisition module is used for acquiring a key frame sequence of the depth map sequence based on the depth map sequence;
the recognition module is used for inputting the key frame sequence into a pre-trained gesture recognition model so as to obtain a semantic sequence of the gesture to be recognized;
And the obtaining module is used for obtaining the semantic result of the gesture to be recognized based on the semantic sequence.
In addition, in order to solve the above problems, the present invention further provides an interaction device based on gesture recognition, which is applied to a mixed reality device, and the interaction device based on gesture recognition includes:
the first acquisition module is used for acquiring a depth map of the gesture to be recognized;
The second acquisition module is used for acquiring a key frame sequence of the depth map based on the depth map;
the recognition module is used for inputting the key frame sequence into a pre-trained gesture recognition model so as to obtain a first semantic sequence of the gesture to be recognized;
the obtaining module is used for obtaining a semantic result of the gesture to be recognized based on the first semantic sequence;
The interaction module is used for outputting the semantic result and acquiring voice response information aiming at the semantic result;
The extraction module is used for extracting voice fragments based on the voice response information;
the translation module is used for inputting the voice fragment into a pre-trained voice translation model so as to obtain a second semantic sequence of the voice response information;
a fourth obtaining module, configured to obtain a gesture graphic sequence based on the second semantic sequence;
and the display module is used for displaying the gesture graph sequence.
In addition, in order to solve the problems, the invention also provides an electronic device, which comprises a memory, a processor and a gesture recognition program stored on the memory and capable of running on the processor, wherein the gesture recognition program is configured to realize the steps of the gesture recognition method, or
The electronic device comprises a memory, a processor and a gesture recognition based interactive program stored on the memory and executable on the processor, wherein the gesture recognition based interactive program is configured to implement the steps of the gesture recognition based interactive method as described above.
In addition, in order to solve the problems, the invention also provides mixed reality glasses, and the mixed reality glasses comprise the electronic equipment.
In order to solve the above problems, the present invention also provides a storage medium having stored thereon a gesture recognition program which, when executed by a processor, implements the steps of the gesture recognition method as described above, or
The storage medium has stored thereon an interaction program based on gesture recognition, which when executed by a processor implements the steps of the interaction method based on gesture recognition as described above.
The embodiment of the invention provides a gesture recognition method, an interaction method based on gesture recognition and mixed reality glasses. The gesture recognition method directly acquires the depth map of the gesture made by the user, avoids errors and corresponding time cost caused by estimating the depth information of the common RGB map by the traditional depth learning method, and accordingly improves the accuracy of gesture recognition and the efficiency of gesture recognition. The method is also beneficial to the accuracy and the instantaneity of the meaning expression of the interaction in the actual application by using the gesture recognition interaction method.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
Most three-dimensional gesture recognition algorithms based on depth learning firstly conduct depth prediction on common RGB images to obtain depth information, and then conduct gesture recognition based on the predicted depth information. For example, gestures are acquired through a lens mounted on a mobile phone or a tablet terminal to obtain a common RGB image, then depth prediction is performed on the RGB image through a depth prediction algorithm mounted on the mobile phone or the tablet terminal to obtain depth information, and gesture recognition is performed based on the predicted depth information. But the recognition accuracy of this type of gesture recognition algorithm is limited by the error of the depth prediction algorithm, i.e. the accuracy of gesture recognition is limited.
In order to solve the above problems, the embodiment of the invention provides a gesture recognition method, which directly acquires a depth map of a gesture made by a user, and avoids errors and corresponding time cost caused by the traditional depth learning method on RGB image estimation, thereby improving the accuracy of gesture recognition and the efficiency of gesture recognition. The method is also beneficial to the accuracy and the instantaneity of the meaning expression of the interaction in the actual application by using the gesture recognition interaction method.
Referring to fig. 1, fig. 1 is a schematic diagram of a recommended device structure of a gesture recognition method and an interaction method based on gesture recognition in a hardware running environment according to an embodiment of the present invention.
The device may be a User Equipment (UE) such as mixed reality glasses, mobile phones, smart phones, notebook computers, digital broadcast receivers, personal Digital Assistants (PDAs), tablet computers (PADs), etc., a handheld device, a vehicle-mounted device, a wearable device, a computing device or other processing device connected to a wireless modem, a Mobile Station (MS), etc. The device may be referred to as a user terminal, portable terminal, desktop terminal, etc.
In general, the device comprises at least one processor 301, a memory 302 and a gesture recognition program stored on the memory and executable on the processor, the gesture recognition program being configured to implement the steps of the gesture recognition method or the interaction method based on gesture recognition as described above.
Processor 301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 301 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 301 may also include a main processor, which is a processor for processing data in a wake-up state, also referred to as a CPU (Central ProcessingUnit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 301 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. The processor 301 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing pertinent gesture recognition operations so that the gesture recognition model can autonomously train learning, improving efficiency and accuracy.
Memory 302 may include one or more computer-readable storage media, which may be non-transitory. Memory 302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 302 is used to store at least one instruction for execution by processor 801 to implement the gesture recognition method provided by the method embodiments of the present application.
In some embodiments, the terminal may also optionally include a communication interface 303 and at least one peripheral device. The processor 301, the memory 302 and the communication interface 303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to the communication interface 303 through a bus, signal line, or circuit board. Specifically, the peripheral devices include at least one of radio frequency circuitry 304, a display screen 305, and a power supply 306.
The communication interface 303 may be used to connect at least one peripheral device associated with an I/O (Input/Output) to the processor 301 and the memory 302. The communication interface 303 is used to receive the movement tracks of the plurality of mobile terminals and other data uploaded by the user through the peripheral device. In some embodiments, processor 301, memory 302, and communication interface 303 are integrated on the same chip or circuit board, and in some other embodiments, either or both of processor 301, memory 302, and communication interface 303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 304 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 304 communicates with a communication network and other communication devices through electromagnetic signals, so that movement trajectories and other data of a plurality of mobile terminals can be acquired. The radio frequency circuit 304 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuitry 304 includes an antenna system, an RF transceiver, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuitry 304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to, metropolitan area networks, generation-by-generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 304 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.
The display screen 305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 305 is a touch screen, the display 305 also has the ability to collect touch signals at or above the surface of the display 305. The touch signal may be input as a control signal to the processor 301 for processing. At this point, the display 305 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 305 may be one, front panel of the electronic device, in other embodiments, at least two, display 305 may be disposed on different surfaces of the electronic device or in a folded design, respectively, and in still other embodiments, display 305 may be a flexible display disposed on a curved surface or a folded surface of the electronic device. Even more, the display screen 305 may be arranged in an irregular pattern other than rectangular, i.e., a shaped screen. The display 305 may be made of LCD (LiquidCrystal Display ), OLED (Organic Light-Emitting Diode) or other materials.
The power supply 306 is used to power the various components in the electronic device. The power source 306 may be alternating current, direct current, disposable or rechargeable. When the power source 306 comprises a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
Those skilled in the art will appreciate that the structure shown in fig. 1 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
An embodiment of the present invention provides a gesture recognition method, and referring to fig. 2, fig. 2 is a schematic flow chart of a first embodiment of the gesture recognition method of the present invention.
In this embodiment, a gesture recognition method includes the following steps:
step S100, a depth map of the gesture to be recognized is obtained.
Specifically, in the above steps, the gesture to be recognized may be a static gesture or a dynamic gesture. Compared with a static gesture, the dynamic gesture can be formed by combining a plurality of hand actions, for example, the gesture can be a sign language used by the deaf-mute, namely, the gesture to be recognized can be a single gesture, or can be a dynamic gesture sequence which is coherent and consists of a plurality of gestures connected in sequence.
The above steps are used for directly acquiring a depth map, and compared with an RGB map acquired by a common lens, the depth map comprises a planar image including gestures and position and size information of the gestures.
For example, the depth map may be obtained by a depth camera, and in this case, step S100 obtains the depth map of the gesture to be recognized, and the adaptation is modified as follows:
step S100', a depth map sequence of the gesture to be recognized, which is acquired by the depth camera, is acquired.
Specifically, in the above steps, the depth map sequence is composed of several frames of depth maps, and the coordinates of each pixel point in the image signal of the depth map sequence are three-dimensional coordinates, and the three-dimensional coordinates can be obtained based on a spatial coordinate system established by taking the depth camera as an origin. The three-dimensional coordinates of the pixels of the depth map may be used to accurately capture gestures of the user.
For example, in an embodiment, a depth camera using a CMOS light sensing element is equipped on the mixed reality glasses for acquiring depth information of a current scene, and the depth camera using the CMOS light sensing element may be used to acquire a depth map of a gesture to be recognized of a user or an interactive object. For example, the three-dimensional coordinate system may use the depth camera as an origin, a horizontal direction at the position of the depth camera as an x-axis direction, a vertical direction at the position of the depth camera as a y-axis direction, and directions perpendicular to the x-axis and the y-axis at the position of the depth camera as z-axis directions. Thus, in the image signal acquired by the depth camera, the z value in the coordinates of each pixel point may be used to indicate the distance from the position point corresponding to the pixel point to the depth camera, that is, the z value may be used to indicate the depth of field.
Step S200, based on the depth map sequence, acquiring a key frame sequence of the depth map sequence.
Specifically, the sequence of key frames is used to represent the meaning of the gesture to be recognized. Since meaning expressions of gestures require consecutive expressions between gestures, extracted key frames need to be arranged in time sequence, i.e. a key frame sequence of a depth map sequence is obtained. For example, the object captured by the depth camera is a sign language gesture made by the deaf-mute, and at this time, the sign language gesture includes a plurality of gestures and corresponding join gestures. I.e. the corresponding depth map sequence comprises a sequence of key frames for meaning representations, and a sequence of non-key frames for non-meaning representations. Thus, the above steps are used to obtain a key frame sequence of the meaning representation.
Step S300, inputting the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized.
Specifically, the above steps are used for performing recognition operation on gesture actions to obtain expressed meanings of the gesture to be recognized. In the above steps, the gesture recognition may be performed in two dimensions by using feature point extraction, or a gesture recognition method by deep learning may be used, which is not limited in the present application.
Alternatively, because conventional gesture recognition networks mostly employ models of 2D CNNs for training and recognition. But for sign language where the gesture to be recognized is used by the deaf-mute, the consistency of the context semantic meaning is emphasized particularly due to the connectivity between different gestures. At this time, the conventional gesture recognition network mainly aiming at static gesture recognition of machine interaction is difficult to meet the requirement of sign language recognition.
To this end, in an embodiment, referring to fig. 3, step S300, inputting a key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized includes:
Step S301, extracting the image space characteristics of each frame of image in the key frame sequence through the convolution layer 3D CNN.
In step S302, the temporal relation features of the key frame sequence are extracted by the temporal recursion layer LSTM RNN.
Step S303, combining the image space feature and the time relation feature to obtain the space-time feature of the key frame sequence.
Step S304, inputting the space-time characteristics into a classifier output layer to obtain a first semantic sequence of the gesture to be recognized.
Specifically, the gesture recognition network is a combined neural network including a convolutional layer 3D CNN and a temporal recurrent neural layer LSTM RNN. The convolution layer 3D CNN has the characteristic of unchanged conversion, and can be used for extracting spatial features with the characteristic of unchanged scale, such as palm shape, direction features and the like of each gesture image in the dynamic gesture sequence. The number of layers of the convolution layer 3D CNN is preset, for example, the convolution layer 3D CNN may specifically include 3 convolution layers and 3 downsampling layers, and convolution kernels of all the convolution layers may be consistent, for example, the convolution kernel size is 5*5.
The function of the LSTM RNN of the first layer is to extract the time relation characteristic of the dynamic gesture sequence, namely the context relation for expressing the gesture sequence, and the function of the LSTM RNN of the time recursion neural network analyzes the characteristic context of the adjacent frames so as to integrate and transmit, namely the space characteristics obtained by combining the 3D CNN of the previous convolution layer can obtain the space-time characteristics required by the final classification. Specifically, the layer 1 LSTM RNN network is concatenated at the last layer of the convolutional layer 3D CNN network.
The classifier output layer may include a plurality of softmax units, and the temporal relation features obtained by performing contextual feature analysis on the action of the classifier output layer are combined with the spatial features to obtain space-time features, and the space-time features are input into a classifier with the softmax units for classification output to obtain a first semantic sequence of the gesture to be recognized.
Step S400, based on the first semantic sequence, obtaining a semantic result of the gesture to be recognized.
In particular, in the above steps, the semantic result may be data such as Mandarin text that may be output directly through a display or player or other interactive device. The steps are used for converting the obtained first semantic sequence into data which can be processed by other interactive equipment, so that meaning expression of the gesture to be recognized can be conveniently output.
For example, in an embodiment, step S400, based on the semantic sequence, obtains a semantic result of the gesture to be recognized, including:
Step S400', the semantic sequence is input into a pre-trained semantic translation model to obtain a semantic result.
In the above steps, the first semantic sequence may be input into a semantic translation model, and a corresponding mandarin text is obtained through the semantic translation model, where the mandarin text is a semantic result of the gesture to be recognized. It is easy to understand that the semantic translation model is a prior art for those skilled in the art to know how to implement, and will not be described here in detail.
Compared with the existing gesture recognition method, the method has the advantages that firstly, depth prediction is carried out on a common RGB image to obtain depth information, and then recognition accuracy of gesture recognition based on the predicted depth information is limited by errors of a depth prediction algorithm, the gesture recognition algorithm provided by the embodiment can directly obtain a depth map sequence of a gesture made by a user through a depth camera and other devices, errors brought by a traditional depth learning method to depth map estimation and corresponding time cost are avoided, and therefore accuracy of gesture recognition and efficiency of gesture recognition are improved. The method is also beneficial to the accuracy and the instantaneity of the meaning expression of the interaction in the actual application by using the gesture recognition interaction method. Thereby being beneficial to the special groups such as the deaf-mute and the like, and helping the special groups to communicate with the ordinary people.
In addition, in order to solve the problems, the application further provides an interaction method embodiment based on gesture recognition, and the method embodiment is applied to mixed reality equipment. The following specifically describes the use of the gesture recognition-based interaction method in mixed reality glasses as an example. It is to be understood that the mixed reality glasses are merely illustrative and not limiting of the embodiments of the present application.
Referring to fig. 4, fig. 4 is a flow chart illustrating an interactive method embodiment based on gesture recognition according to the present invention.
In this embodiment, the interaction method based on gesture recognition includes the following steps:
step S100, a depth map of the gesture to be recognized is obtained.
Specifically, in the above steps, the gesture to be recognized may be a static gesture or a dynamic gesture. Compared with a static gesture, the dynamic gesture can be formed by combining a plurality of hand actions, for example, the gesture can be a sign language used by the deaf-mute, namely, the gesture to be recognized can be a single gesture, or can be a dynamic gesture sequence which is coherent and consists of a plurality of gestures connected in sequence.
The above steps are used for directly acquiring a depth map, and compared with an RGB map acquired by a common lens, the depth map comprises a planar image including gestures and position and size information of the gestures.
For example, the depth map may be obtained by a depth camera, and in this case, step S100 obtains the depth map of the gesture to be recognized, and the adaptation is modified as follows:
step S100', a depth map sequence of the gesture to be recognized, which is acquired by the depth camera, is acquired.
Specifically, in the above steps, the depth map sequence is composed of several frames of depth maps, and the coordinates of each pixel point in the image signal of the depth map sequence are three-dimensional coordinates, and the three-dimensional coordinates can be obtained based on a spatial coordinate system established by taking the depth camera as an origin. The three-dimensional coordinates of the pixels of the depth map may be used to accurately capture gestures of the user.
For example, in an embodiment, a depth camera using a CMOS light sensing element is equipped on the mixed reality glasses for acquiring depth information of a current scene, and the depth camera using the CMOS light sensing element may be used to acquire a depth map of a gesture to be recognized of a user or an interactive object. For example, the three-dimensional coordinate system may use the depth camera as an origin, a horizontal direction at the position of the depth camera as an x-axis direction, a vertical direction at the position of the depth camera as a y-axis direction, and directions perpendicular to the x-axis and the y-axis at the position of the depth camera as z-axis directions. Thus, in the image signal acquired by the depth camera, the z value in the coordinates of each pixel point may be used to indicate the distance from the position point corresponding to the pixel point to the depth camera, that is, the z value may be used to indicate the depth of field.
Step S200, based on the depth map sequence, acquiring a key frame sequence of the depth map sequence.
Specifically, the sequence of key frames is used to represent the meaning of the gesture to be recognized. Since meaning expressions of gestures require consecutive expressions between gestures, extracted key frames need to be arranged in time sequence, i.e. a key frame sequence of a depth map sequence is obtained. For example, the object captured by the depth camera is a sign language gesture made by the deaf-mute, and at this time, the sign language gesture includes a plurality of gestures and corresponding join gestures. I.e. the corresponding depth map sequence comprises a sequence of key frames for meaning representations, and a sequence of non-key frames for non-meaning representations. Thus, the above steps are used to obtain a key frame sequence of the meaning representation.
Step S300, inputting the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized.
Specifically, the above steps are used for performing recognition operation on gesture actions to obtain expressed meanings of the gesture to be recognized. In the above steps, the gesture recognition may be performed in two dimensions by using feature point extraction, or a gesture recognition method by deep learning may be used, which is not limited in the present application.
Alternatively, because conventional gesture recognition networks mostly employ models of 2D CNNs for training and recognition. But for sign language where the gesture to be recognized is used by the deaf-mute, the consistency of the context semantic meaning is emphasized particularly due to the connectivity between different gestures. At this time, the conventional gesture recognition network mainly aiming at static gesture recognition of machine interaction is difficult to meet the requirement of sign language recognition.
To this end, in an embodiment, referring to fig. 3, step S300, inputting a key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized includes:
Step S301, extracting the image space characteristics of each frame of image in the key frame sequence through the convolution layer 3D CNN.
In step S302, the temporal relation features of the key frame sequence are extracted by the temporal recursion layer LSTM RNN.
Step S303, combining the image space feature and the time relation feature to obtain the space-time feature of the key frame sequence.
Step S304, inputting the space-time characteristics into a classifier output layer to obtain a first semantic sequence of the gesture to be recognized.
Specifically, the gesture recognition network is a combined neural network including a convolutional layer 3D CNN and a temporal recurrent neural layer LSTM RNN. The convolution layer 3D CNN has the characteristic of unchanged conversion, and can be used for extracting spatial features with the characteristic of unchanged scale, such as palm shape, direction features and the like of each gesture image in the dynamic gesture sequence. The number of layers of the convolution layer 3D CNN is preset, for example, the convolution layer 3D CNN may specifically include 3 convolution layers and 3 downsampling layers, and convolution kernels of all the convolution layers may be consistent, for example, the convolution kernels have a size of 5×5.
The function of the LSTM RNN of the first layer is to extract the time relation characteristic of the dynamic gesture sequence, namely the context relation for expressing the gesture sequence, and the function of the LSTM RNN of the time recursion neural network analyzes the characteristic context of the adjacent frames so as to integrate and transmit, namely the space characteristics obtained by combining the 3D CNN of the previous convolution layer can obtain the space-time characteristics required by the final classification. Specifically, the layer 1 LSTM RNN network is concatenated at the last layer of the convolutional layer 3D CNN network.
The classifier output layer may include a plurality of softmax units, and the temporal relation features obtained by performing contextual feature analysis on the action of the classifier output layer are combined with the spatial features to obtain space-time features, and the space-time features are input into a classifier with the softmax units for classification output to obtain a first semantic sequence of the gesture to be recognized.
Step S400, based on the first semantic sequence, obtaining a semantic result of the gesture to be recognized.
In particular, in the above steps, the semantic result may be data such as Mandarin text that may be output directly through a display or player or other interactive device. The steps are used for converting the obtained first semantic sequence into data which can be processed by other interactive equipment, so that meaning expression of the gesture to be recognized can be conveniently output.
For example, in an embodiment, step S400, based on the semantic sequence, obtains a semantic result of the gesture to be recognized, including:
Step S400', the semantic sequence is input into a pre-trained semantic translation model to obtain a semantic result.
In the above steps, the first semantic sequence may be input into a semantic translation model, and a corresponding mandarin text is obtained through the semantic translation model, where the mandarin text is a semantic result of the gesture to be recognized. It is easy to understand that the semantic translation model is a prior art for those skilled in the art to know how to implement, and will not be described here in detail. Specifically, the specific implementation process of the steps can refer to the above embodiments, and since the steps adopt all the technical solutions of all the embodiments, the steps have at least all the beneficial effects brought by the technical solutions of the embodiments, and are not described in detail herein.
Step S500, outputting a semantic result and acquiring voice response information aiming at the semantic result.
Specifically, the steps are used for outputting the semantic result obtained by the gesture to be recognized through a display or a player or other interactive devices. For example, through a display screen or through a speaker play or other similar interactive device output. It is readily understood that the output may be received by the expressive object of the gesture to be recognized and a corresponding response may be made upon receipt of the expression of the semantic result.
In one embodiment, the user of the mixed reality glasses is a deaf-mute, the deaf-mute communicates with the general person face to face, the mixed reality glasses recognize the sign language made by the deaf-mute user and recognize the sign language as a semantic result, and then the semantic result, for example, the Mandarin meaning expression of the sign language, can be output through the mounted loudspeaker. After receiving the mandarin meaning expression played by the loudspeaker, the normal person makes a corresponding response, which can be voice response information.
For example, the deaf-mute can buy things in a market by matching the mixed reality glasses, the depth camera mounted on the mixed reality glasses directly acquires the depth map sequence of the sign language by making the corresponding sign language of 'the clothes is folded, and the depth map sequence is recognized as the corresponding semantic result of' the clothes is folded 'of the Mandarin text'. And then playing through the Mandarin through a loudspeaker carried on the mixed reality glasses, wherein the clothes are folded. After hearing the voice played by the loudspeaker, the shopping guide in the market makes a corresponding voice response "you good," the piece of clothes is folded by 7 ", and at the moment, the receiver mounted on the mixed reality glasses acquires and acquires voice response information" you good "aiming at the semantic result" the piece of clothes is folded by several degrees, "the piece of clothes is folded by 7".
Step S600, extracting the voice fragment based on the voice response information.
Specifically, the speech segments in the above steps are key speech segments in which response information is recorded. Specifically, the obtained voice response information may be subjected to noise reduction processing, and then the corresponding voice fragment is extracted.
Step S700, inputting the speech segment into a pre-trained speech translation model to obtain a second semantic sequence of speech response information.
Specifically, the steps are used for converting the voice fragments into corresponding second semantic sequences. The second semantic sequence describes meaning expressions of the voice response information.
Step S800, based on the second semantic sequence, acquiring a gesture graph sequence.
Specifically, in the above steps, the second semantic sequence may be converted into a corresponding sign language representation through semantic analysis, and a corresponding sign language model animation is loaded from a corresponding sign language database, and the corresponding sign language model animation is converted into a gesture graphic sequence through the time sequence of the second semantic sequence.
Step S900, a gesture graphics sequence is displayed.
Specifically, this step is used to achieve interaction between the mixed reality glasses and the deaf-mute user. The interaction can be achieved by displaying the gesture graphic sequence on the mixed reality glasses, namely, the gesture graphic sequence is deployed and played in a real scene. The deaf-mute wearing the mixed reality glasses can directly receive the gesture graphic sequence through vision, so that the meaning representation of the corresponding voice response information is obtained.
Compared with the existing single-ended gesture recognition or voice recognition method, the interaction method provided by the gesture-based interaction method integrates gesture recognition or voice recognition into a whole to form an end-to-end closure, so that real-time communication and communication between special crowds such as deaf-mutes and common crowds are realized. The method is used for the mixed reality equipment, so that real and vivid model animation demonstration can be carried out on interaction through mixed reality glasses, and the communication quality is improved.
In addition, the present invention further provides an embodiment of a gesture recognition apparatus, referring to fig. 5, fig. 5 is a block diagram of the structure of the embodiment, where the apparatus includes:
a first obtaining module 10, configured to obtain a depth map of a gesture to be recognized;
A second acquisition module 20, configured to acquire a key frame sequence of the depth map based on the depth map;
the recognition module 30 is configured to input the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized;
the obtaining module 40 obtains a semantic result of the gesture to be recognized based on the first semantic sequence.
Compared with the existing gesture recognition device, the gesture recognition device provided by the embodiment can directly acquire the depth map sequence of the gesture made by the user through the devices such as the depth camera and the like, and avoids errors and corresponding time cost caused by the traditional depth learning method on depth map estimation, so that the accuracy of gesture recognition and the efficiency of gesture recognition are improved. The method is also beneficial to the accuracy and the instantaneity of the meaning expression of the interaction in the actual application by using the gesture recognition interaction method. Thereby being beneficial to the special groups such as the deaf-mute and the like, and helping the special groups to communicate with the ordinary people.
In addition, the invention also provides an interactive device embodiment based on gesture recognition, which is applied to mixed reality equipment. Referring to fig. 6, fig. 6 is a block diagram of the structure of the present embodiment.
In this embodiment, the interaction device based on gesture recognition includes:
a first obtaining module 10, configured to obtain a depth map of a gesture to be recognized;
A second acquisition module 20, configured to acquire a key frame sequence of the depth map based on the depth map;
the recognition module 30 is configured to input the key frame sequence into a pre-trained gesture recognition model to obtain a first semantic sequence of gestures to be recognized;
an obtaining module 40, configured to obtain a semantic result of the gesture to be recognized based on the first semantic sequence;
The interaction module 50 is used for outputting a semantic result and acquiring voice response information aiming at the semantic result;
An extracting module 60 for extracting a voice clip based on the voice response information;
A translation module 70 for inputting the speech segments into a pre-trained speech translation model to obtain a second semantic sequence of speech response information;
A fourth obtaining module 80, configured to obtain a gesture graphic sequence based on the second semantic sequence;
a display module 90, configured to display a gesture graphics sequence.
Compared with the existing single-ended gesture recognition or voice recognition method, the interaction method provided by the gesture-based interaction device integrates gesture recognition or voice recognition into a whole to form an end-to-end closure, so that real-time communication and communication between special crowds such as deaf-mutes and common crowds are realized. The method is used for the mixed reality equipment, so that real and vivid model animation demonstration can be carried out on interaction through mixed reality glasses, and the communication quality is improved.
Other embodiments or specific implementations of the gesture recognition apparatus and the interaction apparatus based on gesture recognition may refer to the above method embodiments, and are not described herein.
In addition, in order to solve the problems, the invention also provides mixed reality glasses, and the mixed reality glasses comprise the electronic equipment.
Specifically, the mixed reality glasses are provided with the depth camera, the loudspeaker and the earphone, the depth camera is used for collecting the depth map, the user can easily understand that the depth camera can collect gesture actions made by the user and gesture actions made by objects communicated by the user, namely, the mixed reality glasses can be used for deaf-mute people and can be worn by ordinary people to communicate with normal people. The speaker is used for outputting semantic results, so that the instantaneity of communication between the deaf-mute and the ordinary person can be improved.
In order to solve the above problems, the present invention also provides a storage medium having stored thereon a gesture recognition program which, when executed by a processor, implements the steps of the gesture recognition method as described above, or
The storage medium has stored thereon an interaction program based on gesture recognition, which when executed by a processor implements the steps of the interaction method based on gesture recognition as described above. Therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application. As an example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.
Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random access Memory (Random AccessMemory, RAM), or the like.
It should be further noted that the above-described apparatus embodiments are merely illustrative, and that the units described as separate units may or may not be physically separate, and that units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the present invention may be implemented by means of software plus necessary general purpose hardware, or of course by means of special purpose hardware including application specific integrated circuits, special purpose CPUs, special purpose memories, special purpose components, etc. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions can be varied, such as analog circuits, digital circuits, or dedicated circuits. But a software program implementation is a preferred embodiment for many more of the cases of the present invention. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a Read-only memory (ROM), a random-access memory (RAM, randomAccessMemory), a magnetic disk or an optical disk of a computer, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.