CN117877125B

CN117877125B - Action recognition and model training method and device, electronic equipment and storage medium

Info

Publication number: CN117877125B
Application number: CN202410270243.0A
Authority: CN
Inventors: 金良; 赵雅倩; 范宝余; 郭振华; 刘璐; 尹云峰
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2024-03-11
Filing date: 2024-03-11
Publication date: 2024-06-07
Anticipated expiration: 2044-03-11
Also published as: CN117877125A

Abstract

The invention discloses a method and a device for identifying actions and training models thereof, electronic equipment and a storage medium, and is applied to the technical field of video understanding. Inputting a video sample with an action tag and audio data into an action recognition model, extracting visual features, text semantic features and audio features of the video sample, performing visual interaction and audio-visual interaction on the visual features and the audio features, and adding interaction features for the text semantic features to obtain multi-mode action tag features; and iteratively updating the action recognition model according to the loss among the visual interaction characteristics, the audio-visual interaction characteristics, the audio characteristics and the multi-modal action tag characteristics. The invention can solve the problems of poor fine motion recognition and slow convergence of motion recognition tasks in the related art, can enable the motion recognition model to more comprehensively understand and describe fine motion characteristics, improves the performance and robustness of motion recognition, and can also enhance the expansibility and flexibility of the model.

Description

Action recognition and model training method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of video understanding technologies, and in particular, to a method and apparatus for motion recognition and model training thereof, an electronic device, and a readable storage medium.

Background

Motion recognition is the process of recognizing and understanding a motion taken by a specified object, such as recognizing a limb motion of a human, from a video, sensor, or other sensing device. With the continuous development of computer vision technology, the accuracy requirement of motion recognition is gradually improved, and in some current application scenarios, the motion with nuances needs to be distinguished, namely fine-grained motion recognition.

In the process of fine-granularity action recognition, the related art cannot recognize the fine action efficiently and accurately. In view of this, the accuracy of motion recognition is improved, and the task of motion recognition can converge rapidly, which is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention provides a motion recognition and model training method, a device, electronic equipment and a readable storage medium thereof, which can enable a motion recognition model to more comprehensively understand and describe fine motion characteristics, improve motion recognition performance and robustness and further enhance model expansibility and flexibility.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a method for training an action recognition model, including:

Acquiring a video sample data set carrying action tags and audio data;

Inputting video samples in the video sample dataset into an action recognition model; the action recognition model comprises a target visual language model, a target acoustic model and a multi-modal interaction network module, visual characteristics and text semantic characteristics are extracted by utilizing the target visual language model, audio characteristics are extracted by utilizing the target acoustic model, visual interaction and audio-visual interaction are carried out on the visual characteristics and the audio characteristics by utilizing the multi-modal interaction network module according to a sound source party, an action source party and a characteristic source party, and at least one interaction characteristic is added for the text semantic characteristics so as to obtain multi-modal action tag characteristics added with at least one of auditory information, visual information and audio-visual information;

And according to the visual interaction characteristics, the audio-visual interaction characteristics, the loss information between the audio characteristics and the multi-mode action tag characteristics, iteratively updating the action recognition model until a preset model training ending condition is met, and obtaining an action recognition model for executing an action recognition task.

In a first exemplary embodiment, the visual features include a sound source side visual feature and an action source side visual feature, the multi-modal interaction network module includes a visual interaction module and an audio-visual interaction module, and the visual and audio features are visually and audio-visual interacted according to the sound source side, the action source side and the feature source side, and the method includes:

The visual interaction module is utilized to carry out visual interaction processing on the visual characteristics of the sound source side, the visual characteristics corresponding to the characteristic source side and the visual characteristics of the action source side respectively, so as to obtain visual interaction characteristics;

and performing cross-modal audio-visual interaction processing on the visual interaction characteristics and the audio characteristics by utilizing the audio-visual interaction module to obtain audio-visual interaction characteristics.

In a second exemplary embodiment, the motion source side visual features include a limb visual feature and a target visual feature, and the performing visual interaction processing on the sound source side visual feature and the visual feature of the motion source side corresponding to the feature source side respectively to obtain a visual interaction feature includes:

acquiring the sound source side visual characteristics and the sound body-limb interaction characteristics of the limb visual characteristics, and carrying out modal fusion on the sound body-limb interaction characteristics of each dimension to obtain sound body-limb interaction information;

coding the sound body-limb interaction information based on a self-attention mechanism, carrying out residual connection on the coded sound body-limb interaction information, and carrying out layer normalization processing to obtain sound body-limb interaction enhancement characteristics;

Acquiring the sound body-limb-target interaction characteristics of the sound body-limb interaction enhancement characteristics and the visual characteristics of the target object, and carrying out modal fusion on the sound body-limb-target interaction characteristics of each dimension to obtain sound body-limb-target interaction information;

Coding the sound body-limb-target interaction information based on a self-attention mechanism, carrying out residual connection on the coded sound body-limb-target interaction information, and carrying out layer normalization processing to obtain sound body-limb-target interaction enhancement characteristics;

And acquiring the sound body-limb-target-frame interaction characteristics of the sound body-limb-target interaction enhancement characteristics and the visual characteristics corresponding to the characteristic source side, and carrying out modal fusion on the sound body-limb-target-frames of each dimension to obtain the visual interaction characteristics. In a third exemplary embodiment, the visual features corresponding to the feature source include a current image frame and a video frame corresponding to the video sample, and the acquiring the audio-limb-target-frame interaction feature of the audio-limb-target interaction enhancement feature and the visual features corresponding to the feature source, and performing modal fusion on the audio-limb-target-frame of each dimension to obtain the visual interaction feature includes:

Acquiring the interaction enhancement characteristics of the sound body-limb-target and the interaction characteristics of the sound body-limb-target-image frame of the current image frame, and carrying out modal fusion on the interaction characteristics of the sound body-limb-target-image frame of each dimension to obtain interaction information of the sound body-limb-target-image frame;

coding the sound body-limb-target-image frame interaction information based on a self-attention mechanism, carrying out residual connection on the coded sound body-limb-target-image frame interaction information, and carrying out layer normalization processing to obtain sound body-limb-target-image frame interaction enhancement characteristics;

And acquiring the interaction enhancement characteristic of the sound body-limb-target-image frame and the interaction characteristic of the sound body-limb-target-image frame-video frame of the video frame, and carrying out modal fusion on the interaction characteristic of the sound body-limb-target-image frame-video frame of each dimension to obtain the visual interaction characteristic.

In a fourth exemplary embodiment, the target visual language model includes an image encoder and a target detector, and the extracting visual features using the target visual language model includes:

acquiring a sound source side image block, a limb image block and a target object image block from the video sample by utilizing the target detector;

And inputting the sound source side image block, the limb image block, the target object image block, the current image and the video sample into the image encoder to obtain sound source side visual characteristics, limb visual characteristics, target object visual characteristics, current image frames and video frames corresponding to the video sample.

In a fifth exemplary embodiment, the performing modal fusion on the volume-limb interaction features of each dimension to obtain volume-limb interaction information includes:

Calculating the global average value of the interaction characteristics of the voice body and the limbs in the dimension of the target characteristics;

Determining residual coefficients of the sound body-limb interaction features in corresponding dimensions according to the global average value, and fusing the sound body-limb interaction features and the limb visual features based on the residual coefficients to obtain interaction fusion features;

And carrying out layer normalization processing on the interaction fusion features, and carrying out feature extraction on cross-modal interaction features obtained by the layer normalization processing to obtain the sound body-limb interaction information.

In a sixth exemplary embodiment, the calculating a global average of the pitch-limb interaction feature in a target feature dimension includes:

Calling a mean value calculation relation, and calculating a global mean value of the sound body-limb interaction characteristics in a target characteristic dimension; the mean value calculation relational expression is as follows:

；

In the method, in the process of the invention, Is a global average over the L dimension,Is the volume-limb interaction feature in the ith dimension.

In a seventh exemplary embodiment, the determining, according to the global average value, a residual coefficient of the pitch-limb interaction feature in a corresponding dimension includes:

Based on the global average value, calling a residual coefficient calculation relation to determine the residual coefficient of the sound body-limb interaction characteristic in the corresponding dimension; the residual coefficient calculation relation is as follows:

；

In the method, in the process of the invention, For residual coefficients in the L dimension,As an S-shaped function,For the second full connection layer,To activate the function,For the first full connection layer,Is a global average in the L-th dimension.

In an eighth exemplary embodiment, the fusing the body-limb interaction feature and the limb visual feature based on the residual coefficient to obtain an interaction fusion feature includes:

Calling a feature fusion relation to fuse the sound body-limb interaction feature and the limb visual feature; the feature fusion relation is as follows:

；

In the method, in the process of the invention, For the interaction fusion feature,For residual coefficients in the L dimension,For the limb visual features,Is characteristic of the phonological-limb interaction.

In a ninth exemplary embodiment, the limb visual features include a first limb visual feature and a second limb visual feature, and the feature extraction is performed on the cross-modal interaction feature obtained by the layer normalization processing to obtain the volume-limb interaction information, including:

performing feature learning and feature extraction on cross-modal interaction features corresponding to the first limb visual features by using a first feedforward neural network to obtain first sound body-limb interaction information;

performing feature learning and feature extraction on cross-modal interaction features corresponding to the second limb visual features by using a second feedforward neural network to obtain second sound body-limb interaction information;

and splicing the first sound body-limb interaction information and the second sound body-limb interaction information, and mapping the spliced characteristics to the same dimensionality of the visual characteristics of the sound source party by utilizing a third feedforward neural network to obtain the sound body-limb interaction information.

In a tenth exemplary embodiment, the performing cross-modal audio-visual interaction processing on the visual interaction feature and the audio feature to obtain an audio-visual interaction feature includes:

performing cross-modal fusion processing on the visual interaction features and the audio features respectively to obtain visual enhancement interaction features and audio enhancement interaction features;

splicing the visual enhancement interaction characteristics and the audio enhancement interaction characteristics to obtain audio-visual fusion characteristics;

and encoding the audio-visual fusion features based on a self-attention mechanism, and extracting the features of the encoded audio-visual fusion features to obtain audio-visual interaction features.

In an eleventh exemplary embodiment, the cross-modal fusion processing is performed on the visual interaction feature and the audio feature, respectively, including:

Based on the visual interaction feature serving as a query vector, the audio feature serving as a group of key vectors and value vectors, cross-modal audio-visual interaction processing is performed by adopting a cross-attention mechanism, and visual enhancement interaction features are obtained;

Based on the audio feature as a query vector, the visual interaction feature as a set of key vectors and value vectors, cross-modal audio-visual interaction processing is performed by adopting a cross-attention mechanism, and audio-enhanced interaction features are obtained.

In a twelfth exemplary embodiment, the multi-modal interaction network module includes an interaction perception prompt module, and the adding at least one interaction feature to the text semantic feature includes:

respectively learning association relations among the text semantic features, the visual interaction features, the audio-visual interaction features and the audio features by utilizing the interaction perception prompt module to obtain visual-text semantic features, auditory-text semantic features and audio-visual-text semantic features; performing feature enhancement processing on the visual-text semantic features, the auditory-text semantic features and the audiovisual-text semantic features to obtain visual-text action tag features, auditory-text action tag features and audiovisual-text action tag features;

The visual-text action tag feature, the auditory-text action tag feature, the audiovisual-text action tag feature are treated as a set of multimodal action tag features.

In a thirteenth exemplary embodiment, the interactive perception prompt module includes a cross-attention mechanism layer, a first residual connection parallel layer normalization layer, a feedforward neural network layer, and a second residual connection parallel layer normalization layer that are sequentially connected.

In a fourteenth exemplary embodiment, the adding at least one interaction feature to the text semantic feature includes:

inputting the visual interaction features and the text semantic features into the cross-attention mechanism layer, taking the text semantic features as query vectors, and performing cross-modal interaction by taking the visual interaction features as a group of key vectors and value vectors to obtain visual-text interaction features;

Inputting the visual-text interaction characteristics and the text semantic characteristics into the first residual error connection parallel layer normalization layer to obtain visual-text interaction enhancement characteristics;

Inputting the visual-text interaction enhancement features to the feedforward neural network layer and the second residual error connection parallel layer normalization layer respectively, and taking the output of the second residual error connection parallel layer normalization layer as a visual-text action tag feature;

Inputting the audio features and the text semantic features into the cross-attention mechanism layer, taking the text semantic features as query vectors, and performing cross-modal interaction by taking the audio features as a group of key vectors and value vectors to obtain auditory-text interaction features;

inputting the hearing-text interaction characteristics and the text semantic characteristics to the first residual error connection parallel layer normalization layer to obtain hearing-text interaction enhancement characteristics;

inputting the hearing-text interaction enhancement features to the feedforward neural network layer and the second residual error connection parallel layer normalization layer respectively, and taking the output of the second residual error connection parallel layer normalization layer as hearing-text action tag features;

inputting the audio-visual interaction features and the text semantic features into the cross-attention mechanism layer, taking the text semantic features as query vectors, and performing cross-modal interaction by taking the audio-visual interaction features as a group of key vectors and value vectors to obtain audio-visual-text interaction features;

inputting the audio-visual-text interaction characteristics and the text semantic characteristics to the first residual error connection parallel layer normalization layer to obtain audio-visual-text interaction enhancement characteristics;

and respectively inputting the audio-visual-text interaction enhancement features to the feedforward neural network layer and the second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as audio-visual-text action tag features.

In a fifteenth exemplary embodiment, the iteratively updating the motion recognition model according to loss information between visual interaction features, audiovisual interaction features, the audio features, and the multi-modal motion label features includes:

determining visual text loss information according to the visual interaction characteristics and the visual-text action label characteristics;

determining audible text loss information based on the audio feature and the audible-text action tag feature;

determining audio-visual text loss information according to the audio-visual interaction characteristics and the audio-visual-text action tag characteristics;

And determining a loss function of the action recognition model according to the visual text loss information, the hearing text loss information and the audiovisual text loss information, and iteratively updating the action recognition model based on the loss function.

In a sixteenth exemplary embodiment, said determining a loss function of said action recognition model from said visual text loss information, said auditory text loss information, and said audiovisual text loss information comprises:

calling a loss function relation to determine a loss function of the action recognition model according to the visual text loss information, the hearing text loss information and the audiovisual text loss information; the loss function relationship is:

；

wherein L is a loss function, N1 is the total number of negative samples, t _j is the linguistic features of the j-th text sample in the negative sample dataset, sim represents a similarity function, exp represents an exponential function, τ is a temperature parameter, Is a visual interactive feature,For audiovisual interaction features,For audio features,For the visual-text action tag feature,For auditory-text action tag feature,Is an audiovisual-text action tag feature.

The second aspect of the present invention provides an action recognition method, including:

training in advance by using the action recognition model training method according to any one of the previous claims to obtain an action recognition model;

Acquiring a video to be identified;

And inputting the video to be identified into the motion identification model to obtain motion identification information.

In a first exemplary embodiment, the video to be identified is a musical instrument playing video, and the inputting the video to be identified to the motion recognition model to obtain motion recognition information includes:

inputting the musical instrument playing video to the action recognition model;

the action recognition model extracts visual characteristics by utilizing a target visual language model, and extracts musical instrument playing audio characteristics by utilizing a target acoustic model; the visual features include instrument visual features, left-hand visual features, right-hand visual features, player visual features, image frames and video frames;

The multi-mode interaction network module is used for carrying out interaction processing on each visual characteristic in a mode of respectively carrying out interaction with left and right hands, players, current image frames and video frames by taking musical instruments as centers, so as to obtain visual interaction characteristics; performing cross-modal audio-visual interaction processing on the visual interaction characteristics and the musical instrument playing audio characteristics to obtain audio-visual interaction characteristics;

and determining hand motion information of a player during playing the musical instrument according to the output of the motion recognition model.

A third aspect of the present invention provides an action recognition model training apparatus, comprising:

the training sample acquisition module is used for acquiring a video sample data set carrying action labels and audio data;

A data processing module for inputting video samples in the video sample dataset into the motion recognition model; the action recognition model comprises a target visual language model, a target acoustic model and a multi-modal interaction network module; extracting visual features and text semantic features by using the target visual language model, extracting audio features by using the target acoustic model, performing visual interaction and audio-visual interaction on the visual features and the audio features by using the multi-modal interaction network module according to a sound source party, an action source party and a feature source party, and adding at least one interaction feature to the text semantic features to obtain multi-modal action tag features added with at least one of acoustic information, visual information and audio-visual information;

And the iteration updating module is used for carrying out iteration updating on the action recognition model according to the visual interaction characteristics, the audio-visual interaction characteristics, the loss information between the audio characteristics and the multi-mode action tag characteristics until the preset model training ending condition is met, so as to obtain the action recognition model for executing the action recognition task.

A fourth aspect of the present invention provides an action recognition device, comprising:

the model training module is used for training in advance by utilizing the action recognition model training method according to any one of the previous claims to obtain an action recognition model;

the video acquisition module is used for acquiring a video to be identified;

and the action recognition module is used for inputting the video to be recognized into the action recognition model to obtain action recognition information.

The invention also provides an electronic device comprising a processor for implementing the steps of the action recognition and model training method according to any of the preceding claims when executing a computer program stored in a memory.

The invention finally provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the action recognition and model training method of any of the preceding claims.

The technical scheme provided by the invention has the advantages that on the basis of the existing visual language model and acoustic model, the action recognition model is built by adding the multi-mode interaction network module, the model parameters and the model structures of the visual language model and the acoustic model are not required to be changed, the migration with strong cross-mode understanding capability and generalization capability between the visual and the language can be adapted to the action recognition task, the original model performance is reserved, the expansibility and the flexibility of the model are enhanced, and the model is suitable for the wider multi-mode application field. The multi-modal interaction network module is utilized to carry out interaction between the same modality and cross modalities on the visual characteristics and the audio characteristics, so that not only can the action characteristics be refined, but also the problem of cross-modal information fusion and inconsistent viewing and listening is solved, the distinguishing property of the action can be enhanced, the actions with different sounds but similar sounds are further distinguished, the action recognition accuracy is effectively improved, further, the action labels are endowed with prompt information of different levels on the three modalities of vision, hearing and viewing, the multi-modal information is fully utilized, the prompt information of multiple levels is comprehensively considered, the generalization capability and the action recognition performance of the model can be improved, the fine action characteristics can be more comprehensively understood and described by the model, the performance and the robustness of the action recognition are improved, the problems of slow convergence of space-time action recognition tasks and time consumption and resource consumption of training are solved, and the action recognition accuracy and efficiency are effectively improved.

In addition, the invention also provides a corresponding implementation device, electronic equipment and a readable storage medium aiming at the action recognition and the model training method thereof, so that the method has more practicability, and the device, the electronic equipment and the readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

For a clearer description of the present invention or of the technical solutions related thereto, the following brief description will be given of the drawings used in the description of the embodiments or of the related art, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained from these drawings without the inventive effort of a person skilled in the art.

FIG. 1 is a schematic flow chart of a training method of an action recognition model provided by the invention;

FIG. 2 is a schematic diagram of a configuration of an exemplary application scenario visual feature-limb visual feature interaction module according to the present invention;

FIG. 3 is a schematic diagram of a training process of multiple interaction modules in an exemplary application scenario provided by the present invention;

fig. 4 is a schematic structural diagram of an audio-visual interaction module in an exemplary application scenario according to the present invention;

FIG. 5 is a schematic structural diagram of an interactive perception prompt module in an exemplary application scenario according to the present invention;

FIG. 6 is a schematic flow chart of a motion recognition method according to the present invention;

FIG. 7 is a schematic diagram of a hardware framework of an exemplary application scenario of the motion recognition method provided by the present invention;

FIG. 8 is a schematic diagram of a motion recognition model in an exemplary application scenario according to the present invention;

FIG. 9 is a block diagram of an embodiment of a training device for motion recognition models provided by the present invention;

FIG. 10 is a block diagram of an embodiment of an action recognition device according to the present invention;

fig. 11 is a block diagram of an embodiment of an electronic device according to the present invention.

Detailed Description

In order to make the technical scheme of the present invention better understood by those skilled in the art, the present invention will be further described in detail with reference to the accompanying drawings and the detailed description. Wherein the terms "first," "second," "third," "fourth," and the like in the description and in the above figures are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations of the two, are intended to cover a non-exclusive inclusion. The term "exemplary" means "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The action recognition model refers to matching actions or behaviors with specific categories or labels so as to be capable of recognizing and understanding human actions from videos, sensors or other sensing devices, and can be applied to application scenes such as action understanding, action monitoring, human-computer interaction, action generation and the like. Because of the great variety and complexity of human actions, the same action can occur at different speeds, angles and forms, which increases the difficulty of action recognition. In addition, there are some ambiguities in the boundaries between actions, and it is difficult to precisely define the start and end points of each action. And the accuracy of motion recognition can be influenced by different gestures and visual angle changes, and particularly, the influence is more obvious in complex scenes. The existence of these factors leads to the need for a motion recognition model with good motion recognition performance to meet the user's requirement for motion recognition accuracy.

Currently, a motion recognition model, such as ST-GCN (Spatial Temporal Graph Convolutional Network, a motion classification model based on a graph convolutional neural network), used in the related art to perform motion recognition tasks can process spatio-temporal data by modeling a video sequence into a graph structure. It first builds a graph in which the nodes of the graph represent time steps in the video sequence and the edges of the graph represent the spatio-temporal relationship between the nodes. Then, feature extraction and motion classification are performed through a graph convolution network, so that spatio-temporal information in a motion sequence can be effectively captured. I3D (INTERACTIVE THREE dimensions ) expands two-dimensional (2D) convolutional neural networks to three-dimensional (3D) convolutional models for video motion recognition. It is constructed by pre-training in a 2D convolutional neural network model and then extending the weights to the 3D convolutional layers. I3D can perform well on smaller datasets while also allowing further fine-tuning on larger datasets. TRN (Temporal Relational Reasoningin Videos, timing reasoning in video) is a timing relationship based motion recognition model that enhances motion recognition performance by modeling timing relationships in a sequence of motions. The TRN captures long-term dependencies in the motion sequence by learning the relative relationships between time steps. It treats the incoming video sequence as a graph structure, where each node represents a time step, learns the timing relationships between nodes through the graph rolling network, and then performs action classification. The TSM (Temporal Shift Module, time-shift model) increases the time interval between features by introducing non-uniform sampling in the time dimension so that the model can better perceive and capture timing information in a video sequence. This non-uniform sampling approach is referred to as time shifting (i.e., temporalShift) of the lightweight operation, and this type of approach does not involve additional parameters and thus does not increase the computational effort of the model. The PA3D (compact three-dimensional convolutional neural network model) can realize learning efficiency by carrying out multi-level decomposition on semantic tasks, convolution operation and gesture morphology, so that various gesture dynamics can be flexibly encoded as a distinguishing clue to classify complex actions.

Although these motion recognition models are capable of performing motion recognition tasks, better motion recognition results are achieved. However, the current action recognition model adopts a labeled full-supervised learning, pre-training and transfer learning mode. The former needs a great amount of marked high-quality label data, but the large-scale and high-quality label data is difficult to obtain, which results in slow convergence of the existing action recognition task, poor fine action recognition and time and resource consumption in training. And the method is characterized in that the method comprises the steps of performing pre-training on a large-scale data set and then performing fine adjustment on an action recognition task, wherein the original pre-training model parameters are modified in the fine adjustment process to influence the expansibility and flexibility of the model parameters, and the expansibility and flexibility of the transfer learning based on the pre-training are poor. In addition, for some application scenarios with high real-time requirements, the action recognition model of the related technology cannot meet the requirements of efficient and rapid action recognition. For some scenes of fine motion recognition, it is necessary to distinguish motion with fine difference, and can distinguish different motion with possibly only fine difference, and the motion recognition model cannot recognize motion with fine difference and has no strong distinguishing capability.

In view of the above, the multi-modal interaction network module is added on the basis of the existing visual language model and acoustic model, and the multi-modal interaction network module is utilized to carry out interaction of the same modality and cross modality on the visual characteristics and the audio characteristics, so that not only can the action characteristics be refined, but also the fine description of the actions on the visual level can be realized, the distinguishing performance of the actions can be enhanced, the actions with different but similar sounds can be further distinguished, the action recognition accuracy can be effectively improved, the characteristics of the different levels on the three modalities of vision, hearing and viewing are combined, the multi-modal information is fully utilized, the prompt information of a plurality of levels is comprehensively considered, the generalization capability and the action recognition performance of the action recognition model can be improved, the fine action characteristics can be more comprehensively understood and described by the model without obtaining a high-quality training sample set, the performance and the robustness of the action recognition are improved, the action recognition model can be converged more quickly, the action recognition efficiency is effectively improved, and the method is suitable for application scenes with high instantaneity. In addition, the motion recognition model can adapt the migration with strong cross-modal understanding capability and generalization capability between vision and language to the motion recognition task without changing model parameters and model structures of the visual language model and the acoustic model, so that the original model performance is maintained, and the expansibility and flexibility of the model are enhanced, so that the model is suitable for the wider multi-modal application field. Having described aspects of the invention, various non-limiting embodiments of the invention are described in detail below. Numerous specific details are set forth in the following description in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present invention.

Referring to fig. 1, fig. 1 is a flow chart of a motion recognition and model training method provided in this embodiment, where the embodiment may include the following:

s101: a video sample dataset carrying action tags and audio data is acquired.

In this step, the video sample data set is a training sample data set used for training the motion recognition model, which may include a large number of video samples covering a plurality of rich scenes, and the number of video samples may be flexibly determined according to actual requirements, which does not affect the implementation of the present invention. Each video sample is provided with an action tag, the action tag is used for marking the action of a specified target appearing on each frame of image in the video sample, the action tag is text data, the video samples in the video sample data set are all marked with text description actions in advance, and marked information is the action tag. Furthermore, in order to improve the recognition capability of the difference of the fine actions, the difference can be distinguished according to the same action but different angles of the sounds, and based on the recognition capability, the video sample also carries corresponding audio data, and the audio data can be extracted from the video sample by any audio extraction method. Of course, prior to training the motion recognition model, relevant parameters of the model training need to be preset, including but not limited to setting a training period, the number of iterations, a learning rate, an optimizer, a batch size, and a parameter updating method such as a gradient descent algorithm.

S102: video samples in the video sample dataset are input into the motion recognition model.

In this step, the motion recognition model is used to perform a spatiotemporal motion recognition task, which can recognize, from the video, a motion or behavior occurring in a specified target in the video, where the specified target includes, but is not limited to, a person, an animal, and other non-living objects, and the corresponding motion or behavior includes, but is not limited to, a limb motion of a person, where the limb motion may be any motion of a human limb, including motion of an arm, a hand, a leg, and a body, motion of an animal, motion behavior of a robot, and motion behavior of a vehicle, which do not affect implementation of the present invention. The action recognition model in the step is based on a target visual language model and a target acoustic model, and a multi-mode interaction network module is added to construct a model frame on the premise that the original target visual language model and the original target acoustic model are not modified at all, namely the model structure of the action recognition model comprises the target visual language model, the target acoustic model and the multi-mode interaction network module.

The target visual language model is used for extracting visual features and text semantic features, inputting the visual features and the text semantic features into the multi-modal interaction network module correspondingly, and realizing a model of cross-modal vision and language for simultaneously understanding and fusing visual and language information. The target visual Language model may employ any of the VLP (Vision-Language Pretraining, visual Language pre-training) models in the related art, including, but not limited to, viLBERT (Vision and Language bert (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterization from transducer), visual Language bert) model, LXMERT (Language-Visual Multimodal bert, cross-modal bert framework for visual and Language understanding), UNITER (Universal Image-Text Representation, universal Image text representation), CLIP (Contrastive Language-IMAGE PRETRAINING, comparative Language-Image pre-training) model, OSCAR (Object-SEMANTICS ALIGNED PRETRAINING, object semantic alignment pre-training) model. The target acoustic model is used for converting the audio data of the video sample into text representation, and the voice recognition task is realized. The target acoustic model extracts features of the speech signal of the audio data, converts the original high-dimensional speech signal into a lower-dimensional, more meaningful, and more amenable representation, and provides useful information and modeling capabilities for speech recognition tasks. The present invention may employ any acoustic model capable of performing speech recognition tasks, including but not limited to MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficient), tasNet (Time-domain Audio Separation Network ).

In the step, after a corresponding number of video samples are selected from a video sample dataset according to preset training parameters and input into a motion recognition model, the motion recognition model extracts visual features and text semantic features by using a target visual language model, extracts audio features by using a target acoustic model, inputs the visual features, the text semantic features and the audio features into a multi-modal interaction network, and performs visual interaction and audio interaction on the visual features and the audio features by using a multi-modal interaction network module according to a sound source side, a motion source side and a feature source side to obtain visual interaction features and audio-visual interaction features. The interaction is a method of integrating features of both interaction parties by learning an intrinsic relation between both interaction parties. The audio-visual interaction refers to the interaction processing of visual features and auditory features, and the visual interaction refers to the interaction processing of different types of visual features. The sound source is an audio data source, for example, the video sample is a playing video of a musical instrument, and the sound source is the musical instrument; the action source side refers to a part which gives out actions and a main body which gives out actions in a video sample, taking a musical instrument playing video or a dance video as an example, the action source side refers to a hand and a player, the dance video as an example, the action source side refers to limbs and dancers, the characteristic source side refers to data from which characteristics for interaction are generated currently, such as visual characteristics are from the video sample and a certain frame of image, and audio characteristics are from the video sample and audio data. According to the method, the obtained interaction characteristics are used as prompt information to be added into text semantic characteristics, at least one interaction characteristic is added into the text semantic characteristics, and multi-mode action tag characteristics added with at least one of auditory information, visual information and audiovisual information are obtained, so that the text semantic characteristics can be combined with the characteristics of multiple modes, and the text semantic characteristics corresponding to the action tags added with the interaction characteristics are defined as the multi-mode action tag characteristics for convenience of description.

S103: and according to the loss information among the visual interaction features, the audio-visual interaction features, the audio features and the multi-mode action tag features, iteratively updating the action recognition model until the preset model training ending condition is met, and obtaining the action recognition model for executing the action recognition task.

After the visual interaction feature, the audio-visual interaction feature and the audio feature corresponding to the video sample are obtained in the last step, the visual interaction feature, the audio-visual interaction feature and the audio-visual interaction feature are compared with the multi-mode action tag feature corresponding to the action tag of the video sample, and model parameters of the action recognition model are updated by continuously reducing the difference between the visual interaction feature, the audio-visual interaction feature and the multi-mode action tag feature, for example, the action recognition model can be trained by adopting a batch random gradient descent method until a preset model training ending condition is reached, and the preset model training ending condition can be, for example, that the iteration times reaches a preset value, the action recognition model converges, and the precision of the action recognition model reaches a preset precision threshold, so that the implementation of the method is not affected. Before gradient update iteration, the model needs to initialize a gradient descent algorithm, set epoch (training period), batch_size (batch size), weight update period t, and iteration number iteration. For example, the total number of video samples included in the video sample data set may be 6 ten thousand, the motion recognition model is trained for at least 100 training periods, and one training period refers to that model parameters of the neural network are not repeatedly updated by using all training samples in the training set, and one batch (batch) of data is taken at a time for updating model parameters of the motion recognition model, so as to complete a training process. In the gradient update iteration process, 500 video samples are used per iteration update, and these 500 video samples are referred to as one batch (batch) data, i.e., batch_size number of samples. The iteration number refers to the number of times that one epoch is completed using batch_size sample training, iteration number of times that one epoch is completed, iteration number=60000/500=120. The weight updating period refers to updating the weight once every t times of iteration during training of the motion recognition model. When reaching the preset model training ending condition, the motion recognition model is the trained motion recognition model, and any fine-granularity motion recognition task requiring accurate prediction of an input video sample can be executed. The trained motion recognition model can fully utilize the visual, audio-visual and auditory multi-mode information, comprehensively consider prompt information of multiple layers, and can improve the generalization capability and motion recognition performance of the motion recognition model, so that the motion recognition model can more comprehensively understand and describe fine motion characteristics, the capability of the visual language model is transferred to a space-time motion recognition task, the performance of the original model is reserved, and the expansibility and flexibility of the model are enhanced, so that the model is suitable for the wider multi-mode application field.

In the technical scheme provided by the embodiment, on the basis of the existing visual language model and acoustic model, the action recognition model is built by adding the multi-mode interaction network module, and the model parameters and model structures of the visual language model and acoustic model are not required to be changed, so that the migration with strong cross-mode understanding capability and generalization capability between the visual and language can be adapted to the action recognition task, the original model performance is reserved, and the expansibility and flexibility of the model are enhanced, so that the model is suitable for the wider multi-mode application field. The multi-modal interaction network module is utilized to carry out interaction between the same modality and cross modalities on the visual characteristics and the audio characteristics, so that not only can the action characteristics be refined, but also the problem of cross-modal information fusion and inconsistent viewing and listening is solved, the distinguishing property of the action can be enhanced, the actions with different sounds but similar sounds are further distinguished, the action recognition accuracy is effectively improved, further, the action labels are endowed with prompt information of different levels on the three modalities of vision, hearing and viewing, the multi-modal information is fully utilized, the prompt information of multiple levels is comprehensively considered, the generalization capability and the action recognition performance of the model can be improved, the fine action characteristics can be more comprehensively understood and described by the model, the performance and the robustness of the action recognition are improved, the problems of slow convergence of space-time action recognition tasks and time consumption and resource consumption of training are solved, and the action recognition accuracy and efficiency are effectively improved.

In the above embodiment, how to perform the visual interaction and the audio-visual interaction according to the sound source side, the action source side, and the feature source side is not limited in any way, the visual features of the embodiment include the visual features of the sound source side and the visual features of the action source side, the multimodal interaction network module includes the visual interaction module and the audio-visual interaction module, and an exemplary feature interaction manner may include the following:

The visual interaction module is utilized to carry out visual interaction processing on the visual characteristics of the sound source side and the visual characteristics of the action source side corresponding to the characteristic source side respectively, so as to obtain visual interaction characteristics; and performing cross-modal audio-visual interaction processing on the visual interaction characteristics and the audio characteristics by utilizing the audio-visual interaction module to obtain audio-visual interaction characteristics.

In this embodiment, the visual interaction module is configured to interact with the action source and the feature source by using the sound source as a core, and then combine the interaction features to obtain visual interaction features. Of course, the sound source side and the feature corresponding to the action source side can be interacted, and then the interaction feature and the feature corresponding to the feature source side are interacted to obtain the final visual interaction feature. The method can also interact the sound source side with the features corresponding to the feature source side, and then interact the interaction features with the features corresponding to the action source side to obtain the final visual interaction features. The visual features are used for reflecting the action features of the action execution main body, the audio features are used for reflecting the sound of the sound generator, in order to improve the understanding capability of the action, the visual interaction module can send the obtained visual interaction features to the visual interaction module, and the visual interaction module can improve the understanding of the action from the aspect of sound by performing cross-modal interaction on visual information and auditory information. Of course, for the audiovisual interaction process, visual features corresponding to the sound source party, the action source party and the feature source party can be interacted with the audio features respectively, and then the interaction features are combined to serve as audiovisual interaction features.

In the present application, the visual features are extracted using a target visual language model, which may include an image encoder and a target detector, such as a regional convolutional neural network, any one-stage target detector, and two-stage target detector, and the process of extracting the visual features using the target visual language model may include: acquiring a sound source side image block, a limb image block and a target object image block from a video sample by using a target detector; and inputting the image block of the sound source side, the limb image block, the image block of the target object, the current image and the video sample into an image encoder to obtain the video frames corresponding to the visual characteristics of the sound source side, the limb visual characteristics, the visual characteristics of the target object, the current image frame and the video sample. For example, the video sample is input to the target visual language model, the target detector acquires the positions of the instrument, left hand, right hand and person in the current image frame or the adjacent image frames, and cuts out the image blocks at the corresponding positions, which can be defined as the image blocks of the instrument for convenience of descriptionLeft-hand pattern blockRight hand blockPlayer block. Extracting the characteristics of all frame characteristics of a musical instrument block, a left hand block, a right hand block, a player image, a current frame and a video by using an image encoder, and recording the result as visual characteristicsLeft-hand tile visual featuresRight hand tile visual featuresPlayer visual characteristicsCurrent image frameVideo frame. Audio featuresFor the target acoustic modelExtracting audio dataThe result can be expressed as

。

As can be seen from the above, in this embodiment, through interaction of visual features under different angles, and then interaction of visual interaction features with audio features, audiovisual interaction features are obtained, so that not only is fine description of actions on a visual level completed, but also visual information and auditory information can be associated and matched by using an assistance model, thereby enhancing differentiation of actions and improving resolution capability of the actions.

The above embodiment is not limited to interactions of various types of visual features, and exemplary, based on the above embodiment, the present embodiment further provides an interaction implementation manner of different types of visual features, which may include the following:

The motion source side visual features may include a limb visual feature and a target object visual feature, the target object visual feature being a visual feature of a subject to which the limb belongs, for example, the limb visual feature is a left hand feature, then the target object visual feature is a visual feature of a person to whom the left hand belongs. An exemplary manner of interaction of visual interaction features may include:

acquiring sound body-limb interaction characteristics of sound source side visual characteristics and limb visual characteristics, and carrying out modal fusion on the sound body-limb interaction characteristics of each dimension to obtain sound body-limb interaction information; encoding the sound body-limb interaction information based on a self-attention mechanism, carrying out residual connection on the encoded sound body-limb interaction information, and carrying out layer normalization processing to obtain sound body-limb interaction enhancement characteristics; acquiring a sound body-limb-target interaction characteristic of the sound body-limb interaction enhancement characteristic and a visual characteristic of a target object, and carrying out modal fusion on the sound body-limb-target interaction characteristic of each dimension to obtain sound body-limb-target interaction information; coding the sound body-limb-target interaction information based on a self-attention mechanism, carrying out residual connection on the coded sound body-limb-target interaction information, and carrying out layer normalization processing to obtain sound body-limb-target interaction enhancement characteristics; and acquiring the sound body-limb-target-frame interaction characteristics of the sound body-limb-target interaction enhancement characteristics and the visual characteristics corresponding to the characteristic source side, and carrying out modal fusion on the sound body-limb-target-frames of all dimensions to obtain the visual interaction characteristics.

For convenience of description, the sound-limb interaction feature refers to a feature obtained after the sound source side visual feature and the limb visual feature interact, the sound-limb-target interaction feature refers to a feature obtained by the sound-limb interaction feature and the target object visual feature interact, and the sound-limb-target-frame interaction feature refers to a feature obtained by the sound-limb-target interaction feature and the visual feature corresponding to the feature source side interact. In order to further improve the feature understanding capability, in the case that the visual features corresponding to the feature source are divided into the current image frame and the video frame corresponding to the video sample, the process of interacting the audio-limb-target interaction enhancement feature with the visual features corresponding to the feature source may include: acquiring the interaction enhancement characteristics of the sound body-limb-target and the interaction characteristics of the sound body-limb-target-image frame of the current image frame, and carrying out modal fusion on the interaction characteristics of the sound body-limb-target-image frame of each dimension to obtain interaction information of the sound body-limb-target-image frame; coding the interaction information of the sound body, the limb, the target and the image frame based on a self-attention mechanism, carrying out residual connection on the coded interaction information of the sound body, the limb, the target and the image frame, and carrying out layer normalization processing to obtain interaction enhancement characteristics of the sound body, the limb, the target and the image frame; and acquiring the interaction enhancement characteristics of the sound body-limb-target-image frames and the interaction characteristics of the sound body-limb-target-image frames-video frames of the video frames, and carrying out modal fusion on the interaction characteristics of the sound body-limb-target-image frames-video frames of all dimensions to obtain visual interaction characteristics. Further, for convenience of description, in the case that the action source party has a plurality of limb visual features, it may be described that the limb visual features include a first limb visual feature and a second limb visual feature, and the first limb visual feature and the second limb visual feature may be respectively processed interactively according to the above steps, and then the interactive features of the first limb visual feature and the second limb visual feature are spliced to be used as the interactive features of the whole limb and the sound source party. For example, the first feedforward neural network may be used to perform feature learning and feature extraction on the cross-modal interaction features corresponding to the first limb visual features, so as to obtain first voice body-limb interaction information; performing feature learning and feature extraction on cross-modal interaction features corresponding to the second limb visual features by using a second feedforward neural network to obtain second sound body-limb interaction information; and splicing the first sound body-limb interaction information and the second sound body-limb interaction information, and mapping the spliced characteristics to the same dimension of the visual characteristics of the sound source party by utilizing a third feedforward neural network to obtain the sound body-limb interaction information.

An embodiment of performing modal fusion on the sound body-limb interaction characteristics of each dimension to obtain sound body-limb interaction information may further include: calculating the global average value of the interaction characteristics of the voice body and the limbs in the dimension of the target characteristics; determining residual coefficients of the sound body-limb interaction features in corresponding dimensions according to the global average value, and fusing the sound body-limb interaction features and the limb visual features based on the residual coefficients to obtain interaction fusion features; and carrying out layer normalization processing on the interaction fusion features, and carrying out feature extraction on the cross-modal interaction features obtained by the layer normalization processing to obtain the sound body-limb interaction information.

In order to improve the task execution efficiency of the whole action recognition model, a mean value calculation relation, a residual coefficient calculation relation and a feature fusion relation can be stored locally in advance, and a global average value of the sound body-limb interaction feature in a target feature dimension is calculated by calling the mean value calculation relation; calling residual coefficient calculation relation based on the global average value to determine the residual coefficient of the interaction characteristic of the sound body and the limb in the corresponding dimension; and calling a feature fusion relation to fuse the sound body-limb interaction feature and the limb visual feature. Wherein, the mean value calculation relational expression is:

；

Wherein, the residual coefficient calculation relation is:

；

wherein, the characteristic fusion relation is:

；

In the method, in the process of the invention, Is a global average over the L dimension,Is the volume-limb interaction feature in the ith dimension; /(I)For residual coefficients in the L dimension,As an S-shaped function,For the second full connection layer,To activate the function,Is the first full connection layer; /(I)For interactive fusion features,Is the visual characteristic of limbs,Is a sound body-limb interaction feature.

As can be seen from the above, the present embodiment can capture such data information that requires to complete the motion simultaneously using a plurality of limb portions at the same time through the interaction between the limb motion and the visual features of the sound source side, and improves the characterization capability of the model on the fine motion. The interaction features are further interacted with the visual features of the target object and the visual features of the feature source, and the interaction from small to large and from space to time can obtain comprehensive action characterization, improve generalization capability of the model and improve accuracy of action recognition.

In order to facilitate practical application, the network structure of the visual interaction module can be constructed based on the processing flow of the visual interaction module to the data, and the visual interaction module can comprise an audio source visual feature-limb visual feature interaction module and a multiple interaction module. The audio source side visual feature-limb visual feature interaction module comprises an input layer, a feature processing network structure and an output layer, wherein the input layer can comprise a plurality of audio source side visual feature inputs and a plurality of limb visual feature inputs, each limb visual feature input can be used for inputting different limb visual features, such as a first limb visual feature and a second limb visual feature, and each audio source side visual feature input is used for inputting the audio source side visual feature, as shown in fig. 2. The feature processing network structure may include a feature extraction layer, multiple groups of audio-visual feature processing structures, and multiple groups of identical limb feature processing structures, where the number of audio-visual feature processing structures is the same as the number of audio-source side visual feature input ends, and the number of limb feature processing structures is the same as the number of limb visual feature input ends, and is used for processing limb visual features of corresponding input ends. The network structure of each audio visual feature processing structure may be the same or different. The audio visual feature processing structure may include a self-attention mechanism layer, a layer normalization and a residual connection layer; the limb feature processing structure can comprise a cross-modal attention mechanism layer, a parameter learning residual error fusion module and a feedforward neural network, wherein the self-modal attention mechanism layer is used for carrying out coding processing on visual features of a sound source party, and the layer normalization and residual error connection layer is used for accelerating model convergence and further improving model generalization capability. The cross-modal attention mechanism layer is used for acquiring intrinsic relation features among different interaction features, and the parameter learning residual error fusion module is used for adaptively fusing modal information. And learning the coefficients of the interacted information in the dimension, obtaining residual block characteristics through the coefficients and the interacted information, and then fusing through a residual structure. The feedforward neural network is used for extracting deep features of the input features. The feature extraction layer is used for combining the output features of the feedforward neural networks of the limb feature processing structures and converting the dimensions of the features, and the output layer is used for outputting the output features of the feature processing network structures.

In order to make the implementation process of the interaction of the audio source side visual feature-limb visual feature interaction module in the audio-visual interaction module to the different visual features of the video sample more clear for those skilled in the art, the embodiment takes the video sample as a musical instrument playing video as an example to describe the interaction process of the whole visual features. For each instrument, a set of suitable features are always found to describe, and the audio source side visual feature-limb visual feature interaction module in this embodiment is an instrument and left-right hand interaction model, which uses the instrument as a core, and establishes a relationship with surroundings from small and large, that is, interacts with left-right hand, player, current frame and video frame respectively, so as to obtain the interaction features on the visual level, thereby achieving the purpose of accurately describing the action. The following may be included:

it can be appreciated that when playing a musical instrument, some actions often need to be operated by both left and right hands, capturing such information and information interaction is helpful for the model to comprehensively understand the relationships between different views, and the characterization capability of comprehensive visual information is improved. Based on this, the sound source side visual feature-limb visual feature interaction module may include a left hand interaction module and a right hand interaction module. The visual features of the sound source are the visual features of the musical instrument For visual characteristics of musical instrumentCoding by adopting a self-attention mechanism to obtainSelfAttention is any self-attention mechanism, e.g., multiHeadAttention (multi-head attention mechanism), output dimensionAnd input dimensionLikewise, the header may be set to 3. Residual connection and layer normalization are carried out on the obtained product, namely。

The process of the interaction between the musical instrument and the left hand can be as follows: and acquiring information of interaction between the musical instrument and the left hand through a cross-modal attention mechanism, and acquiring the characteristics after interaction through a parameter learning residual fusion module. First, the left-hand information related to the instrument is acquired using a cross-modal attentiveness mechanism, i.e.

Here, query is instrument feature Feat ₀, key and value are left hand feature. And then adopting a parameter learning residual error fusion module to carry out self-adaptive fusion on modal information: the characteristics after interaction are defined asThe dimensions L x d, L and d represent the dimensions of the two-dimensional feature, when interacting with the left hand,Is thatFeat _src isWhen interacting with the right hand, the corresponding characteristic of the right hand is corresponding. Based onCalculating global average and learning residual coefficients of interaction information in each dimension, i.e.WhereinFor the first full connection layer,For the second full connection layer, the input and output dimensions are the same,To activate the function,Is a sigmoid function; based onAnd carrying out interaction feature fusion. Finally, the interaction characteristics after fusion are further processed by adopting layer normalization to obtain the spread mode characteristics. The spread mode feature Feat ₂ is further studied and extracted through Feed forward Network (feedforward neural network), and the feedforward neural network comprises a plurality of full-connection layers and nonlinear functions, so that the correlation and information between modes are better obtained, namely the feature/>, obtained after the interaction of the musical instrument and the left hand, is obtainedThe input dimension Feat ₂ and the output dimension Feat _IL are the same dimension.

According to the above operation, the process of interaction between the instrument and the left hand is completed, and since the instrument is played by the left hand and the right hand, the cross-information interaction and the comprehensive information fusion capability are improved, and when the instrument is interacted with the right hand, the cross-mode attention mechanism layer which is the same as the left hand and the right hand can be adopted, and other operations of the right hand are the same as the left hand, so that the process of interaction between the instrument and the right hand can be referred to, and the details are omitted. The characteristics obtained after the interaction of the musical instrument with the right hand are expressed as. Based onSplicing left-right hand interaction results, wherein the dimension at the moment is the visual characteristic/>, of the input musical instrumentIs 2 times that of the feed-forward neural network, so the feed-forward neural network can also be usedMapping to AND inputCo-dimensionality, yielding the final result. Of course, when the features are mapped, any other network capable of implementing the corresponding functions in the related art may be used, and the present application is not limited in any way. The steps are carried out for N1 times to obtain the characteristics after the interaction between the musical instrument and the left and right, namely the characteristics of the interaction between the musical instrument and the left and right hands。

In this embodiment, the multiple interaction module is configured to interact the audio-body-limb interaction information with the visual features of the target object and the visual features of the feature source, so that the interaction of multiple layers is helpful for the model to analyze the audio-visual task more carefully and comprehensively. The multiple interaction module may include a first input terminal, a second input terminal, an output layer, and an interaction processing module, as shown in fig. 3, where the first input terminal is used for inputting the characteristics that need to be interacted currently, and may be defined asThe second input is used for inputting the characteristics obtained after the last interaction, and can be defined as. For example, the body-limb interaction information interacts with the visual features of the target object, the first input end is used for inputting the visual features of the target object, and the second input end is used for inputting the body-limb interaction information. The output layer is used for outputting visual interactive characteristics, and the interactive processing module is used for carrying out interactive processing on the characteristics input by the two input ends, and can comprise a self-attention layer, a layer normalization and residual connection layer, a cross-attention mechanism layer, a parameter learning residual module and a characteristic extraction layer. The multiple interaction module inputs different characteristics through the first input end and the second input end, and can realize multiple interactions of the sound body-limb interaction information, the visual characteristics of the target object and the visual characteristics of the characteristic source party. For example, the first time the multiple interaction module is used to interact with the audio-limb interaction information and the visual characteristics of the object to obtain the audio-limb-object interaction information, the second time the multiple interaction module is used to interact with the audio-limb-object interaction information and the visual characteristics of the feature source, and the method can be repeated N2 times until all the features are interacted.

Wherein, self-attention layer pairCoding by adopting a self-attention mechanism to obtain self-attention codesSelfAttention is any self-attention mechanism, such as MultiHeadAttention, output dimension E ₀ and input dimension/>, can be employedSimilarly, the header may be set to 6. Layer normalization and residual connection layer processing E ₀ results in enhanced feature E ₁, i.e., E ₁=LayerNorm(E₀+Feat_A). The cross-attention mechanism layer uses cross-attention mechanism interaction to obtain interaction feature information related to the enhanced feature E ₁, i.e., E ₂=CrossAttention(query=E₁,key=Feat_B,value=Feat_B), where query (query vector) is the enhanced feature E ₁, and key (key vector) and value (value vector) are interaction features Feat _B. The parameter learning residual module is used for adaptively fusing the modal information to obtain the adaptively fused modal information E ₃, and the description of the parameter learning residual module in the above embodiment can be referred to by the parameter learning residual module, and will not be repeated here. The feature extraction layer uses more deep learning and extraction of the spread modality features E ₃, i.e., E ₄=FFN（E₃). The feature extraction layer may employ a feed forward neural network FFN that includes a plurality of fully connected layers and a nonlinear function, with the input dimension E ₃ and the output dimension E ₄ being the same dimension.

Likewise, taking a video sample as an example of a musical instrument playing video, the output characteristics of the sound source side visual characteristics-limb visual characteristics interaction module are as followsThe visual characteristics of the target are player visual characteristicsThe feature source side visual features include the current image frameAnd video frame. To obtain data of player processing instrument information to assist the model in capturing relevant style of performance, player visual features/>, may be usedInput to the first input terminal, and interact the musical instrument with left and right hands to obtain the characteristicAnd the visual characteristics of the player and the characteristics after the music instrument are interacted with the left and right are input to the second input end by the multiple interaction module, so that interaction characteristics Feat _inter-person are obtained. In order to sense the action of the player in the whole image and assist the model to acquire information about environment and the like, the current frame features feat _curr-frm can be input to the first input end, feat _inter-person can be input to the second input end, and the multiple interaction module interacts the input features to obtain interaction features Feat _inter-frm. In order to obtain the variation between successive frames in the playing process and the dynamic playing process, to assist the model in capturing the playing dynamics, the video features feat _video-frm may be input to the first input terminal, feat _inter-frm may be input to the second input terminal, and the multiple interaction module may interact with the input features to obtain the final visual interaction features Feat _inter-video.

As can be seen from the above, in this embodiment, by constructing the network structure of the visual interaction module and through interaction of visual features of multiple layers, the overall action characterization can be obtained, the generalization capability of the model is improved, and the accuracy of action recognition is improved.

The above embodiment does not limit how to obtain the audio-visual interaction feature, and based on the above embodiment, the present embodiment further provides an exemplary cross-modal interaction implementation of the visual interaction feature and the audio feature, which may include the following: performing cross-modal fusion processing on the visual interaction characteristics and the audio characteristics respectively to obtain visual enhancement interaction characteristics and audio enhancement interaction characteristics; splicing the visual enhancement interaction features and the audio enhancement interaction features to obtain audio-visual fusion features; and encoding the audio-visual fusion characteristics based on a self-attention mechanism, and extracting the characteristics of the encoded audio-visual fusion characteristics to obtain audio-visual interaction characteristics.

The visual enhancement interaction feature is obtained by taking visual features as cores and adding auditory features, and the visual features are focused on. The audio enhancement interaction feature is an auditory feature focused on by taking the auditory feature as a core and adding a visual feature. The visual features refer to visual interaction features, the auditory features refer to audio features, and any method capable of realizing data fusion of different modes can be adopted to process the visual interaction features and the audio features, including but not limited to a self-attention mechanism, a cross-attention mechanism and a cross-mode attention mechanism. Taking the cross-attention mechanism as an example, the fusion process of visual interaction features and audio features may include: based on taking visual interaction characteristics as query vectors, taking audio characteristics as a group of key vectors and value vectors, adopting a cross-attention mechanism to perform cross-mode audio-visual interaction processing to obtain visual enhancement interaction characteristics; based on the audio feature as the query vector, the visual interaction feature as a group of key vectors and value vectors, cross-modal audio-visual interaction processing is performed by adopting a cross-attention mechanism, and the audio enhancement interaction feature is obtained.

In order to facilitate practical application, a network structure of the audiovisual interaction module may be constructed based on the processing flow of the audiovisual interaction module to data, as shown in fig. 4, where the audiovisual interaction module may include an input layer, a visual first cross-attention mechanism layer, a second cross-attention mechanism layer, a self-attention mechanism layer, a feature extraction layer, and an output layer. The input layer comprises a visual feature input end and an auditory feature input end, wherein the visual feature input end is used for inputting visual interaction features, the auditory feature input end is used for inputting audio features, the first cross attention mechanism layer is used for performing cross-mode audio-visual interaction processing by adopting a cross attention mechanism based on taking the visual interaction features as query vectors and taking the audio features as a group of key vectors and value vectors, so that visual enhancement interaction features are obtained; the second cross attention mechanism layer is used for performing cross-modal audio-visual interaction processing by adopting a cross attention mechanism based on the audio characteristics serving as query vectors and the visual interaction characteristics serving as a group of key vectors and value vectors to obtain audio enhancement interaction characteristics; the self-attention mechanism layer is used for splicing the audio enhancement interaction characteristics and the visual enhancement interaction characteristics and carrying out coding processing, the characteristic extraction layer is used for extracting the input deep characteristics, and the output layer is used for outputting the audio-visual interaction characteristics.

For example, cross-modal fusion is performed on visual information and auditory information, visual interaction features are query, audio features are key and value, a cross-attention mechanism enhancement model is adopted to focus and align the visual information, and a relevant part of the visual information, namely visual enhancement interaction features Feat _VA=CrossAttention(query=Feat_visual,key=Feat_audio,value=Feat_audio, is found in the auditory information. Then, in the case of taking the audio feature as query, the visual interaction feature as key and value, a cross-attention mechanism is adopted to concentrate the model on the closely related part of the action of the musical instrument, so as to enhance the connection between the features and the action, namely, the audio enhancement interaction feature Feat _AV=CrossAttention(query=Feat_audio,key=Feat_visual,value=Feat_visual. Splicing the cross-modal fusion result of the visual information and the auditory information to obtain audio-visual fusion characteristics Feat _comb, namely. And the cross-modal fusion result is encoded by adopting a self-attention mechanism, so that the multi-modal information is further fully utilized, and the optimal multi-modal information interaction mode is explored, so that the incomplete information is made up, namely Feat _video-audio=SelfAttention(Feat_comb. The process may employ any self-attention mechanism, such as MultiHeadAttention, the output dimension Feat _video-audio is the same as the input dimension Feat _comb, and the header may be set to 6. Finally, the characteristics in Feat _video-audio may be further extracted by a multi-layer perceptron, so as to obtain a final result of the audio-visual interaction module, that is, audio-visual interaction characteristics Feat _{visual-auditory}=MLP（Feat_video-audio), where the multi-layer perceptron MLP includes a plurality of fully connected layers and an activation function.

As can be seen from the above, the present embodiment can improve the understanding ability of the motion by performing cross-modal interaction on the visual information and the auditory information, and learning the motion feature and the audio feature at the same time.

The above embodiment does not limit how to add interactive features to text semantic features, and based on the above embodiment, the present embodiment further provides an exemplary generation manner of the multi-mode dynamic tag, which may include the following:

In this embodiment, the multi-modal interaction network module may include an interaction perception prompt module, and the interaction perception prompt module is used to learn association relationships between text semantic features and visual interaction features, audio-visual interaction features, and audio features, so as to obtain visual-text semantic features, auditory-text semantic features, and audio-visual-text semantic features; respectively carrying out feature enhancement processing on the visual-text semantic features, the auditory-text semantic features and the audiovisual-text semantic features to obtain visual-text action tag features, auditory-text action tag features and audiovisual-text action tag features; the visual-text action tag feature, the auditory-text action tag feature, and the audiovisual-text action tag feature are taken as a set of multimodal action tag features.

Wherein the text semantic features are text features of action labels extracted by a text encoder of the target visual language model, all categories to which the action labels belong are encoded by the text encoder, here the text encoder is CLIP Text Encoder (CLIP text encoder) for example, the text semantic features may be expressed as code=clip_text_text_code (label), clip_text_code represents CLIP text encoder,Where c is the number of categories of action tags and E _i is the result after the i-th action tag is encoded. The association relationship can learn the internal relationship between the interactive features through any kind of relationship, including but not limited to a self-attention mechanism and a cross-attention mechanism, the visual-text semantic feature refers to a multi-modal action tag generated after the text semantic feature is added with the visual interactive feature related feature, the auditory-text semantic feature refers to a multi-modal action tag generated after the text semantic feature is added with the audio feature related feature, and the audiovisual-text semantic feature refers to a multi-modal action tag generated after the text semantic feature is added with the related feature audiovisual interactive feature.

The above embodiment does not limit how the interactive perception prompt module learns the association relationship and performs the feature enhancement, and as shown in fig. 5, the interactive perception prompt module may include a cross attention mechanism layer, a first residual error connection parallel layer normalization layer, a feedforward neural network layer, and a second residual error connection parallel layer normalization layer, which are sequentially connected. Of course, the feedforward neural network layer may adopt other network model structures which can play the same role, and the invention is not limited in any way. Inputting visual interaction features and text semantic features into an interaction attention mechanism layer, and performing cross-modal interaction by taking the text semantic features as query vectors and a group of key vectors and value vectors to obtain visual-text interaction features; inputting the visual-text interaction characteristics and the text semantic characteristics into a first residual error connection parallel layer normalization layer to obtain visual-text interaction enhancement characteristics; respectively inputting the visual-text interaction enhancement features into a feedforward neural network layer and a second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as the visual-text action tag features; inputting the audio features and the text semantic features into a cross-attention mechanism layer, using the text semantic features as query vectors, and using the audio features as a group of key vectors and value vectors to perform cross-modal interaction to obtain auditory-text interaction features; inputting the hearing-text interaction characteristics and the text semantic characteristics into a first residual error connection and layer normalization layer to obtain hearing-text interaction enhancement characteristics; respectively inputting the hearing-text interaction enhancement features into a feedforward neural network layer and a second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as hearing-text action tag features; inputting the audio-visual interaction features and the text semantic features into an interaction attention mechanism layer, taking the text semantic features as query vectors, and performing cross-modal interaction by taking the audio-visual interaction features as a group of key vectors and value vectors to obtain audio-visual-text interaction features; inputting the audio-visual-text interaction characteristics and the text semantic characteristics into a first residual error connection parallel layer normalization layer to obtain audio-visual-text interaction enhancement characteristics; and respectively inputting the audio-visual-text interaction enhancement features into the feedforward neural network layer and the second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as the audio-visual-text action tag features.

For example, the text semantic feature may be denoted Feat _action-label, the visual interaction feature, the audio feature and the audio-visual interaction feature may be uniformly defined as the feature Feat _interaction to be interacted, the action tag is the query, the interaction feature is the key and the value, the cross attention mechanism is adopted, and three modes of information including visual, auditory and audio-visual are guided to be extracted through the semantic information of the text and are mixed, so that the semantic information of the text is further enriched, namely Feat _{label-interaction}=CrossAttention(query=Feat_action-label,key=Feat_interaction,value=Feat_interaction. By residual connection and layer normalization, i.e. Outputting a final result by the FFN and residual structure,. When the feature Feat _interaction to be interacted is the audio feature Feat _audio, sound sense information is added to the Text semantic information based on the interaction sense prompt module, and the result output by the interaction sense prompt module is the auditory-Text action tag feature Text _{label-audiory}. When the feature Feat _interaction to be interacted is the visual interaction feature Feat _inter-video, visual information is added to the Text semantic information based on the interaction perception prompt module, and the result output by the interaction perception prompt module is the visual-Text action tag feature Text _label-video. When the feature Feat _interaction to be interacted is the audio-visual interaction feature Feat _{visual-audiory}, visual information is added to the Text semantic information based on the interaction perception prompt module, and the result output by the interaction perception prompt module is an audio-visual Text action tag Text _{label-visual-audiory}.

As can be seen from the above, in this embodiment, different types of modal information are added to the action tag, and learning efficiency can be improved by combining multi-modal prompt information, so that training efficiency of the action recognition model is improved.

The above embodiment does not make any limitation on how the motion recognition model is iteratively updated, that is, the loss function employed by the motion recognition model. Of course, as a simple implementation, the loss function of the motion recognition model may be directly based on a mean square error or cross entropy error or other related technique. In order to improve the performance of the motion recognition model, this embodiment also provides a determination manner of the loss function of the motion recognition model, which may include the following:

determining visual text loss information according to the visual interaction characteristics and the visual-text action label characteristics; determining hearing text loss information according to the audio characteristics and the hearing-text action tag characteristics; determining audio-visual text loss information according to the audio-visual interaction characteristics and the audio-visual-text action tag characteristics; and determining a loss function of the motion recognition model according to the visual text loss information, the hearing text loss information and the audiovisual text loss information, and iteratively updating the motion recognition model based on the loss function.

The visual text loss information, the auditory text loss information and the audiovisual text loss information can be used for calculating corresponding numerical values by adopting a contrast learning mode. In order to further improve the training efficiency of the motion recognition model, a loss function relation can be stored in advance, and the loss function of the motion recognition model can be determined by directly calling the loss function relation according to the visual text loss information, the auditory text loss information and the audiovisual text loss information; the loss function relationship can be expressed as:

；

Wherein L is a loss function, N1 is the total number of negative samples, t _j is the linguistic features of the j-th text sample in the negative sample data set, sim represents a similarity function, exp represents an exponential function, τ is a temperature parameter for controlling the distribution of contrast loss, Is visual interactive feature,For audiovisual interaction features,For audio features,For visual-text action tag feature,For auditory-text action tag feature,Is an audiovisual-text action tag feature. Wherein, for the calculation relation related to the invention, since the log base is a fixed number, the model training process is not affected, the base can be omitted and not written, and the person skilled in the art can select the required base according to the actual situation, which does not affect the implementation of the invention.

In addition, based on the above embodiment, the present invention also provides an implementation manner of applying the method to an actual application scenario, that is, the present embodiment provides an action recognition method, please refer to fig. 6, and the action recognition process may include the following steps:

s601: training to obtain an action recognition model.

The motion recognition model is trained in advance by using the motion recognition model training method described in any one of the embodiments.

S602: and acquiring the video to be identified.

When a user initiates an identification task, the target to be identified in the video to be identified needs to be set at the same time so as to identify the action of the target.

S603: and inputting the video to be identified into the motion identification model to obtain motion identification information.

The action recognition information may include actions or behaviors performed by the specified target during a certain time or a certain period of time of the video to be recognized.

Aiming at the problems that the fine action recognition effect is poor in the musical instrument playing process, the time-space action recognition task is slow in convergence and the training time and resources are consumed, the action recognition model capable of recognizing the action of a player of a musical instrument playing video is obtained through training by adding the visual interaction module, the audio-visual interaction module, the combined visual interaction module, the auditory interaction module and the audio-visual interaction perception prompt module. Correspondingly, the video to be identified is a musical instrument playing video, the video to be identified is input into the action identification model to obtain action identification information, and the method comprises the following steps:

Inputting the musical instrument playing video to the action recognition model; the action recognition model extracts visual characteristics by utilizing the target visual language model, and extracts musical instrument playing audio characteristics by utilizing the target acoustic model; visual features include instrument visual features, left-hand visual features, right-hand visual features, player visual features, image frames, and video frames; the multi-mode interaction network module is used for carrying out interaction processing on each visual characteristic in a mode of respectively carrying out interaction with left and right hands, players, current image frames and video frames by taking musical instruments as centers, so as to obtain visual interaction characteristics; performing cross-modal audio-visual interaction processing on the visual interaction characteristics and the musical instrument playing audio characteristics to obtain audio-visual interaction characteristics; and determining hand motion information of the player during playing the musical instrument according to the output of the motion recognition model.

In the embodiment, the visual interaction module is used for interacting with the left hand, the right hand, the player, the current frame and the video frame respectively by taking the musical instrument as a core, so that the problems of action feature refinement, context information acquisition, action and player relationship modeling, action recognition accuracy and the like are solved, and the fine description of actions on the visual level is realized. By utilizing the audio-visual interaction module to perform cross-modal interaction on visual information and auditory information, the problems of cross-modal information fusion, inconsistent audio-visual information, action distinguishing enhancement, action recognition accuracy enhancement and the like are solved, so that the action distinguishing enhancement is realized, and the action resolution capability is improved. The interactive perception prompt module is utilized to endow prompt information of different layers for each action label in visual, auditory and audiovisual modes through a learnable prompt strategy so as to improve the generalization capability and action recognition performance of the model. The interactive perception prompt module combining the visual level, the auditory level and the audio-visual level can fully utilize multi-mode information, comprehensively consider prompt information of multiple layers and improve the accuracy and the robustness of action recognition of a musical instrument player.

It should be noted that, in the present invention, the steps are not strictly executed sequentially, so long as they conform to the logic sequence, and the steps may be executed simultaneously or according to a certain preset sequence, and fig. 1 and fig. 6 are only schematic, and do not represent only such an execution sequence.

Finally, based on the above technical solution of the present invention, the following description will exemplify some possible application scenarios related to the technical solution of the present invention with reference to fig. 7, and fig. 7 is a schematic diagram of a hardware composition frame to which the action recognition method provided by the present invention is applicable, which may include the following contents:

The hardware component framework may include a first electronic device 71 and a second electronic device 72, with the first electronic device 71 and the second electronic device 72 being connected by a network 73. The first electronic device 71 deploys a processor for executing the motion recognition model training method described in any of the above embodiments, and transmits the trained motion recognition model to the second electronic device 72. The second electronic device 72 deploys an interface for providing man-machine interaction, stores the trained motion recognition model, and when receiving the video to be recognized, invokes the motion recognition model to perform motion recognition on the model to be recognized, outputs a corresponding motion recognition result, and can also feed back the motion recognition result to the corresponding client. If the video to be identified is a musical instrument playing video, outputting the action identification model as hand action information of a player in the process of playing the musical instrument.

The first electronic device 71 performs all or part of the steps in the training of the motion recognition model according to the above embodiment, and the constructed motion recognition model is shown in fig. 8, where the motion recognition model includes a target detector, an image encoder, a target acoustic model, a text encoder, a visual interaction module, an audio-visual interaction module, and an interactive perception prompt module with three layers of joint vision, hearing, and viewing. Extracting visual characteristics of an input video sample by using a target detector and an image encoder, extracting text semantic characteristics of an action tag by using a text encoder, extracting audio characteristics of audio data of the video sample by using a target acoustic model, performing visual interaction on different types of visual characteristics according to a sound source party, an action source party and a characteristic source party by using a visual interaction module, performing interaction processing on the audio characteristics and the visual interaction characteristics by using an audio-visual interaction model, extracting visual characteristics of a musical instrument by using the target detector and the image encoder when the video sample is a musical instrument playing video, extracting visual characteristics of the musical instrument, visual characteristics of a left hand, visual characteristics of a right hand, visual characteristics of a player, image frames and video frames by using the target acoustic model, and extracting the musical instrument playing audio characteristics by using the target acoustic model; taking the musical instrument as the center, and respectively carrying out interaction treatment on each visual characteristic in a mode of interacting with the left hand, the right hand, the player, the current image frame and the video frame to obtain visual interaction characteristics; and performing cross-mode audio-visual interaction processing on the visual interaction characteristics and the musical instrument playing audio characteristics to obtain audio-visual interaction characteristics. And adding at least one interaction feature to the text semantic feature by utilizing the interaction perception prompt module so as to obtain a multi-mode action tag feature added with at least one of auditory information, visual information and audiovisual information. And according to the visual interaction characteristics, the audio-visual interaction characteristics, the loss information between the audio characteristics and the multi-mode action tag characteristics, iteratively updating the action recognition model until a preset model training ending condition is met.

It should be noted that the above application scenario is only shown for the convenience of understanding the idea and principle of the present invention, and the embodiment of the present invention is not limited in any way. Rather, embodiments of the invention may be applied to any scenario where applicable.

As can be seen from the above, the present embodiment is based on the existing visual language model and acoustic model, and uses the musical instrument as a core for the task of fine motion recognition in the playing process of the musical instrument, to interact with the left and right hands, the player, the current frame, and all frames of the video, respectively, to obtain the motion description on the visual level; meanwhile, in order to further distinguish similar actions but different sounds, audio characteristics are introduced, and an audio-visual interaction module is established; finally, visual, auditory and audio-visual three-layer modal information is introduced into the action label, so that the generalization capability of the model is improved. The visual language pre-training model capability is migrated to the space-time action recognition task, so that the original model performance is maintained, the expansibility and flexibility of the model are enhanced, and the model is suitable for the wider multi-mode application field. The motion recognition model can be used for more comprehensively understanding and describing the motion characteristics of players in the playing process of musical instruments, the performance and the robustness of motion recognition are improved, and the expansibility and the flexibility of the model are enhanced.

The invention also provides a corresponding device for the action recognition model training method and the action recognition method, so that the method has more practicability. Wherein the device may be described separately from the functional module and the hardware. In the following description, an action recognition model training apparatus and an action recognition apparatus according to the present invention are described, which are used to implement the action recognition model training method and the action recognition method according to the present invention, and in this embodiment, the action recognition model training apparatus and the action recognition apparatus may include or be divided into one or more program modules, and the one or more program modules are stored in a storage medium and executed by one or more processors, so as to implement the action recognition model training method and the action recognition method according to the first embodiment of the present invention. Program modules in the present embodiment refer to a series of instruction segments of a computer program capable of performing a specific function, and are more suitable than the program itself for describing the execution of the motion recognition model training apparatus and the motion recognition apparatus in a storage medium. The following description will specifically describe functions of each program module of the present embodiment, and the action recognition model training apparatus and the action recognition apparatus described below and the action recognition model training method and the action recognition method described above may be referred to correspondingly to each other.

Based on the angles of the functional modules, referring to fig. 9, fig. 9 is a block diagram of an action recognition model training device provided in this embodiment under a specific implementation manner, where the device may include:

The training sample acquisition module 901 is used for acquiring a video sample data set carrying action tags and audio data;

A data processing module 902 for inputting video samples in the video sample dataset into an action recognition model, the action recognition model comprising a target visual language model, a target acoustic model, and a multimodal interaction network module; extracting visual features and text semantic features by using a target visual language model, extracting audio features by using a target acoustic model, performing visual interaction and audio-visual interaction on the visual features and the audio features by using a multi-modal interaction network module according to a sound source party, an action source party and a feature source party, and adding at least one interaction feature to the text semantic features to obtain multi-modal action tag features added with at least one of acoustic information, visual information and audio-visual information;

The iteration updating module 903 is configured to iteratively update the motion recognition model according to the loss information among the visual interaction feature, the audio-visual interaction feature, the audio feature, and the multi-mode motion label feature until a preset model training end condition is satisfied, thereby obtaining a motion recognition model for executing a motion recognition task.

Illustratively, in some implementations of this embodiment, the data processing module 902 may be further configured to:

The visual features comprise sound source side visual features and action source side visual features, and the multi-mode interaction network module comprises a visual interaction module and an audio-visual interaction module;

the visual interaction module is utilized to carry out visual interaction processing on the visual characteristics of the sound source side and the visual characteristics of the action source side corresponding to the characteristic source side respectively, so as to obtain visual interaction characteristics;

In some exemplary implementations of the above embodiments, the visual interaction module may be further configured to:

The action source side visual features comprise limb visual features and target object visual features;

Acquiring sound body-limb interaction characteristics of sound source side visual characteristics and limb visual characteristics, and carrying out modal fusion on the sound body-limb interaction characteristics of each dimension to obtain sound body-limb interaction information;

Encoding the sound body-limb interaction information based on a self-attention mechanism, carrying out residual connection on the encoded sound body-limb interaction information, and carrying out layer normalization processing to obtain sound body-limb interaction enhancement characteristics;

acquiring a sound body-limb-target interaction characteristic of the sound body-limb interaction enhancement characteristic and a visual characteristic of a target object, and carrying out modal fusion on the sound body-limb-target interaction characteristic of each dimension to obtain sound body-limb-target interaction information;

And acquiring the sound body-limb-target-frame interaction characteristics of the sound body-limb-target interaction enhancement characteristics and the visual characteristics corresponding to the characteristic source side, and carrying out modal fusion on the sound body-limb-target-frames of all dimensions to obtain the visual interaction characteristics. As an exemplary implementation of the foregoing embodiment, the foregoing visual interaction module may further be configured to:

The visual features corresponding to the feature source side comprise current image frames and video frames corresponding to the video samples;

coding the interaction information of the sound body, the limb, the target and the image frame based on a self-attention mechanism, carrying out residual connection on the coded interaction information of the sound body, the limb, the target and the image frame, and carrying out layer normalization processing to obtain interaction enhancement characteristics of the sound body, the limb, the target and the image frame;

and acquiring the interaction enhancement characteristics of the sound body-limb-target-image frames and the interaction characteristics of the sound body-limb-target-image frames-video frames of the video frames, and carrying out modal fusion on the interaction characteristics of the sound body-limb-target-image frames-video frames of all dimensions to obtain visual interaction characteristics.

As an exemplary implementation of the foregoing embodiment, the foregoing data processing module 902 may further be configured to:

the target visual language model comprises an image encoder and a target detector;

acquiring a sound source side image block, a limb image block and a target object image block from a video sample by using a target detector;

and inputting the image block of the sound source side, the limb image block, the image block of the target object, the current image and the video sample into an image encoder to obtain the video frames corresponding to the visual characteristics of the sound source side, the limb visual characteristics, the visual characteristics of the target object, the current image frame and the video sample.

As another exemplary implementation of the above embodiment, the above visual interaction module may further be configured to:

And carrying out layer normalization processing on the interaction fusion features, and carrying out feature extraction on the cross-modal interaction features obtained by the layer normalization processing to obtain the sound body-limb interaction information.

As an exemplary implementation of the foregoing embodiment, the foregoing visual interaction module may further be configured to:

Calling a mean value calculation relation, and calculating a global mean value of the interaction characteristics of the voice body and the limbs in the dimension of the target characteristics; the mean value calculation relation is:

；

Calling residual coefficient calculation relation type to determine the residual coefficient of the interaction characteristic of the sound body and the limb in the corresponding dimension based on the global average value; the residual coefficient calculation relation is:

；

As yet another exemplary implementation of the above embodiment, the above visual interaction module may be further configured to:

Calling a feature fusion relation to fuse the interaction feature of the voice body and the limb visual feature; the feature fusion relation is:

；

In the method, in the process of the invention, For interactive fusion features,For residual coefficients in the L dimension,Is the visual characteristic of limbs,Is a sound body-limb interaction feature.

As some other exemplary implementations of the above embodiments, the above visual interaction module may be further configured to:

the limb visual features include a first limb visual feature and a second limb visual feature;

And splicing the first sound body-limb interaction information and the second sound body-limb interaction information, and mapping the spliced characteristics to the same dimension of the visual characteristics of the sound source party by utilizing a third feedforward neural network to obtain the sound body-limb interaction information.

In other exemplary implementations of the above embodiment, the above audiovisual interaction module may be further configured to:

Performing cross-modal fusion processing on the visual interaction characteristics and the audio characteristics respectively to obtain visual enhancement interaction characteristics and audio enhancement interaction characteristics;

splicing the visual enhancement interaction features and the audio enhancement interaction features to obtain audio-visual fusion features;

And encoding the audio-visual fusion characteristics based on a self-attention mechanism, and extracting the characteristics of the encoded audio-visual fusion characteristics to obtain audio-visual interaction characteristics.

As an exemplary implementation of the above embodiment, the above audiovisual interaction module may further be configured to:

Based on taking visual interaction characteristics as query vectors, taking audio characteristics as a group of key vectors and value vectors, adopting a cross-attention mechanism to perform cross-mode audio-visual interaction processing to obtain visual enhancement interaction characteristics;

Based on the audio feature as the query vector, the visual interaction feature as a group of key vectors and value vectors, cross-modal audio-visual interaction processing is performed by adopting a cross-attention mechanism, and the audio enhancement interaction feature is obtained.

The multi-mode interaction network module comprises an interaction perception prompt module, and the interaction perception prompt module is utilized to respectively learn the association relation between text semantic features and visual interaction features, audio-visual interaction features and audio features to obtain visual-text semantic features, auditory-text semantic features and audio-visual-text semantic features; respectively carrying out feature enhancement processing on the visual-text semantic features, the auditory-text semantic features and the audiovisual-text semantic features to obtain visual-text action tag features, auditory-text action tag features and audiovisual-text action tag features;

The visual-text action tag feature, the auditory-text action tag feature, and the audiovisual-text action tag feature are taken as a set of multimodal action tag features.

In some exemplary implementations of the foregoing embodiments, the interactive perception prompt module includes a cross-attention mechanism layer, a first residual connection parallel layer normalization layer, a feedforward neural network layer, and a second residual connection parallel layer normalization layer that are sequentially connected.

As an exemplary implementation of the foregoing embodiment, the foregoing interactive perception prompt module may further be configured to:

Inputting visual interaction features and text semantic features into an interaction attention mechanism layer, and performing cross-modal interaction by taking the text semantic features as query vectors and a group of key vectors and value vectors to obtain visual-text interaction features;

Inputting the visual-text interaction characteristics and the text semantic characteristics into a first residual error connection parallel layer normalization layer to obtain visual-text interaction enhancement characteristics;

respectively inputting the visual-text interaction enhancement features into a feedforward neural network layer and a second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as the visual-text action tag features;

Inputting the audio features and the text semantic features into a cross-attention mechanism layer, using the text semantic features as query vectors, and using the audio features as a group of key vectors and value vectors to perform cross-modal interaction to obtain auditory-text interaction features;

Inputting the hearing-text interaction characteristics and the text semantic characteristics into a first residual error connection and layer normalization layer to obtain hearing-text interaction enhancement characteristics;

Respectively inputting the hearing-text interaction enhancement features into a feedforward neural network layer and a second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as hearing-text action tag features;

Inputting the audio-visual interaction features and the text semantic features into an interaction attention mechanism layer, taking the text semantic features as query vectors, and performing cross-modal interaction by taking the audio-visual interaction features as a group of key vectors and value vectors to obtain audio-visual-text interaction features;

Inputting the audio-visual-text interaction characteristics and the text semantic characteristics into a first residual error connection parallel layer normalization layer to obtain audio-visual-text interaction enhancement characteristics;

And respectively inputting the audio-visual-text interaction enhancement features into the feedforward neural network layer and the second residual error connection parallel layer normalization layer, and taking the output of the second residual error connection parallel layer normalization layer as the audio-visual-text action tag features.

In some exemplary implementations of the foregoing embodiments, the iterative update module 903 may be further configured to:

determining hearing text loss information according to the audio characteristics and the hearing-text action tag characteristics;

And determining a loss function of the motion recognition model according to the visual text loss information, the hearing text loss information and the audiovisual text loss information, and iteratively updating the motion recognition model based on the loss function.

As an exemplary implementation of the foregoing embodiment, the foregoing iterative updating module 903 may further be configured to:

Calling a loss function relation to determine a loss function of the action recognition model according to the visual text loss information, the hearing text loss information and the hearing text loss information; the loss function relationship is:

；

Where L is the loss function, N1 is the total number of negative samples, t _j is the linguistic characteristics of the j-th text sample in the negative sample dataset, sim represents the similarity function, τ is the temperature parameter, Is visual interactive feature,For audiovisual interaction features,For audio features,For visual-text action tag feature,For auditory-text action tag feature,Is an audiovisual-text action tag feature.

Based on the angles of the functional modules, referring to fig. 10, fig. 10 is a block diagram of an action recognition device provided in this embodiment under a specific implementation manner, where the action recognition device may include:

The model training module 101 is configured to train in advance by using the motion recognition model training method according to any one of the previous claims to obtain a motion recognition model;

the video acquisition module 102 is used for acquiring a video to be identified;

The motion recognition module 103 is configured to input a video to be recognized into the motion recognition model, and obtain motion recognition information.

The functions of each functional module of the motion recognition device and the motion recognition model training device in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

From the above, the embodiment can enable the motion recognition model to more fully understand and describe the fine motion characteristics, improve the performance and robustness of motion recognition, and further enhance the expansibility and flexibility of the model.

The motion recognition model training device and the motion recognition device are described from the perspective of functional modules, and further, the invention also provides electronic equipment, which is described from the perspective of hardware. Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 11, the electronic device comprises a memory 110 for storing a computer program; a processor 111 for implementing the steps of the action recognition model training method and the action recognition method as mentioned in any of the above embodiments when executing a computer program.

Processor 111 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and processor 111 may also be a controller, microcontroller, microprocessor, or other data processing chip, among others. The processor 111 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 111 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 111 may be integrated with a GPU (Graphics Processing Unit, graphics processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 111 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.

Memory 110 may include one or more computer-readable storage media, which may be non-transitory. Memory 110 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. Memory 110 may be an internal storage unit of an electronic device, such as a hard disk of a server, in some embodiments. The memory 110 may also be an external storage device of the electronic device, such as a plug-in hard disk provided on a server, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. in other embodiments. Further, the memory 110 may also include both internal storage units and external storage devices of the electronic device. The memory 110 may be used to store not only application software installed in an electronic device, but also various types of data, such as: code or the like that executes the action recognition model training method and the program in the course of the action recognition method may also be used to temporarily store data that has been output or is to be output. In this embodiment, the memory 110 is at least used for storing a computer program 1101, where the computer program is loaded and executed by the processor 111 to implement the action recognition model training method and the relevant steps of the action recognition method disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory 110 may further include an operating system 1102, data 1103, and the like, and the storage manner may be transient storage or permanent storage. The operating system 1102 may include Windows, unix, linux, among other things. The data 1103 may include, but is not limited to, motion recognition model training results, data corresponding to the motion recognition results, and the like.

In some embodiments, the electronic device may further include a display 112, an input/output interface 113, a communication interface 114, or referred to as a network interface, a power supply 115, and a communication bus 116. Among other things, a display 112, an input output interface 113 such as a Keyboard (Keyboard) pertain to user interfaces, which may also include standard wired interfaces, wireless interfaces, and the like. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface. The communication interface 114 may illustratively include a wired interface and/or a wireless interface, such as a WI-FI interface, a bluetooth interface, etc., typically used to establish a communication connection between an electronic device and other electronic devices. The communication bus 116 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 11, but not only one bus or one type of bus.

Those skilled in the art will appreciate that the configuration shown in fig. 11 is not limiting of the electronic device and may include more or fewer components than shown, for example, may also include sensors 117 to perform various functions.

The functions of each functional module of the electronic device in this embodiment may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not repeated herein.

It will be appreciated that if the motion recognition model training method and the motion recognition method in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution contributing to the related art, or may be embodied in the form of a software product stored in a storage medium, which performs all or part of the steps of the methods of the various embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrically erasable programmable ROM, registers, a hard disk, a multimedia card, a card-type Memory (e.g., SD or DX Memory, etc.), a magnetic Memory, a removable disk, a CD-ROM, a magnetic disk, or an optical disk, etc., that can store program code.

Based on this, the present invention also provides a readable storage medium storing a computer program, which when executed by a processor, performs the steps of the motion recognition model training method and the motion recognition method according to any of the above embodiments.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the hardware including the device and the electronic equipment disclosed in the embodiments, the description is relatively simple because the hardware includes the device and the electronic equipment corresponding to the method disclosed in the embodiments, and relevant places refer to the description of the method.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The method, the device, the electronic equipment and the readable storage medium for identifying the actions and training the models provided by the invention are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that, based on the embodiments of the present invention, all other embodiments obtained by a person skilled in the art without making any inventive effort fall within the scope of protection of the present invention. The present invention is capable of numerous modifications and adaptations without departing from the principles of the present invention, and such modifications and adaptations are intended to be within the scope of the present invention.

Claims

1. A method of training a motion recognition model, comprising:

Acquiring a video sample data set carrying action tags and audio data;

Inputting video samples in the video sample dataset into an action recognition model; the action recognition model comprises a target visual language model, a target acoustic model and a multi-modal interaction network module, visual characteristics and text semantic characteristics are extracted by utilizing the target visual language model, audio characteristics are extracted by utilizing the target acoustic model, visual interaction and audio-visual interaction are carried out on the visual characteristics and the audio characteristics by utilizing the multi-modal interaction network module according to a sound source party, an action source party and a characteristic source party, and at least one interaction characteristic is added for the text semantic characteristics so as to obtain multi-modal action tag characteristics added with at least one of auditory information, visual information and audio-visual information; the multi-mode interaction network module comprises a visual interaction module, an audio-visual interaction module and an interaction perception prompt module, wherein the visual characteristics comprise sound source side visual characteristics and action source side visual characteristics, and the action source side visual characteristics comprise limb visual characteristics and target object visual characteristics;

according to the visual interaction characteristics, the audio-visual interaction characteristics, the loss information between the audio characteristics and the multi-mode action tag characteristics, iteratively updating the action recognition model until a preset model training ending condition is met, and obtaining an action recognition model for executing an action recognition task;

The visual interaction and the audio interaction are carried out on the visual characteristics and the audio characteristics according to a sound source party, an action source party and a characteristic source party, and the method comprises the following steps:

performing cross-modal audio-visual interaction processing on the visual interaction characteristics and the audio characteristics by utilizing the audio-visual interaction module to obtain audio-visual interaction characteristics;

The visual interaction characteristic determining process comprises the following steps:

acquiring the sound body-limb-target-frame interaction characteristics of the sound body-limb-target interaction enhancement characteristics and the visual characteristics corresponding to the characteristic source side, and carrying out modal fusion on the sound body-limb-target-frames of all dimensions to obtain the visual interaction characteristics;

Wherein the determining process of the audio-visual interaction characteristics comprises the following steps:

Coding the audio-visual fusion features based on a self-attention mechanism, and extracting the characteristics of the coded audio-visual fusion features to obtain audio-visual interaction characteristics;

wherein the process of adding at least one interaction feature to the text semantic feature comprises:

2. The method of claim 1, wherein the visual features corresponding to the feature source include a current image frame and a video frame corresponding to the video sample, the acquiring the audio-limb-target-frame interaction feature of the audio-limb-target interaction enhancement feature and the visual features corresponding to the feature source, and performing modal fusion on the audio-limb-target-frame of each dimension to obtain the visual interaction feature, and the method comprises:

3. The method of claim 2, wherein the target visual language model comprises an image encoder and a target detector, and wherein extracting visual features using the target visual language model comprises:

4. The method for training an action recognition model according to claim 1, wherein the performing modal fusion on the sound-limb interaction characteristics of each dimension to obtain sound-limb interaction information includes:

5. The method of claim 4, wherein calculating a global average of the pitch-limb interaction features in a target feature dimension comprises:

；

6. The method of claim 4, wherein determining residual coefficients of the pitch-limb interaction feature in the corresponding dimension from the global average comprises:

；

In the method, in the process of the invention, For residual coefficients in the L dimension,As an S-shaped function,For the second full connection layer,In order to activate the function,For the first full connection layer,Is a global average in the L-th dimension.

7. The method of claim 4, wherein the fusing the phone-limb interaction feature and the limb visual feature based on the residual coefficient to obtain an interaction fusion feature comprises:

；

In the method, in the process of the invention, For the interaction fusion feature,For residual coefficients in the L dimension,As a visual characteristic of the limb in question,Is characteristic of the phonological-limb interaction.

8. The method for training an action recognition model according to claim 4, wherein the limb visual features include a first limb visual feature and a second limb visual feature, the cross-modal interaction feature obtained by the layer normalization process is extracted to obtain the sound-limb interaction information, and the method comprises:

9. The method for training an action recognition model according to claim 1, wherein the cross-modal fusion processing is performed on the visual interaction feature and the audio feature, respectively, and the method comprises:

10. The method for training an action recognition model according to claim 1, wherein the interactive perception prompt module comprises a cross attention mechanism layer, a first residual error connection parallel layer normalization layer, a feedforward neural network layer and a second residual error connection parallel layer normalization layer which are sequentially connected.

11. The method of claim 10, wherein said adding at least one interaction feature to the text semantic feature comprises:

12. The method of claim 1, wherein iteratively updating the motion recognition model based on loss of information between visual interaction features, audiovisual interaction features, audio features, and multi-modal motion label features comprises:

13. The method of claim 12, wherein the determining a loss function of the action recognition model based on the visual text loss information, the auditory text loss information, and the audiovisual text loss information comprises:

；

14. A method of motion recognition, comprising:

Training in advance by using the motion recognition model training method according to any one of claims 1 to 13 to obtain a motion recognition model;

Acquiring a video to be identified;

15. The method for motion recognition according to claim 14, wherein the video to be recognized is a musical instrument playing video, and the inputting the video to be recognized into the motion recognition model to obtain motion recognition information includes:

inputting the musical instrument playing video to the action recognition model;

16. An action recognition model training device, comprising:

A data processing module for inputting video samples in the video sample dataset into the motion recognition model; the action recognition model comprises a target visual language model, a target acoustic model and a multi-modal interaction network module; extracting visual features and text semantic features by using the target visual language model, extracting audio features by using the target acoustic model, performing visual interaction and audio-visual interaction on the visual features and the audio features by using the multi-modal interaction network module according to a sound source party, an action source party and a feature source party, and adding at least one interaction feature to the text semantic features to obtain multi-modal action tag features added with at least one of acoustic information, visual information and audio-visual information; the multi-mode interaction network module comprises a visual interaction module, an audio-visual interaction module and an interaction perception prompt module, wherein the visual characteristics comprise sound source side visual characteristics and action source side visual characteristics, and the action source side visual characteristics comprise limb visual characteristics and target object visual characteristics;

The iteration updating module is used for carrying out iteration updating on the action recognition model according to the visual interaction characteristics, the audio-visual interaction characteristics, the loss information between the audio characteristics and the multi-mode action tag characteristics until a preset model training ending condition is met, so as to obtain an action recognition model for executing an action recognition task;

Wherein the data processing module is further configured to:

17. An action recognition device, comprising:

A model training module, configured to train in advance to obtain an action recognition model by using the action recognition model training method according to any one of claims 1 to 13;

the video acquisition module is used for acquiring a video to be identified;

18. An electronic device comprising a processor and a memory, the processor being configured to implement the action recognition model training method of any one of claims 1 to 13 and/or the steps of the action recognition method of claim 14 or 15 when executing a computer program stored in the memory.

19. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the action recognition model training method according to any one of claims 1 to 13 and/or the action recognition method according to claim 14 or 15.