CN114511877A

CN114511877A - Behavior recognition method and device, storage medium and terminal

Info

Publication number: CN114511877A
Application number: CN202111677068.XA
Authority: CN
Inventors: 冯琰一; 徐博诚; 聂虎; 周涛
Original assignee: Terminus Technology Group Co Ltd
Current assignee: Terminus Technology Group Co Ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-05-17

Abstract

The invention discloses a behavior recognition method, a behavior recognition device, a storage medium and a terminal, wherein the method comprises the following steps: acquiring a target image to be identified; generating a human body image and images of all parts of a human body according to the target image; inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model; the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code; and outputting the human body part of the human body image and the corresponding behavior of the human body part. Because the behavior image marked with the body part code is adopted to train the behavior recognition model, the model can fully capture the related fine-grained knowledge of the body part level, and meanwhile, the model is easy to generalize, so that the recognition performance of the model is improved.

Description

Behavior recognition method and device, storage medium and terminal

Technical Field

The invention relates to the technical field of computer vision, in particular to a behavior recognition method, a behavior recognition device, a storage medium and a terminal.

Background

Human behavior recognition is one of the important research contents in the field of computer vision. Most of the research on human behavior recognition is based on video rather than a single image, but there are many common human behaviors that can sufficiently represent a behavior through a single image, such as making a phone call, interacting with a computer, shooting, and so on. Even if video information of these actions is available, still methods based on static cues are needed, such as playing guitar, riding horse, running, etc., i.e. these human behaviors have small motion amplitude and no discriminability of motion trajectory, so that still methods based on single image are used to recognize these actions.

In the prior art, the traditional human behavior recognition is mainly based on an RGB video sequence, and is greatly influenced by factors such as illumination, scenes, camera lens movement and the like, so that the motion of a human body in the sequence is difficult to accurately describe. Specifically, behavior recognition currently depends heavily on objects and scenes, and algorithms prefer to recognize objects and scenes with coarse-grained knowledge, for example, running and riding can be distinguished according to different backgrounds, but accurate classification is difficult if the backgrounds are removed, so that trained models are difficult to generalize due to different behavior analysis type settings of different data sets, and the recognition performance of the models is reduced.

Disclosure of Invention

The embodiment of the application provides a behavior identification method, a behavior identification device, a storage medium and a terminal. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, an embodiment of the present application provides a behavior identification method, where the method includes:

acquiring a target image to be identified;

generating a human body image and images of all parts of a human body according to the target image;

inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model; the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code;

and outputting the human body part of the human body image and the corresponding behavior of the human body part.

Optionally, generating a human body image and images of each part of the human body according to the target image includes:

inputting the target image into an object detector, and outputting the coordinates of a human body frame in the target image;

intercepting a human body image according to the coordinates of the human body frame, inputting the human body image into a pre-trained posture estimation network, and outputting each part frame of the human body;

and intercepting images of all parts of the human body based on the frames of all parts of the human body.

Optionally, the generating a pre-trained behavior recognition model according to the following steps includes:

collecting a behavior image centered by people, and marking a body part code on the behavior image centered by people to obtain an original behavior recognition data set;

dividing a training set from the original behavior recognition data set, and adjusting the brightness of images in the training set to obtain a model training sample;

performing data enhancement on the model training sample to obtain an enhanced data sample;

establishing a behavior recognition model by adopting a composite backbone network, and performing body part state recognition and feature extraction according to the enhanced data sample and the behavior recognition model to obtain a body part state recognition result and body part features;

calculating a target loss value according to the body part state recognition result and the body part characteristics;

and when the target loss value reaches a preset threshold value, generating a pre-trained behavior recognition model.

Optionally, calculating a target loss value according to the body part state recognition result and the body part feature, including:

calculating a body part motion score according to the body part features;

calculating prediction probability distribution according to the body part state recognition result and the body part action score;

acquiring real probability distribution of the enhanced data sample and real frame coordinates marked in advance;

calculating a first cross entropy loss value according to the prediction probability distribution and the real probability distribution;

calculating a second cross entropy loss value according to the real frame coordinates marked in advance;

a target loss value is calculated based on the first cross entropy loss value and the second cross entropy loss value.

Optionally, when the target loss value reaches the preset threshold, generating a pre-trained behavior recognition model, including:

and when the target loss value does not reach the preset threshold value, performing back propagation on the target loss value to update the parameters of the model, and continuously performing the step of performing data enhancement on the model training sample to obtain an enhanced data sample.

Optionally, labeling the body part code on the behavior image centered on the human to obtain an original behavior recognition data set, including:

selecting a preset number of preprocessed images from the collected behavior images according to preset behavior parameters;

calculating coordinate values of the body part in the preprocessed image;

and marking the body part code on the behavior image with the artificial center according to the coordinate value of the body part to obtain an original behavior identification data set.

Optionally, performing data enhancement on the model training sample to obtain an enhanced data sample, including:

carrying out normalization processing on the model training sample to obtain a normalized training sample;

scaling the normalized training sample into multiple scales by adopting an affine transformation method to obtain a scaled data sample;

and performing rotation enhancement on the scaled data sample by adopting an affine transformation method to obtain an enhanced data sample.

In a second aspect, an embodiment of the present application provides a behavior recognition apparatus, where the apparatus includes:

the image acquisition module is used for acquiring a target image to be identified;

the image generation module is used for generating a human body image and images of all parts of a human body according to the target image;

the image input module is used for inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model; the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code;

and the behavior output module is used for outputting the human body part of the human body image and the behavior corresponding to the human body part.

In a third aspect, embodiments of the present application provide a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides a terminal, which may include: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

in the embodiment of the application, a behavior recognition device firstly acquires a target image to be recognized, generates a human body image and images of all parts of a human body according to the target image, and inputs the target image, the human body image and the images of all parts of the human body into a pre-trained behavior recognition model; the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code; and finally, outputting the human body part of the human body image and the corresponding behavior of the human body part. Because the behavior image marked with the body part code is adopted to train the behavior recognition model, the model can fully capture the related fine-grained knowledge of the body part level, and meanwhile, the model is easy to generalize, so that the recognition performance of the model is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

Fig. 1 is a schematic flowchart of a behavior recognition method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a behavior recognition model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a behavior recognition apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a terminal according to an embodiment of the present application.

Detailed Description

The following description and the drawings sufficiently illustrate specific embodiments of the invention to enable those skilled in the art to practice them.

It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The application provides a behavior recognition method, a behavior recognition device, a storage medium and a terminal, which are used for solving the problems in the related art. In the technical scheme provided by the application, because the behavior image marked with the body part code is adopted to train the behavior recognition model, the model can fully capture the related fine-grained knowledge of the body part level, and meanwhile, the model is easy to generalize, so that the recognition performance of the model is improved, and the following exemplary embodiment is adopted for detailed description.

The behavior recognition method provided by the embodiment of the present application will be described in detail below with reference to fig. 1 to 2. The method may be implemented in dependence on a computer program, executable on a behavior recognition device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Referring to fig. 1, a flow chart of a behavior recognition method is provided in an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application may include the following steps:

s101, acquiring a target image to be identified;

the target image to be recognized may be obtained from a test set divided at a model training stage, or may be obtained from an actual application scene.

In a possible implementation manner, when the target image to be recognized is acquired in an actual application scene, the target image to be recognized is an image of any type, any format, and any size, which is not limited in this application embodiment. The user terminal stores at least one image, and can directly acquire one image in the storage space of the user terminal and determine the image as a target image to be identified. The user terminal can also provide an entrance for uploading images, the user uploads an image based on the entrance for uploading images, and the user terminal determines the image uploaded by the user as a target image to be identified. Of course, the target image to be recognized may also be acquired by other manners, which is not limited in the embodiment of the present application.

S102, generating a human body image and images of all parts of a human body according to the target image;

in a possible implementation manner, when generating a human body image and images of each part of a human body according to a target image, the target image is firstly input into an object detector, coordinates of a human body frame in the target image are output, then the human body image is intercepted according to the coordinates of the human body frame, the human body image is input into a pre-trained posture estimation network, each part frame of the human body is output, and finally each part image of the human body is intercepted based on each part frame of the human body.

S103, inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model;

the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code; the pre-trained behavior recognition model is a mathematical model that can recognize human behavior in an image.

Generally, when performing model training, a related behavior recognition data set based on body part coding is collected and sorted first, then a data analysis and enhancement method designed according to the data set is adopted to form an enhanced behavior recognition data set, and finally a behavior recognition model is trained through the enhanced behavior recognition data set.

In the embodiment of the application, when a pre-trained behavior recognition model is generated, firstly, a behavior image centered by human is collected, a body part code is marked on the behavior image centered by human to obtain an original behavior recognition data set, then, a training set is divided from the original behavior recognition data set, the brightness of the image in the training set is adjusted to obtain a model training sample, then, the model training sample is subjected to data enhancement to obtain an enhanced data sample, secondly, a behavior recognition model is created by adopting a composite backbone network, body part state recognition and feature extraction are carried out according to the enhanced data sample and the behavior recognition model to obtain a body part state recognition result and body part features, then, a target loss value is calculated according to the body part state recognition result and the body part features, and finally, when the target loss value reaches a preset threshold value, a pre-trained behavior recognition model is generated.

In a possible implementation manner, the user terminal obtains a target image to be recognized, and please refer to step S101 for details of obtaining the target image, which is not described herein again, when the user terminal detects the target image, the target image is transmitted to the server in a wired or wireless manner, a pre-trained behavior recognition model is stored in the server, and the server receives the target image and then inputs the target image into the pre-trained behavior recognition model for recognition through an internal program.

And S104, outputting the human body part of the human body image and the behavior corresponding to the human body part.

In one possible implementation, after model processing, the human body part of the human body image and the behavior corresponding to the human body part can be output.

Referring to fig. 2, a flow chart of a behavior recognition model training method is provided in the embodiment of the present application. As shown in fig. 2, the method of the embodiment of the present application may include the following steps:

s201, collecting a behavior image centered by people, and marking a body part code on the behavior image centered by people to obtain an original behavior recognition data set;

in the embodiment of the application, a preset number of preprocessed images are selected from collected behavior images according to preset behavior parameters, then coordinate values of body parts in the preprocessed images are calculated, and finally the behavior images with artificial centers are marked with body part codes according to the coordinate values of the body parts to obtain an original behavior recognition data set.

In one possible implementation, this is done by collecting a human-centric behavioral image containing 3 million pictures with rough activity labels and 15 million data sets collected over the web. For common object interaction and body movement, 10 ten thousand pictures are selected from the pictures, and 126 behaviors are contained in total. And then the estimation error is manually solved to ensure high-quality annotation. The left and right arm frames are estimated by calculating the maximum (x, y) and minimum (x, y) coordinates of wrist, elbow and shoulder joints as the upper left corner point and the lower right corner point, the left and right leg frames are estimated by calculating the maximum (x, y) and minimum (x, y) coordinates of foot, knee and hip joints as the upper left corner point and the lower right corner point, respectively, the rest body part frames are centered on a joint, and the size of the box is predefined by scaling the distance between the neck and the pelvic joint. Combinations with confidence above 0.7 will be considered visible. When the sections concerned cannot be detected, the rule based on the physical knowledge is used. That is, if the neck or pelvis is not visible, the component frames are configured according to other visible joint groups (head, body, arms, legs), for example, if only the upper body is visible, the size frame of the hand is set to be twice the pupillary distance. Computing symbiotic/co-occurrence relationships between behaviors and body parts using normalized point-to-point mutual information based on relevant annotations, and finally selecting 76 candidate local states with the highest NPMI values as a body part set; the former manually labeled picture is taken as a seed, the initial body part labels of the other pictures are automatically generated, and then the labels only need to be checked subsequently; considering that a person may have multiple actions, for each action, its corresponding 10 body parts are labeled separately. Then combining all the body parts of the movement together; and in order to ensure the quality, each image is labeled twice, and finally the original behavior recognition data set is obtained.

S202, dividing a training set from the original behavior recognition data set, and adjusting the brightness of images in the training set to obtain a model training sample;

in a possible implementation manner, the original behavior recognition data set is divided into a training set and a test set, wherein the training set has large brightness and contrast variation, images with extremely high exposure and images with extremely low brightness, and the training set is subjected to brightness contrast enhancement due to the influence on model training caused by the large difference, so that the influence on the model training and prediction process caused by the brightness and contrast variation is avoided. Specifically, the contrast ratio is adjusted by expanding or reducing the difference between a bright point and a dark point while maintaining the average brightness. Since it is to ensure the average brightness is constant, the adjustment ratio for each point must be applied to the difference between the average brightness and the value, so as to ensure the calculated average brightness is constant, and the model training sample can be obtained after the brightness adjustment is completed. The concrete formula is as follows:

Out＝Average+(In-Average)*(1+percent)

wherein In represents the original pixel brightness, Average represents the Average brightness of the whole picture, Out represents the brightness of the adjusted training sample, and percent is the adjustment range [ -1,1 ].

S203, performing data enhancement on the model training sample to obtain an enhanced data sample;

generally, since there are fewer body part types and fewer behaviors in the training set, and the training of the model is easy to be over-fitted, data enhancement needs to be performed on the training set, so that the scale of the data set can be increased and the robustness and diversity of the model can be improved. And the enhancement is to transform the source image by adopting methods of random inversion, multi-scale cutting, normalization and the like.

In the embodiment of the application, when data enhancement is performed, firstly, a model training sample is normalized to obtain a normalized training sample, then, an affine transformation method is adopted to scale the normalized training sample into multiple scales to obtain a scaled data sample, and finally, the scaled data sample is subjected to rotation enhancement by adopting the affine transformation method to obtain an enhanced data sample.

In a possible implementation manner, when the model training sample is normalized, the specific formula is

Where x denotes input data, x^*Represents the normalized output such that all data are at [0, 1]]Meanwhile, the normalization process can increase the convergence speed and improve the accuracy of the model, max represents taking the maximum value, and min represents taking the minimum value.

Further, when multi-scale cutting is performed on the model training sample, the model training sample is analyzed, so that the model training sample is obviously different from a common behavior recognition data set: the angle, brightness, contrast, target size and other aspects have larger difference among different pictures. Therefore, a multi-scale training strategy is adopted, a data set is zoomed into different scales to be input, the adaptability of the network to the recognition of target behaviors with different sizes is improved, the data set is zoomed into six scales to be input by utilizing a radiation transformation method (wrapAffine), the loss of excessive source image information is avoided, and the specific formula of the zooming by utilizing the wraffine is as follows:

wherein f is_xAnd f_yRepresenting the focal length (scaling factor) in the x-axis and y-axis, respectively, x, y representing the input pre-scaled width and height, and x 'and y' representing the post-scaled width and height.

Further, when the model training sample is randomly turned, the wrapAffine method is used for carrying out rotation enhancement on the data set, and the specific formula is as follows:

where θ is the angle representing the rotation, x, y represent the input pre-scaled width and height, and x 'and y' represent the post-scaled width and height.

S204, establishing a behavior recognition model by adopting a composite backbone network, and performing body part state recognition and feature extraction according to the enhanced data sample and the behavior recognition model to obtain a body part state recognition result and body part features;

generally, encoding body part features into a backbone network can effectively solve the bottleneck of a behavior recognition algorithm based on example-level features or knowledge, thereby realizing more accurate behavior recognition to avoid the situation of false detection and missed detection, effectively solving the bottleneck and improving the performance.

In the embodiment of the application, a behavior recognition model is created by using a composite backbone network, an enhanced data sample is input into an object detector, a second body frame coordinate in the enhanced data sample is output, a second body image is intercepted according to the second body frame coordinate, the second body image is input into a pre-trained posture estimation network, each part frame of a second body is output, each part image of the second body is intercepted based on each part frame of the second body, and finally a second target image, the second body image and each part image of the second body are input into the behavior recognition model for training to obtain a body part state recognition result and body part characteristics.

In a possible implementation manner, a behavior recognition model constructed by a composite backbone network is combined with an enhanced data sample to perform body part state recognition and feature extraction, and the specific formula is as follows:

where I is the original (i.e., the target image),

n body-part frames representing a person, which are automatically generated using a paired body-part attention method for identifying the person's interaction with an object, b_oFrame representing a person, f_PartRepresenting a body part feature, R_A2VA feature representation model representing an extracted body part, consisting of a plurality of convolutional layers and fully connected layers, wherein the convolutional layers can be represented by the following formula:

y＝BatchNorm(Conv2D(x,inchannel,outchannel,ksize,ksize))

where Conv2D (x, subchannel, output, ksize, ksize) is the 2D convolution operation, x represents the input of the convolutional layer, subchannel represents the number of input channels, output is the number of output channels, ksize is the convolution size, Conv2D is the 2D convolution, Batchnorm (x) is the normalization layer, and y is the output of the convolutional layer.

Wherein the fully connected layer can be represented by the following formula:

y＝Softmax(w*x+b) (1)

where x is the fully-connected input, w is the neuron weight, b is the offset, Softmax is the normalized exponential function, and y is the fully-connected output.

It should be noted that most current behavior understanding models adopt a similar paradigm to object recognition, i.e. the mapping of pixels to semantic concepts is directly learned by DNN. However, due to the particularity of behavior understanding, such as semantic centralization, more serious long tail distribution, continuous change of human body structure, and the like, direct mapping of physical behaviors from example-level features meets performance bottlenecks at present, and specifically, behavior recognition classification may cause false detection and missed detection due to insufficient information extracted from example features by a backbone network and too high task difficulty, resulting in low recognition efficiency. This application designs out the compound backbone network that can effectively code health position characteristic to realize more accurate action discernment and in order to avoid the condition of false retrieval missed measure, earlier from pixel to health position to action concept again, thereby make compound backbone network can extract abundant health position information except limited example characteristic information, can improve the accuracy and can effectively solve prior art's not enough when the action discernment is categorised.

S205, calculating a target loss value according to the body part state recognition result and the body part characteristics;

in the embodiment of the application, when the loss value is calculated, firstly, the body part action score is calculated according to the body part characteristics, then, the prediction probability distribution is calculated according to the body part state recognition result and the body part action score, then, the real probability distribution of the enhanced data sample and the real frame coordinates marked in advance are obtained, secondly, the first cross entropy loss value is calculated according to the prediction probability distribution and the real probability distribution, secondly, the second cross entropy loss value is calculated according to the real frame coordinates marked in advance, and finally, the target loss value is calculated based on the first cross entropy loss value and the second cross entropy loss value.

It should be noted that, the behavior recognition task is a multi-label classification task because there is a possibility that a body part has multiple states, for example, the head can perform the actions of "eating" and "looking" at the same time. The loss function we use a cross entropy loss function.

In one possible implementation, after obtaining accurate body part features, a body part-based behavioral inference is performed, and a trip is inferred from a local body part as a semantic meaning, and the specific formula is as follows:

S_part＝F_Part-R(f_Part,f_o)

wherein F_Part-RRepresents the method of Pasta-R, f_oIs a characteristic representation of an object, f_PartIs a characteristic representation of a body part, S_partIs the body part motion score at the local state level.

The PaSta-R method has three structures, the first is full-connection combination, and the formula is shown as (1).

The second is two-layer full-connection combination, and the formula is as follows:

y＝Softmax(w₂*relu(w₁*x+b₁)+b₂)

where x is the fully connected input, w1, w2 are the neuron weights, b1, b2 are the offsets, relu is the linear rectification function, Softmax is the normalized exponential function, and y is the fully connected output.

And the third method is to extract global graph features through a graph convolution neural network and then output classification results through full connection.

Further, after obtaining the body part action score, calculating a prediction probability distribution q (x) by combining the body part action score and the body part state recognition result; then obtaining the real probability distribution p (x) of the enhanced data sample and the real frame coordinate box _ true marked in advance, and then calculating a first cross entropy loss value according to the prediction probability distribution and the real probability distribution, wherein the formula is expressed as

A second cross entropy loss value is calculated from the pre-marked real box coordinates box _ true,

and finally, calculating a target loss value based on the first cross entropy loss value and the second cross entropy loss value, wherein the formula is as follows:

where p (x) represents the true probability distribution (cls _ true), q (x) represents the predicted probability distribution (cls _ pred), box _ true represents the true box coordinates, and box _ pred represents the predicted box coordinates, i.e., the human body box coordinates. The cross entropy loss function minimizes the difference between the two probability distributions to make the predicted probability distribution as realistic as possible, where L_PartClassifying the loss for a body part, i.e. (first cross-entropy loss value), L_attFor attention loss (i.e. second cross-over)Entropy loss value), L_lossRepresenting the total loss.

It should be noted that a second-order understanding paradigm based on a human body local semantic state is lacking at present, namely, a network structure from a pixel to a body part and then to a behavior concept is used for carrying out accurate behavior identification according to the body part; aiming at the problems, a second-order understanding paradigm of corresponding picture or video behaviors is established, the behavior recognition precision is improved, and a composite backbone network is designed based on the second-order understanding paradigm of the human body local semantic state so as to accurately recognize body parts and corresponding behaviors in the picture.

And S206, when the target loss value reaches a preset threshold value, generating a pre-trained behavior recognition model.

In one possible implementation, the pre-trained behavior recognition model is generated when the target loss value reaches a preset threshold.

In another possible implementation manner, when the target loss value does not reach the preset threshold, the target loss value is propagated reversely to update parameters of the model, and the step of performing data enhancement on the model training sample to obtain an enhanced data sample is continuously performed.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 3, a schematic structural diagram of a behavior recognition apparatus according to an exemplary embodiment of the present invention is shown. The behavior recognizing means may be implemented as all or a part of the terminal by software, hardware, or a combination of both. The device 1 comprises an image acquisition module 10, an image generation module 20, an image input module 30 and a behavior output module 40.

The image acquisition module 10 is used for acquiring a target image to be identified;

an image generation module 20, configured to generate a human body image and images of various parts of a human body according to the target image;

an image input module 30, configured to input the target image, the human body image, and images of all parts of the human body into a pre-trained behavior recognition model; the pre-trained behavior recognition model is generated based on the behavior image training of the labeling body part code;

and the behavior output module 40 is used for outputting the human body part of the human body image and the behavior corresponding to the human body part.

It should be noted that, when the behavior recognition apparatus provided in the foregoing embodiment executes the behavior recognition method, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the behavior recognition device and the behavior recognition method provided by the above embodiments belong to the same concept, and the details of the implementation process are referred to as method embodiments, which are not described herein again.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The present invention also provides a computer readable medium having stored thereon program instructions which, when executed by a processor, implement the behavior recognition method provided by the various method embodiments described above.

The invention also provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the behavior recognition method of the above-described respective method embodiments.

Please refer to fig. 4, which provides a schematic structural diagram of a terminal according to an embodiment of the present application. As shown in fig. 4, terminal 1000 can include: at least one processor 1001, at least one network interface 1004, a user interface 1003, memory 1005, at least one communication bus 1002.

Wherein a communication bus 1002 is used to enable connective communication between these components.

The user interface 1003 may include a Display screen (Display) and a Camera (Camera), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.

The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 1001 may include one or more processing cores, among other things. The processor 1001 interfaces various parts throughout the electronic device 1000 using various interfaces and lines to perform various functions of the electronic device 1000 and to process data by executing or performing instructions, programs, code sets, or instruction sets stored in the memory 1005 and invoking data stored in the memory 1005. Alternatively, the processor 1001 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1001 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 1001, but may be implemented by a single chip.

The Memory 1005 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 1005 includes a non-transitory computer-readable medium. The memory 1005 may be used to store an instruction, a program, code, a set of codes, or a set of instructions. The memory 1005 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 1005 may alternatively be at least one memory device located remotely from the processor 1001. As shown in fig. 4, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a behavior recognition application program.

In the terminal 1000 shown in fig. 4, the user interface 1003 is mainly used as an interface for providing input for a user, and acquiring data input by the user; and the processor 1001 may be configured to invoke the behavior recognition application stored in the memory 1005 and perform the following operations:

acquiring a target image to be identified;

In one embodiment, the processor 1001 performs the following operations when generating the human body image and the images of the parts of the human body according to the target image:

In one embodiment, the processor 1001 specifically performs the following operations when generating the pre-trained behavior recognition model:

In one embodiment, the processor 1001 specifically performs the following operations when performing the calculation of the target loss value according to the body part state recognition result and the body part feature:

calculating a body part motion score according to the body part features;

In one embodiment, when the processor 1001 generates the pre-trained behavior recognition model when the target loss value reaches the preset threshold, it specifically performs the following operations:

In one embodiment, the processor 1001 performs the following operations when performing the body part labeling coding on the human-centered behavior image to obtain the original behavior recognition data set:

calculating coordinate values of the body part in the preprocessed image;

In an embodiment, when performing data enhancement on the model training sample to obtain an enhanced data sample, the processor 1001 specifically performs the following operations:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program to instruct associated hardware, and the behavior recognition program can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of behavior recognition, the method comprising:

acquiring a target image to be identified;

generating a human body image and images of all parts of the human body according to the target image;

inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model; wherein the pre-trained behavior recognition model is generated based on behavior image training of labeling body part codes;

and outputting the human body part of the human body image and the behavior corresponding to the human body part.

2. The method according to claim 1, wherein the generating of the human body image and the human body each part image according to the target image comprises:

inputting the target image into an object detector, and outputting the coordinates of the human body frame in the target image;

3. The method of claim 1, wherein generating a pre-trained behavior recognition model comprises:

4. The method of claim 3, wherein calculating a target loss value based on the body part state recognition result and body part features comprises:

calculating the body part motion score from the body part features;

calculating a prediction probability distribution according to the body part state recognition result and the body part action score;

calculating a second cross entropy loss value according to the pre-marked real frame coordinate;

and calculating a target loss value based on the first cross entropy loss value and the second cross entropy loss value.

5. The method of claim 3, wherein generating a pre-trained behavior recognition model when the target loss value reaches a preset threshold comprises:

and when the target loss value does not reach a preset threshold value, performing back propagation on the target loss value to update parameters of a model, and continuously performing the step of performing data enhancement on the model training sample to obtain an enhanced data sample.

6. The method of claim 2, wherein said labeling body-part codes the human-centric behavior image, resulting in an original behavior recognition dataset comprising:

calculating coordinate values of the body part in the preprocessed image;

7. The method of claim 2, wherein the performing data enhancement on the model training sample to obtain an enhanced data sample comprises:

8. An apparatus for behavior recognition, the apparatus comprising:

the image input module is used for inputting the target image, the human body image and each part image of the human body into a pre-trained behavior recognition model; wherein the pre-trained behavior recognition model is generated based on behavior image training of labeling body part codes;

9. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any of claims 1-7.

10. A terminal, comprising: a processor and a memory; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1-7.