CN113436064B

CN113436064B - Method and equipment for training detection model of key points of target object and detection method and equipment

Info

Publication number: CN113436064B
Application number: CN202110986015.XA
Authority: CN
Inventors: 王鹏程; 高原; 刘霄
Original assignee: Beijing Century TAL Education Technology Co Ltd
Current assignee: Beijing Century TAL Education Technology Co Ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2021-11-09
Anticipated expiration: 2041-08-26
Also published as: CN113436064A

Abstract

The invention relates to a method for training a detection model of a key point of a target object, a detection method and equipment, wherein a video sample is obtained, a plurality of first to-be-detected image samples contained in the video sample are input into a candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample, then the plurality of first to-be-detected image samples and the first candidate key points corresponding to each first to-be-detected image sample obtained in the last step are input into a self-coding network to obtain a plurality of target generation image samples input by the self-coding network, parameters of the candidate key point detection network are updated based on stability results of the plurality of target generation image samples until the stability results meet a first preset condition, the candidate key point detection network is determined as the target key point detection network, namely, the candidate key point detection network is trained based on the stability results of the plurality of target generation image samples as convergence conditions, therefore, the stability of the target key point detection network is improved.

Description

Method and equipment for training detection model of key points of target object and detection method and equipment

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for training a detection model of key points of a target object.

Background

With the development of image processing technology, the beauty is widely applied to application software such as short video shooting and live webcast. For example: in the process of live webcasting, key points (such as a nose, eyes and the like) of a face image in a video are detected, and the face is beautified based on the detected key points.

The existing key point detection method generally includes the steps of obtaining a video sample, manually labeling key points of an image sample to be detected in the video sample, training the video sample to obtain a key point detection network, and obtaining the key points of the image to be detected in the video to be detected by using the key point detection network.

However, with the prior art methods, the stability of the keypoint detection is not high.

Disclosure of Invention

In order to solve the technical problem or at least partially solve the technical problem, the present disclosure provides a method, a device and a system for training a detection model of a key point of a target object.

In a first aspect, the present disclosure provides a method for training a detection model of a key point of a target object, including:

obtaining a video sample, wherein the video sample comprises: the method comprises the steps that a plurality of first to-be-detected image samples containing first target objects are obtained, wherein the first to-be-detected image samples contain to-be-detected image samples of which the first target objects are not marked with key point information;

inputting a plurality of first to-be-detected image samples into a candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample;

inputting the plurality of first image samples to be detected and the first candidate key points corresponding to each first image sample to be detected into a target self-encoding network to obtain a plurality of target generation image samples;

updating parameters of the candidate key point detection network according to the stability results of the image samples generated by the targets, and returning to execute the step of inputting the first to-be-detected image samples into the candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample; and determining the candidate key point detection network as a target key point detection network until the stability result meets a first preset condition.

Optionally, the target self-encoding network includes: a first encoder, a second encoder and a decoder;

inputting the plurality of first to-be-detected image samples and the first candidate keypoints corresponding to each first to-be-detected image sample into a target self-encoding network to obtain a plurality of target-generated image samples, including:

for each first image sample to be detected, performing the following steps to obtain a plurality of target generation image samples:

performing first encoding processing on a first candidate key point corresponding to the first image sample to be detected by using a first encoder to obtain a key point feature corresponding to the first candidate key point;

performing second encoding processing on the first image sample to be detected by using a second encoder to obtain image characteristics corresponding to the first image sample to be detected;

and obtaining a target generation image sample corresponding to the first image sample to be detected by using the decoder according to the key point characteristics and the image characteristics.

Optionally, the target self-encoding network is obtained by training in the following manner:

inputting a plurality of second image samples to be detected containing second target objects into a first initial key point detection network to obtain first initial key points corresponding to each second image sample to be detected respectively, wherein the second image samples to be detected contain the image samples to be detected with the second target objects not labeled with key point information;

inputting a plurality of second image samples to be detected and the first initial key points corresponding to each second image sample to be detected into a candidate self-coding network to obtain a candidate generated image;

inputting the candidate generated image into a second initial key point detection network to obtain a second initial key point corresponding to the candidate generated image, wherein the first initial key point detection network is the same as the second initial key point detection network;

performing mode consistency judgment according to the first initial key point and the second initial key point to obtain a judgment result;

and updating parameters of the candidate self-coding network according to the judgment result, returning to execute the step of inputting a plurality of second image samples to be detected containing second target objects into a first initial key point detection network to obtain first initial key points corresponding to each second image sample to be detected until the judgment result meets a second preset condition, and determining the candidate self-coding network as the target self-coding network.

Optionally, the performing mode consistency judgment according to the first initial keypoint and the second initial keypoint to obtain a judgment result includes:

generating a first thermodynamic diagram according to the first initial key point;

generating a second thermodynamic diagram according to the second initial key point;

acquiring key point features corresponding to the first thermodynamic diagram, and acquiring key point features corresponding to the second thermodynamic diagram;

and obtaining the judgment result according to the key point characteristics corresponding to the first thermodynamic diagram and the key point characteristics corresponding to the second thermodynamic diagram.

Optionally, the candidate self-coding network is obtained by:

and training a plurality of third image samples to be detected containing third target objects to obtain the candidate self-coding network, wherein the third target objects in the plurality of third image samples to be detected are marked with key point information.

Optionally, the training to obtain the candidate self-encoding network by using a plurality of third to-be-detected image samples including a third target object includes:

performing first coding processing on target key points marked on the third image sample to be detected by using a first coder to obtain target key point characteristics;

performing second coding processing on the third image sample to be detected by using a second coder to obtain a target image characteristic corresponding to the third image sample to be detected;

obtaining a target generation image according to the target key point characteristics and the target image characteristics by using the decoder;

and training an initial self-coding network based on the third image sample to be detected and the target generation image until the initial self-coding network is converged to obtain the candidate self-coding network.

Optionally, the candidate keypoint detection network is obtained by training a plurality of fourth image samples to be detected containing fourth target objects, where the fourth target objects in the plurality of fourth image samples to be detected are labeled with keypoint information.

Optionally, before the generating the stability result of the image sample according to the plurality of targets and updating the parameter of the candidate keypoint detection network, the method further includes:

generating image samples based on a plurality of targets to obtain a generated video sequence;

and acquiring a stability result by using a target video stability judging network according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples.

Optionally, the obtaining, by the target video stability determining network, a stability result according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video sample includes:

performing three-dimensional convolution on a plurality of target generation images included in the generated video sequence by utilizing a space-time convolution network to extract a first space-time characteristic; performing two-dimensional convolution operation on the first time-space characteristic to extract a first deepened space characteristic;

performing three-dimensional convolution on a plurality of first to-be-detected image samples included in the video samples by utilizing a space-time convolution network to extract second space-time characteristics; performing two-dimensional convolution operation on the second space-time characteristic to extract a second deepened space characteristic;

and obtaining a stability result according to the first space-time characteristic, the first deepening space characteristic, the second space-time characteristic and the second deepening space characteristic.

Optionally, the target object is a human face.

The present disclosure provides a method for detecting key points of a target object, including:

acquiring a video to be detected, wherein the video to be detected comprises: a plurality of first images to be detected including a first target object;

and detecting a plurality of first images to be detected by using a target key point detection network to obtain a target object key point in each first image to be detected, wherein the target key point detection network is obtained by the training method of the target object key detection model of the first aspect.

In a third aspect, the present disclosure provides a training apparatus for a key point detection model of a target object, including:

an obtaining module, configured to obtain a video sample, where the video sample includes: the method comprises the steps that a plurality of first to-be-detected image samples containing first target objects are obtained, wherein the first to-be-detected image samples contain to-be-detected image samples of which the first target objects are not marked with key point information;

the processing module is used for inputting a plurality of first to-be-detected image samples into a candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample;

the processing module is further configured to input the plurality of first to-be-detected image samples and the first candidate keypoints corresponding to each of the first to-be-detected image samples into a target self-encoding network, so as to obtain a plurality of target-generated image samples;

the processing module is further configured to update parameters of the candidate keypoint detection network according to the stability results of the image samples generated by the multiple targets, and return to execute inputting of the multiple first to-be-detected image samples into the candidate keypoint detection network to obtain first candidate keypoints corresponding to each first to-be-detected image sample; and determining the candidate key point detection network as a target key point detection network until the stability result meets a first preset condition.

In a fourth aspect, the present disclosure provides an apparatus for detecting a key point of a target object, including:

the acquisition module is used for acquiring a video to be detected, wherein the video to be detected comprises: a plurality of first images to be detected including a first target object;

and the processing module is used for detecting the plurality of first images to be detected by using a target key point detection network to obtain a target object key point in each first image to be detected, wherein the target key point detection network is obtained by using the training method of the target object key detection model in any one of the second aspects.

In a fifth aspect, the present disclosure provides a computer device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the first aspect or the steps of the method of any one of the second aspect when executing the computer program.

In a sixth aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of any one of the first aspect or the second aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

obtaining a video sample, inputting a plurality of first to-be-detected image samples contained in the video sample into a candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample, then, inputting the first candidate key points corresponding to the plurality of first to-be-detected image samples and each first to-be-detected image sample obtained in the previous step into the self-coding network again to obtain a plurality of target generation image samples input by the self-coding network, updating parameters of the candidate key point detection network based on the stability results of the plurality of target generation image samples until the stability results meet a first preset condition, the candidate keypoint detection network is determined to be the target keypoint detection network, that is, the candidate keypoint detection network is trained based on the stability result of the image samples generated by the multiple targets as a convergence condition, so that the stability of the target keypoint detection network is improved. In addition, in the process, the first target objects in the plurality of first to-be-detected image samples in the first to-be-detected image samples are not labeled with the key point information, so that the model training cost is reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of an embodiment of a method for training a detection model of a target object key point according to the present disclosure;

fig. 2 is a schematic structural diagram of a candidate keypoint detection network provided by the present disclosure;

fig. 3 is a schematic structural diagram of a target self-coding network provided by the present disclosure;

fig. 4 is a schematic structural diagram of a target self-coding network provided by the present disclosure;

FIG. 5 is a schematic flowchart of another embodiment of a method for training a detection model of a key point of a target object according to the present disclosure;

FIG. 6a is a schematic diagram of an architecture for target self-coding network training according to the present disclosure;

fig. 6b is a schematic diagram of a modality-consistent determination network according to the present disclosure;

FIG. 7 is a schematic flowchart illustrating an embodiment of a method for training a detection model of a key point of a target object according to the present disclosure;

FIG. 8 is a schematic flowchart of an embodiment of a method for training a detection model of a key point of a target object according to the present disclosure;

FIG. 9 is a schematic flow chart diagram illustrating a method for detecting key points of a target object according to the present disclosure;

FIG. 10 is a schematic structural diagram of an embodiment of a training apparatus for a key point detection model of a target object according to the present disclosure;

fig. 11 is a schematic structural diagram of a device for detecting key points of a target object according to the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

The target object in the present disclosure may be a specific portion in the image to be detected, for example, if the image to be detected includes a person, the target object may be a human face, an arm, a leg, or other portions, and the target object may also be other objects, for example, a license plate of a vehicle. The target object key point detection may be applied not only to image processing, such as beauty, but also to target tracking, etc., to which the present disclosure is not limited.

The detection of the key points is used as a basis for supporting applications such as image processing and target tracking, and the stability of the detected key points is very important. In the prior art, when the key point detection is performed on a target object in a video, the problem of key point jitter exists.

The method comprises the steps of obtaining a key point corresponding to an image sample to be detected through a candidate key point detection network, inputting the image sample to be detected and the key point into a target self-coding network, obtaining a target generated image sample by means of the image generation capacity of the target self-coding network, judging the stability of the key point based on the target generated image sample, taking a stability result as a convergence condition of the candidate key point detection network, adjusting parameters of the candidate key point detection network based on the stability result until the stability meets a certain condition, determining the convergence of the candidate key point detection network, and obtaining the target key point detection network. The stability is used as a convergence condition of model training, so that the stability of the key points output by the target key point detection network is improved.

The technical solutions of the present disclosure are described in several specific embodiments, and the same or similar concepts may be referred to one another, and are not described in detail in each place.

Fig. 1 is a schematic flowchart of an embodiment of a method for training a detection model of a target object key point, as shown in fig. 1, the method of the embodiment is as follows:

s101: a video sample is obtained.

Wherein, the video sample includes: each first image sample to be detected comprises a first target object, wherein the key point information of the first target object is not labeled. In general, the key point information can be displayed in the image sample to be detected in a manual labeling manner, that is, after the original image is taken, the key point information needs to be manually labeled to form the image sample to be detected; however, in the technical solution of the present disclosure, the first to-be-detected image sample in the video samples used in the method does not need to be labeled with the key point information, and therefore, a large number of video samples may be obtained in more ways to facilitate training of the network model, for example, the video samples may be obtained in the following ways:

the method can acquire a large number of video samples from authorized video platforms based on a crawler technology, or acquire the video samples from places with large people flow, such as shopping malls and roads, under the authorized condition, and extract the image samples of each frame in the video samples.

S103: and inputting a plurality of first to-be-detected image samples into a candidate key point detection network to obtain first candidate key points corresponding to each first to-be-detected image sample.

Wherein, the network model for detecting the key point in the present disclosure includes: the candidate key point detection network and the target key point detection network have the same network structure, and the difference is that the candidate key point detection network refers to a key point detection network in the training process, and the target key point detection network refers to a key point detection network after convergence.

The structure diagram of the candidate keypoint detection network is shown in fig. 2, and the input of the candidate keypoint detection network is a first image sample to be detected, and the output of the candidate keypoint detection network is a first candidate keypoint.

And inputting a plurality of first to-be-detected image samples included in the video sample into a candidate key point detection network, performing key point detection processing on the candidate key point detection network, and outputting a first candidate key point corresponding to each first to-be-detected image sample.

After the first candidate keypoints corresponding to each first to-be-detected image sample are obtained, S105 is performed.

S105: and inputting the plurality of first image samples to be detected and the first candidate key points corresponding to each first image sample to be detected into a target self-encoding network to obtain a plurality of target generation image samples.

The structure diagram of the target self-encoding network is shown in fig. 3, and the first candidate keypoints corresponding to the first to-be-detected image sample and the first to-be-detected image sample are input into the target self-encoding network and output as the target generation image sample.

The target self-coding network extracts image features from the first image sample to be detected, acquires key point features from first candidate key points corresponding to the first image sample to be detected, and reconstructs an image based on the image features and the key point features to obtain a target image sample.

S107: and updating the parameters of the candidate key point detection network according to the stability results of the plurality of target generated image samples, and returning to execute S103.

The candidate key point detection network is a key point detection network in a training process, and if the stability of an output result of the candidate key point detection network cannot meet requirements, parameters of the candidate key point detection network need to be continuously adjusted through the training process, so that the stability of key points output by the candidate key point detection network is better and better.

The method comprises the steps of taking stability results of a plurality of target generation image samples as convergence conditions, and adjusting parameters of a candidate key point detection network based on the stability results.

S109: and determining the candidate key point detection network as a target key point detection network until the stability result meets a first preset condition.

After multiple times of training, determining that the candidate key point network is converged until the stability result meets a first preset condition, and determining that the converged candidate key point detection network is the target key point detection network.

In this embodiment, a video sample is obtained, a plurality of first to-be-detected image samples included in the video sample are input into a candidate keypoint detection network, so as to obtain first candidate keypoints corresponding to each first to-be-detected image sample, then, the plurality of first to-be-detected image samples and the first candidate keypoints corresponding to each first to-be-detected image sample obtained in the previous step are input into a self-coding network again, so as to obtain a plurality of target generated image samples input from the self-coding network, parameters of the candidate keypoint detection network are updated based on stability results of the plurality of target generated image samples until the stability results satisfy a first preset condition, so as to determine that the candidate keypoint detection network is a target keypoint detection network, that is, the candidate keypoint detection network is trained based on the stability results of the plurality of target generated image samples as convergence conditions, therefore, the stability of the target key point detection network is improved. In addition, in the process, the first target objects in the plurality of first to-be-detected image samples in the first to-be-detected image samples are not labeled with the key point information, so that the model training cost is reduced.

One implementation of the structure of the target self-coding network in the above embodiment is shown in fig. 4, and the target self-coding network 400 includes: a first encoder 41, a second encoder 42, and a decoder 43; in the above embodiment, with reference to fig. 4, further, a possible implementation manner of S105 is as follows:

and performing first coding processing on a first candidate key point corresponding to the first to-be-detected image sample by using a first coder to obtain a key point feature corresponding to the first candidate key point. And carrying out second coding processing on the first image sample to be detected by using a second coder to obtain the image characteristics corresponding to the first image sample to be detected. After the key point features and the image features are obtained, the key point features and the image features are connected and input into a decoder. And obtaining a target generation image sample corresponding to the first image sample to be detected by using the decoder according to the key point characteristics and the image characteristics.

In the embodiment, two encoders are used for respectively obtaining key point characteristics and image characteristics; and reconstructing the target generation image sample through a decoder based on the key point characteristics and the image characteristics.

Fig. 5 is a schematic flowchart of another embodiment of a method for training a detection model of a key point of a target object according to the present disclosure, fig. 6a is a schematic diagram of an architecture of a target self-encoded network training according to the present disclosure, fig. 5 is a schematic diagram of a training process of a target self-encoded network based on the foregoing embodiment, and further, a target self-encoded network can be obtained by training in the manner shown in fig. 5, which, with reference to fig. 5 and 6a, includes:

s501: and inputting a plurality of second image samples to be detected containing second target objects into a first initial key point detection network to obtain first initial key points corresponding to the second image samples to be detected respectively.

And the second to-be-detected image sample comprises a to-be-detected image sample of which the second target object is not marked with key point information.

The first initial key point detection network is the initial state of the candidate key point detection network.

S502: and inputting a plurality of second image samples to be detected and the first initial key points corresponding to each second image sample to be detected into a candidate self-coding network to obtain a candidate generated image.

Wherein, the input of the candidate self-coding network comprises: and outputting the second image sample to be detected and the first initial key point corresponding to the second image sample to be detected as a candidate generated image sample.

And the candidate self-coding network extracts image characteristics from the second image sample to be detected, acquires key point characteristics from second candidate key points corresponding to the second image sample to be detected, and reconstructs an image based on the image characteristics and the key point characteristics to obtain a candidate generated image.

S503: and inputting the candidate generated image into a second initial key point detection network to obtain a second initial key point corresponding to the candidate generated image.

Wherein the first initial keypoint detection network is the same as the second initial keypoint detection network.

And inputting the candidate generated image into a second initial key point detection network which is the same as the first initial key point detection network to obtain a second initial key point corresponding to the candidate generated image.

S504: and judging the mode consistency according to the first initial key point and the second initial key point to obtain a judgment result.

Optionally, the mode consistency determination may be performed through the mode consistency determination network based on the first initial key point and the second initial key point, so as to obtain a determination result.

Fig. 6b is a schematic diagram of an architecture of a modality coincidence determination network, where fig. 6b is a schematic diagram of an architecture of a modality coincidence determination network provided by the present disclosure, and the schematic diagram includes: the thermodynamic diagram processing module, the convolutional neural network and the discriminator, in combination with fig. 6b, implement the mode consistent discrimination as follows:

and inputting the first initial key point into a thermodynamic diagram processing module, and generating a first thermodynamic diagram according to the first initial key point by the thermodynamic diagram processing module. The thermodynamic diagram is a first thermodynamic diagram, wherein the coordinates of the key points are mapped onto a picture, and the picture with a black background and a key point position is obtained according to a first initial key point.

Similarly, a second initial key point is input into the thermodynamic diagram processing module, and the thermodynamic diagram processing module generates a second thermodynamic diagram according to the second initial key point. Specifically, a second thermodynamic diagram is generated according to the coordinates of the second initial keypoints.

Inputting a first thermodynamic diagram and a second thermodynamic diagram into a convolutional neural network, acquiring the key point features corresponding to the first thermodynamic diagram, and acquiring the key point features corresponding to the second thermodynamic diagram.

Optionally, the key point feature corresponding to the first thermodynamic diagram and the key point feature corresponding to the second thermodynamic diagram are input into the discriminator, and according to the mode consistency loss = | | | discriminator (the key point feature corresponding to the first thermodynamic diagram) -discriminator (the key point feature corresponding to the second thermodynamic diagram) | | | |₂And obtaining a judgment result. Wherein, the mode coincidence loss is the judgment result.

S505: and updating the parameters of the candidate self-coding network according to the judgment result, and returning to execute the step S501.

S506: and determining the candidate self-coding network as a target self-coding network until the judgment result meets a second preset condition.

The second preset condition may be that the determination result is smaller than a first preset threshold, where the first preset threshold may be 0.012, 0.01 or other values close to 0.

In this embodiment, a first initial key point corresponding to a second image sample to be detected is obtained first. Inputting the first initial key point and the second image sample to be detected into a candidate self-coding network to obtain a candidate generated image, wherein the second image sample to be detected is a real image, the candidate generated image can be regarded as a false image obtained based on the information of the real image, inputting the candidate generated image into the same key point detection model to obtain a second initial key point, if the false image is closer to the real image, which indicates that the accuracy of the image generated by the self-coding network is higher, the disclosure judges the closer the false image is to the real image by comparing the modal consistency of the first initial key point and the second initial key point, to adjust the parameters of the candidate self-coding network until the mode consistency judging result meets the second preset condition, and the candidate self-coding network is considered to be converged to obtain the target self-coding network, so that the accuracy of generating the image by the target self-coding network is improved. In addition, in the process of training to obtain the target self-coding network, the image sample of the target object which is not marked with the key point information is adopted, so that the cost of model training is reduced.

Fig. 7 is a schematic flowchart of an embodiment of a method for training a detection model of a key point of a target object according to the present disclosure, and optionally, the candidate self-encoding network in the embodiment shown in fig. 5 may also be obtained by training using a plurality of third to-be-detected image samples including a third target object, where the third target object in the plurality of third to-be-detected image samples is labeled with key point information. Namely, the candidate self-coding network is obtained by training the image sample to be detected marked with the key point information, and then the candidate self-coding network is continuously trained through the image sample to be detected without the key point, so that the convergence efficiency of the candidate self-coding network is improved. One possible implementation of the candidate self-coding network obtained by training is shown in fig. 7.

S701: and carrying out first coding processing on the target key points marked on the third image sample to be detected by using a first coder to obtain the characteristics of the target key points.

S702: and carrying out second coding processing on the third image sample to be detected by using a second coder to obtain the target image characteristics corresponding to the third image sample to be detected.

S703: and obtaining a target generation image according to the target key point characteristics and the target image characteristics by using the decoder.

S704: and training an initial self-coding network based on the third image sample to be detected and the target generation image until the initial self-coding network is converged to obtain the candidate self-coding network.

Specifically, based on the difference between the third detected image sample and the target generated image sample, the parameters of the initial self-coding network are adjusted until the initial self-coding network converges, and a candidate self-coding network is obtained.

Optionally, a value of the second loss function is obtained according to reconstruction loss = MSE (target generated image — third image sample to be detected), where reconstruction loss is a value of the first loss function, and MSE (target generated image — third image sample to be detected) is an average of a sum of squares of differences of each corresponding pixel between the target generated image and the third image sample to be detected.

Optionally, the value of the second loss function is obtained according to decoupling loss = -KL (feature of a key point | | image feature), where decoupling loss is the value of the second loss function, KL (feature of a key point | | | image feature) is a difference between the distribution of the features of the key point and the distribution of the image features, and a smaller value of the second loss function indicates that the distribution of the features of the key point is more similar to the distribution of the image features.

The convergence condition may be that the value of the first loss function is smaller than a second preset threshold and the value of the second loss function is smaller than a third preset threshold, the value of the second preset threshold may be 0.005, 0.01, 0.02 or other values close to 0, and the value of the third preset threshold may be 0.005, 0.01, 0.02 or other values close to 0, which is not limited by the disclosure.

Optionally, the first loss function may also be an average of the sum of absolute values of differences of each corresponding pixel between the target generation image and the third to-be-detected image sample, and the selection of the first loss function may be determined according to an actual situation, which is not limited by the present disclosure.

Optionally, in each of the above embodiments, the candidate keypoint detection network is obtained by training a plurality of fourth image samples to be detected including a fourth target object, where the fourth target object in the plurality of fourth image samples to be detected is labeled with keypoint information. Namely, the candidate key point detection network is obtained by training the to-be-detected image sample labeled with the key point information, and then the candidate key point detection network is continuously trained through the to-be-detected image sample not labeled with the key point, so that the convergence efficiency of the candidate key point detection network is improved.

Fig. 8 is a schematic flowchart of an embodiment of a method for training a detection model of a target object keypoint, provided by the present disclosure, where fig. 8 is based on the foregoing embodiments, and further includes, before S107:

s1061: generating image samples based on the plurality of targets, obtaining a generated video sequence.

S1062: and acquiring a stability result by using a target video stability judging network according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples.

One possible implementation is as follows:

performing three-dimensional convolution on a plurality of target generation images included in the generated video sequence by utilizing a space-time convolution network to extract a first space-time characteristic; and carrying out two-dimensional convolution operation on the first time-space characteristic to extract a first deepened space characteristic.

Performing three-dimensional convolution on a plurality of first to-be-detected image samples included in the video samples by utilizing a space-time convolution network to extract second space-time characteristics; and performing two-dimensional convolution operation on the second space-time characteristic to extract a second deepened space characteristic.

Wherein, it is stabilized_lossThe better the stability result is for the obtained stability result, which indicates that the more stable the key point detected by the key point detection network is, the better jitter is.

Fig. 9 is a schematic flow chart of another method for detecting key points of a target object according to the present disclosure, where the target key point detection network used in fig. 9 can be obtained by the above embodiments of the method for training key points of a target object, as shown in fig. 9, the method of this embodiment is as follows:

s901: and acquiring the video to be detected.

The video to be detected comprises: a plurality of first images to be detected including a first target object;

s902: and detecting the plurality of first images to be detected by using a target key point detection network to obtain the key points of the target object in each first image to be detected.

The target key point detection network is obtained by the training method of the target object key detection model in each embodiment.

In this embodiment, the target key point in each first image to be detected is obtained by detecting the plurality of first images to be detected by using the target key point detection network, and the stability results of the plurality of target generated image samples are used as convergence conditions to train the candidate key point detection network, so that the stability of the target key point detection network is improved.

Fig. 10 is a schematic structural diagram of an embodiment of a training apparatus for a target object keypoint detection model provided by the present disclosure, and as shown in fig. 10, the apparatus of the present embodiment includes an obtaining module 1001 and a processing module 1002, wherein,

a processing module 1002, configured to input a plurality of first to-be-detected image samples into a candidate keypoint detection network, so as to obtain first candidate keypoints corresponding to each first to-be-detected image sample;

the processing module 1002 is further configured to input the plurality of first to-be-detected image samples and the first candidate keypoints corresponding to each of the first to-be-detected image samples into a target self-encoding network, so as to obtain a plurality of target-generated image samples;

the processing module 1002 is further configured to update parameters of the candidate keypoint detection network according to the stability results of the image samples generated by the multiple targets, and return to execute inputting the multiple first to-be-detected image samples into the candidate keypoint detection network to obtain first candidate keypoints corresponding to each first to-be-detected image sample; and determining the candidate key point detection network as a target key point detection network until the stability result meets a first preset condition.

the processing module 1002 is specifically configured to, for each first to-be-detected image sample, perform first encoding processing on a first candidate keypoint corresponding to the first to-be-detected image sample by using a first encoder to obtain a keypoint feature corresponding to the first candidate keypoint; performing second encoding processing on the first image sample to be detected by using a second encoder to obtain image characteristics corresponding to the first image sample to be detected; and obtaining a target generation image sample corresponding to the first image sample to be detected by using the decoder according to the key point characteristics and the image characteristics.

Optionally, the processing module 1002 is specifically configured to train to obtain the target self-encoding network in the following manner:

Optionally, the processing module 1002 is specifically configured to generate a first thermodynamic diagram according to the first initial key point; generating a second thermodynamic diagram according to the second initial key point; acquiring key point features corresponding to the first thermodynamic diagram, and acquiring key point features corresponding to the second thermodynamic diagram; and obtaining the judgment result according to the key point characteristics corresponding to the first thermodynamic diagram and the key point characteristics corresponding to the second thermodynamic diagram.

Optionally, the processing module 1002 is specifically configured to obtain a candidate self-coding network by:

Optionally, the processing module 1002 is specifically configured to perform first encoding processing on a target key point labeled by the third image sample to be detected by using a first encoder, so as to obtain a target key point feature; performing second coding processing on the third image sample to be detected by using a second coder to obtain a target image characteristic corresponding to the third image sample to be detected; obtaining a target generation image according to the target key point characteristics and the target image characteristics by using the decoder; and training an initial self-coding network based on the third image sample to be detected and the target generation image until the initial self-coding network is converged to obtain the candidate self-coding network.

The processing module 1002 is further configured to generate an image sample based on a plurality of targets, obtaining a generated video sequence; and acquiring a stability result by using a target video stability judging network according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples.

The processing module 1002 is further configured to perform three-dimensional convolution on a plurality of target generation images included in the generated video sequence by using a spatio-temporal convolution network to extract a first spatio-temporal feature; performing two-dimensional convolution operation on the first time-space characteristic to extract a first deepened space characteristic; performing three-dimensional convolution on a plurality of first to-be-detected image samples included in the video samples by utilizing a space-time convolution network to extract second space-time characteristics; performing two-dimensional convolution operation on the second space-time characteristic to extract a second deepened space characteristic; and obtaining a stability result according to the first space-time characteristic, the first deepening space characteristic, the second space-time characteristic and the second deepening space characteristic.

Optionally, the target object is a human face.

Fig. 11 is a schematic structural diagram of an apparatus for detecting key points of a target object according to the present disclosure, and the apparatus of this embodiment includes an obtaining module 1101 and a processing module 1102, wherein,

the obtaining module 1101 is configured to obtain a video to be detected, where the video to be detected includes: a plurality of first images to be detected including a first target object;

the processing module 1102 is configured to detect the multiple first images to be detected by using a target keypoint detection network to obtain a target object keypoint in each first image to be detected, where the target keypoint detection network is obtained by using the training method of the target object key detection model described in the foregoing embodiments.

The disclosed embodiment provides a computer device, including: the memory, the processor, and the computer program stored in the memory and capable of running on the processor, where the processor executes the computer program to implement the technical solution of any one of the method embodiments shown in fig. 1 to 9, and the implementation principle and the technical effect are similar, and are not described herein again.

The present disclosure also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the solution of the method embodiment shown in any one of fig. 1 to 9.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for training a detection model of key points of a target object is characterized by comprising the following steps:

obtaining a stability result by using a target video stability judging network according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples;

2. The method of claim 1, wherein the target self-encoded network comprises: a first encoder, a second encoder and a decoder;

3. The method of claim 2, wherein the target self-coding network is obtained by training:

4. The method according to claim 3, wherein said performing modal coincidence decision according to the first initial keypoint and the second initial keypoint to obtain a decision result comprises:

5. The method according to claim 3 or 4, wherein the candidate self-coding networks are obtained by:

6. The method of claim 5, wherein the training of the candidate self-coding network by using a plurality of third image samples to be detected including a third target object comprises:

7. The method according to any one of claims 1 to 4, wherein the candidate keypoint detection network is obtained by training a plurality of fourth image samples to be detected containing a fourth target object, wherein the fourth target object in the plurality of fourth image samples to be detected is labeled with keypoint information.

8. The method according to claim 1, wherein the obtaining the stability result according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples by using the target video stability determination network comprises:

9. The method of any one of claims 1-4, wherein the target object is a human face.

10. A method for detecting key points of a target object is characterized by comprising the following steps:

detecting a plurality of first images to be detected by using a target key point detection network to obtain a target object key point in each first image to be detected, wherein the target key point detection network is obtained by the training method of the target object key detection model according to any one of claims 1 to 9.

11. A training device for a key point detection model of a target object is characterized by comprising:

the processing module is further used for generating image samples based on a plurality of targets to obtain a generated video sequence; obtaining a stability result by using a target video stability judging network according to the generated video sequence and the video sequences corresponding to the plurality of first to-be-detected image samples included in the video samples;

12. An apparatus for detecting key points of a target object, comprising:

a processing module, configured to detect, by using a target keypoint detection network, a plurality of the first images to be detected to obtain a target object keypoint in each first image to be detected, where the target keypoint detection network is obtained by the training method of the target object keypoint detection model according to any one of claims 1 to 9.

13. A computer device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of claims 1 to 9 or implementing the steps of the method of claim 10 when executing the computer program.

14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9 or carries out the steps of the method of claim 10.