CN114299225B

CN114299225B - Action image generation method, model building method, device and storage medium

Info

Publication number: CN114299225B
Application number: CN202111524469.1A
Authority: CN
Inventors: 吴小燕; 何山; 殷兵; 胡金水; 潘清华
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2025-04-04
Anticipated expiration: 2041-12-14
Also published as: CN114299225A

Abstract

The present application provides an action image generation method, a model construction method, a computer device and a storage medium, wherein the model construction method comprises: obtaining a first image set and a second image set of a target person, wherein the first image set and the second image set include sequence images of the target person's actions; performing three-dimensional reconstruction according to the sequence images of the first image set to obtain a plurality of first 3D models; obtaining a texture map of the target person, mapping the plurality of first 3D models according to the texture map to obtain a plurality of texture maps; projecting the plurality of first 3D models to obtain a plurality of 2D projection images; and constructing an action generation model according to the sequence images, 2D projection images and texture maps of the second image set. The action generation model can generate more realistic action images, that is, the action images include more details, thereby improving the user experience.

Description

Action image generation method, model construction method, device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a motion image generating method, a model building method, a computer device, and a storage medium.

Background

At present, a human motion generation scheme generally adopts a mode based on human key points or posture estimation to realize human motion generation, but the scheme is difficult to generate a motion image with high-definition details, such as less human face information. In addition, the morphology of different human bodies, such as different heights and fatness and thinness, is different, and the details cannot be represented based on the key points or the gesture estimation of the human bodies.

Disclosure of Invention

The application provides a method for constructing an action generation model, a method for generating an action image, equipment and a storage medium, which can generate the action image with more details.

In a first aspect, the present application provides a method for constructing an action generation model, where the method includes:

Acquiring a training set of target characters, wherein the training set comprises a first image set and a second image set, and the first image set and the second image set comprise sequence images of the actions of the target characters;

Performing three-dimensional reconstruction according to the sequence images of the first image set to obtain a plurality of first 3D models;

Obtaining a texture map of the target person, and mapping the plurality of first 3D models according to the texture map to obtain a plurality of texture maps;

Projecting the plurality of first 3D models to obtain a plurality of 2D projection images;

And constructing an action generating model according to the sequence image of the second image set, the 2D projection image and the texture map.

In a second aspect, the present application further provides a motion image generating method, where the method includes:

Acquiring a plurality of images of a user, and determining a texture map of the user according to the images;

performing three-dimensional reconstruction according to the plurality of images to obtain a 3D model of the user;

Mapping the 3D model according to the texture map to obtain a texture map;

and generating action images of the user according to the plurality of images and the texture map.

In a third aspect, the present application also provides a computer apparatus comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

The processor is configured to implement the steps of the method for constructing an action generating model according to any one of the embodiments of the present application, or implement the steps of the method for generating an action image according to any one of the embodiments of the present application, by running a program stored in the memory.

In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor causes the processor to implement the steps of the method for constructing an action image generation model according to any one of the embodiments of the present application, or implement the steps of the method for generating an action image according to any one of the embodiments of the present application.

According to the method, the device and the storage medium for constructing the action generation model, disclosed by the application, the more real action image can be generated by changing the input condition of the network model, namely, the generated action image comprises more details, such as more details including face information, human body morphology, clothes and the like, so that the experience of a user can be improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of steps of a method for constructing an action generation model provided by an embodiment of the present application;

FIG. 2 is a schematic diagram showing the effect of a texture map according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of an action generation model provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a first branch network model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a first branch network model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a second branch network model provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a backbone network model provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a corresponding build scenario of an action generation model provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of steps of a motion image generating method according to an embodiment of the present application;

fig. 10 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that, in order to clearly describe the technical solutions of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish identical items or similar items having substantially the same function and effect. For example, the first image set and the second image set are merely for distinguishing between different image sets, and are not limited in their order of precedence. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

In order to facilitate understanding of the embodiments of the present application, some words related to the embodiments of the present application are briefly described below.

1. The action generation refers to generating a series of actions of the person according to a static person image, such as giving a static person image, and automatically generating a motion video according to a series of skeletal joint motion sequences (also called action driving sources) by using a trained deep learning model, so that the person moves according to a given motion mode, such as dancing, sports and the like.

2. Texture mapping-in computer graphics, texture mapping is a technique that uses images, functions, or other data sources to alter the appearance of an object's surface. Texture (Texturing) is an efficient technique for "modeling" object surface properties, the pixels in image Texture are commonly referred to as texels (Texels), and by applying projection equations (Projector Function) to points in space, a set of so-called parameter space values (PARAMETER SPACEVALUES) are obtained, which are values for the Texture, and one or more Mapping functions (Corresponder Function) are used to transform the parameter space values (parameter-space values) into Texture space, which is referred to as Mapping, i.e., texture Mapping. And precisely corresponding each point on the 2D projection image to the surface of the 3D model object, and performing image smoothing interpolation processing at the gap position between the points, namely UV mapping.

3. Warp transformation, namely affine transformation of an image, is to infer the image information at the next moment by optical flow and the current image information.

4. 3D reconstruction, also called three-dimensional reconstruction, specifically refers to the establishment of a mathematical model suitable for computer representation and processing of a three-dimensional object so as to process, operate and analyze its properties in a computer environment, and is also a key technology for establishing virtual reality expressing an objective world in a computer. Common 3D reconstruction algorithms include HMR, simplify-x, total Capture. The HMR is an end-to-end Human Mesh Recovery framework, A3D grid of the SMPL is rebuilt from an RGB bitmap containing a human body and is attempted to be projected back to the picture, the Simplify-X algorithm is also called SMPL-X model (SMPL expression) which is based on the SMPL, a SMPLify method is followed, openPose is used for estimating two-dimensional image features from bottom to top, openPose detects joints of the body, hands, feet and face features, gestures, expressions and the like can be better captured by motion of the human body, and Total Capture, particularly 'A3D Deformation Model for Tracking Faces,Hands,and Bodies', can be used for estimating the gestures of the face, the body and the hands simultaneously and is fused into a three-dimensional model.

5. The barycentric coordinates are also called barycentric coordinates, are computer graphics Barycentric Coordinates and are 3D rendering concepts, a 3D model of a human body can be divided into a plurality of triangular patches, each vertex of each triangular patch has a one-to-one correspondence, when the action is performed, the rendering mode is to find out from a 2D image of a reference image set, when the 3D action is performed, the correspondence of the vertices is needed, each point on the triangular patches has physical significance, the vertices are corresponding, and pixel alignment except the vertices is calculated by Barycentric Coordinates. In an embodiment of the application, the centroid coordinates are used primarily to calculate the flow mapping of pixels between images of two image sets.

6. GAN Network is called GENERATIVE ADVERSARIAL Network, and is called as a generated type countermeasure Network, and is a machine learning method. Among the GAN networks are two networks, a generation network and a discrimination network, which may also be referred to as a generation model (GENERATIVE MODEL) and a discrimination model (DISCRIMINATIVE MODEL), respectively. Taking the generation of pictures as an example, the final goal is to generate a cartoon head using a GAN network, and the generation model is a network for generating pictures, which receives a random noise z, then generates pictures through the noise, and the generated data is denoted as G (z). The discrimination model is a discrimination network for discriminating whether a picture is "real" (whether it is kneaded) or not, its input parameter is x, x represents a picture, output D (x) represents the probability that x is a real picture, if it is 1, it represents that it is a real picture, and output is 0, it represents that it is impossible to be a real picture. In the training process, the object of the network generation is to generate false pictures to cheat the discrimination network, and the object of the discrimination network is to distinguish whether a certain picture is generated or not, so that the game process is changed. The ability to generate and discriminate models simultaneously also increases gradually in the training process until the training process reaches equilibrium when the discriminating network is no longer able to discriminate between real and fake pictures.

To this end, embodiments of the present application provide a construction method of an action generation model, an action image generation method, a computer device, and a storage medium. When the motion generation model is used for a motion generation scheme of a user, motion images with more details can be generated, for example, the motion generation model comprises more details such as face information, human body morphology, clothes, hair and the like, and the experience of the user can be improved.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic diagram of a method for constructing an action generating model according to an embodiment of the present application, where the method may be applied to a computer device, and in particular, may be applied to a computer device constructed by a special model, for example, including a GPU.

As shown in fig. 1, the construction method of the action generation model includes steps S101 to S105.

S101, acquiring a first image set and a second image set of a target person, wherein the first image set and the second image set comprise sequence images of actions of the target person.

Since a plurality of images can form a sequence, a sequence can be formed according to shooting time, for example, a video frame of a video, or a sequence can be formed by shooting a regular action of a target person, each image in the sequence is called a sequence image, and the first image set and the second image set each comprise a plurality of sequence images. Of course, a video of a regular action of the target person can be shot, a first image set and a second image set are determined according to video frames in the video, and the first image set and the second image set are used as training sets for constructing an action generation model.

It should be noted that, in the embodiment of the present application, the first image set may also be referred to as a target image set, and the second image set may also be referred to as a reference image set.

In some embodiments, in order to quickly obtain a training set of a target person, so as to improve efficiency and accuracy of constructing an action generating model, an action video of the target person may be further obtained, and a video frame of the action video is divided to obtain a first image set and a second image set. The first image set and the second image set form a training set of the target person and are used for training an action generation model.

Specifically, the video of the regular action of the target person can be collected by using a camera or a terminal device as the action video of the target person, that is, the video frames in the action video need to include the regular action of the target person, the video frames in the action video are divided into a target image set (target) and a reference image set (source), each of which includes a plurality of video frames, and since the plurality of video frames are continuous on a time axis, the video frames can also be referred to as sequence images, that is, each of the target image set and the reference image set includes a plurality of sequence images, and the target image set and the reference image set are used as training sets.

In some embodiments, to ensure accuracy in model construction, it may also be defined that the action video also needs to include more than a preset number of video frames, for example, the action video needs to include at least 15000 frames of video frames. Of course, the duration of the video may be defined, for example, the duration of the video requiring the action video is longer than 10 minutes, and for using the duration of the video, the frame rate of the camera capturing the video needs to be defined to be greater than a certain threshold, for example, greater than 24 frames/second.

The video frames in the motion video may be divided according to the video sequence, or alternatively divided, for example, the 1 st to n th frames are divided into the target image set, the n+1 th to m th frames are divided into the reference image set, the m+1 th to i th frames are divided into the target image set, the i+1 th to j th frames are divided into the reference image set, and so on.

S102, performing three-dimensional reconstruction according to the sequence images of the first image set to obtain a plurality of first 3D models.

Specifically, the 3D reconstruction technique may be used to reconstruct three-dimensionally the sequence image of the first image set, so as to obtain a plurality of first 3D models, where the plurality of first 3D models are reconstructed according to the sequence image of the first image set, and thus may be referred to as a first 3D model corresponding to the first image set. The 3D reconstruction can be specifically performed by using a preset 3D reconstruction algorithm, wherein the preset 3D reconstruction algorithm comprises at least one of an HMR algorithm, a simple-x algorithm and a Total Capture algorithm.

In the embodiment of the application, the simplefy-x algorithm is preferably selected, and the simplefy-x algorithm can better capture the actions of the human body, such as gestures, expressions and the like.

In some embodiments, the three-dimensional reconstruction may be performed according to the sequence image of the second image set, and a plurality of first 3D models may also be obtained, where the plurality of first 3D models are obtained by reconstructing according to the sequence image of the second image set, and may be referred to as a first 3D model corresponding to the second image set.

The flow mapping relationship of pixels between the sequence images of the second image set and the sequence images of the first image set can thus also be calculated using centroid coordinates based on the first 3D model corresponding to the second image set and the first image set. Wherein the stream map is used to determine alignment features of the sequence images of the second image set and the sequence images of the first image set.

Based on the reconstructed first 3D model, a flow mapping relationship (APPEARANCE FLOW) of pixels between the sequence images of the set of target images and the set of reference images may be calculated using centroid coordinates (Barycentric Coordinates). Since the target image set and the reference image set share the same topology when used for 3D model reconstruction, such as Simpliy-x, a stream mapping relationship of pixels between the target image set and the reference image set can be established according to the reconstructed 3D model. Specifically, if a pixel point on a sequence image (video frame) of the target image set is visible on the sequence image (video frame) of the reference image set, a flow mapping relationship of pixels between the target image set and the reference image set can be calculated according to the centroid coordinates (Barycentric Coordinates).

Regarding the centroid coordinates in 3D rendering, a 3D human body may be divided into a plurality of triangular patches, each vertex of the plurality of triangular patches has a one-to-one correspondence, when motion driving is performed, there is rendering in addition to motion, and when there is a 3D motion, the rendering mode may be determined according to a video frame from a reference image set, so that the correspondence of vertices is required, each point on the triangular patches has a physical meaning, vertices are corresponding, and pixels other than the vertices are complemented, that is, the pixels are calculated by Barycentric Coordinates.

The flow mapping relationship of the pixels between the sequence images of the first image set and the second image set may be calculated by other manners besides calculating the centroid coordinates, which is not limited herein.

S103, obtaining texture maps of the target person, and mapping the plurality of first 3D models according to the texture maps to obtain a plurality of texture maps.

Specifically, the texture map of the target person may be obtained, for example, the texture map of the target person is stored in the server in advance for use, and the texture map of the target person is obtained when the model construction is required, or the texture map of the target person may be determined from a plurality of images of the target person. After the texture map of the target person is obtained, mapping can be performed on the plurality of first 3D models according to the texture map, so as to obtain a plurality of texture maps, wherein the texture maps are images obtained by mapping the first 3D models according to the texture map.

The method comprises the steps of obtaining a texture map of a target person, specifically obtaining a plurality of shooting images of the target person, wherein the plurality of shooting images are images obtained by shooting the target person from different shooting angles, reconstructing a second 3D model corresponding to the target person according to the plurality of shooting images based on a multi-view 3D reconstruction algorithm, and determining the texture map of the target person according to the corresponding relation between the second 3D model and the shooting images.

It should be noted that the 3D reconstruction algorithm may be used to obtain the texture map, but a general multi-view 3D reconstruction algorithm may be used to construct a 3D model for fast processing of data.

In some embodiments, to obtain a more comprehensive and accurate texture map, so that a trained action generation model may generate an image containing more detail. The target person may also be held in a preset posture, such as a-pose posture, while the target person is photographed from different photographing angles.

For example, a camera or a terminal device may be used to collect multiple images of multiple views of a target person, such as 100 images, and in particular, the target person may be allowed to maintain an a-pose pose, and multiple images of the target person in the a-pose pose may be taken from multiple angles.

After obtaining a plurality of images corresponding to multiple views of the target person, estimating pose and body type of each image by utilizing a multiple view method, establishing a 3D model, namely a second 3D model, specifically establishing 100 second 3D models according to 100 photographed images of multiple views, and complementing texture figures of the target person according to the established correspondence between the 100 second 3D models and the photographed images, thereby obtaining texture figures of the target person, wherein the exemplary texture figures are shown in fig. 2.

It should be noted that, due to the errors of the 3D model and the errors of the 3D points and the 2D projection points, the obtained texture map is rough and even inaccurate.

And S104, projecting the plurality of first 3D models to obtain a plurality of 2D projection images.

Specifically, the 3D points in each second 3D model are placed on a 2D plane, so as to obtain corresponding 2D projection images, and a plurality of 2D projection images can be obtained.

Since the sequence images in the first image set are continuous, the reconstructed plurality of first 3D models, texture maps, and the projected plurality of 2D projection images also have continuity.

S105, constructing an action generating model according to the sequence image of the second image set, the 2D projection image and the texture map.

Specifically, an action generating model to be constructed can be selected, the sequence image of the second image set, the 2D projection image and the texture map are input into the action generating model to be constructed, an output action image is obtained until the output action image meets the requirement, and parameters of the action generating model are saved, so that the constructed action generating model is obtained. The meeting of the requirements can be specifically determined by artificial judgment, and can also be determined by comparing the image with the sequence images in the second image set, for example, the phase difference between the sequence images in the second image set and the pixels of the output action images at the corresponding positions is within a certain range, so as to determine the meeting of the requirements.

Due to the model construction method, fine modeling is performed on the input conditions of model construction, namely, the sequence image of the second image set, the 2D projection image and the second 3D model are used as the input conditions, so that the constructed action generation model can generate action images containing more details.

It should be noted that, in the method for constructing an action generating model described in the above embodiment, a model is constructed for one target person as an object, and certainly, a model may be constructed for a plurality of target persons, where each target person corresponds to a training set and a texture image, and certainly, a model may also be constructed for the same target person by collecting the training set and the corresponding texture image multiple times.

The action generating model to be constructed may include a preset network, which may specifically be a neural network, for example, a GAN network, and the sequence image, the 2D projection image and the texture map of the second image set are input to the GAN network to perform model construction until the generating network and the discriminating network of the GAN network reach equilibrium, so as to obtain the action generating model.

In some embodiments, in order to further improve the detail problem of the generated action image, the embodiment of the application not only improves the construction condition of the model, but also makes fine modification on the action generation model to be constructed. Illustratively, as shown in FIG. 3, the pre-set network includes a backbone network model including a generative antagonism network, a first branch network model including Encoder-Decoder networks, and a second branch network model including a Encoder-Decoder network.

The first branch network model is used for fusing alignment features of the sequence images of the second image set and the sequence images of the first image set in the main network model, the second branch network model is used for fusing texture features corresponding to the texture map in the main network model, and the main network model is used for generating an action image according to the 2D projection image, the alignment features and the texture features. By complementing the texture features and the alignment features, finer motion images can be obtained, including facial expression, hair, clothing, and the like.

Correspondingly, during model construction, a 2D projection image can be input into a main network model, a sequence image of a second image set is input into a first branch network model, and a texture map is input into a second branch network model, wherein the first branch network model is used for fusing alignment features of the sequence image of the second image set and the sequence image of the first image set at a decoding end of the main network model, and the second branch network model is used for fusing texture features corresponding to the texture map at the decoding end of the main network model.

Specifically, the alignment features of the sequence images of the second image set and the sequence images of the first image set, and the texture features corresponding to the texture map are added to a decoding end (Decoder) of the backbone network model in Concat mode, so that detail generation can be further supplemented, any character action generation can be generalized, and the generalization capability of the action generation model is improved.

The first branch network model may also be referred to as a reference map Encoder-Decoder network model, the second branch network model may also be referred to as a texture map Encoder-Decoder network model, and the backbone network model may also be referred to as a sequence Encoder-Decoder network model, as shown in fig. 4, 6 and 7, respectively.

Exemplary, as shown in fig. 4, the reference map Encoder-Decoder network model, the input of the reference map Encoder-Decoder network model is a reference image and reconstructs the reference image, the reference image is a sequence image in the second image set, and by means of the reconstructed reference image, the features of the reference image can be better extracted and fused into the backbone network model.

In some embodiments, the sequence image of the second image set is input to the first branch network model, and specifically, an affine transformation map corresponding to the sequence image of the second image set can be determined according to the stream mapping relation, the sequence image of the second image set and the affine transformation map corresponding to the sequence image of the second image set are input to the first branch network model, the sequence image of the second image set is reconstructed, the characteristics of the coding end of the first branch network model are aligned by utilizing the stream mapping relation to obtain alignment characteristics, and the alignment characteristics are fused to the decoding end of the main network model.

As shown in fig. 5, the sequence image Source of the second image set is transformed by using the stream mapping relation APPEARANCE FLOW to obtain a corresponding affine transformation map Warp, both the sequence image Source and the affine transformation map Warp are input to the reference map Encoder-Decoder network model, the sequence image Source is reconstructed, the features of the encoding end Encoder of the first branch network model are aligned by using the stream mapping relation APPEARANCE FLOW to obtain alignment features, and the alignment features are fused to the decoding end of the backbone network model.

The alignment feature can be used to solve the problem of reconstruction difficulties for Target persons wearing loose clothing, shawl hair and hair confusion, since the sequence Encoder-Decoder network model further uses the features of the texture map and the reference picture, i.e. simultaneously uses the texture map to reconstruct the real motion image and add its features to the sequence features, while the reference picture provides all details of the affine transformation map Warp, the features corresponding to the sequence image Source are added according to APPEARANCE FLOW, for example, the sequence image Source such as where the eye position is in the sequence image Target in the first image set, the features are directly added to the corresponding positions. Thereby ensuring the high definition and continuity of the generated action image.

Illustratively, as shown in FIG. 6, the texture map Encoder-Decoder network model, whose input is a texture map, whose output corresponding image is a texture image, denoted as P _texture. Since the texture map has better textures for the hands of the face and does not have corresponding control points for clothes, hair and the like, but since the input of the texture map is a strong condition, the network generated image is clear in face and gesture, but the training network is difficult to acquire other reconstruction details of useful hairs, clothes and the like from the input condition, and therefore the characteristics corresponding to the useful hairs and the clothes need to be acquired from the first branch network model for reconstruction.

Because the texture map is determined only according to 100 images with different visual angles, the determined texture map has the characteristics of errors, blurring and the like, namely the texture map is rough and even inaccurate, and the texture map is slightly updated according to the reconstructed image, so that the texture map is clearer. The method comprises the steps of updating the texture map by utilizing the reverse transfer gradient of the second branch network model to the texture map, and mapping the plurality of first 3D models according to the updated texture map to obtain a plurality of texture maps. By updating the texture map, the texture map can be clearer, so that the details of the generated action image are clearer.

Illustratively, as shown in FIG. 7, the sequence Encoder-Decoder network model, which is input as a 2D projection image, fuses in the decoding end of the network the features of the sequence image of the second image set and the texture features of the texture map, so that it is output as an action image containing more detail.

In some embodiments, it is desirable to ensure that not only does the generated action image include more detail, but also that the generated action image is continuous and stable, thereby preventing jitter. The method comprises the steps of inputting a 2D projection image into a backbone network model, specifically obtaining a front frame 2D projection image of a current 2D projection image in the image sequence, wherein the current 2D projection image is a 2D projection image which needs to be input into the backbone network model at the current moment, the front frame 2D projection image is a 2D projection image which is input into the backbone network model, the front frame 2D projection image is continuous with the current 2D projection image, obtaining a rendering image corresponding to the front frame 2D projection image, wherein the rendering image is an output image corresponding to the front frame 2D projection image and is input into the backbone network model, and inputting the current 2D projection image, the front frame 2D projection image and the rendering image into the backbone network model.

Specifically, in inputting 2D projection images into a sequence Encoder-Decoder network model, wherein a plurality of 2D projection images form an image sequence, three frames can be input each time in consideration of the continuity between frames of the image sequence, assuming that the current time is t, 2D projection images (t-2, t-1, t) and rendered images (t-2, t-1) are input, the continuity of generated action images and the current 2D projection images is taken into consideration, and when the image input at the moment t has deviation, fine adjustment can be performed according to the images at the moments t-2 and t-1. In addition, to continue the continuity of the image, the setup sequence Encoder-Decoder network model has two outputs, one of which is a rendered image and the other of which is a weighted image, i.e., mask image in fig. 7. In the application, the optical flow between the t-1 and the t-1 moment images is regressed by utilizing the t-2 and the t-1 moment, and then some details of the t moment images can be obtained from the t-1 according to the optical flow and used as the final supplement of the final generated rendering image, and the generated rendering image is marked as P _series. The rendered image may be used as a final motion image.

The 2D projection images (t-2, t-1, t) are respectively a 2D projection image at the time t-2, a 2D projection image at the time t-1 and a 2D projection image at the time t, wherein the 2D projection image at the time t is the current 2D projection image, and the rendering images (t-2, t-1) are respectively a rendering image at the time t-2 and a rendering image at the time t-1. The rendering image at the time t-2 and the rendering image at the time t-1 are respectively a 2D projection image at the time t-2 and a 2D projection image at the time t-1, and are input into an output image corresponding to the backbone network model. the 2D projection image at the time t-2 and the 2D projection image at the time t-1 are the 2D projection images of the previous frame.

As shown in fig. 8, fig. 8 shows a structure of an action generation model according to an embodiment of the present application, in fig. 8, a main network model outputs a rendering image and a weight image, a second branch network model outputs a texture image, and an action image defining a preset network output is expressed as:

P_final＝P_series*M+P_texture*(1-M)

Where P _final denotes an action image, P _series denotes a rendering image, M denotes a weight image, and P _texture denotes a texture image. And a motion image with clearer details can be obtained by using the weight and the texture image.

According to the model construction method provided by the embodiments, through changing the input conditions of the network model and modifying the network, the final action generation model can generate a more real action image, namely the generated action image comprises more details such as more details including face information, human body morphology, clothes and the like, and the experience of a user can be further improved.

Referring to fig. 9, fig. 9 is a schematic flowchart illustrating steps of a motion image generating method according to an embodiment of the present application, where the motion image generating method may apply a computer device, and the computer device may be a server or a terminal device, where the computer device may store the motion generating model trained in the foregoing embodiment.

As shown in fig. 9, the motion image generation method includes steps S201 to S204.

S201, acquiring a plurality of images of a user, and determining a texture map of the user according to the images;

S202, performing three-dimensional reconstruction according to the images to obtain a 3D model of the user;

S203, mapping the 3D model according to the texture map to obtain a texture map;

s204, generating an action image of the user according to the image and the texture map.

The method comprises the steps of taking a plurality of images of a user continuously or taking a video, and extracting the plurality of images of the user from a video frame of the video. Preferably, the captured image or video may be captured from a plurality of different angles so that a more complete texture map may be obtained.

The texture map of the user is determined according to the plurality of images, the 3D model can be established first provided by the above embodiment, and the texture map of the user is determined by using the correspondence between the 3D model and the image used for establishing the 3D model.

Specifically, the motion image of the user is generated according to the image and the texture map, and specifically, the motion image of the user can be generated according to the image and the texture map by using a neural network model. For example, according to the image and the texture map, a motion image of the user can be generated by utilizing a human body key point or gesture estimation mode, or other motions can be migrated to a target 3D model by utilizing a 3D skeleton method expression and motion analysis based on a pre-constructed target 3D model.

In some embodiments, after obtaining multiple images and corresponding texture maps of the user, the images and texture maps of the user may be input to a pre-trained motion generation model, and then motion images of the user may be output. The pre-trained action generation model can be a model constructed by the model construction method.

In some embodiments, the motion driving source may be further acquired, and after obtaining the plurality of images, the motion driving source, and the corresponding texture map of the user, the images, the motion driving source, and the texture map of the user may be input to a pre-trained motion generation model, and the motion image of the user may be output.

The motion driving source is used for driving the pre-trained motion generation model to generate motion images, and in the embodiment of the application, the motion driving source specifically can comprise a series of 3D models, wherein the series of 3D models are 3D models representing different motions, and the motion driving source is used for driving the motion generation model to generate motion images with similar motions to the 3D models. The motion generation model may be attached with a motion driving source, or may acquire a motion driving source separately. The action driving source may include a 3D model of the user, or may be a 3D model of another user.

The motion image generating method is applied to different objects and is divided into different scenes, specifically an application scene 1 and an application scene 2.

The application scene 1 is that the terminal equipment stores the trained action generating model in advance, the terminal equipment executes the action generating method, and the image of the user and the corresponding texture map are input into the action generating model so that the action generating model outputs the corresponding action image.

The application scene 2 is that the terminal equipment interacts with the server, the server stores the trained action generating model in advance, the terminal equipment executes the action generating method and sends the image of the user and the corresponding texture map to the server, so that the server inputs the image of the user and the corresponding texture map into the action generating model, the action generating model outputs the corresponding action image, and the server sends the output action image to the terminal equipment.

It should be noted that, the method for generating an action image according to the embodiments of the present application has similar technical effects as the method for constructing an action generation model provided in each of the embodiments described above, and therefore will not be described herein.

Referring to fig. 10, fig. 10 is a schematic block diagram of a computer device according to an embodiment of the present application. As shown in FIG. 10, the computer device 300 includes one or more processors 301 and memory 302, the processors 301 and memory 302 being connected by a bus 33, such as an I2C (Inter-INTEGRATED CIRCUIT) bus.

Wherein the one or more processors 301 work individually or jointly for performing the steps of the method of constructing an action generating model or the method of generating an action image provided by the above embodiments.

Specifically, the Processor 301 may be a Micro-controller Unit (MCU), a central processing Unit (Central Processing Unit, CPU), a digital signal Processor (DIGITAL SIGNAL Processor, DSP), or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.

The processor 301 is configured to execute a computer program stored in the memory 302, and implement the steps of the method for constructing an action generation model or the method for generating an action image provided in the above embodiment when the computer program is executed.

The processor 301 is for example configured to run a computer program stored in the memory 302 and to implement the following steps when said computer program is executed:

The method comprises the steps of obtaining a first image set and a second image set of a target person, wherein the first image set and the second image set comprise sequence images related to actions of the target person, carrying out three-dimensional reconstruction according to the sequence images of the first image set to obtain a plurality of first 3D models, obtaining texture maps of the target person, carrying out mapping on the plurality of first 3D models according to the texture maps to obtain a plurality of texture maps, carrying out projection on the plurality of first 3D models to obtain a plurality of 2D projection images, and constructing an action generation model according to the sequence images of the second image set, the 2D projection images and the texture maps.

In some embodiments, the motion generation model comprises a backbone network model, a first branch network model and a second branch network model, wherein the first branch network model is used for fusing alignment features of the sequence images of the second image set and the sequence images of the first image set to the backbone network model, the second branch network model is used for fusing texture features corresponding to the texture map to the backbone network model, and the backbone network model is used for generating a motion image according to the 2D projection image, the alignment features and the texture features.

In some embodiments, the backbone network model comprises a generative antagonism network, and the first and second branch network models comprise Encoder-Decoder networks.

In some embodiments, the plurality of 2D projection images form an image sequence, and the processor is configured to:

The method comprises the steps of obtaining a front frame 2D projection image of a current 2D projection image in an image sequence, wherein the current 2D projection image is a 2D projection image which needs to be input to a backbone network model at the current moment, the front frame 2D projection image is a 2D projection image which is input to the backbone network model, the front frame 2D projection image and the current 2D projection image are continuous, obtaining a rendering image corresponding to the front frame 2D projection image, wherein the rendering image is an output image obtained by inputting the front frame 2D projection image to the backbone network model, and inputting the current 2D projection image, the front frame 2D projection image and the rendering image to the backbone network model to obtain an action image.

In some embodiments, the processor is configured to implement:

And calculating a flow mapping relation of pixels between the sequence images of the second image set and the sequence images of the first image set by using centroid coordinates based on the first 3D models corresponding to the second image set and the first image set, wherein the flow mapping relation is used for determining alignment characteristics of the sequence images of the second image set and the sequence images of the first image set.

In some embodiments, the processor is configured to implement:

Determining an affine transformation map corresponding to the sequence image of the second image set according to the flow mapping relation, inputting the sequence image of the second image set and the corresponding affine transformation map to the first branch network model, reconstructing the sequence image of the second image set, aligning the characteristics of the coding end of the first branch network model by utilizing the flow mapping relation to obtain alignment characteristics, and fusing the alignment characteristics to the decoding end of the main network model.

In some embodiments, the processor is further configured to implement:

Utilizing the reverse transfer gradient of the second branch network model to the texture map so as to update the texture map; and mapping the plurality of first 3D models according to the updated texture map to obtain a texture map.

In some embodiments, the backbone network model outputs a rendered image and a weighted image, the second branch network model outputs a texture image, and the action image output by the action generation model is represented as:

P_final＝P_series*M+P_texture*(1-M)

Where P _final denotes an action image, P _series denotes a rendering image, M denotes a weight image, and P _texture denotes a texture image.

In some embodiments, the processor is configured to implement:

Dividing video frames of the action video to obtain a first image set and a second image set of the target person.

In some embodiments, the processor, when implementing the obtaining the texture map of the target person, is specifically configured to implement:

The method comprises the steps of obtaining a plurality of shooting images of a target person, wherein the shooting images are images obtained by shooting the target person from different shooting angles, reconstructing a second 3D model corresponding to the target person according to the shooting images based on a multi-view 3D reconstruction algorithm, and determining a texture map of the target person according to the corresponding relation between the second 3D model and the shooting images.

In some embodiments, the first 3D model is reconstructed using a preset 3D reconstruction algorithm, the preset 3D reconstruction algorithm including at least one of an HMR algorithm, a Simplify-x algorithm, and a Total Capture algorithm.

The method comprises the steps of obtaining a plurality of images of a user, determining a texture map of the user according to the images, carrying out three-dimensional reconstruction according to the images to obtain a 3D model of the user, mapping the 3D model according to the texture map to obtain a texture map, and generating an action image of the user according to the images and the texture map.

In some embodiments, the processor, when implementing the generating the motion image of the user according to the plurality of images and the texture map, is specifically configured to implement:

And inputting the images and the texture map to a pre-trained action generating model to output the action image of the user, wherein the pre-trained action generating model can be any action generating model constructed according to the embodiment of the application.

The embodiment of the present application also provides a computer readable storage medium storing a computer program, where the computer program when executed by a processor causes the processor to implement the steps of the method for constructing an action generation model or the method for generating an action image provided in the above embodiment.

The computer readable storage medium may be an internal storage unit of the computer device according to any one of the foregoing embodiments, for example, a hard disk or a memory of the terminal device. The computer-readable storage medium may also be an external storage device of the terminal device, such as a plug-in hard disk provided on the terminal device, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of constructing an action generation model, the method comprising:

Acquiring a first image set and a second image set of a target person, the first image set and the second image set comprising sequence images of actions of the target person;

The method comprises the steps of constructing an action generating model according to a sequence image of a second image set, the 2D projection image and the texture map, wherein the action generating model comprises a main network model, a first branch network model and a second branch network model, the first branch network model is used for fusing alignment features of the sequence image of the second image set and the sequence image of the first image set in the main network model, the second branch network model is used for fusing texture features corresponding to the texture map in the main network model, and the main network model is used for generating an action image according to the 2D projection image, the alignment features and the texture features.

2. The method of claim 1, wherein the backbone network model comprises a generative antagonism network, and the first and second branch network models comprise Encoder-Decoder networks.

3. The method of claim 1, wherein the plurality of 2D projection images comprise an image sequence, the method comprising:

acquiring a front frame 2D projection image of a current 2D projection image in the image sequence, wherein the current 2D projection image is a 2D projection image which needs to be input to the backbone network model at the current moment, the front frame 2D projection image is a 2D projection image which is input to the backbone network model, and the front frame 2D projection image is continuous with the current 2D projection image;

Acquiring a rendering image corresponding to the previous frame 2D projection image, wherein the rendering image is an output image obtained by inputting the previous frame 2D projection image into the backbone network model;

and inputting the current 2D projection image, the previous frame 2D projection image and the rendering image into the backbone network model to obtain an action image.

4. The method according to claim 1, wherein the method further comprises:

performing three-dimensional reconstruction according to the sequence images of the second image set to obtain a plurality of first 3D models;

Calculating a flow mapping relation of pixels between a sequence image of the second image set and a sequence image of the first image set by using centroid coordinates based on the first 3D model corresponding to the second image set and the first image set;

Wherein the stream map is used to determine alignment features of the sequence images of the second image set and the sequence images of the first image set.

5. The method according to claim 4, characterized in that the method comprises:

Determining an affine transformation map corresponding to the sequence image of the second image set according to the stream mapping relation;

Inputting the sequence image of the second image set and the corresponding affine transformation graph into the first branch network model, and reconstructing the sequence image of the second image set;

and aligning the characteristics of the coding end of the first branch network model by utilizing the stream mapping relation to obtain alignment characteristics, and fusing the alignment characteristics to the decoding end of the main network model.

6. The method according to claim 1, wherein the method further comprises:

transferring gradients back to the texture map using the second branch network model to update the texture map, and

And mapping the plurality of first 3D models according to the updated texture map to obtain a texture map.

7. The method of claim 1, wherein the backbone network model outputs a rendered image and a weighted image, the second branch network model outputs a texture image, and the action image output by the action generation model is represented as:

P_final＝P_series*M+P_texture*(1-M)

8. The method according to any one of claims 1-7, characterized in that the method comprises:

Acquiring an action video of a target person;

and dividing the video frames of the action video to obtain a first image set and a second image set of the target person.

9. The method of any of claims 1-7, wherein the obtaining a texture map of the target person comprises:

acquiring a plurality of shooting images of a target person, wherein the plurality of shooting images are obtained by shooting the target person from different shooting angles;

reconstructing a second 3D model corresponding to the target person according to the plurality of photographed images based on a multi-view 3D reconstruction algorithm, and

And determining a texture map of the target person according to the corresponding relation between the second 3D model and the shot image.

10. The method according to any one of claims 1-7, wherein the first 3D model is reconstructed using a preset 3D reconstruction algorithm, the preset 3D reconstruction algorithm including at least one of an HMR algorithm, a simple-x algorithm, and a Total Capture algorithm.

11. A motion image generation method, the method comprising:

Mapping the 3D model according to the texture map to obtain a texture map;

Generating an action image of the user according to the plurality of images and the texture map;

Wherein the generating the action image of the user according to the plurality of images and the texture map includes:

Inputting the images and the texture map into a pre-trained motion generation model, and outputting the motion image of the user, wherein the pre-trained motion generation model is obtained according to the method for constructing the motion generation model of any one of claims 1-10.

12. A computer device, the computer device comprising:

a memory and a processor;

the memory is connected with the processor and used for storing programs;

the processor is configured to implement the steps of the method for constructing an action generating model according to any one of claims 1 to 10, or the steps of the method for generating an action image according to claim 11, by running a program stored in the memory.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the steps of the method of constructing an action generation model according to any one of claims 1 to 10, or to implement the steps of the method of generating an action image according to claim 11.