CN111476868B

CN111476868B - Animation generation model training and animation generation method and device based on deep learning

Info

Publication number: CN111476868B
Application number: CN202010264566.0A
Authority: CN
Inventors: 屈桢深; 于淼; 李清华
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2020-04-07
Filing date: 2020-04-07
Publication date: 2023-06-23
Anticipated expiration: 2040-04-07
Also published as: CN111476868A

Abstract

The invention provides a training of an animation generation model based on deep learning, an animation generation method and a device, relating to the field of computer vision, comprising the following steps: acquiring a training set sequence, wherein the training set sequence comprises a plurality of key frame images; inputting the initial key frame into an animation generation model, and determining a predicted subsequent key frame; determining a value of a loss function according to the predicted subsequent key frame and the actual subsequent key frame; and adjusting parameters of the animation generation model according to the value of the loss function until convergence conditions are met, and completing training of the animation generation model. The invention takes the input as the initial key frame, the initial key frame as the input is given by an animator, other predicted subsequent key frames are supplemented by the trained model, and a series of animations are effectively generated through the model, thereby not only ensuring that the character model designed by the animator is not changed, but also avoiding a great deal of repeated work of the animator.

Description

Animation generation model training, animation generation method and device based on deep learning

技术领域technical field

本发明涉及计算机视觉领域，具体而言，涉及一种基于深度学习的动画生成模型训练、动画生成方法及装置。The present invention relates to the field of computer vision, in particular to a deep learning-based animation generation model training, animation generation method and device.

背景技术Background technique

大部分的动画都依赖于计算机关键帧技术。计算机关键帧技术是目前在计算机动画界用来完成繁复的中间帧插补工作的主流方法，其基本思想是通过插值的方法插补两个关键帧之间的中间帧。在此之后，基于此思想提出的线性插值、样条插值、双插值、运动变形和运动映射等技术，在一定程度上满足了人们对动画生成的需求。Most animations rely on computer keyframing techniques. Computer key frame technology is currently the mainstream method used to complete complicated interpolation of intermediate frames in the field of computer animation. Its basic idea is to interpolate the intermediate frame between two key frames through interpolation. After that, technologies such as linear interpolation, spline interpolation, double interpolation, motion deformation, and motion mapping based on this idea have met people's needs for animation generation to a certain extent.

迄今为止在计算机动画生成领域使用的大部分方法仍然是基于关键帧的方法，但是无论是何种关键帧的获取方法，依旧有人工工作的存在。而随着近年来深度学习技术的兴起，如何使用深度学习技术让计算机自动生成动画也逐渐成为了计算机动画领域中一个新的研究热点。Most of the methods used so far in the field of computer animation generation are still based on keyframes, but no matter what kind of keyframe acquisition method, there is still manual work. With the rise of deep learning technology in recent years, how to use deep learning technology to allow computers to automatically generate animation has gradually become a new research hotspot in the field of computer animation.

目前主要在深度学习技术在计算机动画领域应用有利用特定物体的特征点结合深度学习生成关键帧的方法，但实际操作中有关特征点的提取较为困难，由于特征点提取的不准确，导致生成结果并不理想。深度学习技术在计算机动画领域应用还有提前对物体进行建模进行参数化生成的方法，对特定对象建模，将对象参数化后，再将参数随时间改变以达到生成动画的目的，但是由于像素风格人物具有多样性，且对人物的模型参数化的结果大部分对行走动画生成无益(如高矮、胖瘦等)，平白的增加了大量的工作量又与像素风格人物动画实际的应用方法不匹配。At present, the main application of deep learning technology in the field of computer animation is the method of using the feature points of specific objects combined with deep learning to generate key frames. However, it is difficult to extract the relevant feature points in actual operation. Due to the inaccurate feature point extraction, the generated results Not ideal. In the application of deep learning technology in the field of computer animation, there is also a method of modeling objects in advance for parameterized generation, modeling specific objects, parameterizing the objects, and then changing the parameters over time to achieve the purpose of generating animations, but due to Pixel-style characters are diverse, and most of the results of character model parameterization are not beneficial to the generation of walking animations (such as tall and short, fat and thin, etc.), which simply increases a lot of workload and is different from the actual application method of pixel-style character animation Mismatch.

发明内容Contents of the invention

本发明旨在至少在一定程度上解决相关技术中的技术问题，为达上述目的，本发明第一方面的实施例提供了一种基于深度学习的动画生成模型训练方法，其包括：The present invention aims to solve the technical problems in related technologies at least to a certain extent. To achieve the above-mentioned purpose, the embodiment of the first aspect of the present invention provides a deep learning-based animation generation model training method, which includes:

获取训练集序列，所述训练集序列包括多个关键帧序列，每个所述关键帧序列包括人物不同的行走姿态的多个关键帧图像，多个所述关键帧图像包括初始关键帧和对应的实际后序关键帧，所述初始关键帧为所述人物在初始状态下的行走姿态，所述实际后续关键帧为所述人物在初始状态之后的行走姿态；Obtain a training set sequence, the training set sequence includes a plurality of key frame sequences, each of the key frame sequences includes a plurality of key frame images of different walking postures of the characters, and the plurality of key frame images include initial key frames and corresponding The actual subsequent key frame, the initial key frame is the walking posture of the character in the initial state, and the actual subsequent key frame is the walking posture of the character after the initial state;

将所述初始关键帧输入所述动画生成模型，确定预测后序关键帧；Inputting the initial key frame into the animation generation model to determine the predicted subsequent key frame;

根据所述预测后序关键帧和所述实际后序关键帧确定损失函数的值；determining a value of a loss function according to the predicted subsequent key frame and the actual subsequent key frame;

根据所述损失函数的值调整所述动画生成模型的参数直至满足收敛条件，完成对所述动画生成模型的训练。Adjusting the parameters of the animation generation model according to the value of the loss function until a convergence condition is met, and completing the training of the animation generation model.

由此，本发明通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。Therefore, the present invention uses the input as the initial key frame, the initial key frame as the input is given by the animator, and the trained model is used to complete other predicted subsequent key frames, and a series of animations are effectively generated through the model, which ensures This ensures that the character modeling designed by the animator will not be changed, and avoids a lot of repetitive work for the animator.

进一步地，所述行走姿态包括人物的行走方向和行走动作，所述行走方向包括前方、后方、左方和右方，所述行走动作包括左腿迈出动作、右腿迈出动作和静止。Further, the walking posture includes the walking direction and walking action of the character, the walking direction includes front, rear, left and right, and the walking action includes left leg stepping action, right leg stepping action and stillness.

由此，通过图像翻译的方法，基于完整的动作素材，对像素风格人物面对不同的行走方向，做出不同的行走姿态的训练，有效将像素人物面向一个方向静止的关键帧翻译成面向另一个方向的关键帧，保证预测后序关键帧的准确性。Therefore, through the method of image translation, based on the complete action material, the pixel-style characters face different walking directions and make different walking posture training, effectively translating the key frames of the pixel characters facing one direction and stilling into another direction. Keyframes in one direction ensure the accuracy of predicting subsequent keyframes.

进一步地，所述动画生成模型包括第一级联网络和第二级联网络，所述预测后序关键帧包括第一级联后序关键帧和第二级联后序关键帧；所述将所述初始关键帧输入所述动画生成模型，确定预测后序关键帧包括：Further, the animation generation model includes a first cascade network and a second cascade network, and the predicted subsequent key frames include a first cascaded subsequent key frame and a second cascaded subsequent key frame; The initial key frame is input into the animation generation model, and determining the predicted subsequent key frame includes:

将所述初始关键帧输入至所述第一级联网络，确定所述第一级联后序关键帧；inputting the initial key frame into the first cascade network, and determining the subsequent key frame of the first cascade;

将所述第一级联后序关键帧输入至所述第二级联网络，确定所述第二级联后序关键帧。Inputting the first concatenated subsequent key frame to the second concatenated network to determine the second concatenated subsequent key frame.

由此，本发明通过设计级联网络，利用第一级联网络、第二级联网络的分工合作，能够有效的生成完整的任务行走动作，在动画师制作像素风格人物行走动画时，高效生成像素人物行走动作的预测后序关键帧。Therefore, by designing the cascaded network, the present invention can effectively generate a complete task walking action by utilizing the division of labor between the first cascaded network and the second cascaded network. Predicted subsequent keyframes of the pixel character's walking action.

进一步地，所述将所述初始关键帧输入至所述第一级联网络包括：将所述初始关键帧镜像变换后，输入至所述第一级联网络。Further, the inputting the initial key frame to the first cascade network includes: inputting the initial key frame to the first cascade network after mirror transformation.

由此，本发明通过镜像变换，有效避免人物局部的不对称造成的网络训练困难。Therefore, the present invention effectively avoids network training difficulties caused by local asymmetry of characters through mirror transformation.

进一步地，所述行走姿态包括人物的行走方向和行走动作，所述第一级联网络包括第一生成子网络和第二生成子网络,所述第一级联后序关键帧包括第一生成关键帧和第二生成关键帧；所述将所述初始关键帧输入至所述第一级联网络，确定所述第一级联后序关键帧包括；Further, the walking posture includes the walking direction and walking action of the character, the first cascade network includes a first generation subnetwork and a second generation subnetwork, and the first cascade subsequent key frame includes a first generation subnetwork A key frame and a second key frame are generated; the initial key frame is input to the first cascaded network, and the determination of the first cascaded subsequent key frame includes;

将所述初始关键帧输入至所述第一生成子网络，确定所述第一生成关键帧，所述第一生成关键帧与所述初始关键帧的所述行走方向相同、所述行走动作不同；Inputting the initial key frame into the first generating sub-network, determining the first generating key frame, the first generating key frame has the same walking direction as the initial key frame, and the walking action is different ;

将所述初始关键帧输入至所述第二生成子网络，确定所述第二生成关键帧，所述第二生成关键帧与所述初始关键帧的所述行走方向不同、所述行走动作相同。Inputting the initial key frame into the second generating sub-network, determining the second generating key frame, the second generating key frame is different from the initial key frame in the walking direction, and the walking action is the same .

由此，在第一级联网络中，有效通过第一生成子网络将初始关键帧翻译成面向同一个方向的关键帧，有效通过第二生成子网络将初始关键帧翻译成面向不同方向的关键帧。Thus, in the first cascaded network, the initial keyframes are effectively translated into keyframes facing the same direction through the first generation subnetwork, and the initial keyframes are effectively translated into keyframes facing different directions through the second generation subnetwork. frame.

进一步地，所述行走方向包括第一方向，所述初始关键帧为第一方向初始帧，用于表示人物面对第一方向的初始状态，所述第一生成关键帧包括多个第一方向动作子帧，多个所述第一方向动作子帧的所述行走动作互不相同。Further, the walking direction includes a first direction, and the initial key frame is an initial frame of the first direction, which is used to represent the initial state of the character facing the first direction, and the first generated key frame includes a plurality of first direction In the action subframe, the walking actions of the plurality of first direction action subframes are different from each other.

由此，有效通过第一生成子网络将初始关键帧翻译成面向第一方向采取不同动作的关键帧。Thus, the initial key frame is effectively translated into key frames for taking different actions facing the first direction through the first generation sub-network.

进一步地，所述第二生成子网络包括生成器、第一判别器，其中：Further, the second generation sub-network includes a generator and a first discriminator, wherein:

所述生成器包括多个卷积层，所述生成器用于输出生成预测图；The generator includes a plurality of convolutional layers, and the generator is used to output and generate a prediction map;

所述第一判别器包括多个卷积层和全连接层，所述第一判别器用于比较所述生成预测图、所述初始关键帧和所述初始关键帧在所述第二生成子网络中对应的所述实际后序关键帧，所述第一判别器的输出为比较后的判别结果。The first discriminator includes a plurality of convolutional layers and fully connected layers, and the first discriminator is used to compare the generated prediction map, the initial key frame and the initial key frame in the second generation subnetwork The actual subsequent key frame corresponding to , the output of the first discriminator is the compared discriminant result.

由此，通过生成器有效输出生成预测图，通过判别器有效判断生成预测图是否准确。Thus, the generator effectively outputs the generated prediction map, and the discriminator effectively judges whether the generated prediction map is accurate.

进一步地，所述生成器的损失函数采用L1损失函数，所述第一判别器采用交叉熵损失函数。Further, the loss function of the generator adopts an L1 loss function, and the first discriminator adopts a cross-entropy loss function.

由此，根据生成器和判别器，有针对性地选择损失函数，保证网络训练的高效性和准确性。Therefore, according to the generator and the discriminator, the loss function is selected in a targeted manner to ensure the efficiency and accuracy of network training.

进一步地，所述第二级联后序关键帧为第三生成关键帧；所述将所述第一级联后序关键帧输入至所述第二级联网络，确定所述第二级联后序关键帧包括：Further, the second cascaded subsequent key frame is a third generated key frame; the input of the first cascaded subsequent key frame to the second cascaded network determines the second cascaded Subsequent keyframes include:

将所述第二生成关键帧输入至所述第二级联网络，确定所述第三生成关键帧，所述第三生成关键帧与所述第二生成关键帧的所述行走方向相同、所述行走动作不同。Inputting the second generated key frame into the second cascaded network to determine the third generated key frame, the third generated key frame is the same as the walking direction of the second generated key frame, and the The walking action is different.

由此，通过第二级联网络，有效地将第二生成关键帧翻译成面对同一方向不同动作的关键帧。Thus, through the second cascaded network, the second generated keyframes are effectively translated into keyframes facing different actions in the same direction.

进一步地，所述行走方向包括第二方向、第三方向和第四方向，所述第二生成关键帧包括第二方向初始帧、第三方向初始帧和第四方向初始帧，所述第二级联网络包括第三生成子网络、第四生成子网络和第五生成子网络，所述将所述第二生成关键帧输入至所述第二级联网络，确定所述第三生成关键帧包括：Further, the walking direction includes a second direction, a third direction, and a fourth direction, and the second generated key frame includes an initial frame in the second direction, an initial frame in the third direction, and an initial frame in the fourth direction, and the second The cascade network includes a third generation subnetwork, a fourth generation subnetwork and a fifth generation subnetwork, the second generation key frame is input to the second cascade network, and the third generation key frame is determined include:

将所述第二方向初始帧输入至所述第三生成子网络，确定多个第二方向动作子帧，所述第二方向动作子帧与所述第二方向初始帧的所述行走方向相同、所述行走动作不同，多个所述第二方向动作子帧的所述行走动作互不相同；inputting the initial frame in the second direction to the third generating subnetwork, and determining a plurality of action subframes in the second direction, where the action subframe in the second direction is the same as the walking direction of the initial frame in the second direction . The walking motions are different, and the walking motions of the multiple second-direction motion sub-frames are different from each other;

将所述第三方向初始帧输入至所述第四生成子网络，确定多个第三方向动作子帧，所述第三方向动作子帧与所述第三方向初始帧的所述行走方向相同、所述行走动作不同，多个所述第三方向动作子帧的所述行走动作互不相同；inputting the initial frame of the third direction into the fourth generation subnetwork, and determining a plurality of action subframes of the third direction, the action subframe of the third direction is the same as the walking direction of the initial frame of the third direction . The walking motions are different, and the walking motions of the multiple third-direction motion subframes are different from each other;

将所述第四方向初始帧输入至所述第五生成子网络，确定多个第四方向动作子帧，所述第四方向动作子帧与所述第四方向初始帧的所述行走方向相同、所述行走动作不同，多个所述第四方向动作子帧的所述行走动作互不相同。inputting the initial frame of the fourth direction into the fifth generation subnetwork, and determining a plurality of action sub-frames of the fourth direction, the action sub-frame of the fourth direction is the same as the walking direction of the initial frame of the fourth direction . The walking motions are different, and the walking motions of the multiple fourth-direction motion sub-frames are different from each other.

由此，通过第三生成子网络，有效地将第二方向初始帧翻译成面向同一个方向不同动作的关键帧；通过第四生成子网络，有效地将第三方向初始帧翻译成面向同一个方向不同动作的关键帧；通过第五生成子网络，有效地将第四方向初始帧翻译成面向同一个方向不同动作的关键帧。Thus, through the third generation sub-network, the initial frames in the second direction are effectively translated into key frames oriented to different actions in the same direction; through the fourth generation sub-network, the initial frames in the third direction are effectively translated into Keyframes for different actions in different directions; through the fifth generation sub-network, the initial frame in the fourth direction is effectively translated into keyframes for different actions in the same direction.

进一步地，所述第一生成子网络、所述第三生成子网络、所述第四生成子网络和所述第五生成子网络相同，包括正向生成器、逆向生成器和第二判别器，其中：Further, the first generation subnetwork, the third generation subnetwork, the fourth generation subnetwork and the fifth generation subnetwork are the same, including a forward generator, a reverse generator and a second discriminator ,in:

所述正向生成器包括多个卷积层，所述正向生成器的输入为所属生成子网络对应的输入关键帧图像，输出为正向生成预测图；The forward generator includes a plurality of convolutional layers, the input of the forward generator is the input key frame image corresponding to the generated subnetwork, and the output is the forward generated prediction map;

所述逆向生成器包括多个卷积层，所述逆向生成器的输入为所述正向生成预测图，输出为还原生成预测图；The reverse generator includes a plurality of convolutional layers, the input of the reverse generator is the forward generation prediction map, and the output is the restoration generation prediction map;

所述第二判别器包括多个卷积层和全连接层，所述第二判别器用于比较所述正向生成预测图、所述输入关键帧图像、所述第二判别器的所述输入关键帧图像在所述所属生成子网络中对应的所述实际后序关键帧，所述第二判别器的输出为比较后的判别结果。The second discriminator includes a plurality of convolutional layers and a fully connected layer, and the second discriminator is used to compare the forward generated prediction map, the input key frame image, and the input of the second discriminator The key frame image corresponds to the actual subsequent key frame in the generating subnetwork, and the output of the second discriminator is a compared discriminant result.

由此，通过正向生成器有效输出正向生成预测图，通过逆向生成器有效生成还原生成预测图，通过判别器有效判断生成预测图是否准确。Thus, the forward generator can effectively output the forward generated prediction map, the reverse generator can effectively generate the restored predicted map, and the discriminator can effectively judge whether the generated predicted map is accurate.

进一步地，所述正向生成器的损失函数和所述逆向生成器的损失函数都采用L1损失函数，所述第二判别器的损失函数的确定过程包括：Further, the loss function of the forward generator and the loss function of the reverse generator both use the L1 loss function, and the determination process of the loss function of the second discriminator includes:

根据所述还原生成预测图和所述输入关键帧图像确定第一L1损失函数；Determining a first L1 loss function according to the restored prediction map and the input key frame image;

根据所述正向生成预测图和所述输入关键帧图像在所述所属子网络中对应的所述实际后序关键帧确定第二L1损失函数；determining a second L1 loss function according to the forward generated prediction map and the actual subsequent key frame corresponding to the input key frame image in the sub-network;

将所述第一L1损失函数和所述第二L1损失函数加权求和，确定所述第二判别器的损失函数。Weighting and summing the first L1 loss function and the second L1 loss function to determine a loss function of the second discriminator.

由此，根据正向生成器、逆向生成器和判别器，有针对性地选择损失函数，保证网络训练的高效性和准确性。Therefore, according to the forward generator, reverse generator and discriminator, the loss function is selected in a targeted manner to ensure the efficiency and accuracy of network training.

为达上述目的，本发明第二方面的实施例还提供了一种基于深度学习的动画生成模型训练装置，其包括：In order to achieve the above purpose, the embodiment of the second aspect of the present invention also provides a deep learning-based animation generation model training device, which includes:

获取单元：用于获取训练集序列，所述训练集序列包括多个关键帧序列，每个所述关键帧序列包括人物不同的行走姿态的多个关键帧图像，多个所述关键帧图像包括初始关键帧和对应的实际后序关键帧，所述初始关键帧为所述人物在初始状态下的行走姿态，所述实际后续关键帧为所述人物在初始状态之后的行走姿态；Acquisition unit: used to obtain a training set sequence, the training set sequence includes a plurality of key frame sequences, each of which includes a plurality of key frame images of different walking postures of the characters, and the plurality of key frame images include An initial key frame and a corresponding actual subsequent key frame, the initial key frame is the walking posture of the character in the initial state, and the actual subsequent key frame is the walking posture of the character after the initial state;

处理单元：用于将所述初始关键帧输入所述动画生成模型，确定预测后序关键帧；还用于根据所述预测后序关键帧和所述实际后序关键帧确定损失函数的值；A processing unit: for inputting the initial key frame into the animation generation model, and determining a predicted subsequent key frame; and for determining a value of a loss function according to the predicted subsequent key frame and the actual subsequent key frame;

训练单元：用于根据所述损失函数的值调整所述动画生成模型的参数直至满足收敛条件，完成对所述动画生成模型的训练。A training unit: used to adjust the parameters of the animation generation model according to the value of the loss function until a convergence condition is satisfied, and complete the training of the animation generation model.

所述基于深度学习的动画生成模型训练装置与基于深度学习的动画生成模型训练方法相对于现有技术所具有的有益效果相同，在此不再赘述。The beneficial effects of the deep learning-based animation generation model training device and the deep learning-based animation generation model training method are the same as those of the prior art, and will not be repeated here.

为达上述目的，本发明第三方面的实施例提供了一种基于深度学习的动画生成方法，其包括：In order to achieve the above purpose, the embodiment of the third aspect of the present invention provides a deep learning-based animation generation method, which includes:

获取初始状态关键帧；Get the initial state keyframe;

将所述初始状态关键帧输入至动画生成模型，确定预测后序关键帧，所述动画生成模型采用如上所述的基于深度学习的动画生成模型训练方法进行训练得到；The initial state key frame is input to the animation generation model, and the key frame of the predicted sequence is determined, and the animation generation model is trained by the above-mentioned animation generation model training method based on deep learning;

对所述初始状态关键帧和所述预测后序关键帧进行前景背景分割，确定分割结果图片；Carrying out foreground and background segmentation on the initial state key frame and the predicted subsequent key frame, and determining the segmentation result picture;

循环播放所述分割结果图片以生成动画。The segmented result picture is played in a loop to generate an animation.

由此，本发明通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，并对初始状态关键帧和预测后序关键帧进行前景背景分割，以此准确获得人物行走的动画。综上，本发明通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。Thus, the present invention uses the input as the initial key frame, the initial key frame as the input is given by the animator, and the trained model is used to complete other predicted subsequent key frames, and the initial state key frame and the predicted subsequent key frame The key frame is used to segment the foreground and background, so as to accurately obtain the animation of the character walking. To sum up, the present invention effectively generates a series of animations through the model, which not only ensures that the character modeling designed by the animator will not be changed, but also avoids a lot of repeated work by the animator.

进一步地，所述确定分割结果图片包括：Further, the determination of the segmentation result picture includes:

根据所述前景背景分割的结果，确定透明度通道；Determine the transparency channel according to the result of the foreground and background segmentation;

根据所述透明度通道，将所述初始状态关键帧和所述预测后序关键帧保存为分割结果图片。According to the transparency channel, the initial state key frame and the predicted subsequent key frame are saved as segmentation result pictures.

由此，通过透明通道，有效标识前景部分和背景部分，以此高效确定分割结果图片。Thus, through the transparent channel, the foreground part and the background part are effectively identified, so as to efficiently determine the segmentation result picture.

进一步地，所述根据所述前景背景分割的结果，确定透明度通道包括：Further, according to the result of the foreground and background segmentation, determining the transparency channel includes:

对于前景部分，透明通道值为1；对于背景部分，透明通道值为0。For the foreground part, the transparent channel value is 1; for the background part, the transparent channel value is 0.

由此，通过将前景部分和背景部分二值化，有效标识前景部分和背景部分，以此高效确定分割结果图片。Therefore, by binarizing the foreground part and the background part, the foreground part and the background part are effectively identified, so as to efficiently determine the segmentation result picture.

进一步地，所述根据所述透明度通道，将所述初始状态关键帧和所述预测后序关键帧保存为分割结果图片包括：Further, according to the transparency channel, saving the initial state key frame and the predicted subsequent key frame as a segmentation result picture includes:

在所述初始状态关键帧和所述预测后序关键帧中，将透明通道值为1的部分的颜色为无色，透明通道值为0的部分的颜色不变；In the initial state key frame and the predicted subsequent key frame, the color of the part whose transparent channel value is 1 is colorless, and the color of the part whose transparent channel value is 0 remains unchanged;

以png格式保存所述分割结果图片。Save the segmentation result image in png format.

由此，有效通过透明度通道，将初始状态关键帧和预测后序关键帧保存为分割结果图片。Thus, the initial state keyframe and the predicted subsequent keyframe are saved as segmentation result pictures effectively through the transparency channel.

进一步地，所述循环播放所述分割结果图片以生成动画包括循环进行单次播放，所述单次播放包括：Further, the cyclically playing the segmented result picture to generate animation includes looping for a single playback, and the single playback includes:

播放所述初始状态关键帧对应的分割结果图片；Play the segmentation result picture corresponding to the key frame of the initial state;

按所述预测后序关键帧的生成次序，依次播放所述预测后序关键帧对应的分割结果图片。According to the generation order of the predicted subsequent key frames, the segmentation result pictures corresponding to the predicted subsequent key frames are played sequentially.

由此，通过关键帧的生成次序，有效地循环播放分割结果图片，以此生成对应的动画。Thus, through the generation order of the key frames, the segmentation result picture is effectively played in a loop, so as to generate the corresponding animation.

为达上述目的，本发明第四方面的实施例提供了一种基于深度学习的动画生成装置，其包括：In order to achieve the above purpose, the embodiment of the fourth aspect of the present invention provides a deep learning-based animation generation device, which includes:

获取单元：用于获取初始状态关键帧；Acquisition unit: used to obtain the initial state keyframe;

处理单元：用于将所述初始状态关键帧输入至动画生成模型，确定预测后序关键帧，所述动画生成模型采用如上所述的基于深度学习的动画生成模型训练方法进行训练得到；用于对所述初始状态关键帧和所述预测后序关键帧进行前景背景分割，保存分割结果图片；Processing unit: for inputting the key frame of the initial state into the animation generation model, and determining the predicted subsequent key frame, and the animation generation model is obtained by training the animation generation model training method based on deep learning as described above; for Perform foreground and background segmentation on the initial state key frame and the predicted subsequent key frame, and save the segmentation result picture;

播放单元：用于循环播放所述分割结果图片以生成动画。Playing unit: used for cyclically playing the segmented result picture to generate animation.

本发明提供的基于深度学习的动画生成装置通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，并对初始状态关键帧和预测后序关键帧进行前景背景分割，以此准确获得人物行走的动画。综上，本发明通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。The animation generation device based on deep learning provided by the present invention is input as an initial key frame, and the initial key frame as input is given by an animator, and other predicted subsequent key frames are completed by the trained model, and the initial state The key frame and the predicted subsequent key frame are used for foreground and background segmentation, so as to accurately obtain the animation of the character walking. To sum up, the present invention effectively generates a series of animations through the model, which not only ensures that the character modeling designed by the animator will not be changed, but also avoids a lot of repeated work by the animator.

为达上述目的，本发明第五方面的实施例提供了一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时，实现如上所述的基于深度学习的动画生成模型训练方法，或实现如上所述的基于深度学习的动画生成方法。To achieve the above object, the embodiment of the fifth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned animation based on deep learning is realized. Generate a model training method, or implement a deep learning-based animation generation method as described above.

所述计算机可读存储介质与上述的基于深度学习的动画生成模型训练方法、基于深度学习的动画生成方法相对于现有技术所具有的有益效果相同，在此不再赘述。The computer-readable storage medium has the same beneficial effect as the above-mentioned animation generation model training method based on deep learning and the animation generation method based on deep learning relative to the prior art, and will not be repeated here.

附图说明Description of drawings

图1为本发明实施例中基于深度学习的动画生成模型训练方法的流程示意图；Fig. 1 is the schematic flow chart of the animation generation model training method based on deep learning in the embodiment of the present invention;

图2为本发明实施例中动画生成模型的具体结构示意图；Fig. 2 is the specific structural representation of animation generation model in the embodiment of the present invention;

图3为本发明实施例中第一子网络、第二子网络和第三子网络的具体网络结构示意图；FIG. 3 is a schematic diagram of specific network structures of the first subnetwork, the second subnetwork and the third subnetwork in the embodiment of the present invention;

图4为本发明实施例中第四子网络至第十一子网络的具体网络结构示意图；4 is a schematic diagram of specific network structures from the fourth subnetwork to the eleventh subnetwork in the embodiment of the present invention;

图5为本发明实施例中生成器、正向生成器和逆向生成器的结构示意图；Fig. 5 is a schematic structural diagram of a generator, a forward generator and a reverse generator in an embodiment of the present invention;

图6为本发明实施例中的判别器的结构示意图；FIG. 6 is a schematic structural diagram of a discriminator in an embodiment of the present invention;

图7为本发明实施例中基于深度学习的动画生成模型训练装置的结构示意图；7 is a schematic structural diagram of an animation generation model training device based on deep learning in an embodiment of the present invention;

图8为本发明实施例的基于深度学习的动画生成方法的流程示意图；FIG. 8 is a schematic flow diagram of a deep learning-based animation generation method according to an embodiment of the present invention;

图9为本发明实施例的基于深度学习的动画生成装置的结构示意图。FIG. 9 is a schematic structural diagram of an animation generation device based on deep learning according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图详细描述根据本发明的实施例，描述涉及附图时，除非另有表示，不同附图中的相同附图标记表示相同或相似的要素。要说明的是，以下示例性实施例中所描述的实施方式并不代表本发明的所有实施方式。它们仅是与如权利要求书中所详述的、本发明公开的一些方面相一致的装置和方法的例子，本发明的范围并不局限于此。在不矛盾的前提下，本发明各个实施例中的特征可以相互组合。Embodiments according to the present invention will be described in detail below with reference to the drawings. When the description refers to the drawings, the same reference numerals in different drawings indicate the same or similar elements unless otherwise indicated. It should be noted that the implementations described in the following exemplary embodiments do not represent all implementations of the present invention. They are merely examples of apparatus and methods consistent with certain aspects of the present disclosure as recited in the claims, and the scope of the present invention is not limited thereto. On the premise of no contradiction, the features in the various embodiments of the present invention can be combined with each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本发明的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present invention, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined.

在计算机动画生成领域使用的大部分方法仍然是基于关键帧的方法，但是无论是何种关键帧的获取方法，需要大量人工重复工作。而使用深度学习技术让计算机自动生成动画也逐渐成为了计算机动画领域中一个新的研究热点。目前深度学习技术在计算机动画领域的应用中主要有两种方法，一种是利用特定物体的特征点结合深度学习生成关键帧的方法，但实际操作中有关特征点的提取较为困难，由于特征点提取的不准确，导致生成结果并不理想；另一种则是提前对物体进行建模进行参数化生成的方法，对特定对象建模，将对象参数化后，再将参数随时间改变以达到生成动画的目的，但是由于像素风格人物具有多样性，且对人物的模型参数化的结果大部分对行走动画生成无益(如高矮、胖瘦等)，平白的增加了大量的工作量又与像素风格人物动画实际的应用方法不匹配。因而如何建立一种精准且节约人工成本的动画生成模型是亟待解决的问题。Most of the methods used in the field of computer animation generation are still based on keyframes, but no matter what kind of keyframes are obtained, a lot of manual work is required. The use of deep learning technology to automatically generate animations by computers has gradually become a new research hotspot in the field of computer animation. At present, there are mainly two methods for the application of deep learning technology in the field of computer animation. One is to use the feature points of specific objects combined with deep learning to generate key frames. However, it is difficult to extract relevant feature points in actual operation. The inaccurate extraction leads to unsatisfactory generated results; the other is to model the object in advance for parameterized generation, model a specific object, parameterize the object, and then change the parameters over time to achieve The purpose of generating animation, but due to the diversity of pixel-style characters, and the results of parameterizing the model of the characters are mostly not beneficial to the generation of walking animations (such as tall and short, fat and thin, etc.), adding a lot of workload for no reason and pixels The actual application method of style character animation does not match. Therefore, how to establish an animation generation model that is accurate and saves labor costs is an urgent problem to be solved.

本发明实施例提供了一种基于深度学习的动画生成模型训练方法。图1所示为本发明实施例的基于深度学习的动画生成模型训练方法的流程示意图，包括步骤S101至步骤S104，其中：An embodiment of the present invention provides a method for training an animation generation model based on deep learning. FIG. 1 is a schematic flow diagram of a deep learning-based animation generation model training method according to an embodiment of the present invention, including steps S101 to S104, wherein:

在步骤S101中，获取训练集序列，训练集序列包括多个关键帧序列，每个关键帧序列包括人物不同的行走姿态的多个关键帧图像，多个关键帧图像包括初始关键帧和对应的实际后序关键帧，初始关键帧为人物在初始状态下的行走姿态，实际后续关键帧为人物在初始状态之后的行走姿态。由此，有效获取训练集序列以便后续的网络训练。In step S101, the training set sequence is obtained, the training set sequence includes a plurality of key frame sequences, each key frame sequence includes a plurality of key frame images of different walking postures of the characters, and the plurality of key frame images include initial key frames and corresponding The actual subsequent key frames, the initial key frame is the walking posture of the character in the initial state, and the actual subsequent key frame is the walking posture of the character after the initial state. Thus, the training set sequence is effectively obtained for subsequent network training.

在步骤S102中，将初始关键帧输入动画生成模型，确定预测后序关键帧。由此，通过动画生成模型生成预测后序关键帧。In step S102, the initial key frame is input into the animation generation model, and the predicted subsequent key frame is determined. Thus, the predicted subsequent key frames are generated by the animation generation model.

在步骤S103中，根据预测后序关键帧和实际后序关键帧确定损失函数的值。由此，通过确定损失函数有效对网络进行训练。In step S103, the value of the loss function is determined according to the predicted subsequent key frame and the actual subsequent key frame. Thus, the network is effectively trained by determining the loss function.

在步骤S104中，根据损失函数的值调整动画生成模型的参数直至满足收敛条件，完成对动画生成模型的训练。由此，本发明通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。In step S104, the parameters of the animation generation model are adjusted according to the value of the loss function until the convergence condition is satisfied, and the training of the animation generation model is completed. Therefore, the present invention uses the input as the initial key frame, the initial key frame as the input is given by the animator, and the trained model is used to complete other predicted subsequent key frames, and a series of animations are effectively generated through the model, which ensures This ensures that the character modeling designed by the animator will not be changed, and avoids a lot of repetitive work for the animator.

可选地，行走姿态包括人物的行走方向和行走动作，行走方向包括前方、后方、左方和右方，行走动作包括左腿迈出动作、右腿迈出动作和静止。由此，通过图像翻译的方法，基于完整的动作素材，对像素风格人物面对不同的行走方向，做出不同的行走姿态的训练，有效将像素人物面向一个方向静止的关键帧翻译成面向另一个方向的关键帧，保证预测后序关键帧的准确性。Optionally, the walking posture includes the walking direction and walking action of the character, the walking direction includes front, rear, left and right, and the walking action includes left leg stepping action, right leg stepping action and stillness. Therefore, through the method of image translation, based on the complete action material, the pixel-style characters face different walking directions and make different walking posture training, effectively translating the key frames of the pixel characters facing one direction and stilling into another direction. Keyframes in one direction ensure the accuracy of predicting subsequent keyframes.

具体地，关键帧图像的具体内容如下：(1)像素风格人物面向前方、静止的关键帧；(2)像素风格人物面向前方、左腿迈出的关键帧；(3)像素风格人物面向前方、右腿迈出的关键帧；(4)像素风格人物面向后方、静止的关键帧；(5)像素风格人物面向后方、左腿迈出的关键帧；(6)像素风格人物面向后方、右腿迈出的关键帧；(7)像素风格人物面向左方、静止的关键帧；(8)像素风格人物面向左方、左腿迈出的关键帧；(9)像素风格人物面向左方、右腿迈出的关键帧；(10)像素风格人物面向右方、静止的关键帧；(11)像素风格人物面向右方、左腿迈出的关键帧；(12)像素风格人物面向右方、右腿迈出的关键帧。由此，具体内容包括像素风格人物面对不同的行走方向，做出不同的行走姿态，有利于将像素人物面向一个方向静止的关键帧翻译成面向另一个方向的关键帧，或将将像素人物面向一个方向静止的关键帧翻译成面向同一个方向行走动作的关键帧。Specifically, the specific content of the key frame image is as follows: (1) the key frame of the pixel style character facing forward and still; (2) the key frame of the pixel style character facing forward and stepping with his left leg; (3) the pixel style character facing forward , the key frame of the right leg stepping forward; (4) the key frame of the pixel style character facing the rear and standing still; (5) the key frame of the pixel style character facing the rear and stepping the left leg; (6) the pixel style character facing the rear, right The key frame of the leg stepping; (7) the key frame of the pixel style character facing the left and still; (8) the key frame of the pixel style character facing the left and the left leg stepping; (9) the pixel style character facing the left, The key frame of the right leg stepping; (10) the key frame of the pixel style character facing the right and still; (11) the key frame of the pixel style character facing the right and stepping the left leg; (12) the pixel style character facing the right , the key frame of the right leg stepping forward. Therefore, the specific content includes that the pixel-style characters face different walking directions and make different walking postures, which is conducive to translating the key frames of the pixel characters facing one direction into the key frames facing the other direction, or converting the pixel characters Keyframes for standing still facing one direction are translated into keyframes for walking motion facing the same direction.

可选地，动画生成模型包括第一级联网络和第二级联网络，预测后序关键帧包括第一级联后序关键帧和第二级联后序关键帧。步骤S102包括如下两个步骤：Optionally, the animation generation model includes a first cascade network and a second cascade network, and the predicted subsequent key frames include the first cascaded subsequent key frames and the second cascaded subsequent key frames. Step S102 includes the following two steps:

将初始关键帧输入至第一级联网络，得到第一级联后序关键帧。由此，本发明通过设计级联网络，利用第一级联网络有效生成第一级联后序关键帧；The initial key frame is input to the first cascaded network to obtain the first cascaded subsequent key frame. Therefore, the present invention utilizes the first cascaded network to effectively generate the key frames after the first cascaded sequence by designing the cascaded network;

将第一级联后序关键帧输入至第二级联网络，得到第二级联后序关键帧。由此，本发明通过设计级联网络，利用第一级联网络、第二级联网络的分工合作，能够有效的生成完整的任务行走动作，在动画师制作像素风格人物行走动画时，高效生成像素人物行走动作的预测后序关键帧。The first cascaded subsequent key frame is input to the second cascaded network to obtain the second cascaded subsequent key frame. Therefore, by designing the cascaded network, the present invention can effectively generate a complete task walking action by utilizing the division of labor between the first cascaded network and the second cascaded network. Predicted subsequent keyframes of the pixel character's walking action.

可选地，将初始关键帧输入至第一级联网络包括：将初始关键帧镜像变换后，输入至第一级联网络。由此，本发明通过镜像变换，有效避免人物局部的不对称造成的网络训练困难。Optionally, inputting the initial key frame to the first cascaded network includes: inputting the initial key frame to the first cascaded network after mirror transformation. Therefore, the present invention effectively avoids network training difficulties caused by local asymmetry of characters through mirror transformation.

在本发明实施例中，行走姿态包括人物的行走方向和行走动作，第一级联网络包括第一生成子网络和第二生成子网络。上述将初始关键帧输入至第一级联网络包括如下两个步骤：In the embodiment of the present invention, the walking posture includes the walking direction and walking action of the character, and the first cascaded network includes a first generating sub-network and a second generating sub-network. The above inputting the initial key frame to the first cascade network includes the following two steps:

将初始关键帧输入至第一生成子网络，确定第一生成关键帧，第一生成关键帧与初始关键帧的所述行走方向相同、行走动作不同。由此，在第一级联网络中，有效通过第一生成子网络将初始关键帧翻译成面向同一个方向的关键帧；The initial key frame is input to the first generation sub-network, and the first generated key frame is determined. The walking direction of the first generated key frame is the same as that of the initial key frame, and the walking action is different. Thus, in the first cascaded network, the initial key frame is effectively translated into key frames facing the same direction through the first generation sub-network;

将初始关键帧输入至第二生成子网络，确定第二生成关键帧，第二生成关键帧与初始关键帧的行走方向不同、行走动作相同。由此，在第一级联网络中，有效通过第二生成子网络将初始关键帧翻译成面向不同方向的关键帧。The initial key frame is input to the second generation sub-network to determine the second generated key frame. The second generated key frame has a different walking direction and the same walking action as the initial key frame. Thus, in the first cascaded network, the initial keyframes are effectively translated into keyframes facing different directions through the second generation subnetwork.

可选地，行走方向包括第一方向，初始关键帧为第一方向初始帧，用于表示人物面对第一方向的初始状态，第一生成关键帧包括多个第一方向动作子帧，多个第一方向动作子帧的行走动作互不相同。由此，有效通过第一生成子网络将初始关键帧翻译成面向第一方向采取不同动作的关键帧。Optionally, the walking direction includes the first direction, and the initial key frame is the initial frame of the first direction, which is used to represent the initial state of the character facing the first direction. The walking motions of the first direction motion sub-frames are different from each other. Thus, the initial key frame is effectively translated into key frames for taking different actions facing the first direction through the first generation sub-network.

可选地，第二生成子网络包括生成器、第一判别器，其中：生成器包括多个卷积层，用于输出生成预测图；第一判别器包括多个卷积层和全连接层，第一判别器用于比较生成预测图、初始关键帧、初始关键帧在第二生成子网络中对应的实际后序关键帧，第一判别器的输出为比较后的判别结果。由此，通过生成器有效输出生成预测图，通过第一判别器有效判断生成预测图是否准确。Optionally, the second generation sub-network includes a generator and a first discriminator, wherein: the generator includes multiple convolutional layers for outputting a generated prediction map; the first discriminator includes multiple convolutional layers and a fully connected layer , the first discriminator is used to compare the generated prediction map, the initial key frame, and the actual subsequent key frame corresponding to the initial key frame in the second generation subnetwork, and the output of the first discriminator is the compared discriminant result. Thus, the generator effectively outputs the generated prediction map, and the first discriminator effectively judges whether the generated prediction map is accurate.

可选地，在训练过程中，生成器的损失函数采用L1损失函数，第一判别器采用交叉熵损失函数。由此，根据生成器和判别器，有针对性地选择损失函数，保证网络训练的高效性和准确性。Optionally, during the training process, the loss function of the generator adopts the L1 loss function, and the first discriminator adopts the cross-entropy loss function. Therefore, according to the generator and the discriminator, the loss function is selected in a targeted manner to ensure the efficiency and accuracy of network training.

在本发明实施例中，第二级联后序关键帧为第三生成关键帧。上述将第一级联后序关键帧输入至第二级联网络具体包括：将第二生成关键帧输入至第二级联网络，确定第三生成关键帧，第三生成关键帧与第二生成关键帧的行走方向相同、行走动作不同。由此，通过第二级联网络，有效地将第二生成关键帧翻译成面对同一方向不同动作的关键帧。In the embodiment of the present invention, the second cascaded subsequent key frame is the third generated key frame. The above-mentioned inputting the key frame after the first cascade into the second cascade network specifically includes: inputting the second generated key frame into the second cascaded network, determining the third generated key frame, the third generated key frame and the second generated key frame The keyframes have the same walking direction and different walking actions. Thus, through the second cascaded network, the second generated keyframes are effectively translated into keyframes facing different actions in the same direction.

在本发明实施例中，行走方向包括第二方向、第三方向和第四方向，第二生成关键帧包括第二方向初始帧、第三方向初始帧和第四方向初始帧，第二级联网络包括第三生成子网络、第四生成子网络和第五生成子网络，上述将第一级联后序关键帧输入至第二级联网络具体包括如下三个步骤：In the embodiment of the present invention, the walking direction includes the second direction, the third direction and the fourth direction, the second generated key frame includes the initial frame of the second direction, the initial frame of the third direction and the initial frame of the fourth direction, and the second cascade The network includes a third generation sub-network, a fourth generation sub-network and a fifth generation sub-network, and the input of the first cascaded subsequent key frames to the second cascade network specifically includes the following three steps:

将第二方向初始帧输入至第三生成子网络，确定多个第二方向动作子帧，第二方向动作子帧与第二方向初始帧的行走方向相同、行走动作不同，多个第二方向动作子帧的行走动作互不相同。由此，通过第三生成子网络，有效地将第二方向初始帧翻译成面向同一个方向不同动作的关键帧；Input the initial frame of the second direction to the third generating subnetwork to determine a plurality of second direction action subframes, the second direction action subframe and the second direction initial frame have the same walking direction and different walking actions, and the multiple second direction The walking motions of the motion subframes are different from each other. Thus, through the third generation sub-network, the initial frame in the second direction is effectively translated into key frames facing different actions in the same direction;

将第三方向初始帧输入至第四生成子网络，确定多个第三方向动作子帧，第三方向动作子帧与第三方向初始帧的行走方向相同、行走动作不同，多个第三方向动作子帧的所述行走动作互不相同。由此，通过第四生成子网络，有效地将第三方向初始帧翻译成面向同一个方向不同动作的关键帧；Input the initial frame of the third direction to the fourth generation subnetwork, determine multiple third-direction action sub-frames, the walking direction of the third-direction action sub-frame and the third-direction initial frame are the same, the walking action is different, and the multiple third-direction The walking motions of the motion sub-frames are different from each other. Thus, through the fourth generation sub-network, the initial frame in the third direction is effectively translated into key frames facing different actions in the same direction;

将第四方向初始帧输入至第五生成子网络，确定多个第四方向动作子帧，第四方向动作子帧与第四方向初始帧的行走方向相同、行走动作不同，多个第四方向动作子帧的行走动作互不相同。由此，通过第五生成子网络，有效地将第四方向初始帧翻译成面向同一个方向不同动作的关键帧。Input the initial frame of the fourth direction to the fifth generation subnetwork, determine multiple fourth direction action subframes, the fourth direction action subframe and the fourth direction initial frame have the same walking direction and different walking actions, and multiple fourth direction action subframes The walking motions of the motion subframes are different from each other. Thus, through the fifth generation sub-network, the initial frame in the fourth direction is effectively translated into key frames facing different actions in the same direction.

可选地，第一生成子网络、第三生成子网络、第四生成子网络和第五生成子网络相同，包括正向生成器、逆向生成器和第二判别器，其中：Optionally, the first generation subnetwork, the third generation subnetwork, the fourth generation subnetwork and the fifth generation subnetwork are the same, including a forward generator, a reverse generator and a second discriminator, wherein:

正向生成器包括多个卷积层，正向生成器的输入为所属生成子网络对应的输入关键帧图像，正向生成器的输出为正向生成预测图；The forward generator includes multiple convolutional layers, the input of the forward generator is the input key frame image corresponding to the generated subnetwork, and the output of the forward generator is the forward generated prediction map;

逆向生成器包括多个卷积层，逆向生成器的输入为正向生成预测图，逆向生成器的输出为还原生成预测图；The reverse generator includes multiple convolutional layers, the input of the reverse generator is to generate a prediction map in the forward direction, and the output of the reverse generator is to restore and generate a prediction map;

第二判别器包括多个卷积层和全连接层，第二判别器用于比较正向生成预测图、输入关键帧图像、第二判别器的输入关键帧图像在所属生成子网络中对应的实际后序关键帧，输出为比较后的判别结果。由此，通过正向生成器有效输出正向生成预测图，通过逆向生成器有效生成还原生成预测图，通过第二判别器有效判断生成预测图是否准确。The second discriminator includes multiple convolutional layers and fully connected layers. The second discriminator is used to compare the forward generated prediction map, the input key frame image, and the corresponding actual value of the input key frame image of the second discriminator in the generated subnetwork. Subsequent key frames, the output is the compared discriminant result. Thus, the forward generator can effectively output the forward generated prediction map, the reverse generator can effectively generate the restored predicted map, and the second discriminator can effectively judge whether the generated predicted map is accurate.

可选地，正向生成器的损失函数和逆向生成器的损失函数都采用L1损失函数，第二判别器的损失函数采用如下步骤确定：根据还原生成预测图和输入关键帧图像确定第一L1损失函数；根据正向生成预测图和输入关键帧图像在所属子网络中对应的实际后序关键帧确定第二L1损失函数；将第一L1损失函数和第二L1损失函数加权求和，确定第二判别器的损失函数。由此，根据正向生成器、逆向生成器和判别器，有针对性地选择损失函数，保证网络训练的高效性和准确性。Optionally, the loss function of the forward generator and the loss function of the reverse generator both use the L1 loss function, and the loss function of the second discriminator is determined by the following steps: determine the first L1 according to the restored prediction map and the input key frame image Loss function; determine the second L1 loss function according to the actual subsequent key frame corresponding to the forward generated prediction map and the input key frame image in the sub-network; weight and sum the first L1 loss function and the second L1 loss function to determine Loss function for the second discriminator. Therefore, according to the forward generator, reverse generator and discriminator, the loss function is selected in a targeted manner to ensure the efficiency and accuracy of network training.

具体地，结合图2来看，图2所示为本发明实施例的基于深度学习的动画生成模型的结构示意图。可选地，预测后序关键帧依次包括第一预测后序关键帧至第十一预测后序关键帧，在图2中，关键帧(2)为第一预测后序关键帧，关键帧(3)为第二预测后序关键帧，关键帧(4)为第三预测后序关键帧，关键帧(5)为第四预测后序关键帧，关键帧(6)为第五预测后序关键帧，关键帧(7)为第六预测后序关键帧，关键帧(8)为第七预测后序关键帧，关键帧(9)为第八预测后序关键帧，关键帧(10)为第九预测后序关键帧，关键帧(11)为第十预测后序关键帧，关键帧(12)为第十一预测后序关键帧。Specifically, referring to FIG. 2 , FIG. 2 is a schematic structural diagram of an animation generation model based on deep learning according to an embodiment of the present invention. Optionally, the predicted subsequent key frames include the first predicted subsequent key frame to the eleventh predicted subsequent key frame. In FIG. 2 , the key frame (2) is the first predicted subsequent key frame, and the key frame ( 3) is the second prediction subsequent key frame, key frame (4) is the third prediction subsequent key frame, key frame (5) is the fourth prediction subsequent key frame, and key frame (6) is the fifth prediction subsequent key frame Key frame, key frame (7) is the sixth prediction subsequent key frame, key frame (8) is the seventh prediction subsequent key frame, key frame (9) is the eighth prediction subsequent key frame, key frame (10) is the ninth predicted subsequent key frame, the key frame (11) is the tenth predicted subsequent key frame, and the key frame (12) is the eleventh predicted subsequent key frame.

结合图2来看，第一级联网络包括第一子网络、第二子网络、第三子网络、第四子网络和第五子网络，第二级联网络包括第六子网络、第七子网络、第八子网络、第九子网络、第十子网络和第十一子网络，其中，第一生成网络包括第四子网络和第五子网络，第二生成网络包括第一子网络、第二子网络、第三子网络，第三生成网络包括第六子网络、第七子网络，第四生成网络包括第八子网络、第九子网络，第五生成网络包括第十子网络和第十一子网络。In conjunction with Figure 2, the first cascaded network includes a first subnet, a second subnet, a third subnet, a fourth subnet, and a fifth subnet, and the second cascaded network includes a sixth subnet, a seventh subnet, and a fifth subnet. Sub-networks, the eighth sub-network, the ninth sub-network, the tenth sub-network and the eleventh sub-network, wherein the first generation network includes the fourth sub-network and the fifth sub-network, and the second generation network includes the first sub-network , the second subnetwork, the third subnetwork, the third generation network includes the sixth subnetwork, the seventh subnetwork, the fourth generation network includes the eighth subnetwork, the ninth subnetwork, and the fifth generation network includes the tenth subnetwork and the eleventh subnetwork.

结合图2来看，第一子网络的输出为第三预测后序关键帧；第二子网络的输出为第六预测后序关键帧；第三子网络的输出为第九预测后序关键帧；第四子网络的输出为第一级联后序关键帧；第五子网络的输出为第二级联后序关键帧。In combination with Figure 2, the output of the first subnetwork is the third predicted subsequent keyframe; the output of the second subnetwork is the sixth predicted subsequent keyframe; the output of the third subnetwork is the ninth predicted subsequent keyframe ; The output of the fourth sub-network is the first cascaded post-sequence key frame; the output of the fifth sub-network is the second cascaded post-sequence key frame.

可选地，在第一级联网络中，将初始关键帧输入至第一子网络，得到第三预测后序关键帧，将初始关键帧输入至第二子网络，得到第六预测后序关键帧，将初始关键帧输入至第三子网络，得到第九预测后序关键帧，将初始关键帧输入至第四子网络，得到第一预测后序关键帧，将初始关键帧输入至第五子网络，得到第二预测后序关键帧；Optionally, in the first cascaded network, the initial key frame is input to the first sub-network to obtain the third predicted subsequent key frame, and the initial key frame is input to the second sub-network to obtain the sixth predicted subsequent key frame, input the initial key frame to the third sub-network to obtain the ninth predicted subsequent key frame, input the initial key frame to the fourth sub-network to obtain the first predicted subsequent key frame, and input the initial key frame to the fifth A sub-network is used to obtain the second predicted subsequent key frame;

可选地，在第二级联网络中，将第三预测后序关键帧输入至第六子网络，得到第四预测后序关键帧作为输出；将第三预测后序关键帧输入至第七子网络，得到第五预测后序关键帧作为输出；将第六预测后序关键帧输入至第八子网络，得到第七预测后序关键帧作为输出；将第六预测后序关键帧输入至第九子网络，得到第八预测后序关键帧作为输出；将第九预测后序关键帧输入至第十子网络，得到第十预测后序关键帧作为输出；将第九预测后序关键帧输入至第十一子网络，得到第十一预测后序关键帧作为输出。Optionally, in the second cascaded network, the third predicted subsequent key frame is input to the sixth sub-network, and the fourth predicted subsequent key frame is obtained as an output; the third predicted subsequent key frame is input to the seventh The sub-network obtains the fifth predicted subsequent key frame as an output; the sixth predicted subsequent key frame is input to the eighth sub-network, and the seventh predicted subsequent key frame is obtained as an output; the sixth predicted subsequent key frame is input to The ninth sub-network obtains the eighth predicted post-sequence key frame as an output; the ninth predicted post-sequence key frame is input to the tenth sub-network, and the tenth predicted post-sequence key frame is obtained as an output; the ninth predicted post-sequence key frame is input to the eleventh sub-network, and the eleventh predicted subsequent key frame is obtained as an output.

具体地，结合图2来看，在十一个子网络中：Specifically, in conjunction with Figure 2, in the eleven sub-networks:

第一子网络：输入为像素风格人物面向前方、静止的关键帧，输出为像素风格人物面向后方、静止的关键帧；The first sub-network: the input is the key frame of the pixel-style character facing forward and still, and the output is the key frame of the pixel-style character facing backward and still;

第二子网络：输入为像素风格人物面向前方、静止的关键帧，输出为像素风格人物面向左方、静止的关键帧；The second sub-network: the input is the key frame of the pixel-style character facing forward and still, and the output is the key frame of the pixel-style character facing left and still;

第三子网络：输入为像素风格人物面向前方、静止的关键帧，输出为像素风格人物面向右方、静止的关键帧；The third sub-network: the input is the key frame of the pixel-style character facing forward and still, and the output is the key frame of the pixel-style character facing the right and still;

第四子网络：输入为像素风格人物面向前方、静止的关键帧，输出为像素风格人物面向前方、左腿迈出的关键帧；The fourth sub-network: the input is the key frame of the pixel-style character facing forward and still, and the output is the key frame of the pixel-style character facing forward and stepping with the left leg;

第五子网络：输入为像素风格人物面向前方、静止的关键帧，输出为像素风格人物面向前方、右腿迈出的关键帧；The fifth sub-network: the input is the key frame of the pixel-style character facing forward and still, and the output is the key frame of the pixel-style character facing forward and stepping with the right leg;

第六子网络：输入为像素风格人物面向后方、静止的关键帧，输出为像素风格人物面向后方、左腿迈出的关键帧；The sixth sub-network: the input is the key frame of the pixel-style character facing the rear and still, and the output is the key frame of the pixel-style character facing the rear and stepping with the left leg;

第七子网络：输入为像素风格人物面向后方、静止的关键帧，输出为像素风格人物面向后方、右腿迈出的关键帧；The seventh sub-network: the input is the key frame of the pixel-style character facing the rear and still, and the output is the key frame of the pixel-style character facing the rear and stepping with the right leg;

第八子网络：输入为像素风格人物面向左方、静止的关键帧，输出为像素风格人物面向左方、左腿迈出的关键帧；The eighth sub-network: the input is the key frame of the pixel-style character facing the left and still, and the output is the key frame of the pixel-style character facing the left and stepping with the left leg;

第九子网络：输入为像素风格人物面向左方、静止的关键帧，输出为像素风格人物面向左方、右腿迈出的关键帧；The ninth sub-network: the input is the key frame of the pixel-style character facing left and still, and the output is the key frame of the pixel-style character facing the left and stepping on the right leg;

第十子网络：输入为像素风格人物面向右方、静止的关键帧，输出为像素风格人物面向右方、左腿迈出的关键帧；The tenth sub-network: the input is the key frame of the pixel-style character facing the right and still, and the output is the key frame of the pixel-style character facing the right and stepping with the left leg;

第十一子网络：输入为像素风格人物面向右方、静止的关键帧，输出为像素风格人物面向右方、右腿迈出的关键帧。The eleventh sub-network: the input is the key frame of the pixel-style character facing the right and standing still, and the output is the key frame of the pixel-style character facing the right and stepping with the right leg.

具体地，结合图3来看，图3所示为第一子网络、第二子网络和第三子网络的具体网络结构示意图，包括生成器和判别器。可以理解的是，虚线框内的网络结构的作用仅在训练中，不参与实际应用时的动画生成过程。Specifically, referring to FIG. 3 , FIG. 3 shows a schematic diagram of a specific network structure of the first subnetwork, the second subnetwork and the third subnetwork, including a generator and a discriminator. It can be understood that the role of the network structure in the dotted box is only in training, and does not participate in the animation generation process in actual application.

具体地，结合图4来看，图4所示为第四子网络至第十一子网络的具体网络结构示意图，包括正向生成器、逆向生成器和判别器。可以理解的是，虚线框内的网络结构的作用仅在训练中，不参与实际应用时的动画生成过程。Specifically, referring to FIG. 4 , FIG. 4 is a schematic diagram of specific network structures from the fourth subnetwork to the eleventh subnetwork, including a forward generator, a reverse generator, and a discriminator. It can be understood that the role of the network structure in the dotted box is only in training, and does not participate in the animation generation process in actual application.

具体地，结合图5来看，图5所示为本发明实施例的生成器、正向生成器和逆向生成器的结构示意图。具体包括3个降采样卷积层和3个升采样卷积层和1个卷积输出层，除卷积输出层外卷积核大小均为3*3，卷积输出层卷积核大小为1*1；其中每个降采样卷积层均采用2*2的最大池化，每个升采样的卷积层为2*2的步长卷积。每个子网络输入的图片大小为32*32像素；网络具体网络结构如下：Specifically, referring to FIG. 5 , FIG. 5 is a schematic structural diagram of a generator, a forward generator and a reverse generator according to an embodiment of the present invention. Specifically, it includes 3 downsampling convolution layers, 3 upsampling convolution layers and 1 convolution output layer. The size of the convolution kernel is 3*3 except for the convolution output layer, and the convolution kernel size of the convolution output layer is 1*1; where each downsampling convolutional layer uses 2*2 maximum pooling, and each upsampling convolutional layer is a 2*2 step size convolution. The size of the picture input by each sub-network is 32*32 pixels; the specific network structure of the network is as follows:

降采样卷积层：第一降采样卷积层：32个卷积核，输出16*16*32的特征图；第二降采样卷积层：64个卷积核，输出8*8*64的特征图；第三降采样卷积层：128个卷积核，输出4*4*128的特征图。Downsampling convolutional layer: first downsampling convolutional layer: 32 convolution kernels, output 16*16*32 feature map; second downsampling convolutional layer: 64 convolution kernels, output 8*8*64 The feature map; the third downsampling convolutional layer: 128 convolution kernels, outputting a 4*4*128 feature map.

升采样卷积层：第一升采样卷积层：64个卷积核，输出8*8*64的特征图；第二升采样卷积层：32个卷积核，输出16*16*32的特征图；第三升采样卷积层：16个卷积核，输出32*32*16的特征图。Upsampling convolutional layer: first upsampling convolutional layer: 64 convolution kernels, output feature map of 8*8*64; second upsampling convolutional layer: 32 convolution kernels, output 16*16*32 The feature map; the third upsampling convolutional layer: 16 convolution kernels, outputting a feature map of 32*32*16.

卷积输出层：3个卷积核，输出32*32*3的彩色输出图。Convolution output layer: 3 convolution kernels, outputting a 32*32*3 color output image.

其中降采样卷积层负责提取输入关键帧的特征，升采样卷积层负责从降采样卷积层提取的特征中生成对应的输出关键帧的特征图，最后由一个1*1的卷积层稳定输出结果。Among them, the downsampling convolutional layer is responsible for extracting the features of the input key frame, and the upsampling convolutional layer is responsible for generating the feature map of the corresponding output key frame from the features extracted by the downsampling convolutional layer, and finally a 1*1 convolutional layer Stable output results.

在上述每一层的输出时均采用Relu层激活后才作为该层的输出，Relu层为一种非线性变化，其函数表达如下所示：In the output of each of the above layers, the Relu layer is activated as the output of the layer. The Relu layer is a nonlinear change, and its function expression is as follows:

降采样过程中每一个卷积层的输入为前一个卷积层的输出，在升采样过程中，升采样卷积层1的输入为降采样卷积层3的输出；升采样卷积层1的输出和降采样卷积层2的输出合并后输入升采样卷积层2(如图3中虚线所示)，升采样卷积层2的输出和降采样卷积层1的输出合并后输入升采样卷积层3(如图3中虚线所示)，升采样卷积层3的输出进入最后的卷积输出层输出最终结果。In the downsampling process, the input of each convolutional layer is the output of the previous convolutional layer. In the upsampling process, the input of the upsampling convolutional layer 1 is the output of the downsampling convolutional layer 3; the upsampling convolutional layer 1 The output of the upsampling convolutional layer 2 and the output of the downsampling convolutional layer 2 are combined and input to the upsampling convolutional layer 2 (as shown by the dotted line in Figure 3), and the output of the upsampling convolutional layer 2 and the output of the downsampling convolutional layer 1 are combined and input The upsampling convolutional layer 3 (as shown by the dotted line in FIG. 3 ), the output of the upsampling convolutional layer 3 enters the final convolutional output layer to output the final result.

可选地，在生成器、正向生成器和逆向生成器的网络结构中，每一层的输出时均采用Relu层激活后才作为该层的输出。Relu层作用以及函数同上。前四层的输入均为前一级输出(第一层的为网络输入)，第五层的输入为第四层的输出中，每一个像素点作为第五层中的一个维度，经第五层全连接层输出判断结果。Optionally, in the network structure of the generator, the forward generator and the reverse generator, the output of each layer is activated by the Relu layer before being used as the output of the layer. The function and function of the Relu layer are the same as above. The input of the first four layers is the output of the previous level (the first layer is the network input), the input of the fifth layer is the output of the fourth layer, and each pixel is used as a dimension in the fifth layer, which is passed through the fifth layer The fully connected layer outputs the judgment result.

具体地，结合图6来看，图6所示为本发明实施例的判别器的结构示意图。所述判别器包括上述的第一判别器和第二判别器。其中，判别器中的卷积核大小均为2*2，每次卷积时在特征图周围填补0以保证输出的特征图大小一致。判别器均采用如下结构：第一卷积层：64个卷积核，输出32*32*64的特征图；第二卷积层：128个卷积核，输出32*32*128的特征图；第三卷积层：256个卷积核，输出32*32*256的特征图；第四卷积层：512个卷积核，输出32*32*512的特征图；第五全连接层：输入为524*288维，输出为1维的全连接层，输出判断结果。Specifically, referring to FIG. 6 , FIG. 6 is a schematic structural diagram of a discriminator according to an embodiment of the present invention. The discriminator includes the first discriminator and the second discriminator mentioned above. Among them, the size of the convolution kernel in the discriminator is 2*2, and 0 is filled around the feature map for each convolution to ensure that the size of the output feature map is consistent. The discriminator adopts the following structure: the first convolutional layer: 64 convolutional kernels, outputting a feature map of 32*32*64; the second convolutional layer: 128 convolutional kernels, outputting a feature map of 32*32*128 ;Third convolutional layer: 256 convolution kernels, output feature map of 32*32*256; fourth convolution layer: 512 convolution kernels, output feature map of 32*32*512; fifth fully connected layer : The input is 524*288 dimensions, the output is a fully connected layer of 1 dimension, and the judgment result is output.

在本发明实施例中，在训练的初始采用均值为0，方差为0.1的高斯分布对图3中的网络参数进行初始化。优化器采用Adam优化器、学习率为0.0005、标签为训练集中对应的真实关键帧；训练采取分别对子网络进行端到端训练的方法进行。In the embodiment of the present invention, at the beginning of training, a Gaussian distribution with a mean value of 0 and a variance of 0.1 is used to initialize the network parameters in FIG. 3 . The optimizer uses the Adam optimizer, the learning rate is 0.0005, and the label is the corresponding real key frame in the training set; the training adopts the method of end-to-end training for the sub-networks respectively.

在本发明实施例中，对第一子网络、第二子网络和第三子网络训练时，生成器损失函数采用L1损失函数，判别器采用交叉熵损失函数。In the embodiment of the present invention, when training the first sub-network, the second sub-network and the third sub-network, the generator loss function uses the L1 loss function, and the discriminator uses the cross-entropy loss function.

其中L1损失函数具体为：The L1 loss function is specifically:

其中

代表子网络的生成结果中第i个点的像素值，y⁽ⁱ⁾代表真实标签中的第i个点的像素值，m为总像素点个数。in

Represents the pixel value of the i-th point in the generated result of the sub-network, y ⁽ⁱ⁾ represents the pixel value of the i-th point in the real label, and m is the total number of pixel points.

其中交叉熵损失函数具体为：The cross-entropy loss function is specifically:

其中若输入判别器的图片为真实图片，则y为1；若为生成器生成的图片则y为0，

为判别器的输出。Among them, if the picture input to the discriminator is a real picture, y is 1; if it is a picture generated by the generator, y is 0,

is the output of the discriminator.

在本发明实施例中，对第四子网络至第十一子网络训练时判别器同样采用交叉熵损失，生成器损失函数采用和标签图像的L1损失函数和还原图像和输入图像的L1损失函数加权而成即损失函数为：In the embodiment of the present invention, the discriminator also uses cross-entropy loss when training the fourth to eleventh sub-networks, and the generator loss function uses the L1 loss function of the label image and the L1 loss function of the restored image and the input image The weighted loss function is:

其中

为正向生成器所生成的图片，/>

为逆向生成器所生成的图片，y_l为标签图片，y_i为输入图片。其中λ₁＝1，λ₂＝0.03，所述L1损失函数计算方法同上所述。in

The image generated by the forward generator, />

is the image generated by the reverse generator, y _l is the label image, and y _i is the input image. Where λ ₁ =1, λ ₂ =0.03, and the calculation method of the L1 loss function is the same as that described above.

在本发明实施例中，训练过程中中生成器每训练1次，判别器训练5次；训练过程共500个循环，一个循环代表数据集中所有的数据均被训练一次。In the embodiment of the present invention, during the training process, the generator is trained once, and the discriminator is trained 5 times; the training process has a total of 500 cycles, and one cycle means that all the data in the data set are trained once.

在本发明另一实施例中，针对图2中的具体网络结构，第二子网络和第三子网络的输入可变为初始关键帧和第一子网络的输出。其它步骤及参数与上述相同。这一改变是由于人物在面向前方时，其背面的信息是缺失的，比如仅仅对于人物正面的关键帧，第二子网络和第三子网络不知道人物背面的信息，这样的变动是为了将像素人物面向前方和面向后方的关键帧都输入进网络中，补全人物在背面未知的信息，防止像素人物面向左方和右方的关键帧和像素人物面向前方和后方的关键帧出现冲突。In another embodiment of the present invention, for the specific network structure in FIG. 2 , the input of the second sub-network and the third sub-network can be the initial key frame and the output of the first sub-network. Other steps and parameters are the same as above. This change is due to the lack of information on the back of the character when the character is facing forward. For example, only for the key frame of the front of the character, the second sub-network and the third sub-network do not know the information on the back of the character. The key frames of the pixel character facing the front and the rear are input into the network to complete the unknown information of the character on the back, and prevent the key frames of the pixel character facing the left and right from conflicting with the key frames of the pixel character facing the front and rear.

可选地，计算机可采用如下配置：处理器为

Core(TM)i7 6700HQ，显卡为NVIDIA GTX 1060m 6GB，内存为8GB。由此，进行有效的网络训练。Optionally, the computer can adopt the following configuration: the processor is

Core(TM) i7 6700HQ, graphics card is NVIDIA GTX 1060m 6GB, memory is 8GB. Thus, efficient network training is performed.

本发明提供的基于深度学习的动画生成模型训练方法，通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。The deep learning-based animation generation model training method provided by the present invention uses the input as the initial key frame, which is given by the animator, and the trained model completes other predicted subsequent key frames, through The model effectively generates a series of animations, which not only ensures that the character modeling designed by the animator will not be changed, but also avoids a lot of repetitive work for the animator.

本发明第二方面的实施例还提供了一种基于深度学习的动画生成模型训练装置。图7所示为根据本发明实施例的基于深度学习的动画生成模型训练装置800的结构示意图，包括获取单元801、处理单元802以及训练单元803。The embodiment of the second aspect of the present invention also provides an animation generation model training device based on deep learning. FIG. 7 is a schematic structural diagram of an animation generation model training device 800 based on deep learning according to an embodiment of the present invention, including an acquisition unit 801 , a processing unit 802 and a training unit 803 .

获取单元801：用于获取训练集序列，训练集序列包括多个关键帧序列，每个关键帧序列包括人物不同的行走姿态的多个关键帧图像，多个关键帧图像包括初始关键帧和对应的实际后序关键帧，初始关键帧为人物在初始状态下的行走姿态，实际后续关键帧为人物在初始状态之后的行走姿态；Acquisition unit 801: used to obtain a training set sequence, the training set sequence includes multiple key frame sequences, each key frame sequence includes multiple key frame images of different walking postures of the characters, and the multiple key frame images include initial key frames and corresponding The actual subsequent key frames of , the initial key frame is the walking posture of the character in the initial state, and the actual subsequent key frame is the walking posture of the character after the initial state;

处理单元802：用于将初始关键帧输入动画生成模型，确定预测后序关键帧；还用于根据预测后序关键帧和实际后序关键帧确定损失函数的值；Processing unit 802: used to input the initial key frame into the animation generation model, and determine the predicted subsequent key frame; also used to determine the value of the loss function according to the predicted subsequent key frame and the actual subsequent key frame;

训练单元803：用于根据损失函数的值调整动画生成模型的参数直至满足收敛条件，完成对动画生成模型的训练。Training unit 803: used to adjust the parameters of the animation generation model according to the value of the loss function until the convergence condition is satisfied, and complete the training of the animation generation model.

所述基于深度学习的动画生成模型训练装置800的各个单元的更具体实现方式可以参见对于本发明的基于深度学习的动画生成模型训练方法的描述，且具有与之相似的有益效果，在此不再赘述。For a more specific implementation of each unit of the deep learning-based animation generation model training device 800, please refer to the description of the deep learning-based animation generation model training method of the present invention, and have similar beneficial effects, which will not be discussed here. Let me repeat.

本发明第三方面的实施例提出了一种基于深度学习的动画生成方法。图8所示为根据本发明实施例的基于深度学习的动画生成方法的流程示意图，包括步骤S201至步骤S204。The embodiment of the third aspect of the present invention proposes a deep learning-based animation generation method. FIG. 8 is a schematic flowchart of a method for generating animation based on deep learning according to an embodiment of the present invention, including steps S201 to S204.

在步骤S201中，获取初始状态关键帧。本发明通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定。In step S201, an initial state key frame is obtained. In the present invention, the input is an initial key frame, and the initial key frame as input is given by an animator.

在步骤S202中，将初始状态关键帧输入至动画生成模型，确定预测后序关键帧，动画生成模型采用如上所述的基于深度学习的动画生成模型训练方法进行训练得到。由此，由训练好的模型补齐其他的预测后序关键帧。In step S202, the initial state key frame is input into the animation generation model, and the predicted subsequent key frame is determined, and the animation generation model is trained by the above-mentioned deep learning-based animation generation model training method. As a result, other predicted subsequent key frames are completed by the trained model.

在步骤S203中，对初始状态关键帧和预测后序关键帧进行前景背景分割，确定分割结果图片。由此，对初始状态关键帧和预测后序关键帧进行前景背景分割，以此准确获得人物行走的动画。In step S203, perform foreground and background segmentation on the initial state key frame and the predicted subsequent key frame, and determine the segmentation result picture. Therefore, the foreground and background segmentation is performed on the initial state key frame and the predicted subsequent key frame, so as to accurately obtain the animation of the character walking.

在步骤S204中，循环播放分割结果图片以生成动画。由此，本发明通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。In step S204, the segmentation result picture is played in a loop to generate an animation. Therefore, the present invention effectively generates a series of animations through the model, which not only ensures that the character modeling designed by the animator will not be changed, but also avoids a lot of repeated work by the animator.

在本发明实施例中，步骤S203包括如下具体的两个步骤：In the embodiment of the present invention, step S203 includes the following two specific steps:

根据前景背景分割的结果，确定透明度通道；Determine the transparency channel according to the result of foreground and background segmentation;

根据透明度通道，将初始状态关键帧和预测后序关键帧保存为分割结果图片。由此，通过透明通道，有效标识前景部分和背景部分，以此高效确定分割结果图片。According to the transparency channel, the initial state keyframe and the predicted subsequent keyframe are saved as the segmentation result image. Thus, through the transparent channel, the foreground part and the background part are effectively identified, so as to efficiently determine the segmentation result picture.

可选地，对于前景部分，透明通道值为1；对于背景部分，透明通道值为0。由此，通过将前景部分和背景部分二值化，有效标识前景部分和背景部分，以此高效确定分割结果图片。Optionally, for the foreground part, the transparent channel value is 1; for the background part, the transparent channel value is 0. Therefore, by binarizing the foreground part and the background part, the foreground part and the background part are effectively identified, so as to efficiently determine the segmentation result picture.

可选地，在初始状态关键帧和预测后序关键帧中，将透明通道值为1的部分的颜色为无色，透明通道值为0的部分的颜色不变；以png格式保存所述分割结果图片。由此，有效通过透明度通道，将初始状态关键帧和预测后序关键帧保存为分割结果图片。因为生成的图片为RGB格式图片，即对应一个像素点，由红绿蓝三个分量来表征该点的颜色，如果不进行分割，对于32*32的人物关键帧，人物周围的部分均具有颜色，这样在生成动画时，生成的结果也会带着周边的颜色，这部分颜色是我们不需要的，因此需要进行分割，必须清除生成人物周围不需要的颜色背景。生成的关键帧需要以RGBA格式保存，与上述的RGB格式图片相似，对应同一个像素点，由红绿蓝三个分量来表征该点的颜色，还有一个透明度通道表示像素的透明度，只要透明度通道为1，则代表该像素点代表的颜色为“无色”，透明度通道为0则表示为由红绿蓝三个分量决定的颜色。Optionally, in the initial state keyframe and the predicted subsequent keyframe, the color of the part with a transparent channel value of 1 is colorless, and the color of the part with a transparent channel value of 0 remains unchanged; save the segmentation in png format Result image. Thus, the initial state keyframe and the predicted subsequent keyframe are saved as segmentation result pictures effectively through the transparency channel. Because the generated picture is in RGB format, that is, it corresponds to a pixel point, and the color of the point is represented by three components of red, green and blue. If there is no segmentation, for the 32*32 character key frame, the parts around the character have colors , so that when the animation is generated, the generated result will also carry the surrounding color. This part of the color is not needed by us, so it needs to be divided, and the unnecessary color background around the generated character must be cleared. The generated keyframe needs to be saved in RGBA format, which is similar to the above-mentioned RGB format picture, corresponding to the same pixel, and the color of the point is represented by three components of red, green and blue, and a transparency channel indicates the transparency of the pixel, as long as the transparency If the channel is 1, it means that the color represented by the pixel is "colorless", and if the transparency channel is 0, it means the color determined by the three components of red, green and blue.

可选地，采用区域生长方法进行前景背景分割，采取每一个生成关键帧的左上角像素为背景的种子点，利用区域生长的方法，计算背景区域边界上的颜色梯度，若小于阈值则将其并入背景区域，反复执行直到区域不再增长为止。以图2中具体的动画生成模型为例，像素风格人物行走动画的初始关键帧和模型输出的其它11个预测后序关键帧的左上角的像素点作为种子点，使用区域生长的方式分割初始关键帧和模型输出的其它11个预测后序关键帧的图像前景和图像背景。Optionally, the region growing method is used for foreground and background segmentation, and the upper left corner pixel of each generated key frame is used as the seed point of the background, and the region growing method is used to calculate the color gradient on the boundary of the background region, and if it is less than the threshold, it will be Merge into the background area and execute repeatedly until the area no longer grows. Taking the specific animation generation model in Figure 2 as an example, the initial key frame of the pixel-style character walking animation and the pixel points in the upper left corner of the other 11 predicted subsequent key frames output by the model are used as seed points, and the initial key frame is divided by region growing. The image foreground and image background of the key frame and the other 11 predicted subsequent key frames output by the model.

在本发明实施例中，循环播放包括循环单次播放，单次播放的步骤包括步骤S204包括如下具体的两个步骤：In the embodiment of the present invention, loop playback includes loop single playback, and the step of single playback includes step S204 including the following two specific steps:

播放初始状态关键帧对应的分割结果图片；Play the segmentation result picture corresponding to the key frame of the initial state;

按预测后序关键帧的生成次序，依次播放预测后序关键帧对应的分割结果图片。由此，通过关键帧的生成次序，有效地循环播放分割结果图片，以此生成对应的动画。以图2中具体的动画生成模型为例，循环人物的12个关键帧的png图片，循环播放各个关键帧，即可生成像素风格人物行走动画；也可通过关键帧插值的方式对生成的关键帧进行插值生成更精细的动画。According to the generation sequence of the predicted key frames, the segmentation result pictures corresponding to the predicted key frames are played in sequence. Thus, through the generation order of the key frames, the segmentation result picture is effectively played in a loop, so as to generate the corresponding animation. Take the specific animation generation model in Figure 2 as an example, loop through the png images of 12 key frames of the character, and play each key frame in a loop to generate a pixel-style character walking animation; the generated key frame can also be interpolated by key frame interpolation. Frames are interpolated to produce finer animations.

本发明提供的基于深度学习的动画生成方法，通过输入为初始关键帧，作为输入的初始关键帧是由动画师给定，由训练好的模型补齐其他的预测后序关键帧，通过模型有效生成一系列动画，既保证了动画师设计的人物造型不会被改变，又避免了动画师大量的重复工作。The deep learning-based animation generation method provided by the present invention uses the input as the initial key frame, which is given by the animator, and the trained model is used to complete other predicted subsequent key frames, and the model is effective Generating a series of animations not only ensures that the character modeling designed by the animator will not be changed, but also avoids a lot of repetitive work for the animator.

本发明第四方面的实施例提出了一种基于深度学习的动画生成装置。图9所示为根据本发明实施例的基于深度学习的动画生成装置900的结构示意图，包括获取单元901、处理单元902以及播放单元903。The embodiment of the fourth aspect of the present invention proposes an animation generation device based on deep learning. FIG. 9 is a schematic structural diagram of an animation generation device 900 based on deep learning according to an embodiment of the present invention, including an acquisition unit 901 , a processing unit 902 and a playback unit 903 .

获取单元901，用于获取初始状态关键帧。The obtaining unit 901 is configured to obtain an initial state key frame.

处理单元902，用于将初始状态关键帧输入至动画生成模型，确定预测后序关键帧，动画生成模型采用如上所述的基于深度学习的动画生成模型训练方法进行训练得到；用于对初始状态关键帧和预测后序关键帧进行前景背景分割，保存分割结果图片。The processing unit 902 is configured to input the initial state key frame into the animation generation model, and determine the predicted subsequent key frame, and the animation generation model is obtained by training the animation generation model training method based on deep learning as described above; for the initial state Foreground and background segmentation is performed on the key frame and the predicted key frame, and the segmentation result image is saved.

播放单元903，用于循环播放分割结果图片以生成动画。The playback unit 903 is configured to play the segmented result picture in a loop to generate animation.

所述基于深度学习的动画生成装置900的各个单元的更具体实现方式可以参见对于本发明的基于深度学习的动画生成方法的描述，且具有与之相似的有益效果，在此不再赘述。For a more specific implementation of each unit of the deep learning-based animation generation device 900 , refer to the description of the deep learning-based animation generation method of the present invention, which has similar beneficial effects and will not be repeated here.

本发明第五方面的实施例提出了一种计算机可读存储介质，其上存储有计算机程序，该程序被处理器执行时，实现根据本发明第一方面所述的基于深度学习的动画生成模型训练方法，或实现根据本发明第三方面所述的基于深度学习的动画生成方法。The embodiment of the fifth aspect of the present invention proposes a computer-readable storage medium on which a computer program is stored. When the program is executed by a processor, the deep learning-based animation generation model according to the first aspect of the present invention is realized. Training method, or realize the animation generation method based on deep learning according to the third aspect of the present invention.

尽管上面已经示出和描述了本发明的实施例，应当理解的是，上述实施例是示例性的，不能解释为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it should be understood that the above embodiments are exemplary and cannot be construed as limiting the present invention, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. An animation generation model training method based on deep learning, which is characterized by comprising the following steps:

acquiring a training set sequence, wherein the training set sequence comprises a plurality of key frame sequences, each key frame sequence comprises a plurality of key frame images of different walking postures of a person, each key frame image comprises an initial key frame and a corresponding actual subsequent key frame, the initial key frame is the walking posture of the person in an initial state, and the actual subsequent key frame is the walking posture of the person after the initial state; the walking gesture comprises a walking direction and a walking action of the character;

Inputting the initial key frame into the animation generation model, and determining a predicted subsequent key frame;

the animation generation model comprises a first cascade network and a second cascade network, and the predicted subsequent key frames comprise a first cascade subsequent key frame and a second cascade subsequent key frame; inputting the initial key frame into the first cascade network, and determining the first cascade subsequent key frame; inputting the first cascade subsequent key frame into the second cascade network, and determining the second cascade subsequent key frame;

the first cascade subsequent key frame comprises a first generated key frame and a second generated key frame; the second cascade subsequent key frame is a third generated key frame;

the first generated key frame is the same as the walking direction of the initial key frame and the walking action is different; the second generated key frame is different from the initial key frame in the walking direction and the walking action is the same; the walking direction of the third generated key frame is the same as that of the second generated key frame, and the walking actions are different;

determining a value of a loss function from the predicted subsequent key frame and the actual subsequent key frame;

And adjusting parameters of the animation generation model according to the value of the loss function until convergence conditions are met, and completing training of the animation generation model.

2. The deep learning based animation generation model training method of claim 1, wherein the walking direction comprises front, rear, left and right, and the walking action comprises a left leg taking action, a right leg taking action and rest.

3. The deep learning based animation generation model training method of claim 1, wherein the inputting the initial keyframe to the first level network comprises:

and after mirror-transforming the initial key frame, inputting the initial key frame into the first cascade network.

4. The deep learning based animation generation model training method of claim 1, wherein the first level network comprises a first generation sub-network and a second generation sub-network, wherein inputting the initial keyframe to the first cascading network, and wherein determining the first cascading subsequent keyframe comprises:

inputting the initial key frame into the first generation sub-network, and determining the first generation key frame;

and inputting the initial key frame into the second generation sub-network, and determining the second generation key frame.

5. The method according to claim 4, wherein the walking direction of the initial key frame includes a first direction, the initial key frame is a first direction initial frame for representing an initial state of a person facing the first direction, the first generated key frame includes a plurality of first direction action subframes, and the walking actions of the plurality of first direction action subframes are different from each other.

6. The deep learning based animation production model training method of claim 4 wherein the second production sub-network comprises a generator, a first arbiter, wherein:

the generator comprises a plurality of convolution layers, and the generator is used for outputting and generating a prediction graph;

the first discriminator comprises a plurality of convolution layers and a full connection layer, the first discriminator is used for comparing the generated predictive graph, the initial key frame and the actual subsequent key frame corresponding to the initial key frame in the second generated sub-network, and the output of the first discriminator is a discrimination result after comparison.

7. The deep learning based animation generation model training method of claim 6 wherein the generator's loss function employs an L1 loss function and the first arbiter employs a cross entropy loss function.

8. The deep learning based animation generation model training method of claim 4, wherein the first cascading subsequent keyframes are input to the second cascading network, and wherein determining the second cascading subsequent keyframes comprises:

and inputting the second generated key frame into the second cascade network, and determining the third generated key frame.

9. The deep learning based animation generation model training method of claim 8 wherein the walking direction of the second generation key frame comprises a second direction, a third direction and a fourth direction, the second generation key frame comprises a second direction initial frame, a third direction initial frame and a fourth direction initial frame, the second cascading network comprises a third generation sub-network, a fourth generation sub-network and a fifth generation sub-network, the inputting the second generation key frame into the second cascading network, determining the third generation key frame comprises:

inputting the second direction initial frame into the third generation sub-network, and determining a plurality of second direction action sub-frames, wherein the second direction action sub-frames are identical to the walking direction of the second direction initial frame and different from each other, and the walking actions of the plurality of second direction action sub-frames are different from each other;

Inputting the third-direction initial frame into the fourth generation sub-network, and determining a plurality of third-direction action sub-frames, wherein the third-direction action sub-frames are the same as the walking direction of the third-direction initial frame and different from each other, and the walking actions of the plurality of third-direction action sub-frames are different from each other;

and inputting the fourth-direction initial frame into the fifth generation sub-network, and determining a plurality of fourth-direction action sub-frames, wherein the fourth-direction action sub-frames are the same as the walking direction of the fourth-direction initial frame and different from each other, and the walking actions of the plurality of fourth-direction action sub-frames are different from each other.

10. The deep learning based animation generation model training method of claim 9 wherein the first generation sub-network, the third generation sub-network, the fourth generation sub-network, and the fifth generation sub-network are identical, comprising a forward generator, a reverse generator, and a second arbiter, wherein:

the forward generator comprises a plurality of convolution layers, wherein the input of the forward generator is an input key frame image corresponding to the generation sub-network to which the input belongs, and the output of the forward generator is a forward generation prediction graph;

The reverse generator comprises a plurality of convolution layers, wherein the input of the reverse generator generates a prediction graph for the forward direction, and the output of the reverse generator generates the prediction graph for reduction;

the second discriminator comprises a plurality of convolution layers and a full-connection layer, the second discriminator is used for comparing the forward generation prediction graph, the input key frame image and the actual follow-up key frame corresponding to the input key frame image of the second discriminator in the generation sub-network, and the output of the second discriminator is a discrimination result after comparison.

11. The training method of the animation generation model based on the deep learning according to claim 10, wherein the loss function of the forward generator and the loss function of the reverse generator both use an L1 loss function, and the determining process of the loss function of the second discriminator comprises:

determining a first L1 loss function according to the restored prediction graph and the input key frame image;

determining a second L1 loss function according to the forward generated predictive graph and the actual subsequent key frames corresponding to the input key frame image in the sub-network to which the predictive graph belongs;

and carrying out weighted summation on the first L1 loss function and the second L1 loss function to determine the loss function of the second discriminator.

12. An animation generation model training device based on deep learning, comprising:

an acquisition unit: the method comprises the steps that a training set sequence is obtained, the training set sequence comprises a plurality of key frame sequences, each key frame sequence comprises a plurality of key frame images of different walking postures of a person, each key frame image comprises an initial key frame and a corresponding actual subsequent key frame, the initial key frame is the walking posture of the person in an initial state, and the actual subsequent key frame is the walking posture of the person after the initial state; the walking gesture comprises a walking direction and a walking action of the character;

and a processing unit: the initial key frame is input into the animation generation model, and a predicted subsequent key frame is determined; and determining a value of a loss function from the predicted subsequent key frame and the actual subsequent key frame; the animation generation model comprises a first cascade network and a second cascade network, and the predicted subsequent key frames comprise a first cascade subsequent key frame and a second cascade subsequent key frame; inputting the initial key frame into the first cascade network, and determining the first cascade subsequent key frame; inputting the first cascade subsequent key frame into the second cascade network, and determining the second cascade subsequent key frame; the first cascade subsequent key frame comprises a first generated key frame and a second generated key frame; the second cascade subsequent key frame is a third generated key frame; the first generated key frame is the same as the walking direction of the initial key frame and the walking action is different; the second generated key frame is different from the initial key frame in the walking direction and the walking action is the same; the walking direction of the third generated key frame is the same as that of the second generated key frame, and the walking actions are different;

Training unit: and the parameters of the animation generation model are adjusted according to the value of the loss function until convergence conditions are met, so that training of the animation generation model is completed.

13. An animation generation method based on deep learning, which is characterized by comprising the following steps:

acquiring an initial state key frame;

inputting the initial state key frame into an animation generation model, and determining a predicted subsequent key frame, wherein the animation generation model is obtained by training by adopting the animation generation model training method based on the deep learning according to any one of claims 1-11;

performing foreground and background segmentation on the initial state key frame and the predicted subsequent key frame, and determining a segmentation result picture;

and circularly playing the segmentation result pictures to generate animation.

14. The deep learning based animation generation method of claim 13, wherein determining a segmentation result picture comprises:

determining a transparency channel according to the result of foreground and background segmentation;

and according to the transparency channel, storing the initial state key frame and the predicted subsequent key frame as the segmentation result picture.

15. The deep learning based animation generation method of claim 14 wherein the determining a transparency channel based on the result of the foreground-background segmentation comprises:

For the foreground portion, the clear channel value is 1; for the background portion, the clear channel value is 0.

16. The deep learning based animation generation method of claim 15, wherein the saving the initial state key frame and the predicted subsequent key frame as a segmentation result picture according to the transparency channel comprises:

in the initial state key frame and the predicted subsequent key frame, the color of the part with the transparent channel value of 1 is colorless, and the color of the part with the transparent channel value of 0 is unchanged;

and saving the segmentation result picture in a png format.

17. The deep learning based animation generation method of claim 13, wherein the looping playing the segmentation result picture to generate an animation comprises looping a single play, the single play comprising:

playing the segmentation result picture corresponding to the initial state key frame;

and sequentially playing the segmentation result pictures corresponding to the predicted subsequent key frames according to the generation sequence of the predicted subsequent key frames.

18. An animation generation device based on deep learning, comprising:

an acquisition unit: the method comprises the steps of acquiring an initial state key frame;

And a processing unit: the method comprises the steps of inputting the initial state key frame into an animation generation model, and determining a predicted subsequent key frame, wherein the animation generation model is obtained by training by using the animation generation model training method based on deep learning according to any one of claims 1-11; the method is also used for carrying out foreground and background segmentation on the initial state key frame and the predicted subsequent key frame, and determining a segmentation result picture;

a playing unit: for cyclically playing the segmentation result pictures to generate animations.

19. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the deep learning based animation generation model training method according to any one of claims 1-11 or implements the deep learning based animation generation method according to any one of claims 13-17.