Disclosure of Invention
The invention aims to provide a digital human video synthesis system to solve the problems in the background art:
The motion and expression generation of the existing digital person video system still has unnatural problems, and the existing digital person has limited interaction capability with a user, especially the interaction of a touch screen, and has inaccurate and flexible response to the operation of the user.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a digital human video composition system comprising: 1. a digital human video composition system comprising:
model construction module: the method is used for carrying out digital person modeling based on the real person;
and a voice synthesis module: the system comprises a voice synthesis engine, a voice processing engine and a voice processing system, wherein the voice synthesis engine is used for converting input text content into voice audio and adjusting and optimizing synthesized voice;
expression generation module: for generating an expression of the digital person from the facial expression of the real person;
A rendering engine module: rendering the digital human video;
And the interaction control module is used for: the system is used for carrying out data interaction with a user and a third party;
the model construction module comprises:
character modeling unit: the method comprises the steps of creating an appearance model based on a 3D modeling technology, scanning a real person based on a scanning technology, and constructing a digital person model;
Bone binding and animation unit: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints, and performing action recognition and animation production through skeleton binding;
the expression generating module comprises:
Expression capturing and mapping unit: the facial expression driving method comprises the steps of capturing facial expression data of a real person through a facial capture device, and mapping the facial expression data to a digital human model to realize real-time driving of the expression; the deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;
Expression library and animation curve unit: the method is used for establishing an expression library and controlling the change curve, amplitude and transition effect of the expression by adjusting the animation curve;
The rendering engine module includes:
A real-time rendering unit: the real-time rendering engine is used for rendering the digital human model;
Rendering output and optimization unit: the method comprises the steps of outputting the rendered digital human video into video files with different formats and resolutions, and compressing and encoding the video output by rendering;
The interaction control module comprises:
user interface and operation control unit: the system is used for providing a user interface for a user through the touch control screen and performing operation and parameter setting on the system;
Data management and communication interface unit: for managing and storing system data and providing a communication interface with other systems or applications.
Preferably, the character modeling unit creates different part appearance models of the digital person based on the 3D modeling technique and stores the model in the database.
Preferably, the skeleton binding and animation unit detects the 3D human body posture, compares the detected 3D human body posture with the real human body joints, and performs analysis and calculation on the v joint point coordinate displacement of 33 joints of the human body based on the video frame, specifically as follows:
Based on the calculated 3D coordinate offset D j-1,j for the j-1 frame data and the j frame data:
Wherein, (x j-1,yj-1,zj-1) is the joint position coordinates of the j-1 th frame; (x j,yj,zj) is the joint position coordinates of the j-th frame;
Calculating the ratio of the 3D coordinate offset D j-1,j to the height of the current real person, and updating the 3D coordinate offset:
Wherein D' j-1,j is the updated 3D coordinate offset;
D' j-1,j and the time difference delta t of two adjacent frames are taken as a quotient, and each node V generates a corresponding rate change value V ujv at the j-th frame moment of the u-action video;
in a P-frame video, the average velocity value of each node v The method comprises the following steps:
Generating a corresponding action set according to different action amplitudes by different actions, wherein each joint point corresponds to a different speed average value, and generating a speed threshold V v considering other actions in the set according to the speed average value:
wherein U is the action quantity in the action set;
The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value.
Preferably, the expression capturing and mapping unit captures and identifies a label based on a facial expression identification network, wherein the facial expression identification network comprises an input module, a transducer module fused with a self-attention mechanism, a multi-scale feature extraction module and a fusion identification module;
The input module acquires face image data X 0∈RH×W×C of a real person, wherein H and W are the height and width of the image data, and C is the number of image channels; equidistant segmentation and serialization processing are carried out on the collected facial expression image data based on the numerical common divisor lambda of the height and width of the image data to obtain X', position coding is carried out on each image block, linear projection is carried out on each image block, and position coding vectors corresponding to each image are added to obtain real input X;
The transducer module for fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual error connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer;
The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta and is integrated into a self-attention mechanism, and the output of the value vector is as follows:
after the node e m in the value vector matrix is integrated into the node number and the self-attention mechanism, extracting the characteristics to obtain a new value vector e' n:
wherein f q is the dimension of the matching vector q; m and n are respectively the total lateral amount of nodes and the total longitudinal amount of nodes in the image block;
The output Q of the multi-scale feature extraction module is specifically as follows:
A, B, E are a query vector matrix, a matching vector matrix and a value vector matrix which are respectively composed of a, b and e; g α and g β are calculation matrices of corresponding position parameter features alpha and beta; t is the transpose of the vector; g l corresponds to different areas of the face of the real person;
the fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module of the fusion self-attention mechanism to obtain a feature X R,Q, and sequentially inputs the feature X R,Q into a spatial attention module and a channel attention module, wherein in the spatial attention module,
Processing the input features along the channel dimension by pooling operations, respectively:
wherein c is the number of channels; m avg(XR,Q)、Mmax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
M avg(XR,Q) and M max(XR,Q) are spliced, and then a convolution layer is used for convolution operation, so that output spatial attention characteristic Y 1 is obtained:
Y1=σ(fconv,e(Mavg(XR,Q);Mmax(XR,Q)))
The channel attention mechanism is specifically as follows:
processing the input features along the spatial dimension by a pooling operation:
Wherein, The image block height value and the image block width value; o avg(XR,Q)、Omax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
Splice O avg(XR,Q) and O max(XR,Q), and then convolve with a convolution layer to obtain the output channel attention Y 1:
Y2=σ(MLP(Oavg(XR,Q);Omax(XR,Q)))
fusing the output of the spatial attention module and the output of the channel attention module to obtain a final fused output Z:
Z=δY1+(1-δ)Y2
wherein delta is the fusion weight of the spatial attention module and the channel attention module;
The fusion weight delta is obtained by optimizing based on a cosine annealing strategy; the cosine annealing strategy is added with a wall up strategy to perform initial transition, and fusion weights are optimized, specifically as follows:
cosine annealing with the model recognition error value corresponding to the initial fusion weight as the maximum recognition error value, wherein the recognition error value gradually decreases in the training process, and reaches the minimum value at the end, specifically as follows:
wherein, gamma t is the fusion weight value of the t-th iteration; is the lowest fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; n k is the total number of cycles in the current operating environment;
and finally, outputting a recognition result of the facial expression recognition network on the image data through the Softmax layer.
Preferably, the expression library and animation curve unit also adjusts the animation curve based on the cosine annealing strategy, and performs cosine annealing based on the combined effect of the change speed, amplitude and transition effect of the expression.
Preferably, the touch feedback screen is designed into a curved surface or a bendable shape and is used for adapting to different application scenes and equipment forms; the touch feedback screen is also spliced to form a complete screen through a plurality of independent small touch feedback screen modules, the complete screen is divided into a plurality of functional partitions according to user-defined requirements, and each functional partition has different touch functions and feedback modes.
Preferably, the touch feedback screen is designed with a three-layer sensing structure, which corresponds to pressure sensing, position sensing and gesture sensing respectively; and a touch control induction circuit of the induction structure is also manufactured by utilizing the nano material.
Preferably, the surface of the touch feedback screen is constructed with a micro-lens array, a micro-fluid channel is arranged below the micro-lens array, different liquids with electrical or optical characteristics are filled in the micro-lens array, and the liquid pressure and the flowing state in the micro-fluid channel dynamically adjust the curvature, the interval or the direction of the micro-lens array; the information of the micro lens array is also fed back to the micro fluid channel to dynamically adjust the flow and distribution of the liquid.
Preferably, the touch feedback screen further utilizes a sensing technology to enable the touch feedback screen to detect the actions of fingers or touch pens suspended above the screen within a certain distance, and thus the touch feedback screen with three-dimensional space sensing capability is constructed.
Compared with the prior art, the invention provides a digital human video synthesis system, which comprises the following components
The beneficial effects are that:
According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved. Specifically, the following are included.
Example 1:
Referring to fig. 1-5, the digital human video synthesizing system of the present invention comprises:
model building module 1: the method is used for carrying out digital person modeling based on the real person;
the model building module 1 includes:
Character modeling unit 11: for creating a model of the appearance of a digital person's body, face, hairstyle, clothing, etc. based on 3D modeling techniques, modeling work is performed using software such as Maya, 3ds Max, etc. and stored in a database for recall at any time. Scanning the real person based on a scanning technology to acquire the shape, texture and other data of the body and the face of the real person, and constructing a digital person model; data is collected, for example, by structured light scanning, laser scanning, etc.
Bone binding and animation unit 12: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints so as to realize natural action expression, and carrying out action recognition and animation production through skeleton binding; the steps for performing motion recognition by bone binding are specifically as follows:
3D human body gestures are detected and compared with real human body joints, and 33 joint points of a human body can be analyzed and calculated based on video frames by referring to the v joint point coordinate displacement of FIG. 2, specifically as follows:
Based on the calculated 3D coordinate offset D j-1,j for the j-1 frame data and the j frame data:
Wherein, (x j-1,yj-1,zj-1) is the joint position coordinates of the j-1 th frame; (x j,yj,zj) is the joint position coordinates of the j-th frame;
Calculating the ratio of the 3D coordinate offset D j-1,j to the height of the current real person, and updating the 3D coordinate offset:
Wherein D' j-1,j is the updated 3D coordinate offset;
D' j-1,j and the time difference delta t of two adjacent frames are taken as a quotient, and each node V generates a corresponding rate change value V ujv at the j-th frame moment of the u-action video;
in a P-frame video, the average velocity value of each node v The method comprises the following steps:
Generating a corresponding action set according to different action amplitudes by different actions, wherein each joint point corresponds to a different speed average value, and generating a speed threshold V v considering other actions in the set according to the speed average value:
wherein U is the action quantity in the action set;
The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value. Based on the action set obtained by the cosine annealing strategy, the action made by the real person can be judged and identified more accurately through the speed threshold.
The animation control system can also be developed to support animation generation modes such as key frame animation, motion capture data import, physical simulation and the like, for example, bone binding and animation production are performed by using MotionBuilder software.
The voice synthesis module 2: for converting input text content into voice audio by a voice synthesis engine, common voice synthesis techniques are voice synthesis based on parameter synthesis, splice synthesis, deep learning, and the like. The tone, intonation, speed, etc. of the synthesized speech are adjusted and optimized to adapt to different digital human roles and application scenes. For example, a speech synthesis technique of a science fiction can generate speech with various timbres and styles. For example, voice broadcasting in navigation software is realized through text-to-voice technology, so that clear and accurate route guidance is provided for users.
Expression generation module 3: for generating an expression of the digital person from the facial expression of the real person;
the expression generating module 3 includes:
Expression capturing and mapping unit 31: the real-time expression driving method is used for capturing facial expression data of a real person through a facial capturing device such as a camera, a depth sensor and the like, and mapping the facial expression data onto a digital human model to realize real-time expression driving;
The expression capturing and mapping unit 31 captures and identifies a label based on a facial expression identification network including an input module, a transducer module fusing a self-attention mechanism, a multi-scale feature extraction module, and a fusion identification module;
The input module acquires face image data X 0∈RH×W×C of a real person, wherein H and W are the height and width of the image data, and C is the number of image channels; equidistant segmentation and serialization processing are carried out on the collected facial expression image data based on the numerical common divisor lambda of the height and width of the image data to obtain X', position coding is carried out on each image block, linear projection is carried out on each image block, and position coding vectors corresponding to each image are added to obtain real input X;
The transducer module fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer, the residual connection module can refer to fig. 3, an upper path and a lower path are respectively a jump connection and a main path, the jump connection merges the input and the output of the stacked layers through an identification mapping process, and no additional parameter is needed. The gradient propagates back to the first few layers, thus making training easier by more additional layers for faster training.
The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta, and is integrated into a self-attention mechanism, and the output of the value vector is as follows:
after the node e m in the value vector matrix is integrated into the node number and the self-attention mechanism, extracting the characteristics to obtain a new value vector e' n:
wherein f q is the dimension of the matching vector q; m and n are respectively the total lateral amount of nodes and the total longitudinal amount of nodes in the image block;
The output Q of the multi-scale feature extraction module is specifically as follows:
A, B, E are a query vector matrix, a matching vector matrix and a value vector matrix which are respectively composed of a, b and e; g α and g β are calculation matrices of corresponding position parameter features alpha and beta; t is the transpose of the vector; g l corresponds to different areas of the face of the real person;
The fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module which are fused with the self-attention mechanism to obtain a feature X R,Q, and the feature X R,Q is sequentially input into the space attention module and the channel attention module, and in the space attention module,
Processing the input features along the channel dimension by pooling operations, respectively:
wherein c is the number of channels; m avg(XR,Q)、Mmax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
M avg(XR,Q) and M max(XR,Q) are spliced, and then a convolution layer is used for convolution operation, so that output spatial attention characteristic Y 1 is obtained:
Y1=σ(fconv,e(Mavg(XR,Q);Mmax(XR,Q)))
The channel attention mechanism is specifically as follows:
processing the input features along the spatial dimension by a pooling operation:
Wherein, The image block height value and the image block width value; o avg(XR,Q)、Omax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
Splice O avg(XR,Q) and O max(XR,Q), and then convolve with a convolution layer to obtain the output channel attention Y 1:
Y2=σ(MLP(Oavg(XR,Q);Omax(XR,Q)))
fusing the output of the spatial attention module and the output of the channel attention module to obtain a final fused output Z:
Z=δY1+(1-δ)Y2
wherein delta is the fusion weight of the spatial attention module and the channel attention module;
optimizing the fusion weight delta based on a cosine annealing strategy to obtain the fusion weight delta; the cosine annealing strategy is added with the arm up strategy to perform initial transition, and the fusion weight is optimized, specifically as follows:
cosine annealing with the model recognition error value corresponding to the initial fusion weight as the maximum recognition error value, wherein the recognition error value gradually decreases in the training process, and reaches the minimum value at the end, specifically as follows:
wherein, gamma t is the fusion weight value of the t-th iteration; is the lowest fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; n k is the total number of cycles in the current operating environment;
and finally, outputting a recognition result of the facial expression recognition network on the image data through the Softmax layer.
The application relates to a method for identifying facial expressions, which comprises the steps of taking a convolution network as a model I, taking a convolution network added with a self-attention mechanism as a model II, taking a convolution network added with a self-attention mechanism and a spatial attention mechanism as a model III, taking a convolution network added with a self-attention mechanism and a channel attention mechanism as a model IV, and taking a convolution network added with a self-attention mechanism, a spatial attention mechanism and a channel attention mechanism as a model V, wherein the model V is a model VI, and the facial expression identification can be respectively carried out, and the accuracy of an identification result can be referred to in a table 1:
TABLE 1 facial expression recognition accuracy results for different models
Expression label |
Model one |
Model II |
Model III |
Model IV |
Model five |
Model six |
Qi generating |
66.51% |
67.01% |
69.12% |
70.32% |
71.84% |
72.32% |
Aversion to |
65.37% |
66.23% |
67.91% |
68.72% |
70.69% |
71.08% |
Fear of fear |
60.85% |
66.06% |
67.63% |
69.71% |
71.85% |
72.89% |
Open heart |
63.45% |
64.89% |
66.68% |
69.28% |
70.35% |
72.94% |
Injury of heart |
67.35% |
68.15% |
69.36% |
71.15% |
72.68% |
74.06% |
Surprise (surprise) |
68.16% |
69.48% |
70.06% |
71.26% |
72.68% |
73.48% |
As can be seen from table 1, the accuracy of the model of this example was the highest.
The deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;
expression library and animation curve unit 32: the method is used for establishing an expression library, and comprises various common expression actions and micro-expressions so as to be quickly called when needed. And the animation curve is regulated through a cosine annealing strategy, and cosine annealing treatment is performed based on the comprehensive effects of the change speed, the amplitude and the transition effect of the expression, so that the expression is more natural and smooth.
Rendering engine module 4: rendering the digital human video;
The rendering engine module 4 includes:
The real-time rendering unit 41: the method is used for rendering the digital mannequin by adopting advanced real-time rendering engines such as Unreal Engine, unity and the like to generate a vivid visual effect; and the adjustment and optimization of rendering effects such as ray tracing, shading, materials, textures and the like are also supported, so that the quality of digital human videos is improved.
Rendering output and optimization unit 42: the method is used for outputting the rendered digital human video into video files with different formats such as MP4, AVI and the like and resolutions such as 720p, 1080p, 4K and the like, compressing and encoding the video output by rendering to reduce the size of the file, improve the transmission and storage efficiency and simultaneously ensure the video quality;
interaction control module 5: the system is used for carrying out data interaction with a user and a third party;
the interaction control module 5 includes:
user interface and operation control unit 51: the touch control screen provides a concise and visual user interface for a user, and is convenient for the user to operate and set parameters of the digital human video synthesis system, such as selecting a digital human model, inputting text content, adjusting voice and expression parameters and the like.
Data management and communication interface unit 52: the system is used for managing and storing digital human models, voice data, expression data, animation data and the like, supporting the import, export and backup of the data, and providing a communication interface with other systems or application programs so as to realize the sharing and interaction of the data, such as integration with video editing software, a live broadcast platform and the like.
The touch feedback screen is designed into a curved surface or a bendable shape, so that the touch feedback screen can adapt to different application scenes and equipment forms. For example, a surrounding type flexible touch feedback screen is designed for the wearable equipment, so that the human body curve can be better fitted, and a more natural and comfortable interaction experience is provided. Or an arc-shaped touch feedback screen is designed for the automobile instrument panel, so that the visibility and the operation convenience of a driver are improved.
The touch feedback screen is further divided into a plurality of functional partitions or a modularized design is adopted. Different partitions can have different touch control functions and feedback modes, and a user can customize the functions of each partition according to own requirements. For example, on a touch feedback screen of a game handle, the screen is divided into a direction control area, an action button area, a function setting area and the like, and each area can provide different tactile feedback effects to enhance the immersion of the game.
The conventional touch feedback screen generally has only one sensing layer, the number of sensing layers is increased in this embodiment, and the touch feedback screen is specifically designed into a three-layer sensing structure, and can refer to fig. 4, and the three-layer sensing structure from top to bottom corresponds to pressure sensing, position sensing and gesture sensing respectively, so as to realize more accurate and rich touch input recognition. For example, when a user lightly touches the screen, the first layer pressure sensing layer detects the pressure magnitude; the second layer is positioned on the sensing layer to determine the coordinates of the touch; the third gesture sensing layer recognizes gesture actions such as sliding, zooming and the like of the finger. The sensing structure also utilizes nano materials such as carbon nano tubes, nano silver wires and the like to manufacture a touch sensing circuit. The nano materials have excellent conductivity and flexibility, and can improve the sensitivity and the flexibility of the touch feedback screen. For example, using a carbon nanotube film as the sensing layer, faster signal response and higher spatial resolution can be achieved due to the high conductivity and small size characteristics of the carbon nanotubes.
Referring to fig. 5, a micro lens array is constructed on the surface of the touch feedback screen, and micro lenses can focus light rays, enhance display brightness and contrast of the screen, and simultaneously can be used for realizing a 3D touch effect. When a user touches the screen, the position and the force of the touch are detected through the optical change of the micro lens array, and a three-dimensional feedback effect can be visually presented according to the touch operation, for example, in a game, the user presses different positions and forces of the screen to see the 3D visual effect that the object has different degrees of protrusion or depression. A micro-fluid channel is arranged below the micro-lens array, wherein different liquids with electrical or optical characteristics are filled in the micro-fluid channel, for example, when a user touches the screen, the pressure can enable the liquid in the micro-fluid channel to flow, so that local resistance or capacitance is changed, and touch detection is realized; or by changing the distribution of the liquid, affecting the light transmission or color of the screen locally, providing visual feedback. For example, in an e-book reading application, when a user touches the screen to turn a page, the edges of the screen will have a colored liquid flow effect as feedback. Dynamically adjusting the curvature, spacing or direction of the microlens array according to the liquid pressure and flow state in the microfluidic channel; on the contrary, the information such as external illumination, touch position and the like sensed by the micro-lens array can also be fed back to the micro-fluid system, so that the flow and distribution of liquid can be dynamically adjusted, and the self-adaptive adjustment of screen display and touch feedback is realized. For example, in an outdoor strong light environment, the microlens array detects an increase in illumination intensity, triggers liquid flow in the microfluidic channel, and changes parameters of the microlenses to enhance display brightness and contrast of the screen; meanwhile, when a user performs touch operation, the micro lens array transmits touch information to the micro fluid system, so that corresponding flow and tactile feedback are generated on the liquid.
By utilizing the magnetic field, the electric field or the sound wave and other technologies, the touch feedback screen can detect the actions of fingers or a touch pen suspended above the screen within a certain distance, and non-contact touch operation and feedback are realized. For example, in the air, by swiping a finger over the screen, page scrolling, content selection, etc. may be accomplished while providing tactile feedback through air vibration or slight vibration of the device. The touch feedback screen with three-dimensional space sensing capability can be constructed by the method. By arranging a plurality of sensors inside or around the screen, detection of the position, direction and movement of the object in three-dimensional space is achieved. For example, in a virtual reality VR or augmented reality AR device, a user may directly interact with a virtual object in three-dimensional space, touch, grab, rotate, etc., and provide realistic haptic feedback by way of vibration of the device, force feedback gloves, etc.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.