CN119011752A

CN119011752A - Digital human video synthesis system

Info

Publication number: CN119011752A
Application number: CN202410964070.2A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Anhui Haixuan Education Technology Co ltd
Current assignee: Anhui Haixuan Education Technology Co ltd
Priority date: 2024-07-18
Filing date: 2024-07-18
Publication date: 2024-11-22

Abstract

The present invention discloses a digital human video synthesis system, and relates to the field of intelligent technology. Compared with the previous digital human video synthesis system, the present invention solves the problem that the motion and expression generation of the existing digital human video system is still unnatural, and the current digital human is still relatively limited in its ability to interact with users, especially the interaction of the touch screen, and the response to the user's operation is not accurate and flexible enough; it can quickly build a character model based on the real character image, greatly shortening the long cycle of traditional video production, and also combines multimodal interaction methods such as voice, action, and expression to interact with the user in real time by voice and text, making the interaction between the digital human and the user more natural and smooth, and adjusting the content and expression method according to the user's feedback and needs, providing more intelligent and flexible services; at the same time, the touch screen is optimized to improve the sensitivity of the response to the user's operation and improve the user experience.

Description

Digital human video synthesis system

Technical Field

The invention relates to the technical field of intellectualization, in particular to a digital human video synthesis system.

Background

Digital human video synthesis systems are increasingly emerging with the development of technologies such as computer graphics, artificial intelligence, computer vision, and natural language processing.

In the field of computer graphics, the continual advancement of 3D modeling and animation technology has made it possible to create realistic digital human models. By using specialized modeling software and rendering engines, digital human appearance and actions can be constructed with high detail and realism.

The development of artificial intelligence technology, especially deep learning technology, provides powerful support for the speech synthesis, expression generation and action driving of digital people. For example, a speech synthesis model based on deep neural networks can generate natural fluent speech, while an image generation model based on generating a countermeasure network (GAN) and a variational self-encoder (VAE) can be used to generate digital human facial expressions and actions.

Natural language processing techniques enable a digital person to understand and process natural language text to generate corresponding speech and actions from the input text content. The computer vision technology can be used for capturing the actions and expressions of the real figures and mapping the actions and expressions to the digital human model, so that a more natural and lifelike interaction effect is realized.

Although digital person appearance modeling and rendering techniques have made great progress, in some cases, digital person motion and expression generation still have unnatural problems such as motion stiffness, expression discontinuity, lack of subtle emotional expressions, and the like. The present digital person has limited interaction capability with a user, particularly the interaction of a touch screen, and has inaccurate and flexible response to the operation of the user.

In order to solve the problems, the invention provides a digital human video synthesis system.

Disclosure of Invention

The invention aims to provide a digital human video synthesis system to solve the problems in the background art:

The motion and expression generation of the existing digital person video system still has unnatural problems, and the existing digital person has limited interaction capability with a user, especially the interaction of a touch screen, and has inaccurate and flexible response to the operation of the user.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a digital human video composition system comprising: 1. a digital human video composition system comprising:

model construction module: the method is used for carrying out digital person modeling based on the real person;

and a voice synthesis module: the system comprises a voice synthesis engine, a voice processing engine and a voice processing system, wherein the voice synthesis engine is used for converting input text content into voice audio and adjusting and optimizing synthesized voice;

expression generation module: for generating an expression of the digital person from the facial expression of the real person;

A rendering engine module: rendering the digital human video;

And the interaction control module is used for: the system is used for carrying out data interaction with a user and a third party;

the model construction module comprises:

character modeling unit: the method comprises the steps of creating an appearance model based on a 3D modeling technology, scanning a real person based on a scanning technology, and constructing a digital person model;

Bone binding and animation unit: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints, and performing action recognition and animation production through skeleton binding;

the expression generating module comprises:

Expression capturing and mapping unit: the facial expression driving method comprises the steps of capturing facial expression data of a real person through a facial capture device, and mapping the facial expression data to a digital human model to realize real-time driving of the expression; the deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;

Expression library and animation curve unit: the method is used for establishing an expression library and controlling the change curve, amplitude and transition effect of the expression by adjusting the animation curve;

The rendering engine module includes:

A real-time rendering unit: the real-time rendering engine is used for rendering the digital human model;

Rendering output and optimization unit: the method comprises the steps of outputting the rendered digital human video into video files with different formats and resolutions, and compressing and encoding the video output by rendering;

The interaction control module comprises:

user interface and operation control unit: the system is used for providing a user interface for a user through the touch control screen and performing operation and parameter setting on the system;

Data management and communication interface unit: for managing and storing system data and providing a communication interface with other systems or applications.

Preferably, the character modeling unit creates different part appearance models of the digital person based on the 3D modeling technique and stores the model in the database.

Preferably, the skeleton binding and animation unit detects the 3D human body posture, compares the detected 3D human body posture with the real human body joints, and performs analysis and calculation on the v joint point coordinate displacement of 33 joints of the human body based on the video frame, specifically as follows:

Based on the calculated 3D coordinate offset D _j-1,j for the j-1 frame data and the j frame data:

Wherein, (x _j-1,y_j-1,z_j-1) is the joint position coordinates of the j-1 th frame; (x _j,y_j,z_j) is the joint position coordinates of the j-th frame;

Calculating the ratio of the 3D coordinate offset D _j-1,j to the height of the current real person, and updating the 3D coordinate offset:

Wherein D' _j-1,j is the updated 3D coordinate offset;

D' _j-1,j and the time difference delta t of two adjacent frames are taken as a quotient, and each node V generates a corresponding rate change value V _ujv at the j-th frame moment of the u-action video;

in a P-frame video, the average velocity value of each node v The method comprises the following steps:

Generating a corresponding action set according to different action amplitudes by different actions, wherein each joint point corresponds to a different speed average value, and generating a speed threshold V _v considering other actions in the set according to the speed average value:

wherein U is the action quantity in the action set;

The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value.

Preferably, the expression capturing and mapping unit captures and identifies a label based on a facial expression identification network, wherein the facial expression identification network comprises an input module, a transducer module fused with a self-attention mechanism, a multi-scale feature extraction module and a fusion identification module;

The input module acquires face image data X ₀∈R^H×W×C of a real person, wherein H and W are the height and width of the image data, and C is the number of image channels; equidistant segmentation and serialization processing are carried out on the collected facial expression image data based on the numerical common divisor lambda of the height and width of the image data to obtain X', position coding is carried out on each image block, linear projection is carried out on each image block, and position coding vectors corresponding to each image are added to obtain real input X;

The transducer module for fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual error connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer;

The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta and is integrated into a self-attention mechanism, and the output of the value vector is as follows:

after the node e _m in the value vector matrix is integrated into the node number and the self-attention mechanism, extracting the characteristics to obtain a new value vector e' _n:

wherein f _q is the dimension of the matching vector q; m and n are respectively the total lateral amount of nodes and the total longitudinal amount of nodes in the image block;

The output Q of the multi-scale feature extraction module is specifically as follows:

A, B, E are a query vector matrix, a matching vector matrix and a value vector matrix which are respectively composed of a, b and e; g _α and g _β are calculation matrices of corresponding position parameter features alpha and beta; t is the transpose of the vector; g _l corresponds to different areas of the face of the real person;

the fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module of the fusion self-attention mechanism to obtain a feature X _R,Q, and sequentially inputs the feature X _R,Q into a spatial attention module and a channel attention module, wherein in the spatial attention module,

Processing the input features along the channel dimension by pooling operations, respectively:

wherein c is the number of channels; m _avg(X_R,Q)、M_max(X_R,Q) are the average pooling feature and the maximum pooling feature, respectively;

M _avg(X_R,Q) and M _max(X_R,Q) are spliced, and then a convolution layer is used for convolution operation, so that output spatial attention characteristic Y ₁ is obtained:

Y₁＝σ(f_conv,e(M_avg(X_R,Q);M_max(X_R,Q)))

The channel attention mechanism is specifically as follows:

processing the input features along the spatial dimension by a pooling operation:

Wherein, The image block height value and the image block width value; o _avg(X_R,Q)、O_max(X_R,Q) are the average pooling feature and the maximum pooling feature, respectively;

Splice O _avg(X_R,Q) and O _max(X_R,Q), and then convolve with a convolution layer to obtain the output channel attention Y ₁:

Y₂＝σ(MLP(O_avg(X_R,Q);O_max(X_R,Q)))

fusing the output of the spatial attention module and the output of the channel attention module to obtain a final fused output Z:

Z＝δY₁+(1-δ)Y₂

wherein delta is the fusion weight of the spatial attention module and the channel attention module;

The fusion weight delta is obtained by optimizing based on a cosine annealing strategy; the cosine annealing strategy is added with a wall up strategy to perform initial transition, and fusion weights are optimized, specifically as follows:

cosine annealing with the model recognition error value corresponding to the initial fusion weight as the maximum recognition error value, wherein the recognition error value gradually decreases in the training process, and reaches the minimum value at the end, specifically as follows:

wherein, gamma _t is the fusion weight value of the t-th iteration; is the lowest fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; n ^k is the total number of cycles in the current operating environment;

and finally, outputting a recognition result of the facial expression recognition network on the image data through the Softmax layer.

Preferably, the expression library and animation curve unit also adjusts the animation curve based on the cosine annealing strategy, and performs cosine annealing based on the combined effect of the change speed, amplitude and transition effect of the expression.

Preferably, the touch feedback screen is designed into a curved surface or a bendable shape and is used for adapting to different application scenes and equipment forms; the touch feedback screen is also spliced to form a complete screen through a plurality of independent small touch feedback screen modules, the complete screen is divided into a plurality of functional partitions according to user-defined requirements, and each functional partition has different touch functions and feedback modes.

Preferably, the touch feedback screen is designed with a three-layer sensing structure, which corresponds to pressure sensing, position sensing and gesture sensing respectively; and a touch control induction circuit of the induction structure is also manufactured by utilizing the nano material.

Preferably, the surface of the touch feedback screen is constructed with a micro-lens array, a micro-fluid channel is arranged below the micro-lens array, different liquids with electrical or optical characteristics are filled in the micro-lens array, and the liquid pressure and the flowing state in the micro-fluid channel dynamically adjust the curvature, the interval or the direction of the micro-lens array; the information of the micro lens array is also fed back to the micro fluid channel to dynamically adjust the flow and distribution of the liquid.

Preferably, the touch feedback screen further utilizes a sensing technology to enable the touch feedback screen to detect the actions of fingers or touch pens suspended above the screen within a certain distance, and thus the touch feedback screen with three-dimensional space sensing capability is constructed.

Compared with the prior art, the invention provides a digital human video synthesis system, which comprises the following components

The beneficial effects are that:

According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved.

Drawings

FIG. 1 is a block diagram of the system mentioned in embodiment 1 of the present invention;

FIG. 2 is a schematic view of the joints of the human body according to embodiment 1 of the present invention;

Fig. 3 is a schematic diagram of a residual link module structure in embodiment 1 of the present invention;

FIG. 4 is a schematic diagram of the induction structure mentioned in embodiment 1 of the present invention;

fig. 5 is a schematic diagram of a surface structure of a touch feedback screen according to embodiment 1 of the present invention.

Meaning of the label in the figure:

1. A model building module; 11. a character modeling unit; 12. a bone binding and animation unit; 2. a speech synthesis module; 3. expression generating module; 31. expression capturing and mapping unit; 32. expression library and animation curve unit; 4. a rendering engine module; 41. a real-time rendering unit; 42. rendering output and optimization unit; 5. an interaction control module; 51. a user interface and an operation control unit; 52. and a data management and communication interface unit.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved. Specifically, the following are included.

Example 1:

Referring to fig. 1-5, the digital human video synthesizing system of the present invention comprises:

model building module 1: the method is used for carrying out digital person modeling based on the real person;

the model building module 1 includes:

Character modeling unit 11: for creating a model of the appearance of a digital person's body, face, hairstyle, clothing, etc. based on 3D modeling techniques, modeling work is performed using software such as Maya, 3ds Max, etc. and stored in a database for recall at any time. Scanning the real person based on a scanning technology to acquire the shape, texture and other data of the body and the face of the real person, and constructing a digital person model; data is collected, for example, by structured light scanning, laser scanning, etc.

Bone binding and animation unit 12: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints so as to realize natural action expression, and carrying out action recognition and animation production through skeleton binding; the steps for performing motion recognition by bone binding are specifically as follows:

3D human body gestures are detected and compared with real human body joints, and 33 joint points of a human body can be analyzed and calculated based on video frames by referring to the v joint point coordinate displacement of FIG. 2, specifically as follows:

Wherein D' _j-1,j is the updated 3D coordinate offset;

wherein U is the action quantity in the action set;

The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value. Based on the action set obtained by the cosine annealing strategy, the action made by the real person can be judged and identified more accurately through the speed threshold.

The animation control system can also be developed to support animation generation modes such as key frame animation, motion capture data import, physical simulation and the like, for example, bone binding and animation production are performed by using MotionBuilder software.

The voice synthesis module 2: for converting input text content into voice audio by a voice synthesis engine, common voice synthesis techniques are voice synthesis based on parameter synthesis, splice synthesis, deep learning, and the like. The tone, intonation, speed, etc. of the synthesized speech are adjusted and optimized to adapt to different digital human roles and application scenes. For example, a speech synthesis technique of a science fiction can generate speech with various timbres and styles. For example, voice broadcasting in navigation software is realized through text-to-voice technology, so that clear and accurate route guidance is provided for users.

Expression generation module 3: for generating an expression of the digital person from the facial expression of the real person;

the expression generating module 3 includes:

Expression capturing and mapping unit 31: the real-time expression driving method is used for capturing facial expression data of a real person through a facial capturing device such as a camera, a depth sensor and the like, and mapping the facial expression data onto a digital human model to realize real-time expression driving;

The expression capturing and mapping unit 31 captures and identifies a label based on a facial expression identification network including an input module, a transducer module fusing a self-attention mechanism, a multi-scale feature extraction module, and a fusion identification module;

The transducer module fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer, the residual connection module can refer to fig. 3, an upper path and a lower path are respectively a jump connection and a main path, the jump connection merges the input and the output of the stacked layers through an identification mapping process, and no additional parameter is needed. The gradient propagates back to the first few layers, thus making training easier by more additional layers for faster training.

The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta, and is integrated into a self-attention mechanism, and the output of the value vector is as follows:

The fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module which are fused with the self-attention mechanism to obtain a feature X _R,Q, and the feature X _R,Q is sequentially input into the space attention module and the channel attention module, and in the space attention module,

Y₁＝σ(f_conv,e(M_avg(X_R,Q);M_max(X_R,Q)))

The channel attention mechanism is specifically as follows:

Y₂＝σ(MLP(O_avg(X_R,Q);O_max(X_R,Q)))

Z＝δY₁+(1-δ)Y₂

optimizing the fusion weight delta based on a cosine annealing strategy to obtain the fusion weight delta; the cosine annealing strategy is added with the arm up strategy to perform initial transition, and the fusion weight is optimized, specifically as follows:

The application relates to a method for identifying facial expressions, which comprises the steps of taking a convolution network as a model I, taking a convolution network added with a self-attention mechanism as a model II, taking a convolution network added with a self-attention mechanism and a spatial attention mechanism as a model III, taking a convolution network added with a self-attention mechanism and a channel attention mechanism as a model IV, and taking a convolution network added with a self-attention mechanism, a spatial attention mechanism and a channel attention mechanism as a model V, wherein the model V is a model VI, and the facial expression identification can be respectively carried out, and the accuracy of an identification result can be referred to in a table 1:

TABLE 1 facial expression recognition accuracy results for different models

Expression label

Model one

Model II

Model III

Model IV

Model five

Model six

Qi generating

66.51％

67.01％

69.12％

70.32％

71.84％

72.32％

Aversion to

65.37％

66.23％

67.91％

68.72％

70.69％

71.08％

Fear of fear

60.85％

66.06％

67.63％

69.71％

71.85％

72.89％

Open heart

63.45％

64.89％

66.68％

69.28％

70.35％

72.94％

Injury of heart

67.35％

68.15％

69.36％

71.15％

72.68％

74.06％

Surprise (surprise)

68.16％

69.48％

70.06％

71.26％

72.68％

73.48％

As can be seen from table 1, the accuracy of the model of this example was the highest.

The deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;

expression library and animation curve unit 32: the method is used for establishing an expression library, and comprises various common expression actions and micro-expressions so as to be quickly called when needed. And the animation curve is regulated through a cosine annealing strategy, and cosine annealing treatment is performed based on the comprehensive effects of the change speed, the amplitude and the transition effect of the expression, so that the expression is more natural and smooth.

Rendering engine module 4: rendering the digital human video;

The rendering engine module 4 includes:

The real-time rendering unit 41: the method is used for rendering the digital mannequin by adopting advanced real-time rendering engines such as Unreal Engine, unity and the like to generate a vivid visual effect; and the adjustment and optimization of rendering effects such as ray tracing, shading, materials, textures and the like are also supported, so that the quality of digital human videos is improved.

Rendering output and optimization unit 42: the method is used for outputting the rendered digital human video into video files with different formats such as MP4, AVI and the like and resolutions such as 720p, 1080p, 4K and the like, compressing and encoding the video output by rendering to reduce the size of the file, improve the transmission and storage efficiency and simultaneously ensure the video quality;

interaction control module 5: the system is used for carrying out data interaction with a user and a third party;

the interaction control module 5 includes:

user interface and operation control unit 51: the touch control screen provides a concise and visual user interface for a user, and is convenient for the user to operate and set parameters of the digital human video synthesis system, such as selecting a digital human model, inputting text content, adjusting voice and expression parameters and the like.

Data management and communication interface unit 52: the system is used for managing and storing digital human models, voice data, expression data, animation data and the like, supporting the import, export and backup of the data, and providing a communication interface with other systems or application programs so as to realize the sharing and interaction of the data, such as integration with video editing software, a live broadcast platform and the like.

The touch feedback screen is designed into a curved surface or a bendable shape, so that the touch feedback screen can adapt to different application scenes and equipment forms. For example, a surrounding type flexible touch feedback screen is designed for the wearable equipment, so that the human body curve can be better fitted, and a more natural and comfortable interaction experience is provided. Or an arc-shaped touch feedback screen is designed for the automobile instrument panel, so that the visibility and the operation convenience of a driver are improved.

The touch feedback screen is further divided into a plurality of functional partitions or a modularized design is adopted. Different partitions can have different touch control functions and feedback modes, and a user can customize the functions of each partition according to own requirements. For example, on a touch feedback screen of a game handle, the screen is divided into a direction control area, an action button area, a function setting area and the like, and each area can provide different tactile feedback effects to enhance the immersion of the game.

The conventional touch feedback screen generally has only one sensing layer, the number of sensing layers is increased in this embodiment, and the touch feedback screen is specifically designed into a three-layer sensing structure, and can refer to fig. 4, and the three-layer sensing structure from top to bottom corresponds to pressure sensing, position sensing and gesture sensing respectively, so as to realize more accurate and rich touch input recognition. For example, when a user lightly touches the screen, the first layer pressure sensing layer detects the pressure magnitude; the second layer is positioned on the sensing layer to determine the coordinates of the touch; the third gesture sensing layer recognizes gesture actions such as sliding, zooming and the like of the finger. The sensing structure also utilizes nano materials such as carbon nano tubes, nano silver wires and the like to manufacture a touch sensing circuit. The nano materials have excellent conductivity and flexibility, and can improve the sensitivity and the flexibility of the touch feedback screen. For example, using a carbon nanotube film as the sensing layer, faster signal response and higher spatial resolution can be achieved due to the high conductivity and small size characteristics of the carbon nanotubes.

Referring to fig. 5, a micro lens array is constructed on the surface of the touch feedback screen, and micro lenses can focus light rays, enhance display brightness and contrast of the screen, and simultaneously can be used for realizing a 3D touch effect. When a user touches the screen, the position and the force of the touch are detected through the optical change of the micro lens array, and a three-dimensional feedback effect can be visually presented according to the touch operation, for example, in a game, the user presses different positions and forces of the screen to see the 3D visual effect that the object has different degrees of protrusion or depression. A micro-fluid channel is arranged below the micro-lens array, wherein different liquids with electrical or optical characteristics are filled in the micro-fluid channel, for example, when a user touches the screen, the pressure can enable the liquid in the micro-fluid channel to flow, so that local resistance or capacitance is changed, and touch detection is realized; or by changing the distribution of the liquid, affecting the light transmission or color of the screen locally, providing visual feedback. For example, in an e-book reading application, when a user touches the screen to turn a page, the edges of the screen will have a colored liquid flow effect as feedback. Dynamically adjusting the curvature, spacing or direction of the microlens array according to the liquid pressure and flow state in the microfluidic channel; on the contrary, the information such as external illumination, touch position and the like sensed by the micro-lens array can also be fed back to the micro-fluid system, so that the flow and distribution of liquid can be dynamically adjusted, and the self-adaptive adjustment of screen display and touch feedback is realized. For example, in an outdoor strong light environment, the microlens array detects an increase in illumination intensity, triggers liquid flow in the microfluidic channel, and changes parameters of the microlenses to enhance display brightness and contrast of the screen; meanwhile, when a user performs touch operation, the micro lens array transmits touch information to the micro fluid system, so that corresponding flow and tactile feedback are generated on the liquid.

By utilizing the magnetic field, the electric field or the sound wave and other technologies, the touch feedback screen can detect the actions of fingers or a touch pen suspended above the screen within a certain distance, and non-contact touch operation and feedback are realized. For example, in the air, by swiping a finger over the screen, page scrolling, content selection, etc. may be accomplished while providing tactile feedback through air vibration or slight vibration of the device. The touch feedback screen with three-dimensional space sensing capability can be constructed by the method. By arranging a plurality of sensors inside or around the screen, detection of the position, direction and movement of the object in three-dimensional space is achieved. For example, in a virtual reality VR or augmented reality AR device, a user may directly interact with a virtual object in three-dimensional space, touch, grab, rotate, etc., and provide realistic haptic feedback by way of vibration of the device, force feedback gloves, etc.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A digital human video synthesis system, characterized by comprising:

Model building module (1): used to model digital humans based on real people;

Speech synthesis module (2): used to convert the input text content into speech audio through a speech synthesis engine, and adjust and optimize the synthesized speech;

Expression generation module (3): used to generate the expression of the digital human according to the facial expression of the real person;

Rendering engine module (4): used for rendering digital human video;

Interaction control module (5): used for data interaction with users and third parties;

The model building module (1) comprises:

Character modeling unit (11): used to create an appearance model based on 3D modeling technology, and to scan a real person based on scanning technology to build a digital human model;

Skeleton binding and animation unit (12): used to establish a skeleton system for the digital human model, determine the position and range of motion of the joints, and perform action recognition and animation production through skeleton binding;

The expression generation module (3) comprises:

Expression capture and mapping unit (31): used to capture the facial expression data of real people through facial capture equipment and map it to the digital human model to achieve real-time driving of expressions; and also use deep learning algorithms to analyze the input voice or text content to automatically generate corresponding digital human expressions;

Expression library and animation curve unit (32): used to establish an expression library and control the expression change curve, amplitude and transition effect by adjusting the animation curve;

The rendering engine module (4) comprises:

A real-time rendering unit (41): used to render the digital human model using a real-time rendering engine;

Rendering output and optimization unit (42): used to output the rendered digital human video into video files of different formats and resolutions, and compress and encode the rendered output video;

The interactive control module (5) comprises:

User interface and operation control unit (51): used to provide a user interface for the user through a touch control screen to operate the system and set parameters;

Data management and communication interface unit (52): used to manage and store system data and provide a communication interface with other systems or applications.

2. The digital human video synthesis system according to claim 1 is characterized in that the character modeling unit (11) creates appearance models of different parts of the digital human based on 3D modeling technology and stores them in a database.

3. The digital human video synthesis system according to claim 2 is characterized in that the skeleton binding and animation unit (12) detects the 3D human body posture and compares it with the real human body joints, and analyzes and calculates the coordinate displacement of the vth joint point of the 33 joint points of the human body based on the video frame, as follows:

Based on the calculated 3D coordinate offset d _j-1,j for the j-1th frame data and the jth frame data:

Among them, (x _j-1 ,y _j-1 ,z _j-1 ) are the joint position coordinates of the j-1th frame; (x _j ,y _j ,z _j ) are the joint position coordinates of the jth frame;

Calculate the ratio of the 3D coordinate offset d _j-1,j to the current real person height height, and update the 3D coordinate offset:

Where d' _j-1,j is the updated 3D coordinate offset;

Take d' _j-1,j and the time difference Δt between two adjacent frames as the quotient, and at the jth frame of the u action video, each joint point v generates a corresponding rate change value V _ujv ;

In a P-frame video, the average velocity of each joint point v for:

Different actions are generated into corresponding action sets according to different action amplitudes, where each joint point corresponds to a different speed average value, and the speed threshold V _v that takes into account other actions in the set is generated:

Where U is the number of actions in the action set;

The action set is divided based on the cosine annealing strategy, and cosine annealing is performed with the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value.

4. The digital human video synthesis system according to claim 3 is characterized in that the expression capture and mapping unit (31) captures and recognizes labels based on a facial expression recognition network, wherein the facial expression recognition network includes an input module, a Transformer module fused with a self-attention mechanism, a multi-scale feature extraction module, and a fusion recognition module;

The input module collects facial image data X ₀ ∈R ^H×W×C of a real person, where H and W are the height and width of the image data, and C is the number of image channels; the collected facial expression image data is equidistantly divided based on the numerical common divisor λ of the height and width of the image data and serialized to obtain X', position encoding is performed on each image block, each image block is linearly projected and a position encoding vector corresponding to each image is added to obtain a real input X;

The Transformer module integrating the self-attention mechanism includes a multi-head attention layer, a feedforward neural network layer and a layer normalization, wherein a residual connection module is further added between the input and output of the multi-head attention layer and the feedforward neural network layer;

The multi-scale feature extraction module divides the facial expressions of real people into several regions, each region corresponds to a square matrix with a dimension of the number of nodes in the region. The output of the multi-scale feature extraction module is integrated into the image block position parameter feature α and the node number parameter feature β, and is integrated into the self-attention mechanism. The output of the value vector is as follows:

The node e _m in the value vector matrix is integrated into the node number and self-attention mechanism and features are extracted to obtain a new value vector e' _n :

Where, _fq is the dimension of the matching vector q; m and n are the total number of nodes in the horizontal direction and the total number of nodes in the vertical direction in the image block respectively;

The output Q of the multi-scale feature extraction module is as follows:

Among them, A, B, and E are the query vector matrix, matching vector matrix, and value vector matrix composed of a, b, and e respectively; g _α and g _β are the calculation matrices corresponding to the position parameter features α and β; T is the transpose of the vector; G _l corresponds to different facial regions of real people;

The fusion recognition module concatenates the outputs of the Transformer module of the fusion self-attention mechanism and the multi-scale feature extraction module to obtain the feature X _R,Q , which is input into the spatial attention module and the channel attention module in sequence. In the spatial attention module,

The input features are processed along the channel dimension through the pooling operation:

Where c∈[1,C] is the number of channels; M _avg (X _R,Q ) and M _max (X _R,Q ) are the average pooling feature and the maximum pooling feature respectively;

_Mavg ( _XR,Q ) and _Mmax ( _XR,Q ) are concatenated and then passed through a convolution layer for convolution operation to obtain the output spatial attention feature _Y1 :

Y ₁ =σ(f _conv,e (M _avg (X _R,Q ); M _max (X _R,Q )))

The channel attention mechanism is as follows:

The input features are processed along the spatial dimensions through the pooling operation:

in, are the height and width of the image block; O _avg (X _R,Q ) and O _max (X _R,Q ) are the average pooling feature and the maximum pooling feature respectively;

Concatenate O _avg (X _R,Q ) and O _max (X _R,Q ) and then perform a convolution operation through a convolution layer to obtain the output channel attention Y ₁ :

Y ₂ =σ(MLP(O _avg (X _R,Q ); O _max (X _R,Q )))

The output of the spatial attention module and the output of the channel attention module are fused to obtain the final fused output Z:

Z＝δY ₁ +(1-δ)Y ₂

Among them, δ is the fusion weight of the spatial attention module and the channel attention module;

The fusion weight δ is obtained by optimizing based on the cosine annealing strategy; the cosine annealing strategy adds a warm up strategy for initial transition to optimize the fusion weight, as follows:

The model recognition error value corresponding to the initial fusion weight is used as the cosine annealing of the maximum recognition error value. The recognition error value gradually decreases during the training process and reaches the minimum value at the end. The details are as follows:

Among them, γ _t is the fusion weight value of the tth iteration; is the minimum fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; N ^k is the total number of cycles in the current operating environment;

Finally, the Softmax layer outputs the recognition results of the facial expression recognition network on the image data.

5. The digital human video synthesis system according to claim 4 is characterized in that the expression library and animation curve unit (32) also adjusts the animation curve based on the cosine annealing strategy, and performs cosine annealing processing based on the comprehensive effect of the expression change speed, amplitude and transition effect.

6. The digital human video synthesis system according to claim 1 is characterized in that the touch feedback screen is designed to be a curved or bendable shape to adapt to different application scenarios and device forms; the touch feedback screen is also spliced together by several independent small touch feedback screen modules to form a complete screen, which is divided into several functional partitions according to user-defined requirements, and each functional partition has different touch functions and feedback methods.

7. The digital human video synthesis system according to claim 6 is characterized in that the touch feedback screen is designed with a three-layer sensing structure, corresponding to pressure sensing, position sensing and gesture sensing respectively; and the touch sensing circuit of the sensing structure is also made of nanomaterials.

8. The digital human video synthesis system according to claim 7 is characterized in that a microlens array is constructed on the surface of the touch feedback screen, and a microfluidic channel is arranged below the microlens array, which is filled with different liquids with electrical or optical characteristics, and the liquid pressure and flow state in the microfluidic channel dynamically adjust the curvature, spacing or direction of the microlens array; the information of the microlens array is also fed back to the microfluidic channel to dynamically adjust the liquid flow and distribution.

9. The digital human video synthesis system according to claim 8 is characterized in that the touch feedback screen also uses perception technology to enable the touch feedback screen to detect the movement of fingers or stylus pen suspended within a certain distance above the screen, and thereby construct a touch feedback screen with three-dimensional space perception capabilities.