[go: up one dir, main page]

CN119011752A - Digital human video synthesis system - Google Patents

Digital human video synthesis system Download PDF

Info

Publication number
CN119011752A
CN119011752A CN202410964070.2A CN202410964070A CN119011752A CN 119011752 A CN119011752 A CN 119011752A CN 202410964070 A CN202410964070 A CN 202410964070A CN 119011752 A CN119011752 A CN 119011752A
Authority
CN
China
Prior art keywords
module
digital human
expression
real
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410964070.2A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Haixuan Education Technology Co ltd
Original Assignee
Anhui Haixuan Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Haixuan Education Technology Co ltd filed Critical Anhui Haixuan Education Technology Co ltd
Priority to CN202410964070.2A priority Critical patent/CN119011752A/en
Publication of CN119011752A publication Critical patent/CN119011752A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本发明公开了数字人视频合成系统,涉及智能化技术领域。本发明与之前的数字人视频合成系统相比,解决了现有数字人视频系统的动作和表情生成仍然存在不自然的问题,且目前的数字人在与用户的交互能力方面还比较有限,尤其是触控屏的交互,对用户的操作响应不够准确和灵活的问题;能够依据真实人物形象快速搭建人物模型,大大缩短了传统视频制作的漫长周期,还结合语音、动作、表情等多模态交互方式,能够与用户进行实时的语音、文字交互,使数字人与用户之间的交互更加自然、流畅,并根据用户的反馈和需求调整内容和表现方式,提供更加智能、灵活的服务;同时对触控屏进行优化,提升对用户操作响应的灵敏度,提高用户体验。

The present invention discloses a digital human video synthesis system, and relates to the field of intelligent technology. Compared with the previous digital human video synthesis system, the present invention solves the problem that the motion and expression generation of the existing digital human video system is still unnatural, and the current digital human is still relatively limited in its ability to interact with users, especially the interaction of the touch screen, and the response to the user's operation is not accurate and flexible enough; it can quickly build a character model based on the real character image, greatly shortening the long cycle of traditional video production, and also combines multimodal interaction methods such as voice, action, and expression to interact with the user in real time by voice and text, making the interaction between the digital human and the user more natural and smooth, and adjusting the content and expression method according to the user's feedback and needs, providing more intelligent and flexible services; at the same time, the touch screen is optimized to improve the sensitivity of the response to the user's operation and improve the user experience.

Description

Digital human video synthesis system
Technical Field
The invention relates to the technical field of intellectualization, in particular to a digital human video synthesis system.
Background
Digital human video synthesis systems are increasingly emerging with the development of technologies such as computer graphics, artificial intelligence, computer vision, and natural language processing.
In the field of computer graphics, the continual advancement of 3D modeling and animation technology has made it possible to create realistic digital human models. By using specialized modeling software and rendering engines, digital human appearance and actions can be constructed with high detail and realism.
The development of artificial intelligence technology, especially deep learning technology, provides powerful support for the speech synthesis, expression generation and action driving of digital people. For example, a speech synthesis model based on deep neural networks can generate natural fluent speech, while an image generation model based on generating a countermeasure network (GAN) and a variational self-encoder (VAE) can be used to generate digital human facial expressions and actions.
Natural language processing techniques enable a digital person to understand and process natural language text to generate corresponding speech and actions from the input text content. The computer vision technology can be used for capturing the actions and expressions of the real figures and mapping the actions and expressions to the digital human model, so that a more natural and lifelike interaction effect is realized.
Although digital person appearance modeling and rendering techniques have made great progress, in some cases, digital person motion and expression generation still have unnatural problems such as motion stiffness, expression discontinuity, lack of subtle emotional expressions, and the like. The present digital person has limited interaction capability with a user, particularly the interaction of a touch screen, and has inaccurate and flexible response to the operation of the user.
In order to solve the problems, the invention provides a digital human video synthesis system.
Disclosure of Invention
The invention aims to provide a digital human video synthesis system to solve the problems in the background art:
The motion and expression generation of the existing digital person video system still has unnatural problems, and the existing digital person has limited interaction capability with a user, especially the interaction of a touch screen, and has inaccurate and flexible response to the operation of the user.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a digital human video composition system comprising: 1. a digital human video composition system comprising:
model construction module: the method is used for carrying out digital person modeling based on the real person;
and a voice synthesis module: the system comprises a voice synthesis engine, a voice processing engine and a voice processing system, wherein the voice synthesis engine is used for converting input text content into voice audio and adjusting and optimizing synthesized voice;
expression generation module: for generating an expression of the digital person from the facial expression of the real person;
A rendering engine module: rendering the digital human video;
And the interaction control module is used for: the system is used for carrying out data interaction with a user and a third party;
the model construction module comprises:
character modeling unit: the method comprises the steps of creating an appearance model based on a 3D modeling technology, scanning a real person based on a scanning technology, and constructing a digital person model;
Bone binding and animation unit: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints, and performing action recognition and animation production through skeleton binding;
the expression generating module comprises:
Expression capturing and mapping unit: the facial expression driving method comprises the steps of capturing facial expression data of a real person through a facial capture device, and mapping the facial expression data to a digital human model to realize real-time driving of the expression; the deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;
Expression library and animation curve unit: the method is used for establishing an expression library and controlling the change curve, amplitude and transition effect of the expression by adjusting the animation curve;
The rendering engine module includes:
A real-time rendering unit: the real-time rendering engine is used for rendering the digital human model;
Rendering output and optimization unit: the method comprises the steps of outputting the rendered digital human video into video files with different formats and resolutions, and compressing and encoding the video output by rendering;
The interaction control module comprises:
user interface and operation control unit: the system is used for providing a user interface for a user through the touch control screen and performing operation and parameter setting on the system;
Data management and communication interface unit: for managing and storing system data and providing a communication interface with other systems or applications.
Preferably, the character modeling unit creates different part appearance models of the digital person based on the 3D modeling technique and stores the model in the database.
Preferably, the skeleton binding and animation unit detects the 3D human body posture, compares the detected 3D human body posture with the real human body joints, and performs analysis and calculation on the v joint point coordinate displacement of 33 joints of the human body based on the video frame, specifically as follows:
Based on the calculated 3D coordinate offset D j-1,j for the j-1 frame data and the j frame data:
Wherein, (x j-1,yj-1,zj-1) is the joint position coordinates of the j-1 th frame; (x j,yj,zj) is the joint position coordinates of the j-th frame;
Calculating the ratio of the 3D coordinate offset D j-1,j to the height of the current real person, and updating the 3D coordinate offset:
Wherein D' j-1,j is the updated 3D coordinate offset;
D' j-1,j and the time difference delta t of two adjacent frames are taken as a quotient, and each node V generates a corresponding rate change value V ujv at the j-th frame moment of the u-action video;
in a P-frame video, the average velocity value of each node v The method comprises the following steps:
Generating a corresponding action set according to different action amplitudes by different actions, wherein each joint point corresponds to a different speed average value, and generating a speed threshold V v considering other actions in the set according to the speed average value:
wherein U is the action quantity in the action set;
The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value.
Preferably, the expression capturing and mapping unit captures and identifies a label based on a facial expression identification network, wherein the facial expression identification network comprises an input module, a transducer module fused with a self-attention mechanism, a multi-scale feature extraction module and a fusion identification module;
The input module acquires face image data X 0∈RH×W×C of a real person, wherein H and W are the height and width of the image data, and C is the number of image channels; equidistant segmentation and serialization processing are carried out on the collected facial expression image data based on the numerical common divisor lambda of the height and width of the image data to obtain X', position coding is carried out on each image block, linear projection is carried out on each image block, and position coding vectors corresponding to each image are added to obtain real input X;
The transducer module for fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual error connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer;
The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta and is integrated into a self-attention mechanism, and the output of the value vector is as follows:
after the node e m in the value vector matrix is integrated into the node number and the self-attention mechanism, extracting the characteristics to obtain a new value vector e' n:
wherein f q is the dimension of the matching vector q; m and n are respectively the total lateral amount of nodes and the total longitudinal amount of nodes in the image block;
The output Q of the multi-scale feature extraction module is specifically as follows:
A, B, E are a query vector matrix, a matching vector matrix and a value vector matrix which are respectively composed of a, b and e; g α and g β are calculation matrices of corresponding position parameter features alpha and beta; t is the transpose of the vector; g l corresponds to different areas of the face of the real person;
the fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module of the fusion self-attention mechanism to obtain a feature X R,Q, and sequentially inputs the feature X R,Q into a spatial attention module and a channel attention module, wherein in the spatial attention module,
Processing the input features along the channel dimension by pooling operations, respectively:
wherein c is the number of channels; m avg(XR,Q)、Mmax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
M avg(XR,Q) and M max(XR,Q) are spliced, and then a convolution layer is used for convolution operation, so that output spatial attention characteristic Y 1 is obtained:
Y1=σ(fconv,e(Mavg(XR,Q);Mmax(XR,Q)))
The channel attention mechanism is specifically as follows:
processing the input features along the spatial dimension by a pooling operation:
Wherein, The image block height value and the image block width value; o avg(XR,Q)、Omax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
Splice O avg(XR,Q) and O max(XR,Q), and then convolve with a convolution layer to obtain the output channel attention Y 1:
Y2=σ(MLP(Oavg(XR,Q);Omax(XR,Q)))
fusing the output of the spatial attention module and the output of the channel attention module to obtain a final fused output Z:
Z=δY1+(1-δ)Y2
wherein delta is the fusion weight of the spatial attention module and the channel attention module;
The fusion weight delta is obtained by optimizing based on a cosine annealing strategy; the cosine annealing strategy is added with a wall up strategy to perform initial transition, and fusion weights are optimized, specifically as follows:
cosine annealing with the model recognition error value corresponding to the initial fusion weight as the maximum recognition error value, wherein the recognition error value gradually decreases in the training process, and reaches the minimum value at the end, specifically as follows:
wherein, gamma t is the fusion weight value of the t-th iteration; is the lowest fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; n k is the total number of cycles in the current operating environment;
and finally, outputting a recognition result of the facial expression recognition network on the image data through the Softmax layer.
Preferably, the expression library and animation curve unit also adjusts the animation curve based on the cosine annealing strategy, and performs cosine annealing based on the combined effect of the change speed, amplitude and transition effect of the expression.
Preferably, the touch feedback screen is designed into a curved surface or a bendable shape and is used for adapting to different application scenes and equipment forms; the touch feedback screen is also spliced to form a complete screen through a plurality of independent small touch feedback screen modules, the complete screen is divided into a plurality of functional partitions according to user-defined requirements, and each functional partition has different touch functions and feedback modes.
Preferably, the touch feedback screen is designed with a three-layer sensing structure, which corresponds to pressure sensing, position sensing and gesture sensing respectively; and a touch control induction circuit of the induction structure is also manufactured by utilizing the nano material.
Preferably, the surface of the touch feedback screen is constructed with a micro-lens array, a micro-fluid channel is arranged below the micro-lens array, different liquids with electrical or optical characteristics are filled in the micro-lens array, and the liquid pressure and the flowing state in the micro-fluid channel dynamically adjust the curvature, the interval or the direction of the micro-lens array; the information of the micro lens array is also fed back to the micro fluid channel to dynamically adjust the flow and distribution of the liquid.
Preferably, the touch feedback screen further utilizes a sensing technology to enable the touch feedback screen to detect the actions of fingers or touch pens suspended above the screen within a certain distance, and thus the touch feedback screen with three-dimensional space sensing capability is constructed.
Compared with the prior art, the invention provides a digital human video synthesis system, which comprises the following components
The beneficial effects are that:
According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved.
Drawings
FIG. 1 is a block diagram of the system mentioned in embodiment 1 of the present invention;
FIG. 2 is a schematic view of the joints of the human body according to embodiment 1 of the present invention;
Fig. 3 is a schematic diagram of a residual link module structure in embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of the induction structure mentioned in embodiment 1 of the present invention;
fig. 5 is a schematic diagram of a surface structure of a touch feedback screen according to embodiment 1 of the present invention.
Meaning of the label in the figure:
1. A model building module; 11. a character modeling unit; 12. a bone binding and animation unit; 2. a speech synthesis module; 3. expression generating module; 31. expression capturing and mapping unit; 32. expression library and animation curve unit; 4. a rendering engine module; 41. a real-time rendering unit; 42. rendering output and optimization unit; 5. an interaction control module; 51. a user interface and an operation control unit; 52. and a data management and communication interface unit.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
According to the invention, the character model can be quickly built according to the real character image, the long period of traditional video production is greatly shortened, and the real-time voice and text interaction can be carried out with the user by combining the multi-mode interaction modes such as voice, action and expression, so that the interaction between the digital person and the user is more natural and smooth, the content and the expression mode are adjusted according to the feedback and the demands of the user, and more intelligent and flexible service is provided; meanwhile, the touch screen is optimized, the sensitivity of the user operation response is improved, and the user experience is improved. Specifically, the following are included.
Example 1:
Referring to fig. 1-5, the digital human video synthesizing system of the present invention comprises:
model building module 1: the method is used for carrying out digital person modeling based on the real person;
the model building module 1 includes:
Character modeling unit 11: for creating a model of the appearance of a digital person's body, face, hairstyle, clothing, etc. based on 3D modeling techniques, modeling work is performed using software such as Maya, 3ds Max, etc. and stored in a database for recall at any time. Scanning the real person based on a scanning technology to acquire the shape, texture and other data of the body and the face of the real person, and constructing a digital person model; data is collected, for example, by structured light scanning, laser scanning, etc.
Bone binding and animation unit 12: the method is used for establishing a skeleton system for the digital human model, determining the position and the movement range of joints so as to realize natural action expression, and carrying out action recognition and animation production through skeleton binding; the steps for performing motion recognition by bone binding are specifically as follows:
3D human body gestures are detected and compared with real human body joints, and 33 joint points of a human body can be analyzed and calculated based on video frames by referring to the v joint point coordinate displacement of FIG. 2, specifically as follows:
Based on the calculated 3D coordinate offset D j-1,j for the j-1 frame data and the j frame data:
Wherein, (x j-1,yj-1,zj-1) is the joint position coordinates of the j-1 th frame; (x j,yj,zj) is the joint position coordinates of the j-th frame;
Calculating the ratio of the 3D coordinate offset D j-1,j to the height of the current real person, and updating the 3D coordinate offset:
Wherein D' j-1,j is the updated 3D coordinate offset;
D' j-1,j and the time difference delta t of two adjacent frames are taken as a quotient, and each node V generates a corresponding rate change value V ujv at the j-th frame moment of the u-action video;
in a P-frame video, the average velocity value of each node v The method comprises the following steps:
Generating a corresponding action set according to different action amplitudes by different actions, wherein each joint point corresponds to a different speed average value, and generating a speed threshold V v considering other actions in the set according to the speed average value:
wherein U is the action quantity in the action set;
The action set is divided based on a cosine annealing strategy, and cosine annealing is performed by taking the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value. Based on the action set obtained by the cosine annealing strategy, the action made by the real person can be judged and identified more accurately through the speed threshold.
The animation control system can also be developed to support animation generation modes such as key frame animation, motion capture data import, physical simulation and the like, for example, bone binding and animation production are performed by using MotionBuilder software.
The voice synthesis module 2: for converting input text content into voice audio by a voice synthesis engine, common voice synthesis techniques are voice synthesis based on parameter synthesis, splice synthesis, deep learning, and the like. The tone, intonation, speed, etc. of the synthesized speech are adjusted and optimized to adapt to different digital human roles and application scenes. For example, a speech synthesis technique of a science fiction can generate speech with various timbres and styles. For example, voice broadcasting in navigation software is realized through text-to-voice technology, so that clear and accurate route guidance is provided for users.
Expression generation module 3: for generating an expression of the digital person from the facial expression of the real person;
the expression generating module 3 includes:
Expression capturing and mapping unit 31: the real-time expression driving method is used for capturing facial expression data of a real person through a facial capturing device such as a camera, a depth sensor and the like, and mapping the facial expression data onto a digital human model to realize real-time expression driving;
The expression capturing and mapping unit 31 captures and identifies a label based on a facial expression identification network including an input module, a transducer module fusing a self-attention mechanism, a multi-scale feature extraction module, and a fusion identification module;
The input module acquires face image data X 0∈RH×W×C of a real person, wherein H and W are the height and width of the image data, and C is the number of image channels; equidistant segmentation and serialization processing are carried out on the collected facial expression image data based on the numerical common divisor lambda of the height and width of the image data to obtain X', position coding is carried out on each image block, linear projection is carried out on each image block, and position coding vectors corresponding to each image are added to obtain real input X;
The transducer module fusing the self-attention mechanism comprises a multi-head attention layer, a feedforward neural network layer and layer normalization, wherein a residual connection module is further added between the input and the output of the multi-head attention layer and the feedforward neural network layer, the residual connection module can refer to fig. 3, an upper path and a lower path are respectively a jump connection and a main path, the jump connection merges the input and the output of the stacked layers through an identification mapping process, and no additional parameter is needed. The gradient propagates back to the first few layers, thus making training easier by more additional layers for faster training.
The multi-scale feature extraction module divides the facial expression of the real person into a plurality of areas, each area corresponds to a square matrix with one dimension being the number of nodes in the area, the output of the multi-scale feature extraction module is integrated into the image block position parameter feature alpha and the node number parameter feature beta, and is integrated into a self-attention mechanism, and the output of the value vector is as follows:
after the node e m in the value vector matrix is integrated into the node number and the self-attention mechanism, extracting the characteristics to obtain a new value vector e' n:
wherein f q is the dimension of the matching vector q; m and n are respectively the total lateral amount of nodes and the total longitudinal amount of nodes in the image block;
The output Q of the multi-scale feature extraction module is specifically as follows:
A, B, E are a query vector matrix, a matching vector matrix and a value vector matrix which are respectively composed of a, b and e; g α and g β are calculation matrices of corresponding position parameter features alpha and beta; t is the transpose of the vector; g l corresponds to different areas of the face of the real person;
The fusion recognition module splices the outputs of the transducer module and the multi-scale feature extraction module which are fused with the self-attention mechanism to obtain a feature X R,Q, and the feature X R,Q is sequentially input into the space attention module and the channel attention module, and in the space attention module,
Processing the input features along the channel dimension by pooling operations, respectively:
wherein c is the number of channels; m avg(XR,Q)、Mmax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
M avg(XR,Q) and M max(XR,Q) are spliced, and then a convolution layer is used for convolution operation, so that output spatial attention characteristic Y 1 is obtained:
Y1=σ(fconv,e(Mavg(XR,Q);Mmax(XR,Q)))
The channel attention mechanism is specifically as follows:
processing the input features along the spatial dimension by a pooling operation:
Wherein, The image block height value and the image block width value; o avg(XR,Q)、Omax(XR,Q) are the average pooling feature and the maximum pooling feature, respectively;
Splice O avg(XR,Q) and O max(XR,Q), and then convolve with a convolution layer to obtain the output channel attention Y 1:
Y2=σ(MLP(Oavg(XR,Q);Omax(XR,Q)))
fusing the output of the spatial attention module and the output of the channel attention module to obtain a final fused output Z:
Z=δY1+(1-δ)Y2
wherein delta is the fusion weight of the spatial attention module and the channel attention module;
optimizing the fusion weight delta based on a cosine annealing strategy to obtain the fusion weight delta; the cosine annealing strategy is added with the arm up strategy to perform initial transition, and the fusion weight is optimized, specifically as follows:
cosine annealing with the model recognition error value corresponding to the initial fusion weight as the maximum recognition error value, wherein the recognition error value gradually decreases in the training process, and reaches the minimum value at the end, specifically as follows:
wherein, gamma t is the fusion weight value of the t-th iteration; is the lowest fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; n k is the total number of cycles in the current operating environment;
and finally, outputting a recognition result of the facial expression recognition network on the image data through the Softmax layer.
The application relates to a method for identifying facial expressions, which comprises the steps of taking a convolution network as a model I, taking a convolution network added with a self-attention mechanism as a model II, taking a convolution network added with a self-attention mechanism and a spatial attention mechanism as a model III, taking a convolution network added with a self-attention mechanism and a channel attention mechanism as a model IV, and taking a convolution network added with a self-attention mechanism, a spatial attention mechanism and a channel attention mechanism as a model V, wherein the model V is a model VI, and the facial expression identification can be respectively carried out, and the accuracy of an identification result can be referred to in a table 1:
TABLE 1 facial expression recognition accuracy results for different models
Expression label Model one Model II Model III Model IV Model five Model six
Qi generating 66.51% 67.01% 69.12% 70.32% 71.84% 72.32%
Aversion to 65.37% 66.23% 67.91% 68.72% 70.69% 71.08%
Fear of fear 60.85% 66.06% 67.63% 69.71% 71.85% 72.89%
Open heart 63.45% 64.89% 66.68% 69.28% 70.35% 72.94%
Injury of heart 67.35% 68.15% 69.36% 71.15% 72.68% 74.06%
Surprise (surprise) 68.16% 69.48% 70.06% 71.26% 72.68% 73.48%
As can be seen from table 1, the accuracy of the model of this example was the highest.
The deep learning algorithm is also used for analyzing the input voice or text content, and corresponding digital human expressions are automatically generated;
expression library and animation curve unit 32: the method is used for establishing an expression library, and comprises various common expression actions and micro-expressions so as to be quickly called when needed. And the animation curve is regulated through a cosine annealing strategy, and cosine annealing treatment is performed based on the comprehensive effects of the change speed, the amplitude and the transition effect of the expression, so that the expression is more natural and smooth.
Rendering engine module 4: rendering the digital human video;
The rendering engine module 4 includes:
The real-time rendering unit 41: the method is used for rendering the digital mannequin by adopting advanced real-time rendering engines such as Unreal Engine, unity and the like to generate a vivid visual effect; and the adjustment and optimization of rendering effects such as ray tracing, shading, materials, textures and the like are also supported, so that the quality of digital human videos is improved.
Rendering output and optimization unit 42: the method is used for outputting the rendered digital human video into video files with different formats such as MP4, AVI and the like and resolutions such as 720p, 1080p, 4K and the like, compressing and encoding the video output by rendering to reduce the size of the file, improve the transmission and storage efficiency and simultaneously ensure the video quality;
interaction control module 5: the system is used for carrying out data interaction with a user and a third party;
the interaction control module 5 includes:
user interface and operation control unit 51: the touch control screen provides a concise and visual user interface for a user, and is convenient for the user to operate and set parameters of the digital human video synthesis system, such as selecting a digital human model, inputting text content, adjusting voice and expression parameters and the like.
Data management and communication interface unit 52: the system is used for managing and storing digital human models, voice data, expression data, animation data and the like, supporting the import, export and backup of the data, and providing a communication interface with other systems or application programs so as to realize the sharing and interaction of the data, such as integration with video editing software, a live broadcast platform and the like.
The touch feedback screen is designed into a curved surface or a bendable shape, so that the touch feedback screen can adapt to different application scenes and equipment forms. For example, a surrounding type flexible touch feedback screen is designed for the wearable equipment, so that the human body curve can be better fitted, and a more natural and comfortable interaction experience is provided. Or an arc-shaped touch feedback screen is designed for the automobile instrument panel, so that the visibility and the operation convenience of a driver are improved.
The touch feedback screen is further divided into a plurality of functional partitions or a modularized design is adopted. Different partitions can have different touch control functions and feedback modes, and a user can customize the functions of each partition according to own requirements. For example, on a touch feedback screen of a game handle, the screen is divided into a direction control area, an action button area, a function setting area and the like, and each area can provide different tactile feedback effects to enhance the immersion of the game.
The conventional touch feedback screen generally has only one sensing layer, the number of sensing layers is increased in this embodiment, and the touch feedback screen is specifically designed into a three-layer sensing structure, and can refer to fig. 4, and the three-layer sensing structure from top to bottom corresponds to pressure sensing, position sensing and gesture sensing respectively, so as to realize more accurate and rich touch input recognition. For example, when a user lightly touches the screen, the first layer pressure sensing layer detects the pressure magnitude; the second layer is positioned on the sensing layer to determine the coordinates of the touch; the third gesture sensing layer recognizes gesture actions such as sliding, zooming and the like of the finger. The sensing structure also utilizes nano materials such as carbon nano tubes, nano silver wires and the like to manufacture a touch sensing circuit. The nano materials have excellent conductivity and flexibility, and can improve the sensitivity and the flexibility of the touch feedback screen. For example, using a carbon nanotube film as the sensing layer, faster signal response and higher spatial resolution can be achieved due to the high conductivity and small size characteristics of the carbon nanotubes.
Referring to fig. 5, a micro lens array is constructed on the surface of the touch feedback screen, and micro lenses can focus light rays, enhance display brightness and contrast of the screen, and simultaneously can be used for realizing a 3D touch effect. When a user touches the screen, the position and the force of the touch are detected through the optical change of the micro lens array, and a three-dimensional feedback effect can be visually presented according to the touch operation, for example, in a game, the user presses different positions and forces of the screen to see the 3D visual effect that the object has different degrees of protrusion or depression. A micro-fluid channel is arranged below the micro-lens array, wherein different liquids with electrical or optical characteristics are filled in the micro-fluid channel, for example, when a user touches the screen, the pressure can enable the liquid in the micro-fluid channel to flow, so that local resistance or capacitance is changed, and touch detection is realized; or by changing the distribution of the liquid, affecting the light transmission or color of the screen locally, providing visual feedback. For example, in an e-book reading application, when a user touches the screen to turn a page, the edges of the screen will have a colored liquid flow effect as feedback. Dynamically adjusting the curvature, spacing or direction of the microlens array according to the liquid pressure and flow state in the microfluidic channel; on the contrary, the information such as external illumination, touch position and the like sensed by the micro-lens array can also be fed back to the micro-fluid system, so that the flow and distribution of liquid can be dynamically adjusted, and the self-adaptive adjustment of screen display and touch feedback is realized. For example, in an outdoor strong light environment, the microlens array detects an increase in illumination intensity, triggers liquid flow in the microfluidic channel, and changes parameters of the microlenses to enhance display brightness and contrast of the screen; meanwhile, when a user performs touch operation, the micro lens array transmits touch information to the micro fluid system, so that corresponding flow and tactile feedback are generated on the liquid.
By utilizing the magnetic field, the electric field or the sound wave and other technologies, the touch feedback screen can detect the actions of fingers or a touch pen suspended above the screen within a certain distance, and non-contact touch operation and feedback are realized. For example, in the air, by swiping a finger over the screen, page scrolling, content selection, etc. may be accomplished while providing tactile feedback through air vibration or slight vibration of the device. The touch feedback screen with three-dimensional space sensing capability can be constructed by the method. By arranging a plurality of sensors inside or around the screen, detection of the position, direction and movement of the object in three-dimensional space is achieved. For example, in a virtual reality VR or augmented reality AR device, a user may directly interact with a virtual object in three-dimensional space, touch, grab, rotate, etc., and provide realistic haptic feedback by way of vibration of the device, force feedback gloves, etc.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (9)

1.数字人视频合成系统,其特征在于,包括:1. A digital human video synthesis system, characterized by comprising: 模型构建模块(1):用于进行基于真实人物进行数字人建模;Model building module (1): used to model digital humans based on real people; 语音合成模块(2):用于通过语音合成引擎将输入的文本内容转换为语音音频,并对合成的语音进行调整和优化;Speech synthesis module (2): used to convert the input text content into speech audio through a speech synthesis engine, and adjust and optimize the synthesized speech; 表情生成模块(3):用于根据真实人物的面部表情生成数字人的表情;Expression generation module (3): used to generate the expression of the digital human according to the facial expression of the real person; 渲染引擎模块(4):用于对数字人视频进行渲染;Rendering engine module (4): used for rendering digital human video; 交互控制模块(5):用于与用户及第三方进行数据交互;Interaction control module (5): used for data interaction with users and third parties; 所述模型构建模块(1)包括:The model building module (1) comprises: 人物建模单元(11):用于基于3D建模技术创建外观模型,并基于扫描技术对真实人物进行扫描,构建数字人模型;Character modeling unit (11): used to create an appearance model based on 3D modeling technology, and to scan a real person based on scanning technology to build a digital human model; 骨骼绑定与动画单元(12):用于为数字人模型建立骨骼系统,确定关节的位置和运动范围,并以此通过骨骼绑定进行动作识别和动画制作;Skeleton binding and animation unit (12): used to establish a skeleton system for the digital human model, determine the position and range of motion of the joints, and perform action recognition and animation production through skeleton binding; 所述表情生成模块(3)包括:The expression generation module (3) comprises: 表情捕捉与映射单元(31):用于通过面部捕捉设备捕捉真实人物的面部表情数据,并将其映射到数字人模型上,实现表情的实时驱动;还使用深度学习算法对输入的语音或文本内容进行分析,自动生成相应的数字人表情;Expression capture and mapping unit (31): used to capture the facial expression data of real people through facial capture equipment and map it to the digital human model to achieve real-time driving of expressions; and also use deep learning algorithms to analyze the input voice or text content to automatically generate corresponding digital human expressions; 表情库与动画曲线单元(32):用于建立表情库,通过调整动画曲线,控制表情的变化曲线、幅度和过渡效果;Expression library and animation curve unit (32): used to establish an expression library and control the expression change curve, amplitude and transition effect by adjusting the animation curve; 所述渲染引擎模块(4)包括:The rendering engine module (4) comprises: 实时渲染单元(41):用于采用实时渲染引擎对数字人模型进行渲染;A real-time rendering unit (41): used to render the digital human model using a real-time rendering engine; 渲染输出与优化单元(42):用于将渲染后的数字人视频输出为不同格式和分辨率的视频文件,并对渲染输出的视频进行压缩和编码;Rendering output and optimization unit (42): used to output the rendered digital human video into video files of different formats and resolutions, and compress and encode the rendered output video; 所述交互控制模块(5)包括:The interactive control module (5) comprises: 用户界面与操作控制单元(51):用于通过触控控制屏为用户提供用户界面,对系统进行操作和参数设置;User interface and operation control unit (51): used to provide a user interface for the user through a touch control screen to operate the system and set parameters; 数据管理与通信接口单元(52):用于对系统数据进行管理和存储,并提供与其他系统或应用程序的通信接口。Data management and communication interface unit (52): used to manage and store system data and provide a communication interface with other systems or applications. 2.根据权利要求1所述的数字人视频合成系统,其特征在于,所述人物建模单元(11)基于3D建模技术创建数字人的不同部位外观模型,并存储于数据库中。2. The digital human video synthesis system according to claim 1 is characterized in that the character modeling unit (11) creates appearance models of different parts of the digital human based on 3D modeling technology and stores them in a database. 3.根据权利要求2所述的数字人视频合成系统,其特征在于,所述骨骼绑定与动画单元(12)检测3D人体姿态,并与真实人体关节进行对比,对人体33个关节点的第v个关节点坐标位移基于视频帧进行分析计算,具体如下:3. The digital human video synthesis system according to claim 2 is characterized in that the skeleton binding and animation unit (12) detects the 3D human body posture and compares it with the real human body joints, and analyzes and calculates the coordinate displacement of the vth joint point of the 33 joint points of the human body based on the video frame, as follows: 基于对第j-1帧数据和第j帧数据的计算3D坐标偏移量dj-1,jBased on the calculated 3D coordinate offset d j-1,j for the j-1th frame data and the jth frame data: 其中,(xj-1,yj-1,zj-1)为第j-1帧的关节位置坐标;(xj,yj,zj)为第j帧的关节位置坐标;Among them, (x j-1 ,y j-1 ,z j-1 ) are the joint position coordinates of the j-1th frame; (x j ,y j ,z j ) are the joint position coordinates of the jth frame; 将3D坐标偏移量dj-1,j与当前真实人物身高height进行比值计算,对3D坐标偏移量进行更新:Calculate the ratio of the 3D coordinate offset d j-1,j to the current real person height height, and update the 3D coordinate offset: 其中,d'j-1,j为更新的3D坐标偏移量;Where d' j-1,j is the updated 3D coordinate offset; 将d'j-1,j与相邻两帧的时间差Δt做商,在u动作视频的第j帧时刻,每一个关节点v生成一个对应的速率变化值VujvTake d' j-1,j and the time difference Δt between two adjacent frames as the quotient, and at the jth frame of the u action video, each joint point v generates a corresponding rate change value V ujv ; 在一个P帧的视频中,每个关节点v的速度平均值为:In a P-frame video, the average velocity of each joint point v for: 将不同动作按照不同动作幅度生成对应的动作集合,其中每个关节点对应不同的速度平均值,并以此生成兼顾集合内其他动作的速度阈值VvDifferent actions are generated into corresponding action sets according to different action amplitudes, where each joint point corresponds to a different speed average value, and the speed threshold V v that takes into account other actions in the set is generated: 其中,U为动作集中的动作数量;Where U is the number of actions in the action set; 其中,动作集合基于余弦退火策略进行划分,以动作集合的划分对应的动作识别判断误差值为最大判断误差值进行余弦退火。The action set is divided based on the cosine annealing strategy, and cosine annealing is performed with the action recognition judgment error value corresponding to the division of the action set as the maximum judgment error value. 4.根据权利要求3所述的数字人视频合成系统,其特征在于,所述表情捕捉与映射单元(31)基于面部表情识别网络进行标签的捕捉与识别,所述面部表情识别网络包括输入模块、融合自注意力机制的Transformer模块、多尺度特征提取模块和融合识别模块;4. The digital human video synthesis system according to claim 3 is characterized in that the expression capture and mapping unit (31) captures and recognizes labels based on a facial expression recognition network, wherein the facial expression recognition network includes an input module, a Transformer module fused with a self-attention mechanism, a multi-scale feature extraction module, and a fusion recognition module; 所述输入模块采集真实人物的面部图像数据X0∈RH×W×C,其中,H和W为图像数据的高和宽,C为图像通道数;对采集的面部表情图像数据基于图像数据的高和宽的数值公约数λ进行等距离分割并序列化处理得到X',对每个图像块进行位置编码,将每个图像块进行线性投影并加入每个图像对应的位置编码向量得到真实输入X;The input module collects facial image data X 0 ∈R H×W×C of a real person, where H and W are the height and width of the image data, and C is the number of image channels; the collected facial expression image data is equidistantly divided based on the numerical common divisor λ of the height and width of the image data and serialized to obtain X', position encoding is performed on each image block, each image block is linearly projected and a position encoding vector corresponding to each image is added to obtain a real input X; 所述融合自注意力机制的Transformer模块包括多头注意力层、前馈神经网络层和层归一化,其中,在多头注意力层和前馈神经网络层的输入和输出之间还添加残差连接模块;The Transformer module integrating the self-attention mechanism includes a multi-head attention layer, a feedforward neural network layer and a layer normalization, wherein a residual connection module is further added between the input and output of the multi-head attention layer and the feedforward neural network layer; 所述多尺度特征提取模块将真实人物的面部表情划分为若干个区域,每个区域对应一个维度为区域内节点数量的方阵,所述多尺度特征提取模块的输出融入图像块位置参数特征α和节点数量参数特征β,并融入自注意力机制,值向量的输出如下:The multi-scale feature extraction module divides the facial expressions of real people into several regions, each region corresponds to a square matrix with a dimension of the number of nodes in the region. The output of the multi-scale feature extraction module is integrated into the image block position parameter feature α and the node number parameter feature β, and is integrated into the self-attention mechanism. The output of the value vector is as follows: 值向量矩阵中的节点em融入节点数量和自注意力机制后并提取特征得到新的值向量e'nThe node e m in the value vector matrix is integrated into the node number and self-attention mechanism and features are extracted to obtain a new value vector e' n : 其中,fq为匹配向量q的维度;m、n分别为图像块中节点横向总量与节点纵向总量;Where, fq is the dimension of the matching vector q; m and n are the total number of nodes in the horizontal direction and the total number of nodes in the vertical direction in the image block respectively; 所述多尺度特征提取模块的输出Q具体如下:The output Q of the multi-scale feature extraction module is as follows: 其中,A、B、E分别是由a、b、e组成的查询向量矩阵、匹配向量矩阵和值向量矩阵;gα和gβ为对应位置参数特征α和β的计算矩阵;T为向量的转置;Gl对应真实人物的面部不同区域;Among them, A, B, and E are the query vector matrix, matching vector matrix, and value vector matrix composed of a, b, and e respectively; g α and g β are the calculation matrices corresponding to the position parameter features α and β; T is the transpose of the vector; G l corresponds to different facial regions of real people; 所述融合识别模块将所述融合自注意力机制的Transformer模块和多尺度特征提取模块的输出进行拼接后得到特征XR,Q,依次输入空间注意力模块和通道注意力模块,所述空间注意力模块中,The fusion recognition module concatenates the outputs of the Transformer module of the fusion self-attention mechanism and the multi-scale feature extraction module to obtain the feature X R,Q , which is input into the spatial attention module and the channel attention module in sequence. In the spatial attention module, 通过池化操作分别沿着通道维度对输入特征进行处理:The input features are processed along the channel dimension through the pooling operation: 其中,c∈[1,C]为通道数;Mavg(XR,Q)、Mmax(XR,Q)分别为平均池化特征和最大池化特征;Where c∈[1,C] is the number of channels; M avg (X R,Q ) and M max (X R,Q ) are the average pooling feature and the maximum pooling feature respectively; 将Mavg(XR,Q)和Mmax(XR,Q)进行拼接,再经过一个卷积层进行卷积操作,得到输出的空间注意力特征Y1 Mavg ( XR,Q ) and Mmax ( XR,Q ) are concatenated and then passed through a convolution layer for convolution operation to obtain the output spatial attention feature Y1 : Y1=σ(fconv,e(Mavg(XR,Q);Mmax(XR,Q)))Y 1 =σ(f conv,e (M avg (X R,Q ); M max (X R,Q ))) 所述通道注意力机制具体如下:The channel attention mechanism is as follows: 通过池化操作分别沿着空间维度对输入特征进行处理:The input features are processed along the spatial dimensions through the pooling operation: 其中,为图像块高度值和宽度值;Oavg(XR,Q)、Omax(XR,Q)分别为平均池化特征和最大池化特征;in, are the height and width of the image block; O avg (X R,Q ) and O max (X R,Q ) are the average pooling feature and the maximum pooling feature respectively; 将Oavg(XR,Q)和Omax(XR,Q)进行拼接,再经过一个卷积层进行卷积操作,得到输出的通道注意力Y1Concatenate O avg (X R,Q ) and O max (X R,Q ) and then perform a convolution operation through a convolution layer to obtain the output channel attention Y 1 : Y2=σ(MLP(Oavg(XR,Q);Omax(XR,Q)))Y 2 =σ(MLP(O avg (X R,Q ); O max (X R,Q ))) 将空间注意力模块的输出和通道注意力模块的输出进行融合得到最终的融合输出Z:The output of the spatial attention module and the output of the channel attention module are fused to obtain the final fused output Z: Z=δY1+(1-δ)Y2 Z=δY 1 +(1-δ)Y 2 其中,δ为空间注意力模块和通道注意力模块的融合权重;Among them, δ is the fusion weight of the spatial attention module and the channel attention module; 所述融合权重δ基于余弦退火策略进行寻优获得;所述余弦退火策略添加warm up策略进行初始过度,对融合权重进行优化,具体如下:The fusion weight δ is obtained by optimizing based on the cosine annealing strategy; the cosine annealing strategy adds a warm up strategy for initial transition to optimize the fusion weight, as follows: 以初始融合权重对应的模型识别误差值作为最大识别误差值的余弦退火,在训练过程中识别误差值逐渐减小,在结束时达到最小值具体如下:The model recognition error value corresponding to the initial fusion weight is used as the cosine annealing of the maximum recognition error value. The recognition error value gradually decreases during the training process and reaches the minimum value at the end. The details are as follows: 其中,γt为第t次迭代的融合权重值;是最低融合权重;是最高融合权重;τ为当前正在学习的周期数;Nk为当前操作环境中的总循环数;Among them, γ t is the fusion weight value of the tth iteration; is the minimum fusion weight; is the highest fusion weight; τ is the number of cycles currently being learned; N k is the total number of cycles in the current operating environment; 最后经过Softmax层输出面部表情识别网络对图像数据的识别结果。Finally, the Softmax layer outputs the recognition results of the facial expression recognition network on the image data. 5.根据权利要求4所述的数字人视频合成系统,其特征在于,所述表情库与动画曲线单元(32)也基于所述余弦退火策略进行动画曲线的调整,基于表情的变化速度、幅度和过渡效果的综合效果进行余弦退火处理。5. The digital human video synthesis system according to claim 4 is characterized in that the expression library and animation curve unit (32) also adjusts the animation curve based on the cosine annealing strategy, and performs cosine annealing processing based on the comprehensive effect of the expression change speed, amplitude and transition effect. 6.根据权利要求1所述的数字人视频合成系统,其特征在于,所述触控反馈屏设计为曲面或可弯曲的形状,用于适应不同的应用场景和设备形态;所述触控反馈屏还通过若干个独立的小型触控反馈屏模块拼接形成完整的屏幕,根据用户自定义需求划分成若干个功能分区,每个功能分区具有不同的触控功能和反馈方式。6. The digital human video synthesis system according to claim 1 is characterized in that the touch feedback screen is designed to be a curved or bendable shape to adapt to different application scenarios and device forms; the touch feedback screen is also spliced together by several independent small touch feedback screen modules to form a complete screen, which is divided into several functional partitions according to user-defined requirements, and each functional partition has different touch functions and feedback methods. 7.根据权利要求6所述的数字人视频合成系统,其特征在于,所述触控反馈屏设计有三层感应结构,分别对应压力感应、位置感应和手势感应;还利用纳米材料制作所述感应结构的触控感应线路。7. The digital human video synthesis system according to claim 6 is characterized in that the touch feedback screen is designed with a three-layer sensing structure, corresponding to pressure sensing, position sensing and gesture sensing respectively; and the touch sensing circuit of the sensing structure is also made of nanomaterials. 8.根据权利要求7所述的数字人视频合成系统,其特征在于,所述触控反馈屏表面构建有微透镜阵列,所述微透镜阵列下方设置有微流体通道,其中填充有不同的具有电学或光学特征的液体,所述微流体通道内的液体压力和流动状态动态的调整微透镜阵列的曲率、间距或方向;所述微透镜阵列的信息也反馈至微流体通道中,进行液体流动与分布的动态调整。8. The digital human video synthesis system according to claim 7 is characterized in that a microlens array is constructed on the surface of the touch feedback screen, and a microfluidic channel is arranged below the microlens array, which is filled with different liquids with electrical or optical characteristics, and the liquid pressure and flow state in the microfluidic channel dynamically adjust the curvature, spacing or direction of the microlens array; the information of the microlens array is also fed back to the microfluidic channel to dynamically adjust the liquid flow and distribution. 9.根据权利要求8所述的数字人视频合成系统,其特征在于,所述触控反馈屏还利用感知技术,使触控反馈屏检测悬浮在屏幕上方一定距离内的手指或触控笔动作,并以此构建具有三维空间感知能力的触控反馈屏。9. The digital human video synthesis system according to claim 8 is characterized in that the touch feedback screen also uses perception technology to enable the touch feedback screen to detect the movement of fingers or stylus pen suspended within a certain distance above the screen, and thereby construct a touch feedback screen with three-dimensional space perception capabilities.
CN202410964070.2A 2024-07-18 2024-07-18 Digital human video synthesis system Pending CN119011752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410964070.2A CN119011752A (en) 2024-07-18 2024-07-18 Digital human video synthesis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410964070.2A CN119011752A (en) 2024-07-18 2024-07-18 Digital human video synthesis system

Publications (1)

Publication Number Publication Date
CN119011752A true CN119011752A (en) 2024-11-22

Family

ID=93485156

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410964070.2A Pending CN119011752A (en) 2024-07-18 2024-07-18 Digital human video synthesis system

Country Status (1)

Country Link
CN (1) CN119011752A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150043067A1 (en) * 2013-08-12 2015-02-12 Electronics And Telecommunications Research Institute Microlens array and method for fabricating thereof
CN108021327A (en) * 2016-10-31 2018-05-11 北京小米移动软件有限公司 Control the method and touch control terminal that display interface slides
CN113867610A (en) * 2021-08-20 2021-12-31 深圳十米网络科技有限公司 Game control method, device, computer equipment and storage medium
CN116311456A (en) * 2023-03-23 2023-06-23 应急管理部大数据中心 Personalized virtual human expression generating method based on multi-mode interaction information
CN117519477A (en) * 2023-11-09 2024-02-06 九耀天枢(北京)科技有限公司 Digital human virtual interaction system and method based on display screen
CN117671095A (en) * 2023-12-18 2024-03-08 国电南瑞科技股份有限公司 Multi-mode digital person state prediction system and method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150043067A1 (en) * 2013-08-12 2015-02-12 Electronics And Telecommunications Research Institute Microlens array and method for fabricating thereof
CN108021327A (en) * 2016-10-31 2018-05-11 北京小米移动软件有限公司 Control the method and touch control terminal that display interface slides
CN113867610A (en) * 2021-08-20 2021-12-31 深圳十米网络科技有限公司 Game control method, device, computer equipment and storage medium
CN116311456A (en) * 2023-03-23 2023-06-23 应急管理部大数据中心 Personalized virtual human expression generating method based on multi-mode interaction information
CN117519477A (en) * 2023-11-09 2024-02-06 九耀天枢(北京)科技有限公司 Digital human virtual interaction system and method based on display screen
CN117671095A (en) * 2023-12-18 2024-03-08 国电南瑞科技股份有限公司 Multi-mode digital person state prediction system and method thereof

Similar Documents

Publication Publication Date Title
JP2022515620A (en) Image area recognition method by artificial intelligence, model training method, image processing equipment, terminal equipment, server, computer equipment and computer program
Bonnici et al. Sketch-based interaction and modeling: where do we stand?
Ding et al. A survey of sketch based modeling systems
CN117055724B (en) Working method of generative teaching resource system in virtual teaching scene
CN114581502B (en) Three-dimensional human body model joint reconstruction method based on monocular image, electronic device and storage medium
CN108363973A (en) A kind of unconfined 3D expressions moving method
CN117152843B (en) Digital person action control method and system
CN117115398B (en) A virtual-real fusion digital twin fluid phenomenon simulation method
CN106293099A (en) Gesture identification method and system
CN113506377A (en) Teaching training method based on virtual roaming technology
Thalmann Using virtual reality techniques in the animation process
Calvert Approaches to the representation of human movement: notation, animation and motion capture
JP2023098937A (en) Method and device fo reproducing multidimensional responsive video
Mao et al. A sketch-based gesture interface for rough 3D stick figure animation
Ekmen et al. From 2D to 3D real-time expression transfer for facial animation
Woo et al. Sketch on dynamic gesture tracking and analysis exploiting vision-based 3D interface
CN119011752A (en) Digital human video synthesis system
CN117763430A (en) Method, system and equipment for visually impaired people to recognize virtual textures
McLeod et al. Integrated media systems
CN116645247A (en) Panoramic view-based augmented reality industrial operation training system and method
Liang et al. Interactive experience design of traditional dance in new media era based on action detection
CN116630479A (en) Image generation method, device, electronic equipment and readable storage medium
CN116030168A (en) Method, device, equipment and storage medium for generating intermediate frame
CN115686202A (en) Three-dimensional model interactive rendering method across Unity/Optix platform
CN114779942A (en) Virtual reality immersive interaction system, equipment and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination