CN119206101B

CN119206101B - Editable facial three-dimensional reconstruction method, system and storage medium

Info

Publication number: CN119206101B
Application number: CN202411742013.6A
Authority: CN
Inventors: 田怡然; 曾茂林; 黄明鑫
Original assignee: Fangtian Yichuang Chengdu Technology Co ltd
Current assignee: Fangtian Yichuang Chengdu Technology Co ltd
Priority date: 2024-11-29
Filing date: 2024-11-29
Publication date: 2025-03-25
Anticipated expiration: 2044-11-29
Also published as: CN119206101A

Abstract

The invention belongs to the technical field of image data processing, and particularly relates to an editable face three-dimensional reconstruction method, an editable face three-dimensional reconstruction system and a storage medium. The method mainly comprises the following steps of obtaining the non-expression image data and the expression image data of a modeler, obtaining a neutral model and a true expression model, training a deformation field, a color field and a super-resolution neural network, training an expression encoder to construct an expression base vector, forming an expression face three-dimensional model of the modeler through the expression base vector and a finally generated model, and outputting a rendering image. The invention further provides a system for realizing the method. The invention carries out three-dimensional reconstruction of the face based on the neural radiation field technology, can generate an accurate, natural, beautiful and real three-dimensional model of the face, can carry out flexible and convenient personalized adjustment, and has good application prospect in the fields of medical treatment and medical science.

Description

Editable face three-dimensional reconstruction method, system and storage medium

Technical Field

The invention belongs to the technical field of image data processing, and particularly relates to an editable face three-dimensional reconstruction method, an editable face three-dimensional reconstruction system and a storage medium.

Background

Facial scanning is a technique for generating a three-dimensional model of a human face using a facial image or video, and has an important role in the medical field. It can provide high quality facial morphology information to doctors and patients, helping to diagnose, treat and evaluate various facial related problems such as plastic surgery, mouth repair, maxillofacial deformity, etc. Facial scanning provides high-quality facial morphology information for doctors and patients, assists diagnosis, treatment and evaluation, enhances confidence and satisfaction, can also provide a large amount of facial data for medical research, and promotes development and innovation of medical knowledge.

The existing face scanning mainly comprises the following schemes of direct human body measurement, face scanning based on radiology, a face scanner, a face three-dimensional reconstruction technology based on deep learning and a face three-dimensional reconstruction technology based on a nerve radiation field technology.

The face three-dimensional reconstruction technology based on the deep learning is based on face images, marks face feature points and generates a three-dimensional model through a model with priori knowledge. The method has the advantages that a large amount of face data can be utilized for training, the reconstruction efficiency and stability are improved, and the conditions of certain shielding, expression, gesture and other changes can be processed. The disadvantage is that the accuracy and detail of the reconstruction is limited by the complexity and expressive power of the model, it is difficult to capture individual differences and subtle features, and pre-trained face shape and texture models, or deep learning based networks, are required as input a priori information.

The three-dimensional face reconstruction technology based on the neural radiation field technology shoots the face from multiple view angles, and learns three-dimensional geometric and texture information of the face through images, camera pose and a neural network. The method can scan the face by using a common camera or a mobile phone without special scanning equipment, thereby reducing the hardware cost and the use difficulty and improving the popularity and convenience of the face scanning technology. The high-quality three-dimensional facial image can be rendered from any view angle, is not influenced by factors such as shielding, expression, gesture and the like, and improves the accuracy and detail of facial scanning.

As can be seen, there are many methods and studies for facial scanning in the prior art. However, these prior art techniques still have the following problems:

1. the model or the network is required to be dependent on a pre-trained face model or network, and the model or the network often has certain limitation and deviation, is difficult to adapt to the changes of different face shapes, textures, expressions, postures and the like, is difficult to capture individual differences and fine characteristics, and causes the phenomena of discomfort, incoordination, unrealistic and the like of the edited face model in different scenes to influence the flexibility and the naturalness of face editing.

2. The face editing technology based on the 3D model is lacking, namely, the shape, color, expression, teeth and other attributes of the face of the patient cannot be modified and adjusted according to the requirements of the patient or doctor, and a new three-dimensional image is generated, so that personalized face design and preview are realized. The personalized requirements of the medical field cannot be met.

Accordingly, there is a need in the art to develop new techniques for constructing editable three-dimensional models of faces from facial scan data.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides an editable face three-dimensional reconstruction method, an editable face three-dimensional reconstruction system and a storage medium.

An editable face three-dimensional reconstruction method comprising the steps of:

Obtaining modeler non-expression image data and expression image data shot at a plurality of angles;

Generating a head point cloud by using the expressionless image data to obtain a neutral model;

generating head point cloud by using the expressive image data to obtain a real expressive model;

training a deformation field and a color field by using a true expression model;

the neutral model obtains a model which is finally output through a color field and a deformation field;

Obtaining a rendered image by utilizing the finally output model, and training a super-resolution neural network by using the rendered image and the expressive image data;

training an expression encoder using the non-expressive image data and the expressive image data;

After the facial feature points of the patient in the non-expression image data are dragged, the facial feature points are sent to the expression encoder, and the coordinate of the dragged feature points is encoded into an expression base vector by the expression encoder;

the neutral model passes through the expression base vector and the finally output model to form a facial three-dimensional model with expression of a modeler;

And the facial three-dimensional model with the expression obtains a high-definition rendering chart through a super-resolution network.

Preferably, the non-expressive image data is selected from at least one of a picture or a video, and the expressive image data is selected from at least one of a picture or a video.

Preferably, the construction of the neutral model comprises the following steps:

Step a, preprocessing the expression-free image data;

Step b, reconstructing the pose by using Colmap, performing feature extraction and feature matching, and generating a sparse point cloud;

step c, training the sparse point cloud as an initial point cloud of a 3D Gaussian Splatting algorithm to generate a head point cloud;

and D, inputting DMTet the head point cloud to generate a 3D model, namely a neutral model.

Preferably, the training process of the expression encoder is as follows:

Inputting facial feature points in the non-expression image data and the expression image data, and obtaining expression base vectors through an expression encoder;

Generating an expressive image and expressive facial feature points through the expression base vectors, and carrying out loss calculation on the expressive image and the expressive facial feature points and the facial feature points of the true expressive image data and the true expressive image data respectively, so that the expression encoder is optimized continuously;

the specific process of constructing the expression basis vector is expressed as follows:

Z_exp= E_exp(I_in, P₀)

Wherein Z _exp is an expression base vector, E _exp is an expression encoder, I _in is non-expression image data, and P ₀ is 68 feature points of the expression image data;

in the process of training the expression encoder, the following loss function is adopted:

L_exp= L_gen+ L_f1+ αL_f2

Wherein, L _exp is a Loss function for constructing expression basis vectors, L _gen is a mean square error Loss of each pixel point of a newly generated image and an original image, L _f1 is MSE Loss of characteristic points of a face after characteristic recovery and original face characteristics, L _f2 is a Loss between characteristic points of a surface expression image and the original characteristic points, and alpha is a switch for generating surface expression characteristic Loss.

Preferably, the training process of the deformation field, the color field and the super-resolution network is as follows:

The neutral model and the expression base vector obtain an expressed sparse point cloud through a deformation field and a color field;

The low-resolution image rendered by the expression sparse point cloud obtains a corresponding high-resolution image through a super-resolution neural network;

The neutral model comprises a neutral point cloud position vector and a neutral point cloud characteristic point color vector, wherein the neutral point cloud position vector is a vector formed by position coordinates of expressive characteristic points with position changes;

In a deformation field, the neutral point cloud position vector and the expression base vector form a matrix through shape change, are spliced in a channel dimension to form tensors with the channel number of 4, the tensors perform feature extraction and fusion through ResNet, and finally, the position offset of corresponding point positions relative to a neutral model is mapped through three full-connection layers;

in the color field, the neutral point cloud position vector, the neutral point cloud characteristic point color vector and the expression base vector form a matrix through shape change, the matrix is spliced in the channel dimension to form tensors with the channel number of 4, the tensors perform characteristic extraction and fusion through ResNet, and finally the position offset of the corresponding point position relative to the neutral model is mapped through three full-connection layers.

Preferably, the following three loss functions are used in training the deformation field and the color field:

L_RGB= ||I_rgb– I_gt||₁

L_sil= 1- IOU(M, M_gt)

L_def= ||P - P_gt||₂

wherein, L _RGB is RGB Loss, L _sil is contour Loss, L _def is face feature point Loss, I _rgb is RGB value output by a color field, I _gt is RGB value of a corresponding point of a real expressive model, M is facial contour of a point cloud, M _gt is facial contour of the real expressive model, IOU is intersection ratio of two contours, P _gt is position of a point of the real expressive model, P is position of a corresponding point of the point cloud, in a calculation formula of L _RGB, subscript 1 represents L1 Loss function, and subscript 2 represents L2 Loss function in a calculation formula of L _def;

P = P₀+ f_def(P₀, z_exp)

P ₀ is the position of the corresponding point of the neutral point cloud, Z _exp is the expression base vector, and f _def is the deformation field;

in the process of training the deformation field and the color field, a priori constraint L _offset is introduced, the L _offset penalizes all non-zero displacements,

L_offset=λ₁counter(f_def(P₀, z_exp) = 0)

Wherein lambda ₁ is weight for scaling loss, counter (f _def(P₀, z_exp) =0) represents data for calculating all offsets in the feature points to be zero;

in the process of training the super-resolution neural network, the following loss function is adopted:

L =λ₂||I_hr− I_gt||₁+ (1-λ₂)||I_lr− I_gt||₁

Wherein, I _hr is a high-definition image after resolution improvement of the super-resolution neural network, I _lr is a low-resolution image rendered by the point cloud, I _gt is an image with a real expression, and lambda ₂ is a weight.

The invention also provides a system for realizing the three-dimensional reconstruction method of the editable face, which comprises the following steps:

An input module configured to acquire a series of non-expressive image data and expressive image data of a modeler;

a neutral model construction module configured to generate a head point cloud using the expressionless image data to obtain a neutral model;

the true expressive model construction module is configured to generate head point clouds by using the expressive image data to obtain a true expressive model;

A deformation field and color field training module configured to train the deformation field and color field using the true expressive model;

The final output model training module is configured to obtain a final output model through a color field and a deformation field by the neutral model;

A super-resolution neural network module configured to obtain a rendered image using the finally output model, train a super-resolution neural network using the rendered image and the expressive image data;

The expression base vector construction module is configured to drag the facial feature points of the patient in the non-expression image data, and then send the facial feature points into the expression encoder, and the expression encoder encodes the dragged feature point coordinates into an expression base vector;

A facial three-dimensional model construction module configured to construct a modeler's expressive facial three-dimensional model by passing a neutral model through the expression basis vector and the final output model;

and the high-definition rendering module is configured to obtain a high-definition rendering chart from the facial three-dimensional model with the expression through a super-resolution network.

Preferably, the method further comprises:

and the dynamic adjustment module is configured to change the positions of the characteristic points, adjust the facial three-dimensional model and generate a new facial three-dimensional model.

Preferably, in the dynamic adjustment module, the process of generating the new face three-dimensional model comprises generating a new expression base according to the new feature point position, and generating the new face three-dimensional model by using a deformation field and a color field;

or, in the dynamic adjustment module, generating feature points corresponding to at least one expression or facial form through a GAN model, wherein the feature points corresponding to the at least one expression or facial form are integrated into at least one option;

Or, in the dynamic adjustment module, generating a sequence feature point change through a transducer model, and generating a change of a sequence face three-dimensional model according to the sequence feature point change.

The present invention also provides a computer-readable storage medium having stored thereon a computer program for implementing the above-described editable face three-dimensional reconstruction method, or a computer program for implementing the above-described system.

The invention carries out three-dimensional reconstruction of the human face based on the neural radiation field technology, and can output an edited high-definition three-dimensional model of the human face through editing the characteristic points of the human face, a color field, a deformation field and a super-resolution neural network, and the technical scheme of the invention achieves the following beneficial technical effects:

1. By moving the face feature points and other operations, the face three-dimensional model can be modified, so that the face of the modeler can be finely edited, and meanwhile, the edited model can still keep basic facial features and identity features of the modeler. Can meet the personalized requirements of the medical field. In the preferred scheme, feature points corresponding to the common expressions and the facial shapes can be made into options, the corresponding face three-dimensional model can be generated by clicking, complex manual operation is not needed, and the flexibility and convenience of adjusting the face three-dimensional model are improved.

2. The invention utilizes the combined action of a plurality of multi-layer perceptron (MLP) neural networks (comprising an expression encoder, a deformation field and a color field) to generate an edited face three-dimensional model, further adopts a super-resolution network to improve the resolution of the picture, and further generates an expression base feature vector according to feature points. Therefore, the three-dimensional model of the face ensures the naturalness and the aesthetic property of the face shaping, and improves the authenticity and the individuation.

In conclusion, the three-dimensional model of the face generated by the method is accurate, natural, attractive and real, can be flexibly and conveniently subjected to personalized adjustment, and has good application prospects in the fields of medical treatment and medical science.

It should be apparent that, in light of the foregoing, various modifications, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

The above-described aspects of the present invention will be described in further detail below with reference to specific embodiments in the form of examples. It should not be understood that the scope of the above subject matter of the present invention is limited to the following examples only. All techniques implemented based on the above description of the invention are within the scope of the invention.

Drawings

FIG. 1 is an exemplary diagram of a dataset used to construct an expression basis vector;

FIG. 2 is a diagram of a network structure for constructing expression basis vectors;

FIG. 3 is a schematic diagram of a process for training a deformation field, a color field, and a super-resolution neural network.

Fig. 4 is a schematic diagram of specific structures of deformation fields and color fields.

Detailed Description

It should be noted that, in the embodiments, algorithms of steps such as data acquisition, transmission, storage, and processing, which are not specifically described, and hardware structures, circuit connections, and the like, which are not specifically described may be implemented through the disclosure of the prior art.

Example 1 editable face three-dimensional reconstruction method and System

The present embodiment provides a system for three-dimensional reconstruction of a face, specifically including:

the high-definition rendering module is configured to obtain a high-definition rendering diagram of the facial three-dimensional model with the expression through a super-resolution network;

In the dynamic adjustment module, the process of generating the new face three-dimensional model comprises the steps of generating a new expression base according to the new feature point position, generating the new face three-dimensional model by using a deformation field and a color field, generating feature points corresponding to at least one type of expression or facial form through a GAN model for facilitating quick selection of a user, integrating the feature points corresponding to the at least one type of expression or facial form into at least one option, and generating sequence feature point change through a transducer model for dynamically demonstrating the face three-dimensional model, and generating sequence face three-dimensional model change according to the sequence feature point change.

The method for carrying out the three-dimensional reconstruction of the face by adopting the system specifically comprises the following steps:

Step 1, acquiring the non-expressive image data and the expressive image data of modelers shot at a plurality of angles, wherein the non-expressive image data is selected from at least one of pictures or videos, and the expressive image data is selected from at least one of pictures or videos. The pictures or videos can be acquired by adopting a common mobile phone or camera equipment. As a preferred approach, the video is acquired by 360 encircling the modeler's head or spiral encircling recordings of the patient's face in alignment. The collected content of the expressive image (video screen) data comprises expressions such as nodding, shaking head, blinking, smiling, opening mouth and the like.

Step 2, generating head point clouds by using the expression-free image data to obtain a neutral model; generating head point cloud by using the expressive image data to obtain a real expressive model;

Training a deformation field and a color field by using a true expression model, wherein the neutral model obtains a model which is finally output through the color field and the deformation field;

and 4, obtaining a rendering image by utilizing the finally output model, and training the super-resolution neural network by using the rendering image and the expressive image data of the modeler photographed at a plurality of angles. In training, image data output by the super-resolution neural network can be further added into training data.

Step 5, training an expression encoder by using the non-expression image data and the expression image data;

Step 6, dragging the facial feature points of the patient in the non-expression image data, and then sending the facial feature points to the expression encoder, wherein the expression encoder encodes the coordinates of the dragged feature points into expression base vectors;

Step 7, the neutral model passes through the expression base vector and the finally output model to form a facial three-dimensional model with expression of a modeler;

And 8, obtaining a high-definition rendering chart by the facial three-dimensional model with the expression through a super-resolution network.

In step 2, the construction of the neutral model includes the following steps:

Step 2.1, preprocessing the expression-free image data, taking video data as an example, wherein the preprocessing process comprises the following steps:

2.2, reconstructing the pose by using Colmap, and performing feature extraction and feature matching to generate a sparse point cloud;

Step 2.3, training the sparse point cloud as an initial point cloud of a 3D Gaussian Splatting algorithm to generate a head point cloud;

And 2.4, inputting DMTet the head point cloud to generate a 3D model, namely a neutral model.

In step 5, a data set is first constructed according to the video with the table request, and each data set is shown in fig. 1 and includes an unoccupied face picture, an expressed face picture and expressed face feature points. In order to be able to control expressions more finely and to do this, the 3D model can be controlled using feature points. The present study uses an encoder to extract the expression features and construct the expression basis vectors. The training process of the expression encoder is shown in fig. 2, and specifically comprises the following steps:

And generating an expressive image and expressive facial feature points through the expression base vectors, and carrying out loss calculation on the expressive image and the expressive facial feature points and the facial feature points of the true expressive image data and the true expressive image data respectively, so that the expression encoder is optimized continuously.

The process of constructing the expression basis vector is expressed as:

Z_exp= E_exp(I_in, P₀)

Wherein Z _exp is an expression base vector, E _exp is an expression encoder, I _in is non-expression image data, and P ₀ is 68 feature points of the expression image data. E _exp compresses the image into a 1024 feature vector.

L_exp= L_gen+ L_f1+ αL_f2

L_exp= ||I_in- I_out||₂+ ||P₀– P₁||₂+ α||P₀– P₂||₂

α = 0.1*I(L_gen<0.1)

P ₁ is a face feature point (i.e., "expressive feature point" in fig. 2) generated through the feature recovery network. I _out is the result after decoding by the decoder, and P ₂ is the face feature after generating the expression (i.e. "expression+feature point" in fig. 2). It should be noted that the generated expression image cannot be subjected to feature recognition in the initial stage of network training, so that α is set as a switch for generating expression feature loss in the study, and I is an indication function, which indicates that L _f2 loss is only started when L _gen is smaller than a certain threshold.

In step 4, the training process of the deformation field, the color field and the super-resolution network is shown in fig. 3, and the specific steps are as follows:

The neutral model comprises a neutral point cloud position vector and a neutral point cloud characteristic point color vector, wherein the neutral point cloud position vector is a vector formed by position coordinates of expressive characteristic points with position changes, and the neutral point cloud characteristic point color vector is a vector formed by color values of the expressive characteristic points with position changes.

The structure of the deformation field and the color field is shown in fig. 4.

In training the deformation field and the color field, the following three loss functions are adopted:

L_RGB= ||I_rgb– I_gt||₁

L_sil= 1- IOU(M, M_gt)

L_def= ||P - P_gt||₂

Wherein, L _RGB is RGB Loss, L _sil is contour Loss, L _def is face feature point Loss, I _rgb is RGB value of color field output (namely RGB value of 'high resolution image' in figure 3), I _gt is RGB value of corresponding point of real expressive model (namely RGB value of 'real expressive image' in figure 3), M is facial contour of sparse expression point cloud, M _gt is facial contour of real expressive model, IOU is intersection ratio of two contours, P _gt is position of real expressive model point, P is position of corresponding point of sparse expression point cloud, subscript 1 in the calculation formula of L _RGB represents L1 Loss function, subscript 2in the calculation formula of L _def represents L2 Loss function;

P = P₀+ f_def(P₀, z_exp)

L_offset=λ₁counter(f_def(P₀, z_exp) = 0)

Where λ ₁ is a weight for scaling the loss, and counter (f _def(P₀, z_exp) =0) represents data in which all offsets in the feature points are calculated to be zero.

L =λ₂||I_hr− I_gt||₁+ (1-λ₂)||I_lr− I_gt||₁

The method for using the model constructed through the process through the dynamic adjustment module comprises the following steps:

The user can drag the feature points on the non-expressive picture of the modeler of the facial three-dimensional model, and the dynamic adjustment module can generate new expression bases according to the new feature point positions, so that a new facial three-dimensional model is generated by using deformation fields and color field control.

As a preferred mode, for convenient use, the dynamic adjustment module can train and integrate GAN to generate feature points of appointed expression or facial form, make option keys, and a user clicks the keys to generate corresponding feature points, and then drag and finely adjust the feature points through a mouse. Thereby performing quick facial editing.

Preferably, in order to show the feature point change in the dynamic process, the dynamic adjustment module may integrate a transducer, which is used to generate the sequence feature point change, so that the dynamic three-dimensional model of the face can be directly observed.

According to the embodiment, the face three-dimensional reconstruction method based on the neural radiation field technology can generate an accurate, natural, attractive and real face three-dimensional model, can perform flexible and convenient personalized adjustment, and has good application prospects in the fields of medical treatment and medical science.

Claims

1. An editable face three-dimensional reconstruction method is characterized by comprising the following steps:

The expression base vector and the finally output model form a facial three-dimensional model with expression of a modeler;

the facial three-dimensional model with the expression obtains a high-definition rendering chart through a super-resolution network;

the training process of the super-resolution network comprises the following steps:

the training process of the deformation field and the color field is as follows:

in a color field, the neutral point cloud position vector, the neutral point cloud characteristic point color vector and the expression base vector form a matrix through shape change, splice is carried out on channel dimensions to form tensors with the number of channels being 4, feature extraction and fusion are carried out on the tensors through ResNet, and finally, the position offset of corresponding point positions relative to a neutral model is mapped through three full-connection layers;

in the training of deformation fields and color fields, the following three loss functions are adopted:

L_RGB = ||I_rgb – I_gt||₁

L_sil = 1- IOU(M, M_gt)

L_def = ||P - P_gt||₂

P = P₀ + f_def(P₀, z_exp)

L_offset =λ₁counter(f_def(P₀, z_exp) = 0)

L =λ₂||I_hr − I_gt||₁+ (1-λ₂)||I_lr − I_gt||₁

Wherein, I _hr is a high-definition image after resolution improvement of the super-resolution neural network, I _lr is a low-resolution image rendered by the point cloud, I _gt is an image with a real expression, and lambda ₂ is weight;

Z_exp = E_exp(I_in, P₀)

Wherein Z _exp is an expression base vector, E _exp is an expression encoder, I _in is non-expression image data, and P ₀ is 68 feature points of the expression image data.

2. The method of three-dimensional reconstruction of an editable face of claim 1, wherein the non-expressive image data is selected from at least one of a picture or a video, and wherein the expressive image data is selected from at least one of a picture or a video.

3. The method for three-dimensional reconstruction of an editable face as defined in claim 1, wherein:

The training process of the expression encoder is as follows:

L_exp = L_gen+ L_f1 + αL_f2

4. A system for implementing the editable face three-dimensional reconstruction method of any one of claims 1 to 3, comprising:

A facial three-dimensional model construction module configured such that the expression base vector and the finally output model constitute a modeler's expressive facial three-dimensional model;

the training process of the deformation field, the color field and the super-resolution network is as follows:

L_RGB = ||I_rgb – I_gt||₁

L_sil = 1- IOU(M, M_gt)

L_def = ||P - P_gt||₂

P = P₀ + f_def(P₀, z_exp)

L_offset =λ₁counter(f_def(P₀, z_exp) = 0)

L =λ₂||I_hr − I_gt||₁+ (1-λ₂)||I_lr − I_gt||₁

Z_exp = E_exp(I_in, P₀)

5. The system as recited in claim 4, further comprising:

6. The system of claim 5, wherein the process of generating a new three-dimensional model of the face in the dynamic adjustment module includes generating a new expression base based on the new feature point locations, generating a new three-dimensional model of the face using the deformation field and the color field;

7. A computer-readable storage medium, having stored thereon a computer program for implementing the editable face three-dimensional reconstruction method according to any one of claims 1 to 3, or a computer program for implementing the system according to any one of claims 4 to 6.