CN119417729B

CN119417729B - Image processing method, electronic device and storage medium

Info

Publication number: CN119417729B
Application number: CN202510013410.8A
Authority: CN
Inventors: 卢溜
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2025-01-06
Filing date: 2025-01-06
Publication date: 2025-08-01
Anticipated expiration: 2045-01-06
Also published as: CN119417729A

Abstract

The application provides an image processing method, electronic equipment and a storage medium, and relates to the field of image processing, wherein the method comprises the steps of displaying a first interface, responding to a first operation of a first control in the first interface, and acquiring a face image; and carrying out enhancement processing on the face image, the face normal image, the face albedo image and the environment image by using the trained face enhancement model to obtain a face target image. In the implementation mode, the processing result of the inverse rendering model on the face image is used as the face priori information, the face enhancement model can effectively utilize the face priori information to accurately restore the face, and even under the complex light and shadow condition, unnatural textures can be avoided to be generated, and high-quality shooting images can be output.

Description

Image processing method, electronic device and storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to an image processing method, an electronic device, and a storage medium.

Background

With the popularization of electronic devices with photographing functions in life, photographing by people using electronic devices has become a daily behavior mode. However, in the shooting process, the quality of the shot image is often poor due to dark light, lens shake, movement of the shot object, and the like.

Although various methods for enhancing image quality have been proposed in the prior art, these methods raise image quality and at the same time raise new problems, especially in terms of face image processing. For example, the enhanced image may have the phenomena of blurring, pseudo-texture, and the like, which seriously affect the quality of the face image. Therefore, a new image processing method is needed to solve these problems.

Disclosure of Invention

The application provides an image processing method, electronic equipment and a storage medium, wherein the processing result of an inverse rendering model on a face image is used as face priori information, the face enhancement model can effectively utilize the face priori information to accurately restore the face, and even under the complex light and shadow condition, unnatural textures can be avoided to be generated, and a high-quality shooting image is output.

In order to achieve the above purpose, the application adopts the following technical scheme:

The first aspect provides an image processing method, which comprises the steps of displaying a first interface, wherein the first interface comprises a first control, responding to first operation of the first control to obtain a face image, processing the face image by using a trained inverse rendering model to obtain a face normal image, a face albedo image and an environment image, and performing enhancement processing on the face image, the face normal image, the face albedo image and the environment image by using a trained face enhancement model to obtain a face target image.

Optionally, the first operation is used to instruct to start shooting, and the first operation may be a click operation on a shooting control.

Optionally, the normal face image is used for representing normal features corresponding to the face image in the face image, or is used for representing the normal direction of the face surface in the face image.

Optionally, the face albedo image is used for representing albedo characteristics corresponding to the face image in the face image, or is used for representing characteristics such as color, texture and the like of the face surface in the face image.

Optionally, the environment image is used to represent environment content other than the portrait, or to represent the influence of illumination of the shooting environment.

Optionally, the sharpness of the face target image is greater than the sharpness of the face image.

Optionally, the face target image has rich details, and the problems of strong smearing feeling, artifacts, pseudo textures and the like do not exist.

In the implementation mode, the face and the environment in the face image are split through the trained inverse rendering model, so that a face normal image, a face albedo image and an environment image are obtained, and accurate priori information is obtained. The trained face enhancement model fully utilizes the prior information in the enhancement process, can effectively recover details of the face image, eliminates smearing feeling, artifacts, pseudo textures and the like of the face image, and improves the definition and quality of the face image.

With reference to the first aspect, in some implementations of the first aspect, in response to a first operation on the first control, acquiring the face image includes, in response to the first operation on the first control, acquiring an original image, determining a sharpness of the original image, detecting that the sharpness of the original image is less than a preset sharpness threshold, and cropping the original image to obtain the face image.

In the implementation mode, the original image is effectively screened through definition, so that the determined face images are all face images which need to be subjected to enhancement processing. On one hand, the method is convenient for processing part of face images which need to be subjected to enhancement processing, and can improve the image processing efficiency, on the other hand, the method avoids the situation that artifacts and pseudo textures occur due to the fact that the face images which do not need to be subjected to enhancement processing are subjected to enhancement processing, saves resources of electronic equipment and reduces power consumption.

With reference to the first aspect, in some implementations of the first aspect, the trained inverse rendering model includes a first network, a second network, and a third network, and the processing of the face image using the trained inverse rendering model to obtain a face normal image, a face albedo image, and an environmental image includes processing the face image using the first network to obtain a face normal image, processing the face image and the face normal image using the second network to obtain a face albedo image, and processing the face image, the face normal image, and the face albedo image using the third network to obtain the environmental image.

Optionally, the first network is VQVAE and/or the second network is VQVAE.

Alternatively, the third network may be VQVAE or U-Net.

In the implementation mode, the VQ technology is introduced into each network contained in the inverse rendering model, accurate priori information can be extracted, the subsequent face enhancement model is helped to effectively utilize the priori information to enhance the face image, pseudo textures, artifacts, unnatural effects and the like are avoided, and the quality of the image is greatly improved.

With reference to the first aspect, in some implementations of the first aspect, the first network includes a first vector codebook module, where the first vector codebook module includes a first set of discrete vectors, and the processing of the face image by the first network to obtain a face normal image includes extracting a first potential vector in the face image by the first network, searching for a vector similar to the first potential vector in the first set of discrete vectors, and generating the face normal image according to the vector similar to the first potential vector.

Optionally, the first potential vector may include 3D structural information and geometric information corresponding to the face.

In this implementation, vector quantization through VQVAE can effectively map information in continuous potential space into a discrete space, which helps to improve the computing efficiency and generalization capability of the first network. By the method, the first network can extract abundant 3D structure information and geometric information from the input face image, lays a foundation for the subsequent enhancement processing of the face image, and is beneficial to recovering the 3D structure characteristics and geometric characteristics of the face more accurately when the face image is enhanced.

With reference to the first aspect, in some implementations of the first aspect, the second network includes a second vector codebook module, the second vector codebook module includes a second discrete vector set, and the processing is performed on the face image and the face normal image by using the second network to obtain a face albedo image, including stitching the face image and the face normal image to obtain a first stitched image, extracting a second potential vector in the first stitched image by using the second network, searching a vector close to the second potential vector in the second discrete vector set, and generating the face albedo image according to the vector close to the second potential vector.

Optionally, the second potential vector may include texture information and skin information corresponding to the face.

In this implementation, vector quantization through VQVAE can effectively map information in continuous potential space into a discrete space, which helps to improve the computational efficiency and generalization capability of the second network. By the mode, the second network can extract accurate texture information and skin information from the input face image and the face normal image, lays a foundation for the subsequent enhancement processing of the face image, namely is beneficial to recovering the texture characteristics and the skin characteristics of the face more accurately when the subsequent enhancement processing of the face image is performed, improves the details and the sense of reality of the face image, and avoids generating pseudo textures, artifacts, unnatural effects and the like.

With reference to the first aspect, in some implementations of the first aspect, the trained face enhancement model includes a third vector codebook module, the third vector codebook module includes a third discrete vector set, the face enhancement model is used for enhancing the face image, the face normal image, the face albedo image and the environment image to obtain a face target image, the face target image includes a stitched face image, the face normal image, the face albedo image and the environment image to obtain a second stitched image, the trained face enhancement model is used for extracting a third potential vector in the second stitched image, a vector close to the third potential vector is found in the third discrete vector set, and the face target image is generated according to the vector close to the third potential vector.

Optionally, the third potential vector may include 3D structure information, geometric information, texture information, skin information, ambient lighting information, light shadow information, and the like, corresponding to the face.

In the implementation mode, the face enhancement model can acquire accurate 3D structure information, geometric information, texture information, skin information, ambient illumination information, light shadow information and the like from prior information, so that the characteristics of the 3D structure, the geometry, the texture, the skin, the ambient illumination, the light shadow and the like of the face image can be correctly guided and restored based on the information even under the conditions that the quality of the face image is poor and the light shadow of the face is complex, meanwhile, the generation of pseudo-texture and pseudo-shadow is effectively avoided, the quality of the image is improved, the robustness of the face enhancement model is improved, and the applicability of the face enhancement model is wider.

With reference to the first aspect, in some implementations of the first aspect, the face image, the face normal image, the face albedo image, and the environment image are enhanced by using a trained face enhancement model to obtain a face target image, which includes determining a light-supplementing environment image corresponding to the environment image by using a trained light-shadow enhancement model, and enhancing the face image, the face normal image, the face albedo image, and the light-supplementing environment image by using the trained face enhancement model to obtain the face target image.

In the implementation mode, intelligent light filling is carried out on the environment image through the light and shadow enhancement model, and then the face enhancement model is utilized to finely adjust the face image based on the environment image after light filling. The processing not only enhances the details of the human face, but also realizes the comprehensive light supplementing effect on the human face and the surrounding environment, ensures the balance of the human face and the ambient light, and is beneficial to generating the human face target image with higher definition and stronger contrast.

With reference to the first aspect, in some implementations of the first aspect, the image processing method further includes training the initial inverse rendering network with a first training set to obtain a trained inverse rendering model, the first training set including a high-definition sample image, training the trained inverse rendering model with a second training set to obtain a trained inverse rendering model, the second training set including a low-definition sample image, the low-definition sample image being degraded from the high-definition sample image.

In this implementation, the inverse rendering model is trained in two stages. In the first stage, the initial inverse rendering network is trained through the high-definition sample image, so that the vector codebook module can capture and store priori information of the high-definition sample image, and in the second stage, the low-definition sample image is further trained, so that the inverse rendering model in training learns the ability of accurately recovering and outputting the high-quality image by using the priori information. Through the training of the two stages, the trained inverse rendering model has strong robustness, so that prior information can be accurately extracted and utilized when facing face images in various complex light and shadow scenes, and high-quality images can be generated.

With reference to the first aspect, in some implementations of the first aspect, the image processing method further includes training the initial face enhancement network with a third training set to obtain a trained face enhancement model, the third training set includes a high-definition sample image, training the trained face enhancement model with a fourth training set to obtain a trained face enhancement model, the fourth training set includes a low-definition sample image, and a normal sample image, an albedo sample image, and an environmental sample image corresponding to the low-definition sample image, the normal sample image and the albedo sample image are obtained by processing the low-definition sample image with a trained inverse rendering model, and the environmental sample image is obtained by simulation with the trained light and shadow enhancement model.

In this implementation, the face enhancement model is trained in two stages. In the first stage, the initial face enhancement network is trained through the high-definition sample image, so that a vector codebook module in the initial face enhancement network can learn how to quantize vectors, and the high-quality face image can be accurately recovered and output by using priori information. In the second stage, the face enhancement model in training is further trained through the low-definition sample image, so that the face enhancement model in training learns the ability of accurately recovering and outputting high-quality images by using prior information. Through the training of the two stages, the trained face enhancement model has strong robustness, so that prior information can be accurately utilized when facing face images in various complex light and shadow scenes, and high-quality images can be generated.

With reference to the first aspect, in some implementation manners of the first aspect, the image processing method further includes calculating a target loss value according to a preset loss function in a process of training the face enhancement model, and continuing training the trained face enhancement model in a counter-propagating manner according to the target loss value to obtain a trained face enhancement model.

Optionally, the target loss value is used to represent a loss between the high-definition sample image and the predicted image output by the face enhancement model in training, and the loss is composed of three parts, wherein the first part is an L1 loss, the second part is a perception loss, and the third part is an antagonism loss.

Wherein the L1 penalty is used to represent the penalty between the high definition sample image and the predicted image (or referred to as the reconstructed image), the perceptual penalty is used to represent the feature penalty between the high definition sample image and the predicted image (or referred to as the reconstructed image), and the counterpenalty is used to represent the penalty between the real image and the image generated by the generator in the face enhancement model.

In the implementation mode, the model parameters are adjusted by using the back propagation algorithm, so that the model is facilitated to be converged to an optimal solution more quickly, the training speed of the face enhancement model is improved, the loss of three parts of the face enhancement model is considered when the model parameters are adjusted, the performance of the face enhancement model can be improved in multiple aspects, and the face enhancement model is facilitated to cope with various low-quality and light-shadow complex images.

With reference to the first aspect, in some implementations of the first aspect, the image processing method further includes stitching the original image and the face target image to obtain a high-definition image corresponding to the original image.

In the implementation mode, the enhancement processing is specially carried out on the face image, on one hand, details of the face region can be effectively recovered, smearing feeling, artifacts, pseudo textures and the like of the face image are eliminated, the image quality of the face region is ensured to be remarkably improved, on the other hand, the enhancement processing is independently carried out on the face image, and then the original image and the face target image are spliced to obtain a high-definition image, so that the image processing efficiency can be effectively improved, and the resource consumption of electronic equipment is reduced. In addition, the user usually pays more attention to the face in the shot image, and the image processing method provided by the application can remarkably improve the image quality of the face area and greatly improve the sense of well-being of the user while improving the image processing speed.

In a second aspect, the present application provides an image processing apparatus comprising one or more processors and a memory, the memory being coupled to the one or more processors, the memory being for storing computer program code, the computer program code comprising computer instructions, the one or more processors invoking the computer instructions to cause the image processing apparatus to perform any of the methods provided in the first aspect.

In a third aspect, the application provides an electronic device, which comprises a processor, a memory and an interface, wherein the processor, the memory and the interface are mutually matched, so that the electronic device executes any one of the methods provided in the first aspect.

In a fourth aspect, the present application provides a chip comprising a processor. The processor is configured to read and execute a computer program stored in the memory to perform the method of the first aspect and any possible implementation thereof.

Optionally, the chip further comprises a memory, and the memory is connected with the processor through a circuit or a wire.

Optionally, the chip further comprises a communication interface.

In a fifth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, which when executed by a processor causes the processor to perform any one of the methods of the first aspect.

In a sixth aspect, the present application provides a computer program product comprising computer program code which, when run on an electronic device, causes the electronic device to perform any one of the methods of the first aspect.

The technical effects obtained by the second, third, fourth, fifth and sixth aspects are similar to the technical effects obtained by the corresponding technical means in the first aspect, and are not described herein.

Drawings

Fig. 1 is a schematic view of a scene of a photographed image according to an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram showing a display of a captured image in accordance with an exemplary embodiment of the present application;

FIG. 3 is a flow chart of an image processing method according to an exemplary embodiment of the present application;

FIG. 4 is a flowchart of a method for acquiring a face image according to an exemplary embodiment of the present application;

FIG. 5 is a schematic representation of a face image according to an exemplary embodiment of the present application;

FIG. 6 is a flowchart illustrating another image processing method according to an exemplary embodiment of the present application;

FIG. 7 is a diagram illustrating a trained inverse rendering model architecture according to an exemplary embodiment of the present application;

FIG. 8 is a schematic view of a face image, a face normal image, a face albedo image, and an ambient image, according to an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram illustrating an inverse rendering model process flow according to an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a first network architecture shown in an exemplary embodiment of the application;

FIG. 11 is a schematic diagram of a second network architecture shown in an exemplary embodiment of the present application;

FIG. 12 is a schematic illustration of a face enhancement model process flow shown in an exemplary embodiment of the present application;

FIG. 13 is a schematic diagram of a face enhancement model according to an exemplary embodiment of the present application;

FIG. 14 is a flow diagram illustrating training an inverse rendering model according to an exemplary embodiment of the present application;

FIG. 15 is a schematic illustration of a light and shadow enhancement model rich high definition sample image, as shown in an exemplary embodiment of the application;

FIG. 16 is a flow chart of training a face enhancement model according to an exemplary embodiment of the present application;

Fig. 17 is a schematic diagram of the structure of an electronic device shown in an exemplary embodiment of the present application;

FIG. 18 is a block diagram of the software architecture of an electronic device shown in an exemplary embodiment of the application;

fig. 19 is a schematic view of an image processing apparatus shown in an exemplary embodiment of the present application;

Fig. 20 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

In embodiments of the present application, the following terms "first", "second", "third", "fourth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated.

In order to facilitate understanding of the technical solutions in the embodiments of the present application, before describing the technical solutions in the examples of the present application, some terms related to the embodiments of the present application are explained.

1. Convolutional neural network (Convolutional Neuron Network CNN)

A convolutional neural network is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer, wherein the feature extractor can be regarded as a filter, and the convolutional layer refers to a neuron layer for carrying out convolutional processing on an input signal in the convolutional neural network. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern.

2. RGB (Red, green, blue) color space

The RGB color space may also be referred to as the RGB domain, referring to a color model associated with the structure of the human visual system, with all colors being treated as different combinations of red, green and blue, depending on the structure of the human eye.

3. RGB (Red, green, blue) channel values

RGB channel values may also be referred to as RGB values, meaning that in the RGB color model, the intensity values of three color channels, red (Red), green (Green), blue (Blue), are used to define a particular color, which together determine the color of the final display.

4. Backlight device

The backlight is a condition in which a subject (also referred to as a subject) is just between a light source and a camera. In this state, there is a problem that the subject is not sufficiently exposed, and therefore, in general, the user should avoid photographing the subject under a backlight condition as much as possible.

5. Albedo (Albedo)

Albedo generally refers to the ratio of reflected radiant flux to incident radiant flux of the earth's surface under the influence of solar radiation. It is an important variable reflecting many surface parameters, such as the absorption capacity of the surface for solar radiation.

In the embodiment of the application, the albedo refers to the ratio of the reflected radiation flux to the incident radiation flux of the head of the human figure under the irradiation of lamplight, and is used for reflecting the absorption capacity of the surface layer (such as face, scalp and the like) of the head of the human figure to the irradiation radiation.

6. Normal direction

In the embodiment of the present application, the normal direction indicates the direction of the normal line.

7. Receptive field

One concept in deep neural networks in the field of machine vision is called receptive field, which is used to represent the size of the extent of the perception of an original image by neurons at different locations within the network.

8. Priori character

The priori refers to inherent knowledge or preset information of the face structure and characteristics, which can help the electronic device to improve accuracy and efficiency in processing the face image.

9. Back Propagation Algorithm (Back Propagation, BP)

The neural network can adopt the size of parameters in the neural network model which is trained by Cheng Zhongxiu positive initial by adopting an error back propagation algorithm, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is forwarded until the output is generated with error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges.

The back propagation algorithm is a back propagation motion that dominates the error loss and is used to derive parameters, such as a weight matrix, for the optimal neural network model.

The foregoing is a simplified description of the terminology involved in the embodiments of the present application, and is not described in detail below.

With the popularization of electronic devices with photographing functions in life, photographing by people using electronic devices has become a daily behavior mode. However, in the shooting process, the quality of the image obtained by shooting is poor due to the influence of factors such as hardware performance (such as photosensitive element performance), shooting environment (such as darker light), lens shake, shooting by a long-focus camera, and movement of a shot object.

Although various methods for enhancing image quality have been proposed in the prior art, these methods raise image quality and at the same time raise new problems, especially in terms of face image processing. For example, the composition of a face has a strong priori, but the conventional image enhancement methods (such as a linear filtering method, a histogram equalization method, a sharpening enhancement method and the like) cannot utilize the prior information of the face, so that when the face image is processed, a better face contour cannot be obtained, face details cannot be enhanced better, and even when the face light and shadow is complex, no abrupt and offensive pseudo-texture of the face is introduced. That is, the face image is processed by the conventional image enhancement method, which results in the problems of low definition, insufficient details, strong smearing feeling, noise, artifact, pseudo texture and the like of the processed face image.

Wherein insufficient detail refers to loss of detail. For example, the information of fine textures (such as skin textures, hair, etc.), edges, shapes, etc. originally present in the face image is lost, resulting in the face image not appearing clear enough or lacking in stereoscopic impression.

The strong smearing feeling means that the color and texture in the face image are excessively smoothed, so that the face image looks like being smeared, and lacks sharpness and definition.

The existence of the artifact means that the content which does not exist truly is added in the processed face image.

For example, the face light shadow of a face image is complex, a dark shadow exists above the eyebrows of the face in the face image, when the conventional image enhancement method is used for enhancement, the real characteristics and the light shadow effect cannot be accurately distinguished, the dark shadow is mistakenly enhanced as the eyebrows, two eyebrows appear in the enhanced face image, one eyebrow is the real eyebrow, the other eyebrow is the non-real eyebrow, the non-real eyebrow can be called as the artifact, and the reality and the user experience of the face image are seriously affected by the processed face image.

The existence of pseudo-texture means that the non-truly existing texture information is added in the processed face image.

For example, the face light shadow of a face image is complex, a circle of unnatural light shadow exists around the eyes of a person in the face image, when the traditional image enhancement method is used for enhancement, the real characteristics and the light shadow effect cannot be accurately distinguished, the circle of light shadow is mistakenly enhanced as the outline of glasses, so that glasses appear in the enhanced face image, but the glasses do not exist truly, and the reality and the user experience of the face image are seriously affected by the processed face image.

In view of this, the embodiment of the application provides an image processing method, or provides a face image enhancement method, which not only can eliminate problems such as face blurring, face smearing, noise, artifacts, pseudo textures and the like, but also can supplement details and perform super-resolution reconstruction, and can process clear, complete, high-resolution and high-quality face images from images with poor quality even under the condition of complex light and shadow, thereby improving shooting experience of users.

The image processing method specifically comprises the steps of displaying a first interface, responding to first operation of a first control in the first interface, obtaining a face image, processing the face image by using a trained inverse rendering model to obtain a face normal image, a face albedo image and an environment image, and performing enhancement processing on the face image, the face normal image, the face albedo image and the environment image by using a trained face enhancement model to obtain a face target image, wherein the definition of the face target image is larger than that of the face image.

According to the image processing method provided by the embodiment of the application, the face normal image, the face albedo image and the environment image corresponding to the shot image are extracted by using the trained inverse rendering model, so that accurate face priori information is obtained, and a solid foundation is laid for subsequent face enhancement processing. In the enhancement stage, when the trained face enhancement model is used for enhancing the face image, the face can be accurately restored by using the face priori information, the face outline is effectively improved, the face details are enhanced, and unnatural textures can be avoided even under the complex light and shadow condition. Finally, the image processing method can output high-quality photographed images, and the high-quality photographed images are high in definition and rich in details, and do not have the problems of strong smearing feeling, artifacts, pseudo textures and the like.

Before describing the image processing method provided by the embodiment of the present application, an application scenario of the image processing method provided by the embodiment of the present application is illustrated with reference to the accompanying drawings.

Application scene one is shooting scene

The image processing method provided by the embodiment of the application can be applied to shooting scenes. For example, it can be applied to photographed images of various scenes. Referring to fig. 1, fig. 1 is a schematic view of a scene of a photographed image according to an exemplary embodiment of the present application.

In the embodiment of the application, the electronic equipment is taken as a mobile phone for example. Illustratively, the display of the handset displays a main interface, such as main interface UI1 shown in (a) of fig. 1. The main interface UI1 includes icons of a plurality of applications, such as a "video" application icon, an "gallery" application icon, a "camera" application icon, and the like. When the user wants to shoot an image, clicking operation can be performed on an icon of the camera application program, the mobile phone runs the camera application in response to the clicking operation, and meanwhile, the display screen of the mobile phone jumps from the main interface UI1 to the shooting interface UI2.

A photographing interface UI2 as shown in (b) of fig. 1, the photographing interface UI2 including a preview area 102, a photographing control 103, a preview thumbnail 104, and a plurality of photographing mode options. The preview area 102 is used for displaying a preview image, the shooting control 103 is used for indicating the mobile phone to shoot the image when receiving shooting operation (such as clicking the shooting control 103), and the preview thumbnail 104 is used for displaying a thumbnail corresponding to the currently shot image.

The plurality of photographing mode options may include an aperture mode, a night view mode, a portrait mode, a photographing mode, a video mode, a professional mode, and the like. It will be appreciated that the camera application is typically in a photographing mode by default when it is open.

Illustratively, the light source of the shooting environment in which the subject 101 is located is sunlight, and the user wants to shoot the subject 101, which aims the camera of the mobile phone at the subject 101. The camera captures an image corresponding to the subject 101, and displays a preview image in the preview area 102. When the user satisfies the effect of the preview image, the shooting control 103 can be clicked, the mobile phone captures the preview image displayed in the current preview area 104 in response to the operation of clicking the shooting control 103 by the user, and stores the preview image as a photo after processing, and stores the photo in a gallery of the mobile phone.

When the user wants to view the shot image, clicking operation can be performed on the icon of the 'gallery' application program, and the mobile phone displays the shot image in response to the clicking operation. Or the user can perform clicking operation on the preview thumbnail 104, and the mobile phone can quickly display the currently shot image in response to the clicking operation.

In this shooting scene, the region where the shot object 101 is located is made to be a dark region just between the light source and the camera, and at this time, the shooting image obtained by the related technology generally causes problems such as low definition, insufficient details, strong smearing feeling, artifacts, and pseudo-textures of the face of the shot object 101 in the shooting image.

Referring to fig. 2, fig. 2 is a schematic diagram showing a display of a photographed image according to an exemplary embodiment of the present application. A display interface UI3 as shown in (a) in fig. 2, the display interface UI3 being an interface for displaying images in a gallery application. In the embodiment of the present application, the captured image 105 acquired by the related art is displayed in the display interface UI 3. Obviously, the face of the subject 101 in the captured image 105 has problems such as low definition (which may also be referred to as blurring), insufficient details, and artifacts.

According to the image processing method provided by the embodiment of the application, the face normal image, the face albedo image and the environment image corresponding to the shot image are extracted by using the trained inverse rendering model, so that accurate face priori information is obtained, and a solid foundation is laid for subsequent face enhancement processing. In the enhancement stage, when the trained face enhancement model is used for enhancing the face image, the face can be accurately restored by using the face priori information, the face outline is effectively improved, the face details are enhanced, and unnatural textures can be avoided even under the complex light and shadow condition. Finally, the image processing method can output high-quality photographed images, and the high-quality photographed images are high in definition and rich in detail, and have no problems of smearing feeling, artifacts, pseudo textures and the like.

A display interface UI4 as shown in (b) of fig. 2, the display interface UI4 being an interface for displaying images in a gallery application. In the embodiment of the present application, the captured image 106 obtained by using the image processing method provided by the embodiment of the present application is displayed in the display interface UI 4. Obviously, compared with the shot image 105, the quality of the shot image 106 is higher, and the face of the shot object 101 in the shot image 106 has high definition and rich details, and the problems of strong smearing feeling, artifacts, pseudo textures and the like are avoided.

Application scene two, video scene

The image processing method provided by the embodiment of the application can be applied to video scenes. For example, it can be applied to recorded video of various scenes.

Illustratively, when the camera application is run, a plurality of capture mode options are displayed in the capture interface. The mobile phone responds to the sliding operation of the user aiming at a plurality of shooting mode options, and the shooting mode is switched from the shooting mode to the video recording mode. In the preview state, the preview image collected by the mobile phone can be displayed in real time in the preview area. When the user satisfies the effect of the preview image, the shooting control can be clicked, the mobile phone responds to the operation of clicking the shooting control by the user, the preview image displayed in the current preview area is captured in real time, and the program corresponding to the image processing method provided by the embodiment of the application is operated, so that the shooting video is obtained and stored in the gallery of the mobile phone.

For example, when a subject performs self-photographing in a backlight environment, the area where the subject is located becomes a dark area due to strong ambient light. At this time, if the processing is performed by using the related technology, the obtained video may cause problems such as low definition, insufficient details, strong smearing feeling, artifacts, and pseudo textures of the face of the photographed object. If the image processing method provided by the embodiment of the application is used for processing, the image in the shot video can be split by utilizing the inverse rendering model to obtain accurate face priori information. In the enhancement stage, when the face enhancement model is used for enhancing the face image, the face can be accurately restored by using the face priori information, the face outline is effectively improved, the face details are enhanced, and unnatural textures can be avoided even under the complex light and shadow condition. Finally, the image processing method can output high-quality video, and the high-quality video has high definition and rich details, and does not have the problems of strong smearing feeling, artifacts, pseudo textures and the like.

Application scene three video call scene

The image processing method provided by the embodiment of the application can also be applied to video call scenes. For example, when a user performs a video call in a backlight environment, the area where the user is located becomes a dark area due to strong ambient light. At this time, if the processing is performed by using the related technology, the problems of low definition, insufficient details, strong smearing feeling, artifacts, pseudo textures and the like of the face of the user shot by the electronic device occur in the video call process. If the image processing method provided by the embodiment of the application is used for processing, high-quality video can be output.

Application scene IV image editing/image beautifying (or map trimming) scene

The image processing method provided by the embodiment of the application can also be applied to image beautifying scenes. For example, when the images stored in the gallery have the problems of low definition, insufficient details, strong smearing feeling, artifacts, pseudo textures and the like, and a user wants to beautify the stored images (pre-shot images or images downloaded from the internet and the like), clicking a control (such as an automatic optimizing control, an intelligent image repairing control, a brightness improving control and the like) in a display interface, and the electronic equipment responds to clicking operation.

It should be understood that the application scenario of the image processing method is merely illustrative, and does not limit the actual application scenario of the present application. The image processing method provided by the embodiment of the application can be applied to, but is not limited to, a scene for enhancing images and/or videos in video monitoring, a scene for enhancing preview images in various application programs, a scene for enhancing images in video conference applications, a scene for enhancing images in long and short video applications, a scene for enhancing images in video live broadcasting applications, a scene for enhancing images in intelligent fortune mirror application scenes, and the like.

The image processing method provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a flowchart illustrating an image processing method according to an exemplary embodiment of the application. The method comprises the following steps:

s101, displaying a first interface, wherein the first interface comprises a first control.

For example, a user may instruct the electronic device to run a camera application by clicking an icon of the "camera" application, and typically, after running the camera application, the main camera, i.e., the rear camera, of the electronic device is started by default. Meanwhile, the display interface of the electronic device displays a first interface, such as a photographing interface UI2 shown in (b) of fig. 1, which may include a preview area, a first control, a preview thumbnail, a plurality of photographing mode options, and the like.

The first control can be a shooting control for indicating the electronic equipment to shoot the image when shooting operation (such as operation of clicking the first control) is received.

Optionally, other ways of running the camera application are also provided by embodiments of the present application. For example, when the electronic device is in a locked screen state, the user may instruct the electronic device to run the camera application by double clicking a volume key on the electronic device. Or when the electronic equipment is in the screen locking state, the screen locking interface comprises an icon of the camera application program, and the user instructs the electronic equipment to operate the camera application program by clicking the icon of the camera application program. Or when the electronic equipment runs other applications, the other applications have the authority of calling the camera application program, and the user can instruct the electronic equipment to run the camera application program by clicking corresponding controls in the other applications, for example, when the electronic equipment is running the instant messaging application program, the user can instruct the electronic equipment to run the camera application program and the like by selecting the controls of the camera functions.

It should be appreciated that the foregoing is illustrative of the operation of running the camera application, that the operation may also be indicated by voice, or other operation, that the electronic device is running the camera application, and that the application is not limited in any way. It should also be understood that running the camera application may refer to launching the camera application.

S102, responding to a first operation of the first control, and acquiring a face image.

For example, the first operation is for indicating start of shooting, and the first operation may be a click operation on a shooting control.

It will be appreciated that the foregoing description is given by way of example of the first operation being a clicking operation, in the embodiment of the present application, the first operation may also be an operation of indicating shooting by voice, or the first operation may also be an operation of indicating shooting by face recognition, or the first operation may also be an operation of indicating shooting by gesture recognition, or the first operation may also be an operation of indicating shooting by pressing a physical key (such as a volume key), or the like, which is not limited in any way by the present application.

The camera application program starts shooting in response to various different types of first operations triggered by the user, namely, the electronic device collects an original image by using the camera. The camera can be any one of a main camera, a front camera, a long-focus camera, a wide-angle camera and the like. It should be understood that the embodiment of the present application does not limit this in any way as to the kind of camera.

The main camera has the characteristics of large light incoming amount, high resolution and moderate angle of view. The primary camera is typically the default camera for the electronic device. The long-focus camera has longer focal length and smaller field angle, and can be suitable for shooting a shot object far away from the electronic equipment, namely a far object. The wide-angle camera has a short focal length and a large field angle, and can be suitable for shooting a shot object which is close to the electronic equipment, namely a near object.

In one possible implementation, when the original image is an image captured in real time by a camera of the electronic device, the original image may be an RGB domain image, which refers to an image that is located in the RGB domain color space. The original image may also be a Raw domain image, where the Raw domain image refers to an image acquired in a Raw color space, that is, the Raw domain image refers to an image located in the Raw color space. It should be understood that the original image may be other color images such as a multispectral image, which is not limited by the embodiment of the present application.

In another possible implementation, the original image may also be an image pre-stored in the electronic device. For example, the user may store in advance an image captured by himself in a gallery of the electronic device, or an image downloaded from the internet, or an image received from another electronic device, which is not limited in any way by the embodiment of the present application.

The original image in the embodiment of the application may include a face, and in a possible implementation manner, the original image including the face may be directly used as the face image. In another possible implementation, the original image may be cropped to obtain a face image. For example, a face region in an original image is identified, the face region is cut, and an image containing a face portion is retained, so that a face image is obtained.

It should be noted that the number of faces is not limited in the embodiment of the present application, for example, the original image may include one or more faces.

And S103, processing the face image by using the trained inverse rendering model to obtain a face normal image, a face albedo image and an environment image.

The normal face image is used for representing normal features corresponding to the face image in the face image or representing the normal direction of the face surface in the face image. The normal face image may be used to capture 3D information of the face, which 3D information is important for subsequent analysis of the shape and structure of the face.

The face albedo image is used for representing the albedo characteristics corresponding to the face image in the face image, or is used for representing the characteristics of the face surface, such as color, texture and the like in the face image. The face albedo image is important for subsequent accurate analysis of the inherent properties of the face (e.g., skin tone, speckle, wrinkles, etc.).

The environment image is used to represent environment contents other than the passing person image, or to represent the influence of illumination of the photographing environment.

The normal feature is understood, among other things, as the direction of the normal at each point of the concave-convex surface of the object. For example, in the present application, the normal feature is the direction of the normal line at each point of the face surface. The normal face image may mark the direction of the normal line through RGB color channels. For example, the RGB values of each pixel point may be used to encode three components (X, Y, Z) of the normal vector, where the red channel may represent the X component of the normal vector, the green channel represents the Y component, and the blue channel represents the Z component, such that the RGB values of each pixel point correspond to a vector in three dimensions, indicating the normal direction of the point.

Albedo characteristics refer to the ratio of the light flow scattered in all directions by the fully illuminated portion of the object surface to the light flow incident on the object surface. For example, for a face image, the albedo feature refers to the ratio of the light flow scattered in all directions by the fully illuminated portion of the face skin to the light flow incident on the face skin. The albedo characteristic equivalent to the human face can tell us how the skin surface of the human face reflects light.

For example, the trained inverse rendering model may be utilized to process the face image, that is, the trained inverse rendering model may be utilized to split the face and the environment in the face image, resulting in a face normal image, a face albedo image, and an environment image. In the embodiment of the application, the normal face image, the face albedo image and the environment image can be understood as priori information corresponding to the face image, and the priori information provides key information about the 3D structure, the outline, the color, the texture, the skin, the environment illumination, the light shadow and the like of the face, which is important for the recovery and the processing of the subsequent face image.

And S104, carrying out enhancement processing on the face image, the face normal image, the face albedo image and the environment image by using the trained face enhancement model to obtain a face target image.

Illustratively, after the face image, the face normal image, the face albedo image and the environment image are spliced together, the face image, the face normal image, the face albedo image and the environment image are input into a trained face enhancement model to be processed, and a face target image is obtained. Because the accurate prior information corresponding to the face image is obtained through the inverse rendering model, the prior information provides accurate contours, rich textures, true colors, clear 3D structures, accurate ambient illumination conditions and the like of the face, therefore, the trained face enhancement model can fully utilize the prior information in the enhancement process, effectively recover details of the face image, eliminate smearing feeling, artifacts, pseudo textures and the like of the face image, and improve the definition and quality of the face image.

The face albedo image reflects the face skin information, and the face albedo image does not generate unnecessary textures and is not interfered by factors such as illumination change, shadow and the like. When the face enhancement model carries out enhancement processing by utilizing the face albedo image, details in the face image can be effectively enhanced, and the quality of the face image is improved. This enhancement not only improves the visual clarity of the face image, but also maintains the authenticity and naturalness of the face. Therefore, the enhanced face image is high in definition and rich in detail, and the problems of strong smearing feeling, artifacts, pseudo textures and the like are avoided, so that better shooting experience is brought to a user.

According to the image processing method provided by the embodiment of the application, the face and the environment in the face image are split through the trained inverse rendering model, so that the face normal image, the face albedo image and the environment image are obtained, and the accurate priori information is obtained. The trained face enhancement model fully utilizes the prior information in the enhancement process, can effectively recover details of the face image, eliminates smearing feeling, artifacts, pseudo textures and the like of the face image, and improves the definition and quality of the face image.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method for acquiring a face image according to an exemplary embodiment of the present application. The method comprises the following steps:

S1021, responding to a first operation of the first control, and acquiring an original image.

For example, the first operation may be a click operation on a photographing control, and the camera application program starts photographing in response to various different types of first operations triggered by the user, that is, the electronic device collects an original image with the camera.

S1022, determining the definition of the original image.

The sharpness of the original image refers to the sharpness of the details in the image and may include sharpness of edges, sharpness of textures, etc. It should be understood that the original image with high definition has rich details and obvious human face edges.

In one example, edge information in an original image may be detected by an edge detection algorithm, the edge information may include the intensity and number of edges, and the sharpness of the original image may be evaluated by counting the intensity and number of edges. The edge detection algorithm may include a Sobel operator (Sobel) algorithm, a Canny operator algorithm, and the like.

In another example, the original image may be converted to the frequency domain, such as by fourier transformation, and the high frequency components in the converted original image may be analyzed. It will be appreciated that the more high frequency components the higher the sharpness of the original image, and conversely, the lower the high frequency components the lower the sharpness of the original image.

In yet another example, an image quality assessment index (e.g., sharpness index), edge contrast, etc. may be employed to quantify the sharpness of the original image.

In yet another example, a normalized root mean square similarity (Normalized Root Mean Square Similarit, NASS) value corresponding to the original image may be determined.

NRSS is a reference-free image quality evaluation method for evaluating the quality of an original image, and NRSS value refers to the quality score of the original image calculated by the NRSS method. The higher the NRSS value, the better the quality of the original image, the clearer and the less blurred, and the lower the NRSS value, the worse the quality of the original image, the less clear, i.e., the more blurred.

S1023, detecting that the definition of the original image is smaller than a preset definition threshold, and cutting the original image to obtain a face image.

For example, if the definition of the original image is detected to be greater than or equal to the preset definition threshold, the definition of the original image is proved to be high, the details of the face in the original image are rich, the problems of strong smearing feeling, artifacts, pseudo textures and the like do not exist, and the enhancement processing is not needed. If the definition of the original image is detected to be smaller than the preset definition threshold, the definition of the original image is proved to be low, the human face in the original image may have insufficient details, the problems of strong smearing feeling, artifacts, pseudo textures and the like may exist, and enhancement processing is needed.

It will be appreciated that the magnitude of the preset sharpness threshold may be set and modified as desired, and the embodiments of the present application do not limit this in any way.

Illustratively, when the sharpness of the original image is detected to be less than a preset sharpness threshold, a face region is located (or otherwise identified) in the original image. For example, a face detection algorithm (e.g., a deep learning model, a face feature classifier, etc.) may be used to detect a face region in the original image. And cutting out the area in the boundary frame from the original image to obtain the face image.

Referring to fig. 5, fig. 5 is a schematic view of a face image according to an exemplary embodiment of the present application. As shown in fig. 5, the original image is detected to have a definition smaller than the preset definition threshold, and the left original image is cut to obtain the right face image. It should be noted that diagonal stripes in both the original image and the face image are used to represent the environment.

Referring to fig. 6, fig. 6 is a flowchart illustrating another image processing method according to an exemplary embodiment of the application. The method comprises the following steps:

s201, acquiring a face image.

For example, reference may be made to the foregoing description of acquiring a face image, which is not repeated here.

It should be noted that the number of the obtained face images may be one or more frames, which is not limited to this, and in the embodiment of the present application, a case of obtaining one frame of face image is described as an example.

S202, processing the face image by using the trained inverse rendering model to obtain a face normal image, a face albedo image and an environment image.

The trained inverse rendering model is used for splitting the face and the environment in the face image to obtain a face normal image, a face albedo image and an environment image. The trained inverse rendering model is obtained by training the initial inverse rendering network based on the first training set to obtain the trained inverse rendering model and training the trained inverse rendering model by using the second training set. Wherein the first training set comprises a high-definition sample image and the second training set comprises a low-definition sample image, the low-definition sample image being degraded from the high-definition sample image. The training process of the inverse rendering model is described in detail later, and is omitted here for the moment.

The trained inverse rendering model may include a first network, a second network, and a third network. Optionally, the first network may be a vector quantization variation self-encoder (Vector Quantized Variational Autoencoder, VQVAE) and/or the second network may be VQVAE.

Alternatively, the third network may be VQVAE or U-Net.

VQVAE is a model combining the techniques of variational self-encoder (Variational Autoencoder, VAE) and vector quantization (Vector Quantized, VQ). The continuous vector space can be mapped into the discrete codebook through the VQ technology, so that the compression and discretization of the image data are realized. Specifically, the VQ defines a fixed set of vectors as a codebook, and for each input vector, the VQ finds the closest vector in the codebook for mapping, thereby reducing the dimension of the image data and improving the computational efficiency.

In the embodiment of the application, the VQ technology is introduced into the inverse rendering model, namely, the VQ technology is introduced into each network contained in the inverse rendering model, so that the benefits are various. Firstly, the VQ can extract features in the face image and quantize the features into a set of discrete codes, thereby reducing the dimension of the feature space and improving the calculation efficiency of the inverse rendering model. Secondly, through vector quantization, the VQ is beneficial to enhancing the robustness of the feature representation, so that the performance of the inverse rendering model is improved. In addition, the application of the VQ in the inverse rendering module can improve the reconstruction and enhancement effects of the face image, ensure that the generated image is more natural and retain more details. Therefore, the inverse rendering module is utilized to process the face image, so that the image processing efficiency is improved, accurate priori information can be extracted, the subsequent face enhancement model is helped to effectively utilize the priori information to enhance the face image, the generation of pseudo textures, artifacts, unnatural effects and the like is avoided, and the quality of the image is greatly improved.

The inverse rendering model will be described in detail below with reference to the first network VQVAE, the second network VQVAE, and the third network U-Net.

Referring to fig. 7, fig. 7 is a schematic diagram of a trained inverse rendering model structure according to an exemplary embodiment of the present application. As shown in fig. 7, the inverse rendering model provided in the embodiment of the present application includes three networks connected to each other, and the three networks are respectively referred to as a first network, a second network, and a third network.

The first network is illustratively connected to a second network and a third network, respectively, the second network being connected to the third network. The first network comprises an input end and an output end, the second network comprises a first input end, a second input end and an output end, and the third network comprises a first input end, a second input end, a third input end and an output end.

The input end of the first network, the first input end of the second network and the first input end of the third network are all used as input ends of the inverse rendering model, the output end of the first network and the second input end of the second network are respectively connected with the second input end of the third network, the output end of the second network is connected with the third input end of the third network, the output end of the third network is used as the output end of the inverse rendering model, and the output end of the first network and the output end of the second network are also used as the output ends of the inverse rendering model.

Based on the structure of the inverse rendering model shown in fig. 7, a face image is input at the input end of a first network, the first network processes the face image, a face normal image is output through the output end of the first network, a face image is input at the first input end of a second network, a face normal image is input at the second input end, the second network processes the face image and the face normal image, a face albedo image is output through the output end of the second network, a face image is input at the first input end of a third network, a face normal image is input at the second input end, a face albedo image is input at the third input end, the face image, the face normal image and the face albedo image are processed by the third network, and an environment image is output through the output end of the third network.

Optionally, when the face image is input into the inverse rendering model, the input end of the first network, the first input end of the second network and the first input end of the third network respectively receive the face image, the first network is used for determining a face normal image corresponding to the face image, therefore, the output end of the first network can respectively provide the face normal image to the second input end of the second network and the second input end of the third network, the second network is used for determining the face albedo image corresponding to the face image according to the face image and the face normal image, therefore, the output end of the second network can provide the face albedo image to the third input end of the third network, the third network is used for determining an environment image corresponding to the face image according to the face image, the face normal image and the face albedo image, and then, the face albedo image determined by the second network and the environment image determined by the third network are output as the output of the inverse rendering model.

It should be understood that the foregoing is merely a schematic structural diagram of an inverse rendering model, and the embodiments of the present application are not limited in any way.

Referring to fig. 8, fig. 8 is a schematic view of a face image, a face normal image, a face albedo image, and an environment image according to an exemplary embodiment of the present application. Illustratively, the inverse rendering model splits faces and environments in the face image, resulting in a face normal image, a face albedo image, and an environment image as shown in fig. 8. It should be noted that, the dotted line in the normal face image, the rough oblique line in the albedo face image, and the oblique line in the environment image are only used for distinguishing the normal face image, the albedo face image, and the environment image, and do not limit the normal face image, the albedo face image, and the environment image that are actually output by the inverse rendering model.

It should be appreciated that, in general, the background in the normal image of the face actually output by the inverse rendering model is gray, and the face portion has a color change to represent the direction of the normal to the face surface, where red represents the component of the normal in the X-axis direction, green represents the component of the normal in the Y-axis direction, and blue represents the component of the normal in the Z-axis direction.

The background in the face albedo image actually output by the inverse rendering model is black, the face part is clear and visible, the face skin information is accurately reflected, unnecessary textures are not existed, and the image is not interfered by factors such as illumination change, shadow and the like.

The environment image actually output by the inverse rendering model shows the environment and the illumination condition of the shot object at the shooting moment.

The processing procedure of the inverse rendering model is described in detail below with reference to the respective images shown in fig. 8, and the structure of the inverse rendering model.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating an inverse rendering model processing flow according to an exemplary embodiment of the present application. The process comprises the following steps:

s2021, processing the face image by using the first network to obtain a face normal image.

In one example, a first network is utilized to extract a first potential vector in the face image, a vector similar to the first potential vector is found in a first set of discrete vectors, and a face normal image is generated from the vector similar to the first potential vector.

As shown in fig. 9, the first network may include a first encoder, a first vector codebook module, and a first decoder. The first encoder is configured to map the input face image to a continuous potential space, and it is understood that the first network is configured to extract potential vectors in the face image. For example, features (such as edge features, contour features, geometric features, etc.) in the face image are extracted using a first encoder, and these features are encoded into one continuous potential vector, i.e., a first potential vector. The first potential vector may include 3D structural information and geometric information corresponding to the face.

The first vector codebook module is obtained by extracting potential vectors in a normal image of a human face corresponding to a high-definition sample image from the inverse rendering model in training in the process of training the inverse rendering model. The first vector codebook module may include a first set of discrete vectors, in other words, the first set of discrete vectors includes potential vectors in normal images of a face corresponding to the plurality of high-definition sample images. The continuous first potential vector extracted by the first encoder may be quantized into discrete potential vectors by the first vector codebook module.

Illustratively, from the first potential vector extracted by the first encoder, a vector that is close to the first potential vector is found in the first set of discrete vectors, or a vector that is closest to the first potential vector is found in the first set of discrete vectors. For example, a distance (e.g., euclidean distance) between the first potential vector and each vector in the first set of discrete vectors is calculated, and the vector with the smallest distance is determined to be the vector that is close to the first potential vector.

The first decoder is configured to map the quantized potential vectors (i.e., the vectors found in the first set of discrete vectors that are close to the first potential vector) back to the face image space, the purpose of this process being to reconstruct the face image as close as possible to the original face image using the quantized potential vectors. Illustratively, a vector that is similar to the first potential vector is used as an input to a first decoder, which maps the vector that is similar to the first potential vector back to the face image space, where the first decoder generates a face normal image.

Referring to fig. 10, fig. 10 is a schematic diagram of a first network structure according to an exemplary embodiment of the present application. As shown in fig. 10, the first network includes a first encoder, a first vector codebook module, and a first decoder, and the first encoder is used to extract a first potential vector in the face image, and input the first potential vector into the first vector codebook module. The first vector Codebook module includes a first set of discrete vectors, also referred to as a first Codebook (Codebook) in this embodiment of the present application, which is configured by potential vectors in normal images of a face corresponding to a plurality of high-definition sample images, and is represented by rectangular boxes of different forms in fig. 10 and identified by 0, 1.

The first vector codebook module may further include a first Matching module (Matching Block), comparing the first potential vector with each vector in a first discrete vector set by the Matching Block, determining a vector closest to the first potential vector in the first discrete vector set, such as vectors 3, 7, 6, 2, 4,3, 1, 5, 0, etc. in a nine-square grid as shown in fig. 10, and the first vector codebook module obtains quantized potential vectors based on the vectors and outputs the quantized potential vectors to the first decoder. And then, the first decoder maps the quantized potential vectors back to the face image space, and in the mapping process, a face normal image is generated.

Alternatively, in another example, the first network may be CNN, where edge features, contour features, geometric features, and the like in the face image are extracted by CNN, depth information of each point on the face surface is estimated by CNN, normal vectors of each point on the face surface are calculated according to the estimated depth information, the calculated normal vectors are converted into RGB channel values for visualization in the image, and the RGB channel values are combined into the face normal image by CNN.

S2022, processing the face image and the face normal image by using a second network to obtain a face albedo image.

In one example, a face image and a face normal image are stitched to obtain a first stitched image, a second potential vector in the first stitched image is extracted by using a second network, a vector close to the second potential vector is searched in a second discrete vector set, and a face albedo image is generated according to the vector close to the second potential vector.

As shown in fig. 9, the second network may include a second encoder, a second vector codebook module, and a second decoder. The face image and the face normal image are used as the input of the second network, specifically, the face image and the face normal image are spliced to obtain a first spliced image and then are input. For example, the RGB channels of the face image and the RGB channels of the normal face image are spliced to obtain a new six-channel image, i.e. a first spliced image.

The first stitched image is then input to a second encoder, which is used to map the input first stitched image to a continuous potential space, as will be appreciated by the second network being used to extract potential vectors in the first stitched image. For example, features (e.g., texture features, skin features, etc.) in the first stitched image are extracted using a second encoder and encoded into one continuous potential vector, the second potential vector. The second potential vector may include texture information and skin information corresponding to the face.

The second vector codebook module is obtained by extracting potential vectors in the face albedo image corresponding to the high-definition sample image from the inverse rendering model in the training process. The second vector codebook module may include a second set of discrete vectors, in other words, the second set of discrete vectors includes potential vectors in the face albedo images corresponding to the plurality of high-definition sample images. The continuous second potential vector extracted by the second encoder may be quantized into discrete potential vectors by the second vector codebook module.

Illustratively, based on the second potential vector extracted by the second encoder, a vector that is closest to the second potential vector is found in the second set of discrete vectors, or a vector that is closest to the second potential vector is found in the second set of discrete vectors. For example, a distance (e.g., euclidean distance) between the second potential vector and each vector in the second set of discrete vectors is calculated, and the vector with the smallest distance is determined to be the vector that is closest to the second potential vector.

The second decoder is configured to map the quantized vector (i.e., the vector found in the second set of discrete vectors that is close to the second potential vector) back to the face image space. Illustratively, a vector that is similar to the second potential vector is input to a second decoder, which maps the vector that is similar to the second potential vector back to the face image space, where the second decoder generates a face albedo image.

Referring to fig. 11, fig. 11 is a schematic diagram of a second network structure according to an exemplary embodiment of the present application. As shown in fig. 11, the second network includes a second encoder, a second vector codebook module, and a second decoder, and the second encoder is used to extract a second potential vector in the first stitched image, and input the second potential vector into the second vector codebook module. The second vector Codebook module includes a second set of discrete vectors, also referred to as a second Codebook (Codebook) in this embodiment of the present application, which is composed of potential vectors in the face albedo images corresponding to the plurality of high-definition sample images, represented by rectangular boxes of different forms in fig. 11, and identified by 0, 1.

The second vector codebook module may further include a second Matching module (Matching Block), comparing the second potential vector with each vector in a second discrete vector set through the second Matching Block, determining a vector closest to the second potential vector in the second discrete vector set, such as vectors 4, 8, 7, 3, 5, 4, 2, 0, 5 in a nine-square grid as shown in fig. 11, and the second vector codebook module obtains quantized potential vectors based on the vectors and outputs the quantized potential vectors to the second decoder. For example, the original second potential vector is replaced with the vector closest to the second potential vector, resulting in a quantized potential vector. And then, the second decoder maps the quantized potential vectors back to the face image space, and in the mapping process, a face albedo image is generated.

Alternatively, in another example, the second network may be U-Net, where the U-Net extracts texture features, skin features, and the like in the first stitched image step by step through multiple convolution layers, downsampling is performed in the encoder portion of the U-Net through a pooling layer, the size of the first stitched image is reduced, the receptive field is increased, and local information of more faces is obtained, spatial dimensions of the image are gradually restored through upsampling in the decoder portion of the U-Net, and in each upsampling step, features in the encoder are combined with features in the decoder through a jump connection, so that more detail information can be retained, and in the final stage of the decoder, the multi-channel feature map output by the decoder is converted into a final face albedo image.

S2023, processing the face image, the face normal image and the face albedo image by using a third network to obtain an environment image.

In one example, the third network can be a U-Net. And taking the face image, the face normal image and the face albedo image as the input of the third network, specifically, stitching the face image, the face normal image and the face albedo image to obtain a target stitched image. For example, the RGB channels of the face image, the RGB channels of the face normal image, and the RGB channels of the face albedo image are spliced to obtain a new nine-channel image, i.e., a target spliced image.

The method comprises the steps of gradually extracting environmental illumination characteristics, light shadow characteristics and the like in a target spliced image by a plurality of convolution layers, downsampling by a pooling layer in an encoder part of the U-Net, reducing the size of the target spliced image, increasing the receptive field, obtaining more environmental illumination information and light shadow information, gradually restoring the spatial dimension of the image by upsampling in a decoder part of the U-Net, combining the characteristics in the encoder with the characteristics in the decoder by jumping connection in each upsampling step, so that more environmental illumination information and light shadow information can be reserved, and converting a multichannel characteristic image output by the decoder into a final environmental image in the final stage of the decoder.

In another example, the third network may be VQVAE. The process of processing the face image, the face normal image, and the face albedo image to obtain the environment image through VQVAE may refer to the description in step S2022, which is not repeated here.

And S203, carrying out enhancement processing on the face image, the face normal image, the face albedo image and the environment image by using the trained face enhancement model to obtain a face target image.

The trained face enhancement model is obtained by training the initial face enhancement network based on the third training set to obtain a face enhancement model in training and training the face enhancement model in training by utilizing the fourth training set. The third training set comprises a high-definition sample image, and the fourth training set comprises a low-definition sample image, a normal sample image, an albedo sample image and an environment sample image corresponding to the low-definition sample image; the normal sample image and the albedo sample image are obtained by processing the low-definition sample image through a trained inverse rendering model, and the environment sample image is obtained by simulating a trained shadow enhancement model. The training process of the inverse rendering model is described in detail later, and is omitted here for the moment.

The face enhancement model may be VQVAE or U-Net, and in the embodiment of the present application, VQVAE is taken as an example for illustration.

It will be appreciated that the benefits of introducing VQ techniques into a face enhancement model are manifold. Firstly, the VQ can extract features in a face image, a face normal image, a face albedo image and an environment image, and quantize the features into a set of discrete codes, so that the dimension of a feature space is reduced, and the calculation efficiency of a face enhancement model is improved. Secondly, through vector quantization, the VQ is beneficial to enhancing the robustness of the feature representation, so that the performance of the face enhancement model is improved. In addition, the application of the VQ in the face enhancement model can improve the reconstruction and enhancement effects of the face image, ensure that the enhanced image is more natural, and keep more details. Therefore, the face enhancement model is utilized to enhance the face image, so that the image processing efficiency is improved, accurate prior information extracted by the inverse rendering model can be effectively utilized, the generation of pseudo textures, artifacts, unnatural effects and the like is avoided, and the image quality is greatly improved.

Referring to fig. 12, fig. 12 is a schematic diagram illustrating a face enhancement model processing flow according to an exemplary embodiment of the present application. The process comprises the following steps:

s2031, stitching the face image, the face normal image, the face albedo image and the environment image to obtain a second stitched image.

The face image, the face normal image, the face albedo image and the environment image are used as input of the face enhancement model, specifically, the face image, the face normal image, the face albedo image and the environment image are spliced to obtain a second spliced image and then input. For example, the RGB channels of the face image, the RGB channels of the normal face image, the RGB channels of the face albedo image, and the RGB channels of the environment image are stitched to obtain a new twelve-channel image, i.e., a second stitched image. And then, inputting the second spliced image into the face enhancement model for processing.

And S2032, extracting a third potential vector in the second spliced image by using the trained face enhancement model.

Illustratively, the trained face enhancement model may include a third encoder, a third vector codebook module, and a third decoder. The second stitched image is input to a third encoder for mapping the input second stitched image to a continuous potential space, it being understood that the third network is used to extract potential vectors in the second stitched image. For example, features (such as edge features, contour features, geometric features, texture features, skin features, ambient illumination features, light shadow features, etc.) in the second stitched image are extracted using a third encoder, and these features are encoded into one continuous potential vector, i.e., a third potential vector. The third potential vector may include 3D structure information, geometric information, texture information, skin information, ambient light information, light shadow information, and the like corresponding to the face.

And S2033, searching vectors close to the third potential vector in the third discrete vector set.

The third vector codebook module is obtained by extracting potential vectors corresponding to the human face in the high-definition sample image from the human face enhancement model in the training process of the human face enhancement model. The third vector codebook module may include a third set of discrete vectors, in other words, a third set of discrete vectors including potential vectors corresponding to faces in the plurality of high-definition sample images. The continuous third potential vector extracted by the third encoder may be quantized into discrete potential vectors by the third vector codebook module.

Illustratively, according to the third potential vector extracted by the third encoder, a vector close to the third potential vector is found in the third set of discrete vectors, or a vector closest to the third potential vector is found in the third set of discrete vectors. For example, a distance (e.g., euclidean distance) between the third potential vector and each vector in the third set of discrete vectors is calculated, and the vector with the smallest distance is determined to be the vector that is closest to the third potential vector.

And S2034, generating a face target image according to the vector close to the third potential vector.

The third decoder is configured to map the quantized vector (i.e., the vector found in the third set of discrete vectors that is close to the third potential vector) back to the face image space. Illustratively, a vector that is close to the third potential vector is used as an input to the third decoder, which maps the vector that is close to the third potential vector back to the face image space, where the third decoder generates the face target image.

Referring to fig. 13, fig. 13 is a schematic diagram illustrating a face enhancement model according to an exemplary embodiment of the present application. As shown in fig. 13, the third network includes a third encoder, a third vector codebook module, and a third decoder, and the third encoder is used to extract a third potential vector in the second stitched image, and input the third potential vector into the third vector codebook module. The third vector Codebook module includes a third set of discrete vectors, also referred to as a third Codebook (Codebook) in the embodiment of the present application, which is configured by potential vectors corresponding to faces in a plurality of high-definition sample images, and is represented by rectangular boxes in different forms in fig. 13, and identified by 0, 1.

The third vector codebook module may further include a third Matching module (Matching Block), comparing the third potential vector with each vector in a third discrete vector set by the third Matching Block, determining a vector closest to the third potential vector in the third discrete vector set, such as vectors 2, 6, 5, 1, 3, 8, 0, 7, 3 in a nine-square grid as shown in fig. 13, and outputting the quantized potential vector to the third decoder based on the vectors. And then, the third decoder maps the quantized potential vectors back to the face image space, and in the mapping process, a face target image is generated.

S204, splicing the original image and the face target image to obtain a high-definition image corresponding to the original image.

For example, the original image and the face target image may be stitched using a mask prepared in advance, which is a binary image for indicating which areas in the original image should be replaced by the face target image. It will be appreciated that in the mask, typically the white portions represent regions of the face target image and the black portions represent regions where the original image should remain. The mask is used to process the face target image, for example, the mask may be used to perform a bit and operation on the face target image, ensuring that pixels of the white portion of the mask are preserved.

The processed face target image and the original image are fused, for example, by bit or operation. Wherein the original image is taken as a basis and the corresponding parts of the face target image are combined through a mask. After fusion, the obtained image is the high-definition image corresponding to the original image.

Optionally, after the processed face target image is fused with the original image, some detail adjustment, such as color balance, edge balance, etc., can be performed on the stitching region, so as to ensure natural stitching effect.

Optionally, in a possible implementation manner, the image processing method provided by the embodiment of the application can also be used for determining the light supplementing environment image corresponding to the environment image by using the trained light and shadow enhancement model, and enhancing the face image, the face normal image, the face albedo image and the light supplementing environment image by using the trained face enhancement model to obtain the face target image.

The trained light and shadow enhancement model can be trained based on a U-Net model, and can simulate the light and shadow conditions of a human face in various illumination environments, so that the light and shadow enhancement model can be utilized to supplement light to an environment image and generate a light supplementing environment image. For example, a person pose estimation is performed on an original image, and a person pose is determined. The portrait pose refers to angle information of face orientation, and may be represented by a rotation matrix, a rotation vector, a quaternion, an euler angle (Yaw, roll, pitch), and the like. And combining the human figure posture and the light source position in the environment displayed in the environment image, determining a light supplementing position in the environment image, and supplementing light to the environment image at the light supplementing position to obtain a light supplementing environment image.

Alternatively, in one possible implementation, the original ambient image may be replaced with a new ambient image, which is determined as the light-compensating ambient image.

Then, a third spliced image is obtained by splicing the face image, the face normal image, the face albedo image and the light supplementing environment image, a fourth potential vector in the third spliced image is extracted by utilizing the trained face enhancement model, a vector similar to the fourth potential vector is searched in a third discrete vector set, and a face target image is generated according to the vector similar to the fourth potential vector. The specific implementation process may refer to the descriptions in the foregoing steps S2031 to S2034, and will not be described herein.

Optionally, in one possible implementation manner, after the light-supplementing environment image is determined, the trained light and shadow enhancement model may be further utilized to render the face normal image, the face albedo image and the light-supplementing environment image, so as to generate the light-supplementing face image. It should be understood that, according to the light supplementing condition in the light supplementing environment image, the normal direction in the normal direction image of the face, and the face albedo image, the shadow generated by the face in the light supplementing environment can be calculated, so that the shadow is displayed in the rendered light supplementing face image. For example, where the light is on the right side and the brightness on the right side of the face is increased and the left side is relatively dark, shadows cast in the environment on the left side of the face may be revealed in the light-supplemented face image.

It should be noted that, rendering in the embodiment of the present application refers to a process of converting a three-dimensional light energy transmission process into a two-dimensional image.

And then, carrying out enhancement processing on the light supplementing face image, the face normal image, the face albedo image and the light supplementing environment image by using the trained face enhancement model to obtain a face target image.

In the implementation mode, the trained light and shadow enhancement model can realize light filling, an additional light filling device of electronic equipment is not needed, a user does not need to manually select a light filling area in the shooting process, and user experience is improved. In addition, the method can be used for carrying out light supplementing treatment on the face image in the shooting process or after the shooting is completed, so that the method has higher flexibility. In addition, compared with the traditional external light source or the manual light supplementing area selecting method, the method utilizes the light and shadow enhancement model to render the normal face image, the face albedo image and the light supplementing environment image, and the generated light supplementing face image ensures that the light supplementing face image meets the illumination condition of the actual environment more, so that the light supplementing effect is more natural and real.

It should be noted that, before the inverse rendering model is used, the inverse rendering model needs to be trained and generated, and a detailed description is given below of a training process of the inverse rendering model provided by the embodiment of the present application with reference to the accompanying drawings.

Referring to fig. 14, fig. 14 is a schematic flow chart of training an inverse rendering model according to an exemplary embodiment of the present application, which is specifically described below.

S301, acquiring a first training set.

The first training set comprises a high-definition sample image, the high-definition sample image comprises a human face, the high-definition sample image is high in definition and rich in detail, and quality problems such as smearing feeling, artifacts and pseudo textures do not exist.

The high-definition sample image may originate from various approaches, such as pre-capturing a high-definition image containing a human face by a camera of an electronic device, downloading the high-definition image containing the human face from the internet, receiving the high-definition image containing the human face transmitted by other electronic devices, and so on.

It is worth to be noted that, the number of high-definition sample images may affect the training effect and the training speed of the inverse rendering model, and in the embodiment of the present application, the number of high-definition sample images may be reasonably set to achieve the best balance between accuracy and training speed, and the specific number is not limited.

S302, training the initial inverse rendering network by using the first training set to obtain an inverse rendering model in training.

The high-definition face image in the first training set is cut to obtain a high-definition face image, the high-definition face image is input into an initial inverse rendering network, and the face and the environment in the high-definition face image are split by the initial inverse rendering network to obtain a high-definition face normal image, a high-definition face albedo image and a high-definition environment image.

The structure of the initial inverse rendering network is consistent with that of the trained inverse rendering model, so that the process of processing the high-definition face image by the initial inverse rendering network can refer to the process of processing the face image by using the trained inverse rendering model, and details are not repeated here.

In the process of training the initial inverse rendering network by using the first training set, the initial inverse rendering network can learn the ability of accurately extracting the normal face image, the albedo face image and the environment image from the high-definition sample image. And training the initial inverse rendering network through the high-definition sample image, so that a vector codebook module in the initial inverse rendering network learns prior information corresponding to the high-definition sample image, such as potential vectors corresponding to the normal image of the high-definition face, potential vectors corresponding to the albedo image of the high-definition face and potential vectors corresponding to the high-definition environment image. And then, fixing a vector codebook module in the initial inverse rendering network to obtain an inverse rendering model in training.

Optionally, the training data may also be enriched, i.e. the illumination scene of the high-definition sample image is enriched. Because the illumination scene of the high-definition sample image is generally simpler, the light shadow on the human face is not complex enough, and the trained inverse rendering model cannot accurately process various light shadow changes when facing the real complex illumination scene, thereby generating the problems of artifacts, pseudo textures and the like. Therefore, in the embodiment of the application, a shadow enhancement technology can be introduced in the process of retraining the inverse rendering model, for example, the color, the brightness and the like of a specific area of a human face can be adjusted to generate more high-definition sample images containing complex light and shadow effects.

Optionally, the high definition sample image may also be enriched using a light and shadow enhancement model. For example, a large number of various illumination environment images are collected in advance from various paths, an original high-definition face image is processed through an inverse rendering model in training to obtain a high-definition face normal image and a high-definition face albedo image, the high-definition face normal image, the high-definition face albedo image and various illumination environment images are rendered through a light shadow enhancement model, the light shadow condition of the face in various illumination environments is simulated, and therefore the high-definition face image in various illumination environments is generated. And then, training the inverse rendering model in training by utilizing the high-definition face images under various illumination environments, so that when the inverse rendering model in training learns to cope with complex illumination scenes, how to accurately process various light and shadow changes, the capability of generating artifacts and pseudo textures is avoided, and the robustness and the adaptability of the inverse rendering model are improved.

Referring to fig. 15, fig. 15 is a schematic diagram of a light and shadow enhancement model rich high definition sample image according to an exemplary embodiment of the present application. The method comprises the steps of obtaining a normal image of a high-definition face and an albedo image of the high-definition face by processing an original high-definition face image through an inverse rendering model in training, selecting any illumination environment image (shown by grids in 15), and rendering the normal image of the high-definition face, the albedo image of the high-definition face and any illumination environment image through a light shadow enhancement model to generate the high-definition face image in the illumination environment shown in figure 15.

In the implementation mode, the high-definition sample images are enriched by utilizing the light and shadow enhancement model, so that the light and shadow enrichment degree of the training data is greatly ensured, the light and shadow complexity of the training data is improved, the robustness and the adaptability of the inverse rendering model are improved, various light and shadow changes can be accurately processed by the subsequent trained inverse rendering model when the subsequent trained inverse rendering model faces to a real and complex illumination scene, the problems of artifacts, pseudo textures and the like are avoided, and the image quality is effectively improved.

S303, acquiring a second training set.

The second training set includes a low-definition sample image, the low-definition sample image being degraded from the high-definition sample image. The generation of these low-definition sample images may be implemented, for example, by pre-trained degradation models that are capable of simulating various degradation effects on the high-definition sample images. For example, a high-definition sample image is processed by using a pre-trained degradation model, and at least one degradation effect of noise, blurring, smearing feeling, artifacts, pseudo textures and the like is added to the high-definition sample image, so that a low-definition sample image is obtained.

S304, training the trained inverse rendering model by using the second training set to obtain a trained inverse rendering model.

The low-definition sample image in the second training set is cut to obtain a low-definition face image, the low-definition face image is input into a trained inverse rendering model, and the trained inverse rendering model splits faces and environments in the low-definition face image to obtain a real face normal image, a real face albedo image and a real environment image.

The structure of the trained inverse rendering model is consistent with that of the trained inverse rendering model, so that the process of processing the low-definition face image by the trained inverse rendering model can refer to the process of processing the face image by the trained inverse rendering model, and the details are not repeated here.

Because the prior information of the high-definition sample image is stored in the vector codebook module of the inverse rendering model in training, the prior information of the pre-stored high-definition sample image can be effectively utilized to assist the inverse rendering model in training to recover and output high-quality normal images, albedo images and real environment images of the real face when the low-definition sample image is processed in this stage. In the process, the inverse rendering model in training learns the capability of outputting high-quality images based on prior information of high-definition sample images when facing low-definition sample images, and when the inverse rendering model in training grasps the capability, the vector codebook module at the moment is fixed to obtain the trained inverse rendering model.

In popular terms, the face has strong prior, for example, the facial features of the face are very regular, the inverse rendering model in training learns the rules (i.e. the prior information) in the high-definition face image, and when the low-definition face image is processed, what the fuzzy area in the low-definition face image should be can be presumed according to the learned rules, so that the high-quality image can be recovered and output.

It should be noted that, before using the face enhancement model, the face enhancement model needs to be trained and generated, and the training process of the face enhancement model provided by the embodiment of the application is described in detail below with reference to the accompanying drawings.

Referring to fig. 16, fig. 16 is a schematic flow chart of training a face enhancement model according to an exemplary embodiment of the present application, which is specifically described below.

S401, acquiring a third training set.

The third training set may include high definition sample images. It is noted that the high-definition sample image included in the third training set may be the same as, or different from, the high-definition sample image included in the first training set. Under different conditions, a plurality of high-definition images containing the human face can be acquired through various ways to obtain a third training set.

S402, training the initial face enhancement network by using the third training set to obtain a face enhancement model in training.

The high-definition face image is input into a trained inverse rendering model, and the inverse rendering model splits the face and the environment in the high-definition face image to obtain a high-definition face normal image, a high-definition face albedo image and a high-definition environment image. And processing the high-definition face image, the high-definition face normal image, the high-definition face albedo image and the high-definition environment image by using the initial face enhancement network to obtain a high-definition face target image.

The structure of the initial face enhancement network is consistent with that of the trained face enhancement model, so that the specific processing procedure of the initial face enhancement network can refer to the specific processing procedure of the trained face enhancement model, and the detailed description is omitted here.

In the process of training the initial face enhancement network by using the third training set, a vector codebook module in the initial face enhancement network can learn how to quantize vectors so as to accurately recover and output high-quality face images by using priori information. And after the vector codebook module in the initial face enhancement network has the capability, fixing the vector codebook module in the initial face enhancement network to obtain the face enhancement model in training.

Optionally, the illumination scene of the high-definition sample image in the third training set may also be enriched. For example, the color, brightness and the like of a specific area of the face are adjusted by a shadow enhancement technique to generate more high-definition sample images containing complex light shadow effects, and for example, the high-definition sample images are enriched by a light shadow enhancement model.

In the implementation mode, the high-definition sample images are enriched by utilizing the light and shadow enhancement model, the light and shadow enrichment degree of training data is greatly ensured, the light and shadow complexity of the training data is improved, the robustness and usability of the face enhancement model are improved, various light and shadow changes can be accurately processed by using the face enhancement model which is trained later when the face enhancement model faces images with poor quality and complicated face light and shadow, the problems of artifacts, false textures and the like are avoided, and the image quality is effectively improved.

S403, acquiring a fourth training set.

The fourth training set includes low-definition sample images, it is worth noting that the low-definition sample images included in the fourth training set may be the same as, or different from, the low-definition sample images included in the second training set.

The fourth training set may further include a normal sample image, an albedo sample image, and an environmental sample image corresponding to the low-definition sample image. The environment sample image can be obtained by processing the low-definition sample image through the trained inverse rendering model, and can also be obtained by simulating the trained light and shadow enhancement model. When the environment sample image is obtained through simulation of the trained light and shadow enhancement model, the light and shadow complexity of training data of the face enhancement model can be enriched, and the robustness and usability of the face enhancement model can be improved.

S404, training the face enhancement model in training by using the fourth training set to obtain a trained face enhancement model.

The method comprises the steps of cutting low-definition sample images in a fourth training set to obtain low-definition face images, stitching the low-definition face images, normal sample images, albedo sample images and environment sample images to obtain fourth stitched images, extracting fifth potential vectors in the fourth stitched images by means of trained face enhancement models, searching vectors similar to the fifth potential vectors in a vector codebook module of the face enhancement models in training, and generating high-definition face images according to the vectors similar to the fifth potential vectors. In the embodiment of the application, a high-definition face image output by a face enhancement model in training is called a predicted image or a reconstructed image.

The structure of the face enhancement model in training is consistent with that of the trained face enhancement model, so that the specific processing procedure of the face enhancement model in training can refer to the specific processing procedure of the face enhancement model in training, and will not be described herein.

And when the target loss value corresponding to the face enhancement model in training is converged, obtaining the trained face enhancement model. The face enhancement model in training is continuously trained in a counter-propagation mode according to the target loss value, so that the trained face enhancement model is obtained.

Illustratively, the target loss value is used to represent the loss between the high definition sample image and the predicted image output by the face enhancement model under training, the loss being composed of three parts, the first part being the L1 loss, the second part being the perceived loss, and the third part being the counterloss.

The preset loss function is as follows:

,(1)

,(2)

In the above formulas (1) and (2), Representing the high-definition sample image,Representing an image of the low-definition sample,Representation through face enhanced networkThe resulting reconstructed image is then processed to obtain,The loss of L1 is indicated by,Indicating a loss of perception,Indicating that the loss of countermeasure is to be taken,、、And the weight corresponding to the three losses is represented. Wherein, the human face enhances the networkRefers to the network used by the face enhancement model in training.

Face enhancement networkThe realization of which is as follows:

,(3)

In the above-mentioned formula (3), An encoder representing a face enhancement network,A vector codebook module representing a face enhancement network,A decoder representing a face enhancement network,Representing a priori information.

、AndThe calculation formula of (2) is as follows:

,(4)

In the above-mentioned formula (4), Representing a network of discriminators.

Illustratively, calculate、、Obtaining target loss value, continuously training the face enhancement model in training in a counter-propagation mode according to the target loss value, and adjusting、、When the target loss value converges, fixing the network parameters of the face enhancement network to obtain a trained face enhancement model.

For example, during training of the face enhancement model, the target loss value may gradually decrease and tend to stabilize. In other words, as the number of training times increases, the target loss value no longer decreases significantly, but fluctuates within a smaller range, which indicates that the face enhancement network in training has learned how to accurately recover and output high-quality face images using a priori information, i.e., indicates that the face enhancement model has been trained.

In the embodiment of the application, when the target loss value is detected to be larger than the preset loss threshold value, the training process of the face enhancement model is continuously executed, and when the target loss value is detected to be smaller than or equal to the preset loss threshold value, the training is stopped, and the network parameters of the face enhancement network are saved, so that the trained face enhancement model is obtained.

The structure of the electronic device according to the embodiment of the present application will be briefly described below with reference to the accompanying drawings.

In the embodiment of the present application, the electronic device may be a touch screen-equipped device such as a mobile phone, a smart screen, a tablet computer, a wearable device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a Personal Digital Assistant (PDA), a handheld or laptop device, a media player, a smart projector, a smart television, a desktop computer, a vehicle infotainment system, and the like, which is not limited in particular type and form of the electronic device.

It should be understood that, the software system, the hardware system, the device and the chip in the embodiments of the present application may execute the training method and the image processing method of the various models in the embodiments of the present application, that is, the specific working processes of the following various products may refer to the corresponding processes in the embodiments of the foregoing methods.

Referring to fig. 17, fig. 17 is a schematic diagram of an electronic device according to an exemplary embodiment of the present application. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

The configuration shown in fig. 17 does not constitute a specific limitation on the electronic apparatus 100. In other embodiments of the application, electronic device 100 may include more or fewer components than those shown in FIG. 17, or electronic device 100 may include a combination of some of the components shown in FIG. 17, or electronic device 100 may include sub-components of some of the components shown in FIG. 17. The components shown in fig. 17 may be implemented by hardware, software, or a combination of hardware and software.

Processor 110 may include one or more processing units. For example, the processor 110 may include at least one of an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, a neural-network processor (neural-network processing unit, NPU). The different processing units may be separate devices or integrated devices.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

The processor 110 may run the software code of the image processing method provided by the embodiment of the present application, process the face image through the inverse rendering model, and use the processing result of the inverse rendering model as the face prior information, where the face enhancement model can effectively use the face prior information to accurately recover the face, even under the complex light-film condition, can avoid generating unnatural textures, and output a high-quality photographed image.

The electronic device 100 may implement display functions through a GPU, a display screen 194, and an application processor. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. GPUs can also be used to perform mathematical and pose calculations, for graphics rendering, and the like. Processor 110 may include one or more GPUs, the execution of which may generate or change display information.

In an embodiment of the present application, the ability of the electronic device 100 to display various different display interfaces is dependent on the GPU, the display 194, and the display functions provided by the application processor. For example, a preview image, a photographed video, a processed image, a processed video, and the like are displayed.

In some embodiments, the electronic device 100 may include 1 or N display screens 194, N may be a positive integer greater than 1.

The display screen 194 in the embodiment of the present application is a touch screen. The display 194 may have the touch sensor 180K integrated therein. The touch sensor 180K may also be referred to as a "touch panel". That is, the display screen 194 may include a display panel and a touch panel, and a touch screen, also referred to as a "touch screen", is composed of the touch sensor 180K and the display screen 194. The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. After a touch operation detected by the touch sensor 180K, a driving (e.g., TP driving) of the kernel layer may be transferred to an upper layer to determine a touch event type. Visual output related to the touch operation may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The charge management module 140 is configured to receive a charge input from a charger. The power management module 141 is used to connect the battery 142, the charge management module 140, and the processor 110. The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100.

A camera 193 is used to capture images. The shooting function can be realized by triggering and starting through an application program instruction, such as shooting and acquiring an image of any scene. The camera may include imaging lenses, filters, image sensors, and the like. Light rays emitted or reflected by the object enter the imaging lens, pass through the optical filter and finally are converged on the image sensor. The image sensor is mainly used for converging and imaging light emitted or reflected by all objects (also called shot objects) in a shooting view angle, the optical filter is mainly used for filtering redundant light waves (such as light waves except visible light, for example, infrared light) in the light rays, and the image sensor is mainly used for performing photoelectric conversion on received light signals, converting the received light signals into electric signals and inputting the electric signals into the processor 130 for subsequent processing. The cameras 193 may be located in front of the electronic device 100 or may be located in the back of the electronic device 100, and the specific number and arrangement of the cameras may be set according to requirements, which is not limited in the present application.

Illustratively, the electronic device 100 includes a front-facing camera and a rear-facing camera. For example, either the front camera or the rear camera may include 1 or more cameras. Taking the example that the electronic device 100 has 1 rear camera, when the electronic device 100 starts the 1 rear camera to shoot, the image processing method provided by the embodiment of the application can be used. Or the camera is disposed on an external accessory of the electronic device 100, the external accessory is rotatably connected to a frame of the mobile phone, and an angle formed between the external accessory and the display 194 of the electronic device 100 is any angle between 0 and 360 degrees. For example, when the electronic device 100 is self-timer, the external accessory drives the camera to rotate to a position facing the user. Of course, when the mobile phone has a plurality of cameras, only a part of the cameras may be disposed on the external accessory, and the rest of the cameras are disposed on the electronic device 100 body, which is not limited in any way by the embodiment of the present application.

The internal memory 121 may be used to store computer executable program code including instructions. The internal memory 121 may include a storage program area and a storage data area. The internal memory 121 may also store software codes of the image processing method provided in the embodiment of the present application, and when the processor 110 runs the software codes, the flow steps of the image processing method are executed, resulting in a high-quality photographed image. The high-quality photographed image has high definition and rich details, and has no problems of smearing feeling, artifacts, pseudo textures and the like. The internal memory 121 may also store photographed images.

Of course, the software code of the image processing method provided in the embodiment of the present application may also be stored in an external memory, and the processor 110 may execute the software code through the external memory interface 120 to execute the flow steps of the image processing method, so as to obtain a high-quality photographed image. The image captured by the electronic device 100 may also be stored in an external memory.

In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to perform functions such as unlocking, accessing an application lock, taking a photograph, and receiving an incoming call.

The keys 190 include a power-on key and an volume key. The keys 190 may be mechanical keys or touch keys. The electronic device 100 may receive a key input signal and implement a function related to the case input signal.

In addition, above the above components, various types of operating systems are running. Such as Android (Android) systems, IOS operating systems, sambac (Symbian) operating systems, blackberry (Black Berry) operating systems, linux operating systems, windows operating systems, etc. This is merely illustrative and is not limiting. Different applications, such as any type of input method application, may be installed and run on these operating systems.

The image processing method provided in the embodiment of the present application may be implemented in the electronic device 100 having the above-described hardware structure.

The structure of the electronic device 100 according to the embodiment of the present application is briefly described above, and the software structure according to the embodiment of the present application is briefly described below. Referring to fig. 18, fig. 18 is a block diagram illustrating a software structure of an electronic device according to an exemplary embodiment of the present application.

The software system may employ a layered architecture, an event driven architecture, a microkernel architecture, a micro-service architecture, or a cloud architecture, and the embodiment of the present application exemplarily describes the software system of the electronic device 100.

As shown in fig. 18, the layered architecture divides the software into several layers, each with a clear role and division of work. The layers communicate with each other through a software interface. In some embodiments, the Android system is exemplified by the electronic device 100, and the Android system is divided into five layers, namely, an application layer 210, an application framework layer 220, a hardware abstraction layer 230, a driver layer 240, and a hardware layer 250 from top to bottom.

The application layer 210 may include cameras, gallery applications, and may also include calendar, conversation, map, navigation, WLAN, bluetooth, music, video, short message, etc. applications.

The application framework layer 220 provides an application access interface and programming framework for the applications of the application layer 210.

For example, the application framework layer 220 includes a camera access interface for providing a photographing service of a camera through camera management and a camera device.

Camera management in the application framework layer 220 is used to manage cameras. The camera management may obtain parameters of the camera, for example, determine an operating state of the camera, and the like.

The camera devices in the application framework layer 220 are used to provide a data access interface between the camera devices and camera management.

The hardware abstraction layer 230 is used to abstract the hardware. For example, the hardware abstraction layer 230 may include a camera hardware abstraction layer and other hardware device abstraction layers, the camera hardware abstraction layer may include the camera device 1, the camera device 2, and the like, the camera hardware abstraction layer may be connected to a camera algorithm library, and the camera hardware abstraction layer may call an algorithm in the camera algorithm library.

The driver layer 240 is used to provide drivers for different hardware devices. For example, the driver layer may include a camera driver, a digital signal processor driver, and a graphics processor driver.

The hardware layer 250 may include sensors, image signal processors, digital signal processors, graphics processors, and other hardware devices. The sensors may include a sensor 1, a sensor 2, etc., and may also include a depth sensor (TOF) and a multispectral sensor.

The workflow of the software system of the electronic device 100 is exemplarily described below in connection with a shooting scenario.

When a user performs a click operation on the touch sensor 180K, after the camera APP is awakened by the click operation, each camera device of the camera hardware abstraction layer is invoked through the camera access interface. The camera hardware abstraction layer may send an instruction for calling the camera to the camera device driver, and at the same time, the camera algorithm library starts to load the algorithm utilized by the embodiment of the present application.

After the sensor of the hardware layer is called, for example, after the sensor 1 in the camera is called to acquire an original image, the original image is sent to image signal processing to perform preliminary processing such as registration, after the processing, the original image is driven by camera equipment to return to the hardware abstraction layer, and then the processing is performed by using an algorithm in a loaded camera algorithm library, for example, a trained inverse rendering model and a trained face enhancement model, and the processing is performed according to the related processing steps provided by the embodiment of the application, so that a face target image is obtained. The trained inverse rendering model and the trained face enhancement model can be driven by the digital signal processor to call the digital signal processor, and the graphics processor is driven by the graphics processor to call the graphics processor to process.

And splicing the obtained face target image and the original image to obtain a high-definition image corresponding to the original image, and then sending the high-definition image back to the camera application for display and storage through the camera hardware abstraction layer and the camera access interface.

Referring to fig. 19, fig. 19 is a schematic view of an image processing apparatus according to an exemplary embodiment of the present application.

It should be understood that the image processing apparatus 300 may perform the image processing method provided by the present application, and the image processing apparatus 300 includes a display unit 310, an acquisition unit 320, a first processing unit 330, and a second processing unit 340.

Alternatively, the inverse rendering model, the face enhancement model, and the light and shadow enhancement model may be deployed in the image processing apparatus 300.

A display unit 310, configured to display a first interface, where the first interface includes a first control;

An obtaining unit 320, configured to obtain a face image in response to a first operation on the first control;

a first processing unit 330, configured to process the face image by using the trained inverse rendering model, so as to obtain a face normal image, a face albedo image, and an environment image;

The second processing unit 340 is configured to perform enhancement processing on the face image, the face normal image, the face albedo image, and the environmental image by using the trained face enhancement model, so as to obtain a face target image, where the sharpness of the face target image is greater than that of the face image.

The image processing apparatus 300 is embodied as a functional unit. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.

For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include Application Specific Integrated Circuits (ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.

Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores computer instructions, and when the computer readable storage medium runs on the image processing device, the image processing device is caused to execute the image processing method of any embodiment. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more servers, data centers, etc. that can be integrated with the medium. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium, or a semiconductor medium (e.g., solid State Drive (SSD)), etc.

The embodiments of the present application also provide a computer program product comprising computer instructions which, when run on an image processing apparatus, enable the image processing apparatus to perform the image processing method of any of the embodiments described above.

The embodiment of the application also provides a chip. Referring to fig. 20, fig. 20 is a schematic structural diagram of a chip according to an embodiment of the application. The chip shown in fig. 20 may be a general-purpose processor or a special-purpose processor. The chip includes a processor 410. Wherein the processor 410 is configured to perform the image processing method of any of the above embodiments.

Optionally, the chip further comprises a transceiver 420, and the transceiver 420 is configured to receive control of the processor and is configured to support the image processing apparatus to perform the foregoing technical solution.

Optionally, the chip shown in FIG. 20 may also include a storage medium 430.

It is noted that the chip shown in FIG. 20 may be implemented using one or more field programmable gate arrays (field programmable GATE ARRAY, FPGAs), programmable logic devices (programmable logic device, PLDs), controllers, state machines, gate logic, discrete hardware components, any other suitable circuit, or any combination of circuits capable of performing the various functions described throughout this disclosure.

The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.

It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. The storage medium includes a U disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present application should be included in the scope of the present application. Therefore, the protection scope of the application should be in the protection scope of the claims.

Claims

1. An image processing method, characterized in that the method comprises:

Displaying a first interface, wherein the first interface includes a first control;

In response to a first operation on the first control, acquiring a facial image;

Utilizing a first network included in a trained inverse rendering model, extracting structural features and contour features of a face in the face image, and encoding the structural features and contour features into a first latent vector; searching for a vector close to the first latent vector in a first discrete vector set included in the first network; and generating a normal image of the face based on the vector close to the first latent vector;

splicing the facial image and the facial normal image to obtain a first spliced image; extracting texture features and skin features of the face in the first spliced image using a second network included in the inverse rendering model, and encoding the texture features and the skin features into a second latent vector; searching for a vector close to the second latent vector in a second discrete vector set included in the second network; and generating a facial albedo image based on the vector close to the second latent vector;

Processing the face image, the face normal image, and the face albedo image using a third network included in the inverse rendering model to obtain an environment image;

Determining facial prior information based on the facial normal image, the facial albedo image, and the environmental image; the facial prior information includes facial structure information, facial contour information, facial light and shadow information, facial skin information, and facial texture information;

The facial image, the facial normal image, the facial albedo image and the environmental image are enhanced using a trained face enhancement model, and the facial prior information is used to restore facial details in the facial image to obtain a facial target image; the facial details include facial structure, facial contour, facial light and shadow, facial skin and facial texture; the clarity of the facial target image is greater than the clarity of the facial image.

2. The method according to claim 1, wherein obtaining a facial image in response to a first operation on the first control comprises:

In response to a first operation on the first control, obtaining an original image:

determining the clarity of the original image;

It is detected that the clarity of the original image is less than a preset clarity threshold, and the original image is cropped to obtain the face image.

3. The method according to claim 1, wherein the trained face enhancement model includes a third vector codebook module, the third vector codebook module includes a third discrete vector set, and the using the trained face enhancement model to enhance the face image, the face normal image, the face albedo image, and the environment image to obtain a face target image comprises:

stitching the facial image, the facial normal image, the facial albedo image, and the environment image to obtain a second stitched image;

extracting a third latent vector in the second spliced image using the trained face enhancement model;

Searching for a vector close to the third potential vector in the third discrete vector set;

The target face image is generated based on a vector close to the third latent vector.

4. The method according to claim 1, wherein the step of enhancing the face image, the face normal image, the face albedo image, and the environment image using a trained face enhancement model to obtain a face target image comprises:

Determine a fill-light environment image corresponding to the environment image using a trained light and shadow enhancement model;

The trained face enhancement model is used to enhance the face image, the face normal image, the face albedo image, and the fill-light environment image to obtain a face target image.

5. The method according to any one of claims 1 to 4, further comprising:

Training an initial inverse rendering network using a first training set to obtain an inverse rendering model in training; the first training set includes high-definition sample images;

The inverse rendering model being trained is trained using a second training set to obtain the trained inverse rendering model; the second training set includes low-definition sample images, and the low-definition sample images are obtained by degrading the high-definition sample images.

6. The method according to claim 5, further comprising:

Training the initial face enhancement network using a third training set to obtain a face enhancement model in training; the third training set includes the high-definition sample images;

The face enhancement model in training is trained using the fourth training set to obtain the trained face enhancement model; the fourth training set includes the low-definition sample image, and the normal sample image, albedo sample image, and environment sample image corresponding to the low-definition sample image; the normal sample image and the albedo sample image are obtained by processing the low-definition sample image through the trained inverse rendering model; the environment sample image is obtained by simulating the trained light and shadow enhancement model.

7. The method according to claim 5, further comprising:

During the training of the face enhancement model, a target loss value is calculated according to a preset loss function; the target loss value is used to represent the loss between the high-definition sample image and the predicted image output by the face enhancement model under training;

According to the target loss value, the face enhancement model in training is further trained in a back-propagation manner to obtain the trained face enhancement model.

8. The method according to claim 1, wherein the first network is VQVAE, and/or the second network is VQVAE.

9. The method according to claim 2, further comprising:

The original image and the target face image are spliced together to obtain a high-definition image corresponding to the original image.

10. An electronic device, characterized in that the electronic device comprises: one or more processors, and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions to enable the electronic device to execute the method as described in any one of claims 1 to 9.

11 . A computer-readable storage medium, characterized in that the computer-readable storage medium comprises instructions, and when the instructions are executed on an electronic device, the electronic device executes the method according to any one of claims 1 to 9.

12 . A computer program product, characterized in that the computer program product comprises computer program code, and when the computer program code is executed, the method according to claim 1 is performed.