CN118570054B

CN118570054B - Training method, related device and medium for image generation model

Info

Publication number: CN118570054B
Application number: CN202411060639.9A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-08-05
Filing date: 2024-08-05
Publication date: 2024-10-01
Anticipated expiration: 2044-08-05
Also published as: CN118570054A

Abstract

The disclosure provides a training method, a related device and a medium for an image generation model. The method comprises the following steps: acquiring a plurality of image-text sample pairs, wherein the image-text sample pairs comprise background template images, noise reference images, image description information of the noise reference images and sample object images of sample objects; determining sample stitched image features based on the noise reference image, the background template image, and the template mask image; determining denoising network control information based on contour features of a reference object in a background template image, image description information and a sample object image; based on the sample spliced image characteristics and denoising network control information, performing noise prediction through an image generation model to obtain a noise prediction result of a noise reference image; based on the comparison of the noise reference images and the noise prediction results of the plurality of graphic sample pairs, the image generation model is trained. The method and the device can improve the accuracy of generating the target image.

Description

Training method, related device and medium for image generation model

Technical Field

The disclosure relates to the technical field of big data, in particular to a training method, a related device and a medium of an image generation model.

Background

Currently, in various business scenarios such as video production, virtual exhibition, etc., it is often necessary to replace background images and object images to create personalized images. For example, in creating an article image of article a, article B in one background image is replaced with article a. In this regard, in the related art, it is often adopted to scratch a first object in a background image in a specified background image based on a neural network model, make a second object having the same orientation and shooting angle as those of the first object in the background image, and embed the second object in the background image in a map form, so that the second object is displayed in the specified background image.

However, the above manner is often limited by various factors such as angle and illumination filter of the target object and the background image, and in practical model training, it is often difficult to collect enough training data meeting expectations, which affects the model training effect, and the situation that the target image generated by the model is not satisfactory (for example, illumination of the target object in the image, the filter and the background image cannot be made to be consistent) occurs, so that the accuracy of the target image generated by the model is not high.

Disclosure of Invention

The embodiment of the disclosure provides a training method, a related device and a medium for an image generation model, which can improve the accuracy of generating a target image by the image generation model.

According to an aspect of the present disclosure, there is provided a training method of an image generation model, the training method including:

acquiring a plurality of image-text sample pairs, wherein each image-text sample pair comprises a background template image, a noise reference image, image description information corresponding to the noise reference image and a sample object image of a sample object, and the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object;

Determining sample stitching image features based on the noise reference image, the background template image, and a template mask image, wherein the template mask image is obtained by masking the reference object in the background template image;

determining denoising network control information of the image generation model based on contour features of the reference object, the image description information and the sample object image in the background template image;

Based on the sample spliced image features and the denoising network control information, performing noise prediction through the image generation model to obtain a noise prediction result of the noise reference image;

and training the image generation model based on the comparison of the noise reference images and the noise prediction results of a plurality of graphic sample pairs.

According to an aspect of the present disclosure, there is provided an image generation method including:

Acquiring a target object image of a target object, a target background image, and target description information, wherein the target background image contains a reference object to be replaced by a target object in the target object image, and the target description information is used for describing replacement from the reference object to the target object;

determining target stitching image characteristics based on the target background image, a preset noise image and a target mask image, wherein the target mask image is obtained by masking a reference object in the target background image;

determining denoising control information of an image generation model based on the outline features of the reference object, the target description information and the target object image in the target background image, wherein the image generation model is generated according to the training method of the image generation model;

And generating an image through the image generation model based on the target stitching image characteristics and the denoising control information to obtain a target image, wherein the target image is used for indicating a result of replacing the reference object in the target background image with the target object of the target object image.

According to an aspect of the present disclosure, there is provided a training apparatus of an image generation model, the training apparatus of an image generation model including:

A first obtaining unit, configured to obtain a plurality of image-text sample pairs, where each image-text sample pair includes a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of a sample object, where the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object;

a first determining unit, configured to determine a sample stitched image feature based on the noise reference image, the background template image, and a template mask image, where the template mask image is obtained by masking the reference object in the background template image;

A second determining unit configured to determine denoising network control information of the image generation model based on contour features of the reference object in the background template image, the image description information, and the sample object image;

The prediction unit is used for carrying out noise prediction through the image generation model based on the sample spliced image characteristics and the denoising network control information to obtain a noise prediction result of the noise reference image;

and the training unit is used for training the image generation model based on the comparison of the noise reference images of the image-text sample pairs and the noise prediction results.

Optionally, the training unit includes:

the calculation module is used for acquiring reference noise in the noise reference image according to each image-text sample pair, and calculating a sub-loss function based on comparison of the reference noise and the noise prediction result;

A determining module, configured to determine a total loss function based on the sub-loss functions of each of the image-text sample pairs;

And the training module is used for training the image generation model based on the total loss function.

Optionally, the noise prediction result is obtained by prediction of a plurality of prediction time steps;

the computing module is used for:

determining the predicted noise of the last predicted time step based on the noise prediction result;

Performing regular term calculation based on the reference noise and the prediction noise to obtain a regular term calculation result;

and determining the sub-loss function based on the regular term calculation result.

Optionally, the first determining unit is configured to:

performing first coding processing on the noise reference image to obtain a reference image coding characteristic;

performing second coding processing on the background template image to obtain a background image coding characteristic;

performing third coding processing on the template mask image to obtain mask image coding characteristics;

and splicing the reference image coding feature, the background image coding feature and the mask image coding feature to obtain the sample spliced image feature.

Optionally, the second determining unit includes:

The coding module is used for coding the image description information to obtain an image description embedded vector;

The extraction module is used for extracting the characteristics of the sample object image to obtain sample object characteristic data;

the generation module is used for generating control information through a preset control network based on the image description embedded vector, the sample object feature data and the outline feature of the reference object to obtain the denoising network control information.

Optionally, the encoding module is configured to:

word segmentation is carried out on the image description information to obtain a plurality of description words;

Determining target words in the descriptive words, and searching target word embedding characteristics corresponding to the target words based on a preset dictionary;

aiming at other descriptive words except the target word in the descriptive words, carrying out word embedding processing on the other descriptive words to obtain descriptive word embedding characteristics of the other descriptive words;

integrating the target word embedding feature and the descriptor embedding feature into the image description embedding vector.

Optionally, the control network includes a first control sub-network, and a second control sub-network; the sample object feature data comprises a first sample object feature, a second sample object feature, a third sample object feature and a fourth sample object feature, wherein the first sample object feature, the second sample object feature, the third sample object feature and the fourth sample object feature are obtained by respectively extracting features of the sample object images;

The generating module is used for:

Inputting the image description embedded vector, the first sample object feature and the outline feature into the first control sub-network to generate control information, so as to obtain first control information;

Inputting the image description embedded vector and the second sample object characteristic into the second control sub-network to generate control information, so as to obtain second control information;

Determining upsampling network control information based on the third sample object feature and the fourth sample object feature;

Determining downsampled network control information based on the first control information, the second control information, and the first sample object feature, and the second sample object feature;

and integrating the up-sampling network control information and the down-sampling network control information into the denoising network control information.

Optionally, the noise reference image is generated by:

Determining an expected result of replacing a reference object in the background template image with a sample object, wherein the expected result is image data;

Generating random numbers obeying Gaussian distribution based on a predetermined random number generation model;

and adding the random number to the pixel value of each pixel point in the expected result to obtain the noise reference image.

Optionally, the image generation model includes a diffusion network, and a denoising network;

the prediction unit includes:

The compression module is used for compressing the sample spliced image features to obtain sample compressed image features;

The diffusion module is used for carrying out diffusion processing on the sample compressed image characteristics based on the diffusion network to obtain sample hidden space characteristic vectors;

And the denoising module is used for denoising the sample hidden space feature vector through the denoising network based on the denoising network control information to obtain the noise prediction result.

Optionally, the denoising network includes an upsampling attention network, and a downsampling attention network; the denoising network control information comprises up-sampling network control information and down-sampling network control information;

the denoising module is used for:

fusing the downsampling network control information to a first attention matrix of the downsampling attention network to update the first attention matrix, and fusing the upsampling network control information to a second attention matrix of the upsampling attention network to update the second attention matrix;

And denoising the sample hidden space feature vector through the downsampling attention network after updating the first attention matrix and the upsampling attention network after updating the second attention matrix to obtain the noise prediction result.

Optionally, the sample object image is generated by:

Acquiring a sample image of the sample object;

Performing image segmentation on the sample image based on a preset object segmentation model to obtain a sample segmentation image with the sample object;

and carrying out image enhancement on the sample segmentation image to obtain the sample object image.

Optionally, the template mask image is generated by:

Determining an object contour region of the reference object in the background template image;

And in the background template image, replacing the pixel value of each pixel point in the object contour area with a first value, and replacing the pixel value of each pixel point outside the object contour area with a second value to obtain the template mask image.

Optionally, the contour features of the reference object in the background template image are determined by:

performing object detection on the background template image to obtain an object skeleton diagram of the reference object;

extracting gesture features of the object skeleton graph to obtain a plurality of object gesture key points;

the contour feature is determined based on the plurality of object pose keypoints.

According to an aspect of the present disclosure, there is provided an image generating apparatus including:

A second acquisition unit configured to acquire a target object image of a target object, a target background image, and target description information, wherein the target background image contains a reference object to be replaced by a target object in the target object image, and the target description information is used to describe replacement from the reference object to the target object;

A third determining unit, configured to determine a target stitching image feature based on the target background image, a preset noise image, and a target mask image, where the target mask image is obtained by masking a reference object in the target background image;

A fourth determining unit, configured to determine denoising control information of an image generation model based on the contour feature of the reference object in the target background image, the target description information, and the target object image, where the image generation model is generated by the training method of the image generation model;

And the image generation unit is used for generating an image through the image generation model based on the target spliced image characteristics and the denoising control information to obtain a target image, wherein the target image is used for indicating a result of replacing the reference object in the target background image by the target object of the target object image.

Optionally, the third determining unit is configured to:

performing first coding processing on the preset noise image to obtain noise image characteristics;

Performing second coding processing on the target background image to obtain target background image characteristics;

performing third coding processing on the target mask image to obtain target mask image characteristics;

And stitching the noise image features, the target background image features and the target mask image features to obtain the target stitched image features.

Optionally, the image generation model includes a diffusion network, a denoising network, and a decoding network;

The image generation unit is used for:

compressing the target spliced image features to obtain target compressed image features;

performing diffusion processing on the target compressed image features based on the diffusion network to obtain target hidden space feature vectors;

Denoising the target hidden space feature vector through the denoising network based on the denoising control information to obtain a target denoising result;

and performing feature decoding on the target denoising result based on the decoding network to obtain the target image.

According to an aspect of the present disclosure, there is provided an electronic device including a memory storing a computer program and a processor implementing a training method or an image generation method of an image generation model as described above when the computer program is executed.

According to an aspect of the present disclosure, there is provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the training method or the image generation method of the image generation model as described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program that is read and executed by a processor of an electronic device, causing the electronic device to perform the training method or the image generation method of the image generation model as described above.

In the embodiment of the disclosure, when a model is generated by training an image, a graph-text sample pair is constructed by using a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of a sample object, wherein the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object, and the mode can be used as a reference image to be more fit to a real situation. Then, the background template image, the object mask image of the background template and the image features of the noise reference image are integrated into sample mosaic image features, the sample mosaic image features are used as the input of model training, and as the sample mosaic image features have various image feature information and are used as training data, the information quantity of the training data can be enriched. In addition, the embodiment of the disclosure further introduces image description information (the image description information may indicate what the background and the object are in the noise reference image), outline features of the reference object in the background template image (the outline features may reflect actions, postures, etc. of the reference object), and sample object images of the sample object (the sample object images may reflect outlines, appearances, etc. of the sample object) together to generate the denoising network control information, so that the denoising network control information has various constraint conditions. Further, the denoising network control information and the sample spliced image features are input into the image generation model, so that the image generation model is limited by the denoising network control information when the sample spliced image features are subjected to noise prediction, and the noise prediction process is corrected according to the denoising network control information, so that the noise prediction result of the noise reference image output by the image generation model is more accurate. Finally, the image generation model is trained by comparing the noise prediction result with the noise difference condition of the noise reference image, so that the image generation model meeting the training requirement is obtained, and the accuracy of the model to generate the target image can be improved.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is a architectural diagram of a training method for an image generation model and a system to which the image generation method applies, according to an embodiment of the present disclosure;

2A-2C illustrate schematic diagrams of an image generation method applied in a video production scene according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a training method of an image generation model according to one embodiment of the present disclosure;

FIG. 4 is a flow chart of determining a sample object image according to one embodiment of the present disclosure;

FIG. 5 is a flow chart of determining a noise reference image according to one embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an implementation of determining a noise reference image in accordance with one embodiment of the present disclosure;

FIG. 7 is a flow chart of generating sample stitched image features according to one embodiment of the present disclosure;

FIG. 8 is a flow chart of determining a template mask image according to one embodiment of the present disclosure;

9A-9B are schematic diagrams of implementation of determining a template mask image according to one embodiment of the present disclosure;

FIG. 10 is a flow chart of generating denoising network control information according to one embodiment of the present disclosure;

FIG. 11 is a flowchart of determining contour features of a reference object in a background template image according to one embodiment of the present disclosure;

FIG. 12 is a schematic diagram of an implementation process of keypoint extraction in determining contour features, according to one embodiment of the disclosure;

FIG. 13 is a flow chart of generating an image description embedding vector according to one embodiment of the present disclosure;

FIG. 14 is a schematic diagram of an implementation process of generating an image description embedding vector according to one embodiment of the present disclosure;

fig. 15 is a flowchart of generating denoising network control information according to one embodiment of the present disclosure;

FIG. 16 is a schematic diagram of an implementation process of generating denoising network control information according to one embodiment of the present disclosure;

FIG. 17 is a flow chart of generating noise prediction results according to one embodiment of the present disclosure;

FIG. 18 is a schematic diagram of an implementation process of generating noise prediction results according to one embodiment of the present disclosure;

FIG. 19 is a flow chart of a denoising process according to one embodiment of the present disclosure;

FIG. 20 is a process diagram of an implementation of a downsampling process at the time of denoising process, according to one embodiment of the present disclosure;

FIG. 21 is a process diagram of an implementation of an upsampling process at the time of denoising according to one embodiment of the present disclosure;

FIG. 22 is a flow chart of a training image generation model according to one embodiment of the present disclosure;

FIG. 23 is a flow chart of determining a stator loss function according to an embodiment of the present disclosure;

FIG. 24 is a schematic illustration of implementation details of a training image generation model according to one embodiment of the present disclosure;

FIG. 25 is a flow chart of an image generation method according to one embodiment of the present disclosure;

FIG. 26 is a flow chart of determining target stitched image features according to one embodiment of the present disclosure;

FIG. 27 is a flow chart of generating a target image according to one embodiment of the present disclosure;

FIG. 28 is a schematic illustration of an implementation of an image generation method according to one embodiment of the present disclosure;

FIG. 29 is a schematic illustration of implementation details of an image generation method according to one embodiment of the present disclosure;

FIG. 30 is a block diagram of a training apparatus for image generation models according to one embodiment of the present disclosure;

FIG. 31 is a block diagram of an image generation apparatus according to one embodiment of the present disclosure;

FIG. 32 is a terminal block diagram of a training method of an image generation model according to one embodiment of the present disclosure;

FIG. 33 is a server block diagram of a training method of an image generation model according to one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

Before proceeding to further detailed description of the disclosed embodiments, the terms and terms involved in the disclosed embodiments are described, which are applicable to the following explanation:

A cross-attention control (Cross Attention Control Module) module: is a module used in deep learning to establish a cross-attention mechanism between multiple inputs. Wherein the cross-attention control module may assist the model in automatically learning correlations between inputs as the model processes multiple inputs, thereby improving performance of the model.

The system architecture and scenario to which the embodiments of the present disclosure apply are described below.

Fig. 1 is a system architecture diagram of a training method of an image generation model, and a system architecture diagram to which the image generation method is applied, according to an embodiment of the present disclosure. It includes an object terminal 140, the internet 130, a gateway 120, an image processing server 110, an image database 150, and the like.

The object terminal 140 includes various forms of a desktop computer, a laptop computer, a PDA (personal digital assistant), a tablet computer, a cellular phone, a car-mounted terminal, a home theater terminal, a smart television, a dedicated terminal, and the like. In addition, the device can be a single device or a set of a plurality of devices. The object terminal 140 may communicate with the internet 130 in a wired or wireless manner, exchanging data. The object terminal 140 includes an image processing system for receiving the object-selected background image and the object image and submitting the object-selected background image and the object image to an image processing server so that the image processing server generates a target image with the target object in the object image as a foreground and the background image as a background.

The image processing server 110 refers to a computer system capable of providing some service to the object terminal 140. The image processing server 110 is required to be higher in terms of stability, security, performance, and the like than the general object terminal 140. The image processing server 110 may be one high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of one high-performance computer (e.g., a virtual machine), a combination of portions of multiple high-performance computers (e.g., virtual machines), or a cloud server. The image processing server 110 contains various types of services, wherein the implementation of the individual services of the image processing server 110 is often associated with some intermediate database or storage medium or the like. The image processing server 110 is configured to generate a target image using the trained image generation model and using a target object in the target image as a foreground and a background image as a background, and the image database 150 is configured to store various images such as the target image, and the background image.

Gateway 120 is also known as an intersubnetwork connector, protocol converter. The gateway implements network interconnection on the transport layer, and is a computer system or device that acts as a translation. The gateway is a translator between two systems using different communication protocols, data formats or languages, and even architectures that are quite different. At the same time, the gateway may also provide filtering and security functions. The message transmitted from the object terminal 140 to the image processing server 110 is transmitted to the corresponding server through the gateway 120. A message transmitted from the image processing server 110 to the object terminal 140 is also transmitted to the corresponding object terminal 140 through the gateway 120.

The embodiments of the present disclosure may be applied in a variety of scenarios, such as the video production scenarios shown in fig. 2A-2C, and the like.

As shown in fig. 2A, when an object needs to use a scene with an object a as a foreground and a scene with a background image 5 as a background in the video production process, the object logs into the image processing system on the object terminal and enters the image processing flow. At this time, a hint field "please select background image, upload target object image, and input image description" is displayed on the page, and an edit area for selecting background image, an edit area for selecting target object image, and an edit area for inputting image description are provided. Based on this, the object selects "background image 5" in the editing area for selecting the background image; the object selects 'D\object image of object A\image 1+image 2' in the editing area for selecting the target object image; the object inputs "replace object B in background image 5 with object a" in the editing area for inputting the image description, and clicks the "ok" button to determine that replacement of object B in background image 5 with object a is to be achieved by background image 5, image 1 and image 2 of object a.

As shown in fig. 2B, after the object clicks the ok button, a prompt window is displayed on the page, where the prompt window has a prompt field of "a mask map corresponding to the background image 5 is being generated and an object outline map of the object B, and an image generation model is called, and image synthesis is performed according to the mask map and the object outline map, and please wait.

As shown in fig. 2C, when the image generation model is executed, a prompt field "the object B in the background image 5 has been replaced with the object a" is displayed on the page, which is described in detail in the following. Furthermore, the generated image has been saved to the path: d: composite image folder video production. And a contrast indication of the background image 5 and the generated target image is displayed. Wherein the background image 5 has the bottle A (object B), the target image has the bottle B (object A), and the bottle A in the background image 5 is replaced by the bottle B under the condition of keeping the background unchanged by generating a model through the image in the image processing system.

It should be noted that, the image generating method of the embodiment of the present disclosure may be applied to various application scenarios, such as the video production scenario, the item display scenario in electronic commerce, the personalized image production scenario in social media, and the like.

The training method of the image generation model according to the embodiment of the present disclosure is generally described below.

According to one embodiment of the present disclosure, a training method of an image generation model is provided.

The training method of the image generation model is generally applied to a business scene in which a reference object (person, animal, article, etc.) in a fixed background image is replaced with a target object (person, animal, article, etc.), such as a video production scene, an article display scene, etc. shown in fig. 2A-2C. The embodiment of the disclosure provides a scheme for model training based on image description and object difference between an image to be generated and a background image, which can improve the accuracy of generating a target image by an image generation model.

As shown in fig. 3, a training method of an image generation model according to an embodiment of the present disclosure may be performed by an electronic device, which may be the image processing server or the object terminal shown in fig. 1, and the training method of the image generation model may include:

step 310, obtaining a plurality of image-text sample pairs;

Step 320, determining sample stitched image features based on the noise reference image, the background template image, and the template mask image;

Step 330, determining denoising network control information of an image generation model based on contour features of a reference object in a background template image, image description information and a sample object image;

Step 340, based on the sample spliced image characteristics and the denoising network control information, performing noise prediction through an image generation model to obtain a noise prediction result of a noise reference image;

And 350, training an image generation model based on the comparison of the noise reference images and the noise prediction results of the image-text sample pairs.

Steps 310-350 are described in detail below.

In step 310, a plurality of pairs of teletext samples are acquired.

In the embodiment of the disclosure, one graphic sample pair is used as one training data, wherein each graphic sample pair comprises a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of a sample object.

The background template image refers to a background image to which a sample object is to be added, wherein the background template image contains a reference object to be replaced by the sample object in the sample object image.

A sample object image of a sample object refers to a series of images that reflect the object characteristics (object pose, object motion, object appearance, etc.) of the sample object.

The noise reference image is obtained by adding noise to the expected result of replacing the reference object in the background template image with the sample object.

The image description information corresponding to the noise reference image is used to indicate a condition constraint on the denoising process. For example, the image description information may be to generate the character a in a background template map.

In step 320, sample stitched image features are determined based on the noise reference image, the background template image, and the template mask image.

The template mask image is obtained by masking a reference object in the background template image.

The sample stitched image features are used to indicate feature stitching results of the noise reference image, the background template image, and the template mask image.

In the embodiment, first, a noise reference image, a background template image, and a template mask image are mapped from a pixel space to a potential vector space, respectively, to obtain a vector feature corresponding to the noise reference image, a vector feature corresponding to the background template image, and a vector feature corresponding to the template mask image. And then, carrying out feature stitching on the vector features corresponding to the noise reference image, the vector features corresponding to the background template image and the vector features corresponding to the template mask image to obtain sample stitched image features.

In step 330, denoising network control information of the image generation model is determined based on the contour features of the reference object in the background template image, the image description information, and the sample object image.

The contour features of the reference object in the background template image are used to indicate contour features of actions, gestures, etc. of the reference object in the background template image.

The image generation model of the embodiment of the present disclosure is a neural network model constructed based on a diffusion model (SD). The image generation model of the embodiment of the disclosure often takes a background image, an object image, information describing an object in which the object in the object image replaces the background image, and the like as input, and outputs a composite image taking a scene of the background image as the background and an object in the object image as the foreground, and the image generation model can realize embedding the object into a specific position of an image, thereby meeting the image generation requirements under various scenes.

The denoising network control information is used as a condition constraint to assist the image generation model in denoising the image.

For the sake of brevity, a detailed description of a specific process of determining denoising network control information based on the contour features of the reference object in the background template image, the image description information, and the sample object image in the embodiments of the present disclosure will be described in detail below, and will not be repeated here.

In step 340, noise prediction is performed by the image generation model based on the sample stitched image features and the denoising network control information, so as to obtain a noise prediction result of the noise reference image.

The noise prediction result is used for indicating the noise level contained in the image feature consistent with the noise reference image generated by the image generation model.

For the sake of space saving, a specific process of performing noise prediction through an image generation model based on the sample stitched image features and the denoising network control information according to the embodiments of the present disclosure will be described in detail below, and will not be described here.

In step 350, an image generation model is trained based on a comparison of the noise reference image and the noise prediction results for the plurality of pairs of teletext samples.

When the embodiment is specifically implemented, the model parameters of the image generation model are adjusted according to the noise difference degree of the noise reference image and the noise prediction result of the image-text sample pair, and the steps 310-350 are repeated to continuously reduce the noise difference of the noise reference image and the noise prediction result of the image-text sample pair until the noise difference of the noise reference image and the noise prediction result of a plurality of image-text sample pairs meets the model training requirement, the updating of the model parameters of the image generation model is stopped, the model parameters at the moment are taken as final model parameters, and the image generation model with the final model parameters is taken as a trained image generation model.

Through the steps 310-350, in the embodiment of the present disclosure, when training an image to generate a model, a pair of graphic samples is constructed by using a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of a sample object, where the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object, this way can be used as a reference image to be more fit to a real situation. Then, the background template image, the object mask image of the background template and the image features of the noise reference image are integrated into sample mosaic image features, the sample mosaic image features are used as the input of model training, and as the sample mosaic image features have various image feature information and are used as training data, the information quantity of the training data can be enriched. In addition, the embodiment of the disclosure further introduces image description information (the image description information may indicate what the background and the object are in the noise reference image), outline features of the reference object in the background template image (the outline features may reflect actions, postures, etc. of the reference object), and sample object images of the sample object (the sample object images may reflect outlines, appearances, etc. of the sample object) together to generate the denoising network control information, so that the denoising network control information has various constraint conditions. Further, the denoising network control information and the sample spliced image features are input into the image generation model, so that the image generation model is limited by the denoising network control information when the sample spliced image features are subjected to noise prediction, and the noise prediction process is corrected according to the denoising network control information, so that the noise prediction result of the noise reference image output by the image generation model is more accurate. Finally, the image generation model is trained by comparing the noise prediction result with the noise difference condition of the noise reference image, so that the image generation model meeting the training requirement is obtained, and the accuracy of the model to generate the target image can be improved.

The above is a general description of steps 310-350. The detailed description will be developed below for specific implementations of steps 310, 320, 330, 340 and 350.

Step 310 is described in detail below.

In step 310, a plurality of pairs of graphics samples are acquired, wherein each pair of graphics samples includes a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of the sample object, wherein the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object.

Referring to fig. 4, in one embodiment, a sample object image of a sample object is determined by:

Step 410, obtaining a sample image of a sample object;

step 420, performing image segmentation on the sample image based on a preset object segmentation model to obtain a sample segmentation image with a sample object;

And step 430, performing image enhancement on the sample segmentation image to obtain a sample object image.

Steps 410-430 are described in detail below.

In step 410, the sample image is an image containing a sample object.

In the embodiment, when the license is authorized, the sample images of the plurality of sample images can be extracted from the existing image database, the video data with the sample objects can be extracted from the existing video database, the video data is subjected to video frame segmentation, and each video frame with the sample objects is taken as the sample image.

In step 420, the sample segmentation image refers to a local image with a sample object segmented from the sample image, the sample segmentation image being part of the sample image.

The object segmentation model may be a lightweight semantic segmentation model BiSeNet-v2 or pp_ LiteSeg.

Taking the example that the object segmentation model is a semantic segmentation model pp_ LiteSeg, the object segmentation model comprises an encoding module, a pyramid pooling module, a decoding module and an attention fusion module. Specifically, first, a sample image is input to an encoding module of an object segmentation model, the sample image is multi-scale encoded by the encoding module, and a first sample image feature having an image size of one fourth of the sample image, a second sample image feature having an image size of one eighth of the sample image, a third sample image feature having an image size of one sixteenth of the sample image, and a fourth sample image feature having an image size of one thirty-half of the sample image are sequentially obtained. And then, carrying out feature pooling on the fourth sample image features through a pyramid pooling module to obtain sample image pooled features. Further, feature fusion is carried out on the sample image pooling feature and the third sample image feature through an attention fusion module, and a first image fusion feature is obtained. And then, carrying out feature fusion on the first image fusion feature and the second sample image feature through an attention fusion module to obtain a second image fusion feature. Further, the second image fusion feature is decoded by a decoding module, and a sample segmentation image with a sample object is obtained.

In step 430, image enhancement to the sample-segmented image includes, but is not limited to, contrast enhancement, brightness enhancement, sharpening, noise reduction, and the like to the sample-segmented image. Specifically, first, when image enhancement is performed on a sample-divided image, a linear stretching or logarithmic transformation method is used to expand the pixel value distribution of the sample-divided image so that the bright-dark areas of the sample-image-divided image are more distinct, to achieve contrast enhancement of the sample-divided image. Then, after the contrast enhancement, the brightness values of all the pixels of the sample divided image are adjusted to improve the overall brightness of the sample divided image. Further, after brightness adjustment, a gaussian filter or a median filter is used to smooth the sample-divided image and reduce noise in the sample-divided image. Finally, after noise reduction, edge enhancement is carried out on the sample segmentation image through the Laplacian, so that sharpening processing of the sample segmentation image is realized, and a sample object image is obtained.

The embodiment has the advantages that the image segmentation is carried out on the sample image of the sample object, the local image with the sample object is segmented from the sample image, the interference of irrelevant image information can be better eliminated, and the image quality of the sample segmented image for training is improved. Further, various image enhancement processes such as contrast enhancement, brightness enhancement, sharpening, noise reduction and the like are adopted for the sample segmentation image, so that the image quality of the sample object image can be further improved.

Referring to fig. 5, in one embodiment, the noise reference image is determined by:

step 510, determining an expected result of replacing the reference object in the background template image with the sample object;

Step 520, generating random numbers obeying Gaussian distribution based on a predetermined random number generation model;

And 530, adding random numbers to pixel values of the pixel points aiming at each pixel point in the expected result to obtain a noise reference image.

Steps 510-530 are described in detail below.

In step 510, the expected result is image data.

In a specific implementation of this embodiment, image editing software may be utilized to replace the reference object of the background template image with the sample object, such that a resulting one of the replaced images is treated as the intended result of replacing the reference object in the background template image with the sample object. The image editing software can be Adobe Photoshop and other software.

In step 520, the predetermined random number generation model refers to a random number generator, the random number being a randomly generated number, wherein the random number generated by embodiments of the present disclosure that obeys a gaussian distribution may be repeated.

In a specific implementation of this embodiment, a predetermined random number generation model has a library function (numpy.random.normal () function) therein. Specifically, a random number is generated by a library function of a predetermined random number generation model, according to a gaussian distribution, the mean value to be followed is 0, the standard deviation is 1, and the random number is subjected to the gaussian distribution. Wherein the random numbers subject to gaussian distribution can be expressed in the form of a random noise figure having the same image size as the image size of the intended result.

In step 530, for each pixel in the expected result, first, a pixel value of the pixel in the expected result is determined, and a noise value of the pixel in the random noise map corresponding to the random number is determined. And then, adding the pixel value of the pixel point and the noise value to obtain the noise added pixel value of the pixel point. And finally, obtaining a noise reference image according to the noise adding pixel value of each pixel point.

As shown in fig. 6, a schematic diagram of a specific implementation process of superimposing random noise on the expected result is shown. The expected result is an 8 x 8 image with 64 pixels. The random noise map corresponding to the random number subjected to Gaussian distribution is also an 8×8 image, and the random noise map has 64 pixel points. Specifically, for each pixel point of the expected result, determining a pixel value of each pixel point, wherein the pixel value of each pixel point comprises 1,2,3,4,5 or 6. Next, for each pixel point, a noise value of each pixel point in the random noise map is determined, wherein the noise value includes 0,1, 2,3,4,5, or 6. Further, for each pixel point, the pixel value and the noise value are added to obtain a noise added pixel value of each pixel point. For example, if the pixel value of the first pixel is 1 and the noise value is 5, the noise pixel value is 1+5=6. The pixel value of the second pixel point is 1, the noise value is 1, and the noise added pixel value is 1+1=2. The pixel value of the third pixel point is 1, the noise value is 4, and the noise adding pixel value is 1+4=5; and the like until the noise adding pixel value of the last pixel point of the last row is determined, and generating a noise reference image.

The embodiment has the advantages that random noise conforming to Gaussian distribution is added to the expected result of replacing the reference object in the background template image with the sample object, and environmental noise is introduced into the idealized reference image, so that the finally obtained expected result is more fit with the actual situation, the authenticity and the accuracy of the noise reference image are improved, and the noise reference image has better referenceability.

Step 320 is described in detail below.

In step 320, sample stitched image features are determined based on the noise reference image, the background template image, and a template mask image, wherein the template mask image is obtained by masking a reference object in the background template image.

Referring to FIG. 7, in one embodiment, step 320 includes, but is not limited to, steps 710-740 including:

Step 710, performing a first encoding process on the noise reference image to obtain a reference image encoding feature;

step 720, performing second coding processing on the background template image to obtain coding characteristics of the background image;

step 730, performing third coding processing on the mask image to obtain mask image coding characteristics;

Step 740, stitching the reference image coding feature, the background image coding feature and the mask image coding feature to obtain a sample stitched image feature.

Steps 710-740 are described in detail below.

In step 710, the reference image encoding feature is used to indicate the result of converting the noise reference image from data space to pixel space.

In this embodiment, a preset image encoder may be used to perform a first encoding process on the noise reference image, and convert the noise reference image from a pixel space to a latent vector space, so as to capture image key information of the noise reference image, where the image key information of the noise reference image includes but is not limited to texture information, edge information, corner information, and the like of the morning reference image, to obtain the reference image encoding feature.

In step 720, the background image encoding features are used to indicate the result of converting the background template image from data space to pixel space.

In the specific implementation of this embodiment, the specific process of step 720 is similar to the specific process of step 710 described above. The difference is that the two images to be encoded are different, and the encoder parameters of the adopted image encoder are different. For the sake of space saving, the description is omitted.

In step 730, the mask image encoding feature is used to indicate the result of converting the template mask image from data space to pixel space.

In the specific implementation of this embodiment, the specific process of step 730 is similar to the specific process of step 730 described above. The difference is that the two images to be encoded are different, and the encoder parameters of the adopted image encoder are different. For the sake of space saving, the description is omitted.

In step 740, feature stitching is performed on the reference image coding feature, the background image coding feature, and the mask image coding feature, the reference image coding feature, the background image coding feature, and the mask image coding feature are stitched into a vector feature with a larger number of feature channels, and the stitched vector feature is determined as a sample stitched image feature.

For example, feature dimensions of the reference image encoding feature, the background image encoding feature, and the mask image encoding feature are w×h×4, where W is a feature length of the reference image encoding feature, the background image encoding feature, and the mask image encoding feature, H is a feature height of the reference image encoding feature, the background image encoding feature, and the mask image encoding feature, and 4 is a feature channel number of the reference image encoding feature, the background image encoding feature, and the mask image encoding feature. Through the characteristic stitching operation, the characteristic dimension of the obtained sample stitching image characteristic is W.times.H.times.12.

The benefit of this embodiment is that the image information of the noise reference image, the background template image, and the template mask image are all converted from pixel space to potential vector space, and the noise reference image, the background template image, and the template mask image are stitched in the image information (reference image encoding features, background image encoding features, and mask image encoding features) of the potential vector space to form a sample stitched image feature having noise-containing desired image information, template image information, and reference object mask information. Furthermore, the sample spliced image features are used as the input of the image generation model during model training, so that the image generation effect information, the background template information and the reference object information in the background template are fused in the input data of the model, the richness and the comprehensiveness of the feature information of the sample spliced image features can be better improved, the training model is facilitated to learn and mine various image information, and the image generation accuracy of the model is improved.

Referring to fig. 8, in one embodiment, the stencil mask image is determined by:

Step 810, determining an object contour area of a reference object in a background template image;

and step 820, in the background template image, replacing the pixel value of each pixel point in the object contour area with a first value, and replacing the pixel value of each pixel point outside the object contour area with a second value, so as to obtain the template mask image.

Steps 810-820 are described in detail below.

In step 810, the object contour region is used to indicate the minimum bounding rectangular region corresponding to the object contour of the reference object.

In the specific implementation of this embodiment, first, respective pixel points for representing the reference object are determined in the background template image. Next, a two-dimensional coordinate system is constructed with the pixel point at the upper left corner in the background template image as the origin, the height direction of the background template image as the vertical axis, the width direction of the background template image as the horizontal axis, and the distance between two adjacent pixel points as one unit length. Further, coordinate data for representing each pixel point of the reference object is determined based on the constructed two-dimensional coordinate system. Screening out the maximum value of the abscissa in the coordinate data of each pixel pointMinimum value of abscissaMaximum value of ordinateMinimum value of ordinate. And finally, determining a minimum circumscribed rectangular area corresponding to the object contour of the reference object based on the abscissa maximum value, the abscissa minimum value, the ordinate maximum value and the ordinate minimum value, and determining the minimum circumscribed rectangular area as the object contour area of the reference object. Wherein, the coordinates of the four endpoints of the minimum circumscribed rectangular area are respectively%）、（）、（）、（）。

In step 820, the first value refers to a value of 1 that causes the pixel to appear white, and the second value refers to a value of 0 that causes the pixel to appear black.

In the specific implementation of this embodiment, first, in the background template image, the object contour region and other regions other than the object contour region are divided. Next, for each pixel point in the object contour region, the pixel value of the pixel point is replaced with the first value, so that the object contour region appears white. And for each pixel point except the object contour area, replacing the pixel value of the pixel point with a second value to enable the object contour area to be black, so that the template mask image is obtained according to the pixel value change of each pixel point of the background template image. The stencil mask image often appears as a black and white image.

As shown in fig. 9A, a brief illustration of masking processing at the pixel level is shown. Specifically, in the background template image with the image size of 8×8, the smallest circumscribed rectangular area corresponding to the object contour of the reference object is a 4×4 image area, where the object contour area includes 3 pixels with 9 pixel values, 2 pixels with 8 pixel values, 2 pixels with 1 pixel value, 2 pixels with 4 pixel values, and 7 pixels with 2 pixel values. Based on this, the pixel values of the 16 pixels are replaced with 1, and the pixel values of the pixels other than the 16 pixels in the background template image are replaced with 0, so as to obtain a template mask image in which the pixel value of one pixel is composed of 0 or 1. Wherein, in the template mask image, the object outline area of the reference object appears white, and other parts except the object outline area appear black.

As shown in fig. 9B, a brief illustration of the masking process at the image level. Specifically, the background template image contains a reference object (character), and a plurality of other objects (five-pointed star and polygon). Based on the method, a minimum circumscribed rectangle corresponding to the outline of a reference object (character) is determined, the pixel value of each pixel point in the minimum circumscribed rectangle is set to be 1, the pixel value of each pixel point except the minimum circumscribed rectangle in a background template image is set to be 0, mask processing of the background template image is achieved, and a black-and-white template mask image is obtained, wherein the outline area of the object is a white rectangular area.

The method has the advantages that the object outline area of the reference object is determined in the background template image, the image position of the reference object in the background template image can be clearly determined, the binarization processing (mask processing) of the background template image is realized through the pixel value conversion of each pixel point, the distinguishing degree of the pixel positions occupied by the reference object and the non-reference object in the background template image can be effectively enlarged, the model can better explore the object outline characteristics of the reference object based on the template mask image, and therefore the accuracy of replacing the reference object with the target object is improved.

Step 330 is described in detail below.

Referring to FIG. 10, in one embodiment, step 330 includes, but is not limited to, steps 1010-1030 comprising:

step 1010, coding the image description information to obtain an image description embedded vector;

step 1020, extracting features of the sample object image to obtain sample object feature data;

step 1030, performing control information generation through a preset control network based on the image description embedded vector, the sample object feature data and the outline feature of the reference object, and obtaining denoising network control information.

Steps 1010-1030 are described in detail below.

In step 1010, the image description embedded vector is used to indicate a vector representation of a conditional constraint on the denoising process in the image description information.

In a specific implementation of this embodiment, the text encoder may be used to encode the image description information, and convert the image description information from the data space to the vector space, thereby obtaining the image description embedded vector. The text encoder comprises a word segmentation device, an embedding layer and a text attention calculating module.

Specifically, first, the image description information is input into a word segmentation device, and the word segmentation device is utilized to segment the image description information to obtain a plurality of description words. Then, word embedding processing is carried out on each descriptor by utilizing an embedding layer, each descriptor is converted into a vector form, and a descriptor vector corresponding to each descriptor is obtained. And finally, performing attention calculation on the descriptor vectors of the descriptors by using a text attention calculation module to obtain an image description embedded vector.

In step 1020, sample object feature data is used to indicate object features that the sample object to be replaced has; the sample object feature data may direct a denoising network of the image generation model to restore objects in the image to more conform to the sample object during denoising.

For example, when the sample object is a person, the sample object feature data includes, but is not limited to, facial contours, facial features, and the like.

It should be noted that, the sample object feature data may be obtained through a trim network generated by a trim technique (LoRA technique) based on a deep learning model. The fine-tuning network is a neural network based on a cross-attention algorithm.

In this embodiment, first, a sample object image is input into a fine adjustment network, and the sample object image is linearly projected by the fine adjustment network, so as to generate a key vector, a value vector and a query vector corresponding to the sample object image. And then, performing cross attention calculation by using the key vector, the value vector and the query vector to obtain a calculation result, and converting the calculation result into a vector form to obtain sample object characteristic data.

In step 1030, a preset control network is used to generate control information for the denoising network of the image generation model based on the plurality of input data. Wherein the input data allowed by the preset control network comprises condition constraints of the image to be generated, object line manuscript outlines in the background template image, object features of the object to be replaced, and the like.

It should be noted that, in order to improve accuracy of the denoising network control information output by the control network, the control network of the embodiment of the disclosure may also be a neural network based on a cross-attention algorithm.

The denoising network control information is used for guiding and controlling the focus and study of the image generation model on each characteristic information in the denoising process, and the denoising network control information is also used for indicating the fine adjustment degree of the characteristic information of various images in the denoising process.

For the sake of saving space, the specific process of generating control information through a preset control network and obtaining denoising network control information in the embodiments of the present disclosure will be described in detail below, which is not described here again.

The embodiment has the advantages that the image description information, the outline characteristics of the reference object in the background template image and the object characteristics of the sample object are utilized to jointly generate the denoising network control information of the image generation model, so that denoising can be performed according to various constraint information in the image denoising process, and accurate fine-tuning and flexible control of image denoising are realized. The mode can enable the model to carry out image denoising under the constraint of various dimensions, improves the image denoising capability of the model, is beneficial to generating image content which is more in line with image description by the model, and can enable the motion, the gesture and the like of an object in the image finally generated by the model and a reference object in a template image to be more approximate.

Referring to FIG. 11, in one embodiment, the contour features of the reference object in the background template image are determined by:

step 1110, performing object detection on the background template image to obtain an object skeleton diagram of the reference object;

Step 1120, extracting gesture features of the object skeleton graph to obtain a plurality of object gesture key points;

Step 1130, determining contour features based on the plurality of object pose keypoints.

Steps 1110-1130 are described in detail below.

In step 1110, an object skeleton map is used to indicate skeleton contours of reference objects in the background template image.

In the embodiment, first, object positioning is performed in a background template image by using a preset detection algorithm to identify the position of a reference object, so as to obtain an object detection result, where the object detection result is formed by a plurality of pixel points forming the reference object. The preset detection algorithm includes, but is not limited to, YOLO algorithm, SSD algorithm, and the like. Then, an object skeleton map of the reference object is determined based on the pixel points included in the object detection result.

In step 1120, object pose keypoints are used to indicate the motion and pose that the reference object presents in the background template image.

When the embodiment is specifically implemented, firstly, the object skeleton diagram is input into a preset gesture estimation model, and coordinate positioning is performed on pixel points of important parts of the reference object in the object skeleton diagram by using the preset gesture estimation model, so as to obtain a set of key point coordinates. Then, each key point coordinate output by the gesture estimation model is determined as an object gesture key point. The preset gesture estimation model includes, but is not limited to, a neural network model based on a deep learning algorithm such as PoseNet, alphaPose.

In step 1130, first, according to the relative positions and association relationships of the object posture key points, a plurality of object posture key points are connected to form a complete object line manuscript. And then, converting the object line manuscript into an image expression form acceptable by a control network to obtain the outline characteristics of the reference object.

As shown in fig. 12, a key point detection process for a person (reference object) in one background template image. Specifically, character key point extraction is carried out on the background template image by using a key point extraction algorithm, so that the face key points and the skeleton key points of the reference object are obtained. The facial key points are used for reflecting facial outline characteristics and five sense organs characteristics of the reference object, and the bone key points are used for reflecting gesture characteristics of the reference object in the background template image. Based on this, the face key point and the bone key point are collectively determined as the object posture key point of the reference object.

The embodiment has the advantages that when the outline characteristics of the reference object are determined, the reference object is positioned in the background template image, and the skeleton characteristics of the reference object are constructed according to the position of the reference object. Further, the gesture estimation model is utilized to detect key points in the skeleton diagram, so that the determination efficiency and the determination accuracy of gesture key points can be improved, the outline features of the reference object are drawn according to the relative positions of a plurality of gesture key points, the determination accuracy of the outline features can be improved, and the accuracy of denoising network control information can be improved.

The condition constraint content inaccuracy of image generation is often caused by converting the image description information into the form of an embedded vector by adopting a conventional coding mode. The embodiment of the disclosure provides a scheme for encoding image description information based on a fine tuning technology, which can improve the accuracy of an embedded vector of generated image description, and further improve the effectiveness of conditional constraint content for generating denoising network control information.

It should be noted that, the encoding processing of the image description information according to the embodiments of the present disclosure is implemented based on a preset text encoder (text encoder), where the text encoder includes a word segmentation unit (tokenizer), an embedding layer (embedding), and a text attention computation module (text transformer).

Referring to FIG. 13, in one embodiment, step 1010 includes, but is not limited to, steps 1310-1340 including:

Step 1310, word segmentation is carried out on the image description information to obtain a plurality of description words;

step 1320, determining a target word from a plurality of description words, and searching a target word embedding feature corresponding to the target word based on a preset dictionary;

Step 1330, aiming at other descriptive words except the target word in the descriptive words, carrying out word embedding processing on each other descriptive word to obtain descriptive word embedding characteristics of the other descriptive words;

Step 1340, integrating the target word embedding feature and the descriptor embedding feature into an image descriptor embedding vector.

Steps 1310-1340 are described in detail below.

In step 1310, the descriptor is used to indicate lexical information contained in the image description information.

In a specific implementation of this embodiment, first, the image description information may be input into a word segmentation unit of a text encoder. Then, the word segmentation device performs word segmentation on the image description information according to the space, the punctuation mark or the separator to obtain a plurality of words. Further, the stop words in the separated words are removed, and a plurality of description words are obtained.

In step 1320, the target word refers to a word in the image description information that needs to be searched for a corresponding embedded vector representation by a vocabulary search, and the target word embedded feature is used to indicate a vector representation of its target word in the image description information.

The preset dictionary is preset in the embedding layer, and the preset dictionary of the embedding layer is used for indicating the corresponding relation between the index of each descriptive word and the word embedding characteristic.

In the embodiment, first, a predetermined descriptor to be represented by a fixed character is selected as a target word from among a plurality of descriptors for the plurality of descriptors. And then, inputting the index corresponding to the target word into an embedding layer of the text encoder, searching the features in a preset dictionary of the embedding layer based on the index, and finding the word embedding features corresponding to the index as the target word embedding features.

In step 1330, the descriptor embedding feature is used to indicate the vector representation of other descriptors in the image description information.

In the specific implementation of this embodiment, for the other descriptors than the target word among the plurality of descriptors, first, the index of each of the other descriptors is determined based on the word segmentation result of the word segmentation unit. And then, carrying out word embedding processing on each other descriptive word by utilizing an embedding layer of the text encoder, and searching word embedding characteristics corresponding to indexes of each other descriptive word in a preset dictionary of the embedding layer to obtain descriptive word embedding characteristics of the other descriptive words.

In step 1340, first, the target word embedding feature and the descriptor embedding feature are input to the text attention calculating module, and the descriptor embedding vectors of the respective descriptors are output by the text attention calculating module. And then, determining the number of each descriptor according to the sequence of each descriptor in the image description information. Further, according to the number corresponding to the target word embedding feature and the number corresponding to the descriptor embedding feature, the descriptor embedding vectors are spliced according to the size sequence of the numbers, and a complete descriptor embedding sequence is obtained. And finally, determining the spliced descriptor embedding sequence as an image description embedding vector.

As shown in fig. 14, a specific illustration of a process of generating an image description vector as a conditional constraint based on image description information, and controlling image denoising based on the conditional constraint is shown. Specifically, the input image instance is a clock image. The image description information for the input image instance is "a Photo of clock", and the "clock" in the image description information is taken as a target word, which is denoted as "S _*". At this time, the image description information becomes "a Photo of S _*". Based on the above, the image description information is input into the word segmentation device, the word segmentation is carried out on the image description information through the word segmentation device, and the index of each word is determined according to the preset index corresponding to each candidate word in the preset vocabulary index table. The index corresponding to the descriptor "a" is 508, the index corresponding to the descriptor "Photo" is 701, the index corresponding to the descriptor "of" is 73, and the index corresponding to the descriptor "S _*" is x. Further, word embedding processing is carried out on indexes of the descriptive words through an embedding layer, indexes corresponding to the descriptive words are mapped from a numerical space to a vector space, and word embedding characteristics corresponding to the descriptive words are obtained, wherein the word embedding characteristics corresponding to the descriptive word A are ""; The word embedding feature corresponding to the descriptor "Photo" is ""; The word embedding feature corresponding to the descriptor "of" is ""; The word embedding feature corresponding to the descriptor S _* is'". Further, word embedding characteristics of each descriptor are input into a text attention calculating module to calculate attention, and an image description vector is output through the text attention calculating moduleAnd taking the image description vector as a condition constraint for denoising the noise image instance through the image generator, so that the predicted image instance output by the image generator is attached to the content represented by the image description information. Further, the predicted image instance is obtained by performing 4 times of diffusion processing (4 times of noise addition operation) on the input image instance to obtain a noise image instance, and performing 4 times of denoising processing on the noise image instance by the image generator in accordance with the image description vector.

The embodiment has the advantages that the description words which are preset to be represented by the fixed characters in the image description information are used as target words through a fine tuning technology, word embedding representation of the target words is subjected to fine tuning and searching through a text encoder, target word embedding characteristics corresponding to the target words are obtained, word embedding processing is carried out on other description words, and each description word is converted into an embedding vector.

In an embodiment of the present disclosure, the control network includes a first control subnetwork, and a second control subnetwork.

The first control sub-network and the second control sub-network are neural network structures capable of performing cross-attention computation. The first control sub-network and the second control sub-network have the same network structure, but have different network parameters.

The sample object feature data includes a first sample object feature, a second sample object feature, a third sample object feature, and a fourth sample object feature, wherein the first sample object feature, the second sample object feature, the third sample object feature, and the fourth sample object feature are obtained by performing respective feature extraction on the sample object images.

The first sample object feature, the second sample object feature, the third sample object feature, and the fourth sample object feature are used to fine tune the denoising process of the image generation model, but the fine tuning of the first sample object feature, the second sample object feature, the third sample object feature, and the fourth sample object feature is different. The first sample object feature, the second sample object feature, the third sample object feature, and the fourth sample object feature are all obtained through the trimming network described above. But the network parameters of the trim network used to extract the first sample object feature, the second sample object feature, the third sample object feature, and the fourth sample object feature are different.

Referring to FIG. 15, in one embodiment, step 1030 includes, but is not limited to, steps 1510-1550 including:

step 1510, inputting the image description embedded vector, the first sample object feature and the contour feature into a first control sub-network to generate control information, so as to obtain first control information;

Step 1520, inputting the image description embedded vector and the second sample object feature to a second control sub-network for control information generation, to obtain second control information;

step 1530, determining upsampled network control information based on the third sample object feature and the fourth sample object feature;

step 1540, determining downsampled network control information based on the first control information, the second control information, and the first sample object feature, and the second sample object feature;

step 1550, integrating the up-sampling network control information and the down-sampling network control information into denoising network control information.

Steps 1510-1550 are described in detail below.

In step 1510, the first control information is used to fine tune intermediate results of the denoising network of the image generation model at the time of downsampling.

In a specific implementation of this embodiment, first, the image description embedding vector, the first sample object feature, and the contour feature are input to the first control subnetwork. Then, the first sample object feature and the outline feature are spliced to obtain a spliced feature vector. Further, linear projection is carried out on the spliced feature vector through a first control sub-network to obtain a query matrix vector; and performing linear projection on the image description embedded vector through the first control sub-network to obtain a key matrix vector and a value matrix vector. Further, cross attention calculation is performed based on the query matrix vector, the key matrix vector and the value matrix vector, and an attention calculation result is obtained. And finally, converting the attention calculation result from a numerical space to a vector space through a first control sub-network to obtain first control information.

The specific process of obtaining the attention calculation result by performing cross attention calculation based on the query matrix vector, the key matrix vector and the value matrix vector can be expressed as shown in formula (1):

formula (1);

wherein, Is the result of attention calculation; q is a query matrix vector; Is a matrix vector of keys and, Is a matrix vector of values.Is the characteristic dimension of the key matrix vector,Is the transposed result of the key matrix vector.

In step 1520, the second control information is used to fine tune the result of the downsampling by the denoising network of the image generation model.

In the specific implementation of this embodiment, the specific process of step 1520 is similar to step 1510 described above. The difference is that the input information for the first control subnetwork in step 1510 is the image description embedding vector, the first sample object feature, and the contour feature; and the input information of the second control sub-network in step 1520 is the image description embedding vector and the second sample object feature, the input information of the two are different, and the generated control information is different in the constrained network part in the denoising network. For the sake of space saving, the description is omitted.

In step 1530, the upsampling network control information is used to condition constraint the upsampling process of the denoising network of the image generation model to fine tune and correct the upsampling process.

In a specific implementation of this embodiment, the third sample object feature and the fourth sample object feature are incorporated into the same set, the object feature information within the set is regarded as a whole, and all object feature information within the set is determined as upsampling network control information.

In step 1540, the downsampling network control information is used to condition-constrain the downsampling process of the denoising network of the image generation model to fine-tune and correct the downsampling process.

In a specific implementation of this embodiment, the first control information, the second control information, the first sample object feature, and the second sample object feature are incorporated into the same set, and all information within the set is determined to be downsampled network control information.

In step 1550, the up-sampled network control information and the down-sampled network control information are incorporated into the same set, the control information within the set is treated as a whole, and all information within the set is determined as de-noised network control information.

As shown in fig. 16, feature extraction is performed on the sample object image using a fine tuning network of different model parameters, to obtain a first sample object feature, a second sample object feature, a third sample object feature, and a fourth sample object feature, respectively. Next, the third sample object feature and the fourth sample object feature are integrated into upsampled network control information for constraining the upsampling process. Further, the first sample object feature, the second sample object feature, and the pre-acquired image description embedding vector, the contour feature of the reference object are input together into the control network, the first control information and the second control information are output by the control network, and the first control information, the second control information, the first sample object feature, and the second sample object feature are integrated into downsampling network control information for constraining the downsampling process. Based on the above, the full flow control of the denoising process of the denoising network is realized by using the up-sampling network control information and the down-sampling network control information, so that the accuracy of the denoising process can be improved.

The embodiment has the advantages that the denoising network control information is generated based on the image description information, the sample object characteristics in the sample object image and the outline characteristics of the reference object in the background template image, so that the condition constraint of the denoising network for the image generation model not only contains the description of the image to be generated, but also contains the object fine adjustment information determined according to the outline characteristics of the reference object and the object characteristics of the sample object, and the information comprehensiveness and the information accuracy of the generated denoising network control information are improved.

Step 340 is described in detail below.

In an embodiment of the present disclosure, the image generation model includes a diffusion network, and a denoising network.

The diffusion network is used for realizing gradual noise adding processing on the characteristics of the sample spliced image until the characteristics of the input sample spliced image approach pure noise.

The denoising network is used for denoising hidden space vectors output by the diffusion network step by step until image features corresponding to the sample spliced image features are generated.

Referring to FIG. 17, in one embodiment, step 340 includes, but is not limited to, the following steps 1710-1730:

step 1710, compressing the sample spliced image features to obtain sample compressed image features;

Step 1720, performing diffusion processing on the sample compressed image features based on the diffusion network to obtain sample hidden space feature vectors;

Step 1730, based on the denoising network control information, denoising the sample hidden space feature vector through the denoising network to obtain a noise prediction result.

Steps 1710-1730 are described in detail below.

In step 1710, since the sample stitched image features are stitched from three image features, the feature dimensions of the sample stitched image features are higher than the feature dimensions of the input features acceptable by the diffusion network. Based on the method, firstly, the sample spliced image features are compressed, dimension reduction of the sample spliced image features is achieved, and under the condition that image feature information of the sample spliced image features is unchanged, feature dimensions of the sample spliced image features are reduced, and the sample compressed image features are obtained. The feature dimension of the sample compressed image feature is w×h×4, and satisfies the feature dimension of the input feature acceptable by the diffusion network.

In step 1720, the sample hidden spatial feature vector is used to indicate the result of adding noise to the sample stitched image features over a diffusion network at a fixed time step.

When the embodiment is specifically implemented, when the diffusion network is used for carrying out diffusion processing on the sample compression image characteristics, noise is added to the sample compression image characteristics successively through the diffusion network until the sample compression image characteristics approach pure noise. Wherein the diffusion process of the disclosed embodiments may be a parameterized Markov chain (Markov chain) as a whole.

For example, the fixed time step is set as T, the sample compressed image feature is subjected to T times of noise adding through the forward process of the diffusion network, the hidden space representation corresponding to the noise reference image is generated, and the generated hidden space representation is determined as a sample hidden space feature vector, wherein T is a positive integer.

Specifically, noise is added to the sample compressed image feature successively by the diffusion process of the diffusion network, and the sample compressed image feature gradually loses its characteristics. After T-times noise addition, the sample compressed image features will become a latent spatial representation without any features, which is determined as a sample latent spatial feature vector.

The sample hidden space feature vector refers to a representation of a pure noise image corresponding to a noise reference image, which does not have an image feature. The form of the sample hidden space feature vector is the same as the form of the object representation, and can be the representation of the vector form or the representation of the matrix form, and the method is not limited.

In step 1730, when denoising the sample hidden space feature vector through the denoising network, the sample hidden space feature vector is successively denoised according to the denoising network control information as a constraint condition until the sample hidden space feature vector is restored to an image feature satisfying the constraint requirement of the image description information. The backward process of the denoising network can be a parameterized Markov chain as a whole.

For the sake of space saving, the specific process of denoising the sample hidden space feature vector by the denoising network according to the embodiments of the present disclosure to obtain the noise prediction result will be described in detail below. And will not be described in detail herein.

As shown in fig. 18, a noise prediction process of the image generation model is shown. Specifically, the sample mosaic image features are input into the image generation model, and first, the sample mosaic image features are compressed by the image generation model to form a sample compressed image feature with smaller feature dimensions than the sample mosaic image features. And then, carrying out noise adding operation on the sample compressed image features for T times through a diffusion network, and converting the sample compressed image features into image features close to pure noise to obtain sample hidden space feature vectors, wherein the image information of the sample spliced image feature types cannot be reflected in the sample hidden space feature vectors. Further, through the denoising network, based on given denoising network control information, denoising operation is carried out on the sample hidden space feature vector for T times, noise carried on the sample hidden space feature vector is eliminated, the sample hidden space feature vector is restored into an image feature form, and a noise prediction result is obtained. The noise prediction result can reflect image information of the sample spliced image characteristic species to a certain extent.

The embodiment has the advantage that the sample spliced image features are compressed into sample compressed image features, so that the input sample compressed image features meet the input requirements of the image generation model. And then, noise is added to the sample compressed image characteristics for a plurality of times by using a diffusion network, so that the details and definition of the sample compressed image characteristics are reduced, and the sample hidden space characteristic vector is obtained. Further, the denoising network control information is used as a constraint condition, the noise is successively removed from the sample hidden space feature vector through the denoising network until the sample hidden space feature vector is restored to the image feature meeting the constraint requirement of the image description information, and a noise prediction result is obtained.

In the disclosed embodiments, the denoising network may be a U-network structure (U-net network), the denoising network including an upsampling attention network, and a downsampling attention network; the denoising network control information includes upsampling network control information and downsampling network control information.

The upsampling attention network is used to downsample the sample hidden spatial feature vector to obtain more low-dimensional features.

The downsampling attention network is used for upsampling the output result of the upsampling attention network, and restoring the output result into denoising image features with the same feature dimension as the sample hidden space feature vector.

Referring to FIG. 19, in one embodiment, step 1730 includes, but is not limited to, the following steps 1910-1920:

Step 1910, fusing the downsampling network control information to a first attention matrix of the downsampling attention network to update the first attention moment matrix, and fusing the upsampling network control information to a second attention matrix of the upsampling attention network to update the second attention matrix;

Step 1920, denoising the sample hidden space feature vector through the downsampling attention network after updating the first attention matrix and the upsampling attention network after updating the second attention matrix to obtain a noise prediction result.

Steps 1910-1920 are described in detail below.

In step 1910, the first attention moment array is used for performing attention calculation in the down-sampling process of the denoising network so as to capture the correlation degree between different input features (sample hidden space feature vectors and image description embedded vectors); the second attention matrix is used for performing attention calculations during the up-sampling of the de-noising network to capture the degree of correlation between different input features (output results of the down-sampled attention network, image description embedding vectors).

In a specific implementation of this embodiment, the downsampling attention network comprises an attention downsampling module comprising a first attention matrix and a residual block structure. When the first attention moment array is updated, the first sample object feature and the second sample object feature in the downsampled network control information are weighted to the output of the first attention matrix, and the first control information and the second control information in the downsampled network control information are weighted to the output of the attention downsampling module. Likewise, the upsampling attention network comprises an attention upsampling module comprising a second attention matrix and a residual block structure. The third sample object feature and the fourth sample object feature in the upsampled network control information are weighted onto the output of the second attention matrix when the second attention matrix is updated.

As shown in fig. 20, the downsampling attention network includes two attention downsampling modules each having a first attention matrix with first sample object features weighted to the outputs of the first attention matrix; the second sample object features are weighted to the output of the second first attention matrix. Simultaneously, the first control information is weighted to the output of the first attention down sampling module; the second control information is weighted to the output of the second attention down sampling module.

As shown in fig. 21, the upsampling attention network comprises two attention upsampling modules each having a second attention matrix, the third sample object feature being weighted onto the output of the first second attention matrix; the fourth sample object feature is weighted to the output of the second attention matrix.

In step 1920, when the sample hidden space feature vector is denoising processed through the downsampling attention network after updating the first attention matrix and the upsampling attention network after updating the second attention matrix, first, the sample hidden space feature vector and the image description embedding vector are input into the denoising network of the image generation model, the sample hidden space feature vector is linearly projected to obtain a query feature Q corresponding to the sample hidden space feature vector, and then the image description embedding vector is linearly projected to obtain a key feature K and a value feature V corresponding to the image description embedding vector. And then, performing cross attention calculation on the query feature Q, the key feature K and the value feature V by using a first attention matrix of a first attention downsampling module of the downsampling attention network to obtain a first attention calculation result, and performing weighted calculation on the first attention calculation result and the first sample object feature according to a preset weight proportion to obtain a first weighted result. Further, residual processing is carried out on the first weighted result through a residual block structure of the first attention downsampling module, a first residual processing result is obtained, and weighting calculation is carried out on the first control information and the first residual processing result according to preset weights, so that a second weighted result is obtained. And then, carrying out linear projection on the second weighted result to obtain a new query feature Q, carrying out cross attention calculation on the key feature K, the value feature V and the new query feature Q by using a first attention matrix of a second attention downsampling module of the downsampling attention network to obtain a second attention calculation result, and carrying out weighted calculation on the second attention calculation result and the second sample object feature according to a preset weight proportion to obtain a third weighted result. Further, residual processing is performed on the third weighted result through a residual block structure of the second attention downsampling module, a second residual processing result is obtained, weighting calculation is performed on the second control information and the second residual processing result according to preset weights, and a fourth weighted result is obtained, wherein the fourth weighted result is an output result of the downsampling attention network.

Further, the fourth weighted result and the image description embedded vector are input into an up-sampling attention network, linear projection is firstly carried out on the fourth weighted result to obtain query features, and then linear projection is carried out on the image description embedded vector to obtain key features and value features corresponding to the image description embedded vector. And then, performing cross attention calculation on the key feature, the value feature and the query feature by using a second attention matrix of a first attention up-sampling module of the up-sampling attention network to obtain a third attention calculation result, and performing weighted calculation on the third attention calculation result and the third sample object feature according to a preset weight proportion to obtain a fifth weighted result. Further, residual processing is carried out on the fifth weighted result through a residual block structure of the first attention up sampling module, and a third residual processing result is obtained. And finally, linearly projecting the third residual error processing result to obtain a new query feature, performing cross attention calculation on the key feature, the value feature and the new query feature by using a second attention matrix of a second attention up-sampling module of the up-sampling attention network to obtain a fourth attention calculation result, and performing weighted calculation on the fourth attention calculation result and the fourth sample object feature according to a preset weight proportion to obtain a sixth weighted result. Further, residual processing is carried out on the sixth weighted result through a residual block structure of the second attention up-sampling module, and a denoising result of the denoising network on the sample hidden space feature vector generated in the prediction time step T is obtained; repeating the above process, and continuing the denoising operation for T-1 times on the denoising result of the sample hidden space feature vector to obtain a noise prediction result.

The advantage of this embodiment is that the denoising network control information is introduced in different links of the denoising process to fine tune the denoising process, specifically, the first sample object feature and the second sample object feature in the downsampling network control information are fused to the first attention matrix of the downsampling attention network, and the third sample object feature and the fourth sample object feature in the upsampling network control information are fused to the second attention matrix of the upsampling attention network, so that the correction of the output results of the plurality of attention matrices in the denoising network can be realized. In addition, the embodiment of the disclosure corrects the output result of the downsampling attention network through the first control information and the second control information, and fine adjustment is performed for many times in each denoising process, so that the stability and accuracy of noise prediction can be improved, the finally generated noise prediction result can be more attached to the real requirement, and the model training effect is improved.

Step 350 is described in detail below.

Referring to FIG. 22, in one embodiment, step 350 includes, but is not limited to, steps 2210-2230, including:

step 2210, for each graphic sample pair, acquiring reference noise in a noise reference image, and calculating a sub-loss function based on comparison of the reference noise and a noise prediction result;

step 2220, determining a total loss function based on the sub-loss functions of each graphic sample pair;

Step 2230, training the image generation model based on the total loss function.

Steps 2210-2230 are described in detail below.

In step 2210, the reference noise is used to indicate the degree to which noise is added to the expected result, and the sub-loss function is used to indicate the degree of difference between the reference noise and the predicted noise of the individual teletext sample pairs.

In the embodiment, first, for each graphic sample pair, noise extraction is performed on the noise reference image to obtain reference noise in the noise reference image. And then, carrying out noise difference calculation on the reference noise and the noise prediction result to obtain a noise difference result. Finally, calculating a sub-loss function according to the noise difference result.

In step 2220, the total loss function is used to indicate the overall degree of difference between the reference noise and the predicted noise for all pairs of teletext samples. The smaller the total loss function, the smaller the overall difference between the reference noise and the prediction noise of all the pairs of picture and text samples, and the higher the image generation accuracy of the image generation model.

In a specific implementation of this embodiment, the sub-loss functions of each pair of teletext samples are averaged to obtain a total loss function. Specifically, first, the total number of pairs of teletext samples is determined. Then, the sub-loss functions of all the image-text sample pairs are added to obtain the sum of the sub-loss functions. Finally, dividing the sub-loss function by the total number of the graphic sample pairs to obtain a total loss function.

In step 2230, the model parameters of the image generation model are adjusted with the minimum total loss function as a training target, and the steps 310-350 are repeated to realize iterative training of the image generation model, the model parameters with the minimum total loss function are taken as final model parameters, and the image generation model with the final model parameters is taken as the trained image generation model.

The method has the advantages that the sub-loss function of each image-text sample pair is determined according to the noise difference between the reference noise and the noise prediction result of each image-text sample pair based on a supervised learning mode, and the total loss function is constructed based on a plurality of sub-loss functions.

In the embodiment of the disclosure, the noise prediction result is obtained through prediction of a plurality of prediction time steps.

The prediction time step is used to indicate the frequency of adding and removing noise to the sample stitched image features of the input image generation model.

Referring to FIG. 23, in one embodiment, step 2210 includes, but is not limited to, steps 2310-2330, including:

2310, determining a prediction noise of the last prediction time step based on the noise prediction result;

step 2320, performing regular term calculation based on the reference noise and the prediction noise to obtain a regular term calculation result;

Step 2330, determining a sub-loss function based on the regularized term computation.

Steps 2310-2330 are described in detail below.

In step 2310, the prediction noise is used to indicate the noise contained in the denoising result (denoised image) generated by the image generation model at the last prediction time step.

In the embodiment, since the image generation model denoises the denoising result (denoising image) generated in the previous prediction time step at each prediction time step, and denoises the plurality of prediction time steps, the noise prediction result is predicted. Wherein the noise contained in the denoising results of different prediction time steps is different. Based on this, the noise prediction result includes a denoising result (denoised image) generated at the last prediction time step. The prediction noise for the last prediction time step can be extracted directly from the noise prediction result.

In step 2320, the regularization term calculation is used to indicate the degree of difference between the reference noise and the prediction noise.

In the specific implementation of this embodiment, first, for each image-text sample pair, a difference value is calculated between the reference noise and the prediction noise, so as to obtain a noise difference value. And then regularizing the noise difference value to obtain a regularized item calculation result.

In step 2330, the regularized term computation is transformed based on the denoising network control information, the regularized term computation is transformed into a form of a conditional loss function, and the transformed loss function is determined as a sub-loss function. Wherein, the sub-loss function of the embodiments of the present disclosure may be expressed as shown in formula (2):

Formula (2);

wherein, Is a sub-loss function; Is the reference noise and is used to determine, Indicating that the reference noise is subject to a standard normal distribution (mean 0, variance 1). t represents a prediction time step, which is used for gradually updating the sample hidden space feature vector in the image generation process.Representing the sample hidden spatial feature vector at the time of the predicted time step t,Is the result of the sample compressed image feature z being denoised at the predicted time step t. y and c both refer to denoising network control information for conditional constraints on the image generation process. x is the noise reference image used for training.For indicating encodersThe expectation for the input x.Refers to constraint according to conditionsHiding space feature vectors for samples under the prediction time step tPrediction noise after denoising is performed.Is the result of the regular term computation.

The method has the advantages that the regular term loss between the reference noise and the noise prediction result of the image-text sample pair is calculated, the regular term calculation result of each image-text sample pair is obtained, the regular term calculation result (L2 loss) is used as a sub-loss function to train the image generation model, the denoising network control information can be utilized to better adjust the image generation process of the image generation model, the generation process is more controllable, meanwhile, condition information (denoising network control information) is added into the sub-loss function, the generated noise prediction result can be closely related to a given condition, the consistency and accuracy of the generated image of the model are improved, and the image quality of the target image generated by the model is further improved.

As shown in fig. 24, is an overall flow chart of model training of an embodiment of the present disclosure. Specifically, a background template image, an expected effect image, and image description information of the expected effect image are taken as inputs.

First, a reference noise is generated using the random number i and added to the expected effect map, resulting in a noise reference image, the detailed process of which is similar to steps 510-530 described above. Next, the background template map is masked to obtain an object replacement mask (template mask image), the specific process of which is similar to steps 810-820 described above.

Further, the template mask image, the noise reference image and the background template image are respectively encoded by an encoder, and three encoding results are spliced into sample spliced image features. And compressing the sample spliced image features into sample compressed image features Z through an image generation model, performing diffusion processing on the sample compressed image features Z through a diffusion network, and performing T times of noise adding on the sample compressed image features Z to obtain sample hidden space feature vectors Z _T. Further, the image description information of the expected effect diagram is converted into a text form conforming to the input requirement of the text encoder τ as a condition constraint, and the image description information of the expected effect diagram is input to the text encoder, and the text encoder outputs an image description embedded vector, and the specific process is similar to the steps 1310-1340. Further, the object frame corresponding to the background template image is extracted, and a contour line manuscript (contour feature) of the reference object in the background template image is generated based on the extracted object frame, and the specific process is similar to the above steps 1110-1130. Meanwhile, sample object feature data is generated based on a sample object graph of the sample object, wherein the sample object feature data includes first object features QKV-a, second object features QKV-a, third object features QKV-a, and fourth object features QKV-a. Further, first object feature QKV-a, second object feature QKV-a, outline feature and image description embedding vector are input to the control network, first control information is output by a first control sub-network of the control network, and second control information is output by a second control sub-network of the control network.

Next, when the denoising network denoises the sample hidden space feature vector Z _T, firstly, based on the first control information, the first object feature QKV-a and the second object feature QKV-a, the sample hidden space feature vector Z _T is downsampled through each attention matrix of the downsampling attention network of the denoising network to obtain a downsampling result, and the downsampling result is corrected by using the second control information to obtain a corrected downsampling result; and based on the third object feature QKV-a and the fourth object feature QKV-a, upsampling the corrected downsampling result through each attention matrix of the upsampling attention network of the denoising network to obtain a denoising result Z _T-1' when the predicted time step is T. Further, denoising the denoising result Z _T-1' when the predicted time step is T by using a denoising network according to the above process, and after denoising for T-1 times, obtaining a denoising result Z ^', wherein the specific process is similar to the above steps 1910-1920.

Finally, noise prediction is performed based on the denoising result Z ^', a noise prediction result is obtained, a LOSS function LOSS is constructed according to the reference noise and the noise prediction result, and iterative training is performed on the image generation model based on the LOSS function LOSS until the image generation model meets the training requirement, and the specific process is similar to the step 350. For the sake of space saving, the description is omitted.

An image generating method according to an embodiment of the present disclosure is described in detail below.

According to one embodiment of the present disclosure, an image generation method is provided.

The image generation method is generally applied to a business scene in which things in a fixed background image are replaced with things such as a target person and a target object, for example, a video production scene, an object display scene, and the like shown in fig. 2A to 2C. The embodiment of the disclosure provides a scheme for generating an image by an image generation model based on image description information and object information (contour information of a reference object in a background image and object characteristics of a target object to be replaced), so that the accuracy of generating the target image can be improved.

As shown in fig. 25, an image generating method according to an embodiment of the present disclosure may be performed by an electronic device, which may be the image processing server or the object terminal shown in fig. 1, and may include:

Step 2510, obtaining a target object image, a target background image, and target description information of a target object;

step 2520, determining a target stitched image feature based on the target background image, the preset noise image, and the target mask image;

step 2530, determining denoising control information of an image generation model based on contour features of a reference object in a target background image, target description information and a target object image;

Step 2540, performing image generation through an image generation model based on the target stitched image features and the denoising control information, so as to obtain a target image.

Steps 2510 to 2540 are described in detail below.

In step 2510, a target object image, a target background image, and target description information of the target object are acquired.

A target object image of a target object refers to a series of images that reflect the characteristics of the object of the target object.

The target background image refers to a background image to which a target object is to be added, wherein the target background image contains a reference object to be replaced by the target object in the target object image.

The target description information is used to describe the replacement from the reference object to the target object.

In the implementation of this embodiment, the specific process of step 2510 is similar to the specific process of acquiring the sample object image, the target background image, and the target description information of the sample object in 310 described above. For the sake of space saving, the description is omitted.

In step 2520, a target stitched image feature is determined based on the target background image, the preset noise image, and the target mask image.

The preset noise image is a noise image generated based on random numbers and subject to gaussian distribution, and is specifically generated in a similar manner to the above-described step 520. The difference is that the random number of step 2520 for generating the predetermined noise image is different from the random number of step 520, and is not repeated for the sake of space saving.

The target mask image is obtained by masking the reference object in the target background image, in a manner similar to steps 810-820 described above. For the sake of space saving, the description is omitted.

The target stitching image features are used to indicate feature stitching results of the target background image, the preset noise image, and the target mask image.

For economy of description, the specific process of determining the features of the target stitched image based on the target background image, the preset noise image, and the target mask image according to the embodiments of the present disclosure will be described in detail below. And will not be described in detail herein.

In step 2530, denoising control information of the image generation model is determined based on the contour features of the reference object in the target background image, the target description information, and the target object image.

The image generation model is generated according to the training method of the image generation model of the above embodiment.

The denoising control information is used for assisting the image generating model to denoise the image when the image is generated as a condition constraint so as to improve the denoising accuracy of the image and enable the denoising effect of the image to be attached to the real requirement.

In the specific implementation of this embodiment, the specific process of step 2530 is similar to the specific process of step 330 described above. For the sake of space saving, the description is omitted.

In step 2540, image generation is performed by the image generation model based on the target stitched image features and the denoising control information, resulting in a target image.

The target image is used to indicate the result of replacing the reference object in the target background image with the target object of the target object image.

For the sake of economy, the specific process of obtaining the target image according to the embodiment of the present disclosure based on the target stitched image features and the denoising control information by performing image generation through the image generation model will be described in detail below. And will not be described in detail herein.

Through steps 2510-2540, in the embodiment of the present disclosure, when an image generation model is used for image generation, image features of a target background image, a preset noise image and a target mask image are integrated into a target stitching image feature, so that the target stitching image feature has background image information and reference object information of the target background image, noise information is fused in the target stitching image feature, and the target stitching image feature can be attached to a real situation. Further, the outline characteristics of the reference object, the target description information and the target object image in the target background image are introduced to jointly generate the denoising control information of the denoising network aiming at the image generating model, so that the image generating model can carry out object fine adjustment in a denoising link, the object in the generated target image meets the condition constraint of the target description information, the object characteristics of the target object and the outline characteristics of the reference object, the target object and the background of the target image have good consistency and harmony, and the accuracy of generating the target image by the model is improved.

Referring to FIG. 26, in one embodiment, step 2520 includes, but is not limited to, steps 2610-2640 including:

step 2610, performing a first encoding process on a preset noise image to obtain a noise image feature;

step 2620, performing second coding processing on the target background image to obtain target background image characteristics;

Step 2630, performing third encoding processing on the target mask image to obtain target mask image features;

Step 2640, stitching the noise image feature, the target background image feature, and the target mask image feature to obtain a target stitched image feature.

Steps 2610-2640 are described in detail below.

The reference image coding feature is used to indicate the result of converting the preset noise image from the data space to the pixel space.

The background image encoding feature is used to indicate the result of converting the target background image from data space to pixel space.

The mask image encoding feature is used to indicate the result of converting the target mask image from data space to pixel space.

In the specific implementation of this embodiment, the specific process of steps 2610-2640 is similar to steps 710-740 described above. For the sake of space saving, the description is omitted.

The advantage of this embodiment is that the image information of the pre-set noise image, the target background image, and the target mask image are all converted from pixel space to potential vector space, and the image information (noise image feature, target background image feature, and target mask image feature) of the pre-set noise image, the target background image, and the target mask image in the potential vector space are stitched to form a target stitched image feature having a plurality of image information. Furthermore, the characteristics of the target spliced image are used as the input of the image generation model when the image is generated, so that the noise information, the target background information and the reference object information in the target background are fused in the input data of the model, the richness and the comprehensiveness of the characteristic information of the characteristics of the template spliced image can be better improved, the image generation accuracy of the model can be improved, and the finally generated target image is more real and accurate.

In an embodiment of the present disclosure, the image generation model includes a diffusion network, a denoising network, and a decoding network.

The decoding network is used to convert image features in the target denoising result from the latent vector space to the pixel space.

Referring to FIG. 27, in one embodiment, step 2540 includes, but is not limited to, steps 27710-2740, including:

step 2710, compressing the target spliced image features to obtain target compressed image features;

Step 2720, performing diffusion processing on the target compressed image features based on a diffusion network to obtain target hidden space feature vectors;

step 2730, denoising the target hidden space feature vector through a denoising network based on denoising control information to obtain a target denoising result;

and 2740, performing feature decoding on the target denoising result based on the decoding network to obtain a target image.

Steps 2710-2740 are described in detail below.

The target compressed image features are used for indicating feature dimension reduction results of the target spliced image features. The target compressed image features and the target spliced image features have the same image feature information, but the feature dimensions of the target compressed image features are lower than those of the target spliced image features.

The target hidden space feature vector is used to indicate the result of adding noise to the target compressed image feature at a fixed time step through the diffusion network.

The target denoising result is used for indicating that the target hidden space feature vector is denoised in a fixed time step through a denoising network, and the generated image features meet the constraint requirements of the image description information.

In the specific implementation of this embodiment, the specific process of steps 2710-2730 is similar to steps 1710-1730 described above. For the sake of space saving, the description is omitted.

In step 2740, the target denoising result is input to a decoding network, and image features in the target denoising result are mapped back to the original pixel space from the potential vector space through the decoding network, so as to generate a noise-free target image conforming to the target description information.

The embodiment has the advantage that the target stitching image features are compressed into target compression image features, so that the input target compression image features meet the input requirements of the image generation model. And then, the diffusion network is utilized to carry out noise adding on the target compressed image characteristics for a plurality of times, so that the details and the definition of the target compressed image characteristics are reduced, and the target hidden space characteristic vector is obtained. Further, denoising the target hidden space feature vector for multiple times by using a denoising network to predict a source image, and in the process, fine tuning is performed on the predicted image feature by using denoising control information to enable the object feature in the final predicted target denoising result to be more fit with the real requirement. Finally, decoding the target denoising result by using a decoding network, so as to obtain a predicted target image, improve the image quality of the target image, and enable the object and the background in the target image to have better coordination.

Fig. 28 is a schematic diagram of a specific application module of the image generating method according to the embodiment of the disclosure. Specifically, first, for a background image containing a reference object, performing object face mask processing on the background image to obtain a mask background image, where the specific process is similar to the above steps 810-820; object mask location keypoints are performed on the background image to obtain contour features of a reference object of the background image, and the specific process is similar to the above steps 1110-1130. In addition, for a target object to be replaced by the reference object, extracting features of a target object image of the target object to obtain object fine adjustment feature data of the target object. Further, the hidden space input is constructed based on the random noise image, the background image, and the masking background image, resulting in a target stitched image feature, the detailed process of which is similar to steps 2610-2640 described above. And then, using a driving control network such as contour features and target description information of a target image to be generated, outputting denoising control information as an aid (condition constraint) through the control network, and performing diffusion processing and denoising processing on the target spliced image features through an image generation model to obtain a target denoising result. Finally, the target denoising result is converted into a target image by using a decoding network and output, so that a target image taking the background image as the background and taking the target object as the foreground is generated, and the specific process is similar to the steps 2710-2740. For the sake of space saving, the description is omitted.

As shown in fig. 29, an overall flowchart of image generation based on an image generation model according to an embodiment of the present disclosure is shown. Specifically, the target background image and the target description information corresponding to the image to be generated are input.

First, a random noise pattern is generated using a random number i. Next, the target background image is masked to obtain an object replacement mask (target mask image), the specific process of which is similar to steps 810-820 described above. Further, the random noise map, the target mask image and the target background image are respectively encoded by using an encoder, and three encoding results are spliced into target spliced image features.

Further, compressing the target spliced image features into target compressed image features Z through an image generation model, performing diffusion processing on the target compressed image features Z through a diffusion network, and performing T times of noise adding on the target compressed image features Z to obtain target hidden space feature vectors Z _T. Further, the target description information corresponding to the image to be generated is converted into a text form conforming to the input requirement of the text encoder τ as a condition constraint, the target description information is input to the text encoder, and the text encoder outputs the image description embedded vector, and the specific process is similar to the steps 1310-1340. Further, an object frame corresponding to the target background image is extracted, and a contour line manuscript (contour feature) of the reference object in the target background image is generated based on the extracted object frame, and the specific process is similar to the above steps 1110-1130. Meanwhile, target object feature data is generated based on a target object image of the target object, wherein the target object feature data includes first object features QKV-a, second object features QKV-a, third object features QKV-a, and fourth object features QKV-a. Further, first object feature QKV-a, second object feature QKV-a, outline feature and image description embedding vector are input to the control network, first control information is output by a first control sub-network of the control network, and second control information is output by a second control sub-network of the control network.

Next, when the denoising network denoises the target hidden space feature vector Z _T, firstly, based on the first control information, the first object feature QKV-a and the second object feature QKV-a, the target hidden space feature vector Z _T is downsampled through each attention matrix of the downsampling attention network of the denoising network to obtain a downsampling result, and the downsampling result is corrected by using the second control information to obtain a corrected downsampling result; and based on the third object feature QKV-a and the fourth object feature QKV-a, upsampling the corrected downsampling result through each attention matrix of the upsampling attention network of the denoising network to obtain a denoising result Z _T-1' when the predicted time step is T. Further, denoising the denoising result Z _T-1' when the predicted time step is T by using a denoising network according to the above process, and after denoising for T-1 times, obtaining a denoising result Z ^', wherein the specific process is similar to the above steps 1910-1920. Finally, feature decoding is performed on the denoising result Z ^' based on the decoding network, so as to obtain the target image I, and the specific process is similar to the above step 2740. For the sake of space saving, the description is omitted.

The apparatus and devices of embodiments of the present disclosure are described below.

It will be appreciated that, although the steps in the various flowcharts described above are shown in succession in the order indicated by the arrows, the steps are not necessarily executed in the order indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

In the embodiments of the present application, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and related laws and regulations and standards are complied with for collection, use, processing, etc. of the data. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 30 is a schematic structural diagram of a training device 3000 for an image generation model according to an embodiment of the present disclosure. The training device 3000 for generating an image model includes:

A first obtaining unit 3010, configured to obtain a plurality of pairs of graphics samples, where each pair of graphics samples includes a background template image, a noise reference image, image description information corresponding to the noise reference image, and a sample object image of a sample object, where the noise reference image is obtained by adding noise to an expected result of replacing a reference object in the background template image with the sample object;

A first determining unit 3020 for determining a sample stitched image feature based on the noise reference image, the background template image, and a template mask image, wherein the template mask image is obtained by masking a reference object in the background template image;

A second determining unit 3030, configured to determine denoising network control information of the image generation model based on the contour feature of the reference object in the background template image, the image description information, and the sample object image;

The prediction unit 3040 is used for performing noise prediction through the image generation model based on the sample spliced image characteristics and the denoising network control information to obtain a noise prediction result of the noise reference image;

And the training unit 3050 is used for training an image generation model based on the comparison of the noise reference images and the noise prediction results of the image-text sample pairs.

Optionally, the training unit 3050 includes:

A calculation module (not shown) for acquiring reference noise in the noise reference image for each image-text sample pair, and calculating a sub-loss function based on a comparison of the reference noise and a noise prediction result;

A determining module (not shown) for determining a total loss function based on the sub-loss functions of each of the pairs of teletext samples;

A training module (not shown) for training the image generation model based on the total loss function.

a computing module (not shown) is used to:

Based on the regular term calculation result, a sub-loss function is determined.

Alternatively, the first determining unit 3020 is configured to:

performing first coding processing on the noise reference image to obtain reference image coding characteristics;

performing second coding processing on the background template image to obtain coding characteristics of the background image;

and splicing the reference image coding feature, the background image coding feature and the mask image coding feature to obtain sample spliced image features.

Optionally, the second determining unit 3030 includes:

An encoding module (not shown) for encoding the image description information to obtain an image description embedded vector;

An extracting module (not shown) for extracting features of the sample object image to obtain sample object feature data;

And the generation module (not shown) is used for generating control information through a preset control network based on the image description embedded vector, the sample object feature data and the outline feature of the reference object to obtain denoising network control information.

Optionally, an encoding module (not shown) is used to:

determining target words in the plurality of descriptive words, and searching target word embedding characteristics corresponding to the target words based on a preset dictionary;

aiming at other descriptive words except the target word in the descriptive words, carrying out word embedding processing on each other descriptive word to obtain descriptive word embedding characteristics of the other descriptive words;

integrating the target word embedding feature and the descriptor embedding feature into an image descriptor embedding vector.

Optionally, the control network comprises a first control sub-network, and a second control sub-network; the sample object feature data comprises a first sample object feature, a second sample object feature, a third sample object feature and a fourth sample object feature, wherein the first sample object feature, the second sample object feature, the third sample object feature and the fourth sample object feature are obtained by respectively extracting features of sample object images;

the generating module (not shown) is configured to:

The image description embedding vector, the first sample object feature and the outline feature are input into a first control sub-network to generate control information, so that first control information is obtained;

The image description embedded vector and the second sample object feature are input into a second control sub-network to generate control information, so that second control information is obtained;

and integrating the up-sampling network control information and the down-sampling network control information into denoising network control information.

Optionally, the noise reference image is generated by:

determining an expected result of replacing the reference object in the background template image with the sample object, wherein the expected result is image data;

And adding random numbers to pixel values of the pixel points aiming at each pixel point in the expected result to obtain a noise reference image.

the prediction unit 3040 includes:

The compression module (not shown) is used for compressing the sample spliced image features to obtain sample compressed image features;

a diffusion module (not shown) for performing diffusion processing on the sample compressed image features based on a diffusion network to obtain sample hidden space feature vectors;

and the denoising module (not shown) is used for denoising the sample hidden space feature vector through the denoising network based on the denoising network control information to obtain a noise prediction result.

a denoising module (not shown) is used for:

Fusing the downsampling network control information to a first attention matrix of the downsampling attention network to update the first attention moment matrix, and fusing the upsampling network control information to a second attention matrix of the upsampling attention network to update the second attention matrix;

And denoising the sample hidden space feature vector through a downsampling attention network after updating the first attention matrix and an upsampling attention network after updating the second attention matrix to obtain a noise prediction result.

Optionally, the sample object image is generated by:

acquiring a sample image of a sample object;

image segmentation is carried out on the sample image based on a preset object segmentation model, so that a sample segmentation image with a sample object is obtained;

and carrying out image enhancement on the sample segmentation image to obtain a sample object image.

Optionally, the stencil mask image is generated by:

determining an object contour region of a reference object in a background template image;

in the background template image, pixel values of all pixel points in the outline area of the object are replaced with first values, and pixel values of all pixel points outside the outline area of the object are replaced with second values, so that a template mask image is obtained.

object detection is carried out on the background template image, and an object skeleton diagram of a reference object is obtained;

Contour features are determined based on the plurality of object pose keypoints.

Fig. 31 is a schematic structural diagram of an image generating apparatus 3100 provided in an embodiment of the present disclosure. The image generating apparatus 3100 includes:

A second acquisition unit 3110 for acquiring a target object image of a target object, the target background image including a reference object to be replaced by a target object in the target object image, and target description information for describing replacement from the reference object to the target object;

a third determining unit 3120 configured to determine a target stitched image feature based on a target background image, a preset noise image, and a target mask image, where the target mask image is obtained by masking a reference object in the target background image;

A fourth determining unit 3130, configured to determine denoising control information of an image generation model based on the contour feature of the reference object in the target background image, the target description information, and the target object image, where the image generation model is generated by the training method of the image generation model;

An image generating unit 3140, configured to generate an image by using an image generation model based on the target stitched image feature and the denoising control information, to obtain a target image, where the target image is used to indicate a result of replacing a reference object in the target background image with a target object of the target object image.

Alternatively, the third determining unit 3120 is configured to:

Performing first coding processing on a preset noise image to obtain noise image characteristics;

And stitching the noise image features, the target background image features and the target mask image features to obtain target stitched image features.

the image generation unit 3140 is configured to:

Performing diffusion treatment on the target compressed image features based on a diffusion network to obtain target hidden space feature vectors;

Denoising the target hidden space feature vector through a denoising network based on denoising control information to obtain a target denoising result;

and performing feature decoding on the target denoising result based on the decoding network to obtain a target image.

Referring to fig. 32, fig. 32 is a block diagram of a portion of a terminal, which may be the object terminal shown in fig. 1, implementing a training method of an image generation model or an image generation method according to an embodiment of the present disclosure. The terminal comprises: radio Frequency (RF) circuitry 3210, memory 3215, input unit 3230, display unit 3240, sensor 3250, audio circuitry 3260, wireless fidelity (WIRELESS FIDELITY, wiFi) module 3270, processor 3280, and power supply 3290. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 32 is not limiting of a cell phone or computer and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The RF circuit 3210 may be used for receiving and transmitting signals during a message or a call, and in particular, after receiving downlink information of a base station, the signal is processed by the processor 3280; in addition, the data of the design uplink is sent to the base station.

The memory 3215 may be used to store software programs and modules, and the processor 3280 performs various functional applications and data processing of the object terminal by executing the software programs and modules stored in the memory 3215.

The input unit 3230 may be used to receive input number or character information and generate key signal inputs related to setting and function control of the object terminal. Specifically, the input unit 3230 may include a touch panel 3231 and other input devices 3232.

The display unit 3240 may be used to display input information or provided information and various menus of the object terminal. The display unit 3240 may include a display panel 3241.

Audio circuitry 3260, speaker 3261, and microphone 3262 may provide an audio interface.

In this embodiment, the processor 3280 included in the terminal may perform the training method or the image generation method of the image generation model of the previous embodiment.

Fig. 33 is a block diagram of a portion of a server implementing a training method of an image generation model or an image generation method of an embodiment of the present disclosure. The server may be the image processing server shown in fig. 1. The servers may vary widely by configuration or performance, and may include one or more central processing units (Central Processing Units, simply CPUs) 3322 (e.g., one or more processors) and memory 3332, one or more storage mediums 3330 (e.g., one or more mass storage devices) that store applications 3342 or data 3344. Wherein the memory 3332 and storage medium 3330 may be transitory or persistent. The program stored on the storage medium 3330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 3322 may be configured to communicate with a storage medium 3330 and execute a series of instruction operations on the storage medium 3330 on a server.

The server may also include one or more power supplies 3326, one or more wired or wireless network interfaces 3350, one or more input/output interfaces 3358, and/or one or more operating systems 3341, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The central processor 3322 in the server may be used to perform a training method or an image generation method of the image generation model of the embodiments of the present disclosure.

The embodiments of the present disclosure also provide a computer-readable storage medium storing a computer program for executing the training method or the image generation method of the image generation model of the foregoing embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program. The processor of the electronic device reads the computer program and executes it, so that the electronic device executes a training method or an image generation method implementing the image generation model described above.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory RAM), a magnetic disk, or an optical disk, etc., which can store program codes.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims

1. A training method for an image generation model, the training method comprising:

2. The method of training an image generation model according to claim 1, wherein the training the image generation model based on the comparison of the noise reference image and the noise prediction results of the plurality of pairs of the teletext samples comprises:

For each image-text sample pair, acquiring reference noise in the noise reference image, and calculating a sub-loss function based on comparison of the reference noise and the noise prediction result;

determining a total loss function based on the sub-loss functions of each of the pairs of teletext samples;

The image generation model is trained based on the total loss function.

3. The training method of an image generation model according to claim 2, wherein the noise prediction result is obtained by prediction of a plurality of prediction time steps;

the calculating a sub-loss function based on the comparison of the reference noise and the noise prediction result includes:

4. The method of training an image generation model according to claim 1, wherein the determining sample stitched image features based on the noise reference image, the background template image, and a template mask image comprises:

5. The method of training an image generation model according to claim 1, wherein the determining denoising network control information of the image generation model based on the contour features of the reference object, the image description information, and the sample object image in the background template image comprises:

Coding the image description information to obtain an image description embedded vector;

extracting the characteristics of the sample object image to obtain sample object characteristic data;

and generating control information through a preset control network based on the image description embedded vector, the sample object feature data and the outline feature of the reference object to obtain the denoising network control information.

6. The method for training an image generation model according to claim 5, wherein the encoding the image description information to obtain an image description embedding vector comprises:

7. The method of training an image generation model of claim 5, wherein the control network comprises a first control subnetwork, and a second control subnetwork; the sample object feature data comprises a first sample object feature, a second sample object feature, a third sample object feature and a fourth sample object feature, wherein the first sample object feature, the second sample object feature, the third sample object feature and the fourth sample object feature are obtained by respectively extracting features of the sample object images;

the generating control information based on the image description embedded vector, the sample object feature data and the outline feature of the reference object through a preset control network to obtain the denoising network control information comprises the following steps:

8. The training method of an image generation model according to claim 1, wherein the noise reference image is generated by:

9. The method of training an image generation model according to claim 1, wherein the image generation model comprises a diffusion network, and a denoising network;

The noise prediction is performed by the image generation model based on the sample spliced image characteristics and the denoising network control information, so as to obtain a noise prediction result of the noise reference image, including:

compressing the sample spliced image features to obtain sample compressed image features;

Performing diffusion processing on the sample compressed image features based on the diffusion network to obtain sample hidden space feature vectors;

And denoising the sample hidden space feature vector through the denoising network based on the denoising network control information to obtain the noise prediction result.

10. The method of training an image generation model of claim 9, wherein the denoising network comprises an upsampling attention network and a downsampling attention network; the denoising network control information comprises up-sampling network control information and down-sampling network control information;

The denoising processing is performed on the sample hidden space feature vector through the denoising network based on the denoising network control information to obtain the noise prediction result, including:

11. The training method of an image generation model according to claim 1, wherein the sample object image is generated by:

Acquiring a sample image of the sample object;

12. The training method of an image generation model according to claim 1, wherein the template mask image is generated by:

13. The training method of an image generation model according to claim 1, wherein contour features of the reference object in the background template image are determined by:

14. An image generation method, characterized in that the image generation method comprises:

Determining denoising control information of an image generation model based on contour features of the reference object in the target background image, the target description information, the target object image, the image generation model generated according to the training method of the image generation model of any one of claims 1 to 13;

15. The image generation method of claim 14, wherein the image generation model comprises a diffusion network, a denoising network, and a decoding network;

the image generation is performed through the image generation model based on the target spliced image characteristics and the denoising control information to obtain a target image, and the method comprises the following steps:

16. An image generation model training apparatus, characterized in that the image generation model training apparatus comprises:

17. An image generation apparatus, characterized in that the image generation apparatus comprises:

a fourth determining unit configured to determine denoising control information of an image generation model generated according to the training method of the image generation model according to any one of claims 1 to 13, based on contour features of the reference object in the target background image, the target description information, the target object image;

18. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15 when executing the computer program.

19. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15.

20. A computer program product comprising a computer program that is read and executed by a processor of an electronic device, causing the electronic device to perform the training method of an image generation model according to any one of claims 1 to 13 or the image generation method according to any one of claims 14 to 15.