CN117351299B

CN117351299B - Image generation and model training method, device, equipment and storage medium

Info

Publication number: CN117351299B
Application number: CN202311184380.4A
Authority: CN
Inventors: 李弼; 彭楠; 希滕; 张刚
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2024-12-13
Anticipated expiration: 2043-09-13
Also published as: CN117351299A

Abstract

The disclosure provides an image generation and model training method, device, equipment and storage medium, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as image processing and the like. The training method of the image generation model comprises the steps of obtaining a first generation result of a teacher model, obtaining a second generation result of a student model, enabling the student model to generate the model for an image to be trained, constructing a loss function based on the first generation result and the second generation result, updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix, wherein the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix. The present disclosure may reduce computing resource overhead.

Description

Image generation and model training method, device, equipment and storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning, large models and the like, and can be applied to scenes such as image processing and the like, in particular to an image generation and model training method, an image generation and model training device, equipment and a storage medium.

Background

Diffusion model (diffusion model) is a type of generation model that can be used to generate high resolution images. The diffusion model decomposes the image generation process into a number of denoising steps, i.e., sampling processes that generate the model. Because of the high number of samples and the need for two forward denoising (denoise) processes for a single sample, the sampling rate is slow.

Disclosure of Invention

The present disclosure provides an image generation and model training method, apparatus, device and storage medium.

According to one aspect of the disclosure, a training method of an image generation model is provided, which comprises the steps of obtaining a first generation result of a teacher model, obtaining a second generation result of a student model, wherein the student model is an image generation model to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix, a loss function is constructed based on the first generation result and the second generation result, the first parameter matrix and the second parameter matrix are updated based on the loss function, and an updated target parameter matrix is obtained according to the updated first parameter matrix and the updated second parameter matrix.

According to another aspect of the disclosure, an image generation method is provided, which comprises the steps of obtaining input features, generating an image by using an image generation model, and generating an output image corresponding to the input features by using the image generation model, wherein the image generation model is trained by using the training method according to any one of the aspects.

According to another aspect of the disclosure, a training device for an image generation model is provided, which comprises a first acquisition module, a second acquisition module and an updating module, wherein the first acquisition module is used for acquiring a first generation result of a teacher model, the second acquisition module is used for acquiring a second generation result of a student model, the student model is an image generation model to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix, the constructing module is used for constructing a loss function based on the first generation result and the second generation result, and the updating module is used for updating the first parameter matrix and the second parameter matrix based on the loss function and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

According to another aspect of the disclosure, an image generation device is provided, which comprises an acquisition module, a generation module and a generation module, wherein the acquisition module is used for acquiring an input feature, the input feature is used for generating an image, the generation module is used for generating and processing the input feature by adopting an image generation model so as to generate an output image corresponding to the input feature, and the image generation model is trained by adopting the training method according to any one of the aspects.

According to another aspect of the present disclosure there is provided an electronic device comprising at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the above aspects.

According to the technical scheme disclosed by the invention, the expenditure of computing resources can be reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;

Fig. 2 is a schematic diagram of an application scenario provided according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an overall architecture provided in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram according to a second embodiment of the present disclosure;

FIG. 5 is a schematic diagram according to a third embodiment of the present disclosure;

FIG. 6 is a schematic diagram according to a fourth embodiment of the present disclosure;

FIG. 7 is a schematic diagram according to a fifth embodiment of the present disclosure;

Fig. 8 is a schematic diagram of an electronic device for implementing a training method or image generation method of an image generation model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In order to increase the sampling speed, knowledge distillation can be adopted in the model training stage.

Based on knowledge distillation mode, the whole architecture comprises a teacher model and a student model, and the sizes of a teacher network and a student network are the same. And adopting a teacher model to perform two denoising processes, namely a conditional denoising (condition denoise) process to obtain a conditional generation result y _c and an unconditional denoising (uncondition denoise) process to obtain an unconditional generation result y _uc. And (5) carrying out weighted operation on y _c and y _uc to obtain a final generated result y _t of the teacher model. And adopting the student model to perform a one-time conditional denoising process to obtain a final generation result y _s of the student model. And constructing a loss function L based on a final generation result y _t of the teacher model and a final generation result y _s of the student model, and then adjusting model parameters of the student model by using the L.

In the related art, all model parameters of the student model need to be updated (learnable), which results in high computing resource overhead.

In order to reduce computing resource overhead, the present disclosure provides the following embodiments.

Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. The embodiment provides a training method of an image generation model, which comprises the following steps:

101. and obtaining a first generation result of the teacher model.

102. And obtaining a second generation result of the student model, wherein the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix.

103. Constructing a loss function based on the first and second generation results.

104. Updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

In order to accelerate sampling, a knowledge distillation architecture can be adopted for the image generation model, wherein the architecture comprises a teacher model and a student model.

The teacher model and the student model are both image generation models. In addition, the teacher model and the student model may select the same model structure, and may be the same size.

The student model is an image generation model to be trained, that is, an image generation model to be finally adopted. After training, the student model is adopted to generate images.

The output of the teacher model may be referred to as a first generation result and the output of the student model may be referred to as a second generation result, and in order to distill knowledge of the teacher model into the student model, a loss function is constructed based on the first generation result and the second generation result, and model parameters of the student model are updated with the loss function.

The student model is a deep neural network model, which is typically a multi-layer structure, including, for example, a convolution layer, a pooling layer, an attention layer, and the like. The different network layers correspond to respective model parameters, for example, the student model includes model parameters including W _s ¹,W_s ².

In order to reduce the operation amount and save the computing resource expenditure, a part of parameter matrixes can be selected from all model parameters (parameter matrixes) to update (learn), and the parameter matrixes to be updated are called target parameter matrixes.

The target parameter matrix may be updated by a learning process. Let the target parameter matrix be denoted W _s and its dimensions be denoted d1 x d 2. If learning is performed by using W _s as a whole, the parameter to be learned is d1×d2, and in general, both d1 and d2 are larger values, so that the calculation amount is larger and the calculation resource cost is larger.

In order to reduce the cost of computing resources, in this embodiment, the target parameter matrix is not directly learned as a whole, but can be disassembled into two parameter matrices, which are called a first parameter matrix and a second parameter matrix, and the target parameter matrix is calculated based on the first parameter matrix and the second parameter matrix. The first parameter matrix and the second parameter matrix are low-rank matrices, i.e. the ranks of the two parameter matrices are smaller than the rank of the target parameter matrix. For example, the first parameter matrix and the second parameter matrix are denoted by a and B, where a is d1 x r, B is r x d2, and r is the rank of the two parameter matrices, and may be manually set to a smaller value, such as 32. Based on this, the parameter quantity to be learned is d1×r+r×d2=r (d1+d2), and compared with the parameter quantity of d1×d2, the parameter quantity to be learned can be significantly reduced, and the computing resource overhead can be reduced.

In this embodiment, the first parameter matrix and the second parameter matrix are updated based on the loss function, and the updated target parameter matrix is obtained according to the updated first parameter matrix and the updated second parameter matrix, so that the target parameter matrix can be updated by using the first parameter matrix and the second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller, so that the number of parameters to be learned can be reduced, the computing resource cost can be reduced, and the model training efficiency can be improved, compared with the direct learning of the target parameter matrix.

In order to better understand the embodiments of the present disclosure, application scenarios to which the embodiments of the present disclosure are applicable are described.

Fig. 2 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure. The environment comprises a user terminal 201 and a server 202, wherein the user terminal 201 can comprise a personal computer (PersonalComputer, PC), a mobile phone, a tablet computer, a notebook computer, intelligent wearable equipment and the like. The server 202 may be a cloud server or a local server, and the user terminal 201 and the server 202 may communicate using a communication network, for example, a wired network and/or a wireless network.

The image generation model may be trained by a server. The user terminal can send the training data to the server, and the server trains based on the training data to obtain an image generation model. Then, in the reasoning stage, the server can adopt the image generation model to generate the image, or when the user terminal has offline image generation capability, the server can send the image generation model to the user terminal, and the image generation model can be adopted locally at the user terminal to generate the image.

The image generation model may be a large model (Large Language Model, LLM).

LLM is a hot problem in the field of artificial intelligence in recent years, LLM is a pre-training language model, and rich language knowledge and world knowledge are learned by pre-training on massive text data, so that a remarkable effect can be achieved on various natural language processing (Natural Language Processing, NLP) tasks. Discounts, chatGPT, etc. are LLM-based developed applications that can generate smooth, logical, creative text content and even perform natural conversations with humans. In a natural language processing scenario, the large model may be a transform-based general pre-training (GENERATIVE PRE-trained Transformer, GPT) model, a knowledge integration-based implementation enhanced representation (Enhanced Representation through Knowledge Integration, ERNIE) model, or the like.

In this embodiment, taking the context as an example, the corresponding image generation model may be selected as a diffusion model (diffusion model).

The diffusion model is sampled based on a time step T, an initial image is a noise image (Z _T), and a final image (Z ₀) after denoising is obtained through sampling for a preset number of times (T). Each sampling process (e.g., from time T to time (T-1)) may employ a denoising network to process the current image (Z _t) to obtain a next image (Z _t-1), t=t, T-1, and continuously sampling to obtain a final image Z ₀.

Taking the steady diffusion (stablediffusion) model as an example, the number of samples T needs 50, i.e. 50 iterations. To accelerate sampling, the architecture of knowledge distillation may be employed for model training.

Thus, as shown in FIG. 2, knowledge-based distillation training may be performed on the training data to obtain a trained image generation model.

As shown in fig. 3, a teacher model and a student model are included under the knowledge distillation architecture.

Taking a text chart as an example, the training data comprises an image sample and a text sample, the image sample is processed by an image encoder to obtain image characteristics, the image characteristics in fig. 3 are represented by hidden characteristics (latent) with noise, and the text sample is processed by the text encoder to obtain text characteristics.

Since the knowledge distillation architecture of the present embodiment is to accelerate sampling, rather than reduce the model size, the teacher model and the student model of the present embodiment may be selected to have the same structure and the same size. For example, the teacher model and the student model are net models in UNet form of the same size.

Aiming at a teacher model, the method comprises two denoising processes, wherein one denoising process is a conditional denoising process, a corresponding result is called a conditional generation result, and the other denoising process is an unconditional denoising process, and the corresponding result is called an unconditional generation result. The input to the unconditional denoising process is simply a noisy image, and the input to the conditional denoising process includes text in addition to the noisy image.

Referring to fig. 3, the input noisy hidden features are processed with a teacher model to obtain unconditional generation results, and the input noisy hidden features and text features are processed with a teacher model to obtain conditional generation results.

Then, a weighting operation (weight sum) may be performed on the conditional generation result and the unconditional generation to obtain a first generation result of the teacher model.

For the student model, to speed up sampling, it includes only one denoising process instead of two denoising processes. The result of one denoising process is a condition generating result.

Referring to fig. 3, the student model is adopted to process the input noisy hidden features and text features to obtain a condition generation result as a second generation result of the student model.

A loss function, such as a mean squared error (Mean Square Error, MSE) loss function, may then be constructed based on the first and second generation results.

After the loss function is obtained, the model parameters of the learning model may be updated with the loss function. In order to reduce the cost of computing resources, a parameter efficient updating mode is adopted, namely the number of parameters needing to be learned is reduced.

It is assumed that a certain model parameter of the student model is denoted by W _s and the size is denoted by d1×d2, and in the related art, the model parameter is learned as a whole, so that the calculation resource cost is high, for example, the parameter quantity to be learned is d1×d2.

In addition, the student model includes a plurality of network layers, and in the related art, model parameters of each network layer need to be learned and updated, for example, parameter matrices of the plurality of network layers are respectively updated by W _s ¹,W_s ².

In the diffusion model scene, the teacher model and the student model are both denoising networks and have the same size.

The denoising network may be selected as a network in the form of UNet, which includes an encoder (encoder) and a decoder (decoder), both of which have cross-attention (cross-attention) networks.

In order to reduce the computing resource overhead, in this embodiment, only the parameter matrix (attention weight matrix) in the cross-attention network may be learned, and the parameter matrix of other layers may be kept unchanged, instead of updating the parameter matrix of all layers. The model parameters that need to be updated may be referred to as a target parameter matrix. In addition, for the target parameter matrix, such as W _s, the target parameter matrix is not learned as a whole, and can be disassembled into two low-rank matrices for learning, so that the resource cost is further reduced.

In combination with the above application scenario, the present disclosure further provides the following embodiments.

Fig. 4 is a schematic diagram of a second embodiment of the present disclosure, where a training method of an image generation model is provided, as shown in fig. 4, and the method includes:

401. And generating and processing the image characteristics of the image sample by adopting a teacher model to obtain unconditional generation results.

402. And generating the image characteristics and the text characteristics of the text sample by adopting the teacher model so as to obtain a condition generating result.

403. And weighting the unconditional generation result and the conditional generation result to obtain a first generation result of the teacher model.

Wherein, referring to fig. 3, the image features are represented by noisy hidden features. The image features may be obtained by extracting features from an image sample using an image encoder.

The text feature may be obtained by extracting features from a text sample using a text encoder.

Assuming that the conditional generation result is denoted by y _c, the unconditional generation result is denoted by y _uc, and the first generation result obtained after the weighting operation is denoted by y _t, the calculation formula of the weighting operation may be:

y_t＝y_uc+w*(y_c-y_uc);

wherein w is a preset weight value.

In this embodiment, the first generating result of the teacher model is obtained based on the unconditional generating result and the conditional generating result of the teacher model, and the generating result of the student model can be supervised by using the first generating result, so that the unconditional generating process and the conditional generating process of the teacher model can be distilled into a single generating process of the student model, and the sampling speed is further increased.

404. And generating and processing the image characteristics of the image sample and the text characteristics of the text sample by adopting the student model to obtain a second generation result of the student model.

With reference to fig. 3, the student model has only one denoising process, and the denoising process is a conditional denoising process, that is, the image feature and the text feature are processed to obtain a conditional generation result of the student model, which is used as a second generation result.

In this embodiment, the student model is used to process the image feature and the text feature to obtain the second generation result, and the effect of twice sampling of the teacher model can be achieved by using the student model for once sampling, so as to increase the sampling speed.

405. Constructing a loss function based on the first and second generation results.

Referring to fig. 3, the loss function may be an MSE function, and assuming that the second generation result is denoted by y _s, the MSE loss function L may be constructed based on the first generation result y _t and the second generation result y _s, and further, model parameters of the student model may be updated based on the loss function L.

406. Updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix of the student model according to the updated first parameter matrix and the updated second parameter matrix.

In order to reduce the computing resource overhead, the model parameters of the learning model are updated in a manner of efficient parameter updating, referring to fig. 3.

On the one hand, not all the parameter matrices of the network layers of the student model need to be updated, the parameter matrices of part of the network layers can be selected as target parameter matrices to be updated, and the parameter matrices of the rest of the network layers can be kept unchanged.

In this embodiment, the parameter matrix of a part of the network layer is used as the target parameter matrix to update, and only part of the parameter matrix, but not all of the parameter matrix, can be updated, so that the number of parameters to be updated can be reduced, and the resource overhead can be reduced.

Further, the plurality of network layers includes an attention network layer, and the target parameter matrix is an attention weight matrix of the attention network layer.

For example, when the student model selects a UNet model that includes a cross-attention network, the attention weight matrix of the cross-attention network may be used as the target parameter matrix. Further, if the number of the cross-attention networks is plural, the attention weight matrix of one or more of the cross-attention networks may be used as the target parameter matrix.

Because the attention network plays an important role in improving the performance of the model, in the embodiment, the attention weight matrix is used as a target parameter matrix to update, so that the performance of the model can be improved, and the image generation effect can be improved.

On the other hand, for a single target parameter matrix, the target parameter matrix is not studied as a whole, but is disassembled into two low-rank parameter matrices, parameters in the two low-rank parameter matrices are studied and updated, and the updated target parameter matrix is obtained by using the updated low-rank parameter matrix.

In the process of disassembly, the target parameter matrix can be directly disassembled into the product of two low-rank parameter matrices, or the increment part of the target parameter matrix can be disassembled into the product of two low-rank parameter matrices.

Accordingly, the product of the updated first parameter matrix and the updated second parameter matrix may be used as the updated target parameter matrix, or

Obtaining an increment part of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix; and obtaining an updated target parameter matrix according to the target parameter matrix before updating and the increment part.

For example, for direct disassembly, the updated target parameter matrix is denoted by W _new, and the updated two parameter matrices are denoted by a and B, respectively, then the calculation formula may be:

W_new=A*B;

Let W _new be d1d 2, A be d1 r, B be r d2, r be a manually selected small value, such as 32.

The first and second parameter matrices may be updated using a general learning procedure, for example, using a random gradient descent (Stochastic GRADIENT DESCENT, SGD) algorithm in which the gradient may be calculated using the loss function L and the pre-update parameters and the gradient may be used to calculate the post-update parameters. The initial values of a and B may be randomly generated or fixedly set.

In this embodiment, the product of the updated first parameter matrix and the updated second parameter matrix is used as the updated target parameter matrix, so that the target parameter matrix can be simply, conveniently and efficiently updated, and the processing efficiency is improved.

For another example, for the delta part to be disassembled into two parameter matrices, assuming that the updated target parameter matrix is denoted by W _new, the target parameter matrix before update is denoted by W _old, the delta part is denoted by Δw, and the two parameter matrices after update are denoted by a and B, respectively, the calculation formula may be:

W_new＝W_old+ΔW＝W_old+A*B;

Let W _new and W _old be d1 x d2, a be d1 x r, B be r x d2, r be a manually selected smaller value, such as 32.

In this embodiment, according to the calculated increment portion of the updated first parameter matrix and the updated second parameter matrix, and then according to the increment portion, updating of the target parameter matrix is performed, low-rank parameter matrix disassembly can be performed on the increment portion, so that operation accuracy is improved, and model accuracy is further improved.

In addition, the model training process may be a multiple iteration process, in which the target parameter matrix is updated in the above manner in each iteration process, and a final target parameter matrix may be obtained through multiple iterations, that is, a final image generation model (student model) is obtained. The final student model may then be used for image generation.

In this embodiment, in the knowledge distillation process, the update of the target parameter matrix is performed according to the two low-rank parameter matrices, so that the updated parameter amount can be reduced, and the consumption of computing resources can be reduced. For the downstream task of the diffusion generation model with small data volume, because the updated parameter is small in proportion, on one hand, the capability of the pre-training diffusion generation model can be better reserved, and on the other hand, a better distillation effect can be obtained at the downstream task.

The above embodiments describe a model training process, and after training to obtain an image generation model, the image generation model may be used to generate an image.

Fig. 5 is a schematic diagram of a third embodiment of the present disclosure, which provides an image generating method, including:

501. Input features are acquired, which are used to generate an image.

502. And adopting an image generation model to generate and process the input features so as to generate an output image corresponding to the input features.

Wherein the image generation model is trained using the training method of any one of the above.

In this embodiment, since the image generation model with low calculation performance consumption is adopted, resource overhead can be reduced during image generation, and image generation efficiency can be improved.

In some embodiments, the input features include image features of a noisy image and text features of a prompt text, and the output image is a denoised image of the noisy image, accordingly.

Generally, the single sampling process includes two denoising processes, but in this embodiment, only one denoising process is included, so that the effect of the original two denoising processes can be achieved, and therefore, the sampling speed can be increased and the image generation efficiency can be improved on the basis of ensuring the image quality.

Fig. 6 is a schematic diagram according to a fourth embodiment of the present disclosure. The present embodiment provides a training apparatus for an image generation model, as shown in fig. 6, the apparatus 600 includes a first acquisition module 601, a second acquisition module 602, a construction module 603, and an update module 604.

The system comprises a first obtaining module 601, a second obtaining module 602, a construction module 603 and an updating module 604, wherein the first obtaining module is used for obtaining a first generation result of a teacher model, the second obtaining module 602 is used for obtaining a second generation result of a student model, the student model is an image generation model to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix, the construction module 603 is used for constructing a loss function based on the first generation result and the second generation result, and the updating module 604 is used for updating the first parameter matrix and the second parameter matrix based on the loss function and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix.

In some embodiments, the updating module 604 is further configured to take the product of the updated first parameter matrix and the updated second parameter matrix as the updated target parameter matrix.

In some embodiments, the updating module 604 is further configured to obtain an incremental portion of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix, and obtain the updated target parameter matrix according to the pre-updated target parameter matrix and the incremental portion.

In some embodiments, the student model includes a plurality of network layers, and the target parameter matrix is a parameter matrix of a portion of the plurality of network layers.

In some embodiments, the plurality of network layers includes an attention network layer, and the target parameter matrix is an attention weight matrix of the attention network layer.

In some embodiments, the first obtaining module 601 is further configured to perform a generating process on the image feature of the image sample to obtain an unconditional generating result using the teacher model, perform a generating process on the image feature and the text feature of the text sample to obtain a conditional generating result using the teacher model, and perform a weighting process on the unconditional generating result and the conditional generating result to obtain the first generating result.

In some embodiments, the second obtaining module 602 is further configured to perform a generating process on the image feature of the image sample and the text feature of the text sample using the student model to obtain the second generating result.

Fig. 7 is a schematic diagram according to a fifth embodiment of the present disclosure. The embodiment provides an image generating apparatus, as shown in fig. 7, the apparatus 700 includes an acquisition module 701 and a generation module 702.

The input feature generation module 702 is configured to generate an output image corresponding to the input feature by using an image generation model, where the image generation model is trained by using any training method described above.

It is to be understood that in the embodiments of the disclosure, the same or similar content in different embodiments may be referred to each other.

It can be understood that "first", "second", etc. in the embodiments of the present disclosure are only used for distinguishing, and do not indicate the importance level, the time sequence, etc.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in the electronic device 800 are connected to the I/O interface 805, including an input unit 806 such as a keyboard, a mouse, etc., an output unit 807 such as various types of displays, speakers, etc., a storage unit 808 such as a magnetic disk, an optical disk, etc., and a communication unit 809 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 801 performs the respective methods and processes described above, for example, a training method of an image generation model or an image generation method. For example, in some embodiments, the training method of the image generation model or the image generation method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the training method of the image generation model or the image generation method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a training method or an image generation method of the image generation model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PrivateServer" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A training method of an image generation model, comprising:

Acquiring a first generation result of a teacher model;

the method comprises the steps of obtaining a second generation result of a student model, wherein the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix;

constructing a loss function based on the first and second generation results;

updating the first parameter matrix and the second parameter matrix based on the loss function, and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix;

The obtaining the first generating result of the teacher model includes:

Generating and processing the image characteristics of the image sample by adopting the teacher model to obtain unconditional generation results;

generating the image characteristics and the text characteristics of the text sample by adopting the teacher model to obtain a condition generating result;

and weighting the unconditional generation result and the conditional generation result to obtain the first generation result.

2. The method of claim 1, wherein the obtaining the updated target parameter matrix from the updated first parameter matrix and the updated second parameter matrix comprises:

And taking the product of the updated first parameter matrix and the updated second parameter matrix as the updated target parameter matrix.

3. The method of claim 1, wherein the obtaining the updated target parameter matrix from the updated first parameter matrix and the updated second parameter matrix comprises:

obtaining an increment part of the target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix;

and obtaining an updated target parameter matrix according to the target parameter matrix before updating and the increment part.

4. The method of claim 1, wherein the student model comprises a plurality of network layers, the target parameter matrix being a parameter matrix of a portion of the plurality of network layers.

5. The method of claim 4, wherein the plurality of network layers includes an attention network layer, the target parameter matrix being an attention weight matrix of the attention network layer.

6. The method of any of claims 1-5, wherein the obtaining a second generation of the student model comprises:

and generating the image characteristics of the image sample and the text characteristics of the text sample by adopting the student model so as to obtain the second generation result.

7. An image generation method, comprising:

acquiring input features, wherein the input features are used for generating images;

Generating the input features by adopting an image generation model so as to generate output images corresponding to the input features;

Wherein the image generation model is trained using the method of any of claims 1-6.

8. The method of claim 7, wherein the input features include image features of a noisy image and text features of a prompt text, and the output image is a denoised image of the noisy image, accordingly.

9. A training apparatus for an image generation model, comprising:

the first acquisition module is used for acquiring a first generation result of the teacher model;

The system comprises a first acquisition module, a second acquisition module and a training module, wherein the first acquisition module is used for acquiring a first generation result of a student model, the student model generates a model for an image to be trained, a target parameter matrix of the student model is determined based on a first parameter matrix and a second parameter matrix, and the rank of the first parameter matrix and the rank of the second parameter matrix are smaller than the rank of the target parameter matrix;

The construction module is used for constructing a loss function based on the first generation result and the second generation result;

the updating module is used for updating the first parameter matrix and the second parameter matrix based on the loss function and obtaining an updated target parameter matrix according to the updated first parameter matrix and the updated second parameter matrix;

The first acquisition module is further configured to:

10. The apparatus of claim 9, wherein the update module is further to:

11. The apparatus of claim 9, wherein the update module is further to:

12. The apparatus of claim 9, wherein the student model comprises a plurality of network layers, the target parameter matrix being a parameter matrix of a portion of the plurality of network layers.

13. The apparatus of claim 12, wherein the plurality of network layers comprises an attention network layer, the target parameter matrix being an attention weight matrix of the attention network layer.

14. The apparatus of any of claims 9-13, wherein the second acquisition module is further to:

15. An image generating apparatus comprising:

The acquisition module is used for acquiring input features, wherein the input features are used for generating images;

The generation module is used for generating and processing the input features by adopting an image generation model so as to generate output images corresponding to the input features;

16. The apparatus of claim 15, wherein the input features include image features of a noisy image and text features of a prompt text, and the output image is a denoised image of the noisy image, accordingly.

17. An electronic device, comprising:

At least one processor, and

A memory communicatively coupled to the at least one processor, wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-8.