CN118155214B

CN118155214B - Prompt learning method, image classification method and related devices

Info

Publication number: CN118155214B
Application number: CN202410583471.3A
Authority: CN
Inventors: 韩金伟; 高英国; 鄢科
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-05-11
Filing date: 2024-05-11
Publication date: 2024-08-06
Anticipated expiration: 2044-05-11
Also published as: CN118155214A

Abstract

The embodiment of the application discloses a prompt learning method, an image classification method and a related device, wherein the prompt learning method comprises the following steps: determining an image description text corresponding to the sample image; respectively encoding the image description text and the prompt information to be trained comprising the prompt parameters to be trained through a text encoder in the image-text processing model to obtain description text characteristics and training prompt characteristics; encoding the sample image by an image encoder in the image-text processing model to obtain sample image characteristics; performing similarity calculation according to training prompt features and reference alignment features, and determining target loss according to a similarity calculation result, wherein the reference alignment features are determined according to explanatory text features and sample image features; based on the target loss, training the prompt parameters in the prompt information to be trained. The method can avoid the situation that the image-text processing model is fitted, and reduce the generalization loss of the image-text processing model.

Description

Prompt learning method, image classification method and related devices

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a prompt learning method, an image classification method and a related device.

Background

In recent years, a graphic processing model (also called a graphic pre-training model) obtained by pre-training based on a large number of image texts in a contrast learning mode is widely applied to a plurality of downstream tasks, and has better generalization and migration capability. When the graphic processing model is applied to the downstream task, the prompt information (prompts) is usually required to be designed manually, so that the graphic processing model executes the downstream task based on the prompt information; according to research, the expression effect of the image-text processing model in the downstream task is greatly dependent on the quality of prompt information, and the unreasonable designed prompt information can cause the image-text processing model to be poor in expression in the downstream task.

In order to avoid influencing the expression effect of the graphic processing model due to unreasonable prompting information of manual design, prompting learning (prompting/turning) has been generated. Prompt learning utilizes a learnable prompt parameter (tokens) to replace manually designed prompt information at an input layer, and a graphic processing model is adapted to a downstream task by training the prompt parameter. The existing prompt learning method has the advantages of low training cost, high adaptation speed and the like, but the model generalization loss is caused by overfitting to a certain extent, namely, the effect of the image-text processing model in zero-sample learning (zero-shot learning) is reduced compared with that before prompt learning.

Disclosure of Invention

The embodiment of the application provides a prompt learning method, an image classification method and a related device, which can avoid the situation that an image-text processing model is fitted, and reduce the generalization loss of the image-text processing model.

The first aspect of the application provides a prompt learning method, which comprises the following steps:

determining an image description text corresponding to the sample image;

Encoding the image description text by a text encoder in the image-text processing model to obtain description text characteristics; coding the prompt information to be trained through the text coder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained;

encoding the sample image through an image encoder in the image-text processing model to obtain sample image characteristics;

Performing similarity calculation according to the training prompt feature and the reference alignment feature, and determining target loss according to a similarity calculation result; the reference alignment feature is determined from the specification text feature and the sample image feature;

A second aspect of the present application provides an image classification method, the method comprising:

acquiring a target image to be classified;

Encoding the target image by an image encoder in the image-text processing model to obtain target image characteristics;

Performing similarity calculation according to the target image features and the target prompt features, and determining a classification result corresponding to the target image according to a similarity calculation result; the target prompt feature is obtained by encoding target prompt information through a text encoder in the image-text processing model, and the target prompt information comprises prompt parameters obtained through training by the method in the first aspect and candidate category information in a downstream task.

A third aspect of the present application provides a prompt learning apparatus, the apparatus comprising:

the explanation determining module is used for determining an image explanation text corresponding to the sample image;

The text coding module is used for coding the image description text through a text coder in the image-text processing model to obtain description text characteristics; coding the prompt information to be trained through the text coder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained;

the image coding module is used for coding the sample image through an image coder in the image-text processing model to obtain sample image characteristics;

the loss determination module is used for carrying out similarity calculation according to the training prompt feature and the reference alignment feature, and determining target loss according to a similarity calculation result; the reference alignment feature is determined from the specification text feature and the sample image feature;

And the prompt training module is used for training the prompt parameters in the prompt information to be trained based on the target loss.

A fourth aspect of the present application provides an image classification apparatus, the apparatus comprising:

The image acquisition module is used for acquiring target images to be classified;

the image coding module is used for coding the target image through an image coder in the image-text processing model to obtain the target image characteristics;

The image classification module is used for carrying out similarity calculation according to the target image characteristics and the target prompt characteristics and determining a classification result corresponding to the target image according to a similarity calculation result; the target prompt feature is obtained by encoding target prompt information through a text encoder in the image-text processing model, and the target prompt information comprises prompt parameters obtained through training by the method in the first aspect and candidate category information in a downstream task.

A fifth aspect of the application provides a computer apparatus comprising a processor and a memory:

The memory is used for storing a computer program;

The processor is configured to perform the steps of the prompt learning method as described in the first aspect or the steps of the image classification method as described in the second aspect according to the computer program.

A sixth aspect of the present application provides a computer-readable storage medium storing a computer program for executing the steps of the prompt learning method of the first aspect described above or for executing the steps of the image classification method of the second aspect described above.

A seventh aspect of the application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps of the prompt learning method described in the first aspect or performs the steps of the image classification method described in the second aspect.

From the above technical solutions, the embodiment of the present application has the following advantages:

The embodiment of the application provides a prompt learning method, which innovatively introduces an image description text corresponding to a sample image in the process of adapting a graphic processing model to a downstream task based on a prompt learning mechanism, and trains prompt parameters by using the description text characteristics of the image description text and the sample image characteristics of the sample image. The image description text corresponding to the sample image is a text for describing information in the sample image, and the information expressed by the image description text generally comprises information in a specific downstream task to which the sample image belongs and information in other scenes, so that description text characteristics obtained by encoding the image description text through a text encoder in a graphic processing model can reflect not only the information in the specific downstream task but also the information in other scenes. According to the description text characteristics of the image description text and sample image characteristics obtained by encoding a sample image through an image encoder in an image-text processing model, determining reference alignment characteristics, taking the reference alignment characteristics as a reference target, performing similarity calculation on the reference alignment characteristics and training prompt characteristics obtained by encoding prompt information to be trained (including prompt parameters to be trained) through the text encoder in the image-text processing model, and further determining target loss for training the prompt parameters according to the obtained similarity calculation result; according to the training process, the description text characteristics carrying rich information are introduced into the prompt parameters, and the aim of learning information in other scenes while learning information in specific downstream tasks is achieved by simultaneously referring to the description text characteristics of the image description text and the sample image characteristics of the sample image in the process of training the prompt parameters, so that the situation that the image-text processing model is fitted when the task is executed based on the prompt parameters obtained by training is avoided, and accordingly, the generalization loss of the image-text processing model is reduced.

Drawings

FIG. 1 is a diagram of pre-training and application of a graphics context processing model in the related art;

FIG. 2 is a schematic diagram of a graphics processing model in the related art adapting to a downstream task based on a prompt learning mechanism;

Fig. 3 is a schematic diagram of an application scenario of a prompt learning method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of a prompt learning method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a training process of the first prompting parameter according to the embodiment of the present application;

FIG. 6 is a schematic structural diagram of a training process of a second prompt parameter according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a prompt learning method according to an embodiment of the present application;

fig. 8 is a flow chart of an image classification method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a prompt learning device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an image classification device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; meanwhile, the method relates to an important technology of model training in the fields of computer science and mathematics and artificial intelligence, and a pre-training model is developed from a large language model (Large Language Model) in the NLP field. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. The large model technology brings important innovation for the development of computer vision technology, and pre-trained models in the vision fields of swin-transducer, viT, V-MOE, MAE and the like can be quickly and widely applied to downstream specific tasks through fine tuning (finetune). Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Pre-training models (Pre-trainingmodel), also called a kerbstone model and a large model, refer to deep neural networks (Deepneuralnetwork, DNN) with large parameters, training the deep neural networks on massive unlabeled data, enabling PTM to extract common features on the data by utilizing function approximation capability of the large-parameter DNN, and adopting technologies such as fine tuning (finetune), efficient fine tuning of Parameters (PEFT), prompt-tuning and the like, so that the deep neural networks are suitable for downstream tasks. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of processing into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of two or more data modality features. The pre-trained model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models.

The scheme provided by the embodiment of the application relates to natural language processing, computer vision and other technologies in artificial intelligence, and is specifically described by the following embodiments:

The graphic processing model refers to a model pre-trained on large-scale image and text data, and the pre-training aims at learning semantic association and corresponding relation between the image and the text. The pre-training process of the graphic processing model can be specifically as follows: the text is encoded into text features through a text encoder in the image-text processing model, the image is encoded into image features through an image encoder in the image-text processing model, then a loss function is constructed according to a similarity calculation result between the text features and the image features and an actual corresponding relation between the text and the image (namely whether the text and the image correspond to each other or not) through a comparison learning mechanism, and the image-text processing model is trained based on the loss function, so that the similarity between the text features and the image features obtained by encoding the text and the image with the corresponding relation by the image-text processing model is higher and higher, and the similarity between the text features and the image features obtained by encoding the text and the image without the corresponding relation is lower and lower. Common graphic processing models include, but are not limited to, the CLIP (Contrastive Language-IMAGE PRETRAINING) model, the ALIGN (A Larget-SCALE IMAGE AND Noisy-Text Embedding) model, the VILBERT (Vision-and-Language BERT) model, and the like.

After the pre-training of the image-text processing model is completed, the image-text processing model obtained by training can be applied to downstream tasks; the downstream task may specifically be an image classification task of a specific scene, for example, identifying a category (such as pedestrians, vehicles, etc.) to which an object included in an image belongs in an automatic driving scene, identifying a specific type of lesion area included in a medical image in a medical scene, and the like. In the related art, when the graphic processing model is applied to the downstream task, the prompt information is usually required to be designed manually so as to assist the graphic processing model to execute the corresponding downstream task.

Referring to fig. 1, fig. 1 is a schematic diagram of pre-training and applying a graphics processing model in the related art, and fig. 1 is an illustration of pre-training and applying a CLIP model. As shown in the model pre-training stage in fig. 1, the text encoder and the image encoder in the CLIP model may be trained based on pairs of image text, and the text encoder and the image encoder in the CLIP model may be trained by a contrast learning mechanism during training to target a greater degree of similarity between image features and text features that have a correspondence. As shown in the hint encoding stage in fig. 1, when the pre-trained CLIP model is migrated to the downstream task for application, a batch of hint texts (templates) containing class labels, for example { object }, in one image, may be class labels of an airplane, an automobile, a dog, a bird, etc., and "hint information in one image" designed manually; and encoding each prompt text through a text encoder to obtain the prompt characteristics corresponding to each category label. As shown in the downstream task execution stage in fig. 1, the image to be classified may be encoded by an image encoder to obtain a corresponding image feature; and further calculating the similarity between the image features and each prompt feature, and finally determining the class label corresponding to the prompt feature with the highest similarity between the image features as the class of the image in the downstream task.

However, it is found through research that in the implementation manner, the performance effect of the graphic processing model in the downstream task is too dependent on the quality of the prompt information, so that the performance of the graphic processing model in the downstream task is unstable. In order to avoid the occurrence of the situation, researchers propose a prompt learning mechanism, and manually-designed prompt information is replaced by a learnable prompt parameter tokens at an input layer, so that the situation that an image-text processing model is poor in performance in a downstream task is avoided. Common prompt learning mechanisms include, but are not limited to CoOp (Cooperative Optimization), etc.

Referring to fig. 2, fig. 2 is a schematic diagram of adapting a graphics processing model to a downstream task based on a prompt learning mechanism in the related art, as shown in fig. 2, the prompt information may be designed to include a prompt information of a learnable prompt parameter, for example, t= [ V ] ₁[V]₂…[V]_M [ CLASS ], where [ V ] ₁[V]₂…[V]_M is a learnable prompt parameter in the prompt information, and CLASS label in the downstream task, for example, helicopter, butterfly, pizza, and the like. And coding the prompt information comprising each type of label through a text coder to obtain the prompt characteristics corresponding to each type of label. Then, encoding the sample image in the downstream task through an image encoder to obtain corresponding image features, and training the prompting parameters with the aim of enabling the similarity between the image features and prompting features corresponding to the actual category labels to be the highest. Compared with the manually designed prompt information, the prompt information determined based on the mode is more stable, and is beneficial to better performance of the image-text processing model when executing downstream tasks.

However, since the current prompt learning method generally only uses the sample image in the specific downstream task to train the prompt parameters, the trained prompt parameters can only learn the information in the specific downstream task, but cannot learn the information in other scenes, so that the situation that the image-text processing model is fitted when executing the task based on the prompt parameters, that is, the performance effect in zero sample learning is reduced compared with that before prompt learning, and the generalization loss of the image-text processing model is caused.

In order to solve the problems, the embodiment of the application provides a prompt learning method, which introduces an image description text corresponding to a sample image in the process of adapting a graphic processing model to a downstream task based on a prompt learning mechanism, and trains prompt parameters by using the description text characteristics of the image description text and the sample image characteristics of the sample image. The image description text corresponding to the sample image is a text for describing information in the sample image, and the information expressed by the image description text generally comprises information in a specific downstream task to which the sample image belongs and information in other scenes, so that description text characteristics obtained by encoding the image description text through a text encoder in a graphic processing model can reflect not only the information in the specific downstream task but also the information in other scenes. According to the description text characteristics of the image description text and sample image characteristics obtained by encoding a sample image through an image encoder in an image-text processing model, determining reference alignment characteristics, taking the reference alignment characteristics as a reference target, performing similarity calculation on the reference alignment characteristics and training prompt characteristics obtained by encoding prompt information to be trained (including prompt parameters to be trained) through the text encoder in the image-text processing model, and further determining target loss for training the prompt parameters according to the obtained similarity calculation result; according to the training process, the description text characteristics carrying rich information are introduced into the prompt parameters, and the aim of learning information in other scenes while learning information in specific downstream tasks is achieved by simultaneously referring to the description text characteristics of the image description text and the sample image characteristics of the sample image in the process of training the prompt parameters, so that the situation that the image-text processing model is fitted when the task is executed based on the prompt parameters obtained by training is avoided, and accordingly, the generalization loss of the image-text processing model is reduced.

The prompt learning method and the image classification method provided by the embodiment of the application can be executed by computer equipment, and the computer equipment can be terminal equipment or a server. The terminal equipment comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals, aircrafts and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server.

It should be noted that, the information, the data and the signals related to the embodiment of the application are authorized by the related objects or fully authorized by all parties, and the collection, the use and the processing of the related data all conform to the related laws and regulations and standards of the related countries and regions.

In order to facilitate understanding of the prompt learning method provided by the embodiment of the present application, an application scenario of the prompt learning method is described below by taking an execution subject of the prompt learning method as an example of a server.

Referring to fig. 3, fig. 3 is an application scenario schematic diagram of a prompt learning method according to an embodiment of the present application. As shown in fig. 3, the application scenario includes a database 310 and a server 320, and the server 320 may access the database 310 through a network, or the database 310 may be integrated inside the server 320.

The database 310 stores sample images in downstream tasks (such as image classification tasks or image matching tasks) to which the image processing model is applicable. Server 320 may initiate a data acquisition request to database 310 to acquire a sample image from database 310.

In practical applications, after the server 320 obtains the sample image from the database 310, the server 320 may determine the image specification text corresponding to the sample image, where the image specification text is used to describe the information included in the sample image, and it should be understood that the information expressed by the image specification text generally includes not only the information in the specific downstream task to which the sample image belongs, but also the information in other scenes.

Then, the server 320 may encode the image description text by a text encoder in the graphic processing model to obtain description text features; it should be appreciated that where the information expressed by the image specification text is not limited to information in a particular downstream task, the information expressed by the specification text features is not limited to information in a particular downstream task, but also includes information in other scenarios. In addition, the server 320 may further perform encoding processing on the prompt information to be trained through a text encoder, so as to obtain training prompt features, where the prompt information to be trained includes prompt parameters to be trained. The server 320 may encode the sample image by an image encoder in the graphics processing model to obtain the sample image feature.

Further, the server 320 may perform a similarity calculation based on the training hint feature and the reference alignment feature, and determine the target loss based on the similarity calculation. The reference alignment feature is determined according to the explanatory text feature and the sample image feature, and the reference alignment feature is a training target referred by the training prompt feature in the training process. Finally, server 320 may train the prompt parameters in the prompt to be trained based on the target loss, such that the graphics processing model may perform downstream tasks based on the prompt information including the trained prompt parameters. Therefore, the training process of introducing the description text characteristics carrying rich information into the prompt parameters is realized by simultaneously referring to the description text characteristics of the image description text and the sample image characteristics of the sample image in the process of training the prompt parameters, so that the aim of learning information in other scenes is fulfilled while learning information in specific downstream tasks by the trained prompt parameters is fulfilled, the situation that the trained prompt parameters are excessively limited to learning information in specific downstream tasks is avoided, correspondingly, the situation that the image-text processing model is fitted when the task is executed based on the prompt parameters obtained by training can be avoided, and the generalization loss of the image-text processing model is reduced.

It should be understood that the application scenario shown in fig. 3 is only an example, and in practical application, the prompt learning method provided by the embodiment of the present application may also be applied to other scenarios, and the application scenario of the prompt learning method provided by the embodiment of the present application is not limited in any way.

The following describes the prompt learning method provided by the application in detail through the method embodiment.

Referring to fig. 4, fig. 4 is a flowchart illustrating a prompt learning method according to an embodiment of the present application. For convenience of description, the execution subject of the prompt learning method will be described below as an example of a server. As shown in fig. 4, the prompt learning method includes the steps of:

S401: and determining the image description text corresponding to the sample image.

The sample image refers to training sample data applied in prompt learning, and may be an image in a downstream task to which the graphic processing model needs to be put, for example, when the graphic processing model needs to be applied to a downstream task that classifies animal images, the sample image may be an image including animals. The sample image has a corresponding pre-labeled category label, the category label is used for representing the category of the sample image in a downstream task, and still taking the downstream task as an example for classifying animal images, the category label corresponding to one sample image including a side pasture dog can be "dog" or "side pasture".

The image specification text refers to text describing information contained in the corresponding sample image, for example, for a sample image including a cat, the corresponding image specification text may be a blue eye that a chambered cat is using to enthusiastically gaze at "a Birman CAT CAPTIVATINGLY gazes with its mesmerizing blue eyes". It should be noted that, in the embodiment of the present application, the image description text corresponding to the sample image includes not only the information related to the downstream task, but also the information in other scenes, for example, the image description text includes not only the information related to the downstream task for classifying the animal image, such as a cat, but also the information in other scenes, such as eyes, blue, and so on.

After determining the sample image, the corresponding image specification text may be generated based on the sample image by a large language model (Large Language Model, LLM), e.g., large language model Vicuna and large language model ilama, or by a text generation model. For example, a sample image and a requirement for generating an image specification text corresponding to the sample image may be input to the large language model, and a text output by the large language model may be obtained as the image specification text corresponding to the sample image. Or the corresponding image description text can be generated based on the sample image through a text generation model for generating description information corresponding to the image. In this regard, the present application is not particularly limited to the implementation of determining the image specification text corresponding to the sample image.

In one possible implementation manner, the determining the image description text corresponding to the sample image may include:

Generating an initial description text corresponding to the sample image according to the sample image through a text generation model; the text generation model is used for generating description explanatory text corresponding to the input image; and rewriting the initial description text through the large language model to obtain an image description text corresponding to the sample image.

The text generation model refers to a neural network model capable of generating corresponding descriptive text based on an input image, and can be a pre-trained graphic processing model, such as a Big Language image pre-training model (Big Language-IMAGE PRETRAINING, BLIP2), a generating pre-training converter (GENERATIVE PRE-trained Transformer V, GPT-4V) and the like. In this regard, the present application is not limited to a specific model structure of the text generation model.

The initial specification text refers to an output result obtained by inputting a sample image into the text generation model. Typically, the initial specification text only describes the sample image in coarse granularity, e.g., the sample image includes "maines," while the initial specification text is "cats," i.e., the initial specification text only describes the category of the sample image as cats in coarse granularity.

And inputting the sample image into a text generation model, and generating an initial description text corresponding to the sample image, namely obtaining the initial description text for describing the coarse-granularity image information of the sample image. Because the initial description text can only describe the image information of the sample image from the coarse granularity level, the carried information is not rich and fine enough, and further, the initial description text can be rewritten in fine granularity through a large language model, so that the image description text corresponding to the sample image is obtained.

The large language model refers to a model for generating image specification text, for example, a generative pre-training converter (GENERATIVE PRE-trained Transformer.3.5, GPT-3.5), a generative pre-training converter (GENERATIVE PRE-trained Transformer, GPT-4), and the like. In this regard, the present application is not limited to a specific model structure of a large language model.

Fine granularity rewrite means describing the initial specification text in more detail based on the sample image, for example, the sample image includes "maine cat", the initial specification text is "cat", and image information except for coarse granularity category in the sample image can be further described by fine granularity rewrite, for example, the type of cat in the sample image is maine cat, and eyes of the cat are blue. The initial explanatory text is input into the large model, and the initial explanatory text can be further subjected to fine granularity rewriting through the large language model based on the sample image and the rewriting command so as to obtain the image explanatory text for representing the image information of the richer and complete sample image.

As an example, a fine-grained rewrite command may be entered in a large language model, for example, a short description of this picture is: { initial description text capture }, the label of the picture is: { category label CLASS NAME }. Please rewrite an explicit and concise description .Short description of an image:{caption}. Tags of this image:{class name}. Please rewrite a definite and brief description according to given description and tags. according to the given description information and category label, wherein the short description of the picture may be an initial description text determined based on the sample image, for example { a cat is included in the picture }, and the label of the picture is { cat }, so that the large language model may output the image description text corresponding to the sample image based on the rewrite command. In this regard, the present application is not particularly limited to the short description, the tag, and the fine-grained rewrite command in the above, and may be determined according to actual application requirements.

Thus, the sample image is input into the text generation model, the initial description text corresponding to the sample image can be generated, the description information of the sample image from the coarse granularity level can be obtained, furthermore, the initial description text can be input into the large language model for fine granularity rewriting, and the sample image is described in more detail and completely from the fine granularity level based on the initial description text, so that the image description text containing the image description information of the richer and complete sample image can be obtained.

S402: encoding the image description text by a text encoder in the image-text processing model to obtain description text characteristics; and encoding the prompt information to be trained through a text encoder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained.

The image-text processing model comprises a text encoder and an image encoder, wherein the text encoder is used for encoding input text, and the image encoder is used for encoding input images.

The explanatory text features are features obtained by encoding the image explanatory text through a text encoder in the image-text processing model, can reflect semantic information of the image explanatory text, and can be specifically represented in a matrix form. After the image description text is determined based on the sample image, the image description text is input into a text encoder for encoding processing, so that corresponding description text characteristics can be obtained, and the description text characteristics can provide basic data support for the subsequent training process of prompt learning.

The prompt information to be trained refers to prompt information in a prompt learning mechanism, and the trained prompt information can assist the image-text processing model to better execute downstream tasks. The prompt information to be trained comprises prompt parameters tokens to be trained, wherein the prompt parameters to be trained refer to training objects in a prompt learning method. As an example, the category involved in the downstream task may be spliced with the prompt parameter to be trained, and the prompt information to be trained is determined; or the nouns in the image description text and the prompt parameters to be trained can be spliced to determine the prompt information to be trained.

The training prompt features are features obtained by encoding the prompt information to be trained through a text encoder in the image-text processing model, can reflect the semantics of the current prompt information to be trained, and can be in a matrix form. The prompt information to be trained is input into a text encoder for encoding processing, so that corresponding training prompt characteristics can be obtained, the training prompt characteristics can reflect the quality of prompt parameters in the current prompt information to be trained, and the prompt parameters in the prompt information to be trained can be adjusted by referring to the training prompt characteristics in the process of prompt learning.

S403: and (3) carrying out coding processing on the sample image by an image coder in the image-text processing model to obtain sample image characteristics.

The sample image features refer to features of image block levels obtained by encoding sample images through an image encoder in an image-text processing model. The sample image features can reflect semantic information of the sample image, such as content included in the sample image, which may be a matrix-form feature representation.

After the sample image is determined, the sample image is input into an image encoder, the sample image is encoded, and sample image characteristics corresponding to the sample image can be obtained, wherein the sample image characteristics can provide basic data support for the subsequent training process of prompt learning.

S404: and carrying out similarity calculation according to the training prompt feature and the reference alignment feature, and determining target loss according to a similarity calculation result, wherein the reference alignment feature is determined according to the description text feature and the sample image feature.

The reference alignment feature refers to a training target referred to in the process of training the prompt parameters, and in the process of training the prompt parameters, the training prompt feature is made to be as close to the reference alignment feature as possible to train. The reference alignment feature is determined according to the explanatory text feature and the sample image feature, and as an example, the reference alignment feature may be obtained by performing fusion processing on the explanatory text feature and the sample image feature, or the reference alignment feature may also include both the explanatory text feature characterizing the global semantic information and the sample image feature characterizing the global image information.

After determining the reference alignment feature, feature alignment may be performed on the training prompt feature and the reference alignment feature, and the training prompt feature and the reference alignment feature may be mapped into a common feature space. And then calculating the distance between the training prompt feature and the reference alignment feature in the feature space through cosine similarity algorithm or Euclidean distance algorithm and other algorithms, namely calculating the similarity between the training prompt feature and the reference alignment feature, thereby obtaining a similarity calculation result.

After the similarity calculation result is obtained, the target loss in the prompt parameter training process can be determined through the loss function based on the similarity calculation result. The target loss is a loss value determined in the training process of the prompt parameter, the target loss can reflect the performance effect of the image-text processing model in the corresponding downstream task based on the current prompt parameter, and it is understood that the larger the target loss is, the worse the performance effect of the image-text processing model in the downstream task is, namely, the worse the quality of the current prompt parameter is illustrated, otherwise, the smaller the target loss is, the better the performance effect of the image-text processing model in the downstream task is illustrated, namely, the better the quality of the current prompt parameter is illustrated. The loss function may be a cross entropy loss function or a contrast learning loss function, and in this regard, the application is not particularly limited to the loss function in the training process.

S405: based on the target loss, training the prompt parameters in the prompt information to be trained.

Finally, the prompt parameters in the prompt information to be trained can be trained based on the target loss, namely, the target loss is reduced, and the prompt parameters in the prompt information to be trained are continuously adjusted to optimize the performance effect of the image-text processing model when the image-text processing model executes the downstream task based on the prompt information, so that the suitability of the image-text processing model to the downstream task is enhanced.

In the prompt learning method provided by the embodiment of the application, the image description text corresponding to the sample image is introduced in the process of adapting the image-text processing model to the downstream task based on the prompt learning mechanism, and the prompt parameters are trained by utilizing the description text characteristics of the image description text and the sample image characteristics of the sample image. The image description text corresponding to the sample image is a text for describing information in the sample image, and the information expressed by the image description text generally comprises information in a specific downstream task to which the sample image belongs and information in other scenes, so that description text characteristics obtained by encoding the image description text through a text encoder in a graphic processing model can reflect not only the information in the specific downstream task but also the information in other scenes. According to the description text characteristics of the image description text and sample image characteristics obtained by encoding a sample image through an image encoder in an image-text processing model, determining reference alignment characteristics, taking the reference alignment characteristics as a reference target, performing similarity calculation on the reference alignment characteristics and training prompt characteristics obtained by encoding prompt information to be trained (including prompt parameters to be trained) through the text encoder in the image-text processing model, and further determining target loss for training the prompt parameters according to the obtained similarity calculation result; according to the training process, the description text characteristics carrying rich information are introduced into the prompt parameters, and the aim of learning information in other scenes while learning information in specific downstream tasks is achieved by simultaneously referring to the description text characteristics of the image description text and the sample image characteristics of the sample image in the process of training the prompt parameters, so that the situation that the image-text processing model is fitted when the task is executed based on the prompt parameters obtained by training is avoided, and accordingly, the generalization loss of the image-text processing model is reduced.

In one possible implementation manner, the prompt information to be trained includes first prompt information, the first prompt information includes first prompt parameters to be trained and candidate category information in the downstream task, and the training prompt feature includes a first training prompt feature.

Accordingly, the "performing the similarity calculation according to the training prompt feature and the reference alignment feature and determining the target loss according to the similarity calculation result" in S404 may include:

The method comprises the steps that a feature fusion module is used for carrying out fusion processing on the description text features and sample image features to obtain first reference alignment features;

performing similarity calculation according to the first training prompt feature and the first reference alignment feature to obtain a first similarity result; the first similarity result is used for representing the matching degree between the sample image and each candidate category information in the downstream task;

And determining a first target loss according to the first similarity result and the class label of the sample image in the downstream task.

The prompt information to be trained can be first prompt information, wherein the first prompt information comprises first prompt parameters to be trained and candidate category information in the downstream task, namely, the first prompt information can be formed by splicing the first prompt parameters to be trained and category labels in the downstream task. In the embodiment of the application, the first prompt information can also be called as local prompt information token-aware prompts, and when the first prompt parameter in the local prompt information is specifically trained, the first prompt parameter can be trained based on the fine-grained local characteristics in the explanatory text characteristics and the sample image characteristics.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a training process of the first prompting parameter according to an embodiment of the present application.

The first hint parameter may be represented by V ₁ ^ta、V₂ ^ta...V_M ^ta; the candidate category information in the downstream task refers to category information determined based on the images included in the downstream task and the category labels corresponding to the images, for example, the downstream task classifies different types of animal images, the candidate category information in the downstream task may be "dog", "cat" or "rabbit", and the like, and correspondingly, the candidate category information in the downstream task may be represented by CLASS. And splicing each candidate category information in the downstream task with the first prompt parameter respectively to obtain the first prompt information.

The training prompt feature may be a first training prompt feature token-aware prompt feature, where the first training prompt feature is a feature obtained after the first prompt information is encoded by the text encoder, and the first training prompt feature may be represented by F _t ^ta=g(t^ta), where g represents the encoding process of the text encoder. As an example, the dimension of the first training hint feature may be 512 dimensions of the number of categories (i.e. the number of candidate category information in the downstream task).

The sample image features patch-LEVEL IMAGE features are obtained by encoding a picture X by an image encoder, and may be represented by F _v ^pl =f (X) (with dimension h×w), where F represents the encoding process of the image encoder. As an example, the dimension of the sample image feature may be width (W) x height (H) x 512 dimensions, where width refers to the number of image tiles of the sample image in the width direction and height refers to the number of image tiles of the sample image in the height direction; specifically, in the process of encoding the sample image, the image encoder performs a blocking process on the sample image in order to extract image features corresponding to the sample image more carefully, so that the number of image blocks in the width direction obtained by the image encoder is the W, and the number of image blocks in the height direction is the H.

The declarative text feature token-level text features is obtained by encoding the declarative text by a text encoder, and may be represented by F _t ^tl =g (t), where g represents the encoding process of the text encoder. As an example, the dimension illustrating text features may be 77 x 512 dimensions.

After determining the sample image feature and the description text feature, the description text feature (g (t _sos)...g(t_eos)...g(t_N)) and the sample image feature (f (x) _cls…f(x)_j…f(x)_L) may be fused by a feature fusion module to obtain a first reference alignment featureThe first hint parameter may be subsequently trained based on a similarity calculation between the first reference alignment feature and the first training hint feature.

The feature fusion module Image Caption Fusion is a module that fuses different features, and in the embodiment of the present application, the feature fusion module may fuse the sample image feature and the description text feature. The first reference alignment feature is a feature obtained by fusing the explanatory text feature and the sample image feature and is used for representing semantic information of the fused explanatory text of the image and the sample image.

As an example, the feature fusion module may perform a fusion process on the descriptive text feature and the sample image feature to obtain the first reference alignment feature by:

Determining image-text fusion characteristics according to the description text characteristics used as query and the sample image characteristics used as keys and values through a cross attention layer in the characteristic fusion module; and carrying out pooling treatment on the image-text fusion characteristics through a pooling layer in the characteristic fusion module to obtain first reference alignment characteristics.

The feature fusion module comprises a cross attention layer cross attention and a pooling layer average pooling, wherein the cross attention layer is used for modeling the relation between different features, and interaction information between the different features can be better utilized through the cross attention layer, so that the fusion performance of the feature fusion module is improved. The pooling layer is used for reducing the space size of the image-text fusion features and reducing the parameter quantity of the image-text fusion features so as to obtain first reference alignment features. The image-text fusion feature is a feature obtained by carrying out preliminary fusion processing on the explanatory text feature and the sample image feature based on the cross attention layer and is used for representing comprehensive semantic information of the explanatory text feature and the sample image feature.

It should be noted that, in order to avoid the influence of the image description text on the prompt parameters in the application stage, the feature fusion module is designed as a plug-and-play component, that is, the feature fusion module may be selectively inserted in the training process of the prompt parameters, and the feature fusion module may not be inserted in the application stage of the prompt parameters.

As an example, attention interactions may be performed on the explanatory text features and the sample image features by intersecting attention layers, and in particular attention weights between the explanatory text features and the sample image features may be calculated by an attention formula to obtain fine-grained image-text fusion features after the preliminary fusion process. Attention formula may refer to formula 1.

Equation 1

The above attention formula generally uses three variables (Q (query), K (key), V (value)) to represent different features of input, Q is a query for searching information of other positions related to the current position, K is a key for representing reference information provided for Q to perform comparison, and V is a value for calculating output information. In the embodiment of the application, the explanatory text feature can be taken as Q, and the sample image feature can be taken as K and V, so that the fine granularity feature which is strongly related to the sample image feature can be extracted from the explanatory text feature.

After the image-text fusion feature is determined, the image-text fusion feature can be subjected to pooling treatment through a pooling layer in the feature fusion module, for example, the image-text fusion feature can be subjected to average pooling through an average pooling layer (average pooling), and a first reference alignment feature is obtained. In this regard, the present application is not particularly limited to the pooling treatment method in the pooling layer.

Therefore, through the interaction of the attention by the cross attention layer in the feature fusion module, the part which is strongly related to the sample image, namely the image-text fusion feature, can be extracted from the description text feature, and the image-text fusion feature characterizes the association relationship between the description text feature and the sample image feature from the fine granularity level. And then, the image-text fusion processing can be subjected to pooling processing through a pooling layer in the feature fusion module, so that the space size in the image-text fusion characteristics is reduced, and the first reference alignment characteristics after fusion through the feature fusion module can be obtained.

Of course, in practical application, the feature fusion module may perform fusion processing on the description text feature and the sample image feature in other manners, for example, the description text feature and the sample image feature may be given corresponding weights, and then fusion processing is performed on the description text feature and the sample image feature in a weighted summation manner, so as to obtain the first reference alignment feature. In this regard, the present application is not particularly limited to the fusion processing method.

After the first reference alignment feature is determined, similarity calculation can be performed according to the first training prompt feature and the first reference alignment feature to obtain a first similarity result, and then the first target loss can be determined according to the first similarity result and the class label of the sample image in the downstream task.

The first similarity result is a result obtained by performing similarity calculation based on the first training prompt feature and the first reference alignment feature and is used for representing the matching degree between the sample image and each candidate category information in the downstream task, namely the matching degree between the sample image and each category label in the downstream task is included in the first similarity result. For example, the first similarity result may represent, in a matrix, a degree of matching between the sample image and each category label in the downstream task.

As an example, after determining the first reference alignment feature, feature alignment may be performed on the first training prompt feature and the first reference alignment feature, the first training prompt feature and the first reference alignment feature may be mapped into a common feature space, and then a similarity between the first training prompt feature and the first reference alignment feature may be calculated through an algorithm such as a cosine similarity algorithm or a euclidean distance algorithm, so as to obtain a first similarity result.

As an example, after the first similarity result is obtained, a difference between the first similarity result and the class label corresponding to the sample image in the downstream task may be calculated based on the cross entropy loss function L _CR, so that a first target loss L _CR of the first prompt parameter in the training process may be obtained. The first target loss refers to a loss value determined by the first prompt parameter in the training process, so that the first prompt parameter can be trained based on the loss value.

It should be noted that, the training processes are all trained by taking batch (i.e. the training sample set to which the sample images belong) as a unit, and correspondingly, the finally calculated first target loss is determined based on the first similarity result corresponding to each sample image in the batch, the class label corresponding to each sample image in the batch in the downstream task, and the cross entropy loss function. The first similarity result corresponding to each sample image in the batch can be represented by a similarity matrix of b×c, B represents the batch size, and C is the type of the sample image in the batch.

In one possible implementation manner, after determining the first target loss, the prompt parameter may be trained based on the first target loss, that is, "based on the target loss, in S405, the training prompt parameter in the prompt information to be trained" may include:

based on the first target loss, training a first hint parameter in the first hint information.

Meanwhile, the embodiment of the application can further comprise:

model parameters of the feature fusion module are trained based on the first target loss.

Based on the first target loss, training the first prompt parameters in the first prompt information, namely taking the first target loss reduction as a target, and continuously adjusting the first prompt parameters (V ₁ ^ta、V₂ ^ta...V_M ^ta) in the first prompt information to optimize the expression effect of the image-text processing model when executing the downstream task based on the first prompt information.

Meanwhile, after determining the first target loss, the model parameters of the feature fusion module can be trained based on the first target loss, so that the first target loss is reduced, and the first reference alignment feature for representing the image information of the more complete and rich sample image can be determined by continuously adjusting the model parameters (such as the attention weight in the cross attention layer, the parameters in the pooling layer and the like) in the feature fusion module to optimize the fusion effect of the feature fusion module on the explanatory text feature and the sample image feature.

Therefore, the feature fusion module is used for fusing the explanatory text features and the sample image features, so that the first reference alignment features of the image information of the sample image with richer, specific and complete characteristics can be obtained, the semantic information represented by the sample image is not limited, similarity calculation is carried out based on the first reference alignment features and the first training prompt features, the first training prompt features can learn the semantic information represented by the richer first reference alignment features in the training process, the first target loss determined according to the first similarity result is obtained, the first prompt parameters in the first prompt information are trained, and the problem of overfitting when the image-text processing model is based on the first prompt parameters is avoided.

In one possible implementation manner, the prompt information to be trained includes second prompt information and third prompt information, the second prompt information includes second prompt parameters to be trained and nouns extracted from the image description text, and the third prompt information includes the second prompt parameters and candidate category information in the downstream task.

The step of "encoding the prompt information to be trained through the text encoder to obtain the training prompt feature" in S402 may include:

And respectively encoding the second prompt information and the third prompt information through a text encoder to obtain second training prompt characteristics corresponding to the second prompt information and third training prompt characteristics corresponding to the third prompt information.

The prompt information to be trained can be second prompt information and third prompt information, the second prompt information is formed by splicing second prompt parameters to be trained and nouns extracted from the image description text, and the third prompt information is formed by splicing the second prompt parameters to be trained and category labels in downstream tasks. In the embodiment of the application, the second prompt information and the third prompt information are also called global-aware prompts, and when the second prompt parameter in the global prompt information is specifically trained, the second prompt parameter can be trained based on the rough global feature in the explanatory text feature and the sample image feature.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a training process of the second prompting parameter according to the embodiment of the present application.

The second prompt parameters in the second prompt information and the third prompt information can be represented by V ₁ ^ga、V₂ ^ga...V_M ^ga, and each candidate category information in the downstream task is spliced with the second prompt parameters respectively, so that the third prompt information can be obtained.

As an example, a word segmentation tool may be used to segment the image description text to obtain each word segment included in the image description text, and then each noun included in the image description text is determined in each word segment, and in the training process, one noun NOUN may be randomly selected from each noun included in the image description text, and spliced with the second prompt parameter to obtain the second prompt information.

And inputting the second prompt information into a text encoder for encoding processing, so as to obtain a second training prompt feature global-aware prompt feature F _t ^na, wherein the second training prompt feature can be represented by F _t ^na=g(t^na), and g represents the encoding processing of the text encoder. As an example, the dimensions of the second training hint feature may be 1×512 dimensions.

And inputting the third prompt information into a text encoder for encoding processing, so as to obtain a third training prompt feature global-aware prompt feature F _t ^ga, wherein the third training prompt feature can be represented by F _t ^ga=g(t^ga), and g represents the encoding processing of the text encoder. As an example, the dimension of the third training hint feature may be 512 dimensions of the number of categories (i.e. the number of candidate category information in the downstream task).

Based on the second prompt parameters, the second prompt information and the third prompt information can be obtained by respectively splicing the second prompt parameters with nouns extracted from the image description text and category information in the downstream task, and then the second prompt information and the third prompt information are respectively encoded through a text encoder, so that the second training prompt feature and the third training prompt feature can be correspondingly obtained. By the method, nouns in the image description text and category information in the downstream task are introduced in the training process as supervision objects, so that the problem of fitting in the training process is avoided.

In a possible implementation manner, the "performing the similarity calculation according to the training prompt feature and the reference alignment feature and determining the target loss according to the similarity calculation result" in S404 may include:

Extracting sub-features for representing global text information from the explanatory text features as second reference alignment features; extracting sub-features for representing global image information from the sample image features as third reference alignment features;

And performing similarity calculation according to the second training prompt feature, the third training prompt feature, the second reference alignment feature and the third reference alignment feature, and determining second target loss according to a similarity calculation result.

A sub-feature in the descriptive text feature that characterizes global text information means that the sub-feature can characterize semantic information in the entire image descriptive text. The second reference alignment feature refers to a training target referred to by the second training prompt feature in the training process, and the second reference alignment feature may be a feature for characterizing global text information.

Specifically, when the image description text is encoded by the text encoder, a 77 x 512-dimensional description text feature may be obtained, where there is a 512-dimensional sub-feature capable of characterizing global text information of the image description text, for example, a sub-feature having a global text information label "eos" in the description text feature is a sub-feature characterizing global text information, so that the sub-feature may be directly extracted from the description text feature, and the sub-feature is used as a second reference alignment feature, that is, the second reference alignment feature may be g (t _eos).

A sub-feature in a sample image feature that is used to characterize global image information means that the sub-feature can characterize image information in the entire sample image. The third reference alignment feature refers to a training target referred to by the third training prompt feature in the training process, and the third reference alignment feature may be a sub-feature for characterizing global image information.

Specifically, when the sample image is encoded by the image encoder, a sample image feature with w×h×512 dimensions may be obtained, where there is a sub-feature with 512 dimensions capable of characterizing global image information of the sample image, for example, a sub-feature with a global image information label "cls" in the sample image feature is a sub-feature characterizing global image information, so that the sub-feature may be directly extracted from the sample image feature, and the sub-feature may be used as a third reference alignment feature, i.e. the third reference alignment feature may be f (x) _cls.

After determining the second reference alignment feature and the third reference alignment feature, a similarity calculation may be performed based on the second training hint feature, the third training hint feature, the second reference alignment feature, and the third reference alignment feature, and a second target loss may be determined based on the similarity calculation result.

Thus, the sub-feature for representing the global text information can be determined to be the second reference alignment feature from the description text feature, and the sub-feature for representing the global image information can be determined to be the third reference alignment feature from the sample image feature, so that similarity calculation can be performed based on the second training prompt feature, the third training prompt feature, the second reference alignment feature and the third reference alignment feature, and the second target loss can be determined according to the similarity calculation result. Therefore, the sub-features used for representing the global text information and the sub-features used for representing the global image information are respectively used as the second reference alignment feature and the third reference alignment feature in the training process of the second prompt parameters, so that the second prompt parameters learn the image information represented by the richer sample images from the global angle in the training process, and the generalization loss of the image-text processing model when executing the downstream task based on the second prompt parameters is reduced; meanwhile, the prompt parameters are trained based on the sub-features representing the global information, so that data required to be processed in the training process can be reduced, and the computing efficiency is improved.

In one possible implementation manner, the "performing the similarity calculation according to the second training prompt feature, the third training prompt feature, the second reference alignment feature, and the third reference alignment feature, and determining the second target loss according to the similarity calculation result" may include:

performing similarity calculation according to the second training prompt feature and the second reference alignment feature to obtain a second similarity result; determining a first sub-loss based on the second similarity result;

Performing similarity calculation according to the third training prompt feature and the second reference alignment feature to obtain a third similarity result; determining a second sub-loss according to a third similarity result and a class label of the sample image in the downstream task;

performing similarity calculation according to the third training prompt feature and the third reference alignment feature to obtain a fourth similarity result; determining a third sub-loss according to the fourth similarity result and the class label of the sample image in the downstream task;

and determining a second target loss according to the first sub-loss, the second sub-loss and the third sub-loss.

The second similarity result is a result obtained by performing similarity calculation based on the second training prompt feature and the second reference alignment feature and is used for representing the matching degree between the global text information in the image description text and the text information represented by the second prompt information.

The third similarity result is a result obtained by performing similarity calculation based on the third training prompt feature and the second reference alignment feature and is used for representing the matching degree between the global text information in the image specification text and each category label in the downstream task, for example, the third similarity result can be used for representing the similarity between the global text information in the image specification text and each category label in the downstream task in a matrix mode.

The fourth similarity result is a result obtained by performing similarity calculation based on the third training prompt feature and the third reference alignment feature and is used for representing the matching degree between the global image information in the sample image and each type label in the downstream task, for example, the fourth similarity result can show the similarity between the global image information in the sample image and each type label in the downstream task in a matrix form.

As one example, after determining the second reference alignment feature, the second training hint feature and the second reference alignment feature may be feature aligned and mapped into a common feature space. And then calculating the similarity between the second training prompt feature and the second reference alignment feature through cosine similarity algorithm or Euclidean distance algorithm and other algorithms, and obtaining a second similarity result.

Correspondingly, after the second similarity result is obtained, the first sub-loss L _noun can be calculated according to the second similarity result by comparing the learning loss function, and in the subsequent prompt information, the second prompt parameter can be trained with the goal of minimizing the first sub-loss L _noun, so that the meaning of the second prompt information expression is as close to the meaning of the image description text expression as possible, and the second prompt parameter is prevented from being overfitted to the image information in the downstream task.

Correspondingly, the third training prompt feature and the second reference alignment feature can be subjected to feature alignment, and then the similarity between the third training prompt feature and the second reference alignment feature can be calculated through a cosine similarity algorithm or a Euclidean distance algorithm and other algorithms, so that a third similarity result can be obtained.

After the third similarity result is obtained, the second sub-loss L _cap can be calculated according to the third similarity result and the corresponding class label of the sample image in the downstream task through the cross entropy loss function, and in the subsequent prompt information, the second prompt parameter can be trained with the aim of minimizing the second sub-loss L _cap, namely, the class corresponding to the image description text represented by the third similarity result is made to correspond to the class label of the sample image in the downstream task, and the second prompt parameter is trained with the aim of training, so that the meaning expressed by the third prompt information is close to the meaning expressed by the image description text as much as possible, and the image class fitted to the downstream task is prevented.

Correspondingly, after the third reference alignment feature is determined, the third training prompt feature and the third reference alignment feature can be subjected to feature alignment, then the similarity between the third training prompt feature and the third reference alignment feature can be calculated through a cosine similarity algorithm or a Euclidean distance algorithm, and a fourth similarity result can be obtained.

After the fourth similarity result is obtained, a third sub-loss L _img can be calculated according to the fourth similarity result and the class label corresponding to the sample image in the downstream task through the cross entropy loss function, and in the subsequent prompt information, the second prompt parameter can be trained with the aim of minimizing the third sub-loss L _img, namely, the class corresponding to the sample image represented by the fourth similarity result is used as a training aim and the class label corresponding to the sample image in the downstream task is used as a training aim, so that the meaning expressed by the third prompt information is close to the meaning expressed by the sample image as much as possible, and the graphic processing model can be conveniently fine-tuned to the downstream task data set.

Finally, a second target loss may be determined based on the first sub-loss, the second sub-loss, and the third sub-loss. As an example, the first sub-loss, the second sub-loss, and the third sub-loss may be assigned different weights, and multiplied by the weights, so that a weighted first sub-loss, a weighted second sub-loss, and a weighted third sub-loss may be obtained, and the weighted sub-losses may be added to obtain a second target loss.

It should be noted that, the training process is performed based on the batch in which the sample images are located, and correspondingly, the finally calculated second target loss is determined based on the first sub-loss, the second sub-loss and the third sub-loss corresponding to each sample image in the batch. Meanwhile, the present application is not particularly limited to the order in which the first, second, and third sub-losses are calculated.

Therefore, the similarity calculation is performed on the basis of the second training prompt feature and the second reference alignment feature through the method, so that the second training prompt feature can learn global text information represented by the second reference alignment feature in the training process, the similarity calculation is performed on the basis of the third training prompt feature and the second reference alignment feature, the third training prompt feature can learn global text information represented by the second reference alignment feature in the training process, and the similarity calculation is performed on the basis of the third training prompt feature and the third reference alignment feature, so that the third training prompt feature can learn global image information represented by the third reference alignment feature in the training process, and therefore, the global text information and the global image information can be learned from a coarse granularity level in the training process. And determining a second target loss based on the first sub-loss, the second sub-loss and the third sub-loss, and avoiding the second prompt parameter from being overfitted to the image information in the downstream task from the coarse granularity level, thereby reducing the generalization loss when the image-text processing model executes the downstream task based on the second prompt parameter.

In one possible implementation, after determining the second target loss, the prompt parameter may be trained based on the second target loss, where "training the prompt parameter in the prompt information to be trained based on the target loss" in S405 above may include: based on the second target loss, training a second hint parameter in the second hint information and the third hint information.

Because the second prompt information and the third prompt information are formed by splicing the second prompt parameters and the corresponding information, the second prompt parameters in the second prompt information and the third prompt information can be trained based on the second target loss, namely, the second target loss is reduced, and the second prompt parameters (V ₁ ^ga、V₂ ^ga...V_M ^ga) in the second prompt information and the third prompt information are continuously adjusted to optimize the rough granularity level performance effect of the image-text processing model when the downstream task is executed based on the second prompt information, so that the generalization capability of the image-text processing model when the downstream task is executed based on the second prompt information is enhanced.

It should be noted that, in the training process of the prompt parameters, the embodiment of the present application may train only the first prompt parameters or train only the second prompt parameters based on the description text features and the sample image features. Or referring to fig. 7, fig. 7 is a schematic structural diagram of a prompt learning method according to an embodiment of the present application, that is, by using the prompt learning method shown in fig. 7, the training methods of the two prompt parameters are simultaneously used, and the first prompt parameter and the second prompt parameter are simultaneously trained based on the description text feature and the sample image feature. In addition, in the case of training only one prompt parameter, according to a predetermined conversion mechanism, another prompt parameter can be determined according to the prompt parameter, so that the graphics processing model can also use two prompt parameters in the downstream task to execute the corresponding downstream task.

Based on the prompt learning method, the embodiment of the application also provides a method for executing the downstream task by the image processing model based on the prompt parameters obtained through training by the prompt learning method, namely an image classification method. Referring to fig. 8, fig. 8 is a flowchart illustrating an image classification method according to an embodiment of the present application. For convenience of description, the execution subject of the image classification method will be described below as an example of a server. As shown in fig. 8, the image classification method includes the steps of:

s801: and acquiring a target image to be classified.

The target image to be classified is a processing object of an image classification method, and the image classification method is used for classifying the target image based on the prompt parameters obtained through training of the prompt learning method introduced above through a graphics-text processing model, namely distinguishing the category to which the target image belongs through the graphics-text processing model. The target images to be classified can be stored in a database, and in the actual application process, the server can acquire the target images to be classified from the database, so that the image-text processing model classifies the images based on the acquired target images to be classified; or the target image to be classified can be provided by a user, for example, the user can upload the target image to the server through a specific application program, so that the server classifies the target image through the image-text processing model.

S802: and (3) encoding the target image through an image encoder in the image-text processing model to obtain the target image characteristics.

The target image features refer to features of image block levels obtained by encoding the target image through an image encoder in the image-text processing model. The target image features can reflect semantic information of the target image, e.g., content included in the target image, which may be a feature representation in a matrix form.

After the target image to be classified is obtained, the target image can be input into an image encoder in an image-text processing model, and the target image is encoded, so that target image characteristics corresponding to the target image can be obtained, and basic data support is provided for the subsequent image classification process by the target image characteristics.

S803: and carrying out similarity calculation according to the target image features and the target prompt features, and determining a classification result corresponding to the target image according to the similarity calculation result.

The target prompt feature is obtained by encoding the target prompt information through a text encoder in the image-text processing model, and is used for representing semantic information in the target prompt information, and the target prompt feature can be a matrix type feature representation.

The target prompt information includes prompt parameters obtained through training by the prompt learning method and candidate category information in the downstream task, for example, the target prompt information may include first target prompt information formed by splicing category labels in the first prompt parameters and the downstream task, and second target prompt information formed by splicing category labels in the second prompt parameters and the downstream task, so that the image-text processing model performs image classification according to the first target prompt information and the second target prompt information.

As an example, the first prompt parameter may be a prompt parameter obtained through the schematic training shown in fig. 5, i.e. V ₁ ^ta、V₂ ^ta...V_M ^ta; the second prompt parameter may be a prompt parameter trained by the schematic shown in fig. 6, i.e., V ₁ ^ga、V₂ ^ga...V_M ^ga.

It should be noted that the target prompt feature may be stored in the database in advance, and the target prompt feature is obtained from the database through the server. Or the text encoder can be directly used for encoding the target prompt information when the image-text processing model executes the downstream task, and then the similarity calculation is carried out based on the target prompt characteristic and the target image characteristic.

After the target prompt feature and the target image feature are acquired, feature alignment can be performed on the target prompt feature and the target image feature, and the target prompt feature and the target image feature are mapped into a common feature space. Then, the distance between the target prompt feature and the target image feature in the feature space can be calculated through cosine similarity algorithm or Euclidean distance algorithm and other algorithms, namely, the similarity between the target prompt feature and the target image feature is calculated, and therefore a similarity calculation result is obtained.

And finally, determining the class label corresponding to the maximum similarity in the similarity calculation result as a class result corresponding to the target image, namely the class of the target image in the downstream task.

Therefore, the image processing model performs image classification based on the target prompt information, so that the processing effect of the image processing model in the downstream task is ensured, namely, the classification effect of the image classification task is ensured to be better.

In one possible implementation, the target prompt feature includes a first target prompt feature corresponding to a first target prompt message including a first prompt parameter and a second target prompt feature corresponding to a second target prompt message including a second prompt parameter, the first and second prompt parameters being obtained in different ways.

In S803, the "performing similarity calculation according to the target image feature and the target prompt feature, and determining the classification result corresponding to the target image according to the similarity calculation result" may include:

performing similarity calculation according to the target image features and the first target prompt features to obtain a fifth similarity result;

performing similarity calculation according to the target image features and the second target prompt features to obtain a sixth similarity result;

and determining a classification result according to the fifth similarity result and the sixth similarity result.

The target prompt features comprise first target prompt features and second target prompt features, the first target prompt features are obtained by encoding first target prompt information through a text encoder of an image-text processing model, and the first target prompt information is formed by splicing first prompt parameters and category labels in downstream tasks.

The second target prompt feature is obtained by encoding second target prompt information through a text encoder of the image-text processing model, and the second target prompt information is formed by splicing second prompt parameters and class labels in downstream tasks. The first prompt parameter and the second prompt parameter are obtained through training in different modes, and concretely, the training process in the prompt learning method can be referred.

As an example, the first prompting parameter is obtained by training based on a fusion feature of a descriptive text feature and a sample image feature having a corresponding relationship, specifically, the training process corresponding to the structural schematic diagram shown in fig. 5 may be referred to, and the first prompting parameter is obtained by training based on a first reference alignment feature obtained by performing a fusion process on the descriptive text feature and the sample image feature having a corresponding relationship through a feature fusion module and the first training prompting feature.

The second prompt parameter is obtained by training based on the global text feature extracted from the explanatory text feature and the global image feature extracted from the sample image feature, specifically, the training process corresponding to the structural diagram shown in fig. 6 may be referred to, and the second prompt parameter is obtained by training based on the global text feature as the second reference feature, the global image feature as the third reference feature, the second training prompt feature and the third training prompt feature.

Therefore, the first prompt parameters for representing the local prompt information and the second prompt parameters for representing the global prompt information, which are determined based on the prompt learning method, are applied to the image classification method, so that when the image-text processing model executes the downstream task based on the prompt information, the local information and the global information can be referenced at the same time, and the expression effect of the image-text processing model when executing the downstream task is enhanced.

As another example, the first prompt parameter is trained based on a fusion feature of the explanatory text feature and the sample image feature having a correspondence, and the second prompt parameter is converted by the first conversion mechanism.

Or the second prompt parameter is trained based on the global text feature extracted from the description text feature and the global image feature extracted from the sample image feature, and the first prompt parameter is obtained by converting the second prompt parameter through a second conversion mechanism.

The first conversion mechanism is a processing mechanism for converting the first prompt parameter into the second prompt parameter, for example, the association relationship between the first prompt parameter and the second prompt parameter can be learned through a machine learning model, and further the model is trained based on the association relationship between the first prompt parameter and the second prompt parameter, so that the trained model can output the converted second prompt parameter based on the input first prompt parameter.

Correspondingly, the second conversion mechanism is a processing mechanism for converting the second prompt parameter into the first prompt parameter, for example, the association relationship between the first prompt parameter and the second prompt parameter can be learned through a machine learning model, and further the model is trained based on the association relationship between the first prompt parameter and the second prompt parameter, so that the trained model can output the converted second prompt parameter based on the input second prompt parameter.

Therefore, the method not only can determine the first prompt parameter and the second prompt parameter based on the training mode in the prompt learning method, but also can determine the first prompt parameter and the second prompt parameter through the first conversion mechanism and the second conversion mechanism, so that the efficiency of determining the target prompt information is improved, and the efficiency of classifying images is further improved.

The fifth similarity calculation result is a result obtained by performing similarity calculation based on the first target prompt feature and the target image feature, and is used for representing the matching degree between the image information in the target image and each category label in the downstream task contained in the first target prompt feature, for example, the fifth similarity result can represent the similarity between the target image and each category label in the downstream task in a matrix form.

The sixth similarity calculation result is a result obtained by performing similarity calculation based on the second target prompt feature and the target image feature, and is used for representing the matching degree between the image information in the target image and each category label in the downstream task included in the second target prompt feature, for example, the sixth similarity result can represent the similarity between the target image and each category label in the downstream task in a matrix form.

After the target image feature and the first target prompt feature are obtained, feature alignment can be performed on the target image feature and the first target prompt feature, and then the similarity between the first target prompt feature and the target image feature can be calculated through cosine similarity algorithm or Euclidean distance algorithm and other algorithms, so that a fifth similarity calculation result is obtained.

After the target image feature and the second target prompt feature are obtained, feature alignment can be performed on the target image feature and the second target prompt feature, and then the similarity between the second target prompt feature and the target image feature can be calculated through cosine similarity algorithm or Euclidean distance algorithm and other algorithms, so that a sixth similarity calculation result is obtained.

Finally, the classification result of the target image can be determined according to the fifth similarity calculation result and the sixth similarity calculation result. As an example, different weights may be assigned to the fifth similarity result and the sixth similarity result, and the weighted fifth similarity result and the weighted sixth similarity result may be obtained by multiplying the weights, and the weighted similarity results may be added to obtain the final similarity result. And then, determining the class label corresponding to the maximum similarity result in the final similarity results as the class result corresponding to the target image. The weights corresponding to the fifth similarity result and the sixth similarity result can be adaptively adjusted according to the actual effect.

Therefore, based on the similarity calculation between the first target prompt feature and the second target prompt feature and the target image feature, a final similarity result can be determined, and then a classification result of the target image can be determined based on the final similarity result. When the image-text processing model classifies the target image, the first prompt parameter and the second prompt parameter which learn different dimension information are combined as prompt information, so that the image-text processing model can refer to the learned different dimension information when executing the downstream task based on the prompt information, the expression effect of the image-text processing model when executing the downstream task is enhanced, and the classification effect of the image classification method is improved.

Based on the prompt learning method provided by the foregoing embodiment, the present application further provides a prompt learning device accordingly. Fig. 9 is a schematic structural diagram of a prompt learning device 900 according to an embodiment of the present application, where the device is described below with reference to fig. 9, and the device includes:

The explanation determining module 901 is configured to determine an image explanation text corresponding to the sample image;

the text coding module 902 is configured to perform coding processing on the image description text by using a text coder in the graphic processing model to obtain description text features; coding the prompt information to be trained through the text coder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained;

the image coding module 903 is configured to perform coding processing on the sample image by using an image encoder in the graphics context processing model, so as to obtain a sample image feature;

The loss determination module 904 is configured to perform similarity calculation according to the training prompt feature and the reference alignment feature, and determine a target loss according to a similarity calculation result; the reference alignment feature is determined from the specification text feature and the sample image feature;

Prompt training module 905 is configured to train the prompt parameters in the prompt information to be trained based on the target loss.

Optionally, the prompt information to be trained includes first prompt information, where the first prompt information includes a first prompt parameter to be trained and candidate category information in a downstream task; the training prompt features include a first training prompt feature;

The loss determination module 904 includes:

The fusion unit is used for carrying out fusion processing on the text characteristics and the sample image characteristics through the characteristic fusion module to obtain first reference alignment characteristics;

The first similarity calculation unit is used for performing similarity calculation according to the first training prompt feature and the first reference alignment feature to obtain a first similarity result; the first similarity result is used for representing the matching degree between the sample image and each candidate category information in the downstream task;

And the first determining unit is used for determining a first target loss according to the first similarity result and the category label of the sample image in the downstream task.

Optionally, the fusion unit includes:

The second determining unit is used for determining the image-text fusion characteristic according to the text characteristic serving as the query and the sample image characteristic serving as the key and the value through the cross attention layer in the characteristic fusion module;

and the pooling processing unit is used for pooling the image-text fusion characteristics through a pooling layer in the characteristic fusion module to obtain the first reference alignment characteristics.

Optionally, the prompt training module 905 includes:

the first training unit is used for training the first prompt parameters in the first prompt information based on the first target loss;

The apparatus further comprises:

and the second training unit is used for training the model parameters of the feature fusion module based on the first target loss.

Optionally, the prompt information to be trained includes a second prompt information and a third prompt information, the second prompt information includes a second prompt parameter to be trained and nouns extracted from the image description text, and the third prompt information includes the second prompt parameter and candidate category information in the downstream task;

the text encoding module 902 includes:

The encoding processing unit is used for respectively encoding the second prompt information and the third prompt information through the text encoder to obtain second training prompt characteristics corresponding to the second prompt information and third training prompt characteristics corresponding to the third prompt information.

Optionally, the loss determination module 904 includes:

An extracting unit for extracting sub-features for characterizing global text information from the descriptive text features as second reference alignment features; extracting sub-features for representing global image information from the sample image features as third reference alignment features;

And the second similarity calculation unit is used for performing similarity calculation according to the second training prompt feature, the third training prompt feature, the second reference alignment feature and the third reference alignment feature, and determining a second target loss according to a similarity calculation result.

Optionally, the second similarity calculation unit includes:

a third similarity calculation unit, configured to perform similarity calculation according to the second training prompt feature and the second reference alignment feature, to obtain a second similarity result; determining a first sub-loss according to the second similarity result;

A fourth similarity calculation unit, configured to perform similarity calculation according to the third training prompt feature and the second reference alignment feature, to obtain a third similarity result; determining a second sub-loss according to the third similarity result and the class label of the sample image in the downstream task;

a fifth similarity calculation unit, configured to perform similarity calculation according to the third training prompt feature and the third reference alignment feature, to obtain a fourth similarity result; determining a third sub-loss according to the fourth similarity result and the class label of the sample image in the downstream task;

And a third determining unit configured to determine the second target loss according to the first sub-loss, the second sub-loss, and the third sub-loss.

Optionally, the prompt training module 905 includes:

and the third training unit is used for training the second prompt parameters in the second prompt information and the third prompt information based on the second target loss.

Optionally, the specification module 901 includes:

the generation unit is used for generating a model through texts, and generating initial explanation texts corresponding to the sample images according to the sample images; the text generation model is used for generating description explanatory text corresponding to the input image;

And the rewriting unit is used for rewriting the initial description text through a large language model to obtain an image description text corresponding to the sample image.

Based on the image classification method provided by the foregoing embodiment, the application further provides an image classification device correspondingly. The following is a description with reference to fig. 10. Fig. 10 is a schematic structural diagram of an image classification apparatus 1000 according to an embodiment of the present application, where the apparatus includes:

An image acquisition module 1001, configured to acquire a target image to be classified;

The image coding module 1002 is configured to perform coding processing on the target image by using an image encoder in the graphics context processing model, so as to obtain a target image feature;

An image classification module 1003, configured to perform similarity calculation according to the target image feature and the target prompt feature, and determine a classification result corresponding to the target image according to a similarity calculation result; the target prompt feature is obtained by encoding target prompt information through a text encoder in the image-text processing model, and the target prompt information comprises prompt parameters obtained through training by the prompt learning method and candidate category information in a downstream task.

Optionally, the target prompt feature includes a first target prompt feature and a second target prompt feature, where the first target prompt feature corresponds to a first target prompt message including a first prompt parameter, and the second target prompt feature corresponds to a second target prompt message including a second prompt parameter, and the first prompt parameter and the second prompt parameter are obtained in different manners;

the image classification module 1003 includes:

A sixth similarity calculation unit, configured to perform similarity calculation according to the target image feature and the first target prompt feature, to obtain a fifth similarity calculation result; performing similarity calculation according to the target image features and the second target prompt features to obtain a sixth similarity calculation result;

And a fourth determining unit configured to determine the classification result according to the fifth similarity calculation result and the sixth similarity calculation result.

Optionally, the first prompting parameter is obtained based on fusion feature training of the explanatory text feature and the sample image feature with a corresponding relation;

The second hint parameter is trained based on a global text feature extracted from the text feature and a global image feature extracted from the sample image feature.

Optionally, the first prompting parameter is obtained based on fusion feature training of the explanatory text feature and the sample image feature with corresponding relation, and the second prompting parameter is obtained by converting the first prompting parameter through a first conversion mechanism;

Or alternatively

The second prompt parameter is obtained through training based on the global text feature extracted from the text feature and the global image feature extracted from the sample image feature, and the first prompt parameter is obtained through converting the second prompt parameter through a second conversion mechanism.

The embodiment of the application also provides a computer device, which can be a terminal device or a server, and the terminal device and the server provided by the embodiment of the application are introduced from the aspect of hardware materialization.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 11, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and specific technical details are not disclosed, please refer to the method portions of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, and the terminal is taken as a computer as an example:

Fig. 11 is a block diagram showing a part of the structure of a computer related to a terminal provided by an embodiment of the present application. Referring to fig. 11, a computer includes: radio Frequency (RF) circuitry 1210, memory 1220, input unit 1230 (including touch panel 1231 and other input devices 1232), display unit 1240 (including display panel 1241), sensors 1250, audio circuitry 1260 (to which speaker 1261 and microphone 1262 are connected), wireless fidelity (WIRELESS FIDELITY, wiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the computer architecture shown in fig. 11 is not limiting and that more or fewer components than shown may be included, or that certain components may be combined, or that different arrangements of components may be utilized.

Memory 1220 may be used to store software programs and modules, and processor 1280 may execute the various functional applications and data processing of the computer by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data created according to the use of the computer (such as audio data, phonebooks, etc.), and the like. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

Processor 1280 is a control center of the computer and connects various parts of the entire computer using various interfaces and lines, performing various functions of the computer and processing data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.

In an embodiment of the present application, the processor 1280 included in the terminal is configured to perform steps in the prompt learning method described in the foregoing embodiments, or to perform steps in the image classification method described in the foregoing embodiments.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a server 1300 according to an embodiment of the present application. The server 1300 may vary considerably in configuration or performance and may include one or more central processing units (central processing units, CPUs) 1322 (e.g., one or more processors) and memory 1332, one or more storage mediums 1330 (e.g., one or more mass storage devices) that store applications 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.

The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems, such as a Windows Server ^TM,Mac OS X^TM,Unix^TM, Linux^TM,FreeBSD^TM, and the like.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 12.

Wherein CPU 1322 is configured to perform steps in the prompt learning method described in the foregoing embodiments or to perform steps in the image classification method described in the foregoing embodiments.

The embodiments of the present application also provide a computer-readable storage medium storing a computer program for executing the steps in the prompt learning method described in the foregoing embodiments or for executing the steps in the image classification method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the prompt learning method described in the foregoing embodiments, or performs the steps in the image classification method described in the foregoing embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media in which a computer program can be stored.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A prompt learning method, the method comprising:

determining an image description text corresponding to the sample image;

Encoding the image description text by a text encoder in the image-text processing model to obtain description text characteristics; coding the prompt information to be trained through the text coder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained; the prompt information to be trained comprises first prompt information, wherein the first prompt information comprises first prompt parameters to be trained and candidate category information in downstream tasks; the training prompt features include a first training prompt feature;

Performing similarity calculation according to the training prompt feature and the reference alignment feature, and determining target loss according to a similarity calculation result, wherein the method specifically comprises the following steps: the method comprises the steps that through a feature fusion module, fusion processing is carried out on the text features and the sample image features, and first reference alignment features are obtained; performing similarity calculation according to the first training prompt feature and the first reference alignment feature to obtain a first similarity result; the first similarity result is used for representing the matching degree between the sample image and each candidate category information in the downstream task; determining a first target loss according to the first similarity result and a class label of the sample image in the downstream task; the reference alignment feature is determined from the specification text feature and the sample image feature;

And training the prompt parameters in the prompt information to be trained based on the target loss.

2. The method according to claim 1, wherein the fusing, by the feature fusion module, the text feature and the sample image feature to obtain a first reference alignment feature includes:

determining, by a cross attention layer in the feature fusion module, a graph-text fusion feature according to the specification text feature as a query and the sample image feature as a key and a value;

And carrying out pooling treatment on the image-text fusion characteristics through a pooling layer in the characteristic fusion module to obtain the first reference alignment characteristics.

3. The method according to claim 1 or 2, wherein the training the hint parameter in the hint information to be trained based on the target loss comprises:

training the first hint parameter in the first hint information based on the first target loss;

The method further comprises the steps of:

And training model parameters of the feature fusion module based on the first target loss.

4. The method according to any one of claims 1 to 2, wherein the prompt message to be trained includes a second prompt message and a third prompt message, the second prompt message includes a second prompt parameter to be trained and a noun extracted from the image specification text, and the third prompt message includes the second prompt parameter and candidate category information in a downstream task;

the text encoder is used for encoding the prompt information to be trained to obtain training prompt characteristics, and the method comprises the following steps:

And respectively carrying out coding processing on the second prompt information and the third prompt information through the text coder to obtain second training prompt characteristics corresponding to the second prompt information and third training prompt characteristics corresponding to the third prompt information.

5. The method of claim 4, wherein the performing a similarity calculation based on the training hint feature and the reference alignment feature and determining the target loss based on the similarity calculation further comprises:

Extracting sub-features for representing global text information from the text features as second reference alignment features; extracting sub-features for representing global image information from the sample image features as third reference alignment features;

6. The method of claim 5, wherein the performing a similarity calculation based on the second training hint feature, the third training hint feature, the second reference alignment feature, and the third reference alignment feature, determining a second target loss based on a result of the similarity calculation, comprises:

Performing similarity calculation according to the second training prompt feature and the second reference alignment feature to obtain a second similarity result; determining a first sub-loss according to the second similarity result;

performing similarity calculation according to the third training prompt feature and the second reference alignment feature to obtain a third similarity result; determining a second sub-loss according to the third similarity result and the class label of the sample image in the downstream task;

determining the second target loss according to the first sub-loss, the second sub-loss and the third sub-loss.

7. The method according to claim 5 or 6, wherein the training the hint parameters in the hint information to be trained based on the target loss comprises:

Training the second hint parameter in the second hint information and the third hint information based on the second target loss.

8. The method of claim 1, wherein determining the image specification text corresponding to the sample image comprises:

generating an initial description text corresponding to the sample image according to the sample image through a text generation model; the text generation model is used for generating description explanatory text corresponding to the input image;

And rewriting the initial description text through a large language model to obtain an image description text corresponding to the sample image.

9. A method of classifying images, the method comprising:

acquiring a target image to be classified;

Performing similarity calculation according to the target image features and the target prompt features, and determining a classification result corresponding to the target image according to a similarity calculation result; the target prompt feature is obtained by encoding target prompt information through a text encoder in the image-text processing model, and the target prompt information comprises prompt parameters obtained through training by the method of any one of claims 1 to 8 and candidate category information in downstream tasks.

10. The method of claim 9, wherein the target cue features comprise a first target cue feature and a second target cue feature, the first target cue feature corresponding to a first target cue information comprising a first cue parameter, the second target cue feature corresponding to a second target cue information comprising a second cue parameter, the first cue parameter and the second cue parameter being obtained in different ways;

The step of carrying out similarity calculation according to the target image features and the target prompt features and determining a classification result corresponding to the target image according to a similarity calculation result comprises the following steps:

performing similarity calculation according to the target image features and the first target prompt features to obtain a fifth similarity calculation result; performing similarity calculation according to the target image features and the second target prompt features to obtain a sixth similarity calculation result;

and determining the classification result according to the fifth similarity calculation result and the sixth similarity calculation result.

11. The method of claim 10, wherein the first hint parameter is trained based on fusion features of descriptive text features and sample image features having correspondence;

12. The method of claim 10, wherein the first hint parameter is trained based on fusion features of descriptive text features and sample image features having a correspondence, and the second hint parameter is converted by a first conversion mechanism;

Or alternatively

13. A prompt learning device, the device comprising:

The text coding module is used for coding the image description text through a text coder in the image-text processing model to obtain description text characteristics; coding the prompt information to be trained through the text coder to obtain training prompt characteristics, wherein the prompt information to be trained comprises prompt parameters to be trained; the prompt information to be trained comprises first prompt information, wherein the first prompt information comprises first prompt parameters to be trained and candidate category information in downstream tasks; the training prompt features include a first training prompt feature;

The loss determination module is configured to perform similarity calculation according to the training prompt feature and the reference alignment feature, and determine a target loss according to a similarity calculation result, and specifically includes: the method comprises the steps that through a feature fusion module, fusion processing is carried out on the text features and the sample image features, and first reference alignment features are obtained; performing similarity calculation according to the first training prompt feature and the first reference alignment feature to obtain a first similarity result; the first similarity result is used for representing the matching degree between the sample image and each candidate category information in the downstream task; determining a first target loss according to the first similarity result and a class label of the sample image in the downstream task; the reference alignment feature is determined from the specification text feature and the sample image feature;

14. An image classification apparatus, the apparatus comprising:

The image classification module is used for carrying out similarity calculation according to the target image characteristics and the target prompt characteristics and determining a classification result corresponding to the target image according to a similarity calculation result; the target prompt feature is obtained by encoding target prompt information through a text encoder in the image-text processing model, and the target prompt information comprises prompt parameters obtained through training by the method of any one of claims 1 to 8 and candidate category information in downstream tasks.

15. A computer device, the device comprising a processor and a memory;

The memory is used for storing a computer program;

The processor is configured to execute the prompt learning method according to any one of claims 1 to 8 or the image classification method according to any one of claims 9 to 12 according to the computer program.

16. A computer-readable storage medium storing a computer program which, when executed by an electronic device, implements the prompt learning method of any one of claims 1 to 8 or the image classification method of any one of claims 9 to 12.

17. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the prompt learning method of any one of claims 1 to 8 or the image classification method of any one of claims 9 to 12.