CN118093914A

CN118093914A - A dialogue image retrieval method based on cross-modal emotional interaction

Info

Publication number: CN118093914A
Application number: CN202410158251.6A
Authority: CN
Inventors: 杨巨峰; 夏无忧; 刘胜哲; 秦荣; 贾国力
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-05-28

Abstract

The invention relates to the technical field of emotion analysis, and provides a dialogue image retrieval method based on cross-modal emotion interaction. Comprising the following steps: introducing an emotion recognition data set and clustering expression packages in the emotion recognition data set to obtain a plurality of emotion categories; comparing and learning different image characterizations in the same emotion category, and comparing and learning the image characterizations in different emotion categories to obtain an image sample of the expression package with enhanced local characteristics; coding the dialogue sample and the image sample, extracting initial characteristics of the coded data, and aligning graphic and text characteristics to obtain multi-mode characteristics; obtaining a matching score between each image sample and each dialogue sample by multi-modal feature calculation, and constructing a matched positive sample pair and a non-matched negative sample pair; and carrying out optimization training to obtain a retrieval model, and carrying out expression package retrieval. The invention uses the emotion information of the dialogue and the expression package at the same time, thereby improving the performance and accuracy of the dialogue and expression retrieval method.

Description

Cross-modal emotion interaction-based dialogue image retrieval method

Technical Field

The invention relates to the technical field of emotion analysis, in particular to a dialogue image retrieval method based on cross-modal emotion interaction.

Background

With the rapid development of instant messaging application, expression package images are increasingly widely applied to network chat. The expression pack is more expressive and advantageous in conveying strong emotion, aggressiveness and sense of intimacy than emoticons, and can be used to regulate the emotional expression of people in conversation.

The dialogue-based emotkit image retrieval (StickerResponseSelection, SRS) task aims to recommend appropriate emotkit images to a user using multiple rounds of historical dialogue information. As expression packages become more popular, the SRS task is also becoming more and more interesting to researchers. The task can be further expanded in the fields of visual dialogue, multi-modal emotion analysis and the like.

Expression packages are often used to replace or supplement the emotion of a conversation, so that it is closely related to the emotion of the context. Compared to the traditional text image retrieval task, the SRS task has three new challenges: expression packages belonging to different topics may express the same emotion, while expression packages belonging to the same topic may convey different emotions, which is also known as AFFECTIVEGAP problem in the field of visual emotion analysis. The various visual concepts existing under different expression package topics provide challenges for understanding the emotion of the image, and in addition, the existing SRS data set lacks image-related labels, so that the difficulty of extracting emotion information from the expression package is further improved; modeling differentiated expression pack representations is challenging, since expression packs under the same topic often have similar content, so semantic annotations for them tend to be similar, however, for SRS tasks, model-extracted expression pack features should be discriminative in order to match them exactly to the corresponding conversational queries, using similar semantic tags for expression packs will prevent model learning of discriminative representations; it is challenging to link emotions between the expression package and the dialog.

The previous deep learning-based method can accurately connect the semantic association of the text and the image mode by positioning the text phrase and the image area of the object, however, the emotion of the image has abstract property different from the semantic association, so that the emotion area of the image is difficult to accurately position, and the difficulty of cross-mode emotion association is improved.

Disclosure of Invention

The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a dialogue image retrieval method based on cross-modal emotion interaction.

The invention provides a dialogue image retrieval method based on cross-modal emotion interaction, which comprises the following steps:

S1: introducing an emotion recognition data set, and clustering expression packages in the emotion recognition data set to obtain a plurality of emotion categories;

S2: comparing and learning different image characterizations in the same emotion category, and comparing and learning the image characterizations in different emotion categories to obtain an image sample of the expression package with enhanced local characteristics;

s3: coding the dialogue sample and the image sample, extracting initial characteristics of the coded data, and aligning the obtained image characteristics with the text characteristics to obtain multi-mode characteristics;

S4: obtaining a matching score between each image sample and each dialogue sample by the multi-modal feature calculation, and selecting a corresponding dialogue sample and an image sample based on the matching score to obtain a positive sample pair;

selecting a plurality of non-matching image samples except the matching image samples for a single dialogue sample, and selecting a plurality of non-matching dialogue samples except the matching dialogue samples for the single image sample to obtain a negative sample pair;

S5: and carrying out optimization training based on the positive sample pair and the negative sample pair to obtain a retrieval model, and carrying out expression package retrieval on the dialogue to be retrieved through the retrieval model to obtain a retrieval result.

According to the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention, the step S1 further comprises the following steps:

S11: selecting ResNet models to train on the introduced emotion recognition data set to obtain teacher models;

s12: extracting features of the emotion recognition data set through the teacher model to obtain a plurality of image characterizations;

S13: K-Means clustering is carried out on the image representation based on emotion categories, and a plurality of clusters are obtained;

s14: and selecting a centroid vector of the cluster as an emotion anchor point, and establishing a plurality of emotion categories by taking the emotion anchor point as a center.

According to the dialogue image retrieval method based on the cross-modal emotion interaction provided by the invention, the emotion categories in the step S13 comprise surprise, happiness, aversion, fear, sadness, anger and neutrality.

According to the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention, the step S14 further comprises the following steps:

s141: and calculating the matrix product of the image representation and the emotion anchor point to be used as an emotion pseudo tag of the emotion category.

According to the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention, in step S2, the expression of the optimization target for contrast learning of different image characterizations in the same emotion category is as follows:

Wherein, For the optimization target of contrast learning of different image characterizations in the same emotion category,For the total number of image samples,For the first index value of the image sample,For the second index value of the image sample,ForThe [ CLS ] token coding after the first parallel data enhancement is carried out on the image samples,For[ CLS ] token coding after second parallel data enhancement is carried out on the image samples,ForThe [ CLS ] token codes after the second parallel data enhancement are carried out on the image samples;

in step S2, the expression of the optimization target for performing contrast learning on the image characterizations of different emotion categories is:

Wherein, For the optimization objective of contrast learning of image characterizations of different emotion categories,Is the number of samples in the emotion category.

According to the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention, the step S4 further comprises the following steps:

S41: and calculating emotion polarity scores of the dialogue samples through the introduced text emotion analysis library, and obtaining text difficult samples based on the matching scores and the emotion polarity scores.

According to the dialogue image retrieval method based on cross-modal emotion interaction, in the step S4, when the non-matching image sample is selected, the normalized value of the sum of the semantic similarity and the emotion similarity of the image sample is used as the sampling probability.

According to the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention, in step S5, the overall optimization target expression of the retrieval model is as follows:

Wherein, For the overall optimization objective of the reduction model,For loss of alignment of image features and text features,Loss from different groups for negative samples when image features and text features are aligned,To obtain cross entropy loss in multi-modal features,For the first balance coefficient,KL divergence between coding output for coding expression package in emotion recognition data set and emotion pseudo tagFor the second balance coefficient,Is the third equilibrium coefficient.

According to the dialogue image retrieval method based on cross-modal emotion interaction, the emotion knowledge distillation module and the topic hierarchy semantic comparison learning module are provided to utilize emotion and semantic information of the expression package image, so that feature quality of the expression package obtained by the model is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a dialogue image retrieval method based on cross-modal emotion interaction.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

An embodiment of the present invention is described below with reference to fig. 1.

In the stage, firstly, emotion knowledge of an expression package is extracted from an emotion recognition dataset SER30K of the expression package, and distilled into an image encoder of a proposed PBR method, specifically, image features in the SER30K are clustered, centroid vectors of each cluster are called emotion anchor points, and M emotion anchor points are introduced into each emotion category.

Wherein, step S1 further comprises:

further, the present invention selects ResNet as the teacher model and trains it on SER30K, specifically the present invention uses standard ResNet as the teacher model for the EKD module, sized 128, trains a total of 100 epochs on SER30K, scales the input image to 128 x 128, then rotates randomly and flips randomly horizontally.

The invention uses an Adam optimizer with a learning rate of 10 < -4 > and attenuates to 0.1 times of the original value at 50 th and 80 th epochs, and the final emotion classification accuracy of the teacher model on the SER30K test set is 64.56%.

Wherein the emotion classification in step S13 includes surprise, happiness, aversion, fear, sadness, anger and neutrality.

After training the teacher model, it is first used to extract a feature representation for each image in SER30K and group them according to seven emotional categories, namely surprise, happiness, aversion, fear, sadness, vitality, neutrality.

And clustering the features of each emotion category into M clusters by adopting a K-Means clustering method, wherein each cluster can be regarded as being approximately corresponding to one image style, and then, aggregating the centroid vectors of each cluster to obtain a feature matrix, wherein the feature matrix is called an emotion anchor point.

Wherein, step S14 further comprises:

In the training stage, firstly, extracting feature representation for each new expression package image by using the ResNet teacher model, then taking the matrix product result of the feature representation and the emotion anchor point as an emotion pseudo tag, and carrying out softmax function normalization processing, wherein in order to realize emotion knowledge distillation, KL divergence between the output of an image encoder and the emotion pseudo tag is minimized.

in the stage, the invention provides an intra-topic semantic comparison learning strategy for pulling image representation under the same topic and an inter-topic semantic comparison learning strategy for enhancing the diversity of image representation under different topics.

In step S2, the expression of the optimization target for performing contrast learning on different image characterizations in the same emotion category is:

for intra-topic semantic comparison learning, the invention uses comparison learning to improve the ability of a model to capture local emotion features under the same topic.

For an input expression package image, executing two parallel data enhancement processes adopted by the data enhancement, wherein the enhancement processes are as follows: (1) random scaling and cropping of the image; (2) Randomly changing brightness, tone and contrast of the image; (3) randomly converting the image into a gray scale map; (4) randomly applying gaussian blur to the image; (5) random horizontal flipping.

After obtaining two enhanced images, they are sent to the image encoder to obtain outputs Vaug and Vaug, and then features Vaug and Vaug at each global [ CLS ] location are projected into the semantic representation space using a nonlinear transformation, i.e., a multi-layer perceptron with a ReLU activation function, using other images within the same subject as negative samples, and defining a learning objective as above.

For inter-topic semantic comparison learning, the SRS task not only needs to make the expression package representation learned by the model have discriminant, but also has diversity, so the invention introduces a comparison learning strategy between different topics at the same time, and different from intra-topic comparison learning, other samples are selected as negative samples under the same group among topics, so that the negative samples are ensured to come from other topics.

In the stage, firstly, extracting image features, wherein the specific process is as follows: the invention uses three backbone networks as image encoders, namely Inceptionv, resNet and ViT, and uses the feature map after global pooling as image representation of expression package for Inceptionv and ResNet, and flattens the feature map into one-dimensional sequence, and ViT model as image encoder, when the input image is firstly cut into P x P image blocks, flattened into sequence, then converted into input vision embedding by using linear transformation, and finally the final expression package vision feature is obtained by a series of self-attention coding layers.

Secondly, extracting text features, wherein the specific process is as follows: for language modalities, the present technology uses four different backbone networks to obtain dialog representations, LSTM, GRU, transformer and BERT, respectively. For each dialog context history, each sentence therein is spliced and a [ SEP ] tag is inserted in between the different sentences to indicate their separation, and then the text representation is extracted.

In order to obtain compact multi-modal features, the present invention employs ALBEF a method that uses contrast learning to align features prior to fusing visual and textual features, particularly requiring mapping the features of the visual and linguistic modalities to a common embedding space.

And then, aligning the image and text features. For image to text alignment, assuming K samples in a group, for positive samples in the ith sample, combine it with the ith dialog as a positive sample pair, and the remaining K-1 dialogs combine with the positive sample as a negative sample pair, use InfoNCE loss to optimize the model, which is defined asThe image text features used for this stage are aligned.

For text-to-image alignment, the negative sample of text is not from other expression packages from the same group, but from negative sample expression packages from which the candidate set was selected, we useTo optimize.

Next we perform multi-modal feature extraction, the specific process is: the last six layers of BERT are used as multi-modal encoders, which differ from text encoders in that a Cross Attention mechanism (CA) is applied in each layer, which is similar to a multi-headed self Attention operation, but in which Key and Value features come from the visual modality and Query features come from the text modality.

Finally, using the features at the [ CLS ] position of the multi-modal encoder output to predict the matching score between each expression package and dialog, the present invention models it as a binary classification task and uses cross entropy lossOptimizing it.

According to the invention, the expression package retrieval task is modeled as two classification problems, a positive sample is a dialogue-expression package image sample pair matched in the original data, a negative sample is divided into two types, the first type is that other dialogue and current expression package images are downsampled from the same training batch to form a 'other dialogue-expression package image' sample, and the second type is that other candidate images and current dialogue are downsampled from the same expression package theme to form a 'dialogue-other expression package image' sample.

Wherein, step S4 further comprises:

In step S4, when the non-matching image sample is selected, a normalized value of a sum of semantic similarity and emotion similarity of the image sample is used as a sampling probability.

For each dialog, a Chinese text emotion analysis library cnsenti is used to extract the emotion polarity score for each utterance, denoted as PT and NT pairs, and the present invention adds the positive and negative emotion polarity scores for each sentence as the scores for the dialog.

The semantic similarity between the image and the text mode can be obtained from the image-text alignment process, then the negative sample is selected according to the semantic similarity score and emotion polarity score of the expression package and the dialogue, and if the semantic similarity between the samples is large and the emotion polarity difference is small, the samples are defined as difficult samples.

And adding the positive emotion scores and the negative emotion scores of each sentence to obtain the overall emotion score of the sentence, and normalizing PT and NT by using the overall emotion score so as to facilitate the subsequent calculation of the cross-modal emotion polarity difference.

Setting the positive score PI and the negative score PN of the image to 0, traversing the expression package, adding the score at the corresponding index to PI for positive emotion, namely surprise and happiness, adding the score at the corresponding index to PN for negative emotion, namely aversion, fear, sadness and gas generation, and carrying out normalization processing on the PI and the PN after adding.

And then calculating the image emotion differences between all samples in the same group, namely, the positive score differences and the negative score differences between every two images, wherein the sum of the positive score differences and the negative score differences of the two images is 2 at most, so that the absolute values of the 2 and the polarity differences are used for subtraction in final calculation to serve as emotion similarity between the images, and finally, the sum of the semantic similarity and the emotion similarity of the images is normalized by using a softmax function to serve as the probability of negative sample sampling.

In step S5, the overall optimization objective expression of the search model is:

Further, this stage is a training stage, and the overall objective function of the training stage is the sum of the objective functions set forth in each section.

The experimental verification of the dialogue image retrieval method based on cross-modal emotion interaction provided by the invention is described below.

TABLE 1 results of comparative experiments of the method of the invention with other models

As shown in Table 1, the results of the method of the present invention are compared with those of other models.

For fair comparison, the present invention divides the different methods according to the backbone network of the image and text encoder: meanwhile, in order to analyze the influence of external knowledge (namely SER30K and cnsenti), the invention additionally compares the PBR method (expressed as PBR) after removing the EKD and PHSM modules, and the complete method is expressed as PBR; compared with a method specially designed for a visual question-answering task (namely SYNERGISTIC and PSAC), the PBR method specially designs a corresponding module for the SRS task, realizes emotion association between dialogue and expression package, and further obtains better performance; SMN, DAM, MRFN and LSTUR are representative methods of the recommendation field that focus more on the semantic information contained in the modeling text.

However, the task of dialogue-based emoticons retrieval requires not only reasoning about dialogue semantics, but also sensing cross-modal emotion associations. This makes their experimental results not good. SRS is a method designed specifically for dialogue-based expression pack image retrieval tasks, which achieves 70.9% performance on MAP, 59.0% performance on R10@1, PESRS is an extended version of SRS, which memorizes the additional information of expression pack images that users prefer to send in historical dialogues, which achieves 74.3% performance on MAP and 63.2% performance on R10@1.

However, these methods ignore the role of emotion information, and the PBR method uses a more concise framework to extract emotion information from expression packages and dialogs, thereby significantly improving the performance of the model, i.e., 4.9% improvement in MAP and 6.1% improvement in r10@1 compared to PESRS.

In addition, PBR achieves superior performance compared to other methods using the same backbone network, where PBR increases MAP index by 2.32% on average over a combination of five backbone networks, r10@1 increases 2.86% on average over PBR, indicating that affective information is critical for SRS, and the PBR method has excellent generalization capability in different networks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A dialogue image retrieval method based on cross-modal emotion interaction is characterized by comprising the following steps:

2. The method for searching dialogue images based on cross-modal emotion interaction according to claim 1, wherein step S1 further comprises:

3. The method of claim 2, wherein the emotion categories in step S13 include surprise, happiness, aversion, fear, sadness, anger, and neutrality.

4. The method for searching dialogue images based on cross-modal emotion interaction according to claim 2, wherein step S14 further comprises:

5. The dialogue image retrieval method based on cross-modal emotion interaction according to claim 4, wherein in step S2, the expression of the optimization target for performing contrast learning on different image characterizations in the same emotion category is:

6. The method for searching dialogue images based on cross-modal emotion interaction according to claim 1, wherein step S4 further comprises:

7. The dialogue image retrieval method based on cross-modal emotion interaction according to claim 1, wherein in step S4, when the non-matching image sample is selected, a value obtained by normalizing a sum of semantic similarity and emotion similarity of the image sample is used as a sampling probability.

8. The method for searching dialogue images based on cross-modal emotion interaction according to claim 5, wherein in step S5, the expression of the overall optimization objective of the search model is: