CN114676228B

CN114676228B - Cross-modal matching method based on cross-modal attention screening network with dynamic routing

Info

Publication number: CN114676228B
Application number: CN202210364577.5A
Authority: CN
Inventors: 吴杰; 吴春雷; 宫法明; 张立强; 路静
Original assignee: China University of Petroleum East China
Current assignee: China University of Petroleum East China
Priority date: 2022-04-08
Filing date: 2022-04-08
Publication date: 2025-03-07
Anticipated expiration: 2042-04-08
Also published as: CN114676228A

Abstract

The present invention discloses a cross-modal matching method based on a cross-modal attention screening network with dynamic routing. This task has made significant progress in local alignment. They usually model the top 36 significant regions of each image detection confidence. However, these regions usually contain irrelevant redundant regions, which may introduce noise to the modeling process to interfere with model reasoning. According to our statistics on the most widely used datasets in this task, the number of regions is much larger than the number of words. Therefore, we propose a cross-modal attention screening network with dynamic routing to automatically configure an appropriate number of regions for each input image. The network has the ability to decide the number of regions and can dynamically learn different activation regions according to different data, thereby reducing redundant calculations. In addition, a cross-modal screening module is designed to retain meaningful interactive features by filtering irrelevant information, suppress the interference of meaningless alignment, and adaptively adjust global and local dependencies.

Description

Cross-modal matching method of cross-modal attention screening network based on dynamic routing

Technical Field

The invention belongs to a cross-modal matching method, and relates to the technical field of computer vision and natural language processing.

Background

As innovative technology and social media advance, various multimedia data and information are aggregated together. In order for a computer to understand, match, and transform such cross-modal data, cross-modal retrieval has become a fundamental technique that indexes semantically related instances from one modality to another. It has many applications in the fields of visual semantic navigation, visual questions and answers, image captions, and the like.

The cross-modal matching task is mainly to mine semantic associations between images and text by mapping the images and sentences to a suitable common space. Most early approaches constructed two sub-networks for different modalities, image and text, that interacted in a common space to model the relationship of the intersecting modalities. For example Kiros et al learn representations of images and sentences using CNN and LSTM and optimize the model by triad ordering loss. Faghriet et al propose a triplet ordering penalty in combination with hard negative samples and show significant improvements in cross-modal retrieval tasks. Although some pioneering studies have made great progress, they learn the global representation directly, ignoring fine-grained analysis.

Accordingly, more and more researchers are working to explore fine-grained correspondence between regions in images and words in sentences for image-text matching. Karpathy et al, depth fragmentation embedded network for bi-directional image sentence mapping, align each fragment pair by extracting the fragment features of each image and text. In addition, in bottom-up and top-down attention networks, it is proposed to describe an image with a set of image salient regions, each region represented by a convolution feature vector. Subsequently, lee et al coded images as region-level features using a bottom-up attention network, and designed a stacked cross-attention network to infer image-text matches by focusing on region-related words or word-related regions.

Meanwhile, the reference of the external module brings the promotion of the retrieval result to the cross-mode matching. For example, liu et al propose a graph structure matching network that explicitly models objects, relationships, and attributes as a phrase, in combination with inferring fine-grained correspondences. The king et al learn the conceptual representation of consensus perception using an external corpus to further strengthen the semantic relationship between the image and text. With Transformers's success in the visual and linguistic arts, the paragraph et al propose a generic encoder aimed at cross-modal retrieval by learning a better joint representation of visual and linguistic in a pre-trained manner.

In addition, the current popular deep learning model is mostly static reasoning, and the network parameters are fixed after training, which limits its representation capabilities, efficiency and interpretability. Dynamic networks have advantages in efficiency, compatibility, and adaptability over traditional static network architectures by adapting their structure or parameters to different inputs. In particular, early dynamic approaches aimed at achieving network compression by pruning neurons or skipping layers. For example, chen et al's dynamic region-aware convolution uses a learnable instructor to add a channel-level filter to the spatial dimension, which not only improves the representation ability of the convolution but also maintains the computational cost at standard convolution doses. In recent years, some researchers have designed different dynamic routes for multi-branch or tree-structured networks and dynamically route within the network, adapting the computational graph to each sample. Li et al propose a soft condition gate to dynamically select the scale transformation path for semantic segmentation, which gate adapts to the scale distribution of each image.

While conventional approaches have made great progress, these efforts have largely relied upon hand-made features that are not always optimized for a particular purpose, such as the number of regions per image. We therefore construct a cross-modal attention screening network with dynamic routing that automatically configures an appropriate number of regions for each input image. The network is enabled with the ability to make decisions on the number of zones while reducing redundant computation. In addition, the invention designs a novel cross-mode screening module, which retains meaningful interaction characteristics by filtering irrelevant information, suppresses meaningless alignment interference and also adjusts global and local dependence in a self-adaptive manner.

Disclosure of Invention

The invention aims to solve the problem that in a cross-modal matching method, most of the cross-modal matching method relies on expert experience to model all image detection fixed number of areas, and the flexibility in selecting the number of image areas is lacking.

The technical scheme adopted for solving the technical problems is as follows:

S1, constructing a dynamic router, and selecting corresponding attention area blocks according to the complexity of the image, so that the area quantity decision capability is realized.

S2, combining the dynamic router in the S1, designing a dynamic routing attention module, and solving the problems of parameter redundancy and calculation by constructing different adjacency masks for the defined number of attention areas.

S3, constructing a cross-mode screening module, reserving meaningful interaction characteristics, filtering irrelevant information, inhibiting the interference of meaningless image-text pairs, and learning semantic relations between images and texts.

S4, combining the module in the S2 and the module in the S3 to construct an overall framework of a cross-mode matching method of the cross-mode attention screening network based on dynamic routing.

S5, training a cross-modal matching method of the cross-modal attention screening network based on dynamic routing.

To achieve a dynamic selection of each image we consider a network of multiple block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R ^d×m, the routing space can be defined as a= [ a ₀,...,A_b ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:

where α is the selection probability predicted by the router and a is the set of unit operations. A base unit and corresponding route is designed within each node to select a particular regional characteristic. For each image we designed 3 different numbers of candidate region blocks, selecting the top g regions with highest confidence scores, respectively. The image features will be used for feature transformations inside the cell and inside the route.

The proposed soft router can be seen as a block decision process. And generating the routing probability of each path according to the global meaning of the image. Given the image characteristics V ε R ^d×m, the selection prediction probability α ε R ^b for each attention block can be derived by:

V^*＝softmax(FC₂(V))V (2)

α=relu(tanh(FC₁(V^*))) (3)

FC is a fully connected function. Meanwhile, in the dynamic routing attention module, we employ self-attention and identity mapping as an implementation of each element. However, self-focused dot product operations can result in expensive computations and significant memory usage. In this case, it is important to reduce the number of calculation amounts. Thus, we adjust the number of attention areas per input image, which is a key distinction from existing self-attention based methods. By re-examining the definition of standard self-attention, the regional attention weight can be derived:

wherein η _ij measures the effect of the jth position on the ith position. m represents the number of regions in the image, Η _ij can be regarded as a complete connection diagram between different regions of an image.

To obtain the characteristics of different regions of interest, we need to limit the region join of each input image, so by introducing one contiguous mask M ε R ^m×m, we can get new region attention weights:

M is a binary value. When it is within the attention area of the target element, it is set to 1. Thus, attention operations are limited to a certain number of image regions to explore the in-mold semantic relationships. Finally, the output of the router attention module is given as follows:

Wherein, Thus, the number of regions in the image is limited. This will greatly reduce the computational complexity and errors due to data redundancy.

To better aggregate shared semantics, bridge semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless alignment disturbances, further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:

X^*＝cat((X,duplicate(X_g)),1) (7)

E^*＝cat((E,duplicate(E_g)),1) (8)

To ensure that X _g(E_g) can be connected to X (E), X ° (E °) is extended to the matrix by copying it into multiple rows. We then calculate the shared semantic representation between each pair by a cosine similarity function, as shown in the following:

Where s _ij represents the correlation between the i-th region and the j-th word. Given the image query X ^*, the attention weight for each region is performed on the text set as shown in the following equation:

Where λ is a factor controlling the smoothness of the attention distribution. The text attention feature L _e is derived by a weighted combination of word features:

similarly, given a text query, the attention weight of each word is performed on the image set:

The image-level attention features may be derived from a weighted combination of image region features:

Then, the attention semantic features L (image level L _x/text level L _e) are first mapped to generate a filter vector L _α and a reset vector L _β as follows:

L_α=tanh(MLP(L)) (14)

L_β=MLP(L) (15)

the reset feature is then obtained by

R=Relu(Q⊙L_α+L_β) (16)

Where q=e ^* or X ^*, the similarity score of the image text can be finally calculated by the following formula:

S=sigmoid(FC(μR_x+εR_e)) (17)

The cross-modal matching method of the cross-modal attention screening network based on the dynamic route comprises a dynamic route attention module, a cross-modal screening module and a cross-modal attention screening network based on the dynamic route.

Finally, the cross-modal matching method of the cross-modal attention screening network based on the dynamic routing is as follows:

Our experiments were accomplished by using the PyTorch framework on a single NVIDIA TESLA P GPU. For each image we used a fast-RCNN model with Resnet-101 for region detection and feature extraction. We select the top k=36 region features and 1024-dimensional features per region for confidence score ranking. For each text we choose the BERT-Base model with 12 layers, 768 hidden layer sizes, 12 heads, resulting in the original 768-dimensional word embedding. The embedded potential common dimension is set to 1024. And models of 20 and 30 batches were trained on MSCOCO and Flickr30k datasets, respectively, using Adam optimizer. The learning rate was initially set to 5e-9, 10-fold decrease per 10 or 15 epochs, respectively. The edge parameter α was set to 0.2, the minimum batch size was 64, and the gradient clipping threshold was 2.0.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. To the best of our knowledge, this is the first effort to explore a fixed number of regions in image-text matching, for the information redundancy problem created by 36 visual regions as image feature inputs.

2. A dynamic routing attention module is provided, which selects the number of regions according to the complexity of the image, and reduces redundant calculation to reduce the calculation cost of the model.

3. The invention designs a cross-mode screening module, which dynamically suppresses the interference of nonsensical region-word pairs in the interactive information and adjusts the global and local dependency relationship.

Drawings

Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing.

Fig. 2 is a schematic diagram of a model of a dynamic route attention module.

Fig. 3 is a schematic diagram of a model of a cross-modal screening module.

Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively.

Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.

The invention is further illustrated in the following figures and examples.

Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. As shown in fig. 1, the image and sentence are first encoded into a feature representation. And then configuring an intra-mode attention module with a dynamic router, and selecting a corresponding attention area block according to the complexity of the image to capture the complex intra-mode relationship. Finally, a cross-modal screening module fused with global information is constructed, irrelevant information is restrained by retaining meaningful interaction characteristics, meaningless alignment interference is eliminated, and more accurate semantic relation between images and texts is learned.

Fig. 2 is a schematic diagram of a model of a dynamic route attention module. As shown in fig. 2, to achieve dynamic selection of each image we consider a network of multi-block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R ^d×m, the routing space can be defined as a= [ a ₀,...,A_b ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:

V^*＝softmax(FC₂(V))V (19)

α=relu(tanh(FC₁(V^*))) (20)

Fig. 3 is a schematic diagram of a model of a cross-modal screening module. As shown in fig. 3, to better aggregate shared semantics, bridging semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless aligned disturbances, thereby further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:

X^*＝cat((X,duplicate(X_g)),1) (24)

E^*＝cat((E,duplicate(E_g)),1) (25)

L_α=tanh(MLP(L)) (31)

L_β=MLP(L) (32)

the reset feature is then obtained by

R=Relu(Q⊙L_α+L_β) (33)

S=sigmoid(FC(μR_x+εR_e)) (34)

Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively. As shown in fig. 4 and 5, the cross-modal matching method results of the cross-modal attention screening network based on dynamic routing are more accurate than other models.

Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions. As shown in fig. 6, given an image, a cross-modal attention screening network based on dynamic routing can match out the corresponding text. As shown in fig. 7, given text, a dynamic routing-based cross-modal attention screening network can match out the corresponding pictures.

The invention provides a novel cross-modal attention screening network with dynamic routing to explore potential cross-modal relationships. The semantic relation between the image and the text is learned by retaining meaningful interaction characteristics and filtering irrelevant information, so that interference of useless interaction information is restrained. And a dynamic routing attention module is designed, and corresponding attention area blocks are selected according to the complexity of the image, so that the calculation cost and redundant calculation are reduced. A number of experiments on Flickr30K and MSCOCO demonstrated the superiority of our model (IASDR) over several prior methods. In the future we will further study the more efficient kinetic model mechanism and conduct cross-modality matching studies in a few samples.

Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.

Claims

1. A cross-modal matching method of a cross-modal attention screening network based on dynamic routing, characterized in that the method comprises the following steps:

S1. Build a dynamic router to select the corresponding attention area block according to the complexity of the image, so as to have the ability to decide the number of areas;

S2. Combined with the dynamic router in S1, a dynamic routing attention module is designed to solve the parameter redundancy and calculation problems by constructing different adjacency masks for the defined number of attention regions;

S3. Construct a cross-modal filtering module to retain meaningful interactive features, filter irrelevant information, suppress the interference of meaningless image-text pairs, and learn the semantic relationship between images and texts;

S4. Combining the modules in S2 and S3 to construct the overall architecture of the cross-modal matching method based on the dynamic routing cross-modal attention screening network;

S5. Training of cross-modal matching method based on dynamic routing cross-modal attention screening network;

The specific process of S1 is as follows:

In order to achieve dynamic selection for each image, we consider a network composed of multiple blocks, where each block is equipped with modules with different settings. Specifically, given the image feature V∈Rd ^×m , the routing space can be defined as A=[ _A0 ,..., _Ab ], where b represents the number of attention area blocks and d is the dimension of the image feature. The feature after routing can be defined as:

Where α is the selection probability predicted by the router, A is the set of unit operations, and a basic unit and corresponding routes are designed in each node to select specific regional features. For each image, we designed three different numbers of candidate region blocks and selected the top g regions with the highest confidence scores. Image features will be used for feature conversion within the unit and within the route.

The soft router we proposed can be viewed as a block decision process that generates the routing probability of each path based on the global meaning of the image. Given the image features V∈Rd ^×m , the selection prediction probability ^α∈Rb for each attention block can be obtained as follows:

V ^* = softmax(FC ₂ (V))V (2)

α=relu(tanh(FC ₁ (V ^* ))) (3)

Among them, FC is the fully connected function;

The specific process of S2 is:

In the dynamic routing attention module, we adopt self-attention and identity mapping as the implementation of each unit. However, the dot product operation of self-attention will produce expensive calculations and huge memory usage. In this case, it is crucial to reduce the large amount of calculations. Therefore, we adjust the number of attention regions for each input image, which is a key difference from existing self-attention based methods. By revisiting the definition of standard self-attention, the regional attention weights can be obtained:

Among them, η _ij measures the influence of the j-th position on the i-th position, m represents the number of regions in the image, η _ij can be regarded as a fully connected graph between different regions of an image.

In order to obtain the characteristics of different attention areas, we need to limit the regional connections of each input image. Therefore, by introducing an adjacency mask M∈R ^m×m , we can get a new regional attention weight:

M is a binary value, which is set to 1 when it is within the attention region of the target element. Therefore, the attention operation is limited to a certain number of image regions to explore the intra-module semantic relationship. Finally, the output of the router attention module is given as shown in the following formula:

in, In this way, the number of regions in the image is limited, which will greatly reduce the computational complexity and errors caused by data redundancy.

2. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S3 is:

In order to better aggregate shared semantics and bridge the semantic gap between vision and text, a cross-modal filtering module is proposed to suppress the interference of meaningless alignment, thereby further reducing the time consumption of the entire model. In particular, given the local and global feature representations, we can calculate a new synthetic feature representation by the following formula:

X ^* =cat((X,duplicate(X _g )),1) (7)

E ^* =cat((E,duplicate(E _g )),1) (8)

In order to ensure that _Xg ( _Eg ) can be connected to X(E), it is expanded to the matrix X°(E°) by copying it to multiple rows. Then we calculate the shared semantic representation between each pair through the cosine similarity function, as shown in the following formula:

Among them, s _ij represents the correlation between the i-th region and the j-th word. Given an image query X ^* , the attention weight of each region is performed on the text set, as shown in the following formula:

Among them, λ is a factor that controls the smoothness of attention allocation, and the text attention feature _Le is obtained by the weighted combination of word features:

Similarly, given a text query, the attention weights for each word are performed on the image set:

Image-level attention features can be obtained by weighted combination of image region features:

Then, the attention semantic feature L (image-level _Lx /text-level _Le ) is first mapped to generate the filter vector _Lα and the reset vector _Lβ as follows:

L _α =tanh(MLP(L)) (14)

L _β =MLP(L) (15)

Then, the reset feature is obtained by

R＝Relu(Q⊙L _α +L _β ) (16)

Among them, Q = E ^* or X ^* , and finally the similarity score of the image and text can be calculated by the following formula:

S=sigmoid(FC(μR _x +εR _e )) (17).

3. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S4 is:

The cross-modal matching method based on a dynamic routing cross-modal attention screening network includes a dynamic routing attention module, a cross-modal screening module and a cross-modal attention screening network based on dynamic routing.

4. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S5 is:

The training method of the cross-modal matching method based on the cross-modal attention screening network with dynamic routing is as follows:

Our experiments are implemented using PyTorch on a single Nvidia Tesla P100 GPU. For each image, we use a Faster-RCNN model with Resnet-101 for region detection and feature extraction. We select the top K=36 region features with the highest confidence scores and 1024-dimensional features for each region. For each text, we select the BERT-Base model with 12 layers, 768 hidden layer size, and 12 heads to get the original 768-dimensional word embeddings. The embedded latent common dimension is set to 1024, and the Adam optimizer is used to train the model on the MSCOCO and Flickr30k datasets for 20 and 30 batches, respectively. The learning rate is initially set to 5e-9 and dropped by a factor of 10 every 10 or 15 epochs, respectively. The margin parameter α is set to 0.2, the mini-batch size is 64, and the gradient clipping threshold is 2.0.