Cross-modal matching method of cross-modal attention screening network based on dynamic routing
Technical Field
The invention belongs to a cross-modal matching method, and relates to the technical field of computer vision and natural language processing.
Background
As innovative technology and social media advance, various multimedia data and information are aggregated together. In order for a computer to understand, match, and transform such cross-modal data, cross-modal retrieval has become a fundamental technique that indexes semantically related instances from one modality to another. It has many applications in the fields of visual semantic navigation, visual questions and answers, image captions, and the like.
The cross-modal matching task is mainly to mine semantic associations between images and text by mapping the images and sentences to a suitable common space. Most early approaches constructed two sub-networks for different modalities, image and text, that interacted in a common space to model the relationship of the intersecting modalities. For example Kiros et al learn representations of images and sentences using CNN and LSTM and optimize the model by triad ordering loss. Faghriet et al propose a triplet ordering penalty in combination with hard negative samples and show significant improvements in cross-modal retrieval tasks. Although some pioneering studies have made great progress, they learn the global representation directly, ignoring fine-grained analysis.
Accordingly, more and more researchers are working to explore fine-grained correspondence between regions in images and words in sentences for image-text matching. Karpathy et al, depth fragmentation embedded network for bi-directional image sentence mapping, align each fragment pair by extracting the fragment features of each image and text. In addition, in bottom-up and top-down attention networks, it is proposed to describe an image with a set of image salient regions, each region represented by a convolution feature vector. Subsequently, lee et al coded images as region-level features using a bottom-up attention network, and designed a stacked cross-attention network to infer image-text matches by focusing on region-related words or word-related regions.
Meanwhile, the reference of the external module brings the promotion of the retrieval result to the cross-mode matching. For example, liu et al propose a graph structure matching network that explicitly models objects, relationships, and attributes as a phrase, in combination with inferring fine-grained correspondences. The king et al learn the conceptual representation of consensus perception using an external corpus to further strengthen the semantic relationship between the image and text. With Transformers's success in the visual and linguistic arts, the paragraph et al propose a generic encoder aimed at cross-modal retrieval by learning a better joint representation of visual and linguistic in a pre-trained manner.
In addition, the current popular deep learning model is mostly static reasoning, and the network parameters are fixed after training, which limits its representation capabilities, efficiency and interpretability. Dynamic networks have advantages in efficiency, compatibility, and adaptability over traditional static network architectures by adapting their structure or parameters to different inputs. In particular, early dynamic approaches aimed at achieving network compression by pruning neurons or skipping layers. For example, chen et al's dynamic region-aware convolution uses a learnable instructor to add a channel-level filter to the spatial dimension, which not only improves the representation ability of the convolution but also maintains the computational cost at standard convolution doses. In recent years, some researchers have designed different dynamic routes for multi-branch or tree-structured networks and dynamically route within the network, adapting the computational graph to each sample. Li et al propose a soft condition gate to dynamically select the scale transformation path for semantic segmentation, which gate adapts to the scale distribution of each image.
While conventional approaches have made great progress, these efforts have largely relied upon hand-made features that are not always optimized for a particular purpose, such as the number of regions per image. We therefore construct a cross-modal attention screening network with dynamic routing that automatically configures an appropriate number of regions for each input image. The network is enabled with the ability to make decisions on the number of zones while reducing redundant computation. In addition, the invention designs a novel cross-mode screening module, which retains meaningful interaction characteristics by filtering irrelevant information, suppresses meaningless alignment interference and also adjusts global and local dependence in a self-adaptive manner.
Disclosure of Invention
The invention aims to solve the problem that in a cross-modal matching method, most of the cross-modal matching method relies on expert experience to model all image detection fixed number of areas, and the flexibility in selecting the number of image areas is lacking.
The technical scheme adopted for solving the technical problems is as follows:
S1, constructing a dynamic router, and selecting corresponding attention area blocks according to the complexity of the image, so that the area quantity decision capability is realized.
S2, combining the dynamic router in the S1, designing a dynamic routing attention module, and solving the problems of parameter redundancy and calculation by constructing different adjacency masks for the defined number of attention areas.
S3, constructing a cross-mode screening module, reserving meaningful interaction characteristics, filtering irrelevant information, inhibiting the interference of meaningless image-text pairs, and learning semantic relations between images and texts.
S4, combining the module in the S2 and the module in the S3 to construct an overall framework of a cross-mode matching method of the cross-mode attention screening network based on dynamic routing.
S5, training a cross-modal matching method of the cross-modal attention screening network based on dynamic routing.
To achieve a dynamic selection of each image we consider a network of multiple block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R d×m, the routing space can be defined as a= [ a 0,...,Ab ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:
where α is the selection probability predicted by the router and a is the set of unit operations. A base unit and corresponding route is designed within each node to select a particular regional characteristic. For each image we designed 3 different numbers of candidate region blocks, selecting the top g regions with highest confidence scores, respectively. The image features will be used for feature transformations inside the cell and inside the route.
The proposed soft router can be seen as a block decision process. And generating the routing probability of each path according to the global meaning of the image. Given the image characteristics V ε R d×m, the selection prediction probability α ε R b for each attention block can be derived by:
V*=softmax(FC2(V))V (2)
α=relu(tanh(FC1(V*))) (3)
FC is a fully connected function. Meanwhile, in the dynamic routing attention module, we employ self-attention and identity mapping as an implementation of each element. However, self-focused dot product operations can result in expensive computations and significant memory usage. In this case, it is important to reduce the number of calculation amounts. Thus, we adjust the number of attention areas per input image, which is a key distinction from existing self-attention based methods. By re-examining the definition of standard self-attention, the regional attention weight can be derived:
wherein η ij measures the effect of the jth position on the ith position. m represents the number of regions in the image, Η ij can be regarded as a complete connection diagram between different regions of an image.
To obtain the characteristics of different regions of interest, we need to limit the region join of each input image, so by introducing one contiguous mask M ε R m×m, we can get new region attention weights:
M is a binary value. When it is within the attention area of the target element, it is set to 1. Thus, attention operations are limited to a certain number of image regions to explore the in-mold semantic relationships. Finally, the output of the router attention module is given as follows:
Wherein, Thus, the number of regions in the image is limited. This will greatly reduce the computational complexity and errors due to data redundancy.
To better aggregate shared semantics, bridge semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless alignment disturbances, further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:
X*=cat((X,duplicate(Xg)),1) (7)
E*=cat((E,duplicate(Eg)),1) (8)
To ensure that X g(Eg) can be connected to X (E), X ° (E °) is extended to the matrix by copying it into multiple rows. We then calculate the shared semantic representation between each pair by a cosine similarity function, as shown in the following:
Where s ij represents the correlation between the i-th region and the j-th word. Given the image query X *, the attention weight for each region is performed on the text set as shown in the following equation:
Where λ is a factor controlling the smoothness of the attention distribution. The text attention feature L e is derived by a weighted combination of word features:
similarly, given a text query, the attention weight of each word is performed on the image set:
The image-level attention features may be derived from a weighted combination of image region features:
Then, the attention semantic features L (image level L x/text level L e) are first mapped to generate a filter vector L α and a reset vector L β as follows:
Lα=tanh(MLP(L)) (14)
Lβ=MLP(L) (15)
the reset feature is then obtained by
R=Relu(Q⊙Lα+Lβ) (16)
Where q=e * or X *, the similarity score of the image text can be finally calculated by the following formula:
S=sigmoid(FC(μRx+εRe)) (17)
The cross-modal matching method of the cross-modal attention screening network based on the dynamic route comprises a dynamic route attention module, a cross-modal screening module and a cross-modal attention screening network based on the dynamic route.
Finally, the cross-modal matching method of the cross-modal attention screening network based on the dynamic routing is as follows:
Our experiments were accomplished by using the PyTorch framework on a single NVIDIA TESLA P GPU. For each image we used a fast-RCNN model with Resnet-101 for region detection and feature extraction. We select the top k=36 region features and 1024-dimensional features per region for confidence score ranking. For each text we choose the BERT-Base model with 12 layers, 768 hidden layer sizes, 12 heads, resulting in the original 768-dimensional word embedding. The embedded potential common dimension is set to 1024. And models of 20 and 30 batches were trained on MSCOCO and Flickr30k datasets, respectively, using Adam optimizer. The learning rate was initially set to 5e-9, 10-fold decrease per 10 or 15 epochs, respectively. The edge parameter α was set to 0.2, the minimum batch size was 64, and the gradient clipping threshold was 2.0.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. To the best of our knowledge, this is the first effort to explore a fixed number of regions in image-text matching, for the information redundancy problem created by 36 visual regions as image feature inputs.
2. A dynamic routing attention module is provided, which selects the number of regions according to the complexity of the image, and reduces redundant calculation to reduce the calculation cost of the model.
3. The invention designs a cross-mode screening module, which dynamically suppresses the interference of nonsensical region-word pairs in the interactive information and adjusts the global and local dependency relationship.
Drawings
Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing.
Fig. 2 is a schematic diagram of a model of a dynamic route attention module.
Fig. 3 is a schematic diagram of a model of a cross-modal screening module.
Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively.
Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.
The invention is further illustrated in the following figures and examples.
Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. As shown in fig. 1, the image and sentence are first encoded into a feature representation. And then configuring an intra-mode attention module with a dynamic router, and selecting a corresponding attention area block according to the complexity of the image to capture the complex intra-mode relationship. Finally, a cross-modal screening module fused with global information is constructed, irrelevant information is restrained by retaining meaningful interaction characteristics, meaningless alignment interference is eliminated, and more accurate semantic relation between images and texts is learned.
Fig. 2 is a schematic diagram of a model of a dynamic route attention module. As shown in fig. 2, to achieve dynamic selection of each image we consider a network of multi-block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R d×m, the routing space can be defined as a= [ a 0,...,Ab ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:
where α is the selection probability predicted by the router and a is the set of unit operations. A base unit and corresponding route is designed within each node to select a particular regional characteristic. For each image we designed 3 different numbers of candidate region blocks, selecting the top g regions with highest confidence scores, respectively. The image features will be used for feature transformations inside the cell and inside the route.
The proposed soft router can be seen as a block decision process. And generating the routing probability of each path according to the global meaning of the image. Given the image characteristics V ε R d×m, the selection prediction probability α ε R b for each attention block can be derived by:
V*=softmax(FC2(V))V (19)
α=relu(tanh(FC1(V*))) (20)
FC is a fully connected function. Meanwhile, in the dynamic routing attention module, we employ self-attention and identity mapping as an implementation of each element. However, self-focused dot product operations can result in expensive computations and significant memory usage. In this case, it is important to reduce the number of calculation amounts. Thus, we adjust the number of attention areas per input image, which is a key distinction from existing self-attention based methods. By re-examining the definition of standard self-attention, the regional attention weight can be derived:
wherein η ij measures the effect of the jth position on the ith position. m represents the number of regions in the image, Η ij can be regarded as a complete connection diagram between different regions of an image.
To obtain the characteristics of different regions of interest, we need to limit the region join of each input image, so by introducing one contiguous mask M ε R m×m, we can get new region attention weights:
M is a binary value. When it is within the attention area of the target element, it is set to 1. Thus, attention operations are limited to a certain number of image regions to explore the in-mold semantic relationships. Finally, the output of the router attention module is given as follows:
Wherein, Thus, the number of regions in the image is limited. This will greatly reduce the computational complexity and errors due to data redundancy.
Fig. 3 is a schematic diagram of a model of a cross-modal screening module. As shown in fig. 3, to better aggregate shared semantics, bridging semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless aligned disturbances, thereby further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:
X*=cat((X,duplicate(Xg)),1) (24)
E*=cat((E,duplicate(Eg)),1) (25)
To ensure that X g(Eg) can be connected to X (E), X ° (E °) is extended to the matrix by copying it into multiple rows. We then calculate the shared semantic representation between each pair by a cosine similarity function, as shown in the following:
Where s ij represents the correlation between the i-th region and the j-th word. Given the image query X *, the attention weight for each region is performed on the text set as shown in the following equation:
Where λ is a factor controlling the smoothness of the attention distribution. The text attention feature L e is derived by a weighted combination of word features:
similarly, given a text query, the attention weight of each word is performed on the image set:
The image-level attention features may be derived from a weighted combination of image region features:
Then, the attention semantic features L (image level L x/text level L e) are first mapped to generate a filter vector L α and a reset vector L β as follows:
Lα=tanh(MLP(L)) (31)
Lβ=MLP(L) (32)
the reset feature is then obtained by
R=Relu(Q⊙Lα+Lβ) (33)
Where q=e * or X *, the similarity score of the image text can be finally calculated by the following formula:
S=sigmoid(FC(μRx+εRe)) (34)
Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively. As shown in fig. 4 and 5, the cross-modal matching method results of the cross-modal attention screening network based on dynamic routing are more accurate than other models.
Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions. As shown in fig. 6, given an image, a cross-modal attention screening network based on dynamic routing can match out the corresponding text. As shown in fig. 7, given text, a dynamic routing-based cross-modal attention screening network can match out the corresponding pictures.
The invention provides a novel cross-modal attention screening network with dynamic routing to explore potential cross-modal relationships. The semantic relation between the image and the text is learned by retaining meaningful interaction characteristics and filtering irrelevant information, so that interference of useless interaction information is restrained. And a dynamic routing attention module is designed, and corresponding attention area blocks are selected according to the complexity of the image, so that the calculation cost and redundant calculation are reduced. A number of experiments on Flickr30K and MSCOCO demonstrated the superiority of our model (IASDR) over several prior methods. In the future we will further study the more efficient kinetic model mechanism and conduct cross-modality matching studies in a few samples.
Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.