[go: up one dir, main page]

CN114676228B - Cross-modal matching method based on cross-modal attention screening network with dynamic routing - Google Patents

Cross-modal matching method based on cross-modal attention screening network with dynamic routing Download PDF

Info

Publication number
CN114676228B
CN114676228B CN202210364577.5A CN202210364577A CN114676228B CN 114676228 B CN114676228 B CN 114676228B CN 202210364577 A CN202210364577 A CN 202210364577A CN 114676228 B CN114676228 B CN 114676228B
Authority
CN
China
Prior art keywords
attention
cross
modal
image
regions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210364577.5A
Other languages
Chinese (zh)
Other versions
CN114676228A (en
Inventor
吴杰
吴春雷
宫法明
张立强
路静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN202210364577.5A priority Critical patent/CN114676228B/en
Publication of CN114676228A publication Critical patent/CN114676228A/en
Application granted granted Critical
Publication of CN114676228B publication Critical patent/CN114676228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Library & Information Science (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了基于动态路由的跨模态注意力筛选网络的跨模态匹配方法。该任务在局部对齐方面已取得显著的进展。它们通常对每幅图像检测置信度排名前36位的显著区域进行建模。但这些区域通常包含不相关的冗余区域,可能会给建模过程引入噪声干扰模型推理。根据我们对该任务中最广泛使用数据集的统计,区域数量远大于单词数量。因此,我们提出一种具有动态路由的跨模态注意力筛选网络,为每个输入图像自动配置适当数量的区域。该网络具有区域数量决策的能力,能够根据不同的数据动态学习不同的激活区域,从而减少冗余计算。此外,设计了跨模态筛选模块,通过过滤无关信息保留有意义的交互特征,抑制无意义对齐的干扰,同时自适应调整全局和局部依赖。

The present invention discloses a cross-modal matching method based on a cross-modal attention screening network with dynamic routing. This task has made significant progress in local alignment. They usually model the top 36 significant regions of each image detection confidence. However, these regions usually contain irrelevant redundant regions, which may introduce noise to the modeling process to interfere with model reasoning. According to our statistics on the most widely used datasets in this task, the number of regions is much larger than the number of words. Therefore, we propose a cross-modal attention screening network with dynamic routing to automatically configure an appropriate number of regions for each input image. The network has the ability to decide the number of regions and can dynamically learn different activation regions according to different data, thereby reducing redundant calculations. In addition, a cross-modal screening module is designed to retain meaningful interactive features by filtering irrelevant information, suppress the interference of meaningless alignment, and adaptively adjust global and local dependencies.

Description

Cross-modal matching method of cross-modal attention screening network based on dynamic routing
Technical Field
The invention belongs to a cross-modal matching method, and relates to the technical field of computer vision and natural language processing.
Background
As innovative technology and social media advance, various multimedia data and information are aggregated together. In order for a computer to understand, match, and transform such cross-modal data, cross-modal retrieval has become a fundamental technique that indexes semantically related instances from one modality to another. It has many applications in the fields of visual semantic navigation, visual questions and answers, image captions, and the like.
The cross-modal matching task is mainly to mine semantic associations between images and text by mapping the images and sentences to a suitable common space. Most early approaches constructed two sub-networks for different modalities, image and text, that interacted in a common space to model the relationship of the intersecting modalities. For example Kiros et al learn representations of images and sentences using CNN and LSTM and optimize the model by triad ordering loss. Faghriet et al propose a triplet ordering penalty in combination with hard negative samples and show significant improvements in cross-modal retrieval tasks. Although some pioneering studies have made great progress, they learn the global representation directly, ignoring fine-grained analysis.
Accordingly, more and more researchers are working to explore fine-grained correspondence between regions in images and words in sentences for image-text matching. Karpathy et al, depth fragmentation embedded network for bi-directional image sentence mapping, align each fragment pair by extracting the fragment features of each image and text. In addition, in bottom-up and top-down attention networks, it is proposed to describe an image with a set of image salient regions, each region represented by a convolution feature vector. Subsequently, lee et al coded images as region-level features using a bottom-up attention network, and designed a stacked cross-attention network to infer image-text matches by focusing on region-related words or word-related regions.
Meanwhile, the reference of the external module brings the promotion of the retrieval result to the cross-mode matching. For example, liu et al propose a graph structure matching network that explicitly models objects, relationships, and attributes as a phrase, in combination with inferring fine-grained correspondences. The king et al learn the conceptual representation of consensus perception using an external corpus to further strengthen the semantic relationship between the image and text. With Transformers's success in the visual and linguistic arts, the paragraph et al propose a generic encoder aimed at cross-modal retrieval by learning a better joint representation of visual and linguistic in a pre-trained manner.
In addition, the current popular deep learning model is mostly static reasoning, and the network parameters are fixed after training, which limits its representation capabilities, efficiency and interpretability. Dynamic networks have advantages in efficiency, compatibility, and adaptability over traditional static network architectures by adapting their structure or parameters to different inputs. In particular, early dynamic approaches aimed at achieving network compression by pruning neurons or skipping layers. For example, chen et al's dynamic region-aware convolution uses a learnable instructor to add a channel-level filter to the spatial dimension, which not only improves the representation ability of the convolution but also maintains the computational cost at standard convolution doses. In recent years, some researchers have designed different dynamic routes for multi-branch or tree-structured networks and dynamically route within the network, adapting the computational graph to each sample. Li et al propose a soft condition gate to dynamically select the scale transformation path for semantic segmentation, which gate adapts to the scale distribution of each image.
While conventional approaches have made great progress, these efforts have largely relied upon hand-made features that are not always optimized for a particular purpose, such as the number of regions per image. We therefore construct a cross-modal attention screening network with dynamic routing that automatically configures an appropriate number of regions for each input image. The network is enabled with the ability to make decisions on the number of zones while reducing redundant computation. In addition, the invention designs a novel cross-mode screening module, which retains meaningful interaction characteristics by filtering irrelevant information, suppresses meaningless alignment interference and also adjusts global and local dependence in a self-adaptive manner.
Disclosure of Invention
The invention aims to solve the problem that in a cross-modal matching method, most of the cross-modal matching method relies on expert experience to model all image detection fixed number of areas, and the flexibility in selecting the number of image areas is lacking.
The technical scheme adopted for solving the technical problems is as follows:
S1, constructing a dynamic router, and selecting corresponding attention area blocks according to the complexity of the image, so that the area quantity decision capability is realized.
S2, combining the dynamic router in the S1, designing a dynamic routing attention module, and solving the problems of parameter redundancy and calculation by constructing different adjacency masks for the defined number of attention areas.
S3, constructing a cross-mode screening module, reserving meaningful interaction characteristics, filtering irrelevant information, inhibiting the interference of meaningless image-text pairs, and learning semantic relations between images and texts.
S4, combining the module in the S2 and the module in the S3 to construct an overall framework of a cross-mode matching method of the cross-mode attention screening network based on dynamic routing.
S5, training a cross-modal matching method of the cross-modal attention screening network based on dynamic routing.
To achieve a dynamic selection of each image we consider a network of multiple block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R d×m, the routing space can be defined as a= [ a 0,...,Ab ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:
where α is the selection probability predicted by the router and a is the set of unit operations. A base unit and corresponding route is designed within each node to select a particular regional characteristic. For each image we designed 3 different numbers of candidate region blocks, selecting the top g regions with highest confidence scores, respectively. The image features will be used for feature transformations inside the cell and inside the route.
The proposed soft router can be seen as a block decision process. And generating the routing probability of each path according to the global meaning of the image. Given the image characteristics V ε R d×m, the selection prediction probability α ε R b for each attention block can be derived by:
V*=softmax(FC2(V))V (2)
α=relu(tanh(FC1(V*))) (3)
FC is a fully connected function. Meanwhile, in the dynamic routing attention module, we employ self-attention and identity mapping as an implementation of each element. However, self-focused dot product operations can result in expensive computations and significant memory usage. In this case, it is important to reduce the number of calculation amounts. Thus, we adjust the number of attention areas per input image, which is a key distinction from existing self-attention based methods. By re-examining the definition of standard self-attention, the regional attention weight can be derived:
wherein η ij measures the effect of the jth position on the ith position. m represents the number of regions in the image, Η ij can be regarded as a complete connection diagram between different regions of an image.
To obtain the characteristics of different regions of interest, we need to limit the region join of each input image, so by introducing one contiguous mask M ε R m×m, we can get new region attention weights:
M is a binary value. When it is within the attention area of the target element, it is set to 1. Thus, attention operations are limited to a certain number of image regions to explore the in-mold semantic relationships. Finally, the output of the router attention module is given as follows:
Wherein, Thus, the number of regions in the image is limited. This will greatly reduce the computational complexity and errors due to data redundancy.
To better aggregate shared semantics, bridge semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless alignment disturbances, further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:
X*=cat((X,duplicate(Xg)),1) (7)
E*=cat((E,duplicate(Eg)),1) (8)
To ensure that X g(Eg) can be connected to X (E), X ° (E °) is extended to the matrix by copying it into multiple rows. We then calculate the shared semantic representation between each pair by a cosine similarity function, as shown in the following:
Where s ij represents the correlation between the i-th region and the j-th word. Given the image query X *, the attention weight for each region is performed on the text set as shown in the following equation:
Where λ is a factor controlling the smoothness of the attention distribution. The text attention feature L e is derived by a weighted combination of word features:
similarly, given a text query, the attention weight of each word is performed on the image set:
The image-level attention features may be derived from a weighted combination of image region features:
Then, the attention semantic features L (image level L x/text level L e) are first mapped to generate a filter vector L α and a reset vector L β as follows:
Lα=tanh(MLP(L)) (14)
Lβ=MLP(L) (15)
the reset feature is then obtained by
R=Relu(Q⊙Lα+Lβ) (16)
Where q=e * or X *, the similarity score of the image text can be finally calculated by the following formula:
S=sigmoid(FC(μRx+εRe)) (17)
The cross-modal matching method of the cross-modal attention screening network based on the dynamic route comprises a dynamic route attention module, a cross-modal screening module and a cross-modal attention screening network based on the dynamic route.
Finally, the cross-modal matching method of the cross-modal attention screening network based on the dynamic routing is as follows:
Our experiments were accomplished by using the PyTorch framework on a single NVIDIA TESLA P GPU. For each image we used a fast-RCNN model with Resnet-101 for region detection and feature extraction. We select the top k=36 region features and 1024-dimensional features per region for confidence score ranking. For each text we choose the BERT-Base model with 12 layers, 768 hidden layer sizes, 12 heads, resulting in the original 768-dimensional word embedding. The embedded potential common dimension is set to 1024. And models of 20 and 30 batches were trained on MSCOCO and Flickr30k datasets, respectively, using Adam optimizer. The learning rate was initially set to 5e-9, 10-fold decrease per 10 or 15 epochs, respectively. The edge parameter α was set to 0.2, the minimum batch size was 64, and the gradient clipping threshold was 2.0.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. To the best of our knowledge, this is the first effort to explore a fixed number of regions in image-text matching, for the information redundancy problem created by 36 visual regions as image feature inputs.
2. A dynamic routing attention module is provided, which selects the number of regions according to the complexity of the image, and reduces redundant calculation to reduce the calculation cost of the model.
3. The invention designs a cross-mode screening module, which dynamically suppresses the interference of nonsensical region-word pairs in the interactive information and adjusts the global and local dependency relationship.
Drawings
Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing.
Fig. 2 is a schematic diagram of a model of a dynamic route attention module.
Fig. 3 is a schematic diagram of a model of a cross-modal screening module.
Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively.
Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the present patent.
The invention is further illustrated in the following figures and examples.
Fig. 1 is a schematic diagram of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing. As shown in fig. 1, the image and sentence are first encoded into a feature representation. And then configuring an intra-mode attention module with a dynamic router, and selecting a corresponding attention area block according to the complexity of the image to capture the complex intra-mode relationship. Finally, a cross-modal screening module fused with global information is constructed, irrelevant information is restrained by retaining meaningful interaction characteristics, meaningless alignment interference is eliminated, and more accurate semantic relation between images and texts is learned.
Fig. 2 is a schematic diagram of a model of a dynamic route attention module. As shown in fig. 2, to achieve dynamic selection of each image we consider a network of multi-block structures, each block being provided with differently arranged modules. Specifically, given an image feature V e R d×m, the routing space can be defined as a= [ a 0,...,Ab ], b represents the number of attention area blocks, d is the dimension of the image feature, and the routed feature can be defined as:
where α is the selection probability predicted by the router and a is the set of unit operations. A base unit and corresponding route is designed within each node to select a particular regional characteristic. For each image we designed 3 different numbers of candidate region blocks, selecting the top g regions with highest confidence scores, respectively. The image features will be used for feature transformations inside the cell and inside the route.
The proposed soft router can be seen as a block decision process. And generating the routing probability of each path according to the global meaning of the image. Given the image characteristics V ε R d×m, the selection prediction probability α ε R b for each attention block can be derived by:
V*=softmax(FC2(V))V (19)
α=relu(tanh(FC1(V*))) (20)
FC is a fully connected function. Meanwhile, in the dynamic routing attention module, we employ self-attention and identity mapping as an implementation of each element. However, self-focused dot product operations can result in expensive computations and significant memory usage. In this case, it is important to reduce the number of calculation amounts. Thus, we adjust the number of attention areas per input image, which is a key distinction from existing self-attention based methods. By re-examining the definition of standard self-attention, the regional attention weight can be derived:
wherein η ij measures the effect of the jth position on the ith position. m represents the number of regions in the image, Η ij can be regarded as a complete connection diagram between different regions of an image.
To obtain the characteristics of different regions of interest, we need to limit the region join of each input image, so by introducing one contiguous mask M ε R m×m, we can get new region attention weights:
M is a binary value. When it is within the attention area of the target element, it is set to 1. Thus, attention operations are limited to a certain number of image regions to explore the in-mold semantic relationships. Finally, the output of the router attention module is given as follows:
Wherein, Thus, the number of regions in the image is limited. This will greatly reduce the computational complexity and errors due to data redundancy.
Fig. 3 is a schematic diagram of a model of a cross-modal screening module. As shown in fig. 3, to better aggregate shared semantics, bridging semantic gaps between vision and text, a cross-modal filtering module is proposed to suppress meaningless aligned disturbances, thereby further reducing the overall model time consumption. In particular, given local and global feature representations, we can calculate a washed-out synthesized feature representation from:
X*=cat((X,duplicate(Xg)),1) (24)
E*=cat((E,duplicate(Eg)),1) (25)
To ensure that X g(Eg) can be connected to X (E), X ° (E °) is extended to the matrix by copying it into multiple rows. We then calculate the shared semantic representation between each pair by a cosine similarity function, as shown in the following:
Where s ij represents the correlation between the i-th region and the j-th word. Given the image query X *, the attention weight for each region is performed on the text set as shown in the following equation:
Where λ is a factor controlling the smoothness of the attention distribution. The text attention feature L e is derived by a weighted combination of word features:
similarly, given a text query, the attention weight of each word is performed on the image set:
The image-level attention features may be derived from a weighted combination of image region features:
Then, the attention semantic features L (image level L x/text level L e) are first mapped to generate a filter vector L α and a reset vector L β as follows:
Lα=tanh(MLP(L)) (31)
Lβ=MLP(L) (32)
the reset feature is then obtained by
R=Relu(Q⊙Lα+Lβ) (33)
Where q=e * or X *, the similarity score of the image text can be finally calculated by the following formula:
S=sigmoid(FC(μRx+εRe)) (34)
Fig. 4 and 5 are graphs comparing the results of the cross-modal matching method of the cross-modal attention screening network based on dynamic routing with the cross-modal matching method of other networks on MSCOCO-Flickr30K and MSCOCO-5K datasets, respectively. As shown in fig. 4 and 5, the cross-modal matching method results of the cross-modal attention screening network based on dynamic routing are more accurate than other models.
Fig. 6 and 7 are graphs of visual results of a cross-modal matching method of a cross-modal attention screening network based on dynamic routing in image matching text and text matching image directions. As shown in fig. 6, given an image, a cross-modal attention screening network based on dynamic routing can match out the corresponding text. As shown in fig. 7, given text, a dynamic routing-based cross-modal attention screening network can match out the corresponding pictures.
The invention provides a novel cross-modal attention screening network with dynamic routing to explore potential cross-modal relationships. The semantic relation between the image and the text is learned by retaining meaningful interaction characteristics and filtering irrelevant information, so that interference of useless interaction information is restrained. And a dynamic routing attention module is designed, and corresponding attention area blocks are selected according to the complexity of the image, so that the calculation cost and redundant calculation are reduced. A number of experiments on Flickr30K and MSCOCO demonstrated the superiority of our model (IASDR) over several prior methods. In the future we will further study the more efficient kinetic model mechanism and conduct cross-modality matching studies in a few samples.
Finally, the details of the above examples of the invention are provided only for illustrating the invention, and any modifications, improvements, substitutions, etc. of the above embodiments should be included in the scope of the claims of the invention.

Claims (4)

1.基于动态路由的跨模态注意力筛选网络的跨模态匹配方法,其特征在于,所述方法包括以下步骤:1. A cross-modal matching method of a cross-modal attention screening network based on dynamic routing, characterized in that the method comprises the following steps: S1.构建动态路由器,根据图像的复杂程度选择相应的注意力区域块,从而具有区域数量决策的能力;S1. Build a dynamic router to select the corresponding attention area block according to the complexity of the image, so as to have the ability to decide the number of areas; S2.结合S1中的动态路由器,设计一个动态路由注意力模块,通过对定义的注意力区域数量构造不同的邻接掩码来解决参数冗余和计算问题;S2. Combined with the dynamic router in S1, a dynamic routing attention module is designed to solve the parameter redundancy and calculation problems by constructing different adjacency masks for the defined number of attention regions; S3.构建跨模态筛选模块,保留有意义的交互特征,过滤不相关的信息,抑制无意义的图像-文本对的干扰,学习图像和文本之间的语义关系;S3. Construct a cross-modal filtering module to retain meaningful interactive features, filter irrelevant information, suppress the interference of meaningless image-text pairs, and learn the semantic relationship between images and texts; S4.结合S2中的模块和S3中的模块构建基于动态路由的跨模态注意力筛选网络的跨模态匹配方法的整体架构;S4. Combining the modules in S2 and S3 to construct the overall architecture of the cross-modal matching method based on the dynamic routing cross-modal attention screening network; S5.基于动态路由的跨模态注意力筛选网络的跨模态匹配方法的训练;S5. Training of cross-modal matching method based on dynamic routing cross-modal attention screening network; 所述S1的具体过程为:The specific process of S1 is as follows: 为了实现对每个图像的动态选择,我们考虑了一个由多块结构组成的网络,其中每个块都配有不同设置的模块,具体地说,给定图像特征V∈Rd×m,路由空间可以定义为A=[A0,...,Ab],b表示注意力区域块的数量,d是图像特征的维度,经过路由后的特征可以定义为:In order to achieve dynamic selection for each image, we consider a network composed of multiple blocks, where each block is equipped with modules with different settings. Specifically, given the image feature V∈Rd ×m , the routing space can be defined as A=[ A0 ,..., Ab ], where b represents the number of attention area blocks and d is the dimension of the image feature. The feature after routing can be defined as: 其中α为路由器预测的选择概率,A为单元操作的集合,在每个节点内设计一个基本单元和相应的路由来选择特定的区域特征,对于每幅图像,我们设计了3种不同数量的候选区域块,分别选择置信度分数排名最高的前g个区域,图像特征将用于单元内部和路线内部的特征转换;Where α is the selection probability predicted by the router, A is the set of unit operations, and a basic unit and corresponding routes are designed in each node to select specific regional features. For each image, we designed three different numbers of candidate region blocks and selected the top g regions with the highest confidence scores. Image features will be used for feature conversion within the unit and within the route. 我们提出的软路由器可以看作是一个块决策过程,根据图像的全局含义生成每条路径的路由概率,给定图像特征V∈Rd×m,对每个注意力块的选择预测概率α∈Rb可以由下式得到:The soft router we proposed can be viewed as a block decision process that generates the routing probability of each path based on the global meaning of the image. Given the image features V∈Rd ×m , the selection prediction probability α∈Rb for each attention block can be obtained as follows: V*=softmax(FC2(V))V (2)V * = softmax(FC 2 (V))V (2) α=relu(tanh(FC1(V*))) (3)α=relu(tanh(FC 1 (V * ))) (3) 其中,FC为全连接函数;Among them, FC is the fully connected function; 所述S2的具体过程为:The specific process of S2 is: 在动态路由注意力模块中,我们采用自注意力和身份映射作为每个单元的实现,然而,自我关注的点积操作会产生昂贵的计算和巨大的内存占用,在这种情况下,减少大量的计算量是至关重要的,因此,我们调整了每个输入图像的注意区域数量,这是与现有的基于自我注意的方法的关键区别,通过重新审视标准自我注意的定义,可以得到区域注意力权重:In the dynamic routing attention module, we adopt self-attention and identity mapping as the implementation of each unit. However, the dot product operation of self-attention will produce expensive calculations and huge memory usage. In this case, it is crucial to reduce the large amount of calculations. Therefore, we adjust the number of attention regions for each input image, which is a key difference from existing self-attention based methods. By revisiting the definition of standard self-attention, the regional attention weights can be obtained: 其中,ηij测量第j个位置对第i个位置的影响,m表示图像中区域的个数, ηij可以看作是一个图像种不同区域间的完全连接图,Among them, η ij measures the influence of the j-th position on the i-th position, m represents the number of regions in the image, η ij can be regarded as a fully connected graph between different regions of an image. 为了获得不同的关注区域的特点,我们需要限制每个输入图像的区域连接,因此通过引入一个邻接mask M∈Rm×m,可以得到新的区域注意力权重:In order to obtain the characteristics of different attention areas, we need to limit the regional connections of each input image. Therefore, by introducing an adjacency mask M∈R m×m , we can get a new regional attention weight: M是二进制的值,当它在目标元素的注意区域内时,将其设置为1,因此,将注意力操作限制在一定数量的图像区域内,以探讨模内语义关系,最后给出了路由器注意模块的输出,如下式所示:M is a binary value, which is set to 1 when it is within the attention region of the target element. Therefore, the attention operation is limited to a certain number of image regions to explore the intra-module semantic relationship. Finally, the output of the router attention module is given as shown in the following formula: 其中,这样,图像中区域的数量就受到了限制,这将大大降低计算复杂度和数据冗余带来的误差。in, In this way, the number of regions in the image is limited, which will greatly reduce the computational complexity and errors caused by data redundancy. 2.根据权利要求1所述的基于动态路由的跨模态注意力筛选网络的跨模态匹配方法,其特征在于,所述S3的具体过程为:2. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S3 is: 为了更好地聚集共享语义,弥合视觉和文本之间的语义鸿沟,提出了一种跨模态筛选模块来抑制无意义对齐的干扰,从而进一步减少了整个模型的时间消耗,特别地,给定局部和全局特征表示,我们可以由下式计算一个新的合成特征表示:In order to better aggregate shared semantics and bridge the semantic gap between vision and text, a cross-modal filtering module is proposed to suppress the interference of meaningless alignment, thereby further reducing the time consumption of the entire model. In particular, given the local and global feature representations, we can calculate a new synthetic feature representation by the following formula: X*=cat((X,duplicate(Xg)),1) (7)X * =cat((X,duplicate(X g )),1) (7) E*=cat((E,duplicate(Eg)),1) (8)E * =cat((E,duplicate(E g )),1) (8) 为了保证Xg(Eg)能与X(E)连接,通过将其复制到多行扩展到矩阵X°(E°),然后我们通过余弦相似性函数,计算每一对之间共享语义表示,如下式所示:In order to ensure that Xg ( Eg ) can be connected to X(E), it is expanded to the matrix X°(E°) by copying it to multiple rows. Then we calculate the shared semantic representation between each pair through the cosine similarity function, as shown in the following formula: 其中,sij表示第i个区域与第j个单词之间的相关性,给定图像查询X*,对文本集执行每个区域的注意权重,如下式所示:Among them, s ij represents the correlation between the i-th region and the j-th word. Given an image query X * , the attention weight of each region is performed on the text set, as shown in the following formula: 其中,λ是控制注意力分配平滑度的因素,文本注意力特征Le是通过单词特征的加权组合得到的:Among them, λ is a factor that controls the smoothness of attention allocation, and the text attention feature Le is obtained by the weighted combination of word features: 类似地,给定一个文本查询,对图像集执行每个单词的注意力权重:Similarly, given a text query, the attention weights for each word are performed on the image set: 图像级的注意力特征可以通过图像区域特征的加权组合得到的:Image-level attention features can be obtained by weighted combination of image region features: 然后,注意力语义特征L(图像级Lx/文本级Le)首先被映射来生成过滤向量Lα和重置向量Lβ,如下所示:Then, the attention semantic feature L (image-level Lx /text-level Le ) is first mapped to generate the filter vector and the reset vector as follows: Lα=tanh(MLP(L)) (14)L α =tanh(MLP(L)) (14) Lβ=MLP(L) (15)L β =MLP(L) (15) 然后,重置特征由下式可得Then, the reset feature is obtained by R=Relu(Q⊙Lα+Lβ) (16)R=Relu(Q⊙L α +L β ) (16) 其中,Q=E*或者X*,最后可由下式计算图像文本的相似性分数:Among them, Q = E * or X * , and finally the similarity score of the image and text can be calculated by the following formula: S=sigmoid(FC(μRx+εRe)) (17)。S=sigmoid(FC(μR x +εR e )) (17). 3.根据权利要求1所述的基于动态路由的跨模态注意力筛选网络的跨模态匹配方法,其特征在于,所述S4的具体过程为:3. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S4 is: 所述的基于动态路由的跨模态注意力筛选网络的跨模态匹配方法包含一个动态路由注意力模块、一个跨模态筛选模块和一个基于动态路由的跨模态注意力筛选网络。The cross-modal matching method based on a dynamic routing cross-modal attention screening network includes a dynamic routing attention module, a cross-modal screening module and a cross-modal attention screening network based on dynamic routing. 4.根据权利要求1所述的基于动态路由的跨模态注意力筛选网络的跨模态匹配方法,其特征在于,所述S5的具体过程为:4. According to the cross-modal matching method of the cross-modal attention screening network based on dynamic routing according to claim 1, it is characterized in that the specific process of S5 is: 基于动态路由的跨模态注意力筛选网络的跨模态匹配方法的训练方法如下:The training method of the cross-modal matching method based on the cross-modal attention screening network with dynamic routing is as follows: 我们的实验是通过在单个Nvidia Tesla P100 GPU上使用PyTorch框架实现的,对于每一幅图像,我们使用一个带有Resnet-101的Faster-RCNN模型来进行区域检测和特征提取,我们选择置信度分数排名前K=36个的区域特征和每个区域1024维的特征,对于每个文本,我们选择带有12层,768隐藏层大小,12个头的BERT-Base模型,得到原始的768维的单词嵌入,嵌入式潜在公共维度设置为1024,并利用Adam优化器在MSCOCO和Flickr30k数据集上分别训练20个批次和30个批次的模型,学习速度最初设定为5e-9,分别每10或15个epoch下降10倍,边缘参数α设置为0.2,最小批量尺寸为64,梯度裁剪阈值为2.0。Our experiments are implemented using PyTorch on a single Nvidia Tesla P100 GPU. For each image, we use a Faster-RCNN model with Resnet-101 for region detection and feature extraction. We select the top K=36 region features with the highest confidence scores and 1024-dimensional features for each region. For each text, we select the BERT-Base model with 12 layers, 768 hidden layer size, and 12 heads to get the original 768-dimensional word embeddings. The embedded latent common dimension is set to 1024, and the Adam optimizer is used to train the model on the MSCOCO and Flickr30k datasets for 20 and 30 batches, respectively. The learning rate is initially set to 5e-9 and dropped by a factor of 10 every 10 or 15 epochs, respectively. The margin parameter α is set to 0.2, the mini-batch size is 64, and the gradient clipping threshold is 2.0.
CN202210364577.5A 2022-04-08 2022-04-08 Cross-modal matching method based on cross-modal attention screening network with dynamic routing Active CN114676228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210364577.5A CN114676228B (en) 2022-04-08 2022-04-08 Cross-modal matching method based on cross-modal attention screening network with dynamic routing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210364577.5A CN114676228B (en) 2022-04-08 2022-04-08 Cross-modal matching method based on cross-modal attention screening network with dynamic routing

Publications (2)

Publication Number Publication Date
CN114676228A CN114676228A (en) 2022-06-28
CN114676228B true CN114676228B (en) 2025-03-07

Family

ID=82077860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210364577.5A Active CN114676228B (en) 2022-04-08 2022-04-08 Cross-modal matching method based on cross-modal attention screening network with dynamic routing

Country Status (1)

Country Link
CN (1) CN114676228B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115392389B (en) * 2022-09-01 2023-08-29 北京百度网讯科技有限公司 Cross-modal information matching, processing method, device, electronic device and storage medium
CN116702904A (en) * 2023-04-28 2023-09-05 上海孚典智能科技有限公司 Stage-by-stage artificial intelligence algorithm reasoning calculation method and system
CN119089259B (en) * 2024-08-01 2025-09-26 广州大有网络科技有限公司 A hybrid modality expert emotion recognition method and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN113065012A (en) * 2021-03-17 2021-07-02 山东省人工智能研究院 A Graphical and Text Analysis Method Based on Multimodal Dynamic Interaction Mechanism

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8140335B2 (en) * 2007-12-11 2012-03-20 Voicebox Technologies, Inc. System and method for providing a natural language voice user interface in an integrated voice navigation services environment
US11080656B2 (en) * 2019-04-11 2021-08-03 Prime Research Solutions LLC Digital screening platform with precision threshold adjustment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084358A (en) * 2020-09-04 2020-12-15 中国石油大学(华东) Image-text matching method based on regional enhanced network with theme constraint
CN113065012A (en) * 2021-03-17 2021-07-02 山东省人工智能研究院 A Graphical and Text Analysis Method Based on Multimodal Dynamic Interaction Mechanism

Also Published As

Publication number Publication date
CN114676228A (en) 2022-06-28

Similar Documents

Publication Publication Date Title
CN112966127B (en) Cross-modal retrieval method based on multilayer semantic alignment
CN114297473B (en) News event searching method and system based on multistage image-text semantic alignment model
CN114676228B (en) Cross-modal matching method based on cross-modal attention screening network with dynamic routing
CN114969278A (en) Knowledge enhancement graph neural network-based text question-answering model
CN115331075B (en) Antagonistic multi-mode pre-training method with knowledge enhancement of multi-mode scene graph
CN109992779B (en) Emotion analysis method, device, equipment and storage medium based on CNN
CN116097250A (en) Layout aware multimodal pre-training for multimodal document understanding
CN110322446A (en) A kind of domain adaptive semantic dividing method based on similarity space alignment
CN114969367B (en) Cross-language entity alignment method based on multi-aspect sub-task interaction
CN116128024A (en) Multi-perspective comparative self-supervised attribute network outlier detection method
CN115098646A (en) A multi-level relationship analysis and mining method for graphic data
CN118197402B (en) Method, device and equipment for predicting drug target relation
Xue et al. An effective linguistic steganalysis framework based on hierarchical mutual learning
Li et al. WDAN: A weighted discriminative adversarial network with dual classifiers for fine-grained open-set domain adaptation
Gao et al. A hierarchical recurrent approach to predict scene graphs from a visual‐attention‐oriented perspective
CN117539999A (en) Cross-modal joint coding-based multi-modal emotion analysis method
CN113792144A (en) A Text Classification Method Based on Semi-Supervised Graph Convolutional Neural Networks
CN116881416B (en) Instance-level cross-modal retrieval method based on relational reasoning and cross-modal independent matching network
Peng et al. A new self-supervised task on graphs: Geodesic distance prediction
CN119806502B (en) A code completion method based on retrieval enhancement and multimodality
CN119068259B (en) Pathological image classification method and system based on online pseudo-supervision and dynamic mutual learning
Deng et al. RETARCTED ARTICLE: Multimedia data stream information mining algorithm based on jointed neural network and soft clustering
CN118587494A (en) An image classification method based on CNN neural network
CN117150129A (en) Recommendation information determining method, device, computer equipment and storage medium
Huangfu et al. Question-guided graph convolutional network for visual question answering based on object-difference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant