[go: up one dir, main page]

CN114821223B - Pre-training image text model processing method and image-text retrieval system - Google Patents

Pre-training image text model processing method and image-text retrieval system Download PDF

Info

Publication number
CN114821223B
CN114821223B CN202210327383.8A CN202210327383A CN114821223B CN 114821223 B CN114821223 B CN 114821223B CN 202210327383 A CN202210327383 A CN 202210327383A CN 114821223 B CN114821223 B CN 114821223B
Authority
CN
China
Prior art keywords
image
text
masked
training
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210327383.8A
Other languages
Chinese (zh)
Other versions
CN114821223A (en
Inventor
季葛鹏
高德宏
宁伟
仇光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Overseas Internet Industry Co ltd
Original Assignee
Hangzhou Alibaba Overseas Internet Industry Co ltd
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Alibaba Overseas Internet Industry Co ltd, Alibaba China Co Ltd filed Critical Hangzhou Alibaba Overseas Internet Industry Co ltd
Priority to CN202210327383.8A priority Critical patent/CN114821223B/en
Publication of CN114821223A publication Critical patent/CN114821223A/en
Application granted granted Critical
Publication of CN114821223B publication Critical patent/CN114821223B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a pre-training image text model processing method and an image-text retrieval system. The method comprises the steps of obtaining mask training sample pairs for masking words and image blocks in an image text sample pair, inputting the mask training sample pairs into a pre-training image text model, obtaining loss values output for masked words, the masked image blocks and image text tasks, wherein the pre-training image text model comprises a multi-stage downsampling encoder and a multi-stage upsampling decoder, and adjusting parameters in the pre-training image text model according to the loss values. The invention realizes the pixel-level reconstruction of the masked image blocks in the pre-training image language network by combining the block embedding of the image with the model structure of the multi-stage downsampling encoder and the up-sampling decoder corresponding step by step. Further, combining the self-built residual sub-network implementing the input image and text embedding with the encoder-decoder structure as above, an end-to-end multi-modal pre-training is implemented.

Description

Pre-training image text model processing method and image-text retrieval system
Technical Field
The disclosure relates to the field of deep learning, in particular to a pre-training image text model processing method and an image-text retrieval system.
Background
With the advent of the information age, multimedia data (including text, images, voice, video, etc.) in the internet has been deeply penetrated in aspects of people's daily lives. How to efficiently parse the effective content following human semantic understanding from massive multimedia data and give accurate relevance content feedback according to the behavior habits of specific users has become a research hotspot in academia and industry in recent years.
Traditional single-mode techniques, such as plain image or plain text retrieval, have failed to meet the increasing diversity demands due to their single data form. In contrast, receiving diversified sensations allows artificial agents to understand things themselves more comprehensively and efficiently, which also more closely conforms to the multi-sensory cognitive approach of humans.
Therefore, a scheme for more deeply and accurately mining the related information of the graphics is needed.
Disclosure of Invention
The technical problem to be solved by the present disclosure is to provide a pre-training image text model processing method, which realizes pixel level reconstruction of masked image blocks in a pre-training image language network through block embedding of images and model structure combining a multi-stage downsampling encoder and a stage-by-stage corresponding upsampling decoder. Further, combining the self-built residual sub-network implementing the input image and text embedding with the encoder-decoder structure as above, an end-to-end multi-modal pre-training is implemented.
According to a first aspect of the present disclosure, there is provided a pre-training image text model processing method, including obtaining a pair of mask training samples for masking words and image blocks in a pair of image text samples, inputting the pair of mask training samples into the pre-training image text model, obtaining a loss value output by the pre-training image text model for masked words, masked image blocks and image text tasks, wherein the pre-training image text model includes a multi-stage downsampling encoder and a multi-stage upsampling decoder, and adjusting parameters in the pre-training image text model according to the loss value.
Optionally, inputting the pair of mask training samples into the pre-training image text model includes generating text embedding vectors and image embedding vectors from the pair of mask training samples using an embedding transformation sub-network, wherein an image in the pair of image text samples is divided into a plurality of blocks, and generating the image embedding vectors based on the divided blocks.
Optionally, inputting the mask training sample pair into the pre-training image text model comprises the following operation in each stage of downsampling encoder that the text embedding vector is serially connected with the flattened image embedding vector to obtain a composite vector, the composite vector is spatially reduced and then is sent to an encoder unit consisting of a multi-head attention sub-network and a feedforward network, the composite vector processed by the encoder unit is split to obtain the downsampled text embedding vector, and the rest is reconstructed to obtain the downsampled image embedding vector.
Optionally, in each stage of downsampling encoder, the image embedded vector is processed by a convolution module prior to the flattening process.
Optionally, inputting the mask training sample pairs into the pre-training image text model includes feeding the multi-level downsampled image embedded vectors into the multi-level upsampling decoder to obtain processed image embedded vectors of the same dimensions as the input image.
Optionally, the loss values of the pre-trained image text model for the masked image block are obtained using the processed image embedding vector for pixel-level reconstruction of the masked image block.
Optionally, the number of dividing blocks occupied by the mask image block is set based on the granularity of the image feature to be extracted.
Optionally, obtaining the penalty values of the pre-trained image text model for the masked words, the masked image blocks, and the image text task output includes solving for the penalty values for the masked words and the image text task output based on the output of the multi-stage downsampling encoder, and solving for the penalty values for the masked image blocks based on the output of the multi-stage upsampling decoder.
According to a second aspect of the present disclosure, there is provided a graphic retrieval system, including a query information acquisition module configured to acquire text and/or image information input by a user, and a pre-training image text model acquired according to the method of the first aspect, configured to output matched graphic information based on the text and/or image information input by the user.
According to a third aspect of the present disclosure there is provided a computing device comprising a processor and a memory having executable code stored thereon which when executed by the processor causes the processor to perform the method of the first aspect described above.
According to a fourth aspect of the present disclosure there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first aspect above.
Therefore, the scheme realizes the image-text information learning with variable granularity by combining the blocking of the end-to-end image, the granularity variable mask strategy and the multi-order pyramid coding with the subsequent inverted pyramid decoding, and is particularly suitable for application scenes involving the learning of finer granularity characteristics such as fashion pictures.
Drawings
The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.
FIG. 1 shows a schematic flow chart of a pre-training image text model processing method according to one embodiment of the invention.
Fig. 2 shows an example of masking training sample pairs.
Figures 3A-B illustrate two embodiments of the pre-training network architecture of the present invention.
FIG. 4 illustrates one example of acquiring a text token and an image patch based on an original image text sample pair.
Fig. 5 illustrates a masked visual language converter for a fashion cross-modality representation.
Fig. 6 shows a comparative example of prior art and the PVT-based masking strategy of the present invention.
Fig. 7 shows an example of the teletext retrieval system of the invention.
FIG. 8 illustrates a schematic diagram of a computing device that may be used to implement the pre-training image text model processing method described above, according to one embodiment of the invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As described above, how to efficiently parse effective content following human semantic understanding from massive multimedia data and give accurate relevance content feedback (including application scenes such as searching, recommending, advertising, etc.) according to behavior habits of specific users has become a research hotspot in academia and industry in recent years.
Traditional single-mode techniques (e.g., plain image or plain text retrieval) have failed to meet the increasing diversity demands due to their single data form. In contrast, receiving diversified sensations allows artificial agents to understand things themselves more comprehensively and efficiently, which also more closely conforms to the multi-sensory cognitive approach of humans. Multimodal techniques are also favored by the industry for their superior performance in a number of rich semantic understanding tasks.
In the context of deep learning, how to have a machine feedback in a similar way to human intelligence is a hotspot problem in the field. The existing multi-mode pre-training solution is mostly aimed at the improvement of a transducer module, and is mostly focused on the image-text task in the general field in application. There is a lack of an optimized solution for accurate learning and text-related for fine-grained information of images.
Therefore, the method realizes the pixel-level reconstruction of the masked image blocks in the pre-training image language network through the block embedding of the image and the model structure of a multi-stage downsampling encoder and a step-by-step corresponding upsampling decoder. Further, combining the self-built residual sub-network implementing the input image and text embedding with the encoder-decoder structure as above, an end-to-end multi-modal pre-training is implemented. The system using the model can realize more accurate image-text retrieval, image classification and even pixel-level reconstruction of a mask part, and is particularly suitable for various electronic commerce application scenes.
FIG. 1 shows a schematic flow chart of a pre-training image text model processing method according to one embodiment of the invention.
In step S110, a mask training sample pair that masks words and image blocks in an image text sample pair is acquired. Fig. 2 shows an example of masking training sample pairs. As shown, the image text sample pairs may be the image shown on the left side of fig. 2 and the corresponding textual description (illustrated "word' S SLEEVELESS Long Dress (sleeveless skirt)") and the MASK training sample pairs (e.g., illustrated T and V) may be obtained by masking specific image blocks in the image (e.g., illustrated dashed boxes) and masking specific words in the textual description (e.g., illustrated MASK).
In step S120, the pair of masked training samples is input into the pre-training image text model, which may include a multi-stage downsampling encoder and a multi-stage upsampling decoder, and loss values output by the pre-training image text model for the masked words, masked image blocks, and image text tasks are thereby obtained. Subsequently, in step S130, parameters in the text model of the pre-training image are adjusted according to the loss value.
It will be appreciated that the pre-training image text model of the present application has a pyramid-type multi-level encoder (i.e., the input feature map is smaller and smaller) and an inverted pyramid-type multi-level decoder (i.e., the scaled feature map is further increased step by step), and may have three different training tasks, a predictive task for masked words (i.e., a language task, such as the MLM task described below), an image-language task (also referred to as a vision-language task, i.e., a VL task, such as the ITM task described below), and a task for masked image blocks (i.e., a vision task). In the present application, the visual task may be implemented, inter alia, as a mask image reconstruction task (i.e., a MIR task as will be described in detail below).
In one embodiment, the output of the multi-level encoder may be used directly for language tasks, VL tasks, and visual tasks. The image-language pre-training network thus trained can also perform well in downstream tasks.
In a preferred embodiment, the image embedded vectors output by the multi-stage encoder may be re-fed into the multi-stage upsampling decoder to obtain processed image embedded vectors of the same dimensions as the input image. At this time, obtaining the penalty values of the pre-trained image text model for the masked words, the masked image blocks, and the image text task output may include calculating penalty values for the masked words and the image text task output based on the output of the multi-stage downsampling encoder, and calculating penalty values for the masked image blocks based on the output of the multi-stage upsampling decoder. Thereby, a penalty value for the pre-trained image text model for the masked image block is obtained using the processed image embedding vector for pixel-level reconstruction of the masked image block.
Figures 3A-B illustrate two embodiments of the pre-training network architecture of the present invention. As shown in fig. 3A and 3B, the input text and the image as the object of the text description are subjected to a certain process to obtain a plurality of text tokens (marks) and a plurality of image patches (patches), which are then converted into text embedding ("embedded" or "embedded vector") and an image embedding. The image embedded vector and the text embedded vector are then fed into a multi-modality PVT (pyramid image transducer). Here, multimodal refers to a pyramid model structure consisting of a multi-level downsampled transducer encoder for processing both text and image types of input, PVT. In the example of fig. 3A, the loss value may be calculated directly based on the output of the multi-modality PVT, thereby implementing the language task, VL task, and image task of the pre-training model. In the preferred embodiment shown in fig. 3B, the image-embedded vectors of the multi-modality PVT output may then be up-sampled through an inverted pyramid type multi-stage transform decoder, thereby obtaining processed image-embedded vectors of the same dimension as the input image. Subsequently, the MIR (image reconstruction) task of pixel level reconstruction can be achieved.
FIG. 4 illustrates one example of acquiring a text token and an image patch based on an original image text sample pair. As shown, the original sample included an image (the image of a man wearing a cap zipper shirt and jeans at the lower right side of fig. 4) and a description of that image (the description "Long sleeve hoodle in black.Drawstring closure at hood.Zip closure and patch packets at front.Rib knit sleeve cuffs and hem……( of a cap zipper shirt worn by a man at the lower left side of fig. 4, black long sleeve cap, cap with drawstring, front zipper and pocket, rib knit cuffs and hem). The original sample can be obtained from e-commerce website for selling clothing, and the image is commodity picture, and is described as commodity title or description.
Subsequently, to obtain a text token, the input text may first be word-by-word tokenized (tokenized) to obtain a token sequence, and then whole word masked to obtain a masked text token.
Similarly, the patch (block) processing of an image differs from the RoI (region of interest) method common in the art, but rather the image is segmented into small blocks (usually squares) of the same pixel size, and each patch can be regarded as an "image token". For each patch, the output of PatchNet (block network) may be considered as patch features, which may then be masked to obtain a masked image patch. These patches are naturally ordered and the spatial locations of the patches can be used in subsequent processing to embed vectors in place (see FIG. 5).
In one embodiment of the invention, the text token and image patch may be fed into a standard ResNet to obtain text and image embedded vectors. In yet another more preferred embodiment, the embedded vectorization of incoming sample pairs may be achieved using an embedded transformation sub-network included in the pre-training image text model of the present invention. Thus, inputting the pair of mask training samples into the pre-training image text model may include generating text embedding vectors and image embedding vectors from the pair of mask training samples using an embedding transformation sub-network (e.g., resNet of a simplified structure that was self-written by the inventor), wherein an image in the pair of image text samples is partitioned into a plurality of blocks (patches), and generating the image embedding vectors based on the partitioned blocks.
This allows input to the multi-level downsampling encoder is embedding shown in fig. 3A-B, namely a text embedded vector and an image embedded vector. By using a self-written embedded transformation sub-network (e.g., a residual network with a simplified structure compared to standard ResNet), offline processing of the input image language pairs can be avoided, thereby enabling online execution of downstream tasks. Further, while the embedded transformation subnetwork of the present invention has a simplified structure, it is still capable of performing excellently in downstream tasks, since it can be seen as being included within the pre-trained image text model of the present invention, i.e., the parameters of the subnetwork can be trained with the parameters of the downstream PVT and optional multi-level decoder.
In the present invention, each stage of downsampling encoder can be regarded as a transform encoder capable of downsampling. To this end, inputting the pair of mask training samples into the pre-training image text model may include concatenating the flattened image embedded vectors into a composite vector in each stage of a downsampling encoder, spatially reducing the composite vector, feeding the composite vector into an encoder unit composed of a multi-headed attention sub-network and a feed-forward network, splitting the composite vector processed by the encoder unit, obtaining the downsampled text embedded vector (which may also be referred to as a "text embedded vector"), and reconstructing the remainder to obtain the downsampled image embedded vector. Thus, the processing is carried out by concatenating the text and image embedded vectors into an encoder, and after processing, splitting and shape reconstruction (only for image embedding) are carried out to complete each stage of processing. Preferably, in each stage of downsampling encoder, the image embedded vector is processed by a convolution module (e.g., conv2 Dblock) prior to the flattening process. Thus, the processing for image embedding is performed using the downsampled friendly nature of the convolution operation. Further, the image-embedded vector that has been downsampled in multiple stages may be fed into the multi-stage upsampling decoder as described above in connection with fig. 3B to obtain a processed image-embedded vector of the same dimension as the input image. The processed image embedding vector may then be used to obtain a penalty value for the pre-trained image text model for the masked image block for pixel-level reconstruction of the masked image block.
In one embodiment, the masked image block size may be different from the size of the patch. In other words, the number of partitioned blocks occupied by the masked image block may be set based on the granularity of the image feature to be extracted. For example, it may be determined whether 1 (1 x 1), 4 (2 x 2), or 9 (3 x 3), or even more patches are to be masked for the masked image block based on the setting of the α value as described below.
In order to better illustrate the principles of the present invention, a preferred end-to-end model embodiment of the present invention will be described below in conjunction with FIG. 5, which details the encoder structure. FIG. 5 shows one example of an end-to-end VL pre-training model in accordance with the present invention.
Fig. 5 shows a masked visual language converter (M-ViLT) for a cross-modal representation of fashion, especially in the e-commerce domain. In the model of the invention, PVT (pyramid vision transducer) is utilized to replace BERT and a Mask Image Reconstruction (MIR) strategy is introduced into PVT, so that M-ViLT is the first end-to-end framework in the fashion field. M-ViLT is an extensible and convenient architecture that accepts raw multimodal input without additional pre-processing models (e.g., resNet), implicitly models visual language alignment, and can be easily generalized to various matching and generating tasks downstream.
In order to solve the problems of insufficient granularity and poor mobility in the prior art, the invention specially constructs an end-to-end VL frame for the fashion field. The overall flow of M-ViLT is shown in FIG. 5. The example of fig. 5 uses a four-level encoder and generates features of different sizes. Two keys of the proposed architecture are the multi-modal encoder and the pre-training target.
Multi-mode encoder
As shown in FIG. 5, M-ViLT allows visual and linguistic input. In terms of language, titles of fashion products (e.g., articles of clothing accessories, etc.) are first marked (tokenize), and the marking of the title is randomly masked with a MASK ratio r l using a specific mark [ MASK ]. A series of word tokens is obtained after the masking process. Then, a specific [ CLS ] tag is inserted in the head of this sequence. Furthermore, if the length is less than 128, the sequence is padded to a uniform length L using [ PAD ] tags. This process generates language input ids
Visually, one canConsider a visual input where H and W represent the height and width of a given input. This input is divided into multiple latticed patches Wherein the method comprises the steps ofIs the total number of patches, P represents patch size. Similarly, mask ratio r v is used to mask the partitioned patch. More detailed information about masking policies for the language and visual parts described above is provided in the following description for the "pre-training objectives". The multimodal input is embedded and fed into the following four VL interaction stages (i.e., k.epsilon.1, 2,3, 4). In the first stage, a text embedding vector T 1 and a visual embedding vector V 1 are generated by given inputs (T and V), respectively. Regarding the subsequent stages, only the kth stage is considered to make the description concise. As shown at the bottom of fig. 5, text is first embedded into a vectorEmbedding into language hiding featuresIn (i.e., linear embedded in the illustration), the formula is:
Wherein the method comprises the steps of AndRespectively a linear embedding and a positional embedding matrix that are learnable. D k is the size of the hidden feature embedding. The visual embedding vector isWhere R k represents the visually embedded spatial reduction factor. To obtain pyramid visual features, V k is then embedded and flattened (flattened) to the visual hidden features by two-dimensional projection (i.e., conv2D block)I.e. the space(s) shown). In particular, the projection is performed by using a convolution kernel(Wherein the kernel size is K k, the step size is S k) forces the network to move the equivalent spatial dimension fromTo be reduced toThis can be expressed as:
Wherein the method comprises the steps of Representing the position embedding matrix. Subsequently, the two VL hidden features z k=<mk;nk > are concatenated (Concat) and fed into a plurality (M k) of VL transformer encoders. Each encoder contains a multi-headed self-attention layer with spatial reduction (i.e., the illustrated reduction box), multi-layer perceptron, and layer normalization. Finally, the encoded multi-modal feature z k+1=<mk+1;nk+1 > is derived and divided into a language portion T k+1=mk+1 and a visual portion V k+1=nk+1, where the Reshape (·) operation includes restoring the spatial dimensions of the given feature.
After four VL interaction levels, four text-embedding vectors are generated, respectivelyAnd four pyramid visual embedded vectorsThe following table shows detailed super parameter settings.
Super parameter k=1 k=2 k=3 k=4
Layer number M k 2 2 2 2
Hidden dimension D k 64 128 320 512
Reduced size R k 4 8 16 32
Nuclear size K k 4 2 2 2
Step S k 4 2 2 2
Pre-training targets
To obtain a multimodal representation with sufficient discrimination, three pre-training tasks are employed to establish relationships within and between the most primitive VL modalities, including visual tasks (implemented as mask image reconstruction, MIR), linguistic tasks (implemented as mask language modeling, MLM), and visual linguistic tasks (implemented as image-text matching, ITM).
Task 1 Mask Image Reconstruction (MIR).
The present invention attempts to construct pixel-to-pixel relationships from the perspective of the generation task, thereby facilitating scalability of the visual representation, and designs Mask Image Reconstruction (MIR) to achieve pixel-level reconstruction. To help the model learn better through MIR, a flexible masking strategy is designed with the pyramid features of PVT architecture. The PVT-based architecture masks the input image according to a masking cell matrix that contains small granularity patches. Given patch sequenceThe masked sequence V may be defined as:
Wherein the method comprises the steps of Representing a function (or process) of the masking strategy, q is a randomly selected region of the masking unit, [ ZERO ] represents filling the selected region with pixel values ZERO (0). Masking unitDerived from the following index functions:
wherein each value in the set of integers Φ is randomly selected from the range [1, q ] at a mask ratio r v. Is the total number of masking units. Fig. 6 shows a comparative example of prior art and the PVT-based masking strategy of the present invention. Compared with a masking strategy of only masking one Patch (with the size of PxP) based on ViT on the left side of the graph, the PVT-based method can be more flexible. As shown on the right side of the figure, PVT-based methods can be based on finer granularity of Patch and flexible selection of mask coefficients. For example, the mask coefficient α may be defined as 1 to 8, thereby causing the underlying masking unit (MaskingUnit) to mask (α×p) 2 range of images. In an implementation of the present invention, a=4 may be set by default to capture finer granularity semantics.
Since the smooth-l 1 loss is less sensitive to outliers, it may be chosen to be used as a pre-training target to reconstruct the entire image through the mask sequence V . The pre-training task is defined as:
where I '(x,y) and I (x,y) represent pixels at the coordinates (x, y) of the reconstructed image I' and the input image I, respectively. Parameterized by a learnable weight W MIR. Function ofFour-level U-Net decoder representing a standard, which accepts four pyramid visual embedded vectorsAs input. In other words, although not shown in fig. 5, the visual embedding vector fed into the MIR is not the vector output from the fourth stage shown in fig. 5, but the vector output from the four-stage inverted pyramid decoder after that shown in fig. 3B. Therefore, in the process of solving the MIR task loss, the finally obtained processed visual embedded vector passes through a four-level pyramid type encoder and a four-level inverted pyramid type decoder, and is obtained through a U-Net network.
Task 2 image-text matching (ITM).
Additional classification embedded vectors in the last stage text embedded vector T 4 may be used to couple representations from VL multi-modalities. Can utilize functionsTo represent Fully Connected (FC) and softmax layers, parameterized by weights W ITM.Outputting a bi-classification probability vectorRepresenting that the input image and text description match (i.e., are facing) or do not match (i.e., are negative). The positive pair is selected from the same product category, while the negative pair is randomly selected from different items. The task is ultimately constrained by a binary cross entropy loss function:
where y ITM represents a true label, i.e., 1 represents a matched pair, and 0 represents a non-matched pair.
It should be appreciated that in other implementations, other tasks such as text-to-image matching (TIM) may also be used to train information learning between multiple modalities of a model.
Task 3 Mask Language Modeling (MLM)
The original text mark may be replaced randomly with a specific mark [ "MASK ]. The goal of the MLM is to predict the text content of the masked mark using the unmasked mark and patch. Given a tagged sequence t= { T 1,…,tL }, the masked sequence is denoted T \i={t1,…,[MASK]i,…,tL }. Cross entropy loss can be used to model this goal:
Wherein the method comprises the steps of Representing the predicted probability for each MASK tag [ MASK ] i using T \i. Function ofThe parameters W MLM representing the classifier. The final pre-training goal of the proposed M-ViLT is a combination of three goals:
Overall, M-ViLT, in accordance with the above preferred embodiment of the present invention, enables end-to-end training, i.e., an end-to-end framework is provided that includes feature extractors and a multimodal matching network. The framework can be particularly integrated with a large amount of prior information (such as information of merchants, commodities, categories and the like) in the electronic commerce field, so that the multi-mode matching effect of the electronic commerce field is further improved in a downstream task. Further, pixel-level image reconstruction achieved through the MIR task can enable a model to have finer granularity understanding of the image side, and therefore performance improvement of the model is obvious.
In the multi-mode encoder part, the picture and text information is encoded, and the picture-text multi-mode information is fully interacted in PVT by introducing a PVT pre-training model. In the aspect of the pre-training task, three pre-training tasks are introduced, the MLM task is mainly responsible for language model learning, the TIM is mainly responsible for image and text matching learning, and the MIR is mainly responsible for image reconstruction task learning.
The pre-trained image text model processed according to the present invention can be used for different downstream tasks such as mutual retrieval of pictures and text, image classification, image generation.
For example, one downstream task may be text-to-image retrieval (TIR). The TIR task requires the model to find the text with the highest similarity value to the different query images. In particular, the product title and its corresponding image may be treated as a positive image-text pair, while the negative image pair is randomly selected from a pool of non-matching images. To increase accuracy, a set of image-text candidates (i.e., one positive pair and 100 negative pairs) may be limited to the same subcategory so that they are as similar as possible.
Another downstream task is image-text retrieval (ITR). As an inverse of the TIR task, the ITR task aims to retrieve matching images for a given sequence of merchandise description text entries. The bi-directional retrieval tasks (i.e., TIR and ITR) above can be the main application scenario across modalities.
Other downstream tasks may also include a category identification task and a Mask Image Generation (MIG) task. Category identification (may include: main class identification (M-CR) and sub-class identification (S-CR). These tasks provide specific classes of query products, and the model should be provided with the ability to identify differences at different levels of granularity by different downstream tasks: after the last text embedding vector T 4, two separate FC layers may be added to generate the final probabilities of two different recognition tasks, where this process requires additional fine tuning using the recognition tags, while the MIG task may be considered a pixel-level reconstruction task.
After further training for these downstream tasks, the pre-trained image text model of the present invention may be deployed in a teletext retrieval system. For this purpose, the invention can be further implemented as a teletext retrieval system. Fig. 7 shows an example of the teletext retrieval system of the invention. The image-text retrieval system can comprise a query information acquisition module for acquiring text and/or image information input by a user. Further, the teletext retrieval system may comprise a pre-trained image text model (PLM) obtained according to the method described above for outputting matching teletext information based on text and/or image information entered by a user. The pre-training image text model can be arranged in a retrieval system of an e-commerce website so that a user returns more accurate information when performing image-text retrieval. The text model of the pre-training image processed according to the invention can obtain the interrelationship between the image and the text in a deep way, thereby improving the accuracy of image-text retrieval.
FIG. 8 illustrates a schematic diagram of a computing device that may be used to implement the pre-training image text model processing method described above, according to one embodiment of the invention.
Referring to fig. 8, a computing device 800 includes a memory 810 and a processor 820.
Processor 820 may be a multi-core processor or may include multiple processors. In some embodiments, processor 820 may comprise a general-purpose main processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, processor 820 may be implemented using custom circuitry, for example, an Application SPECIFIC INTEGRATED Circuit (ASIC) or a field programmable gate array (FPGA, field Programmable GATE ARRAYS).
Memory 810 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 820 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 810 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 810 may include a readable and/or writable removable storage device such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only blu-ray disc, an super-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disk, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 810 has stored thereon executable code that, when processed by the processor 820, causes the processor 820 to perform the pre-training image text model processing method described above.
The pre-training image text model processing method and the resulting teletext retrieval system according to the invention have been described in detail hereinabove with reference to the accompanying drawings.
The present invention innovatively eliminates the need for complex preprocessing of the image in the multi-modal encoder portion (i.e., replacing standard ResNet with a self-written simplified residual network). M-ViLT realizes offline and online logical consistency by means of Patch segmentation of the end-to-end online picture and combining a mask strategy, and is beneficial to actual arrangement of models, such as arrangement in the field of electronic commerce. Furthermore, the MIR task defined by the invention can forcedly understand the image in finer granularity through pixel-level feature reconstruction, so that the MIR task meets the requirements of the e-commerce field relying on understanding the image in finer granularity.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.
Or the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1.一种预训练图像文本模型处理方法,包括:1. A method for processing a pre-trained image text model, comprising: 获取对图像文本样本对中的字和图像块进行掩码处理的掩码训练样本对,其中,通过对图像中特定图像块进行掩码并对文字描述中的特定单词进行掩码来获取所述掩码训练样本对;Obtaining a masked training sample pair for masking the words and image blocks in the image-text sample pair, wherein the masked training sample pair is obtained by masking a specific image block in the image and masking a specific word in the text description; 将所述掩码训练样本对输入所述预训练图像文本模型,获取所述预训练图像文本模型针对被掩码字、被掩码图像块和图像文本任务输出的损失值,其中,所述预训练图像文本模型包括多级下采样编码器和多级上采样解码器,其中,基于所述多级下采样编码器的输出,求取针对所述被掩码字和所述图像文本任务输出的损失值,并且基于所述多级上采样解码器的输出求取针对所述被掩码图像块的损失值;Inputting the mask training sample pairs into the pre-trained image-text model, obtaining loss values output by the pre-trained image-text model for masked words, masked image blocks, and image-text tasks, wherein the pre-trained image-text model comprises a multi-stage downsampling encoder and a multi-stage upsampling decoder, wherein based on the output of the multi-stage downsampling encoder, loss values for the masked words and the image-text task output are obtained, and based on the output of the multi-stage upsampling decoder, loss values for the masked image blocks are obtained; 根据所述损失值,调整所述预训练图像文本模型中的参数。According to the loss value, the parameters in the pre-trained image-text model are adjusted. 2.如权利要求1所述的方法,其中,将所述掩码训练样本对输入所述预训练图像文本模型包括:2. The method of claim 1, wherein inputting the mask training sample pair into the pre-trained image-text model comprises: 使用嵌入变换子网络从所述掩码训练样本对生成文本嵌入向量和图像嵌入向量,其中,所述图像文本样本对中的图像被划分为多个块,并基于划分的块生成所述图像嵌入向量。A text embedding vector and an image embedding vector are generated from the mask training sample pair using an embedding transformation subnetwork, wherein the image in the image-text sample pair is divided into a plurality of blocks, and the image embedding vector is generated based on the divided blocks. 3.如权利要求2所述的方法,其中,将所述掩码训练样本对输入所述预训练图像文本模型包括在每级下采样编码器中进行如下操作:3. The method of claim 2, wherein inputting the mask training sample pairs into the pre-trained image text model comprises performing the following operations in each level of downsampling encoder: 所述文本嵌入向量串接经过平坦化处理的所述图像嵌入向量得到合成向量,The text embedding vector is concatenated with the flattened image embedding vector to obtain a composite vector. 所述合成向量经过空间缩减后送入由多头注意子网络和前馈网络构成的编码器单元;以及The synthesized vector is spatially reduced and fed into an encoder unit consisting of a multi-head attention sub-network and a feed-forward network; and 拆分经编码器单元处理的所述合成向量,获取经下采样的所述文本嵌入向量,并对剩余部分进行重构以获取经下采样的所述图像嵌入向量。The synthesized vector processed by the encoder unit is split to obtain the downsampled text embedding vector, and the remaining part is reconstructed to obtain the downsampled image embedding vector. 4.如权利要求3所述的方法,其中,在每级下采样编码器中,所述图像嵌入向量在平坦化处理前经过卷积模块处理。4. The method of claim 3, wherein, in each level of downsampling encoder, the image embedding vector is processed by a convolution module before flattening. 5.如权利要求3所述的方法,其中,将所述掩码训练样本对输入所述预训练图像文本模型包括:5. The method of claim 3, wherein inputting the mask training sample pair into the pre-trained image text model comprises: 将经过多级下采样的所述图像嵌入向量送入所述多级上采样解码器,获取与输入图像相同维度的经处理图像嵌入向量。The image embedding vector after multi-level downsampling is sent to the multi-level upsampling decoder to obtain a processed image embedding vector with the same dimension as the input image. 6.如权利要求5所述的方法,其中,利用所述经处理图像嵌入向量获取所述预训练图像文本模型针对所述被掩码图像块的损失值,用于所述被掩码图像块的像素级重建。6. The method of claim 5, wherein the processed image embedding vector is used to obtain the loss value of the pre-trained image text model for the masked image block, which is used for pixel-level reconstruction of the masked image block. 7.如权利要求5所述的方法,其中,基于待提取图像特征粒度设置所述被掩码图像块占据的划分块的个数。7. The method according to claim 5, wherein the number of divided blocks occupied by the mask image block is set based on the granularity of the image features to be extracted. 8.一种图文检索系统,包括:8. A picture and text retrieval system, comprising: 查询信息获取模块,用于获取用户输入的文字和/或图像信息;以及A query information acquisition module, used to acquire text and/or image information input by a user; and 根据如权利要求1-7中任一项所述的方法获取的预训练图像文本模型,用于基于用户输入的文字和/或图像信息,输出匹配的图文信息。The pre-trained image-text model obtained according to the method described in any one of claims 1-7 is used to output matching image-text information based on text and/or image information input by a user. 9.一种计算设备,包括:9. A computing device comprising: 处理器;以及Processor; and 存储器,其上存储有可执行代码,当所述可执行代码被所述处理器执行时,使所述处理器执行如权利要求1-7中任一项所述的方法。A memory having executable codes stored thereon, which, when executed by the processor, causes the processor to execute the method according to any one of claims 1 to 7. 10.一种非暂时性机器可读存储介质,其上存储有可执行代码,当所述可执行代码被电子设备的处理器执行时,使所述处理器执行如权利要求1-7中任一项所述的方法。10. A non-transitory machine-readable storage medium having executable codes stored thereon, which, when executed by a processor of an electronic device, causes the processor to execute the method according to any one of claims 1 to 7.
CN202210327383.8A 2022-03-30 2022-03-30 Pre-training image text model processing method and image-text retrieval system Active CN114821223B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210327383.8A CN114821223B (en) 2022-03-30 2022-03-30 Pre-training image text model processing method and image-text retrieval system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210327383.8A CN114821223B (en) 2022-03-30 2022-03-30 Pre-training image text model processing method and image-text retrieval system

Publications (2)

Publication Number Publication Date
CN114821223A CN114821223A (en) 2022-07-29
CN114821223B true CN114821223B (en) 2025-07-08

Family

ID=82532592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210327383.8A Active CN114821223B (en) 2022-03-30 2022-03-30 Pre-training image text model processing method and image-text retrieval system

Country Status (1)

Country Link
CN (1) CN114821223B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115393396B (en) * 2022-08-18 2024-02-02 西安电子科技大学 A UAV target tracking method based on mask pre-training
CN115269996A (en) * 2022-08-18 2022-11-01 中国建设银行股份有限公司 Data matching processing method and device
CN115294400B (en) * 2022-08-23 2023-03-31 北京医准智能科技有限公司 Training method and device for image classification model, electronic equipment and storage medium
CN115346074B (en) * 2022-09-02 2023-06-16 北京百度网讯科技有限公司 Training method, image processing device, electronic equipment and storage medium
CN115620304A (en) * 2022-10-11 2023-01-17 浙江大华技术股份有限公司 Training method of text recognition model, text recognition method and related device
CN118153657A (en) * 2022-11-30 2024-06-07 北京有竹居网络技术有限公司 Network model training method, data processing method and device
CN115631205B (en) * 2022-12-01 2023-03-21 阿里巴巴(中国)有限公司 Method, device and equipment for image segmentation and model training
CN115964551A (en) * 2022-12-26 2023-04-14 拉扎斯网络科技(上海)有限公司 Multi-modal commodity searching method and device, storage medium and computer equipment
CN116630986A (en) * 2023-03-07 2023-08-22 平安科技(深圳)有限公司 Image-text comparison model training method, device, and image-text mutual inspection method
CN118628784A (en) * 2023-03-10 2024-09-10 华为云计算技术有限公司 Image classification model training method, image classification method and related device
CN116363662B (en) * 2023-03-27 2025-11-21 北京百度网讯科技有限公司 Training method of target detection model, target detection method and device
CN116468816B (en) * 2023-03-31 2024-04-16 北京百度网讯科技有限公司 Training method of image reconstruction model, commodity identification method, device and equipment
CN116561570A (en) * 2023-03-31 2023-08-08 北京京东方技术开发有限公司 A multi-modal model training method, device, equipment and readable storage medium
CN116563426B (en) * 2023-05-08 2025-10-14 北京有竹居网络技术有限公司 Method, device, electronic device, and medium for processing multimodal data
CN116778011B (en) * 2023-05-22 2024-05-24 阿里巴巴(中国)有限公司 Image generation method
CN116796287A (en) * 2023-06-16 2023-09-22 平安科技(深圳)有限公司 Pre-training methods, devices, equipment and storage media for image and text understanding models
CN116797884A (en) * 2023-07-14 2023-09-22 联仁健康医疗大数据科技股份有限公司 An image classification model training method, image classification method and device
CN117132987B (en) * 2023-08-03 2025-06-03 中山大学 Text supervision semantic segmentation method and device, electronic equipment and storage medium
CN117350913A (en) * 2023-08-29 2024-01-05 华为技术有限公司 A data processing method and its device
CN117218170B (en) * 2023-11-09 2024-03-01 中国科学院空天信息创新研究院 Image processing method and device of remote sensing basic model based on joint task constraint

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186564A (en) * 2021-11-05 2022-03-15 北京百度网讯科技有限公司 Pre-training method, device and electronic device for semantic representation model

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11210554B2 (en) * 2019-03-21 2021-12-28 Illumina, Inc. Artificial intelligence-based generation of sequencing metadata
CN112001180A (en) * 2020-07-14 2020-11-27 北京百度网讯科技有限公司 Multi-mode pre-training model acquisition method and device, electronic equipment and storage medium
CN113239925A (en) * 2021-05-24 2021-08-10 北京有竹居网络技术有限公司 Text detection model training method, text detection method, device and equipment
CN114023306B (en) * 2022-01-04 2022-04-12 阿里云计算有限公司 Processing method for pre-training language model and spoken language understanding system

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186564A (en) * 2021-11-05 2022-03-15 北京百度网讯科技有限公司 Pre-training method, device and electronic device for semantic representation model

Also Published As

Publication number Publication date
CN114821223A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN114821223B (en) Pre-training image text model processing method and image-text retrieval system
Yu et al. A two‐stream deep fusion framework for high‐resolution aerial scene classification
Wu et al. Fully transformer networks for semantic image segmentation
Bao et al. Towards open-set identity preserving face synthesis
CN108509978B (en) Multi-class target detection method and model based on CNN (CNN) multi-level feature fusion
CN111444889B (en) A fine-grained action detection method based on multi-level conditional influence convolutional neural network
CN108596265B (en) Video generation model based on text description information and generative adversarial network
Zhang et al. Sparse codes auto-extractor for classification: A joint embedding and dictionary learning framework for representation
Gupta et al. Single attribute and multi attribute facial gender and age estimation
Robert et al. Hybridnet: Classification and reconstruction cooperation for semi-supervised learning
Liu et al. Transformer based pluralistic image completion with reduced information loss
Kerroumi et al. VisualWordGrid: information extraction from scanned documents using a multimodal approach
Ribeiro et al. Learning with capsules: A survey
WO2024205594A1 (en) Multimodal graph contrastive learning for form document information extraction
CN109766918B (en) Salient object detection method based on multilevel context information fusion
Tsai et al. Frontalization and adaptive exponential ensemble rule for deep-learning-based facial expression recognition system
Vadakkot et al. Automatic one-hand gesture (mudra) identification in bharatanatyam using eigenmudra projections and convolutional neural networks
Ma et al. Multi-level spatial and semantic enhancement network for expression recognition
Zhou et al. Cross-domain facial expression recognition by combining transfer learning and face-cycle generative adversarial network
Huan et al. Learning deep cross-scale feature propagation for indoor semantic segmentation
Zhai Auto-encoder generative adversarial networks
Zhang et al. Landmark‐Guided Local Deep Neural Networks for Age and Gender Classification
Xu et al. Panel-page-aware comic genre understanding
Ly et al. Gesture-based emotion recognition by 3D-CNN and LSTM with keyframes selection
CN111339734A (en) Method for generating image based on text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240229

Address after: Room 303, 3rd Floor, Building 5, No. 699 Wangshang Road, Changhe Street, Binjiang District, Hangzhou City, Zhejiang Province, 310052

Applicant after: Hangzhou Alibaba Overseas Internet Industry Co.,Ltd.

Country or region after: China

Address before: 310052 room 508, 5th floor, building 4, No. 699 Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant before: Alibaba (China) Co.,Ltd.

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant