Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As described above, how to efficiently parse effective content following human semantic understanding from massive multimedia data and give accurate relevance content feedback (including application scenes such as searching, recommending, advertising, etc.) according to behavior habits of specific users has become a research hotspot in academia and industry in recent years.
Traditional single-mode techniques (e.g., plain image or plain text retrieval) have failed to meet the increasing diversity demands due to their single data form. In contrast, receiving diversified sensations allows artificial agents to understand things themselves more comprehensively and efficiently, which also more closely conforms to the multi-sensory cognitive approach of humans. Multimodal techniques are also favored by the industry for their superior performance in a number of rich semantic understanding tasks.
In the context of deep learning, how to have a machine feedback in a similar way to human intelligence is a hotspot problem in the field. The existing multi-mode pre-training solution is mostly aimed at the improvement of a transducer module, and is mostly focused on the image-text task in the general field in application. There is a lack of an optimized solution for accurate learning and text-related for fine-grained information of images.
Therefore, the method realizes the pixel-level reconstruction of the masked image blocks in the pre-training image language network through the block embedding of the image and the model structure of a multi-stage downsampling encoder and a step-by-step corresponding upsampling decoder. Further, combining the self-built residual sub-network implementing the input image and text embedding with the encoder-decoder structure as above, an end-to-end multi-modal pre-training is implemented. The system using the model can realize more accurate image-text retrieval, image classification and even pixel-level reconstruction of a mask part, and is particularly suitable for various electronic commerce application scenes.
FIG. 1 shows a schematic flow chart of a pre-training image text model processing method according to one embodiment of the invention.
In step S110, a mask training sample pair that masks words and image blocks in an image text sample pair is acquired. Fig. 2 shows an example of masking training sample pairs. As shown, the image text sample pairs may be the image shown on the left side of fig. 2 and the corresponding textual description (illustrated "word' S SLEEVELESS Long Dress (sleeveless skirt)") and the MASK training sample pairs (e.g., illustrated T and V) may be obtained by masking specific image blocks in the image (e.g., illustrated dashed boxes) and masking specific words in the textual description (e.g., illustrated MASK).
In step S120, the pair of masked training samples is input into the pre-training image text model, which may include a multi-stage downsampling encoder and a multi-stage upsampling decoder, and loss values output by the pre-training image text model for the masked words, masked image blocks, and image text tasks are thereby obtained. Subsequently, in step S130, parameters in the text model of the pre-training image are adjusted according to the loss value.
It will be appreciated that the pre-training image text model of the present application has a pyramid-type multi-level encoder (i.e., the input feature map is smaller and smaller) and an inverted pyramid-type multi-level decoder (i.e., the scaled feature map is further increased step by step), and may have three different training tasks, a predictive task for masked words (i.e., a language task, such as the MLM task described below), an image-language task (also referred to as a vision-language task, i.e., a VL task, such as the ITM task described below), and a task for masked image blocks (i.e., a vision task). In the present application, the visual task may be implemented, inter alia, as a mask image reconstruction task (i.e., a MIR task as will be described in detail below).
In one embodiment, the output of the multi-level encoder may be used directly for language tasks, VL tasks, and visual tasks. The image-language pre-training network thus trained can also perform well in downstream tasks.
In a preferred embodiment, the image embedded vectors output by the multi-stage encoder may be re-fed into the multi-stage upsampling decoder to obtain processed image embedded vectors of the same dimensions as the input image. At this time, obtaining the penalty values of the pre-trained image text model for the masked words, the masked image blocks, and the image text task output may include calculating penalty values for the masked words and the image text task output based on the output of the multi-stage downsampling encoder, and calculating penalty values for the masked image blocks based on the output of the multi-stage upsampling decoder. Thereby, a penalty value for the pre-trained image text model for the masked image block is obtained using the processed image embedding vector for pixel-level reconstruction of the masked image block.
Figures 3A-B illustrate two embodiments of the pre-training network architecture of the present invention. As shown in fig. 3A and 3B, the input text and the image as the object of the text description are subjected to a certain process to obtain a plurality of text tokens (marks) and a plurality of image patches (patches), which are then converted into text embedding ("embedded" or "embedded vector") and an image embedding. The image embedded vector and the text embedded vector are then fed into a multi-modality PVT (pyramid image transducer). Here, multimodal refers to a pyramid model structure consisting of a multi-level downsampled transducer encoder for processing both text and image types of input, PVT. In the example of fig. 3A, the loss value may be calculated directly based on the output of the multi-modality PVT, thereby implementing the language task, VL task, and image task of the pre-training model. In the preferred embodiment shown in fig. 3B, the image-embedded vectors of the multi-modality PVT output may then be up-sampled through an inverted pyramid type multi-stage transform decoder, thereby obtaining processed image-embedded vectors of the same dimension as the input image. Subsequently, the MIR (image reconstruction) task of pixel level reconstruction can be achieved.
FIG. 4 illustrates one example of acquiring a text token and an image patch based on an original image text sample pair. As shown, the original sample included an image (the image of a man wearing a cap zipper shirt and jeans at the lower right side of fig. 4) and a description of that image (the description "Long sleeve hoodle in black.Drawstring closure at hood.Zip closure and patch packets at front.Rib knit sleeve cuffs and hem……( of a cap zipper shirt worn by a man at the lower left side of fig. 4, black long sleeve cap, cap with drawstring, front zipper and pocket, rib knit cuffs and hem). The original sample can be obtained from e-commerce website for selling clothing, and the image is commodity picture, and is described as commodity title or description.
Subsequently, to obtain a text token, the input text may first be word-by-word tokenized (tokenized) to obtain a token sequence, and then whole word masked to obtain a masked text token.
Similarly, the patch (block) processing of an image differs from the RoI (region of interest) method common in the art, but rather the image is segmented into small blocks (usually squares) of the same pixel size, and each patch can be regarded as an "image token". For each patch, the output of PatchNet (block network) may be considered as patch features, which may then be masked to obtain a masked image patch. These patches are naturally ordered and the spatial locations of the patches can be used in subsequent processing to embed vectors in place (see FIG. 5).
In one embodiment of the invention, the text token and image patch may be fed into a standard ResNet to obtain text and image embedded vectors. In yet another more preferred embodiment, the embedded vectorization of incoming sample pairs may be achieved using an embedded transformation sub-network included in the pre-training image text model of the present invention. Thus, inputting the pair of mask training samples into the pre-training image text model may include generating text embedding vectors and image embedding vectors from the pair of mask training samples using an embedding transformation sub-network (e.g., resNet of a simplified structure that was self-written by the inventor), wherein an image in the pair of image text samples is partitioned into a plurality of blocks (patches), and generating the image embedding vectors based on the partitioned blocks.
This allows input to the multi-level downsampling encoder is embedding shown in fig. 3A-B, namely a text embedded vector and an image embedded vector. By using a self-written embedded transformation sub-network (e.g., a residual network with a simplified structure compared to standard ResNet), offline processing of the input image language pairs can be avoided, thereby enabling online execution of downstream tasks. Further, while the embedded transformation subnetwork of the present invention has a simplified structure, it is still capable of performing excellently in downstream tasks, since it can be seen as being included within the pre-trained image text model of the present invention, i.e., the parameters of the subnetwork can be trained with the parameters of the downstream PVT and optional multi-level decoder.
In the present invention, each stage of downsampling encoder can be regarded as a transform encoder capable of downsampling. To this end, inputting the pair of mask training samples into the pre-training image text model may include concatenating the flattened image embedded vectors into a composite vector in each stage of a downsampling encoder, spatially reducing the composite vector, feeding the composite vector into an encoder unit composed of a multi-headed attention sub-network and a feed-forward network, splitting the composite vector processed by the encoder unit, obtaining the downsampled text embedded vector (which may also be referred to as a "text embedded vector"), and reconstructing the remainder to obtain the downsampled image embedded vector. Thus, the processing is carried out by concatenating the text and image embedded vectors into an encoder, and after processing, splitting and shape reconstruction (only for image embedding) are carried out to complete each stage of processing. Preferably, in each stage of downsampling encoder, the image embedded vector is processed by a convolution module (e.g., conv2 Dblock) prior to the flattening process. Thus, the processing for image embedding is performed using the downsampled friendly nature of the convolution operation. Further, the image-embedded vector that has been downsampled in multiple stages may be fed into the multi-stage upsampling decoder as described above in connection with fig. 3B to obtain a processed image-embedded vector of the same dimension as the input image. The processed image embedding vector may then be used to obtain a penalty value for the pre-trained image text model for the masked image block for pixel-level reconstruction of the masked image block.
In one embodiment, the masked image block size may be different from the size of the patch. In other words, the number of partitioned blocks occupied by the masked image block may be set based on the granularity of the image feature to be extracted. For example, it may be determined whether 1 (1 x 1), 4 (2 x 2), or 9 (3 x 3), or even more patches are to be masked for the masked image block based on the setting of the α value as described below.
In order to better illustrate the principles of the present invention, a preferred end-to-end model embodiment of the present invention will be described below in conjunction with FIG. 5, which details the encoder structure. FIG. 5 shows one example of an end-to-end VL pre-training model in accordance with the present invention.
Fig. 5 shows a masked visual language converter (M-ViLT) for a cross-modal representation of fashion, especially in the e-commerce domain. In the model of the invention, PVT (pyramid vision transducer) is utilized to replace BERT and a Mask Image Reconstruction (MIR) strategy is introduced into PVT, so that M-ViLT is the first end-to-end framework in the fashion field. M-ViLT is an extensible and convenient architecture that accepts raw multimodal input without additional pre-processing models (e.g., resNet), implicitly models visual language alignment, and can be easily generalized to various matching and generating tasks downstream.
In order to solve the problems of insufficient granularity and poor mobility in the prior art, the invention specially constructs an end-to-end VL frame for the fashion field. The overall flow of M-ViLT is shown in FIG. 5. The example of fig. 5 uses a four-level encoder and generates features of different sizes. Two keys of the proposed architecture are the multi-modal encoder and the pre-training target.
Multi-mode encoder
As shown in FIG. 5, M-ViLT allows visual and linguistic input. In terms of language, titles of fashion products (e.g., articles of clothing accessories, etc.) are first marked (tokenize), and the marking of the title is randomly masked with a MASK ratio r l using a specific mark [ MASK ]. A series of word tokens is obtained after the masking process. Then, a specific [ CLS ] tag is inserted in the head of this sequence. Furthermore, if the length is less than 128, the sequence is padded to a uniform length L using [ PAD ] tags. This process generates language input ids
Visually, one canConsider a visual input where H and W represent the height and width of a given input. This input is divided into multiple latticed patches Wherein the method comprises the steps ofIs the total number of patches, P represents patch size. Similarly, mask ratio r v is used to mask the partitioned patch. More detailed information about masking policies for the language and visual parts described above is provided in the following description for the "pre-training objectives". The multimodal input is embedded and fed into the following four VL interaction stages (i.e., k.epsilon.1, 2,3, 4). In the first stage, a text embedding vector T 1 and a visual embedding vector V 1 are generated by given inputs (T and V), respectively. Regarding the subsequent stages, only the kth stage is considered to make the description concise. As shown at the bottom of fig. 5, text is first embedded into a vectorEmbedding into language hiding featuresIn (i.e., linear embedded in the illustration), the formula is:
Wherein the method comprises the steps of AndRespectively a linear embedding and a positional embedding matrix that are learnable. D k is the size of the hidden feature embedding. The visual embedding vector isWhere R k represents the visually embedded spatial reduction factor. To obtain pyramid visual features, V k is then embedded and flattened (flattened) to the visual hidden features by two-dimensional projection (i.e., conv2D block)I.e. the space(s) shown). In particular, the projection is performed by using a convolution kernel(Wherein the kernel size is K k, the step size is S k) forces the network to move the equivalent spatial dimension fromTo be reduced toThis can be expressed as:
Wherein the method comprises the steps of Representing the position embedding matrix. Subsequently, the two VL hidden features z k=<mk;nk > are concatenated (Concat) and fed into a plurality (M k) of VL transformer encoders. Each encoder contains a multi-headed self-attention layer with spatial reduction (i.e., the illustrated reduction box), multi-layer perceptron, and layer normalization. Finally, the encoded multi-modal feature z k+1=<mk+1;nk+1 > is derived and divided into a language portion T k+1=mk+1 and a visual portion V k+1=nk+1, where the Reshape (·) operation includes restoring the spatial dimensions of the given feature.
After four VL interaction levels, four text-embedding vectors are generated, respectivelyAnd four pyramid visual embedded vectorsThe following table shows detailed super parameter settings.
| Super parameter |
k=1 |
k=2 |
k=3 |
k=4 |
| Layer number M k |
2 |
2 |
2 |
2 |
| Hidden dimension D k |
64 |
128 |
320 |
512 |
| Reduced size R k |
4 |
8 |
16 |
32 |
| Nuclear size K k |
4 |
2 |
2 |
2 |
| Step S k |
4 |
2 |
2 |
2 |
Pre-training targets
To obtain a multimodal representation with sufficient discrimination, three pre-training tasks are employed to establish relationships within and between the most primitive VL modalities, including visual tasks (implemented as mask image reconstruction, MIR), linguistic tasks (implemented as mask language modeling, MLM), and visual linguistic tasks (implemented as image-text matching, ITM).
Task 1 Mask Image Reconstruction (MIR).
The present invention attempts to construct pixel-to-pixel relationships from the perspective of the generation task, thereby facilitating scalability of the visual representation, and designs Mask Image Reconstruction (MIR) to achieve pixel-level reconstruction. To help the model learn better through MIR, a flexible masking strategy is designed with the pyramid features of PVT architecture. The PVT-based architecture masks the input image according to a masking cell matrix that contains small granularity patches. Given patch sequenceThe masked sequence V \Φ may be defined as:
Wherein the method comprises the steps of Representing a function (or process) of the masking strategy, q is a randomly selected region of the masking unit, [ ZERO ] represents filling the selected region with pixel values ZERO (0). Masking unitDerived from the following index functions:
wherein each value in the set of integers Φ is randomly selected from the range [1, q ] at a mask ratio r v. Is the total number of masking units. Fig. 6 shows a comparative example of prior art and the PVT-based masking strategy of the present invention. Compared with a masking strategy of only masking one Patch (with the size of PxP) based on ViT on the left side of the graph, the PVT-based method can be more flexible. As shown on the right side of the figure, PVT-based methods can be based on finer granularity of Patch and flexible selection of mask coefficients. For example, the mask coefficient α may be defined as 1 to 8, thereby causing the underlying masking unit (MaskingUnit) to mask (α×p) 2 range of images. In an implementation of the present invention, a=4 may be set by default to capture finer granularity semantics.
Since the smooth-l 1 loss is less sensitive to outliers, it may be chosen to be used as a pre-training target to reconstruct the entire image through the mask sequence V \Φ. The pre-training task is defined as:
where I '(x,y) and I (x,y) represent pixels at the coordinates (x, y) of the reconstructed image I' and the input image I, respectively. Parameterized by a learnable weight W MIR. Function ofFour-level U-Net decoder representing a standard, which accepts four pyramid visual embedded vectorsAs input. In other words, although not shown in fig. 5, the visual embedding vector fed into the MIR is not the vector output from the fourth stage shown in fig. 5, but the vector output from the four-stage inverted pyramid decoder after that shown in fig. 3B. Therefore, in the process of solving the MIR task loss, the finally obtained processed visual embedded vector passes through a four-level pyramid type encoder and a four-level inverted pyramid type decoder, and is obtained through a U-Net network.
Task 2 image-text matching (ITM).
Additional classification embedded vectors in the last stage text embedded vector T 4 may be used to couple representations from VL multi-modalities. Can utilize functionsTo represent Fully Connected (FC) and softmax layers, parameterized by weights W ITM.Outputting a bi-classification probability vectorRepresenting that the input image and text description match (i.e., are facing) or do not match (i.e., are negative). The positive pair is selected from the same product category, while the negative pair is randomly selected from different items. The task is ultimately constrained by a binary cross entropy loss function:
where y ITM represents a true label, i.e., 1 represents a matched pair, and 0 represents a non-matched pair.
It should be appreciated that in other implementations, other tasks such as text-to-image matching (TIM) may also be used to train information learning between multiple modalities of a model.
Task 3 Mask Language Modeling (MLM)
The original text mark may be replaced randomly with a specific mark [ "MASK ]. The goal of the MLM is to predict the text content of the masked mark using the unmasked mark and patch. Given a tagged sequence t= { T 1,…,tL }, the masked sequence is denoted T \i={t1,…,[MASK]i,…,tL }. Cross entropy loss can be used to model this goal:
Wherein the method comprises the steps of Representing the predicted probability for each MASK tag [ MASK ] i using T \i. Function ofThe parameters W MLM representing the classifier. The final pre-training goal of the proposed M-ViLT is a combination of three goals:
Overall, M-ViLT, in accordance with the above preferred embodiment of the present invention, enables end-to-end training, i.e., an end-to-end framework is provided that includes feature extractors and a multimodal matching network. The framework can be particularly integrated with a large amount of prior information (such as information of merchants, commodities, categories and the like) in the electronic commerce field, so that the multi-mode matching effect of the electronic commerce field is further improved in a downstream task. Further, pixel-level image reconstruction achieved through the MIR task can enable a model to have finer granularity understanding of the image side, and therefore performance improvement of the model is obvious.
In the multi-mode encoder part, the picture and text information is encoded, and the picture-text multi-mode information is fully interacted in PVT by introducing a PVT pre-training model. In the aspect of the pre-training task, three pre-training tasks are introduced, the MLM task is mainly responsible for language model learning, the TIM is mainly responsible for image and text matching learning, and the MIR is mainly responsible for image reconstruction task learning.
The pre-trained image text model processed according to the present invention can be used for different downstream tasks such as mutual retrieval of pictures and text, image classification, image generation.
For example, one downstream task may be text-to-image retrieval (TIR). The TIR task requires the model to find the text with the highest similarity value to the different query images. In particular, the product title and its corresponding image may be treated as a positive image-text pair, while the negative image pair is randomly selected from a pool of non-matching images. To increase accuracy, a set of image-text candidates (i.e., one positive pair and 100 negative pairs) may be limited to the same subcategory so that they are as similar as possible.
Another downstream task is image-text retrieval (ITR). As an inverse of the TIR task, the ITR task aims to retrieve matching images for a given sequence of merchandise description text entries. The bi-directional retrieval tasks (i.e., TIR and ITR) above can be the main application scenario across modalities.
Other downstream tasks may also include a category identification task and a Mask Image Generation (MIG) task. Category identification (may include: main class identification (M-CR) and sub-class identification (S-CR). These tasks provide specific classes of query products, and the model should be provided with the ability to identify differences at different levels of granularity by different downstream tasks: after the last text embedding vector T 4, two separate FC layers may be added to generate the final probabilities of two different recognition tasks, where this process requires additional fine tuning using the recognition tags, while the MIG task may be considered a pixel-level reconstruction task.
After further training for these downstream tasks, the pre-trained image text model of the present invention may be deployed in a teletext retrieval system. For this purpose, the invention can be further implemented as a teletext retrieval system. Fig. 7 shows an example of the teletext retrieval system of the invention. The image-text retrieval system can comprise a query information acquisition module for acquiring text and/or image information input by a user. Further, the teletext retrieval system may comprise a pre-trained image text model (PLM) obtained according to the method described above for outputting matching teletext information based on text and/or image information entered by a user. The pre-training image text model can be arranged in a retrieval system of an e-commerce website so that a user returns more accurate information when performing image-text retrieval. The text model of the pre-training image processed according to the invention can obtain the interrelationship between the image and the text in a deep way, thereby improving the accuracy of image-text retrieval.
FIG. 8 illustrates a schematic diagram of a computing device that may be used to implement the pre-training image text model processing method described above, according to one embodiment of the invention.
Referring to fig. 8, a computing device 800 includes a memory 810 and a processor 820.
Processor 820 may be a multi-core processor or may include multiple processors. In some embodiments, processor 820 may comprise a general-purpose main processor and one or more special coprocessors such as, for example, a Graphics Processor (GPU), a Digital Signal Processor (DSP), etc. In some embodiments, processor 820 may be implemented using custom circuitry, for example, an Application SPECIFIC INTEGRATED Circuit (ASIC) or a field programmable gate array (FPGA, field Programmable GATE ARRAYS).
Memory 810 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 820 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 810 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 810 may include a readable and/or writable removable storage device such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only blu-ray disc, an super-density optical disc, a flash memory card (e.g., SD card, min SD card, micro-SD card, etc.), a magnetic floppy disk, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.
The memory 810 has stored thereon executable code that, when processed by the processor 820, causes the processor 820 to perform the pre-training image text model processing method described above.
The pre-training image text model processing method and the resulting teletext retrieval system according to the invention have been described in detail hereinabove with reference to the accompanying drawings.
The present invention innovatively eliminates the need for complex preprocessing of the image in the multi-modal encoder portion (i.e., replacing standard ResNet with a self-written simplified residual network). M-ViLT realizes offline and online logical consistency by means of Patch segmentation of the end-to-end online picture and combining a mask strategy, and is beneficial to actual arrangement of models, such as arrangement in the field of electronic commerce. Furthermore, the MIR task defined by the invention can forcedly understand the image in finer granularity through pixel-level feature reconstruction, so that the MIR task meets the requirements of the e-commerce field relying on understanding the image in finer granularity.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the steps defined in the above-mentioned method of the invention.
Or the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.