CN114357231B

CN114357231B - Text-based image retrieval method and device and readable storage medium

Info

Publication number: CN114357231B
Application number: CN202210221464.XA
Authority: CN
Inventors: 叶海涛; 毛云青; 李洁; 王国梁; 陈斌
Original assignee: CCI China Co Ltd
Current assignee: CCI China Co Ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-28
Anticipated expiration: 2042-03-09
Also published as: CN114357231A

Abstract

The application provides a text-based image retrieval method, a text-based image retrieval device and a readable storage medium, and the method comprises the following steps: acquiring initial image characteristics of a retrieval text and a plurality of candidate images; converting the retrieval text into a digital matrix, extracting initial text characteristics according to the digital matrix, and performing residual error connection on the digital matrix and the initial text characteristics to obtain enhanced text characteristics; respectively fusing the enhanced text features with each initial image feature to obtain a corresponding first feature matrix, and respectively fusing the initial text features with each initial image feature to obtain a corresponding second feature matrix; simultaneously inputting the second feature matrix and the first feature matrix which are fused with the same initial image features into a feature alternating current network to obtain a corresponding alternating current feature matrix; and inputting all the alternating characteristic matrixes into a head prediction network to obtain a target image. The method enables the characteristics of the two modes to carry out effective information exchange and association training, improves the generalization capability of the model, enhances the association between the text and the image, and submits the retrieval precision.

Description

Text-based image retrieval method and device and readable storage medium

Technical Field

The present application relates to the field of image retrieval, and in particular, to a text-based image retrieval method and apparatus, and a readable storage medium.

Background

The image retrieval method mainly comprises two methods, namely searching images by characters and searching images by images.

The method comprises the steps of firstly establishing a dictionary comprising a plurality of labels, manually describing each image in an image material library according to the labels in the dictionary, extracting the existing labels from a retrieval text, and realizing image retrieval through the images in a label target image material library. This approach not only requires the construction of a dictionary comprising a large number of labels, but also requires a large amount of manpower to manually label each image in the image material library, which is extremely inefficient.

In addition, the image searching method mainly uses an image feature extraction technology to construct an image material library comprising a large number of images, extracts the features of the images to be searched and the images in the image material library to compare, and mainly uses a manually designed algorithm to extract image features before the convolution neural network, such as harr, gist, sift and the like.

A Convolutional Neural Network (CNN) belongs to a machine learning network, has the most top expression in visual tasks in a plurality of fields, and has the expression far beyond the traditional method in the field of image retrieval. Deep learning by adding the number of convolutional layers can improve the performance of the CNN in the aspect of image feature extraction. However, feature extraction directly using CNN mainly focuses on searching a graph, because CNN cannot be used to extract features of characters, and if character features extracted using other machine learning algorithms are not in the same probability distribution as image features, direct comparison cannot be performed.

Multimodal machine learning has been developed, and differs from previous machine learning in that multimodal machine learning can receive information in different modalities, and can search for images in text by multimodal machine learning because text, images, audio, video, and the like are input.

However, the current multi-modal machine learning still depends on manually labeled data, unsupervised machine learning cannot be performed by using unlabeled data, and the generalization capability of the model is insufficient in the case of insufficient labeled data. Moreover, the fusion processing of the text and the image is mainly focused on the training process, and when the features of the text and the image are extracted, the features of the text and the image are independently extracted, effective information exchange is not performed, and the features of the text and the image cannot be fully utilized. In addition, some multi-modal machine learning will perform some early pre-training, but requires very large computing resources and cannot be deployed in a wide range.

Disclosure of Invention

The method comprises the steps of respectively extracting the characteristics of a retrieval text and a candidate image, enabling the characteristics of two modes to effectively exchange information in a characteristic fusion mode, improving the generalization capability of a model under a small amount of sample data, performing association training by using an improved transformer structure, enhancing the association between the retrieval text and the candidate image, and improving the capability of describing the retrieval image based on the text.

In a first aspect, an embodiment of the present application provides a text-based image retrieval method, including the following steps:

acquiring initial image characteristics of a retrieval text and a plurality of candidate images;

converting the retrieval text into a digital matrix, extracting initial text features according to the digital matrix, and performing residual error connection on the digital matrix and the initial text features to obtain enhanced text features;

respectively fusing the enhanced text features with each initial image feature to obtain a corresponding first feature matrix, and respectively fusing the initial text features with each initial image feature to obtain a corresponding second feature matrix;

inputting a second feature matrix and a first feature matrix which are fused with the same initial image feature into a feature communication network at the same time to obtain a corresponding communication feature matrix, wherein the feature communication network comprises a text processing network and an image processing network which are parallel, the text processing network and the image processing network comprise the same number of transform layers, and query features in each transform layer corresponding to the text processing network and the image processing network are exchanged, and the query features are obtained by performing linear change on the input of each transform layer;

And inputting all the alternating characteristic matrixes into a head prediction network to obtain at least one target image.

In a second aspect, an embodiment of the present application provides a text-based image retrieval apparatus, configured to implement the text-based image retrieval method in the first aspect, where the apparatus includes the following modules:

the acquisition module is used for acquiring initial image characteristics of the retrieval text and the candidate images;

the text feature extraction module is used for converting the retrieval text into a digital matrix, extracting initial text features according to the digital matrix, and performing residual error connection on the digital matrix and the initial text features to obtain enhanced text features;

the feature fusion module is used for respectively fusing the enhanced text features with each initial image feature to obtain a corresponding first feature matrix, and respectively fusing the initial text features with each initial image feature to obtain a corresponding second feature matrix;

the feature exchange module is used for inputting a second feature matrix fused with the same initial image features and a first feature matrix into a feature exchange network simultaneously to obtain a corresponding exchange feature matrix, wherein the feature exchange network comprises a parallel text processing network and an image processing network, the text processing network and the image processing network comprise the same number of transform layers, query features in each transform layer corresponding to the text processing network and the image processing network are exchanged, and the query features are obtained by performing linear change on the input of each transform layer;

And the prediction module is used for inputting all the alternating characteristic matrixes into a head prediction network to obtain at least one target image.

In a third aspect, an embodiment of the present application provides an electronic apparatus, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform the text-based image retrieval method according to any of the embodiments of the present application.

In a fourth aspect, the present application provides a readable storage medium having stored therein a computer program comprising program code for controlling a process to execute a process, the process comprising the text-based image retrieval method according to any of the above application embodiments.

The main contributions and innovation points of the present application are as follows:

according to the image retrieval method based on the text, the characteristics of the retrieved text and the candidate image are respectively extracted, so that the characteristics of two modes are effectively communicated with each other in a characteristic fusion mode, the generalization capability of models under a small amount of sample data is improved, the structure of a transform is improved for association training, the association between the retrieved text and the candidate image is enhanced, and the capability of describing the retrieved image based on the text is improved. In some application embodiments, the YOLOv5 model is also used for pre-training through the coco data set to reduce resource usage.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more concise and understandable description of the application, and features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a text-based image retrieval method according to an embodiment of the application;

fig. 2 is a schematic diagram of a feature communication network according to an embodiment of the present application;

FIG. 3 is a block diagram of a structure for text-based image retrieval according to an embodiment of the present application;

fig. 4 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.

It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described herein. In some other embodiments, the method may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps for description in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.

Example one

The present embodiment provides a text-based image retrieval method, which mainly includes steps S1-S5 as shown in fig. 1.

Step S1: initial image features of the search text and the plurality of candidate images are obtained.

In the step, a retrieval text and existing candidate images are mainly obtained, and initial image features of each candidate image are extracted, so that image retrieval is performed by using the retrieval text.

The retrieval text is a text description of a target image which is required to be acquired, and the target image is screened out from all candidate images through the description of the retrieval text on the target image.

Specifically, when a certain image or a certain type of image needs to be acquired, the image retrieval is performed by inputting a retrieval text, the candidate images are images stored in an image material library in advance, and initial image features of each candidate image can be extracted by using pre-training models for image detection.

In some embodiments, the pre-training model may employ the YOLOv5 model. The YOLOv5 model is a model applied to image object recognition, and there are two main steps for object recognition, one is to use a box to frame a corresponding object in an image, and the other is to recognize the class of the object in the box, and for the class of the object in each box, the model gives a confidence level to indicate how much the model has confidence in the class.

For example, the YOLOv5 model may be pre-trained with the labeled training images, and then each of the candidate images is input into the pre-trained YOLOv5 model to obtain the corresponding initial image features. In some embodiments, the yco dataset is used directly to train the YOLOv5 model, and because of the model characteristics of the YOLOv5 model, does not require excessive computational resources compared to other object recognition models.

Specifically, training is performed by using the YOLOv5 algorithm, and first, by setting the confidence level, the box higher than the confidence level is selected, so as to select 10 to 36 boxes, and the image in the boxes is used for representing the features of the whole picture. Then, an average pooling operation is performed on the picture information in each selected frame, because the sizes of the frames are different, the sizes after the average pooling operation are also different, the maximum length and width needs to be found out first, assuming that the maximum length and width are H and W, and for those with the length and width less than H and W, an operation of 0 complementing is performed, that is, 0 is filled up and down, left and right, so as to make the picture information reach the size. Assuming that X frames are selected, X H × W × 3 pixel matrices (3 representing the intensities of the three primary colors red, green and blue) are finally obtained. After completion, the tokens in the box are arranged from left to right in the order from top left to bottom right of the picture, and concatee (connection operation) is performed to form a matrix, wherein the matrix has a length of H, a width of X × W, and three pixel channels.

Step S2: and converting the retrieval text into a digital matrix, extracting initial text characteristics according to the digital matrix, and performing residual error connection on the digital matrix and the initial text characteristics to obtain enhanced text characteristics.

In the step, the retrieval text is firstly converted into a digital matrix, the initial text characteristic is extracted from the digital matrix, and the digital matrix and the initial text characteristic are subjected to residual error connection to obtain the enhanced text characteristic.

It is worth mentioning that enhancing the initial text features through the operation of residual concatenation can help the deep neural network to solve the degradation problem and make it converge faster.

Specifically, the retrieval text is converted into a digital matrix through a compiler of the Albert pre-training model, then initial text features are extracted according to the digital matrix, the initial text features are input into a convolutional neural network, the initial text features are subjected to dimension increasing or dimension reducing to enable the size of the initial text features to be the same as that of the digital matrix, numerical values of the initial text features and the same positions of the digital matrix are added, and the enhanced text features are output. The initial text features are the summary of the Albert pre-training model on the word connotation, and include various features of the search text, such as position information of each word, connotation of the word, text information of upper and lower parts, and the like, and can be used for subsequent various tasks.

The Albert is a lightweight version pre-training model improved according to a famous open-source character pre-training model BERT, and basically keeps the performance of the BERT on the premise of reducing the parameter quantity. The Albert model can carry out unsupervised machine learning, namely does not need to carry out the mark of data by people, just can carry out the task training by using the data that the network crawled. Massive network data are used for training, so that the Albert can fully analyze the internal logic and meaning of characters, the network has strong generalization capability, and therefore when other tasks are carried out, a good effect can be achieved only by using a small amount of labeled data for fine adjustment. And the Albert optimizes the network parameters, and greatly reduces the use of computing resources while maintaining the original effect.

The pre-training model for image detection used in step S1 and the Albert pre-training model used in step S2 both require task pre-training by training data to obtain the corresponding pre-training models.

Specifically, for the pre-training model for extracting the initial text features of the search text, the matching task is performed, that is, the original text description of a part of the image is replaced by the text description of other images, for example, an image with an elephant is provided, the original text description is "the elephant", and the text description is randomly replaced by the description of other images, for example, "the elephant is a piece of grass", so that the image is not consistent with the text description thereof, a negative case is formed, and the text description which is not replaced is taken as a positive case. And connecting the attention mechanism of the first feature matrix with the full connection layer, and outputting 0 or 1 after cross entropy to indicate whether the text and the image are matched.

For the target detection model for extracting the initial image features of the candidate image, a part of the image and a part of the text are masked by using a special symbol MASK, 15% of the text and the image are masked, the image has a 90% probability of being cleared to be 0, and the remaining 10% probability remains unchanged. For text, 80% becomes MASK, 10% becomes random words, and 10% does not change. The image output will first go through a full link layer and then produce a probability distribution representing the probability of producing each word to predict which word.

For masked pictures, we still use the picture output to connect the full-link layers to predict the picture's color distribution. The color distribution means that pixel points of the image are composed of three primary colors of red, green and blue, and the red, green and blue can obtain different colors by adjusting respective brightness. The brightness is from 0 to 255. If the brightness is counted, a distribution of colors can be obtained, such as 10 occurrences of green with brightness of 0, 20 occurrences of green with brightness of 1, and so on. We also used the correct answer (unmasked image) from YOLOv5 to count the color distribution and calculate the KL divergence of the two as the loss.

Step S3: and respectively fusing the enhanced text features with each initial image feature to obtain a corresponding first feature matrix, and respectively fusing the initial text features with each initial image feature to obtain a corresponding second feature matrix.

In this step, the search text and the features of the candidate images are mainly cross-connected, so that data of two modes can be fully exchanged, and feature information in the search text and the candidate images can be effectively utilized. And respectively fusing the enhanced text characteristics with each initial image characteristic to obtain a corresponding first characteristic matrix containing the initial image characteristics, and respectively fusing the initial text characteristics with each initial image characteristic to obtain a corresponding second characteristic matrix containing the enhanced character characteristics.

Specifically, the step of respectively fusing the enhanced text features with each of the initial image features to obtain a corresponding first feature matrix includes: and converting the enhanced text features into one-dimensional text vectors, and fusing each initial image feature respectively to obtain a corresponding first feature matrix after the size of the one-dimensional text vectors is the same as that of the initial image features through a full connection layer.

And, the "fusing the initial text features with each of the initial image features to obtain a corresponding second feature matrix" includes: and converting each initial image feature into a corresponding one-dimensional image vector, and fusing each one-dimensional image vector with the initial text feature respectively to obtain a corresponding second feature matrix after the size of each one-dimensional image vector is the same as that of the initial text feature through a full connection layer.

Step S4: and simultaneously inputting the second feature matrix and the first feature matrix which are fused with the same initial image features into a feature alternating current network to obtain a corresponding alternating current feature matrix.

In this step, a feature exchange network including a design is mainly used to help exchange information expressed by the second feature matrix and the first feature matrix fusing the same initial image features between the two modalities.

The feature exchange network comprises a text processing network and an image processing network which are parallel, the text processing network and the image processing network comprise transform layers with the same number, the transform layers are sequentially connected according to a hierarchical order, query features in each transform layer corresponding to the text processing network and the image processing network are exchanged, and the query features are obtained by linearly changing the input of each transform layer.

The feature exchange network structure in this embodiment is shown in fig. 2, and includes a parallel text processing network and an image processing network. If there are several transform layers connected in sequence, the inputs of all the transform layers are the outputs of the previous transform layer except that the input of the first transform layer of the text processing network is the first feature matrix and the input of the second transform layer of the image processing network is the second feature matrix.

The method for simultaneously inputting the second feature matrix and the first feature matrix which are fused with the same initial image features into a feature alternating current network to obtain a corresponding alternating current feature matrix comprises the following steps: inputting the first characteristic matrix into a transform layer in a text processing network to perform linear change to obtain a word query characteristic, a word key characteristic and a word value characteristic of each word; inputting the second characteristic matrix into a transform layer of an image processing network for linear change to obtain a pixel query characteristic, a pixel key characteristic and a pixel value characteristic of each pixel; exchanging word query features and pixel query features which correspond to the output of the same layer of the transform layer; the word query characteristics of the word to be processed and the word key characteristics of each other word are linearly changed to obtain word relevancy corresponding to each other word, the word value characteristics of the word to be processed and all the word relevancy are multiplied and then summed to obtain word expressions of the word to be processed, the word expressions of each word to be processed are obtained in a traversing mode to obtain an initial characteristic communication matrix, and the initial characteristic communication matrix is converted to obtain communication characteristic data.

Specifically, the input of each transform layer in the text processing network is subjected to first linear change, second linear change and third linear change respectively to obtain a word query feature (key), a word key feature (value) and a word value feature (query) of each word; and respectively carrying out first linear change, second linear change and third linear change on the input of each transform layer in the image processing network to obtain an image query feature (key), an image key feature (value) and an image value feature (query) of each image.

And then, exchanging the query characteristics in each transform layer corresponding to the text processing network and the image processing network, namely exchanging the query characteristics of each word and the query characteristics of each pixel, and associating the word with the pixel.

And after the exchange is finished, inputting each word serving as a word to be processed into a multi-head attention mechanism in sequence to obtain a word expression of the word to be processed. Specifically, the word query feature of the word to be processed and the word key feature of each of the other words are linearly changed to obtain the word relevancy corresponding to each of the other words, and the word value feature of the word to be processed and all the word relevancy are multiplied and summed to obtain the word expression of the word to be processed.

And then, traversing to obtain the word expression of each word to be processed to obtain an initial characteristic communication matrix, and converting the initial characteristic communication matrix to obtain communication characteristic data. Specifically, the conversion method is to sequentially perform residual connection, normalization operation, full connection, residual connection and normalization operation on the initial characteristic alternating-current matrix to obtain an alternating-current characteristic matrix. The conversion method is the same as the conventional transform layer, and is not described in detail herein.

It is worth mentioning that words and pixels are associated by exchanging query features of the words and the pixels, each word is associated with one or more pixels, the pixels associated with all the words are combined to form features of a target image of a retrieval text, initial image features of candidate images are matched with the features of the target image one by one, and the candidate image with the highest matching degree or higher than a set threshold is selected as the target image corresponding to the retrieval text.

For example, a first feature matrix is input to a word processing network to generate a word query feature (key), a word key feature (value), and a word value feature (query) of each word, and the image query feature (key), the image key feature (value), and the image value feature (query) of a candidate image are obtained in the same manner, after the words and the query of pixels are exchanged, each word is sequentially used as a word to be processed, the query of the word to be processed and the key of each of the other words are subjected to a linear change (a linear regression model of an open source can be used), a result of the linear change represents the correlation degree of the two words, all the correlation degrees are multiplied by the value of the word, all the multiplication results are added, the word expression of each word to be processed is obtained through traversal, an initial feature alternating matrix is obtained, and the initial feature alternating matrix is converted to obtain alternating feature data.

In some embodiments, the text processing network and the image processing network may further include a plurality of transform layers of the same number to facilitate more information exchange. For example, experiments and experience show that the information exchange effect is best by using 8 transform layers which are connected in sequence.

Step S5: and inputting all the alternating characteristic matrixes into a head prediction network to obtain at least one target image.

In this step, the matching degree between the text corresponding to the alternating-current feature matrix and the candidate image is mainly determined by using a head prediction network, and the candidate image having the highest matching degree or the matching degree exceeding a set threshold is output as the target image.

And (3) constructing a training task to finely adjust the existing image retrieval model by taking the steps 1-5 as an integral image retrieval model. Because the original image retrieval model is trained based on sample data, and the data distribution of the model is similar to that of the image retrieval model which is actually performed, but is not completely the same, the model needs to be retrained by using the image data corresponding to the actual task to adjust the distribution so as to achieve the best effect.

Specifically, a choice question is set for the image retrieval model, and the choice question comprises four options, wherein the type of the option is a data pair consisting of an image and a text, the three texts are not matched with the image, and one text is matched with the image. Inputting data pairs into an image retrieval model trained according to an actual task, passing through the last transform layer in a text processing network and outputting an alternating characteristic matrix, connecting the alternating characteristic matrix with a head prediction network, namely a linear full-connection layer, outputting the matching degree of a text and an image, performing softmax operation on each matching degree after obtaining the matching degree of each data pair, and outputting the image with the highest matching degree or the matching degree exceeding a certain threshold value as a target image.

Example two

Based on the same concept, the present embodiment further provides a text-based image retrieval apparatus, configured to implement the text-based image retrieval method described in the first embodiment, and with specific reference to fig. 3, the apparatus includes the following modules:

EXAMPLE III

The present embodiment further provides an electronic device, referring to fig. 4, comprising a memory 404 and a processor 402, wherein the memory 404 stores a computer program, and the processor 402 is configured to run the computer program to perform the steps of any one of the text-based image retrieval methods in the above embodiments.

Specifically, the processor 402 may include a Central Processing Unit (CPU), or A Specific Integrated Circuit (ASIC), or may be configured to implement one or more Integrated circuits of the embodiments of the present Application.

Memory 404 may include, among other things, mass storage 404 for data or instructions. By way of example, and not limitation, memory 404 may include a Hard Disk Drive (Hard Disk Drive, abbreviated to HDD), a floppy Disk Drive, a Solid State Drive (SSD), flash memory, an optical Disk, a magneto-optical Disk, tape, or a Universal Serial Bus (USB) Drive or a combination of two or more of these. The memory 404 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 404 is a Non-Volatile (Non-Volatile) memory. In particular embodiments, Memory 404 includes Read-Only Memory (ROM) and Random Access Memory (RAM). The ROM may be mask-programmed ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), Electrically rewritable ROM (EAROM), or FLASH Memory (FLASH), or a combination of two or more of these, where appropriate. The RAM may be a Static Random-Access Memory (SRAM) or a Dynamic Random-Access Memory (DRAM), where the DRAM may be a Fast Page Mode Dynamic Random-Access Memory 404 (fpram), an Extended Data Out Dynamic Random Access Memory (EDODRAM), a Synchronous Dynamic Random Access Memory (SDRAM), and the like.

Memory 404 may be used to store or cache various data files needed for processing and/or communication purposes, as well as possibly computer program instructions executed by processor 402.

The processor 402 implements any of the text-based image retrieval methods in the above embodiments by reading and executing computer program instructions stored in the memory 404.

Optionally, the electronic apparatus may further include a transmission device 406 and an input/output device 408, where the transmission device 406 is connected to the processor 402, and the input/output device 408 is connected to the processor 402.

The transmitting device 406 may be used to receive or transmit data via a network. Specific examples of the network described above may include a wired or wireless network provided by a communication provider of the electronic device. In one example, the transmission device includes a Network adapter (NIC) that can be connected to other Network devices through a base station to communicate with the internet. In one example, the transmitting device 406 may be a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

The input-output device 408 is used to input or output information. In this embodiment, the input information may be a current data table such as an epidemic situation tune document, feature data, a template table, and the like, and the output information may be a feature fingerprint, a fingerprint template, text classification recommendation information, a file template configuration mapping table, a file template configuration information table, and the like.

Alternatively, in this embodiment, the processor 402 may be configured to execute the following steps by a computer program:

converting the retrieval text into a digital matrix, extracting initial text characteristics according to the digital matrix, and performing residual error connection on the digital matrix and the initial text characteristics to obtain enhanced text characteristics;

simultaneously inputting a second feature matrix and a first feature matrix which are fused with the same initial image features into a feature communication network to obtain a corresponding communication feature matrix, wherein the feature communication network comprises a text processing network and an image processing network which are parallel, the text processing network and the image processing network comprise the same number of transform layers, exchanging query features in each transform layer corresponding to the text processing network and the image processing network, and the query features are obtained by performing linear change on the input of each transform layer;

It should be noted that, for specific examples in this embodiment, reference may be made to the examples described in the foregoing embodiment and optional implementation manners, and details of this embodiment are not described herein again.

In addition, with reference to any one of the foregoing embodiments of the text-based image retrieval method, the embodiments of the present application may be implemented as a computer program product. The computer program product comprises software code portions for performing a method for text based image retrieval implementing any of the above embodiments when the computer program product is run on a computer.

In addition, in combination with any one of the text-based image retrieval methods in the first embodiment, the embodiments of the present application may be implemented by providing a readable storage medium. The readable storage medium has a computer program stored thereon; the computer program when executed by a processor implements any of the text based image retrieval methods of the above embodiments.

In general, the various embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects of the invention may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

Embodiments of the invention may be implemented by computer software executable by a data processor of the mobile device, such as in a processor entity, or by hardware, or by a combination of software and hardware. Computer software or programs (also referred to as program products) including software routines, applets and/or macros can be stored in any device-readable data storage medium and they include program instructions for performing particular tasks. The computer program product may comprise one or more computer-executable components configured to perform embodiments when the program is run. The one or more computer-executable components may be at least one software code or a portion thereof. Further in this regard it should be noted that any block of the logic flow as in the figures may represent a program step, or an interconnected logic circuit, block and function, or a combination of a program step and a logic circuit, block and function. The software may be stored on physical media such as memory chips or memory blocks implemented within the processor, magnetic media such as hard or floppy disks, and optical media such as, for example, DVDs and data variants thereof, CDs. The physical medium is a non-transitory medium.

It should be understood by those skilled in the art that various technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application should be subject to the appended claims.

Claims

1. A text-based image retrieval method is characterized by comprising the following steps:

2. The method of claim 1, wherein the method of obtaining initial image features of each candidate image comprises: pre-training a YOLOv5 model by using the marked training image, and inputting each candidate image into the pre-trained YOLOv5 model to obtain corresponding initial image characteristics.

3. The text-based image retrieval method of claim 1, wherein converting the retrieved text into a number matrix, and extracting initial text features from the number matrix comprises: inputting the retrieval text into an Albert pre-training model, converting the retrieval text into a digital matrix through a compiler of the Albert pre-training model, and extracting initial text characteristics according to the digital matrix.

4. The text-based image retrieval method of claim 1, wherein inputting a second feature matrix fused with the same initial image features and the first feature matrix into a feature ac network at the same time to obtain a corresponding ac feature matrix comprises: inputting the first characteristic matrix into a transform layer in a text processing network to perform linear change to obtain a word query characteristic, a word key characteristic and a word value characteristic of each word; inputting the second characteristic matrix into a transform layer of an image processing network for linear change to obtain a pixel query characteristic, a pixel key characteristic and a pixel value characteristic of each pixel; exchanging word query features and pixel query features obtained by linear change of the transform layers corresponding to the same layer; the word query characteristics of the word to be processed and the word key characteristics of each of the other words are linearly changed to obtain word relevancy corresponding to each of the other words, the word value characteristics of the word to be processed and all the word relevancy are multiplied and then summed to obtain word expressions of the word to be processed, the word expressions of each of the word to be processed are obtained in a traversing manner to obtain an initial characteristic communication matrix, and the initial characteristic communication matrix is converted to obtain communication characteristic data.

5. The text-based image retrieval method according to claim 4, wherein the input of each transform layer in the text processing network is subjected to first linear variation, second linear variation and third linear variation respectively to obtain a word query feature, a word key feature and a word value feature of each word; and respectively carrying out first linear change, second linear change and third linear change on the input of each transform layer in the image processing network to obtain the image query feature, the image key feature and the image value feature of each image.

6. The text-based image retrieval method of claim 4, wherein converting the initial feature exchange matrix to obtain exchange feature data comprises: and sequentially carrying out residual connection, normalization operation, full connection, residual connection and normalization operation on the initial characteristic alternating current matrix to obtain an alternating current characteristic matrix.

7. The text-based image retrieval method of claim 1, wherein "residual connecting the number matrix with the initial text feature to obtain an enhanced text feature" comprises: inputting the initial text features into a convolutional neural network, performing dimension increasing or dimension reducing on the initial text features to enable the initial text features to be the same as the numerical matrix in size, and adding numerical values of the initial text features and the numerical matrix at the same positions to obtain enhanced text features.

8. The method of claim 1, wherein fusing the enhanced text features with each of the initial image features to obtain a corresponding first feature matrix comprises: and converting the enhanced text features into one-dimensional text vectors, and fusing each initial image feature to obtain a corresponding first feature matrix after the size of the one-dimensional text vectors is the same as that of the initial image features through a full connection layer.

9. The method according to claim 1, wherein fusing the initial text features with each of the initial image features to obtain a corresponding second feature matrix comprises: and converting each initial image feature into a corresponding one-dimensional image vector, and fusing each one-dimensional image vector with the initial text feature respectively to obtain a corresponding second feature matrix after the size of each one-dimensional image vector is the same as that of the initial text feature through a full connection layer.

10. A text-based image retrieval apparatus, comprising:

the feature exchange module is configured to input a second feature matrix and a first feature matrix, which are fused with the same initial image features, into a feature exchange network at the same time to obtain a corresponding exchange feature matrix, where the feature exchange network includes a text processing network and an image processing network that are parallel to each other, the text processing network and the image processing network include transform layers with the same number, query features in each transform layer corresponding to the text processing network and the image processing network are exchanged, and the query features are obtained by linearly changing an input of each transform layer;

11. An electronic device comprising a memory and a processor, wherein the memory has a computer program stored therein, and the processor is configured to run the computer program to perform the text-based image retrieval method according to any one of claims 1 to 9.

12. A readable storage medium, characterized in that a computer program is stored therein, the computer program comprising program code for controlling a process to execute a process, the process comprising the text based image retrieval method according to any one of claims 1 to 9.