CN119206743B - Self-adaptive equalization sensing multi-transformation subtitle method - Google Patents
Self-adaptive equalization sensing multi-transformation subtitle method Download PDFInfo
- Publication number
- CN119206743B CN119206743B CN202411710207.8A CN202411710207A CN119206743B CN 119206743 B CN119206743 B CN 119206743B CN 202411710207 A CN202411710207 A CN 202411710207A CN 119206743 B CN119206743 B CN 119206743B
- Authority
- CN
- China
- Prior art keywords
- attention
- dimension
- query
- image
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 230000000007 visual effect Effects 0.000 claims abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000001419 dependent effect Effects 0.000 claims abstract description 4
- 238000012545 processing Methods 0.000 claims description 38
- 230000006870 function Effects 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 27
- 230000004927 fusion Effects 0.000 claims description 25
- 230000009466 transformation Effects 0.000 claims description 21
- 230000014509 gene expression Effects 0.000 claims description 20
- 238000007781 pre-processing Methods 0.000 claims description 15
- 238000010586 diagram Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 230000003044 adaptive effect Effects 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 239000000047 product Substances 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 230000008034 disappearance Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000011426 transformation method Methods 0.000 claims description 6
- 238000012795 verification Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000003491 array Methods 0.000 claims description 3
- 230000008033 biological extinction Effects 0.000 claims description 3
- 239000012467 final product Substances 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 description 17
- 238000012217 deletion Methods 0.000 description 5
- 230000037430 deletion Effects 0.000 description 5
- 239000000463 material Substances 0.000 description 5
- 235000019987 cider Nutrition 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000001514 detection method Methods 0.000 description 3
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 231100000915 pathological change Toxicity 0.000 description 2
- 230000036285 pathological change Effects 0.000 description 2
- 235000013599 spices Nutrition 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- JEIPFZHSYJVQDO-UHFFFAOYSA-N iron(III) oxide Inorganic materials O=[Fe]O[Fe]=O JEIPFZHSYJVQDO-UHFFFAOYSA-N 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19147—Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/1918—Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to the technical field of variable captions, in particular to a self-adaptive balanced sensing variable caption method. The method comprises the steps of obtaining training set visual characteristics of a changing subtitle, constructing a vocabulary and a tag file, obtaining image characteristics in a first stage of an encoder, obtaining common characteristics in a second stage of the encoder, extracting difference characteristics of images in a third stage of the encoder, splicing two groups of image characteristics to construct contrast representation of two differences, finally, respectively modeling the whole difference and an original image, establishing grammar correlation for each word in text data through a text dependent model in a first stage of a decoder, and obtaining a group of subtitle information describing the image difference in a second stage of the decoder.
Description
Technical Field
The invention relates to the technical field of variable captions, in particular to a self-adaptive balanced sensing variable caption method.
Background
The changing subtitle aims to describe semantic changes between a pair of similar images in compact natural language, and the task has many practical applications, such as environmental monitoring, pathological changes, traditional street view changes and the like. Thus, describing the difference in positioning in natural language is an extremely challenging problem in the multi-modal domain.
Previous work to solve the problem of varying subtitles has mainly handled the two images in a global and local spatial comparison, locating the varying areas in the encoder. The global mode is most intuitive, direct subtraction is carried out on two images, so that the changed characteristics are obtained, but the two images are not subjected to the characteristic matching under the normal condition, even the condition that an object is not obvious or is blocked occurs, the characteristic matching is carried out on the basis of the two images, because the same area in the similar images is large in proportion and easy to extract under the normal condition, the common characteristics are extracted by using cross attention, the changed characteristics are extracted by the difference, the influence of extreme viewpoint change is easy to occur, and the pseudo-change of the appearance and the position of the object can mask the real change. The drawbacks present in this process result in a noisy difference signature that is detrimental to the generation of sentences in the subsequent decoder.
Disclosure of Invention
In view of the above, the present invention provides a self-adaptive balanced sensing multi-variable caption method, which is used for improving the accuracy of semantic information under the condition of ensuring the completeness and smoothness of sentences in the process of generating the variable caption.
In a first aspect, the present invention provides a method for adaptive equalization aware multi-variant subtitle, the method comprising:
Step 1, acquiring image data and text data of a variable caption to obtain training set visual characteristics of the variable caption And;
Step 2, taking the image data and the text data as input to construct a vocabulary and a tag file;
step 3, training set visual characteristics AndI-th image feature of (a)AndInput to the encoder, image characteristics in the first stage of the encoderAndThrough the context fusion module Context Fusion Module, the feature representation of each pixel is enhanced by utilizing the context information in the horizontal and vertical directions, resulting in image featuresAnd;
Step 4, in the second stage of the encoder, image characteristicsAndWhen passing through the context fusion module, the processing module,AndRespectively extracting common characteristics twice as query characteristics, and makingAndRespectively as query features, the obtained common features areAnd;
Step 5, in the third stage of the encoder, the common features based on the differential expression DIFFERENTIAL EXPRESSION areAndExtracting difference features of the images; Is based on Image differences obtained as query features; Is based on Image difference obtained as query feature, two groups of image featuresAndSplicing to construct two contrast representations of difference to obtainAnd finally, the whole differenceWith the original imageAndRespectively modeling to obtain;
Step 6, in the first stage of the decoder, establishing grammar correlation for each word in the Text data through a Text-dependent model Text-DEPENDENT MODEL;
step 7, in the second stage of the decoder, the context fusion module processes the obtained text information and the text information Is put into a transformer to obtain a group of caption information describing the image difference.
Optionally, the step 1 includes:
the image data are two groups of similar image data, respectively And,AndRepresenting the i-th pre-change and post-change images respectively,The text data comprises description information of the image and part of speech of the description information, the two groups of image data of the images and the sc_images are respectively divided into a training set, a verification set and a test set according to the proportion of 8:1:1, the text information corresponds to the image data, the training set, the verification set and the test set are respectively divided into three parts, the image data and the text data are subjected to data preprocessing, visual characteristics are extracted, and the visual characteristics of the training set are obtained,;
The visual characteristics are extracted by reading images from a specified catalog in batches, carrying out standardization processing on the input images, scaling pixel values to the range of [0,1], normalizing by applying a mean value and a standard deviation, extracting the characteristics by using a pre-trained ResNet model, converting the characteristics into NumPy arrays, and storing the characteristics as a npy file.
Optionally, the step 2 includes:
step 21, loading the description and the dependency relation of the description of the image data and the json file of the divided dataset in the text data, and then saving the name of the image data in an all_ imgs file to obtain files including Total images, total captioned images, total captions, total TRAIN IMAGES and Total train captions;
Step 22, constructing a required vocabulary dictionary, firstly traversing a text sequence to divide words according to given separators, selectively reserving and removing appointed symbols, adding special marks < START > and < END > at the beginning and the END of a list, counting the occurrence frequency of each word according to each word division result to update the dictionary, sorting according to counted words, customizing and setting a minimum frequency, adding a value which is greater than or equal to the minimum frequency to the dictionary, and distributing indexes for the dictionary;
Step 23, constructing a required dependency vocabulary, wherein a head_tags dictionary is used for defining dependency tags and indexes thereof, each tag represents different grammar roles, a stoi dictionary is used for storing special tags and indexes thereof, the dependency tags are added into the stoi dictionary and indexes are distributed for each tag, the constructed dependency tag vocabulary is in the format of { tag: index }, vocabulary dependency analysis is used for helping a model understand the structure and grammar relation of sentences, and the tags are converted into indexes so as to facilitate the subsequent model input and processing;
Step 24, converting the mark list into a corresponding index list through the Encode function, checking whether each mark exists in the vocabulary, if not, determining whether to use < UNK > to replace according to the setting, and finally, storing the coded description, the dependency information and the index information in the HDF5 file.
Optionally, the step 3 includes:
step 31, input image characteristics The dimension is (B, C, H, W), the feature map dimension is (B, H×W, C) obtained through view, and the preprocessing feature map is obtained through nnObtaining the dimension of the feature map as followsThe input characteristic of the n.sequential containing full-connected layer, layer normalization and Dropout linear transformation is C, and the output characteristic isAccording to image characteristicsAnd obtaining a characteristic diagram by the same method;
Step 32, preprocessing the feature mapIs processed to be of dimensionAs input, the weighted characteristics are generated through convolution and cosine similarity calculation of the context fusion module, and the method comprises five steps:
(321) Generates query, key, value, and will As input, the inquiry, key and value are obtained by convolution calculation, and the obtained characteristic dimensions are allThen converting the dimension intoThe number of input channels calculated by convolution isThe number of output channels isThe convolution kernel size is 1;
(322) Calculating the attention score of the horizontal and vertical directions and applying a negative infinity mask, firstly, calculating the attention score of the horizontal and vertical directions by using the query and the key, calculating the attention score of the horizontal direction, obtaining the similarity between the query and the key by matrix multiplication, adding the negative infinity mask, and after the calculation is finished, adjusting the shape and rearranging the dimensions of the result, wherein the dimensions of the horizontal and vertical directions are (B, H, W, W);
(323) Combining the attention scores, combining the attention scores in the horizontal and vertical directions through a torch. Cat, and carrying out normalization processing by using a softmax function to obtain the attention weight of each position, wherein the dimension is that ;
(324) Separating the attention results, namely separating the combined attention scores into the attention in the horizontal direction and the attention in the vertical direction, respectively adjusting the shapes for subsequent processing, converting the horizontal attention into the proper shape through permute and the configuration, and similarly processing the vertical attention, wherein the dimension is (B multiplied by W, H, W);
(325) Weighting value calculation, weighting value according to calculated attention fraction, multiplying value and attention fraction by matrix multiplication to obtain horizontal and vertical outputs out_H and out_W, and after calculation, adjusting shape and rearranging dimension to obtain final product AndDimension is。
Optionally, the step 4 includes:
the extraction of common features in the context fusion module is divided into three steps:
(41) Feature processing to As a result of the query,The method uses layer_module function call as key and value, then respectively carries out convolution calculation on (Q, K, V), and the obtained feature map dimensions are allThen the dimension is adjusted toObtaining new characteristic diagram (Q, K, V), convolutionally calculating input channel number asThe number of output channels isThe convolution kernel size is 3, the padding is 1, the step size is 1, the deviation is False, and the grouping is 512;
(42) Attention score is calculated, attention score att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, converts the score to a probability distribution using a softmax function, and applies dropout;
(43) Calculating context expression, obtaining context expression according to weighted value of attention probability distribution, then summing weight with inquiry to obtain common characteristic Will be similar toAs a result of the query,As a key and value, getThe expressions are respectively as follows:
;
。
optionally, the step 5 includes:
step 51, the process of extracting the difference features in the difference expression is based on the extracted AndWill beConversion of dimensions intoObtaining,Is based onImage differences obtained as query features, the same will applyConversion of dimensions intoCan be obtained,Is based onImage differences obtained as query features;
Step 52, preprocessing the data before the difference information is transmitted to the decoder, and dividing the data into three steps:
(521) Splicing the difference features, and combining two groups of difference image features AndSplicing in dim= -1 dimension to obtainThe dimension is (B, H x W, C) using a fully connected layer pairPerforming linear transformation, wherein the input characteristic is C, and the output characteristic isIn order to prevent the model from being over fitted, a Dropout layer is added, a discarding rate of 10% is set, and a ReLU activation function is used;
(522) Combining input and difference features to be AndAnd (3) withLigation to form a novel feature on dim=1AndThe dimensions are (B, C, H, W);
(523) Convolution calculation, new features AndThe dimension is calculated by convolution asThe input channel is C, the output channel isThe convolution kernel size is 1, the filling is 0, then the channel is divided into 32 groups for normalization by using a group normalization layer, and the final result is obtained。
Optionally, the step 6 includes:
Step 61, before establishing relativity to the text, the text data preprocessing is needed, the dimensions of the description information seq, the dependency information dep and the mask information mask of the image are (N, L), word embedding processing is firstly carried out on the description information of the image in the text data, the vocabulary size is 76, each embedded vector size is 300, an index 0 is designated for filling to obtain the seq with the dimension (N, L, D), then the position information is added into the description information seq to help the model understand the sequence of elements in the sequence, a position coding tensor with the size (128,500) is created during initialization, the position coding tensor is gradually extracted and added with the input seq to keep the shape of final input unchanged, and finally the linear transformation is carried out through a full-connection layer, the input characteristic is D, the output characteristic is 512, and the seq tensor with the dimension (N, L, 512) is obtained;
Step 62, establishing a connection between words for the descriptive information seq through self-attention processing, and dividing the connection into four steps:
(621) Mapping seq to query, key and value and performing dimension adjustment, mapping input seq to space of query, key and value, adopting linear transformation of full connection layer, input feature being 512, output feature being 512, and output dimension being (N, L, 512), converting tensor of query, key and value into format suitable for attention calculation by using dimension transformation method, extracting first two dimension information of tensor, adding new dimension information, and splicing them;
(622) Calculate the attention score by dot multiplying the query and key, i.e., att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, then the score is converted into a probability distribution by continuing to use the softmax function, dropout is applied, and part of the attention weight is discarded;
(623) Calculating the context representation, and obtaining the context representation according to the weighted value of the attention probability distribution;
(624) Shape adjustment, the shape of the context layer is adjusted to the form of (N, L, 512), and the result is returned.
Optionally, the step 7 includes:
Step 71, will Before establishing relevance with text data, adding the description information seq text of specific internal relevance with the hidden state of a decoder, and then carrying out layer-by-layer processing on the summation of the description information seq text and the hidden state of the decoder, so as to improve the convergence speed and stability of the model, wherein the seq dimension is (N, L, 512);
step 72, using cross attention to establish a relationship between text and image difference features, which is divided into four steps:
(721) Map sequence into query, and The method comprises the steps of mapping the query, the key and the value into keys and values, simultaneously carrying out corresponding dimension adjustment, carrying out full-connection-layer linear transformation on the query, the key and the value, inputting 512 features, outputting 512 features, and outputting 512 dimensions (N, L, 512), converting tensors of the query, the key and the value into a format suitable for attention calculation by using a dimension transformation method, extracting first two dimension information of the tensor, adding new dimension information, and finally carrying out splicing treatment;
(722) Calculating an attention score by dot multiplying the query and the key to obtain att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score in order to avoid gradient extinction due to excessive dot product result, i.e,If att_mask is not empty, adding it to att_score, then converting the score into a probability distribution using softmax function, obtaining an average attention score for each query for all keys by calculating the average of the attention scores along the second dimension as att_score 1, and applying dropout to the attention scores, discarding part of the attention weights;
(723) Finally, the shape of the context layer is adjusted to be (N, L, 12), the att and the hidden state of the decoder are connected in a residual way, and then the summation of the att and the hidden state of the decoder is subjected to layer-level processing;
(724) The characteristic processing, which is to apply linear transformation to att, wherein the input and output characteristics are 512 and activate by using GELU activation function, then apply linear transformation to input again, wherein the input and output characteristics are 512 and dropout, and connect the result with the residual error, then normalize the layer and finally return the result.
In a second aspect, an embodiment of the present invention provides a computer readable storage medium, where the computer readable storage medium includes a stored program, where the program when executed controls a variable subtitle method for performing adaptive equalization sensing in the first aspect or any possible implementation manner of the first aspect.
In a third aspect, an embodiment of the present invention provides an electronic device comprising one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the device, cause the device to perform the adaptive equalization-aware multi-variant subtitle method of the first aspect or any possible implementation of the first aspect.
The method comprises the steps of obtaining image data and text data of a variable caption to obtain training set visual characteristics of the variable caption, taking the image data and the text data as input to construct a vocabulary and a tag file, inputting an ith image characteristic of the training set visual characteristics to an encoder, and in a first stage of the encoder, enhancing characteristic representation of each pixel by utilizing context information in horizontal and vertical directions to obtain the image characteristicAndIn the second stage of the encoder, image characteristicsAndWhen passing through the context fusion module, the method causesAndRespectively as query features, the obtained common features areAndIn the third stage of the encoder, the common characteristics obtained based on the differential expression are thatAndExtracting difference features of the images; Is based on Image differences obtained as query features; Is based on Image difference obtained as query feature, two groups of image featuresAndSplicing to construct two contrast representations of difference to obtainAnd finally, the whole differenceWith the original imageAndRespectively modeling to obtainIn the first stage of decoder, the grammar correlation is built up by means of text dependency model and the text information and text information are obtained by means of context fusion moduleThe method ensures complete and smooth sentences according to grammar rules and context information in the process of generating the variable subtitles, effectively combines visual information and language information and improves the accuracy of semantic information.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a method for adaptive equalization sensing multi-variant subtitle according to an embodiment of the present invention;
fig. 2 is a schematic architecture diagram of an adaptive equalization sensing network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a context fusion module according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a common feature fusion module according to an embodiment of the present invention;
Fig. 5 is a schematic diagram of inputting a variable subtitle according to an embodiment of the present invention;
Fig. 6 is a schematic diagram of output of a variable subtitle according to an embodiment of the present invention;
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment of the invention, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or b, and may mean that a single first exists while a single first and a single second exist. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
The term "if" as used herein may be interpreted as "at" or "when" depending on the context "or" in response to a determination "or" in response to a detection. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
Fig. 1 is a flowchart of a method for adaptive equalization sensing multi-variant subtitle according to an embodiment of the present invention, as shown in fig. 1, where the method includes:
Step 1, acquiring image data and text data of a variable caption to obtain training set visual characteristics of the variable caption And。
In the embodiment of the present invention, as shown in fig. 2, step 1 includes:
the image data are two groups of similar image data, respectively And,AndRepresenting the i-th pre-change and post-change images respectively,The text data comprises description information of the image and part of speech of the description information, the two groups of image data of the images and the sc_images are respectively divided into a training set, a verification set and a test set according to the proportion of 8:1:1, the text information corresponds to the image data, the training set, the verification set and the test set are respectively divided into three parts, the image data and the text data are subjected to data preprocessing, visual characteristics are extracted, and the visual characteristics of the training set are obtained,;
The visual characteristics are extracted by reading images from a specified catalog in batches, carrying out standardization processing on the input images, scaling pixel values to the range of [0,1], normalizing by applying a mean value and a standard deviation, extracting the characteristics by using a pre-trained ResNet model, converting the characteristics into NumPy arrays, and storing the characteristics as a npy file.
And 2, taking the image data and the text data as input, and constructing a vocabulary and a tag file.
In the embodiment of the present invention, step 2 includes:
step 21, loading the description and the dependency relation of the description of the image data and the json file of the divided dataset in the text data, and then saving the name of the image data in an all_ imgs file to obtain files including Total images, total captioned images, total captions, total TRAIN IMAGES and Total train captions;
Step 22, constructing a required vocabulary dictionary, firstly traversing a text sequence to divide words according to given separators, selectively reserving and removing appointed symbols, adding special marks < START > and < END > at the beginning and the END of a list, counting the occurrence frequency of each word according to each word division result to update the dictionary, sorting according to counted words, customizing and setting a minimum frequency, adding a value which is greater than or equal to the minimum frequency to the dictionary, and distributing indexes for the dictionary;
Step 23, constructing a required dependency vocabulary, wherein the head_tags dictionary is used for defining dependency tags and indexes thereof, each tag represents different grammar roles, such as punctuation (punct), subjects (nsubj), direct objects (dobj) and the like, the stoi dictionary is used for storing special tags and indexes thereof, such as < PAD >, < UNK >, and the like, adding the dependency tags into the stoi dictionary and distributing indexes for each tag, and the constructed dependency tag vocabulary is in the format of { tag: index }, performing vocabulary dependency analysis to help a model understand the structure and grammar relation of sentences, and facilitating subsequent model input and processing by converting the tags into indexes;
Step 24, converting the tag list into a corresponding index list by the Encode function, and checking whether each tag exists in the vocabulary, if not, deciding whether to use < UNK > instead according to the settings, because the model is usually only capable of handling digital input, it is necessary to convert text data into a corresponding index representation. And finally, storing the coded description, the dependency information and the index information into an HDF5 file.
Step 3, training set visual characteristicsAndI-th image feature of (a)AndInput to the encoder, image characteristics in the first stage of the encoderAndThrough the context fusion module Context Fusion Module, the feature representation of each pixel is enhanced by utilizing the context information in the horizontal and vertical directions, resulting in image featuresAnd。
In the embodiment of the present invention, step 3 includes:
step 31, input image characteristics The dimension is (B, C, H, W), the feature map dimension is (B, H×W, C) obtained through view, and the preprocessing feature map is obtained through nnObtaining the dimension of the feature map as followsThe input characteristic of the n.sequential containing full-connected layer, layer normalization and Dropout linear transformation is C, and the output characteristic isAccording to image characteristicsAnd obtaining a characteristic diagram by the same method;
Step 32, preprocessing the feature mapIs processed to be of dimensionAs input, the weighted features are generated by convolution and cosine similarity calculation of the context fusion module, which is divided into five steps as shown in fig. 3:
(321) Generates query, key, value, and will As input, the inquiry, key and value are obtained by convolution calculation, and the obtained characteristic dimensions are allThen converting the dimension intoThe number of input channels calculated by convolution isThe number of output channels isThe convolution kernel size is 1;
(322) Calculating the attention score in the horizontal and vertical directions and applying a negative infinity mask, firstly, calculating the attention score in the horizontal and vertical directions by using the query and the key, calculating the attention score in the horizontal direction, obtaining the similarity between the query and the key through matrix multiplication, adding the negative infinity mask to ensure that certain positions cannot influence the attention, and after the calculation is completed, adjusting the shape and rearranging the dimensions of the result, wherein the dimensions in the horizontal and vertical directions are (B, H, W, W);
(323) Combining the attention scores, combining the horizontal and vertical attention scores by means of a torch. Cat, and carrying out normalization processing (dim=3) by using a softmax function to obtain the attention weight of each position, wherein the dimension is ;
(324) Separating the attention results, separating the combined attention scores into horizontal and vertical attention, adjusting the shape for subsequent processing, converting the horizontal attention into a suitable shape by permute and configuration, and similarly processing the vertical attention, with dimensions (b×w, H, W).
(325) Weighting value calculation, weighting value according to calculated attention fraction, multiplying value and attention fraction by matrix multiplication to obtain horizontal and vertical outputs out_H and out_W, and after calculation, adjusting shape and rearranging dimension to obtain final productAndDimension is。
Step 4, in the second stage of the encoder, image characteristicsAndWhen passing through the context fusion module, the processing module,AndRespectively extracting common characteristics twice as query characteristics, and makingAndRespectively as query features, the obtained common features areAnd。
In the embodiment of the invention, a plurality of similar information is fused in the second stage of the encoder, and the relevance is stronger, so that the influence caused by extreme viewpoints is overcome, and the extracted common features have high accuracy.
In the embodiment of the present invention, as shown in fig. 4, step 4 includes:
the extraction of common features in the context fusion module is divided into three steps:
(41) Feature processing to As a result of the query,The method uses layer_module function call as key and value, then respectively carries out convolution calculation on (Q, K, V), and the obtained feature map dimensions are allThen the dimension is adjusted toObtaining new characteristic diagram (Q, K, V), convolutionally calculating input channel number asThe number of output channels isThe convolution kernel size is 3, the padding is 1, the step size is 1, the deviation is False, and the grouping is 512;
(42) Attention score is calculated, attention score att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, converts the score to a probability distribution using a softmax function, and applies dropout;
(43) Calculating context expression, obtaining context expression according to weighted value of attention probability distribution, then summing weight with inquiry to obtain common characteristic Will be similar toAs a result of the query,As a key and value, getThe expressions are respectively as follows:
;
。
Step 5, in the third stage of the encoder, the common features based on the differential expression DIFFERENTIAL EXPRESSION are AndExtracting difference features of the images; Is based on Image differences obtained as query features; Is based on Image difference obtained as query feature, two groups of image featuresAndSplicing to construct two contrast representations of difference to obtainAnd finally, the whole differenceWith the original imageAndRespectively modeling to obtain。
In the embodiment of the present invention, step 5 includes:
step 51, the process of extracting the difference features in the difference expression is based on the extracted AndWill beConversion of dimensions intoObtaining,Is based onImage differences obtained as query features, the same will applyConversion of dimensions intoCan be obtained,Is based onImage differences obtained as query features;
Step 52, preprocessing the data before the difference information is transmitted to the decoder, and dividing the data into three steps:
(521) Splicing the difference features, and combining two groups of difference image features AndSplicing in dim= -1 dimension to obtainThe dimension is (B, H x W, C) using a fully connected layer pairPerforming linear transformation, wherein the input characteristic is C, and the output characteristic isIn order to prevent the model from being over fitted, a Dropout layer is added, the discarding rate is set to be 10%, and a ReLU activation function is used, so that the model can learn more complex characteristics;
(522) Combining input and difference features to be AndAnd (3) withLigation to form a novel feature on dim=1AndThe dimensions are (B, C, H, W);
(523) Convolution calculation, new features AndThe dimension is calculated by convolution asThe input channel is C, the output channel isThe convolution kernel size is 1, the filling is 0, then the channel is divided into 32 groups for normalization by using a group normalization layer, and the final result is obtained。
And 6, in the first stage of the decoder, establishing grammar correlation for each word in the Text data through a Text-dependent model Text-DEPENDENT MODEL.
Based on the links, the model can better understand the complex structure in the sentence, so as to identify which object changes.
In the embodiment of the present invention, step 6 includes:
Step 61, before establishing relativity to the text, the text data preprocessing is needed, the dimensions of the description information seq, the dependency information dep and the mask information mask of the image are (N, L), word embedding processing is firstly carried out on the description information of the image in the text data, the vocabulary size is 76, each embedded vector size is 300, an index 0 is designated for filling, the seq with the dimension (N, L, D) is obtained, for example, (128,24,300), then the position information is added into the description information seq to help the model understand the sequence of elements, the position coding tensor with the size (128,500) is created during initialization, the stepwise extraction and the addition of the input seq are carried out, the final input shape is kept unchanged, and finally the linear transformation is carried out through a full-connection layer, the input characteristic is D, the output characteristic is 512, and the seq tensor with the dimension (N, L, 512) is obtained;
Step 62, establishing a connection between words for the descriptive information seq through self-attention processing, and dividing the connection into four steps:
(621) Mapping seq to query, key and value and performing dimension adjustment, mapping input seq to space of query, key and value, adopting linear transformation of full connection layer, input feature being 512, output feature being 512, and output dimension being (N, L, 512), converting tensor of query, key and value into format suitable for attention calculation by using dimension transformation method, extracting first two dimension information of tensor, adding new dimension information, and splicing them;
(622) Calculate the attention score by dot multiplying the query and key, i.e., att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, then the score is converted into a probability distribution by continuing to use the softmax function, dropout is applied, and part of the attention weight is discarded;
(623) Calculating the context representation, and obtaining the context representation according to the weighted value of the attention probability distribution;
(624) Shape adjustment, the shape of the context layer is adjusted to the form of (N, L, 512), and the result is returned.
Step 7, in the second stage of the decoder, the context fusion module processes the obtained text information and the text informationIs put into a transformer to obtain a group of caption information describing the image difference.
In the embodiment of the present invention, step 7 includes:
Step 71, will Before establishing relevance with text data, adding the description information seq text of specific internal relevance with the hidden state of a decoder, and then carrying out layer-by-layer processing on the summation of the description information seq text and the hidden state of the decoder, so as to improve the convergence speed and stability of the model, wherein the seq dimension is (N, L, 512);
step 72, using cross attention to establish a relationship between text and image difference features, which is divided into four steps:
(721) Map sequence into query, and The method comprises the steps of mapping the query, the key and the value into keys and values, simultaneously carrying out corresponding dimension adjustment, carrying out full-connection-layer linear transformation on the query, the key and the value, inputting 512 features, outputting 512 features, and outputting 512 dimensions (N, L, 512), converting tensors of the query, the key and the value into a format suitable for attention calculation by using a dimension transformation method, extracting first two dimension information of the tensor, adding new dimension information, and finally carrying out splicing treatment;
(722) Calculating an attention score by dot multiplying the query and the key to obtain att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score in order to avoid gradient extinction due to excessive dot product result, i.e,If att_mask is not empty, adding it to att_score, then converting the score into a probability distribution using softmax function, obtaining an average attention score for each query for all keys by calculating the average of the attention scores along the second dimension as att_score 1, and applying dropout to the attention scores, discarding part of the attention weights;
(723) Finally, the shape of the context layer is adjusted to be (N, L, 12), the att and the hidden state of the decoder are connected in a residual way, and then the summation of the att and the hidden state of the decoder is subjected to layer-level processing;
(724) The characteristic processing, which is to apply linear transformation to att, wherein the input and output characteristics are 512 and activate by using GELU activation function, then apply linear transformation to input again, wherein the input and output characteristics are 512 and dropout, and connect the result with the residual error, then normalize the layer and finally return the result.
The invention consists of an encoder and a decoder. The encoder is divided into four stages, context Fusion focuses the Context to each feature, helping focus unobtrusive local variations to enhance the robustness of the model to noise and interference. In addition, because the feature learning is carried out on different scales, the local and global features complement each other, so that the information from detail to global is better captured, and the perception capability of the model on various changes in different scenes is improved. Common feature extraction capture unchanged features by matching similarity features of a pair of images, and then separating them from the original image pair, resulting in a stability variance feature. The decoder is divided into two stages. The Text-DEPENDENT MODEL establishes directional relevance of single words in the Text information through grammar, the model can further understand the semantics of the Text, and recognize verbs and modifier words thereof in sentences, and the model can judge the change tendency of the Text. Therefore, in the process of generating the subtitles, the model can generate natural and smooth sentences according to grammar rules and context information. caption generation the text information with grammar relevance and the difference feature map are transmitted into a transducer, and visual information and language information are effectively combined, so that more accurate and expressive subtitle information is generated.
In the embodiment of the invention, as shown in fig. 5 and 6, the caption changing result on CLEVR-Change selects 40000 pairs of images, wherein the pairs of images comprise color, position and material changes, and have complex conditions of different vision, different object density, different shielding degree, different light intensity and the like. From the aspect of generating a differential positioning map, the method of the invention achieves a positioning effect under complex conditions, and can be applied to various aspects such as environmental monitoring, pathological changes, satellite image detection, traditional street view changes and the like.
The experimental comparison results of the method of the invention and other models on CLEVR-Change are shown in tables 1 to 4, and the evaluation indexes are blue_4, METEOR and ROUGE_ L, CIDEr, SPICE.
TABLE 1
;
TABLE 2
;
TABLE 3 Table 3
;
TABLE 4 Table 4
;
Table 1 shows the results of the overall performance evaluation, wherein the metrics of METEOR and ROUGE_ L, CIDEr are higher than DUDA +TIGR, M-VAM, R3Net+SSP, SRDRL+AVS and IFDC models. Compared with the highest SRDRL +AVS in ROUGE _L index, the invention improves the index by 0.3. The index CIDEr is increased by 0.6 compared with the highest R3Net+SSP. Tables 2 to 4 are the evaluation of the attributes of color (C), texture (T), addition (a), deletion (D), and movement (M).
Table 2 shows CIDEr evaluation indexes, wherein the evaluation results of the color, the material and the deletion attribute are all higher than those of other main stream models, the evaluation result of the color is improved by 1.1 compared with the highest R3Net+SSP, the evaluation result of the material is improved by 0.8 compared with the highest R3Net+SSP, and the evaluation result of the deletion is improved by 1.5 compared with the highest SRDRL +AVS.
Table 3 shows the evaluation index of SPICE, the evaluation results of the invention on the color and the deletion attribute are higher than those of other main stream models, wherein the evaluation result of the color is improved by 0.3 compared with the highest SRDRL +AVS, and the evaluation result of the deletion is improved by 0.1 compared with the highest SRDRL +AVS.
Table 4 shows the evaluation index of METEOR, the evaluation results of the invention on the color, the material and the added attribute are all higher than those of other main stream models, wherein the evaluation result of the color is improved by 0.2 compared with the highest SRDRL +AVS, the evaluation result of the material is improved by 0.3 compared with the highest SRDRL +AVS, and the added evaluation result is improved by 0.1 compared with the highest SRDRL +AVS. The above data further illustrate that the process of the present invention has better performance.
Various steps of embodiments of the present invention may be performed by an electronic device. Electronic devices include, but are not limited to, cell phones, tablet computers, portable PCs, desktops, and the like.
The method comprises the steps of obtaining image data and text data of a variable caption to obtain training set visual characteristics of the variable caption, taking the image data and the text data as input to construct a vocabulary and a tag file, inputting an ith image characteristic of the training set visual characteristics to an encoder, and in a first stage of the encoder, enhancing characteristic representation of each pixel by utilizing context information in horizontal and vertical directions to obtain the image characteristicAndIn the second stage of the encoder, image characteristicsAndWhen passing through the context fusion module, the method causesAndRespectively as query features, the obtained common features areAndIn the third stage of the encoder, the common characteristics obtained based on the differential expression are thatAndExtracting difference features of the images; Is based on Image differences obtained as query features; Is based on Image difference obtained as query feature, two groups of image featuresAndSplicing to construct two contrast representations of difference to obtainAnd finally, the whole differenceWith the original imageAndRespectively modeling to obtainIn the first stage of decoder, the grammar correlation is built up by means of text dependency model and the text information and text information are obtained by means of context fusion moduleThe method ensures complete and smooth sentences according to grammar rules and context information in the process of generating the variable subtitles, effectively combines visual information and language information and improves the accuracy of semantic information.
The embodiment of the invention provides a computer readable storage medium, which comprises a stored program, wherein the embodiment of the adaptive equalization perceived multi-variation subtitle method is controlled to be executed by electronic equipment where the computer readable storage medium is located when the program runs.
Fig. 7 is a schematic diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 7, the electronic device 21 includes a processor 211, a memory 212, and a computer program 213 stored in the memory 212 and capable of running on the processor 211, where the computer program 213 implements the adaptive equalization-aware multi-variant subtitle method in the embodiment when executed by the processor 211, and is not repeated herein.
The electronic device 21 includes, but is not limited to, a processor 211, a memory 212. It will be appreciated by those skilled in the art that fig. 7 is merely an example of the electronic device 21 and is not meant to be limiting of the electronic device 21, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the electronic device may further include an input-output device, a network access device, a bus, etc.
The Processor 211 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 212 may be an internal storage unit of the electronic device 21, such as a hard disk or a memory of the electronic device 21. The memory 212 may also be an external storage device of the electronic device 21, such as a plug-in hard disk provided on the electronic device 21, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Further, the memory 212 may also include both internal storage units and external storage devices of the electronic device 21. The memory 212 is used to store computer programs and other programs and data required by the network device. The memory 212 may also be used to temporarily store data that has been output or is to be output.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.
Claims (8)
1. A method for adaptively equalizing perceived multi-variant subtitles, the method comprising:
Step 1, acquiring image data and text data of a variable caption to obtain training set visual characteristics of the variable caption And;
Step 2, taking the image data and the text data as input to construct a vocabulary and a tag file;
step 3, training set visual characteristics AndI-th image feature of (a)AndInput to the encoder, image characteristics in the first stage of the encoderAndThrough the context fusion module Context Fusion Module, the feature representation of each pixel is enhanced by utilizing the context information in the horizontal and vertical directions, resulting in image featuresAnd;
Step 4, in the second stage of the encoder, image characteristicsAndWhen passing through the context fusion module, the processing module,AndRespectively extracting common characteristics twice as query characteristics, and makingAndRespectively as query features, the obtained common features areAnd;
Step 5, in the third stage of the encoder, the common features based on the differential expression DIFFERENTIAL EXPRESSION areAndExtracting difference features of the images; Is based on Image differences obtained as query features; Is based on Image difference obtained as query feature, two groups of image featuresAndSplicing to construct two contrast representations of difference to obtainAnd finally, the whole differenceWith the original imageAndRespectively modeling to obtain;
Step 6, in the first stage of the decoder, establishing grammar correlation for each word in the Text data through a Text-dependent model Text-DEPENDENT MODEL;
step 7, in the second stage of the decoder, the context fusion module processes the obtained text information and the text information Putting the subtitle information into a transformer to obtain a group of subtitle information describing the image difference;
the step 3 comprises the following steps:
step 31, input image characteristics The dimension is (B, C, H, W), the feature map dimension is (B, H×W, C) obtained through view, and the preprocessing feature map is obtained through nnObtaining the dimension of the feature map as followsThe input characteristic of the n.sequential containing full-connected layer, layer normalization and Dropout linear transformation is C, and the output characteristic isAccording to image characteristicsAnd obtaining a characteristic diagram by the same method;
Step 32, preprocessing the feature mapIs processed to be of dimensionAs input, the weighted characteristics are generated through convolution and cosine similarity calculation of the context fusion module, and the method comprises five steps:
(321) Generates query, key, value, and will As input, the inquiry, key and value are obtained by convolution calculation, and the obtained characteristic dimensions are allThen converting the dimension intoThe number of input channels calculated by convolution isThe number of output channels isThe convolution kernel size is 1;
(322) Calculating the attention score of the horizontal and vertical directions and applying a negative infinity mask, firstly, calculating the attention score of the horizontal and vertical directions by using the query and the key, calculating the attention score of the horizontal direction, obtaining the similarity between the query and the key by matrix multiplication, adding the negative infinity mask, and after the calculation is finished, adjusting the shape and rearranging the dimensions of the result, wherein the dimensions of the horizontal and vertical directions are (B, H, W, W);
(323) Combining the attention scores, combining the attention scores in the horizontal and vertical directions through a torch. Cat, and carrying out normalization processing by using a softmax function to obtain the attention weight of each position, wherein the dimension is that ;
(324) Separating the attention results, namely separating the combined attention scores into the attention in the horizontal direction and the attention in the vertical direction, respectively adjusting the shapes for subsequent processing, converting the horizontal attention into the proper shape through permute and the configuration, and similarly processing the vertical attention, wherein the dimension is (B multiplied by W, H, W);
(325) Weighting value calculation, weighting value according to calculated attention fraction, multiplying value and attention fraction by matrix multiplication to obtain horizontal and vertical outputs out_H and out_W, and after calculation, adjusting shape and rearranging dimension to obtain final product AndDimension is;
The step 4 comprises the following steps:
the extraction of common features in the context fusion module is divided into three steps:
(41) Feature processing to As a result of the query,The method uses layer_module function call as key and value, then respectively carries out convolution calculation on (Q, K, V), and the obtained feature map dimensions are allThen the dimension is adjusted toObtaining new characteristic diagram (Q, K, V), convolutionally calculating input channel number asThe number of output channels isThe convolution kernel size is 3, the padding is 1, the step size is 1, the deviation is False, and the grouping is 512;
(42) Attention score is calculated, attention score att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, converts the score to a probability distribution using a softmax function, and applies dropout;
(43) Calculating context expression, obtaining context expression according to weighted value of attention probability distribution, then summing weight with inquiry to obtain common characteristic Will be similar toAs a result of the query,As a key and value, getThe expressions are respectively as follows:
;
。
2. the method according to claim 1, wherein the step 1 comprises:
the image data are two groups of similar image data, respectively And,AndRepresenting the i-th pre-change and post-change images respectively,The text data comprises description information of the image and part of speech of the description information, the two groups of image data of the images and the sc_images are respectively divided into a training set, a verification set and a test set according to the proportion of 8:1:1, the text information corresponds to the image data, the training set, the verification set and the test set are respectively divided into three parts, the image data and the text data are subjected to data preprocessing, visual characteristics are extracted, and the visual characteristics of the training set are obtained,;
The visual characteristics are extracted by reading images from a specified catalog in batches, carrying out standardization processing on the input images, scaling pixel values to the range of [0,1], normalizing by applying a mean value and a standard deviation, extracting the characteristics by using a pre-trained ResNet model, converting the characteristics into NumPy arrays, and storing the characteristics as a npy file.
3. The method according to claim 1, wherein the step 2 comprises:
step 21, loading the description and the dependency relation of the description of the image data and the json file of the divided dataset in the text data, and then saving the name of the image data in an all_ imgs file to obtain files including Total images, total captioned images, total captions, total TRAIN IMAGES and Total train captions;
Step 22, constructing a required vocabulary dictionary, firstly traversing a text sequence to divide words according to given separators, selectively reserving and removing appointed symbols, adding special marks < START > and < END > at the beginning and the END of a list, counting the occurrence frequency of each word according to each word division result to update the dictionary, sorting according to counted words, customizing and setting a minimum frequency, adding a value which is greater than or equal to the minimum frequency to the dictionary, and distributing indexes for the dictionary;
Step 23, constructing a required dependency vocabulary, wherein a head_tags dictionary is used for defining dependency tags and indexes thereof, each tag represents different grammar roles, a stoi dictionary is used for storing special tags and indexes thereof, the dependency tags are added into the stoi dictionary and indexes are distributed for each tag, the constructed dependency tag vocabulary is in the format of { tag: index }, vocabulary dependency analysis is used for helping a model understand the structure and grammar relation of sentences, and the tags are converted into indexes so as to facilitate the subsequent model input and processing;
Step 24, converting the mark list into a corresponding index list through the Encode function, checking whether each mark exists in the vocabulary, if not, determining whether to use < UNK > to replace according to the setting, and finally, storing the coded description, the dependency information and the index information in the HDF5 file.
4. The method according to claim 1, wherein the step 5 comprises:
step 51, the process of extracting the difference features in the difference expression is based on the extracted AndWill beConversion of dimensions intoObtaining,Is based onImage differences obtained as query features, the same will applyConversion of dimensions intoCan be obtained,Is based onImage differences obtained as query features;
Step 52, preprocessing the data before the difference information is transmitted to the decoder, and dividing the data into three steps:
(521) Splicing the difference features, and combining two groups of difference image features AndSplicing in dim= -1 dimension to obtainThe dimension is (B, H x W, C) using a fully connected layer pairPerforming linear transformation, wherein the input characteristic is C, and the output characteristic isIn order to prevent the model from being over fitted, a Dropout layer is added, a discarding rate of 10% is set, and a ReLU activation function is used;
(522) Combining input and difference features to be AndAnd (3) withLigation to form a novel feature on dim=1AndThe dimensions are (B, C, H, W);
(523) Convolution calculation, new features AndThe dimension is calculated by convolution asThe input channel is C, the output channel isThe convolution kernel size is 1, the filling is 0, then the channel is divided into 32 groups for normalization by using a group normalization layer, and the final result is obtained。
5. The method according to claim 1, wherein the step 6 comprises:
Step 61, before establishing relativity to the text, the text data preprocessing is needed, the dimensions of the description information seq, the dependency information dep and the mask information mask of the image are (N, L), word embedding processing is firstly carried out on the description information of the image in the text data, the vocabulary size is 76, each embedded vector size is 300, an index 0 is designated for filling to obtain the seq with the dimension (N, L, D), then the position information is added into the description information seq to help the model understand the sequence of elements in the sequence, a position coding tensor with the size (128,500) is created during initialization, the position coding tensor is gradually extracted and added with the input seq to keep the shape of final input unchanged, and finally the linear transformation is carried out through a full-connection layer, the input characteristic is D, the output characteristic is 512, and the seq tensor with the dimension (N, L, 512) is obtained;
Step 62, establishing a connection between words for the descriptive information seq through self-attention processing, and dividing the connection into four steps:
(621) Mapping seq to query, key and value and performing dimension adjustment, mapping input seq to space of query, key and value, adopting linear transformation of full connection layer, input feature being 512, output feature being 512, and output dimension being (N, L, 512), converting tensor of query, key and value into format suitable for attention calculation by using dimension transformation method, extracting first two dimension information of tensor, adding new dimension information, and splicing them;
(622) Calculate the attention score by dot multiplying the query and key, i.e., att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score to avoid the disappearance of the gradient of the softmax function due to oversized dot product results,Is the dimension of the key, then the score is converted into a probability distribution by continuing to use the softmax function, dropout is applied, and part of the attention weight is discarded;
(623) Calculating the context representation, and obtaining the context representation according to the weighted value of the attention probability distribution;
(624) Shape adjustment, the shape of the context layer is adjusted to the form of (N, L, 512), and the result is returned.
6. The method according to claim 1, wherein the step 7 comprises:
Step 71, will Before establishing relevance with text data, adding the description information seq text of specific internal relevance with the hidden state of a decoder, and then carrying out layer-by-layer processing on the summation of the description information seq text and the hidden state of the decoder, so as to improve the convergence speed and stability of the model, wherein the seq dimension is (N, L, 512);
step 72, using cross attention to establish a relationship between text and image difference features, which is divided into four steps:
(721) Map sequence into query, and The method comprises the steps of mapping the query, the key and the value into keys and values, simultaneously carrying out corresponding dimension adjustment, carrying out full-connection-layer linear transformation on the query, the key and the value, inputting 512 features, outputting 512 features, and outputting 512 dimensions (N, L, 512), converting tensors of the query, the key and the value into a format suitable for attention calculation by using a dimension transformation method, extracting first two dimension information of the tensor, adding new dimension information, and finally carrying out splicing treatment;
(722) Calculating an attention score by dot multiplying the query and the key to obtain att_score=q× For measuring similarity or relevance between query and all keys, scaling the attention score in order to avoid gradient extinction due to excessive dot product result, i.e,If att_mask is not empty, adding it to att_score, then converting the score into a probability distribution using softmax function, obtaining an average attention score for each query for all keys by calculating the average of the attention scores along the second dimension as att_score 1, and applying dropout to the attention scores, discarding part of the attention weights;
(723) Finally, the shape of the context layer is adjusted to be (N, L, 12), the att and the hidden state of the decoder are connected in a residual way, and then the summation of the att and the hidden state of the decoder is subjected to layer-level processing;
(724) The characteristic processing, which is to apply linear transformation to att, wherein the input and output characteristics are 512 and activate by using GELU activation function, then apply linear transformation to input again, wherein the input and output characteristics are 512 and dropout, and connect the result with the residual error, then normalize the layer and finally return the result.
7. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored program, wherein the program when run controls a device in which the computer readable storage medium is located to perform the adaptive equalization-aware multi-variant subtitle method of any one of claims 1 to 6.
8. An electronic device comprising one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the device, cause the device to perform the adaptive equalization-aware multi-variant subtitle method of any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411710207.8A CN119206743B (en) | 2024-11-27 | 2024-11-27 | Self-adaptive equalization sensing multi-transformation subtitle method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411710207.8A CN119206743B (en) | 2024-11-27 | 2024-11-27 | Self-adaptive equalization sensing multi-transformation subtitle method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN119206743A CN119206743A (en) | 2024-12-27 |
CN119206743B true CN119206743B (en) | 2025-03-04 |
Family
ID=94076396
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411710207.8A Active CN119206743B (en) | 2024-11-27 | 2024-11-27 | Self-adaptive equalization sensing multi-transformation subtitle method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119206743B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581690A (en) * | 2022-03-14 | 2022-06-03 | 昆明理工大学 | Image pair difference description method based on encoder-decoder |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723937A (en) * | 2019-03-21 | 2020-09-29 | 北京三星通信技术研究有限公司 | Method, device, device and medium for generating description information of multimedia data |
WO2021101231A1 (en) * | 2019-11-22 | 2021-05-27 | Samsung Electronics Co., Ltd. | Event recognition on photos with automatic album detection |
KR102387895B1 (en) * | 2020-01-08 | 2022-04-18 | 인하대학교 산학협력단 | Parallel image caption system and method using 2d masked convolution |
CN113806587A (en) * | 2021-08-24 | 2021-12-17 | 西安理工大学 | A video description text generation method based on multimodal feature fusion |
US12210825B2 (en) * | 2021-11-18 | 2025-01-28 | Adobe Inc. | Image captioning |
KR20240018968A (en) * | 2022-08-03 | 2024-02-14 | 현대자동차주식회사 | Learning method of image captioning model and computer-readable recording media |
CN118918336A (en) * | 2023-05-08 | 2024-11-08 | 中国科学院信息工程研究所 | Image change description method based on visual language model |
CN116612365B (en) * | 2023-06-09 | 2024-01-23 | 匀熵智能科技(无锡)有限公司 | Image subtitle generating method based on target detection and natural language processing |
CN117036967B (en) * | 2023-10-08 | 2024-01-19 | 江西师范大学 | Remote sensing image description method for channel attention of non-visual perception area |
-
2024
- 2024-11-27 CN CN202411710207.8A patent/CN119206743B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581690A (en) * | 2022-03-14 | 2022-06-03 | 昆明理工大学 | Image pair difference description method based on encoder-decoder |
Also Published As
Publication number | Publication date |
---|---|
CN119206743A (en) | 2024-12-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111709409B (en) | Face living body detection method, device, equipment and medium | |
CN115050064B (en) | Human face liveness detection method, device, equipment and medium | |
CN110263912A (en) | A kind of image answering method based on multiple target association depth reasoning | |
CN112818764B (en) | Low-resolution image facial expression recognition method based on feature reconstruction model | |
CN113723166A (en) | Content identification method and device, computer equipment and storage medium | |
CN109684476B (en) | Text classification method, text classification device and terminal equipment | |
CN112233698A (en) | Character emotion recognition method and device, terminal device and storage medium | |
CN112188306A (en) | Label generation method, device, equipment and storage medium | |
CN113705315A (en) | Video processing method, device, equipment and storage medium | |
JP7479507B2 (en) | Image processing method and device, computer device, and computer program | |
CN113095072A (en) | Text processing method and device | |
CN113095066A (en) | Text processing method and device | |
WO2023197749A9 (en) | Background music insertion time point determining method and apparatus, device, and storage medium | |
CN118093840B (en) | Visual question-answering method, device, equipment and storage medium | |
CN117078942B (en) | Context-aware refereed image segmentation method, system, device and storage medium | |
CN115797731A (en) | Target detection model training method, target detection model detection method, terminal device and storage medium | |
CN111079374A (en) | Font generation method, device and storage medium | |
CN113569607A (en) | Motion recognition method, motion recognition device, motion recognition equipment and storage medium | |
CN114417832B (en) | Disambiguation method, training method and device of disambiguation model | |
CN110738261A (en) | Image classification and model training method and device, electronic equipment and storage medium | |
CN118861214A (en) | Visual language model training method, text generation method and related equipment | |
CN119206743B (en) | Self-adaptive equalization sensing multi-transformation subtitle method | |
CN116912924B (en) | Target image recognition method and device | |
CN116721449A (en) | Training method of video recognition model, video recognition method, device and equipment | |
CN117218635A (en) | Subtitle recognition method, device, apparatus, storage medium, and program product |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |