CN112016493B

CN112016493B - Image description method, device, electronic equipment and storage medium

Info

Publication number: CN112016493B
Application number: CN202010916526.XA
Authority: CN
Inventors: 张晋烽; 张启; 韩金金; 王兴宝; 雷琴辉; 王智胜; 王雪初
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-09-03
Filing date: 2020-09-03
Publication date: 2024-08-23
Anticipated expiration: 2040-09-03
Also published as: CN112016493A

Abstract

The embodiment of the invention provides an image description method, an image description device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining target information in an image to be described; the target information includes a plurality of targets and relationships therebetween; determining attention information to be described currently from target information based on a previous text sequence, generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence; and taking the text sequence obtained by the last description task as the description text of the image to be described. According to the method, the device, the electronic equipment and the storage medium provided by the embodiment of the invention, the description text obtained by image description can be attached to the interest points and the knowledge points which need to be noticed by the user, so that the richness and pertinence of the image description are ensured, and the interactivity with the user is improved.

Description

Image description method, device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computer vision, in particular to an image description method, an image description device, electronic equipment and a storage medium.

Background

Image description (Image capture), which refers to the generation of descriptive text about the content of an Image for a given Image using computer vision techniques, thereby enabling the conversion from picture pixels to text.

The descriptive text generated by the existing image description method usually presents the same sentence pattern, so that the generated descriptive text is poor in style and content diversity, the description of the image is not rich and vivid enough, and the interest points and knowledge points which need to be noticed can not be observed by the user.

Disclosure of Invention

The embodiment of the invention provides an image description method, an image description device, electronic equipment and a storage medium, which are used for solving the problems of single descriptive text style and content and poor interactivity with users generated by the existing image description method.

In a first aspect, an embodiment of the present invention provides an image description method, including:

Determining target information in an image to be described; the target information comprises a plurality of targets and relations among the targets;

Determining current attention information to be described from the target information based on a previous text sequence, and generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence;

And taking a text sequence obtained by the last description task as a description text of the image to be described.

Optionally, the determining, based on the previous text sequence, attention information to be described currently from the target information, and generating a current text sequence based on the attention information specifically includes:

Updating the attention information to be described last time into the attention information to be described currently based on the last text sequence; wherein the attention information to be described for the first time is determined based on the target information;

And generating a current text sequence based on the last text sequence based on the attention information to be described currently.

Optionally, the method for determining attention information to be described for the first time includes the following steps:

Extracting interest features from the scene graph corresponding to the target information to obtain the interest features of each node in the scene graph;

Based on the attention distribution of the interest feature of each node in the scene graph, sampling the relevant node of each node, determining the target feature of each node based on the sampled relevant node of each node, and taking the target feature of each node as the attention information to be described for the first time.

Optionally, based on the previous text sequence, updating the attention information to be described last time to the attention information to be described currently, which specifically includes:

Based on the last text sequence, carrying out target updating on the target characteristics updated after the last description task is ended, and obtaining target characteristics after the last description task is ended; the target feature is a feature representation of the target information;

And carrying out target flow update on the attention information to be described last time based on the last text sequence and the target characteristics after the last description task is finished, so as to obtain the attention information to be described currently.

Optionally, based on the last text sequence, performing target update on the target feature updated after the last description task is ended, to obtain the target feature after the last description task is ended, which specifically includes:

And determining a current feature updating parameter based on the correlation between the last text sequence and the target feature updated after the last description task is ended, and updating the target feature updated after the last description task is ended based on the current feature updating parameter to obtain the target feature after the last description task is ended.

Optionally, based on the last text sequence and the target feature after the last description task is finished, performing target flow update on the attention information to be described last time to obtain the attention information to be described currently, which specifically includes:

Determining the flow probability of the last description task corresponding to each information flow state based on the correlation between the last text sequence and each information flow state and the correlation between the target feature after the last description task is finished and each information flow state;

based on the representation of each information flow state and the flow probability, carrying out target flow update on the attention information to be described last time to obtain the attention information to be described currently;

wherein the representation of each information flow state is determined based on the target information and a current number of updates.

Optionally, the target information further includes a preset relationship between targets and/or a target attribute.

In a second aspect, an embodiment of the present invention provides an image description apparatus, including:

an information determining unit for determining target information in the image to be described; the target information comprises a plurality of targets and relations among the targets;

the text generation unit is used for determining attention information to be described currently from the target information based on a previous text sequence, generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence; and taking a text sequence obtained by the last description task as a description text of the image to be described.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor, a communication interface, a memory, and a bus, where the processor, the communication interface, and the memory are configured to communicate with each other through the bus, and the processor may invoke a logic command in the memory to perform the steps of the image description method as provided in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image description method as provided in the first aspect.

According to the image description method, the device, the electronic equipment and the storage medium, in the image description process, the current attention information to be described is determined from a plurality of targets and relations among the targets according to the previous text sequence, the current text sequence is generated, the description text of the image to be described is finally obtained, and the description text obtained by image description can be attached to interest points and knowledge points which need to be noticed by a user in a mode of simulating attention conversion circulation of people during image description, so that richness and pertinence of image description are ensured, and interactivity with the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an image description method according to an embodiment of the present invention;

Fig. 2 is a schematic diagram of an image to be described according to an embodiment of the present invention;

Fig. 3 is a flow chart of a text sequence determining method according to an embodiment of the present invention;

fig. 4 is a flow chart of a text generation method based on a transducer according to an embodiment of the present invention;

Fig. 5 is a flowchart of a method for determining attention information to be described for the first time according to an embodiment of the present invention;

FIG. 6 is a scene graph of an image to be described according to an embodiment of the present invention;

Fig. 7 is a flowchart of a method for updating attention information according to an embodiment of the present invention;

Fig. 8 is a schematic structural diagram of an image description device according to an embodiment of the present invention;

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

When describing an image, in the prior art, a description model is generally established by adopting technologies such as CNN (Convolutional Neural Networks, convolutional neural network), RNN (Recurrent Neural Network, cyclic neural network), attention mechanism and the like, features in the image are encoded to obtain feature vectors of a fixed dimension space, and then the feature vectors are decoded based on a decoder in a machine translation technology, so that a description text is generated, and the generated description text is fixed in sentence form and single in content.

In order to overcome the disadvantages of the prior art, fig. 1 is a schematic flow chart of an image description method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

Step 110, determining target information in an image to be described; the target information includes a plurality of targets and relationships therebetween.

In particular, the image to be described is an image that needs to be converted to descriptive text by means of computer vision techniques. The object information in the image to be described includes a plurality of objects in the image to be described, and a relationship between the respective objects.

The object may be an object existing in the image to be described, such as a person, a plant, an animal, a scene, a building, etc., or may be an abstract object existing in the image to be described, such as a graph, a logo, etc. The relationship between the targets may specifically be a positional relationship, a magnitude relationship, a movement relationship, a dependency relationship, or other various types of relationships between the targets.

For example, fig. 2 is a schematic diagram of an image to be described provided in an embodiment of the present invention, as shown in fig. 2, in a running process of an automobile, an automobile monitoring system may detect that a driver is not wearing a safety belt according to a state of a part, so as to display a warning sign of the safety belt not wearing on an automobile display screen. Because the driver's cognition level in the automobile is different, the driver may not be able to accurately recognize the state monitoring information given on the display screen of the automobile, and at this time, the driver needs to recognize the reminding mark by means of the intelligent auxiliary driving system. When the reminding identification is identified, the intelligent auxiliary driving system needs to provide the target information for the driver, wherein the target information comprises not only the target objects of 'driver' and 'safety belt', but also the relation between the target objects, namely the relation between 'driver' and 'safety belt' which exists in a 'tying' way.

For determining the target information in the image to be described, target detection can be firstly carried out on the image to be described to obtain each target object in the image to be described, and then every two target objects are input into a pre-trained relationship classification model, so that the relationship before every two target objects output by the relationship classification model is obtained.

Further, an end-to-end YOLO model (You Only Look Once, object detection model) may be used to perform object detection, so as to obtain position frame information of a plurality of objects contained in the image to be described and corresponding semantic tags. For example, automotive vehicle fault warning lights are typically constructed from line, circle, rectangle, and cartoon character graphics. When the target detection is carried out, the figure of the person in the reminding mark of the safety belt is detected and obtained, and the inclined rectangle is arranged, and the corresponding semantic labels are respectively a driver and the safety belt.

Further, regression frames (bounding boxes) of a plurality of targets obtained by target detection can be paired in pairs, and relationship classification can be performed by adopting a pre-trained relationship classification model. The relational classification model may be a model that employs stacked multiple convolution layers to construct a multi-classification model. And classifying the relation between the targets by using a relation classification model, and taking the classification result with the highest confidence as the final result of the relation between the two targets. For example, the relationship between the figure of the person in the seat belt unbuckled mark and the obliquely placed rectangle can be "back-attached", "attached" and "holding", etc., and the classification result with the highest confidence is "attached" as the final classification result according to the relationship classification model.

Step 120, determining attention information to be described currently from target information based on a previous text sequence, and generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence;

And taking the text sequence obtained by the last description task as the description text of the image to be described.

Specifically, the description text is a sentence expressing target information in an image to be described in natural language. After the target information in the image to be described is determined, the target information is summarized and arranged into natural sentences conforming to understanding and expression habits of users.

When a person observes a description image, usually, the relationship between objects in the image is taken as a focus point, and an image description sequence is formed according to knowledge and experience, in the process of forming a description text, the person can change the attention on the objects in the image and the relationship between the objects one by one according to the image description sequence, and the focus of the current time description is determined through the circulation of the attention, so that the description of all the objects in the image and the relationship between the objects is realized.

In the embodiment of the invention, the descriptive text obtained by image description can be specifically expressed in a text sequence form, the generation of the descriptive text can simulate the process of image description by a person, the descriptive text is divided into a process of generating words one by one in the text sequence, and each time new word is generated and the text sequence is updated in the process, the descriptive text can be regarded as a descriptive task in the whole image description task. With reference to the image description habits of a person, each description task may correspond to a different point of interest, i.e. each time the attention information to be described is different. The attention information to be described each time may be focused specifically on a portion not yet described in the target information so as to avoid repetitive description or omission of description, or may be focused on a portion of the target information where the description order is arranged after a portion described previously so as to ensure continuity and smoothness of the description order.

Further, the last text sequence, namely the text sequence obtained by the last description task, contains the attention information in the previous description task and the description, and can guide the current description task to determine the point of attention needed in the current time from the target information, so as to determine the attention information in the current description task. On the basis of this, a current text sequence is generated on the basis of the attention information to be described currently. It should be noted that the current text sequence is generated based on the previous text sequence, for example, the previous text sequence is "lying on the lawn", and the current text sequence is "lying on the lawn with a dog".

By executing the description task for a plurality of times, the text sequence obtained by the last description task can cover all objects in the image to be described and the relation between the objects without repetition and omission, and the text sequence is used as the description text of the image to be described.

According to the image description method provided by the embodiment of the invention, in the image description process, the attention information to be described at the current time is determined from a plurality of targets and the relation between the targets according to the text sequence generated at the last time, the text sequence at the current time is generated, and finally the description text of the image to be described is obtained, so that the description text obtained by image description can be attached to interest points and knowledge points which need to be noticed by a user for observation through a mode of simulating the attention conversion circulation of a person during image description, the richness and pertinence of the image description are ensured, and the interactivity with the user is improved.

Based on the above embodiment, step 120 specifically includes:

Inputting the target information into the image description model to obtain a description text output by the image description model; the image description model is obtained by training based on target information in a sample image and sample description text of the sample image;

The image description model is used for determining attention information to be described currently from target information based on the last text sequence, generating the current text sequence based on the attention information, and taking the text sequence obtained by the last description task as the description text of the image to be described.

Specifically, the execution of step 120 may be performed by a pre-trained image description model. After the target information is determined, multiple targets and relationships among the targets can be described by using an image description model, attention information to be described currently is determined from the target information according to a previous text sequence during each description, a current text sequence is generated based on the attention information, and the text sequence obtained by the last description task is used as a description text of the image to be described.

Before step 120 is performed, an image description model may be trained in advance, and specifically, the image description model may be trained as follows: first, a large number of sample images are collected, and sample target information in each sample image is determined, the sample target information including a plurality of sample targets and relationships therebetween. And describing a plurality of sample targets in each sample image and the relation between the sample targets by using natural language in a manual mode to obtain corresponding sample description text. Then, each sample image and the sample description text of each sample image are input into the initial model for training, so that an image description model is obtained.

Based on the foregoing embodiments, fig. 3 is a schematic flow chart of a text sequence determining method according to an embodiment of the present invention, as shown in fig. 3, step 120 specifically includes:

Step 121, updating the attention information to be described last time to the attention information to be described currently based on the last text sequence; wherein the attention information to be described for the first time is determined based on the target information;

Specifically, before the current description task is executed, based on the previous text sequence, the relationship between the object and the object already described in the previous description task may be analyzed, and then, in combination with the relationship between the object and the object already described in the previous description task, attention information to be described in the previous time in the object information is subjected to attention point transfer, and the attention point of the current description task is updated, so as to obtain the attention information to be described currently. By updating the attention information before each description task is executed, each description task can be made to pay more attention to the relationship between the object to be described currently and the object in the object information.

It should be noted that, when the attention information to be described is acquired for the first time, since the text sequence and the attention information before the task to be described for the first time do not exist, the feature extraction can be directly performed on the target information so as to obtain the attention information to be described for the first time, and then when the task to be described for the next time is performed, the current attention information to be described is determined according to the previous text sequence and the attention information to be described for the last time.

Step 122, based on the attention information to be described currently, generating a current text sequence based on the previous text sequence.

Specifically, new word segmentation may be generated based on the previous text sequence according to the attention information to be described currently, so as to obtain the current text sequence. Further, in order to improve the expression capability of the description text of the image to be described, a language model with a structure such as a transducer can be used to generate a text sequence, so that the description text with higher expression capability can be obtained.

According to the image description method provided by the embodiment of the invention, the attention information to be described last time is updated based on the last text sequence, so that the current attention information to be described is obtained, the image description of the current word is realized, the current text sequence is obtained, and the attention circulation in the image description process is realized through successive updating of the attention information, so that the generated description text of the image to be described has richer expression capability.

Based on any of the above embodiments, fig. 4 is a schematic flow chart of a text generation method based on a transducer according to an embodiment of the present invention, where, as shown in fig. 4, when describing an image to be described, the current time step is t, and the last text sequence Y ^t-1＝{y_o,y₁,…,y_t-1 and the attention information to be described currently are respectively performed by LSTM (Long Short-Term Memory)Coding is carried out, and the hidden layer vector of the last text sequence and the hidden layer vector of the attention information to be described currently are obtained. Here, y _o,y₁,…,y_t-1 is the word segmentation in the text sequence generated at each of time steps 0 through t-1,For the vector representation of the current time step t for the attention information of each node in the target information corresponding scene graph, N is the number of nodes in the scene graph. Then, coding and converting hidden layer vectors of the two layers through a multi-layer perceptron (Multilayer Perceptron, MLP) respectively, so as to obtain vectors under the same dimension space, superposing the vectors on the basis, and inputting the superposed vectors and the last text sequence into a pre-trained transform model to obtain the current text sequence.

The transducer model herein may be implemented using an open source xlnet architecture. The transducer model is obtained by pre-training on a large-scale corpus, has strong learning and expression capability, and can be applied to richer vocabulary expression description texts.

Based on any of the above embodiments, fig. 5 is a flowchart of a method for determining attention information to be described for the first time according to an embodiment of the present invention, where, as shown in fig. 5, the method for determining attention information to be described for the first time includes the following steps:

And 510, extracting interest features from the scene graph corresponding to the target information to obtain the interest features of each node in the scene graph.

Specifically, the target information includes a plurality of targets and relationships between the plurality of targets, and the relationships between the targets and the targets may be represented by a scene graph. A scene graph is a graph result consisting of nodes and connections between nodes. The scene graph corresponding to the target information comprises nodes and connecting lines among the nodes, the target objects in the target information and the relations among the target objects correspond to the nodes in the scene graph, and the connecting lines among the nodes in the scene graph represent the directions among the nodes. The scene graph may be expressed as the following formula:

G＝{x_i}

Where G is a scene graph, x _i is any node in the scene graph, and may be a target node, an attribute node, or an edge node, and i is a label of the node.

For example, fig. 6 is a schematic diagram of an image to be described according to an embodiment of the present invention, where the schematic diagram includes 2 target nodes and 1 edge node as shown in fig. 6. The 2 target nodes are respectively a driver and a safety belt, and the 1 side node is tethered.

The scene graph is used as a carrier of the relation between the semantic label of each object in the image to be described and the object, and can display the object information very completely. Meanwhile, the scene graph is used as a topological relation graph to be applied to feature extraction based on a graph neural network, and the information of relation connection between original targets can be well extracted, so that the realization of a subsequent image description task is facilitated.

In order to ensure that the description of the image to be described has strong correlation with the image as a whole, namely, the relation between important targets in the image can be focused, the nodes needing more focusing in the image description and the characteristics of edges between the nodes in the scene graph can be extracted, so that the interest characteristics of each node in the scene graph can be obtained. Further, the attention matrix of the nodes and edges to be noted of the Laplace matrix can be calculated by using the Laplace matrix corresponding to the scene graph, so that the interest features of each node to be noted are abstracted and obtained, and the interest features are expressed as follows:

X_importance＝σ(f(L))·(L)·(G)

L＝D-A

Wherein G is a scene graph, L is a Laplace matrix corresponding to the scene graph G, X _importance is an interest feature matrix formed by interest features of each node in the scene graph G, A is an adjacent matrix of the scene graph, D is a degree matrix of the scene graph, f is a feature transformation function, the feature transformation function can be obtained through multi-layer perceptron learning, and the importance of the node or edge which should be noticed by the Laplace matrix L is obtained after the output of f is activated by an activation function sigma.

Step 520, based on the attention distribution of the interest feature of each node in the scene graph, sampling the relevant node of each node, determining the target feature of each node based on the sampled relevant node of each node, and taking the target feature of each node as the attention information to be described for the first time.

Specifically, the interesting features of each node in the scene graph G are still distributed in a sparse high-dimensional space, and in order to generate descriptive sentences by using the interesting features of the nodes conveniently, the interesting features can be mapped into a dense low-dimensional space.

Here, the attention profile is a priori profile based on the correlation between each node and its neighbor nodes in the scene graph. The attention profile can be obtained in particular by the following method: attention distribution is obtained by applying an attention mechanism algorithm to the interest feature matrix X _importance, and attention query values of each node X ⁱ are obtained by respectively carrying out linear transformation on the interest features of each node X ⁱ through { M _query;M_key;M_value }, wherein the attention query values of the nodes are obtained by adopting the attention mechanism algorithmAttention key valueAnd attention valueWherein, M _query is an attention query transformation matrix, M _key is an attention key value transformation matrix, M _value is an attention value transformation matrix, and all three attention transformation matrices are obtained through training. On this basis, the attention coefficients of the nodes x ⁱ and x ^j are calculated, and the attention coefficients of the nodes x ⁱ and x ^j represent the degree of correlation between the nodes x ⁱ and x ^j, where the conversion can be usedAndThe correlation between these is used as the attention factor of the two nodes. After the attention coefficients of the node x _i and other nodes in the scene graph are calculated, normalization processing is performed, and the attention coefficients after normalization processing are used as the attention distribution of the interest feature of the node x ⁱ and used for randomly sampling related nodes from neighbor nodes of each node.

After obtaining the relevant node of each node through random sampling, determining the target feature of each node based on the relevant node of each node, wherein the relevant node sequence of any node x ⁱ can be expressed as:

in the method, in the process of the invention, The 1 st to j-th relevant nodes of the node x ⁱ are obtained for random sampling according to the attention profile.

Then, the relevant node sequences obtained by sampling can be directly aggregated to obtain the target feature h ⁱ of the node x ⁱ, and the target feature h ⁱ is used as the attention information to be described for the first time, and can be expressed as the following formula:

Where samp_ neigh is the number of sampled related node sequences and M _f is the linear variation of the band offset term for achieving the feature transformation from high-dimensional sparse to low-dimensional dense. j is the sequence number of the relevant node, Is the corresponding attention value for the relevant node x ^j.

According to the image description method provided by the embodiment of the invention, the target characteristics of each node are determined by a sampling mode based on the attention distribution, so that not only the node information related to the node before is considered, but also the node information not related to the node before is combined, and the information of other nodes in the scene graph is better aggregated to the target characteristics of the currently researched node, so that the characteristic representation capability of the target characteristics of each node in the scene graph is improved.

Based on any of the above embodiments, fig. 7 is a flowchart of a method for updating attention information according to an embodiment of the present invention, as shown in fig. 7, step 121 specifically includes:

And 1211, performing target updating on the target features updated after the last description task is ended based on the last text sequence, so as to obtain the target features after the last description task is ended.

Specifically, all nodes in the scene graph should be expressed by descriptive text, and no phenomena of deletion or repetition can occur. After a target node is described in the scene graph, features associated with the target node in the scene graph may be updated. Here, the target feature, that is, the feature representation of the target information, may be obtained by extracting the feature of the target information in the image to be described before the task is first described, and then the target feature after the task is described each time is obtained by updating the target feature after the task is described last time. It should be noted that, the target feature before the task is first described herein may be consistent with the attention information to be first described.

And the target characteristics updated after the last description task is ended can be updated according to the information which is reflected by the last text sequence and is already described in the last description task, so that the target characteristics after the last description task is ended are obtained. For example, if the previous text sequence is "a dog lying on the lawn", the features associated with "lawn", "lying on the lawn" and "dog" in the target features updated after the last description task is finished can be weakened, and the feature representation of the previous text sequence is updated as the already described information into the updated target features after the last description task is finished, so that the target features after the last description task is finished are obtained.

And step 1212, performing target flow update on the attention information to be described last time based on the last text sequence and the target characteristics after the last description task is finished, so as to obtain the attention information to be described currently.

In particular, the generation of the text sequence is related not only to the node features in the scene graph, but also to the information flow state in the scene graph. The information flow state in the scene graph refers to the expected node description order in the scene graph. For example, if the node of current interest is a target node, then the next node to be accessed is likely to be a relationship node or attribute node that connects to the target node, rather than another target node that is not connected to the target node, according to the information flow of the scene graph.

And predicting the information flow state of the scene graph node to be described according to the last text sequence and the target characteristics after the last description task is finished, so that the attention information to be described last time is updated to the attention information to be described currently.

Based on any of the above embodiments, step 1211 specifically includes:

Specifically, after the last text sequence is obtained, the current feature updating parameter can be determined according to the correlation between the last text sequence and the target feature updated after the last description task is finished, and then the target feature updated after the last description task is finished is updated based on the current feature updating parameter, at this time, feature information with high association degree with the described node feature can be selectively weakened, so that repeated description is avoided, and the described information reflected by the last text sequence can be supplemented, so that association between the scene graph feature and the generated text sequence is deepened. Here, the current feature update parameter may specifically be a weight parameter that weakens a part of features in the target feature updated after the end of the last description task, and/or a weight parameter that enhances a part of features in the target feature updated after the end of the last description task.

Assuming that the description time step is t, the generated last text sequence is { y ₁,y₂,····,y_t-1 }, and the updated target feature after the last description task is finished isWhere N is the number of nodes in the scene graph.

In order to facilitate the calculation of the correlation between the last text sequence and the target feature updated after the last description task ends, the last text sequence and the target feature updated after the last description task ends may be first converted into a space of the same dimension. Here, the dimension conversion may be achieved by a BILSTM layer (Bi-directional Long Short-Term Memory) and a linear transformation layer. For example, converting the last text sequence may be embodied as the following formula:

y^t-1＝BILSTM({y₁,y₂,····,y_t-1})

Where y ^t-1 is the feature representation of the last text sequence after dimension conversion.

The current feature update parameters may include a current feature removal matrix and/or a current feature addition matrix. The current feature elimination matrix can be used for weakening the feature with high relevance to the last text sequence in the target feature updated after the last description task is finished, and the current feature addition matrix can be used for supplementing the described feature represented by the last text sequence in the target feature updated after the last description task is finished. It is assumed that the correlation between the last text sequence and the target feature of the i-th node updated after the last description task was completed can be usedRepresenting the current feature elimination parameter as matrix M _clear and the current feature addition parameter as matrix M _add, the current feature elimination matrix and the current feature addition matrix can be formulated as:

Wherein I is an identity matrix.

Correlation ofThe method can be solved by using a multi-layer sensing mechanism (Multilayer Perceptron, MLP), and can be regulated to be between 0 and 1 after being calculated by algorithms such as distance similarity, cosine similarity and the like, and the method is to be noted,The greater the value of (c), the weaker the correlation with the last text sequence in the target feature updated after the last description task ends. Solving for correlations by MLP can be formulated here as:

Wherein M _{clear_y} is a preset text sequence feature removal parameter, M _{clear_g} is a preset image target feature removal parameter, And (5) the target characteristics of the ith node updated after the last description task is finished.

Updating the target feature updated after the last description task is finished based on the current feature updating parameters, and the target feature can be expressed as follows:

in the method, in the process of the invention, And (5) the target characteristics of the ith node updated after the last description task is finished. The current feature elimination parameter M _clear and the target feature of the i-th node updated after the last description task is finishedMultiplication realizes the weakening of the characteristic with high correlation with the last text sequence in the target characteristic updated after the last description task is finished; the multiplication of the current feature addition parameter M _add with the feature representation of the previous text sequence achieves the supplementing of already described information embodied by the previous text sequence, which facilitates the incorporation of previously generated information in the current image description task.

For example, the last text sequence is "lay on grasslands", and the last text sequence is "lay on grasslands puppies". When the current text sequence is generated, as the correlation between the characteristic information of grasslands and lying in the target characteristics updated after the last description task is finished and the last text sequence is higher, the characteristic information of grasslands and lying in the target characteristics updated after the last description task is finished can be weakened by multiplying the current characteristic clearing parameter with the target characteristics updated after the last description task is finished, so that repeated description of the grasslands and lying is avoided.

In addition, the current feature addition parameter can be multiplied with the feature representation of the last text sequence, and the described information reflected by the last text sequence is added into the updated target feature after the last description task is finished, so that the newly generated updated target feature after the last description task is finished can also contain the described information as the above information of the current image description task, and the information can be better combined with the generated text sequence in the execution process of the current image description task, thereby improving the smoothness of the newly generated text sequence.

According to the image description method provided by the embodiment of the invention, the described scene graph characteristics are updated by adopting the scene graph characteristic updating mechanism, and partial characteristic information is selectively weakened and/or enhanced, so that all nodes in the scene graph are expressed by the description text, and the occurrence of deletion or repetition is avoided.

Based on any of the above embodiments, step 1212 specifically includes:

determining a flow probability of the last description task corresponding to each information flow state based on the correlation between the last text sequence and each information flow state and the correlation between the target feature after the last description task is finished and each information flow state;

wherein the representation of each information flow state is determined based on the target information and the current number of updates.

Specifically, the scene graph of the image to be described is described in real time, and when the current text sequence is generated, the description sequence of the scene graph nodes to be described is controlled according to the described scene graph nodes so as to simulate the transfer process of human attention and reflect the description sequence expected by a user.

The information flow state in the scene graph refers to the expected node description sequence in the scene graph, and can be determined according to the target information and the current update times. Generally, information flow states can be divided into four different operations, corresponding to four different scene graph flow forms:

a first operation (op_1), when the representation of the information flow state is to describe the same node using a plurality of words:

A second operation (op_2) when the representation of the information flow state is such that the description order in the scene graph transitions clockwise from one node to the next:

A third operation (op_3) when the representation of the information flow state is such that the description order in the scene graph transitions counter-clockwise from one node to the next:

A fourth operation (op_4) when the representation of the information flow state is such that the description order in the scene graph transitions clockwise from one node across nodes to the next node:

in the above formula, t is a description time step, represents the current update times, alpha is an initial parameter of an information flow state, and L is a Laplacian matrix of a scene graph corresponding to target information.

In the embodiment of the invention, the probabilities P (op_1), P (op_2), P (op_3) and P (op_4) corresponding to four information flow states of the last description can be calculated according to the attention mechanism respectively, and then the nodes of the current description can be predicted according to the information flow states of the last description. The way in which the probability of the information flow state is calculated is described below by calculating P (op_1).

According to the attention mechanism, the last text sequence { y ₁,y₂,…,y_t-1 } and the target characteristics after the last description task are finishedRespectively multiplied byObtaining the last text sequence and the attention mechanism feature vector of the target feature after the last description task is finished:

Wherein, For the attention query transformation matrix in the first information flow state,For the first information flow state attention key value transformation matrix,For the first information flow state, the attention value transformation matrix, the three attention transformation matrices can be obtained through training.

On the basis, calculate the attention characteristic valueWherein Q _h is a query matrix describing target features after the task is finished last timeK _y is the key value matrix of the last text sequenceD is the vector dimension of the Q _h matrix and the K _y matrix.

The attention characteristic value may be normalized by means of, for example, a softmax function, etc., resulting in a flow probability P (op_1) that last described the flow state corresponding to the first information.

Similarly, P (op_2), P (op_3) and P (op_4) can be solved.

According to the representation of each information flow state and the flow probability, the attention information Z ^t-1 to be described last time is updated to obtain the attention information Z ^t to be described last time, and the current information is expressed as follows by a formula:

It should be noted that, the attention information Z ⁰ to be described for the first time may be determined according to the initial target feature of each node, for example, Z ⁰ is the initial target feature of each node

According to the image description method provided by the embodiment of the invention, the attention mechanism of the graph information flow is adopted, the description sequence of the scene graph nodes to be described is controlled according to the described scene graph nodes, the transfer process of human attention is simulated, and the method is closer to the expected description sequence.

Based on any of the above embodiments, the target information further includes a preset relationship between the targets and/or a target attribute.

Specifically, in order to better determine the interest points and knowledge points of the user in the image to be described, a preset relationship between the objects may be added to the determined object information in the image to be described. For example, automotive airbag system malfunction alerting identifiers are typically constructed using a circular icon and a cartoon character graphic. If it is described by computer vision, the description text is "driver holds the air bag", and the recognition result shows that it is not in line with the fact. The description text conforming to the correct understanding of the user should be "the airbag protects the driver", so that a preset relationship "protection" can be added between the determined "airbag" and the determined "driver", so that the relationship between the targets in the image description is more accurate.

When describing some targets in the image to be described, there are some attribute descriptions in cultural background, and the unidentified targets can be analogized with similar targets identified before, and the analogized is taken as a target attribute and is also added into target information. For example, automotive electronic security systems are generally identified using system abbreviations and circular patterns. In particular, an anti-lock braking system (Antilock Brake System, ABS) malfunction alerting indicator for an automobile is typically constructed using the letter ABS and a circular pattern.

Because the ABS system belongs to the basic configuration of the automobile, ABS fault alarm identifiers are more common in the automobile monitoring system, while other types of electronic safety systems are configured according to different types of vehicles, corresponding fault alarm identifiers are not common in the automobile monitoring system, and if the image description is performed by using a computer vision technology, the fault alarm identifiers may not be recognized. If the ABS fault warning indicator has been identified before, the analog relationship "belonging to the vehicle electronic safety system" between the other electronic safety systems and the ABS system can be added to the target information as an attribute of the node with the letter indicator. When describing other icons displayed in the automobile monitoring system, the automobile electronic safety system fault can be accurately identified.

According to the image description method provided by the embodiment of the invention, the preset relation and/or the object attribute among the objects are added in the object information, so that a user can update the object information of the image to be described by himself, and the description text can attach to the interest points and the knowledge points which the user needs to notice in observation according to the updated object information, thereby ensuring the richness of the image description.

Based on any of the above embodiments, the nodes in the target information correspondence scene graph may include target nodes (objects), attribute nodes (attribution), and edge nodes (relationships). The target node represents a target object in the image to be described; the attribute node represents an attribute of the object, such as a color attribute, a shape attribute, a semantic attribute, or the like; the edge nodes represent the relationship between the object and the object, including the position relationship, the size relationship, the motion relationship, the dependency relationship and the like. According to the function of the edge nodes, the edge nodes can be divided into three types: edge nodes of the target pointing attribute, edge nodes of the target pointing relationship, and edge nodes of the relationship pointing target.

It should be noted that, the attribute of the object corresponding to the attribute node and the relationship between the objects corresponding to the edge nodes may be obtained through direct recognition and detection of the image to be described, or may be additionally added in combination with the object and a preset relationship or attribute, which is not particularly limited in the embodiment of the present invention.

Based on any of the above embodiments, fig. 8 is a schematic structural diagram of an image description device according to an embodiment of the present invention, as shown in fig. 8, where the device includes:

An information determining unit 810 for determining target information in an image to be described; the target information includes a plurality of targets and relationships therebetween;

A text generating unit 820, configured to determine attention information to be described currently from the target information based on a previous text sequence, and generate a current text sequence based on the attention information, where the previous text sequence and the current text sequence are text sequences obtained by the previous description task and the current description task, respectively, and the current text sequence includes the previous text sequence; and taking the text sequence obtained by the last description task as the description text of the image to be described.

According to the image description device provided by the embodiment of the invention, in the image description process, the text generation unit determines the attention information to be described at the current time according to the text sequence generated at the last time and the plurality of targets and the relation among the targets determined by the information determination unit, generates the text sequence at the current time, finally obtains the description text of the image to be described, and enables the description text obtained by the image description to be attached to the interest points and the knowledge points which need to be noticed by the user for observation by simulating the attention conversion circulation mode of people during the image description, thereby ensuring the richness and pertinence of the image description and improving the interactivity with the user.

Based on any of the above embodiments, the text generation unit 820 includes:

an attention updating subunit, configured to update, based on a previous text sequence, the attention information to be described last time to the attention information to be described currently; wherein the attention information to be described for the first time is determined based on the target information;

and the text sequence generation subunit is used for generating a current text sequence based on the current attention information to be described.

Based on any of the above embodiments, the apparatus further includes a target feature extraction unit configured to:

Based on the attention distribution of the interest feature of each node in the scene graph, sampling the relevant nodes of each node, determining the target feature of each node based on the sampled relevant nodes of each node, and taking the target feature of each node as the attention information to be described for the first time.

Based on any of the above embodiments, the attention update subunit specifically includes:

the target feature updating module is used for carrying out target updating on the target features updated after the last description task is ended based on the last text sequence to obtain the target features after the last description task is ended; the target feature is a feature representation of the target information;

And the target flow updating module is used for carrying out target flow updating on the attention information to be described last time based on the last text sequence and the target characteristics after the last description task is finished, so as to obtain the attention information to be described currently.

Based on any of the above embodiments, the target feature update module is specifically configured to:

Based on any of the above embodiments, the target flow update module is specifically configured to:

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 9, the electronic device may include: processor (Processor) 910, communication interface (Communications Interface) 920, memory (Memory) 930, and communication bus (Communications Bus) 940, wherein Processor 910, communication interface 920, memory 930 complete communication with each other through communication bus 940. Processor 910 may invoke logic commands in memory 930 to perform the following method:

Determining target information in an image to be described; the target information includes a plurality of targets and relationships therebetween; determining attention information to be described currently from target information based on a previous text sequence, generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence; and taking the text sequence obtained by the last description task as the description text of the image to be described.

In addition, the logic commands in the memory 930 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the methods provided by the above embodiments, for example, comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several commands for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image description method, comprising:

Determining current attention information to be described from the target information based on a previous text sequence, and generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence; the last text sequence comprises the attention information of the last description task and the previous description task;

Taking a text sequence obtained by the last description task as a description text of the image to be described;

the method for generating the current text sequence based on the previous text sequence determines attention information to be described currently from the target information and generates the current text sequence based on the attention information specifically comprises the following steps:

generating a current text sequence based on the previous text sequence based on the attention information to be described currently;

Based on the last text sequence, the method updates the attention information to be described last time to the attention information to be described currently, and specifically comprises the following steps:

Based on the last text sequence and the target characteristics after the last description task is finished, carrying out target flow update on the attention information to be described last time to obtain the attention information to be described currently;

and performing target flow update on the attention information to be described last time based on the last text sequence and the target characteristics after the last description task is finished to obtain the attention information to be described currently, wherein the method specifically comprises the following steps of:

2. The image description method according to claim 1, wherein the method of determining attention information to be described for the first time includes the steps of:

3. The image description method according to claim 1, wherein the updating the target feature updated after the last description task is completed based on the last text sequence to obtain the target feature after the last description task is completed specifically includes:

4. A method of image description according to any one of claims 1 to 3, wherein the target information further comprises a preset relationship between targets and/or target properties.

5. An image description apparatus, comprising:

The text generation unit is used for determining attention information to be described currently from the target information based on a previous text sequence, generating a current text sequence based on the attention information, wherein the previous text sequence and the current text sequence are text sequences obtained by a previous description task and a current description task respectively, and the current text sequence comprises the previous text sequence; the last text sequence comprises the attention information of the last description task and the previous description task; taking a text sequence obtained by the last description task as a description text of the image to be described;

the text generation unit includes:

An attention updating subunit, configured to update, based on a previous text sequence, the attention information to be described last time to the attention information to be described currently; the attention information to be described for the first time is determined based on the attention distribution of the interest features of each node in the scene graph corresponding to the target information;

a text sequence generation subunit, configured to generate a current text sequence based on the previous text sequence based on the attention information to be described currently;

The attention update subunit specifically includes:

the target feature updating module is used for updating the target feature updated after the last description task is ended based on the last text sequence to obtain the target feature after the last description task is ended; the target feature is a feature representation of the target information;

The target flow updating module is used for carrying out target flow updating on the attention information to be described last time based on the last text sequence and the target characteristics after the last description task is finished, so as to obtain the attention information to be described currently;

the target flow updating module is specifically configured to:

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the image description method according to any one of claims 1 to 4 when the computer program is executed.

7. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the steps of the image description method according to any one of claims 1 to 4.