CN109145970B

CN109145970B - Image-based question and answer processing method and device, electronic equipment and storage medium

Info

Publication number: CN109145970B
Application number: CN201810884844.5A
Authority: CN
Inventors: 高鹏; 李鸿升; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2018-08-06
Filing date: 2018-08-06
Publication date: 2021-01-12
Anticipated expiration: 2038-08-06
Also published as: CN109145970A

Abstract

The present disclosure relates to a question and answer processing method and apparatus based on an image, an electronic device and a storage medium, wherein the method includes: acquiring image characteristics of a target image and problem characteristics of a target problem; determining a dynamic convolution kernel based on the problem feature, and performing convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature; determining an answer to the target question in the target image based on the fused features. The embodiment of the disclosure can improve the prediction accuracy of the answer.

Description

Image-based question and answer processing method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the field of computer vision, in particular to a question and answer processing method and device based on an image, electronic equipment and a storage medium.

Background

The image recognition technology can be applied to various fields, for example, image annotation, language drawing search, visual question answering and the like can be realized by combining computer vision and language analysis. The visual question-answering system can give answer text certainly under the condition that a picture and a question related to the picture are given.

In the existing research for processing the visual question-answering task, a bilinear pooling method is firstly needed to obtain the fusion characteristics of the image characteristics and the question characteristics, and the final answer is finally predicted based on the fusion characteristics. However, the current feature fusion process is executed at a later stage of the visual question-answering model, and picture features and question features cannot be effectively utilized.

Disclosure of Invention

The embodiment of the disclosure provides a question and answer processing method and device based on an image, electronic equipment and a storage medium, wherein the question and answer processing method and device can effectively utilize image characteristics and question characteristics and improve recognition accuracy.

According to a first aspect of the embodiments of the present disclosure, there is provided a question-answering processing method based on an image, including:

acquiring image characteristics of a target image and acquiring question characteristics of a target question, wherein the target question is a question aiming at the target image;

determining a dynamic convolution kernel based on the problem feature, and performing convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature;

determining an answer to the target question in the target image based on the fused features.

In an embodiment of the present disclosure, the performing a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fused feature includes:

inputting the dynamic convolution kernels to N convolution modules respectively, wherein the convolution modules are sequentially connected in sequence to form a linear connection structure, and N is an integer greater than 1;

performing convolution operation of the image features based on the dynamic convolution kernel through a first convolution module in the linear connection structure to obtain a first output result;

and performing convolution operation of an output result of the (i-1) th convolution module on the basis of the dynamic convolution kernel through the ith convolution module in the linear connection structure, and obtaining the fusion characteristic through the Nth convolution module, wherein i is an integer which is more than 1 and less than or equal to N.

In an embodiment of the present disclosure, each convolution module includes at least one convolution unit group, where performing convolution operations by each convolution module includes:

dividing input data input into a convolution module into at least one group of characteristic subdata, wherein the group number of the characteristic subdata corresponds to the group number of the convolution unit group, and the input data comprises the image characteristics or the output result of a previous convolution module comprising the convolution module;

correspondingly allocating each group of characteristic subdata to each convolution unit group respectively, allocating a dynamic convolution kernel to at least one convolution unit group, and allocating each convolution unit group except the at least one convolution unit group allocated with the dynamic convolution kernel to a preset convolution kernel;

and performing convolution operation on the input characteristic sub-data based on the distributed dynamic convolution kernel or the preset convolution kernel through each convolution unit group.

In this disclosure, the performing, by each convolution unit group, a convolution operation on the input feature sub-data based on the assigned dynamic convolution kernel or the preset convolution kernel includes:

selecting a first part of data from first characteristic data obtained by a convolution unit group which performs convolution operation by using a dynamic convolution kernel, wherein the first characteristic data is data generated by a convolution unit which performs convolution operation by using the dynamic convolution kernel;

replacing second part data of second characteristic data obtained by a convolution unit group which performs convolution operation by using a preset convolution kernel with the first part data to obtain third characteristic data;

and obtaining an output result of the corresponding convolution module based on the first characteristic data and the third characteristic data.

respectively and correspondingly allocating each group of characteristic subdata to each convolution unit group, and respectively allocating a dynamic convolution kernel to each convolution unit group;

and performing convolution operation on the input feature sub-data based on the allocated dynamic convolution kernel through each convolution unit group.

In this disclosure, the performing, by each convolution unit group, a convolution operation on the input feature sub-data based on the allocated dynamic convolution kernel includes:

acquiring characteristic data output by each convolution unit group;

and obtaining the output result of the corresponding convolution module based on the characteristic data output by each convolution unit group.

In an embodiment of the present disclosure, the determining an answer to the target question in the target image based on the fused feature includes:

and analyzing and identifying the fusion characteristics by using a preset classification model to obtain an answer aiming at the target question.

obtaining an attention weight matrix corresponding to the fusion feature;

weighted summation of corresponding features of the fused features based on the attention weight matrix;

and analyzing and identifying the fusion characteristics after the weighted summation by using a preset classification model to obtain an answer aiming at the target problem.

In an embodiment of the present disclosure, the preset classification model includes a deep learning neural network model.

In an embodiment of the present disclosure, the determining a dynamic convolution kernel based on the problem feature includes:

and processing the problem characteristics by utilizing a fully-connected neural network to obtain the dynamic convolution kernel.

In an embodiment of the present disclosure, the acquiring an image feature of a target image and acquiring a problem feature of a target problem for the target image includes:

image features of the target image are extracted using a first model, and problem features in the target problem are extracted using a second model.

In an embodiment of the disclosure, the first model comprises a deep convolutional network model and the second model comprises a recurrent neural network model.

In an embodiment of the disclosure, the method further comprises:

acquiring a target image from the stored image data and/or receiving a transmitted target image;

receiving an input target question, and/or identifying a target question in a target image.

According to a second aspect of the embodiments of the present disclosure, there is provided an image-based question-answering processing apparatus including:

the system comprises a feature acquisition module and a feature analysis module, wherein the feature acquisition module is configured to acquire image features of a target image and acquire question features of a target question, and the target question is a question aiming at the target image;

a feature fusion module configured to determine a dynamic convolution kernel based on the problem feature and perform a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature;

a determination module configured to determine an answer in the target image to the target question based on the fused features.

In the embodiment of the present disclosure, the feature fusion module includes N convolution modules, and each convolution module is sequentially connected in sequence to form a linear connection structure, where N is an integer greater than 1; and is

The feature fusion module is further configured to input the dynamic convolution kernels to the N convolution modules respectively, and perform convolution operation of the image features based on the dynamic convolution kernels through a first convolution module in the linear connection structure to obtain a first output result; and

In the embodiment of the present disclosure, each convolution module includes at least one convolution unit group;

the feature fusion module is further configured to divide input data input to each convolution module into at least one group of feature sub-data, respectively and correspondingly allocate each group of feature sub-data to each convolution unit group, allocate a dynamic convolution kernel to at least one convolution unit group, and execute convolution operation on the input feature sub-data based on the allocated dynamic convolution kernel or a preset convolution kernel through each convolution unit group;

each convolution unit group except the at least one convolution unit group allocated with the dynamic convolution kernel is allocated with a preset convolution kernel;

and the number of groups of the feature sub-data corresponds to the number of groups of the convolution unit groups, and the input data includes the image feature or an output result of a convolution module before the convolution module.

In an embodiment of the present disclosure, the feature fusion module is further configured to select a first part of data from first feature data obtained from a convolution unit group that performs a convolution operation using a dynamic convolution kernel, where the first feature data is data generated by a convolution unit that performs a convolution operation using the dynamic convolution kernel;

the characteristic fusion module is further configured to divide input data input to each convolution module into at least one group of characteristic sub-data, respectively and correspondingly allocate each group of characteristic sub-data to each convolution unit group, respectively allocate a dynamic convolution kernel to each convolution unit group, and execute convolution operation on the input characteristic sub-data based on the allocated dynamic convolution kernel through each convolution unit group;

the number of groups of the feature sub-data corresponds to the number of groups of the convolution unit groups, and the input data includes the image feature or an output result of a convolution module before the convolution module.

In an embodiment of the present disclosure, the feature fusion module is further configured to obtain an output result of the corresponding convolution module based on the feature data output by each convolution unit group.

In an embodiment of the present disclosure, the determining module is further configured to analyze and identify the fusion features by using a preset classification model, so as to obtain an answer to the target question.

In an embodiment of the present disclosure, the determining module is further configured to obtain an attention weight matrix corresponding to the fusion feature, perform weighted summation on the corresponding feature of the fusion feature based on the attention weight matrix, and perform analysis and identification on the fusion feature after the weighted summation by using a preset classification model to obtain an answer to the target question.

In an embodiment of the present disclosure, the feature fusion module is further configured to process the problem feature by using a fully-connected neural network to obtain the dynamic convolution kernel.

In an embodiment of the present disclosure, the feature obtaining module is further configured to extract an image feature of the target image using a first model, and extract a problem feature in the target problem using a second model.

In an embodiment of the present disclosure, the apparatus further includes:

an input module configured to acquire a target image from stored image data and/or to receive a transmitted target image; and is configured to receive an input of a target issue and/or identify a target issue in a target image.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any of the above embodiments.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any one of the above embodiments.

In the embodiment of the present disclosure, after the image feature and the question feature are obtained, a dynamic convolution kernel is determined according to the question feature to perform a convolution operation of the image feature, and since the question feature is fused in the dynamic convolution kernel, the image and the question information can be fused tightly in the generated fusion feature (visual feature), thereby improving the recognition accuracy of the answer.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow diagram of a method of image-based question answering processing in accordance with an embodiment of the present disclosure;

FIG. 2 shows a block diagram of a convolution module in accordance with an embodiment of the present disclosure;

fig. 3 illustrates a flowchart of step S300 in the image-based question answering processing method according to an embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating sequential connection of volume modules in a linear connection configuration according to an embodiment of the disclosure;

FIG. 5 illustrates a flow diagram for performing convolution operations by convolution modules in accordance with an embodiment of the present disclosure;

FIG. 6 illustrates a flow chart for determining an answer to the target question in the target image based on the fused features according to an embodiment of the present disclosure;

FIG. 7 shows a process diagram of an image-based question-answering method according to an embodiment of the present disclosure;

fig. 8 shows a block diagram of an image-based question answering processing apparatus according to an embodiment of the present disclosure;

FIG. 9 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

FIG. 10 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted.

In addition, the present disclosure also provides a question and answer processing device based on an image, an electronic device, a computer readable storage medium, and a program, which can all be used to implement any question and answer processing method based on an image provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

The embodiment of the disclosure provides a question and answer processing method based on an image, which can be applied to the field of computer vision and is used for processing questions of question and answer based on the image. The embodiment of the disclosure can determine the dynamic convolution kernel for executing the convolution operation according to the acquired question features, and because the dynamic convolution kernel includes the feature information related to the question, the question features and the image features can be effectively fused when the convolution operation is executed, so that the obtained visual features can improve the identification precision of the answer.

Fig. 1 shows a flowchart of an image-based question-answering processing method according to an embodiment of the present disclosure, wherein the image-based question-answering processing method may include, as shown in fig. 1:

s100: acquiring image characteristics of a target image and acquiring question characteristics of a target question, wherein the target question is a question aiming at the target image;

s200: determining a dynamic convolution kernel based on the problem feature, and performing convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature;

s300: determining an answer to the target question in the target image based on the fused features.

The question-answering processing method based on the image in the embodiment of the disclosure can obtain answers about the question in the image according to the obtained image and the question. The question may be a question about the number, color, status or other information of a certain object in an image, such as in an image comprising bicycles, the question may be the number of bicycles involved, or why the bicycle is colored, etc. The method provided by the embodiment of the disclosure can identify answers about questions in the image. The above examples are merely examples of applications of the embodiments of the present disclosure, and are not intended to limit the embodiments of the present disclosure.

The embodiments of the present disclosure are explained in detail below. A target image and a target problem for the target image may be acquired first. The target image includes at least one picture, and the target question includes at least one question, that is, the number of images and questions in the embodiments of the present disclosure may not be limited, where one image corresponds to multiple questions, multiple images correspond to one question, or other forms are also possible. In view of the above situation, the embodiments of the present disclosure process the problem feature and the image feature in substantially the same manner.

The method for acquiring the target image in the embodiment of the present disclosure may include: obtaining a target image from the stored image data, and/or receiving a transmitted target image. That is, the target image in the embodiment of the present disclosure may be an image selected based on stored image data, wherein selection information input by a user may be received through an input component, and the selected target image may be determined according to the selection information, the input component may include at least one of a touch screen, a touch pad, a keyboard, an audio input device, and the like, and in other embodiments, the input component may also include other devices to perform input and reception of the selection information. In addition, in the embodiment of the present disclosure, the target image may also be obtained by communicating with other electronic devices, for example, the target image is received by a wired communication module and/or a wireless communication module, where the wireless communication module may implement data transmission by bluetooth, wifi, NFC, or the like, and the wireless communication module may implement data transmission by an optical fiber, a coaxial cable, or other wired cables.

In addition, in the embodiment of the present disclosure, the manner of obtaining the target problem may include: receiving an input target question, and/or identifying a target question in a target image. Also, the target issue in the embodiments of the present disclosure may be based on information input by the input component or may also be based on information received by the wired communication module and/or the wireless communication module. In addition, the target question may be included in the target image, or the target question may be in the form of a picture, and in this case, the target question in the picture may be recognized by a character recognition method.

After the target image and the target question are acquired, the image features in the target image and the question features in the target question may be acquired through step S100. The image features and the problem features can be respectively expressed in a matrix form, the image features can comprise parameters such as gray values of all pixel points of the target image, and the problem features can comprise language information in the target problem.

In this embodiment of the present disclosure, the acquiring an image feature of a target image and acquiring a problem feature of a target problem for the target image may include: image features of the target image are extracted using a first model, and problem features in the target problem are extracted using a second model. The first model may include a deep convolutional network model, and the second model may include a recurrent neural network model, that is, feature information in the target image may be extracted using the deep convolutional network model to obtain image features, and language information in the target problem may be extracted using the recurrent neural network model to obtain problem features. The first model and the second model in the embodiments of the present disclosure may also be other types of machine learning network models.

In addition, in other embodiments of the present disclosure, the image feature in the target image may also be identified through a first preset algorithm, for example, the first preset algorithm may include a wavelet moment-based algorithm, a fractal feature-based algorithm, and the like, as long as a method capable of identifying the image feature in the target image is applicable to the embodiments of the present disclosure. Meanwhile, the problem characteristics in the target problem can also be obtained through a second preset algorithm, wherein the second preset algorithm can comprise a character recognition algorithm, a character segmentation algorithm and the like.

After the image features and the problem features are obtained, step S200 may be performed, i.e., a dynamic convolution kernel that performs a convolution operation may be determined based on the problem features. A convolution kernel is a weighted average of pixels in a small region of an input image, where each pixel in the output image is a weighted average of the pixels in a small region of the input image, given the image processing of the input image, and where the weights are defined by a function, this function is called a convolution kernel. According to the method and the device, the relevant parameters of the convolution kernel can be determined according to the problem features, and finally the dynamic convolution kernel related to the problem features is formed, so that the problem features and the image features can be effectively fused when the convolution operation of the image features is executed, the obtained fusion features retain the spatial information of the target image, and meanwhile, the problem information is closely fused.

In the embodiment of the present disclosure, the problem feature may be processed by using a fully-connected neural network to obtain the dynamic convolution kernel. Namely, the problem features can be processed in a full-connection mode to obtain the dynamic convolution kernel. The problem features obtained by the second model may be in the form of a one-dimensional matrix, and the dynamic convolution kernels obtained by the full-connected mode may be in the form of a multi-dimensional matrix, such as two-dimensional or three-dimensional. After the dynamic convolution kernel is determined, a convolution operation of the image feature may be performed based on the dynamic convolution kernel, thereby obtaining a fusion feature in which the problem information and the image information are fused. By performing classification learning on the fusion features, answer information for the target question in the target image can be obtained.

The fusion features can be input into a trained classifier or a classification module, and the classification model corresponding to the classification module is used for classifying and learning the fusion features, so that the answer of the target question is finally obtained. Wherein the classification model may comprise a deep learning neural network model, wherein the model may be a trained classification model with high recognition accuracy.

Based on the above, the identification of the answer to the target question in the target image can be completed, wherein when the convolution operation is executed, the convolution kernel for executing the convolution operation can be obtained according to the question feature, so that the question feature and the image feature can be effectively fused together, the fusion of the features is not executed in a bilinear manner, and the identification precision, namely the prediction accuracy, can be improved.

In order to clearly illustrate the embodiments of the present disclosure, the embodiments of the present disclosure are described in detail below, respectively.

As described above, after determining the dynamic convolution kernel, that is, performing a convolution operation of the image feature through the dynamic convolution kernel, in the embodiment of the present disclosure, the dynamic convolution kernel and the image feature may be input into the convolution module, and the convolution module may perform the convolution operation. The embodiment of the disclosure may include at least one convolution module, and when the convolution module includes one convolution module, the convolution module executes convolution operation of the image feature based on the dynamic convolution kernel to obtain the fusion feature. When the convolution module comprises a plurality of convolution modules, the convolution modules can be sequentially connected to form a linear connection structure, so that convolution operation can be sequentially executed according to dynamic convolution kernels according to convolution operations on the linear connection structure to obtain fusion characteristics.

Fig. 2 illustrates a block diagram of a convolution module according to an embodiment of the present disclosure, where the convolution module in the embodiment of the present disclosure may include at least one convolution cell group, through which a corresponding convolution operation may be performed. Fig. 2 illustrates that 8 groups of convolution cells are included, and any number of groups of convolution cells may be included in other embodiments of the present disclosure, which is not limited by the present disclosure. In addition, each convolution unit group may include at least one convolution unit (a1, a2, A3 …), and fig. 2 illustrates three convolution units, but the number of convolution units in other embodiments may be set according to requirements, which is not a limitation of the embodiments of the present disclosure. Each convolution unit in each convolution unit group can be sequentially connected in sequence, input data input to the convolution module can be distributed into a plurality of groups of characteristic sub-data, a first convolution unit a1 of each convolution unit group can perform convolution operation on the received characteristic sub-data and can output a convolution result to a second convolution unit a2, each subsequent convolution unit can further perform convolution operation according to the convolution result of the previous convolution unit, and finally, the last convolution unit obtains the first characteristic data. The input data to the convolution module here may comprise image features or the output of a previous convolution module. The obtained first feature data may be used to determine the final fused feature, or may be input data that is output to the next convolution module. When the embodiment of the present disclosure only includes one convolution module, the input data of the convolution module is the image feature, and the output first feature data is the fusion feature. When a plurality of convolution modules are included, the input data of the convolution modules may be image features or output results output by a previous convolution module, and the first feature data output by each convolution module may be used to determine the fusion features.

In the embodiment of the present disclosure, at least one convolution unit group may exist in each convolution module group, and perform convolution operation using a dynamic convolution kernel, for example, only one convolution unit group may perform convolution operation using a dynamic convolution kernel, or all convolution unit groups may perform convolution operation using the dynamic convolution kernel, or a part of convolution unit groups may perform convolution operation using a dynamic convolution kernel. For example, in fig. 2, the first convolution unit group is a convolution unit group that performs convolution operations using dynamic convolution kernels, and the convolution unit group other than the convolution unit group that performs convolution operations using dynamic convolution kernels may perform convolution operations using a preset convolution kernel, where the preset convolution kernel may be a convolution kernel determined by back propagation calculation of the corresponding convolution unit group, such as a static convolution kernel, or may also be a preset convolution kernel. Based on the configuration, at least one part of convolution unit groups in the convolution module can execute convolution operation by utilizing the dynamic convolution kernel, so that the problem characteristic and the image characteristic can be fused, the problem of large operation amount caused by the introduction of the problem characteristic in the dynamic convolution kernel can be solved, and the operation speed is improved. Further, in the convolution unit group for performing the convolution operation using the dynamic convolution kernel, at least one convolution unit may perform the convolution operation using the dynamic convolution kernel, which may further reduce the amount of operation and improve the operation accuracy.

In addition, as described above, the embodiment of the present disclosure may also include a plurality of convolution modules, where the plurality of convolution modules may be sequentially connected to form a linear structure, and may sequentially pass through each convolution module to perform a convolution operation using a dynamic convolution kernel. The input data of the first convolution module can be image characteristics, the convolution operation of the image characteristics can be executed by using a dynamic convolution kernel to obtain a first output result, then the convolution operation of the first output result can be executed by the second convolution module according to the dynamic convolution kernel, and the like, and the fusion characteristics are output through the last convolution module. Fig. 3 shows a flowchart of step S300 of the image-based question answering method according to an embodiment of the present disclosure, wherein step S300 may include:

s301: inputting the dynamic convolution kernels to N convolution modules respectively, wherein the convolution modules are sequentially connected in sequence to form a linear connection structure, and N is an integer greater than 1;

s302: performing convolution operation of the image features based on the dynamic convolution kernel through a first convolution module in the linear connection structure to obtain a first output result;

s303: and performing convolution operation of an output result of the (i-1) th convolution module on the basis of the dynamic convolution kernel through the ith convolution module in the linear connection structure, and obtaining the fusion characteristic through the Nth convolution module, wherein i is an integer which is more than 1 and less than or equal to N.

Fig. 4 is a block diagram illustrating sequential connection of volume modules in sequence to form a linear connection structure according to an embodiment of the disclosure. The linear connection structure may include a plurality of convolution modules (QGHC modules) connected in sequence, such as 3 convolution modules B, C and D, and in other embodiments of the present disclosure, other numbers of convolution modules may be included, such as 1 convolution module, or more than 1 convolution module.

The dynamic convolution kernels obtained in step S200 may be input into each convolution module B, C, D, so that the convolution module performs a corresponding convolution operation according to the dynamic convolution kernels. The convolution module B may receive the image features obtained in step S100, and perform convolution operation of the image features by using the input dynamic convolution kernel, where at least one convolution unit group may be used in the convolution module B to perform convolution operation of the image features by using the dynamic convolution kernel, and the remaining convolution unit groups perform convolution operation by using a preset convolution kernel, where the preset convolution kernel is unrelated to the problem feature. Then, the convolution module C may perform a convolution operation of the first output result using the dynamic convolution kernel to obtain a second output result, and in the same convolution module C, at least one convolution unit group may be used to perform a convolution operation of the first output result using the dynamic convolution kernel, and the remaining convolution unit groups perform convolution operations using a preset convolution kernel, where the preset convolution kernel is unrelated to the problem feature. And finally, the convolution module D can utilize the dynamic convolution kernel to execute the convolution operation of the second output result to obtain the fusion characteristic, at least one group of convolution unit groups can be utilized in the same convolution module D to execute the convolution operation of the second output result by utilizing the dynamic convolution kernel, the rest convolution unit groups utilize the preset convolution kernel to execute the convolution operation, and the preset convolution kernel is irrelevant to the problem characteristic.

Because this disclosed embodiment can be through the many times convolution operation of a plurality of convolution modules, the degree of accuracy of predetermineeing the question answer has been improved to amalgamation image and question characteristic that can be abundant. Meanwhile, each convolution module can partially utilize the dynamic convolution kernel to execute convolution operation by the convolution unit, so that the operation amount is reduced and the operation speed is improved.

As described above, each convolution module in the embodiment of the present disclosure includes at least one convolution unit group, and fig. 5 shows a flowchart of performing a convolution operation by each convolution module in the embodiment of the present disclosure, where performing a convolution operation by each convolution module may include:

s3001: dividing input data input into a convolution module into at least one group of characteristic subdata, wherein the group number of the characteristic subdata corresponds to the group number of the convolution unit group, and the input data comprises the image characteristics or the output result of a previous convolution module comprising the convolution module;

s3002: correspondingly allocating each group of characteristic subdata to each convolution unit group respectively, allocating a dynamic convolution kernel to at least one convolution unit group, and allocating each convolution unit group except the at least one convolution unit group allocated with the dynamic convolution kernel to a preset convolution kernel;

s3003: and performing convolution operation on the input characteristic sub-data based on the distributed dynamic convolution kernel or the preset convolution kernel through each convolution unit group.

That is, when each convolution module receives input data, the input data can be divided into characteristic subdata with corresponding groups according to the groups of the convolution unit groups in the convolution module, and dimension reduction processing can be performed on the input data through the method, so that the data volume can be further reduced, and the operation is convenient.

At least one convolution unit group exists in the convolution module group and performs convolution operation by using a dynamic convolution kernel, at least one convolution unit also exists in the convolution unit group and performs convolution operation by using the dynamic convolution kernel, and the other convolution units or convolution unit groups perform convolution operation by using a preset convolution kernel to finally obtain a convolution result.

In addition, in order to enhance fusion between the problem feature and the image feature, the mixing of the corresponding results may be performed by the feature mixing unit while each convolution cell group performs the convolution operation. Fig. 5 shows a flowchart of step S3003 according to the embodiment of the present disclosure, which may further include:

s30031: selecting a first part of data from first characteristic data obtained by a convolution unit group which performs convolution operation by using a dynamic convolution kernel, wherein the first characteristic data is data generated by a convolution unit which performs convolution operation by using the dynamic convolution kernel;

s30032: replacing second part data of second characteristic data obtained by a convolution unit group which performs convolution operation by using a preset convolution kernel with the first part data to obtain third characteristic data;

s30033: and obtaining an output result of the corresponding convolution module based on the first characteristic data and the third characteristic data.

As shown in fig. 2, a feature mixing unit a4 may be further included between the second convolution unit a2 and the third convolution unit A3 in each convolution unit group, where the second convolution unit a2 in the first convolution unit group outputs a result as first feature data, and the second convolution unit group and the third convolution unit group output a result as second feature data, where it should be noted that, in the embodiment of the present disclosure, results obtained by convolution unit groups that perform convolution operations using dynamic convolution kernels may be collectively referred to as first feature data, that is, the first feature data is data generated by convolution units that perform convolution operations using the dynamic convolution kernels. And collectively referring results obtained by a convolution unit group performing convolution operation by using a preset convolution kernel as second characteristic data, wherein the first characteristic data and the second characteristic data obtained by each convolution unit can be different, and are only used for distinguishing the results of the convolution operation performed by the dynamic convolution kernel and the preset convolution kernel. In order to improve the degree of fusion of the problem and the image feature, the feature mixing unit may select a first part of data from the first feature data obtained by the convolution unit a2 in the 1 st convolution unit group, and replace the first part of data with a part of the second feature data output by the 2 convolution unit groups and the third convolution unit group A3 to form third feature data, so that the remaining convolution unit groups that perform the convolution operation using the preset convolution kernel may also include the feature related to the problem information, that is, the third feature data may also include the feature related to the problem information. Then, the convolution operations of the first feature data and the third feature data may be respectively performed according to the third convolution unit a3 of each group, so as to obtain a fused feature.

In the embodiment of the present disclosure, the number of convolution units in a convolution unit group in each convolution module may be the same or different, and when the above steps S30032 to S30033 are executed, the above steps may be executed by using convolution units of the same hierarchy, where at least one convolution unit in the same hierarchy performs convolution operation by using a dynamic convolution kernel, so that part of data including problem features may be replaced with feature data output by the other convolution units. In other embodiments of the present disclosure, the above steps may also be performed by using convolution units of different levels, and similarly, at least one of the convolution units of different levels performs a convolution operation by using a dynamic convolution kernel, so that part of the data including the problem feature may be replaced with the feature data output by the other convolution units. And after partial data are replaced, the subsequent convolution unit can be utilized to continuously execute convolution operation, and finally the output result of the corresponding convolution module is obtained.

Through the configuration, the data fused with the problem features can be replaced into other convolution unit groups which use the static convolution kernels to execute convolution operation, the operation speed is greatly improved efficiently under the condition that the convolution operation of the dynamic convolution kernels is not needed to be simultaneously utilized, meanwhile, the feature fusion is effectively realized, and the prediction precision is improved.

In addition, in the embodiment of the present disclosure, each convolution unit group in the convolution module may be assigned a dynamic convolution kernel, that is, each convolution unit group in the convolution module may perform a convolution operation on the corresponding input feature sub-data by using the dynamic convolution kernel. And at least one convolution unit in each convolution unit group performs a convolution operation using the dynamic convolution kernel. At this time, performing the convolution operation by each of the convolution modules may include:

As described above, each convolution unit group in each convolution module in the embodiment of the present disclosure may be assigned with a dynamic convolution kernel, that is, each convolution unit group may include problem information and image information in feature data obtained after performing a convolution operation on corresponding feature sub-data by using the dynamic convolution kernel, and finally, an output result obtained through feature data output by each convolution unit group may sufficiently merge problem features and image features. And, performing convolution operation on the input feature sub-data based on the allocated dynamic convolution kernel by each convolution unit group may include acquiring feature data output by each convolution unit group, and acquiring an output result of the corresponding convolution module based on the feature data output by each convolution unit group. The image information and the problem information can be fused in the output result, and finally the fusion characteristic can be obtained through the output result obtained by the last convolution module.

Wherein obtaining the output result of the corresponding convolution module based on the feature data output by each convolution unit group may include connecting and combining the feature data output by each convolution unit group to form an output result, and may also include: and replacing a part of the data in the feature data output by one of the convolution unit groups with the feature data output by the other convolution unit groups, so as to form an output result by utilizing the combination of the replaced feature data, and finally obtaining the fusion feature through the output result obtained by the last convolution module. The problem information and the image information can be more effectively fused through the configuration, and the prediction precision of the answer is improved.

After the fusion features are obtained, the fusion features can be analyzed and identified by using a preset classification model corresponding to a classification module, so as to obtain answers for the target questions. The classification model may include a deep learning network model, which may perform classification training on the input fusion features to finally obtain answer information for the target question. Those skilled in the art can construct a classification model according to different settings or requirements, that is, the functional structure of the classification model can be realized by using the prior art means, and the embodiment of the present disclosure is not described in detail.

In addition, before classification learning is carried out by utilizing the fusion features, the fusion features can be calculated according to the weight determined for the fusion features, so that the prediction precision is further improved.

Fig. 6 is a flowchart illustrating determining an answer to the target question in the target image based on the fused feature in the embodiment of the present disclosure, that is, step S300 in the embodiment of the present disclosure may include:

s304: obtaining an attention weight matrix corresponding to the fusion feature;

s305: weighted summation of corresponding features of the fused features based on the attention weight matrix;

s306: and analyzing and identifying the fusion characteristics after the weighted summation by using a preset classification model to obtain an answer aiming at the target problem.

In the embodiment of the present disclosure, each parameter in the attention weight matrix represents a weight value of each parameter in the fusion feature, wherein the fusion feature output by the convolution module may be sequentially added to the problem feature in different image regions, and then a 1-by-1 convolution kernel and softmax operation are adopted to map an addition result to the attention weight matrix. That is, the attention weight matrix in the embodiment of the present disclosure may be obtained by a corresponding addition operation of the fusion feature and the problem feature. The problem feature may be obtained after linearization processing based on the target problem, or may be obtained through a recurrent neural network, which is not limited in this disclosure.

After the attention weight matrix is determined, weighted summation of the weights of the attention weight matrix and the fusion features can be performed, so that the optimized fusion features can be obtained. Then, the optimized fusion features can be analyzed and identified by using a classification model, and an answer to the target question is obtained. In this case, the weight of each parameter of the fusion feature is expressed by the attention weight matrix, and the fusion feature is optimized based on the attention weight matrix and then identified and classified, thereby further improving the prediction accuracy.

In order to more completely and clearly embody the embodiment of the present disclosure, the embodiment of the present disclosure is exemplified below with reference to fig. 7, and fig. 7 shows a process schematic diagram of a question answering processing method based on an image according to the embodiment of the present disclosure.

Wherein a target image and a target question are first acquired, wherein the target question is "comprising several bicycles". After the target image and the target question are obtained, image features corresponding to the target image can be obtained by using the first model, and question features corresponding to the target question can be obtained by using the second model. Dynamic convolution kernels are then obtained based on the problem features and input to the three convolution modules B, C, D, respectively. And performing convolution operation of the image characteristics based on the dynamic convolution kernel through the convolution module B to obtain a first output result, performing convolution operation of the first output result based on the dynamic convolution kernel through the convolution module B to obtain a second output result, and performing convolution operation of the second output result based on the dynamic convolution kernel through the convolution module C to obtain the fusion characteristics. Each convolution module can also introduce a preset convolution kernel to reduce the operation amount. After the fusion feature is obtained, an attention weight matrix can be obtained based on the fusion feature and the problem feature after linear processing, an optimized fusion feature is obtained based on the weighted summation of the attention weight matrix and the fusion feature, and then an answer '2' aiming at the target problem is obtained by utilizing the classification learning of the optimized fusion feature of the classification model. The above is merely an exemplary illustration of the embodiments of the disclosure and is not to be construed as a limitation thereof.

In summary, in the embodiment of the present disclosure, after the image feature and the question feature are obtained, the dynamic convolution kernel is determined according to the question feature to perform the convolution operation of the image feature, and since the question feature is fused in the dynamic convolution kernel, the image and the question information can be closely fused in the generated fusion feature (visual feature), so that the accuracy of identifying the answer to the target question is improved. In addition, the convolution module in the embodiment of the disclosure can introduce a preset convolution kernel irrelevant to the problem, and can reduce the operation amount and save the operation time on the premise of ensuring the close fusion of the image and the problem.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Fig. 8 is a block diagram illustrating an image-based question answering processing apparatus according to an embodiment of the present disclosure, which may include, as illustrated in fig. 8:

a feature obtaining module 10 configured to obtain an image feature of a target image and obtain a question feature of a target question, wherein the target question is a question for the target image;

a feature fusion module 20 configured to determine a dynamic convolution kernel based on the problem feature and perform a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature;

a determination module 30 configured to determine an answer in the target image to the target question based on the fused features.

In an embodiment of the present disclosure, the apparatus further includes:

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

Fig. 9 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 9, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 10 is a block diagram illustrating an electronic device 1900 according to an example embodiment. For example, the electronic device 1900 may be provided as a server. Referring to fig. 10, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in memory 1932, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A question-answer processing method based on images is characterized by comprising the following steps:

determining a dynamic convolution kernel based on the problem feature, and performing convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature; the performing a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fused feature includes: performing convolution operation on the image characteristics by utilizing convolution modules on a linear connection structure according to the dynamic convolution kernel in sequence to obtain fusion characteristics, wherein a plurality of convolution modules are connected in sequence to form the linear connection structure, and input data of the convolution modules comprise the image characteristics or an output result of a previous convolution module comprising the convolution module;

2. The method of claim 1, wherein the performing a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fused feature comprises:

inputting the dynamic convolution kernels to N convolution modules respectively, wherein N is an integer greater than 1;

3. The method of claim 2, wherein each convolution module includes at least one group of convolution cells, and wherein performing a convolution operation by each convolution module includes:

dividing input data input into a convolution module into at least one group of characteristic subdata, wherein the group number of the characteristic subdata corresponds to the group number of the convolution unit group;

4. The method of claim 3, wherein performing, by each convolution unit group, a convolution operation on the input feature sub-data based on the assigned dynamic convolution kernel or the preset convolution kernel comprises:

5. The method of claim 2, wherein each convolution module includes at least one group of convolution cells, and wherein performing a convolution operation by each convolution module includes:

6. The method of claim 5, wherein performing, by each convolutional group of bits, a convolution operation on the input feature sub-data based on the assigned dynamic convolution kernel comprises:

acquiring characteristic data output by each convolution unit group;

7. The method of claim 1, wherein the determining an answer to the target question in the target image based on the fused feature comprises:

8. The method of claim 1, wherein the determining an answer to the target question in the target image based on the fused feature comprises:

obtaining an attention weight matrix corresponding to the fusion feature;

9. The method of claim 7 or 8, wherein the preset classification model comprises a deep learning neural network model.

10. The method of claim 1, wherein determining the dynamic convolution kernel based on the problem feature comprises:

11. The method of claim 1, wherein the obtaining image features of a target image and obtaining problem features for a target problem of the target image comprises:

12. The method of claim 11, wherein the first model comprises a deep convolutional network model and the second model comprises a recurrent neural network model.

13. The method of claim 1, further comprising:

14. An image-based question-answering processing apparatus, comprising:

a feature fusion module configured to determine a dynamic convolution kernel based on the problem feature and perform a convolution operation on the image feature based on the dynamic convolution kernel to obtain a fusion feature; the feature fusion module is further configured to perform convolution operation on the image features sequentially according to the dynamic convolution kernel by using convolution modules on a linear connection structure to obtain fusion features, wherein a plurality of convolution modules are sequentially connected to form the linear connection structure, and input data of the convolution modules comprise the image features or output results of a previous convolution module comprising the convolution module;

15. The apparatus of claim 14, wherein the feature fusion module comprises N convolution modules, where N is an integer greater than 1; and is

16. The apparatus of claim 15, wherein each convolution module comprises at least one group of convolution cells;

and the number of groups of the characteristic sub-data corresponds to the number of groups of the convolution unit groups.

17. The apparatus according to claim 16, wherein the feature fusion module is further configured to select a first portion of data from first feature data obtained from a group of convolution units that perform convolution operations with a dynamic convolution kernel, the first feature data being data generated by a convolution unit that performs convolution operations with the dynamic convolution kernel;

18. The apparatus of claim 15, wherein each convolution module comprises at least one group of convolution cells;

19. The apparatus according to claim 18, wherein the feature fusion module is further configured to obtain an output result of the corresponding convolution module based on the feature data output by each convolution unit group.

20. The apparatus according to claim 14, wherein the determining module is further configured to perform an analysis recognition on the fused features by using a preset classification model to obtain an answer to the target question.

21. The apparatus according to claim 14, wherein the determining module is further configured to obtain an attention weight matrix corresponding to the fused features, perform weighted summation on the corresponding features of the fused features based on the attention weight matrix, and perform analysis and identification on the fused features after weighted summation by using a preset classification model to obtain an answer to the target question.

22. The apparatus of claim 20 or 21, wherein the preset classification model comprises a deep learning neural network model.

23. The apparatus of claim 14, wherein the feature fusion module is further configured to process the problem feature using a fully-connected neural network to obtain the dynamic convolution kernel.

24. The apparatus of claim 14, wherein the feature extraction module is further configured to extract image features of the target image using a first model and extract problem features in the target problem using a second model.

25. The apparatus of claim 24, wherein the first model comprises a deep convolutional network model and the second model comprises a recurrent neural network model.

26. The apparatus of claim 14, further comprising:

27. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claims 1 to 13.

28. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 13.