CN113095331A

CN113095331A - Visual question answering method, system and equipment for appearance defects of electric equipment and storage medium thereof

Info

Publication number: CN113095331A
Application number: CN202110436318.4A
Authority: CN
Inventors: 赵冲; 沈奥; 卫星; 葛久松; 韩知渊; 帅竞贤; 陆阳; 侯宝华; 康旭
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2021-07-09

Abstract

The invention provides a visual question and answer method for appearance defects of electric equipment, comprising the steps of: acquiring an image of appearance defects of electric equipment, and preprocessing the image; extracting features from the processed image to obtain image features; For the image, obtain the problem information related to the image, and perform feature extraction on the problem information to obtain text features; based on the bilinear pooling network, the image features and the text features are fused to obtain Multimodal features; establishing and training a visual question answering model to obtain the trained visual question answering model; inputting the multimodal features into the trained visual question answering model to obtain a visual question answering result. The present invention provides a visual question and answer method for appearance defects of electric equipment, which can screen and evaluate appearance defects of electric equipment through images, so that the running state of electric equipment can be monitored in real time.

Description

A visual question answering method, system, device and storage medium for appearance defects of electrical equipment

技术领域technical field

本发明涉及视觉问答技术领域，尤其涉及一种电力设备外观缺陷的视觉问答方法、系统、设备及其存储介质。The present invention relates to the technical field of visual question and answer, in particular to a visual question and answer method, system, device and storage medium for appearance defects of electric equipment.

背景技术Background technique

近年来，伴随着我国工业、科技产业规模的不断扩大和城市范围的不断扩张，民用、商用与军用电力能源需求与日俱增，我国的电网规模为满足日益增长的需求也在不断升级和扩充。目前，我国的设备覆盖范围非常大，各种变电设备的数量也是巨大的，如电力站、导线、杆塔等数量大量增加，而且分布范围广、距离远。如此长的输电线路几乎完全处于室外环境，非常容易受到恶劣天气、动植物入侵以及漂浮物等因素的影响，而导致设备出现故障、老化、破损等现象。一方面需要有经验的运维人员来进行缺陷识别，另一方面单纯依赖人工，难以做到实时地或预防性地获取设备的缺陷状况。同时还需要花费巨大的人力成本和维护费用，对前线工人的生命安全也存在威胁。因此，为提高无人值守或少人值守电力站人员和设备的安全性，需要实时监控变电设备的运行状态及故障隐患。In recent years, with the continuous expansion of the scale of my country's industrial and technological industries and the continuous expansion of urban areas, the demand for civil, commercial and military power energy is increasing day by day, and the scale of my country's power grid is also constantly upgrading and expanding to meet the growing demand. At present, my country's equipment coverage is very large, and the number of various substation equipment is also huge, such as power stations, wires, towers, etc. Such a long transmission line is almost completely in an outdoor environment, and is very vulnerable to factors such as bad weather, animal and plant invasion, and floating objects, which can lead to equipment failure, aging, and damage. On the one hand, experienced operation and maintenance personnel are needed to identify defects, and on the other hand, it is difficult to obtain the defect status of equipment in real time or preventively by relying solely on manual labor. At the same time, it also requires huge labor costs and maintenance costs, and there is also a threat to the life safety of frontline workers. Therefore, in order to improve the safety of personnel and equipment in unattended or less-attended power stations, it is necessary to monitor the operating status and hidden faults of substation equipment in real time.

随着电力站智能化程度的提升以及智能运检技术的逐步普及，开始出现利用图像识别、文本挖掘等人工智能技术辅助开展设备健康状态判断。目前，电力站运维人员通常采用照相机、摄像机等手持终端设备对站内主要电气设备进行图像采集；除此之外，大量的巡检机器人、分布式摄像头已经布置于电力站，负责图像的采集和现场的监控。视觉图像能够反映设备存在的缺陷问题，通常包括漏油、锈蚀、断路器开合及零部件损坏等。然而，目前电力设备外观缺陷还只是停留在感知智能的阶段，即计算机具备类似于人类的视觉和听觉等方面的能力，比如，听到了什么，对应语音识别；看到了什么，对应图像的分类检测和语义分割。但还未上升到认知智能阶段，即强调知识、推理等技能，要求机器能理解、会思考。认知智能，涉及语义理解、知识表达、联想推理、智能问答、自主学习等方面。它与人的语言、知识、逻辑相关，是人工智能的最高阶段，而作为认知智能应用之一的视觉问答在电力设备外观缺陷的研究几乎仍是空白。With the improvement of the intelligence of power stations and the gradual popularization of intelligent inspection technology, the use of artificial intelligence technologies such as image recognition and text mining to assist in judging the health status of equipment has begun to appear. At present, power station operation and maintenance personnel usually use hand-held terminal equipment such as cameras and video cameras to collect images of the main electrical equipment in the station; in addition, a large number of inspection robots and distributed cameras have been arranged in the power station, responsible for image collection and On-site monitoring. Visual images can reflect equipment defects, often including oil leaks, rust, circuit breakers, and damaged components. However, at present, the appearance defects of electrical equipment are still in the stage of perceptual intelligence, that is, the computer has the ability of vision and hearing similar to human beings, for example, what is heard, corresponding to speech recognition; what is seen, corresponding to the classification and detection of images and semantic segmentation. However, it has not yet risen to the stage of cognitive intelligence, that is, emphasizing knowledge, reasoning and other skills, requiring machines to understand and think. Cognitive intelligence involves semantic understanding, knowledge expression, associative reasoning, intelligent question answering, autonomous learning, etc. It is related to human language, knowledge, and logic, and is the highest stage of artificial intelligence. As one of cognitive intelligence applications, visual question answering is still almost blank in the research on appearance defects of electrical equipment.

VQA(视觉问答，Visual Question Answering)指的是，给定一张图片和一个与该图片相关的自然语言问题，计算机能产生一个正确的回答。显然，这是一个典型的多模态问题，融合了CV(计算机视觉，Computer Vision)与NLP(自然语言处理，Natural LanguageProcessing)的技术，计算机需要同时学会理解图像和文字。正因如此，直到相关技术取得突破式发展的2015年，VQA的概念才被正式提出。虽然视觉问答在近两年中获得了极大的发展，但仍然存在有两大问题。首先是训练数据的不足：现有的智能视觉问答数据集的数据规模虽然达到了100万左右，但与传统的图像分类、目标检测等任务相比仍具有巨大的差距；此外，智能视觉问答的模型参数往往数以千万级，小量的训练数据往往未能充分发挥模型的性能。另一个问题是机器提供的问题答案不具备解释性：由于深度学习模型的黑盒设计，导致机器的做出回答的理由与原因往往难以给出；使之涉及视觉问答在工业届中的推广与使用的相关研究还非常少。VQA (Visual Question Answering) means that given a picture and a natural language question related to the picture, the computer can generate a correct answer. Obviously, this is a typical multimodal problem that combines CV (Computer Vision, Computer Vision) and NLP (Natural Language Processing, Natural Language Processing) technologies, and computers need to learn to understand images and text at the same time. Because of this, the concept of VQA was not formally proposed until 2015, when related technologies achieved breakthrough development. Although visual question answering has been greatly developed in the past two years, there are still two major problems. The first is the lack of training data: although the data scale of the existing intelligent visual question answering dataset has reached about 1 million, it still has a huge gap compared with traditional tasks such as image classification and target detection; Model parameters are often in the tens of millions, and a small amount of training data often fails to give full play to the performance of the model. Another problem is that the answers provided by the machine are not explanatory: due to the black-box design of the deep learning model, the reasons and reasons for the machine's answer are often difficult to give; it involves the promotion of visual question answering in the industry and the There are very few related studies used.

因此，为了解决现有技术中大量人工操作导致的高误检和漏检、及传统巡检设备无法提供输电线路、输变电设备隐患的评估与分析的问题，本发明设计了一种电力设备外观缺陷的视觉问答方法、系统、设备及其存储介质。Therefore, in order to solve the problems of high false detection and missed detection caused by a large number of manual operations in the prior art, and the traditional inspection equipment cannot provide the evaluation and analysis of hidden dangers of power transmission lines and power transmission and transformation equipment, the present invention designs a power equipment. Visual question answering method, system, device and storage medium for appearance defect.

发明内容SUMMARY OF THE INVENTION

鉴于以上所述现有技术的缺点，本发明的目的在于提供一种电力设备外观缺陷的视觉问答方法、系统、设备及其存储介质，用于解决现有技术中大量人工操作导致的高误检和漏检、及传统巡检设备无法提供输电线路、输变电设备隐患的评估与分析的问题。In view of the above-mentioned shortcomings of the prior art, the purpose of the present invention is to provide a visual question and answer method, system, device and storage medium for the appearance defects of power equipment, which are used to solve the high false detection caused by a large number of manual operations in the prior art. and missed inspections, and traditional inspection equipment cannot provide assessment and analysis of hidden dangers of transmission lines and power transmission and transformation equipment.

为实现上述目的及其他相关目的，本发明提供了一种电力设备外观缺陷的视觉问答方法，包括步骤：In order to achieve the above-mentioned purpose and other related purposes, the present invention provides a visual question and answer method for appearance defects of electrical equipment, comprising the steps of:

获取电力设备外观缺陷的图像，并对所述图像进行预处理；acquiring an image of the appearance defect of the power equipment, and preprocessing the image;

对处理后的所述图像进行特征提取，得到图像特征；Perform feature extraction on the processed image to obtain image features;

根据所述图像，获取与所述图像相关的问题信息，并对所述问题信息进行特征提取，得到文本特征；According to the image, obtain problem information related to the image, and perform feature extraction on the problem information to obtain text features;

基于双线性池化网络，将所述图像特征与所述文本特征进行融合，得到多模态特征；Based on a bilinear pooling network, the image features and the text features are fused to obtain multimodal features;

建立视觉问答模型并对其进行训练，得到训练后的所述视觉问答模型；establishing a visual question answering model and training it to obtain the trained visual question answering model;

将所述多模态特征输入训练后的所述视觉问答模型中，以得到视觉问答结果。The multimodal feature is input into the trained visual question answering model to obtain a visual question answering result.

于本发明的一实施例中，所述对所述图像进行预处理包括步骤：In an embodiment of the present invention, the preprocessing of the image includes the steps of:

对所述图像进行筛选，以去除不包含检测对象的图像；screening the images to remove images that do not contain detection objects;

对筛选后的所述图像进行暗通道去雾和自适应伽玛校正处理，以提升所述图像的清晰度和对比度；Performing dark channel dehazing and adaptive gamma correction processing on the screened image to improve the clarity and contrast of the image;

对处理后的所述图像进行位置变换及色彩变换，以增强所述图像的信息；performing position transformation and color transformation on the processed image to enhance the information of the image;

对变换后的所述图像进行裁剪，使得所述图像的规格尺寸满足对所述图像进行特征提取时的规格尺寸要求；Cropping the transformed image so that the size of the image satisfies the size requirement when extracting features from the image;

对裁剪后的所述图像进行标记，以得到标记图像；Marking the cropped image to obtain a marked image;

对所述标记图像进行人工检验，以去除所述标记图像中的异常数据。Manual inspection of the labeled image is performed to remove abnormal data in the labeled image.

于本发明的一实施例中，所述对处理后的所述图像进行特征提取，得到图像特征包括步骤：In an embodiment of the present invention, performing feature extraction on the processed image to obtain image features includes the following steps:

通过基于卷积神经网络的目标检测模型对处理后的所述图像进行检测，以识别所述图像中的物体；Detecting the processed image through a target detection model based on a convolutional neural network to identify objects in the image;

根据识别出的所述物体，对所述图像进行特征提取，以得到所述图像特征。According to the identified object, feature extraction is performed on the image to obtain the image feature.

于本发明的一实施例中，所述对所述问题信息进行特征提取，得到文本特征包括步骤：In an embodiment of the present invention, performing feature extraction on the problem information to obtain text features includes the steps of:

对语料进行向量化，得到词向量；Vectorize the corpus to get word vectors;

根据所述词向量，通过LSTM-Attention神经网络的文本向量层对所述问题信息进行向量化，得到句子序列；According to the word vector, the question information is vectorized through the text vector layer of the LSTM-Attention neural network to obtain a sentence sequence;

将所述句子序列输入LSTM网络中，通过所述LSTM网络对所述句子序列进行语义编码，得到隐层输出信息，表示为：The sentence sequence is input into the LSTM network, and the sentence sequence is semantically encoded by the LSTM network to obtain the output information of the hidden layer, which is expressed as:

h_it＝LSTM(x_it)，t∈[1，m]h _it = LSTM(x _it ), t∈[1, m]

其中，m为所述句子序列中词语的个数，x_it为t时刻输入所述LSTM网络的所述词语，h_it为t时刻所述LSTM网络输出的所述隐层输出信息；Wherein, m is the number of words in the sentence sequence, x _it is the word input into the LSTM network at time t, h _it is the output information of the hidden layer output by the LSTM network at time t;

通过所述LSTM-Attention神经网络的词语Attention层，对所述隐层输出信息进行非线性变换及归一化操作，得到词级别隐层输出的权重系数，并根据所述权重系数得到词语注意力机制矩阵，表示为：Through the word Attention layer of the LSTM-Attention neural network, the output information of the hidden layer is nonlinearly transformed and normalized to obtain the weight coefficient of the word-level hidden layer output, and the word attention is obtained according to the weight coefficient. Mechanism matrix, expressed as:

其中，α_it为所述词级别隐层输出的权重系数，s_i为所述词语注意力机制矩阵的句子向量；Wherein, α _it is the weight coefficient output by the word-level hidden layer, and s _i is the sentence vector of the word attention mechanism matrix;

将所述句子向量输入所述LSTM网络中，所述LSTM网络根据上一时刻的所述隐层输出信息对所述句子向量进行更新，得到更新后的所述句子向量；Inputting the sentence vector into the LSTM network, and the LSTM network updates the sentence vector according to the output information of the hidden layer at the previous moment to obtain the updated sentence vector;

通过所述LSTM-Attention神经网络的句子Attention层，对更新后的所述句子向量进行非线性变换及归一化操作，得到句子级别隐层输出的权重系数，并根据所述权重系数得到句子注意力机制矩阵，表示为：Through the sentence Attention layer of the LSTM-Attention neural network, nonlinear transformation and normalization are performed on the updated sentence vector to obtain the weight coefficient of the sentence-level hidden layer output, and the sentence attention is obtained according to the weight coefficient. The force mechanism matrix, expressed as:

其中，α_i为所述句子级别隐层输出的权重系数，h_i为更新后的所述句子向量；Wherein, α _i is the weight coefficient output by the sentence-level hidden layer, and _hi is the updated sentence vector;

根据所述词语注意力机制矩阵和所述句子注意力机制矩阵，对所述问题信息进行特征提取，得到所述文本特征。According to the word attention mechanism matrix and the sentence attention mechanism matrix, feature extraction is performed on the question information to obtain the text feature.

于本发明的一实施例中，所述将所述句子序列输入LSTM网络中，通过所述LSTM网络对所述句子序列进行语义编码，得到隐层输出信息还包括步骤：In an embodiment of the present invention, the step of inputting the sentence sequence into an LSTM network, and performing semantic encoding on the sentence sequence through the LSTM network to obtain the output information of the hidden layer further includes the steps:

将所述句子序列作为所述LSTM网络的输入节点数据；Using the sentence sequence as the input node data of the LSTM network;

通过所述LSTM网络对所述句子序列的前后信息进行学习，以得到所述句子序列之间的前后信息；Learning the before and after information of the sentence sequence through the LSTM network to obtain the before and after information between the sentence sequences;

根据所述前后信息，对所述句子序列进行语义编码。Semantic encoding is performed on the sentence sequence according to the context information.

于本发明的一实施例中，所述基于双线性池化网络，将所述图像特征与所述文本特征进行融合，得到多模态特征包括步骤：In an embodiment of the present invention, the method of fusing the image feature and the text feature based on a bilinear pooling network to obtain a multimodal feature includes the following steps:

将所述图像特征和所述文本特征映射到相同的特征空间，使所述图像特征的维度与所述文本特征的维度相匹配；mapping the image feature and the text feature to the same feature space, so that the dimension of the image feature matches the dimension of the text feature;

将维度相匹配的所述图像特征和所述文本特征输入所述双线性池化网络，以输出融合后的所述多模态特征，表示为：The image features and the text features whose dimensions are matched are input into the bilinear pooling network to output the fused multimodal features, which are expressed as:

q＝Fusion(m，n)q=Fusion(m,n)

其中，m为所述图像特征，n为所述文本特征，Fusion表示特征融合，q为所述多模态特征。Among them, m is the image feature, n is the text feature, Fusion represents feature fusion, and q is the multimodal feature.

于本发明的一实施例中，所述将所述多模态特征输入训练后的所述视觉问答模型中，以得到视觉问答结果包括步骤：In an embodiment of the present invention, inputting the multimodal features into the trained visual question answering model to obtain a visual question answering result includes the steps:

将所述多模态特征输入训练后的所述视觉问答模型中；Inputting the multimodal feature into the trained visual question answering model;

根据所述多模态特征，确定待选答案，且所述待选答案组成待选域；According to the multi-modal feature, the answer to be selected is determined, and the answer to be selected constitutes a domain to be selected;

预测正确答案在所述待选域上的概率分布；predict the probability distribution of the correct answer on the candidate field;

根据所述概率分布，将最大概率所对应的所述待选答案作为所述视觉问答的结果。According to the probability distribution, the candidate answer corresponding to the maximum probability is used as the result of the visual question answering.

本发明还提供了一种电力设备外观缺陷的视觉问答系统，包括：The present invention also provides a visual question answering system for appearance defects of electrical equipment, including:

图像获取模块，用于获取电力设备外观缺陷的图像，并对所述图像进行预处理；an image acquisition module, used for acquiring an image of the appearance defect of the power equipment, and preprocessing the image;

图像特征提取模块，用于对处理后的所述图像进行特征提取，得到图像特征；an image feature extraction module, configured to perform feature extraction on the processed image to obtain image features;

文本特征提取模块，用于根据所述图像，获取与所述图像相关的问题信息，并对所述问题信息进行特征提取，得到文本特征；a text feature extraction module, configured to obtain problem information related to the image according to the image, and perform feature extraction on the problem information to obtain text features;

特征融合模块，用于基于双线性池化网络，将所述图像特征与所述文本特征进行融合，得到多模态特征；a feature fusion module, configured to fuse the image features and the text features based on a bilinear pooling network to obtain multimodal features;

模型建立模块，用于建立视觉问答模型并对其进行训练，得到训练后的所述视觉问答模型；a model building module, used to build a visual question answering model and train it to obtain the trained visual question answering model;

视觉问答模块，用于将所述多模态特征输入训练后的所述视觉问答模型中，以得到视觉问答结果。The visual question answering module is used for inputting the multimodal feature into the trained visual question answering model to obtain a visual question answering result.

于本发明的一实施例中，一种电力设备外观缺陷的视觉问答设备包括：处理器，所述处理器与存储器耦合，所述存储器存储有程序指令，当所述存储器存储的程序指令被所述处理器执行时实现如上所述的电力设备外观缺陷的视觉问答方法。In an embodiment of the present invention, a visual question answering device for appearance defects of electrical equipment includes: a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are When the processor is executed, the visual question answering method for the appearance defect of the electrical equipment as described above is realized.

于本发明的一实施例中，一种计算机可读存储介质，其特征在于：包括程序，当所述程序在计算机上运行时，使得计算机执行如上所述的电力设备外观缺陷的视觉问答方法。In one embodiment of the present invention, a computer-readable storage medium is characterized by comprising a program, which when the program runs on a computer, enables the computer to execute the above-mentioned visual question and answer method for appearance defects of electrical equipment.

如上所述，本发明提供的一种电力设备外观缺陷的视觉问答方法、系统、设备及其存储介质，通过标记的电力设备场景图像，提取图像特征，并与问题信息的文本特征进行融合，以得到相应的答案预测，问答效果好；其次，通过所述方法也能对变电设备的运行状态进行实时监控，防止出现误检、漏检的现象，提高了工作效率。As described above, the present invention provides a visual question answering method, system, device, and storage medium for electrical equipment appearance defects. The image features are extracted from the marked power equipment scene images, and fused with the text features of the problem information to obtain The corresponding answer prediction is obtained, and the question and answer effect is good; secondly, the method can also monitor the running state of the substation equipment in real time, prevent the phenomenon of false detection and missed detection, and improve the work efficiency.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对实施例描述所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

图1为本发明提供的一种电力设备外观缺陷的视觉问答方法的流程示意图。FIG. 1 is a schematic flowchart of a visual question and answer method for appearance defects of electrical equipment provided by the present invention.

图2为本发明实施例中步骤S1的流程示意图。FIG. 2 is a schematic flowchart of step S1 in an embodiment of the present invention.

图3为本发明实施例中步骤S3的流程示意图。FIG. 3 is a schematic flowchart of step S3 in an embodiment of the present invention.

图4为本发明实施例中步骤S33的流程示意图。FIG. 4 is a schematic flowchart of step S33 in the embodiment of the present invention.

图5为本发明实施例中步骤S6的流程示意图。FIG. 5 is a schematic flowchart of step S6 in an embodiment of the present invention.

图6为本发明提供的一种电力设备外观缺陷的视觉问答系统的原理结构示意图。FIG. 6 is a schematic structural diagram of the principle and structure of a visual question answering system for appearance defects of electrical equipment provided by the present invention.

元件标号说明Component label description

11 图像获取模块 14 特征融合模块11 Image acquisition module 14 Feature fusion module

12 图像特征提取模块 15 模型建立模块12 Image Feature Extraction Module 15 Model Building Module

13 文本特征提取模块 16 视觉问答模块13 Text Feature Extraction Module 16 Visual Question Answering Module

具体实施方式Detailed ways

以下通过特定的具体实例说明本发明的实施方式，本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。需说明的是，在不冲突的情况下，以下实施例及实施例中的特征可以相互组合。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other under the condition of no conflict.

需要说明的是，以下实施例中所提供的图示仅以示意方式说明本发明的基本构想，遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制，其实际实施时各组件的型态、数量及比例可为一种随意的改变，且其组件布局型态也可能更为复杂。It should be noted that the drawings provided in the following embodiments are only used to illustrate the basic concept of the present invention in a schematic way, so the drawings only show the components related to the present invention rather than the number, shape and number of components in actual implementation. For dimension drawing, the type, quantity and proportion of each component can be changed at will in actual implementation, and the component layout may also be more complicated.

如图1所示，本发明提供了一种电力设备外观缺陷的视觉问答方法，包括步骤：As shown in Figure 1, the present invention provides a visual question and answer method for appearance defects of electrical equipment, comprising the steps of:

S1、获取电力设备外观缺陷的图像，并对图像进行预处理；S1. Obtain an image of the appearance defect of the power equipment, and preprocess the image;

S2、对处理后的图像进行特征提取，得到图像特征；S2. Perform feature extraction on the processed image to obtain image features;

S3、根据图像，获取与图像相关的问题信息，并对问题信息进行特征提取，得到文本特征；S3. Obtain problem information related to the image according to the image, and perform feature extraction on the problem information to obtain text features;

S4、基于双线性池化网络，将图像特征与文本特征进行融合，得到多模态特征；S4. Based on bilinear pooling network, image features and text features are fused to obtain multimodal features;

S5、建立视觉问答模型并对其进行训练，得到训练后的视觉问答模型；S5. Establish a visual question answering model and train it to obtain a trained visual question answering model;

S6、将多模态特征输入训练后的视觉问答模型中，以得到视觉问答结果。S6. Input the multimodal feature into the trained visual question answering model to obtain a visual question answering result.

如图2所示，进一步地，步骤S1还包括：As shown in Figure 2, further, step S1 also includes:

S11、对图像进行筛选，以去除不包含检测对象的图像；S11. Screen the images to remove images that do not contain the detection object;

S12、对筛选后的图像进行暗通道去雾和自适应伽玛校正处理，以提高图像的清晰度和对比度；S12. Perform dark channel dehazing and adaptive gamma correction processing on the filtered image to improve the clarity and contrast of the image;

S13、对处理后的图像进行位置变换及色彩变换，以增强图像的信息；S13, performing position transformation and color transformation on the processed image to enhance the information of the image;

S14、对变换后的图像进行裁剪，使得图像的规格尺寸满足对图像进行特征提取时的规格尺寸要求；S14. Crop the transformed image so that the size of the image meets the size requirements for the feature extraction of the image;

S15、对裁剪后的图像进行标记，以得到标记图像；S15, marking the cropped image to obtain a marked image;

S16、对标记图像进行人工检验，以去除标记图像中的异常数据。S16. Perform manual inspection on the marked image to remove abnormal data in the marked image.

在本发明的一实施例中，对于步骤S11，对图像进行筛选，以去除不包含检测对象的图像，其中，检测对象包括玻璃绝缘子、复合绝缘子、导线、悬垂线夹、防震锤、杆塔、螺栓、线路走廊及其他相关异物和缺陷。In an embodiment of the present invention, for step S11, the images are screened to remove images that do not contain detection objects, wherein the detection objects include glass insulators, composite insulators, wires, suspension clamps, anti-vibration hammers, towers, bolts , line corridors and other related foreign objects and defects.

在本发明的一实施例中，对于步骤S12，使用去雾算法对存在雾气的图像进行暗通道去雾，以提高图像的清晰度，同时对成像较差的图像进行自适应伽玛校正处理，以提高图像的对比度，从而提高图像后续进行分割和信息量化的准确率。In an embodiment of the present invention, for step S12, a dehazing algorithm is used to perform dark channel dehazing on the image with fog, so as to improve the clarity of the image, and at the same time, adaptive gamma correction processing is performed on the image with poor imaging, In order to improve the contrast of the image, the accuracy of subsequent image segmentation and information quantification can be improved.

在本发明的一实施例中，对于步骤S13，在保证图像的基本特征不变时，对图像进行位置变换及色彩变换，以增强图像的信息。其中，位置变换包括旋转、平移、水平镜像；旋转指以图像中的任意一点为旋转中心，对图像的所有像素进行相同角度的旋转，无像素点位置进行零值填充，由于图像中包含导线、绝缘子这样的连续目标，对图像进行旋转能够在不改变目标特征的情况下，扩充目标的角度多样性；平移指将所有像素点在竖直和水平方向进行随机步进的移动，以扩充目标空间位置的多样性；水平镜像指以图像的中心轴为对称，对两侧像素点进行位置互换，这样能够扩充图像拍摄方向的多样性。在本实施例中，色彩变换包括色彩抖动、对比度调整、亮度调整、饱和度调整；色彩抖动指在图像的RGB三通道中，随机选择一个通道的值进行上下浮动变换，在本实施例中，浮动变换的范围不超过20，例如16；而对比度、亮度及饱和度的调整，主要是通过模拟现实场景中日照光强度的变化实现的。In an embodiment of the present invention, for step S13, when the basic features of the image are guaranteed to remain unchanged, position transformation and color transformation are performed on the image to enhance the information of the image. Among them, position transformation includes rotation, translation, and horizontal mirroring; rotation refers to taking any point in the image as the rotation center, and rotating all pixels of the image by the same angle, and filling zero-value positions without pixel points. For continuous targets such as insulators, rotating the image can expand the angular diversity of the target without changing the characteristics of the target; translation refers to moving all pixels in random steps in the vertical and horizontal directions to expand the target space The diversity of positions; horizontal mirroring refers to the symmetry of the central axis of the image, and the positions of the pixels on both sides are exchanged, which can expand the diversity of image shooting directions. In this embodiment, the color transformation includes color dithering, contrast adjustment, brightness adjustment, and saturation adjustment; color dithering refers to randomly selecting the value of one channel in the three RGB channels of the image to perform up-down floating transformation. In this embodiment, The range of floating transformation does not exceed 20, such as 16; and the adjustment of contrast, brightness and saturation is mainly achieved by simulating the changes of sunlight intensity in real scenes.

在本发明的一实施例中，进一步地，步骤S2还包括：In an embodiment of the present invention, further, step S2 further includes:

S21、通过基于卷积神经网络的目标检测模型对处理后的图像进行检测，以识别图像中的物体；S21. Detect the processed image through a target detection model based on a convolutional neural network to identify objects in the image;

S22、根据识别出的物体，对图像进行特征提取，以得到图像特征。S22. Perform feature extraction on the image according to the recognized object to obtain image features.

在本发明的一实施例中，使用基于卷积神经网络的目标检测模型Faster-RCNN识别图像中的物体，并根据识别出的物体，对图像进行特征提取，得到图像特征。表示为；In an embodiment of the present invention, an object detection model based on a convolutional neural network, Faster-RCNN, is used to identify objects in an image, and according to the identified objects, feature extraction is performed on the image to obtain image features. Expressed as;

V^h＝RCNN(I) (1)V ^h =RCNN(I) (1)

其中，I为图像，RCNN表示对图像进行检测识别，V^h为提取的图像特征。Among them, I is the image, RCNN represents the detection and recognition of the image, and V ^h is the extracted image feature.

在本实施例中，提取到的图像特征表示为一个K×2048的矩阵，其表示每个图像包括K个向量，每个向量的维度是2048维。In this embodiment, the extracted image features are represented as a K×2048 matrix, which indicates that each image includes K vectors, and the dimension of each vector is 2048 dimensions.

如图3所示，进一步地，步骤S3还包括：As shown in Figure 3, further, step S3 also includes:

S31、对语料进行向量化，得到词向量；S31, vectorize the corpus to obtain word vectors;

S32、根据词向量，通过LSTM-Attention神经网络的文本向量层对问题信息进行向量化，得到句子序列；S32. According to the word vector, vectorize the question information through the text vector layer of the LSTM-Attention neural network to obtain a sentence sequence;

S33、将句子序列输入LSTM网络中，通过LSTM网络对句子序列进行语义编码，得到隐层输出信息；S33. Input the sentence sequence into the LSTM network, and perform semantic encoding on the sentence sequence through the LSTM network to obtain the output information of the hidden layer;

S34、通过LSTM-Attention神经网络的词语Attention层，对隐层输出信息进行非线性变换及归一化操作，得到词级别隐层输出的权重系数，并根据所述权重系数得到词语注意力机制矩阵；S34. Through the word Attention layer of the LSTM-Attention neural network, the output information of the hidden layer is nonlinearly transformed and normalized to obtain the weight coefficient of the word-level hidden layer output, and the word attention mechanism matrix is obtained according to the weight coefficient. ;

S35、将词语注意力机制矩阵中的句子向量输入LSTM网络中，LSTM网络根据上一时刻的隐层输出信息对句子向量进行更新，得到更新后的句子向量；S35. Input the sentence vector in the word attention mechanism matrix into the LSTM network, and the LSTM network updates the sentence vector according to the output information of the hidden layer at the previous moment to obtain the updated sentence vector;

S36、通过LSTM-Attention神经网络的句子Attention层，对更新后的句子向量进行非线性变换及归一化操作，得到句子级别隐层输出的权重系数，并根据所述权重系数得到句子注意力机制矩阵；S36. Perform nonlinear transformation and normalization operations on the updated sentence vector through the sentence Attention layer of the LSTM-Attention neural network to obtain the weight coefficient of the sentence-level hidden layer output, and obtain the sentence attention mechanism according to the weight coefficient. matrix;

S37、根据词语注意力机制矩阵和句子注意力机制矩阵，对问题信息进行特征提取，得到文本特征。S37. According to the word attention mechanism matrix and the sentence attention mechanism matrix, feature extraction is performed on the question information to obtain text features.

在本发明的一实施例中，对于步骤S31，使用word2vec中的Skip-Gram无监督模型对语料进行向量化，得到词向量。之后再根据所述词向量，通过LSTM-Attention神经网络的文本向量化层实现问题信息的向量化。In an embodiment of the present invention, for step S31, the Skip-Gram unsupervised model in word2vec is used to vectorize the corpus to obtain word vectors. Then, according to the word vector, the problem information is vectorized through the text vectorization layer of the LSTM-Attention neural network.

如图4所示，进一步地，步骤S33还包括：As shown in Figure 4, further, step S33 also includes:

S331、将句子序列作为LSTM网络的输入节点数据；S331. Use the sentence sequence as the input node data of the LSTM network;

S332、通过LSTM网络对句子序列的前后信息进行学习，以得到句子序列之间的前后信息；S332, learning the before and after information of the sentence sequence through the LSTM network, so as to obtain the before and after information between the sentence sequences;

S333、根据所述前后信息，对句子序列进行语义编码。S333. Perform semantic coding on the sentence sequence according to the before and after information.

在本发明的一实施例中，首先将句子序列(x_i1,x_i2,…,x_it)作为LSTM网络的输入节点数据，其中，x_i1,x_i2,…,x_it为句子序列中的每个词语，之后LSTM网络便会根据输入的数据对句子序列的前后信息进行学习，得到句子序列之间的前后信息，根据得到的前后信息，对句子序列进行语义编码，在不同时刻LSTM网络会输出对应节点的隐层输出信息，表示为：In an embodiment of the present invention, the sentence sequence (x _i1 , x _i2 ,..., x _it ) is first used as the input node data of the LSTM network, wherein x _i1 , x _i2 ,..., x _it are the sentences in the sentence sequence For each word, the LSTM network will then learn the before and after information of the sentence sequence according to the input data, obtain the before and after information between the sentence sequences, and perform semantic coding on the sentence sequence according to the obtained before and after information. Output the hidden layer output information of the corresponding node, which is expressed as:

h_it＝LSTM(x_it)，t∈[1，m] (2)h _it = LSTM(x _it ), t∈[1, m] (2)

其中，m为句子序列中词语的个数，x_it为t时刻输入LSTM网络的词语，h_it为t时刻LSTM网络输出的隐层输出信息。Among them, m is the number of words in the sentence sequence, x _it is the word input to the LSTM network at time t, and h _it is the output information of the hidden layer output by the LSTM network at time t.

在本发明的一实施例中，对于步骤S34，由于每一个词语对于问题信息的贡献度是不同的，为了实现对重要词语的特征进行提取，通过LSTM-Attention神经网络的词语Attention层，对隐层输出信息进行非线性变换及归一化操作，以得到词级别隐层输出的权重系数，表示为：In an embodiment of the present invention, for step S34, since the contribution of each word to the problem information is different, in order to extract the features of important words, the word Attention layer of the LSTM-Attention neural network is used to extract hidden words. The layer output information is nonlinearly transformed and normalized to obtain the weight coefficient of the word-level hidden layer output, which is expressed as:

u_it＝tanh(W_wh_it+b_w) (3)u _it =tanh(W _w h _it +b _w ) (3)

其中，tanh为非线性函数，W_w为LSTM网络隐层的权重，b_w为LSTM网络隐层的偏置，u_it为隐层输出信息h_it经非线性变换后得到的隐含表示，u_w为随机初始化的注意力机制矩阵，α_it为词级别隐层输出的权重系数。Among them, tanh is a nonlinear function, W _w is the weight of the hidden layer of the LSTM network, b _w is the bias of the hidden layer of the LSTM network, u _it is the implicit representation obtained by nonlinear transformation of the hidden layer output information h _it , u _w is the randomly initialized attention mechanism matrix, and α _it is the weight coefficient output by the word-level hidden layer.

之后根据所述词级别隐层输出的权重系数和隐层输出信息，得到词语注意力机制矩阵，表示为：Then, according to the weight coefficient output of the word-level hidden layer and the output information of the hidden layer, the word attention mechanism matrix is obtained, which is expressed as:

s_i＝∑_tα_ith_it (5)s _i =∑ _t α _it h _it (5)

其中，s_i为词语注意力机制矩阵中的句子向量。where _si is the sentence vector in the word attention mechanism matrix.

在本发明的一实施例中，对于步骤S36，由于不同的句子对问题信息重要程度的贡献度也是不同的，因此通过LSTM-Attention神经网络的句子Attention层，对更新后的句子向量进行非线性变换及归一化操作，以得到句子级别隐层输出的权重系数，表示为：In an embodiment of the present invention, for step S36, since the contribution of different sentences to the importance of the question information is also different, the sentence Attention layer of the LSTM-Attention neural network is used to nonlinearly perform the updated sentence vector. Transform and normalize operations to obtain the weight coefficient of the sentence-level hidden layer output, which is expressed as:

u_i＝tanh(W_wh_i+b_w) (6)u _i =tanh(W _w h _i +b _w ) (6)

其中，h_i为经LSTM网络更新后的句子向量，u_i为更新后的句子向量h_i经非线性变换后得到的隐含表示，u_s为随机初始化的句子注意力机制矩阵，α_i为句子级别隐层输出的权重系数。Among them, hi is the sentence vector updated by the LSTM network, _ui is the implicit representation obtained by nonlinear transformation of the updated sentence vector _hi , _u _s is the randomly initialized sentence attention mechanism matrix, α _i is The weight coefficients for the output of the sentence-level hidden layer.

之后根据所述句子级别隐层输出的权重系数和更新后的句子向量，得到句子注意力机制矩阵，表示为：Then, according to the weight coefficient output by the sentence-level hidden layer and the updated sentence vector, the sentence attention mechanism matrix is obtained, which is expressed as:

v＝∑_iα_ih_i (8)v=∑ _i α _i h _i (8)

在本发明的一实施例中，根据词语注意力机制矩阵和句子注意力机制矩阵，对问题信息进行特征提取，得到文本特征后，为了将高维度的文本特征压缩到低维度，增强数据的非线性能力，在Softmax分类层之前添加一个非线性层，将该特征向量映射为长度为c的向量，最后再经Softmax分类层计算各对应类别的分布概率，表示为：In an embodiment of the present invention, according to the word attention mechanism matrix and the sentence attention mechanism matrix, feature extraction is performed on the question information, and after text features are obtained, in order to compress high-dimensional text features into low-dimensional, the non-dimensionality of the data is enhanced. Linear ability, add a nonlinear layer before the Softmax classification layer, map the feature vector to a vector of length c, and finally calculate the distribution probability of each corresponding category through the Softmax classification layer, which is expressed as:

其中，

为提取出的文本特征向量，w_c、b_c分别为非线性层的权重和偏置，y_i为输出的概率分布。in,

is the extracted text feature vector, w _c , b _c are the weights and biases of the nonlinear layer, respectively, and y _i is the output probability distribution.

在本发明的一实施例中，进一步地，步骤S4还包括：In an embodiment of the present invention, further, step S4 further includes:

S41、将图像特征和文本特征映射到相同的特征空间，使图像特征的维度与文本特征的维度相匹配；S41, map the image feature and the text feature to the same feature space, so that the dimension of the image feature matches the dimension of the text feature;

S42、将维度相匹配的图像特征和文本特征输入双线性池化网络中，以得到融合后的多模态特征。S42 , inputting the image features and text features with matching dimensions into a bilinear pooling network to obtain a fused multimodal feature.

使用双线性池化网络进行特征融合时，需要不同特征的维度相匹配，故先将图像特征和文本特征映射到相同的特征空间，使它们具有相同的维度，之后再将两个特征输入双线性池化网络中，以得到融合后的多模态特征，表示为：When using the bilinear pooling network for feature fusion, the dimensions of different features need to be matched, so the image features and text features are first mapped to the same feature space so that they have the same dimension, and then the two features are input into the bilinear feature space. In the linear pooling network, the fused multimodal features are obtained, which are expressed as:

q＝Fusion(m，n) (10)q=Fusion(m,n) (10)

其中，m为图像特征，n为文本特征，Fusion表示特征融合，q为融合后的多模态特征。Among them, m is the image feature, n is the text feature, Fusion represents feature fusion, and q is the fused multimodal feature.

如图5所示，进一步地，步骤S6还包括：As shown in Figure 5, further, step S6 also includes:

S61、将融合后的多模态特征输入训练后的视觉问答模型中；S61. Input the fused multimodal features into the trained visual question answering model;

S62、根据多模态特征，确定待选答案，且所述待选答案组成待选域；S62, according to the multimodal feature, determine the answer to be selected, and the answer to be selected constitutes the domain to be selected;

S63、预测正确答案在待选域上的概率分布；S63, predict the probability distribution of the correct answer on the field to be selected;

S64、根据概率分布，将最大概率所对应的待选答案作为视觉问答的结果。S64. According to the probability distribution, the candidate answer corresponding to the maximum probability is used as the result of the visual question answering.

在本发明的一实施例中，多模态特征输入训练后的视觉问答模型后，会确定待选答案，而所述待选答案会组成待选域，先预测正确答案落在这些待选域上的概率分布，之后便可将最大概率对应的待选答案作为最终视觉问答的结果，表示为：In an embodiment of the present invention, after the multimodal features are input into the trained visual question answering model, the answers to be selected will be determined, and the answers to be selected will form the fields to be selected, and the correct answers will be predicted to fall in these fields to be selected first Then the candidate answer corresponding to the maximum probability can be used as the result of the final visual question and answer, which is expressed as:

其中，I为获取的图像，Q为与图像相关的问题信息，

为正确答案，M为待选域，P为概率分布。Among them, I is the acquired image, Q is the problem information related to the image,

is the correct answer, M is the field to be selected, and P is the probability distribution.

上面方法的步骤划分，只是为了描述清楚，实现时可以合并为一个步骤或者对某些步骤进行拆分，分解为多个步骤，只要包含相同的逻辑关系，都在本发明的保护范围内；对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计，但不改变其算法和流程的核心设计都在该发明的保护范围内。The step division of the above method is only for the purpose of describing clearly, and can be combined into one step or split some steps during implementation, and decomposed into multiple steps, as long as they contain the same logical relationship, they are all within the protection scope of the present invention; Adding insignificant modifications to the algorithm or process or introducing insignificant designs without changing the core design of the algorithm and process are within the protection scope of this invention.

如图6所示，本发明还提供了一种电力设备外观缺陷的视觉问答系统，包括：图像获取模块11、图像特征提取模块12、文本特征提取模块13、特征融合模块14、模型建立模块15及视觉问答模块16。图像获取模块11用于获取电力设备外观缺陷的图像，并对所述图像进行预处理；图像特征提取模块12用于对处理后的所述图像进行特征提取，得到图像特征；文本特征提取模块13用于根据所述图像，获取与所述图像相关的问题信息，并对所述问题信息进行特征提取，得到文本特征；特征融合模块14用于基于双线性池化网络，将所述图像特征与所述文本特征进行融合，得到多模态特征；模型建立模块15用于建立视觉问答模型并对其进行训练，得到训练后的所述视觉问答模型；视觉问答模块16用于将所述多模态特征输入训练后的所述视觉问答模型中，以得到视觉问答结果。As shown in FIG. 6 , the present invention also provides a visual question answering system for appearance defects of electrical equipment, including: an image acquisition module 11 , an image feature extraction module 12 , a text feature extraction module 13 , a feature fusion module 14 , and a model building module 15 and visual question answering module 16. The image acquisition module 11 is used to acquire the image of the appearance defect of the power equipment, and preprocess the image; the image feature extraction module 12 is used to perform feature extraction on the processed image to obtain image features; the text feature extraction module 13 It is used to obtain the problem information related to the image according to the image, and perform feature extraction on the problem information to obtain text features; the feature fusion module 14 is used to combine the image features based on the bilinear pooling network. Fusion with the text features to obtain multimodal features; the model building module 15 is used to build a visual question answering model and train it to obtain the trained visual question answering model; the visual question answering module 16 is used to The modal features are input into the trained visual question answering model to obtain a visual question answering result.

进一步地，文本特征提取模块13还包括：向量化单元、文本向量单元、语义编码单元、词语Attention单元、更新单元、句子Attention单元以及特征提取单元。向量化单元用于对语料进行向量化，得到词向量；文本向量单元用于根据词向量，通过LSTM-Attention神经网络的文本向量层对问题信息进行向量化，得到句子序列；语义编码单元用于将句子序列输入LSTM网络中，通过LSTM网络对句子序列进行语义编码，得到隐层输出信息；词语Attention单元用于通过LSTM-Attention神经网络的词语Attention层，对隐层输出信息进行非线性变换及归一化操作，得到词级别隐层输出的权重系数，并根据权重系数得到词语注意力机制矩阵；更新单元用于将句子向量输入LSTM网络中，LSTM网络根据上一时刻的所述隐层输出信息对句子向量进行更新，得到更新后的所述句子向量；句子Attention单元用于通过LSTM-Attention神经网络的句子Attention层，对更新后的句子向量进行非线性变换及归一化操作，得到句子级别隐层输出的权重系数，并根据权重系数得到句子注意力机制矩阵；特征提取单元用于根据词语注意力机制矩阵和句子注意力机制矩阵，对问题信息进行特征提取，得到文本特征。Further, the text feature extraction module 13 further includes: a vectorization unit, a text vector unit, a semantic encoding unit, a word Attention unit, an update unit, a sentence Attention unit, and a feature extraction unit. The vectorization unit is used to vectorize the corpus to obtain the word vector; the text vector unit is used to vectorize the question information through the text vector layer of the LSTM-Attention neural network according to the word vector to obtain the sentence sequence; the semantic coding unit is used to The sentence sequence is input into the LSTM network, and the sentence sequence is semantically encoded by the LSTM network to obtain the output information of the hidden layer; the word Attention unit is used to pass the word Attention layer of the LSTM-Attention neural network. Non-linear transformation of the output information of the hidden layer and The normalization operation is used to obtain the weight coefficient of the word-level hidden layer output, and the word attention mechanism matrix is obtained according to the weight coefficient; the update unit is used to input the sentence vector into the LSTM network, and the LSTM network outputs the hidden layer according to the previous moment. The information updates the sentence vector to obtain the updated sentence vector; the sentence Attention unit is used to perform nonlinear transformation and normalization on the updated sentence vector through the sentence Attention layer of the LSTM-Attention neural network to obtain the sentence. The weight coefficient of the output of the hidden layer of the level, and the sentence attention mechanism matrix is obtained according to the weight coefficient; the feature extraction unit is used to perform feature extraction on the question information according to the word attention mechanism matrix and the sentence attention mechanism matrix to obtain text features.

需要说明的是，为了突出本发明的创新部分，本实施例中并没有将与解决本发明所提出的技术问题关系不太密切的模块引入，但这并不表明本实施例中不存在其它的模块。It should be noted that, in order to highlight the innovative part of the present invention, this embodiment does not introduce modules that are not closely related to solving the technical problem proposed by the present invention, but this does not mean that there are no other modules in this embodiment. module.

此外，所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。在本发明所提供的实施例中，应该理解到，所揭露的系统，装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，所述模块的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个模块或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In addition, those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the system described above, reference may be made to the corresponding process in the foregoing method embodiments, which will not be repeated here. In the embodiments provided by the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

作为分离部件说明的模块可以是或者也可以不是物理上分开的，作为模块显示的部件可以是或者也可以不是物理模块，即可以位于一个地方，或者也可以分布到多个网络模块上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。Modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical modules, that is, they may be located in one place, or may be distributed to multiple network modules. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能模块可以集成在一个处理模块中，也可以是各个模块单独物理存在，也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist physically alone, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware, or can be implemented in the form of software functional units.

所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(read-only memory，ROM)、随机存取存储器(random access memory，RAM)、磁盘或者光盘等各种可以存储程序代码的介质。If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes.

在本发明的一实施例中，一种农机循迹控制设备包括：处理器，所述处理器与存储器耦合，所述存储器存储有程序指令，当所述存储器存储的程序指令被所述处理器执行时，可实现所述电力设备外观缺陷的视觉问答方法。In an embodiment of the present invention, an agricultural machinery tracking control device includes: a processor, the processor is coupled to a memory, the memory stores program instructions, and when the program instructions stored in the memory are processed by the processor When executed, the visual question answering method for the appearance defect of the electrical equipment can be realized.

如上所述，本发明提供的一种电力设备外观缺陷的视觉问答方法、系统、设备及其存储介质，通过标记的电力设备场景图像，提取图像特征，并与问题信息的文本特征进行融合，以得到相应的答案预测，问答效果好；其次，通过所述方法也能对变电设备的运行状态进行实时监控，防止出现误检、漏检的现象，提高了工作效率。所以，本发明有效克服了现有技术中的种种缺点而具高度产业利用价值。As described above, the present invention provides a visual question answering method, system, device, and storage medium for electrical equipment appearance defects. The image features are extracted from the marked power equipment scene images, and fused with the text features of the problem information to obtain The corresponding answer prediction is obtained, and the question and answer effect is good; secondly, the method can also monitor the running state of the substation equipment in real time, prevent the phenomenon of false detection and missed detection, and improve the work efficiency. Therefore, the present invention effectively overcomes various shortcomings in the prior art and has high industrial utilization value.

上述实施例仅例示性说明本发明的原理及其功效，而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下，对上述实施例进行修饰或改变。因此，举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变，仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical idea disclosed in the present invention should still be covered by the claims of the present invention.

Claims

1. A visual question answering method for appearance defects of electric equipment is characterized by comprising the following steps:

acquiring an image of the appearance defect of the power equipment, and preprocessing the image;

performing feature extraction on the processed image to obtain image features;

according to the image, problem information related to the image is obtained, and feature extraction is carried out on the problem information to obtain text features;

fusing the image features and the text features based on a bilinear pooling network to obtain multi-modal features;

establishing a visual question-answer model and training the visual question-answer model to obtain the trained visual question-answer model;

and inputting the multi-modal characteristics into the trained visual question-answering model to obtain a visual question-answering result.

2. The visual question answering method for the appearance defects of the electric power equipment as claimed in claim 1, wherein the preprocessing of the image comprises the steps of:

screening the images to remove images not containing the detection object;

carrying out dark channel defogging and self-adaptive gamma correction processing on the screened image so as to improve the definition and contrast of the image;

performing position transformation and color transformation on the processed image to enhance the information of the image;

cutting the transformed image to enable the specification size of the image to meet the specification size requirement when the image is subjected to feature extraction;

marking the cut image to obtain a marked image;

and carrying out manual inspection on the marked image to remove abnormal data in the marked image.

3. The visual question answering method for the appearance defects of the electric power equipment according to claim 1, wherein the step of performing feature extraction on the processed image to obtain image features comprises the following steps:

detecting the processed image through a target detection model based on a convolutional neural network to identify an object in the image;

and according to the identified object, performing feature extraction on the image to obtain the image feature.

4. The visual question answering method for the appearance defects of the electric power equipment according to claim 1, wherein the step of performing feature extraction on the question information to obtain text features comprises the following steps:

vectorizing the corpus to obtain a word vector;

vectorizing the question information through a text vector layer of an LSTM-Attention neural network according to the word vector to obtain a sentence sequence;

inputting the sentence sequence into an LSTM network, and performing semantic coding on the sentence sequence through the LSTM network to obtain hidden layer output information, wherein the hidden layer output information is expressed as:

h_it＝LSTM(x_it)，t∈[1，m]

wherein m is the number of words in the sentence sequence, x_itInputting the words, h, of the LSTM network for time t_itThe hidden layer output information is output by the LSTM network at the time t;

carrying out nonlinear transformation and normalization operation on the hidden layer output information through the word Attention layer of the LSTM-Attention neural network to obtain a weight coefficient output by the word level hidden layer, and obtaining a word Attention mechanism matrix according to the weight coefficient, wherein the expression is as follows:

wherein alpha is_itWeight coefficient, s, output for said word-level hidden layer_iSentence vectors for the word attention mechanism matrix;

inputting the sentence vector into the LSTM network, and updating the sentence vector by the LSTM network according to the hidden layer output information at the previous moment to obtain the updated sentence vector;

through the sentence Attention layer of the LSTM-Attention neural network, carrying out nonlinear transformation and normalization operation on the updated sentence vector to obtain a weight coefficient output by the sentence-level hidden layer, and obtaining a sentence Attention mechanism matrix according to the weight coefficient, wherein the expression is as follows:

wherein alpha is_iWeight coefficient, h, for the sentence-level hidden layer output_iThe sentence vector after updating;

and according to the word attention mechanism matrix and the sentence attention mechanism matrix, performing feature extraction on the question information to obtain the text features.

5. The visual question-answering method for the appearance defects of the electric power equipment as claimed in claim 4, wherein the step of inputting the sentence sequence into an LSTM network, and semantically coding the sentence sequence through the LSTM network to obtain hidden layer output information further comprises the steps of:

taking the sentence sequence as input node data of the LSTM network;

learning the front and back information of the sentence sequences through the LSTM network to obtain the front and back information between the sentence sequences;

and according to the previous and next information, performing semantic coding on the sentence sequence.

6. The visual question-answering method for the appearance defects of the electric power equipment as claimed in claim 1, wherein the fusion of the image features and the text features based on the bilinear pooling network to obtain multi-modal features comprises the steps of:

mapping the image features and the text features to the same feature space, such that dimensions of the image features match dimensions of the text features;

inputting the image features and the text features with matched dimensions into the bilinear pooling network to output the fused multi-modal features, which are expressed as:

q＝Fusion(m，n)

wherein m is the image feature, n is the text feature, Fusion represents feature Fusion, and q is the multimodal feature.

7. The visual question-answering method for the appearance defects of the electric power equipment as claimed in claim 1, wherein the step of inputting the multi-modal features into the trained visual question-answering model to obtain a visual question-answering result comprises the steps of:

inputting the multi-modal features into the trained visual question-answering model;

determining answers to be selected according to the multi-modal characteristics, wherein the answers to be selected form a domain to be selected;

predicting the probability distribution of correct answers on the domain to be selected;

and according to the probability distribution, taking the answer to be selected corresponding to the maximum probability as the result of the visual question and answer.

8. A visual question-answering system for appearance defects of electrical equipment, the system comprising:

the image acquisition module is used for acquiring an image of the appearance defect of the power equipment and preprocessing the image;

the image feature extraction module is used for extracting features of the processed image to obtain image features;

the text feature extraction module is used for acquiring problem information related to the image according to the image and extracting features of the problem information to obtain text features;

the feature fusion module is used for fusing the image features and the text features based on a bilinear pooling network to obtain multi-modal features;

the model establishing module is used for establishing a visual question-answer model and training the visual question-answer model to obtain the trained visual question-answer model;

and the visual question-answering module is used for inputting the multi-modal characteristics into the trained visual question-answering model to obtain a visual question-answering result.

9. A visual question answering device for appearance defects of electric equipment is characterized in that: comprising a processor coupled with a memory, the memory storing program instructions that, when executed by the processor, implement the method of any of claims 1 to 7.

10. A computer-readable storage medium characterized by: comprising a program which, when run on a computer, causes the computer to perform the method of any one of claims 1 to 7.