WO2024227415A1

WO2024227415A1 - Question answering method and apparatus, and device and storage medium

Info

Publication number: WO2024227415A1
Application number: PCT/CN2024/089651
Authority: WO
Inventors: 潘俊文; 郭少博; 黄凯
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2023-05-04
Filing date: 2024-04-24
Publication date: 2024-11-07
Also published as: CN116628150A

Abstract

Provided are a question answering method and apparatus, and a device and a storage medium. The question answering method comprises: in response to a question answering initiation operation being detected, using a device of a user to capture image data and a question for the image data (210); extracting text information from the image data (220); acquiring expanded information associated with the text information (230); and determining a target answer for the question on the basis of the image data and the expanded information (240). In this way, in a question answering scenario of multi-modal data, a knowledge base can be introduced to expand the capability of accurately answering a question. Thus, an instant and accurate question answering service can still be provided for a user even if image data is incomplete and insufficient.

Description

Method, device, equipment and storage medium for question answering

本申请要求2023年5月4日递交的、标题为“用于问答的方法、装置、设备和存储介质”、申请号为：202310492380.4的中国发明专利申请的优先权，该申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese invention patent application entitled “Method, device, equipment and storage medium for question and answering” filed on May 4, 2023, with application number: 202310492380.4. The entire contents of this application are incorporated by reference into this application.

Technical Field

本公开的示例实施例总体涉及计算机领域，特别地涉及用于问答的方法、装置、设备和计算机可读存储介质。Example embodiments of the present disclosure generally relate to the field of computers, and more particularly, to methods, devices, apparatuses, and computer-readable storage media for question answering.

Background Art

随着信息技术的飞速发展，越来越多的应用均提供问答功能，给广大用户带来了诸多便利。具有问答功能的应用可以基于用户输入的语音或文字输出对应的回答。具有多模态的视觉语言问答(Visual Question Answering，VQA)功能的应用还可以根据用户输入的图像，基于语音提问来输出针对该图像的回答音频。With the rapid development of information technology, more and more applications provide question-answering functions, which brings a lot of convenience to users. Applications with question-answering functions can output corresponding answers based on the voice or text input by the user. Applications with multimodal visual question answering (VQA) functions can also output audio answers to the image input by the user based on voice questions.

发明内容Summary of the invention

在本公开的第一方面，提供了一种问答方法。该方法包括：响应于检测到问答发起操作，利用用户的设备捕获图像数据和针对图像数据的提问；从图像数据提取文本信息；获取与文本信息相关联的扩展信息；以及基于图像数据和扩展信息来确定针对提问的目标回答。In a first aspect of the present disclosure, a question-and-answer method is provided. The method includes: in response to detecting a question-and-answer initiation operation, using a user's device to capture image data and a question for the image data; extracting text information from the image data; acquiring extended information associated with the text information; and determining a target answer for the question based on the image data and the extended information.

在本公开的第二方面，提供了一种用于问答的装置。该装置包括：数据捕获模块，被配置为响应于检测到问答发起操作，利用用户的设备捕获图像数据和针对图像数据的提问；文本信息提取模块，被配置为从图像数据提取文本信息；扩展信息获取模块，被配置为获取与文本信息相关联的扩展信息；以及目标回答确定模块，被配置为基于图像数据和扩展信息来确定针对提问的目标回答。In a second aspect of the present disclosure, a device for question-answering is provided. The device comprises: a data capture module configured to capture image data and questions about the image data using a user's device in response to detecting a question-answering initiation operation; a text information extraction module configured to extract text information from the image data; an extended information acquisition module configured to acquire extended information associated with the text information; and a target answer determination module configured to determine the target answer based on the image data. Like data and extended information to determine the target answer to the question.

在本公开的第三方面，提供了一种电子设备。该设备包括至少一个处理单元；以及至少一个存储器，至少一个存储器被耦合到至少一个处理单元并且存储用于由至少一个处理单元执行的指令。指令在由至少一个处理单元执行时使电子设备执行第一方面的方法。In a third aspect of the present disclosure, an electronic device is provided. The device includes at least one processing unit; and at least one memory, the at least one memory is coupled to the at least one processing unit and stores instructions for execution by the at least one processing unit. When the instructions are executed by the at least one processing unit, the electronic device executes the method of the first aspect.

在本公开的第四方面，提供了一种计算机可读存储介质。介质上存储有计算机程序，计算机程序被处理器执行时实现第一方面的方法。In a fourth aspect of the present disclosure, a computer-readable storage medium is provided, wherein a computer program is stored on the medium, and when the computer program is executed by a processor, the method of the first aspect is implemented.

在本公开的第五方面，提供了一种计算机程序产品。计算机程序产品被有形地存储在计算机存储介质中并且包括计算机可执行指令，计算机可执行指令在由设备执行时使设备执行第一方面的方法。In a fifth aspect of the present disclosure, a computer program product is provided. The computer program product is tangibly stored in a computer storage medium and comprises computer executable instructions, which, when executed by a device, cause the device to perform the method of the first aspect.

应当理解，该部分中所描述的内容并非旨在限定本公开的实施例的关键特征或重要特征，也不用于限制本公开的范围。本公开的其他特征将通过以下的描述而变得容易理解。It should be understood that the content described in this section is not intended to limit the key features or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

结合附图并参考以下详细说明，本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中，相同或相似的附图标记表示相同或相似的元素，其中：The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the accompanying drawings, the same or similar reference numerals represent the same or similar elements, wherein:

图1示出了能够在其中实现本公开的实施例的示例环境的示意图；FIG1 shows a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

图2示出了根据本公开的一些实施例的问答的过程的流程图；FIG2 shows a flow chart of a question-and-answer process according to some embodiments of the present disclosure;

图3示出了根据本公开的一些实施例的问答界面的示意图；FIG3 shows a schematic diagram of a question-and-answer interface according to some embodiments of the present disclosure;

图4示出了根据本公开的一些实施例的问答的流程的示意图；FIG4 is a schematic diagram showing a question-answering process according to some embodiments of the present disclosure;

图5示出了根据本公开的一些实施例的应用问答的装置的示意性结构框图；以及FIG5 shows a schematic structural block diagram of an apparatus for applying question and answer according to some embodiments of the present disclosure; and

图6示出了可以实施本公开的一个或多个实施例的电子设备的框图。FIG6 shows a block diagram of an electronic device in which one or more embodiments of the present disclosure may be implemented.

DETAILED DESCRIPTION

下面将参照附图更详细地描述本公开的实施例。虽然附图中示出了本公开的某些实施例，然而应当理解的是，本公开可以通过各种形式来实现，而且不应该被解释为限于这里阐述的实施例，相反，提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是，本公开的附图及实施例仅用于示例性作用，并非用于限制本公开的保护范围。The embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Some embodiments of the present disclosure are described, however, it should be understood that the present disclosure can be implemented in various forms and should not be interpreted as being limited to the embodiments described herein. On the contrary, these embodiments are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes and are not intended to limit the scope of protection of the present disclosure.

在本公开的实施例的描述中，术语“包括”及其类似用语应当理解为开放性包含，即“包括但不限于”。术语“基于”应当理解为“至少部分地基于”。术语“一个实施例”或“该实施例”应当理解为“至少一个实施例”。术语“一些实施例”应当理解为“至少一些实施例”。术语“第一”、“第二”等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。如本文中所使用的，术语“模型”可以表示各个数据之间的关联关系。例如，可以基于目前已知的和/或将在未来开发的多种技术方案来获取上述关联关系。In the description of the embodiments of the present disclosure, the term "including" and similar terms should be understood as open inclusion, that is, "including but not limited to". The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The term "some embodiments" should be understood as "at least some embodiments". The terms "first", "second", etc. may refer to different or the same objects. Other explicit and implicit definitions may also be included below. As used herein, the term "model" can represent the association relationship between various data. For example, the above-mentioned association relationship can be obtained based on a variety of technical solutions currently known and/or to be developed in the future.

在本文中，除非明确说明，“响应于A”执行一个步骤并不意味着在“A”之后立即执行该步骤，而是可以包括一个或多个中间步骤。Herein, unless explicitly stated, executing a step “in response to A” does not mean executing the step immediately after “A” but may include one or more intermediate steps.

可以理解的是，本技术方案所涉及的数据(包括但不限于数据本身、数据的获取或使用)应当遵循相应法律法规及相关规定的要求。It is understandable that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and relevant provisions.

可以理解的是，在使用本公开各实施例公开的技术方案之前，均应当根据相关法律法规通过适当的方式对本公开所涉及个人信息的类型、使用范围、使用场景等告知用户并获得用户的授权。It is understandable that before using the technical solutions disclosed in the embodiments of the present disclosure, the types, scope of use, usage scenarios, etc. of the personal information involved in the present disclosure should be informed to the user and the user's authorization should be obtained in an appropriate manner in accordance with relevant laws and regulations.

例如，在响应于接收到用户的主动请求时，向用户发送提示信息，以明确地提示用户，其请求执行的操作将需要获取和使用到用户的个人信息。从而，使得用户可以根据提示信息来自主地选择是否向执行本公开技术方案的操作的电子设备、应用程序、服务器或存储介质等软件或硬件提供个人信息。For example, in response to receiving an active request from a user, a prompt message is sent to the user to clearly prompt the user that the operation requested to be performed will require obtaining and using the user's personal information. Thus, the user can autonomously choose whether to provide personal information to software or hardware such as an electronic device, application, server, or storage medium that performs the operation of the technical solution of the present disclosure according to the prompt message.

作为一种可选的但非限制性的实施例，响应于接收到用户的主动请求，向用户发送提示信息的方式，例如可以是弹出窗口的方式，弹出窗口中可以以文字的方式呈现提示信息。此外，弹出窗口中还可以承载供用户选择“同意”或“不同意”向电子设备提供个人信息的目标探索控件。As an optional but non-limiting embodiment, in response to receiving the user's active request, the prompt information is sent to the user in the form of a pop-up window, and the pop-up window can present the prompt information in the form of text. In addition, the pop-up window can also carry a message for the user to choose "agree" or "disagree" to provide personal information to the electronic device. Click Explore Controls.

可以理解的是，上述通知和获取用户授权过程仅是示意性的，不对本公开的实施例构成限定，其他满足相关法律法规的方式也可应用于本公开的实施例中。It is understandable that the above notification and the process of obtaining user authorization are merely illustrative and do not constitute a limitation to the embodiments of the present disclosure. Other methods that meet relevant laws and regulations may also be applied to the embodiments of the present disclosure.

如本文中所使用的，术语“模型”可以从训练数据中学习到相应的输入与输出之间的关联关系，从而在训练完成后可以针对给定的输入，生成对应的输出。模型的生成可以基于机器学习技术。深度学习是一种机器学习算法，通过使用多层处理单元来处理输入和提供相应输出。神经网络模型是基于深度学习的模型的一个示例。在本文中，“模型”也可以被称为“机器学习模型”、“学习模型”、“机器学习网络”或“学习网络”，这些术语在本文中可互换地使用。As used herein, the term "model" can learn the association between the corresponding input and output from the training data, so that after the training is completed, the corresponding output can be generated for a given input. The generation of the model can be based on machine learning technology. Deep learning is a machine learning algorithm that processes inputs and provides corresponding outputs by using multi-layer processing units. A neural network model is an example of a model based on deep learning. In this article, "model" may also be referred to as "machine learning model", "learning model", "machine learning network" or "learning network", and these terms are used interchangeably in this article.

“神经网络”是一种基于深度学习的机器学习网络。神经网络能够处理输入并且提供相应输出，其通常包括输入层和输出层以及在输入层与输出层之间的一个或多个隐藏层。在深度学习应用中使用的神经网络通常包括许多隐藏层，从而增加网络的深度。神经网络的各个层按顺序相连，从而前一层的输出被提供作为后一层的输入，其中输入层接收神经网络的输入，而输出层的输出作为神经网络的最终输出。神经网络的每个层包括一个或多个节点(也称为处理节点或神经元)，每个节点处理来自上一层的输入。A "neural network" is a machine learning network based on deep learning. A neural network is capable of processing inputs and providing corresponding outputs, and typically includes an input layer and an output layer and one or more hidden layers between the input layer and the output layer. Neural networks used in deep learning applications typically include many hidden layers, thereby increasing the depth of the network. The layers of a neural network are connected in sequence so that the output of the previous layer is provided as input to the next layer, where the input layer receives the input of the neural network and the output of the output layer serves as the final output of the neural network. Each layer of a neural network includes one or more nodes (also called processing nodes or neurons), each of which processes input from the previous layer.

通常，机器学习大致可以包括三个阶段，即训练阶段、测试阶段和应用阶段(也称为推理阶段)。在训练阶段，给定的模型可以使用大量的训练数据进行训练，不断迭代更新参数值，直到模型能够从训练数据中获得一致的满足预期目标的推理。通过训练，模型可以被认为能够从训练数据中学习从输入到输出之间的关联(也称为输入到输出的映射)。训练后的模型的参数值被确定。在测试阶段，将测试输入应用到训练后的模型，测试模型是否能够提供正确的输出，从而确定模型的性能。在应用阶段，模型可以被用于基于训练得到的参数值，对实际的输入进行处理，确定对应的输出。Generally, machine learning can be roughly divided into three stages, namely the training stage, the testing stage, and the application stage (also called the inference stage). In the training stage, a given model can be trained using a large amount of training data, and the parameter values are continuously updated iteratively until the model can obtain consistent inferences that meet the expected goals from the training data. Through training, the model can be considered to be able to learn the association from input to output (also called the mapping of input to output) from the training data. The parameter values of the trained model are determined. In the testing stage, the test input is applied to the trained model to test whether the model can provide the correct output, thereby determining the performance of the model. In the application stage, the model can be used to process the actual input based on the parameter values obtained from the training to determine the corresponding output.

如前文简要提及的，具有多模态的视觉语言问答功能的应用可以根据用户输入的图像，基于用户输入的语音提问来播放针对该图像的回答音频。As briefly mentioned above, applications with multimodal visual-linguistic question-answering capabilities can According to the image input by the user, an audio answer to the image is played based on the voice question input by the user.

视觉语言问答，是一种多模态的理解任务，需要对视觉内容进行理解后回答语言形式的提问。传统上，具有多模态的视觉语言问答功能的应用可以利用经训练的问答模型来实现多模态的视觉语言问答。Visual language question answering is a multimodal understanding task that requires answering questions in the form of language after understanding the visual content. Traditionally, applications with multimodal visual language question answering functions can use trained question answering models to implement multimodal visual language question answering.

然而，在某些情况下，用户采集到的图像数据可能存在质量较差、内容不全等问题。例如，用户拍摄的图像数据较为模糊、图像数据中用户期望提问的未被完整拍摄或者图像数据中的某些文本字体较小等。基于这样的图像数据，可能难以从中生成正确的回答。特别是对于某些人群，例如视障人士而言，视觉语言问答的需求更大，而这些人群由于受到视力限制，往往更难察觉拍摄的图像数据的质量是否满足问答需求。因此，期望在基于视觉数据的问答场景中，即使基于质量较差的图像数据，仍旧可以得到准确的回答。However, in some cases, the image data collected by the user may have problems such as poor quality and incomplete content. For example, the image data taken by the user is blurry, the question the user wants to ask in the image data is not fully captured, or some text in the image data is small in font. Based on such image data, it may be difficult to generate a correct answer from it. In particular, for some groups of people, such as the visually impaired, there is a greater demand for visual language question and answer, and these people often find it more difficult to detect whether the quality of the captured image data meets the question and answer needs due to their limited vision. Therefore, it is expected that in a question and answer scenario based on visual data, accurate answers can still be obtained even based on poor quality image data.

本公开的实施例提供了一种问答的改进方案。根据该方案，采集图像数据以及指示针对图像数据的提问。获取与图像数据中的文本信息相关联的扩展信息。基于图像数据和扩展信息来确定针对提问的回答。以此方式，能够在多模态数据的问答场景中，引入知识库来扩展对提问的准确回答能力。由此，可以在图像数据不全、不足时也能够为用户提供即时、准确的问答服务。The embodiments of the present disclosure provide an improved solution for question-answering. According to the solution, image data is collected and questions about the image data are indicated. Extended information associated with text information in the image data is obtained. Answers to the questions are determined based on the image data and the extended information. In this way, a knowledge base can be introduced in a question-answering scenario of multimodal data to expand the ability to accurately answer questions. As a result, instant and accurate question-answering services can be provided to users even when image data is incomplete or insufficient.

而且，本公开所提出的问答方案能够有效辅助用户，特别是在视力持续性或者暂时性受损或者障碍的人群，准确地实现多模态的视觉语言问答。应当理解，本公开的实施例所提供的方案可以为特定人群提供便利，但这并不暗示对特定人群的任何歧视。Moreover, the question-answering scheme proposed in the present disclosure can effectively assist users, especially those with persistent or temporary visual impairment or obstruction, to accurately implement multimodal visual language question-answering. It should be understood that the scheme provided by the embodiments of the present disclosure can provide convenience for specific groups of people, but this does not imply any discrimination against specific groups of people.

图1示出了本公开的实施例能够在其中实现的示例环境100的示意图。在该示例环境100中，终端设备110中安装有应用120。用户140可以经由终端设备110和/或其附接设备来与应用120进行交互。应用120是至少具有问答功能的应用。1 shows a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. In the example environment 100, an application 120 is installed in a terminal device 110. A user 140 can interact with the application 120 via the terminal device 110 and/or its attached devices. The application 120 is an application having at least a question-and-answer function.

在一些实施例中，终端设备110与服务器130通信，以实现对应用120的服务的供应。终端设备110可以是任意类型的移动终端、固定终端或便携式终端，包括移动手机、台式计算机、膝上型计算机、笔记本计算机、上网本计算机、平板计算机、媒体计算机、多媒体平板、个人通信系统(PCS)设备、个人导航设备、个人数字助理(PDA)、音频/视频播放器、数码相机/摄像机、定位设备、电视接收器、无线电广播接收器、电子书设备、游戏设备或者前述各项的任意组合，包括这些设备的配件和外设或者其任意组合。在一些实施例中，终端设备110也能够支持任意类型的针对用户的接口(诸如“可佩戴”电路等)。In some embodiments, the terminal device 110 communicates with the server 130 to provide services for the application 120. The terminal device 110 may be any type of mobile terminal, fixed terminal, or A fixed terminal or portable terminal includes a mobile phone, a desktop computer, a laptop computer, a notebook computer, a netbook computer, a tablet computer, a media computer, a multimedia tablet, a personal communication system (PCS) device, a personal navigation device, a personal digital assistant (PDA), an audio/video player, a digital camera/camcorder, a positioning device, a television receiver, a radio broadcast receiver, an e-book device, a gaming device, or any combination of the foregoing, including accessories and peripherals of these devices or any combination thereof. In some embodiments, the terminal device 110 can also support any type of interface for the user (such as a "wearable" circuit, etc.).

终端设备110例如可以包括用于检测用户手势的适当类型传感器。例如，终端设备110例如可以包括触摸屏，以用于检测用户在触摸屏上做出的各种类型的手势。备选地或附加地，终端设备110还可以包括诸如临近传感器等其它适当类型感测设备，来检测用户在屏幕上方预定距离内做出的各种类型的手势。终端设备110例如还可以包括用于采集用户音频的声音采集装置(例如麦克风)、用于播放音频的声音播放装置(例如扬声器)、用于采集图像的图像采集装置(例如相机、摄像头等)以及用于界面显示的显示装置(例如显示屏，该显示屏可以为触摸屏)等。The terminal device 110 may, for example, include a sensor of an appropriate type for detecting user gestures. For example, the terminal device 110 may include a touch screen for detecting various types of gestures made by the user on the touch screen. Alternatively or additionally, the terminal device 110 may also include other appropriate types of sensing devices such as proximity sensors to detect various types of gestures made by the user within a predetermined distance above the screen. The terminal device 110 may also include, for example, a sound collection device (such as a microphone) for collecting user audio, a sound playback device (such as a speaker) for playing audio, an image collection device (such as a camera, a webcam, etc.) for collecting images, and a display device (such as a display screen, which may be a touch screen) for interface display, etc.

服务器130可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或者分布式系统，还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络、以及大数据和人工智能平台等基础云计算服务的云服务器。服务器130例如可以包括计算系统/服务器，诸如大型机、边缘计算节点、云环境中的计算设备，等等。服务器130可以为终端设备110中的应用120提供后台服务。The server 130 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content distribution networks, and big data and artificial intelligence platforms. The server 130 may include, for example, a computing system/server, such as a mainframe, an edge computing node, a computing device in a cloud environment, etc. The server 130 may provide background services for the application 120 in the terminal device 110.

在下文将讨论的一些实施例中，可以通过具有各种功能的多个模型实现问答功能。这些模型中的一个或多个模型可以被远程部署在服务器130中，终端设备110可以通过与服务器130之间的通信利用这多个模型以实现对应功能。由此，可以节约终端设备110的资源和功率，并可以利用服务器的强大资源来提高计算效率。在一些实施例中，这些模型中的一个或多个模型也可以被部署在终端设备110本地。这可以根据实际情况来选择。In some embodiments discussed below, the question-answering function can be implemented by multiple models with various functions. One or more of these models can be remotely deployed in the server 130, and the terminal device 110 can use these multiple models to implement the corresponding functions through communication with the server 130. In this way, the resources and power of the terminal device 110 can be saved, and the powerful resources of the server can be used to improve the computing efficiency. In some embodiments, one or more of these models can also be deployed locally on the terminal device 110. This You can choose according to the actual situation.

在一些实施例中，在图1的环境100中，如果应用120处于活动状态，终端设备110可以呈现应用120的界面150。经由界面150，应用120能够向用户140提供与问答功能相关的一个或多个服务，包括采集语音、采集图像、播放语音、显示文字等等。In some embodiments, in the environment 100 of FIG. 1 , if the application 120 is active, the terminal device 110 may present an interface 150 of the application 120. Through the interface 150, the application 120 may provide the user 140 with one or more services related to the question-and-answer function, including collecting voice, collecting images, playing voice, displaying text, and the like.

应当理解，仅出于示例性的目的描述环境100的结构和功能，而不暗示对于本公开的范围的任何限制。It should be understood that the structure and functionality of environment 100 are described for exemplary purposes only and does not imply any limitation on the scope of the present disclosure.

以下将继续参考附图描述本公开的一些示例实施例。Some example embodiments of the present disclosure will be described below with continued reference to the accompanying drawings.

图2示出了根据本公开的一些实施例的问答的过程200的流程图。过程200可以被实现在终端设备110处。为便于讨论，将参考图1的环境100来描述过程200。FIG2 shows a flow chart of a question-answering process 200 according to some embodiments of the present disclosure. The process 200 may be implemented at the terminal device 110. For ease of discussion, the process 200 will be described with reference to the environment 100 of FIG1.

在框210，终端设备110响应于检测到问答发起操作，捕获图像数据和针对图像数据的提问。In block 210 , the terminal device 110 captures image data and a question regarding the image data in response to detecting a question-and-answer initiation operation.

在一些实施例中，终端设备110可以响应于检测到问答发起操作，捕获图像数据和指示针对图像数据的提问。具体地，终端设备110直接检测用户发起的问答发起操作，例如，终端设备110可以响应于检测到问答发起语音(例如“打开问答功能”)，确定检测到问答发起操作。又例如，终端设备110可以响应于检测到对硬件按钮的预设操作(例如按压操作、长按操作等)，确定检测到问答发起操作。在一些实施例中，终端设备110检测到问答发起操作后，运行具有问答功能的应用120，捕获图像数据以及指示针对图像数据的提问。In some embodiments, the terminal device 110 may capture image data and indicate questions about the image data in response to detecting a question-and-answer initiation operation. Specifically, the terminal device 110 directly detects a question-and-answer initiation operation initiated by a user. For example, the terminal device 110 may determine that a question-and-answer initiation operation is detected in response to detecting a question-and-answer initiation voice (e.g., "turn on the question-and-answer function"). For another example, the terminal device 110 may determine that a question-and-answer initiation operation is detected in response to detecting a preset operation on a hardware button (e.g., a pressing operation, a long pressing operation, etc.). In some embodiments, after the terminal device 110 detects the question-and-answer initiation operation, it runs an application 120 with a question-and-answer function, captures image data, and indicates questions about the image data.

终端设备110可以通过图像采集装置捕获图像数据。图像数据可以是任意形式(静态图像、或视频片段等)、任意分辨率、任意格式(例如PNG、JPG等)。备选地或附加地，图像数据还可以是预先存储在终端设备110中的数据。The terminal device 110 may capture image data through an image acquisition device. The image data may be in any form (static image, video clip, etc.), any resolution, and any format (e.g., PNG, JPG, etc.). Alternatively or additionally, the image data may also be data pre-stored in the terminal device 110.

提问可以包括多种形式，即终端设备110可以以多种形式捕获到提问。提问例如可以包括以文字形式捕获的提问，终端设备110可以直接获取用户输入的提问的文本序列(例如，获取“图像中有几个杯子？”这个文本序列)。为了保证问答的便捷性与易操作性，提问还可以包括以语音形式捕获的提问，终端设备110可以通过声音采集装置捕获用户的提问的语音数据。语音数据可以是任意语言(例如中文、英语、日语等)、任意时间长度(例如3s、5s等)以及任意音色的。可以理解，提问还可以是以其他任意适当的形式捕获的提问。Questions may include multiple forms, that is, the terminal device 110 may capture questions in multiple forms. Questions may include questions captured in text form, for example, and the terminal device 110 may directly obtain the text sequence of the question input by the user (for example, obtaining the text sequence of "How many cups are there in the image?"). The questions may be captured in the form of voice, and the terminal device 110 may capture the voice data of the user's questions through a sound collection device. The voice data may be in any language (e.g., Chinese, English, Japanese, etc.), any time length (e.g., 3s, 5s, etc.), and any timbre. It is understood that the questions may also be captured in any other appropriate form.

在提问为以语音形式捕获的提问的情况下，在一些实施例中，终端设备110启动应用120后，可以通过显示装置呈现至少包括录制控件的问答界面。终端设备110可以响应于检测到对录制控件的预定操作，确定检测到问答发起操作，进而捕获图像数据以及语音数据。对录制控件的预定操作例如可以包括点击操作，滑动操作，长按操作等，这里不做限制。在一些实施例中，对录制控件的预定操作也可以通过语音或其他指令来发起。In the case where the question is a question captured in the form of voice, in some embodiments, after the terminal device 110 starts the application 120, a question-and-answer interface including at least a recording control may be presented through a display device. The terminal device 110 may determine that a question-and-answer initiation operation is detected in response to detecting a predetermined operation on the recording control, and then capture image data and voice data. The predetermined operation on the recording control may include, for example, a click operation, a slide operation, a long press operation, etc., which are not limited here. In some embodiments, the predetermined operation on the recording control may also be initiated by voice or other instructions.

在一些实施例中，在捕获图像数据以及语音数据的过程中，终端设备110还可以响应于接收到捕获结束操作停止捕获图像数据以及语音数据。具体地，终端设备110可以响应于检测到例如“停止采集数据”的语音，确定检测到捕获结束操作。终端设备110还可以响应于检测到对硬件按钮的预设操作(例如按压操作、长按操作等)，确定检测到捕获结束操作。终端设备110还可以响应于检测到对录制界面中录制控件的另一预定操作(例如点击操作，松开按压操作等)，确定检测到捕获结束操作。In some embodiments, during the process of capturing image data and voice data, the terminal device 110 may also stop capturing image data and voice data in response to receiving a capture end operation. Specifically, the terminal device 110 may determine that a capture end operation has been detected in response to detecting a voice such as "stop collecting data". The terminal device 110 may also determine that a capture end operation has been detected in response to detecting a preset operation on a hardware button (such as a press operation, a long press operation, etc.). The terminal device 110 may also determine that a capture end operation has been detected in response to detecting another predetermined operation on a recording control in a recording interface (such as a click operation, a release press operation, etc.).

参考图3，图3示出了根据本公开的一些实施例的问答界面300的示意图。录制界面300可以包含控件显示区域330，控件显示区域330至少呈现录制控件332。终端设备110可以响应于检测到对录制控件332的预定操作确定检测到问答发起操作。终端设备110进而可以在文字提示区域310显示“录制中”等文字以提示用户终端设备110当前处于捕获数据的状态。Referring to FIG. 3 , FIG. 3 shows a schematic diagram of a question-and-answer interface 300 according to some embodiments of the present disclosure. The recording interface 300 may include a control display area 330, and the control display area 330 at least presents a recording control 332. The terminal device 110 may determine that a question-and-answer initiation operation is detected in response to detecting a predetermined operation on the recording control 332. The terminal device 110 may further display text such as "Recording" in the text prompt area 310 to prompt the user that the terminal device 110 is currently in a state of capturing data.

在一些实施例中，终端设备110响应于接收到预定操作，通过改变录制控件332的呈现效果(例如改变录制控件332的颜色、尺寸等)来表示终端设备110正在捕获语音数据以及图像数据。相应地，终端设备110可以响应于接收到捕获结束操作停止捕获语音数据以及图像数据。终端设备110可以响应于接收到捕获结束操作将录制控件332的呈现效果切换回捕获之前的状态。In some embodiments, in response to receiving a predetermined operation, the terminal device 110 changes the presentation effect of the recording control 332 (e.g., changes the color, size, etc. of the recording control 332) to indicate that the terminal device 110 is capturing voice data and image data. Accordingly, the terminal device 110 may stop capturing voice data and image data in response to receiving a capture end operation. In response to receiving the capture end operation, the terminal device 110 may switch the presentation effect of the recording control 332 back to the state before the capture.

在一些实施例中，终端设备110可以将捕获到的语音数据转换成文字，显示在文字提示区域310中。如图3所示，终端设备110捕获到语音数据后，将语音数据对应的文字“这个药的用药剂量”显示在文字提示区域310处。与此同时，终端设备110可以在图像显示区域320处呈现终端设备110当前捕获到的图像数据。可以通过语音转文本技术来实现语音到文字的转换，该转换可以在终端设备110本地或远程服务器执行。In some embodiments, the terminal device 110 may convert the captured voice data into text and display it in the text prompt area 310. As shown in FIG3 , after the terminal device 110 captures the voice data, the text "the dosage of this medicine" corresponding to the voice data is displayed in the text prompt area 310. At the same time, the terminal device 110 may present the image data currently captured by the terminal device 110 in the image display area 320. The conversion of voice to text may be achieved through voice-to-text technology, and the conversion may be performed locally on the terminal device 110 or on a remote server.

在一些实施例中，为保证确定的语音数据指示的提问的准确性，终端设备110可以对捕获到的语音数据进行预处理，以消除语音数据中与提问无关的噪音(例如环境音)。In some embodiments, to ensure the accuracy of the question indicated by the determined voice data, the terminal device 110 may pre-process the captured voice data to eliminate noise (such as ambient sound) in the voice data that is not related to the question.

返回参考图2，在框220，终端设备110从图像数据提取文本信息。Referring back to FIG. 2 , at block 220 , the terminal device 110 extracts text information from the image data.

终端设备110例如可以利用光学字符识别(Optical Character Recognition，OCR)技术从图像数据提取文本信息。光学字符识别是指对图像数据进行分析识别处理，获取文本信息的过程。OCR技术可以使图像数据经历图像预处理、文本检测、文本识别等图像处理过程而提取出图像数据的文本信息。具体地，在OCR技术中，图像预处理通常是针对图像数据的成像问题进行修正。常见的预处理过程包括：几何变换(透视、扭曲、旋转等)、畸变校正、去除模糊、图像增强和光线校正等。文本检测即检测文本的所在位置和范围及其布局。通常也包括版面分析和文本行检测等。文本识别是在文本检测的基础上，对文本内容进行识别，从图像数据提取文本信息。For example, the terminal device 110 can use optical character recognition (OCR) technology to extract text information from image data. Optical character recognition refers to the process of analyzing and identifying image data to obtain text information. OCR technology can extract text information from image data by subjecting image data to image processing processes such as image preprocessing, text detection, and text recognition. Specifically, in OCR technology, image preprocessing usually corrects imaging problems of image data. Common preprocessing processes include: geometric transformation (perspective, distortion, rotation, etc.), distortion correction, blur removal, image enhancement, and light correction. Text detection is to detect the location and range of the text and its layout. It usually also includes layout analysis and text line detection. Text recognition is to identify the text content on the basis of text detection and extract text information from image data.

在一些实施例中，终端设备110可以根据OCR技术利用经训练的图像处理模型从图像数据提取文本信息。具体地，终端设备110可以将图像数据输入经训练的图像处理模型中，并获取该图像处理模型输出的文本信息。图像处理模型例如可以包括卷积神经网络(CNN)、前馈神经网络(FNN)、全连接神经网络(FCN)、生成对抗网络(GAN)、循环神经网络(RNN)等任意适当的模型中的一个或多个。In some embodiments, the terminal device 110 can extract text information from the image data using a trained image processing model based on OCR technology. Specifically, the terminal device 110 can input the image data into the trained image processing model and obtain the text information output by the image processing model. The image processing model can include, for example, a convolutional neural network (CNN), a feedforward neural network (FNN), a fully connected neural network (FCN), a generative adversarial network (GAN), One or more of any suitable models such as a recurrent neural network (RNN).

在一些实施例中，图像处理模型例如可以被部署在服务器130中，其中，服务器130可以是远程服务器(例如云端)。终端设备110可以通过与服务器130之间的通信利用经训练的图像处理模型以获取文本信息。具体地，终端设备110可以将捕获到的图像数据发送至服务器130，由服务器130利用经训练的图像处理模型基于图像数据输出文本信息。终端设备110可以从服务器130处获取文本信息。在一些实施例中，经训练的图像处理模型例如还可以被部署在终端设备110本地，终端设备110可直接利用部署在本地的经训练的图像处理模型基于捕获到的图像数据提取文本信息。In some embodiments, the image processing model may be deployed in the server 130, for example, wherein the server 130 may be a remote server (e.g., a cloud server). The terminal device 110 may utilize the trained image processing model to obtain text information through communication with the server 130. Specifically, the terminal device 110 may send the captured image data to the server 130, and the server 130 may output text information based on the image data using the trained image processing model. The terminal device 110 may obtain text information from the server 130. In some embodiments, the trained image processing model may also be deployed locally on the terminal device 110, for example, and the terminal device 110 may directly utilize the trained image processing model deployed locally to extract text information based on the captured image data.

在框230，终端设备110获取与文本信息相关联的扩展信息。In block 230 , the terminal device 110 obtains extended information associated with the text information.

在本公开的实施例中，在用户期望针对图像数据进行提问的场景下，不仅依赖于图像数据本身，还期望能够扩展出更多信息来辅助回答提问。In the embodiments of the present disclosure, in the scenario where the user wants to ask questions about the image data, it is not only necessary to rely on the image data itself, but also to expand more information to assist in answering the questions.

在一些实施例中，终端设备110可以默认总是获取与图像数据中的文本信息相关联的扩展信息。在一些实施例中，终端设备110可以确定从图像数据是否能够确定针对提问的目标回答。具体地，终端设备110可以对提问进行意图识别，以确定与提问相关的意图。具体地，终端设备110可以基于词典以及模版的规则方法识别意图。不同的意图会有不同的领域词典，比如书名、歌曲名、商品名、物体名等。终端设备110可以根据用户的意图和词典的匹配程度或者重合程度来进行判断。终端设备110还可以基于机器学习模型对用户意图进行判别。终端设备110可以通过机器学习和深度学习的方法，对已经标注好的领域语料进行训练学习，得到意图识别的模型(例如基于fastText的模型)。终端设备110进而基于该模型识别输入的提问所指示的意图。可以理解，终端设备110可以是本地确定提问所指示的意图，也可以是将提问发送至服务器130以由服务器130确定提问所指示的意图。In some embodiments, the terminal device 110 may always obtain extended information associated with text information in the image data by default. In some embodiments, the terminal device 110 may determine whether the target answer to the question can be determined from the image data. Specifically, the terminal device 110 may perform intent recognition on the question to determine the intent related to the question. Specifically, the terminal device 110 may recognize the intent based on the rule method of the dictionary and the template. Different intents will have different domain dictionaries, such as book titles, song titles, product names, object names, etc. The terminal device 110 may make a judgment based on the degree of match or overlap between the user's intent and the dictionary. The terminal device 110 may also discriminate the user's intent based on a machine learning model. The terminal device 110 may train and learn the already annotated domain corpus through machine learning and deep learning methods to obtain a model for intent recognition (e.g., a model based on fastText). The terminal device 110 then recognizes the intent indicated by the input question based on the model. It can be understood that the terminal device 110 may determine the intent indicated by the question locally, or send the question to the server 130 so that the server 130 determines the intent indicated by the question.

终端设备110获取到提问指示的意图后，可以基于该意图以及图像数据和/或文本信息判断从图像数据是否能够确定针对提问的目标回答。After the terminal device 110 obtains the intention of the question indication, it can determine whether the target of the question can be determined from the image data based on the intention and the image data and/or text information. answer.

如果基于意图和图像数据确定从图像数据能够确定针对提问的目标回答，终端设备110可以直接基于图像数据和提问确定对应的目标回答而无需获取扩展信息。例如，如果终端设备110捕获到的提问为“有几个杯子”对应的语音数据，终端设备110可以对该语音数据进行识别，确定该提问对应的意图是确定图像数据中杯子的数量。终端设备110进而基于该意图以及捕获到的图像数据判断从图像数据能够确定针对提问的目标回答。终端设备110可以直接对图像数据进行识别，确定图像数据中包括的杯子的数量为3个，即该意图对应的回答为“3个”。If it is determined based on the intent and the image data that the target answer to the question can be determined from the image data, the terminal device 110 can directly determine the corresponding target answer based on the image data and the question without obtaining extended information. For example, if the question captured by the terminal device 110 is voice data corresponding to "how many cups are there", the terminal device 110 can recognize the voice data and determine that the intent corresponding to the question is to determine the number of cups in the image data. The terminal device 110 then determines based on the intent and the captured image data that the target answer to the question can be determined from the image data. The terminal device 110 can directly recognize the image data and determine that the number of cups included in the image data is 3, that is, the answer corresponding to the intent is "3".

又例如，如果终端设备110捕获到的提问为“药品的用药剂量”对应的语音数据，终端设备110可以对该语音数据进行识别，确定该提问对应的意图是确定图像数据中药品对应的用药剂量。在图像数据中药品包装上的文字过小、药品包装上未包含用药剂量或者药品包装上的文字模糊等情况下，终端数据110可以基于该意图以及捕获到的图像数据中提取得到的文本信息判断从图像数据无法确定针对提问的目标回答。在确定从图像数据无法确定针对提问的目标回答的情况下，终端设备110可以获取与文本信息相关联的扩展信息。For another example, if the question captured by the terminal device 110 is voice data corresponding to "drug dosage", the terminal device 110 can recognize the voice data and determine that the intention corresponding to the question is to determine the dosage corresponding to the drug in the image data. In the case where the text on the drug packaging in the image data is too small, the drug packaging does not contain the dosage, or the text on the drug packaging is blurred, the terminal device 110 can determine that the target answer to the question cannot be determined from the image data based on the intention and the text information extracted from the captured image data. In the case of determining that the target answer to the question cannot be determined from the image data, the terminal device 110 can obtain extended information associated with the text information.

在一些实施例中，终端设备110可以访问知识库，并基于文本信息从知识库中取得扩展信息。知识库可以包括知识图谱。具体地，终端设备110可以从文本信息确定至少一个关键词，并从知识库取得与至少一个关键词相关联的扩展信息。终端设备110可以将关键词匹配到知识图谱中的实体、关系、属性等，以获得对应的信息作为扩展信息。知识图谱是计算机科学中的重要数据表示形式，在知识图谱中，节点表示实体，节点与节点之间的边表示实体与实体之间的关系，节点和边均可以具有各自的属性，该属性即为实体或关系的属性。可以理解，知识库还可以为其他任意合适的知识库和/或数据库等，这里不做限制。In some embodiments, the terminal device 110 can access the knowledge base and obtain extended information from the knowledge base based on the text information. The knowledge base may include a knowledge graph. Specifically, the terminal device 110 can determine at least one keyword from the text information and obtain extended information associated with at least one keyword from the knowledge base. The terminal device 110 can match the keywords to entities, relationships, attributes, etc. in the knowledge graph to obtain corresponding information as extended information. The knowledge graph is an important data representation form in computer science. In the knowledge graph, nodes represent entities, and the edges between nodes represent the relationships between entities. Both nodes and edges can have their own attributes, which are the attributes of the entity or relationship. It can be understood that the knowledge base can also be any other suitable knowledge base and/or database, etc., which is not limited here.

为了能够获取到更精准的扩展信息，终端设备110预先的知识库可以包括对应多个领域的多个候选知识库。多个候选知识库包括但不限于对应医学领域的医学知识库、对应农业的农学知识库、对应宠物的宠物知识库、对应食品的食品知识库等等。终端设备110可以确定图像数据或文本信息对应的目标领域，进而从多个候选知识库确定与目标领域对应的目标知识库。示例性的，若图像数据的文本信息中包含药品的名称，则终端设备110可以确定文本信息对应的领域为医学领域，进而从多个候选知识库中将对应医学领域的医学知识库确定为目标知识库。In order to obtain more accurate extended information, the terminal device 110 pre- Multiple candidate knowledge bases corresponding to multiple fields may be included. The multiple candidate knowledge bases include, but are not limited to, a medical knowledge base corresponding to the medical field, an agronomic knowledge base corresponding to agriculture, a pet knowledge base corresponding to pets, a food knowledge base corresponding to food, and the like. The terminal device 110 may determine the target field corresponding to the image data or text information, and then determine the target knowledge base corresponding to the target field from the multiple candidate knowledge bases. Exemplarily, if the text information of the image data contains the name of a drug, the terminal device 110 may determine that the field corresponding to the text information is the medical field, and then determine the medical knowledge base corresponding to the medical field as the target knowledge base from the multiple candidate knowledge bases.

进一步地，终端设备110可以从目标知识库取得与至少一个关键词相关联的扩展信息。类似的，若目标知识库为知识图谱，终端设备110可以将至少一个关键词与知识图谱中的实体、关系、属性等进行匹配，以获得对应的信息作为扩展信息。Furthermore, the terminal device 110 may obtain extended information associated with at least one keyword from the target knowledge base. Similarly, if the target knowledge base is a knowledge graph, the terminal device 110 may match at least one keyword with entities, relationships, attributes, etc. in the knowledge graph to obtain corresponding information as extended information.

在一些实施例中，终端设备110还可以基于图像数据获取扩展信息。具体地，终端设备110可以对图像数据进行图像识别，进而基于图像识别的结果从图像数据对应的领域的目标知识库中取得扩展信息。示例性的，若对图像数据进行图像识别的识别结果指示图像数据中包含猫咪，则终端设备110可以从多个候选知识库中将对应宠物的宠物知识库确定为目标知识库，进而从目标知识库中取得与猫咪相关联的扩展信息。In some embodiments, the terminal device 110 may also obtain extended information based on the image data. Specifically, the terminal device 110 may perform image recognition on the image data, and then obtain extended information from a target knowledge base in a field corresponding to the image data based on the result of the image recognition. Exemplarily, if the recognition result of performing image recognition on the image data indicates that the image data contains a cat, the terminal device 110 may determine a pet knowledge base corresponding to the pet as a target knowledge base from multiple candidate knowledge bases, and then obtain extended information associated with the cat from the target knowledge base.

在框240，终端设备110基于图像数据和扩展信息来确定针对提问的目标回答。In block 240 , the terminal device 110 determines a target answer to the question based on the image data and the extended information.

在一些实施例中，终端设备110可以在本地利用图像数据、扩展信息来确定针对提问的回答。具体地，终端设备110例如可以利用经训练的问答模型基于图像数据、扩展信息以及提问确定针对图像数据的回答。在一些实施例中，在提问是以与语音形式捕获的提问的情况下，终端设备110可以先将提问对应的语音数据转换成文本序列。终端设备110可以基于语音技术(例如，自动语音识别(ASR)技术)将语音数据转换成文本序列，进而向问答模型提供图像数据、扩展信息以及文本序列。 In some embodiments, the terminal device 110 can locally use the image data and the extended information to determine the answer to the question. Specifically, the terminal device 110 can, for example, use the trained question-answering model to determine the answer to the image data based on the image data, the extended information and the question. In some embodiments, when the question is a question captured in the form of voice, the terminal device 110 can first convert the voice data corresponding to the question into a text sequence. The terminal device 110 can convert the voice data into a text sequence based on voice technology (e.g., automatic speech recognition (ASR) technology), and then provide the image data, extended information and text sequence to the question-answering model.

在一些实施例中，终端设备110可以将提问对应的文本序列、图像数据以及扩展信息一同输入至经训练的问答模型中，以让问答模型输出与提问相对应的回答。在这种情况下，问答模型是多模态的问答模型，其输入包括图像模态和文本模态的数据。备选地或附加地，在一些实施例中，终端设备110还可以将文本序列、图像数据中提取出的文本信息以及扩展信息一同输入至经训练的问答模型中，以让问答模型输出与提问相对应的回答。在这种情况下，问答模型可以不是多模态的问答模型，其输入是文本模态的数据。In some embodiments, the terminal device 110 may input the text sequence, image data, and extended information corresponding to the question into a trained question-answering model, so that the question-answering model outputs an answer corresponding to the question. In this case, the question-answering model is a multimodal question-answering model, and its input includes data in image modality and text modality. Alternatively or additionally, in some embodiments, the terminal device 110 may also input the text sequence, text information extracted from the image data, and extended information into a trained question-answering model, so that the question-answering model outputs an answer corresponding to the question. In this case, the question-answering model may not be a multimodal question-answering model, and its input is data in text modality.

可以理解，经训练的问答模型可以被部署在服务器130，也可以被部署在本地。如果问答模型被部署在服务器，终端设备110可以通过与服务器130之间的通信利用经训练的问答模型以实现问答功能。如果问答模型被部署在本地，终端设备110可以直接利用被部署在本地的经训练的问答模型实现问答功能。It is understood that the trained question-answering model can be deployed on the server 130 or locally. If the question-answering model is deployed on the server, the terminal device 110 can use the trained question-answering model to implement the question-answering function through communication with the server 130. If the question-answering model is deployed locally, the terminal device 110 can directly use the trained question-answering model deployed locally to implement the question-answering function.

问答模型例如可以是语言模型(LM)，语言模型通过从大量语料中学习，能够具备问答能力。语言模型可以包括统计语言模型和神经网络语言模型，其中，相较于统计语言模型，神经网络语言模型具有更为强大的泛化能力以及预测能力。在一些实施例中，为更好地实现确定问答功能，利用的经训练的问答模型为神经网络语言模型。The question-answering model may be, for example, a language model (LM), which can have question-answering capabilities by learning from a large amount of corpus. The language model may include a statistical language model and a neural network language model, wherein the neural network language model has a more powerful generalization and prediction capability than the statistical language model. In some embodiments, in order to better implement the determination of the question-answering function, the trained question-answering model used is a neural network language model.

进一步地，由于神经网络语言模型的问答能力可以随着用于训练的数据量以及模型参数量的增多而提升，在一些实施例中，为了确定出更为准确的回答，终端设备110利用具有大规模参数、数据量和计算量的神经网络语言模型，以满足具体应用中的问答质量要求。当语言模型的规模达到一定程度(例如，由更多数据量来训练)时，就会具有符合应用期望的认知、常识以及逻辑推理能力。在一些实施例中，为保证模型的效果，可以通过预训练的方式来确定模型的参数权重。Furthermore, since the question-answering capability of the neural network language model can be improved as the amount of data used for training and the amount of model parameters increase, in some embodiments, in order to determine a more accurate answer, the terminal device 110 uses a neural network language model with large-scale parameters, data volume, and computing volume to meet the quality requirements of question-answering in specific applications. When the scale of the language model reaches a certain level (for example, trained by more data), it will have cognitive, common sense, and logical reasoning capabilities that meet the expectations of the application. In some embodiments, in order to ensure the effectiveness of the model, the parameter weights of the model can be determined by pre-training.

在一些实施例中，终端设备110还可以从图像数据确定针对提问的候选回答，并基于扩展信息来执行针对候选回答的校正，得到目标回答。校正例如可以包括纠错和补全中的至少一项。具体地，终端设备110例如可以利用经训练的问答模型基于图像数据以及提问确定针对图像数据的候选回答。同样的，在提问是以与语音形式捕获的提问的情况下，终端设备110需要先将提问对应的语音数据转换成文本序列，进而将文本序列与图像数据一同输入至经训练的问答模型中，以让问答模型输出与提问相对应的候选回答。对于基于图像数据的候选回答的校正(例如，纠错和/或补全)可以是由问答模型自动实现。In some embodiments, the terminal device 110 may also determine candidate answers to the question from the image data, and perform correction on the candidate answers based on the extended information to obtain the target answer. Correction may include at least one of error correction and completion. Specifically, the terminal device 110 may use the trained question-answering model to determine the target answer based on the image data and the question. Candidate answers to the image data. Similarly, in the case where the question is captured in the form of voice, the terminal device 110 needs to first convert the voice data corresponding to the question into a text sequence, and then input the text sequence and the image data into the trained question-answering model, so that the question-answering model outputs the candidate answer corresponding to the question. Correction (e.g., error correction and/or completion) of the candidate answer based on the image data can be automatically implemented by the question-answering model.

在图像数据的质量较差的情况下，终端设备110直接基于图像数据确定的候选回答可能是错误的。因此，终端设备110进而可以基于扩展信息对候选回答进行校正，以得到针对提问的准确的目标回答。终端设备110可以基于扩展信息中与提问对应的信息，对候选回答进行校正以得到正确的目标回答。In the case of poor quality of image data, the candidate answers determined directly by the terminal device 110 based on the image data may be wrong. Therefore, the terminal device 110 can further correct the candidate answers based on the extended information to obtain an accurate target answer to the question. The terminal device 110 can correct the candidate answers based on the information corresponding to the question in the extended information to obtain the correct target answer.

示例性的，在提问为“药品名称以及作用”的情况下，若图像数据存在曝光不当的问题，图像数据中药品包装上的文字区域可能是不清晰的。终端设备110基于这样的图像数据得到的候选回答可能是不完整的，例如候选回答可以为“黄连上清片，可以散风清热”。终端设备110利用扩展信息对候选回答进行补全后可以得到目标回答为“黄连上清片，可以散风清热，泻火止痛”。由此，可以利用扩展信息对候选回答进行补全式的校正，有助于提升回答的完整性。For example, when the question is "name and function of the drug", if the image data has an improper exposure problem, the text area on the drug packaging in the image data may be unclear. The candidate answer obtained by the terminal device 110 based on such image data may be incomplete. For example, the candidate answer may be "Huanglian Shangqing Tablets can dispel wind and clear heat." After the terminal device 110 uses the extended information to complete the candidate answer, the target answer can be "Huanglian Shangqing Tablets can dispel wind and clear heat, purge fire and relieve pain." Therefore, the extended information can be used to complete the candidate answer, which helps to improve the completeness of the answer.

示例性的，在提问为“这个药的用药剂量”的情况下，若图像数据中药品包装上的文字不包含相应内容或者对应的文字过小，终端设备110基于这样的图像数据得到的候选回答可能是错误的。例如候选回答可以为“24片，2板”。终端设备110利用扩展信息对候选回答进行纠错后可以得到目标回答为“一次6片，一日两次”。由此，可以利用扩展信息对候选回答进行纠错式的校正，有助于提升回答的正确性。For example, when the question is "What is the dosage of this medicine?", if the text on the medicine package in the image data does not contain the corresponding content or the corresponding text is too small, the candidate answer obtained by the terminal device 110 based on such image data may be wrong. For example, the candidate answer may be "24 tablets, 2 plates". After the terminal device 110 uses the extended information to correct the candidate answer, the target answer may be "6 tablets at a time, twice a day". Therefore, the extended information can be used to correct the candidate answer, which helps to improve the correctness of the answer.

参考图4，图4示出了根据本公开的一些实施例的回答的流程400的示意图。流程400可以被实现在终端设备110处。为便于讨论，将参考图1的环境100来描述流程400。注意，图4中示出的图像数据401、提问402、模型输入403以及回答404仅是为了解释说明的示例，而不是指示任何限制。 Referring to FIG. 4 , FIG. 4 shows a schematic diagram of a process 400 of answering according to some embodiments of the present disclosure. The process 400 may be implemented at the terminal device 110. For ease of discussion, the process 400 will be described with reference to the environment 100 of FIG. 1 . Note that the image data 401, question 402, model input 403, and answer 404 shown in FIG. 4 are merely examples for explanation and do not indicate any limitation.

在一些实施例中，在获取到图像数据401后，终端设备110对图像数据401执行文本识别410，以从中提取文本信息，例如从图像数据401中提取到文本信息“XX品牌”、“黄连上清片”、“24片·2板”、“散风清热，泻火止痛”以及“OTC”等。终端设备110从知识库420(例如，医学相关的知识图谱)取得与文本信息相关联的扩展信息。例如，可以从知识402获取扩展信息“作用功效：本品具有散风清热、泻火止痛的作用。可治疗急性结膜炎、急性化脓性中耳炎、牙宣、喉痹以及口疮、复发性口疮”和“用法用量：口服用药，一次6片，一日2次。建议患者在医师指导下用药”，等等。In some embodiments, after acquiring the image data 401, the terminal device 110 performs text recognition 410 on the image data 401 to extract text information therefrom, for example, extracting text information "XX brand", "Huanglian Shangqing Tablets", "24 tablets·2 plates", "disperse wind and clear heat, purge fire and relieve pain" and "OTC" from the image data 401. The terminal device 110 obtains extended information associated with the text information from the knowledge base 420 (for example, a medical-related knowledge graph). For example, the extended information "Effects and effects: This product has the effects of dispersing wind and clearing heat, purging fire and relieving pain. It can treat acute conjunctivitis, acute suppurative otitis media, toothache, laryngeal paralysis, as well as mouth ulcers and recurrent mouth ulcers" and "Usage and dosage: oral medication, 6 tablets at a time, twice a day. Patients are advised to take the medicine under the guidance of a physician", etc. can be obtained from the knowledge 402.

在一些实施例中，如图4所示，终端设备110可以将文本信息、扩展信息以及提问402一同确定为问答模型的模型输入403。在其他示例，终端设备110也可以基于图像数据401、扩展信息以及提问402(例如，其可以是语音的形式)来确定为问答模型的模型输入403。在确定模型输入403时，终端设备110可以将捕获的语音数据识别为文字，并确定提问的意图。In some embodiments, as shown in FIG4 , the terminal device 110 may determine the text information, the extended information, and the question 402 together as the model input 403 of the question-answering model. In other examples, the terminal device 110 may also determine the model input 403 of the question-answering model based on the image data 401, the extended information, and the question 402 (for example, which may be in the form of voice). When determining the model input 403, the terminal device 110 may recognize the captured voice data as text and determine the intention of the question.

在确定模型输入403后，终端设备110将模型输入403提供给问答模型430。问答模型430可以基于模型输入403来确定与提问对应的回答404。终端设备110可以获取问答模型输出的回答404。例如，输出的回答404可以为“一次6片，一日两次”。After determining the model input 403, the terminal device 110 provides the model input 403 to the question-answering model 430. The question-answering model 430 may determine an answer 404 corresponding to the question based on the model input 403. The terminal device 110 may obtain the answer 404 output by the question-answering model. For example, the output answer 404 may be "6 tablets at a time, twice a day".

在一些实施例中，终端设备110确定回答后，可以通过声音播放装置以语音形式播放回答。如图3所示，终端设备110可以通过扬声器播放回答音频。在一些实施例中，回答可以是文本形式。终端设备110可以通过语音合成(TTS)将文本转换为语音进行输出。这样，可以方便用户，特别是有视力障碍的用户快速获知答案。In some embodiments, after the terminal device 110 determines the answer, the answer can be played in voice form through a sound playing device. As shown in Figure 3, the terminal device 110 can play the answer audio through a speaker. In some embodiments, the answer can be in text form. The terminal device 110 can convert the text into voice through speech synthesis (TTS) for output. In this way, it is convenient for users, especially users with visual impairments, to quickly know the answer.

在一些实施例中，备选地，终端设备110还可以通过显示屏以文字形式呈现回答。在一些实施例中，终端设备110还可以附加震动形式以及视觉形式输出回答。视觉形式例如可以包括放大图像、突出显示图像等等。示例性的，在用户输入的语音数据指示查询图像数据中某一物体的名称时，终端设备110在播放包含该物体名称的回答音频时，可以在显示屏上放大图像数据以突出显示该物体。In some embodiments, alternatively, the terminal device 110 may also present the answer in text form through a display screen. In some embodiments, the terminal device 110 may also output the answer in an additional vibration form and a visual form. The visual form may include, for example, an enlarged image, a highlighted image, etc. Exemplarily, when the voice data input by the user indicates the name of an object in the query image data, the terminal device 110 plays the answer audio containing the name of the object. When the image data is magnified on the display screen, the object can be highlighted.

根据本公开的实施例，在所捕获的图像质量不佳导致无法完成问答的情况下，相比于要求用户重新采集符合要求的图像，本公开所提出的方案能够显著提升问答的效率和准确性。以此方式，能够在多模态数据的问答场景中，引入知识库来扩展对提问的准确回答能力。由此，可以在图像数据不全、不足时也能够为用户提供即时、准确的问答服务。According to the embodiments of the present disclosure, when the captured image quality is poor and the question and answer cannot be completed, the solution proposed by the present disclosure can significantly improve the efficiency and accuracy of the question and answer, compared with requiring the user to re-collect images that meet the requirements. In this way, in the question and answer scenario of multimodal data, a knowledge base can be introduced to expand the ability to accurately answer questions. As a result, it is possible to provide users with instant and accurate question and answer services even when the image data is incomplete or insufficient.

图5示出了根据本公开的一些实施例的问答的装置500的示意性结构框图。装置500例如可以被实现在或被包括在终端设备110中。装置500中的各个模块/组件可以由硬件、软件、固件或者它们的任意组合来实现。Fig. 5 shows a schematic structural block diagram of a question-answering apparatus 500 according to some embodiments of the present disclosure. The apparatus 500 may be implemented in or included in the terminal device 110. Each module/component in the apparatus 500 may be implemented by hardware, software, firmware or any combination thereof.

如图所示，装置500包括数据捕获模块510，被配置为响应于检测到问答发起操作，利用用户的设备捕获图像数据和针对图像数据的提问。装置500还包括文本信息提取模块520，被配置为从图像数据提取文本信息。装置500还包括扩展信息获取模块530，被配置为获取与文本信息相关联的扩展信息。装置500还包括目标回答确定模块540，被配置为基于图像数据和扩展信息来确定针对提问的目标回答。As shown in the figure, the device 500 includes a data capture module 510, which is configured to capture image data and questions for the image data using the user's device in response to detecting a question-and-answer initiation operation. The device 500 also includes a text information extraction module 520, which is configured to extract text information from the image data. The device 500 also includes an extended information acquisition module 530, which is configured to acquire extended information associated with the text information. The device 500 also includes a target answer determination module 540, which is configured to determine a target answer for the question based on the image data and the extended information.

在一些实施例中，扩展信息获取模块530，包括：关键词确定模块，被配置为从文本信息确定至少一个关键词；以及扩展信息取得模块，被配置为从知识库取得与至少一个关键词相关联的扩展信息。In some embodiments, the extended information acquisition module 530 includes: a keyword determination module configured to determine at least one keyword from text information; and an extended information acquisition module configured to acquire extended information associated with the at least one keyword from a knowledge base.

在一些实施例中，知识库包括对应多个领域的多个候选知识库，并且扩展信息取得模块包括：目标领域确定模块，被配置为确定图像数据或文本信息对应的目标领域；目标知识库确定模块，被配置为从多个候选知识库确定与目标领域对应的目标知识库；以及信息取得模块，被配置为从目标知识库取得与至少一个关键词相关联的扩展信息。In some embodiments, the knowledge base includes multiple candidate knowledge bases corresponding to multiple fields, and the extended information acquisition module includes: a target field determination module, configured to determine the target field corresponding to the image data or text information; a target knowledge base determination module, configured to determine the target knowledge base corresponding to the target field from multiple candidate knowledge bases; and an information acquisition module, configured to acquire extended information associated with at least one keyword from the target knowledge base.

在一些实施例中，知识库包括知识图谱。In some embodiments, the knowledge base includes a knowledge graph.

在一些实施例中，扩展信息获取模块530，包括：确定模块，被配置为确定从图像数据是否能够确定针对提问的目标回答；以及获取模块，被配置为如果从图像数据无法确定针对提问的目标回答，获取扩展信息。In some embodiments, the extended information acquisition module 530 includes: a determination module configured to determine whether the target answer to the question can be determined from the image data; and an acquisition module configured to acquire the target answer to the question if the target answer to the question cannot be determined from the image data. Extended information.

在一些实施例中，目标回答确定模块540，包括：候选回答确定模块，被配置为从图像数据确定针对提问的候选回答；以及目标回答获得模块，被配置为基于扩展信息来执行针对候选回答的校正，得到目标回答，校正包括纠错和补全中的至少一项。In some embodiments, the target answer determination module 540 includes: a candidate answer determination module, configured to determine candidate answers to the question from image data; and a target answer acquisition module, configured to perform correction on the candidate answers based on extended information to obtain the target answer, the correction including at least one of error correction and completion.

在一些实施例中，目标回答是利用经训练的问答模型来确定的，问答模型的模型输入包括图像数据和文本信息中的至少一项、扩展信息和提问。In some embodiments, the target answer is determined using a trained question-answering model, the model input of the question-answering model including at least one of image data and text information, extended information, and a question.

在一些实施例中，提问包括以语音形式捕获的提问。In some embodiments, the question comprises a question captured in speech form.

装置500中所包括的单元可以利用各种方式来实现，包括软件、硬件、固件或其任意组合。在一些实施例中，一个或多个单元可以使用软件和/或固件来实现，例如存储在存储介质上的机器可执行指令。除了机器可执行指令之外或者作为替代，装置500中的部分或者全部单元可以至少部分地由一个或多个硬件逻辑组件来实现。作为示例而非限制，可以使用的示范类型的硬件逻辑组件包括现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准品(ASSP)、片上系统(SOC)、复杂可编程逻辑器件(CPLD)，等等。The units included in the device 500 can be implemented in various ways, including software, hardware, firmware, or any combination thereof. In some embodiments, one or more units can be implemented using software and/or firmware, such as machine executable instructions stored on a storage medium. In addition to or as an alternative to machine executable instructions, some or all of the units in the device 500 can be implemented at least in part by one or more hardware logic components. As an example and not limitation, exemplary types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), and the like.

图6示出了其中可以实施本公开的一个或多个实施例的电子设备600的框图。应当理解，图6所示出的电子设备600仅仅是示例性的，而不应当构成对本文所描述的实施例的功能和范围的任何限制。图6所示出的电子设备600可以用于实现图1的电子设备110。FIG6 shows a block diagram of an electronic device 600 in which one or more embodiments of the present disclosure may be implemented. It should be understood that the electronic device 600 shown in FIG6 is merely exemplary and should not constitute any limitation on the functionality and scope of the embodiments described herein. The electronic device 600 shown in FIG6 may be used to implement the electronic device 110 of FIG1 .

如图6所示，电子设备600是通用电子设备的形式。电子设备600的组件可以包括但不限于一个或多个处理器或处理单元610、存储器620、存储设备630、一个或多个通信单元640、一个或多个输入设备650以及一个或多个输出设备660。处理单元610可以是实际或虚拟处理器并且能够根据存储器620中存储的程序来执行各种处理。在多处理器系统中，多个处理单元并行执行计算机可执行指令，以提高电子设备600的并行处理能力。As shown in FIG6 , the electronic device 600 is in the form of a general electronic device. The components of the electronic device 600 may include, but are not limited to, one or more processors or processing units 610, a memory 620, a storage device 630, one or more communication units 640, one or more input devices 650, and one or more output devices 660. The processing unit 610 may be an actual or virtual processor and is capable of performing various processes according to a program stored in the memory 620. In a multi-processor system, multiple processing units execute computer executable instructions in parallel to improve the parallel processing capability of the electronic device 600.

电子设备600通常包括多个计算机存储介质。这样的介质可以是电子设备600可访问的任何可以获取的介质，包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器620可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非易失性存储器(例如，只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储设备630可以是可拆卸或不可拆卸的介质，并且可以包括机器可读介质，诸如闪存驱动、磁盘或者任何其他介质，其可以能够用于存储信息和/或数据(例如用于训练的训练数据)并且可以在电子设备600内被访问。Electronic device 600 typically includes a plurality of computer storage media. Such media may be Any accessible media that can be accessed by the electronic device 600, including but not limited to volatile and non-volatile media, removable and non-removable media. The memory 620 can be a volatile memory (e.g., a register, a cache, a random access memory (RAM)), a non-volatile memory (e.g., a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. The storage device 630 can be a removable or non-removable medium and can include a machine-readable medium such as a flash drive, a disk, or any other medium that can be used to store information and/or data (e.g., training data for training) and can be accessed within the electronic device 600.

电子设备600可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性存储介质。尽管未在图6中示出，可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中，每个驱动可以由一个或多个数据介质接口被连接至总线(未示出)。存储器620可以包括计算机程序产品625，其具有一个或多个程序模块，这些程序模块被配置为执行本公开的各种实施例的各种方法或动作。The electronic device 600 may further include additional removable/non-removable, volatile/non-volatile storage media. Although not shown in FIG. 6 , a disk drive for reading or writing from a removable, non-volatile disk (e.g., a “floppy disk”) and an optical drive for reading or writing from a removable, non-volatile optical disk may be provided. In these cases, each drive may be connected to a bus (not shown) by one or more data media interfaces. The memory 620 may include a computer program product 625 having one or more program modules configured to perform various methods or actions of various embodiments of the present disclosure.

通信单元640实现通过通信介质与其他电子设备进行通信。附加地，电子设备600的组件的功能可以以单个计算集群或多个计算机器来实现，这些计算机器能够通过通信连接进行通信。因此，电子设备600可以使用与一个或多个其他服务器、网络个人计算机(PC)或者另一个网络节点的逻辑连接来在联网环境中进行操作。The communication unit 640 implements communication with other electronic devices through a communication medium. Additionally, the functions of the components of the electronic device 600 can be implemented with a single computing cluster or multiple computing machines that can communicate through a communication connection. Therefore, the electronic device 600 can operate in a networked environment using a logical connection with one or more other servers, a network personal computer (PC), or another network node.

输入设备650可以是一个或多个输入设备，例如鼠标、键盘、追踪球等。输出设备660可以是一个或多个输出设备，例如显示器、扬声器、打印机等。电子设备600还可以根据需要通过通信单元640与一个或多个外部设备(未示出)进行通信，外部设备诸如存储设备、显示设备等，与一个或多个使得用户与电子设备600交互的设备进行通信，或者与使得电子设备600与一个或多个其他电子设备通信的任何设备(例如，网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。 The input device 650 may be one or more input devices, such as a mouse, a keyboard, a tracking ball, etc. The output device 660 may be one or more output devices, such as a display, a speaker, a printer, etc. The electronic device 600 may also communicate with one or more external devices (not shown) through the communication unit 640 as needed, such as a storage device, a display device, etc., communicate with one or more devices that allow a user to interact with the electronic device 600, or communicate with any device that allows the electronic device 600 to communicate with one or more other electronic devices (e.g., a network card, a modem, etc.). Such communication may be performed via an input/output (I/O) interface (not shown).

根据本公开的示例性实现方式，提供了一种计算机可读存储介质，其上存储有计算机可执行指令，其中计算机可执行指令被处理器执行以实现上文描述的方法。根据本公开的示例性实现方式，还提供了一种计算机程序产品，计算机程序产品被有形地存储在非瞬态计算机可读介质上并且包括计算机可执行指令，而计算机可执行指令被处理器执行以实现上文描述的方法。According to an exemplary implementation of the present disclosure, a computer-readable storage medium is provided, on which computer-executable instructions are stored, wherein the computer-executable instructions are executed by a processor to implement the method described above. According to an exemplary implementation of the present disclosure, a computer program product is also provided, which is tangibly stored on a non-transitory computer-readable medium and includes computer-executable instructions, and the computer-executable instructions are executed by a processor to implement the method described above.

这里参照根据本公开实现的方法、装置、设备和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present disclosure are described herein with reference to the flowcharts and/or block diagrams of the methods, devices, equipment, and computer program products implemented according to the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理单元，从而生产出一种机器，使得这些指令在通过计算机或其他可编程数据处理装置的处理单元执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processing unit of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processing unit of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

可以把计算机可读程序指令加载到计算机、其他可编程数据处理装置、或其他设备上，使得在计算机、其他可编程数据处理装置或其他设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其他可编程数据处理装置、或其他设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions can be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operational steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, so that the instructions executed on the computer, other programmable data processing apparatus, or other device implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

附图中的流程图和框图显示了根据本公开的多个实现的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate possible architectures, functions, and operations of various implementations of the systems, methods, and computer program products of the present disclosure. In this regard, each box in the flowchart or block diagram may represent a module, a program segment, or a portion of an instruction that contains one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions labeled in the box may be The blocks in the flowchart and/or the flowchart may also occur in a different order than that indicated in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart, may be implemented by a dedicated hardware-based system that performs the specified functions or actions, or may be implemented by a combination of dedicated hardware and computer instructions.

以上已经描述了本公开的各实现，上述说明是示例性的，并非穷尽性的，并且也不限于所公开的各实现。在不偏离所说明的各实现的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实现的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其他普通技术人员能理解本文公开的各个实现方式。 The above descriptions of various implementations of the present disclosure are exemplary, non-exhaustive, and not limited to the disclosed implementations. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described implementations. The selection of terms used herein is intended to best explain the principles of the implementations, practical applications, or improvements to the technology in the market, or to enable other persons of ordinary skill in the art to understand the various implementations disclosed herein.

Claims

A question-and-answer method comprising:

In response to detecting the question-and-answer initiation operation, capturing image data and a question for the image data using a device of the user;

extracting text information from the image data;

Acquiring extended information associated with the text information; and

A target answer to the question is determined based on the image data and the extended information.

The method according to claim 1, wherein obtaining the extended information comprises:

determining at least one keyword from the text information; and

The extended information associated with the at least one keyword is obtained from a knowledge base.

The method according to claim 2, wherein the knowledge base includes a plurality of candidate knowledge bases corresponding to a plurality of fields, and obtaining the extended information from the knowledge base includes:

Determining a target area corresponding to the image data or the text information;

Determining a target knowledge base corresponding to the target domain from the plurality of candidate knowledge bases; and

The extended information associated with the at least one keyword is obtained from the target knowledge base.

The method according to claim 2, wherein the knowledge base comprises a knowledge graph.

determining whether a target answer to the question can be determined from the image data; and

If the target answer to the question cannot be determined from the image data, the extended information is acquired.

The method of claim 1, wherein determining a target answer to the question comprises:

determining candidate responses to the question from the image data; and

Based on the extended information, correction is performed on the candidate answer to obtain the target The correction includes at least one of error correction and completion.

The method according to claim 1, wherein the target answer is determined using a trained question-answering model, the model input of the question-answering model including at least one of the image data and the text information, the extended information, and the question.

The method of claim 1, wherein the question comprises a question captured in speech form.

A device for question answering, comprising:

A data capture module, configured to capture image data and questions regarding the image data using a user's device in response to detecting a question-and-answer initiation operation;

A text information extraction module, configured to extract text information from the image data;

an extended information acquisition module, configured to acquire extended information associated with the text information; and

The target answer determination module is configured to determine a target answer to the question based on the image data and the extended information.

The device according to claim 9, wherein the extended information acquisition module comprises:

a keyword determination module, configured to determine at least one keyword from the text information; and

The extended information acquisition module is configured to acquire the extended information associated with the at least one keyword from a knowledge base.

The apparatus according to claim 10, wherein the knowledge base includes a plurality of candidate knowledge bases corresponding to a plurality of fields, and the extended information acquisition module includes:

a target domain determination module, configured to determine a target domain corresponding to the image data or the text information;

a target knowledge base determining module, configured to determine a target knowledge base corresponding to the target domain from the plurality of candidate knowledge bases; and

The information acquisition module is configured to acquire the extended information associated with the at least one keyword from the target knowledge base.

The apparatus according to claim 10, wherein the knowledge base comprises a knowledge graph Score.

a determination module configured to determine whether a target answer to the question can be determined from the image data; and

The acquisition module is configured to acquire the extended information if the target answer to the question cannot be determined from the image data.

The apparatus according to claim 9, wherein the target answer determination module comprises:

a candidate answer determination module configured to determine candidate answers to the question from the image data; and

The target answer obtaining module is configured to perform correction on the candidate answer based on the extended information to obtain the target answer, wherein the correction includes at least one of error correction and completion.

The apparatus according to claim 9, wherein the target answer is determined using a trained question-answering model, a model input of the question-answering model comprising at least one of the image data and the text information, the extended information, and the question.

The apparatus of claim 9, wherein the question comprises a question captured in speech form.

An electronic device, comprising:

at least one processing unit; and

At least one memory, the at least one memory being coupled to the at least one processing unit and storing instructions for execution by the at least one processing unit, the instructions causing the electronic device to perform the method according to any one of claims 1 to 8 when executed by the at least one processing unit.

A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the method according to any one of claims 1 to 8.

A computer program product tangibly stored in The computer storage medium contains computer executable instructions, which, when executed by a device, cause the device to perform the method according to any one of claims 1 to 8.