CN117453887A

CN117453887A - Document-based question answering method and device, storage medium and computer equipment

Info

Publication number: CN117453887A
Application number: CN202311495012.1A
Authority: CN
Inventors: 侯位移; 潘伟洲
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2024-01-26

Abstract

The disclosure provides a document-based problem answering method, a document-based problem answering device, a storage medium and computer equipment. The method comprises the steps of acquiring document data from a plurality of databases; the method comprises the steps that question-answer data pairs are mined on the basis of text generation models corresponding to the document data of each category, and a plurality of question-answer data pairs are obtained; generating a first training data set according to each document data and a plurality of question-answer data pairs; acquiring a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of a document question-answer model; training the document question-answering model by adopting a target training data set consisting of the first training data set and the second training data set to obtain a trained document question-answering model; and obtaining target questions and associated target documents, and answering the questions of the target questions and the target documents based on the trained document question-answering model to obtain answering results. The method can improve the accuracy of the document-based question answering.

Description

Document-based question answering method and device, storage medium and computer equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to a document-based problem answering method, a document-based problem answering device, a document-based problem answering storage medium and computer equipment.

Background

In recent years, the artificial intelligence technology is developed rapidly, and great convenience is brought to the production and living standard of people. The pre-training large model based on the deep learning technology and the attention mechanism has been widely applied in various large fields such as chat application, text processing and image processing, and the efficiency and processing experience of data processing in the fields are improved.

The document question-answering model is a specific application of the pre-training large model in a scene of question-answering based on documents. The document question-answering model can give accurate answers according to document information and questions input by objects, and the model has higher accuracy requirements than a large language model used in a chat robot.

In the related art, the document question-answering model has higher accuracy requirement, so that the document question-answering model needs to be continuously optimized to improve the accuracy and generalization capability of the document question-answering model. The tuning data of the current document question-answering model is also constructed manually, so that the tuning efficiency of the document question-answering model is greatly reduced. The low tuning efficiency further leads to the fact that the document question-answering model cannot keep good accuracy, and further leads to insufficient accuracy of answering the questions based on the document.

Disclosure of Invention

The embodiment of the disclosure provides a document-based problem answering method, a document-based problem answering device, a storage medium and computer equipment.

According to an aspect of the present disclosure, there is provided a document-based question answering method, the method including:

acquiring document data of a plurality of categories from a plurality of target databases, wherein the target databases are databases associated with a document question-answer model to be trained;

mining question-answer data pairs of the document data of the multiple categories based on a text generation model corresponding to the document data of each category to obtain multiple question-answer data pairs corresponding to the document data;

generating a first training data set according to each document data and the corresponding question-answer data pairs;

acquiring a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of a document question-answer model;

training the document question-answer model by adopting a target training data set consisting of the first training data set and the second training data set to obtain a trained document question-answer model;

And obtaining a target problem and an associated target document, and carrying out problem answering on the target problem and the target document based on the trained document question-answering model to obtain an answering result.

According to an aspect of the present disclosure, there is provided a document-based question answering apparatus, the apparatus including:

the first acquisition unit is used for acquiring document data of a plurality of categories from a plurality of target databases, wherein the target databases are databases associated with a document question-answer model to be trained;

the mining unit is used for mining the question-answer data pairs of the document data of the multiple categories based on the text generation model corresponding to the document data of each category to obtain multiple question-answer data pairs corresponding to each document data;

a generating unit, configured to generate a first training data set according to each document data and the corresponding question-answer data pairs;

the second acquisition unit is used for acquiring a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of the document question-answer model;

the training unit is used for training the document question-answer model by adopting a target training data set formed by the first training data set and the second training data set to obtain a trained document question-answer model;

And the answering unit is used for acquiring the target questions and the associated target documents, and answering the questions of the target questions and the target documents based on the trained document question-answering model to obtain answering results.

Optionally, in some embodiments, the digging unit includes:

the mining subunit is used for mining the prompt words of the document data of the plurality of categories to obtain a plurality of prompt words corresponding to each document data;

and the first generation subunit is used for carrying out question-answer text generation on a plurality of prompt words corresponding to each document data based on a text generation model corresponding to each category of the document data to obtain a plurality of question-answer data pairs corresponding to each document data.

Optionally, in some embodiments, generating the subunit includes:

the first determining module is used for determining a text generation model calling interface corresponding to the document data of each category;

and the first generation module is used for simultaneously calling a plurality of text generation model calling interfaces to generate question-answer texts of the corresponding prompt words so as to obtain a plurality of question-answer data pairs corresponding to each document data.

Optionally, in some embodiments, the mining subunit comprises:

The analysis module is used for carrying out semantic analysis and word sense analysis on the document data of the plurality of categories to obtain analysis results;

and the second determining module is used for determining a plurality of prompt words corresponding to each document data according to the analysis result.

Optionally, the document-based question answering device provided in the present disclosure further includes:

the first acquisition subunit is used for acquiring an evaluation data set for evaluating the document question-answer model, wherein the evaluation data set comprises a plurality of evaluation question-answer data pairs;

the extension subunit is used for performing approximate extension on the evaluation question data in the evaluation question-answer data pairs to obtain an evaluation question-answer data pair set associated with each evaluation question-answer data pair;

and the evaluation subunit is used for evaluating the document question-answer model according to a plurality of evaluation question-answer data pair sets associated with the plurality of evaluation question-answer data pairs to obtain an evaluation result.

Optionally, in some embodiments, the expansion subunit comprises:

the extraction module is used for extracting prompt words from the question data in the evaluation question-answer data pair to obtain the question prompt words corresponding to each question data;

the expansion module is used for carrying out approximate expansion on the problem data based on the problem prompt words to obtain a plurality of pieces of approximate problem data;

And the second generation module is used for generating an evaluation question-answer data pair set associated with each evaluation question-answer data pair according to the plurality of pieces of approximate question data and answer data corresponding to the question data.

Optionally, in some embodiments, the expansion module includes:

the analysis sub-module is used for carrying out semantic analysis on the problem data and determining a target text generation model corresponding to the problem data according to a semantic analysis result;

and the generation sub-module is used for generating the problem data of the problem prompt word based on the target text generation model to obtain a plurality of pieces of approximate problem data.

Optionally, in some embodiments, the evaluating subunit includes:

the computing module is used for computing semantic similarity between elements in each evaluation question-answer data pair set and corresponding evaluation question-answer data pairs;

the dividing module is used for dividing the elements in each evaluation question-answer data pair set into a plurality of categories according to the semantic similarity;

the evaluation module is used for evaluating the document question-answer model based on evaluation question-answer data of a plurality of categories to obtain a plurality of sub-evaluation results;

and the third determining module is used for determining the evaluation result of the document question-answer model according to the plurality of sub-evaluation results.

Optionally, in some embodiments, the third determining module includes:

the first computing sub-module is used for computing the weight coefficient of each sub-evaluation result according to the semantic similarity;

and the second computing sub-module is used for carrying out weighted computation on the plurality of sub-evaluation results based on the weight coefficient to obtain the evaluation result of the document question-answer model.

Optionally, in some embodiments, the document-based question answering apparatus provided by the present disclosure further includes:

the second generation subunit is used for generating accurate target answer data of the target evaluation problem when the answer result of the document question-answer model to the target evaluation problem is identified to be inconsistent with the answer data corresponding to the target evaluation problem in the process of evaluating the document question-answer model;

and the adding subunit is used for adding the abnormal response data pair consisting of the target evaluation problem and the target response data into the second training data set.

the second obtaining subunit is used for obtaining a plurality of template answer data pairs, wherein the template answer data pairs comprise template answers which indicate that the document question-answering model cannot give accurate answers of corresponding questions;

The training unit is further configured to:

and training the document question-answer model by adopting a target training data set formed by the first training data set, the second training data set and the plurality of template answer data pairs.

Optionally, in some embodiments, the generating unit includes:

a third generating subunit, configured to generate a plurality of candidate training data according to each document data and the corresponding plurality of question-answer data;

and the processing subunit is used for carrying out format detection and de-duplication processing on the plurality of candidate training data to obtain a first training data set.

Optionally, in some embodiments, the training unit comprises:

a fourth generation subunit, configured to generate a mirrored environment for training the document question-answer model;

the training subunit is used for training the document question-answer model in the mirror image environment based on a target training data set formed by the first training data set and the second training data set to obtain model parameters after training of the document question-answer model;

and the updating subunit is used for updating the document question-answer model based on the trained model parameters.

According to an aspect of the present disclosure, there is provided a computer device including a memory storing a computer program and a processor implementing a document-based problem answering method as described above when the computer program is executed.

According to an aspect of the present disclosure, there is provided a storage medium storing a computer program which, when executed by a processor, implements the document-based question answering method as described above.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program which is read and executed by a processor of a computer device, causing the computer device to perform the document-based question answering method as described above.

According to the document-based question answering method provided by the embodiment of the disclosure, document data of a plurality of categories are obtained from a plurality of target databases, wherein the target databases are databases associated with a document question answering model to be trained; mining question-answer data pairs of the document data of a plurality of categories based on the text generation model corresponding to the document data of each category to obtain a plurality of question-answer data pairs corresponding to each document data; generating a first training data set according to each document data and a plurality of question-answer data pairs; acquiring a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of a document question-answer model; training the document question-answering model by adopting a target training data set consisting of the first training data set and the second training data set to obtain a trained document question-answering model; and obtaining target questions and associated target documents, and answering the questions of the target questions and the target documents based on the trained document question-answering model to obtain answering results.

In this way, after obtaining the source document data of multiple categories, the embodiment of the disclosure may automatically generate the training data set for training the document question-answer model based on the text generation model corresponding to the source document data of each category. In addition, training data obtained by correcting the abnormal response data of the document question-answer model can be further obtained to supplement the training data set. Therefore, compared with the manual generation of the training data set for training the document question-answer model, the efficiency of generating the training data of the document question-answer model can be greatly improved, and the training efficiency of the document question-answer model can be further improved. Therefore, the higher tuning frequency of the document question-answering model can be realized, the accuracy of the document question-answering model can be further ensured, and the accuracy of the document-based question answering is improved.

Additional features and advantages of the disclosure will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the disclosure. The objectives and other advantages of the disclosure will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosed embodiments and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain, without limitation, the disclosed embodiments.

FIG. 1 is a system architecture diagram to which a document-based question answering method of an embodiment of the present disclosure is applied;

FIG. 2 is a schematic diagram of the architecture of the question-answering system of the present disclosure;

FIG. 3 is a flow diagram of a document-based question answering method provided by the present disclosure;

FIG. 4 is a schematic diagram of a flow frame for implementing automatic generation of a tuning dataset of a document question-answer model by using an ETL frame in an embodiment of the disclosure;

FIG. 5 is a schematic diagram of a process for evaluating a document question-answer model in the present disclosure;

FIG. 6 is a schematic diagram of a system architecture of a system for tuning a document question-answer model provided by the present disclosure;

FIG. 7 is a flow chart of a method for training a document question-answering model provided by the present disclosure;

FIG. 8 is a schematic diagram of a document-based question answering apparatus according to an embodiment of the present disclosure;

FIG. 9 is a terminal block diagram implementing methods according to one embodiment of the present disclosure;

fig. 10 is a server block diagram implementing methods according to one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present disclosure.

Before proceeding to further detailed description of the disclosed embodiments, the terms and terms involved in the disclosed embodiments are described, which are applicable to the following explanation:

large model (large language model, LLM): i.e., a large language model, which is an artificial intelligence model for natural language processing that can learn and understand the structure, semantics, and grammar rules of a language.

Intelligent question-answering: the intelligent question-answering system is an application based on artificial intelligence technology and aims to answer natural language questions presented by an object. Such systems use natural language processing, machine learning, knowledge-graph techniques, etc., to provide accurate and useful answers by analyzing questions and searching related knowledge bases, documents, or internet resources.

Document question-answering model: one used in the intelligent question-answering scenario processes based on the input document and the input question and outputs a large model of the answer corresponding to the input question.

And (3) adjusting: generally, large model tuning, and more particularly, to a series of techniques and methods in training and optimizing large-scale deep learning models. Since large models often possess a large number of parameters and complex structures, tuning aims to improve the performance and efficiency of the model.

Data ETL: namely, the process of extracting, converting and loading data, specifically, the process of extracting the data from a source system, converting the data, and loading the data to a target system. ETL is an important link in data warehouse and data integration.

Text generation model (generated Pre-Trained Transformer, GPT): the text generation model herein may specifically be a generation type pre-training transformation model, which is an internet-based, data-available, text-generated deep learning model.

Compared with a large language model used in application scenes such as chat robots, the method has the advantages that corresponding content generation is only needed according to prompt words, and a document question-answering model is generally used for solving the targeted problem of an object in a certain scene, so that the document question-answering model needs to have higher accuracy. In the related art, maintenance personnel of a document question-answer model need to continuously tune the document question-answer model so as to continuously improve the accuracy of the document question-answer model.

However, in the related art, when a training data set for training the document question-answer model is generated, a method for efficiently and automatically generating a high-quality large-scale high-quality training data set is lacking, and only maintenance personnel of the document question-answer model can be relied on to generate a manual data set. The method has the advantages that the tuning efficiency of the document question-answering model is low, the document question-answering model cannot keep a high answer level in real time, and the use experience of an object on a document question-answering system is reduced. In order to solve the problem of low training efficiency of the document question-answering model, the invention provides a question answering method based on a document, so that the generation efficiency of a training data set for training the document question-answering model can be improved, and the training efficiency of the document question-answering model is further improved. Further, the higher training efficiency of the document question-answering model can ensure that the document question-answering model always keeps higher tuning frequency, and further the accuracy of answering the questions based on the document can be improved.

System architecture and scenario description applied to embodiments of the present disclosure

FIG. 1 is a system architecture diagram to which a document-based question answering method according to an embodiment of the present disclosure is applied. It includes a terminal 140, the internet 130, a gateway 120, a server 110, etc.

The terminal 140 includes various forms of a desktop computer, a laptop computer, a PDA (personal digital assistant), a mobile phone, a car-mounted terminal, a home theater terminal, a dedicated terminal, an intelligent voice interaction device, an intelligent home appliance, an aircraft, and the like. In addition, the device can be a single device or a set of a plurality of devices. The terminal 140 may communicate with the internet 130 in a wired or wireless manner, exchanging data.

Server 110 refers to a computer system that can provide certain services to terminal 140. The server 110 is required to have higher stability, security, performance, etc. than the general terminal 140. The server 110 may be one high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of one high-performance computer (e.g., a virtual machine), a combination of portions of multiple high-performance computers (e.g., virtual machines), etc.

Gateway 120 is also known as an intersubnetwork connector, protocol converter. The gateway implements network interconnection on the transport layer, and is a computer system or device that acts as a translation. The gateway is a translator between two systems using different communication protocols, data formats or languages, and even architectures that are quite different. At the same time, the gateway may also provide filtering and security functions. The message sent by the terminal 140 to the server 110 is to be sent to the corresponding server 110 through the gateway 120. A message sent by the server 110 to the terminal 140 is also sent to the corresponding terminal 140 through the gateway 120.

The document-based question answering method provided in the embodiments of the present disclosure may be implemented in the foregoing terminal 140, may be implemented in the foregoing server 110, or may be implemented in part in the terminal 140 and in part in the server 110.

When the document-based question answering method provided by the embodiment of the present disclosure is implemented in the terminal 140, the terminal 140 may acquire document data of a plurality of categories from a plurality of target databases, which are databases associated with a document question answering model to be trained. And mining the question-answer data pairs of the document data of the multiple categories based on the text generation model corresponding to the document data of each category deployed in the terminal 140, so as to obtain multiple question-answer data pairs corresponding to each document data. The terminal 140 then generates a first training data set from each document data and the corresponding plurality of question-answer data pairs. Then, the terminal 140 acquires a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of the document question-answer model, and finally the terminal 140 trains the document question-answer model by adopting a target training data set formed by the second training data set of the first training data set to obtain a trained document question-answer model; the terminal 140 obtains the target questions and the associated target documents, and answers the questions of the target questions and the target documents based on the trained document question-answering model, thereby obtaining answer results.

When the document-based question answering method provided by the embodiment of the present disclosure is implemented in the server 110, the server 110 may obtain document data of a plurality of categories from a plurality of target databases, the target databases being databases associated with a document question answering model to be trained. The question-answer data pairs of the document data of the plurality of categories are mined based on the text generation model corresponding to the document data of each category deployed in the server 110, and a plurality of question-answer data pairs corresponding to each document data are obtained. The server 110 then generates a first training data set from each document data and the corresponding plurality of question-answer data pairs. Then, the server 110 acquires a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of the document question-answer model, and finally the server 110 trains the document question-answer model by adopting a target training data set formed by the second training data set of the first training data set to obtain a trained document question-answer model; the server 110 obtains the target questions and the associated target documents, and answers the questions based on the trained document question-answering model to obtain answer results.

When the document-based question answering method provided in the embodiment of the present disclosure is implemented in the server 110 and another part is implemented in the terminal 140, the terminal 140 may obtain document data of a plurality of categories from a plurality of target databases, which are databases associated with the document question answering model to be trained. Then, the terminal 140 transmits the acquired plurality of types of document data to the server 110, so that the server 110 performs question-answer data pair mining on the plurality of types of document data based on the text generation model corresponding to each type of document data deployed in the server 110, and obtains a plurality of question-answer data pairs corresponding to each type of document data. After that, the server 110 transmits the mined plurality of question-answer data pairs to the terminal 140. Further, the terminal 140 generates a first training data set according to each document data and the corresponding plurality of question-answer data pairs. Then, the terminal 140 acquires a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of the document question-answer model, and finally the terminal 140 trains the document question-answer model by adopting a target training data set formed by the second training data set of the first training data set to obtain a trained document question-answer model; the terminal 140 obtains the target questions and the associated target documents, and answers the questions of the target questions and the target documents based on the trained document question-answering model, thereby obtaining answer results.

The embodiment of the disclosure can be applied to various scenes, such as various applications or question-answering systems in communities of various types. When the question-answering system receives a question sent by an object, a document corresponding to the question can be searched in a system server, the searched document and the question are sent to a document question-answering model for processing, answer data returned by the document question-answering model is received, and then the question-answering system can display the answer data returned by the document question-answering model to the question object. In the process of using the document question-answering model to answer the question sent by the question object by the document question-answering system, the document question-answering model can be continuously adjusted by the document-based question-answering method provided by the disclosure, so that the accuracy of the document question-answering model is continuously improved, and the accuracy of the document-based question-answering is further improved. By adopting the document-based question answering method to tune the document question answering model, high-quality and large-scale tuning data for tuning the document question answering model can be efficiently obtained, so that the tuning efficiency of the document question answering model can be improved, and further, the document question answering model can always keep good accuracy.

Fig. 2 is a schematic diagram of the architecture of the question-answering system in the present disclosure. As shown, the question-answering system includes a question-answering client 210, a question-answering server 220, and a question-answering model 230. The question-answering client may be a client loaded in the terminal 140, and specifically may be an application program or a web page. The question server 220 is a server corresponding to a question-answering system, and a plurality of question-answering clients 210 may be docked, and the question server 220 is connected with the question-answering model 230 to apply the functions of the question-answering model 230. The question and answer server 220 searches for corresponding document data according to the questions transmitted from the question and answer client 210, and then transmits the received questions and the searched document data to the question and answer model for answer generation, and the specific structure of the question and answer server 220 may be identical to that of the server 110. The question-answering model 230 may generate corresponding answer text according to the questions and document data transmitted from the question-answering server 220, and return the answer text to the question-answering server 220, so that the question-answering server 220 further returns the answer text to the question-answering client 210 for feedback display.

General of embodiments of the disclosureDescription of the invention

According to one embodiment of the present disclosure, a document-based question answering method is provided. The method can be used for the scene of carrying out the block chain data synchronization of newly added nodes in the block chain network or the scene of carrying out the block chain data synchronization of the nodes which already store partial block chain data in the block chain network.

As shown in fig. 3, a flow chart of the document-based question answering method provided in the present disclosure is shown. The method can be applied to a document-based question answering device which can be integrated in computer equipment, and the computer equipment can be a terminal or a server. The document-based question answering method may include:

in step 310, document data of a plurality of categories is obtained from a plurality of target databases.

The embodiment of the disclosure provides a document-based question answering method, in particular to a method for optimizing a document question answering model. The method aims at realizing automatic generation of the document question-answer model tuning data set by utilizing the text processing capacity of the large model, so as to solve the problems that the generation efficiency of the document question-answer model tuning data set is low and the tuning efficiency of the document question-answer model is low and the accuracy of the document question-answer model cannot be guaranteed in real time due to the fact that the document question-answer model tuning data set is generated manually in the related technology. The document-based question answering method provided in the present disclosure is described in detail below.

The document question-answering model may be a large model based on a deep learning technology by inputting a question and a document and then outputting an answer corresponding to the question, and may be divided into a decimated type and a generated type. The document question-answering model can be applied to special fields to solve the problem of the question object. For example, when the method is applied to community communication scenes after some new technologies, standards, papers and products are released, the efficiency is low and a large amount of labor cost is consumed because the community participants are numerous and the problems are different, the number of background staff in the community is limited, and the questions released in the community are solved one by manpower. At the moment, the document question-answering model can be utilized to improve the efficiency of answering the questions and reduce the community maintenance cost. The accuracy requirements of the document question-answer model are generally higher than those of the large language model used in chat applications due to the specificity of the environment in which the document question-answer model is often applied. Because questioners generally do not require the professionality of the answer results during daily chat, in some professional community discussions, if the answer is wrong or the questioner is not answered, the professional evaluation of the community by the community object is affected. Thus, in these cases, constant tuning of the document question-answer model used is required.

When the document question-answering model needs to be tuned, a tuning data set required by tuning the document question-answering model can be generated, and then the document question-answering model is tuned according to the generated tuning data set. In the related art, a data set is generally extracted manually or by means of data acquisition software. In particular, data may be obtained from a number of public data sets, such as those provided by research institutions, academia, or other organizations; alternatively, open source data is obtained from a large-scale dataset provided by some open source projects; alternatively, some data generation techniques are used, such as generating a countermeasure network (Generative Adversarial Network, GAN) to generate corresponding data. The data collected by the method cannot be guaranteed in quantity and scale, and many data lack marks and markers, and additional manual marks are needed, so that the generation efficiency of the tuning data set is low.

In the embodiment of the disclosure, when the tuning data set of the document question-answer model needs to be generated, a plurality of target databases associated with the document question-answer model to be subjected to tuning training can be searched first. The document question-answering model can be a document question-answering model applied to one of the fields of communication field, quantum field, liquid crystal field, computer programming field and the like, and specifically, for example, the document question-answering model to be optimized is a document question-answering model applied to the computer programming field. The database associated with the document question-answering model may be determined as a target database according to the application field information of the document question-answering model, for example, the target database may include, but is not limited to, a background knowledge base of community application in the field of computer programming, a Java (a computer language) question library, a Python (a computer language) knowledge base, databases of some open source project hosting platforms, and the like. After a plurality of target databases associated with the document question-answer model to be trained are found, the acquisition of source data from the plurality of target databases may be performed. The source data may be document data of a plurality of categories. The document data of the plurality of categories can be specifically different categories of document data such as product introduction document data, technical document data, chinese document data, english document data, code document data and the like.

After the document data of a plurality of categories are obtained from a plurality of target databases, the obtained document data can be stored in a local data warehouse according to a certain storage rule so as to be used when a tuning data set of a document question-answer model is generated. In this embodiment, the local data warehouse adopts the ETL scheme to implement source pulling, data conversion and data loading processing on data. Specifically, the method can be implemented by three core components of an Extractor (Dataset-Extractor), a converter (Dataset-converter) and a Loader (Dataset-Loader), respectively. Pulling source data from multiple target databases may be implemented by an extractor. The extractor not only can acquire the source data from the target database, but also can update the increment of the source data, thereby avoiding the resource loss caused by repeated data downloading. After extracting the source data from the plurality of target databases, the extractor may segment the extracted source data and then extract the desired data from the segmented data.

In addition, the extractor can also provide functions of real-time caching and data recovery in the process of acquiring the source data. When abnormal conditions occur, the real-time cache can help to save the extracted document data, and data loss is avoided. Meanwhile, the data recovery function can also carry out data recovery under abnormal conditions, so that the integrity of the acquired document data is ensured.

Compared with manual data extraction, the automatic data extraction function of the data extractor can greatly improve the efficiency of document data extraction from various target databases. Moreover, the real-time caching and data recovery functions provided by the extractor can also ensure the integrity and stability of the extracted data.

Step 320, mining question-answer data pairs of the document data of a plurality of categories based on the text generation model corresponding to the document data of each category, so as to obtain a plurality of question-answer data pairs corresponding to each document data.

After the plurality of types of document data are acquired from the plurality of target databases, a converter can be further adopted to automatically convert the acquired plurality of types of document data. In the embodiment of the disclosure, the converter realizes a proxy layer of a framework of the large model, and interfaces of different large models are adapted in the proxy layer so as to realize calling of the different large models. In the embodiment of the disclosure, the large model is specifically used for generating question-answer data pairs for the acquired document data, so that the large model can specifically generate a model GPT for a text. In particular, the plurality of different text generation models in embodiments of the present disclosure may specifically be trained mature text generation models, each having good text semantic understanding and analysis and the ability to generate corresponding text data based on the results of the semantic understanding and analysis.

In order to ensure the accuracy of question-answer data pairs obtained by mining the text generation model based on the document data, a special text generation model corresponding to the document data of each category can be accessed into the converter. The dedicated text generating models corresponding to different types of document data can have the same model structure, but the different text generating models can be trained based on different training sample data. For example, these text generation models are trained using a plurality of sample text data of different fields, so that the processing power of the text data for a particular field is learned separately. For example, some text generation models have good text processing capabilities in the biomedical field, while some text generation models have good text processing capabilities in the computer programming field, etc.

Therefore, in the embodiment of the disclosure, after the document data of a plurality of categories are obtained from a plurality of target databases, the question-answer data pairs of the document data of the plurality of categories can be mined based on the text generation model corresponding to the document data of each category, so as to obtain a plurality of question-answer data pairs corresponding to each document data. That is, the text generation model corresponding to each type of document data can be utilized to understand the document data(s) of the category, and then question-answer data pairs of the document data of the category can be generated. Here, the question-answer data pair may specifically include a question text and an answer text corresponding to the question text. According to the method and the device, the text understanding capability of the text generation model is fully utilized, the question-answer data pair comprising the question text and the accurate answer text corresponding to the question text can be automatically generated, and compared with the generation of the question-answer data pair corresponding to the document data manually in the related technology, the generation efficiency of the question-answer data pair of the question-answer data can be greatly improved, and the generation efficiency of a tuning data set for tuning the document question-answer model can be improved.

In some embodiments, mining question-answer data pairs of document data of a plurality of categories based on a text generation model corresponding to the document data of each category to obtain a plurality of question-answer data pairs corresponding to each document data, including:

mining the prompt words of the document data of the multiple categories to obtain multiple prompt words corresponding to each document data;

and generating question-answering text for a plurality of prompt words corresponding to each document data based on a text generation model corresponding to each category of document data, so as to obtain a plurality of question-answering data pairs corresponding to each document data.

In the embodiment of the disclosure, the question-answer data pair mining is performed on the document data of a plurality of categories based on the text generation model corresponding to the document data of each category, specifically, the prompt word mining is performed on the document data of the plurality of categories first, so as to obtain a plurality of prompt words corresponding to each document data. Then, based on the text generation model corresponding to the document data of each category, question-answer text generation is carried out on a plurality of prompt words corresponding to each document data, so that a plurality of question-answer data pairs corresponding to each document data are obtained. The prompting word mining is carried out on the document data of each category, and the prompting word mining can be realized by adopting a preset prompting word mining algorithm or a trained prompting word mining model.

That is, in the embodiment of the disclosure, the task of mining the question-answer data pair of the document data can be jointly realized by adopting the prompt word mining model and the text generation model which is generated by the question-answer data pair based on the prompt word. Therefore, the problem that the model training process is difficult to fit due to the fact that the task complexity is too high, and further the accuracy of the model obtained through training is insufficient can be avoided. The method comprises the steps of dividing a task of directly carrying out question-answer data on document data into a task of carrying out prompt word mining on the document data, carrying out question-answer data pairs on the basis of the prompt words obtained by mining to generate two simple subtasks, and respectively training models corresponding to the two subtasks. The difficulty of training sample data acquisition can be reduced, the rationality of model use can be improved, and the accuracy of overall task processing can be improved.

In some embodiments, performing question-answer text generation on a plurality of prompt words corresponding to each document data based on a text generation model corresponding to each category of document data to obtain a plurality of question-answer data pairs corresponding to each document data, including:

determining a text generation model calling interface corresponding to the document data of each category;

And simultaneously calling a plurality of text generation model calling interfaces to generate question-answer texts for the corresponding prompt words, so as to obtain a plurality of question-answer data pairs corresponding to each document data.

In the embodiment of the present disclosure, as described above, the proxy layer in the converter may be accessed with call interfaces corresponding to multiple text generation models. Thus, when question-answering text generation is performed on the plurality of prompt words corresponding to each document data based on the text generation model corresponding to each document data, the text generation model call interface corresponding to each document data can be determined first. And then, a plurality of text generation model calling interfaces can be called simultaneously to generate question-answer texts for the corresponding prompt words, so that a plurality of question-answer data pairs corresponding to each document data are obtained. The text generation model is used for generating a corresponding prompt word, and the corresponding prompt word of the text generation model is generated according to the text generation model.

According to the method, the device and the system, the multiple text generation models are simultaneously and parallelly called through the multiple text generation model calling interfaces to generate the question-answer data pairs, so that multitasking concurrent processing is realized, and the generation efficiency of a tuning data set for tuning the document question-answer models can be greatly improved.

In some embodiments, performing prompt word mining on a plurality of types of document data to obtain a plurality of prompt words corresponding to each document data, including:

carrying out semantic analysis and word sense analysis on the document data of a plurality of categories to obtain analysis results;

and determining a plurality of prompt words corresponding to each document data according to the analysis result.

In the embodiment of the disclosure, the specific process of performing the prompt word mining on the document data of the plurality of categories may be that semantic analysis and word sense analysis are performed on the document data of the plurality of categories to obtain corresponding analysis results. The semantic analysis can identify the importance of texts of different sentences or different paragraphs in each document data, so that unimportant parts can be removed, and prompt word extraction is carried out on the important parts. Therefore, the extraction efficiency of the prompt words can be improved, and the extraction accuracy of the prompt words can be improved. After the key sentences and the key paragraphs in the document data are determined, word division can be further performed on sentence contents in the key sentences or the key paragraphs, and word meaning analysis is performed on a plurality of words obtained by division to determine the key words in the key sentences or the key paragraphs.

After the semantic analysis and word sense analysis are carried out on the document data of a plurality of categories to obtain an analysis result, a plurality of prompt words corresponding to each document data can be further determined according to the keywords in the analysis result.

Step 330, a first training data set is generated according to each document data and the corresponding plurality of question-answer data pairs.

After the plurality of question-answer data pairs corresponding to each document data are generated, a tuning data set for tuning the document question-answer model may be further generated based on the plurality of question-answer data pairs corresponding to each document data. In the embodiment of the present disclosure, a tuning data set generated from source data acquired from a plurality of target databases may be referred to as a first training data set to be distinguished from a tuning data set obtained by correcting abnormal response data hereinafter.

The specific process of generating the tuning data set for tuning the document question-answer model based on the multiple question-answer data pairs corresponding to each document data may be to construct multiple input-output data pairs. The input-output data pair comprises input data and output data, the input data comprises document data and question data in the question-answer data pair, and the input data is answer data in the corresponding question-answer data pair. Because a document can mine out a plurality of question-answer data pairs, a plurality of input-output data pairs can be constructed based on a document, namely a plurality of training data can be generated. In this way, a large number of pairs of input and output data may be generated by traversing each document data, which may form a first training data set.

In some embodiments, generating a first training data set from each document data and a corresponding plurality of question-answer data pairs includes:

generating a plurality of candidate training data according to each document data and a plurality of corresponding question-answer data;

and performing format detection and de-duplication processing on the plurality of candidate training data to obtain a first training data set.

In the embodiment of the present disclosure, after generating a plurality of candidate training data (i.e., the aforementioned input/output data pair) according to each document data and a plurality of corresponding question-answer data, format detection and deduplication may be further performed on each candidate training data. And deleting the candidate training data which does not meet the preset format requirement and is repeated, so that a first training data set is obtained. The quality of the generated tuning data set can be improved by performing format detection and deduplication on the generated plurality of candidate training data.

In the embodiment of the disclosure, the whole process of generating the first training data set according to each document data and the corresponding multiple question-answer data pairs may be implemented by the loader. The loader, when generating and deriving the first training data set, may perform format detection on the generated candidate training data to exclude abnormal data, such as format errors or invalidity, that occur during the generation process. In addition, repeated data generated in the process of generating the tuning data can be screened and removed, so that the quality and the integrity of the derived tuning data set can be ensured.

Step 340, a second training data set is acquired.

In order to further improve the tuning effect on the document question-answer model, after a first training data set for tuning training on the document question-answer model is generated according to the document data acquired from the plurality of target databases, a second training data set for tuning training on the document question-answer model can be further acquired. The second training data set comprises training data sets obtained by correcting abnormal response data of the document question-answer model.

Specifically, in the embodiment of the present disclosure, a bad library storing the second training data set may be set in the document-based question answering apparatus. The badcase can be specifically understood as a case of abnormal answer generated in the using process of the document question-answer model, or can be also understood as a case of inaccurate answer output by the document question-answer model; the badcase can also be a case that the answers output by the document question-answer model in the evaluating process are not matched with the label data in the evaluating data. The cases clearly reflect the accuracy loopholes of the document question-answer model, and further the badcase can be utilized to reversely and purposefully tune the document question-answer model.

The data for correcting the abnormal response data in badcase can be obtained from an evaluation data set or a model by using correction data submitted by an object. For example, the question answering system, after answering the question posed by the use object, presents an answer result evaluation page to the use object. The use object can evaluate the accuracy of the answer data of the document question-answer model in the page, such as satisfaction or dissatisfaction. When the evaluation of the answer data by the use object is not satisfied, the recommended answer display interface can be continuously displayed to the use object so that the use object inputs the correct answer considered to be accurate by the use object in the interface. After receiving the corrected answer input by the user, the question-answering system may construct tuning data based on the question input by the user, the document confirmed from the input question, and the corrected answer input by the user, and store the tuning data in the badcase library to update the second training data set. Or when the evaluation data set is adopted to evaluate the document question-answer model, and when the output of the document question-answer model is detected to be different from the label data in the evaluation data, the evaluation data can be stored in the badcase library to update the second training data set.

In the embodiment of the disclosure, in the process of deriving the tuning data set of the document question-answer model, the loader may perform data combination on the second training data set in the badcase library and the first training data set generated according to the document data acquired from the multiple target databases, so as to obtain a final target tuning data set for tuning the document question-answer model.

Fig. 4 is a schematic diagram of a flow framework for implementing automatic generation of a tuning dataset of a document question-answer model by using an ETL framework in an embodiment of the disclosure. As shown, the extractor 410 may be used to extract the source document data from multiple databases, and then the converter 420 may be used to generate a question-answer data pair corresponding to each document data, where the converter 420 uses the text processing capability of the text generation model 440 to generate an accurate question-answer data pair corresponding to the document data in parallel on a large scale. Then, the loader 430 may be used to perform data cleaning, deduplication, and merging on the question-answer data pairs corresponding to the generated large amount of document data, and finally derive a tuning data set for tuning the document question-answer model.

And step 350, training the document question-answer model by adopting a target training data set consisting of the first training data set and the second training data set.

After the first training data set is generated according to the document data of the multiple categories obtained from the multiple target databases and the second training data set is obtained, the target training data set for tuning the document question-answer model can be determined according to the first training data set and the second training data set. Then, the document question-answer model can be trained by adopting the target training data set, so that tuning of the document question-answer model is realized.

In some embodiments, before training the document question-answer model, a target training data set consisting of the first training data set and the second training data set is used, the method further includes:

acquiring a plurality of template answer data pairs, wherein the template answer data pairs comprise template answers which indicate that a document question-answer model cannot give accurate answers of corresponding questions;

training the document question-answer model by adopting a target training data set consisting of a first training data set and a second training data set, wherein the training comprises the following steps:

and training the document question-answer model by adopting a target training data set formed by the first training data set, the second training data set and a plurality of template answer data pairs.

Because of the specificity of the application scene of the document question-answer model, the answer data of the document question-answer model is answer data generated based on a certain document, namely the answer data cannot be generated from all the documents. However, in some cases, the problem posed by the use object of the question-answering system may be an abnormal problem, such as a garbled problem, or a problem completely different from the application field of the question-answering system. According to the document set in the question-answering system, an accurate answer cannot be given, in which case, in order to ensure the accuracy of the question-answering model and to promote the use experience of the question-answering system object, the document question-answering model may feed back the answer text. The refused text may be a pre-generated template answer, such as "sorry, we cannot learn the answer to the question.

Therefore, in order for the document question-answer model to learn the refusal capability, a certain refusal sample needs to be adopted to train the document question-answer model. Specifically, a plurality of template answer data pairs can be obtained, wherein the template answer data pairs comprise template answers which indicate that the document question-answer model cannot give an accurate answer to the corresponding question. That is, the template answer data pair may include a document data, question data that is not related to the document data, and a template answer. When training the document question-answer model by using the template answer data, taking the document data in the template answer data and the question data which are irrelevant to the document data as the input of the document question-answer model, and then taking the template answer as the output of the document question-answer model to train the document question-answer model.

In some embodiments, training the document question-answer model using a target training dataset comprised of a first training dataset and a second training dataset, comprises:

generating a mirror image environment for training a document question-answer model;

training a document question-answer model in a mirror image environment based on a target training data set consisting of the first training data set and the second training data set to obtain model parameters after training of the document question-answer model;

And updating the document question-answering model based on the trained model parameters.

In the embodiment of the disclosure, in order to avoid the influence of the model tuning process on the use of the model, a specific process of training the document question-answer model by adopting the target training data set in the disclosure can realize tuning of the document question-answer model by a method based on a mirror image environment.

Specifically, a mirror image environment for training the document question-answer model can be generated first, and the mirror image environment contains mirror image data of the document question-answer model. Then, training the document question-answer model in the mirror image environment based on the target training data set to obtain model parameters after training the document question-answer model. Further, after training the document question-answer model in the mirror image environment based on the target training data set to obtain a trained document question-answer model, the trained document question-answer model can be further evaluated. If the evaluation result does not reach the preset model evaluation index, further tuning the document question-answer model again; if the evaluation result reaches a preset model evaluation index, the model parameters of the document question-answer model trained in the mirror image environment can be exported, and the model parameters of the document question-answer model deployed on line are updated based on the exported model parameters, so that the problem based on the document is answered.

In some embodiments, after training the document question-answer model using the target training data set consisting of the first training data set and the second training data set, the method further includes:

acquiring an evaluation data set for evaluating the document question-answer model, wherein the evaluation data set comprises a plurality of evaluation question-answer data pairs;

performing approximation expansion on the evaluation question data in the evaluation question-answer data pairs to obtain an evaluation question-answer data pair set associated with each evaluation question-answer data pair;

and evaluating the document question-answer model according to a plurality of evaluation question-answer data pairs associated with the plurality of evaluation question-answer data pairs to obtain an evaluation result.

In the embodiment of the disclosure, after the document question-answer model is tuned based on the target tuning data set, the document question-answer model can be further evaluated to determine whether the model effect of the document question-answer model meets the standard of online use. In the related art, after tuning a document question-answer model, evaluation is generally performed by a method of running a fixed case (case) set. The data set for evaluating the document question-answering model is fixed, so that the document question-answering model is sensitive to the evaluation data set, and the problem of reduced answer quality to other questions is generated. For example, for problems not encountered in some evaluation sets, it may not be possible to determine whether the content of the output of the document question-answering model is a badcase problem, which ultimately results in a decrease in the accuracy and generalization ability of the document question-answering model.

Therefore, in the embodiment of the disclosure, when the document question-answer model after adjustment is evaluated, an evaluation data set for evaluating the document question-answer model can be obtained first, and the evaluation data set comprises a plurality of evaluation question-answer data pairs. Of course, the evaluation data set also comprises document data corresponding to the evaluation question-answer data pair. Then, the evaluation question data in the evaluation question-answer data pair can be approximately expanded to obtain a plurality of expanded question data corresponding to each evaluation question data. For example, when the evaluation question data in the evaluation question-answer data pair is "what areas are the euKangaroo distributed? "when the method is used, a plurality of extended question data corresponding to the evaluation question data can be generated, such as" which areas have eukangaroo? "," what areas in which Eukangaroo lives? "," I want to know where you have Kangaroo? "and" where there is a Eukangaroo? "etc.

And then, new evaluation question-answer data pairs can be further formed according to the extended question data corresponding to the evaluation question-answer data and the answer data corresponding to the evaluation question-answer data, and the new evaluation question-answer data pairs and the original evaluation question-answer data pairs can be determined to be the evaluation question-answer data pair-associated evaluation question-answer data pair set.

After each evaluation question-answer data pair is generated according to the problem expansion method, the document question-answer model can be evaluated according to a new evaluation data set formed by a plurality of evaluation question-answer data pairs corresponding to the evaluation question-answer data pairs, so that the problem that the evaluation of the document question-answer model by adopting a fixed evaluation data set causes the reduction of the answer accuracy of the model to other problems can be avoided, and the accuracy of the model can be improved.

In the specific process of evaluating the document answering model according to the expanded evaluation dataset, an evaluation result can be obtained by adopting a manual evaluation method, or a large model technology can be adopted, namely, a GPT model can be adopted to directly input an evaluation document, an evaluation question and an answer result of the document answering model in evaluation data into the GPT model, the GPT model evaluates the answer quality of the document answering model, and then indexes such as accuracy, recall rate and the like are output.

FIG. 5 is a schematic diagram of a process for evaluating a document question-answer model in the present disclosure. As shown, the data set for evaluating the document question-answer model includes a general evaluation problem set 510 and a service high-frequency problem set 520, and first, the two evaluation data sets may be subjected to problem extraction to obtain an evaluation problem 530. The evaluation questions 530 may then be expanded using the foregoing method of question expansion to obtain a plurality of expanded questions 540. Further, the set of evaluation questions may be updated based on the plurality of extended questions 540 to obtain an extended set of evaluation questions 550. Further, after the extended evaluation problem set 550 is obtained, the extended evaluation problem set 550 can be adopted to ask questions to the document question-answer model to be evaluated, and after the asking questions are executed, answer data 560 output by the document question-answer model corresponding to each problem is obtained. Further, the answer data 560 may be evaluated by a model and manually, respectively, to obtain respective corresponding evaluation results. Specifically, the model evaluation result 570 can be obtained after model evaluation, and the artificial evaluation result 580 can be obtained by adopting artificial evaluation. After the two evaluation results are obtained, the model evaluation result 570 and the artificial evaluation result 580 may be further weighted and summed, so as to obtain a final evaluation result of the document question-answer model.

In some embodiments, performing approximation expansion on the evaluation question data in the evaluation question-answer data pairs to obtain an associated evaluation question-answer data pair set of each evaluation question-answer data pair, including:

extracting prompt words from the question data in the evaluation question-answer data pair to obtain a question prompt word corresponding to each question data;

performing approximation expansion on the problem data based on the problem prompt words to obtain a plurality of pieces of approximation problem data;

and generating an evaluation question-answer data pair set associated with each evaluation question-answer data pair according to the plurality of pieces of approximate question data and answer data corresponding to the question data.

In the embodiment of the disclosure, a specific process of performing approximate expansion on the question data extracted from the evaluation question-answer data pair may be that firstly, extracting a prompt word from the question data in the evaluation question-answer data pair to obtain a question prompt word corresponding to each question data. Then, the problem data can be approximately expanded based on the problem prompt words, so that a plurality of approximate problems are obtained. Further, a set of evaluation question-answer data pairs associated with each evaluation question-answer data pair may be generated according to answer data corresponding to the approximate question data and the original question data.

Specifically, for example, the question data is the aforementioned "what areas are the euKangaroo distributed? And extracting the prompt words of the problem data to obtain two prompt words, namely 'Eukangaroo' and 'distribution'. For the hint term of 'Kangaroo' the term can be kept unchanged; for the term "distribute" it is possible to generate its paraphrasing or synonyms, such as "living", "having" and "existence", etc.

In some embodiments, performing approximate expansion on the question data based on the question prompting words to obtain a plurality of pieces of approximate question data, including:

carrying out semantic analysis on the problem data, and determining a target text generation model corresponding to the problem data according to a semantic analysis result;

and generating the problem data of the problem prompt word based on the target text generation model to obtain a plurality of pieces of approximate problem data.

In the embodiment of the disclosure, the problem data is approximately expanded based on the problem prompt words extracted from the problem data, and the problem data can be specifically realized by adopting a large model technology, namely, a text generation model is adopted to generate the problem text of the prompt words, so that a plurality of pieces of corresponding approximate problem data are obtained.

In order to avoid the problem that inaccurate approximation problems are generated when approximation problem data are generated, serious deviation of problem semantics occurs, and further the problem that accuracy of an evaluation sample is insufficient is caused. According to the embodiment of the disclosure, before the text generation model is adopted to generate the problem text of the prompt word, semantic analysis can be performed on the problem data, and semantic information such as application fields, scenes and the like related to the problem can be determined according to the result of the semantic analysis. Then, a target text generation model matching the question databook in can be determined from among a plurality of text generation models according to the semantic analysis result. Then, the problem data generation can be performed on the problem prompt word based on the target text generation model, and a plurality of pieces of approximate problem data are obtained.

In some embodiments, evaluating the document question-answer model according to a plurality of evaluation question-answer data pairs associated with a plurality of evaluation question-answer data pairs to obtain an evaluation result includes:

calculating semantic similarity between elements in each evaluation question-answer data pair set and corresponding evaluation question-answer data pairs;

dividing elements in each evaluation question-answer data pair set into a plurality of categories according to semantic similarity;

Evaluating the document question-answer model based on the evaluation question-answer data of a plurality of categories to obtain a plurality of sub-evaluation results;

and determining the evaluation result of the document question-answer model according to the plurality of sub-evaluation results.

In the embodiment of the disclosure, before evaluating the document question-answer model according to a plurality of evaluation question-answer data pair sets associated with the plurality of evaluation question-answer data pairs, semantic similarity between elements in each evaluation question-answer data pair set (i.e., the evaluation question-answer data pairs contained therein) and original evaluation question-answer data pairs corresponding to the set may be calculated respectively. The semantic similarity between the expansion problem data obtained through calculation expansion and the corresponding initial problem data can be achieved. Then, elements in each evaluation question-answer data pair set can be divided into a plurality of categories according to the semantic similarity, for example, the similarity is 0-60% of one category, 60-80% of one category and 80-100% of one category. After elements in the question-answer data pair set are divided into a plurality of categories according to semantic similarity with the initial question-answer data pair, the document question-answer model can be evaluated according to evaluation data of different categories, and therefore a plurality of sub-evaluation results corresponding to the categories are obtained.

After the document question-answer model is evaluated by adopting the evaluation data of different categories, so as to obtain a plurality of sub-evaluation results corresponding to the categories, the evaluation result for evaluating the document question-answer model can be further determined according to the plurality of sub-evaluation results.

In some embodiments, determining an evaluation result of the document question-answer model according to the plurality of sub-evaluation results comprises:

calculating a weight coefficient of each sub-evaluation result according to the semantic similarity;

and carrying out weighted calculation on the plurality of sub-evaluation results based on the weight coefficient to obtain the evaluation result of the document question-answer model.

In the embodiment of the disclosure, after a plurality of sub-evaluation results obtained by evaluating the document question-answer model by using different types of evaluation data are determined. The process of calculating the final evaluation result according to the plurality of sub-evaluation results may be that the weight coefficient of each sub-evaluation result is calculated according to the semantic similarity. And then, carrying out weighted calculation on the plurality of sub-evaluation results according to the weight coefficient corresponding to each sub-evaluation result, thereby obtaining the evaluation result of the document question-answer model.

Specifically, in general, the expanded evaluation data with lower semantic similarity with the original evaluation data can output accurate answers when evaluating the document question-answer model, so that the generalization capability and the accuracy of the document question-answer model are higher. Therefore, the weight coefficient of each sub-evaluation result can be set according to the inverse proportion relation of the semantic similarity of the extended evaluation data and the original evaluation data, namely, the lower the semantic similarity is, the higher the corresponding weight coefficient is, so that a more accurate document question-answer model evaluation result can be obtained.

In some embodiments, in evaluating the document question-answer model, in addition to evaluating the ability of the document question-answer model to make an accurate answer, it is desirable to further evaluate whether the document question-answer model can make an accurate answer. Thus, in the profile set, one or more invalid approximation questions may also be generated, such as "what areas crocodile is distributed in". It is apparent that in the related literature on the study of the eukangaroo, there is certainly no data of crocodile distribution, and therefore the document question-answering model should output answer template data. If the document question-answering model outputs refusal answer template data, the document question-answering model is indicated to have certain refusal answer capability; otherwise, corresponding tuning data is needed to be generated to further tune the document question-answer model, so that the document question-answer model learns the refused capability.

In some embodiments, the evaluating the document question-answer model according to the plurality of evaluation question-answer data pairs associated with the plurality of evaluation question-answer data pairs, after obtaining the evaluation result, further includes:

in the process of evaluating the document question-answer model, when the fact that the answer result of the document question-answer model to the target evaluation problem is inconsistent with answer data corresponding to the target evaluation problem is identified, accurate target answer data of the target evaluation problem is generated;

And adding the abnormal response data pair consisting of the target evaluation problem and the target response data into the second training data set.

When the document question-answer model is evaluated by adopting the set of a plurality of evaluation question-answer data pairs related to the plurality of evaluation question-answer data pairs after expansion, the condition that the answer result output by the document question-answer model is identical to the result data in the evaluation question-answer data pair may be generated, or the condition that the answer result output by the document question-answer model is not identical to the result data in the evaluation question-answer data pair may be generated, namely, the aforementioned badcase may occur. In the process of evaluating the document question-answer model, when the fact that the answer result of the document question-answer model to the target evaluation problem is inconsistent with answer data corresponding to the target evaluation problem is identified, accurate target answer data of the target evaluation problem can be generated. And then generating an abnormal response data pair by using the target evaluation problem and the target response data, and adding the abnormal response data pair into a second training data set, namely adding the abnormal response data pair into the badcase library, so that the abnormal response data pair can be adopted to purposefully tune the document question-answer model when the document question-answer model is tuned next time.

As shown in fig. 6, a system architecture diagram of a system for tuning a document question-answer model provided by the present disclosure is shown. As shown, the system specifically includes a tuning data generation system 610, a model training system 620, a model evaluation system 630, and a badcase library 640. Wherein tuning data generation system 610 also interfaces with large model invocation interface 650 to invoke text processing capabilities of the large model. Specifically, the tuning data generating system 610 automatically acquires source document data based on the ETL technology, and then automatically invokes the large model through the large model invoking interface 650 to generate a tuning data set corresponding to the source document data. In addition, the tuning data generating system 610 may also automatically obtain, from the badcase library 640, tuning data sets generated based on badcase generated in the model evaluation process and the model using process, and tuning data sets corresponding to source document data generated by using a large model, to generate tuning data sets for tuning the document question-answering model together. Then, after receiving the tuning data set derived by the tuning data generating system 610, the model training system 620 may automatically use the tuning data set generated by the tuning data generating system 610 to perform tuning training on the document question-answer model, so as to obtain a tuned document question-answer model. After the model training system 620 completes tuning training of the document question-answer model, the model evaluation system 630 may be triggered to automatically acquire an evaluation data set, and the acquired evaluation data set is expanded to obtain an expanded evaluation data set, and then the document question-answer model is automatically evaluated by adopting the expanded evaluation data set based on a large model technology, so as to output an evaluation result.

Step 360, obtaining the target questions and the associated target documents, and answering the target questions and the target documents based on the trained document question-answering model to obtain answering results.

And when the first training data set automatically generated according to the large language model and the acquired second training data set train the document question-answering model, evaluate the trained document question-answering model, and determine that the effect of the model accords with the preset standard according to the evaluation result, the model can be deployed on line to answer the problem.

Specifically, the trained document question-answering model may be deployed in the aforementioned question-answering system, where is the question server in the question-answering system to receive the target question sent by the question-answering client, e.g. "how does you kangaroo live? After the question is found in the question-answering server based on the question, a target document corresponding to the target question is obtained, and the target document can be specifically a document for researching and analyzing Kangaroo in order to check the question. Then, the question and answer server can send the target questions and the target documents to the document question and answer model, and after the target questions and the target documents are obtained, the document question and answer model can analyze and predict according to the input target documents and target questions to obtain output answer results, for example, the 'Kangaroo live in XXX region'. The document question-answering model may then send the output answer results to a question-answering server, which may further send the answer results to a question-answering client for display.

According to the document-based question answering method provided by the embodiment of the disclosure, document data of a plurality of categories are obtained from a plurality of target databases, wherein the target databases are databases associated with a document question answering model to be trained; mining question-answer data pairs of the document data of a plurality of categories based on the text generation model corresponding to the document data of each category to obtain a plurality of question-answer data pairs corresponding to each document data; generating a first training data set according to each document data and a plurality of question-answer data pairs; acquiring a second training data set, wherein the second training data set is a training data set obtained by correcting abnormal response data of a document question-answer model; training the document question-answering model by adopting a target training data set consisting of the first training data set and the second training data set to obtain a trained document question-answering model; the server 110 obtains the target questions and the associated target documents, and answers the questions based on the trained document question-answering model to obtain answer results.

In this way, after obtaining the source document data of multiple categories, the embodiment of the disclosure may automatically generate the training data set for training the document question-answer model based on the text generation model corresponding to the source document data of each category. In addition, training data obtained by correcting the abnormal response data of the document question-answer model can be further obtained to supplement the training data set. Therefore, compared with the manual generation of the training data set for training the document question-answer model, the efficiency of generating the training data of the document question-answer model can be greatly improved, and the training efficiency of the document question-answer model can be further improved. Therefore, the accuracy of answering the questions based on the document can be improved by improving the tuning frequency of the document question-answering model.

Detailed description of the embodiments of the disclosure in connection with a specific application scenario

As shown in fig. 7, in order to provide a flowchart of a training method of a document question-answer model according to the present disclosure, the training method of the document question-answer model will be described in detail with reference to execution subjects of each step. The method specifically comprises the following steps:

in step 701, an extractor of the tuning data generating system acquires source document data from a plurality of open source data sets, and segments the source document data to obtain segmented data.

The training method of the document question-answering model of the present embodiment will be described based on the system architecture diagram shown in fig. 6. In this embodiment, when tuning is required to be performed on the document question-answering model, the extractor in the tuning data generating system automatically performs source document data acquisition from a plurality of open source data sets, specifically including a background knowledge base, java surface questions, and Python knowledge base of the question-answering system. After the source document data is acquired, the extractor can segment the acquired source document data according to a certain format to obtain multi-segment data (or segmented document data).

The extractor in the embodiment of the disclosure can automatically extract the needed data, thereby reducing the time and cost for manually extracting the data. In addition to data extraction, the extractor also supports real-time caching and data recovery functions. When an abnormal situation occurs, the real-time cache can help to save already extracted data to avoid data loss. Meanwhile, the data recovery function may make it easier and faster to recover data in an abnormal situation.

Step 702, a converter of the tuning data generating system calls a plurality of text generating models to extract question-answer data of the segmented data, and question-answer data pairs corresponding to each segmented data are obtained.

The extractor of the tuning data generating system automatically extracts the complete source document data from the open source database and segments the complete source document data into a plurality of segment data, and the converter of the tuning data generating system can generate question-answer data pairs corresponding to each segment data for the segment data.

Specifically, the converter of the tuning data generating system implements a set of proxy layers for the large language model framework, in which interfaces of different large language models are adapted. The converter can call the plurality of large language models simultaneously through the interfaces of the plurality of large language models to generate corresponding question-answer data pairs for different segmented data. Meanwhile, the converter of the tuning data generation system also supports automatic storage of the converted data, can support custom field setting, and can automatically store the converted data into a corresponding data table field.

In step 703, the loader of the tuning data generating system cleans and filters the question-answer data pair corresponding to each generated segment data, and combines the filtered data with the data obtained from the badcase library to obtain the tuning data set.

Further, the loader in the tuning data generating system performs data cleaning and filtering on the question-answer data generated by the converter of the tuning data generating system by using a preset data cleaning rule, wherein the data cleaning process can specifically be to screen out some abnormal data, such as repeated data or data with wrong format, and the filtering operation can specifically be deduplication. Thereby improving the quality of the tuning data set. And then, the loader can also combine the filtered data with the data obtained from the badcase library, and the combined data can be subjected to duplicate removal processing again, so that the quantity and quality of the tuning data set for training the document question-answer model can be improved to a greater extent.

The loader of the tuning data generation system in the embodiment can automatically complete post-processing, merging and exporting of the large-scale data set, so that the time and cost for manually processing the data are saved. In addition, the data combination supporting multiple knowledge bases also greatly improves the quality and quantity of the data sets.

Step 704, the loader of the tuning data generating system exports the tuning data set to the model training system.

And after the loader of the tuning data generating system performs data merging, cleaning and deduplication operations, the finally generated tuning data set is exported to a model training system for tuning training of the document question-answering model.

The tuning data generation system provided by the embodiment of the disclosure can automatically complete data preparation work: the portable function is provided, the data preparation work including data grabbing, data cleaning, data arrangement and the like is automatically completed through one-key operation, and time and energy are saved. The flexible data cleaning rule insertion function is also provided, and the data can be cleaned and converted according to the self-defined cleaning rule as required, so that the accuracy and consistency of the data are ensured. In addition, the prompt words of the GPT can be customized so as to generate question-answer data pairs meeting the requirements. Meanwhile, parallel processing is supported, and the efficiency and speed of data generation are improved. In addition, the data recovery function is provided, and the safety of the data in the cleaning process is ensured. Meanwhile, a certain cleaning task can be independently executed, and the specific task can be conveniently debugged and optimized.

Step 705, the model training system trains the document question-answer model according to the received tuning data set.

After receiving the tuning data set generated by the tuning data generating system, the model training system can train the document question-answering model based on the tuning data set. This training process has been described in detail in the previous examples and will not be described in detail here.

And 706, after the model training system finishes training the document question-answer model, sending an evaluation instruction to the model evaluation system, and providing a calling interface of the trained document question-answer model to the model evaluation system.

The model training system can detect the training progress in real time in the process of training the document question-answer model based on the tuning data set. And after the completion of the training of the document question-answer model is detected, an evaluation instruction can be sent to the model evaluation system so as to trigger the model evaluation system to evaluate the trained document question-answer model.

Meanwhile, the model training system can also provide a calling interface of the trained document question-answer model for the model evaluating system, so that the model evaluating system can call the trained document question-answer model when performing model evaluation.

And step 707, after receiving the evaluation instruction, the model evaluation system acquires an evaluation data set.

The model evaluation system may begin to automatically acquire an evaluation dataset after receiving an evaluation instruction, where the acquired evaluation dataset may be referred to as a seed dataset. Specifically, the general evaluation problem set and the service high-frequency problem set can be included as seed data sets. In some cases, the two data sets may be selected and used according to a certain proportion, so as to obtain a seed data set.

In step 708, the model evaluation system expands the evaluation data set to obtain an expanded evaluation data set.

After the seed data set for evaluating the trained document question-answer model is obtained, the model evaluation system can further construct prompt words through the large language model, and random variation is carried out on the problems in the seed data set according to the prompt words, so that the evaluation data set is expanded, and an expanded evaluation data set is obtained.

Specifically, for the question data in each piece of seed data, a plurality of approximate questions can be generated through the prompt words, and then an extended evaluation data set is determined according to the combination of the plurality of approximate questions and corresponding answers.

In step 709, the model evaluation system calls the trained document question-answer model to evaluate based on the extended evaluation data, and answer data output by the document question-answer model is obtained.

Further, the model evaluation system can call the trained document question-answer model to evaluate based on the extension evaluation data. And calling a trained document question-answer model deployed on a local or test server, inputting the question data and the document data in the extended evaluation data into the trained document question-answer model, and receiving answer data output by the trained document question-answer model.

And 710, the model evaluation system calls a large model to evaluate the answer data output by the document question-answer model, and an evaluation result is obtained.

After receiving answer data returned by the trained document question-answer model, the model evaluation system can further call the large model to combine question data, document data, question prompting words and the like of the evaluation data with the answer data output by the document question-answer model, and then construct the prompting words for evaluation so that the large model evaluates the quality of the answer data and obtains an evaluation result output by the large model. The evaluation result may specifically include indexes such as accuracy and recall rate.

And 711, adding the evaluation data corresponding to the badcase data detected in the evaluation process into a badcase library by the model evaluation system.

In addition, the model evaluation system can evaluate the trained document question-answer model to determine the quality of the document question-answer model, and can find out specific loopholes existing in the question-answer model in the evaluation process, and the specific loopholes are experienced in the form of badcase data. The model evaluation system can add the evaluation data corresponding to the badcase data detected in the evaluation process into the badcase library so as to adjust the document question-answer model by adopting the data when the document question-answer model is adjusted next time, thereby improving the quality of the document question-answer model in a targeted manner.

Apparatus and device descriptions of embodiments of the present disclosure

It will be appreciated that, although the steps in the various flowcharts described above are shown in succession in the order indicated by the arrows, the steps are not necessarily executed in the order indicated by the arrows. The steps are not strictly limited in order unless explicitly stated in the present embodiment, and may be performed in other orders. Moreover, at least some of the steps in the flowcharts described above may include a plurality of steps or stages that are not necessarily performed at the same time but may be performed at different times, and the order of execution of the steps or stages is not necessarily sequential, but may be performed in turn or alternately with at least a portion of the steps or stages in other steps or other steps.

In the various embodiments of the present disclosure, when related processing is performed according to data related to characteristics of a target object, such as attribute information or attribute information set of the target object, permission or consent of the target object is obtained first, and the collection, use, processing, etc. of the data complies with relevant laws and regulations and standards of the related region. In addition, when the embodiment of the application needs to acquire the attribute information of the target object, the independent permission or independent consent of the target object is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the target object is explicitly acquired, the necessary target object related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 8 is a schematic structural diagram of a document-based question answering apparatus 800 according to an embodiment of the present disclosure.

The device comprises:

a first obtaining unit 810, configured to obtain document data of a plurality of categories from a plurality of target databases, where the target databases are databases associated with a document question-answer model to be trained;

a mining unit 820, configured to mine question-answer data pairs for document data of multiple categories based on text generation models corresponding to the document data of each category, so as to obtain multiple question-answer data pairs corresponding to each document data;

a generating unit 830, configured to generate a first training data set according to each document data and a plurality of question-answer data pairs corresponding to the document data;

a second obtaining unit 840, configured to obtain a second training data set, where the second training data set is a training data set obtained by correcting abnormal response data of the document question-answer model;

the training unit 850 is configured to train the document question-answer model by using a target training data set composed of the first training data set and the second training data set, so as to obtain a trained document question-answer model;

and the answering unit 860 is configured to obtain a target question and an associated target document, and answer the question based on the trained document question-answering model, thereby obtaining an answer result.

Optionally, in some embodiments, the excavation unit includes:

and the first generation subunit is used for generating question-answer texts of a plurality of prompt words corresponding to each document data based on the text generation model corresponding to each category of document data, so as to obtain a plurality of question-answer data pairs corresponding to each document data.

Optionally, in some embodiments, generating the subunit includes:

the first generation module is used for simultaneously calling a plurality of text generation model calling interfaces to generate question-answer texts of the corresponding prompt words, and a plurality of question-answer data pairs corresponding to each document data are obtained.

Optionally, in some embodiments, the mining subunit comprises:

the analysis module is used for carrying out semantic analysis and word sense analysis on the document data of a plurality of categories to obtain analysis results;

Optionally, in some embodiments, the expansion subunit comprises:

Optionally, in some embodiments, the expansion module includes:

Optionally, in some embodiments, the evaluating subunit includes:

the evaluation module is used for evaluating the document question-answer model based on the evaluation question-answer data of a plurality of categories to obtain a plurality of sub-evaluation results;

Optionally, in some embodiments, the third determining module includes:

the second generation subunit is used for generating accurate target answer data of the target evaluation problem when the answer result of the document question-answer model on the target evaluation problem is recognized to be inconsistent with the answer data corresponding to the target evaluation problem in the process of evaluating the document question-answer model;

the second acquisition subunit is used for acquiring a plurality of template answer data pairs, wherein the template answer data pairs comprise template answers which indicate that the document question-answer model cannot give accurate answers of corresponding questions;

training unit, still be used for:

Optionally, in some embodiments, the generating unit includes:

a third generation subunit, configured to generate a plurality of candidate training data according to each document data and a plurality of corresponding question-answer data;

Optionally, in some embodiments, the training unit comprises:

the fourth generation subunit is used for generating a mirror image environment for training the document question-answer model;

the training subunit is used for training the document question-answer model in the mirror image environment based on a target training data set consisting of the first training data set and the second training data set to obtain model parameters after training of the document question-answer model;

Referring to fig. 9, fig. 9 is a block diagram of a portion of a terminal 140 implementing a document-based problem answering method according to an embodiment of the present disclosure, the terminal 140 including: radio Frequency (RF) circuitry 910, memory 915, input unit 930, display unit 940, sensor 950, audio circuitry 960, wireless fidelity (wireless fidelity, wiFi) module 970, processor 980, and power source 990. It will be appreciated by those skilled in the art that the terminal 140 structure shown in fig. 9 is not limiting of a cell phone or computer and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The RF circuit 910 may be used for receiving and transmitting signals during a message or a call, and particularly, after receiving downlink information of a base station, the signal is processed by the processor 980; in addition, the data of the design uplink is sent to the base station.

The memory 915 may be used to store software programs and modules, and the processor 980 may execute various functional applications of the terminal and document-based questions by executing the software programs and modules stored in the memory 915.

The input unit 930 may be used to receive input numerical or character information and to generate key signal inputs related to setting and function control of the terminal. In particular, the input unit 930 may include a touch panel 931 and other input devices 932.

The display unit 940 may be used to display input information or provided information and various menus of the terminal. The display unit 940 may include a display panel 941.

Audio circuitry 960, speaker 961, microphone 962 may provide an audio interface.

In this embodiment, the processor 980 included in the terminal 140 may perform the document-based question answering method of the previous embodiment.

The terminal 140 of the embodiments of the present disclosure includes, but is not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc.

Fig. 10 is a block diagram of a portion of a server 110 implementing a document-based problem answering method according to an embodiment of the present disclosure. The server 110 may vary considerably in configuration or performance and may include one or more central processing units (Central Processing Units, simply CPU) 1022 (e.g., one or more processors) and storage 1032, one or more storage media 1030 (e.g., one or more mass storage devices) storing applications 1042 or data 1044. Wherein storage 1032 and storage medium 1030 may be transitory or persistent. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations on the server 110. Further, central processor 1022 may be configured to communicate with storage medium 1030 to execute a series of instruction operations in storage medium 1030 on server 110.

The server 110 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input/output interfaces 1058, and/or one or more operating systems 1041, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

The central processor 1022 in the server 110 may be used to perform the document-based question answering method of embodiments of the present disclosure.

The embodiments of the present disclosure also provide a storage medium storing program codes for executing the document-based problem answering method of the foregoing embodiments.

The disclosed embodiments also provide a computer program product comprising a computer program. The processor of the computer device reads the computer program and executes it, causing the computer device to execute a document-based question answering method implementing the above.

The terms "first," "second," "third," "fourth," and the like in the description of the present disclosure and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this disclosure, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It should be understood that in the description of the embodiments of the present disclosure, the meaning of a plurality (or multiple) is two or more, and that greater than, less than, exceeding, etc. is understood to not include the present number, and that greater than, less than, within, etc. is understood to include the present number.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units may be stored in a removable storage medium if implemented in the form of software functional units and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It should also be appreciated that the various implementations provided by the embodiments of the present disclosure may be arbitrarily combined to achieve different technical effects.

The above is a specific description of the embodiments of the present disclosure, but the present disclosure is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present disclosure, and are included in the scope of the present disclosure as defined in the claims.

Claims

1. A method of document-based question answering, the method comprising:

2. The method according to claim 1, wherein the mining of the question-answer data pairs of the document data of the plurality of categories based on the text generation model corresponding to the document data of each category to obtain the question-answer data pairs corresponding to each document data includes:

performing prompt word mining on the document data of the multiple categories to obtain multiple prompt words corresponding to each document data;

and generating a question-answer text for a plurality of prompt words corresponding to each document data based on a text generation model corresponding to each category of document data, so as to obtain a plurality of question-answer data pairs corresponding to each document data.

3. The method according to claim 2, wherein the generating the question-answer text for the plurality of prompt words corresponding to each document data based on the text generation model corresponding to each category of document data to obtain a plurality of question-answer data pairs corresponding to each document data includes:

4. The method according to claim 2, wherein the performing the prompt word mining on the document data in the plurality of categories to obtain a plurality of prompt words corresponding to each document data includes:

carrying out semantic analysis and word sense analysis on the document data of the multiple categories to obtain analysis results;

5. The method of claim 1, wherein the training the document question-answer model using the target training data set comprising the first training data set and the second training data set further comprises:

6. The method of claim 5, wherein the performing the approximate expansion on the evaluation question data in the evaluation question-answer data pairs to obtain the evaluation question-answer data pair set associated with each evaluation question-answer data pair includes:

extracting prompt words from the question data in the evaluation question-answering data pair to obtain a question prompt word corresponding to each question data;

performing approximate expansion on the problem data based on the problem prompt words to obtain a plurality of pieces of approximate problem data;

7. The method of claim 6, wherein the performing the approximate expansion on the question data based on the question prompting word to obtain a plurality of pieces of approximate question data includes:

8. The method of claim 5, wherein evaluating the document question-answer model according to a set of multiple evaluation question-answer data pairs associated with the multiple evaluation question-answer data pairs to obtain an evaluation result comprises:

dividing the elements in each evaluation question-answer data pair set into a plurality of categories according to the semantic similarity;

9. The method of claim 8, wherein the determining an evaluation result of the document question-answer model from the plurality of sub-evaluation results comprises:

10. The method according to claim 5, wherein evaluating the document question-answer model according to a set of a plurality of evaluation question-answer data pairs associated with the plurality of evaluation question-answer data pairs, after obtaining an evaluation result, further comprises:

in the process of evaluating the document question-answer model, when the fact that answer results of the document question-answer model on target evaluation questions are inconsistent with answer data corresponding to the target evaluation questions is identified, accurate target answer data of the target evaluation questions are generated;

and adding an abnormal response data pair consisting of the target evaluation problem and the target response data to the second training data set.

11. The method of any one of claims 1 to 10, wherein the training the document question-answer model using the target training data set consisting of the first training data set and the second training data set further comprises, prior to training the document question-answer model:

acquiring a plurality of template answer data pairs, wherein the template answer data pairs comprise template answers which indicate that the document question-answering model cannot give accurate answers to corresponding questions;

the training the document question-answer model by adopting a target training data set consisting of the first training data set and the second training data set comprises the following steps:

12. The method of any one of claims 1 to 10, wherein the generating a first training data set from each of the document data and the corresponding plurality of question-answer data pairs comprises:

generating a plurality of candidate training data according to each document data and the corresponding question-answer data pairs;

13. The method of claim 1, wherein training the document question-answer model using a target training dataset comprising the first training dataset and the second training dataset comprises:

generating a mirror image environment for training the document question-answer model;

training a document question-answer model in the mirror image environment based on a target training data set formed by the first training data set and the second training data set to obtain model parameters after training of the document question-answer model;

And updating the document question-answer model based on the trained model parameters.

14. A document-based question answering apparatus, the apparatus comprising:

15. A storage medium storing a computer program, wherein the computer program when executed by a processor implements the document-based question answering method according to any one of claims 1 to 13.

16. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the document-based question answering method according to any one of claims 1 to 13 when executing the computer program.

17. A computer program product comprising a computer program which is read and executed by a processor of a computer device, causing the computer device to perform the document based question answering method according to any one of claims 1 to 13.