CN108304437B

CN108304437B - automatic question answering method, device and storage medium

Info

Publication number: CN108304437B
Application number: CN201710872147.3A
Authority: CN
Inventors: 张想; 冯启航; 柯玉耿; 林强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-09-25
Filing date: 2017-09-25
Publication date: 2020-01-31
Anticipated expiration: 2037-09-25
Also published as: CN108304437A

Abstract

The embodiment of the invention discloses automatic question-answer methods, devices and storage media, wherein a plurality of question-answer pairs formed based on social data on a social platform are adopted, the question-answer pairs comprise questions and answers corresponding to the questions, then, a reverse index of the questions and phrases of the questions is established, retrieval questions are obtained, similar questions close to the retrieval questions are determined according to the question phrases of the retrieval questions and the reverse index, candidate answers of the retrieval questions are obtained according to the similar questions and the question-answer pairs, candidate answer sets of the retrieval questions are obtained, and target answers of the retrieval questions are selected from the candidate answer sets.

Description

automatic question answering method, device and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to automatic question answering methods, devices and storage media.

Background

The chatting robot system is kinds of artificial intelligence system which can be on-line at any time by means of communication means and communicate with people through natural language, the chatting robot system is essentially kinds of automatic Question Answering (QA) system, the automatic question answering system, also called question answering system, is a computer processing system which memorizes large corpora and automatically searches, searches and answers questions of users.

Specifically, after the user inputs a question, the chat robot system searches the database for an answer matching the question, and then outputs the searched answer to answer the question input by the user, thereby implementing chat.

However, the current chat robot system often has the situation that the answer is not matched with the question, the relevance of the answer is poor, and the accuracy of the output answer of the chat robot system is reduced.

Disclosure of Invention

The embodiment of the invention provides automatic question answering methods, devices and storage media, which can improve the accuracy of answer output of a chat robot system.

The embodiment of the invention provides automatic question answering methods, which comprise the following steps:

a plurality of question-answer pairs formed based on social data on a social platform, wherein the question-answer pairs comprise questions and answers corresponding to the questions;

establishing an inverted index of the question and the phrase thereof;

acquiring a retrieval problem, and determining a similar problem similar to the retrieval problem according to a problem phrase of the retrieval problem and the inverted index;

obtaining candidate answers of the retrieval questions according to the similar questions and the question-answer pairs to obtain a candidate answer set of the retrieval questions;

and selecting a target answer of the retrieval question from the candidate answer set.

Correspondingly, the embodiment of the invention also provides kinds of automatic question answering devices, which comprise:

the system comprises a question-answer pair forming unit, a question-answer pair forming unit and a question-answer pair forming unit, wherein the question-answer pair forming unit is used for forming a plurality of question-answer pairs based on social data on a social platform, and the question-answer pairs comprise questions and answers corresponding to the questions;

the index establishing unit is used for establishing an inverted index of the question and the phrase thereof;

the problem acquisition unit is used for acquiring a retrieval problem and determining a similar problem similar to the retrieval problem according to a problem phrase of the retrieval problem and the inverted index;

the candidate answer obtaining unit is used for obtaining candidate answers of the retrieval questions according to the similar questions and the question-answer pairs to obtain a candidate answer set of the retrieval questions;

and the answer selecting unit is used for selecting the target answer of the retrieval question from the candidate answer set.

Accordingly, an embodiment of the present invention further provides storage media, where the storage media stores instructions, and the instructions, when executed by a processor, implement the automatic question answering method provided in any of the embodiments of the present invention.

The method comprises the steps of forming a plurality of question-answer pairs based on social data on a social platform, establishing an inverted index of questions and phrases of the questions, obtaining retrieval questions, determining similar questions similar to the retrieval questions according to the question phrases of the retrieval questions and the inverted index, obtaining candidate answers of the retrieval questions according to the similar questions and the question-answer pairs, obtaining a candidate answer set of the retrieval questions, and selecting target answers of the retrieval questions from the candidate answer set. According to the scheme, similar questions similar to the retrieval question can be inquired firstly, answers corresponding to the similar questions can be inquired, and then the most appropriate answer is selected from the answers of the similar questions, so that the scheme can output the answer matched with the retrieval question, and the accuracy and the quality of the answer output by the chat robot system are improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1a is a schematic flow chart of an automatic question answering method according to an embodiment of the present invention;

FIG. 1b is a schematic diagram of a sentence synthesis process provided by an embodiment of the present invention;

FIG. 1c is a diagram of a sentence vector derived from a word vector according to an embodiment of the present invention;

FIG. 1d is a schematic diagram illustrating the computation of sentence similarity based on a convolutional neural network according to an embodiment of the present invention;

FIG. 1e is a diagram illustrating word co-occurrence statistics provided by an embodiment of the present invention;

FIG. 2a is a schematic diagram of a scenario of an automatic question answering system according to an embodiment of the present invention;

FIG. 2b is a schematic flow chart of an automatic question answering method according to an embodiment of the present invention;

FIG. 2c is a diagram of a robotic chat interface provided by embodiments of the present invention;

FIG. 2d is another schematic diagram of a robotic chat interface provided by embodiments of the invention;

FIG. 3a is an architecture diagram of an automated question answering system provided by an embodiment of the present invention;

FIG. 3b is a schematic structural diagram of a sorting system according to an embodiment of the present invention;

FIG. 4a is a schematic structural diagram of an automatic question answering device according to an embodiment of the present invention;

fig. 4b is a schematic structural diagram of a second automatic question answering device according to an embodiment of the present invention;

fig. 4c is a schematic structural diagram of an automatic question answering device according to an embodiment of the present invention;

fig. 4d is a schematic diagram of a fourth structure of the automatic question answering device according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only partial embodiments of of the present invention, rather than all embodiments.

The embodiment of the invention provides automatic question answering methods, devices and storage media, which are respectively explained in detail below.

Examples ,

The embodiment will be described from the perspective of an automatic question answering device, which may be specifically integrated in entities or multiple entities, for example, the automatic question answering device may be integrated in a server or the like.

automatic question-answering method includes forming question-answer pairs based on social data on a social platform, establishing reverse indexes of the question and phrases thereof to obtain search questions, determining similar questions to the search questions according to the question phrases of the search questions and the reverse indexes, obtaining candidate answers of the search questions according to the similar questions and the question-answer pairs to obtain candidate answer sets of the search questions, and selecting target answers of the search questions from the candidate answer sets.

As shown in fig. 1a, the specific process of the automatic question answering method may be as follows:

101. a plurality of question-answer pairs are formed based on social data on the social platform, the question-answer pairs including questions and their corresponding answers.

The social platform is a platform for sharing information such as own news, moods, feelings and the like. Such as instant messaging based social platforms, etc. In addition, the social platform can also comprise a chat robot and a question system; that is, the chat machine and the question system can be a social platform for interacting social information.

Social data on a social platform is social information data interacted by a User on the social platform, and the social data can include UGC (User Generated Content) on the social platform. For example, the social data may include content (e.g., text content, etc.) posted by the user on the social platform, and other comment information or reply content for the content by the user on the social platform.

In this embodiment, the question-answer pair is also referred to as a question-answer pair (QA pair), which refers to pairs of social data of the question , such as text data.

According to the method and the device, massive social data on the social platform can be obtained, and then the social data are disassembled to form question and answer pairs. That is, the step of "forming a plurality of question-and-answer pairs based on social data on a social platform" may include:

acquiring social data on a social platform;

and performing question and answer decomposition on the social data to form a plurality of question and answer pairs.

In practical applications, in order to reduce the processing amount and improve the quality of answers, data filtering (i.e., data cleansing) may be performed on the social data, for example, to filter out private data or sensitive data, and then performing question and answer splitting on the filtered social data to form a plurality of question and answer pairs.

Performing data filtering on the social data to obtain filtered social data;

and performing question and answer disassembling on the filtered social data.

Specifically, the social data may be segmented, and then phrase data are filtered according to a vocabulary filtering principle, where the vocabulary filtering principle may be set according to actual requirements, for example, named entity data such as names of people, places, and names of organizations may be filtered, privacy data such as phone numbers, instant messaging account numbers, financial account numbers (e.g., bank card numbers) may be filtered, and unknown vocabulary data such as dirty words may be filtered.

Performing word segmentation processing on the social data to obtain phrase data corresponding to the social data;

and performing data filtering on the phrase data corresponding to the social data according to a preset vocabulary filtering principle.

The embodiment can filter data in an off-line state; for example, the offline processing system may be used to obtain social data on the social platform, and then filter the social data.

102. And establishing an inverted index of the question and the phrases thereof.

Each entry in this index table includes attribute values and the addresses of records with the attribute values.

The reverse index of the question and its phrases in this embodiment may be: the text to be deduplicated is determined by the phrase within the question. The inverted index may include a plurality of index entries or index pairs, each index entry or index pair including an index key and an index entry corresponding to the index key, where the index key may be a phrase in the question, and the index entry may be a question corresponding to the phrase. Therefore, in this embodiment, the reverse index of the question and the phrase thereof is established, that is, the index pair or the index entry for establishing the corresponding relationship between the representation phrase and the question is established. Specifically, the step of establishing an inverted index of the question and its phrases may include:

performing word segmentation processing on the questions in the question-answer pairs to obtain word groups of the questions;

and establishing an index pair according to the problem and the phrase of the problem, wherein the index pair comprises an index key word and an index item corresponding to the index key word, the index key word is the phrase of the problem, and the index item can be the problem.

For example, the problem Q may be subjected to word segmentation processing to obtain word groups Q1, Q2, … … qn, and then an index pair (Q1, Q), (Q2, Q) … … (qn, Q) may be established to obtain an inverted index of the problem and problem word groups.

After the reverse index of the question and the phrase is established, similar questions similar to the retrieval question can be inquired based on the reverse index.

103. And acquiring a retrieval problem, and determining a similar problem similar to the retrieval problem according to the problem phrase of the retrieval problem and the inverted index.

(1) And acquiring a retrieval problem:

wherein, the retrieval question is a question needing retrieval answers; the retrieval problem may be obtained in various ways, for example, it may be obtained according to a sentence input by a user. Specifically, a sentence input by the user may be taken as a retrieval question.

The sentence input by the user may be pieces or multiple pieces of content input by the user, the content of the sentence may be text content, etc., the sentence may be composed of phrases, and the sentence may be complete words or incomplete words.

Optionally, in order to improve the accuracy and quality of the answer, the embodiment may also synthesize the sentence input by the user by using syntactic analysis, so as to obtain an accurate retrieval question. Specifically, the step "obtaining a retrieval question" may include:

acquiring a sentence currently input by a user and a historical sentence input by the user before;

when the sentence does not contain the subject predicate and the verb object and the historical sentence contains the subject predicate or the verb object, the subject or the object in the historical sentence is taken as the subject of the sentence to synthesize a new sentence;

the new sentence is taken as a retrieval question.

For example, after the user inputs ' apple is good at eating ', then ' apple is good at eating ' at the moment, and ' red is the sentence currently input by the user.

In this embodiment, the sentence input by the user may refer to a sentence input by the user through the robot chat client on the terminal. For example, when a user inputs message content in an input box in the robot chat interface, the terminal sends the message content to the server, and the server can receive the message content at this time.

The embodiment may perform syntactic analysis on a sentence currently input by the user and a historical sentence previously input by the user to obtain syntactic analysis results, and then determine whether the sentence currently input by the user contains a predicate or a verb-object and whether the historical sentence contains the predicate or the verb-object based on the syntactic analysis results.

Syntactic analysis (Parsing) refers to Parsing the grammatical function of words in a sentence, such as Parsing "i am late" to obtain: "I" is the subject, "I" is the predicate, and "late" is the complement.

In this embodiment, when a sentence currently input by the user does not include the subject predicate and the verb object and a historical sentence previously input by the user includes the subject predicate or the verb object, the subject or the object in the historical sentence may be used as the subject of the sentence to synthesize a new sentence.

For example, the user inputs 'apple is eaten' in the th round, inputs 'red' in the second round, and at this time, through syntactic analysis, the sentence input by the user in the second round (i.e. the sentence input by the user currently) does not contain the subject predicate and the animal subject, but the sentence input by the user in the th round (i.e. the historical sentence) contains the subject, so that the subject in the sentence input in the th round can be used as the subject of the current sentence, and the sentence is spliced into a new sentence 'apple red', and then the new sentence is used as a retrieval problem.

Optionally, in practical application, the synthesis method may be unsuccessful, and in order to improve accuracy of the retrieval problem, the method of this embodiment further includes:

when the synthesis of a new sentence fails, extracting corresponding keywords in the sentence according to the word type of the phrase;

replacing the target phrase with the same word type as the keyword in the historical sentence by the keyword to obtain a replaced sentence;

and taking the replaced sentence as a retrieval problem.

The word type of the phrase may be divided based on the part of speech of the phrase, for example, the phrase may be divided into nouns, verbs, adjectives, numerics, quantifiers, pronouns, fictional words, and the like. In addition, the word groups may be divided based on the meaning of the word group, for example, the word groups may be divided into names of people, places, organizations, and so on.

In the embodiment, corresponding keywords can be extracted from the sentences based on the part of speech type priority; for example, the priority of the noun may be set to be higher than the priority of the verb, at this time, a phrase with the word type of the noun in the sentence may be extracted as the keyword, and if the phrase is not extracted, a phrase with the word type of the verb in the sentence may be extracted as the keyword, that is, the keyword is extracted according to the priorities of extracting the noun first and then extracting the verb.

Optionally, when extracting nouns, steps can be further performed to extract keywords from the sentence according to the priority of the entity nouns (such as the name of a person, the name of a place, and the name of a mechanism) from high to low, wherein, when extracting the entity nouns, the keywords can be extracted from the sentence according to the priority of the name of a person, the name of a place, and the name of a mechanism from high to low.

Specifically, a phrase whose word type is a person name in the sentence may be extracted as the keyword, if the keyword is not extracted, the keyword is extracted, the word type is a mechanism name, if the keyword is not extracted, the noun keyword with the highest tfidf value in the sentence is extracted, and if the keyword is not extracted, the keyword is further extracted, the word type is a verb in the sentence.

After the keywords are successfully extracted from the sentences, the target phrases with the same word type as the keywords in the historical sentences can be replaced by the keywords. For example, if the extracted keyword is a person name, place name, or organization name as the keyword, the person name, place name, or organization name in the historical sentence may be replaced with the keyword, if the extracted keyword is a verb, a predicate or verb in the historical sentence may be replaced with the keyword, and the like, and if the extracted keyword is a non-entity noun, a subject or object in the historical sentence may be replaced with the keyword.

For example, the user inputs 'go to Beijing and go bad' in the th round, inputs 'go to Shanghai' in the second round, and at this time, the sentence input in the second round (i.e. the sentence currently input by the user) by the user does not contain the subject and the animal subject through syntactic analysis, but the sentence input in the th round (i.e. the historical sentence) by the user contains the animal subject, so that the sentence spliced and synthesized by adopting the above method for replacing the subject fails, at this time, the entity noun 'Shanghai' can be extracted from the sentence input in the second round as a keyword, then the entity noun 'Beijing' in the sentence input in the th round is replaced by the keyword 'Shanghai', so that a new sentence 'go to Shanghai and go bad' can be synthesized, and finally, the new sentence is used as a retrieval problem.

Optionally, in this embodiment, when the keyword replacement fails to synthesize the sentence, a new sentence may be synthesized based on a statistical synthesis strategy to serve as a retrieval problem. For example, corresponding keywords may be extracted from the historical sentences, and then the keywords may be concatenated into a new sentence. The embodiment can count the idf values of the phrases in the historical sentences, and then select the keywords according to the idf values of the phrases. For example, the word with the highest global statistical idf value in the history sentences is selected as the keyword, and the like.

According to the above description of the manner of acquiring the search question, the present embodiment may define three strategies to synthesize a new sentence as the search question. The three strategies include:

syntactic analysis synthesis strategy 1: when the current sentence input by the user does not contain the subject predicate and the verb object and the historical sentence input by the user before contains the subject predicate or the verb object; the subject or object in the history sentence is replaced with the subject of the currently input sentence.

Syntactic analysis synthesis strategy 2: when the current sentence input by the user does not contain the main predicate and the verb object and the historical sentence input by the user before contains the main predicate or the verb object, extracting corresponding keywords in the sentence according to the word types of the phrases, and replacing target phrases with the same word types as the keywords in the historical sentence by the keywords.

For example, corresponding keywords are extracted from the current sentence according to the priorities of the names of people, place names, organization names, nouns with the highest tfidf value and verbs from top to bottom, and then corresponding phrases in the historical sentences are replaced. Specifically, the method comprises the following steps:

case1, if the name of the person, the place and the organization are extracted as the key words, directly searching the corresponding entities of the same type in the query to replace the entities

case2. if nouns are extracted, then similarity is calculated with candidate subjects or objects, and alternatives are selected

case3. if a verb is extracted, the predicate or verb is replaced.

A statistical synthesis strategy: so as to extract corresponding keywords from the historical sentences, and then, the keywords are spliced into a new sentence. Specifically, the idf values of phrases in historical sentences are counted, and then keywords are selected according to the idf values of the phrases. For example, the word with the highest global statistical idf value in the history sentences is selected as the keyword, and the like.

Referring to fig. 1b, the flow of sentence synthesis of the present embodiment is as follows:

1031. a current sentence currently input by a user and a historical sentence input by the user before are obtained.

The current sentence is the content currently input by the user through the robot chat client on the terminal, and the historical sentence is the content previously input by the user through the robot chat client on the terminal.

1032. And performing syntactic analysis on the current sentence and the historical sentence respectively to obtain syntactic analysis results.

1033. Judging whether the current sentence does not contain the main predicate and the object-moving according to the syntactic analysis result, and judging whether the historical sentence contains the main predicate or the object-moving; if yes, go to step 1034, otherwise, do not need to synthesize, end the synthesis procedure.

1034. Sentence synthesis is performed using a syntactic analysis synthesis strategy 1.

1035. And (4) determining whether the synthesis is successful, if not, executing a step 1036, if so, successfully synthesizing, and ending the synthesis flow.

1036. And sentence synthesis is carried out by adopting a syntactic analysis synthesis strategy 2.

1037. Determining whether the synthesis is successful, if not, executing a step 1038, if so, the synthesis is successful, and ending the synthesis process

1038. And synthesizing sentences by adopting a statistical synthesis strategy.

1039. And determining whether the synthesis is successful, if not, executing a step 1040, if so, successfully synthesizing, and ending the synthesis process.

1040. It was determined that synthesis was not possible.

(2) Determining the similar problems:

wherein, the similar question similar to the retrieval question is a question which is retrieved based on the inverted index and is matched with the question phrase of the retrieval question, such as a question similar to, similar to or the same as the question phrase.

The reverse index of the question and its phrase in this embodiment may include: the index pair includes an index key word and an index item corresponding to the index key word, the index key word is a phrase of the question, and the index item may be the question. At this time, the similar problems are: a problem in an index pair matching a problem phrase of a retrieval problem, specifically, a problem in an index pair matching a keyword with a problem phrase of a retrieval problem. That is, the step of determining similar questions similar to the retrieval question according to the question phrase of the retrieval question and the inverted index may include:

and inquiring the problem matched with the problem word of the retrieval problem in the index pair to obtain a similar problem similar to the retrieval problem.

When there are multiple word groups for searching problem, the problem matched with each problem word can be searched in the index pair, so that groups of similar problems similar to the searching problem can be obtained.

For example, after the search problem a is segmented, phrases a1, a2, … … ai … … an are obtained, then index pairs with keywords matched with a1, keywords matched with a2, and … … keywords matched with an are respectively inquired in the index pairs, and finally the problem in the index pair with keywords matched with ai is taken as a similar problem close to the search problem a, so that series similar problems can be obtained, for example, when different problems are inquired about corresponding to each problem phrase, then similar problems Q1 and Q2 … … Qn can be obtained at this time.

104. And obtaining candidate answers of the retrieval questions according to the similar questions and the question-answer pairs to obtain a candidate answer set of the retrieval questions.

After a plurality of similar questions are obtained, the answer corresponding to each similar question can be inquired in the question-answer pair, so that a plurality of candidate answers can be obtained, and a candidate answer set is formed.

Specifically, a matching question-answer pair in which the question matches the similar question may be determined in the question-answer pair, and then the answer in the matching question-answer pair may be used as a candidate answer for the search question. That is, the answer in the question-answer pair in which the question matches the similar question may be used as the candidate answer for the search question, for example, the answer in the question-answer which is similar to or the same as the similar question may be used as the candidate answer for the search question.

For example, if the question-answer pairs have (Q1, a1), (Q2, a2) … … (Qi, Ai) … … (Qn, An), when there are similar questions Q1, Q2 … … Qi, then the answer corresponding to each similar question may be obtained from the question-answer pairs to obtain candidate answer sets { a1, a2 … … Ai }.

105. And selecting a target answer of the retrieval question from the candidate answer set.

For example, or more answers may be selected from the candidate answer set { a1, a2 … … Ai } as final answers to the search question.

In practical application, the candidate answers can be scored, then the candidate answers are ranked based on scores of the candidate answers to obtain a ranked candidate answer set, and finally, a final answer is selected from the ranked candidate answer set. That is, the step of "selecting the target answer of the search question from the candidate answer set" may include:

scoring the candidate answers in the candidate answer set to obtain scores of the candidate answers;

sorting the candidate answers in the candidate answer set according to the scores of the candidate answers to obtain a sorted candidate answer set;

and selecting a target answer of the retrieval question from the sorted candidate answer set.

The candidate answers may be ranked in various manners, such as ranking in a manner of scores from high to low, and for example, ranking in a manner of scores from low to high.

In this embodiment, the candidate answer arranged at the front or the back may be selected as the target answer of the search question. The concrete mode can be set according to actual requirements.

In order to improve the correlation between the output answer and the retrieval question and improve the quality of the answer, the present embodiment may select the target answer of the retrieval question in the following manner:

(1) based on the similarity between the answers and the questions:

specifically, sentence similarity information between the candidate answers in the candidate answer set and the retrieval question can be obtained; and selecting a target answer of the retrieval question from the candidate answer set according to sentence similarity information between the candidate answers in the candidate answer set and the retrieval question.

The sentence similarity information is used for representing the similarity between two sentences; the embodiment may use a vector space model to calculate the sentence similarity, where the sentence similarity information includes: vector similarity between sentence vectors. The vector similarity between sentence vectors can be measured by using cosine values (i.e. cosine similarity) of included angles between vectors, distances between vectors (such as euclidean distance, manhattan distance, and the like) and the like; that is, the vector similarity between sentence vectors may include: cosine similarity between sentence vectors, distance between sentence vectors, etc.

That is, the step of "obtaining sentence similarity information between the answers in the candidate answer set and the retrieval question" may include:

obtaining answer sentence vectors corresponding to candidate answers in the candidate answer set and question sentence vectors corresponding to retrieval questions;

obtaining the vector similarity between the answer sentence vector and the question sentence vector;

in this case, the step of selecting the target answer of the search question from the candidate answer set according to the sentence similarity information between the candidate answer and the search question in the candidate answer set may include: and selecting a target answer of the retrieval question from the candidate answer set according to the vector similarity between the answer sentence vector and the question sentence vector.

The embodiment can obtain the vector similarity, such as cosine similarity, between the answer sentence vector of each candidate answer in the candidate answer set and the question sentence vector, and then select the final answer to the retrieval question from the candidate answer set based on the vector similarity between the answer sentence vector of each candidate answer and the question sentence vector.

For example, the candidate answers in the candidate answer set may be scored according to the vector similarity between the answer sentence vector corresponding to each candidate answer in the candidate answer set and the question sentence vector, then the candidate answers are ranked based on the scores of the candidate answers in the set to obtain a ranked candidate answer set, and the target answer of the retrieval question is selected from the ranked candidate answer set.

In this embodiment, there may be multiple ways to obtain a sentence vector, and in order to form an accurate sentence vector and accurately calculate the vector similarity, this embodiment may obtain the sentence vector as follows:

(1-1) obtaining a sentence vector based on the word vector:

specifically, the step of "obtaining an answer sentence vector corresponding to a candidate answer in the candidate answer set and retrieving a question sentence vector corresponding to a question" may include:

acquiring word vectors corresponding to answer phrases of candidate answers in the candidate answer set, and acquiring answer sentence vectors corresponding to the candidate answers according to the word vectors corresponding to the answer phrases;

and acquiring a word vector corresponding to a problem phrase of the retrieval problem, and acquiring a problem sentence vector corresponding to the retrieval problem according to the word vector corresponding to the problem phrase.

The word vector may be obtained by data training, for example, the word vector may be trained on a preset number of question-answer pairs (QA-pair) by using a word2vec tool. Specifically, a preset number (e.g., 1 hundred million) of question-answer pairs may be selected as training data, and then word vector training is performed on answer phrases of candidate answers in the candidate answers based on the training data to obtain word vectors corresponding to the answer phrases; and performing word vector training on the problem word group of the retrieval problem based on the training data to obtain a word vector corresponding to the problem word. Wherein the question-answer pairs may be question-answer pairs formed based on social data on the social platform.

In practical application, the dimension of the word quantity can be preset during training of the word vector to form a word vector with corresponding dimension, and further a sentence vector with corresponding dimension can be obtained. That is, word vector training may be performed on answer words based on the training data and the preset vector dimensions, and word vector training may be performed on question phrases based on the training data and the preset vector dimensions.

After the word vectors are obtained, the sentence vectors may be formed by means of vector addition, and preferably, the sentence vectors may be obtained by performing weighted summation on the word vectors. For example, when there are word vectors W1, W2, W3, … … Wn of the same dimension, it is possible to: W1X 1+ W2X 2+ W3X 3+ Wi Xi … … W6X 6 ═ S, where S is the sequence Sentence vector and Xi is the weight value corresponding to the word vector Wi.

For example, referring to fig. 1c, the word vectors include 100-dimensional word vectors W1, W2, W3, W4, W5, W6; then W1-W6 can be weighted and summed to form a 100-dimensional sentence vector. As by the formula: w1 × X1+ W2 × 2+ W3 × 3+ W4 × 4+ W5 × 5+ W6 × 6 ═ S.

(1-2) obtaining sentence vectors based on convolutional neural network

In the authentication method (1-1), the problem of loss of sequence information of words can exist, which causes inaccuracy of sentence vectors, and thus, in order to improve the accuracy of sentence vectors, the embodiment may use a convolutional neural network model to obtain sentence vectors.

expressing the retrieval problem into a corresponding problem matrix, and performing convolution processing on the problem matrix based on a convolution neural network model to obtain a problem sentence vector corresponding to the retrieval problem;

and expressing the candidate answers in the candidate answer set into corresponding answer matrixes, and performing convolution processing on the answer matrixes based on the convolutional neural network model to obtain answer sentence vectors corresponding to the answers.

The manner of obtaining the matrix in this embodiment may be obtained based on the word vector, that is, the step "representing the retrieval problem as a corresponding problem matrix" may include: and acquiring a word vector corresponding to the problem word of the retrieval problem, and then acquiring a problem matrix corresponding to the retrieval problem based on the word vector corresponding to the problem word.

The step of representing the candidate answers in the candidate answer set as the corresponding answer matrix may include: and acquiring a word vector corresponding to an answer word of the candidate answer in the candidate answer set, and then acquiring an answer matrix corresponding to the candidate answer based on the word vector corresponding to the answer word.

For example, knowing the word vector for each word, and the word vector being 100 dimensions, and assuming that there are 50 words in a sentence at the maximum, a sentence matrix of 50 x 100 can be formed or constructed.

The word vector may be obtained by training a word through sample data, for example, by performing vector training on words through a preset number of questions and answers.

Preferably, the embodiment may use a plurality of different convolution kernels to perform convolution operation on the matrix to obtain the corresponding sentence vector. For example, different convolution kernels may be respectively used to perform convolution operations on the matrices to obtain convolution results corresponding to the different convolution kernels, and then corresponding sentence vectors are constructed based on the convolution results corresponding to the different convolution kernels.

That is, the step of performing convolution processing on the problem matrix based on the convolutional neural network model to obtain the problem sentence vector corresponding to the retrieval problem may include:

performing convolution operation on the problem matrixes by adopting a plurality of different convolution kernels respectively to obtain convolution results corresponding to the different convolution kernels;

and constructing a problem sentence vector corresponding to the retrieval problem according to convolution results corresponding to different convolution kernels.

The step of performing convolution processing on the answer matrix based on the convolutional neural network model to obtain an answer sentence vector corresponding to the candidate answer may include:

performing convolution operation on the answer matrix by adopting a plurality of different convolution kernels respectively to obtain convolution results corresponding to the different convolution kernels;

and constructing an answer sentence vector corresponding to the candidate answer according to the convolution results corresponding to different convolution kernels.

In this embodiment, after obtaining convolution results corresponding to different convolution kernels, pooling may be performed on the convolution result corresponding to each convolution kernel to obtain a feature value corresponding to each convolution kernel, and then, a corresponding sentence vector may be constructed according to the feature value corresponding to each convolution kernel. For example, after convolution results corresponding to different convolution kernels are obtained, pooling processing may be performed on the convolution result corresponding to each convolution kernel to obtain a feature value corresponding to each convolution kernel, and then a question sentence vector corresponding to the retrieval question or an answer sentence vector corresponding to the candidate answer is constructed based on the feature value corresponding to each convolution kernel.

For example, sentences form a matrix S of 50 × 100, and convolution kernels are 1 × 100, 2 × 100, 3 × 100, and 5 × 100 weight matrices, each of which is 500, the matrix of the sentence may be subjected to feature extraction and calculation, and vectorized representations (vectors of length 2000) of the sentence are finally generated through convolution, nonlinear transformation, and pooling operations.

Assuming that weight matrices M with convolution kernels of 100 × 3 are taken as an example, and sentences form a matrix S of 50 × 100, the convolution process is to slide the convolution window M of 100 × 3 on the matrix S of 100 × 50, and pass through (50-3+1) ═ 48 times, and at each steps of the sliding, it is necessary to calculate the convolution of the weight matrix M and the partial matrix covered by the convolution window in the sentence matrix (the dark rectangle in fig. 1 d), 48 results can be generated in the whole moving process (for the convolution kernels of 3, the number of the convolution kernels of other sizes is different, 50 results of 1 size, 49 results of 2 size, and 46 results of 5 size), the maximum pooling is performed for the results, and finally, 2000 kernels are used, so that 2000-dimensional abstract sentence representations are generated, that is sentence sub-vectors.

After the answer matrix vector and the question matrix vector are obtained, the method of the embodiment can obtain the vector similarity between the two sentence vectors, such as cosine similarity. Herein, the cosine similarity is also called as cosine similarity. And evaluating the similarity of the two vectors by calculating the cosine value of the included angle of the two vectors. The smaller the angle, the closer the cosine value is to 1, and the more identical their directions are, the more similar.

For example, assume that vector a is (a1, a 2.., An), and B is (B1, B2.., Bn); the cosine similarity between vectors a and B can be calculated by the following formula:

for example, cosine similarity between an answer sentence vector corresponding to each candidate answer in the candidate answer set and a question sentence vector corresponding to the retrieval question may be calculated, then, the candidate answers are scored based on the cosine similarity to obtain scores of each candidate answer (for example, the score is higher when the cosine value is larger), the candidate answers are ranked according to the scores of each candidate answer (for example, ranking is performed according to the order of scores from high to low), and a target answer of the retrieval question is selected from the ranked set.

(2) Based on word co-occurrence statistics:

the embodiment can count the number of times that the words in the answer and the words in the question co-occur pairwise and the number of times that each word in the question occurs, and then select the answer of the retrieval question based on the statistical result. That is, the step of "selecting the target answer of the search question from the candidate answer set" may include:

acquiring the number of times of occurrence of the question of a question phrase in the retrieval question and the number of times of co-occurrence of the question answers of answer words in the candidate answers;

and selecting a target answer of the retrieval question from the candidate answer set according to the number of times of occurrence of the question phrase in the retrieval question and the number of times of co-occurrence of the question answers of the answer words in the candidate answers.

Wherein the number of occurrences of the problem is: the number of times the question phrase appears in the question-answer pair, i.e., the number of question-answer pairs for which the question contains the question phrase.

The co-occurrence times of the answers to the questions are as follows: the number of times that the answer phrase in the answer and the question phrase in the retrieval question co-occur in pairs of questions; i.e. the number of question-answer pairs for which the question contains a question phrase and the answer contains a question word.

For example, the phrases after the word segmentation of the question Q are Q1 and Q2 … … qi … … qn. When there are 800 question pairs containing q1, then the number of occurrences of q1 question is 800, when there are 789 question pairs containing or appearing q2, then the number of occurrences of q2 question is 789, when there are m question pairs containing or appearing qi, then the number of occurrences of qi question is m, and so on, the number of occurrences of question for each question word in the question can be obtained.

For another example, the phrases after the word segmentation of the question Q are Q1 and Q2 … … qi … … qn; the candidate answer A is segmented to obtain a1 and a2 … … ai … … aj. When there are k question-answer pairs where the question contains or presents qi and the answer contains or presents qi, then the co-occurrence number of ai and qi is k.

Referring to fig. 1e, the number of occurrences of words in the search question Q, the candidate answer a1 and the candidate answer a2, such as the number of occurrences of two words in the candidate answer case and the search question, and the number of occurrences of each word in the search question, may be counted. It can be seen from the table shown in fig. 1e that when "eat" occurs in the question, the "eat" occurs in the good answer and the "electricity" occurs in the bad answer, and the number of the co-occurrences of the four words in "eat" and the question exceeds the number of the co-occurrences of the four words in "electricity" and the question through statistics.

In obtaining the number of times of occurrence of the question in the question phrase and the number of times of co-occurrence of the question answers of the candidate answer words in the answers, the embodiment may distinguish whether the candidate answers are good or bad based on the number of times of occurrence of the question in the question phrase and the number of times of co-occurrence of the question answers of the candidate answer words in the answers, so as to select the best answer for the retrieval question.

For example, the candidate answers in the candidate answer set may be scored according to the number of times of occurrence of the question in the question phrase in the retrieval question and the number of times of co-occurrence of the question answers of the candidate answer words in the answers, then the candidate answers in the candidate answer set may be ranked based on the scores of the candidate answers, and finally, the answer to the retrieval question may be selected from the ranked candidate answer set.

Preferably, in this embodiment, after counting the number of times of occurrence of the word, a number ratio between the number of times of co-occurrence of the answers to the questions and the number of times of occurrence of the questions may be calculated, then, a probability that each candidate answer is used as the target answer is obtained based on the number ratio, and finally, the target answer is selected based on the probability. That is, the step of selecting the target answer of the search question from the candidate answer set according to the number of occurrences of the question in the question phrase in the search question and the number of co-occurrences of the question answers in the answer words in the candidate answer may include:

acquiring the ratio of the co-occurrence times of the question answers of the answer words in the candidate answers to the occurrence times of the questions of the question words in the retrieval questions;

obtaining target answer probability corresponding to the candidate answer according to the time proportion, wherein the target answer probability is the probability that the answer is used as the target answer of the retrieval question;

and selecting the target answer of the retrieval question from the candidate answer set according to the target answer probability corresponding to the answer.

For example, the phrase after the question Q is divided is Q, Q qn, and the phrase a, a aiaj is obtained after each candidate answer a is divided, then the number of times that qi appears in the question-answer pair, that is, the number of times that qi appears in the question-answer pair, and the number of times that answer to the question of (qi, Ai) co-occur (qi, Ai) are counted, and the number of times that qi and Ai appear in the question-answer pair simultaneously is counted, so that the number ratio of times that answer (qi, Ai)/Count (qi) is obtained, and the target answer probability Score (Q, Ai) of Ai is 1/t + Count (Q, a)/1/t + 1/Q Count (Q, a) + +1/t + 1+ Q +1, Q +1, a sum, a number of the total number of the total (Q +1, a, Ai +1, a, Ai + Q +1, a, Ai + Q.

Referring to FIG. 1e, Score (Q, A1), Score (Q, A2) may be calculated. The specific process is as follows:

score (Q, a1) ═ 1/8 × Count (you, eat)/Count (you) +1/8 × Count (eat )/Count (eat) +1/8 × Count (eat )/Count () +1/8 × Count (do, eat)/Count (do) +1/8 × Count (you, do not)/Count (you) +1/8 × Count (eat, do not)/Count (eat) +1/8 × Count (do, do)/Count (do) +1/8 × Count (do, do not)/Count (do) = 0.055

Score (Q, a2) ═ 1/8 × Count (you, electricity)/Count (you) +1/8 × Count (eating, electricity)/Count (eating) +1/8 × Count (provided, electricity)/Count (provided) +1/8 × Count (do, electricity)/Count (do) +1/8 × Count (you, not)/Count (you) +1/8 × Count (eaten, not)/Count (eating) +1/8 × Count (provided, not)/Count (provided) +1/8 × Count (do, not)/Count (provided) (0.017).

From the above, Score (Q, a1) > Score (Q, a2), so a1 is more suitable than a2 for Q.

After obtaining the target answer probability Score of each candidate answer a, scoring each candidate answer in the candidate answer set based on the target answer probability Score of each candidate answer a (if Score is larger, Score is higher), then sorting the candidate answers, and selecting the target answer from the sorted set, for example, selecting the candidate answer ranked the top as the target answer.

As can be seen from the above, the embodiment of the present invention employs a plurality of question-answer pairs formed based on social data on a social platform, where the question-answer pairs include questions and answers corresponding to the questions, then establishes an inverted index of the questions and phrases thereof, obtains a search question, determines similar questions to the search question according to the question phrases of the search question and the inverted index, obtains candidate answers to the search question according to the similar questions and the question-answer pairs, obtains a candidate answer set of the search question, and selects a target answer to the search question from the candidate answer set. According to the scheme, similar questions similar to the retrieval question can be inquired firstly, answers corresponding to the similar questions can be inquired, and then the most appropriate answer is selected from the answers of the similar questions, so that the scheme can output the answer matched with the retrieval question, and the accuracy and the quality of the answer output by the chat robot system are improved.

Example II,

The method described in example is further detailed below by way of example .

The embodiment of the invention introduces the automatic question-answering method provided by the invention by taking the example that the automatic question-answering device is integrated in a server.

Referring to fig. 2a, an embodiment of the present invention provides automatic question-answering systems, which include a server and a terminal, wherein the server is integrated with an automatic question-answering device, and the server and the terminal are connected through a network.

As shown in fig. 2b, specific processes of automatic question answering methods can be as follows:

201. the server acquires social data on the social platform and performs data filtering on the social data.

Social data on a social platform is social information data interacted by a User on the social platform, and the social data can include UGC (User Generated Content) on the social platform.

The server in this embodiment may filter phrase data according to a vocabulary filtering principle, where the vocabulary filtering principle may be set according to actual requirements, for example, named entity data such as names of people, place names, and organization names may be filtered, privacy data such as phone numbers, instant messaging account numbers, financial account numbers (e.g., bank card numbers) may be filtered, and illegitimate vocabulary data such as dirty words may be filtered.

202. And the server forms a plurality of question-answer pairs according to the filtered social data and stores the question-answer pairs.

Here, the question-answer pair is also called a question-answer pair (QA pair), which refers to pairs of social data such as text data of the question .

203. The server divides the words of the question in the question-answer pair and establishes the reverse index of the question and the word group.

The inverted index may include a plurality of index entries or index pairs, each index entry or index pair including an index key and an index entry corresponding to the index key, where the index key may be a phrase in the question, and the index entry may be a question corresponding to the phrase.

For example, the server may perform word segmentation processing on the question in the question-answer pair to obtain a word group of the question; and establishing an index pair according to the problem and the phrase of the problem, wherein the index pair comprises an index key word and an index item corresponding to the index key word, the index key word is the phrase of the problem, and the index item can be the problem.

204. And the terminal sends the message content currently input by the user to the server.

For example, referring to fig. 2c, after the user opens the robot chat application, the user may enter a conversation interface with the robot by clicking a pet on the application interface, the user may input message content in an input box of the conversation interface, and after the user inputs the content and clicks a send button, the terminal may send the message content currently input by the user to the server, so that the server returns a feedback message content of the content, that is, an answer. For example, the user inputs "do you have a meal" in the dialog input box, and the terminal transmits the content to the server.

205. And the server acquires a retrieval problem according to the current input message content of the user.

For example, the service may use sentences in the message content as retrieval questions. For example, "do you have a meal" can be directly used as the search question Q.

For another example, in order to improve the accuracy and quality of the answer, the embodiment may perform syntactic analysis on the sentence in the message content, so as to obtain an accurate retrieval problem. Specifically, the server may obtain a current sentence in the current input content of the user and a historical sentence of the previous input content of the user, and then perform syntactic analysis on the current sentence and the historical sentence respectively to determine whether the current input sentence of the user contains a subject predicate or a verb object and whether the historical sentence contains the subject predicate or the animal object.

If the current sentence does not contain the subject predicate and the verb object, and the historical sentence contains the subject predicate or the verb object, the above-described synthesis strategy can be adopted to synthesize a new sentence as a search problem, and specifically, the synthesis process in embodiment can be referred to for sentence synthesis, which is not described herein again.

206. The server finds the question matching the question phrase of the retrieval question based on the inverted index, and gets groups of similar questions similar to the retrieval question.

The reverse index of the problem and the phrase thereof can comprise an index pair, wherein the index pair comprises an index keyword and an index item corresponding to the index keyword, the index keyword is the phrase of the problem, and the index item can be the problem.

For example, the question Q may be subjected to word segmentation processing to obtain phrases Q1, Q2, and Q3, and the inverted index includes: index pairs (Q1, Q1), (Q2, Q2), (Q3, Q3) … …. Then the edge can determine from the inverted index the close questions of question Q as Q1, Q2, and Q3.

207. And the server inquires answers matched with the similar questions from the question-answer pairs, and takes the matched answers as candidate answers of the retrieval questions to obtain a candidate answer set of the retrieval questions.

Specifically, the server may determine, among the question-answer pairs, matching question-answer pairs in which the questions match the similar questions, and then take the answers in the matching question-answer pairs as candidate answers to the search questions.

For example, answers to close questions Q1, Q2, and Q3 are determined from the question-answer pairs, assuming that the question-answer pairs include (Q1, a1), (Q2, a2), (Q3, A3) … …. Then, at this time, the candidate answer set { A1, A2, A3} of the search question Q is obtained.

208. And the server scores the candidate answers in the candidate answer set to obtain the score of each candidate answer.

For example, the server may score answers in the candidate answer set { a1, a2, A3} respectively, taking 100 as an example, assuming that a1, a2, A3 are: 80 minutes, 76 minutes and 79 minutes.

The scoring manner for the candidate answer may be various, for example, including:

(1) based on the similarity between the answers and the questions:

specifically, the server may obtain sentence similarity information between the candidate answers in the candidate answer set and the retrieval question; and scoring the candidate answers according to sentence similarity information between the candidate answers in the candidate answer set and the retrieval question.

For example, the server may obtain an answer sentence vector corresponding to the candidate answer in the candidate answer set and a question sentence vector corresponding to the retrieval question, obtain a vector similarity between the answer sentence vector and the question sentence vector, and then score the candidate answers based on the vector similarity between the answer sentence vector and the question sentence vector. In practical application, the vector similarity between sentences can be calculated by Word2vec algorithm.

The specific scoring rule may be set according to actual requirements, for example, when the vector similarity is cosine similarity, the larger the cosine value between the vector of the answer and the vector of the question is, the higher the score is given to the answer.

The sentence vector obtaining method may include: obtaining based on word vectors, for example, obtaining word vectors corresponding to phrases of a sentence, and obtaining answer sentence vectors corresponding to the sentence according to the word vectors corresponding to the phrases

In addition, to improve the accuracy of the sentence vector, the sentence vector obtaining method may further include: obtaining based on a convolutional neural network, for example, representing a retrieval problem as a corresponding problem matrix, and performing convolutional processing on a sentence matrix based on a convolutional neural network model and a plurality of different convolutional kernels to obtain a sentence vector corresponding to a sentence.

Specifically, the acquisition of word vectors, the acquisition of sentence vectors, and the acquisition of vector similarity may refer to the detailed description in embodiment .

(2) Based on word co-occurrence statistics:

for example, the number of times that a word in the answer and a word in the question co-occur two by two, and the number of times that each word in the question occurs may be counted, and then the candidate answers may be scored based on the statistical result.

Specifically, the number of times of occurrence of a question phrase in the search question and the number of times of co-occurrence of question answers of answer words in the candidate answers are obtained, and the candidate answers are scored according to the number of times of occurrence of the question phrase in the search question and the number of times of co-occurrence of question answers of the answer words in the candidate answers.

Preferably, after counting the number of times of occurrence of the word, the server in this embodiment may further calculate a number ratio between the number of times of co-occurrence of the answers to the question and the number of times of occurrence of the question, then obtain a probability that each candidate answer is used as the target answer based on the number ratio, and finally score the candidate answers based on the probabilities.

In practical application, when the probability of the target answer of the candidate answer is higher, the score of the candidate answer is higher.

Specifically, the acquisition of the number of occurrences, the acquisition of the number ratio, and the acquisition of the probability of the target answer may refer to the detailed description of example .

209. And the server ranks the candidate answers in the candidate answer set according to the scores of the candidate answers.

For example, the server may rank the candidate answers in order of scores from high to low. The ranked candidate answer set is { A1, A3, A2} as a function of the scores of A1, A2, A3.

210. And the server selects a target answer of the retrieval question from the sorted candidate answer set and sends the answer to the terminal.

At , in some other embodiments, the last ranked candidate answer may also be selected as the most recent answer to the search question, such as when ranked in order of score from low to high.

For example, the server may select a1 as the best answer, which is the target answer of the search question Q, from the ranked candidate answer set { a1, A3, a2}, and then send the a1 to the terminal. The terminal may be displayed in the robotic chat interface upon receiving a1, thus enabling robotic chat.

For example, referring to fig. 2d, when the message content input by the user in the input box of the robot chat interface is "you have eaten", the terminal sends the message content to the server, and then the server takes the message content as a search question Q, and then, by querying similar questions and answers of the search question Q to obtain a candidate answer list { a1 ═ no-eat, "a 2 ═ no-eat," A3 ═ no-eat fruit "}, the server may rank the candidate answers in the candidate answer list for each candidate answer to obtain a ranked candidate answer list { a1 ═ no," A3 ═ no-eat, "a 2 ═ no-eat," the server may select the top candidate a1 from the candidate answer list as the best answer of the search question Q, the server sends the candidate a1 to the terminal, referring to fig. 2d, the terminal displays a1 as "no eat" on the robot chat interface.

As can be seen from the above, the embodiment of the present invention may form a plurality of question-answer pairs based on social data on a social platform, where the question-answer pairs include questions and answers corresponding to the questions, then establish a reverse index of the questions and phrases thereof, obtain a search question, determine similar questions similar to the search question according to the question phrases of the search question and the reverse index, obtain candidate answers to the search question according to the similar questions and the question-answer pairs, obtain a candidate answer set of the search question, then score candidate answers in the candidate answer set based on a sentence vector similarity algorithm, a similarity algorithm of a convolutional neural network, or a word co-occurrence statistic, rank the candidate answers based on scores of the candidate answers in the set, and finally select a target answer of the search question from the ranked candidate answer set. The scheme can output answers matched with the retrieval questions, and accuracy and quality of the answers output by the chat robot system are improved.

Example III,

The method according to examples and two is illustrated in detail below by way of example .

The embodiment of the invention provides automatic question-answering systems, and referring to fig. 3a, the automatic question-answering system can comprise an online retrieval system, an offline processing system and a sequencing system.

The automatic question answering system realizes the chat process as follows:

for example, the offline processing system may filter out named entities such as names of people, places, organization names, etc., filter out privacy data such as phone numbers, instant messaging signals, bank card numbers, etc., and filter out illiterate words such as dirty words, and the QA pair formation and data cleaning may refer to the detailed description in embodiment .

The online retrieval system carries out indexing construction on the input mass QA pairs, constructs an inverted index, is convenient to carry out retrieval operation, and can find the problem which is most similar to the problem Q input by the user through retrieval, concretely, the problem in the QA pair is participled, and the inverted index between the problem and the word thereof is established, wherein the inverted index comprises the phrase with the keyword as the problem and the index object as the problem, concretely, the establishment of the inverted index and the query of the similar problem can refer to the detailed description in the embodiment .

The online retrieval system receives message content input by a user and sent by a terminal, and then acquires a retrieval problem Q input by the user based on the message content. And inquiring answers of the questions most similar to the question Q input by the user in the massive QA pairs to serve as candidate answers of the search question Q, and obtaining a candidate answer list. The online retrieval system outputs the candidate answer list and the retrieval question Q input by the user to the sorting system.

A sequencing system, comprising: the system comprises a scoring module and a sorting module, wherein the scoring module is used for scoring the candidate answers in the candidate answer list. Referring to fig. 3b, the scoring module may score the candidate answers based on the following three algorithms:

(1) and words are used for statistics:

and counting the number of times that the words in the answer and the words in the question co-occur pairwise and the number of times that each word in the question appears, and then selecting the answer of the retrieval question based on the counting result. Specifically, the number of times of occurrence of a question in a question phrase in the search question and the number of times of co-occurrence of question answers of answer words in the candidate answers may be obtained, a number ratio between the number of times of co-occurrence of the question answers and the number of times of occurrence of the question answers is calculated, then, the probability that each candidate answer is used as the target answer is obtained based on the number ratio, and finally, the candidate answers are scored based on the probabilities.

(2) And calculating the similarity of the Word2vec vector:

training word vectors of answers and phrases in the questions on a preset number of question-answer pairs (QA-pair) by adopting a word2vec tool, and acquiring answer sentence vectors corresponding to the candidate answers according to the word vectors corresponding to the answer phrases; obtaining a problem sentence vector corresponding to the retrieval problem according to the word vector corresponding to the problem word group; and calculating cosine similarity between the answer sentence vector and the question sentence vector, and scoring the candidate answers based on the cosine similarity between the answer sentence vector and the question sentence vector corresponding to the candidate answers.

For example, after the word vector is obtained, the sentence vector may be formed by vector addition, and preferably, the word vector may be subjected to weighted summation to obtain the sentence vector. For example, when there are word vectors W1, W2, W3, … … Wn of the same dimension, it is possible to: W1X 1+ W2X 2+ W3X 3+ Wi Xi … … W6X 6 ═ S, where S is the sequence Sentence vector and Xi is the weight value corresponding to the word vector Wi.

(3) And calculating the similarity based on the convolutional neural network:

in view of the problem of loss of word2vec weighted sentence vectors to the order information of words, the present embodiment may employ a convolutional neural network model to capture the order of words by setting convolutional kernels of different sizes. Specifically, the sentence is expressed into a corresponding matrix, and the matrix is convoluted based on a convolutional neural network model and a plurality of different convolution kernels to obtain a sentence vector corresponding to the sentence. For example, the answer matrix and the question matrix are respectively convolved based on a convolutional neural network model and a plurality of different convolution checks to obtain an answer sentence vector corresponding to the candidate answer and a question sentence vector corresponding to the retrieval question. After the sentence vector is obtained, the vector similarity between the answer sentence vector and the question sentence vector may be calculated, and then the candidate answers are scored based on the vector similarity between the answer sentence vector and the question sentence vector.

Referring to fig. 3b, after scoring each candidate answer, the ranking module may rank the answers in the candidate answer list based on the score of each candidate answer, and output the ranked candidate answer list.

After outputting the ranked candidate answer list, the auto-quiz system may select a candidate answer at a corresponding position from the candidate answer list as a final answer to the search question Q. For example, the first candidate answer in the list is selected as the final answer of the search question Q.

From the above, the automatic question-answering system provided by the embodiment of the present invention can form a plurality of question-answer pairs based on social data on the social platform, the question-answer pairs comprise questions and corresponding answers, then reverse indexes of the questions and phrases are established to obtain retrieval questions, and determines similar problems similar to the retrieval problem according to the problem phrase of the retrieval problem and the inverted index, obtaining candidate answers to the search questions according to the similar questions and the question-answer pairs to obtain a candidate answer set of the search questions, then, based on a sentence vector similarity algorithm, a similarity algorithm of a convolutional neural network or word co-occurrence statistics, scoring is carried out on the candidate answers in the candidate answer set, the candidate answers are ranked based on scores of the candidate answers in the set, and finally, a target answer of the retrieval question is selected from the ranked candidate answer set. The scheme can output answers matched with the retrieval questions, and accuracy and quality of the answers output by the chat robot system are improved.

Example four,

In order to better implement the above method, an embodiment of the present invention further provides an automatic question answering device, as shown in fig. 4a, which may include: the question-answer pair forming unit 401, the index establishing unit 402, the question obtaining unit 403, the candidate answer obtaining unit 404 and the answer selecting unit 405 are as follows:

(1) question-answer pair forming unit 401;

the question-answer pair forming unit 401 is configured to form a plurality of question-answer pairs based on social data on the social platform, where the question-answer pairs include questions and answers corresponding to the questions.

The social platform is a platform for providing users with information of sharing own news, moods, feelings and the like. Such as instant messaging based social platforms, etc.

In this embodiment, a question-answer pair is also called a question-answer pair (QA pair), which refers to pairs of pairs of social data such as text data of a question .

In order to reduce the processing amount and improve the quality of answers, the question-answer pair forming unit 401 may further perform data filtering on the social data to obtain filtered social data, and perform question-answer splitting on the filtered social data to form a plurality of question-answer pairs.

For example, the question-answer pair forming unit 401 may perform word segmentation on the social data, and then filter word group data according to a word filtering principle, for example, name entity data such as a name of a person, a place name, and an organization name may be filtered, privacy data such as a phone number, an instant messaging account number, and a financial account number (e.g., a bank card number) may be filtered, and non-civilized word data such as a dirty word may be filtered.

(2) An index creation unit 402;

an index creating unit 402, configured to create an inverted index of the question and the phrase thereof.

The reverse index of the question and its phrases in this embodiment may be: the text to be deduplicated is determined by the phrase within the question. The inverted index may include a plurality of index entries or index pairs, each index entry or index pair including an index key and an index entry corresponding to the index key, where the index key may be a phrase in the question, and the index entry may be a question corresponding to the phrase. Therefore, in this embodiment, the reverse index of the question and the phrase thereof is established, that is, the index pair or the index entry for establishing the corresponding relationship between the representation phrase and the question is established.

That is, the index creating unit 402 may be specifically configured to perform word segmentation on the question in the question-answer pair to obtain a word group of the question, and create an index pair according to the question and the word group of the question, where the index pair includes an index key and an index item corresponding to the index key, where the index key is the word group of the question, and the index item may be the question.

(3) A question acquisition unit 403;

the question acquiring unit 403 is configured to acquire a search question and determine a similar question similar to the search question according to the question phrase of the search question and the inverted index.

Referring to fig. 4b, the problem acquisition unit 403 may include: a retrieval problem acquisition subunit 4031 and a similar problem acquisition subunit 4032;

a retrieval question acquisition subunit 4031 configured to:

taking the new sentence as a retrieval problem;

a similar problem obtaining subunit 4032, configured to determine a similar problem similar to the search problem according to the problem phrase of the search problem and the inverted index.

The history sentence may be a sentence previously input by the user, for example, a sentence which is input by the user times in the history sentence.

The retrieval problem obtaining subunit 4031 may further be configured to:

when the synthesis of a new sentence fails, extracting corresponding keywords from the sentence according to the word types of the word groups in the sentence, replacing the target word groups with the same word types as the keywords in the historical sentence with the keywords to obtain a replaced sentence, and taking the replaced sentence as a retrieval problem.

For another example, the retrieval problem obtaining sub-unit 4031 may be further configured to synthesize a new sentence as a retrieval problem based on a statistical synthesis strategy when the keyword replacement synthesis sentence fails. For example, corresponding keywords may be extracted from the historical sentences, and then the keywords may be concatenated into a new sentence.

(4) A candidate answer obtaining unit 404;

the candidate answer obtaining unit 404 is configured to obtain candidate answers to the search question according to the similar question and the question-answer pair, so as to obtain a candidate answer set of the search question.

Specifically, the candidate answer obtaining unit 404 may determine a matching question-answer pair in which the question matches the similar question among the question-answer pairs, and then take the answer in the matching question-answer pair as the candidate answer for the search question. That is, the answers in question-answer pairs in which questions match similar questions may be used as candidate answers to the search questions.

(5) An answer selecting unit 405;

an answer selecting unit 405 is configured to select a target answer of the search question from the candidate answer set.

For example, the answer selecting unit 405 may be configured to score the candidate answers in the candidate answer set to obtain scores of the candidate answers, sort the candidate answers in the candidate answer set according to the scores of the candidate answers to obtain a sorted candidate answer set, and select the target answer of the retrieval question from the sorted candidate answer set.

In order to improve the correlation between the output answer and the search question and improve the quality of the answer, the present embodiment may select the target answer of the search question in the following manner.

For example, referring to fig. 4c, the answer selecting unit 405 may include:

a similarity obtaining subunit 4051, configured to obtain sentence similarity information between the candidate answer in the candidate answer set and the retrieval question;

the answer selecting subunit 4052 is configured to select a target answer to the search question from the candidate answer set according to the sentence similarity information between the candidate answer and the search question in the candidate answer set.

Specifically, the similarity obtaining subunit 4051 is configured to: obtaining answer sentence vectors corresponding to candidate answers in the candidate answer set and question sentence vectors corresponding to retrieval questions; obtaining the vector similarity between the answer sentence vector and the question sentence vector;

the answer selecting subunit 4052 is configured to select a target answer to the search question from the candidate answer set according to the vector similarity between the answer sentence vector and the question sentence vector.

The sentence vector obtaining method includes multiple manners, for example, the similarity obtaining subunit 4051 may be configured to obtain a word vector corresponding to an answer phrase of the candidate answer in the candidate answer set, and obtain an answer sentence vector corresponding to the candidate answer according to the word vector corresponding to the answer phrase; and acquiring a word vector corresponding to a problem phrase of the retrieval problem, and acquiring a problem sentence vector corresponding to the retrieval problem according to the word vector corresponding to the problem phrase.

For another example, the similarity obtaining subunit 4051 may be configured to:

and representing the candidate answers in the candidate answer set into corresponding answer matrixes, and carrying out convolution processing on the answer matrixes based on the convolutional neural network model to obtain answer sentence vectors corresponding to the candidate answers.

In order to improve the accuracy of the sentence vector, the embodiment may further use a plurality of different convolution kernels to perform convolution operation on the matrix to obtain a corresponding sentence vector. For example:

the similarity obtaining subunit 4051 may be configured to:

performing convolution operation on the problem matrixes by adopting a plurality of different convolution kernels respectively to obtain convolution results corresponding to the different convolution kernels, and constructing problem sentence vectors corresponding to the retrieval problems according to the convolution results corresponding to the different convolution kernels;

Referring to fig. 4d, the answer selecting unit 405 in this embodiment may include:

a frequency obtaining subunit 4053, configured to obtain the frequency of occurrence of a question in a question phrase in the search question and the frequency of co-occurrence of question answers in answer words in the candidate answers; the number of times of occurrence of the question is the number of times of occurrence of the question phrases in the question-answer pair, and the number of times of co-occurrence of the question answers is the number of times of co-occurrence of two-by-two of the answer phrases in the candidate answers and the question phrases in the retrieval question in the question pair;

the answer selecting sub-unit 4054 is configured to select a target answer of the search question from the candidate answer set according to the number of occurrences of the question in the question phrase in the search question and the number of co-occurrences of the question answers of the candidate answer words in the answer.

The answer selecting sub-unit 4054 may be configured to:

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, and is not described herein again.

The automatic question answering device may be implemented by or more entities, for example, integrated in a server or the like, or for example, the automatic question answering device may be implemented by an offline processing server, an online processing server, a ranking server, or the like.

As can be seen from the above, the automatic question-answering device in the embodiment of the present invention forms a plurality of question-answer pairs based on social data on a social platform through the question-answer pair forming unit 401, where the question-answer pairs include questions and answers corresponding to the questions, then the index establishing unit 402 establishes an inverted index of the questions and phrases thereof, the question obtaining unit 403 obtains the search questions, and determines similar questions similar to the search questions according to the question phrases of the search questions and the inverted index, the candidate answer obtaining unit 404 obtains candidate answers to the search questions according to the similar questions and the question-answer pairs to obtain candidate answer sets of the search questions, and the selecting unit 405 selects target answers to the search questions from the candidate answer sets. According to the scheme, similar questions similar to the retrieval question can be inquired firstly, answers corresponding to the similar questions can be inquired, and then the most appropriate answer is selected from the answers of the similar questions, so that the scheme can output the answer matched with the retrieval question, and the accuracy and the quality of the answer output by the chat robot system are improved.

It will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program stored in computer readable storage medium, which may include Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, etc.

The automatic question answering methods and devices provided by the embodiments of the present invention are described in detail above, the principles and embodiments of the present invention are explained herein by applying specific examples, the description of the above embodiments is only used to help understanding the method and the core idea of the present invention, meanwhile, for those skilled in the art, according to the idea of the present invention, there are changes in the specific embodiments and application scope, and in conclusion, the content of the present description should not be understood as a limitation to the present invention.

Claims

1, automatic question answering method, characterized by comprising:

establishing an inverted index of the question and the phrase thereof;

acquiring the number of times of occurrence of the question of a question phrase in the retrieval question and the number of times of co-occurrence of the question answers of answer words in the candidate answers; the number of times of occurrence of the question is the number of times of occurrence of the question phrases in the question-answer pair, and the number of times of co-occurrence of the question answers is the number of times of co-occurrence of two-by-two answer phrases in the candidate answers and the question phrases in the retrieval questions in the question pair;

and selecting a target answer of the retrieval question from the candidate answer set according to the number of times of occurrence of the question phrase in the retrieval question and the number of times of co-occurrence of the question answers of the candidate answer words in the answers.

2. The automatic question-answering method according to claim 1, wherein selecting a target answer to the search question from the candidate answer set comprises:

obtaining sentence similarity information between the candidate answers and the retrieval questions in the candidate answer set;

and selecting a target answer of the retrieval question from the candidate answer set according to sentence similarity information between the candidate answer and the retrieval question in the candidate answer set.

3. The automatic question-answering method according to claim 2, wherein obtaining sentence similarity information between answers in the candidate answer set and the search question comprises:

acquiring the vector similarity between the answer sentence vector and the question sentence vector;

the selecting a target answer of the retrieval question from the candidate answer set according to sentence similarity information between the candidate answer and the retrieval question in the candidate answer set comprises: and selecting the target answer of the retrieval question from the candidate answer set according to the vector similarity between the answer sentence vector and the question sentence vector.

4. The method of claim 3, wherein obtaining an answer sentence vector corresponding to the candidate answer in the candidate answer set and a question sentence vector corresponding to the search question comprises:

acquiring word vectors corresponding to answer phrases of candidate answers in a candidate answer set, and acquiring answer sentence vectors corresponding to the candidate answers according to the word vectors corresponding to the answer phrases;

5. The method of claim 3, wherein obtaining an answer sentence vector corresponding to the candidate answer in the candidate answer set and a question sentence vector corresponding to the search question comprises:

6. The automatic question-answering method according to claim 5, wherein the convolution processing is performed on the question matrix based on the convolutional neural network model to obtain a question sentence vector corresponding to the retrieval question, and the method comprises the following steps:

constructing a problem sentence vector corresponding to the retrieval problem according to convolution results corresponding to different convolution kernels;

performing convolution processing on an answer matrix based on the convolutional neural network model to obtain an answer sentence vector corresponding to the candidate answer, including:

7. The automatic question-answering method according to claim 1, wherein selecting a target answer to the search question from the candidate answer set based on the number of occurrences of the question in the question phrase in the search question and the number of co-occurrences of the question answers to the answer words in the candidate answers, comprises:

8. The automated question-answering method according to any one of claims 1 to 7 or , wherein obtaining a search question comprises:

when the sentence does not contain the main predicate and the object and the historical sentence contains the main predicate or the object, taking the subject or the object in the historical sentence as the subject of the sentence to synthesize a new sentence;

and taking the new sentence as a retrieval problem.

9. The automatic question-answering method according to claim 8, wherein said obtaining a search question further comprises:

when the synthesis of a new sentence fails, extracting corresponding keywords from the sentence according to the word type of the phrase in the sentence;

replacing the target word group with the same word type as the keyword in the historical sentence with the keyword to obtain a replaced sentence;

and taking the replaced sentence as a retrieval problem.

10, automatic question answering device, characterized by comprising:

the answer selecting unit comprises:

the frequency acquisition subunit is used for acquiring the frequency of occurrence of the question phrase in the retrieval question and the frequency of co-occurrence of the question answers of the answer words in the candidate answers; the number of times of occurrence of the question is the number of times of occurrence of the question phrases in the question-answer pair, and the number of times of co-occurrence of the question answers is the number of times of co-occurrence of two-by-two answer phrases in the candidate answers and the question phrases in the retrieval questions in the question pair;

and the answer selecting subunit is used for selecting the target answer of the retrieval question from the candidate answer set according to the number of times of occurrence of the question phrase in the retrieval question and the number of times of co-occurrence of the question answers of the candidate answer words in the answer.

11. The automatic question answering device according to claim 10, wherein said answer selecting unit comprises:

the similarity obtaining subunit is used for obtaining sentence similarity information between the candidate answers in the candidate answer set and the retrieval question;

and the answer selecting subunit is used for selecting the target answer of the retrieval question from the candidate answer set according to the sentence similarity information between the candidate answer and the retrieval question in the candidate answer set.

12. The automatic question answering device according to claim 11, wherein the similarity obtaining subunit is operable to: obtaining answer sentence vectors corresponding to candidate answers in the candidate answer set and question sentence vectors corresponding to retrieval questions; acquiring the vector similarity between the answer sentence vector and the question sentence vector;

the answer selecting subunit is used for: and selecting the target answer of the retrieval question from the candidate answer set according to the vector similarity between the answer sentence vector and the question sentence vector.

13. The automatic question answering device according to claim 12, wherein the similarity obtaining subunit is operable to: acquiring word vectors corresponding to answer phrases of candidate answers in a candidate answer set, and acquiring answer sentence vectors corresponding to the candidate answers according to the word vectors corresponding to the answer phrases; and acquiring a word vector corresponding to a problem phrase of the retrieval problem, and acquiring a problem sentence vector corresponding to the retrieval problem according to the word vector corresponding to the problem phrase.

14. The automatic question answering device according to claim 12, wherein the similarity obtaining subunit is operable to: expressing the retrieval problem into a corresponding problem matrix, and performing convolution processing on the problem matrix based on a convolution neural network model to obtain a problem sentence vector corresponding to the retrieval problem; and representing the candidate answers in the candidate answer set into corresponding answer matrixes, and carrying out convolution processing on the answer matrixes based on the convolutional neural network model to obtain answer sentence vectors corresponding to the candidate answers.

15. The automatic question answering device according to claim 14, wherein the similarity obtaining subunit is configured to:

16. The automatic question answering device according to claim 10, wherein the answer selecting subunit is configured to:

17. The automatic question answering device according to any one of claims 10 to 16 or , wherein the question acquisition unit includes:

a retrieval problem acquisition subunit operable to:

taking the new sentence as a retrieval problem;

and the similar problem acquisition subunit is used for determining similar problems similar to the retrieval problems according to the problem phrases of the retrieval problems and the inverted index.

18. The automatic question answering device according to claim 17, wherein the retrieval question acquisition subunit is further operable to:

Storage medium 19, , characterized in that, the storage medium stores instructions which, when executed by a processor, implement the automatic question answering method according to any of claims 1-9.