CN105095444A

CN105095444A - Information acquisition method and device

Info

Publication number: CN105095444A
Application number: CN201510441024.5A
Authority: CN
Inventors: 霍华荣; 马艳军; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2015-07-24
Filing date: 2015-07-24
Publication date: 2015-11-25

Abstract

The application discloses an information acquisition method and device. An embodiment of the method includes: acquiring a plurality of question-answer pairs in a data set, extracting at least one question word and at least one answer word of each question-answer pair; determining a context of the question word and the answer word; taking the question word, the answer word and the context as a training sample, training a preset model, and obtaining a word vector set; receiving question information to be responded; based on the word vector set, acquiring answer information matching the question information from the data set. The information acquisition method and device may train the word vector through evaluating a correlation of the question-answer pairs from a semantic aspect, and a training speed and accuracy of the word vector are improved. Based on that when the word vector acquires the matched information, a complicated supervised neural network training is not needed, the speed and accuracy of information acquisition can be improved.

Description

Information acquisition method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for acquiring information.

Background

At present, the way for people to obtain information on the internet is mainly a search engine, and a user needs to browse a large number of webpages to obtain answers, so that the efficiency is low. The deep question-answering technology enables the search to be more intelligent, provides more accurate answers for users, and reduces the cost for the users to obtain information. With the development of online question and answer websites such as "Baidu know", a large amount of user-generated data in the form of question and answer is generated, and provides data support for deep question and answer.

However, these questions and answers are of variable quality, and mainly include the following two questions: "escape" and "verbose". "escaping" means that the main content of the answer is irrelevant to the question and the answer is not asked. The term "lengthy" means that the content of the answer is too long, and besides the answer sentence, there are a large number of indirect answer sentences such as irrelevant and supplementary explanation.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, it would be desirable to provide a solution that improves the speed and accuracy of information acquisition. In order to achieve one or more of the above objects, the present application provides an information acquisition method and apparatus.

In a first aspect, the present application provides an information obtaining method, including: acquiring a plurality of question-answer pairs in a data set, and extracting at least one question word and at least one answer word of each question-answer pair; determining a context of the question words and the answer words; taking the question words, the answer words and the context as training samples, and training a preset model to obtain a word vector set; receiving question information to be responded; and acquiring answer information matched with the question information from the data set based on the word vector set.

In a second aspect, the present application provides an information acquisition apparatus, the apparatus comprising: the extraction module is used for acquiring a plurality of question-answer pairs in the data set and extracting at least one question word and at least one answer word of each question-answer pair; the determining module is used for determining the contexts of the question words and the answer words; the training module is used for training a preset model by taking the question words, the answer words and the context as training samples to obtain a word vector set; the receiving module is used for receiving the question information to be responded; and the acquisition module is used for acquiring answer information matched with the question information from the data set based on the word vector set.

The information acquisition method and the information acquisition device provided by the application can firstly acquire a plurality of question-answer pairs in a data set, extract at least one question word and at least one answer word of each question-answer pair, then determine the question word and the context of the answer word, then use the question word, the answer word and the context as training samples, train a preset model, obtain a word vector set, finally receive question information to be responded, and based on the word vector set, obtain answer information matched with the question information in the data set. The word vector is trained by evaluating the relevance of the question-answer pairs from the aspect of semantics, and the word vector training speed and accuracy are improved. When the matching information is obtained based on the word vector, complex supervised neural network training is not needed, and the speed and the accuracy of information obtaining are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the present application may be applied;

FIG. 2 illustrates a flow diagram according to one embodiment of an information acquisition method provided herein;

FIG. 3 illustrates a flow diagram according to another embodiment of an information acquisition method provided herein;

FIG. 4 illustrates a flow diagram according to yet another embodiment of an information acquisition method provided herein;

FIG. 5 illustrates a flow diagram for one embodiment of a method for obtaining answer information from a dataset that matches question information, in accordance with the present application;

FIG. 6 is a functional block diagram of an embodiment of an information acquisition apparatus 600 provided in accordance with the present application; and

fig. 7 shows a schematic structural diagram of a computer system 700 suitable for implementing a terminal device or a server according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to FIG. 1, an exemplary system architecture 100 to which embodiments of the present application may be applied is shown.

As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, a network 103, and a server 104. The network 103 serves as a medium for providing communication links between the terminal devices 101, 102 and the server 104. Network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user 110 may use the terminal devices 101, 102 to interact with the server 104 over the network 103 to receive or send messages or the like. For example, the user can acquire answer information or the like matching question information to be responded from the server 104 through the network 103 via the terminal apparatuses 101, 102. Various communication client applications, such as an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the terminal devices 101 and 102.

The terminal devices 101, 102 may be various electronic devices including, but not limited to, personal computers, smart phones, tablets, personal digital assistants, and the like.

The server 104 may be a server that provides various services. The server can store, analyze and the like the received data and feed back the processing result to the terminal equipment.

It should be noted that the information obtaining method provided in the embodiment of the present application may be executed by the terminal devices 101 and 102, or may be executed by the server 104; the information acquisition device may be provided in the terminal apparatuses 101 and 102 or in the server 104. In some embodiments, a preset model may be trained in the server 104, and the resulting set of word vectors may be stored in the terminal device 101, 102 for obtaining answer information matching the question information. For example, if the network 103 is unblocked, the server 104 may obtain answer information matching with the question information to be responded from the data set and return the answer information, and if no network exists or the network 103 is unblocked, the terminal device 101, 102 may directly obtain answer information matching with the question information to be responded from the data set.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With further reference to fig. 2, a flow 200 of one embodiment of an information acquisition method provided herein is shown.

As shown in fig. 2, in step 201, a plurality of question-answer pairs in a data set are obtained, and at least one question word and at least one answer word of each question-answer pair are extracted.

To obtain information based on a word vector, samples need to be gathered first to train the word vector. In this embodiment, a plurality of question-answer pairs in the data set may be first obtained as samples of the training word vector. The data set may be, for example, a pre-constructed database containing a large number of challenge-response pairs. For example, the challenge-response pairs may be obtained from the network and stored in a data set. The data set may be stored in a server or a terminal.

After a plurality of question-answer pairs in the data set are obtained, at least one question word and at least one answer word of each question-answer pair can be further extracted. Each question-answer pair may be composed of a question sentence and one or more answer sentences. In this embodiment, the question sentences and the answer sentences in the question-answer pairs may be split into question sentences and answer sentences composed of one or more words, respectively. Therefore, the words in each question sentence and each answer sentence can be extracted and used as the question words or the answer words respectively.

In an optional implementation manner of this embodiment, the question-answer pairs may include a voice question-answer pair and a text question-answer pair. When the voice question-answer pairs are obtained, the voice question-answer pairs can be converted into character question-answer pairs through a voice recognition technology. Further, at least one question word and at least one answer word of the converted text question-answer pair are extracted.

In an optional implementation manner of this embodiment, in order to distinguish the question words from the answer words, a first prefix may be added to each question word, and a second prefix may be added to each answer word. For example, a prefix "Q" may be added for each question word and a prefix "a" for each answer word.

In step 202, the context of the question words and answer words is determined.

After at least one question word and at least one answer word of each question-answer pair are extracted in step 201, the context of the question words and the answer words can be further determined. For example, the question words and the answer words in the same question-answer pair may be set to the same context by a predetermined rule, or the contexts of the question words and the answer words in the original question sentence and answer sentence may also be set to the contexts of the question words and the answer words. In one implementation, the length of the context for the question words and answer words may be set. The contexts of the question words and the answer words may be set to the same length (e.g., 7), or the contexts of the question words and the answer words may be set to different lengths.

In step 203, the question words, the answer words, and the context are used as training samples to train a preset model, so as to obtain a word vector set.

After the question words and the answer words are obtained in the data set and the context of the question words and the answer words is determined, the data can be used as sample data to train a preset training model. Since the final purpose of the training is to determine the word vector set, the word vector set can be regarded as an unknown parameter in the preset model, and then the preset model is trained. When the parameters enable the preset model to meet specific training targets, the parameters at the moment can be regarded as a word vector set needing to be determined.

In an optional implementation manner of this embodiment, each word vector may be a low-dimensional real number vector with a dimension not greater than 1000. For example, the specific form of the finally determined word vector may be the following form: low-dimensional real vectors of [0.355, -0.687, -0.168, 0.103, -0.231, ] with dimensions generally not exceeding an integer of 1000. If the dimension is too small, the difference between each word cannot be sufficiently expressed, and if the dimension is too large, the calculation amount is large. Optionally, the dimension of the word vector may be between 50 and 1000, so that both accuracy and computational efficiency may be taken into account.

In step 204, question information to be responded to is received.

After the word vector is obtained in step 203, information may be obtained based on the word vector. First, question information to be responded to may be received. Specifically, the user may input a question that the user wants to query in a search box of the browser, for example, the question may be a keyword, or a complete sentence. The content input by the user can be used as the question information to be responded.

In step 205, answer information matching the question information is obtained from the data set based on the word vector set.

In this embodiment, after the search system receives the question information to be responded, which is input by the user, the search system may first search the question information in the data set to obtain a plurality of answer information corresponding to the question information. And then retrieving one or more answer information matched with the question information from the plurality of answer information. For example, a degree of matching of the question information with each answer information may be calculated based on the word vector set, and answer information whose degree of matching satisfies a preset condition (e.g., greater than 80%) may be taken as answer information that matches the question information.

In an optional implementation manner of the embodiment, the obtained one or more answer information matched with the question information may be presented in the terminal for the user to view.

The information obtaining method provided in this embodiment may first obtain a plurality of question-answer pairs in a data set, extract at least one question word and at least one answer word of each question-answer pair, then determine a context of the question word and the answer word, then train a preset model using the question word, the answer word, and the context as training samples, obtain a word vector set, finally receive question information to be responded, and obtain answer information matched with the question information from the data set based on the word vector set, train a word vector by evaluating a correlation of the question-answer pair in a semantic aspect, thereby improving a word vector training speed and accuracy. When the matching information is obtained based on the word vector, complex supervised neural network training is not needed, and the speed and the accuracy of information obtaining are improved.

With further reference to fig. 3, a flow 300 is shown in accordance with another embodiment of the information acquisition method provided herein.

As shown in fig. 3, in step 301, a plurality of question-answer pairs in a data set are obtained, and at least one question word and at least one answer word of each question-answer pair are extracted.

In this embodiment, step 301 in the implementation process 300 is substantially the same as step 201 in the implementation process 200, and is not described herein again.

In step 302, the context of each question word is determined.

In this embodiment, the context of each question word may be determined first. For example, the original context of the question sentence in which each question word is located may be determined as the context of each question word. Such as for the following question-answer pairs: "do a associative notebook work well? "and" the associated computer person feels good ", the context of the question word" notebook "can be determined as" do it well with the association ".

In step 303, the context of any question word is determined as the context of all answer words of the question-answer pair in which the question word is located.

In this embodiment, the context of any question word may be determined as the contexts of all answer words of the question-answer pair in which the question word is located. In a question-answer pair, one question sentence may correspond to one or more answer sentences, i.e., each question word may correspond to one or more answer words. In this embodiment, the same context may be set for any question word in a question-answer pair and all the answer words corresponding to the question word. Specifically, the context of any question word may be determined as the contexts of all answer words of the question-answer pair in which the question word is located. For example, for the following question-answer pairs: "do a associative notebook work well? If the context of the question word "notebook" is "good for association", the context of each answer word "association", "computer", "personal", "feeling", and "good for association" can be set to be the same as the context of the question word "notebook".

In step 304, the question words, the answer words, and the context are used as training samples to train a preset model, so as to obtain a word vector set.

In this embodiment, the preset model may be a function of:

<math> <mrow> <mi>p</mi> <mo>=</mo> <munder> <mo>Σ</mo> <mrow> <mo><</mo> <mi>q</mi> <mo>,</mo> <mi>a</mi> <mo>></mo> <mo>&Element;</mo> <mi>D</mi> </mrow> </munder> <munderover> <mo>Σ</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>q</mi> <mo>|</mo> </mrow> </munderover> <mrow> <mo>(</mo> <mi>log</mi> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>q</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>C</mi> <mrow> <mi>q</mi> <mi>i</mi> </mrow> </msub> </mrow> <mo>)</mo> <mo>+</mo> <munderover> <mo>Σ</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>a</mi> <mo>|</mo> </mrow> </munderover> <mo>(</mo> <mi>log</mi> <mi>p</mi> <mo>(</mo> <msub> <mi>a</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>C</mi> <mrow> <mi>q</mi> <mi>i</mi> </mrow> </msub> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </math>

wherein,<q,a>for a question-answer pair in the data set D, | q | is the number of question words in the question-answer pair, q | is the number of question words in the question-answer pair_iVector for the ith question word in the question-answer pair, C_qiIs the context vector of the ith question word in the question-answer pair, | a | is the number of answer words in the question-answer pair, a_jThe vector of the j-th answer word in the question-answer pair; p (q)_i|C_qi) And p (a)_j|C_qi) Determined by the following equation:

w is the vector of any word, C_wIs a context vector of the word, w_uA vector for the u-th word in said data set D, C_wuAnd V is the context vector of the u-th word, and is the number of words contained in the data set D.

As can be seen from the above formula, in this embodiment, a question-answer pair is a vector q of question words_iWhen the answer words are determined, the context vectors of all the answer words are the same and are the context vector C of the answer word_qi。

In an alternative implementation manner of this embodiment, the word vector set may be determined by taking the function maximization as a training target.

In step 305, question information to be responded to is received.

In step 306, answer information matching the question information is obtained from the data set based on the word vector set.

In the present embodiment, the steps 305 and 306 in the implementation process 300 are respectively substantially the same as the steps 204 and 205 in the implementation process 200, and are not described herein again.

In this embodiment, after determining the context of the question word, the context of any question word may be determined as the contexts of all answer words of the question-answer pair in which the question word is located, and further, the question word, the answer words, and the contexts are used as training samples to train a preset model to obtain a word vector set, so that the accuracy of the word vector may be improved.

With further reference to fig. 4, a flow 400 is shown in accordance with another embodiment of the information acquisition method provided herein.

As shown in fig. 4, in step 401, a plurality of question-answer pairs in a data set are obtained, and at least one question word and at least one answer word of each question-answer pair are extracted.

In this embodiment, step 401 in the implementation process 400 is substantially the same as step 201 in the implementation process 200, and is not described herein again.

In step 402, the context of each answer word is determined.

In this embodiment, the context of each answer word may be determined first. For example, the original context of the answer sentence in which each answer word is located may be determined as the context of each answer word. Such as for the following question-answer pairs: "do a associative notebook work well? "and" the associated computer person feels good ", it can be determined that the context of the answer word" computer "is" the associated person feels good ". Alternatively or preferably, the context of the above answer word "computer" may be "associative personal feeling" when the length (e.g., 7) of the context of the answer word is set.

In step 403, the context of any answer word is determined as the context of all the answer words of the question-answer pair in which the answer word is located.

In this embodiment, the context of any answer word may be determined as the contexts of all the answer words of the question-answer pair in which the answer word is located. In a question-answer pair, an answer sentence may correspond to one or more question words. In this embodiment, the same context may be set for any answer word in one question-answer pair and all the question words corresponding to the answer word. Specifically, the context of any answer word may be determined as the contexts of all the answer words of the answer pair in which the answer word is located. For example, for the following question-answer pairs: "do a associative notebook work well? When the 'associated computer feels good', and the context of the answer word 'computer' may be 'associated personal feels good', the contexts of the question words 'association', 'notebook', and 'good do' may all be set to be the same as the context of the answer word 'computer', and 'associated personal feels good'. Alternatively or preferably, the context of the above answer word "computer" may be "associative personal feeling" when the length (e.g., 7) of the context of the answer word is set. At this time, the contexts of the respective question words "associate", "notebook", and "do it with" can be set to the same context "associated personal feeling" as the answer word "computer".

In step 404, the question words, the answer words, and the context are used as training samples to train a preset model, so as to obtain a word vector set.

In this embodiment, the preset model may be a function of:

<math> <mrow> <mi>p</mi> <mo>=</mo> <munder> <mo>Σ</mo> <mrow> <mo><</mo> <mi>q</mi> <mo>,</mo> <mi>a</mi> <mo>></mo> <mo>&Element;</mo> <mi>D</mi> </mrow> </munder> <munderover> <mo>Σ</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>q</mi> <mo>|</mo> </mrow> </munderover> <mrow> <mo>(</mo> <mi>log</mi> <mi>p</mi> <mo>(</mo> <mrow> <msub> <mi>a</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>C</mi> <mrow> <mi>a</mi> <mi>j</mi> </mrow> </msub> </mrow> <mo>)</mo> <mo>+</mo> <munderover> <mo>Σ</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>a</mi> <mo>|</mo> </mrow> </munderover> <mo>(</mo> <mi>log</mi> <mi>p</mi> <mo>(</mo> <msub> <mi>q</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>C</mi> <mrow> <mi>a</mi> <mi>j</mi> </mrow> </msub> <mo>)</mo> <mo>)</mo> </mrow> </mrow> </math>

wherein,<q,a>for a question-answer pair in the data set D, | q | is the number of question words in the question-answer pair, q | is the number of question words in the question-answer pair_iVector for the ith question word in the question-answer pair, C_ajIs the context of the jth answer word in the question-answer pair, | a | is the number of answer words in the question-answer pair, a_jThe vector of the j-th answer word in the question-answer pair; p (a)_j|C_aj) And p (q)_i|C_aj) Determined by the following equation:

w is the vector of any word, C_wIs a context vector of the word, w_uAs a vector of the u-th word in the data set D, C_wuAnd V is the context vector of the u-th word, and is the number of words contained in the data set D.

As can be seen from the above formula, in the present embodiment, the vector a of the answer word in a question-answer pair_jWhen the answer words are determined, the context vectors of all the answer words are the same and are the context vector C of the answer word_aj。

In step 405, issue information to be responded to is received.

In step 406, answer information matching the question information is obtained from the dataset based on the set of word vectors.

In the present embodiment, the steps 405 and 406 in the implementation process 400 are substantially the same as the steps 204 and 205 in the implementation process 200, and are not described herein again.

In this embodiment, after determining the context of the answer word, the context of any answer word may be determined as the contexts of all question words of the question-answer pair in which the answer word is located, and further, the question word, the answer word, and the contexts are used as training samples to train a preset model to obtain a word vector set, so as to improve the accuracy of the word vector.

With further reference to fig. 5, a flow 500 of one embodiment of a method of obtaining answer information from a dataset that matches question information provided in accordance with the present application is illustrated.

As shown in fig. 5, in step 501, a question sentence vector of the question information and an answer sentence vector of each answer information in the data set are constructed from the word vector set.

In this embodiment, answer information matched with the question information to be responded may be acquired from the dataset based on the word vector obtained by training. Specifically, a question sentence vector of the question information and an answer sentence vector of each answer information in the data set may be first constructed from the word vector set.

In one implementation, the question sentence vector of the question information and the answer sentence vector of each answer information in the data set may be constructed according to the following formulas:

where s is the vector of any sentence, m is the length of the sentence, and w_iA word vector for the ith word in the sentence, c_iThe number of times the ith word in the sentence appears in the dataset.

First, the question information and the answer information in the data set may be split into a plurality of words. When the question information or the answer information is split, if the question information or the answer information is a sentence consisting of a plurality of words, the question information or the answer information can be split into the plurality of words according to a general grammar rule; if the question information or the answer information is a word, the word may be regarded as a split word. In this way, each question information or answer information may be split into at least one word. Each word may then be represented by a trained vector. Then, the question information and each answer information in the data set to be responded may be constructed as a question sentence vector and an answer sentence vector containing one or more word vectors using the above formula.

In step 502, the relevance of the question sentence vector to each answer sentence vector is calculated.

After the question sentence vector and the answer sentence vector of each answer information in the data set are obtained in step 501, the correlation between the question sentence vector and the answer sentence vector of each answer information in the data set may be calculated. The relevance can represent the degree of relevance between two sentence vectors, and the larger the relevance value is, the more similar the two sentence vectors are, and the value range can be [ -1,1 ]. When the correlation is 1, the two sentence vectors can be considered to be identical. And when the similarity is-1, the two sentence vectors can be considered to be completely different.

In one implementation, the relevance of the question sentence vector to the answer sentence vector may be calculated according to the following formula:

wherein s is_aAnd s_bVectors, Score(s), for question sentence a and answer sentence b, respectively_a,s_b) Is a question sentence vector s_aAnd answer sentence vector s_bλ is a constant between 0.18 and 0.24, C(s)_a,s_b) Is the number of word co-occurrences between sentence a and sentence b.

In step 503, answer information matching the question information is determined based on the correlation.

After the correlation between the question sentence vector and each answer sentence vector is obtained through calculation in step 502, one or more pieces of answer information matching the question information may be determined according to a specific value of the correlation. In one possible implementation, answer information corresponding to an answer sentence vector having the highest correlation with the question sentence vector may be determined as answer information matching the question information. In another implementation, a relevance threshold may be preset, and answer information corresponding to an answer sentence vector whose relevance to the question sentence vector is greater than or equal to the above threshold may be determined as answer information matching the question information.

In the information obtaining method in this embodiment, a question sentence vector of the question information and an answer sentence vector of each answer information in the data set may be first constructed, then, a correlation between the question sentence vector and each answer sentence vector may be calculated, and finally, the answer information matched with the question information may be determined based on the correlation. When the matching information is obtained based on the word vector, complex supervised neural network training is not needed, and the speed and the accuracy of information obtaining are improved.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

With further reference to fig. 6, a functional module architecture diagram of one embodiment of an information acquisition apparatus 600 provided herein is shown.

As shown in fig. 6, the information acquisition apparatus 600 provided in the present embodiment includes: an extraction module 610, a determination module 620, a training module 630, a receiving module 640, and an acquisition module 650. The extracting module 610 is configured to obtain a plurality of question-answer pairs in the data set, and extract at least one question word and at least one answer word of each question-answer pair; the determining module 620 is used for determining the context of the question words and the answer words; the training module 630 is configured to train a preset model by using the question words, the answer words, and the context as training samples, so as to obtain a word vector set; the receiving module 640 is used for receiving the question information to be responded; the obtaining module 650 is configured to obtain answer information matching the question information from the data set based on the word vector set.

In an optional implementation manner of this embodiment, the determining module 620 is configured to determine the contexts of the question words and the answer words according to the following steps: determining a context for each question word; and determining the context of any question word as the contexts of all answer words of the question-answer pair in which the question word is positioned.

In another optional implementation manner of this embodiment, the preset model is a function of:

In another alternative implementation manner of this embodiment, the determining module 620 is configured to determine the contexts of the question words and the answer words according to the following steps: determining a context of each answer word; and determining the context of any answer word as the contexts of all the question words of the question-answer pair in which the answer word is positioned.

In another optional implementation manner of this embodiment, the training module 630 is configured to train the preset model according to the following steps to obtain a word vector set: and determining a word vector set by taking the preset model maximization as a training target.

In another optional implementation manner of this embodiment, the obtaining module 650 includes: the construction submodule is used for constructing question sentence vectors of the question information and answer sentence vectors of the answer information in the data set according to the word vector set; the calculation submodule is used for calculating the correlation between the question sentence vector and each answer sentence vector; and the determining submodule is used for determining answer information matched with the question information based on the correlation.

In another optional implementation manner of this embodiment, the constructing sub-module is configured to construct the question sentence vector of the question information and the answer sentence vector of each answer information in the data set according to the following formula:

In another optional implementation of this embodiment, the calculation sub-module is configured to calculate the relevance of the question sentence vector and the answer sentence vector according to the following formula:

In another optional implementation manner of this embodiment, the question-answer pairs include a voice question-answer pair and a text question-answer pair; the device still includes: and the conversion module is used for converting the voice question-answer pairs into character question-answer pairs.

It should be understood that the units or modules recited in the information acquisition apparatus shown in fig. 6 correspond to the respective steps in the method described with reference to fig. 2 to 5. Thus, the operations and features described above for the method are also applicable to the apparatus shown in fig. 6 and the modules included therein, and are not described again here.

The information obtaining apparatus provided in this embodiment may first obtain, by the extraction module, a plurality of question-answer pairs in a data set, extract at least one question word and at least one answer word of each question-answer pair, then determine, by the determination module, a context of the question word and the answer word, then, the training module trains a preset model by using the question word, the answer word, and the context as a training sample, to obtain a word vector set, and finally, the receiving module receives question information to be responded, and the obtaining module obtains, based on the word vector set, answer information matched with the question information from the data set, and trains a word vector by evaluating, in a semantic aspect, a correlation of the question-answer pairs, thereby improving a word vector training speed and accuracy. When the matching information is obtained based on the word vector, complex supervised neural network training is not needed, and the speed and the accuracy of information obtaining are improved.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing a terminal device or server of an embodiment of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the system 700 are also stored. The CPU701, the ROM702, and the RAM703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The unit modules described in the embodiments of the present application may be implemented by software or hardware. The described unit modules may also be provided in a processor, and may be described as: a processor includes an extraction module, a determination module, a training module, a reception module, and an acquisition module. The names of the unit modules do not form a limitation on the unit modules themselves in some cases, for example, the extraction module may also be described as "a module for acquiring a plurality of question-answer pairs in a data set, extracting at least one question word and at least one answer word of each question-answer pair".

As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the apparatus in the above-described embodiments; or it may be a separate computer-readable storage medium not incorporated in the terminal. The computer-readable storage medium stores one or more programs, which are used by one or more processors to execute the information acquisition methods described in the present application.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. An information acquisition method, characterized in that the method comprises:

acquiring a plurality of question-answer pairs in a data set, and extracting at least one question word and at least one answer word of each question-answer pair;

determining a context of the question words and the answer words;

taking the question words, the answer words and the context as training samples, and training a preset model to obtain a word vector set;

receiving question information to be responded;

and acquiring answer information matched with the question information from the data set based on the word vector set.

2. The method of claim 1, wherein the determining the context of the question words and the answer words comprises:

determining a context for each question word;

and determining the context of any question word as the contexts of all answer words of the question-answer pair in which the question word is positioned.

3. The method of claim 1, wherein the determining the context of the question words and the answer words comprises:

determining a context of each answer word;

and determining the context of any answer word as the contexts of all the question words of the question-answer pair in which the answer word is positioned.

4. The method of claim 1, wherein the training the predetermined model to obtain a set of word vectors comprises:

and determining the word vector set by taking the maximization of the preset model as a training target.

5. The method of claim 1, wherein the obtaining answer information from the dataset that matches the question information based on the set of word vectors comprises:

according to the word vector set, constructing question sentence vectors of the question information and answer sentence vectors of each answer information in the data set;

calculating the relevance of the question sentence vector and each answer sentence vector;

based on the correlation, answer information that matches the question information is determined.

6. The method of claim 5, wherein constructing the question sentence vector of the question information and the answer sentence vector of each answer information in the data set comprises:

and constructing a sentence vector according to the word vector of each word in the sentence and the occurrence frequency of each word in the data set.

7. The method of claim 5, wherein the calculating the relevance of the question sentence vector to the answer sentence vector comprises:

and determining the relevance according to the question sentence vector, the answer sentence vector and the word co-occurrence times between the question sentence and the answer sentence.

8. The method according to any one of claims 1-7, wherein said question-answer pairs comprise a voice question-answer pair and a text question-answer pair;

the method further comprises the following steps:

and converting the voice question-answer pair into a text question-answer pair.

9. The method of claim 8, further comprising:

and adding a first prefix for each question word and adding a second prefix for each answer word.

10. The method of claim 9, further comprising:

presenting one or more answer information that match the question information.

11. The method of claim 10, wherein each word vector is a low-dimensional real vector having a dimension of no more than 1000.

12. An information acquisition apparatus, characterized in that the apparatus comprises:

the extraction module is used for acquiring a plurality of question-answer pairs in the data set and extracting at least one question word and at least one answer word of each question-answer pair;

the determining module is used for determining the contexts of the question words and the answer words;

the training module is used for training a preset model by taking the question words, the answer words and the context as training samples to obtain a word vector set;

the receiving module is used for receiving the question information to be responded;

and the acquisition module is used for acquiring answer information matched with the question information from the data set based on the word vector set.

13. The apparatus of claim 12, wherein the determining module is configured to determine the context of the question word and the answer word by:

determining a context for each question word;

14. The apparatus of claim 12, wherein the determining module is configured to determine the context of the question word and the answer word by:

determining a context of each answer word;

15. The apparatus of claim 12, wherein the training module is configured to train the preset model to obtain a word vector set according to the following steps:

16. The apparatus of claim 12, wherein the obtaining module comprises:

the construction submodule is used for constructing question sentence vectors of the question information and answer sentence vectors of each answer information in the data set according to the word vector set;

a calculation submodule for calculating the correlation between the question sentence vector and each answer sentence vector;

and the determining submodule is used for determining answer information matched with the question information based on the correlation.

17. The apparatus according to any one of claims 12-16, wherein said challenge-response pairs comprise a voice challenge-response pair and a text challenge-response pair;

the device further comprises:

and the conversion module is used for converting the voice question-answer pair into a character question-answer pair.