Disclosure of Invention
The invention provides a semantic retrieval method, a semantic retrieval device and electronic equipment, which can effectively solve the problems that the existing retrieval method cannot understand the query intention and the query effect cannot meet the requirements of users.
A semantic retrieval method comprising:
receiving query information sent by a user;
correcting the text in the query information to obtain a corrected text;
performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers;
for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;
for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;
and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.
Further, correcting the text in the query information to obtain a corrected text, including:
adopting a Chinese word segmentation device to cut words of the text, and carrying out error detection through word granularity and word granularity to generate a suspected error position candidate set;
traversing all suspected error positions, searching phonetic and morphological words from a pre-stored dictionary to replace words at the suspected error positions, and calculating sentence confusion degree through a language model;
sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word;
and generating the corrected text according to the optimal corrected word.
Further, for the simple fact question-answering, retrieving based on a pre-constructed knowledge graph, and obtaining a first candidate answer set comprises:
extracting entity information, relationship information and attribute information in the corrected text, and using a synonym dictionary to link the entity information, relationship information or attribute information to the entity, relationship or attribute in the knowledge graph to generate an SQL query statement;
and filling the SQL query statement to the position of the extracted corresponding word slot, and executing query to obtain a first candidate answer set.
Further, for the common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair, and obtaining a second candidate answer set includes:
and performing text vectorization on the correction text, searching similar vectors from the vectorized FAQ question-answer pair, obtaining corresponding answers, and generating a second candidate answer set.
Further, finding similar vectors from the vectorized FAQ question-answer pair comprises:
calculating the similarity between the vectorized correction text and the questions in the vectorized FAQ question-answer pair, and returning the answers corresponding to the questions with the highest similarity; and/or
And calculating the similarity between the vectorized correction text and the answer in the vectorized FAQ question-answer pair, and returning the answer with the highest similarity.
Further, ranking the candidate answers according to the first score, the second score, and the third score to obtain answers includes:
weighting and summing the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set;
weighting and summing the first score and the third score of the common question answer to obtain a fifth score of each candidate answer in a second candidate answer set;
sorting all the candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting;
and generating an answer feedback to the user according to the selected answer and the answer template.
Further, the question template library is pre-constructed in the following way:
collecting historical user query information, and constructing the problem template library according to the user query information;
the vectorized FAQ question-answer pair is pre-constructed in the following way:
collecting common questions of a user, making standard answers, and vectorizing the common questions and the standard answers to obtain the vectorized FAQ question-answer pairs.
A semantic retrieval apparatus comprising:
the receiving module is used for receiving query information sent by a user;
the error correction module is used for correcting the text in the query information to obtain a corrected text;
an intent determination module for performing a user intent analysis on the corrected text based on a question template library, determining a first score of the identified user intent, the user intent including simple fact question answering and common question answering;
the first retrieval module is used for retrieving simple fact questions and answers based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;
the second retrieval module is used for answering the common questions, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;
and the answer generation module is used for sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.
An electronic device comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor is used for reading the instructions and executing the semantic retrieval method.
A computer-readable storage medium having stored thereon a plurality of instructions readable by a processor and performing the semantic retrieval method described above.
The semantic retrieval method, the semantic retrieval device and the electronic equipment at least have the following beneficial effects:
(1) the natural language understanding based on the semantic level can better match the real intention of the user, improve the retrieval efficiency and accuracy, and better meet the query requirement of the user compared with the retrieval based on the key words;
(2) based on the synonym dictionary, normalized description can be carried out on the identified entities, attributes and relations, normalized description is carried out on the entities which are not normalized and have inaccurate expression in the query sentence of the user, the problem that the entities cannot be correctly linked to the entity nodes in the knowledge graph because the description of the entities is not normalized is avoided, and the robustness of the knowledge graph-based retrieval system is improved;
(3) for non-simple fact queries such as FAQ, answers which best meet the user intention can be queried through a vectorization retrieval service at a semantic level.
Detailed Description
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
Referring to fig. 1, in some embodiments, there is provided a semantic retrieval method comprising:
step S101, receiving query information sent by a user;
step S102, correcting the text in the query information to obtain a corrected text;
step S103, analyzing user intentions of the corrected texts based on a question template library, and determining a first score of the identified user intentions, wherein the user intentions comprise simple fact questions and answers and common question answers;
step S104, for simple fact questions and answers, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;
step S105, for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;
and S106, sorting the candidate answers according to the first score, the second score and the third score to obtain answers.
The semantic retrieval method provided by the embodiment can better match the real intention of the user, improves the retrieval efficiency and accuracy, and can better meet the query requirement of the user compared with the retrieval based on the key words.
Specifically, before the above method is performed, a question template library, a knowledge graph, and vectorized FAQ (Frequently Asked Questions) question-answer pairs are constructed in advance.
The knowledge graph adopts an entity-relation-entity triple form, and can organize a large amount of discrete information in the information in a structured mode. For example, high and new technology enterprise affirmation-transacting time-working day 09:00-12:00 am, 13:30-17:00 pm, the head entity is "high and new technology enterprise affirmation", the tail entity is "working day 09:00-12:00 am, 13:30-17:00 pm", and the relationship between the two entities is "transacting time".
The FAQ questions and answers are generally the most common questions and answers in the service handling, the frequently asked questions of the user can be manually collected and labeled to make relevant standard answers, and then the questions and the corresponding answers are vectorized by using a uniform semantic vectorization model to obtain the vectorized FAQ question-answer pairs. Common vectorization schemes include BM25, TFIDF, etc., with deep learning semantic directions such as Bert, etc. After vectorization, a vector search tool can be used to perform a fast match search, such as sessions, annoy, etc.
During cold start, the user's query idioms, such as "what the phone of xxx is", "where the xxx address is", etc., may be recorded in a variety of ways, such as in a manual window, mail, etc. And constructing the problem template library by collecting historical user query information.
In some embodiments, referring to fig. 2, in step S102, performing error correction on the text in the query information to obtain a corrected text, including:
step S1021, a Chinese word segmentation device is adopted to cut words of the text, error detection is carried out according to word granularity and word granularity, and a suspected error position candidate set is generated;
step S1022, traverse all suspected wrong positions, and look for similar words and similar words from the dictionary stored in advance to replace the words in suspected wrong positions, calculate the sentence puzzlement degree through the language model;
step S1023, sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word;
and step S1024, generating the corrected text according to the optimal corrected words.
In consideration of wrongly written characters, spoken descriptions, and non-standard words that may appear to a user (for example, "advanced technology enterprise" is simply referred to as "advanced enterprise" or "advanced enterprise"), chinese text correction is required. Error correction mainly has two steps: error detection and error correction. The method comprises the steps of firstly, carrying out error detection on a text word by a Chinese word segmentation device, wherein a word segmentation result often has a wrong segmentation condition due to wrongly-written characters contained in a sentence, so that errors are detected from two aspects of word granularity and word granularity, and suspected error results of the two granularities are integrated to form a suspected error position candidate set; and the error correction part traverses all suspected error positions, replaces words in the error positions with similar words, calculates sentence confusion degree through a language model, compares and sorts results of all candidate sets to obtain the optimal corrected words. The method for text error correction has the advantages of controllability, flexibility, high speed, small resource occupation and the like.
In some embodiments, in step S103, whether the user intends to be a simple factual question answer or a common question answer is identified through short text classification, and in order to improve the robustness of the system, different intentions are scored as a first score. A higher first score indicates a greater likelihood of conforming to the user's true query intent.
In some embodiments, referring to fig. 3, in step S104, for the simple fact question-answer, retrieving based on a pre-constructed knowledge graph, and obtaining a first candidate answer set includes:
step S1041, extracting entity information, relationship information and attribute information in the corrected text, using a synonym dictionary to link the synonym dictionary to the entity, relationship or attribute in the knowledge map, and generating an SQL query statement;
step S1042, filling the SQL query sentence to the extracted corresponding word slot position, and executing query to obtain a first candidate answer set.
Specifically, the entity linking step includes two parts: identification and disambiguation. The identification part mainly uses entity identification in lexical analysis to obtain entities and relationship attributes in user query. For some special fields, a field dictionary is also added in the lexical analysis. The disambiguation part mainly searches the identified entities from the knowledge graph, including aliases, acronyms and the like, as a candidate entity set. The method of Learning to Rank is then used to select the appropriate entity from the candidate set.
And determining a second score of each candidate answer in the first candidate answer set according to the relevance, wherein the higher the second score is, the higher the possibility that the retrieval result meets the requirement of the user is.
In some embodiments, referring to fig. 4, in step S105, for the common question solution, retrieving based on a pre-constructed vectorized FAQ question-answer pair, and obtaining a second candidate answer set includes:
step S1051, carrying out text vectorization on the correction text;
step S1052, searching for similar vectors from the vectorized FAQ question-answer pair, obtaining corresponding answers, and generating a second candidate answer set.
Wherein searching for similar vectors from the vectorized FAQ question-answer pair comprises:
calculating the similarity between the vectorized correction text and the questions in the vectorized FAQ question-answer pair, and returning the answers corresponding to the questions with the highest similarity; and/or
And calculating the similarity between the vectorized correction text and the answer in the vectorized FAQ question-answer pair, and returning the answer with the highest similarity.
The method for searching similar vectors from the vectorized FAQ question-answer pair comprises similar question matching and question answer matching. In practical application, similarity problem matching is mainly adopted, and the vectorization method is mainly based on the semantic level vector of Bert.
And determining a third score of each candidate answer in the second candidate answer set according to the relevance, wherein the higher the third score is, the higher the possibility that the retrieval result meets the requirement of the user is.
In some embodiments, referring to fig. 5, in step S106, ranking the candidate answers according to the first score, the second score, and the third score to obtain an answer includes:
step S1061, performing weighted summation on the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set.
Step S1062, performing weighted summation on the first score and the third score of the common answer to obtain a fifth score of each candidate answer in the second candidate answer set.
Step S1063, sorting all candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting; the sorting is in the order of scores from high to low.
And step S1064, generating an answer feedback to the user according to the selected answer and the answer template.
After the system is operated online, logs are checked regularly, new questions proposed by a user are collected, vectorization processing is carried out after standard answers are made for marking, and the vectorization processing is added into vectorization FAQ question-answer pairs and/or updated into a knowledge graph, so that continuous optimization is realized.
In some embodiments, referring to fig. 6, there is provided a semantic retrieval apparatus including:
a receiving module 201, configured to receive query information sent by a user;
the error correction module 202 is configured to correct errors of the text in the query information to obtain a corrected text;
an intent determination module 203, configured to perform user intent analysis on the corrected text based on a question template library, and determine a first score of the identified user intent, where the user intent includes simple fact question answering and common question answering;
the first retrieval module 204 is configured to retrieve the simple fact questions and answers based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determine a second score of each candidate answer in the first candidate answer set according to the relevance;
the second retrieval module 205 is configured to solve the common questions, retrieve based on a pre-constructed vectorized FAQ question-answer pair, obtain a second candidate answer set, and determine a third score of each candidate answer in the second candidate answer set according to the relevance;
and the answer generating module 206 is configured to rank the candidate answers according to the first score, the second score, and the third score to obtain answers.
Specifically, the error correction module 202 is further configured to perform word segmentation on the text by using a chinese word segmenter, and perform error detection through word granularity and word granularity to generate a candidate set of suspected error positions; traversing all suspected error positions, searching phonetic and morphological words from a pre-stored dictionary to replace words at the suspected error positions, and calculating sentence confusion degree through a language model; sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word; and generating the corrected text according to the optimal corrected word.
The first retrieval module 204 is further configured to extract entity information, relationship information, and attribute information in the corrected text, link the entity information, relationship information, and attribute information to an entity, relationship, or attribute in the knowledge graph using a synonym dictionary, and generate an SQL query statement; and filling the SQL query statement to the position of the extracted corresponding word slot, and executing query to obtain a first candidate answer set.
The second retrieving module 205 is further configured to perform text vectorization on the correction text, search for similar vectors from the vectorized FAQ question-answer pair, obtain corresponding answers, and generate a second candidate answer set.
The second retrieving module 205 is further configured to calculate similarity between the vectorized correction text and the question in the vectorized FAQ question-answer pair, and return an answer corresponding to the question with the highest similarity; and/or calculating the similarity between the vectorized correction text and the answer in the vectorized FAQ question-answer pair, and returning the answer with the highest similarity.
The answer generating module 206 is further configured to perform weighted summation on the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set; weighting and summing the first score and the third score of the common question answer to obtain a fifth score of each candidate answer in a second candidate answer set; sorting all the candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting; and generating an answer feedback to the user according to the selected answer and the answer template.
For the specific working principle, please refer to the above method embodiments, which are not described herein again.
Referring to fig. 7, in some embodiments, there is further provided an electronic device including a processor 301 and a memory 302, where the memory 302 stores a plurality of instructions, and the processor 301 is configured to read the plurality of instructions and execute the semantic retrieval method described above, for example, including: receiving query information sent by a user; correcting the text in the query information to obtain a corrected text; performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers; for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy; for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy; and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.
In some embodiments, there is also provided a computer-readable storage medium storing a plurality of instructions that are readable by a processor and perform the semantic retrieval method described above, for example, comprising: receiving query information sent by a user; correcting the text in the query information to obtain a corrected text; performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers; for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy; for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy; and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.
In summary, the semantic retrieval method, the semantic retrieval device, and the electronic device provided in the embodiments at least have the following advantages:
(1) the natural language understanding based on the semantic level can better match the real intention of the user, improve the retrieval efficiency and accuracy, and better meet the query requirement of the user compared with the retrieval based on the key words;
(2) based on the synonym dictionary, normalized description can be carried out on the identified entities, attributes and relations, normalized description is carried out on the entities which are not normalized and have inaccurate expression in the query sentence of the user, the problem that the entities cannot be correctly linked to the entity nodes in the knowledge graph because the description of the entities is not normalized is avoided, and the robustness of the knowledge graph-based retrieval system is improved;
(3) for non-simple fact queries such as FAQ, answers which best meet the user intention can be queried through a vectorization retrieval service at a semantic level.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.