CN115186065A

CN115186065A - Target word retrieval method and device

Info

Publication number: CN115186065A
Application number: CN202210842766.9A
Authority: CN
Inventors: 綦红镀
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2022-07-18
Filing date: 2022-07-18
Publication date: 2022-10-14

Abstract

The present application discloses a method and device for retrieving target words, which can be applied to the field of artificial intelligence. Obtaining the original retrieval phrase; determining a candidate set according to the original retrieval phrase; querying each query phrase in the candidate set, and determining the query result corresponding to each query phrase. The method extracts the latent semantic information contained in the retrieval database corpus by introducing latent semantics, and constructs a candidate set query with similar semantics based on the original query, so as to solve the problem of the lack of semantic matching ability of the existing full-text search, and improve the intelligence of the full-text search. user experience.

Description

Method and device for retrieving target words

技术领域technical field

本申请涉及计算机技术领域，尤其涉及一种目标词语的检索方法及装置。The present application relates to the field of computer technology, and in particular, to a method and device for retrieving target words.

背景技术Background technique

全文检索是将存储于数据库中整本书、整篇文章中的任意内容信息查找出来的检索。它可以根据需要获得全文中有关章、节、段、句、词等信息，也就是说类似于给整本书的每个字词添加一个标签，也可以进行各种统计和分析。目前常用的全文搜索引擎如ES检索底层采用TFIDF(词频-逆向文件频率)算法计算相关度，该类方法相对具有局限性，TFIDF算法不能挖掘出词汇间深层次的语义关系，导致传统ES搜索引擎无法处理一义多词的情况，停留在低级别的关键词检索层面，无法提供给用户语义层面的检索。例如：用户搜索”automobile”，即汽车，传统全文搜索仅仅会返回包含”automobile”单词的记录，而实际上包含”car”单词的记录也可能是用户所需要的。Full-text search is a search to find out any content information stored in the entire book or article in the database. It can obtain information about chapters, sections, paragraphs, sentences, words and other information in the full text as needed, that is to say, it is similar to adding a label to each word in the whole book, and it can also perform various statistics and analysis. At present, the commonly used full-text search engines such as ES retrieval use the TFIDF (word frequency-reverse document frequency) algorithm to calculate the relevance at the bottom. This method is relatively limited. The TFIDF algorithm cannot mine the deep semantic relationship between words, which leads to the traditional ES search engine. Unable to deal with the situation of multiple words with one sense, stay at the low-level keyword retrieval level, and cannot provide users with semantic level retrieval. For example, if a user searches for "automobile", that is, a car, a traditional full-text search will only return records that contain the word "automobile", but records that actually contain the word "car" may also be what the user wants.

也就是说，在当前这种目标词语的检索方法中，搜索结果通常会拘泥于用户所输入请求语句的字面本身，无法深层次捕捉到用户所输入语句后面的真正意图。这样的词语的检索方法查全性较差、差准性较低。That is to say, in the current method for retrieving target words, the search results are usually limited to the literal text of the request sentence input by the user, and cannot deeply capture the real intention behind the sentence input by the user. The retrieval method of such words has poor recall and low precision.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请实施例提供了一种目标词语的检索方法及装置，旨在实现目标词语的精准全文检索。In view of this, the embodiments of the present application provide a method and apparatus for retrieving target words, aiming at realizing accurate full-text retrieval of target words.

第一方面，本申请实施例提供了一种目标词语的检索方法及装置，所述方法包括：In a first aspect, an embodiment of the present application provides a method and device for retrieving target words, the method comprising:

获取原始检索短语；Get the original search phrase;

根据所述原始检索短语确定候选集；所述候选集包括原始检索短语和多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语A candidate set is determined according to the original retrieval phrase; the candidate set includes an original retrieval phrase and a plurality of first retrieval phrases, and the first retrieval phrase is a phrase semantically similar to the original retrieval phrase

查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。Each query phrase in the candidate set is queried, and a query result corresponding to each query phrase is determined.

可选的，所述根据所述原始检索短语确定候选集，包括：Optionally, the determining the candidate set according to the original search phrase includes:

获取潜在语义计算模型；Obtain latent semantic computing models;

通过所述潜在语义计算模型与第一规则确定多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语；所述第一规则用于确定所述第一检索短语的个数；A plurality of first search phrases are determined by using the latent semantic calculation model and the first rule, and the first search phrases are phrases that are semantically similar to the original search phrases; the first rules are used to determine the first search phrases the number of phrases;

将所述原始检索短语与所述多个第一检索短语合并形成候选集。The original search phrase and the plurality of first search phrases are combined to form a candidate set.

可选的，所述确定所述每个查询短语对应的查询结果，包括：Optionally, the determining the query result corresponding to each query phrase includes:

根据所述查询短语确定文本搜索记录，所述文本搜索记录包括文本相关词语；determining a text search record according to the query phrase, the text search record including text-related words;

根据第二规则确定所述查询结果，所述第二规则用于确定所述查询结果中所述文本相关词语的个数。The query result is determined according to a second rule, and the second rule is used to determine the number of the text-related words in the query result.

可选的，所述确定所述每个查询短语对应的查询结果之后，还包括：Optionally, after determining the query result corresponding to each query phrase, the method further includes:

合并所述每个查询短语对应的查询结果，将合并结果作为最终查询结果集合。The query results corresponding to each query phrase are combined, and the combined result is used as a final query result set.

可选的，所述潜在语义计算模型为隐性语义分析模型或词向量模型。Optionally, the latent semantic computing model is an implicit semantic analysis model or a word vector model.

第二方面，本申请实施例提供了一种目标词语的检索装置，所述装置包括：In a second aspect, an embodiment of the present application provides an apparatus for retrieving target words, the apparatus comprising:

原始检索短语获取模块，用于获取原始检索短语；The original retrieval phrase acquisition module is used to obtain the original retrieval phrase;

候选集确定模块，用于根据所述原始检索短语确定候选集；所述候选集包括原始检索短语和多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语；A candidate set determination module, configured to determine a candidate set according to the original retrieval phrase; the candidate set includes an original retrieval phrase and a plurality of first retrieval phrases, and the first retrieval phrase is a phrase semantically similar to the original retrieval phrase ;

查询结果确定模块，用于查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。The query result determination module is configured to query each query phrase in the candidate set, and determine the query result corresponding to each query phrase.

可选的，所述候选集确定模块包括：Optionally, the candidate set determination module includes:

计算模型获取模块，用于获取潜在语义计算模型；A computational model obtaining module, used to obtain a latent semantic computational model;

第一检索短语确定模块，用于通过所述潜在语义计算模型与第一规则确定多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语；所述第一规则用于确定所述第一检索短语的个数；A first retrieval phrase determination module, configured to determine a plurality of first retrieval phrases through the latent semantic computing model and the first rule, where the first retrieval phrases are phrases semantically similar to the original retrieval phrases; the first retrieval phrases are semantically similar to the original retrieval phrases; rules are used to determine the number of the first search phrases;

候选集形成模块，用于将所述原始检索短语与所述多个第一检索短语合并形成候选集。A candidate set forming module is configured to combine the original search phrase and the plurality of first search phrases to form a candidate set.

可选的，所述查询结果确定模块包括：Optionally, the query result determination module includes:

文本搜索记录确定模块，用于根据所述查询短语确定文本搜索记录，所述文本搜索记录包括文本相关词语；a text search record determination module, configured to determine a text search record according to the query phrase, where the text search record includes text-related words;

查询结果确定模块，用于根据第二规则确定所述查询结果，所述第二规则用于确定所述查询结果中所述文本相关词语的个数。The query result determination module is configured to determine the query result according to a second rule, and the second rule is used to determine the number of the text-related words in the query result.

可选的，所述装置还包括：Optionally, the device further includes:

合并模块，用于合并所述每个查询短语对应的查询结果，将合并结果作为最终查询结果集合。The merging module is used for merging the query results corresponding to each query phrase, and using the merged result as the final query result set.

本申请实施例提供了一种目标词语的检索方法及装置。在执行所述方法时，获取原始检索短语；根据所述原始检索短语确定候选集；查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。由此，当用户输入目标检索短语后，系统分析目标检索短语获得同义短语，以原始待检索文本和多条语义相似文本记录共同作为检索条件，获得多个检索结果。这样，达到了对目标词语的全面检索效果。如此，检索结果综合了关键词检索和语义检索，提升现有全文检索的查全率和查准率。Embodiments of the present application provide a method and device for retrieving target words. When executing the method, an original search phrase is obtained; a candidate set is determined according to the original search phrase; each query phrase in the candidate set is queried, and a query result corresponding to each query phrase is determined. Thus, after the user inputs the target retrieval phrase, the system analyzes the target retrieval phrase to obtain synonymous phrases, and uses the original text to be retrieved and multiple semantically similar text records as retrieval conditions to obtain multiple retrieval results. In this way, the comprehensive retrieval effect of the target word is achieved. In this way, the retrieval results integrate keyword retrieval and semantic retrieval, which improves the recall and precision of existing full-text retrieval.

附图说明Description of drawings

为更清楚地说明本实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the present embodiment or the prior art, the following briefly introduces the accompanying drawings required in the description of the embodiment or the prior art. Obviously, the accompanying drawings in the following description are only For some embodiments of the present application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.

图1为本申请实施例提供的目标词语的检索的一种方法流程图；1 is a flowchart of a method for retrieving target words provided by an embodiment of the present application;

图2为本申请实施例提供的目标词语的检索的一种方法流程图；2 is a flowchart of a method for retrieving a target word provided by an embodiment of the present application;

图3为本申请实施例提供的目标词语的检索的一种结构示意图。FIG. 3 is a schematic structural diagram of a target word retrieval provided by an embodiment of the present application.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本申请方案，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only It is a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

正如前文所述，当前目前常用的全文搜索引擎如ES检索底层采用TFIDF算法计算相关度。但是，发明人经过研究发现，该类方法相对具有局限性，TFIDF算法不能挖掘出词汇间深层次的语义关系，导致该种检索方法无法深层次捕捉到用户所输入语句后面的真正意图，该方法具有查全性较差、差准性较低的缺点。As mentioned above, currently commonly used full-text search engines, such as ES, use the TFIDF algorithm to calculate the relevance at the bottom. However, the inventor found through research that such methods are relatively limited. The TFIDF algorithm cannot dig out the deep semantic relationship between words, so this retrieval method cannot deeply capture the real intention behind the sentence input by the user. It has the disadvantages of poor recall and low accuracy.

为了解决这一问题，在本申请实施例提供了一种目标词语的检索方法及装置，在执行所述方法时，获取原始检索短语；根据所述原始检索短语确定候选集；查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。由此，当用户输入目标检索短语后，系统分析目标检索短语获得同义短语，以原始待检索文本和多条语义相似文本记录共同作为检索条件，获得多个检索结果。这样，达到了对目标词语的全面检索效果。如此，检索结果综合了关键词检索和语义检索，提升现有全文检索的查全率和查准率。In order to solve this problem, an embodiment of the present application provides a method and device for retrieving target words. When the method is executed, an original search phrase is obtained; a candidate set is determined according to the original search phrase; and a candidate set is queried for For each query phrase, a query result corresponding to each query phrase is determined. Thus, after the user inputs the target retrieval phrase, the system analyzes the target retrieval phrase to obtain synonymous phrases, and uses the original text to be retrieved and multiple semantically similar text records as retrieval conditions to obtain multiple retrieval results. In this way, the comprehensive retrieval effect of the target word is achieved. In this way, the retrieval results integrate keyword retrieval and semantic retrieval, which improves the recall and precision of existing full-text retrieval.

本申请实施例提供的方法由搜索引擎和后台服务器执行，例如后台服务器中包括具有检索功能和整合功能的检索系统。检索系统获取检索库中的语义计算模型后，对原始检索短语进行分析获取语义相近的短语，系统将查询对象输入搜索引擎例如ES检索，根据相关度形成查询结果。所述后台服务器可以是一台服务器设备，也可以是由多台服务器组成的服务器集群。The methods provided by the embodiments of the present application are executed by a search engine and a background server. For example, the background server includes a retrieval system with a retrieval function and an integration function. After the retrieval system obtains the semantic computing model in the retrieval database, it analyzes the original retrieval phrases to obtain phrases with similar semantics. The background server may be a server device or a server cluster composed of multiple servers.

以下通过一个实施例，对本申请提供的目标词语的检索方法进行说明。请参考图1，图1为本申请实施例所提供的目标词语的检索方法的一种方法流程图，包括：The following describes the method for retrieving target words provided by the present application through an embodiment. Please refer to FIG. 1. FIG. 1 is a flowchart of a method for retrieving target words provided by an embodiment of the present application, including:

S101：获取原始检索短语。S101: Obtain the original search phrase.

原始检索短语为初始检索目标短语。在具体应用场景中，原始检索短语可以是由用户输入的，也可以是系统根据查询需求设置的。The original retrieval phrase is the initial retrieval target phrase. In a specific application scenario, the original retrieval phrase may be input by the user, or may be set by the system according to query requirements.

S102：根据所述原始检索短语确定候选集。S102: Determine a candidate set according to the original search phrase.

所述候选集包括原始检索短语和多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语。The candidate set includes an original retrieval phrase and a plurality of first retrieval phrases, and the first retrieval phrases are phrases that are semantically similar to the original retrieval phrases.

假设查询原始检索短语为“大数据调研”，通过潜在语义模型计算待检索短语在检索库中语义最相似的前2条记录为“调研”、“学习”，则这两条记录为第一检索短语。系统可以将查询获得的原始检索短语和多条第一检索短语纳入候选集。在实际应用场景中，系统可以根据不同第一检索短语与原始检索短语的不同相关度，各个第一检索短语设置不同的标志，可以用于后续过程中对相似度不同的检索结果进行区分。Assuming that the original search phrase of the query is "big data research", the first two records with the most similar semantics of the phrase to be searched in the search database are calculated by the latent semantic model as "investigation" and "learning", then these two records are the first search phrase. The system can include the original search phrases obtained by the query and a plurality of first search phrases into the candidate set. In practical application scenarios, the system can set different flags for each first retrieval phrase according to the different relevancy degrees between different first retrieval phrases and the original retrieval phrase, which can be used to distinguish retrieval results with different similarities in the subsequent process.

关于如何根据原始检索短语确定候选集，具体参见后文，在此不做赘述。For details on how to determine the candidate set according to the original search phrase, please refer to the following paragraphs, which will not be repeated here.

S103：查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。S103: Query each query phrase in the candidate set, and determine a query result corresponding to each query phrase.

查询文本为候选集中逐个进行查询的短语。从候选集中依次取出原始检索短语和多个第一检索短语，逐个作为查询短语。确定每个查询短语对应的查询结果。The query text is the phrases that are queried one by one in the candidate set. The original retrieval phrase and a plurality of first retrieval phrases are sequentially taken out from the candidate set, and used as query phrases one by one. Determine the query result corresponding to each query phrase.

在实际应用场景中，系统可以将每条查询短语对应的查询结果合并，形成最终查询结果集合。In practical application scenarios, the system can combine the query results corresponding to each query phrase to form a final query result set.

下面对本申请实施例提供的目标词语的检索方法进行详细介绍。参见图2所示，图2为本申请实施例通过的目标词语的检索的另一种流程示意图。其具体过程如下：The method for retrieving target words provided by the embodiments of the present application will be described in detail below. Referring to FIG. 2 , FIG. 2 is another schematic flowchart of the retrieval of target words passed in this embodiment of the present application. The specific process is as follows:

S201：获取原始检索短语。S201: Obtain the original search phrase.

系统获取需要进行检索的原始检索短语。The system obtains the original search phrases that need to be searched.

S202：获取潜在语义计算模型。S202: Obtain a latent semantic computing model.

获取检索库中语料离线计算潜在语义模型，该模型可以是LSA模型也可以是Word2vec模型，这里不做限制，下面步骤实现用LSA举例，具体包括：Obtain the corpus in the retrieval database to calculate the latent semantic model offline. The model can be an LSA model or a Word2vec model, which is not limited here. The following steps are implemented using LSA as an example, including:

分析文档集合，建立词汇-文本矩阵A。假设A是一个m*n的文本数据矩阵(n<<m)，表示该语料包含m个单词，n篇文档。Analyze the collection of documents to build a lexical-text matrix A. Suppose A is an m*n text data matrix (n<<m), which means that the corpus contains m words and n documents.

对词汇-文本矩阵进行奇异值分解，对SVD分解后的矩阵进行降维，使用降维后的矩阵构建潜在语义空间LSA模型，公式为：Perform singular value decomposition on the vocabulary-text matrix, reduce the dimension of the matrix after SVD decomposition, and use the reduced matrix to build the LSA model of the latent semantic space. The formula is:

式中，A_m×n为m*n文本数据矩阵，该式将大矩阵A进行截断奇异值分解，分解为3个矩阵的乘积矩阵，U_m×k为单词-话题矩阵，

为话题文本矩阵，A可以分解出k个特征值，k在此处可以代指主题数，我们选取排序后其中较大的r个特征值，r的值可以根据下列公式计算得出：In the formula, A _m×n is the m*n text data matrix, which decomposes the large matrix A into truncated singular value decomposition and decomposes it into a product matrix of 3 matrices, U _m×k is the word-topic matrix,

For the topic text matrix, A can be decomposed into k eigenvalues, where k can refer to the number of topics, we select the larger r eigenvalues after sorting, and the value of r can be calculated according to the following formula:

式中P_r为对角矩阵的前r个较大特征值的平方和,P为对角矩阵所有特征值的平方和，计算得到的r能够拥有原矩阵95％以上的信息量，且r远小于k。In the formula, P _r is the square sum of the first r larger eigenvalues of the diagonal matrix, and P is the square sum of all the eigenvalues of the diagonal matrix. The calculated r can have more than 95% of the information of the original matrix, and r is far from less than k.

这样

就可以近似表示矩阵A。U是单词-话题矩阵，每一列代表一个潜语义，这个潜语义的意义由m个单词按不同权重组合而成，其行代表词语，其列代表文档。一般情况下，词-文档矩阵的元素是该词在文档中的出现次数。因为U中每一列相互独立，所以r个潜语义构成了一个语义空间，矩阵U中的每一列表示一个关键词，数值越大则越相关，因此，通过U_m×r可以看到词与词义之间的相关性。矩阵V中的每一行表示一类主题，其中的每个非零元素表示一个主题与一个文档的相关性，从

可以看出文本与主题的相关性。而∑V^T是话题-文档矩阵，∑V^T中每一列代表一篇文档，将该文档被映射到了语义空间，∑中每一个奇异值指示了该潜语义的重要度，矩阵∑表示文章主题和关键词之间的相关性。so

The matrix A can be approximated. U is a word-topic matrix, and each column represents a latent meaning. The meaning of this latent meaning is composed of m words combined with different weights. Its rows represent words and its columns represent documents. In general, the elements of the word-document matrix are the number of occurrences of the word in the document. Because each column in U is independent of each other, r subliminal semantics constitute a semantic space. Each column in the matrix U represents a keyword, and the larger the value, the more related it is. Therefore, through U _m×r , you can see the word and the meaning of the word correlation between. Each row in matrix V represents a class of topics, where each non-zero element represents the relevance of a topic to a document, starting from

The relevance of the text to the topic can be seen. ∑V ^T is a topic-document matrix, each column in ∑V ^T represents a document, and the document is mapped to the semantic space, each singular value in ∑ indicates the importance of the latent semantics, and the matrix ∑ represents the topic of the article and keywords.

S203：通过所述潜在语义计算模型与第一规则确定多个第一检索短语。S203: Determine a plurality of first search phrases by using the latent semantic computing model and the first rule.

其中，第一规则用于确定所述第一检索短语的个数。在具体应用场景中，第一规则可以由用户自行设置，也可以是由系统根据查询需求设置。The first rule is used to determine the number of the first search phrases. In a specific application scenario, the first rule may be set by the user, or may be set by the system according to query requirements.

在一些可能的实现方式中，用户可以根据个人搜索需求，对系统提供的语义相近的搜索记录进行选取或剔除，也可以由用户根据需求选取不同的第一检索短语和原始检索短语形成不同的组合。例如当用户删除语义第二相近的检索短语，则语义第三相近的检索短语可以主动补位。第一检索短语也可以由用户自行设置，选取系统检索获得的其他语义相近的检索短语或自行输入第一检索短语内容。In some possible implementations, the user can select or delete the search records with similar semantics provided by the system according to personal search requirements, or the user can select different first retrieval phrases and original retrieval phrases to form different combinations according to requirements . For example, when the user deletes the search phrase with the second semantic similarity, the search phrase with the third semantic similarity can be automatically replaced. The first retrieval phrase can also be set by the user, and other retrieval phrases with similar semantics obtained by the system retrieval are selected or the content of the first retrieval phrase can be input by themselves.

例如，当第一规则指示第一检索短语个数为2，则在实际应用场景中，系统通过潜在语义模型计算待检索短语在检索库中语义最相似的前2条记录。假设查询为：query＝“大数据调研”，通过潜在语义模型计算待检索短语在检索库中语义最相似的前2条记录为：similar_top_2＝{“hadoop调研”、“spark学习”}。其中，“调研”和“学习”为当前第一规则对应的第一检索短语。For example, when the first rule indicates that the number of the first search phrases is 2, in an actual application scenario, the system uses the latent semantic model to calculate the top two records with the most similar semantics in the search database for the phrases to be searched. Assuming that the query is: query="big data research", the top 2 records with the most similar semantics of the phrase to be retrieved in the retrieval database are calculated by the latent semantic model as: similar_top_2={"hadoop research", "spark learning"}. Among them, "investigation" and "learning" are the first search phrases corresponding to the current first rule.

其中，Hadoop和Spark两者为不同第一检索短语的代表标志。在实际应用过程中，系统可以根据不同的代表标志相似度不同的第一检索短语进行区分，在一些可能的实现方式里，系统可以对第一检索短语的代表标志进行设置和适应性修改。Among them, both Hadoop and Spark are representative signs of different first retrieval phrases. In the actual application process, the system can distinguish the first search phrases with different representative marks having different degrees of similarity. In some possible implementations, the system can set and adapt the representative marks of the first search phrase.

在实际应用过程中，Hadoop和Spark两者都是大数据框架，Spark是专为大规模数据处理而设计的快速通用的计算引擎。Hadoop是一个由Apache基金会所开发的分布式系统基础架构，Hadoop以一种可靠、高效、可伸缩的方式进行数据处理。在应用过程中，可以根据这两种大数据架框对系统检索得到的数据进行进一步的数据处理。In practical applications, both Hadoop and Spark are big data frameworks, and Spark is a fast and general computing engine designed for large-scale data processing. Hadoop is a distributed system infrastructure developed by the Apache Foundation. Hadoop processes data in a reliable, efficient, and scalable manner. In the application process, the data retrieved by the system can be further processed according to these two big data frameworks.

作为进一步优化，S203中通过潜在语义模型计算待检索短语在检索库中语义最相似的前N条记录，具体包括：As a further optimization, in S203, the latent semantic model is used to calculate the top N records with the most similar semantics of the phrase to be retrieved in the retrieval database, specifically including:

对于给定的查询，我们根据这个查询中包含的单词A_q构造一个伪文档：V_q＝A_qU∑^-1，然后该伪文档和V中的每一列计算余弦相似度来得到和给定查询最相似的N个文档。假设V中第t列对应的文本向量为V_t，则伪文档向量V_q和V_t之间的余弦相似度计算公式为：For a given query, we construct a pseudo-document from the words A _q contained in this query: V _q =A _q U∑ ^-1 , and then calculate the cosine similarity between the pseudo-document and each column in V to get the sum given by Query the most similar N documents. Assuming that the text vector corresponding to the t-th column in V is V _t , the formula for calculating the cosine similarity between the pseudo-document vectors V _q and V _t is:

式中V_q与V_t为查询文本与语义空间矩阵中第t列对应的文本的向量表示，||V_q||与||V_t||分别是向量V_q与V_t的模,cos(V_q,V_t)为文本向量与为文档向量之间的余弦相似度。where V _q and V _t are the vector representations of the query text and the text corresponding to the t-th column in the semantic space matrix, ||V _q || and ||V _t || are the modulo of the vectors V _q and V _t , cos (V _q , V _t ) is the cosine similarity between the text vector and the document vector.

S204：将所述原始检索短语与所述多个第一检索短语合并形成候选集。S204: Combine the original search phrase and the plurality of first search phrases to form a candidate set.

将原始检索短语与第一规则对应的多个第一检索短语归并形成候选集，例如，将原始检索短语“大数据调研”与第一检索短语“hadoop调研”“spark学习”合并，得到candidate_list＝{“大数据调研”，“hadoop调研”、“spark学习”}。Merge the original search phrase and multiple first search phrases corresponding to the first rule to form a candidate set. For example, combine the original search phrase "big data research" with the first search phrases "hadoop research" and "spark learning" to obtain candidate_list= {"big data research", "hadoop research", "spark learning"}.

S205：查询所述候选集中每个查询短语，根据所述查询短语确定文本搜索记录。S205: Query each query phrase in the candidate set, and determine a text search record according to the query phrase.

其中，文本搜索记录包括多个文本相关词语，文本相关词语为查询出的与查询短语具有相关性的词语。文本搜索记录为多个相关性词语形成的集合。Wherein, the text search record includes a plurality of text-related words, and the text-related words are words related to the query phrase obtained from the query. A text search record is a collection of related terms.

具体的，系统从候选集中依次取出需要查询的短语，根据查询短语确定出当前短语对应的文本搜索记录，例如，查询文本为原始检索短语，即query1＝“大数据调研”，当前查询短语对应的文本搜索记录可以为“大数据发展现状调研”、“研究大数据相关组件”、“大数据与人工智能关系”、“大数据发展前景探究”“大数据的应用拓展”，文本搜索记录为基于当前查询文本的全文检索结果，也就是说，若全文中由N个与当前查询短语直接相关的内容，则文本搜索记录可以为N条。Specifically, the system sequentially retrieves the phrases to be queried from the candidate set, and determines the text search record corresponding to the current phrase according to the query phrase. The text search records can be "Research on the Status Quo of Big Data Development", "Research on Big Data Related Components", "Relationship between Big Data and Artificial Intelligence", "Exploration of Big Data Development Prospects", and "Application Development of Big Data". The full-text search result of the current query text, that is, if there are N contents directly related to the current query phrase in the full text, the number of text search records may be N.

在一些可能的实现方式中，系统基于与查询短语的相关性高低可以对各个相关性词语进行排序。即在系统获取文本相关短语的过程中，根据相关度对获取到的若干文本相关短语进行排序，形成具有顺序性的文本搜索记录。因此，在当前文本搜索记录中，多个文本相关词语之间存在相关度顺序关系，以查询短语大数据调研为相关度判断标准，大数据发展现状调研相关度高于研究大数据相关组件，相关度向后逐渐递减。In some possible implementations, the system may sort each related term based on the level of relevancy to the query phrase. That is, in the process of acquiring text-related phrases, the system sorts the acquired text-related phrases according to the degree of relevancy to form sequential text search records. Therefore, in the current text search records, there is a correlation order relationship between multiple text-related words. Taking the query phrase big data research as the correlation judgment criterion, the correlation degree of the research on the development status of big data is higher than that of the research on the related components of big data. The degree gradually decreases backwards.

在一些可能的实现方式中，系统仅根据查询短语检索全文，未在检索过程中对获得的相关性的词语排序。因此，形成的文本搜索记录中各个文本相关词语之间没有相关性高低顺序关系。关于根据相关度对文本搜索记录中文本相关词语进行提取，系统在获取到第二规则对应的个数请求后，可以对文本搜索记录中的文本相关词排序，也可以设置相关度阈值直接对文本搜索记录中的多个文本相关词进行筛选，进而获得第二规则对应个数的文本相关词。即系统从无顺序关系的文本搜索记录集合中选取若干个相关度达标的文本相关短语。In some possible implementations, the system retrieves the full text only according to the query phrase, and does not rank the obtained related words during the retrieval process. Therefore, there is no order of correlation between each text-related word in the formed text search record. Regarding the extraction of text-related words in the text search records according to the degree of relevancy, after obtaining the number request corresponding to the second rule, the system can sort the text-related words in the text search records, or set the degree of relevancy threshold directly to the text. A plurality of text-related words in the search record are filtered to obtain the number of text-related words corresponding to the second rule. That is, the system selects a number of text-related phrases whose relevance reaches the standard from the set of text search records with no order relation.

S206：根据第二规则确定所述查询结果。S206: Determine the query result according to the second rule.

第二规则用于确定所述查询结果中所述文本相关词语的个数。在具体应用场景中，第二规则可以由用户自行设置，也可以是由系统根据查询需求设置。The second rule is used to determine the number of the text-related words in the query result. In a specific application scenario, the second rule may be set by the user, or may be set by the system according to query requirements.

例如，当第二规则指示文本相关短语个数为3，则在实际应用场景中，系统取相关度前3的文本记录形成查询结果。假设当前查询短语对应的文本搜索记录为“大数据发展现状调研”、“研究大数据相关组件”、“大数据与人工智能关系”、“大数据发展前景探究”“大数据的应用拓展”，根据S205步骤中提及的文本相关短语选取规则，选取相关度前3的文本记录形成查询结果，则对应的结果集result1＝{”大数据发展现状调研“、“研究大数据相关组件”、“大数据与人工智能关系”}。For example, when the second rule indicates that the number of text-related phrases is 3, in an actual application scenario, the system takes the top 3 text records of relevance to form the query result. Assuming that the text search records corresponding to the current query phrase are "Research on the current situation of big data development", "Research on big data related components", "Relationship between big data and artificial intelligence", "Exploration on the development prospect of big data", "Application expansion of big data", According to the text-related phrase selection rule mentioned in step S205, select the top 3 text records of relevance to form the query result, then the corresponding result set result1={"Research on the current situation of big data development", "Research on big data related components", " The relationship between big data and artificial intelligence”}.

S207：合并所述每个查询短语对应的查询结果，将合并结果作为最终查询结果集合。S207: Combine the query results corresponding to each query phrase, and use the combined result as a final query result set.

根据上述S206步骤中，确定候选集中的若干查询文本对应的查询结果。According to the above step S206, the query results corresponding to several query texts in the candidate set are determined.

例如，从候选集中依次取出各个查询文本输入全文搜索引擎例如ES检索，取相关度前3的文本记录形成查询结果。查询文本query1＝“大数据调研”，query2＝“hadoop调研”，query3＝“spark学习”，对应的结果集result1＝{“大数据发展现状调研”、“研究大数据相关组件”、“大数据与人工智能关系”}，result2＝{“hadoop调研”、“hadoop技术调研”、“hadoop快速入门”}，result3＝{”spark学习“、“spark学习笔记”、“spark基础教程”}。在当前步骤中，合并每条查询短语对应的查询结果构成检索结果集合为：result_list＝{”大数据发展现状调研“、“研究大数据相关组件”、“大数据与人工智能关系”、”hadoop调研“、“hadoop技术调研”、“hadoop快速入门”、”spark学习“、“spark学习笔记”、“spark基础教程”}。For example, each query text is sequentially retrieved from the candidate set and input into a full-text search engine such as ES for retrieval, and the top 3 text records of relevance are retrieved to form query results. Query text query1="big data research", query2="hadoop research", query3="spark learning", corresponding result set result1={"big data development status research", "research on big data related components", "big data" Relationship with artificial intelligence"}, result2={"hadoop research", "hadoop technology research", "hadoop quick start"}, result3={"spark learning", "spark study notes", "spark basic tutorial"}. In the current step, the query results corresponding to each query phrase are combined to form a retrieval result set: result_list={"Research on the current situation of big data development", "Research on components related to big data", "The relationship between big data and artificial intelligence", "hadoop Research", "hadoop technology research", "hadoop quick start", "spark learning", "spark study notes", "spark basic tutorial"}.

在实际应用场景中，可以在用户端搜索界面对查询结果分类显示，例如，可以将原始检索短语以及相似度不同的第一检索短语进行区分，设置原始检索短语的查询结果位于第一行，而后，第一检索短语对应的查询结果随行数相似度递减。或者，系统可以对不同的短语设置不同的颜色，进而达到在显示界面的区分效果。In practical application scenarios, the query results can be classified and displayed on the user-side search interface. For example, the original search phrase and the first search phrase with different similarities can be distinguished, and the query result of the original search phrase can be set in the first row, and then , the similarity of the query result corresponding to the first search phrase decreases with the number of rows. Alternatively, the system can set different colors for different phrases, thereby achieving a distinguishing effect on the display interface.

以上为本申请实施例提供一种基于潜在语义分析的检索方法的一些具体实现方式，基于此，本申请还提供了对应的装置。下面将从功能模块化的角度对本申请实施例提供的装置进行介绍。The above embodiments of the present application provide some specific implementations of a retrieval method based on latent semantic analysis, and based on this, the present application also provides a corresponding device. The device provided by the embodiments of the present application will be introduced below from the perspective of functional modularity.

请参考图3，图3为本申请实施例所提供的一种目标词语的检索装置的结构示意图。Please refer to FIG. 3 , which is a schematic structural diagram of an apparatus for retrieving a target word according to an embodiment of the present application.

本实施例中，该装置可以包括：In this embodiment, the device may include:

原始检索短语获取模块300，用于获取原始检索短语；an original retrieval phrase acquisition module 300, configured to acquire an original retrieval phrase;

候选集确定模块301，用于根据所述原始检索短语确定候选集；所述候选集包括原始检索短语和多个第一检索短语，所述第一检索短语为与所述原始检索短语语义相近的短语；A candidate set determination module 301, configured to determine a candidate set according to the original retrieval phrase; the candidate set includes an original retrieval phrase and a plurality of first retrieval phrases, and the first retrieval phrase is semantically similar to the original retrieval phrase phrase;

查询结果确定模块302，用于从目标文本中查询所述候选集中每个查询短语，确定所述每个查询短语对应的查询结果。The query result determination module 302 is configured to query each query phrase in the candidate set from the target text, and determine the query result corresponding to each query phrase.

可选的，所述装置还包括：Optionally, the device further includes:

所述潜在语义计算模型为隐性语义分析模型或词向量模型。The latent semantic computing model is a latent semantic analysis model or a word vector model.

需要说明的是，本发明提供的一种目标词语的检索方法及装置可用于人工智能领域。上述仅为示例，并不对本发明提供的一种目标词语的检索方法及装置的应用领域进行限定。It should be noted that the method and device for retrieving target words provided by the present invention can be used in the field of artificial intelligence. The above is only an example, and does not limit the application field of the method and device for retrieving a target word provided by the present invention.

以上对本申请所提供的一种目标词语的检索方法及装置进行了详细介绍。说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以对本申请进行若干改进和修饰，这些改进和修饰也落入本申请权利要求的保护范围内。The method and device for retrieving a target word provided by the present application have been described in detail above. The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present application, several improvements and modifications can also be made to the present application, and these improvements and modifications also fall within the protection scope of the claims of the present application.

还需要说明的是，在本说明书中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this specification, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is no such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

以上所述仅为本申请的较佳实施例，并不用以限制本申请，凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。The above descriptions are only preferred embodiments of the present application, and are not intended to limit the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present application shall be included in the protection of the present application. within the range.

Claims

1. A method for retrieving a target word, the method comprising:

acquiring an original retrieval phrase;

determining a candidate set according to the original retrieval phrase; the candidate set comprises an original search phrase and a plurality of first search phrases which are semantically similar to the original search phrase

And querying each query phrase in the candidate set, and determining a query result corresponding to each query phrase.

2. The method of claim 1, wherein said determining a candidate set from said original search phrase comprises:

acquiring a potential semantic computation model;

determining a plurality of first search terms through the latent semantic calculation model and a first rule, wherein the first search terms are terms similar to the original search terms in semantic meaning; the first rule is used for determining the number of the first search phrase;

merging the original search phrase with the plurality of first search phrases to form a candidate set.

3. The method for retrieving the target term according to claim 1, wherein the determining the query result corresponding to each query phrase comprises:

determining a text search record according to the query phrase, wherein the text search record comprises text related terms;

and determining the query result according to a second rule, wherein the second rule is used for determining the number of the text related terms in the query result.

4. The method for retrieving the target term in claim 1, wherein after determining the query result corresponding to each query phrase, the method further comprises:

and combining the query results corresponding to each query phrase, and taking the combined results as a final query result set.

5. The method for retrieving target words according to claim 2, wherein the latent semantic calculation model is an implicit semantic analysis model or a word vector model.

6. An apparatus for retrieving a target word, the apparatus comprising:

the original retrieval phrase acquisition module is used for acquiring an original retrieval phrase;

a candidate set determining module, configured to determine a candidate set according to the original search phrase; the candidate set comprises an original search phrase and a plurality of first search phrases, wherein the first search phrases are phrases which are similar to the original search phrase in semantic meaning;

and the query result determining module is used for querying each query phrase in the candidate set and determining a query result corresponding to each query phrase.

7. The apparatus of claim 6, wherein the candidate set determination module comprises:

the calculation model acquisition module is used for acquiring a potential semantic calculation model;

a first search phrase determination module, configured to determine a plurality of first search phrases according to the latent semantic calculation model and a first rule, where the first search phrases are phrases semantically similar to the original search phrase; the first rule is used for determining the number of the first search phrase;

and the candidate set forming module is used for combining the original search phrase and the plurality of first search phrases to form a candidate set.

8. The apparatus of claim 6, wherein the query result determination module comprises:

the text search record determining module is used for determining text search records according to the query phrases, wherein the text search records comprise text related terms;

and the query result determining module is used for determining the query result according to a second rule, and the second rule is used for determining the number of the text related terms in the query result.

9. The apparatus of claim 6, further comprising:

and the merging module is used for merging the query results corresponding to each query phrase, and taking the merged results as a final query result set.

10. The apparatus of claim 7, wherein the latent semantic computation model is an implicit semantic analysis model or a word vector model.