CN107247745B

CN107247745B - A kind of information retrieval method and system based on pseudo-linear filter model

Info

Publication number: CN107247745B
Application number: CN201710370190.XA
Authority: CN
Inventors: 何婷婷; 潘敏; 简芳洪; 毛智明
Original assignee: Central China Normal University
Current assignee: Central China Normal University
Priority date: 2017-05-23
Filing date: 2017-05-23
Publication date: 2018-07-03
Anticipated expiration: 2037-05-23
Also published as: CN107247745A

Abstract

The present invention provides an information retrieval method based on a pseudo-relevance feedback model, which includes merging the word relevance into the pseudo-relevance feedback model to realize information retrieval, including generating query expansion words in a pseudo-relevance document collection, respectively generating candidate extensions The query expansion words characterized by the importance of the word and the query expansion words characterized by the correlation between the candidate expansion words and the query subject words are combined into the original query expansion words to complete the final information retrieval; When querying extended words characterized by the correlation between the expanded word and the query subject word, the kernel function is used to calculate the correlation between the query word and the candidate word appearing in different positions in the document. The present invention can not only highlight the distribution of query words and candidate words, select candidate words with a higher degree of correlation with query subject words, but also locate more accurate candidate words, improve extended query and final retrieval due to the additional correlation information. accuracy.

Description

An information retrieval method and system based on a pseudo-correlation feedback model

技术领域technical field

本发明属于信息检索技术领域，特别涉及一种将核函数词相关度融合到伪相关反馈模型中的信息检索方法及系统。The invention belongs to the technical field of information retrieval, and in particular relates to an information retrieval method and system which integrates the kernel function word correlation into a pseudo-correlation feedback model.

背景技术Background technique

在信息竞争趋势日盛的年代，借助搜索引擎浏览和获取所需信息是人们日常生活的重要组成部分。然而，网络资源异常丰富，信息总量迅速膨胀，使得用户难以高效和准确的获取并识别重要信息，信息处理技术迫切需要一种更为有效的理论和方法来处理日益增长的海量数据。信息检索作为经典的文本处理技术，能够适应这一要求并迅速成为当前信息处理研究领域中的研究热点。In the age of information competition, browsing and obtaining desired information with the help of search engines is an important part of people's daily life. However, the extremely rich network resources and the rapid expansion of the total amount of information make it difficult for users to efficiently and accurately obtain and identify important information. Information processing technology urgently needs a more effective theory and method to deal with the growing mass of data. As a classic text processing technology, information retrieval can adapt to this requirement and quickly become a research hotspot in the current information processing research field.

信息检索(Information Retrieval)是指信息按一定的方式组织起来，并根据信息用户的需要找出有关的信息的过程和技术。信息检索过程可以简单地描述为：用户根据其信息需求，组织一个查询字符串提交给信息检索系统，信息检索系统在文档集中检索出与查询相关的文档子集返回给用户。具体来说是指给定一组特定的查询主题，通过某种信息检索模型，对目标中的所有文档与查询主题进行相关度计算，并将每个文档按得分从大到小的顺序返回，返回的结果中文档越靠前说明该文档与查询主题越相关。经过近半个世纪的研究发展，一些有效的信息检索模型陆续提出并逐渐应用到相关的系统中。其中，影响比较大的检索模型包括：布尔逻辑模型、向量空间模型、概率模型、语言模型以及新近提出来的基于监督学习的检索模型。Information Retrieval (Information Retrieval) refers to the process and technology of organizing information in a certain way and finding relevant information according to the needs of information users. The information retrieval process can be simply described as: the user organizes a query string and submits it to the information retrieval system according to his information needs, and the information retrieval system retrieves a subset of documents related to the query from the document set and returns them to the user. Specifically, given a set of specific query topics, through a certain information retrieval model, the relevance calculation is performed on all the documents in the target and the query topics, and each document is returned in order of scores from large to small, The higher the document in the returned results, the more relevant the document is to the query subject. After nearly half a century of research and development, some effective information retrieval models have been proposed and gradually applied to related systems. Among them, the retrieval models with relatively large influence include: Boolean logic model, vector space model, probability model, language model and the retrieval model based on supervised learning proposed recently.

在实际的信息检索应用中，用户的查询请求与系统反馈的查询结果之间存在一定的偏差，造成检索系统的性能下降。所以，信息检索往往是一个反复的过程，用户常常需要经过多次的查询调整才能获得满意的检索结果。查询扩展技术通过对用户的初始查询进行扩展和重构, 较好的解决了用户查询用词与文档用词不匹配以及用户表达不完整的问题，因而被广泛应用于信息检索领域。简单地说来，查询扩展就是检索系统在进行检索之前，先根据扩展词表，自动把用户查询中的关键词的同义词或者近义词扩展进而形成新的查询，然后再进行检索。In the actual information retrieval application, there is a certain deviation between the user's query request and the query result fed back by the system, which causes the performance of the retrieval system to decline. Therefore, information retrieval is often an iterative process, and users often need to go through multiple query adjustments to obtain satisfactory retrieval results. Query expansion technology, by expanding and reconstructing the user's initial query, better solves the problems of mismatch between user query words and document words and incomplete user expression, so it is widely used in the field of information retrieval. To put it simply, query expansion means that before searching, the retrieval system automatically expands the synonyms or near synonyms of the keywords in the user query according to the extended vocabulary to form a new query, and then performs the retrieval.

伪相关反馈的出现是为了使检索系统更加有效，使检索结果更好地满足用户的查询请求。其主要机制是，系统默认自己检索出的结果中含有大量与用户查询主题相关的文档，从中取出前N篇作为相关文档，对查询进行调整或扩展。The appearance of pseudo-correlation feedback is to make the retrieval system more effective, so that the retrieval results can better meet the user's query request. The main mechanism is that the system defaults that the results retrieved by itself contain a large number of documents related to the subject of the user's query, and the first N articles are taken out as relevant documents to adjust or expand the query.

通常，影响一个检索系统的性能有很多因素，其中最为关键的是信息检索策略，包括文档和查询条件的表示方法、评价文档和查询相关性的匹配策略、查询结果的排序方法和用户进行相关反馈的机制等。Usually, there are many factors that affect the performance of a retrieval system, the most critical of which is the information retrieval strategy, including the representation method of documents and query conditions, the matching strategy for evaluating the relevance of documents and queries, the sorting method of query results, and relevant feedback from users. mechanism etc.

随着高速的互联网发展，海量的信息堆叠，信息的搜索精度成为所有用户关注的第一要点，现在想通过信息检索工具找到用户想要的东西变得越来越困难了，同时各种信息的过度泛滥，使得用户不得不花费更多的时间来甄别哪些信息对于用户来说是有价值的。现有信息检索方法普遍存在的问题是检索平均精度不高，即使目前最好的检索模型的平均精度也只有 30％，信息检索精度的提高还有很长的路要走。而信息检索已经深入到人类生活的各个方面，大部分人每天用百度、谷歌等搜索工具查找需要的各种资料，解决各种现实问题。2010年中国网页搜索的请求量规模达到了600多亿次，而到了2016年，仅百度一个天的搜索请求量就达到了60亿次，在如此大量的检索需求下，信息检索的平均精度每一个百分点的提升将为获取所需要的信息节省大量的时间和精力，其产生的价值非凡。各大互联网公司也在不断追求更低成本更高效率的信息检索技术。With the rapid development of the Internet and the accumulation of massive information, the accuracy of information search has become the first point of concern for all users. Now it is becoming more and more difficult to find what users want through information retrieval tools. At the same time, various information Excessive flooding makes users have to spend more time identifying which information is valuable to users. The common problem of existing information retrieval methods is that the average retrieval accuracy is not high. Even the average accuracy of the current best retrieval model is only 30%, and there is still a long way to go to improve the information retrieval accuracy. However, information retrieval has penetrated into all aspects of human life. Most people use Baidu, Google and other search tools to find various information and solve various practical problems every day. In 2010, the number of web search requests in China reached more than 60 billion times, and in 2016, the number of search requests per day on Baidu alone reached 6 billion times. With such a large number of search requests, the average accuracy of information retrieval is A one percent improvement will save a lot of time and effort in getting the information you need, and the value it yields is extraordinary. Major Internet companies are also constantly pursuing lower-cost and more efficient information retrieval technologies.

发明内容Contents of the invention

本发明所要解决的问题是，优化查询扩展最终以提高检索的平均精度。The problem to be solved by the present invention is to optimize query expansion to finally improve the average precision of retrieval.

本发明提供了一种基于伪相关反馈模型的信息检索方法，将词相关度融合到伪相关反馈模型中实现信息检索，包括在伪相关文档集合中生成查询扩展词的时候，分别生成以候选扩展词的重要度为特征的查询扩展词和以候选扩展词与查询主题词的相关度为特征的查询扩展词，再将两者结合到原查询扩展词中，完成最终的信息检索；生成以候选扩展词与查询主题词的相关度为特征的查询扩展词时，采用核函数计算文档中出现在不同位置上的查询词和候选词之间的相关度。The present invention provides an information retrieval method based on a pseudo-relevance feedback model, which integrates word relevance into the pseudo-relevance feedback model to realize information retrieval, including generating query expansion words in a pseudo-relevance document collection, respectively generating candidate extensions The query expansion words characterized by the importance of the word and the query expansion words characterized by the correlation between the candidate expansion words and the query subject words are combined into the original query expansion words to complete the final information retrieval; When querying extended words characterized by the correlation between the expanded word and the query subject word, the kernel function is used to calculate the correlation between the query word and the candidate word appearing in different positions in the document.

而且，将词相关度融合到伪相关反馈模型中实现信息检索，实现方式如下，Moreover, the word correlation is fused into the pseudo-correlation feedback model to realize information retrieval, and the implementation method is as follows,

当用户提交查询主题时，将查询主题进行预处理得到查询关键词Q，D为所有目标文档， N_D为目标文档集合D的文档总数，通过预设的检索权重模型计算查询关键词Q与目标文档集合D中的每一个文档的得分，按得分结果从高到低排列得到第一次查询结果；设根据伪相关反馈的方式取出目标文档集合D中的前N篇文档作为伪相关文档集合D₁，在进行查询扩展词选择的时候进行以下步骤，When a user submits a query subject, the query subject is preprocessed to obtain the query keyword Q, D is all target documents, N _D is the total number of documents in the target document set D, and the query keyword Q and target are calculated through the preset retrieval weight model The score of each document in the document collection D is arranged according to the score results from high to low to obtain the first query result; assume that the first N documents in the target document collection D are taken out according to the pseudo-relevant feedback method as the pseudo-related document collection D _1. Perform the following steps when selecting query expansion words,

步骤1，将伪相关文档集合D₁中每篇文档中所有的词作为扩展候选词，分别计算各扩展候选词t_j本身在伪相关文档集合D₁的文档d_i中的重要度得分得到各文档d_i的重要度向量如下，Step 1, take all the words in each document in the pseudo-related document collection D ₁ as extended candidate words, and calculate the importance score of each extended candidate word t _j in the document d _i of the pseudo-related document collection D ₁ Get the importance vector of each document d _i as follows,

其中，i＝1,2,3…,N，j＝1,2,3…,n；Among them, i=1,2,3...,N, j=1,2,3...,n;

计算扩展候选词在所有文档中的重要度得分向量如下，Calculate the importance score vector of the extended candidate word in all documents as follows,

将中每个扩展候选词的重要度得分取出后按从大到小的顺序排序，将得分最大的前n₁个值在对应的扩展候选词选取出来构成重要度查询扩展词集合Q₁，用多项式V₁表示重要度查询扩展词集合Q₁中的每个词和该词对应的重要度得分；Will After the importance _score of each extended candidate word in The corresponding extended candidate words are selected to form the importance query expansion word set Q ₁ , and the polynomial V ₁ is used to represent each word in the importance query expansion word set Q ₁ and the corresponding importance score of the word;

步骤2，将伪相关文档集合D₁中每篇文档中所有的词作为扩展候选词，分别根据共现位置和次数采用核函数计算各扩展候选词t_j与查询关键词Q在文档d_i中的相关度得分得到各文档d_i的相关度向量如下，Step 2, take all the words in each document in the pseudo-related document set _D1 as extended candidate words, and use the kernel function to calculate each extended candidate word _tj and query keyword Q in the document d _i according to the co-occurrence position and frequency respectively Relevance Score for Get the correlation vector of each document d _i as follows,

计算扩展候选词在所有文档中的相关度得分向量如下，Calculate the vector of relevance scores of the extended candidate across all documents as follows,

将中每个扩展候选词的相关度得分取出后按从大到小的顺序排序，将得分最大的前n₁个值在对应的扩展候选词选取出来构成相关度查询扩展词集合Q₁'，用多项式V₁'来表示查询扩展词集合Q₁'中的每个词和该词对应的相关度得分；Will After taking out the correlation score of each extended candidate word, it is sorted in descending order, and the top n ₁ values with the largest score are in the Corresponding expansion candidate words are selected to form the relevance query expansion word set Q ₁ ′, and the polynomial V ₁ ′ is used to represent each word in the query expansion word set Q _{1 ′} and the correlation score corresponding to the word;

步骤3，将步骤1和步骤2中所得多项式V₁和V₁'归一化后再进行线性组合，得到新的查询词多项式V如下，Step 3, normalize the polynomials V ₁ and V ₁ ' obtained in steps 1 and 2 and then perform linear combination to obtain a new query word polynomial V as follows,

V＝(1-γ)×||V₁||+γ×||V₁'||V＝(1-γ)×||V ₁ ||+γ×||V ₁ '||

其中，||X||表示对向量X进行归一化运算，γ为调节因子；Among them, ||X|| represents the normalization operation on the vector X, and γ is the adjustment factor;

步骤4，根据步骤3所得查询词多项式V按每个词项的系数从大到小排序，将系数最大的前n₁个词项取出得到新的扩展词集合 Step 4: According to the query word polynomial V obtained in step 3, sort the coefficients of each term from large to small, and take out the first n ₁ terms with the largest coefficients to obtain a new set of expanded words

步骤5，设查询关键词Q包括查询词q_s，s＝1,2,3…,m，将查询关键词Q表示为多项式V_Q，每个查询词的系数值设为1.0；将步骤4中得到的扩展词集合用多项式V'表示，Step 5, set the query keyword Q to include the query word q _s , s=1,2,3...,m, express the query keyword Q as a polynomial V _Q , and set the coefficient value of each query word to 1.0; set step 4 The set of expanded words obtained from Expressed by the polynomial V',

将查询多项式V_Q和查询扩展词多项式V'归一化后进行线性组合，到新的查询词多项式K 如下，The query polynomial V _Q and the query expansion word polynomial V' are normalized and linearly combined to obtain a new query word polynomial K as follows,

K＝α×||V_Q||+β×||V'||K＝α×||V _Q ||+β×||V'||

其中，α和β为调节因子；Among them, α and β are adjustment factors;

步骤6，根据步骤5所得查询词多项式K得到新的查询关键词集合Q'，使用新查询关键词集合Q'和Q'中每个查询词在查询词多项式K中对应的权重，采用预设的检索权重模型进行第二次信息检索，得到的查询结果作为最终信息检索结果。Step 6: Obtain a new query keyword set Q' according to the query word polynomial K obtained in step 5, use the weight corresponding to each query word in the query word polynomial K in the new query keyword set Q' and Q', and use the preset The retrieval weight model is used for the second information retrieval, and the obtained query results are taken as the final information retrieval results.

而且，步骤1中，重要度得分采用TFIDF、BM25或RM3方式求取。Also, in step 1, the importance score Use TFIDF, BM25 or RM3 to obtain.

而且，步骤2中，计算各扩展候选词t_j与查询关键词Q在文档d_i中的相关度得分实现如下，Moreover, in step 2, calculate the correlation score between each extended candidate word t _j and the query keyword Q in the document d _i The implementation is as follows,

设t_r和q_s在某个文档d_i中共现，表示为计算如下，Let t _r and q _s co-occur in some document d _i , expressed as Calculated as follows,

其中，表示t_r与q_s在文档d_i中的相关度，表示文档d_i中的共现频率，表示文档d_i中的共现反文档频率；in, Indicates the correlation between t _r and q _s in document d _i , Indicates that in document d _i the co-occurrence frequency of Indicates that in document d _i The co-occurrence anti-document frequency of ;

计算得出t_r与查询关键词Q在文档d_i中的相关度， Calculate the correlation between t _r and the query keyword Q in the document d _i ,

而且，文档d_i中的共现频率计算如下，Also, in document d _i The co-occurrence frequency of is calculated as follows,

其中，M和L分别表示t_r与q_s在文档d_i中出现的次数，表示文档d_i中出现的第k1个t_r，表示文档d_i中出现的第k2个q_s，k1＝1,2,3…,M，k2＝1,2,3…,L；是指以核函数体现的与的位置邻近程度。Among them, M and L represent the number of occurrences of t _r and q _s in document d _i respectively, Indicates the k1th t _r appearing in document d _i , Indicates the k2th q _s appearing in document d _i , k1=1,2,3...,M, k2=1,2,3...,L; refers to the embodiment of the kernel function and proximity of the location.

而且，核函数为高斯函数或三角函数。Also, the kernel function is a Gaussian function or a trigonometric function.

而且，核函数为高斯函数时，计算如下，Moreover, when the kernel function is a Gaussian function, the calculation is as follows,

其中，p_t和p_q分别表示与在文档中的位置值，σ是调节参数。Among them, p _t and p _q represent respectively and In the positional value in the document, σ is the tuning parameter.

而且，文档d_i中的共现反文档频率计算如下，Also, in document d _i The co-occurrence anti-document frequency of Calculated as follows,

其中，表示当时，在文档d_i中的共现的总次数。in, means when hour, The total number of co-occurrences in document d _i .

而且，所述预设的检索权重模型为基于向量空间模型、概率模型或语言模型。Moreover, the preset retrieval weight model is based on a vector space model, a probability model or a language model.

本发明还提供一种基于伪相关反馈模型的信息检索系统，包括计算机或服务器，在计算机或服务器上执行如上所述的方法。The present invention also provides an information retrieval system based on a pseudo-correlation feedback model, which includes a computer or a server, and executes the above-mentioned method on the computer or server.

依照本发明所提供的将核函数词相关度信息融到伪相关反馈模型中的信息检索方法，能够克服传统的伪相关反馈模型只考虑词频信息的不足。此外，通过核函数计算文档中出现在不同位置上的查询词和候选词之间的相关度，既能突出查询词和候选词的分布情况，选取与查询主题词相关程度更高的候选词，还能因为附加的相关度信息，从而定位更加精准的候选词、提高扩展查询及最终检索的平均精度。在多个国际信息检索评价标准数据集上的与国际上最好的多个模型的对比实验结果表明，依照本发明提供的将核函数的词相关度信息融入到伪相关反馈模型中的信息检索方法，在检索精确度上实现了显著的提升，达到国际领先水平。According to the information retrieval method provided by the present invention, which integrates the kernel function word correlation information into the pseudo-relevance feedback model, it can overcome the deficiency that the traditional pseudo-correlation feedback model only considers word frequency information. In addition, the correlation between query words and candidate words that appear in different positions in the document is calculated by the kernel function, which can not only highlight the distribution of query words and candidate words, but also select candidate words that are more relevant to the query subject words. Because of the additional relevance information, more accurate candidate words can be located, and the average precision of extended query and final retrieval can be improved. The experimental results of comparative experiments on multiple international information retrieval evaluation standard data sets with the best models in the world show that the information retrieval method that integrates the word relevance information of the kernel function into the pseudo-relevance feedback model provided by the present invention method, achieved a significant improvement in retrieval accuracy and reached the international leading level.

附图说明Description of drawings

图1为本发明实施例完整的信息检索过程流程图。FIG. 1 is a flowchart of a complete information retrieval process in an embodiment of the present invention.

具体实施方法Specific implementation method

本发明要解决的核心问题是：提出一种使用核函数来体现用户查询词与文档候选词之间的分布情况和两者之间的相关度，并把这种相关度作为附加权重融合到伪相关反馈模型中，实现查询扩展以提高检索的精确度。The core problem to be solved by the present invention is to propose a kernel function to reflect the distribution of user query words and document candidate words and the degree of correlation between the two, and to use this degree of correlation as an additional weight in the pseudo In the relevant feedback model, query expansion is implemented to improve the accuracy of retrieval.

以下结合附图和实施例，详细描述本发明的将核函数词相关度融合到伪相关反馈模型中的信息检索方法。The following describes in detail the information retrieval method of the present invention, which integrates the word correlation degree of the kernel function into the pseudo-correlation feedback model, in conjunction with the accompanying drawings and embodiments.

本发明针对经典方法中不尽合理的词汇独立假设，提出将词与词之间的相互关联关系考虑进来。通过对文档集合中数据的一些统计信息(比如上下文信息等反映词汇搭配使用关系的信息)的有效利用，结合查询条件设计相关技术方案来获得能够反映查询条件主题并由查询条件所触发的词汇，也就是利用这些信息来更准确的捕获用户的信息需求。Aiming at the unreasonable assumption of vocabulary independence in the classical method, the present invention proposes to take the correlation between words into consideration. Through the effective use of some statistical information of the data in the document collection (such as context information and other information reflecting the relationship between vocabulary collocations), and related technical solutions combined with query conditions to obtain vocabulary that can reflect the subject of the query conditions and be triggered by the query conditions, That is to use this information to more accurately capture the user's information needs.

本发明方法中采用的核函数原本是把原坐标系里线性不可分的数据用Kernel投影到另一个空间，尽量使得数据在新的空间里线性可分。而在本发明方法中将用它来评估一篇文档中两个词的相关程度。The kernel function adopted in the method of the present invention is to project the linearly inseparable data in the original coordinate system to another space with Kernel, so as to make the data linearly separable in the new space as much as possible. In the method of the present invention, it will be used to evaluate the degree of correlation between two words in a document.

参见附图1，实施例的流程为，当用户根据相关查询主题进行检索时：Referring to accompanying drawing 1, the procedure of the embodiment is, when the user searches according to the relevant query subject:

信息检索系统会根据目标文档集合建立查询索引，当用户提交相关查询主题时，系统会将查询主题进行预处理为查询关键词Q(Q是一个集合，一般包含多个主题词q₁、q₂、q₃等)， D为所有目标文档，N_D为目标文档集合D的文档总数。随后，检索系统会通过某种预设的检索权重模型(如TFIDF、BM25、RM3等)方式计算查询关键词Q与文档集合D中的每一个文档的得分，按得分结果从高到低排列得到第一次查询结果。根据伪相关反馈的原理，检索系统将取出文档集合D的第一次查询结果文档中的前N(在大量相关研究文献中，N一般为取值为10、20或30)篇文档作为伪相关文档集合D₁，N小于等于N_D，本领域技术人员可预设取值。在检索系统得到第一次查询所产生的伪相关文档集合D₁，并进行查询扩展词选择的时候进行以下步骤，The information retrieval system will establish a query index based on the target document collection. When the user submits a relevant query topic, the system will preprocess the query topic into query keywords Q (Q is a collection, generally including multiple subject words q ₁ , q ₂ , q _3, etc.), D is all target documents, and N _D is the total number of documents in the target document collection D. Subsequently, the retrieval system will calculate the query keyword Q and the score of each document in the document collection D through a certain preset retrieval weight model (such as TFIDF, BM25, RM3, etc.), and arrange the score results from high to low to obtain The result of the first query. According to the principle of pseudo-relevance feedback, the retrieval system will take out the first N (in a large number of relevant research literature, N is generally 10, 20 or 30) documents in the first query result documents of the document collection D as pseudo-relevance For the document collection D ₁ , N is less than or equal to N _D , and those skilled in the art can preset values. When the retrieval system obtains the pseudo-related document set D ₁ generated by the first query, and selects query expansion words, the following steps are performed,

步骤1，分别计算伪相关文档集合D₁中每篇文档中所有的词(即扩展候选词)本身的重要度得分，该重要度得分可以通过词的词频及逆文档词频(如TFIDF、BM25、RM3等)计算获得，再将不同文档中相同的词重要度得分以词向量的方式进行累加后除以D₁中的文档数N，即可得到所有的扩展候选词的重要度得分向量将向量中元素的得分按从大到小的顺序排列，取出前n₁(n₁一般为取值为10、20、30或50，本领域技术人员可预设取值)个得分在中所对应的词，得到重要度扩展候选词集合Q₁，可用一个多项式V₁来表示集合Q₁中的每个词和该词对应的重要度得分。Step 1, respectively calculate the importance score of all the words (i.e. extended candidate words) in each document in the pseudo-related document collection _D1 , the importance score can be obtained by the word frequency of the word and the reverse document word frequency (such as TFIDF, BM25, RM3, etc.), and then accumulate the same word importance scores in different documents in the form of word vectors and divide by the number of documents N in _D1 to obtain the importance score vectors of all extended candidate words Arrange the scores of the elements in the vector in descending order, and take out the first n ₁ (n ₁ is generally a value of 10, 20, 30 or 50, and those skilled in the art can preset the value) scores in The corresponding word in , get the importance expansion candidate word set Q ₁ , and a polynomial V ₁ can be used to represent each word in the set Q ₁ and the corresponding importance score of the word.

本发明中，将N篇伪相关文档集合D₁中的每篇文档看作词袋模型，以词向量的方式表示，其中第i篇文档的相关度向量公式如下所示。In the present invention, each document in the set of N pseudo-related documents _D1 is regarded as a bag-of-words model, which is expressed in the form of a word vector, and the correlation vector formula of the i-th document is as follows.

在上述公式中，表示伪相关文档集合D₁中的第i篇文档(i＝1,2,3…,N)d_i的词向量表达，t₁、t₂、t₃、…、t_n为伪相关文档集合D₁中所有文档中的所有词，n表示这些词的总数，即伪相关文档集合D₁中所有词的个数；表示对应的t₁、t₂、t₃、…、 t_n在文档d_i中的权重得分(也称重要度得分，权重用来表现扩展候选词的重要度)。某个词的重要度得分通过该词的词频及逆文档词频等信息(如TFIDF、BM25、RM3等)计算获得，如在使用TFIDF的方式计算文档d_i中词条t_j的重要度时，In the above formula, Represents the word vector expression of the i-th document (i=1,2,3...,N)d _i in the pseudo-related document collection _D 1 , t ₁ , t ₂ , t ₃ ,..., t _n are pseudo-related document collections All words in all documents in D ₁ , n represents the total number of these words, that is, the number of all words in the pseudo-related document collection D ₁ ; Indicates the corresponding weight scores of t ₁ , t ₂ , t ₃ _, _. The importance score of a word is obtained by calculating the word frequency of the word and the inverse document word frequency (such as TFIDF, BM25, RM3, etc.). For example, when using TFIDF to calculate the importance of term t _j in document d _i ,

其中，指某个词条t_j在文档d_i中的重要度得分(j＝1,2,3…,n)，TF(t_j,d)指词条t_j在文档d_i中出现的频率(次数)，N_D为目标文档集合D的文档总数，df(t_j)是伪相关集合D₁中，包含了词条t_j的文档个数。in, Refers to the importance score of a term t _j in document d _i (j=1,2,3...,n), TF(t _j ,d) refers to the frequency of term t _j appearing in document d _i ( times), N _D is the total number of documents in the target document set D, and df(t _j ) is the number of documents in the pseudo-related set D ₁ that contain the term t _j .

依照公式(2)，N个文档中的每个文档d_i都可以表示成相应的词的重要度的向量形式并对每个文档向量进行累加求和后再除以伪相关文档总数N，得到所有词条在所有文档中的重要度得分向量如公式(3)所示：According to formula (2), each document d _i in the N documents can be expressed as a vector form of the importance of the corresponding word And accumulate and sum each document vector and then divide it by the total number of pseudo-related documents N to get the importance score vector of all terms in all documents As shown in formula (3):

将中每个词的重要度得分取出后按从大到小的顺序排序，将得分最大的前n₁个值在对应的词选取出来构成重要度查询扩展词集合Q₁。为了方便后面的计算，用多项式V₁来表示集合Q₁中的每个词和该词对应的重要度得分，如公式(4)所示。Will After the importance score of each word in is taken out, it is sorted in descending order, and the top n ₁ values with the largest score are in The corresponding words are selected to form an importance query expansion word set Q ₁ . In order to facilitate subsequent calculations, a polynomial V ₁ is used to represent each word in the set Q ₁ and the corresponding importance score of the word, as shown in formula (4).

在公式(4)中，qh₁、qh₂、qh₃、…、表示Q₁中每个具体的扩展候选词(一共有n₁个)，wh₁、wh₂、wh₃、…、表示对应的扩展候选词在中的得分。In formula (4), qh ₁ , qh ₂ , qh ₃ , ..., Indicates each specific expansion candidate word in Q ₁ (a total of n ₁ ), wh ₁ , wh ₂ , wh ₃ ,..., Indicates that the corresponding extended candidate words are in score in .

步骤2，通过依次计算伪相关文档集合D₁中每篇文档中所有的词(即扩展候选词)与查询词之间的相关度得分，相关度得分根据每篇文档中查询词与扩展候选词的位置由核函数计算获得，再将不同文档中相同的词的得分累加，即可得到所有扩展候选词与查询词的相关度得分向量将向量中元素的得分按从大到小的顺序排列，取出前n₁(n₁一般为取值为10、 20、30或50)个得分在中所对应的词，得到相关度扩展候选词集合Q₁'，这里我们用一个多项式V₁'来表示集合Q₁'中的每个词和该词的相关度得分。Step 2, by sequentially calculating the correlation score between all the words (i.e. extended candidate words) and the query word in each document in the pseudo-related document set _D1 , the correlation score is based on the query word and the extended candidate word in each document The position of is calculated by the kernel function, and then the scores of the same word in different documents are accumulated to obtain the correlation score vector of all extended candidate words and query words Arrange the scores of the elements in the vector in descending order, and take out the first n ₁ (n ₁ is generally 10, 20, 30 or 50) scores in Corresponding words in , get the set Q ₁ ' of candidate words for correlation expansion, here we use a polynomial V ₁ ' to represent the correlation score between each word in the set Q ₁ ' and the word.

为了便于解释，给定扩展候选词t_r和查询词q_s，(其中，r＝1,2,3…,n，n为伪相关文档集合D₁中所有词的个数，s＝1,2,3…,m，m为查询关键词Q集合中的词个数)，如果t_r和q_s在某个文档d_i中共现，在此表示为则它们存在一个共现权重(即相关度)。由于t_r和q_s可能在一篇文档中的多个位置出现，因此不能简单地以的共现次数表示t_r与q_s在文档 d_i中的相关度，为了更合理的衡量它，本发明进一步提出如下公式：For ease of explanation, given extended candidate word t _r and query word q _s , (where r=1, 2, 3..., n, n is the number of all words in the pseudo-related document collection D ₁ , s=1, 2,3...,m, m is the number of words in the set of query keywords Q), if t _r and q _s co-occur in a certain document d _i , expressed as Then they have a co-occurrence weight (ie correlation). Since t _r and q _s may appear in multiple places in a document, they cannot simply be The number of co-occurrences of represents the correlation between t _r and q _s in document d _i . In order to measure it more reasonably, the present invention further proposes the following formula:

在公式(5)中，表示t_r与q_s在文档d_i中的相关度。In formula (5), Indicates the correlation between t _r and q _s in document d _i .

在公式(5)中，表示文档d_i中的共现频率，其具体的计算公式如下：In formula (5), Indicates that in document d _i The co-occurrence frequency of , the specific calculation formula is as follows:

在公式(6)中，M和L分别表示t_r与q_s在文档d_i中出现的次数，表示文档d_i中出现的第k1个t_r，表示文档d_i中出现的第k2个q_s，k1＝1,2,3…,M，k2＝1,2,3…,L。其中，Kernel()表示核函数，核函数是一类可以通过词的位置信息来衡量两个词之间的邻近关系，当两个词共现的位置更接近，其邻近关系更强，即相关程度更高。如高斯函数，三角函数等在很多场景都非常有效。实施例的是指以高斯核函数(具体实施时也可以采用其他核函数)来体现与的位置邻近程度，如公式(7)：In formula (6), M and L represent the number of occurrences of t _r and q _s in document d _i respectively, Indicates the k1th t _r appearing in document d _i , Indicates the k2th q _s appearing in document d _i , k1=1, 2, 3..., M, k2=1, 2, 3..., L. Among them, Kernel() represents the kernel function. The kernel function is a kind of adjacent relationship between two words that can be measured by the position information of the word. When the co-occurrence position of the two words is closer, the adjacent relationship is stronger, that is, the correlation to a higher degree. Such as Gaussian functions, trigonometric functions, etc. are very effective in many scenarios. Example of It refers to the Gaussian kernel function (other kernel functions can also be used in specific implementation) to reflect and The proximity of the location, such as the formula (7):

其中，p_t和p_q分别表示与在文档中的位置值(即该词在文档中的出现序号，为一个正整数)，σ是一个调节参数，用于调节高斯函数的分布，σ的取值范围在10到100之前，在具体实施例中优选取50。Among them, p _t and p _q represent respectively and The position value in the document (that is, the sequence number of the word in the document, which is a positive integer), σ is an adjustment parameter, which is used to adjust the distribution of the Gaussian function, and the value range of σ is between 10 and 100. In the specific Preferably take 50 in the embodiment.

在公式(5)中，表示文档d_i中的共现反文档频率，其具体计算方式如下：In formula (5), Indicates that in document d _i The co-occurrence anti-document frequency of is calculated as follows:

公式(5)给出了t_r与q_s在文档d_i中的相关度由于q_s是查询关键词集合Q中的一个查询词，通过公式(5)，可以得出t_r与查询关键词Q在文档d_i中的相关度，本发明用来表示，具体计算公式如下：Equation (5) gives the correlation between t _r and q _s in document d _i Since q _s is a query term in the query keyword set Q, by formula (5), the correlation between t _r and the query keyword Q in the document d _i can be obtained, and the present invention uses To express, the specific calculation formula is as follows:

依照公式(9)，N篇伪相关文档集合D₁中的第i篇文档d_i都可以表示成相应的扩展候选词与查询词的相关度向量形式即具体公式如下。According to the formula (9), the i-th document d _i in the set of N pseudo-related documents D ₁ can be expressed in the form of the correlation vector between the corresponding extended candidate words and query words, that is, The specific formula is as follows.

接下来对每个文档相关度向量进行累加求和后再除以伪相关文档总数N，最终得到所有词条在所有文档中的相关度得分向量如公式(11)所示：Next, for each document relevance vector Carry out cumulative summation and then divide by the total number of pseudo-related documents N, and finally get the relevance score vector of all terms in all documents As shown in formula (11):

将中每个词的相关度得分取出后按从大到小的顺序排序，将得分最大的前n₁个值在对应的词选取出来构成相关度查询扩展词集合Q₁'。为了方便后面的计算，用多项式V₁'来表示集合Q₁'中的每个词和该词对应的相关度得分，如公式(12)所示。Will After taking out the correlation score of each word, sort it in descending order, and put the top n ₁ values with the largest score in Corresponding words are selected to form the set Q ₁ ' of relevance query expansion words. To facilitate subsequent calculations, the polynomial V ₁ ' is used to represent each word in the set Q ₁ ' and the correlation score corresponding to the word, as shown in formula (12).

在公式(12)中，qh₁'、qh'₂、qh₃'、…、表示Q₁'中每个具体的扩展词(一共有n₁个)， wh′₁、wh′₂、wh′₃、...、表示对应的扩展词在中的得分。In formula (12), qh ₁ ', qh' ₂ , qh ₃ ', ..., Indicate each specific expansion word in Q ₁ ' (n ₁ in total), wh′ ₁ , wh′ ₂ , wh′ ₃ , ..., Indicates that the corresponding expansion word is in score in .

步骤3，将步骤1和步骤2中得到的查询扩展词多项式V₁和V₁'归一化后再进行线性组合得到新的查询词多项式V，具体组合方式如公式(13)所示。Step 3: Normalize the query expansion term polynomials V ₁ and V ₁ ′ obtained in Step 1 and Step 2, and then linearly combine them to obtain a new query term polynomial V. The specific combination method is shown in formula (13).

V＝(1-γ)×||V₁||+γ×||V₁'|| 公式(13)V＝(1-γ)×||V ₁ ||+γ×||V ₁ '|| Formula (13)

在公式(13)中，||X||表示对向量X进行归一化运算，归一化的目的是统一量纲，即将向量中每个元素的值规范到区间[0,1.0]中，方便后续的参数调节。归一化有多种方法可以实现，本实施例中采用的是除以最大值法，即每个元素归一化后的值为该元素原来的值除以向量中元素的最大值。例如有一个向量[1,2,3,4]，有4个元素，元素的最大值是4，那么对这个向量进行除以最大值法归一化后的结果为即[0.25,0.5,0.75,1]，可以看到原向量中的所有的值都规范到区间[0,1.0]中了。In formula (13), ||X|| represents the normalization operation on the vector X. The purpose of normalization is to unify the dimension, that is, to standardize the value of each element in the vector to the interval [0,1.0]. It is convenient for subsequent parameter adjustment. Normalization can be realized in many ways. In this embodiment, the method of dividing by the maximum value is adopted, that is, the normalized value of each element is divided by the original value of the element by the maximum value of the element in the vector. For example, there is a vector [1,2,3,4] with 4 elements, and the maximum value of the element is 4, then the normalized result of dividing this vector by the maximum value method is That is [0.25,0.5,0.75,1], you can see that all the values in the original vector are normalized to the interval [0,1.0].

公式(13)中的调节因子γ的取值范围为0到1.0，它的功能是用来平衡扩展词的重要度得分和扩展词与查询词之前的相关度得分，在具体应用时，可以预先用测试数据在需要应用的目标文档集上测试得出γ的最优值。The adjustment factor γ in the formula (13) ranges from 0 to 1.0, and its function is to balance the importance score of the extended word and the correlation score between the extended word and the query word. In specific applications, it can be adjusted in advance Use the test data to test on the target document set that needs to be applied to get the optimal value of γ.

步骤4，根据步骤3中的多项式V按每个词项的系数(综合权重得分)从大到小排序，将系数最大的前n₁个词项取出得到新的扩展词集合即为最终的查询扩展词集合。Step 4, according to the polynomial V in step 3, sort the coefficient (comprehensive weight score) of each term from large to small, and take out the first n+ ₁ terms with the largest coefficient to obtain a new extended word set That is, the final set of query expansion words.

步骤5，将原查询关键词集合Q表示为多项式V_Q，多项式V_Q中的每个项为Q中的每个查询词q_s，s＝1,2,3…,m，每个项的系数值设置为1.0，则可表示为Step 5, express the original query keyword set Q as a polynomial V _Q , each item in the polynomial V _Q is each query word q _s in Q, s=1,2,3...,m, each item's If the coefficient value is set to 1.0, it can be expressed as

V_Q＝1.0×q₁+1.0×q₂+1.0×q₃+...+1.0×q_m 公式(14)V _Q ＝1.0×q ₁ +1.0×q ₂ +1.0×q ₃ +...+1.0×q _m formula (14)

接着，将步骤4中得到的扩展词集合也用多项式V'来表示，多项式V'中的每个项为中的每个具体的查询扩展词，每个项(词项)的系数为该词项在步骤4中多项式V中对应的值，Next, the expanded word set obtained in step 4 Also represented by a polynomial V', each term in the polynomial V' is For each specific query expansion term in , the coefficient of each item (term) is the corresponding value of the term in the polynomial V in step 4,

其中，q'₂、q'₃、…、表示中每个具体的扩展词(一共有n₁个)，w'₁、w'₂、w'₃、…、表示对应的扩展词在查询词多项式V中的得分。Among them, q' ₂ , q' ₃ ,..., express Each specific expansion word in (a total of n ₁ ), w' ₁ , w' ₂ , w' ₃ ,..., Indicates the score of the corresponding expanded word in the query word polynomial V.

将查询多项式V_Q和查询扩展词多项式V'归一化后再次进行线性组合得到新的查询词多项式K，具体组合方式如公式(16)所示。The query polynomial V _Q and the query expansion word polynomial V' are normalized and linearly combined again to obtain a new query word polynomial K. The specific combination method is shown in formula (16).

K＝α×||V_Q||+β×||V'|| 公式(16)K＝α×||V _Q ||+β×||V'|| formula (16)

在公式(16)中采用了与步骤3一致的归一化方法，公式中的调节因子α一般取固定值1.0,调节因子β的取值范围为0到1.0，它的功能是用来平衡原查询词与扩展查询词之前的权重，具体实施时可设置为经验值。In the formula (16), the normalization method consistent with step 3 is adopted. The adjustment factor α in the formula generally takes a fixed value of 1.0, and the value range of the adjustment factor β is 0 to 1.0. Its function is to balance the original The weight between the query word and the extended query word can be set as an empirical value during specific implementation.

步骤6，根据步骤5可以得到新的查询关键词集合Q'，Q'中的每个查询词即为查询词多项式K中每个词项。使用新查询关键词集合Q'和Q'中每个查询词在查询词多项式K中对应的权重进行第二次信息检索(与第一次检索所采用同一个检索模型)，即再次计算Q'与目标文档集合D中每篇文档的得分，得到的查询结果即为最终信息检索结果。Step 6, according to step 5, a new query keyword set Q' can be obtained, and each query word in Q' is each term in the query word polynomial K. Use the weight of each query term in the query term polynomial K in the new query keyword set Q' and Q' to perform the second information retrieval (the same retrieval model as the first retrieval), that is, calculate Q' again and the score of each document in the target document set D, the query result obtained is the final information retrieval result.

在进行第二次检索时，查询词为新产生的查询关键词集合Q'，在计算查询词与每个文档的得分时，每个查询词的权重为该词在查询词多项式K中的系数，而在进行第一次检索每个查询词的权重为1.0。When performing the second search, the query word is the newly generated query keyword set Q', and when calculating the score of the query word and each document, the weight of each query word is the coefficient of the word in the query word polynomial K , while each query term is given a weight of 1.0 during the first retrieval.

具体实施时，本领域技术人员可采用软件技术实现以上流程的自动运行。相应地，如果提供一种基于伪相关反馈模型的信息检索系统，包括计算机或服务器，在计算机或服务器上执行以上流程将词相关度融合到伪相关反馈模型中实现信息检索，也应当在本发明的保护范围内。During specific implementation, those skilled in the art can use software technology to realize the automatic operation of the above process. Correspondingly, if an information retrieval system based on a pseudo-relevance feedback model is provided, including a computer or a server, and the above process is executed on the computer or server to integrate the word correlation into the pseudo-relevance feedback model to realize information retrieval, it should also be included in the present invention within the scope of protection.

例如，信息检索的开发环境为Java、Python开发环境，开发支持库为Lucene。For example, the development environment for information retrieval is Java and Python development environments, and the development support library is Lucene.

信息检索框架可为基于向量空间模型、概率模型、语言模型等伪相关反馈信息检索框架。The information retrieval framework can be based on vector space models, probability models, language models and other pseudo-relevant feedback information retrieval frameworks.

为了验证本发明方法的实际效果，在多个标准数据集上的做了对比实验，对比实验分两组，一组采用标准的Rocchio伪相关反馈信息检索模型，另外一组采用结合了本发明方法的 Rocchio伪相关反馈信息检索模型简称为KRC。本实验使用了六个标准的国际数据集，它们包括AP88-89、AP90、DISK1&2、DISK4&5、WT2G和WT10G,这些数据集的信息如下表(表1)所示：In order to verify the actual effect of the method of the present invention, a comparative experiment has been done on a plurality of standard data sets, and the comparative experiment is divided into two groups, one group adopts the standard Rocchio pseudo-correlation feedback information retrieval model, and the other group adopts the combination of the method of the present invention The Rocchio pseudo-relevance feedback information retrieval model is abbreviated as KRC. This experiment uses six standard international datasets, including AP88-89, AP90, DISK1&2, DISK4&5, WT2G, and WT10G. The information of these datasets is shown in the following table (Table 1):

数据集名称dataset name 文档总数total number of documents 大小size 查询主题编号Query subject number 查询主题数Number of query topics AP90AP90 78,32178,321 0.23Gb0.23Gb 51-10051-100 5050 AP88-89AP88-89 164,597164,597 0.50Gb0.50Gb 51-10051-100 5050 DISK1&2DISK1&2 741,856741,856 2.03Gb2.03Gb 51-20051-200 150150 DISK4&5DISK4&5 528,155528,155 1.85Gb1.85Gb 301-450301-450 150150 WT2GWT2G 247,491247,491 2.14Gb2.14Gb 401-450401-450 5050 WT10GWT10G 1,692,0961,692,096 10Gb10Gb 451-550451-550 100 100

表1 六个数据集的基本信息Table 1 Basic information of the six datasets

在对比实验中，采用了本发明方法中的核函数选用了高斯核函数(也可以采用其他的核函数)，高斯核函数中σ值取50。为了使实验更加公平，查询扩展词个数N1分别选取了10、 20、30和50四种情况，不同情况下的实验结果如下表(表2)所示：In the comparison experiment, the kernel function in the method of the present invention is adopted to select a Gaussian kernel function (other kernel functions can also be used), and the value of σ in the Gaussian kernel function is 50. In order to make the experiment more fair, the number of query expansion words N1 was selected in four cases: 10, 20, 30 and 50 respectively. The experimental results in different cases are shown in the following table (Table 2):

表2 Rocchio与KRC模型在六个标准数据集上的平均精度(MAP)对比Table 2 Comparison of average precision (MAP) between Rocchio and KRC models on six standard datasets

在表2中，第二列中的Rocchio模型没有采用本发明方法，而KRC模型是采用了本发明方法的Rocchio模型，MAP为检索结果的平均精度，从表中可以观察得出，本发明方法在Rocchio伪相关模型上的检索精度提升效果显著，表明本发明技术方案有效。In table 2, the Rocchio model in the second column does not adopt the method of the present invention, and the KRC model is the Rocchio model that has adopted the method of the present invention, and MAP is the average precision of retrieval result, can observe from the table, the method of the present invention The retrieval accuracy improvement effect on the Rocchio pseudo-correlation model is remarkable, indicating that the technical solution of the present invention is effective.

Claims

1. An information retrieval method based on a pseudo-relevance feedback model, characterized in that: the word relevance is fused into the pseudo-relevance feedback model to realize information retrieval, including generating query expansion words in a pseudo-relevance document collection, respectively generating the following The query expansion words characterized by the importance of the candidate expansion words and the query expansion words characterized by the correlation between the candidate expansion words and the query subject words are combined into the original query expansion words to complete the final information retrieval; generate When the query expansion word is characterized by the correlation between the candidate expansion word and the query subject word, the kernel function is used to calculate the correlation between the query word and the candidate word appearing in different positions in the document;

Said fusion of word relevance into the pseudo-correlation feedback model to realize information retrieval, the implementation is as follows,

When a user submits a query subject, the query subject is preprocessed to obtain the query keyword Q, D is all target documents, N _D is the total number of documents in the target document set D, and the query keyword Q and target are calculated through the preset retrieval weight model The score of each document in the document collection D is arranged according to the score results from high to low to obtain the first query result; assume that the first N documents in the target document collection D are taken out according to the pseudo-relevant feedback method as the pseudo-related document collection D _1. Perform the following steps when selecting query expansion words,

Step 1, take all the words in each document in the pseudo-related document collection D ₁ as extended candidate words, and calculate the importance score of each extended candidate word t _j in the document d _i of the pseudo-related document collection D ₁ Get the importance vector of each document d _i as follows,

Among them, i=1,2,3...,N, j=1,2,3...,n;

Calculate the importance score vector of the extended candidate word in all documents as follows,

Will After the importance _score of each extended candidate word in The corresponding extended candidate words are selected to form the importance query expansion word set Q ₁ , and the polynomial V ₁ is used to represent each word in the importance query expansion word set Q ₁ and the corresponding importance score of the word;

Step 2, take all the words in each document in the pseudo-related document set D ₁ as extended candidate words, and use the kernel function to calculate each extended candidate word t _r and the query keyword Q in the document d _i according to the co-occurrence position and frequency respectively Relevance Score for Get the correlation vector of each document d _i as follows,

Among them, i=1,2,3...,N, r=1,2,3...,n;

Calculate the vector of relevance scores of the extended candidate across all documents as follows,

Will After taking out the correlation score of each extended candidate word, it is sorted in descending order, and the top n ₁ values with the largest score are in the Corresponding extended candidate words are selected to form a relevance query expansion word set Q′ ₁ , and the polynomial V ₁ ′ is used to represent each word in the query expansion word set Q′ ₁ and the corresponding correlation score of the word;

Step 3, normalize the polynomials V ₁ and V ₁ ' obtained in steps 1 and 2 and then perform linear combination to obtain a new query word polynomial V as follows,

V＝(1-γ)×||V ₁ ||+γ×||V ₁ '||

Among them, ||X|| represents the normalization operation on the vector X, and γ is the adjustment factor;

Step 4: According to the query word polynomial V obtained in step 3, sort the coefficients of each term from large to small, and take out the first n ₁ terms with the largest coefficients to obtain a new set of expanded words

Step 5, set the query keyword Q to include the query word q _s , s=1,2,3...,m, express the query keyword Q as a polynomial V _Q , and set the coefficient value of each query word to 1.0; set step 4 The set of expanded words obtained from Expressed by the polynomial V',

The query polynomial V _Q and the query expansion word polynomial V' are normalized and linearly combined, and the new query word polynomial K is as follows,

K＝α×||V _Q ||+β×||V'||

Among them, α and β are adjustment factors;

Step 6: Obtain a new query keyword set Q' according to the query word polynomial K obtained in step 5, use the weight corresponding to each query word in the query word polynomial K in the new query keyword set Q' and Q', and use the preset The retrieval weight model is used for the second information retrieval, and the obtained query results are taken as the final information retrieval results.

2. The information retrieval method based on the pseudo correlation feedback model according to claim 1, characterized in that: in step 1, the importance score Use TFIDF, BM25 or RM3 to obtain.

3. according to the described information retrieval method based on pseudo correlation feedback model of claim 1, it is characterized in that: in step 2, calculate the relevancy score of each expanded candidate word t _r and query keyword Q in document d _i The implementation is as follows,

Let t _r and q _s co-occur in some document d _i , expressed as Calculated as follows,

in, Indicates the correlation between t _r and q _s in document d _i , Indicates that in document d _i the co-occurrence frequency of Indicates that in document d _i The co-occurrence anti-document frequency of ;

Calculate the correlation between t _r and the query keyword Q in the document d _i ,

4. according to the described information retrieval method based on pseudo correlation feedback model of claim 3, it is characterized in that: in document d _i The co-occurrence frequency of is calculated as follows,

Among them, M and L represent the number of occurrences of t _r and q _s in document d _i respectively, Indicates the k1th t _r appearing in document d _i , Indicates the k2th q _s appearing in document d _i , k1=1,2,3...,M, k2=1,2,3...,L; refers to the embodiment of the kernel function and proximity of the location.

5. The information retrieval method based on the pseudo correlation feedback model according to claim 4, characterized in that: the kernel function is a Gaussian function or a trigonometric function.

6. according to the described information retrieval method based on pseudo correlation feedback model of claim 5, it is characterized in that: when kernel function is Gaussian function, calculate as follows,

Among them, p _t and p _q represent respectively and In the positional value in the document, σ is the tuning parameter.

7. The information retrieval method based on the pseudo-correlation feedback model according to claim 4, characterized in that: in the document d _i The co-occurrence anti-document frequency of Calculated as follows,

in, means when hour, The total number of co-occurrences in document d _i .

8. According to claim 1 or 2 or 3 or 4 or 5 or 6 or 7 described information retrieval method based on pseudo correlation feedback model, it is characterized in that: described preset retrieval weight model is based on vector space model, probability model or language models.

9. An information retrieval system based on a pseudo-relevant feedback model, characterized in that it comprises a computer or a server, and the method according to claims 1 to 8 is executed on the computer or server.