CN1996316A

CN1996316A - Search engine searching method based on web page correlation

Info

Publication number: CN1996316A
Application number: CN 200710056425
Authority: CN
Inventors: 侯越先
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2007-01-09
Filing date: 2007-01-09
Publication date: 2007-07-11

Abstract

一种基于网页相关性的搜索引擎搜索方法。该方法能够在一次查询过程中提供两次结果给用户，利用用户第一次点击提供的信息有效地解决了一意多词和一词多意的问题，解决了基于关键字的搜索引擎无法准确确定用户查询意图的问题，这种不仅可以提供给用户与关键字相关且与用户感兴趣的网页相关的网页，而且没有增加用户操作的复杂性。另外，使用点击数据更新差异性矩阵，是从一个新的角度判断网页间差异性，这种差异性是大量数据中体现出来的统计意义上的差异性，是大量搜索引擎用户使用搜索引擎过程中作出的判断。所以，本发明利用具有统计平稳性的网页级的相关性(差异性)分析，不需长期跟踪特定用户的行为，即可为该用户提供统计意义上的优化服务。A search engine search method based on web page correlation. This method can provide two results to the user in one query process, effectively solves the problem of multiple words with one meaning and multiple meanings by using the information provided by the user's first click, and solves the problem that the keyword-based search engine cannot accurately determine To solve the problem of the user's query intention, this kind of webpage can not only provide the user with keywords related to the webpage related to the webpage of interest to the user, but also does not increase the complexity of the user's operation. In addition, using click data to update the difference matrix is to judge the difference between web pages from a new perspective. This difference is the difference in the statistical sense reflected in a large amount of data, and it is the difference in the process of using the search engine by a large number of search engine users. judgment made. Therefore, the present invention utilizes the web page-level correlation (difference) analysis with statistical stationarity, and can provide statistically optimized services for the user without long-term tracking of the user's behavior.

Description

A Search Engine Search Method Based on Web Page Correlation

技术领域technical field

本发明属于计算机网络中搜索引擎搜索技术领域，特别是涉及一种基于网页相关性的搜索引擎搜索方法。The invention belongs to the technical field of search engine search in computer networks, in particular to a search engine search method based on web page correlation.

背景技术Background technique

搜索引擎技术是一种利用关键字组合在网络上查找相关信息，并按照这些信息与关键字的匹配程度进行排序，然后返回给用户查看的技术。随着互联网的迅速发展，使用搜索引擎已成为网络用户获取网络资源的最主要途径。近几年来，全球出现了各种各样的搜索引擎，并且这些搜索引擎在人们对信息的获取过程中起到了很重要的作用。目前主要的搜索引擎可分为目录式搜索引擎和基于关键字的搜索引擎。其中目录式搜索引擎的思路是对网页库预分类，然后由用户自己选择需要哪一类的网页，并到相应的目录下去查找，目前最具代表性的分类目录式搜索引擎是yahoo[http://www.yahoo.com]。但是，为了提交给用户一组最好的搜索结果往往需要很细的类别划分力度，而对于现有的手工和自动分类技术应用于海量的网络信息是不现实的，另外即使搜索引擎提供了很细的类别，用户的选择过程也将变得非常复杂，而且不能保证用户的判断与搜索引擎已有的分类是完全吻合的。Search engine technology is a technology that uses a combination of keywords to find relevant information on the Internet, sorts the information according to the matching degree of the keywords, and then returns it to the user for viewing. With the rapid development of the Internet, using search engines has become the most important way for network users to obtain network resources. In recent years, various search engines have emerged around the world, and these search engines have played an important role in the process of people obtaining information. At present, the main search engines can be divided into directory search engines and keyword-based search engines. Among them, the idea of the directory search engine is to pre-classify the webpage library, and then let the user choose which type of webpage he needs, and go to the corresponding directory to search. At present, the most representative classification directory search engine is yahoo[http: //www.yahoo.com]. However, in order to submit the best set of search results to users, it often requires very fine classification, and it is unrealistic for the existing manual and automatic classification techniques to be applied to massive network information. In addition, even if the search engine provides many If the classification is too fine, the user's selection process will become very complicated, and there is no guarantee that the user's judgment is completely consistent with the existing classification of the search engine.

目前互联网上的搜索引擎大多数采用基于关键字的查询技术，其典型代表为Google[http://www.google.com]和百度[http://www.baidu.com]。Most of the search engines on the Internet currently use keyword-based query technology, and its typical representatives are Google [http://www.google.com] and Baidu [http://www.baidu.com].

这类搜索引擎通过程序收集并索引的信息资源量极其庞大，而用户的提问语句却大多由几个词组成，由于词语本身存在多义性，从而导致搜索引擎很难确定用户的需求，这种情况将会导致数量庞大的搜索结果且不能保证相关度，因此用户需要花费巨大的精力在搜索引擎的结果中进行浏览筛选。总之，目前的搜索引擎给出的信息质量都不是很高。The amount of information resources collected and indexed by this type of search engine through programs is extremely large, but most of the user's question sentences are composed of several words. Due to the ambiguity of the words themselves, it is difficult for the search engine to determine the user's needs. The situation will lead to a huge number of search results and the relevance cannot be guaranteed, so the user needs to spend a lot of effort to browse and filter the results of the search engine. In short, the quality of information given by the current search engines is not very high.

另外，搜索引擎采用的排序算法通常包括以下几种：(1)基于词频统计的排序算法。早期很多搜索引擎采用的排序算法是基于词频统计的，词权的计算一般把该词在网页中出现的位置考虑进来，例如在标题中出现的词比在正文中的词权值高。但是由于网络资源的数量巨大，词频相同的两个网页质量却可能相差很远，而且依据词频计算网页与关键字的相关度并不可靠，因此这种算法的局限性很明显。(2)基于超链分析的排序算法。传统情报检索理论中的引文分析方法是确定学术文献权威性的重要方法之一，即根据引文的数量和质量来确定文献的权威性。基于超链分析的排序算法借鉴了这一思想，通过把引文分析思想借鉴到网络文档重要性的计算中来，利用网络自身的超链接结构根据网页被引用的次数及引用网页自身的重要性给所有的网页确定一个重要性的等级数，以此来帮助实现排序算法的优化。但这种算法得到的是网页自身的重要性等级，而不是网页与用户查询的关键字的相关度，所以常会出现查询结果中网页自身的质最很高但是与用户的查询需求不一定很相关的问题。In addition, the sorting algorithms used by search engines generally include the following: (1) sorting algorithms based on word frequency statistics. The sorting algorithms used by many search engines in the early days were based on word frequency statistics. The calculation of word weight generally takes into account the position where the word appears in the web page. For example, words appearing in the title have higher weight than words in the text. However, due to the huge amount of network resources, the quality of two webpages with the same word frequency may be very different, and it is not reliable to calculate the correlation between webpages and keywords based on word frequency, so the limitations of this algorithm are obvious. (2) Sorting algorithm based on hyperlink analysis. The citation analysis method in traditional information retrieval theory is one of the important methods to determine the authority of academic literature, that is, to determine the authority of literature according to the quantity and quality of citations. The sorting algorithm based on hyperlink analysis draws on this idea. By applying the citation analysis idea to the calculation of the importance of network documents, the hyperlink structure of the network itself is used to assign the number of times the web page is cited and the importance of the cited web page itself. All web pages determine a level of importance to help optimize the ranking algorithm. However, this algorithm obtains the importance level of the webpage itself, rather than the relevance between the webpage and the keyword queried by the user, so it often happens that the webpage itself is of the highest quality in the query results, but it is not necessarily very relevant to the user's query needs. The problem.

发明内容Contents of the invention

为了解决上述问题，本发明的目的在于提供一种能够在不增加操作复杂性的前提下准确地辨别出用户的需求，从而可以提高搜索引擎的搜索结果与用户需求之间相关性的基于网页相关性的搜索引擎搜索方法。In order to solve the above problems, the object of the present invention is to provide a webpage-based correlation algorithm that can accurately identify the user's needs without increasing the complexity of the operation, thereby improving the correlation between the search results of the search engine and the user's needs. Sexual search engine search method.

为了达到上述目的，本发明提供的基于网页相关性的搜索引擎搜索方法包括按顺序进行的下列步骤：In order to achieve the above object, the search engine search method based on the web page correlation provided by the present invention comprises the following steps carried out in order:

(1)在搜索引擎运行过程中记录一段时间内网络用户在搜索引擎搜索结果列表上的点击行为数据；(1) Record the click behavior data of network users on the search engine search result list for a period of time during the operation of the search engine;

(2)用基于向量空间模型的方法计算出所有网页间的差异度并保存；(2) Calculate and save the degree of difference between all web pages with the method based on the vector space model;

(3)用步骤1中记录的点击数据更新步骤2中得到的所有网页间差异度；(3) update the degree of difference between all web pages obtained in step 2 with the click data recorded in step 1;

(4)将步骤3中得到的网页间差异度视为网页间的距离，并用维数约减的算法对这些距离数据降维，从而得到网页间差异度数据的低维几何表示；(4) regard the degree of difference between the web pages obtained in step 3 as the distance between the web pages, and use the algorithm of dimension reduction to reduce the dimension of these distance data, thereby obtaining the low-dimensional geometric representation of the degree of difference data between the web pages;

(5)当搜索引擎接受到一个用户的一次查询请求时进行下列步骤：(5) When the search engine receives a query request from a user, perform the following steps:

(a)搜索引擎接受用户输入的查询关键字，用某种相关度计算方法得出一个对应于此查询关键字的初始查询结果列表并将其提交给用户查看；(a) The search engine accepts the query keyword input by the user, uses a correlation calculation method to obtain an initial query result list corresponding to the query keyword, and submits it to the user for viewing;

(b)用户查看初始查询列表后将点击一个其感兴趣的链接；(b) the user will click on a link of interest after viewing the initial query list;

(c)搜索引擎记录用户点击的第一个链接，并将该链接对应的网页记为目标网页，然后根据步骤4得到的网页间差异度数据的低维几何表示计算出目标网页与初始查询结果列表中所有链接对应的网页间的差异度，并将差异度按照从低到高的顺序排列构成新的查询结果；(c) The search engine records the first link clicked by the user, and marks the webpage corresponding to the link as the target webpage, and then calculates the target webpage and the initial query result according to the low-dimensional geometric representation of the difference data between webpages obtained in step 4 The degree of difference between the web pages corresponding to all the links in the list, and arrange the degree of difference in order from low to high to form a new query result;

(d)将新的查询结果提交给用户，此查询结果即是与用户点击的第一个网页相关且与用户输入的查询关键字高度相关的最终查询结果。(d) Submit a new query result to the user, which is the final query result that is related to the first web page clicked by the user and highly related to the query keyword entered by the user.

所述的步骤1中的记录时间以每个月作为周期，长期动态跟踪。The recording time in the above step 1 takes every month as a cycle, and long-term dynamic tracking.

本发明提供的基于网页相关性的搜索引擎搜索方法具有如下有益效果：The search engine search method based on web page correlation provided by the present invention has the following beneficial effects:

1.本发明能够在一次查询过程中提供两次结果给用户，利用用户第一次点击提供的信息有效地解决了一意多词和一词多意的问题，解决了基于关键字的搜索引擎无法准确确定用户查询意图的问题，这种根据用户的第一次点击提供第二次搜索结果的方法不仅可以提供给用户与关键字相关且与用户感兴趣的网页相关的网页，而且没有增加用户操作的复杂性。1. The present invention can provide two results to the user in one query process, utilize the information provided by the user's first click to effectively solve the problem of multiple words with one meaning and one word with multiple meanings, and solve the problem that keyword-based search engines cannot The problem of accurately determining the user's query intention, this method of providing the second search result based on the user's first click can not only provide the user with keywords related to the web page that the user is interested in, but also does not increase user operations complexity.

2.从经验和直觉上讲，只有同类的、相关性高的网页才更容易被用户同时访问，所以点击数据中包含了用户对网页差异性的判断。使用点击数据更新差异性矩阵，是从一个新的角度判断网页间的差异性，这种差异性是大量数据中体现出来的统计意义上的差异性，是大量搜索引擎用户使用搜索引擎过程中作出的判断。所以，本发明利用具有统计平稳性的网页级的相关性(差异性)分析，不需长期跟踪特定用户的行为，即可为该用户提供统计意义上的优化服务。2. From experience and intuition, only similar and highly relevant web pages are more likely to be accessed by users at the same time, so the click data includes the user's judgment on the difference of web pages. Using click data to update the difference matrix is to judge the difference between web pages from a new perspective. This difference is a statistical difference reflected in a large amount of data, which is made by a large number of search engine users in the process of using the search engine. judgment. Therefore, the present invention utilizes the web page-level correlation (difference) analysis with statistical stationarity, and can provide statistically optimized services for the user without long-term tracking of the user's behavior.

具体实施方式Detailed ways

本发明提供的基于网页相关性的搜索引擎搜索方法是通过收集用户的点击行为数据来确定用户真正需要的信息内容类型，同时将点击数据作为判断网页间相关性的依据之一，由此提高查询结果与用户需求的相关性。The search engine search method based on webpage correlation provided by the present invention is to determine the type of information content that the user really needs by collecting the user's click behavior data, and at the same time, the click data is used as one of the basis for judging the correlation between webpages, thereby improving query efficiency. Relevance of results to user needs.

通常使用搜索引擎的用户不会随机地点击搜索结果列表上的链接，而是作出某种选择，这样点击数据就成为一种包含丰富信息的隐性反馈。由于用户更加趋向于去点击那些与他们的需求相吻合的链接，所以搜索引擎可以通过跟踪用户点击的链接分析出用户的即时需求，解决查询词多义性问题。如搜索引擎可以提供一个动态查询结果，使查询结果既与查询词相关又与用户刚点击的链接内容相关，这样就可以确定出用户想要用此查询词表达的意思，使搜索结果适应用户的需求。Usually, users who use search engines do not randomly click on links on the search result list, but make some kind of choice, so that the click data becomes a kind of implicit feedback that contains rich information. Since users are more inclined to click on links that match their needs, search engines can analyze the immediate needs of users by tracking the links clicked by users and solve the problem of ambiguity in query terms. For example, a search engine can provide a dynamic query result, so that the query result is not only related to the query word but also related to the content of the link that the user just clicked. need.

在进行一次查询过程中，用户的需求往往是比较单一的，而且其总体上不会无故地进行点击，所以在用户的一次查询过程中司时被点击的多个链接相互之间是相关性较强的。本发明通过一个n×n的矩阵保存这种被同时点击的信息，作为更新网页间相关度的依据。即本发明是通过维护由大量用户点击数据获得的网页内容差异性，针对每个查询请求，经由跟踪用户点击和网页内容差异性信息来辨识查询主题和查询意图，最终提供给用户一个与用户点击的第一个网页相关且与用户输入的查询关键字高度相关的最终查询结果。In the process of conducting a query, the user's needs are often relatively single, and generally they will not click for no reason, so the multiple links that are clicked during the user's query process are relatively related to each other. strong. The present invention saves the simultaneously clicked information through an n×n matrix as the basis for updating the correlation between web pages. That is to say, the present invention is to identify the query subject and query intention by tracking user clicks and web content difference information for each query request by maintaining the difference of webpage content obtained by a large number of user click data, and finally provide the user with a link with the user click. The final query results that are relevant to the first webpage of the website and are highly relevant to the query keywords entered by the user.

下面对本发明提供的基于网页相关性的搜索引擎搜索方法进行详细说明：The search engine search method based on web page correlation provided by the present invention is described in detail below:

本发明提供的基于网页相关性的搜索引擎搜索方法包括按顺序进行的下列步骤：The search engine search method based on web page correlation provided by the present invention comprises the following steps carried out in order:

(1)在搜索引擎运行过程中记录一段时间内网络用户在搜索引擎搜索结果列表上的点击行为数据；由于点击行为数据需要积累，所以本步骤需要随搜索引擎运行持续一段时间。(1) Record the click behavior data of network users on the search engine search result list for a period of time during the operation of the search engine; since the click behavior data needs to be accumulated, this step needs to continue for a period of time with the search engine running.

(2)用基于向量空间模型的方法计算出所有网页间的差异度并保存；网页差异度是与网页相关度相反的属性，是对网页间差异程度的定量化的定义，两个网页的相关度越高则差异度越小。(2) Use the method based on the vector space model to calculate and save the degree of difference between all web pages; the degree of difference between web pages is an attribute opposite to the degree of relevance of web pages, and it is a quantitative definition of the degree of difference between web pages. The higher the degree, the smaller the difference.

在此过程中，首先建立差异性矩阵D并实现更新，以维护以下数据结构：In this process, the difference matrix D is first established and updated to maintain the following data structure:

共同访问计数矩阵A：n*n对称矩阵，保存了所有网页间被同时访问的计数。Common visit count matrix A: n*n symmetric matrix, which stores the counts of simultaneous visits among all web pages.

点击计数向量B：n*1向量，b_i为非负整数，[0，+∞]，每个元素保存了对应网页收到的总点击数。Click count vector B: n*1 vector, b _i is a non-negative integer, [0, +∞], each element stores the total number of clicks received by the corresponding web page.

初始差异性矩阵D⁰：n*n对称矩阵，由向量空间模型计算得到。令Doc＝{doc_i|1≤i≤n}表示一个网页集。根据向量空间模型，每个网页doc_i都可以被表示为向量doc_i，则D⁰的第i行j列元素d_ij ⁰可以定义为：Initial difference matrix D ⁰ : n*n symmetric matrix, calculated by vector space model. Let Doc={doc _i |1≤i≤n} denote a set of web pages. According to the vector space model, each webpage doc _i can be expressed as a vector doc _i , then the element d _ij ⁰ in row i and column j of D ⁰ can be defined as:

${d d}_{ij ij}^{00} &equiv; &equiv; \frac{{| | | | \frac{{doc doc}_{i i}}{{| | | | {doc doc}_{i i} | | | |}_{22}} - - \frac{{doc doc}_{i i}}{{| | | | {doc doc}_{j j} | | | |}_{22}} | | | |}_{22}}{arg arg {max max}_{i i,, j j} {{{| | | | \frac{{doc doc}_{i i}}{{| | | | {doc doc}_{i i} | | | |}_{22}} - - \frac{{doc doc}_{j j}}{{| | | | {doc doc}_{j j} | | | |}_{22}} | | | |}_{22}}}} - - - - - - ((11))$

‖·‖₂为2范数。根据定义可知d_ij ⁰是一个规范化的分布在[0，1]值，D⁰的元素满足测度公理(满足测度公理是D可求出几何嵌入的必要属性)。‖·‖ ₂ is the 2-norm. According to the definition, it can be seen that d _ij ⁰ is a normalized distribution in [0, 1] value, and the elements of D ⁰ satisfy the measure axiom (satisfying the measure axiom is a necessary attribute for D to obtain geometric embedding).

点击差异矩阵C：n*n矩阵，直接定义C的元素为Click the difference matrix C: n*n matrix, and directly define the elements of C as

${c c}_{ij ij} &equiv; &equiv; \{\begin{matrix} 11 - - (({a a}_{ij ij} / / max max {{{b b}_{i i},, {b b}_{j j}}})),, i i &NotEqual; &NotEqual; j j \\ 00,, i i = = j j \end{matrix} - - - - - - ((22))$

差异性矩阵D：n*n的对称矩阵。第i行j列元素d_ij保存了第i个网页和第j个网页之间的差异性，定义d_ij为Difference matrix D: a symmetric matrix of n*n. The i-th row and j-th column element d _ij saves the difference between the i-th web page and the j-th web page, and d _ij is defined as

${d d}_{ij ij} &equiv; &equiv; \{\begin{matrix} w w \cdot &Center Dot; {c c}_{ij ij} + + ((11 - - w w)) \cdot &Center Dot; {d d}_{ij ij}^{00},, i i &NotEqual; &NotEqual; j j \\ 00,, i i = = j j \end{matrix} - - - - - - ((33))$

其中w为用户参数，0＜w＜1。在初始状态w置为0，随着系统运行时间的增加逐渐调高w的值。经过足够长的时间后，w可取1。w也可以应特殊需求进行调整，如有些网页只收到了很少的点击，则点击数据的可靠性就比较低，这时可以将w取一个较小的值，则此时差异性主要取决于由VSM方法计算所得到的值。Where w is a user parameter, 0<w<1. In the initial state w is set to 0, and the value of w is gradually increased as the system running time increases. After a long enough time, w can be 1. w can also be adjusted according to special needs. If some webpages only receive few clicks, the reliability of click data will be relatively low. At this time, w can be set to a smaller value, and the difference at this time mainly depends Values calculated by the VSM method.

D的压缩表示Y：n*d矩阵，D的压缩表示，用维数约减算法处理D可以得到Y。D中的元素d_ij被表示为Y中第i行与第j行向量的距离。因此，所有网页间的差异性都可以用Y中向量的欧式距离表示。The compressed representation of D is Y: n*d matrix, the compressed representation of D, and Y can be obtained by processing D with the dimensionality reduction algorithm. An element d _ij in D is denoted as the distance between the i-th and j-th row vectors in Y. Therefore, the difference between all web pages can be represented by the Euclidean distance of the vector in Y.

(3)用步骤1中记录的点击数据更新步骤2中得到的所有网页间差异度；任意两个网页间的差异度更新方法如下：(a)分析步骤1中记录的点击数据，如果点击数据显示这两个网页同时出现在某次查询结果中且它们都被当时的用户打开，则这两个网页间的同时点击计数加1，处理完步骤1中的所有点击数据后可以得到这两个网页间在步骤1所持续的时间段内总的同时点击计数。(3) Use the click data recorded in step 1 to update the degree of difference between all web pages obtained in step 2; the method for updating the degree of difference between any two web pages is as follows: (a) analyze the click data recorded in step 1, if the click data If it shows that these two webpages appear in a certain query result at the same time and they are both opened by the user at that time, the simultaneous click count between the two webpages will be increased by 1, and the two webpages can be obtained after processing all the click data in step 1. The total number of simultaneous hit counts between pages during the time period of step 1.

(4)将步骤3中得到的网页间差异度视为网页间的距离，并用维数约减的算法对这些距离数据降维，从而得到网页间差异度数据的低维几何表示；至此得到搜索引擎产生查询结果所需的计算网页间差异度的数据。(4) The degree of difference between web pages obtained in step 3 is regarded as the distance between web pages, and the dimension reduction algorithm is used to reduce the dimension of these distance data, so as to obtain the low-dimensional geometric representation of the difference degree data between web pages; The data required by the engine to generate query results to calculate the degree of difference between web pages.

在上述的步骤3和4中，定期对差异性矩阵进行更新，更新过程如下In the above steps 3 and 4, the difference matrix is regularly updated, and the update process is as follows

1.依据向量空间模型生成初始差异性矩阵D⁰。1. Generate an initial difference matrix D ⁰ according to the vector space model.

2.对每个查询事件，依据某种方法(不需要约束使用的具体算法)生成查询结果集。结果集中的链接被有序提交给用户，每个链接都附有对应网页的摘要。2. For each query event, generate a query result set according to a certain method (no need to restrict the specific algorithm used). The links in the result set are submitted to the user in order, and each link is accompanied by a summary of the corresponding web page.

3.用户查看列表后依据当时的需要点击了若干个链接，搜索引擎记录下被点击的链接并将被点击的网页间的同时访问计数加1，如下：对被点击的网页i、j，执行3. After viewing the list, the user clicks several links according to the needs at that time, and the search engine records the clicked links and adds 1 to the simultaneous access count between the clicked web pages, as follows: For the clicked web pages i and j, execute

a_ij＝a_ij+1 (4)a _ij =a _ij +1 (4)

b_i＝b_i+1 (5)b _i =bi ₊₁ (5)

b_J＝b_j+1 (6)b _J =b _j +1 (6)

如果只有一个网页i被打开，则执行If only one webpage i is opened, execute

b_i＝b_i+1 (7)b _i =bi ₊₁ (7)

4.搜索引擎规律性的根据A、B和D⁰重新计算生成D，并D对进行降维，获得D压缩几何表示Y。这样网页间的差异性被表示为d维嵌入空间下的欧式距离，d＜＜n。4. The search engine recalculates regularly according to A, B and D ⁰ to generate D, and reduces the dimension of D to obtain the compressed geometric representation Y of D. In this way, the difference between web pages is expressed as the Euclidean distance in the d-dimensional embedding space, d<<n.

5.当有新的网页加入时，系统用基于向量空间模型的方法计算出新网页与其它网页的差异性，并将该网页的w参数调整为0。当该网页收到的点击达到一定量再将w调整到一个合理的非0值。5. When a new webpage is added, the system calculates the difference between the new webpage and other webpages with the method based on the vector space model, and adjusts the w parameter of the webpage to 0. When the number of clicks received by the web page reaches a certain amount, adjust w to a reasonable non-zero value.

(a)搜索引擎接受用户输入的查询关键字，用某种相关度计算方法得出一个对应于此查询关键字的初始查询结果列表并将具提交给用户查看；(a) The search engine accepts the query keyword input by the user, uses a certain correlation calculation method to obtain an initial query result list corresponding to the query keyword and submits it to the user for viewing;

在此步骤中，当有用户使用搜索引擎时，对于一次查询请求进行下列过程：In this step, when a user uses a search engine, the following process is performed for a query request:

1.用基于向量空间模型的方法生成初始查询结果集r。设此时r中行m个网页。1. Generate an initial query result set r using a method based on a vector space model. It is assumed that there are m web pages in r at this time.

2.在用户观察初始查询结果并点击一个链接后，搜索引擎记录该链接(称为目标网页，设其在网页库中的ID为i)。计算目标网页i和r中其它网页的差异度(即计算Y中对应行向量间的距离)，获得差异向量 $d_{i} &equiv; [d_{{ij}_{1}}, {d_{ij}}_{2}, . . ., {d_{ij}}_{m}]^{T}$ (也可以计算目标网页和所有其它网页间的差异度并取差异度最小的一部分网页作为查询结果集的扩展)。2. After the user observes the initial query result and clicks on a link, the search engine records the link (called the target webpage, whose ID in the webpage database is i). Calculate the difference between the target webpage i and other webpages in r (that is, calculate the distance between the corresponding row vectors in Y), and obtain the difference vector $d_{i} &equiv; [d_{{ij}_{1}}, {d_{ij}}_{2}, . . ., {d_{ij}}_{m}]^{T}$ (It is also possible to calculate the degree of difference between the target webpage and all other webpages and take a part of webpages with the smallest degree of difference as an extension of the query result set).

3.将r中的网页按照d_i中对应的差异度升序排列，提交给用户，此为搜索引擎提交给用户的最终结果。3. Arrange the web pages in r in ascending order according to the corresponding degree of difference in d _i , and submit them to the user. This is the final result submitted by the search engine to the user.

Claims

1, a kind of search engine searching method based on web page correlation is characterized in that: described search engine searching method based on web page correlation comprises the following step that carries out in order:

(1) in the search engine operational process, writes down the click behavioral data of the network user in the search engine search results tabulation in a period of time;

(2) use method based on vector space model to calculate the diversity factor between all webpages and preserve;

(3) all webpage differences degree that obtain in the click data step of updating 2 with record in the step 1;

(4) the webpage differences degree that obtains in the step 3 is considered as distance between webpage, and the algorithm that subtracts approximately with dimension is to these range data dimensionality reductions, thereby obtains the low-dimensional geometric representation of webpage differences degrees of data;

(5) when receiving a user's one query request, search engine carries out the following step:

(a) search engine is accepted the key word of the inquiry of user input, draws one corresponding to initial query the results list of this key word of the inquiry and it is submitted to the user check with certain relatedness computation method;

(b) user will click its interested link after checking the initial query tabulation;

(c) first link of search engine recording user click, and the webpage that will link correspondence is designated as target web, the low-dimensional geometric representation of the webpage differences degrees of data that obtains according to step 4 calculates the diversity factor between the target web webpage corresponding with all-links in initial query the results list then, and diversity factor is constituted new Query Result according to from low to high series arrangement;

(d) new Query Result is submitted to the user, this Query Result promptly is the final Query Result of the relevant and key word of the inquiry height correlation that import with the user of first webpage of clicking with the user.

2, the search engine searching method based on web page correlation according to claim 1 is characterized in that: the writing time in the described step 1 with every month as the cycle, long-term dynamics is followed the tracks of.