CN105488209A

CN105488209A - Method and device for analyzing word weight

Info

Publication number: CN105488209A
Application number: CN201510921247.1A
Authority: CN
Inventors: 陈进平
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-12-11
Filing date: 2015-12-11
Publication date: 2016-04-13
Anticipated expiration: 2035-12-11
Also published as: CN105488209B

Abstract

The invention discloses a word weight analysis method and device, relates to the technical field of the Internet, and solves the problem that the existing method for determining the term weight cannot accurately determine the term weight in the query under the Internet search engine environment. The method of the present invention includes: obtaining the pair of <query, title>; counting the occurrence information of each word in the word segment of the query in the pair of <query, title>; calculating the occurrence information of each word in the same word segment according to the occurrence information. Occurrence probability of a word; determining the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment. The invention is mainly used for determining the term weight of the query in the search engine and improving the search quality of the search engine.

Description

A word weight analysis method and device

技术领域technical field

本发明涉及互联网技术领域，特别是涉及一种词权重的分析方法及装置。The invention relates to the technical field of the Internet, in particular to a word weight analysis method and device.

背景技术Background technique

随着互联网的发展，互联网中总的存储数据量非常巨大，因此为了使用户能够快速准确的查找到所需要的数据内容，提供互联网搜索服务的厂商就需要对搜索引擎的搜索质量进行优化。其中，权重是搜索引擎给予一个网页的评估值，这个权重可以反映出网页的重要程度，权重越高，说明网页获得更多搜索引擎的信任和认可。而在用户使用搜索引擎的过程中，会在搜索框中提交查询内容，这些查询内容通常称之为query，搜索引擎需要根据query在海量数据中获取有用信息。由于query中具有不同的词语term，其中每个term对于获取有用查询结果而言其重要程度各不相同，因此若要根据query准确获取到目标查询结果就需要参考query中各个term的重要性，也就是需要利用query中term的权重进行目标结果的查询。With the development of the Internet, the total amount of stored data in the Internet is very large. Therefore, in order to enable users to quickly and accurately find the required data content, manufacturers that provide Internet search services need to optimize the search quality of search engines. Among them, the weight is the evaluation value given by the search engine to a webpage, and this weight can reflect the importance of the webpage. The higher the weight, the more trust and recognition the webpage has gained from the search engines. In the process of using a search engine, users will submit query content in the search box. These query content is usually called query, and the search engine needs to obtain useful information from massive data according to the query. Since there are different terms in the query, each term has different importance for obtaining useful query results, so if you want to accurately obtain the target query results according to the query, you need to refer to the importance of each term in the query, and also It is necessary to use the weight of the term in the query to query the target results.

在现有确定term权重的方法中，通常会利用共同点击、词性以及命名实体来确定term权重，但是这些方法并不是以用户在互联网环境中使用搜索引擎获取内容为基础，从而导致通过上述方法确定的term权重在互联网搜索领域中的参考价值并不高。因此如何在互联网搜索引擎环境下确定term权重成为使用互联网搜索引擎时亟待解决的问题。In the existing methods for determining term weights, common clicks, parts of speech, and named entities are usually used to determine term weights, but these methods are not based on users using search engines to obtain content in the Internet environment, resulting in the determination of term weights by the above methods. The reference value of the term weight in the Internet search field is not high. Therefore, how to determine the term weight in the Internet search engine environment becomes an urgent problem to be solved when using the Internet search engine.

发明内容Contents of the invention

有鉴于此，本发明提出了一种词权重的分析方法及装置，主要目的在于解决现有确定term权重的方法无法在互联网搜索引擎环境下准确确定query中term权重的问题。In view of this, the present invention proposes a word weight analysis method and device, the main purpose of which is to solve the problem that the existing method for determining the term weight cannot accurately determine the term weight in the query under the Internet search engine environment.

依据本发明的第一个方面，本发明提供一种词权重的分析方法，包括：According to the first aspect of the present invention, the present invention provides a kind of analysis method of word weight, comprising:

获取<查询,标题>对；Get <query, title> pairs;

统计<查询,标题>对中所述查询的词片段中每个词的出现情况信息；Count the occurrence information of each word in the word segment of the query described in the <query, title> pair;

根据所述出现情况信息计算相同词片段中每个词的出现概率；Calculate the probability of occurrence of each word in the same word segment according to the occurrence information;

根据所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。The weight of each word in the same word segment is determined according to the occurrence probability of each word in the same word segment.

进一步的，所述获取<查询,标题>对包括：Further, the obtaining <query, title> pair includes:

获取用户点击日志，所述点击日志中包括用户提交的所有查询以及得到的所有标题；Obtain user click logs, which include all queries submitted by users and all titles obtained;

整理所述点击日志，将用户提交的查询与点击所述查询的url得到的标题一一对应，形成<查询,标题>对。The click log is sorted out, and the query submitted by the user is matched with the title obtained by clicking on the url of the query to form a <query, title> pair.

进一步的，所述统计<查询,标题>对中所述查询的词片段中每个词的出现情况信息包括：Further, the occurrence information of each word in the word segment of the query in the statistics <query, title> pair includes:

获取<查询,标题>对中所述查询的所有词片段，所述词片段包括所述查询中的每一个词和相邻两个及以上的词组成的词组；Obtain all word fragments of the query in the <query, title> pair, where the word fragments include each word in the query and a phrase composed of two or more adjacent words;

统计所述查询的所有词片段中每个词的出现情况信息。The occurrence information of each word in all the word segments of the query is counted.

进一步的，统计所述查询的所有词片段中每个词的出现情况信息包括：Further, counting the occurrence information of each word in all word segments of the query includes:

判断所述查询的词片段中每个词是否在所述查询的<查询,标题>对中对应的标题中出现；Judging whether each word in the word segment of the query appears in the corresponding title in the <query, title> pair of the query;

根据判断结果统计所述查询的词片段中每个词的出现情况信息，所述出现情况信息用预设的出现符号以及未出现符号表示。The appearance information of each word in the queried word segment is counted according to the judgment result, and the appearance information is represented by a preset appearance symbol and a non-appearance symbol.

进一步的，根据所述出现情况信息计算相同词片段中每个词的出现概率包括：Further, calculating the occurrence probability of each word in the same word segment according to the occurrence information includes:

获取相同词片段所对应的所有标题的总个数；Obtain the total number of all titles corresponding to the same word segment;

获取所述相同词片段中每个词在所述对应的所有标题中出现的次数；Obtain the number of occurrences of each word in the same word segment in all the corresponding titles;

用所述次数除以所述对应的所有标题的总个数得到相同词片段中每个词在所述对应的所有标题中的出现概率。The occurrence probability of each word in the same word segment in all the corresponding titles is obtained by dividing the number of times by the total number of all corresponding titles.

进一步的，根据所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重包括：Further, determining the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment includes:

将相同词片段中每个词在所述对应的所有标题中的出现概率作为所述相同词片段中每个词的权重。The occurrence probability of each word in the same word segment in all corresponding titles is used as the weight of each word in the same word segment.

依据本发明的第二个方面，本发明提供一种词权重的分析装置，包括：According to a second aspect of the present invention, the present invention provides a word weight analysis device, including:

获取单元，用于获取<查询,标题>对；Acquisition unit, used to obtain the pair of <query, title>;

统计单元，用于统计所述获取单元获取的<查询,标题>对中所述查询的词片段中每个词的出现情况信息；A statistical unit, configured to count the occurrence information of each word in the word segment of the query in the <query, title> pair obtained by the acquisition unit;

计算单元，用于根据所述出现情况信息计算相同词片段中每个词的出现概率；A computing unit, configured to calculate the occurrence probability of each word in the same word segment according to the occurrence information;

确定单元，用于根据所述计算单元计算的所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。A determining unit, configured to determine the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment calculated by the calculation unit.

进一步的，所述获取单元包括：Further, the acquisition unit includes:

获取模块，用于获取用户点击日志，所述点击日志中包括用户提交的所有查询以及得到的所有标题；An acquisition module, configured to acquire user click logs, which include all queries submitted by users and all titles obtained;

整理模块，用于整理所述获取模块获取的所述点击日志，将用户提交的查询与点击所述查询的url得到的标题一一对应，形成<查询,标题>对。A collating module, configured to sort out the click logs acquired by the acquisition module, and make a one-to-one correspondence between the query submitted by the user and the title obtained by clicking on the url of the query to form a <query, title> pair.

进一步的，所述统计单元包括：Further, the statistical unit includes:

切分模块，用于获取<查询,标题>对中所述查询的所有词片段，所述词片段包括所述查询中的每一个词和相邻两个及以上的词组成的词组；Segmentation module, used to obtain all word fragments of the query in the <query, title> pair, the word fragments include each word in the query and a phrase composed of two or more adjacent words;

统计模块，用于统计所述切分模块获取的所述查询的所有词片段中每个词的出现情况信息。A statistical module, configured to count occurrence information of each word in all word segments of the query obtained by the segmentation module.

进一步的，所述统计单元还用于判断所述查询的词片段中每个词是否在所述查询的<查询,标题>对中对应的标题中出现，以及根据判断结果统计所述查询的词片段中每个词的出现情况信息，所述出现情况信息用预设的出现符号以及未出现符号表示。Further, the statistical unit is also used to judge whether each word in the word segment of the query appears in the title corresponding to the <query, title> pair of the query, and count the words of the query according to the judgment result The appearance information of each word in the segment, the appearance information is represented by preset appearance symbols and non-appearance symbols.

进一步的，所述计算单元包括：Further, the calculation unit includes:

计数模块，用于获取相同词片段所对应的所有标题的总个数；The counting module is used to obtain the total number of all titles corresponding to the same word segment;

所述计数模块还用于获取所述相同词片段中每个词在所述对应的所有标题中出现的次数；The counting module is also used to obtain the number of times each word in the same word segment appears in all corresponding titles;

计算模块，用于用所述次数除以所述对应的所有标题的总个数得到相同词片段中每个词在所述对应的所有标题中的出现概率。A calculation module, configured to divide the number of times by the total number of all the corresponding titles to obtain the occurrence probability of each word in the same word segment in all the corresponding titles.

进一步的，所述确定单元用于将相同词片段中每个词在所述对应的所有标题中的出现概率作为所述相同词片段中每个词的权重。Further, the determining unit is configured to use the occurrence probability of each word in the same word segment in all corresponding titles as the weight of each word in the same word segment.

借由上述技术方案，本发明实施例提供的一种词权重的分析方法及装置，能够在用户大规模使用互联网搜索引擎的过程中获取到<查询,标题>对，并统计查询中的词片段中每个词的出现情况信息，根据每个词的出现情况信息计算相同词片段中每个词的出现概率，根据所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。而在现有技术中，当确定搜索查询中词的权重时无法基于互联网环境中使用搜索引擎获取内容为基础，从而造成搜索词的词权重确定不准确，进而影响搜索结果的准确性。与现有技术中的这一缺陷相比，本发明能够以用户大规模使用搜索引擎点击形成的日志为基础，在互联网搜索引擎环境下准确确定搜索查询中词的权重，从而有效提高搜索结果的准确性。By virtue of the above-mentioned technical solution, a word weight analysis method and device provided by the embodiments of the present invention can obtain <query, title> pairs in the process of large-scale use of Internet search engines by users, and count the word fragments in the query Occurrence information of each word in, calculate the occurrence probability of each word in the same word segment according to the occurrence information of each word, determine the occurrence probability of each word in the same word segment according to the occurrence probability of each word in the same word segment weight of words. However, in the prior art, when determining the weight of a word in a search query, it cannot be based on content obtained by using a search engine in an Internet environment, resulting in an inaccurate determination of the weight of a search word, thereby affecting the accuracy of the search result. Compared with this defect in the prior art, the present invention can accurately determine the weight of words in the search query under the Internet search engine environment based on the log formed by the user's large-scale use of search engine clicks, thereby effectively improving the weight of the search results. accuracy.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了本发明实施例提供的一种词权重的分析方法的流程图；Fig. 1 shows the flow chart of the analysis method of a kind of word weight that the embodiment of the present invention provides;

图2示出了本发明实施例提供的一种词权重的分析装置的组成框图；FIG. 2 shows a block diagram of a word weight analysis device provided by an embodiment of the present invention;

图3示出了本发明实施例提供的另一种词权重的分析装置的组成框图；FIG. 3 shows a block diagram of another word weight analysis device provided by an embodiment of the present invention;

图4示出了本发明实施例提供的另一种词权重的分析装置的组成框图；FIG. 4 shows a block diagram of another word weight analysis device provided by an embodiment of the present invention;

图5示出了本发明实施例提供的另一种词权重的分析装置的组成框图。FIG. 5 shows a block diagram of another word weight analysis device provided by an embodiment of the present invention.

具体实施方式detailed description

下面将参照附图更加详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

在用户使用搜索引擎时需要提交查询query，查询query中具有不同的词语term，其中每个term对于获取有用查询结果而言其重要程度各不相同，因此若要根据query准确获取到目标查询结果就需要参考query中各个term的重要性，也就是需要利用query中term的权重进行目标结果的查询。在现有确定term权重的方法中，通常会利用共同点击、词性以及命名实体来确定term权重，但是这些方法并不是以用户在互联网环境中使用搜索引擎获取内容为基础，从而导致通过上述方法确定的term权重在互联网搜索领域中的参考价值并不高。When users use a search engine, they need to submit a query query. There are different terms in the query query, and each term has different importance for obtaining useful query results. Therefore, it is necessary to accurately obtain the target query results according to the query. You need to refer to the importance of each term in the query, that is, you need to use the weight of the term in the query to query the target results. In the existing methods for determining term weights, common clicks, parts of speech, and named entities are usually used to determine term weights, but these methods are not based on users using search engines to obtain content in the Internet environment, resulting in the determination of term weights by the above methods. The reference value of the term weight in the Internet search field is not high.

为了解决上述问题，本发明实施例提供了一种词权重的分析方法，能够基于互联网搜索引擎环境准确确定用户提交的查询query中各个关键词term的权重，如图1所示，该方法包括：In order to solve the above problems, the embodiment of the present invention provides a method for analyzing word weights, which can accurately determine the weight of each keyword term in the query submitted by the user based on the Internet search engine environment, as shown in Figure 1, the method includes:

101、获取<查询,标题>对。101. Obtain the <query, title> pair.

在用户使用搜索引擎查询所需要的内容时需要提交包含有关键词term的查询query，搜索引擎根据用户提交的query匹配到一些相关的标题title供用户点击观看，当用户点击相关的title后，本发明实施例就可以将用户提交的query和点击的title进行组合形成<查询,标题>对，也可以记作<query,title>对。When the user uses a search engine to query the desired content, he needs to submit a query query containing the keyword term. The search engine matches the query submitted by the user to some relevant title titles for the user to click and watch. When the user clicks on the relevant title, this In the embodiment of the invention, the query submitted by the user and the clicked title can be combined to form a <query, title> pair, which can also be recorded as a <query, title> pair.

102、统计<查询,标题>对中查询的词片段中每个词的出现情况信息。102. Count the occurrence information of each word in the word segment in the query in the <query, title> pair.

由于搜索引擎在根据用户提交的query在互联网上搜索相应的内容时，需要根据query中每个词term的重要性调整搜索策略，而query中的term出现在query对应的title中的次数越多说明query中该term越重要，因此本发明实施例需要执行步骤102统计大规模的<query,title>对中查询的词片段中每个词的出现情况信息，根据出现情况信息确定词片段中每个词的重要性。Because the search engine needs to adjust the search strategy according to the importance of each word term in the query when searching for corresponding content on the Internet according to the query submitted by the user, and the more times the term in the query appears in the corresponding title of the query, it means The term in the query is more important, so the embodiment of the present invention needs to perform step 102 to count the occurrence information of each word in the word segment of the query in the large-scale <query, title> pair, and determine the occurrence information of each word in the word segment according to the occurrence information. The importance of words.

103、根据所述出现情况信息计算相同词片段中每个词的出现概率。103. Calculate the occurrence probability of each word in the same word segment according to the occurrence information.

由于本发明实施例需要统计大规模的<query,title>对，因此所有统计的query中包含有大量的相同词片段，对相同词片段ABC而言，所有包含词片段ABC的query中，各个query对应的title里有部分title包含term-A，部分title里不包含term-A；部分title包含term-B，部分title里不包含term-B；部分title包含term-C，部分title里不包含term-C。也就是说相同词片段中每个词在所有包含所述相同词片段的query所对应的title中的出现概率不相同，因此相同词片段中每个词的重要性也就不一样。由此本发明实施例需要执行步骤103根据相同词片段中每个词在所有包含所述相同词片段的query所对应的title中的出现情况信息计算相同词片段中每个词的出现概率。Because the embodiment of the present invention needs to count large-scale <query, title> pairs, therefore contain a large amount of identical word fragments in the query of all statistics, for the same word fragment ABC, in all queries that comprise word fragment ABC, each query In the corresponding title, some titles contain term-A, some titles do not contain term-A; some titles contain term-B, some titles do not contain term-B; some titles contain term-C, and some titles do not contain term -C. That is to say, each word in the same word segment has different occurrence probabilities in the titles corresponding to all queries containing the same word segment, so the importance of each word in the same word segment is also different. Therefore, the embodiment of the present invention needs to perform step 103 to calculate the occurrence probability of each word in the same word segment according to the occurrence information of each word in the title corresponding to all queries containing the same word segment.

104、根据相同词片段中每个词的出现概率确定相同词片段中每个词的权重。104. Determine the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment.

由于权重是一个相对的概念，针对某个指标而言，该指标的权重是指该指标在整体评价中的相对重要程度。而对本发明实施例而言，某个term的权重就是指该term在其所在的query的词片段中的相对重要程度，同时重要程度越高的term在其词片段所在的query对应的title中出现的概率越高，因此当在步骤103中计算出相同词片段中每个词在所有包含所述相同词片段的query所对应的title中的出现概率之后，就可以根据相同词片段中每个词的出现概率确定相同词片段中每个词的权重，以便搜索引擎根据由大规模统计<查询,标题>对所确定的term权重调整搜索策略，提高搜索结果的准确性。Since weight is a relative concept, for a certain indicator, the weight of the indicator refers to the relative importance of the indicator in the overall evaluation. For the embodiment of the present invention, the weight of a certain term refers to the relative importance of the term in the word segment of the query where it is located, and the term with higher importance appears in the title corresponding to the query where the word segment is located The higher the probability, so after calculating the occurrence probability of each word in the same word segment in the title corresponding to all queries containing the same word segment in step 103, it can be based on each word in the same word segment The occurrence probability of determines the weight of each word in the same word fragment, so that the search engine can adjust the search strategy according to the term weight determined by the large-scale statistical <query, title> pair, and improve the accuracy of the search results.

本发明实施例提供的一种词权重的分析方法，能够在用户大规模使用互联网搜索引擎的过程中获取到<查询,标题>对，并统计查询中的词片段中每个词的出现情况信息，根据每个词的出现情况信息计算相同词片段中每个词的出现概率，根据所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。而在现有技术中，当确定搜索查询中词的权重时无法基于互联网环境中使用搜索引擎获取内容为基础，从而造成搜索词的词权重确定不准确，进而影响搜索结果的准确性。与现有技术中的这一缺陷相比，本发明能够以用户大规模使用搜索引擎点击形成的日志为基础，在互联网搜索引擎环境下准确确定搜索查询中词的权重，从而有效提高搜索结果的准确性。The method for analyzing word weights provided by the embodiments of the present invention can obtain <query, title> pairs during the large-scale use of Internet search engines by users, and count the occurrence information of each word in the word fragments in the query , calculating the occurrence probability of each word in the same word segment according to the occurrence information of each word, and determining the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment. However, in the prior art, when determining the weight of a word in a search query, it cannot be based on content obtained by using a search engine in an Internet environment, resulting in an inaccurate determination of the weight of a search word, thereby affecting the accuracy of the search result. Compared with this defect in the prior art, the present invention can accurately determine the weight of words in the search query under the Internet search engine environment based on the log formed by the user's large-scale use of search engine clicks, thereby effectively improving the weight of the search results. accuracy.

为了更好的对上述图1所示的方法进行理解，作为对上述实施方式的细化和扩展，本发明实施例将针对图1中的步骤进行详细说明。In order to better understand the above-mentioned method shown in FIG. 1 , as a refinement and extension of the above-mentioned implementation manner, the embodiment of the present invention will describe the steps in FIG. 1 in detail.

通常用户在使用互联网的过程中会产生大量的点击日志，这些点击日志信息中包括用户在搜索引擎里提交的查询query，所述query点击的统一资源定位符url以及url对应的标题title等数据。由于用户提交的query以及点击所述query的url得到的title通常都具有相互对应的关系，因此通过大规模的统计点击日志信息就可以得到互联网搜索引擎环境下确定搜索关键词term权重的数据基础。由于用户在提交一个query时，有时会点击多个url得到多个相关title，这些title的质量也就是与query的匹配度也会存在高低差异，因此本发明实施例需要对获取的点击日志进行整理，将点击日志中的query与title一一对应，得到<query，title>对。其中，由于用户在提交一个query时，可能点击多个url得到多个对应的title，因此在获得的大规模的<query，title>对中，同一个query也会具有多个<query，title>对。Usually, a user will generate a large number of click logs during the use of the Internet, and the click log information includes data such as a query submitted by the user in a search engine, a URL clicked by the query, and a title corresponding to the URL. Since the query submitted by the user and the title obtained by clicking the url of the query usually have a corresponding relationship, the data basis for determining the weight of the search keyword term under the Internet search engine environment can be obtained through large-scale statistics of click log information. When a user submits a query, he sometimes clicks multiple urls to obtain multiple related titles, and the quality of these titles, that is, the degree of matching with the query, also varies. Therefore, the embodiment of the present invention needs to sort out the obtained click logs , correspond the query and title in the click log one by one, and get the pair of <query, title>. Among them, since the user may click multiple urls to obtain multiple corresponding titles when submitting a query, the same query will also have multiple <query, title> pairs in the obtained large-scale <query, title> pairs right.

由于用户在搜索引擎里提交query后，搜索引擎需要根据query中每个term(关键词)的相对重要程度也就是权重调整搜索策略，以便获取到准确的搜索结果。而query中每个term的重要程度可以用term在query对应的title中的出现情况来表示，如果在大量query中的某个term在对应的title中出现的次数越多，说明该term越重要。由于各个query中会包含有多种多样的词片段，词片段包括query中的每一个term和相邻两个及以上的term组成的词组，而且各个query中也会包含相同的词片段，就同一个词片段来说，所述同一个词片段中的term在所有包含所述词片段的query对应的title中出现的次数越多，说明在所述词片段中该term越重要。因此，本发明实施例需要统计所有query的词片段中每个term的出现情况信息。为了统计所有query的词片段中每个term的出现情况信息，本发明实施例需要对所有的query进行分词，也就是处理所有的<query，title>对，将各个query进行分词，得到query中的每一个term和相邻两个及以上的term组成的词组也就是上述的词片段，并统计词片段中每个term在其对应的title中的出现情况信息。After the user submits a query in the search engine, the search engine needs to adjust the search strategy according to the relative importance of each term (keyword) in the query, that is, the weight, so as to obtain accurate search results. The importance of each term in the query can be expressed by the appearance of the term in the corresponding title of the query. If a term in a large number of queries appears in the corresponding title, the more important the term is. Since each query will contain a variety of word fragments, the word fragments include each term in the query and a phrase composed of two or more adjacent terms, and each query will also contain the same word fragments, just the same For a word segment, the more times a term in the same word segment appears in titles corresponding to all queries containing the word segment, the more important the term is in the word segment. Therefore, the embodiment of the present invention needs to count the occurrence information of each term in all query word segments. In order to count the occurrence information of each term in the word fragments of all queries, the embodiment of the present invention needs to carry out word segmentation to all queries, that is, process all <query, title> pairs, carry out word segmentation to each query, and obtain the A phrase composed of each term and two or more adjacent terms is the above-mentioned word segment, and the occurrence information of each term in its corresponding title in the word segment is counted.

在统计每个query的所有词片段中每个term的出现情况信息时，可以用预设的出现符号以及未出现符号进行表示。也就是判断query的词片段中每个term是否在所述query的<query，title>对中对应的title中出现，若出现，则用预设出现符号表示，若未出现，则用预设未出现符号表示。例如对于<query：ABCD，title：CDEFG>而言，其query中的一个词片段为ABC，这个词片段ABC中的term-A在title：CDEFG中未出现，则用未出现符号0表示；term-B在title：CDEFG中未出现，则用未出现符号0表示；term-C在title：CDEFG中出现，则用出现符号1表示，因此统计词片段ABC中每个term的出现情况信息就可以用ABC：001表示。When counting the occurrence information of each term in all word segments of each query, it can be represented by preset occurrence symbols and non-appearance symbols. That is to judge whether each term in the word segment of the query appears in the title corresponding to the pair of <query, title> of the query, if it appears, it will be represented by the default appearance symbol, if it does not appear, it will be represented by the default non-existing symbol. Symbols appear. For example, for <query: ABCD, title: CDEFG>, a word segment in its query is ABC, and term-A in this word segment ABC does not appear in title: CDEFG, then it is represented by the symbol 0 that does not appear; term If -B does not appear in title: CDEFG, it will be represented by the non-appearing symbol 0; if term-C appears in title: CDEFG, it will be represented by the appearing symbol 1. Therefore, it is enough to count the occurrence information of each term in the word segment ABC Expressed with ABC:001.

当通过上述方式确定<query，title>对中query的词片段中每个term的出现情况信息后，就可以计算相同词片段中每个term的出现概率。具体的在计算相同词片段中每个term的出现概率时，需要获取相同词片段所对应的所有title的总个数。对于同一个词片段而言，就是query中包含所述同一个词片段的所有<query，title>对的总个数，在所有这些<query，title>对的总个数中，部分<query，title>对中的title包含有所述同一个词片段的term，部分<query，title>对中的title不包含有所述同一个词片段的term，因此在获取同一个词片段所对应的所有title的总个数之后，还需要获取同一个词片段中每个term在所述所有title中出现的次数，也就是在所有title中包含某个term的title的个数。用同一个词片段中每个term在所有title中出现的次数除以对应的所有title的总个数得到相同词片段中每个term在对应的所有title中的出现概率。After the occurrence information of each term in the word segment of the query in the <query, title> pair is determined through the above method, the occurrence probability of each term in the same word segment can be calculated. Specifically, when calculating the occurrence probability of each term in the same word segment, it is necessary to obtain the total number of all titles corresponding to the same word segment. For the same word segment, it is the total number of all <query, title> pairs that contain the same word segment in the query. Among the total numbers of all these <query, title> pairs, some <query, The title in the title> pair contains the term of the same word segment, and the title in some <query, title> pairs does not contain the term of the same word segment, so when obtaining all the terms corresponding to the same word segment After the total number of titles, it is also necessary to obtain the number of times each term in the same word segment appears in all the titles, that is, the number of titles containing a certain term in all titles. Divide the number of occurrences of each term in all titles in the same word segment by the total number of corresponding titles to obtain the probability of occurrence of each term in all corresponding titles in the same word segment.

对于同一个词片段而言，其中某个term在其所在query对应的title中的出现频率越高，该term就越重要，因此根据计算得到的相同词片段中每个term的出现概率可以确定相同词片段中每个term的权重。作为一种可选的实施方式，本发明实施例可以将相同词片段中每个term在其对应的所有title中的出现概率作为所述相同词片段中每个term的权重。For the same word segment, the higher the frequency of occurrence of a term in the title corresponding to its query, the more important the term is, so according to the calculated occurrence probability of each term in the same word segment, it can be determined that the same The weight of each term in the word segment. As an optional implementation manner, in this embodiment of the present invention, the occurrence probability of each term in all corresponding titles in the same word segment may be used as the weight of each term in the same word segment.

为了更好的对上述方法进行理解，本发明实施例将以两个<query，title>对为例，对上述过程进行详细说明。这两个<query，title>对分别为<query：ABC，title：CDEF>、<query：ABCDE，title：FGACDHJ>。其中，如果query中的term出现在对应的title中，则用出现符号1表示，如果query中的term未出现在对应的title中，则用未出现符号0表示。In order to better understand the above method, the embodiment of the present invention will take two <query, title> pairs as an example to describe the above process in detail. The two <query, title> pairs are <query: ABC, title: CDEF>, <query: ABCDE, title: FGACDHJ> respectively. Wherein, if the term in the query appears in the corresponding title, it is represented by the occurrence symbol 1, and if the term in the query does not appear in the corresponding title, it is represented by the non-appearance symbol 0.

在统计<query，title>对中query的词片段中每个term的出现情况时，首先需要对<query，title>对中的query进行分词得到所有词片段，然后统计词片段中每个term的出现情况，也就是以query中的词片段为key,以词片段包含的term在对应的title中出现情况为value进行输出，其处理结果如下：When counting the appearance of each term in the word segment of the query in the <query, title> pair, first, it is necessary to segment the query in the <query, title> pair to obtain all the word segments, and then count the occurrence of each term in the word segment Occurrence, that is, the word fragment in the query is used as the key, and the occurrence of the term contained in the word fragment in the corresponding title is output as the value. The processing results are as follows:

1)在<query：ABC，title：CDEF>对中，1) In the <query: ABC, title: CDEF> pair,

包含1个term的：A:0，B:0，C:1Contains 1 term: A:0, B:0, C:1

包含2个term的：AB:00，BC:01Contains 2 terms: AB:00, BC:01

包含3个term的：ABC:001Contains 3 terms: ABC:001

2)在<query：ABCDE，title：FGACDHJ>对中，2) In the <query: ABCDE, title: FGACDHJ> pair,

包含1个term的：A:1，B:0，C:1，D:1，E:0Contains 1 term: A:1, B:0, C:1, D:1, E:0

包含2个term的：AB:10，BC:01，CD:11，DE:10Contains 2 terms: AB:10, BC:01, CD:11, DE:10

包含3个term的：ABC:101，BCD:011，CDE:110Contains 3 terms: ABC:101, BCD:011, CDE:110

包含4个term的：ABCD:1011，BCDE:0110Contains 4 terms: ABCD:1011, BCDE:0110

包含5个term的：ABCDE:10110Contains 5 terms: ABCDE:10110

当处理完所有<query，title>对之后，需要根据词片段中每个term的出现情况信息对相同的词片段进行合并，也就是计算相同词片段中每个term的出现概率。以词片段ABC为例，在<query：ABC，title：CDEF>对中，词片段ABC的value值为001；在<query：ABCDE，title：FGACDHJ>对中，词片段ABC的value值为101，其中，term-A在<query：ABC，title：CDEF>对中的title中未出现，而在<query：ABCDE，title：FGACDHJ>对中的title中出现，因此term-A在title中出现的概率为0.5；同理，term-B在title中出现的概率为0，term-C在title中出现的概率为1，因此对于词片段ABC而言，各个term在搜索结果中出现的概率为ABC：0.5、0、1。根据上述统计结果可知，当用户在搜索引擎中提交包含ABC的query时，搜索时需要参考的term的重要性依次为term-C>term-A>term-B。After processing all <query, title> pairs, it is necessary to merge the same word segments according to the occurrence information of each term in the word segment, that is, to calculate the occurrence probability of each term in the same word segment. Taking the word segment ABC as an example, in the pair of <query: ABC, title: CDEF>, the value of the word segment ABC is 001; in the pair of <query: ABCDE, title: FGACDHJ>, the value of the word segment ABC is 101 , where term-A does not appear in the title of the <query: ABC, title: CDEF> pair, but appears in the title of the <query: ABCDE, title: FGACDHJ> pair, so term-A appears in the title The probability of term-B in the title is 0.5; similarly, the probability of term-B appearing in the title is 0, and the probability of term-C appearing in the title is 1. Therefore, for the word segment ABC, the probability of each term appearing in the search results is ABC: 0.5, 0, 1. According to the above statistical results, when a user submits a query containing ABC in a search engine, the importance of the terms to be referred to when searching is term-C>term-A>term-B.

当然，上述只是以两个<query，title>对为例进行的说明，其得到的概率还不具有代表性，只是为了能够清楚说明具体的分析过程。在实际进行分析的过程中，需要按照上述方式大规模的统计<query，title>对才能得到词片段中每个term可靠的出现概率。例如，若统计大量的<query，title>对后得到类似如下数据，词片段ABC：0.7、0.3、0.9，则表示如下含义：在所有包含词片段ABC的query中，点击的title里包含term-A的概率是0.7，包含term-B的概率是0.3，包含term-C的概率是0.9，因此可以认为term-A和term-C的重要性比较高，而term-B的重要性比较低。Of course, the above is just an illustration using two <query, title> pairs as an example, and the obtained probabilities are not yet representative, just to clearly illustrate the specific analysis process. In the process of actual analysis, it is necessary to count <query, title> pairs on a large scale according to the above method in order to obtain the reliable occurrence probability of each term in the word segment. For example, if a large number of <query, title> pairs are counted to obtain data similar to the following, word segment ABC: 0.7, 0.3, 0.9, it means the following meaning: in all queries containing word segment ABC, the clicked title contains term- The probability of A is 0.7, the probability of including term-B is 0.3, and the probability of including term-C is 0.9, so it can be considered that the importance of term-A and term-C is relatively high, while the importance of term-B is relatively low.

通过本发明实施例所述的词权重的分析方法，可以大规模的挖掘互联网搜索环境下的词片段以及词片段中包含的term在title里的出现概率，例如如下两个词片段：a)番茄鱼汤：0.75、0.82；b)鱼汤好吗：0.78、0.51。其中，a)中表示所有包含“番茄鱼汤”的query点击的title中，75％包含“番茄”，82％包含“鱼汤”，由于“番茄”有同义词“西红柿”，所以实际上所述title中包含的番茄的概率还要高。在b)中表明所有包含“鱼汤好吗”的query点击的title中，“鱼汤”比“好吗”出现的次数多，“鱼汤”比“好吗”更加重要。By the analysis method of the word weight described in the embodiment of the present invention, the word segment under the Internet search environment and the occurrence probability of the term contained in the word segment can be excavated on a large scale, such as the following two word segments: a) tomato Fish soup: 0.75, 0.82; b) How about fish soup: 0.78, 0.51. Among them, in a), it means that among the titles of all query clicks containing "tomato fish soup", 75% contain "tomato", and 82% contain "fish soup". Since "tomato" has a synonym "tomato", the actual The probability of tomatoes included in the title is even higher. In b), it indicates that in the titles of all query clicks containing "how is fish soup", "fish soup" appears more frequently than "how is it", and "fish soup" is more important than "how is it".

本发明实施例利用<query，title>对统计query中term在title里是否出现，并且把出现情况信息通过词片段的value值进行输出，进一步的根据每个词片段的value值统计相同词片段中每个term在title中的出现概率，由此得到词片段中各个term的权重信息，由于这些term的权重信息是基于大规模的互联网搜索环境下的点击日志信息确定的，因此能够有效提高搜索引擎的搜索质量。The embodiment of the present invention utilizes <query, title> to count whether the term in the query appears in the title, and the occurrence information is output through the value value of the word segment, and further according to the value value of each word segment in the same word segment The probability of occurrence of each term in the title, thereby obtaining the weight information of each term in the word segment, because the weight information of these terms is determined based on the click log information in the large-scale Internet search environment, so it can effectively improve the search engine. search quality.

进一步的，作为对上述图1所示方法的实现，本发明实施例提供了一种词权重的分析装置，如图2所示，该装置包括：获取单元21、统计单元22、计算单元23以及确定单元24，其中，Further, as an implementation of the method shown in FIG. 1 above, an embodiment of the present invention provides a device for analyzing word weights. As shown in FIG. 2 , the device includes: an acquisition unit 21, a statistics unit 22, a calculation unit 23, and Determining unit 24, wherein,

获取单元21，用于获取<查询,标题>对；An acquisition unit 21, configured to acquire a pair of <query, title>;

统计单元22，用于统计获取单元21获取的<查询,标题>对中所述查询的词片段中每个词的出现情况信息；Statistical unit 22, used to count the occurrence information of each word in the word segment of the query in the <query, title> pair obtained by statistical acquisition unit 21;

计算单元23，用于根据统计单元22统计的所述出现情况信息计算相同词片段中每个词的出现概率；Calculation unit 23, is used for calculating the occurrence probability of each word in the same word segment according to the described occurrence information of statistics unit 22 statistics;

确定单元24，用于根据计算单元23计算的所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。The determination unit 24 is configured to determine the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment calculated by the calculation unit 23 .

进一步的，如图3所示，获取单元21包括：Further, as shown in Figure 3, the acquisition unit 21 includes:

获取模块211，用于获取用户点击日志，所述点击日志中包括用户提交的所有查询以及得到的所有标题；Obtaining module 211, configured to obtain user click logs, which include all queries submitted by users and all titles obtained;

整理模块212，用于整理获取模块211获取的所述点击日志，将用户提交的查询与点击所述查询的url得到的标题一一对应，形成<查询,标题>对。The sorting module 212 is configured to sort the click logs acquired by the acquiring module 211, and correspond one-to-one between the query submitted by the user and the title obtained by clicking on the url of the query to form a <query, title> pair.

进一步的，如图4所示，统计单元22包括：Further, as shown in Figure 4, the statistical unit 22 includes:

切分模块221，用于获取<查询,标题>对中所述查询的所有词片段，所述词片段包括所述查询中的每一个词和相邻两个及以上的词组成的词组；Segmentation module 221, is used for obtaining all word fragments of the query in <query, title>, and the word fragments include each word in the query and a phrase composed of two or more adjacent words;

统计模块222，用于统计切分模块221获取的所述查询的所有词片段中每个词的出现情况信息。The statistics module 222 is configured to make statistics on the occurrence information of each word in all the word segments of the query acquired by the segmentation module 221 .

进一步的，统计单元22还用于判断所述查询的词片段中每个词是否在所述查询的<查询,标题>对中对应的标题中出现，以及根据判断结果统计所述查询的词片段中每个词的出现情况信息，所述出现情况信息用预设的出现符号以及未出现符号表示。Further, the statistical unit 22 is also used to judge whether each word in the word segment of the query appears in the title corresponding to the <query, title> pair of the query, and count the word segments of the query according to the judgment result The occurrence information of each word in , the occurrence information is represented by a preset occurrence symbol and a non-appearance symbol.

进一步的，如图5所示，计算单元23包括：Further, as shown in Figure 5, the computing unit 23 includes:

计数模块231，用于获取相同词片段所对应的所有标题的总个数；The counting module 231 is used to obtain the total number of all titles corresponding to the same word segment;

计数模块231还用于获取所述相同词片段中每个词在所述对应的所有标题中出现的次数；The counting module 231 is also used to obtain the number of occurrences of each word in the corresponding all titles in the same word segment;

计算模块232，用于用所述次数除以所述对应的所有标题的总个数得到相同词片段中每个词在所述对应的所有标题中的出现概率。The calculation module 232 is configured to divide the number of times by the total number of all the corresponding titles to obtain the occurrence probability of each word in the same word segment in all the corresponding titles.

进一步的，确定单元24用于将相同词片段中每个词在所述对应的所有标题中的出现概率作为所述相同词片段中每个词的权重。Further, the determining unit 24 is configured to use the occurrence probability of each word in the same word segment in all corresponding titles as the weight of each word in the same word segment.

本发明实施例提供的一种词权重的分析装置，能够在用户大规模使用互联网搜索引擎的过程中获取到<查询,标题>对，并统计查询中的词片段中每个词的出现情况信息，根据每个词的出现情况信息计算相同词片段中每个词的出现概率，根据所述相同词片段中每个词的出现概率确定所述相同词片段中每个词的权重。而在现有技术中，当确定搜索查询中词的权重时无法基于互联网环境中使用搜索引擎获取内容为基础，从而造成搜索词的词权重确定不准确，进而影响搜索结果的准确性。与现有技术中的这一缺陷相比，本发明能够以用户大规模使用搜索引擎点击形成的日志为基础，在互联网搜索引擎环境下准确确定搜索查询中词的权重，从而有效提高搜索结果的准确性。The word weight analysis device provided by the embodiment of the present invention can obtain <query, title> pairs in the process of large-scale use of Internet search engines by users, and count the occurrence information of each word in the word fragments in the query , calculating the occurrence probability of each word in the same word segment according to the occurrence information of each word, and determining the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment. However, in the prior art, when determining the weight of a word in a search query, it cannot be based on content obtained by using a search engine in an Internet environment, resulting in an inaccurate determination of the weight of a search word, thereby affecting the accuracy of the search result. Compared with this defect in the prior art, the present invention can accurately determine the weight of words in the search query under the Internet search engine environment based on the log formed by the user's large-scale use of search engine clicks, thereby effectively improving the weight of the search results. accuracy.

此外，本发明实施例利用<query，title>对统计query中term在title里是否出现，并且把出现情况信息通过词片段的value值进行输出，进一步的根据每个词片段的value值统计相同词片段中每个term在title中的出现概率，由此得到词片段中各个term的权重信息，由于这些term的权重信息是基于大规模的互联网搜索环境下的点击日志信息确定的，因此能够有效提高搜索引擎的搜索质量。In addition, the embodiment of the present invention uses <query, title> to count whether the term in the query appears in the title, and the occurrence information is output through the value value of the word segment, and the same word is further counted according to the value value of each word segment The probability of occurrence of each term in the title in the segment, thus obtaining the weight information of each term in the word segment, since the weight information of these terms is determined based on the click log information in the large-scale Internet search environment, it can effectively improve the Search engine search quality.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the foregoing embodiments, the descriptions of each embodiment have their own emphases, and for parts not described in detail in a certain embodiment, reference may be made to relevant descriptions of other embodiments.

可以理解的是，上述方法及装置中的相关特征可以相互参考。另外，上述实施例中的“第一”、“第二”等是用于区分各实施例，而并不代表各实施例的优劣。It can be understood that related features in the above methods and devices can refer to each other. In addition, "first", "second" and so on in the above embodiments are used to distinguish each embodiment, and do not represent the advantages and disadvantages of each embodiment.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other device. Various generic systems can also be used with the teachings based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any one of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的发明名称(如确定网站内链接等级的装置)中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all of the components in the title of the invention (such as the device for determining the link level in the website) according to the embodiment of the present invention some or all of the features. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

Claims

1. an analysis method of word weight, is characterized in that, described method comprises:

Get <query, title> pairs;

Count the occurrence information of each word in the word segment of the query described in the <query, title> pair;

Calculate the probability of occurrence of each word in the same word segment according to the occurrence information;

The weight of each word in the same word segment is determined according to the occurrence probability of each word in the same word segment.

2. The method according to claim 1, wherein said obtaining <query, title> pair comprises:

Obtain user click logs, which include all queries submitted by users and all titles obtained;

The click log is sorted out, and the query submitted by the user is matched with the title obtained by clicking on the url of the query to form a <query, title> pair.

3. The method according to claim 1, wherein the occurrence information of each word in the word segment of the query in the statistics <query, title> comprises:

Obtain all word fragments of the query in the <query, title> pair, where the word fragments include each word in the query and a phrase composed of two or more adjacent words;

The occurrence information of each word in all the word segments of the query is counted.

4. The method according to claim 3, wherein the occurrence information of each word in all word segments of the query includes:

Judging whether each word in the word segment of the query appears in the corresponding title in the <query, title> pair of the query;

The appearance information of each word in the queried word segment is counted according to the judgment result, and the appearance information is represented by a preset appearance symbol and a non-appearance symbol.

5. method according to claim 4, is characterized in that, calculates the probability of occurrence of each word in identical word segment according to described occurrence information and comprises:

Obtain the total number of all titles corresponding to the same word segment;

Obtain the number of occurrences of each word in the same word segment in all the corresponding titles;

The occurrence probability of each word in the same word segment in all the corresponding titles is obtained by dividing the number of times by the total number of all corresponding titles.

6. method according to claim 5, it is characterized in that, according to the probability of occurrence of each word in the same word segment, determine the weight of each word in the same word segment comprises:

The occurrence probability of each word in the same word segment in all corresponding titles is used as the weight of each word in the same word segment.

7. An analysis device of word weight, is characterized in that, described device comprises:

Acquisition unit, used to obtain the pair of <query, title>;

A statistical unit, configured to count the occurrence information of each word in the word segment of the query in the <query, title> pair obtained by the acquisition unit;

A calculation unit, configured to calculate the occurrence probability of each word in the same word segment according to the occurrence information counted by the statistics unit;

A determining unit, configured to determine the weight of each word in the same word segment according to the occurrence probability of each word in the same word segment calculated by the calculation unit.

8. The device according to claim 7, wherein the acquiring unit comprises:

An acquisition module, configured to acquire user click logs, which include all queries submitted by users and all titles obtained;

A collating module, configured to sort out the click logs acquired by the acquisition module, and make a one-to-one correspondence between the query submitted by the user and the title obtained by clicking on the url of the query to form a <query, title> pair.

9. The device according to claim 7, wherein the statistical unit comprises:

Segmentation module, used to obtain all word fragments of the query in the <query, title> pair, the word fragments include each word in the query and a phrase composed of two or more adjacent words;

A statistical module, configured to count occurrence information of each word in all word segments of the query obtained by the segmentation module.

10. The device according to claim 9, wherein the statistical unit is further configured to determine whether each word in the word segment of the query is in the corresponding title of the <query, title> pair of the query Appearance, and counting the occurrence information of each word in the word segment of the query according to the judgment result, the occurrence information is represented by a preset appearance symbol and a non-appearance symbol.

11. The device according to claim 10, wherein the computing unit comprises:

The counting module is used to obtain the total number of all titles corresponding to the same word segment;

The counting module is also used to obtain the number of times each word in the same word segment appears in all corresponding titles;

A calculation module, configured to divide the number of times by the total number of all the corresponding titles to obtain the occurrence probability of each word in the same word segment in all the corresponding titles.

12. The device according to claim 11, wherein the determining unit is used to use the probability of occurrence of each word in the same word segment in all corresponding titles as the probability of each word in the same word segment the weight of.