[go: up one dir, main page]

CN103186650A - Searching method and device - Google Patents

Searching method and device Download PDF

Info

Publication number
CN103186650A
CN103186650A CN2011104611284A CN201110461128A CN103186650A CN 103186650 A CN103186650 A CN 103186650A CN 2011104611284 A CN2011104611284 A CN 2011104611284A CN 201110461128 A CN201110461128 A CN 201110461128A CN 103186650 A CN103186650 A CN 103186650A
Authority
CN
China
Prior art keywords
search
list
etymology
record
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104611284A
Other languages
Chinese (zh)
Other versions
CN103186650B (en
Inventor
简勤
郭正平
陈健骥
何丹
赖航
肖巍
郑长松
王全礼
杨俊拯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Group Sichuan Co Ltd
Original Assignee
China Mobile Group Sichuan Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Group Sichuan Co Ltd filed Critical China Mobile Group Sichuan Co Ltd
Priority to CN201110461128.4A priority Critical patent/CN103186650B/en
Publication of CN103186650A publication Critical patent/CN103186650A/en
Application granted granted Critical
Publication of CN103186650B publication Critical patent/CN103186650B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明提供了一种搜索方法和装置。其中的搜索方法包括:为同一类文档设置关键属性和关键属性权重,计算各个文档的关键属性分值;建立索引倒排表;索引倒排表中的文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。应用本发明可以提高搜索速度,降低系统资源的占用。

Figure 201110461128

The invention provides a search method and device. The search method includes: setting key attributes and key attribute weights for the same type of documents, calculating the key attribute scores of each document; building an index posting list; each record in the document list in the index posting list includes a The document number and key attribute score of the document; the document list is composed of an ordered list and an unordered list, and the ordered list includes the n key attribute scores with the largest value, and the order of the key attribute scores is from large to small Arranged records; wherein, the n is a predetermined value; generate a corresponding etymology according to the search character string input by the user, retrieve the index posting list according to the generated etymology, and search according to the search string entered by the user Result scopes preferentially fetch records from the generated etymologically-corresponding ordered list to obtain the desired total number of search results and related results. The application of the invention can increase the search speed and reduce the occupation of system resources.

Figure 201110461128

Description

一种搜索方法和装置A search method and device

技术领域 technical field

本发明涉及数据业务技术领域,尤其涉及一种搜索方法和装置。The invention relates to the technical field of data services, in particular to a search method and device.

背景技术 Background technique

现有的搜索引擎一般都是单方面基于文本相似度进行排序,索引的存储一般是基于键值对<KEY,DocList>的形式。其中,KEY表示关键字,DocList表示包含关键字KEY的文档列表。DocList中的每个元素是一个文档对象,用于存放一个文档的基本信息,例如:该文档的ID、该关键字KEY在该文档中出现的次数以及出现的位置等信息。当用户输入一个关键字之后,将先根据该关键字进行检索;当检索到相对应的关键字后,再整体计算与该关键字相对应的DocList中所有文档在此关键字下的得分;然后,依据上述得分对所有的搜索结果整体进行排序,再在排序后的搜索结果中读取用户所需条数的搜索结果,将读取的搜索结果返回给用户。Existing search engines generally sort based on text similarity unilaterally, and index storage is generally based on the form of key-value pairs <KEY, DocList>. Among them, KEY represents a keyword, and DocList represents a list of documents containing the keyword KEY. Each element in the DocList is a document object, which is used to store the basic information of a document, such as: the ID of the document, the number of occurrences of the keyword KEY in the document, and the position where it appears. When the user enters a keyword, it will first search based on the keyword; when the corresponding keyword is retrieved, then the overall calculation of the scores of all documents in the DocList corresponding to the keyword under this keyword; and then , sort all the search results as a whole according to the above scores, and then read the search results of the number required by the user from the sorted search results, and return the read search results to the user.

例如,现有技术中存在一种综合搜索结果的排序方法,在该方法中,将采用排序算法计算垂直搜索引擎的综合值,并根据综合直对该垂直搜索引擎进行排序,汇总所有垂直搜索引擎排序结果,生成最终的搜索结果。For example, there is a sorting method for comprehensive search results in the prior art. In this method, a sorting algorithm will be used to calculate the comprehensive value of vertical search engines, and the vertical search engines will be sorted according to the comprehensive value, and all vertical search engines will be summarized. Sort the results to generate the final search results.

但是,上述排序方法是将搜索结果得分的计算过程放在搜索过程中进行,且每次都是进行全量排序,从而对CPU和内存资源的占用量较大,搜索复杂度较高,搜索速度较慢,且没有考虑信息的附加价值属性,因而无法保证价值高的信息排序一定靠前。同时,由于该方法中需要存储得分计算依据的因子以及综合得分计算因子,从而对存储空间也有较高的要求。However, the above sorting method puts the calculation process of the search result score in the search process, and performs full sorting each time, thus occupying a large amount of CPU and memory resources, high search complexity, and slow search speed. It is slow, and does not consider the value-added attribute of information, so it cannot guarantee that the information with high value will be ranked first. At the same time, since the method needs to store the factors for calculating the score and the factors for calculating the comprehensive score, it also requires a relatively high storage space.

此外,现有技术中还存在一种基于搜索引擎的搜索结果排序方法。在该方法中,主要是根据事先配置的网络资源权重,同时根据用户输入的关键字在资源中的文本权重,综合计算每条相关资源得分,然后进行全量排序,以生成搜索结果。但是,在该方法中,虽然考虑到信息的附加价值属性,但是在搜索过程中需要计算各文档得分,同时还需要对搜索的结果进行全量排序,从而对CPU和内存资源的占用量也较大,搜索复杂度也较高,搜索速度也较慢。同时,由于该方法中也需要存储得分计算依据的因子以及综合得分计算因子,从而对存储空间也有较高的要求In addition, there is also a method for sorting search results based on a search engine in the prior art. In this method, the score of each relevant resource is comprehensively calculated based on the pre-configured network resource weight and the text weight of the keyword input by the user, and then the full ranking is performed to generate search results. However, in this method, although the additional value attribute of information is taken into account, the scores of each document need to be calculated during the search process, and the search results also need to be fully sorted, thus occupying a large amount of CPU and memory resources. , the search complexity is also higher, and the search speed is also slower. At the same time, because this method also needs to store the factors for calculating the score and the factors for calculating the comprehensive score, it also has high requirements for storage space.

综上可知,在现有技术中的搜索方法中,由于文档得分的计算和排序都是放在搜索过程中完成,且是针对全量进行排序;而且,索引组织的结构也没有考虑用户的搜索习惯,造成存储的信息量较大;另外,每次搜索均需针对全量进行,从而大大增加了内存和CPU等系统稀缺资源的负担。同时,现有技术中的搜索方法的搜索复杂度一般都较高,搜索响应速度也较慢。此外,现有技术中的搜索方法中进行排序的依据是文本相关度,而并未考虑能表达信息价值的关键属性,因此导致排序方式单一、用户友好性不足等问题。To sum up, in the search methods in the prior art, since the calculation and sorting of the document scores are done in the search process, and the sorting is done for the whole volume; moreover, the structure of the index organization does not consider the user's search habits , resulting in a large amount of stored information; in addition, each search needs to be performed on the full amount, which greatly increases the burden on scarce system resources such as memory and CPU. At the same time, the search methods in the prior art generally have high search complexity and slow search response speed. In addition, the search method in the prior art is based on text relevance, without considering the key attributes that can express the value of information, thus resulting in single sorting methods and insufficient user-friendliness.

发明内容 Contents of the invention

有鉴于此,本发明提供了一种搜索方法和装置,从而可提高搜索速度,降低系统资源的占用。In view of this, the present invention provides a search method and device, thereby improving the search speed and reducing the occupation of system resources.

本发明采用的技术方案具体是这样实现的:The technical scheme that the present invention adopts is specifically realized like this:

一种搜索方法,该方法包括:A method of searching, the method comprising:

A、为同一类文档设置至少一个关键属性及相应的关键属性权重,并根据所述关键属性和关键属性权重计算各个文档的关键属性分值KFScore;A. Set at least one key attribute and the corresponding key attribute weight for the same type of document, and calculate the key attribute score KFScore of each document according to the key attribute and key attribute weight;

B、将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;B, all documents to be retrieved are indexed with etymology Term as a keyword, set up an index posting table with Term as a keyword index, with the total TotalCount of documents containing the Term and the list of documents DocList containing the Term as values; Each record in the document list includes a document number and a key attribute score; the document list is composed of an ordered list and an unordered list, and the ordered list includes n key attribute scores The records that are the largest and are arranged in descending order of key attribute scores; wherein, the n is a predetermined value;

C、根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。C. Generate the corresponding etymology according to the search character string input by the user, search the index posting list according to the generated etymology, and prioritize the sequence corresponding to the generated etymology according to the range of search results input by the user Get the records in the list to get the desired total number of search results and related results.

本发明中还提供了一种搜索装置,该搜索装置包括:倒排表生成模块、存储模块、词源生成模块和检索模块;The present invention also provides a search device, which includes: an inverted list generation module, a storage module, an etymology generation module and a retrieval module;

所述倒排表生成模块,用于将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;将所述索引倒排表发送给存储模块;The inverted table generation module is used to index all documents to be retrieved with etymology Term as a keyword, and establishes Term as a keyword index, with the total TotalCount of documents containing the Term and the list of documents DocList containing the Term is an index posting list of values; each record in the document list includes a document number and a key attribute score; the document list is composed of an ordered list and an unordered list, and the ordered list Including n key attribute scores maximum, and according to the records of key attribute scores arranged in order from large to small; wherein, the n is a predetermined value; the index posting list is sent to the storage module;

所述存储模块,用于存储所述索引倒排表;The storage module is used to store the index posting list;

所述词源生成模块,用于根据用户输入的检索字符串生成相应的词源;将所述词源以及用户输入的搜索结果范围发送给检索模块;The etymology generation module is used to generate a corresponding etymology according to the search character string input by the user; the etymology and the range of search results input by the user are sent to the retrieval module;

所述检索模块,用于根据所生成的词源对存储模块中存储的索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数;将所述搜索结果和相关结果总数发送给用户。The retrieval module is used to retrieve the index posting list stored in the storage module according to the generated etymology, and obtain records from the ordered list corresponding to the generated etymology according to the range of search results input by the user, To obtain the desired total number of search results and related results; sending the search results and the total number of related results to the user.

由上述技术方案可见,本发明中由于为文档设置了关键属性和关键属性权重,因此可计算各个文档的KFScore,并根据各个文档的KFScore建立一个索引倒排表,使得该索引倒排表中的文档列表由有序列表和无序列表组成。当用户输入的检索字符串时,可生成相应的词源,并根据所生成的词源对索引倒排表进行检索,根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数,从而可在保证搜索结果的相关性(即有效性)的前提下,多方面衡量各种信息的价值,以使得有价值的信息的排名靠前,因而可大大提高搜索速度,提高搜索引擎的搜索响应时间;同时,还可降低CPU和内存等系统资源的占用,从而节约了大量的硬、软件资源。It can be seen from the above technical scheme that in the present invention, since the key attribute and the key attribute weight are set for the document, the KFScore of each document can be calculated, and an index posting list is established according to the KFScore of each document, so that the index posting list in the index posting list Document lists consist of ordered and unordered lists. When the user enters a search string, the corresponding etymology can be generated, and the index inverted list can be searched according to the generated etymology, and the ordered list corresponding to the generated etymology can be prioritized according to the range of search results entered by the user In order to obtain the required search results and the total number of relevant results, the value of various information can be measured in many ways under the premise of ensuring the relevance (that is, validity) of the search results, so that the value of valuable information The top ranking can greatly increase the search speed and the search response time of the search engine; at the same time, it can also reduce the occupation of system resources such as CPU and memory, thereby saving a lot of hardware and software resources.

附图说明 Description of drawings

图1是本发明中的搜索方法的流程图。Fig. 1 is a flowchart of the search method in the present invention.

图2为本发明中文档的信息结构图。FIG. 2 is an information structure diagram of a document in the present invention.

图3为本发明中索引倒排表的结构示意图。Fig. 3 is a schematic structural diagram of an indexed inverted list in the present invention.

图4为本发明中单Term搜索方法的流程图。Fig. 4 is a flow chart of the single Term search method in the present invention.

图5为本发明具体实例中的索引倒排表的结构示意图。Fig. 5 is a schematic structural diagram of an index posting list in a specific example of the present invention.

图6为本发明中多Term搜索方法的流程图。Fig. 6 is a flow chart of the multi-term search method in the present invention.

图7为本发明中多层搜索方法的流程图。Fig. 7 is a flow chart of the multi-layer search method in the present invention.

图8为本发明中步骤702的一种实现方法的流程图。FIG. 8 is a flowchart of an implementation method of step 702 in the present invention.

图9为本发明中的搜索装置的结构示意图。Fig. 9 is a schematic structural diagram of the search device in the present invention.

具体实施方式 Detailed ways

为使本发明的目的、技术方案和优点表达得更加清楚明白,下面结合附图及具体实施例对本发明再作进一步详细的说明。In order to make the object, technical solution and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1是本发明中的搜索方法的流程图。Fig. 1 is a flowchart of the search method in the present invention.

如图1所示,该方法包括:As shown in Figure 1, the method includes:

步骤101,为同一类文档设置至少一个关键属性(KeyField)并预先设置各个关键属性的关键属性权重。Step 101, setting at least one key attribute (KeyField) for the same type of document and pre-setting the key attribute weight of each key attribute.

在数据业务领域中,可将描述一个事物的信息集合称之为一个文档(Document)。图2为本发明中文档的信息结构图。如图2所示,在本发明的技术方案中,每个文档中都包括多个词源(Term)和多个属性(Field);其中,所述属性可以由一个或多个Term组成。例如,一个文档中可以包括“标题”、“内容”等属性,其中的“标题”属性可以由一个或多个Term组成。再例如,图2中的属性1由N个词源组成,属性J由M个词源组成等。In the field of data business, a collection of information describing a thing can be called a document (Document). FIG. 2 is an information structure diagram of a document in the present invention. As shown in FIG. 2 , in the technical solution of the present invention, each document includes multiple etymologies (Term) and multiple attributes (Field); wherein, the attributes may be composed of one or more Term. For example, a document may include attributes such as "title" and "content", wherein the "title" attribute may be composed of one or more Term. For another example, attribute 1 in FIG. 2 is composed of N etymologies, attribute J is composed of M etymologies, and so on.

另外,在本发明的技术方案中,还将为每个文档设置一个或多个用于标识该文档的重要信息的关键属性,且同一类文档具有相同的关键属性。如下的表1为本发明中关键属性的示例说明。In addition, in the technical solution of the present invention, one or more key attributes for identifying important information of the document will be set for each document, and documents of the same type have the same key attributes. Table 1 below is an illustration of the key attributes of the present invention.

  文档类别 Document category   关键属性1 Key attribute 1   关键属性2 Key attribute 2 ......  …   商品类文档 Commodity documents   商品价格 commodity price   购买次数 Number of purchases ......  …   论文类文档 Thesis documents   引用次数 Citations   下载次数 download times ......  …   音乐类文档 Music Documentation   试听次数 Trial times   下载次数 download times ......  …   ...... ...   ...... ...   ...... ... ......  …

表1Table 1

如表1所述,对于商品类文档,可以将商品价格和/或购买次数设置为关键属性;对于论文类文档,可以将引用次数和/或下载次数作为关键属性;对于音乐类文档,可以将试听次数和/或下载次数设置为关键属性。As described in Table 1, for commodity documents, commodity price and/or purchase times can be set as key attributes; for paper documents, citation times and/or download times can be used as key attributes; for music documents, you can set Trials and/or downloads are set as key attributes.

在设置关键属性之后,还将根据实际应用情况为所设置的各个关键属性预先确定相应的权重Weight(可称为关键属性权重)。After the key attributes are set, corresponding weights Weight (may be referred to as key attribute weights) will be predetermined for each set key attribute according to actual application conditions.

步骤102,根据所述关键属性和关键属性权重计算各个文档的关键属性分值(KFScore)。Step 102, calculating the key attribute score (KFScore) of each document according to the key attribute and the key attribute weight.

在设置上述关键属性和关键属性权重之后,即可根据所设置的关键属性和关键属性权重计算各个文档的关键属性分值(KFScore)。After the above key attributes and key attribute weights are set, the key attribute score (KFScore) of each document can be calculated according to the set key attributes and key attribute weights.

例如,可通过如下所述的公式(1)来计算一个文档的KFScore:For example, the KFScore of a document can be calculated by formula (1) as follows:

KFScore=KeyField1*W1+KeyField2*W2+......+KeyFieldX*WX KFScore=KeyField 1 *W 1 +KeyField 2 *W 2 +......+KeyField X *W X

其中,W1+W2+......+WX=1    (1)Wherein, W 1 +W 2 +...+W X =1 (1)

所述KeyField1表示该文档第1个KeyField的取值,所述W1表示与该文档第1个KeyField相对应的Weight的取值,所述KeyFieldx表示所述文档第x个KeyField的取值,所述Wx表示与该文档第x个KeyField相对应的Weight的取值,依此类推。The KeyField 1 represents the value of the first KeyField of the document, the W 1 represents the value of the Weight corresponding to the first KeyField of the document, and the KeyField x represents the value of the x-th KeyField of the document , the W x represents the value of Weight corresponding to the xth KeyField of the document, and so on.

步骤103,将全部待检索文档以Term为关键字(Key)进行索引,建立以Term为关键字索引、以包含该Term的文档的总数(TotalCount)和包含该Term的文档列表(DocList)为值(Value)的索引倒排表InvertIndexList。Step 103, all the documents to be retrieved are indexed with Term as the keyword (Key), and the index is set up with Term as the keyword, with the total number (TotalCount) of the documents containing the Term and the document list (DocList) containing the Term as the value (Value) index inverted list InvertIndexList.

在本步骤中,将建立一个索引倒排表。具体来说,将建立一个以Term为关键字索引、以键值对<TotalCount,DocList>为值的索引倒排表,该索引倒排表可称为InvertIndexList。In this step, an index posting list will be created. Specifically, an index inversion list with Term as the keyword index and key-value pair <TotalCount, DocList> as the value will be established. The index inversion list can be called InvertIndexList.

图3为本发明中索引倒排表的结构示意图。如图3所示,本发明中的索引倒排表InvertIndexList中包括一条或多条记录,每条记录均可以键值对<Key,Value>方式存储,每条记录中可包括两个字段:关键字(Key)字段和键值(Value)字段。其中,所述Key字段用于存储文档所对应的Term,Value字段用于存储与所述Key字段中的Term相对应的键值对<TotalCount,DocList>。其中,DocList为包含该Term的文档列表,而TotalCount为DocList中的文档总数。Fig. 3 is a schematic structural diagram of an indexed inverted list in the present invention. As shown in Figure 3, the index inverted table InvertIndexList among the present invention includes one or more records, and each record can be stored in a key-value pair <Key, Value> mode, and can include two fields in each record: the key Word (Key) field and key value (Value) field. Wherein, the Key field is used to store the Term corresponding to the document, and the Value field is used to store the key-value pair <TotalCount, DocList> corresponding to the Term in the Key field. Among them, DocList is a list of documents containing the Term, and TotalCount is the total number of documents in DocList.

如图3所示,在本发明的具体实施例中,所述DocList可以为链表,而TotalCount则为该链表中的元素总数。所述链表中包括多条记录,每条记录均以键值对<KFScore,Did>方式存储,每个键值对都对应于一个包含所对应的Term的文档。其中,所述KFScore表示该文档的关键属性分值,可通过步骤102中的公式(1)计算得到;所述Did则表示该文档的标识,例如,文档的编号。As shown in FIG. 3 , in a specific embodiment of the present invention, the DocList may be a linked list, and TotalCount is the total number of elements in the linked list. The linked list includes multiple records, and each record is stored as a key-value pair <KFScore, Did>, and each key-value pair corresponds to a document containing the corresponding Term. Wherein, the KFScore represents the key attribute score of the document, which can be calculated by the formula (1) in step 102; the Did represents the identification of the document, for example, the serial number of the document.

此外,在本发明的具体实施例中,对于索引倒排表InvertIndexList中的每个DocList,均可预先按照KFScore的大小对同一个DocList中的记录进行部分排序,即先在该DocList中确定前n个KFScore最大的记录,并将所述前n个KFScore最大的记录按KFScore大小顺序排列,组成一个有序列表(OrderedList);该DocList中的其它记录组成一个无序列表(DisorderedList)。因此,本发明的具体实施例中每一个DocList均是由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录。其中,所述n为预先确定的值,因此,所述n即为OrderedList的长度值。In addition, in a specific embodiment of the present invention, for each DocList in the index inverted list InvertIndexList, the records in the same DocList can be partially sorted in advance according to the size of the KFScore, that is, the first n KFScore maximum records, and the first n KFScore maximum records are arranged in order of KFScore size to form an ordered list (OrderedList); other records in the DocList form an unordered list (DisorderedList). Therefore, in the specific embodiment of the present invention, each DocList is composed of an ordered list and an unordered list, and the ordered list includes n key attribute scores that are the largest, and are ordered from large to small according to the key attribute scores. sorted records. Wherein, the n is a predetermined value, therefore, the n is the length value of the OrderedList.

由此可知,在本发明的具体实施例中,无需对DocList中的所有记录都按KFScore大小进行排序,而只需进行部分排序,即只需确定前n个KFScore最大的记录,并将前n个KFScore最大的记录(即有序列表中的记录)按KFScore大小进行排序即可,该DocList中的其它记录(即无序列表中的记录)则可以不进行排序。It can be seen that, in a specific embodiment of the present invention, it is not necessary to sort all the records in the DocList by KFScore size, but only need to perform partial sorting, that is, only need to determine the record with the largest KFScore of the first n, and sort the first n The record with the largest KFScore (that is, the record in the ordered list) can be sorted according to the size of the KFScore, and the other records in the DocList (that is, the records in the unordered list) can not be sorted.

步骤104,根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。Step 104, generate the corresponding etymology according to the search character string input by the user, search the index posting list according to the generated etymology, and prioritize the corresponding etymology from the generated etymology according to the range of search results input by the user. Records in the sequence table to get the desired search results and the total number of related results.

在本发明的技术方案中,在通过上述步骤103建立索引倒排表之后,即可根据用户输入的检索字符串和搜索结果范围对所述索引倒排表进行检索,以得到所需的搜索结果和相关结果总数。In the technical solution of the present invention, after the index posting list is established through the above step 103, the index posting list can be retrieved according to the search character string and the search result range input by the user to obtain the desired search result and the total number of related results.

具体来说,首先可根据用户输入的检索字符串生成相应的词源。Specifically, firstly, a corresponding etymology can be generated according to the search character string input by the user.

例如:当用户所输入的检索字符串为“刘德华”,则可根据不同的转换规则将该检索字符串“刘德华”转换成相对应的词源:“刘德华”或“liudehua”。For example: when the search string entered by the user is "Andy Lau", the search string "Andy Lau" can be converted into the corresponding etymology: "Andy Lau" or "liudehua" according to different conversion rules.

当用户所输入的检索字符串为“刘德华天意”,则可根据不同的转换规则将该检索字符串“刘德华天意”转换成相对应的词源:“刘德华天意”或“liudehuatianyi”;或者可以将该检索字符串“刘德华天意”转换成相应的两个词源:“刘德华”和“天意”。When the search string entered by the user is "Andy Lau's will", the search string "Andy Lau's will" can be converted into the corresponding etymology according to different conversion rules: "Andy Lau's will" or "liudehuatianyi"; The search string "Andy Lau God's Will" is converted into two corresponding etymologies: "Andy Lau" and "God's Will".

在生成相应词源之后,即可根据该词源对所述索引倒排表进行检索,判断所述索引倒排表中是否存储有上述词源。After the corresponding etymology is generated, the index postings can be retrieved according to the etymology, and it is judged whether the above etymology is stored in the index postings.

当所述索引倒排表中存储有该词源时,则可读取该词源所对应的DocList,并根据用户输入的搜索结果范围优先从所述DocList的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。When the etymology is stored in the index posting table, the DocList corresponding to the etymology can be read, and the record is preferentially obtained from the ordered list of the DocList according to the search result range input by the user, to obtain The total number of desired search and related results.

其中,所述搜索结果范围为用户所期望的搜索结果所在的范围。因此,所述搜索结果范围包括:起始位置(InitialPositon)和条数(Count),所述搜索结果范围的最大值即为:(InitialPositon+Count)-1。其中,所述起始位置为搜索结果的起始位置,所述条数为搜索结果的条数。Wherein, the range of the search result is the range where the search result desired by the user is located. Therefore, the search result range includes: an initial position (InitialPositon) and the number of items (Count), and the maximum value of the search result range is: (InitialPositon+Count)-1. Wherein, the starting position is the starting position of the search result, and the number of items is the number of the search result.

例如,当用户输入检索字符串进行搜索后,将会通过网页的形式查看搜索结果。以每个网页显示Count=30条搜索结果为例,当用户希望查看第1页的搜索结果时,InitialPositon=1,Count=30,所述搜索结果范围即为:第1~30条搜索结果;而当用户在查看完第1页的搜索结果后点击“下一页”,希望查看第2页的搜索结果时,则InitialPositon=31,Count=30,所述搜索结果范围即为:第31~60条搜索结果。依此可类推。For example, after a user enters a search string to perform a search, the search result will be viewed in the form of a web page. Take each webpage displaying Count=30 search results as an example, when the user wants to view the search results on the first page, InitialPositon=1, Count=30, the range of the search results is: the first to 30 search results; And when the user clicks "next page" after viewing the search results of the first page, and wishes to view the search results of the second page, then InitialPositon=31, Count=30, the range of the search results is: the 31st~ 60 search results. And so on.

因此,在本发明的具体实施例中,所述根据用户输入的搜索结果范围优先从所述DocList的有序列表中获取记录,以得到所需的搜索结果可以包括:Therefore, in a specific embodiment of the present invention, said obtaining records from the ordered list of DocList preferentially according to the range of search results input by the user, so as to obtain the required search results may include:

当所述搜索结果范围的最大值小于或等于有序列表的长度n时,此时的搜索结果范围包含于所述有序列表中,因此,可直接从有序列表中的第InitialPositon条记录开始,读取Count条记录作为搜索结果,而无需读取无序列表中的记录。由于有序列表中的记录均已按KFScore的大小顺序排列,因此所读取的Count条记录即为按KFScore的大小顺序排列的搜索结果。When the maximum value of the search result range is less than or equal to the length n of the ordered list, the search result range at this time is included in the ordered list, so it can directly start from the InitialPositon record in the ordered list , to read Count records as search results without reading the records in the unordered list. Since the records in the ordered list have been arranged in the order of KFScore size, the Count records read are the search results arranged in order of KFScore size.

当所述搜索结果范围的起始位置位于有序列表中(即InitialPositon≤n),且所述搜索结果范围的最大值大于有序列表的长度n时,此时的搜索结果范围的一部分位于有序列表中,另一部分位于无序列表中,此时,可从有序列表中的第InitialPositon条记录开始,读取(n-InitialPositon+1)条记录,然后将所述无序列表中的记录按KFScore的大小顺序排列,并从所述无序列表中的第1条记录开始,读取[Count-(n-InitialPositon+1)]条记录;将所读取的Count条记录作为搜索结果。此时,所读取的Count条记录也是按KFScore的大小顺序排列的搜索结果。When the starting position of the search result range is in the ordered list (that is, InitialPositon≤n), and the maximum value of the search result range is greater than the length n of the ordered list, a part of the search result range at this time is located in In the sequence list, the other part is located in the unordered list. At this time, the (n-InitialPositon+1) record can be read from the InitialPositon record in the ordered list, and then the records in the unordered list Arrange according to the size order of KFScore, and read [Count-(n-InitialPositon+1)] records starting from the first record in the unordered list; use the read Count records as the search result. At this time, the Count records read are also the search results arranged in order of KFScore size.

当所述搜索结果范围的起始位置位于无序列表中(即InitialPositon>n)时,此时的搜索结果范围都位于所述无序列表中,因此,可将所述无序列表中的记录按KFScore的大小顺序排列,然后从排序后的无序列表中的第(InitialPositon-n)条记录开始,读取Count条记录作为搜索结果。此时,所读取的Count条记录也是按KFScore的大小顺序排列的搜索结果。When the starting position of the search result range is in the unordered list (that is, InitialPositon>n), the search result range at this time is all located in the unordered list, so the records in the unordered list can be Arrange according to the size of KFScore, and then start from the (InitialPositon-n)th record in the sorted unordered list, read Count records as the search results. At this time, the Count records read are also the search results arranged in order of KFScore size.

通过上述的方法,即可获得所需的搜索结果。Through the above methods, the desired search results can be obtained.

考虑到用户的搜索习惯,一般情况下用户只会关心排序靠前的数十条搜索结果。根据权威机构对用户的搜索习惯的统计结果分析可知,超过90%的用户一般只会查看搜索结果的前两页(一般为20条左右的搜索结果)。因此,在搜索数据量为数亿级别的情况下,如果有序列表的长度n的取值超过一定的值(例如,n=100),则在绝大部分情况下,用户所输入的搜索结果范围都包含在有序列表之中,从而可以直接从有序列表中获取所需条数的且已按KFScore的大小顺序排列的搜索结果,而无需再对搜索结果进行排序,所需内存空间较少,CPU运算次数较少,响应时间较高,且占用的系统资源较少,从而可以大大提高搜索引擎的搜索响应时间,同时还可节约大量的硬、软件资源。Considering the user's search habits, generally the user only cares about the top dozens of search results. According to the analysis of statistical results of users' search habits by authoritative organizations, more than 90% of users generally only view the first two pages of search results (generally about 20 search results). Therefore, when the amount of search data is hundreds of millions of levels, if the value of the length n of the ordered list exceeds a certain value (for example, n=100), then in most cases, the search results entered by the user The ranges are included in the ordered list, so that the required number of search results that have been arranged in order of the size of the KFScore can be obtained directly from the ordered list, without the need to sort the search results, and the required memory space is relatively small. The number of CPU operations is less, the response time is higher, and the system resources occupied are less, which can greatly improve the search response time of the search engine, and at the same time save a lot of hardware and software resources.

更进一步地,在本发明的具体实施例中,当所述索引倒排表中未存储根据用户输入的检索字符串所生成的词源时,则可向用户返回空集作为搜索结果,表示未搜索到相关的文档。Furthermore, in a specific embodiment of the present invention, when the etymology generated according to the search string input by the user is not stored in the index posting table, an empty set may be returned to the user as the search result, indicating that there is no Find related documents.

另外,在本发明的技术方案中,上述步骤104可以有多种实现方式,即可以使用多种搜索方法来实现上述的步骤104。以下将以多个具体实施例的方式对各种搜索方法分别进行介绍。In addition, in the technical solution of the present invention, the above step 104 can be implemented in various manners, that is, various search methods can be used to implement the above step 104. In the following, various search methods will be introduced respectively in the form of multiple specific embodiments.

实施例一:单Term搜索方法。Embodiment 1: Single Term search method.

在本实施例中,所述单Term搜索方法是指根据用户输入的检索字符串只生成一个相应的词源时进行搜索的方法。图4为本发明中单Term搜索方法的流程图。如图4所示,所述单Term搜索方法可包括如下所述的步骤:In this embodiment, the single-term search method refers to a method of searching when only one corresponding etymology is generated according to the search character string input by the user. Fig. 4 is a flow chart of the single Term search method in the present invention. As shown in Figure 4, the single Term search method may include the steps as follows:

步骤401,根据用户输入的检索字符串生成一个相应的词源。Step 401, generate a corresponding etymology according to the search character string input by the user.

在本步骤中,可以将用户输入的检索字符串作为一个词,直接将该词作为该检索字符串相应的词源。例如,当用户所输入的检索字符串为“刘德华”时,生成与该检索字符串相对应的词源:“刘德华”。In this step, the search character string input by the user may be used as a word, and the word may be directly used as the corresponding etymology of the search character string. For example, when the search character string input by the user is "Andy Lau", an etymology corresponding to the search character string: "Andy Lau" is generated.

步骤402,根据所生成的词源检索索引倒排表,判断所述索引倒排表中是否存储有该词源;如果是,则执行步骤403;否则,执行步骤415。Step 402 , according to the generated etymology retrieval index posting table, judge whether the etymology is stored in the index posting table; if yes, execute step 403 ; otherwise, execute step 415 .

步骤403,从所述索引倒排表中读取与所述词源相对应的文档列表。Step 403, read the document list corresponding to the etymology from the index posting list.

由于索引倒排表中的每个词源所对应的值(Value)中都包括一个文档列表(DocList),因此,在步骤中可根据上述所生成的词源从所述索引倒排表中读取与所述词源相对应的文档列表。Because all include a list of documents (DocList) in the corresponding value (Value) of each etymology in the index posting list, therefore, can read from the described index posting list according to the above-mentioned etymology generated in the step Get a list of documents corresponding to said etymology.

步骤404,对用户输入的搜索结果范围进行解析,获得起始位置和条数。Step 404, analyzing the search result range input by the user to obtain the starting position and the number of items.

在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.

步骤405,判断(InitialPositon+Count)-1>n是否成立,如果是,则执行步骤406;否则,执行步骤412;Step 405, judging whether (InitialPositon+Count)-1>n holds true, if yes, then execute step 406; otherwise, execute step 412;

由于所述搜索结果范围的最大值为(InitialPositon+Count)-1,因此在本步骤中,将首先判断搜索结果范围的最大值是否大于文档列表的有序列表的长度n,即判断(InitialPositon+Count)-1>n是否成立。Because the maximum value of the search result range is (InitialPositon+Count)-1, so in this step, it will first be judged whether the maximum value of the search result range is greater than the length n of the ordered list of the document list, that is, to judge (InitialPositon+Count) Count)-1>n is established.

当搜索结果范围的最大值大于文档列表的有序列表的长度时,则说明所述搜索结果范围已经超出有序列表的范围,此时,仅从有序列表中将无法读取全部所需条数的搜索结果,因此,将继续执行步骤406;When the maximum value of the search result range is greater than the length of the ordered list of the document list, it means that the search result range has exceeded the range of the ordered list. At this time, it will not be possible to read all the required items from the ordered list Number of search results, therefore, will continue to perform step 406;

当搜索结果范围的最大值小于或等于文档列表的有序列表的长度时,则说明所述搜索结果范围并未超出有序列表的范围,此时,仅从有序列表中即可读取全部所需的搜索结果,因此可以执行步骤412,从有序列表中直接获取所需条数的搜索结果。When the maximum value of the search result range is less than or equal to the length of the ordered list of the document list, it means that the search result range does not exceed the range of the ordered list. At this time, all documents can be read only from the ordered list. required search results, therefore step 412 can be executed to directly obtain the required number of search results from the ordered list.

步骤406,判断InitialPositon>n是否成立,如果是,则执行步骤410;否则,执行步骤407;Step 406, judging whether InitialPositon>n holds true, if yes, execute step 410; otherwise, execute step 407;

在本步骤中,将判断搜索结果范围中的InitialPositon是否大于有序列表的长度n。In this step, it will be judged whether the InitialPositon in the search result range is greater than the length n of the ordered list.

当搜索结果范围中的InitialPositon大于有序列表的长度时,则说明所述搜索结果范围的InitialPositon已经位于无序列表中,所述搜索结果均位于无序列表中,所述有序列表中并不包括搜索结果,因此可执行步骤410;When the InitialPositon in the search result range is greater than the length of the ordered list, it means that the InitialPositon in the search result range is already in the unordered list, the search results are all in the unordered list, and the ordered list does not Include search results, so step 410 can be performed;

当搜索结果范围中的InitialPositon小于或等于有序列表的长度时,则说明所述搜索结果范围的InitialPositon仍位于有序列表中,即有部分搜索结果位于有序列表中,但也有部分搜索结果位于无序列表中。此时可执行步骤407。When the InitialPositon in the search result range is less than or equal to the length of the ordered list, it means that the InitialPositon in the search result range is still in the ordered list, that is, some search results are in the ordered list, but some search results are in the in an unordered list. At this time, step 407 can be executed.

步骤407,从有序列表中的第InitialPositon条记录开始,读取(n-InitialPositon+1)条记录加入到搜索结果列表(ResultList)中。Step 407, starting from the InitialPositon'th record in the ordered list, read (n-InitialPositon+1) records and add them to the search result list (ResultList).

步骤408,将所述无序列表中的记录按KFScore的大小顺序排列。Step 408, arrange the records in the unordered list according to the size of the KFScore.

步骤409,从所述无序列表中读取[Count-(n-InitialPositon+1)]条记录加入到ResultList中,执行步骤413。Step 409, read [Count-(n-InitialPositon+1)] records from the unordered list and add them to ResultList, and execute step 413.

步骤410,将所述无序列表中的记录按KFScore的大小顺序排列。Step 410, arrange the records in the unordered list according to the size of the KFScore.

步骤411,从排序后的无序列表中的第(InitialPositon-n)条记录开始,读取Count条记录加入到ResultList中,执行步骤413。Step 411, starting from the (InitialPositon-n)th record in the sorted unordered list, read Count records and add them to the ResultList, and execute step 413.

步骤412,从有序列表中的第InitialPositon条记录开始,读取Count条记录加入到ResultList中,执行步骤413。Step 412, starting from the InitialPositon record in the ordered list, read Count records and add them to the ResultList, and execute step 413.

步骤413,将所述ResultList作为搜索结果,并将从所述索引倒排表中读取的与所述词源相对应的文档列表的文档总数TotalCount作为相关结果总数SumCount。In step 413, the ResultList is used as the search result, and the total number of documents TotalCount of the document list corresponding to the etymology read from the index posting list is used as the total number of related results SumCount.

步骤414,将所述搜索结果和相关结果总数返回给用户。结束流程。Step 414, returning the search result and the total number of related results to the user. End the process.

步骤415,将空集作为搜索结果,设SumCount=0,将所述空集和相关结果总数返回给用户。结束流程。Step 415, take the empty set as the search result, set SumCount=0, and return the empty set and the total number of related results to the user. End the process.

通过上述的步骤401~415,即可实现单Term搜索。Through the above steps 401-415, a single-term search can be realized.

以下将以音乐搜索领域的具体实例为例,对上述的单Term搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned single-term search method.

具体实例一:Specific example one:

假设在音乐搜索领域中待检索的音乐文档有一千万个,每个文档的音乐信息包括:歌曲名(SongName)、歌手名(SingerName)、专辑名(AlbumName)、歌词、歌曲的试听量(ListenCount)和业务量(BusinessCount)。Suppose there are 10 million music files to be retrieved in the field of music search, and the music information of each file includes: song name (SongName), singer name (SingerName), album name (AlbumName), lyrics, song trial listening volume ( ListenCount) and business volume (BusinessCount).

在本发明的实例中,可将所述ListenCount和BusinessCount设置为上述音乐文档的关键属性,关键属性权重分别设置为0.3和0.7,则各个音乐文档的关键属性分值(KFScore)可通过如下所述公式计算:In the example of the present invention, described ListenCount and BusinessCount can be set as the key attribute of above-mentioned music file, and key attribute weight is set to 0.3 and 0.7 respectively, then the key attribute score value (KFScore) of each music file can be passed as follows Formula calculation:

KFScore=ListenCount*0.3+BusinessCount*0.7KFScore=ListenCount*0.3+BusinessCount*0.7

根据上述待检索的音乐文档、关键属性和关键属性分值可建立索引倒排表InvertIndexList,并可将索引倒排表中的有序列表的长度n设置为100。According to the above music files to be retrieved, key attributes and key attribute scores, an index inverted list InvertIndexList can be established, and the length n of the ordered list in the index inverted list can be set to 100.

图5为本发明具体实例中的索引倒排表的结构示意图。如图5所示,词源“刘德华”所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000,该DocList中的有序列表的长度n=100,即有序列表中包括100条记录;无序列表的长度为9900,即无序列表中包括9900条记录;同理,词源“我和你”所对应的值为:<9870,DocList>,因此,与词源“我和你”相对应的DocList中的文档总数TotalCount=9870;......;词源“天意”所对应的值为:<60,DocList>,因此,与词源“天意”相对应的DocList中的文档总数TotalCount=60;其中,可再假设在上述60个文档中,有20个文档中的歌手名为“刘德华”。Fig. 5 is a schematic structural diagram of an index posting list in a specific example of the present invention. As shown in Figure 5, the value corresponding to the etymology "Andy Lau" is: <10000, DocList>, therefore, the total number of documents TotalCount=10000 in the DocList corresponding to the etymology "Andy Lau", the ordered list in this DocList The length n=100, that is, the ordered list includes 100 records; the length of the unordered list is 9900, that is, the unordered list includes 9900 records; similarly, the corresponding value of the etymology "I and you" is: <9870, DocList>, therefore, the total number of documents TotalCount=9870 in the DocList corresponding to the etymology "I and you"; ...; The corresponding value of the etymology "God's Will" is: <60, DocList >, therefore, the total number of documents TotalCount=60 in the DocList corresponding to the etymology "God's Will"; wherein, it can be further assumed that among the above-mentioned 60 documents, the singer in 20 documents is named "Andy Lau".

当用户输入的检索字符串为“刘德华”,且用户输入的搜索结果范围为:第31~50条搜索结果时,可使用如下所述的单Term搜索方法进行搜索:When the search string entered by the user is "Andy Lau", and the search result range entered by the user is: the 31st to 50th search results, the following single-term search method can be used to search:

根据用户输入的检索字符串生成相应的词源“刘德华”;对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=31,条数Count=20;Generate the corresponding etymology "Andy Lau" according to the search character string input by the user; analyze the search result range input by the user, and learn that the starting position InitialPositon=31, and the number of entries Count=20;

根据所生成的词源“刘德华”检索图5所示的索引倒排表,在索引倒排表中查找到与上述生成的词源相对应的词源“刘德华”;该词源所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000,该DocList中的有序列表的长度n=100,即有序列表中包括100条记录;无序列表的长度为9900,即无序列表中包括9900条记录。Search the index posting table shown in Figure 5 according to the generated etymology "Andy Lau", and find the etymology "Andy Lau" corresponding to the above-mentioned generated etymology in the index posting table; the value corresponding to the etymology is: <10000, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "Andy Lau" TotalCount=10000, the length of the ordered list in the DocList is n=100, that is, 100 records are included in the ordered list ; The length of the unordered list is 9900, that is, the unordered list includes 9900 records.

由于(InitialPositon+Count)-1=31+20-1<n=100,因此搜索结果范围包含于所述有序列表中,所以可直接从有序列表中的第31条记录开始,读取20条记录作为搜索结果;同时,相关结果总数为SumCount=TotalCount=10000。Since (InitialPositon+Count)-1=31+20-1<n=100, the search result range is included in the ordered list, so you can directly start from the 31st record in the ordered list and read 20 records as search results; meanwhile, the total number of relevant results is SumCount=TotalCount=10000.

由于有序列表中的记录均已按KFScore的大小顺序排列,因此所读取的20条记录即为按KFScore的大小顺序排列的搜索结果。Since the records in the ordered list have been arranged in order of size of KFScore, the read 20 records are the search results arranged in order of size of KFScore.

实施例二:多Term搜索方法。Embodiment 2: Multi-Term search method.

在实际应用场景中,用户有可能会输入多个关键字进行搜索,希望能较精确地定位所需查找的内容,同时,有些用户还可能会对检索结果有特定的要求,例如,需要查找既包括关键字A又包括关键字B的信息,或者需要查找包括关键字A或关键字B的信息,或者需要查找包括关键字A但不包括关键字B的信息等等。因此,用户所输入的多个关键字之间还有可能存在与(AND)、或(OR)和差(SUB)三种逻辑运算情况。In actual application scenarios, users may enter multiple keywords to search, hoping to locate the content they want to find more accurately. At the same time, some users may also have specific requirements for the search results. Information that includes keyword A and keyword B, or information that includes keyword A or keyword B needs to be found, or information that includes keyword A but does not include keyword B needs to be found, and so on. Therefore, there may be three logical operations of AND (AND), OR (OR) and difference (SUB) among the multiple keywords input by the user.

为了满足上述用户的需求,在本发明的技术方案中,还提出了一种多Term搜索方法。在本实施例中,所述多Term搜索方法是指根据用户输入的检索字符串生成多个相应的词源时进行搜索的方法。图6为本发明中多Term搜索方法的流程图。如图6所示,所述多Term搜索方法可包括如下所述的步骤:In order to meet the needs of the above users, in the technical solution of the present invention, a multi-term search method is also proposed. In this embodiment, the multi-term search method refers to a method of searching when multiple corresponding etymologies are generated according to the search character string input by the user. Fig. 6 is a flow chart of the multi-term search method in the present invention. As shown in Figure 6, the multi-Term search method may include the steps as follows:

步骤600,在索引倒排表的文档列表中设置Y段读取范围。Step 600, setting the Y-segment reading range in the document list of the index posting list.

在本发明的具体实施例中,将采用分段读取的策略进行检索,以尽量避免每次检索都在全量数据上进行操作。因此,在本步骤中,将在索引倒排表的文档列表中预先设置Y段读取范围。其中,Y≥2。In a specific embodiment of the present invention, the segmented read strategy is used for retrieval, so as to avoid operating on the full amount of data for each retrieval as much as possible. Therefore, in this step, the reading range of the Y segment will be preset in the document list of the index posting list. Among them, Y≥2.

例如,当Y=2时,将在索引倒排表的文档列表中预先设置2段读取范围。此时,可将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围。For example, when Y=2, a 2-segment reading range will be preset in the document list of the index posting list. At this time, the ordered list of the document list can be used as the first reading range, and the entire document list can be used as the second reading range.

再例如,当Y=3时,将在索引倒排表的文档列表中预先设置3段读取范围。此时,可将文档列表的有序列表作为第1段读取范围,将文档列表的无序列表作为第2段读取范围,而将整个文档列表作为第3段读取范围。For another example, when Y=3, the reading range of 3 segments will be preset in the document list of the index posting list. At this time, the ordered list of the document list can be used as the reading range of the first paragraph, the unordered list of the document list can be used as the reading range of the second paragraph, and the entire document list can be used as the reading range of the third paragraph.

在本发明的具体实施例中,还可以根据实际应用情况使用其它方法设置读取范围,具体的设置方法在此不再赘述。In a specific embodiment of the present invention, other methods can also be used to set the reading range according to the actual application situation, and the specific setting method will not be repeated here.

步骤601,根据用户输入的检索字符串生成一个词源队列和查询逻辑序列。Step 601, generate an etymology queue and query logic sequence according to the search character string input by the user.

在本步骤中,将对用户输入的检索字符串进行解析,生成一个词源队列TermArray和一个查询逻辑序列SetOperators。其中,所述词源队列TermArray中可以包括x个词源,可以表示为TermArray{Term1,Term2,...,Termx}。所述查询逻辑序列SetOperators中包括(x-1)个查询逻辑symbol,可以表示为SetOperators{symbol1,symbol2,...,symbolx-1}。其中,所述词源队列中各个词源之间的逻辑关系由所述查询逻辑序列中相应的查询逻辑表示。例如,所述查询逻辑序列中的symbol1表示Term1与Term2之间的逻辑关系,symbol2表示Term3与之前的Term1、Term2之间的逻辑关系,依次类推。另外,所述查询逻辑symbol的取值为:AND、OR或SUB。In this step, the search string entered by the user will be parsed to generate an etymology queue TermArray and a query logic sequence SetOperators. Wherein, the etymology queue TermArray may include x etymologies, which may be expressed as TermArray{Term 1 , Term 2 , . . . , Term x }. The query logic sequence SetOperators includes (x-1) query logic symbols, which can be expressed as SetOperators {symbol 1 , symbol 2 , . . . , symbol x-1 }. Wherein, the logical relationship among the various etymologies in the etymology queue is represented by the corresponding query logic in the query logic sequence. For example, symbol 1 in the query logic sequence represents the logical relationship between Term 1 and Term 2 , symbol 2 represents the logical relationship between Term 3 and the previous Term 1 and Term 2 , and so on. In addition, the value of the query logic symbol is: AND, OR or SUB.

例如,当用户所输入的检索字符串为“刘德华OR天意”时,可对上述检索字符串进行解析后生成词源队列TermArray{刘德华,天意}和一个查询逻辑序列SetOperators{symbol1},其中,symbol1的取值为OR。For example, when the search string entered by the user is "Andy Lau OR God's Will", the above search string can be parsed to generate an etymology queue TermArray{Andy Lau, God's Will} and a query logic sequence SetOperators{symbol 1 }, wherein, The value of symbol 1 is OR.

步骤602,对用户输入的搜索结果范围进行解析,获得起始位置和条数。Step 602: Analyze the search result range input by the user to obtain the starting position and the number of items.

在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.

步骤603,判断TermArray的长度是否大于1;如果是,则执行步骤604;否则,执行步骤616。Step 603, judge whether the length of TermArray is greater than 1; if yes, execute step 604; otherwise, execute step 616.

在本步骤中,将首先判断TermArray的长度是否大于1;如果是,则说明TermArray中至少包括两个以上的词源,将继续执行步骤604;如果不是,说明TermArray中仅包括一个词源,因此可以仅使用单Term搜索方法进行搜索,而不必使用多Term搜索方法,因此执行步骤616。In this step, at first will judge whether the length of TermArray is greater than 1; If yes, then explain that TermArray includes more than two etymology at least, will continue to perform step 604; If not, only include an etymology in the explanation TermArray, therefore It is possible to search only by using a single-term search method, instead of using a multi-term search method, so step 616 is executed.

从上述步骤603可知,在本发明的具体实施例中,所述单Term搜索方式为多Term搜索方式的一种特殊情况。It can be seen from the above step 603 that, in a specific embodiment of the present invention, the single-term search method is a special case of the multi-term search method.

步骤604,读取TermArray中的前两个词源,并设置初始值为2的变量i和初始值为1的变量i。Step 604, read the first two etymologies in TermArray, and set the variable i with an initial value of 2 and the variable i with an initial value of 1.

在本步骤中,将直接从TermArray中读取前两个词源,即词源Term1和Term2In this step, the first two etymologies, Term 1 and Term 2 , will be read directly from the TermArray.

另外,在本步骤中还可以预先设置两个变量:初始值为2的变量i和初始值为1的变量j。其中,i可用于表示当前所读取的是第i个词源,而j则可用于表示当前读取范围的段数,即表示当前读取范围为第j段读取范围。In addition, in this step, two variables may also be preset: a variable i with an initial value of 2 and a variable j with an initial value of 1. Among them, i can be used to indicate that the current reading is the i-th etymology, and j can be used to indicate the number of paragraphs in the current reading range, that is, it means that the current reading range is the reading range of the jth paragraph.

步骤605,根据上述所读取的两个词源分别检索索引倒排表,从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录,将所读取的记录分别存储于第一检索结果列表(ResultList1)和第二检索结果列表(ResultList2)中。Step 605: Retrieve the index posting list according to the two etymologies read above, and respectively read the reading range of the jth paragraph in the document list corresponding to the two etymologies from the index posting list Records in , store the read records in the first search result list (ResultList1) and the second search result list (ResultList2) respectively.

具体来说,在本步骤中可将上述所读取的两个词源分别看成两个独立的词源,根据每个词源分别检索索引倒排表。如果索引倒排表中存储有上述两个词源,则从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录。然后,将所读取的第一个词源(即Term1)相对应的文档列表中第j段读取范围内的记录存储于ResultList1中,将所读取的第二个词源(即Term2)相对应的文档列表中第j段读取范围内的记录存储于ResultList2中。Specifically, in this step, the two etymologies read above can be regarded as two independent etymologies, and the index posting list is retrieved according to each etymology. If the above two etymologies are stored in the index posting list, the records within the read range of the jth paragraph in the document list corresponding to the two etymologies are respectively read from the index posting list. Then, the records in the reading scope of the first section j in the document list corresponding to the first etymology (i.e. Term 1 ) that are read are stored in ResultList1, and the second etymology (i.e. Term 1) that is read is stored in ResultList1. 2 ) Records within the reading range of the jth paragraph in the corresponding document list are stored in ResultList2.

在本步骤中,此时的j=1,因此所读取的均为文档列表中第1段读取范围内的记录。如果在步骤600中将所述第1段读取范围设置为有序列表,则此时所读取的均为文档列表的有序列表中的记录。In this step, j=1 at this time, so all the records read are within the reading range of the first paragraph in the document list. If in step 600 the reading scope of the first segment is set as an ordered list, all the records read at this time are the records in the ordered list of the document list.

更进一步的,如果索引倒排表中未存储上述两个词源中的任意一个词源,则可将该未存储的词源所对应的检索结果列表设置为空集。例如,如果索引倒排表中未存储上述两个词源中的第一个词源(即Term1),则将该词源所对应的检索结果列表ResultList1设置为空集;如果索引倒排表中未存储上述两个词源中的第二个词源(即Term2),则将该词源所对应的检索结果列表ResultList2设置为空集。Furthermore, if any one of the above two etymologies is not stored in the index posting list, the retrieval result list corresponding to the unstored etymology can be set as an empty set. For example, if the first etymology (i.e. Term 1 ) in the above two etymologies is not stored in the index posting list, then the retrieval result list ResultList1 corresponding to the etymology is set to an empty set; if the index posting list If the second etymology (ie Term 2 ) of the above two etymologies is not stored in , then the retrieval result list ResultList2 corresponding to the etymology is set as an empty set.

步骤606,读取查询逻辑序列SetOperators中的第(i-1)个逻辑符号symboli-1Step 606, read the (i-1)th logic symbol symbol i-1 in the query logic sequence SetOperators.

步骤607,判断所读取的逻辑符号的取值;当所述逻辑符号的取值为AND时,执行步骤608;当所述逻辑符号的取值为OR时,执行步骤609;当所述逻辑符号的取值为SUB时,执行步骤610。Step 607, judge the value of the logical symbol read; when the value of the logical symbol is AND, execute step 608; when the value of the logical symbol is OR, execute step 609; When the value of the symbol is SUB, step 610 is executed.

步骤608,使用与(AND)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。In step 608, ResultList1 and ResultList2 are merged using AND (AND) logic, and the merged result is added to the search result list (ResultList); step 611 is executed.

在本步骤中,将使用与逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, ResultList1 and ResultList2 will be merged using AND logic, and the merged result will be added to ResultList.

由于ResultList1和ResultList2中的每条记录都包括两项属性:关键属性得分(KFScore)和文档标识(Did),而进行与逻辑合并的目的在于从两个检索结果列表中找出Did相同的记录。由于KFScore的特殊性可知,如果两个文档的KFScore相同,则这两个文档有可能属于同一文档,而如果两个文档的KFScore不同,则一定不是同一文档。Since each record in ResultList1 and ResultList2 includes two attributes: key attribute score (KFScore) and document identifier (Did), the purpose of logical merging is to find out records with the same Did from the two retrieval result lists. Due to the particularity of KFScore, if the KFScores of two documents are the same, the two documents may belong to the same document, and if the KFScores of the two documents are different, they must not be the same document.

基于上述的原因,在本发明的具体实施例中,所述使用与逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:Based on the above reasons, in a specific embodiment of the present invention, the use and logic of merging ResultList1 and ResultList2, and adding the merged result to ResultList may include:

逐一比较ResultList1和ResultList2中的各条记录,将两个检索结果列表中KFScore和Did相同的两条记录作为一个搜索结果加入到搜索结果列表(ResultList)中。Compare the records in ResultList1 and ResultList2 one by one, and add the two records with the same KFScore and Did in the two search result lists as a search result to the search result list (ResultList).

步骤609,使用或(OR)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。Step 609, use OR logic to merge ResultList1 and ResultList2, and add the merged result to the search result list (ResultList); execute step 611.

在本步骤中,将使用或逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, the OR logic will be used to merge ResultList1 and ResultList2, and the merged result will be added to ResultList.

例如,在本发明的具体实施例中,所述使用或逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:For example, in a specific embodiment of the present invention, the use or logic to merge ResultList1 and ResultList2, and adding the merged result to ResultList may include:

将所述ResultList1和ResultList2中的各条记录加入到搜索结果列表(ResultList)中,如果有两条记录的KFScore和Did均相同,则在ResultList中删除其中的一条记录。Add each record in the ResultList1 and ResultList2 to the search result list (ResultList), if there are two records with the same KFScore and Did, delete one of the records in the ResultList.

步骤610,使用差(SUB)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。Step 610, use difference (SUB) logic to merge ResultList1 and ResultList2, and add the merged result to the search result list (ResultList); execute step 611.

在本步骤中,将使用差逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, ResultList1 and ResultList2 will be merged using difference logic, and the merged result will be added to ResultList.

例如,在本发明的具体实施例中,所述使用差逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:For example, in a specific embodiment of the present invention, the use of difference logic to merge ResultList1 and ResultList2, and adding the merged result to ResultList may include:

从所述ResultList1中去除ResultList2中存储的各条记录,将进行去除操作后的ResultList1中的记录加入到搜索结果列表(ResultList)中。Each record stored in ResultList2 is removed from the ResultList1, and the removed records in ResultList1 are added to the search result list (ResultList).

步骤611,判断TermArray中是否还有Term未使用;如果是,则执行步骤612;否则,执行步骤613。Step 611 , judging whether there are unused Term in TermArray; if yes, go to step 612 ; otherwise, go to step 613 .

步骤612,设i=i+1,读取TermArray中的第i个词源;清空ResultList1和ResultList2,并将ResultList中的记录复制到ResultList2中;根据所读取的词源检索所述索引倒排表,从所述索引倒排表中读取与所读取的词源相对应的文档列表中第j段读取范围内的记录,将所读取的记录存储于ResultList1中;返回执行步骤606。Step 612, set i=i+1, read the i-th etymology in TermArray; Clear ResultList1 and ResultList2, and copy the records in ResultList to ResultList2; Retrieve the index according to the etymology read table, read the records in the jth paragraph reading range in the document list corresponding to the etymology read from the index posting list, store the read records in ResultList1; return to execute step 606 .

具体来说,在本步骤中可先设置i=i+1,以从TermArray中读取下一个词源(即第i个词源);同时,还将ResultList1和ResultList2清空,并将之前的搜索结果(即ResultList中的记录)复制到ResultList2中,以便于在后续搜索过程中将之前的搜索结果与所读取的下一个词源所对应的搜索结果进行逻辑操作。Specifically, in this step, i=i+1 can be set earlier to read the next etymology (i.e. the i-th etymology) from TermArray; The results (that is, the records in ResultList) are copied to ResultList2, so that the previous search results and the search results corresponding to the next etymology read can be logically operated in the subsequent search process.

然后,再根据所读取的词源检索索引倒排表。如果索引倒排表中存储有上述所读取的词源,则从所述索引倒排表中读取与所述词源相对应的文档列表中第j段读取范围内的记录,并将所读取的记录存储于ResultList1中,再返回执行步骤606。Then, the index posting list is retrieved according to the etymology read. If the above-mentioned etymology read is stored in the index posting list, then read the record in the jth paragraph reading range in the document list corresponding to the etymology from the index posting list, and The read records are stored in ResultList1, and then return to step 606.

在本步骤中,如果此时的j=1,则所读取的为文档列表中第1段读取范围内的记录。而如果在步骤600中将所述第1段读取范围设置为有序列表,则此时所读取的将为文档列表的有序列表中的记录。同理,如果此时的j=2,则所读取的为文档列表中第2段读取范围内的记录。而如果在步骤600中将所述第2段读取范围设置为整个文档列表,则此时所读取的将为整个文档列表中的记录。依次类推,在此不再赘述。In this step, if j=1 at this time, what is read is the record within the reading range of the first paragraph in the document list. However, if the reading scope of the first paragraph is set as an ordered list in step 600, then the records in the ordered list of the document list will be read at this time. Similarly, if j=2 at this time, what is read is the record within the reading range of the second paragraph in the document list. However, if the reading range of the second paragraph is set to the entire document list in step 600, then the records in the entire document list will be read at this time. And so on, no more details here.

更进一步的,如果索引倒排表中未存储上述所读取的第i个词源,则可将所述ResultList1设置为空集,再返回执行步骤606。Furthermore, if the read i-th etymology is not stored in the index posting list, the ResultList1 may be set as an empty set, and then return to step 606 .

步骤613,判断是否满足结束条件;如果是,则执行步骤617;否则,执行步骤614。Step 613, judge whether the end condition is satisfied; if yes, execute step 617; otherwise, execute step 614.

其中,在本发明的具体实施例中,所述结束条件可以为:Wherein, in a specific embodiment of the present invention, the end condition may be:

ResultList中所存储的记录的数目大于搜索结果范围的最大值:(InitialPositon+Count)-1;或者,j等于Y。The number of records stored in ResultList is greater than the maximum value of the search result range: (InitialPositon+Count)-1; or, j is equal to Y.

在本发明的技术方案中,当ResultList中所存储的记录的数目大于搜索结果范围的最大值(InitialPositon+Count)-1时,则表示已经检索到足够数量的搜索结果,因此满足了结束搜索的条件。而当j大于Y时,则表示已经进行了最后一段的搜索,此时也可以结束搜索,从而满足结束条件。In the technical scheme of the present invention, when the number of records stored in the ResultList is greater than the maximum value (InitialPositon+Count)-1 of the search result range, it means that a sufficient number of search results have been retrieved, thus satisfying the requirement for ending the search condition. And when j is greater than Y, it means that the last segment of the search has been carried out, and the search can also be ended at this time, thereby satisfying the end condition.

步骤614,清空ResultList1和ResultList2;读取TermArray中的前两个词源,并设置i=2且j=j+1;根据上述所读取的两个词源分别检索索引倒排表,从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录,分别存储于第一检索结果列表(ResultList1)和第二检索结果列表(ResultList2)中。Step 614, empty ResultList1 and ResultList2; Read the first two etymologies in the TermArray, and set i=2 and j=j+1; Retrieve the index inverted list respectively according to the two etymologies read above, from the In the above-mentioned index posting list, read respectively the record in the jth paragraph reading range in the document list corresponding to the two etymology, store in the first retrieval result list (ResultList1) and the second retrieval result list ( ResultList2).

由于步骤613的判断结果是结束条件未满足,则表示ResultList中所存储的记录的数目小于所需的条数,也就是说,在上一阶段的检索过程中未检索到足够数量的搜索结果,所以需要进行本阶段(即第j阶段)的检索过程。Since the judgment result of step 613 is that the end condition is not met, it means that the number of records stored in the ResultList is less than the required number of bars, that is, in the retrieval process of the previous stage, a sufficient number of search results has not been retrieved, Therefore, it is necessary to carry out the retrieval process of this stage (that is, the jth stage).

因此,在本步骤中,将根据上述所读取的两个词源分别再次检索索引倒排表。如果索引倒排表中存储有上述两个词源,则从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录。然后,将所读取的第一个词源(即Term1)相对应的文档列表中第j段读取范围内的记录存储于ResultList1中,将所读取的第二个词源(即Term2)相对应的文档列表中第j段读取范围内的记录存储于ResultList2中。Therefore, in this step, the index posting list will be searched again according to the two etymologies read above. If the above two etymologies are stored in the index posting list, the records within the read range of the jth paragraph in the document list corresponding to the two etymologies are respectively read from the index posting list. Then, the records in the reading scope of the first section j in the document list corresponding to the first etymology (i.e. Term 1 ) that are read are stored in ResultList1, and the second etymology (i.e. Term 1) that is read is stored in ResultList1. 2 ) Records within the reading range of the jth paragraph in the corresponding document list are stored in ResultList2.

步骤615,分别将所述ResultList1和ResultList2中的记录按KFScore的大小顺序排列;返回执行步骤606。Step 615, respectively arrange the records in the ResultList1 and ResultList2 according to the size of the KFScore; return to execute step 606.

步骤616,使用上述单Term搜索方法,根据所述TermArray中的词源检索索引倒排表,得到搜索结果和相关结果总数;执行步骤618。Step 616, using the above-mentioned single-term search method, according to the etymology in the TermArray, retrieve the index posting list to obtain the search results and the total number of related results; execute step 618.

步骤617,从所述ResultList中读取所需的记录作为搜索结果,将ResultList中所存储的记录的数目作为相关结果总数(SumCount)。Step 617, read the required records from the ResultList as search results, and use the number of records stored in the ResultList as the total number of relevant results (SumCount).

例如,在本发明的具体实施例中,所述从所述ResultList中读取所需的记录作为搜索结果包括:For example, in a specific embodiment of the present invention, reading the required records from the ResultList as search results includes:

当SumCount≥(InitialPositon+Count)-1时,从所述ResultList中的第InitialPositon条记录开始,读取Count条记录作为搜索结果;When SumCount≥(InitialPositon+Count)-1, start from the first InitialPositon record in the ResultList, and read the Count record as the search result;

当InitialPositon≤SumCount<(InitialPositon+Count)-1时,则表明ResultList中的记录的数目不足,此时,从所述ResultList中的第InitialPositon条记录开始,读取(SumCount-InitialPositon+1)条记录作为搜索结果,When InitialPositon≤SumCount<(InitialPositon+Count)-1, it indicates that the number of records in the ResultList is insufficient. At this time, start from the InitialPositon record in the ResultList and read (SumCount-InitialPositon+1) records As a search result,

当SumCount<InitialPositon时,则表明ResultList中的记录的数目太少,不存在所需的搜索结果,此时,可将空集作为搜索结果。When SumCount<InitialPositon, it indicates that the number of records in the ResultList is too small, and there is no desired search result. At this time, the empty set can be used as the search result.

步骤618,向用户返回搜索结果和SumCount。Step 618, return the search result and SumCount to the user.

在上述的多Term搜索方法中,使用了分段搜索的策略。例如,当预先设置了2段读取范围,且将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围时,将先搜索文档列表中的有序列表;如果从有序列表中得到的检索结果的数目大于或等于所需的条数,则说明第一段的检索过程中已经检索到足够数量的搜索结果,此时无需对无序列表进行查找,而可以直接给出已经排序的搜索结果;只有当有序列表中检索结果的数目小于所需的条数时,即第一段的检索过程中未检索到足够数量的搜索结果时,才会进行第二阶段的搜索,从文档列表中的所有记录中进行搜索。In the above-mentioned multi-term search method, a segmentation search strategy is used. For example, when the 2-segment reading range is set in advance, and the ordered list of the document list is used as the 1st reading range, and the entire document list is used as the 2nd reading range, the document list will be searched first. Sequence list; if the number of search results obtained from the ordered list is greater than or equal to the required number of items, it means that a sufficient number of search results have been retrieved in the first paragraph of the search process, and there is no need to search for the unordered list at this time. search, but can directly give the sorted search results; only when the number of search results in the ordered list is less than the required number of items, that is, when a sufficient number of search results are not retrieved in the first paragraph of the search process, the A second stage of the search is performed, searching from all records in the document list.

因此,使用上述的多Term搜索方法,可以尽量避免每次均在全量数据集上(例如,整个文档列表中)进行检索操作,因而可有效地缩小搜索空间,减少搜索的复杂度。Therefore, using the above-mentioned multi-term search method, it is possible to avoid performing retrieval operations on the entire data set (for example, the entire document list) each time as much as possible, thereby effectively reducing the search space and reducing the complexity of the search.

例如,在现有技术中,如果搜索数量的级别为亿级别(例如,网页或新闻搜索一般都是T级别),则目前开源搜索(例如,Lucene)中的每次搜索均会对搜索的结果集(比如百万级)进行全量排序,然后在此基础上获取用户搜索的结果集。而如果使用上述的多Term搜索方法,将文档列表的有序列表作为第1段读取范围,则大多数情况下在第一阶段搜索过程结束时即能找到满足用户需要的搜索结果,而不用进行第二阶段的搜索,因此可急剧减少所需搜索的数据量,而且不用在检索过程中对搜索结果进行排序,从而大大提高了搜索响应速度(一般可以比Lucene的搜索响应速度高一个数量级),而且也可以大大减少对内存和CPU资源的占用。For example, in the prior art, if the level of search quantity is 100 million (for example, web page or news search is generally T level), each search in the current open source search (for example, Lucene) will search the results Sets (such as million levels) are fully sorted, and then the result set searched by users is obtained on this basis. However, if the above-mentioned multi-term search method is used, and the ordered list of the document list is used as the reading range of the first paragraph, then in most cases, the search results that meet the user's needs can be found at the end of the first stage of the search process, without using The second stage of search is performed, so the amount of data required to be searched can be drastically reduced, and there is no need to sort the search results during the retrieval process, thereby greatly improving the search response speed (generally an order of magnitude higher than Lucene's search response speed) , and can also greatly reduce the occupation of memory and CPU resources.

通过上述的步骤601~618,即可实现多Term搜索。Through the above steps 601-618, multi-term search can be realized.

以下将以音乐搜索领域的具体实例为例,对上述的多Term搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned multi-term search method.

具体实例二:Specific example two:

为了简便起见,在该具体实例中,仍使用具体实例一中的基本设定,并建立图5所示的索引倒排表,索引倒排表中的有序列表的长度n设置为100。另外,设在索引倒排表的文档列表中设置了2段读取范围(即Y=2),且将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围。For the sake of simplicity, in this specific example, the basic settings in the first specific example are still used, and the index posting list shown in FIG. 5 is established, and the length n of the ordered list in the index posting list is set to 100. In addition, it is assumed that a 2-segment reading range is set in the document list of the index posting list (that is, Y=2), and the ordered list of the document list is used as the first reading range, and the entire document list is used as the second segment. Segment read range.

当用户输入的检索字符串为“刘德华AND天意”,且用户输入的搜索结果范围为:第1~10条搜索结果时,可使用如下所述的多Term搜索方法进行搜索:When the search string entered by the user is "Andy Lau AND God's Will", and the search result range entered by the user is: the 1st to 10th search results, the following multi-term search method can be used to search:

根据用户输入的检索字符串生成一个词源队列TermArray{刘德华,天意}和查询逻辑序列SetOperators{AND};Generate an etymology queue TermArray{Andy Lau, Providence} and a query logic sequence SetOperators{AND} according to the search string entered by the user;

对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=1,条数Count=10;Analyze the range of search results entered by the user, and learn that the starting position InitialPositon=1, and the number of entries Count=10;

分别根据所生成的词源“刘德华”和“天意”检索图5所示的索引倒排表,查找到与上述生成的相对应的词源“刘德华”和“天意”;Retrieve the index inverted list shown in Figure 5 according to the generated etymology "Andy Lau" and "God's Will" respectively, and find the corresponding etymology "Andy Lau" and "God's Will" generated above;

根据图5所示的索引倒排表可知,对于词源“刘德华”来说,其所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000。所以,将从所述索引倒排表中读取与词源“刘德华”相对应的DocList中的有序列表,将所述有序列表中的所有100条记录均存储于ResultList1中;According to the index inverted table shown in Figure 5, for the etymology "Andy Lau", its corresponding value is: <10000, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "Andy Lau" TotalCount=10000. Therefore, the ordered list in the DocList corresponding to the etymology "Andy Lau" will be read from the index inverted list, and all 100 records in the ordered list will be stored in ResultList1;

根据图5所示的索引倒排表可知,对于词源“天意”来说,其所对应的值为:<60,DocList>,因此,与词源“天意”相对应的DocList中的文档总数TotalCount=60;由于60<100,所以,与词源“天意”相对应的DocList中的文档均存储在有序列表中;所以,将从所述索引倒排表中读取与词源“天意”相对应的DocList中的有序列表,将所述有序列表中的所有60条记录均存储于ResultList2中;According to the index posting table shown in Figure 5, it can be seen that for the etymology "God's Will", its corresponding value is: <60, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "God's Will" TotalCount=60; Owing to 60<100, so, the document in the DocList corresponding to etymology " God's will " is all stored in the ordered list; "The ordered list in the corresponding DocList, all 60 records in the described ordered list are all stored in ResultList2;

从逻辑序列SetOperators中读取第1个逻辑符号,因为该逻辑符号为AND,因此使用AND逻辑合并ResultList1和ResultList2,合并结果加入到ResultList中;Read the first logical symbol from the logical sequence SetOperators, because the logical symbol is AND, so use AND logic to merge ResultList1 and ResultList2, and the merged result is added to ResultList;

由于“天意”相对应的上述60个文档中,有20个文档中的歌手名为“刘德华”,因此,上述ResultList将存储有20条记录,相关结果总数即为:SumCount=20。Since 20 of the above 60 documents corresponding to "God's Will" have singers named "Andy Lau", the above ResultList will store 20 records, and the total number of related results is: SumCount=20.

由于TermArray中所有的Term均已使用,而且(InitialPositon+Count)-1=1+10-1<SumCount,即搜索结果范围包含于所述有序列表中,因此已经满足结束条件,此时,可直接从所述ResultList中的第InitialPositon=1条记录开始,读取Count=10条记录作为搜索结果;Since all the Term in TermArray have been used, and (InitialPositon+Count)-1=1+10-1<SumCount, that is, the search result range is included in the ordered list, so the end condition has been satisfied, at this moment, Directly start from the first InitialPositon=1 record in the ResultList, and read Count=10 records as the search result;

最后,将向用户返回上述搜索结果和相关结果总数SumCount=20。Finally, the total SumCount=20 of the above search results and related results will be returned to the user.

实施例三:分层搜索方法。Embodiment 3: Hierarchical search method.

在本实施例中,所述分层搜索方法是指按照预先设定的规则将整个搜索过程分成多层搜索过程,每个层次的搜索过程之间有严格的区分,不同层之间进行搜索时的字段或搜索规则是不同的。一般来说,上一层的得分均高于下一层,每一层内部再指定按照信息关键属性排序的规则,这样划分分层搜索方法既可以保持搜索的相关性,同时又保障重要信息搜索后结果排名靠前的问题。In this embodiment, the hierarchical search method refers to dividing the entire search process into multi-layer search processes according to preset rules, and there is a strict distinction between the search processes at each level. When searching between different layers, The fields or search rules are different. Generally speaking, the score of the upper layer is higher than that of the lower layer, and the rules for sorting according to the key attributes of the information are specified inside each layer. In this way, the hierarchical search method can not only maintain the relevance of the search, but also ensure the search of important information. Post-results top-ranked questions.

图7为本发明中多层搜索方法的流程图。如图7所示,所述多层搜索方法可包括如下所述的步骤:Fig. 7 is a flow chart of the multi-layer search method in the present invention. As shown in Figure 7, the multi-layer search method may include the following steps:

步骤701,根据预先设定的搜索排序规则预先设置多层搜索过程,并设置各层搜索过程的优先级以及各层搜索过程的分层搜索分值范围。Step 701, pre-setting multi-layer search processes according to preset search ordering rules, and setting the priority of each layer search process and the hierarchical search score range of each layer search process.

在本步骤中,可以根据业务特性预先设定相应的搜索排序规则,例如,在本发明的具体实施例中,针对音乐搜索领域,可以预先设定如下的搜索排序规则:精确搜索、全拼搜索以及分词搜索。根据上述搜索排序规则可知,针对音乐搜索领域,可先进行精确搜索,如果未获得足够数量的搜索结果则可以再进行全拼搜索,如果仍未获得足够数量的搜索结果则最后可再进行分词搜索,从而尽量获得足够数量的用户所需的搜索结果。In this step, corresponding search sorting rules can be preset according to business characteristics. For example, in a specific embodiment of the present invention, for the field of music search, the following search sorting rules can be preset: precise search, spelling search And participle search. According to the above search sorting rules, we can know that in the field of music search, precise search can be carried out first, if there is not enough search results, then full spell search can be carried out, and if there is still not enough search results, word segmentation search can be carried out finally , trying to get the search results desired by a sufficient number of users.

因此,根据上述预先设定的搜索排序规则,可以设置多层搜索过程。例如,在音乐搜索领域,可以设置如下的三层搜索过程:精确层搜索过程、全拼层搜索过程和分词层搜索过程,而且,各层搜索过程的优先级按从高到低的顺序为:精确层搜索过程、全拼层搜索过程、分词层搜索过程。Therefore, according to the aforementioned preset search ordering rules, a multi-layer search process can be set. For example, in the field of music search, the following three-layer search process can be set: the precise layer search process, the full spelling layer search process and the word segmentation layer search process, and the priority of each layer search process is in descending order: Exact layer search process, full spell layer search process, word segmentation layer search process.

更进一步的,在本发明的具体实施例中,还可将上述的每层搜索过程再次分为多个子层搜索过程,并设定各个子层搜索过程的优先级。例如,可将上述精确层搜索过程分为按优先级高低顺序排列的三个子层搜索过程:歌曲子层搜索过程、歌手子层搜索过程和专辑子层搜索过程。即在进行精确层搜索过程时,将先进行歌曲子层搜索过程,然后再进行歌手子层搜索过程,最后再进行专辑子层搜索过程,以完成所述精确层搜索过程。同理,也可将所述全拼层搜索过程和分词层搜索过程也分为上述三个子层搜索过程。在此不再赘述。Furthermore, in a specific embodiment of the present invention, the above-mentioned search process of each layer may be further divided into multiple sub-layer search processes, and the priority of each sub-layer search process may be set. For example, the above precise layer search process can be divided into three sub-layer search processes arranged in order of priority: song sub-layer search process, singer sub-layer search process and album sub-layer search process. That is, when performing the precise layer search process, the song sub-layer search process will be performed first, then the singer sub-layer search process will be performed, and finally the album sub-layer search process will be performed to complete the precise layer search process. Similarly, the search process at the full spelling layer and the search process at the word segmentation layer can also be divided into the above three sub-layer search processes. I won't repeat them here.

此外,在本发明的具体实施例中,还可进一步预先设置各层搜索过程的分层搜索分值范围,从而便于后续所获得的检索结果的排序。例如,当设置有3层搜索过程,且各层搜索过程的优先级按从高到低的顺序为:第一层搜索过程、第二层搜索过程、第三层搜索过程时,可将第一层搜索过程的分层搜索分值范围设为:[A1,A2],即表示在第一层搜索过程中所得到的所有检索结果的分层搜索分值都将在A1和A2之间;将第二层搜索过程的分层搜索分值范围设为:[B1,B2],将第三层搜索过程的分层搜索分值范围设为:[C1,C2]。其中,A1>A2>B1>B2>C1>C2。因此,第一层搜索过程中所获得的任意一个检索结果的搜索分数都将高于第二、三层搜索过程中获得的检索结果。通过上述的方法,从而可以在保持检索结果的相关性的同时,还可保证关于重要信息的检索结果排名比较靠前。In addition, in a specific embodiment of the present invention, the hierarchical search score ranges of each layer search process can be further preset, so as to facilitate the sorting of subsequent retrieved results. For example, when there are 3 layers of search processes, and the priorities of each layer of search processes in descending order are: the first layer search process, the second layer search process, and the third layer search process, the first The hierarchical search score range of the layer search process is set to: [A1, A2], which means that the hierarchical search scores of all retrieval results obtained in the first layer search process will be between A1 and A2; The hierarchical search score range of the second-level search process is set to: [B1, B2], and the hierarchical search score range of the third-level search process is set to: [C1, C2]. Wherein, A1>A2>B1>B2>C1>C2. Therefore, the search score of any retrieval result obtained in the first-level search process will be higher than that of the retrieval results obtained in the second and third-level search processes. Through the above method, while maintaining the relevance of the retrieval results, it can also ensure that the ranking of the retrieval results about important information is relatively high.

步骤702,当用户输入检索字符串时,按照优先级的高低顺序进行各层搜索。Step 702, when the user inputs a search character string, each layer is searched in order of priority.

在本步骤中,当用户输入检索字符串时,将进行分层的搜索过程,即按照优先级的高低顺序先进行优先级最高的搜索,然后再进行优先级次高的搜索,并依此类推,直到搜索到足够数量的用户所需的搜索结果。In this step, when the user enters the search string, a hierarchical search process will be performed, that is, the highest priority search will be performed first, and then the second highest priority search will be performed, and so on. , until a sufficient number of user-desired search results are found.

在本发明的技术方案中,上述步骤702可以有多种具体的实现方式。以下将以其中的一种具体实现方式为例,对本发明的技术方案进行介绍。In the technical solution of the present invention, the above step 702 may have multiple specific implementation manners. The technical solutions of the present invention will be introduced below by taking one of the specific implementation manners as an example.

图8为本发明中步骤702的一种实现方法的流程图。如图8所示,上述步骤702可以包括如下所述的步骤:FIG. 8 is a flowchart of an implementation method of step 702 in the present invention. As shown in FIG. 8, the above step 702 may include the following steps:

步骤801,根据预先设置的各层搜索过程对用户输入的检索字符串进行解析,分别生成与各层搜索过程相对应的词源队列和查询逻辑序列。Step 801: Analyze the search character string input by the user according to the pre-set search process at each level, and generate etymology queues and query logic sequences corresponding to the search process at each level.

在本发明的技术方案中,由于预先设置了多层搜索过程,而每层搜索过程所使用的搜索方法并不一定相同,因此在本步骤中,可对用户输入的检索字符串进行解析,从而分别生成与各层搜索过程相对应的词源队列和查询逻辑序列。In the technical solution of the present invention, since the multi-layer search process is preset, the search methods used in each layer search process are not necessarily the same, so in this step, the search character string input by the user can be parsed, thereby Generate etymology queues and query logic sequences corresponding to the search process of each layer respectively.

进一步的,在本发明的具体实施例中,如果所生成的词源队列中只有一个词源,则与该词源队列对应的查询逻辑序列为空集。Further, in a specific embodiment of the present invention, if there is only one etymology in the generated etymology queue, the query logic sequence corresponding to the etymology queue is an empty set.

例如,如果预先设置的多层搜索过程为:精确层搜索过程、全拼层搜索过程和分词层搜索过程,而用户所输入的检索字符串为“刘德华天意”,则对用户所输入的检索字符串进行解析后,所得到与精确层搜索过程相对应的词源队列中只有一个词源“刘德华天意”,与该词源队列对应的查询逻辑序列为空集;所得到与全拼层搜索过程相对应的词源队列中只有一个词源“liudehuatianyi”,与该词源队列对应的查询逻辑序列为空集;所得到与分词层搜索过程相对应的词源队列中有两个词源:“刘德华”和“天意”,与该词源队列对应的查询逻辑序列中有一个逻辑符号:OR。For example, if the pre-set multi-layer search process is: the precise layer search process, the full spelling layer search process and the word segmentation layer search process, and the search character string input by the user is "Andy Lau God Will", then the search character input by the user After parsing the string, there is only one etymology "Andy Lau" in the etymology queue corresponding to the precise layer search process, and the query logic sequence corresponding to the etymology queue is an empty set; There is only one etymology "liudehuatianyi" in the corresponding etymology queue, and the query logic sequence corresponding to this etymology queue is an empty set; there are two etymologies in the resulting etymology queue corresponding to the word segmentation layer search process: " Andy Lau" and "God's will", there is a logic symbol in the query logic sequence corresponding to the etymology queue: OR.

步骤802,对用户输入的搜索结果范围进行解析,获得起始位置和条数。Step 802: Analyze the search result range input by the user to obtain the starting position and the number of items.

在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.

步骤803,根据预先设定的优先级,将当前未执行的且优先级最高的搜索过程确定为当前搜索过程。Step 803, according to the preset priority, determine the currently unexecuted search process with the highest priority as the current search process.

在本步骤中,需要确定当前搜索过程为哪一层搜索过程。因此,可以根据预先设定的优先级,确定当前搜索过程为当前未执行的且优先级最高的搜索过程。例如,如果是第一次执行搜索过程,则当前搜索过程为优先级最高的搜索过程;如果是第二次执行搜索过程,则当前搜索过程为优先级次高的搜索过程;并依此类推。In this step, it is necessary to determine which layer the current search process is. Therefore, according to the preset priority, it can be determined that the current search process is the search process that is not currently executed and has the highest priority. For example, if the search process is executed for the first time, the current search process is the search process with the highest priority; if the search process is performed for the second time, the current search process is the search process with the second highest priority; and so on.

步骤804,根据与当前搜索过程相对应的词源队列和查询逻辑序列对索引倒排表进行检索,将检索结果存储在分层检索结果集合中并得到当前分层检索结果总数。Step 804: Search the index posting list according to the etymology queue and query logic sequence corresponding to the current search process, store the search results in the hierarchical search result set and obtain the total number of current hierarchical search results.

由于在步骤801中已经生成了与各层搜索过程相对应的词源队列和查询逻辑序列,因此在本步骤中,可直接根据当前搜索过程相对应的词源队列和查询逻辑序列对索引倒排表进行检索,将检索结果存储在分层检索结果集合(LayerResultList),从而得到分层检索结果集合;同时,还可得到当前的分层检索结果集合中的检索结果的总数,即当前分层检索结果总数(LayerResultCount)。Since the etymology queue and query logic sequence corresponding to each layer of search process have been generated in step 801, in this step, the index can be reversed directly according to the etymology queue and query logic sequence corresponding to the current search process table, and store the search results in the layered search result set (LayerResultList), so as to obtain the layered search result set; at the same time, the total number of search results in the current layered search result set can also be obtained, that is, the current layered search The total number of results (LayerResultCount).

另外,在本步骤的当前搜索过程中,可使用上述的多Term搜索方式对索引倒排表进行检索,具体检索过程在此不再赘述。In addition, in the current search process of this step, the above-mentioned multi-term search method can be used to search the index posting list, and the specific search process will not be repeated here.

步骤805,根据各个检索结果的KFScore以及当前搜索过程的分层搜索分值范围,计算分层检索结果集合中各个检索结果的分层搜索分值。Step 805: Calculate the hierarchical search score of each retrieval result in the hierarchical retrieval result set according to the KFScore of each retrieval result and the hierarchical search score range of the current search process.

在本发明的具体实施例中,可以使用多种计算方法来计算分层检索结果集合中各个检索结果的分层搜索分值。In a specific embodiment of the present invention, various calculation methods may be used to calculate the hierarchical search score of each search result in the hierarchical search result set.

例如,可根据当前搜索过程的分层搜索分值范围以及分层检索结果集合中各个检索结果按照KFScore从大到小的排列顺序,为各个检索结果设置相应的分层搜索分值。其它的计算方法在此不再赘述。For example, according to the hierarchical search score range of the current search process and the sequence of each search result in the hierarchical search result set according to the KFScore from large to small, a corresponding hierarchical search score can be set for each retrieval result. Other calculation methods will not be repeated here.

步骤806,将分层检索结果集合中的检索结果按照分层搜索分值的大小加入到总检索结果集合中,并删除重复的检索结果;计算得到当前的总检索结果总数,并清空所述分层检索结果集合。Step 806: Add the search results in the hierarchical search result set to the total search result set according to the hierarchical search score, and delete duplicate search results; calculate the total number of current total search results, and clear the score A collection of layer retrieval results.

在本步骤中,可将步骤804中得到的分层检索结果集合中的检索结果都按照分层搜索分值的大小插入到总检索结果集合(ResultList)中的检索结果中,并删除重复的检索结果。其中,所述总检索结果集合的初始值为空集,即初始情况下的总检索结果集合中并未存储检索结果。因此,如果当前搜索过程为第一层搜索过程,则将分层检索结果集合中的检索结果加入到总检索结果集合中时,不会存在重复的检索结果,即重复的检索结果的数目为0。In this step, all the search results in the hierarchical search result set obtained in step 804 can be inserted into the search results in the total search result set (ResultList) according to the size of the hierarchical search score, and duplicate search results can be deleted. result. Wherein, the initial value of the total retrieval result set is an empty set, that is, no retrieval result is stored in the total retrieval result set under initial conditions. Therefore, if the current search process is the first-level search process, when the search results in the layered search result set are added to the total search result set, there will be no repeated search results, that is, the number of repeated search results is 0 .

此外,在本发明的具体实施例中,还需计算得到当前的总检索结果总数(SearchResultCount),即当前的总检索结果集合中的检索结果的总数。例如,可以直接对当前的总检索结果集合进行统计,得到检索结果的总数;或者,还可以将本次计算前的总检索结果总数加上当前分层检索结果总数,并减去重复的检索结果的数目,从而得到当前的总检索结果总数。In addition, in a specific embodiment of the present invention, it is also necessary to calculate the current total number of search results (SearchResultCount), that is, the total number of search results in the current total search result set. For example, the current total search result set can be directly counted to obtain the total number of search results; or, the total number of search results before this calculation can be added to the total number of current layered search results, and repeated search results can be subtracted to obtain the current total number of search results.

步骤807,判断是否满足结束条件;如果是,则执行步骤808;否则,返回执行步骤803。Step 807, judge whether the end condition is satisfied; if yes, execute step 808; otherwise, return to execute step 803.

其中,在本发明的具体实施例中,所述结束条件可以为:Wherein, in a specific embodiment of the present invention, the end condition may be:

当前的总检索结果总数大于搜索结果范围的最大值,或者当前搜索过程为最后一层搜索过程。The current total number of search results is greater than the maximum value of the search result range, or the current search process is the last layer search process.

在本发明的技术方案中,当当前的总检索结果总数大于搜索结果范围(InitialPositon+Count)时,则表示已经检索到足够数量的搜索结果,因此满足了结束搜索的条件。而当当前搜索过程为最后一层搜索过程时,此时也可以结束搜索,从而满足结束条件。In the technical solution of the present invention, when the current total number of search results is greater than the search result range (InitialPositon+Count), it means that a sufficient number of search results have been retrieved, thus satisfying the condition for ending the search. And when the current search process is the last layer search process, the search can also be ended at this time, so that the end condition is met.

步骤808,根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果;将当前的总检索结果总数作为相关结果总数(SumCount)。Step 808, according to the range of search results, read a required number of search results from the total search result set as search results; use the current total number of search results as the total number of relevant results (SumCount).

例如,在本发明的具体实施例中,所述根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果包括:For example, in a specific embodiment of the present invention, reading a required number of search results from the total search result set according to the search result range as search results includes:

当SumCount≥(InitialPositon+Count)-1时,从所述总检索结果集合中的第InitialPositon条记录开始,读取Count条记录作为搜索结果;When SumCount≥(InitialPositon+Count)-1, start from the first InitialPositon record in the total retrieval result set, read the Count record as the search result;

当InitialPositon≤SumCount<(InitialPositon+Count)-1时,则表明总检索结果集合中的记录的数目不足,此时,从所述总检索结果集合中的第InitialPositon条记录开始,读取(SumCount-InitialPositon+1)条记录作为搜索结果;When InitialPositon≤SumCount<(InitialPositon+Count)-1, then show that the number of records in the total retrieval result collection is insufficient, at this moment, from the InitialPositon article record in the described total retrieval result collection, read (SumCount- InitialPositon+1) records as search results;

当SumCount<InitialPositon时,则表明总检索结果集合中的记录的数目太少,不存在所需的搜索结果,此时,可将空集作为搜索结果。When SumCount<InitialPositon, it indicates that the number of records in the total search result set is too small, and there is no desired search result. At this time, the empty set can be used as the search result.

步骤809,向用户返回搜索结果和搜索结果总数。Step 809, return the search result and the total number of search results to the user.

通过上述的步骤701~702,即可实现多层搜索。Through the above steps 701-702, multi-layer search can be realized.

以下将以音乐搜索领域的具体实例为例,对上述的多层搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned multi-layer search method.

具体实例三:Specific example three:

为了简便起见,在该具体实例中,仍使用具体实例一中的基本设定,并建立图5所示的索引倒排表,索引倒排表中的有序列表的长度n设置为100。For the sake of simplicity, in this specific example, the basic settings in the first specific example are still used, and the index posting list shown in FIG. 5 is established, and the length n of the ordered list in the index posting list is set to 100.

当用户输入的检索字符串为“刘德华天意”,且用户输入的搜索结果范围为:第1~50条搜索结果时,可使用如下所述的多层搜索方法进行搜索:When the search string entered by the user is "Andy Lau's providence" and the range of search results entered by the user is: the 1st to 50th search results, the following multi-layer search method can be used to search:

在本具体实例中,首先根据音乐搜索领域中的业务特性预先设置三层搜索过程:精确层搜索过程、全拼层搜索过程和分词层搜索过程。而且,各层搜索过程的优先级按从高到低的顺序为:精确层搜索过程、全拼层搜索过程、分词层搜索过程。然后,再将上述各层搜索过程都按优先级按从高到低的顺序分为三个子层搜索过程:歌曲子层搜索过程、歌手子层搜索过程和专辑子层搜索过程。In this specific example, three-level search processes are preset according to the business characteristics in the music search field: precise level search process, full-level search process and word-segmentation level search process. Moreover, the priorities of the search process at each level are in descending order: the search process at the precise level, the search process at the full spelling level, and the search process at the word segmentation level. Then, the above-mentioned search processes of each layer are all divided into three sub-layer search processes according to the order of priority from high to low: song sub-layer search process, singer sub-layer search process and album sub-layer search process.

根据上述的设置情况,预先设置各层搜索过程的分层搜索分值范围:将精确层搜索过程的分层搜索分值范围设为[A1,A2];将全拼层搜索过程的分层搜索分值范围设为[B1,B2],将分词层搜索过程的分层搜索分值范围设为[C1,C2];其中,A1>A2>B1>B2>C1>C2。According to the above setting situation, pre-set the hierarchical search score range of each layer search process: set the hierarchical search score range of the precise layer search process to [A1, A2]; The score range is set to [B1, B2], and the hierarchical search score range of the word segmentation layer search process is set to [C1, C2]; wherein, A1>A2>B1>B2>C1>C2.

完成上述设定之后,可根据预先设置的各层搜索过程对用户输入的检索字符串“刘德华天意”进行解析,分别生成与各层搜索过程相对应的词源队列和查询逻辑序列,分别为:After the above settings are completed, the search string "Andy Lau God Will" input by the user can be analyzed according to the pre-set search process of each layer, and the etymology queue and query logic sequence corresponding to the search process of each layer are respectively generated, respectively:

与精确层搜索过程相对应的词源队列中只有一个词源“刘德华天意”,与该词源队列对应的查询逻辑序列为空集;There is only one etymology "Andy Lau God Will" in the etymology queue corresponding to the precise layer search process, and the query logic sequence corresponding to the etymology queue is an empty set;

与全拼层搜索过程相对应的词源队列中只有一个词源“liudehuatianyi”,与该词源队列对应的查询逻辑序列为空集;There is only one etymology "liudehuatianyi" in the etymology queue corresponding to the full-level search process, and the query logic sequence corresponding to the etymology queue is an empty set;

与分词层搜索过程相对应的词源队列中有两个词源:“刘德华”和“天意”,与该词源队列对应的查询逻辑序列中有一个逻辑符号:AND。There are two etymologies in the etymology queue corresponding to the word segmentation layer search process: "Andy Lau" and "God's will", and there is a logic symbol in the query logic sequence corresponding to this etymology queue: AND.

对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=1,条数Count=50;Analyze the range of search results entered by the user, and learn that the starting position InitialPositon=1, and the number of entries Count=50;

由于精确层搜索过程中的歌曲子层搜索过程的优先级最高,因此可将歌曲子层搜索过程确定为当前搜索过程。Since the song sub-layer search process has the highest priority in the precise layer search process, the song sub-layer search process can be determined as the current search process.

在歌曲子层搜索过程中,根据词源“刘德华天意”使用单Term搜索方式检索图5所示的索引倒排表。由于未能从所述索引倒排表查找到相对应的词源“刘德华天意”,因此返回空的搜索结果;所以,总检索结果集合为空集,当前的总检索结果总数为0。In the song sub-layer search process, according to the etymology "Andy Lau's providence", the index posting list shown in Figure 5 is retrieved using a single-term search method. Since the corresponding etymology "Andy Lau's providence" cannot be found from the index posting table, an empty search result is returned; therefore, the total search result set is an empty set, and the current total search result is 0.

由于当前的总检索结果总数小于搜索结果范围,且当前搜索过程也不是最后一层搜索过程,因此不满足结束条件,将继续执行后续的检索过程。此时,当前未执行的且优先级最高的搜索过程为歌手子层搜索过程,因此可将歌手子层搜索过程确定为当前搜索过程,进行后续的检索。Since the current total number of search results is less than the range of search results, and the current search process is not the last search process, the end condition is not satisfied, and the subsequent search process will continue. At this time, the currently unexecuted search process with the highest priority is the singer sub-layer search process, so the singer sub-layer search process can be determined as the current search process for subsequent retrieval.

同理,在歌手子层搜索过程和专辑子层搜索过程中,也都未能从所述索引倒排表查找到相对应的词源“刘德华天意”,因此返回空的搜索结果,所以仍不满足结束条件,将继续进行后续的全拼层搜索过程。In the same way, in the singer sub-level search process and the album sub-level search process, the corresponding etymology "Andy Lau's will" cannot be found from the index inverted list, so an empty search result is returned, so it is still not available. If the end condition is met, the subsequent full layer search process will continue.

由于全拼层搜索过程中的词源为“liudehuatianyi”,因此在进行全拼层搜索过程中的各个子层搜索过程时,也都未能从所述索引倒排表查找到相对应的词源“liudehuatianyi”,因此返回空的搜索结果,所以仍不满足结束条件,将继续进行后续的分词层搜索过程。Since the etymology in the full-level search process is "liudehuatianyi", the corresponding etymology cannot be found from the index inverted table when performing the search process of each sub-layer in the full-level search process "liudehuatianyi", therefore returns an empty search result, so the end condition is still not met, and the subsequent word segmentation layer search process will continue.

在分词层搜索过程中,将先进行歌曲子层搜索过程,然后再进行歌手子层搜索过程,最后再进行专辑子层搜索过程。In the word segmentation layer search process, the song sub-layer search process will be carried out first, then the artist sub-layer search process will be carried out, and the album sub-layer search process will be carried out at last.

由于分词层搜索过程中有两个词源:“刘德华”和“天意”,且与该词源队列对应的查询逻辑序列中有一个逻辑符号:AND,因此,在分词层搜索过程的各个子层搜索过程中,均可从所述索引倒排表查找到相对应的词源,并获得相应的搜索结果。Since there are two etymologies in the word segmentation layer search process: "Andy Lau" and "God's Will", and there is a logical symbol in the query logic sequence corresponding to the etymology queue: AND, therefore, in each sublayer of the word segmentation layer search process During the search process, the corresponding etymology can be found from the index inverted list, and corresponding search results can be obtained.

当分词层搜索过程中所得到的搜索结果的数目大于50条时,将结束搜索过程,并根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果,计算搜索结果总数,然后向用户返回搜索结果和搜索结果总数。When the number of search results obtained in the word segmentation layer search process is greater than 50, the search process will be terminated, and the search results of the required number of items will be read from the total search result collection according to the search result range as the search results, and calculated The total number of search results, and then return the search results and the total number of search results to the user.

而当分词层搜索过程结束时的搜索结果的数目仍然小于50条时,由于分词层搜索过程已经是最后一层搜索过程,因此同样满足结束条件,此时,可计算搜索结果总数,然后向用户返回搜索结果和搜索结果总数。And when the number of search results when the word segmentation layer search process ends is still less than 50, because the word segmentation layer search process has been the last layer of search process, it also satisfies the end condition. At this time, the total number of search results can be calculated, and then to the user Returns the search results and the total number of search results.

在本发明的技术方案中,还提出了一种搜索装置。图9为本发明中的搜索装置的结构示意图。如图9所示,本发明中的搜索装置包括:倒排表生成模块901、存储模块902、词源生成模块903、检索模块904。In the technical solution of the present invention, a search device is also proposed. Fig. 9 is a schematic structural diagram of the search device in the present invention. As shown in FIG. 9 , the search device in the present invention includes: a posting list generation module 901 , a storage module 902 , an etymology generation module 903 , and a retrieval module 904 .

所述倒排表生成模块901,用于将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;将所述索引倒排表发送给存储模块902;The posting table generation module 901 is used to index all documents to be retrieved with the etymology Term as the keyword, and establish the Term as the keyword index, the total TotalCount of the documents containing the Term and the list of documents containing the Term DocList is an index posting list of values; each record in the document list includes a document number and a key attribute score; the document list is composed of an ordered list and an unordered list, and the ordered The list includes n key attribute scores that are the largest and are arranged in descending order according to the key attribute scores; wherein, the n is a predetermined value; the index inverted list is sent to the storage module 902;

所述存储模块902,用于存储所述索引倒排表;The storage module 902 is configured to store the index posting list;

所述词源生成模块903,用于根据用户输入的检索字符串生成相应的词源;将所述词源以及用户输入的搜索结果范围发送给检索模块904;The etymology generating module 903 is configured to generate a corresponding etymology according to the search character string input by the user; send the etymology and the search result range input by the user to the retrieval module 904;

所述检索模块904,用于根据所生成的词源对存储模块902中存储的索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数;将所述搜索结果和相关结果总数发送给用户。The retrieval module 904 is used for retrieving the index posting list stored in the storage module 902 according to the generated etymology, and preferentially obtains from the ordered list corresponding to the generated etymology according to the range of search results input by the user record to obtain the desired total of search results and related results; and send the search results and the total of related results to the user.

综上所述,在本发明的技术方案中,由于为同一类文档设置了关键属性和关键属性权重,因此可计算各个文档的关键属性分值,然后根据上述关键属性分值建立索引倒排表,使得该索引倒排表中的文档列表由有序列表和无序列表组成,且有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录。由于文档的关键属性分值在建立索引倒排表时即已完成,而且还对索引倒排表中所存储的记录提前完成了部分排序操作,因此可使得在根据用户输入的检索字符串对所述索引倒排表进行检索时,用户所输入的搜索结果范围在绝大部分情况下都包含在有序列表之中,从而可以直接从有序列表中获取所需条数的搜索结果,保证了大多数的搜索结果可以直接从有序列表中获得,而且所获得的搜索结果已按KFScore的大小顺序排列,因此无需再对搜索结果进行排序,从而大大减少了全量数据得分计算以及全量排序操作,大大减少了CPU的运算次数,显著提高了搜索速度,使得搜索速度比现有技术中的搜索速度提高了一个数量级,而且也大大提高了搜索引擎的搜索响应时间。To sum up, in the technical solution of the present invention, since the key attributes and key attribute weights are set for the same type of documents, the key attribute scores of each document can be calculated, and then the index posting list can be established according to the above key attribute scores , so that the document list in the index posting list is composed of an ordered list and an unordered list, and the ordered list includes n records with the largest key attribute scores and arranged in descending order of key attribute scores. Since the key attribute score of the document has been completed when the index posting list is established, and part of the sorting operation has been completed in advance for the records stored in the index posting list, it is possible to sort all the records according to the search string entered by the user. When retrieving the above index inverted list, the range of search results entered by the user is included in the ordered list in most cases, so that the required number of search results can be directly obtained from the ordered list, ensuring Most of the search results can be obtained directly from the ordered list, and the obtained search results have been arranged in the order of KFScore size, so there is no need to sort the search results, which greatly reduces the calculation of the full data score and the full sort operation. The number of operations of the CPU is greatly reduced, the search speed is significantly improved, and the search speed is increased by an order of magnitude compared with the search speed in the prior art, and the search response time of the search engine is also greatly improved.

另外,在本发明的索引倒排表中,所存储的是各个文档的关键属性分值,无需存放计算得分所需的因子,因此减少了存放的字段,大大节省了所占用的内存资源,从而节约了大量的硬、软件资源。In addition, in the index posting table of the present invention, what is stored is the key attribute score of each document, and there is no need to store the factors required for calculating the score, so the stored fields are reduced, and the occupied memory resources are greatly saved, thereby Save a lot of hardware and software resources.

此外,在本发明的索引倒排表中,对有序列表中的记录进行排序的依据是各个文档的关键属性和关键属性分值,而各个文档的关键属性和关键属性分值均可根据实际应用情况预先进行设置,因此可以充分考虑文本相关性和信息相对价值,使得搜索结果的排序更贴近用户需求;同时,还可以灵活设置关键属性和关键属性分值,从而可以在保证文本相关性的基础上定制搜索结果的排序操作。In addition, in the index inverted list of the present invention, the basis for sorting the records in the ordered list is the key attribute and key attribute score of each document, and the key attribute and key attribute score of each document can be based on the actual The application situation is set in advance, so the relevance of text and the relative value of information can be fully considered, making the ranking of search results closer to user needs; at the same time, key attributes and key attribute scores can be flexibly set, so that text relevance can be ensured Based on the sorting operation of custom search results.

以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.

Claims (25)

1. a searching method is characterized in that, this method comprises:
A, for same at least one determinant attribute of class document setup and corresponding determinant attribute weight, and according to the determinant attribute score value KFScore of described determinant attribute and each document of determinant attribute weight calculation;
B, be that key word carries out index with whole documents to be retrieved with etymology Term, to set up with Term be key word index, be the index inverted list of value with the total TotalCount of the document that comprises this Term and the lists of documents DocList that comprises this Term; Include document code and the determinant attribute score value of a document in every record in the described lists of documents; Described lists of documents is made up of ordered list and unordered list, comprises n determinant attribute score value maximum and the record of arranging by determinant attribute score value descending order in the described ordered list; Wherein, described n is predetermined value;
C, the searching character string of importing according to the user generate corresponding etymology, according to the etymology that generates described index inverted list is retrieved, and preferentially from the ordered list of the etymology correspondence that generates, obtain record according to the Search Results scope of user input, to obtain required Search Results and correlated results sum.
2. method according to claim 1 is characterized in that, the formula that calculates the KFScore of document is:
KFScore=KeyField 1*W 1+KeyField 2*W 2+......+KeyField X*W X
Wherein, W 1+ W 2+ ...+W X=1; Described KeyField xThe value of representing x KeyField of described document, described W xThe value of representing the Weight corresponding with x KeyField of described document.
3. method according to claim 1 is characterized in that,
Described Search Results scope comprises: reference position InitialPositon and bar are counted Count;
The maximal value of described Search Results scope is: (InitialPositon+Count)-1.
4. method according to claim 3 is characterized in that, described Search Results scope according to user input is preferentially obtained record from the ordered list of the etymology correspondence that generates, comprise to obtain required Search Results:
When the maximal value of described Search Results scope was less than or equal to the length n of ordered list of the etymology correspondence that generates, the InitialPositon bar start-of-record from described ordered list read Count bar record as Search Results;
Be arranged in the ordered list of the etymology correspondence that generates when the reference position of described Search Results scope, and the maximal value of described Search Results scope is during greater than the length n of described ordered list, InitialPositon bar start-of-record from described ordered list reads (n-InitialPositon+1) bar record; Record in the unordered list of the etymology correspondence that generates is arranged by the size order of KFScore, and the 1st start-of-record from described unordered list, read [Count-(n-InitialPositon+1)] bar record; The Count bar that reads is recorded as Search Results;
When the reference position of described Search Results scope is arranged in the unordered list of the etymology correspondence that generates, record in the described unordered list being pressed the size order of KFScore arranges, (InitialPositon-n) bar start-of-record from the unordered list after the ordering reads Count bar record as Search Results then.
5. method according to claim 3 is characterized in that, also further comprises among the step C:
When etymology that the searching character string of not storing in the described index inverted list according to user input generates, with empty set as Search Results.
6. method according to claim 1 is characterized in that, described step C comprises:
C1, the searching character string of importing according to the user generate a corresponding etymology;
C2, according to the etymology search index inverted list that generates, judge in the described index inverted list whether store this etymology; If, execution in step c3 then; Otherwise, execution in step c15;
C3, from described index inverted list, read the lists of documents corresponding with described etymology;
C4, the Search Results scope of user input is resolved, obtain reference position InitialPositon and bar and count Count;
Whether c5, judgement (InitialPositon+Count)-1>n set up, if, execution in step c6 then; Otherwise, execution in step c12; Wherein, described n is the length of ordered list;
C6, judge whether InitialPositon>n sets up, if, execution in step c10 then; Otherwise, execution in step c7;
C7, the InitialPositon bar start-of-record from ordered list read (n-InitialPositon+1) bar record and join among the search result list ResultList;
C8, the size order that KFScore pressed in the record in the described unordered list are arranged;
C9, from described unordered list, read [Count-(n-InitialPositon+1)] bar record and join among the ResultList execution in step c13;
C10, the size order that KFScore pressed in the record in the described unordered list are arranged;
C11, (InitialPositon-n) bar start-of-record from the unordered list after the ordering read Count bar record and join among the ResultList execution in step c13;
C12, the InitialPositon bar start-of-record from ordered list read Count bar record and join among the ResultList execution in step c13;
C13, with described ResultList as Search Results, and the total number of documents TotalCount of the lists of documents corresponding with described etymology that will read from described index inverted list is as the correlated results sum;
C14, described Search Results and correlated results sum are returned to the user.Process ends;
C15, with empty set as Search Results, establish the correlated results sum and equal 0, described empty set and correlated results sum are returned to the user, process ends.
7. method according to claim 1 is characterized in that, described step C comprises:
C1, Y section read range is set in the lists of documents of index inverted list; Wherein, Y 〉=2;
C2, the searching character string of importing according to the user generate an etymology formation TermArray and query logic sequence SetOperators;
C3, the Search Results scope of user input is resolved, obtain reference position InitialPositon and bar and count Count;
C4, when the length of TermArray greater than 1 the time, read the first two words source among the TermArray, and initial value is set is that 2 variable i and initial value are 1 variable j;
C5, according to two etymologies that read search index inverted lists respectively, from described index inverted list, read the record in the j section read range in the lists of documents corresponding with described two etymologies respectively, the record that reads is stored in respectively among first result for retrieval tabulation ResultList1 and second result for retrieval tabulation ResultList2;
C6, read (i-1) the individual logical symbol among the SetOperators;
The value of the logical symbol that C7, judgement are read; When the value of described logical symbol is AND, execution in step C8; When the value of described logical symbol is OR, execution in step C9; When the value of described logical symbol is SUB, execution in step C10;
C8, use and logic merge ResultList1 and ResultList2, and amalgamation result joins among the search result list ResultList; Execution in step C11;
C9, use or logic merge ResultList1 and ResultList2, and amalgamation result joins among the ResultList; Execution in step C11;
C10, use difference logic merge ResultList1 and ResultList2, and amalgamation result joins among the ResultList; Execution in step C11;
C11, judge among the TermArray whether to also have etymology not use; If, execution in step C12 then; Otherwise, execution in step C13;
C12, establish i=i+1, read i etymology among the TermArray; Empty ResultList1 and ResultList2, and the record among the ResultList is copied among the ResultList2; Retrieve described index inverted list according to the etymology that reads, from described index inverted list, read the record in the j section read range in the lists of documents corresponding with the etymology that reads, the record that reads is stored among the ResultList1; Return execution in step C6;
C13, judge whether to satisfy termination condition; If, execution in step C16 then; Otherwise, execution in step C14;
C14, empty ResultList1 and ResultList2; Read the first two words source among the TermArray, and i=2 and j=j+1 are set; According to above-mentioned two etymologies difference search index inverted lists that read, from described index inverted list, read the record in the j section read range in the lists of documents corresponding with described two etymologies respectively, be stored in respectively among ResultList1 and the ResultList2;
C15, the size order of the record among described ResultList1 and the ResultList2 being pressed KFScore are respectively arranged; Return execution in step C6;
C16, with the number of the record stored in the set of total result for retrieval as the correlated results sum;
C17, return Search Results and correlated results sum to the user.
8. method according to claim 7 is characterized in that,
When Y=2, with the ordered list of lists of documents as the 1st section read range, with the entire document tabulation as the 2nd section read range.
9. method according to claim 7 is characterized in that,
When Y=3, with the ordered list of lists of documents as the 1st section read range, with the unordered list of lists of documents as the 2nd section read range, with the entire document tabulation as the 3rd section read range.
10. method according to claim 7 is characterized in that, also further comprises among the step C5:
If do not store any one etymology in described two etymologies in the index inverted list, the corresponding result for retrieval tabulation of the etymology of then not storing is set to empty set.
11. method according to claim 7 is characterized in that, described use and logic merge ResultList1 and ResultList2, and amalgamation result joined among the ResultList comprise:
Compare each the bar record among ResultList1 and the ResultList2 one by one, KFScore two records identical with Did in two result for retrieval tabulations are joined among the ResultList as a Search Results.
12. method according to claim 7 is characterized in that, described use or logic merge ResultList1 and ResultList2, and amalgamation result is joined among the ResultList and can comprise:
Each bar record among described ResultList1 and the ResultList2 is joined among the ResultList, if there is the KFScore of two records all identical with Did, then in ResultList, delete a record wherein.
13. method according to claim 7 is characterized in that, described use difference logic merges ResultList1 and ResultList2, and amalgamation result is joined among the ResultList and can comprise:
Remove each bar record of storing among the ResultList2 from described ResultList1, the record that will remove among the ResultList1 after the operation joins among the ResultList.
14. method according to claim 7 is characterized in that, also further comprises among the described step C12:
If do not store above-mentioned i the etymology that reads in the index inverted list, then described ResultList1 is set to empty set, returns execution in step C6 again.
15. method according to claim 7 is characterized in that, described termination condition is:
The maximal value of the number great-than search range of results of the record of storing among the ResultList; Perhaps, j equals Y.
16. method according to claim 7 is characterized in that, describedly reads required record comprise as Search Results from described ResultList:
When SumCount 〉=(InitialPositon+Count)-1, the InitialPositon bar start-of-record from described ResultList reads Count bar record as Search Results;
When InitialPositon≤SumCount<(InitialPositon+Count)-1, the InitialPositon bar start-of-record from described ResultList reads (SumCount-InitialPositon+1) bar record as Search Results;
When SumCount<InitialPositon, with empty set as Search Results.
17. method according to claim 1 is characterized in that, described step C comprises:
Set in advance the multilayer search procedure according to predefined searching order rule, and the priority of each layer search procedure and the hierarchical search score value scope of each layer search procedure are set;
When the user imports searching character string, carry out each layer search in proper order according to the height of priority.
18. method according to claim 17 is characterized in that, describedly sets in advance the multilayer search procedure according to predefined searching order rule, and the priority that each layer search procedure be set comprises:
In the music searching field, three layers of following search procedure are set: accurate layer search procedure, spelling layer search procedure and participle layer search procedure; And the priority of each layer search procedure by order from high to low is: accurate layer search procedure, spelling layer search procedure, participle layer search procedure;
19. method according to claim 18 is characterized in that, this method also further comprises:
Every layer of search procedure is divided into a plurality of sublayers search procedure, and sets the priority of each sublayer search procedure.
20. method according to claim 17 is characterized in that, the described hierarchical search score value scope that each layer search procedure is set comprises:
When the setting search procedure that haves three layers, and the priority of each layer search procedure by from high to low order is: when ground floor search procedure, second layer search procedure, the 3rd layer of search procedure,
The hierarchical search score value scope of ground floor search procedure is made as: [A1, A2]; The hierarchical search score value scope of second layer search procedure is made as: [B1, B2]; The hierarchical search score value scope of the 3rd layer of search procedure is made as: [C1, C2]; Wherein, A1>A2>B1>B2>C1>C2.
21. method according to claim 17 is characterized in that, and is described when the user imports searching character string, carries out each layer search in proper order according to the height of priority and comprises:
Each layer search procedure that Z1, basis set in advance resolved the searching character string of user's input, generates etymology formation and the query logic sequence corresponding with each layer search procedure respectively
Z2, the Search Results scope of user input is resolved, obtain reference position and bar number;
Z3, according to priority preset, current search procedure unenforced and that priority is the highest is defined as the current search process;
Z4, the index inverted list is retrieved according to etymology formation and the query logic sequence corresponding with the current search process, be stored in result for retrieval in the set of layering result for retrieval and obtain current layering result for retrieval sum
Z5, according to the KFScore of each result for retrieval and the hierarchical search score value scope of current search process, calculate the hierarchical search score value of each result for retrieval in the set of layering result for retrieval.
Z6, the size of the result for retrieval in the layering result for retrieval set according to the hierarchical search score value joined in total result for retrieval set, and the result for retrieval that repeats of deletion; Calculate current total result for retrieval sum, and empty described layering result for retrieval set;
Z7, judge whether to satisfy termination condition; If, execution in step Z8 then; Otherwise, return execution in step Z3;
Z8, from described total result for retrieval set, read the result for retrieval of required number as Search Results according to the Search Results scope; With current total result for retrieval sum as correlated results sum SumCount;
Z9, return Search Results and Search Results sum to the user.
22. method according to claim 21 is characterized in that, and is described according to the KFScore of each result for retrieval and the hierarchical search score value scope of current search process, the hierarchical search score value that calculates each result for retrieval in the set of layering result for retrieval comprises:
According to each result for retrieval in the hierarchical search score value scope of current search process and the set of layering result for retrieval according to KFScore putting in order from big to small, for each result for retrieval arranges corresponding hierarchical search score value.
23. method according to claim 21 is characterized in that, described termination condition can for:
The maximal value of current total result for retrieval sum great-than search range of results; Perhaps, the current search process is last one deck search procedure.
24. method according to claim 21 is characterized in that, the described result for retrieval that reads required number according to the Search Results scope from described total result for retrieval set comprises as Search Results:
When SumCount 〉=(InitialPositon+Count)-1, the InitialPositon bar start-of-record from described total result for retrieval set reads Count bar record as Search Results;
When InitialPositon≤SumCount<(InitialPositon+Count)-1, the InitialPositon bar start-of-record from described total result for retrieval set reads (SumCount-InitialPositon+1) bar record as Search Results;
When SumCount<InitialPositon, with empty set as Search Results.
25. a searcher is characterized in that, this searcher comprises: inverted list generation module, memory module, etymology generation module and retrieval module;
Described inverted list generation module, being used for whole documents to be retrieved is that key word carries out index with etymology Term, and foundation is key word index with Term, be the index inverted list of value with the total TotalCount of the document that comprises this Term and the lists of documents DocList that comprises this Term; Include document code and the determinant attribute score value of a document in every record in the described lists of documents; Described lists of documents is made up of ordered list and unordered list, comprises n determinant attribute score value maximum and the record of arranging by determinant attribute score value descending order in the described ordered list; Wherein, described n is predetermined value; Described index inverted list is sent to memory module;
Described memory module is used for the described index inverted list of storage;
Described etymology generation module is used for generating corresponding etymology according to the searching character string of user's input; The Search Results scope of described etymology and user's input is sent to retrieval module;
Described retrieval module, be used for retrieving according to the index inverted list that the etymology that generates is stored memory module, and preferentially from the ordered list of the etymology correspondence that generates, obtain record according to the Search Results scope of user input, to obtain required Search Results and correlated results sum; Described Search Results and correlated results sum are sent to the user.
CN201110461128.4A 2011-12-30 2011-12-30 A kind of searching method and device Active CN103186650B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110461128.4A CN103186650B (en) 2011-12-30 2011-12-30 A kind of searching method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110461128.4A CN103186650B (en) 2011-12-30 2011-12-30 A kind of searching method and device

Publications (2)

Publication Number Publication Date
CN103186650A true CN103186650A (en) 2013-07-03
CN103186650B CN103186650B (en) 2016-05-25

Family

ID=48677819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110461128.4A Active CN103186650B (en) 2011-12-30 2011-12-30 A kind of searching method and device

Country Status (1)

Country Link
CN (1) CN103186650B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649302A (en) * 2015-10-28 2017-05-10 腾讯科技(深圳)有限公司 Search sequencing method and device
CN106649650A (en) * 2016-12-10 2017-05-10 宁波思库网络科技有限公司 Demand information two-way matching method
CN106909647A (en) * 2017-02-21 2017-06-30 福建榕基软件股份有限公司 A kind of data retrieval method and device
CN109388690A (en) * 2017-08-10 2019-02-26 阿里巴巴集团控股有限公司 Text searching method, inverted list generation method and system for text retrieval
WO2019080412A1 (en) * 2017-10-27 2019-05-02 平安科技(深圳)有限公司 Data service method, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210006A1 (en) * 2004-03-18 2005-09-22 Microsoft Corporation Field weighting in text searching
US20080288483A1 (en) * 2007-05-18 2008-11-20 Microsoft Corporation Efficient retrieval algorithm by query term discrimination
CN101460949A (en) * 2006-06-01 2009-06-17 微软公司 Indexing documents for information retrieval based on additional feedback fields
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
US20110022600A1 (en) * 2009-07-22 2011-01-27 Ecole Polytechnique Federale De Lausanne Epfl Method of data retrieval, and search engine using such a method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050210006A1 (en) * 2004-03-18 2005-09-22 Microsoft Corporation Field weighting in text searching
CN101460949A (en) * 2006-06-01 2009-06-17 微软公司 Indexing documents for information retrieval based on additional feedback fields
US20080288483A1 (en) * 2007-05-18 2008-11-20 Microsoft Corporation Efficient retrieval algorithm by query term discrimination
CN101685455A (en) * 2008-09-28 2010-03-31 华为技术有限公司 Method and system of data retrieval
US20110022600A1 (en) * 2009-07-22 2011-01-27 Ecole Polytechnique Federale De Lausanne Epfl Method of data retrieval, and search engine using such a method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
励子闰等: "基于全文检索引擎的信息检索技术的应用研究", 《计算机与数字工程》, vol. 36, no. 9, 31 December 2008 (2008-12-31), pages 81 - 85 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649302A (en) * 2015-10-28 2017-05-10 腾讯科技(深圳)有限公司 Search sequencing method and device
CN106649650A (en) * 2016-12-10 2017-05-10 宁波思库网络科技有限公司 Demand information two-way matching method
CN106649650B (en) * 2016-12-10 2020-08-18 宁波财经学院 Bidirectional matching method for demand information
CN106909647A (en) * 2017-02-21 2017-06-30 福建榕基软件股份有限公司 A kind of data retrieval method and device
CN106909647B (en) * 2017-02-21 2020-01-03 福建榕基软件股份有限公司 Data retrieval method and device
CN109388690A (en) * 2017-08-10 2019-02-26 阿里巴巴集团控股有限公司 Text searching method, inverted list generation method and system for text retrieval
WO2019080412A1 (en) * 2017-10-27 2019-05-02 平安科技(深圳)有限公司 Data service method, electronic device and storage medium

Also Published As

Publication number Publication date
CN103186650B (en) 2016-05-25

Similar Documents

Publication Publication Date Title
CN101506767B (en) Relative to taxonomic hierarchies classify such as document and/or cluster object and from this classification derive data structure
Jäschke et al. Tag recommendations in folksonomies
JP5332477B2 (en) Automatic generation of term hierarchy
US10430448B2 (en) Computer-implemented method of and system for searching an inverted index having a plurality of posting lists
JP5353173B2 (en) Determining the concreteness of a document
CN110888990A (en) Text recommending methods, devices, equipment and media
US20080162455A1 (en) Determination of document similarity
CN104794242B (en) Searching method
JP5391632B2 (en) Determining word and document depth
JP2009093653A (en) Refine search space according to user input
JP2009193584A (en) Determining words related to a word set
US9971828B2 (en) Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries
CN103942198B (en) For excavating the method and apparatus being intended to
WO2007085187A1 (en) Method of data retrieval, method of generating index files and search engine
CN103186650B (en) A kind of searching method and device
CN117171331B (en) Professional field information interaction method, device and equipment based on large language model
US20130346385A1 (en) System and method for a purposeful sharing environment
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium
TWI290687B (en) System and method for search information based on classifications of synonymous words
CN103942232B (en) For excavating the method and apparatus being intended to
CN108509571B (en) General method for webpage information data mining
CN104252487A (en) Method and device for generating entry information
Lipczak et al. The impact of resource title on tags in collaborative tagging systems
Das et al. Leveraging collaborative tagging for web item design
JP2013168177A (en) Information provision program, information provision apparatus, and provision method of retrieval service

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant