CN103186650A - Searching method and device - Google Patents
Searching method and device Download PDFInfo
- Publication number
- CN103186650A CN103186650A CN2011104611284A CN201110461128A CN103186650A CN 103186650 A CN103186650 A CN 103186650A CN 2011104611284 A CN2011104611284 A CN 2011104611284A CN 201110461128 A CN201110461128 A CN 201110461128A CN 103186650 A CN103186650 A CN 103186650A
- Authority
- CN
- China
- Prior art keywords
- search
- list
- etymology
- record
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 274
- 230000008569 process Effects 0.000 claims description 152
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000000875 corresponding effect Effects 0.000 claims 14
- 230000002596 correlated effect Effects 0.000 claims 10
- 238000005267 amalgamation Methods 0.000 claims 6
- 230000015572 biosynthetic process Effects 0.000 claims 3
- 230000008676 import Effects 0.000 claims 2
- 238000005284 basis set Methods 0.000 claims 1
- 238000012217 deletion Methods 0.000 claims 1
- 230000037430 deletion Effects 0.000 claims 1
- 230000011218 segmentation Effects 0.000 description 17
- 238000010586 diagram Methods 0.000 description 8
- 230000004044 response Effects 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
本发明提供了一种搜索方法和装置。其中的搜索方法包括:为同一类文档设置关键属性和关键属性权重,计算各个文档的关键属性分值;建立索引倒排表;索引倒排表中的文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。应用本发明可以提高搜索速度,降低系统资源的占用。
The invention provides a search method and device. The search method includes: setting key attributes and key attribute weights for the same type of documents, calculating the key attribute scores of each document; building an index posting list; each record in the document list in the index posting list includes a The document number and key attribute score of the document; the document list is composed of an ordered list and an unordered list, and the ordered list includes the n key attribute scores with the largest value, and the order of the key attribute scores is from large to small Arranged records; wherein, the n is a predetermined value; generate a corresponding etymology according to the search character string input by the user, retrieve the index posting list according to the generated etymology, and search according to the search string entered by the user Result scopes preferentially fetch records from the generated etymologically-corresponding ordered list to obtain the desired total number of search results and related results. The application of the invention can increase the search speed and reduce the occupation of system resources.
Description
技术领域 technical field
本发明涉及数据业务技术领域,尤其涉及一种搜索方法和装置。The invention relates to the technical field of data services, in particular to a search method and device.
背景技术 Background technique
现有的搜索引擎一般都是单方面基于文本相似度进行排序,索引的存储一般是基于键值对<KEY,DocList>的形式。其中,KEY表示关键字,DocList表示包含关键字KEY的文档列表。DocList中的每个元素是一个文档对象,用于存放一个文档的基本信息,例如:该文档的ID、该关键字KEY在该文档中出现的次数以及出现的位置等信息。当用户输入一个关键字之后,将先根据该关键字进行检索;当检索到相对应的关键字后,再整体计算与该关键字相对应的DocList中所有文档在此关键字下的得分;然后,依据上述得分对所有的搜索结果整体进行排序,再在排序后的搜索结果中读取用户所需条数的搜索结果,将读取的搜索结果返回给用户。Existing search engines generally sort based on text similarity unilaterally, and index storage is generally based on the form of key-value pairs <KEY, DocList>. Among them, KEY represents a keyword, and DocList represents a list of documents containing the keyword KEY. Each element in the DocList is a document object, which is used to store the basic information of a document, such as: the ID of the document, the number of occurrences of the keyword KEY in the document, and the position where it appears. When the user enters a keyword, it will first search based on the keyword; when the corresponding keyword is retrieved, then the overall calculation of the scores of all documents in the DocList corresponding to the keyword under this keyword; and then , sort all the search results as a whole according to the above scores, and then read the search results of the number required by the user from the sorted search results, and return the read search results to the user.
例如,现有技术中存在一种综合搜索结果的排序方法,在该方法中,将采用排序算法计算垂直搜索引擎的综合值,并根据综合直对该垂直搜索引擎进行排序,汇总所有垂直搜索引擎排序结果,生成最终的搜索结果。For example, there is a sorting method for comprehensive search results in the prior art. In this method, a sorting algorithm will be used to calculate the comprehensive value of vertical search engines, and the vertical search engines will be sorted according to the comprehensive value, and all vertical search engines will be summarized. Sort the results to generate the final search results.
但是,上述排序方法是将搜索结果得分的计算过程放在搜索过程中进行,且每次都是进行全量排序,从而对CPU和内存资源的占用量较大,搜索复杂度较高,搜索速度较慢,且没有考虑信息的附加价值属性,因而无法保证价值高的信息排序一定靠前。同时,由于该方法中需要存储得分计算依据的因子以及综合得分计算因子,从而对存储空间也有较高的要求。However, the above sorting method puts the calculation process of the search result score in the search process, and performs full sorting each time, thus occupying a large amount of CPU and memory resources, high search complexity, and slow search speed. It is slow, and does not consider the value-added attribute of information, so it cannot guarantee that the information with high value will be ranked first. At the same time, since the method needs to store the factors for calculating the score and the factors for calculating the comprehensive score, it also requires a relatively high storage space.
此外,现有技术中还存在一种基于搜索引擎的搜索结果排序方法。在该方法中,主要是根据事先配置的网络资源权重,同时根据用户输入的关键字在资源中的文本权重,综合计算每条相关资源得分,然后进行全量排序,以生成搜索结果。但是,在该方法中,虽然考虑到信息的附加价值属性,但是在搜索过程中需要计算各文档得分,同时还需要对搜索的结果进行全量排序,从而对CPU和内存资源的占用量也较大,搜索复杂度也较高,搜索速度也较慢。同时,由于该方法中也需要存储得分计算依据的因子以及综合得分计算因子,从而对存储空间也有较高的要求In addition, there is also a method for sorting search results based on a search engine in the prior art. In this method, the score of each relevant resource is comprehensively calculated based on the pre-configured network resource weight and the text weight of the keyword input by the user, and then the full ranking is performed to generate search results. However, in this method, although the additional value attribute of information is taken into account, the scores of each document need to be calculated during the search process, and the search results also need to be fully sorted, thus occupying a large amount of CPU and memory resources. , the search complexity is also higher, and the search speed is also slower. At the same time, because this method also needs to store the factors for calculating the score and the factors for calculating the comprehensive score, it also has high requirements for storage space.
综上可知,在现有技术中的搜索方法中,由于文档得分的计算和排序都是放在搜索过程中完成,且是针对全量进行排序;而且,索引组织的结构也没有考虑用户的搜索习惯,造成存储的信息量较大;另外,每次搜索均需针对全量进行,从而大大增加了内存和CPU等系统稀缺资源的负担。同时,现有技术中的搜索方法的搜索复杂度一般都较高,搜索响应速度也较慢。此外,现有技术中的搜索方法中进行排序的依据是文本相关度,而并未考虑能表达信息价值的关键属性,因此导致排序方式单一、用户友好性不足等问题。To sum up, in the search methods in the prior art, since the calculation and sorting of the document scores are done in the search process, and the sorting is done for the whole volume; moreover, the structure of the index organization does not consider the user's search habits , resulting in a large amount of stored information; in addition, each search needs to be performed on the full amount, which greatly increases the burden on scarce system resources such as memory and CPU. At the same time, the search methods in the prior art generally have high search complexity and slow search response speed. In addition, the search method in the prior art is based on text relevance, without considering the key attributes that can express the value of information, thus resulting in single sorting methods and insufficient user-friendliness.
发明内容 Contents of the invention
有鉴于此,本发明提供了一种搜索方法和装置,从而可提高搜索速度,降低系统资源的占用。In view of this, the present invention provides a search method and device, thereby improving the search speed and reducing the occupation of system resources.
本发明采用的技术方案具体是这样实现的:The technical scheme that the present invention adopts is specifically realized like this:
一种搜索方法,该方法包括:A method of searching, the method comprising:
A、为同一类文档设置至少一个关键属性及相应的关键属性权重,并根据所述关键属性和关键属性权重计算各个文档的关键属性分值KFScore;A. Set at least one key attribute and the corresponding key attribute weight for the same type of document, and calculate the key attribute score KFScore of each document according to the key attribute and key attribute weight;
B、将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;B, all documents to be retrieved are indexed with etymology Term as a keyword, set up an index posting table with Term as a keyword index, with the total TotalCount of documents containing the Term and the list of documents DocList containing the Term as values; Each record in the document list includes a document number and a key attribute score; the document list is composed of an ordered list and an unordered list, and the ordered list includes n key attribute scores The records that are the largest and are arranged in descending order of key attribute scores; wherein, the n is a predetermined value;
C、根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。C. Generate the corresponding etymology according to the search character string input by the user, search the index posting list according to the generated etymology, and prioritize the sequence corresponding to the generated etymology according to the range of search results input by the user Get the records in the list to get the desired total number of search results and related results.
本发明中还提供了一种搜索装置,该搜索装置包括:倒排表生成模块、存储模块、词源生成模块和检索模块;The present invention also provides a search device, which includes: an inverted list generation module, a storage module, an etymology generation module and a retrieval module;
所述倒排表生成模块,用于将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;将所述索引倒排表发送给存储模块;The inverted table generation module is used to index all documents to be retrieved with etymology Term as a keyword, and establishes Term as a keyword index, with the total TotalCount of documents containing the Term and the list of documents DocList containing the Term is an index posting list of values; each record in the document list includes a document number and a key attribute score; the document list is composed of an ordered list and an unordered list, and the ordered list Including n key attribute scores maximum, and according to the records of key attribute scores arranged in order from large to small; wherein, the n is a predetermined value; the index posting list is sent to the storage module;
所述存储模块,用于存储所述索引倒排表;The storage module is used to store the index posting list;
所述词源生成模块,用于根据用户输入的检索字符串生成相应的词源;将所述词源以及用户输入的搜索结果范围发送给检索模块;The etymology generation module is used to generate a corresponding etymology according to the search character string input by the user; the etymology and the range of search results input by the user are sent to the retrieval module;
所述检索模块,用于根据所生成的词源对存储模块中存储的索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数;将所述搜索结果和相关结果总数发送给用户。The retrieval module is used to retrieve the index posting list stored in the storage module according to the generated etymology, and obtain records from the ordered list corresponding to the generated etymology according to the range of search results input by the user, To obtain the desired total number of search results and related results; sending the search results and the total number of related results to the user.
由上述技术方案可见,本发明中由于为文档设置了关键属性和关键属性权重,因此可计算各个文档的KFScore,并根据各个文档的KFScore建立一个索引倒排表,使得该索引倒排表中的文档列表由有序列表和无序列表组成。当用户输入的检索字符串时,可生成相应的词源,并根据所生成的词源对索引倒排表进行检索,根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数,从而可在保证搜索结果的相关性(即有效性)的前提下,多方面衡量各种信息的价值,以使得有价值的信息的排名靠前,因而可大大提高搜索速度,提高搜索引擎的搜索响应时间;同时,还可降低CPU和内存等系统资源的占用,从而节约了大量的硬、软件资源。It can be seen from the above technical scheme that in the present invention, since the key attribute and the key attribute weight are set for the document, the KFScore of each document can be calculated, and an index posting list is established according to the KFScore of each document, so that the index posting list in the index posting list Document lists consist of ordered and unordered lists. When the user enters a search string, the corresponding etymology can be generated, and the index inverted list can be searched according to the generated etymology, and the ordered list corresponding to the generated etymology can be prioritized according to the range of search results entered by the user In order to obtain the required search results and the total number of relevant results, the value of various information can be measured in many ways under the premise of ensuring the relevance (that is, validity) of the search results, so that the value of valuable information The top ranking can greatly increase the search speed and the search response time of the search engine; at the same time, it can also reduce the occupation of system resources such as CPU and memory, thereby saving a lot of hardware and software resources.
附图说明 Description of drawings
图1是本发明中的搜索方法的流程图。Fig. 1 is a flowchart of the search method in the present invention.
图2为本发明中文档的信息结构图。FIG. 2 is an information structure diagram of a document in the present invention.
图3为本发明中索引倒排表的结构示意图。Fig. 3 is a schematic structural diagram of an indexed inverted list in the present invention.
图4为本发明中单Term搜索方法的流程图。Fig. 4 is a flow chart of the single Term search method in the present invention.
图5为本发明具体实例中的索引倒排表的结构示意图。Fig. 5 is a schematic structural diagram of an index posting list in a specific example of the present invention.
图6为本发明中多Term搜索方法的流程图。Fig. 6 is a flow chart of the multi-term search method in the present invention.
图7为本发明中多层搜索方法的流程图。Fig. 7 is a flow chart of the multi-layer search method in the present invention.
图8为本发明中步骤702的一种实现方法的流程图。FIG. 8 is a flowchart of an implementation method of
图9为本发明中的搜索装置的结构示意图。Fig. 9 is a schematic structural diagram of the search device in the present invention.
具体实施方式 Detailed ways
为使本发明的目的、技术方案和优点表达得更加清楚明白,下面结合附图及具体实施例对本发明再作进一步详细的说明。In order to make the object, technical solution and advantages of the present invention more clearly, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
图1是本发明中的搜索方法的流程图。Fig. 1 is a flowchart of the search method in the present invention.
如图1所示,该方法包括:As shown in Figure 1, the method includes:
步骤101,为同一类文档设置至少一个关键属性(KeyField)并预先设置各个关键属性的关键属性权重。
在数据业务领域中,可将描述一个事物的信息集合称之为一个文档(Document)。图2为本发明中文档的信息结构图。如图2所示,在本发明的技术方案中,每个文档中都包括多个词源(Term)和多个属性(Field);其中,所述属性可以由一个或多个Term组成。例如,一个文档中可以包括“标题”、“内容”等属性,其中的“标题”属性可以由一个或多个Term组成。再例如,图2中的属性1由N个词源组成,属性J由M个词源组成等。In the field of data business, a collection of information describing a thing can be called a document (Document). FIG. 2 is an information structure diagram of a document in the present invention. As shown in FIG. 2 , in the technical solution of the present invention, each document includes multiple etymologies (Term) and multiple attributes (Field); wherein, the attributes may be composed of one or more Term. For example, a document may include attributes such as "title" and "content", wherein the "title" attribute may be composed of one or more Term. For another example,
另外,在本发明的技术方案中,还将为每个文档设置一个或多个用于标识该文档的重要信息的关键属性,且同一类文档具有相同的关键属性。如下的表1为本发明中关键属性的示例说明。In addition, in the technical solution of the present invention, one or more key attributes for identifying important information of the document will be set for each document, and documents of the same type have the same key attributes. Table 1 below is an illustration of the key attributes of the present invention.
表1Table 1
如表1所述,对于商品类文档,可以将商品价格和/或购买次数设置为关键属性;对于论文类文档,可以将引用次数和/或下载次数作为关键属性;对于音乐类文档,可以将试听次数和/或下载次数设置为关键属性。As described in Table 1, for commodity documents, commodity price and/or purchase times can be set as key attributes; for paper documents, citation times and/or download times can be used as key attributes; for music documents, you can set Trials and/or downloads are set as key attributes.
在设置关键属性之后,还将根据实际应用情况为所设置的各个关键属性预先确定相应的权重Weight(可称为关键属性权重)。After the key attributes are set, corresponding weights Weight (may be referred to as key attribute weights) will be predetermined for each set key attribute according to actual application conditions.
步骤102,根据所述关键属性和关键属性权重计算各个文档的关键属性分值(KFScore)。
在设置上述关键属性和关键属性权重之后,即可根据所设置的关键属性和关键属性权重计算各个文档的关键属性分值(KFScore)。After the above key attributes and key attribute weights are set, the key attribute score (KFScore) of each document can be calculated according to the set key attributes and key attribute weights.
例如,可通过如下所述的公式(1)来计算一个文档的KFScore:For example, the KFScore of a document can be calculated by formula (1) as follows:
KFScore=KeyField1*W1+KeyField2*W2+......+KeyFieldX*WX KFScore=KeyField 1 *W 1 +KeyField 2 *W 2 +......+KeyField X *W X
其中,W1+W2+......+WX=1 (1)Wherein, W 1 +W 2 +...+W X =1 (1)
所述KeyField1表示该文档第1个KeyField的取值,所述W1表示与该文档第1个KeyField相对应的Weight的取值,所述KeyFieldx表示所述文档第x个KeyField的取值,所述Wx表示与该文档第x个KeyField相对应的Weight的取值,依此类推。The KeyField 1 represents the value of the first KeyField of the document, the W 1 represents the value of the Weight corresponding to the first KeyField of the document, and the KeyField x represents the value of the x-th KeyField of the document , the W x represents the value of Weight corresponding to the xth KeyField of the document, and so on.
步骤103,将全部待检索文档以Term为关键字(Key)进行索引,建立以Term为关键字索引、以包含该Term的文档的总数(TotalCount)和包含该Term的文档列表(DocList)为值(Value)的索引倒排表InvertIndexList。
在本步骤中,将建立一个索引倒排表。具体来说,将建立一个以Term为关键字索引、以键值对<TotalCount,DocList>为值的索引倒排表,该索引倒排表可称为InvertIndexList。In this step, an index posting list will be created. Specifically, an index inversion list with Term as the keyword index and key-value pair <TotalCount, DocList> as the value will be established. The index inversion list can be called InvertIndexList.
图3为本发明中索引倒排表的结构示意图。如图3所示,本发明中的索引倒排表InvertIndexList中包括一条或多条记录,每条记录均可以键值对<Key,Value>方式存储,每条记录中可包括两个字段:关键字(Key)字段和键值(Value)字段。其中,所述Key字段用于存储文档所对应的Term,Value字段用于存储与所述Key字段中的Term相对应的键值对<TotalCount,DocList>。其中,DocList为包含该Term的文档列表,而TotalCount为DocList中的文档总数。Fig. 3 is a schematic structural diagram of an indexed inverted list in the present invention. As shown in Figure 3, the index inverted table InvertIndexList among the present invention includes one or more records, and each record can be stored in a key-value pair <Key, Value> mode, and can include two fields in each record: the key Word (Key) field and key value (Value) field. Wherein, the Key field is used to store the Term corresponding to the document, and the Value field is used to store the key-value pair <TotalCount, DocList> corresponding to the Term in the Key field. Among them, DocList is a list of documents containing the Term, and TotalCount is the total number of documents in DocList.
如图3所示,在本发明的具体实施例中,所述DocList可以为链表,而TotalCount则为该链表中的元素总数。所述链表中包括多条记录,每条记录均以键值对<KFScore,Did>方式存储,每个键值对都对应于一个包含所对应的Term的文档。其中,所述KFScore表示该文档的关键属性分值,可通过步骤102中的公式(1)计算得到;所述Did则表示该文档的标识,例如,文档的编号。As shown in FIG. 3 , in a specific embodiment of the present invention, the DocList may be a linked list, and TotalCount is the total number of elements in the linked list. The linked list includes multiple records, and each record is stored as a key-value pair <KFScore, Did>, and each key-value pair corresponds to a document containing the corresponding Term. Wherein, the KFScore represents the key attribute score of the document, which can be calculated by the formula (1) in
此外,在本发明的具体实施例中,对于索引倒排表InvertIndexList中的每个DocList,均可预先按照KFScore的大小对同一个DocList中的记录进行部分排序,即先在该DocList中确定前n个KFScore最大的记录,并将所述前n个KFScore最大的记录按KFScore大小顺序排列,组成一个有序列表(OrderedList);该DocList中的其它记录组成一个无序列表(DisorderedList)。因此,本发明的具体实施例中每一个DocList均是由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录。其中,所述n为预先确定的值,因此,所述n即为OrderedList的长度值。In addition, in a specific embodiment of the present invention, for each DocList in the index inverted list InvertIndexList, the records in the same DocList can be partially sorted in advance according to the size of the KFScore, that is, the first n KFScore maximum records, and the first n KFScore maximum records are arranged in order of KFScore size to form an ordered list (OrderedList); other records in the DocList form an unordered list (DisorderedList). Therefore, in the specific embodiment of the present invention, each DocList is composed of an ordered list and an unordered list, and the ordered list includes n key attribute scores that are the largest, and are ordered from large to small according to the key attribute scores. sorted records. Wherein, the n is a predetermined value, therefore, the n is the length value of the OrderedList.
由此可知,在本发明的具体实施例中,无需对DocList中的所有记录都按KFScore大小进行排序,而只需进行部分排序,即只需确定前n个KFScore最大的记录,并将前n个KFScore最大的记录(即有序列表中的记录)按KFScore大小进行排序即可,该DocList中的其它记录(即无序列表中的记录)则可以不进行排序。It can be seen that, in a specific embodiment of the present invention, it is not necessary to sort all the records in the DocList by KFScore size, but only need to perform partial sorting, that is, only need to determine the record with the largest KFScore of the first n, and sort the first n The record with the largest KFScore (that is, the record in the ordered list) can be sorted according to the size of the KFScore, and the other records in the DocList (that is, the records in the unordered list) can not be sorted.
步骤104,根据用户输入的检索字符串生成相应的词源,根据所生成的词源对所述索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。
在本发明的技术方案中,在通过上述步骤103建立索引倒排表之后,即可根据用户输入的检索字符串和搜索结果范围对所述索引倒排表进行检索,以得到所需的搜索结果和相关结果总数。In the technical solution of the present invention, after the index posting list is established through the
具体来说,首先可根据用户输入的检索字符串生成相应的词源。Specifically, firstly, a corresponding etymology can be generated according to the search character string input by the user.
例如:当用户所输入的检索字符串为“刘德华”,则可根据不同的转换规则将该检索字符串“刘德华”转换成相对应的词源:“刘德华”或“liudehua”。For example: when the search string entered by the user is "Andy Lau", the search string "Andy Lau" can be converted into the corresponding etymology: "Andy Lau" or "liudehua" according to different conversion rules.
当用户所输入的检索字符串为“刘德华天意”,则可根据不同的转换规则将该检索字符串“刘德华天意”转换成相对应的词源:“刘德华天意”或“liudehuatianyi”;或者可以将该检索字符串“刘德华天意”转换成相应的两个词源:“刘德华”和“天意”。When the search string entered by the user is "Andy Lau's will", the search string "Andy Lau's will" can be converted into the corresponding etymology according to different conversion rules: "Andy Lau's will" or "liudehuatianyi"; The search string "Andy Lau God's Will" is converted into two corresponding etymologies: "Andy Lau" and "God's Will".
在生成相应词源之后,即可根据该词源对所述索引倒排表进行检索,判断所述索引倒排表中是否存储有上述词源。After the corresponding etymology is generated, the index postings can be retrieved according to the etymology, and it is judged whether the above etymology is stored in the index postings.
当所述索引倒排表中存储有该词源时,则可读取该词源所对应的DocList,并根据用户输入的搜索结果范围优先从所述DocList的有序列表中获取记录,以得到所需的搜索结果和相关结果总数。When the etymology is stored in the index posting table, the DocList corresponding to the etymology can be read, and the record is preferentially obtained from the ordered list of the DocList according to the search result range input by the user, to obtain The total number of desired search and related results.
其中,所述搜索结果范围为用户所期望的搜索结果所在的范围。因此,所述搜索结果范围包括:起始位置(InitialPositon)和条数(Count),所述搜索结果范围的最大值即为:(InitialPositon+Count)-1。其中,所述起始位置为搜索结果的起始位置,所述条数为搜索结果的条数。Wherein, the range of the search result is the range where the search result desired by the user is located. Therefore, the search result range includes: an initial position (InitialPositon) and the number of items (Count), and the maximum value of the search result range is: (InitialPositon+Count)-1. Wherein, the starting position is the starting position of the search result, and the number of items is the number of the search result.
例如,当用户输入检索字符串进行搜索后,将会通过网页的形式查看搜索结果。以每个网页显示Count=30条搜索结果为例,当用户希望查看第1页的搜索结果时,InitialPositon=1,Count=30,所述搜索结果范围即为:第1~30条搜索结果;而当用户在查看完第1页的搜索结果后点击“下一页”,希望查看第2页的搜索结果时,则InitialPositon=31,Count=30,所述搜索结果范围即为:第31~60条搜索结果。依此可类推。For example, after a user enters a search string to perform a search, the search result will be viewed in the form of a web page. Take each webpage displaying Count=30 search results as an example, when the user wants to view the search results on the first page, InitialPositon=1, Count=30, the range of the search results is: the first to 30 search results; And when the user clicks "next page" after viewing the search results of the first page, and wishes to view the search results of the second page, then InitialPositon=31, Count=30, the range of the search results is: the 31st~ 60 search results. And so on.
因此,在本发明的具体实施例中,所述根据用户输入的搜索结果范围优先从所述DocList的有序列表中获取记录,以得到所需的搜索结果可以包括:Therefore, in a specific embodiment of the present invention, said obtaining records from the ordered list of DocList preferentially according to the range of search results input by the user, so as to obtain the required search results may include:
当所述搜索结果范围的最大值小于或等于有序列表的长度n时,此时的搜索结果范围包含于所述有序列表中,因此,可直接从有序列表中的第InitialPositon条记录开始,读取Count条记录作为搜索结果,而无需读取无序列表中的记录。由于有序列表中的记录均已按KFScore的大小顺序排列,因此所读取的Count条记录即为按KFScore的大小顺序排列的搜索结果。When the maximum value of the search result range is less than or equal to the length n of the ordered list, the search result range at this time is included in the ordered list, so it can directly start from the InitialPositon record in the ordered list , to read Count records as search results without reading the records in the unordered list. Since the records in the ordered list have been arranged in the order of KFScore size, the Count records read are the search results arranged in order of KFScore size.
当所述搜索结果范围的起始位置位于有序列表中(即InitialPositon≤n),且所述搜索结果范围的最大值大于有序列表的长度n时,此时的搜索结果范围的一部分位于有序列表中,另一部分位于无序列表中,此时,可从有序列表中的第InitialPositon条记录开始,读取(n-InitialPositon+1)条记录,然后将所述无序列表中的记录按KFScore的大小顺序排列,并从所述无序列表中的第1条记录开始,读取[Count-(n-InitialPositon+1)]条记录;将所读取的Count条记录作为搜索结果。此时,所读取的Count条记录也是按KFScore的大小顺序排列的搜索结果。When the starting position of the search result range is in the ordered list (that is, InitialPositon≤n), and the maximum value of the search result range is greater than the length n of the ordered list, a part of the search result range at this time is located in In the sequence list, the other part is located in the unordered list. At this time, the (n-InitialPositon+1) record can be read from the InitialPositon record in the ordered list, and then the records in the unordered list Arrange according to the size order of KFScore, and read [Count-(n-InitialPositon+1)] records starting from the first record in the unordered list; use the read Count records as the search result. At this time, the Count records read are also the search results arranged in order of KFScore size.
当所述搜索结果范围的起始位置位于无序列表中(即InitialPositon>n)时,此时的搜索结果范围都位于所述无序列表中,因此,可将所述无序列表中的记录按KFScore的大小顺序排列,然后从排序后的无序列表中的第(InitialPositon-n)条记录开始,读取Count条记录作为搜索结果。此时,所读取的Count条记录也是按KFScore的大小顺序排列的搜索结果。When the starting position of the search result range is in the unordered list (that is, InitialPositon>n), the search result range at this time is all located in the unordered list, so the records in the unordered list can be Arrange according to the size of KFScore, and then start from the (InitialPositon-n)th record in the sorted unordered list, read Count records as the search results. At this time, the Count records read are also the search results arranged in order of KFScore size.
通过上述的方法,即可获得所需的搜索结果。Through the above methods, the desired search results can be obtained.
考虑到用户的搜索习惯,一般情况下用户只会关心排序靠前的数十条搜索结果。根据权威机构对用户的搜索习惯的统计结果分析可知,超过90%的用户一般只会查看搜索结果的前两页(一般为20条左右的搜索结果)。因此,在搜索数据量为数亿级别的情况下,如果有序列表的长度n的取值超过一定的值(例如,n=100),则在绝大部分情况下,用户所输入的搜索结果范围都包含在有序列表之中,从而可以直接从有序列表中获取所需条数的且已按KFScore的大小顺序排列的搜索结果,而无需再对搜索结果进行排序,所需内存空间较少,CPU运算次数较少,响应时间较高,且占用的系统资源较少,从而可以大大提高搜索引擎的搜索响应时间,同时还可节约大量的硬、软件资源。Considering the user's search habits, generally the user only cares about the top dozens of search results. According to the analysis of statistical results of users' search habits by authoritative organizations, more than 90% of users generally only view the first two pages of search results (generally about 20 search results). Therefore, when the amount of search data is hundreds of millions of levels, if the value of the length n of the ordered list exceeds a certain value (for example, n=100), then in most cases, the search results entered by the user The ranges are included in the ordered list, so that the required number of search results that have been arranged in order of the size of the KFScore can be obtained directly from the ordered list, without the need to sort the search results, and the required memory space is relatively small. The number of CPU operations is less, the response time is higher, and the system resources occupied are less, which can greatly improve the search response time of the search engine, and at the same time save a lot of hardware and software resources.
更进一步地,在本发明的具体实施例中,当所述索引倒排表中未存储根据用户输入的检索字符串所生成的词源时,则可向用户返回空集作为搜索结果,表示未搜索到相关的文档。Furthermore, in a specific embodiment of the present invention, when the etymology generated according to the search string input by the user is not stored in the index posting table, an empty set may be returned to the user as the search result, indicating that there is no Find related documents.
另外,在本发明的技术方案中,上述步骤104可以有多种实现方式,即可以使用多种搜索方法来实现上述的步骤104。以下将以多个具体实施例的方式对各种搜索方法分别进行介绍。In addition, in the technical solution of the present invention, the
实施例一:单Term搜索方法。Embodiment 1: Single Term search method.
在本实施例中,所述单Term搜索方法是指根据用户输入的检索字符串只生成一个相应的词源时进行搜索的方法。图4为本发明中单Term搜索方法的流程图。如图4所示,所述单Term搜索方法可包括如下所述的步骤:In this embodiment, the single-term search method refers to a method of searching when only one corresponding etymology is generated according to the search character string input by the user. Fig. 4 is a flow chart of the single Term search method in the present invention. As shown in Figure 4, the single Term search method may include the steps as follows:
步骤401,根据用户输入的检索字符串生成一个相应的词源。
在本步骤中,可以将用户输入的检索字符串作为一个词,直接将该词作为该检索字符串相应的词源。例如,当用户所输入的检索字符串为“刘德华”时,生成与该检索字符串相对应的词源:“刘德华”。In this step, the search character string input by the user may be used as a word, and the word may be directly used as the corresponding etymology of the search character string. For example, when the search character string input by the user is "Andy Lau", an etymology corresponding to the search character string: "Andy Lau" is generated.
步骤402,根据所生成的词源检索索引倒排表,判断所述索引倒排表中是否存储有该词源;如果是,则执行步骤403;否则,执行步骤415。
步骤403,从所述索引倒排表中读取与所述词源相对应的文档列表。
由于索引倒排表中的每个词源所对应的值(Value)中都包括一个文档列表(DocList),因此,在步骤中可根据上述所生成的词源从所述索引倒排表中读取与所述词源相对应的文档列表。Because all include a list of documents (DocList) in the corresponding value (Value) of each etymology in the index posting list, therefore, can read from the described index posting list according to the above-mentioned etymology generated in the step Get a list of documents corresponding to said etymology.
步骤404,对用户输入的搜索结果范围进行解析,获得起始位置和条数。
在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.
步骤405,判断(InitialPositon+Count)-1>n是否成立,如果是,则执行步骤406;否则,执行步骤412;
由于所述搜索结果范围的最大值为(InitialPositon+Count)-1,因此在本步骤中,将首先判断搜索结果范围的最大值是否大于文档列表的有序列表的长度n,即判断(InitialPositon+Count)-1>n是否成立。Because the maximum value of the search result range is (InitialPositon+Count)-1, so in this step, it will first be judged whether the maximum value of the search result range is greater than the length n of the ordered list of the document list, that is, to judge (InitialPositon+Count) Count)-1>n is established.
当搜索结果范围的最大值大于文档列表的有序列表的长度时,则说明所述搜索结果范围已经超出有序列表的范围,此时,仅从有序列表中将无法读取全部所需条数的搜索结果,因此,将继续执行步骤406;When the maximum value of the search result range is greater than the length of the ordered list of the document list, it means that the search result range has exceeded the range of the ordered list. At this time, it will not be possible to read all the required items from the ordered list Number of search results, therefore, will continue to perform
当搜索结果范围的最大值小于或等于文档列表的有序列表的长度时,则说明所述搜索结果范围并未超出有序列表的范围,此时,仅从有序列表中即可读取全部所需的搜索结果,因此可以执行步骤412,从有序列表中直接获取所需条数的搜索结果。When the maximum value of the search result range is less than or equal to the length of the ordered list of the document list, it means that the search result range does not exceed the range of the ordered list. At this time, all documents can be read only from the ordered list. required search results, therefore step 412 can be executed to directly obtain the required number of search results from the ordered list.
步骤406,判断InitialPositon>n是否成立,如果是,则执行步骤410;否则,执行步骤407;
在本步骤中,将判断搜索结果范围中的InitialPositon是否大于有序列表的长度n。In this step, it will be judged whether the InitialPositon in the search result range is greater than the length n of the ordered list.
当搜索结果范围中的InitialPositon大于有序列表的长度时,则说明所述搜索结果范围的InitialPositon已经位于无序列表中,所述搜索结果均位于无序列表中,所述有序列表中并不包括搜索结果,因此可执行步骤410;When the InitialPositon in the search result range is greater than the length of the ordered list, it means that the InitialPositon in the search result range is already in the unordered list, the search results are all in the unordered list, and the ordered list does not Include search results, so step 410 can be performed;
当搜索结果范围中的InitialPositon小于或等于有序列表的长度时,则说明所述搜索结果范围的InitialPositon仍位于有序列表中,即有部分搜索结果位于有序列表中,但也有部分搜索结果位于无序列表中。此时可执行步骤407。When the InitialPositon in the search result range is less than or equal to the length of the ordered list, it means that the InitialPositon in the search result range is still in the ordered list, that is, some search results are in the ordered list, but some search results are in the in an unordered list. At this time,
步骤407,从有序列表中的第InitialPositon条记录开始,读取(n-InitialPositon+1)条记录加入到搜索结果列表(ResultList)中。
步骤408,将所述无序列表中的记录按KFScore的大小顺序排列。
步骤409,从所述无序列表中读取[Count-(n-InitialPositon+1)]条记录加入到ResultList中,执行步骤413。
步骤410,将所述无序列表中的记录按KFScore的大小顺序排列。
步骤411,从排序后的无序列表中的第(InitialPositon-n)条记录开始,读取Count条记录加入到ResultList中,执行步骤413。
步骤412,从有序列表中的第InitialPositon条记录开始,读取Count条记录加入到ResultList中,执行步骤413。
步骤413,将所述ResultList作为搜索结果,并将从所述索引倒排表中读取的与所述词源相对应的文档列表的文档总数TotalCount作为相关结果总数SumCount。In
步骤414,将所述搜索结果和相关结果总数返回给用户。结束流程。
步骤415,将空集作为搜索结果,设SumCount=0,将所述空集和相关结果总数返回给用户。结束流程。
通过上述的步骤401~415,即可实现单Term搜索。Through the above steps 401-415, a single-term search can be realized.
以下将以音乐搜索领域的具体实例为例,对上述的单Term搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned single-term search method.
具体实例一:Specific example one:
假设在音乐搜索领域中待检索的音乐文档有一千万个,每个文档的音乐信息包括:歌曲名(SongName)、歌手名(SingerName)、专辑名(AlbumName)、歌词、歌曲的试听量(ListenCount)和业务量(BusinessCount)。Suppose there are 10 million music files to be retrieved in the field of music search, and the music information of each file includes: song name (SongName), singer name (SingerName), album name (AlbumName), lyrics, song trial listening volume ( ListenCount) and business volume (BusinessCount).
在本发明的实例中,可将所述ListenCount和BusinessCount设置为上述音乐文档的关键属性,关键属性权重分别设置为0.3和0.7,则各个音乐文档的关键属性分值(KFScore)可通过如下所述公式计算:In the example of the present invention, described ListenCount and BusinessCount can be set as the key attribute of above-mentioned music file, and key attribute weight is set to 0.3 and 0.7 respectively, then the key attribute score value (KFScore) of each music file can be passed as follows Formula calculation:
KFScore=ListenCount*0.3+BusinessCount*0.7KFScore=ListenCount*0.3+BusinessCount*0.7
根据上述待检索的音乐文档、关键属性和关键属性分值可建立索引倒排表InvertIndexList,并可将索引倒排表中的有序列表的长度n设置为100。According to the above music files to be retrieved, key attributes and key attribute scores, an index inverted list InvertIndexList can be established, and the length n of the ordered list in the index inverted list can be set to 100.
图5为本发明具体实例中的索引倒排表的结构示意图。如图5所示,词源“刘德华”所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000,该DocList中的有序列表的长度n=100,即有序列表中包括100条记录;无序列表的长度为9900,即无序列表中包括9900条记录;同理,词源“我和你”所对应的值为:<9870,DocList>,因此,与词源“我和你”相对应的DocList中的文档总数TotalCount=9870;......;词源“天意”所对应的值为:<60,DocList>,因此,与词源“天意”相对应的DocList中的文档总数TotalCount=60;其中,可再假设在上述60个文档中,有20个文档中的歌手名为“刘德华”。Fig. 5 is a schematic structural diagram of an index posting list in a specific example of the present invention. As shown in Figure 5, the value corresponding to the etymology "Andy Lau" is: <10000, DocList>, therefore, the total number of documents TotalCount=10000 in the DocList corresponding to the etymology "Andy Lau", the ordered list in this DocList The length n=100, that is, the ordered list includes 100 records; the length of the unordered list is 9900, that is, the unordered list includes 9900 records; similarly, the corresponding value of the etymology "I and you" is: <9870, DocList>, therefore, the total number of documents TotalCount=9870 in the DocList corresponding to the etymology "I and you"; ...; The corresponding value of the etymology "God's Will" is: <60, DocList >, therefore, the total number of documents TotalCount=60 in the DocList corresponding to the etymology "God's Will"; wherein, it can be further assumed that among the above-mentioned 60 documents, the singer in 20 documents is named "Andy Lau".
当用户输入的检索字符串为“刘德华”,且用户输入的搜索结果范围为:第31~50条搜索结果时,可使用如下所述的单Term搜索方法进行搜索:When the search string entered by the user is "Andy Lau", and the search result range entered by the user is: the 31st to 50th search results, the following single-term search method can be used to search:
根据用户输入的检索字符串生成相应的词源“刘德华”;对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=31,条数Count=20;Generate the corresponding etymology "Andy Lau" according to the search character string input by the user; analyze the search result range input by the user, and learn that the starting position InitialPositon=31, and the number of entries Count=20;
根据所生成的词源“刘德华”检索图5所示的索引倒排表,在索引倒排表中查找到与上述生成的词源相对应的词源“刘德华”;该词源所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000,该DocList中的有序列表的长度n=100,即有序列表中包括100条记录;无序列表的长度为9900,即无序列表中包括9900条记录。Search the index posting table shown in Figure 5 according to the generated etymology "Andy Lau", and find the etymology "Andy Lau" corresponding to the above-mentioned generated etymology in the index posting table; the value corresponding to the etymology is: <10000, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "Andy Lau" TotalCount=10000, the length of the ordered list in the DocList is n=100, that is, 100 records are included in the ordered list ; The length of the unordered list is 9900, that is, the unordered list includes 9900 records.
由于(InitialPositon+Count)-1=31+20-1<n=100,因此搜索结果范围包含于所述有序列表中,所以可直接从有序列表中的第31条记录开始,读取20条记录作为搜索结果;同时,相关结果总数为SumCount=TotalCount=10000。Since (InitialPositon+Count)-1=31+20-1<n=100, the search result range is included in the ordered list, so you can directly start from the 31st record in the ordered list and read 20 records as search results; meanwhile, the total number of relevant results is SumCount=TotalCount=10000.
由于有序列表中的记录均已按KFScore的大小顺序排列,因此所读取的20条记录即为按KFScore的大小顺序排列的搜索结果。Since the records in the ordered list have been arranged in order of size of KFScore, the read 20 records are the search results arranged in order of size of KFScore.
实施例二:多Term搜索方法。Embodiment 2: Multi-Term search method.
在实际应用场景中,用户有可能会输入多个关键字进行搜索,希望能较精确地定位所需查找的内容,同时,有些用户还可能会对检索结果有特定的要求,例如,需要查找既包括关键字A又包括关键字B的信息,或者需要查找包括关键字A或关键字B的信息,或者需要查找包括关键字A但不包括关键字B的信息等等。因此,用户所输入的多个关键字之间还有可能存在与(AND)、或(OR)和差(SUB)三种逻辑运算情况。In actual application scenarios, users may enter multiple keywords to search, hoping to locate the content they want to find more accurately. At the same time, some users may also have specific requirements for the search results. Information that includes keyword A and keyword B, or information that includes keyword A or keyword B needs to be found, or information that includes keyword A but does not include keyword B needs to be found, and so on. Therefore, there may be three logical operations of AND (AND), OR (OR) and difference (SUB) among the multiple keywords input by the user.
为了满足上述用户的需求,在本发明的技术方案中,还提出了一种多Term搜索方法。在本实施例中,所述多Term搜索方法是指根据用户输入的检索字符串生成多个相应的词源时进行搜索的方法。图6为本发明中多Term搜索方法的流程图。如图6所示,所述多Term搜索方法可包括如下所述的步骤:In order to meet the needs of the above users, in the technical solution of the present invention, a multi-term search method is also proposed. In this embodiment, the multi-term search method refers to a method of searching when multiple corresponding etymologies are generated according to the search character string input by the user. Fig. 6 is a flow chart of the multi-term search method in the present invention. As shown in Figure 6, the multi-Term search method may include the steps as follows:
步骤600,在索引倒排表的文档列表中设置Y段读取范围。Step 600, setting the Y-segment reading range in the document list of the index posting list.
在本发明的具体实施例中,将采用分段读取的策略进行检索,以尽量避免每次检索都在全量数据上进行操作。因此,在本步骤中,将在索引倒排表的文档列表中预先设置Y段读取范围。其中,Y≥2。In a specific embodiment of the present invention, the segmented read strategy is used for retrieval, so as to avoid operating on the full amount of data for each retrieval as much as possible. Therefore, in this step, the reading range of the Y segment will be preset in the document list of the index posting list. Among them, Y≥2.
例如,当Y=2时,将在索引倒排表的文档列表中预先设置2段读取范围。此时,可将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围。For example, when Y=2, a 2-segment reading range will be preset in the document list of the index posting list. At this time, the ordered list of the document list can be used as the first reading range, and the entire document list can be used as the second reading range.
再例如,当Y=3时,将在索引倒排表的文档列表中预先设置3段读取范围。此时,可将文档列表的有序列表作为第1段读取范围,将文档列表的无序列表作为第2段读取范围,而将整个文档列表作为第3段读取范围。For another example, when Y=3, the reading range of 3 segments will be preset in the document list of the index posting list. At this time, the ordered list of the document list can be used as the reading range of the first paragraph, the unordered list of the document list can be used as the reading range of the second paragraph, and the entire document list can be used as the reading range of the third paragraph.
在本发明的具体实施例中,还可以根据实际应用情况使用其它方法设置读取范围,具体的设置方法在此不再赘述。In a specific embodiment of the present invention, other methods can also be used to set the reading range according to the actual application situation, and the specific setting method will not be repeated here.
步骤601,根据用户输入的检索字符串生成一个词源队列和查询逻辑序列。
在本步骤中,将对用户输入的检索字符串进行解析,生成一个词源队列TermArray和一个查询逻辑序列SetOperators。其中,所述词源队列TermArray中可以包括x个词源,可以表示为TermArray{Term1,Term2,...,Termx}。所述查询逻辑序列SetOperators中包括(x-1)个查询逻辑symbol,可以表示为SetOperators{symbol1,symbol2,...,symbolx-1}。其中,所述词源队列中各个词源之间的逻辑关系由所述查询逻辑序列中相应的查询逻辑表示。例如,所述查询逻辑序列中的symbol1表示Term1与Term2之间的逻辑关系,symbol2表示Term3与之前的Term1、Term2之间的逻辑关系,依次类推。另外,所述查询逻辑symbol的取值为:AND、OR或SUB。In this step, the search string entered by the user will be parsed to generate an etymology queue TermArray and a query logic sequence SetOperators. Wherein, the etymology queue TermArray may include x etymologies, which may be expressed as TermArray{Term 1 , Term 2 , . . . , Term x }. The query logic sequence SetOperators includes (x-1) query logic symbols, which can be expressed as SetOperators {symbol 1 , symbol 2 , . . . , symbol x-1 }. Wherein, the logical relationship among the various etymologies in the etymology queue is represented by the corresponding query logic in the query logic sequence. For example, symbol 1 in the query logic sequence represents the logical relationship between Term 1 and Term 2 , symbol 2 represents the logical relationship between Term 3 and the previous Term 1 and Term 2 , and so on. In addition, the value of the query logic symbol is: AND, OR or SUB.
例如,当用户所输入的检索字符串为“刘德华OR天意”时,可对上述检索字符串进行解析后生成词源队列TermArray{刘德华,天意}和一个查询逻辑序列SetOperators{symbol1},其中,symbol1的取值为OR。For example, when the search string entered by the user is "Andy Lau OR God's Will", the above search string can be parsed to generate an etymology queue TermArray{Andy Lau, God's Will} and a query logic sequence SetOperators{symbol 1 }, wherein, The value of symbol 1 is OR.
步骤602,对用户输入的搜索结果范围进行解析,获得起始位置和条数。Step 602: Analyze the search result range input by the user to obtain the starting position and the number of items.
在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.
步骤603,判断TermArray的长度是否大于1;如果是,则执行步骤604;否则,执行步骤616。
在本步骤中,将首先判断TermArray的长度是否大于1;如果是,则说明TermArray中至少包括两个以上的词源,将继续执行步骤604;如果不是,说明TermArray中仅包括一个词源,因此可以仅使用单Term搜索方法进行搜索,而不必使用多Term搜索方法,因此执行步骤616。In this step, at first will judge whether the length of TermArray is greater than 1; If yes, then explain that TermArray includes more than two etymology at least, will continue to perform
从上述步骤603可知,在本发明的具体实施例中,所述单Term搜索方式为多Term搜索方式的一种特殊情况。It can be seen from the
步骤604,读取TermArray中的前两个词源,并设置初始值为2的变量i和初始值为1的变量i。
在本步骤中,将直接从TermArray中读取前两个词源,即词源Term1和Term2。In this step, the first two etymologies, Term 1 and Term 2 , will be read directly from the TermArray.
另外,在本步骤中还可以预先设置两个变量:初始值为2的变量i和初始值为1的变量j。其中,i可用于表示当前所读取的是第i个词源,而j则可用于表示当前读取范围的段数,即表示当前读取范围为第j段读取范围。In addition, in this step, two variables may also be preset: a variable i with an initial value of 2 and a variable j with an initial value of 1. Among them, i can be used to indicate that the current reading is the i-th etymology, and j can be used to indicate the number of paragraphs in the current reading range, that is, it means that the current reading range is the reading range of the jth paragraph.
步骤605,根据上述所读取的两个词源分别检索索引倒排表,从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录,将所读取的记录分别存储于第一检索结果列表(ResultList1)和第二检索结果列表(ResultList2)中。Step 605: Retrieve the index posting list according to the two etymologies read above, and respectively read the reading range of the jth paragraph in the document list corresponding to the two etymologies from the index posting list Records in , store the read records in the first search result list (ResultList1) and the second search result list (ResultList2) respectively.
具体来说,在本步骤中可将上述所读取的两个词源分别看成两个独立的词源,根据每个词源分别检索索引倒排表。如果索引倒排表中存储有上述两个词源,则从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录。然后,将所读取的第一个词源(即Term1)相对应的文档列表中第j段读取范围内的记录存储于ResultList1中,将所读取的第二个词源(即Term2)相对应的文档列表中第j段读取范围内的记录存储于ResultList2中。Specifically, in this step, the two etymologies read above can be regarded as two independent etymologies, and the index posting list is retrieved according to each etymology. If the above two etymologies are stored in the index posting list, the records within the read range of the jth paragraph in the document list corresponding to the two etymologies are respectively read from the index posting list. Then, the records in the reading scope of the first section j in the document list corresponding to the first etymology (i.e. Term 1 ) that are read are stored in ResultList1, and the second etymology (i.e. Term 1) that is read is stored in ResultList1. 2 ) Records within the reading range of the jth paragraph in the corresponding document list are stored in ResultList2.
在本步骤中,此时的j=1,因此所读取的均为文档列表中第1段读取范围内的记录。如果在步骤600中将所述第1段读取范围设置为有序列表,则此时所读取的均为文档列表的有序列表中的记录。In this step, j=1 at this time, so all the records read are within the reading range of the first paragraph in the document list. If in step 600 the reading scope of the first segment is set as an ordered list, all the records read at this time are the records in the ordered list of the document list.
更进一步的,如果索引倒排表中未存储上述两个词源中的任意一个词源,则可将该未存储的词源所对应的检索结果列表设置为空集。例如,如果索引倒排表中未存储上述两个词源中的第一个词源(即Term1),则将该词源所对应的检索结果列表ResultList1设置为空集;如果索引倒排表中未存储上述两个词源中的第二个词源(即Term2),则将该词源所对应的检索结果列表ResultList2设置为空集。Furthermore, if any one of the above two etymologies is not stored in the index posting list, the retrieval result list corresponding to the unstored etymology can be set as an empty set. For example, if the first etymology (i.e. Term 1 ) in the above two etymologies is not stored in the index posting list, then the retrieval result list ResultList1 corresponding to the etymology is set to an empty set; if the index posting list If the second etymology (ie Term 2 ) of the above two etymologies is not stored in , then the retrieval result list ResultList2 corresponding to the etymology is set as an empty set.
步骤606,读取查询逻辑序列SetOperators中的第(i-1)个逻辑符号symboli-1。
步骤607,判断所读取的逻辑符号的取值;当所述逻辑符号的取值为AND时,执行步骤608;当所述逻辑符号的取值为OR时,执行步骤609;当所述逻辑符号的取值为SUB时,执行步骤610。
步骤608,使用与(AND)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。In
在本步骤中,将使用与逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, ResultList1 and ResultList2 will be merged using AND logic, and the merged result will be added to ResultList.
由于ResultList1和ResultList2中的每条记录都包括两项属性:关键属性得分(KFScore)和文档标识(Did),而进行与逻辑合并的目的在于从两个检索结果列表中找出Did相同的记录。由于KFScore的特殊性可知,如果两个文档的KFScore相同,则这两个文档有可能属于同一文档,而如果两个文档的KFScore不同,则一定不是同一文档。Since each record in ResultList1 and ResultList2 includes two attributes: key attribute score (KFScore) and document identifier (Did), the purpose of logical merging is to find out records with the same Did from the two retrieval result lists. Due to the particularity of KFScore, if the KFScores of two documents are the same, the two documents may belong to the same document, and if the KFScores of the two documents are different, they must not be the same document.
基于上述的原因,在本发明的具体实施例中,所述使用与逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:Based on the above reasons, in a specific embodiment of the present invention, the use and logic of merging ResultList1 and ResultList2, and adding the merged result to ResultList may include:
逐一比较ResultList1和ResultList2中的各条记录,将两个检索结果列表中KFScore和Did相同的两条记录作为一个搜索结果加入到搜索结果列表(ResultList)中。Compare the records in ResultList1 and ResultList2 one by one, and add the two records with the same KFScore and Did in the two search result lists as a search result to the search result list (ResultList).
步骤609,使用或(OR)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。
在本步骤中,将使用或逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, the OR logic will be used to merge ResultList1 and ResultList2, and the merged result will be added to ResultList.
例如,在本发明的具体实施例中,所述使用或逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:For example, in a specific embodiment of the present invention, the use or logic to merge ResultList1 and ResultList2, and adding the merged result to ResultList may include:
将所述ResultList1和ResultList2中的各条记录加入到搜索结果列表(ResultList)中,如果有两条记录的KFScore和Did均相同,则在ResultList中删除其中的一条记录。Add each record in the ResultList1 and ResultList2 to the search result list (ResultList), if there are two records with the same KFScore and Did, delete one of the records in the ResultList.
步骤610,使用差(SUB)逻辑合并ResultList1和ResultList2,合并结果加入到搜索结果列表(ResultList)中;执行步骤611。
在本步骤中,将使用差逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中。In this step, ResultList1 and ResultList2 will be merged using difference logic, and the merged result will be added to ResultList.
例如,在本发明的具体实施例中,所述使用差逻辑合并ResultList1和ResultList2,并将合并结果加入到ResultList中可以包括:For example, in a specific embodiment of the present invention, the use of difference logic to merge ResultList1 and ResultList2, and adding the merged result to ResultList may include:
从所述ResultList1中去除ResultList2中存储的各条记录,将进行去除操作后的ResultList1中的记录加入到搜索结果列表(ResultList)中。Each record stored in ResultList2 is removed from the ResultList1, and the removed records in ResultList1 are added to the search result list (ResultList).
步骤611,判断TermArray中是否还有Term未使用;如果是,则执行步骤612;否则,执行步骤613。
步骤612,设i=i+1,读取TermArray中的第i个词源;清空ResultList1和ResultList2,并将ResultList中的记录复制到ResultList2中;根据所读取的词源检索所述索引倒排表,从所述索引倒排表中读取与所读取的词源相对应的文档列表中第j段读取范围内的记录,将所读取的记录存储于ResultList1中;返回执行步骤606。
具体来说,在本步骤中可先设置i=i+1,以从TermArray中读取下一个词源(即第i个词源);同时,还将ResultList1和ResultList2清空,并将之前的搜索结果(即ResultList中的记录)复制到ResultList2中,以便于在后续搜索过程中将之前的搜索结果与所读取的下一个词源所对应的搜索结果进行逻辑操作。Specifically, in this step, i=i+1 can be set earlier to read the next etymology (i.e. the i-th etymology) from TermArray; The results (that is, the records in ResultList) are copied to ResultList2, so that the previous search results and the search results corresponding to the next etymology read can be logically operated in the subsequent search process.
然后,再根据所读取的词源检索索引倒排表。如果索引倒排表中存储有上述所读取的词源,则从所述索引倒排表中读取与所述词源相对应的文档列表中第j段读取范围内的记录,并将所读取的记录存储于ResultList1中,再返回执行步骤606。Then, the index posting list is retrieved according to the etymology read. If the above-mentioned etymology read is stored in the index posting list, then read the record in the jth paragraph reading range in the document list corresponding to the etymology from the index posting list, and The read records are stored in ResultList1, and then return to step 606.
在本步骤中,如果此时的j=1,则所读取的为文档列表中第1段读取范围内的记录。而如果在步骤600中将所述第1段读取范围设置为有序列表,则此时所读取的将为文档列表的有序列表中的记录。同理,如果此时的j=2,则所读取的为文档列表中第2段读取范围内的记录。而如果在步骤600中将所述第2段读取范围设置为整个文档列表,则此时所读取的将为整个文档列表中的记录。依次类推,在此不再赘述。In this step, if j=1 at this time, what is read is the record within the reading range of the first paragraph in the document list. However, if the reading scope of the first paragraph is set as an ordered list in step 600, then the records in the ordered list of the document list will be read at this time. Similarly, if j=2 at this time, what is read is the record within the reading range of the second paragraph in the document list. However, if the reading range of the second paragraph is set to the entire document list in step 600, then the records in the entire document list will be read at this time. And so on, no more details here.
更进一步的,如果索引倒排表中未存储上述所读取的第i个词源,则可将所述ResultList1设置为空集,再返回执行步骤606。Furthermore, if the read i-th etymology is not stored in the index posting list, the ResultList1 may be set as an empty set, and then return to step 606 .
步骤613,判断是否满足结束条件;如果是,则执行步骤617;否则,执行步骤614。
其中,在本发明的具体实施例中,所述结束条件可以为:Wherein, in a specific embodiment of the present invention, the end condition may be:
ResultList中所存储的记录的数目大于搜索结果范围的最大值:(InitialPositon+Count)-1;或者,j等于Y。The number of records stored in ResultList is greater than the maximum value of the search result range: (InitialPositon+Count)-1; or, j is equal to Y.
在本发明的技术方案中,当ResultList中所存储的记录的数目大于搜索结果范围的最大值(InitialPositon+Count)-1时,则表示已经检索到足够数量的搜索结果,因此满足了结束搜索的条件。而当j大于Y时,则表示已经进行了最后一段的搜索,此时也可以结束搜索,从而满足结束条件。In the technical scheme of the present invention, when the number of records stored in the ResultList is greater than the maximum value (InitialPositon+Count)-1 of the search result range, it means that a sufficient number of search results have been retrieved, thus satisfying the requirement for ending the search condition. And when j is greater than Y, it means that the last segment of the search has been carried out, and the search can also be ended at this time, thereby satisfying the end condition.
步骤614,清空ResultList1和ResultList2;读取TermArray中的前两个词源,并设置i=2且j=j+1;根据上述所读取的两个词源分别检索索引倒排表,从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录,分别存储于第一检索结果列表(ResultList1)和第二检索结果列表(ResultList2)中。
由于步骤613的判断结果是结束条件未满足,则表示ResultList中所存储的记录的数目小于所需的条数,也就是说,在上一阶段的检索过程中未检索到足够数量的搜索结果,所以需要进行本阶段(即第j阶段)的检索过程。Since the judgment result of
因此,在本步骤中,将根据上述所读取的两个词源分别再次检索索引倒排表。如果索引倒排表中存储有上述两个词源,则从所述索引倒排表中分别读取与所述两个词源相对应的文档列表中第j段读取范围内的记录。然后,将所读取的第一个词源(即Term1)相对应的文档列表中第j段读取范围内的记录存储于ResultList1中,将所读取的第二个词源(即Term2)相对应的文档列表中第j段读取范围内的记录存储于ResultList2中。Therefore, in this step, the index posting list will be searched again according to the two etymologies read above. If the above two etymologies are stored in the index posting list, the records within the read range of the jth paragraph in the document list corresponding to the two etymologies are respectively read from the index posting list. Then, the records in the reading scope of the first section j in the document list corresponding to the first etymology (i.e. Term 1 ) that are read are stored in ResultList1, and the second etymology (i.e. Term 1) that is read is stored in ResultList1. 2 ) Records within the reading range of the jth paragraph in the corresponding document list are stored in ResultList2.
步骤615,分别将所述ResultList1和ResultList2中的记录按KFScore的大小顺序排列;返回执行步骤606。Step 615, respectively arrange the records in the ResultList1 and ResultList2 according to the size of the KFScore; return to execute
步骤616,使用上述单Term搜索方法,根据所述TermArray中的词源检索索引倒排表,得到搜索结果和相关结果总数;执行步骤618。
步骤617,从所述ResultList中读取所需的记录作为搜索结果,将ResultList中所存储的记录的数目作为相关结果总数(SumCount)。Step 617, read the required records from the ResultList as search results, and use the number of records stored in the ResultList as the total number of relevant results (SumCount).
例如,在本发明的具体实施例中,所述从所述ResultList中读取所需的记录作为搜索结果包括:For example, in a specific embodiment of the present invention, reading the required records from the ResultList as search results includes:
当SumCount≥(InitialPositon+Count)-1时,从所述ResultList中的第InitialPositon条记录开始,读取Count条记录作为搜索结果;When SumCount≥(InitialPositon+Count)-1, start from the first InitialPositon record in the ResultList, and read the Count record as the search result;
当InitialPositon≤SumCount<(InitialPositon+Count)-1时,则表明ResultList中的记录的数目不足,此时,从所述ResultList中的第InitialPositon条记录开始,读取(SumCount-InitialPositon+1)条记录作为搜索结果,When InitialPositon≤SumCount<(InitialPositon+Count)-1, it indicates that the number of records in the ResultList is insufficient. At this time, start from the InitialPositon record in the ResultList and read (SumCount-InitialPositon+1) records As a search result,
当SumCount<InitialPositon时,则表明ResultList中的记录的数目太少,不存在所需的搜索结果,此时,可将空集作为搜索结果。When SumCount<InitialPositon, it indicates that the number of records in the ResultList is too small, and there is no desired search result. At this time, the empty set can be used as the search result.
步骤618,向用户返回搜索结果和SumCount。
在上述的多Term搜索方法中,使用了分段搜索的策略。例如,当预先设置了2段读取范围,且将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围时,将先搜索文档列表中的有序列表;如果从有序列表中得到的检索结果的数目大于或等于所需的条数,则说明第一段的检索过程中已经检索到足够数量的搜索结果,此时无需对无序列表进行查找,而可以直接给出已经排序的搜索结果;只有当有序列表中检索结果的数目小于所需的条数时,即第一段的检索过程中未检索到足够数量的搜索结果时,才会进行第二阶段的搜索,从文档列表中的所有记录中进行搜索。In the above-mentioned multi-term search method, a segmentation search strategy is used. For example, when the 2-segment reading range is set in advance, and the ordered list of the document list is used as the 1st reading range, and the entire document list is used as the 2nd reading range, the document list will be searched first. Sequence list; if the number of search results obtained from the ordered list is greater than or equal to the required number of items, it means that a sufficient number of search results have been retrieved in the first paragraph of the search process, and there is no need to search for the unordered list at this time. search, but can directly give the sorted search results; only when the number of search results in the ordered list is less than the required number of items, that is, when a sufficient number of search results are not retrieved in the first paragraph of the search process, the A second stage of the search is performed, searching from all records in the document list.
因此,使用上述的多Term搜索方法,可以尽量避免每次均在全量数据集上(例如,整个文档列表中)进行检索操作,因而可有效地缩小搜索空间,减少搜索的复杂度。Therefore, using the above-mentioned multi-term search method, it is possible to avoid performing retrieval operations on the entire data set (for example, the entire document list) each time as much as possible, thereby effectively reducing the search space and reducing the complexity of the search.
例如,在现有技术中,如果搜索数量的级别为亿级别(例如,网页或新闻搜索一般都是T级别),则目前开源搜索(例如,Lucene)中的每次搜索均会对搜索的结果集(比如百万级)进行全量排序,然后在此基础上获取用户搜索的结果集。而如果使用上述的多Term搜索方法,将文档列表的有序列表作为第1段读取范围,则大多数情况下在第一阶段搜索过程结束时即能找到满足用户需要的搜索结果,而不用进行第二阶段的搜索,因此可急剧减少所需搜索的数据量,而且不用在检索过程中对搜索结果进行排序,从而大大提高了搜索响应速度(一般可以比Lucene的搜索响应速度高一个数量级),而且也可以大大减少对内存和CPU资源的占用。For example, in the prior art, if the level of search quantity is 100 million (for example, web page or news search is generally T level), each search in the current open source search (for example, Lucene) will search the results Sets (such as million levels) are fully sorted, and then the result set searched by users is obtained on this basis. However, if the above-mentioned multi-term search method is used, and the ordered list of the document list is used as the reading range of the first paragraph, then in most cases, the search results that meet the user's needs can be found at the end of the first stage of the search process, without using The second stage of search is performed, so the amount of data required to be searched can be drastically reduced, and there is no need to sort the search results during the retrieval process, thereby greatly improving the search response speed (generally an order of magnitude higher than Lucene's search response speed) , and can also greatly reduce the occupation of memory and CPU resources.
通过上述的步骤601~618,即可实现多Term搜索。Through the above steps 601-618, multi-term search can be realized.
以下将以音乐搜索领域的具体实例为例,对上述的多Term搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned multi-term search method.
具体实例二:Specific example two:
为了简便起见,在该具体实例中,仍使用具体实例一中的基本设定,并建立图5所示的索引倒排表,索引倒排表中的有序列表的长度n设置为100。另外,设在索引倒排表的文档列表中设置了2段读取范围(即Y=2),且将文档列表的有序列表作为第1段读取范围,而将整个文档列表作为第2段读取范围。For the sake of simplicity, in this specific example, the basic settings in the first specific example are still used, and the index posting list shown in FIG. 5 is established, and the length n of the ordered list in the index posting list is set to 100. In addition, it is assumed that a 2-segment reading range is set in the document list of the index posting list (that is, Y=2), and the ordered list of the document list is used as the first reading range, and the entire document list is used as the second segment. Segment read range.
当用户输入的检索字符串为“刘德华AND天意”,且用户输入的搜索结果范围为:第1~10条搜索结果时,可使用如下所述的多Term搜索方法进行搜索:When the search string entered by the user is "Andy Lau AND God's Will", and the search result range entered by the user is: the 1st to 10th search results, the following multi-term search method can be used to search:
根据用户输入的检索字符串生成一个词源队列TermArray{刘德华,天意}和查询逻辑序列SetOperators{AND};Generate an etymology queue TermArray{Andy Lau, Providence} and a query logic sequence SetOperators{AND} according to the search string entered by the user;
对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=1,条数Count=10;Analyze the range of search results entered by the user, and learn that the starting position InitialPositon=1, and the number of entries Count=10;
分别根据所生成的词源“刘德华”和“天意”检索图5所示的索引倒排表,查找到与上述生成的相对应的词源“刘德华”和“天意”;Retrieve the index inverted list shown in Figure 5 according to the generated etymology "Andy Lau" and "God's Will" respectively, and find the corresponding etymology "Andy Lau" and "God's Will" generated above;
根据图5所示的索引倒排表可知,对于词源“刘德华”来说,其所对应的值为:<10000,DocList>,因此,与词源“刘德华”相对应的DocList中的文档总数TotalCount=10000。所以,将从所述索引倒排表中读取与词源“刘德华”相对应的DocList中的有序列表,将所述有序列表中的所有100条记录均存储于ResultList1中;According to the index inverted table shown in Figure 5, for the etymology "Andy Lau", its corresponding value is: <10000, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "Andy Lau" TotalCount=10000. Therefore, the ordered list in the DocList corresponding to the etymology "Andy Lau" will be read from the index inverted list, and all 100 records in the ordered list will be stored in ResultList1;
根据图5所示的索引倒排表可知,对于词源“天意”来说,其所对应的值为:<60,DocList>,因此,与词源“天意”相对应的DocList中的文档总数TotalCount=60;由于60<100,所以,与词源“天意”相对应的DocList中的文档均存储在有序列表中;所以,将从所述索引倒排表中读取与词源“天意”相对应的DocList中的有序列表,将所述有序列表中的所有60条记录均存储于ResultList2中;According to the index posting table shown in Figure 5, it can be seen that for the etymology "God's Will", its corresponding value is: <60, DocList>, therefore, the total number of documents in the DocList corresponding to the etymology "God's Will" TotalCount=60; Owing to 60<100, so, the document in the DocList corresponding to etymology " God's will " is all stored in the ordered list; "The ordered list in the corresponding DocList, all 60 records in the described ordered list are all stored in ResultList2;
从逻辑序列SetOperators中读取第1个逻辑符号,因为该逻辑符号为AND,因此使用AND逻辑合并ResultList1和ResultList2,合并结果加入到ResultList中;Read the first logical symbol from the logical sequence SetOperators, because the logical symbol is AND, so use AND logic to merge ResultList1 and ResultList2, and the merged result is added to ResultList;
由于“天意”相对应的上述60个文档中,有20个文档中的歌手名为“刘德华”,因此,上述ResultList将存储有20条记录,相关结果总数即为:SumCount=20。Since 20 of the above 60 documents corresponding to "God's Will" have singers named "Andy Lau", the above ResultList will store 20 records, and the total number of related results is: SumCount=20.
由于TermArray中所有的Term均已使用,而且(InitialPositon+Count)-1=1+10-1<SumCount,即搜索结果范围包含于所述有序列表中,因此已经满足结束条件,此时,可直接从所述ResultList中的第InitialPositon=1条记录开始,读取Count=10条记录作为搜索结果;Since all the Term in TermArray have been used, and (InitialPositon+Count)-1=1+10-1<SumCount, that is, the search result range is included in the ordered list, so the end condition has been satisfied, at this moment, Directly start from the first InitialPositon=1 record in the ResultList, and read Count=10 records as the search result;
最后,将向用户返回上述搜索结果和相关结果总数SumCount=20。Finally, the total SumCount=20 of the above search results and related results will be returned to the user.
实施例三:分层搜索方法。Embodiment 3: Hierarchical search method.
在本实施例中,所述分层搜索方法是指按照预先设定的规则将整个搜索过程分成多层搜索过程,每个层次的搜索过程之间有严格的区分,不同层之间进行搜索时的字段或搜索规则是不同的。一般来说,上一层的得分均高于下一层,每一层内部再指定按照信息关键属性排序的规则,这样划分分层搜索方法既可以保持搜索的相关性,同时又保障重要信息搜索后结果排名靠前的问题。In this embodiment, the hierarchical search method refers to dividing the entire search process into multi-layer search processes according to preset rules, and there is a strict distinction between the search processes at each level. When searching between different layers, The fields or search rules are different. Generally speaking, the score of the upper layer is higher than that of the lower layer, and the rules for sorting according to the key attributes of the information are specified inside each layer. In this way, the hierarchical search method can not only maintain the relevance of the search, but also ensure the search of important information. Post-results top-ranked questions.
图7为本发明中多层搜索方法的流程图。如图7所示,所述多层搜索方法可包括如下所述的步骤:Fig. 7 is a flow chart of the multi-layer search method in the present invention. As shown in Figure 7, the multi-layer search method may include the following steps:
步骤701,根据预先设定的搜索排序规则预先设置多层搜索过程,并设置各层搜索过程的优先级以及各层搜索过程的分层搜索分值范围。
在本步骤中,可以根据业务特性预先设定相应的搜索排序规则,例如,在本发明的具体实施例中,针对音乐搜索领域,可以预先设定如下的搜索排序规则:精确搜索、全拼搜索以及分词搜索。根据上述搜索排序规则可知,针对音乐搜索领域,可先进行精确搜索,如果未获得足够数量的搜索结果则可以再进行全拼搜索,如果仍未获得足够数量的搜索结果则最后可再进行分词搜索,从而尽量获得足够数量的用户所需的搜索结果。In this step, corresponding search sorting rules can be preset according to business characteristics. For example, in a specific embodiment of the present invention, for the field of music search, the following search sorting rules can be preset: precise search, spelling search And participle search. According to the above search sorting rules, we can know that in the field of music search, precise search can be carried out first, if there is not enough search results, then full spell search can be carried out, and if there is still not enough search results, word segmentation search can be carried out finally , trying to get the search results desired by a sufficient number of users.
因此,根据上述预先设定的搜索排序规则,可以设置多层搜索过程。例如,在音乐搜索领域,可以设置如下的三层搜索过程:精确层搜索过程、全拼层搜索过程和分词层搜索过程,而且,各层搜索过程的优先级按从高到低的顺序为:精确层搜索过程、全拼层搜索过程、分词层搜索过程。Therefore, according to the aforementioned preset search ordering rules, a multi-layer search process can be set. For example, in the field of music search, the following three-layer search process can be set: the precise layer search process, the full spelling layer search process and the word segmentation layer search process, and the priority of each layer search process is in descending order: Exact layer search process, full spell layer search process, word segmentation layer search process.
更进一步的,在本发明的具体实施例中,还可将上述的每层搜索过程再次分为多个子层搜索过程,并设定各个子层搜索过程的优先级。例如,可将上述精确层搜索过程分为按优先级高低顺序排列的三个子层搜索过程:歌曲子层搜索过程、歌手子层搜索过程和专辑子层搜索过程。即在进行精确层搜索过程时,将先进行歌曲子层搜索过程,然后再进行歌手子层搜索过程,最后再进行专辑子层搜索过程,以完成所述精确层搜索过程。同理,也可将所述全拼层搜索过程和分词层搜索过程也分为上述三个子层搜索过程。在此不再赘述。Furthermore, in a specific embodiment of the present invention, the above-mentioned search process of each layer may be further divided into multiple sub-layer search processes, and the priority of each sub-layer search process may be set. For example, the above precise layer search process can be divided into three sub-layer search processes arranged in order of priority: song sub-layer search process, singer sub-layer search process and album sub-layer search process. That is, when performing the precise layer search process, the song sub-layer search process will be performed first, then the singer sub-layer search process will be performed, and finally the album sub-layer search process will be performed to complete the precise layer search process. Similarly, the search process at the full spelling layer and the search process at the word segmentation layer can also be divided into the above three sub-layer search processes. I won't repeat them here.
此外,在本发明的具体实施例中,还可进一步预先设置各层搜索过程的分层搜索分值范围,从而便于后续所获得的检索结果的排序。例如,当设置有3层搜索过程,且各层搜索过程的优先级按从高到低的顺序为:第一层搜索过程、第二层搜索过程、第三层搜索过程时,可将第一层搜索过程的分层搜索分值范围设为:[A1,A2],即表示在第一层搜索过程中所得到的所有检索结果的分层搜索分值都将在A1和A2之间;将第二层搜索过程的分层搜索分值范围设为:[B1,B2],将第三层搜索过程的分层搜索分值范围设为:[C1,C2]。其中,A1>A2>B1>B2>C1>C2。因此,第一层搜索过程中所获得的任意一个检索结果的搜索分数都将高于第二、三层搜索过程中获得的检索结果。通过上述的方法,从而可以在保持检索结果的相关性的同时,还可保证关于重要信息的检索结果排名比较靠前。In addition, in a specific embodiment of the present invention, the hierarchical search score ranges of each layer search process can be further preset, so as to facilitate the sorting of subsequent retrieved results. For example, when there are 3 layers of search processes, and the priorities of each layer of search processes in descending order are: the first layer search process, the second layer search process, and the third layer search process, the first The hierarchical search score range of the layer search process is set to: [A1, A2], which means that the hierarchical search scores of all retrieval results obtained in the first layer search process will be between A1 and A2; The hierarchical search score range of the second-level search process is set to: [B1, B2], and the hierarchical search score range of the third-level search process is set to: [C1, C2]. Wherein, A1>A2>B1>B2>C1>C2. Therefore, the search score of any retrieval result obtained in the first-level search process will be higher than that of the retrieval results obtained in the second and third-level search processes. Through the above method, while maintaining the relevance of the retrieval results, it can also ensure that the ranking of the retrieval results about important information is relatively high.
步骤702,当用户输入检索字符串时,按照优先级的高低顺序进行各层搜索。
在本步骤中,当用户输入检索字符串时,将进行分层的搜索过程,即按照优先级的高低顺序先进行优先级最高的搜索,然后再进行优先级次高的搜索,并依此类推,直到搜索到足够数量的用户所需的搜索结果。In this step, when the user enters the search string, a hierarchical search process will be performed, that is, the highest priority search will be performed first, and then the second highest priority search will be performed, and so on. , until a sufficient number of user-desired search results are found.
在本发明的技术方案中,上述步骤702可以有多种具体的实现方式。以下将以其中的一种具体实现方式为例,对本发明的技术方案进行介绍。In the technical solution of the present invention, the
图8为本发明中步骤702的一种实现方法的流程图。如图8所示,上述步骤702可以包括如下所述的步骤:FIG. 8 is a flowchart of an implementation method of
步骤801,根据预先设置的各层搜索过程对用户输入的检索字符串进行解析,分别生成与各层搜索过程相对应的词源队列和查询逻辑序列。Step 801: Analyze the search character string input by the user according to the pre-set search process at each level, and generate etymology queues and query logic sequences corresponding to the search process at each level.
在本发明的技术方案中,由于预先设置了多层搜索过程,而每层搜索过程所使用的搜索方法并不一定相同,因此在本步骤中,可对用户输入的检索字符串进行解析,从而分别生成与各层搜索过程相对应的词源队列和查询逻辑序列。In the technical solution of the present invention, since the multi-layer search process is preset, the search methods used in each layer search process are not necessarily the same, so in this step, the search character string input by the user can be parsed, thereby Generate etymology queues and query logic sequences corresponding to the search process of each layer respectively.
进一步的,在本发明的具体实施例中,如果所生成的词源队列中只有一个词源,则与该词源队列对应的查询逻辑序列为空集。Further, in a specific embodiment of the present invention, if there is only one etymology in the generated etymology queue, the query logic sequence corresponding to the etymology queue is an empty set.
例如,如果预先设置的多层搜索过程为:精确层搜索过程、全拼层搜索过程和分词层搜索过程,而用户所输入的检索字符串为“刘德华天意”,则对用户所输入的检索字符串进行解析后,所得到与精确层搜索过程相对应的词源队列中只有一个词源“刘德华天意”,与该词源队列对应的查询逻辑序列为空集;所得到与全拼层搜索过程相对应的词源队列中只有一个词源“liudehuatianyi”,与该词源队列对应的查询逻辑序列为空集;所得到与分词层搜索过程相对应的词源队列中有两个词源:“刘德华”和“天意”,与该词源队列对应的查询逻辑序列中有一个逻辑符号:OR。For example, if the pre-set multi-layer search process is: the precise layer search process, the full spelling layer search process and the word segmentation layer search process, and the search character string input by the user is "Andy Lau God Will", then the search character input by the user After parsing the string, there is only one etymology "Andy Lau" in the etymology queue corresponding to the precise layer search process, and the query logic sequence corresponding to the etymology queue is an empty set; There is only one etymology "liudehuatianyi" in the corresponding etymology queue, and the query logic sequence corresponding to this etymology queue is an empty set; there are two etymologies in the resulting etymology queue corresponding to the word segmentation layer search process: " Andy Lau" and "God's will", there is a logic symbol in the query logic sequence corresponding to the etymology queue: OR.
步骤802,对用户输入的搜索结果范围进行解析,获得起始位置和条数。Step 802: Analyze the search result range input by the user to obtain the starting position and the number of items.
在本步骤中,可通过对用户所输入的搜索结果范围进行解析,获得起始位置InitialPositon和条数Count。In this step, the initial position InitialPositon and the number of items Count can be obtained by analyzing the search result range input by the user.
步骤803,根据预先设定的优先级,将当前未执行的且优先级最高的搜索过程确定为当前搜索过程。
在本步骤中,需要确定当前搜索过程为哪一层搜索过程。因此,可以根据预先设定的优先级,确定当前搜索过程为当前未执行的且优先级最高的搜索过程。例如,如果是第一次执行搜索过程,则当前搜索过程为优先级最高的搜索过程;如果是第二次执行搜索过程,则当前搜索过程为优先级次高的搜索过程;并依此类推。In this step, it is necessary to determine which layer the current search process is. Therefore, according to the preset priority, it can be determined that the current search process is the search process that is not currently executed and has the highest priority. For example, if the search process is executed for the first time, the current search process is the search process with the highest priority; if the search process is performed for the second time, the current search process is the search process with the second highest priority; and so on.
步骤804,根据与当前搜索过程相对应的词源队列和查询逻辑序列对索引倒排表进行检索,将检索结果存储在分层检索结果集合中并得到当前分层检索结果总数。Step 804: Search the index posting list according to the etymology queue and query logic sequence corresponding to the current search process, store the search results in the hierarchical search result set and obtain the total number of current hierarchical search results.
由于在步骤801中已经生成了与各层搜索过程相对应的词源队列和查询逻辑序列,因此在本步骤中,可直接根据当前搜索过程相对应的词源队列和查询逻辑序列对索引倒排表进行检索,将检索结果存储在分层检索结果集合(LayerResultList),从而得到分层检索结果集合;同时,还可得到当前的分层检索结果集合中的检索结果的总数,即当前分层检索结果总数(LayerResultCount)。Since the etymology queue and query logic sequence corresponding to each layer of search process have been generated in
另外,在本步骤的当前搜索过程中,可使用上述的多Term搜索方式对索引倒排表进行检索,具体检索过程在此不再赘述。In addition, in the current search process of this step, the above-mentioned multi-term search method can be used to search the index posting list, and the specific search process will not be repeated here.
步骤805,根据各个检索结果的KFScore以及当前搜索过程的分层搜索分值范围,计算分层检索结果集合中各个检索结果的分层搜索分值。Step 805: Calculate the hierarchical search score of each retrieval result in the hierarchical retrieval result set according to the KFScore of each retrieval result and the hierarchical search score range of the current search process.
在本发明的具体实施例中,可以使用多种计算方法来计算分层检索结果集合中各个检索结果的分层搜索分值。In a specific embodiment of the present invention, various calculation methods may be used to calculate the hierarchical search score of each search result in the hierarchical search result set.
例如,可根据当前搜索过程的分层搜索分值范围以及分层检索结果集合中各个检索结果按照KFScore从大到小的排列顺序,为各个检索结果设置相应的分层搜索分值。其它的计算方法在此不再赘述。For example, according to the hierarchical search score range of the current search process and the sequence of each search result in the hierarchical search result set according to the KFScore from large to small, a corresponding hierarchical search score can be set for each retrieval result. Other calculation methods will not be repeated here.
步骤806,将分层检索结果集合中的检索结果按照分层搜索分值的大小加入到总检索结果集合中,并删除重复的检索结果;计算得到当前的总检索结果总数,并清空所述分层检索结果集合。Step 806: Add the search results in the hierarchical search result set to the total search result set according to the hierarchical search score, and delete duplicate search results; calculate the total number of current total search results, and clear the score A collection of layer retrieval results.
在本步骤中,可将步骤804中得到的分层检索结果集合中的检索结果都按照分层搜索分值的大小插入到总检索结果集合(ResultList)中的检索结果中,并删除重复的检索结果。其中,所述总检索结果集合的初始值为空集,即初始情况下的总检索结果集合中并未存储检索结果。因此,如果当前搜索过程为第一层搜索过程,则将分层检索结果集合中的检索结果加入到总检索结果集合中时,不会存在重复的检索结果,即重复的检索结果的数目为0。In this step, all the search results in the hierarchical search result set obtained in
此外,在本发明的具体实施例中,还需计算得到当前的总检索结果总数(SearchResultCount),即当前的总检索结果集合中的检索结果的总数。例如,可以直接对当前的总检索结果集合进行统计,得到检索结果的总数;或者,还可以将本次计算前的总检索结果总数加上当前分层检索结果总数,并减去重复的检索结果的数目,从而得到当前的总检索结果总数。In addition, in a specific embodiment of the present invention, it is also necessary to calculate the current total number of search results (SearchResultCount), that is, the total number of search results in the current total search result set. For example, the current total search result set can be directly counted to obtain the total number of search results; or, the total number of search results before this calculation can be added to the total number of current layered search results, and repeated search results can be subtracted to obtain the current total number of search results.
步骤807,判断是否满足结束条件;如果是,则执行步骤808;否则,返回执行步骤803。
其中,在本发明的具体实施例中,所述结束条件可以为:Wherein, in a specific embodiment of the present invention, the end condition may be:
当前的总检索结果总数大于搜索结果范围的最大值,或者当前搜索过程为最后一层搜索过程。The current total number of search results is greater than the maximum value of the search result range, or the current search process is the last layer search process.
在本发明的技术方案中,当当前的总检索结果总数大于搜索结果范围(InitialPositon+Count)时,则表示已经检索到足够数量的搜索结果,因此满足了结束搜索的条件。而当当前搜索过程为最后一层搜索过程时,此时也可以结束搜索,从而满足结束条件。In the technical solution of the present invention, when the current total number of search results is greater than the search result range (InitialPositon+Count), it means that a sufficient number of search results have been retrieved, thus satisfying the condition for ending the search. And when the current search process is the last layer search process, the search can also be ended at this time, so that the end condition is met.
步骤808,根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果;将当前的总检索结果总数作为相关结果总数(SumCount)。
例如,在本发明的具体实施例中,所述根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果包括:For example, in a specific embodiment of the present invention, reading a required number of search results from the total search result set according to the search result range as search results includes:
当SumCount≥(InitialPositon+Count)-1时,从所述总检索结果集合中的第InitialPositon条记录开始,读取Count条记录作为搜索结果;When SumCount≥(InitialPositon+Count)-1, start from the first InitialPositon record in the total retrieval result set, read the Count record as the search result;
当InitialPositon≤SumCount<(InitialPositon+Count)-1时,则表明总检索结果集合中的记录的数目不足,此时,从所述总检索结果集合中的第InitialPositon条记录开始,读取(SumCount-InitialPositon+1)条记录作为搜索结果;When InitialPositon≤SumCount<(InitialPositon+Count)-1, then show that the number of records in the total retrieval result collection is insufficient, at this moment, from the InitialPositon article record in the described total retrieval result collection, read (SumCount- InitialPositon+1) records as search results;
当SumCount<InitialPositon时,则表明总检索结果集合中的记录的数目太少,不存在所需的搜索结果,此时,可将空集作为搜索结果。When SumCount<InitialPositon, it indicates that the number of records in the total search result set is too small, and there is no desired search result. At this time, the empty set can be used as the search result.
步骤809,向用户返回搜索结果和搜索结果总数。
通过上述的步骤701~702,即可实现多层搜索。Through the above steps 701-702, multi-layer search can be realized.
以下将以音乐搜索领域的具体实例为例,对上述的多层搜索方法进行进一步的介绍。The following will take a specific example in the field of music search as an example to further introduce the above-mentioned multi-layer search method.
具体实例三:Specific example three:
为了简便起见,在该具体实例中,仍使用具体实例一中的基本设定,并建立图5所示的索引倒排表,索引倒排表中的有序列表的长度n设置为100。For the sake of simplicity, in this specific example, the basic settings in the first specific example are still used, and the index posting list shown in FIG. 5 is established, and the length n of the ordered list in the index posting list is set to 100.
当用户输入的检索字符串为“刘德华天意”,且用户输入的搜索结果范围为:第1~50条搜索结果时,可使用如下所述的多层搜索方法进行搜索:When the search string entered by the user is "Andy Lau's providence" and the range of search results entered by the user is: the 1st to 50th search results, the following multi-layer search method can be used to search:
在本具体实例中,首先根据音乐搜索领域中的业务特性预先设置三层搜索过程:精确层搜索过程、全拼层搜索过程和分词层搜索过程。而且,各层搜索过程的优先级按从高到低的顺序为:精确层搜索过程、全拼层搜索过程、分词层搜索过程。然后,再将上述各层搜索过程都按优先级按从高到低的顺序分为三个子层搜索过程:歌曲子层搜索过程、歌手子层搜索过程和专辑子层搜索过程。In this specific example, three-level search processes are preset according to the business characteristics in the music search field: precise level search process, full-level search process and word-segmentation level search process. Moreover, the priorities of the search process at each level are in descending order: the search process at the precise level, the search process at the full spelling level, and the search process at the word segmentation level. Then, the above-mentioned search processes of each layer are all divided into three sub-layer search processes according to the order of priority from high to low: song sub-layer search process, singer sub-layer search process and album sub-layer search process.
根据上述的设置情况,预先设置各层搜索过程的分层搜索分值范围:将精确层搜索过程的分层搜索分值范围设为[A1,A2];将全拼层搜索过程的分层搜索分值范围设为[B1,B2],将分词层搜索过程的分层搜索分值范围设为[C1,C2];其中,A1>A2>B1>B2>C1>C2。According to the above setting situation, pre-set the hierarchical search score range of each layer search process: set the hierarchical search score range of the precise layer search process to [A1, A2]; The score range is set to [B1, B2], and the hierarchical search score range of the word segmentation layer search process is set to [C1, C2]; wherein, A1>A2>B1>B2>C1>C2.
完成上述设定之后,可根据预先设置的各层搜索过程对用户输入的检索字符串“刘德华天意”进行解析,分别生成与各层搜索过程相对应的词源队列和查询逻辑序列,分别为:After the above settings are completed, the search string "Andy Lau God Will" input by the user can be analyzed according to the pre-set search process of each layer, and the etymology queue and query logic sequence corresponding to the search process of each layer are respectively generated, respectively:
与精确层搜索过程相对应的词源队列中只有一个词源“刘德华天意”,与该词源队列对应的查询逻辑序列为空集;There is only one etymology "Andy Lau God Will" in the etymology queue corresponding to the precise layer search process, and the query logic sequence corresponding to the etymology queue is an empty set;
与全拼层搜索过程相对应的词源队列中只有一个词源“liudehuatianyi”,与该词源队列对应的查询逻辑序列为空集;There is only one etymology "liudehuatianyi" in the etymology queue corresponding to the full-level search process, and the query logic sequence corresponding to the etymology queue is an empty set;
与分词层搜索过程相对应的词源队列中有两个词源:“刘德华”和“天意”,与该词源队列对应的查询逻辑序列中有一个逻辑符号:AND。There are two etymologies in the etymology queue corresponding to the word segmentation layer search process: "Andy Lau" and "God's will", and there is a logic symbol in the query logic sequence corresponding to this etymology queue: AND.
对用户输入的搜索结果范围进行解析,获知起始位置InitialPositon=1,条数Count=50;Analyze the range of search results entered by the user, and learn that the starting position InitialPositon=1, and the number of entries Count=50;
由于精确层搜索过程中的歌曲子层搜索过程的优先级最高,因此可将歌曲子层搜索过程确定为当前搜索过程。Since the song sub-layer search process has the highest priority in the precise layer search process, the song sub-layer search process can be determined as the current search process.
在歌曲子层搜索过程中,根据词源“刘德华天意”使用单Term搜索方式检索图5所示的索引倒排表。由于未能从所述索引倒排表查找到相对应的词源“刘德华天意”,因此返回空的搜索结果;所以,总检索结果集合为空集,当前的总检索结果总数为0。In the song sub-layer search process, according to the etymology "Andy Lau's providence", the index posting list shown in Figure 5 is retrieved using a single-term search method. Since the corresponding etymology "Andy Lau's providence" cannot be found from the index posting table, an empty search result is returned; therefore, the total search result set is an empty set, and the current total search result is 0.
由于当前的总检索结果总数小于搜索结果范围,且当前搜索过程也不是最后一层搜索过程,因此不满足结束条件,将继续执行后续的检索过程。此时,当前未执行的且优先级最高的搜索过程为歌手子层搜索过程,因此可将歌手子层搜索过程确定为当前搜索过程,进行后续的检索。Since the current total number of search results is less than the range of search results, and the current search process is not the last search process, the end condition is not satisfied, and the subsequent search process will continue. At this time, the currently unexecuted search process with the highest priority is the singer sub-layer search process, so the singer sub-layer search process can be determined as the current search process for subsequent retrieval.
同理,在歌手子层搜索过程和专辑子层搜索过程中,也都未能从所述索引倒排表查找到相对应的词源“刘德华天意”,因此返回空的搜索结果,所以仍不满足结束条件,将继续进行后续的全拼层搜索过程。In the same way, in the singer sub-level search process and the album sub-level search process, the corresponding etymology "Andy Lau's will" cannot be found from the index inverted list, so an empty search result is returned, so it is still not available. If the end condition is met, the subsequent full layer search process will continue.
由于全拼层搜索过程中的词源为“liudehuatianyi”,因此在进行全拼层搜索过程中的各个子层搜索过程时,也都未能从所述索引倒排表查找到相对应的词源“liudehuatianyi”,因此返回空的搜索结果,所以仍不满足结束条件,将继续进行后续的分词层搜索过程。Since the etymology in the full-level search process is "liudehuatianyi", the corresponding etymology cannot be found from the index inverted table when performing the search process of each sub-layer in the full-level search process "liudehuatianyi", therefore returns an empty search result, so the end condition is still not met, and the subsequent word segmentation layer search process will continue.
在分词层搜索过程中,将先进行歌曲子层搜索过程,然后再进行歌手子层搜索过程,最后再进行专辑子层搜索过程。In the word segmentation layer search process, the song sub-layer search process will be carried out first, then the artist sub-layer search process will be carried out, and the album sub-layer search process will be carried out at last.
由于分词层搜索过程中有两个词源:“刘德华”和“天意”,且与该词源队列对应的查询逻辑序列中有一个逻辑符号:AND,因此,在分词层搜索过程的各个子层搜索过程中,均可从所述索引倒排表查找到相对应的词源,并获得相应的搜索结果。Since there are two etymologies in the word segmentation layer search process: "Andy Lau" and "God's Will", and there is a logical symbol in the query logic sequence corresponding to the etymology queue: AND, therefore, in each sublayer of the word segmentation layer search process During the search process, the corresponding etymology can be found from the index inverted list, and corresponding search results can be obtained.
当分词层搜索过程中所得到的搜索结果的数目大于50条时,将结束搜索过程,并根据搜索结果范围从所述总检索结果集合中读取所需条数的检索结果作为搜索结果,计算搜索结果总数,然后向用户返回搜索结果和搜索结果总数。When the number of search results obtained in the word segmentation layer search process is greater than 50, the search process will be terminated, and the search results of the required number of items will be read from the total search result collection according to the search result range as the search results, and calculated The total number of search results, and then return the search results and the total number of search results to the user.
而当分词层搜索过程结束时的搜索结果的数目仍然小于50条时,由于分词层搜索过程已经是最后一层搜索过程,因此同样满足结束条件,此时,可计算搜索结果总数,然后向用户返回搜索结果和搜索结果总数。And when the number of search results when the word segmentation layer search process ends is still less than 50, because the word segmentation layer search process has been the last layer of search process, it also satisfies the end condition. At this time, the total number of search results can be calculated, and then to the user Returns the search results and the total number of search results.
在本发明的技术方案中,还提出了一种搜索装置。图9为本发明中的搜索装置的结构示意图。如图9所示,本发明中的搜索装置包括:倒排表生成模块901、存储模块902、词源生成模块903、检索模块904。In the technical solution of the present invention, a search device is also proposed. Fig. 9 is a schematic structural diagram of the search device in the present invention. As shown in FIG. 9 , the search device in the present invention includes: a posting
所述倒排表生成模块901,用于将全部待检索文档以词源Term为关键字进行索引,建立以Term为关键字索引、以包含该Term的文档的总数TotalCount和包含该Term的文档列表DocList为值的索引倒排表;所述文档列表中的每条记录中均包括一个文档的文档编号和关键属性分值;所述文档列表由有序列表和无序列表组成,所述有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录;其中,所述n为预先确定的值;将所述索引倒排表发送给存储模块902;The posting
所述存储模块902,用于存储所述索引倒排表;The
所述词源生成模块903,用于根据用户输入的检索字符串生成相应的词源;将所述词源以及用户输入的搜索结果范围发送给检索模块904;The
所述检索模块904,用于根据所生成的词源对存储模块902中存储的索引倒排表进行检索,并根据用户输入的搜索结果范围优先从所生成的词源对应的有序列表中获取记录,以得到所需的搜索结果和相关结果总数;将所述搜索结果和相关结果总数发送给用户。The
综上所述,在本发明的技术方案中,由于为同一类文档设置了关键属性和关键属性权重,因此可计算各个文档的关键属性分值,然后根据上述关键属性分值建立索引倒排表,使得该索引倒排表中的文档列表由有序列表和无序列表组成,且有序列表中包括n个关键属性分值最大、且按关键属性分值从大到小顺序排列的记录。由于文档的关键属性分值在建立索引倒排表时即已完成,而且还对索引倒排表中所存储的记录提前完成了部分排序操作,因此可使得在根据用户输入的检索字符串对所述索引倒排表进行检索时,用户所输入的搜索结果范围在绝大部分情况下都包含在有序列表之中,从而可以直接从有序列表中获取所需条数的搜索结果,保证了大多数的搜索结果可以直接从有序列表中获得,而且所获得的搜索结果已按KFScore的大小顺序排列,因此无需再对搜索结果进行排序,从而大大减少了全量数据得分计算以及全量排序操作,大大减少了CPU的运算次数,显著提高了搜索速度,使得搜索速度比现有技术中的搜索速度提高了一个数量级,而且也大大提高了搜索引擎的搜索响应时间。To sum up, in the technical solution of the present invention, since the key attributes and key attribute weights are set for the same type of documents, the key attribute scores of each document can be calculated, and then the index posting list can be established according to the above key attribute scores , so that the document list in the index posting list is composed of an ordered list and an unordered list, and the ordered list includes n records with the largest key attribute scores and arranged in descending order of key attribute scores. Since the key attribute score of the document has been completed when the index posting list is established, and part of the sorting operation has been completed in advance for the records stored in the index posting list, it is possible to sort all the records according to the search string entered by the user. When retrieving the above index inverted list, the range of search results entered by the user is included in the ordered list in most cases, so that the required number of search results can be directly obtained from the ordered list, ensuring Most of the search results can be obtained directly from the ordered list, and the obtained search results have been arranged in the order of KFScore size, so there is no need to sort the search results, which greatly reduces the calculation of the full data score and the full sort operation. The number of operations of the CPU is greatly reduced, the search speed is significantly improved, and the search speed is increased by an order of magnitude compared with the search speed in the prior art, and the search response time of the search engine is also greatly improved.
另外,在本发明的索引倒排表中,所存储的是各个文档的关键属性分值,无需存放计算得分所需的因子,因此减少了存放的字段,大大节省了所占用的内存资源,从而节约了大量的硬、软件资源。In addition, in the index posting table of the present invention, what is stored is the key attribute score of each document, and there is no need to store the factors required for calculating the score, so the stored fields are reduced, and the occupied memory resources are greatly saved, thereby Save a lot of hardware and software resources.
此外,在本发明的索引倒排表中,对有序列表中的记录进行排序的依据是各个文档的关键属性和关键属性分值,而各个文档的关键属性和关键属性分值均可根据实际应用情况预先进行设置,因此可以充分考虑文本相关性和信息相对价值,使得搜索结果的排序更贴近用户需求;同时,还可以灵活设置关键属性和关键属性分值,从而可以在保证文本相关性的基础上定制搜索结果的排序操作。In addition, in the index inverted list of the present invention, the basis for sorting the records in the ordered list is the key attribute and key attribute score of each document, and the key attribute and key attribute score of each document can be based on the actual The application situation is set in advance, so the relevance of text and the relative value of information can be fully considered, making the ranking of search results closer to user needs; at the same time, key attributes and key attribute scores can be flexibly set, so that text relevance can be ensured Based on the sorting operation of custom search results.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.
Claims (25)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110461128.4A CN103186650B (en) | 2011-12-30 | 2011-12-30 | A kind of searching method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110461128.4A CN103186650B (en) | 2011-12-30 | 2011-12-30 | A kind of searching method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103186650A true CN103186650A (en) | 2013-07-03 |
CN103186650B CN103186650B (en) | 2016-05-25 |
Family
ID=48677819
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110461128.4A Active CN103186650B (en) | 2011-12-30 | 2011-12-30 | A kind of searching method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103186650B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649302A (en) * | 2015-10-28 | 2017-05-10 | 腾讯科技(深圳)有限公司 | Search sequencing method and device |
CN106649650A (en) * | 2016-12-10 | 2017-05-10 | 宁波思库网络科技有限公司 | Demand information two-way matching method |
CN106909647A (en) * | 2017-02-21 | 2017-06-30 | 福建榕基软件股份有限公司 | A kind of data retrieval method and device |
CN109388690A (en) * | 2017-08-10 | 2019-02-26 | 阿里巴巴集团控股有限公司 | Text searching method, inverted list generation method and system for text retrieval |
WO2019080412A1 (en) * | 2017-10-27 | 2019-05-02 | 平安科技(深圳)有限公司 | Data service method, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050210006A1 (en) * | 2004-03-18 | 2005-09-22 | Microsoft Corporation | Field weighting in text searching |
US20080288483A1 (en) * | 2007-05-18 | 2008-11-20 | Microsoft Corporation | Efficient retrieval algorithm by query term discrimination |
CN101460949A (en) * | 2006-06-01 | 2009-06-17 | 微软公司 | Indexing documents for information retrieval based on additional feedback fields |
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
US20110022600A1 (en) * | 2009-07-22 | 2011-01-27 | Ecole Polytechnique Federale De Lausanne Epfl | Method of data retrieval, and search engine using such a method |
-
2011
- 2011-12-30 CN CN201110461128.4A patent/CN103186650B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050210006A1 (en) * | 2004-03-18 | 2005-09-22 | Microsoft Corporation | Field weighting in text searching |
CN101460949A (en) * | 2006-06-01 | 2009-06-17 | 微软公司 | Indexing documents for information retrieval based on additional feedback fields |
US20080288483A1 (en) * | 2007-05-18 | 2008-11-20 | Microsoft Corporation | Efficient retrieval algorithm by query term discrimination |
CN101685455A (en) * | 2008-09-28 | 2010-03-31 | 华为技术有限公司 | Method and system of data retrieval |
US20110022600A1 (en) * | 2009-07-22 | 2011-01-27 | Ecole Polytechnique Federale De Lausanne Epfl | Method of data retrieval, and search engine using such a method |
Non-Patent Citations (1)
Title |
---|
励子闰等: "基于全文检索引擎的信息检索技术的应用研究", 《计算机与数字工程》, vol. 36, no. 9, 31 December 2008 (2008-12-31), pages 81 - 85 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106649302A (en) * | 2015-10-28 | 2017-05-10 | 腾讯科技(深圳)有限公司 | Search sequencing method and device |
CN106649650A (en) * | 2016-12-10 | 2017-05-10 | 宁波思库网络科技有限公司 | Demand information two-way matching method |
CN106649650B (en) * | 2016-12-10 | 2020-08-18 | 宁波财经学院 | Bidirectional matching method for demand information |
CN106909647A (en) * | 2017-02-21 | 2017-06-30 | 福建榕基软件股份有限公司 | A kind of data retrieval method and device |
CN106909647B (en) * | 2017-02-21 | 2020-01-03 | 福建榕基软件股份有限公司 | Data retrieval method and device |
CN109388690A (en) * | 2017-08-10 | 2019-02-26 | 阿里巴巴集团控股有限公司 | Text searching method, inverted list generation method and system for text retrieval |
WO2019080412A1 (en) * | 2017-10-27 | 2019-05-02 | 平安科技(深圳)有限公司 | Data service method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN103186650B (en) | 2016-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101506767B (en) | Relative to taxonomic hierarchies classify such as document and/or cluster object and from this classification derive data structure | |
Jäschke et al. | Tag recommendations in folksonomies | |
JP5332477B2 (en) | Automatic generation of term hierarchy | |
US10430448B2 (en) | Computer-implemented method of and system for searching an inverted index having a plurality of posting lists | |
JP5353173B2 (en) | Determining the concreteness of a document | |
CN110888990A (en) | Text recommending methods, devices, equipment and media | |
US20080162455A1 (en) | Determination of document similarity | |
CN104794242B (en) | Searching method | |
JP5391632B2 (en) | Determining word and document depth | |
JP2009093653A (en) | Refine search space according to user input | |
JP2009193584A (en) | Determining words related to a word set | |
US9971828B2 (en) | Document tagging and retrieval using per-subject dictionaries including subject-determining-power scores for entries | |
CN103942198B (en) | For excavating the method and apparatus being intended to | |
WO2007085187A1 (en) | Method of data retrieval, method of generating index files and search engine | |
CN103186650B (en) | A kind of searching method and device | |
CN117171331B (en) | Professional field information interaction method, device and equipment based on large language model | |
US20130346385A1 (en) | System and method for a purposeful sharing environment | |
CN116610853A (en) | Search recommendation method, search recommendation system, computer device, and storage medium | |
TWI290687B (en) | System and method for search information based on classifications of synonymous words | |
CN103942232B (en) | For excavating the method and apparatus being intended to | |
CN108509571B (en) | General method for webpage information data mining | |
CN104252487A (en) | Method and device for generating entry information | |
Lipczak et al. | The impact of resource title on tags in collaborative tagging systems | |
Das et al. | Leveraging collaborative tagging for web item design | |
JP2013168177A (en) | Information provision program, information provision apparatus, and provision method of retrieval service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |