US20090254523A1 - Hybrid term and document-based indexing for search query resolution - Google Patents
Hybrid term and document-based indexing for search query resolution Download PDFInfo
- Publication number
- US20090254523A1 US20090254523A1 US12/098,376 US9837608A US2009254523A1 US 20090254523 A1 US20090254523 A1 US 20090254523A1 US 9837608 A US9837608 A US 9837608A US 2009254523 A1 US2009254523 A1 US 2009254523A1
- Authority
- US
- United States
- Prior art keywords
- bank
- term
- docids
- computer
- computers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
Definitions
- the present invention generally relates to search query resolution, and more particularly to resolving search queries, such as Internet searches, using clusters of computers.
- Term-based searching of large databases to identify relevant or potentially relevant documents is an area of continued research and innovation.
- Internet users provide term-based search queries to search engines accessing such databases to identify web pages that may be relevant to that query.
- an inverted index of terms appearing in the documents is provided.
- the inverted index provides a list of terms and a list of document identifications in which those terms appear.
- Each list of document identifications for a particular term is usually called a “posting list.” For some terms, the number of document identifications in the associated posting lists is very large, while for others terms, the number of documents in which those terms appear may be relatively small.
- a cluster of computers may be provided to store and provide indexing services based on the inverted index. Since a cluster may comprise a plurality of physically distinct machines; ways to distribute the index among the machines of the cluster have been developed.
- a document-based distribution scheme One way is a document-based distribution scheme.
- portions of the index are distributed among various computers of the cluster based on hashes of document identifiers, which are functionally unique for the purposes of identifying a particular document.
- portions of a given posting list i.e., a list of DocIDs for a given term
- a query is transmitted/broadcast to all machines in the cluster, which can then separately and in parallel process the query for its fraction of the DocIDs. Since each machine is responsible for a subset of the DocIDs, each machine processes all terms against its fraction of the DocIDs, and could return documents for which it has responsibility and in which one or more of the terms appear.
- Another way to distribute the work load for a search among the computers of the cluster is a term-based distribution scheme.
- terms of the index are equally divided among the cluster's machines, by for example, using hashes of the term to obtain a term identifier (termID).
- termID term identifier
- a term from a query is sent only to the machines responsible for storing that particular term in that query. Each of those machines reads the entire posting list for the terms which are assigned to it and which appear in that query.
- aspects include a method of distributing on a computing cluster an inverted index that comprises terms respectively associated with posting lists of document identifiers (DocIDs).
- the method comprises organizing m computers into B banks, and distributing document IDentifiers (DocIDs) appearing in posting lists of the inverted index among the B banks of computers, where each posting list corresponding to a search term.
- DocIDs document IDentifiers
- the method includes distributing portions of the DocIDs, which appear in a large posting list and are distributed to that bank, to a plurality of the computers within that bank.
- the method also comprises assigning responsibility for a small posting list to fewer of the computers of that bank, and providing for the distribution of DocIDs appearing in the small posting list, which are not already distributed to its assigned computers.
- Further aspects include a method of providing a computer cluster for hosting an inverted index, comprising providing a plurality of banks of computers forming a computer cluster, obtaining an inverted index comprising a plurality of posting lists, where each posting list corresponds to a term, and comprises respective document identifiers (DocIDs) for one or more documents of a document set in which that term appears.
- the method also comprises distributing subsets of DocIDs for documents of the document set among respective banks of computers for storage on one or more computers therein.
- the method For storing the subsets of DocIDs distributed to each bank, the method comprises identifying a larger posting list comprising DocIDs of the subset distributed to that bank, distributing the DocIDs of the larger posting list among plural computers of that bank, each of the plural computers for producing DocIDs of the larger posting list which were distributed to it, identifying a smaller posting list comprising DocIDs of the subset distributed to that bank, and assigning responsibility for producing DocIDs of the smaller posting list to fewer computers than the plural computers for the larger posting list.
- a method of identifying documents potentially relevant to a term-based query comprising receiving a query comprising search terms, using a computer cluster of m computers organized into B banks, the computer cluster hosting an inverted index comprising posting lists of DocIDs in which each term of a plurality of terms appears, and each computer of a respective bank stores a portion of DocIDs that are assigned to that bank and which are associated with a large posting list, and all the DocIDs assigned to that bank, which are associated with a small posting list corresponding to a term assigned to that computer.
- the method further comprises distributing the search terms to each bank.
- the method comprise retrieving its corresponding smaller posting list from the computer to which it was assigned, and for any term corresponding to a large posting list, the method comprises retrieving a portion of its corresponding posting list from each computer of the bank.
- Still further aspects include a method of organizing a computer cluster for supporting term-based searching of an inverted index, comprising: dividing m computers of the computer cluster into B banks and distributing selections of the document identifiers of an inverted index among the B banks. At least some of the document identifiers are distributed to fewer than all of the B banks. The method also comprises distributing the document identifiers assigned to each bank among the computers of that bank, wherein B is selected for balancing an aggregate search throughput of the computer cluster with respective search latencies for individual searches.
- FIG. 1 illustrates a first cluster architecture for an inverted index distributed on the cluster
- FIG. 2 illustrates method aspects of a first distribution of an index on the cluster of FIG. 1 ;
- FIG. 3 illustrates aspects of a run-time method useful in the cluster of FIG. 1 as configured according to FIG. 2 ;
- FIG. 4 illustrates a preferred hybrid distribution of an index on the cluster of FIG. 1 ;
- FIG. 5 illustrates aspects of a run-time method useful in the cluster of FIG. 1 as configured according to FIG. 4 ;
- FIG. 6 illustrates data flow aspects for using the cluster of FIG. 1 as configured according to FIG. 4 .
- An inverted index comprises lists of terms and corresponding lists of document identifiers (DocIDs) in which those terms appear.
- DocIDs document identifiers
- a collection of indications of what documents contain a given term is frequently called a posting list (e.g., a list of document identifiers).
- a posting list e.g., a list of document identifiers.
- a cluster of computers can be used to provide a capability to search an inverted index for lists of documents in which specified terms appear, and in such a cluster, each computer can take a part of producing DocIDs responsive to a query.
- the document-based distribution strategy provides reduction in latency when producing large posting lists, because the DocIDs of a large posting list are produced in parallel by more computers.
- the document-based distribution strategy calls for distributing documents among the computers based on DocIDs, DocIDs from any given posting list may actually be distributed among a large number of computers.
- each computer in the system performs a seek to determine whether it has DocIDs for a term of a given search.
- Such a seek may include a hard-drive seek to load a list of DocIDs for a given term, which is orders of magnitude slower than indexing a solid state memory.
- FIG. 1 illustrates a first exemplary cluster organization 100 (“cluster 100 ”) that seeks a balance between reducing latency for generation of large posting lists while also reducing unnecessary seeks induced by small posting lists.
- Cluster 100 contains m computers (illustratively numbered 110 a - 110 m ), organized into B banks 125 a - 125 B. Although it may be preferable and/or intuitive that all B banks contain the same number of computers, there is no requirement that this be the case.
- Each computer 110 a - 110 m includes a storage resource, for example one or more hard drives, and/or flash drives, or even a virtual or logical partition in a dedicated storage unit, provided the storage unit could appropriately serve data within acceptable latencies to the computer using it as a storage resource.
- a storage resource for example one or more hard drives, and/or flash drives, or even a virtual or logical partition in a dedicated storage unit, provided the storage unit could appropriately serve data within acceptable latencies to the computer using it as a storage resource.
- such a computer may be a rack-mount server having a RAID hard drive implementation that can be configured for data protection and/or data throughput (e.g., RAID 0, 1, 5, 10, etc.)
- exemplary computer 110 a which comprises a processing resource 111 , for example a central processing unit that may include a number of independently operable processing cores and other functional resources, a chipset 112 , an I/O controller 113 , a working memory 114 (e.g., system memory), network connectivity 116 , and a storage resource 115 , which may be interfaced to the I/O controller 113 using one or more of SATA, SCSI, Infiniband, Fibre Channel, Ethernet and a PCI-E connection, for example.
- a computer 110 a would not have a dedicated monitor or user interface, but usually would be controlled through a network management system.
- a bank management server 120 can optionally be provided, which can coordinate operation of computers 110 a - 110 m in each bank and interface with cluster management server 105 . Where a bank-specific server 120 is not provided, a management process for each bank can execute on server 105 or on a designated computer in each bank 125 a - 125 B.
- the number of banks (B) can be selected based on measurements of aggregate search throughput and samples of latencies for searches resulting in larger result sets. Thus, the number of banks (B) is increased to decrease individual search latencies, and B can be decreased to increase aggregate search throughput.
- Cluster 100 can also be distributed geographically such that inter-computer and inter-bank links can be of any distance. For example, these connections may be long-haul fiber connections that carry virtual LAN traffic.
- different computers within a bank or within a cluster can actually be implemented as a portion of a larger computer, in that virtualization allows separate allocation of processing resources, and/or storage resources.
- a document collection may have any number of documents, n documents.
- a document can be assigned a numerical Document Identifier (DocID) that can be any random or pseudorandom string of sufficient length to allow a high probability of distinctness among all DocIDs.
- DocID Document Identifier
- other ways to construct DocIDs are acceptable, so long as an individual document can be identified with its ID.
- a “term” may refer to a canonical term, which may include, for example, various forms of a given word, such as all tenses of a verb, or a stem for a number of words, or the like.
- a canonical term may include, for example, various forms of a given word, such as all tenses of a verb, or a stem for a number of words, or the like.
- an inverted index for terms is depicted in table 1, where identifiers for a set of terms appear in a first column and in subsequent columns in that row, identifiers for specific documents in which that term appears are listed.
- Table 1 depicts that some terms will have many associated DocIDs in its posting list while others may have a few. Of course, the scale of an actual implementation may be many orders of magnitude larger than this example.
- DocIDs are distributed among servers, and their respective documents can be separately stored in another repository. This architecture can be selected because the size of posting lists for some terms can be so large that simply producing a list of DocIDs within an acceptable latency is sufficiently challenging. However, in other implementations, documents themselves can be stored with their DocIDs.
- a method 200 for distributing DocIDs among a cluster 100 for the example of Table 1 includes at least logically grouping 205 the m computers of cluster 100 into B banks.
- the number B of banks can be selected based on a desired balance between latency for larger posting lists and reducing unnecessary seeks for smaller posting lists, as will be explained in further detail below.
- This grouping 205 can include, for example, providing a switch to locally interconnect computers of a given bank, and providing an uplink to a switch that serves all banks of cluster 100 .
- Other ways to group 205 computers of cluster 100 into banks includes defining a VLAN for computers of a given bank, and maintaining a table of MAC addresses or IP addresses corresponding to a given bank. Such a table can be maintained by central server 105 , for example. In other words, there is at least a logical hierarchy of computers within a bank and banks within cluster 100 , but that hierarchy may not map directly into a hierarchy of physical connectivity.
- the n DocIDs are distributed 210 among the banks.
- One way to divide DocIDs among the banks is to perform modulo division on some or all of a hash value derived from a given DocID by the number of banks, and discriminate among the banks based on the remainder of that modulo division.
- a further step is to allocate 215 the DocIDs of a given bank among the computers of that bank.
- the allocation is a term-based allocation, and so allocation 215 may also involve an analysis to determine what terms appear in the DocIDs allocated to a bank, or such analysis can be performed in advance. For example, a hash can be performed on a term, to arrive at a hash value, and a number of bits of that hash value appropriate for the number of computers can be inspected to determine a computer of the bank to be responsible for producing DocIDs for that term (e.g., a partial posting list for that term) within that bank (e.g., by modulo division).
- the configuration of cluster 100 provides for DocIDs to be distributed among banks of cluster 100 . Then, a determination of what terms appear in the documents grouped into each bank may be undertaken such that a subset of the computers in a given bank have responsibility for producing the portion of that term's posting list in that bank. (i.e., generally a subset of the DocIDs for a term's posting list will be allocated to a given bank by DocID, and then further allocated to computers in that bank term-by-term). In one aspect, responsibility for producing DocIDs in a posting list for a term, which are assigned to a given bank may be assigned to a single computer in that bank.
- such responsibility may be distributed among the plurality of computers in the bank, for example, two computers may be allocated responsibility for the DocIDs of a given term's posting list within a bank.
- a partial posting list refers to any subset of a set of DocIDs appearing in a posting list.
- partial posting lists can be created for each bank based on DocID allocation.
- FIG. 3 illustrates method 300 for producing DocIDs for documents containing terms included in a search query.
- a first query is received 305 , the query contains one or more terms with the expectation that results relevant to those terms will be returned.
- the terms of the query are distributed 310 to all banks of cluster 100 .
- it is determined 315 which computer of that bank is responsible for producing posting list results for each term of the query. This determination can be performed by an indexing process provided on the optional local management server 120 ( FIG. 1 ). In absence of local management server 120 , this determination 310 may be performed by a search query distribution process in server 105 , which also interfaces with web front end 175 .
- a further alternative is for each computer 110 a - 110 m to store an index of terms for which it has partial posting list results in a main memory, so that access can be rapid, and does not require a hard drive seek.
- Each computer responsible for a given term then performs a lookup 320 to identify DocIDs associated with that given term (e.g., usually, partial posting lists), and which were allocated to that bank.
- the identified DocIDs may be termed an initial result set, and may undergo preliminary processing to reduce a number of DocIDs returned.
- each computer can process multiple terms and can intersect the partial posting lists it identified during lookup 320 to return non-duplicative results. Subsequently, each computer returns 325 identified DocIDs for its terms.
- the document results may then be received by the management server 120 for each bank, if present, and if not present then by management process(es) of server 105 , which also would be receiving document results from other banks, potentially for the same terms as the document results returned from the bank described above.
- Management process within server 105 may then further process each DocID set to provide a final result set to other functionality used in producing a final search result.
- each bank 125 a - 125 B of cluster 100 would generally produce a portion of a posting list for a given term and within each bank only a subset of computers would have performed a seek to determine whether it contained or otherwise was responsible for returning DocIDs in a posting list for that term.
- This strategy reduces a number of seeks performed by the computers of cluster 100 while allowing posting list results to be returned by multiple computers in parallel, which reduces latency for large posting lists.
- a second method 400 to distribute DocIDs among the computers of cluster 100 is explained with respect to FIG. 4 .
- the available computers in the cluster are again grouped 401 into banks.
- the DocIDs for document collection are also distributed 405 among banks of cluster 100 according to document identifiers (e.g., modulo division on a hash value for each DocID).
- the method 400 also includes differentiating between (or otherwise, determining) 410 for DocIDs distributed to a given bank whether posting lists in which those documents appear are large or small.
- determining 410 can include a term-based analysis of whether or not partial posting lists for a respective term have a large number of DocIDs distributed to a given bank.
- differentiating/determining 410 can be performed prior to distribution 405 , such that a posting list for a given term can be judged large or small for the document collection as a whole, rather than for a portion of the document collection allocated to each bank. In such an example, this determination could control treatment of the partial posting lists in each bank for that term. In either case, within each bank, a term-by-term distinction between large versus small posting list is provided. This distinction between large versus small posting list is used to determine distribution of responsibility for producing posting list results within computers of a bank.
- subsets of DocIDs associated with a partial posting list considered large are distributed 415 among a plurality of computers in that bank.
- a subset of DocIDs is distributed to each computer of that bank.
- DocIDs for the entire posting list can be stored in a plurality of computers and responsibility for producing a given subset of those DocIDs can be allocated to each computer. For example, if each computer had sufficient storage capacity for DocIDs of an entire document collection, then the additional effort to segment the DocIDs for that document collection among these computers may not be required, even though latency reduction in producing such documents may be desirable. This may be a practical matter, for example, where a hard drive of a larger size often costs only incrementally more than a hard drive substantially smaller.
- responsibility for producing DocIDs in that posting list and present in the bank is assigned 420 to fewer computers, than for a large posting list.
- responsibility is assigned to only one computer of the bank, such that DocIDs for that small posting list present in that bank would be produced only by that one computer. Because any given DocID may be present both in large and in small posting lists, DocIDs may need to be duplicated among the computers of the bank. For example, in table 1 above, it was illustrated that DocID50 appeared in posting lists for both term 1 and term 2.
- DocID50 may be distributed to computer 110 a , while responsibility for producing DocIDs present in the posting list for term 2 may be assigned to computer 110 b . As such, DocID 50 may be duplicated on both computer 110 a and computer 110 b.
- a “run time” method 500 for obtaining DocID results for term-based queries is illustrated in FIG. 5 and described below.
- a query is received 505 ; such query can comprise a plurality of terms, for which relevant documents are desired.
- the terms of the query are distributed 510 to each bank, and it is determined 515 whether a partial posting list for each term in each bank is either large or small (determining 515 can also be performed globally for the entire document collection, such that a posting list for a term is either large or small in all banks).
- Terms with large posting lists are distributed 520 to each computer of the bank. Terms with small posting lists are provided 525 only to the computer(s) which was assigned responsibility for producing documents for that term's partial posting list.
- each computer After each computer has identified documents responsive to all the terms provided to it (e.g., some computers may have searched for documents of multiple partial posting lists, such as large and small partial posting lists, or multiple small posting lists), each computer can merge 530 those identified DocIDs to remove redundant DocIDs (e.g., multiple terms may appear in the same document). The merged are returned 535 to a management process in server 120 or server 105 ; if results are not merged, then some redundant results may be returned, which may be acceptable in some implementations.
- the optional reshuffling step 526 may be applied to method 500 in the following circumstances. It was described in the background that is known to distribute DocIDs for posting lists among computers of a cluster according to a hash value. For example, in a 100 computer cluster, a computer to receive a document can be identified by Modulo (DocID, 100). In other words, it is known to distribute DocIDs listed in a posting list among a plurality of computers, and in such clusters, terms are distributed among all the computers of a cluster and those computers having part of a terms posting list (i.e., having DocIDs in a partial posting list for that term) respond with those DocIDs. In such clusters, each DocID can be said to have an actual home on the computer storing it.
- DocID Modulo
- This arrangement effectively allows the distribution of DocIDs for large posting lists in a hybrid cluster to correspond with how those DocIDs would be distributed in a prior art document based distribution cluster.
- responsibility for producing results for small posting lists, within a bank is assigned to select computers (in some examples, only 1 computer).
- redistribution of partial posting list results within a bank for small posting lists can be undertaken prior to reporting results from a bank for a search.
- a portion of DocIDs appearing in a small posting list may still be distributed among multiple banks, and therefore, each of these banks will have at least one computer responsible for returning results for such a posting list.
- responsibility for producing DocIDs of a partial small posting list can be distributed to one computer of the bank.
- each bank receives a portion of DocIDs of a document collection, generally distributed according to DocID. Then, within a bank, large partial posting lists are distributed among all computers of that bank, and small partial posting lists are each assigned to one computer of that bank.
- posting lists are categorized as either large or small, and distribution according to this categorization is either to all computers in a bank or one computer.
- Other examples and implementations may provide more granular categorizations and assignments. For example, a number of degrees of a size for a partial posting list (i.e., a portion of a posting list present in a bank) can be established, and the larger a given partial posting list, the more computers within its bank will be assigned to produce DocIDs for it. Conversely, the smaller the partial posting list, the fewer the number of computers in a given bank will be assigned to produce postings for it.
- posting lists for a document collection could be categorized as large/medium/small, or distributions of posting lists could be formed where a first quartile of the largest posting lists could be distributed to all of a bank's computers, and quartiles of smaller posting lists could be distributed to fewer computers within a bank.
- fewer than all computers of a bank store a partial posting list for a given term, then the computers having posting list contents relevant for the term associated with the partial posting list preferentially are indexed. Such indexing allows determination at run time which computers of a bank have data responsive to a given term.
- portions of DocIDs for a document collection distributed to each bank may be approximately equal. However, this approximately equal distribution is an example, and distributions can also be made unequally among banks. For example, one bank may have more computing resources than another bank, or better network connectivity, etc. Such distinctions can be used in determining how to distribute DocIDs of a collection among banks in cluster 100 .
- computers producing posting list results may first index a table based on a term to identify a list of document identifiers (DocIDs) that correspond to that term. These DocIDs can then be used to identify respective physical locations where the documents for each DocID are stored. Since documents sizes will vary, it may be convenient to provide an index of DocIDs to file locations, or alternatively an existing file system structure can be used such that DocIDs can serve as file names, and the file system itself can be used to obtain the document for each DocID.
- DocIDs document identifiers
- FIG. 6 illustrates an example dataflow diagram that summarizes aspects described above, for an example query comprising a set of three terms ⁇ S 1 , S 2 , S 3 ⁇ .
- the query is received by a management process and the terms of the query are distributed to Banks 1 ..b.
- all terms are distributed to each bank, as illustrated by distribution of the terms ⁇ S 1 , S 2 , S 3 ⁇ to each bank.
- this determination can include determining a computer responsible for producing DocIDs in small posting lists.
- terms S 2 and S 3 were determined small. Following these determinations, terms are distributed to responsible computers within each bank.
- 602 shows ⁇ S 1 , S 2 ⁇ to computer Ck
- 603 shows ⁇ S 1
- 625 shows ⁇ S 1 ] to C 1
- 604 shows ⁇ S 2 ⁇ to Ck
- 605 shows ⁇ S 2
- 606 shows ⁇ S 2 ⁇ to C 1 . Similar operation would occur in bank b, but particulars are omitted in this example.
- Each computer in each bank then performs a seek to identify DocIDs in its partial posting list for that term (as described above, a given computer may actually be storing DocIDs containing a given term, but responsibility for producing those DocIDs in response to a query may be assigned to another computer in the bank or in a different bank). Then, each computer producing result sets as follows.
- C 1 produces results as follows: 616 shows R ⁇ S 1 ⁇ , 615 shows the union of the result sets for terms S 1 and a partial result for S 2 R ⁇ S 1 U p.r. S 2 ⁇ (i.e., computer Ck avoids producing duplicative DocIDs), and 614 shows a partial result for S 2 being transmitted to computer C 2 .
- 604 shows ⁇ S 2 ⁇ transmitted to Ck
- 605 shows ⁇ S 2
- 606 shows ⁇ S 2 ⁇ to C 1 .
- the computers of Bank 2 produce results as follows: 617 shows R ⁇ S 2 ⁇ , 619 shows R ⁇ S 2 U S 3 ⁇ , 611 shows that partial results from C 2 are sent from the collection point (e.g., a management process) to C 1 , which are then shown in 620 as being returned with other results R ⁇ S 2 U P.R. S 3 ⁇ from C 1 .
- the operation of C 2 and C 1 in Bank 2 with respect to results for S 3 illustrate a different way to maintain transparency of origin of results for terms searching.
- a given computer can send all results to a management process (e.g., 619 shows returning a union of results for S 2 and S 3 from C 2 ), and the management process can identify portions of results that would have been from different computers in a bank, and can send those results to those computers (e.g., as shown by 611 , where partial results for S 3 are sent to C 1 ).
- a management process e.g., 619 shows returning a union of results for S 2 and S 3 from C 2
- the management process can identify portions of results that would have been from different computers in a bank, and can send those results to those computers (e.g., as shown by 611 , where partial results for S 3 are sent to C 1 ).
- bank 2 is not producing posting results for S 1 , which would imply that documents in which S 1 appears are distributed to banks other than bank 2 . For posting lists of most practical sizes, this result may be statistically unlikely, but nevertheless possible.
- all the terms of a given search query can be distributed to all banks, such
- results from all the banks of the cluster are collected and analyzed ( 623 ).
- such analysis would further narrow the results based on any of a variety of algorithms and the results from 623 would then be presented 624 .
- the results can be provided to a user, saved, and/or transmitted. Since the hybrid cluster can provide DocIDs for any type of further use, the particulars of such use need not be described.
- query scheduling algorithms can be provided based on how terms are allocated within each bank. For example, a bank can determine that two terms of two different queries are assigned to different computers and may schedule those terms for servicing simultaneously.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
- 1. Field
- The present invention generally relates to search query resolution, and more particularly to resolving search queries, such as Internet searches, using clusters of computers.
- 2. Description of Related Art
- Term-based searching of large databases to identify relevant or potentially relevant documents is an area of continued research and innovation. For example, Internet users provide term-based search queries to search engines accessing such databases to identify web pages that may be relevant to that query.
- Because of the large number of data items (a.k.a. documents) available on the Internet (and even in particular portions of it, such as the World Wide Web), techniques to distribute indexing data for these documents and the work load of searching them for relevant terms have been developed.
- To avoid actually searching documents responsively to each entered search query (which would result in unacceptable delays), an inverted index of terms appearing in the documents is provided. The inverted index provides a list of terms and a list of document identifications in which those terms appear. Each list of document identifications for a particular term is usually called a “posting list.” For some terms, the number of document identifications in the associated posting lists is very large, while for others terms, the number of documents in which those terms appear may be relatively small.
- Also, the entire index itself can be too large to store and use efficiently in one computer system, so a cluster of computers may be provided to store and provide indexing services based on the inverted index. Since a cluster may comprise a plurality of physically distinct machines; ways to distribute the index among the machines of the cluster have been developed.
- One way is a document-based distribution scheme. In a document-based distributed scheme, portions of the index are distributed among various computers of the cluster based on hashes of document identifiers, which are functionally unique for the purposes of identifying a particular document. In a DocID distribution scheme, portions of a given posting list (i.e., a list of DocIDs for a given term) are distributed among the cluster machines. At “run time”, when a query comes in, it is transmitted/broadcast to all machines in the cluster, which can then separately and in parallel process the query for its fraction of the DocIDs. Since each machine is responsible for a subset of the DocIDs, each machine processes all terms against its fraction of the DocIDs, and could return documents for which it has responsibility and in which one or more of the terms appear.
- Another way to distribute the work load for a search among the computers of the cluster is a term-based distribution scheme. During index building for a term-based distribution, terms of the index are equally divided among the cluster's machines, by for example, using hashes of the term to obtain a term identifier (termID). At run time, a term from a query is sent only to the machines responsible for storing that particular term in that query. Each of those machines reads the entire posting list for the terms which are assigned to it and which appear in that query.
- Further innovations in providing posting lists corresponding to search terms from cluster computers is desirable to increase throughput, decrease search latency, and manage costs of the machinery providing the search results.
- Aspects include a method of distributing on a computing cluster an inverted index that comprises terms respectively associated with posting lists of document identifiers (DocIDs). The method comprises organizing m computers into B banks, and distributing document IDentifiers (DocIDs) appearing in posting lists of the inverted index among the B banks of computers, where each posting list corresponding to a search term. Within a bank of the B banks, the method includes distributing portions of the DocIDs, which appear in a large posting list and are distributed to that bank, to a plurality of the computers within that bank. Within each of the B banks, the method also comprises assigning responsibility for a small posting list to fewer of the computers of that bank, and providing for the distribution of DocIDs appearing in the small posting list, which are not already distributed to its assigned computers.
- Further aspects include a computer cluster for providing searching of an inverted index comprising posting lists of document identifiers of documents in which each term of a plurality of terms appears, the computer cluster comprises m computers organized into B banks, where each computer is operable for storing data assigned to it, and wherein each computer of a respective bank stores a portion of document identifiers that are assigned to that bank and which are associated with a large posting list, and all the document identifiers assigned to that bank which are associated with a small posting list corresponding to a term assigned to that computer.
- Further aspects include a method of providing a computer cluster for hosting an inverted index, comprising providing a plurality of banks of computers forming a computer cluster, obtaining an inverted index comprising a plurality of posting lists, where each posting list corresponds to a term, and comprises respective document identifiers (DocIDs) for one or more documents of a document set in which that term appears. The method also comprises distributing subsets of DocIDs for documents of the document set among respective banks of computers for storage on one or more computers therein.
- For storing the subsets of DocIDs distributed to each bank, the method comprises identifying a larger posting list comprising DocIDs of the subset distributed to that bank, distributing the DocIDs of the larger posting list among plural computers of that bank, each of the plural computers for producing DocIDs of the larger posting list which were distributed to it, identifying a smaller posting list comprising DocIDs of the subset distributed to that bank, and assigning responsibility for producing DocIDs of the smaller posting list to fewer computers than the plural computers for the larger posting list.
- A method of identifying documents potentially relevant to a term-based query, comprising receiving a query comprising search terms, using a computer cluster of m computers organized into B banks, the computer cluster hosting an inverted index comprising posting lists of DocIDs in which each term of a plurality of terms appears, and each computer of a respective bank stores a portion of DocIDs that are assigned to that bank and which are associated with a large posting list, and all the DocIDs assigned to that bank, which are associated with a small posting list corresponding to a term assigned to that computer. The method further comprises distributing the search terms to each bank. In each bank and for any term corresponding to a small posting list, the method comprise retrieving its corresponding smaller posting list from the computer to which it was assigned, and for any term corresponding to a large posting list, the method comprises retrieving a portion of its corresponding posting list from each computer of the bank.
- Still further aspects include a method of organizing a computer cluster for supporting term-based searching of an inverted index, comprising: dividing m computers of the computer cluster into B banks and distributing selections of the document identifiers of an inverted index among the B banks. At least some of the document identifiers are distributed to fewer than all of the B banks. The method also comprises distributing the document identifiers assigned to each bank among the computers of that bank, wherein B is selected for balancing an aggregate search throughput of the computer cluster with respective search latencies for individual searches.
- For a fuller understanding of aspects and examples disclosed herein, reference is made to the accompanying drawings in the following description.
-
FIG. 1 illustrates a first cluster architecture for an inverted index distributed on the cluster; -
FIG. 2 illustrates method aspects of a first distribution of an index on the cluster ofFIG. 1 ; -
FIG. 3 illustrates aspects of a run-time method useful in the cluster ofFIG. 1 as configured according toFIG. 2 ; -
FIG. 4 illustrates a preferred hybrid distribution of an index on the cluster ofFIG. 1 ; -
FIG. 5 illustrates aspects of a run-time method useful in the cluster ofFIG. 1 as configured according toFIG. 4 ; and -
FIG. 6 illustrates data flow aspects for using the cluster ofFIG. 1 as configured according toFIG. 4 . - The following description is presented to enable a person of ordinary skill in the art to make and use various aspects of the inventions. Descriptions of specific techniques, implementations and applications are provided only as examples. Various modifications to the examples described herein may be apparent to those skilled in the art, and the general principles defined herein may be applied to other examples and applications without departing from the scope of the invention.
- An inverted index comprises lists of terms and corresponding lists of document identifiers (DocIDs) in which those terms appear. A collection of indications of what documents contain a given term is frequently called a posting list (e.g., a list of document identifiers). Thus, an inverted index is searchable by term to identify documents having that term. In the case of large document collections, there may be many documents that contain one term, and relatively few that contain another.
- It was described in the background that a cluster of computers can be used to provide a capability to search an inverted index for lists of documents in which specified terms appear, and in such a cluster, each computer can take a part of producing DocIDs responsive to a query.
- The document-based distribution strategy provides reduction in latency when producing large posting lists, because the DocIDs of a large posting list are produced in parallel by more computers. However, because the document-based distribution strategy calls for distributing documents among the computers based on DocIDs, DocIDs from any given posting list may actually be distributed among a large number of computers. Thus, generally, each computer in the system performs a seek to determine whether it has DocIDs for a term of a given search. Such a seek may include a hard-drive seek to load a list of DocIDs for a given term, which is orders of magnitude slower than indexing a solid state memory.
- These seeks can cause waste of resources because some posting lists are comparatively small and so DocIDs of small posting lists may not be on a large number of computers, or the number of DocIDs on each computer may quite small. In such circumstances, the seeks on each computer that do not have relevant documents are wasted or cause a disproportionate waste of time for the amount of data produced. Of course, it may be possible to provide more and more resources for providing search capabilities in a given document collection, however, merely increasing resources can result in wasted money, in the form of capital expenditures, as well as increased maintenance costs and even utility bills. Therefore, it also is desirable to increase performance achievable with a cluster having a given number of computers.
-
FIG. 1 illustrates a first exemplary cluster organization 100 (“cluster 100”) that seeks a balance between reducing latency for generation of large posting lists while also reducing unnecessary seeks induced by small posting lists.Cluster 100 contains m computers (illustratively numbered 110 a-110 m), organized into B banks 125 a-125B. Although it may be preferable and/or intuitive that all B banks contain the same number of computers, there is no requirement that this be the case. - Each computer 110 a-110 m includes a storage resource, for example one or more hard drives, and/or flash drives, or even a virtual or logical partition in a dedicated storage unit, provided the storage unit could appropriately serve data within acceptable latencies to the computer using it as a storage resource. For example, in some cases, such a computer may be a rack-mount server having a RAID hard drive implementation that can be configured for data protection and/or data throughput (e.g.,
RAID 0, 1, 5, 10, etc.) Such aspects are illustrated in more detail withexemplary computer 110 a, which comprises aprocessing resource 111, for example a central processing unit that may include a number of independently operable processing cores and other functional resources, achipset 112, an I/O controller 113, a working memory 114 (e.g., system memory),network connectivity 116, and astorage resource 115, which may be interfaced to the I/O controller 113 using one or more of SATA, SCSI, Infiniband, Fibre Channel, Ethernet and a PCI-E connection, for example. Typically, such acomputer 110 a would not have a dedicated monitor or user interface, but usually would be controlled through a network management system. - A
bank management server 120 can optionally be provided, which can coordinate operation of computers 110 a-110 m in each bank and interface withcluster management server 105. Where a bank-specific server 120 is not provided, a management process for each bank can execute onserver 105 or on a designated computer in each bank 125 a-125B. The number of banks (B) can be selected based on measurements of aggregate search throughput and samples of latencies for searches resulting in larger result sets. Thus, the number of banks (B) is increased to decrease individual search latencies, and B can be decreased to increase aggregate search throughput. -
Cluster 100 can also be distributed geographically such that inter-computer and inter-bank links can be of any distance. For example, these connections may be long-haul fiber connections that carry virtual LAN traffic. On the other hand, different computers within a bank or within a cluster can actually be implemented as a portion of a larger computer, in that virtualization allows separate allocation of processing resources, and/or storage resources. - Now, for the purposes of this example, a document collection may have any number of documents, n documents. A document can be assigned a numerical Document Identifier (DocID) that can be any random or pseudorandom string of sufficient length to allow a high probability of distinctness among all DocIDs. Of course, other ways to construct DocIDs are acceptable, so long as an individual document can be identified with its ID.
- Within these documents any number of terms, t terms, may appear. Here, a “term” may refer to a canonical term, which may include, for example, various forms of a given word, such as all tenses of a verb, or a stem for a number of words, or the like. For example, an inverted index for terms is depicted in table 1, where identifiers for a set of terms appear in a first column and in subsequent columns in that row, identifiers for specific documents in which that term appears are listed.
- Table 1 depicts that some terms will have many associated DocIDs in its posting list while others may have a few. Of course, the scale of an actual implementation may be many orders of magnitude larger than this example. In present examples of systems and methods, DocIDs are distributed among servers, and their respective documents can be separately stored in another repository. This architecture can be selected because the size of posting lists for some terms can be so large that simply producing a list of DocIDs within an acceptable latency is sufficiently challenging. However, in other implementations, documents themselves can be stored with their DocIDs.
-
TABLE 1 Terms Documents Term ID1 DocID114 DocID150 . . . DocID161 Term ID2 DocID150 DocID100450 . . . . . . . . . Term IDt DocID2487 DocID12345 . . . DocID24322 - A method 200 for distributing DocIDs among a
cluster 100 for the example of Table 1 according to a first aspect includes at least logically grouping 205 the m computers ofcluster 100 into B banks. The number B of banks can be selected based on a desired balance between latency for larger posting lists and reducing unnecessary seeks for smaller posting lists, as will be explained in further detail below. - This
grouping 205 can include, for example, providing a switch to locally interconnect computers of a given bank, and providing an uplink to a switch that serves all banks ofcluster 100. Other ways togroup 205 computers ofcluster 100 into banks includes defining a VLAN for computers of a given bank, and maintaining a table of MAC addresses or IP addresses corresponding to a given bank. Such a table can be maintained bycentral server 105, for example. In other words, there is at least a logical hierarchy of computers within a bank and banks withincluster 100, but that hierarchy may not map directly into a hierarchy of physical connectivity. - Once there is at least some logical grouping of computers in
cluster 100 into banks, the n DocIDs are distributed 210 among the banks. One way to divide DocIDs among the banks is to perform modulo division on some or all of a hash value derived from a given DocID by the number of banks, and discriminate among the banks based on the remainder of that modulo division. - After determining an allocation of DocIDs to banks, a further step is to allocate 215 the DocIDs of a given bank among the computers of that bank. Here, the allocation is a term-based allocation, and so allocation 215 may also involve an analysis to determine what terms appear in the DocIDs allocated to a bank, or such analysis can be performed in advance. For example, a hash can be performed on a term, to arrive at a hash value, and a number of bits of that hash value appropriate for the number of computers can be inspected to determine a computer of the bank to be responsible for producing DocIDs for that term (e.g., a partial posting list for that term) within that bank (e.g., by modulo division). Note that because a given term may appear in a number of documents whose respective DocIDs are allocated to a given bank, and every document in which a given term appears has a DocID distributed to a computer, there may be duplicates of DocIDs among the computers of a given bank.
- Hence, the configuration of
cluster 100 provides for DocIDs to be distributed among banks ofcluster 100. Then, a determination of what terms appear in the documents grouped into each bank may be undertaken such that a subset of the computers in a given bank have responsibility for producing the portion of that term's posting list in that bank. (i.e., generally a subset of the DocIDs for a term's posting list will be allocated to a given bank by DocID, and then further allocated to computers in that bank term-by-term). In one aspect, responsibility for producing DocIDs in a posting list for a term, which are assigned to a given bank may be assigned to a single computer in that bank. In other aspects, such responsibility may be distributed among the plurality of computers in the bank, for example, two computers may be allocated responsibility for the DocIDs of a given term's posting list within a bank. For convenience, a partial posting list refers to any subset of a set of DocIDs appearing in a posting list. For example, for a term's posting list, partial posting lists can be created for each bank based on DocID allocation. - Generally the configuration of
cluster 100 and allocation of DocIDs and distribution of responsibility for producing DocID results are performed “off-line”, because the documents and the terms indexed in those documents are expected to change much less frequently than a frequency of searches using that index. Thereafter, the “run time” method of searching the index (i.e., identification of documents that contain specified terms) is performed as described inFIG. 3 below. -
FIG. 3 illustrates method 300 for producing DocIDs for documents containing terms included in a search query. A first query is received 305, the query contains one or more terms with the expectation that results relevant to those terms will be returned. The terms of the query are distributed 310 to all banks ofcluster 100. Within each bank ofcluster 100, it is determined 315, which computer of that bank is responsible for producing posting list results for each term of the query. This determination can be performed by an indexing process provided on the optional local management server 120 (FIG. 1 ). In absence oflocal management server 120, thisdetermination 310 may be performed by a search query distribution process inserver 105, which also interfaces with webfront end 175. A further alternative is for each computer 110 a-110 m to store an index of terms for which it has partial posting list results in a main memory, so that access can be rapid, and does not require a hard drive seek. - Each computer responsible for a given term then performs a
lookup 320 to identify DocIDs associated with that given term (e.g., usually, partial posting lists), and which were allocated to that bank. The identified DocIDs may be termed an initial result set, and may undergo preliminary processing to reduce a number of DocIDs returned. For example, each computer can process multiple terms and can intersect the partial posting lists it identified duringlookup 320 to return non-duplicative results. Subsequently, each computer returns 325 identified DocIDs for its terms. The document results may then be received by themanagement server 120 for each bank, if present, and if not present then by management process(es) ofserver 105, which also would be receiving document results from other banks, potentially for the same terms as the document results returned from the bank described above. Management process withinserver 105 may then further process each DocID set to provide a final result set to other functionality used in producing a final search result. - Thus, each bank 125 a-125B of
cluster 100 would generally produce a portion of a posting list for a given term and within each bank only a subset of computers would have performed a seek to determine whether it contained or otherwise was responsible for returning DocIDs in a posting list for that term. This strategy reduces a number of seeks performed by the computers ofcluster 100 while allowing posting list results to be returned by multiple computers in parallel, which reduces latency for large posting lists. - A
second method 400 to distribute DocIDs among the computers ofcluster 100 is explained with respect toFIG. 4 . In themethod 400, the available computers in the cluster are again grouped 401 into banks. The DocIDs for document collection are also distributed 405 among banks ofcluster 100 according to document identifiers (e.g., modulo division on a hash value for each DocID). Themethod 400 also includes differentiating between (or otherwise, determining) 410 for DocIDs distributed to a given bank whether posting lists in which those documents appear are large or small. In other words, afterdistribution 405, determining 410 can include a term-based analysis of whether or not partial posting lists for a respective term have a large number of DocIDs distributed to a given bank. Alternatively, differentiating/determining 410 can be performed prior todistribution 405, such that a posting list for a given term can be judged large or small for the document collection as a whole, rather than for a portion of the document collection allocated to each bank. In such an example, this determination could control treatment of the partial posting lists in each bank for that term. In either case, within each bank, a term-by-term distinction between large versus small posting list is provided. This distinction between large versus small posting list is used to determine distribution of responsibility for producing posting list results within computers of a bank. - Within a given bank, subsets of DocIDs associated with a partial posting list considered large are distributed 415 among a plurality of computers in that bank. In an example, a subset of DocIDs is distributed to each computer of that bank. Alternatively to physically storing only a subset of DocIDs in each computer, DocIDs for the entire posting list can be stored in a plurality of computers and responsibility for producing a given subset of those DocIDs can be allocated to each computer. For example, if each computer had sufficient storage capacity for DocIDs of an entire document collection, then the additional effort to segment the DocIDs for that document collection among these computers may not be required, even though latency reduction in producing such documents may be desirable. This may be a practical matter, for example, where a hard drive of a larger size often costs only incrementally more than a hard drive substantially smaller.
- For posting lists judged to be small, and within a bank, responsibility for producing DocIDs in that posting list and present in the bank, is assigned 420 to fewer computers, than for a large posting list. In an example, responsibility is assigned to only one computer of the bank, such that DocIDs for that small posting list present in that bank would be produced only by that one computer. Because any given DocID may be present both in large and in small posting lists, DocIDs may need to be duplicated among the computers of the bank. For example, in table 1 above, it was illustrated that DocID50 appeared in posting lists for both
term 1 andterm 2. Now assuming that DocID50 were assigned tobank 125 a, and further assuming that the posting list forterm 1 was determined to be large, at least withinbank 125 a, then DocID50 may be distributed tocomputer 110 a, while responsibility for producing DocIDs present in the posting list forterm 2 may be assigned tocomputer 110 b. As such, DocID 50 may be duplicated on bothcomputer 110 a andcomputer 110 b. - A “run time”
method 500 for obtaining DocID results for term-based queries is illustrated inFIG. 5 and described below. Inmethod 500, a query is received 505; such query can comprise a plurality of terms, for which relevant documents are desired. The terms of the query are distributed 510 to each bank, and it is determined 515 whether a partial posting list for each term in each bank is either large or small (determining 515 can also be performed globally for the entire document collection, such that a posting list for a term is either large or small in all banks). Terms with large posting lists are distributed 520 to each computer of the bank. Terms with small posting lists are provided 525 only to the computer(s) which was assigned responsibility for producing documents for that term's partial posting list. The optional step of reshuffling 526 is described below. After each computer has identified documents responsive to all the terms provided to it (e.g., some computers may have searched for documents of multiple partial posting lists, such as large and small partial posting lists, or multiple small posting lists), each computer can merge 530 those identified DocIDs to remove redundant DocIDs (e.g., multiple terms may appear in the same document). The merged are returned 535 to a management process inserver 120 orserver 105; if results are not merged, then some redundant results may be returned, which may be acceptable in some implementations. - The
optional reshuffling step 526 may be applied tomethod 500 in the following circumstances. It was described in the background that is known to distribute DocIDs for posting lists among computers of a cluster according to a hash value. For example, in a 100 computer cluster, a computer to receive a document can be identified by Modulo (DocID, 100). In other words, it is known to distribute DocIDs listed in a posting list among a plurality of computers, and in such clusters, terms are distributed among all the computers of a cluster and those computers having part of a terms posting list (i.e., having DocIDs in a partial posting list for that term) respond with those DocIDs. In such clusters, each DocID can be said to have an actual home on the computer storing it. In the hybrid cluster described in some embodiments herein, it may be desirable to make the hybrid cluster (e.g., 100 machines organized into 5 banks) appear to higher-level systems as a pure document based distribution system. To do so, each DocID of a large posting list can have an actual home determined as Bank=DocID DIV 20, and Computer=DocID MOD 5. This arrangement effectively allows the distribution of DocIDs for large posting lists in a hybrid cluster to correspond with how those DocIDs would be distributed in a prior art document based distribution cluster. - However, in a hybrid cluster according to aspects disclosed herein, responsibility for producing results for small posting lists, within a bank, is assigned to select computers (in some examples, only 1 computer). For the above cluster example, DocIDs for a small posting list for a given TermID can be allocated to Bank=DocID DIV 20 and Computer=TermID MOD 5. So, if that computer returned its posting list for that TermID directly, it would be apparent that these results were not produced from a prior art document-based distribution cluster scheme.
- Therefore, in further aspects, redistribution of partial posting list results within a bank for small posting lists can be undertaken prior to reporting results from a bank for a search. This redistribution may include, for a termID having a small posting list in a bank, sending DocIDs from a computer assigned to produce those postings to a computer that would have had those DocIDs in a document-based cluster scheme. For example, within a given bank, there may be distribution from a computer=TermID MOD 5 to computer=DocID MOD 5.
- From the above disclosures, the following aspects concerning large posting lists in a document collection can be appreciated. First, a portion of DocIDs appearing in a given posting list will be distributed to each of the banks, and such portion can be termed a partial posting list for that term. Within each bank, there can be a 1:1 correspondence between responsibility for producing a defined portion of DocIDs of that partial posting list, and physical storage of those DocIDs.
- From the above disclosures, the following aspects concerning small posting lists can be appreciated. First, a portion of DocIDs appearing in a small posting list may still be distributed among multiple banks, and therefore, each of these banks will have at least one computer responsible for returning results for such a posting list. Within each bank, responsibility for producing DocIDs of a partial small posting list can be distributed to one computer of the bank.
- In sum, in this example, each bank receives a portion of DocIDs of a document collection, generally distributed according to DocID. Then, within a bank, large partial posting lists are distributed among all computers of that bank, and small partial posting lists are each assigned to one computer of that bank.
- This example is a prototypical in the senses that posting lists are categorized as either large or small, and distribution according to this categorization is either to all computers in a bank or one computer. Other examples and implementations may provide more granular categorizations and assignments. For example, a number of degrees of a size for a partial posting list (i.e., a portion of a posting list present in a bank) can be established, and the larger a given partial posting list, the more computers within its bank will be assigned to produce DocIDs for it. Conversely, the smaller the partial posting list, the fewer the number of computers in a given bank will be assigned to produce postings for it. For example, posting lists for a document collection could be categorized as large/medium/small, or distributions of posting lists could be formed where a first quartile of the largest posting lists could be distributed to all of a bank's computers, and quartiles of smaller posting lists could be distributed to fewer computers within a bank. Where fewer than all computers of a bank store a partial posting list for a given term, then the computers having posting list contents relevant for the term associated with the partial posting list preferentially are indexed. Such indexing allows determination at run time which computers of a bank have data responsive to a given term.
- Also, prototypically, portions of DocIDs for a document collection distributed to each bank may be approximately equal. However, this approximately equal distribution is an example, and distributions can also be made unequally among banks. For example, one bank may have more computing resources than another bank, or better network connectivity, etc. Such distinctions can be used in determining how to distribute DocIDs of a collection among banks in
cluster 100. - In the above description, computers producing posting list results may first index a table based on a term to identify a list of document identifiers (DocIDs) that correspond to that term. These DocIDs can then be used to identify respective physical locations where the documents for each DocID are stored. Since documents sizes will vary, it may be convenient to provide an index of DocIDs to file locations, or alternatively an existing file system structure can be used such that DocIDs can serve as file names, and the file system itself can be used to obtain the document for each DocID.
-
FIG. 6 illustrates an example dataflow diagram that summarizes aspects described above, for an example query comprising a set of three terms {S1, S2, S3}. The query is received by a management process and the terms of the query are distributed toBanks 1..b. In some implementations, all terms are distributed to each bank, as illustrated by distribution of the terms {S1, S2, S3} to each bank. Within each bank, it is determined which computers are assigned responsibility, or otherwise have stored partial posting lists responsive to each term. In an example, this determination can include determining a computer responsible for producing DocIDs in small posting lists. In this example, terms S2 and S3 were determined small. Following these determinations, terms are distributed to responsible computers within each bank. For example, inbank bank - In
Bank Bank 2 produce results as follows: 617 shows R {S2}, 619 shows R {S2 U S3}, 611 shows that partial results from C2 are sent from the collection point (e.g., a management process) to C1, which are then shown in 620 as being returned with other results R {S2 U P.R. S3} from C1. The operation of C2 and C1 inBank 2 with respect to results for S3 illustrate a different way to maintain transparency of origin of results for terms searching. Rather than sending from one computer in a bank to another partial results that would have been resident on the destination computer in a document-based cluster, a given computer can send all results to a management process (e.g., 619 shows returning a union of results for S2 and S3 from C2), and the management process can identify portions of results that would have been from different computers in a bank, and can send those results to those computers (e.g., as shown by 611, where partial results for S3 are sent to C1). In this example,bank 2 is not producing posting results for S1, which would imply that documents in which S1 appears are distributed to banks other thanbank 2. For posting lists of most practical sizes, this result may be statistically unlikely, but nevertheless possible. However, all the terms of a given search query can be distributed to all banks, such that control over posting list distribution in the bank can remain more localized. - Thus, results from all the banks of the cluster are collected and analyzed (623). Typically, such analysis would further narrow the results based on any of a variety of algorithms and the results from 623 would then be presented 624. For example, the results can be provided to a user, saved, and/or transmitted. Since the hybrid cluster can provide DocIDs for any type of further use, the particulars of such use need not be described.
- In hybrid indexing clusters according to disclosed aspects, it will be the case that not all computers of a bank will be involved in each search processed by the cluster, or by that bank. Therefore, query scheduling algorithms can be provided based on how terms are allocated within each bank. For example, a bank can determine that two terms of two different queries are assigned to different computers and may schedule those terms for servicing simultaneously.
- Many variations and enhancements to the examples and aspects disclosed herein will be apparent to those of ordinary skill in the art in view of these disclosures, and all such variations and enhancements should therefore be considered within the scope of the appended claims and their equivalents.
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/098,376 US20090254523A1 (en) | 2008-04-04 | 2008-04-04 | Hybrid term and document-based indexing for search query resolution |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US12/098,376 US20090254523A1 (en) | 2008-04-04 | 2008-04-04 | Hybrid term and document-based indexing for search query resolution |
Publications (1)
Publication Number | Publication Date |
---|---|
US20090254523A1 true US20090254523A1 (en) | 2009-10-08 |
Family
ID=41134185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/098,376 Abandoned US20090254523A1 (en) | 2008-04-04 | 2008-04-04 | Hybrid term and document-based indexing for search query resolution |
Country Status (1)
Country | Link |
---|---|
US (1) | US20090254523A1 (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110087684A1 (en) * | 2009-10-12 | 2011-04-14 | Flavio Junqueira | Posting list intersection parallelism in query processing |
CN102201007A (en) * | 2011-06-14 | 2011-09-28 | 悠易互通(北京)广告有限公司 | Large-scale data retrieving system |
US20130018891A1 (en) * | 2011-07-13 | 2013-01-17 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
US8478704B2 (en) | 2010-11-22 | 2013-07-02 | Microsoft Corporation | Decomposable ranking for efficient precomputing that selects preliminary ranking features comprising static ranking features and dynamic atom-isolated components |
US8620907B2 (en) | 2010-11-22 | 2013-12-31 | Microsoft Corporation | Matching funnel for large document index |
US8713024B2 (en) | 2010-11-22 | 2014-04-29 | Microsoft Corporation | Efficient forward ranking in a search engine |
US20140214882A1 (en) * | 2013-01-28 | 2014-07-31 | International Business Machines Corporation | Segmenting documents within a full text index |
US20140289289A1 (en) * | 2013-03-19 | 2014-09-25 | Pfu Limited | Information processing device, information processing system, and computer readable medium |
US8990612B2 (en) | 2011-04-08 | 2015-03-24 | Microsoft Technology Licensing, Llc | Recovery of a document serving environment |
WO2015004607A3 (en) * | 2013-07-08 | 2015-04-09 | Yandex Europe Ag | Computer-implemented method of and system for searching an inverted index having a plurality of posting lists |
US9158767B2 (en) | 2011-04-08 | 2015-10-13 | Microsoft Technology Licensing, Llc | Lock-free indexing of documents |
US9185163B2 (en) | 2011-04-08 | 2015-11-10 | Microsoft Technology Licensing, Llc | Receiving individual documents to serve |
US9424351B2 (en) | 2010-11-22 | 2016-08-23 | Microsoft Technology Licensing, Llc | Hybrid-distribution model for search engine indexes |
US20160275178A1 (en) * | 2013-11-29 | 2016-09-22 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for search |
US9529908B2 (en) | 2010-11-22 | 2016-12-27 | Microsoft Technology Licensing, Llc | Tiering of posting lists in search engine index |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6957209B1 (en) * | 2000-02-29 | 2005-10-18 | Unisys Corporation | Sizing servers for database management systems via user defined workloads |
US20070250608A1 (en) * | 2001-11-08 | 2007-10-25 | Watt Charles T | System and method for dynamic server allocation and provisioning |
US20070271570A1 (en) * | 2006-05-17 | 2007-11-22 | Brown Douglas P | Managing database utilities to improve throughput and concurrency |
US7330857B1 (en) * | 1999-05-10 | 2008-02-12 | Overture Services, Inc. | Search engine with two-dimensional linearly scalable parallel architecture |
US20090083214A1 (en) * | 2007-09-21 | 2009-03-26 | Microsoft Corporation | Keyword search over heavy-tailed data and multi-keyword queries |
US20090157666A1 (en) * | 2007-12-14 | 2009-06-18 | Fast Search & Transfer As | Method for improving search engine efficiency |
US20090171867A1 (en) * | 2007-12-27 | 2009-07-02 | Microsoft Corporation | Determining quality of tier assignments |
US7693813B1 (en) * | 2007-03-30 | 2010-04-06 | Google Inc. | Index server architecture using tiered and sharded phrase posting lists |
US7702614B1 (en) * | 2007-03-30 | 2010-04-20 | Google Inc. | Index updating using segment swapping |
-
2008
- 2008-04-04 US US12/098,376 patent/US20090254523A1/en not_active Abandoned
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7330857B1 (en) * | 1999-05-10 | 2008-02-12 | Overture Services, Inc. | Search engine with two-dimensional linearly scalable parallel architecture |
US6957209B1 (en) * | 2000-02-29 | 2005-10-18 | Unisys Corporation | Sizing servers for database management systems via user defined workloads |
US20070250608A1 (en) * | 2001-11-08 | 2007-10-25 | Watt Charles T | System and method for dynamic server allocation and provisioning |
US20070271570A1 (en) * | 2006-05-17 | 2007-11-22 | Brown Douglas P | Managing database utilities to improve throughput and concurrency |
US7693813B1 (en) * | 2007-03-30 | 2010-04-06 | Google Inc. | Index server architecture using tiered and sharded phrase posting lists |
US7702614B1 (en) * | 2007-03-30 | 2010-04-20 | Google Inc. | Index updating using segment swapping |
US20090083214A1 (en) * | 2007-09-21 | 2009-03-26 | Microsoft Corporation | Keyword search over heavy-tailed data and multi-keyword queries |
US20090157666A1 (en) * | 2007-12-14 | 2009-06-18 | Fast Search & Transfer As | Method for improving search engine efficiency |
US20090171867A1 (en) * | 2007-12-27 | 2009-07-02 | Microsoft Corporation | Determining quality of tier assignments |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110087684A1 (en) * | 2009-10-12 | 2011-04-14 | Flavio Junqueira | Posting list intersection parallelism in query processing |
US8838576B2 (en) * | 2009-10-12 | 2014-09-16 | Yahoo! Inc. | Posting list intersection parallelism in query processing |
US8478704B2 (en) | 2010-11-22 | 2013-07-02 | Microsoft Corporation | Decomposable ranking for efficient precomputing that selects preliminary ranking features comprising static ranking features and dynamic atom-isolated components |
US8620907B2 (en) | 2010-11-22 | 2013-12-31 | Microsoft Corporation | Matching funnel for large document index |
US8713024B2 (en) | 2010-11-22 | 2014-04-29 | Microsoft Corporation | Efficient forward ranking in a search engine |
US10437892B2 (en) | 2010-11-22 | 2019-10-08 | Microsoft Technology Licensing, Llc | Efficient forward ranking in a search engine |
US9529908B2 (en) | 2010-11-22 | 2016-12-27 | Microsoft Technology Licensing, Llc | Tiering of posting lists in search engine index |
US9424351B2 (en) | 2010-11-22 | 2016-08-23 | Microsoft Technology Licensing, Llc | Hybrid-distribution model for search engine indexes |
US9158767B2 (en) | 2011-04-08 | 2015-10-13 | Microsoft Technology Licensing, Llc | Lock-free indexing of documents |
US8990612B2 (en) | 2011-04-08 | 2015-03-24 | Microsoft Technology Licensing, Llc | Recovery of a document serving environment |
US9185163B2 (en) | 2011-04-08 | 2015-11-10 | Microsoft Technology Licensing, Llc | Receiving individual documents to serve |
CN102201007A (en) * | 2011-06-14 | 2011-09-28 | 悠易互通(北京)广告有限公司 | Large-scale data retrieving system |
US20130018891A1 (en) * | 2011-07-13 | 2013-01-17 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
US9171062B2 (en) * | 2011-07-13 | 2015-10-27 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
US9152697B2 (en) | 2011-07-13 | 2015-10-06 | International Business Machines Corporation | Real-time search of vertically partitioned, inverted indexes |
US20140214882A1 (en) * | 2013-01-28 | 2014-07-31 | International Business Machines Corporation | Segmenting documents within a full text index |
US9135254B2 (en) * | 2013-01-28 | 2015-09-15 | International Business Machines Corporation | Segmenting documents within a full text index |
US9087055B2 (en) * | 2013-01-28 | 2015-07-21 | International Business Machines Corporation | Segmenting documents within a full text index |
US20140372475A1 (en) * | 2013-01-28 | 2014-12-18 | International Business Machines Corporation | Segmenting documents within a full text index |
US9710287B2 (en) * | 2013-03-19 | 2017-07-18 | Pfu Limited | Information processing device, information processing system, and computer readable medium |
US20140289289A1 (en) * | 2013-03-19 | 2014-09-25 | Pfu Limited | Information processing device, information processing system, and computer readable medium |
WO2015004607A3 (en) * | 2013-07-08 | 2015-04-09 | Yandex Europe Ag | Computer-implemented method of and system for searching an inverted index having a plurality of posting lists |
US10430448B2 (en) | 2013-07-08 | 2019-10-01 | Yandex Europe Ag | Computer-implemented method of and system for searching an inverted index having a plurality of posting lists |
RU2718435C2 (en) * | 2013-07-08 | 2020-04-02 | Общество С Ограниченной Ответственностью "Яндекс" | Computer-executable method and system for searching in inverted index having plurality of wordpositions lists |
US20160275178A1 (en) * | 2013-11-29 | 2016-09-22 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for search |
US10452691B2 (en) * | 2013-11-29 | 2019-10-22 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating search results using inverted index |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20090254523A1 (en) | Hybrid term and document-based indexing for search query resolution | |
US7899851B2 (en) | Indexing method of database management system | |
US7370029B2 (en) | Method of changing system configuration in shared-nothing database management system | |
US9740706B2 (en) | Management of intermediate data spills during the shuffle phase of a map-reduce job | |
CN102831120B (en) | A kind of data processing method and system | |
CN101604337B (en) | Apparatus and method for hash table storage, searching | |
JP5765416B2 (en) | Distributed storage system and method | |
CN101354726B (en) | A memory metadata management method for a cluster file system | |
US20160350302A1 (en) | Dynamically splitting a range of a node in a distributed hash table | |
CN104657459A (en) | Massive data storage method based on file granularity | |
Shalita et al. | Social hash: an assignment framework for optimizing distributed systems operations on social networks | |
JPH1196190A (en) | Method for processing record from data base | |
WO2011137189A1 (en) | System and methods for mapping and searching objects in multidimensional space | |
CN1602480A (en) | Managing storage resources attached to a data network | |
US20170235809A1 (en) | System and method of using replication for additional semantically defined partitioning | |
WO2022126839A1 (en) | Cloud computing-based adaptive storage hierarchy system and method | |
US20100017429A1 (en) | Method and apparatus of distributing data in partioned databases operating on a shared-nothing architecture | |
US8819017B2 (en) | Affinitizing datasets based on efficient query processing | |
JP5464017B2 (en) | Distributed memory database system, database server, data processing method and program thereof | |
US20130013824A1 (en) | Parallel aggregation system | |
JP3777872B2 (en) | Query parallel processing system and machine-readable recording medium recording program | |
Srinivasan et al. | Techniques and Efficiencies from Building a Real-Time DBMS | |
CN111274259A (en) | Data updating method for storage nodes in distributed storage system | |
WO2022267508A1 (en) | Metadata compression method and apparatus | |
Ding et al. | Indexing strategies for graceful degradation of search quality |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: YAHOO| INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:LANG, KEVIN;LIM, SWEE;CHANG, CHOONGSOON;REEL/FRAME:020760/0521 Effective date: 20080404 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |
|
AS | Assignment |
Owner name: YAHOO HOLDINGS, INC., CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO| INC.;REEL/FRAME:042963/0211 Effective date: 20170613 |
|
AS | Assignment |
Owner name: OATH INC., NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:YAHOO HOLDINGS, INC.;REEL/FRAME:045240/0310 Effective date: 20171231 |