GB2366008A - Document selection - Google Patents
Document selection Download PDFInfo
- Publication number
- GB2366008A GB2366008A GB0005092A GB0005092A GB2366008A GB 2366008 A GB2366008 A GB 2366008A GB 0005092 A GB0005092 A GB 0005092A GB 0005092 A GB0005092 A GB 0005092A GB 2366008 A GB2366008 A GB 2366008A
- Authority
- GB
- United Kingdom
- Prior art keywords
- rule
- category
- score
- combination
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9532—Query formulation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A computer system for providing an on-line user conducting a search on a broad topic or category with a more sensible result set of documents derived from a set of easily definable specific categories. The system has a recursive hierarchical structure allowing the same scoring logic to be used at any level. A categorisation engine classifies the importance of specific categories for particular documents and these results are collated and indexed by a search engine. At run-time, a user's search on a broad category is translated into an equivalent search on at least one associated more specific category and typically a Boolean combination of a plurality of more specific categories. This allows the system to populate broad categories with related documents from more specific categories.
Description
<Desc/Clms Page number 1>
DOCUMENT SELECTION The invention relates to document selection, particularly to a method of using a computer system to categorise information using a set of rule-bases and to a document selection system.
The constantly developing World Wide Web has dramatically revolutionised communications and is able to present an extremely large amount of information to online users. Smaller intca-company computer networks, i.e. intranets, also provide a common pool of information relating to the company and are readily accessible to the company's connected employees. Intranets are used to promote knowledge-sharing in any organisation. It is possible for all the various departments that constitute an organisation to access one others data. The problem with any network, whether the Internet or smaller intra-nets, is that they are constantly growing and users can be flooded with information. One of the biggest challenges in the information age is being able to disseminate the overwhelming amount of information that is available. It is necessary to be able to access the required information as rapidly and efficiently as possible. At present, search engines provide the gateway to finding information on the Internet. Various methods of searching are used. In one method the online user inputs a number of key words that are felt to best describe the topic he wishes to obtain information on. However, due to the massive number of results that may be returned by the search engine, the user wastes a great deal of time reviewing irrelevant documents before arriving at something useful. The user is often encouraged to conduct a refined search whereupon he may have to give some thought to redoing a search by inputting Boolean expressions or other key word combinations.
<Desc/Clms Page number 2>
There do exist search engines which present a more intuitive user interface by presenting the results in a categorised format as opposed to a list of all the articles matching the input criteria. The idea is that the user is then able to have a reasonable overview of the information available and may drill-down into the most suitable category and obtain the required information. An example of such a search engine is YahooTM.
One problem which needs to be overcome for any search engine to operate efficiently is to sensibly populate search categories.
One of the problems with traditional categorisation systems is building sensible rules for broad categories. It is relatively straightforward to create rules for a narrow category using relevant buzzwords. However, for broader categories having broad rules it is difficult to know where to begin and the documents displayed to the user are often bland. Other categorisation systems, may not define broad rules at all but will do something worse, include in the broad categories every document that occurs in any sub-categories.
It is an aim of the present invention to provide a system and method of selecting documents which does not require rules to be written for so-called "broad categories".
According to one aspect of the present invention there is provided a system for selecting desired documents from a number of documents, the system comprising: means for comparing textual content of each document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string, said comparing means being organised to trigger a rule when the alphanumeric string of that rule is located in the textual part, wherein each rule is associated with a weighting factor; means for generating a category score representing an importance weighting for each category based on the weighting factors for the rules which
<Desc/Clms Page number 3>
triggered in the rule set defining the category; means for indexing each document in association with its score for each category in which it is selected; and means for translating a search parameter defined by a user and representing a broad category into a set of associated categories and returning said desired documents to the user based on the category scores of said associated categories. Another aspect of the invention provides a method for selecting desired documents from a number of documents comprising the following steps: comparing textual content of each document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules which each constituted by at least one alphanumeric string, wherein a rule is triggered when the alphanumeric string of that rule is located in the textual part, each rule being associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; indexing each document in association with' its score for each category in which it is selected; and translating a search parameter defined by a user and representing a broad category into a set of associated categories and returning said desired document to the user based on the category scores of said associated categories. A further aspect of the invention provides a method of populating a broad category with document data comprising the steps of: comparing the textual content of each of a plurality of documents with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string, wherein a rule is triggered when the alphanumeric string of that rule is located in the textual part, each rule being associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; defining said broad category as a
<Desc/Clms Page number 4>
set of associated categories, and including in said broad category document data defining each document selected in the associated categories.
Thus, with the present invention category scores are provided in association with each indexed document. Specific associated categories, whose rule bases are easier to define, hold documents with associated category scores. Thus, broad categories can be populated by the documents in associated categories, preferably based on category scores. In particular, category scores can be provided indicating a high importance, medium importance or low importance of a document in a particular category. In that scenario, broad categories can be populated only by those documents having a high importance in the specific categories which are associated to define the broad category.
The invention can be implemented in any suitable search environment. In particular, it is suitable for implementation on the Web, where a navigational tool can be implemented on a Web browser. This can be an ordinary Web browser, or a wireless Web browser.
The present invention will now be described by way of an example with reference to the accompanying drawings, in which: Figure 1 shows the architecture of the present invention; Figure 2 shows the hierarchical structure of the examples of the rule-bases in text format for six combinations; Figure 3 shows an example of the RETAIL FINANCIAL SERVICES category; Figure 4a-f shows screenshots of rule-bases in text format for six combinations; Figure 5 shows how category KEYMALUE pairs are created by the categorisation engine;
<Desc/Clms Page number 5>
Figure 6 shows how the system defines a broad category such as FILM by other more specific categories.
Figure 7a shows an example of the GUI from a users perspective.
Figure 1 illustrates the architecture of an indexing and searching system. To categorise documents, a categorisation engine is used with a plurality of rule- bases 6 in binary format. Each rule-base is in the form a database (for example ExceITM) stored in RAM. The block marked 6 in figure 1 represents a number of rule-bases in a binary format. The blocks marked 2 and 4 denote the rule-bases in a text format and intermediate format discussed later. These are interim stages to the rule-bases in binary format 6. A search engine 14 receives and collates the results from the categorisation engine 10. The search engine 14 is responsible for indexing the documents. A navigation hierarchy 16 allows a user to interrogate the search engine to call up the documents he is interested in. In Figure 1, reference numeral 8 denotes a document which, of course, does not form part of the architecture itself.
The rule-bases are created by an information scientist in such a manner as to suit the requirements of a particular organisation. For example, the creator of the rule-bases for the intranet of a particular company would tailor and/or weight the rules in such a manner that would be relevant to the company's interests. More generally, construction of the rule-bases depends on the nature of the information to be searched and the expected scope of searches.
An information scientist can create the rule-bases in a simple textual manner without requiring any computer-language programming skills. This is done in the text format 2 of the rule-bases. It can be converted into a binary format 6 for storage and automated comparison purposes via an intermediate format 4 such as XML.
<Desc/Clms Page number 6>
Before describing the rule-bases, it is important to understand the basic hierarchical structure governing the indexing of documents. Figure 2 provides a basic overview of the hierarchy. At the highest level is a category 20. A category 20 is the broadest heading into which documents are grouped. Each category contains multiple "sub-categories" that that shall be referred to herein as combinations 22. As will be described later, each combination 22 is defined by a set of rules 24. However, it needs to be understood that the system has a recursive structure. Therefore, in practice, there may be combinations occurring under combinations. So a sub-combination will combine its scores up to a parent combination and then that score will be combined with other scores at that level and so on up to the top category level. Rules are the base items with no children. As a result of the recursive structure, the same statistical algorithm and scoring logic may be applied at any level, as will be described later. A summary of the more complete structure is shown below: Category - Rule OR Combination Combination -> Combination OR Rule Rule -> nothing An example relating to the banking sector is used to describe a particular embodiment of the present invention. It should be understood that this is a non- limiting example and that the system described herein may be applied to any topic so desired by a person skilled in the art.
Figure 3 shows the RETAIL FINANCIAL SERVICES category, which contains six combinations; i.e. GENERAL, TITLE, PRODUCTS, COMPETITORS, CUSTOMERS and ALLIANCES. The rules, combinations and categories are all uniquely identified by a number.
<Desc/Clms Page number 7>
Figures 4a-f provide examples of the text format 2 of the rule-bases. Each of these figures represents a screen shot for a particular combination indicated by the combination indicator 30. For example, figure 4c illustrates the PRODUCTS combination. The spreadsheet format, in this case ExceITM, allows the information scientist to edit the rule-bases by creating rules and input criteria for their associated fields. The text format for the rule-bases comprises a predetermined set of fields as follows: ID, field 32 - This is a number that uniquely identifies each rule within a combination. RULES, field 34 - This is the rule itself, which the document is fired against. It comprises free-text which may be a single word (for example (see Figure 4a) rules 10, 17, 18, 20), a phrase (for example rule 19) or a Boolean combination of words (for example rules 1,2,5 to 9 etc.). FIELD, field 36 - This stipulates the location in the document where the text defined in a rule may occur for it to trigger. If "TITLE" is input into this field then the criterion is that the relevant rule must be matched within the title of the document in question. Alternatively, by inserting "ANY", the rule is triggered if it is matched anywhere within the document. SCOPE, field 38 - This stipulates the criteria for the proximity of free-text portions defined in a Boolean combination. For example, under the PRODUCTS combination rule number 5 is "barclays AND interneY". In this case, "sentence" is inserted in the scope field. Therefore, for the rule to trigger both the words "barclays" and "internef' need to appear within the same sentence. Alternatively, if "paragraph" was inserted then the words could appear anywhere within the same paragraph for the rule to trigger.
<Desc/Clms Page number 8>
DATA, field 40 - In this field, the information scientist inserts the data that would be readable or output to the user and which defines the rule in a humanly understandable manner. Different rules can have the same data field and in fact this is very useful in searching as will be discussed latter. SCORE, field 42 - This is a weighting indicator as to the importance of a rule within a particular combination. It is a reflection of the interest that a rule has within a combination. In the current example, the score has a maximum value of 100. A score of 0 is possible. Annex 1 shows an example of the ruleformat 4 for the RETAIL FINANCIAL SERVICES category using the intermediate (XML) format.
Each combination in the category RETAIL FINANCIAL SERVICES is designed as: < combination name="[name]" score="100" data="" id= "RETAIL FINANCIAL SERVICES/[name]"> where [name] is the combination indicator 30. Each rule is defined with the following structure: < rule id="$C$[#]" field="[insert]" score=" [insert]">[rule name] < /rule> where [#] is a unique rule number and the field and score are inserted as decided with reference to the rule-bases of figures 4a, b, c, d, a and f. Categorisation Engine The categorisation engine 10 takes a particular document 8 and fires the text contained within such a document at rule-bases 6. The rule-bases are translated from an intermediate format 4 into a binary format 6. The binary format 6 is the
<Desc/Clms Page number 9>
format that is presented as an input 7 to the categorisation engine 10. The other input to the engine 10 is the document 8 to be categorised. The categorisation engine compares the binary format of the rule-bases to that of the document and determines which of the rules are triggered. At this binary level, each byte pattern that constitutes the input text of the document (in ASCII or some other suitable encoding) is compared against the byte patterns of the free-text defined in each RULE field 34. For each rule ID, if there is a match according to any Boolean logical combination defined in the rule, and subject to any restrictions in the FIELD or SCOPE fields of the rule, then that particular rule is triggered. The score associated with that rule is stored against the rule ID in a designated storage area. Each rule is checked against the document so that a list of scores is built up for each combination heading 30. The text is fired against all the rule bases simultaneously. For Boolean logic rules, a `hit' on part of the free-text defined in the rule leaves that part to be held in a waiting zone until the other part or parts are located in the document, in which case the rule triggers. If not, the waiting zones are simply emptied when the end of the document is reached. After the document has been fired against the rule-bases, a statistical algorithm is performed on the scores, which constitute each of the relevant combination lists. The first step is to arrange the scores that constitute each combination list into descending order. Thus, the highest score is placed first on the list and would be saved in a storage location as a variable A. The statistical algorithm then takes each subsequent score X and applies the algorithm:
<Desc/Clms Page number 10>
The result of this mathematical manipulation is then stored in another storage location. The contents of the two storage locations are added and the result is stored in an accumulator. The process is iterated until all the scores in the list have been manipulated by the algorithm, wherein after the first iteration has been performed each subsequent iteration uses the value stored in the accumulator register for the previous iteration as the A variable in the algorithm. Once all the iterations have been performed, a total score for each combination list is calculated and stored. It is then possible to arrange all of the combination total scores into a separate list in descending order. The same statistical algorithm is applied at this categorisation level, producing a category total score reflecting the relevance of a particular category for the document. Table 1 shows an example of how the statistical algorithm would be carried out on a list comprising the five scores 50, 30, 10, 10 and 5. For the first iteration, the values A=50 and X=30 are used to calculate a value of 15. The first value 50 is then added to the calculated 15 giving an accumulated total of 65. For the second iteration, A=65 and X=10 where the algorithm produces a result of 3.5, which is accumulated to give -a value of 68.5 in the accumulator. The process is iterated, until all the scores have been manipulated. The final result in the accumulator (i.e. 73.0675) is truncated to supply a total score of 73 for the list in question. In summary, the statistical algorithm may be applied at the combination level as well as at the category level.
SCORE ADD ACCUMULATOR 5 .J 68.5 lu 3.15 71.65 101 1.41751 73.0675 Table 1
Once the algorithm has been applied at the category level, the category total is then analysed and classified .as falling into a range based on defined threshold
<Desc/Clms Page number 11>
values. An example of the classification of the ranges is shown in Table 2. A total score greater than 60 falls into the HIGH range, while a score between 40-60 and 20-40 indicate a MEDIUM and LOW range respectively. A score below 20 is discarded.
Threshold Total RangelValue > 60 High 40-60 Medium 20-40 Low < 20 Discard Table 2
Figure 5 illustrates more clearly the operation of the categorisation engine as generally described above. A particular document 8 is shown to consist of a title 50 and a body of text 52. The same document is used in Annexe 2, which indicates the output of the categorisation engine as will now be described. The rule-bases are compared against the document and certain rules are triggered. This figure shows how the rules are triggered for two categories: RETAIL FINANCIAL SERVICES 54 and BARCLAYS 55. In Annexe 2, additional categories are exemplified. Firstly, for the RETAIL FINANCIAL SERVICES 54 category, it is evident that rules have triggered under three combinations, i.e. GENERAL 56, TITLE 62 and PRODUCTS 66. For the GENERAL combination the rule "Barclaycard" has triggered twice. The criterion in the FIELD field 36 was "ANY" (see figure 4a) and thus the rule triggered once because it was found in the title 50 and a second time because it occurred in the text 52. Each rule has a weighting of 20 and the scores are placed in designated storage areas 74 and 76. Similarly, the PRODUCTS combination (see figure 4c) has "ANY" in the FIELD field 36 and thus the rule triggers twice. However, for this combination each rule has a
<Desc/Clms Page number 12>
weighting of 5, whose scores are stored in areas 84 and 86. The TITLE combination (see figure 4b) has "title" in the FIELD field and therefore only triggers once, i.e. in the title 52 of the document. The rule under this combination is given a weighting of 50 and is stored in a storage area 80.
In this manner a list of scores is built up under each combination depending on the corresponding rules that trigger. In the case of the GENERAL combination, the scores stored in areas 74 and 76 are used by the statistical algorithm 88 to produce a total combination score in storage area 72. Similarly, for the TITLE and PRODUCTS combinations, the results from the algorithm are stored in areas 78 and 82 respectively.
It will readily be appreciated that in practice the storage areas may be implemented as addressable memory locations in random access memory (RAM) or in any other suitable manner.
Moreover, the algorithm can be implemented by suitable logic or by a suitably programmed processor.
The iterations through the algorithms are shown below: For the GENERAL combination:
SCORE ADD ACCUMULATOR 20 16 36 20 0 36
For the TITLE combination:
SCORE ADD ACCUMULATOR 50 0 I 50
<Desc/Clms Page number 13>
For the PRODUCTS combination:
SCORE ADD OCUMULATC:i 5 4.75 9.75 5 0 - 9.75
The final accumulator score for each combination is truncated as 36, 50 and 9 and is stored in locations 72, 78 and 82 respectively.
Now, the total combination scores stored in 72, 78 and 82 constitute a new input list 79 to the algorithm 88, which produces a truncated RETAIL FINANCIAL SERVICES category score of 70. The iterations of the algorithm are shown below:
SCORE ADD ACCUMULATOR 50 18 68 36 2.88 70.88 9 0 70.88
The score for the RETAIL FINANCIAL SERVICES category is fed 96 into a look- up table 97 (see Table 2), which classifies the overall score of 70 as fitting into the HIGH range. Each category is associated with a unique key 94. The value 98 is combined with the key 94, which uniquely identifies each category, to produce a KEYMALUE pair 99. For example, for RETAIL FINANCIAL SERVICES the KEYMALUE pair is HIGH:9120. A similar method of obtaining the value for the BARCLAYS category 55 would be used. The document is fired against the rule-bases and it is found that two rules under the TITLE combination 57 triggered. In this case, both rules under the TITLE category are triggered within the title 50 of the document. The rule 59 triggers because "barclays" appears in the title and has a weighting of 25 that is
<Desc/Clms Page number 14>
stored in area 65. The rule 61 is a Boolean combination, where the words "barclays" AND "credit" must appear in the title 50. Rule 61 has a weighting of 50 that is stored in area 67. The statistical algorithm 88 operates on the scores and produces a truncated value of 62 that is stored in area 63. The iterations are shown below:
SCORE ADD 'ACCUMULATOR 50 12.5 62.5 25 0 62.
For the BARCLAYS category 55, there is only one combination and therefore the statistical algorithm produces an overall score of 62 that is input 96 to the lookup table 97. The value of 62 fits into the HIGH range and this value 98 forms part of the KEY:VALUE pair 99 relating to the BARCLAYS category for this document:- HIGH:6688. Therefore, for a particular document, the categorisation engine will produce a set of KEYMALUE pairs 99 (one for each category) that are supplied to the search engine. In the example of figure 5, the creation of two KEY-VALUE pairs, i.e. HIGH:9120 and HIGH:6688,.was shown. Therefore if a user wishes to access information that is a lot about RETAIL FINANCIAL SERVICES or a lot about BARCLAYS, then the search engine will present a list of documents that includes the document of the example. In fact, a synopsis of all the KEY-VALUE pairs for the example, is shown in table 3. It should be noted that the idea behind the statistical algorithm is that if a document triggers a lot of rules in one combination, it will not necessarily be sufficient to push the document over the normal categorisation threshold. Categorisation evidence is stronger if the evidence comes from a variety of combinations.
<Desc/Clms Page number 15>
CATEGORY Barclays ig Retail Financial Services Credit Cards high Finance Currency Interest Rates Bank of England 1824 rm 18848 Inflation Chemicals an Gases ow e ecomms 30400 Hardware and Technology 34656 et Cable e ecoms - - ,Competition 32224 Table 3
Table 3 shows that the document is a lot about BARCLAYS, RETAIL FINANCIAL SERVICES and CREDIT CARDS and a little about FINANCE, INTEREST RATES and CHEMICALS AND GASES. Although the remaining categories had combinations and rules that triggered, their overall total category score was lower than the threshold of 20 (Table 3) and thus were regarded as being insignificant and therefore discarded. This is confirmed by the fact that no range value is submitted for these categories to the search engine.
In the RETAIL FINANCIAL SERVICES category, it can be seen that the "barclaycard" rule of the PRODUCTS combination has triggered twice, once because it occurs in the title and once because it occurs in the text. This is because the FIELD specification in the rule-bases was "ANY". Therefore, although the rule has a weighted score of 5 points for each time it triggers the statistical algorithm provides a total score of 9 for the PRODUCTS combination. Alternatively, a single rule, subject to certain conditions, may trigger multiple times within a document and the score for the rule is incremented under each
<Desc/Clms Page number 16>
combination- For example, in the document 8 exemplified in Annex 2 there are multiple references to the word "barclaycard". To detect this, a new rule (as shown below) can be defined under the PRODUCTS combination. In this case, the total combinational score would be increased. This is due to the list now containing multiple scores of the rule weighting, where the exact number of these scores corresponds to the number of times "barclaycard" is found in the document.
Id Rules Field 'Scope Data Score 11 MULTIPLE(barclaycard) ANY Barclay Card
Broad Categories A so-called Broad Category in this connection is a category which is difficult to populate using rule bases per se. Consider an example of a broad category such as FILM or COMPUTERS. ' The list of possible rules and combinations to adequately define such a category would be enormous and similarly the number of documents returned would be just as large. In addition, the rule-base creator will need to spend a great deal of time creating the rules. Moreover, it would be impossible to distinguish easily between documents about FILM in the sense of movies and FILM in the sense of cameras etc.
The system described herein simplifies the handling of such broad categories by providing in the navigation element 16 an abstract specification of all associated categories. Each broad category is defined by a selection of these associated categories that are more specific and have easily definable rule-bases.
Figure 6 shows an overall diagram of how the system translates a search for a broad category such as FILM. The user enters the search criteria FILM at the user input tool 100 of the Navigation Hierarchy element. This element contains a translation tool 110 that converts the user's selection, in this case FILM, into an
<Desc/Clms Page number 17>
equivalent search of more specific associated categories. In practice, the search request is translated into a Boolean expression of a plurality of specific categories. It should be realised that the exact definition would be flexible for each broad category and could be edited by an information scientist at any time. For example, for the broad category FILM, appropriate specific categories may include HOLLYWOOD BLOCKBUSTERS (1D=9876), ORSON WELLES FILMS (1D=1045), INDIAN CINEMA (1D=10789), FRENCH CIMEMA (1D=978) and WESTERNS (1D=1002). These specific categories are defined by their relevant rule-bases, which are simpler to create.
Consider now how this concept interacts with the categorisation scheme defined above. Consider a certain document that has been manipulated by the categorisation engine 10 and found to be a lot about FRENCH CINEMA (1D=978) and WESTERNS (1D=1002) and a little about ENVIRONMENT (1D=2012). Therefore, for the document in question the following KEY:VALUE pairs are added to the search engine 14: "HIGH:978", "HIGH:1002" and "LOW:2012". The search engine indexes 106 hold these three KEY:VALUE pairs in association with that document. All other documents manipulated by the categorisation engine are held in the indexes 106 in association with their KEY:VALUE pairs.
At run-time, the user performs a search of the system using the navigation hierarchy element 16. This presents a GUI (Graphical User Interface) with a display as in the example shown in the screen shot of Figure 7a. A Directory 200 presents a number of broad categories which the user can select, including FILM. In the example, it is assumed that an information scientist has defined the FILM category 100 using the translation tool 110 as: FILM=HIGH(9876 OR 1045 OR 10789 OR 978 OR 1002)
<Desc/Clms Page number 18>
Therefore, if a user selects the broad FILM category by clicking on the displayed word FILM in the directory 200, this is translated into an equivalent search for all documents held in the indexes 106 in the search engine 14 that have been classified by the categorisation engine 10 as being a lot about HOLLYWOOD BLOCKBUSTERS or ORSON WELLES FILMS or INDIAN CINEMA or FRENCH CINEMA or WESTERNS. Thus broad categories are defined by taking the best results (i.e. a HIGH ranking) from other more specific categories that are well defined. The example is intended to be non-limiting and it may be decided by the information scientist that this definition for a particular broad category is not optimum. For example, the resulting document set may be too small, in which case the definition may be changed by including documents of LOW and MEDIUM value, or it may be decided that other specific categories are to be added or removed, or the Boolean logical operators comprising the definition can be changed. The translated search request 110 is sent to the search engine 14 and correlated against the indexes 106. If it is found that a particular document meets the search criteria, the document forms part of the set that is returned to the user. In particular, the document of the example will be one the documents amongst the result set returned to the user, where the two KEYVALUE pairs "HIGH:978" and "HIGH:1002" meet the translated search criterion. It should be realised that the banking and film examples presented in the description are intended to be non-limiting and it should be appreciated that the rule-bases would be adapted according to the nature of the information. It should also be realised that the certain formats described in the example, i.e. XML for the intermediate format and Excel for the textual format, are non-limiting and may be replaced by other suitable tools known to a person skilled in the art.
<Desc/Clms Page number 19>
Rule Input Format to Category Engine.
< xml> - < category name="Retail Financial Services" id="\barclays\Retail Financial Services\Retail Financial Services.xis" data=""> - < combination name= "general" score="100" data="" id="Retail Financial Services/ general"> < rule id="$C$7" field="ANY" score ="10">barclays AND nationwide < /rule> < rule id="$C$8" field="ANY" score="10">barclays AND cash < /rule> < rule id="$C$9" field="ANY" score="10">barclays AND it < /rule> < rule id="$C$10" field="ANY" score="20">barclayloan/-/'s < /rule> < rule id="$C$11" field="ANY" score="10">barclays AND internet < /rule> < ruie id="$C$12" field="ANY" score="10">barclays AND mortgage < /ru I e > < rule id="$C$13" field="ANY" score= "10">barclays AND student < /rule> < rule id="$C$14" field="ANY" score="10">barclays AND charges < /rule> < rule id="$C$15" field="ANY" score= "10">barclays AND holiday/- /s/makers < /rule> < rule id="$C$16" field="ANY" score ="20">barclaycard < /rule> < rule id="$C$17" field="ANY" score="10">barclays AND pension < /rule> < rule id="$C$18" field="ANY" score ="10">barclays AND small business/-/ es < /rule> < rule id="$C$19" field="ANY" score= "10">barciays AND graduate < /rule> < rule id="$C$20" fleld="ANY" score= "10">barclays AND wealth management < /rule> < rule id="$C$21" field="ANY" score= "10">barcfays AND isa < /rule> < rule id=-$C$22" fieid="ANY" score="10">barclays AND telephone banking < /rule> < rule id="$C$23" field="ANY" score="20">barclaycall < /rule> < rule id="$C$24" field="ANY" score= "20">barclaysconnect < /rule> < rule id="$C$25" field="ANY" score="20">barclays merchant services < /rule> < rule id="$C$26" field="ANY" score="20">barclaysquare < /rule> < /combination> - < combination narne="titfe" score="100" data="" id="Retail Financial Services/title" >
<Desc/Clms Page number 20>
< rule id="$C$7" field="title" score= "25">barclays AND nationwide < /rule> < rule id="$C$8" field="title" score="50">barclays AND cash < /rule> < rule id="$C$9" field="title" score= "50">barclays AND it < /rule> < rule id="$C$10" field="title" score="50">barclayloan/-/,s < /rule> < rule id="$C$i1" field="title" score="50">barclays AND internet < /rule> < rule id="$C$12" field="title" score="50">barclays AND mortgage < /rule> < rule id="$C$13" fleld="title" score="50">barcfays AND student < /ru 1e> < rule id="$C$14" field="title" score="50">barclays AND charges < /rule> < rule id="$C$15" field="titlle" score="50">barclays AND holiday/- /s/makers < /rule> < rule id="$C$16" field="titlle" score="50">barclaycard < /rule> < rule id="$C$17" field="titlle" score="50">barcfays AND pension < /rule> < rule id="$C$18" field="title" score= "50">barcfays AND small business/-/es < /rule> < rule id="$C$19" field="title" score="50">barclays AND graduate < /rule> < rule id="$C$20" field="title" score="50">barclays AND wealth management < /rule> < rule id="$C$21" field="title" score="50">barclays AND isa < /rule> < rule id="$C$22" field="titlle" score="50">barclays AND telephone banking < /rule> < rule id="$C$23" field="titie" score="50">barclayca11 < /rule> < rule id="$C$24" field="title" score="50">barclaysconnect < /rule> < rule id="$C$25" fieid="titfe" score="50">barclays merchant services < /ru l e > < rule id="$C$26" field="tite" score="50">barclaysquare < /rule> < /combination> - < combination name="products" score="100" data="" id="Retail Financial Services/ products"> < rule id="$C$7" field="ANY" data="Barciay Card" score= "5">barcfaycard < /rule> < rule id="$C$8" field="ANY" data="Barclay Loan" score= "5" > ba rclayloa n < /ru 1 e> < rule id="$C$9" fleld="ANY" data="Barcfay Call" score="5">barclaycal1 < /rule> < rule id="$C$10" fleld="ANY" data="Barciays Connect" score="5">barciaysconnect < /rule> < rule id="$C$11" fieid="ANY" scope ="sentence" data="Barclays Internet" score="5">barclays AND internet < /rule> < rule id="$C$12" field="ANY" scope= "sentence" data="Barclays Internet" score= "5">barcfays AND web < /rule> < rule id="$C$13" fleld="ANY" scope="sentence" data="Barclays Insurance" score= "5">barclays insurance < /rule> < rule id="$C$14" field="ANY" scope= "sentence" data="Barclays Mortgages" score="5">barclays AND mortgage/-/s < /rule>
<Desc/Clms Page number 21>
< rule id="$C$15" fleld="ANY" data="Barclays Internet" scare= "20">barclays.net < /rule> < rule id="$C$16" fleld="ANY" data="Barclays ISA" score="1">isa < /rule> < /combination> - < combination name="competitors" score="100" data="" id="Retail Financial Services/ competitors"> < rule id="$C$7" fleld="ANY" data= "Nationwide Building Society" score="10">nationwide < /rule> < rule id="$C$S" fleld="ANY" data="National Westminster Bank" score="10">natwest < /rule> < rule id="$C$9" field="ANY" data="Midlands Bank" score= "10">midlands < /ruie> < rule id="$C$10" field="ANY" data="Hong Kong Shanghai! Banking Corporation" score="10">hsbc < /rule> < rule id="$C$11" field="ANY" data="Royal Bank of Scotland" score= "10">royal bank of scotJand < /rule> < rule id="$C$12" fleld="ANY" data ="Lloyds-TSB Bank" score="10">lloyds < /rule> < rule id="$C$13" fleld="ANY" data= "Lloyds-TSB Bank" score="10">tsb < /rule> < rule id="$C$14" Feld="ANY" data ="Lloyds-TSB Bank" score= "10">Iloyds-tsb < /rule> < rule id="$C$15" field="ANY" data="Bank of Scotland" score="10">bank of scotland < /rule> < rule id="$C$16" fleld="ANY" data ="Clydesdale Bank" score="10">ciydesdale bank < /rule> < /combination> - < combination name="customers" score="100" data="" id="Retail Financial Services/ custorners"> < rule 1d="$C$7" field="ANY" data="Smail Business" score="1">small business/-/es < /rule> < rule id="$C$8" field="ANY" data="Students" score="1">graduate < /rule> < rule id="$C$9" field="ANY" data="Students" score="1">student AND loan/-/s < /rule> < rufe id="$C$10" field="ANY" data="Students" score="1">student AND grant/-/s < /rule> < rule id="$C$11" field="ANY" data="Small Business" score= "1">small corporations < /rule> < /combination> - < combination name="alliances" score="100" data="" id="Retail Financial Services/alliances"> < rule id="$C$7" field="ANY" data="Dell Computer Corporation" sco re="1">dell < / ru !e> < rule id="$C$8" fleld="ANY" data="BBC Worldwide" score="1">bbc AND worldwide < rule> < rule id="$C$9" field="ANY" data ="Microsoft" score= "1"> microsoft < /ru I e > < rule id="$C$10" fleld="ANY" data= "Microsoft" score="1">webtv < /rule>
<Desc/Clms Page number 22>
< rule id="$C$11" field="ANY" data= "Cellnet" score= "1">cellnet < /rule> < rule id="$C$12" field="ANY" data="Link" score="1">link < /rule> < rule id="$C$13" feld="ANY" data="Royal National Institute for Deaf People" score= "1">rnid < /ruie> < rule id="$C$14" feld="ANY" data="Royal National Institute for Deaf People" score="1">institute for deaf < /rule> < rule id="$C$15" feld="ANY" data="Blind In Business" score="1">bib < /rule> < rule id="$C$16" field="ANY" data="Blind In Business" score= "1">Blind In Business < /rule> < rule id="$C$17" feld="ANY" data="Eastern Group" score= "1">eastern group < /rule> < rule id="$C$18" feld="ANY" data="Unisys" score= "10">unisys < /rule> < /combination> < /category> < /xml>
<Desc/Clms Page number 23>
Output of categorization engine for one document.
< document title= "Barclays Newsroom. News Releases: Barclaycard sets the standards for the UK credit card market" uri="http://www.newsroom.barclays.co.uk/news/data/145.html"> < title>Barclays Newsroom. News Releases: Barclaycard sets the standards for the UK credit card market < /title> - < category id="6688" name="Barclays" score="62" range="high" source="/ barclays/barclays.xls"> - < combination id="20944" name='Title" bubble="0" score="62" sou rce="Barclays/title"> < rule id="66588" score="25" phrase= "barclays" source="$C$7" field= "FIELD TITLE" scope="ANY" operator= "PHRASE" Is="0" 1e="8" rs="0" re="0" /> < rule id="67084" score="50" phrase= "barclays AND credit" source="$C$11" field ="FIELD TITLE" scope="ANY" operator= "AND" 1s="0" 1e="8" rs="76" re="82" /> < /combination > < /category> - < category id="9120" name="Retail Financial Services" score="70" range="high" source="/barclays/Retail Financial Services/Retail Financial Services.xls"> - < combination id="26488" name= "General" bubble="0" score="36" source="Retail Financial Services/ general"> < rule id="75888" score="20" phrase= "barcllaycard" source="$C$16" field ="FIELD_TITLE" scope="ANY" operator="PHRASE" Is="34" 1e="45" rs="0" re="0" /> < rule id="75888" score="20" phrase= "barclaycard" source="$C$16" field ="FIELD TEXT" scope="ANY" operator= "PHRASE" Is="22" 1e="33" rs="0" re="0" / > < /combination> - < combination id="27104" name='Title" bubble="0" score="50" source="Retail Financial Services/title"> < rule id="78368" score="50" phrase="barclaycard" source="$C$16" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" Is="34" 1e="45" rs="0" re="0" / > < /combination > - < combination id="27720" name= "Products" bubble="1" score="9" source="Retail Financial Services/ products"> < rule id="79732" score="5" phrase="barciaycard" source="$C$7" data="Barclay Card" field="FIELD_TTTLE" scope="ANY" operator="PHRASE" Is="34" 1e="45" rs="0" re="0" /> < rule id="79732" score="5" phrase= "barclaycard" source="$C$7" data="Barclay Card" field ="FIELD TEXT" scope="ANY" operator="PHRASE" Is="22" 1e="33" rs="0" re="0" /> < /combination> < /category> - < category id="25536" name="Credit Cards" score="75" range="high" source="/Finance/credit cards/credit cards.xis">
<Desc/Clms Page number 24>
- < combination id="87472" name="Groups" bubble="1" score="48" source="credit cards/groups"> < rule id="255192" score="10" phrase= "barclaycard" source="$C$i1" data= "BarclayCard" field ="FIELD TITLE" scope="ANY" operator="PHRASE" 1s="34" 1e="45" rs="0" re="0" /> < rule id="255316" score="25" phrase= "barclaycard" source= "$C$12" data= "BarclayCard" field="FIELD TITLE" scope="ANY" operator="PHRASE" 1s="34" 1e="45" rs="0" re="0" /> < rule id="255192" score="10" phrase="barclaycard" source="$C$11" data= "BarclayCard" field ="FIELD TEXT" scope="ANY" operator="PHRASE" 1s="22" 1e="33" rs="O" re="0" /> < rule id="255440" score="15" phrase="credit card company" source="$C$13" field ="FIELD TEXT" scope="ANY" operator="PHRASE" 1s="1446" 1e="1465" rs="0" re="0" /> < /combination> - < combination id="88704" name= "Genera l" bubble="0" score="53" source="credit cards/general"> < rule id="260400" score="15" phrase="credit card" source="$C$12" field= "FIELD TITLE" scope="ANY" operator= "PHRASE" 1s="76" 1e="87" rs=,1O" re="0" /> < rule id="260524" score="25" phrase="credit card" source="$C$13" field ="FIELD TITLE" scope="ANY" operator=" PHRASE" 1s="76" 1e="87" rs="0" re="0" /> < rule id="260400" score="15" phrase="credit card" source="$C$12" field= "FIELD_TEXT" scope="ANY" operator="PHRASE" 1s="64" 1e="75" rs="0" re="0" /> < rule id="261888" score="10" phrase= "Platinum card" source="$C$23" data="Platinum Card" field ="FIELD_,TEXT" scope="ANY" operator="PHRASE" (s="411" 1e="424" rs="O" re="0" /> < rule id="260896" score="5" phrase="credit limit" source="$C$16" field ="FIELD TEXT" scope="ANY" operator="PHRASE" 1s="3953" 1e="3965" rs="0" re="0" /> < /combination > < /category> - < category id="23712" name= "Finance" score="22" range="low" source="/ Finance/finance.xls"> - < combination id="82544" name= "General" bubble="O" score="18" source="finance/general"> < rule id="230392" score="5" phrase="credit card" source="$C$7" data="Credit Cards" field="FIELD TITLE" scope="ANY" operator="PHRASE" 1s="76" 1e="87" rs="0" re="0" /> < rule id="230516" score="10" phrase="credit card" source="$C$8" data="Credit Cards" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" 1s="76" 1e="87" rs="0" re="0" /> < rule id="230392" score="5" phrase="credit card" source="$C$7" data="Credit Cards" field ="FIELDTEXT" scope="ANY" operator="PHRASE" 1s="64" 1e="75" rs="0" re="0" /> < /combination> - < combination id="81928" name="Indicators" bubble="1" score="S" source= "finance/ indicators">
<Desc/Clms Page number 25>
< rule id="229896" score="5" phrase="interest rates" source="$C$9" data="Interest Rates" field ="FIELD_TEXT" scope="ANY" operator="PHRASE" (s="438" (e="452" rs="0" re="0" /> < /combination>
< /category> - < category id="26144" name= "Currency" score="9" source="/ Finance/currency/currency.xls"> - < combination id="90552" name= "Folding" bubble="1" score="9" source= "currency/folding"> < rule id="267344" score="5" phrase="credit card" source="$C$16" data="Credit Cards" field ="FIELD TITLE" scope="ANY" operator= "PHRASE" (s="76" 1e="87" r5="0" re="0" /> < rule id="267344" score="5" phrase="credit card" source="$C$16" data="Credit Cards" field ="FIELD_TEXT" scope="ANY" operator= "PHRASE" 1s="64" 1e="75" r5="0" re="0" /> < /combination> < /category> < text>Barclaycard sets the standards for the UK credit card market7th April 1999 Almost seven million customers will benefit from a powerful package of new initiatives unveiled today by Barclaycard - the UK's largest credit card. The package based on the results of intensive customer research includes the introduction of a new rewards scheme, free extended warranty for a11 customers, a new platinum card and a cut in interest rates. "These initiatives will further strengthen Barclaycard's grip on the competitive credit card market. No other credit card can offer this combination of additional benefits, powerful rewards, attractive interest rates and no hidden charges," comments John Eaton, managing director at Barclaycard. Barclaycard Rewards scheme: The new Barclaycard Rewards scheme enables customers to make unique savings on gas, electricity and telephone bills, from May 1 this year. For example, Barclaycard customers can already save 20 per cent on home telephone calls, but by redeeming Barclaycard Rewards points, cardholders can boost their savings by an extra 10 per cent. This represents a saving of up to 956 for the average customer spending E200 on household calls through BT standard charges. New deals on travel insurance (up to 15 per cent off) and AA cover (40 per cent off) are also being launched through the Barclaycard Rewards scheme. Extended warranty: Barclaycard is to become the first major credit card company to offer all cardholders free extended warranty. From April 15, 1999 customers purchasing new household appliances with their Barclaycard will benefit from one year's free extended warranty. The service applies to most new household appliances costing more than 25. To take advantage of the offer customers simply need to register a purchase within 90 days. Platinum card: Barclaycard Platinum is a top of the range extension to Barclaycard offering customers an unrivalled range of benefits. The highlight of the benefits of Barclaycard Platinum will be two year's free extended warranty on most household appliances paid for with the card. Together with the initial manufacturer's warranty period this provides a total warranty period - absolutely free of charge - of three years for Barclaycard Platinum customers. Additionally, Platinum customers will receive
<Desc/Clms Page number 26>
favourable rates ranging from 14.9 to 17.9 per cent. Cardholders spending 6,000 or more in a year on their card will also receive a full rebate of the annual fee. By the end of this year Barclaycard expects to have issued up to 500,000 Platinum cards. Lowest Barclaycard rates: Barclaycard has also announced a reduction in interest rates of one per cent to 19.9 per cent - the lowest rate ever. Barclaycard's interest rates have fallen by three per cent in the last six months. The new standard Barclaycard APRs will range from 16.9 per cent to 19.9 per cent, depending on the amount a cardholder spends each month. In addition half of Barclaycard's customers will benefit from the fee rebate thresholds being lowered from 5,000 to 2,000. Barclaycard is also offering both new and existing cardholders the opportunity to transfer balances from other cards to their Barclaycard at an APR of 9.9 per cent for six months. "Customers need to look at the complete picture when choosing a credit card. They should look not just at the initial APR, but at the interest free period -ours is 56 days- any hidden charges, reward schemes and the range of additional benefits available. It is therefore no wonder we attracted half a million new customers last year alone," comments John Eaton. Research* conducted by Barclaycard indicates that a staggering number of cardholders in the.UK are still not aware of the hidden charges imposed on them by issuers. Credit cardholders who do not have a Barclaycard could end up paying for at least one of the following hidden charges - up to 20 for a late payment, up to 15 for exceeding a credit limit, up to 10 for a duplicate statement, up to 15 for a direct debit or 5 for a copy voucher. Cardholders incurring these costs once or twice a year will find that they cancel out the benefits of limited APR special offers. "The introduction of this new package offers more value to our cardholders than ever before. These additions will ensure that Barclaycard extends its lead as the UK's number one credit card," concludes John Eaton. Notes to editors: 1. The research was conducted in March 1999 by Audience Selection on behalf of Barclaycard. 1013 telephone interviews were undertaker,. 2. For every 10 spent cardholders will receive one Reward point. All profiles points will automatically be converted into Reward points of equivalent value. For further information, journalists should contact the relevant press office < /text> - < category id="28576" name="Interest Rates" score="23" range="low" source="/ Finance/ interest rates/interest rates.xls"> - < combination id="99792" name="Higher" bubble="1" score="10" source="interest rates/higher"> < rule id="303800" score="10" phrase="cut in interest rates" source="$C$7" fleld="FIELD TEXT" scope="ANY" operator="PHRASE" (s="431" 1e="452" rs="0" re="0" /> < /combination> - < combination id="101024" name= "General" bubble="0" score="15" source= "interest rates/ general"> < rule id="311364" score="15" phrase= "interest rates" source="$C$12" field="FIELD_TEXT" scope="ANY" operator="PHRASE" (s="438" 1e="452" rs="0" re="0" / > < /cornbination> < /category>
<Desc/Clms Page number 27>
- < category id="1824" name="Bank Of england" score="5" source="/ Banking/ bank of england/bank of england.xlls"> - < combination id="2464" name="Policy" bubble=-11, score="5" source="bank of england/policy"> < rule id="8804" score="S" phrase="interest rates" source="$C$16" data="Interest Rates" field ="FIELD_TEXT" scope="ANY" operator="PHRASE" (s="438" 1e="452" rs="0" re="0" /> < /combination > < /category> - < category id="18848" name="Erm" score="S" source= "/economy/euro/erm/erm.xis"> - < combination id="69608" name="Indicators" bubbie="1" score="5" source="erm/indicators"> < rule id="200260" score="S" phrase="interest rates" source="$C$8" field ="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="438" 1e="452" rs="O" re="0" /> < /combination> < /category> - < category id="27968" name= "Inflation" score="5" source= "inflation/ inflation"> - < combination id="99176" name= "Genera I" bubble="O" score="5" source= "inflation/ general" > < rule id="299708" score="5" phrase= "interest rates" source="$C$9" data="Interest Rates" field ="FIELD_TEXT" scope="ANY" operator= "PHRASE" (s="438" !e="452" rs="0" re="0" /> < /combination> < /category> - < category id="12768" name= "Chemicals and Gases" score="20" range="tow" source="/environment/chemicals and gases/chemicals and gases.xls"> - < combination id="47432" name= "General" bubble="0" score="20" source= "chemicals and gases/general"> < rule id="135408" score="20" phrase="gas" source="$C$30" field= "FIELD TEXT" scope="ANY" operator="PHRASE" 1s="856" 1e="859" rs="0" re="0" /> < /combination> < /category> - < category 1d="30400" name="Telecoms" score="11" source="/telecoms/telecoms.xls"> - < combination 1d="115808" name="General" bubble="0" score="5" source= "telecoms/general"> < rule 1d="373488" score="5" phrase= "telephone" source="$C$21" field= "FIELD TEXT" scope="ANY" operator="PHRASE" !s="877" !e="886" rs="O" re="0" /> < /combination> - < combination 1d="113960" name="Companies" bubble="1" score="7" . source="telecoms/companies" > < rule 1d="364064" score="7" phrase="bt" source="$C$10" data="British Telecom" field= "FIELD_TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="0" re="0" /> < /combination>
<Desc/Clms Page number 28>
< /category> - < category id="34656" name="Hardware and Technology" score="S" source="/telecoms/technology and hardware/technology & hardware.xls"> - < combination id="139216" name="Phones" bubble="1" score="5" source= "hardware and technology/ phones"> < rule id="456692" score="5" phrase="telephone" source="$C$25" field="FIELD TEXT" scope="ANY" operator= "PHRASE" (s="877" 1e="886" rs="0" re="0" /> < /combination> < /category> - < category id="26752" name="Debt" score="5" source="/ Finance/debt/debt.xls"> - < combination id="93632" name= "Persona l" bubble="1" score="5" source= "debt/ persona l"> < rule id="274908" score="5" phrase= "bills" source="$C$9" field="FIELD TEXT" scope="ANY" operator="PHRASE" (s="887" 1e="892" rs="0" re="0" /> < /combination> < /category> - < category id="31008" name="Cable Telecoms" score="5" source="/telecoms/cable comms/cable comms.xls"> - < combination id="118272" name= "Players" bubble="1" score="S" source="cable telecoms/ players"> < rule id="382292" score="5" phrase="bt" source="$C$14" data="BT Cable" field ="FIELD_TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="0" re="0" /> < /combination> < /category> - < category id="32224" name= "Competition" score="3" source="/telecoms/competition /competition.xls"> - < combination id="123200" name="Names" bubble="1" score="3" source= "competition/ names"> < rule id="400520" score="3" phrase="bt" source="$C$9" data="bt OR british telecom" field="FIELD TEXT" scope="ANY" operator="OR" 1s="1215" 1e="1217" rs="0" re="0" /> < /combination> < /category> < /document>
<Desc/Clms Page number 29>
Claims (15)
- CLAIMS 1. A system for selecting desired documents from a number of documents, the system comprising: means for comparing textual content of each document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string, said comparing means being organised to trigger a rule when the alphanumeric string of that rule is located in the textual part, wherein each rule is associated with a weighting factor; means for generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; means for indexing each document in association with its score for each category in which it is selected; and means for translating a search parameter defined by a user and representing a broad category into a set of associated categories and returning said desired documents to the user based on the category scores of said associated categories.
- 2. A system according to claim 1, wherein the category scores are organised into bands having high importance, medium importance and low importance.
- 3. A system according to claim 2, wherein the desired documents returned to a user are those which have a category score in the band of high importance for the set of associated categories.
- 4. A system according to any preceding claim, wherein each rule set defining a category is divided into combinations, each combination comprising a plurality of rules.<Desc/Clms Page number 30>
- 5. A system according to claim 4, wherein said means for generating a category score is arranged to generate combination scores and comprises a processor programmed to run an algorithm which receives the weighting factors for the rules which triggered in a particular combination and which generates a combination score based on said weighting factors.
- 6. A system according to claim 5, wherein the category score is generated using said algorithm having as its inputs the combination scores generated by the first pass of the algorithm.
- 7. A system according to claim 5 or 6, wherein said algorithm is an iterative calculation based on the statistical formulaperformed on a list of numerical values in descending order where: for the first iteration A is the highest value in the list and X is the next value on the list; and for subsequent iterations A is the addition of the result of the previous iteration with the previous value of A.
- 8. A system according to any preceding claim, which includes at least one rule base comprising a store holding said plurality of rule sets defining respective categories.
- 9. A system according to claim 8, wherein the rule base holds in association with each rule a field identifying the location in the document where said at least one alphanumeric string is to be located for the rule to trigger.<Desc/Clms Page number 31>
- 10. A system according to any preceding claim, wherein at least some of said rules are constituted by a Boolean combination of alphanumeric strings which both have to be located in the document to trigger the rule.
- 11. A method for selecting desired documents from a number of documents comprising the following steps: comparing textual content of each document with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules which each constituted by at least one alphanumeric string, wherein a rule is triggered when the alphanumeric string of that rule is located in the textual part, each rule being associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; indexing each document in association with its score for each category in which it is selected; and translating a search parameter defined by a user and representing a broad category into a set of associated categories and returning said desired document to the user based on the category scores of said associated categories.
- 12. A method according to claim 11, wherein the category scores are organised into bands having high importance, medium importance and low importance.
- 13. A method according to claim 11 or 12, wherein the desired documents returned to a user are those which have a category score in the band of high importance for the set of associated categories.
- 14. A method according to claims 11, 12 or 13, wherein each rule set defining a category is divided into combinations, each combination comprising a plurality of rules.<Desc/Clms Page number 32>
- 15. A method of populating a broad category with document data comprising the steps of: comparing the textual content of each of a plurality of documents with a plurality of rule sets, each rule set defining a category and comprising a plurality of rules each constituted by at least one alphanumeric string, wherein a rule is triggered when the alphanumeric string of that rule is located in the textual part, each rule being associated with a weighting factor; generating a category score representing an importance weighting for each category based on the weighting factors for the rules which triggered in the rule set defining the category; defining said broad category as a set of associated categories, and including in said broad category document data defining each document selected in the associated categories.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0005092A GB2366008A (en) | 2000-03-02 | 2000-03-02 | Document selection |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB0005092A GB2366008A (en) | 2000-03-02 | 2000-03-02 | Document selection |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| GB0005092D0 GB0005092D0 (en) | 2000-04-26 |
| GB2366008A true GB2366008A (en) | 2002-02-27 |
Family
ID=9886859
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB0005092A Withdrawn GB2366008A (en) | 2000-03-02 | 2000-03-02 | Document selection |
Country Status (1)
| Country | Link |
|---|---|
| GB (1) | GB2366008A (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5642502A (en) * | 1994-12-06 | 1997-06-24 | University Of Central Florida | Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text |
| US5943670A (en) * | 1997-11-21 | 1999-08-24 | International Business Machines Corporation | System and method for categorizing objects in combined categories |
| GB2336700A (en) * | 1998-04-24 | 1999-10-27 | Dialog Corp Plc The | Generating machine readable association files |
-
2000
- 2000-03-02 GB GB0005092A patent/GB2366008A/en not_active Withdrawn
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5642502A (en) * | 1994-12-06 | 1997-06-24 | University Of Central Florida | Method and system for searching for relevant documents from a text database collection, using statistical ranking, relevancy feedback and small pieces of text |
| US5943670A (en) * | 1997-11-21 | 1999-08-24 | International Business Machines Corporation | System and method for categorizing objects in combined categories |
| GB2336700A (en) * | 1998-04-24 | 1999-10-27 | Dialog Corp Plc The | Generating machine readable association files |
Also Published As
| Publication number | Publication date |
|---|---|
| GB0005092D0 (en) | 2000-04-26 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12248751B1 (en) | User interface for use with a search engine for searching financial related documents | |
| US9449105B1 (en) | User-context-based search engine | |
| US7363308B2 (en) | System and method for obtaining keyword descriptions of records from a large database | |
| CN101385025B (en) | Determine context by analyzing content and deliver relevant content based on that context | |
| US20050154692A1 (en) | Predictive selection of content transformation in predictive modeling systems | |
| US20110313854A1 (en) | Online advertising valuation apparatus and method | |
| US20140344130A1 (en) | Systems And Methods For Automated Political Risk Management | |
| US20050131935A1 (en) | Sector content mining system using a modular knowledge base | |
| CN114880486A (en) | Industrial chain identification method and system based on NLP and knowledge graph | |
| US20070192279A1 (en) | Advertising in a Database of Documents | |
| CN103309886A (en) | Trading-platform-based structural information searching method and device | |
| JP5552582B2 (en) | Content search device | |
| KR20090111791A (en) | Patent information convergence analysis method and system for generating social network analysis result | |
| Caid et al. | Context vector-based text retrieval | |
| Lin | Association rule mining for collaborative recommender systems. | |
| Shringi et al. | Detection of spam reviews using hybrid grey wolf optimizer clustering method | |
| Evans et al. | CLARIT TREC design, experiments, and results | |
| Yadav et al. | An Improvised Feature-Based Method for Sentiment Analysis of Product Reviews. | |
| CN117033561B (en) | ESG (electronic service guide) index optimization-based enterprise assessment model generation method and system | |
| Du et al. | Identifying high-impact opioid products and key sellers in dark net marketplaces: An interpretable text analytics approach | |
| GB2366877A (en) | A system for categorising and indexing documents | |
| GB2366008A (en) | Document selection | |
| Yang | An active recommendation approach to improve book-acquisition process | |
| Coulter | The impact of news media on cryptocurrency prices: Modelling data driven discourses in the crypto-economy | |
| Costantino et al. | Information extraction in finance |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |