EP1073974A1

EP1073974A1 - Method and apparatus for generating machine-readable association files

Info

Publication number: EP1073974A1
Application number: EP99918127A
Authority: EP
Inventors: Rachel Hammond; Llewelyn Ignazio Fernandes
Original assignee: Dialog Corp PLC
Current assignee: SMARTLOGIK GROUP PLC
Priority date: 1998-04-24
Filing date: 1999-04-21
Publication date: 2001-02-07
Also published as: WO1999056222A1; GB9808808D0; GB2336700A

Abstract

Asociation files (153, 154, 155) are generated that are suitable for determining whether a data file (151) belongs to a predetermined category (A, B). A plurality of included files (156) belonging to the category are stored in combination with a plurality of excluded files (157) not belonging to the category. Included files (156) are processed to identify candidate terms for an association file (155). The suitability of candidate terms is assessed with references to occurrences in the included files (156) in addition, the suitability is also assessed with reference to occurrences in the excluded files (157) so as to provide definition terms for an association file. Thus, if a term identified as a candidate also appears frequently in the excluded files (157) it is likely to be assessed as unsuitable for inclusion within the new association file.

Description

Generating Machine-Readable Association Files

Field Of The Invention The present invention relates to generating machine readable association files, to facilitate associating files of machine readable data with information types.

Background To The Invention Traditionally, database technology has been dedicated to the organising of numerical and tabular data and it is only recently, particularly with the expansion of the Internet, that demand has grown for the retrieval of text-based files. Several facilities are available on the Internet, commonly referred to as "search engines" which assist in the location of information. The majority of these operate by performing what has become known as

"free text" searching, in which a user specifies words which they believe are contained within the target file as a mechanism for instructing the system to retrieve files of interest.

Problems with this technique are well known to users of the available search engines, particularly over the Internet. A simple enquiry can generate hundreds of thousands of "hits", the majority of which will tend to be totally irrelevant to the user's needs. Furthermore, other relevant files may be missed simply because they do not contain the specific chosen words.

Procedures for classifying volumes of data so as to facilitate subsequent searching are known but the classification process often involves manual intervention thereby making it time consuming and prone to human error. Furthermore, except in circumstances where the documentation is considered to be extremely valuable and will be searched over a significant period of time, the cost of performing this manual exercise cannot be justified in terms of the commercial worth of the data sources being considered. Similarly, such a problem would result in much data being effectively inaccessible and outside the realm of searchable knowledge.

Procedures are known for processing a data file so as to determine whether the data file should be associated with a particular information category. The known process requires a machine readable association file, also known as an outline file and usually identified by a file extension .OTL. In this way, it is possible for the incoming data file to be processed with reference to one or many outline files whereupon each outline file produces a numerical score value defining an extent to which the data file is relevant to the particular association category, whereafter a decision may be made on the basis of a threshold comparison. In practical systems, thousands of such outline files would be required in order to provide a useful level of categorisation. Furthermore, new outline files may be required on a regular basis therefore the generation of outline files in itself becomes a time consuming and specialised procedure. Thus, although processes are known for the automated and industrially applicable processing of data files against outline files in order to generate category associations, a problem exists in terms of defining similar procedures for the automatic generation of the machine readable association files themselves.

Summary Of The Invention According to a first aspect of the present invention, there is provided apparatus for generating association files used to determine whether a data file belongs to a predetermined category, comprising processing means and storage means, wherein said storage means stores a plurality of included files belonging to said category and a plurality of excluded files not belonging to said category; and said processing means is configured to: process included files read from said storage means to identify candidate terms for an association file, and to assess the suitability of said candidate terms with reference to occurrences in said included files and occurrences in said excluded files so as to provide definition terms for an association file.

In a preferred embodiment, the processing means is configured to determine weighting values by processing a first probability of getting a term in included files with a second probability of getting a term in excluded files.

According to a second aspect of the present invention, there is provided a method of generating association files suitable for determining whether a data file belongs to a predetermined category, wherein a plurality of included files belonging to said category are stored in combination with a plurality of excluded files not belonging to said category, said method comprising the steps of: processing included files to identify candidate terms for an association file; and assessing the suitability of said candidate terms with reference to occurrences in said included files and occurrences in said excluded files so as to provide definition terms for an association file. In a preferred embodiment, candidates having a weighting value below a predetermined threshold are removed.

Brief Description Of The Drawings

Figure 1A shows a data distribution environment, in which data is received from a plurality of sources;

Figure 1B represents underlying principles of operation for the preferred embodiment; Figure 2 details the data processing environment identified in Figure 1A

Figure 3 details operations performed by the data processing facility shown in Figure 2; Figure 4 details a central data processing facility;

Figure 5 identifies procedures for training the system shown in Figure 4;

Figure 6 details procedures for the generation of an outline file; Figure 7 shows an example of a sampled data file; Figure 8 shows an example of selected noun phrases;

Figure 9 details the term generation process identified in Figure 6; Figure 1A0 shows an example of processed noun phrases; Figure 1A1 details procedures for the frequency counting process identified in Figure 6; Figure 1A2 shows a table representing stored values;

Figure 1A3 details procedures for calculating weighting values; Figures 14A and 14B illustrated procedures for the non-linear processing of the weighting value determined in accordance with the procedure shown in Figure 1A3; Figure 1A5 details procedures for stem selection, identified in Figure

6;

Figure 1A6 details procedures for casing selection, identified in Figure 6;

Figure 1A7 illustrates the effect of the procedures shown in Figures 15 and 16;

Figure 1A8 illustrates an association outline file;

Figure 1A9 illustrates a subsidiary processor of the type shown in Figure 4;

Figure 20 illustrates operations performed on the processing unit shown in Figure 1A9;

Figure 21 illustrates the use made for the memory storage illustrated in Figure 1A9;

Figure 22 details the process for associating preferred terms, identified in Figure 3;

Figure 23 details procedures for the processing of data identified in Figure 22; Figure 24 details a triggering phase identified in Figure 23;

Figure 25 details a scoring phase identified in Figure 23; Figure 26 details a list generation phase identified in Figure 23; Figure 27 illustrates a table of the type generated by the processing system shown in Figure 4; Figure 28 illustrates a linked list of the type generated by the central processing system shown in Figure 4;

Figure 29 details procedures for performing a search in response to a user selection, as identified in Figure 3;

Figure 30 illustrates a screen display inviting a user to initiate a search;

Figure 31 illustrates a screen display prompting a user to define search criteria; and

Figure 32 illustrates a screen display of titles returned to a user identifying data files.

Detailed Description Of The Preferred Embodiments

A data distribution environment is illustrated in Figure 1A, in which data received from a plurality of sources, such as sources 101 , 102 and 103 is supplied to an electronic data distribution network 104. A plurality of organisations, including organisation 105, organisation 106 and organisation 107 make use of data received from data sources 101 to 103 in various ways. However, the amount of data generated by the data sources 101 to

103 is significantly large and the actual volume of information required by the organisations 105 to 107 tends to be a relatively small sub-set of the total volume of data available from the data sources 101 to 103. Consequently, procedures for sorting, filtering and selecting data are required in order for organisations 105 to 107 to be provided with the information that they require, derived from a significantly larger volume of available data.

In the environment shown in Figure 1A, all of the information available from the electronic data distribution network 104 is made available to organisations 105 to 107. Organisations 105 and 106 receive data directly from electronic data distribution network 104 via dedicated communication channels 108 and 109 respectively. Organisation 107, which may be a nonprofit making organisation such as an educational establishment etc, receives information from network 104 via the Internet 110. Organisations 105 and 106 also receive data via the Internet 110 over connections 111 and 112. For the purposes of this illustration, it is assumed that organisation

105 is a large multinational company, such as an oil company, primarily interested in information relating to the oil industry. Organisation B is a financial institution primarily interested in information relating to the transfer of funds and the value of assets etc. Organisation C is an educational institution and has very wide interests in many academic fields of endeavour.

Each organisation defines categories of information representing fields of information where an interest has been shown. Each category has its own category definition terms, allowing an association process to be performed which associates files of machine readable data, received from the electronic data distribution network illustrated in Figure 1A, with categories or information types. The present invention provides for the automatic and industrially applicable generation of these association files by a technical process.

Category definitions are generated by means of a training process. A human trainer selects files which fall within a particular category. It is not necessary for the trainer to explain why a file is appropriate to a particular category given that the process will perform a learning operation and use the information that it has learnt so as to select what it considers as similar files from incoming data files as they are received. Thus, a human trainer selects files as belonging to a particular category. The technical process of the present invention analyses these files in order to generate category definitions. These category definitions are then used to analyse new data files so that they may be associated to appropriate categories in an automated way. Thus, it could be possible for each of organisations 105, 106 and 107 to use the same or similar category name. However, given that the training process will be performed specifically for that particular organisation, the actual association files could specify very different association processes, resulting in very different collections of information being derived possibly from the same source data.

In this way, an organisation's inherent "feel" for the type of information that it finds interesting has been encoded and represented in machine readable form so as to facilitate the automated selection of information under a technical process.

Underlying principles of operation for the preferred embodiment are illustrated in Figure 1B. Data files arriving at Organisation A are represented by file icons 151. These data files are processed with reference to association files at 152. In this simplified example, two association files 153 and 154 are provided. Association file 153 includes definition terms for determining whether a data file should be associated with Category A,

Similarly, association file 154 includes definition terms to determine whether a data file should be associated with Category B. Each incoming data file 151 is processed with respect to each association file 153 and 154. Thus, any data file may be associated with Category A, or it may be associated with Category B, or it may be associated with both Category A and Category B or it may be discarded as not being associated with either of these categories. The number of data files considered in this way may be relatively large. In this way, relatively low grade data files, such as newspaper articles etc, may be considered such that any articles or stories that do relate to the specified categories are identified. Consequently, from a very large data source, a relatively small proportion of these may be categorised for subsequent review. Thus, the collection of data files may be identified as low grade data and the collection of categorised files may be considered as high value information. Traditionally, association files, such as files 153 and 154, have been constructed manually such that there is a significant overhead in terms of association file generation. Given this significant overhead, a problem has arisen in that it is difficult to tailor association files for specific applications. Thus, traditionally, association files such as files 153 and 154 have been used at central locations, prior to data distribution, where it is difficult to have customer-specific association file generation.

In the present embodiment, the association process 151 is provided at customer sites, such as sites A, B and C shown in Figure 1 and a technical solution is provided to allow these customers to automatically generate new association files.

A new association file 155 is shown in Figure 1B. This association file represents a new category, which may be considered as Category C. This association file has been generated locally via a technical process without requiring highly trained association file developers.

A customer reviews typical examples of incoming data files and identifies them as members of the new category or non-members of the new category. Data files are considered until approximately one thousand members and one thousand non-members have been identified. In Figure 1B, example data files that are considered to be members of the new category are shown at 156 with example data files that are considered to be non-members of the category identified at 157. Procedures 158, 159 and 160 are then performed to generate the new association file.

An association file is made up of definition terms which in the preferred embodiment usually take the form of noun phrases. At step 158 files 156 that are members of the category are processed to identify candidate terms that could be included in the new association file. However, many noun phrases that are identified in files 156 may be poor examples of phrases that really do associate the file with the category under consideration. Thus, in order to remove these definition terms and to thereby improve the overall effectiveness of the association file, the suitability of the candidate is considered at step 159 by considering the number of occurrences that occur in members 156 of the category with the number of occurrences that occur in the non-member files 157. Thus, if a candidate term occurs frequently in files 156 but infrequently in files 157, the term is selected for inclusion in the new association file. However, if a term that occurs frequently in files 156 also occurs frequently in files 157, the term is rejected and is not included within the new association file.

Thus, files which satisfy the criteria specified at step 159 are then built into the new association file 155 at step 160, whereafter file 155 may be supplied to the association process 152. Thereafter, incoming data files 151 may be processed to identify files that should be placed under the new category (Category C) in addition to Category A and Category B.

A data processing environment for organisation 105 is detailed in Figure 2. Data received from transmission lines 108 and 111 are supplied via respective gateways 201 and 202 to a central data processing facility 203. The central data processing facility makes information available to individual users, illustrated by user terminals 205 to 213. Terminals 205 to 208 are shown connection to a local network 214 which is in turn connected to the central data processing facility 203 via an interface 215. Similarly, terminals

209 to 213 are connected to a local network 216 which is again connected to the central data processing facility via interface 217. In general, hundreds of local networks could be connected to the central data processing facility each having tens of user terminals connected thereto. In addition to the direct connections shown in Figure 2, other parts of the organisation may receive information from the central data processing facility 203 via Internet connections or by other communication means. Thus, facilities may be provided, for example, for allowing remote laptop computers to communicate with the central data processing facility via mobile cellular telephones etc.

Organisation 105 also has the capability for transferring eye readable data into machine readable data via an image scanner 218 and an optical character recognition terminal 219. Furthermore, new data may be generated internally and distribution within the organisation. Over time, data processed by facility 203 will attain a high level of value. Consequently, data processed by the facility is periodically supplied to an off-site data archive 220. Operations performed by the central data processing facility 203 are illustrated in Figure 3. At step 301 the system is trained to categorise incoming data files according to the organisation's preferred system of categorisation. At step 302 all existing files held by the facility are categorised in accordance with the system determined at step 301. Thereafter, the processing facility continually categorises new incoming data files as they are received, as illustrated at step 303 and supplies categorised information to users as and when the information is requested; as illustrated at step 304.

Central data processing facility 203 is detailed in Figure 4. Data signals from data sources 101 to 103 are supplied to input interfaces 401 via data input lines 402. Similarly, output data signals are supplied to users 111 to 117 via an output interface 403 and output wires 404. Input interface 401 and output interface 403 communicate with a central processing system 405 based on DEC alpha integrated circuitry. The central processing system 405 also communicates with other processing systems in a distributed processing architecture. The processing facility 203 includes eight Intel circuit based processing systems 411 to 418 each implementing instructions under the control of conventional operating systems such as Windows NT.

An operator communicates with facility 203 by means of an operator terminal having a visual display unit 421 and a manually operable keyboard 422. Data files received from sources 101 to 103 are written to bulk storage devices 423 arranged as an array of magnetic disks. Data files are written to the disk array 423 after these files have been associated with preferred terms thereby categorising the incoming data files as shown at step 303. The association processes are performed by the subsidiary processors 411 to 418. The facility includes a CD ROM reader 425, arranged to read CD ROM's, such as ROM 426. In this way, it is possible to install executable instructions for computer system 405 and computer systems 411 to 418.

The central processing system 405 communicates with subsidiary processors 411 to 418 via an Ethernet connection 424 allowing processing requirements to be distributed between processors 411 to 418. Having addressed a subsidiary processor, such as processor 411 , the transferring of data to the addressed processor is performed. Each individual incoming data file is supplied exclusively to one of the subsidiary processors. The selected subsidiary processor is then responsible for performing the association process, to identify preferred terms relevant to that particular data file so as to define categories relevant for association with the file. These associations with categories or preferred terms are added as additional data to the file itself and a file processed in this way is referred to as an associated data file. Thus, after performing the association process, the associated data file is returned to the central processing system 405 which is then responsible for writing the associated data file to the disk array 423. In this way, it is possible to scale the degree of processing capacity provided by facility 203 dependent upon the volume of data files requiring categorisation. The central processing system 405 also maintains a table of preferred terms, pointing to particular associated data files which have been identified as relevant to each of the preferred terms. Step 301 for the training of the system to categorise incoming data is detailed in Figure 5. At step 501 a category is selected for which an association file, also referred to as an outline file, is to be generated. At step 502 a human trainer identifies sample files to be included in the category selected at step 501 and at step 503 the human trainer identifies sample files to be excluded from the category selected at step 501.

At step 504, processor 405 generates an outline file that defines the characteristics of the files included in the selected category. These association files, defined in the form of an outline file, include definition terms which are then compared against similar terms present within incoming data files, from which scores are determined, thereby allowing a selection to be made as to whether a data file is or is not association with the respective category defined by the association file. Thereafter, a question is asked at step 505 as to whether another category is to be considered and when answered in the affirmative control is returned to step 501 , allowing the next category (for which an outline file is to be generated) to be selected.

Eventually, all of the categories will have been considered and the question asked at step 505 will be answered in the negative. Thereafter, data structures are initialised by parsing the outline files generated by the repeated implementation of step 504.

The technical process, of the present embodiment, is provided with a category title and a plurality of data files that are considered to be associated to said category. The requirement is to produce association files or outline files in machine readable form, suitable for providing the basis for analysing new incoming data files so as to determine whether said new incoming data files should be associated with particular categories. The technical process is achieved by firstly selecting files that are associated with a defined type. In addition, files are selected that are not associated with the defined type. The technical process is then configured to analyse the interrelationship between the selected files so as to provide the association data required in order to produce an association file.

Step 504 for the generation of an outline file that defines the characteristics of data files included in the selected category is detailed in Figure 6. Sample data files excluded from the category are illustrated generally at 611 and, similarly, sample data files included in the category are shown generally at 612. It is not necessary for the same number of files to be included in each of samples 611 and 612 but, preferably, in the region of one thousand files should be included for each of these types.

Data files received from the data distribution environment 104 tend to consist of two components, namely the title of the file and its associated detail, referred to herein as the "story". Consequently, the outline files are configured to process the title data and the story data differently, therefore it is necessary to take account of this difference when creating an outline file.

At step 601 a selection is made as to whether titles are being considered or story sections are being considered. The process is configured such that titles for all of the sample files are considered first, followed by the story sections for all of the sampled files.

At step 602 a syntactic analysis is performed on sample files 612 only; that is to say, the syntactic analysis at step 602 is only performed on sample files included within the category.

At step 603 a term generation process is performed followed by a frequency counting process at step 604. The frequency counting process considers both sample 611 and sample 612; that is to say, it considers both files included in the category along with files excluded from the category. At step 605 a weight calculation process is performed followed by a stem and casing selection process at step 606.

At step 607 a question is asked as to whether the story section has been processed which, on the first iteration, will be answered in the negative. Consequently, control is returned to step 601 resulting in story sections of the files being selected at said step. Thereafter, processes 602 to 606 are repeated, followed by the question asked at step 607 being answered in the affirmative, to the effect that the story sections have now been processed. At step 608 a rule base is generated as an association or outline (OTL) file.

An example of a sampled data file is shown at 701 in Figure 7. The data file includes a title section 702 along with a story section 703.

Step 606 for performing the syntactic analysis is primarily concerned with identifying noun phrases which occur within the file being considered.

The noun phrases identify the informational content within the data file without being constrained by grammatical representation. The noun phrases are selected by use of a C library routine for syntactic parsing, such as

"LINGUIST X". An example of noun phrases selected by process 602, when executed upon a typical sample file, is illustrated in Figure 8. The process has identified the noun phrases "oil industry", "acquisitions" and "oil industry acquisitions".

These are structured within the file as shown in Figure 8. This represents noun phrases derived from a single file but, having processed typically a thousand files in this way, a very large list of noun phrases is produced.

The list of noun phrases is processed at step 603 to generate a list of terms. The list of terms is larger than the list of noun phrases because the noun phrases are processed to generate stem representations of noun phrases, identified at step 602. Term generation process 605 is detailed in Figure 9.

At step 901 a noun phrase is selected and at step 902 a question is asked as to whether the noun phrases contains a word stem. If this question is answered in the affirmative, a stem rule is generated at step 903; alternatively step 903 is bypassed.

At step 904 a question is asked as to whether the noun phrase contains mixed upper and lower case characters. If this question is answered in the affirmative the rule is made case specific at step 905. At step 906 a question is asked as to whether another noun phrase is to be considered and when answered in the affirmative the next noun phrase is selected at step 901.

Stem rules are generated at step 903 by means of a dictionary type look-up table. Words within the noun phrases are supplied to the dictionary look-up table which will then determine whether said words are contained within the dictionary. If a word is identified within the dictionary, the process returns its appropriate stem thereby allowing the stem rule to be generated at step 903. Alternatively, results may be produced by execution of an appropriation function.

If a noun phrase is identified containing all lower case characters, it is assumed that the words are of interest irrespective of their case representation. However, if phrases are identified that include upper and lower case characters, the case representation may be relevant, therefore a case specific selection rule is generated at step 905.

Examples of noun phrases processed in order to produce terms is illustrated in Figure 1A0. In the example, the noun phrase "acidity" has been identified, as shown at 1001. This word is identified as containing a stem at step 902 resulting in the stem "acid" being included in the list of terms as illustrated at 1002. Similarly, the word "development" was identified as a noun phrase, as shown at 1003, resulting in the inclusion of the stem "develop" as shown at 1004. The noun phrase "Doctor F Bloggs" has been identified and, given that upper case characters were found from the original data at step 904 the upper case characters are retained in the list of terms, thereby generating a case specific selection rule. Step 604 for performing the frequency counting process is detailed in

Figure 1A1. At step 1101 a term is selected and at step 1102 the number of occurrences present within the included sample of files is counted and stored as a variable I, representing the number of occurrences of the term selected at step 1101 in the sample of included files 602. At step 1103 the number of occurrences of the term selected at step

1101 existing within the excluded files 601 is counted and stored as a variable X. Variables I and X for each selected term are stored and at step 1104 a question is asked as to whether another term is to be considered. When answered in the affirmative control is returned to step 1101 and the next term is selected. Eventually all of the terms will have been considered and the question asked at step 1104 will be answered in the negative.

A table representing stored values for I and X, calculated by repeated execution of processes 1102 and 1103, is illustrated as a table in Figure 12. The table includes a first column 1201 representing the term selected at step 1101, a second column 1202 representing values for I calculated at step

1102 and a third column 1203 representing values calculated for X at step 1103.

In the example shown in Figure 12, the word "acidity" has been added to the list of terms as shown at row 1204. A number of occurrences have been counted to find a value for I which, in this example, has turned out to be five hundred and three. Similarly a count value from excluded files has been performed to determine a value for X which, in this example, is one. Thus, the process has found that the word "acidity" provides strong evidence to the effect that a file containing this word should be included in the category. The noun "acidity" may therefore be selected as a definition term for inclusion in the rule base. At row 1206 the word "merger" has been considered. On this occasion, two hundred and fifty nine occurrences have been found within the included files but six hundred occurrences have been found in the excluded files. Thus, this word does not provide a very good candidate for determining whether a particular data file should be included within the category under consideration.

In the resulting association outline file, the individual entries are not merely present but are weighted and structured. Weightings are applied to individual entries which, within a branching structure, may be considered as the lowest level "leaves". A collection of leaves are branched together and a branching point may also include a weighting value. Branching of this type is illustrated in Figure 8 in which each individual lowest level term is accorded weights a, b, c, d and e. Weights for each term are calculated with reference to the frequency data as represented in the table shown in Figure 12.

Procedures for calculating weighting values W are illustrated in Figure 13. A probability P is calculated representing the probability of finding a term in the data files 602 included in the particular category. This is calculated by dividing the total term count I by the total number of files in the sample. In the example shown, for the term "acidity" five hundred and three occurrences of the term in the included category have been identified and this is divided by one thousand and seventeen representing the total number of files present, giving a probability value of 0.495.

A probability Q is calculated representing the probability of the term occurring in the excluded data files 601. Thus, in the example previously considered, the term "acidity" occurred only once in the excluded category from a total of one thousand and twenty-eight files. Thus, a probability value Q is calculated of 0.001. A variable D is introduced which, when used as an exponent, enhances the division between the active and inactive terms. The value of D may be fixed or may be made user selective in a range between 0.5 and 1. A constant A is set to 0.001 so as to prevent division by zero and to restrict the range of results to lie within a range of zero to 1.000. An importance factor F is introduced which, typically, is set to a value of 2 for a title and a phrase and is set to a value of 1 for a word. These values can be adjusted such that, for particular data sets, greater emphasis may be placed on, for example, titles or phrases within the data files being considered. The value for W is calculated from a first product multiplied by A and I.

The first product is derived from the quotient of a numerator and a denominator. The numerator is calculated by adding constant A to 1-Q multiplied by P raised to the power of D. Similarly, the denominator is calculated by adding constant A to the product of Q and 1-Q again raised to the power D.

After calculating value W as illustrated in Figure 13, W is processed non-linearly as illustrated in Figure 14A and 14B. A variable L represents the level of maximum importance which, empirically, is set to a value of between zero and 0.3. Variable T represents a lower threshold which again, empirically, is usually set to a value of 0.05. As shown in Figure 14A, if W is greater than L then W is set equal to L, thereby placing an upper bound on the weighting value. If W is smaller than L then W is set to zero thereby removing many irrelevant terms which only occur very infrequently.

The result of the non-linear processes, as illustrated in Figure 14A is illustrated graphically in Figure 14B. Output values for W are plotted against input values for W. As shown, values of W less than 0.05 result in the output for W being set to zero. Similarly, for values above 0.3 the output for W is set to 0.3. This response is illustrated by solid line 1401. In an alternative embodiment an output response is used with reference to a look-up table to produce a non-discontinuous result as illustrated by dotted output response 1402. Process 606 for stem selection is detailed in Figure 15. Process 603, expanded in Figure 9, generates terms in the form of stems and identified noun phrases. Process 606 is concerned with selecting stems and casing rules that should be retained within the association file.

At step 1501 a term is selected and at step 1502 a question is asked as to whether the term is duplicated in a stem term. If this question is answered in the affirmative, a question is asked at step 1503 as to whether the weights are similar for the original term and for its subsequently generated stem term. If this question is answered in the affirmative both the stem term and the non-stem term are retained. Alternatively, if the question asked at step 1503 is answered in the negative, to the effect that the weights are not similar, a question is asked at step 1505 as to whether the non-stem term is weighted higher than the stem term. If this question is answered in the affirmative the stem term is excluded at step 1506. Alternatively, if the question asked at step 1505 is answered in the negative, the non-stem term is excluded at step 1507.

If the question asked at step 1502 is answered in the negative, control is directed to step 1508 after the execution of steps 1504 and 1506. At step 1508 a question is asked as to whether another term is to be considered and when answered in the affirmative control is returned to step 1501 for the next term to be selected.

The casing selection aspect of process 606 is detailed in Figure 16. At step 1601 a term is selected and at step 1602 a question is asked as to whether the term is duplicated with different letter cases. If this question is answered in the affirmative, a question is asked at step 1603 as to whether the weights are similar and if this question is answered in the affirmative the cased term is included. Alternatively, if the question asked at step 1603 is answered in the negative, control is directed to step 1605 where a question is asked as to whether the uncased term is weighted higher than the mixed case term. If this question is answered in the affirmative the mixed case term is excluded at step 1606. Alternatively, if the question asked at step 1605 is answered in the negative, the uncased term is excluded at step 1607. If the question asked at step 1602 is answered in the negative control is directed to step 1608 and control is also directed to said step after the termination of processes 1604 and 1606. At step 1608 a question is asked as to whether another term is to be considered and when answered in the affirmative control is returned to step 1601 for the next term to be selected. Eventually, all of the terms will have been considered and the question asked at step 1608 will be answered in the negative.

Questions are asked at step 1503 and at step 1603 as to whether weights are similar. The extent to which weights may be considered similar is determined empirically and the similarity criteria may be adjusted for particular data sets.

The purpose of the procedures shown in Figure 15 and Figure 16 is to reduce the total number of terms present within the file, effectively making it more manageable and faster to execute. Examples of its purpose are illustrated in Figure 17. The term "acids" at 1701 and stem term "acid" at 1702 have been included. The processing of the steps illustrated in Figure 15 may determine that the stem term "acid" has been given a significantly higher weighting than the non-stem term "acids". Under these circumstances the term "acids" could be excluded from the file. Similarly, the process illustrated in Figure 16 may determine that "Johnson" at 1703 has a significantly higher weighting than the word "Johnson" at 1704, resulting in the word "Johnson" being excluded from the file. Step 610 generates a rule base as an association outline file of the type illustrated in Figure 18. Each asterisk, such as asterisk 1801 , represents a level of nesting. Thus, at the highest level, there is an entry representing the start of the title, at line 1802 and a further high level entry at line 1803 representing the start of the story. In process 301 shown in Figure 3, training of the system is performed, so as to subsequently categorise files according to an organisation's preferred system of categorisation. A number of files are selected as a sample to represent files that will be included in the selected category, as identified in process 502 shown in Figure 5. Also, in process 503, as shown in Figure 5, sample files are identified which are to be excluded from the selected category. Typically a large number of files will be required in order sufficiently identify the probabilities of terms occurring in these files in order that subsequently received data files may be analysed correctly.

Identifying files which are to be included or excluded from a specific category can be extremely time consuming when such a large number of files are required in order to achieve sufficiently accurate analysis results. Thus, in an alternative embodiment an additional analysis is performed for measuring the granularity of probability measurements. Thus, when the addition of a small number sample files results in a significant modification in weight calculations, it may be understood that the number of sample files included in the sample set is insufficient to achieve a highly accurate result. However, a smaller number of sample files may be considered when the marginal effect of processing one more file becomes small.

Thus, by increasing the number of sample files until the granularity of weight calculation is reduced, a point at which the number of sample files has reached a significant threshold value may be identified. A distribution density may be calculated, where the preferred result is a Bradford/Zipf curve, which is smooth, and represents a highly accurate set of data samples. Alternatively, if such as distribution density of calculated weights is irregular, or significantly different from the optimal Bradford/Zipf curve, it may be inferred that additional sample files in the selected category and excluded from the selected category, may need to be identified.

Subsidiary processor 411 is detailed in Figure 19. The processor includes an Intel Pentium processing unit 1901 connected to sixty-four megabytes of randomly accessible memory 1902 via a PCI bus 1903. In addition, a local disk drive 1904 and an interface circuit 1905 are connected to bus 1903. Interface circuit 1905 communicates with the TCP/IP network 424.

Random Access Memory 1902 stores instructions executable by the processing unit 1901, in addition to storing input data files received from the data sources 101 to 103 and intermediate data. Operations performed on processing unit 1901 in response to instructions read from memory 1902 are identified in Figure 20.

At step 2001 temporary memory structures are cleared and at step 2002 an OTL description file is selected. At step 2003 an item in the OTL file is identified and at step 2004 a question is asked as to whether the item selected at step 2003 is a rule definition. If this question is answered in the affirmative, a rule object is defined at step 2005. Alternatively, if the question asked at step 2004 is answered in the negative, to the effect that the item is not a rule definition, a question is asked at step 2006 as to whether the item is a word definition. If this question is answered in the affirmative, a dictionary link is created at step 2004.

At step 2008 a question is asked as to whether the item is a label and when answered in the affirmative a new entry is created in a label list, whereafter, at step 2010, a question is asked as to whether another item is present. After executing step 2005 or after executing step 2007, control is directed to step 2010.

When the question asked at step 2010 is answered in the affirmative, to the effect that another item is present, control is returned to step 2003 and the next item is identified in the OTL file. Eventually, all of the items will have been identified resulting in the question asked at step 2010 being answered in the negative. Thereafter, at step 2011 a question is asked as to whether another OTL file is present and when answered in the affirmative control is returned to step 2002 allowing the next OTL description file to be selected. Thus, this process continues until all of the OTL files have been considered resulting in the question asked at step 2011 being answered in the negative.

For each OTL file considered, by being selected at step 2001 , a rulebase is generated and a plurality of such rulebases are illustrated in Figure 21. Thus, a first OTL file processed in accordance with the procedure shown in Figure 20, results in the generation of a first rulebase 2101.

Typically, for a specific installation, in the order of three thousand rulebases would be generated by executing the procedures illustrated in Figure 20. Rulebases 2101 to 2109 are stored in memory 1902, which also provides storage space for a dictionary 2121 , a label list 2122 and a data buffer 2123. The dictionary stores a list of words which have importance in any of the stored rulebases. Associated with each word in the dictionary there is at least one pointer and possibly many pointers to specific entries in specific rulebases 2101 to 2109.

The step for associating preferred terms with source files is detailed in Figure 22. At step 2201 central process 405 obtains access to one of the subsidiary processors 411 to 418. The central processor then expects to receive authorisation so that communication may be effected with one of the subsidiary processors and after this connection has been established, the source file is supplied to the selected subsidiary processor at step 2203. At step 2204 the data is processed to determine associated preferred terms.

After performing process 2204, the results are transmitted back to the central processing system at step 2205 and at step 2206 data with associated preferred terms is stored and data pointers associated with the preferred terms are updated at step 2207.

Step 2204 for the processing of data to determine associated preferred terms is detailed in Figure 23. The overall processing is broken down into three major phases, consisting of a triggering phase 2301 , followed by a scoring phase 2302 and, finally, a list generation phase 2303.

Triggering phase 2301 is detailed in Figure 24. At step 2401 a section of the data, such as its title, market sector or main body of text, is identified and at step 2402 an item of the identified section is selected. At step 2403 a question is asked as to whether the item indicates a new context, which may be considered as a grammatical marker in the form of a full stop, capital letter, start of a sentence or quotation marks etc. When answered in the affirmative, new context information is supplied to all rulebases at step 2404 and control is then directed to step 2407.

If the question asked at step 2403 is answered in the negative, step 2404 is bypassed and a look-up address is obtained for rule objects in rulebases from the dictionary at step 2405. Thereafter, at step 2406 all addressed objects are triggered and a multiplication of scores is effected by a score weighting factor. Thereafter, at step 2407 a question is asked as to whether another item is present and when answered in the affirmative control is returned to step 2402. Eventually, all of the items for a selection section will have been considered resulting in the question asked at step 2407 being answered in the negative. Thereafter, at step 2408 a question is asked as to whether another section is to be considered and when answered in the affirmative control is returned to step 2401. At step 2401 the next section is identified and steps 2402 to 2408 are repeated. Eventually, all of the sections will have been considered and the question asked at step 2408 will be answered in the negative.

Scoring phase 2302 is detailed in Figure 25. At step 2501 a rulebase is selected and at step 2502 a score variable is re-set to zero. At step 2503 a branch is identified for score accumulation/accrue and at step 2504 scores are accumulated or accrued from triggered rules attached to the branch. At step 2505 a question is asked as to whether another branch is to be considered and when answered in the affirmative control is returned to step 2503. A next branch is selected at step 2503 with procedure 2504 being repeated. Eventually all of the branches will have been considered resulting in the question asked at step 2505 being answered in the negative.

At step 2506 an overall score in the range of zero to one hundred is stored for the rulebase and at step 2507 a question is asked as to whether another rulebase is present. When answered in the affirmative, control is returned to step 2501 and steps 2501 to 2507 are repeated. Eventually, all of the rulebases will have been considered and the question asked at step 2507 will be answered in the negative.

Phase 2303 for the generation of a list of associated preferred terms is detailed in Figure 26. At step 2601 a rulebase is identified having a score greater than a predetermined threshold. Thus, for a particular application, a threshold may be set at forty-eight percent. At step 2602 additional triggered preferred data characteristics are identified by associating successful rulebases with parent categorisations by rulebase links.

At step 2603 lists of successful and inferred rulebases are combined to form overall lists of preferred data characteristics. Step 2603 results in data being generated by a subsidiary processor, such as processor 411 , which is then supplied back to the central processing system 405.

Central processing system 405 is responsible for constructing a table of the type shown in Figure 27 in which an entry is present for each preferred term. The specific preferred terms are stored in column 2701 and, for each of these terms, column 2702 defines a specific pointer to a position in memory associated with the central processing system 405. Specific data files are identified by file names and the number of files associated with each preferred term is variable, depending on the nature and the amount of input data being considered. Thus, in order for this data to be accessible quickly while optimising use of the storage capacity within the central processing system 405, an indication of the file name is stored in the form of a linked list, as illustrated in Figure 28.

The preferred term "OILJNDUSTRY" has been linked to a pointer 0F8912. Address 0F8912 is the first in column 2801 of the linked list. Column 2802 identifies a particular file name and column 2803 identifies the next pointer in the list. Thus, entry 0F8912 points to a particular file with the file name "OIL_INDUSTRY_NETHERLAND_3" with a further pointer to memory location 0F8A20. At memory 0F8A20 a new file name is provided, illustrated at column 2802 and again a new pointer is present at column 2803. Eventually, ail relevant files will have been considered and the end of this list is identified by address 000000 by the pointer location in column 2803.

In an active system, the database 423 will be continually updated and users will continually be given access to the database, all under the control of the central processing system 405.

Procedures for performing a search in response to a user request are detailed in Figure 29. At step 2901 a user logs onto the system and at step 2902 a search method is identified. At step 2903 search criteria are defined and at step 2904 the search criteria are processed to determine preferred terms. At step 2905 a list of preferred terms are supplied to the central processing system 405.

At step 2906 a question is asked as to whether a host has responded and when answered in the affirmative titles of associated data files are displayed at step 2907.

At step 2908 a question is asked as to whether the user wishes to view identified data and when answered in the affirmative the data is viewed; after being downloaded over the communication channel at step 2909.

At step 2910 a question is asked as to whether another search is to be performed and when answered in the affirmative control is returned to step

2902.

Step 2902 requires the search method to be identified and in order to achieve this a user is prompted by a screen display of the type shown in Figure 30. Thus, a plurality of text boxes are presented to the user inviting the user to specify a particular search method.

Step 2903 for the defining of search criteria results in the user being prompted by a screen of the type shown in Figure 31. Terms providing the basis for the user's search are displayed in a window 3101. Preferred terms are displayed in upper case characters, such as the entry shown at position

3102.

The displaying of titles of associated files at step 2907 results in the user seeing information displayed of the type illustrated in Figure 32. Each entry, such as entry 3201 , includes a check box 3202. Check boxes 3202 allow a particular item to be selected by a user such that the actual information file may be supplied to the user from the central database over a communication channel.

Claims

What We Claim Is:

1. Apparatus for generating association files used to determine whether a data file belongs to a predetermined category, comprising processing means and storage means, wherein said storage means stores a plurality of included files belonging to said category and a plurality of excluded files not belonging to said category; and said processing means is configured to:

(a) process included files read from said storage means to identify candidate terms for an association file, and to

(b) assess the suitability of said candidate terms with reference to occurrences in said included files and occurrences in said excluded files so as to provide definition terms for an association file.

2. Apparatus according to claim 1 , wherein said processing means is configured to determine weighting values by processing a first probability of getting a term in included files with a second probability of getting a term in excluded files.

3. Apparatus according to claim 2, wherein said processing means is configured to process said probabilities, whereby a first function of said first probability is divided by a second function of said second probability.

4. Apparatus according to claim 3, wherein said processing means is configured such that said functions include the addition of a range restricting constant.

5. Apparatus according to claim 2, wherein said processing means is configured such that said first function is derived by multiplying said first probability by one minus said second probability to produce a first product.

6. Apparatus according to claim 5, wherein said processing means is configured such that said first product is raised to the power of an enhancement exponent.

7. Apparatus according to claim 3, wherein said processing means is configured such that said second function is derived by multiplying said second probability by one minus said first probability.

8. Apparatus according to claim 7, wherein said processing means is configured such that said second product is raised to the power of an enhancement exponent.

9. Apparatus according to any of claims 1 to 8, wherein said processing means is configured to remove candidates having a weighting value that falls below a predetermined threshold.

10. Apparatus according to any of claims 1 to 9, wherein said processing means is configured to restrain weighting values at a maximum allowable level.

11. A method of generating association files suitable for determining whether a data file belongs to a predetermined category, wherein a plurality of included files belonging to said category are stored in combination with a plurality of excluded files not belonging to said category, said method comprising the steps of: processing included files to identify candidate terms for an association file; and assessing the suitability of said candidate terms with reference to occurrences in said included files and occurrences in said excluded files so as to provide definition terms for an association file.

12. A method according to claim 11 , wherein weighting values are determined by processing a first probability of getting a term in included files with a second probability of getting said term in excluded files.

13. A method according to claim 12, wherein a first function of said first probability is divided by a second function of said second probability.

14. A method according to claim 13, wherein said functions include the addition of a range restricting constant.

15. A method according to claim 12, wherein said first function is derived by multiplying the first probability by one minus the second probability to produce a first product.

16. A method according to claim 15, wherein said first product is raised to the power of an enhancement exponent.

17. A method according to claim 13, wherein said second function is derived by multiplying said second probability by one minus said first probability.

18. A method according o claim 17, wherein said second product is raised to the power of an enhancement exponent.

19. A method according to any of claims 11 to 18, wherein candidates having a weighting value below a predetermined threshold are removed.

20. A method according to claims 11 to 19, wherein said weighting values are restrained up to a maximum level.

21. A computer system programmed to execute stored instructions such that in response to said stored instructions said system is configured to: process a plurality of included files associated with a category to identify candidate terms for an association file; and to assess the suitability of said candidate terms with reference to occurrences in said included files and with reference to occurrences in a plurality of excluded files not associated with said category so as to provide definition terms for an association file.

22. A computer system programmed to execute stored instructions according to claim 21 , configured to determine weighting values by processing a first probability of getting a term in included files with a second probability of getting a term in excluded files.

23. A computer system programmed to execute stored instructions according to claim 22, configured to process said probabilities, wherein a first function of said first probability is divided by a second function of said second probability.

24. A computer system programmed to execute stored instructions according to claim 23, configured such that said functions include the addition of a range restricting constant.

25. A computer system programmed to execute stored instructions according to claim 22, configured such that said function is derived by multiplying said first probability by one minus said second probability to produce a first product.

26. A computer system programmed to execute stored instructions according to claim 25, configured such that said first product is raised to the power of an enhancement exponent.

27. A computer system programmed to execute stored instructions according to claim 23, configured such that said second function is derived by multiplying said second probability by one minus said first probability.

28. A computer system programmed to execute stored instructions according to claim 27, configured such that said second product is raised to the power of an enhancement exponent.

29. A computer system programmed to execute stored instructions according to any of claims 21 to 28, configured to remove candidates having a weighting value that falls below a predetermined threshold.

30. A computer system programmed to execute stored instructions according to claims 21 to 29, configured to restrain weighting values at a maximum allowable level.

31. A computer-readable medium having computer-readable instructions executable by a computer such that, when executing said instructions, a computer performs the steps of: processing a plurality of included files associated with a category to identify candidate terms for an association file; and assessing the suitability of said candidate terms with reference to occurrences in said included files and with reference to occurrences in a plurality of excluded files not associated with said category so as to provide definition terms for an association file.

32. A computer-readable medium having computer-readable instructions according to claim 31 , such that when executing said instructions a computer will also perform the step of determining weighting values by processing a first probability of getting a term in included files with a second probability of getting said term in said excluded files.

33. A computer-readable medium having computer-readable instructions according to claim 32, such that when executing said instructions a computer will also perform the step of dividing a first function of said first probability by a second function of said probability.

34. A computer-readable medium having computer-readable instructions according to claim 33, such that when executing said instructions a computer will also perform the step of including a range restricting constant in said functions.

35. A computer-readable medium having computer-readable instructions according to claim 32, such that when executing said instructions a computer will also perform the step of multiplying the first probability by one minus the second probability to produce a first product for said first function.

36. A computer-readable medium having computer-readable instructions according to claim 35, such that when executing said instructions a computer will also perform the step of raising said first product to the power of an enhancement exponent.

37. A computer-readable medium having computer-readable instructions according to claim 33, such that when executing said instructions a computer will also perform the step of multiplying said second probability by one minus said first probability to derive said second function.

38. A computer-readable medium having computer-readable instructions according to claim 37, such that when executing said instructions a computer will also perform the step of raising said second product to the power of an enhancement exponent.

39. A computer-readable medium having computer-readable instructions according to any of claims 31 to 28, such that when executing said instructions a computer will also perform the step of removing candidates having a weighting value below a predetermined threshold.

40. A computer readable medium having computer-readable instructions according to any of claims 31 to 39, such that when executing said instructions a computer will also perform the step of restraining said weighting values up to a maximum level.