GB2336700A

GB2336700A - Generating machine readable association files

Info

Publication number: GB2336700A
Application number: GB9808808A
Authority: GB
Inventors: Rachel Hammond; Llewelyn Ignazio Fernandes
Original assignee: Dialog Corp PLC
Current assignee: Dialog Corp PLC
Priority date: 1998-04-24
Filing date: 1998-04-24
Publication date: 1999-10-27
Also published as: EP1073974A1; WO1999056222A1; GB9808808D0

Abstract

An association file is represented in the form of an outline file. A data file is processed in combination with an outline file to produce an association identifying the data file as being relevant to a particular category. The association file is generated by selecting a plurality of sample files (612) which are associated to the category. In addition, sample files (611) are selected that are not associated to the category. Preferred term candidates are identified by a process of syntactic analysis (602) from the files (612) associated to the category. Weighting values are calculated for the candidates with reference to the files (611) not associated to the category. Preferred terms are applied to an association file by analysing the resulting weighting values.

Description

1.

2336700 GENERATING MACHINE READABLE ASSOCIATION FILES The present invention relates to generating machine readable association files, to facilitate associating files of machine readable data with information types.

Introduction

Traditionally, database technology has been dedicated to the organising of numerical and tabular data and it is only recently, particularly with the expansion of the Internet, that demand has grown for the retrieval of text-based files. Several facilities are available on the Internet, commonly referred to as usearch engines" which assist in the location of information.

The majority of these operate by performing what has become known as 1ree text" searching, in which a user specifies words which they believe are contained within the target file as a mechanism for instructing the system to retrieve files of interest.

Problems with this technique are well known to users of the available search engines, particularly over the Internet. A simple enquiry can generate hundreds of thousands of "hits", the majority of which will tend to be totally irrelevant to the user's needs. Furthermore, other relevant files may be missed simply because they do not contain the specific chosen words.

Procedures for classifying volumes of data so as to facilitate subsequent searching are known but the classification process often involves manual intervention thereby making it time consuming and prone to human error. Furthermore, except in circumstances where the documentation is considered to be extremely valuable and will be searched over a significant period of time, the cost of performing this manual exercise cannot be justified in terms of the commercial worth of the data sources being considered.

Similarly, such a problem would result in much data being effectively inaccessible and outside the realm of searchable knowledge.

2 Procedures are known for processing a data file so as to determine whether the data file should be associated with a particular information category. The known process requires a machine readable association file, also known as an outline file and usually identified by a file extension OTL. In this way, it is possible for the incoming data file to be processed with reference to one or many outline files whereupon each outline file produces a numerical score value defining an extent to which the data file is relevant to the association categories, whereafter a decision may be made on the basis of a threshold comparison.

In practical systems, thousands of such outline files would be required in order to provide a useful level of categorisation. Furthermore, new outline files may be required on a regular basis therefore the generation of outline files in itself becomes a time consuming and specialised procedure. Thus, although procedures are known for the automated and industrially applicable processing of data files against outline files in order to generate category association, a problem exists in terms of defining similar procedures for the automatic generation of the machine readable association files themselves.

Summary of The Invention

According to a first aspect of the present invention, there is provided a method of generating a machine readable association file, to facilitate associaflon files of machine readable data with information categories, wherein a plurality of data files have been selected as being associated with an information category and a plurality of data files have been selected as not being associated with said information category, comprising steps of identifying preferred term candidates from said associated files; weighting said candidate with reference to said files not associated with said category; and applying preferred terms to an association file by analysing said weighting values.

3 The present invention provides an industrially applicable solution to the generation of outline files, by processing data files which have already been categorised in which proportion of the data files are associated with a category and a proportion of the data files are not associated with the category.

In a preferred embodiment, weighting values are determined by processing a first probability of getting the term in data flies included in the category with a second probability of getting the term in data files excluded from the category. Preferably, a first function of said first probability is divided by a second function of said second probability. The functions may include the addition of a range restricting constant.

In a preferred embodiment, the first function is derived by multiplying the first probability by one minus the second probability to produce a first product and this first product may be raised to the power of an enhancement exponent. Preferably, the second function is derived by multiplying the second probability of one minus the first probability and the second product may be raised to the power of an enhancement exponent.

According to a second aspect of the present invention, there is provided apparatus configured to generate machine readable association files, to facilitate associating files of machine readable data with information categories, wherein a plurality of data flies have been selected as being associated with an information category and a plurality of data files have been selected as not being associated with said information category, comprising means configured to identify preferred term candidates from said associated flies; calculating means configured to calculate weighting values for said candidates with reference to said files not associated with said category; and applying means for applying preferred terms to an association file by analysing said weighting values.

4 r-- I,- k - Brief Description of The Drawings

Figure 1 shows a data distribution environment, in which data is received from a plurality of sources; Figure 2 details the data processing environment identified in Figure 1; Figure 3 details operations performed by the data processing facility shown in Figure 2; Figure 4 details a central data processing facility; Figure 5 identifies procedures for training the system shown in Figure 4; Figure 6 details procedures for the generation of an outline file; Figure 7 shows an example of a sampled data file; Figure 8 shows an example of selected noun phrases; Figure9 details the term generation process identified in Figure 6; Figure 10 shows an example of processed noun phrases; Figure 11 details procedures for the frequency counting process identified in Figure 6; Figure 12 shows a table representing stored values; Figure 13 details procedures for calculating weighting values; Figures 14A and 14B illustrated procedures for the non-linear processing of the weighting value determined in accordance with the procedure shown in Figure 13; Figure 15 details procedures for stem selection, identified in Figure 6; Figure 16 details procedures for casing selection, identified in Figure 6; Figure 17 illustrates the effect of the procedures shown in Figures 15 and 16; Figure 18 illustrates an association outline file; Figure 19 illustrates a subsidiary processor of the type shown in Figure 4; 1 Figure 20 illustrates operations performed on the processing unit shown in Figure 19; Figure 21 illustrates the use made for the memory storage illustrated in Figure 19; Figure 22 details the process for associating preferred terms, identified in Figure 3; Figure 23 details procedures for the processing of data identified in Figure 22; Figure 24 details a triggering phase identified in Figure 23; Figure 25 details a scoring phase identified in Figure 23; Figure 26 details a list generation phase identified in Figure 23; Figure 27 illustrates a table of the type generated by the processing system shown in Figure 4; Figure 28 illustrates a linked list of the type generated by the central processing system shown in Figure 4; Figure 2.9 details procedures for performing a search in response to a user selection, as identified in Figure 3; Figure 30 illustrates a screen display inviting a user to initiate a search; Figure 31 illustrates a screen display prompting a user to define search criteria; and Figure 32 illustrates a screen display of titles returned to a user identifying data files.

Detailed Description of The Preferred Embodiments

The invention will now be described by way of example only with reference to the previously identified drawings.

A data distribution environment is illustrated in Figure 1, in which data received from a plurality of sources, such as sources 101, 102 and 103 is supplied to an electronic data distribution network 104. A plurality of 6 organisations, including organisation 105, organisation 106 and organisation 107 make use of data received from data sources 101 to 103 in various ways. However, the amount of data generated by the data sources 101 to 103 is significantly large and the actual volume of information required by the organisations 105 to 107 tends to be a relatively small sub-set of the total volume of data available from the data sources 101 to 103. Consequently, procedures for sorting, filtering and selecting data are required in order for organisations 105 to 107 to be provided with the information that they require, derived from a significantly larger volume of available data.

In the environment shown in Figure 1, all of the information available from the electronic data distribution network 104 is made available to organisations 105 to 107. Organisations 105 and 106 receive data directly from electronic data distribution network 104 via dedicated communication channels 108 and 109 respectively. Organisation 107, which may be a non profit making organisation such as an educational establishment etc, receives information from network 104 via the Internet 110. Organisations 105 and 106 also receive data via the Internet 110 over connections 111 and 112.

For the purposes of this illustration, it is assumed that organisation is a large multinational company, such as an oil company, primarily interested in information relating to the oil industry. Organisation B is a financial institution primarily interested in information relating to the transfer of funds and the value of assets etc. Organisation C is an educational institution and has very wide interests in many academic fields of endeavour.

Each organisation defines categories of information representing fields of information where an interest has been shown. Each category has its own category definition, allowing an association process to be performed which associates files of machine readable data, received from the electronic data distribution network, with categories or information types. The present invention provides for the automatic and industrially applicable generation of these association files by a technical process.

7 Category definitions are generated by means of a training process. A human trainer selects files which fall within a particular category. It is not necessary for the trainer to explain why a file is appropriate to a particular category given that the process will perform a learning operation and use the information that it has learnt so as to select what it considers as similar flies from incoming data files as they are received. Thus, a human trainer selects files as belonging to a particular category. The technical process of the present invention analyses these files in order to generate category definitions. These category definitions are then used to analyse new data files so that they may be associated to appropriate categories in an automated way. Thus, it could be possible for each of organisations 105, 106 and 107 to use the same or similar category name. However, given that the training process will be performed specifically for that particular organisation, the actual association files could specify very different association processes, resulting in very different collections of information being derived possibly from the same source data.

In this way, an organisation's inherent 'feel" for the type of information that it finds interesting has been encoded and represented in machine readable form so as to facilitate the automated selection of information under a technical process.

A data processing environment for organisation 105 is detailed in Figure 2. Data received from transmission lines 108 and 111 are supplied via respective gateways 201 and 202 to a central data processing facility 203.

The central data processing facility makes information available to individual users, illustrated by user terminals 205 to 213. Terminals 205 to 208 are shown connection to a local network 214 which is in turn connected to the central data processing facility 203 via an interface 215. Similarly, terminals 209 to 213 are connected to a local network 216 which is again connected to the central data processing facility via interface 217. In general, hundreds of local networks could be connected to the central data processing 8 facility each having tens of user terminals connected thereto.

In addition to the direct connection shown in Figure 2, other parts of the organisation may receive information from the central data processing facility 203 via Internet connections or by other communication means. Thus, facilities may be provided, for example for allowing remote laptop computers to communicate with the central data processing facility via GSM telephones etc.

Organisation 105 also has the capability for transferring eye readable data into machine readable data via an image scanner 218 and an optical character recognition terminal 219. Furthermore, new data may be generated internally and distribution within the organisation.

Over time, data processed by facility 203 will attain a high level of value. Consequently, data processed by the facility is periodically supplied to an off-site data archive 220.

Operations performed by the central data processing facility 203 are illustrated in Figure 3. At step 301 the system is trained to categorise incoming data files according to the organisation's preferred system of categorisation. At step 302 all existing files held by the facility are categorised in accordance with the system determined at step 301. Thereafter, the processing facility continually categorises new incoming data files as they are received, as illustrated at step 303 and supplies categorised information to users as and when the information is requested; as illustrated at step 304.

Central data processing facility 203 is detailed in Figure 4. Data signals from data sources 101 to 103 are supplied to input interfaces 401 via data input lines 402. Similarly, output data signals are supplied to users 111 to 117 via an output interface 403 and output wires 404. Input interface 401 and output interface 403 communicate with a central processing system 405 based on DEC alpha integrated circuitry. The central processing system 405 also communicates with other processing systems in a distributed processing architecture. The processing facility 203 includes eight Intel circuit based 9 processing systems 411 to 418 each implementing instructions under the control of conventional operating systems such as Windows NT.

An operator communicates with facility 203 by means of an operator terminal having a visual display unit 421 and a manually operable keyboard 422. Data files received from sources 101 to 103 are written to bulk storage devices 423 arranged as an array of magnetic disks. Data files are written to the disk array 423 after these files have been associated with preferred terms thereby categorising the incoming data files as shown at step 303. The association processes are performed by the subsidiary processors 411 to 418.

The central processing system 405 communicates with subsidiary processors 411 to 418 via an Ethemet connection 424 allowing processing requirements to be distributed between processors 411 to 418. Having addressed a subsidiary processor, such as processor 411, the transferring of data to the addressed processor is performed. Each individual incoming data file is supplied exclusively to one of the subsidiary processors. The selected subsidiary processor is then responsible for performing the association process, to identify preferred terms relevant to that particular data file so as to define categories relevant for association with the file. These associations with categories or preferred terms are added as additional data to the file itself and a file processed in this way is referred to as an associated data file.

Thus, after performing the association process, the associated data file is returned to the central processing system 405 which is then response for writing the associated data file to the disk array 423.

In this way, it is possible to scale the degree of processing capacity provided by facility 203 dependent upon the volume of data files requiring categorisation. The central processing system 405 also maintains a table of preferred terms, pointing to particular associated data files which have been identified as relevant to each of the preferred terms.

Step 301 for the training of the system to categorise incoming data is detailed in Figure 5. At step 501 a category is selected for which an association file, also referred to as an outline file, is to be generated. At step 502 a human trainer identifies sample files to be included in the category selected at step 501 and at step 503 the human trainer identifies sample files to be excluded from the category selected at step 501.

At step 504 processor 405 generates an outline file that defines the characteristics of the files included in the selected category. Thereafter, a question is asked at step 505 as to whether another category is to be considered and when answered in the affirmative control is retumed to step 501, allowing the next category (for which an outline file is to be generated) to be selected.

Eventually, all of the categories will have been considered and the question asked at step 505 will be answered in the negative. Thereafter, data structures are initialised by parsing the outline files generated by the repeated implementation of step 504.

The technical process, of the present embodiment, is provided with a category title and a plurality of files which are considered to be associated to said category. The requirement is to produce association files or outline files in machine readable form, suitable for providing the basis for analysing new incoming data files so as to determine whether said new incoming data files should be associated with particular categories. The technical process is achieved by firstly selecting files that are associated with a defined type. In addition, files are selected that are not associated with the defined type. The technical process is then configured to analyse the interrelationship between the selected files so as to provide the association data required in order to produce an association file.

Step 504 for the generation of an outline file that defines the characteristics of data files included in the selected category is detailed in Figure 6. Sample data files excluded from the category are illustrated 11 generally at 611 and, similarly, sample data files included in the category are shown generally at 612. It is not necessary for the same number of files to be included in each of samples 611 and 612 but, preferably, in the region of one thousand files should be included for each of these types.

Data files received from the data distribution environment 104 tend to consist of two components, namely the title of the file and its associated detail, referred to herein as the "story". Consequently, the outline files are configured to process the title data and the story data differently, therefore it is necessary to take account of this difference when creating an outline file.

At step 601 a selection is made as to whether titles are being considered or story sections are being considered. The process is configured such that titles for all of the sample files are considered first, followed by the story sections for all of the sampled files.

At step 602 a syntactic analysis is performed on sample files 612 only; that is to say, the syntactic analysis at step 602 is only performed on sample files included within the category.

At step 603 a term generation process is performed followed by a frequency counting process at step 604. The frequency counting process considers both sample 611 and sample 612; that is to say, it considers both files included in the category along with files excluded from the category.

At step 605 a weight calculation process is performed followed by a stem and casing selection process at step 606.

At step 607 a question is asked as to whether the story section has been processed which, on the first iteration, will be answered in the negative.

Consequently, control is returned to step 601 resulting in story sections of the files being selected at said step. Thereafter, processes 602 to 606 are repeated, followed by the question asked at step 607 being answered in the affirmative, to the effect that the story sections have now been processed. At step 608 a rule base is generated as an association or outline (OTL) file.

12 An example of a sampled data file is shown at 701 in Figure 7. The data file includes a title section 702 along with a story section 703.

Step 606 for performing the syntactic analysis is primarily concerned with identifying noun phrases which occur within the file being considered. The noun phrases identify the informational content within the data file without being constrained by grammatical representation. The noun phrases are selected by use of a C library routine for syntactic parsing, such as "LINGUIST X'.

An example of noun phrases selected by process 602,.when executed upon a typical sample file, is illustrated in Figure 8. The process has identified the noun phrases "oil industry", "acquisitions" and "oil industry acquisitions". These are structured within the file as shown in Figure 8. This represents noun phrases derived from a single file but, having processed typically a thousand files in this way, a very large list of noun phrases is produced.

The list of noun phrases is processed at step 603 to generate a list of terms. The list of terms is larger than the list of noun phrases because the noun phrases are processed to generate stem representations of noun phrases, identified at step 602. Term generation process 605 is detailed in Figure 9.

At step 901 a noun phrase is selected and at step 902 a question is asked as to whether the noun phrases contains a word stem. If this question is answered in the affirmative, a stem rule is generated at step 903; alternatively step 903 is bypassed.

At step 904 a question is asked as to whether the noun phrase contains mixed upper and lower case characters. If this question is answered in the affirmative the rule is made case specific at step 905. At step 906 a question is asked as to whether another noun phrase is to be considered and when answered in the affirmative the next noun phrase is selected at step 901.

13 1 Stem rules are generated at step 903 by means of a dictionary type look-up table. Words within the noun phrases are supplied to the dictionary look-up table which will then determine whether said words are contained within the dictionary. if a word is identified within the dictionary, the process returns its appropriate stem thereby allowing the stem rule to be generated at step 903. Alternatively, results may be produced by execution of an appropriation function.

If a noun phrase is identified containing all lower case characters, it is assumed that the words are of interest irrespective of their case representation. However, if phrases are identified that include upper and lower case characters, the case representation may be relevant, therefore a case specific selection rule is generated at step 905.

Examples of noun phrases processed in order to produce terms is illustrated in Figure 10. In the example, the noun phrase "acidity" has been identified, as shown at 1001. This word is identified as containing a stem at step 902 resulting in the stem "acid" being included in the list of terms as illustrated at 1002. Similarly, the word "development" was identified as a noun phrase, as shown at 1003, resulting in the inclusion of the stem "develop" as shown at 1004.

The noun phrase Moctor F Bloggs" has been identified and, given that upper case characters were found from the original data at step 904 the upper case characters are retained in the list of terms, thereby generating a case specific selection rule.

Step 604 for performing the frequency counting process is detailed in Figure 11. At step 1101 a term is selected and at step 1102 the number of occurrences present within the included sample of files is counted and stored as a variable 1, representing the number of occurrences of the term selected at step 1101 in the sample of included files 602.

At step 1103 the number of occurrences of the term selected at step 1101 existing within the excluded files 601 is counted and stored as a 14 variable X. Variables 1 and X for each selected term are stored and at step 1104 a question is asked as to whether another term is to be considered.

When answered in the affirmative control is returned to step 1101 and the next term is selected. Eventually all of the terms will have been considered and the question asked at step 1104 will be answered in the negative.

A table representing stored values for 1 and X, calculated by repeated execution of processes 1102 and 1103, is illustrated as a table in Figure 12.

The table includes a first column 1201 representing the term selected at step 1101, a second column 1202 representing values for 1 calculated at step 1102 and a third column 1203 representing values calculated for X at step 1103.

In the example shown in Figure 12, the word "acidy has been added to the list of terms as shown at row 1204. A number of occurrences have been counted to find a value for 1 which, in this example, has turned out to be five hundred and three. Similarly a count value from excluded files has been performed to determine a value for X which, in this example, is one. Thus, the process has found that the word "acidity" provides strong evidence to the effect that a file containing this word should be included in the category.

At row 1206 the word "rnerger" has been considered. On this occasion, two hundred and fifty nine occurrences have been found within the included files but six hundred occurrences have been found in the excluded files. Thus, this word does not provide a very good candidate for determining whether a particular data file should be included within the category under consideration.

In the resulting association outline file, the individual entries are not merely present but are weighted and structured. Weightings are applied to individual entries which, within a branching structure, may be considered as the lowest level "leaves". A collection of leaves are branched together and a branching point may also include a weighting value. Branching of this type is illustrated in Figure 8 in which each individual lowest level term is accorded weights a, b, c, d and e. Weights for each term are calculated with reference to the frequency data as represented in the table shown in Figure 12.

Procedures for calculating weighting values W are illustrated in Figure 13. A probability P is calculated representing the probability of finding a term in the data files 602 included in the particular category. This is calculated by dividing the total term count 1 by the total number of files in the sample. In the example shown, for the term "acidity" five hundred and three occurrences of the term in the included category have been identified and this is divided by one thousand and seventeen representing the total number of files present, giving a probability value of 0.495. A probability Q is calculated representing the probability of the term

occurring in the excluded data files 601. Thus, in the example previously considered, the term "acidity" occurred only once in the excluded category from a total of one thousand and twenty-eight flies. Thus, a probability value Q is calculated of 0.001.

A variable D is introduced which, when used as an exponent, enhances the division between the active and inactive terms. The value of D may be fixed or may be made user selective in a range between 0.5 and 1. A constant A is set to 0.001 so as to prevent division by zero and to restrict the range of results to fie within a range of zero to 1.000.

An importance factor F is introduced which, typically, is set to a value of 2 for a title and a phrase and is set to a value of 1 for a word. These values can be adjusted such that, for particular data sets, greater emphasis may be placed on, for example, titles or phrases within the data files being considered.

The value for W is calculated from a first product multiplied by A and 1.

The first product is derived from the quotient of a numerator and a denominator. The numerator is calculated by adding constant A to l-Q multiplied by P raised to the power of D. Similarly, the denominator is calculated by adding constant A to the product of 0 and l-Q again raised to 16 ---1 the power D.

After calculating value W as illustrated in Figure 13, W is processed non-linearly as illustrated in Figure 14A and 14B. A variable L represents the level of maximum importance which, empirically, is set to a value of between zero and 0.3. Variable T represents a lower threshold which again, empirically, is usually set to a value of 0.05. As shown in Figure 14A, if W is greater than L then W is set equal to L, thereby placing an upper bound on the weighting value. If W is smaller than L then W is set to zero thereby removing many irrelevant terms which only occur very infrequently.

The result of the non-linear processes, as illustrated in Figure 14A is illustrated graphically in Figure 14B. Output values for W are plotted against input values for W. As shown, values of W less than 0.05 result in the output for W being set to zero. Similady, for values above 0.3 the output for W is set to 0.3. This response is illustrated by solid line 1401. In an alternative embodiment an output response is used with reference to a look-up table to produce a non-discontinuous result as illustrated by dotted output response 1402.

Process 606 for stem selection is detailed in Figure 15. Process 603, expanded in Figure 9, generates terms in the form of stems and identified noun phrases. Process 606 is concerned with selecting stems and casing rules that should be retained within the association file.

At step 1501 a term is selected and at step 1502 a question is asked as to whether the term is duplicated in a stem term. If this question is answered in the affirmative, a question is asked at step 1503 as to whether the weights are similar for the original term and for its subsequently generated stem term. If this question is answered in the affirmative both the stem term and the non-stem term are retained. Alternatively, if the question asked at step 1503 is answered in the negative, to the effect that the weights are not similar, a question is asked at step 1505 as to whether the non- stem term is weighted higher than the stem term. If this question is answered in the 17 affirmative the stem term is excluded at step 1506. Alternatively, W the question asked at step 1505 is answered in the negative, the non-stem term is excluded at step 1507.

If the question asked at step 1502 is answered in the negative, control is directed to step 1508 after the execution of steps 1504 and 1506. At step 1508 a question is asked as to whether another term is to be considered and when answered in the affirmative control is returned to step 1501 for the next term to be selected.

The casing selection aspect of process 606 is detailed in Figure 16. At step 1601 a term is selected and at step 1602 a question is asked as to whether the term is duplicated with different letter cases. If this question is answered in the affirmative, a question is asked at step 1603 as to whether the weights are similar and if this question is answered in the affirmative the cased term is included. Alternatively, if the question asked at step 1603 is answered in the negative, control is directed to step 1605 where a question is asked as to whether the uncased term is weighted higher than the mixed case term. If this question is answered in the affirmative the mixed case term is excluded at step 1606. Alternatively, if the question asked at step 1605 is answered in the negative, the uncased term is excluded at step 1607.

If the question asked at step 1602 is answered in the negative control is directed to step 1608 and control is also directed to said step after the termination of processes 1604 and 1606. At step 1608 a question is asked as to whether another term is to be considered and when answered in the affirmative control is returned to step 1601 for the next term to be selected. Eventually, all of the terms will have been considered and the question asked at step 1608 will be answered in the negative.

Questions are asked at step 1503 and at step 1603 as to whether weights are similar. The extent to which weights may be considered similar is determined empirically and the similarity criteria may be adjusted for particular data sets.

18 The purpose of the procedures shown in Figure 15 and Figure 16 is to reduce the total number of terms present within the file, effectively making it more manageable and faster to execute. Examples of its purpose are illustrated in Figure 17. The term "acids" at 1701 and stem term "acid" at 1702 have been included. The processing of the steps illustrated in Figure 15 may determine that the stem term "acid" has been given a significantly higher weighting than the non-stem term "acids". Under these circumstances the term "acids" could be excluded from the file. Similarly, the process illustrated in Figure 16 may determine that "Johnson" at 1703 has a significantly higher weighting than the word "johnson" at 1704, resulting in the word "johnson" being excluded from the file.

Step 610 generates a rule base as an association outline file of the type illustrated in Figure 18. Each asterisk, such as asterisk 1801, represents a level of nesting. Thus, at the highest level, there is an entry representing the start of the tide, at line 1802 and a further high level entry at line 1803 representing the start of the story.

In process 301 shown in Figure 3, training of the system is performed, so as to subsequently categorise files according to an organisation's preferred system of categorisation. A number of files are selected as a sample to represent files that will be included in the selected category, as identified in process 502 shown in Figure 5. Also, in process 503, as shown in Figure 5, sample files are identified which are to be excluded from the selected category. Typically a large number of files will be required in order sufficiently identify the probabilities of terms occurring in these files in order that subsequently received data files may be analysed correctly.

Identifying files which are to be included or excluded from a specific category can be extremely time consuming when such a large number of files are required in order to achieve sufficiently accurate analysis results. Thus, in an alternative embodiment an additional analysis is performed for measuring the granularity of probability measurements. Thus, when the addition of a 19 small number sample files results in a significant modification in weight calculations, it may be understood that the number of sample flies included in the sample set is insufficient to achieve a highly accurate result. However, a smaller number of sample files may be considered when the marginal effect of processing one more file becomes small.

Thus, by increasing the number of sample flies until the granularity of weight calculation is reduced, a point at which the number of sample flies has reached a significant threshold value may be identified. A distribution density may be calculated, where the preferred result is a BradfordIZipf curve, which is smooth, and represents a highly accurate set of data samples.

Alternatively, if such as distribution density of calculated weights is irregular, or significantly different from the optimal BradfordIZipf curve, it may be inferred that additional sample files in the selected category and excluded from the selected category, may need to be identified.

Subsidiary processor 411 is detailed in Figure 19. The processor includes an Intel Pentium processing unit 1901 connected to sixty-four mega bytes of randomly accessible memory 1902 via a PC1 bus 1903. In addition, a local disk drive 1904 and an interface circuit 1905 are connected to bus 1903. Interface circuit 1905 communicates with the TCP/IP network 424.

Random Access Memory 1902 stores instructions executable by the processing unit 1901, in addition to storing input data flies received from the data sources 101 to 103 and intermediate data. Operations performed on processing unit 1901 in response to instructions read from memory 1902 are identified in Figure 20.

At step 2001 temporary memory structures are cleared and at step 2002 an OTL description file is selected. At step 2003 an item in the OTL file is identified and at step 2004 a question is asked as to whether the item selected at step 2003 is a rule definition. If this question is answered in the affirmative, a rule object is defined at step 2005. Alternatively, if the question asked at step 2004 is answered in the negative, to the effect that the item is not a rule definition, a question is asked at step 2006 as to whether the item is a word definition. If this question is answered in the affirmative, a dictionary link is created at step 2004.

At step 2008 a question is asked as to whether the item is a label and when answered in the affirmative a new entry is created in a label list, whereafter, at step 2010, a question is asked as to whether another item is present. After executing step 2005 or after executing step 2007, control is directed to step 2010.

When the question asked at step 2010 is answered in the affirmative, to the effect that another item is present, control is returned to step 2003 and the next item is identified in the OTL file. Eventually, all of the items will have been identified resulting in the question asked at step 2010 being answered in the negative. Thereafter, at step 2011 a question is asked as to whether another OTL file is present and when answered in the affirmative control is returned to step 2002 allowing the next OTL description file to be selected.

Thus, this process continues until all of the OTL flies have been considered resulting in the question asked at step 2011 being answered in the negative.

For each OTL file considered, by being selected at step 2001, a rulebase is generated and a plurality of such rulebases are illustrated in Figure 21. Thus, a first OTL file processed in accordance with the procedure shown in Figure 20, results in the generation of a first rulebase 2101.

Typically, for a specific installation, in the order of three thousand rulebases would be generated by executing the procedures illustrated in Figure 20.

Rulebases 2101 to 2109 are stored in memory 1902, which also provides storage space for a dictionary 2121, a label list 2122 and a data buffer 2123. The dictionary stores a list of words which have importance in any of the stored rulebases. Associated with each word in the dictionary there is at least one pointer and possibly many pointers to specific entries in specific rulebases 2101 to 2109.

21 The step for associating preferred terms with source files is detailed in Figure 22. At step 2201 central process 405 obtains access to one of the subsidiary processors 411 to 418. The central processor then expects to receive authorisation so that communication may be effected with one of the subsidiary processors and after this connection has been established, the source file is supplied to the selected subsidiary processor at step 2203. At step 2204 the data is processed to determine associated preferred terms.

After performing process 2204, the results are transmitted back to the central processing system at step 2205 and at step 2206 data with associated preferred terms is stored and data pointers associated with the preferred terms are updated at step 2207.

Step 2204 for the processing of data to determine associated preferred terms is detailed in Figure 23. The overall processing is broken down into three major phases, consisting of a triggering phase 2301, followed by a scoring phase 2302 and, finally, a list generation phase 2303.

Triggering phase 2301 is detailed in Figure 24. At step 2401 a section of the data, such as its title, market sector or main body of text, is identified and at step 2402 an item of the identified section is selected. At step 2403 a question is asked as to whether the item indicates a new context, which may be considered as a grammatical marker in the form of a full stop, capital letter, start of a sentence or quotation marks etc. When answered in the affirmative, new context information is supplied to all rulebases at step 2404 and control is then directed to step 2407.

If the question asked at step 2403 is answered in the negative, step 2404 is bypassed and a look-up address is obtained for rule objects in rulebases from the dictionary at step 2405. Thereafter, at step 2406 all addressed objects are triggered and a multiplication of scores is effected by a score weighting factor. Thereafter, at step 2407 a question is asked as to whether another item is present and when answered in the affirmative control is returned to step 2402.

22 Eventually, all of the items for a selection section will have been considered resulting in the question asked at step 2407 being answered in the negative. Thereafter, at step 2408 a question is asked as to whether another section is to be considered and when answered in the affirmative control is returned to step 2401.

At step 2401 the next section is identified and steps 2402 to 2408 are repeated. Eventually, all of the sections will have been considered and the question asked at step 2408 will be answered in the negative.

Scohng phase 2302 is detailed in Figure 25. At step 2501 a rulebase is selected and at step 2502 a score variable is re-set to zero. At step 2503 a branch is identified for score accumulationlaccrue and at step 2504 scores are accumulated or accrued from triggered rules attached to the branch. At step 2505 a question is asked as to whether another branch is to be considered and when answered in the affirmative control is returned to step 2503. A next branch is selected at step 2503 with procedure 2504 being repeated. Eventually all of the branches will have been considered resulting in the question asked at step 2505 being answered in the negative.

At step 2506 an overall score in the range of zero to one hundred is stored for the rulebase and at step 2507 a question is asked as to whether another rulebase is present. When answered in the affirmative, control is retumed to step 2501 and steps 2501 to 2507 are repeated. Eventually, all of the rulebases will have been considered and the question asked at step 2507 will be answered in the negative.

Phase 2303 for the generation of a list of associated preferred terms is detailed in Figure 26. At step 2601 a rulebase is identified having a score greater than a predetermined threshold. Thus, for a particular application, a threshold may be set at forty-eight percent. At step 2602 additional triggered preferred data characteristics are identified by associating successful rulebases with parent categorisations by rulebase links.

23 At step 2603 lists of successful and inferred rulebases are combined to form overall lists of preferred data characteristics. Step 2603 results in data being generated by a subsidiary processor, such as processor 411, which is then supplied back to the central processing system 405.

Central processing system 405 is responsible for constructing a table of the type shown in Figure 27 in which an entry is present for each preferred term. The specific preferred terms are stored in column 2701 and, for each of these terms, column 2702 defines a specific pointer to a position in memory associated with the central processing system 405. Specific data files are identified by file names and the number of files associated with each preferred term is variable, depending on the nature and the amount of input data being considered. Thus, in order for this data to be accessible quickly while optimising use of the storage capacity within the central processing system 405, an indication of the file name is stored in the form of a linked list, as illustrated in Figure 28.

The preferred term "OIL-INDUSTRY" has been linked to a pointer OF8912. Address OF8912 is the first in column 2801 of the linked list. Column 2802 identifies a particular file name and column 2803 identifies the next pointer in the list. Thus, entry OF8912 points to a particular file with the file name "OIL-INDUSTRY-NETHERLAND-3"wfth a further pointer to memory location OF8A20. At memory OF8A20 a new file name is provided, illustrated at column 2802 and again a new pointer is present at column 2803.

Eventually, all relevant files will have been considered and the end of this list is identified by address 000000 by the pointer location in column 2803.

In an active system, the database 423 will be continually updated and users will continually be given access to the database, all under the control of the central processing system 405.

Procedures for performing a search in response to a user request are detailed in Figure 29. At step 2901 a user logs onto the system and at step 2902 a search method is identified. At step 2903 search criteria are defined 24 and at step 2904 the search criteria are processed to determine preferred terms. At step 2905 a list of preferred terms are supplied to the central processing system 405.

At step 2906 a question is asked as to whether a host has responded and when answered in the affirmative titles of associated data files are displayed at step 2907.

At step 2908 a question is asked as to whether the user wishes to view identified data and when answered in the affirmative the data is viewed; after being downloaded over the communication channel at step 2909.

At step 2910 a question is asked as to whether another search is to be performed and when answered in the affirmative control is returned to step 2902.

Step 2902 requires the search method to be identified and in order to achieve this a user is prompted by a screen display of the type shown in Figure 30. Thus, a plurality of text boxes are presented to the user inviting the user to specify a particular search method.

Step 2903 for the defining of search criteria results in the user being prompted by a screen of the type shown in Figure 31. Terms providing the basis for the user's search are displayed in a window 3101. Preferred terms are displayed in upper case characters, such as the entry shown at position 3102.

The displaying of titles of associated files at step 2907 results in the user seeing information displayed of the type illustrated in Figure 32. Each entry, such as entry 3201, includes a check box 3202. Check boxes 3202 allow a particular item to be selected by a user such that the actual information file may be supplied to the user from the central database over a communication channel.

Claims

1. A method of generating machine readable association files, to facilitate associating files of machine readable data with information categories, wherein a plurality of data files have been selected as being associated with an information category and a plurality of data files have been selected as not being associated with said information category, comprising steps of identifying preferred term candidates from said associated flies; calculating weighting values for said candidates with reference to said files not associated with said category; and applying preferred terms to an association file by analysing said weighting values.

2. A method according to claim 1, wherein said weighting values are determined by processing a first probability of getting the term in data files included in the category with a second probability of getting the term in data flies excluded from the category.

3. A method according to claim 2, wherein a first function of said first probability is divided by a second function of said second probability.

4. A method according to claim 3, wherein said functions include the addition of a range restricting constant.

5. A method according to cAaim 2, wherein said first function is derived by multiplying the first probability by one minus the second probability to produce a first product.

26

6. A method according to claim 3, wherein said first product is raised to the power of an enhancement exponent.

7. A method according to claim 3, wherein said second function is derived by multiplying the second probability by one minus the first probability.

Y r

8. A method according to claim 7, wherein said second product is raised to the power of an enhancement exponent.

9. A method according to claim 1, wherein candidates having a weighting value below a predetermined threshold are removed.

10. A method according to claim 1, wherein said weighting values are restrained up to a maximum level.

11. Apparatus configured to generate machine readable association files, to facilitate associating flies of machine readable data with information categories, wherein a plurality of data flies have been selected as being associated with an information category and a plurality of data files have been selected as not being associated with said information category, comprising means configured to identify preferred term candidates from said associated files; calculating means configured to calculate weighting values for said candidates with reference to said flies not associated with said category; and applying means for applying preferred terms to an association file by analysing said weighting values.

27

12. Apparatus according to claim 11, wherein said calculating means is configured to determine weighting values by processing a first probability of getting the term in data files included in the category with a second probability of getting the term in data files excluded from the category.

13. Apparatus according to claim 12, wherein said calculating means is configured to process said probabilities, in that a first function of said first probability is divided by a second function of said second probability.

1 1

14. Apparatus according to claim 13, wherein said calculating means is configured such that said functions include the addition of a range restricting constant.

15. Apparatus according to claim 12, wherein said calculating means is configured such that said first function is derived by multiplying said first probability by one minus said second probability to produce a first product.

16. Apparatus according to claim 13, wherein said calculating means is configured such that said first product is raised to the power of an enhancement exponent.

17. Apparatus according to claim 13, wherein said calculating means is configured such that said second function is derived by multiplying said second probability by one minus said first probability.

18. Apparatus according to claim 17, wherein said calculating means is configured such that said second product is raised to the power of an enhancement exponent.

28

19. Apparatus according to claim 11, wherein said applying means is configured to remove candidates having a weighting value which fails below a predetermined threshold.

20. Apparatus according to claim 11, wherein said calculating means is configured to restrain weighting values up to a maximum level.

1'