WO2006062485A1

WO2006062485A1 - A method for classifying data

Info

Publication number: WO2006062485A1
Application number: PCT/SG2004/000402
Authority: WO
Inventors: Jinyan Li
Original assignee: Agency For Science, Technology And Research
Priority date: 2004-12-08
Filing date: 2004-12-08
Publication date: 2006-06-15

Abstract

A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning (110) a frequent itemset space in the first dataset into convex non-overlapping intemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning (121) each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

Description

Title

A Method for Classifying Data

Technical Field

The invention concerns a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least first and second datasets using a computer processor

Background of the Invention

Frequent itemsets are a type of knowledge pattern that have been widely and extensively investigated in the data mining field over the last decade. Research has mainly focussed on two aspects: efficient mining of all frequent itemsets from a dataset; and efficient mining of a subset of frequent itemsets but that can be used to derive all frequent ones without accessing the original data again. Given a dataset D, an itemset is frequent if its support in D is larger than or equal to a support value. The problem usually poses heavy computation when the support value is pre-set as a small value and D is high-dimensional.

Summary of the Invention

In a first preferred aspect, there is provided a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space. Advantageously, no access to the first and second datasets is required to calculate support in the discovery of the left bound of an emerging pattern plateau.

The right bound of the border of the emerging pattern plateau may be computed by identifying the non-overlapping equivalence classes that have a zero support in the second dataset.

The left bound of the border of the emerging pattern plateau may be computed by using the border of an itemset plateau that have a zero support in the second dataset.

The method may further comprise the step of ranking data in the datasets using an entropy method.

The method may further comprise the step of selecting a predetermined number of ranked data.

The method may further comprise the step of discretizing the selected data.

The method may further comprise the step of translating the emerging pattern space into clinical rules.

In a second aspect, there is provided a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

In a third aspect, there is provided a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: a first partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a second partitioning module to partition each itemset plateau into non- overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

In a fourth aspect, there is provided a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the system comprising: a partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a processing module to compute the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

In a fifth aspect, there is provided a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

In a sixth aspect, there is provided a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

The present invention provides a plateau-like partitioning method for the structural decomposition of both frequent itemset spaces and emerging pattern spaces. All these subspaces are convex spaces. Therefore, they can be concisely represented by borders. Based on these convexities, new concise representations for frequent itemset spaces and EP spaces are possible. Thus, the mining of EPs can be reduced to the mining of only boundary EPs.

Brief Description of the Drawings

An example of the invention will now be described with reference to the accompanying drawings, in which:

Figure 1 is a process flow diagram of the method for discovering emerging pattern plateau borders; and

Figure 2 is a block diagram of a general purpose computer with which the embodiments of the invention can be practiced.

Detailed Description of the Drawings

The process of identifying patterns generally is referred to as "data mining" and comprises the use of processes that, under some acceptable computational efficiency limitations, produce a particular enumeration of the required patterns. A major aspect of data mining is to discover dependencies among data, a goal that has been achieved with the use of association rules, but is also now becoming ^•practical for other types of classifiers.

Each record costs of a number of items. A record is also called a transaction or an instance.

Emerging patterns are basically conjunctions of simple conditions. Preferably, emerging patterns have four qualities: validity, novelty, potential usefulness, and understandability.

The validity of a pattern relates to the applicability of the pattern to new data. Ideally a discovered EP should be valid with some degree of certainty when applied to new data. One way of investigating this property is to test the validity of an EP after the original databases have been updated by adding a small percentage of new data. An EP may be particularly strong if it remains valid even when a large percentage of new data is incorporated into the previously processed data.

Novelty relates to whether a pattern has not been previously discovered, either by traditional statistical methods or by human experts. Usually, such a pattern involves lots of conditions or a low support level, because a human expert may know some, but not all, of the conditions involved, or because human experts tend to notice those patterns that occur frequently, but not the rare ones. Some EP's, for example, consist of astonishingly long patterns comprising more than 5, including as many as 15 conditions, when the number of attributes in a dataset is as large as

1,000. This provides new and unexpected insights into previously well-understood problems.

The potential usefulness of a pattern arises if it can be used predictively. Emerging patterns can describe trends in any two or more non-overlapping temporal datasets and significant differences in any two or more spatial datasets. In this context, a

"difference" refers to a set of conditions that most data of a class satisfy but none of the other class satisfies. A "trend" refers to a set of conditions that most data in a dataset for one time-point satisfy, but data in a dataset for another time-point do not satisfy. Accordingly, EP's may find considerable use in applications such as predicting business market trends, identifying hidden causes to some specific diseases among different racial groups, for handwriting character recognition, for distinguishing between genes that code for ribosomal proteins and those that code for other proteins, and for differentiating positive instances and negative instances, for example, "healthy" or "sick", in discrete data.

A pattern is understandable if its meaning is intuitively clear from inspecting it. The fact that an EP is a conjunction of simple conditions means that it is usually easy to understand.

Interpretation of an EP is particularly aided when facts about its ability to distinguish between two classes of data are known.

If one more condition (item) is added to a boundary EP, thereby generating a superset of the EP, the superset EP may still have the same frequency as the boundary EP in the home class. EP's having this property are called "plateau EP's," and are defined in the following way: given a boundary EP, all its supersets having the same frequency as itself are its "plateau EP's." Of course, boundary EP's are trivially plateau EP's of themselves. Unless the frequency of the EP is zero, a superset EP with this property is also necessarily an EP.

Plateau EP's as a whole can be used to define a space. All plateau EP's of all boundary EP's with the same frequency as each other are called a "plateau space" (or simply, a "P-space"). So, all EP's in a P-space are at the same significance level in terms of their occurrence in both their home class and their counterpart class. Suppose that the home frequency is n, then the P-space may be denoted a "Pn-space." All P-spaces have a useful property, called "convexity" which means that a P-space can be succinctly represented by its most general and most specific elements. The most specific elements of P-spaces contribute to the high accuracy of a classification system based on EP's.

Convexity is an important property of certain types of large collections of data that can be exploited to represent such collections concisely. If a collection is a convex space, "convexity" is said to hold. Convexity is a mathematical notion useful for concisely and losslessly representing a large space of points using a border. A point is an itemset, and many itemset spaces are seen as convex spaces.

In particular, a space of frequent itemsets is convex, and its non-overlapping subspaces after a plateau-like structural decomposition are all convex. Such global convexity and subspace convexity are also observed in a different type of itemset space that is defined on a pair of datasets.

Plateau-like Partitioning Method Structural decomposition involves partitioning a collection of frequent itemsets of a dataset into subspaces such that each of them consists of exactly all itemsets that have the same high support. These subspaces are itemset plateaus. So, a space of frequent itemsets can be viewed as a collection of plateau-like subspaces under structural decomposition. All of the subspaces have the property of convexity, and that the convexity is also possessed by the whole itemset space. Hence, not only the whole space but each of the subspaces are bordered by a pair of bounds: A left bound consisting of the most "general" itemsets, and a right bound consisting of the most "specific" ones. Thus these itemset spaces can be concisely represented by the boundary itemsets in a lossless way. A new concise representation for spaces of frequent itemsets is provided.

The right bound of the whole space are exactly the so-called maximal itemsets. The right bound of each subspace are exactly the so-called closed frequent itemset, and the left bound of each subspace are exactly the so-called key patterns or generators. All the left boundary itemsets constitute a convex space, and that each subspace is a union of non-overlapping equivalent classes. The picture drawn by the decomposition method for the space of frequent itemsets captures, in one place, many different important data mining notions. Consequently, a better and deeper understanding about the mining of frequent itemsets by different methods is obtained.

Previously, frequent itemsets were defined based on only one dataset. The definition does not use any other datasets. A structural decomposition on a different type of itemset space that is defined based on two datasets is provided by the present invention. The dataset may be contained in a database. These itemsets are called emerging patterns (EPs). Given a pair of datasets D^pos and D^neg , an EP is defined as an itemset that has a support in D^pos not below % , and its support in D"^eg is less than δ. A special case of this definition is that π = 1 and δ = 1. That is, an EP is an itemset that has a non-zero support in D^pos , but its support in D"^es is zero. Similarly as for spaces of frequent itemsets, the global convexity and the subspace convexity are also observed for spaces of emerging patterns. Based on this, a new concise representation for spaces of emerging patterns is provided. Also, methods to compute the new representations are described.

Moving from the notion of frequent itemsets to EPs, an important difference is the number of datasets used in their definitions. The definition of frequent itemset needs one dataset, but to define EPs, two datasets are required. The content and meaning of the itemset spaces change significantly. Frequent itemsets can be used to capture association rules, episode rules, sequential patterns and clusters. EPs are used in the biomedical field for discovering diagnostic biomarkers and rules, in the business side for identifying niches opportunities, and in the field of machine learning. However, the commonality which the two types of itemset spaces share is the convexity in their global spaces and in their plateau-like subspaces.

Let / = {/₁₅/₂,K ,i_m} be a set of distinct literals called items. An itemset or a pattern, is a set of items. A transaction is a non-empty set of items. A dataset is a non-empty set of transactions. An itemset X is said to be contained or included in a transaction T if X c T . An itemset X is said to be contained in a dataset D if there is r e Z⁾ such that I c T. The support of an itemset X in a dataset D, denoted sup(X, D), is the number of transactions in dataset D that contain X. An itemset X is said to be frequent in a dataset D if sup(X, D) is greater than or equal to a pre- specified equal support value min_sup. Given a dataset D and a support value min_sup, the collection of all frequent itemsets in D is called the space of frequent itemsets, and is denoted by F(min_sup, D).

In a first example, suppose a dataset D consists of five transactions as shown in the following table:

Trans. ID Items

1 a, c, d

2 b, c, e

3 a, b, c, e, f

4 b, e

5 a, b, c, e

Let min_sup = 2. Then, F(HIiIl-SUp₃P) = {0, {4 {4 {*}>{?}}

A space of frequent itemsets can be large. Next, convexity can be described in the context of itemset spaces. Convexity is an attractive property for large spaces of patterns because it is helpful for their concise and lossless representation.

DEFINITION 1

An itemset space S is convex if, for all X, Y e s such that X c T , it is the case that Z c ₁S whenever X c Z c F

The set inclusion is crucial for this definition. Different operations can also be used to define convex spaces.

For a convex space S, the collection of all most general itemsets in S as a bound of S is defined, where an itemset X is most general in S if there is no proper subset of X in S. Similarly, the collection of all most specific itemsets as another bound of S is defined, where an itemset X is most specific in S if there is no proper superset of X in S. The former bound is the left bound of S, denoted L; and the latter bound is the right bound of S₁ denoted R. The pair of left and right bound is the border of S, and is written as (L, i?) . Also defined is: [L,R]=

Y}

(L, R.) and [L,R] are two different notions. Specifically, [L,R] = S, but (L,R) is only a concise representation of the whole space S in a lossless way. Also, [{0},{0}]= {0} and i [0,0] = 0.

In a second example, let S =

{{a},

be a collection of itemsets. Then S is a convex space. Its left bound is L = {{a}, {&}}, and its right bound is R =

{b,c, d}} . So, the border of S is ({{a}, {&}}, {{a, b,

d}}) . This border is a concise representation of S as it uses only four itemsets to represent all the nine itemsets in S.

PROPOSITION 1 Let a dataset D and a support value min_sup < |X>j be given. Then F(min_sup, D) is a convex item space. Furthermore, its border is of the form ({θ}, i?) for some R. This is proved by supposing two itemsets X, Y e F(min_sup, Z)) such that X e Y . Then sup(X, D) ≥ sup(7, D) > min_ sup . Let Z be any itemset such that X c Z e Y . Then sup(Z, D) > sup(7, D) . Then sup(Z, D) ≥ min_ sup . So, Z e .P(HUn- sup) . Thus, F(min_ sup, D) is a convex space.

Since O c T for every T e D , it is the case that sup(θ, D) = \D ≥ min_ sup .

Therefore, 0 e .F(min_sup, £>). As 0 is a proper subset for any non-empty itemsets, it follows that 0 is the sole most general element in F(min_sup, D). So, the border of F(min_sup, D) is of the form ({θ}, R) , for some R.

In a third example, following the dataset D and min_sup in the first example, it is known that F(min_sup, D) contains fifteen frequent itemsets, excluding 0. But, the border of F(min_sup, D) is ({θ}, {{α&ce}}) containing only two itemsets. It is thus a concise representation of F(min_sup, D).

The results stated in Proposition 1 can also be obtained by using, for example, the Max-Miner algorithm. This algorithm is able to discover frequent maximal itemsets that are actually all itemsets in the right bound of a space of frequent itemsets, and can cover all frequent itemsets. However, convex frequent itemset spaces are able to be further decomposed systematically into convex subspaces.

The plateau-like decomposition method to systematically partition a space of frequent itemsets into non-overlapping subspaces requires defining itemset plateaus. DEFINITION 2

An itemset plateau, plateau(π , D), in a dataset D is defined as the collection of all itemsets that have the same support value % in D.

Let F(min_sup, D) be a frequent itemset space in a dataset D. Suppose these frequent itemsets have k distinct support values π_v Jt₂ K π_k . Then,

F(min_ sup, D) = Ϋ plateaufø_j , D) . i=\

Every frequent itemset bas a unique support value in D. Consequently,

Z)) = 0 for all i ≠ j . That is, F(min_sup, D) is partitioned into non-overlapping itemset plateaus.

In a fourth example, following the dataset D and min_sup in the first example, four itemset plateaus are obtained as follows, where π_λ = 5 , π₂ = 4 , π_% = 3 , and π₄ = 2 :

1. plateau(π₁,D) = {6}

2. plateau{π₂,D) = {{c},{b},{e\{be}}

3. plateau(π₃ , D) = {{a}, {ac}, {be},

4. plateau(π₄ , D) — {{ab}, {ae}, {abc}, {abe}, {ace}, {abce}}

Every plateauψ^D) is a convex space.

THEOREM 1

Let a dataset D be given. Every π_s ,

is a convex space. This is proved without loss of generality, by letting plateau^,, D) ≠ 0 . Suppose itemsets X and Y are in plateau(π_t , D) and Ic J. Then sup(X, D) = sup(Y, D) = %.. Let Z be any itemset Z such that I c Z c T. Then suρ(X, D) ≥ sup(Z, D) ≥ sup(F, D) . As sup(X, Z)) = suρ(7, D) = π_t , it follows that sup(Z, Z⁾) = TT, . Therefore the itemset Z is in

As a convex space can be concisely represented by its border, every itemset plateau can be concisely represented by a border.

In a fifth example, following the fourth example, the border of plateau(π₄ , D)

is

{{{ab\{ae}\{{abce}}) .

This border is concise because it uses only three itemsets to represent the six itemsets that have the same support level τr₄ in D.

THEOREM 1 is used for a new concise representation of frequent itemsets. Suppose F(min_sup, D) is a frequent itemset space in a dataset D, and suppose these frequent itemsets have k distinct support values 7T₁₅Tr₂K π_k . Then: k k

F(min_sup, D) = γplateau(π_t , D) =Y[Z, , R₁ ]

where (L_nR_f) is the border of plateau(π_t,D). So, the mining of all frequent itemsets can be converted to the mining of borders of all frequent itemset plateaus. This concise representation is different from those based on closed patterns or on key patterns. These borders can be efficiently discovered.

The bounds of the itemset plateaus and itemset ideas such as closed patterns and key patterns (also called generators) are related. The relationship between itemset plateaus and equivalence classes on which closed patterns and key patterns are defined is described.

The notion of equivalence classes deals with itemsets that are common to a set of transactions in a dataset. Two itemsets are considered "equivalent" if they are included in exactly the same transactions. The filter, f(P, D), of an itemset P in a dataset D as f(P, D) = {T e D \ P c I} is defined. Then an itemset Q is equivalent to an itemset P in a dataset D iff f(Q, D ) = f(P, D). Then the equivalence class [p]_D of P in a dataset D is the collection of itemsets defined as:

[P]_D = {X \ f(X,D) = f(P,D)}. PROPOSITtON 2

[P]_D is a convex itemset space, and the right bound of its border is a singleton set. This is proved by letting X, Y e [P]₀ such that X c Y. Suppose X c Z c Y. Then, f(Y, D) c f(Z, D) c f{X, D) . Since, X, Y G [P]_D , it is known that f(Y, D) = f{P, D) = f(X, D) . So f(Z, D) = f(P, D) and Z e . Thus [p]_D is convex.

Letting (L,R) be the border of [p]_D . Suppose XJ e R . Then f(X,D)= f(P,D) = f(Y,D). This implies f(XvY,D) = f(P,D). Thus X \j Ye. [P]_n . Since X, Y e R by the definition of the right bound of a border, both X and Y are most specific. Then it must be the case that X = Xu F = F. Thus R is a singleton set.

Sup(X, D) = sup(Y, D) for all X, Ye [P]₀. Therefore, the equivalence class of any itemset P must be a subset of plateau{π_t,D), where π_t = sup(P, D). Following from the definition of equivalence classes, plateau{π_t,D) is the union of non- overlapping equivalence classes. In particular, the number of equivalence classes is equal to the number of itemsets in the right bound of the border of plateau{π_t,D). In other words:

COROLLARY 1

Let a dataset D and a support value min_sup < \D be given. Then for any π ≥ min_sup such that plateau{π_t,D) ≠ 0 , it is the case that

_Vlateauip ,,D) = [L_x uZ₂ uK uL_R,{R_uR₂,K ,R_r}] = Ϋ[Z,,{i?,}]

;=1 Where fcL,fcL,K ,[R_R]_O = {[P]_O}| sup(P, D) = π}, and each (L₁₉[R₁]) is the border of the corresponding [i?,. ]_o . In a sixth example, following the fourth example, it is known that

Here, [MW]= [WL = [WL »>d

[{{be}, {ce}}, {{bee}}] = [{δc}]_β = [{ce}L are two equivalence classes in D.

Together with equivalence classes, frequent closed patterns and frequent key patterns have been widely studied in the data mining field. One main reason is that both of them can be separately used to concisely represent the whole space of frequent itemsets.

Plateau boundaries and closed patterns and key patterns are related. Traditionally, closed patterns are defined based on the notion of equivalence classes. For example, a closed pattern is the maximal (the most specific, under set inclusion) itemset of an equivalence class. In a simpler way but equivalently, a closed pattern is also defined as an itemset that has no proper superset with the same support.

As described, a plateau{π_t,D) is the collection of all itemsets whose support value in D is π₍ . The right bound of

are itemsets that are most specific in plateau(π_i}D). That is, any proper superset of any itemset in this bound is not in So, those proper supersets have a different

support level. Hence, the itemsets in the right bound of plateauψ^D) are all closed patterns. Conversely, all closed patterns with the same support value π₍. are in the right bound of plateau{it_i , D). So, the right bounds of the itemset plateaus capture exactly all closed patterns.

DEFINITION 3

Key patterns is a concept that was developed after closed patterns. Its definition is also based on equivalence classes. A pattern P is a key pattern in a dataset D iff

P e minjpj^ , where HiIn[P]₀ is the collection of all minimal (the most general, under set inclusion) itemsets in [P]₁₃. That is, P is a key pattern in D iff no proper subset of P is in [P]₀.

According to this definition and the definition for the left bound of

it can be seen that the itemsets in the left bounds of plateaus are all key patterns. Bounds of convex plateaus can be used to capture all key patterns and all closed patterns instead of confining them to smaller equivalence classes.

The itemsets in the left bounds of all plateauψ^D) form a convex space.

THEOREM 2

Let a dataset D and a support value min_sup <

be given. Suppose: k k

F(min_ sup, D) = Yplateau(π₍ , D) = Y[L₁ , R₁ ] i=l (=1 k where L₁ and R₁ are bounds of the itemset plateaus. Then Yi, is a convex k space. This is proved by letting K = YL₁ . Suppose X, Y e K such that

X a. Y . Suppose X c, Z c f . Then, Z e K is proved. The condition X c Z c F implies four cases: I c Z c U c Z =r,I = Z c F, and X= Z = Y. Only

Z € K when X cz Z cz Y needs to be proved. The other three cases are apparent.

Let sup(Z, D) = π_m . Then, Z = plateau(π_m,D). Z e L_1n is proved by contradiction. Assume Z & L_m . Then there is A e L_1n such that A c Z because plateau(π_m,D) is a convex space.

Let Y = Z u V such that V n Z = 0. As A, Z € plateau(π_m , D) , then sup (A , D ) = π „ = sup (Z , D ) . As A c Z , then sup(_^4 u F, Z⁾) = sup(7, /J⁾) . So, Y has a proper subset A u F with the same support as itself: But Y is an itemset in the left bound of a plateau. So all its proper subsets must have a larger support than Y's. This is a contradiction. So, Z <= L_m .

Then Z e K .

The contribution by frequent closed patterns and key patterns to data mining is that both of them can be separately used to concisely represent the whole space of frequent itemsets in a dataset. These results can be re-written and then compared to the subspace border representation.

Given a dataset D and a support value min_sup, denote FC_D as the set of closed itemsets that are also frequent in D. Then a concise representation of F(min_sup, D) is defined as the set FC_D enriched by the information on the support for every itemset in FC _D . This concise representation is correct because for any itemset JT c /. • If there is a Z e FC _D such that Z ^ X , then X e F(HUn-SUp₅D) and s\φ{X,D) = max{sup(r, D) \ Y e FC_DandY 3 X] ;

• otherwise, X ^ F(min_sup,_D).

To get the support for all the frequent itemsets from this concise representation, all possible X must be tested. This computation is not cheap, but does not need any access to D for the calculation of support.

Key patterns are sometimes used as an intermediate step for the discovery of closed itemsets. The key patterns themselves can constitute a concise lossless representation of frequent itemsets. FG_D is denoted as the set of key patterns that are also frequent in D. Then a concise representation of F(min_ SUp₅I)) based on FG_n is defined as:

• the set of FG_D enriched by the information on support for every itemset set in FG ^■» ; • the set GBd _D of all minimal infrequent key patterns; and

• the set of I_D of items that each occurs in D. This concise representation is correct because for any itemset X c I,

• if X g I_D or there is a Z e GBd _D such that Z c X , then X g F(min_sup,D)

• otherwise, X e F(min_sup,/J⁾) and sup(X, D) = min{sup(7₅ D)] Y e FG _D andY c X}

Frequent key patterns alone are not sufficient for a lossless concise representation of a space of frequent itemsets. Also, the support of represented frequent itemsets is not immediate or obvious.

A new concise representation for frequent itemsets is provided. It is a subspace border representation for concisely representing frequent itemset spaces. This representation is better than those based solely on FC _D or FG_D .

Suppose F(min_sup,D) is a frequent itemset space in a dataset D. As described earlier: k k

F(π∞ι_ sup, D) = Ypϊateau(π_t , D) = Y[L₁ , R₁ ]

'=1 '=1 where π_λ > π >₂ K > π_k are the k distinct support values for these frequent itemsets, and each (L_nR₁) is the border of the corresponding plαteαu(π_nD). Then the subspace border representation is defined as: (L₁ , R₁ ), / = 1,K , k enriched with information on the support of these boundary itemsets in D.

k

The mining of (L₁ , R₁ ), i = 1,K , k is efficient. Y Yi/ is the set of all frequent key k patterns in D, and Yi?,. is the set of all frequent closed patterns in D. Then, a

method can be used to discover frequent closed patterns and simultaneously k k discover both Yi,- and Yi?. since key patterns are usually used as an

J=I (=1 intermediate step for the discovery closed itemsets. Suppose C ,C₂₅K ₅C_n are the frequent closed itemsets sorted by their support in a descending order, then the first group of itemsets that have the same support is R₁ , the second group of itemsets having the same support is R₂ and so on. Finally the kth group of itemsets having the same support is R_k . Similarly, L₁, i = I₅K ,k , can be identified from the frequent key patterns.

The L₁, i = 1,K ,k , representation for F(mia_ sup, D) has two main advantages: Firstly, the generation of all represented frequent itemsets is direct. That is, for each X e [Z_; , i?, ] , X is a frequent itemset. Secondly, the support of any represented frequent itemset is clear. That is, for each X e [L₁ , R₁ ] = plateau(π,- , D), sup(jT, D) = π, .

Frequent itemsets are defined with regard to one dataset. A different type of itemsets that are related to two datasets, and convexity and plateau-like structural decomposition for this type of itemset are described. Traditionally, this type of itemset are called EPs. Frequent itemsets and emerging patterns can find different practical applications, and EPs have a broader potential.

DEFINITION 4

Given a pair of datasets D^pos and D"^eg , an itemset X is an EP if sup(x, D^pos ) ≠ 0 , but sup(x, D"^es ) = 0. This space of emerging patterns (EP space), is denoted by EP\D^pos ,-,D"^eg ). Symmetrically, an itemset Y is an EP if svφ(γ,D^neg)≠ 0 , but

0 . This space EP space is then denoted by EP(D"^e8,-,D^pos).

EPs are useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends. THEOREM 3

Given a pair of datasets D^pos and D^mg , then both EP(D^P0S ,~,D"^eg) and

EP[p"^eg ,->D^p0S) are convex spaces. This is proved by firstly proving EP(D^POS ,-,D"^es) is a convex space. Let X, Y e EP(D^POS ,-,D^neg) be itemsets such that X c Y. Suppose IcZ c J. Since X e £?(£>*^<* , -D"* ) , it is known that X does not occur in D"^eg . Then all of its supersets - those itemsets more specific than X - cannot occur in D"^eg . Therefore, Z cannot occur in D^neε . So₁ sup(z,D^neg)= 0. Since 7 e EP(D^P°^S , -iD"*), it is known that

SUp(F, D^prø J > 0. Then all subsets of Y - those more general than Y - must have a non-zero support in D^pos . Therefore, Z occurs in D^pos . Consequently, Z is an EP in EP{D^POS ,-,D^neg ). Thus EP(D^P°^S ,-,D^neg ) is a convex space. The proof for the convexity of EP(D^neg ,-iD^p0S ) is similar.

So, the EP space EP{p^pos ,—D^neg ) can be concisely represented of itemsets by a border (L,R) , and [i,i?]are exactly all the EPs. Thus, the mining of EPs can be reduced to the mining of only boundary EPs.

Also, the left boundary EPs are EPs that have the biggest support in D^pos . As their support in D"^eg are all zero, the left boundary EPs are the most discriminating patterns for differentiating the two classes of data.

Example - Bioinformatics

Initially, characteristics of prostate cancer cells are captured from their gene expression profiling data. The original data has two classes of samples: 52 prostate tumor cells and 50 normal cells. Every cell sample is described by 12600 features, also known as attributes, genes, probe-and-EST variables in the bioinformatics literature. Every feature is continuous, measuring the expression level of that specific gene in the cell samples by Affymetrix U95Av2 oglinucleotide microarrays. So, the values of a feature may change significantly from one sample to another, or from the tumor class to the normal class. Genes or gene groups can be discovered that can discriminate the two classes of cell samples sharply. The 52 prostate tumor cells are denoted as D""" , and the 50 normal cells as

D^nor . Firstly, to discover the EPs, the 12600 features are ranked using an entropy method. Next, 40 top-ranked features are selected and discretized by the entropy method. After the discretization, a total of 90 items is obtained, and each transaction (each cell) is a set of 40 items. For this pair of reduced and discretized datasets, there are 24287 left boundary EPs for EP(D^nor , -iD^tum ), and 18268 left boundary EPs for EP(D^tum ,-,D^nor).

Ten left boundary EPs of Ep{D^m\-_tD^tum ) are listed in Table 1 below:

itemset X sup(X,Z>"^or) sup(X, £>""" )

{7, 21} 36(72.00%) 0

{7, 44} 36(72.00%) 0

{7, 11} 35(72.00%) 0

{7, 43} 35(72.00%) 0

{7, 39} 34(72.00%) 0

{7, 87} 34(72.00%) 0

{7, 32} 34(72.00%) 0

{11, 24} 34(72.00%) 0

{24, 29} 34(72.00%) 0

{7. 86} 33(72.00%) 0

Ten left boundary EPs of EP(D^tum ,->D"^or ) are listed in Table 2 below:

itemset X Svφ(X,D"^or) sup^/J⁾""")

{6, 23, 83} 0 43 (82.69%)

{14, 23, 36, 55, 83} 0 42 (80.77%)

{6, 16, 55, 83} 0 42 (80.77%)

{16, 36, 55, 83} 0 42 (80.77%) ■

{14, 55, 67, 83} 0 42 (80.77%)

{6, 14, 67, 83} 0 42 (80.77%)

{19, 55, 83} 0 42 (80.77%)

{4, 67, 83} 0 42 (80.77%)

{14, 55, 62} 0 42 (80.77%) {14, 36, 62} 0 42 (80.77%)

The reference numbers in the first column of the two tables represent items. For example, the reference number "7" stands for the item "38406_f_at > 499.0", saying that the expression of gene 38406_f_at is larger than 499.0. These EPs are characteristics, rather than random noise patterns, of the tumor cells or the normal cells since these EPs occur in a large percentage (for example, 72%, 82%) of samples in the tumor class or in the normal class. Furthermore, as they occur with high frequencies in either as D""" , or D"^or , but not simultaneously in both, they can serve as excellent diagnostic indicators for prostate cancer, and may even play a causal role in the cancer.

These EP patterns are translated into useful clinical rules. For example, the first EP {6, 23, 83} in Table 2 can be interpreted as the following rule: If the expression of gene 38406_f_at < 499 and that of 556_s_jit < 127 and that of gene 34678_at is between 0 and 33, then this cell is a tumor cell. This rule is satisfied by 82.69% of the tumor samples.

A New Concise Representation For EP Spaces

As described earlier, the subspaces of a space of frequent itemsets are all convex spaces after plateau-like structural decomposition. This also happens to EP spaces.

DEFINITION 5

Given a pair of datasets D^pos and D^neg , let EP(D^POS ,-,D^neg ) be the space of EPs that have a non-zero support in D^pos . A plateau-like structural decomposition partitions EP\p^pos ,-ιD^nes ) into subspaces such that each of them consists of

EPs that have the same high support in D^pos . Such a subspace is denoted by EPplateau{κ _{i y}D^pos ) where π_t is the support of the EPs in D^pos .

THEOREM 4

Given a pair of datasets D^pos and D^ms , then:

1. EP(D^POS , -,D^neg ) = YEPplateau(n_t ,D^p0S) where π_x , Jt₂ K π_k , are all

possible distinct support levels of these EPs in D^pos ; 2. EPplateau^τ_i,D^pos)n EPplateau{^_j,D^pos)= 0 for all i ≠ j; and

3. each EPplateauψi , D^pos J is a convex space.

To prove Statement 1, suppose X 6 EP(D^≠S ,-,D"^eg) . Then svφ(x,D^pos )≠ 0. Then there is an m, l ≤ m ≤ k , such that sup(X, D^pos)= π_m . Therefore, X e EPplateau(π_m , D^pos).

Thus, X e

On the other hand, suppose k

Y e YEPplateau(π_i , D^pos ). i=l

Then Y is an EP and svap{γ,D^p0S)≠ 0. Therefore, Y e Ep(D^pos,->D"^eg). Combining the above two points, Statement 1 is proved.

To prove Statement 2, for each X e EPplateau(π_t , D^pos \ sup(z, D^pos ) = π_t . As π_t ≠ π_j,X <£ EPplateauip _j,D^pos). Statement 2 is proved.

To prove Statement 3, by definition, X e EPplateauψ_t , D^pos ) is a collection of all EPs whose support in D^pos is π_t ≠ 0. Suppose two patterns XJ e

satisfy X c 7. Let I c^'Z c 7. By definition of X G EPplateauiπ_t , D^pos } sup(γ, D^pos )= π_i = sup(x, D^pos ) . Since X <= Z £ 7 , it is the case that sup(x, D^pos ) ≥ sup(z, D^pos)≥ sup(r, D^pos ). So, sup(z, D^pos )= π_r Thus, Z e EPplateau(π_t , D^pos ), and EPplateau(π_t , D^pos ) is a convex space. Statement 3 is proved.

THEOREM 4 states that a space of EPs can be partitioned into non-overlapping EP plateaus, and each of such subspaces is a convex space. This theorem is useful for a new concise representation of EP spaces. As described, a known and simple concise representation of an EP space is this space's border. Previous studies on the mining of EPs are almost all focused on the mining of such a border enriched with these boundary EPs' support. However, this representation does not provide sufficient information to derive the support for any represented EPs.

A new representation for concisely representing a space of EPs and its advantages is described. Let EP(D^POS ,-iD"^eg ) be an EP space for dataset pair D^pos and D"^eg . Using THEOREM 4, it is known that: k k

EP(D^POS ,-,D"^eg)= YEPplateaufai, D^)=Y[L₁R₁] where

1=1 /=1 π₁,π₂K π_k ≥ l , and (L₁R₁) is the border of the EPplateau(π_i,D^pas). Then,

(L₁R₁)J = I₅K ,k is defined as a concise representation of EP(p^pos ,-,D^mg). This concise representation is also called an EP plateau border representation.

An EP plateau border representation is a lossless representation for any EP spaces. Moreover, for any represented EP X, if X e EPplateauψ_ι,O^p ), then

π_t . So, the support for any represented EP is directly derivable.

Methods to discover EP plateau borders

Referring to Figure 1, a method 110 to discover EP plateau borders

(L_tR_t)j = I₅K ,k is provided. This method relies on Propositions 3 and 4 as follows:

PROPOSITION 3

Let EP{D^POS ,-J)^nes) be an EP space for pair of datasets D^pos and D^neg so

that: EP{D^P0S

(Lf R^) J = 1,K ,k_e, is the plateau border representation for this EP space; and

,k_p is

the subspace border representation of F(I₅D**"¹ J. Then:

1. k_e ≤ k_p , that is, the number of EP plateaus is less than or equal to the number of itemset plateaus in D^pos ; 2. EPplateau(π° ,D^pos)^ plateau(π^p ,D^pos) for all i and j such that π- = π^p ; and

3. Rf ci R_j,i = I₅K k_e , if the support level of the boundary itemsets in R_j is equal to the support level of the boundary EPs in Rf^F .

To prove Statement 1, given an itemset plateau, plateauψ,D^pos), of F^D*⁰").

If snp(x,D^neg)≠ 0 for each X e plateau{π,D^pos), then there is no itemset in this plateau that is an EP. Then k_e < k_p . Thus, it is clear that k_e ≤ k_p.

To prove Statement 2, let X e EPplateau{π\ , D^pos ) . Then, sup(x, D^pos ) = π' ≥ 1. So, X e F(l, D^pos ). Then there is plateau(π^p , D^pos ) and π- = πj\ Thus, EPp!ateau(π° ,D^pos)c plateau(π^P ,D^pm).

To prove Statement 3, let X e Rf , then X s plateau^ ,D^pos), where π ^P = sup(x, D^pos ) = π- . Let A G RJ , where R_j is the right bound of the border of plateauψ^p ,D^pos), be such that X Q A . Since X c ^ , it is known that SUp(^, £>^Meg ) = 0. Since A e R_j , it is known that SUp(^₃ D^pos ) = π_} ^p . Therefore, A is an EP in EPplateau(π°,D^pos). This implies A ^ T for some X'e Rf . Therefore, I c i c T is obtained. Since X'e Rf , X = A = X' is obtained. Therefore X e R_j.

Given a dataset D, a zero support space is used to call the collection of all itemsets that have a zero support in D. This space is denoted by F(1, D) .

PROPOSITION 4

F(L, D) is a convex space. This is proved by supposing itemsets J,F e F(1,D) such that X c Y. Then sup(X, D) = sup(Y, D) = 0. So, for each Z satisfying J c Z cZ, sup(Z, D) = 0 is obtained. Thus, convexity is proved. In fact, a zero support space is a special itemset plateau where the support level is zero. As zero support spaces are convex, each of them can be concisely represented by a border in the form of (£_;{/}) for some L.

Based on Propositions^'3 and 4, the following methods are provided to discover the plateau border representation for a space of emerging patterns. Let:

• a pair of datasets D^pos and D^ms ;

• the subspace border representation (L₁Ri

and • the border (x° i?° ), i = I₃K , k_p of the zero support space JF(I, D"^eg ) be given.

The border of the EP plateaus of EP(D^P0S , ->D^m8 ) is computed 110 as follows, where the border of an itemset plateau is used to derive the border of an EP plateau. First, partition 120

):

• Let [Ii ,JRj J, an itemset plateau in F(I, D^P∞ ), be partitioned 121 into v non- overlapping equivalence classes [Z^, V^, jj>7 ⁼ IK- ^v . so that

[L_t,R,]= Y [l_y, {βj], as per Corollary 1. Then the support of R_ys in D^ms

is calculated 122. If for every R_y , sxφ{RyD"^eg ) ≠ 0 , then there is no itemset in [L₁R₁ J that is a desired EP. On the other hand, without loss of generality, suppose the u of these RyS have a zero support in D^neε , and the rest of nonzero support in D^neg . By Proposition 3, these u special R^s are exactly constituted as the right bound of EPplateauψ_t ,D^pos), where π_t =sn_V{R_a,D^pos).

• The remaining problem on finding the left bound of EPplateauψi ,D^pos). To do this, the border ( jL°i?° ) of the zero support space -F(I, D"^eg ) is used, and the left bound of EPplateau(p_t ,D^pos) is obtained 123 as u , are the u

R_ϋ s that have zero support in D"^eg . The proof is to find the leftbound of

After going through all the itemset plateaus of .P(l, D^pos J, then the borders of all possible EP plateaus can be found 124. The above method does not require any access to D^pos or to D^neg to calculate support in the discovery of the left bound of an EP plateau. However, the overhead for this is the given border /Z°i?°\ of the

zero support space F\l,D"^es J. If this border is not given, a different method is used to compute the left bound of an EP plateau.

The notion of emerging patterns by DEFINITION 4 is intended to capture sharp difference between two classes of data. However, it may miss some important patterns. For example, itemsets that have high support in one dataset but very low level (≠ θ) of support in the other dataset are not covered by this definition. Due to the significant change in their support, these itemsets may be sometimes equally interesting as EPs. Next, the original definition of EPs is extended to cover these patterns.

DEFINITION 6

Given a pair of datasets D^pos and D"^es . An itemset X is an EP if svφ(x,D^pos)≥ π , but svφ(x,D^neg)< δ , where π > δ , and δ is a small integer number. This space of emerging patterns is denoted by: EP(_πT D^pos ,^δ >l D"^eg J.

Under this definition, EPs still have attractive properties. Firstly, EP spaces are convex, thus each of them can be concisely represented by a border. Secondly, EP plateau subspaces are also convex, thus a different concise representation for EP spaces can be used. Thirdly, the EP plateau borders can be efficiently computed from the borders of the π -frequent itemset plateaus in D^pos . THEOREM 5

Given a pair of datasets D^pos and Z)"* . Then both EP{^ D^pos,^s I D"^eg) and

D"^eg,^δ -I D^posJ are convex spaces. This is proved by supposing itemsets XJ≡ EP(j D^pos,^δ I D^ms), and l c F. Let l c Z c Y. Since X e EP(_π t D^pos,^δ I D"^eg), sup(x, D"^eg ) < δ is obtained. Then all of its supersets - those itemsets more specific than X - must have a support less than δ in D"^eg as well. Therefore, sxφ(z,D^ms)< δ . Since Y e EP(_πt D^pos,^δ I D"^eg), svφ\Y,D^pos ]> π is obtained. Then all its subsets - those more general than Y - must have a support ≥ π in D^pos as well. Therefore, sup(z,D^pos)> π . Thus Y is an EP in EP{_π t D^pos ,^s I D^ms ) . So EP(_π t D^{pos δ} I D"^eg ) is a convex space. The case for EP{j D"^eg,^δ I D^pos) is similar.

As described earlier, a plateau-like structural decomposition

EP\_π t D^pos ,^δ 4 D"^eg ) is defined as a collection of subspaces each consisting of all EPs that have the same support level (> π) in D^pas . Such a subspace is also called an EP plateau. It is proved in a way similar to THEOREM 4, that every EP plateau is a convex space. So₁

where (∑f ,R?^P) is the border of the ith EPplateau{π_t ,D^pos). Then:

I₁Lf ^p , Rf^p \ i = 1,K , k_e is defined as a concise representation ,

EP(_π t D^pos,^δ I D^nes). This concise representation is also called EP plateau border representation.

Let Fψ,D^pos) be a frequent itemset space of D^pos . This space can be partitioned into non-overlapping itemset plateaus as per THEOREM 1 as:

F{π,D^pos)=Yplateau{π^p,D^pos)=γ[L_pR]

J=I 7=1 where ( L_j, R_jYj = 1,K ,k_p , is the subspace border representation of

Fψ,D^posJ. Then it can be shown in a way similar to Proposition 3 that if π^p = π° F then:

• plateauψ^p ,D^pDS) covers EPplateauψ°,D^pos); and • the right bound of EPplateauψ. , D^pos ) is a subset of the right bound of plateauψ^p,D^p0S).

Another method is described based on the above, to discover the plateau border representation for an EP(_K T D^pos,^s 4- D^mg). In fact, the border of EPplateauψ_j , D^pos ) is derived from plateauψ_j , D^pos ). This is derived as follows:

• The right bound of EPplateauψ_j , D^pos ) is given by

Y_meΛ {R I SUp(R, D"^eg)< δ}, where R is the right bound of plateauψ_j,D^pos). This formulation follows via a reasoning similar to Proposition 3. In other words, the process is: For each R in the right bound of

its support in D^ms is calculated; and if sup(R,D"^eg)< δ ,

R is moved to the right bound of EPplateauψ_j, D^pos).

• The left bound of EPplateauψ_j , D^pos J is given by

Y^ min{z | [L, {Y}] = [Y]_0^ ,X <E L,X ^ Z ^ Y, sup(z, D"^eg )< δ} where R is the right bound of EPplateauψ_j , D^pos ) . This formulation follows via a reasoning similar to Corollary 1. In other words, the process is: For each Y in the right bound of EPplateauψ_j, D^pos), the border of the equivalence class

[Fj₀P₀₅ is found. Then for every X in the left bound of [T]₀^ , their support in D"^eg is calculated. If the support is less than δ , X is moved into the left bound of EPplateauψ_j , D^pos ) . Otherwise, the support in D"^es is checked for all immediate proper supersets (within the equivalence class) of X. Those whose support in D"^eg are less than δ are moved into the left bound of EPplateauψ _j, D^pos). For those whose support in D"^es are greater than or equal to δ , their immediate superset is checked in turn. And so on.

If the border of J(δ,D"^eg) is known, then the left bound of EPplateau(π_j,D^pos) can be computed by a method similar to where F[δ,D^neg ) is specially set as F^~(l,D"^es).

Concise and lossless representations for spaces of frequent itemsets is important in data mining. These representations may be based on key patterns, on closed patterns, and on plateau borders. Also, disjunction-free sets, disjunction-free generators, and generalised disjunction-free generators can be also used for concise representations. Transitions between these concise representations are possible. EP spaces possess a global convexity and plateau-like subspace convexities.

A process to generate an emerging pattern with zero support from a dataset D_n is described. The process involves mining the border (L,R) of zero support itemsets from the dataset D_n . Let D_n = JT₁, UT₂₅A ,T_p } be a dataset consisting of p transactions. To mine the border IL,R) of zero support itemsets of D_n the following steps are performed:

1. Calculate R = I ^JT₁ . If R is equal to any of J\ , i= 1, Λ ,p. Then (L,R) = ( φ,φ ) , stop.

2. Otherwise, find the maximal ones of J ^{1 1}. , i =1 , Λ , p. Denote the maximal ones as [M₁ ,M₂, Λ ,M_k), where k < p. Note that a transaction J⁷- '^s maximal in D_n if there does not exist a transaction in D_n that is a proper superset of J⁷.. 3. Use the function Border-Diff to get the border (I,]*) by calling Border-Diff

( ({φ,{R}),({φ,{M_lύM₂,A ,M_k}) )- This function is proposed in Dong & Li in the paper authored by G.Dong and J. Li entitled "Efficient mining of emerging patterns: Discovering trends and differences" in S. Chaudhuri and D. Madigan, editors; Proceedings of the Fifth ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, pages 43 to 52, 1999, ACM Press,

Computer Implementation

The embodiments of the invention are preferably implemented using a computer, such as the general-purpose computer shown in FIG. 2. In particular, the process of FIG. 1 can be implemented as software, or a computer program, executing on the computer. The method or process steps for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space are effected by instructions in the software that are carried out by the computer. The software may be implemented as one or more modules for implementing the process steps. A module is a part of a computer program that usually performs a particular function or related functions. Also, a module can also be a packaged functional hardware unit for use with other components or modules.

In particular, the software may be stored in a computer readable medium, including the storage devices described below. The software is preferably loaded into the computer from the computer readable medium and then carried out by the computer. A computer program product includes a computer readable medium having such software or a computer program recorded on it that can be carried out by a computer. The use of the computer program product in the computer preferably effects an advantageous apparatus for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space in accordance with the embodiments of the invention.

The computer system 200 consists of the computer 202, a video display 216, and input devices 218, 220. In addition, the computer system 200 can have any of a number of other output devices including line printers, laser printers, plotters, and other reproduction devices connected to the computer 202. The computer system 200 can be connected to one or more other computers via a communication interface 208b using an appropriate communication channel 230 such as a modem communications path, a computer network, or the like. The computer network may include a local area network (LAN), a wide area network (WAN), an Intranet, and/or the Internet.

The computer 202 itself consists of a central processing unit(s) (simply referred to as a processor hereinafter) 204, a memory 206 which may include random access memory (RAM) and read-only memory (ROM), input/output (10) interfaces 208A and 208B, a video interface 210, and one or more storage devices generally represented by a block 212 in FIG. 2. The storage device(s) 212 can consist of one or more of the following: a floppy disc, a hard disc drive, a magneto-optical disc drive, CD-ROM, magnetic tape or any other of a number of non-volatile storage devices well known to those skilled in the art. Each of the components 204 to 212 is typically connected to one or more of the other devices via a bus 214 that in turn can consist of data, address, and control buses. The video interface 210 is connected to the video display 216 and provides video signals from the computer 202 for display on the video display 216. User input to operate the computer 202 can be provided by one or more input devices 208b. For example, an operator can use the keyboard 218 and/or a pointing device such as the mouse 220 to provide input to the computer 202.

The system 200 is simply provided for illustrative purposes and other configurations can be employed without departing from the scope and spirit of the invention. Typically, the processes of the embodiments, described hereinafter, are resident as software or a program recorded on a hard disk drive (generally depicted as block 212 in FIG. 2) as the computer readable medium, and read and controlled using the processor 204. Intermediate storage of the program and pixel data and any data fetched from the network may be accomplished using the semiconductor memory 206, possibly in concert with the hard disk drive 212.

In some instances, the program may be supplied to the user encoded on a CD- ROM, DVD or a floppy disk (both generally depicted by block 212), or alternatively could be read by the user from the network via a network device connected to the computer, for example. Still further, the software can also be loaded into the computer system 200 from other computer readable medium including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including email transmissions and information recorded on websites and the like. The foregoing are merely exemplary of relevant computer readable mediums. Other computer readable mediums may be practiced without departing from the scope and spirit of the invention.

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope or spirit of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects illustrative and not restrictive.

Industrial Applicability

Identifying emerging patterns is useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends.

Claims

WE CLAIM:

1. A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

2. The method according to claim 1, wherein the right bound of the border of the emerging pattern plateau is computed by identifying the non-overlapping equivalence classes that have a zero support in the second dataset.

3. The method according to claim 1, wherein the left bound of the border of the emerging pattern plateau is computed by using the border of an itemset plateau that have a zero support in the second dataset.

4. The method according to claim 1, further comprising the step of ranking data in the datasets using an entropy method.

5. The method according to claim 4, further comprising the step of selecting a predetermined number of ranked data.

6. The method according to claim 5, further comprising the step of discretizing the selected data.

7. The method according to claim 1, further comprising the step of translating the emerging pattern space into clinical rules.

8. A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

9. The method according to claim 8, further comprising the step of ranking data in the datasets using an entropy method.

10. The method according to claim 9, further comprising the step of selecting a predetermined number of ranked data.

11. The method according to claim 10, further comprising the step of discretizing the selected data.

12. The method according to claim 8, further comprising the step of translating the emerging pattern space into clinical rules.

13. A system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: a first partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a second partitioning module to partition each itemset plateau into non- overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

14. A system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the system comprising: a partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a processing module to compute the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

15. A computer program product comprised of a computer-readable medium for carrying computer-executable instructions for. partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

16. A computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.