[go: up one dir, main page]

WO2006062485A1 - A method for classifying data - Google Patents

A method for classifying data Download PDF

Info

Publication number
WO2006062485A1
WO2006062485A1 PCT/SG2004/000402 SG2004000402W WO2006062485A1 WO 2006062485 A1 WO2006062485 A1 WO 2006062485A1 SG 2004000402 W SG2004000402 W SG 2004000402W WO 2006062485 A1 WO2006062485 A1 WO 2006062485A1
Authority
WO
WIPO (PCT)
Prior art keywords
plateau
itemset
dataset
convex
space
Prior art date
Application number
PCT/SG2004/000402
Other languages
French (fr)
Inventor
Jinyan Li
Original Assignee
Agency For Science, Technology And Research
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agency For Science, Technology And Research filed Critical Agency For Science, Technology And Research
Priority to PCT/SG2004/000402 priority Critical patent/WO2006062485A1/en
Publication of WO2006062485A1 publication Critical patent/WO2006062485A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation

Definitions

  • the invention concerns a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least first and second datasets using a computer processor
  • Frequent itemsets are a type of knowledge pattern that have been widely and extensively investigated in the data mining field over the last decade. Research has mainly focussed on two aspects: efficient mining of all frequent itemsets from a dataset; and efficient mining of a subset of frequent itemsets but that can be used to derive all frequent ones without accessing the original data again.
  • Given a dataset D an itemset is frequent if its support in D is larger than or equal to a support value.
  • the problem usually poses heavy computation when the support value is pre-set as a small value and D is high-dimensional.
  • a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • no access to the first and second datasets is required to calculate support in the discovery of the left bound of an emerging pattern plateau.
  • the right bound of the border of the emerging pattern plateau may be computed by identifying the non-overlapping equivalence classes that have a zero support in the second dataset.
  • the left bound of the border of the emerging pattern plateau may be computed by using the border of an itemset plateau that have a zero support in the second dataset.
  • the method may further comprise the step of ranking data in the datasets using an entropy method.
  • the method may further comprise the step of selecting a predetermined number of ranked data.
  • the method may further comprise the step of discretizing the selected data.
  • the method may further comprise the step of translating the emerging pattern space into clinical rules.
  • a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor the method comprising: a first partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a second partitioning module to partition each itemset plateau into non- overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor comprising: a partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a processing module to compute the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
  • the present invention provides a plateau-like partitioning method for the structural decomposition of both frequent itemset spaces and emerging pattern spaces. All these subspaces are convex spaces. Therefore, they can be concisely represented by borders. Based on these convexities, new concise representations for frequent itemset spaces and EP spaces are possible. Thus, the mining of EPs can be reduced to the mining of only boundary EPs.
  • Figure 1 is a process flow diagram of the method for discovering emerging pattern plateau borders
  • FIG. 2 is a block diagram of a general purpose computer with which the embodiments of the invention can be practiced.
  • the process of identifying patterns generally is referred to as "data mining” and comprises the use of processes that, under some acceptable computational efficiency limitations, produce a particular enumeration of the required patterns.
  • a major aspect of data mining is to discover dependencies among data, a goal that has been achieved with the use of association rules, but is also now becoming • practical for other types of classifiers.
  • Each record costs of a number of items.
  • a record is also called a transaction or an instance.
  • Emerging patterns are basically conjunctions of simple conditions. Preferably, emerging patterns have four qualities: validity, novelty, potential usefulness, and understandability.
  • the validity of a pattern relates to the applicability of the pattern to new data. Ideally a discovered EP should be valid with some degree of certainty when applied to new data. One way of investigating this property is to test the validity of an EP after the original databases have been updated by adding a small percentage of new data. An EP may be particularly strong if it remains valid even when a large percentage of new data is incorporated into the previously processed data.
  • Novelty relates to whether a pattern has not been previously discovered, either by traditional statistical methods or by human experts.
  • a pattern involves lots of conditions or a low support level, because a human expert may know some, but not all, of the conditions involved, or because human experts tend to notice those patterns that occur frequently, but not the rare ones.
  • Emerging patterns can describe trends in any two or more non-overlapping temporal datasets and significant differences in any two or more spatial datasets. In this context, a
  • “difference” refers to a set of conditions that most data of a class satisfy but none of the other class satisfies.
  • a “trend” refers to a set of conditions that most data in a dataset for one time-point satisfy, but data in a dataset for another time-point do not satisfy. Accordingly, EP's may find considerable use in applications such as predicting business market trends, identifying hidden causes to some specific diseases among different racial groups, for handwriting character recognition, for distinguishing between genes that code for ribosomal proteins and those that code for other proteins, and for differentiating positive instances and negative instances, for example, "healthy” or "sick", in discrete data.
  • the superset EP may still have the same frequency as the boundary EP in the home class.
  • EP's having this property are called “plateau EP's,” and are defined in the following way: given a boundary EP, all its supersets having the same frequency as itself are its “plateau EP's.” Of course, boundary EP's are trivially plateau EP's of themselves. Unless the frequency of the EP is zero, a superset EP with this property is also necessarily an EP.
  • Plateau EP's as a whole can be used to define a space. All plateau EP's of all boundary EP's with the same frequency as each other are called a "plateau space” (or simply, a "P-space”). So, all EP's in a P-space are at the same significance level in terms of their occurrence in both their home class and their counterpart class. Suppose that the home frequency is n, then the P-space may be denoted a "Pn-space.” All P-spaces have a useful property, called “convexity" which means that a P-space can be succinctly represented by its most general and most specific elements. The most specific elements of P-spaces contribute to the high accuracy of a classification system based on EP's.
  • Convexity is an important property of certain types of large collections of data that can be exploited to represent such collections concisely. If a collection is a convex space, "convexity" is said to hold. Convexity is a mathematical notion useful for concisely and losslessly representing a large space of points using a border. A point is an itemset, and many itemset spaces are seen as convex spaces.
  • a space of frequent itemsets is convex, and its non-overlapping subspaces after a plateau-like structural decomposition are all convex.
  • Such global convexity and subspace convexity are also observed in a different type of itemset space that is defined on a pair of datasets.
  • Plateau-like Partitioning Method Structural decomposition involves partitioning a collection of frequent itemsets of a dataset into subspaces such that each of them consists of exactly all itemsets that have the same high support. These subspaces are itemset plateaus. So, a space of frequent itemsets can be viewed as a collection of plateau-like subspaces under structural decomposition. All of the subspaces have the property of convexity, and that the convexity is also possessed by the whole itemset space. Hence, not only the whole space but each of the subspaces are bordered by a pair of bounds: A left bound consisting of the most "general" itemsets, and a right bound consisting of the most "specific" ones. Thus these itemset spaces can be concisely represented by the boundary itemsets in a lossless way. A new concise representation for spaces of frequent itemsets is provided.
  • the right bound of the whole space are exactly the so-called maximal itemsets.
  • the right bound of each subspace are exactly the so-called closed frequent itemset, and the left bound of each subspace are exactly the so-called key patterns or generators. All the left boundary itemsets constitute a convex space, and that each subspace is a union of non-overlapping equivalent classes.
  • the picture drawn by the decomposition method for the space of frequent itemsets captures, in one place, many different important data mining notions. Consequently, a better and deeper understanding about the mining of frequent itemsets by different methods is obtained.
  • An itemset or a pattern is a set of items.
  • a transaction is a non-empty set of items.
  • a dataset is a non-empty set of transactions.
  • An itemset X is said to be contained or included in a transaction T if X c T .
  • An itemset X is said to be contained in a dataset D if there is r e Z ) such that I c T.
  • the support of an itemset X in a dataset D denoted sup(X, D), is the number of transactions in dataset D that contain X.
  • An itemset X is said to be frequent in a dataset D if sup(X, D) is greater than or equal to a pre- specified equal support value min_sup. Given a dataset D and a support value min_sup, the collection of all frequent itemsets in D is called the space of frequent itemsets, and is denoted by F(min_sup, D).
  • Convexity is an attractive property for large spaces of patterns because it is helpful for their concise and lossless representation.
  • An itemset space S is convex if, for all X, Y e s such that X c T , it is the case that Z c 1 S whenever X c Z c F
  • ⁇ a ⁇ be a collection of itemsets.
  • PROPOSITION 1 Let a dataset D and a support value min_sup ⁇
  • F(min_sup, D) contains fifteen frequent itemsets, excluding 0. But, the border of F(min_sup, D) is ( ⁇ , ⁇ &ce ⁇ ) containing only two itemsets. It is thus a concise representation of F(min_sup, D).
  • Proposition 1 can also be obtained by using, for example, the Max-Miner algorithm. This algorithm is able to discover frequent maximal itemsets that are actually all itemsets in the right bound of a space of frequent itemsets, and can cover all frequent itemsets. However, convex frequent itemset spaces are able to be further decomposed systematically into convex subspaces.
  • An itemset plateau, plateau( ⁇ , D), in a dataset D is defined as the collection of all itemsets that have the same support value % in D.
  • F(min_sup, D) be a frequent itemset space in a dataset D.
  • these frequent itemsets have k distinct support values ⁇ v Jt 2 K ⁇ k . Then,
  • This border is concise because it uses only three itemsets to represent the six itemsets that have the same support level ⁇ r 4 in D.
  • THEOREM 1 is used for a new concise representation of frequent itemsets.
  • F(min_sup, D) is a frequent itemset space in a dataset D, and suppose these frequent itemsets have k distinct support values 7T 15 Tr 2 K ⁇ k . Then: k k
  • equivalence classes deals with itemsets that are common to a set of transactions in a dataset. Two itemsets are considered "equivalent” if they are included in exactly the same transactions.
  • the equivalence class [p] D of P in a dataset D is the collection of itemsets defined as:
  • plateau ⁇ t ,D is the union of non- overlapping equivalence classes.
  • the number of equivalence classes is equal to the number of itemsets in the right bound of the border of plateau ⁇ t ,D).
  • closed patterns are defined based on the notion of equivalence classes. For example, a closed pattern is the maximal (the most specific, under set inclusion) itemset of an equivalence class. In a simpler way but equivalently, a closed pattern is also defined as an itemset that has no proper superset with the same support.
  • a plateau ⁇ t ,D is the collection of all itemsets whose support value in D is ⁇ ( .
  • the right bound of are itemsets that are most specific in plateau( ⁇ i ⁇ D). That is, any proper superset of any itemset in this bound is not in So, those proper supersets have a different support level.
  • the itemsets in the right bound of plateau ⁇ D) are all closed patterns.
  • all closed patterns with the same support value ⁇ ( . are in the right bound of plateau ⁇ it i , D). So, the right bounds of the itemset plateaus capture exactly all closed patterns.
  • a pattern P is a key pattern in a dataset D iff
  • HiIn[P] 0 is the collection of all minimal (the most general, under set inclusion) itemsets in [P] 13 . That is, P is a key pattern in D iff no proper subset of P is in [P] 0 .
  • the itemsets in the left bounds of plateaus are all key patterns. Bounds of convex plateaus can be used to capture all key patterns and all closed patterns instead of confining them to smaller equivalence classes.
  • FC D the set of closed itemsets that are also frequent in D.
  • FG D is denoted as the set of key patterns that are also frequent in D.
  • F(min_ SUp 5 I)) based on FG n is defined as:
  • a new concise representation for frequent itemsets is provided. It is a subspace border representation for concisely representing frequent itemset spaces. This representation is better than those based solely on FC D or FG D .
  • F(min_sup,D) is a frequent itemset space in a dataset D.
  • C ,C 25 K 5 C n are the frequent closed itemsets sorted by their support in a descending order, then the first group of itemsets that have the same support is R 1 , the second group of itemsets having the same support is R 2 and so on. Finally the kth group of itemsets having the same support is R k .
  • Frequent itemsets are defined with regard to one dataset. A different type of itemsets that are related to two datasets, and convexity and plateau-like structural decomposition for this type of itemset are described. Traditionally, this type of itemset are called EPs. Frequent itemsets and emerging patterns can find different practical applications, and EPs have a broader potential.
  • This space of emerging patterns is denoted by EP ⁇ D pos ,-,D" eg ).
  • an itemset Y is an EP if sv ⁇ ( ⁇ ,D neg ) ⁇ 0 , but 0 .
  • This space EP space is then denoted by EP(D" e8 ,-,D pos ).
  • EPs are useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends. THEOREM 3
  • EP[p" eg ,->D p0S ) are convex spaces. This is proved by firstly proving EP(D POS ,-,D" es ) is a convex space.
  • X, Y e EP(D POS ,-,D neg ) be itemsets such that X c Y.
  • EP(D P ° S ,-,D neg ) is a convex space.
  • the proof for the convexity of EP(D neg ,-iD p0S ) is similar.
  • the EP space EP ⁇ p pos ,—D neg ) can be concisely represented of itemsets by a border (L,R) , and [i,i?]are exactly all the EPs.
  • the mining of EPs can be reduced to the mining of only boundary EPs.
  • the left boundary EPs are EPs that have the biggest support in D pos . As their support in D" eg are all zero, the left boundary EPs are the most discriminating patterns for differentiating the two classes of data.
  • the original data has two classes of samples: 52 prostate tumor cells and 50 normal cells. Every cell sample is described by 12600 features, also known as attributes, genes, probe-and-EST variables in the bioinformatics literature. Every feature is continuous, measuring the expression level of that specific gene in the cell samples by Affymetrix U95Av2 oglinucleotide microarrays. So, the values of a feature may change significantly from one sample to another, or from the tumor class to the normal class. Genes or gene groups can be discovered that can discriminate the two classes of cell samples sharply.
  • the 52 prostate tumor cells are denoted as D"""
  • the 50 normal cells are denoted as D""
  • the reference numbers in the first column of the two tables represent items.
  • the reference number "7” stands for the item “38406_f_at > 499.0”, saying that the expression of gene 38406_f_at is larger than 499.0.
  • EP patterns are translated into useful clinical rules.
  • the first EP ⁇ 6, 23, 83 ⁇ in Table 2 can be interpreted as the following rule: If the expression of gene 38406_f_at ⁇ 499 and that of 556_s_jit ⁇ 127 and that of gene 34678_at is between 0 and 33, then this cell is a tumor cell. This rule is satisfied by 82.69% of the tumor samples.
  • EP(D POS ,-,D neg ) be the space of EPs that have a non-zero support in D pos .
  • a plateau-like structural decomposition partitions EP ⁇ p pos ,- ⁇ D nes ) into subspaces such that each of them consists of
  • EPs that have the same high support in D pos .
  • Such a subspace is denoted by EPplateau ⁇ i y D pos ) where ⁇ t is the support of the EPs in D pos .
  • each EPplateau ⁇ i , D pos J is a convex space.
  • X e EPplateau ⁇ t , D pos is a collection of all EPs whose support in D pos is ⁇ t ⁇ 0.
  • two patterns XJ e satisfy X c 7.
  • THEOREM 4 states that a space of EPs can be partitioned into non-overlapping EP plateaus, and each of such subspaces is a convex space. This theorem is useful for a new concise representation of EP spaces. As described, a known and simple concise representation of an EP space is this space's border. Previous studies on the mining of EPs are almost all focused on the mining of such a border enriched with these boundary EPs' support. However, this representation does not provide sufficient information to derive the support for any represented EPs.
  • EP(D POS ,-iD" eg ) be an EP space for dataset pair D pos and D" eg .
  • THEOREM 4 it is known that: k k
  • An EP plateau border representation is a lossless representation for any EP spaces. Moreover, for any represented EP X, if X e EPplateau ⁇ ⁇ ,O p ), then
  • EP ⁇ D POS ,-J) nes be an EP space for pair of datasets D pos and D neg so
  • Rf ci R j ,i I 5 K k e , if the support level of the boundary itemsets in R j is equal to the support level of the boundary EPs in Rf F .
  • a zero support space is used to call the collection of all itemsets that have a zero support in D. This space is denoted by F(1, D) .
  • the border of the EP plateaus of EP(D P0S , ->D m8 ) is computed 110 as follows, where the border of an itemset plateau is used to derive the border of an EP plateau. First, partition 120 ):
  • DEFINITION 4 The notion of emerging patterns by DEFINITION 4 is intended to capture sharp difference between two classes of data. However, it may miss some important patterns. For example, itemsets that have high support in one dataset but very low level ( ⁇ ⁇ ) of support in the other dataset are not covered by this definition. Due to the significant change in their support, these itemsets may be sometimes equally interesting as EPs. Next, the original definition of EPs is extended to cover these patterns.
  • An itemset X is an EP if sv ⁇ (x,D pos ) ⁇ ⁇ , but sv ⁇ (x,D neg ) ⁇ ⁇ , where ⁇ > ⁇ , and ⁇ is a small integer number.
  • This space of emerging patterns is denoted by: EP( ⁇ T D pos , ⁇ >l D" eg J.
  • EPs still have attractive properties. Firstly, EP spaces are convex, thus each of them can be concisely represented by a border. Secondly, EP plateau subspaces are also convex, thus a different concise representation for EP spaces can be used. Thirdly, the EP plateau borders can be efficiently computed from the borders of the ⁇ -frequent itemset plateaus in D pos . THEOREM 5
  • D" eg , ⁇ -I D pos J are convex spaces. This is proved by supposing itemsets XJ ⁇ EP(j D pos , ⁇ I D ms ), and l c F. Let l c Z c Y. Since X e EP( ⁇ t D pos , ⁇ I D" eg ), sup(x, D" eg ) ⁇ ⁇ is obtained. Then all of its supersets - those itemsets more specific than X - must have a support less than ⁇ in D" eg as well. Therefore, sx ⁇ (z,D ms ) ⁇ ⁇ .
  • EP ⁇ ⁇ t D pos , ⁇ 4 D" eg is defined as a collection of subspaces each consisting of all EPs that have the same support level (> ⁇ ) in D pas . Such a subspace is also called an EP plateau. It is proved in a way similar to THEOREM 4, that every EP plateau is a convex space. So 1
  • F ⁇ ,D pos a frequent itemset space of D pos . This space can be partitioned into non-overlapping itemset plateaus as per THEOREM 1 as:
  • R is moved to the right bound of EPplateau ⁇ j , D pos ).
  • [Fj 0 P 05 is found. Then for every X in the left bound of [T] 0 ⁇ , their support in D" eg is calculated. If the support is less than ⁇ , X is moved into the left bound of EPplateau ⁇ j , D pos ) . Otherwise, the support in D" es is checked for all immediate proper supersets (within the equivalence class) of X. Those whose support in D" eg are less than ⁇ are moved into the left bound of EPplateau ⁇ j , D pos ). For those whose support in D" es are greater than or equal to ⁇ , their immediate superset is checked in turn. And so on.
  • EPplateau( ⁇ j ,D pos ) can be computed by a method similar to where F[ ⁇ ,D neg ) is specially set as F ⁇ (l,D" es ).
  • Concise and lossless representations for spaces of frequent itemsets is important in data mining. These representations may be based on key patterns, on closed patterns, and on plateau borders. Also, disjunction-free sets, disjunction-free generators, and generalised disjunction-free generators can be also used for concise representations. Transitions between these concise representations are possible. EP spaces possess a global convexity and plateau-like subspace convexities.
  • a process to generate an emerging pattern with zero support from a dataset D n involves mining the border (L,R) of zero support itemsets from the dataset D n .
  • D n JT 1 , UT 25 A ,T p ⁇ be a dataset consisting of p transactions.
  • the embodiments of the invention are preferably implemented using a computer, such as the general-purpose computer shown in FIG. 2.
  • the process of FIG. 1 can be implemented as software, or a computer program, executing on the computer.
  • the method or process steps for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space are effected by instructions in the software that are carried out by the computer.
  • the software may be implemented as one or more modules for implementing the process steps.
  • a module is a part of a computer program that usually performs a particular function or related functions.
  • a module can also be a packaged functional hardware unit for use with other components or modules.
  • the software may be stored in a computer readable medium, including the storage devices described below.
  • the software is preferably loaded into the computer from the computer readable medium and then carried out by the computer.
  • a computer program product includes a computer readable medium having such software or a computer program recorded on it that can be carried out by a computer.
  • the use of the computer program product in the computer preferably effects an advantageous apparatus for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space in accordance with the embodiments of the invention.
  • the computer system 200 consists of the computer 202, a video display 216, and input devices 218, 220.
  • the computer system 200 can have any of a number of other output devices including line printers, laser printers, plotters, and other reproduction devices connected to the computer 202.
  • the computer system 200 can be connected to one or more other computers via a communication interface 208b using an appropriate communication channel 230 such as a modem communications path, a computer network, or the like.
  • the computer network may include a local area network (LAN), a wide area network (WAN), an Intranet, and/or the Internet.
  • the computer 202 itself consists of a central processing unit(s) (simply referred to as a processor hereinafter) 204, a memory 206 which may include random access memory (RAM) and read-only memory (ROM), input/output (10) interfaces 208A and 208B, a video interface 210, and one or more storage devices generally represented by a block 212 in FIG. 2.
  • the storage device(s) 212 can consist of one or more of the following: a floppy disc, a hard disc drive, a magneto-optical disc drive, CD-ROM, magnetic tape or any other of a number of non-volatile storage devices well known to those skilled in the art.
  • Each of the components 204 to 212 is typically connected to one or more of the other devices via a bus 214 that in turn can consist of data, address, and control buses.
  • the video interface 210 is connected to the video display 216 and provides video signals from the computer 202 for display on the video display 216.
  • User input to operate the computer 202 can be provided by one or more input devices 208b. For example, an operator can use the keyboard 218 and/or a pointing device such as the mouse 220 to provide input to the computer 202.
  • the system 200 is simply provided for illustrative purposes and other configurations can be employed without departing from the scope and spirit of the invention.
  • the processes of the embodiments, described hereinafter are resident as software or a program recorded on a hard disk drive (generally depicted as block 212 in FIG. 2) as the computer readable medium, and read and controlled using the processor 204.
  • Intermediate storage of the program and pixel data and any data fetched from the network may be accomplished using the semiconductor memory 206, possibly in concert with the hard disk drive 212.
  • the program may be supplied to the user encoded on a CD- ROM, DVD or a floppy disk (both generally depicted by block 212), or alternatively could be read by the user from the network via a network device connected to the computer, for example.
  • the software can also be loaded into the computer system 200 from other computer readable medium including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including email transmissions and information recorded on websites and the like.
  • the foregoing are merely exemplary of relevant computer readable mediums. Other computer readable mediums may be practiced without departing from the scope and spirit of the invention.
  • Identifying emerging patterns is useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning (110) a frequent itemset space in the first dataset into convex non-overlapping intemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning (121) each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.

Description

Title
A Method for Classifying Data
Technical Field
The invention concerns a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least first and second datasets using a computer processor
Background of the Invention
Frequent itemsets are a type of knowledge pattern that have been widely and extensively investigated in the data mining field over the last decade. Research has mainly focussed on two aspects: efficient mining of all frequent itemsets from a dataset; and efficient mining of a subset of frequent itemsets but that can be used to derive all frequent ones without accessing the original data again. Given a dataset D, an itemset is frequent if its support in D is larger than or equal to a support value. The problem usually poses heavy computation when the support value is pre-set as a small value and D is high-dimensional.
Summary of the Invention
In a first preferred aspect, there is provided a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space. Advantageously, no access to the first and second datasets is required to calculate support in the discovery of the left bound of an emerging pattern plateau.
The right bound of the border of the emerging pattern plateau may be computed by identifying the non-overlapping equivalence classes that have a zero support in the second dataset.
The left bound of the border of the emerging pattern plateau may be computed by using the border of an itemset plateau that have a zero support in the second dataset.
The method may further comprise the step of ranking data in the datasets using an entropy method.
The method may further comprise the step of selecting a predetermined number of ranked data.
The method may further comprise the step of discretizing the selected data.
The method may further comprise the step of translating the emerging pattern space into clinical rules.
In a second aspect, there is provided a method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
In a third aspect, there is provided a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: a first partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a second partitioning module to partition each itemset plateau into non- overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
In a fourth aspect, there is provided a system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the system comprising: a partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a processing module to compute the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
In a fifth aspect, there is provided a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
In a sixth aspect, there is provided a computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
The present invention provides a plateau-like partitioning method for the structural decomposition of both frequent itemset spaces and emerging pattern spaces. All these subspaces are convex spaces. Therefore, they can be concisely represented by borders. Based on these convexities, new concise representations for frequent itemset spaces and EP spaces are possible. Thus, the mining of EPs can be reduced to the mining of only boundary EPs.
Brief Description of the Drawings
An example of the invention will now be described with reference to the accompanying drawings, in which:
Figure 1 is a process flow diagram of the method for discovering emerging pattern plateau borders; and
Figure 2 is a block diagram of a general purpose computer with which the embodiments of the invention can be practiced.
Detailed Description of the Drawings
The process of identifying patterns generally is referred to as "data mining" and comprises the use of processes that, under some acceptable computational efficiency limitations, produce a particular enumeration of the required patterns. A major aspect of data mining is to discover dependencies among data, a goal that has been achieved with the use of association rules, but is also now becoming practical for other types of classifiers.
Each record costs of a number of items. A record is also called a transaction or an instance.
Emerging patterns are basically conjunctions of simple conditions. Preferably, emerging patterns have four qualities: validity, novelty, potential usefulness, and understandability.
The validity of a pattern relates to the applicability of the pattern to new data. Ideally a discovered EP should be valid with some degree of certainty when applied to new data. One way of investigating this property is to test the validity of an EP after the original databases have been updated by adding a small percentage of new data. An EP may be particularly strong if it remains valid even when a large percentage of new data is incorporated into the previously processed data.
Novelty relates to whether a pattern has not been previously discovered, either by traditional statistical methods or by human experts. Usually, such a pattern involves lots of conditions or a low support level, because a human expert may know some, but not all, of the conditions involved, or because human experts tend to notice those patterns that occur frequently, but not the rare ones. Some EP's, for example, consist of astonishingly long patterns comprising more than 5, including as many as 15 conditions, when the number of attributes in a dataset is as large as
1,000. This provides new and unexpected insights into previously well-understood problems.
The potential usefulness of a pattern arises if it can be used predictively. Emerging patterns can describe trends in any two or more non-overlapping temporal datasets and significant differences in any two or more spatial datasets. In this context, a
"difference" refers to a set of conditions that most data of a class satisfy but none of the other class satisfies. A "trend" refers to a set of conditions that most data in a dataset for one time-point satisfy, but data in a dataset for another time-point do not satisfy. Accordingly, EP's may find considerable use in applications such as predicting business market trends, identifying hidden causes to some specific diseases among different racial groups, for handwriting character recognition, for distinguishing between genes that code for ribosomal proteins and those that code for other proteins, and for differentiating positive instances and negative instances, for example, "healthy" or "sick", in discrete data.
A pattern is understandable if its meaning is intuitively clear from inspecting it. The fact that an EP is a conjunction of simple conditions means that it is usually easy to understand.
Interpretation of an EP is particularly aided when facts about its ability to distinguish between two classes of data are known.
If one more condition (item) is added to a boundary EP, thereby generating a superset of the EP, the superset EP may still have the same frequency as the boundary EP in the home class. EP's having this property are called "plateau EP's," and are defined in the following way: given a boundary EP, all its supersets having the same frequency as itself are its "plateau EP's." Of course, boundary EP's are trivially plateau EP's of themselves. Unless the frequency of the EP is zero, a superset EP with this property is also necessarily an EP.
Plateau EP's as a whole can be used to define a space. All plateau EP's of all boundary EP's with the same frequency as each other are called a "plateau space" (or simply, a "P-space"). So, all EP's in a P-space are at the same significance level in terms of their occurrence in both their home class and their counterpart class. Suppose that the home frequency is n, then the P-space may be denoted a "Pn-space." All P-spaces have a useful property, called "convexity" which means that a P-space can be succinctly represented by its most general and most specific elements. The most specific elements of P-spaces contribute to the high accuracy of a classification system based on EP's.
Convexity is an important property of certain types of large collections of data that can be exploited to represent such collections concisely. If a collection is a convex space, "convexity" is said to hold. Convexity is a mathematical notion useful for concisely and losslessly representing a large space of points using a border. A point is an itemset, and many itemset spaces are seen as convex spaces.
In particular, a space of frequent itemsets is convex, and its non-overlapping subspaces after a plateau-like structural decomposition are all convex. Such global convexity and subspace convexity are also observed in a different type of itemset space that is defined on a pair of datasets.
Plateau-like Partitioning Method Structural decomposition involves partitioning a collection of frequent itemsets of a dataset into subspaces such that each of them consists of exactly all itemsets that have the same high support. These subspaces are itemset plateaus. So, a space of frequent itemsets can be viewed as a collection of plateau-like subspaces under structural decomposition. All of the subspaces have the property of convexity, and that the convexity is also possessed by the whole itemset space. Hence, not only the whole space but each of the subspaces are bordered by a pair of bounds: A left bound consisting of the most "general" itemsets, and a right bound consisting of the most "specific" ones. Thus these itemset spaces can be concisely represented by the boundary itemsets in a lossless way. A new concise representation for spaces of frequent itemsets is provided.
The right bound of the whole space are exactly the so-called maximal itemsets. The right bound of each subspace are exactly the so-called closed frequent itemset, and the left bound of each subspace are exactly the so-called key patterns or generators. All the left boundary itemsets constitute a convex space, and that each subspace is a union of non-overlapping equivalent classes. The picture drawn by the decomposition method for the space of frequent itemsets captures, in one place, many different important data mining notions. Consequently, a better and deeper understanding about the mining of frequent itemsets by different methods is obtained.
Previously, frequent itemsets were defined based on only one dataset. The definition does not use any other datasets. A structural decomposition on a different type of itemset space that is defined based on two datasets is provided by the present invention. The dataset may be contained in a database. These itemsets are called emerging patterns (EPs). Given a pair of datasets Dpos and Dneg , an EP is defined as an itemset that has a support in Dpos not below % , and its support in D"eg is less than δ. A special case of this definition is that π = 1 and δ = 1. That is, an EP is an itemset that has a non-zero support in Dpos , but its support in D"es is zero. Similarly as for spaces of frequent itemsets, the global convexity and the subspace convexity are also observed for spaces of emerging patterns. Based on this, a new concise representation for spaces of emerging patterns is provided. Also, methods to compute the new representations are described.
Moving from the notion of frequent itemsets to EPs, an important difference is the number of datasets used in their definitions. The definition of frequent itemset needs one dataset, but to define EPs, two datasets are required. The content and meaning of the itemset spaces change significantly. Frequent itemsets can be used to capture association rules, episode rules, sequential patterns and clusters. EPs are used in the biomedical field for discovering diagnostic biomarkers and rules, in the business side for identifying niches opportunities, and in the field of machine learning. However, the commonality which the two types of itemset spaces share is the convexity in their global spaces and in their plateau-like subspaces.
Let / = {/15/2,K ,im} be a set of distinct literals called items. An itemset or a pattern, is a set of items. A transaction is a non-empty set of items. A dataset is a non-empty set of transactions. An itemset X is said to be contained or included in a transaction T if X c T . An itemset X is said to be contained in a dataset D if there is r e Z) such that I c T. The support of an itemset X in a dataset D, denoted sup(X, D), is the number of transactions in dataset D that contain X. An itemset X is said to be frequent in a dataset D if sup(X, D) is greater than or equal to a pre- specified equal support value min_sup. Given a dataset D and a support value min_sup, the collection of all frequent itemsets in D is called the space of frequent itemsets, and is denoted by F(min_sup, D).
In a first example, suppose a dataset D consists of five transactions as shown in the following table:
Trans. ID Items
1 a, c, d
2 b, c, e
3 a, b, c, e, f
4 b, e
5 a, b, c, e
Let min_sup = 2. Then, F(HIiIl-SUp3P) = {0, {4 {4 {*}>{?}}
Figure imgf000011_0001
A space of frequent itemsets can be large. Next, convexity can be described in the context of itemset spaces. Convexity is an attractive property for large spaces of patterns because it is helpful for their concise and lossless representation.
DEFINITION 1
An itemset space S is convex if, for all X, Y e s such that X c T , it is the case that Z c 1S whenever X c Z c F
The set inclusion is crucial for this definition. Different operations can also be used to define convex spaces.
For a convex space S, the collection of all most general itemsets in S as a bound of S is defined, where an itemset X is most general in S if there is no proper subset of X in S. Similarly, the collection of all most specific itemsets as another bound of S is defined, where an itemset X is most specific in S if there is no proper superset of X in S. The former bound is the left bound of S, denoted L; and the latter bound is the right bound of S1 denoted R. The pair of left and right bound is the border of S, and is written as (L, i?) . Also defined is: [L,R]=
Figure imgf000011_0002
Y}
(L, R.) and [L,R] are two different notions. Specifically, [L,R] = S, but (L,R) is only a concise representation of the whole space S in a lossless way. Also, [{0},{0}]= {0} and i [0,0] = 0.
In a second example, let S =
{{a},
Figure imgf000011_0003
be a collection of itemsets. Then S is a convex space. Its left bound is L = {{a}, {&}}, and its right bound is R =
Figure imgf000011_0004
{b,c, d}} . So, the border of S is ({{a}, {&}}, {{a, b,
Figure imgf000012_0001
d}}) . This border is a concise representation of S as it uses only four itemsets to represent all the nine itemsets in S.
PROPOSITION 1 Let a dataset D and a support value min_sup < |X>j be given. Then F(min_sup, D) is a convex item space. Furthermore, its border is of the form ({θ}, i?) for some R. This is proved by supposing two itemsets X, Y e F(min_sup, Z)) such that X e Y . Then sup(X, D) ≥ sup(7, D) > min_ sup . Let Z be any itemset such that X c Z e Y . Then sup(Z, D) > sup(7, D) . Then sup(Z, D) ≥ min_ sup . So, Z e .P(HUn- sup) . Thus, F(min_ sup, D) is a convex space.
Since O c T for every T e D , it is the case that sup(θ, D) = \D ≥ min_ sup .
Therefore, 0 e .F(min_sup, £>). As 0 is a proper subset for any non-empty itemsets, it follows that 0 is the sole most general element in F(min_sup, D). So, the border of F(min_sup, D) is of the form ({θ}, R) , for some R.
In a third example, following the dataset D and min_sup in the first example, it is known that F(min_sup, D) contains fifteen frequent itemsets, excluding 0. But, the border of F(min_sup, D) is ({θ}, {{α&ce}}) containing only two itemsets. It is thus a concise representation of F(min_sup, D).
The results stated in Proposition 1 can also be obtained by using, for example, the Max-Miner algorithm. This algorithm is able to discover frequent maximal itemsets that are actually all itemsets in the right bound of a space of frequent itemsets, and can cover all frequent itemsets. However, convex frequent itemset spaces are able to be further decomposed systematically into convex subspaces.
The plateau-like decomposition method to systematically partition a space of frequent itemsets into non-overlapping subspaces requires defining itemset plateaus. DEFINITION 2
An itemset plateau, plateau(π , D), in a dataset D is defined as the collection of all itemsets that have the same support value % in D.
Let F(min_sup, D) be a frequent itemset space in a dataset D. Suppose these frequent itemsets have k distinct support values πv Jt2 K πk . Then,
F(min_ sup, D) = Ϋ plateauføj , D) . i=\
Every frequent itemset bas a unique support value in D. Consequently,
Figure imgf000013_0001
Z)) = 0 for all i ≠ j . That is, F(min_sup, D) is partitioned into non-overlapping itemset plateaus.
In a fourth example, following the dataset D and min_sup in the first example, four itemset plateaus are obtained as follows, where πλ = 5 , π2 = 4 , π% = 3 , and π4 = 2 :
1. plateau(π1,D) = {6}
2. plateau{π2,D) = {{c},{b},{e\{be}}
3. plateau(π3 , D) = {{a}, {ac}, {be},
Figure imgf000013_0002
4. plateau(π4 , D) — {{ab}, {ae}, {abc}, {abe}, {ace}, {abce}}
Every plateauψ^D) is a convex space.
THEOREM 1
Let a dataset D be given. Every πs ,
Figure imgf000013_0003
is a convex space. This is proved without loss of generality, by letting plateau^,, D) ≠ 0 . Suppose itemsets X and Y are in plateau(πt , D) and Ic J. Then sup(X, D) = sup(Y, D) = %.. Let Z be any itemset Z such that I c Z c T. Then suρ(X, D) ≥ sup(Z, D) ≥ sup(F, D) . As sup(X, Z)) = suρ(7, D) = πt , it follows that sup(Z, Z)) = TT, . Therefore the itemset Z is in
Figure imgf000013_0004
As a convex space can be concisely represented by its border, every itemset plateau can be concisely represented by a border.
In a fifth example, following the fourth example, the border of plateau(π4 , D)
Figure imgf000014_0001
is
{{{ab\{ae}\{{abce}}) .
This border is concise because it uses only three itemsets to represent the six itemsets that have the same support level τr4 in D.
THEOREM 1 is used for a new concise representation of frequent itemsets. Suppose F(min_sup, D) is a frequent itemset space in a dataset D, and suppose these frequent itemsets have k distinct support values 7T15Tr2K πk . Then: k k
F(min_sup, D) = γplateau(πt , D) =Y[Z, , R1 ]
where (LnRf) is the border of plateau(πt,D). So, the mining of all frequent itemsets can be converted to the mining of borders of all frequent itemset plateaus. This concise representation is different from those based on closed patterns or on key patterns. These borders can be efficiently discovered.
The bounds of the itemset plateaus and itemset ideas such as closed patterns and key patterns (also called generators) are related. The relationship between itemset plateaus and equivalence classes on which closed patterns and key patterns are defined is described.
The notion of equivalence classes deals with itemsets that are common to a set of transactions in a dataset. Two itemsets are considered "equivalent" if they are included in exactly the same transactions. The filter, f(P, D), of an itemset P in a dataset D as f(P, D) = {T e D \ P c I} is defined. Then an itemset Q is equivalent to an itemset P in a dataset D iff f(Q, D ) = f(P, D). Then the equivalence class [p]D of P in a dataset D is the collection of itemsets defined as:
[P]D = {X \ f(X,D) = f(P,D)}. PROPOSITtON 2
[P]D is a convex itemset space, and the right bound of its border is a singleton set. This is proved by letting X, Y e [P]0 such that X c Y. Suppose X c Z c Y. Then, f(Y, D) c f(Z, D) c f{X, D) . Since, X, Y G [P]D , it is known that f(Y, D) = f{P, D) = f(X, D) . So f(Z, D) = f(P, D) and Z e . Thus [p]D is convex.
Letting (L,R) be the border of [p]D . Suppose XJ e R . Then f(X,D)= f(P,D) = f(Y,D). This implies f(XvY,D) = f(P,D). Thus X \j Ye. [P]n . Since X, Y e R by the definition of the right bound of a border, both X and Y are most specific. Then it must be the case that X = Xu F = F. Thus R is a singleton set.
Sup(X, D) = sup(Y, D) for all X, Ye [P]0. Therefore, the equivalence class of any itemset P must be a subset of plateau{πt,D), where πt = sup(P, D). Following from the definition of equivalence classes, plateau{πt,D) is the union of non- overlapping equivalence classes. In particular, the number of equivalence classes is equal to the number of itemsets in the right bound of the border of plateau{πt,D). In other words:
COROLLARY 1
Let a dataset D and a support value min_sup < \D be given. Then for any π ≥ min_sup such that plateau{πt,D) ≠ 0 , it is the case that
Vlateauip ,,D) = [Lx uZ2 uK uLR,{RuR2,K ,Rr}] = Ϋ[Z,,{i?,}]
;=1 Where fcL,fcL,K ,[RR]O = {[P]O}| sup(P, D) = π}, and each (L19[R1]) is the border of the corresponding [i?,. ]o . In a sixth example, following the fourth example, it is known that
Figure imgf000016_0001
Here, [MW]= [WL = [WL »>d
[{{be}, {ce}}, {{bee}}] = [{δc}]β = [{ce}L are two equivalence classes in D.
Together with equivalence classes, frequent closed patterns and frequent key patterns have been widely studied in the data mining field. One main reason is that both of them can be separately used to concisely represent the whole space of frequent itemsets.
Plateau boundaries and closed patterns and key patterns are related. Traditionally, closed patterns are defined based on the notion of equivalence classes. For example, a closed pattern is the maximal (the most specific, under set inclusion) itemset of an equivalence class. In a simpler way but equivalently, a closed pattern is also defined as an itemset that has no proper superset with the same support.
As described, a plateau{πt,D) is the collection of all itemsets whose support value in D is π( . The right bound of
Figure imgf000016_0002
are itemsets that are most specific in plateau(πi}D). That is, any proper superset of any itemset in this bound is not in So, those proper supersets have a different
Figure imgf000016_0003
support level. Hence, the itemsets in the right bound of plateauψ^D) are all closed patterns. Conversely, all closed patterns with the same support value π(. are in the right bound of plateau{iti , D). So, the right bounds of the itemset plateaus capture exactly all closed patterns.
DEFINITION 3
Key patterns is a concept that was developed after closed patterns. Its definition is also based on equivalence classes. A pattern P is a key pattern in a dataset D iff
P e minjpj^ , where HiIn[P]0 is the collection of all minimal (the most general, under set inclusion) itemsets in [P]13. That is, P is a key pattern in D iff no proper subset of P is in [P]0.
According to this definition and the definition for the left bound of
Figure imgf000017_0001
it can be seen that the itemsets in the left bounds of plateaus are all key patterns. Bounds of convex plateaus can be used to capture all key patterns and all closed patterns instead of confining them to smaller equivalence classes.
The itemsets in the left bounds of all plateauψ^D) form a convex space.
THEOREM 2
Let a dataset D and a support value min_sup <
Figure imgf000017_0002
be given. Suppose: k k
F(min_ sup, D) = Yplateau(π( , D) = Y[L1 , R1 ] i=l (=1 k where L1 and R1 are bounds of the itemset plateaus. Then Yi, is a convex k space. This is proved by letting K = YL1 . Suppose X, Y e K such that
X a. Y . Suppose X c, Z c f . Then, Z e K is proved. The condition X c Z c F implies four cases: I c Z c U c Z =r,I = Z c F, and X= Z = Y. Only
Z € K when X cz Z cz Y needs to be proved. The other three cases are apparent.
Let sup(Z, D) = πm . Then, Z = plateau(πm,D). Z e L1n is proved by contradiction. Assume Z & Lm . Then there is A e L1n such that A c Z because plateau(πm,D) is a convex space.
Let Y = Z u V such that V n Z = 0. As A, Z € plateau(πm , D) , then sup (A , D ) = π „ = sup (Z , D ) . As A c Z , then sup(^4 u F, Z)) = sup(7, /J)) . So, Y has a proper subset A u F with the same support as itself: But Y is an itemset in the left bound of a plateau. So all its proper subsets must have a larger support than Y's. This is a contradiction. So, Z <= Lm .
Then Z e K .
The contribution by frequent closed patterns and key patterns to data mining is that both of them can be separately used to concisely represent the whole space of frequent itemsets in a dataset. These results can be re-written and then compared to the subspace border representation.
Given a dataset D and a support value min_sup, denote FCD as the set of closed itemsets that are also frequent in D. Then a concise representation of F(min_sup, D) is defined as the set FCD enriched by the information on the support for every itemset in FC D . This concise representation is correct because for any itemset JT c /. • If there is a Z e FC D such that Z ^ X , then X e F(HUn-SUp5D) and s\φ{X,D) = max{sup(r, D) \ Y e FCDandY 3 X] ;
• otherwise, X ^ F(min_sup,_D).
To get the support for all the frequent itemsets from this concise representation, all possible X must be tested. This computation is not cheap, but does not need any access to D for the calculation of support.
Key patterns are sometimes used as an intermediate step for the discovery of closed itemsets. The key patterns themselves can constitute a concise lossless representation of frequent itemsets. FGD is denoted as the set of key patterns that are also frequent in D. Then a concise representation of F(min_ SUp5I)) based on FGn is defined as:
• the set of FGD enriched by the information on support for every itemset set in FG » ; • the set GBd D of all minimal infrequent key patterns; and
• the set of ID of items that each occurs in D. This concise representation is correct because for any itemset X c I,
• if X g ID or there is a Z e GBd D such that Z c X , then X g F(min_sup,D)
• otherwise, X e F(min_sup,/J)) and sup(X, D) = min{sup(75 D)] Y e FG D andY c X}
Frequent key patterns alone are not sufficient for a lossless concise representation of a space of frequent itemsets. Also, the support of represented frequent itemsets is not immediate or obvious.
A new concise representation for frequent itemsets is provided. It is a subspace border representation for concisely representing frequent itemset spaces. This representation is better than those based solely on FC D or FGD .
Suppose F(min_sup,D) is a frequent itemset space in a dataset D. As described earlier: k k
F(π∞ι_ sup, D) = Ypϊateau(πt , D) = Y[L1 , R1 ]
'=1 '=1 where πλ > π >2 K > πk are the k distinct support values for these frequent itemsets, and each (LnR1) is the border of the corresponding plαteαu(πnD). Then the subspace border representation is defined as: (L1 , R1 ), / = 1,K , k enriched with information on the support of these boundary itemsets in D.
k
The mining of (L1 , R1 ), i = 1,K , k is efficient. Y Yi/ is the set of all frequent key k patterns in D, and Yi?,. is the set of all frequent closed patterns in D. Then, a
method can be used to discover frequent closed patterns and simultaneously k k discover both Yi,- and Yi?. since key patterns are usually used as an
J=I (=1 intermediate step for the discovery closed itemsets. Suppose C ,C25K 5Cn are the frequent closed itemsets sorted by their support in a descending order, then the first group of itemsets that have the same support is R1 , the second group of itemsets having the same support is R2 and so on. Finally the kth group of itemsets having the same support is Rk . Similarly, L1, i = I5K ,k , can be identified from the frequent key patterns.
The L1, i = 1,K ,k , representation for F(mia_ sup, D) has two main advantages: Firstly, the generation of all represented frequent itemsets is direct. That is, for each X e [Z; , i?, ] , X is a frequent itemset. Secondly, the support of any represented frequent itemset is clear. That is, for each X e [L1 , R1 ] = plateau(π,- , D), sup(jT, D) = π, .
Frequent itemsets are defined with regard to one dataset. A different type of itemsets that are related to two datasets, and convexity and plateau-like structural decomposition for this type of itemset are described. Traditionally, this type of itemset are called EPs. Frequent itemsets and emerging patterns can find different practical applications, and EPs have a broader potential.
DEFINITION 4
Given a pair of datasets Dpos and D"eg , an itemset X is an EP if sup(x, Dpos ) ≠ 0 , but sup(x, D"es ) = 0. This space of emerging patterns (EP space), is denoted by EP\Dpos ,-,D"eg ). Symmetrically, an itemset Y is an EP if svφ(γ,Dneg)≠ 0 , but
Figure imgf000020_0001
0 . This space EP space is then denoted by EP(D"e8,-,Dpos).
EPs are useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends. THEOREM 3
Given a pair of datasets Dpos and Dmg , then both EP(DP0S ,~,D"eg) and
EP[p"eg ,->Dp0S) are convex spaces. This is proved by firstly proving EP(DPOS ,-,D"es) is a convex space. Let X, Y e EP(DPOS ,-,Dneg) be itemsets such that X c Y. Suppose IcZ c J. Since X e £?(£>*<* , -D"* ) , it is known that X does not occur in D"eg . Then all of its supersets - those itemsets more specific than X - cannot occur in D"eg . Therefore, Z cannot occur in Dneε . So1 sup(z,Dneg)= 0. Since 7 e EP(DP°S , -iD"*), it is known that
SUp(F, Dprø J > 0. Then all subsets of Y - those more general than Y - must have a non-zero support in Dpos . Therefore, Z occurs in Dpos . Consequently, Z is an EP in EP{DPOS ,-,Dneg ). Thus EP(DP°S ,-,Dneg ) is a convex space. The proof for the convexity of EP(Dneg ,-iDp0S ) is similar.
So, the EP space EP{ppos ,—Dneg ) can be concisely represented of itemsets by a border (L,R) , and [i,i?]are exactly all the EPs. Thus, the mining of EPs can be reduced to the mining of only boundary EPs.
Also, the left boundary EPs are EPs that have the biggest support in Dpos . As their support in D"eg are all zero, the left boundary EPs are the most discriminating patterns for differentiating the two classes of data.
Example - Bioinformatics
Initially, characteristics of prostate cancer cells are captured from their gene expression profiling data. The original data has two classes of samples: 52 prostate tumor cells and 50 normal cells. Every cell sample is described by 12600 features, also known as attributes, genes, probe-and-EST variables in the bioinformatics literature. Every feature is continuous, measuring the expression level of that specific gene in the cell samples by Affymetrix U95Av2 oglinucleotide microarrays. So, the values of a feature may change significantly from one sample to another, or from the tumor class to the normal class. Genes or gene groups can be discovered that can discriminate the two classes of cell samples sharply. The 52 prostate tumor cells are denoted as D""" , and the 50 normal cells as
Dnor . Firstly, to discover the EPs, the 12600 features are ranked using an entropy method. Next, 40 top-ranked features are selected and discretized by the entropy method. After the discretization, a total of 90 items is obtained, and each transaction (each cell) is a set of 40 items. For this pair of reduced and discretized datasets, there are 24287 left boundary EPs for EP(Dnor , -iDtum ), and 18268 left boundary EPs for EP(Dtum ,-,Dnor).
Ten left boundary EPs of Ep{Dm\-tDtum ) are listed in Table 1 below:
itemset X sup(X,Z>"or) sup(X, £>""" )
{7, 21} 36(72.00%) 0
{7, 44} 36(72.00%) 0
{7, 11} 35(72.00%) 0
{7, 43} 35(72.00%) 0
{7, 39} 34(72.00%) 0
{7, 87} 34(72.00%) 0
{7, 32} 34(72.00%) 0
{11, 24} 34(72.00%) 0
{24, 29} 34(72.00%) 0
{7. 86} 33(72.00%) 0
Ten left boundary EPs of EP(Dtum ,->D"or ) are listed in Table 2 below:
itemset X Svφ(X,D"or) sup^/J)""")
{6, 23, 83} 0 43 (82.69%)
{14, 23, 36, 55, 83} 0 42 (80.77%)
{6, 16, 55, 83} 0 42 (80.77%)
{16, 36, 55, 83} 0 42 (80.77%) ■
{14, 55, 67, 83} 0 42 (80.77%)
{6, 14, 67, 83} 0 42 (80.77%)
{19, 55, 83} 0 42 (80.77%)
{4, 67, 83} 0 42 (80.77%)
{14, 55, 62} 0 42 (80.77%) {14, 36, 62} 0 42 (80.77%)
The reference numbers in the first column of the two tables represent items. For example, the reference number "7" stands for the item "38406_f_at > 499.0", saying that the expression of gene 38406_f_at is larger than 499.0. These EPs are characteristics, rather than random noise patterns, of the tumor cells or the normal cells since these EPs occur in a large percentage (for example, 72%, 82%) of samples in the tumor class or in the normal class. Furthermore, as they occur with high frequencies in either as D""" , or D"or , but not simultaneously in both, they can serve as excellent diagnostic indicators for prostate cancer, and may even play a causal role in the cancer.
These EP patterns are translated into useful clinical rules. For example, the first EP {6, 23, 83} in Table 2 can be interpreted as the following rule: If the expression of gene 38406_f_at < 499 and that of 556_s_jit < 127 and that of gene 34678_at is between 0 and 33, then this cell is a tumor cell. This rule is satisfied by 82.69% of the tumor samples.
A New Concise Representation For EP Spaces
As described earlier, the subspaces of a space of frequent itemsets are all convex spaces after plateau-like structural decomposition. This also happens to EP spaces.
DEFINITION 5
Given a pair of datasets Dpos and Dneg , let EP(DPOS ,-,Dneg ) be the space of EPs that have a non-zero support in Dpos . A plateau-like structural decomposition partitions EP\ppos ,-ιDnes ) into subspaces such that each of them consists of
EPs that have the same high support in Dpos . Such a subspace is denoted by EPplateau{κ i yDpos ) where πt is the support of the EPs in Dpos .
THEOREM 4
Given a pair of datasets Dpos and Dms , then:
1. EP(DPOS , -,Dneg ) = YEPplateau(nt ,Dp0S) where πx , Jt2 K πk , are all
Figure imgf000023_0001
possible distinct support levels of these EPs in Dpos ; 2. EPplateau^τi,Dpos)n EPplateau{^j,Dpos)= 0 for all i ≠ j; and
3. each EPplateauψi , Dpos J is a convex space.
To prove Statement 1, suppose X 6 EP(D≠S ,-,D"eg) . Then svφ(x,Dpos )≠ 0. Then there is an m, l ≤ m ≤ k , such that sup(X, Dpos)= πm . Therefore, X e EPplateau(πm , Dpos).
Thus, X e
Figure imgf000024_0001
On the other hand, suppose k
Y e YEPplateau(πi , Dpos ). i=l
Then Y is an EP and svap{γ,Dp0S)≠ 0. Therefore, Y e Ep(Dpos,->D"eg). Combining the above two points, Statement 1 is proved.
To prove Statement 2, for each X e EPplateau(πt , Dpos \ sup(z, Dpos ) = πt . As πt ≠ πj,X <£ EPplateauip j,Dpos). Statement 2 is proved.
To prove Statement 3, by definition, X e EPplateauψt , Dpos ) is a collection of all EPs whose support in Dpos is πt ≠ 0. Suppose two patterns XJ e
Figure imgf000024_0002
satisfy X c 7. Let I c'Z c 7. By definition of X G EPplateauiπt , Dpos } sup(γ, Dpos )= πi = sup(x, Dpos ) . Since X <= Z £ 7 , it is the case that sup(x, Dpos ) ≥ sup(z, Dpos)≥ sup(r, Dpos ). So, sup(z, Dpos )= πr Thus, Z e EPplateau(πt , Dpos ), and EPplateau(πt , Dpos ) is a convex space. Statement 3 is proved.
THEOREM 4 states that a space of EPs can be partitioned into non-overlapping EP plateaus, and each of such subspaces is a convex space. This theorem is useful for a new concise representation of EP spaces. As described, a known and simple concise representation of an EP space is this space's border. Previous studies on the mining of EPs are almost all focused on the mining of such a border enriched with these boundary EPs' support. However, this representation does not provide sufficient information to derive the support for any represented EPs.
A new representation for concisely representing a space of EPs and its advantages is described. Let EP(DPOS ,-iD"eg ) be an EP space for dataset pair Dpos and D"eg . Using THEOREM 4, it is known that: k k
EP(DPOS ,-,D"eg)= YEPplateaufai, D^)=Y[L1R1] where
1=1 /=1 π12K πk ≥ l , and (L1R1) is the border of the EPplateau(πi,Dpas). Then,
(L1R1)J = I5K ,k is defined as a concise representation of EP(ppos ,-,Dmg). This concise representation is also called an EP plateau border representation.
An EP plateau border representation is a lossless representation for any EP spaces. Moreover, for any represented EP X, if X e EPplateauψι,Op ), then
Figure imgf000025_0001
πt . So, the support for any represented EP is directly derivable.
Methods to discover EP plateau borders
Referring to Figure 1, a method 110 to discover EP plateau borders
(LtRt)j = I5K ,k is provided. This method relies on Propositions 3 and 4 as follows:
PROPOSITION 3
Let EP{DPOS ,-J)nes) be an EP space for pair of datasets Dpos and Dneg so
that: EP{DP0S
Figure imgf000025_0002
(Lf R^) J = 1,K ,ke, is the plateau border representation for this EP space; and
,kp is
Figure imgf000025_0003
the subspace border representation of F(I5D**"1 J. Then:
1. ke ≤ kp , that is, the number of EP plateaus is less than or equal to the number of itemset plateaus in Dpos ; 2. EPplateau(π° ,Dpos)^ plateau(πp ,Dpos) for all i and j such that π- = πp ; and
3. Rf ci Rj,i = I5K ke , if the support level of the boundary itemsets in Rj is equal to the support level of the boundary EPs in RfF .
To prove Statement 1, given an itemset plateau, plateauψ,Dpos), of F^D*0").
If snp(x,Dneg)≠ 0 for each X e plateau{π,Dpos), then there is no itemset in this plateau that is an EP. Then ke < kp . Thus, it is clear that ke ≤ kp.
To prove Statement 2, let X e EPplateau{π\ , Dpos ) . Then, sup(x, Dpos ) = π' ≥ 1. So, X e F(l, Dpos ). Then there is plateau(πp , Dpos ) and π- = πj\ Thus, EPp!ateau(π° ,Dpos)c plateau(πP ,Dpm).
To prove Statement 3, let X e Rf , then X s plateau^ ,Dpos), where π P = sup(x, Dpos ) = π- . Let A G RJ , where Rj is the right bound of the border of plateauψp ,Dpos), be such that X Q A . Since X c ^ , it is known that SUp(^, £>Meg ) = 0. Since A e Rj , it is known that SUp(^3 Dpos ) = π} p . Therefore, A is an EP in EPplateau(π°,Dpos). This implies A ^ T for some X'e Rf . Therefore, I c i c T is obtained. Since X'e Rf , X = A = X' is obtained. Therefore X e Rj.
Given a dataset D, a zero support space is used to call the collection of all itemsets that have a zero support in D. This space is denoted by F(1, D) .
PROPOSITION 4
F(L, D) is a convex space. This is proved by supposing itemsets J,F e F(1,D) such that X c Y. Then sup(X, D) = sup(Y, D) = 0. So, for each Z satisfying J c Z cZ, sup(Z, D) = 0 is obtained. Thus, convexity is proved. In fact, a zero support space is a special itemset plateau where the support level is zero. As zero support spaces are convex, each of them can be concisely represented by a border in the form of (£;{/}) for some L.
Based on Propositions'3 and 4, the following methods are provided to discover the plateau border representation for a space of emerging patterns. Let:
• a pair of datasets Dpos and Dms ;
• the subspace border representation (L1Ri
Figure imgf000027_0001
and • the border (x° i?° ), i = I3K , kp of the zero support space JF(I, D"eg ) be given.
The border of the EP plateaus of EP(DP0S , ->Dm8 ) is computed 110 as follows, where the border of an itemset plateau is used to derive the border of an EP plateau. First, partition 120
Figure imgf000027_0002
):
• Let [Ii ,JRj J, an itemset plateau in F(I, DP∞ ), be partitioned 121 into v non- overlapping equivalence classes [Z^, V^, jj>7 = IK- v . so that
[Lt,R,]= Y [ly, {βj], as per Corollary 1. Then the support of Rys in Dms
is calculated 122. If for every Ry , sxφ{RyD"eg ) ≠ 0 , then there is no itemset in [L1R1 J that is a desired EP. On the other hand, without loss of generality, suppose the u of these RyS have a zero support in Dneε , and the rest of nonzero support in Dneg . By Proposition 3, these u special R^s are exactly constituted as the right bound of EPplateauψt ,Dpos), where πt =snV{Ra,Dpos).
• The remaining problem on finding the left bound of EPplateauψi ,Dpos). To do this, the border ( jL°i?° ) of the zero support space -F(I, D"eg ) is used, and the left bound of EPplateau(pt ,Dpos) is obtained 123 as u , are the u
Rϋ s that have zero support in D"eg . The proof is to find the leftbound of
Figure imgf000028_0002
After going through all the itemset plateaus of .P(l, Dpos J, then the borders of all possible EP plateaus can be found 124. The above method does not require any access to Dpos or to Dneg to calculate support in the discovery of the left bound of an EP plateau. However, the overhead for this is the given border /Z°i?°\ of the
zero support space F\l,D"es J. If this border is not given, a different method is used to compute the left bound of an EP plateau.
The notion of emerging patterns by DEFINITION 4 is intended to capture sharp difference between two classes of data. However, it may miss some important patterns. For example, itemsets that have high support in one dataset but very low level (≠ θ) of support in the other dataset are not covered by this definition. Due to the significant change in their support, these itemsets may be sometimes equally interesting as EPs. Next, the original definition of EPs is extended to cover these patterns.
DEFINITION 6
Given a pair of datasets Dpos and D"es . An itemset X is an EP if svφ(x,Dpos)≥ π , but svφ(x,Dneg)< δ , where π > δ , and δ is a small integer number. This space of emerging patterns is denoted by: EP(πT Dpos ,δ >l D"eg J.
Under this definition, EPs still have attractive properties. Firstly, EP spaces are convex, thus each of them can be concisely represented by a border. Secondly, EP plateau subspaces are also convex, thus a different concise representation for EP spaces can be used. Thirdly, the EP plateau borders can be efficiently computed from the borders of the π -frequent itemset plateaus in Dpos . THEOREM 5
Given a pair of datasets Dpos and Z)"* . Then both EP{^ Dpos,s I D"eg) and
Figure imgf000029_0001
D"eg,δ -I DposJ are convex spaces. This is proved by supposing itemsets XJ≡ EP(j Dpos,δ I Dms), and l c F. Let l c Z c Y. Since X e EP(π t Dpos,δ I D"eg), sup(x, D"eg ) < δ is obtained. Then all of its supersets - those itemsets more specific than X - must have a support less than δ in D"eg as well. Therefore, sxφ(z,Dms)< δ . Since Y e EP(πt Dpos,δ I D"eg), svφ\Y,Dpos ]> π is obtained. Then all its subsets - those more general than Y - must have a support ≥ π in Dpos as well. Therefore, sup(z,Dpos)> π . Thus Y is an EP in EP{π t Dpos ,s I Dms ) . So EP(π t Dpos δ I D"eg ) is a convex space. The case for EP{j D"eg,δ I Dpos) is similar.
As described earlier, a plateau-like structural decomposition
EP\π t Dpos ,δ 4 D"eg ) is defined as a collection of subspaces each consisting of all EPs that have the same support level (> π) in Dpas . Such a subspace is also called an EP plateau. It is proved in a way similar to THEOREM 4, that every EP plateau is a convex space. So1
Figure imgf000029_0002
where (∑f ,R?P) is the border of the ith EPplateau{πt ,Dpos). Then:
I1Lf p , Rfp \ i = 1,K , ke is defined as a concise representation ,
EP(π t Dpos,δ I Dnes). This concise representation is also called EP plateau border representation.
Let Fψ,Dpos) be a frequent itemset space of Dpos . This space can be partitioned into non-overlapping itemset plateaus as per THEOREM 1 as:
F{π,Dpos)=Yplateau{πp,Dpos)=γ[LpR]
J=I 7=1 where ( Lj, RjYj = 1,K ,kp , is the subspace border representation of
Fψ,DposJ. Then it can be shown in a way similar to Proposition 3 that if πp = π° F then:
• plateauψp ,DpDS) covers EPplateauψ°,Dpos); and • the right bound of EPplateauψ. , Dpos ) is a subset of the right bound of plateauψp,Dp0S).
Another method is described based on the above, to discover the plateau border representation for an EP(K T Dpos,s 4- Dmg). In fact, the border of EPplateauψj , Dpos ) is derived from plateauψj , Dpos ). This is derived as follows:
• The right bound of EPplateauψj , Dpos ) is given by
YmeΛ {R I SUp(R, D"eg)< δ}, where R is the right bound of plateauψj,Dpos). This formulation follows via a reasoning similar to Proposition 3. In other words, the process is: For each R in the right bound of
Figure imgf000030_0001
its support in Dms is calculated; and if sup(R,D"eg)< δ ,
R is moved to the right bound of EPplateauψj, Dpos).
• The left bound of EPplateauψj , Dpos J is given by
Y^ min{z | [L, {Y}] = [Y]0^ ,X <E L,X ^ Z ^ Y, sup(z, D"eg )< δ} where R is the right bound of EPplateauψj , Dpos ) . This formulation follows via a reasoning similar to Corollary 1. In other words, the process is: For each Y in the right bound of EPplateauψj, Dpos), the border of the equivalence class
[Fj0P05 is found. Then for every X in the left bound of [T]0^ , their support in D"eg is calculated. If the support is less than δ , X is moved into the left bound of EPplateauψj , Dpos ) . Otherwise, the support in D"es is checked for all immediate proper supersets (within the equivalence class) of X. Those whose support in D"eg are less than δ are moved into the left bound of EPplateauψ j, Dpos). For those whose support in D"es are greater than or equal to δ , their immediate superset is checked in turn. And so on.
If the border of J(δ,D"eg) is known, then the left bound of EPplateau(πj,Dpos) can be computed by a method similar to where F[δ,Dneg ) is specially set as F~(l,D"es).
Concise and lossless representations for spaces of frequent itemsets is important in data mining. These representations may be based on key patterns, on closed patterns, and on plateau borders. Also, disjunction-free sets, disjunction-free generators, and generalised disjunction-free generators can be also used for concise representations. Transitions between these concise representations are possible. EP spaces possess a global convexity and plateau-like subspace convexities.
A process to generate an emerging pattern with zero support from a dataset Dn is described. The process involves mining the border (L,R) of zero support itemsets from the dataset Dn . Let Dn = JT1, UT25A ,Tp } be a dataset consisting of p transactions. To mine the border IL,R) of zero support itemsets of Dn the following steps are performed:
1. Calculate R = I ^JT1 . If R is equal to any of J\ , i= 1, Λ ,p. Then (L,R) = ( φ,φ ) , stop.
2. Otherwise, find the maximal ones of J 1 1. , i =1 , Λ , p. Denote the maximal ones as [M1 ,M2, Λ ,Mk), where k < p. Note that a transaction J7- 's maximal in Dn if there does not exist a transaction in Dn that is a proper superset of J7.. 3. Use the function Border-Diff to get the border (I,]*) by calling Border-Diff
( ({φ,{R}),({φ,{MM2,A ,Mk}) )- This function is proposed in Dong & Li in the paper authored by G.Dong and J. Li entitled "Efficient mining of emerging patterns: Discovering trends and differences" in S. Chaudhuri and D. Madigan, editors; Proceedings of the Fifth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 43 to 52, 1999, ACM Press,
Computer Implementation
The embodiments of the invention are preferably implemented using a computer, such as the general-purpose computer shown in FIG. 2. In particular, the process of FIG. 1 can be implemented as software, or a computer program, executing on the computer. The method or process steps for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space are effected by instructions in the software that are carried out by the computer. The software may be implemented as one or more modules for implementing the process steps. A module is a part of a computer program that usually performs a particular function or related functions. Also, a module can also be a packaged functional hardware unit for use with other components or modules.
In particular, the software may be stored in a computer readable medium, including the storage devices described below. The software is preferably loaded into the computer from the computer readable medium and then carried out by the computer. A computer program product includes a computer readable medium having such software or a computer program recorded on it that can be carried out by a computer. The use of the computer program product in the computer preferably effects an advantageous apparatus for classification of data into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space in accordance with the embodiments of the invention.
The computer system 200 consists of the computer 202, a video display 216, and input devices 218, 220. In addition, the computer system 200 can have any of a number of other output devices including line printers, laser printers, plotters, and other reproduction devices connected to the computer 202. The computer system 200 can be connected to one or more other computers via a communication interface 208b using an appropriate communication channel 230 such as a modem communications path, a computer network, or the like. The computer network may include a local area network (LAN), a wide area network (WAN), an Intranet, and/or the Internet.
The computer 202 itself consists of a central processing unit(s) (simply referred to as a processor hereinafter) 204, a memory 206 which may include random access memory (RAM) and read-only memory (ROM), input/output (10) interfaces 208A and 208B, a video interface 210, and one or more storage devices generally represented by a block 212 in FIG. 2. The storage device(s) 212 can consist of one or more of the following: a floppy disc, a hard disc drive, a magneto-optical disc drive, CD-ROM, magnetic tape or any other of a number of non-volatile storage devices well known to those skilled in the art. Each of the components 204 to 212 is typically connected to one or more of the other devices via a bus 214 that in turn can consist of data, address, and control buses. The video interface 210 is connected to the video display 216 and provides video signals from the computer 202 for display on the video display 216. User input to operate the computer 202 can be provided by one or more input devices 208b. For example, an operator can use the keyboard 218 and/or a pointing device such as the mouse 220 to provide input to the computer 202.
The system 200 is simply provided for illustrative purposes and other configurations can be employed without departing from the scope and spirit of the invention. Typically, the processes of the embodiments, described hereinafter, are resident as software or a program recorded on a hard disk drive (generally depicted as block 212 in FIG. 2) as the computer readable medium, and read and controlled using the processor 204. Intermediate storage of the program and pixel data and any data fetched from the network may be accomplished using the semiconductor memory 206, possibly in concert with the hard disk drive 212.
In some instances, the program may be supplied to the user encoded on a CD- ROM, DVD or a floppy disk (both generally depicted by block 212), or alternatively could be read by the user from the network via a network device connected to the computer, for example. Still further, the software can also be loaded into the computer system 200 from other computer readable medium including magnetic tape, a ROM or integrated circuit, a magneto-optical disk, a radio or infra-red transmission channel between the computer and another device, a computer readable card such as a PCMCIA card, and the Internet and Intranets including email transmissions and information recorded on websites and the like. The foregoing are merely exemplary of relevant computer readable mediums. Other computer readable mediums may be practiced without departing from the scope and spirit of the invention.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the scope or spirit of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects illustrative and not restrictive.
Industrial Applicability
Identifying emerging patterns is useful when mining a set of contrasting characteristics for a pair of datasets. For example, comparing tumor cancer cells against normal healthy cells, to discover gene expression changes for diagnosis. In another example, comparing demographic data between two states, to discover distinct social behaviour for the two populations. In a further example, comparing transaction data of a company between two consecutive years, to discover emerging purchase patterns from the customers and shrinking trends.

Claims

WE CLAIM:
1. A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
2. The method according to claim 1, wherein the right bound of the border of the emerging pattern plateau is computed by identifying the non-overlapping equivalence classes that have a zero support in the second dataset.
3. The method according to claim 1, wherein the left bound of the border of the emerging pattern plateau is computed by using the border of an itemset plateau that have a zero support in the second dataset.
4. The method according to claim 1, further comprising the step of ranking data in the datasets using an entropy method.
5. The method according to claim 4, further comprising the step of selecting a predetermined number of ranked data.
6. The method according to claim 5, further comprising the step of discretizing the selected data.
7. The method according to claim 1, further comprising the step of translating the emerging pattern space into clinical rules.
8. A method for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
9. The method according to claim 8, further comprising the step of ranking data in the datasets using an entropy method.
10. The method according to claim 9, further comprising the step of selecting a predetermined number of ranked data.
11. The method according to claim 10, further comprising the step of discretizing the selected data.
12. The method according to claim 8, further comprising the step of translating the emerging pattern space into clinical rules.
13. A system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the method comprising: a first partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a second partitioning module to partition each itemset plateau into non- overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
14. A system for classifying data of at least a first and second dataset into convex non-overlapping emerging pattern plateau borders of a convex emerging pattern space of the at least a first and second dataset using a computer processor, the system comprising: a partitioning module to partition a frequent itemset space in the first dataset into convex non-overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and a processing module to compute the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
15. A computer program product comprised of a computer-readable medium for carrying computer-executable instructions for. partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; partitioning each itemset plateau into non-overlapping equivalence classes such that the union of all non-overlapping equivalence classes in an itemset plateau is equal to the itemset plateau; where, the non-overlapping equivalence classes that are supported in the second dataset determine the borders of the emerging pattern plateaus; and where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
16. A computer program product comprised of a computer-readable medium for carrying computer-executable instructions for: partitioning a frequent itemset space in the first dataset into convex non- overlapping itemset plateaus, each itemset plateau having an equal support value in the first dataset; and computing the borders of the emerging pattern plateau by computing the support of the left and right bounds of the borders in the second dataset, each emerging pattern plateau having an equal support value in the first dataset; where the emerging pattern plateau borders are a lossless representation of the convex emerging pattern space.
PCT/SG2004/000402 2004-12-08 2004-12-08 A method for classifying data WO2006062485A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2004/000402 WO2006062485A1 (en) 2004-12-08 2004-12-08 A method for classifying data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2004/000402 WO2006062485A1 (en) 2004-12-08 2004-12-08 A method for classifying data

Publications (1)

Publication Number Publication Date
WO2006062485A1 true WO2006062485A1 (en) 2006-06-15

Family

ID=36578196

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2004/000402 WO2006062485A1 (en) 2004-12-08 2004-12-08 A method for classifying data

Country Status (1)

Country Link
WO (1) WO2006062485A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6507843B1 (en) * 1999-08-14 2003-01-14 Kent Ridge Digital Labs Method and apparatus for classification of data by aggregating emerging patterns
WO2004019264A1 (en) * 2002-08-22 2004-03-04 Agency For Science, Technology And Research Prediction by collective likelihood from emerging patterns

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6507843B1 (en) * 1999-08-14 2003-01-14 Kent Ridge Digital Labs Method and apparatus for classification of data by aggregating emerging patterns
WO2004019264A1 (en) * 2002-08-22 2004-03-04 Agency For Science, Technology And Research Prediction by collective likelihood from emerging patterns

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DONG G. ET AL: "Classification by Aggregating Emerging Patterns", PROC. SECOND INTERNATIONAL CONFERENCE ON DISCOVERY SCIENCE, December 1999 (1999-12-01), pages 30 - 42 *
DONG G. ET AL: "Efficient Mining of Emerging Patterns: Discovery and Data Mining", PROC. ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, August 1999 (1999-08-01), pages 43 - 52 *
LI J. ET AL: "Emerging Patterns and Classification", PROC. ASIAN CONFERENCE, 2000, pages 15 - 32 *
WONG L.: "Convexity in Itemset Spaces", ACADEMICA SINICA, 28 May 2004 (2004-05-28), TAIPEI, TAIWAN, Retrieved from the Internet <URL:http://www.iis.sinica.edu.tw/EVENT/HTML/CHINESE/S200405281030.html> *

Similar Documents

Publication Publication Date Title
Abu Alfeilat et al. Effects of distance measure choice on k-nearest neighbor classifier performance: a review
Pérez-Suárez et al. A review of conceptual clustering algorithms
Mittal et al. Clustering approaches for high‐dimensional databases: A review
Dong et al. Mining border descriptions of emerging patterns from dataset pairs
He et al. Discovering cluster-based local outliers
Atzmueller Subgroup discovery
US8935249B2 (en) Visualization of concepts within a collection of information
Le et al. FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy
Ressom et al. Increasing the efficiency of fuzzy logic-based gene expression data analysis
US20110082862A1 (en) Identification Disambiguation in Databases
Newaz et al. Network-based protein structural classification
Calumby et al. Multimodal retrieval with relevance feedback based on genetic programming
Hegland Data mining techniques
Chang et al. Categorical data visualization and clustering using subjective factors
Grady et al. Seminar: scalable preprocessing tools for exposomic data analysis
Chiang et al. The Chinese text categorization system with association rule and category priority
Ukey et al. Efficient continuous kNN join over dynamic high-dimensional data
Lin et al. Outlier detection for set-valued data based on rough set theory and granular computing
Lemmerich Novel techniques for efficient and effective subgroup discovery
US8250024B2 (en) Search relevance in business intelligence systems through networked ranking
Rashid et al. A top down approach to enumerate α-maximal cliques in uncertain graphs
Chen Classification of scientific networks using aggregated journal‐journal citation relations in the Journal Citation Reports
Liu et al. Efficient mining of distance‐based subspace clusters
Hetland et al. Evolutionary rule mining in time series databases
Pita et al. Probabilistic topic modeling for short text based on word embedding networks

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 04801754

Country of ref document: EP

Kind code of ref document: A1