Abstract
Discovering contrasts between collections of data is an important task in data mining. In this paper, we introduce a new type of contrast pattern, called a Minimal Distinguishing Subsequence (MDS). An MDS is a minimal subsequence that occurs frequently in one class of sequences and infrequently in sequences of another class. It is a natural way of representing strong and succinct contrast information between two sequential datasets and can be useful in applications such as protein comparison, document comparison and building sequential classification models. Mining MDS patterns is a challenging task and is significantly different from mining contrasts between relational/transactional data. One particularly important type of constraint that can be integrated into the mining process is the gap constraint. We present an efficient algorithm called ConSGapMiner (Contrast Sequences with Gap Miner), to mine all MDSs satisfying a minimum and maximum gap constraint, plus a maximum length constraint. It employs highly efficient bitset and boolean operations, for powerful gap-based pruning within a prefix growth framework. A performance evaluation with both sparse and dense datasets, demonstrates the scalability of ConSGapMiner and shows its ability to mine patterns from high dimensional datasets at low supports.
Similar content being viewed by others
References
Agrawel R, Srikant R (1995) Mining sequential patterns. In: Yu PS, Chen ALP (eds) Proceedings of the 11th international conference on data engineering, IEEE computer society, Taipei, Taiwan, pp 3–14.
Antunes C, Oliveira AL (2003) Generalization of pattern-growth methods for sequential pattern mining with gap constraints. In: Perner P, Rosenfeld A (eds) Proceedings of the 3rd international conference on machine learning and data mining in pattern recognition, lecture notes in computer science, Leipzig, Germany, pp 239–251.
Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SlGKDD international conference on knowledge discovery and data mining, ACM, Edmonton, Alberta, Canada, pp 429–435.
Bailey J, Manoukian T, Ramamohanarao K (2003) Classification using constrained emerging patterns. In: Dong G, Tang C, Wang W (eds) Proceedings of the 4th international conference on advances in web-age information management, lecture notes in computer science, Chengdu, China, pp 226–237.
Bay SD, Pazzani MJ (2001) Detecting group differences: Mining contrast sets. Data Mining Know Disc 5:213–246.
Garriga GC (2003) Discovering unbounded episodes in sequential data. In: Lavrac N, Gamberger D, Blockeel H, Todorovski L (eds) Proceedings of the 7th European conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Cavtat-Dubrovnik, Croatia, pp 83–94.
Chan S, Kao B, Yip CL et al (2003) Mining emerging substrings. In: Proceedings of the 8th international conference on database systems for advanced applications, IEEE Computer Society, Kyoto, Japan, pp 119–126.
Das G, Fleischer R, Gasieniec L et al (1997) Episode matching. In: Apostolico A, Hein J (eds) Proceedings of the 8th annual symposium on combinatorial pattern matching, lecture notes in computer science, Aarhus, Denmark, pp 12–27.
Dong G, Li J (1999) Efficient mining of emerging patterns: Discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, San Diego, CA, USA, pp 43–52.
Dong G, Li J (2005) Mining border descriptions of emerging patterns from dataset pairs. Know and Inf Syst 8:178–202.
Dong G, Zhang X, Wong L et al (1999) CAEP: Classification by aggregating emerging patterns. In: Arikawa S, Furukawa K (eds) Proceedings of the 2nd international conference on discovery science, lecture notes in computer science, Tokyo, Japan, pp 30–42.
Fischer J, Raedt LD (2004) Towards optimizing conjunctive inductive queries. In: Dai H, Srikant R, Zhang C (eds) Proceedings of the 8th Pacific-Asia conference on advances in knowledge discovery and data mining, lecture notes in computer science, Sydney, Australia, pp 625–637.
Gusfield D (1997) Algorithms on strings, trees and sequences, computer science and computational biology. Cambridge University Press, Cambridge, pp 235–244.
Han J, Pei J, Mortazavi-Asl B et al (2000) Freespan: frequent pattern-projected sequential pattern mining. In: proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Boston, MA, USA, pp 355–359.
Hirao M, Hoshino H, Shinohara A et al (2003) A practical algorithm to find the best subsequence patterns. Theor Comp Sci 292:465–479.
Ji X, Bailey J, Dong D (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of the 5th IEEE international conference on data mining, IEEE Computer Society, Houston, Texas USA, pp 194–201.
Lesh N, Zaki MJ, Ogihara M (2000) Scalable feature mining for sequential data. IEEE Int Syst 15:48–56.
Li J, Dong G, Ramamohanarao K (2001) Making use of the most expressive jumping emerging patterns for classification. Know Info Sys 3:131–145.
Maier D (1978) The complexity of some problems on subsequences and supersequences. J ACM 25(2):322–336.
Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Fayyad UM, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, AAAI Press, Montreal, Canada, pp 210–215.
Méger N, Rigotti C (2004) Constraint-based mining of episode rules and optimal window sizes. In: Boulicaut JF, Esposito F, Giannotti F, Pcdrcschi D (eds) Proceedings of the 8th european conference on principles and practice of knowledge discovery in databases, lecture notes in computer science, Pisa, Italy, pp 313–324.
Mitchell T (1982) Generalization as search. AI Journal 18(2):203–226.
Narasimhan G, Bu C, Gao Y et al (2002) Mining protein sequences for motifs. J Comp Bio 9(5):707–720.
Pei J, Han J, Mortazavi-Asl B et al (2001) Prefixspan: mining sequential patterns by prefix-projected growth. In: Proceedings of the 17th international conference on data engineering, IEEE Computer Society, Heidelberg, Germany, pp 215–224.
Raedit LD, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Morgan Kaufmann, Seattle, Washington, USA, pp 853–862.
She R, Chen F, Wang K et al (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Gctoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 436–445.
Tronícek Z (2001) Episode matching. In: Amir A, Landau GM (eds) Proceedings of the 12th annual symposium on combinatorial pattern matching, lecture notes in computer science, Jerusalem, Israel, pp 143–146.
Tzvetkov P, Yan X, Han J (2003) TSP: mining top-K closed sequential patterns. In: Proceedings of the 3rd IEEE international conference on data mining, IEEE Computer Society, Melbourne, Florida, USA, pp 347–354.
Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data engineering, IEEE Computer Society, Boston, MA, USA, pp 79–90.
Webb GI, Butler S, Newlands D (2003) On detecting differences between groups. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, Washington, DC, USA, pp 256–265.
Yan X, Han J, Afshar R (2003) Clospan: mining closed sequential patterns in large databases. In: Barbara D, Kamath C (eds) Proceedings of the 3rd SIAM international conference on data mining, SIAM, San Francisco, CA, USA.
Zaki MJ (2000) Sequence mining in categorical domains: incorporating constraints. In: Proceedings of the 2000 ACM CIKM international conference on information and knowledge management, ACM, McLean, VA, USA, pp 422–429.
Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1/2):31–60.
Zhang M, Kao B, Cheung D, Yip K (2005) Mining periodic patterns with gap requirement from sequences. In: Özcan F (ed) Proceedings of the 2005 ACM SIGMOD international conference on management of data, ACM, Maryland, USA, pp 622–633 .
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ji, X., Bailey, J. & Dong, G. Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst 11, 259–286 (2007). https://doi.org/10.1007/s10115-006-0038-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-006-0038-2