Abstract
Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.
Similar content being viewed by others
References
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases, Santiago de Chile, Chile, September 1994. Morgan Kaufmann, San Francisco, pp 487–499
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the 11th international conference on data engineering, Taipei, Taiwan, March 1995. IEEE Press, New York, pp 3–14
Borgelt C, Berthold M (2002) Mining molecular fragments: finding relevant substructures of molecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 51–58
Coffman E, Eve J (1970) File structures using hashing functions. Commun ACM 13(7):427–436
Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the 4th international conference on knowledge discovery and data mining, August 1998. AAAI Press, New York, pp 30–36
De Raedt L, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Seattle, August 2001. Morgan Kaufmann, San Francisco, pp 853–862
Djoko S, Cook D, Holder L (1995) Analyzing the benefits of domain knowledge in substructure discovery. In: Fayyad U, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, Montreal, August 1995. AAAI Press, New York, pp 75–80
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the 12th international conference on machine learning, Tahoe City, CA, July 1995. Morgan Kaufmann, San Francisco, pp 194–202
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Jonassen I, Eidhammer I (2000) Structure motif discovery and mining the pdb. In: Bornberg-Bauer E, Rost U, Stoye J (eds) Proceedings of the German conference on bioinformatics, Heidelberg, Germany, October 2000. Logos Verlag, Heidelberg, pp 175–182
Kim J, Moriyama E, Warr C, Clyne P, Carlson J (2002) Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16:767–775
King R, Karwath A, Clare A, Dehaspe L (2000) Genome scale prediction of protein functional class from sequence using data mining. In: Ramakrishnan R, Stolfo S (eds) Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining, Boston, August 2000. ACM Press, New York, pp 384–389
Koperski K, Han J, Stefanovic N (1998) An efficient two-step method for classification of spatial data. In: Poiker T, Chrisman N (eds) Proceedings of the 8th international symposium on spatial data handling, Vancouver, British Columbia, Canada, July 1998. International Geographical Union, pp 45–54
Kuramochi M, Karypis G (2002) Discovering frequent geometric subgraphs. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 258–265
Li H, Parthasarathy S (2001) Automatically deriving multi-level protein structures through data mining. In: Proceedings of the HiPC conference workshop on bioinformatics and computational biology, Hyderabad, India, December 2001
Mannila H, Toivonen H (1996) Discovering generalized episodes using minimal occurrences. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining, Portland, OR, August 1996. AAAI Press, New York, pp 146–151
Pan W, Lin J, Le C (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol 3(2):research0009.1–0009.8
Parthasarathy S, Coatney M (2002) Efficient discovery of common substructures in macromolecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 362–369
Parthasarathy S, Zaki M, Ogihara M, Dwarkadas S (1999) Incremental and interactive sequence mining. In: Proceedings of the 8th international conference on information and knowledge management, Kansas City, MO, November 1999. ACM Press, New York, pp 251–258
Quinlan J (1996) Induction of decision trees. Mach Learn 1:81–106
Wang X, Wang J, Shasha D, Shapiro B, Dikshitulu S, Rigoutsos I, Zhang K (1997) Automated discovery of active motifs in three dimensional molecules. In: Heckerman D, Mannila H, Pregibon D (eds) Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, August 1997. AAAI Press, New York, pp 89–95
Zheng X, Chan T (2002) Chemical genomics: a systematic approach in biological research and drug discovery. Curr Issues Molec Biol 4:33–43
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Coatney, M., Parthasarathy, S. MotifMiner: Efficient discovery of common substructures in biochemical molecules. Knowl Inf Syst 7, 202–223 (2005). https://doi.org/10.1007/s10115-003-0119-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0119-4