MotifMiner: Efficient discovery of common substructures in biochemical molecules

Matt Coatney¹ &
Srinivasan Parthasarathy²

101 Accesses
Explore all metrics

Abstract

Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases, Santiago de Chile, Chile, September 1994. Morgan Kaufmann, San Francisco, pp 487–499
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the 11th international conference on data engineering, Taipei, Taiwan, March 1995. IEEE Press, New York, pp 3–14
Borgelt C, Berthold M (2002) Mining molecular fragments: finding relevant substructures of molecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 51–58
Coffman E, Eve J (1970) File structures using hashing functions. Commun ACM 13(7):427–436
Article Google Scholar
Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the 4th international conference on knowledge discovery and data mining, August 1998. AAAI Press, New York, pp 30–36
De Raedt L, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Seattle, August 2001. Morgan Kaufmann, San Francisco, pp 853–862
Djoko S, Cook D, Holder L (1995) Analyzing the benefits of domain knowledge in substructure discovery. In: Fayyad U, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, Montreal, August 1995. AAAI Press, New York, pp 75–80
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the 12th international conference on machine learning, Tahoe City, CA, July 1995. Morgan Kaufmann, San Francisco, pp 194–202
Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ
Jonassen I, Eidhammer I (2000) Structure motif discovery and mining the pdb. In: Bornberg-Bauer E, Rost U, Stoye J (eds) Proceedings of the German conference on bioinformatics, Heidelberg, Germany, October 2000. Logos Verlag, Heidelberg, pp 175–182
Kim J, Moriyama E, Warr C, Clyne P, Carlson J (2002) Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16:767–775
Article Google Scholar
King R, Karwath A, Clare A, Dehaspe L (2000) Genome scale prediction of protein functional class from sequence using data mining. In: Ramakrishnan R, Stolfo S (eds) Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining, Boston, August 2000. ACM Press, New York, pp 384–389
Koperski K, Han J, Stefanovic N (1998) An efficient two-step method for classification of spatial data. In: Poiker T, Chrisman N (eds) Proceedings of the 8th international symposium on spatial data handling, Vancouver, British Columbia, Canada, July 1998. International Geographical Union, pp 45–54
Kuramochi M, Karypis G (2002) Discovering frequent geometric subgraphs. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 258–265
Li H, Parthasarathy S (2001) Automatically deriving multi-level protein structures through data mining. In: Proceedings of the HiPC conference workshop on bioinformatics and computational biology, Hyderabad, India, December 2001
Mannila H, Toivonen H (1996) Discovering generalized episodes using minimal occurrences. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining, Portland, OR, August 1996. AAAI Press, New York, pp 146–151
Pan W, Lin J, Le C (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol 3(2):research0009.1–0009.8
Google Scholar
Parthasarathy S, Coatney M (2002) Efficient discovery of common substructures in macromolecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 362–369
Parthasarathy S, Zaki M, Ogihara M, Dwarkadas S (1999) Incremental and interactive sequence mining. In: Proceedings of the 8th international conference on information and knowledge management, Kansas City, MO, November 1999. ACM Press, New York, pp 251–258
Quinlan J (1996) Induction of decision trees. Mach Learn 1:81–106
Google Scholar
Wang X, Wang J, Shasha D, Shapiro B, Dikshitulu S, Rigoutsos I, Zhang K (1997) Automated discovery of active motifs in three dimensional molecules. In: Heckerman D, Mannila H, Pregibon D (eds) Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, August 1997. AAAI Press, New York, pp 89–95
Zheng X, Chan T (2002) Chemical genomics: a systematic approach in biological research and drug discovery. Curr Issues Molec Biol 4:33–43
Google Scholar

Download references

Author information

Authors and Affiliations

Computer and Information Science, The Ohio State University, Columbus, OH, USA
Matt Coatney
Department of Computer and Information Science, The Ohio State University, 395 Dreese Lab, 2015 Neil Ave., Columbus, OH, 43210, USA
Srinivasan Parthasarathy

Authors

Matt Coatney
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Parthasarathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Srinivasan Parthasarathy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coatney, M., Parthasarathy, S. MotifMiner: Efficient discovery of common substructures in biochemical molecules. Knowl Inf Syst 7, 202–223 (2005). https://doi.org/10.1007/s10115-003-0119-4

Download citation

Received: 09 December 2002
Revised: 07 February 2003
Accepted: 15 May 2003
Published: 01 February 2005
Issue Date: February 2005
DOI: https://doi.org/10.1007/s10115-003-0119-4

MotifMiner: Efficient discovery of common substructures in biochemical molecules

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PrePeP: A Light-Weight, Extensible Tool for Predicting Frequent Hitters

PATTERNA: transcriptome-wide search for functional RNA elements via structural data signatures

MODSIDE: a motif discovery pipeline and similarity detector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

MotifMiner: Efficient discovery of common substructures in biochemical molecules

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

PrePeP: A Light-Weight, Extensible Tool for Predicting Frequent Hitters

PATTERNA: transcriptome-wide search for functional RNA elements via structural data signatures

MODSIDE: a motif discovery pipeline and similarity detector

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now