[go: up one dir, main page]

Skip to main content

Advertisement

Log in

MotifMiner: Efficient discovery of common substructures in biochemical molecules

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Biochemical research often involves examining structural relationships in molecules since scientists strongly believe in the causal relationship between structure and function. Traditionally, researchers have identified these patterns, or motifs, manually using domain expertise. However, with the massive influx of new biochemical data and the ability to gather data for very large molecules, there is great need for techniques that automatically and efficiently identify commonly occurring structural patterns in molecules. Previous automated substructure discovery approaches have each introduced variations of similar underlying techniques and have embedded domain knowledge. While doing so improves performance for the particular domain, this complicates extensibility to other domains. Also, they do not address scalability or noise, which is critical for macromolecules such as proteins. In this paper, we present MotifMiner, a general framework for efficiently identifying common motifs in most scientific molecular datasets. The approach combines structure-based frequent-pattern discovery with search space reduction and coordinate noise handling. We describe both the framework and several algorithms as well as demonstrate the flexibility of our system by analyzing protein and drug biochemical datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases, Santiago de Chile, Chile, September 1994. Morgan Kaufmann, San Francisco, pp 487–499

  • Agrawal R, Srikant R (1995) Mining sequential patterns. In: Yu P, Chen A (eds) Proceedings of the 11th international conference on data engineering, Taipei, Taiwan, March 1995. IEEE Press, New York, pp 3–14

  • Borgelt C, Berthold M (2002) Mining molecular fragments: finding relevant substructures of molecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 51–58

  • Coffman E, Eve J (1970) File structures using hashing functions. Commun ACM 13(7):427–436

    Article  Google Scholar 

  • Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the 4th international conference on knowledge discovery and data mining, August 1998. AAAI Press, New York, pp 30–36

  • De Raedt L, Kramer S (2001) The levelwise version space algorithm and its application to molecular fragment finding. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, Seattle, August 2001. Morgan Kaufmann, San Francisco, pp 853–862

  • Djoko S, Cook D, Holder L (1995) Analyzing the benefits of domain knowledge in substructure discovery. In: Fayyad U, Uthurusamy R (eds) Proceedings of the 1st international conference on knowledge discovery and data mining, Montreal, August 1995. AAAI Press, New York, pp 75–80

  • Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the 12th international conference on machine learning, Tahoe City, CA, July 1995. Morgan Kaufmann, San Francisco, pp 194–202

  • Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ

  • Jonassen I, Eidhammer I (2000) Structure motif discovery and mining the pdb. In: Bornberg-Bauer E, Rost U, Stoye J (eds) Proceedings of the German conference on bioinformatics, Heidelberg, Germany, October 2000. Logos Verlag, Heidelberg, pp 175–182

  • Kim J, Moriyama E, Warr C, Clyne P, Carlson J (2002) Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16:767–775

    Article  Google Scholar 

  • King R, Karwath A, Clare A, Dehaspe L (2000) Genome scale prediction of protein functional class from sequence using data mining. In: Ramakrishnan R, Stolfo S (eds) Proceedings of the 6th ACM SIGKDD conference on knowledge discovery and data mining, Boston, August 2000. ACM Press, New York, pp 384–389

  • Koperski K, Han J, Stefanovic N (1998) An efficient two-step method for classification of spatial data. In: Poiker T, Chrisman N (eds) Proceedings of the 8th international symposium on spatial data handling, Vancouver, British Columbia, Canada, July 1998. International Geographical Union, pp 45–54

  • Kuramochi M, Karypis G (2002) Discovering frequent geometric subgraphs. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 258–265

  • Li H, Parthasarathy S (2001) Automatically deriving multi-level protein structures through data mining. In: Proceedings of the HiPC conference workshop on bioinformatics and computational biology, Hyderabad, India, December 2001

  • Mannila H, Toivonen H (1996) Discovering generalized episodes using minimal occurrences. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of the 2nd international conference on knowledge discovery and data mining, Portland, OR, August 1996. AAAI Press, New York, pp 146–151

  • Pan W, Lin J, Le C (2002) Model-based cluster analysis of microarray gene-expression data. Genome Biol 3(2):research0009.1–0009.8

    Google Scholar 

  • Parthasarathy S, Coatney M (2002) Efficient discovery of common substructures in macromolecules. In: Kumar V, Tsumoto S, Zhong N, Yu P, Wu X (eds) Proceedings of the 2nd international conference on data mining, Maebashi, Japan, December 2002. IEEE Press, New York, pp 362–369

  • Parthasarathy S, Zaki M, Ogihara M, Dwarkadas S (1999) Incremental and interactive sequence mining. In: Proceedings of the 8th international conference on information and knowledge management, Kansas City, MO, November 1999. ACM Press, New York, pp 251–258

  • Quinlan J (1996) Induction of decision trees. Mach Learn 1:81–106

    Google Scholar 

  • Wang X, Wang J, Shasha D, Shapiro B, Dikshitulu S, Rigoutsos I, Zhang K (1997) Automated discovery of active motifs in three dimensional molecules. In: Heckerman D, Mannila H, Pregibon D (eds) Proceedings of the 3rd international conference on knowledge discovery and data mining, Newport Beach, CA, August 1997. AAAI Press, New York, pp 89–95

  • Zheng X, Chan T (2002) Chemical genomics: a systematic approach in biological research and drug discovery. Curr Issues Molec Biol 4:33–43

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivasan Parthasarathy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Coatney, M., Parthasarathy, S. MotifMiner: Efficient discovery of common substructures in biochemical molecules. Knowl Inf Syst 7, 202–223 (2005). https://doi.org/10.1007/s10115-003-0119-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0119-4

Keywords