Trie-based Apriori Motif Discovery Approach

Isra Al-Turaiki²³,
Ghada Badr²³ &
Hassan Mathkour²³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 7292))

Included in the following conference series:

International Symposium on Bioinformatics Research and Applications

1105 Accesses

Abstract

One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie. The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa’s benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Sequence motif finder using memetic algorithm

Article Open access 03 January 2018

Performance evaluation for MOTIFSIM

Article Open access 18 December 2018

MODSIDE: a motif discovery pipeline and similarity detector

Article Open access 19 October 2018

References

Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 487–499 (1994)
Google Scholar
Badr, G.: Tries in information retrieval and syntactic pattern recognition. Ph.D. Thesis, School of Computer Science, Carleton University (June 2006)
Google Scholar
Badr, G., Turcotte, M.: Component-Based Matching for Multiple Interacting RNA Sequences. In: Chen, J., Wang, J., Zelikovsky, A. (eds.) ISBRA 2011. LNCS, vol. 6674, pp. 73–86. Springer, Heidelberg (2011)
Chapter Google Scholar
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology, ISMB, vol. 2, pp. 28–36 (1994)
Google Scholar
Bajcsy, P., Han, J., Liu, L., Yang, J.: Survey of biodata analysis from a data mining perspective. In: Wu, X., Jain, L., Wang, J.T., Zaki, M.J., Toivonen, H.T., Shasha, D. (eds.) Data Mining in Bioinformatics, pp. 9–39. Springer, London (2005)
Chapter Google Scholar
Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(2), 126–140 (2006)
Article Google Scholar
Goos, G., Hartmanis, J., Leeuwen, J., Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.: Longest Repeats with a Block of Don’t Cares. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 271–278. Springer, Heidelberg (2004)
Chapter Google Scholar
Jiang, H., Zhao, Y., Chen, W., Zheng, W.: Searching maximal degenerate motifs guided by a compact suffix tree. Advances in Experimental Medicine and Biology 680, 19–26 (2010)
Article Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)
Article MathSciNet Google Scholar
Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34(Database issue), D108–D110 (2006)
Article Google Scholar
Ozer, H.G., Ray, W.C.: Informative Motifs in Protein Family Alignments. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 161–170. Springer, Heidelberg (2007)
Chapter Google Scholar
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), S207–S214 (2001)
Article Google Scholar
Pavesi, G., Mauri, G., Pesole, G.: In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics 5(3), 217–236 (2004)
Article Google Scholar
Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 269–278. AAAI Press (2000)
Google Scholar
Sagot, M.-F.: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)
Chapter Google Scholar
Sagot, M.F., Escalier, V., Viari, A., Soldano, H.: Searching for repeated words in a text allowing for mismatches and gaps. In: Baeza-Yates, R., Manber, U. (eds.) Second South American Workshop on String Processing, pp. 87–100. University of Chile (1995)
Google Scholar
Sandelin, A., Alkema, W., Engstrm, P., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(Database issue), D91–D94 (2004)
Article Google Scholar
Schneider, T.D., Stormo, G.D., Gold, L.: Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188(3), 415–431 (1986)
Article Google Scholar
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Rgnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1), 137–144 (2005)
Article Google Scholar
Ukkonen, E.: On-Line construction of suffix trees. Algorithmica 14, 249–260 (1995)
Article MathSciNet MATH Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11. IEEE (October 1973)
Google Scholar
Wijaya, E., Rajaraman, K., Yiu, S., Sung, W.: Detection of generic spaced motifs using submotif pattern mining. Bioinformatics 23(12), 1476–1485 (2007)
Article Google Scholar
Jia, C., Lu, R., Chen, L.: A Frequent Pattern Mining Method for Finding Planted (l, d)-motifs of Unknown Length. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 240–248. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer and Information Sciences, King Saud University, Riyadh, Kingdom of Saudi Arabia
Isra Al-Turaiki, Ghada Badr & Hassan Mathkour

Authors

Isra Al-Turaiki
View author publications
You can also search for this author in PubMed Google Scholar
Ghada Badr
View author publications
You can also search for this author in PubMed Google Scholar
Hassan Mathkour
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Departments of Bioengineering and Electrical Engineering, University of Texas at Dallas, 75080, Richardson, TX, USA
Leonidas Bleris
Department of Computer Science and Engineering, University of Connecticut, 06269, Storrs, CT, USA
Ion Măndoiu
Department of Biological Sciences, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
Russell Schwartz
School of Information Science and Engineering, Central South University, 410083, Changsha, China
Jianxin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Al-Turaiki, I., Badr, G., Mathkour, H. (2012). Trie-based Apriori Motif Discovery Approach. In: Bleris, L., Măndoiu, I., Schwartz, R., Wang, J. (eds) Bioinformatics Research and Applications. ISBRA 2012. Lecture Notes in Computer Science(), vol 7292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30191-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-30191-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30190-2
Online ISBN: 978-3-642-30191-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics