Abstract
One of the hardest and long-standing problems in Bioinformatics is the problem of motif discovery in biological sequences. It is the problem of finding recurring patterns in these sequences. Apriori is a well-known data mining algorithm. It is used to mine frequent patterns in large datasets. In this paper, we would like to apply Apriori to the common motifs discovery problem. We propose three modifications so that we can adapt the classic Apriori to our problem. First, the Trie data structure is used to store all biological sequences under examination. Second, both of the frequent pattern extraction and the candidate generation steps are done using the same data structure, the Trie. The Trie allows to simultaneously search all possible starting points in the sequence for any occurrence of the given pattern. Third, instead of using only the support as a measure to assess frequent patterns, a new measure, the normalized information content (normIC), is proposed which is able to distinguish motifs in real promoter sequences. Preliminary experiments are conducted on Tompa’s benchmark to investigate the performance of our proposed algorithm, the Trie-based Apriori Motif Discovery (TrieAMD). Results show that our algorithm outperforms all of the tested tools on real datasets for average sensitivity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases, San Francisco, CA, USA, pp. 487–499 (1994)
Badr, G.: Tries in information retrieval and syntactic pattern recognition. Ph.D. Thesis, School of Computer Science, Carleton University (June 2006)
Badr, G., Turcotte, M.: Component-Based Matching for Multiple Interacting RNA Sequences. In: Chen, J., Wang, J., Zelikovsky, A. (eds.) ISBRA 2011. LNCS, vol. 6674, pp. 73–86. Springer, Heidelberg (2011)
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: Proceedings of International Conference on Intelligent Systems for Molecular Biology, ISMB, vol. 2, pp. 28–36 (1994)
Bajcsy, P., Han, J., Liu, L., Yang, J.: Survey of biodata analysis from a data mining perspective. In: Wu, X., Jain, L., Wang, J.T., Zaki, M.J., Toivonen, H.T., Shasha, D. (eds.) Data Mining in Bioinformatics, pp. 9–39. Springer, London (2005)
Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(2), 126–140 (2006)
Goos, G., Hartmanis, J., Leeuwen, J., Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.: Longest Repeats with a Block of Don’t Cares. In: Farach-Colton, M. (ed.) LATIN 2004. LNCS, vol. 2976, pp. 271–278. Springer, Heidelberg (2004)
Jiang, H., Zhao, Y., Chen, W., Zheng, W.: Searching maximal degenerate motifs guided by a compact suffix tree. Advances in Experimental Medicine and Biology 680, 19–26 (2010)
Li, M., Ma, B., Wang, L.: Finding similar regions in many sequences. Journal of Computer and System Sciences 65, 73–96 (2002)
Matys, V., Kel-Margoulis, O.V., Fricke, E., Liebich, I., Land, S., Barre-Dirrie, A., Reuter, I., Chekmenev, D., Krull, M., Hornischer, K., Voss, N., Stegmaier, P., Lewicki-Potapov, B., Saxel, H., Kel, A.E., Wingender, E.: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Research 34(Database issue), D108–D110 (2006)
Ozer, H.G., Ray, W.C.: Informative Motifs in Protein Family Alignments. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 161–170. Springer, Heidelberg (2007)
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17(suppl. 1), S207–S214 (2001)
Pavesi, G., Mauri, G., Pesole, G.: In silico representation and discovery of transcription factor binding sites. Briefings in Bioinformatics 5(3), 217–236 (2004)
Pevzner, P.A., Sze, S.H.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pp. 269–278. AAAI Press (2000)
Sagot, M.-F.: Spelling Approximate Repeated or Common Motifs Using a Suffix Tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)
Sagot, M.F., Escalier, V., Viari, A., Soldano, H.: Searching for repeated words in a text allowing for mismatches and gaps. In: Baeza-Yates, R., Manber, U. (eds.) Second South American Workshop on String Processing, pp. 87–100. University of Chile (1995)
Sandelin, A., Alkema, W., Engstrm, P., Wasserman, W.W., Lenhard, B.: JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Research 32(Database issue), D91–D94 (2004)
Schneider, T.D., Stormo, G.D., Gold, L.: Information content of binding sites on nucleotide sequences. Journal of Molecular Biology 188(3), 415–431 (1986)
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Rgnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nature Biotechnology 23(1), 137–144 (2005)
Ukkonen, E.: On-Line construction of suffix trees. Algorithmica 14, 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: IEEE Conference Record of 14th Annual Symposium on Switching and Automata Theory, SWAT 2008, pp. 1–11. IEEE (October 1973)
Wijaya, E., Rajaraman, K., Yiu, S., Sung, W.: Detection of generic spaced motifs using submotif pattern mining. Bioinformatics 23(12), 1476–1485 (2007)
Jia, C., Lu, R., Chen, L.: A Frequent Pattern Mining Method for Finding Planted (l, d)-motifs of Unknown Length. In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS, vol. 6401, pp. 240–248. Springer, Heidelberg (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Al-Turaiki, I., Badr, G., Mathkour, H. (2012). Trie-based Apriori Motif Discovery Approach. In: Bleris, L., Măndoiu, I., Schwartz, R., Wang, J. (eds) Bioinformatics Research and Applications. ISBRA 2012. Lecture Notes in Computer Science(), vol 7292. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30191-9_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-30191-9_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30190-2
Online ISBN: 978-3-642-30191-9
eBook Packages: Computer ScienceComputer Science (R0)