Abstract
We consider the problem of on-line exact string matching of a pattern in a set of highly similar sequences. This can be useful in cases where indexing the sequences is not feasible. We present a preliminary study by restricting the problem for a specific case where we adapt the classical Morris-Pratt and Knuth-Morris-Pratt algorithms to consider borders with errors. We give an original algorithm for computing borders at Hamming distance 1. We exhibit experimental results showing that our algorithms are much faster than searching for the pattern in each sequences with a very fast on-line exact string matching algorithm.
Similar content being viewed by others
References
Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)
Alatabbi, A., Barton, C., Iliopoulos, C.S., Mouchard, L.: Querying highly similar structured sequences via binary encoding and word level operations. In: Iliadis, L.S., Maglogiannis, I., Papadopoulos, H., Karatzas, K., Sioutas, S. (eds.) Proceedings of the International Workshop Artificial Intelligence Applications and Innovations (AIAI 2012) Part II, IFIP Advances in Information and Communication Technology, vol. 382, pp. 584–592. Springer, Halkidiki, Greece (2012)
Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with \(k\) mismatches. J. Algorithms 50(2), 257–275 (2004)
Cao, X., Li, S.C., Tung, A.K.H.: Indexing DNA sequences using q-grams. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA 2005), Lecture Notes in Computer Science, vol. 3453, pp. 4–16. Springer, Beijing, China (2005)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Knuth, D.E., Morris, J.H., Jr., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)
Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. Theor. Comput. Sci. 532, 14–30 (2014)
Faro, S., Lecroq, T.: Efficient variants of the Backward-Oracle-Matching algorithm. In: Holub, J., Zdàrek, J. (eds.) Proceedings of the Prague Stringology Conference 2008, pp. 146–160. Czech Republic, Prague (2008)
Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)
Franek, F., Jennings, C.G., Smyth, W.F.: A simple fast hybrid pattern-matching algorithm. J. Discrete Algorithms 5(4), 682–695 (2007)
Fritz, M.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)
Huang, S., Lam, T.W., Sung, W.-K., Tam, S.-L., Yiu, S.-M.: Indexing similar DNA sequences. In: Chen, B. (ed.) Proceedings of the 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Lecture Notes in Computer Science, vol. 6124, pp. 180–190. Springer, Weihai, China (2010)
Iliopoulos, C.S., Alatabbi, A., Barton, C.: On the repetitive collection indexing problem. In: Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), pp. 682–687. IEEE Computer Society (2012)
Iliopoulos, C.S., Mouchard, L., Rahman, M.S.: A new approach to pattern matching in degenerate DNA/RNA sequences and distributed pattern matching. Math. Comput. Sci. 1(4), 557–569 (2008)
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chávez, E., Lonardi, S. (eds.) Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE 2010), Lecture Notes in Computer Science, vol. 6393, pp. 201–206. Springer, Los Cabos, Mexico (2010)
Morris, J.H., Jr, Pratt, V.R.: A linear pattern-matching algorithm. Report 40, University of California, Berkeley (1970)
Na, J.C., Park, H., Crochemore, M., Holub, J., Iliopoulos, C.S., Mouchard, L., Park, K.: Suffix tree of an alignment: an efficient index for similar data. In: Lecroq, T., Mouchard, L., (eds.) Revised Selected Papers of the 24th International Workshop On Combinatorial Algorithms (IWOCA 2013), vol. 8288 of Lecture Notes in Computer Science, Rouen, France, Springer (2013)
Na, J.C., Park, H., Lee, S., Hong, M., Lecroq, T., Mouchard, L., Park, K.: Suffix array of alignment: a practical index for similar data. In: Oren Kurland, M.L., Porat, E. (eds.) Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE 2013), Lecture Notes in Computer Science, vol. 8214, pp. 243–254. Springer, Jerusalem, Israel (2013)
Panneton, F., L’Ecuyer, P., Matsumoto, M.: Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Softw. 32(1), 1–16 (2006)
Pottier, C., Hannequin, D., Coutant, S., Rovelet-Lecrux, A., Wallon, D., Paquet, C., Bombois, S., Pariente, J., Thomas-Anterion, C., Michon, A., Croisile, B., Etcharry-Bouyx, F., Berr, C., Dartigues, J.-F., Amouyel, P., Dauchel, H., Boutoleau-Bretonnière, C., Frebourg, T., Lambert, J.C., Campion, D., PHRC GMAJ collaborators: High frequency of potentially causative SORL1 mutations in autosomal dominant early-onset alzheimer disease. Mol. Psychiatry 17(9), 875–879 (2012)
Rahn, R., Weese, D., Reinert, K.: Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24), 3499–3505 (2014)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Nsira, N.B., Elloumi, M. & Lecroq, T. On-line String Matching in Highly Similar DNA Sequences. Math.Comput.Sci. 11, 113–126 (2017). https://doi.org/10.1007/s11786-016-0280-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11786-016-0280-2