[go: up one dir, main page]

Skip to main content
Log in

On-line String Matching in Highly Similar DNA Sequences

  • Published:
Mathematics in Computer Science Aims and scope Submit manuscript

Abstract

We consider the problem of on-line exact string matching of a pattern in a set of highly similar sequences. This can be useful in cases where indexing the sequences is not feasible. We present a preliminary study by restricting the problem for a specific case where we adapt the classical Morris-Pratt and Knuth-Morris-Pratt algorithms to consider borders with errors. We give an original algorithm for computing borders at Hamming distance 1. We exhibit experimental results showing that our algorithms are much faster than searching for the pattern in each sequences with a very fast on-line exact string matching algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aho, A.V., Corasick, M.J.: Efficient string matching: an aid to bibliographic search. Commun. ACM 18(6), 333–340 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  2. Alatabbi, A., Barton, C., Iliopoulos, C.S., Mouchard, L.: Querying highly similar structured sequences via binary encoding and word level operations. In: Iliadis, L.S., Maglogiannis, I., Papadopoulos, H., Karatzas, K., Sioutas, S. (eds.) Proceedings of the International Workshop Artificial Intelligence Applications and Innovations (AIAI 2012) Part II, IFIP Advances in Information and Communication Technology, vol. 382, pp. 584–592. Springer, Halkidiki, Greece (2012)

  3. Amir, A., Lewenstein, M., Porat, E.: Faster algorithms for string matching with \(k\) mismatches. J. Algorithms 50(2), 257–275 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  4. Cao, X., Li, S.C., Tung, A.K.H.: Indexing DNA sequences using q-grams. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) Proceedings of the 10th International Conference on Database Systems for Advanced Applications (DASFAA 2005), Lecture Notes in Computer Science, vol. 3453, pp. 4–16. Springer, Beijing, China (2005)

  5. Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)

    Book  MATH  Google Scholar 

  6. Knuth, D.E., Morris, J.H., Jr., Pratt, V.R.: Fast pattern matching in strings. SIAM J. Comput. 6(2), 323–350 (1977)

  7. Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative Lempel-Ziv self-index for similar sequences. Theor. Comput. Sci. 532, 14–30 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  8. Faro, S., Lecroq, T.: Efficient variants of the Backward-Oracle-Matching algorithm. In: Holub, J., Zdàrek, J. (eds.) Proceedings of the Prague Stringology Conference 2008, pp. 146–160. Czech Republic, Prague (2008)

  9. Faro, S., Lecroq, T.: The exact online string matching problem: a review of the most recent results. ACM Comput. Surv. 45(2), 13 (2013)

    Article  MATH  Google Scholar 

  10. Franek, F., Jennings, C.G., Smyth, W.F.: A simple fast hybrid pattern-matching algorithm. J. Discrete Algorithms 5(4), 682–695 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  11. Fritz, M.-Y., Leinonen, R., Cochrane, G., Birney, E.: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011)

    Article  Google Scholar 

  12. Huang, S., Lam, T.W., Sung, W.-K., Tam, S.-L., Yiu, S.-M.: Indexing similar DNA sequences. In: Chen, B. (ed.) Proceedings of the 6th International Conference on Algorithmic Aspects in Information and Management (AAIM 2010), Lecture Notes in Computer Science, vol. 6124, pp. 180–190. Springer, Weihai, China (2010)

  13. Iliopoulos, C.S., Alatabbi, A., Barton, C.: On the repetitive collection indexing problem. In: Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), pp. 682–687. IEEE Computer Society (2012)

  14. Iliopoulos, C.S., Mouchard, L., Rahman, M.S.: A new approach to pattern matching in degenerate DNA/RNA sequences and distributed pattern matching. Math. Comput. Sci. 1(4), 557–569 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  15. Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative Lempel-Ziv compression of genomes for large-scale storage and retrieval. In: Chávez, E., Lonardi, S. (eds.) Proceedings of the 17th International Symposium on String Processing and Information Retrieval (SPIRE 2010), Lecture Notes in Computer Science, vol. 6393, pp. 201–206. Springer, Los Cabos, Mexico (2010)

  16. Morris, J.H., Jr, Pratt, V.R.: A linear pattern-matching algorithm. Report 40, University of California, Berkeley (1970)

  17. Na, J.C., Park, H., Crochemore, M., Holub, J., Iliopoulos, C.S., Mouchard, L., Park, K.: Suffix tree of an alignment: an efficient index for similar data. In: Lecroq, T., Mouchard, L., (eds.) Revised Selected Papers of the 24th International Workshop On Combinatorial Algorithms (IWOCA 2013), vol. 8288 of Lecture Notes in Computer Science, Rouen, France, Springer (2013)

  18. Na, J.C., Park, H., Lee, S., Hong, M., Lecroq, T., Mouchard, L., Park, K.: Suffix array of alignment: a practical index for similar data. In: Oren Kurland, M.L., Porat, E. (eds.) Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE 2013), Lecture Notes in Computer Science, vol. 8214, pp. 243–254. Springer, Jerusalem, Israel (2013)

  19. Panneton, F., L’Ecuyer, P., Matsumoto, M.: Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Softw. 32(1), 1–16 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  20. Pottier, C., Hannequin, D., Coutant, S., Rovelet-Lecrux, A., Wallon, D., Paquet, C., Bombois, S., Pariente, J., Thomas-Anterion, C., Michon, A., Croisile, B., Etcharry-Bouyx, F., Berr, C., Dartigues, J.-F., Amouyel, P., Dauchel, H., Boutoleau-Bretonnière, C., Frebourg, T., Lambert, J.C., Campion, D., PHRC GMAJ collaborators: High frequency of potentially causative SORL1 mutations in autosomal dominant early-onset alzheimer disease. Mol. Psychiatry 17(9), 875–879 (2012)

  21. Rahn, R., Weese, D., Reinert, K.: Journaled string tree—a scalable data structure for analyzing thousands of similar genomes on your laptop. Bioinformatics 30(24), 3499–3505 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thierry Lecroq.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nsira, N.B., Elloumi, M. & Lecroq, T. On-line String Matching in Highly Similar DNA Sequences. Math.Comput.Sci. 11, 113–126 (2017). https://doi.org/10.1007/s11786-016-0280-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11786-016-0280-2

Keywords

Mathematics Subject Classification

Navigation