Suffix Array of Alignment: A Practical Index for Similar Data

Joong Chae Na¹⁹,
Heejin Park²⁰,
Sunho Lee²¹,
Minsung Hong²¹,
Thierry Lecroq²²,
Laurent Mouchard²² &
…
Kunsoo Park²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8214))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

1291 Accesses
7 Citations

Abstract

The suffix tree of alignment is an index data structure for similar strings. Given an alignment of similar strings, it stores all suffixes of the alignment, called alignment-suffixes. An alignment-suffix represents one suffix of a string or suffixes of multiple strings starting at the same position in the alignment. The suffix tree of alignment makes good use of similarity in strings theoretically. However, suffix trees are not widely used in biological applications because of their huge space requirements, and instead suffix arrays are used in practice.

In this paper we propose a space-economical version of the suffix tree of alignment, named the suffix array of alignment (SAA). Given an alignment ρ of similar strings, the SAA for ρ is a lexicographically sorted list of all the alignment-suffixes of ρ. The SAA supports pattern search as efficiently as the generalized suffix array. Our experiments show that our index uses only 14% of the space used by the generalized suffix array to index 11 human genome sequences. The space efficiency of our index increases as the number of the genome sequences increases. We also present an efficient algorithm for constructing the SAA.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Suffix Tree of Alignment: An Efficient Index for Similar Data

Suffix sorting via matching statistics

Article Open access 12 March 2024

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

Article Open access 22 September 2020

References

Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Article MathSciNet MATH Google Scholar
Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words, pp. 85–95. Springer (1985)
Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Paolo Alto, California (1994)
Google Scholar
The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010)
Google Scholar
Do, H.H., Jansson, J., Sadakane, K., Sung, W.-K.: Fast relative lempel-ziv self-index for similar sequences. In: Snoeyink, J., Lu, P., Su, K., Wang, L. (eds.) AAIM 2012 and FAW 2012. LNCS, vol. 7285, pp. 291–302. Springer, Heidelberg (2012)
Chapter Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. Journal of the ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM Journal on Computing 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Gusfield, D.: Algorithms on Strings, Tree, and Sequences. Cambridge University Press, Cambridge (1997)
Book MATH Google Scholar
Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M.: Indexing similar DNA sequences. In: Chen, B. (ed.) AAIM 2010. LNCS, vol. 6124, pp. 180–190. Springer, Heidelberg (2010)
Chapter Google Scholar
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. Journal of the ACM 53(6), 918–936 (2006)
Article MathSciNet MATH Google Scholar
Kim, D.K., Kim, M., Park, H.: Linearized suffix tree: an efficient index data structure with the capabilities of suffix trees and suffix arrays. Algorithmica 52(3), 350–377 (2008)
Article MathSciNet MATH Google Scholar
Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. Journal of Discrete Algorithms 3(2-4), 126–142 (2005)
Article MathSciNet MATH Google Scholar
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. Journal of Discrete Algorithms 3(2-4), 143–156 (2005)
Article MathSciNet MATH Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theoretical Computer Science 483, 115–133 (2013)
Article MathSciNet MATH Google Scholar
Kuruppu, S., Puglisi, S.J., Zobel, J.: Relative lempel-ziv compression of genomes for large-scale storage and retrieval. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 201–206. Springer, Heidelberg (2010)
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. Journal of Computational Biology 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM Journal on Computing 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Na, J.C., Crochemore, M., Park, H., Holub, J., Iliopoulos, C.S., Mouchard, L., Park, K.: Suffix tree of alignment: An efficient index for similar data. In: Proceedings of IWOCA 2013 (2013)
Google Scholar
Navarro, G.: Indexing highly repetitive collections. In: Smyth, B. (ed.) IWOCA 2012. LNCS, vol. 7643, pp. 274–279. Springer, Heidelberg (2012)
Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Sejong University, Korea
Joong Chae Na
College of Information and Communications, Hanyang University, Korea
Heejin Park
School of Computer Science and Engineering, Seoul National University, Korea
Sunho Lee, Minsung Hong & Kunsoo Park
Department of Computer Science, University of Rouen, France
Thierry Lecroq & Laurent Mouchard

Authors

Joong Chae Na
View author publications
You can also search for this author in PubMed Google Scholar
Heejin Park
View author publications
You can also search for this author in PubMed Google Scholar
Sunho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Minsung Hong
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Lecroq
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Mouchard
View author publications
You can also search for this author in PubMed Google Scholar
Kunsoo Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Industrial Engineering and Management Technion, Technion Institute of Technology, Bloomfield Hall 308, 32000, Haifa, Israel
Oren Kurland
Bar-Ilan University, Israel
Moshe Lewenstein
Department of Computer Science, Bar-Ilan University, 52900, Ramat-Gan, Israel
Ely Porat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Na, J.C. et al. (2013). Suffix Array of Alignment: A Practical Index for Similar Data. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-02432-5_27
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02431-8
Online ISBN: 978-3-319-02432-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Suffix Array of Alignment: A Practical Index for Similar Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Suffix Tree of Alignment: An Efficient Index for Similar Data

Suffix sorting via matching statistics

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Suffix Array of Alignment: A Practical Index for Similar Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Suffix Tree of Alignment: An Efficient Index for Similar Data

Suffix sorting via matching statistics

gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation