Text Indexing with Errors

Moritz G. Maaß¹⁹ &
Johannes Nowak¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3537))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

888 Accesses
17 Citations

Abstract

In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of results. This improves over previous work where the lookup time is not linear or depends upon the size of the document corpus. Our data structure has size $O\left(n\log^k n\right)$ on average and with high probability for input size n and queries with up to k errors. Additionally, we present a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear lookup time on average.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Less Space: Indexing for Queries with Wildcards

Indexes for Document Retrieval with Relevance

Efficient Compressed Indexing for Approximate Top-k String Retrieval

References

Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Chapter Google Scholar
Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)
Article MATH MathSciNet Google Scholar
Barkol, O., Rabani, Y.: Tighter bounds for nearest neighbor search and related problems in the cell probe model. In: Proc. 32nd ACM Symp. on Theory of Computing (STOC), pp. 388–396. ACM Press, New York (2000)
Google Scholar
Borodin, A., Ostrovsky, R., Rabani, Y.: Lower bounds for high dimensional nearest neighbor search and related problems. In: Proc. 31st ACM Symp. on Theory of Computing (STOC), pp. 312–321. ACM Press, New York (1999)
Google Scholar
Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 65–74. Springer, Heidelberg (1996)
Google Scholar
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 120–131. Springer, Heidelberg (2000)
Chapter Google Scholar
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994)
Article MATH MathSciNet Google Scholar
Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)
Chapter Google Scholar
Cobbs, A.L.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)
Google Scholar
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th ACM Symp. on Theory of Computing (STOC), pp. 91–100 (2004)
Google Scholar
Demaine, E.D., López-Ortiz, A.: A linear lower bound on index size for text retrieval. In: Proc. 12th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 289–294. ACM, New York (2001)
Google Scholar
Gabow, H.N., Bentely, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th ACM Symp. on Theory of Computing (STOC), pp. 135–143. ACM, New York (1984)
Google Scholar
Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Indexing structures for approximate string matching. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, pp. 140–151. Springer, Heidelberg (2003)
Chapter Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Comp. Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Book Google Scholar
Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, ch. 39, 2nd edn. CRC Press LLC, Boca Raton (2004)
Google Scholar
Maaß, M.G.: Average-case analysis of approximate trie search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 472–483. Springer, Heidelberg (2004)
Chapter Google Scholar
Maaß, M.G., Nowak, J.: Text indexing with errors. Technical Report TUMI0503, Fakultät für Informatik, TU München (Mar 2005)
Google Scholar
Maaß, M.G., Nowak, J.: A new method for approximate indexing and dictionary lookup with one error. Information Processing Letters (IPL) (to be published)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th ACM-SIAM Symp. on Discrete Algorithms (SODA), ACM/SIAM (2002)
Google Scholar
Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)
Article MATH MathSciNet Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–209 (2000); Special issue on Matching Patterns
MathSciNet Google Scholar
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Google Scholar
Nowak, J.: A new indexing method for approximate pattern matching with one mismatch. Master’s thesis, Fakultät für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching (February 2004)
Google Scholar
Pittel, B.: Asymptotical growth of a class of random trees. Annals of Probability 13(2), 414–427 (1985)
Article MATH MathSciNet Google Scholar
Szpankowski, W.: Asymptotic properties of data compression and suffix trees. IEEE Transact. on Information Theory 39(5), 1647–1659 (1993)
Article MATH MathSciNet Google Scholar
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley-Interscience, Chichester (2000)
Google Scholar
Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64, 100–118 (1985)
Article MATH MathSciNet Google Scholar
Ukkonen, E.: Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)
Chapter Google Scholar
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)
Article MATH MathSciNet Google Scholar
Weiner, P.: Linear pattern matching. In: Proc. 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11. IEEE, Los Alamitos (1973)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Fakultät für Informatik, Technische Universität München, Boltzmannstraße 3, D-85748, Garching, Germany
Moritz G. Maaß & Johannes Nowak

Authors

Moritz G. Maaß
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Nowak
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Georgia Institute of Technology and Università di Padova,
Alberto Apostolico
Université Paris-Est, France
Maxime Crochemore
School of Computer Science and Engineering, Seoul National University, 151-742, Seoul, Korea
Kunsoo Park

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maaß, M.G., Nowak, J. (2005). Text Indexing with Errors. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_3

Download citation

DOI: https://doi.org/10.1007/11496656_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Text Indexing with Errors

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Less Space: Indexing for Queries with Wildcards

Indexes for Document Retrieval with Relevance

Efficient Compressed Indexing for Approximate Top-k String Retrieval

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Text Indexing with Errors

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Less Space: Indexing for Queries with Wildcards

Indexes for Document Retrieval with Relevance

Efficient Compressed Indexing for Approximate Top-k String Retrieval

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation