Abstract
In this paper we address the problem of constructing an index for a text document or a collection of documents to answer various questions about the occurrences of a pattern when allowing a constant number of errors. In particular, our index can be built to report all occurrences, all positions, or all documents where a pattern occurs in time linear in the size of the query string and the number of results. This improves over previous work where the lookup time is not linear or depends upon the size of the document corpus. Our data structure has size \(O\left(n\log^k n\right)\) on average and with high probability for input size n and queries with up to k errors. Additionally, we present a trade-off between query time and index complexity that achieves worst-case bounded index size and preprocessing time with linear lookup time on average.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Amir, A., Keselman, D., Landau, G.M., Lewenstein, M., Lewenstein, N., Rodeh, M.: Indexing and dictionary matching with one error. J. Algorithms 37, 309–325 (2000)
Barkol, O., Rabani, Y.: Tighter bounds for nearest neighbor search and related problems in the cell probe model. In: Proc. 32nd ACM Symp. on Theory of Computing (STOC), pp. 388–396. ACM Press, New York (2000)
Borodin, A., Ostrovsky, R., Rabani, Y.: Lower bounds for high dimensional nearest neighbor search and related problems. In: Proc. 31st ACM Symp. on Theory of Computing (STOC), pp. 312–321. ACM Press, New York (1999)
Brodal, G.S., Ga̧sieniec, L.: Approximate dictionary queries. In: Hirschberg, D.S., Meyers, G. (eds.) CPM 1996. LNCS, vol. 1075, pp. 65–74. Springer, Heidelberg (1996)
Buchsbaum, A.L., Goodrich, M.T., Westbrook, J.: Range searching over tree cross products. In: Paterson, M. (ed.) ESA 2000. LNCS, vol. 1879, pp. 120–131. Springer, Heidelberg (2000)
Chang, W.I., Lawler, E.L.: Sublinear approximate string matching and biological applications. Algorithmica 12, 327–344 (1994)
Chávez, E., Navarro, G.: A metric index for approximate string matching. In: Rajsbaum, S. (ed.) LATIN 2002. LNCS, vol. 2286, pp. 181–195. Springer, Heidelberg (2002)
Cobbs, A.L.: Fast approximate matching using suffix trees. In: Galil, Z., Ukkonen, E. (eds.) CPM 1995. LNCS, vol. 937, pp. 41–54. Springer, Heidelberg (1995)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Proc. 36th ACM Symp. on Theory of Computing (STOC), pp. 91–100 (2004)
Demaine, E.D., López-Ortiz, A.: A linear lower bound on index size for text retrieval. In: Proc. 12th ACM-SIAM Symp. on Discrete Algorithms (SODA), pp. 289–294. ACM, New York (2001)
Gabow, H.N., Bentely, J.L., Tarjan, R.E.: Scaling and related techniques for geometry problems. In: Proc. 16th ACM Symp. on Theory of Computing (STOC), pp. 135–143. ACM, New York (1984)
Gabriele, A., Mignosi, F., Restivo, A., Sciortino, M.: Indexing structures for approximate string matching. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, pp. 140–151. Springer, Heidelberg (2003)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Comp. Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, ch. 39, 2nd edn. CRC Press LLC, Boca Raton (2004)
Maaß, M.G.: Average-case analysis of approximate trie search. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 472–483. Springer, Heidelberg (2004)
Maaß, M.G., Nowak, J.: Text indexing with errors. Technical Report TUMI0503, Fakultät für Informatik, TU München (Mar 2005)
Maaß, M.G., Nowak, J.: A new method for approximate indexing and dictionary lookup with one error. Information Processing Letters (IPL) (to be published)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: Proc. 13th ACM-SIAM Symp. on Discrete Algorithms (SODA), ACM/SIAM (2002)
Myers, E.W.: A sublinear algorithm for approximate keyword searching. Algorithmica 12, 345–374 (1994)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1(1), 205–209 (2000); Special issue on Matching Patterns
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Eng. Bull. 24(4), 19–27 (2001)
Nowak, J.: A new indexing method for approximate pattern matching with one mismatch. Master’s thesis, Fakultät für Informatik, Technische Universität München, Boltzmannstr. 3, D-85748 Garching (February 2004)
Pittel, B.: Asymptotical growth of a class of random trees. Annals of Probability 13(2), 414–427 (1985)
Szpankowski, W.: Asymptotic properties of data compression and suffix trees. IEEE Transact. on Information Theory 39(5), 1647–1659 (1993)
Szpankowski, W.: Average Case Analysis of Algorithms on Sequences, 1st edn. Wiley-Interscience, Chichester (2000)
Ukkonen, E.: Algorithms for approximate string matching. Information and Control 64, 100–118 (1985)
Ukkonen, E.: Approximate string-matching over suffix trees. In: Apostolico, A., Crochemore, M., Galil, Z., Manber, U. (eds.) CPM 1993. LNCS, vol. 684, pp. 228–242. Springer, Heidelberg (1993)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14, 249–260 (1995)
Weiner, P.: Linear pattern matching. In: Proc. 14th IEEE Symp. on Switching and Automata Theory, pp. 1–11. IEEE, Los Alamitos (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Maaß, M.G., Nowak, J. (2005). Text Indexing with Errors. In: Apostolico, A., Crochemore, M., Park, K. (eds) Combinatorial Pattern Matching. CPM 2005. Lecture Notes in Computer Science, vol 3537. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11496656_3
Download citation
DOI: https://doi.org/10.1007/11496656_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26201-5
Online ISBN: 978-3-540-31562-9
eBook Packages: Computer ScienceComputer Science (R0)