[go: up one dir, main page]

Skip to main content

Bridging Lossy and Lossless Compression by Motif Pattern Discovery

  • Chapter
General Theory of Information Transfer and Combinatorics

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4123))

Abstract

We present data compression techniques hinged on the notion of a motif, interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. This notion arises originally in the analysis of sequences, particularly biomolecules, due to its multiple implications in the understanding of biological structure and function, and it has been the subject of various characterizations and study. Correspondingly, motif discovery techniques and tools have been devised. This task is made hard by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are motifs of which the structure and list of occurrences cannot be inferred by a combination of other motifs’ occurrences. Although suboptimal, the available procedures for the extraction of some such motifs are not prohibitively expensive. Here we show that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Actually, once the motifs in our lossy encodings are disambiguated into corresponding lossless codebooks, they still prove capable of yielding savings over popular methods in use. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods (i.e. GZip) by more than 20% in lossy and 10% in lossless implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

eBook
USD 15.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Atallah, M.J., Genin, Y., Szpankowski, W.: Pattern matching image compression: algorithmic and empirical results. IEEE Transactions on PAMI 21(7), 614–629 (1999)

    Google Scholar 

  2. Apostolico, A.: On the efficiency of syntactic feature extraction and approximate schemes for data compression. In: Proceedings of the 5th International Conference on Pattern Recognition, Miami, Florida, pp. 982–984 (1980)

    Google Scholar 

  3. Apostolico, A., Atallah, M.J.: Compact recognizers of episode sequences. Information and Computation 174, 180–192 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  4. Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. Oxford University Press, New York (1997)

    MATH  Google Scholar 

  5. Apostolico, A., Fraenkel, A.: Robust transmission of unbounded strings using Fibonacci representations. IEEE Trans. Inform. Theory 33(2), 238–245 (1987)

    Article  MATH  MathSciNet  Google Scholar 

  6. Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE 88(11), 1733–1744 (2000)

    Article  Google Scholar 

  7. Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11(1), 15–25 (2004)

    Article  Google Scholar 

  8. Apostolico, A., Parida, L.: Compression and the wheel of fortune. In: Proceedings of IEEE DCC Data Compression Conference, pp. 143–152. Computer Society Press (2003)

    Google Scholar 

  9. Apostolico, A., Comin, M., Parida, L.: Motifs in Ziv-Lempel-Welch clef. In: Proceedings of IEEE DCC Data Compression Conference, pp. 72–81. Computer Society Press (2004)

    Google Scholar 

  10. Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15, 481–494 (1996)

    Article  MATH  MathSciNet  Google Scholar 

  11. Berger, T.: Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice Hall, Englewood Cliffs (1971)

    Google Scholar 

  12. Berger, T., Gibson, J.D.: Lossy source coding. IEEE Trans. Inform. Theory 44(6), 2693–2723 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  13. Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  14. DeAgostino, S., Storer, J.A.: On-line versus off-line computation in dynamic text compression. Inform. Process. Lett. 59(3), 169–174 (1996)

    Article  MathSciNet  Google Scholar 

  15. Fu, K.S., Booth, T.L.: Grammatical inference: introduction and survey – part I. IEEE Transactions on Systems, Man and Cybernetics 5, 95–111 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  16. Fu, K.S., Booth, T.L.: Grammatical inference: introduction and survey – part II. IEEE Transactions on Systems, Man and Cybernetics 5, 112–127 (1975)

    Google Scholar 

  17. Kieffer, J.C.: A survey of the theory of source coding with a fidelity criterion. IEEE Trans. Inform. Theory 39(5), 1473–1490 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  18. Kontoyiannis, I.: An implementable lossy version of Lempel-Ziv algorithm –part 1: optimality for memoryless sources. IEEE Trans. Inform. Theory 45, 2293–2305 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  19. Lehman, E., Shelat, A.: Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 205–212 (2002)

    Google Scholar 

  20. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22, 75–81 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  21. Luczak, T., Szpankowski, W.: A suboptimal lossy data compression algorithm based on approximate pattern matching. IEEE Trans. Inform. Theory 43(5), 1439–1451 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  22. Neville-Manning, C., Witten, I.H., Maulsby, D.: Compression by induction of hierarchical grammars. In: DCC: Data Compression Conference, pp. 244–253. IEEE Computer Society TCC, Los Alamitos (1994)

    Google Scholar 

  23. Parida, L., Rigoutsos, I., Platt, D.: An output-sensitive flexible pattern discovery algorithm. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 131–142. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  24. Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and polynomial time algorithms. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), January 2000, pp. 297–308 (2000)

    Google Scholar 

  25. Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., Platt, D.: The emergence of pattern discovery techniques in computational biology. Journal of Metabolic Engineering 2(3), 159–177 (2000)

    Article  Google Scholar 

  26. Sadeh, I.: On approximate string matching. In: Proceedings of DCC 1993, pp. 148–157. IEEE Computer Society Press, Los Alamitos (1993)

    Google Scholar 

  27. Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press (1988)

    Google Scholar 

  28. Wang, J., Shapiro, B., Shasha, D. (eds.): Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications. Oxford University Press, Oxford (1999)

    Google Scholar 

  29. Waterman, M.: Introduction to Computational Biology. Chapman and Hall, Boca Raton (1995)

    MATH  Google Scholar 

  30. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)

    Article  MATH  MathSciNet  Google Scholar 

  31. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Apostolico, A., Comin, M., Parida, L. (2006). Bridging Lossy and Lossless Compression by Motif Pattern Discovery. In: Ahlswede, R., et al. General Theory of Information Transfer and Combinatorics. Lecture Notes in Computer Science, vol 4123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11889342_51

Download citation

  • DOI: https://doi.org/10.1007/11889342_51

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46244-6

  • Online ISBN: 978-3-540-46245-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics