Abstract
We present data compression techniques hinged on the notion of a motif, interpreted here as a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. This notion arises originally in the analysis of sequences, particularly biomolecules, due to its multiple implications in the understanding of biological structure and function, and it has been the subject of various characterizations and study. Correspondingly, motif discovery techniques and tools have been devised. This task is made hard by the circumstance that the number of motifs identifiable in general in a sequence can be exponential in the size of that sequence. A significant gain in the direction of reducing the number of motifs is achieved through the introduction of irredundant motifs, which in intuitive terms are motifs of which the structure and list of occurrences cannot be inferred by a combination of other motifs’ occurrences. Although suboptimal, the available procedures for the extraction of some such motifs are not prohibitively expensive. Here we show that irredundant motifs can be usefully exploited in lossy compression methods based on textual substitution and suitable for signals as well as text. Actually, once the motifs in our lossy encodings are disambiguated into corresponding lossless codebooks, they still prove capable of yielding savings over popular methods in use. Preliminary experiments with these fungible strategies at the crossroads of lossless and lossy data compression show performances that improve over popular methods (i.e. GZip) by more than 20% in lossy and 10% in lossless implementations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Atallah, M.J., Genin, Y., Szpankowski, W.: Pattern matching image compression: algorithmic and empirical results. IEEE Transactions on PAMI 21(7), 614–629 (1999)
Apostolico, A.: On the efficiency of syntactic feature extraction and approximate schemes for data compression. In: Proceedings of the 5th International Conference on Pattern Recognition, Miami, Florida, pp. 982–984 (1980)
Apostolico, A., Atallah, M.J.: Compact recognizers of episode sequences. Information and Computation 174, 180–192 (2002)
Apostolico, A., Galil, Z. (eds.): Pattern Matching Algorithms. Oxford University Press, New York (1997)
Apostolico, A., Fraenkel, A.: Robust transmission of unbounded strings using Fibonacci representations. IEEE Trans. Inform. Theory 33(2), 238–245 (1987)
Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE 88(11), 1733–1744 (2000)
Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11(1), 15–25 (2004)
Apostolico, A., Parida, L.: Compression and the wheel of fortune. In: Proceedings of IEEE DCC Data Compression Conference, pp. 143–152. Computer Society Press (2003)
Apostolico, A., Comin, M., Parida, L.: Motifs in Ziv-Lempel-Welch clef. In: Proceedings of IEEE DCC Data Compression Conference, pp. 72–81. Computer Society Press (2004)
Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15, 481–494 (1996)
Berger, T.: Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice Hall, Englewood Cliffs (1971)
Berger, T., Gibson, J.D.: Lossy source coding. IEEE Trans. Inform. Theory 44(6), 2693–2723 (1998)
Blumer, A., Blumer, J., Ehrenfeucht, A., Haussler, D., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science 40, 31–55 (1985)
DeAgostino, S., Storer, J.A.: On-line versus off-line computation in dynamic text compression. Inform. Process. Lett. 59(3), 169–174 (1996)
Fu, K.S., Booth, T.L.: Grammatical inference: introduction and survey – part I. IEEE Transactions on Systems, Man and Cybernetics 5, 95–111 (1975)
Fu, K.S., Booth, T.L.: Grammatical inference: introduction and survey – part II. IEEE Transactions on Systems, Man and Cybernetics 5, 112–127 (1975)
Kieffer, J.C.: A survey of the theory of source coding with a fidelity criterion. IEEE Trans. Inform. Theory 39(5), 1473–1490 (1993)
Kontoyiannis, I.: An implementable lossy version of Lempel-Ziv algorithm –part 1: optimality for memoryless sources. IEEE Trans. Inform. Theory 45, 2293–2305 (1999)
Lehman, E., Shelat, A.: Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2002), pp. 205–212 (2002)
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inform. Theory 22, 75–81 (1976)
Luczak, T., Szpankowski, W.: A suboptimal lossy data compression algorithm based on approximate pattern matching. IEEE Trans. Inform. Theory 43(5), 1439–1451 (1997)
Neville-Manning, C., Witten, I.H., Maulsby, D.: Compression by induction of hierarchical grammars. In: DCC: Data Compression Conference, pp. 244–253. IEEE Computer Society TCC, Los Alamitos (1994)
Parida, L., Rigoutsos, I., Platt, D.: An output-sensitive flexible pattern discovery algorithm. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 131–142. Springer, Heidelberg (2001)
Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and polynomial time algorithms. In: Proceedings of the eleventh ACM-SIAM Symposium on Discrete Algorithms (SODA 2000), January 2000, pp. 297–308 (2000)
Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., Platt, D.: The emergence of pattern discovery techniques in computational biology. Journal of Metabolic Engineering 2(3), 159–177 (2000)
Sadeh, I.: On approximate string matching. In: Proceedings of DCC 1993, pp. 148–157. IEEE Computer Society Press, Los Alamitos (1993)
Storer, J.A.: Data Compression: Methods and Theory. Computer Science Press (1988)
Wang, J., Shapiro, B., Shasha, D. (eds.): Pattern Discovery in Biomolecular Data: Tools, Techniques, and Applications. Oxford University Press, Oxford (1999)
Waterman, M.: Introduction to Computational Biology. Chapman and Hall, Boca Raton (1995)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inform. Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 24(5), 530–536 (1978)
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Apostolico, A., Comin, M., Parida, L. (2006). Bridging Lossy and Lossless Compression by Motif Pattern Discovery. In: Ahlswede, R., et al. General Theory of Information Transfer and Combinatorics. Lecture Notes in Computer Science, vol 4123. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11889342_51
Download citation
DOI: https://doi.org/10.1007/11889342_51
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46244-6
Online ISBN: 978-3-540-46245-3
eBook Packages: Computer ScienceComputer Science (R0)