Pareto-Optimal Data Compression for Binary Classification Tasks
<p>The Pareto frontier (top panel) for compressed versions <math display="inline"><semantics> <mrow> <mi>Z</mi> <mo>=</mo> <mi>g</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> of our warmup dataset <math display="inline"><semantics> <msup> <mrow> <mi>X</mi> <mo>∈</mo> <mrow> <mo>[</mo> <mn>0</mn> <mo>,</mo> <mn>1</mn> <mo>]</mo> </mrow> </mrow> <mn>2</mn> </msup> </semantics></math> with classes <math display="inline"><semantics> <mrow> <mi>Y</mi> <mo>∈</mo> <mo>{</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>}</mo> </mrow> </semantics></math>, showing the maximum attainable class information <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Z</mi> <mo>,</mo> <mi>Y</mi> <mo>)</mo> </mrow> </semantics></math> for a given entropy <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>(</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math>, mapped using the method described in this paper using the likelihood binning in the bottom panel.</p> "> Figure 2
<p>Sample data from <a href="#sec3-entropy-22-00007" class="html-sec">Section 3</a>. Images from MMNIST (<b>top</b>), Fashion-MNIST (<b>middle</b>) and CIFAR-10 are mapped into integers (group labels) <math display="inline"><semantics> <mrow> <mi>Z</mi> <mo>=</mo> <mi>f</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> retaining maximum mutual information with the class variable <span class="html-italic">Y</span> (ones/sevens, shirts/pullovers and cats/dogs, respectively) for 3, 5 and 5 groups, respectively. These mappings <span class="html-italic">f</span> correspond to Pareto frontier “corners”.</p> "> Figure 3
<p>Essentially all information about <span class="html-italic">Y</span> is retained if <span class="html-italic">W</span> is binned into sufficiently narrow bins. Sorting the bins (<b>left</b>) to make the conditional probability monotonically increasing (<b>right</b>) changes neither this information nor the entropy.</p> "> Figure 4
<p>The reason that the Pareto frontier can never be reached using non-contiguous bins is that swapping parts of them against parts of an intermediate bin can increase <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Z</mi> <mo>,</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> while keeping <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>(</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> constant. In this example, the binning function <span class="html-italic">g</span> assigns two separate <span class="html-italic">W</span>-intervals (top panel) to the same bin (bin 2) as seen in the bottom panel. The shaded rectangles have widths <math display="inline"><semantics> <msub> <mi>P</mi> <mi>i</mi> </msub> </semantics></math>, heights <math display="inline"><semantics> <msub> <mi>p</mi> <mi>i</mi> </msub> </semantics></math> and areas <math display="inline"><semantics> <mrow> <msub> <mi>P</mi> <mrow> <mi>i</mi> <mn>1</mn> </mrow> </msub> <mo>=</mo> <msub> <mi>P</mi> <mi>i</mi> </msub> <msub> <mi>p</mi> <mn>1</mn> </msub> </mrow> </semantics></math>. In the upper panel, the conditional probabilities <math display="inline"><semantics> <msub> <mi>p</mi> <mi>i</mi> </msub> </semantics></math> are monotonically increasing because they are averages of the monotonically increasing curve <math display="inline"><semantics> <mrow> <msub> <mi>p</mi> <mn>1</mn> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> </mrow> </semantics></math>.</p> "> Figure 5
<p>Contour plot of the function <math display="inline"><semantics> <mrow> <mi>W</mi> <mo>(</mo> <msub> <mi>x</mi> <mn>1</mn> </msub> <mo>,</mo> <msub> <mi>x</mi> <mn>2</mn> </msub> <mo>)</mo> </mrow> </semantics></math> computed both exactly using Equation (<a href="#FD27-entropy-22-00007" class="html-disp-formula">27</a>) (solid curves) and approximately using a neural network (dashed curves).</p> "> Figure 6
<p>Cumulative distributions <math display="inline"><semantics> <mrow> <msub> <mi>F</mi> <mi>i</mi> </msub> <mrow> <mo>(</mo> <mi>w</mi> <mo>)</mo> </mrow> <mo>≡</mo> <mi>P</mi> <mrow> <mo>(</mo> <mi>W</mi> <mo><</mo> <mi>w</mi> <mo>|</mo> <mi>Y</mi> <mo>=</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </semantics></math> are shown for the analytic (blue/dark grey), Fashion-MNIST (red/grey) and CIFAR-10 (orange/light grey) examples. Solid curves show the observed cumulative histograms of <span class="html-italic">W</span> from the neural network, and dashed curves show the fits defined by Equation (<a href="#FD31-entropy-22-00007" class="html-disp-formula">31</a>) and <a href="#entropy-22-00007-t002" class="html-table">Table 2</a>.</p> "> Figure 7
<p>The solid curves show the actual conditional probability <math display="inline"><semantics> <mrow> <mi>P</mi> <mo>(</mo> <mi>Y</mi> <mo>=</mo> <mn>1</mn> <mo>|</mo> <mi>W</mi> <mo>)</mo> </mrow> </semantics></math> for CIFAR-10 (where the labels Y = 1 and 2 correspond to “cat” and “dog”) and MNIST with 20% label noise (where the labels Y = 1 and 2 correspond to “1” and “7”), respectively. The color-matched dashed curves show the conditional probabilities predicted by the neural network; the reason that they are not diagonal lines <math display="inline"><semantics> <mrow> <mi>P</mi> <mo>(</mo> <mi>Y</mi> <mo>=</mo> <mn>1</mn> <mo>|</mo> <mi>W</mi> <mo>)</mo> <mo>=</mo> <mi>W</mi> </mrow> </semantics></math> is that <span class="html-italic">W</span> has been reparametrized to have a uniform distribution. If the neural network classifiers were optimal, then solid and dashed curves would coincide.</p> "> Figure 8
<p>The Pareto frontier for compressed versions <math display="inline"><semantics> <mrow> <mi>Z</mi> <mo>=</mo> <mi>g</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math> of our four datasets <span class="html-italic">X</span>, showing the maximum attainable class information <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Z</mi> <mo>,</mo> <mi>Y</mi> <mo>)</mo> </mrow> </semantics></math> for a given entropy <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>(</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math>. The “corners” (dots) correspond to the maximum <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Z</mi> <mo>,</mo> <mi>Y</mi> <mo>)</mo> </mrow> </semantics></math> attainable when binning the likelihood <span class="html-italic">W</span> into a given number of bins (2, 3, …, 8 from right to left). The horizontal dotted lines show the maximum available information <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>X</mi> <mo>,</mo> <mi>Y</mi> <mo>)</mo> </mrow> </semantics></math> for each case, reflecting that there is simply less to learn in some examples than in others.</p> "> Figure 9
<p>The Pareto frontier our analytic example is computed exactly with our method (solid curve) and approximately with the Blahut–Arimoto method (dots).</p> ">
Abstract
:1. Introduction
Objectives and Relation to Prior Work
2. Method
2.1. Lossless Distillation for Classification Tasks
2.2. Pareto-Optimal Compression for Binary Classification Tasks
2.2.1. Uniformization of W
2.2.2. Binning W
2.2.3. Making the Conditional Probability Monotonic
2.2.4. Proof that Pareto Frontier is Spanned by Contiguous Binnings
- Only and change,
- both marginal distributions remain the same,
- the new conditional probabilities and are further apart.
2.3. Mapping the Frontier
3. Results
3.1. Analytic Warmup Example
3.2. The Pareto Frontier
3.2.1. Approximating
3.2.2. Approximating
3.3. MNIST, Fashion-MNIST and CIFAR-10
3.4. Interpretation of Our Results
3.5. Real-World Issues
3.6. Performance Compared with Blahut–Arimoto Method
4. Conclusions and Discussion
4.1. Relation to Information Bottleneck
4.2. Relation to Phase Transitions in DIB Learning
4.3. Outlook
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A. Binning Can Be Practically Lossless
Appendix B. More Varying Conditional Probability Boosts Mutual Information
References
- Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
- Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Hotelling, H. Relation between two sets of variates. Biometrica 1936, 28, 321–377. [Google Scholar] [CrossRef]
- Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar] [CrossRef]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Clark, D.G.; Livezey, J.A.; Bouchard, K.E. Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis. arXiv 2019, arXiv:1905.09944. [Google Scholar]
- Tegmark, M. Optimal Latent Representations: Distilling Mutual Information into Principal Pairs. arXiv 2019, arXiv:1902.03364. [Google Scholar]
- Kurkoski, B.M.; Yagi, H. Quantization of binary-input discrete memoryless channels. IEEE Trans. Inf. Theory 2014, 60, 4544–4552. [Google Scholar] [CrossRef]
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
- Tan, A.; Meshulam, L.; Bialek, W.; Schwab, D. The renormalization group and information bottleneck: A unified framework. In Proceedings of the APS Meeting Abstracts, Boston, MA, USA, 4–8 March 2019. [Google Scholar]
- Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [Green Version]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
- Chalk, M.; Marre, O.; Tkacik, G. Relevant sparse codes with variational information bottleneck. In Proceedings of the Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; pp. 1957–1965. [Google Scholar]
- Fischer, I. The Conditional Entropy Bottleneck. 2018. Available online: https://openreview.net/forum?id=rkVOXhAqY7 (accessed on 11 December 2019).
- Amjad, R.A.; Geiger, B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kim, I.Y.; de Weck, O.L. Adaptive weighted-sum method for bi-objective optimization: Pareto front generation. Struct. Multidiscip. Optim. 2005, 29, 149–158. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Nair, V.; Hinton, G. The CIFAR-10 Dataset. 2014. Available online: https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 11 December 2019).
- LeCun, Y.; Cortes, C.; Burges, C. MNIST Handwritten Digit Database. 2010. Available online: http://yann.lecun.com/exdb/mnist (accessed on 11 December 2019).
- Xiao, H.; Rasul, K.; Vollgraf, R. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms. arXiv 2017, arXiv:1708.07747. [Google Scholar]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report TR-2009; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. Neural Comput. 2019, 31, 596–612. [Google Scholar] [CrossRef] [PubMed]
- Wu, T.; Fischer, I.; Chuang, I.; Tegmark, M. Learnability for the information bottleneck. arXiv 2019, arXiv:1907.07331. [Google Scholar]
- Wu, T.; Fischer, I.; Chuang, I.; Tegmark, M. Learnability for the information bottleneck. Entropy 2019, 21, 924. [Google Scholar] [CrossRef] [Green Version]
- Thompson, J.; Garner, A.J.; Mahoney, J.R.; Crutchfield, J.P.; Vedral, V.; Gu, M. Causal asymmetry in a quantum world. Phys. Rev. X 2018, 8, 031013. [Google Scholar] [CrossRef] [Green Version]
- Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
- Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef] [Green Version]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Rezende, D.J.; Viola, F. Taming VAEs. arXiv 2018, arXiv:1810.00597. [Google Scholar]
- Achille, A.; Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018, 19, 1947–1980. [Google Scholar]
- Still, S. Thermodynamic cost and benefit of data representations. arXiv 2017, arXiv:1705.00612. [Google Scholar]
Random Vectors | What Is Distilled? | Probability Distribution | |
---|---|---|---|
Gaussian | Non-Gaussian | ||
1 | Entropy | PCA | Autoencoder |
2 | Mutual information | CCA | Latent reps |
Experiment | Y | |||||||
---|---|---|---|---|---|---|---|---|
Analytic | 1 | 0.0668 | −4.7685 | 16.8993 | −25.0849 | 13.758 | 0.5797 | −0.2700 |
2 | 0.4841 | −5.0106 | 5.7863 | −1.5697 | −1.7180 | −0.3313 | −0.0030 | |
Fashion-MNIST | Pullover | 0.2878 | −12.9596 | 44.9217 | −68.0105 | 37.3126 | 0.3547 | −0.2838 |
Shirt | 1.0821 | −23.8350 | 81.6655 | −112.2720 | 53.9602 | −0.4068 | 0.4552 | |
CIFAR-10 | Cat | 0.9230 | 0.2165 | 0.0859 | 6.0013 | −1.0037 | 0.8499 | |
0.6795 | 0.0511 | 0.6838 | −1.0138 | 0.9061 | ||||
Dog | 0.8970 | 0.2132 | 0.0806 | 6.0013 | −1.0039 | 0.8500 | ||
0.7872 | 0.0144 | 0.7974 | −0.9440 | 0.7237 | ||||
MNIST | One | 3.1188 | −65.224 | 231.4 | −320.054 | 150.779 | 1.1226 | −0.6856 |
Seven | −1.0325 | −47.5411 | 189.895 | −269.28 | 127.363 | −0.8219 | 0.1284 |
BA-Method | Our Method | |
---|---|---|
0.0000 | 0.0000 | 0.0000 |
0.9652 | 0.3260 | 0.3421 |
0.9998 | 0.3506 | 0.3622 |
1.5437 | 0.4126 | 0.4276 |
1.5581 | 0.4126 | 0.4298 |
1.5725 | 0.4141 | 0.4314 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tegmark, M.; Wu, T. Pareto-Optimal Data Compression for Binary Classification Tasks. Entropy 2020, 22, 7. https://doi.org/10.3390/e22010007
Tegmark M, Wu T. Pareto-Optimal Data Compression for Binary Classification Tasks. Entropy. 2020; 22(1):7. https://doi.org/10.3390/e22010007
Chicago/Turabian StyleTegmark, Max, and Tailin Wu. 2020. "Pareto-Optimal Data Compression for Binary Classification Tasks" Entropy 22, no. 1: 7. https://doi.org/10.3390/e22010007
APA StyleTegmark, M., & Wu, T. (2020). Pareto-Optimal Data Compression for Binary Classification Tasks. Entropy, 22(1), 7. https://doi.org/10.3390/e22010007