Learnability for the Information Bottleneck
<p>Accuracy for binary classification of MNIST digits 0 and 1 with 20% label noise and varying <math display="inline"><semantics> <mi>β</mi> </semantics></math>. No learning happens for models trained at <math display="inline"><semantics> <mrow> <mi>β</mi> <mo><</mo> <mn>3.25</mn> </mrow> </semantics></math>.</p> "> Figure 2
<p>The Pareto frontier of the information plane, <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>X</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> vs. <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Y</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math>, for the binary classification of MNIST digits 0 and 1 with 20% label noise described in <a href="#sec1-entropy-21-00924" class="html-sec">Section 1</a> and <a href="#entropy-21-00924-f001" class="html-fig">Figure 1</a>. For this problem, learning happens for models trained at <math display="inline"><semantics> <mrow> <mi>β</mi> <mo>></mo> <mn>3.25</mn> </mrow> </semantics></math>. <math display="inline"><semantics> <mrow> <mi>H</mi> <mo>(</mo> <mi>Y</mi> <mo>)</mo> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> bit since only two of ten digits are used and <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Y</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> <mo>≤</mo> <mi>I</mi> <mo>(</mo> <mi>X</mi> <mo>;</mo> <mi>Y</mi> <mo>)</mo> <mo>≈</mo> <mn>0.5</mn> </mrow> </semantics></math> bits <math display="inline"><semantics> <mrow> <mo><</mo> <mi>H</mi> <mo>(</mo> <mi>Y</mi> <mo>)</mo> </mrow> </semantics></math> because of the 20% label noise. The true frontier is differentiable; the figure shows a variational approximation that places an upper bound on both informations, horizontally offset to pass through the origin.</p> "> Figure 3
<p>Predicted vs. experimentally identified <math display="inline"><semantics> <msub> <mi>β</mi> <mn>0</mn> </msub> </semantics></math>, for mixture of Gaussians with varying class-conditional noise rates.</p> "> Figure 4
<p><math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Y</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> vs. <math display="inline"><semantics> <mi>β</mi> </semantics></math>, for mixture of Gaussian datasets with different distances between the two mixture components. The vertical lines are <math display="inline"><semantics> <msub> <mi>β</mi> <mrow> <mn>0</mn> <mo>,</mo> <mi>predicted</mi> </mrow> </msub> </semantics></math> computed by the R.H.S. of Equation (<a href="#FD8-entropy-21-00924" class="html-disp-formula">8</a>). As Equation (<a href="#FD8-entropy-21-00924" class="html-disp-formula">8</a>) does not make predictions w.r.t. class overlap, the vertical lines are always just above <math display="inline"><semantics> <mrow> <msub> <mi>β</mi> <mrow> <mn>0</mn> <mo>,</mo> <mi>predicted</mi> </mrow> </msub> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>. However, as expected, decreasing the distance between the classes in <span class="html-italic">X</span> space also increases the true <math display="inline"><semantics> <msub> <mi>β</mi> <mn>0</mn> </msub> </semantics></math>.</p> "> Figure 5
<p><math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Y</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> vs. <math display="inline"><semantics> <mi>β</mi> </semantics></math> for the MNIST binary classification with different hidden units per layer <span class="html-italic">n</span> and noise rates <math display="inline"><semantics> <mi>ρ</mi> </semantics></math>: (upper left) <math display="inline"><semantics> <mrow> <mi>ρ</mi> <mo>=</mo> <mn>0.02</mn> </mrow> </semantics></math>, (upper right) <math display="inline"><semantics> <mrow> <mi>ρ</mi> <mo>=</mo> <mn>0.1</mn> </mrow> </semantics></math>, (lower left) <math display="inline"><semantics> <mrow> <mi>ρ</mi> <mo>=</mo> <mn>0.2</mn> </mrow> </semantics></math>, (lower right) <math display="inline"><semantics> <mrow> <mi>ρ</mi> <mo>=</mo> <mn>0.3</mn> </mrow> </semantics></math>. The vertical lines are <math display="inline"><semantics> <msub> <mi>β</mi> <mn>0</mn> </msub> </semantics></math> estimated by different methods. <math display="inline"><semantics> <mrow> <mi>n</mi> <mo>=</mo> <mn>128</mn> </mrow> </semantics></math> has insufficient capacity for the problem, so its observed learnability onset is pushed higher, similar to the class overlap case.</p> "> Figure 6
<p>Histograms of the full MNIST training and validation sets according to <math display="inline"><semantics> <mrow> <mi>h</mi> <mo>(</mo> <mi>X</mi> <mo>)</mo> </mrow> </semantics></math>. Note that both are bimodal and the histograms are indistinguishable. In both cases, <math display="inline"><semantics> <mrow> <mi>h</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </semantics></math> has learned to separate most of the ones into the smaller mode but difficult ones are in the wide valley between the two modes. See <a href="#entropy-21-00924-f007" class="html-fig">Figure 7</a> for all of the training images to the left of the red threshold line, as well as the first few images to the right of the threshold.</p> "> Figure 7
<p>The first 5776 MNIST training set digits when sorted by <math display="inline"><semantics> <mrow> <mi>h</mi> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </semantics></math>. The digits highlighted in red are above the threshold drawn in <a href="#entropy-21-00924-f006" class="html-fig">Figure 6</a>.</p> "> Figure 8
<p>Plot of <math display="inline"><semantics> <mrow> <mi>I</mi> <mo>(</mo> <mi>Y</mi> <mo>;</mo> <mi>Z</mi> <mo>)</mo> </mrow> </semantics></math> vs. <math display="inline"><semantics> <mi>β</mi> </semantics></math> for CIFAR10 training set with 20% label noise. Each blue cross corresponds to a fully-converged model starting with independent initialization. The vertical black line corresponds to the predicted <math display="inline"><semantics> <mrow> <msub> <mi>β</mi> <mn>0</mn> </msub> <mo>=</mo> <mn>1.0483</mn> </mrow> </semantics></math> using Algorithm 1. The empirical <math display="inline"><semantics> <mrow> <msub> <mi>β</mi> <mn>0</mn> </msub> <mo>=</mo> <mn>1.048</mn> </mrow> </semantics></math>.</p> ">
Abstract
:1. Introduction
- We introduce the concept of IB-Learnability and show that when we vary , the IB objective will undergo a phase transition from the inability to learn to the ability to learn (Section 3).
- Using the second-order variation, we derive sufficient conditions for IB-Learnability, which provide upper bounds for the learnability threshold (Section 4).
- We show that IB-Learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), reveal its relationship with the slope of the Pareto frontier at the origin on the information plane vs. and discuss its relation to model capacity (Section 5).
- We prove a deep relationship between IB-Learnability, our upper bounds on , the hypercontractivity coefficient, the contraction coefficient and the maximum correlation (Section 5).
2. Related Work
3. IB-Learnability
3.1. Trivial Solutions
3.2. Necessary Condition for IB-Learnability
4. Sufficient Conditions for IB-Learnability
- We can set to be nonzero if for some region and 0 otherwise. Then we obtain the following sufficient condition:
5. Discussion
5.1. The Conspicuous Subset Determines
5.2. Multiple Phase Transitions
5.3. Similarity to Information Measures
5.4. Estimating Model Capacity
5.5. Learnability and the Information Plane
5.6. IB-Learnability, Hypercontractivity and Maximum Correlation
6. Estimating the IB-Learnability Condition
6.1. Estimation Algorithm
Algorithm 1 Estimating the upper bound for and identifying the conspicuous subset |
Require: Dataset . The number of classes is C. Require: tolerance for estimating
|
subroutine Get():
|
6.2. Special Cases for Estimating
6.2.1. Class-Conditional Label Noise
6.2.2. Deterministic Relationships
7. Experiments
7.1. Synthetic Dataset Experiments
7.1.1. Classification with Class-Conditional Noise
7.1.2. Classification with Class Overlap
7.2. MNIST Experiments
7.3. MNIST Experiments Using Equation (2)
7.4. CIFAR10 Forgetting Experiments
8. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Appendix A
Appendix A.1. Preliminaries: First-Order and Second-Order Variations
Appendix A.2. Proof of Lemma 1
Appendix A.3. Proof of Theorem 1
Appendix A.4. Proof of Theorem 2
Appendix A.5. First- and Second-Order Variations of
Appendix A.6. Proof of Lemma 2
Appendix A.7. Proof of Theorem 3
Appendix A.8. What IB First Learns at Its Onset of Learning
Appendix A.9. Proof of Theorem 4
Appendix A.10. Proof of Corollary 1 and Corollary 2
Appendix A.10.1. Proof of Corollary 1
Appendix A.10.2. Proof of Corollary 2
Appendix A.11. β0, Hypercontractivity Coefficient, Contraction Coefficient, , and Maximum Correlation
Appendix A.12. Experiment Details
CIFAR10 Details
Plane | Auto. | Bird | Cat | Deer | Dog | Frog | Horse | Ship | Truck | |
---|---|---|---|---|---|---|---|---|---|---|
Plane | 0.82232 | 0.00238 | 0.021 | 0.00069 | 0.00108 | 0 | 0.00017 | 0.00019 | 0.1473 | 0.00489 |
Auto. | 0.00233 | 0.83419 | 0.00009 | 0.00011 | 0 | 0.00001 | 0.00002 | 0 | 0.00946 | 0.15379 |
Bird | 0.03139 | 0.00026 | 0.76082 | 0.0095 | 0.07764 | 0.01389 | 0.1031 | 0.00309 | 0.00031 | 0 |
Cat | 0.00096 | 0.0001 | 0.00273 | 0.69325 | 0.00557 | 0.28067 | 0.01471 | 0.00191 | 0.00002 | 0.0001 |
Deer | 0.00199 | 0 | 0.03866 | 0.00542 | 0.83435 | 0.01273 | 0.02567 | 0.08066 | 0.00052 | 0.00001 |
Dog | 0 | 0.00004 | 0.00391 | 0.2498 | 0.00531 | 0.73191 | 0.00477 | 0.00423 | 0.00001 | 0 |
Frog | 0.00067 | 0.00008 | 0.06303 | 0.05025 | 0.0337 | 0.00842 | 0.8433 | 0 | 0.00054 | 0 |
Horse | 0.00157 | 0.00006 | 0.00649 | 0.00295 | 0.13058 | 0.02287 | 0 | 0.83328 | 0.00023 | 0.00196 |
Ship | 0.1288 | 0.01668 | 0.00029 | 0.00002 | 0.00164 | 0.00006 | 0.00027 | 0.00017 | 0.83385 | 0.01822 |
Truck | 0.01007 | 0.15107 | 0 | 0.00015 | 0.00001 | 0.00001 | 0 | 0.00048 | 0.02549 | 0.81273 |
References
- Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv 2000, arXiv:physics/0004057. [Google Scholar]
- Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
- Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information bottleneck for Gaussian variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
- Rey, M.; Roth, V. Meta-Gaussian information bottleneck. In Advances in Neural Information Processing Systems; lNIPS: San Diego, CA, USA, 2012; pp. 1916–1924. [Google Scholar]
- Alemi, A.A.; Fischer, I.; Dillon, J.V.; Murphy, K. Deep variational information bottleneck. arXiv 2016, arXiv:1612.00410. [Google Scholar]
- Chalk, M.; Marre, O.; Tkacik, G. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2016; pp. 1957–1965. [Google Scholar]
- Fischer, I. The Conditional Entropy Bottleneck. 2018. Available online: https://openreview.net/forum?id=rkVOXhAqY7 (accessed on 20 September 2019).
- Strouse, D.; Schwab, D.J. The deterministic information bottleneck. Neural Comput. 2017, 29, 1611–1630. [Google Scholar] [CrossRef] [PubMed]
- Kolchinsky, A.; Tracey, B.D.; Van Kuyk, S. Caveats for information bottleneck in deterministic scenarios. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 30 April 2019. [Google Scholar]
- Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. arXiv 2017, arXiv:1712.09657. [Google Scholar] [CrossRef]
- Achille, A.; Soatto, S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018, 19, 1947–1980. [Google Scholar]
- Achille, A.; Soatto, S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018. [Google Scholar] [CrossRef]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images; Technical Report; University of Toronto: Toronto, ON, Canada, 2009. [Google Scholar]
- Achille, A.; Mbeng, G.; Soatto, S. The Dynamics of Differential Learning I: Information-Dynamics and Task Reachability. arXiv 2018, arXiv:1810.02440. [Google Scholar]
- Anantharam, V.; Gohari, A.; Kamath, S.; Nair, C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover. arXiv 2013, arXiv:1304.6133. [Google Scholar]
- Polyanskiy, Y.; Wu, Y. Strong data-processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Springer: Berlin/Heidelberg, Germany, 2017; pp. 211–249. [Google Scholar]
- Kim, H.; Gao, W.; Kannan, S.; Oh, S.; Viswanath, P. Discovering potential correlations via hypercontractivity. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2017; pp. 4577–4587. [Google Scholar]
- Lin, H.W.; Tegmark, M. Criticality in formal languages and statistical physics. arXiv 2016, arXiv:1606.06737. [Google Scholar]
- Hirschfeld, H.O. A connection between correlation and contingency. In Mathematical Proceedings of the Cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1935; Volume 31, pp. 520–524. [Google Scholar]
- Gebelein, H. Das statistische Problem der Korrelation als Variations-und Eigenwertproblem und sein Zusammenhang mit der Ausgleichsrechnung. ZAMM-J. Appl. Math. Mech. Für Angew. Math. Und Mech. 1941, 21, 364–379. [Google Scholar] [CrossRef]
- Angluin, D.; Laird, P. Learning from noisy examples. Mach. Learn. 1988, 2, 343–370. [Google Scholar] [CrossRef] [Green Version]
- Natarajan, N.; Dhillon, I.S.; Ravikumar, P.K.; Tewari, A. Learning with noisy labels. In Advances in Neural Information Processing Systems; NIPS: San Diego, CA, USA, 2013; pp. 1196–1204. [Google Scholar]
- Liu, T.; Tao, D. Classification with noisy labels by importance reweighting. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 447–461. [Google Scholar] [CrossRef] [PubMed]
- Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; Wang, X. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2691–2699. [Google Scholar]
- Northcutt, C.G.; Wu, T.; Chuang, I.L. Learning with confident examples: Rank pruning for robust classification with noisy labels. arXiv 2017, arXiv:1705.01936. [Google Scholar]
- van den Oord, A.; Kalchbrenner, N.; Espeholt, L.; Kavukcuoglu, K.; Vinyals, O.; Graves, A. Conditional Image Generation with PixelCNN Decoders. In Advances in Neural Information Processing Systems 29; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 4790–4798. [Google Scholar]
- Salimans, T.; Karpathy, A.; Chen, X.; Kingma, D.P. PixelCNN++: A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef]
- Gelfand, I.M.; Silverman, R.A. Calculus of Variations; Courier Corporation: North Chelmsford, MA, USA, 2000. [Google Scholar]
- Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
- Rényi, A. On measures of dependence. Acta Math. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2016, arXiv:1605.07146. [Google Scholar] [Green Version]
- Cubuk, E.D.; Zoph, B.; Mane, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation policies from data. arXiv 2018, arXiv:1805.09501. [Google Scholar]
(2) Algorithm 1 | (3) | ||||||
---|---|---|---|---|---|---|---|
Noise Rate | Observed | (1) Corollary 1 | True | True | (4) Equation (2) | (2′) Algorithm 1 | (3′) |
0.02 | 1.06 | 1.09 | 1.09 | 1.10 | 1.08 | 1.08 | 1.10 |
0.04 | 1.20 | 1.18 | 1.18 | 1.21 | 1.18 | 1.19 | 1.20 |
0.06 | 1.26 | 1.29 | 1.29 | 1.33 | 1.30 | 1.31 | 1.33 |
0.08 | 1.40 | 1.42 | 1.42 | 1.45 | 1.42 | 1.43 | 1.46 |
0.10 | 1.52 | 1.56 | 1.56 | 1.60 | 1.55 | 1.58 | 1.60 |
0.12 | 1.70 | 1.73 | 1.73 | 1.78 | 1.71 | 1.73 | 1.77 |
0.14 | 1.99 | 1.93 | 1.93 | 1.99 | 1.90 | 1.91 | 1.95 |
0.16 | 2.04 | 2.16 | 2.16 | 2.24 | 2.15 | 2.15 | 2.16 |
0.18 | 2.41 | 2.44 | 2.44 | 2.49 | 2.43 | 2.42 | 2.49 |
0.20 | 2.74 | 2.78 | 2.78 | 2.86 | 2.76 | 2.77 | 2.71 |
0.22 | 3.15 | 3.19 | 3.19 | 3.29 | 3.19 | 3.21 | 3.29 |
0.24 | 3.75 | 3.70 | 3.70 | 3.83 | 3.71 | 3.75 | 3.72 |
0.26 | 4.40 | 4.34 | 4.34 | 4.48 | 4.35 | 4.31 | 4.17 |
0.28 | 5.16 | 5.17 | 5.17 | 5.37 | 5.12 | 4.98 | 4.55 |
0.30 | 6.34 | 6.25 | 6.25 | 6.49 | 6.24 | 6.03 | 5.58 |
0.32 | 8.06 | 7.72 | 7.72 | 8.02 | 7.63 | 7.19 | 7.33 |
0.34 | 9.77 | 9.77 | 9.77 | 10.13 | 9.74 | 8.95 | 7.37 |
0.36 | 12.58 | 12.76 | 12.76 | 13.21 | 12.51 | 11.11 | 10.09 |
0.38 | 16.91 | 17.36 | 17.36 | 17.96 | 16.97 | 14.55 | 10.49 |
0.40 | 24.66 | 25.00 | 25.00 | 25.99 | 25.01 | 20.36 | 17.27 |
0.42 | 39.08 | 39.06 | 39.06 | 40.85 | 39.48 | 30.12 | 10.89 |
0.44 | 64.82 | 69.44 | 69.44 | 71.80 | 76.48 | 51.95 | 21.95 |
0.46 | 163.07 | 156.25 | 156.26 | 161.88 | 173.15 | 114.57 | 21.47 |
0.48 | 599.45 | 625.00 | 625.00 | 651.47 | 838.90 | 293.90 | 8.69 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wu, T.; Fischer, I.; Chuang, I.L.; Tegmark, M. Learnability for the Information Bottleneck. Entropy 2019, 21, 924. https://doi.org/10.3390/e21100924
Wu T, Fischer I, Chuang IL, Tegmark M. Learnability for the Information Bottleneck. Entropy. 2019; 21(10):924. https://doi.org/10.3390/e21100924
Chicago/Turabian StyleWu, Tailin, Ian Fischer, Isaac L. Chuang, and Max Tegmark. 2019. "Learnability for the Information Bottleneck" Entropy 21, no. 10: 924. https://doi.org/10.3390/e21100924
APA StyleWu, T., Fischer, I., Chuang, I. L., & Tegmark, M. (2019). Learnability for the Information Bottleneck. Entropy, 21(10), 924. https://doi.org/10.3390/e21100924