[go: up one dir, main page]

 
 
entropy-logo

Journal Browser

Journal Browser

Information-Theoretic Methods in Deep Learning: Theory and Applications

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (15 May 2024) | Viewed by 32585

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
Interests: information theoretic learning; information bottleneck; deep learning; artificial general intelligence; correntropy
Department of Computer Science, Vrije Universiteit Amsterdam, 1081 HV Amsterdam, The Netherlands
Interests: information theory of deep neural network; explainable/interpretable AI; machine learning in non-stationary environments; time series analysis; brain network analysis

E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, University of Kentucky, Lexington, KY 40506, USA
Interests: machine learning for signal processing; information theoretic learning; representation learning; computer vision; computational neuroscience
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, 28 Xianning West Road, Xi'an 710049, China
Interests: information theoretic learning; artificial intelligence; cognitive science; adaptive filtering; brain machine learning; robotics
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Information theory is a mathematical infrastructure to deal with manipulation of information. It has a significant influence on the design of efficient and reliable communication systems. Information theoretic learning (ITL) has attracted increasing attention in the field of deep learning in recent years. It provides useful descriptions of the underlying behavior of random variables or processes to develop and analyze deep models. Novel ITL estimators and principles have been used for different deep learning problems, such as mutual information neural estimator for representation learning with the information maximization principle; and the principle of relevant information for redundancy compression and graph sparsification. As a vital approach to describe performance constraints and design mappings, ITL has essential applications in supervised, unsupervised and reinforcement learning problems, such as classification, clustering, and sequential decision making. In this field, information bottleneck (IB) aims at the right balance between data fit and generalization based on the mutual information as both a regularizer and a cost function. The IB theory helps to better understand the basic limits of learning problems, such as the learning performance of deep neural networks, geometric clustering, and extracting the Gaussian part of a signal, etc.

In recent years, researchers have revealed that ITL provides a powerful paradigm for analyzing neural networks by shedding light on the layered structure, generalization capabilities and learning dynamics. For example, the IB theory have demonstrated great potential to solve critical problems in deep learning, including understanding and analyzing black-box neural networks, and serving as an optimization criterion for training deep neural networks. Divergence estimation is another approach with a broad range of applications including domain shift detection, domain adaptation, generative modeling, and model regularization

With the development of ITL theory, we believe that ITL can provide new perspectives, theories, and algorithms to the challenging problems of deep learning. Therefore, this Special Issue aims at reporting the latest developments on ITL methods and their applications. Topics of interest include but are not limited to:

  • Information-Theoretic Quantities and Estimators;
  • Information-Theoretic Principles and Regularization in deep neural networks;
  • Interpretation and explanation of deep learning models with information-theoretic methods;
  • Information theoretic methods for distributed deep learning;
  • Information theoretic methods for brain inspired neural networks;
  • Information Bottleneck in deep representation learning;
  • Representation learning beyond the Information Bottleneck, such as total correlation explanation and principles of relevant information.

Dr. Shuangming Yang
Dr. Shujian Yu
Dr. Luis Gonzalo Sánchez Giraldo
Prof. Dr. Badong Chen
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • information theoretic learning
  • information bottleneck
  • deep learning
  • neural networks

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

23 pages, 927 KiB  
Article
PyDTS: A Python Toolkit for Deep Learning Time Series Modelling
by Pascal A. Schirmer and Iosif Mporas
Entropy 2024, 26(4), 311; https://doi.org/10.3390/e26040311 - 31 Mar 2024
Cited by 1 | Viewed by 2419
Abstract
In this article, the topic of time series modelling is discussed. It highlights the criticality of analysing and forecasting time series data across various sectors, identifying five primary application areas: denoising, forecasting, nonlinear transient modelling, anomaly detection, and degradation modelling. It further outlines [...] Read more.
In this article, the topic of time series modelling is discussed. It highlights the criticality of analysing and forecasting time series data across various sectors, identifying five primary application areas: denoising, forecasting, nonlinear transient modelling, anomaly detection, and degradation modelling. It further outlines the mathematical frameworks employed in a time series modelling task, categorizing them into statistical, linear algebra, and machine- or deep-learning-based approaches, with each category serving distinct dimensions and complexities of time series problems. Additionally, the article reviews the extensive literature on time series modelling, covering statistical processes, state space representations, and machine and deep learning applications in various fields. The unique contribution of this work lies in its presentation of a Python-based toolkit for time series modelling (PyDTS) that integrates popular methodologies and offers practical examples and benchmarking across diverse datasets. Full article
Show Figures

Figure 1

Figure 1
<p>Generalized time series architecture.</p>
Full article ">Figure 2
<p>Relation between input and output dimensionality for frame-based time series modelling: (<b>a</b>) sequence-to-point, (<b>b</b>) sequence-to-subsequence, and (<b>c</b>) sequence-to-sequence.</p>
Full article ">Figure 3
<p>Overview of implemented modules and functionalities in the PyDTS toolkit. Inputs and preprocessing are indicated in red, features and data sequencing in yellow, modelling in green, postprocessing in blue, and visual elements and outputs in purple.</p>
Full article ">Figure 4
<p>Internal data pipeline of PyDTS including training and testing modules and external data, model, and setup databases.</p>
Full article ">Figure 5
<p>Grid search for the optimal number of input samples depending on the time series problem.</p>
Full article ">Figure 6
<p>DL layer architectures for DNNs, LSTM, and CNN models. For CNNs, the notation of the convolutional layer is Conv1D(x,y) with x being the number of filters and y being the kernel size. For pooling layers MaxPool(x,y), x is the size and y the stride, while for LSTM and DNN layers, x denotes the number of neurons.</p>
Full article ">Figure 7
<p>Predicted appliance current draw for 12 h for three different (FRE, HPE, and CDE) appliances from the AMPds2 dataset on 9 January 2013 at 12:00 p.m.</p>
Full article ">Figure 8
<p>Forecasted power consumption and error for phase L1 for 1 week using RF as regression model.</p>
Full article ">Figure 9
<p>Predicted temperature for stator winding and rotor magnet for IDs 60, 62, and 72.</p>
Full article ">Figure 10
<p>Confusion matrices for (<b>a</b>) raw, (<b>b</b>) statistical, and (<b>c</b>) frequency domain features for the CNN model.</p>
Full article ">Figure 11
<p>Ground-truth and predicted remaining cell charge and prediction error using the best-performing DNN model (for visibility, the predicted output has been filtered with a median filter of a length of 100 samples).</p>
Full article ">Figure 12
<p>Feature ranking for the nonlinear modelling task for 13 features: coolant/ambient temperature (<math display="inline"><semantics> <msub> <mi>T</mi> <mi>c</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>T</mi> <mi>a</mi> </msub> </semantics></math>), stator voltages (<math display="inline"><semantics> <msub> <mi>U</mi> <mi>s</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>U</mi> <mi>d</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>U</mi> <mi>q</mi> </msub> </semantics></math>), stator currents (<math display="inline"><semantics> <msub> <mi>I</mi> <mi>s</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>I</mi> <mi>d</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>I</mi> <mi>q</mi> </msub> </semantics></math>), torque (<math display="inline"><semantics> <msub> <mi>T</mi> <mi>m</mi> </msub> </semantics></math>), rotational speed (<math display="inline"><semantics> <msub> <mi>ω</mi> <mi>m</mi> </msub> </semantics></math>), apparent power (<math display="inline"><semantics> <msub> <mi>S</mi> <mi>s</mi> </msub> </semantics></math>), and products or current/power and rotational speed (<math display="inline"><semantics> <msub> <mi>I</mi> <mi>ω</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>S</mi> <mi>ω</mi> </msub> </semantics></math>).</p>
Full article ">
24 pages, 1055 KiB  
Article
A Unifying Generator Loss Function for Generative Adversarial Networks
by Justin Veiner, Fady Alajaji and Bahman Gharesifard
Entropy 2024, 26(4), 290; https://doi.org/10.3390/e26040290 - 27 Mar 2024
Cited by 2 | Viewed by 1587
Abstract
A unifying α-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN) that uses a canonical (or classical) discriminator loss function such as the one in the original GAN (VanillaGAN) system. The generator loss function is based on a [...] Read more.
A unifying α-parametrized generator loss function is introduced for a dual-objective generative adversarial network (GAN) that uses a canonical (or classical) discriminator loss function such as the one in the original GAN (VanillaGAN) system. The generator loss function is based on a symmetric class probability estimation type function, Lα, and the resulting GAN system is termed Lα-GAN. Under an optimal discriminator, it is shown that the generator’s optimization problem consists of minimizing a Jensen-fα-divergence, a natural generalization of the Jensen-Shannon divergence, where fα is a convex function expressed in terms of the loss function Lα. It is also demonstrated that this Lα-GAN problem recovers as special cases a number of GAN problems in the literature, including VanillaGAN, least squares GAN (LSGAN), least kth-order GAN (LkGAN), and the recently introduced (αD,αG)-GAN with αD=1. Finally, experimental results are provided for three datasets—MNIST, CIFAR-10, and Stacked MNIST—to illustrate the performance of various examples of the Lα-GAN system. Full article
Show Figures

Figure 1

Figure 1
<p>Generated images for the best-performing (<math display="inline"><semantics> <msub> <mi>α</mi> <mi>D</mi> </msub> </semantics></math>, <math display="inline"><semantics> <msub> <mi>α</mi> <mi>G</mi> </msub> </semantics></math>)-GANs. (<b>a</b>) (<math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>,</mo> <msub> <mi>α</mi> <mi>G</mi> </msub> </mrow> </semantics></math>)-GAN for MNIST, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>=</mo> <mn>1.0</mn> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>=</mo> <mn>5.0</mn> </mrow> </semantics></math>, FID: 1.125. (<b>b</b>) <math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>,</mo> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>)</mo> </mrow> </semantics></math>-GAN-GP for CIFAR-10, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>=</mo> <mn>1.0</mn> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>=</mo> <mn>20.0</mn> </mrow> </semantics></math>, FID = 8.466. (<b>c</b>) <math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>,</mo> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>)</mo> </mrow> </semantics></math>-GAN-GP for Stacked MNIST, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>=</mo> <mn>1.0</mn> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>=</mo> <mn>0.5</mn> </mrow> </semantics></math>, FID = 4.833.</p>
Full article ">Figure 2
<p>Average FID scores vs. epochs for various <math display="inline"><semantics> <mrow> <mo>(</mo> <msub> <mi>α</mi> <mi>D</mi> </msub> <mo>,</mo> <msub> <mi>α</mi> <mi>G</mi> </msub> <mo>)</mo> </mrow> </semantics></math>-GANs.</p>
Full article ">Figure 3
<p>Generated images for best-performing SL<span class="html-italic">k</span>GANs. (<b>a</b>) Vanilla-SL<span class="html-italic">k</span>GAN-0.25 for MNIST, FID = 1.112. (<b>b</b>) Vanilla-SL<span class="html-italic">k</span>GAN-2.0 for CIFAR-10, FID = 4.58. (<b>c</b>) Vanilla-SL<span class="html-italic">k</span>GAN-15.0-GP for Stacked MNIST, FID = 3.836.</p>
Full article ">Figure 4
<p>FID scores vs. epochs for various SL<span class="html-italic">k</span>GANs.</p>
Full article ">
22 pages, 1728 KiB  
Article
Ensemble Transductive Propagation Network for Semi-Supervised Few-Shot Learning
by Xueling Pan, Guohe Li and Yifeng Zheng
Entropy 2024, 26(2), 135; https://doi.org/10.3390/e26020135 - 31 Jan 2024
Cited by 2 | Viewed by 1523
Abstract
Few-shot learning aims to solve the difficulty in obtaining training samples, leading to high variance, high bias, and over-fitting. Recently, graph-based transductive few-shot learning approaches supplement the deficiency of label information via unlabeled data to make a joint prediction, which has become a [...] Read more.
Few-shot learning aims to solve the difficulty in obtaining training samples, leading to high variance, high bias, and over-fitting. Recently, graph-based transductive few-shot learning approaches supplement the deficiency of label information via unlabeled data to make a joint prediction, which has become a new research hotspot. Therefore, in this paper, we propose a novel ensemble semi-supervised few-shot learning strategy via transductive network and Dempster–Shafer (D-S) evidence fusion, named ensemble transductive propagation networks (ETPN). First, we present homogeneity and heterogeneity ensemble transductive propagation networks to better use the unlabeled data, which introduce a preset weight coefficient and provide the process of iterative inferences during transductive propagation learning. Then, we combine the information entropy to improve the D-S evidence fusion method, which improves the stability of multi-model results fusion from the pre-processing of the evidence source. Third, we combine the L2 norm to improve an ensemble pruning approach to select individual learners with higher accuracy to participate in the integration of the few-shot model results. Moreover, interference sets are introduced to semi-supervised training to improve the anti-disturbance ability of the mode. Eventually, experiments indicate that the proposed approaches outperform the state-of-the-art few-shot model. The best accuracy of ETPN increases by 0.3% and 0.28% in the 5-way 5-shot, and by 3.43% and 7.6% in the 5-way 1-shot on miniImagNet and tieredImageNet, respectively. Full article
Show Figures

Figure 1

Figure 1
<p>The overall framework diagram of the model.</p>
Full article ">Figure 2
<p>The overall flow chart of the ensemble model.</p>
Full article ">Figure 3
<p>The framework of the IG-semiTPN model.</p>
Full article ">Figure 4
<p><math display="inline"><semantics> <msub> <mi>g</mi> <mi>ϕ</mi> </msub> </semantics></math> construction of model.</p>
Full article ">Figure 5
<p>The process of homogeneous ensembles.</p>
Full article ">Figure 6
<p>The process of heterogeneous ensembles.</p>
Full article ">Figure 7
<p>The framework of the Ho-ETPN model.</p>
Full article ">Figure 8
<p>The framework of the He-ETPN model.</p>
Full article ">Figure 9
<p>Transductive propagation algorithm based on K-nearest neighbor graph.</p>
Full article ">Figure 10
<p>Comparison of the IG-semiTPN and ETPN (He-ETPN and Ho-ETPN) on miniImageNet.</p>
Full article ">Figure 11
<p>Comparison of the IG-semiTPN and ETPN (He-ETPN and Ho-ETPN) on tieredImageNet.</p>
Full article ">Figure 12
<p>Comparison of the IG-semiTPN and semi-HoETPN on miniImageNet.</p>
Full article ">Figure 13
<p>Comparison of the IG-semiTPN and semi-HoETPN on tieredImageNet.</p>
Full article ">
17 pages, 857 KiB  
Article
Deep Individual Active Learning: Safeguarding against Out-of-Distribution Challenges in Neural Networks
by Shachar Shayovitz, Koby Bibas and Meir Feder
Entropy 2024, 26(2), 129; https://doi.org/10.3390/e26020129 - 31 Jan 2024
Viewed by 1285
Abstract
Active learning (AL) is a paradigm focused on purposefully selecting training data to enhance a model’s performance by minimizing the need for annotated samples. Typically, strategies assume that the training pool shares the same distribution as the test set, which is not always [...] Read more.
Active learning (AL) is a paradigm focused on purposefully selecting training data to enhance a model’s performance by minimizing the need for annotated samples. Typically, strategies assume that the training pool shares the same distribution as the test set, which is not always valid in privacy-sensitive applications where annotating user data is challenging. In this study, we operate within an individual setting and leverage an active learning criterion which selects data points for labeling based on minimizing the min-max regret on a small unlabeled test set sample. Our key contribution lies in the development of an efficient algorithm, addressing the challenging computational complexity associated with approximating this criterion for neural networks. Notably, our results show that, especially in the presence of out-of-distribution data, the proposed algorithm substantially reduces the required training set size by up to 15.4%, 11%, and 35.1% for CIFAR10, EMNIST, and MNIST datasets, respectively. Full article
Show Figures

Figure 1

Figure 1
<p>Datasets that contain a mix of images with OOD samples. (Top) Unlabeled pool contains OOD samples (Bottom). Test set includes only valid data.</p>
Full article ">Figure 2
<p>Accuracy as function of number of Oracle calls on MNIST dataset. DIAL outperforms the baselines for the two setups.</p>
Full article ">Figure 3
<p>Active learning performance on the EMNIST dataset. DIAL is more efficient than tested baselines in the number of Oracle calls.</p>
Full article ">Figure 4
<p>The left figure illustrates the performance of CIFAR10 using only IND samples. The DIAL method performs similarly to the Random method. The figure on the right shows the performance of a combination of OOD samples, where DIAL outperforms all other methods.</p>
Full article ">Figure 5
<p>The amount of chosen OOD samples for CIFAR10 with the presence of OOD samples.</p>
Full article ">
16 pages, 2785 KiB  
Article
Continual Reinforcement Learning for Quadruped Robot Locomotion
by Sibo Gai, Shangke Lyu, Hongyin Zhang and Donglin Wang
Entropy 2024, 26(1), 93; https://doi.org/10.3390/e26010093 - 22 Jan 2024
Cited by 2 | Viewed by 3273
Abstract
The ability to learn continuously is crucial for a robot to achieve a high level of intelligence and autonomy. In this paper, we consider continual reinforcement learning (RL) for quadruped robots, which includes the ability to continuously learn sub-sequential tasks (plasticity) and maintain [...] Read more.
The ability to learn continuously is crucial for a robot to achieve a high level of intelligence and autonomy. In this paper, we consider continual reinforcement learning (RL) for quadruped robots, which includes the ability to continuously learn sub-sequential tasks (plasticity) and maintain performance on previous tasks (stability). The policy obtained by the proposed method enables robots to learn multiple tasks sequentially, while overcoming both catastrophic forgetting and loss of plasticity. At the same time, it achieves the above goals with as little modification to the original RL learning process as possible. The proposed method uses the Piggyback algorithm to select protected parameters for each task, and reinitializes the unused parameters to increase plasticity. Meanwhile, we encourage the policy network exploring by encouraging the entropy of the soft network of the policy network. Our experiments show that traditional continual learning algorithms cannot perform well on robot locomotion problems, and our algorithm is more stable and less disruptive to the RL training progress. Several robot locomotion experiments validate the effectiveness of our method. Full article
Show Figures

Figure 1

Figure 1
<p>(<b>a</b>) Multi-task RL formulation: the agent learns from a set of pre-defined environments. (<b>b</b>) Continual RL formulation: the agent learns from sequential environments.</p>
Full article ">Figure 2
<p>For each task, we only select the most important subset of parameters to compose a sub-network for training and utilization. Among them, only the parameters never selected by other tasks before (chosen elements) will be updated in order not to affect learned task performance, while parameters previously selected by other tasks are used without updating (locked elements). After learning each task, some parameters will be selected (private elements) for later fine-tuning on this task. Private parameters of previous tasks (unusable elements) avoid being chosen by subsequent tasks for both training and utilization thus preventing influencing the performance of other tasks. After all, those parameters that never chosen by any task will be re-initialization after training a task.</p>
Full article ">Figure 3
<p>(<b>a</b>) In the Leg Crash task, we set the output of each leg into zero, respectively. (<b>b</b>) In the Leg Inversion task, we inverse the output of each leg, respectively. (<b>c</b>) In the Leg Noise task, we add a random noise into the output of each leg, respectively.</p>
Full article ">Figure 4
<p>Each figure shows the performance of one task during the entire learning phase: the first row is for tasks 1–2, the second row is for tasks 3–4, and so on. Throughout the learning process, we train each task 500 epochs and switch to the next task. The first 500 epochs of each task show the reward during training, while subsequent data are results of testing the performance on that task each 500 epochs. Higher results are better. We can see that our algorithm maintains the performance achieved during training on all tasks, while all baselines exhibit decreased performance during later testing.</p>
Full article ">Figure 5
<p>Training rewards of the Piggyback method during the whole training process. A value of 0.25 means each task uses 25% vacant parameters and 75% occupied parameters; free means select using vacant parameters and the occupied parameters freely like WSN; vacant means only use the vacant parameters for new tasks. The task changes every 500 epochs. Higher is better.</p>
Full article ">Figure 6
<p>Each figure shows the performance of tracking the line velocity of one task during the entire learning phase: the first row is for tasks 1–2, the second row is for tasks 3–4, and so on. Throughout the learning process, we switch the robot to another task every 500 epochs. The first 500 epochs of each task show the performance of tracking the line velocity during training, while subsequent data are the results of testing the performance on that task for each 500 epochs. Higher results are better. We can see that the performance of our method only drops just after training (because of the shift between training and testing), and can keep a steady performance in the remaining learning process.</p>
Full article ">Figure 7
<p>Each figure shows the performance of tracking the angle velocity of one task during the entire learning phase: the first row is for tasks 1–2, the second row is for tasks 3–4, and so on. Throughout the learning process, we switch the robot to another task every 500 epochs. The first 500 epochs of each task show the performance of tracking the line velocity during training, while subsequent data are the results of testing the performance on that task for each 500 epochs. Higher results are better. We can see that the performance of our method only drops just after training (because of the shift between training and testing), and can keep a steady performance in the remaining learning process.</p>
Full article ">Figure 8
<p>Training rewards during the whole training process. None means not to reinitialize; Reinit means to reinitialize unused parameters of the base network and the soft network; Reinit &amp; Entropy means to reinitialize the unused parameters of the base networks and the soft network, and raise the entropy of the action when learning the soft network. The task changes every 500 epochs. Higher is better.</p>
Full article ">Figure 9
<p>Training performance of tracking line velocity during the whole training process. None means not to reinitialize; Reinit means to reinitialize unused parameters of the base network and the soft network; Reinit &amp; Entropy means to reinitialize the unused parameters of the base networks and the soft network, and raise the entropy of the action when learning the soft network. The task changes every 500 epochs. Higher is better.</p>
Full article ">Figure 10
<p>Training performance of tracking angle velocity during the whole training process. None means not to reinitialize; Reinit means to reinitialize unused parameters of the base network and the soft network; Reinit &amp; Entropy means to reinitialize the unused parameters of the base networks and the soft network, and raise the entropy of the action when learning the soft network. The task changes every 500 epochs. Higher is better.</p>
Full article ">Figure 11
<p>The learning process of the mixed environment. We can see that learning a new task that combination of two learned tasks from a combination of the soft network can speed up the training process.</p>
Full article ">
15 pages, 656 KiB  
Article
A Deep Neural Network Regularization Measure: The Class-Based Decorrelation Method
by Chenguang Zhang, Tian Liu and Xuejiao Du
Entropy 2024, 26(1), 7; https://doi.org/10.3390/e26010007 - 20 Dec 2023
Cited by 1 | Viewed by 1851
Abstract
In response to the challenge of overfitting, which may lead to a decline in network generalization performance, this paper proposes a new regularization technique, called the class-based decorrelation method (CDM). Specifically, this method views the neurons in a specific hidden layer as base [...] Read more.
In response to the challenge of overfitting, which may lead to a decline in network generalization performance, this paper proposes a new regularization technique, called the class-based decorrelation method (CDM). Specifically, this method views the neurons in a specific hidden layer as base learners, and aims to boost network generalization as well as model accuracy by minimizing the correlation among individual base learners while simultaneously maximizing their class-conditional correlation. Intuitively, CDM not only promotes diversity among the hidden neurons, but also enhances their cohesiveness among them when processing samples from the same class. Comparative experiments conducted on various datasets using deep models demonstrate that CDM effectively reduces overfitting and improves classification performance. Full article
Show Figures

Figure 1

Figure 1
<p>Architecture of a fully connected neural network based on CDM. Each neuron of the specified hidden layer is treated as a base learner.</p>
Full article ">Figure 2
<p>Some examples of introducing Gaussian noise to (<b>a</b>) MNIST, (<b>b</b>) CIFAR-10 and (<b>c</b>) Mini-ImageNet datasets to generate noisy images (<b>top row</b>) from original images (<b>bottom row</b>).</p>
Full article ">Figure 3
<p>Given the values of <math display="inline"> <semantics> <mi>λ</mi> </semantics> </math> in the range [0, 0.4], the changes in covariance (Cov), class-conditional covariance (C-Cov), and the covariance gap (Cov-gap) with increasing training iterations on (<b>a</b>) MNIST and (<b>b</b>) CIFAR-10 datasets.</p>
Full article ">Figure 4
<p>The changes in classification accuracy by varying the value of <math display="inline"> <semantics> <mi>λ</mi> </semantics> </math> on (<b>a</b>) MNIST and (<b>b</b>) CIFAR-10 datasets.</p>
Full article ">
21 pages, 913 KiB  
Article
Analysis of Deep Convolutional Neural Networks Using Tensor Kernels and Matrix-Based Entropy
by Kristoffer K. Wickstrøm, Sigurd Løkse, Michael C. Kampffmeyer, Shujian Yu, José C. Príncipe and Robert Jenssen
Entropy 2023, 25(6), 899; https://doi.org/10.3390/e25060899 - 3 Jun 2023
Viewed by 2097
Abstract
Analyzing deep neural networks (DNNs) via information plane (IP) theory has gained tremendous attention recently to gain insight into, among others, DNNs’ generalization ability. However, it is by no means obvious how to estimate the mutual information (MI) between each hidden layer and [...] Read more.
Analyzing deep neural networks (DNNs) via information plane (IP) theory has gained tremendous attention recently to gain insight into, among others, DNNs’ generalization ability. However, it is by no means obvious how to estimate the mutual information (MI) between each hidden layer and the input/desired output to construct the IP. For instance, hidden layers with many neurons require MI estimators with robustness toward the high dimensionality associated with such layers. MI estimators should also be able to handle convolutional layers while at the same time being computationally tractable to scale to large networks. Existing IP methods have not been able to study truly deep convolutional neural networks (CNNs). We propose an IP analysis using the new matrix-based Rényi’s entropy coupled with tensor kernels, leveraging the power of kernel methods to represent properties of the probability distribution independently of the dimensionality of the data. Our results shed new light on previous studies concerning small-scale DNNs using a completely new approach. We provide a comprehensive IP analysis of large-scale CNNs, investigating the different training phases and providing new insights into the training dynamics of large-scale neural networks. Full article
Show Figures

Figure 1

Figure 1
<p>IP obtained using our proposed measure for a small DNN averaged over 5 training runs. The solid black line illustrates the fitting phase while the dotted black line illustrates the compression phase. The iterations at which early stopping would be performed assuming a given patience parameter are highlighted. Patience denotes the number of iterations that need to pass without progress on a validation set before training is stopped to avoid overfitting. For low patience values, training will stop before the compression phase. For the benefit of the reader, a magnified version of the first four layers is also displayed.</p>
Full article ">Figure 2
<p>The leftmost plot shows the entropy calculated using Equation (<a href="#FD14-entropy-25-00899" class="html-disp-formula">14</a>) of a 100-dimensional normal distribution with zero mean and an isotropic covariance matrix for different variances. The variances are given along the x-axis. The rightmost plot shows the entropy estimated using Equation (<a href="#FD1-entropy-25-00899" class="html-disp-formula">1</a>) for the same distribution. The plots illustrated that the analytically computed entropy and the estimated quantity follow the same trend.</p>
Full article ">Figure 3
<p>The leftmost plot shows the mutual information calculated using Equation (16) between a standard 100-dimensional normal distribution and a normal distribution with a mean vector of all ones and an isotropic covariance matrix with different variances. The variances are given along the x-axis. The rightmost plot shows the mutual information estimated using Equation (<a href="#FD3-entropy-25-00899" class="html-disp-formula">3</a>) for the same distributions. The plots illustrated that the analytically computed mutual information and the estimated quantity follow the same trend.</p>
Full article ">Figure 4
<p>IP of a CNN consisting of three convolutional layers with 4, 8 and 12 filters and one fully connected layer with 256 neurons and a ReLU activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.</p>
Full article ">Figure 5
<p>IP of the VGG16 on the CIFAR-10 dataset. MI was measured using the training data and averaged over 2 runs. Color saturation increases as training progresses. Both the fitting phase and the compression phase is clearly visible for several layers.</p>
Full article ">Figure 6
<p>IP of the VGG16 on the CIFAR-10 dataset. MI was measured using the test data and averaged over 2 runs. Color saturation increases as training progresses. The fitting phase is clearly visible, while the compression phase can only be seen in the output layer.</p>
Full article ">Figure 7
<p>Mean difference in MI of subsequent layers <span class="html-italic">ℓ</span> and <math display="inline"><semantics> <mrow> <mo>????</mo> <mo>+</mo> <mn>1</mn> </mrow> </semantics></math>. Positive numbers indicate compliance with the DPI. MI was measured on the MNIST training set for the MLP and on the CIFAR-10 training set for the VGG16.</p>
Full article ">Figure 8
<p>Evolution of kernel width as a function of iteration for the three networks that we considered in this work. From left to right, plots shows the kernel width for the MLP, CNN, and VGG16. The plots demonstrate how the optimal kernel width quickly stabilizes and stays relatively stable throughout the training.</p>
Full article ">Figure A1
<p>Different approaches for calculating kernels based on tensor data. The first row shows results when using the multivariate approach of [<a href="#B20-entropy-25-00899" class="html-bibr">20</a>], the second row depicts the tensor kernel approach used in this paper, and the third row displays the kernel obtained using matricization-based tensor kernels [<a href="#B32-entropy-25-00899" class="html-bibr">32</a>] that preserve the structure between channels. Bright colors indicate high values, while dark values indicate low values in all the kernel matrices.</p>
Full article ">Figure A2
<p>IP of an MLP consisting of four fully connected layers with 1024, 20, 20, and 20 neurons and a tanh activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.</p>
Full article ">Figure A3
<p>IP of a CNN consisting of three convolutional layers with 4, 8 and 12 filters and one fully connected layer with 256 neurons and a tanh activation function in each hidden layer. MI was measured using the training data of the MNIST dataset and averaged over 5 runs.</p>
Full article ">Figure A4
<p>Mean difference in MI of subsequent layers <span class="html-italic">ℓ</span> and <math display="inline"><semantics> <mrow> <mo>????</mo> <mo>+</mo> <mn>1</mn> </mrow> </semantics></math>. Positive numbers indicate compliance with the DPI. MI was measured on the MNIST training using the EDGE MI estimator on a simple MLP [<a href="#B4-entropy-25-00899" class="html-bibr">4</a>].</p>
Full article ">Figure A5
<p>Training and test loss of neural network with one hidden layer with 50,000 neurons on a subset of the MNIST dataset. The figure also shows the MI between the input/labels and the hidden/output layer. The epoch-wise double descent phenomenon is visible in the test loss, and it seems to coincide with the start of the compression of MI between the input and output layer. Notice the different labels on the left and right y-axis. Curves represent the average over 3 training runs.</p>
Full article ">
49 pages, 10680 KiB  
Article
Multivariate Time Series Information Bottleneck
by Denis Ullmann, Olga Taran and Slava Voloshynovskiy
Entropy 2023, 25(5), 831; https://doi.org/10.3390/e25050831 - 22 May 2023
Cited by 2 | Viewed by 3236
Abstract
Time series (TS) and multiple time series (MTS) predictions have historically paved the way for distinct families of deep learning models. The temporal dimension, distinguished by its evolutionary sequential aspect, is usually modeled by decomposition into the trio of “trend, seasonality, noise”, by [...] Read more.
Time series (TS) and multiple time series (MTS) predictions have historically paved the way for distinct families of deep learning models. The temporal dimension, distinguished by its evolutionary sequential aspect, is usually modeled by decomposition into the trio of “trend, seasonality, noise”, by attempts to copy the functioning of human synapses, and more recently, by transformer models with self-attention on the temporal dimension. These models may find applications in finance and e-commerce, where any increase in performance of less than 1% has large monetary repercussions, they also have potential applications in natural language processing (NLP), medicine, and physics. To the best of our knowledge, the information bottleneck (IB) framework has not received significant attention in the context of TS or MTS analyses. One can demonstrate that a compression of the temporal dimension is key in the context of MTS. We propose a new approach with partial convolution, where a time sequence is encoded into a two-dimensional representation resembling images. Accordingly, we use the recent advances made in image extension to predict an unseen part of an image from a given one. We show that our model compares well with traditional TS models, has information–theoretical foundations, and can be easily extended to more dimensions than only time and space. An evaluation of our multiple time series–information bottleneck (MTS-IB) model proves its efficiency in electricity production, road traffic, and astronomical data representing solar activity, as recorded by NASA’s interface region imaging spectrograph (IRIS) satellite. Full article
Show Figures

Figure 1

Figure 1
<p>Comparison of Markov chains for a selection of deep TS predictors: the blue parts correspond to the compressed representations of the time dimension. Some of these may accept additional inputs (correlated context) but we did not include them in these diagrams because that would overload the global understanding, and the time dimension is compressed in the same way. A bold <math display="inline"><semantics> <mi mathvariant="bold">X</mi> </semantics></math> is used when the model accepts vectors as input.</p>
Full article ">Figure 2
<p>Schematic analogy between the IB principle and image extension: (<b>Left</b>) schematically shows the time prediction under the IB principle, with compression and decoding, using <math display="inline"><semantics> <mrow> <mi>P</mi> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>v</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>D</mi> <mi>P</mi> <mi>C</mi> <mi>o</mi> <mi>n</mi> <mi>v</mi> </mrow> </semantics></math> and skipping connections to form a variant of U-Net. (<b>Right</b>) is an equivalent representation seen as the image extension, where the skipping layers connect <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mrow> <mn>1</mn> <mo>:</mo> <mi>T</mi> </mrow> </msub> </semantics></math> from the input to the output, and the bottleneck principle allows predicting <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mrow> <mi>T</mi> <mo>+</mo> <mn>1</mn> <mo>:</mo> <mi>T</mi> <mo>+</mo> <mi>F</mi> </mrow> </msub> </semantics></math> from <math display="inline"><semantics> <msub> <mi mathvariant="bold">X</mi> <mrow> <mn>1</mn> <mo>:</mo> <mi>T</mi> </mrow> </msub> </semantics></math>.</p>
Full article ">Figure 3
<p>Problem formulation: <math display="inline"><semantics> <mrow> <mo>(</mo> <mi>x</mi> <mo>,</mo> <mi>y</mi> <mo>)</mo> </mrow> </semantics></math> represent the spatial coordinates, <math display="inline"><semantics> <mi mathvariant="san-serif">λ</mi> </semantics></math> and <span class="html-italic">t</span>, respectively, represent the spectral and time coordinates. NASA’s IRIS satellite integrates a mirror from which the <span class="html-italic">Sun image</span> or videos are captured by a sensor paired with a wavelength filter chosen among <math display="inline"><semantics> <mrow> <mn>1330</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mn>1400</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math>, <math display="inline"><semantics> <mrow> <mn>2796</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math>, and <math display="inline"><semantics> <mrow> <mn>2832</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math>. This mirror holds a vertical slit from which the diffraction occurs. The <span class="html-italic">x</span> position of the slit can vary in time and is chosen before the observation. A sensor behind the mirror captures the <span class="html-italic">Sun spectra</span> for each vertical position <span class="html-italic">y</span> of the <span class="html-italic">Sun’s image</span>, but only at the <span class="html-italic">x</span> position of the slit. We only consider the MgIIh/k data, which are between <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="san-serif">λ</mi> <mn>1</mn> </msub> <mo>=</mo> <mn>2793.8401</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <msub> <mi mathvariant="san-serif">λ</mi> <mn>2</mn> </msub> <mo>=</mo> <mn>2806.02</mn> <mspace width="3.33333pt"/> <mi mathvariant="sans-serif">Å</mi> </mrow> </semantics></math>, and we consider all available time sequences.</p>
Full article ">Figure 4
<p>Histogram of the cadences in seconds/time steps.</p>
Full article ">Figure 5
<p>Structure of the ED-LSTM and ED-GRU models used for comparison. <math display="inline"><semantics> <msubsup> <mi mathvariant="bold">C</mi> <mi>t</mi> <mi>i</mi> </msubsup> </semantics></math> represents the hidden state vectors for GRU cells, combined with cell state vectors for LSTM cells.</p>
Full article ">Figure 6
<p>Evaluations performed on the proposed time predictor: center assignments, activity classification, and physical features. Classical MTS and CV evaluations were also performed without appearing in this diagram for readability concerns.</p>
Full article ">Figure 7
<p>Evaluation of predictions for one flaring (FL) sample performed by the proposed IB-MTS model. The <b>first row</b> contains, respectively, the masked input, the predicted output, the genuine data, and the magnified pixel-wise error between the predicted and genuine. <b>Second row</b>: Spectral center distribution for the prior, the predicted, and the genuine MTS. <b>Third row</b>: MTS evaluation on the prediction. <b>Last twelve plots</b>: astrophysical features evaluations; the dotted blues represent the genuine and the green lines represent the prediction.</p>
Full article ">Figure 8
<p>Prediction results: The first column presents the results of the direct predictions (blue part) and the second column presents the iterated predictions (violet part). A masked sample is given from the original sequence (<b>first row</b>); the prediction (<b>second row</b>) and the magnified (<math display="inline"><semantics> <mrow> <mo>×</mo> <mn>5</mn> </mrow> </semantics></math>) differences (<b>third row</b>) are shown.</p>
Full article ">Figure 9
<p>MTS metrics evaluation averaged on the test set for the direct prediction setups on QS, AR, and FL IRIS data.</p>
Full article ">Figure 10
<p>MTS metrics evaluation averaged on the test set for the iterated prediction setups on QS, AR, and FL IRIS data.</p>
Full article ">Figure 11
<p>Histogram of the event durations from IRIS data.</p>
Full article ">Figure 12
<p>CV evaluation (over time) of the forecast for the direct and iterated predictions on IRIS data.</p>
Full article ">Figure 13
<p>Average distributions of centroids with their standard deviations as vertical gray error bars. The first graph is for the average prior central data, the middle graph is for the average genuine target, and the right graph is the average distribution of predictions performed with IB-MTS.</p>
Full article ">Figure 14
<p>IRIS center assignment evaluation (over time) of the forecasts for direct predictions.</p>
Full article ">Figure 15
<p>IRIS center assignment evaluation (over time) of the forecasts for iterated predictions.</p>
Full article ">Figure 16
<p>Evaluation of the relative prediction errors for physical features over time of the forecasts for IRIS data and the <span class="html-italic">direct</span> setup. The lower the better.</p>
Full article ">Figure 17
<p>Evaluation of the relative prediction errors for physical features over time of the forecasts for IRIS data and the <span class="html-italic">iterated</span> setup.</p>
Full article ">Figure A1
<p>Detailed MTS metrics evaluation on the test set for the direct prediction setup. The evaluations are given for each solar activity: the first row of results is for QS activity, the second row is for AR, and the last row is for FL.</p>
Full article ">Figure A1 Cont.
<p>Detailed MTS metrics evaluation on the test set for the direct prediction setup. The evaluations are given for each solar activity: the first row of results is for QS activity, the second row is for AR, and the last row is for FL.</p>
Full article ">Figure A2
<p>Detailed MTS metrics evaluation on the test set for the iterated prediction setup. The evaluations are given for each solar activity: the first row of results is for QS activity, the second row is for AR, and the last row is for FL.</p>
Full article ">Figure A3
<p>Confusion matrices for the prediction of centroids on IRIS data, for the direct procedure. We used the 53 centroids from [<a href="#B55-entropy-25-00831" class="html-bibr">55</a>]. Each row of results corresponds to a model. Columns are organized by data labels: <span class="html-italic">global</span> aggregate results for QS, AR, and FL data; other columns present the result for of each label, taken separately. Each confusion matrix gives results in terms of join probability distribution values between the genuine and the predicted. Probability values are displayed with color maps, where violet is the lowest probability and yellow is the highest.</p>
Full article ">Figure A4
<p>Confusion matrices for the prediction of centroids on IRIS data, for the iterated procedure. We used the 53 centroids from [<a href="#B55-entropy-25-00831" class="html-bibr">55</a>]. Each row of results corresponds to a model. Columns are organized by data labels: <span class="html-italic">global</span> aggregate results for QS, AR, and FL data; other columns present the results for of each label, taken separately. Each confusion matrix provides results in terms of join probability distribution values, between the genuine and the predicted. Probability values are displayed with color maps, where violet is the lowest probability and yellow is the highest.</p>
Full article ">
19 pages, 3673 KiB  
Article
Position-Wise Gated Res2Net-Based Convolutional Network with Selective Fusing for Sentiment Analysis
by Jinfeng Zhou, Xiaoqin Zeng, Yang Zou and Haoran Zhu
Entropy 2023, 25(5), 740; https://doi.org/10.3390/e25050740 - 30 Apr 2023
Viewed by 1730
Abstract
Sentiment analysis (SA) is an important task in natural language processing in which convolutional neural networks (CNNs) have been successfully applied. However, most existing CNNs can only extract predefined, fixed-scale sentiment features and cannot synthesize flexible, multi-scale sentiment features. Moreover, these models’ convolutional [...] Read more.
Sentiment analysis (SA) is an important task in natural language processing in which convolutional neural networks (CNNs) have been successfully applied. However, most existing CNNs can only extract predefined, fixed-scale sentiment features and cannot synthesize flexible, multi-scale sentiment features. Moreover, these models’ convolutional and pooling layers gradually lose local detailed information. In this study, a new CNN model based on residual network technology and attention mechanisms is proposed. This model exploits more abundant multi-scale sentiment features and addresses the loss of locally detailed information to enhance the accuracy of sentiment classification. It is primarily composed of a position-wise gated Res2Net (PG-Res2Net) module and a selective fusing module. The PG-Res2Net module can adaptively learn multi-scale sentiment features over a large range using multi-way convolution, residual-like connections, and position-wise gates. The selective fusing module is developed to fully reuse and selectively fuse these features for prediction. The proposed model was evaluated using five baseline datasets. The experimental results demonstrate that the proposed model surpassed the other models in performance. In the best case, the model outperforms the other models by up to 1.2%. Ablation studies and visualizations further revealed the model’s ability to extract and fuse multi-scale sentiment features. Full article
Show Figures

Figure 1

Figure 1
<p>Impacts of multi-scale words and phrases on analyzing the sentiment of a text. (<b>a</b>) Limitations of the use of fixed scales to extract sentiment features. (<b>b</b>) Importance of jointly determining text sentiment by local sentiment words and phrases of different positions and scales.</p>
Full article ">Figure 2
<p>Overview framework of the proposed model.</p>
Full article ">Figure 3
<p>Structures of residual blocks in the two modules: (<b>a</b>) Res2Net and (<b>b</b>) PG-Res2Net.</p>
Full article ">Figure 4
<p>Structure of the selective fusing module. It includes two key operations: (<b>a</b>) generating a guidance descriptor and (<b>b</b>) fusing all sentiment representations based on the descriptor.</p>
Full article ">Figure 5
<p>Heatmaps of multi-scale sentiment features and representations of residual block 1 in the PG-Res2Net. (<b>a</b>,<b>b</b>) show two texts with positive and negative sentiment polarities, respectively.</p>
Full article ">Figure 6
<p>Two-dimensional t-SNE visualization of text sentiment representations. (<b>a</b>) Text sentiment representations produced by 3-Blocks-W-SF and 3-Blocks-WO-SF on MR. (<b>b</b>) Text sentiment representations produced by 7-Blocks-W-SF and 7-Blocks-WO-SF on Yelp.F.</p>
Full article ">Figure 7
<p>Impact with a different number of residual blocks. (<b>a</b>) Test accuracy with a different number of residual blocks. (<b>b</b>) Relationship between the number of residual blocks and the number of learnable weights.</p>
Full article ">Figure 8
<p>Heatmaps of representations of residual block <span class="html-italic">1</span> in the PG-Res2Net module. (<b>a</b>,<b>b</b>) are the representative texts of two factors that cause incorrect predictions.</p>
Full article ">

Review

Jump to: Research

28 pages, 570 KiB  
Review
To Compress or Not to Compress—Self-Supervised Learning and Information Theory: A Review
by Ravid Shwartz Ziv and Yann LeCun
Entropy 2024, 26(3), 252; https://doi.org/10.3390/e26030252 - 12 Mar 2024
Cited by 41 | Viewed by 11399
Abstract
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck [...] Read more.
Deep neural networks excel in supervised learning tasks but are constrained by the need for extensive labeled data. Self-supervised learning emerges as a promising alternative, allowing models to learn without explicit labels. Information theory has shaped deep neural networks, particularly the information bottleneck principle. This principle optimizes the trade-off between compression and preserving relevant information, providing a foundation for efficient network design in supervised contexts. However, its precise role and adaptation in self-supervised learning remain unclear. In this work, we scrutinize various self-supervised learning approaches from an information-theoretic perspective, introducing a unified framework that encapsulates the self-supervised information-theoretic learning problem. This framework includes multiple encoders and decoders, suggesting that all existing work on self-supervised learning can be seen as specific instances. We aim to unify these approaches to understand their underlying principles better and address the main challenge: many works present different frameworks with differing theories that may seem contradictory. By weaving existing research into a cohesive narrative, we delve into contemporary self-supervised methodologies, spotlight potential research areas, and highlight inherent challenges. Moreover, we discuss how to estimate information-theoretic quantities and their associated empirical problems. Overall, this paper provides a comprehensive review of the intersection of information theory, self-supervised learning, and deep neural networks, aiming for a better understanding through our proposed unified approach. Full article
Show Figures

Figure 1

Figure 1
<p>Multiview information bottleneck diagram for self-supervised, unsupervised, and supervised learning.</p>
Full article ">
Back to TopTop