research-article

RetroKD : Leveraging Past States for Regularizing Targets in Teacher-Student Learning

Authors:

Surgan Jandial,

Balaji Krishnamurthy,

Vineeth N BalasubramanianAuthors Info & Claims

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Pages 10 - 18

https://doi.org/10.1145/3570991.3571014

Published: 04 January 2023 Publication History

Abstract

Several recent works show that higher accuracy models may not be better teachers for every student, and hence, refer this problem as student-teacher “knowledge gap". Further, they propose techniques, which, in this paper, we discuss are constrained to certain pre-conditions: 1). Access to Teacher Model/Architecture 2). Retraining Teacher Model 3). Models in Addition to Teacher Model. Being well known that for a lot of settings, these conditions may not hold true challenges the applicability of such approaches. In this work, we propose RetroKD, which smoothes out the logits of a student network by leveraging students’ past state logits with the ones from the teacher. By doing so, we hypothesize that the present target will no longer be as hard as the teacher target and not as more uncomplicated as the past student target. Such regularization on learning the parameters alleviates the needs as required by other methods. Our extensive set of experiments comparing against the baselines for CIFAR 10, CIFAR 100, and TinyImageNet datasets and a theoretical study further help in supporting our claim. We performed crucial ablation studies such as hyperparameter sensitivity, the generalization study by showing the flatness on loss landscape and feature similarly with teacher network.

References

[1]

Armen Aghajanyan. 2016. SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks. arXiv:arXiv:1609.06693

[2]

Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’06). Association for Computing Machinery, 535–541.

[3]

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. [n. d.]. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In International Conference on Learning Representations, ICLR 2017.

[4]

Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved Baselines with Momentum Contrastive Learning. Technical Report.

[5]

J. H. Cho and B. Hariharan. 2019. On the Efficacy of Knowledge Distillation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4793–4801.

[6]

Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, and Shu-Tao Xia. 2019. Adaptive Regularization of Labels. arXiv:arXiv:1908.05474

[7]

Hubert Ebert, Diemo Koedderitzsch, and Jan Minar. 2011. Calculating condensed matter properties using the KKR-Green’s function method—recent developments and applications. Reports on Progress in Physics 74, 9 (2011), 096501.

[8]

T. Fukuda, Masayuki Suzuki, Gakuto Kurata, S. Thomas, Jia Cui, and B. Ramabhadran. 2017. Efficient Knowledge Distillation from an Ensemble of Teachers. In INTERSPEECH.

[9]

Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born-Again Neural Networks. In International Conference on Machine Learning, ICML 2018(Proceedings of Machine Learning Research, Vol. 80). PMLR, 1602–1611.

[10]

Mengya Gao, Yujun Shen, Quanquan Li, Junjie Yan, Liang Wan, Dahua Lin, Chen Change Loy, and Xiaoou Tang. 2018. An Embarrassingly Simple Approach for Knowledge Distillation. arXiv:arXiv:1812.01819

[11]

Sangchul Hahn and Heeyoul Choi. 2019. Self-Knowledge Distillation in Natural Language Processing. In RANLP.

[12]

Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharp and Flat Local Minima. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc., 2553–2564. https://proceedings.neurips.cc/paper/2019/file/01d8bae291b1e4724443375634ccfa0e-Paper.pdf

[13]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90

[15]

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531

[16]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Flat Minima. Neural Comput. 9, 1 (Jan. 1997), 1–42.

Digital Library

[17]

Z. Huang and Naiyan Wang. 2017. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. Technical Report.

[18]

Surgan Jandial, Ayush Chopra, Mausoom Sarkar, Piyush Gupta, Balaji Krishnamurthy, and Vineeth Balasubramanian. 2020. Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks(KDD ’20). Association for Computing Machinery, 9 pages. https://doi.org/10.1145/3394486.3403165

Digital Library

[19]

Liang Jiang, Zujie Wen, Zhongping Liang, Yafang Wang, Gerard de Melo, Zhe Li, Liangzhuang Ma, Jiaxing Zhang, Xiaolong Li, and Yuan Qi. 2020. Long Short-Term Sample Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence. arXiv:arXiv:2003.00739

[20]

Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, and Xiaolin Hu. 2019. Knowledge Distillation via Route Constrained Optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

[21]

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.

[22]

Simon Kornblith, Mohammad Norouzi, H. Lee, and Geoffrey E. Hinton. 2019. Similarity of Neural Network Representations Revisited. In International Conference on Machine Learning, ICML 2019.

[23]

Samuli Laine and Timo Aila. [n. d.]. Temporal Ensembling for Semi-Supervised Learning. In International Conference on Learning Representations, ICLR 2017. arXiv:arXiv:1610.02242

[24]

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018. Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems.

[25]

Xuewei Li, Songyuan Li, Bourahla Omar, Fei Wu, and Xi Li. 2021. ResKD: Residual-Guided Knowledge Distillation. IEEE Transactions on Image Processing(2021).

[26]

David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2015. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643(2015).

[27]

Seyed Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved Knowledge Distillation via Teacher Assistant. Proceedings of the AAAI Conference on Artificial Intelligence 34 (04 2020), 5191–5198. https://doi.org/10.1609/aaai.v34i04.5963

[28]

Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. 2020. Self-Distillation Amplifies Regularization in Hilbert Space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 3351–3361. https://proceedings.neurips.cc/paper/2020/file/2288f691b58edecadcc9a8691762b4fd-Paper.pdf

[29]

Ari S. Morcos, M. Raghu, and S. Bengio. 2018. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems.

[30]

Hideki Oki, Motoshi Abe, Junichi Miyao, and Takio Kurita. 2020. Triplet Loss for Knowledge Distillation. In International Joint Conference on Neural Networks (IJCNN), 2020. arXiv:arXiv:2004.08116

[31]

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3967–3976.

[32]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf

Digital Library

[33]

Akshay Rangamani, Nam H. Nguyen, Abhishek Kumar, D. Phan, S. H. Chin, and Trac D. Tran. 2019. A Scale Invariant Flatness Measure for Deep Network Minima. ArXiv abs/1902.02434(2019).

[34]

M. Ribeiro, K. Grolinger, and M. A. M. Capretz. 2015. MLaaS: Machine Learning as a Service. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). 896–902. https://doi.org/10.1109/ICMLA.2015.152

[35]

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. [n. d.]. FitNets: Hints for Thin Deep Nets. In International Conference on Learning Representations, ICLR 2015, Yoshua Bengio and Yann LeCun (Eds.).

[36]

Bharat Bhusan Sau and Vineeth N Balasubramanian. 2016. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650(2016).

[37]

Shai Shalev-Shwartz and Nathan Srebro. 2008. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning. 928–935.

Digital Library

[38]

R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). 909–910.

[39]

Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems. arXiv:arXiv:1703.01780

[40]

V Vapnik. 1998. Statistical learning theory new york. NY: Wiley (1998).

[41]

Chenglin Yang, Lingxi Xie, Chi Su, and A. Yuille. 2019. Snapshot Distillation: Teacher-Student Optimization in One Generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 2854–2863.

[42]

J. Yim, D. Joo, J. Bae, and J. Kim. 2017. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7130–7138.

[43]

Sergey Zagoruyko and Nikos Komodakis. [n. d.]. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In International Conference on Learning Representations, ICLR 2017. https://arxiv.org/abs/1612.03928

[44]

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 3 (2021), 107–115.

Digital Library

[45]

Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. 2018. Deep Mutual Learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4320–4328. https://doi.org/10.1109/CVPR.2018.00454

Index Terms

RetroKD : Leveraging Past States for Regularizing Targets in Teacher-Student Learning
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics
2. Networks
  1. Network properties
    1. Network reliability

Recommendations

Multi-view Teacher–Student Network
Abstract
Multi-view learning aims to fully exploit the view-consistency and view-discrepancy for performance improvement. Knowledge Distillation (KD), characterized by the so-called “Teacher–Student” (T-S) learning framework, can transfer ...
Highlights
- MTS-Net framework exploits the knowledge distillation to realize both principles.
Transformative Learning in Teacher Education: Literature as a Bridge for Increasing Cultural Competence

This article describes a partnership between teacher education candidates in a small, rural, private university and students in a large, public, urban junior/ senior high school. This partnership utilized technology and used a Literature as a Bridge ...
A model for high-school teacher professional development and student learning
FIE'09: Proceedings of the 39th IEEE international conference on Frontiers in education conference

This paper describes a model for high-school teacher professional development and student learning that can be readily adapted by other universities seeking meaningful partnerships with K-12 schools. In this program, university engineering and science ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 2023

357 pages

ISBN:9781450397971

DOI:10.1145/3570991

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CODS-COMAD 2023

CODS-COMAD 2023: 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

January 4 - 7, 2023

Mumbai, India

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
62
Total Downloads

Downloads (Last 12 months)17
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten