[go: up one dir, main page]

skip to main content
10.1145/3570991.3571014acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

RetroKD : Leveraging Past States for Regularizing Targets in Teacher-Student Learning

Published: 04 January 2023 Publication History

Abstract

Several recent works show that higher accuracy models may not be better teachers for every student, and hence, refer this problem as student-teacher “knowledge gap". Further, they propose techniques, which, in this paper, we discuss are constrained to certain pre-conditions: 1). Access to Teacher Model/Architecture 2). Retraining Teacher Model 3). Models in Addition to Teacher Model. Being well known that for a lot of settings, these conditions may not hold true challenges the applicability of such approaches. In this work, we propose RetroKD, which smoothes out the logits of a student network by leveraging students’ past state logits with the ones from the teacher. By doing so, we hypothesize that the present target will no longer be as hard as the teacher target and not as more uncomplicated as the past student target. Such regularization on learning the parameters alleviates the needs as required by other methods. Our extensive set of experiments comparing against the baselines for CIFAR 10, CIFAR 100, and TinyImageNet datasets and a theoretical study further help in supporting our claim. We performed crucial ablation studies such as hyperparameter sensitivity, the generalization study by showing the flatness on loss landscape and feature similarly with teacher network.

References

[1]
Armen Aghajanyan. 2016. SoftTarget Regularization: An Effective Technique to Reduce Over-Fitting in Neural Networks. arXiv:arXiv:1609.06693
[2]
Cristian Bucilua, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD ’06). Association for Computing Machinery, 535–541.
[3]
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. [n. d.]. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In International Conference on Learning Representations, ICLR 2017.
[4]
Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. 2020. Improved Baselines with Momentum Contrastive Learning. Technical Report.
[5]
J. H. Cho and B. Hariharan. 2019. On the Efficacy of Knowledge Distillation. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). 4793–4801.
[6]
Qianggang Ding, Sifan Wu, Hao Sun, Jiadong Guo, and Shu-Tao Xia. 2019. Adaptive Regularization of Labels. arXiv:arXiv:1908.05474
[7]
Hubert Ebert, Diemo Koedderitzsch, and Jan Minar. 2011. Calculating condensed matter properties using the KKR-Green’s function method—recent developments and applications. Reports on Progress in Physics 74, 9 (2011), 096501.
[8]
T. Fukuda, Masayuki Suzuki, Gakuto Kurata, S. Thomas, Jia Cui, and B. Ramabhadran. 2017. Efficient Knowledge Distillation from an Ensemble of Teachers. In INTERSPEECH.
[9]
Tommaso Furlanello, Zachary Chase Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. Born-Again Neural Networks. In International Conference on Machine Learning, ICML 2018(Proceedings of Machine Learning Research, Vol. 80). PMLR, 1602–1611.
[10]
Mengya Gao, Yujun Shen, Quanquan Li, Junjie Yan, Liang Wan, Dahua Lin, Chen Change Loy, and Xiaoou Tang. 2018. An Embarrassingly Simple Approach for Knowledge Distillation. arXiv:arXiv:1812.01819
[11]
Sangchul Hahn and Heeyoul Choi. 2019. Self-Knowledge Distillation in Natural Language Processing. In RANLP.
[12]
Haowei He, Gao Huang, and Yang Yuan. 2019. Asymmetric Valleys: Beyond Sharp and Flat Local Minima. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dÁlché-Buc, E. Fox, and R. Garnett (Eds.). Vol. 32. Curran Associates, Inc., 2553–2564. https://proceedings.neurips.cc/paper/2019/file/01d8bae291b1e4724443375634ccfa0e-Paper.pdf
[13]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
[15]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop. http://arxiv.org/abs/1503.02531
[16]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Flat Minima. Neural Comput. 9, 1 (Jan. 1997), 1–42.
[17]
Z. Huang and Naiyan Wang. 2017. Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. Technical Report.
[18]
Surgan Jandial, Ayush Chopra, Mausoom Sarkar, Piyush Gupta, Balaji Krishnamurthy, and Vineeth Balasubramanian. 2020. Retrospective Loss: Looking Back to Improve Training of Deep Neural Networks(KDD ’20). Association for Computing Machinery, 9 pages. https://doi.org/10.1145/3394486.3403165
[19]
Liang Jiang, Zujie Wen, Zhongping Liang, Yafang Wang, Gerard de Melo, Zhe Li, Liangzhuang Ma, Jiaxing Zhang, Xiaolong Li, and Yuan Qi. 2020. Long Short-Term Sample Distillation. In Proceedings of the AAAI Conference on Artificial Intelligence. arXiv:arXiv:2003.00739
[20]
Xiao Jin, Baoyun Peng, Yichao Wu, Yu Liu, Jiaheng Liu, Ding Liang, Junjie Yan, and Xiaolin Hu. 2019. Knowledge Distillation via Route Constrained Optimization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
[21]
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. 2017. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.
[22]
Simon Kornblith, Mohammad Norouzi, H. Lee, and Geoffrey E. Hinton. 2019. Similarity of Neural Network Representations Revisited. In International Conference on Machine Learning, ICML 2019.
[23]
Samuli Laine and Timo Aila. [n. d.]. Temporal Ensembling for Semi-Supervised Learning. In International Conference on Learning Representations, ICLR 2017. arXiv:arXiv:1610.02242
[24]
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. 2018. Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems.
[25]
Xuewei Li, Songyuan Li, Bourahla Omar, Fei Wu, and Xi Li. 2021. ResKD: Residual-Guided Knowledge Distillation. IEEE Transactions on Image Processing(2021).
[26]
David Lopez-Paz, Léon Bottou, Bernhard Schölkopf, and Vladimir Vapnik. 2015. Unifying distillation and privileged information. arXiv preprint arXiv:1511.03643(2015).
[27]
Seyed Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved Knowledge Distillation via Teacher Assistant. Proceedings of the AAAI Conference on Artificial Intelligence 34 (04 2020), 5191–5198. https://doi.org/10.1609/aaai.v34i04.5963
[28]
Hossein Mobahi, Mehrdad Farajtabar, and Peter Bartlett. 2020. Self-Distillation Amplifies Regularization in Hilbert Space. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Vol. 33. Curran Associates, Inc., 3351–3361. https://proceedings.neurips.cc/paper/2020/file/2288f691b58edecadcc9a8691762b4fd-Paper.pdf
[29]
Ari S. Morcos, M. Raghu, and S. Bengio. 2018. Insights on representational similarity in neural networks with canonical correlation. In Advances in Neural Information Processing Systems.
[30]
Hideki Oki, Motoshi Abe, Junichi Miyao, and Takio Kurita. 2020. Triplet Loss for Knowledge Distillation. In International Joint Conference on Neural Networks (IJCNN), 2020. arXiv:arXiv:2004.08116
[31]
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational Knowledge Distillation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3967–3976.
[32]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems. 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
[33]
Akshay Rangamani, Nam H. Nguyen, Abhishek Kumar, D. Phan, S. H. Chin, and Trac D. Tran. 2019. A Scale Invariant Flatness Measure for Deep Network Minima. ArXiv abs/1902.02434(2019).
[34]
M. Ribeiro, K. Grolinger, and M. A. M. Capretz. 2015. MLaaS: Machine Learning as a Service. In 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). 896–902. https://doi.org/10.1109/ICMLA.2015.152
[35]
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. [n. d.]. FitNets: Hints for Thin Deep Nets. In International Conference on Learning Representations, ICLR 2015, Yoshua Bengio and Yann LeCun (Eds.).
[36]
Bharat Bhusan Sau and Vineeth N Balasubramanian. 2016. Deep model compression: Distilling knowledge from noisy teachers. arXiv preprint arXiv:1610.09650(2016).
[37]
Shai Shalev-Shwartz and Nathan Srebro. 2008. SVM optimization: inverse dependence on training set size. In Proceedings of the 25th international conference on Machine learning. 928–935.
[38]
R. Shokri and V. Shmatikov. 2015. Privacy-preserving deep learning. In 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). 909–910.
[39]
Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems. arXiv:arXiv:1703.01780
[40]
V Vapnik. 1998. Statistical learning theory new york. NY: Wiley (1998).
[41]
Chenglin Yang, Lingxi Xie, Chi Su, and A. Yuille. 2019. Snapshot Distillation: Teacher-Student Optimization in One Generation. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), 2854–2863.
[42]
J. Yim, D. Joo, J. Bae, and J. Kim. 2017. A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 7130–7138.
[43]
Sergey Zagoruyko and Nikos Komodakis. [n. d.]. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. In International Conference on Learning Representations, ICLR 2017. https://arxiv.org/abs/1612.03928
[44]
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2021. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 64, 3 (2021), 107–115.
[45]
Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. 2018. Deep Mutual Learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4320–4328. https://doi.org/10.1109/CVPR.2018.00454

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)
January 2023
357 pages
ISBN:9781450397971
DOI:10.1145/3570991
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 January 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Knowledge Distillation
  2. Past States
  3. Regularization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CODS-COMAD 2023

Acceptance Rates

Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 62
    Total Downloads
  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media