[go: up one dir, main page]

Skip to main content
Log in

SCL-IKD: intermediate knowledge distillation via supervised contrastive representation learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Knowledge distillation, which extracts dark knowledge from a deep teacher model to drive the learning of a shallow student model, is helpful in several tasks, including model compression and regularization. While previous research has focused on architecture-driven solutions for extracting information from the teacher models, these solutions are focused on a single task and fail to extract rich dark knowledge from large teacher networks in the presence of capacity gaps for broader applications. Hence, in this paper, we propose a supervised contrastive learning-based intermediate knowledge distillation (SCL-IKD) technique that is more effective in distilling knowledge from teacher networks to train a student model for classification tasks. SCL-IKD, unlike other approaches, is model agnostic and may be used in a variety of teacher-student cross-architectures. Investigations on several datasets reveal that SCL-IKD can achieve \(3-4\%\) better top-1 accuracy over several state-of-the-art baselines. Furthermore, compared to the baselines, SCL-IKD is found better to handle capacity gaps between teacher and student models and is significantly more robust to symmetric noisy labels and data availability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Phan A.H, Sobolev K, Sozykin K, Ermilov D, Gusak J, Tichavský P, Glukhov V, Oseledets IV, Cichocki A (2020) Stable low-rank tensor decomposition for compression of convolutional neural network. In: Vedaldi A, Bischof H, Brox T, Frahm J (eds) Computer vision - ECCV 2020 - 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIX. Lecture notes in computer science, vol 12374. pp 522–539

  2. Liang J, Zhang T, Feng G (2020) Channel compression: rethinking information redundancy among channels in cnn architecture. IEEE Access 8:147265–147274

    Article  Google Scholar 

  3. Han S, Mao H, Dally WJ (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding

  4. Hinton GE, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. CoRR arXiv:1503.02531

  5. Romero A, Ballas N, Kahou S.E, Chassang A, Gatta C, Bengio Y (2015) Fitnets: Hints for thin deep nets. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference track proceedings

  6. Zagoruyko S, Komodakis N (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: 5th international conference on learning representations, ICLR 2017, Toulon, France, April 24–26, 2017, Conference track proceedings. https://openreview.net/forum?id=Sks9_ajex

  7. Yim J, Joo D, Bae J, Kim J (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4133–4141

  8. Hou Y, Ma Z, Liu C, Hui T-W, Loy CC (2020) Inter-region affinity distillation for road marking segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12486–12495

  9. Xu G, Liu Z, Li X, Loy CC (2020) Knowledge distillation meets self-supervision. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer vision - ECCV 2020. Springer, Cham p, pp 588–604

    Chapter  Google Scholar 

  10. Gou J, Yu B, Maybank SJ, Tao D (2021) Knowledge distillation: a survey. Int J Comput Vision 129(6):1789–1819

    Article  Google Scholar 

  11. Tian Y, Krishnan D, Isola P (2020) Contrastive representation distillation. In: 8th International conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. https://openreview.net/forum?id=SkgpBJrtvS

  12. Chen D, Mei J, Zhang Y, Wang C, Wang Z, Feng Y, Chen C (2021) Cross-layer distillation with semantic calibration. In: Thirty-Fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, virtual event, February 2–9, 2021, pp 7028–7036. https://ojs.aaai.org/index.php/AAAI/article/view/16865

  13. Zhang Y, Xiang T, Hospedales TM, Lu H (2018) Deep mutual learning. In: 2018 IEEE/CVF conference on computer vision and pattern recognition, pp 4320–4328. https://doi.org/10.1109/CVPR.2018.00454

  14. Zhou S, Wang Y, Chen D, Chen J, Wang X, Wang C, Bu J (2021) Distilling holistic knowledge with graph neural networks. In: 2021 IEEE/CVF international conference on computer vision, ICCV 2021, Montreal, QC, Canada, October 10–17, 2021, pp 10367–10376. https://doi.org/10.1109/ICCV48922.2021.01022

  15. Park W, Kim D, Lu Y, Cho M (2019) Relational knowledge distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3967–3976

  16. Passalis N, Tefas A (2018) Learning deep representations with probabilistic knowledge transfer. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer vision - ECCV 2018 - 15th European conference, Munich, Germany, September 8–14, 2018, Proceedings, Part XI. Lecture notes in computer science, vol 11215, pp 283–299. https://doi.org/10.1007/978-3-030-01252-6_17

  17. Chen D, Mei J, Zhang H, Wang C, Feng Y, Chen C (2022) Knowledge distillation with the reused teacher classifier. In: IEEE/CVF conference on computer vision and pattern recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022, pp 11923–11932. https://doi.org/10.1109/CVPR52688.2022.01163

  18. Khosla P, Teterwak P, Wang C, Sarna A, Tian Y, Isola P, Maschinot A, Liu C, Krishnan D (2020) Supervised contrastive learning. In: Larochelle H, Ranzato M, Hadsell R, Balcan M, Lin H (eds) Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6–12, 2020, Virtual

  19. Chen T, Kornblith S, Swersky K, Norouzi M, Hinton GE (2020) Big self-supervised models are strong semi-supervised learners. In: Larochelle H, Ranzato M, Hadsell R, Balcan MF, Lin H (eds) Advances in neural information processing systems, vol 33, pp 22243–22255

  20. Ba J, Caruana R (2014) Do deep nets really need to be deep? In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27: annual conference on neural information processing systems 2014, December 8–13 2014, Montreal, Quebec, Canada, pp 2654–2662

  21. Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31

  22. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  23. Tung F, Mori G (2019) Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1365–1374

  24. Yuan L, Tay FE, Li G, Wang T, Feng J (2020) Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3903–3911

  25. Mirzadeh S, Farajtabar M, Li A, Levine N, Matsukawa A, Ghasemzadeh H (2020) Improved knowledge distillation via teacher assistant. In: The thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, the thirty-second innovative applications of artificial intelligence conference, IAAI 2020, the tenth AAAI symposium on educational advances in artificial intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, pp 5191–5198. https://ojs.aaai.org/index.php/AAAI/article/view/5963

  26. Kim J, Bhalgat Y, Lee J, Patel C, Kwak N (2019) QKD: quantization-aware knowledge distillation. CoRR arXiv:1911.12491

  27. Ding F, Yang Y, Hu H, Krovi V, Luo F (2022) Dual-level knowledge distillation via knowledge alignment and correlation. IEEE Trans Neural Netw Learn Syst 1–11

  28. Liu X, Li L, Li C, Yao A (2023) NORM: Knowledge distillation via n-to-one representation matching. In: The eleventh international conference on learning representations. https://openreview.net/forum?id=CRNwGauQpb6

  29. Liu D, Kan M, Shan S, Chen X (2023) Function-consistent feature distillation. In: the Eleventh international conference on learning representations. https://openreview.net/forum?id=pgHNOcxEdRI

  30. Gao M, Wang Y, Wan L (2021) Residual error based knowledge distillation. Neurocomputing 433:154–161

    Article  Google Scholar 

  31. Liu Y, Jia X, Tan M, Vemulapalli R, Zhu Y, Green B, Wang X (2020) Search to distill: pearls are everywhere but not the eyes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7539–7548

  32. Chen D, Mei J, Zhang Y, Wang C, Wang Z, Feng Y, Chen C (2021) Cross-layer distillation with semantic calibration. In: Thirty-fifth AAAI conference on artificial intelligence, AAAI 2021, thirty-third conference on innovative applications of artificial intelligence, IAAI 2021, the eleventh symposium on educational advances in artificial intelligence, EAAI 2021, virtual event, February 2–9, 2021, pp 7028–7036. https://ojs.aaai.org/index.php/AAAI/article/view/16865

  33. Chen X, He K (2021) Exploring simple siamese representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15750–15758

  34. Lee SH, Kim DH, Song BC (2018) Self-supervised knowledge distillation using singular value decomposition. In: Proceedings of the European conference on computer vision (ECCV), p 335–350

  35. Wu Z, Xiong Y, Yu SX, Lin D (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742

  36. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607

  37. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: International conference on machine learning. PMLR, pp 1597–1607

  38. He K, Fan H, Wu Y, Xie S, Girshick RB (2020) Momentum contrast for unsupervised visual representation learning. In: 2020 IEEE/CVF conference on computer vision and pattern recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020, pp 9726–9735. https://doi.org/10.1109/CVPR42600.2020.00975

  39. Gao B, Pavel L (2017) On the properties of the softmax function with application in game theory and reinforcement learning. CoRR arXiv:1704.00805

  40. Krizhevsky A, Nair V, Hinton G (2009) Cifar-10 and cifar-100 datasets. 6(1):1 https://www.cs.toronto.edu/kriz/cifar.html

  41. Le Y, Yang X (2015) Tiny imagenet visual recognition challenge. CS 231N 7(7):3

  42. Deng L (2012) The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Process Mag 29(6):141–142

    Article  Google Scholar 

  43. Nilsback M-E, Zisserman A (2008) Automated flower classification over a large number of classes. In: 2008 sixth Indian conference on computer vision, graphics & image processing, pp 722–729. https://doi.org/10.1109/ICVGIP.2008.47

  44. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  45. Han S, Pool J, Tran J, Dally WJ (2015) Learning both weights and connections for efficient neural networks. CoRR arXiv:1506.02626

  46. Zagoruyko S, Komodakis N (2016) Wide residual networks. In: Wilson RC, Hancock ER, Smith WAP (eds) Proceedings of the British machine vision conference 2016, BMVC 2016, York, UK, September 19–22. http://www.bmva.org/bmvc/2016/papers/paper087/index.html

  47. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520

  48. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: Bengio Y, LeCun Y (eds) 3rd International conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings. http://arxiv.org/abs/1409.1556

  49. Hsu H, Lachenbruch PA (2014) Paired t test. Wiley StatsRef: statistics reference online

  50. Chollet F et al (2018) Keras: the python deep learning library. Astrophysics Source Code Library 1806

  51. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow IJ, Harp A, Irving G, Isard M, Jia Y, Józefowicz R, Kaiser L, Kudlur M, Levenberg J, Mané D, Monga R, Moore S, Murray D.G, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker PA, Vanhoucke V, Vasudevan V, Viégas FB, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X (2016) Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR arXiv:1603.04467

  52. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Bengio Y, LeCun Y (eds) 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, Conference Track Proceedings

  53. Song H, Kim M, Park D, Shin Y, Lee J-G (2022) Learning from noisy labels with deep neural networks: a survey. IEEE Trans Neural Netw Learn Syst

  54. Lukasik M, Bhojanapalli S, Menon A, Kumar S (2020) Does label smoothing mitigate label noise? In: III HD, Singh A (eds) Proceedings of the 37th international conference on machine learning. Proceedings of machine learning research, vol 119, pp 6448–6458. https://proceedings.mlr.press/v119/lukasik20a.html

  55. Sachdeva R, Cordeiro FR, Belagiannis V, Reid I, Carneiro G (2023) Scanmix: learning from severe label noise via semantic clustering and semi-supervised learning. Pattern Recogn 134:109121. https://doi.org/10.1016/j.patcog.2022.109121

  56. Luo D, Cheng W, Wang Y, Xu D, Ni J, Yu W, Zhang X, Liu Y, Chen Y, Chen H, Zhang X (2023) Time series contrastive learning with information-aware augmentations. CoRR arXiv:2303.11911

  57. Chen J, Zhang R, Mao Y, Xu J (2022) Contrastnet: a contrastive learning framework for few-shot text classification. In: Thirty-sixth AAAI conference on artificial intelligence, AAAI 2022, thirty-fourth conference on innovative applications of artificial intelligence, IAAI 2022, the twelveth symposium on educational advances in artificial intelligence, EAAI 2022 virtual event, February 22 - March 1, 2022, pp 10492–10500. https://ojs.aaai.org/index.php/AAAI/article/view/21292

Download references

Acknowledgements

The paper’s authors would like to gratefully acknowledge the support from the Prime Minister’s Research Fellowship (PMRF) scheme by the Government of India under which this research work is carried out. We also acknowledge the assistance from the Indian Institute of Technology Patna (IITP) Centre of Excellence in Cyber Crime Prevention against Women and Children- AI-based Tools for Women and Children Safety project for providing us technical infrastructure used in this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saurabh Sharma.

Ethics declarations

Conflicts of interest

The authors declare no competing interests.

Ethical standard

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sharma, S., Lodhi, S.S. & Chandra, J. SCL-IKD: intermediate knowledge distillation via supervised contrastive representation learning. Appl Intell 53, 28520–28541 (2023). https://doi.org/10.1007/s10489-023-05036-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05036-y

Keywords

Navigation