Computer Science > Artificial Intelligence

arXiv:2310.03188 (cs)

[Submitted on 4 Oct 2023]

Title:Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Authors:Zhe Zhao, Qingyun Liu, Huan Gui, Bang An, Lichan Hong, Ed H. Chi

View PDF

Abstract:Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the state-of-the-art in many applications. However, it is still an open question of how to use these models to perform downstream tasks efficiently. Knowledge distillation (KD) has been explored to tackle this challenge. KD transfers knowledge from a large teacher model to a smaller student model. While KD has been successful in improving student model performance, recent research has discovered that a powerful teacher does not necessarily lead to a powerful student, due to their huge capacity gap. In addition, the potential distribution shifts between the pre-training data and downstream tasks can make knowledge transfer in KD sub-optimal for improving downstream task performance. In this paper, we extend KD with an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models. Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs. Specifically, we let each model (i.e., student and teacher) train two components: (1) an encoder encoding the model's hidden states to a message and (2) a decoder decoding any messages to its own hidden states. With encoder and decoder, not only can the teacher transfer rich information by encoding its hidden states, but also the student can send messages with information of downstream tasks to the teacher. Therefore, knowledge passing from teacher to student can be tailored to the student's capacity and downstream tasks' distributions. We conducted experiments on benchmark datasets to show that our communication mechanism outperforms state-of-the-art distillation techniques.

Comments:	19 pages, 3 figures
Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2310.03188 [cs.AI]
	(or arXiv:2310.03188v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2310.03188

Submission history

From: Zhe Zhao [view email]
[v1] Wed, 4 Oct 2023 22:22:21 UTC (284 KB)

Computer Science > Artificial Intelligence

Title:Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators