Computer Science > Computation and Language

arXiv:2305.11000 (cs)

[Submitted on 18 May 2023 (v1), last revised 19 May 2023 (this version, v2)]

Title:SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Authors:Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, Xipeng Qiu

View PDF

Abstract:Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT. However, current speech-language models typically adopt the cascade paradigm, preventing inter-modal knowledge transfer. In this paper, we propose SpeechGPT, a large language model with intrinsic cross-modal conversational abilities, capable of perceiving and generating multi-model content. With discrete speech representations, we first construct SpeechInstruct, a large-scale cross-modal speech instruction dataset. Additionally, we employ a three-stage training strategy that includes modality-adaptation pre-training, cross-modal instruction fine-tuning, and chain-of-modality instruction fine-tuning. The experimental results demonstrate that SpeechGPT has an impressive capacity to follow multi-modal human instructions and highlight the potential of handling multiple modalities with one model. Demos are shown in this https URL.

Comments:	work in progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2305.11000 [cs.CL]
	(or arXiv:2305.11000v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2305.11000

Submission history

From: Dong Zhang Zhang [view email]
[v1] Thu, 18 May 2023 14:23:25 UTC (322 KB)
[v2] Fri, 19 May 2023 14:41:16 UTC (323 KB)

Computer Science > Computation and Language

Title:SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators