Computer Science > Computation and Language

arXiv:2006.11477 (cs)

[Submitted on 20 Jun 2020 (v1), last revised 22 Oct 2020 (this version, v3)]

Title:wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Authors:Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli

View PDF

Abstract:We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec 2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech recognition with limited amounts of labeled data.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2006.11477 [cs.CL]
	(or arXiv:2006.11477v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2006.11477

Submission history

From: Michael Auli [view email]
[v1] Sat, 20 Jun 2020 02:35:02 UTC (301 KB)
[v2] Tue, 22 Sep 2020 04:26:03 UTC (301 KB)
[v3] Thu, 22 Oct 2020 06:09:10 UTC (301 KB)

Computer Science > Computation and Language

Title:wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Submission history

Access Paper:

References & Citations

2 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Submission history

Access Paper:

References & Citations

2 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators