MLS: A Large-Scale Multilingual Dataset For Speech Research
MLS: A Large-Scale Multilingual Dataset For Speech Research
Abstract Table 1: LibriVox Audiobook data statistics for the top 15 lan-
guages; * - audio books with mix of multiple languages
This paper introduces Multilingual LibriSpeech (MLS) dataset,
a large multilingual corpus suitable for speech research. The Language Hours Books Speakers
dataset is derived from read audiobooks from LibriVox and con-
sists of 8 languages, including about 32K hours of English and English 71,506.79 12421 4214
a total of 4.5K hours for other languages. We provide baseline German 3,287.48 593 244
Automatic Speech Recognition (ASR) models and Language Dutch 2,253.68 206 91
Models (LM) for all the languages in our dataset. We believe Spanish 1,438.41 285 120
such a large transcribed dataset will open new avenues in ASR French 1,333.35 224 114
and Text-To-Speech (TTS) research. The dataset will be made Multilingual* 516.82 130 19
freely available for anyone at [Link] Portuguese 284.59 68 31
Index Terms: speech recognition, multilingual Italian 279.43 61 28
Russian 172.34 44 29
Latin 138.93 20 16
1. Introduction Polish 137.00 25 16
The success of LibriSpeech [1] as a standard, freely available, Church Slavonic 136.42 8 2
Automatic Speech Recognition (ASR) benchmark is undeniable Hebrew 125.72 23 13
in the research community. LibriSpeech is English-only, and Japanese 97.67 38 24
while benchmarks for other languages are available, there are Ancient Greek 69.77 43 8
often low-scale or scattered around different places, and rarely
available under an open license. In this paper, we revisit the
work which has been done with LibriSpeech but in a multi- not released and under an open license, and focused on low-
lingual manner and at larger scale, introducing the Multilingual resource languages, with labeled data ranging between 25 to
LibriSpeech (MLS) dataset. MLS includes 32K hours of En- 65 hours per language. On the open license side, two impor-
glish, and a total of 4.5K hours spread over 7 other languages. tant volunteer-supported multi-lingual speech gathering efforts
As for LibriSpeech, MLS is a read-speech dataset, which lever- are being conducted: (i) VoxForge [3] which collected data for
ages LibriVox1 audiobook data, most of which being based on about 15 different languages, but remains low-scale (about 300
the Project Gutenberg2 text data. LibriVox and Project Guten- hours in total). (ii) CommonVoice [4], a more scalable solution,
berg data are released in the public domain, which allows us to with more than 30 languages available, which keeps growing
release MLS freely to everyone. with 4500 (validated) hours currently available. Other notat-
In Section 3, we detail how we created the dataset, by able multi-lingual datasets distributed under an open license are
(i) training some acoustic models on in-house data, (ii) gen- the M-AILABS [5] and the CMU Wilderness [6] datasets. M-
erating pseudo-labels with these models, and (iii) retrieving the AILABS is a lower-scale version of our work, with 9 languages
original transcript by matching pseudo-labels to available book collected from LibriVox, for a total of about 1000 hours avail-
transcripts. Section 4 details the statistics of MLS for its dif- able. The CMU Wilderness collects readings from the New Tes-
ferent languages. Section 5 introduces languages models we tament, with 700 different languages available.
trained for each of language. These languages models are part
of the MLS release. Section 6 covers some baseline ASR ex-
periments.
3. Data processing pipeline
This section describes the major steps involved in preparing the
2. Related Work MLS dataset.
As for our work, LibriSpeech [1] is derived from the Lib- 3.1. Downloading audiobooks
riVox data, and is distributed under an open license. It ships
with about 1000 hours of labeled audio, obtained by leveraging Table 1 shows the LibriVox audiobooks data available for each
alignments between textbooks and their read (audio) counter- language that we measured using LibriVox APIs 3 . While En-
part. In contrast to our work, it is only mono-lingual (English). glish is the most dominant language, we can see that there is a
A notable multi-lingual ASR dataset was built with the IARPA significant amount of audio hours present in languages other
Babel Program [2]. It collected data for 24 languages, mostly than English, making this a valuable source for multilingual
from conversational telephone speech. The dataset is however dataset preparation.
1 [Link] 3 [Link]
2 [Link]
2758
Table 2: Statistics of Train/Dev/Test partitions of each language. Below lists for each partition: the total duration in hours (left),
number of speakers in each gender (middle) and the shortest and longest duration in minutes for speakers in dev and test sets (right).
5. Language models
We have trained language models (LM) for all the languages
in our dataset. Those LMs are 5-gram models trained on train-
ing transcriptions using the KenLM toolkit [16]. The number
of words of each LM and their perplexities (excluding out-of-
vocabulary words) on the transcriptions in development sets are
Figure 1: Violin plots of audio segments duration in the training listed in Table 3. We also release those models, to be potentially
data for different languages used as standard benchmarks when comparing only acoustic
models.
2759
Table 4: Baseline WER for different languages. data spread over 8 languages. We believe this dataset will pro-
mote open research in large-scale training of ASR systems and
No LM 5-gram LM in multilingual ASR. This dataset can also be used for Text-to-
Language
Speech(TTS) research by extending the LibriTTS [22] dataset,
Dev Test Dev Test
and by creating a larger and multilingual version for TTS Re-
English 14.45 15.18 11.90 12.64 search.
German 16.53 17.84 14.37 15.57
Dutch 27.95 30.53 24.28 28.87 8. Acknowledgements
Spanish 12.96 12.46 11.40 11.07
French 18.63 19.66 16.58 18.08 We would like to thank Steven Garan for help in data prepara-
Italian 32.77 36.70 24.54 28.19 tion and text normalization.
Portuguese 42.47 44.45 36.88 39.55
Polish 46.92 67.23 43.61 60.32 9. References
[1] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
rispeech: An asr corpus based on public domain audio books,”
in 2015 IEEE International Conference on Acoustics, Speech and
and residual connections in between. Specifically, the model Signal Processing (ICASSP), 2015, pp. 5206–5210.
has 4 groups of TDS blocks with a 1-D convolutions at the be-
[2] M. P. Harper, “The iarpa babel program,” [Link]
ginning of each group as transitions. Similarly, the first 3 con- [Link]/research-programs/babel.
volutions have stride 2 so as to reach the same sub-sampling
[3] [Link], “Free speech... recognition (linux, windows
(striding) rate of 8, thus 80ms. There are 2, 2, 5, and 8 TDS and mac) - [Link],” [Link] accessed
blocks in each group, containing 16, 16, 24, and 32 channels, 06/25/2014.
respectively. Following [19], we also apply a channel increas- [4] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,
ing factor F = 2 in each TDS block. J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,
For the other languages, we experimented with three model “Common voice: A massively-multilingual speech corpus,” 2019.
architectures of varying capacities, all of which were based on [5] I. Solak, “The M-AILABS Speech Dataset,” [Link]
the TDS architecture with CTC loss [8]. The capacity of the de/2019/01/the-m-ailabs-speech-dataset/, 2019, [Online; ac-
models is adjusted by changing the number of TDS blocks. The cessed 19-April-2020].
smallest model architecture (60M parameters) contains two 10- [6] A. W. Black, “Cmu wilderness multilingual speech dataset,” in
channel, three 14-channel and five 18-channel TDS blocks. In ICASSP 2019 - 2019 IEEE International Conference on Acous-
the 100M parameter architecture, we use convolutions of the tics, Speech and Signal Processing (ICASSP), 2019, pp. 5971–
5975.
same width but increase the number of blocks to three, four and
eight respectively. Finally, in the largest architecture (200M pa- [7] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end-
to-end convnet-based speech recognition system,” 2016.
rameters), we increase the numbers to five, six, and ten respec-
tively. We used dropout and spectral augmentation [18] to reg- [8] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con-
nectionist temporal classification: labelling unsegmented se-
ularize the models and we tuned dropout extensively for each
quence data with recurrent neural networks,” in Proceedings of
language. the 23rd international conference on Machine learning, 2006, pp.
We have also experimented with different token sets: 369–376.
graphemes or sub-word tokens generated using the Sentence- [9] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based
Piece toolkit [21]. For the lowest resource languages (Italian, speech recognition with gated convnets,” 2017.
Portuguese and Polish), we obtained the best results with a 60M [10] V. Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Han-
parameter model. For these languages, we experimented with nun, V. Liptchinsky, G. Synnaeve, and R. Collobert, “Scaling up
graphemes, 300 sentence pieces and 500 sentence pieces as the online speech recognition using convnets,” 2020.
token set and found graphemes to work best for Italian and Pol- [11] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave,
ish and 300 sentence pieces to work best for Portuguese. For the V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-
other languages, we only experimented with 5,000 and 10,000 end asr: from supervised to semi-supervised learning with modern
sentence pieces and found 10,000 sentence pieces to work best. architectures,” 2019.
We obtained our best results with a 200M parameter model [12] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-
for all languages except Dutch, for which the 100M parameter rispeech: an asr corpus based on public domain audio books,”
in 2015 IEEE International Conference on Acoustics, Speech and
model outperformed the others. Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210.
Finally, we use beam-search decoding in wav2letter++ to
[13] V. Manohar, D. Povey, and S. Khudanpur, “Jhu kaldi system
integrate external 5-gram language models trained on the train- for arabic mgb-3 asr challenge using diarization, audio-transcript
ing text, together with the AMs. The decoder hyper-parameters alignment and transfer learning,” in 2017 IEEE Automatic Speech
are tuned on the validation set. Recognition and Understanding Workshop (ASRU). IEEE, 2017,
Table 4 shows the best results obtained for each language pp. 346–352.
on the (averaged) Dev and Test sets with and without a language [14] T. Smith and M. Waterman, “Identification of common
model. There is a lot of room for improvement upon those base- molecular subsequences,” Journal of Molecular Biology, vol.
lines, which vary between 11% and 60% WER depending on 147, no. 1, pp. 195 – 197, 1981. [Online]. Available: http:
//[Link]/science/article/pii/0022283681900875
languages.
[15] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training
algorithm for optimal margin classifiers,” in Proceedings of
7. Conclusions the Fifth Annual Workshop on Computational Learning Theory,
ser. COLT ’92. New York, NY, USA: Association for
We have presented the Multilingual LibriSpeech dataset, a large Computing Machinery, 1992, p. 144–152. [Online]. Available:
scale multilingual speech dataset with 36.5K hours of training [Link]
2760
[16] K. Heafield, “Kenlm: Faster and smaller language model queries,”
in Proceedings of the sixth workshop on statistical machine trans-
lation. Association for Computational Linguistics, 2011, pp.
187–197.
[17] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve,
V. Liptchinsky, and R. Collobert, “Wav2letter++: A fast open-
source speech recognition system,” in ICASSP 2019 - 2019 IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2019, pp. 6460–6464.
[18] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph,
E. D. Cubuk, and Q. V. Le, “Specaugment: A simple
data augmentation method for automatic speech recognition,”
Interspeech 2019, Sep 2019. [Online]. Available: [Link]
org/10.21437/interspeech.2019-2680
[19] G. Synnaeve, Q. Xu, J. Kahn, E. Grave, T. Likhomanenko,
V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-
end asr: from supervised to semi-supervised learning with modern
architectures,” arXiv preprint arXiv:1911.08460, 2019.
[20] A. Hannun, A. Lee, Q. Xu, and R. Collobert, “Sequence-to-
sequence speech recognition with time-depth separable convolu-
tions,” in INTERSPEECH, 2019.
[21] T. Kudo, “Subword regularization: Improving neural network
translation models with multiple subword candidates,” 2018.
[22] H. Zen, R. Clark, R. J. Weiss, V. Dang, Y. Jia, Y. Wu, Y. Zhang,
and Z. Chen, “Libritts: A corpus derived from librispeech
for text-to-speech,” in Interspeech, 2019. [Online]. Available:
[Link]
2761