Abstract
The paper describes the design, collection, transcription and analysis of 200 hours of HKUST Mandarin Telephone Speech Corpus (HKUST/MTS) from over 2100 Mandarin speakers in mainland China under the DARPA EARS framework. The corpus includes speech data, transcriptions and speaker demographic information. The speech data include 1206 ten-minute natural Mandarin conversations between either strangers or friends. Each conversation focuses on a single topic. All calls are recorded over public telephone networks. All calls are manually annotated with standard Chinese characters (GBK) as well as specific mark-ups for spontaneous speech. A file with speaker demographic information is also provided. The corpus is the largest and first of its kind for Mandarin conversational telephone speech, providing abundant and diversified samples for Mandarin speech recognition and other application-dependent tasks, such as topic detection, information retrieval, keyword spotting, speaker recognition, etc. In a 2004 evaluation test by NIST, the corpus is found to improve system performance quite significantly.
HKUST Mandarin Telephone Transcripts, Part 1, Linguistic Data Consortium (LDC) catalog number LDC2005T32 and isbn 1-58563-352-6 http://www.ldc.upenn.edu/
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Linguistic Data Consortium (LDC), various corpus resources on http://www.ldc.upenn.edu
European Language Resources Association (ELRA), http://www.elra.info/
Hoge, H., et al.: European speech databases for telephone applications. In: Proceedings of the IEEE ICASSP, vol. 3, pp. 1771–1774 (1997)
Ohtsuki, K., et al.: Japanese large-vocabulary continuous speech recognition using a newspaper corpus and broadcast news. Speech Communication 28, 155–166 (1999)
Godfrey, J., et al.: SWITCHBOARD: Telephone Speech Corpus for Research and Development. In: Proceedings of the IEEE ICASSP, vol. 1, pp. 517–520 (1992)
Wang, H.C.: MAT- A project to collect Mandarin speech data through telephone networks in Taiwan. Computational Linguistics and Chinese Language Processing 2(1), 73–90 (1997)
Huang, J.H.: Chinese Dialects. Xiamen University Press (1987) (Chinese version)
Lee, T., et al.: Spoken language resources for Cantonese speech processing. Speech Communication 36(3-4), 327–342 (2002)
LDC EARS Project RT-04 Transcription Guidelines: http://www.ldc.upenn.edu/Projects/Transcription/rt-04/RT-04-guidelines-V3.1.pdf
TextGrid as an objection of PRAAT: http://www.fon.hum.uva.nl/praat/manual/TextGrid.html
Le, A., et al.: 2004 fall rich transcription speech-to-text evaluation (2004), http://www.nist.gov/speech/tests/rt/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, Y., Fung, P., Yang, Y., Cieri, C., Huang, S., Graff, D. (2006). HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus . In: Huo, Q., Ma, B., Chng, ES., Li, H. (eds) Chinese Spoken Language Processing. ISCSLP 2006. Lecture Notes in Computer Science(), vol 4274. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11939993_73
Download citation
DOI: https://doi.org/10.1007/11939993_73
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49665-6
Online ISBN: 978-3-540-49666-3
eBook Packages: Computer ScienceComputer Science (R0)