Abstract
This paper describes a new Korean Text-to-Speech (TTS) system based on a large speech corpus. Conventional concatenative TTS systems still produce machine-like synthetic speech. The poor naturalness is caused by excessive prosodic modification using a small speech database. To cope with this problem, we utilized a dynamic unit selection method based on a large speech database without prosodic modification. The proposed TTS system adopts triphones as synthesis units. We designed a new sentence set maximizing phonetic or prosodic coverage of Korean triphones. All the utterances were segmented automatically into phonemes using a speech recognizer. With the segmented phonemes, we achieved a synthesis unit cost of zero if two synthesis units were placed consecutively in an utterance. This reduces the number of concatenating points that may occur due to concatenating mismatches. In this paper, we present data concerning the realization of major prosodic variations through a consideration of prosodic phrase break strength. The phrase break was divided into four kinds of strength based on pause length. Using phrase break strength, triphones were further classified to reflect major prosodic variations. To predict phrase break strength on texts, we adopted an HMM-like Part-of-Speech (POS) sequence model. The performance of the model showed 73.5% accuracy for 4-level break strength prediction. For unit selection, a Viterbi beam search was performed to find the most appropriate triphone sequence, which has the minimum continuation cost of prosody and spectrum at concatenating boundaries. From the informal listening test, we found that the proposed Korean corpus-based TTS system showed better naturalness than the conventional demisyllable-based one.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Beutnagel, M., Conkie, A., and Syrdal, A. (1998). Diphone synthesis using unit selection. The 3rd ESCA/COCOSDA Workshop on Speech Synthesis. Jenolan Caves, Australia: ESCA, Paper F.2(R5t2).
Black, A.W. and Campbell, N. (1995). Optimizing selection of unit from speech database concatenative synthesis. EUROSPEECH'95 Proceedings. Madrid, Spain: ESCA, vol. 1, pp. 581–584.
Campbell, N. (1998). Large-scale single-speaker speech corpora. A Collection of Technical Publications. Advanced Telecommunications Research Institute International(ATR)-Interpreting Telecommunications Research Laboratories. Kyoto, Japan: ATR, pp. 21–26.
Hauptmann, A.G. (1993). SPEAKEZ: A first experiment in concatenation synthesis from a large corpus. In EUROSPEECH'93. Berlin, Germany: ESCA, pp. 1701–1704.
Hunt, A.J. and Black, A.W. (1996). Unit selection in a concatenative speech synthesis system using a large speech database. ICASSP'96 Proceedings. Atlanta, GA: IEEE, pp. 373–376.
Kim, S.H. and Lee, J.C. (1994). Korean text-to-speech system using TD-PSOLA. Australian International Conference on Speech Science and Technology (SST'94) Proceedings. Perth, Australia: ASSTA, pp. 587–592.
Kim, S.H., Lee, H.S., and Kim, H.R. (1996). An effectiveness of automatic labeling using speech recognizer. International Conference on Phonetic Sciences (SICOPS'96) Proceedings. Seoul, Korea: Seoul National University, pp. 468–471.
Moulines, E. and Charpentier, F. (1990). Pitch synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech Communication, 9:453–467.
Ostendorf, M. and Veilleux, N. (1994). A hierarchical stochastic model for automatic prediction of prosodic boundary location. Computational Linguistics, 20(1):27–54.
Roucos, S. and Wilgus, A.M. (1985). High quality time scale modification for speech. ICASSP'85. Tampa, Florida: IEEE, pp. 493–496.
Seiyama, N., Imai, A., Takagi, T., and Miyasaka, E. (1996). A new approach to compensate degeneration of speech intelligibility for elderly listeners. IEEE Transaction on Broadcasting, 42(3):285–292.
Taylor, P. and Black, A.W. (1998). Assigning phrase breaks from part-of-speech sequences. Computer Speech and Language, 12:99–117.
Wightman, C.W. and Ostendorf, M. (1994). Automatic labeling of prosodic patterns. IEEE Transaction on Speech and Audio Processing, 2(4):469–481.