EP1668628A1

EP1668628A1 - Method for synthesizing speech

Info

Publication number: EP1668628A1
Application number: EP04784355A
Authority: EP
Inventors: Fang Chen; Gui-Lin Chen
Original assignee: Motorola Inc
Current assignee: Motorola Solutions Inc
Priority date: 2003-09-29
Filing date: 2004-09-17
Publication date: 2006-06-14
Also published as: EP1668628A4; MXPA06003431A; KR20060066121A; CN1604182A; KR100769033B1; WO2005034082A1; CN1260704C

Abstract

A method of performing speech synthesis that includes comparing a text segment (120) with an utterance waveform corpus (60) that contains numerous speech samples (140). The method determines whether there is a contextual best match between the text segment (120) and one speech sample (140). If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment (120) and a speech sample (140). A contextual phonetic hybrid match requires a match of all implicit prosodic features (210) in a defined prosodic feature group (220). If a match is still not found, the prosodic feature group (220) is redefined by deleting one of the implicit prosodic features (210) from the prosodic feature group (220). The prosodic feature group (220) is successively redefined by deleting one implicit prosodic feature (210) from the group (220) until a match is found between the input text segment (120) and a speech sample (140). When a match is found, the matched speech sample (140) is used to generate concatenative speech (110).

Description

METHOD FOR SYNTHESIZING SPEECH

FIELD OF THE INVENTION The present invention relates generally to Text-To-Speech (TTS) synthesis. The invention is particularly useful for, but not necessarily limited to, determining an appropriate synthesized pronunciation of a text segment using a non-exhaustive utterance corpus.

BACKGROUND OF THE INVENTION Text to Speech (TTS) conversion, often referred to as concatenated text to speech synthesis, allows electronic devices to receive an input text string and provide a converted representation of the string in the form of synthesized speech. However, a device that may be required to synthesize speech originating from a non-deterministic number of received text strings will have difficulty in providing high quality realistic synthesized speech. That is because the pronunciation of each word or syllable (for Chinese characters and the like) to be synthesized is context and location dependent. For example, a pronunciation of a word at the beginning of a sentence

(input text string) may be drawn out or lengthened. The pronunciation of the same word may be lengthened even more if it occurs in the middle of a sentence where emphasis is required. In most languages the pronunciation of a word depends on at least tone (pitch), volume and duration. Furthermore many languages include numerous possible pronunciations of individual syllables. Typically a single syllable represented by a Chinese character (or other similar character based script) may have up to 6 different pronunciations. Furthermore, in order to provide a realistic synthesized utterance of each pronunciation, a large pre-recorded utterance waveform corpus of sentences is required. This corpus typically requires on average about 500 variations of each pronunciation if realistic speech synthesis is to be achieved. Thus an utterance waveform corpus, of all pronunciations for every character would be prohibitively large. In most TTS systems there is a need to determine the appropriate pronunciation of an input text string based on comparisons with a limited size utterance waveform corpus. The size of the utterance waveform corpus may be particularly limited when it is embedded in a small electronic device having a low memory capacity such as a radio telephone or a personal digital assistant. The algorithms used to compare the input text strings with the audio database also need to be efficient and fast so that the resulting synthesized and concatenated speech flows naturally and smoothly. Due to memory and processing speed limitations, existing TTS methods for embedded applications often result in speech that is unnatural or robotic sounding. There is therefore a need for an improved method for performing TTS to provide a natural sounding synthesized speech whilst using a non-exhaustive utterance corpus.

SUMMARY OF THE INVENTION The present invention is a method of performing speech synthesis that includes comparing an input text segment with an utterance waveform corpus that contains numerous speech samples. The method determines whether there is a contextual best match between the text segment and one speech sample included in the utterance waveform corpus. If there is not a contextual best match, the method determines whether there is a contextual phonetic hybrid match between the text segment and a speech sample included in the utterance waveform corpus. A contextual phonetic hybrid match requires a match of all implicit prosodic features in a defined prosodic feature group. If a match is still not found, the prosodic feature group is redefined by deleting one of the implicit prosodic features from the prosodic feature group so as to redefine the prosodic feature group. The prosodic feature group is successively redefined by deleting one implicit prosodic feature from the group until a match is found between the input text segment and a speech sample. When a match is found, the matched speech sample is used to generate concatenative speech.

BRIEF DESCRIPTION OF THE DRAWINGS Other aspects of the present invention will become apparent from the following detailed description taken together with the drawings, wherein like reference characters designate like or corresponding elements or steps throughout the drawings, in which: Fig. 1 is a block diagram of an electronic device upon which the invention may be implemented; Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech in the Chinese language; and Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT OF THE INVENTION Referring to Fig. 1 there is illustrated a block diagram of an electronic device 10 upon which the invention may be implemented. The device 10 includes a processor 30 operatively coupled, by a common bus 15, to a text memory module 20, a Read only Memory (ROM) 40, a Random Access Memory (RAM) 50 and a waveform corpus 60. The processor 30 is also operatively coupled to a touch screen display 90 and an input of a speech synthesizer 70. An output of the speech synthesizer 70 is operatively coupled to a speaker 80. As will be apparent to a person skilled in the art, the text memory module is a store for storing text obtained by any receiving means possible such as by radio reception, internet, or plug in portable memory cards etc. The ROM stores operating code for performing the invention as described in Figures 2 and 3. Also the Corpus 60 is a essentially a conventional corpus as is the speech synthesizer 70 and speaker 80 and the touch screen display 90 is a user interface and allows for display of text stored in the a text memory module 20. Fig. 2 is a flow chart illustrating a specific embodiment of the present invention used to generate concatenative speech 110 from an input text segment 120 in the Chinese language. The text segment 120 is compared with an utterance waveform corpus 60, which includes a plurality of speech samples 140, to determine whether there is a contextual best match (step S110). If a contextual best match is found between a text segment 120 and a specific speech sample 140, that specific speech sample 140 is sent to a concatenating algorithm 150 for generating the concatenative speech 110. If no contextual best match is found between the text segment 120 and a specific speech sample 140, then the text segment 120 is compared again with the utterance waveform corpus 130 to determine whether there is a contextual phonetic hybrid match (step S120). Fig. 3 is a flowchart illustrating the process of determining whether a contextual phonetic hybrid match exists by successively relaxing the constraints used to define a match. A contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. If no match is found, one of the implicit prosodic features 210 is deleted from the defined prosodic feature group 220 and the group 220 is redefined as including all of the previously included features 210 less the deleted feature 210 (e.g., Step 130). The redefined prosodic feature group 220 is then compared with the text segment 120 to determine whether there is a match. The process of deleting an implicit prosodic feature 210, redefining the prosodic feature group 220, and then redetermining whether there is a contextual phonetic hybrid match, continues until a match is found (Steps S130, S140, etc. to S170). When a contextual phonetic hybrid match is found, the matched speech sample 140, which matches the text segment 120, is sent to the concatenating algorithm 150 for generating concatenative speech 110. As shown in Fig. 3, if all of the implicit prosodic features 210 except pinyin are successively deleted from the prosodic feature group 220 and still no match is found, then a basic phonetic match is performed matching only pinyin (Step S180). In one embodiment of the present invention the utterance waveform corpus 60 is designed so that there is always at least one syllable included with the correct pinyin to match all possible input text segments 120. That basic phonetic match is then input into the concatenating algorithm 150. The invention is thus a multi-layer, data-driven method for controlling the prosody (rhythm and intonation) of the resulting synthesized, concatenative speech 110. Where each layer of the method includes a redefined prosodic feature group 220. For purposes of the present invention a text segment 120 means any type of input text string or segment of coded language. It should not be limited to only visible text that is scaned or otherwise entered into a TTS system. The utterance waveform corpus 130 of the present invention is annotated with information concerning each speech sample 140 (usually a word) that is included in the corpus 130. The speech samples 140 themselves are generally recordings of actual human speech, usually digitized or analog waveforms. Annotations are thus required to identify the samples 140. Such annotations may include the specific letters or characters (depending on the language) that define the sample 140 as well as the implicit prosodic features 210 of the speech sample 140. The implicit prosodic features 210 include context information concerning how the speech sample 140 is used in a sentence. For example, a speech sample 140 in the Chinese language may include the following implicit prosodic features 210: Text context: the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Pinyin: the phonetic representation of a speech sample 140. Pinyin is a standard romanization of the Chinese language using the Western alphabet. Tone context: the tone context of the Chinese characters immediately preceding and immediately following the annotated text of a speech sample 140. Co-articulation: the phonetic level representatives that immediately precede and immediately follow the annotated text of a speech sample 140, such as phonemes or sub-syllables. Syllable position: the position of a syllable in a prosodic phrase. Phrase position: the position of a prosodic phrase in a sentence. Usually the phrase position is identified as one of the three positions of sentence initial, sentence medial and sentence final. Character symbol: the code (e.g., ASCII code) representing the Chinese character that defines a speech sample 140. Length of phrase: the number of Chinese characters included in a prosodic phrase. For an example of the specific values of the above implicit prosodic features 210, consider the following sentence in Chinese: "Φ Hfl^ ." If a spoken audio recording of that sentence were stored in an utterance waveform corpus 130, each character's sound could represent a speech sample 140 and could be annotated with the above implicit prosodic features 210. For example, the character "H" as found in the above sentence could be annotated as follows: Text context: ψ, ?! ; Pinyin: guo2; Tone context: 1 , 3; Co-articulation: ong, h ; Syllable position: 2 ; Phrase position: 1 ; Character symbol: ASCII code for H ; and Length of phrase: 2. In Fig. 2, step S110 determines whether there is a contextual best match between a text segment 120 and a speech sample 140. A contextual best match is generally defined as the closest, or an exact, match of both 1 ) the letters or characters (depending on the language) of an input text segment 120 with the corresponding letters or characters of an annotated speech sample 140, and 2) the implicit prosodic features 210 of the input text segment 120 with the implicit prosodic features 210 of the annotated speech sample 140. In more general terms a best match is being determined by identifying the greatest number of consecutive syllables in the input text segment that are identical to attributes and attribute positions in each of the waveform utterances (speech sample) in the waveform corpus 60. Only when both the letters or characters and the implicit prosodic features 210 match exactly is a speech sample 140 selected immediately as an element for use in the concatenating algorithm 150. When a contextual best match is not found, the method of the present invention then determines whether there is a contextual phonetic hybrid match between an input text segment 120 and a speech sample 140. As described above, a contextual phonetic hybrid match requires a match between a text segment 120 and all of the implicit prosodic features 210 included in a defined prosodic feature group 220. As shown in Fig. 3, one embodiment of the present invention used to synthesize speech in the Chinese language uses a first defined prosodic feature group 220 that includes the implicit prosodic features 210 of pinyin, tone context, co- articulation, syllable position, phrase position, character symbol, and length of phrase (Step S120). If none of the annotated speech samples 140 found in the utterance waveform corpus 130 have identical values for each of the above features 210 as found in the input text segment 120, then the corpus 130 does not contain a speech sample 140 that is close enough to the input text segment 120 based on the matching rules as applied in Step S120. Therefore the constraints of the matching rules must be relaxed and thus broadened to include other speech samples 140 that possess the next most preferable features 210 found in the input text segment 120. In other words, the matching rules are broadened by deleting the one feature 210 found in the defined prosodic feature group 220 that is least likely to affect the natural prosody of the input text segment 120. For example, as shown in Step S130 in both Fig. 2 and Fig.3, the next most preferable features 210 found in the illustrated embodiment of the present invention include all of the features 210 defined above less the length of phrase feature 210. The order in which the implicit prosodic features 210 are deleted from the defined prosodic feature group 220 is determined empirically. When the features 210 are deleted in a proper order, the method of the present invention results in efficient and fast speech synthesis. The output speech therefore sounds more natural even though the utterance waveform corpus 130 may be relatively limited in size. According to the present invention, after the utterance waveform corpus 130 has been compared with a text segment 120 using a particular defined prosodic feature group 220, it is possible that the annotations of multiple speech samples 140 will be found to match the analyzed text segment 120. In such a case, an optimal contextual phonetic hybrid match may be selected by using the following equation: dW = W_p dur- BestDur _{2 (E 1} ^•" ^p ^d BestDur ' where W_p = weight of the pitch of the text segment 120; W_d = weight of the duration of the text segment 120; cliff = differential value for selecting an optimal contextual phonetic hybrid match; pitch = pitch of the text segment 120; BestPitch = pitch of an ideal text segment 120; dur= duration of the text segment 120; and BestDur = duration of the ideal text segment 120.

In the above Equation 1, the variable BestPitch may be determined based on a statistical analysis of the utterance waveform corpus 130. For example, a corpus 130 may include five tones, each having an average pitch. Each annotated speech sample 140 in the corpus 130 may also include individual prosody information represented by the values of pitch, duration and energy. So the average values of pitch, duration and energy of the entire corpus 130 are available. The best pitch for a particular context may then be determined using the following formula: BestPitch - pitch_tone — nlndexx empiricalvalue (Eq. 2) where pitch_tone= the average pitch including tone of the utterance waveform corpus; nlndex= the index of the text segment 120 in a prosody phrase; and empircalvalue = an empirical value based on the utterance waveform corpus. The empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language; however this number could vary depending on the content of a particular utterance waveform corpus 130. Similarly, the duration of an ideal text segment 120 may be determined using the following equation: BestDur= dur_s f_s - nlndexx empircαlvdue (Eq.3) where dur_s = the average duration of the text segment 120 without tone; nlndex = the index of the text segment 120 in a prosody phrase; f_s = a coefficient for prosody position; and empircalvalue = an empirical value based on said utterance waveform corpus. Again, the empirical value of 4 is used in one particular embodiment of the present invention that synthesizes the Chinese language, however this number could vary depending on the content of a particular utterance waveform corpus 130. The differential value for a word diffW may be the summation of differential values for each syllable in the word. That may be represented in mathematical terms by the following equation: difw = /iff_k (Eq.4) it

As described above, if several speech samples 140 are found to match a particular text segment 120, the system will choose the speech sample 140 whose differential value is lowest. That may be represented in mathematical terms by the following equation: diJW_mm = ikfw(J W, (Eq. 5) I Further, the method of the present invention may include the use of preset thresholds for the differential value diffW. If the differential value for a matched speech sample 140 is below a particular threshold, the method will route the matched speech sample 140 to the concatenating algorithm 150 for generating the concatenative speech 110. Otherwise, the method may require relaxing the constraints on the contextual phonetic hybrid match by deleting one of the required implicit prosodic features 210 and continuing to search for a match. Although the above description concerns a specific example of the method of the present invention for the Chinese language, the invention is suitable for many languages. For some languages the implicit prosodic features 210 would need to be deleted or redefined from the examples given hereinabove. For example, the feature 210 identified above as tone context would be deleted in an application of the present invention for the English language because English is not a tonal language. Also, the feature 210 identified above as pinyin would likely be redefined as simply a phonetic symbol when the present invention is applied to English. The present invention is therefore a multi-layer, data-driven prosodic control scheme that utilizes the implicit prosodic information in an utterance waveform corpus 130. When searching for an appropriate speech sample 140 to match with a given input text segment 120, the method of the present invention employs a strategy based on multi-layer matching, where each layer is tried in turn until a sufficiently good match is found. By successively relaxing the constraints of each layer, the method efficiently determines whether the utterance waveform corpus 130 contains a match. The method is therefore particularly appropriate for embedded TTS systems where the size of the utterance waveform corpus 130 and the processing power of the system may be limited. Although exemplary embodiments of a method of the present invention has been illustrated in the accompanying drawings and described in the foregoing description, it is to be understood that the invention is not limited to the embodiments disclosed; rather the invention can be varied in numerous ways, particularly concerning applications in languages other than

Chinese. It should, therefore, be recognized that the invention should be limited only by the scope of the following claims.

Claims

WE CLAIM:

1. A method for performing speech synthesis on a test segment, the method being performed on an electronic device, the method comprising: comparing a text segment with an utterance waveform corpus, said utterance waveform corpus comprising a plurality of speech waveform samples; determining a best match between consecutive syllables in the text segment and attributes associated with sampled speech waveform utterances, the best match being determined by identifying the greatest number of consecutive syllables that are identical to the attributes and attribute positions in each of the waveform utterances, ascertaining a suitable match for each unmatched syllable in the text segment, each unmatched syllable being a syllable that is not one of the consecutive syllables and the suitable match being determined from a comparison of a prosodic features in prosodic features group with the attributes associated with sampled speech waveform utterances, wherein the ascertaining is characterized by successively removing the prosodic features from the prosodic feature group until there is said suitable match; and generating concatenated synthesized speech for the text segment by using the speech waveform samples in the corpus, the speech waveform samples being selected from best match between consecutive syllables and suitable match for each unmatched syllable.

2. The method of claim 1 , wherein the prosodic features includes features selected from the group consisting of text context, pinyin, tone context, co-articulation, syllable position, phrase position, character symbol, and length of phrase.

3. The method of claim 1 , wherein the prosodic features comprises tone context, co-articulation, syllable position, phrase position, and character symbol.

4. The method of claim 1 , further comprising the step of performing a basic phonetic match based on only pinyin after all of said other prosodic features have been successively removed.

5. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous best matches are found by using the formula: pitch- BestPltcly dur - BestDur - ₂ ^p BestPitch ^a BestDur where W_p = weight of the pitch of said speech segment; W_d = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.

6. The method of claim 5, wherein the BestPitch is determined using the formula: BestPitch = pitch_tone - nlndexx empiricalvalue where pitchto_ne = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.

7. The method of claim 5, wherein the Bestdur is determined using the formula: BestDur= dur_s xf_s — nlndexx empircalvdue where dur_s = the average duration of said speech segment without tone; nlndex= the index of said speech segment in a prosody phrase; and f_s = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.

8. The method of claim 1 , wherein the step of determining includes the step of selecting an optimal contextual phonetic hybrid match when numerous suitable matches are found by using the formula: pitch- BestPitclγ dur- BestDur ₂ ^p BestPitch BestDur where W_p = weight of the pitch of said speech segment; W_d = weight of the duration of said speech segment; diff= differential value for selecting said optimal contextual phonetic hybrid match; pitch = pitch of said speech segment; BestPitch = pitch of an ideal speech segment; dur= duration of said speech segment; and BestDur = duration of said ideal speech segment.

9. The method of claim 8, wherein said optimal contextual phonetic hybrid match is the match having the lowest differential value (difή.

10. The method of claim 8, wherein said differential value (diff) for selecting said optimal contextual phonetic hybrid match is compared with a preset threshold.

11. The method of claim 8, wherein the BestPitch is determined using the formula: BestPitch = pitch_tone - nlndex x empiricalvalue where pitchtone = the average pitch including tone of said utterance waveform corpus; nlndex = the index of said speech segment in a prosody phrase; and empircalvalue = an empirical value based on said utterance waveform corpus.

12. The method of claim 8, wherein the Bestdur \s determined using the formula:

BestDur= dur_s xf_s - nlndexx empircαlvdue where dur_s = the average duration of said speech segment without tone; nlndex = the index of said speech segment in a prosody phrase; and f_s = the coefficient for prosody position. empircalvalue = an empirical value based on said utterance waveform corpus.